n
Original Research Paper
115
Chemometrics and Intelligent Laboratory Systems, 22 (1994) 115-125 Elsevier Science Publishers B.V., Amsterdam
Rank determination of spectroscopic profiles by means of cross validation The effect of replicate measurements on the effective degrees of freedom * Bjgrn Grung and Olav M. Kvalheim Department
of Chemistry, University of Bergen, N-5007 Bergen (Norway)
(Received 2 September 1992; accepted 15 June 1993)
Abstract
Gnmg, B. and Kvalheim, O.M., 1994. Rank determination of spectroscopic profiles by means of cross validation. The effect of replicate measurements on the effective degrees of freedom. Chemometrics and Intelligent Laboratory Systems, 22: 115-125. In the present work the effect of sample replication and ‘true’ instrumental resolution on the number of significant latent variables obtained by cross validation is investigated. Three sample sets containing three and four chemical components were prepared according to mixture design. The samples were analysed by Fourier transform infrared spectroscopy, and cross validation was performed using the predicted residual error sum of squares (PRESS) algorithm with a varying number of degrees of freedom. It was found that the correct number of chemical components could be estimated precisely from the instrumental data (the X matrix) presuming that the degrees of freedom were corrected for the above-mentioned effects.
INTRODUCIION
Validation is mandatory in data modelling. Thus, a model of chemical data should always be validated both in a chemical and statistical sense. The statistical validation of a model obtained
Correspolldence to: O.M. Kvalheim, Department of Chemistry, University of Bergen, N-5007 Bergen (Norway). l Paper presented at the 5th Conference on Computer Applications in Analytical Chemistry (COMPANA ‘92), Jena, Germany, 24-27 August 1992. 0169-7439/94/$07.00
from decomposition of a data matrix into latent variables aims at deciding the number of latent variables to be included in the model, and at identifying and rejecting outliers among the samples and variables in the data matrix. There are numerous ways of dealing with the former of the two tasks, e.g., using Malinowski’s F test [ll or indicator function [21, use of the zero-component region [3,4], etc. This work examines the performance of cross validation 15-73, a technique which is implemented in many chemometric software packages and therefore much used. However, as
0 1994 - Elsevier Science Publishers B.V. All rights reserved
SSDI 0169-7439(93)E0040-B
116
B. Grung and O.M. Kvalheim /Chemom.
pointed out by Droge and co-workers [8,9] and Gerritsen et al. [lo], the method fails in many cases. For instance, cross validation has a tendency to overestimate the number of significant components when used in the traditional way on spectral profiles from, e.g., IR analysis - spectral data with a high correlation between many of the variables. As we shall see, this is often the result of an overestimation of the number of degrees of freedom in the data. One way of dealing with the problem of overfitting has been proposed by Wold and Sjiistriim [ll], where use is made of an empirical function to compensate for the increasing complexity of the model as the number of terms in the model increases. It can be a difficult task to find the correct number of latent variables to be used in the model if the net analytical signal (NAS) [12] of the minor chemical components is small - a situation that often occurs in analysis of similar compounds. There are of course also physical reasons for obtaining additional latent variables. This phenomenon can be ascribed to two main sources: (i) additional chemical factors due to impurities and interacting species, and (ii) additional factors due to instrumental artefacts (nonlinearities, baselines, heteroscedastic noise). Factors belonging to the first group should be detected by cross validation and yield extra latent variables, while those belonging to the other group should be corrected for before cross validation is performed on the data. In the present work it is shown that cross validation can provide good results for spectroscopic data if the efficient degrees of freedom in the system can be estimated. This can be achieved by using knowledge of the data and by performing a thorough examination of the correlational relationships among the objects and variables in the data matrix. The outline of this work is as follows. First, the concepts of cross validation and degrees of freedom are explained. The effects of different kinds of replicates among the objects are discussed along with the importance of the ‘true’ instrumental resolution. The loss of degrees of freedom as the model increases its number of latent vari-
Intell. Lab. Syst. 22 (1994) 11%125/0riginal
Research Paper
n
ables is examined. Finally, it is demonstrated that when the proper corrections have been made, the method yields excellent results.
CROSS VALIDATION
Cross validation is a technique that is used to find the number of significant components. We shall now outline how to cross validate a data matrix X. Each row in the data matrix X is a spectrum of a sample, and every column is thus the measurement at a wave number for every sample. In the cross validation procedure, X is divided into a number of smaller groups. One group is deleted from X, and the reduced matrix is decomposed into score and loading vectors. By means of the model established from the reduced matrix, the deleted group is predicted and the residuals between predicted and actual values are computed. This is done successively for all the groups. Every element in X is a member of one, and only one, group. The number of components giving the lowest total prediction error when predicting the deleted group, is the number of significant components. With increasing number of terms (i.e., latent variables) the number of degrees of freedom is reduced, and a correction term is therefore needed to make comparison between the possible models. There are many different ways of dividing X into smaller groups. If the matrix is small, the best way to do the cross validation is to use a procedure called leave-out-one-sample-at-a-time [6]. As the name implies, every group consists of one of the row vectors in X. This procedure becomes time consuming if the number of samples is large. In that case, a faster procedure is the so-called leave-out-one-group-of-elements-ata-time [7]. Use of this procedure ensures that none of the groups have a majority of its elements from one or a few of the objects or the variables. In this work, the procedure used is called leave-out-one-block-of-samples/ variables-at-atime [131. The cross validation procedure returns a
8
B. Grung and O.M. Kvalheim /Chemom.
Intell. Lab. Syst. 22 (1994) 11%125/Original
PRESS value (predicted residual error sum of squares). Traditionally, one has chosen the model with the number of latent variables corresponding to the lowest PRESS value. A better approach is to plot the PRESS values against the number of latent variables. The significant number of latent variables can then be found where the graph has ceased to decline. The reason behind this is that the PRESS values often decrease successively, but that the decrease eventually is insignificant compared to the larger differences between the first few PRESS values. This is easily seen in such a plot (or by using an F test), but somewhat harder to see if one only looks at the numbers.
DEGREES OF FREEDOM
Theory The number of degrees of freedom in a data set is equal to the excess of independent variables over the number of parameters in the model [14]. This can be illustrated with the following examples. (1) During an analysis, M independent variables have been measured on a sample. This means that we have M degrees of freedom. If the variables are normalised to a constant sum, e.g., 100, a constraint is put on the variables: after the determination of the first M - 1 variables, the last variable is determined through the constraint that the sum of the variables must be equal to 100. We have lost one degree of freedom. (2) A data table with N independent samples and M independent variables has N * M degrees of freedom. This is the basis for the traditional correction factor used to compensate for the loss of degrees of freedom in the data set as more components are included in the model. This correction factor is (N -A - 1XM -A), where A is the number of latent variables extracted from the data set. (3) Interpolation to find the centre point between two points does not lead to an increase of the degrees of freedom. The interpolation has not introduced any new, independent information to the data set.
Research Paper
117
The interpretation of N and M in example (2) is the main reason for the failure of the cross validation when analysing spectral data. Normally, N and M are regarded as the number of samples and variables in the data set, respectively. The next sections will show that this leads to a vast overestimation of the real or efficient degrees of freedom in the data set. 7&e eSfectof replicates
The interpretation of N as the number of samples in the data set is valid only as long as the data set does not contain replicates. One can distinguish three hinds of replicates: (1) Analytical replicates. If two samples are prepared in such a way that the composition of the samples is identical, the two samples are analytical replicates. (2) Instrumental replicates. If a sample is analysed more than once on the same instrument under the same conditions, the resulting objects in the data matrix are instrumental replicates. (3) Unintended replicates. These are samples that were prepared as different samples, but due to the sensitivity of the instrument or the sensitivity of the analysis they are for practical purposes indistinguishable. Fig. 1 illustrates the concept of unintended replicates more theoretically. In the figure, the sensitivities of three measurement processes (e.g.,
Fig. 1. The sensitivity functions of three measurement processes. To be able to distinguish between similar samples, the sensitivity (the slope of the function) should be as high as possible (process A). We obtain unintended replicates if the sensitivity is poor (process C).
118
B. GIUW and O.M. Kvalheim / Chemom. Intell. Lab. Syst. 22 (1994) 115-125 /Original Research Paper
instruments) are shown. The sensitivity of an instrument is defined as the slope of the curve obtained when plotting the results of the measurements against the amounts detected [151. Of the three measurement processes pictured in Fig. 1, A has a very high sensitivity, B a normal sensitivity and C a very poor sensitivity. This is expressed mathematically in Eqn. 1:
w
dy, dy, dy, >>>>-
overestimation of the degrees of freedom and subsequent overfitting in the modelling process. It is of course an easy task to identify replicates of type (1) and (21, so replicates of these kinds pose no problems. In fact, they are used to detect replicates of type (31, as these can be found by comparing the Euclidean distances between the replicates to the Euclidean distances between all the samples in the data set. The following algorithm can be used to find unintended replicates:
Disregarding noise, two replicates have equal x and equal y coordinates regardless of which process we are looking at. Two different samples always have different x coordinates, but the difference in y coordinates depends upon the sensitivity of the instrument. Measurement A, due to its high sensitivity, always distinguish between two different samples. Measurement C, however, has such a poor sensitivity that even quite dissimilar samples produce equal responses, thus introducing unintended replicates in the data. Measurement B shows a more normal behaviour. To be able to distinguish between similar samples, the sensitivity must be higher than the sensitivity of measurement C. Another factor that has an impact on the outcome of the cross validation is the NAS [12]. In a system consisting of two very similar components, the minor component has a small NAS. The first eigenvalue will be much larger than the second, which may be around the size of the third eigenvalue (corresponding to noise). This makes it more difficult to distinguish between noise and information, and increases the importance of the sensitivity functions depicted in Fig. 1. None of the above-mentioned kinds of replicates contribute to the number of efficient degrees of freedom in the data set. A sample can only contribute once to the N in the correction factor, no matter how many times the sample is analysed on the instrument. Since one always should include replicates of type (1) and (2) in an analysis to obtain information about the repeatability and the signal-to-noise ratio, one must adjust the N in the correction factor to avoid
Compute the Euclidean distances between all the replicates belonging to the same group of replicates. A group of replicates consists of all the replicates of one sample. This is done for every group of prepared replicates. Find the largest Euclidean distance between two replicates belonging to the same group. Use Dixon’s Q test [16] to check if this distance is due to an outlier. If this is the case, check the second largest distance, etc. Compute the Euclidean distances between all the samples. Use the centres of gravity for the groups of replicates. Samples or centres of gravity are possible unintended replicates if the Euclidean distance between them is less than the distance found in step 2. In step 2 we use Dixon’s Q test rather than assuming a normal (or any other) distribution of the distances. The reason for this is that in most cases the number of replicates is not large enough to use assumptions of normal distributions among them. The number of replicates is also the reason for not using the Mahalanobis distance [151 when computing the distances. With just a few replicates of every sample, the conclusions drawn when using the Mahalanobis distance will not differ significantly from the conclusions arrived at when computing the Euclidean distance. Furthermore, the differences among the replicates are mostly due to random errors, and this means that Euclidean distances can equally well be used. Finally, the Mahalanobis distance requires an inversion of a correlation matrix that is (almost) singular for this type of spectral data.
dx,
dx,
dx,
n
B. Grung and O.M. Kvalheim / Chenwm. Intell. Lab. Syst. 22 (1994) 115-125/Original
Research Paper
119
The list of samples and distances obtained in step 5 must be examined carefully, because situations can occur where two samples are identified as replicates of a third sample, while the distance between the two samples is too large for them to be regarded as replicates. They are therefore not members of the same group of unintended replicates. This means that N in the correction factor must be adjusted by one, and not two. If the data set contains many unintended replicates, score plots can be of help when the list of distances is to be interpreted.
A further decrease in the number of data points stems from the fact that only half the interferogram is used - the other half is only used for phase corrections. A second division by two is therefore required. The result of this is that the correct number for M is only a fraction of the number of variables in the spectrum. We note that there is a connection between the instrumental resolution and the importance of unintended replicates: if the resolution is too low, the effect of unintended replication will increase.
The effect of data processing and the instrument’s resolution
The correction term
The previous section dealt with the effect of sample replicates on the degrees of freedom. It should be obvious that correlations among the variables in the data set also can affect the outcome of the cross validation. Normally, M is set equal to the number of variables in the data set - quite analogous to choosing N as the number of objects. Unfortunately, for spectroscopic data these choices may both be wrong. The correct estimation of M is dependent on the instrument’s resolution and the way the signal is sampled. For Fourier transform infrared spectroscopy (FT-IR) data, M should be equal to the number of data points in the interferogram relative to the data points in the spectrum. A formula for computing the number of data points in the interferogram can be found in ref. 17:
(4 where N, is the number of data points, lllnaxand Vminthe maximum and minimum frequency, and A5 the resolution. The effective degrees of freedom can then be found by multiplying N, with the ratio of the number of frequencies used in the data set to the number of frequencies in the range of the complete spectrum recorded: M
=
Ns
_%nx,data
set
- &-I ,data
set
vmx,spectmm- Ln,spectnun
(3)
After having established that the assignment of values to N and M in the correction term is of crucial importance, what remains is a more thorough examination of the correction term itself. The standard correction term 171is shown in Eqn. 4a: correction term=
(N-A
- l)(M-A)
(4a)
The underlying assumption is that the initial number of degrees of freedom is equal to the product of rows and columns in the data matrix. Use of this term has a peculiar consequence: one looses fewer degrees of freedom per component as the number of components in the model increases. The reason for this becomes obvious if one rearranges Eqn. 4a as shown in Eqn. 4b: correction term = NM - NA - MA - M + A + A* (4b) It is of course the last term, A*, that causes this undesired effect. As every component extracted should result in a loss of degrees of freedom of exactly the same magnitude, we need another expression to correct for this loss. To help us on, we should look at what really happens with the matrix when components are extracted: a score vector and a loading vector are determined for every component that is extracted. The determination of the score vector (with as many elements as there are rows in X) influences the degrees of freedom among the objects, while the determination of the loading
B. Grung and O.M. Kvalheim /Chemom. Intell. Lab. Syst. 22 (1994) 115-125/0riginal
120
vector (with as many elements as there are columns in X) influences the degrees of freedom among the variables. Since X is assumed to be centred, the score vectors will also be centred. Accordingly, we lose N - 1 degrees of freedom as a result of the determination of the score vector. We lose further M degrees of freedom as a result of the determination of the loading vector. This leads to the following correction term: correction term=
(N-
l)M-A(N-
1 +M)
(5) Use of this correction term ensures that the loss of degrees of freedom per component is independent of the number of components extracted. When using this correction term, one must be careful not to extract too many components (over-fitting) to prevent the degrees of freedom from becoming negative. Although this seemingly absurd situation will occur if too many components are extracted, it hardly poses any problem since one of the objects of cross validation is to avoid overfitting. It should also be noted that the use of the new correction term is not as important as performing the corrections on N and M. In many cases, the outcome of the cross validation does not change
Research Paper
n
whether one uses the new or the traditional correction term.
EXPERIMENTAL
Sample description The methods presented are tested on three different sets of samples. The first sample set consists of three liquids in mixture: (1) methyl cyclohexane; (2) ethyl benzene; (3) dibutyl ether. The data set contains 45 samples. Eleven of the 45 samples are replicates of other samples, so that the total number of different samples is 34. This data set is hereafter called the three-component mixture. The second sample set is made of four liquids mixed in different concentrations. The liquids are: (1) 2-butanone; (2) isopropanol; (3) 2-butanal; (4) 2,3-butandiol. The total number of objects in the data matrix is 33, but there are only 23 different samples. This data set will further be referred to as the four-component mixture. The last sample set consists of 88 samples of mixtures of three different master batches of polymers. There are only 26 different samples in the data set; the rest are replicates. Further infor-
TABLE I Overview of the data sets used in this analysis and a comparison of the significant number of components for the data sets and the number of chemical compounds in the mixtures Data set
Three-component
Four-component
mixture
mixture
Number of variables
Wave-number regions (cm-‘)
Number of components (standard method)
Number of chemical compounds
100 200
2980-2881 2990-2791 2892-2868 1747-1723 1066-1042 712- 688
> 19 13
3 3
> 19
4
> 19
4
> 19
3
14 4
3 3
100
186 Poly
100
100 190
2978-2946 2928-2904 2843-2819 2730-2706 1206-1107 18
W
B. Grung and O.M. Kvalheim/Chemom.
Intell. Lab. Syst. 22 (1994) 115-125/Original
mation on this data set can be found in Toft et al. [18]. This data set will be called ‘Poly’ in this paper. Data acqufiition
All the spectra were obtained on a PerkinElmer 1720X FT-IR spectrometer and converted into absorbance units. The spectrometer was interfaced to a VAX station 2000 by means of the RS-232 port. The spectra from the three- and four-component mixtures were acquired with a deuterated triglycerine sulphate (DTGS) detector and a circular cell. The spectra were recorded from 4000 to 600 cm- ’ using 20 scans and a resolution of 4 cm-‘. FT-IR spectra of the sample set Poly were obtained with a TGS detector. A Spectra Tech horizontal attenuated total reflectance (HATR) cell with a ZnSe crystal cut with a 45” bevel was used. To compensate for the inhomogeneity of the samples, each spectrum was the mean of four independently measured spectra - measured with a different location of the polymer in the cell. The spectra were measured from 4000 to 600 cm-i with a resolution of 4 cm-‘. Data analysis
The data were treated on a VAX station 2000 with a program written in FORTRAN. In addition, the data program SIRIUS [19] was used. All the data sets have been normalised to 100% 1201 to remove the size factor from the data. To examine different parts of the data sets, they have been divided into subsets. Each subset consists of maximum 200 variables or wave numbers. The subsets have also been subjected to different pretreatments. Table 1 shows some information on every subset used in this analysis. The subsets within each data set are numbered in an increasing order after their order of appearance in Table 1. RESULTS AND DISCUSSION
Results by means of standard cross validation
To illustrate the need for a better cross validation, Table 1 also shows the results from cross
Research Paper
121
validation of the different subsets by means of the standard method as compared to the wanted result. This is always one component less than the number of chemical components in the mixtures. Thus, for closed systems, the concentration of the last component is fixed when all the other concentrations are determined. Table 1 shows that cross validation always provides too many significant components. Before showing the results of the cross validation as the method improves, it is interesting to see that the results are independent of whether one groups together and deletes variables or objects, and that they also (to a certain extent) are independent of the number of groups the matrix is divided in. For these data sets, no changes in the number of significant components were detected whether one used four or ten groups of variables in the cross validation. A similar situation occurs if one compares results obtained with a varying number of objects in each group, and in fact even if one compares results from cross validation where objects have been deleted to results where variables have been deleted. Although this is not shown here, this situation holds for all the data sets. It is therefore of no importance whether one chooses to work with variables or objects. Furthermore, as the results are the same regardless of the use of four or ten groups in the cross validation, much time is saved by doing the cross validation with only four groups. Hence, results are shown only for one kind of grouping for each subset. Results when the effects of replicates and resolution are taken into account
In this section we investigate the effect that correcting for intended sample replicates and instrumental resolution has on the number of significant components. The corrections for sample replicates are made according to the information in the section ‘Sample description’, while corrections for instrumental resolution are made for those subsets not created by means of maximum entropy [21]. These subsets no longer have the same kind of dependency among the variables as the subsets that are not treated this way, and the
B. Grung and O.M. Kvalheim /Chenwm. Intell. Lab. Syst. 22 (1994) 11%125/Original Research Paper
122
l
0
0
t
2
3
4
5
8
7
8
Number of comporwnta
Fig. 2. Results from the cross validation of the first subset of the data set Poly when the effect of intended replicates and instrumental resolution have been corrected for.
correction should not be made for these subsets. Unfortunately, this causes an overestimation of the number of degrees of freedom in the maximum-entropy pre-processed subsets, as there still are dependent variables in the data sets. The problem is to find a way of estimating the number of dependent variables. Fig. 2 shows the results for the first subset of the data set Poly. As we can see, the results show a major improvement, although we obtain one component more than expected. The great move in the correct direction points to the importance of doing this correction. Indeed, correcting for the digital resolution and intended replicates improves the results for all the subsets, except for the subsets variable-reduced by means of maximum entropy. This is because A4 still has to be set to the number of variables for these subsets, thus making a gross overestimation of the degrees of freedom. However, there are still more corrections to be made if the cross validation is to produce perfect results.
tended replicates in every subset. In fact, for the data set Poly the analysis showed that all the samples are replicates of other samples. Careful examination of the results suggests that there are only three really different samples in this data set. This has to do with the nature of the samples - they are polymers and subject to heterogeneity in sample preparation. The spectra of replicates in this data set are more unlike, and this way of finding unintended replicates is sensitive to large discrepancies between the spectra of planned replicates. It is therefore likely that the sample set contains more than three different samples, but it is also likely that N should be reduced by a large number. The design of the samples is made such that the spread of the samples is quite small. With N equal to 3 it is only possible to extract one component due to the correction term (if we use the old term), thus making a plot of the PRESS values highly uninteresting. We are of course interested in how the PRESS values behave after the second component is extracted. To illustrate that the PRESS values do indeed start to become higher again if the degrees of freedom are reduced enough, Fig. 3 shows the results of the cross validation after N has been reduced to 9. This figure clearly shows how important it is to
“0
The search for unintended replicates
All the subsets are examined for unintended replicates. Use of the algorithm described in the section ‘The effect of replicates’ identified unin-
n
1
2
3
4
5
6
7
8
Number of components
Fig. 3. PRESS values for the data set Poly, first subset, when the number of degrees of freedom is corrected for unintended replicates. The figure shows the situation with N equal to 9. The results are far better after this correction than before. The cross validation now yields two significant components.
n
B. Grung and O.M. Kvalheim /Chemom. Intell. Lab. Syst. 22 (1994) ID-125/Original
Research Paper
123
TABLE 2 Overview of the number of objects, variables, replicates and the final assignments of values to N and M for the different classes Data set and subset
Three-component mixture First Second Four-component mtiture First Second Pory First Second Third
Objects
Variables
Intended replicates
Unintended replicates
Value assigned to N
45 45
100 200
11 11
15 14
19 20
26 51
33 33
100 186
10 10
2 4
21 19
26 186
88 88 88
100 100 190
62 62 62
9 9 9
26 26 190
reduce the overestimation of the degrees of freedom in the system if the cross validation is to produce reliable results. The final results from the search for unintended replicates in the subsets from the other data sets are presented in Table 2. If we correct N for unintended replicates, we find that the results for all the subsets, apart from the maximum-entropy variable-reduced subsets, are quite satisfactory. The reason for the failure of the cross validation for the maximum-entropy variable-reduced subsets is of course that we have defined A4 as the number of reduced variables in
Value assigned to M
the correction term. Fig. 4 shows the results for the second subset of the 3-component mixture. Final results: use of a better correction term
In this section the results are shown of the cross validations when the correction term for the loss of degrees of freedom is changed from Eqn. 4a to Eqn. 5. N has been set to 9 for the subsets originating from the data set Poly. The results for the first subset of the class Poly are portrayed in Fig. 5, and we see again that the results are good. This is the case for all the subsets - apart from the maximum-entropy variable-reduced subsets. TABLE 3
I\
0.15
0.1
....
0.05
I ...............................................................
......
.............................................................
i
Oi
0
12
3 4 5 6 7 0 9 Nomberoiwmpmmma Fig. 4. PRESS values for the three-component mixture, second subset, after the results have been corrected for unintended replicates.
Overview of the final results from the cross validation. The columns show the number of significant components achieved compared to the correct number and the number achieved with the standard method Name of data set and subset
Wanted result
Poly First 2 Second 2 Third 2 Three-component mixture First 2 Second 2 Four-component mixture First 3 Second 3
Result from the standard method
Achieved result
> 19 > 19 13
2 2 3
> 19 > 19
2 2
14 18
3 4
B. Grung and O.M. Kvalheim / Chemom. Intell. Lab. Syst. 22 (I 994) 115-125 /Original
124
Research Paper
n
variable reduction is to pick variables from the fingerprint region. Regarding the impact the overestimation of degrees of freedom has on the outcome of the cross validation, it is highly probable that a similar effect also takes place in SIMCA modelling [221, thus producing incorrect results even there.
ACKNOWLEDGEMENTS
“0
1
2
3
4
5
6
7
8
Number of components
Fig. 5. The final results of the cross validation for the data set Poly, first subset. The figure shows the PRESS values plotted against the number of extracted components after corrections for replicates and resolution have been made. Furthermore, the correction term is changed to (N - 1)M - A(N - 1+ M).
Table 3 compares the final acquired results with the results from the standard method along with the expected results. This goes to show that the results have been largely improved thanks to this way of cross validating.
We would like to thank Jostein Toft, Department of Chemistry, University of Bergen, for willingly letting us use his spectra for the subsets Poly and the three-component mixture. Professor John van der Maas of the University of Utrecht is thanked for valuable comments on the subject of instrumental resolution. BG would like to thank the Royal Norwegian Council for Scientific and Industrial Research (NTNF) for a research grant.
REFERENCES E.R. Malinowski, Statistical F-test for abstract factor analysis and target testing, Journal of Chemometrics, 3 (1988) 49-60.
E.R. Malinowski, Theory of error applied to pure test vectors in target factor analysis, Analytica Chimica Acta, 133 (1981) 99-101.
CONCLUSION
The cross validation technique described in this paper performs well if one carefully examines the number of degrees of freedom in the system to be analysed. The number of degrees of freedom needs to be corrected for sample replicates as well as instrumental resolution if the method is to yield correct results. The standard correction term is replaced with one ensuring a constant loss of degrees of freedom throughout the extraction of components. The results are independent of the number of groups in the cross validation, and they even are independent of whether one groups and deletes variables or objects. The method does not seem to work perfectly on subsets preprocessed by means of the maximum-entropy criterion, due to an unavoidable overestimation of the degrees of freedom. A far better approach to
O.M. Kvalheim and Y.-Z. Liang, Heuristic evolving latent projections: resolving two-way multicomponent data. 1. Selectivity, latent-projective graph, datascope, local rank, and unique resolution, Analytical Chemistry, 64 (1992) 936-946. Y.-Z. Liang, O.M. Kvalheim, H.R. Keller, D.L. Massart,
P. Kiechle and F. Erni, Heuristic evolving latent projections: resolving two-way multicomponent data. 2. Detection and resolution of minor constituents, Analytical Chemistry,
64 (1992) 946-953.
F. Mosteller and D.L. Wallace, Inference in an authorship problem, Journal of the American Statistical Association, 58 (1963) 275-309.
M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statktical Society, B, 36 (1974) 111-133. S. Wold, Cross-validatory
components Technometrics,
estimation of the number of in factor and principal components models, 20 (1978) 397-405.
J.B.M. Droge and H.A. van’t Klooster, An evaluation of SIMCA. Part 1. The reliability of the SIMCA pattern recognition method for a varying number of objects and features, Journal of Chemometrics, 1 (1987) 221-230.
n
B. Grung and O.M. Kvalheim /Chemorn.
Intell. Lab. Syst. 22 (1994) 115-125/Original
9 J.B.M. Drdge, W.J. Rinsma, H.A. van’t Klooster, A.C. Tas
and J. van der Greef, An evaluation of SIMCA. Part 2. Classification of pyrolysis mass spectra of Pseudomonas and Serratia bacteria by pattern recognition using the SIMCA classifier, Journal of Chemometrics, 1 (1987) 231241. 10 M.J.P. Gerritsen, N.M. Faber, M. van Rijn, B.G.M. Vandeginste and G. Kateman, Realistic simulations of highperformance liquid chromatographic-ultraviolet data for the evaluation of multivariate techniques, Chemometrics and Intelligent Laboratory
Systems,
12 (1992) 257-268.
11 S. Wold and M. Sjiistriim, Letter to the Editor. Comments on a recent evaluation of the SIMCA method, Journal of Chemometrics, 1 (1987) 243-245. 12 A. Lorber, Error propagation and figures of merit for quantification by solving matrix equations, Analytical Chemistry,
58 (1986) 1167-1172.
13 E. Sletten, O.M. Kvalheim, S. Kruse, M. Farstad and 0. Soreide, Detection of malignant tumors by multivariate analysis of proton magnetic resonance spectra of serum, European Journal of Cancer, 26 (1990) 615-618.
14 S. Kotz and N.L. Johnson (Editors), Encyclopedia of Statistical Sciences, Wiley, New York, 1982, pp. 293-294.
Research Paper
125
15 D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman, Chemometrics : A Textbook, Elsevier, Amsterdam, 1988. 16 J.C. Miller and J.N. Miller, Statistics for Analytical Chemistry, Wiley, New York, 1984. 17 P.R. Griffiths, Chemical Infrared Fourier Transform Spectroscopy, Wiley, New York, 1975. 18 J. Toft, O.M. Kvalheim, T.V. Karstang, A.A. Christy, K. Kleveland and A. Henriksen, Analysis of nontransparent polymers: mixture design, second-derivative attenuated total internal reflectance FT-IR, and multivariate calibration, Applied Spectroscopy, 6 (1992) 1002-1008. 19 O.M. Kvalheim and T.V. Karstang, A general-purpose program for multivariate data analysis, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 235-237, 20 J.C. Davis, Statistics and Data Analysis in Geology, Wiley,
New York, 1973. 21 W.E. Full, R. Ehrlich and S.K. Kennedy, Optimal configuration and information content of sets of frequency distributions, Journal of Sedimentary Petrology, 54 (1984) 117126. 22 S. Wold, Pattern recognition by means of disjoint principal components models, Pattern Recognition, 8 (1976) 127-139.