Analytica ChimicaActa, 161(1984) 125-134 Elsevier Science Publishers B.V., Amsterdam -Printed
in The Netherlands
THE IMPROVEMENT OF SIMCA CLASSIFICATION BY USING KERNEL DENSITY ESTIMATION Part 2. Practical Evaluation of SIMCA, ALLOC and CLASSY on Three Data Sets
HILKO VAN DER VOET* and DURK A. DOORNBOS Research Group Optimization, Laboratory for Pharmaceutical and Analytxxd Chemistry, State University of Groningen, A. Deusinghlaan 2, 9 713 A W Groningen (The Netherlands) (Received 5th December 1983)
SUMMARY The performance of the new probabilistic classification method CLASSY is evaluated on three different data sets, together with its predecessors SIMCA and ALLOC. The improvement made over ALLOC is only marginal, whereas CLASSY shows better predictive ability and greater reliability than SIMCA in most cases.
The evaluation of pattern recognition techniques was considered theoretically in Part 1 of this series [l] . The present paper is concerned with what information is provided by the selected measures for predictive ability [the number of errors, NE, and the quadratic score (QJ1), sharpness (QZ)] and reliability (Q5) about the SIMCA method (made probabilistic as described in Part 1) and the ALLOC and CLASSY methods. Because the optimal dimensionality for the principal component (PC) class models in SIMCA and CLASSY was unknown, and because the same optimum was not expected for both methods, all possible values of class dimensionality A were examined systematically. DATA AND COMPUTER PROGRAMS
The pattern recognition methods evaluated on three data sets.
SIMCA, ALLOC
and CLASSY
were
Data sets Iris data. The well known iris data from Fisher have been analysed by
several authors [2-41. The data set consists of measurements made on flowers from three species of iris: Iris setosa, Iris uersicolor and Iris uirginica. Iris setosa was very easily distinguished from the other two by all methods, so only the latter two species were used here. There are four variables: sepal length, sepal width, petal length and petal width. Each class contains 50 individuals, which were divided randomly in two groups, a training and a test 0003-2670/84/$03.00
0 1984 Elsevier Science Publishers B.V.
126
group of 25 objects each. The data and the division in groups were taken from the paper by Wold [ 21. After the training and test groups had been exchanged and evaluated again, the results were averaged, so that the present conclusions are based on an evaluation set of 100 objects, analysed by the leave-half-out method. Three kinds of data scaling were applied on the iris data: autoscaling, class scaling and no scaling. In autoscaling, which is the most popular kind of scaling in multivariate analysis all variables are scaled over the entire training set to mean 0 and variance 1. In class scaling, for which an improvement of the classification results with SIMCA is claimed [5, 61, the difference from classical autoscaling is that each training class is scaled separately; this method prevents a large variance caused by between-class differences from being “scaled away”. No scaling at all is a possible option for this data set, because the four variables are measured in the same units and have similar variances. Wine data. In the wine data set, twenty chemical and physical variables were measured on 40 samples from French wines, 21 from the Bordeaux and 19 from the Bourgogne region. This data set has been analysed by several pattern recognition techniques [ 71 and in general a good discrimmation was obtained. Evaluation was done with the leave-one-out method. This data set has an entirely different structure from the previous one. The object-tovariable ratio is only 2 (instead of 25 for the iris data) and the variables are not measured in the same units, but are very different from each other (physical properties such as absorbance and concentrations of constituents such as organic acids and ions). This means that data scaling is essential and class scaling was applied here. Wine data with headspace analysis. The same 40 wine samples were additionally analysed by head-space gas chromatography (h.sJg.1.c.). Eleven peak heights were measured in the chromatogram and used as variables for discrimination. Again the data were class-scaled and the leave-one-out method was used for evaluation. Programs The computer programs used for the SIMCA and ALLOC methods were SIMCA-3B, written in BASIC [5], and ALLOC, written in FORTRAN-IV [8]. For the CLASSY method some of the SIMCA-3B programs were modified and a further BASIC program for kernel density estimation was attached. The results were evaluated by PASCAL and BASIC programs especially written for this purpose. All computations were performed on the CDC CYBER 170/760 computer of Groningen State University. RESULTS
AND DISCUSSION
Iris data The SIMCA and CLASSY programs were evaluated with three kinds of scaling and for A = 1, 2 and 3 (A is the number of components used in both
127
class models). For CLASSY,A = 4 was also investigated. The ALLOC method is invariant under scaling. In addition to a full analysis with all four variables, ALLOC was also evaluated for 1, 2 or 3 variables. The selection order was obtained by the inbuilt procedure in the ALLOC package, which is based on non-error rates. In previous work [ 71, it was found that this was a good variable selection procedure in combination with ALLOC classification. The results are shown in Fig. 1. It can be seen that the (non)error rate is very uninformative. Almost all methods misclassify between 2 and 8 objects out of 100; SIMCA (A = 3) is worse with 14-21 misclassifications. For each object, there are only 2 possible results (right or wrong classification), thus the results can be considered as a sample from a binomial distribution. For a sample size of 100, the 95% confidence interval for the error rate is then 0.01-0.06 when 2 misclassifications are found, and 0.04-0.15, when 8 misclassifications occur. Therefore, only very large differences in the sample (non)error rates (e.g., that between SIMCA (A = 3) and the other methods) can be considered significant. In fact, it is easily shown that a difference of 5 or less in the number of misclassified objects is never significant at the 95% confidence level, regardless of the size of the evaluation set [9]. A difference of 6, 7 or 8 is significant only when all misclassifications of the best method are also made by the worst method. The conclusion must be that the (non)error rate fails to distinguish clearly between the methods investigated in the evaluation of this and, in fact, most other data sets. The quadratic score, Q31, permits a more sensitive evaluation of the discriminatory ability. As can be seen from Fig. l(b), both SIMCA and CLASSY show an optimum when the number of components in the class models is varied. These results confirm the conclusion of Wold [2] that the optimum dimensionality for SIMCA is maximally 2. This is, however, not the result of any supposed underlying simplicity in the data matrix, for in that case it would be reasonable to expect the same Aopt for CLASSY. But the maximum score for CLASSY is reached for A = 3, and it is difficult to believe that the inclusion of merely errors in the class models would lead to better performance. It can be seen that CLASSY always scores better than SIMCA, and this gap tends to widen as A increases. This may be interpreted as an inability of SIMCA to “look” inside the A-dimensional hyperboxes; the method seems to make incomplete use of the information that is available in the data. The significance of the differences found in Qsl can be tested by pairing techniques [lo]. As explained before, the quadratic score is just the average of the individual quadratic scores for each object evaluated. Two classification methods A and B can be compared by computing the difference dQsl = Qsl(B) - Qal(A) for each object evaluated. The null hypothesis: AQal = 0 (no real difference between A and B) can then be tested with a t-test. The results of these pairing comparisons show that the better score for CLASSY compared to SIMCA is significant (at the 95% level), when 2 or more components are used in both methods.
,,I
I
___,__- .n
,A_2t____y
: 2
3
4
I
2
3
4
Fig. 1. Evaluation of the Iris data for a varying number of components (A). Solid lines and symbols: CLASSY. Dashed lines and open symbols: SIMCA. (=, 0) no scaling;(*, 0) class scaling; (A, A) autoscaling. (a) Number of errors (out of 100); (b) quadratic score; (c) sharpness ; (d) reliability score.
In the paired comparisons (see Table 1) between the best SIMCA method (with A = 2), the best CLASSY method (withA = 3) and ALLOC, CLASSY had a significantly higher score than SIMCA for all three types of scaling. The higher score for ALLOC compared to SIMCA is significant only in the case of autoscaling. The improvement that CLASSY makes over ALLOC, finally, is only marginal (not significant). One might object that ALLOC was somewhat handicapped because SIMCA and CLASSY were optimized with respect to the number of principal components to use, which was of course impossible for ALLOC. To give TABLE 1 Paired comparisons between the best SIMCA, CLASSY and ALLOC models in the evaluation of the Iris data set Method A
SIMCA (A = 2) no scaling SIMCA (A = 2) no scaling ALLOC (4 variables)
Method B CLASSY (A = 3) no scaling ALLOC (4 variables) CLASSY (A = 3) no scaling
Mean dQ,, =
Difference between
Q,,(B) -Q,,(A)
A andB
0.025
significant (p < 0.001)
0.013
not signif. (p = 0.32)
0.012
not signif. (p = 0.10)
129
ALLOC an equal chance, it was also evaluated when the number of variables was reduced by variable selection. The best result was, however, obtained by using all four variables, so that the optimal ALLOC dimensionality can be regarded as 4. The reliability analysis of the results is discussed only briefly for this data set. In most cases, reliable predictions were obtained (Fig. Id). Only SIMCA (A = 1) with all three kinds of scaling, SIMCA (A = 2) without scaling and CLASSY (A = 1) with autoscaling were diffident. It may be noted that this includes the best SIMCA method in terms of predictive ability. Wine data The structure of the Wine data is quite different from that of the Iris data. In fact, the number of variables (20) is even greater than the number of objects in the Bourgogne class (which is 19 and, with the leave-one-out method used, often only 18). Therefore, the number of principal components which could be extracted from this class is theoretically 18, but, for programming reasons, is only 17 in practice. The results obtained after SIMCA and CLASSY analysis using 1-17component models for both classes, are shown in Fig. 2. Again the number of errors (Fig. 2a) gives indications, but almost no proof, about the superiority of one method over the other; a difference of at least 6 between the numbers of misclassified objects is the minimum requirement for detecting statistical significance. The quadratic scores (Fig. 2b) confirm the first impression given by the (b)
Fig. 2. Evaluation of the Wine data for a varying number of components (A). Solid lines and symbols: CLASSY. Dashed lines and open symbols: SIMCA. (a) Number of errors (out of 40); (b) quadratic score ;(c) sharpness; (d) reliability score.
130
error rates, but more definite conclusions can be drawn. Roughly there are three regions in the graph. In the first region (A = 2 to 5), SIMCA scores better than CLASSY. SIMCA reaches its overall maximum score 0.893 at A = 5. However, the difference found between the two methods is never significant. In the second region (A = 6 to lo), the SIMCA scores remain at about the same level, but the CLASSY scores increase rapidly, reaching the nearly maximum score 0.928 at A = 6 and the overall maximum score 0.932 atA = 10. For A = 6, 8 and 10, CLASSY outperforms SIMCA significantly ((Y = 0.05). In the third region (A = 11 to 17), the discriminative ability deteriorates for both methods. As the number of components approaches the number of objects in each class, the scores become very unstable, which can lead to wild variations when one other component is added. Surprisingly, the SIMCA result for A = 17 is significantly better than the CLASSY score. The optimal class dimensionality Aopt is not as clearly seen as for the Iris data. The Qal graphs are rather flat and a prudent conclusion might be that A opt for SIMCA is somewhere in the region 2-10, while for CLASSY it lies in the region 6-11. But the general form of the graphs confirms the conclusion already drawn from the Iris data, i.e., with the same data set CLASSY can handle more principal components than can SIMCA, and thereby attains a better discriminative performance. A paired comparison between the best SIMCA (A = 5) and the best CLASSY (A = 10) method was attempted, but the distribution of the differences dQ3i was so skew that a valid t-test was impossible. A nonparametric alternative, the sign test, was also not useful, because this is a test for the median rather than for the mean score, and Q3, is meaningful as a measure of discrimination only when it is averaged over a number of objects [lo]. The results of some comparisons between best and near-best methods that were possible with a t-test are shown in Table 2. The above-mentioned skewness arises in any comparison between methods with very different dimensionalities for their IMS (inside model space). This may be understood by looking at the sharpness of the predictions. Figure 2(c) shows how the sharpness increases almost linearly with A For A = 17, almost all probabilities are 1.00 or 0.00. There is no obvious explanation for this relationship, but it is clear that this increasing self-confidence of a method is justified only as long as it is accompanied by an increase in disTABLE 2 Paired comparisons between the best and near-best SIMCA and CLASSY evaluation of the Wine data Method A
SIMCA (A = 5) SIMCA (A = 9)
Method B CLASSY (A = 6) CLASSY (A = 10)
Mean dQ,, =
models in the
Q,,(B) -Q,,(A)
Difference between A andB
0.035 0.047
significant (p = 0.042) significant (p = 0.006)
131
criminative ability. The reliability score Q5 was introduced as a measure for the trustworthiness of the probability values (see above). The numerator of Q5 is Q~I - 0.5Q2 - 0.5; it measures if discriminative ability (Qsl) and sharpness (Q2) are in concordance with each other. In Fig. 2(d), Qs is plotted for the SIMCA and CLASSY evaluations of the Wine data. The most remarkable feature is the enormous over-confidence of both methods when A gets too large (e.g., >lO). In this region, the predictions are very sharp, but this is not matched by a corresponding low error rate. In contrast, SIMCA shows slightly, but significantly, diffident behaviour for small values of A. If the output of the selected method is to be trusted as being real probabilities, then the choice is between CLASSY in the range A = l-10 and SIMCA in the range A = 6-10. Again, the SIMCA method with best discrimination (A = 5) is not reliable. The results obtained with ALLOC were surprising. It was thought likely that the ALLOC results would suffer from the large number of variables, as has been reported [ 111. However, the results with ALLOC, using all twenty variables, were nearly as good as those with the optimal CLASSY method (with A = 10); there were only 3 erroneous classifications and the quadratic score was 0.930 (vs. 3 errors and Qsl = 0.932 for CLASSY). The only point of criticism was its slightly over-confident behaviour ( Q5 = -2.32). The authors of ALLOC claim that a leave-one-out method is automatically provided in their program by the nature of the kernel density estimation procedure used [8]. This is, however, only partly true because the smoothness parameter (kernel function width) is computed on the basis of all objects in the class, including the one under classification. There is still some over-optimistic bias left in the results produced by ALLOC. The above results were obtained with a real leave-one-out method, which simply means that the program was run 40 times. For comparison, the partial leave-one-out method yielded Qjl = 0.950. Wine data with head-space gas chromatography These data are discussed to give a complete picture, for the pattern is much less clear than for the iris and wine data and there are several effects for which there is no immediate explanation. The wine data set with h.sJg.1.c. data added have 11 variables, so that the evaluation comprised runs with A varied between 1 and 10 for SIMCA (programme restriction) and between 1 and 11 for CLASSY. ALLOC was evaluated with all 11 variables. The results are shown in Fig. 3. The error rate is again uninformative (Fig. 3a): only SIMCA with A = 9 or 10 can be disqualified for bad discrimination. The SIMCA methtid exhibits the normal pattern for the quadratic scores (Fig. 3b); the scores rise to a maximum (0.943) in the early phase (A = 2-4) and then decrease when too many components are used. For CLASSY, however, there is no such pattern: the scores form a quite irregular graph with (sub)maxima at A = 1, 5 and 10, and (sub)minima at A = 3 and 8. The
132 15
(b)
,c (a’
NE
A
#’
I 0.
~_________~___
Q2
(d)
,,4 ,,Db 0.8 O.f
d
R’ 0
,
5
A
IO
Fig. 3. Evaluation of the wine data with h.sJg.1.c. for a varying number of components Solid lines and symbols: CLASSY. Dashed lines and open symbols: SIMCA. (a) Number of errors (out of 40); (b) quadratic score; (c) sharpness; (d) reliability score. (A).
general maximum Q31 is 0.975 at A = 10. Even the difference between the highest and the lowest CLASSY score is not significant; it seems that the discriminative ability of CLASSY is not much influenced by the class dimensionality. In contrast, the reliability of the CLASSY output is greatly affected by the choice of A (Fig. 3d). The only reliable methods are SIMCA with A = 1-6 and CLASSY with A = 1, 2 or 4 (also with A = 3, 5 or 10 if slight overconfidence is allowed). The results from ALLOC are comparable with those of CLASSY with A = 11 (as might be expected, for the only difference is a rotation of the axes); there were 2 errors, the quadratic score was 0.949 and the reliability measure was 4.24 (over-confident). The general conclusion for this data set is that there are no really important differences between SIMCA, ALLOC and CLASSY. Some restrictions Before general conclusions can be made, the restrictions of this study and of this kind of research in general must be considered. This paper has been concerned with three pattern recognition methods which have been eval-
133
uated by using different data sets, i.e., the approach is method-orientated, not data-orientated. The problem of how to convert real-life problems into a manageable form for the application of pattern recognition has not been discussed although this is certainly an important step, perhaps the most important, in the overall classification process. There are of course also practical limitations with such evaluations. Here, the effect of varying A, the class dimensionality, was studied but A was not varied for each class separately, though this is readily possible in principle with both SIMCA and CLASSY. The above results are not intended to suggest that such evaluations should be used to determine the optimal class dimensionality Aopt. That would involve too much work for any practical application. The choice of varying A systematically resulted from the lack of a simple reliable method for finding Aopt, and this remains a big problem. Many methods have been advocated for this purpose [12-141, including a cross-validatory method which is included in the SIMCA package, but our experience with most of these methods is not very encouraging. Indeed, it was shown above that Aopt may be quite different depending on which classification method is used. Thus the dimensionality of a data matrix (in a practical rather than mathematical sense) may be rather an elusive concept, in the same way as the “optimal” variable selection. Conclusions The two distinct aspects of classification, predictive ability and reliability, must be considered. With respect to predictive ability, both theoretical studies [l] and practical evaluation have shown that the performance of SIMCA can be improved by applying kernel density estimation in the IMS. More information can then be extracted from the data. It appears that the optimal class dimensionality is in general higher for CLASSY than for SIMCA. The third method investigated, ALLOC, also performed very well. Although in all cases it was possible to construct a CLASSY method with higher predictive ability and better reliability, much more research is needed to assess the possible superiority for one or the other method. A general result concerning the reliability of the methods is the increasing sharpness as the class dimensionalities increase. This increasing sharpness is, however, only partly justified by better agreement between prediction and outcome, so that the reliability score of SIMCA and CLASSY generally decreases as A increases. This is acceptable if the method was initially diffident, but the choice of A should not be so high that over-confident predictions are obtained. The CLASSY method is almost always more self-confident than the corresponding SIMCA method. For the Iris and Wine data, this produces better reliability at the class dimensionality which gives the best discrimination. We thank Jan Hemel for many helpful discussions on the subject.
134 REFERENCES 1 2 3 4
H. van der Voet and D. A. Doornbos, Anal. Chim, Acta, 161 (1984) 115. S. Wold, Pattern Recognition, 8 (1976) 127. M. Sjijstriim and B. R. Kowalski, Anal. Chim. Acta, 112 (1979) 11. D. Coomans, D. L. Massart, I. Broeckaert and A. Tassin, Anal. Chim. Acta, 133 (1981) 215. 5 C. Albano, G. Blomqvist, D. Coomans, W. J. Dunn III, U. Edlund, B. Eliasson, S. Hellberg, E. Johansson, D. Joknels, B. Norden, M. SjSstrSm, B. SSderstrBm, H. Wold and S. Wold, in A. Hiiskuldsson and K. Esbensen (Eds.), Proc. Symp. Applied Statistics, NEUCC, RECAU and RECKU in cooperation with the Danish Society of Theoretical Statistics, Copenhagen, 1981. 6 M. P. Derde, D. Coomans and D. L. Massart, Anal. Chim. Acta, 141 (1982) 187. 7 H. van der Voet, D. A. Doornbos, M. Meems and G. van de Haar, Anal. Chim. Acta, 160 (1984) 159. 8 J. Hermans and J. D. F. Habbema, Manual for the ALLOC Discriminant Analysis Programs, CR1 Univ. Leiden, 1976. 9 Wissenschaftliche Tabellen, 7th edn., Geigy, Basel, 1968, p, 85. 10 J. Hilden and J. D. F. Habbema, in A. Alperovitch, F. T. De Dombal and F. Gremy (Eds.), Evaluation of Efficacy of Medical Action, North-Holland, Amsterdam, 1979, p. 123. 11 D. Coomans, M. Derde and D. L. Massart, Anal. Chim. Acta, 133 (1981) 241. 12 S. Wold, Technometrics, 20 (1978) 397. 13 E. R. Maiinowski and D. G. Howery, Factor Analysis, Wiley, New York, 1980. 14 H. T. Eastment and W. J. Krzanowski, Technometrics, 24 (1982) 73.