n
Original Research Paper
15
Chemometrics and Intelligent Laboratory Systems, 11 (1991) 75-88 Elsevier Science Publishers B.V., Amsterdam
Classification of essential mint oils of different geographic origin by applying pattern recognition methods to gas chromatographic data E. Marengo, C. Baiocchi Dipartimento
di Chimica Analitica,
*, M.C. Gennaro and P.L. Bertolo
Universita’ di Torino, Via P. Giuria, 5, 10125 Torino (Italy)
S. Lanteri Istituto di Analisi e Tecnologie Farmaceutiche ed Alimentari,
Via Brigata Salerno, 16147 Genova (Italy)
W. Garrone Laboratorio Ricerche Ferrer0 s.p.a., Piazzale P. Ferrero, 1, 12051 Alba (Italyl (Received 15 June 1990; accepted 26 November 1990)
Abstract
Marengo, E., Baiocchi, C., Gennaro, M.C., Bertolo, P.L., Lanteri, S. and Garrone, W., 1991. Classification of essential mint oils of different geographic origin by applying pattern recognition methods to gas chromatographic data. Chemometrics and Intelligent Laboratory Systems, 11: 75-88. Pattern recognition techniques were applied to the classification of essential mint oils using their gas chromatographic data. A study has been completed on 59 samples of three different geographic origins: U.S.A., France and Italy. Principal component analysis provided new variables for an effective classification of the samples according to the different areas of origin. The pattern of the result was further analysed by means of a fuzzy clustering algorithm which permitted the quantification of the differences between the three classes. The chemical information contained in the gas chromatographic data was sufficient to characterize the geographic origin of the samples.
INTRODUCTION
The problem of the classification of natural products on the basis of their chemical composition is of increasing interest. The refined analytical instruments now available can provide a lot of experimental data related both to the qualitative and quantitative compositions of real samples. Unfortunately, it is usually difficult to use this 0169-7439/91/$03.50
0 1991 - Elsevier Science Publishers B.V.
large amount of information directly as classification criteria because of the complexity of the multivariate space which must be considered. In the case of the characterization of aromatic natural products, the use of gas chromatographic techniques coupled with multivariate chemometrical treatments have already proved to be effective in studies on olive oil [l], wine [2,3], coffee [4] and tea [5].
Chemometrics
16
No report of statistical treatments concerning classification of mints is currently available, probably because of the complexity of the composition of their essential oils. In this paper the classification of flavours of mints of different geographic origin has been performed. The gas chromatographic data were treated by means of different pattern recognition techniques. Our objective is to discover the most suitable chemical variables in order to perform an effective geographic classification of the sample of unknown origin and, at the same time, to judge about the genuineness of the product or, on the contrary, to assess the possibility of a mixing with less valuable products.
EXPERIMENTAL
AND
METHODS
Instrumental A Perkin-Elmer SIGMA 2000 gas chromatograph equipped with a flame ionization detector
and Intelligent Laboratory
Systems
n
was used for direct analysis of the mint oils. No particular treatment of the samples was required. All the chemometrical calculations were performed on an IBM AT extended personal computer, supported by an INTEL 80286 mathematical co-processor. Experimental The gas chromatograms recorded on the investigated oil samples show about 90 peaks (Fig. 1). Twenty-eight of them were identified and the corresponding components are listed in the caption of Fig. 1. The measured percentage areas of these peaks were used as variables in the characterization study. The sum of the areas of peaks 22 and 23 was taken as a single variable since in several chromatograms the two peaks could not be resolved leading to a total of 27 variables. Mint flavours originating from France (F), U.S.A. (A) and from
14
Fig. 1. List of peaks identified in a typical gas chromatogram of a mint oil sample. Peaks: 1 = a-thujene, 2 = ol-pinene, 3 = camphene, 4 = /3-pinene, 5 = sabinene, 6 = myrcene, 7 = n-phellandrene, 8 = a-terpinene, 9 = a-limonene, 10 = 1,8-cineole, 11 = y-terpinene, 12 = terpinolene, 13 = n-hexanol, 14 = 3-octanol, 15 = I-octen-3-01, 16 = hydrated sabinene, 17 = menthone, 18 = menthofuran, 19 = isomenthone, 20 = menthyl acetate, 21 = neomenthol, 22 = 4-terpineol, 23 = /3-caryophyllene, 24 = menthol, 25 = pulegone, 26 = a-isomenthol, 27 = terpineole, 28 = germacrene. Peaks 22 and 23 in some gas chromatograms are not resolved.
n
Original Research Paper
five different Italian suppliers [l-5] were considered. The Italian ones were in every case treated as a single group, devoting most attention to the classification among French, American and Italian samples.
Chemometrics Principal component analysis (PCA) is a wellknown technique which provides significant insight into the structure of any numerical data matrix [6,7]. It generates a set of new orthogonal variables, the principal components (PCs), a linear combination of the original ones, so that the maximum possible amount of variance contained in the starting data matrix is concentrated in as few PCs as possible. These new variables can be used in place of the original ones for successive treatment. The coefficients of the original variables defining each PC, are called ‘loadings’, while the projections of the experimental points on the new variables are called ‘scores’. Since PCA effectively concentrates the variance of the data matrix in a smaller number of new variables, it is suitable to reduce the dimensionality of large data matrices by eliminating the non-significant PCs and facilitating successive treatments on the reduced data. PCA was computed through the diagonalization of the variance-covariance matrix by means of the Jacobi algorithm. The data were autoscaled before PC computation in order to assign the same numerical weight to each variable. Linear discriminant analysis (LDA) [8] is a multivariate classification technique based on the hypothesis of normal distribution in each category, with different barycenter but the same variante-covariance matrix for the categories. Here LDA has been used as a display technique. First, the variable-to-variable ratios have been added to the original variables. In fact, in many cases of natural product identity problems, ratios have been shown to have a discriminating power higher than that of the original variables. Second, some variables or ratios have been selected by stepwise LDA, i.e. the variables that are the most useful in the separation of the categories. Then the canonical variables of LDA were computed: these are
71
the directions (generally non-orthogonal) in the space of the original variables that maximize the ratio between the intercategory variance and the intracategory variance. A hierarchical agglomerative clustering method, Ward method [9], has been used that produces a dendrogram of the similarities among samples (ob jects). Fuzzy clustering (FC) is an unsupervised clustering method which makes use of fuzzy function theory [lo]. It allows the generation of fuzzy partitions and prototypes for any numerical data [11,12]. These partitions are useful in order to corroborate a known data substructure or to suggest a likely substructure in unexplored data. The FC calculations are based on the minimization of the generalized least-squared errors functional:
J= ;
k=l
i (u,dmd12k.~
i=l
where u,~ is the membership point and ith cluster; and
score
for the
k th
d&A = II Yk-iZIIA:=(Yk-l)i)TA(Yk-i),) where T indicates the transpose of the vector; vi is the vector of the coordinates of the centroid of the ith cluster; yk is the vector of the coordinates of the kth object; A is the matrix which defines the metric of the space considered; dfk,A is the squared distance between the kth object and the centroid of the i th cluster in norm A. The parameters influencing the FC calculations are: A, the norm (Euclidean, diagonal or Mahalanobis) which corresponds to different definitions of the distance between points and to different shapes of the clusters (hyperspherical or hyperellipsoidal); m, the exponent of the weighting coefficients, which essentially controls sensitivity to noise relationship and is related to the hardness or softness of the classification; m increasing from 1 tends to degrade membership towards fuzzy states; c, the number of allowed clusters. Through the coefficients u it is possible to
78
Chemometrics and Intelligent Laboratory Systems
define the entropy of the partition, related to the quality of the partition N
h= -
h, which is considered:
C
c c bik kP,k)/~
k=l
;=I
where base a depends on the chosen norm A. Lower values of h correspond to more probable partitions. The knowledge of the u coefficients for the best partition allows one to perform the classification of the experimental points (objects). Thanks to the fuzzy function algorithm sample corresponding to hybrids can also be treated. Because of its flexibility FC seems to be a good method for pattern recognition in our case where some samples can correspond to mixtures of the principal classes. We performed FC calculation by means of the FCMEAN computer program [lo].
RESULTS AND DISCUSSION
The variance explained in the first six PCs after a PCA computation on the 59 samples and 27 properties, is listed in Table 1. In order to perform a preliminary model refinement through the elimination of non-significant discriminating variables, we calculated the residual variance contained in each variable after the first two PCs, remembering that variables whose residual variance is greater than 0.6 can usually be eliminated (after the last significant PC). From the analysis of Table 2 in which the residual variances are listed, we decided to eliminate variables 4 and 7.
TABLE 2 Residual variances of the first two principal components of Table 1 Variable
PC1
PC2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0.99066 0.44414 0.86554 0.19667 0.10004 0.59642 0.64911 0.22196 0.30832 0.06353 0.15096 0.09104 0.55468 0.70860 0.36541 0.13198 0.55951 0.55195 0.24919 0.16383 0.10456 0.07725 0.65801 0.28584 0.40571 0.13911 0.26387
0.24018 0.31885 0.78017 0.12075 0.08123 0.30459 0.59778 0.20069 0.19449 0.05771 0.14502 0.08565 0.35381 0.15684 0.28041 0.12985 0.28840 0.21278 0.15189 0.13216 0.09560 0.03528 0.33763 0.22968 0.39054 0.11349 0.26379
PCA was repeated on the new data set containing 59 samples and 25 variables which have been renumbered from 1 to 25. The variance explained by the first PCs is listed in Table 3. The elimination of the two variables caused a percentage increase in the variance explained by the first two PCs up to 65.6% (from 61.4%).
TABLE 1 Variance explained of the first six principal components obtained from the analysis of the entire data set PC
Variance explained
Total variance explained
1 2 3 4 5 6
46.62879 14.76747 9.28546 6.63766 4.21947 3.65660
46.62879 61.39627 70.68173 77.31940 81.53886 85.19547
n
TABLE 3 Variance explained by the first five principal components obtained from the analysis of the reduced data set PC
Variance explained
Total variance explained
1 2 3 4 5
49.84708 15.78107 7.44389 6.41131 3.94440
49.84708 65.62815 73.07204 79.48336 83.42776
n
Original Research Paper
19
Figs. 2a, 2b, 2c and 3, respectively, show the distribution of the samples plotted against the first PCs, the second, the third, and the first and the second together. From Fig. 2a it is evident how PC1 accounts mainly for the difference between French and American plus Italian mint oils. From
Fig. 2b we see that PC2 contains mainly information about the difference between American and Italian plus French samples. In Fig. 2c no discrimination between the geographic classes is apparent so the information contained in the third PC, which accounts for 9.3% of the total variance,
(a) PC1
II
I
I
ITALIAN
FRENCH
PC1
I III I II
I
PC1
ANERlCd
(b) PC 2
PC 2
I
I I
II
FRFXCH
PC 2
I
I
I
I
Ill1
Ill1
I
I I
I
I I
AHERICAH
PC3
IfAlAn
PC3
FRENCH
PC3
Fig. 2. Distribution
II
II
I III
I
of samples of different geographic origin plotted against PC1 (a), PC2 (b), PC3 (c).
At4ERICAN
80
Chemometrics and Intelligent Laboratory Systems
3 3
3
3
2
3
Eigenvector
1
Fig. 3. Plot of the score of PC1 and PC2: 2 = French, 3 = Italian.
1= American,
seems not to contribute significantly to the separation of the mint oils. As can be seen from Fig. 3, the plot of the scores of the first two PCs contains sufficient information for the characterization of the investigated mint flavours, which form well localized clusters of points, corresponding to their geographic origin. French mint oils are very well separated from the other ones. The difference between Italian and American oils is evident even if it is less straightforward than the differences between American/French and Italian/French flavours. There are only few Italian samples which lay in a region near the American cluster. It is noticeable the good localization of the American mint oils with respect to the Italian ones taking into account that their geographic origin is very extended along the American country. Fig. 4 shows a scatter diagram of the loadings of the first two PCs. It may be seen that there is no non-significant variable left (in the center); this is confirmed by the list of the residual variances of the variables for the first two PCs (Table 4). Moreover Fig. 4 gives information on the discriminating ability of the variables; many vari-
n
ables have high loading (positive or negative) on the first PC: the group of variables close to the variable 10 in Fig. 4, and variables 2, 3, 7, 17 and 20 on the right. Variables 1 and 2 have a high positive loading on the second component; variables 5, 15, 16 and 21 a high negative loading. Variables 11, 13, and 17, 2, 3 and 7 also have some importance for the second component. Fig. 3 shows that American mint oils have high scores on PC2 with respect to the Italian ones. We can conclude that Italian samples are characterized with respect to the American ones by the lower peak areas of compounds 1 and 12 (uthujene and 3-octanol) and higher peak areas of compounds 16, 5, 15 and 21 (menthofuran, myrcene, menthone and P-caryophyllene). Fig. 4 also shows that many variables give about the same information, in respect of the two first components. By selecting only four variables, 8, 12, 16, 20, as representative of the main clusters of variables, about the same separation among the
11 N
17 20
lY!iJ
16
Loading
on
camp.
1
Fig. 4. Plot of the loadings of PC1 and PC2 with variables renumbered as: 1 = a-thujene, 2 = ol-pinene, 3 = &pinene, 4 = sabinene, 5 = myrcene, 6 = a-terpinene, 7 = a-limonene, 8 = 1,8-cineole, 9 = y-terpinene, 10 = terpinolene, 11 = n-hexanol, 12 = 3-octanol, 13 = I-octen-3-01, 14 = hydrated sabinene, 15 = menthone, 16 = menthofuran, 17 = isomenthone, 18 = menthyl acetate, 19 = neomenthol, 20 = 4-terpineol+ p-caryophyllene, 21 = menthol, 22 = pulegone, 23 = a-isomenthol, 24 = terpineole, 25 = germacrene.
l
Original
TABLE Residual Table 3
Research
Paper
81
4 variances
for the first two principal
components
of 2
22
Variable
PC1
PC2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.99817 0.44041 0.19676 0.09547 0.59890 0.23759 0.30552 0.06194 0.16421 0.10207 0.55127 0.72144 0.35791 0.13338 0.55079 0.54912 0.25330 0.15910 0.10152 0.07280 0.64749 0.28460 0.39784 0.13849 0.25423
0.26622 0.30164 0.11477 0.07959 0.28804 0.22477 0.17991 0.05759 0.16161 0.09912 0.32433 0.17897 0.26021 0.13086 0.29075 0.20868 0.14360 0.11780 0.09539 0.03476 0.34649 0.23141 0.38879 0.10895 0.25339
9.
L 0 u
b (u
I
“,I c
1
4
0
c
l!J
J
Canonical
variable
1
Fig. 6. Plot of the canonical variables of LDA obtained four variables (1 variable and 3 ratios).
with
three categories has been obtained as with 25 variables. This is shown in Fig. 5 where the superimposed loadings and scores are reported, after a varimax rotation of scores [13]. The separation can be improved by the use of the canonical variables or ratios by means of a stepwise procedure. By introducing in each step the variable or ratio that produces the highest increase in the Mahalanobis distance between the two closest categories, the following features have been selected, in order: - ratio between variable 3 and 8; - ratio between variable 12 and 2; _ variable 19; - ratio between variable 10 and 5.
I
N I
2
N
I I
_a_-L___-_-------_--_--__-_? 20
>
2
.rl
TABLE
5
Entropy different
values from fuzzy clustering values of exponent m
Number clusters
of
ii >
2
2
calculations
by using two
2
2
Varivector
9
Fig. 5. Superimposed plot of loadings (magnified indices) and scores obtained with four variables after a varimax rotation.
2 3 4 5 6
Entropy (m =1.5)
Entropy (??I = 3.0)
0.031 0.069 0.162 0.208 0.209
0.393 0.705 0.889 1.120 1.224
Chemometrics
82
and Intelligent Laboratory
Systems
n
.O
.2
.3
.4
5 i?. .d f+j .6 A .?I E .7 .d ul .a
.9
1.0
Fig. 7. Dendrogram
of objects (Ward method) obtained with 25 variables.
Note that the selected ratios have the two variables in two different clusters of Fig. 4. The separation obtained with the canonical variables of LDA is shown in Fig. 6. Because of
the low (variables shown in However,
ratio between the number of features + ratios) and objects, the separation Fig. 6 has to be considered cautiously. the use of the ratios to improve the
.o r
.i .2
7
L i
Fig. 8. Dendrogram
of objects (Ward method) obtained with four variables (1 variable and 3 ratios).
w Original Research Paper
83
TABLE 6
TABLE 6 (continued)
Membership scores for the points and the clusters determined by fuzzy clustering calculation
Sample
Sample
1
2 3 4 5 6 7 a 9 10 11 12 12 12 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
m = 3.0
38 39 40 41 42 43 44 45 46 47 48 49
1
2
1
2
0.9019 0.9889 0.9845 0.9272 0.9961 0.9426 0.9394 0.9569 0.9533 0.0001 0.0001 0.0001 O.oooO 0.0007 0.0000 0.0000 0.9995 0.9988 0.9992 0.9997 0.9998 1.0000 0.9999 0.9989 0.9996 0.9993 0.9991 0.9993 0.9993 0.9997 0.9995 0.9994 0.9998 0.9988 0.9974 0.9997 0.9954 0.9999 0.9976 0.9985 1.0000 0.9987 0.9995 0.9998 0.9998 1.0000 1.0000 1.0000 0.9988
0.0981
0.6003 0.7253 0.6990 0.6215 0.7621 0.6349 0.6268 0.6481 0.6413 0.1177 0.1329 0.1115 0.0934 0.1549 0.0677 0.0031 0.8754 0.8548 0.8719 0.8872 0.8889 0.9510 0.9455 0.8507 0.8822 0.8698 0.9024 0.8789 0.8719 0.8982 0.8734 0.8700 0.8993 0.8321 0.8153 0.9075 0.8055 0.9365 0.8152 0.8534 0.9447 0.8573 0.8471 0.8992 0.8961 0.9304 0.9580 0.9415 0.8211
0.3997 0.2747 0.3010 0.3785 0.2379 0.3651 0.3732 0.3519 0.3587 0.8823 0.8671 0.8885 0.9066 0.8451 0.9323 0.9969 0.1246 0.1452 0.1281 0.1128 0.1111 0.0490 0.0545 0.1493 0.1178 0.1302 0.0976 0.1211 0.1281 0.1018 0.1266 0.1300 0.1007 0.1679 0.1847 0.0925 0.1945 0.0635 0.1848 0.1466 0.0553 0.1427 0.1529 0.1008 0.1039 0.0696 0.0420 0.0585 0.1789
0.0111 0.0155 0.0728 0.0039 0.0574 0.0606 0.0431 0.0467 0.9999 0.9999 0.9999 1.0000 0.9993 1.0000 1.0000 0.0005 0.0012 0.0008 0.0003 0.0002 0.0000 0.0001 0.0011 0.0004 0.0007 0.0003 0.0007 0.0007 0.0003 0.0005 0.0006 0.0002 0.0012 0.0026 0.0003 0.0046 0.0001 0.0024 0.0015 0.0000 0.0013 0.0005 0.0002 0.0002 0.0000 0.0000 0.0000 0.0012
m = 3.0
m=1.5
Cluster membership m =1.5
Cluster membership
50 51 52 53 54 55 56 57 58 59
1
2
1
2
0.9981 0.9961 1.0000 0.9999 0.9999 0.9995 1.0000 0.9985 0.9978 0.9969
0.0019 0.0039 0.0000 0.0001 0.0001 0.0005 0.0000 0.0015 0.0022 0.0031
0.8019 0.8150 0.9570 0.8765 0.8882 0.8813 0.9609 0.8521 0.8387 0.8236
0.1981 0.1850 0.0430 0.1235 0.1118 0.1187 0.0391 0.1479 0.1613 0.1764
separation ability appears to be of considerable interest from the point of view of its application to a wider data set. Though the difference between the geographic classes is evident, other statistical techniques were employed, namely hierarchical and fuzzy clustering methods, in order to confirm the visual results and possibly, at the same time, to quantify the differences. Figs. 7 and 8 show the dendrograms obtained (Ward method) with the 25 variables and the 4 features selected by stepwise LDA. The clusters of France and U.S.A. are very well separated in the two dendrograms. Moreover, three (Fig. 7) or two (Fig. 8) less separated clusters can be seen in the Italian samples. Note that, in spite of the second dendrogram having been obtained with only 4 features, related to 7 variables, the left cluster of the Italian samples in Fig. 7 is very similar in its composition to the left cluster of the Italian samples in Fig. 8. So these two clusters appear to be significant. The FC technique, briefly described in the methods section, was applied to the scores of the first two PCs obtained from the reduced data set. Two calculations were performed with different values of the exponent m, namely 1.5 and 3.0, corresponding to hard and soft classifications, respectively. Unfortunately there is actually no rule for the determination of the correct value to assign to the exponent. Anyway, it will be shown that the final results are quite similar.
Chemometrics and Intelligent Laboratory Systems
84
TABLE 7
TABLE 7 (continued)
FC membership scores for a partition of the data set in three clusters
Sample
Sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
Cluster membership 1
2
3
0.0007 0.0011 0.0001 0.0004 0.0005 0.0002 0.0002 0.0000 0.0002 0.9999 0.9999 0.9999 1.0000 0.9988 0.9999 0.9998 0.0003 0.0003 0.0001 0.0002 0.0003 0.0000 0.0000 0.0005 0.0002 0.0002 0.0000 0.0001 0.0002 0.0001 0.0003 0.0003 0.0001 0.0015 0.0017 0.0000 0.0012 0.0000 0.0018 0.0002 0.0000 0.0002 0.0013 0.0002 o.ooo2 0.0001 0.0000 0.0001 0.0021 0.0025 0.0008 0.0000
0.9957 0.9501 0.9963 0.9962 0.9548 0.9984 0.9980 0.9999 0.9980 0.0000 0.0000 0.0001 0.0000 0.0005 0.0000 0.0001 0.0041 0.0028 0.0013 0.0038 0.0049 0.0001 0.0000 0.0060 0.0033 0.0025 0.0004 0.0009 0.0020 0.0007 0.0050 0.0032 0.0020 0.0817 0.0371 0.0001 0.0100 0.0002 0.0482 0.0020 0.0003 0.0018 0.1618 0.0057 0.0063 0.0019 0.0003 0.0059 0.2400 0.4590 0.0049 0.0000
0.0036 0.0488 0.0036 0.0033 0.0447 0.0014 0.0018 0.0001 0.0018 0.0000 0.0000 0.0000 0.0000 0.0006 0.0000 0.0001 0.9956 0.9969 0.9986 0.9959 0.9949 0.9999 1.0000 0.9935 0.9964 0.9973 0.9995 0.9990 0.9978 0.9992 0.9946 0.9965 0.9978 0.9168 0.9612 0.9999 0.9888 0.9998 0.9500 0.9978 0.9997 0.9981 0.8369 0.9941 0.9935 0.9980 0.9997 0.9939 0.7580 0.5385 0.9943 1.0000
53 54 55 56 57 58 59
n
Cluster membership 1
2
3
0.0008 0.0004 0.0001 0.0000 0.0002 0.0003 0.0006
0.0990 0.0304 0.0016 0.0000 0.0021 0.0025 0.0053
0.9002 0.9691 0.9983 1.0000 0.9977 0.9971 0.9941
The results concerning the most probable partition are listed in Table 5. From the entropy values the most probable partition consists in both cases of 2 clusters. These two clusters agree with the visual inspection of the dendrograms in Figs. 7 and 8. The membership scores for each sample and each cluster are listed in Table 6. If we look at the scores we can see that French mint oils form one of the clusters while Italian and American samples belong to the other cluster. The result depends on the fact that Italian and American oils are not so different that an unsupervised method like fuzzy clustering can manage to differentiate them. In fact the unsupervised method is not able to realize that all the American samples are localized on one side and the Italian ones on another side. Anyway we can get information on the difference betwen Italian and American samples by analysing the FC membership scores for a partition of the data set into three clusters (Table 7). From these data we see that the classification with three cluster is correct and the samples of different geographic origin are well separated. Fuzzy clustering also confirms the results obtained from the visual analysis of Fig. 3. From the data of Table 7 we can obtain quantitative information about the difference between American and Italian samples. In particular there are some Italian mint oils which are quite similar to the American ones: 50 (45.9%), 49 (24.0%) 43 (16.2%), 53 (9.9%) and 34 (8.2%). This similarity is highly significant since it was obtained with a low value of m (hard classification). This may have its origin in real chemical similarities or in possible mixing of Italian and American flavours. Then the result obtained from FCMEAN would suggest subject-
m
Original Research Paper
85
TABLE 8
TABLE 10
Variance explained of the first five PCs of the study of the Italian samples
Variance explained by the first six PCs after the treatment on the reduced data set
PC
Variance explained
Total variance explained
PC
Variance explained
Total variance explained
1 2 3 4 5
26.68290 20.55535 11.48854 8.67799 6.33526
26.68290 47.23825 58.72680 67.40479 73.74004
1 2 3 4 5 6
30.54099 23.08048 13.39185 7.90756 6.28947 4.79480
30.54099 53.62147 67.01331 74.92087 81.21033 86.00513
ing samples 50,49,43, 53 and 34 to further quality checks. The membership scores are nearer 1.0 and 0.0 in the case of m = 1.5, which corresponds to the hard classification, as expected. As m increases it becomes more difficult to assign every sample to a
TABLE 9 Residual variance on each variable for the first three PCs of Table 8
definite cluster so usually a value of rn between 1.5 and 3.0 is advisable. Since Italian mint oils come from five different suppliers and they do not form well localized clusters in Fig. 3 (in comparison with the American and French ones), we decided to study further the pattern of the Italian samples. The data corresponding to the 43 italian mint oils were autoscaled and subjected to PCA. Table
Variable
PC1
PC2
PC3
TABLE 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0.74425 0.29943 0.77208 0.27368 0.31552 0.47073 0.48496 0.23250 0.73367 0.52083 0.24872 0.22539 0.81079 0.83661 0.94885 0.20092 0.91648 0.43767 0.17795 0.50410 0.83147 0.31801 0.97592 0.40644 0.67232 0.76953 0.78704
0.69346 0.29594 0.55353 0.27368 0.26915 0.45026 0.46223 0.21475 0.51244 0.24003 0.21002 0.17982 0.29446 0.60163 0.71285 0.19842 0.25639 0.40712 0.09074 0.14341 0.12390 0.29545 0.28846 0.35690 0.58552 0.18638 0.15175
0.67774 0.19875 0.22568 0.22478 0.17012 0.23495 0.25763 0.06027 0.26145 0.21036 0.05980 0.05973 0.28291 0.60123 0.69840 0.14472 0.23894 0.39123 0.09074 0.13903 0.12181 0.12182 0.27294 0.23287 0.57140 0.18602 0.13599
Residual variances for the first three PCs after the treatment on a reduced data set Variable
PC1
PC2
PC3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0.31361 0.77503 0.27426 0.31076 0.46086 0.49565 0.22609 0.73318 0.52503 0.24435 0.22300 0.80211 0.19601 0.91380 0.44041 0.18134 0.49786 0.84269 0.31276 0.97669 0.39875 0.76627 0.79216
0.31108 0.52164 0.27422 0.26789 0.44647 0.46625 0.20757 0.52195 0.22491 0.20290 0.17438 0.28179 0.19274 0.24157 0.41935 0.09022 0.13892 0.11247 0.28916 0.30726 0.35086 0.19894 0.15372
0.20931 0.20003 0.22812 0.17722 0.21857 0.27059 0.05440 0.27550 0.19435 0.05354 0.05437 0.27083 0.13754 0.22536 0.39680 0.09018 0.13514 0.11042 0.11469 0.29184 0.22569 0.19757 0.14129
Chemometrics
86
8 contains the list of the variance explained by the first PC while Table 9 contains the residual variante on each variable for the first three PCs. From the data of Table 9 variables 1, 14, 15, judged as
5
1 5 2
2 1
2g
1
3
3
2/1 3
c-d 5 L
0
511
5 3
3 5
dJ u al
44
z
4
al m5
5
4
3 3
4 ii
4 3
4
4 I
Eigenvector
1
5 5 5
5
5
3
3
3 3
3 I
Eigenvector
1
Laboratory
Systems
l
non-significant, were eliminated and a new PCA was performed on the reduced data set. The list of the variance explained by the first PCs is given in Table 10 while the residual variances for three PCs
3
(a)
and Intelligent
H
Original Research Paper
87
5 5 5
5
5 5
4
5
1
4
Eigenvector
25
3
J
2
Fig. 9. Results of PCA of Italian mint oil samples. (a) Plot of the scores of PC1 and PC2, (b) plot of the scores of PC1 and PC3, (c) plot of the scores of PC2 and PC3.
4748 46 5-l 53
44
34 35
45 43 35
50
Principal
Component
1
Fig. 10. Plot of scores on PC1 and PC2 of Italian samples (24 variables).
are listed in Table 11. Again it can be noticed an increase in the total variance explained by the first PCs from 58.7 to 67.0% at the third PC. Table 11, containing the residual variances, indicates that there is no non-significant variable left. Fig. 9a, 9b and 9c show the plots of the scores of the first three PCs. Fig. 10 shows the same data as Fig. 9a with the number of the samples to be compared with those in the dendrogram of Fig. 11. It is evident that there is a certain separation between the mint oils coming from different suppliers, even if, unfortunately, it is impossible to explain the reason for such a separation since the suppliers were not required in declare the exact Italian origin of the samples. For the same reason, waiting for a more precise and accurate sampling of Italian mint oils, LDA and FC were not applied to the Italian samples alone. It may in any case be noticed that samples from suppliers 1 and 2 form well-localized and
Chemometrics
88
and Intelligent
Laboratory
Systems
n
.3
.a
compact clusters in approximately the same region. The other Italian mint oils are more spread around, even if they tend to form well-defined large clusters corresponding to each different supplier. For the future, in order to verify the possibility of performing a classification of mint oils of different national and international geographic origin by gas chromatographic analysis coupled to the use of pattern recognition methods, a rigorous sampling procedure will be executed. The results presented in this paper concerning the classification of Italian, American and French mint oils, as well as a study of the repartition of the Italian ones are very encouraging.
REFERENCES 1 M. Forma and E. Tiscomia, Pattern recognition methods in the prediction of Italian oil origin by their fatty acid content, Annali di Chimica (Rome), 72 (1982) 143-155. 2 I.E. Frank and B.R. Kowalski, Prediction of wine quality and geographic from chemical measurements by partial least-squares regression modeling, Analytica Chimica Acta, 162 (1984) 241-251. 3 W.O. Kwan and B.R. Kowalski, Classification of wines by applying pattern recognition to chemical composition data, Journal of Food Science, 43 (1978) 1320-1323.
4 K. Wada, S. Oghama, H. Sasaki and M. Shimoda, Classification of various varieties of coffee by coupling of sensory data and multivariate analysis, Agricultural and Biological Chemistry, 51 (1987) 1745-1752. 5 L. Xiande, P. Van Espen and F. Adams, Classification of Chinese tea samples according to origin and quality by principal component techniques, Analytica Chimica Acta, 200 (1987) 421-430. 6 E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, Wiley, New York, 1978. 7 S. Wold, Pattern recognition by means of disjoint principal components models, Pattern Recognition, 8 (1976) 127-139. 8 M. Forma, R. Leardi, C. Armanino, S. Lanteri, P. Conti and P. Princi, PARVUS: An Extendable Package of Programs for Data Exploration Classification and Correlation, Elsevier Scientific Software, Amsterdam, 1988. 9 D.L. Massart and L. Kaufman, Hierarchical clustering methods, in P.J. Elving and J.D. Winefordner (Editors), The Interpretation of Analytical Chemical Data by the Use of Clurter Analysis, Wiley, New York, 1983, p. 80. 10 J.C. Bedzek, R. Erlich and W. Full, FCM: the fuzzy c-means algorithm, Computers and Geoscience, 10 (1984) 191-203. 11 J.C. Bedzek, M. Trivedi, R. Erlich and W. Full, Fuzzy clustering: a new approach for geostatistical analysis, Znternational Journal of Systems, Measurement and Decision, 1982. 12 L.A. Zadeh, Fuzzy sets, Information and Control, 8 (1965) 338-353. 13 M. Forma, C. Armanino, S. Lanteri and R. Leardi, Methods of varimax rotation in factor analysis with applications in clinical and food chemistry, Journal of Chemometrics, 3 (1988) 115-125.