Available online at www.sciencedirect.com
Chinese Chemical Letters 22 (2011) 1241–1244 www.elsevier.com/locate/cclet
Discrimination of industrial products by on-line near infrared spectroscopy with an improved dendrogram Jing Jing Liu, Heng Xu, Wen Sheng Cai, Xue Guang Shao * Research Center for Analytical Sciences, College of Chemistry, Nankai University, Tianjin 300071, China Received 16 February 2011 Available online 18 July 2011
Abstract Near infrared (NIR) spectroscopy technique has shown great power and gained wide acceptance for analyzing complicated samples. The present work is to distinguish different brands of tobacco products by using on-line NIR spectroscopy and pattern recognition techniques. Moreover, since each brand contains a large number of samples, an improved dendrogram was proposed to show the classification of different brands. The results suggest that NIR spectroscopy combined with principal component analysis (PCA) and hierarchical cluster analysis (HCA) performs well in discrimination of the different brands, and the improved dendrogram could provide more information about the difference of the brands. # 2011 Xue Guang Shao. Published by Elsevier B.V. on behalf of Chinese Chemical Society. All rights reserved. Keywords: Near infrared spectroscopy; Principal component analysis; Hierarchical cluster analysis; Tobacco products
Discrimination of industrial products is an important task in industry [1–3]. However, due to the complex property of the industrial products, it is essential to find an appropriate analytical method. Near infrared (NIR) spectroscopy is a simple, rapid and nondestructive analytical method, which could capture several types of information from one single measurement. Combined with pattern recognition techniques, it has attracted considerable attention for variety discrimination in practical applications, e.g. biomedical diagnosis [4], discrimination of the different genotypes of herbal plants [5], identification of false drugs [6], authentication of the origin of food [7], etc. Especially, it has shown great power and gained wide acceptance in tobacco industry [8–10]. In the present work, a method based on NIR spectroscopy combined with pattern recognition techniques was proposed to distinguish different brands of tobacco products. For improving discrimination, principal component analysis (PCA) was firstly employed to reduce data dimensionality and extract the information, and then hierarchical cluster analysis (HCA) was utilized to exhibit the separation of the classes. On the other hand, since a large number of samples exist in each brand, an improved dendrogram was developed to show the difference between the brands.
* Corresponding author. E-mail address:
[email protected] (X.G. Shao). 1001-8417/$ – see front matter # 2011 Xue Guang Shao. Published by Elsevier B.V. on behalf of Chinese Chemical Society. All rights reserved. doi:10.1016/j.cclet.2011.04.019
1242
J.J. Liu et al. / Chinese Chemical Letters 22 (2011) 1241–1244
1. Experimental All the spectra were measured on a production line of a tobacco company. The samples include seven brands of products, denoted as A, B, C, D, E, F and G. These brands belong to three series, i.e. the first series includes brands A, B, C and D, the second series includes brands E and F, and the third series consists of brand G. In this work, 2100 spectra were used, and each brand consists of 300 spectra. The spectra were recorded on an MPA FT-NIR spectrometer (Bruker, Germany) in the wavenumber range of 4000–12,000 cm1 with the digitization interval ca. 4 cm1. To reduce the influences of noise and background, a continuous wavelet transform (CWT) [11] was operated with Haar filter and scale 20. PCA [12] and HCA [13] were adopted to investigate the difference of the samples. PCA was employed to reduce data dimensionality and inspect the difference of the samples. Generally, the difference of the samples is usually visualized in the first two- or three-dimensional PC spaces. However, since all the tobacco products are very similar, the utility of the two- or three-dimensional PC spaces cannot provide unambiguous information for exhibiting the difference. Therefore, HCA was further employed, because it could characterize similarities of the samples by distances between sample pairs in high dimensional spaces. Typically, the similarities are represented on a twodimensional (distance and sample index) diagram, named as dendrogram, which illustrates different levels of difference and suggests the possible clustering solutions. On the dendrogram, each linkage step is represented by a connection line. The difference between the two connected classes is represented by distance, and each number along the sample index axis corresponds to a sample. In this study, different brands are investigated and each brand contains a large number of samples. On one hand, it is impossible to show all the samples on the dendrogram. On the other hand, the separation information between the brands is expected. Therefore, an improved dendrogram was proposed. On the improved dendrogram, the centroid and variance of the samples in a brand, and the resolution between the brands were adopted. The resolution (Rs) is calculated by:prabv Rs ¼
jma mb j sa þ sb
(1)
where a and b are two classes being connected, ma and mb are the centroids of the two classes, and sa and sb are the standard deviations of the Euclidean distances to the centroid of the class. Both the s and Rs are labeled on the dendrogram to show the variance in a brand and the separation between the classes. Moreover, the length of connection line is proportional to the distance between the centriods of the connected two classes. 2. Results and discussion Fig. 1 shows the measured spectra and the preprocessed spectra of tobacco products. It is obvious that the measured spectra in Fig. 1a are very similar and all of them are highly overlapped. After removing the background and noise by
Fig. 1. Measured spectra (a) and the preprocessed spectra (b) of the samples.
J.J. Liu et al. / Chinese Chemical Letters 22 (2011) 1241–1244
1243
Fig. 2. Score plot of the first two PCs for discriminating the samples.
CWT, the spectra are even more similar to each other, as shown in Fig. 1b. Therefore, it is impossible to distinguish the brands by using the spectra directly. Because PCA is an effective data mining technique to reduce the computation burden and extract the main information, it was performed to examine the difference of products. Fig. 2 presents the score plot of the first two PCs. Although the cumulative variance of the first two PCs is 99.95%, it can be seen that the samples still considerably overlapped. Therefore, for exhibiting the difference in high dimensional spaces clearly, HCA was utilized in the following studies. Due to the complexity of the samples, the PCs explaining small variance may play an important role in the discrimination. The first eight PCs were, therefore, used as representative. The result of HCA is illustrated in Fig. 3a, in which each triangle represents the centroid of a brand. It can be seen that two main classes can be identified. One contains brands A, B, C and D, and the other contains brands E, F and G. In the latter, brands E and F are more similar than G. It is not surprising since the brands A–D belong to the first series, brands E and F belong to the second series, and brand G belongs to the third series. Moreover, the differences between brands A and B, C and D, E and F can be clearly seen in the figure. It suggests that the similar brands in a series can also be distinguished by the method. Therefore, NIR spectroscopy combined with HCA may provide an efficient tool for discrimination of tobacco brands at different level of similarity. However, on the dendrogram, only the Euclidean distances between the centroids of the brands are used to evaluate the similarity of the brands. The distribution of the samples in a brand and the separation information between the brands cannot be reflected.
Fig. 3. Dendrogram (a) and the improved dendrogram (b) for discriminating the brands.
J.J. Liu et al. / Chinese Chemical Letters 22 (2011) 1241–1244
1244
For more descriptive representation, the improved dendrogram is plotted in Fig. 3b. The numbers on the right of the triangles indicate the distribution of the samples in a brand, which are obtained by the standard deviations of the Euclidean distances of the samples from each class centroid in the eight PC spaces. It is apparent that, in the studied cases, the values are small enough to show the similarity between the samples. From the values of Rs labeled on the connection line, the separation between the brands can be evaluated. For example, the value for the two main classes is 3.24, which indicates a complete separation because the value is much bigger than 1.5. However, the value for brands A and B is only 1.23, which suggests a comparatively poor separation. In addition, because the length of connection line is proportional to the distance between the centriods of the connected two classes, the difference between the connected classes can be clearly demonstrated. However, this results in a problem that the difference between unconnected classes cannot be reflected by the distance directly along the sample index axis. For instance, the brands D and G belong to different classes but they are very close geometrically. Of course, the real difference between brands D and G can be evaluated by the Rs between the classes they belong to. Similarly, the real difference between brands A and C should also be evaluated in the same way. In conclusion, NIR spectroscopy coupled to pattern recognition techniques perform well in discrimination of complicated tobacco products. Moreover, for the classes containing a large number of samples, the improved dendrogram could provide more information about the difference of the classes. Acknowledgment This work is supported by National Natural Science Foundation of China (No. 20835002). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
J. Sadecka, J. Tothova, P. Majek, Food Chem. 117 (2009) 491. R.M. Balabin, R.Z. Safieva, Fuel 87 (2008) 1096. W.L. Yoon, R.D. Jee, A. Charvill, et al. J. Pharm. Biomed. Anal. 34 (2004) 933. L. Munck, J. Chemometr. 21 (2007) 406. Y.H. Lai, Y.N. Ni, S. Kokot, Chin. Chem. Lett. 20 (2010) 213. S.H.F. Scafi, C. Pasquini, Analyst 126 (2001) 2218. L.J. Xie, Y.B. Ying, T.J. Ying, et al. Anal. Chim. Acta 584 (2007) 379. L.J. Ni, L.G. Zhang, J. Xie, et al. Anal. Chim. Acta 633 (2009) 43. E.D.T. Moreiraa, M.J.C. Pontesa, R.K.H. Galvaob, et al. Talanta 79 (2009) 1260. C. Tan, X. Qin, M.L. Li, Vib. Spectrosc. 51 (2009) 276. X.G. Shao, A.K.M. Leung, F.T. Chau, Accounts Chem. Res. 36 (2003) 276. S. Wold, Chemometr. Intell. Lab. Syst. 2 (1987) 37. N. Bratchell, Chemometr. Intell. Lab. Syst. 6 (1989) 105.