ARTICLE IN PRESS
Engineering Applications of Artificial Intelligence 20 (2007) 415–427 www.elsevier.com/locate/engappai
Classifying NIR spectra of textile products with kernel methods Y. Langeron, M. Doussot, D.J. Hewson, J. Ducheˆne Institut Charles Delaunay (ICD), Universite´ de technologie de Troyes (UTT), 12, rue Marie Curie, BP 2060, F-10010, Troyes Cedex, France Received 28 January 2005; received in revised form 13 June 2006; accepted 8 July 2006 Available online 18 September 2006
Abstract This paper describes the use of kernel methods to classify tissue samples using near-infrared spectra in order to discriminate between samples, either with or without elastane. The aim of this real-world study is to identify an alternative method to classify textile products using near-infrared (NIR) spectroscopy in order to improve quality control, and to aid in the detection of counterfeit garments. The principles behind support vector machines (SVMs), of which the main idea is to linearly separate data, are recalled progressively in order to demonstrate that the decision function obtained is a global optimal solution of a quadratic programming problem. Generally, this solution is found after embedding data in another space F with a higher dimension by the means of a specific non-linear function, the kernel. For a selected kernel, one of the most important and difficult subjects concerning SVM is the determination of tuning parameters. Generally, different combinations of these parameters are tested in order to obtain a machine with adequate classification ability. With the kernel alignment method used in this paper, the most appropriate kernel parameters are identified rapidly. Since in many cases, data are embedded in F, a linear principal component (PC) analysis (PCA) can be considered and studied. The main properties and the algorithm of k-PCA are described here. This paper compares the results obtained in prediction for a linear classifier built in the initial space with the PCs from a PCA and those obtained in F with non-linear PCs from a k-PCA. In the present study, even if there are potentially discriminating wavelengths seen on the NIR spectra, linear discriminant analysis and soft independent modelling of class analogy results show that these wavelengths are not sufficient to build a machine with correct generalisation ability. The use of a non-linear method, such as SVM and its corollary methods, kernel alignment and k-PCA, is then justified. r 2006 Elsevier Ltd. All rights reserved. Keywords: Support vector machine; K-principal component analysis; Kernel alignment; Standard normal variate transformation
1. Introduction The use of near-infrared (NIR) measurements in the textile industry to identify different fabrics may contribute to the evolution of methods of quality control among manufacturers, as well as offering trade advantages due to improved conformity controls of products. In general, such controls are performed with chemicals, in contrast to NIR which offers an environmentally clean alternative as well as a precious saving of time. The aim of this study was to classify in a binary manner pieces of fabric according to the presence or absence of a Corresponding author. Tel.: +33 325 71 5691; fax: +33 325 71 5699.
E-mail address:
[email protected] (Y. Langeron). 0952-1976/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2006.07.001
fabric component by using NIR spectra. The substance chosen for this study was elastane, a fibre that contains at least 85% elastomer, and thus has a great capacity to be stretched and return to its starting length. There are many ways used in NIR spectroscopy to build a classification function such as:
Linear discriminant analysis (LDA), which is a Bayesian approach where learning is fast and easy, but the hypothesis of a Gaussian model quickly becomes inadequate. K-nearest neighbours, where the prediction step may be time consuming if the size of the database is very high. Soft independent modelling of class analogy (SIMCA) method, which offers good performance, but requires
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
416
homogeneous distributions for all classes. This method also becomes inadequate if only limited amounts of training data are available. Neural networks, which can be used for pattern recognition by the use of a searching model to minimise error criterion using adequate synaptic weights. However, it is not guaranteed to obtain a global optimum for a given model. Moreover, it is not obvious which neural network architecture should be chosen.
In this paper, no assumptions will be made about the statistical behaviour of the training data. Instead, the problem will be treated as a geometrical one, using a support vector learning machine (SVM) (Burges, 1998; Cristianini and Shawe-Taylor, 2000; Gunn, 1998), whose main principles are summarised in Section 2. The basic idea with SVM is to project labelled data, initially defined in Rd into another space F with a higher dimension, where a linear separation will be easier. As this projection is not so obvious, one needs to use a kernel function. For a selected kernel, several parameters have to be tuned simultaneously in order to obtain a learning machine with the best performances. The kernel alignment method explained in Section 2 is a measure of similarity between the distance matrix obtained in F of each data point with respect to the others, and a target distance matrix containing the data labels. The use of principal component (PC) analysis (PCA) brings an orthogonal basis projection, where projected variance is maximised and the loss of information minimised. PCA is dealt with in the same section, with a particular emphasis on factor analysis in space F (k-PCA) which enables the emergence of non-linear relationships between samples in the initial data space. In Section 4, the data set is checked to ascertain whether or not it is non-linear, using SIMCA and LDA methods, which a posteriori justifies the use of SVM. Pre-processing treatments of NIR spectra are presented in the section devoted to data description. This project was undertaken in conjunction with the Institut Franc- ais du Textile et de l’Habillement (IFTH— Working Group ISO TC 38/WG22), who provided the spectral data. 2. Discriminating methods
criterion (Keinosuke, 1990), which measures the difference of two means normalised by the averaged variance, this optimum is equal to 1 1 1 S1 þ S2 ðM 2 M 1 Þ, (2) w¼ 2 2 with Si the covariance matrix and Mi the mean vector for the class Ci. The Fisher criterion does not enable the optimum value of b to be found. In this paper, b is considered to lie between the means of training projected onto the direction w. 2.1.2. Soft independent modelling of class analogy This method was proposed by Svante Wold (Wold, 1976; Wold and Sjostrom, 1977), whereby each class of the main database is modelled separately by disjoint PC models with the use of a cross-validation technique. From the scatter of points around these models, tolerance volumes are constructed, indicating the space spanned by each class. The test set samples are classified as class members if they fall inside a volume. Obviously, when the tolerance volumes overlap, samples can be assigned to both classes. 2.2. Support vector machine The aim of an SVM is to separate data where two cases need to be distinguished, initially in the original space. If the separation is acceptable within the confines of the original space, such a separation is, by definition, linear. If a linear solution in the original space is not acceptable, the data need to be non-linearly transformed, so that the SVM procedure could be successfully applied in the new space. Therefore, the following section is divided into separable and non-separable cases in the original space, followed by the expansion into a higher-dimension space. 2.2.1. Separable case In such a case, it is a relatively straightforward task to find a hyperplane that linearly separates data. The hyperplane chosen needs to be located equidistant to the closest points of each group, and can thus be seen as dividing an intermediate zone, called the margin, into two equal parts. Let us formalise the problem:
2.1. Usual discriminating approaches
2.1.1. Linear discriminating analysis Each sample xi is assigned to a class Ci with the following decision rule h(xi):
C1
X hðxi Þ ¼ xi :w þ b 0. o
(1)
The aim of the design work is to find the optimum vector w and the associated threshold value b. Using the Fisher
C2
Let l NIR spectra be a training database ðxi ; yi Þ with xi 2 Rd , yi 2 f1; þ1g; i ¼ 1; . . . ; l. Let x w þ b be the equation of a separating surface, with w the normal direction to this surface and jbj=kwk the distance to a chosen origin. This hyperplane is expected to separate the positive and negative samples. Let d+ be the distance of the nearest positive point to the separating surface. Let d be the distance of the nearest negative point to the separating surface.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
The optimal hyperplane H maximises the margin M ¼ d þ þ d under the following constraints: xi w þ bX þ 1 with yi ¼ þ1,
(3)
xi w þ bp 1 with yi ¼ 1.
(4)
Both can be summarised in yi ðxi w þ bÞ 1X0 8i.
(5)
Every point for which the equality in (3) or (4) holds lies on a hyperplane, with the perpendicular distance from H defined by 1 dþ ¼ d ¼ . kwk
(6)
417
Spectra that have non-null ai generate a separating hyperplane, and these data are named support vectors. Classifying a new test sample is realised by studying the sign of ðxtest w þ bÞ. In the separable case, all support vectors lie on hyperplanes that delimit the margin, and verify the following relation: wt w ¼
NSV X
aj
NSV : the number of support vectors:
j¼1
(14) This last result is very important as it guarantees linear separability in the original data space and may be an empirical step for building SVM.
So, the margin M becomes M¼
2 . kwk
(7)
Thus, to maximise the margin, it is necessary to minimise w norm. Then, let the following optimisation quadratic problem be defined by 1 kw k2 2
ðCost function to be minimisedÞ,
yi ðxi w þ bÞ 1X0 8i
ðAll constraints to be satisfiedÞ.
The solution to this optimisation problem is obtained with the saddle point of the Lagrangian function defined in its primal form by l l X X 1 Lp ðw; b; aÞ ¼ kwk2 ai yi ðxi w þ bÞ þ ai , 2 i¼1 i¼1
(8)
^ aÞpLp ðw; ^ a^ ÞpLp ðw; b; a^ Þ with ai X0. ^ b; ^ b; Lp ðw;
(9)
When the necessary and sufficient optimality conditions defined by Karush Kuhn and Tucker are applied, the Lagrangian function evolves to the dual form with Ld ðw; b; aÞ ¼
l X i¼1
l X
ai
l 1X ai aj yi yj ðxi xj Þ, 2 i; j¼1
ai yi ¼ 0,
(10)
(11)
i¼1
w¼
l X
ai yi xi ,
(12)
i¼1
½yi ðxi w þ bÞ 1ai ¼ 0 ai X0.
(13)
The optimisation problem induced by (10) under constraints (11) has one global optimal solution. Several algorithms are available for such a quadratic optimisation. For the present study, the Matlabs quadprog function was used, with b obtained from (13) after calculation of (10) and (12). The dual form is particularly interesting, as it requires only labelled training data and dot products.
2.2.2. Non-separable case In many applications, data are not separable by a hyperplane. It is necessary, therefore, to find a compromise between maximising the separating margin and minimising the percentage of data misclassified during the training process. Such a compromise is always a question of optimisation, although in this case, relaxing constraints would generate an additional cost for the objective function. The problem becomes X 1 kwk2 þ C xi ðCost function to be minimisedÞ (15) 2 i under the constraints: xi w þ bX þ 1 xi
with yi ¼ þ1,
(16)
xi w þ bp 1 þ xi
with yi ¼ 1.
(17)
It can be seen that the chosen cost function presents a linear relationship with all slack variables xi . However, there are other ways to formulate the cost function (2-norm soft margin classifier (Cristianini and Shawe-Taylor, 2000), chapter 6). For all xi that do not satisfy the optimal constraints (3) and (4), a coefficient xi is used to relax the corresponding constraints, with the aim of minimising the additional cost in the objective function. The positive parameter C justifies the passage from constraints to objective function and expresses the importance given to the global relaxing process. C means that the user may or may not accept a certain number of misclassified individuals during the training process. The expression of Lagrangian used in the optimal case is still retained, with the only difference being that the Lagrange multipliers a now have an upper bound defined by C. Support vectors (with a40) do not lie systematically on the hyperplanes delimiting the margin. On the other hand, those support vectors that are located on the hyperplanes correspond to xi 0. When C tends towards +N, the classification rule tries to minimise the number of misclassified data, with no thought given to the margin.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
418
When C tends towards 0, the classification rule tries to maximise the margin with no concern for the number of misclassified data, which can sometimes lead to an absurd solution. It will be easier for users to adjust C if they have a good knowledge of the training database, in particular the noisy aspect of the data. 2.2.3. Embedding data in a higher dimension space The previous section proposed a method that produced a solution as a trade off between acceptable class separation and low misclassification in the original space. Therefore, the idea would be to design a new space of representation of higher dimension using a mapping function F, where data could be perfectly separated with a hyperplane. It can be easily thought that it is more likely to obtain such a hyperplane if the new space has a higher dimension than the original one:
l X
ai
i¼1
l 1X ai aj yi yj Fðxi Þ Fðxj Þ. 2 i; j¼1
Knowing K, there is no need to know the mapping function F and the optimisation algorithm (18) becomes l X
ai
i¼1
ai yi
l 1X ai aj yi yj Kðxi ; xj Þ, 2 i; j¼1
! aj yj Kðxi ; xj Þ þ b
(19)
# 1 ¼ 0,
(20)
j¼1
f ðxtest Þ ¼ sgn
nsv X
! ai yi Kðxi ; xtest Þ þ b ;
0pai pC.
Let l data be a training database ðxi ; yi Þ with xi 2 Rd , yi 2 f1; þ1g; i ¼ 1; . . . ; l. Let y be a l dimension column vector of labels for these data. Let K be the dot product matrix with Kði; jÞ ¼ Fðxi Þ Fðxj Þ as previously seen. Let hK 1 ; K 2 iF be the Frobenius product between the two matrices K 1 ; K 2 defined as hK 1 ; K 2 iF ¼
l X
K 1 ði; jÞ K 2 ði; jÞ.
(22)
(18)
Kðxi ; xj Þ ¼ Fðxi Þ Fðxj Þ.
l X
i; j¼1
However, finding the best mapping function F is not an easy task, as the dimension of the space F can be infinite. In order to avoid this drawback, Scholkopf et al. (1999) suggested the introduction of a new function K (the kernel) such that
"
2.3.1. Definition of the alignment
If such a function exists, then the algorithm studied in the preceding sections applies again, becoming
Ld ðw; b; aÞ ¼
2.3. Kernel alignment
F : Rd ! F ; x ! FðxÞ:
Ld ðw; b; aÞ ¼
The current state of the art (Cristianini et al., 2001) suggests the use of a Gaussian kernel, such as xi xj 2 . Kðxi ; xj Þ ¼ exp 2s2 Such a kernel has the capacity to solve a large variety of classification problems, and can therefore be seen as a universal tool. s is a tuning parameter and refers to the bandwidth.
(21)
i¼1
In the space F, the user keeps the possibility to accept misclassified data with the same parameter C, as defined in the previous sub-section. The linear boundary defined in F becomes non-linear in the original space. The usual kernels encountered in the SVM literature are linear, Gaussian, polynomial, sigmoid, spline, and Fourier. To be defined as a kernel, any K function has to satisfy Mercer’s conditions (Scholkopf et al., 1999).
The expression of the alignment between K and the matrix defined by yy0 is the following: hK; yy0 iF A ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hK; KiF hyy0 ; yy0 iF
with 1pAp þ 1.
(23)
2.3.2. Properties of kernel alignment The notion of alignment may be seen as a correlation coefficient between two random variables. Thus, the alignment gives an indication of the similarity between two descriptions of the same data set, K and yy0 , where intra-class variance is minimised and inter-class variance is maximised (Cristianini et al., 2001; Kandola et al., 2002). K is the dot product matrix of each individual with the others in F, and yy0 can be seen as a target matrix. Therefore, an ideally designed kernel with respect to A will reveal two independent structures for a two-class classification problem. Fig. 1 illustrates this alignment concept, detailing three values of the bandwidth parameter s. A maximum alignment is obtained for s ¼ 0:9 (Fig. 1b) for which the matrix K reveals two structures (see Fig. 1c2). For a very low value of s, the matrix K is diagonal. Each sample is orthogonal with all the others and can be thought of as its own class. For high s, the matrix K defines a single group. Achieving an SVM with these extreme values will certainly imply either an overfitting or difficulties in the choice of the C parameter to minimise the cost function (15). Kernel alignment may be one way to choose the kernel parameters, i.e. the bandwidth s for a Gaussian kernel.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
419
Fig. 1. Kernel alignment for a basic classification problem with two classes: (a) original data; (b) evolution of the alignment A versus the Gaussian kernel bandwidth s; (c1) (c2) (c3). The dot product matrix K represented for three values of s; 0.1, 0.9 (best alignment) and 5, respectively. In (c2), the dotted lines illustrate the two different patterns.
Once s is found, SVM is performed with various values of C, depending on the acceptable number of misclassified individuals during the training process.
The first vector v is the eigenvector corresponding to the highest eigenvalue l of matrix T. The number of non-zero eigenvalues is equal to the rank of T.
2.4. Factor analysis in space F: k-PCA
2.4.2. k-PCA The interest of embedding data into the expanded space F using a mapping function F, such as
2.4.1. Linear PCA PCA is an orthogonal transformation that enables a subspace of Rd to be obtained with a minimum loss of information. The most interesting part of such a factor analysis is the reduction of the dimension, where it becomes easier to represent individuals, de-noise data and perhaps reveal different classes. This subspace is obtained by computing the eigenvectors of the covariance matrix T, as defined by the following relation where individuals are centred: T¼
l 1X xj x0j . l j¼1
(24)
F : Rd ! F , x ! FðxÞ, has already been shown. The aim of kernel-PCA (Mika et al., 1999; Scholkopf et al., 1998) (k-PCA) is to complete a linear PCA in F , i.e. to find the orthogonal axes in this space on which individuals F(x) are projected. Since embedding is done with a non-linear relationship, each hyperplane in F becomes non-linear in Rd . As for classical PCA, this subspace is defined by the successive eigenvectors V of the covariance matrix TF (after centring) with l 1X Fðxj Þ Fðxj Þ0 . l j¼1
Each vector v, composing the new orthogonal basis, must maximise the projected variance v0 Tv (cost function) with a unity norm v0 v ¼ 1. Here, the optimisation problem has equality constraints, where the solution is given by
TF ¼
lv ¼ Tv.
lV ¼ T F V .
(25)
(26)
The same problem as studied previously is found again, where the new orthogonal projection basis is given by (27)
ARTICLE IN PRESS 420
Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
Defining V as a linear grouping of terms F(x) such as P V ¼ lj¼1 mj Fðxj Þ with mj expansion coefficient and defining K as the dot product matrix with Kði; jÞ ¼ Fðxi Þ Fðxj Þ, the Eq. (27) becomes llm ¼ Km.
(28)
The problem reduces to find l eigenvectors mk of K ðk ¼ f1; . . . ; lgÞ. As with linear PCA, the kth vector of the new basis must be normalised, for instance ðV k V k Þ ¼ 1, which leads to 1 ¼ lk ðmk mk Þ.
(29)
Fig. 2 illustrates the k-PCA for the case of two concentric classes defined in R2 with 168 individuals. If data are embedded in the space F with a higher dimension, 168 principal axes can be obtained and enable the emergence of Fig. 3. Projection onto the first three principal axes for the case of two concentric classes defined in R2. The elements of each class are represented by black stars and squares.
linear separability in F from a non-linear separability in R2. The background grey level on Fig. 2a and b explains the projection value for each point of this initial space onto the first and second principal axis obtained in F. The first principal axis already reveals the two concentric classes. Fig. 3 shows the projection in F of the embedded data onto the first three principal axes. A linear separation can be considered easily starting from this projection. Keep in mind that the concentric lines (iso-lines on Fig. 2a and b) of constant projection are drawn straight lines in F. 2.4.3. k-PCA algorithm For a chosen kernel K, the successive steps for achieving the k-PCA algorithm are:
Compute the dot product matrix K. Find the eigenvectors mk of K. Normalise mk with (29).
The projection of x onto the nth eigenvector in F is computed as ðV n FðxÞÞ ¼
l X
mni ðFðxi Þ FðxÞÞ,
(30)
i¼1
i.e. ðk PCAÞn ðxÞ ¼
l X
mni Kðxi ; xÞ.
(31)
i¼1
2.4.4. k-PCA properties The k-PCA properties are the same as those of the linear PCA: Fig. 2. k-PCA with a Gaussian kernel for the case of two concentric classes defined in R2. The background grey level explains the projection value of this space onto the first principal axis (a) and the second principal axis (b) obtained in F. The elements of each class are represented by white discs and black stars.
The obtained basis is orthogonal. Projections onto the first eigenvectors carry the main part of the variance.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
PCs are uncorrelated. On the other hand, k-PCA allows up to l PCs (the number of individuals) to be found, compared with d PCs (the dimension of the initial space) for linear PCA.
k-PCA may be computationally more expensive than linear PCA because it needs to embed the data in F before achieving a PCA. However, according to Scholkopf et al. (1998), the use of a linear SVM is often enough for classification after extracting PCs in space F. That is to say, the expression of the kernel function is simply Kðxi ; xj Þ ¼ ðxi xj Þ.
3. Data collection and analysis 3.1. Principle and protocol of data acquisition The equipment for infrared analysis used fibre-optic technology, which was able to be adapted for two types of measurement, depending on the nature of the fabric sample. For the present study, as the samples were solid, measures of diffuse reflection were obtained using a Michelson interferometer (DiffusIRs), in combination with a NIR spectrometer (MB160D Bomems). The Michelson interferometer (diffuse reflection) used a quartz halogen lamp as the NIR source. The light emitted by this NIR source was rendered parallel by a collimator mirror. The light was then modulated by the interferometer to produce the interferogram. An intense beam of modulated light 50 mm in diameter was directed onto the fabric sample, which absorbed part of the light, thus modifying the interferogram. The remaining (diffused) light was directed into the spectrometer by use of a parabolic mirror, which enabled the sensitivity of the measurement to be improved. The surface measured was sufficiently large to obtain a representative measure of the average reflection of the fabric sample. The measurement was performed in two steps:
421
3.2. Data inspection NIR were obtained from 162 tissue samples. For a given sample, wavelength absorption depends on the sample components, as well as all the successive treatments the sample has undergone (i.e. working fibres, knitting, etc). The NIR spectrum reflects both composition and tissue processes, thus the NIR of two identical tissue samples that have undergone different treatments may have different absorption spectra. Fig. 4 shows the mean spectrum for each class of the training database. It can be seen that potentially discriminating wavelengths occur around data points 80, 150, 180 and 250. Given such easily observable differences, a decision rule based on the spectral forms should be expected. The NIR wave number n ranged from 3800 to 10,000 cm1, which corresponds to a wavelength l ranging from 1000 to 2631 nm ðl ¼ 107 =nÞ. The device resolution led to 806 points for a spectrum. 3.3. Pre-processing 3.3.1. Limiting the spectrum to the first 546 points After visual observation of the signals, it was decided to reduce the upper limit of each spectrum to n ¼ 8000 cm1 (i.e. the first 546 spectral points), since the rejected area obviously did not contain any discriminant information. 3.3.2. Standard normal variate (SNV) transformation A light scattering effect may accompany the experimental process, whereby samples with the same components can show differences in absorption at the same wavelengths and strongly distort the spectrum shape.
Firstly, a test was performed on a ‘‘reference disc’’ that was placed on the detection surface. The detection surface was cleaned before each series of tests. The second step consisted of measuring each sample individually. Each fabric sample was covered by the reference disc in order to neutralise the other light sources present in the laboratory, which might otherwise have altered the measurement.
Each sample was measured an equal number of times on each side, in order to obtain a more representative result. At least two measures were taken for each side, before taking an average of the results to obtain a mean spectrum for each tissue sample.
Fig. 4. Mean NIR spectrum for the two classes of the training database. Data are limited to the first 546 points.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
422
Specific functions are devoted to minimise the effect of light scattering, which can involve a higher boundary complexity for a classification operation. Previous studies have suggested centring and normalising spectra before any treatment as seen in papers (Barnes et al., 1989; Candolfi et al., 1999; Dhanoa et al., 1994, 1995). SNV transformation removes the slope variation from spectra caused by scatter and variation of particle size. The transformation was applied to each spectrum i of length p individually by subtracting the spectrum mean and scaling with the spectrum standard deviation, with the following relationship: Pp Ai a¯ i j¼1 Ai; j SNV ffi with a¯ i ¼ Ai . ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pp p ð ðA a¯ Þ2 =p 1Þ j¼1
i
i
(32) Notice that SNV transformation is different from the classical data reduction used in statistics, this transformation being applied to each spectrum separately. 3.4. Final database The main database used for the method design and validation in the present study was composed of 162 samples, 84 samples labelled ‘‘without elastane’’ (1) and 78 samples labelled ‘‘with elastane’’ (+1). A smaller database (test set) was composed of 24 samples, 12 samples ‘‘without elastane’’ (1) and 12 samples labelled ‘‘with elastane’’ (+1) were used to assess the overall method performance. All of these spectra were limited to the first 546 points, before being centred and normalised by SNV transformation. 4. Results and discussion 4.1. Cross-validation process From the main database 30 samples labelled 1 and 30 samples labelled +1 were chosen randomly as the validation set, with the remaining 102 samples used for the learning subset. The decision rule was built from the learning subset, validated with the validation set, (k-leave one out method) and finally tested with the test set. This operation was performed 50 times in order to build a basic criterion of performance defined and calculated as:
Average rate of success with the learning data set (estimation step): mrate E . Average rate of success with the validation data set (validation step): mrate V . Average rate of success with the test data set (prediction step): mrate P .
The rate of success in estimation, validation and prediction expresses the percentage of individuals correctly classified compared to the number of data from each
database (102 for estimation, 60 for validation and 24 for prediction). All these subsets were common to all discriminating methods in order to compare their performances; i.e. LDA, SIMCA, SVM, and linear SVM associated with PCA and k-PCA. 4.2. LDA and SIMCA results The Fisher decision rule was easily computed following the Eqs. (1) and (2). The results are shown in Table 1. The SIMCA method was applied with the help of Unscramblers software from CAMO Process Company (http://www.camo.com). SIMCA enables tolerance intervals to be defined (Eriksson et al., 2001), for which one sample can be assigned to both classes or simply rejected. For these reasons, three parameters were added to the basic criterion of performance with the percentage of individuals badly assigned, not assigned and assigned to both classes. The results of this criterion are shown in Table 2 where the two last columns explain the false alarms made when samples were assigned to both classes. For example, the ‘‘With elastane’’ column represented the percentage of samples labelled a priori ‘‘without elastane’’ (1) for which the machine assigned them also to the other class (with elastane). The results for both LDA and SIMCA methods show that a linear separation does not exist between classes. Regarding Fisher LDA, the rule could not even correctly classify data from the learning set (46% misclassified). For SIMCA, the percentage of samples assigned to both classes shows the high interlacing of the two classes; i.e. a high overlapping between the tolerance intervals. Regarding the two last columns of the Table 2, it can be seen that this overlapping concerns in equal proportions the two classes of samples. Finally, these results show a weak discriminating capacity of elastane based on the spectral forms from a linear point of view, whereas wavelengths around data points 80, 150, 180 and 250 show differences (on average) between spectra. Therefore, the use of a non-linear method such as SVM can be perfectly justified. 4.3. SVM and kernel alignment benefits All decision rules were built with a Gaussian kernel due to its universal capacity (Cristianini et al., 2001), with two Table 1 Results of the performance criterion for the LDA learning machine Rate of success (correctly assigned) (%) Estimation step Validation step Test step
54 51 50
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
423
Table 2 Classification results with the SIMCA method Correctly assigned
Estimation step Validation step Prediction step
21.5 22 18.3
Badly assigned
1.9 3.5 1.25
Not assigned
4.3 6 8.33
Assigned to both classes
72.3 68.5 72.12
False alarm With elastane
Without elastane
73.69 60.34 66.74
70.91 76.66 77.50
These results express the classification percentage of individuals with respect to the number of data for each database; i.e. 102 for estimation, 60 for validation and 24 for prediction.
tuning parameters; bandwidth s and coefficient C. The most important and difficult aspect of SVM is the determination of these two parameters; therefore the strategy used was to sweep different combinations of s and C. The seven tentative values for s were f0:5; 1; 1:6; 2; 5; 10; 20g, while the seven tentative values for C were f1; 5; 10; 20; 30; 100; infg, thus providing 49 possible combinations to optimise the classification rule with respect to s and C. The same basic performance criterion was computed with two additional parameters: the average margin and the average number of support vectors. Figs. 5 and 6 illustrate the effect of the pair ðC; sÞ. The higher the value of C, the greater the minimisation of the number of misclassified data in the estimation step, without regarding the margin. This result is in favour of an increase in the complexity of the decision boundary, leading to a corresponding reduction of the generalisation ability. Choosing a very low value of s, such as 0.5 is not judicious, as can be seen in Fig. 5, where such a choice implies that each sample becomes a support vector. In this case, the average rate of success in estimation is 100%, but the model is obviously overfitted (Fig. 6). It should be kept in mind that the aim of SVM is to resume all information with a few individuals from the learning set, which may ensure a good a posteriori prediction. Finally, these results reveal a good combination of parameters with s ¼ 1:6 and C ¼ 30. For this combination, the margin is 0.15, the number of support vectors is 72, the rate of success in estimation is 100%, with 83% success for validation and 89% for prediction (Fig. 7). The cross-validation process required 50 machines to be built for each combination ðC; sÞ. It was computationally more expensive to find a good combination of s and C as 2450 SVM were necessary (50 machines and 49 combinations of s and C). The alignment method was applied in order to obtain a value of bandwidth for which the dot product matrix K was the most appropriate for the classification problem. Fig. 8a shows the evolution of the Gaussian kernel alignment versus the bandwidth s.It reveals that the optimal alignment was obtained for s ¼ 1:6. Using kernel alignment offers a good alternative for tuning SVM by finding the most adapted value for the kernel parameter.
Fig. 5. Evolution of the margin and the NSV versus the C parameter for different values of the bandwidth s.
The remaining step was to find the best value of C. Fig. 8b shows the image of the matrix K for the optimal alignment s ¼ 1:6. The difficulty in obtaining the structure corresponding to the first class (without elastane) can be seen, whereas the form due to the second class (with elastane) is clearly visible.
ARTICLE IN PRESS 424
Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
Fig. 6. Rate of success for the estimation step versus the C parameter for different values of the bandwidth s.
Fig. 8. Kernel alignment results: (a) evolution of the alignment versus the Gaussian kernel bandwidth s with a maximum for s ¼ 1:6 and (b) the dot product matrix K represented for s ¼ 1:6.
4.4. k-PCA highlights with linear SVM results
Fig. 7. Rate of success for the validation and the test steps versus the C parameter for different values of the bandwidth s.
In order to compare PCA and k-PCA results, the first step was to obtain a PCA with the learning set (main set) in order to reach a maximum of 546 PCs; i.e. the dimension of the learning set. The learning, validation, and test sets were then projected onto these PCs. After this step, a linear SVM was built for an increasing number of PCs, with Kðxi ; xj Þ ¼ xi xj and C ¼ inf. Only, six non-zero eigenvalues were obtained for this step. The same procedure was achieved in F before embedding data with a Gaussian kernel and a bandwidth s ¼ 1:6, with the difference being that only 102 PCs were feasible; i.e. the number of samples of the learning set. Fig. 9 shows the poor results obtained with linear PCs. With the maximum of PCs, the rate of success in estimation was 73.7%, falling to 49.2% for validation and 52.5% for prediction. Achieving a PCA in the initial space enabled noise to be removed from data with a minimum loss of
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
425
Fig. 9. Linear SVM associated with PCA.
information, slightly improving results when compared to Fisher LDA. The high discriminating power of non-linear PCs can be seen in Fig. 10, whereby better results were obtained for the non-linear case whereas the aim of PCA is not to find discriminating axes. Hence a wide choice of PC is feasible. This possible choice illustrates that data in F are slightly noisy and it becomes possible to select a few non-linear components (Mika et al., 1999), with the aim of obtaining a satisfying number of support vectors, without loss of classification ability. 5. Conclusion For this practical application with only a limited amount of training data, SVM were able to successfully and quickly qualitatively identify samples containing elastane, which was not the case for classical methods such as LDA or SIMCA. Obviously it would be necessary to continue the learning step by increasing the number of samples in the database in order to get a better decision boundary with additional support vectors, with a corresponding improvement in prediction. In Belousov et al.’s (2002) paper, one of the remaining questions was related to the selection of the optimal kernel function. An initial response could be proposed with the kernel target alignment method. As seen in this paper, for a given kernel this method tries to obtain the best dot product matrix K with the respect to the a priori knowledge
about the training database classes. Kernel alignment tries to increase the discriminant property of the dot product matrix minimising intra-class variance and maximising inter-classes variance. It can then be used as a means of initially selecting the best kernel parameters for SVM before optimising C. Non-linear PC enable non-linear discriminating relationships between individuals in the initial space to be taken into account, something that was not possible for linear PCA. In other words, k-PCA demonstrates the plasticity of data. As explained in Scholkopf et al. (1998), after achieving a k-PCA, a linear SVM was sufficient in order to produce good results in prediction. Moreover, while preserving a good classification, it also selected some nonlinear components (not necessarily the first ones) for the desired margin or the number of support vectors. In the present study, any new sample of tissue was classified by studying the sign of f ðxtest Þ ¼ sgnðxtest w þ bÞ, which assigned the sample to one of the two classes (with or without elastane). Another option would be to follow the evolution of samples for diagnosis by increasing the number of possible assignments, much like a sigmoid function in a neural network. In this way, a soft margin decision could be obtained, whereby a continuous decision surface is built which could produce different clusters in each class, and could highlight the role of some a priori knowledge such as the type of treatment (i.e. working fibres, knitting), the colour of the sample, etc.
ARTICLE IN PRESS 426
Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427
Fig. 10. Linear SVM associated with k-PCA.
Finally, only Gaussian kernels were considered in this paper for their universal ability to solve a large variety of classification problems. In the present case, the Gaussian kernel was constructed without reference to the data set analysed, with the obvious exception of the optimisation of s. Such a generic approach could therefore be suitable for use in a large range of applications. In contrast, the use of an ad-hoc kernel might have resulted in a slightly better classification, but would have been extremely unlikely to be applicable to other data configurations. However, other kernels could be tested to analyse their impact on the dot product matrix K, notably their ability to discriminate between different patterns. Moreover, in order to improve the accuracy of the decision boundary for the present study, it would be possible to remove outliers from the learning database, something that was not performed.
Acknowledgements We would like to thank S. Gunn and B.Scholkopf for their academic software designed with Matlab which were a precious help to create our own software used by IFTH (Institut Franc- ais du Textile et de l’Habillement). We would also like to thank Lennart Eriksson for his help, in particular his book (Eriksson et al., 2001) on multivariate analysis methods used in chemometric.
References Barnes, N.J., Dhanoa, M.S., Lister, S.J., 1989. Standard normal variate transformation and detrending of near-infrared diffuse reflectance spectra. Appl. Spectrosc. 43, 772–777. Belousov, A.I., Verzakov, S.A., Frese, J., 2002. A flexible classification approach with optimal generalisation performance : support vector machines. Chemometr. Intell. Lab. 64, 15–25. Burges, C., 1998. A tutorial on support vector machines. Data Min. Knowl. Discov. 2, 121–169. Candolfi, A., De Maesschalck, R., Jouan-Rimbaud, D., Hailey, P.A., Massart, L., 1999. The influence of data pre-processing in the pattern recognition of excipients near-infrared spectra. J. Pharmaceut. Biomed. 21, 115–132. Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines. University Press, Cambridge, UK. Cristianini, N., Elisseeff, A., Shawe-Taylor, J., Kandla, J., 2001. On kernel target alignment. NeuroCOLT Technical Report NC-TR-01-099, 2001. Dhanoa, M.S., Lister, S.J., Sanderson, R., Barnes, R.J., 1994. The link between multiplicative scatter correction (MSC) and standard normal variate (SNV) transformations of NIR spectra. J. Near Infrared Spectrosc. 2, 43–47. Dhanoa, M.S., Lister, S.J., Barnes, R.J., 1995. On the scales associated with near-infrared reflectance difference spectra. Appl. Spectrosc. 49, 765–772. Eriksson, L., Johansson, E., Kettaneh-Wold, N., Wold, S., 2001. Multiand Megavariate Data Analysis—Principles and Applications. Umetrics AB, Umea, Sweden. Gunn, S., 1998. Support vector machines for classification and regression. Technical Report 10 May 1998. Department of Electronics and Computer Science, University of Southampton, Southampton. Kandola, J., Shawe-Taylor, J., Cristianini, N., 2002. On the extensions of kernel alignment, NeuroCOLT Technical Report NC-TR-02-120.
ARTICLE IN PRESS Y. Langeron et al. / Engineering Applications of Artificial Intelligence 20 (2007) 415–427 Keinosuke, F., 1990. Introduction to Statistical Pattern Recognition, second ed. Academic Press, San Diego, CA. Mika, S., Scho¨lkopf, B., Smola, A.J., Mu¨ller, K.-R., Scholz, M., Ra¨tsch, G., 1999. Kernel PCA and de-noising in feature spaces. In: Presentation at Advances in Neural Information Processing Systems, Denver, CO, USA. Scholkopf, B., Smola, A., Muller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural. Comput. 10, 1299–1319.
427
Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., Ratsch, G., Smola, A., 1999. Input space vs. feature space in kernel-based methods. IEEE Trans. Neural Networks 10, 1000–1017. Wold, S., 1976. Pattern recognitition by means of disjoint principal components models. Pattern Recogn. 8, 127–139. Wold, S., Sjostrom, M., 1977. SIMCA: a method for analysing chemical data in terms of similarity and analogy. In: Presentation at Chemometrics Theory and Application, American Chemical Society Symposium Series, No. 52, Washington DC, USA.