Similarity measure method based on spectra subspace and locally linear embedding algorithm

Similarity measure method based on spectra subspace and locally linear embedding algorithm

Infrared Physics and Technology 100 (2019) 57–61 Contents lists available at ScienceDirect Infrared Physics & Technology journal homepage: www.elsev...

580KB Sizes 0 Downloads 70 Views

Infrared Physics and Technology 100 (2019) 57–61

Contents lists available at ScienceDirect

Infrared Physics & Technology journal homepage: www.elsevier.com/locate/infrared

Regular article

Similarity measure method based on spectra subspace and locally linear embedding algorithm

T



Yuhua Qina, , Kai Duanb, Lijun Wub, Baoding Xuc a

College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China R&D Center, China Tobacco Yunnan Industrial Co., Ltd, No. 367 Hongjin Road, Kunming 650231, China c College of Information Science and Engineering, China Ocean University, Qingdao 266100, China b

A R T I C LE I N FO

A B S T R A C T

Keywords: Spectra subspace Geodesic distance Locally linear embedding algorithm Near infrared spectroscopy Similarity measure

Due to the high dimensionality, redundancy, noise and nonlinearity of the near infrared (NIR) spectra data result the difficulty of the similarity measure. This paper presented a similarity measure method SSLLE based on the spectra subspace and locally linear embedding (LLE) algorithm. Firstly, we divided the high dimensional spectra data into several subspaces according to the absorption band of the major chemical compositions, which effectively avoids the influence of irrelevant features and noise and reduces the dimension and computation complexity of the LLE. Then, we modified the LLE algorithm by introducing the geodesic distance instead of Euclidean distance, which solves the measure problem of the Euclidean distance in high dimensional space. In order to make the sample more evenly distributed, the method of distance calculation in LLE was also modified. For each spectra subspace, the distance matrix was calculated according to the embedding that was mapped from the high dimensional space by using the modified LLE. Subsequently, the spectral similarity matrix of the sample set was integrated by adding all of the individual distance matrices of each subspace so that the sample with the highest similarity can be found. In order to investigate the effectiveness of the algorithm, the spectral projection of the samples was analyzed first, the results showed that the SSLLE distinguished the tobacco samples from different areas significantly better than the methods of principal component analysis (PCA) and LLE. Secondly, we compared the results of searching the most spectrally similar sample with the target tobacco, it showed that the SSLLE had the minimum differences in the chemical composition, and the highest consistency with the recommendation of the experts than that of PCA and LLE algorithm. It also had good robustness and precision.

1. Introduction For cigarette manufacturers, maintaining the stability of the cigarette flavor is the basic requirements to ensure the product quality, of which the most important thing is to keep the cigarette formula relatively stable. In the industrial producing process, due to the limited production of the raw materials, when there is a shortage of stocks, or fluctuations of price and quality of one or more tobacco leaves in the cigarette formula, another one with the similar quality needs to be selected for the replacement. Therefore, finding the alternative tobacco that is closest to the quality of the target one is essential for the maintenance and design of the cigarette products. At present, the selection for alternative tobacco in the cigarette formula mainly depends on the sensory evaluation of the experts and the compositional characteristics of the tobacco. However, these methods are not only time-



Corresponding author. E-mail address: [email protected] (Y. Qin).

https://doi.org/10.1016/j.infrared.2019.05.006 Received 28 January 2019; Accepted 12 May 2019 Available online 13 May 2019 1350-4495/ © 2019 Elsevier B.V. All rights reserved.

consuming, inefficient but also involves the subjectivity of the human sense, and product quality is still difficult to control. In recent years, near infrared (NIR) spectroscopy has been widely applied in quantitative and qualitative analysis of complex products [1–3], since it is a fast, nondestructive, and reliable analysis method. Currently it is more mature in the area of composition detection and cigarette authentication [4,5]. However, there are lack of applications in cigarette formula aided design and maintenance. In this paper, the similarity measure technique based on NIR spectra is applied to find the alternative tobacco in the cigarette formula, so as to realize the control and maintenance of the product quality. Typically, the commonly used spectra similarity measure methods include Euclidean distance, Mahalanobis distance [6] and Cosine similarity [7] etc., of which Euclidean distance is more commonly used. However, when dealing with NIR spectra data with up to several

Infrared Physics and Technology 100 (2019) 57–61

Y. Qin, et al.

thousand dimensions, there will be a “curse of dimensionality” problem [8,9], which means the discrimination between the nearest and the farthest neighbors in the sample sets becomes rather weaker in high dimensional spaces. While with the popular dimension reduction method such as principal component analysis (PCA) [10] is invisible to the nonlinear structure and will fail to recover the underlying nonlinear structure of the data, resulting in inaccurate results. Locally linear embedding (LLE) [11] is a nonlinear dimension reduction method. It can preserve local structures, which mean nearby points in high dimensional space remain nearby and similarly co-located with respect to each other in low dimensional space. Ramirezlopes et al. [12] have tested that NIR space similarity between samples is also close in terms of compositional characteristics. They evaluated different distance metric algorithms for finding the most spectrally similar samples and the results showed that LLE method could better reflect the compositional similarity. However, the calculation of LLE is very time-consuming for high dimensional data. Furthermore, the high redundancy and noise exist in the NIR feature space [13] will also greatly degrade the efficiency of the similarity measure. Taking into account the difficulties of distance measure in high dimensional space, a similarity measure method SSLLE based on the subspace of NIR spectra and modified LLE algorithm was proposed in this paper. Firstly, we divided the high dimensional spectra data into several subspaces according to the absorption band of the major chemical compositions in the spectrum. Then, we modified the LLE algorithm: (i) introduced the geodesic distance [14] instead of Euclidean distance, (ii) proposed a distance calculation method. The modified LLE algorithm was used in each spectra subspace and the spectra similarity matrix of the sample set was obtained by adding the distance matrices of each subspace so that the sample with the highest similarity can be found.

coordinates to minimize the following cost function: N

J (y ) =

i=1

N

∑ yi = 0, i=1

N

∑ yiT yi = I i=1

(3)

(4)

The embedding is obtained from the eigenvectors corresponding to the bottom d + 1 eigenvalue of the M , except for the smallest eigenvector which is the unit vector with eigenvalue equal to 0. 2.2. The modified LLE algorithm The LLE algorithm is suitable for nonlinear data dimension reduction. However, it will hard to handle when the dimension of the processing data is too large. In particular, there will be problems for the spectra data with the characteristics of high dimensionality, high noise, and sparse samples. In view of the problems, we modified the LLE algorithm in the following aspects. 1. Divide the spectra data into multiple subspace. The existence of the redundant information and noise in spectra data will cause inaccuracy of the similarity measure. Moreover, the higher the dimension, the more serious the problem of the “curse of dimensionality”. Therefore, according to the absorption band ofP major chemical compositions in the spectrum, the features that correlated with the chemical compositions were selected, and P spectra subspaces were generated respectively. The spectral similarity measured from subspace eliminates the influence of the irrelevant features and noise in the spectra, and reduces the dimension and computation complexity of the LLE as well. 2. Introduce the geodesic distance instead of Euclidean distance. In high dimensional space, Euclidean distances may not reflect the intrinsic distance between data points. It may lead to a data point to have neighbors which are instead very distant as one considered the intrinsic geometry of the data [13], as they can cause LLE to misinterpret the actual data structure. The geodesic distance between two points can be thought as their distance along the contour of an object. It is the shortest path of two points in space, so the geodesic distance can better reflect the topology structure of the data points more than the Euclidean distance. The geodesic distance versus Euclidean distances are shown in Fig. 1. The red line is the geodesic distance between the sample points x i and x j , and the blue line is the Euclidean distance. In this paper, the Dijkstra’s algorithm [18] was employed to compute the geodesic distance of the sample points. 3. Modify the method of the distance computation in LLE. Spectral data samples are usually unevenly distributed. While LLE algorithm may achieve desired results with a small k value in the dense area of the data points, whereas in the sparse area of the data points, a larger k value is often required to maintain the relative position of the sample points. However, the value of k is fixed during the computation process of the LLE. In order to optimize the distribution of the sample sets and make it more evenly distributed, we modified the distance computation method which is defined as:

k

∑ ||xi − ∑ wij xj ||2 (1)

j=1

1 N

M = (I − W )T (I − W )

The LLE algorithm was introduced by Roweis and Saul [11]. The basic idea of LLE is to look for an embedding that preserve the local geometry of its neighbors and to find a low dimensional configuration of data point. In other words, a data point can be approximate reconstructed by a linear combination of its neighbors in the mapping space. It has been widely used in image processing, gene expression, data visualization etc. [15–17]. Let X = {x1, x2 , ...,xN } be an input sample set ofN points in a high D dimensional data space, Y = {y1 , y2 , ...,yN } is denoted as corresponding locally linear embedding of X in the low d dimensional space (d < < D ), the embedding Y of X is carried out in three steps: 1. Seek k nearest neighbors. For each data point x i (i = 1, 2, ...,N ) , finds the k nearest neighbors measured by Euclidean distance. 2. Compute the reconstruction weight matrix W . Define the following cost function by its k nearest neighbors:

i=1

(2)

We can define a sparse, symmetric matrixM which contains the weight W :

2.1. Locally linear embedding (LLE) algorithm

N

j=1

where yj (i = 1, 2, ...,K ) is the k nearest neighbors of yi . In order to ensure the uniqueness of the solution, two constraints need to be imposed as Eq. (3):

2. Materials and methods

J (w ) =

k

∑ ||yi − ∑ wij yj ||2

where x j (i = 1, 2, ...,K ) is the k nearest neighbors of x i , wij is the linear weight coefficient of x i and x j . The LLE finds the wij by minimizing the reconstruction errors. Moreover, for each x i , the weights wij are under k

the constraint that sum to 1, that is ∑ j = 1 wij = 1. It should be derived that the constrained weights are invariant to translation. Thus, the reconstruction weights characterize intrinsic geometric properties in the original data space are equally valid for local pieces of the low dimensional embedding. 3. Compute the low dimensional coordinates. The best low dimensional embeddingY is computed by using the reconstruction weight matrixW . This corresponds by choosing the low dimensional

Dij =

2dG (x i , x j ) Mi + Mj

(5)

where dG (x i , x j ) is the geodesic distance of the data points x i and x j , Mi , are the average distances of x i and x j to the other data points, respectively. The modified distance computation makes the data set more 58

Infrared Physics and Technology 100 (2019) 57–61

Y. Qin, et al.

2.5. Sample preparation and NIR spectra collection The tobacco samples were dried 4 h under 40 °C and then processed by milling into powder and passing through 40 mesh sieves. The powder sample were stored in a sealed bag before spectra collection. The temperature was controlled at 18–22 °C and relative humidity was about 55% for stability. NIR spectra were recorded in the reflectance mode using Nicolet Antaris II spectrometer (ThermoFisher Scientific, Waltham, USA) in the range of 10,000 to 4000 cm−1. Each sample was collected in triplicate and the final spectrum was averaged. In addition, we applied the Norris-Gap (with a gap size of 11) first derivative preprocessing technique for correcting the effect of baseline offsets and highlighting the useful information. The algorithm described in Section 2.3 was implemented in Matlab 2012b (The Mathworks, Natick, USA). 3. Results and discussion Fig. 1. The comparison of Euclidean distance and geodesic distance. The red line is the geodesic distance between the sample points x i and x j , and the blue line is the Euclidean distance. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.1. Division of the spectra subspace Total sugar, reduced sugar, nicotine, total nitrogen are the essential compositions that determine the quality of the tobacco [4]. In this study we selected four spectra feature bands that are related to each composition according to the absorption position in the spectrum, and formed four subset matrices: S1 (92) , S2 (79) , S3 (104) , S4 (95) . This will help to avoid the influence of noise and irrelevant features in the spectrum effectively.

evenly distributed and improves the accuracy and efficiency of the algorithm. 2.3. Similarity measure method based on spectra subspace and LLE (SSLLE) The proposed similarity measure method SSLLE based on spectra subspace and LLE algorithm is composed of three phases:

3.2. Determination of the parameter k and d Currently there is no good way for determining the number of the nearest neighborsk and the dimensionalityd of the reduced space, however, the selection for these parameters are rather crucial to the mapping quality of the LLE. In fact, if k is set too high, the mapping will lose its nonlinear character and it also increases the calculations, in contrast, if it is too low, the mapping will not reflect any global properties. Moreover, if d is set too high, the mapping will enhance noise, if it is too low, the data structure in high dimensional space will not be fully reflected [19]. In this paper, we presented a supervised method for determining the optimal value of the parameterk and d . The sample set A was employed to compute the residual by co-adjusting the value of k and d . The residual is defined by:

1. The first phase consists in choosing the spectra subspaces S1, S2, ...,SP which are strong correlated to the chemical composition of the tobacco according to the absorption position of the P chemical compositions in the spectrum. Thus, the spectra data was divided into P subspaces. 2. In the second phase, the modified LLE algorithm was used to map the obtained P spectra subspaces to its low dimensional spaces, and the distance matrices of the embedding in each low dimensional spaces were calculated respectively based on the Euclidean distance, denoted by V1, V2, ...,VP . 3. For each Vi , using the following formula for normalization.

dij =

dij − d min d max − d min

(6)

R = 1 − ρ (Dv , Dc )

Moreover, if dij > 0.3, let dij = +∞. The final similarity matrix of the sample set was obtained as V = V1 + V2 + ,...,+VP . When looking for the highest spectral similarity to the sample x i , the minimum column value of the matrix V in the ith row is the most similar one.

(7)

where Dv is the distance matrix of the subspace in its mapped space, Dc is the difference matrix of the corresponding chemical composition, and ρ is the correlation coefficient. We used the nicotine content as the metric forDc since it is the most important factor for tobacco quality. Fig. 2 shows the residuals of the different values of k and d in the subspace S1. It can be seen that for the different value of k , the residual is always the smallest when d is 3. Moreover, after the value k exceeds 20, there has little effect on the residual when d is 3. And the larger the k , the larger the calculation. Therefore, the optimal value of k = 20 and d = 3 were selected for the first subspace. Using this parameter estimation method, the optimal values of k and d of the other subspaces S2, S3, S4 can be obtained. In addition to the value of d was 4 in subspace S2 , the optimal value of the rest subspace was 3, and k = 20 was reasonable in all subspaces.

2.4. Data sets We used two data sets provided by a tobacco company. The first one (set A) comprised 260 stock tobacco samples with the chemical composition content information, which would be used for the training of the parameters and the evaluation of the compositional similarity. The second one (set B) comprised 170 pairs of formula adjustment samples (that is a total of 340 samples) which were adjusted previous by the formula experts during the maintenance of the cigarette formula. Each pair of the sample contains the target and the replacement samples. When there is a shortage of certain constituent in the formula, the experts will find out the most similar tobacco as a replacement through the assessment of the chemical compositions and sensory evaluation. Therefore, there will be high degree of spectra similarity between the pair of the samples. Hence, the sample set B would be used to test the consistency of the similarity measure methods with the experts.

3.3. Similarity matrix For computing the similarity matrix of the sample points, firstly the embedding Y need to be obtained for each subspace S1, S2, ...,S4 by using the modified LLE algorithm (the parameters of the LLE were set as 59

Infrared Physics and Technology 100 (2019) 57–61

Y. Qin, et al.

Fig. 2. Residuals of different values of k and d in subspace S1. Note for the different value of k , the residual is always the smallest when d is 3. There has little effect on the residual after k exceeds 20 when d is 3. Fig. 3. Projection comparison of PCA, LLE and SSLLE for tobacco samples from four producing areas. The SSLLE algorithm distinguished tobacco from different areas significantly better than PCA and LLE.

Section 3.2), then the Euclidean distance matrix of the sample points was calculated and the matrix also need to be normalized. For example, the calculated distance matrix V1 of the first subspaceS1 is as follow:

v ⎡ 11 ⎢ v21 V1 = ⎢ v31 ⎢⋮ ⎢ vn1 ⎣

v12 v22 v32 ⋮ vn2

v13 v23 v33 ⋮ vn3

⋯ ⋯ ⋯ ⋱ ⋯

measure was compared with the methods of PCA and LLE. Six principal components (cumulative variance contribution rate is 90%) was selected for the PCA method, while the local neighborhoodk was set to 20, and d was set to 3 for the LLE algorithm. The similarity measure experiments were carried out from two different aspects: In the first experiment, 80 tobacco samples were randomly selected from the sample set A of total 260 samples. The similarity matrix was used to find the spectral nearest similar sample from the rest of the180 samples, and the root square of difference SD [11] was used for the measure of the compositional differences. The SD is defined as:

v1n 0.074 0.237 ⋯ 0.233 ⎤ ⎤ ⎡ 0 v2n ⎥ ⎢ 0.074 0 0.316 ⋯ 0.282 ⎥ v3n ⎥ = ⎢ 0.237 0.316 0 ⋯ 0.108 ⎥ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ vnn ⎥ 0 ⎥ ⎦ ⎦ ⎣ 0.233 0.282 0.108 ⋯

where n is the number of the samples, vij is the distance between the sample i and j , and vij is equal to vji . V1 is an oblique symmetric matrix of n ∗ n . While the distance matrices of the subspaces V2, V3, V4 were computed simultaneously, and the spectra similarity matrix V of the sample points was integrated by adding all of the individual distance matrices V1, V2, ...,V4 .

n

SD =

0.468 1.741 ⋯ 1.920 ⎤ ⎡ 0 0 2.014 ⋯ 1.895 ⎥ ⎢ 0.468 V = ⎢ 1.741 2.014 ⋯ 1.523 ⎥ lim 0 ⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥ x →∞ ⎢1.920 1.895 1.523 ⋯ 0 ⎥ ⎦ ⎣

∑i = 1 (ri − rsi ) (8)

n

where n is the number of the sample, ri and rsi are the nicotine content of the selected sample and its most similar corresponding sample, respectively. The second experiment was conducted by randomly select 100 samples from 170 pairs of samples in set B. For each one the nearest three samples with the spectral similarity were identified, and then the results were compared to the consistency of the expert recommendations. The probability of the consistency (ratio of correctly recommended samples to total number of samples) between the similarity measure method and the experts was recorded. Table 1 shows the results of the two experiments. It can be seen from Table 1 that the result of the spectral similarity matrix constructed by SSLLE has the minimum differences in chemical composition, and the highest consistency with the experts. PCA is a linear dimension reduction method, which cannot effectively reflect the essential characteristics of the data due to the nonlinear structure in spectral data, so the similarity measure result is the worst. The LLE algorithm maintains the local structure of the data position in the mapping space, however, due to the influences of the high dimension and the defects of the distance calculation, the similarity measure result is less than the SSLLE. While SSLLE method was improved on the basis of LLE and comprehensively considered the influences of high

The similarity measure method holds that the closer the distance between the two sample points, the higher the similarity, that is, the similarity is inversely proportional to the distance. Therefore, when looking for the sample with the highest similarity to the sample vi , the minimum value (except 0) of jth column in the ith row, or the minimum value (except 0) of jth row in the ith column of the matrix V , vj was the one that we are searching for. 3.4. Projection comparison In order to investigate the effectiveness of the SSLLE, 180 tobacco samples from four different producing areas of set A were projected to a 2-dimensional space. Fig. 3 shows the projections comparison of the PCA, LLE, and SSLLE. The consensus of the tobacco experts is that tobacco of the same producing area are with the higher similarity, so a good similarity measurement model should make the tobacco of the same area as close as possible, and different areas should be separated as far as possible. As can be seen from Fig. 3, the SSLLE algorithm presented in this paper can distinguish tobacco from different areas significantly better than PCA and LLE.

Table 1 Performance comparison of different algorithms.

3.5. Comparison of the spectral similarity measure In this section two spectra similarity models were constructed by using both the sample set A and B, and the accuracy of the similarity 60

Algorithm

Experiment 1

Experiment 2

PCA LLE SSLLE

0.21 0.18 0.12

55.8% 70.8% 81.7%

Infrared Physics and Technology 100 (2019) 57–61

Y. Qin, et al.

dimension, irrelevance features of the NIR spectra data and disadvantages of the LLE, therefore, the similarity measure result is much better than that of PCA and LLE.

[4] [5]

4. Conclusions [6]

The SSLLE method based on the spectra subspaces and LLE algorithm can effectively improve the robustness and accuracy of near infrared spectra similarity measure. The spectrum was divided into several subspaces, which eliminated the irrelevant features and noise, and reduced the dimension of the LLE. Meanwhile, the distance measure problem that occurred in the high dimension was solved by introducing the geodesic distance instead of the Euclidean distance. Also the modified distance formula made the sample more evenly distributed in the high dimensional space. The experimental results showed that the SSLLE method had good robustness and high precision. It can assist in the maintenance and design of the cigarette formula, and has significance for the similarity measure of the high dimensional data.

[7]

[8]

[9]

[10] [11] [12]

[13]

Acknowledgements [14]

This work was supported by the China Tobacco Yunnan Industrial Co., Ltd [grant number 2018JC01, Study and platform construction of near infrared detection for tobacco quality based on computer vision and cloud assistance technique].

[15] [16]

References

[17]

[1] V. Gaydou, J. Kister, N. Dupuy, Evaluation of multiblock NIR/MIR PLS predictive models to detect adulteration of diesel/biodiesel blends by vegetal oil, Chemom. Intell. Lab. Syst. 106 (2011) 190–197. [2] L.E. Agelet, D.D. Ellis, S. Duvick, A.S. Goggi, C.R. Hurburgh, Feasibility of near infrared spectroscopy for analyzing corn kernel damage and viability of soybean and corn kernels, J. Cereal Sci. 55 (2012) 160–165. [3] C. Meesa, F. Souard, C. Delported, E. Deconinck, P. Stoffelen, C. Stévigny,

[18] [19]

61

J.M. Kauffmann, K.D. Braekeleer, Identification of coffee leaves using FT-NIR spectroscopy and SIMCA, Talanta 177 (2018) 4–11. X. Liu, H. Chen, T. Liu, Application of PCA-SVR to NIR prediction model for tobacco chemical composition, Spectrosc. Spect. Anal. 27 (2007) 2460–2463. H. Maha, W. Mcclure, T. Whitaker, Applying artificial neural networks II. Using near infrared data to classify tobacco types and identify native grown tobacco, J. Near Infrared Spec. 5 (1997) 19–25. R. De Maesschalck, D.J. Rimbaud, D.L. Massart, The Mahalanobis distance, Chemom. Intell. Lab. Syst. 50 (2000) 1–18. B. Park, W.R. Windham, K.C. Lawrence, D.P. Smith, Contaminant classification of poultry hyperspectral imagery using a spectral angle mapper algorithm, Biosyst. Eng. 96 (2007) 323–333. S. Berchtold, D. Keim, H.P. Kriegel, The X-Tree: an index structure for high-dimensional data, in: Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, 1996, pp. 28–39. R. Weber, H.J. Schek, S. Blott, A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces, in: Proceedings of the 24th International Conference on Very Large Databases, New York, 1998, pp. 194–205. I.T. Jolliffe, Principal Component Analysis, Springer Verlag, New York, 2002. S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. L. Ramirez-Lopez, T. Behrens, K. Schmidt, R.A. Viscarra Rossel, J.A.M. Demattê, T. Scholten, Distance and similarity-search metrics for use with soil vis-NIR spectra, Geoderma 199 (2013) 43–53. M. Casale, N. Sinelli, P. Oliveri, V. Di Egidio, S. Lanteri, Chemometric strategies for feature selection and data compression applied to NIR and MIR spectra of extra virgin olive oils for cultivar identification, Talanta 80 (2010) 1832–1837. C. Varini, A. Degenhard, T.W. Nattkemper, ISOLLE: LLE with geodesic distance, Neurocomputering 69 (2006) 1768–1771. L. Ma, M.M. Crawford, J. Tian, Anomaly detection for hyperspectral images based on robust locally linear embedding, J. Infrared Millim. Te. 31 (2010) 753–762. A. Hadid, O. Kouropteva, M. Pietikainen, Unsupervised learning using locally linear embedding: experiments in face pose analysis, in: Proc. 16th International Conference on Pattern Recognition, Quebec City, Canada, 2002, pp. 111–114. B. Li, C.H. Zheng, D.S. Huang, L. Zhang, K. Han, Gene expression data classification using locally linear discriminant embedding, Comput. Biol. Med. 40 (2010) 802–810. E.W. Dijkstra, A note on two problems in connection with graphs[J], Numer. Math 1 (1959) 269–271. Ruifeng Shan, Wensheng Cai, Xueguang Shao, Variable selection based on locally linear embedding mapping for near-infrared spectral analysis[J], Chemom. Intell. Lab. Syst. 131 (2014) 31–36.