Journal Pre-proof A partition-based variable selection in partial least squares regression Chuan-Quan Li, Zhaoyu Fang, Qing-Song Xu PII:
S0169-7439(19)30615-X
DOI:
https://doi.org/10.1016/j.chemolab.2020.103935
Reference:
CHEMOM 103935
To appear in:
Chemometrics and Intelligent Laboratory Systems
Received Date: 24 September 2019 Revised Date:
6 January 2020
Accepted Date: 9 January 2020
Please cite this article as: C.-Q. Li, Z. Fang, Q.-S. Xu, A partition-based variable selection in partial least squares regression, Chemometrics and Intelligent Laboratory Systems (2020), doi: https:// doi.org/10.1016/j.chemolab.2020.103935. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
1
A Partition-based Variable Selection in Partial Least Squares
2
Regression
3
Chuan-Quan Li1, Zhaoyu Fang1, Qing-Song Xu1,∗
4
1
School of Mathematics and Statistics, Central South University, Changsha 410083, P. R. China
5 6
Abstract
7
Partial least squares regression is one of the most popular modeling approaches for
8
predicting spectral data and identifying key wavelengths when combining with many variable
9
selection methods. But some traditional variable selection approaches often overlook the local
10
or group information between the covariates. In this paper, a partition-based variable selection
11
in partial least squares (PARPLS) method is proposed. It first uses the k-means algorithm to
12
part the variable space and then estimates the coefficients in each group. Finally, these
13
coefficients are sorted to select the important variables. The results on three near-infrared
14
(NIR) spectroscopy datasets show that the PARPLS is able to obtain better prediction
15
performance and select more effective variables than its competitors.
16 17
Keywords: k-means clustering; partial least squares; variable selection
18 19 20 21 22
∗ Correspondence to: Q.-S. Xu.
Email:
[email protected]
23
1.
Introduction
24
In recent years, different spectroscopy techniques, such as near infrared spectroscopy,
25
infrared (IR) spectroscopy and ultraviolet–visible spectroscopy have been widely used in the
26
industry, chemistry, environment, biology and other fields [1,2]. Amount of multivariate
27
spectral calibration methods are proposed to build the quantitative relations between variables
28
(wavelengths) and the chemical properties of interest, such as concentration, sugar content or
29
different species category. However, these complex and high dimensional spectrum datasets
30
often contain much uninformative and redundant information, which may increase the
31
difficulty of modelling and prediction performance. Therefore, variables (wavelengths)
32
selection plays an important role in contemporary chemometric research [3-5].
33
Partial least squares (PLS) regression is a widely used method in chemometrics
34
community [6,7], which can handle with the high dimensional spectroscopy datasets. The
35
original form of PLS regression extracts multi latent variables from the whole variable space,
36
which can’t select important variables directly. Many literatures about the variable selection
37
in PLS have appeared over the last two decades. Generally, these methods can be categorized
38
into three types: filter, wrapper and embedded methods [8]. Filter methods select some
39
important variables based on the regression coefficient [9], the variable importance in
40
projection (VIP) [10], the loading weights [11] and so on. Wrapper strategy mostly runs an
41
iterated procedure between model fitting and variable selection, such as the uninformative
42
variable elimination (PLSUVE) [12], the backward variable elimination PLS [13],
43
competitive adaptive reweighing sampling (CARS) [14], the sure-independence-screening
44
based sparse partial least squares method (SIS-SPLS) [15]. Embedded methods always nest
45
the variable selection within the PLS algorithm, and are based on one iterative procedure,
46
such as the interval PLS [17] and the sparse PLS (SPLS) algorithm [18].
47
Another direction is the interval wavelengths selection methods [19,20] because the
48
functional groups absorb within relatively short wavelength bands in the spectroscopic data.
49
In addition, there are high correlation within the wavelength bands. A series of interval
50
selection methods have been proposed, such as the interval PLS method [17], moving window
51
PLS [21], interval random frog[22], and stacked interval PLS [23,24]. An important
52
parameter: the fixed-sized interval length, is difficult to determine. Therefore, there are some
53
other algorithms based on the cluster skill to determine the different interval length, such as
54
Fisher optimal subspace shrinkage (FOSS) [25], a functional domain selection (FuDoS)
55
method [26]. Recently, Kang [27] proposed a partition-based ultrahigh-dimensional variable
56
screening in the framework of generalized linear models. And two kinds of partition are used:
57
correlation-based cluster and spatial partition-based cluster to get the grouping information.
58
This method is more efficient than marginal screening[28] approaches for variable screening
59
and parametric estimation.
60
In this work, we focus on a partition-based variable selection with PLS model (PARPLS)
61
for the spectroscopic data. Motivated by the idea of spatial clustering, a novel wavelength
62
selection method is proposed based on the PLS regression coefficient on different clusters. In
63
addition, the performance of clustering algorithm may be sensitive to grouping and initial
64
value. Then the Monte Carlo sampling method is adopted to obtain more stable group
65
information and PLS regression coefficients. Finally, variables are selected based on the
66
absolute value of PLS regression coefficient.
67
We have organized this paper into the following sections. In Section 2, we introduce our
68
algorithm in detail. A brief introduction of three near-infrared spectroscopy datasets are given
69
in Section 3. Section 4 presents and discusses the experimental results on three datasets.
70
Finally, section 5 includes some conclusions.
71 72
2. Theory and methods
73
2.1 Notation
74
First, we present some notations. The predictor variables X are n × p matrix. And the
75
variable Y can be the single or multiple continuous response. When modelling, both X and Y
76
are mean-centered.
77
2.2 Partial least squares
78
Here, we will introduce the partial least squares algorithm, which is a classical
79
dimension reduction algorithm and reduce the variable space by a few latent variables. The
80
first latent variable can be obtained by maximizing the variance of predictors and response
81
variables:
82 83
max‖
‖
Where the loading weights
(
‖
,‖
,
,
)
are the first left and right singular vectors of X Y. . The X and Y are deflated by using the
84
And the first latent variable is defined as:
85
least square estimation. The residual errors can be computed as follows:
86
(1)
=
X=
+
87 88
Y=
The next components
,⋯,
"
+
(2)
can be similarly computed as the formula (2) through
89
the residual errors X and Y . This deflation pattern can make sure the loading weights
90
#$×% and the components &'×% are orthogonal. Therefore, the estimation of β can be
91
expressed in the PLS algorithm: ) = #(# +
92 93
+
#), # +
+
(3)
2.3 K-means cluster
94
The k-means algorithm [29] is one of the most popular and simple clustering methods,
95
which finds a partition such that the squared error between the empirical mean of a cluster
96
and the points in the cluster is minimized. Its equation is defined:
97
- = ∑83
∑'0 (/0 − 23 ) 4{
6
3}
(4)
98
The reasons why we use the k-means here are two folds. First, the k-means algorithm
99
requires three user-specified parameters: the number of clusters G, cluster initialization, and
100
distance metric, but we only need to tune the number of clusters parameter G in most cases.
101
Here, the Euclidean distance metric rather than the Mahalanobis distance, L1 distance and
102
other distance metric is often chosen because the Euclidean distance can save much
103
computational cost. As to the initialization center of the k-means, different initialization value
104
may lead to different final clustering. We can run the k-means with multiple different initial
105
partitions [29], to overcome the local minima.
106
Second, the k-means can be used to part these data with the spatial information, such as
107
image data, spectral data and geospatial data. Close predictors tend to have stronger
108
correlations and may have similar effect on the response [30]. Hence, the k-means based on
109
the correlation-guided partitioning can efficiently separate different clusters with spatial
110
information [31].
111 112
2.4 The PARPLS algorithm
113
2.4.1
The framework of PARPLS algorithm
114
The PLS model is used to obtain the <; coefficients in each cluster. But the estimated
115
coefficients are sensitive to the number of groups. Therefore, the Monte Carlo sampling
116
method is adapted to improve the estimated stability [32, 33], and it also reduces the effect of
117
initial value in the k-means. According to some research [14], the number of Monte Carlo
118
sampling can be set up to 100. Next, we sort the p component-wise absolute coefficients )
119
in a decreasing order and select the most = largest variables.
120
Note that there are two important parameters: the number of group and the number of
121
selected variables, to be tuned in our algorithm. We will discuss how to tune them in section
122
2.4.2. In the following, the detailed algorithm is described:
123
Step 1. The N sub-datasets are constructed, and each one is randomly sampling 75% of the
124
whole data.
125
Step 2. The k-means cluster algorithm is built on the sub-datasets. And the k-means method
126
needs to define the number of clusters G before estimating.
127
< are Step 3. A PLS model is built on each cluster and the estimated coefficients βg
128
calculated.
129
< coefficients in the N sub-models are obtained. Step 4. The mean value of βg
130
Step 5. For a chosen thresholding parameter r, the set of indices selected by our proposed
131
=}. partition-based variable selection is @
132
Step 6. With the selected variables, a PLS model is fitted and applied to predict the spectra of
133
an independent test set.
134
2.4.2
Parameter setting/tuning
135
The PARPLS algorithm has two tuning parameters: the number of clusters G in step 2,
136
and the chosen thresholding parameter r in the step 5. The value range of G can be set in
137
the range of 1 and (p-1). When the number of clusters G is 1, our algorithm is the same as
138
the feature selection based on the PLS regression coefficients [9]. As to the thresholding
139
parameter r, we can refer to Fan’s paper [28]. The parameter r can be set up as [αn/log (n)],
140
where the parameter α values are set up within a reasonably range of γ (e.g. M =
141
0.5,1,1.5, ⋯). As described above, we use 10-fold CV and grid search approach to choose the
142
optimal value g and r based on the minimal root-mean-square error.
143 144
2.5 An outline of the compared methods
145
In this article, we intend to compare our algorithm with other PLS variable selection
146
methods, which is widely used in high dimensional spectral data. The five methods are the
147
original PLS, sparse partial least squares (SPLS), PLSUVE, PLSVIP, SIS-SPLS respectively.
148
Here, the main idea of them are briefly described in the following section.
149
2.5.1
The SPLS algorithm
150
The original sparse partial linear square regression (SPLS) algorithm is based on the
151
sparse principal component analysis (SPCA) [34], which adopts the R -norm penalty onto a
152
surrogate of the direction vector (c). In the SPLS regression, the L1 and L2 penalty are
153
together added to build the latent components: min
154
,
[−M
+
@ + (1 − M)( − )+ @( − ) + V ‖ ‖ + V W W] s. t.
155
+
(5)
= 1
and c respectively represent the original and surrogate direction vector. In
Where
156
addition, M, V and V
157
deflated and updated steps in two variant SPLS models, namely, the NIPALS and SIMPLS
158
algorithm. For the NIPALS algorithm, update
159
update
160
A = {i| b 0 ≠ 0, ) $[\ ≠ 0}. The sparsity degree, V V and the number of latent components,
161
are determined by cross validation.
162
2.5.2
]
←
] (Ι −
are penalty factors and @ =
←
_] (_`+ _] ), _`+ ), where _] =
The PLSUVE algorithm
−
+
+
. And there are different
) $[\ ; For the SIMPLS algorithm,
+ + + , ` ] #` (#` ` ] #` )
and
163
The PLSUVE algorithm selects the important variables through adding a series of
164
artificial variables. And these artificial variables have the same dimension of the original
165
variable space. The fitness to enter the jth variable in PLS model is determined by the ratio of
166
the regression coefficient e3 and its standard deviation s(e3 ):
167
3
f
= hifg for B = 1, ⋯ ,2l gj
(6)
168
Where the s(e3 ) can be obtained from the vector of n e03 coefficients by
169
leave-one-out jackknifing. And the cutoff level can be set as abs(max(cnopqr )) in these
170
artificial variables. If absi 3 j < absimax(cnopqr )j (j = 1, ⋯ , p), the jth variable can be
171
regarded as an uninformative variable, otherwise the jth variable can be considered as an
172
informative variable.
173
2.5.3
174
The PLSVIP algorithm
The PLS regression coefficients
$[\
has been one of the most frequently used metric
175
index for measuring the variable importance and selecting individual wavelengths. Another
176
commonly used metric for judging the importance of the X-variables on Y is the variable
177
importance on the projection (i.e. VIP). The VIP score for the j th variable (wavelength) for
178
a single response y is computed according to the formula (6):
179
VIP3 = yz ∑% "
[(
+ " " " )( "3 ⁄W "3 W
)]|∑% " (
+ " " ")
(7)
180
The PLSVIP algorithm was proposed by S. Wold [35], which is a filter strategy for
181
variable selection. Variable j can be eliminated if }4_3 < 2 for some user-defined threshold.
182
Here, the optimal 2 value is set up based on the cross validation.
183
2.5.4
The SIS-SPLS algorithm
184
The SIS-SPLS algorithm was proposed by Xu [15], which combines the SIS model with
185
a sparse version of PLS (SPLS) method. In the proposed algorithm, part of variables with
186
larger Pearson coefficient are added into model in the first step. Its formulation is as follows:
187
b =4`
188
Here, 4` is a diagonal matrix which the ~th diagonal element either equals to 1 for •1,
189
or equals to 0 for ~ ∉ •1. And then it uses the equation (2) of PLS to shrink the X-variables
190
and Y. Through multiple iterative steps, this model can be efficient to select variables and
191
reduce the variable dimension with relatively low MSEP. Hence, this novel method preserves
192
the attractive properties of PLS as well as the asymptotic property of the SIS method. In
193
addition, this method is so similar as the covsel algorithm [16], which adapts the forward
194
regression strategy to select each variable. Hence, the SIS-SPLS is faster and more effective
195
than the covsel method because it uses the BIC method to choose important variables in each
196
iteration.
(8)
197 198
3
199
3.1 Simulation study
Data and software implementation
200
In order to investigate the model selection accuracy of partition-based partial least square
201
with other partial least square variable selection algorithm, a set of simulation data, similar as
202
the reference [27], has been created. The covariates X from multivariate normal distribution
203
and the true coefficients
are given as follows: • = / +‚
204
The model is set up to n=100, p=500, βƒ = (−3, 3, −3, 3, −3, 0…,† ). And the noise conforms
205
to the standard random gaussian distribution. The covariance structure of
206
first-order autoregression model with unit variance and correlation 0.9, i.e.
207
0.9|0,3| for any ~ ≠ B ∈ ‰Š , where ‰Š = {B ∈ ‹: 10M − 9 ≤ B ≤ 10} for M = 1, ⋯ ,50 and
208
otherwise
209
3.2 Beer dataset
is that of a block i/0 , /3 j =
i/0 , /3 j = 0.
210
The beer spectral dataset contains 60 samples published in [17]. The spectrophotometer
211
uses a split detector system with a silicon (Si) detector between 400 and 1100 nm and a lead
212
sulfide (PbS) detector from 1100 to 2500 nm. And the spectral data collected at 2 nm intervals
213
in the range from 400 to 2250 nm were converted to absorbance units. According to some
214
research [36], 400-1100nm part is mainly unsystematic noise and also irrelevant for
215
predictions. Hence, we only consider the wavelength from 1100 to 2250 nm (576 data points)
216
in steps of 2 nm. Original extract concentration which illustrates the substrate potential for the
217
yeast to ferment alcohol is considered as the property of interest. The original spectra of the
218
beer data are shown in the Fig.1(A).
219
3.3 Corn data
220
This benchmark dataset consists of 80 corn samples, measured by m5 NIR spectrometers.
221
The spectral data includes 700 points measured from 1100 to 2498 nm in steps of 2 nm. the
222
NIR spectra of 80 corn samples measured on the m5 instrument is treated as the predictors
223
data X. In this study, two different responses are employed to investigate the performance of
224
PARPLS. The oil value is treated as the dependent variable y in the first data. For the second
225
dataset, the starch value is considered as the property of interest. The spectra of the corn data
226
are plotted in the Fig.2(A).
227
3.4 Software implementation and performance evaluation
228
All codes excluding the SIS-SPLS are performed in R language (Version 3.4.0). The
229
PLS and SPLS can be respectively implemented by the pls [37] and spls [38] package. There
230
are two strategies in the pls package to determine the number of PLS components, namely
231
cross validation (CV) and leave-one-out (LOO). The default strategy is the cross validation
232
and the number of CV segments is 10. Hence, the 10-cross validation is chosen and the
233
number of PLS components is determined based on the minimal root mean square error. The
234
standard PLSUVE approach is implemented using the R package plsVarSel [39]. In addition,
235
it is time-consuming to iterate the step2 and step3 for the N submodels. Therefore, we suggest
236
to adopt the parallel computing techniques, e.g. doParallel or foreach R package.
237
Furthermore, the fitting ability of different methods was evaluated by the root mean
238
square error (RMSE), and the coefficient of determination (• ). The RMSE is given as
239
follows:
240
Ž@•• = y
∑“ ’,‘) 6” (‘ '
(9)
241
where yˆ is the predicted response value and n is the number of samples. The coefficient of
242
determination is defined by the formula:
∑“ (‘’,‘)
• = 1 − ∑“6”
243
6”
(‘’,‘•)
(10)
244
where y is the average of predicted value of response. RMSEC represents the RMSE value
245
from the training set. The notation RMSEP indicates that the RMSE is calculated from an
246
independent test set. • and •$ respectively represent the coefficients of determination for
247
the training set and independent test set.
248 249
4. Results and discussion
250
To assess the performance of PARPLS, some high-performance wavelength selection
251
methods, including SPLS, PLSUVE and PLSVIP, are used for comparison. All the datasets
252
are centered first to have zero means before modeling. In this study, the calibration set (75%
253
of the samples) is used for building the model and performing variable selection. The
254
independent test set (25% of the samples) is then used to validate the calibration model.
255
Several evaluation metrics, such as the root-mean-square error of the calibration set (RMSEC),
256
• , root-mean-square error of the prediction of test set (RMSEP), and •$ , the number of
257
selected variables (nVAR), are calculated. Each method is repeated 50 times to ensure
258
reproducibility and stability of the evaluation.
259
4.1 Simulation study
260
Here, some simulation results, including the sensitivity, specificity, RMSEP and nVAR,
261
are recorded in the table 1. From the table 1, we can see that the PARPLS model is more
262
efficient for the setting than other models. The SPLS model only chooses three true variables,
263
and miss another two variables. The SIS-SPLS is invalid to this setting because of the lower
264
sensitivity, specificity and the larger RMSEP. The PLSVIP and PLSUVE has similar result
265
with the PARPLS, but the PARPLS model has higher sensitivity and specificity, which means
266
that it has better probability to select the true variables. In addition, our model shows the
267
smallest RMSEP and nVAR in this setting. Hence, we can think that our model is efficient to
268
do the variable screening.
269
[Insert Tab.1]
270
4.2 Case study
271
4.2.1
Beer dataset
272
In the case of the beer dataset, the evaluation results of different methods on this dataset
273
are illustrated in Tab. 1 and Fig. 1. Compared with the PLS full-spectrum, all the variable
274
selection methods have improved prediction performance on both cross-validation and test set,
275
which demonstrates that it is necessary to perform variable selection before developing a
276
calibration model on this dataset. The prediction accuracy of the PARPLS algorithm is
277
significantly enhanced. Compared to the PLS model of full spectra, the values of RMSEP and
278
•$ for PARPLS are improved from 0.209 to 0.074 and from 0.944 to 0.99 on the dataset,
279
respectively.
280
The selected variables by different methods on the beer dataset are displayed in Fig. 1.
281
The SPLS and PLSUVE tend to select a larger number of variables. According to the Fig.
282
1(B), (C), (D) and (F), we can see that the SPLS, PLSVIP, PLSUVE and PARPLS methods
283
have the similar selected regions, but their nVARs have a little difference. According to
284
foregoing research, 1100-1400 nm has larger correlation with response than other regions.
285
And this region corresponds to the first overtone of C-H and O-H stretching bond vibration
286
[25]. Our method also chooses these relevant informational regions, such as the region of
287
1150–1160 and 1260-1350 nm. And the accuracy of our model is similar as the SPLS,
288
PLSVIP and PLSUVE. As to the SIS-SPLS, it selects several variables from the regions of
289
1400 to 2498 nm. The reason is may that the independent condition in the SIS-SPLS [15] is
290
often invalid in reality so that some unimportant variables are chosen when facing much noisy
291
variables.
292
[Insert Tab.2]
293
[Insert Fig. 1]
294
4.2.2
Oil dataset
295
For the corn oil dataset, the performances of different methods are presented in Tab. 3
296
and Fig. 2. From the Tab. 3, we could observe that compared to the PLS full spectrum, our
297
variable selection method can improve prediction accuracy, but the SPLS, SIS-SPLS,
298
PLSVIP methods show decreased prediction accuracy. According to the RMSEP, we can
299
observe
300
PLSVIP
301
the full spectra, the RMSEP and Q P2 for our algorithm decreased remarkably by 45.1% and
302
2.31% although our algorithm has the similar RMSEC and QC2 value with PLS algorithm. In
303
addition, the PARPLS yields the lowest standard deviation, indicating the highest stability on
304
this dataset.
a
clear
rank
of
all
the
methods,
which
is
as
follows:
305
The selected variables by different methods on the oil dataset are displayed in Fig. 2. The
306
SPLS tends to select a larger number of variables, while the nVAR values obtained by the
307
SIS-SPLS are lower. From Fig. 2(D), and (F), we can observe that the wavelengths selected
308
by the PARPLS have several similar information wavelengths, such as the regions around the
309
wavelength 1700nm, 2100nm and 2300nm, which correspond to the overtones of C-H
310
stretching mode and the combination of C-H vibrations [25]. These wavelengths actually are
311
assigned to specific compounds for the response [16,40]. And the regions selected by the
312
PLSVIP mainly concentrate on the both ends of the full spectra. The PLSUVE only picked
313
region 4 successfully but failed to select the other informative wavelength regions. Hence, its
314
prediction accuracy is even lower than that of the full spectra. The wavelength regions with
315
extremely high frequency, such as 1700-1744nm, 2030-2050nm, 2050-2300nm, are naturally
316
considered to be key wavelengths regions.
317
[Insert Tab.3]
318
[Insert Fig. 2]
319
4.2.3
Starch dataset
320
For the corn starch dataset, the performances of different methods are presented in Tab.
321
4 and Fig. 3. From the Tab. 4, we could observe that compared to the PLS full spectrum, our
322
variable selection method can improve prediction accuracy, but the SPLS, PLSUVE, PLSVIP
323
methods show decreased prediction accuracy. By comparison, the PARPLS has a large
324
significant improvement on the •– and RMSEP. And our algorithm yields the lowest
325
standard deviation of RMSEP than others, indicating the highest stability on this dataset.
326
The selected variables by different methods on the starch dataset are present in Fig. 3. As
327
to the nVAR, our method chooses fewer variables than the PLSUVE and SPLS. From Fig. 2
328
(E) and (F), we can observe that the similar wavelengths selected by the PARPLS are 1500nm,
329
1700nm, and 1800nm. But the frequency of chosen intervals in the SIS-SPLS are smaller than
330
the PARPLS, and the part of wavelength intervals are missing, such as the 1330nm, 1500nm
331
and 2206-2228nm. These regions respectively correspond to the overtones of the C–H, O–H
332
and N–H bond and the interaction of them, and is meaningful to the starch content [16,40].
333
Therefore, our method may have more potential advantage to select important information
334
wavelengths.
335
[Insert Tab.4]
336
[Insert Fig. 3]
337 338
5. Conclusions
339
In this study, we propose a partition-based variable selection in partial least squares
340
regression (PARPLS) for the spectral data. Unlike most of the existing variable selection
341
methods, the PARPLS uses the k-means algorithm to adaptively split variables into some
342
groups to obtain some local information, and then selects important variables relying on the
343
estimated grouping coefficients. According to the application of three NIR datasets, it is
344
proved that our algorithm can extract efficient variables and promote the prediction
345
performance when compared with several other classical PLS variable selection algorithms.
346
In the future, this efficient and simple strategy can be combined with other frameworks’
347
methods such as sufficient dimension reduction (SDR)[41], functional data analysis (FDA)
348
[42] and stacked model [43]. The method also can be extended to the classification data.
349
Therefore, we believe that the PARPLS is a potentially useful tool for exploring high
350
dimension data and identifying valuable variables from the spectrum data.
351 352 353 354 355 356
Acknowledgements
357
We thank the editor and the referees for constructive suggestions that substantially
358
improved this work. We also thank Feng Wang for providing the high-performance computer
359
with us. This work was supported by the Innovation Program of Central South University
360
[Grant No. 502221807].
361 362
References
363
[1]. R.G. Brereton, J. Jansen, J. Lopes, F. Marini, A. Pomerantsev, O. Rodionova, et al. Chemometrics in
364
analytical chemistry-part II: modeling, validation, and applications. Anal. Bioanal. Chem. 410 (2018)
365
6691-704.
366 367 368 369 370 371 372 373 374 375 376 377 378 379
[2]. C. Pasquini. Near infrared spectroscopy: A mature analytical technique with new perspectives - A review. Anal. Chim. Acta. 1026 (2018) 8-36. [3]. L.E. Frank, J.H. Friedman. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 35 (1999) 109-35. [4]. J.Q. Fan and J.C. Lv. A selective overview of variable selection in high dimensional feature space. Stat. Sinica. 20 (2010) 101-148. [5]. Y.H. Yun, H.D. Li, B.C. Deng and D.S. Cao. An Overview of Variable Selection Methods in Multivariate Analysis of near-Infrared Spectra. Trac-Trend. Anal. Chem. 113 (2019) 102-115. [6]. S. Wold, M. Sjöström, L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58 (2001) 109-30. [7]. C.Q. Li, N. Xiao, Y. Wen, S.H. He, Y.D. Xu, Y.W. Lin, et al. Collaboration patterns and network in chemometrics. Chemom. Intell. Lab. Syst. 191 (2019) 21-9. [8]. T. Mehmood, K.H. L, L. Snipen, S. Sæbø. A review of variable selection methods in Partial Least Squares Regression. Chemom. Intell. Lab. Syst. 118 (2012) 62-9.
380
[9]. A. Frenich, D. Jouan-Rimbaud, D. Massart, S. Kuttatharmmakul, M. Galera, J. Vidal, Wavelength selection
381
method for multicomponent spectrophotometric determinations using partial least squares, Analyst 120
382
(1995) 2787–2792.
383 384
[10]. L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis, Umetrics, Umeå, 2001.
385
[11]. I. Frank, Intermediate least squares regression method, Chemom. Intell. Lab. Syst. 1 (1987) 233–242.
386
[12]. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna. Elimination of
387 388 389 390 391 392 393 394 395 396 397 398 399
Uninformative Variables for Multivariate Calibration. Anal. Chem. 68 (1996) 3851-8. [13]. J.A. Fernández Pierna, O. Abbas, V. Baeten, P. Dardenne. A Backward Variable Selection method for PLS regression (BVSPLS). Anal. Chim. Acta. 642 (2009) 89-93. [14]. H.D. Li, Y.Z. Liang, Q.S. Xu, D.S. Cao. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta. 648 (2009) 77-84. [15]. X. Xu, K.K. Cheng, L. Deng, J. Dong. A sparse partial least squares algorithm based on sure independence screening method. Chemom. Intell. Lab. Syst. 170 (2017) 38-50. [16]. J. M. Roger, B. D. Palagos and E. Fernandez-Ahumada. Covsel: Variable Selection for Highly Multivariate and Multi-Response Calibration. Chemom. Intell. Lab. Syst. 106 (2011) 216-223. [17]. L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, E. S. Balling. Interval partial least-squares regression (iPLS). Appl. Spectrosc. 54 (2000) 413-9. [18]. H. Chun, S. Keles, Sparse partial least squares regression for simultaneous dimension reduction and variable selection, J. R. Stat. Soc. B 72 (2010) 3–25.
400 401 402 403
[19]. A. Höskuldsson. Variable and subset selection in PLS regression. Chemom. Intell. Lab. Syst. 55 (2001):23-38. [20]. L.L. Wang, Y.W. Lin, X.F. Wang, N. Xiao, Y.D. Xu, H.D. Li, et al. A selective review and comparison for interval variable selection in spectroscopic modeling. Chemom. Intell. Lab. Syst. 172 (2018) 229-40.
404
[21]. Y.P. Du, Y.Z. Liang, J.H. Jiang, R.J. Berry, Y. Ozaki. Spectral regions selection to improve prediction
405
ability of PLS models by changeable size moving window partial least squares and searching combination
406
moving window partial least squares. Anal. Chim. Acta. 501 (2004) 183-91.
407
[22]. Y.H. Yun, H.D. Li, L. R. E. Wood, W. Fan, J.J. Wang, D.S. Cao, Q.S. Xu and Y.Z. Liang. An Efficient
408
Method of Wavelength Interval Selection Based on Random Frog for Multivariate Spectral Calibration.
409
Specctrochim. Acta. A 111(2013) 31-36.
410 411 412 413
[23]. V.P. Dominic and S. D. Brown. Stacked Interval Sparse Partial Least Squares Regression Analysis. Chemom. Intell. Lab. Syst. 166(2017) 49-60. [24]. W.D. Ni, S. D. Brown and R.L Man. Stacked Partial Least Squares Regression Analysis for Spectral Calibration and Prediction. J. Chemom. 23(2009) 505-517.
414
[25]. Y.W. Lin, B.C. Deng, L.L. Wang, Q.S. Xu, L. Liu, Y.Z. Liang. Fisher optimal subspace shrinkage for block
415
variable selection with applications to NIR spectroscopic analysis. Chemom. Intell. Lab. Syst. 159 (2016)
416
196-204.
417 418 419 420 421 422
[26]. A.Park, J.Aston, F.Ferraty: Stable and predictive functional domain selection with application to brain images.Preprint (2016). arXiv:1606.02186 [27]. J. Kang, H.G. Hong, Y. Li. Partition-based ultrahigh-dimensional variable screening. Biometrika. 104 (2017) 785-800. [28]. J.Q. Fan and J.C. Lv. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Stat. Soc. B. 70 (2008) 849-911.
423
[29]. A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2010) 651-66.
424
[30]. G.H. Fu, W.M. Zhang, L. Dai and Y.Z. Fu. Group Variable Selection with Oracle Property by Weight-Fused
425
Adaptive Elastic Net Model for Strongly Correlated Data. Commun. Stat-Simul. C. 43(2014) 2468-2481.
426
[31]. H.J. Miller, J.W. Han, Geographic Data Mining and Knowledge Discovery, CRC Press, New York, 2009.
427
[32]. H.D. Li, Y.Z. Liang, D.S. Cao, Q.S. Xu. Model-population analysis and its applications in chemical and
428
biological modeling. Trac-Trend. Anal. Chem. 38(2012) 154-62.
429
[33]. N. Meinshausen, P. Bühlmann. Stability selection. J. R. Stat. Soc. B. 72 (2010) 417-73.
430
[34]. H. Zou, T. Hastie, R. Tibshirani. Sparse Principal Component Analysis. J. Comput. Graph. Stat. 15 (2006)
431 432 433
265-286. [35]. L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis, Umetrics, Umeå, 2001.
434
[36]. C. M. Andersen and R. Bro. Variable Selection in Regression—a Tutorial. J. Chemom. 24(2010) 728-737.
435
[37]. B.H. Mevik, R. Wehrens, K. H. Liland, P. Hiemstra. pls: Partial Least Squares and Principal Component
436 437 438 439 440
Regression, 2019. R package version 2.7-1 [38]. D. Chung, H. Chun, S. Keles. spls: Sparse partial least squares (SPLS) regression and classification, 2009. R package version 2.1-0. [39]. K. H. Liland, T. Mehmood, S. Sæbø. plsVarSel: Variable Selection in Partial Least Squares, 2017. R package version 0.9.4
441
[40]. B. G. Osborne. Near-Infrared Spectroscopy in Food Analysis. 2006.
442
[41]. Y.W. Lin, B.C. Deng, Q.S. Xu, Y.H. Yun and Y.Z. Liang. The Equivalence of Partial Least Squares and
443
Principal Component Regression in the Sufficient Dimension Reduction Framework. Chemom. Intell. Lab.
444
Syst. 150(2016) 58-64.
445
[42]. J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, New York, 2005.
446
[43]. W.N. Ni, S. D. Brown and R.L. Man. Stacked Partial Least Squares Regression Analysis for Spectral
447
Calibration and Prediction. J. Chemom. 23(2009) 505-517.
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462
Figure Captions
463
Table 1. The results of simulated example. The data are presented as (sensitivity, specificity,
464
RMSEP and nVAR).
465
Table 2. The results of different methods on the beer dataset, and benchmarking results with
466
the form mean value ± standard deviation in 50 runs.
467
Table 3. The results of different methods on the oil dataset, and benchmarking results with the
468
form mean value ± standard deviation in 50 runs.
469
Table 4. The results of different methods on the starch dataset, and benchmarking results with
470
the form mean value ± standard deviation in 50 runs.
471
Fig.1. Selection frequencies of the selected variables within 50 times by different methods on
472
the beer dataset. (A) Raw NIR spectra of beer data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,
473
(E)SIS-SPLS and (F) PARPLS.
474
Fig.2. Selection frequencies of the selected variables within 50 times by different methods on
475
the oil dataset. (A) Raw NIR spectra of oil data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,
476
(E)SIS-SPLS and (F) PARPLS.
477
Fig.3. Selection frequencies of the selected variables within 50 times by different methods on
478
the starch dataset. (A) Raw NIR spectra of starch data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,
479
(E)SIS-SPLS and (F) PARPLS.
480 481 482 483 484 485 486 487 488
Table 1. The results of simulated example. The data are presented as the form mean value ± standard deviation of
489
sensitivity, specificity, RMSEP and nVAR.
Methods
Sensitivity
Specificity
RMSEP
nVAR
PLS
----
-----
2.78±0.40
500
SPLS
0.6±0
1±0
1.95 ± 0
3±0
PLSUVE
0.988±0.047
0.975±0.012
1.232 ± 0.32
17.4±5.84
PLSVIP
0.99±0.045
0.996±0.003
1.084±0.165
6.9±1.41
SIS-SPLS
0.836±0.165
0.806±0
2.56±0.474
99.9 ± 0.14
PARPLS
1±0
0.996±0.003
1.07±0.149
6.9±1.77
490 491 492 493 494 495 496 497 498
Table 2. The results of different methods on the beer dataset, and benchmarking results with the form mean value
499
± standard deviation in 50 runs.
Method
RMSEC
RMSEP
Ncomp
•œ
•–
nVAR
PLS
0. 003 ± 0.005
0.209±0.088
7.28±1.89
0.999±0
0.944±0.063
576
SPLS
0. 051 ± 0.023
0.092±0.025
3.16±1.68
0.996±0.002
0.988±0.013
117.7 ± 23.36
PLSUVE
0. 057 ± 0.011
0.083±0.017
4.6 ± 2.62
0.996±0.002
0.990±0.015
97.44 ± 6.35
PLSVIP
0. 04 ± 0.01
0.079±0.024
8.54±2.47
0.997±0.001
0.990±0.016
68.6 ± 18.73
SIS-SPLS
0.0143±0.005
0.591±0.199
10±0
0.999±0
0.466±0.522
100±0
PARPLS
0.048±0.01
0.073±0.02
6.6 ± 2
0.998±0
0.990±0.01
34.2 ± 18.1
500 501 502 503 504 505 506 507 508 509
Table 3. The results of different methods on the oil dataset. and benchmarking results with the form mean value ±
510
standard deviation in 50 runs.
Method
RMSEC
RMSEP
Ncomp
•œ
•–
nVAR
PLS
0.263±0.021
0.387±0.058
9.82±0.43
0.928±0.03
0.833±0.067
700
SPLS
0. 265 ± 0.025
0.386±0.056
9.64±0.74
0.926±0.015
0.831±0.074
487.1 ± 147.5
PLSUVE
0. 26 ± 0.056
0.386±0.056
9.36±1.19
0.927±0.034
0.85±0.08
143.1 ± 61.2
PLSVIP
0. 297 ± 0.05
0.362±0.073
9.18±1.22
0.906±0.032
0.80±0.124
86.2 ± 12.76
SIS-SPLS
0.291±0.027
0.405±0.071
10+0
0.913±0.018
0.809±0.078
43.24±26.5
PARPLS
0.11±0.021
0.168 ± 0.04
9.74 ± 0.5
0.986±0.005
0.966±0.02
64.2 ± 16.8
511 512 513 514 515 516 517 518 519
Table 4. The results of different methods on the starch dataset. and benchmarking results with the form mean value
520
± standard deviation in 50 runs.
Method
RMSEC
RMSEP
Ncomp
•œ
•–
nVAR
PLS
0. 207 ± 0.018
0.3±0.053
9.96±0.2
0.956±0.008
0.89±0.06
700
SPLS
0. 207 ± 0.017
0.304±0.057
9.98±0.14
0.956±0.007
0.886±0.066
688.3 ± 26.67
PLSUVE
0. 335 ± 0.116
0.478±0.154
9.4±1.55
0.871±0.113
0.713±0.199
152.9 ± 94.77
PLSVIP
0. 283 ± 0.04
0.392±0.074
9.5±0.76
0.916±0.028
0.815±0.088
84.6 ± 12.8
SIS-SPLS
0.29±0.042
0.454±0.101
10 ± 0
0.911±0.027
0.711±0.139
28.2 ± 14.81
PARPLS
0.204±0.036
0.285±0.065
10 ± 0
0.956±0.017
0.9±0.06
61 ± 14.74
521 522 523 524 525
Highlights
Our method takes into account the group information in the spectral data by using the k-means algorithm to select variables. The Monte Carlo sampling method is adopted to obtain more stable group information and PLS regression coefficients. Our algorithm is able to obtain better prediction performance and select more efficient variables than its competitors.
1
Competing interests The authors declare that they have no competing interests.
Authors contributions The authors declare that the study was realized in collaboration with the same responsibility. All authors read and approved the final manuscript.