A partition-based variable selection in partial least squares regression

A partition-based variable selection in partial least squares regression

Journal Pre-proof A partition-based variable selection in partial least squares regression Chuan-Quan Li, Zhaoyu Fang, Qing-Song Xu PII: S0169-7439(1...

902KB Sizes 1 Downloads 131 Views

Journal Pre-proof A partition-based variable selection in partial least squares regression Chuan-Quan Li, Zhaoyu Fang, Qing-Song Xu PII:

S0169-7439(19)30615-X

DOI:

https://doi.org/10.1016/j.chemolab.2020.103935

Reference:

CHEMOM 103935

To appear in:

Chemometrics and Intelligent Laboratory Systems

Received Date: 24 September 2019 Revised Date:

6 January 2020

Accepted Date: 9 January 2020

Please cite this article as: C.-Q. Li, Z. Fang, Q.-S. Xu, A partition-based variable selection in partial least squares regression, Chemometrics and Intelligent Laboratory Systems (2020), doi: https:// doi.org/10.1016/j.chemolab.2020.103935. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

1

A Partition-based Variable Selection in Partial Least Squares

2

Regression

3

Chuan-Quan Li1, Zhaoyu Fang1, Qing-Song Xu1,∗

4

1

School of Mathematics and Statistics, Central South University, Changsha 410083, P. R. China

5 6

Abstract

7

Partial least squares regression is one of the most popular modeling approaches for

8

predicting spectral data and identifying key wavelengths when combining with many variable

9

selection methods. But some traditional variable selection approaches often overlook the local

10

or group information between the covariates. In this paper, a partition-based variable selection

11

in partial least squares (PARPLS) method is proposed. It first uses the k-means algorithm to

12

part the variable space and then estimates the coefficients in each group. Finally, these

13

coefficients are sorted to select the important variables. The results on three near-infrared

14

(NIR) spectroscopy datasets show that the PARPLS is able to obtain better prediction

15

performance and select more effective variables than its competitors.

16 17

Keywords: k-means clustering; partial least squares; variable selection

18 19 20 21 22

∗ Correspondence to: Q.-S. Xu.

Email: [email protected]

23

1.

Introduction

24

In recent years, different spectroscopy techniques, such as near infrared spectroscopy,

25

infrared (IR) spectroscopy and ultraviolet–visible spectroscopy have been widely used in the

26

industry, chemistry, environment, biology and other fields [1,2]. Amount of multivariate

27

spectral calibration methods are proposed to build the quantitative relations between variables

28

(wavelengths) and the chemical properties of interest, such as concentration, sugar content or

29

different species category. However, these complex and high dimensional spectrum datasets

30

often contain much uninformative and redundant information, which may increase the

31

difficulty of modelling and prediction performance. Therefore, variables (wavelengths)

32

selection plays an important role in contemporary chemometric research [3-5].

33

Partial least squares (PLS) regression is a widely used method in chemometrics

34

community [6,7], which can handle with the high dimensional spectroscopy datasets. The

35

original form of PLS regression extracts multi latent variables from the whole variable space,

36

which can’t select important variables directly. Many literatures about the variable selection

37

in PLS have appeared over the last two decades. Generally, these methods can be categorized

38

into three types: filter, wrapper and embedded methods [8]. Filter methods select some

39

important variables based on the regression coefficient [9], the variable importance in

40

projection (VIP) [10], the loading weights [11] and so on. Wrapper strategy mostly runs an

41

iterated procedure between model fitting and variable selection, such as the uninformative

42

variable elimination (PLSUVE) [12], the backward variable elimination PLS [13],

43

competitive adaptive reweighing sampling (CARS) [14], the sure-independence-screening

44

based sparse partial least squares method (SIS-SPLS) [15]. Embedded methods always nest

45

the variable selection within the PLS algorithm, and are based on one iterative procedure,

46

such as the interval PLS [17] and the sparse PLS (SPLS) algorithm [18].

47

Another direction is the interval wavelengths selection methods [19,20] because the

48

functional groups absorb within relatively short wavelength bands in the spectroscopic data.

49

In addition, there are high correlation within the wavelength bands. A series of interval

50

selection methods have been proposed, such as the interval PLS method [17], moving window

51

PLS [21], interval random frog[22], and stacked interval PLS [23,24]. An important

52

parameter: the fixed-sized interval length, is difficult to determine. Therefore, there are some

53

other algorithms based on the cluster skill to determine the different interval length, such as

54

Fisher optimal subspace shrinkage (FOSS) [25], a functional domain selection (FuDoS)

55

method [26]. Recently, Kang [27] proposed a partition-based ultrahigh-dimensional variable

56

screening in the framework of generalized linear models. And two kinds of partition are used:

57

correlation-based cluster and spatial partition-based cluster to get the grouping information.

58

This method is more efficient than marginal screening[28] approaches for variable screening

59

and parametric estimation.

60

In this work, we focus on a partition-based variable selection with PLS model (PARPLS)

61

for the spectroscopic data. Motivated by the idea of spatial clustering, a novel wavelength

62

selection method is proposed based on the PLS regression coefficient on different clusters. In

63

addition, the performance of clustering algorithm may be sensitive to grouping and initial

64

value. Then the Monte Carlo sampling method is adopted to obtain more stable group

65

information and PLS regression coefficients. Finally, variables are selected based on the

66

absolute value of PLS regression coefficient.

67

We have organized this paper into the following sections. In Section 2, we introduce our

68

algorithm in detail. A brief introduction of three near-infrared spectroscopy datasets are given

69

in Section 3. Section 4 presents and discusses the experimental results on three datasets.

70

Finally, section 5 includes some conclusions.

71 72

2. Theory and methods

73

2.1 Notation

74

First, we present some notations. The predictor variables X are n × p matrix. And the

75

variable Y can be the single or multiple continuous response. When modelling, both X and Y

76

are mean-centered.

77

2.2 Partial least squares

78

Here, we will introduce the partial least squares algorithm, which is a classical

79

dimension reduction algorithm and reduce the variable space by a few latent variables. The

80

first latent variable can be obtained by maximizing the variance of predictors and response

81

variables:

82 83

max‖



Where the loading weights

(



,‖

,

,

)

are the first left and right singular vectors of X Y. . The X and Y are deflated by using the

84

And the first latent variable is defined as:

85

least square estimation. The residual errors can be computed as follows:

86

(1)

=

X=

+

87 88

Y=

The next components

,⋯,

"

+

(2)

can be similarly computed as the formula (2) through

89

the residual errors X and Y . This deflation pattern can make sure the loading weights

90

#$×% and the components &'×% are orthogonal. Therefore, the estimation of β can be

91

expressed in the PLS algorithm: ) = #(# +

92 93

+

#), # +

+

(3)

2.3 K-means cluster

94

The k-means algorithm [29] is one of the most popular and simple clustering methods,

95

which finds a partition such that the squared error between the empirical mean of a cluster

96

and the points in the cluster is minimized. Its equation is defined:

97

- = ∑83

∑'0 (/0 − 23 ) 4{

6

3}

(4)

98

The reasons why we use the k-means here are two folds. First, the k-means algorithm

99

requires three user-specified parameters: the number of clusters G, cluster initialization, and

100

distance metric, but we only need to tune the number of clusters parameter G in most cases.

101

Here, the Euclidean distance metric rather than the Mahalanobis distance, L1 distance and

102

other distance metric is often chosen because the Euclidean distance can save much

103

computational cost. As to the initialization center of the k-means, different initialization value

104

may lead to different final clustering. We can run the k-means with multiple different initial

105

partitions [29], to overcome the local minima.

106

Second, the k-means can be used to part these data with the spatial information, such as

107

image data, spectral data and geospatial data. Close predictors tend to have stronger

108

correlations and may have similar effect on the response [30]. Hence, the k-means based on

109

the correlation-guided partitioning can efficiently separate different clusters with spatial

110

information [31].

111 112

2.4 The PARPLS algorithm

113

2.4.1

The framework of PARPLS algorithm

114

The PLS model is used to obtain the <; coefficients in each cluster. But the estimated

115

coefficients are sensitive to the number of groups. Therefore, the Monte Carlo sampling

116

method is adapted to improve the estimated stability [32, 33], and it also reduces the effect of

117

initial value in the k-means. According to some research [14], the number of Monte Carlo

118

sampling can be set up to 100. Next, we sort the p component-wise absolute coefficients )

119

in a decreasing order and select the most = largest variables.

120

Note that there are two important parameters: the number of group and the number of

121

selected variables, to be tuned in our algorithm. We will discuss how to tune them in section

122

2.4.2. In the following, the detailed algorithm is described:

123

Step 1. The N sub-datasets are constructed, and each one is randomly sampling 75% of the

124

whole data.

125

Step 2. The k-means cluster algorithm is built on the sub-datasets. And the k-means method

126

needs to define the number of clusters G before estimating.

127

< are Step 3. A PLS model is built on each cluster and the estimated coefficients βg

128

calculated.

129

< coefficients in the N sub-models are obtained. Step 4. The mean value of βg

130

Step 5. For a chosen thresholding parameter r, the set of indices selected by our proposed

131

=}. partition-based variable selection is @

132

Step 6. With the selected variables, a PLS model is fitted and applied to predict the spectra of

133

an independent test set.

134

2.4.2

Parameter setting/tuning

135

The PARPLS algorithm has two tuning parameters: the number of clusters G in step 2,

136

and the chosen thresholding parameter r in the step 5. The value range of G can be set in

137

the range of 1 and (p-1). When the number of clusters G is 1, our algorithm is the same as

138

the feature selection based on the PLS regression coefficients [9]. As to the thresholding

139

parameter r, we can refer to Fan’s paper [28]. The parameter r can be set up as [αn/log (n)],

140

where the parameter α values are set up within a reasonably range of γ (e.g. M =

141

0.5,1,1.5, ⋯). As described above, we use 10-fold CV and grid search approach to choose the

142

optimal value g and r based on the minimal root-mean-square error.

143 144

2.5 An outline of the compared methods

145

In this article, we intend to compare our algorithm with other PLS variable selection

146

methods, which is widely used in high dimensional spectral data. The five methods are the

147

original PLS, sparse partial least squares (SPLS), PLSUVE, PLSVIP, SIS-SPLS respectively.

148

Here, the main idea of them are briefly described in the following section.

149

2.5.1

The SPLS algorithm

150

The original sparse partial linear square regression (SPLS) algorithm is based on the

151

sparse principal component analysis (SPCA) [34], which adopts the R -norm penalty onto a

152

surrogate of the direction vector (c). In the SPLS regression, the L1 and L2 penalty are

153

together added to build the latent components: min

154

,

[−M

+

@ + (1 − M)( − )+ @( − ) + V ‖ ‖ + V W W] s. t.

155

+

(5)

= 1

and c respectively represent the original and surrogate direction vector. In

Where

156

addition, M, V and V

157

deflated and updated steps in two variant SPLS models, namely, the NIPALS and SIMPLS

158

algorithm. For the NIPALS algorithm, update

159

update

160

A = {i| b 0 ≠ 0, ) $[\ ≠ 0}. The sparsity degree, V V and the number of latent components,

161

are determined by cross validation.

162

2.5.2

]



] (Ι −

are penalty factors and @ =



_] (_`+ _] ), _`+ ), where _] =

The PLSUVE algorithm



+

+

. And there are different

) $[\ ; For the SIMPLS algorithm,

+ + + , ` ] #` (#` ` ] #` )

and

163

The PLSUVE algorithm selects the important variables through adding a series of

164

artificial variables. And these artificial variables have the same dimension of the original

165

variable space. The fitness to enter the jth variable in PLS model is determined by the ratio of

166

the regression coefficient e3 and its standard deviation s(e3 ):

167

3

f

= hifg for B = 1, ⋯ ,2l gj

(6)

168

Where the s(e3 ) can be obtained from the vector of n e03 coefficients by

169

leave-one-out jackknifing. And the cutoff level can be set as abs(max(cnopqr )) in these

170

artificial variables. If absi 3 j < absimax(cnopqr )j (j = 1, ⋯ , p), the jth variable can be

171

regarded as an uninformative variable, otherwise the jth variable can be considered as an

172

informative variable.

173

2.5.3

174

The PLSVIP algorithm

The PLS regression coefficients

$[\

has been one of the most frequently used metric

175

index for measuring the variable importance and selecting individual wavelengths. Another

176

commonly used metric for judging the importance of the X-variables on Y is the variable

177

importance on the projection (i.e. VIP). The VIP score for the j th variable (wavelength) for

178

a single response y is computed according to the formula (6):

179

VIP3 = yz ∑% "

[(

+ " " " )( "3 ⁄W "3 W

)]|∑% " (

+ " " ")

(7)

180

The PLSVIP algorithm was proposed by S. Wold [35], which is a filter strategy for

181

variable selection. Variable j can be eliminated if }4_3 < 2 for some user-defined threshold.

182

Here, the optimal 2 value is set up based on the cross validation.

183

2.5.4

The SIS-SPLS algorithm

184

The SIS-SPLS algorithm was proposed by Xu [15], which combines the SIS model with

185

a sparse version of PLS (SPLS) method. In the proposed algorithm, part of variables with

186

larger Pearson coefficient are added into model in the first step. Its formulation is as follows:

187

b =4`

188

Here, 4` is a diagonal matrix which the ~th diagonal element either equals to 1 for •1,

189

or equals to 0 for ~ ∉ •1. And then it uses the equation (2) of PLS to shrink the X-variables

190

and Y. Through multiple iterative steps, this model can be efficient to select variables and

191

reduce the variable dimension with relatively low MSEP. Hence, this novel method preserves

192

the attractive properties of PLS as well as the asymptotic property of the SIS method. In

193

addition, this method is so similar as the covsel algorithm [16], which adapts the forward

194

regression strategy to select each variable. Hence, the SIS-SPLS is faster and more effective

195

than the covsel method because it uses the BIC method to choose important variables in each

196

iteration.

(8)

197 198

3

199

3.1 Simulation study

Data and software implementation

200

In order to investigate the model selection accuracy of partition-based partial least square

201

with other partial least square variable selection algorithm, a set of simulation data, similar as

202

the reference [27], has been created. The covariates X from multivariate normal distribution

203

and the true coefficients

are given as follows: • = / +‚

204

The model is set up to n=100, p=500, βƒ = (−3, 3, −3, 3, −3, 0…,† ). And the noise conforms

205

to the standard random gaussian distribution. The covariance structure of

206

first-order autoregression model with unit variance and correlation 0.9, i.e.

207

0.9|0,3| for any ~ ≠ B ∈ ‰Š , where ‰Š = {B ∈ ‹: 10M − 9 ≤ B ≤ 10} for M = 1, ⋯ ,50 and

208

otherwise

209

3.2 Beer dataset

is that of a block i/0 , /3 j =

i/0 , /3 j = 0.

210

The beer spectral dataset contains 60 samples published in [17]. The spectrophotometer

211

uses a split detector system with a silicon (Si) detector between 400 and 1100 nm and a lead

212

sulfide (PbS) detector from 1100 to 2500 nm. And the spectral data collected at 2 nm intervals

213

in the range from 400 to 2250 nm were converted to absorbance units. According to some

214

research [36], 400-1100nm part is mainly unsystematic noise and also irrelevant for

215

predictions. Hence, we only consider the wavelength from 1100 to 2250 nm (576 data points)

216

in steps of 2 nm. Original extract concentration which illustrates the substrate potential for the

217

yeast to ferment alcohol is considered as the property of interest. The original spectra of the

218

beer data are shown in the Fig.1(A).

219

3.3 Corn data

220

This benchmark dataset consists of 80 corn samples, measured by m5 NIR spectrometers.

221

The spectral data includes 700 points measured from 1100 to 2498 nm in steps of 2 nm. the

222

NIR spectra of 80 corn samples measured on the m5 instrument is treated as the predictors

223

data X. In this study, two different responses are employed to investigate the performance of

224

PARPLS. The oil value is treated as the dependent variable y in the first data. For the second

225

dataset, the starch value is considered as the property of interest. The spectra of the corn data

226

are plotted in the Fig.2(A).

227

3.4 Software implementation and performance evaluation

228

All codes excluding the SIS-SPLS are performed in R language (Version 3.4.0). The

229

PLS and SPLS can be respectively implemented by the pls [37] and spls [38] package. There

230

are two strategies in the pls package to determine the number of PLS components, namely

231

cross validation (CV) and leave-one-out (LOO). The default strategy is the cross validation

232

and the number of CV segments is 10. Hence, the 10-cross validation is chosen and the

233

number of PLS components is determined based on the minimal root mean square error. The

234

standard PLSUVE approach is implemented using the R package plsVarSel [39]. In addition,

235

it is time-consuming to iterate the step2 and step3 for the N submodels. Therefore, we suggest

236

to adopt the parallel computing techniques, e.g. doParallel or foreach R package.

237

Furthermore, the fitting ability of different methods was evaluated by the root mean

238

square error (RMSE), and the coefficient of determination (• ). The RMSE is given as

239

follows:

240

Ž@•• = y

∑“ ’,‘) 6” (‘ '

(9)

241

where yˆ is the predicted response value and n is the number of samples. The coefficient of

242

determination is defined by the formula:

∑“ (‘’,‘)

• = 1 − ∑“6”

243

6”

(‘’,‘•)

(10)

244

where y is the average of predicted value of response. RMSEC represents the RMSE value

245

from the training set. The notation RMSEP indicates that the RMSE is calculated from an

246

independent test set. • and •$ respectively represent the coefficients of determination for

247

the training set and independent test set.

248 249

4. Results and discussion

250

To assess the performance of PARPLS, some high-performance wavelength selection

251

methods, including SPLS, PLSUVE and PLSVIP, are used for comparison. All the datasets

252

are centered first to have zero means before modeling. In this study, the calibration set (75%

253

of the samples) is used for building the model and performing variable selection. The

254

independent test set (25% of the samples) is then used to validate the calibration model.

255

Several evaluation metrics, such as the root-mean-square error of the calibration set (RMSEC),

256

• , root-mean-square error of the prediction of test set (RMSEP), and •$ , the number of

257

selected variables (nVAR), are calculated. Each method is repeated 50 times to ensure

258

reproducibility and stability of the evaluation.

259

4.1 Simulation study

260

Here, some simulation results, including the sensitivity, specificity, RMSEP and nVAR,

261

are recorded in the table 1. From the table 1, we can see that the PARPLS model is more

262

efficient for the setting than other models. The SPLS model only chooses three true variables,

263

and miss another two variables. The SIS-SPLS is invalid to this setting because of the lower

264

sensitivity, specificity and the larger RMSEP. The PLSVIP and PLSUVE has similar result

265

with the PARPLS, but the PARPLS model has higher sensitivity and specificity, which means

266

that it has better probability to select the true variables. In addition, our model shows the

267

smallest RMSEP and nVAR in this setting. Hence, we can think that our model is efficient to

268

do the variable screening.

269

[Insert Tab.1]

270

4.2 Case study

271

4.2.1

Beer dataset

272

In the case of the beer dataset, the evaluation results of different methods on this dataset

273

are illustrated in Tab. 1 and Fig. 1. Compared with the PLS full-spectrum, all the variable

274

selection methods have improved prediction performance on both cross-validation and test set,

275

which demonstrates that it is necessary to perform variable selection before developing a

276

calibration model on this dataset. The prediction accuracy of the PARPLS algorithm is

277

significantly enhanced. Compared to the PLS model of full spectra, the values of RMSEP and

278

•$ for PARPLS are improved from 0.209 to 0.074 and from 0.944 to 0.99 on the dataset,

279

respectively.

280

The selected variables by different methods on the beer dataset are displayed in Fig. 1.

281

The SPLS and PLSUVE tend to select a larger number of variables. According to the Fig.

282

1(B), (C), (D) and (F), we can see that the SPLS, PLSVIP, PLSUVE and PARPLS methods

283

have the similar selected regions, but their nVARs have a little difference. According to

284

foregoing research, 1100-1400 nm has larger correlation with response than other regions.

285

And this region corresponds to the first overtone of C-H and O-H stretching bond vibration

286

[25]. Our method also chooses these relevant informational regions, such as the region of

287

1150–1160 and 1260-1350 nm. And the accuracy of our model is similar as the SPLS,

288

PLSVIP and PLSUVE. As to the SIS-SPLS, it selects several variables from the regions of

289

1400 to 2498 nm. The reason is may that the independent condition in the SIS-SPLS [15] is

290

often invalid in reality so that some unimportant variables are chosen when facing much noisy

291

variables.

292

[Insert Tab.2]

293

[Insert Fig. 1]

294

4.2.2

Oil dataset

295

For the corn oil dataset, the performances of different methods are presented in Tab. 3

296

and Fig. 2. From the Tab. 3, we could observe that compared to the PLS full spectrum, our

297

variable selection method can improve prediction accuracy, but the SPLS, SIS-SPLS,

298

PLSVIP methods show decreased prediction accuracy. According to the RMSEP, we can

299

observe

300

PLSVIP
301

the full spectra, the RMSEP and Q P2 for our algorithm decreased remarkably by 45.1% and

302

2.31% although our algorithm has the similar RMSEC and QC2 value with PLS algorithm. In

303

addition, the PARPLS yields the lowest standard deviation, indicating the highest stability on

304

this dataset.

a

clear

rank

of

all

the

methods,

which

is

as

follows:

305

The selected variables by different methods on the oil dataset are displayed in Fig. 2. The

306

SPLS tends to select a larger number of variables, while the nVAR values obtained by the

307

SIS-SPLS are lower. From Fig. 2(D), and (F), we can observe that the wavelengths selected

308

by the PARPLS have several similar information wavelengths, such as the regions around the

309

wavelength 1700nm, 2100nm and 2300nm, which correspond to the overtones of C-H

310

stretching mode and the combination of C-H vibrations [25]. These wavelengths actually are

311

assigned to specific compounds for the response [16,40]. And the regions selected by the

312

PLSVIP mainly concentrate on the both ends of the full spectra. The PLSUVE only picked

313

region 4 successfully but failed to select the other informative wavelength regions. Hence, its

314

prediction accuracy is even lower than that of the full spectra. The wavelength regions with

315

extremely high frequency, such as 1700-1744nm, 2030-2050nm, 2050-2300nm, are naturally

316

considered to be key wavelengths regions.

317

[Insert Tab.3]

318

[Insert Fig. 2]

319

4.2.3

Starch dataset

320

For the corn starch dataset, the performances of different methods are presented in Tab.

321

4 and Fig. 3. From the Tab. 4, we could observe that compared to the PLS full spectrum, our

322

variable selection method can improve prediction accuracy, but the SPLS, PLSUVE, PLSVIP

323

methods show decreased prediction accuracy. By comparison, the PARPLS has a large

324

significant improvement on the •– and RMSEP. And our algorithm yields the lowest

325

standard deviation of RMSEP than others, indicating the highest stability on this dataset.

326

The selected variables by different methods on the starch dataset are present in Fig. 3. As

327

to the nVAR, our method chooses fewer variables than the PLSUVE and SPLS. From Fig. 2

328

(E) and (F), we can observe that the similar wavelengths selected by the PARPLS are 1500nm,

329

1700nm, and 1800nm. But the frequency of chosen intervals in the SIS-SPLS are smaller than

330

the PARPLS, and the part of wavelength intervals are missing, such as the 1330nm, 1500nm

331

and 2206-2228nm. These regions respectively correspond to the overtones of the C–H, O–H

332

and N–H bond and the interaction of them, and is meaningful to the starch content [16,40].

333

Therefore, our method may have more potential advantage to select important information

334

wavelengths.

335

[Insert Tab.4]

336

[Insert Fig. 3]

337 338

5. Conclusions

339

In this study, we propose a partition-based variable selection in partial least squares

340

regression (PARPLS) for the spectral data. Unlike most of the existing variable selection

341

methods, the PARPLS uses the k-means algorithm to adaptively split variables into some

342

groups to obtain some local information, and then selects important variables relying on the

343

estimated grouping coefficients. According to the application of three NIR datasets, it is

344

proved that our algorithm can extract efficient variables and promote the prediction

345

performance when compared with several other classical PLS variable selection algorithms.

346

In the future, this efficient and simple strategy can be combined with other frameworks’

347

methods such as sufficient dimension reduction (SDR)[41], functional data analysis (FDA)

348

[42] and stacked model [43]. The method also can be extended to the classification data.

349

Therefore, we believe that the PARPLS is a potentially useful tool for exploring high

350

dimension data and identifying valuable variables from the spectrum data.

351 352 353 354 355 356

Acknowledgements

357

We thank the editor and the referees for constructive suggestions that substantially

358

improved this work. We also thank Feng Wang for providing the high-performance computer

359

with us. This work was supported by the Innovation Program of Central South University

360

[Grant No. 502221807].

361 362

References

363

[1]. R.G. Brereton, J. Jansen, J. Lopes, F. Marini, A. Pomerantsev, O. Rodionova, et al. Chemometrics in

364

analytical chemistry-part II: modeling, validation, and applications. Anal. Bioanal. Chem. 410 (2018)

365

6691-704.

366 367 368 369 370 371 372 373 374 375 376 377 378 379

[2]. C. Pasquini. Near infrared spectroscopy: A mature analytical technique with new perspectives - A review. Anal. Chim. Acta. 1026 (2018) 8-36. [3]. L.E. Frank, J.H. Friedman. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 35 (1999) 109-35. [4]. J.Q. Fan and J.C. Lv. A selective overview of variable selection in high dimensional feature space. Stat. Sinica. 20 (2010) 101-148. [5]. Y.H. Yun, H.D. Li, B.C. Deng and D.S. Cao. An Overview of Variable Selection Methods in Multivariate Analysis of near-Infrared Spectra. Trac-Trend. Anal. Chem. 113 (2019) 102-115. [6]. S. Wold, M. Sjöström, L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58 (2001) 109-30. [7]. C.Q. Li, N. Xiao, Y. Wen, S.H. He, Y.D. Xu, Y.W. Lin, et al. Collaboration patterns and network in chemometrics. Chemom. Intell. Lab. Syst. 191 (2019) 21-9. [8]. T. Mehmood, K.H. L, L. Snipen, S. Sæbø. A review of variable selection methods in Partial Least Squares Regression. Chemom. Intell. Lab. Syst. 118 (2012) 62-9.

380

[9]. A. Frenich, D. Jouan-Rimbaud, D. Massart, S. Kuttatharmmakul, M. Galera, J. Vidal, Wavelength selection

381

method for multicomponent spectrophotometric determinations using partial least squares, Analyst 120

382

(1995) 2787–2792.

383 384

[10]. L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis, Umetrics, Umeå, 2001.

385

[11]. I. Frank, Intermediate least squares regression method, Chemom. Intell. Lab. Syst. 1 (1987) 233–242.

386

[12]. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna. Elimination of

387 388 389 390 391 392 393 394 395 396 397 398 399

Uninformative Variables for Multivariate Calibration. Anal. Chem. 68 (1996) 3851-8. [13]. J.A. Fernández Pierna, O. Abbas, V. Baeten, P. Dardenne. A Backward Variable Selection method for PLS regression (BVSPLS). Anal. Chim. Acta. 642 (2009) 89-93. [14]. H.D. Li, Y.Z. Liang, Q.S. Xu, D.S. Cao. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta. 648 (2009) 77-84. [15]. X. Xu, K.K. Cheng, L. Deng, J. Dong. A sparse partial least squares algorithm based on sure independence screening method. Chemom. Intell. Lab. Syst. 170 (2017) 38-50. [16]. J. M. Roger, B. D. Palagos and E. Fernandez-Ahumada. Covsel: Variable Selection for Highly Multivariate and Multi-Response Calibration. Chemom. Intell. Lab. Syst. 106 (2011) 216-223. [17]. L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, E. S. Balling. Interval partial least-squares regression (iPLS). Appl. Spectrosc. 54 (2000) 413-9. [18]. H. Chun, S. Keles, Sparse partial least squares regression for simultaneous dimension reduction and variable selection, J. R. Stat. Soc. B 72 (2010) 3–25.

400 401 402 403

[19]. A. Höskuldsson. Variable and subset selection in PLS regression. Chemom. Intell. Lab. Syst. 55 (2001):23-38. [20]. L.L. Wang, Y.W. Lin, X.F. Wang, N. Xiao, Y.D. Xu, H.D. Li, et al. A selective review and comparison for interval variable selection in spectroscopic modeling. Chemom. Intell. Lab. Syst. 172 (2018) 229-40.

404

[21]. Y.P. Du, Y.Z. Liang, J.H. Jiang, R.J. Berry, Y. Ozaki. Spectral regions selection to improve prediction

405

ability of PLS models by changeable size moving window partial least squares and searching combination

406

moving window partial least squares. Anal. Chim. Acta. 501 (2004) 183-91.

407

[22]. Y.H. Yun, H.D. Li, L. R. E. Wood, W. Fan, J.J. Wang, D.S. Cao, Q.S. Xu and Y.Z. Liang. An Efficient

408

Method of Wavelength Interval Selection Based on Random Frog for Multivariate Spectral Calibration.

409

Specctrochim. Acta. A 111(2013) 31-36.

410 411 412 413

[23]. V.P. Dominic and S. D. Brown. Stacked Interval Sparse Partial Least Squares Regression Analysis. Chemom. Intell. Lab. Syst. 166(2017) 49-60. [24]. W.D. Ni, S. D. Brown and R.L Man. Stacked Partial Least Squares Regression Analysis for Spectral Calibration and Prediction. J. Chemom. 23(2009) 505-517.

414

[25]. Y.W. Lin, B.C. Deng, L.L. Wang, Q.S. Xu, L. Liu, Y.Z. Liang. Fisher optimal subspace shrinkage for block

415

variable selection with applications to NIR spectroscopic analysis. Chemom. Intell. Lab. Syst. 159 (2016)

416

196-204.

417 418 419 420 421 422

[26]. A.Park, J.Aston, F.Ferraty: Stable and predictive functional domain selection with application to brain images.Preprint (2016). arXiv:1606.02186 [27]. J. Kang, H.G. Hong, Y. Li. Partition-based ultrahigh-dimensional variable screening. Biometrika. 104 (2017) 785-800. [28]. J.Q. Fan and J.C. Lv. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Stat. Soc. B. 70 (2008) 849-911.

423

[29]. A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2010) 651-66.

424

[30]. G.H. Fu, W.M. Zhang, L. Dai and Y.Z. Fu. Group Variable Selection with Oracle Property by Weight-Fused

425

Adaptive Elastic Net Model for Strongly Correlated Data. Commun. Stat-Simul. C. 43(2014) 2468-2481.

426

[31]. H.J. Miller, J.W. Han, Geographic Data Mining and Knowledge Discovery, CRC Press, New York, 2009.

427

[32]. H.D. Li, Y.Z. Liang, D.S. Cao, Q.S. Xu. Model-population analysis and its applications in chemical and

428

biological modeling. Trac-Trend. Anal. Chem. 38(2012) 154-62.

429

[33]. N. Meinshausen, P. Bühlmann. Stability selection. J. R. Stat. Soc. B. 72 (2010) 417-73.

430

[34]. H. Zou, T. Hastie, R. Tibshirani. Sparse Principal Component Analysis. J. Comput. Graph. Stat. 15 (2006)

431 432 433

265-286. [35]. L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis, Umetrics, Umeå, 2001.

434

[36]. C. M. Andersen and R. Bro. Variable Selection in Regression—a Tutorial. J. Chemom. 24(2010) 728-737.

435

[37]. B.H. Mevik, R. Wehrens, K. H. Liland, P. Hiemstra. pls: Partial Least Squares and Principal Component

436 437 438 439 440

Regression, 2019. R package version 2.7-1 [38]. D. Chung, H. Chun, S. Keles. spls: Sparse partial least squares (SPLS) regression and classification, 2009. R package version 2.1-0. [39]. K. H. Liland, T. Mehmood, S. Sæbø. plsVarSel: Variable Selection in Partial Least Squares, 2017. R package version 0.9.4

441

[40]. B. G. Osborne. Near-Infrared Spectroscopy in Food Analysis. 2006.

442

[41]. Y.W. Lin, B.C. Deng, Q.S. Xu, Y.H. Yun and Y.Z. Liang. The Equivalence of Partial Least Squares and

443

Principal Component Regression in the Sufficient Dimension Reduction Framework. Chemom. Intell. Lab.

444

Syst. 150(2016) 58-64.

445

[42]. J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, New York, 2005.

446

[43]. W.N. Ni, S. D. Brown and R.L. Man. Stacked Partial Least Squares Regression Analysis for Spectral

447

Calibration and Prediction. J. Chemom. 23(2009) 505-517.

448 449 450 451 452 453 454 455 456 457 458 459 460 461 462

Figure Captions

463

Table 1. The results of simulated example. The data are presented as (sensitivity, specificity,

464

RMSEP and nVAR).

465

Table 2. The results of different methods on the beer dataset, and benchmarking results with

466

the form mean value ± standard deviation in 50 runs.

467

Table 3. The results of different methods on the oil dataset, and benchmarking results with the

468

form mean value ± standard deviation in 50 runs.

469

Table 4. The results of different methods on the starch dataset, and benchmarking results with

470

the form mean value ± standard deviation in 50 runs.

471

Fig.1. Selection frequencies of the selected variables within 50 times by different methods on

472

the beer dataset. (A) Raw NIR spectra of beer data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,

473

(E)SIS-SPLS and (F) PARPLS.

474

Fig.2. Selection frequencies of the selected variables within 50 times by different methods on

475

the oil dataset. (A) Raw NIR spectra of oil data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,

476

(E)SIS-SPLS and (F) PARPLS.

477

Fig.3. Selection frequencies of the selected variables within 50 times by different methods on

478

the starch dataset. (A) Raw NIR spectra of starch data, (B) SPLS, (C) PLSUVE, (D) PLSVIP,

479

(E)SIS-SPLS and (F) PARPLS.

480 481 482 483 484 485 486 487 488

Table 1. The results of simulated example. The data are presented as the form mean value ± standard deviation of

489

sensitivity, specificity, RMSEP and nVAR.

Methods

Sensitivity

Specificity

RMSEP

nVAR

PLS

----

-----

2.78±0.40

500

SPLS

0.6±0

1±0

1.95 ± 0

3±0

PLSUVE

0.988±0.047

0.975±0.012

1.232 ± 0.32

17.4±5.84

PLSVIP

0.99±0.045

0.996±0.003

1.084±0.165

6.9±1.41

SIS-SPLS

0.836±0.165

0.806±0

2.56±0.474

99.9 ± 0.14

PARPLS

1±0

0.996±0.003

1.07±0.149

6.9±1.77

490 491 492 493 494 495 496 497 498

Table 2. The results of different methods on the beer dataset, and benchmarking results with the form mean value

499

± standard deviation in 50 runs.

Method

RMSEC

RMSEP

Ncomp

•œ

•–

nVAR

PLS

0. 003 ± 0.005

0.209±0.088

7.28±1.89

0.999±0

0.944±0.063

576

SPLS

0. 051 ± 0.023

0.092±0.025

3.16±1.68

0.996±0.002

0.988±0.013

117.7 ± 23.36

PLSUVE

0. 057 ± 0.011

0.083±0.017

4.6 ± 2.62

0.996±0.002

0.990±0.015

97.44 ± 6.35

PLSVIP

0. 04 ± 0.01

0.079±0.024

8.54±2.47

0.997±0.001

0.990±0.016

68.6 ± 18.73

SIS-SPLS

0.0143±0.005

0.591±0.199

10±0

0.999±0

0.466±0.522

100±0

PARPLS

0.048±0.01

0.073±0.02

6.6 ± 2

0.998±0

0.990±0.01

34.2 ± 18.1

500 501 502 503 504 505 506 507 508 509

Table 3. The results of different methods on the oil dataset. and benchmarking results with the form mean value ±

510

standard deviation in 50 runs.

Method

RMSEC

RMSEP

Ncomp

•œ

•–

nVAR

PLS

0.263±0.021

0.387±0.058

9.82±0.43

0.928±0.03

0.833±0.067

700

SPLS

0. 265 ± 0.025

0.386±0.056

9.64±0.74

0.926±0.015

0.831±0.074

487.1 ± 147.5

PLSUVE

0. 26 ± 0.056

0.386±0.056

9.36±1.19

0.927±0.034

0.85±0.08

143.1 ± 61.2

PLSVIP

0. 297 ± 0.05

0.362±0.073

9.18±1.22

0.906±0.032

0.80±0.124

86.2 ± 12.76

SIS-SPLS

0.291±0.027

0.405±0.071

10+0

0.913±0.018

0.809±0.078

43.24±26.5

PARPLS

0.11±0.021

0.168 ± 0.04

9.74 ± 0.5

0.986±0.005

0.966±0.02

64.2 ± 16.8

511 512 513 514 515 516 517 518 519

Table 4. The results of different methods on the starch dataset. and benchmarking results with the form mean value

520

± standard deviation in 50 runs.

Method

RMSEC

RMSEP

Ncomp

•œ

•–

nVAR

PLS

0. 207 ± 0.018

0.3±0.053

9.96±0.2

0.956±0.008

0.89±0.06

700

SPLS

0. 207 ± 0.017

0.304±0.057

9.98±0.14

0.956±0.007

0.886±0.066

688.3 ± 26.67

PLSUVE

0. 335 ± 0.116

0.478±0.154

9.4±1.55

0.871±0.113

0.713±0.199

152.9 ± 94.77

PLSVIP

0. 283 ± 0.04

0.392±0.074

9.5±0.76

0.916±0.028

0.815±0.088

84.6 ± 12.8

SIS-SPLS

0.29±0.042

0.454±0.101

10 ± 0

0.911±0.027

0.711±0.139

28.2 ± 14.81

PARPLS

0.204±0.036

0.285±0.065

10 ± 0

0.956±0.017

0.9±0.06

61 ± 14.74

521 522 523 524 525

Highlights

Our method takes into account the group information in the spectral data by using the k-means algorithm to select variables. The Monte Carlo sampling method is adopted to obtain more stable group information and PLS regression coefficients. Our algorithm is able to obtain better prediction performance and select more efficient variables than its competitors.

1

Competing interests The authors declare that they have no competing interests.

Authors contributions The authors declare that the study was realized in collaboration with the same responsibility. All authors read and approved the final manuscript.