An overview of variable selection methods in multivariate analysis of near-infrared spectra

Trends in Analytical Chemistry 113 (2019) 102e115 Contents lists available at ScienceDirect Trends in Analytical Chemistry journal homepage: www.els...

Download PDF

1MB Sizes 1 Downloads 42 Views

Report

PDF Reader
Full Text

Trends in Analytical Chemistry 113 (2019) 102e115

Contents lists available at ScienceDirect

Trends in Analytical Chemistry journal homepage: www.elsevier.com/locate/trac

An overview of variable selection methods in multivariate analysis of near-infrared spectra Yong-Huan Yun a, **, Hong-Dong Li b, Bai-Chuan Deng c, Dong-Sheng Cao d, * a

College of Food Science and Technology, Hainan University, 58 Renmin Road, Haikou, 570228, PR China School of Information Science and Engineering, Central South University, Changsha, 410083, PR China c Guangdong Provincial Key Laboratory of Animal Nutrition Control, Subtropical Institute of Animal Nutrition and Feed, College of Animal Science, South China Agricultural University, Guangzhou, 510642, PR China d Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, PR China b

a r t i c l e i n f o

a b s t r a c t

Article history: Available online 2 February 2019

With the advances in innovative instrumentation and various valuable applications, near-infrared (NIR) spectroscopy has become a mature analytical technique in various ﬁelds. Variable (wavelength) selection is a critical step in multivariate calibration of NIR spectra, which can improve the prediction performance, make the calibration reliable and provide simpler interpretation. During the last several decades, there have been a large number of variable selection methods proposed in NIR spectroscopy. In this paper, we generalize variable selection methods in a simple manner to introduce their classiﬁcations, merits and drawbacks, to provide a better understanding of their characteristics, similarities and differences. We also introduce some hybrid and modiﬁed methods, highlighting their improvements. Finally, we summarize the limitations of existing variable selection methods, providing our remarks and suggestions on the development of variable selection methods, to promote the development of NIR spectroscopy. © 2019 Elsevier B.V. All rights reserved.

Keywords: Variable selection Multivariate calibration Near-infrared spectroscopy Classiﬁcation Wavelength interval selection Wavelength point selection

1. Introduction With the characteristics of simple, rapid, noninvasive, costeffective and no sample pretreatment, near-infrared (NIR) spectroscopy [1] has been adopted as a popular analytical tool for both qualitative and quantitative analysis in various ﬁelds spanning agricultural [2,3], petrochemical [4], pharmaceutical [5,6], food [7e10], polymer [11], forestry [12], traditional Chinese medicine [13,14], environmental [15,16], biomedical and clinical sectors [17,18]. As Pasquini said, modern NIR spectroscopy is a mature analytical technique that has three sustaining pillars: fundamentals of vibrational spectroscopy, instrumentation, and chemometrics [19]. The multivariate calibration method in chemometrics is ﬁrst applied to construct the relationship between NIR wavelengths and the property of interest that is to make a predictive model. Then the model is used to predict the same properties from NIR wavelengths

* Corresponding author. ** Corresponding author. E-mail addresses: [email protected] (Y.-H. Yun), oriental-cds@163. com (D.-S. Cao). https://doi.org/10.1016/j.trac.2019.01.018 0165-9936/© 2019 Elsevier B.V. All rights reserved.

of unknown samples for qualitative or quantitative analysis [20e22]. With the advance in modern analytical instruments, a NIR spectrum of a sample contains hundreds of wavelengths. For instance, a spectral range of 6000 cm1 can yield 1557 spectral points (i.e., variables) with a high resolution of 4 cm1 when using Fourier transform NIR analyzer. Such high-dimensional data has brought about the “curse of dimensionality” [23,24] that a lot of traditional statistical methods fail to cope with. With the large number of spectral variables, NIR spectra usually include some noise and interfering variables which renders the predictive property of interest unreliable. To address these problems, three types of methods have been developed which are regularization, dimension reduction and variable selection [25]. The regularization methods such as ridge regression [26], elastic net [27,28], least absolute shrinkage and selection operator (LASSO) [29] and fuzzy rule-building system [30], add a penalty term to the objective function to solve the overﬁtting problem caused by high dimensionality. The dimension reduction methods replace the original high-dimensional variable space by low-dimensional space. For instance, the projection methods, principal component regression (PCR) [31], and partial least squares (PLS) regression [32] are used through replacing the original variables by a few latent variables or principal components of larger variance, to reduce the impact of

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

collinearity, band overlaps and redundant noise irrelevant to the property of interest. However, PCR and PLS full-spectrum usually suffer from the fact that the latent variables are hardly interpretable compared to original variables. In contrast, as only part of the variables is related to the property of interest, variable selection is based on the assumptions of choosing a small number of variables that can improve the prediction performance, make the calibration reliable, and provide easier interpretation. It is carried out prior to multivariate calibration methods like PLS and PCR. In fact, the whole variables may be informative, uninformative (noise) or represent referring variables when focusing on the property of interest [33e35]. For example, if the NIR model predicts the moisture content, the wavelengths related to the OeH bonds could possibly be regarded as informative variables, and other variables are regarded as uninformative or even interfering because the moisture has only OeH bonds. From this point of view of chemical basis, variable selection is highly useful in removing uninformative or interfering variables, thus obtaining better prediction performance of the analytical methods. Spiegelman et al. justiﬁed the improvement of PLS calibration through selection of informative variables from the view of the mathematical basis [36]. Yun and Liang et al. have intrinsically veriﬁed that for the complex analytical systems like a vibrational spectroscopy system, it is very important and essential to conduct variable selection to gain better prediction performance [37]. In addition to theoretical demonstrations, many experiments have also proved that variable selection can gain better prediction performance and better interpretation [38e45]. Zou et al. [46] summarized the importance of variable selection in terms of chemical, physical and statistical basis. Overall, variable selection is a critical step in multivariate calibration of NIR spectra. The purpose of variable selection can be summarized in three aspects [47]: (1) improve the model predictive ability; (2) provide faster and more cost-effective variables by reducing the noise or interfering variables; (3) improve interpretability with simple models. With the high number of variables in NIR datasets, it is impossible to investigate all possible combinations of the total number of variables to select the optimal set of variables. In the face of the situation that the number of samples is much smaller than the number of variables (large p, small n) [48,49], it becomes a nondeterministic polynomial-time (NP) hard optimization problem to ﬁnd the optimal subset that satisﬁes the above four aspects. Thus, variable selection is actually a mathematical optimization problem. Mathematical optimization searches the space of possible variable subsets and chooses the subset that is optimal or near-optimal in regard to an objective function based on optimization algorithms and some statistical strategies. This approach should be aided by using computers and statistics science. In the past two decades, a large amount of variable selection methods has been proposed to be used in analyzing NIR spectra. There are many categories to classify these methods. The classiﬁcation categories and brief overview of most variable selection methods will be introduced in chapter 3. In this paper, we brieﬂy overview the variable selection methods in NIR over the last several decades while not repeating the contents of other reviews and surveys [46,50,51]. Our purposes are twofold: First, we generalize variable selection methods in a simple manner in order to clearly inform readers about the characteristics of each method, and to draw visual comparisons on their similarities and differences. Second, we summarize the problems of existing variable selection methods and give a deeper perspective on how to direct the development of the variable selection methods, which could be very beneﬁcial for the further development of NIR spectroscopy.

103

2. Variable selection in multivariate calibration of NIR spectra As described above, variable selection is a critical step in multivariate calibration of the analysis of NIR spectra. For NIR spectral technique, multivariate calibration [52] is deﬁned as, “A process for creating a model that relates sample properties ‘y’ to the intensities or absorbance ‘X’ at more than one wavelength or frequency of a set of known reference samples” as shown in Equation (1) [48].

y ¼ f ðXÞ

(1)

Generally, the NIR dataset is divided into a calibration set and an independent test set. The calibration set is used to build the calibration model, while the test set is an independent set used to validate the calibration model. The calibration set is further divided into training set and prediction set, and the prediction set is used to evaluate the error of the calibration model built using the training set. Cross validation and bootstrap techniques, including leave-oneout [53], leave-many-out [53], Monte Carlo (MC) cross validation [54], double cross validation [55] and bootstrapped Latin partitions [56,57], dividing the calibration set into training and prediction sets are commonly used to estimate the error of the calibration model. The variable selection step is conducted in the calibration set, and the calibration model is built with the selected variables [58]. When using the calibration model for prediction, the spectra of new samples only contain the selected variables and then predict the property of interest. The root mean-squared error of calibration on the calibration set (RMSEC), root mean-squared error of prediction on the test set (RMSEP) and root mean-squared error of cross validation (RMSECV) are often used to assess the accuracy of a model [59]. 3. The classiﬁcation of variable selection methods Due to the continuous features of NIR spectra, wavelength interval selection (WIS) and wavelength point selection (WPS) are suitable to classify variable selection methods for NIR spectra data. WIS methods obey the fact that NIR spectra have continuous features of spectral bands. Each wavelength interval consisting of a number of continuous variables is regarded as a unit when conducting variable selection. WPS methods regard each wavelength point as a unit (i.e., a variable). Thus, the selected variables are discrete. Filter, wrapper, and embedded [47,60] are often used to classify the variable selection methods in various ﬁelds based on the combination of evaluation metric and learning algorithms. Variable subset selection and ranking [47] classify the variable selection methods according to the ﬁnal output of the variables. Static and dynamic [61] approaches are based on the selection processing. Fig. 1 shows the classiﬁcation of variable selection methods based on different principles. In this study, we generalize most variable selection methods in a simple manner in terms of four factors as follows. 1) Initialization of variables: The conduction of any variable selection methods should initialize the input of variables. Some methods take all variables into account for initialization directly or by a pre-process step. Sampling methods such as Monte Carlo (MC) sampling, Bootstrap sampling and binary matrix sampling (BMS) [62] are often used to generate subsets of variables in variable space. Wavelength interval selection methods usually ﬁrst divide all variables into intervals with a width. It should be noted that initialization of variables is a very important step in the variable selection as it can even inﬂuence the ﬁnal variable subset.

104

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

Fig. 1. Classiﬁcation of variable selection methods.

2) Modeling method: With the selected variables, a modeling method is used to build the relationship between variables and the property of interest. For NIR spectra, the common modeling methods include multivariable linear regression (MLR), PCR, PLS, LASSO, support vector regression (SVR) [63], extreme learning machine (ELM) [64,65], and artiﬁcial neural network (ANN) [66]. However, some univariate methods are model-free; they only evaluate the relevance between the variables and the property of interest, for instance, a correlation coefﬁcient. 3) Evaluation metric: The prediction performance of single variable or variable subset is assessed based on an evaluation metric, such as the value of correlation coefﬁcient, regression coefﬁcients and RMSECV. 4) Selection strategy: This step is to ﬁnd the optimal variable subset by searching in variable space. The selection strategy includes a ﬁlter-based, extreme value, sequential, exhaustive, intelligent optimization algorithms-based (IOA-based) and model population analysis-based (MPA-based) searches among others. I) Filter-based: When an evaluation metric assesses the variables and produces a ranking, ﬁlter-based strategy eliminates the variables not satisfying a deﬁned threshold value of the evaluation metric. II) Extreme value search: extreme value search only selects the most notable of all values such as the lowest RMSECV and the largest absolute regression coefﬁcients (absRC). III) Sequential search: sequential search [67] includes forward selection [68] and backward selection [69]. For forward selection, the variables are sequentially added to an empty candidate set until the addition of further variables does not decrease the criterion. For backward selection, the variables are sequentially removed from a full candidate set until the removal of further variables increase the criterion. IV) Exhaustive search: This search considers all possible variable combinations and selects a variable combination with the best result if the number of variables or intervals is not too large. V) IOA-based: This search uses an objective function to evaluate variable subsets based on their predictive performance (i.e., error on test data) by statistical resampling or crossvalidation. It evaluates multiple models using evolutionary and swarm intelligent algorithms [70] such as genetic algorithm (GA) [71,72], particle swarm optimization (PSO) [73], ﬁreﬂy [74], ant colony optimization (ACO) [75] and

simulated annealing (SA) [76e78], to ﬁnd the optimal variable combination that maximizes model performance. VI) MPA-based: This search extracts statistical information and makes a selection from a large population of sub-models which are built with a large population of variable subsets. MPA strategy [79,80] makes a statistical analysis with the outputs of various generated parameters. It considers the output of interest not as a single value but a distribution by which one can conveniently perform various parametric/ nonparametric statistical signiﬁcance tests. 4. Brief overview of variable selection methods In this section, unlike other reviews, we do not repeatedly introduce the details of each method but brieﬂy provide the structure and core characteristic of each method. Tables 1 and 2 summarizes the selective variable selection methods by the four factors including initialization of variables, modeling method, evaluation metric and selection strategy. With this approach, we can clearly understand the characteristics, merits and drawbacks of each method and easily make comparisons between them. We will then make a thorough discussion based on the information from Tables 1 and 2. 4.1. Wavelength point selection methods The feature of the WPS methods regards each wavelength point as a unit (i.e., a variable). The selected variables are thus discrete. As Fig. 2 shows, for WPS, there are two ways of initializing variables, either all the variables or a part of the variables, resulting in different selection strategies of variable selection methods. Regression coefﬁcients (RC) [81], variable importance in projection (VIP) [82], selectivity ratio (SR) [83], and signiﬁcance multivariate correlation (sMC) [84] are the PLS parameter-based methods which take all variables into account for initialization, then ﬁlter the unimportant variables based on some criterions. Based on PLS1 algorithm [85], the relationship between spectral matrix X (n p) and the response vector y (n 1) can be expressed by Equation (2) which is a general form for any linear model.

y ¼ Xb þ e

(2)

The vector, b, is the RC which a single measure of relationship between each variable and the corresponding response. Variables having a small absRC can be eliminated as uninformative, noise, or

Table 1 The four factors and characteristics of wavelength point selection methods. Method

Initialization of variables

Modeling method

Evaluation metric

Selection strategy

Characteristic (Merit and Drawback)

Ref.

RC VIP SR sMC rPLS

all all all all all

PLS/MLR PLS PLS PLS PLS

absRC VIP SR sMC RC

ﬁlter-based ﬁlter-based ﬁlter-based ﬁlter-based ﬁlter-based

fast, simple, need to deﬁne a threshold, does not consider the combination effect of variables

[81] [82] [83] [84] [87]

IPW

all variables

PLS

ﬁlter-based

MC-UVE

all variables

PLS

absRC and the standard deviation of the variable mean (absRC)/std (absRC)

ﬁlter-based

RT

all variables

PLS

absRC

ﬁlter-based

BVS

all variables

PLS

RMSECV

backward selection

SPA

all variables

MLR

maximum projection value on the orthogonal sub-spaces, RMSE

extreme value search, forward selection

SIS-PLS

all variables

PLS

RMSECV, absRC

sure independence screening

CARS

Monte Carlo sampling

PLS

absRC, RMSECV

MPA-based

SEPA-LASSO

Monte Carlo sampling

LASSO

RMSECV,

MPA-based

IRIV

binary matrix sampling

PLS

RMSECV

MPA-based, backward selection

VCPA

binary matrix sampling

PLS

the frequency of variables, RMSECV

MPA-based, exhaustive search

VISSA

weighted binary matrix sampling

PLS

RMSECV

MPA-based,

BOSS

weighted bootstrap sampling

PLS

RMSECV, absRC

MPA-based,

IVSO

weighted binary matrix sampling

PLS

absRC, RMSECV

MPA-based

GA-PLS

random sampling

PLS

RMSECV

GA algorithm

Fireﬂy-PLS

random sampling based on uniform distribution random sampling

PLS

RMSE

Fireﬂy algorithm

PLS

Boltzman's probability distribution, RMSE

SA algorithm

fast, simple, only one parameter, does not consider the combination effect of variables softly eliminate unimportant variables, high computation cost, needs to deﬁne a threshold high stability, needs to deﬁne a threshold, tends to select more variables simple, considers the information of response, does not consider the combination effect of variables simple, linear optimization, does not consider the combination effect of variables selects the variables with the minimum of collinearity, selected variables may have low signalto-noise ratio (S/N) combines sure independence screening and latent variables, low prediction ability fast, uses EDF to continuously shrink the variable space, low stability high stability, uses vote-rule to select subset, need to artiﬁcially select optimal subset good ability to assess the importance of variables, classify variables into four categories, high computation cost, multiple parameters high prediction ability, uses EDF to continuously shrink the variable space, multiple parameters, select fewer variables 14-16 iteratively and smoothly shrinks variable space, high computation cost, multiple parameters soft shrinkage to eliminate unimportant variables high computation cost eliminates uninformative variables gradually, high stability, high computation cost global optimization, multiple parameters, the number of input variables should not exceed 200 high dependence on excellent individuals, multiple parameters, tends to select fewer variables robust, high computational efﬁciency, easy to fall into local optimization

[88] [86] [91] [69] [94]

[114] [89] [98] [35]

[96]

[95]

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

SA-PLS

variables variables variables variables variables

[93] [61] [72] [74] [77]

105

Monte Carlo sampling iRF

a Ways of dividing spectra into intervals. 1st: divide the full spectra into equal widths; 2nd: use a window with a ﬁxed width moving through the whole spectra to generate a series of intervals; 3rd: select scatter variables based on WPS methods, and then add the adjacent variables to constitute intervals; 4th: use an algorithm to divide the whole spectra into intervals with different widths.

[90] MPA-based absRC, RMSECV PLS

weighted binary matrix sampling ICO

2nd

[97] MPA-based RMSECV PLS

random sampling from all intervals binary matrix sampling GA-iPLS iVISSA

1th

[106] [109] GA algorithm MPA-based RMSECV RMSECV PLS PLS

all intervals OHPL

1th 3rd

[110] extreme value search absRC, RMSECV PLS/LASSO

all intervals FOSS

4th

[92] MPA-based, weighted block bootstrap sampling abs RC, RMSECV PLS

all intervals MWPLS

4th

[108] ﬁlter -based RMSECV PLS

RMSECV RMSECV RMSECV PLS PLS PLS 1th 1th 1th all intervals all intervals all intervals ﬁPLS biPLS siPLS

2nd

[106] [107] [107]

[105]

forward selection backward selection exhaustive search the best 2/3/4 intervals

Ref.

extreme value search RMSECV

Evaluation metric Modeling method

PLS all intervals iPLS

Obtaining intervalsa Initialization of variables Method

Table 2 The four factors and characteristics of wavelength interval selection methods.

1th

selection strategy

fast, simple, no optimization with the intervals fast, linear optimization fast, linear optimization investigates all combinations of given number of intervals, high computation cost obtains continuous intervals using moving window, needs to artiﬁcially select intervals, no optimization with the intervals considers the correlations among the variables, high computation cost the intervals consider the information of response, high computation cost global optimization, multiple parameters high stability, high computation cost, multiple parameters iteratively and smoothly shrinks variable space, high stability, multiple parameters obtains continuous intervals using moving window, high computation cost, multiple parameters

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

Characteristic (Merit and Drawback)

106

referring variables. VIP accumulates the importance of each variable being reﬂected by weighting each PLS component. The variables can be sorted in descending order based on their VIP scores. SR uses target projection to quantify a variance which is proportional to the co-variance between X and y, while separating the orthogonal variation to y at the same time. The ratio of these two variances is then used to determine the importance of each variable in the PLS regression model. sMC combines regression variance and residual variance from the PLS regression model to statistically determine a variable's importance. A variable with higher value of VIP, SR and sMC represents greater importance. With these sorts of variables, PLS parameter-based methods often give a partition line to eliminate the variables that are less than a deﬁned threshold. Thus, they are fast and simple. As the RC of a PLS model ideally reﬂects the importance of the variables, it is often used as evaluation metric in various forms for other methods. Monte Carlo-uninformative variable elimination (MC-UVE) [86] ﬁrst employs Monte Carlo sampling to select a ﬁxed ratio of samples in each run and then build a calibration model. A regression coefﬁcient vector b ¼ [b1, …,bp] for each variable is then calculated. After N runs, a regression coefﬁcient matrix B (N p) is recorded. The stability of each variable j can be quantitatively measured as

sj ¼ mean Bj std Bj

j¼1 p

(3)

where mean (Bj) and std (Bj) are the mean and standard deviation of the absRC of variable j. The stability is used as an evaluation metric in MC-UVE, and the variable with large stability represents its importance. The variable whose stability is less than a deﬁned threshold is regarded as uninformative and then eliminated. Recursive weighted partial least squares (rPLS) [87] iteratively reweights the variables using the RC of PLS model and then reﬂects the importance of the variables. Iterative predictor weighting (IPW) [88], competitive adaptive reweighted sampling (CARS) [89], interval random frog (iRF) [90], iteratively variable subset optimization (IVSO) [61], randomization test (RT) [91], Fisher optimal subspace shrinkage (FOSS) [92] and bootstrapping soft shrinkage (BOSS) [93] use absRC as evaluation metric as Table 1 shows. IPW is an iterative elimination procedure where a measure of variable importance is calculated after ﬁtting a PLS model. It uses the product of absRC and standard deviation of the variable as a weight, and then gives the importance of each variable by multiplying the weights in each cycle. RT calculates the ratio of the number of absRC from the random PLS models that are bigger than the corresponding RC in the regular PLS model to the total number of the random models, and give a ranking of variables based on the ratio value. Filter-based strategy is then used to eliminate the variables that are greater than a deﬁned threshold. Successive projections algorithm (SPA) [94] employs simple projection operations in a vector space and forward selection method to obtain subsets of variables with minimum collinearity, it is thus very suitable to NIR spectra with high collinearity. MPA-based methods includes MC-UVE, CARS, iteratively retains informative variables (IRIV) [35], variable iterative space shrinkage approach (VISSA) [95], variable combination population analysis (VCPA) [96], IVSO [61], FOSS [92], BOSS [93], iRF [90], interval combination optimization (ICO) [97], and sampling error proﬁle analysis-LASSO (SEPA-LASSO) [98]. They have a characteristic in common that extracts statistical information from a large population of sub-models. MPA is a very efﬁcient strategy to develop variable selection methods as it statistical analyzes the output of interest not as a single value but a distribution. Tables 1 and 2 clearly show their differences in terms of the four factors. MPA has three key elements as Fig. 3 shows:

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

107

Fig. 2. The ﬂowchart of wavelength point selection methods.

1) Randomly draw N sub-datasets by sampling methods including MC-sampling, Bootstrap sampling and BMS; 2) For each sub-dataset, a sub-model is built; 3) Statistically analyze an outcome of interest of all N sub-models based on the evaluation metric parameter such RC and RMSECV. CARS uses an absRC as evaluation metric and exponentially decreasing function (EDF) as a selection strategy to competitively select key variables based on adaptive reweighted sampling. IRIV employs binary matrix sampling as a sampling method in the variable space to generate a large population of variable combinations. IRIV then assesses the importance of each variable through using a Mann-Whitney U test to compare the differences between the RMSECV distribution of inclusion and exclusion of each variable. Similar to IRIV, VCPA ﬁrst uses binary matrix sampling to generate a population of different variable combinations. It then computes the frequency of each variable in the best 10% of sub-models. EDF is employed to determine the

remaining number of variables based on the frequency. After all EDF runs have ﬁnished, exhausted searches are used to search for the optimal variable subset with the remaining variables. VISSA and ICO use weighted binary matrix sampling to smoothly shrink variable space and iteratively remain the variables or intervals whose weights is equal to 1 based on the frequency of each variable in the best 5% of sub-models. BOSS and IVSO also use weighted binary matrix sampling and the weights of variables are determined based on absRC by analyzing a population of submodels. iRF compares a population of intervals based on the mean value of absRC of the variables in each interval. For SEPALASSO, it not only uses MC sampling to generate and build a population of sub-models, but also use LASSO as modeling method. As a penalty-based method, LASSO is a shrinkage estimator used to remove redundant variables by reducing their weights (regression coefﬁcients) to zero. The vote rule is then used to determine the importance of variables based on the selected frequency of LASSO.

108

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

Fig. 3. The framework and three key elements of model population analysis (MPA) strategy in the variable selection methods and their corresponding factors in the simple classiﬁcation.

IOA-based variable selection methods, including GA-PLS [72], ﬁreﬂy [74], ACO [75] and SA [77], are used to search for the optimal variable subset in NIR spectra. According to the four factors in Table 1, the major differing factors are initialization of variables and selection strategy. Fig. 4 shows the process of IOA-based variable selection methods. They search the optimal variable subset in an iterative manner, with different evolutionary and swarm intelligent algorithms having different ways of initialization and updating of variable subsets. Additionally, PSO, ACO, cat swarm optimization [99], bat algorithm [100], harmony search [101], cuckoo search [102], and ﬁrework algorithms [103] can be also used for variable selection in NIR spectra as they have good ability of optimization. Besides, Liu et al. developed a software integrating a set of IOAbased algorithms with various evaluation combinations for variable selection [104]. 4.2. Wavelength interval selection methods (WIS) WIS methods consider each wavelength interval as a unit and then ﬁnd the optimal combination of intervals based on different search strategies. Each interval consists of a number of continuous variables, which is consistent with the fact that the vibrational and rotational spectra have continuous features of spectral bands. As

the variables selected by WPS are discrete, WIS has the advantage over WPS in the interpretation of the model. As shown in Table 2, interval PLS (iPLS) [105], forward iPLS (ﬁPLS) [106], backward iPLS (biPLS) [107], synergy iPLS (siPLS) [107], GA-iPLS [106], Moving Windows PLS (MWPLS) [108], ICO [97], iRF [90], interval VISSA (iVISSA) [109], FOSS [92] and ordered homogeneity pursuit LASSO (OHPL) [110] belong to WIS methods. As WIS methods regard intervals as a unit, the width of the interval is determined in different ways. Fig. 5 shows four ways of dividing spectra into intervals. iPLS was ﬁrst proposed to divide the full spectra into equal widths as Fig. 5a shows and then select wavelength intervals with the lowest RMSECV. ﬁPLS, biPLS and siPLS were proposed as improvements of iPLS in order to optimize the combination of intervals by sequential selection strategy (forward selection and back ward selection) and exhaustive search strategy. For GA-iPLS, it makes an optimization of intervals based on a GA algorithm. Fig. 5b shows the second way that uses a window with a ﬁxed width moving through the whole spectra to generate a series of intervals. The overlapping intervals provide more spectral information than non-overlapping ones. MWPLS is ﬁrstly employed in this way considering all the possible continuous intervals, but it cannot optimize their combinations. Changeable Size MWPLS

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

Fig. 4. The process of intelligent optimization algorithms (IOA)-based variable selection methods.

109

(CSMWPLS) [111], Searching Combination MWPLS (SCMWPLS) [111] based on MWPLS were then proposed to search for an optimized interval combination. Additionally, iRF uses a reversible jump Markov Chain Monte Carlo-like algorithm to optimize all possible interval combinations. Each interval is assessed based on the mean value of absRC of all variables in the interval. As Fig. 5c shows, the third method is ﬁrst to select scatter variables based on WPS, and then adds the adjacent variables to constitute intervals. iVISSA ﬁrstly selects key variables based on VISSA, and a local search is carried out to search the combination of intervals that have the lowest RMSECV by successively adding the adjacent variables into the sub-models. Fig. 5d uses an algorithm to divide the whole spectra into intervals with different widths, similar to FOSS and OHPL. They use a Fisher optimal partitions algorithm to generate a series of intervals with different widths. The variables in each group have the same relevance with the response. FOSS uses weighted binary matrix sampling and the weights of the intervals are determined based on the absRC by analyzing a population of sub-models. In OHPL, sparse regression techniques, such as LASSO, are applied to these groups to obtain the optimal variable subset. Interval SPA (iSPA) [112], ICO [97], and sure independence screening-iPLS (SIS-iPLS) [113] are the modiﬁcation of SPA, VISSA and SIS [114], respectively. They just regard wavelength interval as a unit and then optimize the combination of intervals. ICO employed the ﬁrst way to obtain the intervals, then searches the best combination of intervals based on the core idea of VISSA.

Fig. 5. Four ways of dividing NIR spectra into intervals. a: divide the full spectra into equal widths, b: use a window with a ﬁxed width w moving through the whole spectra to generate a series of intervals, c: select scatter variables based on WPS methods, and then add the adjacent variables to constitute intervals, d: use an algorithm to divide the whole spectra into intervals with different widths.

110

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

4.3. Hybrid variable selection methods In addition to the previously mentioned methods, there are some hybrid methods which combine two or three methods to make selection of variables, such as CARS-SPA [115], CARS-GA-PLS [116], UVE-SPA-MLR [117], UVE-SPA-PLS [118], iPLS-SPA [119], siPLS-GA-PLS [120], iPLS-mIPW [121], random forest-back propagation (BP)-network [122], VCPA-GA [123] and VCPA-IRIV [123]. When two methods are coupled, the later method would make a further optimization on the variables selected by the former method. However, if the former method did not select the key variables, it will make a great effect on the performance on the further optimization by later method. Thus, the hybrid methods should not combine two methods at random. It should be based on a certain logical way. Generally, the former method makes a rough selection to eliminate the uninformative variables, and the later method is then used to make a ﬁne selection. UVE-SPA-MLR ﬁrst ﬁlters the noise variables by UVE method and makes a ﬁne selection by SPA method. iPLS-SPA, iPLS-mIPW and siPLS-GA-PLS ﬁrst use iPLS and siPLS method to obtain the informative intervals. SPA, mIPW and GA are then conducted to make a further selection. Chen et al. [122] proposed a hybrid strategy that combines random forest [124] and BP network to select key wavelengths in NIR spectra. The random forest ﬁrst selects some informative wavelengths, and they then are input into the BP network to generate a new comprehensive variable group that has the minimum errors. Recently, Yun et al. proposed a VCPAbased hybrid strategy [123], including VCPA-GA and VCPA-IRIV, that utilizes VCPA's characteristic to continuously shrink and optimize the variable space from big to small in the ﬁrst step. It is then coupled with IRIV and GA to carry out further optimization in the second step with the variables remained by the ﬁrst step. 4.4. Modiﬁed variable selection methods Table 3 shows the list of modiﬁed methods and their modiﬁed factors. Generally, the development of modiﬁed methods will choose at least one modiﬁed factor from the four above factors (initialization of variables, modeling method, evaluation metric and selection strategy). It can be said that the development of modiﬁed methods can choose one or two factors to change or make improvements. Stability CARS (SCARS) [125] use stability (Equation (3)) as an evaluation metric instead of the absRC to improve the unstability of CARS. iRF regards intervals as a unit and then further processes based on a reversible jump Markov Chain

Monte Carlo-like algorithm embedded in random frog (RF) [126] to optimize interval combinations. Variable permutation population analysis (VPPA) [127] uses a variable permutation analysis strategy to replace the frequency of each variable in the best 10% of sub-models by VCPA. Stability and variable permutation (SVP) [128] selects variables in an iterative and competitive manner both in sample space and variable space, which combines the core ideas of CARS, MC-UVE and VPPA. It ﬁrst uses Equation (3) to compute the stability and then divides variables into elite variables and normal variables according to the stability by the adaptive reweighted sampling from CARS. Variable permutation analysis from VPPA and EDF from CARS are employed to select important variables from normal variables. As for automatic weighting VCPA (AWVCPA) [129], it combines the variable frequency and RC of PLS to obtain the value of the contribution of each variable. In order to generate more sub-models, MC-UVE uses Monte Carlo sampling instead of cross validation to compute the mean and stand deviation of RC with the sub-models. Similar offspring voting GA-PLS (sovGA-PLS) [130] modiﬁed GA-PLS in the GA algorithm using similar offspring voting strategy to achieve a comprise between prediction accuracy and reliability of models. Large regression coefﬁcient-GA-PLS (LRC-GA-PLS) [131] applied large regression coefﬁcient into the structure of a proportion of chromosomes in the initial population of GA algorithm, making the optimization better toward the optimal solution. MC-UVE-random forest [132] uses random forest as the modeling method, and the importance value of the variables produced by the random forest is used as an evaluation metric through Equation (4).

. std importancej Ij ¼ mean importancej

(4)

where mean (importancej) and std (importancej) are the mean and standard deviation of the importance value of variable j. Continuous wavelet transform - modiﬁed IPW (CWT-mIPW) [133] uses CWT as a pretreatment tool to preprocess all variables for initialization. Then, mIPW uses a hard threshold as a ﬁlter strategy to iteratively retain the most important variables after cyclically ﬁtting a PLS model. Table 4 lists the available software and toolbox of variable selection methods for NIR spectra. It is convenient for readers to apply them for non-commercial use. Some methods can be freely and directly accessed on the website, but some methods should contact the authors to share if on request. Table 5 lists the abbreviations used in this work.

Table 3 List of the modiﬁed methods. Modiﬁed method

Original method

Modiﬁed factor

Reference

MC-UVE iRF ICO ﬁPLS, biPLS, siPLS GA-iPLS iVISSA CSMWPLS, SCMWPLS iSPA SIS-iPLS SCARS VPPA SVP AWVCPA sovGA LRC-GA-PLS MC-UVE-random forest CWT-mIPW

UVE RF VISSA iPLS iPLS VISSA MWPLS SPA SIS-PLS CARS VCPA CARS, MC-UVE, VPPA VCPA GA-PLS GA-PLS MC-UVE IPW

initialization of variables initialization of variables initialization of variables selection strategy selection strategy selection strategy selection strategy initialization of variables initialization of variables evaluation metric evaluation metric selection strategy evaluation metric selection strategy selection strategy evaluation metric, modeling method initialization of variables, selection strategy

[86] [90] [97] [105e107] [106] [109] [111] [112] [113] [125] [127] [128] [129] [130] [131] [132] [133]

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

111

Table 4 The available software and toolbox of variable selection methods in the NIR spectra. Method

Language

The available website

RC, VIP, SR iPLS, biPLS, ﬁPLS, siPLS, MWPLS rPLS GA-PLS SPA General SA algorithm CARS, MC-UVE, MWPLS iRF, IRIV, VCPA VISSA, iVISSA, BOSS BP-ANN IVSO ICO SIS-PLS SEPA-LASSO OHPL MC-UVE-random forest

MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB R MATLAB R MATLAB/R

Fireﬂy PSO, ACO, SA ACO VCPA-GA, VCPA-IRIV

MATLAB MATLAB MATLAB MATLAB

http://www.libpls.net/index.php http://www.models.life.ku.dk/iToolbox http://www.models.life.ku.dk/rPLS http://www.models.life.ku.dk/GAPLS http://www.ele.ita.br/~kawakami/spa/ http://www.mathworks.com/matlabcentral/ﬁleexchange/10548 http://www.libpls.net/index.php https://ww2.mathworks.cn/matlabcentral/proﬁle/authors/5526470-yonghuan-yun https://ww2.mathworks.cn/matlabcentral/proﬁle/authors/3476621-baichuan The in-house code in MATLAB http://www.rsc.org/suppdata/c5/ra/c5ra08455e/c5ra08455e1.zip Shared on request according to Ref. [97] Shared on request according to Ref. [114] http://www.imm.dtu.dk/projects/spasm/(http://www.stanford.edu/~hastie/glmnet_matlab Shared on request according to Ref. [110] https://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm, https://cran.r-project.org/web/ packages/randomForest/index.html https://ww2.mathworks.cn/matlabcentral/ﬁleexchange/29693-ﬁreﬂy-algorithm?s_tid¼prof_contriblnk http://yarpiz.com/306/ypml122-evolutionary-feature-selection Shared on request according to Ref. [75] https://ww2.mathworks.cn/matlabcentral/proﬁle/authors/5526470-yonghuan-yun

5. The limitations of existing variable selection methods Although a large amount of variable selection methods have been developed in NIR spectra during the last several decades, there are still many problems that need to be faced and solved. We have concluded these are stability, reliability, interpretability, applicability, effect of outliers, modeling method, and computation cost, which are illustrated as below. Such summaries can efﬁciently beneﬁt the usage and development of variable selection methods. 1. Stability: For some methods, the prediction results are unstable because they select different variable subsets at different times. The reason is that a random sampling method is used and the optimization is always in an iterative manner. The complex optimization step certainly results in unstable results. Moreover, the squared error of prediction (e.g. RMSECV) as evaluation metric for the good model with the limited samples in the training set may produce unstable prediction results in the validation set. Thus, stability could be used as a metric to evaluate the prediction performance of models as Deng et al. did [109]. Ensemble strategy [134,135] is also a good way to address the instability problem as it could obtain a more accurate, stable and robust prediction results combining all the predictions of multiple sub-models built from different subsets. 2. Reliability. Some newly suggested methods made the comparison with several existing methods employing a simple quantitative “visual” comparison and made a conclusion that the proposed method is better than others [19]. However, the calculated RMSEP value is slightly smaller, but not making a signiﬁcant test by statistics, which indicate the unreliability of proposed methods. 3. Interpretability. Improving interpretability with simple models is one of the three purposes of variable selection. However, many proposed methods only report the prediction performance without necessary interpretation of selected variables. The selection criteria of many methods are not driven by real relevance to NIR spectral features but by using a searching algorithm with prediction accuracy, e.g. RMSEP. Although NIR spectra has very poor features corresponding to the structural information due to severe overlapping of spectral bands, it is still necessary to make an interpretation of selected variables associating with the functional group (CeH, OeH, SeH, NeH, etc.) of

4.

5.

6.

7.

the analyte or property of interest. However, most existing methods put more emphasis on the lower prediction errors than interpretability. Applicability. The newly proposed methods always give positive results that have lower prediction errors than that of some selected methods on two or three datasets. However, when other researchers use them in other datasets, they cannot obtain the ideal results. As a matter of fact, for variable selection, all methods are data based, but no method can be used for all kinds of datasets. However, some proposed methods may make an unfair comparison with other methods, and these methods did not provide information on suitable scope and some other considerations. Some methods that have parameters to optimize initially should be noted clearly. Effect of outliers. As outliers have a positive impact on the robustness and predictive accuracy of a model, it is very crucial to identify and remove outliers from measured data before modeling. However, the processes of variable selection and outlier detection usually inﬂuence each other, and their conduction orders also strongly inﬂuence the modeling results as Wen et al. [136] and Cao et al. [137] reported. When removing outliers before variable selection, some samples appear as outliers with a reduced set of variables. When conducting variable selection before removing outliers, the reduced set of variables weakens the ability of identifying outliers. Modeling method. Some research [138e141] uses linear modeling method to select characteristic variables, with a nonlinear modeling method then being used to make models and obtain prediction results. The variables selected by linear modeling methods are not necessarily relevant to nonlinear modeling method due to their different principles. For instance, Chen et al. [140]ﬁrst employed siPLS to select efﬁcient spectra intervals. ELM and BP-ANN are then employed to make nonlinear models with the selected variables. In Guo's reference [141], characteristic variables are ﬁrst selected by UVE-PLS and then input into a generalized regression neural network, SVR, and ELM to make a nonlinear model for detection of soluble solids content of apples. Computation cost. The newly proposed methods are more and more complex in order to obtain lower and lower RMSEP values. It thus results in high computation cost. We should seek a balance between computation ability and good predictability.

112

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

Table 5 Abbreviations used in this article. Abbreviation

Full name

The location of the ﬁrst appearance

absRC ACO ANN AWVCPA biPLS/ﬁPLS BMS BOSS BP BVS CARS CSMWPLS CWT CWT-mIPW EDF ELM FOSS GA-iPLS GA-PLS ICO iPLS iRF IRIV iSPA IOA IPW iVISSA IVSO LASSO LRC-GA-PLS MC MC-UVE mIPW MLR MPA MWPLS NIR OHPL PCR PLS PSO RC RF RMSEC RMSECV RMSEP rPLS RT SA SCMWPLS SEPA-LASSO siPLS SIS-iPLS SIS-PLS sMC sovGA-PLS SPA SR SVP SVR VCPA VIP VISSA VPPA

absolute value of Regression Coefﬁcients Ant Colony Optimization Artiﬁcial Neural Network Automatic Weighting Variable Combination Population Analysis backward/forward interval Partial Least Squares Binary Matrix Sampling Bootstrapping Soft Shrinkage Back Propagation Backward Variable Selection Competitive Adaptive Reweighted Sampling Changeable Size Moving Window Partial Least Squares Continuous Wavelet Transform Continuous Wavelet Transform e modiﬁed Iterative Predictor Weighting Exponentially Decreasing Function Extreme Learning Machine Fisher Optimal Subspace Shrinkage Genetic Algorithm - interval Partial Least Squares Genetic Algorithm - Partial Least Squares Interval Combination Optimization interval Partial Least Squares interval Random Frog Iteratively Retains Informative Variables interval Successive Projections Algorithm Intelligent Optimization Algorithms Iterative Predictor Weighting interval Variable Iterative Space Shrinkage Approach Iteratively Variable Subset Optimization Least Absolute Shrinkage and Selection Operator Large Regression Coefﬁcient - Genetic Algorithm - Partial Least Squares Monte Carlo Monte Carlo - Uninformative Variable Elimination modiﬁed Iterative Predictor Weighting Multivariable Linear Regression Model Population Analysis Moving Windows Partial Least Squares Near-Infrared Ordered Homogeneity Pursuit LASSO Principal Component Regression Partial Least Squares Particle Swarm Optimization Regression Coefﬁcients Random Frog Root Mean-Squared Error of Calibration on calibration set Root Mean -Squared Error of Cross validation Root Mean-Squared Error of Prediction on test set recursive weighted Partial Least Squares Randomization Test Simulate Annealing Searching Combination Moving Window Partial Least Squares Sampling Error Proﬁle Analysis - Least Absolute Shrinkage and Selection Operator synergy interval Partial Least Squares Sure Independence Screening - interval Partial Least Squares Sure Independence Screening - Partial Least Squares signiﬁcance Multivariate Correlation similar offspring voting Genetic Algorithm - Partial Least Squares Successive Projections Algorithm Selectivity Ratio Stability and Variable Permutation Support Vector Regression Variable Combination Population Analysis Variable Important in Projection Variable Iterative Space Shrinkage Approach Variable Permutation Population Analysis

Extreme value search of chapter 3 IOA-based of chapter 3 Modeling method of chapter 3 Paragraph 1 of chapter 4.4 Paragraph 1 of chapter 4.2 Initialization of variables of chapter 3 Paragraph 3 of chapter 4.1 Paragraph 1 of chapter 4.3 Table 1 Paragraph 3 of chapter 4.1 Paragraph 3 of chapter 4.2 Paragraph 3 of chapter 4.4 Paragraph 3 of chapter 4.4 Paragraph 6 of chapter 4.1 Modeling method of chapter 3 Paragraph 2 of chapter 4.1 Paragraph 1 of chapter 4.2 IOA-based of chapter 3 Paragraph 5 of chapter 4.1 Paragraph 1 of chapter 4.2 Paragraph 3 of chapter 4.1 Paragraph 5 of chapter 4.1 Paragraph 6 of chapter 4.2 Selection strategy of chapter 3 Paragraph 3 of chapter 4.1 Paragraph 1 of chapter 4.2 Paragraph 3 of chapter 4.1 Paragraph 2 of chapter 1 Paragraph 1 of chapter 4.4 Paragraph 2 of chapter 2 Paragraph 3 of chapter 4.1 Paragraph 3 of chapter 4.4 Modeling method of chapter 3 Selection strategy of chapter 3 Paragraph 1 of chapter 4.2 Paragraph 1 of chapter 1 Paragraph 1 of chapter 4.2 Paragraph 2 of chapter 1 Paragraph 2 of chapter 1 IOA-based of chapter 3 Paragraph 2 of chapter 4.1 Paragraph 1 of chapter 4.4 Paragraph 1 of chapter 2 Paragraph 1 of chapter 2 Paragraph 1 of chapter 2 Paragraph 3 of chapter 4.1 Paragraph 3 of chapter 4.1 IOA-based of chapter 3 Paragraph 3 of chapter 4.2 Paragraph 5 of chapter 4.1 Paragraph 1 of chapter 4.2 Paragraph 6 of chapter 4.2 Paragraph 6 of chapter 4.2 Paragraph 2 of chapter 4.1 Paragraph 1 of chapter 4.4 Paragraph 4 of chapter 4.1 Paragraph 2 of chapter 4.1 Paragraph 1 of chapter 4.4 Modeling method of chapter 3 Paragraph 5 of chapter 4.1 Paragraph 2 of chapter 4.1 Paragraph 5 of chapter 4.1 Paragraph 1 of chapter 4.4

6. The perspectives on the development of new variable selection methods With the rapid development of modern analytical instruments, the number of wavelengths of NIR spectrum will continue to grow, especially for NIR hyper-spectral image data [142,143]. Some methods are not suitable for this kind of data. In addition, with the

increasing popularity of NIR spectral techniques, new application problems regarding variable selection will arise. In the face of these challenges, we should pay more attention to developing more efﬁcient variable selection methods. Prediction accuracy, model interpretability, and computational complexity are three important pillars of any variable selection procedures [144]. It is a great challenge to develop a method that can strike a good balance

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

between three pillars. Here, we have concluded our remarks and suggestions on the development of new variable selection methods and now highlight what improvements might be offered. 1. Hybrid methods. With so many existing methods, we are tempted to make good use of them. Each method has its own merits and unique characteristics. We should focus on their characteristics and make an effective combination. Actually, some hybrid methods have been proposed to combine their different characteristics as described above. It should be noted that the hybrid method should be based on reasonable logical thinking. Hybrid methods are not the simple coupling of several methods. The characteristic of one method can be embedded into another method to exploit their advantages. 2. Considering the effect of variable combination. The effect of variable combination has a great inﬂuence on prediction performance. The variables are not very important, but when they are combined, they can obtain great prediction performance. These kinds of variables are what we want to select. However, none of existing methods has solved this problem. Although the IRIV [35] method considers the effect of variable combinations by observing the differences between the RMSECV of inclusion and exclusion for each variable, it does not basically ﬁnd out the effect. 3. Multi-objective optimization [104,145,146]. The existing methods only rely on one objective function to search the optimal variable subset. Multi-objective optimization is associated with mathematical optimization problems involving more than one objective function to be optimized simultaneously. In the area of variable selection in NIR spectra, not only the RMSECV or RMSEP can be regarded as the objective function, the number of variables and PLS components can also be objective functions. With more than one objective function to be optimized simultaneously, the model built by selected variables can have great prediction performance, but also be simpler with greater interpretability. 4. Interpretability. A good variable selection method in the NIR spectra not only improves the prediction performance, but also has great interpretability. We could not blindly pursue lower RMSEP and ignore the interpretation of the selected variables. The methods we developed should be based on the features of the NIR spectra. In particular, if the methods consider some variables related to the property of interest as control variables in the process of development, they would have strong power of interpretability and avoid blind selection of variables. 5. Universal applicability. The new methods should be based on the principle of simple use and ease of understanding, high prediction performance and low computation cost. With these characteristics, the new methods will be suited universally for application, especially application into the software of portable NIR spectral instruments. Moreover, developing user-friendly and standalone software [104] that assemble various methods and lower the entry barrier is very useful for non-specialist to deal with variable selection problems. 6. Application of newly proposed algorithms. The development of new methods should also pay attention to the generation of new algorithms such as ELM and optimization algorithms such as bat, ﬁrework, and cuckoo search algorithms. The introduction of new algorithms is a good way to develop variable selection methods that can solve the problems associated with new kinds of data. 7. Conclusion Variable selection is a critical step in multivariate analysis of NIR spectra, and it is very important to apply variable selection

113

methods in a proper manner. In this paper, the exiting variable selection methods in the NIR spectra were overviewed brieﬂy. We generalize the variable selection methods in terms of four factors including initialization of variables, modeling method, evaluation metric and selection strategy, and introduce the merits and drawbacks of each method. With the visual comparison on their similarities and differences, the researchers can understand them well and choose an appropriate method for their study. It is worth mentioning that the interpretation of the selected variables is also very critical. It is a challenge to make the NIR models more interpretable. We could not merely pursue good prediction performance and ignore the interpretation of the selected variables that should be corresponding to the property of interest. It should be noted that the purpose of this paper is not to draw a good conclusion about which variable selection method is the best in the NIR spectra and make a ﬁnal recommendation to what the reader should choose. Owing to different principles, all variable selection methods are data-based, and they show different characteristics when faced with different kinds of data. No single method can obtain good performance for all kinds of data. Each method has its applicability, merits and drawbacks. Their principles and characteristics should be considered when using them. We hope that this paper could provide the readers with a better understanding of the similarities and differences between the above methods, and be signiﬁcant help in applying them correctly. It is always advisable to try out methods with different principles in face of a given data problem. We introduced the modiﬁed and hybrid methods and their improvements, mentioned seven aspects of the problem affecting the existing methods, and ﬁnally gave a deeper perspective on the trends of the development on the variable selection methods in the NIR spectra. We hope that this paper will help researchers develop new efﬁcient methods to address the exiting problems of variable selection, which could signiﬁcantly promote the development of NIR spectroscopy.

Acknowledgements This work is ﬁnancially supported by the National Natural Science Foundation of China (grants no. 21705162) and the Scientiﬁc Research Funds of Hainan University, China (KYQD(ZR)-1844).

References [1] C. Pasquini, J. Braz. Chem. Soc. 14 (2003) 198. [2] B. Stenberg, R.A.V. Rossel, A.M. Mouazen, J. Wetterlind, Visible and near infrared spectroscopy in soil science, Elsevier, 2010, p. 163. [3] K.D. Shepherd, M.G. Walsh, J. Near Infrared Spectrosc. 15 (2007) 1. [4] S. Macho, M. Larrechi, Trac. Trends Anal. Chem. 21 (2002) 799. [5] Y. Roggo, P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond, N. Jent, J. Pharmaceut. Biomed. Anal. 44 (2007) 683. [6] J. Luypaert, D. Massart, Y. Vander Heyden, Talanta 72 (2007) 865. [7] Y. Ozaki, W.F. McClure, A.A. Christy, Near-infrared spectroscopy in food science and technology, John Wiley & Sons, 2006. [8] Z. Zhu, S. Chen, X. Wu, C. Xing, J. Yuan, Food Science & Nutrition, 2018. , R. Boque , J. Sabate , J. Casals, J. Simo , Food Chem. 262 (2018) [9] S. Sans, J. Ferre 178. [10] B.G. Osborne, T. Fearn, Encycl. Anal. Chem. 5 (2000) 4069. [11] N. Heigl, C. Petter, M. Rainer, M. Najam-ul-Haq, R. Vallant, R. Bakry, G. Bonn, C. Huck, J. Near Infrared Spectrosc. 15 (2007) 269. [12] M. Schwanninger, J.C. Rodrigues, K. Fackler, J. Near Infrared Spectrosc. 19 (2011) 287. [13] Y.-H. Yun, Y.-C. Wei, X.-B. Zhao, W.-J. Wu, Y.-Z. Liang, H.-M. Lu, RSC Adv. 5 (2015) 105057. [14] Y. Jiang, B. David, P. Tu, Y. Barbin, Anal. Chim. Acta 657 (2010) 9. [15] C.K. Vance, D.R. Tolleson, K. Kinoshita, J. Rodriguez, W.J. Foley, J. Near Infrared Spectrosc. 24 (2016) 1. [16] A. Gredilla, S.F.-O. de Vallejuelo, N. Elejoste, A. de Diego, J.M. Madariaga, Trac. Trends Anal. Chem. 76 (2016) 30.

114

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115

[17] L. Yao, N. Lyu, J. Chen, T. Pan, J. Yu, Joint analyses model for total cholesterol and triglyceride in human serum with near-infrared spectroscopy, Spectrochim. Acta Part A: Mol. Biomol. Spectrosc. 159 (2016) 53e59. [18] G. Bale, C.E. Elwell, I. Tachtsidis, J. Biomed. Optic. 21 (2016) 091307. [19] C. Pasquini, Anal. Chim. Acta 1026 (2018) 8. [20] H. Martens, T. Naes, Multivariate calibration, John Wiley & Sons, 1989. [21] P. Dardenne, G. Sinnaeve, V. Baeten, J. Near Infrared Spectrosc. 8 (2000) 229. [22] K.R. Beebe, B.R. Kowalski, Anal. Chem. 59 (1987) 1007A. [23] J. Fan, R. Li, arXiv preprint math/0602133, 2006. [24] I.M. Johnstone, D.M. Titterington, Statistical challenges of high-dimensional data, The Royal Society, 2009. [25] B. Nadler, R.R. Coifman, J. Chemometr.: J. Chemometr. Soc. 19 (2005) 107. [26] G.C. Mcdonald, Wiley Interdiscipl. Rev. Comput. Stat. 1 (2010) 93. [27] H.T. Hui Zou, J. Roy. Stat. Soc. B 67 (2005) 768. [28] G.H. Fu, Q.S. Xu, H.D. Li, D.S. Cao, Y.Z. Liang, Appl. Spectrosc. 65 (2011) 402. [29] R. Tibshirani, J. Roy. Stat. Soc. B 58 (1996) 267. [30] P.B. Harrington, J. Chemometr. 5 (1991) 467. [31] P.J. Gemperline, A. Salt, J. Chemometr. 3 (1989) 343. [32] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1. [33] H.-D. Li, M.-M. Zeng, B.-B. Tan, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Metabolomics 6 (2010) 353. [34] Q. Wang, H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Analyst 136 (2011) 1456. [35] Y.H. Yun, W.T. Wang, M.L. Tan, Y.Z. Liang, H.D. Li, D.S. Cao, H.M. Lu, Q.S. Xu, Anal. Chim. Acta 807 (2014) 36e43. , [36] C.H. Spiegelman, M.J. McShane, M.J. Goetz, M. Motamedi, Q.L. Yue, G.L. Cote Anal. Chem. 70 (1998) 35. [37] Y.-H. Yun, Y.-Z. Liang, G.-X. Xie, H.-D. Li, D.-S. Cao, Q.-S. Xu, Analyst 138 (2013) 6412. [38] J.P.M. Andries, Y.V. Heyden, L.M.C. Buydens, Anal. Chem. 85 (2013) 5444. s, A. Rodríguez, J. Blasco, B. Rey, C. Besada, S. Cubero, A. Salvador, [39] V. Corte P. Talens, N. Aleixos, J. Food Eng. 204 (2017) 27. [40] C. Erkinbaev, K. Henderson, J. Paliwal, Food Control 80 (2017) 197. [41] L. Huang, Y. Zhou, L. Meng, D. Wu, Y. He, Food Chem. 224 (2017) 1. [42] F.Y.H. Kutsanedzie, Q. Chen, M.M. Hassan, M. Yang, H. Sun, M.H. Rahman, Food Chem. 240 (2018) 231. [43] T. Ma, X. Li, T. Inagaki, H. Yang, S. Tsuchikawa, J. Food Eng. 224 (2018) 53. [44] H. Pu, D.-W. Sun, J. Ma, J.-H. Cheng, Meat Sci. 99 (2015) 81. [45] L. Wang, D.-W. Sun, H. Pu, J.-H. Cheng, Crit. Rev. Food Sci. Nutr. 57 (2017) 1524. [46] Z. Xiaobo, Z. Jiewen, M.J. Povey, M. Holmes, M. Hanpin, Anal. Chim. Acta 667 (2010) 14. [47] I. Guyon, A. Elisseeff, J. Mach. Learn. Res. 3 (2003) 1157. [48] H. Zou, T. Hastie, J. Roy. Stat. Soc. B 67 (2005) 301. [49] E. Candes, T. Tao, Ann. Stat. 35 (2007) 2313. [50] T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, Chemometr. Intell. Lab. Syst. 118 (2012) 62. [51] R.M. Balabin, S.V. Smirnov, Anal. Chim. Acta 692 (2011) 63. [52] M. Zeaiter, J.M. Roger, V. Bellon-Maurel, Trends Anal. Chem. 24 (2005) 437. [53] M.W. Browne, J. Math. Psychol. 44 (2000) 108. [54] Q.-S. Xu, Y.-Z. Liang, Chemometr. Intell. Lab. Syst. 56 (2001) 1. [55] T. Fearn, Double cross-validation, 2010, p. 201014. [56] P.B. Harrington, C. Laurent, D.F. Levinson, P. Levitt, S.P. Markey, Anal. Chim. Acta 599 (2007) 219. [57] P. de Boves Harrington, Trac. Trends Anal. Chem. 25 (2006) 1112. [58] M. Zeaiter, J.-M. Roger, V. Bellon-Maurel, Trac. Trends Anal. Chem. 24 (2005) 437. [59] P. Williams, K. Norris, Near-infrared technology in the agricultural and food industries, American Association of Cereal Chemists, Inc., 1987. [60] G. Chandrashekar, F. Sahin, Comput. Electr. Eng. 40 (2014) 16. [61] W. Wang, Y. Yun, B. Deng, W. Fan, Y. Liang, RSC Adv. 5 (2015) 95771. [62] H.Y. Zhang, H.Y. Wang, Z.J. Dai, M.S. Chen, Z.M. Yuan, BMC Bioinf. 13 (2012) 298e317. [63] Q. Chen, J. Zhao, C. Fang, D. Wang, Spectrochim. Acta Mol. Biomol. Spectrosc. 66 (2007) 568. [64] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Neurocomputing 70 (2006) 489. [65] X. Bian, C. Zhang, X. Tan, M. Dymek, Y. Guo, L. Lin, B. Cheng, X. Hu, Anal. Methods 9 (2017) 2983. [66] Y. Chen, S.S. Thosar, R.A. Forbess, M.S. Kemper, R.L. Rubinovitz, A.J. Shukla, Drug Dev. Ind. Pharm. 27 (2001) 623. [67] J.M. Sutter, J.H. Kalivas, Microchem. J. 47 (1993) 60. [68] F.G. Blanchet, P. Legendre, D. Borcard, Ecology 89 (2008) 2623. [69] J.A. Fern andez Pierna, O. Abbas, V. Baeten, P. Dardenne, Anal. Chim. Acta 642 (2009) 89. [70] L. Brezo cnik, I. Fister, V. Podgorelec, Appl. Sci. 8 (2018) 1521. [71] R. Leardi, J. Chemometr. 15 (2001) 559. [72] R. Leardi, J. Chemometr. 14 (2000) 643. [73] F. Marini, B. Walczak, Chemometr. Intell. Lab. Syst. 149 (2015) 153. [74] M. Goodarzi, L. dos Santos Coelho, Anal. Chim. Acta 852 (2014) 20. [75] F. Allegrini, A.C. Olivieri, Anal. Chim. Acta 699 (2011) 18. [76] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, science 220 (1983) 671. [77] J.H. Kalivas, Chemometr. Intell. Lab. Syst. 37 (1997) 255. [78] J.H. Kalivas, N. Roberts, J.M. Sutter, Anal. Chem. 61 (1989) 2024. [79] H.D. Li, Y.Z. Liang, D.S. Cao, Q.S. Xu, Trac. Trends Anal. Chem. 38 (2012) 154e162.

[80] B.C. Deng, Y.H. Yun, Y.Z. Liang, Chemometr. Intell. Lab. Syst. 149 (2015) 166e176. Part B. € stro € m, L. Eriksson, Chemometr. Intell. Lab. Syst. 58 (2001) [81] S. Wold, M. Sjo 109e130. [82] S. Favilla, C. Durante, M.L. Vigni, M. Cocchi, Chemometr. Intell. Lab. Syst. 129 (2013) 76e86. [83] O.M. Kvalheim, J. Chemometr. 24 (2010) 496e504. [84] T.N. Tran, N.L. Afanador, L.M. Buydens, L. Blanchet, Chemometr. Intell. Lab. Syst. 138 (2014) 153. [85] M. Andersson, J. Chemometr. 23 (2010) 518. [86] W. Cai, Y. Li, X. Shao, Chemometr. Intell. Lab 90 (2008) 188. [87] Å. Rinnan, M. Andersson, C. Ridder, S.B. Engelsen, J. Chemometr. 28 (2014) 439. [88] M. Forina, C. Casolino, C.P. Millan, J. Chemometr. 13 (2015) 165. [89] H.D. Li, Y.Z. Liang, Q.S. Xu, D.S. Cao, Anal. Chim. Acta 648 (2009) 77e84. [90] Y.H. Yun, H.D. Li, L.R.E. Wood, W. Fan, J.J. Wang, D.S. Cao, Q.S. Xu, Y.Z. Liang, Spectrochim.Acta.A. 111 (2013) 31e36. [91] H. Xu, Z. Liu, W. Cai, X. Shao, Chemometr. Intell. Lab. Syst. 97 (2009) 189. [92] Y.-W. Lin, B.-C. Deng, L.-L. Wang, Q.-S. Xu, L. Liu, Y.-Z. Liang, Chemometr. Intell. Lab. Syst. 159 (2016) 196. [93] B.-C. Deng, Y.-H. Yun, D.-S. Cao, Y.-L. Yin, W.-T. Wang, H.-M. Lu, Q.-Y. Luo, Y.Z. Liang, Anal. Chim. Acta 908 (2016) 63. [94] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvao, T. Yoneyama, H.C. Chame, V. Visani, Chemometr. Intell. Lab. Syst. 57 (2001) 65. [95] B.C. Deng, Y.H. Yun, Y.Z. Liang, L.Z. Yi, Analyst 139 (2014) 4836e4845. [96] Y.H. Yun, W.T. Wang, B.C. Deng, G.B. Lai, X.B. Liu, D.B. Ren, Y.Z. Liang, W. Fan, Q.S. Xu, Anal. Chim. Acta 862 (2015) 14e23. [97] X. Song, Y. Huang, H. Yan, Y. Xiong, S. Min, Anal. Chim. Acta 948 (2016) 19. [98] R. Zhang, F. Zhang, W. Chen, H. Yao, J. Ge, S. Wu, T. Wu, Y. Du, Chemometr. Intell. Lab. Syst. 175 (2018) 47. [99] S.-C. Chu, P.-W. Tsai, J.-S. Pan, Cat swarm optimization, Springer, 2006, p. 854. [100] X.-S. Yang, A new metaheuristic bat-inspired algorithm, Springer, 2010, p. 65. [101] Z.W. Geem, J.H. Kim, G.V. Loganathan, Simulation 76 (2001) 60. vy ﬂights, IEEE, 2009, p. 210. [102] X.-S. Yang, S. Deb, Cuckoo search via Le [103] Y. Tan, Y. Zhu, Fireworks algorithm for optimization, Springer, 2010, p. 355. [104] Z. Liu, J. Huang, Y. Wang, D. Cao, IEEE Access 6 (2018) 20950. [105] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Appl. Spectrosc. 54 (2000) 413. [106] X. Zou, J. Zhao, Y. Li, Vib. Spectrosc. 44 (2007) 220. [107] R. Leardi, L. Nørgaard, J. Chemometr. 18 (2004) 486e497. [108] J.H. Jiang, R.J. Berry, H.W. Siesler, Y. Ozaki, Anal. Chem. 74 (2002) 3555e3565. [109] B.-C. Deng, Y.-H. Yun, P. Ma, C.-C. Lin, D.-B. Ren, Y.-Z. Liang, Analyst 140 (2015) 1876. [110] Y.-W. Lin, N. Xiao, L.-L. Wang, C.-Q. Li, Q.-S. Xu, Chemometr. Intell. Lab. Syst. 168 (2017) 62. [111] Y.P. Du, Y.Z. Liang, J.H. Jiang, R.J. Berry, Y. Ozaki, Anal. Chim. Acta 501 (2004) 183e191. ~o, M.C.U. de Araújo, G. Ve ras, E.C. da Silva, [112] A. de Araújo Gomes, R.K.H. Galva Microchem. J. 110 (2013) 202. [113] J. Xu, Q.-S. Xu, C.-O. Chan, D.K.-W. Mok, L.-Z. Yi, F.-T. Chau, Anal. Chim. Acta 870 (2015) 45. [114] X. Huang, Q.-S. Xu, Y.-Z. Liang, Anal. Methods 4 (2012) 2815. [115] G. Tang, Y. Huang, K. Tian, X. Song, H. Yan, J. Hu, Y. Xiong, S. Min, Analyst 139 (2014) 4894. [116] F. Xu, X. Huang, H. Dai, W. Chen, R. Ding, E. Teye, Anal. Methods 6 (2014) 1090. [117] S. Ye, D. Wang, S. Min, Chemometr. Intell. Lab. Syst. 91 (2008) 194. [118] D. Wu, X. Chen, X. Zhu, X. Guan, G. Wu, Anal. Methods 3 (2011) 1790. [119] Q. Kong, Z. Su, W. Shen, B. Zhang, J. Wang, N. Ji, H. Ge, Guang pu xue yu guang pu fen xi¼ Guang pu 35 (2015) 1233. [120] F.-F. Lin, Z.-L. Chen, K. Wang, J.-S. Deng, H.-W. Xu, J. Infrared Millim. Waves 4 (2009) 008. [121] X. Fu, F.-J. Duan, T.-T. Huang, L. Ma, J.-J. Jiang, Y.-C. Li, J. Anal. Atomic Spectrom. 32 (2017) 1166. [122] H. Chen, X. Liu, Z. Jia, Z. Liu, K. Shi, K. Cai, Chemometr. Intell. Lab. Syst. 182 (2018) 101. [123] Y.-H. Yun, J. Bin, D.-L. Liu, L. Xu, T.-L. Yan, D.-S. Cao, Q.-S. Xu, Anal. Chim. Acta (2019). https://doi.org/10.1016/j.aca.2019.01.022. [124] L. Breiman, Mach. Learn. 45 (2001) 5. [125] K. Zheng, Q. Li, J. Wang, J. Geng, P. Cao, T. Sui, X. Wang, Y. Du, Chemometr. Intell. Lab 112 (2012) 48. [126] H.D. Li, Q.S. Xu, Y.Z. Liang, Anal. Chim. Acta 740 (2012) 20e26. [127] J. Bin, F. Ai, W. Fan, J. Zhou, X. Li, W. Tang, Y. Liang, Chemometr. Intell. Lab. Syst. 158 (2016) 1. [128] J. Chen, C. Yang, H. Zhu, Y. Li, W. Gui, Chemometr. Intell. Lab. Syst. 182 (2018) 188. [129] Z. Huan, H. Ke-Wei, S. Xiao-Guang, F. Zheng, L. Li-Ying, L. Wei, Z. Chun-Ying, Chin. J. Anal. Chem. 46 (2018) 136. [130] W. Zheng, X. Fu, Y. Ying, J. Chemometr. 31 (2017) e2893. [131] Y.H. Yun, D.S. Cao, M.L. Tan, J. Yan, D.B. Ren, Q.S. Xu, L. Yu, Y.Z. Liang, Chemometr. Intell. Lab 130 (2014) 76e83. [132] J. Bin, F.-F. Ai, W. Fan, J.-H. Zhou, Y.-H. Yun, Y.-Z. Liang, RSC Adv. 6 (2016) 30353.

Y.-H. Yun et al. / Trends in Analytical Chemistry 113 (2019) 102e115 [133] C. Da, H. Bin, S. Xueguang, S. Qingde, Analyst 129 (2004) 664. [134] D.-S. Cao, Z.-K. Deng, M.-F. Zhu, Z.-J. Yao, J. Dong, R.-G. Zhao, J. Chemometr. 31 (2017) e2922. [135] X. Bian, P. Diwu, Y. Liu, P. Liu, Q. Li, X. Tan, J. Chemometr. 32 (2018) e2940. [136] M. Wen, B.-C. Deng, D.-S. Cao, Y.-H. Yun, R.-H. Yang, H.-M. Lu, Y.-Z. Liang, Analyst 141 (2016) 5586. [137] D. Cao, Y. Liang, Q. Xu, Y. Yun, H. Li, J. Comput. Aided Mol. Des. 25 (2011) 67. [138] W. Guo, B. Lin, D. Liu, X. Zhu, Food Anal. Methods 10 (2017) 3781. [139] C. Dong, J. Li, J. Wang, G. Liang, Y. Jiang, H. Yuan, Y. Yang, H. Meng, Spectrochim. Acta Mol. Biomol. Spectrosc. 205 (2018) 227.

115

[140] Q. Chen, J. Ding, J. Cai, J. Zhao, Food Chem. 135 (2012) 590. [141] W. Guo, L. Shang, X. Zhu, S.O. Nelson, Food Bioprocess Technol. 8 (2015) 1126. [142] D.-W. Sun, Hyperspectral imaging for food quality analysis and control, Elsevier, 2010. [143] A. Gowen, C. O'Donnell, P. Cullen, G. Downey, J. Frias, Trends Food Sci. Technol. 18 (2007) 590. [144] J. Fan, J. Lv, Stat. Sin. 20 (2010) 101. [145] K. Deb, Multi-objective optimization, Springer, 2014, p. 403. [146] X.-S. Yang, Int. J. Bio-Inspired Comput. 3 (2011) 267.

An overview of variable selection methods in multivariate analysis of near-infrared spectra

An overview of variable selection methods in multivariate analysis of near-infrared spectra

Recommend Documents