Regression and independence based variable importance measure

Regression and independence based variable importance measure

Computers and Chemical Engineering 135 (2020) 106757 Contents lists available at ScienceDirect Computers and Chemical Engineering journal homepage: ...

708KB Sizes 0 Downloads 134 Views

Computers and Chemical Engineering 135 (2020) 106757

Contents lists available at ScienceDirect

Computers and Chemical Engineering journal homepage: www.elsevier.com/locate/compchemeng

Regression and independence based variable importance measure Xinmin Zhang a, Takuya Wada b, Koichi Fujiwara c, Manabu Kano b,∗ a

Department of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China Department of Systems Science, Kyoto University, Kyoto 606-8501, Japan c Department of Material Engineering, Nagoya University, Nagoya 464-8601, Japan b

a r t i c l e

i n f o

Article history: Received 22 November 2019 Revised 17 January 2020 Accepted 23 January 2020 Available online 25 January 2020 Keywords: Variable importance measure Nonlinear systems Gaussian process regression Hilbert-Schmidt independence criterion

a b s t r a c t Evaluating the importance of input (predictor) variables is of interest in many applications of statistical models. However, nonlinearity and correlation among variables make it difficult to measure variable importance accurately. In this work, a novel variable importance measure, called regression and independence based variable importance (RIVI), is proposed. RIVI is designed by integrating Gaussian process regression (GPR) and Hilbert-Schmidt independence criterion (HSIC) so that it is applicable to nonlinear systems. The results of two numerical examples demonstrate that RIVI is superior to several conventional measures including the Pearson correlation coefficient, PLS-β , PLS-VIP, Lasso, HSIC, and permutation importance with random forest in the variable identification accuracy.

1. Introduction Supervised learning is a machine learning task of obtaining a predictive function that maps inputs to an output based on historical data of input-output pairs (Bishop, 2006). In supervised learning, models, i.e., predictive functions, are required to be accurate and explainable. In the present work, we focus on how to evaluate the importance of each input variable on the basis of nonlinear system models. A popular supervised learning method in linear regression is partial least squares (PLS) regression, which has been widely used in various fields (Kano and Nakagawa, 2008; Kano and Fujiwara, 2013; Wang et al., 2015; Ge et al., 2017; Ge, 2018; Shah et al., 2019). PLS possesses the characteristic of good interpretability (Wold et al., 2001; Abdi, 2010). For example, a PLS model can be easily interpreted by PLS-β or PLS-VIP. PLS-β uses absolute values of regression coefficients of the PLS model as a measure of variable importance (Wold et al., 2001), and PLS-VIP adopts the variable importance in projection (VIP) score (Chong and Jun, 2005). Lasso (least absolute shrinkage and selection operator) is another widely used linear regression method that uses 1 -regularization to perform variable importance measure (Tibshirani, 1996; Hastie et al., 2015). However, for complex nonlinear industrial processes, simple parametric models lack expressive power. The nonparametric regression models, such as random forest (RF) (Strobl et al., 2009; Verikas et al., 2011), can handle process nonlinearity and



Corresponding author. E-mail address: [email protected] (M. Kano).

https://doi.org/10.1016/j.compchemeng.2020.106757 0098-1354/© 2020 Elsevier Ltd. All rights reserved.

© 2020 Elsevier Ltd. All rights reserved.

provide relatively high predictive power. RF is a nonlinear ensemble learning method, which constructs several regression trees using bootstrap samples during the training phase and then predicts the final output by averaging the predictions of all trees. Since RF is a nonparametric ensemble learning method, the interpretability of RF is a challenging problem. To solve this problem, an RF variable importance measure in terms of the permutation importance criterion (referred to as RF-PI) has been proposed (Genuer et al., 2010; Bühlmann, 2012). The basic idea is that if the variable is not important (the null hypothesis), then rearranging the values of that variable will not degrade prediction accuracy. However, due to the complexity of the process, such as nonlinearity and correlation among variables, these methods do not always work well. In this study, a novel variable importance measure, called regression and independence based variable importance (RIVI), is proposed. RIVI is designed based on Gaussian process regression (GPR) and Hilbert-Schmidt independence criterion (HSIC). GPR is a kernel-based nonparametric Bayesian learning method, which is powerful in handling process nonlinearity. HSIC is a nonlinear independence measure. By integrating GPR and HSIC into a causal analysis framework, the proposed RIVI is capable of handling highly nonlinear systems. The effectiveness of the proposed RIVI is evaluated using two numerical examples. The remainder of this paper is structured as follows. In Section 2, the conventional variable importance analysis methods including PLS-β , PLS-VIP, Lasso, and RF-PI are introduced. Then, a novel variable importance measure, RIVI, is given in Section 3. In Section 4, the superiority of the proposed RIVI is illustrated

2

X. Zhang, T. Wada and K. Fujiwara et al. / Computers and Chemical Engineering 135 (2020) 106757

through two nonlinear numerical examples. Finally, this work is concluded in Section 5. 2. Conventional methods In this section, the conventional variable importance analysis methods including PLS-β , PLS-VIP, Lasso, and RF-PI are introduced. 2.1. PLS-β Partial least squares (PLS) is a popular regression method, which can deal with multicollinearity (Wold et al., 2001; Abdi, 2010). Mathematically, given the input matrix X ∈ RN×M and the output vector y ∈ RN , where N and M denote the number of samples and variables, respectively. PLS decomposes X and y as follows:

X = T P + E

(1)

y = Tb + f

(2)

T = X W (P T W )−1

(3)

where T ∈ RN×K denotes the latent variable matrix whose kth column is the latent vector t k ∈ RN (k = 1, . . . , K ), P ∈ RM×K denotes the loading matrix of X and its kth column is the loading vector pk ∈ RM , b ∈ RK is the loading vector of y whose kth element is the loading score bk ∈ R, W ∈ RM×K denotes the weight matrix whose kth column is the weight vector wk ∈ RM . K is the number of retained latent variables. E ∈ RN×M and f ∈ RN are residuals. Nonlinear iterative partial least squares (NIPALS) algorithm is commonly utilized to build the PLS model (Wold et al., 2001). The output estimation yˆ of the PLS model can be calculated as

yˆ = X β pls β pls = W (P T W )−1 b

(4)

where β pls ∈ RM denotes a regression coefficient vector. PLS-β uses absolute values of elements of β pls as a measure of variable importance. 2.2. PLS-VIP

VIP j =

M

Random forest (RF) is an ensemble learning method, which constructs a number of regression trees during the training phase and then predicts the final output by averaging the predictions of all trees (Strobl et al., 2009; Verikas et al., 2011; Breiman, 2017). Each tree is built based on a bootstrap subset, which is randomly drawn with replacement from the original dataset. At each node in the tree-growing process, the best split is chosen from the randomly selected subset of the total predictor variables. The calculation of variable importance measure can be obtained in RF by using the permutation importance criterion (referred to as RF-PI) (Genuer et al., 2010). The basic principle is that if the variable is not important (the null hypothesis), the prediction accuracy will not deteriorate when the values of that variable are permuted. RF-PI uses the out-of-bag (OOB) dataset to calculate the importance score for each input variable (Bühlmann, 2012). More specifically, given the dataset D = {(x1 , y1 ), . . . , (xN , yN )}, where xi ∈ RM and yi ∈ R are the ith observations of input-output pairs. Suppose that a sequence of B bootstrap subsets are generated and the tree grown on the bth bootstrap subset is denoted as fˆb (· ), the importance measure VI j for the jth variable can be calculated as follows: (I). For b = 1 and find the OOB observations, denoted by OOBb ; (II). Use the tree predictor fˆb (· ) to predict OOBb and then calculate the OOB-error er rboob ;

er rboob =

N 

1 nOOB

b



 2

yi − fˆb xi

(7)

i=1

(i∈OOBb )

 

where yi represents the measured value, fˆb xi represents the predicted value, and nOOB represents the number of b OOBb . (III). For the j-th variable, j = 1, . . . , M: (a) implement the random permutation of the jth variable in OOBb , and denote the permuted OOBb as OOBb,j ; (b) predict OOBb,j utlizing the tree predictor fˆb (· ), and then oob ; calculate the OOB-error er rb, j

PLS-VIP is a variable importance measure based on the variable importance in projection (VIP) score, which is estimated by the PLS regression model. The VIP score measures the importance of each input variable in the projection used in a PLS model and expresses the contribution of each input variable to the output variable (Chong and Jun, 2005; Kano and Fujiwara, 2013). Mathematically, the VIP score of the jth variable, VIPj , is calculated as



2.4. RF-PI

K

k=1

K

w2jk b2k (t  k tk)

2 k=1 bk

(

t k tk

)

(5)

where wjk is the jth element of wk .

(IV). Repeat Steps I-III for b = 2, . . . , B. (V). Calculate the importance measure VI j

VI j =

B 1 (errb,oobj − errboob )2 . B

(8)

b=1

3. Proposed method: RIVI In this section, GPR and HSIC are first introduced. Then, a novel variable importance measure, RIVI, is proposed. 3.1. Gaussian process regression (GPR)

2.3. Lasso Lasso is a popular regularization and variable selection method (Tibshirani, 1996; Hastie et al., 2015). Mathematically, Lasso solves the following optimization problem

1 2

βˆ lasso = arg min y − X β22 + λβ1 β

(6)

M ˆ • where β lasso ∈ R denotes a set of regression coefficients,  1 and •2 denote the 1 -norm and 2 -norm, and λ denotes the regularization parameter. Lasso imposes the 1 -penalty (norm) on the regression coefficients and shrinks the coefficients of irrelevant variˆ ables to zero. The absolute values of elements of β lasso can be used as a measure of variable importance.

GPR is a kernel-based nonparametric Bayesian learning method, which is powerful in handling process nonlinearity (Rasmussen, 2003). GPR has the appealing property of working well on small datasets. In addition, compared with the standard cross-validation method, GPR provides a more efficient framework that automatically learns kernel parameters by maximizing the marginal likelihood of the training data. Mathematically, GPR assumes that the latent function f (x ) is a random variable which follows a Gaussian   process (GP) prior distribution. Let f = f (x1 ), f (x2 ), . . . , f (xN ) denote the latent function values corresponding to the inputs. A GP can be expressed as

f (x ) ∼ GP (m(x ), k(x, x ))

(9)

X. Zhang, T. Wada and K. Fujiwara et al. / Computers and Chemical Engineering 135 (2020) 106757



m (x ) = E f (x ) k ( x, x ) = E





(10)

f (x ) − m (x )



f ( x ) − m ( x )



(11)

where m(x ) is the mean function and k(x, x ) is a covariance or kernel function. To avoid expensive posterior computations, m(x ) is commonly set to zero (Melo, 2012; Schulz et al., 2018). According to the principles of GP (Rasmussen, 2003), the joint distribution of any finite number of GP random variables is multivariate Gaussian. Thus, the distribution of f can be expressed as









p f | X = N 0, K

(12)

RN×N

where K ∈ denotes the covariance matrix with element [K]i j = k(xi , x j ). The marginal distribution of y can be calculated by





p y| X =



 





p y| f , X p f | X d f = N 0, K + δε2 I



(13)

where I is the identity matrix and δε2 is the variance of an i.i.d. zero-mean Gaussian noise. Given a query sample xq ∈ RM , the latent function value f(xq ), denoted as fq , can be obtained. The joint distribution of observations and fq under the GP prior is given by

  y fq

 

∼ N 0,

K + δε2 I kq kq kqq T



(14)

where kq ∈ RN denotes the covariance vector with element [kq ] j = k(xq , x j ), j = 1, 2, . . . , N and kqq = k(xq , xq ). The posterior predictive distribution of fq given y can be expressed as







p fq | y, X , xq = N kq K + δε2 I

−1



y, kqq − kq K + δε2 I

−1 T 

kq . (15)

The best estimation for fq is the mean of this distribution



f¯q = kq K + δε2 I

−1

y.

Traditionally, the GPR model is trained on the basis of the maximum likelihood technique with gradient descent optimation (Rasmussen, 2003; Melo, 2012). 3.2. Hilbert-Schmidt independence criterion (HSIC)





= Ex,x ,y,y [k(x, x )l (y, y )] + Ex,x [k(x, x )]Ey,y [l (y, y )]



− 2Ex,y Ex [k(x, x )]Ey [l (y, y )]



(19)

    φ (x ), φ (x ) and l (y, y ) = ψ (y ), ψ (y ) are kernel functions, Ex,x ,y,y [k(x, x )l (y, y )] denotes the expectation over where k(x, x ) =

data pairs (x, y) and (x , y ) drawn from the joint distribution pxy . Given N independent observations {(x1 , y1 ), . . . , (xN , yN )}, the empirical HSIC is given by

HSIC(X , Y ) =

1

(N − 1 )−2

tr(K H LH )

(20)

where K , H , L ∈ RN×N , [K]i j = k(xi , x j ), [L]i j = l (xi , x j ), and [H]i j = δi j − N−1 . HSIC(X , Y ) = 0 if and only if the random vectors X and Y are independent. Thus, HSIC can be utilized as a dependence measure. 3.3. Regression and independence based variable importance (RIVI) In this section, a novel variable importance measure, denoted as RIVI, is proposed. RIVI is designed by integrating GPR and HSIC into a causal analysis framework. RIVI inherits the merits of GPR and HSIC, making it applicable to nonlinear systems. The basic idea of RIVI is to compare two models. One is the model which is constructed using all input variables (referred to as full model). The other is the model which is constructed using remaining input variables except for one variable (referred to as reduced model). If the removed variable is a causal variable of the output, the independence of the residual and the output variable is larger than the independence derived from the full model. Based on this idea, a new variable importance measure of RIVI is defined. Suppose that X ∈ RN×M and y ∈ RN are the input and output variables. Let x = [x1 , . . . , xM ] and y be any pair of observations of input and output variables. RIVI first constructs a full model with all of the input variables as follows

(21)

where full denotes the estimated residual and ffull denotes the constructed model by GPR. When the model is accurate, full has only the information of noise. Second, RIVI constructs a reduced model in which one input variable is eliminated from the original dataset X . When xm is eliminated, the constructed reduced model is described as

(22)

where m denotes the estimated residual in the reduced model. When xm is the causal variable for y, the residual m has the information of both the noise term and the removed variable xm . Thus, a reasonable measure of variable importance can be obtained by calculating the difference in statistical independence between the prediction residuals and the output variable before and after eliminating an input variable. To estimate this difference in statistical independence, HSIC is adopted. More specifically, the importance of xm , denoted as RIVIm , is defined as

RIVIm =

HSIC (y, m ) . HSIC (y, f ull )

(23)

The details of this algorithm are shown in Algorithm 1.

(17)

where μx = Ex [φ (x )], μy = Ey [ψ (y )], and  denotes the tensor product. HSIC is expressed as the squared Hilbert-Schmidt norm of the cross-covariance operator

HSIC(X , Y ) := ||Cxy ||2HS .

HSIC(X , Y )

y = fˆm (x1 , . . . , xm−1 , xm+1 , . . . , xM ) + m

HSIC is a nonlinear independence measure that is defined on the reproducing kernel Hilbert space (RKHS) (Gretton et al., 2005; Aronszajn, 1950). Compared with other kernel-based independence measures, HSIC has the following advantages: (1) it has a simpler empirical estimate and does not require user-defined regularization terms; (2) it is more robust to outliers; (3) it does not suffer from slow learning rates (Devroye et al., 2013). Denoting X and Y as two random variables. Let F be an RKHS with a nonlinear mapping φ (x ) ∈ F for each x ∈ X . Similarly, let G be another RKHS with a nonlinear mapping ψ (y ) ∈ G for each y ∈ Y. Then, the cross-covariance operator (Fukumizu et al., 2004) between these two RKHSs is defined as

Cxy := Exy (φ (x ) − μx )  (ψ (y ) − μy )

Eq. (18) can be rewritten in terms of kernels (Rasmussen, 2003) as

y = f f ull (x1 , . . . , xM ) + f ull (16)

3

(18)

4. Case study In this section, the effectiveness of the proposed RIVI was verified through two numerical examples. RIVI is compared with the conventional Pearson correlation coefficient (PCC), PLS-β , PLS-VIP, Lasso, RF-PI, and HSIC.

4

X. Zhang, T. Wada and K. Fujiwara et al. / Computers and Chemical Engineering 135 (2020) 106757

Fig. 1. Calculated variable importance of each method in numerical example 1.

Algorithm 1 RIVI Require: Input data matrix X ∈ RN×M and output data vector y ∈ RN . x = [x1 , · · · , xM ] and y denote any pair of observations of input and output variables. 1: Construct a full model by GPR with all of the variables as y = f f ull (x1 , · · · , xM ) + f ull . 2: for m = 1, · · · , M do Construct a reduced model in which xm is eliminated from 3: X as y = fm (x1 , · · · , xm−1 , xm+1 , · · · , xM ) + m . 4: Calculate the importance score of xm as RIVIm = HSIC (y, m )/HSIC (y, f ull ). 5: end for Ensure: The importance score RIVIm (m = 1, · · · , M )

4.1. Numerical example 1 This numerical example is used to test the performance of the proposed RIVI in handling data with cross-correlated features and nonlinear input-output dependency. 4.1.1. Data generation Data are generated from the following system:

x1 , x2 , x3 ∼ N (μ, )

(24)

x4 ∼ N ( 0, 1 )

(25)

x5 = exp(x24 ) + e

(26)

where x1 , x2 , and x3 are correlated variables that follow the normal distribution with mean μ and covariance  as follows



μ=



0.0 0.0 , 0.0



=

1.0 0.9 0.81

0.9 1.0 0.9



0.81 0.9 . 1.0

(27)

x4 is independently sampled from the standard normal distribution with zero-mean and unit-variance. x5 is a nonlinear function of x4 , and the noise term e follows a standard normal distribution. The output variable y is defined as

y = f ( x1 , x4 ) + ξ

(28)

f (x1 , x4 ) = −5.0 cos (3x1 ) + 5.0 cos (3x4 )

(29)

Table 1 Confusion matrix. Estimated Class

TrueImportant ClassUnimportant

Important

Unimportant

a c

b d

where ξ is the noise term. It is sampled from the normal distribution so that its SN ratio σ equals to 0.33 as follows

ξ ∼ σ × N ( 0.0, 1.0 )

(30)

σ = 0.33 × V ar[ f (x1 , x4 )]

(31)

where Var[ · ] is the variance. Notice that the output variable y is only affected by x1 and x4 , and the relation between these two variables and y is nonlinear. To implement the variable importance analysis, 200 samples were generated. 4.1.2. Performance measure To evaluate the performance of each method in identifying the important variables, a confusion matrix which contains the information about true and predicted classes is adopted. The top two variables in the order of variable importance, which are estimated by each method, is classified to the important variables. The bottom three variables are classified to the unimportant variables. Table 1 shows the confusion matrix, in which the variables are divided into four groups. In this table, a, b, c, and d represent the number of variables identified by each method in each class. To compare the performance of the variable importance analysis methods quantitatively, three measures are calculated as follows

Accuracy = (a + d )/(a + b + c + d )

(32)

Sensitivity = a/(a + b)

(33)

Specificity = d/(c + d ).

(34)

4.1.3. Results and discussion PCC, PLS-β , PLS-VIP, RF-PI, HSIC, and the proposed RIVI were applied to the generated data. In PLS-β and PLS-VIP, the number of latent variables used was set to be 4. The number of trees used

X. Zhang, T. Wada and K. Fujiwara et al. / Computers and Chemical Engineering 135 (2020) 106757

5

Table 2 Comparison of variable importance analysis methods in numerical example 1.

PCC PLS-β PLS-VIP Lasso RF-PI HSIC RIVI

Accuracy

Sensitivity

Specificity

0.53 0.50 0.54 0.61 0.98 0.90 1.00

0.41 0.37 0.42 0.52 0.98 0.88 1.00

0.61 0.58 0.61 0.68 0.98 0.92 1.00

different numbers of samples. RIVI outperformed the other methods in all cases, and PCC, PLS-β , PLS-VIP, and Lasso yielded worse performance. 4.2. Numerical example 2

Fig. 2. Accuracy of each method for different numbers of samples in numerical example 1.

in RF-PI was set to be 200. In PCC and HSIC, the variable importance measure is calculated directly from the dependency between each input variable xm and output variable y. The parameters used in each method were determined by 10-fold cross-validation. This simulation is repeated 100 times. Table 2 shows the mean identification results of each method. PCC, PLS-β , PLS-VIP, and Lasso provided poor performance with low accuracy, sensitivity, and specificity. Compared with them, HSIC achieved better performance, but it is worse than RF-PI. In comparison, the proposed RIVI achieved the best performance with the highest accuracy, sensitivity, and specificity. The detailed identification results can be observed in Fig. 1. RF, HSIC, and RIVI were able to identify that x1 and x4 are important variables. Compared to RF-PI and HSIC, RIVI completely distinguished the important variables from the other variables. However, PCC, PLS-β , PLS-VIP, and Lasso did not work well at all due to nonlinearity. To check the performance of the proposed RIVI to the number of samples, the experiments were conducted with a different number of samples. Fig. 2 provides the accuracy of each method for

This numerical example is used to test the performance of the proposed RIVI in handling data with redundant features and nonlinear input-output dependency. 4.2.1. Data generation Data are generated from the following system:

x1 , . . . , x8 ∼ N ( 0, 1 )

(35)

x9 = 1.5x1 + U (−1, 1 )

(36)

x10 = 1.5x2 + U (−1, 1 )

(37)

y=

2x21 + 1.5x2 + 0.1ξ 0.8 + ( x2 + 1.5 )2

(38)

where U (a, b) denotes the uniform distribution on [a, b], and ξ ∼ N (0, 1 ) is the noise term. Notice that the data contain redundant input variables and nonlinear dependency between the input and output variables. The important input variables are x1 and x2 . To implement the variable importance analysis, 500 samples were generated.

Fig. 3. Calculated variable importance of each method in numerical example 2.

6

X. Zhang, T. Wada and K. Fujiwara et al. / Computers and Chemical Engineering 135 (2020) 106757 Table 3 Comparison of variable importance analysis methods in numerical example 2.

PCC PLS-β PLS-VIP Lasso RF-PI HSIC RIVI

Accuracy

Sensitivity

Specificity

0.78 0.81 0.71 0.89 0.97 0.91 1.00

0.46 0.52 0.28 0.73 0.92 0.77 1.00

0.87 0.88 0.82 0.93 0.98 0.94 1.00

fulness and advantages of RIVI were verified through its applications to two nonlinear systems. The application results have demonstrated that RIVI significantly outperforms the conventional PCC, PLS-β , PLS-VIP, Lasso, RF-PI, and HSIC in the variable identification accuracy. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Xinmin Zhang: Software, Validation, Formal analysis, Investigation, Writing - original draft. Takuya Wada: Methodology, Software, Formal analysis, Investigation, Writing - original draft. Koichi Fujiwara: Validation, Resources, Writing - review & editing. Manabu Kano: Conceptualization, Validation, Resources, Writing review & editing, Supervision. References

Fig. 4. Accuracy of each method for different numbers of samples in numerical example 2.

4.2.2. Results and discussion PCC, PLS-β , PLS-VIP, Lasso, RF-PI, HSIC, and the proposed RIVI were applied to the generated data. The number of latent variables used in PLS-β and PLS-VIP was 5. In RF-PI, the number of trees was 550. The parameters used in each method were determined by 10-fold cross-validation. This simulation is repeated 100 times. Table 3 provides the mean identification results of each method. PCC, PLS-β , PLS-VIP, and Lasso achieved poor performance with low accuracy, sensitivity, and specificity. Compared with them, RFPI, HSIC, and RIVI achieved better performance. RIVI provided the best performance with the highest accuracy, sensitivity, and specificity. Fig. 3 provides detailed identification results. PCC, PLS-β , PLS-VIP, and Lasso did not perform well. RF and HSIC provided high importance scores for variables x9 and x10 , although they are not the important variables. In comparison, RIVI correctly identified x1 and x2 as the most important variables. Similarly to the numerical example 1, the experiments were conducted with a different number of samples to check the performance of the proposed RIVI to the number of samples. Fig. 4 provides the accuracy of each method for different numbers of samples. RIVI is superior to the other methods in most cases. All of the above analysis results have demonstrated the superiority of the proposed RIVI. 5. Conclusions In this paper, a novel variable importance measure, RIVI, was developed by integrating GPR and HSIC into a causal analysis framework that makes it applicable to nonlinear systems. The use-

Abdi, H., 2010. Partial least squares regression and projection on latent structure regression (PLS regression). Wiley interdisciplinary reviews: computational statistics 2 (1), 97–106. Aronszajn, N., 1950. Theory of reproducing kernels. Transactions of the American mathematical society 68 (3), 337–404. Bishop, C.M., 2006. Pattern recognition and machine learning. Springer Science Business Media. Breiman, L., 2017. Classification and regression trees. Routledge. Bühlmann, P., 2012. Bagging, boosting and ensemble methods. In: Handbook of Computational Statistics. Springer, pp. 985–1022. Chong, I.-G., Jun, C.-H., 2005. Performance of some variable selection methods when multicollinearity is present. Chemometrics and intelligent laboratory systems 78 (1-2), 103–112. Devroye, L., Györfi, L., Lugosi, G., 2013. A probabilistic theory of pattern recognition, 31. Springer Science & Business Media. Fukumizu, K., Bach, F.R., Jordan, M.I., 2004. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research 5 (Jan), 73–99. Ge, Z., 2018. Process data analytics via probabilistic latent variable models: A tutorial review. Industrial & Engineering Chemistry Research 57 (38), 12646–12661. Ge, Z., Song, Z., Ding, S.X., Huang, B., 2017. Data mining and analytics in the process industry: The role of machine learning. IEEE Access 5, 20590–20616. Genuer, R., Poggi, J.-M., Tuleau-Malot, C., 2010. Variable selection using random forests. Pattern Recognition Letters 31 (14), 2225–2236. Gretton, A., Bousquet, O., Smola, A., Schölkopf, B., 2005. Measuring statistical dependence with hilbert-schmidt norms. In: International conference on algorithmic learning theory. Springer, pp. 63–77. Hastie, T., Tibshirani, R., Wainwright, M., 2015. The lasso for linear models. Statistical Learning with Sparsity: The LASSO and Generalization 7–28. Kano, M., Fujiwara, K., 2013. Virtual sensing technology in process industries: Trends and challenges revealed by recent industrial applications. Journal of Chemical Engineering of Japan 46 (1), 1–17. Kano, M., Nakagawa, Y., 2008. Data-based process monitoring, process control, and quality improvement: Recent developments and applications in steel industry. Computers & Chemical Engineering 32 (1-2), 12–24. Melo, J., 2012. Gaussian processes for regression: a tutorial. Technical Report. Rasmussen, C.E., 2003. Gaussian processes in machine learning. In: Summer School on Machine Learning. Springer, pp. 63–71. Schulz, E., Speekenbrink, M., Krause, A., 2018. A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology 85, 1–16. Shah, D., Wang, J., He, Q.P., 2019. A feature-based soft sensor for spectroscopic data analysis. Journal of Process Control 78, 98–107. Strobl, C., Malley, J., Tutz, G., 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14 (4), 323–348. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), 267–288. Verikas, A., Gelzinis, A., Bacauskiene, M., 2011. Mining data with random forests: A survey and results of new tests. Pattern recognition 44 (2), 330–349. Wang, Z.X., He, Q.P., Wang, J., 2015. Comparison of variable selection methods for PLS-based soft sensor modeling. Journal of Process Control 26, 56–72. Wold, S., Sjöström, M., Eriksson, L., 2001. PLS-regression: a basic tool of chemometrics. Chemometrics and intelligent laboratory systems 58 (2), 109–130.