Model selection method based on maximal information coefficient of residuals

Model selection method based on maximal information coefficient of residuals

Acta Mathematica Scientia 2014,34B(2):579–592 http://actams.wipm.ac.cn MODEL SELECTION METHOD BASED ON MAXIMAL INFORMATION COEFFICIENT OF RESIDUALS∗ ...

2MB Sizes 0 Downloads 25 Views

Acta Mathematica Scientia 2014,34B(2):579–592 http://actams.wipm.ac.cn

MODEL SELECTION METHOD BASED ON MAXIMAL INFORMATION COEFFICIENT OF RESIDUALS∗ Qiuheng TAN (

¢ï)

öÉ?)

Hangjin JIANG (



Wuhan Institute of Physics and Mathematics, CAS, Wuhan 430071, China University of CAS, Beijing 100049, China E-mail : [email protected]

¶Â²)

Yiming DING (

Key Laboratory of Magnetic Resonance in Biological Systems, Wuhan Institute of Physics and Mathematics, CAS, Wuhan 430071, China National Center for Mathematics and Interdisciplinary Sciences, CAS, Beijing 100049, China E-mail : [email protected] Abstract The traditional model selection criterions try to make a balance between fitted error and model complexity. Assumptions on the distribution of the response or the noise, which may be misspecified, should be made before using the traditional ones. In this article, we give a new model selection criterion, based on the assumption that noise term in the model is independent with explanatory variables, of minimizing the association strength between regression residuals and the response, with fewer assumptions. Maximal Information Coefficient (MIC), a recently proposed dependence measure, captures a wide range of associations, and gives almost the same score to different type of relationships with equal noise, so MIC is used to measure the association strength. Furthermore, partial maximal information coefficient (PMIC) is introduced to capture the association between two variables removing a third controlling random variable. In addition, the definition of general partial relationship is given. Key words

Model Selection; residual; maximal information coefficient; partial maximal information coefficient

2010 MR Subject Classification

1

62B10; 62P99

Introduction

In the era of big data, there is a growing need of methods which point to interesting, important features of the data. Model selection plays a critical role in statistical data analysis. Generally, model selection is the task of selecting a statistical model from a set of candidate ∗ Received

February 5, 2013; revised December 20, 2013. This work was partly supported by National Basic Research Program of China (973 Program, 2011CB707802, 2013CB910200) and National Science Foundation of China (11201466). † Corresponding author

580

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

models, given data. In its most basic forms, model selection is one of the fundamental tasks of scientific inquiry. Determining the principle that explains a series of observations is often linked directly to a mathematical model predicting those observations. Once a model is identified, various forms of inferences such as prediction, control, information extraction, knowledge discovery, validation, risk evaluation, and decision making can be done with the framework of deductive argument [12]. Curve fitting and variable selection are two hot topics in model selection. Given a set of points and other background knowledge (for example, points are a result of i.i.d. samples), one desire to select a best curve that describes the function that generated the points, such a curve stands for the relationship underlying the samples. Variable selection is the process of selecting a subset of relevant variables to be used in model construction. Variable selection techniques provide three main benefits when constructing predictive models: improved model interpretability, shorter training times, and enhanced generalization by reducing overfitting. Good model selection technique will balance goodness of fit with simplicity. Given candidate models of similar predictive or explanatory power, the simplest model is favorable to be generalized. Many effective criterions, such as AIC (Akaike information criterion), BIC (Bayesian information criterion), MDL (Minimum description length), and SRM (Structural risk minimization), are popular in data analysis and modeling, and each of them has its advantages. The residual of a model also plays a key role in model validation and regression analysis, even in model selection [3, 14]. If the model fitting to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. A method called 6-plot were used to see whether or not the residual contains patten from 6 plots [14]. Residual information criterion (RIC), a selection criterion based on the residual loglikelihood, for regression models including classical regression models, Box-Cox transformation models, weighted regression models, and regression models with autoregressive moving average errors, were proposed in [18]. Taking RIC as a selection criterion, one should get the formulation of the criterion by oneself through a few steps of derivation, which seems to be a little bit complex. Model selection criterions dependent on the estimation of the variance of the residue, such as AIC, BIC, as well as their improved versions, are no longer validy if the noise are heavy tail distribution without bounded second order moment, because the estimation of the variance of the residue is impossible. In this article, we introduce a method based on the association between the residuals and explanatory variables using the behavior of residuals in the true model, which is more efficient, correct and adapt to the cases when the models are perturbed by complex noises. In our method, maximal information coefficient(MIC), introduced by D. Reshef et al in 2011, is used as a measure on dependency between the residuals and explanatory variables. For a pair of variables X and Y , 0 ≤ MIC(X, Y ) ≤ 1 and MIC(X, Y ) = 0 if and only if X and Y are independent. MIC was called “A correlation for the 21st century” [20]. It was widely used in many research fields, especially in biology [4, 10, 23], because it can capture a wide range of associations (both functional and non-functional associations), gives similar scores to equally noisy relationships of different types, and is robust to outliers.

No.2

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

581

Partial correlation coefficient is an important method for variable selection widely used in time series analysis and linear regression analysis, which measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. It is natural to ask how to measure the association between two random variables with the effect of a set of controlling random variables removed [20]. Parallel to partial correlation coefficient, we introduce partial maximal information coefficient (PMIC) as a measure on partial nonlinearity of variables. PMIC can help us find underlying nonlinear relationships between response and explanatory variables, even if such kind of relationship is obscured by other strong associations. The remain parts of this article is organized as follows. After brief introduction of MIC, we present a C++ version program to compute MIC, and we present a model selection method based on the maximal information coefficient (MIC) of residue in Section 2. Our method is work for both ordinary noise and wild noise (for example, Cauchy distribution). In Section 3, we introduce partial maximal information coefficient under a prescribed hypothesis space, and apply it to variable selection. Two simulation examples are presented to show that PMIC can detect obscured association. The conclusion of our work is given in Section 4.

2

Maximal Information Coefficient and Curve Fitting

MIC (Maximal Information Coefficient, [16]), as described in this article, can detect a wide range of relationships (both functional and non-functional relationships), and for functional relationships, it provides a score that roughly equals the coefficient of determination of the data relative to the regression function. In fact, M IC(X, Y ) is the mutual information I(X, Y ) between random variables X and Y normalized by the minimum entropy min{H(X), H(Y )} of X and Y , which can be written as M IC(X, Y ) =

I(X, Y ) H(X|Y ) I(X, Y ) H(Y |X) =1− or M IC(X, Y ) = =1− . H(X) H(X) H(Y ) H(Y )

H(Y |X) is the conditional entropy, which is the amount of information needed to describe the outcome of Y given that the value of X is known. H(Y |X) = 0 if and only if the value of Y is completely determined by the value of X. Conversely, H(Y |X) = H(Y ) if and only if X and Y are independent random variables. In this sense, MIC is either that the percent of information in Y can be interpreted by X, or that the percent of information in X can be interpreted by Y . Theoretically, MIC gives scores that tend to 1 for almost all kinds of noiseless functional relationship, and gives scores that tend to 0 to statistically independent variables; MIC is symmetric because the mutual information is symmetric. MIC is robust to outliers because the estimations of Shannon Entropy and conditional entropy are robust [17]. 2.1

Computation of MIC

The computation of MIC is not easy. Reshef et al proposed an algorithm based on dynamic program principle to approximate the normalized mutual information when an n × m grid is given, and take the maximun vaule in a large class of grids to be the approximation of MIC. A software to compute MIC in Java version and in R version was presented in [16]. By the ideas in [16], we write a C++ program to compute MIC, which can be used in Matlab environment. Our program can parallel computaion for big data sets, and is faster than that in [16].

582

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

The efficiency of our program are validated via computations. Specifically, we have MIC1 (X, Y ) = MIC(X, Y ) = 1 when X and Y have strictly functional relationship, such as y = |x|0.2 , x2 + y 2 = 1, y = sin(x) + (x − 5)5 , or other kinds of functional relationship, and when X and Y are independent, MIC1 (X, Y ) 0) − 3xI(x <= 0) + k × noise (two lines+noise), 2. y = sin(30x) + 2x + k × noise (sin+line+noise), 3. y = 3x2 + 4 + k × noise (parabolic+noise), 4. y = sin(10x) + cos(30x) + k × noise (sin+cos+noise), where k is the intensity of noise, I(x) is the indicator function (IA (x) = 1 for x ∈ A and IA (x) = 0 for x ∈ / A), and noise is i.i.d. ∼ N (0, 1). The results we obtain in the simulation examples show the rationality and effectiveness of our computation program (see Figure 1).

Figure 1 (color online) The performance of our procedure Left Panel: The MIC values of four types of functions with the intensity(k) of noise in the data are plotted, where noise ∼ N (0, 1). The size of noise we sampled here is 260. Considering the property of MIC that it gets almost the same score of MIC(y, f (x)+noise), if the distribution of noise is unchanged, for each k, we sample noise from N (0, 1) one time and get one MIC value. One can see from the panel that MIC values in these relationships is decreased with the increasing of intensity of noise in the data. Right Panel: The values of MIC1 (X, Y ) − MIC(X, Y ) are plotted in three cases, X ∼ N (0, 1) and Y ∼ N (0, 1), X ∼ U (0, 1) and Y ∼ U (0, 1), and X ∼ N (0, 1) and Y ∼ U (0, 1). The sample size is 260 and the procedure is repeated 300 times. It is clear that MIC1 (X, Y ) < MIC(X, Y ).

No.2

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

583

In order to show that our computation gives that MIC scores tend to 0 for statistically independent variables with a larger sample size, we compute the MIC scores of simulated examples with sample size from 50 to 1200. The simulation is repeated 2000 times for each sample size (see Figure 2). One can see that the median and standard deviation closely approximate to 0 with the increasing of sample size, as desired. So, our program will output robust small MIC values for independent variables for large sample size.

Figure 2 (color online) Simulation of two independent variables with the sample size ranging from 50 to 1200 Left panel: the box plot of the MIC values. Middle panel: the mean, median, standard deviation, and coefficient of variation (CV = σ/mean) are decrease as the sample size increase; Right panel: mean and median are consistent because the difference between mean and median is small for large sample size.

2.2

Curve fitting

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Fitted curves can be used as an aid for data visualization to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Usually, the data points are collected under random noise, and the model we are interested in is as follow: y = f (x) + ε,

ε ∼ i.i.d.,

where f stands for the relationship between the explanatory variables X and the responsible variable Y , and ε is the random noise. The noise ε and the explanatory variables X are independent. The popular model selection methods, such as AIC and BIC, make a balance between accuracy and model complexity. Comparing with BIC, AIC tends to select a complex model in a given problem. Among many models selected by different criterions, it is natural to ask which model is better? It is necessary to seek for other methods to make the decision. In traditional

584

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

time series analysis, a criterion of selecting a model whose regression residual approximates closest to a white noise is proposed in model selection [3, 19], while, for its computational cumbersome of inconvenience, it is not widely used, but for the problem in hand it helps much. The residuals from a fitted model are the differences between the responses observed at each combination values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the ith observation in the data set is εbi = yi − fb(xi ),

where yi denote the ith response in the data set and xi the vector of explanatory variables, each set at the corresponding values found in the ith observation in the data set. If we rewrite the residue as: εbi = yi − fb(xi ) = (yi − f (xi )) + (f (xi ) − fb(xi )),

where f (x) is the true underlying relationships between Y and X, then εb is an estimation of the i.i.d. random noise ε. If the model fits to the data correctly, the residual would approximate the random error that make the relationship between the explanatory variables and the response variable a statistical relationship. If the residual appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. In fact, any signal in the residue indicates the deviations between fb(x) and f (x). To check if a fit fb(x) are good or not, one can investigate the dependence between the residue induced by fb(x) and explanatory variables. Hence, minimize the maximal information coefficient between the residue and explanatory variables is a reasonable criterion, because MIC is an ideal index to capture the association between a pair of variables. It is remarkable that we do not need the variance of the residue in the proposed criterion. This method can be applied in more general situation. Example 2.1 We consider the model p y := f (x) = sin(c1 x) + c2 ∗ |x| + ε, (2.1)

where c1 = 10, c2 = 4, and ε is the random noise. We consider the noise of symmetric distributions : i.i.d normal , and i.i.d Cauchy. The Gaussian noise is N (0, 52 ) used in modeling, while Cauchy noise is special because the mean and variance of Cauchy distribution are not exist. If we have samples (yi , xi ), i = 1, · · · , N , it is possible to obtain the estimation of model fˆ by well established regression method. The problem is how well the model fˆ is? As is well known that one can get different models if different regression hypothesis or cost function are chosen. Model selection is to look for appropriate model for given data. If the functional relation are known, but the parameters c1 and c2 are unknown, one can use suitable statistical method to estimate the parameters. We consider two kind of good estimators: unbiased estimation that is asymptotic normal and unbiased interval estimation. 80 samples (yi , xi ) are generated by (2.1), where noise are Gaussian N (0, 52 ) (Figure 3 (a)) and Cauchy C(0, 1) (Figure 3 (b)). Suppose we have unbiased estimation for c1 and c2 , which is asymptotic normal, that is, cˆ1 ∼ i.i.d. N (10, σ 2 ), cˆ2 ∼ i.i.d. N (4, σ 2 ). Clearly, the estimations of parameters are more accurate if the σ value is small. As a result, the model is more accurate. For both Gaussian noise and Cauchy noise, we proceed by the following steps:

No.2

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

585

Pseudocode of the simulation in Example 2.1 (asymptotic normal) for m = 1 : 1 : 15 σ = 0.30m for i = 1 : 1 : 350 (cˆ1 , cˆ2 ) ∼ N ((10, 4), σ 2 I2 ) p fˆ(xi ) = sin(cˆ1 xi ) + cˆ2 |xi |

εˆi = yi − fˆ(xi ),i = 1, · · · , 80; εˆ = {εˆi , i = 1, 2, · · · , 80} MIC(i) =MIC(x, εˆ) end MIC= {MIC(i), i = 1, 2, · · · , 350} Mm = mean(MIC)

end

The curve of m versus Mm for Gaussian noise and Cauchy noise are illustrated in Figure 3 (c). We can see that the mean maximal information coefficient between the residues εˆ and the variable x is monotone with respect to the value of σ, as expected.

Figure 3 (color online) Validation of the model selection criterion of the minimization of MIC between residue and explanatory variable for simple model with two parameters Panel (a): simulation data points generated by model (2.1), with Gaussian noise N (0, 52 ). Panel (b): the plot of data points generated by the same model with Cauchy noise C(0, 1), the sample points are more messy. Panel (c): plot of mean MIC, M IC(x, εˆ) in unbias asymptotic normal estimation of parameters. Panel (d): plot of mean MIC, M IC(x, εˆ) in unbias interval estimation of parameters. For both cases, the smaller the σ values are, the smaller values of M IC(x, εˆ) we get in Gaussian noise and Cauchy noise, which suggest that our method tends to select good model.

Now, suppose that we have unbiased interval estimation for c1 and c2 , cˆ1 ∈ [10 − σ, 10 + σ] and cˆ2 ∈ [4 − σ, 4 + σ]. For both Gaussian noise and Cauchy noise, we proceed in the following steps:

586

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

Pseudocode of the simulation in Example 2.1 (interval estimation) for m = 1 : 1 : 15 σ = 0.15m for i = 1 : 1 : 250 (cˆ1 , cˆ2 ) ∼ U [10 − σ, 10 + σ] × [4 − σ, 4 + σ] p fˆ(xi ) = sin(cˆ1 xi ) + cˆ2 |xi | εˆi = yi − fˆ(xi ),i = 1, · · · , 80; εˆ = {εˆi , i = 1, 2, · · · , 80} MIC(i) =MIC(x, εˆ) end MIC= {MIC(i), i = 1, 2, · · · , 250} Mm = mean(MIC)

end The curve of m versus Mm for Gaussian noise and Cauchy noise are illustrated in Figure 3 (d). We can see that the mean maximal information coefficient between the residues εˆ and the variable x is also monotone with respect to the value of σ. Example 2.2 We consider a more complex model selection problem: y := f (x) = c1 sin(c2 x) + c3 x2 + c4 log(x) + ε,

(2.2)

where c = [c1 , c2 , c3 , c4 ] = [5, 3, 1, 1], and ε is the random noise. We consider the noise of symmetric distributions: i.i.d normal, and i.i.d Cauchy. We samples (yi , xi ), i = 1, · · · , 50, according to (2.2), and x = 0.1 : 2π, with different noise (the Gaussian noise, N (0, 52 ) (Figure 4 (a)), and the Cauchy noise, C(0, 1) (Figure 4 (b))). We consider only the unbiased estimation which is asymptotic normal. For both Gaussian noise and Cauchy noise, we proceed in the following steps: pseudocode of the simulation in Example 2.2 (asymptotic normal) for m=1:1:20 σ = 0.01m for i = 1 : 1 : k (k = 350 for Gaussian noise,k = 500 for Cauchy noise) (cˆ1 , cˆ2 , cˆ3 , cˆ4 ) ∼ N ((c1 , c2 , c3 , c4 ), σ 2 I4 ) fˆ(xi ) = cˆ1 sin(cˆ2 xi ) + cˆ3 x2 + cˆ4 log(xi ) i

εˆi = yi − fˆ(xi ),i = 1, · · · , 50 εˆ = {εˆi , i = 1, 2, · · · , 50} MIC(i) =MIC(x, εˆ) end MIC= {MIC(i), i = 1, 2, · · · , k} Mm = mean(MIC) end The curve of m versus Mm for Gaussian noise and Cauchy noise are illustrated in Figure 4 (c). One can see that the monotonicity of the mean maximal information coefficient between the residues εˆ and the variable x with respect to the value of σ is true even in the complicated model. These examples suggest that our method is favored to select accurate model.

No.2

587

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

Remark Traditional methods on parametrical estimation can not deal with a model with a noise distributed as Cauchy distribution. We can see from the above examples that our method is robust under the perturbation of complex noises.

Figure 4 (color online) Validation of the model selection criterion of minimization of MIC between residue and explanatory variable for complex model with four parametes Panel (a): simulation data points generated by the model (2.2), with Gaussian noise N (0, 52 ). Panel (b): simulation data points generated by the same model with Cauchy noise C(0, 1). Panel (c): mean MIC, M IC(x, εˆ) in unbias asymptotic normal estimation of parameters. For both Gaussian noise and Cauchy noise, the smaller the σ values are, the smaller values of M IC(x, εˆ) we get, which suggests that our method tends to select good model.

3

Partial Maximal Information Coefficient and Variable Selection

As Terry Speed said in his comment on Maximal information coefficient [20], “A very important extension of the linear correlation ρ(X, Y ) between a pair of variables X and Y is the partial (linear) correlation ρ(X, Y |Z) between X and Y while a third variable, Z, is held at some value. In the linear world, the magnitude of ρ(X, Y |Z) does not depend on the value at which Z is held; in the nonlinear world, it may, and that could be very interesting. Thus, we need extensions of MIC(X, Y ) to MIC(X, Y |Z).” We try to give the definition of Partial maximal information coefficient in a natural way. 3.1

Partial maximal information coefficient

Partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. The partial correlation coefficient between X and Y given the third controlling variable(s) Z is the correlation coefficient between the resultant residuals RX and RY from linear regression of X and Y on Z, which is given by ρ(X, Y |Z) = ρ(X − f (Z), Y − g(Z)),

(3.1)

where f (Z) and g(Z) are linear regression functions of X and Y with respect to controlling random variable(s) Z.

588

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

It is natural to introduce partial maximal information coefficient (PMIC) between two random variables removing the effect of a set of controlling random variables, but PMIC is more subtle than partial correlation, because it is more difficult to remove nonlinear associations between related variables and controlling variable(s). Rewriting the partial correlation as ρ(X, Y |Z) = ρ(X, Y |H(Z)) = ρ(X − f (Z), Y − g(Z)),

(3.2)

where H(Z) is the hypothesis space which contains all possible linear regression functions generated by Z, and f (Z) and g(Z) are the linear regression functions, we find that the partial correlation is obtained by three steps: (a) Selecting a hypothesis space induced by the controlling variable(s) Z, the linear space spanned by Z; (b) Getting the project function or regression function of X and Y in the selected hypothesis space H(Z), and the residuals; (c) Getting the partial correlation coefficient, which is the correlation coefficient between the residuals in step (b). This sheds some light on the definition of partial MIC. Definition 3.1 The partial maximal information coefficient(PMIC) between X and Y , given a hypothesis space H(Z) induced by controlling variable(s) Z, is the maximal information coefficient between the residuals RX and RY resulting from the regression of X and Y on Z in H(Z). P M IC(X, Y |H(Z)) = M IC(RX , RY ) = M IC(X − f (Z), Y − g(Z)),

(3.3)

where f (Z) and g(Z) are regression functions of X and Y on controlling random variable(s) Z in hypothesis space H(Z). Remark 3.1 • Denoting PMIC between X and Y , given Z by PMIC(X, Y |H(Z)), is to keep in mind that PMIC is not only related to the controlling variable itself, but also closely related to the selection of the hypothesis space H(Z). • If X and Z are independent, it is sufficient to obtain the regression function of Y on Z, g(Z), in H(Z), M IC(X, Y |H(Z)) = M IC(X, Y − g(Z)).

(3.4)

A further step may lead a general definition of partial associations. Definition 3.2 The partial association between X and Y , given the hypothesis space H(Z) induced by controlling random variable(s) Z, is given by P D(X, Y |H(Z)) = D(X, Y |H(Z)) = D(X − f (Z), Y − g(Z)),

(3.5)

where D(X, Y ) is a dependence measure, f (Z) and g(Z) are regression functions of X and Y on controlling random variable(s) Z in hypothesis space H(Z). Remark 3.2 • If setting D(X, Y ) as Pearson correlation coefficient, H(Z) is the linear space spanned by Z and P D(X, Y |H(Z)) is the partial correlation. Similarly, if setting D(X, Y ) as MIC(X, Y ), then it is the partial MIC, MIC(X, Y |H(Z)). • If we obtain the regression function of X and Y , say, f (Z) and g(Z), in different hypothesis spaces induced by Z, then, P D(X, Y |H(Z)) = D(X, Y |HX (Z), HY (Z)).

(3.6)

No.2

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

589

• If X and Z are independent, then, P D(X, Y |H(Z)) = D(X, Y |H(Z)) = D(X, Y − g(Z)).

(3.7)

The selection of the hypothesis space H(Z), according to the definition of PMIC, is a key step in the process of getting a partial MIC. If H(Z) is not correctly selected, then PMIC will not work well. The following are some candidate hypothesis space [2]. Hypothesis space 1 (Linear function space) H(Z) = {span(Z)}.

(3.8)

Hypothesis space 2 (Finite dimensional function space) This generalizes the previous example. H(Z) = {f (Z)|f (Z) = ΣN i=0 ai ϕi (Z), ai ∈ R},

(3.9)

where {ϕ1 (Z), ϕ2 (Z), · · · , ϕN (Z)} is the base function vector of space H(Z). If we set ϕi (Z) = Z i , then H(Z) is the finite dimensional polynomial function space that is a so-called polynomial function space with N dimensions. Hypothesis space 3 (Reproducing Kernel Hilbert Space) Let X be a compact metric space and K : X × X −→ R be a Mercer Kernel, then by Mercer’s theorem, there is a Hilbert inner product space HK with inner product hx, yi = K(x, y), and HK is always called the Reproducing Kernel Hilbert Space(RKHS). There are some common Mercer Kernels, for example, spline kernel, box spline kernel, Gaussian kernel, and Inverse multiquadrics kernel. As almost all of the functions, according to Stone-Weierstrass theorem, could be approximated by polynomials, the finite dimensional polynomial function space is recommended as the first candidate for its computational convenience. 3.2

Variable selection

Variable selection is very important in constructing mathematical model. Correlation and partial correlation are widely used in variable selections, especially in pattern recognize, Brian network analysis in fMRI data analysis. Similarly, MIC and partial MIC can also be helpful in variable selection. It seems to us that they are more reliable than correlation and partial correlation, because they capture much more information between the explanatory variables and responsible variables. There are three steps, analogy to correlation, for MIC and partial MIC play their role in variable selection. • Compute the MIC values between each pair of explanatory variable and responsible variable; •then an association is significant if the corresponding MIC is greater than the threshold; • Reveal obscured association: use partial MIC to recover obscured associations. In what follows, we illustrate the variable selection process in two simulation examples. We always sample the explanatory variables and noise from i.i.d. N (0, 1) with length 260. Example 3.1 X = 0.01 sin(W ) + Z 2 + ε1 ,

(3.10)

Y = cos(U ) + ε2 ,

(3.11)

where ε1 ∼ N (0, 1) and ε2 ∼ N (0, 1).

590

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

To get a reasonable explanation of the model using the information from the samples in hand, we proceed in three steps: 1) Compute MIC values between each pair of explanatory variable and responsible variable, and two responsible variables. The results are collected in the following Table 3.1. Table 3.1

MICs in Example 3.1

MIC

X

Y

Z

W

U

X

1

0.162

0.9538

0.2766

0.1843

Y

0.162

1

0.1403

0.1864

0.7828

2) Mark the significant association. Let M IC0 = 0.6 (may be different in other cases) be the threshold. According to Table 3.1, M IC(X, Z) = 0.9538 and M IC(Y, U ) = 0.7828 are two significant associations. So, X should be a function of Z and Y should be a function of U ; 3) Reveal the obscured association. Although the information of Z plays an important role in X, we do not know if X contains information from the explanatory variables W and U , because the strong association between X and Z may obscure other signals. To recover possible obscured associations, we compute partial MIC M IC(X, U |Hn (Z)) and M IC(X, W )|Hn (Z)), where the hypothesis space Hn (Z) consists of polynomials of Z with degree no greater than n. We take n = 1, · · · , 10 to obtain robust partial MIC values. The computational results are described in Figure 5(a). From the plot, we can see that the association between X and W become significant if the effect of Z in X is reduced, and the association between X and U is still small. Now, we check if the strong association between Y and U obscure other associations or not. By the same idea, we compute MIC(Y, Z|Hn (U )) and MIC(Y, W |Hn (U )) for n = 1, · · · , 10, and the results are plotted in Figure 5(a). One can see that both MIC(Y, Z|Hn (U )) and MIC(Y, W |Hn (U )) are small, which suggests that both the associations between Y and Z and the association between Y and W are weak. Therefore, we conclude that the responsible variable X is a function of Z and W , and U is not an explanatory variable of X. Y is a function of U , and both Z and W are not suitable to be selected as explanatory variables of Y . Example 3.2 We consider the following model: X = 0.01 sin(W ) + 10 tan(Z) + ε1 ,

(3.12)

Y = cos(U ) + Z 0.6 + ε2 ,

(3.13)

where ε1 ∼ N (0, 1), ε2 ∼ N (0, 1). 1) Compute MIC values between each pair of explanatory variable and responsible variable, and two responsible variables. The results are collected in the following Table 3.2. Table 3.2

MICs in Example 3.2

MIC

X

Y

Z

W

U

X

1

0.6570

0.9692

0.2668

0.1779

Y

0.6570

1

0.6365

0.1704

0.3057

No.2

Q.H. Tan et al: MODEL SELECTION METHOD BASED ON MIC

591

2) Mark the significant association. Let M IC0 = 0.6 be the threshold. According to Table 3.2, M IC(X, Z) = 0.9692 and M IC(Y, Z) = 0.6365 are two significant associations. So, X should be a function of Z and Y should be a function of Z; 3) Recover the obscured association. Let Hn (Z) (n ≤ 10) be the hypothesis space which consists of polynomials of Z with degree no greater than n. For n = 1, · · · , 10, we compute M IC(X, W |Hn (Z)), M IC(X, U |Hn (Z)), M IC(Y, W |Hn (Z)), and M IC(Y, U |Hn (Z)). The results are illustrated in Figure 5 (b). One can see that X is also a function of W and Y is a function of U . Therefore, X should be function of variables Z and W , and Y be the function of Z and U . From Table 3.2, we can see that the MIC between two responsible variables X and Y are significant. By Figure 5 (b), we know that the reason may come from the fact that both X and Y have a common explanatory variable Z. Denote MIC(X, Y |Hn (Z), Hm (Z)) as the partial MIC where the regression function of X on X belongs to Hn (Z), and the regression function of Y on Z belongs to Hm (Z). After more computations on partial MIC (Figure 5 (c)), one can see that MIC(X, Y |Hn (Z), Hm (Z)) ≤ 0.3 for most n and m. So, the association between X and Y are mostly come from Z, and the explanatory variables W and U contribute almost nothing to the association between X and Y .

Figure 5 (color online) Detect obscured associations Panel (a): the partial MICs under different hypothesis space for Example 3.1. The association between X and W is large after the effect of Z was removed. Panel (b): the partial MICs under different hypothesis space for Example 3.2. The associations between X and W and between Y and U are detected after the removing of effects of Z. Panel (c): the partial maximal information coefficient between responsible variables, most MIC(X, Y |Hn (Z), Hm (Z)) ≤ 0.3 for 1 ≤ n, m ≤ 10.

4

Conclusion

Maximal Information Coefficient (MIC) captures a wide range of associations, and gives similar scores to equally noisy relationships of different types [16]. On the basis of the dynamic program ideas in [16], we write a C++ program to compute MIC, which can be used in Matlab

592

ACTA MATHEMATICA SCIENTIA

Vol.34 Ser.B

environment. The efficiency of our program are validated via computations. The residue of a model plays an important role in model validation and model selection. We proposed a model selection method that minimizes the maximal information coefficient between the model residue and response variable, and two simulation examples are presented to show its validity. Similar to partial coefficient, we introduce partial maximal information coefficient (PMIC) to capture the association between two variables with the effect of a set of controlling random variables removed in a prescribed hypothesis space. PMIC detects obscured association, which is useful for variable selection. Our programs on the computation of MIC and partial MIC are available upon request. References [1] Claeskens Gerda, Hjort Nils Lid. Model selection and model averaging. Cambridge Books, 1993 [2] Cook R Dennis, Weisberg Sanford. Residuals and influence in regression. Volume 5. New York: Chapman and Hall, 1982 [3] Cucker Felipe, Zhou Ding Xuan. Learning theory: an approximation theory viewpoint. Volume 24. Cambridge University Press, 2007 [4] Das Jishnu, Mohammed Jaaved, Yu Haiyuan. Genome-scale analysis of interaction dynamics reveals organization of biological networks. Bioinformatics, 2012, 28(14): 1873–1878 [5] Fan Jianqing, Lv Jinchi. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 2010, 20(1): 101 [6] Gelfand L M, Jaglom A M, Kolmogoroff A N. Zur allgemeinen definition der information. Arbeiten zur Informationstheorie II. Berlin: VEB Deutscher Verlag der Wissenschaften, 1958 [7] Gorfine Malka, Heller Ruth, Heller Yair . Comment on “detecting novel associations in large data sets”. Unpublished manuscript. http://iew3. technion. ac. il/˜ gorfinm/files/science6. pdf, 2012 [8] Hastie Trevor, Tibshirani Robert, Friedman J Jerome H. The elements of statistical learning. Volume 1. New York: Springer, 2001 [9] Heller Ruth, Heller Yair, Gorfine Malka. A consistent multivariate test of association based on ranks of distances. arXiv preprint arXiv:1201.3522, 2012 [10] Karpinets Tatiana V, Park Byung H, Uberbacher Edward C. Analyzing large biological datasets with association networks. Nucleic Acids Research, 2012, 40(17): e131 [11] Kolmogoroff Alexander. Grundbegriffe der wahrscheinlichkeitsrechnung. Volume 3. Berlin: Springer, 1933 [12] Konishi Sadanori, Kitagawa Genshiro. Information criteria and statistical modeling. Springer Verlag, 2008 [13] Linfoot E H. An informational measure of correlation. Information and Control, 1957, 1(1): 85–89 [14] NIST/SEMATECH. e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/, 2013 [15] R´ enyi Alfr´ ed. On measures of dependence. Acta Mathematica Hungarica, 1959, 10(3): 441–451 [16] Reshef David N, Reshef Yakir A, Finucane Hilary K, et al. Detecting novel associations in large data sets. Science, 2011, 334(6062): 1518–1524 [17] Shannon Claude Elwood, Weaver Warren. A mathematical theory of communication. Univ of Illinois Press, 1949 [18] Shi Peide, Tsai Chih-Ling. Regression model selection a residual likelihood approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2002, 64(2): 237–252 [19] Shumway Robert H , Stoffer David S. Time series analysis and its applications. Springer Science+ Business Media, 2010 [20] Speed Terry. A correlation for the 21st century. Science, 2011, 334(6062): 1502–1503 [21] Steuer Ralf, Kurths J¨ urgen, Daub Carstens O, Weise Janko, Selbig Joachim. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 2002, 18(suppl 2): S231–S240 [22] Sz´ ekely G´ abor J, Rizzo Maria L . Brownian distance covariance. The annals of applied statistics, 2009: 1236–1265 [23] Zhang Xiujun, Liu Keqin, Liu Zhi-Ping, et al. Narromi: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics, 2013, 29(1): 106–113