Nearest neighbor selection for iteratively kNN imputation

Nearest neighbor selection for iteratively kNN imputation

The Journal of Systems and Software 85 (2012) 2541–2552 Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software journ...

798KB Sizes 188 Downloads 532 Views

The Journal of Systems and Software 85 (2012) 2541–2552

Contents lists available at SciVerse ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Nearest neighbor selection for iteratively kNN imputation Shichao Zhang a,b,c a

College of Computer Science and Information Technology, Guangxi Normal University, Guilin, China Institute of Computing Technology, The Chinese Academy of Sciences, Beijing, China c QUIS, Faculty of Engineering and Information Technology, University of Technology Sydney, Australia b

a r t i c l e

i n f o

Article history: Received 14 December 2011 Received in revised form 19 May 2012 Accepted 24 May 2012 Available online 1 June 2012 Keywords: Missing data k nearest neighbors kNN imputation

a b s t r a c t Existing kNN imputation methods for dealing with missing data are designed according to Minkowski distance or its variants, and have been shown to be generally efficient for numerical variables (features, or attributes). To deal with heterogeneous (i.e., mixed-attributes) data, we propose a novel kNN (k nearest neighbor) imputation method to iteratively imputing missing data, named GkNN (gray kNN) imputation. GkNN selects k nearest neighbors for each missing datum via calculating the gray distance between the missing datum and all the training data rather than traditional distance metric methods, such as Euclidean distance. Such a distance metric can deal with both numerical and categorical attributes. For achieving the better effectiveness, GkNN regards all the imputed instances (i.e., the missing data been imputed) as observed data, which with complete instances (instances without missing values) together to iteratively impute other missing data. We experimentally evaluate the proposed approach, and demonstrate that the gray distance is much better than the Minkowski distance at both capturing the proximity relationship (or nearness) of two instances and dealing with mixed attributes. Moreover, experimental results also show that the GkNN algorithm is much more efficient than existent kNN imputation methods. © 2012 Elsevier Inc. All rights reserved.

1. Introduction kNN imputation is designed to find k nearest neighbors for a missing datum (incomplete instance) from all complete instances (without missing values) in a given dataset, and then fill in the missing datum with the most frequent one occurring in the neighbors if the target feature (or attribute) is categorical, referred to as majority rule, or with the mean of the neighbors if the target feature is numerical, referred to mean rule. Due to its simplicity, easyunderstanding and relatively high accuracy, the k-nearest neighbor (kNN) approach has successfully been used in real data processing applications, such as surveys conducted at Statistics Canada, the U.S. Bureau of Labor Statistics, and the U.S. Census Bureau (Chen and Shao, 2000). kNN imputation is a lazy and instance-based estimation method. Different from model-based algorithms (building estimators from all complete instances and then filling in a missing datum with the estimators), kNN needs to search all complete instances and select k instances most relevant to a given missing datum. Certainly, the model-based techniques can be used in kNN imputation once the k nearest neighbors are selected. While kNN imputation with majority/mean rule is simple and effective in general, there are still many efforts focusing on improving its performance. We categorize existing

E-mail address: [email protected] 0164-1212/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jss.2012.05.073

literatures into three research directions and briefly review them as follows. The first research direction is against the imputation of kNN imputation algorithms. Almost all improvement efforts belong to this direction. This direction is actually a selection of the k nearest neighbors for a given missing datum. This is because different distance functions or weighting techniques (or both) can generate different k nearest neighbors only. Whatever the distance functions or weighting techniques are selected, the goal is to find a machine that highlights some attributes and decreases the impact of the rest on the missing datum. This looks like a mapping that transforms the original space to a new space more suitable to an imputation task. It is much clear when we apply the ␭-cutting rule to such an algorithm. With the ␭-cutting rule, the kNN imputation will be carried out on only those complete instances by a distance weighted rule that the attributes are stretched out or drawn back, or on a subspace consisting of attributes with the impact values equal to or greater than ␭, or a combination among them. A lately selection of the nearest neighbors is the SN (Shelly Neighbors) method that uses only those neighbors that form a shell to surround the missing datum, drawn from the k nearest neighbors (Zhang, 2011). The SN approach is actually a quadratic selection of the k nearest neighbors. The second research direction is against the computation part of kNN imputation algorithms, or the estimators of missing data. While there are some popular methods, such as majority rule (weighted), mean rule and the Bayesian rule, this direction has

2542

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

actually advanced to a mature stage. However, we should construct new estimators of missing data for, such as cost-sensitive learning and imbalanced learning applications. Recently, a kNN-CF imputation method was proposed for addressing the issue of imbalanced data processing (Zhang, 2010). All the estimators of missing data can be viewed as an embedded part of single-imputation, multipleimputation, or iterative-imputation algorithms. The last research direction is against the output of kNN imputation algorithms. This direction is usually focused on assessing the accuracy of imputed data and/or the performance of imputation algorithms. While extant kNN imputation algorithms utilize the evaluation techniques designed for other imputation algorithms, Zhang (2008) advocated a parimputation (partly imputation) approach that imputes only those incomplete instances (missing data) that there are certain complete instances in a small neighborhood of the incomplete instances. The output includes only those imputed instances for future data processing applications. However, existing kNN imputation of missing data is based on Minkowski distance or its variants. These distance functions are often efficient for numerical variables (features, or attributes) and do not perform well for categorical ones. To deal with heterogeneous (mixed-attributes) data, in this paper we study a nearest neighbor selection for iteratively kNN imputation of missing data, named GkNN (gray kNN) imputation. It measures the similarity between a missing datum and its nearest neighbors with gray distance. The gray relational analysis (GRA) is more appropriate to capture the ‘nearness’ (proximity, or relationship) between two instances than Minkowski distances or others (Huang and Lee, 2002). For efficiency, the GkNN utilizes all the imputed instances as observed data with complete instances (instances without missing values) together for consequent imputation iteration. Further, the GkNN algorithm is extended for imputing heterogeneous datasets. This significantly widens the applicable range of the GkNN imputation. We experimentally evaluate the proposed approach, and demonstrate that the gray distance is much better than the Minkowski distance at both capturing the proximity relationship (or nearness) of two instances and dealing with mixed attributes, and the GkNN algorithm is much more efficient than existent kNN imputation methods. The rest of this paper is organized as follows. In Section 2 we briefly review related work on missing value imputation. The GkNN imputation algorithm is designed in Section 3. In Section 4, a simple example is presented to illustrate the use of the GkNN imputation algorithm. Section 5 empirically evaluates the performance of the proposed method compared with the typical kNN imputation methods. We conclude this paper in Section 6.

2. Related work Real data is often with missing values which are generated by, such as error, equipment failure, and change of plans. These missing data can lead to biased results. For example, the values of higher incomes are more likely to be missed in the reported incomes from the Current Population Survey of United States (Mistiaen and Ravallion, 2003). That generally leads to the average income to be underestimated. Therefore, missing data imputation is critical for inducing quality models in machine learning and data mining applications. Fortunately, many researches into missing data treatment are reported in the area of intelligent systems, such as Batista and Monard (2003), Little and Rubin (2002), Liu and Zhang (2012), Pearson (2005, 2006), Schafer and Graham (2002), Troyanskaya et al. (2001), Zhang (2008, 2011), Zhang et al. (2005, 2006, 2007) and Zhang (2012). Pearson (2006) categorized them into four different strategies, deletion, single imputation, multiple imputation, and iterative imputation. Recently, Zhang (2008, 2011) proposed a

new imputation direction, named parimputation strategy, in which a missing datum is imputed if and only if there are certain complete instances in a small neighborhood of the missing datum. While extant data mining algorithms are designed for complete instances (without missing values), deleting incomplete instances (with missing values) is naturally a strategy. Deletion strategy omits the incomplete instances and learns models from the complete instances. This often leads the original dataset to be an insufficient dataset. When the ratio of missing data is larger or the distribution of missing data is non-random, the deletion strategy easily results in serious bias and erroneous conclusion (Quinlan, 1989). For example, Little and Rubin (2002) compared two cases, i.e., the case based only on complete data records, and the case based on all records that are sufficiently complete. In the former scenario, the deletion strategy is very fast but is undesirable on running cost. Moreover, they believed one common imputation for dealing with missing values is an alternative. Therefore, in comparison to other methods, missing data imputation is a usually used strategy. Single imputation strategy provides single estimation for each missing datum. Popular single imputation methods include, such as hot deck imputation, mean imputation. Hot deck imputation replaces missing values with responses from other records that satisfy certain matching conditions, such as missing values estimated by the observed values for another dataset with same or similar background. Mean imputation estimates missing values by the mean value of appropriately selected “similar” samples. Currently, existing single imputation methods are usually targeted for handling numerical or discrete attributes separately. In fact, it is very popular to handling missing values with numerical attributes, such as Wang and Rao (2002) and Zhang et al. (2006). However, these methods aforementioned cannot effectively handle categorical variables. Moreover, in some methods, such as C4.5 (Quinlan, 1989, 1993) that was designed for handling discrete missing values, numerical attributes are discretized into discrete. The process may lose the true characteristic (Qin et al., 2007). A few literatures are designed to deal with both numerical and discrete missing values. For example, Huang and Lee (2004) proposed a gray-based nearest neighbor approach to predict missing attribute values and can handle with numerical and symbolic attributes, but the method ignored the important support of class label, and it does make the best use of all the observed data. Zhang et al. (2007) thought that it is useful for imputing missing values to use the observed information in the instances with missing values, and the proposed paper is the extension of the paper. A limitation of single imputation strategy is that it tends to artificially reduce the variability of characterizations of the imputed dataset. With single-imputation techniques, missing values in a variable are imputed by a plausible estimated value. However, such a single imputation cannot provide valid standard errors and confidence intervals, since it ignores the implicit uncertainty. That is, the imputed values are not the real actual values as we never know the real values for the missing values. Furthermore, imputing the missing value with a single value does not capture the sample variability of the imputed value, or the uncertainty associated with the model used for imputation. The alternatives of single imputation is to fill in the missing values with multiple imputations or iterative imputation, such as multiple imputation (MI) (Little and Rubin, 2002) and EM algorithm (Dempster et al., 1977). In multiple imputation strategy, several (typically < 20) different imputed datasets are generated and a set of results (mean characterizations or variability estimates, i.e., standard deviations) can be computed. That subjects to implement same analysis. For example, in multivariate analysis, MI methods provide good estimations of the sample standard errors due to applying several analyses. However, the MI method requires that the data must follow the missing mechanism (i.e., missing

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

completely at random, called as MCAR) in order to generating a general-purpose imputation. Both deletion strategy and single imputation strategy may be regarded as filters according to John et al. (1994), as the imputation methods yield modified datasets and can be analyzed by standard methods without modification. Multiple imputation strategy is somewhat more involved but still do not require modification of the underlying analysis procedures. Moreover it is non-iterative in nature. In contrast, iterative imputation is analogous to the class of wrappers and can be better developed for missing data (Myllymaki, 2001). The well-known iterative method is the expectation–maximization (EM) algorithm, which formalizes the ad hoc strategy (Little and Rubin, 2002) as follows: (1) imputing the missing data values; (2) estimating data model parameters by the imputed values; (3) re-estimating the missing data values by these estimated model parameters; and (4) repeating and iterating the former steps until the algorithm converges. The EM algorithm is very popular and has widely been applied to impute missing data in real applications (Dempster et al., 1977; Little and Rubin, 2002). The EM algorithm can even be applied when the model belongs to an exponential family, but it has a slow convergence ratio. Other iterative imputation algorithms (such as Brása and Menezes, 2007; Kim et al., 2004) are based on the framework in Caruana (2001). Compared with traditional kNN imputation methods, our GkNN algorithm has three features as follows: (1) Different from single imputation, the GkNN, an EM-like iteration imputation method, can overcome the limitation of single imputation methods. That is, it can provide valid standard errors and confidence intervals, and it also contains the implicit uncertainty. Meantime, we extended it to handle both numerical missing values and discrete ones based on the framework of Zhu et al. (2011). So it is different from the kernel methods (Wang and Rao, 2002; Zhang et al., 2006) and C4.5 algorithms (Quinlan, 1989, 1993). (2) The GkNN algorithm makes best use of all the observed information including the incomplete instances (instances with missing values). The existing algorithms (e.g., single imputation strategy and multiple imputation strategy) only used the observed instances without missing values. (3) Nonparametric imputation is efficient if we cannot exactly know the model of data (Zhang et al., 2006) (in fact, we have usually not prior knowledge about the data). Although the GkNN algorithm is an EM-like yet a nonparametric imputation, it is different from the EM algorithm in which both E and M steps depend on parametric models, as well as different from the parametric methods with multiple imputation strategy. 3. GkNN algorithm Let X be a d-dimensional vector of factors and Y a response variable (target feature) influenced by X. In practice, one often obtains a random sample (sample size = n) of incomplete data associated with a population (X, Y, ı), (Xi , Yi , ıi ),

i = 1, 2, . . . , n,

3.1. kNN estimator for missing data imputation The NN method is one of the hot deck techniques designed to compensate for nonresponse (missing values) in sample surveys. It should be one of the first choices for missing data imputation when there is little or no prior knowledge about the distribution of the data. Since it was introduced by Skellam (1952), the NN imputation method has successfully been applied to a broad range of applications, such as surveys conducted at Statistics Canada, the U.S. Bureau of Labor Statistics, and the U.S. Census Bureau (Chen and Shao, 2000). Because the NN method often leads to over-fitting, it was extended to kNN method (de Andrade and Hruschka, 2009). With the kNN method, a categorical missing value is imputed with the majority among its k nearest neighbors, and the average value (mean) of the k nearest neighbors is regarded as the prediction for a numerical missing value, called as majority/mean rule. It is formally defined as follows. Given (X, Y, 0) and the set of its k nearest neighbors Dk = {(Xj , Yj , 1) |j = 1, 2, . . ., k}, the kNN estimator is defined as

Y=

⎧ ⎧ ⎨ ⎪ ⎪ ⎪ ⎪ arg max v ⎪ ⎨ ⎩ k ⎪ ⎪ 1 ⎪ ⎪ Yj , ⎪ ⎩k



(Xj ,Yj ,1) ∈ Dk

⎫ ⎬

1(Yj = v)



,

if Y is categorical

if Y is numerical

j=1

where v is a value in the domain of the target feature Y and 1(Yj = v) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. Therefore, kNN imputation is a model free method. Although the kNN estimator (majority/mean rule) is very simple, there are two challenging issues faced by kNN imputation: (1) how to determine the optimal value of k in advance and (2) how to select the k nearest neighbors. Lall and Sharma (1996) suggested a potential choice √ of k = n for n > 100, where n is the number of observed instances in a dataset. As there is no prior information concerning the optimal k for a specified application, the optimal value of k will be set according to experimental tests in our GkNN imputation algorithm. For selecting k nearest neighbors, the similarity between an instance and its nearest neighbors, determined from the differences between instances, should certainly be maximal (or, the difference of them is minimal). And the usually used method is Minkowski distance (or its variants) as follows, d(i, j) = |xi1 − xj1 |q + |xi2 − xj2 |q + · · · + |xip − xjp |q

where all the Xi ’s are observed and ıi = 0 if Yi is missing; otherwise ıi = 1. In this setting, we can assume that (Xi , Yi )’s satisfy the following model: Yi = m(Xi ) + εi ,

To avoid estimating m, kNN imputation method is used, which has experimentally been proved more efficient than other existing imputation methods. Based on kNN imputation, this section presents the GkNN algorithm in details. We present the kNN estimator and GBA (gray-based analysis) based nearest neighbor selection in Sections 3.1 and 3.2, respectively. Section 3.3 presents a data transformation for dealing with mixed-attribute datasets. The GkNN algorithm is designed in Section 3.4. The error ratio and complexity are analyzed in Section 3.5.



i = 1, 2, . . . , n

2543

(1)

where m(·) is an unknown function, and the unobserved εi (with population ε) are independent and identically distributed (iid) random errors with mean 0 and unknown finite variance  2 , and independent of the iid random variables Xi ’s.

1/q

where q is a non-negative integer, called Minkowski coefficient. Minkowski distance is regarded as Manhattan distance while q = 1, and as Euclidean distance while q = 2. Usually, different datasets will be designed different Minkowski coefficient. As Caruana (2001) pointed out, it did not mean that kNN imputation method would perform well for all distance metric. In fact, devising good distance metrics is not always easy for kNN method. For example, it is difficult to devise a distance metric that combines distances measured between symbolical (categorical) and

2544

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

numerical features (variables). Generally, Minkowski distances or other metrics are mainly suitable for some application domains, such as domains with numerical attributes. Fortunately, gray relational analysis (GRA) is more appropriate to capture the ‘nearness’ (or relationship) between two instances than that Minkowski distances or others do (Huang and Lee, 2002), and it can well deal with both categorical and numerical attributes (mixed, or heterogeneous data). In this paper, k nearest neighbors of a missing datum are selected based on the GRA when calculating the similarities between instances (see Section 3.2).

3.2. GRA based nearest neighbor selection To select k nearest neighbors for the GkNN imputation, we use the GRA (gray relational analysis) that is a method of gray system theory (GST). The GST is designed for tackling unascertained systems with partially known and partially unknown information. It can draw out valuable information by generating and developing the partially known information. It can also describe correctly and monitor effectively the systemic operational behavior. A system engineering theory based on the uncertainty of small samples was developed by Deng in (1982). The system was named by using gray as the color that indicates the amount of known information in control theory. For instance, if the internal structures and features of a system are completely unknown, the system is usually denoted as a ‘black box’. In contrast, ‘white box’ means that the internal features of a system are fully explored. Between white box and black box, there is a gray system indicating that part of the information is clear, while another part is still unknown. The gray system theory has simple processes to study complex systems with reliable analysis results, and allows us to utilize only a few known data to establish an analysis model. In the context of datastarvation, the gray system theory is known to be effective and has been successfully and widely applied to address the real world problems, such as image processing (Jou et al., 1999), mobile communication (Su et al., 2000), machine vision inspection (Huang and Lee, 2002, 2004), decision-making (Luo et al., 2001), stock price prediction (Wang, 2003), and system control (Jiang et al., 2002; Song et al., 2005). The GRA is mainly used to quantify all the influences of various factors and the relationship among data series. It is based on the gray relational model which is used to measure the trend relationship between two systems or two elements of a system. For a given system, if the development trend between two elements of the system tends toward concordance, the relational grade is considered to be large. Otherwise, it is regarded as small. Therefore, the GRA is designed to measure the similarity of emerging trends. Generally, in this paper, we uses the gray relational coefficient (GRC) to describe the trend relationship between an instance containing missing values and a reference instance (an instance without missing values in this paper) at a given dataset. Consider a set of observations {x0 , x1 , . . ., xn }, where x0 is the reference instance and x0 , x1 , . . ., xn are the observed instances. The number of conditional attributes of each instance xi is m and we denote it as xi = (xi (1), xi (2), . . ., xi (m)), i = 0, 1, 2, . . ., n, and a class label Di . Generally, before calculating the GRC, all numerical data are normalized into [0, 1] to avoiding the bias that the result is usually prone to the data with bigger magnitude (it will be detailed in Section 3.3). Consequently, the GRC is defined as follows: GRC(x0 (p), xi (p)) =

dy dx

min∀j min∀k |x0 (k) − xj (k)| +  max∀j max∀k |x0 (k) − xj (k)| |x0 (p) − xi (p)| + max∀j max∀k |x0 (k) − xj (k)|

(2)

where  ∈ [0, 1] ( is a distinguishing coefficient, normally, let  = 0.5), i = j = 1, 2, . . ., n, and k = p = 1, 2, . . ., m. In Formula (2), GRC(x0 (p), xi (p)) is valued in [0, 1], and means the similarity between x0 (p) and xi (p). If GRC(x0 (p), x1 (p)) exceeds GRC(x0 (p), x2 (p)), then the similarity between x0 (p) and x1 (p) is larger than that between x0 (p) and x2 (p); otherwise the former is smaller than the latter. Moreover, if x0 and xi have the same values for attribute p, GRC(x0 (p), xi (p)) will equal to 1 (i.e., the similarity between x0 (p) and xi (p) is maximal). By contrast, if x0 and xi have absolutely different values for numerical attribute p, GRC(x0 (p), xi (p)) will be 0. As there are too many relational coefficients to be compared directly, in this paper we propose to use mean processing to convert each series’ gray relational coefficients at all points into its mean. The mean is referred to as the gray relational grade (GRG). 1 GRC(x0 (k), xi (k)), m m

GRG(x0 , xi ) =

i = 1, 2, . . . , n.

(3)

k=1

According to Formula (3), we know that the value of the GRG takes a value between zero and one. In this paper, we select k nearest neighbors based on the following rule: if GRG(x0 , x1 ) is greater than GRG(x0 , x2 ), the difference between x0 and x1 is smaller than that between x0 and x2 ; otherwise the former is larger than the latter. Clearly, according to Deng (1982), despite its simplicity, gray relational analysis still satisfies four principal axioms as follows. (1) Normality: the value of the GRG(x0 , xi ) is no less than zero, as well as no large than one. (2) Dual symmetry: given only two observations (e.g., xi and xj ) in the relational space, the equation GRG(xi , xj ) = GRG(xj , xi ) is still preserved. (3) Wholeness: given more than two (i.e., three or more) observations in the relational space, the value of GRG(x0 , xi ) seldom equals to GRG(xi , x0 ) for any i. (4) Approachability: GRG(x0 , xi ) decreases as the difference between x0 (p) and xi (p) increases (other values in Eqs. (1) and (2) are held constant). Based on these axioms, we can know that gray relational analysis can be regarded one of measure metrics. We also know that gray relational analysis offers some advantages rather than other measure metrics, such as Minkowski distance. For example, gray relational analysis gives a normalized measuring function due to its normality, i.e., measuring the similarities or differences among observations for analyzing the relational structure. Moreover, gray relational analysis gives whole relational orders due to its wholeness over the entire relational space (Huang and Lee, 2004). Furthermore, this paper uses such relationships among instances to predict missing attribute values, according to the magnitude of GRG ranging from 0 to 1. 3.3. Data transformation for heterogeneous data In this paper, data transaction is designed for addressing two practical issues, the mixture and the inconsistent unit of attributes, for dealing with heterogeneous data. While the mixture of attributes indicates that a dataset contains both categorical and numerical attributes, the inconsistent unit of attributes means that there are differences between units of two attributes, called as unit difference. The unit difference can enlarge/reduce the impact of an attribute on selecting nearest neighbors. For example, consider only two attributes, income and ratio (i.e., the ratio of the employment content) in a dataset, attribute income can value from 500 to 250,000, whereas attribute ratio values only from 0 to a maximum

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

of 100%. Certainly, the similarity of two instances is dominated by attribute income. In other words, the similarity of two instances is dominated by attributes with bigger magnitude units. To avoid this bias generated by unit difference, the data should be transformed or normalized before selecting k nearest neighbors for the GkNN imputation algorithm. There exist many methods for data normalization, such as Min–Max normalization, z-score normalization, and normalization by decimal scaling. In this paper, we first transform all input attributes to obtain temporary variables with distribution having zero mean and standard deviation of 1 via the following transformation according to Qin et al. (2007). aij(temp) =

(aij ) − a¯ j

(4)

(aj )

where aij represents the value of the jth attribute of the ith instance, a¯ j or (aj ) represents the mean value (or standard deviation) of the observed values aij , then let aij(trans) =

aij(temp) {MAX[range(aj=1(temp) ), . . . , range(aj=x(temp) )]} range(aj(temp) ) (5)

where aj(temp) is a temporary value of the jth attribute of aij normalized via Formula (4), aij(trans) the final value of aij transformed with Formula (5). Note that, the issue on non-numerical attributes will be explained in the last paragraph of Section 3.4. 3.4. Algorithm design Consider a dataset T with n instances as X = {X1 , X2 , . . . , Xn }, where XI = {X1 , . . . , Xr } (r ≤ n) is the set of incomplete instances (with missing values), and XC = {Xr+1 , . . . , Xn } is the set of complete instances (without missing values). Each instance Xi = (Ci (1), . . . , Ci (m), D) (i = 1, 2, . . ., n) has m + 1 numerical and/or categorical attributes, where D is the class attribute. The jth attribute value of instance i is denoted as V (i, j) for observed value and MV (i, j) for missing value, and the imputed value of MV (i, j) as ˆ k (i, j) for the kth imputation iteration. The pseudo of the GkNN MV algorithm is presented in Fig. 1. Before illustrating the use of the above algorithm in Section 4, we simply interpret it as follows. Many imputation methods try to impute missing values by only the complete instances. For example, kernel-based nonparametric imputation is to build nonlinear models using all complete instances and impute missing values with these models. In practice, most of industrial databases, however, have a high ratio of missing values and their complete instances are insufficient for training trustworthy models. To address the issue of insufficient information, the above GkNN algorithm utilizes all imputed instances in consequent imputation iterations. Specifically, in the first imputation iteration, because missing values are not imputed, it is unfeasible to compute the difference between an incomplete instance and its complete neighbors (without missing values). Missing values are imputed with the mean of observed values for numerical attributes, and the most frequent one among the observed values for categorical attributes. From the second imputation iteration, all the missing values are imputed again one by one, so as to improve the system performance. When imputing the missing value x at instance c, we first assume x is missed (in fact, it has imputed in former imputation iteration), and the rest are viewed as observed values (the latest imputed values). And then we impute the missing value x at instance c based on the kNN estimator (detailed in Part 3.1 of Algorithm 1, or in Section 3.1). The imputation process is repeated until the imputed values converge or begin to a cycle. With such an imputation process, the proposed GkNN algorithm makes the best

2545

Table 1 A complete dataset.

I1 I2 I3 I4 I5 I6 I7 I8 I9

A

B

C

D

E

F

0.2 0.9 0.9 0.1 0.5 0.6 0.1 0.8 0.5

0.4 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 0.6 0.6 0.9 0.1 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 0.2 0.6 0.7 0.9 0.6 0.2 0.7

Y B Y B Y B R R B

use of all observed information and achieved a better imputation performance. Finally, we describe how to extend the proposed algorithm for dealing with heterogeneous datasets. Similar to the methods in Huang and Lee (2004), if x0 and xi have the same values for categorical attribute p, set GRC(x0 (p), xi (p)) = 1 (i.e., the similarity between x0 (p) and xi (p) is maximal). Otherwise, if x0 and xi have different values for categorical attribute p, set GRC(x0 (p), xi (p)) = 0 (i.e., the similarity between x0 (p) and xi (p) is minimal). Therefore, the proposed approach can be applied to numerical and categorical attributes with missing values. According to the literatures (e.g., Zhu et al., 2011), heterogeneous attributes, or mixed-attributes, includes numerical attributes as well as non-numerical attributes (i.e., heterogeneous attributes). Again, according to the analysis in Section 3.3 and this section, we know that the proposed GkNN algorithm can impute missing data in the datasets consisting of heterogeneous attributes. 3.5. Error ratio and complexity Cover and Hart (1967) demonstrated that, for any number of classifications, the probability of error of the NN rule is bounded between R* and 2R*, where R* denotes the Bayes’ error (i.e., optimal error). This error ratio can effectively decrease by our nearest neighbor selection based on the GRG (gray relational grade) and will further be demonstrated in Section 4. Let n be the number of instances compared in the imputation process and m the number of attributes. The time complexity of calculating the GRG is O(mn). Furthermore, the total processing time also includes sorting all the gray relational grades, which is greater than O(n × log n) in general. So the complexity of imputing the whole dataset (i.e., without classification) is O(k2 n2 mlog n), where k2 is the number of imputation iterations and it is usually very small in our experiments. After finishing classification (we assume that the dataset have k1 classes), as our algorithm performs the kNN method independently on each cluster for missing value imputation, the complexity of the GkNN algorithm is O(k1 k2 mn2j log nj ), where nj is the number of instances in the biggest class j, i.e., class j is the class containing the maximal number of instances. Generally speaking, nj is lesser than n when k1 > 1, so we have O(k1 k2 mn2j log nj ) < O(k2 mn2 log n). It is clear that the computation cost in the proposed GkNN algorithm is less than traditional imputation algorithms, such as the kNN algorithm, and the method in Huang and Lee (2004). 4. Case study The section gives an example to illustrate the use of our GkNN algorithm. Consider a complete dataset consisting of nine instances I1 , I2 , . . ., I9 , listed in Table 1. Each instance has six attributes (one attribute is categorical, i.e., the last attribute F, the others are numerical). For highlighting the nearest neighbor selection,

2546

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

Fig. 1. The pseudo-code of GkNN.

we assume that the data has already been transformed with the method in Section 3.3, i.e., all numerical attribute values are normalized between 0 and 1. Moreover, all nine instances have same class label (we do not show the class attribute in Table 1 for the sake of simplification). To demonstrate the efficiency, we randomly miss 9 values (i.e., missing ratio is 10%) in Table 1 and the results are listed in Table 2. According to the proposed GkNN algorithm, the first step is to impute the missing values with mean (for numerical attributes) or majority class (for categorical attributes). The result in the first imputation iteration is presented in Table 3. From Table 3, there is only one imputed value consistent with that in Table 1 after the 1st imputation iteration, i.e., I5 (A), the value of attribute A in 5th instance. Two imputed values are close to the real values in Table 1,

i.e., I1 (B) and I6 (A). And the others are clearly different from that in Table 1. From the second imputation iteration, the proposed GkNN algorithm (let k = 1) is employed. It starts from the first missing value of the first instance, i.e., I1 (B) in the above example. I1 (B) is assumed to be missed and the rest of missing values are taken as observed values presented in Table 4. I1 (B) is re-imputed as follows. First, the gray relational coefficient (GRC) and relational grade (GRG) between the instance I1 and other instances are calculated as follows:

Table 2 Randomly missing some values generated from the dataset in Table 1.

Table 3 Imputation results after first iteration.

I1 I2 I3 I4 I5 I6 I7 I8 I9

A

B

C

D

E

F

0.2 0.9 0.9 0.1 ?? ?? 0.1 0.8 0.5

?? 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 ?? 0.6 0.9 ?? 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 ?? 0.6 0.7 ?? 0.6 0.2 0.7

Y B Y ?? Y ?? R R B

min∀j min∀k |x0 (k) − xj (k)| = 0 and max∀j max∀k |x0 (k) − xj (k)| = 1. where j = 1, . . ., 8 and k = 1, . . ., 5.

I1 I2 I3 I4 I5 I6 I7 I8 I9

A

B

C

D

E

F

0.2 0.9 0.9 0.1 0.5 0.5 0.1 0.8 0.5

0.49 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 0.59 0.6 0.9 0.59 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 0.51 0.6 0.7 0.51 0.6 0.2 0.7

Y B Y Y Y Y R R B

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552 Table 4 Data for re-imputing the first missing value at second imputation iteration.

I1 I2 I3 I4 I5 I6 I7 I8 I9

Table 6 Imputation results after third imputation iteration.

A

B

C

D

E

F

0.2 0.9 0.9 0.1 0.5 0.5 0.1 0.8 0.5

?? 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 0.59 0.6 0.9 0.59 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 0.51 0.6 0.7 0.51 0.6 0.2 0.7

Y B Y Y Y Y R R B

Hence, the expression of the gray relational coefficient (GRC) is: GRC(x0 (p), xi (p)) = =

min∀j min∀k |x0 (k) − xj (k)| +  max∀j max∀k |x0 (k) − xj (k)| |x0 (p) − xi (p)| +  max∀j max∀k |x0 (k) − xj (k)| 0 + 0.5 × 1 |x0 (p) − xi (p)| + 0.5 × 1

where i = 1, . . ., 8; j = 1, . . ., 8; k = 1, . . ., 5; p = 1, . . ., 5. And the expression of gray relational grade (GRG) is: 1 GRC(x0 (k), xi (k)), i = 1,2, . . . ,8. 5 5

GRG(x0 , xi ) =

2547

k=1

Hence, GRG(x1 , x2 ) = 0.527, GRG(x1 , x3 ) = 0.729, GRG(x1 , x4 ) = 0.874, GRG(x1 , x5 ) = 0.716, GRG(x1 , x6 ) = 0.673, GRG(x1 , x7 ) = 0.724, GRG(x1 , x8 ) = 0.61, GRG(x1 , x9 ) = 0.549. Moreover, GRG(x1 , x4 ) > GRG(x1 , x3 ) > GRG(x1 , x7 ) > GRG(x1 , x5 ) > GRG(x1 , x6 ) > GRG(x1 , x8 ) > GRG(x1 , x9 ) > GRG(x1 , x2 ) Therefore, instance I4 is selected as the nearest neighbor of I1 for the 1-NN estimator. And I1 (B) is re-imputed with 0.3, the value of I4 (B) according to the 1-NN estimator the 1-NN estimator. Similar to I1 (B), we can re-impute all other missing values. Table 5 presents all the re-imputed missing values in the second imputation iteration. Two missing values, I2 (C) and I5 (C) are the same as that in last imputation iteration. Other seven missing values are changed and, two of them are consistent with that in Table 1: I3 (E) and I6 (F), other five missing values have become better. Consequently, the second imputation iteration has generated more accurate imputation values compared with the last imputation iteration. The results of the third imputation iteration are presented in Table 6. Difference from Table 5, two missing values, I1 (B) and I2 (C), have been changed to consistent with that in Table 1, and others have not changed. This also delivers that the imputation performance is better than that in the last imputation iteration.

I1 I2 I3 I4 I5 I6 I7 I8 I9

A

B

C

D

E

F

0.2 0.9 0.9 0.1 0.5 0.5 0.1 0.8 0.5

0.4 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 0.6 0.6 0.9 0.3 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 0.2 0.6 0.7 0.7 0.6 0.2 0.7

Y B Y R Y B R R B

From the fourth imputation iteration, there are not any missing values to change. It indicates the GkNN algorithm converges in the fourth imputation iteration. Finally, five out of nine missing values are correctly imputed and others are closer to their real values in only three imputation iterations. Because there are only nine instances in the dataset, it cannot well achieve to higher performance. Therefore, the GkNN algorithm is empirically evaluated in next section. 5. Experimental analysis This section experimentally evaluates the proposed GkNN algorithm with several real datasets from UCI datasets (Blake and Merz, 1998). We first demonstrate the convergence of our algorithm against the existing methods in terms of different missing ratio and different datasets in Section 5.1. Second, in Sections 5.2 and 5.3, the experiments show whether the algorithm can improve the system performance (such as prediction accuracy) when the filled-in values had converged or begun to cycle. In comparing with our algorithm, we designed two algorithms in our experiments. For example, the algorithm of iteration imputation with gray based the kNN method is denoted as ‘Noclassified’ (i.e., the method is the iterative edition of Huang and Lee, 2004). That is, we change the method in Huang and Lee (2004) as an iterative imputation method based on our iterative imputation principle. In our expectation, our imputation method with classification will outperform the literature in Huang and Lee (2004) without classification if the experimental results of our GkNN approach is better than the results of the algorithm ‘Noclassified’ in terms of prediction accuracy and classification accuracy. Another comparison algorithm is iterative imputation based on the Euclidean-distance kNN method and is denoted as kNN. Note that, in Section 5.3, we employed the datasets with missing values, but in Sections 5.1 and 5.2, we employed the datasets without missing values as we wanted to compare the imputed results with the real values. The missing values were missed by the mechanism missing at random (MAR) (Qin et al., 2007), i.e., P(ı = 1|X = x) = 0.9 − 0.2|x − 1| if |x − 1| ≤ 4.5, and = 0.1 otherwise where ı = 1 means that x is missed.

Table 5 Imputation results after second imputation iteration.

I1 I2 I3 I4 I5 I6 I7 I8 I9

5.1. Convergence of the imputed values

A

B

C

D

E

F

0.2 0.9 0.9 0.1 0.5 0.5 0.1 0.8 0.5

0.3 0.5 0.2 0.3 0.8 0.7 0.4 0.2 0.8

0.9 0.5 0.6 0.9 0.3 0.1 0.8 0.5 0.3

0.6 0.2 0.3 0.4 0.9 0.8 0.5 0.3 0.9

0.5 0.3 0.2 0.6 0.7 0.7 0.6 0.2 0.7

Y B Y R Y B R R B

We should explain the convergence of the imputed values for our algorithm, because our algorithm is an EM-style iteration imputation method. Each iteration of the EM algorithm is guaranteed non-decreasing according to the theory of maximum likelihood (Dempster et al., 1977), and the EM algorithm converges to a local minima in likelihood based on a parametric model. In our algorithm, we are not able to make similar proof for the non-parametric method as few theoretical results consider the validity of kNN so

2548

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

0.16

Mean Change in Filled-in Values

Mean Change in Filled-in Values

0.045

kNN GkNN Noclassified

0.18

0.14 0.12 0.10 0.08

kNN GkNN Noclassified

0.040 0.035 0.030 0.025 0.020 0.015

0

2

4

6

8

10

12

0

2

4

6

8

Iteration

Iteration

(1): Iris (10%)

(2): Tic-tac-toe (10%)

10

kNN GkNN Noclassified

0.20 0.19 0.18

Mean Change in Filled-in Values

Mean Change in Filled-in Values

0.21

0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0

2

4

6

8

10

12

14

0.065

kNN GkNN Noclassified

0.060 0.055 0.050 0.045 0.040 0.035 0.030 0.025 0.020

16

0

2

4

Iteration

6

8

10

12

Iteration

(3): Iris (20%)

(4): Tic-tac-toe (20%)

kNN GkNN Noclassified

0.32 0.30

Mean Change in Filled-in Values

Mean Change in Filled-in Values

0.34

0.28 0.26 0.24 0.22 0.20 0.18 0.16

0.10

kNN GkNN Noclassified

0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02

0.14 0

2

4

6

8

10

12

14

16

18

20

Iteration

(5): Iris (40%)

0

2

4

6

8

10

12

14

Iteration

(6): Tic-tac-toe (40%)

Fig. 2. Mean change in filled-in values for Iris dataset (left) and Tic-tac-toe dataset (right) for successive iteration at different missing ratio.

that it is difficult for building a mathematical proof. In this section we empirically show the convergence of the GkNN algorithm. The imputation method converges while the “mean change in filled-in values” reaches to zero. The meaning of “mean change in filled-in values” is the distance between the mean of all imputed in last iteration imputation and the mean of all imputed in the

current iteration. Caruana (2001) thought that the “mean change in filled-in values” usually does not drop all the ways to zero and only approximates to a value that is as close as possible to zero in the non-parametric model (such as kNN, kernel method) in practice. This can make the algorithm converging to a cycle. In real applications, cycle is rare in parametric model if density calculations are

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

0.35

1.6

kNN GkNN Noclassified

1.5 1.4

kNN GkNN Noclassified

0.30 0.25

1.3 1.2

RMSE

RMSE

2549

1.1

0.20 0.15

1.0 0.9

0.10

0.8 0.05 0

2

4

6

8

10

0

1

2

3

4

Iteration

(7): Iris (10%)

6

7

8

9

Fig.8: Tic-tac-toe (10%) 0.7

1.9

kNN GkNN Noclassified

1.8 1.7

kNN GkNN Noclassified

0.6 0.5

1.6 1.5

RMSE

RMSE

5

Iteration

1.4 1.3

0.4 0.3

1.2 0.2

1.1 1.0

0.1

0.9 0

2

4

6

8

10

12

14

0

2

4

Iteration

6

8

10

Iteration

(10): Tic-tac-toe (20%)

(9): Iris (20%)

1.0

kNN GkNN Noclassified

2.4 2.2

0.8 0.7

RMSE

2.0

RMSE

kNN GkNN Noclassified

0.9

1.8 1.6

0.6 0.5 0.4

1.4

0.3

1.2

0.2 0

2

4

6

8

10

12

14

16

18

0

2

4

6

8

Iteration

Iteration

(11): Iris (40%)

(12): Tic-tac-toe (40%)

10

12

Fig. 3. Experimental results on Iris dataset (left) and Tic-tac-toe dataset (right) for three algorithms after converged.

exact. Caruana (2001) tested that the phenomenon are more likely in the non-parametric models. But, in our experiments, we never have seen such a case. The experimental results of “mean change in filled-in values”, which is applied to dataset Iris and Tic-tac-toe dataset, respectively,

showed the phenomenon of convergence for the three algorithms in Fig. 2. In Fig. 2, subfigures (1), (3) and (5) presented the results of Iris dataset with missing ratio 10%, 20% and 40%, respectively, subfigures (2), (4) and (6) presented the result of mean change in filled-in values for dataset Tic-tac-toe with missing ratio 10%,

2550

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

93 92 91 90 89 88 87

kNN GkNN Noclassified

86 85

Classification Accuracy

Classification Accuracy

94

84 1

2

3

4

89 88 87 86 85 84 83 82 81 80 79 78

5

kNN GkNN Noclassified

1

2

3

Iteration (13): Hepatitis

74 73 72 71 kNN GkNN Noclassified

70 69 68 3

4

5

6

Classification Accuracy

Classification Accuracy

75

2

5

6

7

(14): Echocardiogram

76

1

4

Iteration

78 77 76 75 74 73 72 71 70 69 68 67

kNN GkNN Noclassified

1

2

3

4

5

6

7

Iteration

Iteration (15): Soybean

(16): Water-treatment

Fig. 4. Experimental results on hepatitis, echocardiogram, soybean and water-treatment for three algorithms after converged from subfigures (13)–(16), respectively.

20% and 40%, separately. In fact, the two datasets (Iris and Tic-tactoe) are completed datasets without missing values, so the datasets are missed at random at different missing ratio with an aim to demonstrating the characteristic of the algorithm. The results of some datasets with missing values is presented in Table 7 about the “mean change in filled-in values” when the algorithms convergence and the optimal the number of k in kNN algorithm, such as the missing ratio of Soybean dataset is 6.63% and the GkNN convergence while the number of iterative imputation is 5 with 3NN algorithm. The results show that all the three algorithms are convergence in different datasets because the “mean change in filled in values” remains stable after some iterations. That indicates it is reasonable to impute missing values with iterative imputation. For example, the variation of the “mean change in filled-in values” for the imputation method ‘Noclassified’ levels off after 5 iterations, the Table 7 Iterative times for seven datasets after the algorithms converge or begin to cycle. MR(kNN)

Noclassified

GkNN

kNN

Iris

10% (k = 2) 20% (k = 2) 40% (k = 1)

7 10 13

6 8 11

11 13 17

Tic-tac-toe

10% (k = 8) 20% (k = 5) 40% (k = 3)

6 8 10

5 6 8

8 10 12

Hepatitis Echocardiogram Soybean Water-treatment

5.67% (k = 2) 7.69% (k = 1) 6.63% (k = 3) 2.95% (k = 5)

5 4 6 7

5 3 5 6

6 5 8 8

‘GkNN’ after 5 and the kNN after 6 in hepatitis dataset. The higher of missing ratio the dataset is with, the more iterative times the algorithm is. For instance, the iterative times are 6, 8 and 11when the missing ratio is at 10%, 20% and 40%, respectively in Iris dataset in the GkNN algorithm. Obviously, the more missing ratio the dataset are with, the less useful information the algorithms are. All of these results demonstrate that our algorithm GkNN is the fastest with respect to convergence between the other two algorithms for these seven datasets. Such as in water-treatment dataset, the iterative times for the algorithm ‘Noclassified’, GkNN and kNN are 7, 6 and 8 in Table 7. Further, the “mean change in imputed values” of our GkNN is the smallest among the three algorithms when running these UCI datasets.

5.2. Experimental evaluation on prediction accuracy Whether the imputation missing values are getting better after convergence, or just convergence, this section presented the results of some experiments where we predict the missing values with three algorithms at different missing ratio. For numerical missing values, the approaches are evaluated on predication accuracy about dataset Iris and Tic-tac-toe at different missing ratio, such as 10%, 20% and 40%. The accuracy of prediction was measured using the root mean square error (RMSE) as follows:



m

1 (ei − e˜ i )2 RMSE =  m

i=1

(6)

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

where ei is the original attribute value; e˜ i is the estimated attribute value, and m is the total number of missing values. The larger the RMSE is, the worse the prediction accuracy is. These results are shown in Fig. 3. From Fig. 3, the value of RMSE of the second iterative is dramatically decreased compared to the first iterative imputation in these three algorithms at different missing ratio while these algorithms have converged. And RMSE of the third iteration also outperform than the second times. However, after several iterative imputations, all the results stable vary for the value of RMSE for all the algorithms. But our method shows the best performance. 5.3. Experimental evaluation on classification accuracy For discrete missing values, the approaches are evaluated on predication accuracy about datasets hepatitis, echocardiogram, soybean and water-treatment which are incomplete datasets. We assessed the performance of these prediction procedures through classification accuracy (CA) defined as follows: 1 l(ICi , RCi ) n n

CA =

i=1

where the number of missing values, and the number of the instances in dataset are t and n, respectively, the indicator function l(x, y) = 1 if x = y; otherwise 0. The RCi is the real class label for the i-th instance. Having imputed all the missing values, we used all the instances in the dataset to construct a kNN classifier, and then used it to classify each instance in dataset. The value ICi is the classification result of the i-th instance. Obviously, the larger value the value CA is, the more efficient the algorithm is. The results of CA of the dataset hepatitis, echocardiogram, soybean and water-treatment are presented in subfigures (13)–(16) of Fig. 4. 6. Conclusions We have proposed a nearest neighbor selection for iteratively kNN imputation of missing data, named GkNN imputation. Different from existing imputation methods, GkNN is able to deal with missing values in heterogeneous datasets. In the approach, the gray relational analysis, which is more appropriate to capture the ‘nearness’ (or relationship) between two instances compared with Minkowski distance, has been used to describe the relational structure of all instances, and can accelerate the convergence ratio of iteration imputation. On the other hand, the GkNN searches for the nearest neighbor instance with the same class label between the instance and the missing instance, which can reduce the time complexity, and improves the prediction errors. Our experimental results on six UCI datasets have showed that our method is superior to the kNN method and the method in Huang and Lee (2004) in terms of convergence ratio, and the RMSE is used to measure the prediction accuracy and classification error ratio. Acknowledgments This work is supported in part by the Australian Research Council (ARC) under large grant DP0985456; the Nature Science Foundation (NSF) of China under grant 61170131; the China “1000Plan” National Distinguished Professorship; the China 863 Program under grant SQ2011AAJY2742; the Guangxi Natural Science Foundation under grant 2012GXNSFGA060004; the Guangxi “Bagui” Teams for Innovation and Research; and the Jiangsu Provincial Key Laboratory of E-business at the Nanjing University of Finance and

2551

Economics Research Exchanges with China/India Award, The Royal Academy of Engineering.

References Batista, G., Monard, M., 2003. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17, 519–533. Blake, C., Merz, C., 1998. UCI Repository of Machine Learning Databases. Brása, L.P., Menezes, J.C., 2007. Improving cluster-based missing value estimation of DNA microarray data. Biomolecular Engineering 24 (June (2)), 273–282. Caruana, R., 2001. A non-parametric EM-style algorithm for imputing missing value. In: Proc 8th Intl Workshop Artificial Intelligence and Statistics, Key West, FL. 15. Chen, J., Shao, J., 2000. Nearest neighbor imputation for survey data. Journal of Official Statistics 16, 113–132. Cover, T.M., Hart, P.E., 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1), 21–27. de Andrade Silva Jonathan, Hruschka Eduardo, R., 2009. EACImpute: an evolutionary algorithm for clustering-based imputation. ISDA 2009, 1400–1406. Dempster, A.P., et al., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38. Deng, J.L., 1982. Control problems of grey system. System and Control Letters 1, 288–294. Huang, C.C., Lee, H.M., 2002. An instance-based learning approach based on grey relational structure. In: Proc. of the UK Workshop on Computational Intelligence (UKCI-02), Birmingham. Huang, C.C., Lee, H.M., 2004. A grey-based nearest neighbor approach for missing attribute value prediction. Applied Intelligence 20, 239–252. Jiang, B.C., et al., 2002. Machine vision-based gray relational theory applied to IC marking inspection. IEEE Transactions on Semiconductor Manufacturing 15 (4), 531–539. John, G.H., Kohavi, R., Pfleger, K., 1994. Irrelevant features and the subset selection problem. In: Proceedings of the 11th International Conference on Machine Learning, pp. 12I–129I. Jou, J.M., et al., 1999. The gray prediction search algorithm for block motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 9 (6), 843–848. Kim, K.-Y., Kim, B.-J., Yi, G.-S., 2004. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5 (October), 160. Lall, U., Sharma, A., 1996. A nearest-neighbor bootstrap for resampling hydrologic time series. Water Resources Research 32 (3), 679–693. Little, R., Rubin, D., 2002. Statistical Analysis with Missing Data. Wiley. Liu, H., Zhang, S., 2012. Noisy Data Elimination Using Mutual k-Nearest Neighbor for Classification Mining. Journal of Systems & Software 85 (5), 1067–1074. Luo, R.C., et al., 2001. Target tracking using a hierarchical grey-fuzzy motion decision making method. IEEE Transactions on Systems, Man and Cybernetics (Part A) 31 (3), 179–186. Mistiaen, J., Ravallion, M., 2003. Survey compliance and the distribution of income. Available at http://econ.worldbank.org. Myllymaki, Y., 2001. Effective web data extraction with standard XML technologies. In: Proc. 10th International Conference on Word Wide Web, Hong Kong. Pearson, R., 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM-05. Pearson, R., 2006. The problem of disguised missing data. ACM SIGKDD Explorations Newsletter 8 (1), 83–92. Qin, Y.S., et al., 2007. Semi-parametric optimization for missing data imputation. Applied Intelligence 27 (1), 79–88. Quinlan, J., 1989. Unknown attribute values in induction. In: Proc 6th Int’ Workshop on Machine Learning, Ithaca, pp. 164–168. Quinlan, J., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, USA. Schafer, J., Graham, J., 2002. Missing data: our view of the state of the art. Psychological Methods 7, 147–177. Skellam, J.G., 1952. Studies in Statistical Ecology: Spatial Pattern. Biometrika 39, 346–362. Song, Q.B., et al., 2005. Using grey relational analysis to predict software effort with small data sets. In: 11th IEEE International Software Metrics Symposium (METRICS 2005). Su, S.L., et al., 2000. Grey-based power control for DS-CDMA cellular mobile systems. IEEE Transactions on Vehicular Technology 49 (6), 2081–2088. Troyanskaya1, O., et al., 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17 (6), 520–525. Wang, Y.F., 2003. On-demand forecasting of stock prices using a real-time predictor. IEEE Transactions on Knowledge and Data Engineering 15 (4), 1033–1037. Wang, Q.H., Rao, R.N.K., 2002. Empirical likelihood-based inference under imputation for missing response data. Annals of Statistics 30, 896–924. Zhang, C.Q., et al., 2007. An Imputation Method for Missing Values. PAKDD, LNAI, vol. 4426, pp. 1080–1087.

2552

S. Zhang / The Journal of Systems and Software 85 (2012) 2541–2552

Zhang, S.C., 2008. Parimputation: from imputation and null-imputation to partially imputation. IEEE Intelligent Informatics Bulletin 9 (1), 32–38. Zhang, S.C., 2010. KNN-CF approach: incorporating certainty factor to kNN classification. IEEE Intelligent Informatics Bulletin 11 (1), 25–34. Zhang, S.C., 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligence 36 (1), 108–118. Zhang, S.C., et al., 2005. “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17 (12), 1689–1693. Zhang, S.C., et al., 2006. Optimized Parameters for Missing Data Imputation. PRICAI06, pp. 1010–1016. Zhang, S., 2012. Decision Tree Classifiers Sensitive to Heterogeneous Costs. Journal of Systems & Software 85 (4), 771–779. Zhu, X.F., et al., 2011. Missing value estimation for mixed-attribute datasets. IEEE Transactions on Knowledge and Data Engineering 23 (1), 110–121.

Shichao Zhang is a China “1000-Plan” Distinguished Professor and a Vice President of the Guangxi Normal University, China. He holds a PhD degree from the CIAE, China. His research interests include machine learning and information quality. He has published about 60 international journal papers and over 60 international conference papers. He is a CI for 11 competitive nation-level projects supported by China NSF, China 863 Program, China 973 Program, and Australia ARC. He is a senior member of the IEEE, a member of the ACM; serving as an associate editor for IEEE Intelligent Informatics Bulletin, served as an associate editor for IEEE Transactions on Knowledge and Data Engineering, Knowledge and Information Systems; served as a PC co-chair/vice-chair for 6 international conferences and as a general co-chair/vice-chair for 3 international conferences.