Knowledge-Based Systems xxx (xxxx) xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Similarity-learning information-fusion schemes for missing data imputation✩ ∗
Roozbeh Razavi-Far , Boyuan Cheng, Mehrdad Saif, Majid Ahmadi Faculty of Engineering, University of Windsor, 401 Sunset Avenue, Windsor, N9B 3P4, Ontario, Canada
highlights • • • • •
Two novel missing data imputation techniques are proposed. These novel imputation techniques handle both numerical and categorical features. Experimental comparison based on eleven publicly datasets, two evaluation criteria. Missing scores are estimated by learning local and global similarities. Top estimations are fused to obtain the final estimation.
article
info
Article history: Received 5 January 2019 Received in revised form 14 June 2019 Accepted 15 June 2019 Available online xxxx Keywords: Missing data Imputation Expectation–Maximization Similarity learning Information fusion Dempster–Shafer theory
a b s t r a c t Missing data imputation is a very important data cleaning task for machine learning and data mining with incomplete data. This paper proposes two novel methods for missing data imputation, named kEMI and kEMI+ , that are based on the k-Nearest Neighbours algorithm for pre-imputation and the Expectation–Maximization algorithm for posterior-imputation. The former is a local search mechanism that aims to automatically find the best value for k and the latter makes use of the best k nearest neighbours to estimate missing scores by learning global similarities. kEMI+ makes use of a novel information fusion mechanism. It fuses top estimations through the Dempster–Shafer fusion module to obtain the final estimation. They handle both numerical and categorical features. The performance of the proposed imputation techniques are evaluated by applying them on twenty one publicly available datasets with different missingness and ratios, and, then, compared with other state-of-the-art missing data imputation techniques in terms of standard evaluation measures such as the normalized root mean square difference and the absolute error. The attained results indicate the effectiveness of the proposed novel missing data imputation techniques. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Data anomalies and impurities can cause inefficient data analysis, inaccurate decisions and user inconvenience. Careless use of erroneous data can be misleading and even worthless for the users [1–3]. Therefore, data processing tasks such as data cleaning are crucial to deal with inaccurate and incomplete data and to ensure the high-quality mining [2,3]. Nowadays data are collected in various ways through paperbased and online surveys, interviews, sensors and measurements, ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.06.013. ∗ Corresponding author. E-mail addresses:
[email protected] (R. Razavi-Far),
[email protected] (B. Cheng),
[email protected] (M. Saif),
[email protected] (M. Ahmadi).
among others [4]. There are various reasons for missing values including imperfect procedures of manual data entry, incorrect measurements, data collection problems, equipment errors, sensor failures, omitted entries in datasets, and non-response in questionnaires. Missing scores are unavoidable in almost all industrial and business sectors and highly inadmissible in machine learning, data mining and information systems [5–7]. Organizations are extremely dependent on data collection, storage capacity and data analysis for decision-making and performance improvement [5,8], so the presence of missing scores can result in the performance reduction of the subsequent data analysis tasks such as classification, regression, prediction, and decision making among others, which is due to the dependency of these tasks on the complete information [6,9–12]. One important task of data cleaning is missing data imputation. Thus far, many imputation techniques have been proposed [13–17]. In general, imputation performance heavily depends on the selection of a suitable imputation technique to deal
https://doi.org/10.1016/j.knosys.2019.06.013 0950-7051/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
2
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
with each incomplete data. The performance of each imputation technique may vary according to the types of missingness and datasets [18]. For imputing the missing scores of a dataset, the existing machine learning-based imputation methods either use the similarity of the records or the correlations of the features [19,20]. According to the similarity structure of the search and the imputation mechanism, these machine learning-based missing data imputation methods can be classified into two categories. The first category contains those missing data imputation techniques that are seeking for the global similarity within the data structure, and the second category contains those missing data imputation techniques that are searching for the local similarities [21,22]. Global similarity-based imputation strategies focus on using the global correlation structure of the whole dataset for imputing the missing scores of the dataset. Some existing global similaritybased imputation methods include Expectation–Maximization Imputation (EMI) [23,24] and imputation by means of Bayesian Principal Component Analysis (BPCA) [25]. EMI makes use of the whole dataset and seeks for the correlations between the features with missing scores (target features) and the features without missing scores (covariates) in order to impute numerical missing values [21]. Unlike the global similarity-based imputation approaches that use the whole dataset, the local similarity-based imputation approaches seek for the local similarity structures among the records for imputing the missing scores. Imputation techniques such as k-Nearest Neighbours-based Imputation (kNNI) [26,27], Local Least Squares Imputation (LLSI) [28], Iterative Local Least Squares Imputation (ILLSI) [29], and Bi-cluster based Iterative Local Least Squares Imputation (IBLLS) [30] fall in this kind of approaches. Another issue of concern is to find a technique that can impute both numerical and categorical features. In practice, a missing data imputation technique should be time efficient, which means that its algorithm should not keep focusing on the whole dataset in each iteration. This paper initially proposes a novel missing data imputation technique, called kEMI, which integrates k-nearest neighbours and expectation maximization imputation algorithms for missing data imputation (kEMI). Initial estimation of missing values is very common either for enhancing the accuracy of estimations [31] or dealing with the planned missingness and non-overlapping missing structures [32, 33]. In this work, given an incomplete record, the proposed method performs a pre-imputation by means of a local search mechanism to automatically find the best set of donors and, then, uses a posterior-imputation mechanism to estimate missing scores of the incomplete records by learning global similarities among the selected donors. kEMI, at first, automatically finds the best value for k and, thus, k nearest neighbours. Then, it uses EMI algorithm to impute the missing scores. The technique firstly looks for local solutions (donors). It has an advantage of imputing missing scores based on k nearest neighbours instead of the whole dataset. kNN-based techniques usually find the k nearest neighbours based on the similarity of records so correlations among features are neglected. Then, EM is used to search for the global similarity among the selected donors to impute missing scores. kEMI not only focuses on the similarity of records but also focuses on the correlations among the features. It has a local search mechanism and does not have too many iterations for imputing missing scores of a given dataset. That means kEMI not only improves the accuracy, but also improves the time efficiency. This paper then proposes another novel missing data imputation technique, called kEMI+ , that is an extension of kEMI. kEMI+ similar to kEMI automatically finds the best k estimations and, then, fuses them by resorting to the Dempster–Shafer fusion [34,
35]. kEMI+ not only has all advantages of kEMI but also can further improve estimations through a powerful fusion mechanism. Fusion techniques have been proposed in [36,37], which make use of belief functions to deal with incomplete records, uncertain and imprecise information, and obtain the final classification. The proposed imputation techniques, kEMI and kEMI+ , are evaluated on twenty one publicly available datasets and compared to other state-of-the-art missing data imputation techniques such as EMI [23,24], DMI [18], kDMI [21], kNNI [26], LCSR [22], LRMC [38], CLRMC [19], and CLRMC-EN [19] in terms of imputation performance measures, so-called the normalized root mean square difference (NRMS) and the absolute error (AE). The organization of the paper is as follows. Section 2 provides background information about missing data imputation. The proposed novel missing data imputation techniques, kEMI and kEMI+ , are formally presented in Section 3. Section 4 presents experimental results and compares kEMI and kEMI+ with other state-of-the-art missing data imputation techniques. Section 5 describes the conclusions. 2. Missing data imputation: Background Missing data is a common problem in many data-driven applications. Missing data in practice can significantly deviate the outcomes of data mining and knowledge discovery, and, thus, it is crucial to treat them properly. Discarding missing records is a very simple but inefficient technique especially for the datasets with a high missing rate. Majority of the contemporary missing data treatment techniques estimate the missing scores, socalled missing data imputation [39]. Imputation of the missing scores usually requires certain assumptions about the dataset distribution [40]. Indeed, any improper assumptions can bias the estimations [41,42]. Various missing data imputation techniques have been developed based on machine learning or statistical learning techniques [13–17,43]. Our objective is to develop an efficient technique for treating missing scores that consider both local and global similarities into account. Although missing data imputation can improve the data quality, care should be taken to the choice of a proper imputation technique. Some of the missing data imputation techniques cannot hold the relationship among the features. Indeed, a few techniques may change the underlying distributions of data [44]. This section aims to review some representative works that are focused on the missing data imputation. These works consider the local similarity among the records and/or global similarity and correlation among the features. k-Nearest Neighbour Imputation (kNNI) [21,26,45] is an efficient technique for imputing a missing score xij . It firstly finds the k-most similar records to xi in the dataset X by means of the Euclidean distance (k is a user-defined value). If Xj is a categorical feature, the technique imputes xij by using the most frequent value of Xj within the k-Nearest Neighbours (k-NN). These knearest neighbours can be found by means of Hamming distance for the categorical features. Otherwise, if Xj is a numerical feature, the technique uses the mean value of Xj over the k nearest neighbours. It is indicated that the imputation accuracy of kNNI is better than the imputation accuracy of the normal mean/mode imputation techniques that calculate mean/mode from the whole dataset X , instead of the k-nearest neighbours of xi within X . kNNI is a simple and efficient method that performs well on the datasets having strong local correlation structure. However, this method is computationally expensive for large datasets, because finding the most suitable k-nearest neighbours is based on searching the whole dataset. Expectation–Maximization–Imputation (EMI) algorithm considers the mean and the covariance matrix of data to impute
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
3
Fig. 1. The proposed kEMI scheme, which contains two modules for pre-imputation and posterior-imputation.
the numerical missing scores of the dataset [23,24]. Firstly, the mean and the covariance matrix are estimated from an incomplete dataset, and, the missing scores are then imputed based on the mean and the covariance matrix. For each record missing scores (if any) are estimated as follows based on the relationship between features: xmis = µmis + (xobs − µobs )B + e
(1)
where xmis and xobs are vectors of missing and observed values of the record x, respectively. Moreover, µmis and µobs are the mean of missing and observed values, respectively. B = ∑−1 vectors ∑ obsobs obsmis is the regression coefficient matrix, which is the product of the inverse ∑−1 of the covariance matrix of the observed feature values ( obsobs ), and cross-covariance matrix of the ob∑ served and the missing values ( obsmis ). Besides, e stands for the residual error with a zero mean and an unknown covariance matrix. This algorithm has been widely used for imputing missing scores in various applications [46,47]. Decision tree based Imputation (DMI) is a state-of-the-art decision tree-based technique for missing data imputation [48]. It merges Decision Tree and Expectation–Maximization algorithms. It firstly divides the dataset into a number of horizontal segments obtained by the leaves of the decision tree. The records belonging to each leaf are expected to be similar to each other. The correlations among the features for the records that fall within a leaf are generally higher than the correlations among the features within the whole dataset. DMI takes the records within a leaf node instead of using all records in the whole dataset [48] and, then, makes use of the EMI algorithm [23,24] in order to impute the missing scores among these records. kDMI is an extended version of DMI that is improved for missing data imputation [21]. It has two levels for the row-wise partitioning. The former is done by means of the decision tree algorithm and the latter is created using the k-NN algorithm, looking for the most similar records to an incomplete target record. The first level is the same as DMI, dividing the dataset into a number of horizontal segments obtained by the leaves of the decision tree; however, in the second level, it adds k-NN into imputation process instead of the direct execution of the EMI algorithm [21]. The correlations among the features for the records within a leaf is higher than the whole dataset, however,
kDMI further makes use of k-NN in order to find the k nearest records that are the most similar to xi , among all records that fall into the leaf where xi landed in. kDMI lastly feeds the best k nearest neighbours into the EMI algorithm in order to impute the missing scores of xi . Locality constrained sparse representation (LCSR) is a stateof-the-art technique, which estimates incomplete records by the sparse linear aggregation of complete records, by means of sparse coefficients with the characteristic of sparsity, smoothness and preserved locality structure [22]. Low-rank matrix completion (LRMC) is a state-of-the-art technique that aims to estimate missing data by imposing a low-rank constraint and minimizing the rank of the matrix [38]. LRMC basically depends on the correlation among records and features of a matrix as a whole, however, it discards the specificity of records. In [19], two advanced variations of LRMC have been proposed for missing data imputation. Correlation-based LRMC (CLRMC) focuses on each record and looks for temporal and spatial correlation. A more enhanced version with ensemble learning (CLRMC-EN) has a multiple imputation framework, creating multiple estimations and, then, integrating them in order to improve the final estimation [19]. These state-of-the-art techniques for missing data imputation, EMI, kNNI, DMI, kDMI, LCSR, LRMC, CLRMC, and CLRMC-EN are selected as a set of competitors. In Section 4, these missing data imputation techniques are compared to the proposed techniques, kEMI, and kEMI+ that are described in Section 3. These techniques are compared together in terms of imputation performance criteria, so-called the normalized root mean square difference (NRMS) and the absolute error (AE) that are defined as follows:
Xestimated − Xcomplete F NRMS = Xcomplete
(2)
F
where Xcomplete and Xestimated are the originally complete dataset and the imputed dataset, respectively; and ∥.∥F stands for the Frobenius norm. As for the dataset with categorical features, these techniques are compared together in terms of other performance criteria, called the absolute error (AE) [26], which can be defined as follows: 1 AE = Σim I(xˆ i = xi ) (3) m =1
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
4
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
where m is the number of records and I(.) stands for a function that returns 1 if the estimated value xˆ i and the real value xi are the same (i.e., if the condition is true), but 0 otherwise. 3. Proposed novel missing data imputation techniques This paper initially proposes a novel missing data imputation technique, called kEMI, which integrates the kNNI and the Expectation–Maximization based Imputation (EMI) algorithms. The general scheme for kEMI is illustrated in Fig. 1. It makes use of kNNI for pre-imputation and automatically finds the most suitable k value, and, then, it feeds the best k-nearest neighbours along with the incomplete target record to the EMI algorithm for the posterior-imputation. The EMI algorithm generally performs well on datasets having higher correlations. kEMI, however, focuses on the feature correlations among the records within the best k-nearest neighbours that are generally higher than the feature correlations among the whole dataset. In this work, kNN is selected for pre-imputation since it is one of the most common approaches in machine learning/data mining to look for local similarities. Other imputation techniques such as k-means and fuzzy c-means can also be used for the pre-imputation. The key parameters of these algorithms can be tuned by minimizing the root mean square difference between the estimated and the original values. However, in this work, kNN is selected for pre-imputation since by changing k values different records are selected as donors to contribute in the estimation of missing scores of the target records. On the other hand, the diversity of donors might be limited in other techniques such as kmeans and fuzzy c-means since in those techniques only the cluster centroids contribute in the estimation of missing scores of the target records. Moreover, changing the k value in kNN will change the number of nearest neighbours, so-called donors, but the change of k in kmeans and other parameters in clustering based imputation techniques can change the data partitioning. This concerns KEMI+ since the information fusion module requires various initial estimations, and, therefore, changing k in kmeans can result in the generation of suboptimal partitions. As for the posterior-imputation, any technique that looks for the correlation among the features can be selected. EM is a very common method that looks for the global similarity structure, and here is used for the posterior imputation in order to increase the likelihood at each iteration. The pseudo-code of kEMI is presented in Algorithm 1. The main steps of the kEMI algorithm are explained in the following: Step 1 - It splits the dataset X into two observed Xobs and incomplete Xmis subsets. Step 2 - It creates a pool P t = Xobs . Step 3 - It then selects {an incomplete record xi from Xmis and } adds it into the pool P t = P t ∪ xi . Step 4 - It then excludes target features j (corresponding to missing scores in xi ) from the pool P t and creates a complete subset S t . Step 5 - It then induces a random missingness in xi within S t . Step 6 - It consequently imputes the newly generated missing score xiz ∈ xi through feeding S t into the pre-imputation module. The pre-imputation module varies k within a range [2, . . . , mp − 1], where mp stands for the number of records in P t , and iteratively makes use of kNNI for estimating the missing score xˆ iz . For a given k, the pre-imputation module estimates xiz and returns an estimation xˆ iz , and, consequently, computes RMSE since the actual value of this synthetic missing score xiz is available. RMSE is the root mean square errors between the actual subset S t and the estimated subset Sˆ t as defined below:
1 RMSE = √
mp dp
mp dp ∑ ∑(
xij − xˆ ij
i=1 j=1
)2
(4)
Algorithm 1: kEMI Input: A dataset with missing values Xm×d Output: A dataset with all missing scores imputed Xˆ Definitions: X is the dataset Xobs is the complete subset of X Xmis is the incomplete subset of X P t is the pool that is initialized to Xobs xi is the target record from Xmis xˆ i is the estimated record of xi Xj is an incomplete feature xˆ iz is the estimated value of xiz S t is the subset of P t excluding γ incomplete features Sˆt is the estimated S t mp is the number of records in P t αi are k nearest neighbours of xi Θi is a subset made up with αi and xi ˆ i is the estimated Θi ; Θ begin Split X into Xobs and Xmis subsets Create a pool P t = Xobs while card } } ̸= ⊘ do { {Xmis /*Select an incomplete record from P t = P t ∪ xi Xmis and add it into the pool*/ for j = 1, . . . , d do if xij = NaN { then } S t = P t \ Xj /*Create a complete subset by excluding incomplete features with missing value*/ end end xiz = NaN /*Induce a random missing into xiz */ for k = 2, . . . , mp − 1 do αi ←− kNN(xi , S t , k) /*Find the k nearest neighbours of xi within S t by using kNN*/ xˆ iz ←− kNNI(xi , αi , k) /*Impute xiz using kNNI and the k nearest neighbours αi */ RMSEk ←− RMSE(xiz , xˆ iz ) /*Calculate RMSE between the actual xiz and the imputed xˆ iz values*/ end k ←− min {RMSEk } /*Find the lowest RMSE and return corresponding k*/ αi ←− kNN(xi , P t , k) /*Find the k nearest neighbours of xi within P t using kNN*/ Θi = {αi ∪ xi } /*Create a subset made up with xi and αi */ xˆ ij ←− EMI(Θi ) /*Impute the real missing scores xij in xi using { EMI and } return xˆ i */ Xobs = Xobs ∪ xˆ i and Xmis = {Xmis \ xi } /*Update Xobs and Xmis */ end Xˆ = Xobs end
in which mp and dp stand for the number of records and the number of features in S t , respectively. This has been iteratively performed mp − 1 times, and the pre-imputation module returns mp − 1 estimations and RMSE values. Step 7 - It then sorts the attained mp − 1 RMSE values in an ascending order and finds a k value, which results in the lowest RMSE value.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
Step 8 - kEMI then goes to step 3 and feeds the pool P t that contains an incomplete record xi into the posterior-imputation module. The posterior-imputation module selects k nearest neighbours αi = {xl }kl=1 for xi by means of the best value found for k in step 7. Step 9 - It then creates a subset Θi = {αi ∪ xi } and feeds it to EMI. EMI imputes xij ∈ xi and returns the completed subset ˆ i . If the missing feature is numerical, it imputes missing scores Θ by means of the EMI algorithm on the best k-nearest neighbours. If the missing feature is categorical, it imputes missing scores by means of the most frequent value of the target feature among the best k-nearest neighbours αi . ˆ Step 10 - kEMI then { selects } the completed record xˆ i from Θi and updates Xobs = Xobs ∪ xˆ i and Xmis = {Xmis \ xi }, respectively. kEMI iteratively returns to step 2 as long as card {Xmis } ̸ = ⊘ for imputing the rest of incomplete records in Xmis . This paper then proposes another novel missing data imputation technique, called kEMI+ , which is an extension of kEMI by integrating an information fusion module. The general scheme for kEMI+ is illustrated in Fig. 2. kEMI+ uses kNNI for pre-imputation and automatically finds a set of best k values, and, then, for each k it feeds the k-nearest neighbours along with the incomplete target record to the EMI module for the posterior-imputation. It iteratively calls EMI and collects a set of estimations for each missing score, and, then, feeds those estimations to the information fusion module. This module makes use of the Dempster–Shafer Fusion (DSF) strategy to combine these estimations. The Dempster–Shafer (DS) theory, also known as evidence theory, was firstly proposed in [49]. As an uncertainty reasoning method, the DS theory has been widely used in expert systems to handle risk assessment, information reliability evaluation, and uncertainty management [34,50–53]. The DS theory is a natural way to combine the probability of different outcomes. This theory has provided a simple method, the Dempster’s combination rule, to combine multiple evidences. Hence, its application has been extended to the information fusion area [34,50–53]. The Dempster–Shafer theory focuses on combining separate information in order to calculate the probability of a certain event by using belief functions and plausible reasoning. Considering estimations obtained by different k values as probability events, the DS theory can be used to compute and combine evidences. DSF module initially forms a matrix of estimations by arranging the top n2 imputed values. Each k value results in estimation for a missing score. In general, there exist uncertainty in trusting the estimation of a certain k value. It then calculates the mean of each column of the matrix to obtain the proximity. It then calculates the belief degrees by using the proximity. Calculating the belief degree helps to figure out the credence of all imputed values. It then combines all these belief degrees in each column together to find the final support. It then fuses the imputed values by means of the final supports obtaining the final estimation. The pseudo-code of kEMI+ is presented in Algorithm 2. The main steps of the kEMI+ algorithm are explained in the following: (1) The initial steps of kEMI+ are the same as kEMI (Step 1 to Step 7). Step 8 - It then sorts the attained RMSE values in an ascending order and selects the top n2 values for k and their corresponding k neighbours αi}= {xl }kl=1 , in which n = max {Q }, Q = { nearest 2 n|n ≤ mp − 1, n ∈ N ∗ , and N ∗ stands for the set of all natural positive integers. Step 9 - It then creates several n2 subsets Θi (q) = {αi (q) ∪ xi }, q = 1, . . . , n2 and feeds them to EMI for the posterior-imputation. EMI iteratively imputes missing scores in xi and returns n2 estimations xˆ ij .
5
Step 10 - kEMI+ then forms an imputed scores profile (Γ ) by resorting to the n2 imputed values xˆ ij , as shown below:
⎡ γ1,1 ⎢γ2,1 Γ =⎢ ⎣ .. .
... ... .. .
⎤ γ1,n γ2,n ⎥ .. ⎥ ⎦ . γn,n
(5)
γn,1 . . . { } in which γ1,1 , γ1,2 , γ1,3 , . . . , γn,n are the top n2 imputed values. Step{ 11 - kEMI+ then}computes the imputed scores templates ψ = ψj | j = 1, . . . , n by calculating the average over each column of the decision profile Γ , as shown below: ψj =
n 1∑
n
γi,j
(6)
i=1
Step 12 - kEMI+ then calculates the proximity between ψj and each row γi of Γ , as shown below: (1 + ψj × J1,n − γi )−1
ϕi,j = ∑n
j=1 (1
+ ψj × J1,n − γi )−1
(7)
in which J1,n is a 1 × n matrix where every element is equal to one, γi is the ith row of Γ , and ∥.∥ stands for the matrix norm. Step 13 - It then calculates the belief degrees in order to figure out credence of all n2 imputed values, as shown below:
∏ ϕi,j k̸=j (1 − ϕi,k ) [ ] βj (γi ) = ∏ 1 − ϕi,j 1 − k̸=j (1 − ϕi,k )
(8)
Step 14 - According to the Dempster–Shafer rule, the final support for all imputed values in a column should be calculated by multiplying all the belief degrees of the imputed values within a row together; kEMI+ then obtains the final support as follows:
λj =
n ∏
(βj (γi ))
(9)
i=1
Step 15 - kEMI+ forms a Λ matrix by using λj , as shown below:
⎡ λ1 ⎢λ1 Λ=⎢ ⎣ .. . λ1
··· ··· .. . ···
λj λj .. .
λj
··· ··· .. . ···
⎤ λn λn ⎥ .. ⎥ ⎦ . λn
(10)
Step 16 - Consequently kEMI+ fuses the imputed values by means of the following formula in order to obtain the final estimation γˆfinal :
γˆfinal =
∥Γ + Λ∥ + ∥Γ − Λ∥ n!
(11)
Step 17 - kEMI+ then returns the imputed score xˆ ij , which is equal to the final estimation γˆfinal . It then replaces xˆ ij }in xi { and returns xˆ i . After that, kEMI+ updates Xobs = Xobs ∪ xˆ i and Xmis = {Xmis \ xi }, respectively. kEMI+ iteratively returns to Step 2 as long as card {Xmis } ̸ = Ø for imputing the rest of incomplete records in Xmis . To treat categorical features, kEMI+ performs the same steps until step 10, except using mode imputation instead of EMI that is used for numerical features. Once the imputed profile Γ is formed in Step 10, it performs the following four steps in order to obtain the final estimation. Step 11 - kEMI+ then computes the frequency of occurrence for each distinct imputed value in Γ . For instance, given the following imputed profile
[ Γ =
A C B
B A A
A D D
] (12)
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
6
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 2. The proposed kEMI+ scheme, which contains three modules for pre-imputation, posterior-imputation, and Dempster–Shafer fusion.
the frequency of each distinct estimation such as A is calculated as the number nA of times A occurred in Γ . This results in nA = 4, nB = 2, nC = 1, nD = 2. Step 12 - kEMI+ then calculates the average over kq values corresponding to each distinct imputed value as presented in Table 1. Step 13 - kEMI+ then calculates the score ςA for each distinct estimation A, by multiplying its frequency nA with the corresponding average k¯ q as follows ςA = nA k¯ q . Given the above example, the following scores are calculated in Table 2. Step 14 - The distinct estimation with the highest score is returned as the final estimation. Example. Here a numerical example is provided in order to show the efficiency of the fusion module in enhancing estimations. Given a target record with one missing score that is originally equal to 1.14, and running KEMI+ , one can obtain imputed values and form top nine estimations in a 3 × 3 matrix as follows: 1.1 1.4 1.7
[ Γ =
1.2 1.5 1.8
1.3 1.6 1.9
] (13)
Then, by tracing the fusion steps in Algorithm 2 (Steps 11 to 17), one can obtain a final estimation that is equal to 1.1503. In step 11, according to Eq. (6), we can get ψ1 = 1.4, ψ2 = 1.5, ψ3 = 1.6. In step 12, we can get ϕ1,1 = 0.3706, ϕ1,2 = 0.3255, ϕ1,3 = 0.2983, ϕ2,1 = 0.3310, ϕ2,2 = 0.3490, ϕ2,3 = 0.3310, ϕ3,1 = 0.2983, ϕ3,2 = 0.3255, ϕ3,3 = 0.3706. In step 13, we can get β1 (γ1 ) = 0.2166, β1 (γ2 ) = 0.1749, β1 (γ3 ) = 0.1518, β2 (γ1 ) = 0.1793, β2 (γ2 ) = 0.1960, β2 (γ3 ) = 0.1793, β3 (γ1 ) = 0.1518, β3 (γ2 ) = 0.1749, β3 (γ3 ) = 0.2166. In step 14, we can get λ1 = 0.0057, λ2 = 0.0063, λ3 = 0.0057. Finally, in step 16, we can get γˆfinal = 1.1503. Thus, the estimation has been improved.
4. Experimental results The proposed missing data imputation techniques, kEMI, and kEMI+ , are evaluated by applying on twenty-one datasets, as shown in Table 3, and compared with eight selected state-ofthe-art missing data imputation techniques including EMI, kNNI, DMI, kDMI, LCSR, LRMC, CLRMC, and CLRMC-EN. The datasets are complete and publicly available. These datasets have no natural missing values. The missing scores are induced into these datasets in a different manner generating various missingness patterns. These four induced missing patterns are simple, medium, complex and blended [48,54]. In the simple pattern, a record can have at most one missing score. In the medium pattern, a record has minimum two and maximum up to 50% of the scores are missing. In the complex pattern, an incomplete record has missing values on minimum 50% and maximum 80% of the scores. A blended pattern represents a scenario where an incomplete record can have three patterns, where it contains 25% of simple, 50% of medium, and 25% of complex patterns [48,54]. In this paper, different missing ratios are also considered from 1% to 20%. Besides, two missing models are used, called uniformly distributed (UD) missing values and overall missing values [48, 54]. In UD, each feature has the same number of missing values. In an overall model, a feature can have a higher number of missing values than the number of missing values in another feature [48, 54]. These result in the generation of 32 missing combinations (four missing ratios * four missing patterns * two missing models) over each dataset. In these experiments, it is also focused on whether the properties of the dataset itself, such as the missing data pattern, will influence the accuracy of imputation. All these datasets are rearranged in a way that one can distinguish between monotone and non-monotone missing scores. Fig. 3 illustrates collapse representation of the Zoo dataset, with the complex missing pattern, 20% missing ratio and overall missing model. This collapse representation can help in visualizing monotone and non-monotone missing scores. The first row contains a
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
Algorithm 2: kEMI+ Input: X is an incomplete dataset. Output: Xˆ is the completed dataset. Definitions: Same definitions of kEMI can be applied to kEMI+ . n should satisfy the{ following condition: } n = max {Q }, Q = n|n2 ≤ mp − 1, n ∈ N ∗ q is a counter. begin Split X into Xobs and Xmis subsets Create a pool P t = Xobs while card { {Xmis } } ̸= ⊘ do P t = P t ∪ xi /*Select an incomplete record from Xmis and add it into the pool*/ for j = 1, . . . , d do if xij = NaN { then } S t = P t \ Xj /*Create a complete subset by excluding incomplete features*/ end end xiz = NaN /*Induce a random missing into xiz */ for k = 2, . . . , mp − 1 do αi ←− kNN(xi , S t , k) /*Find the k nearest neighbours of xi within S t by using kNN*/ xˆ iz ←− kNNI(xi , αi , k) /*Impute xiz using kNNI and the k nearest neighbours αi */ RMSEk ←− RMSE(xiz , xˆ iz ) /*Calculate RMSE between the actual xiz and the imputed xˆ iz values*/ end for q = 1, . . . , n2 do kq ←− min {RMSEk } /*Find the lowest RMSE and return corresponding kq */ } { RMSEk = RMSEk \ RMSEkq /*Update RMSEk */ αi (q) ←− kNN(xi , P t , kq ) /*Find the kq nearest neighbours of xi within P t using kNN*/ Θi (q) = {αi (q) ∪ xi } /*Create a subset made up with xi and αi (q)*/ xˆ ij ←− EMI(Θi (q)) /*Impute the real missing scores xij in xi using EMI*/ γq = xˆ ij /*Store imputed values in γq */ end Form Γ as shown in Eq. (5) Calculate ψj as shown in Eq. (6). Calculate proximity ϕi,j as shown in Eq. (7). Calculate belief degrees βj (γi ) as shown in Eq. (8). Calculate the final support λj as shown in Eq. (9). Form the matrix Λ by using λj as shown in Eq. (10). Obtain the final estimation γˆfinal by Eq. (11). Replace{ xˆ ij = γˆfinal } and return xˆ i . Xobs = Xobs ∪ xˆ i and Xmis = {Xmis \ xi } /*Update Xobs and Xmis */ end Xˆ = Xobs end
/*Return the completed dataset*/
sequence of numbers standing for the feature number of the original dataset before the arrangement. The last row also contains a sequence of numbers standing for the number of missing scores in each feature. The leftmost column contains a sequence of numbers, where each represents how many records have missing values in the same feature or combination of features (i.e., same missingness pattern). The green cells stand for the monotone
7
Table 1 Distinct estimations in Γ and their corresponding kq values. Estimation in Γ
Distinct estimations
kq values
Average k¯ q
γ1,1 γ1,3 γ2,2 γ3,2
A A A A
1 3 5 7
4
γ1,2 γ3,1
B B
2 8
5
γ2,1
C
6
6
γ2,3 γ3,3
D D
4 9
6.5
Table 2 Distinct estimations in Γ and their corresponding scores. Distinct estimations
Frequency
k¯ q
Score
A B C D
4 2 1 2
4 5 6 6.5
16 10 6 13
Fig. 3. Collapse representation for an incomplete Zoo dataset with complex missingness, missing ratio of 20%, and overall missing model. Each white (green) (red) cell stands for an observed (monotone missing) (non-monotone missing) score. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
missing scores; the red cells stand for the non-monotone missing scores, and the white cells stand for the observed values. The attained results indicate that the missingness pattern may slightly influence estimations. For instance, the incomplete dataset containing 101 records (the leftmost column in the last row) is presented by the minimum number of possible rows in Fig. 3. Thus, these 101 rows of the dataset are collapsed into 21 rows (row 2 to 22 in the table). For example, the 10th feature has been missed in 15 records. The number of complete records is 81 (the second row of the Table). The 4th row shows a record where the features 10, 12, 1, 16, 14, 7, 8, 15, 2 are missing together and this type of missingness occurs once in this incomplete dataset (the leftmost column).
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
8
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 3 The datasets used in these experiments. Datasets
# of records
# of numerical features
BCW BUPA Dermatology Glass Ionosphere Iris Letter PID Sheart Sonar Spam Wine Yeast Zoo
683 345 366 214 351 150 20000 768 270 208 4597 178 1484 101
9 6 34 9 34 4 16 8 13 60 57 13 8 17
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Car House Votes Mushroom Promoters SPECT Splice TTTEG
1728 234 5644 106 267 3190 958
0 0 0 0 0 0 0
6 16 22 57 22 60 9
These imputation techniques are evaluated through imputing the missing scores over these 672 datasets in terms of the normalized root mean square (NRMS) difference and the absolute error (AE). Table 4 shows the average NRMS values obtained over 32 experiments for each of these fourteen datasets by means of each missing data imputation technique. The lowest NRMS value among all of these imputation techniques indicate the best imputation technique. The experimental results indicate that kEMI+ followed by kEMI perform significantly better than EMI, DMI, kDMI, kNNI, LCSR, LRMC, CLRMC, and ECLRMC. The reported results in Table 4 show that kEMI+ has the lowest averaged NRMS for all these fourteen datasets (see bold entries in Table 4). Fig. 4 illustrates boxplot of the imputation performance assessment attained for all numerical datasets listed in Table 3 with different missing ratios, i.e., (1%, 5%, 10% and 20%), missing models and missingness. Each box contains 448 NRMS values attained by each imputation method. A lower NRMS value shows a better imputation performance. The obtained outcomes show that kEMI+ and kEMI outperform other missing data imputation techniques. kEMI+ has the lowest average over all NRMS values and is the most stable technique (smallest box in Fig. 4). Fig. 5 illustrates the distribution of the averaged NRMS values over all experiments performed on all datasets (numerical datasets listed in Table 3). Each box contains 14 averaged NRMS values attained by each imputation method. The attained averaged NRMS values also show that kEMI+ outperforms other missing data imputation techniques. Table 5 reports the averaged NRMS values over all numerical datasets obtained by each missing data imputation technique for each missing pattern. The bold entries in Table 5 stand for the lowest averaged NRMS values. The reported results show that kEMI+ outperforms other missing data imputation techniques for all missing data patterns. Table 6 reports the averaged NRMS values over all numerical datasets obtained by each missing data imputation technique for each missing ratio. The bold entries in Table 6 stand for the lowest averaged NRMS values. The reported results show that kEMI+ outperforms other missing data imputation techniques for all missing ratios. Table 7 reports the averaged NRMS values over all numerical datasets obtained by each missing data imputation technique for each missing model. The bold entries in Table 7 stand for the lowest averaged NRMS values. The reported results show that
# of categorical features
Fig. 4. Distribution of the NRMS values attained by each technique. The solid dashes indicate 1st and 99th percentiles, and the solid squares stand for the averaged NRMS value attained by each imputation technique. The red crosses stand for outliers.
Fig. 5. Distribution of the averaged NRMS values attained by each missing data imputation technique.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
9
Table 4 The average results (NRMS) over 32 experiments obtained for each of the fourteen datasets. Datasets
kNNI
EMI
DMI
kDMI
LCSR
LRMC
CLRMC
ECLRMC
kEMI
kEMI+
Glass Iris Sonar Wine Zoo Bupa Sheart Derm Ionosphere BCW PID Yeast Spam Letter
0.115 0.115 0.057 0.108 0.119 0.121 0.120 0.117 0.098 0.169 0.162 0.124 0.152 0.171
0.032 0.073 0.080 0.069 0.115 0.082 0.053 0.077 0.188 0.102 0.106 0.112 0.148 0.223
0.011 0.046 0.073 0.048 0.104 0.065 0.042 0.068 0.183 0.077 0.077 0.097 0.127 0.204
0.008 0.057 0.069 0.057 0.095 0.075 0.039 0.069 0.127 0.080 0.091 0.099 0.118 0.169
0.039 0.042 0.039 0.038 0.060 0.078 0.058 0.064 0.088 0.097 0.085 0.087 0.100 0.139
0.121 0.117 0.052 0.109 0.096 0.136 0.138 0.098 0.097 0.163 0.176 0.121 0.147 0.169
0.110 0.070 0.050 0.109 0.107 0.095 0.136 0.118 0.104 0.151 0.138 0.124 0.151 0.176
0.118 0.082 0.045 0.111 0.090 0.104 0.131 0.104 0.095 0.151 0.146 0.118 0.147 0.167
0.005 0.022 0.034 0.027 0.060 0.062 0.030 0.056 0.078 0.068 0.064 0.073 0.083 0.115
0.004 0.020 0.027 0.010 0.036 0.055 0.028 0.053 0.061 0.047 0.048 0.056 0.055 0.079
Table 5 The averaged NRMS results over all numerical datasets obtained by each missing data imputation technique for each missing pattern. Missing patterns
kNNI
EMI
DMI
kDMI
LCSR
LRMC
CLRMC
ECLRMC
kEMI
kEMI+
Simple Medium Complex Blended
0.105 0.114 0.146 0.134
0.090 0.107 0.116 0.103
0.080 0.082 0.093 0.093
0.072 0.075 0.092 0.089
0.066 0.072 0.085 0.066
0.087 0.124 0.164 0.122
0.081 0.115 0.157 0.117
0.083 0.110 0.155 0.111
0.048 0.053 0.059 0.060
0.032 0.043 0.046 0.043
Table 6 The averaged NRMS results over all numerical datasets obtained by each missing data imputation technique for each missing ratio. Missing ratios
kNNI
EMI
DMI
kDMI
LCSR
LRMC
CLRMC
ECLRMC
kEMI
kEMI+
1% 5% 10% 20%
0.049 0.099 0.162 0.121
0.041 0.064 0.079 0.103
0.014 0.051 0.091 0.065
0.014 0.059 0.082 0.072
0.031 0.053 0.053 0.065
0.035 0.093 0.127 0.184
0.028 0.085 0.115 0.159
0.030 0.080 0.114 0.165
0.013 0.029 0.052 0.041
0.007 0.021 0.033 0.039
Table 7 The averaged NRMS results over all numerical datasets obtained by each missing data imputation technique for each missing model. Missing models
kNNI
EMI
DMI
kDMI
LCSR
LRMC
CLRMC
ECLRMC
kEMI
kEMI+
Overall UD
0.091 0.141
0.104 0.104
0.062 0.097
0.058 0.092
0.073 0.071
0.124 0.124
0.123 0.116
0.115 0.115
0.041 0.063
0.034 0.043
kEMI+ outperforms other missing data imputation techniques for all missing models. The difference is more significant, when the UD missing model is used. Figs. 6–9 illustrate the attained averaged NRMS values for simple, medium, blended, and complex missing patterns, respectively, over all numerical datasets obtained by each missing data imputation technique for each combination of missing ratio and missing model. There are eight missing combinations. These figures show kEMI+ achieves the lowest averaged NRMS values and, thus, outperforms the other missing data imputation techniques. The attained results in Figs. 6–9 show that kEMI+ and kEMI perform better than other competitors and the difference is more significant when the UD missing model is used while increasing the missing ratios. Fig. 10 illustrates boxplot of the imputation performance assessment attained for all categorical datasets (listed in Table 3) with different missing ratios, i.e., (1%, 5%, 10% and 20%), missing models and missingness. Each box contains 224 AE values attained by each imputation method. This figure merely includes those techniques that can handle categorical datasets. A larger AE value shows a better imputation performance. The obtained outcomes show that kEMI+ and kEMI outperform other missing data imputation techniques. kEMI+ has the largest average over all AE values and is the most stable technique (smallest box in Fig. 10).
Fig. 6. The obtained averaged NRMS values for the simple missing pattern over all numerical datasets obtained by each missing data imputation technique for each combination of the missing ratio and the missing model.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
10
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 8 Friedman’s Tables based on the attained NRMS and AE values. Values
Source
SS
df
MS
Chi-sq
Prob≻Chi-sq
14819.1 22122.9 36942
9 4023 4479
1646.56 5.5
1617.41
0
NRMS
Columns Error Total Columns Error Total
1125.5 474 1599.5
4 636 799
281.376 0.754
450.34
3.66368e−96
AE
Fig. 7. The obtained averaged NRMS values for the medium missing pattern over all numerical datasets obtained by each missing data imputation technique for each combination of the missing ratio and the missing model.
Fig. 9. The obtained averaged NRMS values for the complex missing pattern over all numerical datasets obtained by each missing data imputation technique for each combination of the missing ratio and the missing model.
Fig. 8. The obtained averaged NRMS values for the blended missing pattern over all numerical datasets obtained by each missing data imputation technique for each combination of the missing ratio and the missing model.
Fig. 10. Distribution of the AE values attained by each technique. The solid dashes indicate 1st and 99th percentiles, and the solid squares stand for the average AE value attained by each imputation technique. The red crosses stand for outliers.
These missing data imputation techniques are then tested in order to statistically evaluate differences between them. Friedman rank test at the significance level α = 0.05 is firstly carried out. This test is used to check the differences among these missing data imputation algorithms. In other words, it determines if there are one or more imputation algorithms whose performance can be regarded as significantly different. As shown in Table 8, the p-values for the attained NRMS (0) and AE (3.66368e−96 ) are less than the significance level of 0.05, which result in rejection
of the null hypothesis and conclusion that at least one of these techniques has a different effect. Therefore, the statistical assessment of the differences can be further made with a post-hoc test. This is used to compare all the algorithms in a pairwise manner, and is based on the absolute difference of the average rankings of the imputation algorithms. For a significance level α the test determines the critical difference (CD); if the difference between the average ranking of two algorithms is greater than CD, then
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx
11
Fig. 11. Critical difference diagram for the Nemenyi test. Comparison of all missing data imputation techniques against each other in terms of NRMS (a) AE (b). Panel a (b) shows the averaged rank of each imputation technique, where the first rank indicates the method with the lowest NRMS (highest AE) value and best performance. Groups of methods that are not significantly different at α = 0.05 are connected. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
the null hypothesis that the algorithms have the same performance is rejected. The resulting critical difference (CD) diagram for the post-hoc Nemenyi tests [55] in terms of NRMS and AE performance measures are shown in Fig. 11(a, b), respectively. It shows the average ranks of the examined methods. Those groups of missing data imputation techniques that are not significantly different are connected by thick solid lines (red lines). The difference between the average ranks of these group of techniques is less than the CD value. Fig. 11a shows that LCSR is around the fourth rank, kDMI and DMI are around the fifth ranks. They are followed by ECLRMC, EMI, CLRMC, LRMC, and kNNI (k = 9). It confirms that kNNI is the least accurate technique in terms of NRMS. It also shows that kEMI+ and kEMI are the most accurate techniques in terms of NRMS. This figure indicates that kEMI+ and kEMI significantly outperform other competitors. It also shows that kDMI and DMI are not significantly different, however, they significantly outperform ECLRMC, EMI, CLRMC, LRMC, and kNNI. Fig. 11b shows that kEMI and DMI are around the third ranks. They significantly outperform kDMI and kNNI (k = 9) (around the fourth and the fifth ranks). It confirms that kNNI is the least accurate technique in terms of AE. It also shows that kEMI+ is the most accurate techniques in terms of AE. This figure indicates that kEMI+ significantly outperforms kEMI, DMI, kDMI, and kNNI. It also shows that kEMI and DMI significantly outperform kDMI and kNNI. 4.1. Complexity analysis Let us consider a dataset with m records, and d attributes. Let us suppose that there exist d′ incomplete attributes over the whole dataset, and m′ records with one or more missing values. kEMI initially splits the dataset into two subsets O(md). Then, a while loop is repeated m′ times, in which various operations are performed: firstly, a for loop is used to exclude incomplete features and create a complete subset O(d). Then, another for loop, repeated mp − 2 times, to perform kNN for a record O(mp k + mp d), and perform kNNI on k nearest neighbours O(k2 d′ + k3 ). Finally, RMSE is calculated O(d′ ). After these loops the algorithm finds the best k values O(mp ), and performs kNN with P t O(mp k + mp d). Finally, the imputation is performed using EMI on k nearest neighbours O(kd2 + d3 ). Thus, the overall complexity of kEMI upon simplification is as follows O(md + m′ d + m′ m2p k + m′ m2p d + m′ mp k2 d′ + m′ mp k3 + m′ mp d′ + m′ kd2 + m′ d3 ). Assuming that m′ ≺≺ m, it can be concluded that mp ≈ m, and, therefore, the complexity is O(m2 k + m2 d + mk2 d′ + mk3 + kd2 + d3 ).
Fig. 12. Runtimes w.r.t. the number of records. Runtimes of the proposed imputation techniques are shown by the shaded areas around the averaged runtimes over 32 incomplete datasets for each experiment.
The datasets used in these experiments are typical datasets, in which the number of records are much larger than the number of attributes (m ≻≻ d), thus, the complexity of kEMI is O(m2 k+mk3 ). The complexity of kEMI+ is approximately similar to kEMI, however, it includes the complexity of the Dempster–Shafer theory O(n2 ) into account as a multiplicative term within the last for loop of Algorithm 2. Thus, the complexity of kEMI+ is as follows O(md + m′ m2p k + m′ m2p d + m′ mp k2 d′ + m′ mp k3 + m′ n2 mp d + m′ n2 mp k + m′ n2 kd2 + m′ n2 d3 ). Assuming m′ ≺≺ m, it can be concluded that mp ≈ m, and, assuming n ≺≺ m, therefore, the overall complexity of kEMI+ is O(m2 k + m2 d + mk2 d′ + mk3 + kd2 + d3 ). The final assumption made on kEMI (m ≻≻ d) results in the final complexity of the kEMI+ as follows O(m2 k + mk3 ). The complexity of all techniques used in this work for the experimental comparison are reported in Table 9. In Table 9, t stands for the number of iterations, m and d for LCSR stand for the dimension of over complete dictionary [22]. 4.2. Scalability analysis Another issue of concern is to study the evolution of the runtimes with respect to the size of the dataset in terms of the number of records and the number of features. To study the evolution of the runtimes w.r.t. the number of records, ten synthetic datasets are generated, where each of these datasets contains a fixed number of features with different number of records varying from 1000 to 50000. At each experiment, missing scores are induced over one of these synthetic datasets in different manners forming 32 incomplete datasets with various missingness models, rates and types. Then, the proposed missing data imputation techniques are applied to these 32 incomplete datasets and the runtimes are measured accordingly. This process is repeated for each of these synthetic datasets. These experiments examine the scalability of the proposed imputation techniques w.r.t. the number of records, by measuring the runtimes over all synthetic datasets. The experiments are carried out by using machine 1, which is configured with 4 × 8 core Intel i7-4710MQ processor and 12 GB RAM. Fig. 12 illustrates the runtimes on a range of 1000 up to 50000 records, with a fixed number of features. At each experiment,
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
12
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 9 Complexity of the missing data imputation techniques. Algorithms
Complexity
Assuming d ≺≺ m
Assuming d ≺≺ m ∧ k ≺≺ m
EMI kNNI DMI kDMI LRMC LRMC-EN LCSR kEMI kEMI+
O(md2 + d3 ) ′ O(m2 d + km2 ) O(m2 d2 d′ + md3 d′ ) ′ ′ 3 O(mkd d2 + d ) } { md 2 O(t × min m d, md{2 ) } O(m2 d + mt × min (k + 1)2 d, (k + 1)d2 ) O(d3 + m2 d + md3 ) ′ O(m2 k + m2 d + mk2 d + mk3 + kd2 + d3 ) ′ O(m2 k + m2 d + mk2 d + mk3 + kd2 + d3 )
O(m) O(km2 ) O(m2 ) O(mk) O(tm) O(m2 + mtk + mt) O(m2 ) O(m2 k + mk3 ) O(m2 k + mk3 )
O(m) O(m2 ) O(m2 ) O(m) O(tm) O(m2 + mt) O(m2 ) O(m2 ) O(m2 )
5. Conclusions
Fig. 13. Runtimes w.r.t. the number of features. Runtimes of the proposed imputation techniques are shown by the shaded areas around the averaged runtimes over 32 incomplete datasets for each experiment.
the runtime of kEMI and kEMI+ over all 32 incomplete datasets are measured. Fig. 12 shows the highest, the lowest and the mean of all these runtimes for both proposed techniques. The figure illustrates the range of runtimes (highlighted region) for each of the proposed imputation techniques. This figure shows that, in general, imputation through kEMI is computationally less expensive compared to kEMI+ . The runtimes for both imputation techniques increase exponentially with the number of records. The runtimes for both techniques vary with m2 , and, thus, this exponential increase can be expected. To study the evolution of the runtimes w.r.t. the number of features, ten synthetic datasets are generated, where each of these datasets contains a fixed number of records with different number of features varying from 50 to 500. Missing scores are induced over one of these synthetic datasets similar to the previous set of experiments and the runtimes are measured accordingly. These experiments examine the scalability of the proposed imputation techniques w.r.t. the number of features, by measuring the runtimes over all synthetic datasets. Fig. 13 illustrates the runtimes varying the number of features from 50 up to 500 features, with a fixed number of records. It shows the highest, the lowest and the mean of all these runtimes for both proposed techniques. This figure illustrates the range of runtimes (shaded region) for each of the proposed imputation techniques. The runtimes for both techniques vary with d3 , and, thus, a sharp increase can be expected. kEMI shows better scalability compared to kEMI+ .
In this work, two novel missing data imputation techniques are proposed that aim to learn both similarities between the records (local similarity) and correlation among the features (global similarity) within the dataset. The local similarity can result in improving the imputation performance. This can further be improved considering the global similarity among the features. kNNI imputes missing scores using the most similar records, i.e., nearest neighbours. EMI imputes missing scores by searching correlation among the features within all records. On the other hand, DMI and kDMI initially segment the data. kDMI however has two levels for the horizontal segmentation. DMI and kDMI impute missing scores by applying EMI on nearest neighbours. kDMI makes use of decision tree and kNN for the horizontal segmentation. This is computationally expensive. It also varies k, and, then, iteratively sends all nearest neighbours to EMI in order to tune the parameter k, which further increases the computational complexity. On the other hand, kEMI discards decision tree partitioning from the local search mechanism and directly uses kNNI to automatically tune parameter k in the pre-imputation module. It then finds the best value for k, and consequently, feeds the best k nearest neighbours to the posterior-imputation module, which makes use of EMI for the final estimation. The attained results show that kEMI outperforms other competitors. The second proposed novel missing data imputation technique, called kEMI+ , has three modules including pre-imputation, posterior-imputation, and the Dempster–Shafer fusion. kEMI+ varies k, uses kNNI to find a set of best k values, which result in the lowest imputation error, selects a set of top nearest neighbours and feeds them to the posterior-imputation module, where EMI estimates missing scores. It then feeds a set of best estimations into the Dempster–Shafer fusion module. This module fuses these top estimations and returns the final estimation. The attained results show that kEMI+ significantly outperforms kEMI and other competitors. The proposed novel missing data imputation techniques can handle both numerical and categorical features. The proposed schemes could also be extended to use imputation techniques that can handle heterogeneous datasets for imputing mixed features, which looks to be a worthwhile direction for future research. The usefulness of the proposed missing data imputation techniques is validated and compared with eight competitors based on twenty-one datasets with various missingness models, rates and types. References [1] H. Muller, J. Freytag, Problems, Methods and Challenges in Comprehensive Data Cleansing, Technical Report, Humboldt-Universitt zu Berlin Institut fr Informatik, 2003. [2] Q. Abbas, A. Aggarwal, Development of a structured framework to achieve quality data, Int. J. Adv. Eng. Appl. (2010) 193–196. [3] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, 2006.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.
R. Razavi-Far, B. Cheng, M. Saif et al. / Knowledge-Based Systems xxx (xxxx) xxx [4] A. Chapman, Principles of Data Quality, Version 1.0, Technical Report, The global biodiversity information facility, Copenhagen, 2005. [5] R. Razavi-Far, E. Zio, V. Palade, Efficient residuals pre-processing for diagnosing multi-class faults in a doubly fed induction generator, under missing data scenarios, Expert Syst. Appl. 41 (14) (2014) 6386–6399. [6] R. Razavi-Far, M. Saif, Imputation of missing data using fuzzy neighborhood density-based clustering, in: 2016 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE, 2016, pp. 1834–1841. [7] R. Razavi-Far, S. Chakrabarti, M. Saif, E. Zio, An integrated imputationprediction scheme for prognostics of battery data with missing observations, Expert Syst. Appl. 115 (2) (2019) 709–723. [8] H. Muller, F. Naumann, J. Freytag, Data Quality in Genome Databases, Technical Report, Humboldt-Universitt zu Berlin Institut fr Informatik, 2003. [9] R. Razavi-Far, E. Hallaji, M. Farajzadeh-Zanjani, M. Saif, A semisupervised diagnostic framework based on the surface estimation of faulty distributions, IEEE Trans. Ind. Inf. 15 (3) (2019) 1277–1286. [10] J. Bi, C. Zhang, An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme, Knowl.-Based Syst. 158 (2018) 81–93. [11] M. Farajzadeh-Zanjani, R. Razavi-Far, M. Saif, L. Rueda, Efficient feature extraction of vibration signals for diagnosing bearing defects in induction motors, in: International Joint Conference on Neural Networks, IJCNN, 2016, pp. 4504–4511. [12] R. Razavi-Far, E. Hallaji, M. Farajzadeh-Zanjani, M. Saif, S.H. Kia, H. Henao, G. Capolino, Information fusion and semi-supervised deep learning scheme for diagnosing gear faults in induction machine systems, IEEE Trans. Ind. Electron. 66 (8) (2019) 6331–6342. [13] I.B. Aydilek, A. Arslan, A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm, Inform. Sci. 233 (2013) 25–35. [14] Y. Ding, A. Ross, A comparison of imputation methods for handling missing scores in biometric fusion, Pattern Recognit. 45 (2012) 919–933. [15] P. Kang, Locally linear reconstruction based missing value imputation for supervised learning, Neurocomputing 118 (2013) 65–78. [16] Y. Li, L.E. Parker, Nearest neighbor imputation using spatial temporal correlations in wireless sensor networks, Inf. Fusion 15 (2014) 64–79, Special Issue: Resource Constrained Networks. [17] H.C. Valdiviezo, S.V. Aelst, Tree-based prediction on incomplete data using imputation or surrogate decisions, Inform. Sci. 311 (2015) 163–181. [18] M. Rahman, M. Islam, A decision tree-based missing value imputation technique for data pre-processing, AusDM 2011, CRPIT, in: Australasian Data Mining Conference, Vol. 121, 2011, 41–50. [19] X. Chen, Z. Wei, Z. Li, J. Liang, Y. Cai, B. Zhang, Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation, Knowl.-Based Syst. 132 (2017) 249–262. [20] R. Razavi-Far, M. Saif, Imputation of missing data for diagnosing sensor faults in a wind turbine, IEEE Int. Conf. Syst. Man Cybern. 3 (2015) 99–104. [21] M. Rahman, M. Islam, kDMI: A novel method for missing values imputation using two levels of horizontal partitioning in a data set, in: ADMA 2013 Part 2 LNAI 8347, 2013, pp. 250–263. [22] X. Feng, S. Wu, J. Srivastava, P. Desikan, Automatic instance selection via locality constrained sparse representation for missing value estimation, Knowl.-Based Syst. 85 (2015) 210–223. [23] T. Schneider, Analysis of incomplete climate data: Estimate of mean values and covariance matrices and imputation of missing values, J. Clim. (2001) 853–871. [24] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the em algorithm, J. Roy. Statist. Soc. B (1977) 1–38. [25] K. Cheng, N. Law, W. Siu, Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data, Pattern Recognit. 45 (4) (2012) 1281–1289. [26] G. Batista, M. Monard, An analysis of four missing data treatment methods for supervised learning, Appl. Artif. Intell. (2003) 519–533. [27] C.T. Tran, M. Zhang, P. Andreae, B. Xue, L.T. Bui, An effective and efficient approach to classification with incomplete data, Knowl.-Based Syst. 154 (2018) 1–16. [28] Z. Cai, M. Heydari, G. Lin, Iterated local least squares microarray missing value imputation, J. Bioinform. Comput. Biol. (2006) 935–958. [29] K. Cheng, N. Law, W. Siu, Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data, Pattern Recognit. (2012) 1281–1289.
13
[30] H. Kim, G. Golub, H. Park, Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics (2005) 187–198. [31] K.J. Nishanth, V. Ravi, N. Ankaiah, I. Bose, Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts, Expert Syst. Appl. 39 (12) (2012) 10583–10589. [32] C. Enders, Applied Missing Data Analysis, Methodology in the Social Sciences, Guilford Press, 2010. [33] K. Muteki, J. MacGregor, T. Ueda, Estimation of missing data using latent variable methods with auxiliary information, Chemometr. Intell. Lab. Syst. 78 (2005) 41–50. [34] O. Basir, X. Yuan, Engine fault diagnosis based on multi-sensor information fusion using Dempster–Shafer evidence theory, Inf. Fusion 8 (2007) 379–386. [35] Y. Leung, N.-N. Ji, J.-H. Ma, An integrated information fusion approach based on the theory of evidence and group decision-making, Inf. Fusion 14 (4) (2013) 410–422. [36] Z. Liu, Q. Pan, J. Dezert, A. Martin, Adaptive imputation of missing values for incomplete pattern classification, Pattern Recognit. 52 (2016) 85–95. [37] Z. Liu, Q. Pan, G. Mercier, J. Dezert, A new incomplete pattern classification method based on evidential reasoning, IEEE Trans. Cybern. 45 (4) (2015) 635–646. [38] B. Ran, H. Tan, J. Feng, Y. Liu, W. Wang, Traffic speed data imputation method based on tensor completion, Comput. Intell. Neurosci. 22 (2015). [39] J. Bethlehem, Applied Survey Methods: A Statistical Perspective, Wiley Series in Survey Methodology, Wiley, Hoboken, NJ, 2009. [40] J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997. [41] P. Allison, Multiple imputation for missing data: A cautionary tale, Sociol. Methods Res. 28 (2000) 301–309. [42] I. Gheyas, L. Smith, A neural network-based framework for the reconstruction of incomplete data sets, Neurocomputing 73 (2010) 3039–3065. [43] C.-F. Tsai, M.-L. Li, W.-C. Lin, A class center based approach for missing value imputation, Knowl.-Based Syst. 151 (2018) 124–135. [44] C. Paul, W. Mason, D. McCaffrey, S. Fox, A cautionary case study of approaches to the treatment of missing data, Stat. Methods Appl. 17 (3) (2008) 351–372. [45] Z. Liu, Y. Liu, J. Dezert, Q. Pan, Classification of incomplete data based on belief functions and k-nearest neighbors, Knowl.-Based Syst. 89 (2015) 113–125. [46] E. Nejad, R. Razavi-Far, Q. Wu, M. Saif, Multiple imputation of missing residuals for fault classification: A wind turbine application, in: IEEE 14th International Conference on Machine Learning and Applications, 2015. [47] R. Razavi-Far, M. Farajzadeh-Zanjani, M. Saif, An integrated classimbalanced learning scheme for diagnosing bearing defects in induction motors, IEEE Trans. Ind. Inf. 13 (6) (2017) 2758–2769. [48] M. Rahman, M. Islam, Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques, Knowl.-Based Syst. 53 (2013) 51–65. [49] A. Dempster, Upper and lower probabilities induced by multi-valued mapping, Ann. Math. Stat. 38 (1967) 325–339. [50] D. Gruyer, S. Demmel, V. Magnier, R. Belaroussi, Multi-hypotheses tracking using the Dempster–Shafer theory, application to ambiguous road context, Inf. Fusion 29 (2016) 40–56. [51] S. Panigrahi, A. Kundu, S. Sural, A.K. Majumdar, Credit card fraud detection: A fusion approach using Dempster–Shafer theory and Bayesian learning, Inf. Fusion 10 (2009) 354–363. [52] R. Haenni, S. Hartmann, Modeling partially reliable information sources: A general approach based on Dempster–Shafer theory, Inf. Fusion 7 (2006) 361–379. [53] B. Mora, M.A. Wulder, J.C. White, An approach using Dempster–Shafer theory to fuse spatial data and satellite image derived crown metrics for estimation of forest stand leading species, Inf. Fusion 14 (2013) 384–395. [54] M.G. Rahman, M.Z. Islam, FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis, Knowl.-Based Syst. 56 (2014) 311–327. [55] P. Nemenyi, Distribution-Free Multiple Comparisons (Ph.D. thesis), Princeton University, Princeton, NJ, USA, 1963.
Please cite this article as: R. Razavi-Far, B. Cheng, M. Saif et al., Similarity-learning information-fusion schemes for missing data imputation, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.06.013.