FRIEND: Feature selection on inconsistent data

FRIEND: Feature selection on inconsistent data

FRIEND: Feature Selection on Inconsistent Data Communicated by Dr. Jiliang Tang Journal Pre-proof FRIEND: Feature Selection on Inconsistent Data Zh...

1MB Sizes 0 Downloads 89 Views

FRIEND: Feature Selection on Inconsistent Data

Communicated by Dr. Jiliang Tang

Journal Pre-proof

FRIEND: Feature Selection on Inconsistent Data Zhixin Qi, Hongzhi Wang, Tao He, Jianzhong Li, Hong Gao PII: DOI: Reference:

S0925-2312(20)30152-1 https://doi.org/10.1016/j.neucom.2020.01.094 NEUCOM 21851

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

30 August 2018 17 November 2019 24 January 2020

Please cite this article as: Zhixin Qi, Hongzhi Wang, Tao He, Jianzhong Li, Hong Gao, FRIEND: Feature Selection on Inconsistent Data, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.094

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

FRIEND: Feature Selection on Inconsistent Data Zhixin Qi, Hongzhi Wang∗, Tao He, Jianzhong Li, Hong Gao Department of Computer Science and Technology Harbin Institute of Technology, Harbin, China

Abstract With the explosive growth of information, inconsistent data are increasingly common. However, traditional feature selection methods are lack of efficiency due to inconsistent data repairing beforehand. Therefore, it is necessary to take inconsistencies into consideration during feature selection to not only reduce time costs but also guarantee accuracy of machine learning models. To achieve this goal, we present FRIEND, a feature selection approach on inconsistent data. Since features in consistency rules have higher correlation with each other, we aim to select a specific amount of features from these. We prove that the specific feature selection problem is NP-hard and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of our proposed approach. Keywords: Feature Selection, Inconsistent Data, Mutual Information, Data Quality, Approximation

1. Introduction With the dramatic increase of data dimensions, the efficiency and effectiveness of machine learning algorithms are affected. High dimensionality causes overfitting of models, which leads to performance reduction [1]. Therefore, the curse of dimensionality has drawn high attentions. Motivated by this, various feature selection methods have been proposed, which can be classified into three lines [2]. One is filters, which evaluate selected feature subsets according to new scoring algorithms, such as mutual information [3, 4, 5], relevance and redundancy [6, 7]. These methods are efficient in feature selection, but they depend much on the characteristics of training data since the process of feature selection and machine learning models are separated. ∗ Corresponding

author. Tel.: +86 13069887146. Email addresses: [email protected] (Zhixin Qi), [email protected] (Hongzhi Wang), [email protected] (Tao He), [email protected] (Jianzhong Li), [email protected] (Hong Gao)

Preprint submitted to Elsevier

January 28, 2020

Another line of approaches is wrappers [8, 9, 10], which score feature subsets according to the precision of machine learning models. Even though the accuracy of these approaches is satisfied, high time complexity is needed. The third line is embedded methods [11, 12, 13]. These approaches select feature subsets with the characteristics of their own machine learning models, such as support vector machine [14], decision tree [15]. Although these methods balance the accuracy and efficiency of feature selection, they also have drawbacks. One is the overfitting problem. Another is inapplicability on massive data since the time costs in each iteration are expensive. Additionally, data quality issues are seldom considered in feature selection. In fact, data quality issues provide optimization chances for feature selection. On one hand, involving data quality issues in feature selection could reduce the impacts of low-quality features as well as the expensive costs of data cleaning. On the other hand, the detection of data quality problems often requires semantic information of data, which provides more evidence for feature selection. Therefore, we attempt to involve data quality issues into feature selection to increase the efficiency and effectiveness. In this paper, we focus on consistency, which is an important aspect of data quality [16]. Inconsistencies are often detected according to semantic constraints. For example, in Table 1, a semantic constraint is “[Student No.] → [Name, Major]”, which means that Student No. determines Name and Major. When the values of Student No. in two tuples are identical, the values of both Name and Major in the two tuples are also identical. Data which satisfy this constraint are consistent and vice versa. We find that the values of Student No. in the first and second tuples are identical, but the values of Name are different. Thus, Alice in the first tuple and Bob in the second one are inconsistent. Similarly, the values of Student No. in the third and fourth tuples are identical, but the values of Major are different. Therefore, M athematics in the third tuple and Computer Science in the fourth tuple violate the consistency rule. Table 1: Students Information

Student No. 170302 170302 170520 170520

Name Alice Bob Steven Lucy

Major Computer Science Computer Science M athematics Computer Science

With the consideration of consistency rules, how can we select feature subsets on inconsistent data? This problem brings the following two challenges. On one hand, existing feature selection approaches need to repair inconsistent data before feature selection. Thus, the first challenge is how to involve consistencies in the selection criteria effectively to choose significant features with high-quality values. On the other hand, existing methods only rely on the importance of features and the correlation among features without the consideration of extra semantics. Therefore, the second challenge is how to integrate

2

semantic information provided by constraints with existing evidences for feature selection. In the light of these challenges, we propose a feature selection strategy on inconsistent data. Taking consistency constraints as well as data quality into consideration, we model feature selection problem as a multi-objective optimization problem with the number of selected features as the constraint and four optimization goals reflecting various aspects including the relevance of selected features to excluded ones, the rate of inconsistent data, the correlation with class labels, and the redundancy with selected features. In this problem, the consistency constraints are involved in the optimization goal. Thus, the data consistencies are reflected on the evaluation of features. This problem is proven to be NP-hard. To address it, we develop an approximation algorithm with ratio bound 1.582. Our solution not only improves the efficiency of feature selection, but also ensures to select the most informative features. We envision the application scenarios of our solution in the following areas: Data warehousing. A data warehouse contains data coming from many different sources. Some of it typically does not satisfy the given consistency rules [17]. The common feature selection approach on it removes the inconsistencies beforehand, which leads to high time costs. To overcome this drawback, our solution makes it possible to select features on inconsistent data directly. Data integration. Many different databases are often integrated together to provide a single unified view for the users. Database integration is difficult since it requires the resolution of various kinds of discrepancies of the integrated databases. One possible discrepancy is due to different sets of consistency rules [18]. Common methods clean the inconsistent data before feature selection on integrated databases. However, the accuracy of data cleaning is usually unsatisfied due to the high inconsistent rate of integrated data, which deteriorates the feature selection. To improve this weakness, our solution provides an opportunity to select features directly on integrated data involving semantics of consistency rules. We guarantee the effectiveness of feature selection with the solution. In summary, our main contributions are listed as follows: • We study the problem of feature selection on inconsistent data. To the best of our knowledge, this is the first work to involve data quality issues into feature selection. • We model feature selection as an optimization problem embedding consistency constraints and value consistencies of features to make the usage of information provided by consistency issues sufficiently. We develop an efficient algorithm for this problem with ratio bound. • To demonstrate the efficiency and effectiveness of the presented method, we conduct extensive experiments. Experimental results show that FRIEND performs well in feature selection on inconsistent data. The rest paper is organized as follows. Mutual information and consistency 3

rules are reviewed in Section 2. Section 3 introduces our proposed feature selection approach, FRIEND. Our experiments are discussed in Section 4. We conclude our paper in Section 5. 2. Background Before introducing our approach, we briefly discuss the relevant background including mutual information and consistency rules. 2.1. Mutual Information Mutual information quantifies how much information is shared by two variables X and Y [19], which is defined as follows. I(X; Y ) =

XX

p(x, y)log

y∈Y x∈X

p(x, y) . p(x)p(y)

(1)

where p(x, y) is the joint probability density function of X and Y . From Eq.(1), if X and Y are closely related with each other, the value of I(X; Y ) will be large, and I(X; Y ) = 0 means that these two variables are independent. Mutual information algorithm is a classic method in feature selection [2]. During selecting features from a set of candidate features F , a general evaluation function J(f ) is used to measure each candidate feature f ∈ F , which is defined as follows. J(f ) = α · I(C; f |S) − g(C, S, f ). (2) where α is a coefficient to regulate the relative significance of mutual information, C is the class label set, and S is the selected feature set. g(C, S, f ) is a deviated function about f and S under C, which denotes the redundancy of f and S, and can be learned from historical records. After calculating the values of J(f ) for candidate features in F , the features with higher scores are selected to S. 2.2. Consistency Rules Since consistency rules describe the semantic constraints of data, inconsistent data are typically identified as violations of consistency rules [20]. Various forms of consistency rules have been proposed. For the convenience of discussions, we only consider three typical types in this paper, functional dependencies [21], conditional dependencies [22], and functional dependency set [23]. A functional dependency on a relation schema R is defined as: [A] → [B]. where A and B are two features on R. Its semantic meaning is A determines B, that is, if t1 [A] = t2 [A], then t1 [B] = t2 [B], where t1 and t2 are two tuples. In this paper, we call both A and B the specific features, where A is the pre-feature, B is the post-feature. 4

For example, there is a functional dependency “[Student No.] → [Score]” on Table 2, which means that Student No. determines Score. As the table shows, t2 [Student No.] = t3 [Student No.], and t2 [Score] = t3 [Score]. We call {Student No., Name} the specific feature set, where Student No. is the pre-feature and Name is the post-feature. A conditional dependency extends a functional dependency by specifying constant conditions, which is defined as: [A = a, B] → [C]. where A, B and C are features on R, and a is a constant value. Its semantic meaning is that only when the value of A is a, B determines C. That is, when A = a, if t1 [B] = t2 [B], then t1 [C] = t2 [C], where t1 and t2 are two tuples. In this paper, we call A, B, and C the specific features, where A and B are the pre-features, C is the post-feature. For example, there is a conditional dependency “[Student No.=170304, Name] → [Score]” on Table 2, which means that only when the value of Student No. is 170304, Name determines Score. As shown in the table, t2 [Student No.] = t3 [Student No.] = 170304, t2 [Name] = t3 [Name], and t2 [Score] = t3 [Score]. We call {Student No., Name, Score} the specific feature set, where Student No. and Name are the pre-features and Score is the post-feature. Functional dependency set is a set of functional dependencies. For instance, a set of two functional dependencies is defined as: ( [A] → [B]. [C] → [D].

where A, B, C, and D are features on R. In this paper, we regard this as multiple functional dependencies except a special case, which is “[A] → [B], [B] → [C]”. We take such case as a merged functional dependency, “[A] → [B] → [C]”. A, B, and C are called as the specific features, where A is the pre-feature, B and C are the post-features. For example, the functional dependency set “[Student No.] → [City], [City] → [Country]” on Table 2 means that Student No. determines City, and City determines Country. We take this as a merged functional dependency, that is, “[Student No.] → [City] → [Country]”. As is shown in Table 2, t2 [Student No.] = t3 [Student No.], t2 [City] = t3 [City], and t2 [Country] = t3 [Country]. We call {Student No., City, Country} the specific feature set, where Student No. is the pre-feature, City and Country are the post-features. In practice, consistency rules are predefined by database administrators. As the different types of consistency rules defined above, we know that it is easy to capture the pre-features and post-features from the given rules. Therefore, although our method leverages the knowledge of pre-features and post-features for feature selection, we need little extra knowledge, which can be acquired easily.

5

Table 2: Scores Information

t1 t2 t3 t4

Student No. 170302 170304 170304 170328

Name Alice Bob Bob Alice

City NY C NY C NY C P AR

Country U.S.A U.S.A U.S.A FR

Score 92 88 88 90

3. The Proposed Approach In this section, we introduce our feature selection approach, FRIEND. The design of this approach has two goals. One is to save the time costs of feature selection on inconsistent data. The other is to guarantee the effectiveness of feature selection by involving extra semantics provided by consistency rules. For the first goal, we integrate data inconsistencies into the process of selecting features instead of inconsistent data repairing beforehand. Such integration achieves a high efficiency by saving the expensive time costs of inconsistent data repairing. For the second goal, we develop the feature selection criteria with the consideration of consistency rules. 3.1. Problem Definition Since our goal of FRIEND is to reduce time costs and guarantee the effectiveness of feature selection on inconsistent data, we consider how to achieve high efficiency and accuracy. In order to save expensive time costs of inconsistent data repairing, we involve data inconsistencies in feature selection directly. Then, our problem is how to select specific features. To ensure the effectiveness of specific feature selection, we consider four factors relative to accuracy. First, since consistency constraints reflect dependencies among specific features, when the selected features are more related to the excluded ones, the features are more informative. Hence, the correlation with unselected features is essential. Second, since value inconsistencies of features are involved in feature selection, the rate of inconsistent data matters. Third, when the selected specific features are more related to class labels, the class labels can be predicted or distinguished more correctly. Thus, the correlation with class labels is also an important factor. The last but not least, the redundancy with the selected feature set matters. This is because the selected features will be trained and tested together in machine learning models. Hence, the lower their redundancy is, the more informative they are. Thus, we define the specific feature selection (SFS for brief) problem with the consideration of these four factors. Consider a dataset D(F, C), where F is the candidate feature set, and C is the class label set. φ denotes a consistency rule on D. From each φAB ∈ φ, a pre-feature set FA and a post-feature set FB are extracted. The maximum number of selected specific features is r0 , and S denotes the Pr0selected feature set. The SFS problem is defined as follows. max k=1 (α I(C; Fk |S) - β · g(C, S, Fk ) + γ · f (Fk ) - δ · r(Fk )), where α + β + γ + δ = 1. 6

s.t.

P P  1  Si ∈S I(Si ; Fk |Ci ) Ci ∈C |S|   . g(C, S, Fk ) =   |C|   X   I(FBj ; Fk |S). ∀Fk ∈ FA , f (Fk ) =  F ∈F B B j    X    I(FAi ; Fk |S). ∀Fk ∈ FB , f (Fk ) =   FAi ∈FA

where I(C; Fk |S) denotes conditional mutual information between C and Fk under S, g(C, S, Fk ) denotes the redundancy of S and Fk under C, f (Fk ) denotes the correlation of Fk and unselected specific features, r(Fk ) denotes the inconsistency rate of each Fk , α is a coefficient to adjust the relative importance of the correlation of Fk and C, β is a coefficient to regulate the relative significance of the redundancy of Fk and selected features, γ is a coefficient to adjust the relative importance of the relevancy of pre-features and post-features, and δ is a coefficient to regulate the relative significance of the rate of inconsistent data. We use Eq.(1) to figure out I(C; Fk |S). g(C, S, Fk ) is calculated by the average redundancy of S and Fk under each Ci in C. Similarly, g(Ci , S, Fk ) is the average redundancy of each Si in S and Fk under CiP . When Fk belongs to FA , it is a pre-feature. For each pre-feature Fk , we use FB ∈FB I(FBj ; Fk |S) j to represent the whole semantic correlation of Fk and each post-feature FBj under S, i.e. f (Fk ). When P Fk belongs to FB , it is a post-feature. For each post-feature Fk , we use FAi ∈FA I(FAi , Fk |S) to denote the whole semantic correlation of Fk and each pre-feature FAi under S, i.e. f (Fk ). r(Fk ) is the ratio of the number of inconsistent values in Fk . Thus, the SFS problem is a multi-objective optimization problem with four optimization goals reflecting various aspects including the correlation with class labels, the redundancy with selected features, the relevance of selected specific features to excluded ones, and the rate of inconsistent data. The constraint of SFS is the number of selected features. Due to the different scales of these four objectives, we adopt Z-score standardization [24] to convert them into the same scale. In this way, we measure them with the unified Z-scores to guarantee the comparability of them. Theorem 1 shows the hardness of the SFS problem. Theorem 1. Specific Feature Selection is an NP-hard problem. Proof of Theorem 1. We can easily reduce MAXIMUM COVERAGE problem [25] to a special case of this problem. As it is well-known that MAXIMUM COVERAGE is NP-hard, SFS is also NP-hard.  3.2. Our Solution In this subsection, we develop an approximation algorithm for the SFS problem with constant approximation ratio. The basic idea is to select specific features with the highest score of optimization function, which are features with 7

the most correlation with the unselected specific features, the least inconsistency rate, the most correlation with class labels, and the least redundancy with the selected feature set. Our proposed SFS algorithm selects optimal specific features with the greedy strategy [26]. This strategy helps to find the feature with the highest score of the evaluation function in each iteration. Based on this, the selected specific features are determined. Before SFS algorithm, in order to specify r0 , the number of specific features to be selected, we consider two cases: Case 1. With time or space limitation, the iteration number of SFS algorithm is restricted. Accordingly, the number of specific features to be selected is limited. Thus, in this case, r0 is given by users. Case 2. Without time or space limitation, we figure out r0 with the following steps: 1. We use mRMR incremental selection [6] to select n (a preset large number) sequential features from the input features F . Then, n sequential feature sets S1 ⊂ S2 ⊂ ... ⊂ Sn−1 ⊂ Sn are obtained. 2. We compare S1 , ..., Sk , ..., Sn (1 ≤ k ≤ n) to find the range of k, denoted by R, within which the cross-validation classification error ek is consistently small. 3. Within R, we find the smallest classification error e∗ = minek . r0 is chosen as the smallest k corresponding to e∗ . Therefore, in our solution, we treat r0 as an input of SFS algorithm. The pseudo-code of specific feature selection is shown in Algorithm 1. The input is a dataset D(F, C), the pre-feature set FA , the post-feature set FB , the candidate feature set S, and the maximum number of selected specific features r0 . The algorithm produces the selected specific feature set Sr0 . First, we calculate the score of evaluation function J(Fk ) for each specific feature Fk in FA and FB (Line 1-7). Then, we use the greedy strategy to select the feature with the highest J(Fk ), and mark it as Fk∗ (Line 8-9). When a feature is obtained, we add it to the selected specific feature set Sr0 , and the added feature is deleted from the candidate set Fs (Line 10-12). Finally, we return the selected feature set as the output (Line 13). We give an example to illustrate Algorithm 1. Example 1. Consider Table 1 as the input. The pre-feature is Student No., the post-features are Name and Major, and r0 = 2. Suppose J(Student N o.) = 0.67, J(N ame) = 0.62, and J(M ajor) = 0.42. With the greedy strategy, we first add Student No. to Sr0 due to its highest score. Then, we delete Student No. from Fs and obtain Name. Since the maximum number of Sr0 is achieved, the iteration is stopped. Thus, the selected specific feature set Sr0 of Table 1 is {Student No., Name}.  Parameter Setting. To measure the relative importance of the four optimization goals, i.e. the correlation with class labels, the redundancy with selected features, the relevance of selected specific features to excluded ones,

8

Algorithm 1 Specific Feature Selection Input: a Dataset D(F ,C), pre-feature set FA , post-feature set FB , candidate feature set S, a number r0 . Output: Selected specific feature set Sr0 . 1: Fs ← FA ∪ FB 2: Sr0 ← ∅, num ← 0 3: for each Fk ∈ Fs do 4: if Fk ∈ FA then P 5: J(Fk ) ← α · I(C; Fk |S) - β · g(C, S, Fk ) + γ · FB ∈FB I(FBj ; Fk |S) j

6: 7:

8: 9: 10: 11: 12: 13:

- δ · r(Fk ) else P J(Fk ) ← α · I(C; Fk |S) - β · g(C, S, Fk ) + γ · FA ∈FA I(FAi , Fk |S) i δ · r(Fk ) while num ≤ r0 do Fk∗ ← argmaxFk (J(Fk )) Sr0 ← Sr0 ∪ {Fk∗ } Delete Fk∗ from Fs num ← num + 1 return Sr0

and the rate of inconsistent data, the SFS algorithm has four parameters, i.e. α, β, γ, and δ. In order to set the proper values for these parameters, we have to test the impacts of them on the effectiveness of feature selection, which is reflected by the accuracy of the machine learning models with the features obtained by the SFS algorithm. In our problem definition, α+β+γ+δ=1. To satisfy this constraint, we set each parameter orderly. When testing the impact of each parameter, we allocate the remaining weights to the other three parameters equally. After obtaining the best values of three parameters, the last one is determined. Time Complexity Analysis. The time complexity of Algorithm 1 is determined by the number of pre-features in FA (denoted by m), the number of post-features in FB (denoted by n), and the size of the dataset D (denoted by |D|). The cost of computation of J(Fk ) for each specific feature Fk is O((m+n)|D|), and the cost of greedy strategy is O((m+n)r0 ). Clearly, the complexity of Algorithm 1 is O((m+n)|D|). We show the approximate ratio of Algorithm 1 in Theorem 2. 1 Theorem 2. Algorithm 1 is a (1+ e−1 ) ' 1.582 approximation. That is, for the selected specific feature set Sr0 and an optimal specific feature set Sr∗0 , we have |Sr∗0 | 1 ≤1+ . |Sr0 | e−1

9

Such approximation ratio is optimal. Proof. We first prove the following claims. Claim 1. xk+1 ≥ zrk0 where xk+1 is the number of selected features in the (k+1)th step of the optimal solution, zk is the number of unselected features in the kth step of the optimal solution, and r0 is the maximal number of selected specific features. Proof of Claim 1. At each step, SFS selects the feature Fk which maximizes the scoring function J(Fk ). Since the optimal solution uses |Sr∗0 | features to cover r0 features, some feature must cover 1/r0 fraction of the at least zk remaining unselected features from Sr∗0 . Hence, xk+1 ≥ zrk0 . Claim 2. zk+1 ≤ (1- r10 )k+1 · |Sr∗0 | where zk+1 is the number of unselected features in the (k+1)th step of the optimal solution. Proof of Claim 2. The claim is true for k = 0. We assume inductively that zk+1 ≤ (1- r10 )k+1 · |Sr∗0 |. Then zk+1 ≤ zk − xk+1 1 ≤ zk (1 − ) [Claim1] r0 1 ≤ (1 − )k+1 · |Sr∗0 |. r0 Proof of Theorem 2. It follows from Claim 2 that zr0 ≤ (1- r10 )r0 · |Sr∗0 | ≤

|Sr∗ | 0

|S ∗ |

1 . Hence, |Sr0 | = |Sr∗0 | - zr0 ≥ (1- 1e )· |Sr∗0 |, i.e., |Srr0 | ≤ 1+ e−1 ' 1.582.  0 Algorithm 1 maximizes the correlation with unselected specific features with the extra semantic information provided by consistency rules, minimizes the inconsistency rate of selected features, maximizes the correlation with class labels, and minimizes the redundancy of the selected feature set. Based on these, it achieves high effectiveness with the consideration of comprehensive factors. Thus, Algorithm 1 produces optimal specific features for feature selection. e

4. Experiments To verify the performance of our proposed approach, FRIEND, we conduct extensive experiments using real-life data along three dimensions: 1) the time costs of FRIEND; 2) accuracy of machine learning models with the selected features by FRIEND; 3) the impacts of parameters. 4.1. Experimental Setup We select two real-world inconsistent data sets (Pollution1 and Census2 ) and four UCI public data sets3 (Human Activity, Gisette, Arcene, and Dorothea) 1 https://www.kaggle.com/sogun3/uspollution 2 https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) 3 http://archive.ics.uci.edu/ml/datasets.html

10

with various dimensions and data sizes. Their basic information is shown in Table 3. Table 3: Datasets Information

Name Pollution Census Human Activity Gisette Arcene Dorothea

#-Features 28 68 561 5000 10000 100000

#-Instances 1746661 2949942 10299 13500 900 1950

All experiments are conducted on a machine powered by two Intel(R) Xeon(R) E5-2609 [email protected] CPUs and 32GB memory, under CentOS7. All the algorithms are implemented in C++ and complied by g++ 4.8.5. In the experiments, we adopt mRMR [6], FCBF [7], CMIM [3], ReliefF [27], Simba [28], LVF [29], and QBB [30] as baselines to FRIEND. mRMR, FCBF, and CMIM are relevance-redundancy-based algorithms. ReliefF and Simba are distance-based methods. LVF and QBB are consistency-based algorithms. These seven feature selection approaches are universally acknowledged in feature selection [31, 32, 33], and we compare our method FRIEND with them on all data sets. Error rate is the rate of inconsistent data. The error rates of Pollution and Census are 4.33% and 14.44% respectively. For UCI data sets, we inject errors with uniform distribution by modifying values to the other values within the range. Running time is used to measure the efficiency of feature selection. We run each test 5 times and report the average time costs in millisecond. The accuracy of machine learning models is used to evaluate the effectiveness of feature selection, which is defined as follows: Accuracy =

# − correctly predicted labels . # − predicted labels

Since we focus on inconsistent data in this paper, the pre-features and postfeatures in consistency rules of each data set are given beforehand. These prefeatures and post-features are shown in Table 4. In addition, we list the number of selected features for each data set and the other parameters used in the experiments in Table 5. 4.2. Efficiency of FRIEND We compare the efficiency of FRIEND with mRMR, FCBF, CMIM, ReliefF, Simba, LVF, and QBB. Since consistency-based feature selection methods are capable of handling some noise [33], LVF and QBB do not have to repair inconsistent data before feature selection. However, inconsistent data repairing is necessary for mRMR, FCBF, CMIM, ReliefF, and Simba. Thus, before relevance-redundancy-based and distance-based feature selection algorithms, we 11

Table 4: Pre-Features and Post-Features Information

Data Set

Pollution

Census

Pre-Features State Code, County Code, Site Num, NO2 Mean, NO2 1st Max Value, NO2 1st Max Hour, O3 Mean, O3 1st Max Value, O3 1st Max Hour, SO2 Mean, SO2 1st Max Value, SO2 1st Max Hour, CO Mean, CO 1st Max Value, CO 1st Max Hour dAge, dAncstry1, dIncome1, dOccup, dTravtime, iSex, iDisabl1, iEnglish, iYearwrk, iPerscare, iOthrserv, iRspouse, iRlabor, iRownchld, iMobility

Human Activity

the 1st-15th, 41st-55th, 81st-95th, 121st-135th, 161st-175th, 266th-280th, 345th-359th, 424th-438th features

Gisette Arcene Dorothea

the 1st-5000th features the 1st-10000th features the 1st-100000th features

Post-Features

NO2 AQI, O3 AQI, SO2 AQI, CO AQI

dPoverty the 201st-205th, 214th-218th, 227th-231st, 240th-244th, 253rd-257th, 503rd-507th, 516th-520th, 529th-533rd, 542nd-546th features the class feature the class feature the class feature

Table 5: Parameters Information

Parameters #-selected features max tries in LVF max tries in QBB δ in FCBF Decision Tree K-Nearest Neighbor Logistic Regression

Setting Pollution: 9; Census: 55; Human Activity: 5; Gisette: 20; Arcene: 5; Dorothea: 10 Pollution: 500; Census: 1000; Human Activity: 5000; Gisette: 25000; Arcene: 50000; Dorothea: 100000 Pollution: 500; Census: 1000; Human Activity: 5000; Gisette: 25000; Arcene: 50000; Dorothea: 100000 Pollution: 0.0001; Census: 0.0001; Human Activity: 0.0006; Gisette: 0.001; Arcene: 0.001; Dorothea: 0.002 criterion: GINI; splitter: RANDOM; presort: TRUE; min weight fraction leaf: 0; min samples leaf: 1; min samples split: 2 n neighbors: 5; n jobs: -1 solver: SAG; multi class: MULTINOMIAL; max iter: 1000

repair the inconsistent data to the most frequent one of corresponding consistent values, which is the highest efficient method in inconsistent data repairing 12

and recovers all the injected errors [34]. mRMR-RP, FCBF-RP, CMIM-RP, ReliefF-RP, and Simba-RP refer to mRMR, FCBF, CMIM, ReliefF, and Simba without data repairing process, respectively. 960000

550000

FRIEND

mRMR-RP

mRMR

FCBF-RP

FCBF

CMIM-RP

940000

CMIM

500000

FRIEND

mRMR-RP

mRMR

FCBF-RP

FCBF

CMIM-RP

CMIM

450000

time costs (ms)

time costs (ms)

920000

400000

350000

900000

880000

860000

300000 840000

250000 0%

10%

20%

30%

40%

50%

0%

10%

error rate

20%

30%

40%

50%

error rate

(a) Time Costs on Human Activity

(b) Time Costs on Gisette 1460000

750000

FRIEND

mRMR-RP

mRMR

FCBF-RP

FCBF

CMIM-RP

1440000

CMIM

1420000

FRIEND

mRMR-RP

mRMR

FCBF-RP

FCBF

CMIM-RP

CMIM

700000

time costs (ms)

time costs (ms)

1400000

650000

600000

1380000

1360000

1340000

1320000 550000

1300000

1280000

500000

1260000 0%

10%

20%

30%

40%

50%

0%

10%

20%

30%

40%

50%

error rate

error rate

(c) Time Costs on Arcene

(d) Time Costs on Dorothea

Figure 1: Results of Efficiency Compared with Relevance-Redundancy-Based Algorithms

Table 6: Time Costs on Real-World Data Sets (Unit: ms)

Pollution Census Pollution Census Pollution Census

FRIEND 608442 13950425 Simba 2103799 38632976 CMIM-RP 757096 14748280

mRMR 1305898 18936844 LVF 599344 29557355 ReliefF-RP 1787669 11332248

FCBF 1373097 18268834 QBB 553207 28587896 Simba-RP 1405135 34503056

CMIM 1455760 18878200 mRMR-RP 607234 14806924

ReliefF 2486333 15462168 FCBF-RP 674433 14138914

As depicted in Figure 1 and 2, we observe that as the error rate rises, the time costs of FRIEND, mRMR-RP, FCBF-RP, CMIM-RP, ReliefF-RP, and Simba13

FRIEND

FRIEND 1800000

ReliefF 900000

Simba

ReliefF-RP 800000

ReliefF-RP

1600000

Simba-RP

Simba-RP time costs (ms)

700000

time costs (ms)

ReliefF

Simba

600000

500000

1400000

1200000

1000000

400000

800000 300000

600000

200000

0%

10%

20%

30%

40%

0%

50%

10%

(a) Time Costs on Human Activity

50%

ReliefF

Simba

Simba

2400000

ReliefF-RP

ReliefF-RP

Simba-RP

Simba-RP

2200000

time costs (ms)

time costs (ms)

40%

FRIEND

2600000

ReliefF

550000

30%

(b) Time Costs on Gisette

FRIEND 600000

20%

error rate

error rate

500000

450000

2000000

1800000

1600000

400000

1400000

350000

1200000

0%

10%

20%

30%

40%

50%

0%

error rate

10%

20%

30%

40%

50%

error rate

(c) Time Costs on Arcene

(d) Time Costs on Dorothea

Figure 2: Results of Efficiency Compared with Distance-Based Algorithms

RP are stable while those of mRMR, FCBF, CMIM, ReliefF, and Simba increase sharply. This is because that relevance-redundancy-based and distance-based algorithms need data repairing in feature selection on inconsistent data, and the time costs of data repairing are expensive. In addition, as the error rate increases, the time costs of mRMR, FCBF, CMIM, ReliefF, and Simba rise almost linearly. This is due to the fact that the time costs of repairing inconsistent data are related to the error rate. Note that during inconsistent data repairing, we need to detect inconsistent data first. Thus, when the error rate is 0%, the time costs of mRMR, FCBF, CMIM, ReliefF, and Simba are still higher than those of mRMR-RP, FCBF-RP, CMIM-RP, ReliefF-RP, and Simba-RP, respectively. As illustrated in Figure 3, we observe that the time costs of FRIEND, LVF, and QBB are unaffected by error rate. As the error rate increases, the efficiency of FRIEND is more stable than that of LVF and QBB. In addition, when the data size is larger, the running time of FRIEND is less than that of the other two algorithms. The observations on UCI data sets are still true on real-world data sets. As shown in Table 6, the time costs of FRIEND, LVF, and QBB are fewer than

14

FRIEND

1020000

LVF

320000

1000000

QBB

980000

time costs (ms)

time costs (ms)

300000

280000

260000

960000

940000

920000

900000

FRIEND LVF

880000

240000

QBB 860000

220000

840000

0%

10%

20%

30%

40%

0%

50%

10%

(a) Time Costs on Human Activity 540000

20%

30%

40%

50%

error rate

error rate

(b) Time Costs on Gisette

FRIEND

2000000

LVF QBB

1900000

520000

time costs (ms)

time costs (ms)

1800000

500000

480000

1700000

1600000

1500000

FRIEND LVF

1400000

460000

QBB

1300000 440000 0%

10%

20%

30%

40%

0%

50%

10%

20%

30%

40%

50%

error rate

error rate

(c) Time Costs on Arcene

(d) Time Costs on Dorothea

Figure 3: Results of Efficiency Compared with Consistency-Based Algorithms

the running time of the other algorithms. This is because that the relevanceredundancy-based and distance-based algorithms need the process of inconsistent data repairing. We also observe that the efficiency of FRIEND on Pollution is slightly lower than that of LVF and QBB, while the time cost of FRIEND on Census is significantly higher than that of the other two algorithms. Since the data size of Census is larger than Pollution, this case confirms the observation on UCI data sets. Thus, it can be concluded that in terms of efficiency, when inconsistent data exists, FRIEND outperforms relevance-redundancy-based and distance-based algorithms. 4.3. Accuracy of Machine Learning Models Since the accuracy of machine learning models reflects the effectiveness of feature selection, we adopt three classic machine learning models, that are Decision Tree [35] (DT for brief), K-Nearest Neighbor [36] (KNN for brief), and Logistic Regression [37] (LR for brief). We report the accuracy of these models on various datasets with the selected features by FRIEND, mRMR, FCBF,

15

CMIM, ReliefF, Simba, LVF, and QBB to compare the effectiveness of these feature selection approaches. 90%

95%

90% 80%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

85% 70%

Accuracy

Accuracy

80%

60%

75%

50%

70%

40%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

65%

60%

30% 0%

10%

20%

30%

40%

50%

0%

10%

(a) Accuracy on Human Activity 80%

20%

30%

40%

50%

error rate

error rate

(b) Accuracy on Gisette

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

100%

80%

Accuracy

Accuracy

70%

60%

50%

60%

40%

20%

40%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM 30%

QBB

0%

0%

10%

20%

30%

40%

50%

0%

error rate

10%

20%

30%

40%

50%

error rate

(c) Accuracy on Arcene

(d) Accuracy on Dorothea

Figure 4: Results of DT

As depicted in Figure 4, 5, and 6, as the error rate rises, the accuracy of DT, KNN, and LR with the features selected by FRIEND remains stable. However, with the features obtained by the other algorithms, the accuracy of all these three models drops rapidly. This is caused by two reasons. On the one hand, relevance-redundancy-based and distance-based algorithms contain the inconsistent data repairing process, the effectiveness of the feature selection is related to the accuracy of data repairing. As the error rate increases, the accuracy of data repairing gets worse. The low accuracy of inconsistent data repairing leads to the reduction of feature selection. On the other hand, FRIEND involves semantics in consistency rules during feature selection, which makes it select features effectively and unaffected by the variation of error rate. In addition, we notice that in Figure 4, 5(a), 5(b), and 6, when the error rate is less than 20%, the machine learning models do not achieve the best accuracy with the selected features by FRIEND or consistency-based algorithms. When the error rate is equal to or more than 20%, the accuracy of machine

16

90%

80%

100%

FRIEND

ReliefF

FRIEND

ReliefF

mRMR

Simba

mRMR

Simba

FCBF

LVF

FCBF

LVF

CMIM

QBB

CMIM

QBB

90%

70%

Accuracy

Accuracy

80%

60%

50%

70%

60% 40%

50%

30%

0%

10%

20%

30%

40%

50%

0%

10%

error rate

20%

30%

40%

50%

error rate

(a) Accuracy on Human Activity

(b) Accuracy on Gisette 100%

90%

90%

80%

80%

70% 70%

Accuracy

Accuracy

60%

50%

60%

50%

40%

30%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

20%

0%

10%

40%

30%

QBB

20%

40%

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

20%

30%

FRIEND

0%

50%

error rate

10%

QBB

20%

30%

40%

50%

error rate

(c) Accuracy on Arcene

(d) Accuracy on Dorothea

Figure 5: Results of KNN

learning models with the features obtained by FRIEND is higher than that with the features selected by the other algorithms. Similarly, in Figure 5(c) and 5(d), when the error rate is less than 10%, the accuracy of KNN with the features obtained by FRIEND or consistency-based algorithms is not the highest. When the error rate is equal to or more than 10%, KNN with the selected features by FRIEND achieves better accuracy than that with the features obtained by the other algorithms. This is because that when the error rate is lower, the inconsistent data repairing process guarantees the accuracy of relevance-redundancy-based and distance-based algorithms. When the error rate is higher, relevance-redundancy-based and distance-based feature selection methods are affected by the accuracy of data repairing while FRIEND and consistency-based algorithms are not. Meanwhile, the semantics in consistency rules helps FRIEND select effective features. The observations on UCI data sets remain the same on real-world data sets. As shown in Table 7, the accuracy of the decision tree and logistic regression with the features selected by FRIEND is a little lower than the best accuracy. This is due to the fact that the error rates of both Pollution and Census are

17

90%

80%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

90%

85%

70%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

Accuracy

Accuracy

80%

75%

60%

70%

50%

65% 40%

60% 30% 0%

10%

20%

30%

40%

50%

0%

10%

20%

30%

40%

50%

error rate

error rate

(a) Accuracy on Human Activity

(b) Accuracy on Gisette

80% 100%

70%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

90%

80%

Accuracy

Accuracy

60%

50%

70%

60%

40%

30%

FRIEND

ReliefF

mRMR

Simba

FCBF

LVF

CMIM

QBB

50%

40%

20%

30%

0%

10%

20%

30%

40%

50%

0%

error rate

10%

20%

30%

40%

50%

error rate

(c) Accuracy on Arcene

(d) Accuracy on Dorothea

Figure 6: Results of LR

less than 20%. In addition, the accuracy of KNN on Census achieves the best with the features selected by FRIEND, while that on Pollution does not. This is because that the error rate of Census is more than 10% while that of Pollution is less than 10%. The experimental results on real-world data sets confirm the observations on UCI data sets. Therefore, FRIEND always performs better on efficiency than the other algorithms significantly. When the error rate is low, even though the effectiveness of relevance-redundancy-based and distance-based algorithms outperform FRIEND, their difference is pretty small. Thus, it can be concluded that FRIEND is suitable for feature selection on inconsistent data for both efficiency and effectiveness issues. 4.4. Parameters of SFS Algorithm Since this algorithm has four parameters, i.e. α (the relative importance of the correlation with class labels), β (the relative significance of the redundancy with selected features), γ (the relative importance of the correlation with unselected specific features), and δ (the relative significance of error rate), the

18

Table 7: Accuracy of Machine Learning Models on Real-World Data Sets (Unit: %)

Pollution Census Pollution Census

Pollution Census Pollution Census

Pollution Census Pollution Census

Decision Tree FRIEND mRMR FCBF 88.82 85.05 68.57 87.60 87.64 87.48 ReliefF Simba LVF 79.64 89.16 87.34 87.63 87.46 87.60 K-Nearest Neighbor FRIEND mRMR FCBF 74.52 73.45 62.71 91.44 89.35 87.16 ReliefF Simba LVF 76.01 53.95 53.45 89.38 89.37 89.46 Logistic Regression FRIEND mRMR FCBF 68.28 67.80 68.34 88.25 88.28 87.06 ReliefF Simba LVF 68.10 68.04 68.06 88.04 88.12 88.27

CMIM 91.61 87.48 QBB 86.62 87.58 CMIM 52.15 89.44 QBB 53.45 89.46 CMIM 68.05 88.25 QBB 68.06 88.18

impacts of these parameters on feature selection are tested. In order to evaluate the impacts and select the proper values, we test the accuracy of machine learning models. We vary the error rate from 10% to 50% and range α, β and γ from 0.1 to 0.5, respectively. Experimental results are shown in Figure 7. As illustrated in Figure 7, we observe that when 0.3 is the value of α and β, 0.2 is the value of γ and δ, our SFS algorithm achieves the best results in accuracy. This is because that the correlation with class labels and the redundancy with selected features have greater impacts on feature selection than the correlation with unselected specific features and the error rate. Thus, we set 0.3 as the value of α and β, 0.2 as the value of γ and δ in SFS algorithm. According to above discussions, we conclude that FRIEND selects features on inconsistent data efficiently and effectively, which makes it suitable in various scenarios. 5. Conclusion In this paper, we proposed a feature selection approach named FRIEND for selecting features on inconsistent data. The method reduces the time costs of feature selection and guarantees the accuracy of machine learning models. Experimental results demonstrate the efficiency and effectiveness of FRIEND.

19

81.0%

80.0%

80.5%

79.5%

80.0%

Accuracy

Accuracy

79.0%

78.5%

79.5%

79.0%

error rate=10%

78.0%

error rate=10%

error rate=20%

error rate=20% 78.5%

error rate=30% 77.5%

error rate=30%

error rate=40%

error rate=40%

error rate=50%

error rate=50%

78.0%

77.0% 0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

(a) Accuracy on Human Activity varying α (b) Accuracy on Human Activity varying β error rate=10%

error rate=10% 81.5%

error rate=20%

81.0%

error rate=40%

error rate=40%

error rate=50%

80.5%

Accuracy

Accuracy

81.0%

error rate=50%

80.5%

error rate=20% error rate=30%

error rate=30%

80.0%

80.0%

79.5%

79.5%

79.0%

79.0%

78.5% 0.1

0.2

0.3

0.4

0.5

0.1

(c) Accuracy on Human Activity varying γ

0.2

0.3

0.4

(d) Accuracy on Gisette varying α

error rate=10% 82.50%

error rate=10% 83.0%

error rate=20%

error rate=20%

error rate=30% 82.25%

0.5

error rate=30%

error rate=40%

error rate=40%

error rate=50%

error rate=50%

82.5%

81.75%

Accuracy

Accuracy

82.00%

81.50%

82.0%

81.5%

81.25%

81.0%

81.00%

80.75%

80.5% 0.1

0.2

0.3

0.4

0.5

0.1

(e) Accuracy on Gisette varying β

0.2

0.3

0.4

0.5

(f) Accuracy on Gisette varying γ

Figure 7: Results of Parameter Selection

Our important future work is to take other data quality dimensions, such as completeness, currency, into consideration. With considering various aspects of data quality, feature selection can achieve higher fault tolerance on dirty data. Another line of work is to combine our method with other feature selection approaches, such as wrappers, embedded methods. In this way, we can select 20

features efficiently and effectively in various scenarios. Acknowledgement This paper was partially supported by NSFC grant U1509216, U1866602, 61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Provience LC2016026 and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology. References [1] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking metric with application to scalable visual and bioinformatics data classification, Neurocomputing 173 (2016) 346–354. [2] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28. [3] F. Fleuret, Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5 (Nov) (2004) 1531–1555. [4] N. Hoque, D. Bhattacharyya, J. K. Kalita, Mifs-nd: a mutual informationbased feature selection method, Expert Systems with Applications 41 (14) (2014) 6371–6385. [5] M. Han, W. Ren, Global mutual information-based feature selection approach using single-objective and multi-objective optimization, Neurocomputing 168 (2015) 47–54. [6] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238. [7] L. Yu, H. Liu, Feature selection for high-dimensional data: A fast correlation-based filter solution, in: Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 856–863. [8] D. Rodrigues, L. A. Pereira, R. Y. Nakamura, K. A. Costa, X.-S. Yang, A. N. Souza, J. P. Papa, A wrapper approach for feature selection based on bat algorithm and optimum-path forest, Expert Systems with Applications 41 (5) (2014) 2250–2258. [9] P. Bermejo, J. A. G´ amez, J. M. Puerta, Speeding up incremental wrapper feature subset selection with naive bayes classifier, Knowledge-Based Systems 55 (2014) 140–147.

21

[10] M. Mafarja, S. Mirjalili, Whale optimization approaches for wrapper feature selection, Applied Soft Computing 62 (2018) 441–453. [11] R. Sheikhpour, M. A. Sarram, R. Sheikhpour, Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer, Applied Soft Computing 40 (2016) 113–131. [12] J. Yang, H. Xu, P. Jia, Effective search for genetic-based machine learning systems via estimation of distribution algorithms and embedded feature reduction techniques, Neurocomputing 113 (2013) 105–121. [13] S. Wang, J. Tang, H. Liu, Embedded unsupervised feature selection, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Press, 2015, pp. 470–476. [14] X. Zhang, G. Wu, Z. Dong, C. Crawford, Embedded feature-selection support vector machine for driving pattern recognition, Journal of the Franklin Institute 352 (2) (2015) 669–685. [15] S. T. Monteiro, R. J. Murphy, Embedded feature selection of hyperspectral bands with boosted decision trees, in: Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE International, IEEE, 2011, pp. 2361– 2364. [16] G. Cong, W. Fan, F. Geerts, X. Jia, S. Ma, Improving data quality: Consistency and accuracy, in: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, pp. 315–326. [17] M. Arenas, L. Bertossi, J. Chomicki, Consistent query answers in inconsistent databases, in: Proceedings of the eighteenth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, ACM, 1999, pp. 68–79. [18] M. Lenzerini, Data integration: A theoretical perspective, in: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, 2002, pp. 233–246. [19] M. Bennasar, Y. Hicks, R. Setchi, Feature selection using joint mutual information maximisation, Expert Systems with Applications 42 (22) (2015) 8520–8532. [20] W. Fan, F. Geerts, Foundations of data quality management, Synthesis Lectures on Data Management 4 (5) (2012) 1–217. [21] W. Fan, F. Geerts, X. Jia, A. Kementsietsidis, Conditional functional dependencies for capturing data inconsistencies, ACM Transactions on Database Systems (TODS) 33 (2) (2008) 6.

22

[22] W. Fan, F. Geerts, J. Li, M. Xiong, Discovering conditional functional dependencies, IEEE Transactions on Knowledge and Data Engineering 23 (5) (2011) 683–698. [23] W. Fan, Dependencies revisited for improving data quality, in: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, 2008, pp. 159–170. [24] G. W. Milligan, M. C. Cooper, A study of standardization of variables in cluster analysis, Journal of classification 5 (2) (1988) 181–204. [25] S. Khuller, A. Moss, J. Naor, et al., The budgeted maximum coverage problem, Information Processing Letters 70 (1) (1999) 39–45. [26] T. K. Truong, K. Li, Y. Xu, Chemical reaction optimization with greedy strategy for the 0–1 knapsack problem, Applied Soft Computing 13 (4) (2013) 1774–1780. ˇ [27] M. Robnik-Sikonja, I. Kononenko, Theoretical and empirical analysis of relieff and rrelieff, Machine learning 53 (1-2) (2003) 23–69. [28] R. Gilad-Bachrach, A. Navot, N. Tishby, Margin based feature selectiontheory and algorithms, in: Proceedings of the twenty-first international conference on Machine learning, ACM, 2004, p. 43. [29] H. Liu, R. Setiono, et al., A probabilistic approach to feature selection-a filter solution, in: ICML, Vol. 96, Citeseer, 1996, pp. 319–327. [30] M. Dash, H. Liu, Hybrid search of feature subsets, in: Pacific Rim International Conference on Artificial Intelligence, Springer, 1998, pp. 238–249. [31] Z. Chen, C. Wu, Y. Zhang, Z. Huang, B. Ran, M. Zhong, N. Lyu, Feature selection with redundancy-complementariness dispersion, KnowledgeBased Systems 89 (2015) 203–217. [32] Q. Hu, W. Pan, Y. Song, D. Yu, Large-margin feature selection for monotonic classification, Knowledge-Based Systems 31 (2012) 8–18. [33] M. Dash, H. Liu, Consistency-based search in feature selection, Artificial intelligence 151 (1-2) (2003) 155–176. [34] P. Bohannon, W. Fan, M. Flaster, R. Rastogi, A cost-based model and effective heuristic for repairing constraints by value modification, in: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, ACM, 2005, pp. 143–154. [35] J. R. Quinlan, Induction of decision trees, Machine learning 1 (1) (1986) 81–106. [36] L. E. Peterson, K-nearest neighbor, Scholarpedia 4 (2) (2009) 1883.

23

[37] J. Zhu, T. Hastie, Kernel logistic regression and the import vector machine, in: Advances in neural information processing systems, 2002, pp. 1081– 1088.

24

Zhixin Qi is a doctoral student in School of Computer Science and Technology, Harbin Institute of Technology. She received her M.S. degree from Harbin Institute of Technology. Her research interests include data quality, data cleaning, and big data management. Hongzhi Wang is a professor and doctoral supervisor at Harbin Institute of Technology. He received his Ph.D degree from Harbin Institute of Technology. He was awarded Microsoft fellowship, Chinese excellent database engineer and IBM PhD fellowship. His research interests include big data management, data quality, and graph data management. Tao He is an undergraduate in School of Computer Science and Technology, Harbin Institute of Technology. His research interests include query processing, data cleaning, data quality. Jianzhong Li is a professor and doctoral supervisor at Harbin Institute of Technology. He is a fellow of CCF. In the past, he worked as a visiting scholar with the University of California at Berkeley, and as a visiting professor with the University of Minnesota. His research interests include database, parallel computing, wireless sensor networks, etc.

Hong Gao is a professor and doctoral supervisor at Harbin Institute of Technology. She is a senior member of CCF. She received her Ph.D degree from Harbin Institute of Technology. Her research interests include database, parallel computing, wireless sensor networks, etc.

Conflict of Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Author Contribution Zhixin Qi: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Hongzhi Wang: Resources, Writing - original draft, Writing - review & editing, Supervision. Tao He: Software, Validation, Data curation. Jianzhong Li: Funding acquisition. Hong Gao: Project administration.

27