Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing

Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing

Journal Pre-proof Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing Chen Ye, Hongzhi Wang, Wenbo Lu, Jianzhong Li P...

813KB Sizes 0 Downloads 51 Views

Journal Pre-proof Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing Chen Ye, Hongzhi Wang, Wenbo Lu, Jianzhong Li

PII: DOI: Reference:

S0950-7051(19)30534-9 https://doi.org/10.1016/j.knosys.2019.105199 KNOSYS 105199

To appear in:

Knowledge-Based Systems

Received date : 11 January 2019 Revised date : 3 November 2019 Accepted date : 5 November 2019 Please cite this article as: C. Ye, H. Wang, W. Lu et al., Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105199. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof *Declaration *Conflict of Interest of Interest Form Statement

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Jo

urn a

lP

re-

pro of

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Journal Pre-proof *Revised Manuscript (Clean Version) Click here to view linked References

pro of

Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing Chen Yea,b , Hongzhi Wanga,∗, Wenbo Lua , Jianzhong Lia a Harbin

Institute of Technology, Harbin, P.R. China Dianzi University, Hangzhou, P.R. China

b Hangzhou

Abstract

urn a

lP

re-

During the process of data collection, incompleteness is one of the most serious data quality problems to deal with. Traditional imputation methods mostly rely on statistics and machine learning techniques. However, both types of methods are limited in their accuracy due to lacking enough information about the missing data. To obtain more information, recent methods resort to external sources such as knowledge bases or the worldwide web. Unfortunately, such methods may still be less helpful, since there may exist little information about the missing values in the knowledge bases, or too much noise on the web. To tackle these issues, this paper adopts crowdsourcing as the external source, where hundreds of thousands of ordinary workers on the platform can provide high-quality information based on contextual knowledge and human cognitive ability. To reduce the cost, a joint model is proposed for imputation, which integrates crowdsourcing into the process of Bayesian inference. We first construct a Bayesian network for the attributes in the dataset, then the missing attribute values are inferred by Bayesian inference. To improve the accuracy of the Bayesian inference, we outsource a small number of informative tasks to the crowd workers, where the informative tasks are selected based on uncertainty and influence. The proposed approach is evaluated with extensive experiments using real-world datasets with a simulated crowd and two real crowdsourcing platforms. The experimental results show that our approach achieves a better performance compared to other imputation approaches. Keywords: Missing values, Bayesian network, Crowdsourcing

Jo

1. Introduction

Missing values have negative consequences on data quality, leading to the loss of important information. For instance, missing values may exist in the ∗ Corresponding

author Email addresses: [email protected] (Chen Ye), [email protected] (Hongzhi Wang), [email protected] (Wenbo Lu), [email protected] (Jianzhong Li)

Preprint submitted to Knowledge-Based Systems

November 3, 2019

Journal Pre-proof

Jo

urn a

lP

re-

pro of

medical area, and the incomplete information tends to lead to biased results, producing a negative influence on health [16]. Due to its importance, many efforts have been done to deal with the missing values. Traditional imputation methods can be broadly divided into statistical methods [32, 33, 39] and machine-learning-based methods [8, 21, 25, 28, 31, 34, 38, 42, 43]. For the aspect of statistics, regression imputation [32, 39] is a popular method, which fits response variables with explanatory variables and predicates missing values with the created fitting model [29]. However, the fitting quality depends on the selection of the independent variables and completeness of the training dataset [29, 30]. For the aspect of machine learning, the k-nearest neighbor (KNN) [25, 42] and support vector machine (SVM) [28] are widely adopted. These machine-learning-based methods are typically supervised, which heavily depend on both the training datasets and the selected features [12]. In summary, traditional approaches predict missing values based on the incomplete data set, which can only reach limited imputation accuracy due to the lacking of sufficient knowledge about the missing data [23]. To reach a higher imputation accuracy, some recent work resorted to external sources such as knowledge bases [5, 29], the worldwide web [22, 23], and crowd experts1 [5, 40] for obtaining the missing value information. However, there are also some limitations, since there may exist little information about the missing values in the knowledge bases, too much noise on the web, and unavailable experts in practice. Motivated by these concerns, we propose a novel imputation approach enhanced by crowdsourcing, a public platform to publish tasks and collect feedback from hundreds of thousands of ordinary workers (i.e., the crowd). As the crowd has contextual knowledge and cognitive ability, it is able to collect enough information about the incomplete tuples [19]. The crowd is not free, and if there are large numbers of tasks, crowdsourcing can be expensive. To reduce the cost, an intuitive idea is that, for the missing values with enough information in the data, statistical/machine learning techniques are used for inference. Otherwise, crowdsourcing is adopted for extra information. For the basic framework, we choose the Bayesian network (BN), since using BN as the imputation model it is possible to preserve the statistical relationships between variables, as well as dealing with high dimensional datasets [8]. As BN makes connections among variables (i.e., attributes), it is also possible to fill the missing values of all the attributes by the Bayesian inference at one time. The crowdsourcing is then utilized to answer only necessary tasks to improve the accuracy of the Bayesian inference. Challenges. Although combining the Bayesian network with crowdsourcing for imputation has many benefits, it brings two challenges as follows. • As the effectiveness of Bayesian inference strongly depends on the structure of the Bayesian network, proper structure construction for the net-

1 Note that in [5, 40], the expert crowd is adopted, where workers (i.e., experts) are assumed to give correct answers for all the tasks. The setting is different from the real crowdsourcing platform, where crowd workers may provide wrong answers.

2

Journal Pre-proof

pro of

work is needed. However, the construction of a proper Bayesian network based on incomplete data tends to be difficult. With regard to an amount of missing values, we cannot achieve precise parameters, which may lead to ambiguity and mistakes. • Since we use crowdsourcing to provide additional information for the Bayesian inference, both the cost and accuracy of the crowd feedback should be considered. Specifically, given a finite budget, only a small number of incomplete tuples can be sent to the crowdsourcing platform. As the reliabilities of various workers usually differ, the crowd feedback also cannot ensure to be totally consistent and correct.

urn a

lP

re-

In this paper, we address these challenges. To tackle the first one, we develop a Bayesian network construction algorithm specific for the incomplete data. The algorithm is based on the relevant relationship among attributes. In this way, to infer the missing values of one attribute, we only need the information of its relevant attributes. For the second challenge, to reduce the cost of crowdsourcing, we propose two principles to select the most beneficial tuples for workers to do, and model them as two optimization problems. One principle focuses on the uncertainty factor, selecting the incomplete tuples that are the most uncertain for the Bayesian inference. We propose an algorithm to solve it and prove its optimality. The other principle focuses on the influence factor, selecting the incomplete tuples that influence the largest number of remaining incomplete tuples. Such a problem is proven to be NP-hard, and thus we develop a greedy approximation algorithm for it. Moreover, to improve the accuracy of the crowd feedback, we first use a qualification test [1, 2] to eliminates low-quality workers or spammers. We then assign a task to multiple workers and aggregate their answers [3, 20]. Contributions. We summarize our contributions as follows. • To impute the missing values, we construct a Bayesian network based on the relevant relationship among the attributes. We also develop a basic missing value imputation algorithm according to the Bayesian inference. • To further improve the effectiveness of imputation, we integrate crowdsourcing into the process of the Bayesian inference. To reduce the cost and ensure the accuracy, we propose two principles for crowd tuple selection, and adopt practical strategies for crowd feedback.

Jo

• We conducted extensive experiments on the real-world datasets using a simulated crowd and two real crowdsourcing platforms. The results show that the imputation quality of the Bayesian inference is significantly improved via the power of crowdsourcing.

The paper is organized as follows. We first analyze related work in Section 2. The problem definition and the whole framework are described in Section 3. We propose the algorithms of the Bayesian network construction as well as the basic missing value imputation in Section 4. Section 5 improves the basic missing 3

Journal Pre-proof

pro of

value imputation approach with the help of crowdsourcing. We analyze the experimental results in Section 6 and provide conclusions in Section 7. 2. Related Work

Jo

urn a

lP

re-

Missing value imputation has been widely studied in recent years. Traditional imputation methods can be broadly divided into statistical methods [32, 33, 39] and machine-learning-based methods [8, 21, 25, 28, 31, 34, 38, 42, 43]. Imputation approaches based on statistics, such as regression imputation [32, 39], and expectation maximization (EM) [33], are usually developed for only numeric attributes. These approaches normally assume a desired data distribution based on the complete data, and predict missing values to get closer to the desired data distribution. Machine-learning-based approaches mainly include k nearest neighbor (KNN) [25, 42], support vector machine (SVM) [28], clustering [34, 43], naive Bayesian methods [8, 21], neural network [31], and random forest techniques [38]. These approaches solve imputation problems as classification problems and use the complete data as training data. As all these approaches infer missing values based on the incomplete datasets, they can only reach a limited imputation accuracy with little information about missing data. There are also some recent approaches that resort to knowledge bases [5, 29], the worldwide web [22, 23], and crowd experts [5, 40] for extra information. However, there are also some kinds of missing values they can not work well, since either too much noise on the web or unavailable knowledge bases/experts for the missing value information. Different from these approaches, our approach is still effective when missing values widely exist, as we adopt crowdsourcing as the external source for extra information. With selection algorithms of crowd tasks and quality control mechanisms of crowd feedback, the proposed method is more likely to obtain high-quality information within a finite budget, which boosts the performance of the Bayesian inference. Another line of related work is known as task assignments [11, 44, 46]. To achieve a high overall quality, these methods select the most informative tasks and assign them to the crowd. They compute an uncertainty score for each task based on their collected answers, then select the k most uncertain tasks, and assign them to the worker. In our paper, we adopt the uncertain strategy to select the most uncertain tuples for the Bayesian inference process, which is a specific application. In particular, we define the uncertainty in two ways, among which the influence-based method is different from the existing works as it considers the influence of a crowd feedback on the other remaining uncertain tuples. There are also several efforts that aim to improve the quality of the Bayesian inference [4, 24] and crowd feedback [9, 14, 41, 45, 47], which are orthogonal to our work. With a more accurate Bayesian inference, our framework can infer more types and larger scale of missing values. A full discussion of the above topics lies beyond the scope of this study. Nevertheless, inferring the missing values by combining the Bayesian inference and crowdsourcing are usually more

4

Journal Pre-proof

re-

pro of

reliable than just analyzing the data on hand. Thus, they can be treated as relatively trusted resources. In the preliminary version of this paper [40], we propose a novel imputation approach combining the Bayesian network with crowdsourcing. The crowd tuples are selected based on influence, and the expert crowd is adopted. This method has two drawbacks. First, it ignores the uncertainty factor existing in the Bayesian inference, which is a critical factor affecting the imputation result. Secondly, it considers workers on the crowdsourcing platform as experts, which is not realistic in the real-world applications. In this paper, we enrich the preliminary version with a crowd tuple selection approach based on uncertainty. Moreover, we adopt aggregation strategies for the case of noisy crowd feedback. In the experiments, we compare the performance of the uncertainty-based crowd tuple selection algorithm and the influence-based crowd tuple selection algorithm on real crowdsourcing platforms as well as a simulated noisy crowd. 3. Framework

lP

In this section, we first provide the problem definition of missing value imputation in Section 3.1, and then propose an overview of the whole framework in Section 3.2.

Jo

urn a

3.1. Problem definition Considering a set T of tuples, each tuple t ∈ T is made up of attributes V = {V1 , V2 , . . . , Vn } with finite values2 . For attribute Vi , if its value in tuple t is null, we consider ti as an incomplete tuple towards Vi . The incomplete tuple i set towards Vi is denoted as Mi = {t1i , t2i , . . . , tw i }, where wi represents the total number of incomplete tuples towards Vi . All the incomplete tuples towards V = {V1 , V2 , . . . , Vn } are then represented as a family of sets M = {M1 , M2 , . . . , Mn }. Note that two sets Mi and Mj in M may overlap, but for a tuple t ∈ Mi ∩ Mj , in Mi and Mj , the focuses are the missing values towards different attributes Vi and Vj in t, respectively. To infer the missing values, we adopt the Bayesian network for probabilistic knowledge. The Bayesian network is a directed acyclic graph, in which each node represents a variable with a finite number of states. Each edge connects a pair of nodes, and a conditional probability distribution is attached to each variable. By treating each attribute in the tuple corresponding to a variable in the Bayesian network, the Bayesian network can estimate the conditional probability distribution of each attribute. The missing values of each attribute can then be imputed by maximizing the conditional probability. Thus, we convert the missing value imputation problem into a maximum a posteriori estimation (MAP) problem on the Bayesian network, and propose our problem definition based on the Bayesian network as follows. 2 For a continuous attribute, we divide its possible value into finite intervals and infer its most possible interval.

5

Journal Pre-proof

Inference Phase

Learning Phase

Bayesian Inference

pro of

Bayesian Network

Relevance Score Calculation

Uncertainty

Influence

Crowd Tuples Selection

Attribute Order Determination

Crowd Platform

Incomplete Tuples

Crowd Feedback

re-

Updates

Figure 1: The Whole Framework Overview

lP

Definition 1. (Problem Definition) Given a set T of incomplete tuples and a Bayesian network G = (V, E) with V as the attribute set of T , and E as the set of relationships among attributes, the goal is to find the most possible value vi∗ for each incomplete tuple towards attribute Vi ∈ V conditioning with the evidence E. Suppose that an attribute Vi has ri possible values, and the possible values of Vi are denoted as {vi1 , vi2 , . . . , viri }. Each missing value towards Vi can then be estimated by maximizing the conditional probability as follows.  vi∗ = arg max P Vi = vik |E = e . vi

(1)

urn a

In summary, given a set T of incomplete tuples, the goal of our problem is to construct a Bayesian network G, then obtain an estimation M∗ = {M1∗ , M2∗ , . . . , Mn∗ } for the family of incomplete tuple sets M = {M1 , M2 , . . . , Mn } based on G.

Jo

3.2. Framework Overview To solve this problem, we propose a solution as the combination of Bayesian inference and crowdsourcing. Figure 1 shows the whole framework of our solution. It contains two phases; i.e., the learning phase and the inference phase. In the learning phase, we construct a Bayesian network based on the incomplete tuples in Section 4.1. The attributes are ordered, and relevant relationships between attributes are identified. A relevance-based algorithm is then developed to find the proper parents for each attribute in order. Through this algorithm, a Bayesian network is constructed by iteratively adding nodes and edges. In the inference phase, we fill the missing values combining the Bayesian inference and crowd feedback. In Section 4.2, a basic imputation approach based on the Bayesian inference is proposed. Given a Bayesian network, the conditional probability distribution (CPD) for each possible value of each attribute is 6

Journal Pre-proof

pro of

computed. The missing values are then imputed by maximizing the conditional probabilities. To further improve the accuracy of imputation, we propose two strategies to select several incomplete tuples as the crowd tuples in Section 5.1. For the first strategy, based on uncertainty, we select the most uncertain tuples whose crowd feedback can extremely boost the Bayesian inference. For the second strategy, based on influence, the most influential tuples are selected whose crowd feedback can benefit the largest number of remaining uncertain tuples. In Section 5.2, the selected crowd tuples are sent to the crowdsourcing platform, and the crowd feedback is obtained. To improve the accuracy, we adopt several practical strategies to aggregate the crowd feedback. As a result, we get the updates of the incomplete tuples combining the results of the crowd feedback aggregation and Bayesian inference.

re-

4. Missing Value Imputation

lP

In this section, we propose the imputation method without crowdsourcing as the basic version of our approach. As shown in Section 3.2, our approach contains two phases. Section 4.1 describes the first phase and Section 4.2 proposes the algorithm for the second phase without crowdsourcing. The crowdsourcingenhanced version of our approach will be discussed in Section 5. 4.1. Bayesian Network Construction

urn a

Traditional Bayesian network construction approaches, such as K2 [6] and Hill-climbing [35], are mainly suitable for complete data, which can obtain causal relationships among the attributes according to a strict score function. However, for incomplete data, we measure the relevant relationships among the attributes instead of causal relationships for the following reasons. • The causal relationship may not always exist among attributes, as it indicates that one attribute value is the result of the occurrence of the other attribute value. • It is hard to measure the causal relationship based on incomplete data. The computational accuracy of the strict score function may be affected by the missing values, and the judgment of causal relationships will be accordingly affected.

Jo

• As a comparison, relevant relationships always exist in the data, and are easier to be evaluated by the relevance scores. Also, it is sufficient to impute missing values based on relevant relationships among the attributes.

Therefore, we give up building the Bayesian network according to causal relationships but instead discover relevant relationships among the attributes. To achieve this goal, we build the nodes in the order of reliability instead of a random order. The edges are iteratively added to the network according to the relevance score.

7

Journal Pre-proof

pro of

A

B

C

D

E

F

G

re-

Figure 2: A Bayesian network example

urn a

lP

In the following parts, we discuss the approach of ordering the attributes and computing the relevance score between the attributes in turn, then propose a Bayesian network construction algorithm. Attribute Order Determination. During the construction of the Bayesian network, we build the nodes according to the order of the attributes V = {V1 , V2 , . . . , Vn } and add edges in turn. That is to say, for each attribute Vi , we first create a node for Vi . We then select P a(Vi ) among the existing nodes {V1 , V2 ,. . . , Vi−1 }, and add edges from each attribute Vj ∈ P a(Vi ) to Vi , respectively. Since the order of the attributes mainly determines the parents’ scope of each attribute, a proper order is needed. We choose to order the attributes according to their reliability, which can ensure that the edges representing relevant relationships among the attributes are generated from a more reliable one to less reliable one. Therefore, we can infer the missing values of attribute Vi more accurately according to its more reliable parents. Clearly, the fewer missing values the attribute corresponds to, the more reliable is the attribute. Thus, we order the attributes V = {V1 , V2 , . . . , Vn } according to their number of missing values w = {w1 , w2 , . . . , wn }. Relevance Score Calculation. As the Pearson Correlation Coefficient [18] can accurately measure the correlation between attributes, we use it as the score function to measure the relevant relationship between attributes.

Jo

Definition 2. (Relevant Score) For each pair of attributes (Vi , Vj ), the relevant score is defined as: Corr(Vi , Vj ) = p

|Cov(Vi , Vj )| V ar(Vi )V ar(Vj )

(2)

where Cov(Vi , Vj ) represents the covariances between Vi and Vj , and V ar(Vi ), V ar(Vj ) represents the variances of Vi and Vj , respectively.

Bayesian Network Construction. According to the previous discussions, we construct the Bayesian network by creating nodes in the order of the reliability of 8

Journal Pre-proof

Algorithm 1 BN construction (T )

re-

pro of

Input: A set T of incomplete tuples. Output: A Bayesian network G. 1: for i ← 1 to n do 2: V1 , V2 , . . . , Vn ← sort V according to w in the ascending order2 ; 3: for i ← 1 to n do 4: Create a node for Vi ; 5: Set P a(Vi ) = ∅, add-node = ∅, new-score = 0, old-score = 0; 6: for j ← 1 to i − 1 do 7: new-score = Corr(Vi , Vj ); 8: if new-score > old-score then 9: old-score = new-score; 10: add-node = add-node ∪ Vj ; 11: if add-node 6= ∅ then 12: for each Vj ∈ add-node do 13: Add Vj to P a(Vi ); 14: Add edges from Vj to Vi ; 15: return G.

urn a

lP

the attributes, and add iteratively edges for each node to maximize the relevance score. Algorithm 1 shows our Bayesian network construction algorithm. We first sort the attributes according to their number of missing items (Lines 1-2). For each attribute, we create a node (Lines 3-4) and find its parents according to the relevance score (Lines 6-10). Finally, we create edges from the parents to each node and calculate the conditional probability of each node on the condition of their parents (Lines 11-17). As a result of Algorithm 1, we obtain a Bayesian network G based on the relevant relationships among the attributes.

Jo

Example 1. We use Figure 2 to show the whole process of Algorithm 1. Suppose that there are seven attributes existing in the data. The attributes are sorted according to their total number of missing values in the ascending order (i.e., A-G). We first create a node for A. Since no node is ordered before A, we set P a(A) = ∅. The algorithm then comes to the second iteration, and creates a node for B. We get a new-score = Corr(B, A) > old-score = 0, and add A to P a(B). In the third iteration, a node for C is created. We get a new-score = Corr(C, A) > Corr(C, B) > old-score = 0. Thus, we add A to P a(C). In the second loop, we compute a new-score = Corr(C, B) < Corr(C, A) = old-score, and finish the loop with P a(C) = A. Similar procedures are conducted for the rest of the nodes, and a Bayesian network in Figure 2 is obtained. Time Complexity Analysis. The time complexity of the reliability computation and sorting in Lines 1-2 is O(nm + nlogn), where m is the total number of tuples, and n is the number of attributes. Lines 3-17 contain n iterations, and it seeks for the parents among (i −P1) probable attributes for each node Vi n in each iteration. Thus, the cost is O( i=1 (i · (i − 1)) = O(n2 ). For each possible attribute Vj , the algorithm calculates Corr(Vi , Vj ) and the cost is O(m2 ). 9

Journal Pre-proof

pro of

Overall, the time complexity of Algorithm 1 is O(n2 m2 ). As n is usually a small number, the algorithm can be quickly executed. 4.2. Bayesian Inference

Given the Bayesian network G consisting of attributes V = {V1 , V2 , . . . , Vn }3 , if the value of attribute Vi of tuple ti ∈ Mi is missing while the values of the other attributes can be observed, the missing value vi∗ can be inferred by estimating the following equation according to Eq. (1): vi∗ = arg max P (Vi |V1 , . . . , Vi−1 , Vi+1 , . . . , Vn ) . vi

(3)

re-

Definition 3. (Bayesian Inference) According to Pearl’s conclusion [26], given the Bayesian network G, each attribute V is conditionally independent of all its non-descendants given its parents P a(V ), where P a(V ) refers to attributes directly preceding attribute V . Thus, we infer missing values by estimating the probability only on the condition of their parents, i.e., using the following equation instead of Eq. (3).  vi∗ = arg max P Vi = vik |P a(Vi ) . (4) vi

urn a

lP

Note that when P a(Vi ) = ∅, vi∗ = arg maxvi P (Vi = vik ). To maximize the influence of the attributes with a high reliability and fewer missing values, we develop an algorithm to fill the missing values by attributes from top to bottom, which means that we fill the attributes’ values from the most reliable one to the least reliable one. Filling the missing values according to the reliability of the attributes makes us impute a missing value conditioning on more reliable attributes that are directly linked to it. Note that for the top attribute without parents, the value that maximizes its marginal distribution is filled. When dealing with other attributes’ missing values, we fill them by calculating their conditional probability distribution. We formally give the process of Bayesian inference in Algorithm 2. The algorithm fills the missing values in the order of the attributes’ reliability and finds the most likely value for each missing value in the missing set.

Jo

Example 2. Consider Example 1. When we infer the value of D, the values of A, B and C are all complete. Thus, we can compute the conditional probability of D according to the knowledge of B and C and find the most likely value for D by maximizing P (D|B, C). We then come to deal with G. Similarly, we aim to compute the conditional probability P (G|D, E). Although in tuple ti , the value of D is initially unknown, we filled it in the above step. Therefore, our algorithm ensures that the parents of each missing value are known when it is to be filled.

3 To facilitate the expression and keep consistent, attributes V = {V , V , . . . , V } that n 1 2 appeared below are all considered to have been sorted.

10

Journal Pre-proof

Algorithm 2 BNMVI (T, G, M)

pro of

Input: The set T of tuples, the generated Bayesian network G, and the incomplete tuple set M = {M1 , M2 , . . . , Mn }. Output: The complete tuples T ∗ . 1: for i ← 1 to n do 2: for each ti ∈ Mi do  3: vi∗ ← arg maxvi P Vi = vik |P a(Vi ) ; 4: Vi ← vi∗ ; 5: return T ∗ . Theorem 1 shows that the proposed algorithm could obtain the optimal solution for the problem defined in Definition 3.

re-

Theorem 1. Given a Bayesian network G and the whole incomplete tuple set M = {M1 , M2 , . . . , Mn }, for each Mi , the maximum a posteriori estimation of Vi which treated as the most possible value vi∗ can be computed according to Eq. (4).

lP

Proof. We proof the theorem with induction. (1) When i = 1, the missing values set M1 refers to the missing items corresponding to attribute V1 , which is at the top of G. As P a(V1 ) = φ, the most possible value v1∗ for each missing item of V1 is calculated by v1∗ = arg maxv1k P (V1 = v1k ). (2) Assume that when i = n − 1, the missing values in the set M  n−1 can be

urn a

∗ k calculated by v(n−1) = arg maxvk P V(n−1) = v(n−1) |P a(Vn−1 ) . (n−1) When i = n, the missing values of the attributes before Vn−1 have then been imputed before the above step, and the missing values of the attribute Vn−1 are imputed in the above step. Thus, the imputation of attributes V = {V1 , V2 , . . . , Vn−1 } before Vn are all accomplished. As P a(Vn ) is in the attributes V = {V1 , V2 , . . . , Vn−1 } and the values of P a(Vn ) are all known, the missing  values in the set Mn can be calculated by vn∗ = arg maxvnk P Vn = vnk |P a(Vn ) . Thus, we prove the theorem.

Jo

Time Complexity Analysis. For each attribute, the algorithm fills the missing values one by one (Lines 1-5). Its time complexity is O(|M | · l), where |M | = max{|M1 |, |M2 |, . . . , |Mn |}, l = max{l1 , l2 , . . . , ln }, |Mi | is the number of tuples in Mi , li is the number of possible values of Vi , and n is the number of attributes. In the worst case, |M | is equal to m, where m is the total number of tuples. In summary, the total time complexity of Algorithm 2 is O(nlm). Although we have proved that the most likely value can be calculated in order, we cannot ensure the most likely value is the correct value, especially in the situation that the information is not enough for the Bayesian inference. As the performance of the Bayesian inference mainly relies on initial information, it cannot work well when the initial information is not large enough to get the precise estimation. For example, after calculating the conditional

11

Journal Pre-proof

pro of

probabilities of D with three possible values in Figure 2, the results corresponding to these values are P (D = d1 |B, C) = 0.34, P (D = d2 |B, C) = 0.36, and P (D = d3 |B, C) = 0.3. The Bayesian inference may make mistakes if simply choosing D = d2 as the correct value among the similar probabilities of the values. Thus, we use crowdsourcing to provide extra information to help to improve the accuracy of the Bayesian inference in the next section. 5. Crowdsourced Missing Value Imputation

re-

To improve the accuracy, we adopt crowdsourcing to provide external knowledge for the Bayesian inference. Obviously, the tuples with uncertain values to impute should be sent for crowdsourcing. However, considering the cost of crowdsourcing, it is not realistic to send all the uncertain tuples to the crowd. Especially, when the data set is very large, the uncertain tuples will also be relatively large. Therefore, we prefer to select several tuples as representatives to the crowd as crowd tuples. In this section, we first present our crowd tuple selection strategies and then propose the method to utilize crowd feedback in the Bayesian inference.

urn a

lP

5.1. Crowd Tuple Selection As we select crowd tuples to provide external information for the Bayesian inference, it is unnecessary to crowd the tuples that have been imputed by the Bayesian inference in a high probability. Thus, we only need to select the tuples with a high uncertainty for crowdsourcing. Even for such uncertain tuples, we do not have to send all of them to the crowd. On the one hand, sending all such tuples is costly. On the other hand, due to the relationships between tuples, when some tuples are filled with certain values, other tuples could be accurately filled. Motivated by this, we developed two crowd tuple selection algorithms in this section. Before introducing the algorithms, we define an uncertain set. We then develop two algorithms called the uncertainty-based algorithm and influencebased algorithm to select crowd tuples among the uncertain set. As the maximum a posteriori estimation (denoted as pvi∗ ) represents the upper bound of the probability of the filled value of Vi in an incomplete tuple ti , we select ti as an uncertain tuple if pvi∗ is smaller than a given threshold θ.

Jo

Definition 4. (Uncertain tuple set) Given a Bayesian network G, the whole incomplete tuple set M = {M1 , M2 , . . . , Mn }, and a lower threshold θ, for each incomplete tuple set Mi , the uncertain tuple set is defined as: Miu = {ti |pvi∗ < θ, ti ∈ Mi },

(5)

where pvi∗ = maxvi P (Vi = vi |P a(Vi )). The whole uncertain tuple set is then defined as Mu = {M1u , M2u , . . . , Mnu }. We then introduce the uncertainty-based algorithm and influence-based algorithm based on the whole uncertain tuple set. 12

Journal Pre-proof

k

p(vik )

vik |P a(Vi )).

pro of

5.1.1. Uncertainty-based Algorithm For the uncertain tuple set Miu of attribute Vi , a basic method to improve the accuracy of the Bayesian inference is to estimate the uncertainty of each tuple and select the most uncertain tuples. As the uncertainty of a tuple can be derived by measuring the disagreement amongst the probability of different values, we adopt entropy [7] which is widely used to measure the uncertainty in order to measure this disagreement. The uncertain score of a tuple can then be quantified by the entropy on the fraction of different probability in which each of the possible values is predicted. For each uncertain tuple ti ∈ Miu , we formulate the uncertain score as follows. X score(ti ) = − p(vik ) log p(vik ), (6)

re-

where = P (Vi = Intuitively, the uncertainty-based crowd tuples selection strategy is to select a subset of tuples with the highest uncertainty, i.e., the maximal total uncertain scores. The problem is defined as follows, where the uncertain score of Miq is the sum of score(ti ) for all ti ∈ Miq due to the independence of the uncertain scores.

lP

Definition 5. (Uncertainty-based Crowd Tuples Selection) For each missing set Mi , given the uncertain tuple set Miu and a number q, it selects a subset of uncertain tuples Miq ⊆ Miu satisfying: (1) the size |Miq | ≤ q. (2) the uncertain score of Miq is maximized.

urn a

To solve this problem, we design an uncertainty-based algorithm. It is a greedy algorithm and it selects the uncertain tuple which has the highest uncertain score as the crowd tuple at each stage. Algorithm 3 shows the uncertaintybased crowd tuple selection algorithm. In Lines 3-5, we select a tuple that maximizes the uncertain score in each iteration. Algorithm 3 Crowd unc (T, G, Mu , q)

Jo

Input: The set T of tuples, the generated Bayesian network G, the whole uncertain tuple set Mu = {M1u , M2u , . . . , Mnu }, and the number q of crowd tuples. Output: The whole crowd tuple set Mq = {M1q , M2q , . . . , Mnq }. 1: for i ← 1 to n do 2: Set Miq ← ∅; 3: for r ← 1 to q do P 4: ti ← arg maxti ∈Miu \Miq − k p(vik ) log p(vik ); 5: Miq ← Miq ∪ {ti }; 6: return Mq . Theorem 2 shows that the uncertainty-based algorithm could obtain the optimal solution for the problem defined in Definition 5.

13

Journal Pre-proof

pro of

Theorem 2. Given the whole uncertain tuple set Mu = {M1u , M2u , . . . , Mnu } and a number q, for each uncertain tuple set Miu , the total uncertain score of Miq is maximized according to Algorithm 3.

re-

Proof. We prove the theorem by showing that Algorithm 3 has the greedy choice property and the optimal substructure property. (1) It has the greedy choice property. Let Miq be an optimal solution of Miu whose crowd tuples {t1i , t2i , . . . , tqi } are sorted by decreasing uncertain score. Let the first crowd tuple in Miq be tki . If k = 1, then we are done. If k 6= 1, then we can remove tuple tki and add tuple t1i because score(t1i ) > score(tki ). Since score(Miq − {tki } ∪ {t1i }) > score(Miq ), t1i is in an optimal solution. (2) It has the optimal substructure property. Let Miq be an optimal solution 0 0 to Miu , then Miq = Miq − {t1i } is an optimal solution to Miu = Miu − {t1i }. Thus, we prove the theorem.

lP

Time Complexity Analysis. The algorithm computes score(ti ) for each uncertain tuple {ti |ti ∈ Miu \ Miq } in r iterations, each of which needs to scan all the uncertain tuples in {Miu \ Miq }. Thus, the time complexity of Algorithm 3 is O(r · |M u \ M q |), where |M u | = max{|M1u |, |M2u |, . . . , |Mnu |}, |M q | = max{|M1q |, |M2q |, . . . , |Mnq |}, |Miu | is the number of tuples in Miu , and |Miq | is the number of tuples in Miq .

urn a

5.1.2. Influence-based algorithm Although selecting the most uncertain tuples as crowd tuples can improve the accuracy of the Bayesian inference, it ignores their influence on the remaining uncertain tuples. Without the consideration of influence, crowd feedback may not be used sufficiently to impute other uncertain tuples. To overcome this drawback, we propose an influence-based algorithm which not only considers the uncertainty of the incomplete tuples but also takes the information of other uncertain tuples into account. The basic idea is that for each uncertain set, we select a set of uncertain tuples which can influence the highest number of remaining uncertain tuples as crowd tuples to increase the benefits of crowd feedback. The influence of crowd tuple set Miq is calculated by the following equation which represents the sum of the influenced uncertain tuples: X IN F (Miq ) = I(|p0vi∗ − pvi∗ | = 6 0), (7) ti ∈Miu

Jo

where I(·) is an indicator function: I(|p0vi∗

− pvi∗ | = 6 0) =

(

1 if (|p0vi∗ − pvi∗ | = 6 0); 0 if (|p0vi∗ − pvi∗ | = 0),

and p0vi∗ is the update probability of the most likely value of vi∗ when the missing values in uncertain tuples ti ∈ Miq are all known from the crowd. We then define the problem based on Eq. (7) as follows.

14

Journal Pre-proof

pro of

Definition 6. (Influence-based Crowd Tuple Selection) For each incomplete tuple set Mi , given the uncertain tuple set Miu and a number q, it selects a subset of uncertain tuples Miq ⊆ Miu satisfying: (1) the size |Miq | ≤ q. (2) IN F (Miq ) is maximized. Theorem 3 shows the difficulty of this problem.

Theorem 3. Influence-based crowd tuple selection is NP-hard.

urn a

lP

re-

Proof. We prove the theorem by showing that the influence-based crowd tuple selection problem and the maximum coverage problem are equivalent under L-reduction. Recall that an instance of maximum coverage problem (U, S, k) consists of a set of elements U = {u1 , u2 ,. . . , u|U | }, a collection of subsets S = {S1 , S2 ,. . . , S|S| } where Si ⊆ U , and a number k. The problemSaims to select k subsets S ∗ ⊆ S to maximize the number of covered element | S⊆S ∗ S|. Let F =Maximum Coverage Problem(U, S, k) and G= Influence-based Crowd Tuple Selection Problem(Miu , C, q), where Miu = {t1 , t2 , . . . , t|Miu | } is an uncertain set, C = {C1 , C2 , . . . , C|C| } is a collection of influence sets of each uncertain tuple C(ti ) = {ti |p0vi∗ − pvi∗ | = 6 0, ti ∈ Miu }, and q is the number of crowd tuples. Define a transformation f from G to F by Miu = U , C = S and q = k. Given the optimal solution S ∗ = {S1 , S2 , . . . , Sk } of F , the optimal solution of G is Miq = {ti |C(ti ) = Si }. An instance of a crowd tuple set corresponds to that of a maximum coverage set in the same size. Thus, f is an L-reduction with α = β = 1. Define a transformation g from F to G by U = Miu , S = C and k = q. Given the optimal solution Miq = {t1 , t2 , . . . , tk } of G, the optimal solution of F is S ∗ = {Si |Si = C(ti )}. Since a maximum coverage set corresponds to a crowd tuple set in the same size, g is an L-reduction with α = β = 1. As it is well-known that the maximum coverage problem is NP-hard, the influence-based crowd tuple selection is also NP-hard. Algorithm 4 Crowd inf (T, G, Mu , q)

Jo

Input: The set T of tuples, the generated Bayesian network G, the whole uncertain tuple set Mu = {M1u , M2u , . . . , Mnu }, and the number q of crowd tuples. Output: The whole crowd tuple set Mq = {M1q , M2q , . . . , Mnq }. 1: for i ← 1 to n do 2: Miq ← ∅; 3: for r ← 1 to q do 4: ti ← arg maxti ∈Miu \Miq IN F (Miq ∪ {ti }) − IN F (Miq ), where P q 0 ∗ IN F (Mi ) ← ti ∈M u I(|pvi∗ − pvi | = 6 0); i 5: Miq ← Miq ∪ {ti }; 6: return Mq . Due to the difficulty of the influence-based crowd tuple selection, we design an approximate algorithm. This algorithm adopts the greedy strategy. It selects 15

Journal Pre-proof

pro of

the uncertain tuple which can influence the highest number of remaining uncertain tuples at each stage as the crowd tuple until the number of selected crowd tuples reaches the given crowd tuple number q. Algorithm 4 shows the greedy crowd tuple selection algorithm. In Lines 3-6, we select a tuple that maximizes the marginal influence in each iteration. Theorem 4 shows the approximate ratio bound of the proposed algorithm. Theorem 4. The approximation ratio bound of our greedy influence-based crowd tuple selection algorithm is 1 − 1e .

lP

re-

Proof. We have proved that the influence-based crowd tuple selection problem and the maximum coverage problem are equivalent under L-reduction in Theorem 3, and in both reductions |k| = |q|. The reductions are S-reductions with size amplification n in the number of subsets and questions in the maximum cover problem and influence-based crowd tuple selection problem. Thus, an approximation algorithm for one of the problems gives us an equal approximation algorithm for the other problem. Since the greedy algorithm for maximum coverage chooses the set containing the largest number of uncovered elements at each stage, it can achieve an approximate ratio of 1 − 1e , where e is the base of the natural logarithm [13]. Therefore, our greedy algorithm, which selects the uncertain tuples affecting the most unselected tuples at each stage, has the same approximation ratio of 1 − 1e .

Jo

urn a

Time Complexity Analysis. The algorithm computes IN F (Miq ∪ {ti }) for each uncertain tuple {ti |ti ∈ Miu \ Miq } by calculating the new probability of the most likely value of ti ∈ Miu in r iterations, each of which needs to scan all the uncertain tuples Miu times. Thus, the time complexity of Algorithm 4 is O(r · |M u \ M q | · |M u |), where |M u | = max{|M1u |, |M2u |, . . . , |Mnu |}, |M q | = max{|M1q |, |M2q |, . . . , |Mnq |}, |Miu | is the number of tuples in Miu , and |Miq | is the number of tuples in Miq . Remark. As the uncertainty-based algorithm and the influence-based algorithm have different pros and cons, it does not mean that one is superior to the other. Specifically, the uncertainty-based algorithm selects the most uncertain tuples to achieve a high overall quality. The drawback is that the algorithm ignores the influence among the remaining tuples and may ask the crowd to do a lot of redundant work. While the influence-based algorithm uses the crowdsourced information to deduce some other task results, thus saving the cost of asking the crowd to do these tasks. However, the influence-based algorithm may reduce the imputation quality if the crowd makes some mistakes in their answers, and thus the answers will be amplified in other relative imputation tasks.

5.2. Bayesian Inference with Crowd Feedback As the Bayesian inference needs to infer the missing values, in turn, according to M = {M1 , M2 , . . . , Mn }, we can acquire the uncertain tuple set Mu = {M1u , M2u , . . . , Mnu } during the inference process. After crowd tuple selection, the 16

Journal Pre-proof

pro of

whole uncertain tuple set Mq = {M1q , M2q , . . . , Mnq } is sent to the crowd. This section describes the process of Bayesian inference with crowd feedback. Crowd Feedback Aggregation. As the feedback of the workers on the real crowdsourcing platform cannot ensure to be totally correct, we collect the imputation feedback of multiple workers for each incomplete tuple. To discover the correct imputation result, we adopt the aggregation strategy in [20] for the crowd feedback. Algorithm 5 CrowdBNMVI (T, M, q)

urn a

lP

re-

Input: The set T of tuples, the corresponding incomplete tuple set M = {M1 , M2 , . . . , Mn }, and the number q of crowd tuples. Output: The complete tuples T ∗ . 1: G ← BN construction (T ); 2: for i ← 1 to n do 3: Miu ← ∅; 4: for each ti ∈ Mi do 5: if pvi∗ ≥ θ then  6: vi∗ ← arg maxvi P Vi = vik |P a(Vi ) ; 7: Vi ← vi∗ ; 8: else 9: Add ti to Miu ; u 10: M ← {M1u , M2u , . . . , Mnu }; 11: Mq ←Crowd unc (T, G, Mu , q) or Crowd inf (T, G, Mu , q); 12: Send the tuples in Mq to the crowdsourcing platform; 13: Impute the tuples in Mq with the aggregated crowd feedback; 14: T ∗ ← BNMVI (T, G, Mu \ Mq ); 15: return T ∗ .

Jo

We summarize the whole process in Algorithm 5. It contains two steps as follows. Step 1. Imputation by the Bayesian inference (Lines 2-7). For each ti ∈ Mi , we compute the most possible value vi∗ by maximizing the conditional probability and set a lower threshold θ for probability of the most likely value. When the probability of the most possible value has an absolute advantage compared to the other candidates (pvi∗ ≥ θ), we set it as the correct value. Otherwise, crowdsourcing starts in step 2. Step 2. Imputation by the Bayesian inference and crowd feedback (Lines 8-14). For each ti ∈ Mi , when the probability of the most possible value vi∗ is similar to the other candidates (pvi∗ < θ), we treat the tuple as an uncertain tuple and add it to the uncertain set Miu . We then select crowd tuple set Miq based on the uncertainty-based algorithm or influence-based algorithm in Section 5.1 and send it to the crowd. Finally, we obtain the aggregated crowd feedback, and infer the remaining uncertain tuples based on the Bayesian inference algorithm. Time Complexity Analysis. Given the time complexity of Algorithms 14, the time complexity of Algorithm 5 is O(n2 m2 + nl|M | + r · |M u \ M q | · |M u |), where n is the number of attributes, m is the total number of tu17

Journal Pre-proof

pro of

ples, |M | = max{|M1 |, |M2 |, . . . , |Mn |}, l = max{l1 , l2 , . . . , ln }, |Mi | is the number of tuples in Mi , li is the number of possible values of Vi , |M u | = max{|M1u |, |M2u |, . . . , |Mnu |}, |M q | = max{|M1q |, |M2q |, . . . , |Mnq |}, |Miu | is the number of tuples in Miu , and |Miq | is the number of tuples in Miq . 6. Experiments

6.1. Experimental Setup

re-

In this section, we evaluate the proposed method on five real-world datasets. The experimental results clearly demonstrate the advantages of the proposed method by incorporating the crowd feedback into the Bayesian inference. We first discuss the experimental setup in Section 6.1. With a simulated crowd, we explore the effectiveness of our parameters in Section 6.2. In Section 6.3, we test the proposed method on two well-known public crowdsourcing platforms, i.e., Amazon Mechanical Turk (AMT) [1] and Figure Eight (formerly known as CrowdFlower) [2].

lP

Algorithms. For the proposed method, we implement the basic imputation algorithm BNMVI as well as the crowdsourcing-enhanced imputation algorithm CrowdBNMVI. We denote CrowdBNMVI with Crowd unc and Crowd inf as BNMVI+Crowd unc and BNMVI+Crowd inf, respectively. For the baseline methods, the following missing value imputation methods are implemented: • Mode: This is the naive imputation approach which uses the most frequent value as the correct value.

urn a

• KNNI [27]: This approach is the k nearest neighbor imputation where missing values are imputed using the values calculated from the k nearest observed data. In the experiments, we set the cluster number to 3. • SVM [27]: This approach is based on the use of kernel functions to construct a hyperplane, which allows the data instances to be linearly separable. In the experiments, we use SVM based on the radial basis function (RBF) kernel.

Jo

• MIBOS [37]: This approach fills categorical missing values based on a novel similarity model among incomplete tuples. This approach constructs the similarity matrix of tuples and further obtains the nearest undifferentiated tuple set of each tuple to iteratively infer the missing data. • CMI [43]: This approach divides the original dataset into clusters, and the missing values of a tuple are patched up with the plausible values generating from the tuple’s cluster. In the experiments, we set the cluster number to 5 and choose candidates with the maximum probability as substitutions.

18

Journal Pre-proof

Table 1: The statistics of the datasets

# tuples 101 1624 3196 10,024 69,925

# attributes 17 8 37 12 12

attribute type 15 categorical/2 numerical categorical categorical 2 categorical/10 numerical 7 categorical/5 numerical

# classes 2-7 2-85 2 5-81 2-8

pro of

dataset Zoo Olympics Chess Earthquake Disease

• Certain [10]: This approach imputes missing values under the constraints of certain rules which are defined over attributes. In the experiments, to ensure the accuracy of the rules, we assume that there is a functional dependency from attribute A to B when the probability P (B|A) = 1.

re-

Datasets. To comprehensively test the proposed approaches, we conduct extensive experiments on five real-word datasets. Their basic information is shown in Table 1. These datasets are all complete with initially no missing value. Note that for the continuous numerical attribute, we divide its possible value into finite intervals and infer its most possible interval. # classes indicate the range of the number of possible values (intervals) for each attribute. The datasets are described as follows.

lP

• Zoo4 . This dataset contains one attribute describing the name of each animal and the other attributes describing the various characteristics of different animals. We use 17 attributes as our test data, in which the attributes are mostly boolean attributes, such as Hair, Feathers, and Eggs.

urn a

• Olympics4 : The dataset contains the award information of seven events of the Summer Olympic Games from 1896 to 2012. There are 8 categorical attributes, i.e., Year, City, Sport, Discipline, Athlete, Country, Gender, Event, and Medal in the dataset. • Chess5 : The dataset contains a set of board descriptions for each chess endgame. There are 37 categorical attributes in the dataset. The first 36 attributes describe the board, and the last (37th) attribute represents ”win” or ”nowin” for the game.

Jo

• Earthquake6 : This dataset covers all the recorded earthquakes in Turkey from 27/09/1910 to 27/09/2017, at which the filter of intensity was set to 3.5 to 9.0. We use 12 categorical attributes, i.e., date, lat, long, city, direction, dist, depth, xm, md, richter, ms, and mb as our test data. • Disease7 : This dataset collects patient information during a medical examination. It includes three types of information; i.e., factual information,

4 https://www.kaggle.com/the-guardian/olympic-games#summer.csv. 5 http://archive.ics.uci.edu/ml/. 6 https://www.kaggle.com/caganseval/earthquake 7 https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

19

Journal Pre-proof

pro of

results of medical examination, and information given by the patient. We use all the attributes except the id as our test data. Performance Measures. To evaluate the performance, we test these methods in terms of effectiveness and efficiency. The effectiveness is measured as Recall, Precision, and F-measure. We randomly remove values in the original dataset according to a different percent as the missing data and apply the imputation methods to fill the removed values. Let truth be the set of removed values and found be the set of filling values returned by the imputation algorithms (not including the missing values that are failed to fill). ound| , which represents the pro• Recall is calculated as Recall = |truth∩f |truth| portion of missing values that are accurately filled. |truth∩f ound| , |f ound|

re-

• Precision is calculated as P recision = tion of filled values that are correct.

• F-measure is calculated as F -measure = 2 · sents the overall accuracy [36].

denoting the propor-

P recision·Recall P recision+Recall ,

which repre-

lP

• Efficiency is measured by the processing time of each missing value imputation method. 6.2. Crowd Simulations on Real-world Data

Jo

urn a

To explore the performance of the proposed methods under different settings of parameters, we first conduct experiments on the Zoo, Chess, and Earthquake datasets with a simulated crowd. For each crowd tuple, we simulate the feedback from 10 workers, and aggregate the crowd feedback using [20]. We follow a practical guideline [15] to simulate the different worker characteristics of the crowd. According to a study on crowd population at real-world crowdsourcing services [17], we assign the default values of the worker population as follows: 4 reliable workers (90% probability providing correct answer), 4 sloppy workers (50% probability providing correct answer), and 2 spammers (10% probability providing correct answer). In the following part, the distribution of the worker types is the same as discussed unless stated otherwise. The effect of missing rates and data sizes. We first compare our proposed methods with the baseline methods under different missing rate Mis Rate and data sizes. For BNMVI+Crowd unc and BNMVI+Crowd inf, we set the percentage of crowd tuples among all the incomplete tuples (denoted as Crowd rate) as 20%, and the lower threshold θ as 0.7. In the experiments of varying data sizes, Mis Rate is set as 10%. The results are shown in Figure 3 and Figure 4, respectively. We can see that the performance of our methods BNMVI+Crowd unc and BNMVI+Crowd inf achieve a significantly higher F-measure in most cases. The reason is that with the help of crowdsourcing, BNMVI+Crowd unc and BNMVI+Crowd inf obtain more information about the uncertain tuples. It is also worth mentioning that our basic method BNMVI still performs better than the other methods in both Precision and Recall. We obtain a comparable Precision 20

Journal Pre-proof

S V M

M IB O S

0 .9

C M I

C e r ta in

0 .8

0 .8

R e c a ll

0 .6

0 .4

0 .4

0 .5 0 .2

0 .4 0 .3 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

0 .2

0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

(a ) Z o o

(b ) Z o o

1 .0

1 .0

0 .9

0 .8

F -m e a s u re

P r e c is io n

R e c a ll

0 .6

0 .4

0 .4

0 .5 0 .2

0 .3 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

(e ) C h e ss

(d ) C h e ss 1 .0

0 .8

0 .8

0 .6

R e c a ll

P r e c is io n

0 .6

0 .4

0 .4

0 .2

lP

0 .2 0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

(g ) E a rth q u a k e

0 .2

0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

re-

0 .4

(c ) Z o o

0 .8

0 .6

0 .6

0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

1 .0

0 .8 0 .7

B N M V I+ C r o w d _ in f

F -m e a s u re

P r e c is io n

0 .6

0 .6

B N M V I+ C ro w d _ u n c 1 .0

0 .8 0 .7

B N M V I

1 .0

pro of

K N N

0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

(h ) E a rth q u a k e

(f) C h e ss 0 .8 0 .6

F -m e a s u re

M o d e 1 .0

0 .4 0 .2 0 .0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

(i) E a rth q u a k e

Figure 3: Performance comparison under different settings of missing rates

Jo

urn a

due to the sufficient use of the information of relevant attributes, and higher Recall is achieved due to making full use of the information of the filled values. For Certain, as it imputes missing values according to the accurate rules, a high Precision is achieved. However, as few tuples satisfy the rules, the corresponding Recall is extremely low in Figures 3(b)(e)(h). As different missing rates influence the number of filled values, the Recall is not stable. For the machine learning baselines KNN and SVM, they share a similar performance. As they use the complete part of data for training, their F-measure decreases with the increase in the missing rates and rises with the increase in the data sizes. For MIBOS, whose similarity among tuples is defined by value equality, the Precision is similar to CMI. In particular, as shown in Figures 3(b)(e)(h), as few tuples can satisfy value equality on every attribute, MIBOS achieves a limited number of filled missing values, which leads to a low Recall. For CMI, which considers similarities over all the attributes, it has more opportunities in filling missing values. Thus, it has a higher Recall as shown in Figures 3(b)(e)(h). However, considering all the attributes with value similarities limits the number of similar tuples found by CMI. Instead, by imputing missing values according to the Bayesian network in order, our methods can infer the current value of an attribute according to its parents whose missing values have been imputed. Therefore, as shown in Figures 3(b)(e)(h), the Recall of our methods is even 21

Journal Pre-proof

M o d e

K N N

S V M

M IB O S

C M I

C e r ta in

B N M V I

0 .8

F -m e a s u re

R e c a ll

0 .6

0 .6 0 .4

4 0

5 0

6 0 7 0 D a ta S iz e

8 0

9 0

0 .2

1 0 0

3 0

4 0

5 0

(a ) Z o o

6 0 7 0 D a ta S iz e

(b ) Z o o

1 .0

1 .0

8

1 2

1 6

2 0

2 4

2 8

0 .4

4

8

D a ta S iz e ( * 1 0 2 )

0 .8

0 .9

0 .7 0 .6

0 .7

R e c a ll

P r e c is io n

0 .8

0 .6

0 .5 0 .4

0 .4 2

3

4

5

1 2

1 6

6

D a ta S iz e ( * 1 2 5 0 )

(g ) E a rth q u a k e

7

lP

0 .5

8

0 .3 0 .0

6 0 7 0 D a ta S iz e

8 0

9 0

1 0 0

2 4

2 8

3 2

7

8

0 .6 0 .4

2 0

2 4

2 8

0 .0

3 2

4

8

1 2

1

2

3

4

5

6

D a ta S iz e ( * 1 2 5 0 )

(h ) E a rth q u a k e

7

8

1 6

2 0

D a ta S iz e ( * 1 0 2 )

(e ) C h e ss

(d ) C h e ss

1

5 0

D a ta S iz e ( * 1 0 2 )

1 .0

0 .3

4 0

0 .2

0 .0

3 2

3 0

(c ) Z o o

re-

4

0 .2

1 0 0

0 .8

0 .6

0 .2 0 .4

9 0

F -m e a s u re

R e c a ll

P r e c is io n

0 .6

8 0

1 .0

0 .8 0 .8

0 .6 0 .4

(f) C h e ss 0 .8 0 .7 F -m e a s u re

P r e c is io n

0 .8

3 0

B N M V I+ C r o w d _ in f

1 .0

0 .8

0 .4

B N M V I+ C ro w d _ u n c

1 .0

pro of

1 .0

0 .6 0 .5 0 .4 0 .3 0 .0 1

2

3

4

5

6

D a ta S iz e ( * 1 2 5 0 )

(i) E a rth q u a k e

Figure 4: Performance comparison under different settings of data sizes

Jo

urn a

higher. Similar results can also be observed in Figure 4. To further show the performance of the Bayesian inference with crowd feedback, we vary the parameters in the following parts. We compare the accuracy of BNMVI with and without crowdsourcing. Note that as the proposed methods can always obtain all the imputation results about the uncertain tuples, i.e., Precision = Recall = F-measure, we only report the results related to the F-measure. The effect of the lower threshold θ. First, we evaluate the impact of the lower threshold θ. In the experiments, we fix the Crowd rate and the Mis Rate at 20% and 50%, respectively. The results are shown in Figures 5(a)(d)(g). From Figure 5(a), we observe that the F-measure of BNMVI+Crowd inf and BNMVI+Crowd unc increases with the increase in threshold θ, and reaches the highest value when θ reaches 0.7, then decreases. This is because, with the increase of θ, more uncertain tuples are inferred with the information of crowd feedback. When θ is higher than 0.7, the accuracy of the basic Bayesian inference is good enough, so the accuracy is not improved. What’s more, BNMVI+Crowd unc achieves a better performance than BNMVI+Crowd inf when θ is small. When θ is higher, the situation is changing. This is because Crowd inf, which selects the most influential tuple each time, can influence more tuples with

22

Journal Pre-proof

B N M V I+ C ro w d _ u n c 1 .0 0

0 .9 5

0 .9 5

0 .9 0

0 .9 0

0 .8 5

0 .8 5

0 .7 5

1 .0 0 .9

F -m e a s u re

0 .8 0

0 .8 0 0 .7 5

0 .7 0

0 .7 0

0 .6 5

0 .6 5

0 .6 0 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0 lo w e r th r e s h o ld

0 .6 0 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5

0 .7 5 0 .7 0 0 .6 5 0 .6 0 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0

0 8

0 .9 5

6

0 .9 0

4 2 0 8 6

0 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5

0 .6 0 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 E rro r_ ra te

0 .9 0 .8 0 .7

lP

0 .6 0 .5

(g ) E a rth q u a k e , v a ry in g lo w e r th re s h o ld

(f) C h e s s , v a ry in g E rro r_ ra te

0 .4 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 C ro w d _ ra te

(h ) E a rth q u a k e , v a ry in g C ro w d _ ra te

0 .6 5 0 .6 0

F -m e a s u re

0 .5 6

F -m e a s u re

1 .0

lo w e r th r e s h o ld

0 .7 5

(e ) C h e s s , v a ry in g C ro w d _ ra te

(d ) C h e s s , v a ry in g lo w e r th re s h o ld

0 .4 8 0 .0 6 0 .0 8 0 .1 0 0 .1 2 0 .1 4 0 .1 6 0 .1 8 0 .2 0

0 .8 0

0 .6 5

2

C ro w d _ ra te

0 .5 2

0 .8 5

0 .7 0

4

0 .5 8

0 .5 0

(c ) Z o o , v a ry in g E rro r_ ra te

1 .0 0

lo w e r th r e s h o ld

0 .5 4

0 .6

0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 E rro r_ ra te

F -m e a s u re

F -m e a s u re

F -m e a s u re

0 .9 0

1 .0 0 .9 0 .9 0 .9 0 .9 0 .9 0 .8 0 .8 0 .8 0 .8 0 .8

re-

0 .9 5

0 .8 0

0 .7

(b ) Z o o , v a ry in g C ro w d _ ra te

(a ) Z o o , v a ry in g lo w e r th re s h o ld

0 .8 5

0 .8

0 .5

C ro w d _ ra te

1 .0 0

F -m e a s u re

B N M V I+ C r o w d _ in f

pro of

F -m e a s u re

F -m e a s u re

B N M V I 1 .0 0

0 .5 5 0 .5 0 0 .4 5 0 .4 0 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 E rro r_ ra te

(i) E a rth q u a k e , v a ry in g E rro r_ ra te

Figure 5: The performance of Bayesian inference with crowd feedback under different settings of lower threshold (Left), crowd rate (Middle), and Error rate (Right).

Jo

urn a

the increase in the lower threshold. When we set θ to 0.7, BNMVI+Crowd inf achieves 90% of the F-measure, while BNMVI+Crowd unc achieves 89% of the F-measure. BNMVI+Crowd inf and BNMVI+Crowd unc improve the F-measure of BNMVI to 23%-26%. The improvement is attributed to the extra knowledge from the crowd. Similar results are also achieved in Figures 5(d)(g). The effect of Crowd rates. We then evaluate the influence of Crowd rate. In the experiments, we set the lower threshold θ as 0.75 and the Mis Rate as 50%. The results are shown in Figures 5(b)(e)(h). We can see that the F-measure of both BNMVI+Crowd unc and BNMVI+Crowd inf increases with the increase in the crowd rate, and then tends to plateau. In Figure 5(b), when we set the Crowd rate for 20%, the F-measure of BNMVI+Crowd inf reaches almost 89%. In addition, BNMVI+Crowd inf and BNMVI+Crowd unc improve the F-measure of BNMVI to 23%-25%. The effect of crowd feedback quality. In the above experiments, we simulate the crowd feedback of 10 workers, in which each worker has a certain percentage of probability to provide the correct answer. In order to show the effect of the quality of the crowd feedback to the Bayesian inference, in this part, we change the probability of providing correct answers for each worker. In the experiments, we fix the Crowd rate at 20%, Mis Rate at 50%, and the lower 23

Journal Pre-proof

M o d e

K N N

S V M

M IB O S

1 8

C M I

C e r t a in

B N M V I

1

6 0

T im e c o s t( s )

T im e c o s t( s )

T im e c o s t( s ) 2

1 5 1 0

0 0 .0 5

8 0 4 0 1 0 5

5 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 0 .3 5 0 .4 0 M is _ R a te

B N M V I + C r o w d _ in f

pro of

8 0

1 6

B N M V I+ C ro w d _ u n c 8 0 0 0 4 0 0 0

0 .1 0

0 .1 5

0 .2 0 0 .2 5 M is _ R a te

(b ) C h e ss

(a ) Z o o

0 .3 0

0 .3 5

0 .4 0

0

1

2

3

4

5

6

7

8

D a ta S iz e ( * 1 2 5 0 )

(c ) E a rth q u a k e

Figure 6: Efficiency evaluation

urn a

lP

re-

threshold at 0.75. The results are shown in Figures 5(c)(f)(i). The x-axis represents the error rate (named Error rate for short) of the output of the aggregation approach [20]. We observe that the F-measure of both BNMVI+Crowd inf and BNMVI+Crowd unc decreases with the increase in the Error rate on all the datasets. When the Error rate is 35%, 20%, and 27%, respectively, BNMVI+Crowd inf and BNMVI+Crowd unc perform worse than BNMVI. It can be concluded that the quality of the crowd feedback aggregation should be ensured to improve the accuracy of the Bayesian inference. According to [20], when only 1 out of 8 workers is reliable, [20] can still obtain the correct aggregated answers. Therefore, the accuracy of BNMVI is able to be improved with the help of crowdsourcing in most cases. We will further analyze the quality of crowd feedback and aggregation result in Section 6.3. Efficiency. We also evaluate the efficiency of the proposed methods and the baseline methods with regard to the missing rates and data sizes. Figure 6 shows the experimental results. We can see that our method BNMVI has a comparable time cost with the baseline methods. BNMVI+Crowd unc and BNMVI+Crowd unc need to simulate and aggregate the crowd feedback, which require a longer time. In particular, due to the complex calculation, the time cost of MIBOS linearly increases when the data size is large in Figure 6(b), while the time cost of other methods remains relatively stable. Table 2: Crowd Performance on real crowdsourcing platforms

Datasets Zoo

Jo

Olympics Disease

Algorithms

# HITs

Time

Aggregation Accuracy

Crowd unc Crowd inf Crowd unc Crowd inf Crowd unc Crowd inf

168 240 1083 1728 2979 2985

2h12min 1h18min 4h45min 5h9min 2h52min 2h46min

0.850 0.844 0.686 0.714 0.834 0.821

6.3. Testing on real crowdsourcing platforms In this section, we illustrate the performance of the proposed approaches in real crowdsourcing platforms. We implement BNMVI+Crowd unc and BNMVI+Crowd inf with Amazon Mechanical Turk and Figure Eight, respectively. 24

Journal Pre-proof

1 0 0

0 .6 0 .4

6 0 4 0

2 0

0 .2

C ro w d _ u n c C r o w d _ in f

1

2

3

4

5

6

7

8

9

0

W

A t t r ib u t e ID

2 0

C ro w d _ u n c

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

T im e ( m in )

(b ) Z o o

(c ) Z o o

T o t a l H I T s s u b m it t e d in C r o w d _ in f I n c o r r e c t H I T s s u b m it t e d in C r o w d _ in f

N u m b e r o f H IT s

0 .6 0 .4

0 .2

6 0 0

4 0 0

2 0 0

C ro w d _ u n c C r o w d _ in f

1

2

3

4

5

6

7

0

W

(d ) O ly m p ic s

0 .8

8 0

N u m b e r o f H IT s

T o t a l H I T s s u b m it t e d in C r o w d _ in f I n c o r r e c t H I T s s u b m it t e d in C r o w d _ in f

0 .6 0 .4

0 .2

6 0 4 0

2 0

5

6

7

8

A t t r ib u t e ID

(g ) D is e a s e

9

1 0

1 1

lP

C ro w d _ u n c C r o w d _ in f

4

2 0

C r o w d _ in f

0

6 0

o r k e r ID

re-

1 0 0

3

4 0

1 2

0

W

o r k e r ID

(h ) D is e a s e

1 2 0

1 8 0

2 4 0

3 0 0

T im e ( m in )

(f) O ly m p ic s

(e ) O ly m p ic s

1 .0

2

6 0

0

8

A t t r ib u t e ID

1

8 0

1 0 0

A s s ig n m e n ts c o m p le te d ( % )

0 .0

A s s ig n m e n ts c o m p le te d ( % )

1 0 0

8 0 0 0 .8

A c c u ra c y

4 0

o r k e r ID

1 .0

A c c u ra c y

6 0

0

1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7

(a ) Z o o

0 .0

8 0

pro of

N u m b e r o f H IT s

A c c u ra c y

8 0

0 .8

0 .0

1 0 0 T o t a l H I T s s u b m it t e d in C r o w d _ u n c I n c o r r e c t H I T s s u b m it t e d in C r o w d _ u n c

1 .0

A s s ig n m e n ts c o m p le te d ( % )

1 .2

8 0

6 0 4 0

2 0 C ro w d _ u n c 0 0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

1 8 0

T im e ( m in )

(i) D is e a s e

Figure 7: Crowd worker affinity and quality from different aspects: aggregation accuracy (Left), HITs/quality by worker (Middle), and HITs Completion (Right).

Jo

urn a

To control the result quality, each uncertain tuple was replicated into three HITs. That is, each uncertain tuple would be imputed by three different workers. We paid workers 3 cents for completing each HIT. To filter the spammers, the workers whose adoption rates are higher than 90% were allowed to accept our HITs. The crowd feedback was aggregated by majority voting. We tested three real-world data sets Zoo, Olympics, and Disease. On the Zoo dataset, the lower threshold θ was set to 0.5, q was set to 10, and the Mis Rate was set to 0.2. On the Olympics dataset, the lower threshold θ was set to 0.3, q was set to 100, and the Mis Rate was set to 0.2. On the Disease dataset, the lower threshold θ was set to 0.1, q was set to 1000, and the Mis Rate was set to 0.05. We first compare Crowd unc and Crowd inf in terms of the number of HITs, completion time, and aggregation accuracy. We use Accuracy to evaluate the majority voting result of the crowd feedback. Table 2 shows the respective results for the three datasets. We can see that on all the datasets, the number of HITs (i.e., tuples) submitted in Crowd unc is lower than that in Crowd inf. As the total number of missing values for both approaches is equal, the average number of missing values in each tuple in Crowd unc is greater than that in Crowd inf. This is because the incomplete tuples with a large number of missing values have a high probability to be treated as uncertain in Crowd unc, while the incomplete tuples with a few missing values tended to be treated as the influence 25

Journal Pre-proof

urn a

lP

re-

pro of

maximum in Crowd inf. As a result, compared with Crowd inf, the workers in Crowd unc need more time to impute more missing values with less information. This also coincides with the result of the feedback aggregation accuracy. That is, on the Zoo and Disease datasets, which are relatively easy tasks for binary choice, the result of Crowd unc is slightly better than that of Crowd inf. While on the Olympics dataset, the result of Crowd inf is obviously better than that of Crowd unc. As for each tuple, it provides more information for workers to impute fewer missing values. We then focus on three other issues; i.e., the aggregation accuracy for each attribute, the distribution of work among workers, and the response time of the HITs. Figure 7 shows the results. For each attribute, Figures 7(a)(d)(g) show the aggregated accuracy of Crowd unc and Crowd unc, respectively. From these figures, we observe that the aggregation accuracy of crowd feedback for each attribute exceeds 60% in most cases. This experimental result indicates that conducting majority voting for the crowd feedback can effectively filter the wrong feedback from 3 workers. For each worker, Figures 7(b)(e)(h) show the number of HITs completed by that worker and the number of errors made by that worker. The workers are plotted along the x-axis in the decreasing order of the number of HITs processed by them. As can be seen, the more HITs a worker completed, the more errors that worker made. However, the relation between them is not strong. Benefiting from our filtering strategy, the workers whose adoption rates are low were not allowed to accept our HITs. Thus, most of the workers are able to impute the missing values correctly and submit very few incorrect ones. Figures 7(c)(f)(i) show the fraction of HITs that are completed as a function of time. Obviously, on all the datasets, the longer we wait, the more HITs are completed. The results show that workers are eager to accept the HITs in the first few minutes. For all the datasets, 80% of the HITs are completed within 180 min. The results indicate that it does not take too long for worker accepting and accomplishing tasks. Table 3: Performance comparison

Datasets Zoo

Jo

Olympics

Disease

Algorithms BNMVI BNMVI+Crowd unc BNMVI+Crowd inf BNMVI BNMVI+Crowd unc BNMVI+Crowd inf BNMVI BNMVI+Crowd unc BNMVI+Crowd inf

F-measure 0.759 0.843 0.839 0.436 0.562 0.601 0.684 0.743 0.731

Finally, we compare the performance among the BNMVI, BNMVI+Crowd unc, and BNMVI+Crowd inf. Table 3 shows the results for the datasets. We observe that on all the datasets, the crowd feedback enhances the performance of the Bayesian inference. For the Zoo and Disease datasets, BNMVI+Crowd unc performs the best, due to the high feedback quality from crowdsourcing. Com26

Journal Pre-proof

pro of

pared to BNMVI, the improvement of the F-measure is 8% and 6%, respectively. For the Olympics dataset, benefiting from the crowd tuple selection approach Crowd inf, the F-measure of BNMVI+Crowd inf is the highest. Compared to BNMVI, BNMVI+Crowd inf achieves a significant improvement in the F-measure (i.e., 16%). 7. Conclusion

lP

re-

Imputing the missing values of a tuple mostly relies on statistics-based and machine-learning-based methods. Due to the very complex models, such filling can barely be efficiently and effectively done. In this paper, we proposed a method for missing value imputation by combining the Bayesian network and crowdsourcing. We first build the Bayesian network based on the incomplete data and then propose our Bayesian inference algorithm which can fill the missing values based on limited variables. To obtain enough evidence, crowdsourcing is used to improve the accuracy of the Bayesian inference. Two crowd tuple selection algorithms are also presented to reduce the cost. Experimental results on real-word datasets with a simulated crowd and two real crowdsourcing platforms show that our methods achieve a much higher accuracy than the existing methods. Our future work includes adopting the method to heterogeneous data and massive data. Acknowledgements

urn a

This paper was partially supported by NSFC grant U1509216, U1866602, 61602129, and Microsoft Research Asia. References

[1] https://www.mturk.com/. [2] https://www.figure-eight.com. [3] C. C. Cao, J. She, Y. Tong, and L. Chen. Whom to ask? jury selection for decision making tasks on micro-blog services. PVLDB, 5(11):1495–1506, 2012.

Jo

[4] F. Chu, Y. Wang, D. S. Parker, and C. Zaniolo. Data cleaning using belief propagation. In Proc. of IQIS, pages 99–104, 2005. [5] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. of SIGMOD, pages 1247–1261, 2015. [6] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992.

27

Journal Pre-proof

pro of

[7] I. Dagan and S. P. Engelson. Committee-based sampling for training probabilistic classifiers. In Proc. of ICML, pages 150–157, 1995. [8] M. Di Zio, M. Scanu, L. Coppola, O. Luzi, and A. Ponti. Bayesian networks for imputation. Journal of the Royal Statistical Society: Series A (Statistics in Society), 167(2):309–322, 2004. [9] J. Fan, G. Li, B. C. Ooi, K.-l. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In Proc. of SIGMOD, pages 1015–1030, 2015. [10] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1-2):173–184, 2010. [11] S. L. Goldberg, D. Z. Wang, and T. Kraska. Castle: Crowd-assisted system for text labeling and extraction. In Proc. of HCOMP, 2013.

re-

[12] S. Hao, N. Tang, G. Li, J. He, N. Ta, and J. Feng. A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng., 29(4):727–742, 2017.

lP

[13] D. S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Proc. of Approximation algorithms for NP-hard problems, pages 94–143, 1996. [14] H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced poi labelling: Location-aware result inference and task assignment. In Proc. of ICDE, pages 61–72, 2016.

urn a

[15] N. Q. V. Hung, N. T. Tam, L. N. Tran, and K. Aberer. An evaluation of aggregation techniques in crowdsourcing. In Proc. of WISE, pages 1–15, 2013. [16] K. J. Janssen, A. R. T. Donders, F. E. Harrell, Y. Vergouwe, Q. Chen, D. E. Grobbee, and K. G. Moons. Missing covariate data in medical research: to impute is better than to ignore. Journal of clinical epidemiology, 63(7):721– 727, 2010. [17] G. Kazai, J. Kamps, and N. Milic-Frayling. Worker types and personality traits in crowdsourcing relevance labels. In Proc. of CIKM, pages 1941– 1944, 2011.

Jo

[18] I. Lawrence and K. Lin. A concordance correlation coefficient to evaluate reproducibility. Biometrics, pages 255–268, 1989. [19] G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. IEEE Trans. Knowl. Data Eng., 28(9):2296–2319, 2016. [20] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187–1198, 2014. 28

Journal Pre-proof

pro of

[21] X.-B. Li. A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality, 1(1):3, 2009. [22] Z. Li, L. Qin, H. Cheng, X. Zhang, and X. Zhou. TRIP: an interactive retrieving-inferring data imputation approach. IEEE Trans. Knowl. Data Eng., 27(9):2550–2563, 2015. [23] Z. Li, M. A. Sharaf, L. Sitbon, S. W. Sadiq, M. Indulska, and X. Zhou. A web-based approach to data imputation. World Wide Web, 17(5):873–897, 2014. [24] C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In Proc. of SIGMOD, pages 75–86, 2010.

re-

[25] R. Pan, T. Yang, J. Cao, K. Lu, and Z. Zhang. Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl. Intell., 43(3):614–632, 2015. [26] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. 2014.

lP

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

urn a

[28] K. Pelckmans, J. D. Brabanter, J. A. K. Suykens, and B. D. Moor. Handling missing values in support vector machine classifiers. Neural Networks, 18(56):684–692, 2005. [29] Z. Qi, H. Wang, J. Li, and H. Gao. FROG: inference from knowledge base for missing value imputation. Knowl.-Based Syst., 145:77–90, 2018. [30] A. M. Sefidian and N. Daneshpour. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Syst. Appl., 115:68–94, 2019.

Jo

[31] N. A. Setiawan, P. Venkatachalam, and A. F. M. Hani. Missing attribute value prediction based on artificial neural network and rough set theory. In Proc. of BMEI, pages 306–310, 2008. [32] Y. Shan and G. Deng. Kernel pca regression for missing data estimation in dna microarray analysis. In Proc. of ISCAS, pages 1477–1480, 2009.

[33] X. Su, T. M. Khoshgoftaar, and R. Greiner. Using imputation techniques to help learn accurate classifiers. In Proc. of ICTAI, pages 437–444, 2008.

[34] C. Tsai, M. Li, and W. Lin. A class center based approach for missing value imputation. Knowl.-Based Syst., 151:124–135, 2018. 29

Journal Pre-proof

pro of

[35] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hillclimbing Bayesian network structure learning algorithm. Machine learning, 65(1):31–78, 2006. [36] C. J. Van Rijsbergen. Information retrieval. 1979.

[37] S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In Proc. of SMC, pages 2827–2832, 2012. [38] J. Xia, S. Zhang, G. Cai, L. Li, Q. Pan, J. Yan, and G. Ning. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognition, 69:52–60, 2017.

re-

[39] K. Yang, J. Li, and C. Wang. Missing values estimation in microarray data with partial least squares regression. Computational Science–ICCS 2006, pages 662–669, 2006. [40] C. Ye, H. Wang, J. Li, H. Gao, and S. Cheng. Crowdsourcing-enhanced missing values imputation based on Bayesian network. In Proc. of DASFAA, pages 67–81, 2016.

lP

[41] C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. In Proc. of ICDE, pages 6–17, 2015. [42] S. Zhang. Nearest neighbor selection for iteratively knn imputation. Journal of Systems and Software, 85(11):2541–2552, 2012.

urn a

[43] S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. Computational Science, 1:128–138, 2008. [44] X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612–623, 2016. [45] Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In Prod. of EDBT, 2015. [46] Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. Qasca: A qualityaware task assignment system for crowdsourcing applications. In Proc. of SIGMOD, pages 1031–1046, 2015.

Jo

[47] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, pages 2195–2203, 2012.

30