International Journal of Approximate Reasoning 130 (2021) 150–169
Contents lists available at ScienceDirect
International Journal of Approximate Reasoning www.elsevier.com/locate/ijar
Multivariate statistical matching using graphical modeling Pier Luigi Conti a , Daniela Marella b,∗ , Paola Vicard c , Vincenzina Vitale d a
Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Italy Dipartimento di Scienze della Formazione, Università Roma Tre, Italy Dipartimento di Economia, Università Roma Tre, Italy d Dipartimento di Scienze Sociali ed Economiche, Sapienza Università di Roma, Italy b c
a r t i c l e
i n f o
Article history: Received 30 April 2020 Received in revised form 16 November 2020 Accepted 2 December 2020 Available online 10 December 2020 Keywords: Bayesian networks Dependence structure Collapsibility Statistical matching Uncertainty
a b s t r a c t The goal of statistical matching, at a macro level, is the estimation of the joint distribution of variables separately observed in independent samples. The lack of joint information on the variables of interest leads to uncertainty about the data generating model. In this paper we propose the use of graphical models to deal with the statistical matching uncertainty for multivariate categorical variables. The use of Bayesian networks in the statistical matching context allows both to introduce extra sample information on the dependence structure between the variables of interest and to use such an information to factorize the joint probability distribution according to the graph decomposition of a multivariate dependence in lower dimension components. This representation of the joint probability distribution, taking advantage of local relationships, allows to simplify both parameters estimation and statistical matching quality evaluation in a multivariate context. A simulation experiment is performed in order to evaluate the performance of the proposed methodology with and without auxiliary information, as well as to compare it with the saturated multinomial model, in terms of uncertainty reduction. Finally, an application to a real case is provided. Results show a considerable improvement in the quality of statistical matching when the dependence structure is taken into account. © 2020 Elsevier Inc. All rights reserved.
1. Introduction Information for statistical analysis is frequently available in different micro data bases. Each data base contains only some of the variables of interest. This is a serious drawback when the interest is in the joint analysis of variables that are not jointly observed. Statistical matching aims at combining information obtained from different non-overlapping sample surveys referred to the same target population. Formally, let (X, Y, Z) be a discrete random variable (r.v.) with joint probability mass function (pmf ) P X Y Z . Without loss of generality, it will be assumed that X = ( X 1 , . . . , X H ), Y = (Y 1 , . . . , Y K ) and Z = ( Z 1 , . . . , Z T ) are random vectors of dimension H , K , T , respectively. Furthermore, let A and B be two independent samples of n A and n B independent and identically distributed (i.i.d.) records from (X, Y, Z). In sample A we only have observations of the variables X and Y; in sample B we only have observations of the variables X and Z. Therefore, observations in file A have missing Z values, and
*
Corresponding author. E-mail addresses:
[email protected] (P.L. Conti),
[email protected] (D. Marella),
[email protected] (P. Vicard),
[email protected] (V. Vitale). https://doi.org/10.1016/j.ijar.2020.12.006 0888-613X/© 2020 Elsevier Inc. All rights reserved.
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
observations in file B have missing Y values. The common variables X, observed in both data sets, are called matching variables. Missing data are due to the observational mechanism, i.e. missingness is by design. At a micro level, the main goal of statistical matching consists in constructing a complete synthetic data set where all the variables of interest are jointly observed; for details see D’Orazio et al. [7]. At a macro level, the main goal of statistical matching consists in estimating the joint distribution of (X , Y, Z) from the samples A and B. Unless special (and generally restrictive) assumptions are made, the joint distribution of (X, Y, Z) is not identifiable (identification problem), since the parameters regarding the statistical relationship between Y and Z are not estimable due to the lack of joint information on Z and Y given X. In other words, the sample information provided by A and B is actually unable to discriminate among a set of plausible models for (X, Y, Z), leading to uncertainty about the data generating model. For instance, in a parametric setting and for K = T = 1, the estimation problem cannot be “pointwise”; only ranges of values containing all the pointwise estimates obtainable by each model compatible with the available sample information can be detected. Such intervals are named uncertainty intervals. Articles tackling uncertainty assessment in the statistical matching problem for parametric models, assuming i.i.d. observations, are Kadane [13], Rubin [21], Moriarity and Scheuren [16], Rässler [19], D’Orazio et al. [6]. Uncertainty in statistical matching, mainly in a nonparametric setting, is addressed in Conti et al. [1,2,3,4]. In order to overcome the identification problem, alternative techniques have been proposed in the literature. A first group of techniques is based on the conditional independence assumption between Y and Z given X (CIA assumption), see Okner [17]; the basic idea is that the joint distribution of (X, Y, Z) is identifiable under the observation mechanism generating samples A and B, thanks to the independence of Y and Z given X. A second group of techniques uses external auxiliary information regarding the statistical relationships between Y and Z, e.g. an additional available sample C with (X, Y, Z) jointly observed is used, as in Singh et al. [23]. Under this approach, the three samples A, B and C are assumed independent and composed by i.i.d. observations. Clearly, the basic assumption is that the observations in sample C are generated by the same model generating the observations in samples A and B, respectively. Unfortunately, this assumption, as well as the CIA, can not be tested. Moreover, the appropriateness of the two assumptions above is questionable: the CIA assumption is rarely met in practice (its suitability is discussed in several papers, cf. Sims [22], Rodgers [20]), and external auxiliary information consisting in a complete sample C is hardly ever available. In this paper we propose the use of Bayesian networks (BNs) to deal with the statistical matching uncertainty in multivariate categorical data. Uncertainty in statistical matching for categorical data is analyzed in [6], [2] [4]. In [6] the multinomial model is used and maximum likelihood estimates (MLE) of parameters are computed through the EM algorithm. [2] deals with the statistical matching problem in a nonparametric setting. The notion of uncertainty is introduced and a measure of uncertainty is then proposed. Finally, in [4] the authors propose to estimate, in the case K = T = 1, the probability distribution function of the variables that are not jointly observed based on an iterative proportional fitting algorithm, and show how to evaluate its reliability. In all the aforementioned papers the possibility of reducing the matching uncertainty by means of restriction on the support of the variables is illustrated. For categorical data such restrictions have been defined in terms of structural zeros on the joint distribution of Y and Z. The main motivation of the present paper is to propose a new technique for multivariate statistical matching that avoids the use of hard-to-defend assumptions, such as CIA. The lack of CIA implies that, in general, the statistical model for (X , Y, Z) is not identifiable. As a consequence, a careful evaluation of the uncertainty related to statistical matching is needed. In case of multivariate matching, modeling the dependence structure of the involved variables is of particular relevance, because it affects the quality of statistical matching results in terms of closeness of the estimated joint distribution of (X , Y, Z) to the “true” one. A natural tool for modeling dependence relationships among discrete variables is the Bayesian networks methodology, that is used throughout the present paper. The first attempt to use BNs for statistical matching of multivariate discrete data is in Endres and Augustin [11] where the CIA is assumed thanks to the connection between conditional independence and d-separation criterion. Under the CIA, both the dependence structure and the BN parameters are estimable from the sample data and there is no uncertainty at all. However, when the CIA model is not adequate, the final dataset may be significantly different from the one it would have been available if complete observations of (X, Y, Z) were collected, and the application of standard inferential procedures may result in highly misleading estimates. BNs have been also used in data integration problems using a calibration approach [8,9]. Bayesian networks (BNs), in principle, allow us to adopt a “good” model for the inter-dependence of the variables involved in the matching process, and hence to improve the final quality of statistical matching results. More specifically, the association structure between the variables is represented by means of a directed acyclic graph (DAG), and the joint pmf can be factorized according to the conditional independencies entailed by the DAG. In the statistical matching context, the use of BNs allows: (i) to introduce extra sample information on the dependencies between specific Y and Z components; (ii) to use this information to factorize the joint pmf. As a consequence, the advantages are twofold. On one side, computational complexity is reduced and parameter estimation is simplified thanks to the factorization of the joint pmf in the product of lower dimension components with a smaller number of parameters. On the other hand, statistical matching uncertainty is confined to those (few) factors of the joint pmf containing both Y and Z variables, while all the other factors (the majority) containing either X and Y or X and Z variables only can be estimated from sample A or sample B respectively without uncertainty. When the CIA is assumed as in Endres and Augustin [11], the advantage of using BNs is essentially computational due to the absence of uncertainty in this case. In this paper the CIA is no longer assumed. Therefore, differently from Endres and Augustin [11], BNs can help tackling the multivariate statistical matching problem both computationally and by targeting 151
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 1. (a) Example of a DAG and (b) moral graph of DAG (a).
the effort for uncertainty assessment only to those components of the joint pmf relative to variables separately observed in the two available samples. The main contributions of the present paper are essentially two. 1. Proposal of a methodology based on Bayesian networks for multivariate statistical matching when CIA is no longer assumed. 2. Definition of a new measure of uncertainty to evaluate the quality of obtained results accounting for the uncertainty in the dependence structure in the case it is not given by subject-matter knowledge. The paper is organized as follows. In Sect. 2 basics on Bayesian networks are given and the concept of uncertainty in statistical matching when BNs are used is illustrated. A measure of the total matching error is given in Sect. 2.3. Sect. 3 presents and discusses the results of a simulation study, performed for evaluating the performance of the proposed approach with and without auxiliary information, and to compare it with the saturated multinomial model in terms of uncertainty reduction. In Sect. 4 an application to real data is performed. We conclude with a brief summary in Sect. 5. 2. Uncertainty in statistical matching using graphical models Bayesian networks are multivariate statistical models satisfying sets of conditional independence statements contained in a directed acyclic graph (DAG), see Pearl [18] and Cowell et al. [5]. A graph is a pair G = ( V , E ) consisting in a set of vertices V and a set of directed edges E between pairs of nodes. In particular, directed acyclic graphs are considered: it is not possible to start from a node and go back to the same node following arrows directions. Each node corresponds to a random variable and missing arrows between nodes imply (conditional) independence between the corresponding variables. In a BN each node, say xh , is associated with the distribution of the corresponding variable given its parents, pa(xh ), i.e. all nodes linked to xh by an arrow pointing to xh . Hence a BN consists of two components: the DAG and the set of the distributions parameters. For instance, consider the random vector X = ( X 1 , . . . , X H ). Its joint pmf can be factorized according to the DAG as follows
P (x1 , . . . , x H ) =
H
P (xh |pa(xh ))
(1)
h =1
where P (xh |pa(xh )) is the probability distribution associated to node xh given its parents pa(xh ), h = 1, . . . , H . Let f a(xh ) = xh pa(xh ) be the family of xh and let ch(xh ) be the set of children of xh , i.e. the set of nodes connected with xh by an arrow starting from xh . The clan of xh is defined as clan(xh ) = f a(xh ch(xh )). Furthermore, we say that two vertices xh and xh are adjacent if there is an edge connecting them. A graph is said to be complete if all nodes are adjacent to each other. Given a set A ∈ V , the induced subgraph is G A = ( A , E A ), where E A is obtained from E keeping those edges connecting two vertices in A. A complete subgraph is a subgraph which is complete. The skeleton of a DAG is the undirected graph obtained by replacing all arrows with lines. A v - configuration is a triple of nodes, say ( X i , X k , X j ), having the structure X i → X k ← X j . The moral graph of a DAG is the undirected graph obtained from the DAG by replacing all directed with undirected edges (lines) and by joining with a line the parent nodes of common child/children in the v-structures. A clique is a maximal complete subgraph (i.e. a graph that becomes incomplete if another node is added to it). In Fig. 1 (a) an example of DAG is shown. Considering node X3, X1 is its parent while X4, X5 and X6 are its children. Moreover, the family of X3, f a( X3), is given by X1 and X3, and the clan of X3 is constituted by the set ( X3, X4, X5, X6) plus its parents X1 and X2. Fig. 1 (b) shows the moral graph of the DAG in Fig. 1 (a); all the arrows are replaced by lines and nodes X2 and X3 and X5 and X8 are connected by an undirected edge since in the DAG they are parents of 152
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 2. Uncertainty evaluation flowchart.
the common child X5 and X7 respectively. Moreover, the subgraph induced by ( X3, X4, X6) is a complete subgraph, i.e. a clique. In the statistical matching scenario, the non identifiability of the statistical model for (X , Y, Z), due to the lack of joint observations on the variables of interest, implies that both the components of the BN (i.e. the DAG and its parameters) can not be estimated from the lower dimensional datasets A and B. In fact, when BNs are used to deal with statistical matching, it is needed: (i) to specify the dependence structure of the variables (X , Y, Z); (ii) to estimate the parameters, i.e. the local probability distributions associated to the edges between the components of Y and Z. As a consequence, two kinds of uncertainty have to be accounted for: 1. uncertainty regarding the DAG, that is the dependence structure between the variables of interest; 2. uncertainty regarding the parameters of the statistical relationship between Y and Z (the conditional probability tables entries) given the DAG, i.e. given the factorization of the joint pmf for (X, Y, Z). These two kinds of uncertainty will be analyzed in Sect. 2.1 and Sect. 2.2, respectively. Fig. 2 shows the main steps of the approach described in sections 2.1-2.3. Boxes represent the steps and arrows show the flow of the proposed method. The steps are the following: Step 1 Estimate the DAGs of X, (X, Y) and (X, Z) from A ∪ B, A and B, respectively. The estimation procedure is described in Sect. 2.1. Step 2 Extra sample information on the association structure between Y and Z must be inserted in the DAGs. For instance, if variable Y k is associated to Z t then a link between the vertices Y k and Z t must be added. Step 3 Define the class of plausible DAGs for (X, Y, Z) (4) and compute the uncertainty due to the dependence structure (6) as described in Sect. 2.1. Step 4 Select a model from the class of plausible DAGs defined in Step 3. Step 5 Estimate the parameters of the local pmf s associated to the edges between the components of Y and Z as described in Sect. 2.2 and compute the parameters estimation uncertainty (9). Step 6 Compute the total uncertainty (12) adding the dependence structure uncertainty to the uncertainty in the parameters estimation. 2.1. Uncertainty in the dependence structure Let P X Y Z be the joint pmf of (X, Y, Z) associated to the DAG G X Y Z = ( V , E ), with V = V X Y Z and E = E X Y Z . Let us denote by G X Y = ( V X Y , E X Y ) and G X Z = ( V X Z , E X Z ) the DAGs estimated via sample A and B, respectively. In order to estimate the DAGs G X Y and G X Z , the following procedure can be applied. First of all, the DAG G X = ( V X , E X ) is estimated on the basis of the overall sample A ∪ B. Secondly, given G X , the association structure for (X, Y) and (X, Z) is estimated through the sample data in A and B, respectively. As far as P X Y Z is concerned, unless special assumptions are made, one can only say that it lies in the class of all joint pmf s for (X, Y, Z) satisfying the estimate collapsibility over Y and Z, respectively. Formally, we say that the joint pmf P X Y Z is estimate collapsible over Z t if
P (X, Y, Z\{ Z t }) = P G X Y Z \{ Z t } (X, Y, Z\{ Z t }).
(2) 153
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 3. (a) Example of DAG with Y and Z conditionally independent given X , (b) and (c) examples of two independence equivalent DAGs, (d) graphical representation of the independence equivalence class constituted by DAGs (b) and (c).
In other words, the estimate P (X, Y, Z\{ Z t }) of P (X, Y, Z\{ Z t }), obtained by marginalizing the maximum likelihood estimate (MLE) of P (X, Y, Z) under the original DAG G X Y Z , coincides with the MLE under the DAG G X Y Z \{ Z t } obtained form G X Y Z removing the vertex Z t , see Kim and Kim [14]. Estimate collapsibility over a set Z is defined similarly. In terms of graphs, cremovability is a concept equivalent to estimate collapsibility. A vertex Z t ∈ Z is c-removable from G X Y Z if any two vertices in clan( Z t ) are adjacent, except when both vertices belong to pa( Z t ), or if it is contained in one and only one clique (see Theorem 1 of Kim and Kim [14]). As an example, consider Fig. 1 (a). Node X7 is c-removable since its clan is constituted by two parents only, while node X6 is c-removable since it belongs to one single clique. Furthermore, the set Z = ( Z 1 , . . . , Z T ) is sequentially c-removable if all vertices in Z can be ordered in a such a way that they can be c-removed according to that ordering (Theorem 4 of Kim and Kim [14]). Taking into account the example of Fig. 1 (a), nodes X7 and X5 (in this order) are sequentially c-removable. An analogous condition is required for estimate collapsibility over Y. Then, the class of plausible joint pmf s for (X, Y, Z) can be described as follows
PXY Z = {P XY Z : P (X, Y) = P G X Y (X, Y), P (X, Z) = P G X Z (X, Z)}
(3)
where Pˆ G X Y (X, Y) ( Pˆ G X Z (X, Z)) is the MLE of P (X, Y) ( P (X, Z)) under the DAG G X Y (G X Z ). The class (3) can be equivalently defined, using the graphical structure of the model, as the class of plausible DAGs G X Y Z where the variables Z and Y are c-removable, respectively. Formally
G X Y Z = {G X Y Z : Z is c − removable ,
Y
is
c − removable }.
(4)
Note that in the statistical matching context, the X variables are often socio-demographic variables (as age, sex, education, geographical area of residence, occupational status, etc.) relevant in explaining variations in response variables Y and Z, respectively. Then, DAGs with edges from Y and/or Z to X are not admitted in the class (4). Note that the condition (4) requires that the set Z = ( Z 1 , . . . , Z T ) is c-removable from G X Y Z and the set Y = (Y 1 , . . . , Y K ) is c-removable from G X Y Z . Thus, the class (4) requires a sequentially blocked c-removability as defined in Definition 1 below. Definition 1. Two sets of variables Y, Z are said to be blocked c-removable from G X Y Z if Y and Z are sequentially cremovable from G X Y Z , respectively. An example of variables Y, Z blocked c-removable is in Sect. 3 (Fig. 4). In terms of estimate collapsibility, Definition 1 means that the MLE obtained from the original model G X Y Z numerically coincides with those obtained after collapsing over Y, Z, respectively. The most favorable case occurs, for instance, under the CIA when the class (3) is composed by a single joint pmf defined as P (X, Y, Z) = P (X) P (Y|X) P (Z|X). Equivalently, in graphical terms, this means that the class (4) collapses into one single graph given by G CX IYAZ = G X Y G X Z where Y and Z are d-separated by the set X. For an account on d-separation we refer to Lauritzen [15]. Note that such a network always belongs to the class (4). Under the CIA, both the dependence structure and the BN parameters are estimable from the sample data and there is no uncertainty at all. The class (4) can be expressed as the set of Markov equivalence classes compatible with the subgraphs G X Y and G X Z . Two DAGs are independence equivalent if and only if they have the same skeleton and the same v-configurations. For instance, consider the case where H = K = T = 1 and variable X is directly connected to Y and Z . The nodes Y and Z may or may not be connected. If they are not connected, the DAG is that shown in Fig. 3a. If Y and Z are connected, the arrow joining them may have both the possible orientations (DAGs in Figs. 3b-c) since neither v-configurations nor directed cycles are created. Therefore the number of equivalence classes in G X Y Z is equal to 2 corresponding to the presence and absence of the edge between the two variables Y and Z , respectively. The graphical model representations of the two equivalence classes are shown in Figs. 3a and 3d. When the CIA does not hold, an uncertainty measure for the dependence structure among the variables of interest, i.e. a measure to compare the plausible independence equivalence classes in (4) in terms of graph density/sparsity, has to be constructed. To this aim, the Structural Hamming Distance (SHD, for short), for comparing equivalence class structures, can be used [24]. Such a distance applied to independence equivalence classes can be defined as the number of edge insertions or deletions necessary to transform a given equivalence class into another equivalence class. As a consequence, SHD defines on the class (4) the partial order relation larger than or equal number of edges between the Z and Y components. 154
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
All the DAGs in G X Y Z have the same vertex set V X Y Z , and may differ in the edge set E X Y Z , specifically in the number and in the direction of edges between Y and Z components. Hence, for each DAG G X Y Z in G X Y Z the following inequality holds
GL ⊆ G XY Z ⊆ GU .
(5)
G L = G CX IYAZ , with V L = V X Y Z and E L = E X Y ∪ E X Z , is the smallest DAG with no edges between Y and Z. G U is a DAG where all pairs of distinct vertices Y and Z are connected by a unique directed edge without introducing cycle. The any directed numbers of edges between Y (K -variate) and Z (T -variate) in G L and in G U are denoted by E YL Z and E YU Z and are given by E YL Z = 0 and E YU Z ≤ K T , respectively. Note that the inequality is strict in case not all possible edges between Y and Z can be added without generating directed cycles. Suppose G ∗X Y Z ∈ G X Y Z is chosen as a matching graph for ( X , Y , Z ), but that the “true” graph is G X Y Z (still in G X Y Z , of course). The discrepancy between G ∗X Y Z and the true G X Y Z , in terms of edges between Y and Z, is the matching error due to the association structure. A small matching error means that the chosen association structure G ∗X Y Z is close to the true G X Y Z , and hence replacing G X Y Z by G ∗X Y Z does not produce a large error. The matching error measure is the SHD distance between G X Y Z and G ∗X Y Z , that is the minimum number of edge additions and/or removals necessary to convert the estimated graph G ∗X Y Z into the true graph. Although the sample data do not allow to estimate the matching error, relationship (5) provides a useful upper bound, namely:
L M E (G ∗X Y Z , G X Y Z ) = S H D (G ∗X Y Z , G X Y Z ) ≤ E U Y Z − E Y Z ≤ K T .
(6)
For K = T = 1 the distance between G L and G U in terms of number of edges between Y and Z is equal to 1. From now on, expression (6) will be used as a matching error measure due to the association structure. In many circumstances, subject-matter knowledge provides useful extra-sample information to choose a plausible DAG from the class G X Y Z or, at least, to reduce its size. In fact, the equivalence classes in G X Y Z that are inconsistent with such an information will be eliminated, with a consequent reduction of the upper bound in (6). In other cases, the dependence structure can be elicited by experts. As stressed in Geiger et al. [12], qualitative dependencies among variables can often be asserted with confidence, whereas numerical assessments are subject to a great deal of hesitancy. For example, an expert may willingly state that some variables in Y are related to some variables in Z, however he/she would not provide a numeric quantification of these relationships. Moreover, the knowledge of structural zeros or inequality constraints on the joint probability distribution, as discussed in Sect. 2.2, provides auxiliary information on the presence of the corresponding edges. When such extra-sample information on the association structure is available, if it also refers to links between Y and Z, the bounds (5) can be tightened. Clearly, the lower bound can be improved with information on the presence of at least one edge between the Y and Z variables, while the upper bound can be improved with information on the absence of at least one edge between the Y and Z variables. 2.2. Uncertainty in the parameters estimation Suppose that a DAG G ∗X Y Z has been selected from the class G X Y Z and let P ∗X Y Z be the joint pmf associated to G ∗X Y Z . According to G ∗X Y Z , the unknown joint pmf P ∗X Y Z can be factorized into local probability distributions, some of which can be directly estimated from the available sample information, while other cannot. In the case of categorical variables, uncertainty is dealt with as in D’Orazio et al. [6], where parameters uncertainty is estimated according to the maximum likelihood principle. Suppose that from the factorization of P ∗X Y Z , the only parameter that can not be estimated is the joint probability distribution P ∗ ( X h , Y k , Z t ). Let ( X h , Y k , Z t ) be a multinomial joint distribution with vector parameter θ ∗ = {θi∗jl }, where θi∗jl = P ∗ ( X h = i , Y k = j , Z t = l), for i = 1, . . . , I , j = 1, . . . , J , and l = 1, . . . , L. Analogously to (3), as far as θ ∗ is concerned, one can only say that it lies in the following reduced parameter space
R = {θ ∗ :
θi∗jl = θi j . ,
l
θi∗jl = θi .l , θi∗jl ≥ 0,
j
θi∗jl = 1}
(7)
i jl
where
θi j . =
niAj . n A + n B i .. i .. , niA.. n A + n B
θi .l =
niB.l niA.. + niB.. niB.. n A + n B
(8)
are the marginal MLEs of ( X h , Y k ) and ( X h , Z t ) from samples A and B, respectively. More specifically, niAj . in (8) denotes the number of observations in sample A such that ( X h = i , Y k = j ) and similar definitions hold for the other quantities appearing in (8). The parameter estimate which maximizes the likelihood function is not unique, and the set of all MLEs is the likelihood ridge. All the distributions in the likelihood ridge are equally informative, given the data. 155
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Uncertainty measures can be defined by analyzing the characteristics of the parameter space R . Since the reduced parameter space (7) is a convex set, each parameter θi∗jl lies in an interval θi∗jlL ≤ θi∗jl ≤ θi∗jlU whose upper and lower bounds can not be generally expressed in closed form. The relationship θi∗jlL ≤ θi∗jl ≤ θi∗jlU means that the uncertainty set for each θi∗jl
is an interval. A straightforward uncertainty measure for θi∗jl can be then defined as the interval length θi∗jlU − θi∗jlL . Clearly, if
θi∗jlL = θi∗jlU there is no uncertainty on θi∗jl . A measure of uncertainty proposed by Rässler [19] is given by
U M (θ) =
∗U i jl (θi jl
− θi∗jlL )
(9)
M
where M is the number of uncertain parameters, that is parameters such that θi∗jlL = θi∗jlU . However, (9) provides just a rough approximation of the size of R . A more sophisticated uncertainty measure has been introduced by D’Orazio et al. [6] where each parameter θ ∗ ∈ R is associated with a plausible probability distribution for (X, Y, Z). Let θi∗jl denote the estimate of θi∗jl , for each (i , j , l), then the matching error in the statistical matching due to the BN parameters is given by ∗ M E [ θ , θ ∗ |G ∗X Y Z ] =
i jl
|θi∗jl − θi∗jl | M
≤
∗U i jl (θi jl
− θi∗jlL )
M
= U M (θ ∗ ).
(10)
In order to exclude some parameter vectors in R , it is important to introduce extra sample information characterizing the phenomenon under study. The introduction of such constraints is useful for reducing the overall parameters uncertainty. Clearly, the amount of reduction depends on the informativity of the imposed constraints. Sometimes constraints are so informative that the likelihood ridge reduces to a unique point. These constraints can be defined in terms of structural zero (θi∗jl = 0 for some (i , j , l)) and inequality constraints between pairs of distribution parameters components of θ ∗ (θi∗jl ≤ θi∗ j l
for some (i , j , l), (i , j , l )). Note that, any combination of structural zeros and inequality constraints leads to a subspace of R which is closed and convex. The presence of constraints on the components of θ ∗ implies both the restriction of the parameter space R to a subspace and the reduction of the likelihood ridge. Specifically, as in the unconstrained case, the parameter estimate maximizing the likelihood function is not unique, and the set of all maximizers is the constrained likelihood ridge. When constraints are imposed, the likelihood function maximization problem may be solved using two different strategies depending on the following situations: 1) has a non empty intersection with the unconstrained likelihood ridge; 2) has an empty intersection with the unconstrained likelihood ridge. In the second case the likelihood function has no local maxima in the interior of . It may possess local maxima at the boundary of . The construction of the constrained likelihood ridge is more complicated and it is essentially based on iterative methods. An algorithm for likelihood function maximization in the multinomial case is described in D’Orazio et al. [6]. Clearly, the larger the number of directed edges between the components of Y and Z, the larger the number of uncertain parameters needed to be estimated in the factorization of the joint distributions P ∗X Y Z . Remark 1. This approach can be extended to the case where more than one local probability distribution (more edges) needs to be estimated. To this aim, the edges should be ordered according to their reliability and time/logical priority. Once stated the edges order, the BN parameters will be estimated one at a time through an iterative procedure starting from the most reliable and ending with least reliable. In other words, the factorization of the joint probability allows to decompose a multivariate dependence in lower dimension components useful for the statistical matching uncertainty evaluation in the multivariate context. 2.3. Total matching error in graphical modeling If the true BN (G X Y Z ; θ ) is estimated by (G ∗X Y Z , θ∗ ) as total matching error measure we can consider the following ∗
∗
T M E [(G X Y Z , θ ); (G ∗X Y Z , θ )] = M E (G X Y Z , G ∗X Y Z ) + M E (θ ∗ , θ |G ∗X Y Z ).
(11)
Although the sample data do not allow to estimate the total matching error (11), the analysis in Sect. 2.1 and Sect. 2.2 provide a useful upper bound for it. In fact, from (6) and (10), it is immediate to see that ∗
T M E [(G X Y Z , θ ); (G ∗X Y Z , θ )] ≤ K T + U M (θ ∗ ).
(12)
∗
Hence, (12) can be proposed as a measure to assess how reliable is the use of (G ∗X Y Z , θ ) as a surrogate of the actual BN (G X Y Z , θ ). In other words, (12) can be used in evaluating the quality of statistical matching procedure. 3. Simulation study In this section a simulation experiment is performed in order to evaluate the performance of the proposed methodology. Let (X, Y, Z) be a random vector and suppose that the association structure between the variables of interest can be assumed as known, for instance because it is elicited by experts. In such a case the network structure can be essentially viewed as 156
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 4. Bayesian network for (X, Y, Z). Table 1 Conditional probability distributions associated to the nodes of the BN in Fig. 4.
X1 X2
0 0 1
X1 Y1
0 0 1
Z1 Y2
0.80 0.20
0.75 0.25
0 0 1
1.00 0.00
X3 = 0 1 0.35 0.65 Y2 = 0 1 0.35 0.65 X2 = 0 1 0.20 0.80
X3 = 1 0
1
0.25 0.75
0.40 0.60
Y2 = 1 0
1
0.77 0.23
0.20 0.80
X2 = 1 0
1
0.75 0.25
0.32 0.68 X3 0 0.4
X2 0 Z1
0 1
0.25 0.75 X2 0
Z2
0 1
0.40 0.60 X3 0
X1
0 1
0.70 0.30
1 0.60 0.40
1 0.30 0.70
1 0.30 0.70
1 0.6
a source of qualitative auxiliary information. The present simulation aims at evaluating how accounting for the association structure contributes to uncertainty reduction. The considered BN is shown in Fig. 4. The variables (X, Y, Z) are dichotomous rvs, with conditional probability distributions reported in Table 1. We first check blocked c-removability of Y and Z. On the basis of Theorem 1 and 4 of Kim and Kim [14], Y = (Y 1 , Y 2 ) is sequentially c-removable according to the ordering (Y 1 , Y 2 ), while Z = ( Z 1 , Z 2 ) is sequentially c-removable according to the ordering ( Z 1 , Z 2 ) or ( Z 2 , Z 1 ). In fact, Y 1 is c-removable since its clan is composed by two parents only, X 1 and Y 2 , that are not adjacent. Then Y 2 is c-removable since it belongs to one single clique. It can be analogously seen that ( Z 1 , Z 2 ) (as well as ( Z 2 , Z 1 )) are sequentially c-removable. The joint pmf P (X, Y, Z) can be factorized according to the graph in Fig. 4 as follows:
P (X, Y, Z) = P ( X 1 , X 2 , X 3 ) P (Y 1 | X 1 , Y 2 ) P (Y 2 | X 2 , Z 1 ) P ( Z 1 | X 2 ) P ( Z 2 | X 2 )
= P ( X 1 , X 3 | X 2 ) P (Y 1 | X 1 , Y 2 ) P ( X 2 , Y 2 , Z 1 ) P ( Z 2 | X 2 )
(13)
where P ( X 2 , Y 2 , Z 1 ) is the only distribution that can not be estimated from the available information in samples A and B. Such a distribution has been estimated by performing a simulation in R consisting in the following steps: Step 1 5000 i.i.d. observations have been generated from the graph in Fig. 4 using the package gRain. Step 2 To reproduce the statistical matching situation, the original file has been randomly split into two datasets of 2500 units each. The variable Z = ( Z 1 , Z 2 ) has been removed from the first dataset (sample A) and the variable Y = (Y 1 , Y 2 ) has been removed from the second dataset (sample B). Step 3 The joint distribution P ( X 2 , Y 2 , Z 1 ) in (13) has been estimated according to the EM algorithm as described in D’Orazio et al. [6]. More specifically, in order to explore the likelihood ridge we run the EM algorithm with 100000 different starting points; each simulation gave rise to a global maximum likelihood result. The function em.cat of package cat has been used. In the simulation, two scenarios have been considered: (i) S0 representing the initial situation of absence of auxiliary information on the joint distribution of ( X 2 , Y 2 , Z 1 ); (ii) S1 representing the situation where a structural zero for the cell ( X 2 , Y 2 , Z 1 ) = (0, 1, 0) is introduced. 157
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 2 (θi1L ) and upper bound θi0U (θi1U ) of the True cell counts (ni jl ), true probabilities θi jl , lower bound θi0L jl jl jl jl 0
1
probability estimates range and average value θ i jl ( θ i jl ) in 100000 runs of EM under scenario S0 (S1),
θiCjlI A .
CIA estimates
0
1
X2
Y2
Z1
ni jl
θi jl
θi0L jl
θi0U jl
θ i jl
θi1L jl
θi1U jl
θ i jl
θiCjlI A
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
617 377 1175 329 0 1402 378 722
0.120 0.072 0.234 0.067 0.000 0.287 0.078 0.142
0.000 0.067 0.095 0.000 0.000 0.159 0.006 0.000
0.126 0.194 0.305 0.210 0.126 0.285 0.216 0.210
0.052 0.142 0.183 0.122 0.075 0.211 0.128 0.088
0.126 0.068 0.095 0.000 0.285 0.006 0.000
0.126 0.068 0.305 0.210 0.285 0.216 0.210
0.126 0.068 0.183 0.122 0.285 0.128 0.088
0.051 0.143 0.182 0.123 0.075 0.210 0.129 0.087
Table 3 Uncertainty Measures and Relative Root Mean Squared Error under the two scenarios S0 and S1. X2
Y2
Z1
U M0
U M1
R R M S E0
R R M S E1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
0.126 0.126 0.210 0.210 0.126 0.126 0.210 0.210
0.000 0.000 0.210 0.210 0.000 0.210 0.210
0.640 0.989 0.301 1.119 0.275 0.939 0.512
0.023 0.103 0.302 1.122 0.017 0.940 0.513
In Table 2 the simulation extremes of the likelihood ridge found by running EM 100000 times are reported. More specifically, in Table 2 the true cell counts ni jl , the true probabilities θi jl , the CIA estimates θiCjlI A , the range of probability estimates [θi0L , θi0U ] and [θi1L , θi1U ] in 100000 runs of EM under the two scenarios S0 and S1 respectively, are reported. jl jl jl jl s
Moreover, the average values θ i jl , for s = 0, 1, of the 100000 estimates, as representative parameter values among those in the likelihood ridge, are shown. First of all, as shown in D’Orazio et al. [6], under scenario S0 the CIA solution is always included in the range of plausible values found through EM. When the auxiliary information regarding the structural zero is not considered, the EM algorithm provides nonull estimate in correspondence with the structural zero. The same holds for the estimates under the CIA. Under 0
CIA S0, from Table 2, the average value θ010 = 0.075 is obtained. The range of plausible values for θ010 varies between 0 θ 010 = and 0.126. When auxiliary information is considered, an overall reduction of ranges for some of the estimated probabilities can be observed, as shown in Table 3, where the uncertainty for each cell is reported. Furthermore, since the analysed variables are 1
1
binary, the introduction of the structural zero ( X 2 , Y 2 , Z 1 ) = (0, 1, 0) implies that the parameters θ 000 = 0.126, θ 001 = 0.068 1
and θ 011 = 0.285 are univocally determined. In Table 3, the uncertainty for each cell under S0 (U M 0 ) and S1 (U M 1 ) is reported. Note that the introduction of auxiliary information produces a reduction of the ranges of some estimated cell probabilities. Finally, the dispersion of the frequency distribution relative to the 100000 simulations is measured by the Relative Root Mean Squared Error (R R M S E) given by
R R M S E ( θitjk )
=
1 100000
100000 t =1
θi jk
( θitjk − θi jk )2
(14)
.
The R R M S E 0 and R R M S E 1 values computed under the scenarios S0 and S1, respectively, are reported in the last two columns of Table 3. A remarkable decrease of R R M S E can be noticed in each cell where the structural zero produces an uncertainty reduction, i.e. cells ( X 2 , Y 2 , Z 1 ) = (0, 0, 0), ( X 2 , Y 2 , Z 1 ) = (0, 0, 1) and ( X 2 , Y 2 , Z 1 ) = (0, 1, 1). In Figs. 5-7, the density of the parameters θ000 , θ001 and θ011 have been approximated with the frequency distribution for our 100000 simulations. More specifically, Figs. 5-7 show the likelihood ridge without constraint (diagram on the left) and with constraint (diagram on the right). The vertical bar represents the true probability. Note that the distribution of the parameter under the constraint is more concentrated than the initial one where no constraints are imposed. We next evaluate the effectiveness of BN factorization in the multivariate statistical matching in terms of uncertainty reduction with respect to the saturated multinomial model. In our simulation experiment all the variables are dichotomous, 158
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 5. Likelihood ridge for cell ( X 2 , Y 2 , Z 1 ) = (0, 0, 0) without constraint (left) and with constraint (right).
Fig. 6. Likelihood ridge for cell ( X 2 , Y 2 , Z 1 ) = (0, 0, 1) without constraint (left) and with constraint (right).
therefore the saturated multinomial model has 27 cells. The EM algorithm with 100000 iterations is run under scenario S1. Table 4 shows the simulation results in terms of U M and R R M S E under the scenario S1, for the saturated multinomial model (U M 1S and R R M S E 1S ) and for the BN model (U M 1B N and R R M S E 1B N ), respectively. It is worth noting that each of the eight configurations of the triplet ( X 2 , Y 2 , Z 1 ) is associated with 16 possible configurations of the remaining four variables. Therefore, in Table 4 the reported values have to be intended as average values. The U M 1B N and R R M S E 1B N average values are smaller than U M 1S and R R M S E 1S ones for all eight cells. These results confirm the advantage, in terms of uncertainty reduction, obtained exploiting the information about the association structure and the conditional independence statements entailed by the BN with the statistical matching problem. 4. An application to real data: statistical matching among factors promoting sharing mobility in Italy In this section, the proposed methodology is applied to an empirical dataset. We analyse data coming from the 2017 survey on Aspects of Daily Life (ADL) carried out by the Italian National Institute of Statistics (Istat) on the resident population by interviewing a representative sample of 20952 households and 48855 people. ADL survey collects information on the citizens’ habits as well as on the main social aspects of daily life, allowing to evaluate the well-being of individuals and households with reference to specific themes of interest. 159
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 7. Likelihood ridge for Cell ( X 2 , Y 2 , Z 1 ) = (0, 1, 1) without constraint (left) and with constraint (right).
Table 4 Uncertainty Measure and Relative Root Mean Squared Error for saturated multinomial model (U M 1S , R R M S E 1S ) and BN model (U M 1B N , R R M S E 1B N ), under scenario S 1 . X2
Y2
Z1
U M 1S
U M 1B N
R M M S E 1S
R M M S E 1B N
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
0.012 0.014 0.029 0.022
0.000 0.000 0.014 0.014
0.466 0.745 0.490 1.550
0.152 0.223 0.323 1.429
0.022 0.022 0.017
0.000 0.014 0.014
0.430 1.384 0.576
0.124 1.125 0.525
For the present application, a subset of 14 variables, out of the 686 original ones, has been selected by taking into account important promoting factors of sharing mobility in Italy. Furthermore, only individuals aged 18 years and over have been considered, so that our data set consists of 34249 sample units and 14 variables.1 For statistical matching purposes, the 14 variables of interest have been divided into three groups, X variables of dimension H = 4, Y variables of dimension K = 6, Z variables of dimension T = 4, respectively, as reported in Table 5. In addition, the data set has been randomly split into two subsets, sample A and sample B, say, composed by n A = 17125 and n B = 17124 units, respectively. In the original sample data, the 14 variables of interest are observed in both samples A and B. To perform statistical matching, Y variables have been removed from sample B, and Z variables have been removed from sample A. As a result, X variables are common to both samples A and B, Y variables are specific of sample A, and Z variables are specific of sample B; cf. Table 5. The four common X variables, observed in both samples, are essentially related to socio-demographic aspects of individuals. In order to learn the final BN under the statistical matching scenario, the graphs G X , G X Y and G X Z have been estimated by means of the hill climbing algorithm (with BIC score) implemented in the R package bnlearn. First of all, the graph G X , defined over the common variables, has been estimated based on all n A + n B observations within samples A and B, and taking into account the following forbidden edge directions: 1. SEX → AGE and AGE → SEX: sex and age do not influence each other; 2. EDU → AGE, OCCUP → AGE, OCCUP → SEX, EDU → SEX: educational level and professional status do not affect sex and age of an individual; 3. OCCUP → EDU: the professional status does not affect the educational level, but generally the opposite holds since people with higher levels of education have better job prospects.
1
For some variables, further classes aggregations were computed with respect to the original ones. 160
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 5 The selected variables for Statistical Matching. Label
Description
X variables
AGE SEX EDU OCCUP
Age in years (18-24; 25-34; 35-44; 45-54; 55-64; 65-74; 75+) Gender (male; female) Educational level (lower than high school; high school; bachelor’s degree or higher) Professional activity (employed; in search of employment; inactive (excluding students); student)
Y variables
NET USE MOBILE NET USE PUBLIC MEANS USE SPORT FRIENDS PRIVATE CAR USE
the frequency of use of Internet during the last 12 months (daily; not daily) the frequency of use of a mobile connection (less than 3 months; over 3 months) the frequency of use of public means (daily or a few times a week; a few times a month/year; never or the service doesn’t exist) the frequency of playing sport activities (no; occasionally; regularly) the frequency of hanging out with friends (less then once a week; up to once a week) the frequency of use of own private car (never; a few times a week/month/year; daily)
CARS NUMBER BIKES NUMBER BIKESHARING CARSHARING
the the the the
Z variables
number of cars per family (no cars; one; two or more) number of bikes per family (no bikes; one; two; three or more) use of a bikesharing in the last 12 months (no; yes) use of a carsharing in the last 12 months (no; yes)
Fig. 8. Graph G X Y Z under CIA assumption. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
Next, according to the procedure illustrated in Sect. 2.1, the graphs G X Y and G X Z have been estimated on the basis of data in sample A (n A observations) and sample B (n B observations), respectively, and by taking into account the following constraints:
• the estimated conditional independence structure in G X ; • edge directions from Y and Z variables to X ones are forbidden. The graph G X Y Z has been built as the union of G X Y and G X Z , given that the structure of the common X variables is the same for both graphs. The resulting graph, shown in Fig. 8, corresponds to the CIA assumption; Y variables (pink nodes) and Z variables (green nodes) are conditionally independent given X variables (blue nodes). Extra sample information regarding the relationships between the components of Y and Z can be introduced as illustrated in Fig. 8. With this regard, the development of new information technologies and systems, like digital platforms, have promoted the diffusion of sharing mobility services. Since most of carsharing (CARSHARING) services need the mobile connection to locate and share the vehicle as well as to pay rental, the propensity to use internet on own mobile device 161
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 9. Graph G X Y Z taking into account extra sample information on qualitative dependencies.
(MOBILE NET USE) is considered an important enabling factor. Furthermore, the frequency of playing sport activities (SPORT) affects the number of bikes per family (BIKES NUMBER), see [10]. Then, G X Y Z has been augmented in order to take into account the relationships among the aforementioned variables and to reduce, as a result, the statistical matching uncertainty. Correspondingly, the edges MOBILE NET USE → CARSHARING and SPORT → BIKES NUMBER coloured in orange, are added to the graph in Fig. 8. The resulting graph is shown in Fig. 9. Moreover, three additional arcs (represented with dashed lines) have been included in order to ensure the sequential c-removability. The resulting graph is in Fig. 10. In particular, the nodes belonging to sample A are c-removable in the following order: FRIENDS - SPORT - PUBLIC MEANS USE - PRIVATE CAR USE - NET USE - MOBILE NET USE. The nodes belonging to sample B are c-removable in the following order: CARS NUMBER - BIKES NUMBER - BIKESHARING CARSHARING. Other possible orderings can be considered. Once included in the graph extra sample information on qualitative dependencies among components of Y and Z, the subsequent step consists in choosing a DAG in the class of plausible DAGs. Operationally, our proposal is to choose the DAG nearest to the one under CIA (Fig. 8). This choice offers a relevant computational advantage when evidence propagation algorithms are used to impute missing Z values in sample B and missing Y values in sample A. Based on the DAG in Fig. 10, the joint pmf factorizes into the product of the following local probability distributions:
P (X, Y, Z) = P ( AG E ) · P ( S E X ) · P ( E DU | AG E , S E X ) · P ( O C C U P | AG E , E DU , S E X )·
· P ( N E T U S E | AG E , E DU ) · P ( S P O R T | AG E , E DU ) · P ( F R I E N D S | AG E , S P O R T )· · P(BIKES NUMBER|AGE, EDU, SPORT) · P ( M O B I L E N E T U S E | AG E , N E T U S E , E DU )· · P ( P R I V AT E C A R U S E | O C C U P , E DU , S E X ) · P (C A R S N U M B E R | AG E , B I K E S N U M B E R )· · P ( B I K E S H A R I N G |C A R S H A R I N G , E DU ) · P(CARSHARING|AGE, MOBILE NET USE, EDU)· · P ( P U B L I C M E AN S U S E | N E T U S E , P R I V AT E C A R U S E ).
(15)
The only two local probability distributions, including both Y and Z variables, that can not be estimated by means of the information in samples A and B are those in teletype face. As shown in the simulation study in Sect. 3, by running the EM algorithm with 100000 different starting points, we are able to estimate the range of the above probabilities in terms of their joint pmf providing the extremes of the likelihood ridge [θiLjkl , θiUjkl ], for each cell. The main results are shown in the Appendix A. In particular, Tables 6 and 7 report the true probabilities θi jkl , the lower and upper bound of probability estimates [θiLjkl , θiUjkl ], the average value over the 100000 estimates θ i jkl , the CIA estimate 162
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Fig. 10. Graph G X Y Z with additional edges to ensure sequential c-removability.
θiCjklI A (i.e. based on the factorization of the DAG in Fig. 8). Finally, in the last two columns, for each parameter the uncertainty U M given by the difference between the upper and lower bound of probability estimates and RMMSE values are computed. RMMSE measures the dispersion of the frequency distribution of the 100000 simulated values, for each cell; looking at its values in both tables, we can argue that it is small for almost all cell probabilities but for few cases, all related with the category “yes” of the variable “CARSHARING” (see Table 6), characterized by a small probability since in 2017 sharing mobility was still not widespread in Italy. We point out that the CIA estimate θiCjklI A is not always included in the corresponding uncertainty interval since other constraints have been added in the graph corresponding to the statistical matching uncertainty scenario. Finally, since the extra sample information (MOBILE NET USE → CARSHARING and SPORT → BIKES NUMBER) involves two components of Y and Z, respectively, it is straightforward to prove that the uncertainty in the dependence structure is ( K − 2)( T − 2) = 8. As far as the uncertainty in the parameters estimation is concerned, from (9) we obtain U M (θ ∗ ) = 1.84, where θ ∗ is a vector of length M = (84 + 252). It is worth noting that the comparison with the saturated multinomial model, as provided in the study, is infeasible in this application since the number of cell probabilities to be estimated is 1741824. This number is drastically reduced to 336 (84 + 252) when the conditional independence statements in the graph in Fig. 10 and the corresponding factorization in (15) are taken into account, showing the important role of Bayesian networks in simplifying statistical matching in multivariate contexts. 5. Conclusions In this paper the use of Bayesian networks to deal with the statistical matching uncertainty in multivariate categorical variables is proposed. The use of BNs is motivated by several advantages: (i) extra sample information on qualitative dependencies between the components of Y and Z can be included in statistical matching; (ii) this information can be used to factorize the joint pmf according to the conditional independencies entailed by the graph; (iii) including extra sample information significantly improves the quality of statistical matching. The representation of the joint pmf, taking advantage of local relationships, allows to simplify both parameters estimation and statistical matching uncertainty evaluation in a multivariate context since a smaller number of lower dimension parameters has to be estimated. Moreover, the modularity of the graphical model allows to separately deal with: 1) subgraphs induced by nodes (variables) belonging to the same sample; 2) subgraphs induced by variables observed on different samples. In other words, parameters affected by uncertainty are separated from those directly estimable through available sample information. In this way, computational complexity is limited to some subsets of variables. 163
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Such considerations have been proved in the simulation study, with results showing the advantage of BNs in the statistical matching context in terms of uncertainty and computational complexity reduction. Finally, the application proved the feasibility of the proposed approach in real contexts. The paper has been developed in a statistical matching macro perspective. Nevertheless, BNs also allow a straightforward extension to a micro approach. In fact, missing Z values in sample A and missing Y values in sample B can be imputed from the given BN by efficient evidence propagation algorithms. Finally, the proposed methodology reveals a good performance in terms of uncertainty measure reduction, due to the role of qualitative auxiliary information. Such an information is more easily accessible than structural zeros or inequality constraints discussed in Sect. 2.2. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement The authors are grateful to the referees for very careful reading of the manuscript and thoughtful comments. Appendix A. Tables of application results Table 6 True probabilities θi jkl for P (CARSHARING, AGE, EDU, MOBILE NET USE), lower bound θiLjkl and upper bound θiUjkl of the probability estimates range, average value θ i jkl in 100000 runs of EM, CIA estimates θiCjklI A , U M and RMMSE values. N
CARSHARING
AGE
EDU
MOBILE NET USE
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no no
18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+
high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school
over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months
0.01483 0.03404 0.00120 0.00453 0.00791 0.01329 0.02879 0.02978 0.01194 0.01729 0.01381 0.00873 0.04701 0.02683 0.01997 0.01693 0.03489 0.01139 0.06073 0.02120 0.01799 0.01180 0.06324 0.01215 0.05028 0.01060 0.01442 0.00674 0.07282 0.00715 0.03390 0.00318 0.01124 0.00283 0.09305 0.00242 0.01746 0.00064 0.00625 0.00047 0.12284 0.00044
0.01403 0.03346 0.00072 0.00487 0.00742 0.01359 0.02782 0.02867 0.00981 0.01611 0.01255 0.00942 0.04547 0.02641 0.01853 0.01493 0.03425 0.01175 0.05951 0.02093 0.01771 0.01128 0.06334 0.01174 0.04949 0.01065 0.01461 0.00568 0.07195 0.00787 0.03405 0.00268 0.01149 0.00251 0.09281 0.00257 0.01771 0.00025 0.00632 0.00043 0.12295 0.00004
0.01544 0.03487 0.00101 0.00515 0.00760 0.01377 0.02946 0.03031 0.01259 0.01889 0.01297 0.00984 0.04704 0.02797 0.02113 0.01753 0.03443 0.01193 0.06075 0.02217 0.01849 0.01206 0.06364 0.01205 0.05017 0.01132 0.01536 0.00643 0.07213 0.00805 0.03435 0.00297 0.01156 0.00257 0.09294 0.00269 0.01783 0.00036 0.00632 0.00043 0.12319 0.00028
0.01504 0.03386 0.00097 0.00491 0.00754 0.01365 0.02865 0.02948 0.01149 0.01721 0.01272 0.00966 0.04603 0.02742 0.01970 0.01636 0.03429 0.01189 0.05981 0.02188 0.01801 0.01176 0.06337 0.01201 0.04959 0.01123 0.01481 0.00622 0.07196 0.00804 0.03407 0.00295 0.01150 0.00257 0.09282 0.00269 0.01771 0.00036 0.00632 0.00043 0.12295 0.00028
0.01480 0.03456 0.00158 0.00440 0.00728 0.01394 0.02820 0.03037 0.01318 0.01693 0.01238 0.01030 0.04633 0.02726 0.02168 0.01536 0.03256 0.01347 0.05930 0.02216 0.01974 0.00935 0.06168 0.01368 0.04862 0.01172 0.01565 0.00508 0.07153 0.00841 0.03355 0.00309 0.01182 0.00163 0.09191 0.00338 0.01811 0.00030 0.00654 0.00021 0.12160 0.00054
0.00141 0.00141 0.00029 0.00028 0.00018 0.00018 0.00164 0.00164 0.00278 0.00278 0.00042 0.00042 0.00157 0.00156 0.00260 0.00260 0.00018 0.00018 0.00124 0.00124 0.00078 0.00078 0.00030 0.00031 0.00068 0.00067 0.00075 0.00075 0.00018 0.00018 0.00030 0.00029 0.00007 0.00006 0.00013 0.00012 0.00012 0.00011 0.00000 0.00000 0.00024 0.00024
0.03194 0.01366 0.19938 0.08517 0.04762 0.02807 0.01978 0.02121 0.08063 0.04941 0.07944 0.10783 0.02347 0.02893 0.04385 0.05972 0.01721 0.04473 0.01637 0.03628 0.01418 0.02175 0.00231 0.01262 0.01405 0.06117 0.03107 0.08453 0.01184 0.12408 0.00513 0.07285 0.02342 0.09440 0.00257 0.10840 0.01457 0.43503 0.01137 0.08953 0.00096 0.35242
164
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 6 (continued) N
CARSHARING
AGE
EDU
MOBILE NET USE
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes
18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+
high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school high school high school bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school
over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months over than 3 months less than 3 months
0.00020 0.00123 0.00000 0.00044 0.00000 0.00018 0.00026 0.00093 0.00035 0.00190 0.00009 0.00018 0.00073 0.00044 0.00076 0.00099 0.00003 0.00006 0.00061 0.00038 0.00044 0.00032 0.00026 0.00003 0.00038 0.00023 0.00038 0.00023 0.00012 0.00009 0.00018 0.00006 0.00006 0.00000 0.00015 0.00000 0.00009 0.00000 0.00003 0.00000 0.00020 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00141 0.00141 0.00029 0.00029 0.00018 0.00018 0.00164 0.00164 0.00278 0.00278 0.00042 0.00042 0.00156 0.00156 0.00260 0.00260 0.00018 0.00018 0.00124 0.00124 0.00077 0.00077 0.00030 0.00030 0.00067 0.00067 0.00075 0.00075 0.00018 0.00018 0.00029 0.00029 0.00006 0.00006 0.00012 0.00012 0.00011 0.00011 0.00000 0.00000 0.00024 0.00024
0.00040 0.00101 0.00004 0.00025 0.00006 0.00012 0.00081 0.00083 0.00110 0.00168 0.00024 0.00018 0.00101 0.00055 0.00143 0.00117 0.00014 0.00004 0.00095 0.00029 0.00048 0.00029 0.00026 0.00004 0.00057 0.00010 0.00054 0.00020 0.00017 0.00001 0.00028 0.00001 0.00005 0.00001 0.00012 0.00000 0.00011 0.00000 0.00000 0.00000 0.00024 0.00000
0.00027 0.00063 0.00008 0.00022 0.00003 0.00005 0.00052 0.00056 0.00067 0.00086 0.00004 0.00004 0.00085 0.00050 0.00110 0.00078 0.00011 0.00005 0.00109 0.00041 0.00100 0.00047 0.00022 0.00005 0.00089 0.00022 0.00079 0.00026 0.00025 0.00003 0.00062 0.00006 0.00060 0.00008 0.00032 0.00001 0.00033 0.00001 0.00033 0.00001 0.00043 0.00000
0.00141 0.00141 0.00029 0.00029 0.00018 0.00018 0.00164 0.00164 0.00278 0.00278 0.00042 0.00042 0.00156 0.00156 0.00260 0.00260 0.00018 0.00018 0.00124 0.00124 0.00077 0.00077 0.00030 0.00030 0.00067 0.00067 0.00075 0.00075 0.00018 0.00018 0.00029 0.00029 0.00006 0.00006 0.00012 0.00012 0.00011 0.00011 0.00000 0.00000 0.00024 0.00024
2.29963 0.39061 0.45761 0.46377 2.95746 0.60190 3.23461 0.46315 2.39778 0.80672 0.79211 1.18883 1.40854 0.85660 4.16863 0.95557 0.80422 0.98327 0.58826 0.79575 0.26147 2.36425 0.67430 0.91455 0.73195 0.96564 0.51096 0.93839 0.64064 1.05920 0.27719 0.18310 0.28517 1.00000 0.17711 -
a
The sign “-” is in correspondence of true cell probabilities equal to 0.
Table 7 True probabilities θi jkl for P (BIKES NUMBER, AGE, EDU, SPORT), lower bound θiLjkl and upper bound θiUjkl of the probability estimates range and average value
θiCjklI A , U M and RMMSE values. θ i jkl in 100000 runs of EM, CIA estimates N
BIKES NUMBER
AGE
EDU
SPORT
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34
high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school
regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally
0.00482 0.00780 0.00210 0.00058 0.00088 0.00038 0.00161 0.00549 0.00085 0.00628 0.01282 0.00286 0.00380 0.00561 0.00178 0.00149 0.00765 0.00085
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00381 0.00000
0.01414 0.01424 0.00754 0.00189 0.00188 0.00089 0.00618 0.00765 0.00297 0.02118 0.02222 0.00802 0.01182 0.01159 0.00652 0.00444 0.01077 0.00256
0.00594 0.00615 0.00218 0.00078 0.00084 0.00028 0.00230 0.00427 0.00114 0.00852 0.01063 0.00319 0.00491 0.00481 0.00268 0.00219 0.00730 0.00132
0.00642 0.00664 0.00231 0.00079 0.00085 0.00028 0.00192 0.00369 0.00091 0.00907 0.01150 0.00319 0.00504 0.00495 0.00261 0.00176 0.00629 0.00102
0.01414 0.01424 0.00754 0.00189 0.00188 0.00089 0.00618 0.00765 0.00297 0.02118 0.02222 0.00802 0.01182 0.01159 0.00652 0.00444 0.00696 0.00256
0.59515 0.40253 0.65162 0.68398 0.40567 0.52024 0.82957 0.33102 0.82718 0.68294 0.34685 0.61281 0.58532 0.36882 0.87957 0.84252 0.18489 0.94173
(continued on next page) 165
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 7 (continued) N
BIKES NUMBER
AGE
EDU
SPORT
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44
high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher
regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally
0.00499 0.01480 0.00251 0.00298 0.00578 0.00155 0.00193 0.01536 0.00105 0.00461 0.01606 0.00269 0.00210 0.00438 0.00108 0.00234 0.02572 0.00143 0.00283 0.01720 0.00152 0.00140 0.00511 0.00073 0.00234 0.03448 0.00169 0.00207 0.01477 0.00099 0.00085 0.00566 0.00053 0.00228 0.04724 0.00158 0.00058 0.01060 0.00032 0.00047 0.00385 0.00026 0.00137 0.08003 0.00114 0.00467 0.00523 0.00181 0.00061 0.00047 0.00012 0.00131 0.00237 0.00079 0.00502 0.00794 0.00245 0.00318 0.00280 0.00114 0.00099 0.00485 0.00076 0.00461 0.01010 0.00242 0.00353 0.00426 0.00140
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00790 0.00000 0.00000 0.00001 0.00000 0.00000 0.00000 0.00000 0.00000 0.01520 0.00000 0.00000 0.00255 0.00000 0.00000 0.00000 0.00000 0.00000 0.02388 0.00000 0.00000 0.00808 0.00000 0.00000 0.00341 0.00000 0.00000 0.03987 0.00000 0.00000 0.00908 0.00000 0.00000 0.00353 0.00000 0.00000 0.07816 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.01974 0.02328 0.00998 0.01099 0.01102 0.00689 0.00708 0.01858 0.00360 0.02147 0.02509 0.01132 0.00743 0.00760 0.00523 0.00909 0.02907 0.00477 0.01218 0.02235 0.00767 0.00569 0.00722 0.00314 0.00874 0.03788 0.00525 0.00664 0.01779 0.00309 0.00241 0.00749 0.00181 0.00750 0.05242 0.00504 0.00152 0.01126 0.00067 0.00122 0.00504 0.00043 0.00352 0.08372 0.00204 0.01118 0.01125 0.00739 0.00109 0.00109 0.00085 0.00412 0.00418 0.00287 0.01517 0.01514 0.00795 0.00736 0.00739 0.00635 0.00439 0.00546 0.00256 0.01574 0.01610 0.00997 0.00860 0.00864 0.00691
0.00640 0.01365 0.00323 0.00366 0.00536 0.00202 0.00298 0.01403 0.00158 0.00671 0.01487 0.00357 0.00260 0.00372 0.00131 0.00369 0.02338 0.00200 0.00460 0.01481 0.00297 0.00198 0.00416 0.00109 0.00442 0.03072 0.00274 0.00329 0.01290 0.00161 0.00138 0.00519 0.00106 0.00447 0.04489 0.00306 0.00101 0.00980 0.00046 0.00096 0.00385 0.00036 0.00263 0.07955 0.00154 0.00472 0.00490 0.00166 0.00046 0.00049 0.00015 0.00121 0.00242 0.00055 0.00584 0.00738 0.00205 0.00298 0.00293 0.00153 0.00107 0.00379 0.00060 0.00436 0.00967 0.00209 0.00284 0.00428 0.00154
0.00673 0.01469 0.00331 0.00423 0.00629 0.00231 0.00233 0.01173 0.00118 0.00712 0.01618 0.00369 0.00338 0.00485 0.00171 0.00295 0.02009 0.00155 0.00500 0.01709 0.00315 0.00240 0.00525 0.00129 0.00359 0.02719 0.00216 0.00351 0.01459 0.00163 0.00127 0.00524 0.00096 0.00397 0.04394 0.00266 0.00105 0.01113 0.00046 0.00086 0.00362 0.00030 0.00236 0.07897 0.00137 0.00447 0.00462 0.00161 0.00055 0.00059 0.00019 0.00133 0.00257 0.00063 0.00563 0.00713 0.00198 0.00313 0.00307 0.00162 0.00109 0.00390 0.00063 0.00470 0.01026 0.00231 0.00296 0.00439 0.00162
0.01974 0.02328 0.00998 0.01099 0.01102 0.00689 0.00708 0.01068 0.00360 0.02147 0.02508 0.01132 0.00743 0.00760 0.00523 0.00909 0.01387 0.00477 0.01218 0.01980 0.00767 0.00569 0.00722 0.00314 0.00874 0.01400 0.00525 0.00664 0.00971 0.00309 0.00241 0.00408 0.00181 0.00750 0.01255 0.00504 0.00152 0.00218 0.00067 0.00122 0.00151 0.00043 0.00352 0.00556 0.00204 0.01118 0.01125 0.00739 0.00109 0.00109 0.00085 0.00412 0.00418 0.00287 0.01517 0.01514 0.00795 0.00736 0.00739 0.00635 0.00439 0.00546 0.00256 0.01574 0.01610 0.00997 0.00860 0.00864 0.00691
0.76762 0.30235 0.86131 0.65811 0.36609 0.84452 1.05626 0.17422 1.03599 0.94154 0.30256 0.90867 0.68519 0.37603 0.82376 1.15896 0.15483 0.99683 1.12369 0.25177 1.51469 0.85870 0.31595 1.01609 1.34628 0.14538 1.06606 0.98363 0.18882 1.03367 0.91913 0.16300 1.31344 1.33042 0.08176 1.30882 1.00451 0.09004 0.71508 1.16099 0.06492 0.47398 1.15195 0.01812 0.60937 0.47793 0.43534 0.65159 0.45468 0.50438 1.02860 0.58419 0.37835 0.63007 0.60437 0.39587 0.60085 0.45463 0.50915 0.92126 0.81562 0.31090 0.69130 0.60235 0.33474 0.67857 0.48047 0.41088 0.74382
166
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 7 (continued) N 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
BIKES NUMBER
AGE
EDU
SPORT
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54
less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school
regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally
0.00181 0.00923 0.00102 0.00444 0.01048 0.00269 0.00231 0.00280 0.00108 0.00237 0.01571 0.00143 0.00345 0.01028 0.00187 0.00175 0.00280 0.00061 0.00210 0.01553 0.00166 0.00155 0.00616 0.00096 0.00064 0.00210 0.00041 0.00196 0.01834 0.00140 0.00038 0.00309 0.00018 0.00020 0.00093 0.00000 0.00114 0.02234 0.00064 0.00537 0.00336 0.00222 0.00058 0.00058 0.00026 0.00184 0.00178 0.00073 0.00526 0.00549 0.00239 0.00327 0.00237 0.00172 0.00093 0.00245 0.00044 0.00400 0.00882 0.00242 0.00301 0.00406 0.00158 0.00152 0.00651 0.00082 0.00523 0.01031 0.00237
0.00000 0.00145 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00493 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00459 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00766 0.00000 0.00000 0.00136 0.00000 0.00000 0.00000 0.00000 0.00000 0.01717 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00708 0.01213 0.00360 0.01712 0.01718 0.01116 0.00631 0.00634 0.00513 0.00909 0.01880 0.00477 0.01176 0.01515 0.00765 0.00519 0.00545 0.00314 0.00874 0.01858 0.00525 0.00653 0.00855 0.00308 0.00228 0.00238 0.00179 0.00750 0.02020 0.00504 0.00152 0.00355 0.00067 0.00078 0.00079 0.00042 0.00352 0.02274 0.00204 0.01093 0.01101 0.00751 0.00115 0.00115 0.00086 0.00452 0.00459 0.00297 0.01272 0.01272 0.00799 0.00648 0.00647 0.00601 0.00393 0.00414 0.00252 0.01518 0.01542 0.01002 0.00878 0.00894 0.00681 0.00690 0.00819 0.00359 0.01644 0.01657 0.01106
0.00187 0.00931 0.00095 0.00446 0.01049 0.00224 0.00216 0.00314 0.00106 0.00225 0.01537 0.00118 0.00298 0.01029 0.00188 0.00146 0.00321 0.00079 0.00200 0.01540 0.00119 0.00150 0.00637 0.00068 0.00038 0.00171 0.00028 0.00152 0.01767 0.00100 0.00028 0.00314 0.00012 0.00013 0.00061 0.00004 0.00058 0.02182 0.00033 0.00464 0.00479 0.00162 0.00048 0.00052 0.00016 0.00135 0.00263 0.00062 0.00487 0.00625 0.00166 0.00261 0.00257 0.00131 0.00077 0.00294 0.00043 0.00414 0.00932 0.00197 0.00295 0.00441 0.00160 0.00117 0.00644 0.00057 0.00431 0.01010 0.00216
0.00163 0.00819 0.00083 0.00488 0.01111 0.00253 0.00232 0.00333 0.00117 0.00203 0.01379 0.00106 0.00292 0.00999 0.00184 0.00140 0.00307 0.00076 0.00210 0.01590 0.00126 0.00141 0.00585 0.00065 0.00051 0.00210 0.00038 0.00159 0.01761 0.00107 0.00028 0.00301 0.00013 0.00023 0.00098 0.00008 0.00064 0.02134 0.00037 0.00453 0.00469 0.00163 0.00056 0.00060 0.00020 0.00135 0.00261 0.00064 0.00468 0.00593 0.00165 0.00260 0.00255 0.00135 0.00091 0.00324 0.00053 0.00416 0.00907 0.00204 0.00262 0.00389 0.00143 0.00144 0.00724 0.00073 0.00445 0.01012 0.00231
0.00708 0.01068 0.00360 0.01712 0.01718 0.01116 0.00631 0.00634 0.00513 0.00909 0.01387 0.00477 0.01176 0.01515 0.00765 0.00519 0.00545 0.00314 0.00874 0.01399 0.00525 0.00653 0.00855 0.00308 0.00228 0.00238 0.00179 0.00750 0.01254 0.00504 0.00152 0.00219 0.00067 0.00078 0.00079 0.00042 0.00352 0.00557 0.00204 0.01093 0.01101 0.00751 0.00115 0.00115 0.00086 0.00452 0.00459 0.00297 0.01272 0.01272 0.00799 0.00648 0.00647 0.00601 0.00393 0.00414 0.00252 0.01518 0.01542 0.01002 0.00878 0.00894 0.00681 0.00690 0.00819 0.00359 0.01644 0.01657 0.01106
0.81908 0.21703 0.79862 0.65619 0.34148 0.67419 0.51684 0.48375 0.69574 0.83621 0.17615 0.79090 0.63658 0.28551 0.77950 0.53273 0.41632 0.94309 0.89398 0.17399 0.76260 0.79965 0.26091 0.71424 0.67227 0.29913 0.73383 0.86709 0.13952 0.85249 0.88521 0.13848 0.90728 0.80066 0.38751 0.86461 0.05788 0.89158 0.43158 0.78622 0.58394 0.44763 0.43339 0.61559 0.52047 0.72260 0.66248 0.49280 0.51174 0.60730 0.44234 0.54846 0.56358 0.71290 0.40109 0.92392 0.66977 0.37051 0.66605 0.52952 0.45062 0.67490 0.77125 0.23102 0.81162 0.57120 0.33960 0.73332
(continued on next page) 167
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 7 (continued) N 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
BIKES NUMBER
AGE
EDU
SPORT
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+
45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 18-24 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54
bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school
regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally
0.00263 0.00321 0.00105 0.00239 0.01066 0.00128 0.00362 0.00858 0.00204 0.00166 0.00251 0.00082 0.00210 0.01104 0.00166 0.00184 0.00502 0.00088 0.00055 0.00126 0.00050 0.00225 0.01267 0.00134 0.00032 0.00152 0.00035 0.00023 0.00032 0.00012 0.00076 0.01028 0.00029 0.00654 0.00420 0.00219 0.00111 0.00041 0.00018 0.00210 0.00181 0.00070 0.00388 0.00429 0.00108 0.00274 0.00181 0.00126 0.00035 0.00169 0.00035 0.00672 0.00990 0.00371 0.00365 0.00488 0.00199 0.00166 0.00453 0.00093 0.00835 0.01136 0.00435 0.00403 0.00406 0.00181 0.00266 0.00829 0.00140
0.00000 0.00000 0.00000 0.00000 0.00106 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00114 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00357 0.00000 0.00000 0.00030 0.00000 0.00000 0.00000 0.00000 0.00000 0.00605 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00005 0.00000
0.00697 0.00694 0.00518 0.00908 0.01490 0.00477 0.01190 0.01397 0.00759 0.00471 0.00482 0.00311 0.00873 0.01512 0.00525 0.00658 0.00797 0.00308 0.00207 0.00225 0.00175 0.00750 0.01611 0.00504 0.00152 0.00248 0.00067 0.00056 0.00056 0.00042 0.00352 0.01161 0.00204 0.01364 0.01361 0.00748 0.00199 0.00199 0.00089 0.00480 0.00489 0.00293 0.00936 0.00938 0.00759 0.00512 0.00512 0.00500 0.00237 0.00240 0.00227 0.01926 0.02016 0.00997 0.00987 0.00990 0.00684 0.00660 0.00747 0.00357 0.02147 0.02396 0.01124 0.00931 0.00955 0.00524 0.00898 0.01291 0.00477
0.00237 0.00342 0.00118 0.00171 0.01231 0.00087 0.00273 0.00954 0.00171 0.00128 0.00285 0.00068 0.00156 0.01264 0.00091 0.00140 0.00594 0.00063 0.00036 0.00163 0.00026 0.00115 0.01421 0.00075 0.00018 0.00222 0.00007 0.00009 0.00045 0.00002 0.00023 0.01125 0.00013 0.00571 0.00590 0.00210 0.00083 0.00088 0.00030 0.00143 0.00280 0.00067 0.00360 0.00466 0.00113 0.00209 0.00204 0.00101 0.00041 0.00178 0.00021 0.00552 0.01191 0.00273 0.00330 0.00489 0.00181 0.00106 0.00590 0.00051 0.00639 0.01426 0.00336 0.00326 0.00461 0.00170 0.00143 0.01076 0.00072
0.00212 0.00303 0.00107 0.00185 0.01257 0.00097 0.00253 0.00866 0.00160 0.00122 0.00266 0.00066 0.00182 0.01378 0.00109 0.00119 0.00495 0.00055 0.00043 0.00178 0.00032 0.00135 0.01491 0.00090 0.00015 0.00163 0.00007 0.00013 0.00053 0.00004 0.00035 0.01158 0.00020 0.00557 0.00576 0.00201 0.00069 0.00074 0.00024 0.00166 0.00320 0.00079 0.00339 0.00429 0.00119 0.00188 0.00185 0.00098 0.00066 0.00235 0.00038 0.00481 0.01050 0.00236 0.00303 0.00450 0.00165 0.00166 0.00839 0.00085 0.00542 0.01234 0.00281 0.00258 0.00370 0.00130 0.00225 0.01531 0.00118
0.00697 0.00694 0.00518 0.00908 0.01384 0.00477 0.01190 0.01397 0.00759 0.00471 0.00482 0.00311 0.00873 0.01398 0.00525 0.00658 0.00797 0.00308 0.00207 0.00225 0.00175 0.00750 0.01254 0.00504 0.00152 0.00218 0.00067 0.00056 0.00056 0.00042 0.00352 0.00556 0.00204 0.01364 0.01361 0.00748 0.00199 0.00199 0.00089 0.00480 0.00489 0.00293 0.00936 0.00938 0.00759 0.00512 0.00512 0.00500 0.00237 0.00240 0.00227 0.01926 0.02016 0.00997 0.00987 0.00990 0.00684 0.00660 0.00747 0.00357 0.02147 0.02396 0.01124 0.00931 0.00955 0.00524 0.00898 0.01286 0.00477
0.49288 0.44409 0.77273 0.76351 0.27035 0.80448 0.61053 0.34084 0.69458 0.53839 0.41962 0.64511 0.82173 0.25926 0.76680 0.68629 0.35415 0.74199 0.68896 0.47842 0.70393 0.79618 0.20744 0.84630 0.89156 0.51268 0.85415 0.76510 0.54703 0.86953 0.96099 0.11879 1.16258 0.41234 0.73520 0.61247 0.40986 1.46188 1.24778 0.51895 0.78143 0.70603 0.51612 0.49460 0.89603 0.45710 0.59884 0.61125 1.14041 0.30759 0.80768 0.51585 0.44751 0.57042 0.48215 0.39738 0.58326 0.72646 0.43366 0.76386 0.50243 0.47770 0.56021 0.43315 0.45265 0.54396 0.73867 0.39272 0.77703
168
P.L. Conti, D. Marella, P. Vicard et al.
International Journal of Approximate Reasoning 130 (2021) 150–169
Table 7 (continued) N 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 a
BIKES NUMBER
AGE
EDU
SPORT
θi jkl
θiLjkl
θiUjkl
θ i jkl
θiCjklI A
UM
RMMSEa
3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+ 3+
55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+ 75+
high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school high school high school high school bachelor’s degree or higher bachelor’s degree or higher bachelor’s degree or higher less than high school less than high school less than high school
regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally regularly no occasionally
0.00274 0.00564 0.00172 0.00155 0.00199 0.00085 0.00166 0.00517 0.00073 0.00079 0.00196 0.00032 0.00053 0.00082 0.00029 0.00105 0.00505 0.00047 0.00018 0.00061 0.00006 0.00018 0.00015 0.00003 0.00029 0.00508 0.00012
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00011 0.00000
0.00985 0.00999 0.00746 0.00421 0.00428 0.00310 0.00780 0.00860 0.00507 0.00296 0.00299 0.00243 0.00184 0.00188 0.00168 0.00645 0.00690 0.00480 0.00087 0.00090 0.00064 0.00022 0.00022 0.00022 0.00350 0.00541 0.00204
0.00186 0.00701 0.00112 0.00112 0.00257 0.00059 0.00076 0.00742 0.00042 0.00044 0.00239 0.00016 0.00029 0.00138 0.00021 0.00037 0.00631 0.00022 0.00005 0.00084 0.00002 0.00003 0.00019 0.00001 0.00008 0.00530 0.00004
0.00172 0.00588 0.00108 0.00083 0.00181 0.00044 0.00124 0.00936 0.00074 0.00053 0.00220 0.00025 0.00019 0.00079 0.00014 0.00060 0.00663 0.00040 0.00007 0.00073 0.00003 0.00006 0.00024 0.00002 0.00015 0.00516 0.00009
0.00985 0.00999 0.00746 0.00421 0.00428 0.00310 0.00780 0.00860 0.00507 0.00296 0.00299 0.00243 0.00184 0.00188 0.00168 0.00645 0.00690 0.00480 0.00087 0.00090 0.00064 0.00022 0.00022 0.00022 0.00350 0.00530 0.00204
0.64643 0.44256 0.70406 0.54964 0.54291 0.62805 0.81278 0.51626 0.95875 0.75684 0.37886 0.89941 0.69584 0.83941 0.80357 0.89784 0.31238 1.07931 0.90376 0.41542 1.01056 0.85665 0.42992 0.93344 1.14508 0.08476 1.45433
The sign “-” is in correspondence of true cell probabilities equal to 0.
References [1] P.L. Conti, D. Marella, M. Scanu, Uncertainty analysis in statistical matching, J. Off. Stat. 28 (2012) 69–88. [2] P.L. Conti, D. Marella, M. Scanu, Uncertainty analysis for statistical matching of ordered categorical variables, Comput. Stat. Data Anal. 68 (2013) 311–325. [3] P.L. Conti, D. Marella, M. Scanu, How far from identifiability? A nonparametric approach to uncertainty in statistical matching under logical constraints, Commun. Stat., Theory Methods (2013). [4] P.L. Conti, D. Marella, M. Scanu, Statistical matching analysis for complex survey data with applications, J. Am. Stat. Assoc. 111 (516) (2016) 1715–1725. [5] R.G. Cowell, P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer, New York, 1999. [6] M. D’Orazio, M. Di Zio, M. Scanu, Statistical matching for categorical data: displaying uncertainty and using logical constraints, J. Off. Stat. 22 (2006) 137–157. [7] M. D’Orazio, M. Di Zio, M. Scanu, Statistical Matching: Theory and Practice, Wiley, Chichester, 2006. [8] L. Dalla Valle, R. Kenett, Official statistics data integration for enhanced information quality, Qual. Reliab. Eng. Int. 31 (7) (2015) 1281–1300. [9] L. Dalla Valle, R. Kenett, Social media big data integration: a new approach based on calibration, Expert Syst. Appl. 111 (2018) 76–90. [10] F.F. Dias, P.S. Lavieri, V.M. Garikapati, S. Astroza, R.M. Pendyala, C.R. Bhat, A behavioral choice model of the use of car-sharing and ride-sourcing services, Transportation 44 (6) (2017) 1307–1323. [11] E. Endres, T. Augustin, Statistical matching of discrete data by Bayesian networks, in: JMLR: Workshop and Conference Proceedings, vol. 52, 2016, pp. 159–170. [12] D. Geiger, T. Verma, J. Pearl, Identifying independence in Bayesian networks, Networks 20 (1990) 507–534. [13] J.B. Kadane, Some statistical problems in merging data files, in: Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 1978, pp. 159–179, Reprinted in J. Off. Stat. 17 (2001) 423–433. [14] Sung-Ho Kim, Seong-Ho Kim, A note on collapsibility in DAG models of contingency tables, Scand. J. Stat. 33 (2006) 575–590. [15] S.L. Lauritzen, Graphical Models, Oxford Science Publication, Oxford University Press, 1996. [16] C. Moriarity, F. Scheuren, Statistical matching: a paradigm of assessing the uncertainty in the procedure, J. Off. Stat. 17 (2001) 407–422. [17] B. Okner, Constructing a new data base from existing microdata sets: the 1966 merge file, Ann. Econ. Soc. Meas. 1 (1972) 325–342. [18] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. [19] S. Rässler, Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Springer, New York, 2002. [20] W.L. Rodgers, An evaluation of statistical matching, J. Bus. Econ. Stat. 2 (1984) 91–102. [21] D.B. Rubin, Statistical matching using file concatenation with adjusted weights and multiple imputations, J. Bus. Econ. Stat. 4 (1986) 87–94. [22] C.A. Sims, Comments on: “Constructing a new data base from existing microdata sets: the 1966 merge file”, by B.A. Okner, Ann. Econ. Soc. Meas. 1 (1972) 343–345. [23] A.C. Singh, H. Mantel, M. Kinack, G. Rowe, Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption, Surv. Methodol. 19 (1993) 59–79. [24] I. Tsamardinos, L. Brown, C. Aliferis, The max-min hill-climbing Bayesian network structure learning algorithm, Mach. Learn. 65 (2006) 31–78.
169