Cultural consensus theory for multiple consensus truths

Cultural consensus theory for multiple consensus truths

Journal of Mathematical Psychology 56 (2012) 452–469 Contents lists available at SciVerse ScienceDirect Journal of Mathematical Psychology journal h...

1MB Sizes 3 Downloads 119 Views

Journal of Mathematical Psychology 56 (2012) 452–469

Contents lists available at SciVerse ScienceDirect

Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp

Cultural consensus theory for multiple consensus truths R. Anders ∗ , W.H. Batchelder Department of Cognitive Sciences, University of California Irvine, United States

article

info

Article history: Received 25 October 2012 Received in revised form 22 January 2013 Available online 1 March 2013 Keywords: Cultural consensus theory Latent class models Latent trait models Signal detection theory

abstract Cultural Consensus Theory (CCT) is a popular information pooling methodology used in the social and behavioral sciences. CCT consists of cognitive models designed to determine a consensus truth shared by a group of informants (respondents), and to better understand the cognitive characteristics of the informants (e.g. level knowledge, response biases). However prior to this paper, no CCT models have been developed that allow the possibility of the informant responses to come from a mixture of two or more consensus answer patterns. The major advance in the current paper is to endow one of the popular CCT models, the General Condorcet Model (GCM) for dichotomous responses, with the possibility of having several latent consensus answer patterns, each corresponding to a different, latent subgroup of informants. In addition, we augment the model to allow the possibility of questions having differential difficulty (cultural saliency). This is the first CCT finite-mixture model, and it is named the Multi-Culture GCM (MC-GCM). The model is developed axiomatically and a notable property is derived that can suggest the appropriate number of mixtures for a given data set. The model is extended in a hierarchical Bayesian framework and its application is successfully demonstrated on both simulated and real data, including a new experimental data set on political viewpoints. Published by Elsevier Inc.

1. Introduction Since its inception in the mid 1980s, Cultural Consensus Theory (CCT) has become a popular methodology in the social and behavioral sciences, especially in cultural anthropology (e.g., Romney & Batchelder, 1999; Weller, 2007). CCT is an approach to information pooling (aggregation, data fusion), and the primary goal of CCT is to estimate consensus knowledge or beliefs that are shared by a group of informants. The data relevant to CCT is of a standard item-response structure, consisting of the responses from each informant to a series of relevant questions chosen by the researcher. CCT consists of a set of models, each of which correspond to a particular questionnaire format (e.g., true/false, multiple choice, ordered categories, continuous responses). The most popular CCT model is known as the General Condorcet Model (GCM), and it is applicable to dichotomous, true/false questions (Batchelder & Anders, 2012; Batchelder & Romney, 1986, 1988, 1989; Karabatsos & Batchelder, 2003; Oravecz, Vandekerckhove, & Batchelder, in press). The structure of the GCM is that of a standard signal detection model with dichotomous correct answers (signal/noise; true/false) and latent hit and false alarm rates, which are heterogeneous across informants. However, unlike most applications of signal

detection models, it is assumed that the researcher does not know the culturally ‘correct answers’ to the questions a priori.1 Instead the GCM treats these answers as latent variables, which are estimated from the informants’ response data. While the GCM has been successfully applied to a number of data sets by a variety of researchers (see summary in Weller, 2007), the model has been limited by the assumption that all informants share the same consensus answers to the questions asked. In particular, there have been applications of CCT methodology to various data sets that suggest a single, shared consensus truth may not characterize the informants’ responses (e.g., Hruschka, Kalim, Edmonds, & Sibley, 2008; Romney, 1999). Until now, there have been no operative CCT models that allow the possibility of the informant responses to come from a mixture of two or more consensus answer patterns. Batchelder and Romney (1989) discussed the possibility of a model that allowed multiple consensus answer patterns; however, they did not develop its properties, nor did they estimate the model. Apart from CCT, Mueller and Veinott (2008) attempted to remedy the problem of mixed consensus answer patterns with a single-parameter, binomial-probability model that was applied by a finite mixture modeling algorithm; the model did not include any parameters to recover the cognitive characteristics of the informants (e.g. level



Corresponding author. E-mail addresses: [email protected] (R. Anders), [email protected] (W.H. Batchelder). 0022-2496/$ – see front matter. Published by Elsevier Inc. doi:10.1016/j.jmp.2013.01.004

1 In CCT applications, the culturally correct answers need not be the empirically correct answers.

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

knowledge, response biases), or the latent consensus answers of each culture. The major advance in the current paper is to endow the GCM with the possibility of having several latent answer keys, each corresponding to a different, latent subgroup of informants. The proposed model not only recovers these different answer keys, but also the respondents’ cognitive characteristics. In addition, we augment the new model to allow the possibility of the questions having differential difficulty (cultural saliency), using techniques from psychometric test theory presented earlier for the GCM in Batchelder and Romney (1988) and Karabatsos and Batchelder (2003). We name the current model the Multi-Culture GCM (MCGCM), because unlike the GCM, it postulates that the response data result from a mixture of GCM models, where each component corresponds to a latent subgroup of informants who share cultural knowledge. This paper builds on new results in Batchelder and Anders (2012), in which novel, testable mathematical properties of the GCM were developed as well as hierarchical Bayesian inference for the model, which can be conducted using JAGs (Just Another Gibbs Sampler, Plummer, 2003): a freely available software package for Bayesian inference that uses Markov Chain Monte Carlo samplers, as well as Gibbs and slice-samplers. In addition, several posterior predictive model checks were developed, the most important of which assesses the adequacy of the central GCM assumption of a single underlying consensus answer pattern. Given that it was the first CCT paper to use hierarchical inference, it was able to show with both real and simulated data, that the GCM, endowed with hierarchical Bayesian assumptions, was able to estimate consensus knowledge with small groups of informants much more efficiently than algorithms based on descriptive statistics, such as the majority rule. This paper retains the approach in Batchelder and Anders (2012) by providing hierarchical Bayesian inference for the MCGCM, along with several mathematical properties of the new model that are a consequence of its mixture assumptions. The key result of the multi-culture extension of the GCM is a theorem about the structure of the informant-by-informant correlations in their responses over items. The addition of multiple cultures has a derivable effect on the structure of the scree plot obtained from a minimum residual factor analysis of the informant-by-informant correlation matrix, and this derived result enables one to apply related diagnostics to assess the number of consensus components in the response data. The derivation also leads to the construction of a useful Bayesian posterior predictive test to determine if the number of cultures used by the MC-GCM appropriately satisfies the underlying latent consensus answer patterns in the response data. The MC-GCM is applied to simulated data, and it is shown that it is able to estimate multiple consensus answer patterns along with latent consensus membership parameters, which assign each informant to a particular consensus component of the model. Then the model is applied successfully to two published data sets that have been known to fail the one-culture assumption of the GCM. Finally, it is applied to new experimental data involving two cultures. The paper consists of five sections. After the Introduction, Section 2 provides axioms that specify the MC-GCM. In addition, mathematical consequences of the multi-culture assumption are developed and compared to the special case of the GCM with a single culture. Finally in Section 2.3, hyperdistributions and hyperpriors are developed for hierarchical Bayesian inference with the model. In Section 3, information is provided to appropriately fit the MC-GCM to data in a hierarchical Bayesian framework, taking into account that it is a finite mixture model. In addition, this section develops the posterior predictive checks for the model that assess model fit on the basis of culture number and item

453

difficulty. Section 4 demonstrates the model and its results on both simulated and real data sets. With simulated data in Section 4.1, it is shown that the model is able to correctly recover the number of underlying consensus components in the data as well as provide accurate estimates of the other parameters of the model. Then three sets of real data are analyzed with the model in Section 4.2, and new results for two published data sets are discovered. Finally, a data set from a new experiment involving shared political beliefs is analyzed, and the model successfully extracts two distinct political viewpoints. Section 5 contains the general discussion. 2. The MC-GCM 2.1. Specification of the model Assume that each of N informants answers ‘true’ or ‘false’ to each of a set of M items, and let the responses be the realization of a random response profile matrix X = (Xik )N ×M , where

 Xik =

1 0

if i responds true to item k if i responds false to item k.

(1)

The MC-GCM is defined by five axioms that represent a generalization of the GCM in Batchelder and Anders (2012) in two important ways. First, the MC-GCM generalizes the assumption that all of the informants share the same answer key, and second the model allows items to have heterogeneous difficulty (differential cultural salience). To see how this works, consider the GCM in Batchelder and Anders (2012). In essence the model is a general signal detection model except that the correct answers (signal/noise, true/false) are latent rather than known to the experimenter. Thus the data in (1), which is shared also by the GCM, cannot not be scored for corrects and errors, and instead it is assumed that a latent answer key Z = (Zk )1×M is shared by all informants, where

∀k,

 Zk =

1 0

if item k is true for the culture if item k is false for the culture.

Each informant is assumed to have a hit rate Hi , and false alarm rate Fi , subject to 0 ≤ Fi ≤ Hi ≤ 1 in order to identify the model. Then the GCM makes the assumption that Pr(Xik = 1 | Hi , Fi , Zk ) =



Hi Fi

if Zk = 1 if Zk = 0.

In most applications of the GCM, the hit and false alarm rates are specified in terms of the so-called double high threshold model (DHT) (Macmillan & Creelman, 2005), which is also a property of the MC-GCM as revealed in (4). Now comparatively, the MC-GCM is the first CCT model to allow multiple answer keys, where the data in (1) are assumed to come from a mixture of GCM models, each characterized by a different answer key. More specifically, the model assumes that there are T ≥ 1 answer keys, Z = {Z1 , . . . , ZT } rather than the single answer key Z of the GCM. Each informant is assumed to share exactly one of the answer keys with a subgroup of the informants, and the model specifies the latent group membership parameters as E = (ei )1×N , with ei ∈ {1, . . . , T }, where Zei is the answer key for informant i. Similar to the GCM, the MC-GCM specifies hit-rate and falsealarm-rate parameters (see Axiom 3G) for each informant, denoted respectively by H = (Hi )1×N and F = (Fi )1×N subject to ∀i, 0 ≤ Fi ≤ Hi ≤ 1. Similar to the design of a classical double-high threshold model, the hit- and false-alarm-rate parameters are reparameterized into two parameters: the informant probabilities of knowing an answer, Di , and the informant probabilities for guessing a true (‘1’) response, gi , which are both in [0, 1], as in Axiom 4G. Finally, as will be explained in detail later, the MC-GCM allows items to have differential difficulty.

454

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Axiom 1G (Cultural Truths). There is a collection of answer keys, Z = {Z1 , . . . , ZT }, for some T ≥ 1, and each informant’s answer key, Zei , is specified by informant membership parameters E = (ei )1×N , where ∀i, ei = {1, . . . , T }.

Axiom 5G (Item Difficulty). If item difficulty is assumed heterogeneous, then the probability of knowing an answer to a particular item is

Axiom 2G (Conditional Independence). The response profile matrix satisfies conditional independence given by

Dik =

Pr [X = (xik )N ×M | Z, H, F, E]

=

N  M 

Pr (Xik = xik | Zei k , Hi , Fi )

(2)

i=1 k=1

Axiom 3G (Marginal Response Probabilities). The marginal probabilities in (2) are given by Pr (Xik = xik | Zei k , Hi , Fi ) F = i  (1 − Hi ) (1 − Fi )

if if if if

xik xik xik xik

= 1 and Zei k = 1 and Zei k = 0 and Zei k = 0 and Zei k

=1 =0 =1 = 0.

(3)

Axiom 4G (Double High Threshold). Hit- and false-alarm-rate parameters for each informant are reparameterized by

∀i,

Hi = Di + (1 − Di )gi ,

Fi = (1 − Di )gi .

(5)

where δk ∈ [0, 1] and θi ∈ [0, 1]. Eq. (5) is a reparameterization of the Rasch model formula, which is usually presented as Dik = [1 + exp −(αi − βk )]−1 ,

for all possible realizations (xik ) of the response profile matrix.

 H   i

θi (1 − δk ) , θi (1 − δk ) + δk (1 − θi )

(4)

Axioms 1G–4G are similar to Axioms 1G–4G of the GCM in Batchelder and Anders (2012) except for the addition of multiple answer keys. In particular, Axiom 1G sets the MC-GCM as a finitemixture model with the assumption that a collection of answer keys pertain to the informants rather than a single answer key, and that each informant belongs to one answer key, Zei = (Zei k )1×M , determined by the informant’s membership parameter, ei . The second axiom of conditional independence is typical of the assumptions made in item-response theory (e.g., Embretson & Reise, 2000), as well as by other models that have parameters for respondents and items. Axiom 3G specifies the marginal distributions of (2) in terms of hit and false alarm rates, as usuallydefined in signal detection models (e.g., Macmillan & Creelman, 2005); Batchelder and Romney (1988) showed that in order to identify the GCM, the informants’ false alarm rates must not exceed their hit rates, and consequently we adopt that restriction here. The four limbs in (3) refer respectively to hits, false alarms, misses, and correct rejections relative to the informant’s answer key membership. Axiom 4G reparameterizes the hit and false alarm rates into a DHT design with two parameters, D = (Di )1×N and G = (gi )1×N , where Di is the informant’s probability of knowing any given item’s answer and responding accordingly for his or her culture, while gi is the probability to guess true (‘1’) if the informant does not know the correct answer, with probability 1 − Di . Next, Axiom 5G introduces heterogeneous item difficulty for the MC-GCM. While the GCM defined in Batchelder and Anders (2012) was analyzed under the case of homogeneous item difficulty, the possibility of modeling heterogeneous item difficulty for the GCM was first introduced in Batchelder and Romney (1988). The introduction of heterogeneous item difficulty was achieved by indexing the Di in (4) by both the informant and the item, to obtain Dik ∈ [0, 1]. However, in order to avoid the result of too many parameters, a Rasch model adaptation (Fischer & Molenaar, 1995) is applied to Dik , which specifies the Dik in terms of the informant’s ability, Θ = (θi )1×N in [0, 1], and the item’s difficulty, ∆ = (δk )1×M in [0, 1]. In this way, only N + M parameters are used in the heterogeneous item difficulty case of the model by using θi and δk to calculate Dik rather than N × M parameters, if one were to set the Dik as an estimated parameter.

−∞ < αi , βk < ∞.

In this form there is an obvious non-identifiability since a positive constant can be added to all of the parameters without changing the values of the Dik . If one defines the parameters in (5) by oneto-one transformations in terms of the usual parameters for the Rasch model as follows,

θi =

eαi 1+

eαi

,

δk =

e βk 1 + e βk

,

θi < 0, δk < 1,

it is easy to see that plugging these parameters into (5) yields the usual Rasch formula. This establishes that (5) is a reparameterization of the Rasch formula. All reparameterizations of the usual Rasch form have a corresponding non-identifiability. Crowther, Batchelder, and Hu (1995) discuss the nature of the nonidentifiability problem for the reparameterization in (5), and we handle this non-identifiability issue with the hierarchical Bayesian settings of the model in Section 2.3. The idea of using a form of the Rasch model to represent item difficulty has been implemented in the GCM in Karabatsos and Batchelder (2003) and Oravecz et al. (in press); however in both of these cases, inference was carried out with a Bayesian, fixedeffects formulation. In this paper, Axiom 5G incorporates item difficulty into the MC-GCM, and in Section 2.3, we provide the Bayesian hierarchical specification for the model. Note that in (5), Dik increases as informant ability, θi , increases or as item difficulty, δk , decreases. Further, in the case that all items are equally difficult, then one can set ∀i, k; δk = 1/2 and in this case ∀i, k; Dik = θi , which is the form of the model with only Axioms 1G–4G. As with other specifications of the Rasch model, there is no interaction between informant ability and item difficulty as can be seen by logit(Dik ) = ln

θi 1 − θi

− ln

δk 1 − δk

.

The MC-GCM defined by Axioms 1G–5G has 3N parameters for the informants, which include the competence, bias, and membership parameters, and T · M + M item parameters for the cultural truth sets if item difficulty is heterogeneous, or T · M if item difficulty is homogeneous, in which case δ = (0.5)1×M . The response profile matrix in (1) has N · M data points, so as long as Min{N , M } ≥ T + 4, there is not a greater number of model parameters than data points. Fig. 1 illustrates a processing tree of how the MC-GCM functions as a classical double high-threshold signal detection model with a latent signal, Z , but in addition to the item index, k, Z is also specialized by the culture to which each informant belongs to, which is represented by ei . Depending on the answer key for an informant given his or her culture, Zei k , with probability Dik the informant knows the correct answer and responds Xik = Zei k , but with probability (1 − Dik ) the informant does not know the correct answer and guesses ‘true’ with probability gi , or guesses ‘false’ with probability 1 − gi .

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

455

where for example, ρ(XiK , ZK ) is the correlation between informant i’s responses and the latent answer key, Z. Eq. (9) leads to a consequence that for all distinct i, j, h, l

ρ(XiK , XjK )ρ(XhK , XlK ) = ρ(XiK , XlK )ρ(XhK , XjK ),

Fig. 1. Illustration of the MC-GCM with heterogeneous item difficulty.

The likelihood function for the model with T > 1 defined by Axioms 1G–5G is provided by L[X = (xik )N ×M | Z, Θ, ∆, G, E]

=

N  M 

[Dik + (1 − Dik )gi ]xik Zei k · [(1 − Dik )gi ]xik (1−Zei k ) (1−xik )Zei k

· [(1 − Dik )(1 − gi )]

· [Dik + (1 − Dik )(1 − gi )](1−xik )(1−Zei k ) .

(6)

Eq. (6) has four terms in brackets that correspond to the four limbs of (3) by the DHT model. By observing that the terms in the four exponent positions in (6) are dichotomous 1–0 variables, one can easily see that for every combination of question and informant response, exactly one of the exponents is one and the other three are zero. 2.2. Properties of the model Two important properties of the model that are instrumental in its proper application are included below in Sections 2.2.1 and 2.2.2. These properties are used to define statistics that are important for ascertaining the number of separate answer keys needed to fit the data in (1), and also they can be used to assess whether or not heterogeneity in item difficulty as specified in Axiom 5G is necessary. We will apply the model using a Bayesian hierarchical framework, and we will show in Section 3.2 how to use these statistics as posterior predictive model checks. After the following properties are discussed and are related to posterior predictive checks, the Bayesian specification of the model is included in Section 2.3. 2.2.1. Spearman’s law property Batchelder and Anders (2012) analyzed the properties of the GCM, which is a special case of the MC-GCM defined by Axioms 1G–5G. In particular, the GCM is defined by the first three axioms, under the restriction of a single consensus truth (T = 1), and often the DHT assumption in Axiom 4G is added when the GCM is applied to data. An important property of the GCM, is that the correlation between the responses of two informants over items takes a simple form. To display this, we begin with defining K to be a random variable that selects a random item index,

∀k = 1, . . . , M ;

Pr(K = k) = 1/M .

(7)

Recall that the correlation between two random variables, ρ(X , Y ), is given by

ρ(X , Y ) = √

Cov(X , Y )

Var(X )Var(Y )

E (XY ) − E (X )E (Y ) = √ . Var(X )Var(Y )

(8)

Then the property derived in Observation 1 of Batchelder and Anders (2012) is that for all distinct pairs of informants, i and j,

ρ(XiK , XjK ) = ρ(XiK , ZK )ρ(XjK , ZK ),

which is a form of Spearman’s (1904) famous law of tetrads. For Spearman, the law expresses the property that there is a single factor behind the test–test correlations, and he coined this ‘general intelligence’. In our case, (9) and its consequence in (10), imply that there is a single factor behind the informant-byinformant correlations, namely the shared consensus answer key. As developed in Batchelder and Anders (2012), this property shows that an observed, informant-by-informant correlation matrix, M = (rij )N ×N , will have a simple structure as follows. Eq. (9) provides that the off-diagonal terms in M from the data have a one-factor approximation. Specifically, there exists a column vector, a = (ai )N ×1 such that the off-diagonal terms of M are approximated within sampling variability by the formula M ≈ a · aT ,

i=1 k=1

(9)

(10)

(11)

where T stands for transpose. In practice, the column vector a can be found by minimizing the sum of squared errors given by SS (a) =

N  N  (rij − ai aj )2 .

(12)

i=1 j>i

The single factor and potential succeeding factors, which ignore the structural ones in the main diagonal entries in M, may be obtained by performing a standard minimum residual factor analysis (MINRES) of M (Comrey, 1962); a readily available function, fa(), in the R (R Core Team, 2012) ‘psych’ package (Revelle, 2012) can perform MINRES and obtain these factors and their corresponding eigenvalues. The standard pattern indicating a onefactor solution, and hence a selection of the GCM over the MC-GCM with T > 1, is a relatively sizable eigenvalue for the first factor, followed by a large drop in the second eigenvalue which starts a linear decline in the eigenvalues, as its associated eigenvector and the remaining eigenvectors fit residual noise in M. The left-most plot in Fig. 2 shows a visual representation of eigenvalues, known as a scree plot, for many different response matrices that all exhibit the single-factor design of eigenvalues. Unfortunately, as discussed in Batchelder and Anders (2012), there are no universally accepted statistical tests for the adequacy of a one factor approximation to M; however it is possible to develop posterior predictive model checks based on this property. Now as the Spearman property is clearly developed in the GCM, it is reasonable to consider if a comparative property exists for the MC-GCM. Newly developed in this paper, is the finding and proof that indeed a property which directly relates to (10) exists for the MC-GCM; as presented in Theorem 1, and proven in Appendix A. As will be demonstrated, Theorem 1 provides an important basis for the model. Particularly, it provides foundations for investigative techniques that can suggest the number of underlying cultures in a given data set, and provides for a posterior predictive check to examine whether the inference results of the MC-GCM, specified at a particular number of cultures, satisfies the consensus structure of the data. Theorem 1. For the MC-GCM defined by Axioms 1G–3G, it is found that for any number of truths, T ,

∀i, j = 1, . . . , N ∋ i ̸= j, ρ(XiK , XjK ) = ρ(XiK , Zei K )ρ(XjK , Zej K )ρ(Zei K , Zej K ).

(13)

456

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Fig. 2. Overlapped scree plots of 50 data sets simulated by the hierarchical MC-GCM; from left to right, T = 1, 2, 3, and 4 truths. There were N = 15 informants per latent truth, M = 40 items, and the average competency of each run was between 0.35 and 0.75.

Theorem 1 is an interesting extension of Observation 1 in Batchelder and Anders (2012) because (13) reduces to the one factor property of the GCM in (9) if the informants answer keys are all identical and therefore perfectly correlated. However, when the answer keys are different, then (13) shows that an additional multiplicative factor, ρ(Zei K , Zej K ), is added to the formula for the correlation between two informants’ responses over items. As is demonstrated next, it is easy to see that this added term results in a violation of the one-factor structure of the informant-by-informant correlation matrix. Suppose there are two answer keys with correlation ρ(Z1 , Z2 ) = u ∈ (−1, 1). Next consider the structure of the informant-byinformant correlation matrix M. Using Theorem 1, the terms from the model are

ρ(XiK , XjK ) =



ρ(XiK , Zei K )ρ(XjK , Zej K ) uρ(XiK , Zei K )ρ(XjK , Zej K )

if ei = ej if ei ̸= ej .

(14)

To see the effect of the additional factor in (13) suppose the true values of the correlations rather than their observations are inserted into (11). Now without loss of generality, suppose that the informants are ordered so that the informants with ei = 1 are indexed from i = 1, . . . , N1 and the ei = 2 are indexed from i = N1 + 1, . . . , N. Then when the first factor a = (ai )N ×1 is extracted using MINRES from the correlation matrix (ρ(XiK , XjK ))N ×N , it is easy to see that the fit is not perfect. The terms with the same answer key will be underestimated and those with different answer keys will be overestimated, that is

ρ(XiK , XjK ) − ai aj



>0 <0

if ei = ej if ei ̸= ej .

A facilitated way to see this is to note that the least squares estimates of the ai represent a compromise of fitting the upper limb of (14), as well as its lower limb, which has an additional term u for the correlation between the two keys. The consequence of this when observed informant-by-informant correlations are inserted into (12) is that the residual matrix, M(2) = M − a · aT , will have a sign pattern given by (2)

M

 + = −

 − , +

and MINRES will extract a second significant factor to represent this non-random structure. In simulation studies patterned after the two answer key case, it was observed that if the two answer keys are not perfectly correlated, then the expected sign pattern in the residual matrix occurred and a second non-random factor was extracted. Following on from the case of two distinct answer keys, we conducted a number of simulation studies with T ∈ {1, 2, 3, 4} distinct answer keys. In the large majority of these simulations it

was observed that T sizable, non-random factors occurred in the scree plot. These results are summarized in Fig. 2, which contains 50 data sets per T case, generated by the hierarchical MC-GCM under homogeneous item difficulty, with the priors specified in Section 2.3; except that the group average competencies were uniformly sampled between 0.35 and 0.75. The figure shows that under these specifications, T sizable factors were nearly always evident for data with T number of truths. For example, the second plot of the figure contains the scree plots of T = 2 truths, in which for each line representing the series of eigenvalues for each data set, there are T = 2 large eigenvalues followed by a series of substantially smaller eigenvalues that fit a linear-decreasing trend; this pattern is similar in the next two plots that respectively show related trends in the series of eigenvalues for T = 3 and T = 4 truths. However, as explained by the discussion of Theorem 1 that (13) reduces to the one factor property of the GCM in (9), the results of the scree plot analyses performed on simulated data sets with highly correlated answer keys agreed with this property in that most of these data sets had less than T sizable eigenvalues. In addition, other investigations showed that a lower number of sizable factors were also found in sets without highly correlated truths in which there were low group average competencies, typically at values less than 0.35; this result was exaggerated with lower numbers of informants (e.g. <8) while larger numbers of informants were found to counteract this effect. Thus, the single-versus-multi scree plot design discussed provides the researcher with an idea of whether the GCM or MC-GCM is more appropriate. If the MC-GCM is selected, it is not guaranteed that the number of latent truths in the data is directly reflected by the number of sizable factors in the scree plot. However, the number of sizable factors may help the researcher localize which T values to apply the MC-GCM with. It is possible to develop a posterior predictive model check based on a graphic representation of the eigenvalue structure in the scree plot to see if the posterior predictive data exhibit a similar pattern of factors as in the real data as shown in Fig. 5, and this is described in detail in Section 3.2. So far, we have been working with the MC-GCM defined by Axioms 1G–4G that assume homogeneous item difficulty, and now we need to discuss the effect of item heterogeneity in Axiom 5G. First consider the GCM (T = 1) that is one culture with a single answer key, and item heterogeneity, which continues to have a single factor property. However, when heterogeneous item difficulty is included in the GCM, it is possible to show that (9) is violated in that for all distinct pairs of informants, instead ρ(XiK , XjK ) > ρ(XiK , ZK )ρ(XjK , ZK ); and in certain cases, the exact one-factor property does not hold as a property of the model. However, simulations by Maher (1987) showed that the onefactor property of the GCM still holds to a good approximation

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

under heterogeneous item difficulty. Furthermore, numerous simulations performed here with the MC-GCM likewise resulted in T sizable factors for simulated data with T number of truths. For this reason, we continue to employ the eigenvalue posterior predictive model check developed and discussed in Batchelder and Anders (2012) for the MC-GCM even during cases of heterogeneous item difficulty; however our departure is to perform the eigenvalue check via a graphical representation, in order to display a series of eigenvalues rather than the ratio of the first to the second. This check is discussed in detail in Section 3.2. 2.2.2. Informant responses and item difficulty A property of both the MC-GCM and GCM is that under homogeneous item difficulty, the probability of an informant having a correct response for any item is the same when gi = 1/2; to see this, note from (4) that Hi = (1 − Fi ) = (1 + Di )/2. Under the assumption that the model holds, Batchelder and Anders (2012) showed that due to this property, the variance of responses over informants for each item, namely the entries in the columns of (1), will be nearly equal in the data under homogeneous item difficulty, even with heterogeneous gi values. From this fact, one can create a single statistical measure of the variation in responses over informants by each item by taking again the variance of these item column variances, which we call the Variance Dispersion Index (VDI). VDI(X) =

M 

 Vk2

/M −

M 

2 Vk /M

(15)

k=1

k=1

where Vk = Pk (1 − Pk ),

and

Pk =

N 

Xik /N .

i =1

In Batchelder and Anders (2012) the VDI served as a useful postpredictive model check for the GCM when compared with another model that implied that there should be large differences in the item column variances. It turns out that VDI also provides for a posterior predictive check to determine if item heterogeneity in Axiom 5G should be included in the MC-GCM when fitting any particular data set. First, in the case of T = 1, it is obvious that the introduction of item heterogeneity will increase dispersion of the item column variances in (1). This is true because easy items will tend to have mostly 1’s or mostly 0’s in their corresponding column sums; whereas hard items will tend to have a greater mix of 1’s and 0’s due to more guessing. Based on a number of simulation studies of the MC-GCM with T > 1, we found that when the other parameters were held constant, allowing the δk in (5) to vary usually resulted in a higher value of the VDI in comparison to setting δk = 1/2 for all items and answer keys, in which from (5) Dik = θi . However, through analysis of simulated and real data, there are cases in which the addition of multiple cultures in the inference results alone cannot satisfy the level of VDI in the data, and in these cases, one may include item heterogeneity in effort to satisfy the VDI level; an example of this is illustrated in the VDI check for the real data in the right plot of Fig. 10, which will be discussed in detail later. 2.3. Bayesian specification of the model The present paper will use a Bayesian hierarchical framework for both simulating and analyzing data with the model. A Bayesian specification of the model is included below. The classical GCM with homogeneous item difficulty was first estimated hierarchically in Batchelder and Anders (2012); we retain their

457

hierarchical approach in the Bayesian formulation of the MC-GCM, as well as introduce a new hierarchical specification for informant membership and item difficulty parameters. Our assumptions for the hierarchical distributions for the MCGCM, which will be followed with an explanation, are: Ztk ∼ Bernoulli(pt )

δk ∼ Beta(µδ τδ , (1 − µδ )τδ ) θi ∼ Beta(µθei τθei , (1 − µθei )τθei ) gi ∼ Beta(µgei τgei , (1 − µgei )τgei ) ei ∼ Categorical(λ), where t ∈ {1, . . . , T } and T is pre-specified. It is natural to assume that the Ztk ∈ {0, 1} are drawn from a Bernoulli distribution, and since the δk , θi , and gi have space [0, 1], we assume distributions naturally in the same range, which we choose to be beta distributions with mean µ and ‘precision’ τ (e.g., Kruschke, 2011). A frequently-employed alternative to using the beta hierarchical distribution for parameters in [0, 1] is to model the logit or probit of the parameters with a multivariate Gaussian distribution; however, we prefer to work with the untransformed model parameters, see Merkle, Smithson, and Verkuilen (2011) for a discussion of these alternatives. One may notice that by indexing the priors of each θi and gi by the informant’s group membership, ei , we are able to avoid doubly-indexing these parameters and hence optimize the model by omitting superfluous instances of parameters. Finally, an advantageous characteristic of the model as specified above, is that each culture, or each cluster of informants, has the possibility for a separate group mean and group standard deviation of competency and bias. The hyperpriors that we use throughout the paper are: pt ∼ Uniform(0, 1)

µθt ∼ Beta(α, α), α = 2 τθt ∼ Gamma(µ2τθ /στ2θ , µτθ /στ2θ ), µgt = 1/2

µτθ = 10, στθ = 10

τgt ∼ Gamma(µ2τg /στ2g , µτg /στ2g ),

µτg = 10, στg = 10

µδ = 1/2 τδ ∼ Gamma(µ2τδ /στ2δ , µτδ /στ2δ ), λ ∼ Dirichlet(L),

µτδ = 10, στδ = 10

L = (1)1×T .

Since there is no prior reason to suspect any particular proportion of consensus ‘true’ answers for each culture’s consensus truth set, Ztk , a diffuse uniform hyperprior on pt seems appropriate. A mildly informative beta, Beta(α, α), with α = 2, is used for the group mean of each cluster’s competencies, µθt , as one would expect that the average competency is unlikely to be near extreme values of 0 or 1. With regard to the hyperpriors of the beta distribution precision parameters, gamma distributions are often used in Bayesian frameworks (e.g., Kruschke, 2011, Ch. 9), particularly with a parameterization such that µ and σ 2 , respectively are the mean and variance of the gamma, and this parameterization often mimics Gaussian curves restricted to the positive half-line. In particular, we selected values that produce a diffuse gamma, however, other researchers may desire using other values. In addition, for cases in which item heterogeneity is used, the item difficulty mean for each truth is set to 0.50 in order to solve the standard identifiability problem with the Rasch formula, as each item’s difficulty is relative to all other items’ difficulty. The first Bayesian application of item difficulty for CCT was for a fixed effects version of the GCM by Karabatsos and Batchelder (2003), in this case, the identifiability problem was solved by setting one of the item’s δk to 0.50; this method is likewise an option if one desires using a fixed-effects version of the MC-GCM. Finally, λ is set such that as a prior belief, the membership likelihood of an informant being in any group is equally likely.

458

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

3. Hierarchical Bayesian analysis with the MC-GCM This section discusses techniques and methods to appropriately fit the MC-GCM to data in a hierarchical Bayesian framework. In Section 3.1, important details for how to apply and assess the results of the MC-GCM, being that it is a finite mixture model, are included. The next section develops the posterior predictive checks to assess appropriate fit of the model. Then Section 3.3 summarizes a generalized routine for applying the model. 3.1. Applying the MC-GCM The model is applicable with standard inference software such as JAGS (Plummer, 2003) or Openbugs (Thomas, O’Hara, Ligges, & Sturtz, 2006). The present paper applies the model with JAGS utilized from the R interface by using R packages: rjags and R2jags (Plummer, 2012); the JAGS model code for the MC-GCM is included in Appendix B. As a finite-mixture model applied in a Bayesian framework, the model provides a few challenges for the Bayesian user, which are typical of finite mixture models. The first challenge is due to the socalled label–switching phenomenon (Stephens, 2000). As a result of label-switching, many of the automatically-delivered values by inference software, such as the posterior means of the parameters, ˆ and the Deviance Information Criterion (DIC; Spiegelhalter, Best, R, Carlin, & vander Linde, 2002) may be misrepresented; and thus the label-switching issue must be addressed prior to interpreting these statistics. In our case, label-switching occurs as the T cultures may be ordered, or labeled, in any way in the inference results, so long as the ei membership values correspond. From this, any parameters with a t index and associated ei , may be labeled differently across different chains. As discussed by Stephens (2000), many approaches are possible to rectify this issue, and the present paper resolves this issue with the typical approach of post-processing the MCMC chains, in which the parameter samples across chains that exhibit inconsistent labeling are swapped in order to achieve one consistent labeling. The second challenge of using this finite-mixture model in a Bayesian framework arises from a mixture-number phenomenon: when some chains may contain a different number of mixtures than others. For example, the MC-GCM pre-specified with a T = 4 but applied to T = 2 data may occasionally produce chains of just T = 2 mixtures by converging the other 2 cultures on the mean of the hyperprior (for example, all posterior mean Ztk values for these 2 cultures will be near 0.50), and not clustering any informants on these cultures. Thus in such cases, the researcher obtains some chains that do not contain the same number of mixtures as sought by the initial specification of T . Further, the Rˆ values for a number of parameters across chains with different mixtures will be misrepresented, or at values that clearly violate (see Gelman, Carlin, Stern, & Rubin, 2004). To resolve this issue, the researcher is recommended to run a sizable number of chains, such as six or more, and retain the chains that contain the pre-specified number of mixtures. Incidentally though, if the model has a strong tendency to select a certain number of mixtures over the preset T possibilities, it has been found through our simulated and real data analyses that this may be an early indication that this number of mixtures will be preferred by the model selection criteria. When comparing the MC-GCM across different mixture numbers (T values) on a data set, it is recommended that the same number of chains across each set of T mixtures be used such that the fit statistics across models are comparable. The two aforementioned phenomena, label-switching and mixture-number, are easily-spotted when viewing the posterior means of the Ztk values by chain in a plot; each phenomenon

is evident by a typical pattern. The first phenomenon is easilydiscovered by detecting the pattern of whether the Ztk means by chain have decisively converged at divergent values by culture for many of the items; if this is the case, then the ei values will have a similar decisive difference across chains. The second phenomenon is easily-noticed if the posterior means of the Ztk for all items within one mixture or more, have all converged near the mean of the hyperprior, near 0.50. After the label-switching and mixture-number issues are addressed, one can then accurately calculate the posterior means, Rˆ values, and DIC values. The DIC is calculated by evaluating the likelihood in (6) with the posterior samples (for calculating the DIC from MCMC samples, see Gelman et al., 2004). In short, note that DIC = Dev + pDev , and Dev(φ) = −2 log(L(X | φ)) + C , where φ = {Z, Θ, ∆, G, E}, and C is a constant that effectively cancels out when the DIC is compared across models. We denote an array of numbers, Dev, as Dev(φ) evaluated at each set of samples. Then Dev = Mean(Dev) and the robust approximation to pDev , as suggested by Gelman et al. (2004), is calculated using the rule that pDev = Var(Dev)/2, which is also the method of calculation for pDev that is automatically delivered by the current version of JAGS. When the DIC is compared across models, smaller values of DIC are preferred. Although the DIC has been criticized in the hierarchical setting due to shrinkage (Carlin & Spiegelhalter, 2007), the simulation and real data studies to follow demonstrate its success and adequacy for the MC-GCM. However, one may consider using other model selection criteria, such as the Bayesian Predictive Information Criterion (BPIC, Ando, 2007), which relates to DIC as BPIC = Dev + 2pDev . Typically, we use a model selection criterion to select a model for cases in which more than one specification of the MC-GCM (different T values or heterogeneous/homogeneous item difficulty) satisfy the posterior predictive checks that will be introduced in the next section. In our demonstrations that follow, the DIC delivered the same conclusive results as the BPIC in model selection, when more than one case of the model satisfied the posterior predictive checks. The completely user-friendly procurement of popular CCT models is a recently initiated movement, as the first modern GUI for the basic GCM was released by Oravecz et al. (in press).2 It remains a future project to develop such a standalone GUI for the MC-GCM that will automatically handle both the label-switching and mixture-number phenomena discussed previously. 3.2. Posterior predictive checks and model selection Two important posterior predictive checks, the graphical eigenvalue and VDI checks based on Sections 2.2.1 and 2.2.2 respectively, are suggested in order to insure an appropriate application of the MC-GCM on the data. The eigenvalue check ascertains whether the specification of T is appropriate for a given data set, and the VDI check is used to determine whether the assumption of homogeneous or heterogeneous item difficulty is more appropriate. With regard to the eigenvalue check based on Section 2.2.1, an appropriate posterior of the MC-GCM will predict data that mimics a similar pattern of eigenvalues, and this may be observed graphically. In order to pass the graphical posterior predictive check, the posterior predictive data should exhibit a similar pattern in the series of eigenvalues as that of the actual data; ideally though not necessarily required, the eigenvalues of the posterior predictive data should overlap the actual data’s eigenvalues. Figs. 5, 11 and 14 contain examples of posterior predictive data that satisfy

2 This GUI may be downloaded from: http://bayesian.zitaoravecz.net.

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

459

Fig. 3. Examples of violations of the eigenvalue graphical posterior predictive check. The first two plots pertain to the hot–cold and political data sets analyzed in Section 4.2, run with the MC-GCM at T = 1, while the third and fourth plots pertain to the N = 15 · T by M = 40 items, T = 3 data analyzed in Section 4.1, run with the MC-GCM at T = 1 and T = 2 respectively. The actual data’s eigenvalues are represented by the black line while the gray region depicts the eigenvalue series of 500 randomly-sampled posterior predictive data sets.

the eigenvalue check, which range from adequate to strong fits of the data’s factor design. The second plot of Fig. 11 exhibits an adequate fit by including a similar pattern in the series of eigenvalues that does not necessarily overlap all eigenvalues, while a strong fit is displayed in the first plot of the figure, which exhibits both of these qualities. Examples of posterior predictive data that violate the check are included in Fig. 3. The first two plots pertain to data with two sizable eigenvalues while the third and fourth pertain to data with three sizable eigenvalues. In the first plot, although the first factor is somewhat fitted by the overlapping gray, the second, sizable factor is completely missed by the posterior predictive data. In the second plot, the first important factor is missed, and although the second factor is contained in the gray region, the non-linear pattern between this factor and the succeeding ones is not captured. In the third plot, while the first factor is locally captured by only a marginal amount of posterior predictive data, the second and third factors are completely missed, as well as the pattern of the succeeding factors. In the fourth plot, the full pattern of eigenvalues is also notably missed, but the first two factors of the data are at least captured.3 Now with regard to the VDI check based on Section 2.2.2, an appropriate posterior should predict data sets that exhibit similar response variability over items (VDI) in the data. While it has been found that the addition of multiple truths tends to increase the VDI, the VDI of the data may still not be fit by the posterior predictive data if heterogeneous item difficulty is the case. Thus in this check, the user will calculate the VDI in (15) for the posterior predictive data across inference runs and verify if the data’s VDI level is reasonably fit. Specifically, our check includes verifying if the data’s VDI level is at a value greater than the 2.5th percentile and less than the 97.5th percentile of the distribution of VDI levels from the posterior predictive data. This check is visualized in Figs. 10 and 14. For cases in which more than one T specification of the MCGCM may satisfy both the eigenvalue and VDI posterior predictive checks, one may use a model-selection method criterion on the inference runs that have satisfied the checks in order to select just one interpretation. The present paper employs DIC, which awards for fit but penalizes based on the effective number of parameters; it is discussed in the context of CCT modeling by Karabatsos and Batchelder (2003).

3 The first two plots include two of the real data sets, the hot–cold and political data sets analyzed in Section 4.2, fit with the MC-GCM specified at a single truth, with T = 1, but is rather satisfied with the T = 2 MC-GCM. The third and fourth plots include the simulated data with T = 3 truths in Section 4.1 respectively analyzed by the MC-GCM at T = 1 and T = 2, but is rather satisfied by the T = 3 MC-GCM.

Finally as an optional addition, the researcher may also be interested if the posterior predictive data resemble the actual data, and this may be visualized by a heat map, by which the intensity or darkness of color (by node) represents the percent of correct posterior predictive data matches with the actual data. Usually, one may desire to sort the informants from highest to lowest competency, and adequate posterior predictive data for the MC-GCM would include high match percentages across most nodes for the competent informants, and low-to-medium percentages for the low competency informants; a depiction of this is included in Fig. 6. The occurrence of low-to-medium percentages for low-competency informants is considered adequate, as a great amount of their responses are due to random guessing, and hence both the actual generating parameters and an appropriate posterior predictive MC-GCM would perform fairly equally at an approximate 50% for many of these informant responses. One may notice that the figure contains a few white squares for highly competent individuals, these generally occur as a model property for highly competent informants who are much more likely to respond correctly than incorrectly for each item, and in these cases, the informant responded incorrectly in the data. 3.3. Generalized routine for fitting the model to data The suggested, generalized approach for fitting the MC-GCM to data includes the following procedure. First, the researcher should obtain the scree plot of the data using MINRES on M in (11) as illustrated in Figs. 4 and 9, and from this plot, he or she may obtain an idea of which T values are sensible to specify in applications of the model. After running a number of inferences with the model specified at various T values, label-switching, mixture-number, and convergence issues should be checked for and addressed if ˆ and needed. Afterward, statistics such as the posterior means, R, DIC/BPIC can be appropriately calculated. Then one can perform the two posterior predictive checks, the eigenvalue and VDI checks, and optionally the heat map check in Fig. 6, across candidates. The candidate that has the best model selection criterion value, such as DIC, out of all candidates that also satisfy the posterior predictive checks is then selected for posterior analysis. 4. MC-GCM application to data In this section, the MC-GCM is applied to both simulated and real data sets via hierarchical Bayesian inference. The simulated data analysis is included in Section 4.1, in which the model is fit to simulated data with T = 2 and T = 3 cultures. Also, a test of the robustness of the model is included: whether it can recover cultures with as few as six informants per culture. Next in Section 4.2,

460

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

the model is fit to three real data sets. The first two real data sets were previously-examined by CCT-related papers, and were found not to contain a single consensus truth; the sets respectively pertain to disease and postpartum hemorrhage beliefs. The third data set is newly-collected data about respondents’ knowledge about American political parties. All Bayesian runs analyzed in this paper involved six chains of 2000 samples, no ‘thinning’, and dropping the first 1000 samples as ‘burn-in’, and hence retaining 6000 samples for analysis for each run. Any exhibited label-switching and mixturenumber phenomena discussed in Section 3.1 were checked for and addressed; in model comparisons, the same number of chains were compared across models, which in every case was at least three chains. It was verified that the Rˆ statistic was below 1.1 for every node between all chains (see Gelman et al., 2004, for an explanation of MCMC sampling terms); and the autocorrelation function was sufficiently low, to indicate that thinning was indeed not necessary. In addition, random selections of trace plots for each parameter were inspected and none of them indicated concern for inadequate mixing.

Fig. 4. Scree plots of the T = 2 (left) and T = 3 (right) data sets of size N = 15 · T and M = 40 that were simulated by the hierarchical MC-GCM.

4.1. Simulated data analysis with the MC-GCM In this section we demonstrate the hierarchical Bayesian application of the MC-GCM by working through the generalized procedure for applying the model to data; we particularly apply the model to data that are simulated with T = 2 and T = 3 answer keys and homogeneous item difficulty. In addition, we conclude this section with a test of the MC-GCM’s ability to recover the answer key when applied to very small data sets of only a few informants per culture. Each data set presented was generated hierarchically by parameters that were randomly sampled from distributions in the range of the priors specified in Section 2.3. In order to simplify the presentation of the model, in every case of the simulated data demonstrations that follow, we used the MC-GCM with homogeneous item difficulty, defined by Axioms 1G–4G. Yet we note that from numerous simulation studies, the model performs comparatively as well with the inclusion of item difficulty, and these parameters are also successfully recovered. The inclusion of heterogeneous item difficulty by Axiom 5G is included and reported on in our real data analysis of the paper, in Section 4.2. The T = 2 and T = 3 data sets were simulated with N = 15 · T informants by M = 40 items. Specifically, the hyperparameters used were: pt = [0.24, 0.66], µθ = [0.55, 0.44], τθ = [3.6, 7.8], τg = [13.8, 21.4] for the T = 2 set; and pt = [0.71, 0.41, 0.55], µθ = [0.67, 0.54, 0.48], τθ = [6.9, 7.5, 7.3], τg = [13.5, 18.0, 19.1] for the T = 3 set. Although the data are simulated, we begin the presentation with the assumption that the researcher is not aware of how many latent truths reside in the data. In order to examine this property, we begin with scree plots of M from the data and they are displayed in Fig. 4. The data set on the left suggests evidence for a two-truth design, in which there are two sizable eigenvalues evident and the rest are much smaller and decline with an approximate linear trend. Comparatively, the data set on the right displays a similar design but for three-truths. Although it is unlikely by the look of these scree plots, as shown by (13), it may be possible that a different number of latent truths are evident, and so it is recommended to run the MC-GCM with other local T values; particularly, we try values from T = 2 to T = 8. The GCM (T = 1) is already improper to apply as the scree plots clearly show multifactor designs, which violate the GCM’s required single-factor design, and hence violate the single-truth property of the model formulated in (9); though when one applies the T = 1 model to either data set, the eigenvalue posterior predictive check is failed,

Fig. 5. Graphical posterior predictive check of the scree plots of the T = 2 (left) and T = 3 (right) 500 randomly-sampled, posterior predictive data sets from the hierarchical MC-GCM. The black line depicts the actual data’s series of eigenvalues while the gray region depicts the series for the posterior predictive data. Table 1 DIC values for the simulated data sets.

T =2 T =3

2

3

4

5

6

7

8

1717.7 4650.2

1770.8 2038.5

1842.9 2110.3

2352.5 2440.0

2107.7 2112.0

1888.1 2106.5

1849.2 2242.0

as demonstrated for the T = 3 simulated data in the third plot of Fig. 3. An interesting result occurred in the seven inference runs of T = 2 to T = 8 on each data set (which was also exhibited in a few other data sets analyzed with a strong signal), the MC-GCM consistently returned only T = 2 and T = 3 mixtures respectively across many inference runs on each data set, despite specifying larger T values. Although getting a return of the pre-specified number of T mixtures was not obtained across many runs, we still investigate the results of these inferences. All inference runs with T values larger than the actual data satisfied the eigenvalue and VDI posterior predictive checks. In particular, the eigenvalue check was satisfied for each data set by the model clustering all of the informants respectively into the same number of cultural truths as in the real data, and none in the other cultural truths, which had non-meaningful posterior mean truths that generally converged near the mean of the hyperprior (see mixture-number phenomenon in Section 3.1). Nonetheless, multiple cases of the model are satisfying both of the posterior predictive checks, and so a model selection criterion, such as the DIC is used to determine the appropriate model. To select the appropriate model, we calculate the DIC for each run; and in the interest of space, we later present and discuss the plots of the posterior predictive checks only for the models with the lowest DIC for each data set. Table 1 contains the DIC values for the two data sets. Although a critique of DIC is that it has a tendency to prefer models with more parameters, the DIC selected

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Fig. 6. Heat map as a graphical posterior predictive check in which darkness represents how often the posterior predictive data matched the actual data for T = 3 out of 6000, using the hierarchical MC-GCM.

the correct number of truths for each data set and did not prefer additional truth clusters of informants. As the model consistently returned the same number of mixtures despite increases in T , the DIC values in Table 1 mostly differ due to penalties by the inclusion of additional, unnecessary parameters that are specified for mixtures without informants assigned to them. However note that in the case of the T = 2 MC-GCM fit to the T = 3 data, the DIC is significantly larger than the other DIC values by the difficulty of the model to fit three mixtures with only two possible groups. Next we present the eigenvalue posterior predictive check for the T = 2 and T = 3 inference runs that were selected by the DIC on each of their respective data sets. Fig. 5 contains the graphical posterior predictive eigenvalue check, including 500 randomly-sampled, posterior predictive data sets for each model in gray, and we see that the pattern of eigenvalues for the posterior predictive data closely resemble the actual data’s pattern in black for each data set. Thus from the posterior predictive check, the inference results satisfy the model’s important multi-truth Spearman property in (13) in a way that resembles the actual data’s factors, and we may appropriately use the posterior results of the model. When one calculates this check for the MC-GCM specified at T values lower than those that generated the data, the check is violated by similar patterns as shown in Fig. 3. With regard to the VDI check, as we know the generated data has homogeneous item difficulty, we do not include a presentation of the VDI check, although it was satisfied by the posterior predictive data; this check is included in the real data analysis of Section 4.2. However, one may be interested in observing how well the posterior predictive data match the actual data. The average percent over all nodes for how often the 6000 T = 2 posterior predictive data sets matched each node of the actual data was 63.7% while it was 70.2% for T = 3 data; this difference was mainly found due to a higher proportion of informants in the T = 2 data set having lower competency. This may also be represented with more detail in a visual heat map as shown for T = 3 in Fig. 6. The informants are sorted from most competent to least competent, and an adequate prediction as discussed in Section 3.2 is evident as there are high match percentages across most nodes for the competent informants, and low-to-medium percentages for the low competency informants. As CCT models are traditionally used to detect latent truths and measure informant response characteristics, we next present the model’s capacity to recover parameters that generated the

461

data, which is contained in Fig. 7 for the T = 2 data. The T = 2 data is a good example because it reveals the recovery ¯ t =1 = 0.56 with effects on a group with average competency D a number of experts, five with Di ≈ 0.8, and on another group ¯ t =2 = 0.42 that only has one with lower average competency D expert. In the top row of the figure, we see that the posterior of the latent truth parameter, Zei k , exhibits a strong recovery of the generating values with a Pearson correlation of r = 1.00 (the circles representing the posterior means of the Zei k fully overlap the squares representing the generating truths) for the latent truth of the average competency group, t = 1 (the first plot), and an almost full recovery for the below average competency group, t = 2 (the second plot), with r = 0.94, as one of the posterior truths, Z2,16 is on the incorrect side of its generating value. The correlation between the posterior mean truths for the two cultures is −0.41. The first two plots in the second row display the competency and bias recovery denoted similarly as the truth plots, but now also with 95% Highest Density Intervals (HDI, also known as HPD, for explanation see Kruschke, 2011). In most cases, 64 out of 68 nodes, the 95% HDI overlaps the generating values. In each plot, the informants are sorted by their mode posterior group membership, and we can see that two groups exist with different means and spreads of competency and bias. In the competency plot, the bottom row of numbers is the generating group membership, and as shown in the figure, the posterior successfully clustered all informants in accordance with their generating group. To the right of these two plots are the hyperparameters which are also mostly contained in the 95% HDI’s, even with diffuse priors for inference as in Section 2.3. The next simulation study to be discussed was patterned after a recovery study in Batchelder and Anders (2012), where a test of the robustness of the newly-specified, Bayesian hierarchical GCM was performed, and the model was found to successfully recover the latent truths and informant parameters for a data set with only six ¯ of 0.5, on 40 items. We informants with an average competency, D, were interested in seeing if the MC-GCM could also achieve such results with small numbers of informants, and if separate, welldefined cultures may be obtained with as few as six informants as well. We applied the hierarchical MC-GCM on many simulations of T = 3 data from the hierarchical model that had only 6 informants per culture, with each culture having an average competency between 0.50 and 0.60 on M = 40 items. The MC-GCM consistently recovered the latent truths, informant memberships, and other parameters. We include the results of a typical simulation in Fig. 8, with hyperparameters: pt = [0.53, 0.55, 0.53], µθ = [0.54, 0.57, 0.58], τθ = [8.8, 5.7, 5.6], τg = [21.6, 4.7, 13.4]. The results of the T = 3 MC-GCM applied on the data are reported, which satisfied all of the posterior predictive checks, and had the most-preferred DIC value. As shown in the top plot of Fig. 8, three distinct truths are recovered by the posterior in which the posterior means correlate highly with their respective generating truths by Pearson r’s of 0.92, 0.94, and 0.95. The third culture had the best recovery and the fewest ‘indecisive’ posterior truth means away from 0 to 1, which may be attributed to having the highest group ¯ t , and more experts, as it had the highest group competency, D variance in competencies: while the mean group competencies ¯ t = [0.57, 0.57, 0.61], the group variance competencies were D were Var(Dt ) = [0.005, 0.020, 0.027]. Similarly, a previous study by Batchelder and Anders (2012) that applied the GCM to a 6by-40 data set also had the most exceptional recovery when ex¯ = 0.5, perts were included. That data set had the settings of D Var(D) = 0.097, and D = [0.92, 0.88, 0.28, 0.32, 0.26, 0.34]; the two experts were highly influential, and when they agreed, their responses outweighed all incorrect responses of the other informants in the determination of the posterior mean, even during 2–4

462

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Fig. 7. Hierarchical, T = 2 MC-GCM-generated data modeled by the hierarchical MC-GCM at T = 2. The top row, left to right, contains the posterior means (circles) with 95% HDI’s and generating parameters (squares) of the two truth cultures sorted from the lowest to highest posterior means, in order of the first culture; then beginning with the second row are the competency, bias, and hyperparameters in which the informants are sorted by their posterior mode group membership, by increasing order of the parameter’s posterior mean. For the competency plot, the text below indicates each informant’s generating group membership.

disagreeing response-splits. Thus, having a select number of experts may considerably increase the strength and recoverability of a cultural truth, even when this group may have a lower group average competency compared to a group without experts. In the bottom plot of Fig. 8, the recovery of the informant competencies is displayed in which the posterior mean competencies are denoted as circles with 95% HDI’s and the generating ones as squares. These informants are sorted by their posterior mode culture, from lowest to highest competency, and the text below indicates their actual generating group membership. A strong recovery is indicated as the posterior mean is quite close to the generating value for each informant; this was also a similar case for the recovery of each informant’s bias, which mostly ranged from 0.35 to 0.75. In addition, the surrounding HDI’s are fairly tight for all but the two lowest-competency informants. The appropriate recovery of the separate group mean and variances of each group are reflected by the trend of the posterior means, though the second cluster had the weakest recovery. With regard to cluster recovery, the text at the bottom of the figure shows that all informants assigned to cluster ‘1’ were correctly clustered together, and so on for the other two clusters, by the posterior mode ei ; for all informants, at least 90% of the ei samples were the same value, and thus the posterior mode ei is a valid measure. In particular, the two lowest-competency informants had 90% ei samples at the same value, which may explain their wider HDI’s in posterior competency, while the remaining 14 informants had 99% matching ei samples. In conclusion, the MC-GCM is similarly robust as the GCM, as it can successfully recover accurate parameters of small data sets, such as ones containing only 6 informants per culture on 40 items, in which the informants have only average group competencies. 4.2. Real data analysis with the MC-GCM In this section, we first present the MC-GCM applied on two sets of real data that CCT researchers could not previously interpret appropriately with a GCM, due to its single-truth restriction. Following this presentation, we introduce a new experiment, and provide an analysis of the data with the MC-GCM. Of the two data sets that could not be appropriately analyzed with the CCT GCM, the first is a classic CCT data set, which prior

to this paper, was assumed to either have no cultures in Romney (1999) or perhaps multiple cultures in Batchelder and Anders (2012). The data set, known as the hot–cold data set, was originally presented in an anthropological field research paper (Weller, 1984) for the study of folk medical beliefs. In an urban Guatemalan setting, each of N = 23 female informants provided dichotomous responses to 27 disease terms, whether or not each disease requires a ‘hot’ or ‘cold’ remedy. The purpose of asking about the type of remedy was to test a hypothesis of some anthropologists that ancient beliefs about humoral medicine characterized the beliefs of Guatemalan women. The response profile data in the form of (1) is provided in Romney (1999, Table 1). The second data set is a recent influential data set in Hruschka et al. (2008) that served to debunk a widely-practiced yet incorrect interpretation of the first installment of CCT in Romney, Weller, and Batchelder (1986), which was the idea that if the ratio of the first-to-second eigenvalues of M = (rij )N ×N from the data is 3-to-1 or greater, that this is sufficient evidence to assume the data supports the single-truth property of the GCM (Weller, 2007). While we will see that the ratio of the first to second eigenvalues is much larger than 3:1 for the data, it was found by Hruschka et al. (2008) that this data set as a whole is not appropriately analyzable with the GCM due to significant differences in the response patterns found between selected subgroups of informants. Now that the MC-GCM is developed and applicable, we can fortunately fit it to the data, and see how the model will cluster the informants. The data set consisted of 149 women with variable experience in birth attendance, who answered 234 questions regarding postpartum hemorrhage in Bangladesh. We begin the analysis with a scree plot of both data sets in Fig. 9. The hot–cold data on the left display a strong pattern for two latent truths which may be applied to the informants, and this was particularly the reason why the GCM was proven invalid on this data set. Although the postpartum data had three pre-specified groups, where one might hypothesize significant between-group consensus differences, the scree plot on the right conversely shows a very large ratio of the first to second eigenvalues. Nevertheless, it shows potential for two latent truths to be applicable to the informants, as the second eigenvalue is sizable comparable to the ones beyond it, and an apparent linearly decreasing trend of the eigenvalues does not begin until the third factor.

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

463

Fig. 10. VDI post-predictive check on the Hot–Cold T = 2 under homogeneous (same ID) and heterogeneous (diff ID) item difficulty (left) and Postpartum T = 4 under homogeneous item difficulty and T = 2 under heterogeneous item difficulty (right). The vertical line represents the VDI of the actual data while the curves represent the distribution of VDI values from the posterior predictive data.

Fig. 11. Graphical posterior predictive check of the scree plots of the T = 2 MCGCM on hot–cold data (left) and T = 2 MC-GCM on the postpartum data (right) from 500 randomly-sampled posterior predictive data sets from the hierarchical MC-GCM. The black line depicts the actual data’s series of eigenvalues while the gray region depicts the series for the posterior predictive data.

Fig. 8. The results of the posterior from the hierarchical MC-GCM run on an N = 6 per culture simulated data set with M = 40 items. The top plot contains the posterior mean truth values, Zei k , over items sorted from lowest to highest in order of the first culture, and the legend contains the Pearson correlations with the actual truth at r = [0.92, 0.94, 0.95]. The bottom plot contains the posterior mean informant competencies (circles), Di , with 95% HDI’s and generating competencies (squares); these informants are sorted by their posterior mode culture, from lowest to highest competency, and the text below indicates their actual generating parameter culture.

As we did for the simulated data, we ran each data set with the hierarchical MC-GCM and calculated the associated DIC values for a variety of T specifications from 2 to 8, but each with a case of item homogeneity and item heterogeneity. The DIC preferred the T = 2 case for the hot–cold data and T = 4 case for the postpartum data under item-homogeneity, and the T = 2 case for

Fig. 9. Scree plots of the Hot–Cold (left) and Postpartum (right) data sets.

the hot–cold data and T = 2 case for the postpartum data under item heterogeneity. As the data were not simulated, it is not readily known which assumption of item difficulty properly fits the itemresponse characteristics of the data as explained in Section 2.2.2, and we use the VDI in (15) to determine the appropriate model specification. The VDI was calculated as a posterior predictive check for both the hot–cold data and postpartum data for all values of T analyzed; the VDI check was satisfied in all runs for the hot–cold data but was only satisfied under item heterogeneity in the postpartum data inference. The VDI results of the particular models preferred by the DIC stated above are displayed in Fig. 10. We see that for the hot–cold data on the left, a better fit of the VDI is established by the item-heterogeneous T = 2 MC-GCM, and this model also had a significantly better DIC (DIC = 643.9) compared to the item-homogeneous case (DIC = 872.6). With regard to the postpartum data on the right, as was for multiple T cases of the item-homogeneous MC-GCM, the VDI of the data is completely missed by the item-homogeneous T = 4 MC-GCM, but it is well-captured by the posterior predictive data of the item-heterogeneous T = 2 MC-GCM, and this model also had a significantly better DIC (DIC = 41 619.81) compared to the itemhomogeneous case (DIC = 43 874.5). Next we examine the important eigenvalue posterior predictive checks for each data set and Fig. 11 contains the results. Beginning with the hot–cold data, the posterior predictive data satisfies a similar pattern of eigenvalues with the real data as expected from the property of Theorem 1 for the MC-GCM; thus combined with the selection of the T = 2 fit based on the DIC, we consider the posterior interpretation to be appropriate for the hot–cold data. Likewise, the T = 2 fit of the postpartum data satisfy the check by providing a very similar pattern of eigenvalues as the real data’s, although it does underestimate the second factor.

464

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Fig. 12. Statistics of the posterior distributions from the hierarchical MC-GCM run on the hot–cold data at T = 2 truths. The top plot contains the posterior mean truth values, Zei k , over items sorted from lowest to highest in order of the first culture. The bottom plot contains the posterior mean informant competencies, Di , with 95% HDI’s; these informants are sorted by their culture, from lowest to highest competency, and a ‘‘1’’ denotes placement in culture 1 while a ‘‘2’’ denotes placement in culture 2.

As the optimal models, selected by having the lowest DIC values, are verified to pass the two important posterior predictive checks, being the eigenvalue and VDI checks, we can appropriately continue with an analysis and presentation of the posterior results from the inference. Beginning with the hot–cold data, the posterior results are presented in Fig. 12. In the top plot of Fig. 12 are the posterior mean item truth values, Zei k , for the two cultures, and the items are sorted by the first culture’s truths from lowest to highest. While some truth values overlap, it is observed that the two truths are distinctly different, as the Pearson r correlation of the two cultures’ posterior mean truths across items is −0.11. In particular, 11 out of 27 items have fully-opposite posterior truth means, such as 0 versus 1, and 6 other items differ in their truth means from 0.3 to 0.7. The 10 remaining items, on which there was a common consensus among all informants, had a tendency to be common ‘diseases’ that often have fever and/or inflammation as a symptom, which informants agreed needed a cold remedy, such as: flu, arthritis, rheumatism, diphtheria, whooping cough, tuberculosis, malaria. A hot-remedy item that informants agreed upon were allergies. Furthermore, the above diseases, which informants tended to agree on, mostly have direct treatments

for recovery. In contrast, the 11 items with disagreement among informants involved complicated diseases, many of which are without any directly known treatments for full recovery, or have complicated treatments, such as: cancer, smallpox, chicken pox, polio, tetanus, rubella, mumps, measles, diabetes, and intestinal influenza; incidentally, a number of these diseases are only known to be avoided by vaccinations. Thus, on these difficult diseases to cure, informants may have partitioned into two groups by responding to these items based on their preferred hypothetical methods of remedy during sickness. In the second plot of Fig. 12 are the posterior mean informant competencies, Di , with 95% HDI’s sorted by their mode culture, in order of their competency. The posterior resulted in two cultures with similar means (µD1 = 0.49 and µD2 = 0.51) yet markedly different variances (τD1 = 15.8 and τD2 = 10.3). The first cluster of informants consisted of 13 informants while the second consisted of 10 informants. The informants are clustered by their mode posterior ei , and although the mode was used, for all but one informant, the mean posterior ei for each chain was within 0.15 of ei = 1 or ei = 2. Thus there was a decisive clustering of two groups with near-equal group competencies in their respective truths. This interpretation of the data using the MC-GCM is the first Bayesian CCT model application on the hot–cold data that successfully passes the checks of the eigenvalue pattern and VDI statistic. Next, we begin to inspect the posterior results of the postpartum data set. As mentioned, the optimal model selected was set at T = 2 cultural truth groups with heterogeneous item difficulty. The posterior means of the two cultural truths, Z1 and Z2 , were moderately correlated at a value of r = 0.55 out of the M = 234 items, and more items were believed to be false for each culture as the posterior mean of p1 = 0.39 and p2 = 0.44. Of the two groups of informants, the first group of 17 informants was considerably more competent at a posterior mean µD1 = 0.68 versus the second at µD2 = 0.53, and each had about the same variance of τD1 = 9.4 and τD2 = 9.2. However, the first group was much tighter toward neutral biases of g = 0.5 with a τg1 = 11.9 than the second group, with τg2 = 4.5. In order to better interpret the groups of informants from the posterior results, we may consider covariates of the data set that were collected during the interview of each informant; particularly, we use the identifiers of each informant that pertain to their experience in birth attendance. Of the 149 women interviewed, 14 were Skilled Birth Attendants (SBA’s) that received formal training in birth attendance, 49 Traditional Birth Attendants (TBA’s) that are without formal training but employ traditional Bangladeshi techniques in birth attendance, and 98 Lay-Women (LW) that are without any occupational experience in birth attending. We begin with noting that although the real data set consisted of informants with three separate identifiers, 14 SBA’s, 49 TBA’s, and 98 LW, the best fit of the data with the MCGCM consisted of a two-culture, or two-cluster design, which also included item heterogeneity; and such a two-culture design as discussed below is supported by the data’s scree plot in Fig. 9. Fig. 13 contains the posterior clustering and mean competencies of the informants in a similar way that was presented for the hot–cold data, yet with a meaningful identifier for each informant: where ‘‘S’’ denotes SBA, ‘‘T’’ denotes TBA, and ‘‘L’’ denotes LW. With regard to the MC-GCM’s posterior results on the clustering, there is an interesting relationship between the previous statistical analysis on the data by Hruschka et al. (2008) and the present one. Particularly, question–response analysis by Hruschka et al. (2008) found that the small SBA group significantly differed from the other two on a variety of questions, but that the other two larger groups, the TBA’s and LW, did not notably differ from each other; this analysis relates to the small, yet informative second factor in the scree plot of Fig. 9, as the SBA group significantly differ from the other groups

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

465

Fig. 13. The posterior clustering and mean informant competencies with 95% HDI’s of the inference results of the heterogeneous item difficulty T = 2 MC-GCM on the postpartum data. The bottom text indicates the original identity of the informant: S = Skilled Birth Attendant (SBA), T = Traditional Birth Attendant (TBA), and L = Lay Woman (LW).

but only constitute 9% of all informants. In the present analysis, a three-group cluster of the data by the MC-GCM was not preferred by the DIC over the two-group cluster, thus such three identities do not necessitate three distinct cultural truths. However, as Fig. 13 shows, 13 out of the 14 SBA’s are clustered in one group with four TBA’s, and all others in the next group. From this note, it seems that 13 of the 14 SBA’s have a considerable skill level in their trade while only a few of the 49 TBA’s, four, have a training or knowledge comparable to the SBA’s; however, 3 of these 4 TBA’s have the lowest competencies of in the SBA group, so they are expectedly not as competent in the cultural truth of academically-trained SBA’s. With regard to the second cluster of informants in Fig. 13, one of the SBA’s is included in this group. Presumably, this SBA prefers traditional views of postpartum hemorrhage in line with TBA’s and the general Bangladeshi culture of LW, instead of the academictrained knowledge of SBA’s. The second interesting property of this particular cluster is that the TBA’s are not clearly, generally the most competent members of this group in the subject matter; though, they are also not the least-competent as no TBA is in the bottom 20. These properties may also be due to that proportionally, there are much fewer TBA’s, 49, compared to LW, 98. By comparing the two clusters, note that the predominantly SBA group had a higher mean competency than the TBA/LW group, and more neutral biases as well; thus their cultural knowledge was more cohesive. In addition, while the mode posterior ei was used to determine each informant’s latent clustering, the posterior mean ei for 145 out of 149 informants in each chain was within 0.03 of ei = 1 or ei = 2 and within 0.23 for the remaining 4 informants. A concluding mention of the analysis is that this two-cluster design was preferred over all others, and hence the beliefs of the TBA’s and LW, reflected by the questions asked, were not distinct enough to form two distinct groups using the MC-GCM; even when the memberships of the T = 3 case of the MC-GCM on the data was checked, the same 13 SBA’s were in one group while the other two groups had an indecisive mixture of TBA’s and LW’s. Thus, the MCGCM and practices included within performed an adequate job of selecting an appropriate number of cultures to describe the data, which is supported both by the scree plot in Fig. 9 and the statistical analysis of the data previously done by Hruschka et al. (2008).

Experiment We were interested if the MC-GCM could successfully recover information about different cultures based on political knowledge. Forty-five participants answered the same, 38-item questionnaire, provided in Appendix C, about their knowledge of political party beliefs. However, two cultures were experimentally instated by assigning 23 participants to answer the questionnaire using their knowledge of what a typical Republican believes, while the other 22 participants were assigned similarly, but with respect to what a typical Democrat believes. The participants were assigned to a party randomly, and their own political viewpoints were not collected. The participants surveyed were university students with ages ranging from 17 to 24; the mean and median age were 19.3, and 19 respectively. The main results of the analysis are included in Fig. 14. Firstly, different T cases of the MC-GCM were tried on the data. However, a strong preference for the T = 2 case was obtained over many attempts at different T cases. For example, when running the model specified at T = 3 or more mixtures, all chains usually contained only T = 2 mixtures. As expected by this preference in the sampling, the DIC was also the lowest for the T = 2 case and particularly, it preferred the T = 2 case with heterogeneous item difficulty at DIC = 1669.1, versus homogeneous item difficulty with DIC = 2380.7. This preference by the DIC was in line with our prior belief that some questions on the questionnaire would be easier to answer correctly than others. The posterior predictive check results of the heterogeneous item difficulty T = 2 MC-GCM are included in the left column of the Fig. 14. With regard to eigenvalue posterior predictive check, the inference results produce a similar pattern as to the real data’s, which show action in the second factor, supporting a two-truth design, though some of the posterior predictive data underestimate this second factor. With respect to the VDI check, an interesting result is evident in which the heterogeneous item difficulty model actually produced lower VDI’s on average more often than the homogeneous item difficulty model. However, the check is still passed at the 95% level as the VDI of the data is in the 91st percentile of the posterior predicted data. With respect to the clustering results of the MC-GCM on the data, as shown in the top-right plot of Fig. 14, two groups were

466

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Fig. 14. The inference results of the hierarchical MC-GCM on the political data set. The left column of plots contain the posterior predictive eigenvalue and VDI checks while the right column contains the posterior truths of each culture and the competencies and clusterings of the informants. The black line in the eigenvalue check represents the series of eigenvalues for the actual data while the vertical line in the VDI check is the VDI of the actual data.

obtained, which are distinctly Democrat and Republican, as the informants were assigned. These two groups had similar variance in competency with τDR = 3.9 and τDD = 3.8, yet the Republican group had a higher average competency µDR = 0.57 than the Democratic group µDD = 0.47. The Republican group included 21 informants that were all assigned Republican, while the other group of size 24 were all assigned Democrat with the exception of two informants. All of the informants, except those two, had answer patterns that correlated highly with the consensus answer key for the group they were assigned to. These two exceptions were assigned to the Republican party but as they were clustered into the Democratic group, it was found that they had among the lowest posterior mean competencies of 0.2, and their responses were negatively correlated with the Republican posterior mean cultural truth at −0.24 and −0.12, but positively correlated with the Democratic posterior mean cultural truth at 0.22 and 0.16. Our interpretation is that these two subjects may not have followed the instructions. The posterior mean ei ’s of all the informants were within 0.03 of their mode membership, and so these two informants were rarely clustered in the other group in the posterior results of the model. As the posterior group mean competency of the Democratic group was lower, and as we will show, the posterior mean truths for the Democratic group were slightly less decisive on some items, these factors also contribute to the greater likelihood of clustering these two Republican informants into the Democratic group. The posterior mean truths of each cluster are represented in the bottom right plot of the Fig. 14. In order to obtain a strong clustering, especially with the unexpected level of political knowledge that young university students might have, the questionnaire was formulated such that most all questions would be answered oppositely by each cluster. In addition, more questions (23 out of 38) were formulated such that ‘‘True’’ would be the expected response for the Republican party, and ‘‘False’’ for the Democratic party. According to the figure, the clustering of the responses of the informants show a strong recovery of two divergent cultural truths. In fact, the Pearson r correlation of the posterior mean truths of the two cultures, ρ(ZD , ZR ) = −0.96. Overall, the Democratic group’s cultural truth was slightly more

flexible than the Republican’s, as it had three more less-decisive posterior mean truths than the Republican group, which were values not at either 0 or 1. Between the two groups, the four questions that had considerable indecisive posterior truth means between 0.35 and 0.75 were coincidently questions that involved highly-debated topics such as abortion (Item 12), same-sex marriage (Item 15), phone/email monitoring with regard to terrorism prevention (Item 16), and whether unions should be able to provide unlimited campaign funding (Item 14); these items correspond respectively to questionnaire items 29, 15, 2, and 6 in Appendix C. Incidentally, these questions had among the highest posterior mean item difficulties, δk , at 0.79, 0.90, 0.71, and 0.85. Particularly out of all questions, the same-sex marriage question had the highest posterior mean item difficulty. According to the ordering of truths in Fig. 14, items 9 to 21 had the highest values of item difficulty between δk = 0.6 and 0.9 while the other items had lower difficulties ranging from δk = 0.06 to 0.55; and thus there was some relationship between larger values of item difficulty and how close the posterior mean Zk is to 1 or 0. The lowest posterior mean item difficulty was question 24, regarding whether ‘‘the Supreme court is too conservative’’ at a value of 0.06. The next three lowest posterior mean δk ’s were near 0.13 and involved questions 7, 9, and 11 in Appendix C, which involved the government as it relates to issues of social change, fiscal limitations on private enterprizes, and poverty. 5. Discussion The GCM for dichotomous responses has been widely-used in the social and behavioral sciences, especially cultural anthropology, but data sets have been acquired in these fields that are not appropriately interpretable by the single-truth design of the GCM. The present paper resolves this problem by developing the MCGCM, which is the first CCT model applicable to latent multi-truth data, and can cluster informants according to each cluster’s shared latent truth. The model is clearly presented via axioms and essential properties, and is also estimated in a hierarchical Bayesian framework. Based on the fundamental GCM properties, a method

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

for determining when to use the MC-GCM over the GCM is provided, as well as an important posterior predictive check, based on informant-by-informant correlations, to verify if the model predicts data with a consensus structure similar to the actual data; in addition, another posterior predictive model check, based on the VDI in (15), is provided to determine if the inclusion of item heterogeneity with the model is necessary. When one wants to compare different numbers of finite mixtures with the MC-GCM that pass these checks, a model selection criterion, such as the DIC or BPIC is suggested, and these are shown to select the appropriate model in simulated data and not select over-fitted mixtures. Furthermore, on real data, these model selection criteria, in combination with the VDI and eigenvalue posterior predictive checks, was found to appropriately select mixture numbers of the MC-GCM that were supported by the factor-design of the real data, in addition to prior statistical analyses of the data. However, the employer of the MCGCM may consider using other or additional model selection criteria or posterior predictive checks, depending on his or her preferences for model-selection judgments. As an alternative to using the DIC, BPIC, or Bayes Factor as model selection criteria, a proxy method comparable to the product space method by Lodewyckx et al. (2011) may be employable. An experience of our research in the MC-GCM applied to both simulated and real data, was that the preferred number of T mixtures exhibited across chains in the MC-GCM was also preferred by the DIC and strongly satisfied the posterior predictive checks. Here a heuristic to calculating the number of mixtures in a chain could be calculating the number of unique mode membership values, ei , there are across informants. Then with sufficient computing power, one could run 1000 or more chains of the MC-GCM specified at T = 10, to obtain a related statistic for the relative probability of each number of mixtures on a given data set. Another modification of model application, which a user may desire employing is heterogeneous item difficulty specialized by culture. That is, the previous δk would be doubly-indexed by culture as δtk , where µδ remains 1/2 for each culture yet τδt is specialized by culture, using the priors specified in Section 2.3. Then (5) for calculating the Dik would be Dik =

θi (1 − δei k ) , θi (1 − δei k ) + δei k (1 − θi )

(16)

where the t subscript on the δtk is indexed by the informant’s group membership, ei . In our analysis of using a δtk rather than δk on real data, the δk consistently provided for better fits as the DIC preferred the δk in each case, and the mixing using δk was significantly better as well. However, there may be real data sets that have a strong case for heterogeneous item difficulty across cultures in which this specification of the model is necessary. In addition, much larger data sets may be needed for the inclusion of δtk . Using this setting of item difficulty is a development for future research. In the spirit that CCT models are often used as measurement tools, in which the posterior means are important for understanding the latent truths and informant characteristics, it was shown that the model performs well in locally recovering generating values of simulated data of both parameters and hyperparameters. Now, researchers may use the MC-GCM to measure such qualities, as reflected by the parameters, on multi-truth data sets. In additional research, we found that the MC-GCM is robust in that it was able to recover latent truths and informant parameters exceptionally well, even with as few as N = 6 informants per truth in an M = 40 item setting with three truths, granted the average group competencies are 0.5 or higher. This was inspired by a previous test of robustness of the GCM in Batchelder and Anders (2012), which likewise exhibited a strong recovery when applied to a data set of just N = 6 informants on one truth.

467

Finally, new interpretations of two published data sets that were not previously appropriately analyzed were presented using the MC-GCM. First, it was verified that the classic CCT Guatemalan hot–cold data set may be appropriately interpreted in a two-truth design whereas before it was considered to be without any clear consensus structure. Particularly, two markedly distinct truths were retrieved in addition to a decisive clustering of the informants into two, nearly-equally sized groups. The second data set on Bangladeshi postpartum hemorrhage beliefs, was most-preferred by the MC-GCM under the T = 2 case. Incidentally, the real data had three groups, but both the analytical methods in the present paper, and statistical measures by researchers found evidence mainly for one small group and a second much larger group. The posterior clustering results of the MC-GCM found just this, and the model successfully passed all posterior predictive results. The analyses from this paper have further found that the MCGCM is a generally robust model, which can successfully recover latent truths, clusters, and informant parameters on data sets that have as few as six informants per culture. However despite these advantages, as mentioned previously, due to label-switching and mixing phenomena, which are naturally typical of many finite mixture models in Bayesian inference, it is not an easy model to apply for an entry-level Bayesian user. Since CCT is historically known for its accessibility and use in a variety of fields within the social sciences, it is intended that this model will be eventually, readily-usable by other researchers. Thus, it remains a future project to develop and deploy the computer software for a fully user-friendly GUI of the model that can automatically handle any issues of label-switching or mixing phenomena for when it may occur. However, the development, analysis, and application of the model here insofar, in the present paper, is considered to be a significant leap forward in the total equipment and applicability of CCT. Acknowledgments Work on this paper was supported by grants to the second author from the U.S. Air Force Office of Scientific Research (AFOSR), the Army Research Office (ARO), and the Research Fellowship from the Oak Ridge Institute for Science and Education (ORISE) to support research with Michael Young at the Wright Patterson Air Force Base. We would like to sincerely thank Daniel Hruschka and Lynn Sibley for making available their postpartum hemorrhage data set. We would also like to the Action Editor and Referee for their constructive feedback, as well as Michael Lee and Zita Oravecz for their helpful advice. Appendix A. Proof of Theorem 1 Theorem 1. For the MC-GCM defined by Axioms 1G–3G, it is found that for any number of truths, T ,

∀i, j = 1, . . . , N ∋ i ̸= j, ρ(XiK , XjK ) = ρ(XiK , Zei K )ρ(XjK , Zej K )ρ(Zei K , Zej K ). Proof. With the foundations of (1), (2), and (7), we first use (8) to get

ρ(XiK , XjK ) =

E (XiK XjK ) − E (XiK )E (XjK )



Var(XiK )Var(XjK )

.

Let

α11 =

M 

Zei k Zej k /M ,

α10 =

k=1

α01 =

M  (1 − Zei k )Zej k /M , k=1

M 

Zei k (1 − Zej k )/M

k=1

α00 =

M  (1 − Zei k )(1 − Zej k )/M . k=1

468

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

Clearly, α11 + α10 + α01 + α00 = 1. Now, E (XiK ) = E [E (XiK | K )] =

M 1 

M k =1

Appendix B. JAGS model code Multiple Culture General Condorcet Model (MC-GCM)

Pr(Xik = 1)

= (α11 + α10 )Hi + (α01 + α00 )Fi , E (XjK ) = E [E (XjK | K )] =

M 1 

M k=1

Pr(Xjk = 1)

= (α11 + α01 )Hj + (α10 + α00 )Fj , and E (XiK XjK ) = α11 Hi Hj + α10 Hi Fj + α01 Hj Fi + α00 Fi Fj . From these results it is easy to see that E (XiK XjK ) − E (XiK )E (XjK )

= Hi Hj [α11 − (α11 + α10 )(α11 + α01 )] + Hi Fj [α10 − (α11 + α10 )(α10 + α00 )] + Fi Hj [α01 − (α01 + α00 )(α11 + α01 )] + Fi Fj [α00 − (α01 + α00 )(α10 + α00 )]. Now the four terms in brackets can be simplified as follows. Let A = (α11 α00 − α10 α01 ). Then simple algebra shows that E (XiK XjK ) − E (XiK )E (XjK )

= A(Hi Hj + Fi Fj − Hi Fj − Fi Hj ) = A(Hi − Fi )(Hj − Fj ) for example the term with Hi Hj becomes

α11 (1 − α11 ) − (α11 α01 + α10 α01 + α11 α10 ) = α11 (α10 + α01 + α00 ) − (α11 α01 + α10 α01 + α11 α00 ) = A, and the other three terms simplify similarly. Thus we have derived that A(Hi − Fi )(Hj − Fj )

ρ(XiK , XjK ) = √

Var(XiK ) Var(XjK )



.

Next we turn to the second term in the theorem. From (9) about the GCM, as shown in Batchelder and Anders (2012), we established that

πi (1 − πi )(Hi − Fi )  , ρ(XiK , Zei K ) = √ Var(XiK ) Var(Zei K ) where in our case, πi = α11 + α10 . Further,

ρ(Zei K , Zej K ) =

E (Zei K Zej K ) − E (Zei K )E (Zej K ) Var(Zei K ) Var(Zej K )





,

and Var(Zei K ) = E (Ze2i K ) − [E (Zei K )]2 = πi (1 − πi ),

#Axioms 1G-4G (Homogeneous Item Difficulty) model{ #Data for (i in 1:n){ for (k in 1:m){ pY[i,k] <- (D[i]*z[k,e[i]]) +((1-D[i])*g[i]) Y[i,k] ~ dbern(pY[i,k]) }} #Parameters for (i in 1:n){ e[i] ~ dcat(pe) D[i] ~ dbeta(dmu[e[i]]*dth[e[i]],(1-dmu[e[i]])*dth[e[i]]) g[i] ~ dbeta(gmu[e[i]]*gth[e[i]],(1-gmu[e[i]])*gth[e[i]]) } for (t in 1:T){ for (k in 1:m){ z[k,t] ~ dbern(p[t]) }} #Hyperparameters alph <- 2 gsmu <- 10 gssig <- 10 dsmu <- 10 dssig <- 10 pe[1:T] ~ ddirch(alpha) for (t in 1:T){ alpha[t] <- 1 gmu[t] <- .5 gth[t] ~ dgamma(pow(gsmu,2)/pow(gssig,2),gsmu/pow(gssig,2)) dmu[t] ~ dbeta(alph,alph) dth[t] ~ dgamma(pow(dsmu,2)/pow(dssig,2),dsmu/pow(dssig,2)) p[t] ~ dunif(0,1) }} _______ #Axioms 1G-5G (Heterogeneous Item Difficulty) model{ #Data for (i in 1:n){ for (k in 1:m){ D[i,k] <- (theta[i]*(1-delta[k])) / ((theta[i]*(1-delta[k]))+(delta[k]*(1-theta[i]))) pY[i,k] <- (D[i,k]*z[k,e[i]]) +((1-D[i,k])*g[i]) Y[i,k] ~ dbern(pY[i,k]) }} #Parameters for (i in 1:n){ e[i] ~ dcat(pe) D[i] ~ dbeta(dmu[e[i]]*dth[e[i]],(1-dmu[e[i]])*dth[e[i]]) g[i] ~ dbeta(gmu[e[i]]*gth[e[i]],(1-gmu[e[i]])*gth[e[i]]) } for (k in 1:m){ for (t in 1:T){ z[k,t] ~ dbern(p[t])} delta[k] ~ dbeta(idmu*idth,(1-idmu)*idth) } #Hyperparameters alph <- 2 gsmu <- 10 gssig <- 10 dsmu <- 10 dssig <- 10 pe[1:T] ~ ddirch(alpha) idsmu <- 10 idssig <- 10 idmu <- .5 for (t in 1:T){ alpha[t] <- 1 gmu[t] <- .5 gth[t] ~ dgamma(pow(gsmu,2)/pow(gssig,2),gsmu/pow(gssig,2)) dmu[t] ~ dbeta(alph,alph) dth[t] ~ dgamma(pow(dsmu,2)/pow(dssig,2),dsmu/pow(dssig,2)) p[t] ~ dunif(0,1) } idth ~ dgamma(pow(idsmu,2)/pow(idssig,2), idsmu/pow(idssig,2)) }

Appendix C. Political questionnaire All questions were prefaced with, e.g., ‘‘A typical Republican (believes). . . ’’, or what a typical Democrat believes, depending on the informant’s experiment group assignment.

and since E (Zei K Zej K ) − E (Zei K )E (Zej K ) = α11 − (α10 + α01 )(α01 + α10 ),

1. . . . that the current economic crisis was caused by excessive government support policies.

it is easy to see that

2. . . . that the government should monitor our phone/email conversations to check for terrorist activity.

ρ(XiK , Zei K )ρ(XjK , Zej K )ρ(Zei K , Zej K ) A(Hi − Fi )(Hj − Fj )

= √

Var(XiK ) Var(XjK )



. 

3. . . . that some communist policies would be effective if applied in the United States. 4. . . . that illegal immigration is a considerable problem in the United States that should be given attention.

R. Anders, W.H. Batchelder / Journal of Mathematical Psychology 56 (2012) 452–469

5. . . . that private corporations should be able to contribute unlimited amounts to electoral campaigns. 6. . . . that unions should be able to contribute unlimited amounts to electoral campaigns. 7. . . . that politicians should make decisions affecting social change and social groups based on their morals. 8. . . . that it is acceptable for individuals to rely on the government during difficult financial times. 9. . . . that there are too many fiscal limitations on private enterprizes in the United States. 10. . . . that the United States should use military interventions to end totalitarian regimes across the world. 11. . . . that poverty and inequality in the United States are the results of excessive government support. 12. . . . would support higher taxes on large businesses in order to keep Social Security benefits for the elderly. 13. . . . would support higher taxes on individuals making more than $1,000,000. 14. . . . that Medicare and Medicaid should be expanded to support more people. 15. . . . that the issue of same-sex marriage should be decided by individual state governments. 16. . . . would prefer to see the Affordable Care Act repealed. 17. . . . that mining and oil-extracting companies face too many environmental limitations or hurdles. 18. . . . would likely support a candidate who claims religion and/or prayer as a guiding force in his/her decision-making. 19. . . . would prefer to support a flat tax for individuals of all income brackets. 20. . . . would prefer to support a flat tax for private businesses of all sizes. 21. . . . that the government should work toward maintaining Social Security solvent for many generations to come. 22. . . . that continued government support for the unemployed diminishes their incentive to find employment. 23. . . . that enacting gun control legislation amounts to a violation of the Second Amendment. 24. . . . that the current Supreme Court is too conservative. 25. . . . that the U.S. should invest more in the military than in Social Security and other social programs that support low-income families and the elderly. 26. . . . that judges making unpopular decisions on social issues are activist judges. 27. . . . that a good way to secure our border with Mexico is to build a wall across it. 28. . . . that it is it a good idea to allow individual states to demand immigration papers during routine encounters with the local or state police. 29. . . . that the current Congress should limit abortion less. 30. . . . that the government’s size is unsustainably large. 31. . . . that generally, there are too many people who rely on government assistance. 32. . . . has a desire to see the federal government shrink. 33. . . . considering the number of recent shootings, that a new debate on gun control and gun rights is necessary. 34. . . . that the United States should intervene more readily in foreign conflicts and affairs. 35. . . . that the Dream Act is a good step toward a more general solution to immigration issues. 36. . . . that the military is better with the repeal of the don’t ask don’t tell policy.

469

37. . . . that the federal government should resume defending the Defense of Marriage Act. 38. . . . would likely vote for a candidate who describes himself/herself as an atheist or agnostic. References Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayes models. Biometrika, 94, 443–458. Batchelder, W. H., & Anders, R. (2012). Cultural consensus theory: comparing different concepts of cultural truth. Journal of Mathematical Psychology, 56, 316–332. Batchelder, W. H., & Romney, A. K. (1986). The statistical analysis of a general condorcet model for dichotomous choice situations. In B. Grofman, & G. Owen (Eds.), Information pooling and group decision making: proceedings of the second University of California Irvine conference on political economy (pp. 103–112). Greenwich, Conn.: JAI Press. Batchelder, W. H., & Romney, A. K. (1988). Test theory without an answer key. Psychometrika, 53, 71–92. Batchelder, W. H., & Romney, A. K. (1989). New results in test theory without an answer key. In Roskam (Ed.), Mathematical psychology in progress (pp. 229–248). Heidelberg, Germany: Springer-Verlag. Carlin, B., & Spiegelhalter, D. (2007). Comment on estimating the integrated likelihood via posterior simulation using the harmonic mean identity, by A. E. Raftery, M. A. Newton, J. M. Satagopan, and P. N. Krivitsky. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, & M. West (Eds.), Bayesian Statistics 8 (pp. 403–406). Oxford: Oxford University Press. Comrey, A. L. (1962). The minimum residual method of factor analysis. Psychological Reports, 11, 15–18. Crowther, C. S., Batchelder, W. H., & Hu, X. (1995). A measurement-theoretic analysis of the fuzzy logic model of perception. Psychological Review, 102, 396–408. Embretson, S., & Reise, S. (2000). Item response theory for psychologists. Mahwah, N.J.: Erlbaum. Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: recent developments and applications. New York: Springer-Verlag. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (second ed.). Boca Raton, FL.: Chapman & Hall/CRC. Hruschka, D. J., Kalim, N., Edmonds, J., & Sibley, L. (2008). When there is more than one answer key: cultural theories of postpartum hemorrhage in Matlab, Bangladesh. Field Methods, 20, 315–337. Karabatsos, G., & Batchelder, W. H. (2003). Markov chain estimation methods for test theory without an answer key. Psychometrika, 68, 373–389. Kruschke, J. K. (2011). Doing Bayesian data analysis: a tutorial with R and BUGS. Lodewyckx, T., Kim, W., Lee, M., Tuerlinckx, F., Kuppens, P., & Wagenmakers, E. (2011). A tutorial on Bayes factor estimation with the product space method. Journal of Mathematical Psychology, 55, 331–347. Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: a users guide (second ed.). Mahwah, N.J.: Erlbaum. Maher, K.M. (1987). A multiple choice model for aggregating group knowledge and estimating individual competencies. Ph.D. Dissertation University of California, Irvine. Merkle, E. C., Smithson, M., & Verkuilen, J. (2011). Hierarchical models of simple mechanisms underlying confidence in decision making. Journal of Mathematical Psychology, 55, 57–67. Mueller, S., & Veinott, E. (2008). Cultural mixture modeling: identifying cultural consensus (and disagreement) using finite mixture modeling. In Proceedings of the cognitive science society. Oravecz, Z., Vandekerckhove, J., & Batchelder, W.H. Bayesian cultural consensus theory. Field Methods (in press). Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Plummer, M. (2012). Rjags: Bayesian graphic models using mcmc. R package version 2.2.0-3. http://CRAN.R-project.org/package=rjags. R Core Team, (2012). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, ISBN: 3-900051-07-0. Revelle, W. (2012). psych: procedures for psychological, psychometric, and personality research. Illinois: Northwestern University Evanston, R package version 1.2.1. Romney, A. K. (1999). Cultural consensus as a statistical model. Current Anthropology, 40, 103–115. Romney, A. K., & Batchelder, W. H. (1999). Cultural consensus theory. In R. Wilson, & F. Keil (Eds.), The MIT encyclopedia of the cognitive sciences (pp. 208–209). Cambridge, MA.: The MIT Press. Romney, A. K., Weller, S. C., & Batchelder, W. H. (1986). Culture as consensus: a theory of culture and informant accuracy. American Anthropologist, 88, 313–338. Spearman, C. E. (1904). General intelligence’ objectively determined and measured. American Journal of Psychology, 15, 72–101. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & vander Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B, 6, 583–640. Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 62, 795–809. Thomas, A., O’Hara, B., Ligges, U., & Sturtz, S. (2006). Making BUGS open. R News, 6, 12–17. Weller, S. W. (1984). Cross-cultural concept of illness: variation and validation. American Anthropologist, 86, 341–351. Weller, S. W. (2007). Cultural consensus theory: applications and frequently asked questions. Field Methods, 19, 339–368.