Unsupervised consensus analysis for on-line review and questionnaire data

Unsupervised consensus analysis for on-line review and questionnaire data

Information Sciences 283 (2014) 241–257 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

2MB Sizes 0 Downloads 36 Views

Information Sciences 283 (2014) 241–257

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Unsupervised consensus analysis for on-line review and questionnaire data Stephen L. France a,⇑, William H. Batchelder b,1 a b

Lubar School of Business, University of Wisconsin – Milwaukee, Milwaukee, WI 53201, United States School of Social Sciences, University of California, Irvine, Irvine, CA 92697, United States

a r t i c l e

i n f o

Article history: Received 15 January 2013 Received in revised form 4 January 2014 Accepted 9 June 2014 Available online 30 June 2014 Keywords: Clusterwise Consensus k-Means Maximum likelihood

a b s t r a c t We describe a set of Cultural Consensus Theory (CCT) models for analyzing review and questionnaire data. The basic single culture/cluster model can be used to estimate user competencies, user biases, and aggregate review scores. The model is unsupervised and only utilizes the input review scores. A maximum likelihood approach is used to estimate the model. We expand existing work by developing a clusterwise multi-culture continuous CCT model, for which we use the acronym CONSCLUS (CONSensus CLUStering). The original single culture CCT model is a special one-cluster case of CONSCLUS. We show that when all user competencies are equal, CONSCLUS is equivalent to k-means clustering. CONSCLUS is estimated using an alternating least squares variant of the algorithm for k-means clustering, which we denote as CCT-Means. CONSCLUS is a partitioning clustering technique. We describe extensions to CONSCLUS to incorporate fuzzy clustering and overlapping clustering. We run a series of simulation experiments using generated data with random error. We test both the single cluster and multiple cluster models. These experiments show that CONSCLUS is able to recover aggregate rating values and latent cluster assignments better than a range of other aggregation methods. The performance increase over the other aggregation methods is particularly strong when the users have varying competencies. We give an illustrative example using the Movielens dataset. We give a set of recommendations for the practical implementation of CONSCLUS on real world data and show how the user competencies can be used to gain insight into these data that cannot be gained from simple partitioning clustering. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Recent years have seen an explosion in the amount of user generated content available on the internet. In particular, there has been an increase in the amount of on-line review data generated by consumers. Given the very large volumes of data, there is great scope for the use of data mining models and algorithms. Consider a data set containing multiple product reviews. The reviews can be stored in a sparse n user  m product matrix. There has been a large amount of work under the banner of recommender systems for data of this type [1,9,35]. Given a set of reviews, a recommender system would ⇑ Corresponding author. Tel.: +1 4142294596. E-mail address: [email protected] (S.L. France). The second author acknowledges the support of a grant from the Army Research Office (ARO) and a Fellowship from the Oak Ridge Institute for Science and Education (ORISE). 1

http://dx.doi.org/10.1016/j.ins.2014.06.015 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.

242

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

typically predict scores for {user, product} pairs for which reviews are not available. Recommender system techniques can be placed under the broad categories of content filtering [44] and collaborative filtering [19,32,37,59]. Content filtering techniques utilize product information, while collaborative filtering techniques utilize review scores and correlational measures between users [47] or between items [38,51]. These categories are not mutually exclusive and recommender systems can incorporate both content filtering and collaborative filtering [40]. Our proposed set of techniques model an aggregate answer or knowledge component for each question/review item and model user competency with respect to the knowledge components for each user. Here, competency is defined as a measure of inverse error variance. Consumer preferences and social effects are accounted for by clustering consumers and modeling cluster level preferences. Given the cluster level preferences, each cluster can be considered to have its own knowledge or culture. We describe both a simple single cluster model and a multiple cluster model. The multiple cluster model utilizes cultural consensus theory and uses a clusterwise scheme to model both cluster membership and cluster level effects. The models described in this paper utilize only review scores, so using the etymology of recommender systems can be thought of as collaborative techniques rather than as content techniques and are particularly useful when review content information is not available or is of poor quality. A simple collaborative method to analyze review scores is to average the review scores. In fact, the average review score is usually reported on review websites. However, an average review score does not account for differing reviewer competencies. For a set of reviews, where there are more reviewers with idiosyncratic preferences than reviewers who are unbiased and competent, the arithmetic mean may give a poor overall score.

2. Cultural consensus theory Cultural Consensus Theory (CCT) [11,12,14,15,48] is an approach to information pooling (aggregation, data fusion). Since its inception, CCT has been utilized across the social and behavioral sciences, especially cultural anthropology. The initial application of CCT [12,48] was to estimate folk medical beliefs in a sample of Guatemalan women using two questions about each of a set of diseases: 1. is the disease contagious and 2. does the disease require a hot or cold remedy? While exact answers for disease contagiousness are known to modern science and the characterization of diseases as requiring hot or cold remedies comes from ancient folk medical beliefs, isolated cultures may share different beliefs and thus have different consensus answers for such questions. The disease data were analyzed using the classical CCT implementation for dichotomous data and the model is estimated using a maximum likelihood estimation scheme, where the bias is fixed [13,17,48]. The output parameters were the disease classifications ({hot,cold} and {contagious,non-contagious}) and estimates of how competent the women were with respect to the belief system of their culture. The primary goal of CCT is to estimate consensus answers from a set of raters (respondents), each of whom provides responses to questions about some aspect of their shared knowledge or beliefs. CCT consists of a set of parametric, cognitive models, each corresponding to a different questionnaire format, e.g., true/false, multiple choice, ordered categories, continuous responses. CCT specifies the consensus answers to the questions as latent variables rather than as known a priori to the researcher. In addition, the models specify latent parameters for the competences (degrees of cultural knowledge) and response biases of the informants. The continuous CCT model [14] is designed to analyze numerical ratings. For a set of users giving numerical ratings for a set of items, the continuous CCT model gives a set of user competencies, a competency weighted aggregate value for each item, and optionally, a set of user biases. Continuous CCT can be applied to any set of numerical ratings data where multiple users rate multiple items. An exam grading application for continuous CCT is described in [14]. Here, CCT was used to analyze ratings for 50 essays, each rated by 14 raters. Each rater was given the essay prompt, a grading rubric, and some example graded essays. Utilizing CCT for essay grading allows raters to be evaluated and the competency weighted aggregate values give greater weight to more competent raters than to less competent raters. We describe the continuous single culture CCT model and extend it to account for multiple cultures. We implement a fixed point estimation procedure for the basic model and combine this procedure with a k-means like clustering procedure for the multiple culture model. We run a series of Monte Carlo experiments to test how well both the basic model and the extended model recover model parameters from error perturbed data. We give an illustrative example, showing how CCT can be applied to on-line review data. Our work provides two major advances in the development of CCT. First, the extended CCT model allows for multiple clusters of users, each with its own latent consensus answer pattern. Second, our estimation approach is able to handle large amounts of missing data, as is often the case with on-line reviews. Most on-line review scores are collected as ordinal scale data using a Likert scale. However, we treat the data as continuous. We do this for three reasons. The first reason is precedence. Most collaborative filtering techniques take on-line review scores as continuous and utilize continuous correlational measures, as researchers have found that these measures give better predictive performance than ordinal, rank order measures [32,59]. The second reason is that there is evidence that when respondents are asked for a specific score (for example, a direct Likert rating scale of 1–10), the underlying latent continuous distances between categories are almost even [36] and thus the data can be considered to be approximately interval scale. And third, a previous CCT paper dealing with a different CCT model for continuous response data showed that it was a good approximation to Likert scale data [15].

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

243

2.1. Models and assumptions Define X as an n user  m item matrix of questionnaire answers or review scores, z as a 1  m answer key vector of latent scores, and d as an n  1 vector of user competencies. A set of axioms or assumptions define the basic continuous CCT model, described in [44]. Assumption 1. Common Truth. For a single culture, there is a fixed answer key z, where zk 2 ð1; 1Þ is the correct answer to question k. The single culture assumption can be tested empirically from the data. A procedure to check whether raters belong to the same culture is described in [7]. A factor analysis is performed on the inter-rater correlations and a scree plot of the eigenvalues is examined for evidence of a one-factor structure. Assumption 2. Random Error. Let xik be the answer given by user i for item k, where xik 2 ð1; 1Þ. The answer is defined in (1) as the sum of the answer key value and random error.

xik ¼ zk þ eik ; where the error values

ð1Þ

ik

are normally distributed and have the expected value Eðik Þ ¼ 0.

Assumption 3. Local Independence. The values of ik are mutually stochastically independent. Assumptions 2 and 3 are similar to the equivalent error assumptions for multiple linear regression, so similar considerations should be applied to consensus analysis. When performing consensus analysis, some analysis of residuals could ensure that the assumptions of consensus analysis hold. For example, a Durbin Watson test [26,27] could be used to test for any autocorrelation and a violation of the assumption of independent errors. Residuals could be examined using residual plots or using statistical tests of normality, such as the Anderson–Darling [8] and Shapiro–Wilk [53] tests. When modeling, one must bear these assumptions in mind, but past work [43,46] has shown that when performing parametric modeling on Likert scale rating/questionnaire data, parametric techniques are relatively robust to assumptions of normality and error distribution and have higher statistical power than nonparametric techniques. Assumption 4. Inhomogeneous Variances. Define competence as the inverse error variance. Let di 2 ð0; 1Þ be a competence 1 parameter for user i and let di ¼ r2 ðeik Þ, where rðeik Þ is the error standard deviation. The definition of competency is related to error variance with respect to a latent underlying score. Variance adjusted means [30, pp. 145] are utilized in multi-sensor data fusion in order to increase the influence of more reliable high quality sensors at the expense of less reliable low quality sensors. Here, a sensor’s contribution to an aggregate score is inversely related to its overall variance over time. This works well when the value being measured is relatively constant over time, but if the value is changing rapidly, a high quality sensor would still have high variance. Separating out the error variance from the item variance avoids this problem. The continuous CCT model described in [14] utilizes a maximum likelihood estimation framework. Maximum likelihood estimation is a common methodology for fitting statistical models. A comparison of maximum likelihood estimation to basic least squares estimation given in [42]. Maximum likelihood allows more flexibility, gives the ability to fit more complex models than least squares estimation, but requires some basic distributional assumptions. A range of metrics, such as the Akaike information criterion [2], can be used along with maximum likelihood estimation to chose the correct statistical model from a range of models [18]. An important factor is the speed of estimation. Initial experimentation showed that the fixed point maximum likelihood estimation procedure described in this paper is 50–80 times faster than a gradient based approach and over 100 times faster than a Bayesian approach incorporating MCMC analysis to estimate the joint posterior distribution of the model parameters. The thrust of this paper is to produce a technique that can be used with large datasets and can simultaneously fit model parameters for multiple cultures. Thus, we utilize maximum likelihood estimation for a combination of flexibility, speed, and scalability. Assuming a known answer matrix X and Gaussian errors, the general likelihood equation for the Gaussian distribution is given in (2).

 Lðl; rjXÞ ¼

m Y n Y

1 pffiffiffiffiffiffiffi e

2 ðxik lk Þ 2r2 i

k¼1 i¼1 ri 2p

 ð2Þ

The vector of means l corresponds to the latent answer key vector z. The vector of variances r2 can be replaced by a vector of competencies d, using the definition given in Assumption 4. The resulting likelihood function is given in (3).

rffiffiffiffiffiffiffi d ðx z Þ2  i ik k di 2 e Lðd; zjXÞ ¼ 2p k¼1 i¼1 m Y n Y

ð3Þ

244

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

The presence of rater bias can be problematic to psychologically based models of rater behavior [50]. If bias is present and is unaccounted for then this can lead to distorted effect sizes and measurement errors [28,33]. We model two types of bias, additive bias and multiplicative bias. To account for these biases, additional bias parameters can be added to (1) to give (4), which results in the bias adjusted likelihood equation given in (5). The additive bias acts as a shift parameter and the multiplicative bias acts as a scaling or sensitivity parameter. For the exam grading scenario given previously, a positive additive bias would be indicative of a tough grader and a negative additive bias would be indicative of an easier grader. A grader who graded most essays close to the middle of the scale would have a lower multiplicative bias than a grader who graded more to the extremes of the scale.

xik ¼ bMi zk þ bAi þ eik

ð4Þ

rffiffiffiffiffiffiffi d ðx b b z Þ2  m Y n i ik Ai Mi k Y di 2 e Lðd; z; bA ; bM jXÞ ¼ 2 p k¼1 i¼1

ð5Þ

2.2. Model extensions and fitting The model described in the previous section assumes only one culture or cluster. Recent attempts have been made to incorporate multiple cultures into dichotomous CCT [7]. We introduce a clusterwise implementation of continuous CCT, which we denote as CONSCLUS (CONSensus CLUStering), in order to account for multiple user clusters. Clusterwise methods are used to simultaneously cluster a group of users and fit a model to each individual cluster. For example, clusterwise linear regression techniques [23,24] fit a separate linear regression function for each cluster. Let P be an n  r cluster assignment matrix,2 where pil ¼ 1 if user i is a member of cluster l and pil ¼ 0 otherwise. Each user P is assigned to exactly one cluster, so rl¼1 pil ¼ 1; 8i. The likelihood function for the clusterwise model is given in (6).

LðP; D; Z; BA ; BM jXÞ ¼

m Y n Y r Y k¼1 i¼1 l¼1

rffiffiffiffiffiffiffi!pil d ðx b b z Þ2  il ik Ail Mil kl pil dil 2 e ; 2p

ð6Þ

where n is the number of users, m is the number of items, and r is the number of clusters. The model can be estimated by maximizing the log-likelihood function. This approach has been used to estimate both the single culture continuous CCT model [14] and the single culture dichotomous CCT model [10]. The log-likelihood function is given in (7).

LLðP; D; Z; BA ; BM jXÞ ¼

  qffiffiffiffi m X n X r X dil dil ðxik bAil bMil zkl Þ2 pil log  2p 2

ð7Þ

k¼1 i¼1 l¼1

The values of xik are fixed from the data, the values of zlk ; dil ; bAil , and bMil are continuous latent parameters, and the cluster assignments pil are binary latent parameters. The resulting optimization problem is a mixed integer linear program. The loglikelihood function is summed across all i, k, and l. Overall, the continuous parameters can be stored in an r  m answer key matrix Z, an n  r competency matrix D, and optionally an n  r additive bias matrix BA and an n  r multiplicative bias matrix BM . Though, competencies and biases are defined for every combination of rater i and cluster l, in practice these parameters only need to be recorded when pil ¼ 1 and can be estimated separately for each cluster. Thus, competencies and biases can be summarized in the n  1 vectors d, bA , and, bM . Missing data can be dealt with either by estimating missing data prior to estimating the model, using a technique such as statistical imputation [49] or by setting the value of the components of likelihood function where missing data are present to 1 (which reduces to 0 in the log-likelihood function). For a single cluster l, Eq. (7) can be maximized with respect to the within cluster continuous parameters for users where pil ¼ 1 by using partial differentiation and then estimating first order conditions by setting the partial derivatives to 0. The first order conditions are given below.

m

dil ¼ Pm

k¼1 ðxik

Pn zkl ¼

 bAil  bMil zk Þ

ðxik  bAil ÞÞ i¼1 ðdil bMil Pn  2  i¼1 dil bMil

2

ð8Þ ð9Þ

Pm bAil ¼

k¼1 ðbMil zkl

 xik Þ m k¼1 zkl ðxik  bAil Þ Pm 2 k¼1 zkl

ð10Þ

Pm bMil ¼

ð11Þ

2 Papers in the CCT literature use k as a question/item iterator, while partitioning clustering papers use k as the number of clusters. We follow the notation of the CCT literature and we denote the number of clusters as r and the cluster iterator as l.

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

245

The parameters can be estimated by iteratively solving Eqs. (8), (9), and optionally (10) and (11) until the parameter values converge. In initial experiments, it was found that the basic model without bias is fully identified, i.e., only one unique combination of parameters gives an optimal solution. However, when bias is introduced into the model then there is some lack of identifiability. A short illustrative proof and a description of a methodology for dealing with identifiability is given in [29]. Given a model with either additive or multiplicative bias, a fully identified model with no bias is fit and then either (i) a single value of zk is fixed to the value in the fully identified model or ii) every value of zk is fixed to the value in the fully identified model. The full model is then run with the fixed parameters set to the values found for the sub-model. For i), the single value of zk to be fixed is chosen as the zk with the lowest error, i.e., the lowest value of Pn 2 i¼1 ðxik  zk Þ . Initial experiments found that sequentially applying (8)–(11) to estimate parameters for the model maximum log-likelihood gave good results. The parameters converged quickly on a range of datasets. When there are only a few users, a bound may have to be placed on the value of d. For example, if there is only one user i then zk can be set to xik for all k and di can take any value in ð0; 1Þ without any effect on the likelihood function. Setting the upper bound on d to a number larger than any of the estimated competencies prevents the value of di shooting off to infinity during the estimation procedure. To solve the overall estimation problem, the binary values of P must be estimated along with the continuous parameters. The estimation of P is a partitioning problem. Partitioning clustering is a combinatorially difficult problem to solve. In fact, partitioning clustering with Euclidean distances is NP-Hard [5]. For a set of n users partitioned into r classes, there are approximately r n =r! partitions [58]. To work around the problem of computational complexity, we utilize an algorithm similar to that described by [39] for k-means clustering. To help motivate the algorithm, we first describe the links between kmeans clustering and continuous CCT. Definition 2.1. Partitioning least squares clustering of users. Given n users, m items, an n  m ratings matrix X, and r cluster partitions, the optimal solution to the least squares clustering problem is given by minimizing the sum of squares error P criterion given in (12). Each user is a member of a single cluster l. The cluster centroid is defined as  xlk ¼ i2C l xik =jC l j, where jC l j is the number of users in cluster l.

SSE ¼

m X r X X 

xik  xlk

2

ð12Þ

k¼1 l¼1 i2C l

Theorem 2.1. For a given partition P, if all competencies are equal and there are no bias parameters in the model, then the least l are equal to the optimal consensus analysis answer keys zl for 8l ¼ 1 . . . r. squares clustering solution centroids x Proof. The equation to maximize the CCT log-likelihood (7) can without any loss of generality be rewritten as the minimization criterion given in (13). 2 m X r X X dil ðxik  bAil  bMil zkl Þ LLðP; D; Z; BA ; BM jXÞ ¼  log 2 k¼1 l¼1 i2C l

rffiffiffiffiffiffiffi!! dil 2p

ð13Þ

The likelihood function is separable by cluster. For a given cluster l, for users within the cluster, setting the derivative of (13) with respect to zlk to 0 gives the first order conditions in (9). If all competencies are equal to some fixed constant df and biases are not incorporated in the model (i.e., all bAil ¼ 0 and all bMil ¼ 1), then (9) can be expressed as (14). This is the definition of the cluster centroid, so optimizing zlk gives the cluster centroid for cluster l.

P

Pn zlk ¼

ðxik  bAi ÞÞ i¼1 ðdi bMi Pn  2  i¼1 di bMi

¼

i2C l 1df ðxik  P 2 i2C l 1 df



P ¼

i2C l xik

jC l j



ð14Þ

Corollary 2.2. When all competencies are equal and there is no bias parameter, the CCT answer key is equal to the answer key calculated using the arithmetic mean. Proof. This follows from 2.1, as the definition for a cluster centroid for question k is identical to the arithmetic mean of the answers for question k. h If all competencies are not equal and bias parameters are incorporated into the model then CONSCLUS gives the same centroids as least squares clustering with weighted, bias adjusted centroids. Let (15) be the equation for partitioning least squares clustering, where the centroid is weighted by user competencies and is offset by bias parameters. The first order conditions for calculating the cluster centroids are identical to those described by (9). However, (15) is not identical to (13). If (15) were optimized with respect to both the values of dil and  xlk then a value of SSE ¼ 0 could easily be obtained by setting all values of dil ¼ 0. The log term derived from the Gaussian distribution is essential for identifying the competencies.

246

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

SSE ¼

m X r X X  2 dil xik  bAil  bMil xlk

ð15Þ

k¼1 l¼1 i2C l

Our proposed algorithm is an adaption of the original k-means algorithm described in [39]. The algorithm is a type of ALS (alternating least squares) algorithm. It alternates between updating the cluster centroid (weighting by competencies and adjusting for bias) and cluster membership parameters and updating the bias and competency parameters. The full algorithm, which we name the CCT-Means algorithm, is given below. Algorithm 2.1. CCT-Means algorithm. Inputs An n user  m item review matrix X, the number of clusters r, and the minimum numbers of users in each cluster minl . Outputs An n  r cluster membership matrix P, an r  m matrix of answer keys Z, an n  1 vector of user competencies d, and optionally an n  1 vector of user additive biases bA and an n  1 vector of user multiplicative biases bM . Algorithm Steps 1. Calculate initial cluster centroids zl for each cluster l. There are numerous methods of selecting random solutions. A number of methods are summarized in [58]. We implement two methods. (a) Given a minimum value ak and a maximum value bk for each review item k, randomly sample each value of zlk from the uniform distribution on ½ak ; bk . (b) From [56]. Assign each user randomly to one of the r clusters, ensuring that jC l j P minl 8l ¼ 1 . . . r, i.e., all clusters have more than the minimum number of users. Calculate the centroid zl for each cluster l, using (9). 2. Set the iteration counter It ¼ 1 and start the main iterative process. 3. Assign each user i to a cluster l to minimize the distance given in (16), where bAil ¼ 0 if additive bias is not incorporated into the model and bMil ¼ 1 if multiplicative bias is not incorporated into the model. If It > 1 then record the number of users that have been reassigned to a new cluster. Denote this value as jDPj. f ði; lÞ ¼

m X 2 ðxik  bAil  bMil zkl Þ

ð16Þ

k¼1

4. For each cluster l, use the fixed point estimation scheme to calculate the answer key zl using (9) and for all users calculate the user competency parameters dil using (8), and optionally the additive bias parameters bAil using (10) and the multiplicative bias parameters bMil using (11). 5. If It > 1 and jDPj ¼ 0 then terminate the algorithm, otherwise set It :¼ It þ 1 and return to step 3. The algorithm is efficient and in initial experiments it always converged to a locally optimal solution. These experiments showed that the algorithm is quicker than a combinatorial steepest descent local search algorithm by a factor of 50–100. As per the k-means algorithm, the algorithm is not guaranteed to converge to the globally optimal solution, so multiple random starts are recommended in order to give a good chance of finding this solution. Another advantage of the CCT-Means algorithm, is that there is a large body of literature for k-means clustering. Innovations developed for k-means clustering can easily be adapted for the CCT-Means algorithm. These innovations include techniques for initializing the algorithm, determining the number of clusters, variable/item standardization, variable/item selection, and detecting influential observations. See [58] for a synthesis and summary. 3. Experimentation The CCT techniques described in this paper are unsupervised, so besides the log-likelihood, there is no available measure of overall fit or success. To demonstrate the utility of these techniques, we ran a series of experiments on generated data. It is common to test unsupervised techniques by generating data from the underlying distributions with error and then estimating the model parameters using the error perturbed data. For example, see [3,21]. We took this experimental approach and generated sets of responses from answer keys with random error. The success of the experiments was measured by testing how well the techniques recovered the underlying model parameters relative to a set of comparison techniques. 3.1. Experiment 1: Single culture The first experiment was designed to test the basic continuous CCT model with one cluster or culture. Ratings were randomly sampled with error from a set of answer keys, generated to give a range of answer values. The major purpose of the experiment was to show that continuous CCT recovers a set of latent answer keys better than a range of other aggregate measures from ratings sampled with random error. The chosen aggregate measures were the arithmetic mean (AM), the geometric mean (GM), the Gauss arithmetic–geometric mean (AGM) [4], and the variance adjusted mean (VAM) [30, pp. 145],

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

247

which is commonly used in data pooling applications. In addition a minimum residual factor analysis (MFA) [22], was performed as per classical dichotomous CCT, on the inter-user correlations and the first factor was taken as the user competency vector d. The rationale behind this choice of metrics was to choose several commonly used aggregation measures and several variance adjusted measures. Both the VAM and the MFA method use variance adjusted measures and are intermediate techniques between the standard aggregate measures and CCT. The MFA method gives additional weight to users with large factor scores, i.e., those users with high inter-user correlations. The VAM weights ratings using an inverse measure of the overall user error variance, while CCT weights ratings using an inverse measure of specific error variance. The experimental factors are listed in Table 1. Overall, there were 5  5  5  6  2  5  5 ¼ 37; 500 different combinations of factor levels. We ran 10 replications for each combination of experimental factor levels, giving a total of 37; 500  10 ¼ 375; 000 experimental runs. The instructions/ steps performed for each experimental run are detailed below. 1. Create a continuous scale from the lowest possible answer to the largest possible answer. 2. For each review item, randomly select a ‘‘true’’ review score from the underlying distribution (normal or uniform). The range for the uniform distribution is the range of the scale (e.g., from 1–5 if the upper bound is 5). The mean for the normal distribution is the midpoint of the scale and the standard deviation is set so that the range of the scale is equal to 6 standard deviations. 1 1=2 3. Determine the latent competencies. As per Axiom 4, di ¼ r2 ðeik Þ, and so the error standard deviation is equal to di . The competencies are determined by setting the maximum error standard deviation to be 1=y of the continuous range defined in step 1. Values of y ¼ 3, 6, and 9 define the minimum, average, and maximum competencies respectively. The competencies are then selected using the selection scheme described in Table 1. 4. Randomly sample each error variable eik from the normal distribution using the standard deviation derived from the competency. Calculate X, setting xik ¼ zk þ eik for each combination of i and k. Select the Likert scale value with the closest value to the sampled continuous value. 5. Randomly remove ‘‘missing %’’ of the data. Components of the likelihood function with missing xik are counted as 0 in the log-likelihood function. 6. Estimate the cultural consensus model by optimizing the maximum log-likelihood for the continuous parameters. 7. On the same data, estimate the answer key z for each of the comparison aggregation techniques. For each experimental run, we recorded the mean squared error between the true latent answer key z and the answer key P 2 ^ estimated from the data ^ z, so that MSEðzÞ ¼ m k¼1 ðzk  zk Þ . The data were analyzed using a multi-factor ANOVA with the experimental factors as independent variables and MSEðzÞ as the dependent variable. All factors were significant with p < 0:001 except for the number of items. Results showing the relative performance of the techniques are summarized in Fig. 1 and Table 2. Fig. 1 shows the relative performance of the different aggregation methods relative to the amount of missing data. In Fig. 1, the value of MSEðzÞ is plotted for each aggregation method relative to the amount of missing data. The performance of CCT is significantly better than the other two variance adjusted techniques (VAM and MFA). Table 2 gives the methods in order of performance. Alongside the performance is an indicator of whether the mean MSEðzÞ is significantly less (<) or greater (>) than the means for every other method at the 95% significance level using the conservative Scheffé [52] post hoc test. To examine the significance of the different experimental factors on CCT for the specific CCT runs, we ran an ANOVA (analysis of variance) test with the experimental factors as independent variables and MSEðzÞ as the dependent variable. To help interpret these effect sizes for the CCT technique, in Table 3 we list the marginal means for MSEðzÞ for the CCT experimental runs and two different values of each factor. For continuous values, we use the highest and lowest levels of the

Table 1 Design for experiment 1. Factor

Level

Description

Technique % No. items

The aggregation scheme used for the data

Scale

AM, GM, AGM, VAM, MFA, CCT 200, 400, 600, 800, 1000 200, 400, 600, 800, 1000 5, 6, 7, 8, 9, 10

True dist Missing % Scheme

Normal, uniform 0, 20, 40, 60, 80 1, 2, 3, 4, 5

No. users

The number of items to be evaluated/reviewed The number of users reviewing the items The upper bound for the scale (the lower bound is 1). The most common review scale is 1–5, but other scales with finer granularities are also used on some review websites The distribution from which the true answer keys are sampled The percentage of missing data 1 – All users have the average competency, 2 – random competencies are selected from the uniform distribution, 3–25% competent (with 4 times the average competency) and 75% non-competent (with 1/4 of the average competency), 4–50% competent and 50% non-competent, 5–75% competent and 25% noncompetent

248

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

Fig. 1. Technique performance relative to missing data.

Table 2 Effect sizes for MSE(z). Method

MSEðzÞ

CCT MFA VAM AM AGM GM

0:02030 0:03476 0:03628 0:04285 0:05513 0:10423

CCT > > > > >

MFA

VAM

AM

AGM

GM

<

<

< < <

> > >

> > >

< < < <

< < < < <

> >

>

Table 3 Effect sizes for MSE(z). Factor

Level 1

Level 2

Mean 1

Mean 2

Cohen’s d

No. items No. users⁄ Scale⁄ True dist⁄ Missing⁄ Scheme⁄

200 200 5 Norm 0 3

1000 1000 10 Unfrm 80 5

0.02753 0.02001 0.00707 0.01679 0.01705 0.03441

0.01699 0.02040 0.03658 0.02381 0.11017 0.04285

0.02112 0.60303 2.01647 0.44138 0.58429 1.38417

Fig. 2. MSE(z) across the number of users and amount of missing data.

factor. For the scheme factor we give the two schemes with the largest effect difference. A star⁄ indicates that the difference between the levels given in the table is significant at the 95% level using a Scheffé post hoc test. We also give the effect size for Cohen’s d (a large effect >0.8, a medium effect >0.5, and a small effect >0.2). All effect sizes are at least medium, except for number of items and the true distribution. The small effect size for the number of items verifies the lack of significance for the Scheffé test.

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

249

From Table 3, one can see that the largest effect sizes are for the number of users and for the percentage of missing data. There is a large effect size for scale, but the mean absolute error (MAE) values are linear with respect to the size of the scale, which is to be expected. In Fig. 2, MSEðzÞ is plotted across both the number of users and the amount of missing data. For small numbers of users and large amounts of missing data there is a slight increase in error. The CCT competency weighted mean is essentially the arithmetic mean weighted by the competencies of the raters. A graph of the out-performance of CCT over the arithmetic mean relative to the scheme used for generating the competences and the amount of missing data is given in Fig. 3. For the equal d scheme, all users have equal competency, so if the biases b are 0, Eq. (9) reduces to the simple mean and the optimal CCT solution is the arithmetic mean solution. For the case where competencies are normally distributed around a central point, CCT outperformed the arithmetic mean. However the largest out-performance was for the schemes where users have either low or high competency. In particular, for the scheme where 75% of users are competent and 25% of users are not competent, CCT greatly outperformed arithmetic mean. CCT minimizes the effect of the non-competent users, by minimizing the effect of these users in the calculation of the aggregate ratings. Overall, continuous CCT gave strong recovery of the answer keys relative to both the simple aggregation methods and the two other variance adjusted methods. The arithmetic mean had the best performance of the simple aggregation methods, which suggests that the geometric mean is not an appropriate measure for aggregating ratings data. The two other variance adjusted metrics, the variance adjusted mean and the minimum residual factor analysis both gave stronger recovery of the answer keys than the basic arithmetic mean. However, CCT outperformed both these methods. This suggests that there is strong utility in modeling competency as a function of specific error variance rather than as a function of overall item variance. 3.2. Experiment 2: Multiple clusters The second experiment extended the first experiment to multiple clusters. In this experiment, we tested how well CONSCLUS recovers cluster membership parameters from error perturbed ratings data. As per Experiment 1, we randomly generated answer keys. We used the adjusted Rand index of cluster recovery to measure the proportion of correctly recovered pij cluster assignments. The adjusted Rand index (adj-Rand) [34,57], is based upon the Rand index [45], but accounts for random error, so random recovery of cluster membership would give an expected value of adj-Rand = 0. The within cluster recovery of z was tested in Experiment 1, so results testing this measure are omitted for the sake of brevity. In order to vary the difficulty of cluster recovery, we sampled a cluster specific answer key zkl for each cluster l and also sampled an overall general answer key  zk . For each cluster l and item k, the test answer key value is defined in (17). When k ¼ 1 and there is a small error standard deviation then the cluster membership parameters should be relatively easy to recover. When k ¼ 0, cluster recovery will be random, and CONSCLUS should not be expected to recover cluster membership parameters above a baseline random rate. Intermediate values of k allow the difficulty of cluster recovery to be controlled via the experiment.

zkl ¼ kzkl þ ð1  kÞzk

ð17Þ

The experimental design includes most of the factors from Experiment 1 and also additional factors for the number of clusters, the error standard deviation as a proportion of the variable range (as opposed to the fixed competency schemes for Experiment 1), the cluster preference weight k, and the CCT-Means initialization procedure. Due to the additional factors, only a subset of levels from the original factors were used. The experimental factors are listed in Table 4. Overall, there are 2  2  2  2  3  3  4  5  2 ¼ 5760 different combinations of factor levels. We ran four replications for each combination of factor levels, giving a total of 5760  4 ¼ 23044 experimental runs. For each experimental run, we carried out 50 random starts for both k-means clustering and CCT-Means clustering, giving a total of 1,152,200 runs for each method. We then selected the best solution out of the 50 for each method. For CCT, we took the solution with the maximum log-likelihood. For k-means clustering we took the solution with the minimum value of the k-means optimization criterion. The experimental steps are listed below.

Fig. 3. Graph of performance increase by scheme.

250

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

Table 4 Design for experiment 2. Factor

Level

Description

No. items No. users Scale True score Missing % Error st. dev. No. clus. Weight k Initialize

200, 400 200, 400 5, 10 Normal, uniform 0, 40, 80 1/9, 1/7, 1/5 2, 3, 4, 5 0, 0.25, 0.50, 0.75, 1 1, 2

As per Table 1 As per Table 1 As per Table 1 As per Table 1 As per Table 1 The error standard deviation as a proportion of the overall range The number of generated clusters/cultures The proportion of knowledge accounted for by the cluster specific answer key 1 – Randomly select points from the variable range, 2 – use the random partitioning procedure described in [56]

1. Create a continuous scale from the lowest possible answer to the largest possible answer. 2. For each review item, randomly select a ‘‘true’’ review score from the underlying distribution (normal or uniform). 3. For each combination of review item k and cluster l, randomly select a ‘‘true’’ review score from the underlying distribution (normal or uniform). Using the experimental value of k, calculate zkl as per (17). 4. Determine cluster membership. Randomly assign each user i to a cluster l. If user i is assigned to cluster l then pil ¼ 1, otherwise pil ¼ 0. 5. Determine the latent competencies. Calculate the competencies as per Experiment 1. 6. Randomly sample the error ik as per Experiment 1 and calculate xik as xik ¼ pil zkl þ ik . Select the item category with the closest value to the sampled continuous value. 7. Randomly remove ‘‘missing %’’ of the data and treat as per Experiment 1. 8. Estimate the CONSCLUS model by running the CCT-Means algorithm. 9. Estimate the k-means clustering model using the basic k-means algorithm. 10. Evaluate the recovery of cluster membership. Across experimental runs, the average value of Adj-RandðzÞ for k-means clustering was 0.82155 and the average value of Adj-RandðzÞ for CONSCLUS was 0.82762. Using the Scheffé post hoc test, the difference between these values is significant at the 95% confidence interval level. To test the effects of the different experimental factors on the performance of CONSCLUS, we ran an ANOVA test with the experimental factors as independent variables and the value of the adjusted Rand index (adjRand) for the CONSCLUS runs as the dependent variable. All factors were significant with p < 0:001 except for the initialization technique. The marginal means for the minimum and maximum levels of each factor are given in Table 5, along with the Cohen’s d statistic. A star⁄ indicates that the difference between the levels is significant using a Scheffé post hoc test. The Cohen’s d value for the weight k dwarfs the other values. This is due to the majority of the experimental variance being due to the value of k. This variance is included in the Cohen’s d calculations for the other factors, thus lowering the value of Cohen’s d for these factors. By far the most influential factor is the value of k. As expected, the value of adj-Rand for k ¼ 0 is that of random recovery, which is approximately 0, as the adjusted Rand index accounts for random error. If k ¼ 1 then recovery is perfect for both kmeans clustering and for CONSCLUS. To examine how recovery increases as k increases, in Fig. 4 the average value of adjRand is plotted against k. As k increases from 0 to 0.25, there is a large increase in the value of adj-Rand, from random recovery to a value averaging from 0.65 to 0.87 depending on the number of clusters. For k ¼ 0:5 and k ¼ 0:75 there is very strong cluster recovery. This suggests that CONSCLUS is still effective at determining cluster structure when there is a general answer key accounting for a large proportion of each cluster’s answer key. The moderate amount of noise added to the experimental data (error with l ¼ 0 and r ¼ 1=9  Range to r ¼ 1=5  Range) was not enough to prevent good cluster recovery (see Fig. 5).

Table 5 Effect sizes for RandðbÞ. Factor

Level 1

Level 2

Mean 1

Mean 2

Cohen’s d

No. items⁄ No. users⁄ Scale⁄ True dist⁄ Missing %⁄ Range divisor⁄ No. clusters⁄ Initialize Weight k

200 200 5 Norm 0 5 2 1 0

400 400 10 Unfrm 80 9 5 2 1

0.7422 0.7409 0.7450 0.7320 0.7825 0.7216 0.8091 0.7491 0.0001

0.7562 0.7570 0.7519 0.7648 0.6991 0.7685 0.6998 0.7478 1.0000

0.05548 0.05050 0.02658 0.11237 0.30341 0.16023 0.29248 0.00070 13.36813

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

251

Fig. 4. Graph of adj-Rand relative to k for CONSCLUS.

Fig. 5. Comparison between CONSCLUS and k-means clustering.

As per Experiment 1, we examined the relative performance of the tested techniques. In Fig. 4, average values of adj-Rand are given for each combination of levels of k, missing %, and No. users. Separate tables are given for k-means clustering and for CONSCLUS. The tables are shaded in gray, with darker levels of gray corresponding to larger values of adj-Rand. A third table gives the percentage performance improvement of CONSCLUS over k-means. For high values of k and low to mid values of missing data, the improvement is 0 as both k-means clustering and CONSCLUS gave perfect cluster recovery. In cases where the results were not perfect, CONSCLUS outperformed k-means. The increase in performance is negatively correlated to the level of performance, i.e., poorer k-means solutions give greater scope for performance increase. Overall, CONSCLUS gave strong recovery of underlying latent cluster membership and should be used in place of k-means for clustering review/questionnaire data when it is suspected that users have differing competencies. 3.3. Clustering extensions CONSCLUS and k-means clustering are both non-hierarchical partitioning clustering methods. Here, given a cluster assignment matrix P, pil ¼ 1 if user i is assigned to cluster l, otherwise pil ¼ 0. A user can only be assigned to one cluster, P i.e., given r clusters, rl¼1 pil ¼ 1; 8i. Partitioning clustering is commonly used in situations where there is a need to split items into disjoint groups. For example, partitioning clustering methods are often used in marketing segmentation applications where marketing managers wish to partition customers into segments and send targeted advertising campaigns to consumers in chosen segments [31]. However, partitioning clustering is not the only type of non-hierarchical clustering. Consider a P situation where rl¼1 pil ¼ 1; 8i still holds, but pil 2 ½0; 1. This type of clustering is referred to as fuzzy clustering [25]. Here, pil is continuous and a user has proportional memberships of multiple clusters. Fuzzy clustering gives a representation of cluster membership when there is a desire to model user membership across multiple clusters. Another method of assigning users to multiple clusters is overlapping clustering [41,54,55]. Here, cluster membership variables are fixed to be discrete, P so that pil ¼ 0 or pil ¼ 1, but the cluster membership constraint is relaxed, so that rl¼1 pil P 1.

252

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

SSE ¼

m X r X n X  2 pqij xik  xlk

ð18Þ

k¼1 l¼1 i¼1

We describe an extension to CONSCLUS, to allow for fuzzy clustering solutions. The basic fuzzy clustering model is given in (18). The equation is identical to (12) except that the cluster assignment matrix parameters pil 2 ½0; 1 are raised to a ‘‘fuzzy’’ constant q, where q > 1. The model can be fit using the c-means algorithm [16,25,20]. This algorithm is similar to the k-means algorithm, except that the cluster assignment values are calculated using a ratio of distances, as per (19) and the cluster centroid for cluster l is calculated from all users that have a value of pil > 0, as per (20). As q ! 1, the clustering solution tends towards a partitioning solution and as q increases, the solution becomes more ‘‘fuzzy’’, with a more even distribution of cluster membership across clusters. At exactly q ¼ 1, the solution is undefined.

pil ¼

1  Pm 1 ðx x Þ q1 Pmk¼1 ik lk j¼1 ðxjk xlk Þ k¼1

Pn

Pn q i¼1 ðpil Þ xik xlk ¼ P n q i¼1 ðpil Þ

ð19Þ

ð20Þ

The CONSCLUS log-likelihood function given in (7) can be adapted to give a fuzzy clustering solution by performing a similar transition to that shown in (12)–(18). The resulting log-likelihood function is given in (21). Similar logic as that given in Theorem 2.1 and Corollary 2.2 could be applied to link this model to c-means clustering, but we skip this for the sake of brevity.

‘ðD; Z; BA ; BM jXÞ ¼

  qffiffiffiffi m X n X r X dil dil ðxik bAil bMil zlk Þ2  pqil log 2p 2

ð21Þ

k¼1 i¼1 l¼1

The CONSCLUS algorithm can be adapted to fit (21) by modifying step 5 to both calculate the fuzzy membership matrix P using (22), which is modified from (19), and calculate the cluster centroid/answer key for each cluster l using (23), which is modified from (9) using (20) to take account of partial cluster membership. After applying (22), the cluster assignments P should be scaled so that rl¼1 pil ¼ 1 8i. As per CONSCLUS, the component ratings of the centroid are weighted by competency and when bias is modeled, the distances between user answers and centroids are corrected by bias.

1  Pm 1 Pn ðx bAi bMi zlk Þ q1 k¼1 ik P m j¼1 ðxjk bAi bMi zlk Þ k¼1 Pn di bMi ðpil Þq ðxik  bAi Þ zlk ¼ i¼1Pn 2 q i¼1 di bMi ðpil Þ

pil ¼

ð22Þ

ð23Þ

The fuzzy clustering solutions created by the fuzzy extension to CONSCLUS can be converted to overlapping clustering solutions by setting some threshold T, so that if the fuzzy clustering membership parameter pij P T then the overlapping clustering membership parameter pij ¼ 1 and if pij < T then pij ¼ 0. The threshold T and the fuzzy parameter q work in harness to determine the level of overlap of the clustering solution. For some fixed value of T, as q ! 1, the overlapping clustering solution tends towards a partitioning clustering solution and as q increases above 1, the level of solution overlap increases. Lower values of T result in higher sensitivity as overlap increases quickly relative to higher values of T. 3.4. Movielens example In this section, we show how CONSCLUS can be applied to a real world data example, that of analyzing movie review data. We utilize the Movielens dataset [31]. The Movielens dataset contains 1,000,209 reviews of 3952 movies from 6040 users. To aid problem tractability, the dataset was preprocessed by removing any movies with less than 10 reviews. The remaining data include 999,733 reviews of 3464 movies by 6040 users. We found solutions for each number of clusters from r ¼ 2 to r ¼ 20. For each value of r, we ran the CCT-means algorithm 100 times and took the optimal solution, i.e., the solution with the maximum likelihood. The process of determining an ‘‘optimal’’ solution depends on the problem domain and on a researcher’s own statistical philosophy and background. If one were to take an exploratory data analysis [62] approach, one may consider solutions for multiple values of r to have valid interpretations. However, researchers often prefer to designate an optimal value for the number of clusters. There are numerous methods for determining the optimal number of clusters for k-means clustering. Many are described in [56]. We implemented the Gap method described in [60], as this method has been found to give good results on a variety of input data. The rationale behind the Gap method is that a good clustering solution should fit the data much better than it would fit data randomly generated from the uniform distribution. The Gap statistic for r clusters is given in (24). Here, SSW ðr Þ is the total within cluster sum of squares for the CCT-Means solution and SSW b ðr Þ is the total within cluster sum of squares for the CCT-Means solution applied to the bth dataset generated from the uniform distribution in

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

253

the range of the original data. The value of r chosen is the smallest value of r for which Gapðr Þ P Gapðr þ 1Þ  srþ1 , where p srþ1 ¼ ð1 þ 1=BÞrrþ1 and rrþ1 is the standard deviation of log ðSSW b ðr þ 1ÞÞ.

Gapðr Þ ¼

B 1X log ðSSW b ðr ÞÞ  log ðSSW ðrÞÞ B b¼1

ð24Þ

We set B to be 20 and calculated the Gap value for each value of r. The optimal value of r came out as 8. To visualize the solution, we calculated distances between the users using a correlation distance as per user based collaborative filtering [19]. We then ran classical multidimensional scaling (MDS) [61] on the resulting distances. Fig. 6 gives a scatter plot of users using the three dimensions from the classical MDS analysis that account for the maximum variance. The users are color-coded by cluster. Each cluster has a different color and symbol. Most users are clustered together towards the center of the figure. Several clusters are very compact and are close to the overall ‘‘user’’ mean. Other clusters are more spread out. One would expect that the more compact user clusters would be closer to the overall mean. To aid intuition, we modeled an overall knowledge score using a one cluster solution and calculated the correlations between the overall knowledge component and the clusters in the 8 cluster solution. These correlations are given for each cluster in Table 6, along with the number of users in the cluster, the average answer key score, the average user competency, and the standard deviation of the user competency. The correlation between the overall score and the cluster score is highest for clusters six and seven. The within cluster competencies are low for cluster seven. This indicates a larger within cluster variance. One can see that in the visualization in Fig. 6, the clusters with lower average competencies are relatively more spread out than the clusters with higher average competencies. None of the competencies are equal to the maximum allowed competency Max d, which suggests that enough data were available to efficiently estimate the model. The last three rows of Table 6 give some basic demographic information for the clusters. The users are split into age range categories of 0–18, 18–24, 25–34, 35–44, and 45–54, and 55+. The median age category is given for each cluster in the group. There is little difference between the groups, but cluster four has slightly older demographics than the other clusters. Clusters three and four are significantly more female than the overall dataset, which is 72% male and 28% female. The Movielens dataset has one or more of eighteen categories assigned to each movie. In order to gain further insight into cluster behavior, the average answer key score for each combination of cluster and category is given in Fig. 7. Cluster one has generally positive reviewers, with high average answer key scores across most categories. Cluster seven has low average answer key scores across most categories, but has high scores for the documentary and film-noir categories, indicating possible high brow movie preferences. Cluster three has low values for films and documentaries, indicating more mainstream, entertainment focused tastes. For marketing segmentation applications, product recommendations or promotions can be tailored to the idiosyncrasies of the individual clusters. In this analysis, we do not explicitly analyze the user competencies, as the users in the dataset are anonymized. However, in a marketing context, high competency users could be targeted for panels and for new product evaluations. To obtain a fixed error bound in a marketing research survey, lower numbers of higher competency

Fig. 6. Classical MDS visualization of cluster configuration.

254

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

Table 6 Cluster information.

Corr No. users Avg. zlk Avg. di SD di %F %M Med age

C1

C2

C3

C4

C5

C6

C7

C8

0.664 559 3.348 1.635 0.698 26.30 73.70 25–34

0.565 625 3.137 1.328 0.601 21.92 78.08 25–34

0.462 736 3.577 1.498 0.639 34.37 65.63 25–34

0.736 779 3.104 1.370 0.581 39.79 60.21 35–44

0.821 850 3.810 1.785 0.708 26.82 73.18 25–34

0.912 1003 3.163 1.479 0.550 25.92 74.08 25–34

0.835 659 2.733 1.179 0.623 24.13 75.87 25–34

0.867 829 3.472 1.833 0.746 25.93 74.07 25–34

Fig. 7. Average answer key score by cluster and category.

Fig. 8. Fuzzy cluster membership by cluster.

users would be required than randomly selected users. Thus, the ability to select high competency users would reduce marketing research costs. In addition to the partitioning clustering solution, we created an eight cluster fuzzy clustering solution, using the fuzzy extension to CONSCLUS described previously. The values of z for the fuzzy clusters were strongly correlated with the values of z for the partitioning clusters, with each fuzzy cluster having a correlation of at least 0.7 with its most highly correlated partitioning cluster and with correlations of at least 0.4 between any cluster and any other partitioning or fuzzy cluster. This indicates some shared knowledge across all clusters. To illustrate the fuzzy cluster assignments contained in the assignment

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

255

matrix P, for each cluster, we utilized the first two dimensions of the classical MDS user plot created for Fig. 6, added the value of pij as a third dimension, and then created a heat-map contour plot of pij , which is given in Fig. 8. In the heat map, lighter yellow values indicate higher values of pij and darker red values indicate lower values of pij . As per Fig. 6, there is a high density of points close to the center of the map. One can see how the regions of high pij differ across clusters. 4. Conclusions and future work In this paper, we expand previous work on cultural consensus analysis and develop a clusterwise version of continuous cultural consensus analysis, which we denote as CONSCLUS. CONSCLUS is a partitioning technique and can be formulated in a similar manner to k-means clustering. However, in CONSCLUS, a Gaussian maximum likelihood formulation is used to assign users competencies and to weight users by competency when calculating overall aggregate answer key/review values for items/questions. The user competencies are derived not from prior information, but from patterns within the data. The CCT-Means algorithm is used to fit CONSCLUS. In initial attempts to fit CONSCLUS, we utilized local search based techniques. However, given the combinatorial complexity of partitioning clustering, these techniques were rather slow and impractical for larger datasets. The CCT-Means algorithm is an alternating least squares type of algorithm. The algorithm calculates cluster assignments as per the basic k-means type clustering algorithm, but rather than calculating the cluster centroids using a simple arithmetic mean, the cluster competencies, the cluster biases, and the cluster answer key (a weighted centroid) are calculated using the continuous CCT procedure. The algorithm gives good performance on relatively large datasets. As k-means is a relatively mature technique, there is a large body of literature on k-means clustering and on innovations for initializing, optimizing, and choosing the number of clusters for the k-means method. For future work, the CCT-Means procedure could be adapted to use some of these innovations and to use k-means algorithms designed for massive datasets, for example [6]. In the example section, we showed how the Gap statistic could be used to chose the number of clusters. This is but one of many methods for choosing the number of clusters for k-means clustering. A large scale empirical and theoretical analysis of the adaption of different methods for choosing the number of clusters from the k-means literature would be a logical next step to develop the CONSCLUS technique further. The k-means method for partitioning clustering has an analogous method for fuzzy clustering, entitled c-means clustering. We implemented an extension to CONSCLUS for fuzzy clustering, utilizing an alternating least squares extension to the c-means algorithm rather than the k-means algorithm. We described how by using a threshold parameter in conjunction with the fuzzy parameter, a range of overlapping clustering solutions could be derived from the fuzzy clustering solutions. Different interpretations of consensus analysis solutions can be found by utilizing fuzzy clustering, partitioning clustering, and overlapping clustering solutions. Future work on fuzzy consensus clustering could involve an examination of the distribution of the fuzzy cluster membership parameters and an analysis of algorithms for fuzzy clustering. We ran a series of experiments on generated data in order to test the performance of the methods described in this paper. The data were generated by first randomly generating competencies, biases, and answer keys, and by then generating user ratings with random error. For a single cluster/culture scenario, the task was to recover underlying answer keys from the error perturbed ratings. CCT outperformed a range of both simple aggregation techniques and variance adjusted aggregation techniques. For multiple cluster data, CONSCLUS outperformed k-means clustering on the task of recovering cluster membership parameters from error perturbed ratings data. The fixed point CCT fitting procedure was very robust with respect to missing data and as long as sufficient data were present, the algorithm worked well with high percentages of missing data, though there was some drop-off in performance for small datasets with large amounts of missing data. The problem here is that there is not sufficient data to estimate the model and some of the competencies shoot off to very high values. This problem can be mitigated by setting an upper bound for the competencies. For future work, some sort of Bayesian prior could be added to help estimate the model when there are insufficient data to efficiently estimate the maximum likelihood. Overall, the major purpose of an exploratory or unsupervised technique is to help gain insight into the data. We showed how CONSCLUS could be used to analyze a set of on-line review data. We used cluster assignments, answer key values, and user competency statistics, along with user demographics and item category information, to build an overall picture of the data and to gain insight into the specific characteristics of the different clusters. Much of this analysis could be done using a standard clustering algorithm, such as k-means clustering, but the ability to characterize users by competence and to be able to weight scores by user competencies enables additional insight. When calculated for review data, the competencies generated by CONSCLUS could be used to help customer service and product management teams filter useful reviews. They could also be used to select subgroups of consumers for marketing research purposes. Future work could include utilizing competencies derived from CCT and CONSCLUS to help improve and tune other data analysis techniques and to improve the accuracy of supervised data mining techniques. For example, competencies could be used to weight users in user based collaborative filtering or to weight observations when tuning a prediction technique such as support vector machines. References [1] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl. Data Eng. 17 (2005) 734–749. [2] H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. Control 19 (1974) 716–723 (ID: 1). [3] U. Akkucuk, J.D. Carroll, PARAMAP vs. isomap: a comparison of two nonlinear mapping algorithms, J. Classif. 23 (2006) 221–254.

256

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

[4] G. Almkvist, B. Berndt, Gauss, Landen, Ramanujan, the arithmetic–geometric mean, ellipses, p, and the ladies diary, Am. Math. Mon. 95 (1988) 585– 608. [5] D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn. 75 (2009) 245–248. [6] K. Alsabti, S. Ranka, V. Singh, An efficient k-Means Clustering Algorithm, Working paper, Syracuse University, 1998. [7] R. Anders, W.H. Batchelder, Cultural consensus theory for multiple consensus truths, J. Math. Psychol. 56 (2012) 452–469. [8] T.W. Anderson, D.A. Darling, A test of goodness of fit, J. Am. Stat. Assoc. 49 (1954) 765–769. [9] A. Ansari, S. Essegaier, R. Kohli, Internet recommendation systems, J. Market. Res. (JMR) 37 (2000) 363–375 (M3: Article). [10] A. Aßfalg, E. Erdfelder, CAML–maximum likelihood consensus analysis, Behav. Res. Methods 44 (2012) 189–201, http://dx.doi.org/10.3758/s13428011-0138-0. [11] W. Batchelder, A. Romney, Test theory without an answer key, Psychometrika 53 (1988) 71–92, http://dx.doi.org/10.1007/BF02294195. [12] W.H. Batchelder, R. Anders, Cultural consensus theory: comparing different concepts of cultural truth, J. Math. Psychol. 56 (2012) 316–332. [13] W.H. Batchelder, A.K. Romney, The statistical analysis of a general condorcet model for dichotomous choice situations, in: B. Grofman, G. Owen (Eds.), Information Pooling and Group Decision Making: Proceedings of the Second University of California, Irvine Conference on Political Economy, JAI Press, Greenwich, CT, 1986, pp. 103–112. [14] W.H. Batchelder, A.K. Romney, New results in test theory without an answer key, in: E.E. Roskam (Ed.), Mathematical Psychology in Progress, SpringerVerlag, Heidelberg, Germany, 1989, pp. 229–248. [15] W.H. Batchelder, A. Strashney, A.K. Romney, Cultural consensus theory: aggregating continuous responses in a finite interval, in: S.K. Chai, J.J. Salerno, P.L. Mabrey (Eds.), Social Computing, Behavioral Modeling, and Prediction 2010, Springer-Verlag, New York, NY, 2010, pp. 98–107. [16] J.C. Bezdek, A convergence theorem for the fuzzy ISODATA clustering algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 1–8 (ID: 1). [17] S.P. Borgatti, ANTHROPAC 4.983, Analytic Technologies, Natick, MA, 1992. [18] H. Bozdogan, Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions, Psychometrika 52 (1987) 345– 370. http://dx.doi.org/10.1007/BF0229436. [19] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, Madison, WI, 1998, pp. 43–52. [20] R.L. Cannon, J.V. Dave, J.C. Bezdek, Efficient implementation of the fuzzy c-means clustering algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 248–255 (ID: 1). [21] L. Chen, A. Buja, Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis, J. Am. Stat. Assoc. 104 (2009) 209–219. [22] A.L. Comrey, The minimum residual method of factor analysis, Psychol. Rep. 11 (1962) 15–18, http://dx.doi.org/10.2466/pr0.1962.11.1.1. [23] W. DeSarbo, R. Oliver, A. Rangaswamy, A simulated annealing methodology for clusterwise linear regression, Psychometrika 54 (1989) 707–736, http://dx.doi.org/10.1007/BF02296405. [24] W.S. DeSarbo, W.L. Cron, A maximum likelihood methodology for clusterwise linear regression, J. Classif. 5 (1988) 249–282, http://dx.doi.org/10.1007/ BF01897167. [25] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. Cybern. 3 (1973) 32–57. . [26] J. Durbin, G.S. Watson, Testing for serial correlation in least squares regression: I, Biometrika 37 (1950) 409–428. [27] J. Durbin, G.S. Watson, Corrections to part I: Testing for serial correlation in least squares regression: I, Biometrika 38 (1951) 177–178. [28] B.D. Eugenio, M. Glass, The kappa statistic: a second look, Comput. Linguist. 30 (2004) 95–101, http://dx.doi.org/10.1162/08912010477363340. [29] S.L. France, W.H. Batchelder, Maximum likelihood item easiness models for test theory without an answer key, Educ. Psychol. Meas., in press, 1–22. . [30] D.L. Hall, S.A. McMullen, Mathematical Techniques in Multisensor Data Fusion, Artech House, Boston, MA, 2004. [31] M. Harper, Movielens Data Set, 2006. [32] J. Herlocker, J.A. Konstan, J. Riedl, An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms, Inform. Ret. 5 (2002) 287–310, http://dx.doi.org/10.1023/A:102044390983. [33] W.T. Hoyt, Rater bias in psychological research: when is it a problem and what can we do about it?, Psychol Methods 5 (2000) 64–86 (ID:2000-03131004). [34] L.J. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1985) 193–218. [35] D. Iacobucci, P. Arabie, A. Bodapati, Recommendation agents on the internet, J. Interact. Market. 14 (2000) 2–11. [36] R. Kennedy, C. Riquier, B. Sharp, Practical applications of correspondence analysis to categorical data in market research, J. Target. Measur. Anal. Market. 5 (1996) 56–70. [37] S.K. Lee, Y.H. Cho, S.H. Kim, Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations, Inf. Sci. 180 (2010) 2142–2155. [38] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item collaborative filtering, IEEE Int. Comput. 7 (2003) 76–80 (ID: 1). [39] J.B. Macqueen, Some methods for classification and analysis of multivariate observations, in: L.M. Le Cam, J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, Statistics, University of California Press, Berkeley, CA, 1967, pp. 281–297. [40] P. Melville, R.J. Mooney, R. Nagarajan, Content-boosted collaborative filtering for improved recommendations, in: Eighteenth National Conference on Artificial Intelligence, American Association for Artificial Intelligence, Menlo Park, CA, USA, 2002, pp. 187–192. [41] W.D. Mulder, Optimal clustering in the context of overlapping cluster analysis, Inf. Sci. 223 (2013) 56–74. [42] I.J. Myung, Tutorial on maximum likelihood estimation, J. Math. Psychol. 47 (2003) 90–100. [43] G. Norman, Likert scales, levels of measurement and the ‘‘laws’’of statistics, Adv. Health Sci. Educ. 15 (2010) 625–632. [44] M. Pazzani, D. Billsus, Content-based recommendation systems, in: P. Brusilovsky, A. Kobsa, W. Nejdl (Eds.), The Adaptive Web, vol. 4321, Springer, Berlin/Heidelberg, 2007, pp. 325–341. [45] W.M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc. 66 (1971) 846–850. [46] J.L. Rasmussen, Analysis of likert-scale data: a reinterpretation of gregoire and driver, Psychol. Bull. 105 (1989) 167–170, http://dx.doi.org/10.1037/ 0033-2909.105.1.167 (ID:1989-14216-001). [47] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, Grouplens: an open architecture for collaborative filtering of netnews, in: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, CSCW ’94, ACM, New York, NY, USA, 1994, pp. 175–186. [48] A.K. Romney, S.C. Weller, W.H. Batchelder, Culture as consensus: a theory of culture and informant accuracy, Am. Anthropol. 88 (1986) 313–338. [49] D.B. Rubin, Multiple imputation after 18+ years, J. Am. Stat. Assoc. 91 (1996) 473–489, http://dx.doi.org/10.1080/01621459.1996.1047690. [50] F.E. Saal, R.G. Downey, M.A. Lahey, Rating the ratings: assessing the psychometric quality of rating data, Psychol. Bull. 88 (1980) 413–428, http:// dx.doi.org/10.1037/0033-2909.88.2.413 (ID:1980-29212-001). [51] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms, in: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, ACM, New York, NY, USA, 2001, pp. 285–295. [52] H. Scheffé, The Analysis of Variance, first ed., John Wiley & Sons, New York, NY, 1959. [53] S.S. Shapiro, M.B. Wilk, An analysis of variance test for normality (complete samples), Biometrika 52 (1965) 591–611. [54] R.N. Shepard, P. Arabie, Additive clustering: representation of similarities as combinations of discrete overlapping properties, Psychol. Rev. 86 (1979) 87–123. [55] T.C. Silva, L. Zhao, Uncovering overlapping cluster structures via stochastic competitive learning, Inf. Sci. 247 (2013) 40–61. [56] D. Steinley, Local optima in k-means clustering: what you dont know may hurt you, Psychol. Methods 8 (2003) 294–304.

S.L. France, W.H. Batchelder / Information Sciences 283 (2014) 241–257

257

[57] D. Steinley, Properties of the Hubert–Arable adjusted Rand index, Psychol. Methods 9 (2004) 386–396, http://dx.doi.org/10.1037/1082-989X.9.3.386 (ID:2004-17801-007). [58] D. Steinley, k-Means clustering: a half-century synthesis, British J. Math. Stat. Psychol. 59 (2006) 1–34. [59] X. Su, T.M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell. 2009 (2009). 19 pages. . [60] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 63 (2001) 411–423. [61] W.S. Torgerson, Multidimensional scaling, I: Theory and method, Psychometrika 17 (1952) 401–419. [62] J.W. Tukey, Exploratory Data Analysis, first ed., Addison-Wesley, Reading, MA, 1977.