Handling missing values in multiple factor analysis

Handling missing values in multiple factor analysis

Food Quality and Preference 30 (2013) 77–85 Contents lists available at SciVerse ScienceDirect Food Quality and Preference journal homepage: www.els...

556KB Sizes 0 Downloads 101 Views

Food Quality and Preference 30 (2013) 77–85

Contents lists available at SciVerse ScienceDirect

Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual

Handling missing values in multiple factor analysis François Husson ⇑, Julie Josse Applied Mathematics Department, Agrocampus Rennes, 65 Rue de Saint-Brieuc, 35042 Rennes Cedex, France

a r t i c l e

i n f o

Article history: Received 27 February 2013 Received in revised form 24 April 2013 Accepted 28 April 2013 Available online 14 May 2013 Keywords: Exploratory multivariate analysis Missing values Multi-table data Multiple factor analysis Napping data

a b s t r a c t Handling missing values is an unavoidable problem in the practice of statistics. We focus on multiple factor analysis in the sense of Escofier and Pagès (2008), a principal component method that simultaneously takes into account several multivariate datasets composed of continuous and/or categorical variables. The suggested strategy to deal with missing values, named regularised iterative MFA, is derived from a method available in principal component analysis which consists in alternating a step of estimation of the axes and components and a step of estimation of the missing values. The pattern of missing values considered can be structured with missing rows in some datasets. Some simulations and real examples that cover several situations in sensory analysis are used to illustrate the methodology. We focus on the important issue of the maximum number of products that can be assessed during an evaluation task. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction Multivariate exploratory data analysis methods such as principal component analysis (PCA) or multiple correspondence analysis (MCA) are often used in the context of sensory analysis. These methods are roughly similar in their main purpose which is to explore data tables with individuals in rows and variables in columns (continuous variables in PCA and categorical variables in MCA). The core idea is to describe the dataset with a small number of uncorrelated variables (the principal components) in order to reduce the dimensionality of the dataset while keeping as much information as possible. The methods are mainly used as descriptive tools to sum up and visualise data. Nowadays, the information is more and more structured in multiple datasets. This is partly explained by the increasing ease with which measurements can be made and stored. However, the main reason is that taking into account several sources of information allows us to hope a comprehensive approach to phenomena. The data thus has a multi-table structure: I individuals are described by J groups of variables. Several multi-table methods are available in the literature (Kroonenberg, 2008) to explore and visualise this kind of data. We focus here on multiple factor analysis (MFA) in the sense of Escofier and Pagès (2008). In MFA, the number of variables and the type of the variables (continuous or categorical) may vary from one group to another, but within each group the nature of the variables is the same. The aims of MFA are similar to those of PCA or ⇑ Corresponding author. Tel.: +33 2 23 48 58 86; fax: +33 2 23 48 58 71. E-mail address: [email protected] (F. Husson). 0950-3293/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.foodqual.2013.04.013

MCA, namely studying the similarities between individuals in terms of all the variables, studying links between variables and relating these two studies. But beyond these traditional aims, new issues arise because of the data’s structure. MFA also studies the links between groups and compares the information brought by each group. From a technical point of view, the first step of MFA consists in coding the categorical variables using an indicator matrix of dummy variables. Then, each indicator matrix is weighted as it is done in MCA (Greenacre & Blasius, 2006), i.e. each dummy variable is divided by the square root of the proportion of individuals taking the associated category. More formally, let us denote the matrix X = [X1, . . ., Xj] with Xj the matrix of continuous variables if group j is composed of continuous variables and Xj the indicator matrix of dummy variables if group j is composed of categorical variables. Then, we define DR as the diagonal matrix (which size is the number of columns of X) where each element is equal to one for continuous variables and to the square root of the proportion for the dummy variables. The first step of MFA thus consists in computing 1=2

1=2

the matrix XDR . As usual, this matrix is centred Z ¼ XDR  M, with M the I  K matrix with M ¼ 1m0 where m is the barycentre of X. The second step of MFA consists in a weighting step that balances the influence of the groups of variables in order to ensure that the analysis is not influenced by a single group of variables. A similar situation is found in standardised PCA where variables are weighted by the inverse of the standard deviation in order to avoid for the analysis to be dominated by the variables that have the largest variances. In practice, all the variables of a group j of continuous variables are weighted by the inverse of the square root of the first eigenvalue kj1 associated with the PCA of the group j and

78

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

all the columns of the indicator matrix associated with a group j of categorical variables are weighted by the inverse of the square root of the first eigenvalue kj1 associated with the MCA of the group j. The last step of MFA consists in performing a PCA on the global  qffiffiffiffiffi qffiffiffiffiffi weighted data table Z K1=2 ¼ Z 1 = k11 ;    ; Z J = kJ1 . MFA provides the same graphical representations as in PCA (representation of individuals and variables) but, due to the group’s structure, specific representations are available such as the group representation and the superimposed representation. The group representation allows one to visualise the relationship between groups at a global level. Two groups are linked when the relative position of individuals in a group is similar to the relative position of individuals in the other one. The superimposed representation focuses on the resemblances and differences between groups at the individual level. For instance, it allows one to highlight specific individuals with respect to one group. There are many application examples of MFA in sensory analysis. A classical one concerns the comparison of the evaluation performed by several panels: each panel evaluates a same set of products with common or different attributes and MFA allows us to compare the product space given by each panel (Lê, Pagès, & Husson, 2008b; Pagès, Bertrand, Ali, Husson, & Lê, 2007; Tu, Valentin, Husson, & Dacremont, 2010). MFA is also often used to analyse data obtained from specific data collection methods such as napping (Dehlholm, Brockhoff, Meinert, Aaslyng, & Bredie, 2012; Pagès, 2005) or CATA (Dooley, Lee, & Meullenet, 2010), etc. Another example is the use of MFA to study the link between sensory data and other kind of data such as instrumental data (Pagès & Husson, 2005) or consumers’ descriptions (Ares, Giménez, Barreiro, & Gámbaro, 2010). Whatever the kind of data, missing values are ubiquitous and can occur for plenty of reasons. In addition, the risk of being confronted with missing data increases when the number of sources of information is larger. It is also common that some individuals have missing entries for all the variables of one (or more) group(s), resulting in a missing data pattern specific to multiple tables (see Fig. 1). For instance, it may be difficult to taste more than a certain number of products due to sensory saturation. Consequently, it frequently happens that a subset of products is given to each assessor. It leads to the Fig. 1 pattern of missing values if rows correspond to products and groups to assessors. There are in the literature several methods that handle missing values in the multiple multivariate framework. One can quote the works of Commandeur (1991) for general Procrustes analysis (GPA), Tomasi and Bro (2005) for PARAFAC models, Van de Velden and Bijmolt (2006) and van de Velden and Takane (2011) for the generalised canonical correlation analysis. All approaches consist in minimizing a criterion on the observed data by introducing a weighted matrix. The idea is to estimate the parameters (such as the dimensions) skipping the missing values. The algorithms involved are mainly alternating weighted least squares algorithms type.

Fig. 1. Pattern of missing values specific to multiple multivariate datasets with missing rows for some groups of variables.

In this paper, we suggest an algorithm to perform multiple factor analysis from an incomplete dataset. The pattern of missing values can be random with missing values scattered in the data or structured as in Fig. 1. Section 2 briefly reviews how to perform principal component methods such as PCA and MCA with missing values. Since PCA is the core of MFA, it gives the foundation to define an algorithm to deal with missing values in MFA. The properties of the method are highlighted in Section 3 through a simulation study and examples in sensory analysis. 2. Materials and methods 2.1. PCA with missing values PCA in the complete case provides a subspace that maximises the variability of the projected points or equivalently that minimises the distances between individuals and their projections onto the subspace. It is equivalent to finding a matrix of low rank S that provides the best approximation of the matrix X IK in the least squares sense. It is equivalent to find two matrices F IS and U KS that minimise the following criterion: 2

C ¼ kX  M  FUk ¼

I X K X

S X xik  mk  fis uks

!2 ð1Þ

S¼1

i¼1 k¼1

With the additional constraints that the columns of U are orthogonal and of unit norm, the solution is given by the eigenvectors of the inner-product matrix XX0 , namely the principal compob (the scores matrix such that the variance of each column is nents F equal to the corresponding eigenvalue) and the eigenvectors of the b (the loadings covariance matrix X0 X, namely the principal axes U matrix or the coefficient matrix). The solution can be obtained either by the singular value decomposition of (X  M) or by the use of an alternating least squares algorithm (ALS). A common approach dealing with missing values in PCA consists of ignoring the missing values by minimizing the least squares criterion (Eq. (1)) over all non missing entries. This can be achieved by the introduction of a weighted matrix W in the criterion, with W ik ¼ 0 if X ik is missing and W ik ¼ 1, otherwise:



I X K S X X wik xik  mk  fis uks i¼1 k¼1

!2 ð2Þ

S¼1

Contrarily to the complete case, there is no explicit solution to minimise the criterion (2) and it is necessary to resort to iterative algorithms. Two kinds of algorithms are available to minimise this criterion: algorithms of weighted alternating least squares such as the one suggested by Gabriel and Zamir (1979) or the iterative PCA algorithm (Josse & Husson, 2012; Kiers, 1997). In the latter algorithm, an imputation of the missing values is achieved during the estimation process. More precisely, the algorithm is the following: 1. Initialization l = 0: X0. Missing elements are replaced by initial values such as the mean of each variable. 2. Step 1: a. PCA is performed on the completed dataset to estimate the bl; U b l Þ; the l upperscript means that it corparameters: ð c M l; F M l is the reponds to an estimation at the step l, for instance c mean matrix estimated at the stage l; S dimensions are kept, b. Missing values are imputed with the reconstruction formubl ¼ c b lðU b l Þ0 ; the new imputed lae, i.e. the fitted values X Ml þ F b l (with ⁄ the Hadamard dataset is X l ¼ W  X þ ð1 þ WÞ  X product): observed values are the same and missing values are replaced by the fitted ones, c. Means (and standard deviations) are updated.

79

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

True configuration - consensus

3

3

Mean imputation - consensus

T2

2

2 0

PER1 1POY 1BOI 4EL 2BEA 1DAM 2DAM 1ING 2BOU 1TUR 1ROC DOM1 1FON 1BEN 1CHA 2EL

1POY PER1 4EL 2DAM 2BEA 1TUR 1BOI 1FON 2BOU 1ING T1 1DAM 1CHA 1ROC DOM1 1BEN 2EL 1VAU 2ING

-2

-2

-1

-1

2ING

1

Dim 2 (22.13 %)

1VAU

0

3EL

1

Dim 2 (13.22 %)

3EL

T2 T1

-4

-4

-3

-2

-1

0

1

-3

-2

-1

0

1

2

2 Dim 1 (40.95 %)

Dim 1 (65.39 %)

2

3

Iterative MFA - consensus

1

T1 3EL 4EL 1POY 1BOI PER1 2BEA 2BOU 1DAM 1ING 2DAM

1VAU

0

Dim 2 (15.29 %)

T2

1TUR 1ROC DOM1 1BEN 1FON 2EL

-1

1CHA

-2

2ING

-4

-3

-2

-1

0

1

2

Dim 1 (71.74 %)

Fig. 2. Representation of the individuals obtained by the MFA on the complete dataset (top left), the MFA on the incomplete dataset with an imputation by the mean of the variables (top right) and with regularised iterative MFA (bottom left). Individuals in red have missing values for all the variables of the group olfaction. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3. Steps 2(a) of estimation of parameters via PCA, 2(b) of imputation of the missing values and 2(c) are repeated until convergence. After each imputation step, the means of the variables change. Consequently the vector of means has to be updated after each imputation step. It is important to make sure to have the same point of view regarding scaling if one wishes to perform a standardised PCA. As for the means, a rescaling step is incorporated after each imputation step. The scaling process should be considered as a part of the analysis. Surprisingly, including this rescaling step is not so trivial and cannot be done for instance in some other algorithms that have been proposed to handle missing values in PCA such as the Nipals algorithm (Christofferson, 1969; Wold, 1966). That is why the iterative PCA algorithm is preferred to perform PCA on incomplete datasets (Josse & Husson, 2012). Overfitting problems currently occur in these kinds of algorithm. They can be avoided thanks to a regularization procedure that consists in thresholding the singular values obtained when performing the PCA in step 2(a) of the algorithm. A regularised ver-

sion of the algorithm, the regularised iterative PCA (Josse & Husson, 2012) is then used in practice. 2.2. MCA with missing values The most popular approach to perform MCJosseA on an incomplete dataset consists in using the missing single method (van der Heijen, 2003). This method creates an extra category per variable for the missing values and performs the MCA on the new dataset. Josse, Chavent, Liquet, Husson and Pagès (2012) have defined a new method named regularised iterative MCA. Their algorithm works on the indicator matrix involved in MCA. It first initialises the missing values of the indicator matrix with, for instance, the proportion of the category (which is the equivalent to the mean imputation for continuous variables). Then MCA is performed on the imputed indicator matrix to obtain an estimation of the principal components and axes and the missing values are imputed using the reconstruction formulae. The algorithm is in the same vein as the one used in PCA. However, it is necessary to take into account the specificities of MCA. For instance, after each imputation step,

80

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

Mean imputation - partial points 3

True configuration - partial points

2.0 1.5

T2

3EL

2

T2

PER1 1POY 1BOI 4EL 2BEA 1DAM 1ING 2BOU 2DAM

1TUR 1ROC 1FON

-0.5

Dim 2 (22.13 %)

1VAU

2ING

DOM1

1CHA 1VAU 2ING

PER1 4EL 2DAM 2BEA 1BOI 1FON 2BOU 1ING T1 1DAM 1ROC DOM1 1BEN 2EL

1BEN

2EL

-3

-1.0

-2

1CHA

1POY 1TUR

-1

0.5

3EL

0

1.0

1

T1

0.0

Dim 2 (13.22 %)

Olfaction Tasting

Olfaction Tasting

-4

-2

0

2

-4

-2

0

2

Dim 1 (65.39 %)

Dim 1 (40.95 %)

Iterative MFA - partial points Olfaction Tasting

1.5

T2

3EL

0.5

4EL

0.0

1POY 1BOI PER1 2BEA 1ING 2BOU 1DAM 2DAM

-0.5

1TUR 1CHA

-1.0

Dim 2 (15.29 %)

1.0

T1 1VAU

DOM1 1ROC 1FON1BEN 2EL

-1.5

2ING

-4

-2

0

2

Dim 1 (71.74 %)

Fig. 3. Representation of the mean individuals and of the partial individuals which have missing values. The representation is obtained with the MFA on the complete dataset (top left), the MFA on the incomplete dataset with an imputation by the mean of the variables (top right) and with regularised iterative MFA (bottom left).

the column margins of the indicator matrix change and thus it is necessary to incorporate a step in order to update the margins in a similar way as for the mean and standard deviation in PCA. Regularised iterative MCA was applied on a sorting task example in Josse (2010) and showed better performances compared to the other methods dedicated to handle missing values in MCA. 2.3. The iterative MFA algorithm Since the core of MFA is a weighted PCA, the algorithm suggested to handle missing values in MFA is inspired from the regularised iterative PCA algorithm but the group’s structure and the specific weighting have to be taken into account. More precisely, the algorithm is carried out using the following step: 1. Initialization l = 0: X0. Missing elements are replaced by initial values such as the mean of each variable for continuous variables and the proportion of the category for each dummy variable, calculated from the non-missing entries. Note that the entries can be non-integer for the dummy variables but the sum of the entries corresponding to one individual and one variable has to be equal to one. The matrices D0R and c M 0 are calculated. The first eigenvalue of each group of variables is computed and K0 ¼ diagðk11 ;    ; kJ1 Þ0 . 2. Step 1:   ðl1Þ a. The global table Z l1 ðK1=2 Þðl1Þ ¼ X ðl1Þ D1=2  R c M ðl1Þ ÞðK1=2 Þðl1Þ is constructed.

b. A PCA is performed on this global table to estimate the bl; U b l Þ; S dimensions are kept. dimensions ð F bl ¼ F b lðU b l Þ0 ðK1=2 Þðl1Þ and c. The fitted values are estimated as Z then the missing values are imputed with   ðl1Þ 1=2 1=2 ðl1Þ l ðl1Þ l bl 0 c b b DR X ¼ M þ F ð U Þ ðK Þ . The new b imputed dataset is X l ¼ W  X þ ð1  WÞ  X. d. From this new completed matrix, the matrices DlR , c M l and Kl are calculated. 3. Steps 2(a) to 2(d) are repeated until convergence. This algorithm also suffers from overfitting problem and is also regularised as in PCA leading to the regularised iterative MFA algorithm. More precisely, the S singular values of MFA, i.e. the S singular values associated with the PCA on the weighted matrix (step pffiffiffiffiffi 2.b) are thresholded. Each singular value kS is replaced by   pffiffiffiffiffi 2 r 2 p ffiffiffi ffi kS  with r the mean of the last eigenvalues. More details kS

concerning the rationale of the regularization procedure are given in Josse and Husson (2012). Note that the number of dimensions S used in the algorithm has to be chosen a priori. This choice is crucial and difficult. In the complete case framework, different methods can be used to select this number such as the scree test, permutation tests or methods based on distributional assumptions (Jolliffe, 2002). However, they can’t be straightforwardly applied in the incomplete case. On the contrary, the cross-validation method described in Bro, Kjeldahl,

81

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

Mean imputation - group representation

0.6

Olfaction

0.4

Dim 2 (22.13 %)

0.6 0.4

Dim 2 (13.22 %)

0.8

0.8

1.0

1.0

True configuration - group representation

Olfaction 0.2

0.2

Tasting

0.0

0.0

Tasting

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Dim 1 (40.95 %)

Dim 1 (65.39 %)

0.6 0.4

Dim 2 (15.29 %)

0.8

1.0

Iterative MFA - group representation

0.2

Olfaction

0.0

Tasting

0.0

0.2

0.4

0.6

0.8

1.0

Dim 1 (71.74 %)

Fig. 4. Group representation obtained by an MFA on the complete dataset (top left), an MFA on the incomplete dataset with an imputation by the mean (top right) of the variables and with regularised iterative MFA (bottom).

Smilde, and Kiers (2008) and in Josse and Husson (2011) in PCA has the great advantage to be easily extended to the incomplete case. The rationale of the method is, for a fixed number of dimensions, to remove a cell of the data matrix and to predict it with the reconstruction formulae built on the data without this cell. The procedure is repeated for each cell of the data and for different numbers of dimensions. The number that minimises the mean squared error of prediction is kept. Since MFA is a particular weighted PCA, this methodology could be adapted to MFA, but work has to be done to validate the quality of this extension.

3. Results In the first sub-section, the behaviour and the outputs of regularised iterative MFA are illustrated through a small sensory dataset. Then, in the second sub-section, simulations are performed in the context of napping data collection focusing on the major issue of

the maximum number of products that can be assessed during an evaluation task. Finally, the method is illustrated on a dataset with categorical variables. 3.1. Simulations on a small sensory dataset The method is illustrated on an extract of the dataset ‘‘wine’’ (Escofier & Pagès, 2008) where 21 wines are described by nine olfactive and nine tasting attributes. The purpose of the analysis is to describe the wines but also to see the differences between the olfactive and the tasting perceptions. The nine olfaction variables are quality of smell, fruity, flower, spice, plant, phenolic, aroma intensity, aroma persistency, aroma quality and the nine tasting variables are attack intensity, acidity, astringency, bitterness, alcohol, balance, smooth, intensity, harmony. All the variables are evaluated on a scale from 1 to 9 by panellists and the mean of the scores is calculated for each product. MFA is performed on the complete dataset. The configuration of the wines, i.e. the consensus or the mean configuration, the superimposed representation and the configuration

3.2. Simulation with napping data Napping (Pagès, 2005), also named projective mapping or placing, is a holistic approach that implies asking each judge to position products on a tablecloth according to his/her sensory perception: Two products are close if they are perceived as similar from a sensory point of view and distant if they are perceived as different. The classical data table obtained from this method contains the products in rows and the coordinates of the products on the tablecloth

0.9 0.8 0.7

RV coefficient

0.6

of the groups on the first two dimensions of MFA are represented in Figs. 2–4 on the top left and they are named the true configurations. To create a specific pattern of missing values as illustrated in Fig. 1, missing values are put for six wines on all the variables of the olfaction group. The missing entries are put for specific wines (1VAU, 2ING and T1) who contributes a lot to the construction of the first dimensions of MFA as well as for non specific wines (4EL, 2BEA and 1FON) who are close to the centre of gravity. The proportion of the data set that is set to missing (14%) is not very important but the pattern of missing entries is difficult because all the missing entries concern the same wines and a same group of variables. Then, from the incomplete dataset, two strategies are compared. The first one consists in imputing each missing entry with the mean of the variable and then performing an MFA on the completed dataset; this method is called the ‘‘mean imputation’’ and is still implemented by default in many softwares. The second strategy consists in performing the regularised iterative MFA algorithm with two dimensions. The number of dimension has been chosen by cross-validation. Fig. 2 shows that the true consensus configuration and the one obtained with regularised iterative MFA are very close. This is confirmed with the RV coefficient (Escoufier, 1973; Josse, Pagès, & Husson, 2008) between the two configurations (on the first two dimensions) which is equal to 0.966. The configuration obtained from the mean imputation is distorted and far from the true configuration (the RV coefficient is equal to 0.750). The results of the superimposed representation (Fig. 3) are also in agreement with the previous comments. In this representation, each individual is represented by its ‘‘mean’’ point as in Fig. 2, but there is also one partial point associated with each group of variables. The ‘‘mean’’ point is at the barycentre of the partial points. More details about this representation are available in Escofier and Pagès (2008). The partial points are well recovered with regularised iterative MFA even for the individuals who have missing entries for the olfaction group. Regularised iterative MFA uses the relationships between variables of the olfaction and tasting groups to calculate the position of the partial points. If the mean imputation is used on a dataset where some individuals have missing entries for all the variables of one group, then the partial points of these individuals for the group are at the barycentre of the graph. This is expected since the mean imputation does not use any information from the other group and it is of course a very bad representation (see graph on top right, Fig. 3). The group representation obtained with regularised iterative MFA is close to the representation obtained on the complete dataset (Fig. 4). This is expected since the configurations of the mean individuals are close (Fig. 3). We can notice that the percentage of variance on the first two dimensions is greater for regularised iterative MFA (87.03%) than for the MFA on the complete dataset (78.61%). This can be explained by the fact that, at the end of the algorithm, the missing values are imputed with the principal components and axes. Consequently, it artificially reinforces the structure of the dataset. Then the imputed values are considered as observed values (although they are not) and the percentage of variability is overestimated.

1.0

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

12 consumers 25 consumers 50 consumers 75 consumers 99 consumers

iterative MFA mean

0.5

82

6

7

8

9

10

11

Fig. 5. Average of the RV coefficient between the consensus configuration obtained by an MFA on the complete dataset and the one obtained by an MFA on the incomplete dataset with an imputation by the mean of the variables (dashed line) and by regularised iterative MFA (solid line). Results are given for different number of products assessed and for different subsets of consumers.

for each judge in columns (i.e. two columns, one for the x-coordinates and one for the y-coordinates for each judge). The data are continuous and the variables are structured into groups, one group corresponding to one judge. These data may be analysed by MFA, INDSCAL model (Carroll & Chang, 1970), distatis (Abdi & Valentin, 2007) or GPA, for instance. A napping evaluation has been conducted by students in a food science master’s degrees from Agrocampus and a complete dataset with 99 judges (French consumers) who evaluated 12 luxury perfumes is available. MFA is first performed on this complete dataset leading to what we call the ‘‘true configuration’’ in the sense that it is the configuration obtained on the complete dataset. Then we consider that each panellist evaluates only a sample of n products (n varying from 6 to 11) among the 12 products. In practice it is common that each panellist does not evaluate the overall set of products. Two main reasons can be mentioned. First, it is difficult to position more than a certain number of products on a tablecloth because the task is intellectually difficult. In addition, after assessing many products a phenomenon of sensory saturation may occur. A D-optimal design has been used to select the n products evaluated by each judge. All the panellists do not evaluate the same subset of products and all the 12 products are evaluated the same number of times. The products that are not evaluated by a judge have missing values for the x-coordinate and the y-coordinate leading to a missing data pattern as illustrated Fig. 1. Note that these missing values are not really missing but planed by an experimental design. However, the methodology to deal with missing values can be applied in such a situation. On this incomplete data, regularised iterative MFA is performed as well as the MFA on the data where each missing values is imputed by the mean of its variable. The RV coefficient is then calculated between two matrices with 12 rows (for the 12 products) and two columns (for the two dimensions of the MFA). The first matrix corresponds to the true consensus configuration and the second matrix corresponds either to the consensus configuration obtained by regularised iterative MFA or to the one obtained by the mean imputation. Several (50) simulations are done for each number of products (n) to avoid a mean configuration to be close to the true consensus configuration by chance. Then the mean of the RV coefficients on these 50 simulations is calculated. This procedure is also applied for different subsets of panellists (12, 25, 50 and 75). More precisely, for a subset of 25 panellists, the

83

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

MFA on a complete data (with 25 panellists) is performed and gives the ‘‘true configuration’’. Then missing values are inserted in this dataset to obtain a sample of products described by panellist (6 to 11 products). On this incomplete dataset, regularised iterative MFA as well as the mean imputation are performed and the RV coefficients are calculated. Fig. 5 represents the mean of the RV coefficients as a function of the number of products for each subset of panellists. The RV coefficient between the true consensus configuration and the one obtained with regularised iterative MFA is close to one when the number of consumers is large whatever the number of products evaluated per judge: it is equal to 0.9 even when the 99 judges evaluate only six products. For a small number of judges (12 or 25), the RV coefficient is close to one (greater than 0.9) when each judge evaluates 9–11 products but if the number of missing values is too important, then the consensus configuration obtained with regularised iterative MFA is far from the consensus configuration obtained with the MFA on the complete subset. Note that whatever the number of judges and the number of products evaluated per judge, regularised iterative MFA always gives better results than the mean imputation. This simulation study opens perspectives on the number of products that can be compared during a sensory evaluation. When there are a lot of judges, it is possible to give a small number of products to each judge and to have accurate results for all the products. It gives the opportunity to explore a larger number of products in a sensory session. For instance, a product space with 30 perfumes could be studied in a napping session because, thanks to regularised iterative MFA, each judge can perform a sensory evaluation on a smaller subset of perfumes. Of course, this result depends on the type of products. If the evaluation of the products is very close from one judge to another, then it is easy to learn from the other judgments and the number of missing values can be important. Indeed, as shown with the wines example, regularised iterative MFA has good performances when there are strong relationships between variables. In a napping description, the individual judgments are often not well reproduced from one judge to another and it is not possible to have a lot of missing values. Another simulation study was conducted using this dataset. The consensus configuration of the MFA on the complete dataset (with all the 12 products and the 99 judges) is here considered as the reference configuration. First, the RV coefficient between this reference configuration and the consensus configuration from the MFA on a complete subset of 25 judges (that evaluated all the products) is calculated and given first row of Table 1. Then, the RV coefficients are calculated between the reference configuration and the consensus configuration obtained with regularised iterative MFA performed on a subset of 50 judges that evaluated a subset of 6–11 products (rows 2–7 of Table 1). Results in Table 1 correspond to the mean of the RV for 100 simulations. The results are striking: with a same number of evaluations, it is better to have 25 judges that evaluate all the products (row 1, RV = 0.92) rather than 50 judges that evaluate half of the products (row 2, RV = 0.81). But, if the 50 judges evaluate nine products or more (rows 5–7), then the consensus configurations are closer to the reference one. The same procedure of simulation has been performed for 10 panellists assessing all the products (row 8) and then 20 panellists assessing from 6 (to have the same number of evaluations) to 11 products. The results are similar to the previous ones. A comment can be done regarding these two simulation studies. It is not exactly the same thing to generate missing values from a complete dataset and to have real missing values, i.e. judges focusing their evaluation only on a subset of products. In this latter case, we can expect the task of tasting is easier and lead to a more relevant evaluation. However, these simulation studies give clues about the behaviour of the method regarding the number of products.

Table 1 The first row of each sub-table corresponds to the RV coefficient between the consensus configuration on the complete dataset (with all the products and the 99 judges) and the consensus configurations obtained with the MFA with a subset of judges that evaluated all the products. The rows after correspond to the RV coefficient between the consensus configuration on the complete dataset (with all the products and the 99 judges) and the consensus configurations obtained with regularised iterative MFA with a subset of judges that evaluated a subset of products (the standard deviation of the mean of the RV for the 100 simulations is given in parenthesis). Row

Number of panellists

Number of productsevaluated per panellist

RV coefficient

1 2 3 4 5 6 7 8 9 10 11 12 13 14

25 50 50 50 50 50 50 10 20 20 20 20 20 20

12 6 7 8 9 10 11 12 6 7 8 9 10 11

0.920 0.804 0.872 0.911 0.936 0.954 0.966 0.807 0.645 0.731 0.786 0.833 0.859 0.881

(3.58  103) (7.32  103) (5.48  103) (4.49  103) (3.31  103) (2.49  103) (1.30  103) (6.94  103) (1.29  102) (1.15  102) (8.79  103) (7.00  103) (5.57  103) (4.90  103)

3.3. Sorting task descriptions done by two panels This last example focuses on categorical variables and shows the ability of the method to deal with groups of categorical variables with missing values. The sorting task description of a same set of 12 products (the same luxury perfumes as in Section 3.2) done by two panels is analysed. In the first panel, 98 panellists evaluated all the products. In the second panel, 46 panellists evaluated eight products among the 12. The products were assigned to the judges according to a balanced incomplete block design. The complete dataset is analysed in Cadoret, Lê, and Pagès (2009). The aim of this analysis is to take into account all the available information to produce a product space but balancing the influence of both panels. Regularised iterative MFA is performed considering two groups of categorical data, one group for each panel. Each variable in each group corresponds to the clusters done by one judge. The first group has 98 variables without missing values whereas the second group has 46 variables with four missing values each. Note that, contrarily to the previous simulations (Section 3.2), the missing values are real in the sense that the panellists evaluating a subset of products really evaluate only this subset. Fig. 6 represents the mean configuration obtained with the method using all the data at hand. Notice that in this case with two groups, one complete and one incomplete, analysing the graph of the groups of variables and the graph of the partial points may be done with caution since the relationship between the two groups is mechanically increased. In other words, if the aim of the analysis was to compare the results obtained from a complete and an incomplete data set, it would be better to use the following strategy: the results of MCA performed on the complete data set (with 98 panellists) would be compared to the results of MCA on the incomplete data set (with 46 panellists). More precisely, the RV coefficient could be computed between both MCA configurations.

4. Conclusion Regularised iterative MFA allows one to perform MFA with missing values even when there are rows of missing values in the multiple tables dataset. This method is particularly efficient when

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85

3

84

Lolita Lempicka

Cinéma 1

Dim 1 (19.51 %)

2

Angel

0

Shalimar L instant Coco Mademoiselle

-1

Aromatics Elixir Pleasures Pure Poison J adore (ET) J adore (EP)

-1

Chanel 5

0

1

2

3

Dim 1 (19.51 %) Fig. 6. Regularised iterative MFA results obtained on the categorical data set concerning the sorting task description of 12 perfumes performed by 98 panellists who evaluate the 12 products and the sorting task description of the same products performed by 46 panellists who evaluate eight products among the 12.

there are strong relationships between the variables and between the groups. During the algorithm, an imputation of the missing values is performed. This imputation takes into account the similarities between individuals as well as the relationships between variables and between groups since it uses the principal components and the axes. Regularised iterative MFA can thus be used as a single imputation method. Consequently, it is possible to perform other statistical methods on the completed dataset. However, after an imputation, we have to keep in mind that the imputed values are not observed values but only predicted ones. As a consequence, the information in the observed values seems reinforced because it is used to predict the missing values. Methods as multiple imputation (Little & Rubin, 2002) can be considered to take into account the uncertainty on the predictions. It is common in practice to perform principal component methods before clustering in the complete case. Using MFA as a preprocessing step before a clustering on a multi-table dataset is a way to denoise the data and to balance the influence of the groups of variables when computing distances between individuals. This methodology can easily be extended to the incomplete case by performing the clustering on the principal components obtained by regularised iterative MFA. The regularised iterative MFA algorithm is implemented in the imputeMFA function available in the missMDA package (Husson & Josse, 2013) of the R software (R core, 2012). This function takes as inputs an incomplete dataset with groups of continuous and/or categorical variables, and the number of dimensions used in the algorithm. Then, it gives as an output, the completed matrix X with the completed continuous variables and the completed indicator matrices. This matrix X is used as an input of the MFA function of the FactoMineR package (Husson, Josse, Lê, & Mazet, 2013, Lê, Josse, & Husson, 2008a) using the argument tab.comp. Results of MFA with missing values are thus obtained. Proceeding in two steps, first an imputation and second a MFA, allows one to use this method as an imputation method by keeping only the results of the first step. However, the imputation of the categorical variables is not directly obtained, only an imputation of the indicator matrices

is given. These imputed values can be considered as degrees of membership to the associated category and consequently, it is possible to impute the categorical variable with the most plausible value. In the missMDA package, the imputePCA and the imputeMCA functions are also available to perform PCA and MCA with missing values and to impute continuous and categorical variables. From a sensory point of view, the simulation study highlights that it may be a better strategy to assess less products with many judges than all the products with a smaller number of judges. However, it is difficult to know if the results obtained from an incomplete data, even if there are many judges, can be interpreted accurately. There is a need to know which credit can be given to those results. Consequently, it would be very useful to have confidence areas around the results.

References Abdi, H., & Valentin, D. (2007). Some new and easy ways to describe, compare, and evaluate products and assessors. In D. Valentin, L. Pelletier, & D. Z. Nguyen (Eds.), New trends in sensory evaluation of food and non-food products (pp. 5–15). Ho Chi Minh, Vietnam National University: Ho Chi Minh Publishing House. Ares, G., Giménez, A., Barreiro, C., & Gámbaro, A. (2010). Use of an open-ended question to identify drivers of liking of milk desserts. Comparison with preference mapping techniques. Food Quality and Preference, 21(3), 286–294. Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component model: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390, 1241–1251. Cadoret, M., Lê, S., & Pagès, J. (2009). A factorial approach for sorting task data (FAST). Food Quality and Preference, 20, 410–417. Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of ‘‘Eckart-Young’’ decomposition. Psychometrika, 35, 283–319. Christofferson, A. (1969). The one-component model with incomplete data. PhD dissertation, Uppsala University, Institute of Statistics. Commandeur, J.J.F. (1991). Matching configurations. DSWO press, Leiden University. Dehlholm, C., Brockhoff, P. B., Meinert, L., Aaslyng, M. D., & Bredie, W. L. P. (2012). Rapid descriptive sensory methods – Comparison of free multiple sorting, partial napping, napping, flash profiling and conventional profiling. Food Quality and Preference, 26, 267–277. Dooley, L., Lee, Y.-S., & Meullenet, J.-F. (2010). The application of check-all-thatapply (CATA) consumer profiling to preference mapping of vanilla ice cream and its comparison to classical external preference mapping. Food Quality and Preference, 21, 394–401.

F. Husson, J. Josse / Food Quality and Preference 30 (2013) 77–85 Escofier, B., & Pagès, J. (2008). Analyses factorielles simples et multiples; objectifs, méthodes et interprétation (4th éd.). Paris: Dunod. Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, 29, 751–760. Gabriel, K., & Zamir, S. (1979). Lower rank approximation of matrices by least squares with any choice of weights. Technometrics, 21, 236–246. Greenacre, M., & Blasius, J. (2006). Multiple correspondence analysis and related methods. London: Chapman & Hall/CRC. Husson, F., & Josse, J. (2013). missMDA: Handling missing values with/in multivariate data analysis (principal component methods). R package version 1.6. Husson, F., Josse, J., Lê, S., & Mazet, J. (2013). FactoMineR: Factor analysis and data mining with R. R package version 1.21. http://www.cran.R-project.org/ package=FactoMineR. Jolliffe, I. T. (2002). Principal component analysis. Springer. Josse, J. (2010). Prise en compte des données manquantes en analyse exploratoire des données. PhD dissertation. Agrocampus Ouest. Josse, J., Chavent, M., Liquet, B., Husson, F., & Pagès (2012). Handling missing values with regularized iterative multiple correspondence analysis. Journal of Classification, 29(1), 91–116. Josse, J., & Husson, F. (2011). Selecting the number of components in PCA using cross-validation approximations. Computational Statististics and Data Analysis, 56(6), 1869–1879. Josse, J., & Husson, F. (2012). Missing values in exploratory multivariate data analysis methods. Journal de la SFdS, 153(2), 79–99. Josse, J., Pagès, J., & Husson, F. (2008). Testing the significance of the RV coefficient. Computational Statististics and Data Analysis, 53, 82–91. Kiers, H. A. L. (1997). Weighted least squares fitting using ordinary least squares algorithms. Psychometrika, 62, 251–266. Kroonenberg, P. M. (2008). Applied multiway data analysis. Wiley series in probability and statistics. Lê, S., Josse, J., & Husson, F. (2008a). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18.

85

Lê, S., Pagès, J., & Husson, F. (2008b). Methodology for the comparison of sensory profiles provided by several panels: Application to a cross-cultural study. Food Quality and Preference, 18(7), 179–184. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. New-York: Wiley series in probability and statistics. Pagès, J. (2005). Collection and analysis of perceived product inter-distances using multiple factor analysis; application to the study of ten white wines from the Loire Valley. Food Quality and Preference, 16, 642–649. Pagès, J., Bertrand, C., Ali, R., Husson, F., & Lê, S. (2007). Compared sensory analysis of eight biscuits by French and Pakistani panels. Journal of Sensory Studies, 22, 665–686. Pagès, J., & Husson, F. (2005). Multiple factor analysis with confidence ellipses: A methodology to study the relationships between sensory and instrumental data. Journal of Chemometrics, 19, 138–144. R Development Core Team (2012). R: A language and environment for statistical computing. In: R Foundation for Statistical Computing, Vienna, Austria. Tomasi, G., & Bro, R. (2005). PARAFAC and missing values. Chemometrics and Intelligent Laboratory Systems, 75(2), 163–180. Tu, V. P., Valentin, D., Husson, F., & Dacremont, C. (2010). Cultural differences in food description and preference: Contrasting Vietnamese and French panellists on soy yogurts. Food Quality and Preference, 21, 602–610. Van de Velden, M., & Bijmolt, T. H. A. (2006). Generalized canonical correlation analysis of matrices with missing rows: A simulation study. Psychometrika, 71(2), 323–331. Van de Velden, M., & Takane, Y. (2011). Generalized canonical correlation analysis with missing values. Computational Statistics, 27, 551–571. van der Heijen P.G.M., & Escofier B. (2003). Multiple correspondence analysis with missing data. In Analyse des correspondances. Presses Universitaires de Rennes. Wold, H. (1966). Nonlinear estimation by iterative least squares procedures. In Research papers in statistics (pp. 411–444). Wiley.