Available online at www.sciencedirect.com
Food Quality and Preference 19 (2008) 146–155 www.elsevier.com/locate/foodqual
Permutation tests for Generalized Procrustes Analysis R. Xiong a, K. Blot a,*, J.F. Meullenet b, J.M. Dessirier a b
a Unilever R&D, 40 Merritt Blvd, Trumbull, CT 06611, USA Department of Food Science, University of Arkansas, 2650 N Yong Ave, Fayetteville, AR 72704, USA
Received 2 September 2006; received in revised form 2 February 2007; accepted 11 March 2007 Available online 23 March 2007
Abstract Generalized Procrustes Analysis (GPA) is a useful tool for sensory professionals to analyze sensory data, especially those from free choice profiling. Over a decade ago, Wakeling introduced a permutation test for determining if the GPA consensus is significant. However, Wakeling’s permutation test lacks specification of an explicit null hypothesis, resulting in different interpretations of what a ‘‘significant consensus” signifies. In this paper, a new GPA permutation test analogous to the well established ANOVA permutation procedure is proposed. The proposed approach emphasizes that the null hypothesis dictates how data are permuted to test specific null hypotheses within GPA (e.g., product effect, assessor effect and interaction effect). The only assumption behind permutation testing is exchangeability, which is discussed. Applications of the proposed GPA permutation test are provided using three datasets (two actual datasets and one random dataset). Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Permutation test; GPA; ANOVA; MANOVA; Free choice profiling; Conventional profiling
1. Introduction Permutation testing is a non-parametric technique that can permit tests of statistical significance for essentially any null hypothesis by repeatedly randomizing the original dataset (see e.g., Anderson, 2001; Good, 2000). For each randomization of the original dataset, a test statistic is computed. The result is a sampling distribution of the test statistic under the null hypothesis, which is used to assess the statistical significance of the test statistic obtained from the original dataset. Permutation tests can produce reliable and sometimes exact p-values for data violating conventional normal theory assumptions by preserving the original data structure. Notably, for each permutation, the null hypothesis dictates how to randomly exchange or rearrange the original observations (see Good, 2000). The null hypothesis also dictates how to interpret the permutation test results. As
*
Corresponding author. Tel.: +1 203 381 4321; fax: +1 203 381 5485. E-mail address:
[email protected] (K. Blot).
0950-3293/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodqual.2007.03.003
there are multiple ways to permute observations when multiple factors are involved in a study, special attention must be given to setting up the proper null hypothesis to permit the correct significance test and interpretation. In addition, permutation testing for multiple-way ANOVA (analysis of variance), especially for two-way ANOVA, is well documented and can be used to test different null hypotheses (see Anderson, 2001; Good, 2000), which has not been explored by sensory professionals. Generalized Procrustes Analysis (GPA, see Gower, 1975) is a multivariate exploratory technique that involves transformations (i.e., translation, rotation, reflection, isotropic rescaling) of individual data matrices to provide optimal comparability. The average of the individual matrices is called the consensus matrix. The individual and consensus matrices are typically submitted to Principal Components Analysis (PCA) and projected onto a lower dimensional space. This space provides a vantage point to compare individual data and to visualize the consensus. In theory, the matrix transformation processes can be assumed to help hone in on underlying perceptual phenomena independent of scaling artefacts.
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
GPA is popular among sensory professionals as a way to combine data from different individual assessors (Dijksterhuis & Heiser, 1995). GPA is especially useful for free choice profiling, because it can accommodate different numbers and kinds of attributes among assessors. Moreover, GPA can be used to visually describe different effects, such as product differences, assessor agreement, and repeatability. However, there is no clearly established way to test the significance of each of these factorial effects, including the interactions. For instance, a test of the product by assessor effect may be critical. If this interaction is significant, it implies that assessors do not perceive the products in the same way (i.e., the product main effect is not invariant across assessors). Although previous work has been designed to examine the significance of the consensus variance using Monte Carlo simulation (King & Arents, 1991) or permutation testing (Wakeling, Raats, & MacFie, 1992; Wu, Guo, de Jong, & Massart, 2002), no study has explicitly adapted the ANOVA permutation test to analogous significance testing for GPA. Therefore, the objective of this paper is to show how permutation tests designed for factorial ANOVA can be adapted to GPA to permit analogous main effect and interaction significance tests. 2. Theory 2.1. Permutation testing Permutation testing is a non-parametric technique that can permit tests of statistical significance for any null hypothesis by repeatedly permuting the original dataset. The theory for the permutation test has evolved from the works of Fisher (1935) and Pitman (1937). The permutation test procedure can be summarized in the following six steps (Good, 2000): (1) Set up the null and alternative hypotheses (H0 and Ha) of interest. (2) Choose a test statistic t(x) that distinguishes H0 from H a. (3) Compute the test statistic t(x0) for the original observations. (4) Permute the observations according to H0. (a) Randomly rearrange the observations by assigning them to new treatment conditions. (b) Compute the test statistic t(x) for the new arrangement. (c) Repeat steps a–b N times (e.g., N = 1000). (5) Compute the distribution of the test statistic t(x) and corresponding p-value for t(x0). The p-value is calculated as (m + 1)/(N + 1), where m typically represents the number of t(x) values that are as small as or smaller than the observed t(x0) (e.g., F statistic), while N is the total number of t(x) values calculated. Note that for some test statistics, m can represent the number of t(x) values that are as large as or larger than the
147
observed t(x0). The ‘‘1” in the numerator and denominator denotes that the observed value t(x0) is a member of the possible values. (6) Draw a conclusion. (a) Reject H0 if the p-value is less than a (a significance level). (b) Do not reject H0 if the p-value is equal to or greater than a. According to Fisher (1936), the permutation test agrees with the normal theory tests when the normal theory assumptions are met (Anderson, 2001). The only assumption of the permutation test is exchangeability. Exchangeability entails that under a true H0, the original observations will stay fixed, while assigned randomly to alternative treatment conditions. These exchanges are done multiple times, resulting in several possible new arrangements that are all equally likely to occur. Each new arrangement, or permutation, has a unique set of treatment–observation pairings, resulting in a test statistic outcome that provides one point on the H0 sampling distribution. The outcomes themselves are not equally likely to occur, with larger and larger values (or smaller and smaller values) approaching a zero limit, or an infinitesimally smaller likelihood of occurring. The original dataset is considered one other possible arrangement under H0 that contributes one other possible outcome to the H0 sampling distribution. All these outcomes are possible under H0 given that the original dataset involves random assignment of treatments to experimental units or random ordering of treatments for each experimental unit, etc. In this way, the experimental approach controls for bias that is not assumed under H0 and that may influence the outcomes contributing to the empirical sampling distribution under H0. This exchangeability assumption is met for experimental designs and inferred for quasi-experimental designs. An implication of exchangeability is that errors are independent and identically distributed according to any shape (e.g., lognormal) (Anderson, 2001; Good, 2000). Note that because all permutation arrangements are equally likely to occur, a large random subset (e.g., 1000) of the total possible permutations should produce a representative sampling distribution. 2.2. Prior work: significance testing and GPA King and Arents (1991) reported that GPA always produces a consensus space or plot even when the data are randomly generated. Thus, for instance, any apparent product differences observed in the consensus space may not be real. To determine when GPA provides systematic information rather than random output, King and Arents proposed a Monte Carlo approach to test the significance of Rc (the percentage of consensus variance in the total variance). The sampling distribution of Rc was obtained by repeating GPA analysis on hundreds of random datasets generated from uniform distributions. If Rc for the original
148
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
data was larger than the 95th percentile of the sampling distribution, then it was inferred that ‘‘. . . the consensus plot as given by the GPA most likely represents some true consensus among panelists regarding the relationships among the samples evaluated” (King & Arents, 1991, p. 41). Since King and Arents (1991) generated random data independently for each sensory variable, the correlations among the variables in the original data were not preserved. This approach was shown to result in inflated Rc for the random outcomes, and thus, an overly conservative significance test (see Wakeling et al., 1992). Moreover, the approach did not permit specific ways to randomize the data to test for specific main or interaction effects, beyond testing the overall consensus variation against an overall random reference point. Wakeling et al. (1992) modified King and Arents’ method by permuting the original data. The original data consisted of product labels (rows) and attributes (columns). Whole rows were permuted for a given exchange, so that a product and its associated observations across all attributes were assigned randomly to another product. The exchange of whole rows preserves the correlational structure between the attributes. Moreover, the rows were permuted specifically within each assessor’s matrix, consistent with free choice profiling, which does not permit exchange of data across assessors using different attributes. Again, the general objective was to test the likelihood that the original observed Rc came from a distribution of randomly generated Rc values. A small likelihood (p 6 .05) indicated significance. A significant result was difficult to interpret because there was no explicit null hypothesis given for the test. As we will know from the following discussion, their test was actually consistent with what would be done to test the null hypothesis (H0) of no product differences since restricted permutations within each assessor’s matrix controlled for the assessor main effect. By extending Wakeling’s GPA permutation test, Wu et al. (2002) proposed the randomization F-test designed for each dimension to determine how many dimensions could be considered meaningful, and in essence, providing a potential substitute for the traditional scree plot. The randomization F-test suffers from the same problem as Wakeling’s permutation test using Rc: lack of explicit statement of the null hypothesis and proper interpretation. As discussed above, the permutation test always requires the null hypothesis because H0 determines how to permute the observations. To our best knowledge, there are no explicit null hypotheses for the significance test of factorial effects in GPA in the literature (King & Arents, 1991; Kunert & Qannari, 1999; Wakeling et al., 1992; Wu et al., 2002). For instance, the significance of Rc may imply the following null hypothesis (H0) and alternative hypothesis (Ha):
where R0 is the percentage of consensus variance in the total variance that is due to noise, or true random data, which can be estimated through simulation. This null hypothesis does not tell us how to permute the observations and how to interpret the permutation test result. Wakeling et al. (1992) proposed to permute the rows of each assessor’s data matrix apparently to preserve correlations among sensory attributes, but this permutation procedure is not directly linked to this null hypothesis (H0: Rc = R0). Moreover, a by-product of limited explicit null hypotheses for significance testing on GPA has resulted in unavoidably different interpretations of what exactly is being tested. There are at least five interpretations in the literature: whether the consensus is false or true (King & Arents, 1991); whether the consensus is significant or not (Wakeling et al., 1992); whether the consensus is meaningful (Kunert & Qannari, 1999); whether the consensus is artifactual or not (Wu et al., 2002), and whether a consensus is reached after the GPA transformations (XLStat, version 6.3, 2006). Clearly then, stating the null hypothesis for each of the factorial effects, performing the corresponding permutations, and conducting the relevant significance tests should help localize, manage, and explain the GPA output. This set of procedures has not been employed to test factorial effects for GPA. Without a proper connection between the null hypotheses and permutation procedure, the interpretation of the GPA can be misleading. In the following section, this paper will give the connection between the null hypothesis and permutation procedure. 2.3. Permutation testing for ANOVA The test for GPA factorial effects and factorial ANOVA are analogous. Permutation tests for factorial ANOVA are well documented and can be adapted to GPA factorial testing. The simplest factorial model, the linear two-way ANOVA model, is given as Y ijk ¼ l þ Ai þ Bj þ ABij þ eijk ði ¼ 1; 2; . . . ; a; j ¼ 1; 2; . . . ; b; k ¼ 1; 2; . . . ; nÞ where l is the population mean, Ai is the effect of the ith level of factor A, Bj is the effect of the jth level of factor B, ABij is the interaction effect of the ijth combination of factors A and B, eijk is the error associated with observation Yijk, a and b are the numbers of levels of factors A and B, respectively, and k is the number of replications. Since there are two main effects and one interaction effect, there are three null and alternative hypotheses for the twoway ANOVA problem. The null hypotheses (H0) for interaction and main effects are given, respectively, as follows: H0 ðinteraction effectÞ : ABij ¼ 0
H0:
Rc ¼ R0
ði ¼ 1; 2; . . . ; a; j ¼ 1; 2; . . . ; bÞ H0 ðmain effect AÞ : A1 ¼ A2 ¼ ¼ Aa
Ha:
Rc > R0
H0 ðmain effect BÞ : B1 ¼ B2 ¼ ¼ Bb
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
Null hypothesis testing usually starts with the test of a significant interaction AB (given that replicated data are available). The permutation test is more complicated for the interaction effect than for the main effects. For the interaction test AB, the main effects must first be removed from the observations Yijk. The main effects are unknown, but can be estimated by calculating the means for each level of a factor. Main effects of the factors can then be ‘‘removed” by subtracting the appropriate mean from each observation to obtain residuals (Anderson, 2001). For a single variable Y, the removal of the main effects of factors A and B to compute the residuals is given as rijk ¼ Y ijk Y i:: Y :j: þ Y ::: where rijk represents the residuals after removing main effects, Y i:: is the mean for level i of factor A, Y :j: is the mean of level j of factor B, and Y ::: is the overall mean. Under the null hypothesis (i.e., H0 is true), the residuals rijk are exchangeable with one another unrestricted across the AB conditions and can be permuted (Anderson, 2001; Good, 2000). The permutation test for interaction effects is not an exact test because the residuals are estimated using sample means. With increases in sample size (n), estimates of means get better and the permutation test gets closer to being an exact test; hence it is referred to as an asymptotically exact test (Anderson, 2001). If the interaction effect is not significant (i.e., the null hypothesis associated with the interaction cannot be rejected) or ignored due to various reasons (e.g., aggregating is done across replications, thus leaving only a product main effect to test), main effects can then be tested. When the permutation test is used for testing main effects, the effects of the remaining factors must be controlled. This control is usually achieved by permuting the observations of the factor within each level of the remaining factors. In the case of two factors A and B in a factorial design, the main effect of factor A can be tested by permuting the observations within each level of factor B, while the effect of factor B can be tested by permuting the observations within each level of factor A (Good, 2000). The F-value can be used as the test statistic for the twoway ANOVA permutation test, although Good (2000) suggested better alternative test statistics for testing the interaction and main effects. The permuting procedure for the interaction and main effects is summarized as follows: (1) Under H0: ABij = 0 (i = 1, 2, . . . , a; j = 1, 2, . . . , b) (no interaction effect), all the residuals rijk are exchangeable ? randomly rearrange the residuals rijk. (2) Under H0: A1 = A2 = = Aa (no main effect of factor A), all the observations Yijk within each level of factor B are exchangeable ? within each level of factor B, randomly rearrange the observations Yijk. (3) Under H0: B1 = B2 = = Bb (no main effect of factor B), all the observations Yijk within each level of factor A are exchangeable ? within each level of factor A, randomly rearrange the observations Yijk.
149
Note that any rearrangement always involves the whole row of data, across all variables (attributes), so the single attribute case generalizes to the multiple attribute case. The two-way ANOVA permutation test can be extended to the multiple-way ANOVA permutation test (Good, 2000). Note that parametric ANOVA can be applied to the above two-way factorial design data if normal theory assumptions are satisfied. However, in a balanced design, the permutation test has a threefold advantage over the parametric ANOVA: it is essentially exact for main effects given that all or a large subset of the permutations are tested; it is not restricted by an assumption of normality; and it is as powerful or more powerful than the parametric approach (Good, 2000). In addition, the parametric ANOVA approach may not be applicable to test the interaction and main effects within the context of GPA (e.g., due to insufficient degrees of freedom), but the permutation test is generally applicable. In an unbalanced design, however, the permutation test of the main effects will be confounded with interactions so that the two cannot be tested separately (Good, 2000). 2.4. Permutation testing for GPA A GPA study is usually a factorial design with at least two factors—products and assessors or panels—so the approach discussed above for the two-way ANOVA permutation test can be practically applied to GPA. For convenience of discussion, assume that a panel of assessors evaluated multiple products in a factorial design experiment. Analogous to the two-way ANOVA approach, there are three possible effects of interest: product, assessor, and product by assessor. There are at least three possible ways to permute the data, assuming the conditions are exchangeable under H0. As usual, we start with the null hypothesis of no interaction effect (H0: ABij = 0). If the data are from different assessors using different sets of attributes (i.e., free choice profiling), the AB levels logically cannot be exchanged; thus, no direct permutation test for this free choice profiling interaction is available. In contrast, if all assessors evaluated the products using the same set of attributes, the same or similar sensory protocol (including the scales, references, etc.), and the same number of replications (i.e., conventional profiling), the AB levels are presumably exchangeable and can be permuted. The residuals rijk are calculated and permuted to obtain the sampling distribution of a test statistic. Either Rc or sums of squares of residuals (SSR, also called loss function, see Kunert & Qannari, 1999) can be used as the test statistic. SSR will be used in this study because its calculation is simpler than Rc. Whether H0 for no A B interaction is rejected or not depends on the p-value of the test statistic for the original data. If the interaction effect is not significant or data cannot be exchanged, we may test the main effects of product (denoted as factor A) and assessor (denoted as factor B).
150
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
Under the null hypothesis (H0) of no product differences (H0: A1 = A2 = = Aa), the observations are exchangeable for conventional profiling, as well as free choice profiling, because the data are exchanged within rather than across assessors. Products are permuted within each assessor matrix, or in other words, within each level of factor B, thereby controlling for the main effect of B. According to H0: A1 = A2 = = Aa, the interpretation of the product effect is that there is a significant overall difference among the products. If there is no significant overall difference (i.e., H0 is not rejected at a), the product consensus coordinates, or average product locations, are not reliably different, suggesting that there is no need for the GPA consensus plot. No overall difference among the products may indicate that there is either no ‘‘real” (physical) difference among the products and/or no perceived difference by the assessors. If the overall difference is significant (i.e., H0 is rejected at a), then at least one product has consensus coordinates that are significantly different, suggesting that the consensus plot is useful for further visual examination of the relationships among the products. Thus, analogous to the ANOVA approach, the permutation test does not tell us which products are significantly different. How to test which products are significantly different is beyond the scope of this article, but the authors will present a new method in a forthcoming paper. Unlike testing the product effect, testing the assessor effect may involve an exchangeability (or iid) assumption violation, because data must be exchanged across assessors. If different assessors use dissimilar attribute sets (i.e., in terms of quantity and/or quality) to evaluate the same products, the scores between assessors cannot be from the identical distribution and hence are not exchangeable, so it is untenable to conduct the permutation test. The data from conventional profiling can satisfy the exchangeability assumption given that all assessors use the same set of attributes to evaluate the same products, while the data from free choice profiling usually violate exchangeability given that dissimilar attribute sets are typically used across assessors. If the exchangeability assumption is met under the null hypothesis of no assessor effect (H0: B1 = B2 = = Bb), then the observations should be permuted across assessors within each product. This procedure is useful for assessing overall panel performance when assessors use the same set of attributes. Similar to the two-way ANOVA permutation test, the permutation procedure for the two-way GPA permutation test is summarized as (1) Test interaction effect. Under H0: ABij = 0 (i = 1, 2, . . . , a; j = 1, 2, . . . , b) (no interaction effect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Calculate means Y i:: and Y :j: for A and B at each level and grand mean Y ::: .
(c) Calculate interaction residuals for each cell using rijk ¼ Y ijk Y i:: Y :j: þ Y ::: . (d) Repeat steps b and c for all attributes. (e) Rearrange the residuals rijk randomly. In most cases, the number of all possible rearrangements is too large for a computer to do all the rearrangements. We usually randomly select R (e.g., 300) of these permutations. (f) Calculate the test statistic SSR for each of the rearranged datasets. (g) Calculate a p-value and compare it with a predetermined significance level a (usually 0.05) to make a decision. If the null hypothesis is not rejected at a, we proceed to test main effects. (2) Test main effect of product. Under H0: A1 = A2 = = Aa (no product effect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Rearrange randomly the data for A within each level of B R = 300 times. (c) Calculate the SSR for each of the rearranged datasets. (d) Calculate a p-value and compare it with a predetermined significance level a to make a decision. (3) Test main effect of assessor. Under H0: B1 = B2 = = Bb (no assessor effect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Rearrange randomly the data for B within each level of A R = 300 times. (c) Calculate the SSR for each of the rearranged datasets. (d) Calculate a p-value and compare it with a pre-determined significance level a to make a decision. For multivariate GPA permutation, there are two ways to rearrange the data: (1) separate randomization for each attribute (King & Arents, 1991); and (2) simultaneous randomization for all attributes (Wakeling et al., 1992). The simultaneous randomization approach is recommended because it preserves the correlation structure among the attributes (Wakeling et al., 1992) and is more efficient.
3. Datasets 3.1. Eight expert panel data The eight panel data were from research comparing the capabilities of eight expert descriptive panels used by
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
Unilever for evaluating face creams. The eight expert panels were from different parts of the world and had been trained to varying degrees prior to this study. Some panels used Quantitative Descriptive Analysis (QDA) while others used the Spectrum Method (SM). Each individual panel used its own set of sensory descriptors with some overlap across panels. Each individual panel also used its own testing procedure, body site tested (volar forearm vs face), scales and references. The same 12 samples were evaluated by the panels for two repetitions. The two-way factorial model can be used to make some useful tests, such as: (a) a test of the product effect across the group of eight expert panels (free choice profiling) (Fig. 1a); (b) tests of product, assessor, and product by assessor effects within a given panel (conventional profiling) (Fig. 1b); (c) a test of the rep effect within a given panel, where data are averaged across products (conventional profiling) (Fig. 1c); and (d) a test of the rep effect within a given assessor, where data are averaged across products (conventional profiling) (Fig. 1d). For the present purposes, ‘a’ and ‘b’ will be shown to illustrate the two-way GPA permutation test procedure.
151
3.2. USA panel data The datasets from the panels were conventional profiling data. For example, the USA panel, which was one of eight expert panels, was comprised of seven assessors and used 32 descriptive attributes for its sensory evaluation. All seven assessors evaluated 12 products for two replications using the same methodology and testing sites. This USA panel dataset was used to demonstrate the application of the permutation test for two-way GPA. For these data, as shown in Fig. 1b, there were at least three null hypotheses to be tested: no interaction of product and assessor, no assessor effect and no product effect. 3.3. Random data A dataset was randomly generated using a uniform distribution (0–100), which was used by King and Arents (1991), Wakeling et al. (1992) and Wu et al. (2002). This dataset had eight individual matrices and each matrix had 25 rows (samples) and 5 columns (attributes). These configurations were used by Wakeling et al. (1992) and
Fig. 1. Data structure used for global panel alignment work (note: p is the number of assessors varying from panel to panel and Mk is the number of attributes used by the kth panel).
152
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
Wu et al. (2002). The prediction is that products are most likely not different if all assessors had randomly scored the products, thus providing a criterion for examining the permutation procedures presented here. 4. Results and discussion 4.1. Eight expert panel data The dataset including all eight panels (Fig. 1a) follows the free choice profiling format, so only the product effect was tested (although the replication effect could also be tested). Under H0: no product effect, the observations from 12 products could be exchanged within each panel. For 12 products evaluated by eight panels, there were (12!)8 possible permutations, too large to exhaust all permutations for practical purposes. A reasonable number of permutations was considered to be 300, which was also used by Wakeling et al. (1992) and Wu et al. (2002). These 300 permutations were randomly selected from the (12!)8 possibilities to conduct the GPA permutation test. The SSR0 value for the original observations was 16749.3. The minimum of the SSRmin values obtained from 300 permuted datasets was 56908.0, and the corresponding p-value was 0.0033 (=(0 + 1)/(300 + 1)), approaching the 100th percentile in this case. This p-value was approximated based on 300 permutations. Since SSRmin SSR0, the ‘‘true” p-value should be much less than 0.0033. To obtain a more precise estimate of the p-value, the number of permutations can be increased. In this study, the p-value of 0.0033 is already quite low and reliable given the set significance level a = 0.05, so more accurate estimates of the p-value were unnecessary. We concluded that there was a significant overall difference among the products at a = 0.05. Therefore, the consensus plot (Fig. 2) was deemed useful to visually identify the differences among the products. Fig. 2 indicates that products A, B, F, G, J, K were more likely to be perceived as different, and products B, C, D, E, H, I and L were more likely to be perceived as the same or similar. An analytical method is needed for accurately compar-
Fig. 2. Average configuration (consensus) plot.
ing the differences among the products. It is also possible to use permutation tests to do pairwise comparison of the products.
4.2. USA panel data Since the USA panel data was conventional profiling data, three null hypotheses (no product by assessor effect, no product effect, no assessor effect) were tested. The null hypothesis of no interaction of product by assessor was tested using the interaction residuals. The SSR0 for the original dataset was 203 960. The SSR values for the 300 permuted datasets are given in Fig. 3 with the minimum SSRmin of 201899.4. The p-value estimated based on 300 permutations was 0.03, suggesting a weak but significant interaction of product by assessor at a = 0.05. The significant interaction indicates that assessors perceived the products differently. When the interaction is significant, the typical recommendation is not to proceed with analyzing the main effects of factors included in the interaction. In this paper, however, we shall proceed to test main effects of product and assessor for demonstration purposes. Under the null
Fig. 3. Frequency of sum of squares of residuals (SSR) for interaction effect using USA panel data (300 permutations).
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
hypothesis of no product effect, the observations were rearranged within each assessor matrix 300 times and the corresponding SSR values are given in Fig. 4. The p-value estimated based on 300 permutations was 0.0033 < a = 0.05, indicating that there was a significant overall difference among the products. The consensus plot for the 12 products is given in Fig. 5, which can be used to visually examine differences between the products. The product effect can also be analyzed using the randomization F-test (Wu et al., 2002) and the results from the randomization F-test are given in Table 1. The table shows that 11 out of 23 dimensions had p-values less than a = 0.05, indicating that there were significant overall differences among the products on those dimensions. This finding is in agreement with the results obtained using SSR (which takes into account all dimensions simultaneously). As Wu et al. (2002) demonstrated, the randomization F-test could be used to determine the number of significant dimensions in the context of GPA. However, when the total number of dimensions are large, the randomization F-test usually gives multiple answers for the number of significant dimensions. Take Table 1 as an example. Dimensions 1–6, 8–9, 12, 14, and 22 were significant at a = 0.05, so the number of significant dimensions could be 6, 9, 12, 14 or 22. Thus,
153
it is difficult to know what the true number of significant dimensions is. This suggests that the randomization F-test is useful to determine if the products are overall significantly different on each dimension, but not reliable to determine the number of significant dimensions. Under the null hypothesis of no assessor effect, the observations were rearranged within each product (i.e., levels of the product factor) 300 times and the corresponding SSR values are given in Fig. 6. The estimated p-value was 0.0033, suggesting that the assessor effect was significant at a = 0.05. This suggests that the assessors did not agree on perceptions of the products and helps direct attention to further work needed to identify which assessors differ from others. The same USA panel dataset was also analyzed by MANOVA. The results from MANOVA showed that all effects (i.e., product, assessor and interaction effects) were significant (p < 0.0001) in terms of Wilks’ Lambda statistic. MANOVA and GPA permutation tests showed similar results. Note that MANOVA and GPA permutation tests may produce different results for some datasets if the normal assumption is not met. The authors also found that MANOVA may simply not be executable in the case where the number of variables is larger than the number of obser-
Fig. 4. Frequency of sum of squares of residuals for product effect using USA panel data (300 permutations).
Fig. 5. Average configuration using USA panel data (12 products, where subscripts represent reps).
154
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155
Table 1 Results from randomization F-test for USA panel data (300 permutations) Dimension f
F1 (original)
Probability (F1 (permuted) > F1 (original))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
113.22 30.48 17.25 9.87 7.55 8.17 5.32 14.03 12.64 7.69 8.98 9.95 9.96 12.48 7.60 10.53 11.37 10.37 10.28 7.81 9.62 14.52 13.35
0.00 0.00 0.00 0.00 0.02 0.01 0.89 0.00 0.00 0.35 0.08 0.04 0.06 0.00 0.86 0.16 0.10 0.32 0.50 0.98 0.73 0.04 0.14
vations. In contrast, the GPA permutation test does not have such sensitive constraints and can be useful for evaluating individual assessor performance (e.g., Fig. 1d).
4.3. Random data Only the null hypothesis of no product difference was tested using the proposed approach, so that the example could be considered in light of either free choice or conventional profiling. The SSR0 for the original dataset was 260751.6 and the SSR values for 1000 permuted datasets are given in Fig. 7. Since the corresponding p-value was 0.063 > a = 0.05, there were no significant overall differences among the 25 products, which is expected because the data were randomly generated. The Wakeling’s permutation test on the random data showed that the percentile for the Rc (0.462) was 93.4%, less than 95%, indicating that the consensus from the GPA was not ‘‘significant”. Thus, both Wakeling’s test and the proposed permutation test given here yield the same results. The results from the randomization F-test in Table 2 show that the first four dimensions were not significant at a = 0.05 but the fifth dimension was significant, suggesting that some products
Fig. 6. Frequency of sum of squares of residuals for assessor effect using USA panel data (300 permutations).
Fig. 7. Frequency of sum of squares of residuals for product effect using random data (1000 permutations).
R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155 Table 2 Results from randomization F-test for random data (1000 permutations) Dimension f
F1 (original)
Probability (F1 (permuted) > F1 (original))
1 2 3 4 5
5.02 4.86 3.06 2.85 3.23
0.24 0.08 0.67 0.49 0.015
were significantly different from others. This dataset again illustrates that the randomization F-test may not be reliable to determine the number of significant dimensions. One of the drawbacks for the randomization F-test is that it inflates Type I error by testing product differences on each individual dimension, especially when the number of dimensions is large. One remedy for this problem is to use the experiment-wise error rate (a0 ) instead of the comparison-wise error rate (a), where a0 = a/n (n is the number of dimensions). In this case, a0 = a/n = 0.05/5 = 0.01, resulting in no significant dimensions.
5. Conclusion For any significance test, it is essential to state explicitly the null hypothesis H0 a priori. The null hypothesis dictates how to conduct the test of significance and how to interpret the results. In this paper, we specifically demonstrated how to set up H0 and permute the data to permit specific tests of main effects and interactions within the context of GPA. The proposed GPA permutation test can be easily extended to test various types of null hypotheses of interest in multiple-way factorial design settings. If replication is considered as a factor or block (e.g., Fig. 1c and d), its effect can be tested using the proposed approach. The sum of squares of residuals (SSR) is used as the test statistic for the proposed permutation test. Ideally, different test statis-
155
tics should be designed for different null hypotheses (Good, 2000), thus demanding further research. Acknowledgements We thank our internal Unilever panel leaders for permitting use of the present sensory datasets, as well as the panels for their descriptive work. Thanks are also given to the referees for their useful comments. References Anderson, M. J. (2001). Permutation tests for univariate or multivariate analysis of variance and regression. Canadian Journal of Fish and Aquatic Science, 58, 626–639. Dijksterhuis, G. B., & Heiser, W. (1995). The role of permutation tests in exploratory multivariate data analysis. Food Quality and Preference, 6, 263–270. Fisher, R. A. (1935). Design of experiments. Edingburg: Oliver and Boyd. Fisher, R. A. (1936). The coefficient of racial likeness and the future of craniometry. Journal of the Royal Anthropological Institute of Greater Britain Ireland, 66, 57–63. Good, P. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses (2nd ed.). New York: Springer. Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika, 40, 33–51. King, B. M., & Arents, P. A. (1991). Statistical test of consensus obtained from generalized procrustes analysis of sensory data. Journal of Sensory Studies, 6, 37–48. Kunert, J., & Qannari, E. M. (1999). A simple alternative to generalized procrustes analysis: Application to sensory profiling data. Journal of Sensory Studies, 14, 197–208. Pitman, E. J. G. (1937). Significance tests which may be applied to samples from any populations. Journal of Royal Statistical Society, B, 4, 119–130. Wakeling, I. N., Raats, M. M., & MacFie, H. J. H. (1992). A new significance test for consensus in generalized procrustes analysis. Journal of Sensory Studies, 7, 91–96. Wu, W., Guo, Q., de Jong, S., & Massart, D. L. (2002). Randomisation test for the number of dimensions of the group average space in generalized procrustes analysis. Food Quality and Preference, 13, 191–200.