Permutation tests for Generalized Procrustes Analysis

Available online at www.sciencedirect.com Food Quality and Preference 19 (2008) 146–155 www.elsevier.com/locate/foodqual Permutation tests for Gener...

Download PDF

885KB Sizes 2 Downloads 130 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Food Quality and Preference 19 (2008) 146–155 www.elsevier.com/locate/foodqual

Permutation tests for Generalized Procrustes Analysis R. Xiong a, K. Blot a,*, J.F. Meullenet b, J.M. Dessirier a b

a Unilever R&D, 40 Merritt Blvd, Trumbull, CT 06611, USA Department of Food Science, University of Arkansas, 2650 N Yong Ave, Fayetteville, AR 72704, USA

Received 2 September 2006; received in revised form 2 February 2007; accepted 11 March 2007 Available online 23 March 2007

Abstract Generalized Procrustes Analysis (GPA) is a useful tool for sensory professionals to analyze sensory data, especially those from free choice proﬁling. Over a decade ago, Wakeling introduced a permutation test for determining if the GPA consensus is signiﬁcant. However, Wakeling’s permutation test lacks speciﬁcation of an explicit null hypothesis, resulting in diﬀerent interpretations of what a ‘‘signiﬁcant consensus” signiﬁes. In this paper, a new GPA permutation test analogous to the well established ANOVA permutation procedure is proposed. The proposed approach emphasizes that the null hypothesis dictates how data are permuted to test speciﬁc null hypotheses within GPA (e.g., product eﬀect, assessor eﬀect and interaction eﬀect). The only assumption behind permutation testing is exchangeability, which is discussed. Applications of the proposed GPA permutation test are provided using three datasets (two actual datasets and one random dataset). Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Permutation test; GPA; ANOVA; MANOVA; Free choice proﬁling; Conventional proﬁling

1. Introduction Permutation testing is a non-parametric technique that can permit tests of statistical signiﬁcance for essentially any null hypothesis by repeatedly randomizing the original dataset (see e.g., Anderson, 2001; Good, 2000). For each randomization of the original dataset, a test statistic is computed. The result is a sampling distribution of the test statistic under the null hypothesis, which is used to assess the statistical signiﬁcance of the test statistic obtained from the original dataset. Permutation tests can produce reliable and sometimes exact p-values for data violating conventional normal theory assumptions by preserving the original data structure. Notably, for each permutation, the null hypothesis dictates how to randomly exchange or rearrange the original observations (see Good, 2000). The null hypothesis also dictates how to interpret the permutation test results. As

*

Corresponding author. Tel.: +1 203 381 4321; fax: +1 203 381 5485. E-mail address: [email protected] (K. Blot).

0950-3293/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodqual.2007.03.003

there are multiple ways to permute observations when multiple factors are involved in a study, special attention must be given to setting up the proper null hypothesis to permit the correct signiﬁcance test and interpretation. In addition, permutation testing for multiple-way ANOVA (analysis of variance), especially for two-way ANOVA, is well documented and can be used to test diﬀerent null hypotheses (see Anderson, 2001; Good, 2000), which has not been explored by sensory professionals. Generalized Procrustes Analysis (GPA, see Gower, 1975) is a multivariate exploratory technique that involves transformations (i.e., translation, rotation, reﬂection, isotropic rescaling) of individual data matrices to provide optimal comparability. The average of the individual matrices is called the consensus matrix. The individual and consensus matrices are typically submitted to Principal Components Analysis (PCA) and projected onto a lower dimensional space. This space provides a vantage point to compare individual data and to visualize the consensus. In theory, the matrix transformation processes can be assumed to help hone in on underlying perceptual phenomena independent of scaling artefacts.

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

GPA is popular among sensory professionals as a way to combine data from diﬀerent individual assessors (Dijksterhuis & Heiser, 1995). GPA is especially useful for free choice proﬁling, because it can accommodate diﬀerent numbers and kinds of attributes among assessors. Moreover, GPA can be used to visually describe diﬀerent eﬀects, such as product diﬀerences, assessor agreement, and repeatability. However, there is no clearly established way to test the signiﬁcance of each of these factorial eﬀects, including the interactions. For instance, a test of the product by assessor eﬀect may be critical. If this interaction is signiﬁcant, it implies that assessors do not perceive the products in the same way (i.e., the product main eﬀect is not invariant across assessors). Although previous work has been designed to examine the signiﬁcance of the consensus variance using Monte Carlo simulation (King & Arents, 1991) or permutation testing (Wakeling, Raats, & MacFie, 1992; Wu, Guo, de Jong, & Massart, 2002), no study has explicitly adapted the ANOVA permutation test to analogous signiﬁcance testing for GPA. Therefore, the objective of this paper is to show how permutation tests designed for factorial ANOVA can be adapted to GPA to permit analogous main eﬀect and interaction signiﬁcance tests. 2. Theory 2.1. Permutation testing Permutation testing is a non-parametric technique that can permit tests of statistical signiﬁcance for any null hypothesis by repeatedly permuting the original dataset. The theory for the permutation test has evolved from the works of Fisher (1935) and Pitman (1937). The permutation test procedure can be summarized in the following six steps (Good, 2000): (1) Set up the null and alternative hypotheses (H0 and Ha) of interest. (2) Choose a test statistic t(x) that distinguishes H0 from H a. (3) Compute the test statistic t(x0) for the original observations. (4) Permute the observations according to H0. (a) Randomly rearrange the observations by assigning them to new treatment conditions. (b) Compute the test statistic t(x) for the new arrangement. (c) Repeat steps a–b N times (e.g., N = 1000). (5) Compute the distribution of the test statistic t(x) and corresponding p-value for t(x0). The p-value is calculated as (m + 1)/(N + 1), where m typically represents the number of t(x) values that are as small as or smaller than the observed t(x0) (e.g., F statistic), while N is the total number of t(x) values calculated. Note that for some test statistics, m can represent the number of t(x) values that are as large as or larger than the

147

observed t(x0). The ‘‘1” in the numerator and denominator denotes that the observed value t(x0) is a member of the possible values. (6) Draw a conclusion. (a) Reject H0 if the p-value is less than a (a signiﬁcance level). (b) Do not reject H0 if the p-value is equal to or greater than a. According to Fisher (1936), the permutation test agrees with the normal theory tests when the normal theory assumptions are met (Anderson, 2001). The only assumption of the permutation test is exchangeability. Exchangeability entails that under a true H0, the original observations will stay ﬁxed, while assigned randomly to alternative treatment conditions. These exchanges are done multiple times, resulting in several possible new arrangements that are all equally likely to occur. Each new arrangement, or permutation, has a unique set of treatment–observation pairings, resulting in a test statistic outcome that provides one point on the H0 sampling distribution. The outcomes themselves are not equally likely to occur, with larger and larger values (or smaller and smaller values) approaching a zero limit, or an inﬁnitesimally smaller likelihood of occurring. The original dataset is considered one other possible arrangement under H0 that contributes one other possible outcome to the H0 sampling distribution. All these outcomes are possible under H0 given that the original dataset involves random assignment of treatments to experimental units or random ordering of treatments for each experimental unit, etc. In this way, the experimental approach controls for bias that is not assumed under H0 and that may inﬂuence the outcomes contributing to the empirical sampling distribution under H0. This exchangeability assumption is met for experimental designs and inferred for quasi-experimental designs. An implication of exchangeability is that errors are independent and identically distributed according to any shape (e.g., lognormal) (Anderson, 2001; Good, 2000). Note that because all permutation arrangements are equally likely to occur, a large random subset (e.g., 1000) of the total possible permutations should produce a representative sampling distribution. 2.2. Prior work: signiﬁcance testing and GPA King and Arents (1991) reported that GPA always produces a consensus space or plot even when the data are randomly generated. Thus, for instance, any apparent product diﬀerences observed in the consensus space may not be real. To determine when GPA provides systematic information rather than random output, King and Arents proposed a Monte Carlo approach to test the signiﬁcance of Rc (the percentage of consensus variance in the total variance). The sampling distribution of Rc was obtained by repeating GPA analysis on hundreds of random datasets generated from uniform distributions. If Rc for the original

148

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

data was larger than the 95th percentile of the sampling distribution, then it was inferred that ‘‘. . . the consensus plot as given by the GPA most likely represents some true consensus among panelists regarding the relationships among the samples evaluated” (King & Arents, 1991, p. 41). Since King and Arents (1991) generated random data independently for each sensory variable, the correlations among the variables in the original data were not preserved. This approach was shown to result in inﬂated Rc for the random outcomes, and thus, an overly conservative signiﬁcance test (see Wakeling et al., 1992). Moreover, the approach did not permit speciﬁc ways to randomize the data to test for speciﬁc main or interaction eﬀects, beyond testing the overall consensus variation against an overall random reference point. Wakeling et al. (1992) modiﬁed King and Arents’ method by permuting the original data. The original data consisted of product labels (rows) and attributes (columns). Whole rows were permuted for a given exchange, so that a product and its associated observations across all attributes were assigned randomly to another product. The exchange of whole rows preserves the correlational structure between the attributes. Moreover, the rows were permuted speciﬁcally within each assessor’s matrix, consistent with free choice proﬁling, which does not permit exchange of data across assessors using diﬀerent attributes. Again, the general objective was to test the likelihood that the original observed Rc came from a distribution of randomly generated Rc values. A small likelihood (p 6 .05) indicated signiﬁcance. A signiﬁcant result was diﬃcult to interpret because there was no explicit null hypothesis given for the test. As we will know from the following discussion, their test was actually consistent with what would be done to test the null hypothesis (H0) of no product diﬀerences since restricted permutations within each assessor’s matrix controlled for the assessor main eﬀect. By extending Wakeling’s GPA permutation test, Wu et al. (2002) proposed the randomization F-test designed for each dimension to determine how many dimensions could be considered meaningful, and in essence, providing a potential substitute for the traditional scree plot. The randomization F-test suﬀers from the same problem as Wakeling’s permutation test using Rc: lack of explicit statement of the null hypothesis and proper interpretation. As discussed above, the permutation test always requires the null hypothesis because H0 determines how to permute the observations. To our best knowledge, there are no explicit null hypotheses for the signiﬁcance test of factorial eﬀects in GPA in the literature (King & Arents, 1991; Kunert & Qannari, 1999; Wakeling et al., 1992; Wu et al., 2002). For instance, the signiﬁcance of Rc may imply the following null hypothesis (H0) and alternative hypothesis (Ha):

where R0 is the percentage of consensus variance in the total variance that is due to noise, or true random data, which can be estimated through simulation. This null hypothesis does not tell us how to permute the observations and how to interpret the permutation test result. Wakeling et al. (1992) proposed to permute the rows of each assessor’s data matrix apparently to preserve correlations among sensory attributes, but this permutation procedure is not directly linked to this null hypothesis (H0: Rc = R0). Moreover, a by-product of limited explicit null hypotheses for signiﬁcance testing on GPA has resulted in unavoidably diﬀerent interpretations of what exactly is being tested. There are at least ﬁve interpretations in the literature: whether the consensus is false or true (King & Arents, 1991); whether the consensus is signiﬁcant or not (Wakeling et al., 1992); whether the consensus is meaningful (Kunert & Qannari, 1999); whether the consensus is artifactual or not (Wu et al., 2002), and whether a consensus is reached after the GPA transformations (XLStat, version 6.3, 2006). Clearly then, stating the null hypothesis for each of the factorial eﬀects, performing the corresponding permutations, and conducting the relevant signiﬁcance tests should help localize, manage, and explain the GPA output. This set of procedures has not been employed to test factorial eﬀects for GPA. Without a proper connection between the null hypotheses and permutation procedure, the interpretation of the GPA can be misleading. In the following section, this paper will give the connection between the null hypothesis and permutation procedure. 2.3. Permutation testing for ANOVA The test for GPA factorial eﬀects and factorial ANOVA are analogous. Permutation tests for factorial ANOVA are well documented and can be adapted to GPA factorial testing. The simplest factorial model, the linear two-way ANOVA model, is given as Y ijk ¼ l þ Ai þ Bj þ ABij þ eijk ði ¼ 1; 2; . . . ; a; j ¼ 1; 2; . . . ; b; k ¼ 1; 2; . . . ; nÞ where l is the population mean, Ai is the eﬀect of the ith level of factor A, Bj is the eﬀect of the jth level of factor B, ABij is the interaction eﬀect of the ijth combination of factors A and B, eijk is the error associated with observation Yijk, a and b are the numbers of levels of factors A and B, respectively, and k is the number of replications. Since there are two main eﬀects and one interaction eﬀect, there are three null and alternative hypotheses for the twoway ANOVA problem. The null hypotheses (H0) for interaction and main eﬀects are given, respectively, as follows: H0 ðinteraction effectÞ : ABij ¼ 0

H0:

Rc ¼ R0

ði ¼ 1; 2; . . . ; a; j ¼ 1; 2; . . . ; bÞ H0 ðmain effect AÞ : A1 ¼ A2 ¼ ¼ Aa

Ha:

Rc > R0

H0 ðmain effect BÞ : B1 ¼ B2 ¼ ¼ Bb

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

Null hypothesis testing usually starts with the test of a signiﬁcant interaction AB (given that replicated data are available). The permutation test is more complicated for the interaction eﬀect than for the main eﬀects. For the interaction test AB, the main eﬀects must ﬁrst be removed from the observations Yijk. The main eﬀects are unknown, but can be estimated by calculating the means for each level of a factor. Main eﬀects of the factors can then be ‘‘removed” by subtracting the appropriate mean from each observation to obtain residuals (Anderson, 2001). For a single variable Y, the removal of the main eﬀects of factors A and B to compute the residuals is given as rijk ¼ Y ijk Y i:: Y :j: þ Y ::: where rijk represents the residuals after removing main effects, Y i:: is the mean for level i of factor A, Y :j: is the mean of level j of factor B, and Y ::: is the overall mean. Under the null hypothesis (i.e., H0 is true), the residuals rijk are exchangeable with one another unrestricted across the AB conditions and can be permuted (Anderson, 2001; Good, 2000). The permutation test for interaction eﬀects is not an exact test because the residuals are estimated using sample means. With increases in sample size (n), estimates of means get better and the permutation test gets closer to being an exact test; hence it is referred to as an asymptotically exact test (Anderson, 2001). If the interaction eﬀect is not signiﬁcant (i.e., the null hypothesis associated with the interaction cannot be rejected) or ignored due to various reasons (e.g., aggregating is done across replications, thus leaving only a product main eﬀect to test), main eﬀects can then be tested. When the permutation test is used for testing main eﬀects, the eﬀects of the remaining factors must be controlled. This control is usually achieved by permuting the observations of the factor within each level of the remaining factors. In the case of two factors A and B in a factorial design, the main eﬀect of factor A can be tested by permuting the observations within each level of factor B, while the eﬀect of factor B can be tested by permuting the observations within each level of factor A (Good, 2000). The F-value can be used as the test statistic for the twoway ANOVA permutation test, although Good (2000) suggested better alternative test statistics for testing the interaction and main eﬀects. The permuting procedure for the interaction and main eﬀects is summarized as follows: (1) Under H0: ABij = 0 (i = 1, 2, . . . , a; j = 1, 2, . . . , b) (no interaction eﬀect), all the residuals rijk are exchangeable ? randomly rearrange the residuals rijk. (2) Under H0: A1 = A2 = = Aa (no main eﬀect of factor A), all the observations Yijk within each level of factor B are exchangeable ? within each level of factor B, randomly rearrange the observations Yijk. (3) Under H0: B1 = B2 = = Bb (no main eﬀect of factor B), all the observations Yijk within each level of factor A are exchangeable ? within each level of factor A, randomly rearrange the observations Yijk.

149

Note that any rearrangement always involves the whole row of data, across all variables (attributes), so the single attribute case generalizes to the multiple attribute case. The two-way ANOVA permutation test can be extended to the multiple-way ANOVA permutation test (Good, 2000). Note that parametric ANOVA can be applied to the above two-way factorial design data if normal theory assumptions are satisﬁed. However, in a balanced design, the permutation test has a threefold advantage over the parametric ANOVA: it is essentially exact for main eﬀects given that all or a large subset of the permutations are tested; it is not restricted by an assumption of normality; and it is as powerful or more powerful than the parametric approach (Good, 2000). In addition, the parametric ANOVA approach may not be applicable to test the interaction and main eﬀects within the context of GPA (e.g., due to insuﬃcient degrees of freedom), but the permutation test is generally applicable. In an unbalanced design, however, the permutation test of the main eﬀects will be confounded with interactions so that the two cannot be tested separately (Good, 2000). 2.4. Permutation testing for GPA A GPA study is usually a factorial design with at least two factors—products and assessors or panels—so the approach discussed above for the two-way ANOVA permutation test can be practically applied to GPA. For convenience of discussion, assume that a panel of assessors evaluated multiple products in a factorial design experiment. Analogous to the two-way ANOVA approach, there are three possible eﬀects of interest: product, assessor, and product by assessor. There are at least three possible ways to permute the data, assuming the conditions are exchangeable under H0. As usual, we start with the null hypothesis of no interaction eﬀect (H0: ABij = 0). If the data are from diﬀerent assessors using diﬀerent sets of attributes (i.e., free choice proﬁling), the AB levels logically cannot be exchanged; thus, no direct permutation test for this free choice proﬁling interaction is available. In contrast, if all assessors evaluated the products using the same set of attributes, the same or similar sensory protocol (including the scales, references, etc.), and the same number of replications (i.e., conventional proﬁling), the AB levels are presumably exchangeable and can be permuted. The residuals rijk are calculated and permuted to obtain the sampling distribution of a test statistic. Either Rc or sums of squares of residuals (SSR, also called loss function, see Kunert & Qannari, 1999) can be used as the test statistic. SSR will be used in this study because its calculation is simpler than Rc. Whether H0 for no A B interaction is rejected or not depends on the p-value of the test statistic for the original data. If the interaction eﬀect is not signiﬁcant or data cannot be exchanged, we may test the main eﬀects of product (denoted as factor A) and assessor (denoted as factor B).

150

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

Under the null hypothesis (H0) of no product diﬀerences (H0: A1 = A2 = = Aa), the observations are exchangeable for conventional proﬁling, as well as free choice proﬁling, because the data are exchanged within rather than across assessors. Products are permuted within each assessor matrix, or in other words, within each level of factor B, thereby controlling for the main eﬀect of B. According to H0: A1 = A2 = = Aa, the interpretation of the product eﬀect is that there is a signiﬁcant overall diﬀerence among the products. If there is no signiﬁcant overall diﬀerence (i.e., H0 is not rejected at a), the product consensus coordinates, or average product locations, are not reliably diﬀerent, suggesting that there is no need for the GPA consensus plot. No overall diﬀerence among the products may indicate that there is either no ‘‘real” (physical) diﬀerence among the products and/or no perceived diﬀerence by the assessors. If the overall diﬀerence is signiﬁcant (i.e., H0 is rejected at a), then at least one product has consensus coordinates that are signiﬁcantly diﬀerent, suggesting that the consensus plot is useful for further visual examination of the relationships among the products. Thus, analogous to the ANOVA approach, the permutation test does not tell us which products are signiﬁcantly diﬀerent. How to test which products are signiﬁcantly diﬀerent is beyond the scope of this article, but the authors will present a new method in a forthcoming paper. Unlike testing the product eﬀect, testing the assessor eﬀect may involve an exchangeability (or iid) assumption violation, because data must be exchanged across assessors. If diﬀerent assessors use dissimilar attribute sets (i.e., in terms of quantity and/or quality) to evaluate the same products, the scores between assessors cannot be from the identical distribution and hence are not exchangeable, so it is untenable to conduct the permutation test. The data from conventional proﬁling can satisfy the exchangeability assumption given that all assessors use the same set of attributes to evaluate the same products, while the data from free choice proﬁling usually violate exchangeability given that dissimilar attribute sets are typically used across assessors. If the exchangeability assumption is met under the null hypothesis of no assessor eﬀect (H0: B1 = B2 = = Bb), then the observations should be permuted across assessors within each product. This procedure is useful for assessing overall panel performance when assessors use the same set of attributes. Similar to the two-way ANOVA permutation test, the permutation procedure for the two-way GPA permutation test is summarized as (1) Test interaction eﬀect. Under H0: ABij = 0 (i = 1, 2, . . . , a; j = 1, 2, . . . , b) (no interaction eﬀect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Calculate means Y i:: and Y :j: for A and B at each level and grand mean Y ::: .

(c) Calculate interaction residuals for each cell using rijk ¼ Y ijk Y i:: Y :j: þ Y ::: . (d) Repeat steps b and c for all attributes. (e) Rearrange the residuals rijk randomly. In most cases, the number of all possible rearrangements is too large for a computer to do all the rearrangements. We usually randomly select R (e.g., 300) of these permutations. (f) Calculate the test statistic SSR for each of the rearranged datasets. (g) Calculate a p-value and compare it with a predetermined signiﬁcance level a (usually 0.05) to make a decision. If the null hypothesis is not rejected at a, we proceed to test main eﬀects. (2) Test main eﬀect of product. Under H0: A1 = A2 = = Aa (no product eﬀect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Rearrange randomly the data for A within each level of B R = 300 times. (c) Calculate the SSR for each of the rearranged datasets. (d) Calculate a p-value and compare it with a predetermined signiﬁcance level a to make a decision. (3) Test main eﬀect of assessor. Under H0: B1 = B2 = = Bb (no assessor eﬀect), if the data are exchangeable, then do the permutation test as follows: (a) Calculate the test statistic SSR for the original dataset. (b) Rearrange randomly the data for B within each level of A R = 300 times. (c) Calculate the SSR for each of the rearranged datasets. (d) Calculate a p-value and compare it with a pre-determined signiﬁcance level a to make a decision. For multivariate GPA permutation, there are two ways to rearrange the data: (1) separate randomization for each attribute (King & Arents, 1991); and (2) simultaneous randomization for all attributes (Wakeling et al., 1992). The simultaneous randomization approach is recommended because it preserves the correlation structure among the attributes (Wakeling et al., 1992) and is more eﬃcient.

3. Datasets 3.1. Eight expert panel data The eight panel data were from research comparing the capabilities of eight expert descriptive panels used by

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

Unilever for evaluating face creams. The eight expert panels were from diﬀerent parts of the world and had been trained to varying degrees prior to this study. Some panels used Quantitative Descriptive Analysis (QDA) while others used the Spectrum Method (SM). Each individual panel used its own set of sensory descriptors with some overlap across panels. Each individual panel also used its own testing procedure, body site tested (volar forearm vs face), scales and references. The same 12 samples were evaluated by the panels for two repetitions. The two-way factorial model can be used to make some useful tests, such as: (a) a test of the product eﬀect across the group of eight expert panels (free choice proﬁling) (Fig. 1a); (b) tests of product, assessor, and product by assessor eﬀects within a given panel (conventional proﬁling) (Fig. 1b); (c) a test of the rep eﬀect within a given panel, where data are averaged across products (conventional proﬁling) (Fig. 1c); and (d) a test of the rep eﬀect within a given assessor, where data are averaged across products (conventional proﬁling) (Fig. 1d). For the present purposes, ‘a’ and ‘b’ will be shown to illustrate the two-way GPA permutation test procedure.

151

3.2. USA panel data The datasets from the panels were conventional proﬁling data. For example, the USA panel, which was one of eight expert panels, was comprised of seven assessors and used 32 descriptive attributes for its sensory evaluation. All seven assessors evaluated 12 products for two replications using the same methodology and testing sites. This USA panel dataset was used to demonstrate the application of the permutation test for two-way GPA. For these data, as shown in Fig. 1b, there were at least three null hypotheses to be tested: no interaction of product and assessor, no assessor eﬀect and no product eﬀect. 3.3. Random data A dataset was randomly generated using a uniform distribution (0–100), which was used by King and Arents (1991), Wakeling et al. (1992) and Wu et al. (2002). This dataset had eight individual matrices and each matrix had 25 rows (samples) and 5 columns (attributes). These conﬁgurations were used by Wakeling et al. (1992) and

Fig. 1. Data structure used for global panel alignment work (note: p is the number of assessors varying from panel to panel and Mk is the number of attributes used by the kth panel).

152

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

Wu et al. (2002). The prediction is that products are most likely not diﬀerent if all assessors had randomly scored the products, thus providing a criterion for examining the permutation procedures presented here. 4. Results and discussion 4.1. Eight expert panel data The dataset including all eight panels (Fig. 1a) follows the free choice proﬁling format, so only the product eﬀect was tested (although the replication eﬀect could also be tested). Under H0: no product eﬀect, the observations from 12 products could be exchanged within each panel. For 12 products evaluated by eight panels, there were (12!)8 possible permutations, too large to exhaust all permutations for practical purposes. A reasonable number of permutations was considered to be 300, which was also used by Wakeling et al. (1992) and Wu et al. (2002). These 300 permutations were randomly selected from the (12!)8 possibilities to conduct the GPA permutation test. The SSR0 value for the original observations was 16749.3. The minimum of the SSRmin values obtained from 300 permuted datasets was 56908.0, and the corresponding p-value was 0.0033 (=(0 + 1)/(300 + 1)), approaching the 100th percentile in this case. This p-value was approximated based on 300 permutations. Since SSRmin SSR0, the ‘‘true” p-value should be much less than 0.0033. To obtain a more precise estimate of the p-value, the number of permutations can be increased. In this study, the p-value of 0.0033 is already quite low and reliable given the set signiﬁcance level a = 0.05, so more accurate estimates of the p-value were unnecessary. We concluded that there was a signiﬁcant overall diﬀerence among the products at a = 0.05. Therefore, the consensus plot (Fig. 2) was deemed useful to visually identify the diﬀerences among the products. Fig. 2 indicates that products A, B, F, G, J, K were more likely to be perceived as diﬀerent, and products B, C, D, E, H, I and L were more likely to be perceived as the same or similar. An analytical method is needed for accurately compar-

Fig. 2. Average conﬁguration (consensus) plot.

ing the diﬀerences among the products. It is also possible to use permutation tests to do pairwise comparison of the products.

4.2. USA panel data Since the USA panel data was conventional proﬁling data, three null hypotheses (no product by assessor eﬀect, no product eﬀect, no assessor eﬀect) were tested. The null hypothesis of no interaction of product by assessor was tested using the interaction residuals. The SSR0 for the original dataset was 203 960. The SSR values for the 300 permuted datasets are given in Fig. 3 with the minimum SSRmin of 201899.4. The p-value estimated based on 300 permutations was 0.03, suggesting a weak but signiﬁcant interaction of product by assessor at a = 0.05. The signiﬁcant interaction indicates that assessors perceived the products diﬀerently. When the interaction is signiﬁcant, the typical recommendation is not to proceed with analyzing the main eﬀects of factors included in the interaction. In this paper, however, we shall proceed to test main eﬀects of product and assessor for demonstration purposes. Under the null

Fig. 3. Frequency of sum of squares of residuals (SSR) for interaction eﬀect using USA panel data (300 permutations).

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

hypothesis of no product eﬀect, the observations were rearranged within each assessor matrix 300 times and the corresponding SSR values are given in Fig. 4. The p-value estimated based on 300 permutations was 0.0033 < a = 0.05, indicating that there was a signiﬁcant overall diﬀerence among the products. The consensus plot for the 12 products is given in Fig. 5, which can be used to visually examine diﬀerences between the products. The product eﬀect can also be analyzed using the randomization F-test (Wu et al., 2002) and the results from the randomization F-test are given in Table 1. The table shows that 11 out of 23 dimensions had p-values less than a = 0.05, indicating that there were signiﬁcant overall diﬀerences among the products on those dimensions. This ﬁnding is in agreement with the results obtained using SSR (which takes into account all dimensions simultaneously). As Wu et al. (2002) demonstrated, the randomization F-test could be used to determine the number of signiﬁcant dimensions in the context of GPA. However, when the total number of dimensions are large, the randomization F-test usually gives multiple answers for the number of signiﬁcant dimensions. Take Table 1 as an example. Dimensions 1–6, 8–9, 12, 14, and 22 were signiﬁcant at a = 0.05, so the number of signiﬁcant dimensions could be 6, 9, 12, 14 or 22. Thus,

153

it is diﬃcult to know what the true number of signiﬁcant dimensions is. This suggests that the randomization F-test is useful to determine if the products are overall signiﬁcantly diﬀerent on each dimension, but not reliable to determine the number of signiﬁcant dimensions. Under the null hypothesis of no assessor eﬀect, the observations were rearranged within each product (i.e., levels of the product factor) 300 times and the corresponding SSR values are given in Fig. 6. The estimated p-value was 0.0033, suggesting that the assessor eﬀect was signiﬁcant at a = 0.05. This suggests that the assessors did not agree on perceptions of the products and helps direct attention to further work needed to identify which assessors diﬀer from others. The same USA panel dataset was also analyzed by MANOVA. The results from MANOVA showed that all eﬀects (i.e., product, assessor and interaction eﬀects) were signiﬁcant (p < 0.0001) in terms of Wilks’ Lambda statistic. MANOVA and GPA permutation tests showed similar results. Note that MANOVA and GPA permutation tests may produce diﬀerent results for some datasets if the normal assumption is not met. The authors also found that MANOVA may simply not be executable in the case where the number of variables is larger than the number of obser-

Fig. 4. Frequency of sum of squares of residuals for product eﬀect using USA panel data (300 permutations).

Fig. 5. Average conﬁguration using USA panel data (12 products, where subscripts represent reps).

154

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155

Table 1 Results from randomization F-test for USA panel data (300 permutations) Dimension f

F1 (original)

Probability (F1 (permuted) > F1 (original))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

113.22 30.48 17.25 9.87 7.55 8.17 5.32 14.03 12.64 7.69 8.98 9.95 9.96 12.48 7.60 10.53 11.37 10.37 10.28 7.81 9.62 14.52 13.35

0.00 0.00 0.00 0.00 0.02 0.01 0.89 0.00 0.00 0.35 0.08 0.04 0.06 0.00 0.86 0.16 0.10 0.32 0.50 0.98 0.73 0.04 0.14

vations. In contrast, the GPA permutation test does not have such sensitive constraints and can be useful for evaluating individual assessor performance (e.g., Fig. 1d).

4.3. Random data Only the null hypothesis of no product diﬀerence was tested using the proposed approach, so that the example could be considered in light of either free choice or conventional proﬁling. The SSR0 for the original dataset was 260751.6 and the SSR values for 1000 permuted datasets are given in Fig. 7. Since the corresponding p-value was 0.063 > a = 0.05, there were no signiﬁcant overall diﬀerences among the 25 products, which is expected because the data were randomly generated. The Wakeling’s permutation test on the random data showed that the percentile for the Rc (0.462) was 93.4%, less than 95%, indicating that the consensus from the GPA was not ‘‘signiﬁcant”. Thus, both Wakeling’s test and the proposed permutation test given here yield the same results. The results from the randomization F-test in Table 2 show that the ﬁrst four dimensions were not signiﬁcant at a = 0.05 but the ﬁfth dimension was signiﬁcant, suggesting that some products

Fig. 6. Frequency of sum of squares of residuals for assessor eﬀect using USA panel data (300 permutations).

Fig. 7. Frequency of sum of squares of residuals for product eﬀect using random data (1000 permutations).

R. Xiong et al. / Food Quality and Preference 19 (2008) 146–155 Table 2 Results from randomization F-test for random data (1000 permutations) Dimension f

F1 (original)

Probability (F1 (permuted) > F1 (original))

1 2 3 4 5

5.02 4.86 3.06 2.85 3.23

0.24 0.08 0.67 0.49 0.015

were signiﬁcantly diﬀerent from others. This dataset again illustrates that the randomization F-test may not be reliable to determine the number of signiﬁcant dimensions. One of the drawbacks for the randomization F-test is that it inﬂates Type I error by testing product diﬀerences on each individual dimension, especially when the number of dimensions is large. One remedy for this problem is to use the experiment-wise error rate (a0 ) instead of the comparison-wise error rate (a), where a0 = a/n (n is the number of dimensions). In this case, a0 = a/n = 0.05/5 = 0.01, resulting in no signiﬁcant dimensions.

5. Conclusion For any signiﬁcance test, it is essential to state explicitly the null hypothesis H0 a priori. The null hypothesis dictates how to conduct the test of signiﬁcance and how to interpret the results. In this paper, we speciﬁcally demonstrated how to set up H0 and permute the data to permit speciﬁc tests of main eﬀects and interactions within the context of GPA. The proposed GPA permutation test can be easily extended to test various types of null hypotheses of interest in multiple-way factorial design settings. If replication is considered as a factor or block (e.g., Fig. 1c and d), its eﬀect can be tested using the proposed approach. The sum of squares of residuals (SSR) is used as the test statistic for the proposed permutation test. Ideally, diﬀerent test statis-

155

tics should be designed for diﬀerent null hypotheses (Good, 2000), thus demanding further research. Acknowledgements We thank our internal Unilever panel leaders for permitting use of the present sensory datasets, as well as the panels for their descriptive work. Thanks are also given to the referees for their useful comments. References Anderson, M. J. (2001). Permutation tests for univariate or multivariate analysis of variance and regression. Canadian Journal of Fish and Aquatic Science, 58, 626–639. Dijksterhuis, G. B., & Heiser, W. (1995). The role of permutation tests in exploratory multivariate data analysis. Food Quality and Preference, 6, 263–270. Fisher, R. A. (1935). Design of experiments. Edingburg: Oliver and Boyd. Fisher, R. A. (1936). The coeﬃcient of racial likeness and the future of craniometry. Journal of the Royal Anthropological Institute of Greater Britain Ireland, 66, 57–63. Good, P. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses (2nd ed.). New York: Springer. Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika, 40, 33–51. King, B. M., & Arents, P. A. (1991). Statistical test of consensus obtained from generalized procrustes analysis of sensory data. Journal of Sensory Studies, 6, 37–48. Kunert, J., & Qannari, E. M. (1999). A simple alternative to generalized procrustes analysis: Application to sensory proﬁling data. Journal of Sensory Studies, 14, 197–208. Pitman, E. J. G. (1937). Signiﬁcance tests which may be applied to samples from any populations. Journal of Royal Statistical Society, B, 4, 119–130. Wakeling, I. N., Raats, M. M., & MacFie, H. J. H. (1992). A new signiﬁcance test for consensus in generalized procrustes analysis. Journal of Sensory Studies, 7, 91–96. Wu, W., Guo, Q., de Jong, S., & Massart, D. L. (2002). Randomisation test for the number of dimensions of the group average space in generalized procrustes analysis. Food Quality and Preference, 13, 191–200.

Permutation tests for Generalized Procrustes Analysis

Permutation tests for Generalized Procrustes Analysis

Recommend Documents