Assessing sensory panel performance using generalizability theory

Assessing sensory panel performance using generalizability theory

Food Quality and Preference xxx (2015) xxx–xxx Contents lists available at ScienceDirect Food Quality and Preference journal homepage: www.elsevier...

316KB Sizes 1 Downloads 56 Views

Food Quality and Preference xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual

Assessing sensory panel performance using generalizability theory Paul Talsma ⇑ Givaudan UK Ltd., Kennington Road, Ashford, Kent TN24 0LT, United Kingdom

a r t i c l e

i n f o

Article history: Received 4 September 2014 Received in revised form 21 January 2015 Accepted 27 February 2015 Available online xxxx Keywords: Generalizability theory Panel performance Coefficient U Panel reliability Variance component Statistical power

a b s t r a c t Generalizability theory provides a framework for assessing panel reliability, both for the panel as a whole and for individual panellists. The variability of the sensory panel scores is split up into products, panellists, replications, and interactions between these terms. Reliability is defined as product variance over total variance. Coefficient G only includes terms including product in the denominator and focusses on the ordering of the product scores. Coefficient U is introduced, which includes all variance components in the denominator and focuses on the ordering as well as on the absolute values. It is shown that this latter feature provides important additional information about panel performance. An algorithm is described which excludes panellists one by one and then evaluates the contributions of each panellist on total test reliability. The focus is on changes in the coefficients and variance components after removal compared to before, allowing for an in-depth evaluation of the performance of individual panellists. When coefficient U increases with an amount deemed relevant (0.05 on average for all attributes, or 0.10 for a single attribute) after removal of a panellist, the panellist qualifies for exclusion from the statistical analysis. The total number of panellists to be excluded is limited to a maximum of 20% of the panel size. It is shown that a statistical power calculation is a useful addition to a reliability analysis by checking if panel discrimination meets a pre-set standard. It is explained how the reliability algorithm and the power calculation can be implemented using MS Excel. The common criteria for assessing panel performance: discrimination, consensus, and repeatability, are defined in terms of generalizability theory variance components. Discrimination focuses on maximising product variance, consensus on minimising all components containing panellist, and repeatability on minimising all components containing replicate. It is discussed how reliability results obtained using this methodology can be used for panel management. Two examples from studies carried out by the Givaudan sensory panel in Ashford, UK, are given. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction This article builds upon a recent publication in Food Quality and Preference (Verhoef, Huijberts, & Vaessen, 2014) where generalizability theory (Brennan, 2001) is proposed to evaluate panel performance. They give a good introduction into the background and subject matter. At Givaudan we have been using generalizability theory to assess panel performance for 4 years. This article describes three additions to the framework suggested by Verhoef et al., and gives two examples of applications in research studies. ⇑ Tel.: +44 (0) 1233 644090. E-mail address: [email protected]

In generalizability theory (G theory), the sensory panel data variance is split into its components: products, panellists, replications, and interactions between these. G theory can be used for reliability analysis (a G-study) or for decision making (a D-study). In a G-study on the one hand, the focus is on obtaining estimates of variance components and using these to assess reliability of the raw data scores. In a D-study on the other, the data are collected for the specific purpose of making a decision. At Givaudan, we want to make decisions about products or fragrance samples, like for example which one has a higher fragrance intensity than the others. So for us the D-study framework applies. We use the product means for decision making. We therefore focus on the reliability of the product means rather than of the raw data. The variance components of these are derived as indicated in Table 1.

http://dx.doi.org/10.1016/j.foodqual.2015.02.019 0950-3293/Ó 2015 Elsevier Ltd. All rights reserved.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

2

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx

Table 1 Variance components in a sensory panel study: derivation and reliability coefficients. Variance component

Derivation

Product

r2product r2panellist/number of panellists r2replicate/number of replicates r2product * panellist/number of panellists r2product * replicate/number of replicates r2panellist * replicate/(number of panellists * number of replicates) r2product * panellist * replicate/(number of panellists * number of replicates)

Panellist Replicate Product by panellist Product by replicate Panellist by replicate Product by panellist by replicate

Included in denominator G p

U p p p

p

p

p

p

p

p

p

Table 2 Algorithm for evaluating the panel performance of individual panellists. Step Step Step Step

1 2 3 4

Reliability whole panel (full set) G and U Remove one panellist and recalculate Does removing the panellist increase U with a relevant amount? If yes, remove Recalculate coefficients for reduced set

2. Calculation We estimate the respective variances in Table 1 using SAS PROC VARCOMP using minimum norm quadratic unbiased (MINQUE) estimation (Hartley, Rao, & LaMotte, 1978). The MIVQUE0 method in SAS (METHOD = MIVQUE0) produces unbiased estimates that are locally best quadratic unbiased estimates given that the true ratio of each component to the residual error component is zero. Negative estimates, which sometimes occur and are caused by variability in the data or outliers, are set to zero. We normally present the variance components as percentages adding up to 100. This gives good insight into possible causes for lack of panel reliability. Reliability is defined as product variance/total variance, where the total variance is calculated as the sum of the variance components indicated in Table 1 for reliability coefficients G and U. Coefficient G is the usual reliability (generalizability) coefficient and is called univariate quality index by Verhoef et al. (2014). 2.1. Coefficient U Coefficient U has been proposed by Brennan (2001) and we consider it to be a useful addition. Unlike G it takes all possible sources for lack of reliability into account. Whereas G focusses on the ordering of the product scores, U focusses on both the ordering and the absolute scores. U is always 6G. If, for example, we would find that for our panel G = 0.91 and U = 0.50, we can see that by and large the panel have excellent agreement about the ordering of the product scores across replicates, but the low level of U shows that the following three variance components are relatively large in combination: the main effect of panellist, the main effect of replicate, and the panellist by replicate interaction. With some straightforward arithmetic it can actually be derived that the combined variance of these three components equals 90% of the product variance (100% * [1/U  1/G]). For a more informed view, the actual variance components should be studied. Their interpretation will be explained in detail below. 2.2. Algorithm Coefficients G and U are useful to evaluate performance of the panel as a whole, but do not give insight in the performance of

All panellists For all panellists For each panellist Reliable panellists

individual panellists. For this we have used an algorithm of which the steps are explained in Table 2. We start with calculating coefficients G and U of the panel as a whole, and then proceed with removing one panellist at the time and recalculating the coefficients. Coefficient U with the panellist removed is compared to U of the whole panel, and if the reliability increases with a relevant amount when the panellist is not included, then the panellist is considered to have a negative effect on the overall test reliability. We use as criteria for relevance: either U increases with 0.05 or more on average for all assessments together, or U increases with 0.10 for one single assessment. For the second criterion it is also checked if U does not decrease on average for all assessments together. Panellists thus identified have a negative effect on test reliability which may be due to several reasons like underperforming on the day or inexperience/lack of training. This line of reasoning only applies, however, if the occasional panellist is not performing on the day and the majority of the panel is. For this reason, we only use the reduced set for reporting when no more than 20% of panellists have been removed from the analysis. Otherwise, the full set is used. The cut-off level at 20% has been agreed upon within Givaudan. This is the criterion that sensory test leaders are using and have been using for a long time, because it fits with their perception of an appropriate boundary between the occasional panellist not performing and the majority of the panel not performing. It is also being used in the Flavours division of Givaudan, and has been defined independently by them. The test leader should always check the data and evaluate if they agree with the decisions taken by the algorithm. Having the algorithm in place does have the advantage, however, that the criteria for panellist removal and which (sub)set to use for the report are all specified in advance before the test is done.

2.3. Discrimination: Statistical power analysis Reliability coefficients can be very low (close to zero) while the panel has performed adequately. The reason for this is the definition of reliability as product variance over total variance. When the products are very similar, product variance may be very low, leading to low values for the coefficients. For this reason, we have introduced the approach of supplementing the reliability analysis with statistical power analysis.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

3

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx Table 3 Variance components in a sensory panel study: interpretation. Variance component

Interpretation

Product Panellist Replicate Product by panellist

Differences among the products in scoring level The panellists score at different levels The replicates are scored at different levels The panellists score the products differently, and may even order them differently altogether Some products are scored differently across replicates Some panellists score differently across replicates Some panellists score some products differently across replicates

Product by replicate Panellist by replicate Product by panellist by replicate

We want to calculate what size of difference between products can be detected with 95% power, and compare that to our aim of being able to detect differences of 10% of the scale used or less. We have used the 10% cut-off for 4 years and it works well in practice. For most of our tests, the panel meets this cut-off level. It is generally easier to meet for attributes that are being tested all the time, as the panel have lots of experience with such attributes (and are well trained on them), and more difficult to meet for new attributes where the panellists have less experience. The difference the panel is able to detect may vary across attributes. To be able to evaluate the study as a whole, we normally take the average over the attributes used to come up with a single number. The power analysis is done using the usual formula (Machin, Campbell, Tan, & Tan, 2009) N = f(a,b) * (r/d)2, where N is the number of data records, f(a,b) is a function of the type I and type II error: (z1a/2 + z1b)2, r2 is an estimate of the variability in the data for which we use the MSE of the ANOVA, and d is the difference which can be detected. Using a = 0.05 two-sided and b = 0.05, the equation is solved for d. 2.4. Assessing the performance of individual panellists The tools described above should give a panel leader a good basis for assessing the performance of the panel as a whole and of individual panellists. The reliability coefficients are useful, but the most useful of all is looking at the variance components themselves. For ease of interpretation we normally put these as a percentage of the total variance, so adding up to 100. Most panel leaders will be used to looking at the following aspects of panel performance: (a) Discrimination – is the panel able to distinguish between the products? Are individual panellists able to distinguish between the products? (b) Consensus – do the panellist agree with each other? Do they score the products in a similar way? (c) Repeatability – are the panellists scores consistent over time? Table 3 below shows how these different aspects of panel performance can be assessed using variance components. 2.4.1. Discrimination For optimal discrimination, we want the product variance percentage to be high. So, coefficients G and U are both important for assessing discrimination of the panel as a whole. The closer they are to 1, the better discrimination we have. As explained above, discrimination of individual panellists can be evaluated by removing them from the set and re-calculating G and U. If the panellist has a negative influence on panel discrimination, then the reliability will go up when she has been removed from the set. 2.4.2. Consensus The consensus column in Table 2 shows all the variance components associated with lack of consensus: all the components where panellist is included. For each of these, the fewer the better.

Discrimination p

Consensus

Repeatability

p p p

p p

p p p

The panellist effect identifies panellists which consistently score lower or higher than the rest of the panel. The product by panellist interaction means that the panellists have scored the products differently. They might even have ordered them differently altogether. This has the potential of making the study very difficult to interpret. The panellist by replicate interaction means that some panellists score differently across replicates, and the difference in scoring level across replicates is not the same for all panellists. The three-way interaction (product by panellist by replicate) is the most difficult to interpret. It means that the magnitude of each of the two-way interactions between two factors depends on the third factor. For example (consensus): the product by panellist interaction differs across replicates. 2.4.3. Repeatability It can be seen in Table 2 that there are several aspects to (lack of) repeatability. All have replicate included. The main effect of replicate means that the panel as a whole has scored at a different level across replicates. The product by replicate interaction means that some products have been scored differently across replicates. The difference between the product scores over the replicates is not the same for all products. The panellist by replicate interaction was already discussed under consensus, and one can argue it belongs to both aspects of panel performance. If some panellists score differently across replicates, and others score approximately the same, then this finding can be interpreted as lack of consensus but also as lack of repeatability for the panellists for whom the scores vary. The three-way interaction can be interpreted in term of repeatability as: the panellist by replicate interaction depends on the product. 2.5. Feedback for panel leaders and panellists Our panel leaders get reliability output containing the values of coefficients G and U, changes to U when a panellist is removed for each panellist, and a listing saying which panellists meet the criteria for removal. Coefficients G and U are recalculated for the reduced set with all panellists meeting the criteria removed. In addition this output contains listings of the raw data for each panellist so that the panel leader can check the raw data for selected panellists. The percentages of variance are added up according to Table 3 for discrimination, consensus and repeatability, and these are provided to the panel leader for each panellist. To give all three concepts a positive interpretation going in the same direction, we use 100 minus the sum of the relevant variance components (as indicated in Table 3) for consensus and repeatability. So a maximum score of 100 for consensus means that the four variance components contributing to lack of consensus in Table 3 are all zero.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

4

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx

For ease of interpretation we use a traffic light system (colour coding) for all three concepts. Take discrimination as an example. Inclusion of the panellist in the set having a positive or neutral effect means: discrimination is the same or higher when she is included compared to when she is not. This is coded with green. Inclusion of the panellist in the set having a slightly negative effect means: discrimination goes down when she is included compared to when she is not. Up to a maximum decrease of 2.5%, this is coded with orange. Decreases of more than 2.5% are coded with red. The panel leader gets a sheet with the colour codes and percentages. The panel themselves only get to see the colours. We have found that the cut-offs of 0% and 2.5% for the colours work well in practice for panel feedback. For feedback, low thresholds are desirable, to be able to pre-warn panellists that they might be underperforming. The panel leader will normally check which panellists have been removed from the analysis and why – how many, is it the occasional panellists not performing on the day or a large proportion of the panel? If it is a large proportion, then either the panel find the task very difficult, or the products are very similar. A power calculation helps to distinguish between the two. Also the panel leader will check the performance of panellists in terms of discrimination, consensus and repeatability for the respective assessments in the test, both for this single panel and over time. Selected panellists may need additional training if their performance is repeatedly coded as orange or red. An individual profile can be made: which attributes are they not performing on, and how often? Then the panel leader will decide upon additional training or possible disciplinary action depending on the reasons for not performing, such as the wrong attitude, lack of motivation, or lack of improvement. Automation of the process using the algorithm has been a large advantage. Several panel leaders over the years have looked at doing their own thing regarding removing panellists from the analysis after looking at the results, but the results they get in terms of discrimination are generally worse than what the algorithm produces.

the performance of our fragrance sensory panels in the different locations across the world by doing the same test at the different locations (Ashford, New Jersey, Sao Paulo, Singapore) and comparing the results in terms of ordering of the products, significances found, reliabilities, and power analysis. We found good agreement, and all panels were performing consistently at very acceptable levels for all the criteria used. 2.7. Doing the calculations using MS Excel Givaudan is a global company and we have several sensory centres across the world performing sensory panel studies for the Fragrance division. We have therefore rolled out the methods described here from Ashford (UK) to the other locations: New Jersey, Singapore, Sao Paulo, and several other locations in Latin America. To maximise ease of use we are using MS Excel for this, and avoid using any supportive statistical software. The formulae of the Sums of Squares for a three-factor ANOVA were obtained from the well-known textbook of Kirk (1968). In Kirk’s terminology (p. 220) the design is called the CRF-pqr design, a completely randomised factorial with number of levels p, q, and r, respectively. The formulae he provides to calculate the sums of squares do not require matrix inversion. They are quickly and easily incorporated into and calculated by MS Excel. The values obtained are then used to calculate the mean squares by dividing by the degrees of freedom, and the expected mean square formulae are used to calculate the variance components needed for the reliability analysis. The error variance r2e is estimated using the mean squares of the three-way interaction (products * panellists * replicates), and then the other terms are calculated using the respective formulae for the expected mean squares, as per Table 4 below. When the respective ANOVA variances have been estimated, the computations of Table 1 are used to calculate the generalizability theory variance components, and coefficients G and U. Please note that the generalizability theory D-study focusses on the means of the products, and that therefore the variance component estimates used for the reliability analysis have to be calculated using the formulae given in Table 1. All these calculations have been programmed in Visual Basic. The same applies to the algorithm for removing one panellist at a time and recalculating the coefficients, and then working out the changes in U when a panellist is removed, deciding using the criteria mentioned above which panellists qualify for removal and then recalculating the coefficients for the reduced set. This Excel application works quite well for balanced, complete designs. Our recommended design internationally, which we use most of the time, is one single panel assessing a number of products, where all panellists assess all products and do the same number of replicates (usually two). The results obtained from studies with this design using the Excel calculations tend to be very similar to the ones obtained with SAS: same panellists removed, coefficients very similar, etc.

2.6. Reporting of panel results To our clients we normally report the results of ANOVA of the assessments used in the test, with factors product, panellist and replicate: Product LS means and significance of pairwise comparisons between products. For the latter we normally use Tukey correction for multiple comparisons. The LS means and significances are also used when panel results are discussed internally and presented to management. With the methods described above being in place for several years now, it is becoming increasingly more customary to report the reliability coefficients also to management together with the means and significances. When a more in-depth review of a study is done, power analysis results are also reported. We have for instance compared

Table 4 Analysis of variance table for a completely randomised factorial design with three factors.

1 2 3 4 5 6 7 8 *

Source

SS

df

MS

F

E(MS)

Products (A) Panellists (B) Replicates (C) Pro * Pan (AB) Pro * Rep (AC) Pan * Rep (BC) Pro * Pan * Rep (ABC) Total

SS(A) SS(B) SS(C) SS(AB) SS(AC) SS(BC) SS(ABC) SS(Total)

p-1 q-1 r-1 (p-1)(q-1) (p-1)(r-1) (q-1)(r-1) (p-1)(q-1)(r-1) pqr-1

*

[1/7] [2/7] [3/7] [4/7] [5/7] [6/7]

re2 + qrra2 re2 + prrb2 re2 + pqrc 2 re2 + rrab 2 re2 + qrac 2 re2 + prbc 2 re2

* * * * * *

MS = SS/df.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

5

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx

3.2. Study 2

Table 5 Reliability coefficients for study 1. Time (h)

G

U

0 2 4

0.97 0.97 0.98

0.95 0.95 0.96

We have found that for more complex, unbalanced analyses, the Excel application is not very well suited and may give results seriously off the mark. The reason for this is that in such cases the variance component estimation becomes much more complicated, and SAS PROC VARCOMP does a much better job. Likewise, we have developed an application for power analysis, where the straightforward power formulae given in Section 2.3 above have been implemented in MS Excel. The user has to enter the MSE from the ANOVA, the number of products tested, and the number of data records. The N per product is calculated as the ratio of these latter two numbers. The programme then shows the difference between products which can be detected with 95% power. The method is demonstrated using two studies performed by the sensory panel in Ashford. 3. Results 3.1. Study 1 This was a fragrance intensity over time study carried out in Q1 2013. The intensity of four cleaning products was assessed at 0, 2, and 4 h after applying the fragrance. Each time point was evaluated by a separate panel. The panel sizes were 19, 21, and 18, respectively. The total Ashford panel consists of a pool of about 35 part-time Givaudan employees (see Table 5). It can be seen that the reliabilities are excellent for all time points. Accordingly, no panellists were excluded by the algorithm. The statistical power analysis showed that on average the difference between products which could be detected with 95% power is 10.1, using a line scale 0–100. This is in accordance with our aim of being able to detect differences of 10% of the scale used.

This was a malodour counteraction study carried out in Q1 2012. Three fragranced dishwashing liquids and a control (malodour only) were tested for fragrance and malodour intensity using 16 panellists. Malodour reliabilities obtained were excellent (G = 0.98, U = 0.96), but fragrance much less so (see Table 6 below). Three out of 16 panellists were removed from the analysis by the algorithm, resulting in acceptable fragrance intensity reliability. The reliability results for fragrance intensity are shown for the full panel and for the subset of reliable panellists (three panellists removed). In addition, results are shown when one panellist (Panellist One) is removed with the largest effect on the U coefficient – an increase of 0.51. It can be seen that for the full set both reliability coefficients are very low. 24.1% of the variance is product by panellist interaction, indicating that the panel could not agree on the ordering of the products. The delta value is 5.07, which is low and very acceptable given that we are aiming for a delta of 10 (this being 10% of the scale used). In this study discriminating between the products for fragrance intensity was very difficult because fragrance intensities were very low. As a consequence, even with the low value of delta, no significant differences between products were found. Tukey correction for multiple comparisons was used. When Panellist One is removed from the analysis, both G and U increase to a level which almost can be deemed acceptable. We normally strive for G of 0.8 or higher and U of 0.7 or higher. It can be seen that the panellist effect is reduced from 12.4% to 0%, indicating that this panellist scores at a very different level from the other panellists. All two-way interactions involving replicate, when combined, are reduced from 20.2% to 4.1%, indicating that this panellist is largely responsible for inconsistencies across replicates. Accordingly, repeatability goes up with 23.8% when the panellist is removed. The product by panellist interaction is reduced from 24.1% to 9.5% by removing this panellist, indicating that this panellist is responsible for a large part of the disagreement in the panel about the ordering of the products. Indeed, consensus goes up from 38.8% to 73.8% when the panellist is removed. Tables 7 and 8 show the raw data of Panellist One and of the panel as a whole without Panellist One.

Table 6 Fragrance intensity reliability results of study 2 for the panel as a whole (N = 16), with one panellist removed and with three panellists removed. All panellists included

One panellist removed

Three panellists removed

U

0.25 0.19

0.73 0.70

0.80 0.78

Product Panellist Replicate Product * panellist Product * replicate Panellist * replicate Product * panellist * replicate Discriminationa Consensusa Repeatabilitya

19.1 12.4 10.2 24.1 9.7 0.3 24.4 19.1 38.8 55.4

69.6 0.0 4.1 9.5 0.0 0.0 16.7 69.9 73.8 79.2

78.4 0.0 2.6 4.9 0.0 0.0 14.1 78.4 81.0 83.3

Control Product 1 Product 2 Product 3

1.0A 2.1A 3.8A 3.3A

0.0A 1.5AB 2.7AB 3.6B

0.0A 0.9AB 3.1BC 4.1C

Delta from power analysis

5.07

4.38

4.81

Reliability G Variance components (%)

LS meansb

Discrimination a

Discrimination = Product variance, expressed as a percentage. Consensus = 100  (the sum of all variance components containing panellist, expressed as percentages). Repeatability = 100  (the sum of all variance components containing replicate, expressed as percentages). b The intensity scale used ranges from 0 to 100. Product means sharing the same letter are not statistically significantly different from each other at 5% level.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

6

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx

Table 7 Fragrance intensity data for Panellist One, study 2.

Control Product 1 Product 2 Product 3

Replicate 1

Replicate 2

Mean

18.0 0.0 15.0 0.0

15.0 22.0 27.0 0.0

16.5 11.0 21.0 0.0

Table 8 Fragrance intensity data (means and standard errors) for all panellists except Panellist One, study 2.

Control Product 1 Product 2 Product 3

Replicate 1

Replicate 2

Mean

0.0 0.3 2.5 3.3

0.0 2.7 2.9 3.8

0.0 1.5 2.7 3.6

(0.00) (0.27) (0.99) (1.49)

(0.00) (1.04) (1.10) (1.72)

(0.00) (0.57) (0.73) (1.12)

It can be seen that Panellist One has scored three out of four products at a very different level (much higher) from the others. Repeatability for products 1 and 2 is very poor. It also appears that Panellist One has misidentified the control (no fragrance) as being product 3. All this evidence together points to the conclusion that this panellist has had an off-day: her results seriously distort the fragrance intensity results for the panel as a whole. Returning to Table 6, the results with three panellists removed as indicated by the algorithm leads to acceptable reliability for the fragrance intensity assessments: G = 0.80 and U = 0.78. The main effects and two-way interactions together amount to only 7.5% of the total variance. An interesting finding can be seen when comparing these results to the ones with only Panellist One removed: although the delta value slightly increases when two more panellist are removed, more statistical significance between the products is found. This is due to the improved reliability and corresponding increase in product variance. 4. Discussion and conclusions Generalizability theory has proven to be to be a very useful tool to assess panel performance, both for the panel as a whole and for individual panellists. We have been using the algorithm for identifying panellists with unreliable results for 4 years now in numerous studies. The automation of the process has greatly helped us turning around reports more quickly and efficiently whilst maintaining the desired quality of the statistical analysis. A panel leader should always carefully check the analysis results and might decide to deviate from the algorithm, for instance by removing a panellist for which it was already known before the analysis that they did not perform up to standard on the day. A clear advantage of using the algorithm is, though, that the rules for leaving panellists out of the analysis are in place before the analysis is done and before the data are collected. This takes away any suspicion that the test leader or someone else might have looked at different subsets and then presented the results which appear best (‘‘p-hunting’’). When using our algorithm, panellists may be removed from the analysis because including them leads to a decrease in panel reliability (coefficient U) which is deemed relevant. In addition the power analysis checks if the panel is able to detect relevant differences between products with adequate power. Our approach does not include statistical testing for aspects of panel reliability such as those described by Latreille et al. (2006). These authors propose an ANOVA model very similar to the one used here and derive statistical hypotheses from it about aspects of panel reliability: discrimination, repeatability, and consensus. Their framework is

very elaborate and well-defined, but there are several problems associated with using statistical testing for assessing aspects of panel reliability. Tests tend to become significant more readily when more data are being collected, which is independent of panel performance. This will lead to finding more discrimination (product variance tests becoming significant) but also to finding more significance for the respective measurement error variance components, potentially leading to erroneous conclusions about consensus and repeatability. Tests for interactions tend to have low power (Edwards, 1993), and the differences the respective tests are able to detect with sufficient power have to be checked for relevance, if indeed it can be established at all which differences are considered relevant for each individual test. Finally, when assessing panel performance with statistical tests, usually a lot of tests need to be performed, with the associated risk of finding significance by chance (multiple testing). For these reasons we focus on relevant contributions to panel reliability in combination with doing a power analysis and do not use statistical testing for aspects of panel performance. We have shown how discrimination, consensus, and repeatability can be defined in terms of the variance components of generalizability theory and quantified by presenting the components as percentages and adding the relevant ones. In this way feedback can be given to the panel leader and to the individual panellists. Coefficient U is a useful addition to the reliability analysis because of its focus on the absolute scores produced by the panel, which distinguishes it from coefficient G. The use of statistical power analysis gives independent information in addition to a reliability analysis. It can be used to evaluate discrimination achieved against a standard, and so provides a criterion to decide in case of low reliability if the panel has underperformed or that the products were very similar. Study 2 above showed such a case of low reliability and good discrimination (low delta value), so with acceptable performance of the panel – the task of distinguishing between fragrances with very low intensity was just very difficult for them. We have found the reliability analysis to be especially useful for Quantitative Descriptive Analysis (QDA) and odour profiling studies. In QDA, the panellists agree on a list of attributes they will use to evaluate the products and then score the products using these attributes. In odour profiling, the panel assesses all products using a list of odour descriptors like floral, fruity, etc. Not all attributes will discriminate equally well. The reliability analysis can help selecting the attributes which have been scored reliably. We normally do Principal Components Analysis (PCA) on the products by attributes means table, and it has been known for a long time (Gorsuch, 1983, chap. 16) that factor loadings become more stable and replicable when one uses variables with good reliability. Accordingly, we have found that the PCA solutions we get are substantially better (more % of variance explained in two or three dimensions, clearer structure) when limited to reliable attributes only. We are still looking at defining the best cut-off level for acceptable reliability for QDA and odour profiling. G > 0.8 is most probably too strict, because it leads to throwing out too many attributes. We recently have had some good experiences with using G > 0.5. This led to a clear PCA structure, whilst keeping in a good number of attributes. References Brennan, R. L. (2001). Generalizability theory. Springer Verlag. Edwards, L. K. (1993). Applied analysis of variance in behavioral science. Marcel Dekker Inc. Gorsuch, R. L. (1983). Factor analysis. Psychology Press. Hartley, H. O., Rao, J. N., & LaMotte, L. (1978). A simple synthesis-based method of variance component estimation. Biometrics, 34, 233–244.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019

P. Talsma / Food Quality and Preference xxx (2015) xxx–xxx Kirk, R. E. (1968). Experimental design: Procedures for the behavioral sciences. Wadsworth Publishing Company Inc. Latreille, J., Mauger, E., Ambroisine, L., Tenenhaus, M., Vincent, M., Navarro, S., et al. (2006). Measurement of the reliability of sensory panel performances. Food Quality and Preference, 17(5), 369–375.

7

Machin, D., Campbell, M. J., Tan, S. B., & Tan, S. H. (2009). Sample size tables for clinical studies. Wiley-Blackwell. Verhoef, A., Huijberts, G., & Vaessen, W. (2014). Introduction of a quality index, based on Generalizability theory, as a measure of reliability for univariate- and multivariate sensory descriptive data. Food Quality and Preference.

Please cite this article in press as: Talsma, P. Assessing sensory panel performance using generalizability theory. Food Quality and Preference (2015), http:// dx.doi.org/10.1016/j.foodqual.2015.02.019