Food Quality and Preference 14 (2003) 247–256 www.elsevier.com/locate/foodqual
Proficiency testing for sensory ranking panels: measuring panel performance Jean A. McEwana,*, Raija-Liisa Heinio¨b, E. Anthony Hunterc, Per Lead a
Department of Consumer and Sensory Sciences, Campden & Chorleywood Food Research Association, Chipping Campden, Gloucestershire, GL55 6LD, UK b VTT Biotechnology, PO Box 1500, FIN-02044 VTT, Finland c Biomathematics & Statistics Scotland, JCMB, King’s Buildings, Edinburgh EH9 3JZ, UK d MATFORSK, Osloveien 1, N-1430 A˚s, Norway Received 23 January 2002; received in revised form 22 June 2002; accepted 1 July 2002
Abstract Proficiency testing in sensory analysis is an important step towards demonstrating that results from one sensory panel are consistent with the results of other sensory panels. The uniqueness of sensory analysis poses some specific problems for measuring the proficiency of the human instrument (panel). As part of an EU supported project, ProfiSens, 14 panels undertook ranking of sweetness on five samples of apple juice. Four panels were designated ‘validation’ panels, whose data were used firstly to establish the expected ranking results, and secondly to set the performance criteria that a trained sensory panel would be expected to achieve. Four key measures of a panel’s performance were investigated: the ability to rank the samples in the correct order; the significance level associated with differences between samples; the number of pairs of samples that a panel found to be different at a specified level of significance; and the degree of agreement between assessors within a panel. For each of these criteria an ‘expected result’ was considered, as well as an overall measure of performance. The data from the remaining panels were analysed, and the level of performance was recorded for each of the stated criteria. Results indicated different levels of performance, and the research also revealed the importance of choice of validation panels and the screening of samples prior to testing. A simpler performance scheme was proposed to address issues relating to attaching arbitrary weightings to each of the performance criteria, and to address potential problems associated with combining different measurement criteria into a single performance score. While there is still potential for further development and refinement, ProfiSens has made a significant contribution to proficiency testing for sensory analysis. # 2002 Elsevier Science Ltd. All rights reserved. Keywords: Proficiency testing; Panel performance; Ranking; Expected results
1. Introduction Proficiency testing (PT) schemes have been widely studied and are commonly used in chemical/instrumental analysis as official inter-comparison studies. PT schemes are not yet well developed in sensory evaluation, mainly due to the different character of sensory analysis as compared with chemical or other instrumental analysis. However, participation in a PT scheme is the most effective way to demonstrate the ability of a * Corresponding author. Present address: MMR Food and Drink Research Worldwide, Wallingford House, High Street, Wallingford, OX10 ODB, UK. Tel.: +44-1491-824-999; fax: +44-1491-824-666. E-mail address:
[email protected] (J.A. McEwan).
sensory panel. In addition, for accredited laboratories it is even a necessity. For sensory panels, performance in PT is of utmost importance in terms of demonstrating the repeatability and reproducibility of results. Several studies have investigated methods to measure and screen sensory assessor performance (e.g. Næs, 1998; Rossi, 2001), but the performance of the sensory panel does not appear to have been so widely studied. Moreover, it is the results of the panel as a single unit that are the basis for decision making on the sensory quality of a product. However, poor performance by individual assessors will unavoidably result in a lower performance of the panel. Sensory evaluation is unique in using human beings as the objective measuring instrument. The results of a
0950-3293/03/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved. PII: S0950-3293(02)00083-6
248
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
sensory panel are perceptions, and not necessarily easy to relate to chemical or physical values. For chemical analysis, it is usually the amount (concentration) of the analysed substance in a test material that is measured, whereas in sensory analysis it is the sensory response to a chemical or physical stimulus that is measured. The human instrument tends to measure in a comparative mode, and so sensory results are always relative. In chemical analysis, the quantity of the analyte is usually measured and provided as a numerical value, whereas sensory results can be expressed in different ways according to the sensory method used (e.g., ranking, rating). Thus, the lack of sensory PT schemes may also be considered a consequence of difficulties in setting up the criteria for measuring panel performance. In this and previous studies (McEwan, 2000a, 2000b, 2001a, 2001b) these criteria were developed. In chemical analysis the ‘true value’ of an analyte is used as a comparison measure for the results, but in sensory analysis this concept can be somewhat ambiguous. Instead of using a ‘true value’, the new concept of ‘expected value’ was introduced and set up using validation panels for its determination prior to the full inter-comparison trials. Although not being official PT studies, results from some inter-comparison studies are available (e.g. ESN, 1996; Hunter & McEwan, 1998; McEwan et al., 2002; Page`s & Husson, 2001), using profiling as the sensory method. The ranking method was chosen for this intercomparison study as it differs considerably from profiling, and is a somewhat simpler test. Ranking can be considered as a difference test, and is less demanding in terms of organisation, data capture and data processing. In addition, for ranking, familiarity of the panel with the tested product is not as important as it is for profiling. The data used in this study are based on an EU supported project ProfiSens (contract number SMT4-CT 98–2227). The results of the project are published as ‘International Guidelines for Proficiency Testing in Sensory Analysis’ (Lyon, 2001). Detailed reports of the four inter-comparison studies implemented on ranking during the research (two trials on apple juice, one on tomato soup and one on custard) are available (McEwan, 2000a, 2001a).
poses some issues and challenges for further research. However, the main aim of this paper is to outline the performance measures considered after the preliminary tests and subsequent development of these measures using data from a second proficiency test on apple juice.
2. First steps in developing performance measures 2.1. Learning from the first proficiency test on apple juice While this paper concentrates on the results of a second proficiency testing experiment on sweetness in apple juice, it is important to consider the experiences from the first study, using similar test materials (Lyon, 2001; McEwan, 2000a). Five samples of apple juice were spiked with a glucose-fructose mix to represent different levels of sweetness. The samples underwent a preliminary screening by the provider of the glucose–fructose spikes, and were judged to be suitable. However, the results from the proficiency test showed that the task was ‘too easy’, and consequently did not test the panels. Thus, the importance of using validation panels, prior to the proficiency testing round, to establish whether the test materials were suitable was highlighted. The first proficiency test yielded information on how panel performance could be established, and also provided useful information on how to improve the test procedure. Unlike the case for descriptive profiling (McEwan, 2000b), it was possible to measure panel performance, and Fig. 1 illustrates the procedure.
1.1. Objectives and scope This paper concentrates on the development of a performance scheme for the ranking test, using the findings from a preliminary trial on apple juice (McEwan, 2000a) as a basis to refine and test a scheme of measurement of panel performance that could be used for proficiency testing. It considers some of the experiences and knowledge gained by the ProfiSens participants to suggest improvements for future schemes and
Fig. 1. Procedure for establishing the expected result and setting performance criteria.
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
2.2. Scheme for establishing and measuring panel performance The scheme (Fig. 1) comprises four key steps to measure a panel’s performance: the ability to rank the samples in the correct order; the significance level associated with differences between samples; the number of pairs of samples that a panel found to be different at a specified level of significance; and how well the assessors agreed with each other within a panel. The first step is to tabulate the rank data and work out the panel rank mean for each sample. The Pearson correlation coefficient then allows agreement between the panel rank means and ‘expected panel rank means’ to be calculated. The correlation should be significant at P < 0.10 and a negative correlation would be rejected, as this would indicate that the panel had ranked the samples in the wrong order (or forgot to recode the data if ranked from high to low). In this event, the participating laboratory would be failed. The choice of a 10% significance level was to minimise the possibility of rejecting a panel that may have a satisfactory level of performance as defined by other criteria. To determine how well each panel discriminated between the samples, a Friedman rank test is undertaken and the level of significance recorded. Based on the validation panel results, panels will be judged against the ‘expected significance level’. Kendall’s Coefficient of Concordance (W) can be used to measure the agreement between assessors in a panel, which is related to the overall level of discrimination. Generally, a lack of agreement between assessors would be reflected by a poor result in Steps 2a and 3. However, W provides a single measure of how well the panel works together to produce a given level of performance. Panel performance can be measured in relation to the ‘expected concordance level’. Having established an expected significance level, the next step is to determine which pairs of samples are different at a specified level of significance (for example 1, 5 and 10% significance). This can be achieved through the use of a suitable multiple comparison test, for example Conover’s method (Conover, 1999). From these results, panel performance can be measured in relation to the ‘expected number of sample differences’. Finally, the information gathered in Steps 1–3 can be aggregated to obtain an overall performance score, thus enabling overall performance to be compared with the ‘expected overall performance’.
249
scores are high, the validation laboratories can discriminate between the samples, can rank the samples in the right order, and can detect differences between the specified samples, and the assessors within the panel agree with each other. This will normally indicate that the Proficiency Scheme Co-ordinator should go ahead with the main trial, unless untrained panels also perform well. If the validation panels’ performance is low, particularly that of the trained panels, the selection of samples may need to be revisited and a repeat validation stage organised with new samples. If the expected overall performance scores are ‘average’, the data should be carefully considered again to be confident that the panels in the main inter-comparison will be able to discriminate between the samples, before recommending that the main trial goes ahead. It is also possible that the task is too easy (reflected by high performance scores), and therefore the ring trial would not be sufficiently demanding. In such circumstances, the Co-ordinator may recommend that the sample differences are made smaller. Having set the performance criteria, and having made the decision to carry on with the main trial, one performance level should be designated as the ‘expected result’ for each step in the performance scheme. Participating laboratories will therefore be judged on their performance in each of the critical performance measures, and not just on the basis of ‘overall performance’. It is important that the expected result is achievable, for each of the performance criteria. For example, if the expected result is set too high (e.g. ‘very good’ level), then it is likely that few panels may be as good as ‘expected’ in the main inter-comparison. For this reason, in coming to the decision on the ‘expected result’, it is also important to consider what might reasonably be ‘expected’ of a trained sensory panel in whose ability one would normally have confidence in performing sensory ranking tests. Normally, only one validation stage would be necessary, as prior screening and pre-testing should have sorted out any problems of sample differences being too large or too small.
3. Materials and methods The procedure for the second trial to rank sweetness of apple juice is described below. 3.1. Samples and procedure—validation stage
2.3. Comments on the procedure When setting performance criteria using validation panels prior to the proficiency testing round, some thought should be given to whether the test material is suitable. In general, if the expected overall performance
3.1.1. Samples Commercially available apple juice (brand name Apple Nectar, aseptically packed in 1 l tetra brik packages, manufactured by Valio, Finland), which was spiked with five mixtures of sugars (glucose and fructose,
250
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
Table 1 Sugar blends and 3-digit numbers used to code the products for two replicate assessments Sample
1 2 3 4 5
Sugar blend (%)
Code
Glucose
Fructose
Rep 1
Rep 2
75 63 50 37 25
25 37 50 63 75
611 208 436 986 538
853 460 798 199 887
pharmaceutical quality), diluted with bottled natural mineral water (brand name Evian, packed in 50 cl polyester bottles, manufactured by Evian, France), were ranked according to their perceived sweetness. Each mixture comprised 50 ml of apple juice, 50 ml of water and 6.5 g of a sugar blend. The chosen sugar mixtures (Table 1) represented noticeable and just detectable differences between the pairs of samples, and they were coded with 3-digit numbers, with blinding codes provided by a ProfiSens participant not participating in the ranking test. The samples were prepared approximately one hour before the evaluation by each participating laboratory, and the two replicates were prepared separately. 3.1.2. Homogeneity testing of samples Homogeneity of the samples was tested by sensory triangle tests, which were performed by two panels of laboratories participating in the ProfiSens project. The five samples were prepared according to the instructions on two occasions (A and B), thus representing potential within sample variation. Five triangle tests were undertaken, one on each sample, where A and B were presented as the different sample an equal number of times, and the six possible serving orders were observed. A panel of 18 assessors undertook the assessment at each laboratory. The results of all triangle tests were nonsignificant. 3.1.3. Panels The inter-comparison study was completed by 14 sensory panels from eight European countries (UK, Ireland, Spain, Italy, Denmark Norway, Sweden and Finland). Ten panels were trained, coded from A to J in the results, and four were untrained, coded from F1 to K1. All the trained panels were highly experienced, whilst untrained panels were included to provide some guidance on potential lower levels of performance. No special training on the ranking method was required, but 5 out of the 14 panels received training prior to the tests. The panel size ranged from 8 to 16 assessors, the mean being 11 assessors.
3.1.4. Procedure The sample ingredients were distributed to the participants by courier, and the samples were prepared by the laboratories according to the detailed instructions agreed by a ProfiSens working group. The assessors were asked to rank the apple juice samples according to perceived sweetness intensity. The ranking method used in the study followed ISO standards (ISO 8587, 1988). The panels were allowed to use their normal procedure: either ‘1=most’ and ‘5=least’, or ‘1=least’ and ‘5=most’. Before the data were analysed, the rank order was converted to the former system. Seven out of the 14 panels used the ascending rank order, and seven the descending rank order. Approximately 40 ml of each sample was served to the assessors. During the assessments, all panels used an appropriate palate cleanser between the samples. Recommendations about conditions for evaluation were closely followed, including serving temperature, visual masking of the samples and other physical requirements for the assessment. Both the validation testing and the full inter-comparison study involved replicate assessments by all participating sensory panels. These assessments were completed on consecutive days or at the very most within 1 week. Half the laboratories used manual data collection and the rest computer assisted data collection. Each participant collected information about the test procedure using a common checklist, and any deviations from the instructions were recorded. Such information could be used as feedback on possible causes of less than satisfactory panel performance. 3.2. Experimental design The five apple juice samples spiked with different sugar blends were served to the assessors in different orders according to the experimental design provided in the instructions. Each laboratory used the same design, where the order of presentation was according to a modified Williams Latin Square to minimise potential order and carry-over effects. In both replicate sessions, each assessor ranked the five 3-digit coded samples according to perceived sweetness. The primary data
251
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
were collected on an ExcelTM file by each laboratory and sent to the supplier of the blinding codes to be decoded, before being distributed to the laboratories undertaking statistical analysis. 3.3. Data analysis 3.3.1. Validation panels Data from the four validation panels (two trained and two untrained) were analysed to determine if all panels ranked the samples in the same order, and to test their ability to discriminate between the samples. The Pearson correlation coefficient was calculated to measure the correlation between panel rank means, whilst a Friedman rank test (O’Mahony, 1986) was undertaken on the two replicate rankings separately, and the P-value recorded. The Conover multiple comparison value (Conover, 1999), at the 5% level of significance, was then calculated and using this the number of pairs of samples that were significantly different was determined. The Conover method was used because it takes into account the number of assessors and the sum of ranks. To measure agreement between assessors in a panel, the coefficient of concordance (Kendall & Gibbons, 1990) was calculated on each replicate ranking for each panel. From this information, levels of performance were set for each of the criteria outlined in Fig. 1. 3.3.2. Proficiency test panels For each panel’s data, the panel rank mean was calculated and this was correlated with the ‘expected rank mean’. Panels who had a significant (P < 0.05) positive correlation with the expected result passed the first stage for further analyses of their performance. A Friedman rank test was undertaken on the data of each panel, followed by a Conover multiple comparison test. This information was used to establish how many of the 10 possible sample pairs were significantly different at the 5% level of significance. Finally, the coefficient of concordance was calculated to provide a measure of agreement between assessors in the panel, i.e. the panel concordance.
4. Results and discussion 4.1. Validation panel results 4.1.1. Panel rank means, Friedman test and Conover multiple comparison Two trained and two untrained panels were selected to investigate the procedure for setting the ‘expected results’ and the resulting performance criteria. The criteria for each score were based on the results from two inter-comparison trials on apple juice. Thus, a specified score for each step in the procedure represents an
expected value (or result), but remembering that the score has a specific criterion associated with it. The exception is the first step, as described below where a pre-step is required prior to setting the expected result. The first step is to calculate the panel mean ranks (Table 2), and to then set the expected panel rank means. The overall sample rank order corresponded well across all panels, and Panel A showed the most discrimination between samples. For this reason, Panel A was used to define the expected sample rank means, as shown in Table 3. Thus, for proficiency testing, the validation panels are used to adjust or fine-tune the expected panel rank means. Table 4 shows the Pearson correlation coefficient between the ‘expected rank means’ and the rank means for each validation panel. Based on these results, the following performance criteria were set, where a score of 1 represents the expected result.
Score 0 Score 1
if P > 0.10 if P 40.10
or if a negative correlation ‘expected result’
If a panel provides a significant negative correlation with the expected rank means, then this would result in an immediate decision that the laboratory was not proficient. The second step is to establish the level of significance associated with the sample differences. The results of the Friedman test (Table 2) indicate that Panels A and E have P 40.001 on both replicates, whilst Panel F1 has P=0.003 and P=0.013, and Panel I1 has P40.001 and P=0.002. Therefore the following performance criteria were set.
Score Score Score Score
0 1 2 3
if if if if
P P P P
> 0.05 40.05 40.01 40.001
‘expected result’
The third step is to identify the number of pairs of samples that are significantly different at the 5% level of significance. There are potentially 10 pairs of samples that can show significant differences. It was decided to look at the 5% level of significance (Step 2a). From Table 2 it can be observed that Panel A achieved 10 significant pairs in the first replicate, whilst Panel F1 achieved only three significant pairs in the second replicate. Therefore, the following performance criteria were set.
Score 0 Score 1 Score 2
if 0 or 1 significant difference if 2 or 3 significant differences if 4 or 5 significant differences
252
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
Table 2 Results of the Friedman and multiple comparison tests for the four validation panels Sample
Panel A
611/853 208/460 436/798 986/199 538/887 P-value MC-5% n-Assessor Significant differences
E
F1
Rep 1
Rep 2
Rep 1
Rep 2
Rep 1
Rep 2
Rep 1
Rep 2
5.0 4.0 2.8 1.9 1.3 0.000 0.49 9 10
4.8 4.1 3.1 1.7 1.3 0.000 0.52 9 9
4.4 3.8 2.9 2.1 1.8 0.000 0.98 12 6
4.6 3.2 3.0 2.4 1.8 0.000 1.03 12 6
3.9 3.4 3.6 2.3 1.8 0.003 1.11 12 5
4.3 3.2 2.3 2.8 2.4 0.013 1.16 12 3
4.9 3.6 2.8 2.0 1.8 0.000 1.02 8 6
4.6 3.4 2.5 3.0 1.5 0.002 1.19 8 6
Table 3 Mean panel data for Panel A over two replicates used to calculate expected mean rank Sample
A
611/853 208/460 436/798 986/199 538/887
I1
Table 5 Coefficient of concordance (W) for the validation panels
Expected result
Rep 1
Rep 2
5.0 4.0 2.8 1.9 1.3
4.8 4.1 3.1 1.7 1.3
4.9 4.0 3.0 1.8 1.3
Panel
Replicate 1
Replicate 2
A E F1 I1
0.91 0.48 0.33 0.65
0.90 0.42 0.27 0.53
Table 4 Correlation between the panel mean data and the expected mean rank Correlation
Panel
Replicate 1 (P-value) Replicate 2 (P-value)
Score Score Score Score
3 4 5 6
if if if if
A
E
F1
I1
0.997 (0.000) 0.998 (0.000)
0.997 (0.000) 0.957 (0.006)
0.924 (0.012) 0.812 (0.048)
0.981 (0.002) 0.997 (0.026)
6 or 7 significant differences 8 significant differences 9 significant differences 10 significant differences
‘expected result’
Score 3 Score 4
if W 50.80 if W 50.90
‘expected result’
5. Setting the final expected result criteria 4.1.2. Coefficient of concordance The coefficient of concordance calculated for the validation panels is shown in Table 5 for both replicate assessments. This forms Step 2b of the performance scheme. Based on these results, the following performance criteria were specified.
Based on the performance criteria scores given for Steps 1–3 above, a total possible score of 14 (1+3+6+4) is achievable. If the ‘expected results’ from Steps 1–3 are added together, a total score of 10 (1+2+4+3) is specified. Given that a panel can score 1 less than the expected result on any step (1, 2a, 2b or 3), then the expected overall score could be set as the interval 9–10.
Score 0 Score 1 Score 2
Score=10.1–14.0 Score=9–10 Score < 9.0
If W < 0.60 if W 50.60 if W 50.70
Better than expected ‘Expected result’ Less than expected
253
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
5.1. Proficiency test panel results 5.1.1. Panel rank means, Friedman test and conover multiple comparison Table 6a and b show the panel mean ranks, the results of the Friedman test (P-value), multiple comparison value at 5% significance and the number of significant pairs at this significance level. 5.1.2. Step 1—correlation with expected rank means Table 7 shows the results of the correlation between the panel sample rank means and expected sample rank means. With the exception of Panel G, all panels received a score of 1 on both replicate assessments. 5.1.3. Step 2a—level of significance associated with sample differences Table 6a and b shows the P-value corresponding to the significance level associated with sample differences, on undertaking the Friedman test. In addition, the performance score achieved is recorded in Table 8. 5.1.4. Step 2b—agreement between assessors— coefficient of concordance Table 7 shows the coefficient of concordance to measure the agreement between assessors in a panel, whilst the allocated score is shown in Table 8. This measure is distinguishing between panels in terms of agreement between assessors.
5.1.5. Step 3—significantly different sample pairs Table 6a and b show the number of significantly different pairs out of a possible total of 10, together with the allocated performance score given in Table 8. It can be seen that panels differ in their ability to discriminate between pairs of samples, in spite of having similar P-values. This illustrates the importance of looking at specific sample differences, rather than just the overall test result. 5.2. Panel performance Table 9 summarises the performance of each panel for each of the criteria, where Step 4 is the overall performance, based on the sum of the individual scores. Panel G was failed in the trial based on its performance in Step 1. All other panels achieved the expected result in Step 1, whilst for Step 2a, only Panel F1 was below expected. Only Panels A, C and J were above expected for Step 2b and Step 3. With respect to Step 4, the overall score, whilst most panels were below the expected result (9–10), six panels (B, D, E, H, I and I1) were only up to 2 points below the lower score of the overall expected result. It may be considered, for example, that Panel F1 needs more training, which is unsurprising for an untrained panel! Fig. 2 represents the final overall performance score, illustrating the expected result band. This graph illustrates that most panels performed around the same level, even though overall they scored slightly below the specified overall expected score.
Table 6 Panel rank means, Friedman and multiple comparison results Sample
Panel A
B
C
D
E
F
G
H
I
J
F1
I1
J1
K1
(a) Replicate 1 611 208 436 986 538 P-value MC-5% n Significant differences
5.0 4.0 2.8 1.9 1.3 0.000 0.49 9 10
4.5 3.6 2.4 2.7 1.8 0.000 0.91 14 5
4.9 4.1 2.8 2.2 1.0 0.000 0.34 10 10
3.8 4.6 3.0 1.3 2.3 0.000 0.99 9 6
4.4 3.8 2.9 2.1 1.8 0.000 0.98 12 6
4.2 3.4 3.3 2.3 1.8 0.003 1.10 12 4
3.1 2.4 3.0 2.8 2.8 0.799 1.25 16 0
4.6 3.8 3.2 1.9 1.5 0.008 0.79 11 7
4.3 3.5 2.9 2.0 2.3 0.000 1.22 10 4
5.0 4.0 2.9 1.8 1.3 0.000 0.39 10 10
3.9 3.4 3.6 2.3 1.8 0.003 1.11 12 5
4.9 3.6 2.8 2.0 1.8 0.000 1.02 8 6
4.8 3.5 2.3 2.2 2.2 0.000 0.92 12 7
4.3 3.5 3.6 2.1 1.6 0.000 0.96 12 7
(b) Replicate 2 853 460 798 199 887 P-value MC-5% n Significant differences
4.8 4.1 3.1 1.7 1.3 0.000 0.52 9 9
4.6 3.6 3.2 2.3 1.3 0.000 0.73 14 9
5.0 3.9 3.0 2.1 1.0 0.000 0.29 10 10
4.2 3.3 3.6 2.7 1.1 0.000 1.08 9 5
4.6 3.2 3.0 2.4 1.8 0.000 1.03 12 6
4.3 3.6 2.5 2.7 1.9 0.002 1.08 12 4
2.8 3.1 3.1 2.8 3.3 0.844 1.14 16 0
4.5 3.7 3.3 2.3 1.2 0.000 0.81 11 6
5.0 3.5 2.9 1.9 1.7 0.000 0.81 10 7
5.0 3.7 3.0 2.3 1.0 0.000 0.48 10 10
4.3 3.2 2.3 2.8 2.4 0.013 1.16 12 3
4.6 3.4 2.5 3.0 1.5 0.002 1.19 8 6
4.1 3.3 2.6 3.2 1.9 0.000 1.17 12 4
3.9 4.0 3.3 2.3 1.4 0.000 0.97 12 5
254
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
Table 7 Pearson correlation between panel and expected sample rank means and the coefficient of concordance Panel
A B C D E F G H I J F1 I1 J1 K1
Pearson’s correlation
Coefficient of concordance
Replicate 1
Replicate 1
Replicate 2
0.91 0.46 0.95 0.62 0.48 0.34 0.04 0.70 0.34 0.93 0.33 0.65 0.54 0.50
0.90 0.65 0.96 0.56 0.42 0.37 0.02 0.68 0.72 0.90 0.27 0.53 0.26 0.49
0.997 0.922 0.984 0.828 0.998 0.977 0.106 0.996 0.967 0.999 0.924 0.981 0.910 0.955
Replicate 2 0.998 0.979 0.990 0.873 0.957 0.934 0.424 0.945 0.979 0.975 0.812 0.877 0.817 0.943
6. Discussion, conclusions and implications Based on the worked examples of the apple juice, the performance scheme proposed in Fig. 1 provided a procedure to define criteria to enable proficiency scheme providers to discriminate between panels with different levels of performance. However, it is useful to review a number of issues based on findings from the apple juice trials and work on tomato soup and custard (McEwan, 2000a, 2001a). 6.1. Homogeneity testing of samples The ProfiSens project used triangle tests on samples prepared on two occasions by two laboratories to determine whether the samples were homogeneous. In a real proficiency testing scheme it is recommended that more rigorous homogeneity testing is undertaken. For
Table 8 Scores for each panel for Steps 2a, 2b and 3, based on the information in Table 6a and b Panel
A B C D E F G H I J F1 I1 J1 K1
Discrimination
Concordance
Number of sample differences
Replicate 1 score
Replicate 2 score
Replicate 1 score
Replicate 2 score
Replicate 1 score
Replicate 2 score
3 3 3 3 3 2 0 2 3 3 2 3 3 3
3 3 3 3 3 2 0 3 3 3 1 2 3 3
4 0 4 1 0 0 0 2 0 4 0 1 0 0
4 1 4 0 0 0 0 1 2 4 0 0 0 0
6 2 6 3 3 2 0 3 2 6 2 3 3 3
5 5 6 2 3 2 0 3 3 6 1 3 2 2
Table 9 Summary of performance for each of the three Steps, where Step 4 is the sum of the others: average over replicates, where maximum value is 14 Panel
Step 1
Step 2a
Step 2b
Step 3
Step 4
A B C D E F G H I J F1 I1 J1 K1 Expected
1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
3 3 3 3 3 2 0 2.5 3 3 1.5 2.5 3 3 2
4 0.5 4 0.5 0 0 0 1.5 1 4 0 0.5 0 0 3
5.5 3.5 6 2.5 3 2 0 3 2.5 6 1.5 3 2.5 2.5 4
13.5 8 14 7 7 5 0 8 7.5 14 4 7.0 6.5 6.5 9–10
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
Fig. 2. Panel performance on apple juice based on an overall performance score.
example, to test within sample variation, 4–5 batches at each of three separate laboratories would provide additional confidence of homogeneity. This also raises the question of efficiency and appropriateness of the triangle test method for this application. For example, paired comparison tests against a designated ‘control’ sample could be one solution. In the case of five batches, five paired comparisons (including control versus control) would be undertaken for each sample. A difference from control test was also considered, but it was felt that it would not be as sensitive to small batch-to-batch differences. 6.2. Screening, pre-testing and validation The importance of screening, pre-testing and validation cannot be over-emphasised. As demonstrated in a custard trial to rank thickness, the initial screening and pre-testing resulted in samples that showed (too) large differences (McEwan, 2001a). However, in spite of reducing the increments of thickening agents for the main trial, the ranking task was still ‘too easy’. A second screening and pre-test would have had every chance of detecting that the task was still not sufficiently difficult. It is therefore recommended that samples are always screened and pre-tested, thus giving the best chance for the validation phase to be successfully used to set the performance criteria and expected results. In the case of apple juice, the results reported in this paper benefited from a previous round as reported in McEwan (2000a).
255
results from panels with expertise in a product category could lead to the expected results being set too high. In this exercise, certain significance levels were chosen, but it should be remembered that these should be chosen on the basis of the data, and should be reviewed for each new product, making use of the cumulative experience with previous screening tests and ring trials. Moreover, several iterations were required to develop the performance criteria through a scoring system. This resulted in the importance of considering performance at each step in the scheme, rather than just considering a final overall score. In addition, the importance of inadvertently over-weighting a step in measuring performance became very apparent in earlier versions of the Performance Scheme, and this needs to be considered. In fact, it was felt that discrimination between pairs of samples was the most important step. Fig. 3 shows a graphical representation of the panel performance based on this criterion alone. External review of the research was of the opinion that an earlier scheme was too tough, and therefore only expert (product specific) panels (and not ‘general’ trained panels) could achieve the expected result. It was recommended that the expected result should be a range, which is what is shown in Fig. 3. It was felt important that proficiency testing should not just be about good performance, but also demonstrating to panels that while they are satisfactory, there is room for improvement. However, it was recognised that a confidence interval around the expected result would offer a more statistically robust procedure, and this needs to be included in future performance schemes. The relative performance of the panels is much the same as in Fig. 2, with a few switches in position for those panels lying between the solid (score=8) and dotted (score=5) lines in Fig. 3. However, as well as using sample discrimination as a criterion for performance, this representation now has two performance lines, the expected result as before (score=8), but additionally a lower performance boundary (score=5). In this scenario Panels C, J and A are still performing
6.3. Setting performance criteria6.2Setting Performance Criteria Whilst setting the criteria for the apple juice trial was relatively straightforward, experience with custard was not so clear-cut. This indicated the importance of working through several scenarios prior to finalising the performance criteria. In addition, it is also important to consider the experience of the validation laboratories, as
Fig. 3. Panel performance on apple juice based on pairwise sample discrimination.
256
J.A. McEwan et al. / Food Quality and Preference 14 (2003) 247–256
above expected, whilst Panels F and F1 are performing at a lower level than all other panels. However, the eight panels between the two lines can be said to have achieved a satisfactory level of performance. 6.4. Further research and considerations The ProfiSens project undertook all the work on ranking five samples as this was felt to represent a reasonable number for a ranking test, and allow different perceptual intervals between samples to be incorporated. Nonetheless the process of setting performance criteria for between four and six samples will be similar to that outlined in this paper. Ranking of only three samples is not sufficiently challenging for a proficiency test, whilst it was felt that it would not be good practice to suggest using more than six samples due to sensory fatigue. However, if more than six samples were used, then there would be more than 15 pairs of samples that could be statistically significant. In this event some further consideration would need to be given to how many significant differences are reasonable, given a sufficiently challenging test. Clearly, there are still issues that need to be considered in terms of fine-tuning the Performance Scheme. As the size of the perceptual difference between samples is different, this implies that some pairs of samples will be easier to discriminate between than others. Therefore, it will be important to build this into future performance schemes. It could also be useful to develop a better weighting procedure for each of the steps, though recognising that weighting schemes are open to criticism. Moreover, the concept of confidence intervals around the expected result (value) is an attractive option, worthy of further investigation. One final issue is the ability to compare results across ring trials, as clearly laboratories will wish to demonstrate improvement over time. However, the Performance Scheme will differ for different products and depending how challenging the task is in terms of perceptible differences between samples. More thought may be required as, at present, results can mainly be compared within a ring trial. Significant progress has been made by the Profisens project, and it is hoped that the challenges offered by this paper will be taken up, not just for ranking, but for the many other sensory tests used throughout different sectors of the sensory community.
Acknowledgements The work reported in this paper was part of an EU supported project called ProfiSens (SMT-4-CL98– 2227), which ran from September 1998 to August 2001, and was co-ordinated by CCFRA, UK. The collaboration of all 17 partners is gratefully acknowledged. The CCFRA contribution to this project was partly supported by funding which forms part of CCFRA’s member funded research programme. The BioSS contribution was partly funded by the Scottish Executive Environment and Rural Affairs Department. Valio, Finland is acknowledged for the provision of the apple juice samples.
References Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: John Wiley & Sons. ESN. (1996). A European sensory and consumer study. A case study on coffee. UK: CCFRA. ISO 8587 (1988). Sensory analysis. Methodology—Ranking. Hunter, E. A., & McEwan, J. A. (1998). Evaluation of an international ring trial for sensory profiling of hard cheese. Food Quality and Preference, 9(5), 343–354. Kendall, M., & Gibbons, J. D. (1990). Rank correlation methods (5th Ed.). London: Edward Arnold. Lyon, D. H. (2001). International guidelines for proficiency testing in sensory analysis. CCFRA Guideline 35. Campden & Chorleywood Food Research Association. McEwan, J. A. (2000a). Proficiency testing for sensory ranking tests: statistical guidelines. Part 1. R&D Report No. 118. CCFRA. McEwan, J. A. (2000b). Proficiency testing for sensory profile tests: statistical guidelines. Part 1. R&D Report No. 119. CCFRA. McEwan, J. A. (2001a). Proficiency testing for sensory ranking tests: statistical guidelines. Part 2. R&D Report No. 126. CCFRA. McEwan, J. A. (2001b). Proficiency testing for sensory profile tests: statistical guidelines. Part 2. R&D Report No. 127. CCFRA. McEwan, J. A., Hunter, A. E., van Gemert, L. J., & Lea, P. (2002). Proficiency testing for sensory profile panels: measuring panel performance. Food Quality and Preference, 13, 181–190. Næs, T. (1998). Detecting individual differences among assessors and differences among replicates in sensory profiling. Food Quality and Preference, 3, 107–110. O’Mahony, M. (1986). Sensory evaluation of food: statistical methods and procedures. New York: Marcel Dekker. Page`s, J., & Husson, F. (2001). Inter-laboratory comparison of sensory profiles: methodology and results. Food Quality and Preference, 12, 297–309. Rossi, F. (2001). Assessing sensory panelist performance using repeatability and reproducibility measures. Food Quality and Preference, 12, 467–497.