Revisiting Fisher’s ‘Lady Tasting Tea’ from a perspective of sensory discrimination testing

Revisiting Fisher’s ‘Lady Tasting Tea’ from a perspective of sensory discrimination testing

Food Quality and Preference 43 (2015) 47–52 Contents lists available at ScienceDirect Food Quality and Preference journal homepage: www.elsevier.com...

313KB Sizes 0 Downloads 20 Views

Food Quality and Preference 43 (2015) 47–52

Contents lists available at ScienceDirect

Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual

Revisiting Fisher’s ‘Lady Tasting Tea’ from a perspective of sensory discrimination testing Jian Bi ⇑, Carla Kuesten Sensometrics Research and Service, Richmond, VA 23236, USA Amway, Ada, MI, USA

a r t i c l e

i n f o

Article history: Received 1 June 2014 Received in revised form 5 February 2015 Accepted 10 February 2015 Available online 17 February 2015 Keywords: Lady Tasting Tea ‘M + N’ experiment Fisher’s exact test Hypogeometric distribution Mantel–Haenszel test Odds ratio Common odds ratio Cohen’s d Difference test Effect size Performance of trained sensory panels and panelists

a b s t r a c t The Lady Tasting Tea is a famous real story in the history of development of statistics, related to R.A. Fisher, one of the greatest statisticians and founders of modern statistics. The main learning and insight offered by this paper from revisiting the story are that the methodology of conventional sensory difference tests can be and should be expanded to cover the ‘M + N’ method with larger M and N. Unlike the conventional discrimination tests, which use multiple sets of ‘M + N’ samples with small M and N based on a binomial model, the ‘M + N’ tests with larger M and N can reach a statistical significance in a single trial using only one set of ‘M + N’ samples based on a hypergeometric distribution in Fisher’s exact test. This paper explores the applications of the new methods particularly in assessing performance of trained sensory panels and panelists. The connection of the odds ratio or common odds ratio with Cohen’s standardized mean difference d is also discussed. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction The ‘‘Lady Tasting Tea’’ is a famous real story in the history of the development of statistics in the 20th century, related to R.A. Fisher, one of the greatest statisticians and founders of modern statistics. This story took place in Cambridge, England, in the late 1920s. A group of university dons, their wives, and some guests were sitting around an outdoor table for afternoon tea. A lady, a colleague of Fisher’s, claimed to be able to judge whether tea or milk was poured in the cup first. Fisher designed a classic experiment to test the lady’s claim. According to Fisher’s daughter, this story is real. Here’s an account of the seemingly trivial event that had the most profound impact on the history of modern statistics (Box, 1978). In the second chapter of R.A. Fisher’s text The Design of Experiments (1935), Fisher described the experiment and the test of significance. The lady was given 8 cups of tea, in 4 of which tea was poured first and in 4 of which milk was poured first, and ⇑ Corresponding author at: Sensometrics Research and Service, Richmond, VA 23236, USA. Tel./fax: +1 804 560 1754. E-mail address: [email protected] (J. Bi). http://dx.doi.org/10.1016/j.foodqual.2015.02.009 0950-3293/Ó 2015 Elsevier Ltd. All rights reserved.

was told to guess which 4 had milk added first. Fisher’s exact test, i.e., a permutation test based on an exact hypergeometric distribution was used to test the lady’s claim. Revisiting the story, we were surprised that this designed experiment and the test used by Fisher in 1935 is a typical sensory analysis and a sensory discrimination test! It is a real sensory test for discrimination ability of a panelist! We can regard it as a forerunner of modern sensory analysis. We noted that Fisher’s Lady Tasting Tea experiment is just an ‘M + N’ method with M = N = 4, i.e., known as the octad method in the sensory field (Harris & Kalmus, 1949). Gridgeman (1959) called it a double-tetrad. Fisher’s exact test is based on the assumption that both sets of marginal counts in a 2  2 table are naturally fixed. This assumption is rarely encountered in practice, though Fisher’s exact test is widely used in situations with small samples. We are pleased to notice that this assumption is perfectly and naturally satisfied in the ‘M + N’ experiments. The main learning and insight offered by this paper from revisiting the story are that the methodology of conventional sensory difference tests can be and should be expanded to cover the ‘M + N’ with larger M and N based on a new model. The main

48

J. Bi, C. Kuesten / Food Quality and Preference 43 (2015) 47–52

objective of this paper is to explore the applications of the new methods as sensory discrimination tests, especially in measuring and tracking performance of trained sensory panels and panelists.

2. The ‘M+N’ method The ‘M + N’ method is a generalization and can be considered as a framework of many forced-choice methods used widely in the sensory field, including the m-AFC, the triangle, the specified and unspecified tetrad, and the specified and unspecified two-out-offive tests, etc. In an ‘M + N’ method, there are two groups of samples. One group contains M samples of A and another group contains N samples of B. The panelist is asked to sort the samples into their appropriate classes (A and B). There are two versions of the ‘M + N’ method, i.e., specified and unspecified versions. In the specified version, the characteristics of the samples in each class are identified (e.g. stronger or weaker) and the sorting of samples into the specified two classes carried out accordingly. In the unspecified version, the stimuli are sorted without being identified. Harris and Kalmus (1949) used an unspecified octad test (M = N = 4) for measuring sensitivity to phenylthiourea (P.T.C); it is commonly referred to as the Harris–Kalmus test. Lockhart (1951) was perhaps the first to discuss the ‘M + N’ method and gave the probabilities of a correct response by chance in some versions of the ‘M + N’ method. Peryam (1958) described the ‘multiple standards’ test, which is essentially equivalent to an unspecified 4-AFC (Bi, Lee, & O’Mahony, 2010). Amerine, Pangborn, and Roessler (1965), when mentioning ‘M + N’ tests, called them ‘multi-sample’ tests. Basker (1980) discussed the polygonal testing, which is similar to the ‘M + N’ test, but he did not distinguish between the specified and unspecified versions of the ‘M + N’ test. Frijters (1988) and Smith (1989) briefly discussed the ‘M + N’ test and the probabilities of a correct response by chance in the ‘M + N’ test. It is noted that the chance probabilities, given in Frijters (1988) and Smith (1989) for the ‘M + N’ test with M = N, are different. The reason is that they did not distinguish the specified and unspecified versions of the ‘M + N’ test, either. Recently, Bi, Lee, and O’Mahony (2014) revisited the ‘M + N’ method, and used a computer-intensive approach for Thurstonian models of various versions (more than 40 versions) of the ‘M + N’ method. The ‘M + N’ method with small numbers of M and N (<4) is widely used as a forced choice method in sensory difference testing. Multiple sets of ‘M + N’ samples are needed in a difference test. The binomial model is used for analysis of the proportion of correct responses observed in a difference test. However, the ‘M + N’ method with larger numbers of M and N is rarely used as a discrimination method, considering the memory disadvantages or physiological carry-over limitations, though Bi, Lee, and O’Mahony (2014) suggest that the M + N test with M = N > 3 as a forced-choice method may have a potential application perspective in visual assessment and sometimes manual assessment of products, e.g., cosmetic, considering the low chance probability and high similarity for specified and unspecified tests for the ‘M + N’ with larger M and N. From Fisher’s Lady Tasting Tea experiment, we realized that the ‘M + N’ with larger M and N is a very useful and appropriate experiment for testing the discriminating ability of trained sensory panels and panelists, and deserves to be a kind of discrimination method with a new model. In the commonly used discrimination tests using the ‘M + N’ method with small M and N, e.g., the triangle method (M = 2 and N = 1), we cannot get a statistical conclusion from only one set of M + N samples, because even for the perfect response for a set of the samples, the chance probability (1/3) is still larger than any acceptable significance level. Hence, replications are needed. However, interestingly, we can get a statistical

conclusion based on responses in a 2  2 table for only one set of ‘M + N’ samples with larger M and N in a Fisher’s exact test. It is not necessary to require a panelist to perfectly identify all the M and N samples in the ‘M + N’ experiment in order to measure and test the discriminating ability of a panelist. This is the advantage of the ‘M + N’ method with larger M and N, and is a different perspective that is taken in sensory difference tests. In the ‘M + N’ method, M and N may differ. However, as Fisher remarks (1935), the balanced designs are justified, while these imbalanced designs have no apparent advantage. It is worth noting that M = N is also desirable as there is no middle stimulus to get mis-categorized as indicated by Ennis (2013). We discuss and recommend using the ‘M + N’ method with larger M and N in the situations of this paper only for the versions of M = N > 3, i.e., the balanced designs with larger M and N. The results of the ‘M + N’ experiment for a panelist can be presented in a 2  2 table as Table 1, where x denotes the number of identifying product A in a set of ‘M + N’ samples. 3. Using ‘M+N’ with larger M and N experiment and Fisher’s exact test to assess discrimination ability of individual panelists 3.1. Fisher’s exact test The ‘exact’ test is used in the sense that the actual distribution of the statistic rather than a large-sample approximation is used for the test. Fisher’s exact test is based on a hypergeometric distribution, which can be used for analysis of the data in a 2  2 table as Table 1. The one-sided exact test using the specified ‘M + N’ method is the test with the null hypothesis H0 : pA ¼ pN , i.e., the probabilities of ‘‘Product A’’ response for product A and Not A (product B) are the same, and the alternative hypothesis Ha : pA > pN . The hypotheses are equal to H0 : k ¼ 1 versus Ha : k > 1, where k in (1) is an odds ratio.

 k¼

pA 1  pA



 pN : 1  pN

ð1Þ

The two-sided exact test should be used for the unspecified ‘M + N’ method. Under the null hypothesis and in the situation with the fixed marginal totals M, the number of x, i.e., correctly identifying product A, follows a hypergeometric distribution with the expectation and variance in (2) and (3), and with the probability in (4):

EðxÞ ¼ M=2; VðxÞ ¼

ð2Þ

M2 ; 4ð2M  1Þ 

f ðxjMÞ ¼

M x

 

ð3Þ 

M

Mx  2M M

ð4Þ

:

For the exact test, the probabilities in (4) are then summed over the given observed or more extreme values of x. Table 2 gives the probabilities of each possible selection x in the specified ‘M + N’

Table 1 Format of data summary for the ‘M + N’ experiment.

Product A Product Not A Total

‘‘Product A’’

‘‘Product Not A’’

Total

x Mx M

Mx x M

M M 2M

49

J. Bi, C. Kuesten / Food Quality and Preference 43 (2015) 47–52 Table 2 Probabilities for correct selections in Fisher’s exact tests using the specified ‘M + N’ methods with M = N. ‘x’*

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 *

M=N = 4

5

6

7

8

9

10

11

12

13

14

15

0.0143 0.2286 0.5143 0.2286 0.0143

0.0040 0.0992 0.3968 0.3968 0.0992 0.0040

0.0011 0.0390 0.2435 0.4329 0.2435 0.0390 0.0011

0.0003 0.0143 0.1285 0.3569 0.3569 0.1285 0.0143 0.0003

0.0001 0.0050 0.0609 0.2437 0.3807 0.2437 0.0609 0.0050 0.0001

0.0000 0.0017 0.0267 0.1451 0.3265 0.3265 0.1451 0.0267 0.0017 0.0000

0.0000 0.0005 0.0110 0.0779 0.2387 0.3437 0.2387 0.0779 0.0110 0.0005 0.0000

0.0000 0.0002 0.0043 0.0386 0.1544 0.3026 0.3026 0.1544 0.0386 0.0043 0.0002 0.0000

0.0000 0.0001 0.0016 0.0179 0.0906 0.2320 0.3157 0.2320 0.0906 0.0179 0.0016 0.0001 0.0000

0.0000 0.0000 0.0006 0.0079 0.0492 0.1593 0.2831 0.2831 0.1593 0.0492 0.0079 0.0006 0.0000 0.0000

0.0000 0.0000 0.0002 0.0033 0.0250 0.0999 0.2248 0.2936 0.2248 0.0999 0.0250 0.0033 0.0002 0.0000 0.0000

0.0000 0.0000 0.0001 0.0013 0.0120 0.0581 0.1615 0.2670 0.2670 0.1615 0.0581 0.0120 0.0013 0.0001 0.0000 0.0000

‘x’: number of correct selections.

Table 3 Cumulative probabilities (p-values) for correct selections in Fisher’s exact tests using the specified ‘M + N’ methods with M = N. ‘x’*

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

M=N = 4

5

6

7

8

9

10

11

12

13

14

15

1.0000 0.9857 0.7571 0.2429 0.0143

1.0000 0.9960 0.8968 0.5000 0.1032 0.0040

1.0000 0.9989 0.9600 0.7165 0.2835 0.0400 0.0011

1.0000 0.9997 0.9854 0.8569 0.5000 0.1431 0.0146 0.0003

1.0000 0.9999 0.9949 0.9340 0.6904 0.3096 0.0660 0.0051 0.0001

1.0000 1.0000 0.9983 0.9717 0.8265 0.5000 0.1735 0.0283 0.0017 0.0000

1.0000 1.0000 0.9995 0.9885 0.9106 0.6719 0.3281 0.0894 0.0115 0.0005 0.0000

1.0000 1.0000 0.9998 0.9955 0.9569 0.8026 0.5000 0.1974 0.0431 0.0045 0.0002 0.0000

1.0000 1.0000 0.9999 0.9983 0.9804 0.8898 0.6579 0.3421 0.1102 0.0196 0.0017 0.0001 0.0000

1.0000 1.0000 1.0000 0.9994 0.9915 0.9424 0.7831 0.5000 0.2169 0.0576 0.0085 0.0006 0.0000 0.0000

1.0000 1.0000 1.0000 0.9998 0.9965 0.9715 0.8716 0.6468 0.3532 0.1284 0.0285 0.0035 0.0002 0.0000 0.0000

1.0000 1.0000 1.0000 0.9999 0.9986 0.9866 0.9284 0.7670 0.5000 0.2330 0.0716 0.0134 0.0014 0.0001 0.0000 0.0000

Significant at alpha = 0.05 for the number of correct selections corresponding to the bold values. * ‘x’: number of correct selections.

method (one-sided test) with M = N = 4–15. Table 3 gives the cumulative probabilities, i.e., p-values, of the exact tests using the ‘M + N’ method. It is remarkable that a significant result does not require all correct responses. The R built-in function ‘dhyper’ can be used to calculate the probabilities of possible x values in a hypogeometric distribution. The R built-in function ‘phyper’ can be used to calculate the cumulative probabilities, i.e., p-values. For example, for the test using the ‘M + N’ method with M = N = 4, the probabilities and cumulative probabilities for x = 0, 1, 2, 3, 4 are obtained as below: >dhyper(seq(0,4),4,4,4) [1] 0.01428571 0.22857143 0.51428571 0.22857143 0.01428571 >phyper(seq(0,4),4,4,4) [1] 0.01428571 0.24285714 0.75714286 0.98571429 1.00000000

3.2. Data analysis for The Lady Tasting Tea For the experiment using the ‘M + N’ method with M = N = 4, which is used in The Lady Tasting Tea, the data are listed in Table 4.

Table 4 Outcome of Fisher’s tea tasting experiment. Poured first

Milk Tea Total

Guess poured first

Total

Milk

Tea

3 1 4

1 3 4

4 4 8

The null hypothesis is that the lady has no discriminating ability, i.e., H0 : k ¼ 1. Only if the lady perfectly distinguished the 8 cups, i.e., selected the 4 cups with milk first, can the null hypothesis be rejected and her claim be accepted at a 0.05 alpha level, with the corresponding p-value = 0.0143 (i.e., 1/70, see Table 3) for a onesided test and p-value = 0.0286 for a two-sided test. Fisher (1935) did not give the results of the actual experiment, but according to Salsburg (2001, p. 8), a colleague of Fisher, H.F. Smith who was there, claimed that the lady correctly identified all 8 cups. The lady’s faculty of discriminability should be accepted. However, when the original tea-tasting experiment is described in the literature, a hypothetical outcome of the experiment in Table 4 is typically discussed, e.g., in Agresti (1992). Based on the data in Table 4, the corresponding p-value for Fisher’s exact test

50

J. Bi, C. Kuesten / Food Quality and Preference 43 (2015) 47–52

is 0.2429 for a one-sided test (see Table 3). Hence there is not enough evidence to support the lady’s claim. For a two-sided test, the exact probabilities are accordingly to be doubled. Using the R built-in program ‘fisher.test’, the p-value = 0.2429 for a one-sided test can be obtained: >fisher.test(cbind(c(3,1),c(1,3)),alternative = ‘‘great’’) Fisher’s Exact Test for Count Data data: cbind(c(3, 1), c(1, 3)) p-value = 0.2429 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.3135693 Inf sample estimates: odds ratio 6.408309 3.3. Power and sample size of Fisher’s exact test Fisher did not give the power of the exact test. However, there are many references on the power function of Fisher’s exact test (see, e.g., Bennett & Hsu, 1960; Casagrande, Pike, & Smith, 1978; Duchateau, McDermott, & Rowlands, 1998). It is noted that there are two types of power for Fisher’s exact test. One is the conditional power. The other one is the ‘‘overall’’ or average power. The conditional power means that it is conditional on the fixed marginal total number for response ‘‘A’’. Because in the ‘M + N’ with M = N experiment, the fixed marginal number is always M, we calculate only the conditional power with the power function in

PowerðkjMÞ ¼

M X gðijMÞ

ð5Þ

i¼k0

PM where k0 is a critical value such that i¼k0 f ðijMÞ 6 a and PM f ðijMÞ > a ; i¼k0 1    M M x k M x    x  gðxjMÞ ¼ ; and k is in (1). PM M M i k i¼0 Mi i For calculating the power of Fisher’s exact test, a specified odds ratio k, a marginal number M, and a Type I error a, are needed. The practical meaning of odds ratio k will be further discussed in Section 5 of this paper. The R code ‘fisherpow’ can be used to calculate a conditional power of Fisher’s exact test for given M = N, k, and a. For example, for M = N = 8, k = 6, a = 0.1, the estimated conditional power is about 0.64. >fisherpow(8,6,0.1) M = N = 8, lambda = 6, alpha = 0.1 power = 0.6411 [1] 0.6411385 We might be also interested in sample size, i.e., the numbers of M and N (M = N) in our situation, that produces a specified power for a Fisher’s exact test. It is noted that the power of Fisher’s exact test is not a strict monotone increasing function of sample size, i.e., the number of M = N, due to a discrete distribution. We can calculate a series of powers for a series of M = N numbers for a specified odds ratio and alpha level. Then we can select an appropriate number of M = N. The R code ‘fishersam’ can be used for the calculations. For example, for specified k = 6, and a = 0.1, the estimated conditional powers of Fisher’s exact tests for M = N = 4–20 are below. If we want to select a power not less than 0.7, it seems that M = N = 10 is an appropriate number. The true estimated power in that situation should be about 0.76.

>fishersam(6,0.1) M = N = Power 4 4 0.2109032 5 5 0.1186686 6 6 0.4581118 7 7 0.3263210 8 8 0.6411385 9 9 0.5173425 10 10 0.7649726 11 11 0.6650576 12 12 0.5602988 13 13 0.7714992 14 14 0.6856531 15 15 0.8456168 16 16 0.7793706 17 17 0.8962863 18 18 0.8470493 19 19 0.9305638 20 20 0.8948695 4. Assessing discrimination ability of trained sensory panels 4.1. The Mantel–Haenszel test It is desirable to assess discrimination ability, not only for individual panelists, but also for panels. For a panel with k panelists, we can conduct the ‘M + N’ experiment for each panelist, then use the Mantel–Haenszel test (also called Cochran-Mantel–Haenszel test) (Agresti, 2013) to conduct an overall test to verify whether the panel has a significant discrimination ability. The null hypothesis is H0 : k1 ¼ k2 ¼ . . . ¼ kk ¼ 1 with the alternative hypothesis that the true common odds ratio is larger than one. The common odds ratio is a weighted mean of individual odds ratios. The Mantel and Haenszel (1959) statistic is

MH ¼

Pk i¼1 ðxi  E0 ðxi ÞÞ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; Pk i¼1 V 0 ðxi Þ

ð6Þ

where xi is the observed number in the 2  2 table for the ith panelist, i ¼ 1; 2; . . . ; k; E0 ðxi Þ and V 0 ðxi Þ are the same as (2) and (3) in the situation that all Mi = Ni = M0. The MH statistic follows an approximate standard normal distribution. It should be mentioned that the ‘M + N’ experiments for individual panelists could be different. The R built-in program ‘mantelhaen.test’ can be used for the test. The R program can also give a conditional maximum likelihood estimate of the common odds ratio and its confidence interval. 4.2. Numerical example Suppose there is a trained sensory panel composed of 8 panelists. The ‘M + N’ experiment with M = N = 5 was conducted for each of the 8 panelists. The observed counts for the 8 panelists are (3, 2; 2, 3) (i.e., 3 and 2 in the first column, 2 and 3 in the second), (4, 1; 1, 4), (2, 3; 3, 2), (3, 2; 2, 3), (5, 0; 0, 5), (4, 1; 1, 4), (4, 1; 1, 4), (5, 0; 0, 5). The data should be an array of 2  2  8 table in R. It can be produced as shown below: Paneldat <-array(c(3,2,2,3,4,1,1,4,2,3,3,2,3,2,2,3,5,0,0,5,4, 1,1,4,4,1,1,4,5,0,0,5), dim = c(2, 2, 8), dimnames = list( Product = c(‘‘A’’, ‘‘B’’), Response = c(‘‘‘A’’’, ‘‘‘B’’’), Panelist = c(‘‘1’’, ‘‘2’’, ‘‘3’’, ‘‘4’’, ‘‘5’’, ‘‘6’’, ‘‘7’’, ‘‘8’’)))

51

J. Bi, C. Kuesten / Food Quality and Preference 43 (2015) 47–52

Using the R program ‘mantelhaen.test’, we can get the test results including an estimated odds ratio. The statistic ‘‘S’’, i.e., the observed total number of x in the first row and first column in the 2  2 tables is 30; and the p-value is smaller than 0.001. It suggests that the panel as a whole has a significant discriminating ability. The estimated common odds ratio is 6.872, which is significantly larger than 1. >mantelhaen.test(Paneldat,exact = T,alternative = ‘‘g’’) Exact conditional test of independence in 2  2  k tables data: Paneldat S = 30, p-value = 1.946e05 alternative hypothesis: true common odds ratio is greater than 1 95 percent confidence interval: 2.916054 Inf sample estimates: common odds ratio 6.871695 5. Using the odds ratio to measure discriminating ability of panels and panelists

   . 2 V log ^k ¼ log kL  log ^k z1a

ð7Þ

where kL denotes the lower limit of confidence interval of k given from ‘fisher.test’ and ‘mantelhaen.test’, and z1a denotes the (1  a) percentile of the standard normal distribution. For a twosided test, z1a should be replaced by z1a=2 . The R code ‘logtransf’ can be used for log transform of the odds ratio with the input of ^ and estimated k, kL and Type I error a. The output includes log k Vðlog ^ kÞ. For example, for the data in Section 4.2, the estimated k = 6.8717, kL = 2.9161, and a = 0.05 for a one-sided test, the log ^ = 1.9274 and Vðlog ^ kÞ = 0.2716. transformation results are log k >logtrans(6.8717,2.9161,0.05,1) [1] 1.9274 0.2716 5.2. Conversion of odds ratio k into Cohen’s D (or d) Cohen’s D (or d) (Cohen, 1969) in (8) is a standardized mean difference, which has been widely used in the behavioral sciences as an index of effect size for continuous data. Cohen’s d is an estimator of effect-size D:

le  lc : rc

5.1. Odds ratio (or common odds ratio) k as an index of effect-size



It is noted that using the R functions ‘fisher.test’ and ‘mantelhaen.test’, an estimate of the odds ratio and a common odds ratio along with a confidence interval are obtained. These measures are conditional maximum likelihood (ML) estimators calculated using the non-central hypergeometric distribution. The odds ratio in a Fisher’s exact test using an ‘M + N’ method with larger M and N is always related to the probabilities of identification for the two samples for comparison. The relationship between the two probabilities pA ; pN and odds ratio k is in (1), where pA is the probability of response ‘‘A’’ for sample A, while pN is the probability of response ‘‘A’’ for Not A, i.e., sample B. For example, in the tasting-tea experiment, pA is the probability of response ‘‘milk first’’ for the cups which milk was poured first, while pN is the probability of response ‘‘milk first’’ for the cups which tea was poured first. The relationship pN and pA in odds ratio k is unchanged in Fisher’s exact tests with changed M and N in ‘M + N’ method with larger M and N. Because pN and pA reflect the inherent difference of samples for comparison, the odds ratio can be used as a valid index of effect-size, which is better than the indices of the difference pA  pN and the ratio pN =pA (Fleiss, 1994; Haddock, Rindskopf, & Shadish, 1998). The odds ratio k can be any number between 0 and infinity. k = 1 means that the panelist has no discriminating ability. In other words, the responses are totally random. The odds ratio is an important statistic and an index to measure panelists’ discriminating abilities. Because the odds ratio is not symmetrically distributed, a log transformation, log k, is often used. For given estimation of k and its confidence interval in the output of the built-in function of ‘fisher.test’ and ‘mantelhaen.test’, which are based on conditional maximum likelihood estimation (Hollander, Wolfe, & Chicken, 2014, p.520), the variance of log ^ k can be estimated from

We can convert from a log odds ratio to the standardized mean difference, i.e., Cohen’s d using

d ¼ log OR 

ð8Þ

pffiffiffi 3

p

ð9Þ

:

This approach connecting Cohen’s d and the log odds ratio was originally proposed by Hasselblad and Hedges (1995). See also Chinn (2000), Sanchez-Meca, Marin-Martinez, and ChaconMoscoso (2003) and Borenstein, Hedges, Higgins, and Rothstein (2009) for the practice. The basic idea behind (9) is that the logistic and normal distributions differ little, except in the tails of the distributions (see, e.g., Finney, 1978, pp. 362–368; Chinn, 2000). The standard logistic distribution has variance p2 =3, so a difference in log odds can be converted to an approximate difference in normal pffiffiffi equivalent deviate by dividing by p= 3, which is about 1.81. Hence the Cohen’s d is just the log odds ratio multiplied by the conpffiffiffi stant 3=p, which is about 0.55. Based on (9), it is easy to demonstrate that the variance of Cohen’s d is

VarðdÞ ¼ Varðlog ORÞ 

3

p2

:

ð10Þ

It is that the variance of the Cohen’s d is the variance of log odds ratio multiplied by the constant 3=p2 (i.e., 0.304). For example, for log odds ratio = 1.9274 with variance 0.2716, the estimated Cohen’s d is 1.1 with variance 0.0826. > 1.9274 * sqrt(3)/pi [1] 1.062631 > 0.2716 * 3/pi^2 [1] 0.0825565

Table 5 Comparisons of discrimination tests using ‘M + N’ with small M and N, and with larger M and N. Aspect

‘M + N’ with small M and N

‘M + N’ with larger M and N

Example Response Null hypothesis Statistical model Index of effectsize

Tetrad with M = N = 2 Yes/no Proportion of correct responses (Pc) = Pc0 Binomial test d0 In psychometric functions, Pc = f(d0 ). The functions are different for the methods

Octad with M = N = 4 In a 2  2 table Odds ratio (OR) = 1 Fisher’s exact test, Mantel–Haenszel test Odds ratio, OR = f(pA, pN), which is the same for changed M and N

52

J. Bi, C. Kuesten / Food Quality and Preference 43 (2015) 47–52

6. Discussion and conclusions This paper revisits Fisher’s Lady Tasting Tea experiment and test from a perspective of sensory discrimination testing. From our viewpoint, the experiment and test are closely relevant to the sensory field, and can be regarded as Fisher’s heritage left to sensory analysis methodology and sensometrics. The main conclusion obtained in the paper is that the ‘M + N’ method with larger M and N deserves to be considered a new kind of sensory discrimination test. Table 5 compares the discrimination tests using the ‘M + N’ method with small M and N as conventional forced-choice methods, and with larger M and N as new discrimination methods. Further research is needed to empirically compare the results of discrimination tests using the conventional forced-choice methods and the new methods. It is expected that discrimination tests using the new methods can provide more information about the difference between products, or about discriminability of panels and panelists, than the tests using the conventional forced-choice methods. Evidence is that it is possible to get a statistical conclusion based on responses for only one set of samples in the new methods, while it is impossible in the conventional forced-choice methods. In addition, the conversion of the odds ratio or common odds ratio into Cohen’s standardized mean difference d is also discussed. We recommend use of the ‘M + N’ with M = N > 3 as basic designs; use of Fisher’s exact test to detect discriminating ability of individual panelists; use of the Mantel–Haenszel test to detect discriminating ability of panels; use of the odds ratio, common odds ratio and other effect-size indices to measure discriminating ability of panelists and panels. The same procedures can be used for panel and consumer discrimination tests. All the analyses can be completed in available R software using built-in programs and codes provided in the paper. R codes used in the paper are available from the Supporting information in the online version of this paper. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.foodqual.2015.02. 009. References Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7, 131–177. Agresti, A. (2013). Categorical data analysis (3rd ed.). New Jersey: Wiley.

Amerine, M. A., Pangborn, R. M., & Roessler, E. B. (1965). Principles of sensory evaluation of food. New York, NY: Academic Press. Basker, D. (1980). Polygonal and polyhedral taste testing. Journal of Food Quality, 3, 1–10. Bennett, B. M., & Hsu, P. (1960). On the power function of the exact test for the 2  2 contingency table. Biometrika, 47, 393–398. Bi, J., Lee, H.-S., & O’Mahony, M. (2010). d0 and variance of d0 for four-alternative forced choice (4-AFC). Journal of Sensory Studies, 25, 740–750. Bi, J., Lee, H.-S., & O’Mahony, M. (2014). Estimation of Thurstonian models for various forced-choice sensory discrimination methods in a form of the ‘M + N’ test. Journal of Sensory Studies, 29, 325–338. Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York: Wiley. Box, J. F. (1978). R.A. Fisher: The life of a scientist. New York: John Wiley & Sons Inc.. Casagrande, J. T., Pike, M. C., & Smith, P. G. (1978). The power function of the ‘‘exact’’ test for comparing two binomial distributions. Applied Statistics, 27, 176–801. Chinn, S. (2000). A simple method for converting an odds ratio to effect size for use in meta-analysis. Statistics in Medicine, 19, 3127–3131. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press. Duchateau, L., McDermott, B., & Rowlands, G. J. (1998). Power evaluation of small drug and vaccine experiments with binary outcomes. Statistics in Medicine, 17, 111–120. Ennis, J. M. (2013). A Thurstonian analysis of the two-out-of-five test. Journal of Sensory Studies, 28(4), 297–310. Finney, D. J. (1978). Statistical method in biological assay (3rd ed.). London and High Wycombe: Charles Criffin. Fisher, R. A. (1935). The design of experiments. Edingburgh: Oliver and Boyd. Fleiss, J. L. (1994). Measure of effect size for categorical data. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 245–260). New York: Russell Sage Foundation. Frijters, J. E. R. (1988). Sensory difference testing and the measurement of sensory discriminability. In J. R. Piggott (Ed.), Sensory analysis of foods. London: Elsevier (2nd ed.). Gridgeman, N. T. (1959). The lady tasting tea and allied topics. Journal of American Statistical Association, 54, 776–783. Haddock, C. K., Rindskopf, D., & Shadish, W. R. (1998). Using odds ratios as effect sizes for meta-analysis of dichotomous data: A primer on methods and issues. Psychological Methods, 3, 339–353. Harris, H., & Kalmus, H. (1949). The measurement of taste sensitivity to phenylthiourea (P.T.C.). Annal of Eugenics, 15, 24–31. Hasselblad, V., & Hedges, L. V. (1995). Meta-analysis of screening and diagnostic tests. Psychological Bulletin, 117, 167–178. Hollander, M., Wolfe, D. A., & Chicken, E. (2014). Nonparametric statistical methods (3rd ed.). New Jersey: Wiley. Lockhart, E. (1951). Binomial systems and organoleptic analysis. Food Technology, 5, 428–431. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748. Peryam, D. R. (1958). Sensory difference tests. Food Technology, 12, 231–236. Salsburg, D. (2001). The Lady Tasting Tea: How statistics revolutionized science in the twentieth century. W.H. Freeman/Owl Book. Sanchez-Meca, J., Marin-Martinez, F., & Chacon-Moscoso, S. (2003). Effect-size indices for dichotomized outcomes in meta-analysis. Psychological Methods, 8(4), 448–467. Smith, G. L. (1989). An introduction to statistics for sensory analysis experiments. Aberdeen, Scotland: Torry Research Station, Ministry of Agriculture, Fisheries and Food.