The 9-point hedonic scale: Are words and numbers compatible?

The 9-point hedonic scale: Are words and numbers compatible?

Food Quality and Preference 21 (2010) 1008–1015 Contents lists available at ScienceDirect Food Quality and Preference journal homepage: www.elsevier...

417KB Sizes 42 Downloads 49 Views

Food Quality and Preference 21 (2010) 1008–1015

Contents lists available at ScienceDirect

Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual

The 9-point hedonic scale: Are words and numbers compatible? Laura Nicolas a, Coline Marquilly a, Michael O’Mahony b,* a b

Department Food Science, AgroParisTech, Paris, France Department Food Science & Technology, University of California, Davis. California, USA

a r t i c l e

i n f o

Article history: Received 28 August 2009 Received in revised form 12 March 2010 Accepted 19 May 2010 Available online 25 May 2010 Keywords: Food acceptance Chocolate 9-point hedonic scale Ranking Cognitive strategies Relative strategy Absolute strategy Words Numbers Stimulus equalizing bias

a b s t r a c t The original 9-point scale, developed by the U.S. army for menu planning for their canteens, consisted of a series of nine verbal categories representing degrees of liking from ‘dislike extremely’ to ‘like extremely’. For subsequent quantitative and statistical analysis, the verbal categories are generally converted to numerical values: ‘like extremely’ as ‘9’, ‘dislike extremely’ as ‘1’. Yet, sometimes what is termed a 9point hedonic scale is an unstructured numerical scale, labeled at the ends with ‘dislike extremely’ and ‘like extremely’. The former scale requires consumers to categorize foods according to how much they are liked or not; the latter requires the consumers to differentiate numerically between the foods in terms of the relative degree of liking for each. Foods that were placed in the same verbal category for the former scale might be given different numerical scores on the second scale. To illustrate this, consumers rated five chocolates, in a series of experiments, on these two types of 9-point scale (verbal categories only vs numbers only) and the proportion responding differently to the two scales ranged from 100% to 79%. This indicated that numerical data obtained from both types of 9-point scale were not interchangeable. It also suggested that consumers were using different cognitive strategies for verbal categories and numbers. To check that the difference was not caused by the fact that the verbal categories were bipolar and the numbers unipolar, the experiment was repeated using a bipolar number scale (–4 through 0 to +4). The same results were obtained. For comparison, a 9-point hedonic scale including both verbal categories and numbers together, was also used. The results for this scale showed a greater similarity to the version of the 9-point scale consisting only of verbal categories than the unstructured numerical version. Stimulus equalizing bias was used as a tool to make a preliminary investigation into the cognitive strategies involved for the two versions of the scale. The hypothesized relative strategy was confirmed for the unstructured numerical scale but the hypothesized absolute strategy was not confirmed for the scale using only verbal categories; the strategy appeared to have relative elements. Regardless of the precise nature of the cognitive strategies used for two versions of the scale, they do not give the same results and data obtained from each version should be compared with caution. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Food product development and the launching of new products in the market require some measure of whether the products are liked or not. Various measures of liking have been reviewed (Rosas-Nexticapa, Angulo, & O’Mahony, 2005). However, one of the best known measures of liking is the 9-point hedonic scale (Peryam & Girardot, 1952; Peryam & Pilgrim, 1957) introduced as an aid to menu planning for US soldiers in their canteens. The scale comprises a series of nine verbal categories ranging from ‘dislike extremely’ to ‘like extremely’ and is described as such in sensory texts (e.g. Amerine, Pangborn, & Roessler, 1965; Kemp, Hollowood, & Hort, 2009; Lawless & Heymann, 1998). For subsequent quantitative and statistical analysis, the verbal categories are generally * Corresponding author. Tel.: + 1 530 756 5493. E-mail address: [email protected] (M. O’Mahony). 0950-3293/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodqual.2010.05.017

converted to numerical values: ‘like extremely’ as ‘9’, ‘dislike extremely’ as ‘1’ (Peryam, Polemis, Kamen, Eindhoven, & Pilgrim, 1960; Stone & Sidel, 1993). Scale values derived by Jones, Peryam, and Thurstone (1955), as well as later experimental results (Stroh, 1998) indicated that this scale is best considered as ordinal rather than interval. Peryam and Girardot (1952) and Peryam et al. (1960) conceded this possibility but asserted that parametric analysis was still a viable approach on practical grounds. The more recent Labeled Affective Magnitude scale (Cardello & Schutz, 2004; Schutz & Cardello, 2001) and the Labeled Hedonic Scale (Lim, Wood, & Green, 2009) use the same phrases with unequal spacing on their hedonic line scales. A brief look at the literature will soon show that there are variations in the way that 9-point hedonic scales are presented. For example, some authors are precise in the way they describe their scales. Yeu, Lee, and Lee (2008) describe precisely their numerical 9-point scale, labeled only at its end points with ‘like extremely’

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

and ‘dislike extremely’, while Hein, Jaeger, Carr, and Delahunty (2008) make it clear that they were using Peryam and Giradot’s scale with only verbal categories. Other authors are not always as precise and the reader is required to infer the exact nature of the scale they were using. It can be inferred, rightly or wrongly, that and Taylor, Fasina, and Bell (2008) used the same numerical scale labeled at the ends, as Yeu et al. (2008) while Sabbe, Verbeke, Deliza, Matta, and van Damme (2009) apparently used the same scale with an extra label, ‘neither like nor dislike’, at the midpoint. Although it was not described in detail in her paper, Chung’s (2009) 9-point hedonic scale consisted of nine empty category boxes (no numbers, no labels) labeled at appropriate ends with ‘dislike extremely’ and ‘like extremely’ (Chung, 2010). With apparent contradictions in the description, it is not completely clear what variation Nepote, Olmedo, Mestrallet, and Grosso (2009) were using, while it might seem that Villanueva and Da Silva (2009) were using a scale with both words and numbers. With even these few examples of the variation in what are called 9-point hedonic scales, it is worth demonstrating that the data derived from such scales are not necessarily interchangeable. Only two versions of the scale were considered in the present study. The first version used only original verbal categories (Peryam and Girardot, 1952; Peryam and Pilgrim, 1957) and will be referred to as the ‘words only’ scale. The second unstructured version using only numbers will be referred to as the ‘numbers only’ scale. Other versions such as Chung’s (2009) were not considered. Because what is described as a 9-point hedonic scale can be either of these versions, it is worth considering the numerical data that would be derived from each. It can be hypothesized that consumers would respond to a task of assigning foods to verbal categories (‘words only’ scale) in a different way from assigning numerical scores to represent the comparative degree of liking for a set of foods (‘numbers only’ scale). There have been indications in the literature that this hypothesis might be correct. Villegas-Ruiz, Angulo and O’Mahony (2008) noted for consumers, assessing three yoghurts on a ‘words only’ 9-point hedonic scale, that 28% gave identical responses to two of the yoghurts, all of whom reported that the tied scores did not represent equal liking. These pairs of yogurts would have been given different scores on a ‘numbers only’ scale. A further judge tied three yoghurts on the scale but again did not like them equally. Yao et al. (2003) with a cross-cultural study, provide more support by noting that the range of scores given by consumers for ‘words only’ and ‘numbers only’ versions of the 9-point hedonic scale were not the same. The ‘numbers only’ version tended to elicit a wider range of scores. At this point it is worth discussing possible cognitive strategies involved in scaling. For intensity scaling, one hypothesized model is the absolute cognitive strategy (Zwislocki, 1983; Zwislocki & Goodman, 1980). Essentially, this hypothesizes that a stimulus to be assessed for its intensity, is compared to a set of internal intensity exemplars, each associated with an intensity value. This is precisely how a calibrated instrument works. An alternative model is the relative cognitive strategy (Mellers 1983a, 1983b). Here, the numerical rating for a given stimulus intensity is assigned relative to the ratings given to the other stimuli in the experiment. Another way of describing this is that the process is akin to ranking and using the scale numbers to describe the spacing between the ranks. The evidence in the literature favors the relative model unless a judge has been calibrated in some way. Context effects provide evidence for the relative model (Parducci, 1963, 1965, 1968; Schifferstein & Frijters, 1992; Stillmann, 1993; Vickers & Roberts, 1993). Indeed, Mellers used such effects to argue against Zwislocki. Lawless and Malone (1986), also used context effects to argue that intensity scaling was relative. They further supported this position by the fact that judges would space

1009

stimuli with relatively small physical differences as well as stimuli with relatively large physical differences, across the whole length of the scale. Poulton (1979) called this phenomenon stimulus equalizing bias. Still further evidence comes from a protocol called rank-rating, which was designed to reduce errors of forgetting in an intensity rating task, should judges be using a relative strategy (Jeon, O’Mahony, & Kim, 2004; Kim, & O’Mahony, 1998; Koo, Kim, & O’Mahony, 2002; Lee, Kim, & O’Mahony, 2001; Metcalf & Vickers, 2002; Park, Jeon, O’Mahony, & Kim, 2004). In this protocol, a judge is able to retaste stimuli as often as desired and review and alter his scores as often as is required. Using this protocol, a judge is able to check, on occasions where he forgets the intensity of stimuli or the scores given to stimuli tasted earlier in the experimental session. He can thus avoid the problem of giving a higher score to one of two stimuli, when that stimulus was actually perceived as less intense than the other (or vice versa). This is a luxury unavailable with a serial monadic protocol, where a judge merely continues tasting stimuli without re-tasting as a check. This latter protocol would be more suited to a judge who had sufficient calibration to use an absolute cognitive strategy. Thus it would seem that a relative strategy is superior for intensity scaling. Recently, Colyar, Eggett, Steele, Dunn, and Ogden (2009) showed that rank-rating was not only advantageous for intensity measurements but also for hedonic comparisons. The cognitive strategies for hedonic scaling have not been determined experimentally. Accordingly, the choice of an experimental protocol that would be most compatible with a cognitive strategy used for hedonic scaling can be problematical and is surely a matter for future research. In the absence of such knowledge, one can only make hypotheses which may or may not be true. Looking to the results from intensity scaling, and encouraged by the results of Colyar et al. (2009), it was hypothesized that the numbers on a hedonic scale would elicit a relative cognitive strategy. However, using the labels on the scale would be more akin to matching the stimuli to personal hedonic categories or exemplars. It might be hypothesized that this is more akin to the absolute cognitive strategy. If this were so, it would lend some depth to the hypothesis that a ‘words only’ and a ‘numbers only’ scale might give a different distribution of scores. The first experiment investigated whether consumers responded in the same way to a ‘words only’ and to a ‘numbers only’ scale. The second experiment was a modification of the first, using a bipolar ‘numbers only’ scale, to see whether it might correspond more to the bipolarity of the ‘words only’ scale. The third experiment investigated the cognitive strategies used for the two versions, using a method based on Poulton’s (1979) stimulus equalizing bias. For clarity, the specific goals for each experiment will be stated as each experiment is described.

2. Experiment 1 Because it may be misunderstood, it is important to emphasize the exact goals of this experiment. The first goal is to investigate, for a given rank order of preference, whether a judge will express those preferences in the same way or differently, using a ‘words only’ or a ‘numbers only’ scale. Will the pattern of responses be the same or different on the two scales? Thus, it is important that the order of preference remains the same when using both scales. It is also important that the consumers are given adequate instructions so that they know how to use the scales appropriately. Having achieved this first and major goal, it would be of secondary interest, to investigate how the consumers would respond to a scale consisting of both words and numbers, even though the use of such scales is not common in the literature. It would be interest-

1010

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

ing to compare various types of data derived from such a scale to see whether consumers responded in a way more similar to the ‘words only’ or the ‘numbers only’ scale or whether their pattern of response might be completely different. This would give information regarding whether words or numbers were more salient. 2.1. Consumers Eighty-two consumers of chocolate (43 F, 39 M, age range 17– 60 yrs) were tested. They were students, staff and friends sampled from the campus of the University of California Davis. 2.2. Stimuli Consumers were required to taste five types of chocolate: ‘Milk Chocolate Kisses’, ‘Milk Chocolate Kisses with Almonds’, ‘Special Dark Chocolate Kisses’, ‘Milk Chocolate Nuggets with Toffee and Almonds’ (the Hershey Co. Pennsylvania, USA); ‘Dark Chocolate Caramel Treasures’ (Nestlé Co, Glendale, California, USA). All of these chocolates were ‘bite sized’ and were presented in their individual wrappers. Because a rank-rating (Kim & O’Mahony, 1998) procedure was used, the scales were presented on cardboard strips. Each strip was 75 cm long and 9 cm wide and was marked so that it was divided into nine equal categories. For the ‘numbers only’ scale the categories were numbered from 1 to 9 with numbers 2.5 cm tall. For the ‘words only’ scale, the categories were labeled with capitalized lower case writing (capitals 1 cm high), written over two lines. For the ‘words with numbers’ scale, the numbers were written above the words (same sizes as above) in each category. All writing was in black on a white background. Preliminary experiments with placing words above the numbers and vice versa, as well as varying the sizing of each on the various scales, indicated no differences in consumers’ responses. 2.3. Procedure For the hedonic scaling, a rank-rating protocol (Jeon, O’Mahony, & Kim, 2004; Kim & O’Mahony, 1998; Koo, Kim, & O’Mahony, 2002; Lee, Kim, & O’Mahony, 2001; Metcalf & Vickers, 2002; Park, Jeon, O’Mahony, & Kim, 2004) was chosen because, with intensity scaling, it had been demonstrated to minimize any effects of forgetting the tastes of the stimuli during the experiment. Using this protocol, a judge is able to re-taste stimuli as often as desired and review and alter his scores as often as is required. Consequently, a judge commits far fewer errors (giving a more intense stimulus a lower score and vice versa) than with a serial monadic protocol, where a judge merely continues tasting stimuli without re-tasting as a check. This latter protocol would be more suited to a judge who had sufficient calibration to use an absolute cognitive strategy. It has also been shown that the advantage of the rank-rating protocol not only applies to comparisons of intensity but also to hedonic comparisons (Colyar et al., 2009). For the experiment, consumers were placed in a taste booth with the experimenter behind, to provide help should it be needed and give instructions when necessary. The experimenter took care to be inconspicuous and not to ‘crowd’ the consumers. After establishing rapport, and noting the consumers’ demographics, the experiment began. The consumers were first presented with five transparent plastic cups, each filled with one the five sets of chocolates. They tasted each type of chocolate and ranked them for preference. They responded by placing the five cups in a row, with the chocolates they liked the most on the right and least on the left. This corresponded with the direction of the 9-point scale. For this task and all later tasks, they were able to sample each type of chocolate ad-lib. When they had completed this task, the experimenter

checked to make sure that the cups were ranked in the appropriate direction. The consumers were then told that it was expected that this rank order would stay constant for the next few minutes required for the experiment but that it was perfectly fine to change their minds. Should they do so, they were told that it was very important to inform the experimenter. No consumers changed their minds. Had they done so their data would have been eliminated from the experimental results. This was to ensure that the consumers were performing according to the goal of the experiment, namely, to investigate responses to different scales for a given rank order of liking. After the ranking, the consumers were required to rate the five chocolates on the ‘numbers only’ 9-point hedonic scale. They were told that the reason for this was to indicate the degree of difference between the rankings they had just performed. This was illustrated visually by varying the distance between the first and second ranked stimuli. Consumers responded by placing the cups in appropriate positions along the scale. When their responses had been recorded, the cups were placed back in front of the consumer in their rank order, as a reminder, and the scale was removed. Next, the procedure was repeated using the ‘words only’ 9-point scale. The ‘words only’ scale was placed in front of the consumers and it was explained that this scale was used to indicate whether they liked or disliked the chocolates, and whether they liked or disliked the chocolates a lot or only slightly. After this, the chocolates were returned to their ranks, as a further reminder. Finally, the ‘words and numbers’ scale was used in the same way. For this, the consumers were told falsely that this was the normal scale used for making these measurements and no further explanation was given, so as not to attract the consumer to the words or the numbers. On completion, the consumers were asked whether they paid more attention to the words or the numbers, or a mixture of both. After using each rating scale, the experimenter checked with the consumer to make sure that the scale had been used correctly. Because they were the main focus of the experiment, the presentation of the ‘words only’ and ‘numbers only’ scales was counterbalanced over consumers so as to minimize any order effects. However, so as not to introduce any extraneous effects into this counterbalancing, and thus interfere with the main goal of the research, the ‘words and numbers’ scale was always presented last. Also, it had been decided to measure the reaction to the ‘words and numbers’ scale when both ‘numbers only’ and ‘words only’ scales were relatively fresh in the consumers’ memories. The goal here was not make a detailed comparison of the ‘words and numbers’ scale with the other scales; it was mainly to get an idea of whether words or numbers were more salient and to what extent. The experiment lasted approx. 5–10 min. 2.4. Results and discussion The numerical scores for the ‘numbers only’ and those derived from the ‘words and numbers’ scales were noted for each consumer. For the ‘words only’ scale, the numbers 1–9 were assigned to the verbal categories in the usual manner (‘dislike extremely’, ‘1’; ‘like extremely’, ‘9’). None of the 82 consumers gave the same set of ratings for the ‘words only’ and ‘numbers only’ scales. This is not surprising because the tasks demanded by each scale were different. It also suggested that the cognitive strategies required for using each scale were different. Yet, no consumer used the scales in a manner incompatible with their first rank order. With the ‘words only’ scale, there were ties but none reversed the order of the first ranking. With the ‘words and numbers’ scale, 29 consumers gave data identical to the ‘words only’ scale while only seven gave data identical to the ‘numbers only’ scale. This effect was independent of

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015 Table 1 Mean hedonic ratings for the unimodal ‘numbers only’, ‘words only’ and ‘words and numbers’ scales for each of the five chocolate samples in Experiment 1. Chocolates

Numbers only

Words only

Words and numbers

Milk Chocolate Kisses

5.0

6.1

6.1a

Milk Chocolate Kisses with Almonds Special Dark Chocolate Kisses

5.8

6.4

6.3

4.5

5.5

5.4

Milk Chocolate Nuggets with Toffee and Almonds Dark Chocolate Caramel Treasures

5.3

6.0

5.9

6.7

7.0

6.9

a Means that are not joined by horizontal underlinings are significantly different (p < 0.001).

whether the ‘words only’ scale or the ‘numbers only’ scale was presented first. However, the majority of consumers (46) gave data identical to neither of the two scales; they were either using a different or some form of mixed cognitive strategy. Of these, 35 consumers reported that they paid more attention to the words, even when their responses did not correspond with their ‘words only’ responses. Five reported that they paid more attention to the numbers and six to both. From this, it might be hypothesized that words were more salient than numbers. Another comparison between the scales can be made by considering the mean ratings for each type of chocolate on the various scales (see Table 1). From the table, it can be seen that the mean scores for each chocolate derived from the ‘numbers only’ scale were consistently and significantly lower than for the other two scales (p < 0.001). This is not surprising because if the chocolates were generally liked, the scales with verbal labels only allowed scores ranging 6–9. For the ‘numbers only’ scale, such chocolates were allowed scores below six without any implication of disliking the chocolates. The only exception was the lack of significance for the trend with Milk Chocolate Nuggets with Toffee and Almonds (not even at p < 0.05). On the other hand, the means for the ‘words only’ and the ‘words and numbers’ scales were not significantly different (p > 0.02) demonstrating the salience of the verbal labels. These results were independent of whether the ‘words only’ scale or the ‘numbers only’ scale was presented first. Because it is a common computation, it is interesting to compare the pattern of the significant differences between the chocolates for the three scales. This can be seen in Fig. 1. The overall pattern of significant differences between the ‘words only’ and

1011

‘words and numbers’ scales is the same. Yet, it would be unwise to generalize this result and assume that one scale is merely a substitute for the other. Again, it can be seen that the mean scores for the ‘numbers only’ scale were lower and also that there were more significant differences; the latter is not surprising given the wider possible range of scores. Yet, interestingly, the rank order of scores was also different for the ‘numbers only’ scale. At first, this might seem counter-intuitive. Yet, on consideration it is quite possible. First it should be noted that the differences between the mean scores were small. In such a situation, although many of the chocolates in the ‘words only’ scale were ‘tied’ in the same categories, the remainder would be sufficient to produce the rank order shown in Fig. 1, where Milk Chocolate Kisses had a higher mean score (yet not significantly so) than Milk Chocolate Nuggets with Toffee and Almonds. Yet, for those consumers who placed a substantial number of the two chocolates in the same category, when using the ‘words only’ scale, there might have been an overall preference for the Milk Chocolate Nuggets with Toffee and Almonds. Accordingly, when this could be expressed on the ‘numbers only’ scale, the overall order of mean scores between the two chocolates could be reversed. The point of the comparisons in Fig. 1, is merely to illustrate that even though the general trends are the same, the differences between the ‘numbers only’ and the ‘words only’ scales can produce different patterns of response. It could be risky to ignore such differences and dismiss them as trivial. The authors are also aware that with much larger samples, the slight differences between the means could all become significant, as well as the rank orders changing. 3. Experiment 2 It is possible to argue, that the differences between the ‘numbers only’ and the ‘words only’ scale could be due to polarity. The ‘words only’ scale was bipolar but the ‘numbers only’ scale was unipolar. As a control, the experiment was repeated using a bipolar set of numbers (’’ 4’ through ‘0’ to +’4’) for the ‘numbers only’ and the ‘words and numbers’ scales. If there had been an effect of polarity, it might be hypothesized that the responses to the ‘words only’ and ‘numbers only’ scales might now be more in accord. 3.1. Consumers A further sample of 82 consumers of chocolate (41 F, 41 M, age range 17–50 yrs) was tested. It comprised of students, staff and friends sampled from the campus of the University of California Davis. 3.2. Stimuli The stimuli used were the same as in Experiment 1. The only exception was that the ‘numbers only’ scale and the ‘words and numbers’ scale were re-numbered bimodally rather than unimodally. Accordingly, they were numbered from ‘ 4’ through ‘0’ to ‘+4’. 3.3. Procedure The procedure was the same as in Experiment 1. 3.4. Results and discussion

Fig. 1. Mean hedonic ratings for five chocolates assessed on ’numbers only’, ’words only’ and ’words and numbers’ hedonic scales with unipolar numbers in Experiment 1 N = 82 (p < 0.05).

The analysis proceeded as in Experiment 1, except that for statistical purposes, the numerical data had to be transformed to a unimodal numbering system (‘1–9’).

1012

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

Of the 82 consumers, 74 gave different ratings for the ‘numbers only’ and ‘words only’ scales. The use of a bimodal numerical scale had only a small effect (eight consumers) of inducing consumers to use the ‘words only’ and ‘numbers only’ scales in the same way. Interestingly, of these eight consumers, five used all three scales in the same way. As in Experiment 1 no individual consumer used the scales in a way incompatible with their first ranking. Also, as with Experiment 1, the bulk of the evidence would suggest that consumers tend to use different cognitive strategies for each scale. With the ‘words and numbers’ scale, 36 consumers gave data identical to the ‘words only’ scale while only nine gave data identical to the ‘numbers only’ scale. That is not including the five further consumers who gave identical results for all three scales. The rest of the consumers (32) gave data for the ‘words and numbers’ scale identical to neither of the two scales. Of these, 18 consumers reported that they paid more attention to the words, three to the numbers and 11 to both. Again, words would appear more salient than numbers. In Experiment 1, the effects noted did not appear to depend on whether the ‘words only’ or the ‘numbers only’ scales were presented first. However, in this experiment, there was an effect. When the ‘words only’ scale was presented immediately prior to the ‘words and numbers’ scale, a substantial number of consumers (24/37) used the two scales in the same way, while no consumers used the ‘words and numbers’ scale in the same way as the ‘numbers only’ scale. In the reverse condition, the effect was not the same. Only nine consumers used the ‘words only’ and the ‘words and numbers’ scales in the same way while eight (rather than zero) used the ‘words and numbers’ scale in the same way as the ‘numbers only’ scale. This would suggest that there was some ‘carry over’ memory effect induced by the ‘words only’ scale that was far weaker for the ‘numbers only’ scale. Again, it points to the salience of the words. As in the first experiment, mean ratings for each type of chocolate were considered for each of the three scales. These are given in Table 2. From the table, it can be seen that the picture is similar to that for Experiment 1 (see Table 1) except that the Milk Chocolate Nuggets with Toffee and Caramel are no longer the exception. The significance levels were the same as in Experiment 1. As in Experiment 1, the pattern of significant differences between the chocolates was examined to see whether they coincided for the three scales; the results are presented in Fig. 2. The rank orders and patterns of significance were different from Experiment 1, which is not surprising because the consumers were different. Unlike Experiment 1, the rank orders for the ‘words only’ and the ‘words and numbers’ scales were different. The rank order for the ‘words and numbers’ scale was the same as that for the ‘numbers only’ scale, although there were more significant differences for the latter and the scores were again consistently lower. Overall,

Table 2 Mean hedonic ratings for the bimodal ‘numbers only’, ‘words only’ and ‘words and numbers’ scales for each of the five chocolate samples in Experiment 2. Chocolates

a

Numbers only

Words only

Words and numbers

Milk Chocolate Kisses

5.0

5.6

5.5a

Milk Chocolate Kisses with Almonds Special Dark Chocolate Kisses

5.7

6.3

6.2

4.7

5.3

5.2

Milk Chocolate Nuggets with Toffee and Almonds Dark Chocolate Caramel Treasures

7.2

7.3

7.3

5.5

5.9

5.8

Means that are not joined by horizontal underlinings are significantly different (p < 0.001).

Fig. 2. Mean hedonic ratings for five chocolates assessed on ’numbers only’, ’words only’ and ’words and numbers’ hedonic scales with bipolar numbers in Experiment 2 N = 82 (p < 0.05).

the same general conclusions can be drawn from this experiment as from Experiment 1.

4. Experiment 3 The fact that consumers treat ‘numbers only’ differently from ‘words only’ is not surprising because the two scales deal with different questions. The goal of this experiment was to take the investigation a little further by making a preliminary investigation into the cognitive strategies involved with both tasks. For this, Poulton’s stimulus equalizing bias, which applies to the relative cognitive strategy, was used (Poulton, 1979). The experiment was performed as before with the ‘numbers only’ scale and the ‘words only’ scale using all five chocolates. Then, for each consumer the chocolates that had the highest and lowest scores were removed and the experiment was repeated with the remaining three chocolates. Poulton (1979) stated that with stimulus equalizing bias: ‘‘the observer uses his full range of responses, whatever the size of the range of stimuli. He simply magnifies his response scale to fit a small stimulus range.” In the context of this experiment, it would mean that, if the ‘numbers only’ scale elicited a relative cognitive strategy, the range of scores for the three chocolates would increase when they were tested on their own and be greater than when they were tested in the context of five chocolates. The three chocolates would spread to cover more of the scale. Yet, with the ‘words only’ scale, the consumers can be hypothesized as categorizing the chocolates according to their liking/disliking categories: in other words, matching to exemplars. With such an absolute strategy, stimulus equalizing bias would not be expected to have an effect. Once a stimulus had been categorized as, say, ‘‘liked moderately”, there would be no logical reason to change this categorization (say, to ‘‘like very much”) should other more liked stimuli be removed from the scale. Thus, if the ‘words only’ hedonic scale was eliciting the absolute cognitive strategy, it is hypothesized that the stimulus equalizing bias effect would be absent. The authors were happy to discover a similar argument used by Lawless and Malone (1986).

4.1. Consumers One hundred consumers of chocolate (53 F, 47 M, age range 17– 53 yrs) were tested. They were students, staff and friends sampled from the campus of the University of California Davis.

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

4.2. Stimuli The stimuli and scales were the same as those used in Experiments 1 and 2. 4.3. Procedure The first part of the experiment was the same as for Experiment 1. Consumers ranked the chocolates and rated them on the ‘numbers only’ and ‘words only’ scales, in the same way as before. Half the consumers used the ‘words only’ scale first while half used the ‘numbers only’ scale first. As before, after finishing each scale, it was removed and the cups were placed back in front of the consumer in their rank order. The most liked and least liked chocolates (ranks 1st and 5th) were then removed from the rankings and the experimenter explained that she would like to repeat the experiment with three chocolates rather than five. The consumer was told to ‘start afresh’ and forget about the chocolates that had been removed. They were assured that the experiment was not a test of memory but merely to see how consumers felt when the circumstances had changed. The experiment then proceeded using the ‘numbers only’ and ‘words only’ scales in the same order as before. Half the consumers started the experiment using five chocolates, as described above. The other half began the experiment with only three chocolates and then continued with five chocolates. This latter condition was achieved by first requiring the consumers to rank the five chocolates in order and then removing the most and least liked chocolates. The experiment was then begun using only three chocolates. After the first two rating scales had been used, the consumer was told that the experimenter wished to repeat the experiment with the two previously removed chocolates added back in. Suitable instructions were given as before. Thus, both the order of use of the scales and the numbers of chocolates used were counterbalanced over judges. The experiment lasted approx. 7–16 min. 4.4. Results and discussion The numerical scores for the ‘numbers only’ and ‘words only’ scales were recorded for each consumer. For the ‘words only’ scale, the numbers 1–9 were assigned to the verbal categories in the traditional manner. It is worth noting that when five chocolates were used 81/100 consumers gave different responses for the ‘words only’ and ‘numbers only’ scales; with three chocolates the proportion was comparable (79/100). This confirms the results of Experiment 1, yet not as strongly. Consider first the ‘numbers only’ scale, because stimulus equalizing bias had been noted for numerical scales (intensity scales). In this experiment, the ranges for the ‘numbers only’ scale were calculated by subtracting the lowest score from the highest score. The mean range for the three chocolates tasted on their own was 4.2, while in the context of five chocolates the range for these three was only 3.4. This difference was significant (t-test, p < 0.001). This confirms the effect of stimulus equalizing bias for hedonic measurement. However, it is worth noting that despite the significant difference between the means, only 53 consumers followed this trend while 40 consumers showed no change (seven showed the reverse trend). Of the consumers who showed no change, all gave exactly the same scores, suggesting a possible memory effect; some consumers might have remembered their prior responses. The measures of range discussed above, provide the most direct way of measuring the stimulus equalizing bias. A tempting yet erroneous secondary measure would be to compare the scores for the most liked and the least liked of the three chocolates in both

1013

conditions. Consider first the most liked chocolate. When the three chocolates were tested on their own, the mean score was 7.8 but in the context of five chocolates, it was only 7.3; this difference was significant (t-test, p < 0.001). This could be seen to support the stimulus equalizing bias, yet the measure is not free of possible scaling drift. The change in the mean score could simply be the result of consumers changing to use a higher or lower end of the scale than before. This can be seen when comparing the scores of the least liked of the three chocolates. They were: context of three, mean = 3.6; context of five, mean = 3.8 (t-test, p < 0.05). This would appear to disconfirm the effect. Because of the possibility of scaling drift interfering with the results, it is recommended that this method of analysis is not used. Finally, it is interesting to see whether the range for the three stimuli tested alone was as great as the total range for the five stimuli. The range for the five stimuli was 6.5 which was significantly greater (t-test, p < 0.001) than the range for the three stimuli tested alone (4.2). Thus, the stimulus equalizing bias, as quoted by Poulton (1979) was confirmed as far as expanding the range was concerned. But it was not confirmed as far as ‘‘using his full range of responses”. However, even expansion that is not as great as the range for the five stimuli, is enough to support the hypothesis that responses for the ‘numbers only’ scale are generated using a relative cognitive strategy. Secondly, the ‘words only’ scale can be considered. The mean range for the three chocolates tested on their own was 2.7, while in the context of five chocolates, the range for these three was 2.5. This difference was significant (t-test, p < 0.05). This confirms the effect of stimulus equalizing bias for the ‘words only’ scale. However, the effect was far less than for the ‘numbers only’ scale; only 28 consumers followed this trend while 14 showed the reverse trend. The majority (58) showed no change which might be considered evidence of use of an absolute strategy. Yet, of the 58 consumers who showed no change, 55 gave exactly the same scores, again suggesting a memory effect. This casts doubt on an interpretation of the results as support for an absolute strategy. The range for the three stimuli tested alone was compared with the total range for the five stimuli which was significantly greater (2.7 vs. 4.8: t-test, p < 0.001). Again, the stimulus equalizing bias had only been confirmed as far as expanding the range was concerned. An initial hypothesis had been that responses to the ‘words only’ scale would be initiated using an absolute cognitive strategy rather than the relative cognitive strategy used for the ‘numbers only’ scale. However, the occurence of stimulus equalizing bias suggests a possible effect of the relative cognitive strategy, albeit far less strong than for the ‘numbers only’ scale. There are several possibilities. The placement of the labels for the ‘words only’ scale along the rank-rating strip might have suggested some numerical properties to which the consumers responded. Repetition of the experiment with the labels displayed in a non ordinal manner might shed some light on this.

5. General discussion The main finding of the experiment was that the majority of consumers responded differently to the ‘words only’ and ‘numbers only’ scales. This was true whether the numbers were unimodal or bimodal. The percentage of consumers giving different results for the ‘numbers only’ and ‘words only’ scales ranged 79–100%. As ‘words only’ and ‘numbers only’ scales are both in use and both are reported in the literature as being ‘a 9-point hedonic scale’, it should be cautioned that the numerical data derived from these two scales are not interchangeable.

1014

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

There is a subtle question. It was seen that the ‘numbers only’ and ‘words only’ scales gave different results for nearly all the consumers. The rank order of liking was visible all the time throughout the experiment as a reminder. The ‘numbers only’ scale gave the same rank order with the spacing between the ranks differing. Yet, for the ‘words only’ scale, ties were introduced. They were not incompatible with the ranks; there were no reversals. Yet, the possibility remains that the rank order on display could have had some biasing effect on the spacing for the ‘words only’ scale. If anything a ranking with no ties it might be expected to elicit fewer ties for the ‘words only’ scale. Yet, if this bias was present, it would not negate the conclusions drawn from the experiment. Using the ‘words and numbers’ scale, it was found for the consumers tested here, that words were more salient than numbers. It could be hypothesized that words suggest more associations than numbers and so demand more attention. Yet, this speculation is a matter for further research. Of course, the difference in response to the two scales was largely determined by the instructions given to the consumers. Yet, the goal of the experiment was not to determine whether words and numbers were incompatible, per se. The goal was not to determine whether responses to the ‘words only’ and the ‘numbers only’ scales were different under the same experimental conditions. In that case, the same instructions or no instructions would have been given for both scales. In fact, for such a goal, the authors would have designed a completely different experiment. The goal was to determine whether responses were different when sufficient instructions had been given to ensure that consumers understood clearly the task demanded for each scale. Because of preliminary experiments, the authors felt it important that consumers should fully understand the tasks that the two scales required, namely giving a picture of the relative likings of the stimuli vs. categorizing the stimuli as liked or disliked and by how much. It cannot be pretended that these are the same task. In some preliminary experiments, where precise instructions were not given, some consumers used the numerical scale in the reverse order. They gave scores of ‘1’ or ‘2’ to foods that they liked the most. Also, with some consumers who had reported that they liked all the stimuli and were not given specific instructions, it was found that they tended to use the ‘words only ‘scale as though it were a ‘numbers only’ scale, by spreading their scores across the whole scale length (some having scores less than ‘5’). Yet, scores below ‘5’ would imply dislike of the stimuli. It was only when this was mentioned that the consumers realized their mistake. Consumers are fallible; they can misunderstand or not always concentrate or think through their task. Accordingly, the authors ensured that consumers were given appropriate instructions so that they completely understood the task at hand. Such an approach is desirable for any form of sensory testing. It means, however, that the present results can only apply to the condition where consumers understood the task required because precise instructions on how to use the scales had been given. In other circumstances, it is possible that different results might apply. The results would suggest that consumers were using different cognitive strategies for numbers and words. It might be hypothesized that the cognitive strategy used with numbers was relative and the strategy used with words was absolute. However, Experiment 3 did not fully support this; it merely demonstrated that the strategies were different. Yet, as suggested above, failure of the ‘words only’ scale to resist the stimulus equalizing bias may have been an artifact of the way the categories for the ‘words only’ scale were displayed along a strip. It is interesting to note that some consumers’ reports indicated that when they considered a single chocolate they thought absolutely, but when considering all chocolates, they thought relatively. This suggests the possibility of context effects.

The rank-rating protocol was used to reduce the effects of forgetting. Yet for such a protocol, it can be argued that any context effects would be strong. The authors are aware that a serial monadic protocol is more often used for hedonic judgements, to minimize context effects. Yet, the strategy used here was to see whether responses to the ‘words only’ scale and the ‘numbers only’ scale differed under controlled conditions, where issues of forgetting how the chocolates tasted or the scores to be given to them were minimized. This was to establish that the effect existed. Yet, the question remains regarding what would happen should a serial monadic protocol be used. Would reduction of context alter the results? Would an effect of forgetting the tastes of the stimuli or forgetting the scores given to them cause artifacts? Perhaps if stimulus ‘A’ was ranked as liked more that ‘B’, forgetting the tastes of the chocolates or the scores given to them, might result in ‘B’ getting a higher score than ‘A’. At the time of writing, an experiment using a serial monadic protocol is in progress. Probably, the most common use for hedonic scales is to get numerical values for liking and determine significant differences with the usual parametric statistics (ANOVA, etc.). Despite the fact that such a statistical analysis is questionable, in that assumptions for parametric analysis are broken, this procedure was followed in both Experiments 1 and 2, so as to mimic other studies. In both experiments, the patterns of significance for the ‘words only’ and the unimodal and bimodal ‘numbers only’ scales, were different while the latter scales were unsurprisingly more discriminating. Regarding Poulton’s (1979) stimulus equalizing bias, the effect was noticed for both ‘numbers only’ and ‘words only’ scales. Had all consumers used relative cognitive strategies for the ‘numbers only’ scale and absolute cognitive strategies for the ‘words only’ scale, this would not have happened. Admittedly, the effect was stronger for the ‘numbers only’ scale but it was not absent for the ‘words only’ scale. Also, the responses to the scales were not homogeneous. This suggests the possibility of a variety of strategies. Certainly, the performance on the ‘numbers only’ scale suggested a strong presence of the relative strategy. For the ‘words only’ scale, this was reduced. These results do not confirm the hypothesis that a ‘words only’ scale would use an absolute cognitive strategy but they do lean, to a certain extent, in that direction. On the subject of cognitive strategies, it is worth hypothesizing that what was presented as a dichotomy (absolute vs. relative) might really be two ends of a continuum. A naïve judge may always use a relative strategy, and a calibrated judge may always use an absolute strategy. These judges could be seen as residing at opposite ends of the continuum. However, it is logically possible that a person with some experience of the type of foods being presented, might use a mixed cognitive strategy, with elements of both relative and absolute approaches. This person could be envisioned as on the continuum somewhere between the relative and absolute ends. The results in Experiment 3 are complicated by the possibility of judges remembering their scores from one scale to the next. This memory effect can be reduced by increasing the time between using the scales. Yet, as the time increases, judges tend to change their minds regarding preferences; this is not a problem for intensity scales. The trick for hedonic scales is to find the right ‘window’. The interval should be long enough to induce forgetting yet not to induce preference change. This is a topic for further research. There is some ambiguity in Poulton’s (1979) definition of stimulus equalizing bias. He states that the bias induces a judge to ‘‘use his full range of responses whatever the size of the range of stimuli”. In the present context, this could simply mean that the three chocolates alone would increase their range of responses but not necessarily as far as the range for the five chocolates. Here, ‘‘full range” refers to the full range for a set of three stimuli only, which might not be the same as the full range for a set of five stimuli. On

L. Nicolas et al. / Food Quality and Preference 21 (2010) 1008–1015

the other hand, ‘‘full range” might imply that the range for three and five chocolates would be the same: there is a same ‘full range’ for any number of stimuli. Poulton (1979) is not clear on this. Most of the literature on the subject refers to visual or auditory stimuli, using magnitude estimation (e.g. Jones & Woskow 1962). There is little research concerning taste or flavor hedonics. Further research should step back and investigate the stimulus equalizing bias with both intensity and hedonic scales for taste and flavor. More needs to be learned about stimulus equalizing bias before it can be used as a tool for investigating cognitive strategies. Besides issues concerning the precise cognitive strategies used for the scales, it is important to be precise about the conclusions to be drawn from this research. It can be concluded that when consumers are given appropriate instructions, data derived from the ‘words only’ and ‘numbers only’ 9-point hedonic scales used here are not interchangeable. However, this does not necessarily mean that words and numbers are incompatible per se. The research does not address issues of whether words and numbers can or cannot be compatible in other situations. References Amerine, M. A., Pangborn, R. M., & Roessler, E. B. (1965). Principles of sensory evaluation of food. New York, NY: Academic Press. Cardello, A. V., & Schutz, H. G. (2004). Numerical scale-point location for constructing the LAM (labeled affective magnitude) scale. Journal of Sensory Studies, 19, 341–346. Chung, S.-J. (2009). Effects of milk type and consumer factors on the acceptance of milk among Korean female consumers. Journal of Food Science, 74, S286–S295. Chung, S.-J. (2010). Personal communication. Colyar, J. M., Eggett, D. L., Steele, F. M., Dunn, M. L., & Ogden, L. V. (2009). Sensitivity comparisons of sequential monadic and side-by-side presentation protocols in affective consumer testing. Journal of Food Science, 74, S322–S327. Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five common acceptance and preference methods. Food Quality and Preference, 19, 651–661. Jeon, S.-Y., O’Mahony, M., & Kim, K.-O. (2004). A comparison of category and line scales under various experimental protocols. Journal of Sensory Studies, 19, 49–66. Jones, F. N., & Woskow, M. H. (1962). On the relationship between estimates of magnitude of loudness and pitch. American Journal of Psychology, 75, 669–671. Jones, L. V., Peryam, D. R., & Thurstone, L. L. (1955). Development of a scale for measuring soldiers’ food preferences. Food Research, 20, 512–520. Kemp, S. E., Hollowood, T., & Hort, J. (2009). Sensory evaluation: A practical handbook. Chichester, UK: Wiley-Blackwell. Kim, K. O., & O’Mahony, M. (1998). A new approach to category scales of intensity I: Traditional vs. rank-rating. Journal of Sensory Studies, 13, 241–249. Koo, T.-Y., Kim, K.-O., & O’Mahony, M. (2002). Effects of forgetting on performance on various intensity scaling protocols: Magnitude estimation and labeled magnitude scale (Green scale). Journal of Sensory Studies, 17, 177–192. Lawless, H. T., & Heymann, H. (1998). Sensory evaluation of food. Principles and Practices. New York, NY: Chapman & Hall. Lawless, H. T., & Malone, G. J. (1986). A comparison of rating scales: Sensitivity, replicates and relative measurement. Journal of Sensory Studies, 1, 155–174. Lee, H.-J., Kim, K.-O., & O’Mahony, M. (2001). Effects of forgetting on various protocols for category and line scales of intensity. Journal of Sensory Studies, 16, 327–342. Lim, J., Wood, A., & Green, B. G. (2009). Derivation and evaluation of a labeled hedonic scale. Chemical Senses, 34, 739–751.

1015

Mellers, B. A. (1983a). Evidence against ‘‘absolute” scaling. Perception and Psychophysics, 33, 523–526. Mellers, B. A. (1983b). Reply to Zwislocki’s views on ‘‘absolute” scaling. Perception and Psychophysics, 34, 405–408. Metcalf, K. L., & Vickers, Z. (2002). Taste intensities of oil-in-water emulsions with varying fat content. Journal of Sensory Studies, 17, 379–390. Nepote, V., Olmedo, R. H., Mestrallet, M. G., & Grosso, N. R. (2009). A study of the relationships among consumer acceptance, oxidation chemical indicators, and sensory attributes in high-oleic and normal peanuts. Journal of Food Science, 74, S1–S8. Parducci, A. (1963). Range-frequency compromise in judgment. Psychological Monographs, 77, 1–50. Parducci, A. (1965). Category judgment: A range-frequency model. Psychological Review, 72, 407–418. Parducci, A. (1968). The relativism of absolute judgments. Scientific American, 219, 84–90. Park, J.-Y., Jeon, S.-Y., O’Mahony, M., & Kim, K.-O. (2004). Induction of scaling errors. Journal of Sensory Studies, 19, 261–271. Peryam, D. R., & Girardot, N. F. (1952). Advanced taste-test method. Food Engineering, 24(July), 58–61, 194. Peryam, D. R., & Pilgrim, F. J. (1957). Hedonic scale method of measuring food preferences. Food Technology, 11(September), 9–14. Peryam, D. R., Polemis, B. W., Kamen, J. M., Eindhoven, G., & Pilgrim, F. J. (1960). Food preference of men in the US armed forces. Chicago, IL: Departement of the Army Quatermaster Research and Engineering Comand-Quatermaster Food and Container Institute for the Armed Forces. Poulton, E. C. (1979). Models for biases in judging sensory magnitudes. Psychological Bulletin, 86(4), 777–803. Rosas-Nexticapa, M., Angulo, O., & O’Mahony, M. (2005). How well does the 9-point hedonic scale predict purchase frequency? Journal of Sensory Studies, 20, 313–331. Sabbe, S., Verbeke, W., Deliza, R., Matta, V. M., & van Damme, P. (2009). Consumer liking of fruit juices with different Açai (Euterpe oleracea Mart) concentrations. Journal of Food Science, 74, S171–S176. Schifferstein, H. N. J., & Frijters, J. E. R. (1992). Contextual and sequential effects on judgments of sweetness intensity. Perception and Psychophysics, 52, 243–255. Schutz, H. G. A., & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159. Stillmann, J. A. (1993). Context effects in judging taste intensity: A comparison of variable line and category rating methods. Perception and Psychophysics, 54, 477–484. Stone, H., & Sidel, J. L. (1993). Sensory evaluation practices (2nd ed.). San Francisco, CA: Academic Press. Stroh, S. (1998). Response bias and memory effects for selected scaling and discrimination protocols (128p). MS thesis, University of California, Davis, CA. Taylor, T. P., Fasina, O., & Bell, L. N. (2008). Physical properties and consumer liking of cookies prepared by replacing sucrose with Tagatose. Journal of Food Science, 73, S145–S151. Vickers, Z., & Roberts, A. (1993). Liking of popcorn containing different levels of salt. Journal of Sensory Studies, 8, 83–99. Villegas-Ruiz, X., Angulo, O., & O’Mahony, M. (2008). Hidden and false ‘‘preferences” on the structured 9-point hedonic scale. Journal of Sensory Studies, 23, 780–790. Villanueva, N. D. M., & Da Silva, M. A. A. P. (2009). Comparative performance of the nine-point hedonic, hybrid and self-adjusting scales in the generation of internal preference maps. Food Quality and Preference, 20, 1–12. Yao, E., Lim, J., Tamaki, K., Ishii, R., Kim, K.-O., & O’Mahony, M. (2003). Structured and unstructured 9-point hedonic scales: A cross cultural study with American, Japanese and Korean consumers. Journal of Sensory Studies, 18, 115–139. Yeu, K., Lee, Y., & Lee, S.-Y. (2008). Consumer acceptance of an extruded soy-based high-protein breakfast cereal. Journal of Food Science, 73, S20–S25. Zwislocki, J. J. (1983). Absolute and other scales: Question of validity. Perception and Psychophysics, 33, 593–594. Zwislocki, J. J., & Goodman, D. A. (1980). Absolute scaling of sensory magnitudes: A validation. Perception and Psychophysics, 28, 28–38.