Labeled Magnitude Scales: A critical review

Labeled Magnitude Scales: A critical review

Food Quality and Preference 26 (2012) 151–158 Contents lists available at SciVerse ScienceDirect Food Quality and Preference journal homepage: www.e...

375KB Sizes 2 Downloads 72 Views

Food Quality and Preference 26 (2012) 151–158

Contents lists available at SciVerse ScienceDirect

Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual

Labeled Magnitude Scales: A critical review Hendrik N.J. Schifferstein ⇑ Department of Industrial Design, Delft University of Technology, Landbergstraat 15, 2628 CE Delft, The Netherlands

a r t i c l e

i n f o

Article history: Received 26 August 2011 Received in revised form 29 April 2012 Accepted 30 April 2012 Available online 8 May 2012 Keywords: Labeled Magnitude Scale Category rating Magnitude estimation Absolute scaling Context effects Ceiling effects

a b s t r a c t Labeled Magnitude Scales (LMS) have gained substantial popularity in the sensory community. They were claimed to outperform traditional response methods, such as category rating and magnitude estimation, because they allegedly generated ratio-level data, enabled valid comparison of individual and group differences, and were not susceptible to ceiling effects (e.g., Green, Shaffer, & Gilmore, 1993; Lim, Wood, & Green, 2009). However, none of these claims seems to be well-founded. Although responses on the LMS are highly similar to those obtained through magnitude estimation, it is questionable whether any of these methods yields ratio-level data. In addition, comparing LMS data between individuals and groups may be invalid, because LMS data vary with manipulation of experimental context. Furthermore, restricting the LMS at the upper end of the scale possibly makes it susceptible to ceiling effects. Therefore, none of the original claims seems to hold. Moreover, the LMS holds a disadvantage compared to more traditional scaling methods, in that no simple cognitive algebraic model seems to underlie its responses, which makes it unclear what LMS responses exactly signify. Ó 2012 Elsevier Ltd. All rights reserved.

Contents 1. 2.

3.

4. 5.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Does the LMS yield ratio-level data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Measurement level of magnitude estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Measurement level of the LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Does the LMS enable valid comparisons over individuals or groups? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. The zero point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Maximum intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Anchoring intermediate steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Is the LMS susceptible to ceiling effects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Introduction A major challenge in psychological measurement is to find a scale with ratio properties that remains unaffected by the context in which it is used (Stevens, 1946; Zwislocki & Goodman, 1980). The characteristics of such an ideal scale imply that the scale should have a well-defined zero point, the psychological distances between units on the scale should all have the same size, and the scale will provide the same outcome independent of any changes

⇑ Tel.: +31 15 278 7896; fax: +31 15 278 7179. E-mail address: [email protected] 0950-3293/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.foodqual.2012.04.016

151 152 153 154 155 156 156 156 157 157 158 158

in the environment in which it is used, such as changes in background temperature and lighting or the presence of other stimuli. Being able to assess the effects of different stimuli on such a scale would make it possible to compare different areas of individual experience to one another. Just like an economic system, in which every object has a monetary value that can be used in exchanging objects, being able to use a universal scale of intensity would make it possible to determine whether the brightness of a specific light would be more intense than the smell of a distant waste plant. Having such a universal scale is important for comparing the abilities of various (groups of) individuals. For example, in clinical testing of sensory abilities it is necessary to have a standard of normal sensory functioning in order to decide whether a specific

152

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

Fig. 1. The original Category Ratio Scale (Borg, 1982).

individual exhibits reduced sensory performance or misses an important sensory ability (e.g., Gent, Frank, & Mott, 1997). In addition, a universal intensity scale would be useful for determining the interrelations between different sensory impressions produced by a single source. For example, in order to determine the degree to which a specific sensory aspect of an object dominates the overall perception of the object, it would be helpful if we could determine the intensities of the individual sensations that are produced by the object (e.g., Schifferstein, Otten, Thoolen, & Hekkert, 2010). Furthermore, a scale with ratio properties allows the researcher to determine whether stimulus A is twice as intense as stimulus B. However, a measurement procedure leading to a contextindependent scale with ratio properties may already be hard to define precisely for physical phenomena, such as temperature or weight (e.g., Cook & Rabinowicz, 1963), it is even harder to develop it for a psychophysical concept, such as perceived intensity. Nonetheless, several investigators have suggested that a Category Ratio (CR; see Fig. 1) Scale (e.g., Borg, 1982; Griep, Borg, Collys, & Massart, 1998; Marks, Borg, & Westerlund, 1992) or a Labeled Magnitude Scale (LMS; see Fig. 2) (e.g., Bartoshuk et al., 2002; Green, Shaffer, & Gilmore, 1993; Green et al., 1996) may have properties that approach a perfect ratio scale, and would make it possi-

ble to compare intensity ratings over different individuals, different senses, and different contexts. By its particular lay-out, and by anchoring various categories to values that seem to have an absolute meaning, the developers suggest that many downsides of other scaling methods might be surpassed. In addition, the method should enable a person to respond directly, and the responses should be easy to understand and interpret (Borg & Borg, 2001, p. 27). In recent years, Labeled Magnitude Scales have gained substantial popularity in the field of sensory evaluation. Besides variants of the LMS that focus on measuring perceived stimulus intensity (e.g., Bartoshuk et al., 2002; Green et al., 1996), additional variants have been developed to measure stimulus liking (the labeled affective magnitude scale, LAM) (e.g., Guest, Essick, Patel, Prajapati, & McGlone, 2007; Lim, Wood, & Green, 2009; Schutz & Cardello, 2001), perceived satiety (Cardello, Schutz, Lesher, & Merrill, 2005; Zalifah, Greenway, Caffin, D’Arcy, & Gidley, 2008), and the perceived degree of dissimilarity between stimuli (Kurtz, White, & Hayes, 2000). All these scales have in common that the relative spacing of verbal anchors is derived using a magnitude estimation procedure, in which participants try to preserve the ratios of the subjective magnitudes between the stimuli. The mean responses to the verbal anchors determine their relative positions on a line scale, which is used as the response scale in subsequent studies. Just like category rating and magnitude estimation, labeled magnitude scaling intends to provide a direct scaling method that participants can use to represent perceived sensory magnitudes (e.g., an intensity, degree of liking, or degree of dissimilarity) in a simple and straightforward way. Several studies have appeared in the literature that compared the properties of data obtained with Labeled Magnitude Scales to category ratings, responses on line scales, or magnitude estimates with respect to a number of pragmatic criteria, such as their userfriendliness, the ability to show differences among products or among consumer segments, the reliability of ratings, and their predictive value for specific behaviors such as food habits (e.g., Lawless, Popper, & Kroll, 2010; Lawless, Sinopoli, & Chapman, 2010). This focus on pragmatic criteria carries the danger of disregarding the assessment of the validity of the LMS as a measurement tool. These studies fail to answer the question how the labeled magnitude scaling technique works and whether it indeed works in the way it was designed to work. If the original claims appear to be unfounded, the LMS does not provide a solution to the deficits in the existing scales, and if the LMS does not provide an improved alternative to the existing scaling methods, it basically loses its ground for existence. Therefore, in the present paper we critically review the original claims made by proponents of the CR and LMS scales (e.g., Green et al., 1993; Lim et al., 2009) that (1) the LMS yields ratio-level data; (2) the LMS enables a valid comparison of individual and group differences; and (3) the LMS is not susceptible to ceiling effects.

2. Does the LMS yield ratio-level data?

Fig. 2. The oral Labeled Magnitude Scale (Green et al., 1993).

The spacing of the verbal descriptors on the LMS has been derived using the method of magnitude estimation (Green et al., 1993). Magnitude estimation (ME) was developed in the 1950s by Stevens (Stevens, 1955; Stevens & Galanter, 1957), who claimed that magnitude estimation yielded ratio-level data. During the development of the LMS, ratings were shown to be similar to ME responses, which was taken as evidence supporting the claim that the LMS also yielded ratio-level data (Green et al., 1993, 1996). Therefore, testing the claim that the LMS provides ratio-level data basically boils down to evaluating the assumption that ME yields ratio-level data. This brings us back to an elaborate debate in basic

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

psychophysics that is by no means resolved, between the proponents of ‘difference’ scaling methods such as category rating scales and line scales, versus the proponents of ‘ratio’ scaling methods such as magnitude estimation. To understand this debate, we need to go back to the roots of psychophysical scaling research.

2.1. Measurement level of magnitude estimation In the early days of psychophysics, researchers held the view that people were unable to provide a direct estimate of any stimulus property they perceived. Therefore, they developed indirect ways of measuring perceived intensity. Fechner (1860) developed the concept of the cumulated just noticeable differences (JND) scale. He used the absolute detection threshold to define the zero point on the scale, and the differential threshold to define the unit of the scale. By making the assumption that the degree of confusion between stimuli could be used as a measure of the psychological distance between those stimuli, he was able to derive scale values for a set of stimuli. This led to Fechner’s logarithmic psychophysical law, in which perceived intensity for taste or smell is a logarithmic function of the concentration of the substance. Unfortunately, this indirect way of deducting a scale from threshold measurements was extremely laborious, and investigators started to experiment by asking participants to provide direct ratings of perceived intensities. They provided participants with a number of named categories that reflected an increase in intensity over the categories, and techniques were developed to determine the psychological distances between the verbal category descriptors. In this way, a scale could be derived for which the psychological distances between the category descriptors were equal (Thurstone’s method of equal appearing intervals, see Torgerson, 1958). To avoid the issue that category distances might not be psychologically equal over the scale, some investigators only named the two extremes of the scales and provided numbered categories, graphically equally spaced categories, or unstructured line scales, hence assuming that category distances were equally spaced on the number continuum or in the perception of length. Empirically, investigators found that the outcomes of the accumulated JND scales corresponded quite well with mean ratings on category scales (e.g., McBride, 1983). However, because the category scales were limited at either end, participants tended to avoid the extremes of the scales and response distributions tended to be skewed near the upper and lower end of the scale, which had a biasing effect on the mean responses, the so-called floor and ceiling effects (Poulton, 1979). According to Stevens (Stevens, 1955; Stevens & Galanter, 1957), the limitations of the restricted category scales could be overcome by asking participants to use numbers for judging the magnitude of the perceived intensity (magnitude estimation, ME). Since participants were allowed to use any number, the response scale was unrestricted and no ceiling effects were expected. In order to provide a reference point on the scale, participants were typically instructed to assign numbers in relation to the number assigned to a standard stimulus. If a stimulus was twice as intense as the standard and the standard was given the number ‘10’, the response to the stimulus should be ‘20’. Because response distributions of these magnitude estimates tended to be skewed, geometric stimulus means were usually calculated instead of arithmetic means. When Stevens made a graph plotting the mean of the logtransformed ME responses as a function of the logarithm of the substance concentration, he typically found a linear relationship, which suggested a power-law relating ME response to physical concentration. Stevens and Galanter (1957) found that mean responses on category scales tended to be non-linearly, but monotonically, related to mean ME responses for many psychophysical

153

continua, in particular for those in which an intensity dimension was involved (the so-called prothetic continua). Stevens developed the psychophysical power law in the behavioristic tradition, implying that he was only interested in studying the relationship between stimulus (S) and response (R) (i.e., S–R), and made no assumptions about how individuals processed the stimulus information. The individual was regarded as a black box. With the rise of the cognitive psychology movement, however, researchers tried to fill in the black box by modeling the cognitive processes that underlie participants’ psychophysical judgments under different types of task instructions. Hence, cognitive psychology uses an S–O–R perspective on direct judgment, in which O stands for the role of the organism. Following the S–O–R perspective, we can assume that perceiving stimulus Si leads to a representation on an internal intensity scale, the perceived sensation si. The relationship between Si and si is called the psychophysical function. When participants are asked to judge stimulus Si, their response Ri will be based on the perceived sensation si. The relationship between si and the observable response Ri is called the judgment function or the response-output function (Anderson, 1970). Hence, according to the S–O–R perspective, the relationship between the stimulus Si and the response Ri is the composite of the psychophysical function researchers are typically interested in (what do the participants perceive?) and the response-output function (how do participants generate responses based on what they perceive?). The challenge for psychophysicists is to derive the values of si from the values of Ri, while the response-output function is essentially unknown. Many investigators make the implicit assumption that the response-output function is a linear function, without testing this assumption. Fortunately, Anderson’s (1981) Information Integration Theory provides tools to determine the shape of the response-output function empirically and thus allows us to test whether the latter assumption is correct. The internal representations of perceived sensations si are used in the mental operations that participants perform during an experiment. The cognitive models that describe these operations give us insight into how people derive their responses. Modeling these processes has proven beneficial in several areas of judgmental processing (e.g., Birnbaum & Stegner, 1981; Frijters, 1979). For instance, according to ME instructions participants are required to calculate the ratio in sensation magnitude between the perceived sensation si and the perceived intensity of the reference stimulus sref. Hence, the cognitive model underlying ME suggests that Ri is a function of si/sref. Alternatively, if a category rating procedure is used, the response Ri is a function of the intensity differences between si and the intensities of the stimuli that define the endpoints of the scale. If smin and smax represent the stimuli that correspond to the extreme categories on the response scale, Ri is a function of (si smin)/(smax smin). If s0 represents a meaningful zero point on the intensity scale, the response Ri is a function of (si s0)/(smax s0) (Birnbaum, 1978). The advantage of having cognitive models underlying psychophysical methods is that these models can be tested empirically. For instance, although ME instructs people to judge ratios between perceived intensities, we cannot be sure a priori that people actually mentally calculate these ratios and that ME responses can be used to generate a ratio scale of perceived intensity, unless the properties of the internal representations si are known and the nature of the response-output function is determined empirically. Analogous tests can be conducted for responses on category rating or line scales. As stated above, in empirical studies category ratings and magnitude estimates have typically been found to be related by a non-linear, monotonic relationship (Stevens & Galanter, 1957). However, according to the cognitive modeling approach such a monotonic relationship conflicts with the assumption that cate-

154

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

Fig. 3. A comparison of calculated model predictions with experimental judgment data. The left panel shows the linear fan pattern resulting when arithmetic ratios are plotted as a function of arithmetic differences for the combinations of all integers between 1 and 7. The right panel shows a typical example of an experimental psychophysical study in which judgments of ratios are plotted as a function of judgments of differences between two stimuli. The data shown here involve heaviness judgments for all combinations of seven weights varying from 50 to 200 g with 25 g differences (data from Birnbaum & Veit, 1974).

gory rating and magnitude estimation tasks imply performing two different cognitive operations on the same set of internal representations. If responses were derived by calculating either differences si sj (category scale) or ratios si/sj (magnitude estimation) on a single set of intensity scale values, this should produce a fan pattern according to the rules of arithmetic (see Fig. 3). Therefore, Torgerson (1961) suggested that people do not use two different cognitive operations in category rating and magnitude estimation tasks. Instead, they use only a single cognitive operation (either difference or ratio), and subsequently use different response-output transformation functions for the category rating and the magnitude estimation task (e.g., Birnbaum, 1978; Birnbaum & Elmasian, 1977; Mellers, Davis, & Birnbaum, 1984). Birnbaum’s empirical work suggests that for many continua a ‘difference’ operation is most plausible. For continua that do not contain a natural, well-defined zero point, such as taste and smell intensity, ratios cannot be calculated. Therefore, a ‘difference’ operation is most logical (De Graaf & Frijters, 1988; Schifferstein, 1995). On the other hand, for continua with a clearly defined zero point, such as line length and area size, a ‘ratio’ operation may be more plausible (Schneider & Bissett, 1988). 2.2. Measurement level of the LMS If a cognitive ‘difference’ model is most plausible for many psychophysical continua, perceived intensities should not be derived from ME data by assuming a ‘ratio’ model, and ME is unlikely to produce ratio-level data. If Birnbaum is correct, the results of many magnitude estimation experiments assessing taste or smell intensity have been analyzed and interpreted incorrectly. Because the quality of the LMS highly depends on the assumption that ME yields ratio-level data, this also questions the validity of LMS responses. However, perhaps we can find other evidence that may support the validity of LMS as a measurement instrument? The LMS presents the outcomes of ME experiments in a linear graphical format and, thereby, combines elements of a ‘ratio’ scal-

ing method with those of a ‘difference’ method. During LMS scale development (Green et al., 1993), participants were presented with descriptions of specific oral sensations (e.g., ‘the bitterness of celery’, ‘the pain from biting on your tongue’) and general statements that differed only in the adjectives and adverbs used (e.g., ‘a barely detectable oral sensation’, ‘a very strong oral sensation’) and were asked to assign magnitude estimates that expressed their intensity relative to other oral sensations of all kinds. The latter adverb– adjective combinations were later depicted on a vertical line scale, at positions reflecting the geometric means of their ME responses. In contrast to category scales developed with Thurstonian scaling techniques where descriptors are typically distributed quite evenly over the scale (Torgerson, 1958), on the LMS the low-intensity descriptors (e.g., barely detectable, weak) were clustered closely together at the lower end of the scale, whereas the high-intensity descriptors were spaced far apart at the upper end. As a consequence, the locations of the descriptors on the LMS give the impression of a logarithmic distribution (see Fig. 2). In the original version of the CR scale the descriptors were depicted together with a set of evenly spaced numbers (Griep et al., 1998; Marks et al., 1992), although the graphical spacing did not concur with the natural spacing between the numerical values (e.g., the graphical distance between 0 and 1 was larger than the graphical distance between 3 and 4; see Fig. 1). The LMS and its later versions typically used a vertical (Bartoshuk et al., 2004; Green et al., 1993, 1996; Lim et al., 2009) or horizontal (Bartoshuk et al., 2002) line without numbers. Participants were originally instructed to ‘first determine which descriptor most appropriately describes the intensity of the sensation, then fine-tune your rating by placing a mark on the scale at the proper location between that descriptor and the next most appropriate one’ (Green et al., 1993; p. 690). This instruction is very similar to what one would expect for a category scale with evenly spaced descriptors, in which participants are likely to base their judgments on the distances between the different locations. No mentioning is made of the ratios between the various locations, from which the current

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

positions of the descriptors were derived. Hence, one may wonder whether users of the LMS tend to use the distances on the scale as measures of intensity, instead of the ratios from which the scale was derived. Possibly, the somewhat unusual spacing on the scale may provide clues that they should not use the distances in a way typical for category scales. In many recent studies using LMS scales, participants are not explicitly instructed how they should use the scale, but are only required to rate perceived intensity or degree of liking by putting a mark on the line (e.g., Diamond & Lawless, 2001; Lawless, Sinopoli et al., 2010). But this leaves open the question: how should they use them? Surprisingly and unlike traditional category rating and ME tasks, the original LMS instruction does not provide a simple cognitive model that is likely to underlie participants’ judgments. By combining elements of ratio estimation (i.e., the derivation of descriptor locations through magnitude estimation, logarithmic spacing) and difference estimation (i.e., the graphical display format suggesting meaningful distances), the cognitive operations participants perform when using the scale are unspecified. In fact, participants may employ multiple strategies in using the LMS, depending on their personal preference for focusing on ‘ratio’ or ‘difference’ elements in the task. Possibly, participants decide to pick the two descriptors that are closest to their perception and then rate perceived intensity by using an approximately linear interpolation between the two marks corresponding to these descriptors. Such a strategy implies picking a section of the logarithmically spaced scale (‘ratio’) and then using this part of the scale to evaluate the distances in this section (‘difference’). This two-step strategy bears similarities to the category partitioning method that was developed by Heller (1985). His method instructs participants to first categorize stimuli using their general frame of reference by selecting the verbal category to which the stimulus seems to belong. Subsequently, the participant can fine tune the final judgment by picking one of the ten numerical responses provided within the overall category (see Ellermeier, Westphal, & Heidenfelder, 1991). Alternatively, Cardello, Lawless, and Schutz (2008) suggested that participants may use an LMS as if it were a category scale, because a large number of participants in their study (50–65%) made responses on a 200 mm scale that were at or very near the crosshatch marks for the verbal anchors. If participants employ a categorical response strategy, the LMS does not work as intended, and response distributions are likely to be biased. In a follow-up study, Lawless, Sinopoli, et al. (2010) found that the number of categorical responses decreased when the length of the scale was increased. However, explicitly encouraging participants to use the space between the anchors could not reduce the rate of categorical responding. Possibly, other subtle changes in experimental procedure may enhance or attenuate the ‘ratio’ or ‘difference’ character of the results. For instance, Lim et al. (2009) stress the importance of using only participants who have ample training in ME, thereby probably increasing the ‘ratio’ character of their expected outcomes. However, if participants need special training in order to use the LMS, this suggests that people find it hard and perhaps counterintuitive to use the LMS? This makes us wonder whether the LMS is appropriate for screening large consumer groups, as is done in research on PROP taste status (e.g., Bartoshuk et al., 2004). What is also exemplary in this respect, are reports on the properties of the response distributions obtained with the LMS. In many cases and similar to ME experiments, significantly skewed response distributions have been encountered, and typically logarithmic transformations have been applied to correct for deviations from normality (e.g., Green et al., 1993). However, in other instances and suggestive of the use of a ‘difference’ model, normal

155

or close-to-normal response distributions were found (e.g., Lim et al., 2009). Although normal response distributions are desirable for statistical analyses, it contradicts with the validity of the LMS as a ‘ratio’ scaling technique. Since the LMS combines elements from the ‘difference’ and the ‘ratio’ approaches, the cognitive operation that people perform remains unknown. Then what do the responses mean? Borg, who developed the LMS’s precursor, does not seem to take this problem very seriously. Borg and Borg (2001) state ‘That we are using a ‘‘ratio scaling method’’ does not mean that the responses obtained with the method perfectly fits a mathematical model. We are always dealing with human responses and conceptions of numbers and not with abstract mathematical concepts. Responses obtained with a kind of ‘‘semi-ratio scale’’, i.e., a scale, reflecting most people’s experiences, can then be handled statistically and mathematically as belonging to a ratio scale [..]’’ (p. 22). Indeed, we can calculate the appropriate statistics, depending on the characteristics of their response distributions. But what does ‘A is significantly larger than B’ mean, if we do not know what A and B stand for? It reduces the LMS to an S–R scaling technique, in which we know what the responses to stimuli are, but we have no idea how to interpret them. As described above, Anderson’s (1981) Information Integration Theory has shown its usefulness in elucidating the cognitive models underlying other direct scaling methods, such as category rating and magnitude estimation, and in determining the measurement properties of the corresponding scales (Birnbaum, 1978). Hence, this could be a valuable approach for investigating the properties of LMS scaling.

3. Does the LMS enable valid comparisons over individuals or groups? Comparing results from psychophysical experiments between groups or between individuals always involves assumptions, because you cannot be sure that the response scales in the different conditions are really measuring the same thing. People in the different groups may have different biological ‘hardware’ (biological effect) (e.g., Bartoshuk, Duffy, & Miller, 1994), they all have different previous experiences (context effect) (e.g., Zoeke & Sarris, 1983), they may interpret scale elements (e.g., distances, symbols, words, numbers) in different ways (e.g., Schifferstein, Smeets, & Hallensleben, 2011), and they may have different ways in which they generate ratings (response strategy) (e.g., Anderson, 1970). Nevertheless several authors have developed procedures that are claimed to make a scale ‘absolute’. For instance, if we can find a continuum to which all participants respond similarly, whether they are in group A or B, the responses on this continuum can be used as a reference. If all responses are then rescaled relative to a point on the reference continuum, the rescaled responses can be compared. The critical assumption in this approach is that the reference continuum should in no way be related to the continuum of interest. Early studies investigating the effects of PROP taster status used the perceived saltiness of NaCl as the reference continuum (e.g., Bartoshuk, Rifkin, Marks, & Hooper, 1988). However, later studies showed that PROP taster status was related to the number of taste buds in humans (Bartoshuk et al., 1994), which may affect NaCl saltiness perception as well. Therefore, in subsequent studies brightness or darkness perception (visual) or loudness perception (auditory) were used as reference continua, because they are more likely to yield valid comparisons. However, even in the latter case we should be aware that the coupling between sensory continua

156

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

has been qualified as loose and tends to be susceptible to context effects (e.g., Marks, 1992; Schifferstein, 1995). Other researchers have tried to enhance the absolute character of response scales by adding reliable, context-independent verbal anchors on the response continuum. For instance, the minimum and maximum of the scale may be defined by a biological limitation (lowest or highest intensity imaginable) or by providing a reference sample, and linking it to an appropriate location on the response scale. In addition, descriptors may be added in order to clarify what certain positions on the continuum imply. If people agree on the meaning of these descriptors, they will be more likely to use the response continuum in similar ways, even though they take part in different experimental conditions. The latter approach has also been used in the development of the LMS. However, it remains an empirical question to determine whether these measures are really effective: do participants adhere to instructions? Do they really assign a specific perceived intensity to a specific point of the scale, independent of the experimental context? We will now discuss the different anchoring strategies. 3.1. The zero point On scales of intensity the zero point is essential, as it indicates the logical starting point of the scale. It indicates that no intensity is perceived for the target characteristic. Yet, the zero point may expose a perceptual problem, because not all sensory continua contain a well-defined zero point. For instance, what would be a stimulus that defines the zero point for the sense of taste? Would that be distilled water? Due to the presence of substances in human saliva, the taste of distilled water differs from the taste we perceive when we insert no stimulus in the mouth (Bartoshuk, 1968; McBurney & Shick, 1971; O’Mahony, 1972). So, should the taste of saliva be used as the zero point? Or should participants rinse their mouths with distilled water before they start tasting, so that distilled water has no taste when they start tasting (De Graaf, Frijters, & van Trijp, 1987)? Other sensory modalities are likely to pose similar problems, just like the neutral category (like nor dislike) for a hedonic scale. Establishing the zero point on an internal continuum is not just an empirical problem, but also a theoretical one. It can only be accomplished by evaluating what is an appropriate neutral reference point for all possible types of stimuli under all possible presentation modes. On the response continuum, the zero point of no sensation is represented either by the number 0 (e.g., in magnitude estimation) or by the minimum end point of the scale (e.g., in line scales and category scales). But also on the response continuum, the zero point poses logical problems. If participants are asked to judge ratios between stimuli, they may have trouble in estimating certain ratios, because dividing by 0 is not possible. Analogously, in magnitude estimation 0 responses lead to problems in data processing, because log-transformations are not possible (e.g., Owen & DeRouen, 1980). Unfortunately, there seems to be only limited attention for this issue in the magnitude estimation literature, because attention primarily goes to the estimation of the exponent of the psychophysical function (the slope after log-transformation) and not to the constant (the intercept of the regression line after log-transformation). On the other hand, on category scales and line scales response distributions tend to be skewed near the end point of the scale, because the response scale is restricted. As a consequence, the mean stimulus rating near the end of the scale is biased (e.g., Poulton, 1979). The absence of a well-defined zero point on a perceptual continuum presents a logical problem: it precludes the existence of a real ratio scale. Hence, it affects all scaling methods to the same extent: if a continuum does not contain a valid zero point, the measurement level of the scale can be interval level at best. As regards

the response continuum, both the category scale and the LMS provide a well-defined minimal end point, but responses tend to be biased due to end effects. Furthermore, magnitude estimation has to deal with a logical problem, because some ratios do not exist when intensities are zero. Therefore, none of the existing methods can solve this problem in a satisfactory way. 3.2. Maximum intensity In many psychophysical studies, an arbitrary maximum is defined by providing a certain high-intensity stimulus that is used to define the maximum of the scale. Providing reference samples makes a scale relative by definition, because other studies may use different references. For an absolute scale that can be used independent of context, it is important to define a maximum intensity that is unequivocally defined under all circumstances in which the scale can be used. However, it is extremely hard to derive standards that are identical across individuals or groups, because people cannot share one another’s perceptions. According to Borg’s range model (Borg, 1982), the subjective dynamic range can be assumed to be the same across individuals as a first approximation, despite differences in physical dynamic range (i.e., the different stimulus intensities a person is able to process). Although Borg (1982) initially cautioned that this assumption might not be suitable for all sense modalities, Teghtsoonian (1973) argued that ‘for all perceptual continua the maximum range of subjective magnitude is constant’. Subsequently, Borg (1994) argued that ‘maximal intensities are about the same for most sensory modalities (except for pain)’. If Borg’s assumption is correct, anchoring the scale using minimum and maximum intensity anchors provides a stable frame of reference that fixates the scale and prevents its meaning from floating. Borg makes it plausible that people have similar dynamic ranges, by pointing out that most objects and events in the human experiential world have specific boundaries. The human sensory organs are adapted to pick up specific intensity ranges that are relevant to our daily life. Therefore, our similarities in biology make sure that people will largely process similar stimulus events and it makes sense to assume that dynamic ranges are similar over individuals. For the different variants of the LMS, different descriptors have been used to define the end point of the scale. Borg (1982) had anchored the top of the Category Ratio Scale as ‘maximal’ (see Fig. 1). Green et al. (1993) asked participants to assign the top of their oral sensation scale the ‘strongest imaginable sensation’ (Fig. 2). In the gLMS (Bartoshuk et al., 2002) the scale was anchored at the top with ‘strongest imaginable sensation of any kind’, implying that these authors assume that the strongest sensation imaginable may differ between modalities. Maximum intensity anchors that are independent of the type of stimuli rated support the absolute character of the scale, because these end anchors are relevant for any type of stimulus. However, because the maximum intensity level from multiple sensory domains is in many cases higher than the maximum intensity in one specific domain, these general anchors may lead to compression of responses at the upper end of the scale. Indeed, scales with domain-specific maximum anchors such as ‘strongest imaginable sweetness’ tend to result in steeper psychophysical functions than scales with maximum anchors covering multiple domains, such as ‘strongest imaginable sensation’ (e.g., Green et al., 1996). This issue will be discussed further below in the section on ceiling effects. 3.3. Anchoring intermediate steps In everyday language, people compare sensations using intensity adjectives (e.g., strong) along with adverbs to modify the adjectives (e.g., very). These adverb–adjective combinations have

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

been used extensively in the development of category scales. According to reference frame theories (see Zoeke & Sarris, 1983), the descriptions of personal perceptual events occur within a reference frame that is determined by day-to-day experiences with a specific stimulus dimension, and this frame becomes stabilized in memory. These descriptions occur in categories of everyday language. As such, they are communicable and absolute, because no physical reference stimuli are needed to pinpoint particular scale elements. Our similarities in culture and language make sure that certain verbal expressions are closely linked to certain perceived intensities. Therefore, naming specific levels in the dynamic ranges we perceive helps people to understand, use, and interpret a scale. In contrast, people are not used to give numbers to the intensities of the events they perceive in daily life. However, although everyday adjectives help in defining the different areas of the scale, it is questionable whether they can do so in an absolute way. The meaning of adverb–adjective combinations such as ‘very small’ and ‘quite large’ is relative and typically dependent on the nouns they are used with: a small elephant is likely to be larger than a large mouse. As a consequence, the anchoring value of the adjectives may be limited. LMS researchers have tried to overcome these limitations by asking participants to scale the remembered intensities of specific events (Bartoshuk et al., 2002). Because these events are defined in more detail, they are likely to be less context-dependent. The challenge here is to find events that everyone knows, or that are likely to have a similar impact on many different people. Although anchoring procedures seem to make sense intuitively, it is an empirical question whether they are effective. Changing the spacing of the verbal anchors affects mean responses in a predictable way (e.g., Cardello et al., 2008). But can anchoring also be effective in making the scale absolute? We think it is not. Despite considerable efforts invested in finding adequate descriptors for response scales, all different types of direct response scales have been found to be susceptible to manipulations of experimental context. In addition, the presentation of reference standards and extensive training is unable to eliminate the effects of experimental context (Olabi & Lawless, 2008). Although category scales have been criticized for being susceptible to contextual effects, magnitude estimates (Ellermeier et al., 1991; Mellers, 1983; Poulton, 1979) and LMS responses (Diamond & Lawless, 2001; Lawless, Horne, & Spiers, 2000) have been shown to be susceptible to manipulations of stimulus sets as well. According to Schifferstein (1995) manipulating stimulus range, frequency, or spacing in experimental contexts with multiple sensory dimensions results in relative shifts between internal stimulus continua and thus affects the relationships between the internal stimulus representations si. Hence, these context effects do not reside in response selection processes and are unlikely to be eliminated by changing the response procedure.

4. Is the LMS susceptible to ceiling effects? According to Stevens (1971), a ratio scaling method should not impose restrictions on participants. People should feel free to use any number they want, and the response continuum should not be restricted. As a consequence, ceiling effects are unlikely to occur in magnitude estimation, while avoidance of end categories is commonly observed for category scales (e.g., Stevens & Galanter, 1957). To avoid truncation of responses, Borg (1982) suggested to leave the response scale open at the top, so that people may give higher responses than the scale maximum. For instance, suppose that during an experiment participants perceive a sensation that is stronger than any sensation they could imagine beforehand, they should be able to give such a response. Therefore, on some of Borg’s

157

scales participants had the opportunity to give higher ratings than the proposed maximum (Borg & Borg, 2001) (Fig. 1). However, it remains to be established whether this approach really avoids all ceiling effects. Although a scale may not be limited physically, people may have implicit maximum response limits they do not surpass. In addition, currently used LMS scales are typically restricted at the top with a maximum intensity level specified (Fig. 2). Although the labels used are extreme, the possibility exists that stimuli presented during an experiment are stronger than participants could imagine beforehand, which makes the scale susceptible to ceiling effects again. Due to the differences in adverb spacing between the category scale and the LMS, the sensitivity of the scales in assessing particular product differences is likely to differ. The LMS is most likely to be very sensitive near the upper end, because descriptors are spaced widely apart from each other. Due to the large size of the maximum category, the number of responses in this category also tends to be somewhat larger for the LMS than for the category scale (see Lawless, Popper, et al., 2010). The increased sensitivity may be an advantage when the LMS is used to differentiate between products with high intensities (for the LMS) or with high liking scores (for the LAM). Conversely, a linear category scale is likely to be more sensitive in detecting differences between low-intensity stimuli (compared with LMS) or products with low liking scores (compared with LAM). Because commercial consumer testing often involves many products that already appeal to consumers, the affective LMS is likely to have an advantage here over a hedonic category scale. Indeed, Schutz and Cardello (2001) and El Dine and Olabi (2009) reported some advantages in product differentiation for the LAM scale for highly liked items, but others have found no such advantages (Lawless, Popper, et al., 2010; Lawless, Sinopoli, et al., 2010), or have even found an opposite effect (Hein, Jaeger, Carr, & Delahunty, 2008). Cardello et al. (2008) found that descriptor spacing affected mean product ratings, but did not affect the ability to differentiate products: if wider spacing results in larger errors, product differentiation remains the same. In conclusion, we found no convincing evidence that LMS scales are not or less susceptible to ceiling effects than category scales. Although participants may use the upper response category somewhat more often and mean responses may be somewhat higher for the LMS than for category scales, this does not result in consistent improvements in the ability to differentiate between products.

5. Conclusion Although labeled magnitude scaling has gained substantial popularity within the sensory community, the claims supporting its use as an absolute, direct response scaling method originally made by its developers are not supported by the literature. Moreover, according to an S–O–R view on perception the LMS holds a disadvantage compared to more traditional scaling methods such as category scales and magnitude estimation, in that no simple cognitive algebraic model underlies its responses. The absence of such a model should make us wonder: what exactly do LMS responses signify? If we are unsure about the cognitive processes participants use to generate LMS responses or if participants employ different strategies in generating these responses, we cannot be sure about what these responses represent and how we should analyze them. A model relating internal representations to overt responses assists in choosing the appropriate mathematical transformations to optimize statistical procedures and data representation (e.g., in choosing between geometrical or arithmetic means). In absence of an underlying cognitive model, we can only use a behavioristic approach to LMS data analysis, in which we statistically demonstrate that one stimulus gives higher responses than another

158

H.N.J. Schifferstein / Food Quality and Preference 26 (2012) 151–158

stimulus. But is such an outcome useful, if we do not know what these responses represent? Until the latter question can be answered, it seems unwise to promote the use of LMS scales in sensory research.

Acknowledgement I would like to thank Jan E.R. Frijters, who was my mentor during my Ph.D. research, who taught me the principles of psychophysical research, and who initially discouraged me to write the current paper, but will probably be pleased with the end result.

References Anderson, N. H. (1970). Functional measurement and psychophysical judgment. Psychological Review, 77, 153–170. Anderson, N. H. (1981). Foundations of information integration theory. New York: Academic Press. Bartoshuk, L. M. (1968). Water taste in man. Perception & Psychophysics, 3, 69–72. Bartoshuk, L. M., Duffy, V. B., Fast, K., Green, B. G., Prutkin, J., & Snyder, D. J. (2002). Labeled scales (e.g., category, Likert, VAS) and invalid across-group comparisons: What we have learned from genetic variation in taste. Food Quality and Preference, 14, 125–138. Bartoshuk, L. M., Duffy, V. B., Green, B. G., Hoffman, H. J., Ko, C. W., Lucchina, L. A., et al. (2004). Valid across-group comparisons with labeled scales: The gLMS versus magnitude matching. Physiology & Behavior, 82, 109–114. Bartoshuk, L. M., Duffy, V. B., & Miller, I. J. (1994). PTC/PROP tasting: Anatomy, psychophysics, and sex effects. Physiology & Behavior, 56, 1155–1171. Bartoshuk, L. M., Rifkin, B., Marks, L. E., & Hooper, J. E. (1988). Bitterness of KCl and benzoate: Related to genetic status for sensitivity to PTC/PROP. Chemical Senses, 13, 517–528. Birnbaum, M. H. (1978). Differences and ratios in psychological measurement. In N. J. Castellan & F. Restle (Eds.). Cognitive theory (Vol. III, pp. 33–74). Hillsdale, NJ: Erlbaum. Birnbaum, M. H., & Elmasian, R. (1977). Loudness ‘‘ratios’’ and ‘‘differences’’ involve the same psychophysical operation. Perception & Psychophysics, 22, 383–391. Birnbaum, M. H., & Stegner, S. E. (1981). Measuring the importance of cues in judgment for individuals: Subjective theories of IQ as a function of heredity and environment. Journal of Experimental Social Psychology, 17, 159–182. Birnbaum, M. H., & Veit, C. T. (1974). Scale convergence as a criterion for rescaling: Information integration with difference, ratio, and averaging tasks. Perception & Psychophysics, 15, 7–15. Borg, G. (1994). Psychophysical scaling: An overview. In J. Boivie, P. Hansson, & U. Lindblom (Eds.), Touch, temperature, and pain in health and disease: Mechanisms and assessments (pp. 27–39). Seattle, WA: IASP Press. Borg, G., & Borg, E. (2001). A new generation of scaling methods: Level-anchored ratio scaling. Psychologica, 28, 15–45. Borg, G. (1982). A category scale with ratio properties for intermodal and interindividual comparisons. In H. G. Geissler & P. Petzold (Eds.), Psychophysical judgment and the process of perception (pp. 25–34). Berlin: VEB Deutscher Verlag der Wissenschaften. Cardello, A. V., Lawless, H. T., & Schutz, H. G. (2008). Effects of extreme anchors and interior label spacing on labeled magnitude scales. Food Quality and Preference, 21, 323–334. Cardello, A. V., Schutz, H. G., Lesher, L. L., & Merrill, E. (2005). Development and testing of a labeled magnitude scale of perceived satiety. Appetite, 44, 1–13. Cook, N. H., & Rabinowicz, E. (1963). Physical measurement and analysis. Reading, MA: Addison-Wesley. De Graaf, C., & Frijters, J. E. R. (1988). ‘‘Ratios’’ and ‘‘differences’’ in perceived sweetness intensity. Perception & Psychophysics, 44, 357–362. De Graaf, C., Frijters, J. E. R., & van Trijp, H. C. M. (1987). Taste interaction between glucose and fructose assessed by functional measurement. Perception & Psychophysics, 41, 383–392. Diamond, J., & Lawless, H. T. (2001). Context effects and reference standards with magnitude estimation and the labeled magnitude scale. Journal of Sensory Studies, 16, 1–10. El Dine, A. N., & Olabi, A. (2009). Effect of reference foods in repeated acceptability tests: Testing familiar and novel foods using 2 acceptability scales. Journal of Food Science, 74, S97–S105. Ellermeier, W., Westphal, W., & Heidenfelder, M. (1991). On the ‘‘absoluteness’’ of category and magnitude scales of pain. Perception & Psychophysics, 49, 159–166. Fechner, G. T. (1860). Elemente der Psychophysik. Leipzig: Breitkopf und Härtel. Frijters, J. E. R. (1979). The paradox of discriminatory nondiscriminators resolved. Chemical Senses, 5, 355–358. Gent, J. F., Frank, M. E., & Mott, A. E. (1997). Taste testing in clinical practice. In A. M. Seiden (Ed.), Taste and smell disorders (pp. 146–158). New York: Thieme Medical. Green, B. G., Dalton, P., Cowart, B., Shaffer, G., Rankin, K., & Higgins, J. (1996). Evaluating the ‘labeled magnitude scale’ for measuring sensations of taste and smell. Chemical Senses, 21, 323–334.

Green, B. G., Shaffer, G. S., & Gilmore, M. M. (1993). Derivation and evaluation of a semantic scale of oral sensation magnitude with apparent ratio properties. Chemical Senses, 18, 683–702. Griep, M. I., Borg, E., Collys, K., & Massart, D. L. (1998). Category ratio scale as an alternative to magnitude matching for age-related taste and odour perception. Food Quality and Preference, 9, 67–72. Guest, S., Essick, G., Patel, A., Prajapati, R., & McGlone, F. (2007). Labeled magnitude scales for oral sensations of wetness, dryness, pleasantness and unpleasantness. Food Quality and Preference, 18, 342–352. Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five common acceptance and preference methods. Food Quality and Preference, 19, 651–661. Heller, O. (1985). Hörfeldaudiometrie mit dem Verfahren der Kategorienunterteilung (KU) (Hearing field audiometry using the method of category partitioning (CP)). Psychologische Beiträge, 27, 478–493. Kurtz, D. B., White, T. L., & Hayes, M. (2000). The labeled dissimilarity scale: A metric of perceptual dissimilarity. Perception & Psychophysics, 62, 152–161. Lawless, H. T., Horne, J., & Spiers, W. (2000). Contrast and range effects for category, magnitude and labeled magnitude scales in judgements of sweetness intensity. Chemical Senses, 25, 85–92. Lawless, H. T., Popper, R., & Kroll, B. J. (2010). Comparison of the labeled affective magnitude (LAM) scale, an 11-point category rating scale and the traditional nine-point hedonic scale. Food Quality and Preference, 21, 2–12. Lawless, H. T., Sinopoli, D., & Chapman, K. W. (2010). A comparison of the labeled affective magnitude scale and the 9-point hedonic scale and examination of categorical behavior. Journal of Sensory Studies, 25, 54–66. Lim, J., Wood, A., & Green, B. G. (2009). Derivation and evaluation of a labeled hedonic scale. Chemical Senses, 34, 739–751. Marks, L. E. (1992). The slippery context effect in psychophysics: Intensive, extensive, and qualitative continua. Perception & Psychophysics, 51, 187–198. Marks, L. E., Borg, G., & Westerlund, J. (1992). Differences in taste perception assessed by magnitude matching and by category-ratio scaling. Chemical Senses, 17, 493–506. McBride, R. L. (1983). A JND-scale category-scale convergence in taste. Perception & Psychophysics, 34, 77–83. McBurney, D. H., & Shick, T. R. (1971). Taste and water taste of twenty-six compounds for man. Perception & Psychophysics, 10, 249–252. Mellers, B. A. (1983). Evidence against ‘‘absolute’’ scaling. Perception & Psychophysics, 33, 523–526. Mellers, B. A., Davis, D. M., & Birnbaum, M. H. (1984). Weight of evidence supports one operation for ‘‘ratios’’ and ‘‘differences’’ of heaviness. Journal of Experimental Psychology: Human Perception and Performance, 10, 216–230. Olabi, A., & Lawless, H. T. (2008). Persistence of context effects after training and with intensity references. Journal of Food Science, 73(4), S185–S189. O’Mahony, M. (1972). The interstimulus interval for taste: 1. The efficiency of expectoration and mouthrinsing in clearing the mouth of salt residues. Perception, 1, 209–215. Owen, W. J., & DeRouen, T. A. (1980). Estimation of the mean for lognormal data containing zeroes and left-censored values with applications to the measurement of worker exposure to air contaminants. Biometrics, 36, 707–719. Poulton, E. C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86, 777–803. Schifferstein, H. N. J. (1995). Contextual effects in difference judgments. Perception & Psychophysics, 57, 56–70. Schifferstein, H. N. J., Otten, J. J., Thoolen, F., & Hekkert, P. (2010). An experimental approach to assess sensory dominance in a product development context. Journal of Design Research, 8(2), 119–144. Schifferstein, H. N. J., Smeets, M. A. M., & Hallensleben, R. (2011). Stimulus sets can induce shifts in descriptor meanings in product evaluation tasks. Acta Psychologica, 138, 237–243. Schneider, B., & Bissett, R. (1988). ‘‘Ratio’’ and ‘‘difference’’ judgments for length, area, and volume: Are there two classes of sensory continua? Journal of Experimental Psychology: Human Perception and Performance, 14, 503–512. Schutz, H. G., & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. Stevens, S. S. (1955). The measurement of loudness. Journal of the Acoustical Society of America, 27, 815–829. Stevens, S. S. (1971). Issues in psychophysical measurement. Psychological Review, 78, 426–450. Stevens, S. S., & Galanter, E. H. (1957). Ratio scales and category scales for a dozen perceptual continua. Journal of Experimental Psychology, 54, 377–411. Teghtsoonian, R. (1973). Range effects in psychophysical scaling and a revision of Stevens’s law. American Journal of Psychology, 86, 3–27. Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley. Torgerson, W. S. (1961). Distances and ratios in psychological scaling. Acta Psychologica, 19, 201–205. Zalifah, M. K., Greenway, D. R., Caffin, N. A., D’Arcy, B. R., & Gidley, M. J. (2008). Application of labelled magnitude satiety scale in a linguistically-diverse population. Food Quality and Preference, 19, 574–578. Zoeke, B., & Sarris, V. (1983). A comparison of ‘frame of reference’ paradigms in human and animal psychophysics. In H. G. Geissler, J. F. J. M. Buffant, E. L. J. Leeuwenburg, & V. Sarris (Eds.), Modern issues in perception (pp. 283–317). Berlin: VEB Deutscher Verlag der Wissenschaften. Zwislocki, J. J., & Goodman, D. A. (1980). Absolute scaling of sensory magnitudes: A validation. Perception & Psychophysics, 28, 28–38.