Effect of monetary reward and food type on accuracy and assessment time of untrained sensory panelists in triangle tests

Effect of monetary reward and food type on accuracy and assessment time of untrained sensory panelists in triangle tests

Accepted Manuscript Effect of Monetary Reward and Food Type on Accuracy and Assessment Time of Untrained Sensory Panelists in Triangle Tests Jessilee ...

556KB Sizes 0 Downloads 20 Views

Accepted Manuscript Effect of Monetary Reward and Food Type on Accuracy and Assessment Time of Untrained Sensory Panelists in Triangle Tests Jessilee N. Loucks, Dennis L. Eggett, Michael L. Dunn, Frost M. Steele, Laura K. Jefferies PII: DOI: Reference:

S0950-3293(16)30195-1 http://dx.doi.org/10.1016/j.foodqual.2016.09.007 FQAP 3200

To appear in:

Food Quality and Preference

Received Date: Revised Date: Accepted Date:

16 October 2015 20 September 2016 21 September 2016

Please cite this article as: Loucks, J.N., Eggett, D.L., Dunn, M.L., Steele, F.M., Jefferies, L.K., Effect of Monetary Reward and Food Type on Accuracy and Assessment Time of Untrained Sensory Panelists in Triangle Tests, Food Quality and Preference (2016), doi: http://dx.doi.org/10.1016/j.foodqual.2016.09.007

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Effect of Monetary Reward and Food Type on Accuracy and Assessment Time of Untrained Sensory Panelists in Triangle Tests

Jessilee N. Loucksa, Dennis L. Eggettb, Michael L. Dunna, Frost M. Steelea, Laura K. Jefferiesa a

Department of Nutrition, Dietetics and Food Science Brigham Young University S-221 ESC, Provo, UT 84602 USA b

Department of Statistics Brigham Young University 223 TMCB, Provo, UT 84602 USA

Corresponding Author: Laura K. Jefferies Department of Nutrition, Dietetics and Food Science Brigham Young University S-221 ESC Provo, UT 84602 Phone: 801-422-9290 Fax: 801-422-0258 [email protected]

Abstract The effect of monetary reward amount and test food type as factors of triangle test accuracy and assessment time for untrained panelists was explored. Monetary compensation is commonly used to reward panelists for their time and effort. While studies have documented that paying panelists can influence hedonic ratings, research on its possible influence on triangle test accuracy or time to assess products is lacking. Relatedly, some studies suggest that assessment time influences accuracy. Furthermore, little research has been conducted on the effect of the test food itself, and its general likeability, on biasing panelist accuracy or assessment time. Pairs of two liked foods – chocolate chip cookies and cheddar cheese, and two not-as-liked-foods – green olives and lima beans, were tested. In addition to correlating the two response variables with the main effects, interactions of overall expected hedonic liking of the test foods, panelist age, gender, time of day of the test, and day of the week the test were analyzed. Results indicate monetary compensation amount did not influence panelist accuracy or assessment time. However, accuracy increased significantly with longer assessment times; the effect of the “liked” vs. “not-as-well-liked” food categorizations were inconclusive. Expected hedonic liking, gender, age, and time of day of the test, were significant, suggesting that test accuracy and assessment time are likely influenced by multiple intrinsic and extrinsic factors. Overall results suggest that the use of broad demographics of untrained consumers for triangle tests results in data not consistently or strongly biased by payment amount or food type. Keywords: Monetary; Food Type, Incentive; Triangle Test; Accuracy; Assessment Time

1. Introduction Three unique observations have been made about using humans as instruments in sensory panels: first, panelists change over time; second, panelists as a group are inconsistent; and third, bias greatly affects human panelists (Meilgaard, Civille, & Carr, 2007). A bias in sensory evaluation is defined as anything that influences a panelist in such a way that their scores do not represent the actual sensory experience (Lawless & Heymann, 2010). Researchers minimize biases by using techniques such as, but not limited to, randomization of samples and labeling them with 3-digit blinding codes. While sensory researchers make great efforts to ensure that widely recognized biases are minimized as much as possible, the actual impact of some lesser recognized biases is unknown. Potential sources of bias, such as the amount of panelist reward, either directly through monetary compensation or indirectly through the food product being evaluated, and their possible influencing effect on triangle test accuracy and assessment time has not been studied in detail. Although monetary payment is commonly used to incentivize panelists to participate in panels and has also been found to effect resulting data, how these results should guide product developers is uncertain. Bell (1993), for example, found that monetarily rewarded and unrewarded panelist groups scored significantly different from each other during hedonic testing when given the exact same foods. He concluded that the presence or lack of monetary incentive influenced consumer scoring; however, not only were conclusions about which scores more accurately represented consumer liking impossible to determine, he noted that even if more accurate results could be obtained from panelists that are not monetarily rewarded, that such payment is an incentive for their participation. Additionally, two other research groups found that in a performance-based incentivized double triangle test, panelists who were paid had greater accuracy in selecting the odd sample, for some foods (Berglund, Lau, & Holm, 1993;

Lau, Post, & Kagan, 1995). With regard to how much sensory panelists should be compensated, Lawless & Heymann (2010) warn that monetary incentives should only be enough to elicit panelist participation in evaluation, and not so extreme as to make the incentive the only motivation for participation. Little research has been done to explore the possible effects of the degree of liking or disliking of the food being evaluated on panelist accuracy and assessment time in sensory testing. Even though consumer evaluation is primarily conducted using those who are ‘likers’ or consumers of a product, not all foods are liked equally among and between individuals. Consequently, the degree of liking of the test food may influence panelists’ willingness to evaluate foods conscientiously (Meilgaard, Civille, & Carr, 2007). Much of the literature on identifying general incentivizing factors for sensory panelists or survey participants either fails to conclude the extent of their effects, has explored the role of payment based on panel performance rather than participation alone, or simply does not exist (Bell, 1993; Berglund, Lau, & Holm, 1993; Lau, Post, & Kagan, 1995; Stone, Bleibaum, & Thomas, 2012). The marked difference in results for rewarded and unrewarded panelists in Bell’s study leads to the question of whether increasing levels of monetary reward, without the pressure of “performance” might lead to greater panelist attentiveness to sample differences, especially in instances when not-as-well-liked foods are evaluated, even by willing participants. In addition to measuring respondent accuracy, some researchers suggest that panelist motivation can be estimated and easily compared by measuring the relative time required to make a judgement (Brüggen & Dholakia, 2010; Zagorsky & Rhoton, 2008). A study of general survey participants found that the highest payment of $40 increased the length of interview time for interview-type surveys and number of items answered for mail surveys compared to a $20

incentive and no incentive (Zagorsky & Rhoton, 2008). Research conducted with web-based surveys also found that motivational rewards increased the number of completed surveys, the words per survey, comments per survey, and the time spent on the survey (Bruggen & Dholakia, 2010). These results suggest a positive relationship between payment amount and data quality, in part due to the increased time taken to complete the surveys, in these contexts. Trained sensory panelists report that among the most influential motivational factors for participation is extra income (Lund, Jones, & Spanitz, 2009), but how payment affects the amount of time they take during testing, or their accuracy rate with discrimination tests, is unknown. The objective of this study was to evaluate the effects of potential incentivizing biases -stemming from monetary incentive amount and food type and their potential interaction on untrained panelists, by measuring triangle test accuracy and panelist assessment time. The effects of panelist age and gender, overall hedonic scores of the test foods, time of day of the test, and day of the week in which the test was conducted, were also studied. It was hypothesized that increased payment amount and evaluation of well-liked foods would lead to increased accuracy and assessment time. 2. Methods Three preliminary steps were conducted in preparation for the main study triangle tests. The first was to survey a general consumer populace regarding their past acceptance of a wide variety of foods of expected differences in liking, using an online survey. From these results, two liked and two not-as-liked foods were selected. Then, two samples were selected for each food with intent that they differ. Each pair was validated for difference using a two-step process by presenting them each in their own initial triangle and a difference-from-control tests. After validation, the main study, comprised of a series of triangle tests, was conducted where panelists (in each monetary incentive group) received the same payment for evaluating each food pair. All

tests were performed at the Brigham Young University Sensory Lab (BYUSL) (Provo, Utah, U.S.A.); sensory data was collected using Compusense five® software (Guelph, Ontario, Canada). The Brigham Young University Institutional Review Board approved all tests.

2.1. Preliminary Test 1: Determination of test foods In order to determine two well-liked and two not-as-well-liked foods, two online surveys were completed, via convenience sampling, by members of the BYUSL panelist database. Panelists were asked the following question (without the food to taste) for a variety of food products, “Considering your overall impression of (insert food name here), how much do you like or dislike it/them?” Responses were scored on a discrete 9-point hedonic scale, including a corresponding numerical value (9 = like extremely, 1 = dislike extremely). The average numerical score for each surveyed food was calculated. The first survey asked panelists (n=1,004) to report their degree of liking/disliking for 29 different foods; a subsequent survey (n=1,102) expanded the list of foods to 48. Foods surveyed were presented in a randomized order. Data were collected using Qualtrics, LLC survey software (Provo, Utah, U.S.A.). Panelists were not compensated for survey participation. Foods that scored a mean of ≥8 were categorized as well-liked. Foods with a mean >5 to ≤6.5 were considered foods that are generally not-as-well-liked, but not disliked. From this survey, four foods were selected for subsequent study: well-liked - chocolate chip cookies and cheddar cheese, and not-as-well-liked - lima beans and green olives with pimentos. Then, two samples were selected for each food category with intent that they differ. The pair was validated for difference using a two-step process described in Sections 2.2 and 2.3. Table 1 describes the product treatment differences for each food pair. 2.2. Preliminary Test 2: Triangle test treatment validation

Treatments for each food pair to be used in the main study were tested in a preliminary step to validate that such differences were enough to be detected by panelists, but were not too obvious. The threshold for validation for all of the food pairs was set at a guideline of ≤15% proportion of distinguishers (% pd). Since use of % pd can be product-and situation-dependent, this value was selected in consideration of the wide variety of test foods and their individual characteristics; the goal was that samples within each food pair fell within acceptable consumer expectations, and that differences between the pairs were not too obvious (Lawless Heymann, 2010). Table 1 shows the number of correct responses/number of panelists, and % pd for each food. Triangle tests were conducted as described by Pilgrim and Peryam (1958) with additional specifications. For each test, 30 to 35 untrained panelists were selected from the BYUSL panelist database based on their willingness to try the food (Stone, Bleibaum, & Thomas, 2012) and an absence of allergenicity to it. Untrained panelists were used as trained panelists may have recognized specific or similar training foods as those used in the study, the challenge of finding enough panelists with the breadth of skill required to become trained in four very different food types as each payment group evaluated all foods, and that trained panelists would likely question why payment amounts received in this study may have differed from payment during their training or participation in other panels. Panelists, who ranged from frequent attendees to brand new, were also recruited based on balanced gender and age groups of 18-29, 30-39, 40-49, 50-59, and 60 years and older. Prior to evaluating the samples, panelists successfully completed a practice triangle test on paper in order to demonstrate that they understood the test protocol. Samples were presented side-by-side, on a tray with a napkin, an unsalted cracker, a 5.5 fl. oz. (162.7 mL) translucent plastic cup with

room temperature filtered water, and an appropriate eating utensil. Product-specific serving conditions for each food are described in Table 2. Each panelist received samples in a randomized order and sample containers were labeled with 3-digit blinding codes. Panelists were asked to select the odd sample among the three presented. After participation, panelists were compensated monetarily for their time and effort. Panelists who participated in this step were not included in the main study triangle tests. 2.3. Preliminary Test 3: Verification of similar differences across foods Each food pair was then subject to a difference-from-control test to ensure that the degree of difference between food pairs and across all food products was approximately the same. The goal was that products and their food pairs were similarly different. One treatment from each food pair was selected to be a labeled control. This control (Treatment 2 as designated in Table 1 for all products), along with a blinded control and the other treatment were presented side-byside using a 100-point line scale with six guiding descriptors set at 0, 20, 40, 60, 80, and 100 points (left to right respectively: No Difference, Very Slight Difference, Slight Difference, Moderate Difference, Large Difference, Extreme Difference), panelists were asked to indicate the degree of difference of each sample from the labeled control by marking anywhere on the line (Muñoz, Civille, & Carr, 1992). Food pair treatments and sample presentations were the same as described above and in Table 2, with the exception that the cheddar cheese serving size was reduced to only 1 cube. The degree of difference between food pairs and across all food products was calculated in a 2-step process. For each food pair, the mathematical difference between mean scores for the blind control compared to the other treatment sample was calculated (Table 3). The maximum acceptable range for the differences between sample pair scores was predetermined to be < 20%.

The difference between all food pair mean scores met this criteria. Secondly, the mathematical difference between the sample pair mean differences was computed to ensure that while all pairs differed by < 20 points, that the mean differences between all food products were similar. These differences were compared between all the other food products, with the maximum acceptable range for differences between them predetermined to be <10%. As mean data was available, significance between pair samples was determined using Tukey’s procedure with significance set at 5%; all sample pair means were statistically different. For each food type test, 36 to 40 untrained panelists were selected from the BYUSL panelist database using the criteria mentioned earlier except that all participants were ≥20 years. After participation, panelists were compensated for their time and effort. Panelists that participated in this preliminary study were not included in the main study triangle tests. 2.4. Main Study: Triangle Tests Panelists from the BYUSL database who had not participated in preliminary panels, were randomly placed into 4 monetary incentive treatment groups (approximately 150 panelists/group): $3, $5, $10, or $20 (Figure 1), upon recruiting. The $3 payment served as a control, in keeping with the typical payment amount for most visits to the BYUSL. Grouping panelists by payment amount, rather than food type, was done to eliminate bias that would likely occur if panelists were invited to test the same foods for different compensation amounts, as well as to mask panelists from deducing that subsequent panel visits were related to earlier ones. Each group participated in a series of 4 triangle tests over a 7-month period where they tested each of the food products, one time each (Figure 1). Test variables were balanced between and across test groups --payment amount, food type tested, day of the week tested, and time of day tested. Each visit was treated as an independent panel; thus, panelists were not told that subsequent triangle panels were linked to earlier visits.

No panelists tested more than 1 product within the same week and no 2 days of the week were repeated as testing days for each group.. Tests were also conducted such that a single food product was not tested more than once/week; in fact, about a 2 week period separated testing/product. Furthermore, panel start times were balanced between starting in the morning (9 panels) or the afternoon (7 panels). Panelists were recruited, prepared, and tested using the same criteria and written practice test described earlier. While completing the test survey, but before evaluating the samples, panelists were asked, “Considering your overall impression of (insert food name here), how much do you like or dislike it/them?” and asked to respond using discrete 9-point hedonic scale (9 = like extremely, 1 = dislike extremely). The purpose of this question was to assess each panelist’s personal expected liking/disliking level of the food being tested. Samples were presented to panelists as described earlier and in Table 2, (except for cheddar cheese which had been reduced to 1 cube). Assessment time was determined by recording how long it took panelists to finish the test once they received their samples, the time to the nearest second. 2.5. Data Analysis Data were analyzed for significance using Statistical Analysis Software version 9.3 (Cary, North Carolina, U.S.A.). Panelist accuracy data (number of correct answers) were analyzed using a mixed model logistic analysis due to its binary nature (correct versus incorrect). Panelist assessment time data analysis was performed using a mixed model analysis of covariance due to its continuous nature (time in seconds). Additionally, a mixed model analysis of variance using Tukey’s procedure (with a 5% level of significance) for both accuracy and assessment time was conducted by comparing accuracy and assessment time results for each individual visit (first through fourth), to ensure that earlier triangle test visits did not have a

possible training effect on panelists as they returned for subsequent panel visits. Independent variables of panelist, payment amount, food type, panelist-specific expected liking scores of the food being tested, age group, gender, day of the week of the test, time of day the test was taken, visit order, and all potential interactions were included in the initial models for accuracy, assessment time analyses. Adjusted P values, using the Tukey’s procedure, are reported for the odds ratios for accuracy based on payment amount comparisons (Table 4) and assessment time results by age group comparisons (Table 6), in keeping with standard mixed model analyses. Accuracy data with respect to the probability of selecting the correct answer by gender and age (Table 5) is not adjusted as adjustments were only made for pairwise comparisons of specific interest and to maintain α=0.05. Independent variables and their respective interactions that were not significant, were removed from each model and the data reanalyzed.

3. Results 3.1. Accuracy Both payment amount (p = 0.0227) and food type (p = 0.0005), were significant with respect to panelist accuracy. However, the only significant difference in accuracy with respect to payment amount was between the $10 payment and the $20 payment (OR: 0.68, p=0.02) (Table 4); no other pair of payment amounts showed significant differences with regard to accuracy. There was no significant interaction between expected hedonic liking scores for any of the four foods and accuracy (p=0.6377). However, panelists were able to select the “odd” sample among green olive samples, a not-as-well-liked food, significantly more often than for the well-liked foods: cheese and cookies. Gender and age interaction is significant (p=0.0140) in general. Table 5 shows the comparative probability (as a percentage) of panelists (n=605) paired by each gender and age

category of selecting the correct sample in a triangle test. Higher probability values represent a greater likelihood for the specified gender/age group to select the correct sample than those groups with a lower value. When comparing individual interaction probabilities, in general, it appears that women ages 40 – 59 years are less likely to select the correct sample than men in the same age range, while women ages 18 – 39 years are more likely to select the correct sample than men their age. As men aged, their probability for a correct answer also increased consistently over each category with men ≥60 years having a greater probability for a correct answer than their female counterparts whose ability to select the correct answer was the lowest in this age range. Women’s highest probability for selecting the correct answer was when they were in their 30’s. The relationship between panelist accuracy and assessment time was also significant (p<0.0001). For every additional second a panelist took to complete the test, their chance of selecting the correct sample increased by 0.2%. It is also of note that there was no significant food type and payment amount interaction with respect to panelist accuracy. 3.2. Assessment Time Food type (p<0.0001), but not payment amount, significantly influenced the amount of time panelists took to complete the test. There was no significant food type and payment interaction with respect to assessment time. Panelists took significantly longer to assess foods that were well-liked compared to foods that were not as well-liked. Panelist-specific expected liking of the test food was significant (p <0.0001). For every 1 point increase on the 9point hedonic scale, the time a panelist spent taking the test increased by an average of 7.5 seconds. The effect of panelist age was also significant (p<0.0001). Table 6 shows the difference in assessment time for all age groups. The two oldest age groups took significantly

longer than all younger groups, with the exception of the difference between the 50-59 year olds to 20-29 (p =0.38) and 30-39 year olds (p=0.06). Interestingly, the time of day the test was started was significant (p=0.026) with regard to assessment time. For each hour later in the day that the test started, panelists took an average of 6.84 seconds longer to complete it. Furthermore, assessment time and the day of the week of the test was given and was significant (p= 0.0245) (data not shown).

4. Discussion 4.1. Accuracy As the only significant difference seen in accuracy results were between payments of $10 and $20, it is assumed that this finding was an anomaly and that while payment amounts in this study ranged from typical to high, it could be argued that an even higher payment amount may trigger a sense of increased responsibility for performance that was simply not captured by our upper limit of $20. As payment amounts <$3 and >$20 were not tested, the effects of such payment quantities are unknown. In studies on mail surveys, researchers found that the response rate increase was greater when a payment increased from $0 to $1 than when increasing it from $1 to $5 or even $5 to $20 (James & Bolstein, 1992). This suggests that minimal amounts of money are needed to solicit participation in a survey, but that greater amounts do not further increase the response. It is possible that in our study, similar to the one by James and Bolstein, a small amount of money is needed to motivate untrained panelist to attend an external panel at a CLT site, but further increases may not lead to increased panelist accuracy during the test. This may be because panelists of CLT sites feel their most important job is to arrive at the location and perform the desired task.

In light of these findings, it appears 1.) that while payment amount is likely incentivizing for panelists to participate in panels, its influence on intrinsic panelist behavior, such as effort or perception is nominal; perhaps panelists associate payment with participation as a whole and not on their performance, and 2.) that the ability to determine differences between samples is far more complex than the presence of a single or a few extrinsic incentivizing factors. The influence of food type, while statistically significant, only showed an increase in accuracy for one of the two not-as-well-liked foods - green olives. Perhaps assessment using more food types in both categories may result in more a more evidentiary and consistent differentiation between them, or that food type overall was not of practical significance. The lack of significance between expected liking and accuracy results supports this. Mean expected liking scores for each food type was as follows: chocolate chip cookies (8.3), cheddar cheese (8.0), lima beans (7.1) and green olives (5.8). As relationships between expected hedonic liking scores for any of the four foods and accuracy were not significant, and as there was no consistent pattern between accuracy and food type, these results tend to support the idea that how much a food is liked/disliked has little to no bearing on panelists’ ability to discern between like samples. This is supported by Lund, Jones, & Spanitz (2009), who concluded that a general motivator for people to participate in and remain on a trained sensory panel was a general interest in food.

4.2. Assessment time

The observation that panelists who like a particular food spend more time evaluating it, and that panelists who spend more time evaluating test samples are more likely to get a triangle test correct, suggests that panelists may be more accurate with better-liked foods. More time spent with better-liked foods may be because panelists spent more time “savoring” foods they

liked, hurried through those they didn’t, or both. While the amount of food subjects ate wasn’t measured, it may be that they consumed greater quantities of the foods they liked, which contributed to longer assessment times. 4.3. Accuracy and Assessment Time: Other potential influences 4.3.1. Age and Gender It is clear that panelist age and the age and gender interaction has some effect on panelist behavior. In a study that explored how well panelists could discriminate the presence or absence the spice marjoram in a soup, elderly subjects were less-likely to detect it than those who were young (Cain, Reid & Stevens, 2001). This is further supported by evidence of the decreased ability of older subjects to distinguish between mixtures of taste and odor compounds compared to younger subjects (Kaneda, Maeshima, Goto, Kobayakawa, Ayaba-Kanamura, & Saito, 2000). As the results of the present study suggest an increased likelihood of distinguishing between similar samples with increasing age, the ability to do so may be due to other influences such as cultural background (Wilkie, Wooton, & Paton, 2004). Other reasons could be that the decline in sensory discrimination ability observed in other studies did not examine the same breadth of food types used in this one, differences in aromatic compounds and intensities between this study’s foods and others, or that the panelists over age 60 years in the present study, had not collectively reached the point of diminished sensory abilities as those in the cited studies. As it was men who had greater accuracy with increasing age, while both genders took longer to complete the test as they aged, a correlation between cognitive differences between males and females varies by task, living conditions, education, and numerous other factors (Weber, Skirbekk, Freund, & Herlitz, 2014). In academic settings, the relationship between accuracy and time to complete a test is mixed, as some studies suggest that speed is inversely

proportional to accuracy (Larner, 2015), while others conclude that there is no relationship (Bridges, 1985; Beaulieu & Frost, 1994). It’s important to note that panelists in the present study, unlike those in an academic setting, were unaware that they were being timed and that the “test” they were taking was not performance-based. 4.3.2. Panel Timing The effect of panel start time on longer assessment time may seem minor, but when the summation of time throughout the day is considered, the total time becomes appreciable. Based on these findings, a panelist who arrives at 3:00 PM might spend over 40 seconds longer evaluating the samples than a panelist who comes in at 9:00 AM. With panelists in this study on average taking about 220 seconds to complete the evaluation, 18% (40 seconds) of that time could be potentially attributed to the later start time, for those panelists participating in a later panel, again, demonstrating the importance of this variable as a factor in paneling. While accuracy tends to increase with increased assessment time for some panelists, research in academic settings reveals that test-takers perform their best in the morning than in the afternoon, due to cognitive fatigue as the day progresses (Sievertsen, Gino, & Piovesan, 2016; Wise, Ma, Kingsbury, & Hauser, 2010). While day of the week was found to be significant as far as assessment time, Wise and others, who found that this variable was not significant in educational tests. It may be that such patterns may be related to workday accumulation as reported by Benedetti, Diefendorff, Gabriel, and Chandler (2015). These researchers found that as a workday continues, workers put less effort into accomplishing tasks; however, the negative effect of the continuing workday may be diminished when a break from workday tasks are taken. When participating in a CLT a panelist’s normal workday is disrupted when they leave their

workspace to complete a consumer acceptance test or difference test. Additionally, when considering conclusions from Boulianne (2013), who found social and nonmonetary factors to be highly motivating to questionnaire participants, it may be that after a long work day, not only would a visit to the CLT act as a reprieve from a work, it might also be additionally motivating because it affords additional social interactions. On the other hand, it could also be that panelists feel less time constrained later in the day and thus take longer, or it may be that the long workday makes concentrating on taking the test harder.

5. Conclusions At CLT sites, panelists commonly receive monetary incentives to encourage them to come and to thoughtfully take the test. It appears that an increased monetary incentive will not conclusively increase panelist accuracy or assessment time in triangle tests. Increased assessment time directly influences accuracy, but is accompanied by a multiplicity of other factors, as demographics, test day and test time were found to be significant. Food type and higher expected hedonic scores increased the amount of time taken to perform the test. These results suggest that the use of broad demographics of untrained consumers for triangle tests results in data not consistently or strongly biased by payment amount or food type. However, more research is required to further explore these conclusions. Acknowledgements Providing assistance in gathering data, the Brigham Young University Sensory lab and research assistants: Maren Clark Long, Jessie Winfield, and Jared Simmons.

References Bell, R. (1993). Some unresolved issues of control in consumer tests: The effects of expected monetary reward and hunger. Journal of Sensory Studies, 8(4), 329-340. doi:10.1111/j.1745-459X.1993.tb00223.x Beaulieu, R. P., & Frost, B. 1994. Another look at the time-score relationship. Perceptual and Motor Skills, 78, 40-42. Benedetti, A. A., Diefendorff, J. M., Gabriel, A. S., & Chandler, M. M. (2015). The effects of intrinsic and extrinsic sources of motivation on well-being depend on time of day: The moderating effects of workday accumulation. Journal of Vocational Behavior, 88, 38-46. doi:10.1016/j.jvb.2015.02.009

Berglund, P. T., Lau, K., & Holm, E. T. (1993). Improvement of triangle test data by use of incentives. Journal of Sensory Studies, 8(4), 301-316. doi:10.1111/j.1745459X.1993.tb00221.x Boulianne, S. (2013). Examining the gender effects of different incentive amounts in a web survey. Field Methods, 25(1), 91-104. doi:10.1177/1525822X12453113 Bridges, K. R. 1985.Test-completion speed: Its relationship to performance on three coursebased objective examinations. Educational and Psychological Measurement, 45: 29-35.

Brüggen, E., & Dholakia, U. (2010). Determinants of participation and response effort in web panel surveys. Journal of Interactive Marketing, 24, 239-250. doi:10.1016/j.intmar.2010.04.004 Cain, W. S., Reid, F., & Stevens, J.C. 1990. Aging and the discrimination of flavor. J. Nutrition for the Elderly, 9 (3): 3-15. doi; 10.1300/J052v09n03_02

James, J., & Bolstein, R. (1992). Large monetary incentives and their effect on mail survey response rates. Public Opinion Quarterly, 56(4), 442-453. doi:10.1086/269336 Kaneda, H., Maeshima, K., Goto, N., Kobayakawa, T., Ayabe-Kanamura, S., & Saito, S. 2000. Decline in taste and odor discrimination abilities with age and relationship between gustation and olfaction. Chem. Senses, 25 (3): 331-337. doi;10.1093/chemse/25.3.331 Lau, K., Post, G., & Kagan, A. (1995). Using economic incentives to distinguish perception bias from discrimination ability in taste tests. Journal of Marketing Research, 32, 140-151.

Larner, A.J. 2015. Speed versus accuracy in cognitive assessment when using CSIs. Progress in Neurology and Psychiatry, Jan/Feb: 21-24. Lawless, H. T., & Heymann, H. (2010). Sensory evaluation of food principles and practices (2nd ed.) New York, NY: Springer. Lund, C., Jones, V., & Spanitz, S. (2009). Effects and influences of motivation on trained panelists. Food Quality and Preference, 20, 295-303. doi:10.1016/j.foodqual.2009.01.004

Meilgaard, M., Civille, G. V., & Carr, B. T. (2007). Sensory evaluation techniques (4th ed.) Boca Raton, FL: Taylor & Francis. Muñoz, A. M., Civille, G. V., & Carr, B. T. (1992). Sensory evaluation in quality control. New York, NY: Van Nostrand Reinhold.

Pilgrim, F. J., & Peryam, D. R. (Eds.). (1958). Sensory testing methods: A manual. Chicago, Illinois: ASTM International. Sievertsen, H. H., Gino, F., & Piovesan, M. Cognitive fatigue influences students’ performance on standardized tests. (2016). PNAS, 113 (28): 2621-2624. Stone, H., Bleibaum, R. N., & Thomas, H. A. (2012). Sensory evaluation practices (4th ed.) Boston, MA: Elsevier/Academic Press.

Weber, D., Skirbekk, V., Freund, I. & Herlitz, A. 2014. The changing face of cognitive gender differences in Europe. PNAS, 111 (32): 11673-11678. Wilkie, K., Wootton, M., & Paton, E. (2004). Sensory testing of Australian fragrant, imported fragrant, and non-fragrant rice aroma. International J. of Food Properties, 7(1): 27-36. Wise, S.L., Ma, L., Kingsbury, G. G., & Hauser, C., 2010. An investigation of the relationship between time of testing and test-taking effort. https://www.nwea.org/resources/investigation-relationship-time-testing-test-taking-effort/ Accessed 07.14.16.

Zagorsky, J. L., & Rhoton, P. (2008). The effects of promised monetary incentives on attrition in a long-term panel survey. Public Opinion Quarterly, 72, 502-513. doi:10.1093/poq/nfn025

Table 1 Details about each food product and the differences in the two treatments presented in the triangle tests. Products were homogenized, if not from the same lot, before creating different treatments, where applicable. % pd refers to the percentage of true discriminators as calculated using the following equation % pd = (3C-N/2) where C = # correct responses, N = # paneliststs

Product

Chocolate Chip Cookies

Cheddar Cheese

Lima Beans

Green Olives

Treatment 1

Store bought cookies rebaked at 300° F (148.9° C) (conduction bake) for 2 minutes Store brand sharp cheese

Treatment 2

Store bought cookies rebaked at 300° F (148.9° C) (conduction bake) for 6 minutes Store brand extra sharp cheese 49.5% Water 49%Water 49.5% Lima 49% Lima beans beans 1% Sucrose 2% Sucrose Heated to 150 Heated to °F (65.6° C) 150° F (65.6°C) Name Brand Store Brand pitted and pitted and stuffed pimento stuffed pimento green olives green olives

Difference Type Appearance, flavor, texture

Number of correct responses/total panelists (% pd)

14/32 (15.6%)

Ripening time

12/32 (6.02%)

Sweetness 12/33 (4.5%)

Brand difference

14/32 (15.6 %)

Table 2 Serving conditions for each product.

Product Chocolate Chip Cookies

Serving Temp Room Temperature

Container 1/4 lb. boat (113.4g)

Utensil None

Cheddar Cheese

Room Temperature

3.25 fl. oz. portion cup (96.1 mL)

Tooth Pick

Lima Beans

170-180° F (76.7°-82.2°C) from steam table Room Temperature

3.25 fl. oz. portion cup (96.1 mL)

Fork

3.25 fl. oz. portion cup (96.1 mL)

Fork

Green Olives

Amount 2, 11oz cookies (312g) 1 or 2, 1-inch cubes of cheese (~32g) 1.5oz of lima beans strained (42.5g) 8 green olives

Other None

Serve with 2 unsalted crackers None

None

Table 3 Results from each difference from control test showing means and differences of means between each food pair. Differences between means were significant for all food pairs. (n=36 each)

Food Cookies Cheddar Cheese Lima Beans Green Olives

Sample

Sample Means

Blind control Test sample Blind control Test sample Blind control Test sample Blind control Test sample

22.2 33.6 28.4 46.9 32.4 44.1 30.3 45.9

Difference Between Sample Means

P Value

11.4

0.005

18.5

0.000

11.7

0.033

15.6

0.001

Surveys to Determine Potential Test Foods

Well-Liked Foods (≥ 8)

Not as WellLiked Foods (> 5 ‫݋ݐ‬ ≤ 6.5)

Triangle Test Treatment Validation (15% or fewer proportion of discriminators)

Verification of Triangle Treatment Differences Across Products in Difference from Control Test (difference of all treatment differences < 10%)

Main Study

Figure 1 Flow chart of preliminary studies leading to the main study triangle tests, grouped by food expected liking and payment groups. Panelist groups were blocked by money amounts of $3, $5, $10 and $20.

Table 4 Accuracy results showing comparisons of odds ratios of each test food. Accuracy is based on getting a triangle test correct or not. A positive estimated difference in log odds indicates that Payment 1 panelist accuracy is higher than accuracy of Payment 2. A negative estimated difference in log odds indicates that Payment 1 panelist accuracy is lower than accuracy of Payment 2. The odds ratio is the comparison of the probability of getting the triangle test correct with Payment 1 compared to the probability of Payment 2. An odds ratio >1 indicates that the probability of getting the triangle test correct with Payment 1 is greater than the probability of Payment 2. The opposite is true is the odds ratio is <1.

Estimated Payment Payment Difference 1 2 in Log ($) ($) Odds 3 3 3 5 5 10

5 10 20 10 20 20

-0.154 0.122 -0.259 0.276 -0.105 -0.381

Odds Ratio 0.86 1.13 0.77 1.32 0.90 0.68

Adjusted Lower Confidence Interval 0.6 0.78 0.54 0.93 0.64 0.49

Adjusted Upper Standard Adjusted Confidence Error P Value Interval 1.23 1.63 1.1 1.86 1.26 0.96

0.1417 0.1415 0.1385 0.134 0.1314 0.1317

0.696 0.826 0.240 0.167 0.855 0.020

Table 5 Probability (as a percentage) for comparative gender/age groups for selecting the correct answer in a triangle test. Lower and upper confidence intervals show the limits of the probabilities.

Gender

Age

n

Probability

Female Female Female Female Female Male Male Male Male Male

18-29 30-39 40-49 50-59 60+ 18-29 30-39 40-49 50-59 60+

152 143 47 54 35 43 38 36 26 31

0.442 0.545 0.369 0.461 0.342 0.383 0.406 0.417 0.445 0.508

Lower Confidence Interval 0.395 0.457 0.276 0.373 0.244 0.338 0.333 0.332 0.350 0.415

Upper Confidence Interval 0.489 0.630 0.473 0.551 0.457 0.431 0.485 0.507 0.544 0.600

Standard Error 0.100 0.176 0.214 0.182 0.240 0.100 0.159 0.182 0.199 0.187

Table 6 Difference in mean assessment times showing direct time comparison estimates between age groups. Time was measured from when panelists received their samples to when they indicated they were finished. The difference in means is equal to the mathematical difference in the mean time for Age Group 1 minus Age Group 2. A positive number means that Age Group 1 took that more time to complete the test by that value than Age Group; a negative number means that Age Group 1 took less time. n=605. Adjusted Adjusted Difference Age Group Age Group Lower Upper Standard Adjusted in means 1 (years) 2 (years) Confidence Confidence Error P Value (Seconds) Interval Interval 20-29 30-39 13.7 -11.3 38.8 9.2 0.563 20-29

40-49

28.8

1.1

56.4

10.1

20-29

50-59

-9.0

-37.1

19

10.3

20-29

60+

-33.5

-64.1

-2.9

11.2

30-39

40-49

15.0

-17.2

47.2

11.8

30-39

50-59

-22.8

-55.6

10.1

12.0

30-39

60+

-47.2

-82.3

-12.1

12.8

40-49

50-59

-37.8

-72.3

-3.3

12.6

40-49

60+

-62.2

-99.2

-25.3

13.5

50-59

60+

-24.4

-61.1

12.2

13.4

0.037 0.904 0.024 0.707 0.322 0.002 0.024 <0.0001 0.361

Highlights • Monetary compensation amount is not a primary factor for triangle test accuracy or assessment time for untrained panelists. • Accuracy increased with increased assessment time. • Assessment time and accuracy were impacted differently by the level of liking of test foods. • Demographic factors, time of day of the test began, and day of the week that the test was conducted were significant factors that influenced assessment time.