Transportation Research Part C 103 (2019) 129–141
Contents lists available at ScienceDirect
Transportation Research Part C journal homepage: www.elsevier.com/locate/trc
Using Internet-based marketplaces to conduct surveys: An application to airline itinerary choice models
T
Mohammad Ilbeigia, Virginie Lurkinb, Laurie A. Garrowc,
⁎
a b c
Bowling State University, Department of Construction Management, Bowling Green, OH 43403-0001, United States TU/e, Eindhoven University of Technology, Eindhoven, the Netherlands Georgia Institute of Technology, School of Civil and Environmental Engineering, 790 Atlantic Drive, Atlanta, GA 30332-0355, United States
ARTICLE INFO
ABSTRACT
Keywords: Survey design Online surveys Amazon Mechanical Turk Air travel behavior Airline itinerary choice
Within the transportation community, there has been increasing interest in using online outsourcing platforms such as Amazon Mechanical Turk (AMT) to conduct surveys. To date, transportation researchers’ use of AMT has been justified based on findings from studies in other fields. That is, to the best of our knowledge, there has been no study that has evaluated how the distribution of responses associated with each question and behavioral model estimated from AMT survey data compares to survey data collected from a traditional platform for a travel behavior application. This paper fills an important gap in the literature by examining (1) whether the distributions of responses from AMT and Qualtrics (a traditional market research firm) respondents are statistically equivalent, and (2) whether itinerary choice models estimated from these two surveys are statistically equivalent? Results show that AMT and Qualtrics respondents reported similar air trip characteristics and were drawn from a similar geographic distribution, but they exhibited distinct sociodemographic characteristics. After controlling for different age distributions in the two datasets, we found that airline itinerary choice models estimated from the AMT and Qualtrics survey data produced similar results, with the key difference related to price sensitivities. Our study provides preliminary evidence on the viability of using AMT and similar online outsourcing platforms for air travel behavior studies.
1. Introduction and motivation Motivated by the ability to collect survey data cheaper and faster from a larger, more diverse participant pool than traditional data collection approaches, many researchers have used Amazon Mechanical Turk (AMT) and other internet-based marketplaces to conduct surveys (see Berinsky et al., 2012; Cantor et al., 2014; Searles and Ryan, 2015). A report by the Pew Research Center finds that during the week of Dec. 7–11, 2015, 89% of tasks posted by academics on AMT consisted of surveys (Hitlin, 2016). Since AMT’s launch in 2005, there has been an exponential growth in scholarly references that have mentioned “Amazon Mechanical Turk” (Garrow et al., 2018; Harmes, 2015). To put this growth in context, in 2017 there were more than 10,600 references that mentioned “Amazon Mechanical Turk” (Garrow et al., 2018; Fig. 1 that was adapted from Harmes and DeSimone, 2015). The exponential growth in AMT articles is not isolated to particular fields. The use of AMT to conduct surveys is more prevalent in the social sciences, including political science, psychology, and industrial organization (e.g., see Berinsky et al., 2012; Highhouse and Zhang, 2015; Landers and Behrend, 2015; Searles and Ryan, 2015). To date, there have been limited studies that have used AMT
⁎
Corresponding author. E-mail addresses:
[email protected] (M. Ilbeigi),
[email protected] (V. Lurkin),
[email protected] (L.A. Garrow).
https://doi.org/10.1016/j.trc.2019.03.025 Received 3 July 2018; Received in revised form 22 February 2019; Accepted 26 March 2019 Available online 15 April 2019 0968-090X/ © 2019 Elsevier Ltd. All rights reserved.
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
survey data within the transportation behavioral research and supply chain fields. A search of the keyword “Mechanical Turk” conducted in June 20181 revealed that just seven articles in Transportation Research Parts A to E and Transportation had used AMT. These articles study a variety of topics ranging from plug-in hybrid electric vehicles (Krupa et al., 2014), electric vehicles (Helveston et al., 2015), autonomous vehicles (Haboucha et al., 2017; Deb et al., 2018), supply chains (Cantor et al., 2014), design of subway maps (Guo et al., 2017), and carpooling behaviors (Neoh et al., 2018). A similar search conducted in Production and Operations Management (POMS) in June 2018 revealed seven articles that had used AMT to study supply chain–related research questions; these studies include remanufactured products (Abbey et al., 2017), customer perceptions of peak supply events (Dixon et al., 2017), management of innovation projects (Hutchison-Krupat and Chao, 2014), peer-to-peer networks (Jiang et al., 2017), framing effects in inventory control decisions (Tokar et al., 2016), scoring versus ranking decision processes within innovation management (Cui and Kumar, in press), and an article exploring the applicability of using AMT for behavioral operations management applications (Lee, Seo, and Siemsen, 2018). Many researchers have questioned whether survey data collected through online marketplaces are similar to survey data collected through traditional survey platforms (Hitlin, 2016), and our approach is similar to that of other researchers. For example, Gosling and colleagues compared the demographics of over 3,000 AMT participants with those in a large internet sample (Gosling et al., 2004). Krupa and colleagues compared the demographics of their sample of AMT participants with a state population (Krupa et al., 2014), and a report by the Pew Foundation compared the demographics of an AMT sample to the U.S. population (Hitlin, 2016). There are two key validity concerns that are often discussed in the literature (see Berinsky et al., 2012; Cantor et al., 2014; Hutchison-Krupat and Chao, 2014). First, researchers question whether the distribution of responses associated with each question is statistically equivalent across both populations. Second, researchers question whether behavioral models estimated from data from these two populations provide similar interpretations and statistically equivalent results. On one hand, it is straightforward to investigate whether samples and models estimated on data obtained from online marketplaces and traditional survey platforms are similar—the researcher only needs to execute the same survey on both platforms. The challenge is that survey costs can be prohibitive, i.e., it can cost an order of magnitude more to execute the same survey with a traditional market research firm (such as Qualtrics) versus an online marketplace (such as AMT). This paper addresses this gap in the transportation literature by administering the same survey instrument for a U.S.-based population on AMT and Qualtrics. The survey was focused on air travelers and used a mock-up of an online travel agency website to collect data for estimating airline itinerary choice models. This paper complements a previous study conducted by the research team (Garrow et al., 2018). That study focused on the mechanics of how to conduct surveys in AMT and discussed unexpected challenges encountered in the survey execution that other transportation researchers will hopefully be able to avoid. These challenges do not influence the quality of survey results but relate to managing the data collection process and ensuring a respondent did not inadvertently take the survey multiple times. The focus of this current paper is analyzing the survey results to address two fundamental research questions in the context of a transportation-related application: (1) Are the distributions of responses from AMT and Qualtrics respondents statistically equivalent? and (2) Are itinerary choice models estimated from these two surveys statistically equivalent? 2. Statistical analysis 2.1. Are AMT and Qualtrics respondents similar? To investigate whether survey respondents from AMT and Qualtrics were statistically equivalent, we used chi-square tests to determine whether the distribution of responses to each question on the survey was statistically equivalent. When using the chisquare test of homogeneity, the null hypothesis is:
H0 : AMT and Qualtrics populations are homogeneous with respect to the categories The test statistic is given as r 2
c
= i=1 j=1
eij )2
(oij eij
(1)
where
oij is the observed frequency count for category i and population j , eij is the expected frequency count for category i and population j , r is the number of bins in the corresponding categorical question, c is the number of populations (i.e., two in our case). Conceptually, if the population frequencies for AMT and Qualtrics are similar, then we expect the proportions of respondents in 1
We conducted the search in June 2018, but the search included all years for which the journal had been published. 130
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
each bin to be similar for AMT and Qualtrics. When using the chi-square test for testing the homogeneity of populations, the null hypothesis is rejected if 2 statistic
2 k,
>
(2)
where
k is the degrees of freedom and is equal to (number of columns − 1) × (number of rows − 1), is the significance level (in our case, = 0.05). A statistic above the critical value means that the observed frequency counts are far apart from the expected frequency counts and we conclude that the proportions for one or more categories differ for AMT and Qualtrics respondents. For additional information on the chi-square test, see Sheskin (2000). A two-proportions test is conducted to determine which categories were a major influence on the rejection of the null hypothesis. In particular, if
(P1ij P (1
P2ij ) 1 n1ij
P)
> 1.96
+
1 n2ij
(3)
where
P =
Y1ij + Y2ij n1j + n2j
P1ij is the proportion of response category i for question j in the AMT survey, P2ij is the proportion of response category j for question i in the Qualtrics survey, Y1ij is the number of people who selected response i for question j in the AMT survey, Y2ij is the number of people who selected response i for question j in the Qualtrics survey, n1j is the total number of responses for question j in the AMT survey, n2j is the total number of responses for question j in the Qualtrics survey. 2.2. Are itinerary choice models similar for AMT and Qualtrics respondents? To investigate whether itinerary choice models estimated using AMT and Qualtrics data were statistically equivalent, we used data from the mock-up travel agency website that showed respondents a set of outbound itineraries customized to their most recent air travel experience, and asked them to select their preferred itinerary. Following the methodology presented by Louviere et al. (2000) for datasets composed of revealed preference and stated preference data, we combined the Qualtrics and AMT datasets and used a nested logit (NL) model to account for potential different scale factors of the error term. This allows us to account for the possibility that even if the parameter values (and ratios) appear to be comparable, differences might persist in the amount of variance explained by the model which, hence, would give an indication about the data quality. We used NL models to estimate the probability that individual z chose alternative i from the set of alternatives K in the choice set C. Formally, suppressing the index for the individual, the NL probability is given as:
Pi =
e
Vi µm
M l =1
Vj µ j Am e m
µm 1
Vj
j Al
e µl
µl
, 0 < µm
1 (4)
where alternatives are grouped into M nests, i.e., i Am , m = 1, 2, , M . In our case, all of the AMT alternatives are grouped into one nest, all of the Qualtrics alternatives are grouped into a second nest, and the logsum of the Qualtrics nest is constrained to be 1 whereas the scale (or µ ) associated with the AMT nest is estimated from the data. After accounting for scale differences, we estimated two discrete choice models. In the first model (that contains both the AMT and Qualtrics data) we defined the utility function so that the parameters were constrained to be the same for both the AMT and Qualtrics populations. In the second model (that also contains both the AMT and Qualtrics data) we allowed one or more itinerary choice parameters to differ for the AMT and Qualtrics populations. To determine whether consumer preferences for airline itinerary attributes differed, we used the likelihood ratio (LR) test to compare the goodness of fit of these two models. As an example, consider a situation where we want to know whether preferences for the number of connections vary for the AMT and Qualtrics populations. The first NL model, referred to as the unrestricted model, estimates separate connection parameters for the AMT and Qualtrics populations, or # Cnx _MTurk and # Cnx _MTurk . The second NL model, referred to as the restricted model, constrains these two parameters 131
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
to be the same. The null hypothesis is:
H0 : The difference in the goodness of fit of the two models is not statistically different The LR statistic is 2
=
2(LLR
(5)
LLU )
where
LLR is the log likelihood of the restricted model, LLU is the log likelihood of the unrestricted model. The null hypothesis is rejected if 2 statistic
>
2 k,
(6)
where
k is the degrees of freedom and is equal to nU nR , where nU is the number of parameters estimates in the unrestricted model and nR is the number of parameters estimates in the restricted model, is the significance level (in our case, = 0.05). Rejecting the null hypothesis means that the unrestricted model should be preferred (and that consumer preferences for a given itinerary attribute differ for the AMT and Qualtrics populations). 2.3. Survey instrument and participant recruitment process The data for this study were collected by executing the same survey instrument on two distinct platforms: AMT and Qualtrics. AMT is an internet-based marketplace that enables researchers to post surveys to a potential pool of respondents. Qualtrics is a “traditional” marketing firm that maintains a panel of respondents. We drew respondents from a national U.S. panel on both survey platforms. Specifically, we posted the survey on AMT to workers who met the qualification of U.S. residence. Qualtrics sent the survey to nationally representative panels maintained by Qualtrics or their partners. We designed the survey instrument using an annual survey of air passengers conducted by the Resource Systems Group (see Nicolae et al., 2016). The survey contained five parts and had a median time of six minutes to complete. With the exception of the itinerary choice exercise, all responses were based on a drop-down list, and respondents had to respond to all questions to receive payment for taking the survey (i.e., forced responses were used). The first part contained screening questions. Only those individuals who had taken a domestic air trip in the last 12 months that they or their company paid for were eligible to complete the survey. Airline employees were not eligible to take the survey. The second part asked questions about the respondent’s most recent trip, e.g., what were the originating and terminating airports, how far in advance was the flight booked, what day of week did the trip start, etc.; see Fig. 1 for the specific questions asked in this section and the responses included in the drop-down menu. The third part asked respondents how much they agreed or disagreed with statements that solicited information about how they valued airline itinerary attributes, e.g., “I only fly certain airlines,” “Departure and arrival times are more important to me than price,” etc.; see Fig. 2 for the specific questions asked in this section and the potential responses. The fourth part of the survey presented a mock-up of an online travel agency site, shown in Fig. 3. Respondents were asked to select their preferred outbound itinerary. The itineraries were tailored to the individual’s most recent air travel experience and were based on choice sets we used in prior studies (see Lurkin et al., 2017, for more information on how these itinerary choice sets were generated). The set of itineraries shown to a particular respondent was customized based on the respondent’s most recent air trip. Specifically, choice sets were customized based on the respondent’s most recent origin, destination, departure day of week, advance purchase period, and product type (i.e., whether the individual purchased a high-yield fare for a first class, business class, or unrestricted coach product or a low-yield fare for a restricted coach product). We only asked individuals one trade-off question. The fifth and final part of the survey collected sociodemographic characteristics, including age, gender, household income, number of individuals in the household, and home zip code; see Fig. 4 for the specific questions asked in this section and the potential responses. We executed the survey on AMT from October to November 2016 and on Qualtrics in March 2017. Our analysis database contains 690 responses from AMT and 554 responses from Qualtrics. Our original intent was to obtain 1000 responses each from AMT and Qualtrics; however, we were unable to obtain more than 690 responses from AMT over the course of a two-month period, and, thus, terminated with fewer responses. Given unexpected programming costs associated with executing the survey on AMT, we also lowered our responses on Qualtrics to fit within our remaining budget. We paid $305.25 to conduct the survey on AMT and $3535.00 to conduct the survey on Qualtrics. Neither of these costs include our programming time (see Garrow et al., 2018, for a description of how we executed the survey on AMT). Given that we only executed these surveys at one point in time, we cannot assess the variability of results or consistency in sampling on AMT. 132
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Fig. 1. Comparison of trip characteristics between AMT and Qualtrics.
3. Results 3.1. Similarities and differences of the AMT and Qualtrics populations Figs. 1, 2, and 4 summarize the results of the chi-square tests and show the frequency distribution for each question. Note that on the figures, the counts that are underlined and bolded correspond to the categories that were statistically different. The 2 statistic calculated from Eq. (6) and the corresponding p–value for each question are shown on the bottom left of each subfigure. A p–value 133
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Fig. 2. Importance of airline attributes to AMT and Qualtrics respondents.
less than or equal to 0.05 leads to a rejection of the null hypothesis, and we conclude that the distribution of responses from AMT and Qualtrics are statistically different. However, a p–value greater than 0.05 means that we fail to reject the null hypothesis, and there is no evidence that the distributions are statistically different. In terms of prior trip characteristics, the chi-square analysis shows that AMT and Qualtrics respondents are quite similar—we find no statistical evidence to reject the null hypotheses for the following questions: (1) How often do you make air trips? (2) When was your last air trip? (3) Who paid for your ticket? (4) How long before your trip did you purchase your ticket? (5) What was the primary reason you flew? (6) What day of the week did you depart (possible responses include Saturday, Sunday, … Friday, I’m not sure but it was a weekday, and I’m not sure but it was a weekend). The only prior trip characteristic that differed between AMT and Qualtrics respondents was related to group size, with Qualtrics respondents more likely to travel in groups of six or more. We conclude that the trip characteristics between AMT and Qualtrics respondents are similar, with the exception of group size. Fig. 2 reveals that AMT and Qualtrics respondents value airline attributes differently. AMT respondents are more focused on price and are less sensitive to airline, equipment, and departure times than Qualtrics respondents. AMT respondents are more likely to agree and/or strongly agree with the statements: (1) “I generally shop for the cheapest flights and do not consider other factors,” and (2) “Price is more important to me than carrier.” Conversely, Qualtrics respondents are more likely to agree and/or strongly agree with the statements: (1) “I only fly certain airlines,” (2) “I avoid small propeller and regional jet aircraft,” and, (3) “Departure and arrival times are more important to me than price.” Qualtrics users are more likely to be neutral to the question: “Departure and arrival times are more important to me than carrier.” We conclude that AMT and Qualtrics respondents value airline attributes differently; however, the chi-square analysis does not provide insight into why these different distributions are occurring. On one hand, these differences could be caused by fundamental differences in taste preferences between the AMT and Qualtrics populations, i.e., all else being equal, AMT respondents are more price-sensitive than Qualtrics respondents. Conversely, these differences could be 134
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Fig. 3. Itinerary choice screen used in survey. Note: Airline logos were displayed above the prices but are suppressed here due to copyright restrictions. More alternatives than those shown on the figure were presented to respondents.
Fig. 4. Comparison of sociodemographic characteristics between AMT and Qualtrics respondents.
caused if preferences vary across sociodemographic groups and the distribution of these groups varies between the AMT and Qualtrics populations. For example, if the AMT population has a higher percentage of young adults that are price sensitive, this could influence the distributions shown in Fig. 2. Fig. 4 shows that the sociodemographic characteristics of AMT and Qualtrics respondents do indeed differ. Proportionately more AMT respondents are male, younger (in particular, between the ages of 25–34), and live alone. Proportionately more AMT respondents earn lower incomes, particularly in the $10,000–$19,999 and $30,000–$39,000 ranges; proportionately fewer AMT 135
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Fig. 5. Comparison of Qualtrics and AMT respondents by areas of the U.S.
respondents earn higher incomes, particularly in the $150,000–$199,999 and $250,000+ ranges. Geographically, the distribution of AMT and Qualtrics respondents differ. Fig. 5, created in Tableau version 10.1, shows the frequency distribution and percentage of respondents by four main geographic areas in the U.S. Qualtrics has a more geographically balanced sample than AMT. Both survey platforms recruited more participants from states with large populations, e.g., California, Texas, Florida, and New York are the states with the largest populations and correspond to the states with the largest number of respondents in both samples. A chi-square test shows that the proportion of responses across the 50 states is different, although these 2 differences are not likely to be materially important ( 2 statistic of 68.85 > 49,0.05 = 66.34). The five states contributing most to the differences include New Jersey, Tennessee, Delaware, New Hampshire, and Massachusetts. 3.2. Itinerary choice models The chi-square analysis showed that proportionately more AMT respondents were younger and had lower annual household incomes than the Qualtrics respondents. The question of interest for the second part of the analysis is determining whether, after controlling for sociodemographic differences, AMT and Qualtrics respondents exhibit similar or different preferences for itinerary characteristics. Table 1 shows a baseline model that includes several variables. These variables include airline constants, departure time intervals, elapsed time (defined as the difference between when the last leg of the itinerary arrives at the gate and the first leg of the itinerary departs from the gate), number of connections, and price. In selecting our baseline model, we estimated dozens of models that interacted price with different sociodemographic variables (i.e., age, gender, income, etc.) The baseline model shown in Table 1 represents the model that best fit the data and provided intuitive results. As expected, the baseline itinerary choice model shows that customers prefer morning flights, or those that depart between 12 midnight and 9:59 AM. Similar to regression models, in discrete choice models when M categorical variables are used in the specification, at most M − 1 categories can be estimated (and one must be set as the reference). In the case of departure times, the 12 midnight to 9:59 AM category is set as the reference (and has a parameter value of zero) and the two parameter estimates associated with the afternoon period (10 AM – 3:59 PM) and evening period (4 PM – 11:59 PM) are negative. Thus, the morning departure period has the most positive parameter estimate, and is preferred over the afternoon and evening departure periods. In the baseline model, the departure time parameter estimates are restricted to be the same for both the AMT and Qualtrics populations; thus, this estimate is shown on a single cell that includes both the AMT and Qualtrics rows for the given departure time interval. Similar logic applies to the interpretation of elapsed time, number of connections, and price. As elapsed time increases, the probability of taking an itinerary decreases (as seen by the parameter estimate of −0.0040). Customers also prefer nonstop itineraries (represented by zero connections) over connecting itineraries (as seen by the parameter estimate of −1.4758). As price increases, the probability of selecting an itinerary decreases (the parameter estimate associated with price is −0.0152). Our baseline model includes separate airline constants for AMT and Qualtrics respondents. This is to control for the different geographic distributions in the two samples, e.g., respondents in Georgia will be more likely to fly Delta given the airline has a major hub in Atlanta, whereas respondents in Texas will be more likely to fly Southwest, American, or United given those airlines have hubs in Dallas and Houston. 136
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Table 1 Model results. Baseline Model
Model 1 Price
Model 2 Time of Day
Model 3 # Connections
Model 4 Elapsed Time
American AMT
−0.2307 (−1.76)
−0.2544 (−1.77)
−0.2027 (−1.63)
−0.2148 (−1.77)
−0.2296 (−1.76)
Delta AMT
−0.1609 (−1.28)
−0.1734 (−1.26)
−0.1414 (−1.18)
−0.1493 (−1.28)
−0.1598 (−1.28)
United AMT
−0.4588 (−3.04)
−0.4878 (−2.96)
−0.4201 (−2.92)
−0.4163 (−2.93)
−0.4556 (−3.01)
American Qualtrics
−0.5756 (−3.29)
−0.5560 (−3.17)
−0.6066 (−3.46)
−0.5600 (−3.19)
−0.5736 (−3.27)
Delta Qualtrics
−0.6217 (−3.52)
−0.6116 (−3.46)
−0.6371 (−3.60)
−0.6119 (−3.46)
−0.6211 (−3.51)
United Qualtrics
−0.5796 (−2.98)
−0.5919 (−3.04)
−0.6087 (−3.13)
−0.5853 (−3.00)
−0.5788 (−2.97)
Other (ref.)
0
0
0
0
0
0
0
0
0
0
−0.1662 (−2.68)
−0.1841 (−2.78)
−0.1542 (−2.58)
−0.1652 (−2.65)
−0.6417 (−7.17)
−0.6905 (−7.19)
−0.6009 (−6.63)
−0.6384 (−6.94)
−0.0040 (−6.43)
−0.0043 (−6.40)
−0.0039 (−6.43)
−1.4758 (−10.70)
−1.5688 (−10.69)
−1.4259 (−10.40)
Airline Constants
Departure Time Morning 12 – 9:59 AM (ref.) Afternoon 10 AM – 3:59 PM AMT Afternoon 10 AM – 3:59 PM Qualtrics Evening 4 – 11:59 PM AMT Evening 4 – 11:59 PM Qualtrics
−0.0677 (−0.94) −0.3422 (−3.30) −0.4668 (−4.76) −0.9470 (−6.55)
Elapsed Time (minutes) Elapsed Time AMT Elapsed Time Qualtrics
−0.0038 (−6.27)
−0.0040 (−5.14) −0.0041 (−4.82)
Number of Connections Number of Connections AMT Number of Connections Qualtrics
−1.3036 (−7.57)
−1.4695 (−10.17)
−1.5932 (−10.07)
Price Price AMT Price Qualtrics
−0.0152 (−15.43)
−0.0175 (−10.97) −0.0139 (−12.12)
−0.0148 (−14.85)
−0.0145 (−13.16)
−0.0151 (−14.29)
Scale Mu AMT
0.8081 (−3.13)
0.8868 (−1.50)
0.7691 (−3.83)
0.7480 (−3.54)
0.8030 (−2.80)
Mu Qualtrics (fixed)
1
1
1
1
1
−2958.21; 0.160
−2955.94; 0.160
−2959.59; 0.159
−2960.55; 0.159
4.70, 1, 3.84
9.24, 2, 5.99
1.94, 1, 3.84
0.02, 1, 3.84
Measures of Model Fit; LL(0) = −3520.88 LL(model);
2 0
−2960.56; 0.159
Tests of Null Hypotheses Against Restricted Model −2(LLR – LLU), Degrees of freedom (DOR),
2 DOF
N/A
Note: Parameter estimate (and t-stat) are reported in rows above the measures of model fit.
Finally, the scale parameter of 0.81 is statistically significant in the baseline model, which as described earlier, means that differences between the AMT and Qualtrics populations might persist in the amount of variance explained by the model, which would give an indication about the data quality. Models 1 to 4 in Table 1 investigate whether AMT and Qualtrics respondent preferences for fares, departure time of day, number 137
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
of connections, and elapsed times are statistically equivalent. For example, consider Model 1 (labeled the Price model). Model 1 is identical to the baseline model but estimates separate price coefficients for AMT and Qualtrics. Using the terminology described in Section 3.2, the baseline model represents the “restricted” model as it restricts the price coefficients to be the same across AMT and AMT Qualtrics = price Qualtrics, i.e., price , whereas the Model 1 Price column contains the “unrestricted model” as it estimates separate coef-
AMT Qualtrics ficients for the AMT and Qualtrics populations, price . We can statistically compare Model 1 to the baseline model by price 2 = 2(LLR LLU ) = −2(−2960.50–2958.21) = 4.70. Note that LLR corresponds to the baseline model computing the LR test, or as it restricts the price coefficient to be the same, and LLU corresponds to Model 1 as it allows the price coefficient to differ between 2 = 3.84 , where the degrees of freedom is given as one (as we are these two populations. Since the likelihood ratio test of 4.70 > 1,0.05 imposing one equality relationship among the parameters), we cannot reject the null hypothesis and find evidence that the price coefficients for the AMT and Qualtrics differ. Applying similar logic across Models 2–4, we find that we cannot reject the null hypothesis and find that AMT and Qualtrics respondents exhibit statistically equivalent preferences for departure time of day, number of connections, and elapsed time, but have different price sensitivities. The question of interest then from Table 1 is why price sensitivities vary across the AMT and Qualtrics populations. To investigate if price sensitivities were related to differences in sociodemographic characteristics, we estimated two additional models, shown in Table 2. Note that Model 1 now becomes the baseline model against which we need to test our hypothesis and is thus shown as the “restricted” model in Table 2. Model 5 estimated separate coefficients for price depending on whether the age of the respondent was 18–24, 25–64, or 65+. In Model 5, the coefficients for Price × Ages 18–24 for AMT and Qualtrics are identical; that is, these coefficients were constrained to be the same. This can be seen because both the coefficients as well as the t-stats are identical for the Price × Ages respective rows (e.g., both the Price × Ages 18–24 AMT and Price × Ages 18–24 Qualtrics row have price coefficients of −0.0209 and t-stats of −8.88). Model 6 further relaxed the specification by allowing the coefficients to vary depending on whether the respondent participated in the AMT or Qualtrics survey. Model 5 fits the data better than Model 1 and the 2 statistic of 2 5.84 < 3,0.05 = 7.81, which rejects the null hypothesis that Model 1 is the preferred model. Similarly, when we compare Models 5 and 6 we see that Model 6 fits the data better and a 2 statistic comparing these two models (not shown in Table 2) also rejects the null 2 hypothesis that Model 5 is the preferred model (in this case 2 statistic of 4.92 < 3,0.05 = 7.81). The scale difference that was present in earlier specifications is no longer statistically significant at the 0.05 level when price is interacted by age and allowed to differ between AMT and Qualtrics respondents. To understand the practical implications of the price sensitivities between AMT and Qualtrics respondents, we calculated values of time for Models 1 and 6, shown in Table 3. Consistent with a priori expectations, values of time increase with age. Among respondents who are under 65, Qualtrics respondents have slightly higher values of time than AMT respondents. Such a result follows earlier findings with respect to the chi-square tests of travelers’ itinerary preferences. Although the value of time for those 65 and older is higher for the AMT respondents, the t–statistic is not significant. We conclude that—with the exception of price—AMT and Qualtrics respondents exhibit similar preferences for airline characteristics. As a final note, as part of our analysis we also compared the quality of responses between the AMT and Qualtrics surveys (detailed in Garrow et al., 2018). In the case of our survey, we had a programming error that prevented us from accurately collecting the time each individual spent on each question and on the entire survey. We also only asked one discrete choice modeling trade-off question. Thus, to examine if the quality of responses differed between the Qualtrics and AMT respondents, we examined if there were lexicographic patterns in the responses for six sequential questions in which the respondent answered strongly disagree, disagree, neutral, agree, or strongly agree. The quality of responses was similar between the two surveys, with the most notable (yet minor) difference seen in straight-lining responses (e.g., always answering the same response for a sequence of questions). Straight-lining was just 1.3% in the AMT sample and slightly higher (3.4%) in the traditional Qualtrics panel.
4. Conclusions To the best of our knowledge, this paper represents the first study that has examined if and how the population of AMT survey respondents differs from the population of survey respondents from a traditional survey panel for an air travel behavior application. It also represents the first study that has examined if air travel behavior models estimated from AMT versus a traditional panel are similar or different. This paper fills an important research gap by demonstrating that online marketplaces, such as AMT, can be used to conduct air travel behavior–related studies. As our study demonstrates, although there were underlying differences in the sociodemographic characteristics between the AMT and Qualtrics respondents, itinerary choice models that controlled for sociodemographic differences yielded similar results. This is an important finding for the air transportation community, as it demonstrates the viability of using AMT for air travel behavior studies. However, for a particular research application, there are pros and cons of using AMT versus a traditional panel that researchers should consider. In our experience, it was difficult to obtain more than 500 responses from the United States. Different software platforms exist for helping to manage the recruitment and survey execution processes on AMT (e.g., see TurkPrime, 2017), but this will add to the overall costs. Qualtrics and similar companies promote themselves as being able to target a nationally representative population, but depending on the qualifications the researcher needs for a study, even these larger companies can encounter problems obtaining a desired sample size. For example, in a different study, we used a company with a traditional panel to recruit participants residing or working in particular U.S. metro areas who had one-way commutes of at least 30 min and individual annual incomes of more than $200 K. Even though this company reached out to partners for lists of potential participants meeting these criteria, they were unable to deliver on providing 500 completed responses for the survey, and we needed to lower our criteria on 138
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Table 2 Models that interact price and age. Model 1 Price
Model 5 Price × Ages
Model 6 Price × Ages × Survey Type
American AMT
−0.2544 (−1.77)
−0.2329 (−1.76)
−0.2542 (−1.77)
Delta AMT
−0.1734 (−1.26)
−0.1549 (−1.22)
−0.1666 (−1.21)
United AMT
−0.4878 (−2.96)
−0.4573 (−3.01)
−0.4836 (−2.94)
American Qualtrics
−0.5560 (−3.17)
−0.5805 (−3.31)
−0.5562 (−3.16)
Delta Qualtrics
−0.6116 (−3.46)
−0.6173 (−3.48)
−0.5981 (−3.37)
United Qualtrics
−0.5919 (−3.04)
−0.5677 (−2.91)
−0.5779 (−2.96)
Other (ref.)
0
0
0
Morning 12 – 9:59 AM (ref.)
0
0
0
Afternoon 10 AM – 3:59 PM
−0.1841 (−2.78)
−0.1699 (−2.72)
−0.1860 (−2.81)
Evening 4 – 11:59 PM
−0.6905 (−7.19)
−0.6462 (−7.19)
−0.6886 (−7.18)
Elapsed Time (minutes)
−0.0043 (−6.40)
−0.0040 (−6.42)
−0.0042 (−6.39)
Number of Connections
−0.0043 (−6.40)
−1.4896 (−10.78)
−1.5672 (−10.68)
−0.0209 (−8.88)
−0.0229 (−6.73)
−0.0149 (−14.40)
−0.0170 (−10.63)
Price × Ages 65 + AMT
−0.0105 (−3.74)
−0.0054 (−0.71)
Price × Ages 18–24 Qualtrics
−0.0209 (−8.88)
−0.0202 (−5.88)
−0.0149 (−14.40)
−0.0133 (−10.33)
−0.0105 (−3.74)
−0.0116 (−3.81)
Airline Constants
Departure Time
Price Price × Ages 18–24 AMT Price × Ages 25–64 AMT
−0.0175 (−10.97)
Price × Ages 25–64 Qualtrics
−0.0139 (−12.12)
Price × Ages 65 + Qualtrics Scale Mu AMT
0.8868 (−1.50)
0.8154 (−3.00)
0.8883 (−1.55)
Mu Qualtrics (fixed)
1
1
1
−2958.21
−2955.39
−2952.93
0.160
0.161
0.161
5.64
10.56
3, 7.81
3, 7.81
Measures of Model Fit; LL(0) = −3520.88 LL(model) 2 0
Tests of Null Hypotheses against Model 1 −2(LLR – LLU), Degree of freedom (DOF),
N/A 2 DOF
Note: Parameter estimate (and t-stat) are reported in rows above the measures of model fit.
both the one-way commute time as well as the individual income to obtain our desired number of completed surveys. AMT and other online platforms are quickly evolving. There have been two recent changes within AMT that are particularly relevant to transportation researchers who are interested in conducting surveys. At the time we conducted our survey, it was not possible to target respondents with particular sociodemographic characteristics. Now, there is a multitude of characteristics that can be used to target surveys to a particular population. As of June 2018, these included “traditional” characteristics such as age, gender, 139
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Table 3 Values of time for models that interact price and age ($/h).
Ages 18–24 Ages 25–64 Ages 65+
AMT
Qualtrics
All
11.09 15.00 47.29
12.57 19.12 21.88
11.59 16.28 22.99
income, educational attainment, auto ownership, and residential information (e.g., rent or own), as well as other characteristics such as online purchase histories and technology usage factors (e.g., does the respondent have accounts on Facebook, Google, Twitter, and a variety of other technology factors). AMT’s pricing page contains information on the characteristics that can be used to target surveys to a particular population (Amazon Mechanical Turk, 2018b). A second change that has occurred within AMT is that they are expanding the ability for online workers in different countries to deposit their earnings into an online bank account. As of June 2018, only workers from the U.S. or India could deposit their earnings into a bank account; all others were given their earnings in the form an Amazon.com gift card (Amazon Mechanical Turk, 2018a). Thus, although the potential number of AMT workers is large, currently it is easier to survey participants from the U.S. or India. This will likely change if and when AMT expands the ability for workers from other countries to deposit earnings into a bank account. Finally, it is important to note that researchers who desire to target a specific geography will likely encounter issues obtaining a desired sample size using online platforms. Similar challenges are faced by researchers today using traditional panels; that is, the more restrictive the qualifying criteria, the more expensive the survey data collection is. However, unlike a traditional survey firm that can purchase potential lists of respondents from different sources to meet the desired quota, a researcher that uses AMT or similar online platforms is restricted to a single population—namely, AMT workers. As an example, consider a researcher who wants to conduct a survey of individuals in North Dakota. In our sample, we had no respondents from this state, even though we did not exclude them from the survey. Thus, the use of AMT and other online platforms will likely be useful—at least in the near term—for transportation studies focused on broad geographic areas within the U.S. and India. In conclusion, it will be exciting to see how these online workplaces evolve in the next decade and to see if other studies within the transportation community confirm the key findings from this study. Although we are not at a point to be able to offer best practice recommendations for when to conduct surveys in AMT versus traditional panels across a range of research contexts, in our experience, researchers should consider qualification criteria, geographic constraints, and the amount of internal resources they have to program and manage the survey execution platform when choosing between these platforms. Acknowledgements Funding for this research was provided by a NASA Learn Grant with Dr. Brian German as the lead investigator. Sharon Dunn of Type Right Editing edited the final manuscript and Wenhui Yang created Fig. 5. References Abbey, J.D., Kleber, R., Souza, G.C., Voigt, G., 2017. The role of perceived quality risk in pricing remanufactured products. Prod. Oper. Manage. 26 (1), 100–115. Amazon Mechanical Turk, 2018a. Getting paid. Available online at < https://requester.mturk.com/pricing > . Accessed June 4, 2018. Amazon Mechanical Turk, 2018b. Pricing. Available online at < https://requester.mturk.com/pricing > . Accessed June 4, 2018. Berinsky, A.J., Huber, G.A., Lenz, G.S., 2012. Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Polit. Anal. 20 (3), 351–368. Cantor, D.E., Blackhurst, J.V., Cortes, J.D., 2014. The clock is ticking: The role of uncertainty, regulatory focus, and level of risk on supply chain disruption decision making behavior. Transp. Res. Part E 72, 159–172. Cui, Z., Kumar, P.M.S., Goncalves, D., 2019;al., in press. Scoring vs. ranking: an experimental study of idea evaluation processes. Prod. Oper. Manage. 28 (1), 176–188. Deb, S., Strawderman, L., Carruth, D.W., DuBien, J., Smith, B.K., Garrison, T.M., 2018. Development and validation of a questionnaire to assess pedestrian receptivity toward fully autonomous vehicles. Transp. Res. Part C 84, 178–195. Dixon, M.J., Victorino, L., Kwortnik, R.J., Verma, R., 2017. Surprise, anticipation, and sequence effects in design of experiential services. Prod. Oper. Manage. 26 (5), 945–960. Garrow, L.A., Chen, Z., Ilbeigi, M., Lurkin, V., 2018. A new twist on the gig economy: conducting surveys on Amazon Mechanical Turk. Transportation. https://doi. org/10.1007/s11116-018-9962-8. Gosling, S.D., Vazire, S., Srivastava, S., John, O.P., 2004. Should we trust web-based studies? A comparative analysis of six preconceptions about internet questionnaires. Am. Psychol. 59, 93–104. Guo, Z., Zhao, J., Whong, C., Mishra, P., Wyman, L., 2017. Redesigning subway map to mitigate bottleneck congestion: an experiment in Washington DC using Mechanical Turk. Transp. Res. Part A 106, 158–169. Haboucha, C.J., Ishaq, R., Shiftan, Y., 2017. User preferences regarding autonomous vehicles. Transp. Res. Part C 78, 37–49. Harmes, P.D., DeSimone, J.A., 2015. Caution! MTurk workers ahead – fines doubled. Indus. Org. Psychol. 8 (2), 183–190. Helveston, J.P., Lui, Y., Feit, E.M., Fuchs, E., Klampfl, E., Michalek, J., 2015. Will subsidies drive electric vehicle adoption? Measuring consumer preferences in the U.S. and China. Transp. Res. Part A 73, 96–112. Highhouse, S., Zhang, D., 2015. The new fruit fly for applied psychological research. Indus. Org. Psychol. 8 (2), 179–183. Hitlin, P., 2016. Research in the crowdsourcing age: A case study. Pew Research Center. Available at < http://www.pewinternet.org/2016/07/11/research-in-thecrowdsourcing-age-a-case-study/ > . Downloaded April 9, 2017. Hutchison-Krupat, J., Chao, R.O., 2014. Tolerance for failure and incentives for collaborative innovation. Prod. Oper. Manage. 23 (8), 1265–1285. Jiang, L., Dimitrov, S., Martin, B., 2017. P2P marketplaces and retailing in the presence of consumers’ valuation uncertainty. Prod. Oper. Manage. 26 (3), 509–524. Krupa, J.S., Rizzo, D.M., Eppstein, M.J., Lanute, D.B., Gaalema, D.E., Lakkaraju, K., Warrender, C.E., 2014. Analysis of a consumer survey on plug-in hybrid electric vehicles. Transp. Res. Part A 14–31. Landers, R.N., Behrend, T.S., 2015. An inconvenient truth: arbitrary distinctions between organizational, Mechanical Turk, and other convenience samples. Indus. Org.
140
Transportation Research Part C 103 (2019) 129–141
M. Ilbeigi, et al.
Psychol. 8 (2), 142–164. Lee, Y.S., Seo, Y.W., Siemsen, E., 2018. Running behavioral operations experiments using Amazon’s Mechanical Turk. Prod. Oper. Manage. 27 (5), 767–783. Louviere, J.J., Hensher, D.A., Swait, J.D., 2000. Stated Choice Methods: Analysis and Application. Cambridge University Press, Cambridge, UK. Lurkin, V., Garrow, L.A., Higgins, M.J., Newman, J.P., Schyns, M., 2017. Accounting for price endogeneity in airline itinerary choice models: an application to continental U.S. markets. Transp. Res. Part A 100, 228–246. Neoh, J.G., Chipula, M., Marshall, A., Tewkesbury, A., 2018. How commuters’ motivations to drive relate to propensity to carpool: evidence from the United Kingdom and the United States. Transp. Res. Part A 110, 128–148. Nicolae, M., Ferguson, M., Garrow, L., 2016. Measuring the benefit of offering auxiliary services: do bag-checkers differ in their sensitivities to differences in airline itinerary attributes? Prod. Oper. Manage. 25 (10), 1689–1708. Searles, K., Ryan, J.B., 2015. Researchers are rushing to Amazon’s Mechanical Turk. Should they? Washington Post. Available at: < http://www.washingtonpost.com/ blogs/monkey-cage/wp/2015/05/04/researchers-are-rushing-to-amazons-mechanical-turk-should-they/ > . Downloaded April 10, 2017. Sheskin, D.J., 2000. Handbook of Parametric and Nonparametric Statistical Procedures, second ed. Chapman & Hall/CRC, Boca Raton, FL. Tokar, T., Aloysius, J., Waller, M., Hawkins, D.L., 2016. Exploring framing effects in inventory control decisions: violations of procedure invariance. Prod. Oper. Manage. 25 (2), 306–329. TurkPrime, 2017. TurkPrime’s powerful Mechanical Turk toolkit. < https://www.turkprime.com/Service/MTurkToolkit > . Accessed April 21, 2017.
141