The Journal of Choice Modelling 9 (2013) 3–13
Contents lists available at ScienceDirect
The Journal of Choice Modelling journal homepage: www.elsevier.com/locate/jocm
Simplified probabilistic choice set formation models in a residential location choice context Alireza Zolfaghari n, Aruna Sivakumar 1, John Polak 2 Centre for Transport Studies, Imperial College London, London, United Kingdom
a r t i c l e i n f o
abstract
Available online 10 January 2014
The implementation of a theoretically sound, two-stage discrete-choice modelling paradigm incorporating probabilistic choice sets is impractical when the number of alternatives is large, which is a typical case in most spatial choice contexts. In the context of residential location choice, Kaplan et al. (2009), (2011), (2012) (KBS) developed a semicompensatory choice model incorporating data of individuals searching for dwellings observed using a customised real estate agency website. This secondary data is used to compute the probability of considering a choice set that takes the form of an ordered probit model. In this paper, we illustrate that the simplicity of the KBS model arises because of an unrealistic assumption that individuals0 choice sets only contain alternatives that derive from their observed combination of thresholds. Relaxing this assumption, we introduce a new probabilistic choice set formation model that allows the power set to include all potential choice sets derived from variations in thresholds0 combinations. In addition to extending the KBS model, our proposed model asymptotically approaches the classical Manski model, if a suitable structure is used to categorise alternatives. In order to illustrate the biases inherent in the original KBS approach, we compare it with our proposed model and the MNL model using a Monte Carlo experiment. The results of this experiment show that the KBS model causes biases in predicted market share if individuals are free to choose from any potential choice sets derived from combinations of thresholds. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Choice set formation Semi-compensatory models Ordered probit
1. Introduction Standard discrete choice models are powerful in choice situations where individuals0 choice sets are known to the analyst with a degree of confidence. However, there are many choice contexts where this is not the case. Examples are typically found in spatial choices such as residential location choice where the number of alternatives evaluated is constrained by individuals0 limited capacity for gathering and processing information (Fotheringham et al., 2000). The inappropriate use of the universal choice set in modelling can lead to significant misspecification errors where the actual choice sets considered by decision makers are different from the universal choice set (see Stopher, 1980). Experimental research suggests that in a complex choice situation (e.g., choice among a large number of alternatives) decision makers adopt non-compensatory screening strategies (e.g., elimination-by-aspects, see Tversky, 1972) to reduce the n
Corresponding author. Tel.: þ44 2075946086. E-mail addresses:
[email protected] (A. Zolfaghari),
[email protected] (A. Sivakumar),
[email protected] (J. Polak). 1 Tel.: þ44 2075946036. 2 Tel.: þ44 2075946089.
1755-5345/$ - see front matter & 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jocm.2013.12.004
4
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
number of alternatives to a smaller number before using a compensatory decision rule to make a final decision (see Manrai and Andrews, 1998). This has led to the view that decisions may be made in two stages: (a) non-compensatory stage and (b) compensatory stage. Manski (1977) proposed a two-stage discrete-choice modelling framework incorporating probabilistic choice sets. The general Manski model considers all potential choice sets and requires summation over the power set G which is a set of all non-empty subsets of the master set. The size of the power set G, however, increases exponentially with the increase in the number of alternatives. Therefore, it is practically impossible to estimate the general Manski model with a large number of alternatives in the master choice set. Some authors have tried to overcome the computation burden of the Manski approach by imposing a priori restrictions on the composition of choice sets based on some exogenous evidence from the choice set formation process (see Swait and Ben-Akiva, 1987; Siddarth et al., 1995; Andrews and Manrai, 1998; Zheng and Guo, 2008; Kaplan et al., 2009; Hicks and Schnier, 2010). In the context of residential location choice, Kaplan et al. (2009), (2011), (2012) (KBS, hereafter) developed a probabilistic choice set model incorporating data of individuals searching for dwellings observed using a customised real estate agency website. This secondary data was used to compute the probability of considering a choice set that takes the form of an Ordered Probit model. In this paper, we illustrate that the simplicity of the KBS model arises because of an unrealistic assumption that individuals0 choice sets only contain alternatives that derive from their observed combination of thresholds. By relaxing this assumption we introduce a new probabilistic choice set formation model that allows the choice set space to include all potential choice sets derived from variations in the combinations of thresholds. In addition to extending the KBS model, our proposed model asymptotically approaches the classical Manski model, provided a suitable structure is used to categorise alternatives. In order to illustrate the biases inherent in the original KBS approach, we compare it with our proposed model and the MNL model using a Monte Carlo experiment. The results of this experiment show that the KBS model causes biases in predicted market share if individuals are free to choose from any potential choice sets derived from combinations of thresholds. The rest of this paper is organised as follows. Section 2 reviews the literature in choice set formation problem discussing different approaches and methodologies applied to reduce the computation burden of the Manski model. Section 3 elaborates more on the model developed by Kaplan et al. (2009), (2011), (2012)and discusses the shortcomings of this model. Section 4 presents the proposed modelling framework of this research. Section 5 presents the details of the Monte Carlo experiment, including data generation, different model estimation and evaluation criteria as well as a discussion of the findings. Section 6 concludes this paper. 2. Literature review An important issue in models with a massive universal choice set concerns the fact that decision makers do not choose from the universal choice set when they are facing a large number of alternatives. This issue is known in the literature as the choice set formation problem (Thill, 1992, in the context of destination choice). Since the observed choice data do not reveal any information about the actual choice sets, researchers have proposed different methodologies to deal with the issue of choice set formation based on decision theory.3 Thill (1992) and more recently Pagliara and Timmermans (2009) reviewed different choice set formation approaches applied in the spatial choice context. Focusing on residential choice, different choice set formation approaches are reviewed in the following sub-sections. 2.1. Deterministic choice set formation In the deterministic (or exogenous) choice set formation approach, it is assumed that decision makers0 choice sets can be determined deterministically and exogenous to the choice process. The choice set for each decision maker, in the deterministic approach, is typically generated by restricting the universal choice set using deterministic constraints on one or more attributes of alternatives. Many applications of this approach exist in the literature for different choice contexts. In the mode choice context, for example, knowledge that an individual has no driving licence may lead the analyst to an a priori removal of the auto-mode from the individual0 s choice set (see Ben-Akiva and Lerman, 1974; Train, 1980). In the destination/activity location choice context, a distance threshold, for example, is determined deterministically and used to generate the choice sets based on the hypothesis that individuals do not consider alternatives that would imply travelling more than a specified maximum distance (see Parsons and Hauber, 1998; Termansen et al., 2004; Scott, 2006). Some authors have also proposed applying the importance sampling approach (without considering the alternativespecific correction terms) to incorporate more behavioural realism in the choice set formation stage for spatial models. For example, in the residential location choice component of ILUTE, households0 choice sets are constructed by taking 75% of a random sample from dwellings that were within 15 km of their previous locations and the remaining 25% from dwellings that were not within the 15 km threshold (Farooq and Miller, 2012). Elgar et al. (2009) also applied the importance sampling approach (without alternative-specific correction terms) in a firm location choice context by oversampling of alternatives 3 There have been some attempts to model choices as an outcome of a search process based on the search theory rather than the decision theory and using the “choice process” data rather than the choice data (see Lerman and Mahmassani, 1985; Caplin and Dean, 2011).
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
5
around the anchor-points (anchor-points are points of significance to decision makers – such as work locations – which lead decision makers to assign more importance to contiguous locations when they are looking for a residential location). In general, the definition of choice sets in the deterministic constraint approach (or the importance sampling approach without correction) depends on the analysts0 assumptions as to which variable and what thresholds should be applied to define the choice sets. Zolfaghari et al. (2012) empirically compared alternative exogenous choice set formation approaches within London residential location choice context. The results of aggregate and disaggregate validations of the model suggest that the models with exogenous choice sets in fact perform worse than the universal choice set approach in the residential location choice context. 2.2. Two-stage choice set formation models Some authors have argued that deterministically constructing choice sets based on an analyst0 s judgement involves uncertainty and have thus attempted to incorporate the uncertainty involved in the choice set formation process by probabilistic modelling of choice sets and choices based on the Manski (1977) formulation. The general form of Manski0 s two-stage model is as follows: P i;n ¼ ∑ PðijC n ÞQ n ðC n Þ
ð1Þ
Cn A G
where, P i;n is the unconditional probability of alternative i being chosen, PðijC n Þ is the conditional probability of alternative i being chosen given choice set C n , Q n ðC n Þ is the probability of individual n0 s choice set being C n ; and G is a set of all nonempty subsets of the universal choice set M. In a marketing research and brand choice context, different choice set generation strategies are studied. Manrai and Andrews (1998) reviewed different specifications of choice generation in marketing literature and classified various choice set generation models such as: (a) memory-based or stimulus-based, (b) attribute-based or brand-based, etc. In transportation and environmental planning literature on the other hand, the scope of choice set generation strategies is mainly limited to an attribute-based approach where an individual is assumed to generate her choice set by eliminating alternatives based on one or multiple constraints on the attributes of alternatives (i.e., a constraint-based model). Swait and Ben-Akiva (1987) formulated a two-stage probabilistic model through a constraint-based non-compensatory process for the choice set generation process and the utility maximisation compensatory process for the actual choice (i.e., a semi-compensatory choice set formation model). Since choice sets in general are a latent construct to the analyst, and there is usually nothing observed about them except the most preferred alternative for each individual, the thresholds are assumed to be random variables (i.e., a random constraint model). In order to incorporate the heterogeneity of individuals in the choice set formation stage, one can parameterise the means of distributions of thresholds as a function of individual characteristics. Thresholds are assumed to follow different distributions in the literature, with normal distributions being assumed in Swait and Ben-Akiva (1987), Andrews and Srinivasan (1995), Cantillo and Ortúzar (2005) and logistic distributions in Ben-Akiva and Boccara (1995) and Basar and Bhat (2004). The major drawback of the Manski model is that it is computationally intensive since the number of theoretically possible choice sets increases exponentially as the size of M increases (i.e., jGj ¼ 2jMj 1). This approach is therefore impractical for spatial choice models, which are characterised by large number of alternatives. Morikawa (1995) proposed an alternative formulation for two-stage choice set formation models in order to reduce the computation complexity of the Manski model. Following the Manski formulation, some authors proposed to restrict the composition of the powerset based on some simplifying assumptions regarding the behavioural realism of the choice set formation process. This approach is referred to as the simplified probabilistic choice set formation in this paper and is reviewed separately here. 2.3. Single-stage choice set formation models Following the competing destination model of Fotheringham (1983), some authors have viewed choice sets as fuzzy sets so as to be able to incorporate the uncertainties of choice sets in discrete choice models (see Cascetta and Papola, 2001; Cascetta et al., 2007). Wu and Rangaswamy (2003) provided a detailed discussion on this approach. In a residential location choice context, Martínez et al. (2009) developed the Constrained Multinomial Logit (CMNL) model where each alternative has a degree of membership to the choice set which is determined based on some constraints on the attributes of alternatives. This approach is also referred to as the soft constraint approach, in which the constraints can be violated to some extent (i.e., as opposed to the crisp constraint approach of the Manski model). Bierlaire et al. (2010) compared the single-stage framework with the two-stage modelling framework using synthetic data and concluded that the CMNL model developed by Martínez et al. (2009) is unable to reproduce the results of the Manski model and should be considered as a model in its own right. 2.4. Taste-driven choice set formation models Horowitz and Louviere (1995) hypothesised that the choice set may simply be a reflection of preference rather than a separate construct. They argued that choice sets provide no information beyond that contained in the utility function. The
6
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
utility function, however, is never known with certainty to the analyst and must be estimated from the data. Since any information about preferences may be useful for improving estimation efficiency, Horowitz and Louviere (1995) argue that the specifications of choice sets can be used to improve estimation efficiency. Hence, they employed a range of survey questions to elicit individuals0 choice sets and used that to improve the specification of the utility function (see Horowitz and Louviere, 1995; Parsons et al., 1999). Swait (2001) proposed a new model of choice set generation belonging to the GEV (Generalised Extreme Value) family of discrete choice models (i.e., the GenL model). The choice sets in the GenL model are taste-driven, similar to the proposal of Horowitz and Louviere (1995). An interesting feature of the GenL model is that the model neither uses exogenous information nor incorporates latent variables describing the choice sets. 2.5. Simplified probabilistic choice set formation models Some authors have tried to overcome the computation burden of the Manski model while still accommodating the probabilistic nature of choice sets by restricting the composition of set G based on some simplifying assumptions regarding the behavioural realism of the choice set formation process. By restricting G, we in fact assume that an aggregated set of alternatives can be in or out of the choice set together. Assuming that an individual either chooses from the choice set that contains the feasible/available alternatives (considering the constraints of the individual) or is free to choose from the universal choice set, Swait and Ben-Akiva (1987) proposed a model with the restricted G as P i;n ¼ K½ PðijMn;rðiÞ ÞQ n ðM n;rðiÞ Þ þ PðijM n ÞQ n ðMn Þ
ð2Þ
Gn ¼ f M n;1 ; …; M n;R ; M n g
ð3Þ
where, Gn is the restricted choice set space and K is a normalisation constant to account for the restrictions imposed on the composition of the choice sets. In this approach, the universal choice set is subdivided into R non-empty, mutually exclusive and collectively exhaustive subsets M n;r (i:e:; M n ¼ [ Rr ¼ 1 M n;r and M n;r \ M n;r0 ¼ ∅; 8 r 0 ar). Hence, an individual is either free to choose from the universal choice set M n or he is limited to one of its subsets which meets the individual0 s constraints. The captivity model (Dogit model) developed by Gaudry and Dagenais (1979) can be seen as a special case of this model. In the Dogit model, individuals are either captive to an alternative (e.g., low-income workers may be captive to public transportation) or free to choose among all the alternatives, which can be interpreted as special case of the model presented in Eq. (2) if we assume that jM n;r j ¼ 1; r ¼ 1::R. The model proposed by Siddarth et al. (1995) based on secondary data on consumer purchase history is also equivalent to the Swait and Ben-Akiva0 s (1987) model with restricted G as described in Eq. (2). The GenL model proposed by Swait (2001) is also developed by restricting the choice set space based on some exogenous evidence from the choice process, even though the behavioural motivation of the GenL model is different from the two-stage models. In the marketing literature, Andrews and Manrai (1998) developed a feature-based elimination model that assumed that decision makers apply a sequence of non-compensatory rules to form their choice sets based on the features of the alternatives. In this model, alternatives in the universal choice set are categorised based on their features in a hierarchical structure. This model can also be considered as a simplified two-stage model because the G set is restricted to include the choice sets matching the hierarchical structure, not all of the subsets of the universal choice set. In the destination choice context, Zheng and Guo (2008) developed a random constraint model for destination choice with 27 alternatives. The restricted set G is constructed by assuming that a destination alternative is included in an individual0 s potential choice set if, and only if, all other alternatives that are closer to the individual0 s trip origin are also included in the choice set. By restricting the composition of set G based on the abovementioned spatial contiguity assumption, Zheng and Guo reduced the total number of possible choice sets from 2jMj 1 to L r jMj for each trip origin. Hicks and Schnier (2010) developed a simplified probabilistic choice set formation model using different macrodefinitions of spatial regions in order to focus on the micro-level spatial decision making and to investigate the sensitivity of the results to alternative macro-level spatial definitions. Again, they demonstrated that by assuming some structure on the composition of choice sets (i.e., restricting G); the dimensionality problems associated with the general Manski model can be reduced. In a residential location choice context, Kaplan et al. (2009), (2011), (2012) developed a probabilistic choice set formation model incorporating data of individuals searching for dwellings observed using a real estate agency website. The KBS model uses observed choice set data as well as chosen alternative data in order to reduce the computational complexity of probabilistic choice set models. Estimation of the KBS model, therefore, requires observation of individual0 s thresholds, such as number of bedrooms and price thresholds, etc. as discrete categorical variables during the search process (see Fig. 1). 3. The shortcomings of the KBS model The KBS model assumes that the choice sets derived by observed individuals0 thresholds is the “true choice set”. Kaplan et al. (2009) argued that by observing the “true choice set” using a customised real estate agency website, the unconditional choice probability of an alternative does not require enumeration over all potential choice sets (unlike the Manski model)
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
Combination of Thresholds
Master Set
7
Multinomial Ordered Probit Model
Choice Set Corresponding to the Selected Criteria
Choice Process
MNL Model
Fig. 1. Conceptual framework of the KBS model.
and can be written as P i;n ¼ PðijC n ÞQ n ðC n Þ
ð4Þ
where PðijC n Þ is the conditional probability of alternative i being chosen given the “true choice set” C n , Q n ðC n Þ is the probability that the “true choice set” is C n . In the original KBS model the conditional probability is assumed to have a MNL structural form. It also assumes that the probability of selecting the choice set C n is equal to the probability of selecting a combination of thresholds. Further assuming that choices of different thresholds are independent, the following choice set probability is proposed: Q n ðC n Þ ¼ Pðt n1;n \ t n2;n \ … \ t nK;n Þ ¼ Pðt n1;n Þ Pðt n2;n Þ … Pðt nK;n Þ
ð5Þ
As mentioned earlier, in the KBS model individuals0 thresholds on different criteria, such as number of beds, prices, etc., as well as their characteristics such as income, age, etc., are observed during the search-by-criteria stage using a real estate agency website. Since the thresholds are categorical variables (pre-determined values) and are naturally ordered, the probability of individual n to select the thresholds t n of the kth criterion (i.e.,Pðt nk;n Þ) is modelled using an Ordered Probit formulation in the KBS model. Assuming that the probability of selecting a threshold is related to characteristics of an individual, the dependent variable t nk;n is characterised as a function of individual0 s characteristics and an error term as t nk;n ¼ αk Z k;n þ εk;n
ð6Þ
where Z k;n is the vector of explanatory variables (i.e., individual0 s characteristics), αk is the model parameters and εk;n is the normal error term. The further extensions of the KBS model also incorporate the correlations among different criteria in the noncompensatory choice set formation stage using the Multinomial Ordered Probit model developed by Bhat and Srinivasan (2005) as well as incorporating a flexible error structure in the compensatory choice stage based on Mixed Logit formulation (see Kaplan et al., 2012). The overall approach in all of these extensions is the same, however, and the unconditional choice probability is still represented by Eq. (4). The non-compensatory choice set formation model based on the Multinomial Ordered Probit formulation and the compensatory choice model based on the Multinomial Logit or the Mixed Logit formulation are estimated jointly using a maximum likelihood approach in the KBS model. Here, we argue that the simplicity of the KBS model arises not because of observing the thresholds and adding more information, rather by restricting G to include only specific choice sets similar to the other models presented earlier. In the KBS model, it is assumed that each combination of thresholds corresponds to a choice set. This effectively means that in this approach the choice set space G is limited to include R ði:e:; R ¼ I 1 I 2 …I K Þ mutually exclusive and collectively exhaustive subsets of the universal choice set corresponding to each combination of thresholds (I k is the number of categories for kth criterion). Mathematically speaking, the choice set space G corresponding to different combinations of thresholds (R) is given as G ¼ f C 1 ; C 2 ; …; C R g
ð7Þ
where C r is the choice set corresponding to rth combination of thresholds. Since PðijC n Þ ¼0 if i 2 = C n , the general Manski model with this restricted G reduces to Eq. (4) in the KBS model. It should be noted that a two-stage model is probabilistic when the composition of the choice set is not known with certainty and non-zero occurrence probabilities are attached to two or more sets (Manrai and Andrews, 1998). With this in mind, the KBS model cannot be considered as a probabilistic model, because the non-zero occurrence probabilities are only attached to one set (the observed choice set). This is an important drawback of the KBS model because the summation of choice probabilities would not equal to one. Moreover, in the KBS model, it is assumed that by observing the combination of thresholds for each individual during the search process, we are in fact observing the “true choice set” of that individual. It should be borne in mind that observing the choice sets with certainty is impossible (Shocker et al., 1991). All points
8
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
considered, we believe that the KBS model is in fact a deterministic model and it should not be considered as a probabilistic choice set formation model. It should be also noted that the joint estimation of the choice set formation stage and the choice stage, as suggested by Kaplan et al. (2009), is not necessarily since the probability of choosing an alternative given a choice set and the probability of selecting a combination of thresholds are independent in the KBS model. In summary, there are both methodological and conceptual shortcomings in the KBS model. From the conceptual perspective, the assumption that the “true choice set” can be observed by observing individuals0 thresholds is not empirically supported. From the methodological perspective, meanwhile, the summation of choice probabilities over the alternatives in the KBS model presented in Eq. (4) does not add up to one which results in incorrect predictions of the market shares when the model is applied. The KBS model is interesting, however, in that it introduces an innovative approach to estimate choice models based on observed individuals0 thresholds on different criteria in the web-based store environment. 4. The proposed model Punj and Moore (2009) proposed a conceptual model of information search and choice set formation in a web-based store environment based on constructs that are known to affect consumer behaviour in online settings such as eliminationtype screening strategies (e.g., thresholds on different attributes), electronic decision aids, etc. They conducted a survey of 120 undergraduate students choosing an apartment to rent near a hypothetical university in order to test different hypotheses about student0 s searching behaviour and choice set formation in an online setting. They concluded that (1) when more time is available, the number of search iterations conducted increases, as well as the number of alternatives in the individuals0 choice sets, (2) when many alternatives are available consumers conduct fewer search iterations but they do not actually examine fewer alternatives (they have a larger choice set), and (3) consumers form larger choice sets when many alternatives are available but form smaller choice sets when more time is available. Thresholds for different criteria in real estate websites, as used in the KBS model, therefore, are actually tools to simplify the information search process and it would be behaviourally unrealistic to assume that an individual0 s choice set can only contain alternatives corresponding to a combination of thresholds. Here, we assume that the choice sets can be the union of two or more sets of alternatives derived from a combination of thresholds and propose the following model: P i;n ¼ ∑C n A G PðijC n ÞQ n ðC n Þ
ð8Þ
G ¼ f C 1 ; C 2 ; …; C R ; fC 1 ; C 2 g; fC 1 ; C 3 g; …; fC 1 ; C 2 ; C 3 g; …; fC 1 ; C 2 ; …; C R gg
ð9Þ
where C 1 ; C 2 ; …; C R are an aggregated set of alternatives based on the combination of thresholds, R ¼ I 1 I 2 …I K is the total combination of K attributes0 thresholds, I 1 ; I 2 ; …; I K are the number of ordered categories in each attribute0 s threshold. The alternatives in the proposed model are, therefore, categorised to larger sets and the choice set space is constructed based on these sets of aggregated alternatives. The aggregated set of alternatives, similar to the KBS model, is derived from a combination of thresholds on different attributes implemented as searching tools in real estate agency websites. Each criterion used in the search process is represented by different threshold values as discrete categorical variables. The size of set G grows exponentially as the number of thresholds and number of criteria increases (i:e:; jGj ¼ 2R 1). Hence, in addition to extending the KBS model, our proposed model asymptotically approaches the classical Manski model when the universal choice set is partitioned in such a way that each alternative belongs to a combination of thresholds. It should be noted that the computation burden of this model is much less than the original Manski (1977) model since alternatives are grouped into R categories and the choice set space thus grows exponentially with the number of categories rather than the number of alternatives, as in the original Manski model. In the most general case, the probability of selecting an aggregated set of alternatives (i.e., P n ðC r Þ) should be calculated based on the chosen thresholds on different attributes (as well as the decision maker0 s socioeconomics) on each search iteration. However, since one of the main objectives of this paper is to compare the proposed model against the KBS model, here we assumed that the chosen thresholds on different attributes are only observed for the final search iteration similar to the KBS approach. The probabilities of selecting aggregated sets of alternatives, therefore, can be modelled as a Multinomial Ordered Probit model:4 P n ðC r Þ ¼ Pðt n1;r ÞPðt n2;r Þ…Pðt nk;r Þ
ð10Þ
8 C r A C ¼ f C 1 ; C 2 ; …; C R g
ð11Þ
It should be noted that the probability of selecting a combination of thresholds is equal to the choice set probability in the KBS model (see Eq. 5). The choice set probability in the proposed model, however, is calculated as following: Q n ðC n Þ ¼ K∏C r A C P n ðC r A CÞ∏C r 2= C n ð1 P n ðC r A CÞÞ
ð12Þ
where K is a normalisation constant in order to allow the choice set probabilities over G to sum to one. 4
For the sake of simplicity, we do not incorporate the correlations between the thresholds in our formulation (see Bhat and Srinivasan, 2005).
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
9
Here, we assumed that probabilities of selecting aggregated sets of alternatives are independent, similar to the independent availability assumption of random constraint models (see Swait and Ben-Akiva, 1987). Further research is required in order to understand the implication of this assumption in the calculation of the choice set probabilities. The estimation of the proposed model requires novel data collected during the search process in an online environment. The overall estimation approach has the following stages: first, the Multinomial Ordered Probit model presented in Eq. (10) is estimated using the Maximum Likelihood approach (ML). Then, the parameters of choice model of Eq. (8) is estimated using the ML approach, while the choice set probabilities are calculated following Eq. (12) based on the estimated parameters of the Multinomial Ordered Probit model. Considering the residential choice example of Fig. 2, where the dwelling price and dwelling size thresholds are used in the search process, and there are 3 ordered categories for dwelling price and dwelling size thresholds, the universal choice set is partitioned into 9 categories (i.e., R ¼3 3 ¼9), each corresponding to an aggregated set of alternatives. Decision makers0 characteristics such as household income and household size can potentially influence the probability of selecting a combination of thresholds (i.e., the Multinomial Ordered Probit model). The choice set space in this example contains 511 members (i.e., jGj ¼ 29 1Þ as illustrated in Fig. 2. In the choice stage, different attributes such as dwelling price, dwelling size, school quality, commute time to workplace location, etc. can potentially influence the choice probabilities (i.e., the MNL model). Guo (2004) and Schirmer et al. (2012) comprehensively reviewed the previously developed residential location choice models from an empirical perspective and summarised influencing factors of residential location choice decisions. It should be noted that the original KBS model assumes that the choice set space can only include the choice sets corresponding to these partitions. In our proposed model, however, the choice set space includes different variations of these partitions. (see, Fig. 2).
KBS Model Choice Set Space (9 members)
Proposed Model Choice Set Space (511 members)
Fig. 2. The KBS model vs. the proposed model: an example of real estate website.
10
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
The proposed model assumes that an individual choice set can include the alternatives corresponding to one or more combination of thresholds (i.e., one or more categories) and cannot include, for example, half of the alternatives from one category and another half from another category. Hence, both the KBS model and our proposed model assume some predetermined restrictions on the composition of the choice set space, and the simplification, compared to the original Manski model, happens because of these restrictions. This is also analogous to the model developed by Hicks and Schnier (2010), except that, in the Hicks and Schnier0 s (2010) model, observed spatial regulations are used to categorise the alternatives rather than the observed thresholds. The proposed model can also be viewed as a variation of the Swait0 s (2001) GenL model where, instead of latent grouping of alternatives, alternatives are categorised based on observed thresholds. Similar to the GenL model, the proposed model assumes that the choice set space is part of the model specification. In the next section, we conduct a Monte Carlo experiment to compare the proposed model against the original Manski model and to show the bias inherent in the KBS model using simulated data. 5. Monte Carlo simulation In order to illustrate the performance of our approach against the original Manski model and to show the bias inherent in the KBS model, we have conducted two Monte Carlo case studies. The simulation in general has three stages: (i) simulation of alternatives (dwelling units), (ii) simulation of decision makers (households), and (iii) assigning alternatives to decision makers based on the proposed choice model. The generation of alternatives was conducted by randomly scattering dwellings units across a region which comprised five Travel Analysis Zones (TAZ). For each alternative, we generated two attributes: (i) price and (ii) location (TAZ). Dwelling prices and dwelling locations (TAZs) were generated independently. This might not be true in general but for the purposes of our analysis this independency assumption is not limiting. The number of alternatives in each case study is different and is described in the following sub-sections. For each household we generated three characteristics: (i) income, (ii) TAZ of the workplace location, and (iii) price threshold. Here again, the generation of incomes and workplace locations are independent although it might not be true in reality. We adopted quite a large sample size (NH ¼1000) in order to avoid biases associated with small sample sizes. The price threshold was generated based on the following latent threshold as a function of household income: t nn ¼ αI n þ εn
ð13Þ
where I n is the income of household n, and εn is a standard normal error term. The postulated parameter of the latent threshold is assumed to be one (i.e., α¼1). We assumed that alternatives are classified into NCAT categories (different for each case study as discussed later) based on their prices in a hypothetical real estate agency website. Without loss of generality, it can also be assumed that the number of alternatives in each category is the same. This study assumes that the true behaviour is represented by the Manski model since most empirical studies have confirmed the superiority of the Manski model against the MNL model and models with deterministic choice sets (see Swait and Ben-Akiva, 1987; Zheng and Guo, 2008). The alternatives, therefore, were assigned to decision makers based on the two-stage probabilistic choice set formation model following a Monte Carlo sampling process. This procedure generates a random number between 0 and 1, and compares it to the cumulative choice probability of alternatives. The alternative that has a cumulative probability interval which contains the random number is assigned to the households. The following utility function is assumed for the compensatory stage U in ¼ β1 T T in þ β2 P i þ εin
ð14Þ
where TT in is the travel time from the TAZ of dwelling i to TAZ of the workplace of household, P i is the price of alternative i divided by 1000, and εin is the error term, which is independent and identically distributed (IIA) across alternatives and decision makers. Finally, the postulated parameters for compensatory stage are assumed as: β1 ¼ β2 ¼ 1. 5.1. Case study 1 The first case study concerns hypothetical choice situations with 10, 8 and 6 alternatives in order to be able to implement the original Manski model. The simulation had two stages. First, we assumed that the Manski model is the true model; therefore, we used Manski0 s choice probabilities to generate the chosen alternatives for all households in the sample. Secondly, we estimated three models (M1, M2 and M3) varying in number of categories for each choice situation. Model estimation was achieved using a randomly drawn 75% from the sample with the remaining 25% being kept aside as a holdout sample. M1 corresponded to the original Manski model where the number of alternatives is equal to the number of categories. M2 and M3 were estimated based on grouping of alternatives into different categories. For simplicity, we kept the number of alternatives within each category the same as discussed before. Table 1 summarises the specifications of the different models. The 25% hold-out sample was used to compare the predicted market share of the proposed model (with a different number of categories) against the true market share (market share of the hold-out sample). Two performance measures: (a) Root Mean
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
11
Table 1 Case study 1 – estimated models. Choice situation1
Number of alternatives
Number of categories
M1 M2 M3
NALT ¼ 10 NALT ¼ 10 NALT ¼ 10
NCAT ¼10 NCAT ¼5 NCAT ¼2
Choice Situation 2 M1 NALT ¼ 8 M2 NALT ¼ 8 M3 NALT ¼ 8
NCAT ¼8 NCAT ¼4 NCAT ¼2
Choice situation 3 M1 NALT ¼ 6 M2 NALT ¼ 6 M3 NALT ¼ 6
NCAT ¼6 NCAT ¼3 NCAT ¼2
Table 2 Estimated parameters and prediction tests of case study 1 (average of 10 runs). M1 NALT ¼6 NCAT ¼6 β1 β2
M2 NALT ¼ 6
M3 NALT ¼ 6
NCAT ¼3 β1 β2
Estimate 0.44 0.28
Prediction test on the hold-out sample RMSE 1.83 MADE 1.68
RMSE MADE
15.53 15
M1 NALT ¼8
M2 NALT ¼ 8
NCAT ¼8 β1 β2
Estimate 0.95 0.92
t-stat 3.43 3.86
Estimate 0.16 0.16
Prediction test on the hold-out sample RMSE 5.54 MADE 3.76
RMSE MADE
18.58 14.94
M1 NALT ¼10
M1 NALT ¼ 10
NCAT ¼10 Estimate t-stat β1 0.9 2.8 β2 0.95 3.25 Prediction test on the hold-out sample RMSE 7.28 MADE 6.1
NCAT ¼5 β1 β2
Estimate 0.06 0.06
RMSE MADE
24.06 20.45
t-stat 2.51 5.86
NCAT ¼ 2 β1 β2
Estimate 0.11 0.19
RMSE MADE
43.57 36.51
t-stat 15.41 14.42
M3 NALT ¼ 8
NCAT ¼4 β1 β2
Estimate 0.95 1.02
t-stat 13.93 13.78
t-stat 7.21 8.88
NCAT ¼ 2 β1 β2
Estimate 0.13 0.2
RMSE MADE
80.26 55.91
t-stat 18.52 20.02
M1 NALT ¼ 10 t-stat 4.21 3.88
NCAT ¼ 2 β1 β2
Estimate 0.01 0.24
RMSE MADE
86.13 67.3
t-stat 19.33 15.46
Square Error (RMSE), and (b) Mean Absolute Percentage Error (MAPE) were used for the evaluation of the proximity of the predicted and true market share of different models. We have also reported the estimated parameters for different models; however, we could not compare the estimated parameters against the postulated true parameters, except for M1, because of the different structure of the models. The average results of ten Monte Carlo simulation runs are tabulated in Table 2. The results also show that the performance of the proposed model is very sensitive to the imposed structure on the choice set space. Hence, similar to the GenL model (Swait, 2001), the choice set space should be considered as a part of the model specification. As expected, however, the RMSE and MAPE measures of proximity suggest that the performance of M1 is better than M2 and the performance of M2 is better than M3. The results show that the proposed model with restricted choice set space cannot approximate the model with the full choice set space (i.e., the Manski model) and it should be considered as a model in its own right for searching and choice set formation. Further empirical studies are required to evaluate the performance of the proposed approach and to determine a suitable structure (i.e., number of thresholds and number of categories) before applying the proposed model in practice. This shapes the future work of this study. 5.2. Case study 2 The second case study concerns a hypothetical choice situation among 50 alternatives. Clearly, estimation of the original Manski model with 50 alternatives is not feasible given the current computing power. In the first stage of simulation of this
12
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
Table 3 Estimated Parameters and Prediction Tests of Case Study 2 (Average of 10 runs). Proposed model
Estimate
t-stat
KBS
Estimate
t-stat
MNL
Estimate
t-stat
β1 β2
1.01 0.97
10.19 9.96
β1 β2
0.83 0.62
6.99 6.81
β1 β2
0.24 0.21
17.86 12.77
RMSE MADE
6.49 4.16
RMSE MADE
3.93 2.75
Prediction test on the hold-out sample RMSE 1.98 MADE 1.34
case study, therefore, the proposed model was used with five categories to generate the chosen alternatives for all households in the sample. Here again, for the sake of simplicity, we kept the number of alternatives within in each category the same (i.e., ten alternatives in each category). In the second stage, we fitted the KBS model and the MNL model to the simulated data. For comparison purposes, we also fitted the proposed model with five categories (assumed to be a true model in this case study). Here again, model estimation was achieved using 75% the simulated data with the remainder kept as a hold-out sample. The hold-out sample was used to compare the predicted market share of the proposed model against the KBS model and the MNL model using RMSE and MADE measures. The estimated parameters are also reported but, as discussed earlier, we cannot draw any conclusions from the estimated parameter because of the different structures of the models. Table 3 tabulates the results of the averages of ten Monte Carlo simulation runs. The prediction tests using the hold-out sample method show that the KBS model cannot reproduce the market shares with good approximation whereas the MNL model retrieves the true market shares more accurately. The KBS model is therefore not suitable for choice situations in which the chosen alternative is not a member of the choice set derived from observed thresholds. In fact, where the chosen alternatives for a household are not a member of the choice set derived from their observed thresholds, that household should be eliminated in order to be able to estimate the KBS model. Our simulation confirms that, on average, only 19% of households will end up with alternatives that belong to the choice set derived from their observed thresholds of the final search iteration. The results of this case study illustrate that the KBS model causes biases in predicted market share if we take into account the probabilistic nature of choice sets and assume that individuals are free to choose from any potential choice sets derived from combinations of thresholds.
6. Conclusion Estimation of probabilistic choice set formation models based on Manski (1977) formulation is impractical even for only a moderately large number of alternatives, as is typically the case in most spatial choice contexts. This paper has described and numerically illustrated that the simplicity of the simplified choice set formation model developed by Kaplan et al. (2009), (2011), (2012) arises because of unrealistic assumptions that they made regarding the choice set space. It has also proposed a novel simplified probabilistic choice set formation approach to model simultaneously both choice set formation and choices of households in online real estate websites. Using this simulated dataset, the performance of the proposed model against the Manski model in predicting the true market share was evaluated using the hold-out sample technique. The results show that the proposed model cannot approximate the model with the full choice set space and should be considered as a model in its own right for searching and choice set formation. The bias inherent in the KBS model was also illustrated by comparing it against the proposed model and the MNL model. Finally, although all empirical studies have confirmed the superiority of the two-stage choice set formation models against the MNL model, the predictive power of the proposed simplified probabilistic choice set formation model should also be investigated using real data. This remains a task for future research. The validity and the implications of assuming that probabilities of selecting aggregated sets of alternatives are independent should also be empirically examined. References Andrews, R.L., Manrai, A.K., 1998. Feature-based elimination: model and empirical comparison. Eur. J. Oper. Res. 111, 248–267. Andrews, R.L., Srinivasan, T., 1995. Studying consideration effects in empirical choice models using scanner panel data. J. Mark. Res., 30–41. Basar, G., Bhat, C., 2004. A parameterized consideration set model for airport choice: an application to the San Francisco Bay Area. Transp. Res. Part B: Methodol. 38, 889–904. Ben-Akiva, M., Boccara, B., 1995. Discrete choice models with latent choice setsn 1. Int. J. Res. Mark. 12, 9–24. Ben-Akiva, M., Lerman, S., 1974. Some estimation results of a simultaneous model of auto ownership and mode choice to work. Transportation 3, 357–376. Bhat, C.R., Srinivasan, S., 2005. A multidimensional mixed ordered-response model for analyzing weekend activity participation. Transp. Res. Part B: Methodol. 39, 255–278. Bierlaire, M., Hurtubia, R., Flötteröd, G., 2010. Analysis of implicit choice set generation using a constrained multinomial logit model. Transp. Res. Rec.: J. Transp. Res. Board 2175, 92–97. Cantillo, V., Ortúzar, J., 2005. A semi-compensatory discrete choice model with explicit attribute thresholds of perception. Transp. Res. Part B: Methodol. 39, 641–657. Caplin, A., Dean, M., 2011. Search, choice, and revealed preference. Theor. Econ. 6, 19–48. Cascetta, E.,Paliara, F., Axhausen, K., 2007. The Use of Dominance Variables in Choice Set Generation.
A. Zolfaghari et al. / Journal of Choice Modelling 9 (2013) 3–13
13
Cascetta, E., Papola, A., 2001. Random utility models with implicit availability/perception of choice alternatives for the simulation of travel demand. Transp. Res. Part C: Emerg. Technol. 9, 249–263. Elgar, I., Farooq, B., Miller, E.J., 2009. Modeling location decisions of office firms. Transp. Res. Rec.: J. Transp. Res. Board 2133, 56–63. Farooq, B., Miller, E.J., 2012. Towards integrated land use and transportation: a dynamic disequilibrium based microsimulation framework for built space markets. Transp. Res. Part A: Policy Pract 46, 1030–1053. Fotheringham, A., Brunsdon, C., Charlton, M., 2000. Quantitative Geography: Perspectives on Spatial Data Analysis. Sage Publications Ltd., London. Fotheringham, A.S., 1983. A new set of spatial-interaction models: the theory of competing destinations. Environment and Planning A 15, 15–36. Gaudry, M.J.I., Dagenais, M.G., 1979. The dogit model. Transp. Res. 13B, 105–111. Guo, J.Y., 2004. Addressing Spatial Complexities in Residential Location Choice Models (Ph.D). The University of Texas, Austin. Hicks, R.L., Schnier, K.E., 2010. Spatial regulations and endogenous consideration sets in fisheries. Resour. Energy Econ. 32, 117–134. Horowitz, J.L., Louviere, J.J., 1995. What is the role of consideration sets in choice modeling? Int. J. Res. Mark. 12, 39–54. Kaplan, S., Bekhor, S., Shiftan, Y., 2009. Two-stage model for jointly revealing determinants of noncompensatory conjunctive choice set formation and compensatory choice. Transp. Res. Rec.: J. Transp. Res. Board 2134, 153–163. Kaplan, S., Bekhor, S., Shiftan, Y., 2011. Development and estimation of a semi-compensatory residential choice model based on explicit choice protocols. Ann. Reg. Sci., 1–30. Kaplan, S., Shiftan, Y., Bekhor, S., 2012. Development and estimation of a semi-compensatory model with a flexible error structure. Transp. Res. Part B: Methodol. 46, 291–304. Lerman, S., Mahmassani, H., 1985. The econometrics of search. Environ. Plan. A 17, 1009–1024. Manrai, A.K., Andrews, R.L., 1998. Two-stage discrete choice models for scanner panel data: an assessment of process and assumptions. Eur. J. Oper. Res. 111, 193–215. Manski, C., 1977. The structure of random utility models. Theory Decis. 8, 229–254. Martínez, F., Aguila, F., Hurtubia, R., 2009. The constrained multinomial logit: a semi-compensatory choice model. Transp. Res. Part B: Methodol. 43, 365–377. Morikawa, T., 1995. A hybrid probabilistic choice set model with compensatory and noncompensatory choice rules. In: Proceedings of the 7th World Conference on Transport Research, World Transport Research, Sydney, Australia, pp. 317–325. Pagliara, F., Timmermans, H., 2009. Choice set generation in spatial contexts: a review. Transportation Letters 1, 181–196. Parsons, G.R., Hauber, A.B., 1998. Spatial boundaries and choice set definition in a random utility model of recreation demand. Land Econ. 74, 32–48. Parsons, G.R., Massey, D.M., Tomasi, T., 1999. Familiar and favorite sites in a random utility model of beach recreation. Mar. Resour. Econ. 14, 299–315. Punj, G., Moore, R., 2009. Information search and consideration set formation in a web-based store environment. J. Bus. Res. 62, 644–650. Schirmer, P.,Van Eggermond, M.A.B., Axhausen, K.W., 2012. Reviewing measurements in residential location choice models. In: Proceedings of the 13th International Conference on Travel Behaviour Research (IATBR), Toronto, Canada. Scott, D.M., 2006. Constrained destination choice set generation: a comparison of GIS-based approaches. Transportation Research Board 85th Annual Meeting, Washington DC, United States. Shocker, A.D., Ben-Akiva, M., Boccara, B., Nedungadi, P., 1991. Consideration set influences on consumer decision-making and choice: issues, models, and suggestions. Mark. Lett. 2, 181–197. Siddarth, S., Bucklin, R.E., Morrison, D.G., 1995. Making the cut: modeling and analyzing choice set restriction in scanner panel data. J. Mark. Res., 255–266. Stopher, P., 1980. Captivity and choice in travel-behavior models. Transp. Eng. J. 106, 427–435. Swait, J., 2001. Choice set generation within the generalized extreme value family of discrete choice models. Transp. Res. Part B: Methodol. 35, 643–666. Swait, J., Ben-Akiva, M., 1987. Incorporating random constraints in discrete models of choice set generation. Transp. Res. Part B: Methodol. 21, 91–102. Termansen, M., Mcclean, C., Skov-Petersen, H., 2004. Recreational site choice modelling using high-resolution spatial data. Environ. Plan. B: Plan. Des. 36, 1085–1099. Thill, J.-C., 1992. Choice set formation for destination choice modelling. Prog. Hum. Geogr. 16, 361–382. Train, K., 1980. A structured logit model of auto ownership and mode choice. Rev Econ. Stud. 47, 357–370. Tversky, A., 1972. Elimination by aspects: a theory of choice. Psychol. Rev. 79, 281. Wu, J., Rangaswamy, A., 2003. A fuzzy set model of search and consideration with an application to an online market. Mark. Sci., 411–434. Zheng, J., Guo, J., 2008. Destination choice model incorporating choice set formation. Transportation Research Board 87th Annual Meeting, 13-1-2008 to 171-2008, Washington DC, United States. Zolfaghari, A.,Sivakumar, A., Polak, J.W., 2012. Choice set formation in residential location choice modelling: empirical comparison of alternative approaches. Transportation Research Board (TRB), Washington DC, US.