World Development 127 (2020) 104796
Contents lists available at ScienceDirect
World Development journal homepage: www.elsevier.com/locate/worlddev
Commentary
Good identification, meet good data Andrew Dillon a,⇑, Dean Karlan a, Christopher Udry a, Jonathan Zinman b a b
Northwestern University, United States Dartmouth College, United States
a r t i c l e
i n f o
Article history: Accepted 1 December 2019
a b s t r a c t Causal inference lies at the heart of social science, and the 2019 Nobel Prize in Economics highlights the value of randomized variation for identifying causal effects and mechanisms. But causal inference cannot rely on randomized variation alone; it also requires good data. Yet the data-generating process has received less consideration from economists. We provide a simple framework to clarify how research inputs affect data quality and discuss several such inputs, including interviewer selection and training, survey design, and investments in linking across multiple data sources. More investment in research on the data quality production function would considerably improve casual inference generally, and poverty alleviation specifically. Ó 2019 Elsevier Ltd. All rights reserved.
1. Introduction The 2019 Nobel Prize in Economics acknowledged the intellectual leadership of Abhijit Banerjee, Esther Duflo and Michael Kremer in using an experimental approach to fighting poverty in developing countries. Experimental methods have contributed to a revolution in development economics, providing a powerful tool for understanding fundamental questions of economic life and what works, what does not and why in fighting poverty. Naturally, learning requires not just good identification but also good data. Poor data quality can lead to biased inference, low statistical power and limited external validity (via classical or nonclassical measurement error). Fortunately, researchers have some control over data quality: it is endogenous to research design choices like data source, sampling strategy, survey content, and fieldwork implementation protocols. Yet whereas randomized controlled trials became a ‘‘movement”, work to understand data quality and how research design choices affect data quality—i.e. the data quality production function—has not risen to movement status (yet). For example, standard power calculations have few choice variables; e.g., they presume unbiased inference rather than considering non-classical measurement error as a potential confounder, and they neglect to consider how data source (e.g., administrative vs. household or firm survey) and elicitation method (e.g., survey administration
⇑ Corresponding author. E-mail addresses:
[email protected] (A. Dillon), karlan@ northwestern.edu (D. Karlan),
[email protected] (C. Udry),
[email protected] (J. Zinman). https://doi.org/10.1016/j.worlddev.2019.104796 0305-750X/Ó 2019 Elsevier Ltd. All rights reserved.
channel and/or question/task content) may affect the variance of key outcomes. We argue that more explicit consideration of the data quality production function will improve future research. With a better understanding of how design choices affect data quality and inference, researchers could identify ‘‘free lunches” where there is a clear best practice; for example, using automated data audit procedures. Researchers also could make more-informed tradeoffs between unbiased inference, statistical power, and external validity; for example, when deciding on investing in relatively precise GPS measurement of farm size versus increasing sample size by relying on relatively imprecise estimates from farmers.
2. A research design problem and the data quality production function Consider a researcher seeking to maximize credible knowledge K, which is created through attention not only to causal inference C, but also statistical power, P, and external validity, E. The researcher is constrained by a budget constraint, B, and a data quality production function, D(c) that depends on a vector of design choices c. So the researcher’s problem is to max K(C,P,E) subject to B and D(c). This framework allows for the likely possibility that there are free lunches as well as tradeoffs. Indeed we believe that we and many others have been working well in the interior of D(c). To take just one example, we have likely paid insufficient attention to work in neighboring disciplines on best practices. This is likely particularly true as economic research expands its boundaries (Dhar, Jain, & Jayachandran, 2018; Heckman, Jagelka, & Kautz, 2019).
2
A. Dillon et al. / World Development 127 (2020) 104796
With additional space, we could itemize improvements that we could have made in the measurement of subjective well-being, environmental externalities, intimate partner violence, and soil quality on farms at no cost other than our own learning. This framework can also help researchers optimize the many tradeoffs we face in seeking to maximize K: How should we spend our marginal dollar given binding research budget constraints? On reducing non-classical measurement error that confounds causal inference? On increasing power by reducing classical measurement error in a key outcome, or by increasing sample size? On improving external validity by increasing response rates, or by reaching a broader sample? Answers to these questions highlight the importance of understanding the data quality production technology, dD(.)/dc, in estimating the returns to data quality. Four examples are the design of survey instruments, the design of field data collection protocols, sampling, and the integration of multiple data sources. 2.1. How design choices affect data quality: reducing measurement error through questionnaire design Household surveys are likely to remain a linchpin of development economics, and it has long been understood that survey data can be rife with measurement error.1 Much less understood is what kind of measurement error, how it affects inference if left uncorrected or unaccounted for, and what to do about it. Methodological research that builds on measurement error prediction bias models (e.g., Hyslop and Imbens (2001)) and tests across different models will have high returns in disentangling the mechanisms of measurement error that we often call recall bias, telescoping, framing, social desirability bias, priming, or interviewer effects. These concerns relate both to the mismeasurement of key constructs (e.g., measures of or inputs to development), and to how measurement itself can change parameter estimates (e.g., treatment effects) and thereby confound inference. Much investment in questionnaire design is cost-neutral, particularly if data quality improvements require different questions rather than additional modules to measure key concepts.2 Increased attention to the measurement of difficult to measure or unobserved variables that are fundamental to understanding development also has high potential returns, but this margin of data quality production presents researchers with difficult tradeoffs. High measurement costs per observation—indeed, measurement error corrections may require multiple measures per construct and observation, within-survey or across-surveys over time--limit the overall number of observations in a particular study and the study’s statistical power. An early example of this tradeoff is found in discussions of welfare measurement. Income measures are straightforward to implement, but measurement error concerns have resulted in greater emphasis on measuring consumption, assets, and/or subjective well-being (Deaton & Zaidi, 2002). Another classic example is that of agricultural production data which are most useful and often best recalled at the plotlevel, but many households cultivate multiple plots and assigning inputs and outputs for each is time-consuming. And an increasingly classic example is that of sex-disaggregated data where activities, ownership, and control are foundational to estimating intrahousehold models yet increase questionnaire length and are difficult to validate.3 1 Bound, Brown, and Mathiowetz (2001) provides a review of measurement error literature in labor economics. 2 De Weerdt, Gibson, and Beegle (2019) assesses the nascent survey experiment literature. 3 For an example, see Donald, Koolwal, Annan, Falb, and Goldstein (2017) who discuss validating measures of women’s agency.
New elicitation methods, like relying more on natural language responses (Athey & Imbens, 2019), offer promise for improving measurement while relaxing tradeoffs and research budget constraints. 2.2. How design choices affect data quality: fieldwork implementation More-rigorous fieldwork implementation protocols potentially improve data quality, but also increase the cost per observation. One example is the investment that researchers make in the survey data-generating process. This process can be classified into stages: survey instrument design, interviewer recruitment, piloting, training, team construction, respondent assignment protocol, and field monitoring. A meta-analysis of the interviewer effects literature establishes that interviewer behavioral traits and demographic characteristics influence survey responses and by extension data quality (West & Blom, 2017). Response rates and response biases are particularly influenced by specific interviewer characteristics (e.g. age, ethnicity, experience, and education), behaviors (formal versus conversational interview style), cognitive and noncognitive skills (e.g. mathematical ability, reading, attention to detail, and empathy) and interviewer experience. Beaman, Keleher, and Magruder (2018) investigates referral approaches to interviewer recruitment, finding that recruiting through interviewer networks disadvantages women. Despite amassing billions of data points on interviewercollected data, much of which would not have existed without implemented RCTs, we have little causal evidence on how best to assure data quality through field implementation choices and the return to interviewer skills. Many choices related to recruitment, training and respondent assignment protocols are likely to be cost neutral. Others involve tradeoffs between e.g., between larger sample size and more back-checking, or between more productive interviewer teams and lower supervision costs. 2.3. How design choices affect data quality: sampling and mode of administration Trends in richer countries suggest that increasing attention to sampling techniques will be important. On the down side, survey response rates have fallen substantially, raising concerns about external validity and non-classical measurement error in putatively representative samples. On the upside, online survey sampling and administration techniques have improved dramatically, suggesting that online surveying could be a broadly viable substitute for in-person interviews when communities have internet or cellular services. Increased access to mobile technology also lowers the costs of tracking survey respondents in panel surveys, playing an important role in reducing attrition. 2.4. How design choices affect data quality: integrating data sources Household surveys have been the primary data collection technology for research based on randomized control trials in developing countries. Alternative and complementary data sources including remotely sensed data, wearable technology such as accelerometers, and administrative data have potentially high returns for data quality, often with the potential to link with household survey data (see for example, Meyer and Mittag (2019)). The integration of remotely sensed data to delineate sample frames, measure land area or crop yields, measure electricity usage, and deforestation holds the potential to enrich household surveys and validate self-reported information. It represents a clear tradeoff for the research in achieving wider external validity, but without linkage to household surveys has limited nuanced information to estimate economic models of behavior. Accelerometers provide
A. Dillon et al. / World Development 127 (2020) 104796
measures of time use, providing an alternative to self-reported time use, and physical activity, an unobserved mechanism in studies that estimate labor and health intervention effects. Yet wearable technology is costly and raises issues of noncompliance, so there can be tradeoffs with sample size and selection. Identifying effects of wearable technology on the precision, unbiasedness, and external validity of treatment effect estimates are just a few additional examples of research on dD(.)/dc that would inform how researchers approach casual inference. 3. Conclusion The 2019 Nobel Memorial Prize in Economic Sciences highlights the important contribution of one aspect of research design in improving knowledge of poverty and its alleviation. We highlight the researcher’s broader decision problem in creating such knowledge. Critically, research design choices like questionnaire design, fieldwork implementation protocols, sampling choices, and data source choices are inputs into a data-quality production function. This function is often poorly understood, and so there are countless opportunities to improve knowledge with systematic, methodological work on how research design choices affect data quality and ultimately inference quality. Good identification, meet good data. Acknowledgements We gratefully acknowledge our many academic and implementing collaborators, including Innovations for Poverty
3
Action field teams, who have taught us about the data generating process. We appreciate the insightful comments of Kathleen Beegle and Steven Glazerman on this manuscript.
References Athey, S., & Imbens, G. (2019). Machine learning methods that economists should know about. Annual Review of Economics, 11, 685–725. Beaman, L., Keleher, N., & Magruder, J. (2018). Do job networks disadvantage women? Evidence from a recruitment experiment. Journal of Labor Economics, 36(1), 147–161. Bound, J., Brown, C., & Mathiowetz, N. (2001). Measurement error in survey data. In Heckman, J. & Leamer, E. (Eds.), Handbook of econometrics (pp. 3705–3843). New York: North Holland Publishing, Elsevier. Deaton, A., & Zaidi, S. (2002). A Guide to Aggregating Consumption Expenditures, Living Standards Measurement Study, Working Paper 135. De Weerdt, J., Gibson, J., & Beegle, K. (2019). What can we learn from experimenting with survey methods? Annual Review of Economics 3: Submitted. DOI: 10.1146/annurev-resource-103019-105958. Dhar, D., Jain, T., & Jayachandran, S. (2018). Reshaping adolescents’ gender attitudes: Evidence from a school-based experiment in India. http://faculty. wcas.northwestern.edu/~sjv340/reshaping_gender_attitudes.pdf. Donald, A., Koolwal, G., Annan, J., Falb, K., & Goldstein, M. (2017). Measuring women’s agency. Working Paper Series 8148. Washington, DC: World Bank Group. Heckman, J., Jagelka, T., & Kautz, T. (2019). Some contributions of economics to the study of personality. NBER Working Paper No. 26459. Hyslop, R., & Imbens, G. (2001). Bias from classical and other forms of measurement error. Journal of Business & Economic Statistics, 19(4), 475–481. Meyer, B., & Mittag, N. (2019). Using linked survey and administrative data to better measure income: Implications for poverty, program effectiveness and holes in the safety net. American Economic Journal: Applied Economics, 11(2), 176–204. West, B. T., & Blom, A. G. (2017). Explaining interviewer effects: A research synthesis. Journal of Survey Statistics and Methodology, 5(2), 175–211.