Challenges in using RCTs for evaluation of large-scale public programs with complex designs: Lessons from Peru

Challenges in using RCTs for evaluation of large-scale public programs with complex designs: Lessons from Peru

World Development 127 (2020) 104798 Contents lists available at ScienceDirect World Development journal homepage: www.elsevier.com/locate/worlddev ...

204KB Sizes 0 Downloads 7 Views

World Development 127 (2020) 104798

Contents lists available at ScienceDirect

World Development journal homepage: www.elsevier.com/locate/worlddev

Commentary

Challenges in using RCTs for evaluation of large-scale public programs with complex designs: Lessons from Peru Javier Escobal ⇑, Carmen Ponce Group for the Analysis of Development (GRADE), Lima, Peru

a r t i c l e

i n f o

Article history: Accepted 1 December 2019

Keywords: RCTs Impact evaluation Multifaceted intervention Small sample bias Peru

a b s t r a c t The use of randomized control trials (RCTs) to evaluate public policies and interventions in developing countries faces several challenges. These include limited budgets to finance sample designs and sample sizes required to evaluate multifaceted interventions, potential small-sample bias arising from such limited samples, and difficulties in random assignment when participants self-exclude from parts of the intervention. In addition, institutional challenges arise when seeking to evaluate large-scale interventions implemented within a state bureaucracy as compared to NGO small pilots’ evaluations. This short article seeks to discuss the practical challenges facing RCTs when used as a public policy and program evaluation mechanism. This discussion is based on the impact evaluation of a public project that offered several productive interventions to rural households who were already receiving conditional cash transfers. Ó 2019 Elsevier Ltd. All rights reserved.

RCTs have attracted increasing attention as a useful tool to evaluate policies and interventions in developing countries. However, as researchers conducting program evaluations ‘‘experiment” with this toolbox, challenges arise. These include the need to design and evaluate programs that bundle together an array of interventions, and the need to understand what works and why in order to adjust and scale-up interventions that prove successful. Further, sample sizes and sample designs required to evaluate such complex multiarmed interventions are limited by budget restrictions. Finally, governmental bureaucracies implementing the evaluated interventions present additional problems, stemming from potentially high political costs of committing to a randomized schedule of intervention and delaying deployment until the baseline survey is implemented. This short article discusses how these challenges were tackled when evaluating Haku Wiñay (‘‘Growing Together” in Quechua) and the lessons learned. This Peruvian public intervention focuses on the development of productive and entrepreneurial skills to help rural households strengthen their income generation capacities, diversify livelihoods and enhance food security. The intervention is implemented during a 24-month period1 and targets poor households already participating in the conditional cash transfer program Juntos. With the Cooperation Fund for Social Development (FONCODES) as the executing agency, the Peruvian Ministry of

Development and Social Inclusion (MIDIS) supervised how the intervention was deployed. Once the targeted population was selected, given the annual budget allocation restrictions, it was possible to coordinate with MIDIS the rollout schedule for the period 2013– 2018. This allowed the external evaluation team to randomly assign eligible towns to control and treatment groups. The intervention included four components: (1) a ‘‘family production systems” component, designed to help households adopt simple and low-cost technological innovations in agriculture, fisheries or livestock activities, providing productive assets and technical assistance; (2) a ‘‘healthy housing” component, aimed at implementing safe kitchens and fostering access to safe water and efficient solid waste management; (3) an ”inclusive rural businesses” component, designed to promote business initiatives by funding and organizing grants competitions, and helping those interested in participating to organize and prepare business plans to pursue those grants; and (4) a ‘‘financial education” component, involving training to improve access and use of the formal financial system and promote savings2. Such a multifaceted intervention affects a wide array of outcomes, including but not limited to income, consumption, nutrition and health. In addition, the intervention affects savings, assets accumulation, entrepreneurship abilities, and empowerment

⇑ Corresponding author. E-mail addresses: [email protected] (J. Escobal), [email protected] (C. Ponce). 1 The investment per family is estimated in USD 1300. https://doi.org/10.1016/j.worlddev.2019.104798 0305-750X/Ó 2019 Elsevier Ltd. All rights reserved.

2 Escobal and Ponce (2015) describe the non-randomized pilot that was used to inform the final design of the intervention.

2

J. Escobal, C. Ponce / World Development 127 (2020) 104798

which may have longer term impacts.3 Some of the components of the intervention were demand-driven, increasing the complexity of the program and the evaluation strategy4. 1. Methodological challenges RCT-based evaluations face a host of methodological challenges. First, as previously mentioned, because of limited budget the sample size is typically smaller than desired to achieve precision. According to the literature on evaluation, using previous knowledge about conditions that may affect exogeneity or prevent heterogeneity in the estimated effects can substantially improve precision, especially through stratification (Deaton & Cartwright, 2018). Therefore, the evaluation focused exclusively on similar Highland territories and stratified the sample by regions. At the household level, the sample focused only on potential beneficiaries who were receiving transfers from Juntos. Due to the targeting strategy of Juntos, all households in the sample lived in extreme poverty both in terms of income and assets. Despite having a (clustered) randomized sample of households5, small sample size can yield unbalanced treatment and control subsamples. Several methodologies are suggested in the literature to achieve balance between samples, including propensity score matching, genetic matching and entropic balancing, among others (Duflo, Glennerster, & Kremer, 2007; Hainmueller, 2012; Imbens & Wooldridge, 2009). Since the household sample selected for evaluation was not perfectly balanced between treatment and control groups, entropic balance weights were calculated for control households in order to ensure balance between the subsamples in a set of variables that could affect level and change of the outcomes being evaluated6. As other public programs, participation in Haku Wiñay was voluntary7, so the estimated effect derived from contrasting the two subsamples conveyed the effect of the program on households that were targeted, but not necessarily treated (Intention-to-Treat estimator). Following Duflo et al. (2008), it was possible to use the town randomized selection as an instrument to estimate the average treatment effect on treated households (ATET). Key to the identification of the ATET was the program’s compliance with the agreed rollout schedule, which ensured that no households in control towns were treated, allowing compliance with the monotonicity assumption8. It is important to highlight that the ATET estimator may be biased whenever treatment externalities affect the outcomes of non-participant households. To explore potential externalities, the survey included several questions exploring this issue. Although, as expected, non-participant households had relatives and friends

3 The increase in total household income attributable to the intervention represented 8% of pre-intervention income. Other positive impacts include a sharp increase in food security, a reduction in firewood consumption, and improvements in financial literacy and empowerment indicators. Escobal and Ponce (2016) describe in more detail the impacts of the intervention using the RCT design. The intervention also affected time allocation for younger members of the household as well as gender gaps in work and study time (Ponce & Escobal, 2019). 4 Banerjee et al. (2015) evaluate similar multifaceted programs. However, their interventions are not run through government implementing agencies and minimize the demand-driven component of the intervention, factors that facilitate the evaluation design. 5 Eligible households were randomly selected within each randomly selected town. 6 12 out of 14 tested characteristics achieved balance when the entropic weights were used. These included education and sex of the head of household, regional location, as well as other indicators associated with labor market dynamics and agriculture technology at the town level. 7 86% of households in treated towns decided to participate. 8 The monotonicity assumption requires that the probability of participating in the program is equal or greater for households in treatment towns than it is for those in control towns.

who had participated in the program, most of them reported not having been benefited from any program activity or asset transfer. The final challenge we want to highlight concerns treatment heterogeneities. As in other demand-driven programs, the design aimed to be flexible enough to fit household’s needs and interests. There is extensive evidence that this is a powerful feature in economic opportunities programs because it empowers beneficiaries and improves program effectiveness (Linn, Hartmann, Kharas, Kohl, & Massler, 2010). Despite these advantages, this flexibility introduces challenges to the impact evaluation strategy because it makes the treatment heterogeneous and, more importantly, endogenously heterogeneous. Participants may self-exclude, especially from the third component. Furthermore, they may participate in the grant competition but lose to other participants. Although the sample design and size were not suitable for testing the impact of each component, it was possible to explore potential impact differences across treated households that stem from exposure to different intensities of treatment. Since the program established a maximum amount of money to be transferred per household, a monetary equivalent of the treatment was calculated for each treated household. Households were ranked accordingly, and entropic weights were reestimated for each subsample of upper and lower terciles. Although the balance improved, it was far from perfect and results were not as robust. Further work is needed to assess methodological alternatives to address such heterogeneous treatments, both in terms of sample size and relative assessment of different components.

2. Institutional challenges While evaluators of NGO projects are usually granted control over research questions and RCT planning, external evaluators of public programs interested in conducting RCT-based evaluations need to overcome specific political and administrative challenges. The evaluators need to engage in advance with the office in charge of designing and implementing the intervention to build mutual trust. In particular, the evaluators need to dissipate public officers’ concerns about their motivations (whether priorities are more aligned with an academic career than advancing government policies), the focus of the evaluation on the intervention goals and their will to accommodate to the program’s practical issues. Mutual trust allows the evaluators to understand the actual implementation strategy, identify potential problems that may affect internal validity and adjust randomization accordingly. In the evaluation discussed here, this process culminated in an administrative decree protecting the program rollout from last-minute changes due to public officers’ political or administrative concerns or interests. Furthermore, working together with the program implementation office to ensure that no announcements about future implementation of the program in control towns, or before the baseline survey in treated towns, was critical to avoid estimation bias. Finally, our experience evaluating multi-armed programs suggests that RCTs should be conducted by external evaluators but designed as part of a broader learning system embedded in the institutions in charge of design and evaluation of development programs (Escobal & Ponce, 2015, 2016; Escobal, Ponce, Pajuelo, & Espinoza, 2012). Such a system should be based upon complementary evidence gathered with qualitative and quantitative methodologies. Complementary approaches are critical when interventions are not fully amenable to randomization, when small samples reduce precision of causal estimates, or when the outcomes are not fully suitable for quantitative analysis.

J. Escobal, C. Ponce / World Development 127 (2020) 104798

References Banerjee, A., Duflo, E., Goldberg, N., Karlan, D., Osei, R., Pariente, W., Shapiro, J., Thuysbaert, B., & Udry, C. (2015). A Multifaceted Program Causes Lasting Progress for the Very Poor: Evidence from Six Countries. Science, 348(6236), 1–16. 1260799. Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized controlled trials. Social Science & Medicine, 210, 2–21. Duflo, E., Glennerster, R., & Kremer, M. (2007). Using randomization in development economics research: A toolkit. In T. Schultz & J. Strauss (Eds.). Handbook of Development Economics (Vol. 4, pp. 3895–3962). Elsevier. Escobal, J., & Ponce, C. (2015). Combining Social Protection with Economic Opportunities in Rural Peru: Haku Wiñay. Policy in Focus. The International Policy Centre for Inclusive. Growth., 12(2), 22–25. Escobal, J., & Ponce, C. (2016). Combinando protección social con generación de oportunidades económicas: Una evaluación de los avances del programa Haku Wiñay (p. 195). GRADE, Lima: Fundación Ford; GRADE.

3

Escobal, J., Ponce, C., Pajuelo, R., & Espinoza, M. (2012). Estudio comparativo de intervenciones para el desarrollo rural en la Sierra sur del Perú (p. 160). Lima: Fundación Ford; GRADE. Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies. Political Analysis, 20(1), 25–46. Imbens, G., & Wooldridge, J. (2009). Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47(1), 5–86. Linn, J., Hartmann, A., Kharas, H., Kohl, R., Massler, B. (2010). Scaling Up the Fight Against Rural Poverty: An institutional review of IFAD’S approach. Global Economy & Development Working Paper 43, October: The Brookings Institution. Ponce, C., and Escobal, J. (2019). Reshaping the gender gap in child time use: unintended effects of a program expanding economic opportunities in the Peruvian Andes. Lima: GRADE. Available at http://www.grade.org.pe/ publicaciones/time-use-gap/.