The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

Journal of Development Economics 99 (2012) 486–496 Contents lists available at SciVerse ScienceDirect Journal of Development Economics journal homep...

266KB Sizes 4 Downloads 145 Views

Journal of Development Economics 99 (2012) 486–496

Contents lists available at SciVerse ScienceDirect

Journal of Development Economics journal homepage: www.elsevier.com/locate/devec

The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation☆ Johannes Metzler a, Ludger Woessmann b,⁎ a b

University of Munich and Monitor Group, Germany University of Munich and Ifo Institute for Economic Research, CESifo and IZA, Poschingerstr. 5, 81679 Munich, Germany

a r t i c l e

i n f o

Article history: Received 18 October 2011 Received in revised form 8 June 2012 Accepted 9 June 2012 JEL classification: I20 O15 Keywords: Teacher knowledge Student achievement Peru

a b s t r a c t Teachers differ greatly in how much they teach their students, but little is known about which teacher attributes account for this. We estimate the causal effect of teacher subject knowledge on student achievement using within-teacher within-student variation, exploiting a unique Peruvian 6th-grade dataset that tested both students and their teachers in two subjects. Observing teachers teaching both subjects in one-classroomper-grade schools, we circumvent omitted-variable and selection biases using a correlated random effects model that identifies from differences between the two subjects. After measurement-error correction, one standard deviation in subject-specific teacher achievement increases student achievement by about 9% of a standard deviation in math. Effects in reading are significantly smaller and mostly not significantly different from zero. Effects also depend on the teacher-student match in ability and gender. © 2012 Elsevier B.V. All rights reserved.

1. Introduction One of the biggest puzzles in educational production today is the teacher quality puzzle: while there is clear evidence that teacher quality is a key determinant of student learning, little is known about which specific observable characteristics of teachers can account for this impact (e.g., Hanushek and Rivkin, 2010; Rivkin et al., 2005; Rockoff, 2004). In particular, there is little evidence that those characteristics most often used in hiring and salary decisions, namely teachers' education and experience, are crucial for teacher quality. Virtually the only attribute that has been shown to be more frequently

☆ We thank Sandy Black, Jere Behrman, David Card, Paul Glewwe, Eric Hanushek, Patrick Kline, Michael Kremer, Javier Luque, Lant Pritchett, Martin West, the editor Duncan Thomas, two anonymous referees, and seminar participants at UC Berkeley, Harvard University, the University of Munich, the Ifo Institute, the Glasgow congress of the European Economic Association, the CESifo Area Conference on Economics of Education, the IZA conference on Cognitive and Non-Cognitive Skills, and the Frankfurt conference of the German Economic Association for helpful discussion and comments. Special gratitude goes to Andrés Burga León of the Ministerio de Educación del Perú for providing us with reliability statistics for the teacher tests. Woessmann is thankful for the support and hospitality by the W. Glenn Campbell and Rita Ricardo-Campbell National Fellowship of the Hoover Institution, Stanford University. Support by the Pact for Research and Innovation of the Leibniz Association and the Regional Bureau for Latin America and the Caribbean of the United Nations Development Programme are also gratefully acknowledged. ⁎ Corresponding author. E-mail addresses: [email protected] (J. Metzler), [email protected] (L. Woessmann). URL: http://www.cesifo.de/woessmann (L. Woessmann). 0304-3878/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.jdeveco.2012.06.002

significantly correlated with student achievement is teachers' academic skills measured by scores on achievement tests (cf. Eide et al., 2004; Hanushek and Rivkin, 2006; Wayne and Youngs, 2003). The problem with the latter evidence, however, is that issues of omitted variables and non-random selection are very hard to address when estimating causal effects of teacher characteristics.1 In this paper, we estimate the impact of teachers' academic skills on student achievement, circumventing problems from unobserved teacher traits and non-random sorting of students to teachers. We use a unique primary-school dataset from Peru that contains test scores in two academic subjects not only for each student, but also for each teacher. This allows us to identify the impact of teachers' academic performance in a specific subject on students' academic performance in the subject, while at the same time holding constant any student characteristics and any teacher characteristics that do not differ across subjects. We can observe whether the same student taught by the same teacher in two different academic subjects performs better in one of the subjects if the teacher's knowledge is relatively better in this subject. We develop a correlated random effects model that generalizes conventional fixed-effects models with fixed effects for students, teachers, and subjects. Our model identifies the effect based on within-teacher within-student variation and allows us to test the overidentification restrictions implicit in the fixed-effects models.

1 Recently, Lavy (2011) estimates the effect of teaching practices on student achievement.

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

We can additionally restrict our analysis to small, mostly rural schools with at most one teacher per grade, excluding any remaining possibility that parents may have chosen a specific teacher for their students in both subjects based on the teacher's specific knowledge in one subject, or of bias from non-random classroom assignments more generally (cf. Rothstein, 2010). We find that teacher subject knowledge exerts a statistically and quantitatively significant impact on student achievement. Once corrected for measurement error in teacher test scores, our results suggest that a one standard deviation increase in teacher test scores raises student test scores by about 9% of a standard deviation in math. Results in reading are significantly smaller and indistinguishable from zero in most specifications. We also find evidence that the effect depends on the ability and gender match between teachers and students, in that above-median students are not affected by variation in teacher subject knowledge when taught by below-median teachers and that female students are not affected by variation in teacher subject knowledge when taught by male teachers. The average math result means that if a student was moved, for example, from a teacher at the 5th percentile of the distribution of teacher math knowledge to a teacher at the median of this distribution, the student's math achievement would increase by 14.3% of a standard deviation by the end of the school year. Thus, teacher subject knowledge is a relevant observable factor that is part of overall teacher quality. Methodologically, our identification strategy for teacher effects extends the within-student comparisons in two different subjects proposed by Dee (2005, 2007) as a way of holding student characteristics constant.2 Because the teacher characteristics analyzed by Dee — gender, race, and ethnicity — do not vary within teachers, that approach uses variation across different teachers, so that the identifying variation may still be affected by selection and unobserved teacher characteristics. By contrast, our approach uses variation within individual teachers, a feature made possible by the fact that subject knowledge does vary within teachers. In our correlated random effects model, we actually reject the overidentification restriction implied in the fixed-effects model that teacher knowledge effects are the same across subjects. While in addition to teacher subject knowledge per se, the identified estimate may also capture the effect of other features correlated with teacher subject knowledge such as passion for the subject or experience in teaching the subject curriculum, robustness tests confirm that results are not driven by observed between-subject differences in instruction time, teaching methods, and student motivation. The vast literature on education production functions hints at teacher knowledge as one — possibly the only — factor quite consistently associated with growth in student achievement. 3 In his early review of the literature, Hanushek (1986, p. 1164) concluded that “the closest thing to a consistent finding among the studies is that ‘smarter’ teachers, ones who perform well on verbal ability tests, do better in the classroom” — although even on this account, the existing evidence is not overwhelming. 4 Important studies estimating the association of teacher test scores with student achievement gains in the

487

United States include Ehrenberg and Brewer (1995), Ferguson (1998), Ferguson and Ladd (1996), Hanushek (1971, 1992), Murnane and Phillips (1981), Rockoff et al. (2011), Rowan et al. (1997), and Summers and Wolfe (1977). Examples of education production function estimates with teacher test scores in developing countries include Harbison and Hanushek (1992) in rural Northeast Brazil, Tan et al. (1997) in the Philippines, Bedi and Marshall (2002) in Honduras, and Behrman et al. (2008) in rural Pakistan.5 However, the existing evidence is still likely to suffer from bias due to unobserved student characteristics, omitted school and teacher variables, and non-random sorting and selection into classrooms and schools (cf. Glewwe and Kremer, 2006). Obvious examples of such bias include incidents where better-motivated teachers incite more learning and also accrue more subject knowledge; where parents with a strong preference for educational achievement choose schools or classrooms with better teachers and also further their children's learning in other ways; and where principals assign students with higher learning gains to better teachers. Estimates of teacher effects that convincingly overcome such biases are so far missing in the literature. 6 An additional problem when estimating teacher effects is measurement error, which is likely to attenuate estimated effects (cf. Glewwe and Kremer, 2006, p. 989). The available measure may proxy only poorly for the concept of teachers' subject knowledge, as in most existing studies, the examined skill is not subject-specific. Furthermore, any specific test will measure teacher subject knowledge only with considerable noise. In this paper, we have access to separate tests of teachers' subject-specific knowledge in math and reading that are each generated from batteries of test questions using psychometric modeling. These provide consistent estimates of the internal reliability metric of the tests, which we use to adjust for biases due to measurement error based on classical measurement‐error theory. Understanding the causes of student achievement is important not least because of its substantial economic importance. Differences in student achievement tests can account for an important part of the difference in long-run economic growth across countries, in particular between Latin America and other world regions as well as within Latin America (Hanushek and Woessmann, 2011, 2012). Peru has been doing comparatively well in terms of school attendance, but not in terms of educational quality as measured by achievement tests (cf. Luque, 2008; World Bank, 2007). Its average performance on international student achievement tests is dismal compared to developed countries (cf. OECD, 2003) and just below the average of other Latin American countries. Of the 16 countries participating in a Latin American comparative study in 2006, Peruvian 6th-graders ranked 9 and 10, respectively, in math and reading (LLECE, 2008). In what follows, Section 2 derives our econometric identification strategy. Section 3 describes the dataset and descriptive statistics. Section 4 reports our results and robustness tests. 2. Empirical identification 2.1. The correlated random effects model

2

Further examples using this between-subject method to identify effects of specific teacher attributes such as gender, ethnicity, credentials, teaching methods, instruction time, and class size include Altinok and Kingdon (2012), Ammermüller and Dolton (2006), Clotfelter et al. (2010), Dee and West (2011), Lavy (2010), and Schwerdt and Wuppermann (2011). 3 The Coleman et al. (1966) report, which initiated this literature, first found that verbal skills of teachers were associated with better student achievement. 4 A decade later, based on 41 estimates of teacher test-score effects, Hanushek (1997, p. 144) finds that “of all the explicit measures [of teachers and schools] that lend themselves to tabulation, stronger teacher test scores are most consistently related to higher student achievement.” Eide et al. (2004, p. 233) suggest that, compared to more standard measures of teacher attributes, “a stronger case can be made for measures of teachers' academic performance or skill as predictors of teachers' effectiveness.” See also Hanushek and Rivkin (2006).

We consider an education production function with an explicit focus on teacher skills: yi1 ¼ β1 T t1 þ γU t1 þ αZ i þ δX i1 þ μ i þ τ t1 þ εi1

ð1aÞ

yi2 ¼ β2 T t2 þ γU t2 þ αZ i þ δX i2 þ μ i þ τ t2 þ εi2

ð1bÞ

5 See Behrman (2010) and Glewwe and Kremer (2006) for recent reviews of the literature on education production functions in developing countries. 6 Note that because of the association of teacher subject knowledge with other teacher traits, even an experimental setup that randomly assigns teachers to students would not be able to identify the effect of teacher subject knowledge.

488

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

where yi1 and yi2 are test scores of student i in subjects 1 and 2 (math and reading in our application below), respectively. Teachers t are characterized by subject-specific teacher knowledge Ttj and nonsubject-specific teacher characteristics Ut such as pedagogical skills and motivation. (The latter differ across the two equations only if the student is taught by different teachers in the two subjects.) Additional factors are non-subject-specific (Zi) and subject-specific (Xij) characteristics of students and schools. The error term is composed of a student-specific component μi, a teacher-specific component τt, and a subject-specific component εij. The coefficient vectors β1, β2, and γ characterize the impact of all subject-specific and non-subject-specific teacher characteristics that constitute the overall teacher quality effect as estimated by valueadded studies (cf. Hanushek and Rivkin, 2010). As indicated earlier, several sources of endogeneity bias, stemming from unobserved student and teacher characteristics or from non-random sorting of students to schools and teachers, are likely to hamper identification of the effect of teacher subject knowledge in Eqs. (1a) and (1b). Following Chamberlain (1982), we model the potential correlation of the unobserved student effect μi with the observed inputs in a general setup: μ i ¼ η1 T t1 þ η2 T t2 þ θ1 U t1 þ θ2 U t2 þ χZ i þ ϕX i1 þ ϕX i2 þ ωi

ð2Þ

where ωi is uncorrelated with the observed inputs. We allow the η and θ parameters to differ across subjects, but assume that the parameters on the subject-specific student characteristics are the same across subjects. Substituting Eq. (2) into Eqs. (1a) and (1b) and collecting terms yields the following correlated random effects models: 7  yi1 ¼ β1 þ η1 T t1 þ η2 T t2 þ ðγ þ θ1 ÞU t1 þ θ2 U t2 þ ðα þ χ ÞZ i ′ þ ðδ þ ϕÞX i1 þ ϕX i2 þ τt1 þ ε i1

ð3aÞ

 yi2 ¼ η1 T t1 þ β2 þ η2 T t2 þ θ1 U t1 þ ðγ þ θ2 ÞU t2 þ ðα þ χ ÞZ i ′ þ ϕX i1 þ ðδ þ ϕÞX i2 þ τt2 þ ε i2

ð3bÞ

where ε′ij = εij + ωi. In this model, teacher scores in each subject enter the reduced-form equation of both subjects. The η coefficients capture the extent to which standard models would be biased due to the omission of unobserved teacher factors, while the β parameters represent the effect of teacher subject knowledge. The model setup with correlated random effects allows us to test the overidentification restrictions implicit in conventional fixed-effects models, which are nested within the unrestricted correlated random effects model (see Ashenfelter and Zimmerman, 1997). Since the seminal contribution to cross-subject identifications of teacher effects by Dee (2005), available fixed-effects estimators (commonly implemented in first-differenced estimations) implicitly impose that teacher effects are the same across subjects. Rather than initially imposing such a restriction, after estimating the system of Eqs. (3a) and (3b) we can straightforwardly test whether β1 = β2 = β and whether η1 = η2 = η. Only if these overidentification restrictions cannot be rejected, we can also specify correlated random effects models that restrict the β coefficients, the η coefficients, or both of them to be the same across subjects in Eqs. (3a) and (3b). A correlated random effects model that enacts both of these restrictions is equivalent to conventional fixed-effects models that perform within-student comparisons in two subjects to eliminate bias from unobserved non-subject-specific student characteristics. They have been applied to estimate effects of non-subject-specific teacher attributes such as gender, ethnicity, and credentials (e.g., Clotfelter 7 This correlated random effects specification is similar to siblings models of earnings effects of education (Ashenfelter and Krueger, 1994; Card, 1999). We thank David Card for alerting us to this interpretation of our model.

et al., 2010; Dee, 2005, 2007). Note, however, that in order to identify β and γ in specifications like Eqs. (3a) and (3b) that restrict coefficients to be the same across subjects, it has to be assumed that the assignment of students to teachers is random and, in particular, is not related to students' subject-specific propensity for achievement. Otherwise, omitted teacher characteristics such as pedagogical skills and motivation, comprised in the teacher-specific error component τt, could bias estimates of the observed teacher attributes. In order to avoid such bias from omitted teacher characteristics, we propose to restrict the analysis to samples of students who are taught by the same teacher in the two subjects. In such a setting, Ut1 = Ut2 = Ut and τt1 = τt2 = τt, so that the education production functions simplify to:  yi1 ¼ β1 þ η1 T t1 þ η2 T t2 þ ðγ þ θ1 þ θ2 ÞU t þ ðα þ χ ÞZ i ′ þ ðδ þ ϕÞX i1 þ ϕX i2 þ τt þ ε i1

ð4aÞ

 yi2 ¼ η1 T t1 þ β2 þ η2 T t2 þ ðγ þ θ1 þ θ2 ÞU t þ ðα þ χ ÞZ i þ ϕX i1 ′ þ ðδ þ ϕÞX i2 þ τ t þ ε i2

ð4bÞ

With the additional restrictions that β1 = β2 = β and η1 = η2 = η, this model has a straightforward first-differenced representation: ′

yi1 −yi2 ¼ βðT i1 −T i2 Þ þ δðX i1 −X i2 Þ þ ε i1 −ε

′ i2

ð5Þ

which is equivalent to including both student and teacher fixed effects in a pooled specification. In this specification, any teacher characteristic that is not subject-specific (Ut and τt) drops out. Identification of teacher effects is still possible in our setting because teacher subject knowledge varies across subjects. Note that this specification can be identified only in samples where students are taught by the same teacher in the two subjects. This makes it impossible to identify teacher effects for attributes that do not vary across the two subjects within the same teacher. Thus, this specification cannot identify effects of teacher attributes that are not subject-specific, such as gender and race.8 But it allows eliminating bias from standard omitted teacher variables when estimating the effect of teacher subject knowledge. A final remaining concern might be that the matching of students and teachers is not random with respect to their respective relative performance in the two subjects. For example, if there are two classrooms in the same grade in a school, in principle it is conceivable that students who are better in math than in reading are assigned to one of the classes and that a teacher who is better in math than in reading is assigned to teach them. Therefore, to avoid bias from non-random allocation of teachers to students, we finally restrict our main sample to schools that have only one classroom per grade. This eliminates any bias from sorting between classrooms within the grade of a school. While this restriction comes at the cost of identification from a sample of relatively small and mostly rural schools that are not necessarily representative of the total population, additional analyses suggest that the sample restriction is not driving our results. By restricting the analysis to schools in mostly rural regions where parents have access to only one school, the restricted sample additionally eliminates any possible concern of non-random selection of schools by parents based on the relative performance in math vs. reading of the schools and the students. Note that, by eliminating non-random teacher selection, the sample restriction also rules out bias from prior differences in student achievement, as students cannot be 8 The within-teacher identification also accounts for possible differences between teachers in their absenteeism from the classroom, an important topic of recent research on teachers in developing countries (cf. Banerjee and Duflo, 2006). Note also that in a group of six developing countries with data, teacher absenteeism rates are lowest in Peru (Chaudhury et al., 2006).

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

allocated on grounds of within-student performance differences between the subjects to appropriate teachers. 9 One way to test that student–teacher assignment is indeed random in our sample with respect to their relative capabilities in the two subjects is to include measures of subject-specific student characteristics (Xij) in the model. Therefore, in robustness specifications we will include a subject-specific index of student motivation in the model. Furthermore, the observed between-subject variation in teacher subject knowledge may correlate with other, partly unobserved subject-specific teacher characteristics, such as passion for the subject or experience in teaching the curriculum of the subject. The estimates of our models are thus best interpreted as the effect of teacher subject knowledge and other subjectspecific traits that go with it. To the extent that features like subjectspecific passion are the outcome of the knowledge level, rather than vice versa, estimates that do not control for other subject-specific teacher characteristics capture the full reduced-form effect of teacher subject knowledge. But to test to what extent subject-specific teacher knowledge may capture more than just plain knowledge, we also present robustness specifications that include teacher characteristics other than subject knowledge that might vary between subjects, namely subject-specific measures of teaching hours and teaching methods. Finally, teachers whose first language is not Spanish may not only be relatively better in math than in reading, but may also put relatively less effort into teaching Spanish. Since the native language is likely correlated between students and their teachers, we will test for robustness in a sample restricted to students whose first language is Spanish. 2.2. Correcting for measurement error Any given test can only be an imperfect measure of our explanatory variable of interest, teachers' subject knowledge T. A first source of measurement error arises if the available measure proxies only poorly for the concept of teachers' subject knowledge. In most existing studies, the tested teacher skills can be only weakly tied to the academic knowledge in the subject in which student achievement is examined and cannot distinguish between subject-specific cognitive and general noncognitive teacher skills (see Rockoff et al., 2011 for a notable exception). Often, the examined skill is not subject-specific, either when based on a rudimentary measure of verbal ability (e.g., Coleman et al., 1966) or when aggregated across a range of skills including verbal and pedagogic ability as well as knowledge in different subjects (e.g., Summers and Wolfe, 1977). A second source of measurement error arises because any specific test will measure teacher subject knowledge only with considerable noise. This is most evident when the result of one single math question is used as an indicator of math skills (cf. Rowan et al., 1997), but it applies more generally because the reliability of any given item battery is not perfect. As is well known, classical measurement error in the explanatory variable will give rise to downward bias in the estimated effects. Glewwe and Kremer (2006, p. 989) argue that this is likely to be a serious problem in estimating school and teacher effects. Assume that, in any given subject s, true teacher test scores T⁎ are measured with classical measurement error e: T is ¼ T is þ eis ;

Eðeis Þ ¼ CovðT is ; eis Þ ¼ 0

ð6Þ

Suppose that T⁎ is the only explanatory variable in an educational production function like Eqs. (1a) and (1b) with mean-zero variables, and e and ε are uncorrelated. Classical measurement‐error theory tells us that using measured T instead of true T⁎ as the explanatory 9 A remaining potential source of non-random sorting might be selective drop-out based on differences in teacher subject knowledge between the two subjects. Such selectivity seems quite unlikely, though, and drop-out by grade 6 is very uncommon in Peru. In the 2004 Demographic and Health Survey, 97.0% of 12-year-olds — the relevant age for grade 6 — are currently enrolled in school, and 93.4% of those aged 15–19 have attained grade 6 (Filmer, 2007).

489

variable leads to well-known attenuation bias, where the true effect β is asymptotically attenuated by the reliability ratio λ (e.g., Angrist and Krueger, 1999): yis ¼ βs λs T is þ ε~ is ; λs ¼

CovðT is ; T is Þ VarðT is Þ  ¼ VarðT is Þ Var T is þ Varðeis Þ

ð7Þ

^ =λs . 10 The Thus, the unbiased effect of T⁎ on y is given by βs ¼ β s problem is that in general, we do not know the reliability λ with which a variable is measured. But in the current case, where the explanatory variable is a test score derived using psychometric modeling, the underlying psychometric test theory in fact provides estimates of the reliability ratio, the most commonly used statistic being Cronbach's α (Cronbach, 1951). The method underlying Cronbach's α is based on the idea that reliability can be estimated by splitting the underlying test items in two halves and treating them as separate measures of the underlying concept. Cronbach's α is the mean of all possible split-half coefficients resulting from different splittings of a test and as such estimates reliability as the ratio of true variance to observed variance of a given test. It is routinely calculated as a measure of the reliability of test metrics, and provided as one of the psychometric properties in our dataset.11 Of course, such reliability measures cannot inform us about the test's validity — how good a measure it is of the underlying concept (teacher subject knowledge in our case). But they provide us with information of the measurement error contained within the test metric derived from the underlying test items. Note that differences between the test metric and the underlying concept only reduce total reliability further, rendering our adjustments still a conservative estimate of the “true” effect of teacher subject knowledge on student achievement. 3. Data and descriptive statistics To implement our identification strategy, which is rather demanding in terms of specific samples and testing in two separate subjects of both students and teachers, we employ data from the 2004 Peruvian national evaluation of student achievement, the “Evaluación nacional del rendimiento estudiantil” (EN 2004). The study tested a representative sample of 6th-graders in mathematics and reading, covering a total of over 12,000 students in nearly 900 randomly sampled primary schools. The sample is representative at the national level and for comparisons of urban vs. rural areas, public vs. private, and multi-grade12 vs. “complete” schools. The two tested subjects, math and reading, are subjects that are separately taught in the students' curriculum. The student tests were scheduled for 60 min each and were developed to be adequate for the curriculum and knowledge of the tested 6th-graders. As a unique feature of EN 2004, not only students but also their teachers were required to take tests in their respective subjects. The teacher tests were developed independently from the student tests, so that 10 For a derivation of the measurement-error correction formula in the firstdifferenced model, see the working-paper version of this paper (Metzler and Woessmann, 2010). While in contrast to the standard first-differenced model, our correlated random effects model has more than one explanatory variable, results of the correlated random effects model without controls are very close to the ones reported here. 11 In Rasch modeling, there are additional alternative measures of the reliability ratio, such as the person separation reliability. In the EN 2004 teacher test, these are very close to Cronbach's α (at 0.73–0.77 in math and 0.58–0.62 in reading), so that applying them provides very similar (even slightly larger) estimates of the true effect. 12 Multi-grade schools, in which several grades are served in the same class by the same teacher, are a wide-spread phenomenon in many developing countries where parts of the population live in sparsely populated areas (cf. Hargreaves et al., 2001). For example, the remoteness of communities in the Andes and the Amazon basin means a lack of critical mass of students to provide complete schools that have at least one classroom per grade.

490

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

the two do not have common items. While in most studies in the existing literature, tested teacher skills can be only weakly tied to the academic knowledge in the subject in which student achievement is examined (cf. Wayne and Youngs, 2003), the separate subjectspecific teacher tests in EN 2004 allow for an encompassing measurement of teacher subject knowledge in two specific subjects. 13 Both student and teacher tests in both subjects were scaled using Rasch modeling by the Unit for Quality Measurement (UMC) of the Peruvian Ministry of Education. Of the 12,165 students in the nationally representative dataset, 56% (6819 students) were taught by the same teacher in math and reading. Of the latter, 63% (4302 students) are in schools that have only one class in 6th grade. The combination of being served by the same teacher in both subjects in schools with only one classroom in 6th grade is quite frequent, as the Peruvian school system is characterized by many small schools spread out over the country to serve the dispersed population. As the main focus of our analyses will be on this “same-teacher one-classroom” (STOC) sample (although we also report cross-sectional estimates for the full sample below), we focus on this sample here. 14 Table A1 in Appendix A reports descriptive statistics. Both student and teacher test scores are scaled to a mean of 0 and standard deviation of 1. The first language was Spanish for 87% of the students, and a native language for the others. Descriptive test scores reveal that Spanish-speaking students fare significantly better than students whose first language is not Spanish, although their teachers perform only slightly better. Both students and teachers score better in urban (compared to rural) areas, in private (compared to public) schools, and in complete (compared to multi-grade) schools. In terms of teacher education, 28% of the teachers hold a university degree, compared to a degree from a teaching institute. Both math and reading is taught at an average of about 6 h per week. Subject-specific inspection (not shown) reveals only a few substantive subject-specific differences in sub-populations. Boys and girls score similar in reading, but boys outperform girls on average in math. While male and female teachers score similarly on the math test, male teachers score worse than female teachers in reading. Teachers with a university degree score worse in math but slightly better in reading compared to teachers with a degree from an institute. Throughout the paper, we express results as standard deviations of student achievement per standard deviation of teacher achievement. While no external benchmark is available to provide context to the size of the achievement variation for either test in EN 2004, it is worth noting that on international tests, the average learning gain of students during one year of schooling (the “grade-level equivalent”) is usually equivalent to roughly 25–35% of a standard deviation of student achievement. For Peruvian 15-year-olds in 10th and 11th grade in PISA 2009, the grade-level equivalent equals 31.0% of a standard deviation (no similar data are available for 6th-graders). Apart from this, we can contextualize the test-score variations in terms of the observed differences in teacher and student test scores between sub-groups of the population. For example, the difference in teacher math knowledge between rural and urban areas is 41.3% of a teacher standard deviation, whereas the difference in student knowledge between rural and urban areas is 79.2% of a student standard deviation 13 Evidence on content-related validity comes from the fact that both teacher test measures do not show a significant departure from unidimensionality. Principal component analysis of the Rasch standardized residuals shows a first eigenvalue of 1.7 in math and 1.5 in reading, indicating that both tests are measuring a dominant dimension — math and reading subject knowledge, respectively. Additional validity stems from the fact that at least five subject experts in math and reading, respectively, certified adequacy of test items for both teacher tests. 14 While the restricted STOC sample is certainly not a random sample of the full population of Peruvian 6th-graders (see Table A2 in Appendix A), a matching procedure reported below indicates that the restriction is not driving our results.

(similar values in reading). This suggests that, at least on the available measures, the teacher population appears more homogeneous than the student population in the rural–urban comparison.

4. Results 4.1. Main results As a reference point for the existing literature and the subsequent analyses, Table 1 reports cross-sectional regressions of student test scores on teacher test scores and different sets of regressors in the two subjects. Throughout the paper, standard errors are clustered at the classroom level. As the first four columns show, there are statistically significant associations between student test scores and teacher test scores in both subjects in the full sample, both in a bivariate setting and after adding a set of controls. The associations are substantially smaller when controlling for the background characteristics, in particular the three school characteristics — urban location, private operation, and being a multi-grade school — and the reading estimate remains only marginally significant. Student achievement in both subjects is significantly positively associated with Spanish as a first language, urban location, private operation, being a complete school, and a student motivation index. Female students perform better in reading but worse in math, and reading (but not math) achievement is significantly positively associated with having a female teacher. As the subsequent columns reveal, the significant associations between student and teacher test scores hold also in the sample of students taught by the same teacher in both subjects and in the sameteacher one-classroom (STOC) sample, although point estimates are reduced (and become insignificant in reading in the STOC sample), indicating possible selection bias in the full sample. Table 2 presents the central result of our paper, the correlated random effects model of Eqs. (3a) and (3b) estimated by seemingly unrelated regressions. By restricting the student sample to those who are taught by the same teacher in both subjects in schools with only one classroom per grade (STOC sample), we avoid bias stemming from within-school sorting. 15 In our model, the effect of teacher subject knowledge on student achievement in math, β1, is given by the difference between the regression coefficient on the teacher math test score in the math equation minus the regression coefficient on the teacher math test score in the reading subject (see Eqs. (3a) and (3b)), and vice versa for reading. In the STOC sample, the point estimates are surprisingly close to the cross-sectional estimates with basic controls, supporting the view that the teacher-student assignment is indeed as good as random in this sample. The estimates of this within-teacher within-student specification indicate that an increase in teacher math test scores by one standard deviation raises student math test scores by about 6.4% of a standard deviation. A χ 2 test reveals that this effect is statistically highly significant. In reading, albeit also positive, the estimate is much lower and does not reach statistical significance. In fact, the tests for overidentification restrictions reject the hypothesis that the effect of interest is the same in the two subjects, advising against estimating the simplified first-differenced model of Eq. (5).16 15 Results on the βs are qualitatively similar in the larger same-teacher sample without restriction to schools with only one classroom per grade (see Table A3 Appendix A). In that sample, student math scores are significantly related to teacher math scores after controlling for teacher reading scores, and student reading scores are significantly related to teacher reading scores after controlling for teacher math scores. While the coefficient estimates on the differing-subject teacher scores are positive in both models in that sample, they are smaller than the same-subject coefficients and statistically insignificant in the unrestricted model. 16 If we denote the coefficient on the same-subject teacher test score in Eqs. (3a) and (3b) by ψj, this test is implemented by testing ψ1 − η1 = ψ2 − η2, where the ηs are estimated as the coefficients on the other-subject teacher test score in each equation. Given that ψj = η j + β j, this provides a test of β1 = β2.

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

491

Table 1 Cross-sectional regressions. Full sample

Same-teacher sample

OLS

SUR

OLS

Same-teacher one-classroom (STOC) SUR

OLS

SUR

Math

Reading

Math

Reading

Math

Reading

Math

Reading

Math

Reading

Math

Reading

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

0.264⁎⁎⁎ (0.027)

0.219⁎⁎⁎ (0.023)

0.094⁎⁎⁎ (0.019) − 0.094⁎⁎⁎

0.027⁎ (0.018) 0.063⁎⁎

0.147⁎⁎⁎ (0.041)

0.182⁎⁎⁎ (0.032)

0.073⁎⁎⁎ (0.022) − 0.074⁎⁎

0.032⁎ (0.018) 0.057⁎

0.119⁎⁎ (0.054)

0.158⁎⁎⁎ (0.042)

Private school

(0.025) 0.709⁎⁎⁎ (0.076) 0.323⁎⁎⁎ (0.066) 0.798⁎⁎⁎

(0.027) 0.748⁎⁎⁎ (0.066) 0.392⁎⁎⁎ (0.067) 0.708⁎⁎⁎

(0.029) 0.765⁎⁎⁎ (0.081) 0.332⁎⁎⁎ (0.069) 0.701⁎⁎⁎

(0.030) 0.741⁎⁎⁎ (0.071) 0.382⁎⁎⁎ (0.069) 0.570⁎⁎⁎

0.061⁎⁎ (0.025) − 0.098⁎⁎⁎ (0.035) 0.639⁎⁎⁎ (0.079) 0.321⁎⁎⁎ (0.077) 0.725⁎⁎⁎

0.017 (0.023) 0.024 (0.035) 0.623⁎⁎⁎ (0.072) 0.364⁎⁎⁎ (0.078) 0.562⁎⁎⁎

Complete school

(0.063) 0.262⁎⁎⁎

(0.055) 0.281⁎⁎⁎ (0.069) 0.104⁎⁎⁎ (0.038) 0.039 (0.041) 0.195⁎⁎⁎

(0.099) 0.273⁎⁎⁎ (0.077) − 0.010 (0.059) 0.004 (0.064) 0.701⁎⁎⁎

(0.077) 0.290⁎⁎⁎ (0.076) 0.100⁎ (0.053) 0.024 (0.060) 0.315⁎⁎⁎

(0.115) 0.360⁎⁎⁎ (0.081) − 0.110 (0.070) 0.060 (0.079) 0.638⁎⁎⁎

(0.092) 0.371⁎⁎⁎ (0.077) 0.066 (0.066) 0.058 (0.075) 0.382⁎⁎⁎

(0.046) 0.006 (0.010) Yes

(0.076) 0.022 (0.015) Yes 522.98 4987 398

(0.058) 0.014 (0.012) Yes

(0.083) 0.001 (0.018) Yes 470.36 3121 261

(0.077) − 0.003 (0.016) Yes

Teacher test score Student female Student 1st language Spanish Urban area

(0.069) 0.025 (0.040) 0.022 (0.045) 0.517⁎⁎⁎

Teacher female Teacher university degree Student motivation Hours Teaching method (8 indicators) F (OLS) / χ2 (SUR) Observations (students) Classrooms (clusters)

92.46 12,165 893

89.82 12,165 893

(0.065) 0.017 (0.012) Yes 806.76 6461 500

12.79 6819 521

32.71 6819 521

4.96 4302 346

13.97 4302 346

Dependent variable: student test score in math and reading, respectively. Clustered standard errors in the SUR models are estimated by maximum likelihood. Robust standard errors (adjusted for clustering at classroom level) in parentheses: significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

We can only speculate about the reasons why results differ markedly between math and reading. A first option is that teacher subject knowledge may be genuinely more relevant in the production

Table 2 The effect of teacher test scores: correlated random effects models.

Implied βsubject χ2 (βsubject = 0) Prob > χ2 Regression estimates: Teacher test score in same subject Teacher test score in other subject χ2 (ηmath = ηreading) Prob > χ2 χ2 (βmath = βreading) Prob > χ2 χ2 Observations (students) Classrooms (clusters) βsubject, corrected for measurement error λ (Cronbach's α)

Unrestricted model

Model restricting η1 = η2

(1)

(2)

Math

Reading

Math

Reading

0.064⁎⁎⁎ 13.138 [0.0003]

0.014 0.599 [0.439]

0.060⁎⁎⁎ 9.585 [0.002]

0.019 1.095 [0.295]

0.047 (0.037) 0.026 (0.036)

0.039 (0.034) − 0.017 (0.032) 0.585 [0.444] 6.948⁎⁎⁎

0.063⁎⁎⁎ (0.025) 0.004 (0.019)

0.022 (0.020) 0.004 (0.019) – – 3.477⁎

0.087⁎⁎⁎ 0.74

[0.008] 353.45 3936 316 0.022 0.64

0.081⁎⁎⁎ 0.74

[0.062] 354.57 3936 316 0.029 0.64

Dependent variable: student test score in math and reading, respectively. Regressions in the two subjects estimated by seemingly unrelated regressions (SUR). Sample: Sameteacher one-classroom (STOC). Implied β represents the effect of the teacher test score, given by the difference between the estimate on the teacher test score in the respective subject between the equation of the student test score in the respective subject and the equation of the student test score in the other subject (see Eqs. (3a) and (3b)). Regressions include controls for student gender, student 1st language, urban area, private school, complete school, teacher gender, and teacher university degree. Clustered standard errors in the SUR models are estimated by maximum likelihood. Robust standard errors (adjusted for clustering at classroom level) in parentheses: significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

function of math rather than reading skills. Another possibility is that by 6th grade, reading skills are already far developed and less responsive to teacher influence, whereas math skills still develop strongly. It could be that the teacher reading test has more noise than the math test, so that estimates suffer from greater attenuation bias. It could also be that the underlying true distribution of teacher knowledge is tighter in reading than in math (relative to the student distribution in each subject), so that a one standard-deviation change in teacher scores reflects smaller variation in actual knowledge in reading than in math. As discussed in Section 2.2 above, the estimates still suffer from attenuation bias due to measurement error in the teacher test scores. Following Eq. (7), the unbiased effect of teacher subject ^ =λs . In the knowledge on student achievement is given by βs ¼ β s EN 2004, the reliability ratio (Cronbach's α) of the teacher math test is λM = 0.74, and the reliability ratio of the teacher reading test is λR = 0.64 (in line with the argument just made). Thus, as reported at the bottom of Table 2, the true effect of teacher math knowledge on student math achievement turns out to be 0.087: an increase in student math achievement by 8.7% of a standard deviation for a one standard deviation increase in teacher math knowledge. The overidentification tests do not reject that the selection term η (given by the coefficient on other-subject teacher scores in each equation) is the same in the two subjects, allowing us to estimate a restricted model that assumes η1 = η2 in column (2). Results are hardly different in this restricted model, although similarity of results across subjects is only marginally rejected in this model. 17 The estimate of η is close to zero in the STOC sample, a further indication

17 Table A4 in Appendix A shows results of basic first-differenced specifications, which would be warranted when effects do not differ across subjects and which effectively represent an average of our estimates in math and in reading. The workingpaper version (Metzler and Woessmann, 2010) provides further details on firstdifferenced specifications.

492

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

that selection effects may indeed be eliminated by the sample restriction to students who have the same teacher in both subjects in schools that have only one classroom in the grade. In the larger same-teacher sample (shown in Table A3 in Appendix A), the estimate of η is positive and (marginally) significant, indicating positive selection effects in that sample. 4.2. Evidence on heterogeneous effects To test whether the main effect hides effect heterogeneity in subpopulations, we estimate our model separately by stratified subsamples (Table 3). We start by looking at the following sub-groups: female vs. male students, students with Spanish vs. a native language as their first language, urban vs. rural areas, private vs. public schools, complete vs. multi-grade schools, male vs. female teachers, and teachers with a university vs. institute degree. With only two exceptions, both math and reading results are robust within the subsamples, and the difference between the respective split samples is not statistically significant. 18 The exceptions are that the math results are no longer statistically significant in the two smallest sub-samples, private schools and teachers with a university degree. While the small sample sizes are most likely behind this pattern, the point estimates in these two sub-samples are also close to zero, and the difference of the effect for teachers with a university degree compared to the effect for teachers without a university degree is statistically significant. With this exception, there is little evidence that the effect of teacher subject knowledge differs across these sub-populations. The exception of the non-effect for teachers holding a university degree might indicate that the effect of teacher subject knowledge flattens out at higher levels of teacher skills. To test whether the effect of teacher subject knowledge is non-linear, Table 3 also reports results for sub-samples of teachers below and above the median in terms of their knowledge level, averaged across the two subjects. There is no evidence for heterogeneous effects. The same is true in models that interact the teacher test-score difference with a continuous measure of the average test-score level or with test-score levels of the two subjects separately. We also experimented with additional types of non-linear specifications, with similar results. At least within our sample of Peruvian teachers, there is no indication that the effect of teacher subject knowledge levels out at higher knowledge levels. Table 3 also reports results separately for students achieving below and above the median achievement level, averaged across the two subjects. The math results are significant in both sub-samples. The effect in reading becomes marginally significant in the subsample of students above the median level and is significantly higher than the effect for low-performing students. To analyze this pattern further, the left panel of Table 4 looks deeper into the “ability match” between teachers and students. It provides estimates for four sub-samples based on whether students and teachers, respectively, have below or above median achievement. Here, the sub-sample of above-median students taught by belowmedian teachers falls out of the general pattern in that the point estimate in math is small and statistically insignificant. This suggests that the teacher-student ability match matters: If teachers are relatively low-performing, their subject knowledge does not matter much for students who are already relatively high-performing. This type of heterogeneity is an interesting aspect of education production functions that is frequently overlooked. Similarly, the right panel of Table 4 looks at the teacher-student gender match. Given the apparent descriptive gender differences in math vs. reading achievement, and given the recent evidence that being assigned to a same-gender teacher may affect student outcomes (Carrell et al., 2010; Dee, 2005, 2007), it shows estimates for 18 We also do not find that the effect of teacher subject knowledge interacts significantly with teaching hours.

similar sub-sample models based on the gender match between teachers and students. Results suggest that female students do not profit from higher teacher math knowledge when they are taught by a male teacher. They do so when taught by a female teacher, as do male students independent of the teachers' gender. This suggests that the importance of teacher subject knowledge also depends on the teacher-student gender match: For girls, student–teacher gender interactions may be important in facilitating the transmission of math knowledge from teachers to students. Although the findings generally suggest that results can be generalized to most sub-populations, the fact that our main sample of analysis — the same-teacher one-classroom (STOC) sample — is not a representative sample of the Peruvian student population raises the question whether results derived on the STOC sample can be generalized to the Peruvian student population at large. Schools with only one 6th-grade class that employ the same teacher to teach both subjects are more likely to be found in rural rather than urban areas and in multi-grade schools, and accordingly on average perform lower on the tests (see Table A2 in Appendix A). To test whether the sample specifics prevent generalization, we perform the following analysis. We first draw a 25-percent random sample from the original EN 2004 population, which is smaller than the STOC sample. We then employ nearest-neighbor propensity score matching (caliper 0.01) to pick those observations from the STOC sample that are comparable to the initial population. The variables along which this STOC sub-sample is made comparable to the initial population are student test scores in both subjects, teacher test scores in both subjects, and dummies for different school types (urban, public, and multi-grade). On this sub-sample of the STOC sample that is comparable to the whole population, we estimate our withinteacher within-student specification. In a bootstrapping exercise, we repeated this procedure 1000 times. The regression analyses yield an average teacher math score effect of 0.065, with an average χ2 of 7.524 (statistically significant at the 3% level), for an average number of 1739 observations (detailed results available on request). These results suggest that the teacher test‐score effect of our main specification is not an artifact of the STOC sample, but is likely to hold at a similar magnitude in the Peruvian 6th-grade student population at large.

4.3. Specification tests As a test for the randomness of the student–teacher assignment, we can add subject-specific measures of student and teacher attributes to the within-teacher within-student specification. Such extended models also address the question to what extent subject-specific teacher knowledge may capture other subject-specific features correlated with it. Results are reported in Table A5 in Appendix A. We first introduce an index of subject-specific student motivation (column (1)). 19 Subject-specific differences in student motivation may be problematic if they are pre-existing and systematically correlated with the difference in teacher subject knowledge between subjects. On the other hand, they may also be the result of the quality of teaching in the respective subject, so that controlling for differences in student motivation between the two subjects would lead to an underestimation of the full effect of teacher subject knowledge. In any event, although student motivation is significantly associated with student test scores in both subjects, and although the motivation effect is significantly larger in math than in reading, its inclusion hardly

19 The student motivation index, scaled 0–1, is calculated as a simple average of five survey questions corresponding to each subject such as “I like to read in my free time” (reading) or “I have fun solving mathematical problems” (math), on which students can choose disagree/sometimes true/agree. The resulting answer is coded as 0 for the answer displaying low motivation, 0.5 for medium motivation, and 1 for high motivation.

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

493

Table 3 The effect of teacher test scores in sub-samples. No

Math: Student female Student 1st language Spanish Urban area Private school Complete school Teacher male Teacher university degree Teacher average score above median Student average score above median Reading: Student female Student 1st language Spanish Urban area Private school Complete school Teacher male Teacher university degree Teacher average score above median Student average score above median

Yes

Difference

Implied β

Prob > χ2

Obs.

Clusters

Implied β

Prob > χ2

Obs.

Clusters

Diff.

Prob > χ2

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

[0.0002] [0.025] [0.001] [0.0001] [0.006] [0.0004] [0.00003] [0.036] [0.010]

2048 530 1859 3436 1823 2100 2815 2062 1970

205 114 203 283 213 153 227 167 290

0.050⁎⁎ 0.060⁎⁎⁎ 0.063⁎⁎ − 0.022 0.067⁎⁎⁎

[0.043] [0.003] [0.014] [0.753] [0.005] [0.131] [0.931] [0.0003] [0.006]

1888 3406 2077 500 2113 1836 1121 1874 1966

301 293 113 33 103 163 89 149 242

0.028 0.036 0.013 0.094 0.005 0.043 0.084⁎⁎ − 0.044 − 0.019

[0.346] [0.461] [0.710] [0.177] [0.898] [0.214] [0.030] [0.263] [0.585]

[0.846] [0.637] [0.805] [0.417] [0.219] [0.313] [0.350] [0.872] [0.520]

2048 530 1859 3436 1823 2100 2815 2062 1970

205 114 203 283 213 153 227 167 290

0.033 0.011 0.025 − 0.016 0.000 0.005 0.003 0.009 0.035

[0.180] [0.551] [0.268] [0.745] [0.990] [0.863] [0.919] [0.771] [0.109]

1888 3406 2077 500 2113 1836 1121 1874 1966

301 293 113 33 103 163 89 149 242

− 0.038 0.015 − 0.018 0.032 0.037 0.019 0.017 − 0.015 − 0.052⁎

[0.250] [0.795] [0.623] [0.550] [0.320] [0.600] [0.626] [0.756] [0.100]

0.078⁎⁎⁎ 0.096⁎⁎ 0.076⁎⁎⁎ 0.073⁎⁎⁎ 0.071⁎⁎⁎ 0.082⁎⁎⁎ 0.087⁎⁎⁎ 0.059⁎⁎ 0.057⁎⁎⁎

− 0.005 0.026 0.007 0.016 0.037 0.024 0.020 − 0.006 − 0.017

0.039 0.003 0.103⁎⁎⁎ 0.076⁎⁎⁎

Dependent variable: student test score in math and reading, respectively. For each sub-sample, regressions in the two subjects are estimated jointly by seemingly unrelated regressions (SUR). Sample: Same-teacher one-classroom (STOC), stratified in two sub-samples based on whether characteristic in head column is true or not. Implied β represents the effect of the teacher test score, given by the difference between the estimate on the teacher test score in the respective subject between the equation of the student test score in the respective subject and the equation of the student test score in the other subject (see Eqs. (3a) and (3b)). Regressions include controls for student gender, student 1st language, urban area, private school, complete school, teacher gender, and teacher university degree. Clustered standard errors (adjusted for clustering at classroom level) in the SUR models are estimated by maximum likelihood. Significance at ⁎⁎⁎1, ⁎⁎5,and ⁎10%.

affects the estimated effects of teacher subject knowledge in the two subjects. Next, we introduce subject-specific teacher measures (column (2) of Table A5). First, we add weekly teaching hours to control for the possibility that teachers may prefer to teach more in the subject they know better, or get to know the subject better which they teach more. The effect of teaching hours is, however, not significant in either subject. Second, we add controls for how teachers teach the curriculum in the two subjects. 20 Teachers report for each subject whether they use subject-specific books, work books, local, regional, or national guidelines, and other help to design their subject curriculum. The eight indicators of teaching methods enter jointly statistically significantly in both subjects. However, the effect of teacher subject knowledge is again unaffected by controlling for these subject-specific teacher characteristics. Together, the models with subject-specific controls confirm the significant causal effect of teacher subject knowledge in math. Another concern with the identification, given the relatively large share (13%) of students for whom Spanish is not the main language, may lie in the fact that teachers who are not native Spanish speakers may not only be relatively worse in teaching Spanish compared to math, but may also exert less effort in teaching Spanish (even after controlling for their knowledge of Spanish). As long as the native language is correlated between students and their teachers (due to local concentrations of indigenous peoples), one way to test for such bias is to compare the full-sample estimate to an estimate on the sample of native Spanish‐speaking students only (shown in Table 3). The effect of teacher test scores on student achievement in this reduced sample is 0.06 (very close to the full sample) and highly significant, indicating

20 EN 2004 asks teachers to describe which items they use to design the subject curriculum: working books, school curriculum, institutional educational projects, regional educational projects, 3rd cycle basic curriculum structure, and/or readjusted curricular programs.

that lower relative effort of non-native Spanish teachers in teaching reading is unlikely to drive our results. 4.4. Interpretation of effect size In order to assess whether the overall effect stems from an exposure of a student to a teacher for one year or for several years, we can observe in the data how long each teacher has been with the tested class. A caveat with such an analysis is that 6th-grade teachers' score differences between math and reading may be correlated with previous teachers' score differences between subjects, although it is not obvious that this a substantial issue in the subject-differenced analysis. It turns out that 39% of the STOC sample has been taught by the specific teacher for only one year, an additional 38% for two years, and only the remaining 23% for more than two years. Table A6 in Appendix A reports results of estimating the model separately for the three sub-groups. The point estimate in the sub-sample of students who have been taught by the teacher for only one year is even larger than the point estimate in our main specification, suggesting that the estimated effect mostly captures a one-year effect. Note that in this sub-sample, there is actually also a (marginally) significant effect of teacher subject knowledge on student achievement in reading. At the same time, the math effects are of a similar magnitude in the sub-samples of teachers that have been with the class for longer than one year. With limited statistical power in these smaller samples, it is however hard to reject that the effect increases with tenure, at least when there is some fadeout of teacher effects (Kane and Staiger, 2008; see Metzler and Woessmann, 2010 for a more extensive discussion). When the estimates of our main specification are viewed as the effects of having been with a teacher of a certain subject knowledge for one year, the measurement-error corrected estimate of the effect in math means that if a student who is taught by a teacher at, e.g., the 5th percentile of the subject knowledge distribution was moved to a

494

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

Table 4 The effect of teacher test scores in sub-samples matching students and teachers on achievement and gender. Achievement level

Math: Teacher low level Student low level Student high level Teacher high level Student low level Student high level Reading: Teacher low level Student low level Student high level Teacher high level Student low level Student high level

Gender Implied β

Prob > χ2

Obs.

Clusters

(1)

(2)

(3)

(4)

0.091⁎⁎⁎ 0.017

[0.009] [0.642]

1132 930

159 117

0.068 0.133⁎⁎⁎

[0.130] [0.002]

838 1036

131 125

− 0.039 0.040

[0.358] [0.472]

1132 930

159 117

− 0.022 0.020

[0.612] [0.619]

838 1036

131 125

Math: Teacher female Student female Student male Teacher male Student female Student male Reading: Teacher female Student female Student male Teacher male Student female Student male

Implied β

Prob > χ2

Obs.

Clusters

(5)

(6)

(7)

(8)

0.093⁎⁎⁎ 0.073⁎⁎⁎

[0.002] [0.010]

1012 1088

146 148

− 0.014 0.083⁎⁎⁎

[0.698] [0.005]

876 960

155 157

0.037 0.011

[0.230] [0.713]

1012 1088

146 148

0.030 − 0.028

[0.390] [0.454]

876 960

155 157

Dependent variable: student test score in math and reading, respectively. For each sub-sample, regressions in the two subjects are estimated jointly by seemingly unrelated regressions (SUR). Sample: Same-teacher one-classroom (STOC), stratified in four sub-samples based on whether combined teacher-student characteristic in head column is true or not. Teacher/student low/high level is defined by whether teacher/student average test score is below/above median. Implied β represents the effect of the teacher test score, given by the difference between the estimate on the teacher test score in the respective subject between the equation of the student test score in the respective subject and the equation of the student test score in the other subject (see Eqs. (3a) and (3b)). Regressions include controls for student gender, student 1st language, urban area, private school, complete school, teacher gender, and teacher university degree. Clustered standard errors (adjusted for clustering at classroom level) in the SUR models are estimated by maximum likelihood. Significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

teacher with median subject knowledge, the student's math achievement would be raised by 14.3% of a standard deviation by the end of the school year. If one raised all 6th-grade teachers who are currently below the 95th (75th, 50th) percentile on the subject knowledge distribution to that percentile, average student performance would increase by 13.9 (7.0, 3.5) percent of a standard deviation. Additional improvements could be expected from raising teacher knowledge in grades 1 to 5. Of course, this would mean substantial improvements in teacher knowledge, but ones that appear conceivable considering the low average teacher performance in the given developingcountry context. It is also illuminating to compare our estimated size of the effect of teacher subject knowledge to the size of the recent estimates of the overall teacher effect based on relative student test score increases exerted by different teachers. Rockoff (2004, pp. 247–248) concludes for the U.S. school system that a “one-standard-deviation increase in teacher quality raises test scores by approximately 0.1 standard deviations in reading and math on nationally standardized distributions of achievement.” Rivkin et al. (2005) find a similar magnitude. It is impossible to relate our results directly to this finding, both because the Peru finding may not generalize to the United States and because the line-up of teachers along the dimension of student test-score increases will not perfectly match the line-up along the dimension of teacher subject knowledge, so that the two metrics do not necessarily match. But it is still interesting to note that our measurement-error corrected estimate of the effect of teacher subject knowledge falls in the same ballpark. 5. Conclusion The empirical literature analyzing determinants of student learning has tackled the issue of teacher impacts from two sides: measuring the impact of teachers as a whole and measuring the impact of distinct teacher characteristics. Due to recent advances in the first stream of literature using newly available rich panel datasets, it has been firmly established that overall teacher quality is an important determinant of student outcomes, i.e., that teachers differ strongly in their impact on student learning (Hanushek and Rivkin, 2010). The second stream of literature examines which specific teacher

characteristics may be responsible for these strong effects and constitute the unobserved bundle of overall teacher quality. Answers to this question are particularly useful for educational administration and policy-making. In the empirical examination of teacher attributes such as education, experience, salaries, test scores, and certification, only teacher knowledge measured by test scores has reasonably consistently been found to be associated with student achievement (Hanushek, 1986; Hanushek and Rivkin, 2006). However, identification of causal effects of teacher characteristics is often hampered by omitted student and teacher characteristics and by non-random selection into classrooms. This paper proposed a new way to identify the effect of teacher subject knowledge on student achievement, drawing on the variation in subject matter knowledge of teachers across two subjects and the commensurate achievement across the subjects by their individual students. By restricting the sample to students who are taught by the same teacher in both subjects, and in schools that have only one classroom per grade, this identification approach is able to circumvent the usual bias from omitted student and teacher variables and from non-random selection and placement. Furthermore, possible bias from measurement error was addressed by reverting to psychometric test statistics on reliability ratios of the underlying teacher tests. Drawing on data on math and reading achievement of 6th-grade students and their teachers in Peru, we find a significant effect of teacher math knowledge on student math achievement, but (with few exceptions of sub-groups) not in reading. A one-standarddeviation increase in teacher math knowledge raises student achievement by about 9% of a standard deviation. The results suggest that teacher subject knowledge is indeed one observable factor that is part of what makes up as-yet unobserved teacher quality, at least in math. While the results are robust in most sub-populations, we also find interesting signs of effect heterogeneity depending on the teacherstudent match. First, the effect does not seem to be present when high-performing students are taught by low-performing teachers. Second, female students are not affected by the level of teacher math knowledge when taught by male teachers. Thus, effects of teacher subject knowledge may depend on the subject, the ability

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

and gender match between teachers and students. Such aspects of education production are frequently overlooked, and understanding the mechanisms behind them provides interesting directions for future research. Overall, the results suggest that teacher subject knowledge should be on the agenda of educational administrators and policy-makers. Attention to teacher subject knowledge seems to be in order in recruitment policies, teacher training practices, and compensation schemes. However, additional knowledge about the relative cost of improving teacher knowledge — both of different means of improving teacher knowledge and compared to other means of improving student achievement — is needed before policy priorities can be established. The results have to be interpreted in the given developing-country context with relatively low academic achievement. Although average student achievement in Peru is close to the average of Latin American countries (LLECE, 2008), it is far below the average developed country. For example, when Peruvian 15-year-olds took the Programme for International Student Assessment (PISA) test in 2002, their average math performance was a full two standard deviations below the average OECD country, at the very bottom of the 41 (mostly developed) countries that had taken the test by the time (OECD, 2003). The extent to which the current results generalize to developed countries thus remains an open question for future research. Appendix A. Supplementary material Supplementary material to this article can be found online at http://dx.doi.org/10.1016/j.jdeveco.2012.06.002. References Altinok, Nadir, Kingdon, Geeta, 2012. New evidence on class size effects: a pupil fixed effects approach. Oxford Bulletin of Economics and Statistics 74 (2), 203–234. Ammermüller, Andreas, Dolton, Peter, 2006. Pupil–teacher gender interaction effects on scholastic outcomes in England and the USA. ZEW Discussion Paper, 06–060. Centre for European Economic Research, Mannheim. Angrist, Joshua D., Krueger, Alan B., 1999. Empirical strategies in labor economics. In: Ashenfelter, Orley, Card, David (Eds.), Handbook of Labor Economics, vol. 3A. North-Holland, Amsterdam, pp. 1277–1366. Ashenfelter, Orley, Krueger, Alan B., 1994. Estimates of the economic return to schooling from a new sample of twins. The American Economic Review 84 (5), 1157–1173. Ashenfelter, Orley, Zimmerman, David J., 1997. Estimates of the returns to schooling from sibling data: fathers, sons, and brothers. The Review of Economics and Statistics 79 (1), 1–9. Banerjee, Abhijit, Duflo, Esther, 2006. Addressing absence. The Journal of Economic Perspectives 20 (1), 117–132. Bedi, Arjun S., Marshall, Jeffery H., 2002. Primary school attendance in Honduras. Journal of Development Economics 69 (1), 129–153. Behrman, Jere R., 2010. Investment in education: inputs and incentives. In: Rodrik, Dani, Rosenzweig, Mark (Eds.), Handbook of Development Economics, vol. 5. North-Holland, Amsterdam, pp. 4883–4975. Behrman, Jere R., Ross, David, Sabot, Richard, 2008. Improving quality versus increasing the quantity of schooling: estimates of rates of return from rural Pakistan. Journal of Development Economics 85 (1–2), 94–104. Card, David, 1999. The causal effect of education on earnings. In: Ashenfelter, Orley, Card, David (Eds.), Handbook of Labor Economics, vol. 3A. North-Holland, Amsterdam, pp. 1801–1863. Carrell, Scott E., Page, Marianne E., West, James E., 2010. Sex and science: how professor gender perpetuates the gender gap. Quarterly Journal of Economics 125 (3), 1101–1144. Chamberlain, Gary, 1982. Multivariate regression models for panel data. Journal of Econometrics 18 (1), 5–46. Chaudhury, Nazmul, Hammer, Jeffrey, Kremer, Michael, Karthik Muralidharan, F., Rogers, Halsey, 2006. Missing in action: teacher and health worker absence in developing countries. The Journal of Economic Perspectives 20 (1), 91–116. Clotfelter, Charles T., Ladd, Helen F., Vigdor, Jacob L., 2010. Teacher credentials and student achievement in high school: a cross-subject analysis with student fixed effects. The Journal of Human Resources 45 (3), 655–681. Coleman, James S., Campbell, Ernest Q., Hobson, Carol F., McPartland, James M., Mood, Alexander M., Weinfeld, Frederic D., et al., 1966. Equality of Educational Opportunity. Government Printing Office, Washington, D.C. Cronbach, Lee J., 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16 (3), 297–334. Dee, Thomas S., 2005. A teacher like me: does race, ethnicity, or gender matter? The American Economic Review 95 (2), 158–165.

495

Dee, Thomas S., 2007. Teachers and the gender gaps in student achievement. The Journal of Human Resources 42 (3), 528–554. Dee, Thomas S., West, Martin R., 2011. The non-cognitive returns to class size. Educational Evaluation and Policy Analysis 33 (1), 23–46. Ehrenberg, Ronald G., Brewer, Dominic J., 1995. Did teachers' verbal ability and race matter in the 1960s? Coleman revisited. Economics of Education Review 14 (1), 1–21. Eide, Eric, Goldhaber, Dan, Brewer, Dominic, 2004. The teacher labour market and teacher quality. Oxford Review of Economic Policy 20 (2), 230–244. Ferguson, Ronald F., 1998. Can schools narrow the black-white test score gap? In: Jencks, Christopher, Phillips, Meredith (Eds.), The Black-White Test Score Gap. Brookings Institution, Washington, D.C., pp. 318–374. Ferguson, Ronald F., Ladd, Helen F., 1996. How and why money matters: an analysis of Alabama schools. In: Ladd, Helen F. (Ed.), Holding Schools Accountable: Performance-based Reform in Education. Brookings Institution, Washington, D.C., pp. 265–298. Filmer, Deon, 2007. Educational Attainment and Enrollment around the World. Development Research Group, The World Bank (Peru 2004 data last updated 25 Sep 2007) [accessed 24 March 2012]. Available from econ.worldbank.org/projects/edattain. Glewwe, Paul, Kremer, Michael, 2006. Schools, teachers, and education outcomes in developing countries. In: Hanushek, Eric A., Welch, Finis (Eds.), Handbook of the Economics of Education, vol. 2. North-Holland, Amsterdam, pp. 945–1017. Hanushek, Eric, 1971. Teacher characteristics and gains in student achievement: estimation using micro data. The American Economic Review 61 (2), 280–288. Hanushek, Eric A., 1986. The economics of schooling: production and efficiency in public schools. Journal of Economic Literature 24 (3), 1141–1177. Hanushek, Eric A., 1992. The trade-off between child quantity and quality. Journal of Political Economy 100 (1), 84–117. Hanushek, Eric A., 1997. Assessing the effects of school resources on student performance: an update. Educational Evaluation and Policy Analysis 19 (2), 141–164. Hanushek, Eric A., Rivkin, Steven G., 2006. Teacher quality. In: Hanushek, Eric A., Welch, Finis (Eds.), Handbook of the Economics of Education, vol. 2. North-Holland, Amsterdam, pp. 1051–1078. Hanushek, Eric A., Rivkin, Steven G., 2010. Generalizations about using valueadded measures of teacher quality. The American Economic Review 100 (2), 267–271. Hanushek, Eric A., Woessmann, Ludger, 2011. The economics of international differences in educational achievement. In: Hanushek, Eric A., Machin, Stephen, Woessmann, Ludger (Eds.), Handbook of the Economics of Education, vol. 3. North Holland, Amsterdam, pp. 89–200. Hanushek, Eric A., Woessmann, Ludger, 2012. Schooling, Educational Achievement, and the Latin American Growth Puzzle. Journal of Development Economics 99 (2), 497–512. Harbison, Ralph W., Hanushek, Eric A., 1992. Educational performance of the poor: lessons from rural northeast Brazil. A World Bank Book. Oxford University Press, Oxford. Hargreaves, Eleanore, Montero, C., Chau, N., Sibli, M., Thanh, T., 2001. Multigrade teaching in Peru, Sri Lanka and Vietnam: an overview. International Journal of Educational Development 21 (6), 499–520. Kane, Thomas J., Staiger, Douglas O., 2008. Estimating teacher impacts on student achievement: an experimental evaluation. NBER Working Paper, 14607. National Bureau of Economic Research, Cambridge, MA. Laboratorio Latinoamericano de Evaluación de la Calidad de la Educación (LLECE), 2008. Los Aprendizajes de los Estudiantes de América Latina y el Caribe: Primer Reporte de los Resultados del Segundo Estudio Regional Comparativo y Explicativo. Oficina Regional de Educación de la UNESCO para América Latina y el Caribe (OREALC/UNESCO), Santiago, Chile. Lavy, Victor, 2010. Do differences in school's instruction time explain international achievement gaps in math, science, and reading? Evidence from developed and developing countries. NBER Working Paper, 16227. National Bureau of Economic Research, Cambridge, MA. Lavy, Victor, 2011. What makes an effective teacher? Quasi-experimental evidence. NBER Working Paper, 16885. National Bureau of Economic Research, Cambridge, MA. Luque, Javier, 2008. The Quality of Education in Latin America and the Caribbean: the Case of Peru. Abt Associates, San Isidro. Metzler, Johannes, Woessmann, Ludger, 2010. The impact of teacher subject knowledge on student achievement: evidence from within-teacher within-student variation. CESifo Working Paper, 3111. CESifo, Munich. Murnane, Richard J., Phillips, Barbara R., 1981. What do effective teachers of inner-city children have in common? Social Science Research 10 (1), 83–100. Organisation for Economic Co-operation and Development (OECD), 2003. Literacy Skills for the World of Tomorrow: Further Results from PISA 2000. OECD, Paris. Rivkin, Steven G., Hanushek, Eric A., Kain, John F., 2005. Teachers, schools, and academic achievement. Econometrica 73 (2), 417–458. Rockoff, Jonah E., 2004. The impact of individual teachers on student achievement: evidence from panel data. The American Economic Review 94 (2), 247–252. Rockoff, Jonah E., Jacob, Brian A., Kane, Thomas J., Staiger, Douglas O., 2011. Can you recognize an effective teacher when you recruit one? Education Finance and Policy 6 (1), 43–74. Rothstein, Jesse, 2010. Teacher quality in educational production: tracking, decay, and student achievement. Quarterly Journal of Economics 125 (1), 175–214. Rowan, Brian, Chiang, Fang-Shen, Miller, Robert J., 1997. Using research on employees' performance to study the effects of teachers on students' achievement. Sociology of Education 70 (4), 256–285. Schwerdt, Guido, Wuppermann, Amelie C., 2011. Is traditional teaching really all that bad? A within-student between-subject approach. Economics of Education Review 30 (2), 365–379.

496

J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

Summers, Anita A., Wolfe, Barbara L., 1977. Do schools make a difference? The American Economic Review 67 (4), 639–652. Tan, Jee-Peng, Lane, Julia, Coustère, Paul, 1997. Putting inputs to work in elementary schools: what can be done in the Philippines? Economic Development and Cultural Change 45 (4), 857–879.

Wayne, Andrew J., Youngs, Peter, 2003. Teacher characteristics and student achievement gains: a review. Review of Educational Research 73 (1), 89–122. World Bank, 2007. Toward high-quality education in Peru: standards, accountability, and capacity building. A World Bank Country Study. The World Bank, Washington, D.C.