International Journal of Educational Development 45 (2015) 77–88
Contents lists available at ScienceDirect
International Journal of Educational Development journal homepage: www.elsevier.com/locate/ijedudev
Reading achievement progress across countries Maciej Jakubowski a, Artur Pokropek b,* a b
Faculty of Economic Sciences, Warsaw University, Poland Educational Research Institute (IBE), Poland
A R T I C L E I N F O
A B S T R A C T
Article history: Received 5 July 2014 Received in revised form 12 September 2015 Accepted 22 September 2015
This paper discusses a method to compare progress in reading achievement across countries. The method uses measures of achievement in primary schools from PIRLS and compares them to secondary school results from PISA. Results describe an interval in which estimates of progress can lie, depending on the comparability of these two assessments. Progress estimates are also adjusted for age, gender and other characteristics that differ between countries and surveys. The paper also demonstrates how useful these estimates are for policy analysis by comparing achievement progress across early and late tracking countries. ß 2015 Elsevier Ltd. All rights reserved.
Keywords: PISA PIRLS Value added Human capital Early selection Tracking
1. Introduction International surveys of students, such as the Progress in International Reading Literacy Study (PIRLS) and the Programme for International Student Assessment (PISA), assess representative samples of students from different countries to provide estimates of their average level of skills and knowledge related to reading competencies (von Davier et al., 2013). Country rankings produced by these surveys usually attract considerable attention, while more in-depth analyses of the factors that may influence these results are discussed less often (Grek, 2009). Although countries can compare their students’ skills levels to those of other participating countries, cross-sectional surveys like PIRLS or PISA provide limited guidance to policy makers. Average country performance is only partly affected by teaching quality; between-country differences in such factors as parents’ education, a country’s economic and social development, or school enrolment levels usually play important roles in defining student outcomes (Fuchs and Wo¨ßmann, 2008). To assess country’s performance at the secondary level it seems useful to take into account primary school performance. Countryspecific factors, like economic prosperity, teacher’s professionalism,
* Corresponding author. E-mail addresses:
[email protected] (M. Jakubowski),
[email protected] (A. Pokropek). http://dx.doi.org/10.1016/j.ijedudev.2015.09.011 0738-0593/ß 2015 Elsevier Ltd. All rights reserved.
culture and social capital, parents’ educational attainment or school resources tend to be similar and similarly affect students at both primary and secondary levels. Other factors, for example tracking policies or changes in curriculum, might affect students at different ages differently. Thus, by adjusting secondary school results for primary school achievement, or by looking at the achievement progress between primary and secondary level, one can see how the factors that vary between primary and secondary school affect performance. These factors will be more driven by policy changes or the effectiveness of secondary schools than by country-specific characteristics that can be rarely changed by policies, at least in the short-term. This idea is close to difference-in-differences method often used in cross-country analyses (see Hanushek and Wo¨ßmann, 2006, for an application in education which is extended in this paper). It is also similar to longitudinal studies that compare student performance over time for some organizational units, for example, across US states or across the same schools (see Card and Payne, 2002, for an analysis of US states, and Hoxby, 2000, for an analysis of schools). In each case, the idea is to exclude factors that affect both data points similarly and to relate outcome changes to factors and policies that vary between the two data points. Although such approach has its limitations, which we discuss below, there is no doubt that it has a potential to provide more useful secondary schools policy indicators than average scores collected at one moment in time. To enable this type of comparison, it is first necessary to calculate internationally comparable reading achievement scores for primary
78
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
and secondary schools and then to report point and interval estimates of reading achievement progress or development across countries, which we attempt to do in the first part of the paper. These data enable answering different types of research questions, and in this paper we focus on three. First we compare how achievement progress of boys and girls varies across countries. Second we compare how performance gaps develop differently. Finally, we present an example of analysis comparing achievement progress between countries that track students very early (usually by segregating them into academic and vocational tracks around the age of 10 or 12) and those who have the same curriculum for all students until the age of 15 or 16. Our approach combines several known statistical and psychometric methods but to our knowledge is the first time that they are applied to international comparisons. Thus, the final goal of this paper is to present this novel methodology. Although some international comparisons of achievement between different grades are already available for mathematics and science, e.g. 4th and 8th grade results in TIMSS (Trends in International Mathematics and Science Study), our study differs in three aspects. First, we present reliable international comparisons of achievement progress in reading literacy, which were previously unavailable. Second, we calculate differences for a longer period of time, between the age of 10 and 15, or between the 4th and 9th grades. This opens a way, for example, to the analysis of tracking policy, which we present in this paper. Finally, we propose a way to correct point estimates and a novel approach to estimate complex standard errors that account for all sources of errors when comparing results from two international student surveys. Previous studies used two strategies in the attempt to increase comparability. The first was to adjust and compare country-level statistics. For example, Brown et al. (2007) report a similarity in country rankings obtained from studies like PISA, PIRLS, but also TIMSS and IALS (The International Adult Literacy Survey). They note, however, that although measures of central tendency provide a relatively consistent picture, that is not the case with measures of dispersion, suggesting that differences between surveys, such as different scaling models, might limit possible comparisons. Rindermann (2007) and Jakubowski (2010) compare countrylevel statistics and both suggest that results of PIRLS and PISA are more consistent after adjusting for differences in average student age. In the second approach, student results are compared within groups of students that share similar characteristics. For example, Carnoy and Rothstein (2013) compare PISA and TIMSS results within student groups defined by social classes. This enables them to adjust for the large share of disadvantaged students in the U.S. and obtain a more comparable picture of performance across countries. We re-estimate student performance using micro-data from both PIRLS and PISA assessments. To our knowledge, this is the first attempt to apply the same psychometric approach to data from different studies in order to increase their comparability. Although authors comparing PIRLS or TIMSS with PISA usually mention differences in measurement models which might affect the comparisons, these issues were never fully addressed. In a typical approach authors simply re-standardize country averages or other statistics to put them on the same scale without properly addressing complex issues of scale comparability (see for example Hanushek and Wo¨ßmann, 2006). This motivates our approach to re-estimate student performance from the micro-data using the same measurement model for both PIRLS and PISA and to estimate errors that arise when comparing different achievement surveys. It should be noted, however, that we can only account for errors that can be corrected or estimated with the available data. For example, our results might be affected by survey response, dropout or enrolment rates. In fact, studies reported that these factors
might mildly affect comparisons, while they do not change country rankings or analytical results based on larger groups of countries.1 Anyhow, the results for countries with low response, high exclusion, or high drop-out rates should be taken with more caution. Performance comparisons between primary and secondary schools might be even more sensitive to these factors as enrolment is usually nearly universal in primary schools while it might decrease at the secondary level for some countries. We provide relevant data in Table A6 in the Appendix to inform readers about the coverage of samples in each country. In general, we do not find any correlation between the population coverage of 15year-olds as reported by PISA and our measures of achievement progress. Thus, while some caution is in order when interpreting results for countries with smaller coverage rates, we do not find evidence that these factors affect our results. Our analysis also suggests that more effort is necessary to provide data that allow reliable comparisons across countries, especially for developing countries where the participation rates in large-scale assessments are usually lower. We compare reading achievement in primary school, as measured by PIRLS, to reading achievement of 15-year-olds in the PISA survey. The results of PIRLS 2001 are compared with results from PISA 2000, while the results from PIRLS 2006 are compared to results from PISA 2009. While PIRLS is entirely devoted to assessing reading achievement, only PISA 2000 and PISA 2009 focused on reading and provide the most reliable comparison with PIRLS. Thus, to increase reliability we decided to compare surveys that focus on reading achievement. We also compare surveys that were administered more or less at the same time. This is due to three reasons. The first one we already discussed and is related to the lack of reliable reading assessments that would follow the same cohort of students over time.2 Secondly, taking surveys from different moments in time would increase the probability that outcomes are affected by factors external to the education system, for example, economic crisis, migration or political changes that might happen over a longer period of time. Finally, comparing student outcomes adjacent in time allows us to exclude the effect of policies similarly affecting students at different grades. Otherwise, it would be difficult to separate policies that differently affect primary and secondary school students from those that affect all students similarly but could change between assessments. Thus, our results do not represent progress of individual students over time but rather learning opportunities between primary and secondary level based on outcomes of students attending schools at the same time. Achievement is compared using random draws of test items from both surveys giving a range of results possible to obtain under differently constructed tests. Thus, our results describe an interval in which estimates of progress can lie, depending on the comparability of the two assessments. We also adjust progress estimates for differences in the distribution of student background characteristics and for differences in testing age across countries and surveys. The results are precise enough to compare changes in achievement across countries, even after taking into account the fact that different combinations of test items would give different estimates of progress. Results are provided for all students and for subpopulations defined by gender, immigrant status, and 1 Hanushek and Wo¨ßmann (2011) reported correlation between response and enrolment rates and outcomes of various international surveys of achievement. On the other hand, Micklewright and Schnepf (2012) analyse non-response for PISA in the UK to report that the relationship between performance bias and response rates is not straightforward with both positive and negative biases possible depending on the particular response pattern in a country. 2 For example, one could compare PIRLS 2001 to PISA 2006 which covered similar cohort of students, but as PISA 2006 has focused on science rather than reading the comparisons would be much less reliable.
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
proficiency level. These results provide detailed evidence on how students in different educational systems progress from the age of 10 to 15. The paper is organized as follows. Section 2 discusses similarities between PIRLS and PISA and reviews literature on this topic. Section 3 discusses the research that explores effects of early tracking policies that separate students into different study programmes between primary and secondary education. Section 4 describes the data and provides details on how the original data were re-scaled and adjusted to provide more reliable comparisons. Section 5 discusses errors arising in comparisons of achievement results from different surveys and describes our method of estimating these errors. Section 6 provides results comparing reading achievement progress across countries. Section 7 presents an empirical analysis that uses these results to compare achievement progress between early tracking and non-tracking countries. Section 8 concludes. 2. Similarities between PIRLS and PISA Both PIRLS and PISA aim to measure student achievement in reading in an internationally comparable way. Although carried out by different organizations, these two studies have many things in common. They use similar methodologies common nowadays to international surveys of student achievement (for details see technical reports for both studies, for example OECD, 2009 and Martin et al., 2003). Both use paper and pencil tests and are based on representative random sample. PISA is using a two-stage sampling with schools sampled first and then around 35 students sampled from all classrooms (OECD, 2009: 66). PIRLS is also using two-stage sampling where on the first level schools are sampled but on the second level entire classrooms are randomly chosen and tested (Martin et al., 2003: 56–57). In PISA, a minimum of 150 schools and a minimum sample size of 4500 is required (OECD, 2009: 66), In PIRLS an effective sample size of 400 students under simple random sampling is required with 150 schools sampled (Martin et al., 2003: 57). In PISA a response rate of 80% of selected students in participating schools is required (OECD, 2009: 66), while in PIRLS 85% was obligatory (Martin et al., 2003: 125). This requirements results in median sample size of 3635 for PIRLS (Martin et al., 2003: 57) and median sample size 4926 for PISA (OECD, 2009: 400). The main difference between the two surveys is that PISA is age based, while PIRLS is grade based so the distribution of students across grades and age is different. This is separately addressed in our study, while we use the original replication methods and survey weights to separately calculate point and interval estimates for each study (for details see Section 5). The experts involved often work for both studies, and despite some apparent differences, their general goal is similarly stated: both want to provide internationally comparable measures of what students can do in reading. The main difference is that PIRLS is conducted in primary schools while PISA measures achievement in secondary schools. That provides an opportunity to compare how countries differ in their students’ achievement progress from primary to secondary education. PISA is generally considered to be a test aimed at measuring the literacy needed to function in real-life situations, while PIRLS is more closely related to countries’ school curricula. Nevertheless, a careful analysis of test content and framework assumptions for both PIRLS and PISA reveals many commonalities (Mullis et al., 2006, see Table A1 in the Appendix). Constructs measured in both assessments are defined similarly. Reading literacy in PIRLS is defined as ‘‘as the ability to understand and use those written language forms required by society and/or valued by the individual’’ (Mullis et al., 2006: 3). In PISA reading literacy is defined as: ‘‘The
79
capacity to understand, use, reflect on and engage with written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society’’ (OECD, 2006: 46). In both assessments around half of the items are multiple choice (50% in PIRLS and 55% in PISA), while the remaining half are constructed response questions. Both PIRLS and PISA define similar cognitive processes (retrieving, interpreting, reflecting and evaluating). Among differences the most visible one is that PIRLS uses continuous texts only, while PISA includes also graphs, forms and lists. Another difference is that PIRLS includes literary experience among reading purposes, while PISA focuses on acquiring information, although it is less clear how that affects reading literacy measurement. Overall, as PISA and PIRLS aim at measuring similar broadly defined constructs of reading literacy, the differences seem to be of less importance: ‘‘comparisons of the PIRLS and PISA (. . .) demonstrates how different International consensus-building process can result in somewhat similar approaches to assessment (Mullis et al., 2006: 106). We are aware of only one study that empirically addresses these issues using item-level data (see Grisay et al., 2009). Their work suggests that psychometric prosperities of items are very similar in PIRLS and PISA. The distribution of item difficulty was similar in both studies and about 80% of the item difficulty variance was explained by a common factor. Also, the mean absolute magnitude of differential item functioning (DIF)3 was similar. DIF was found mostly in non-Indo-European languages and/or in countries with lower GDP per capita. As already noted in the introduction, studies report broad alignment of the average results of PIRLS and PISA. However, certain discrepancies were noted and the studies suggest that they can be limited through adjustments increasing comparability. For example, Brown et al. (2007) note that while average results are consistent, the measures of dispersion are less correlated than measures of central tendency. Rindermann (2007) reports that age adjusted correlation of country means between PIRLS and PISA reading results equals 0.84. The main conclusion from these studies is that although there are some differences between the content and methods in PIRLS and PISA, these two surveys are similar, and country results from both are highly correlated, while the data should be adjusted to allow more meaningful comparisons. 3. The effects of early tracking on the achievement progress between primary and secondary school The comparisons of achievement progress across countries can provide important policy insights which are not available when looking at a single country data. Student progress can be related to differences in policies across countries, especially when these vary with regard to secondary schools but are similar at the primary level. One example is analysis of the effects of early tracking policy – a policy that is difficult to assess using a single country data as usually it applies to all students. We define tracking school systems as those that separate students into schools which differ in educational programmes or objectives. Usually the best students are selected into academic schools and vocational training is provided for less talented. This definition differs from the one commonly used in the United States where tracking often relates to streaming by ability within the same school. Although both policies are based on a similar mechanism of student sorting, the early tracking policy as we 3 Meaningful comparisons between groups require that the meaning of an item and some of its psychometric characteristics are the same for each group. Test items with large DIF are usually excluded from a study as the probability of correct response for these items varies across countries even for people of similar ability.
80
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
define it might have more profound educational and social effects as it totally separates large groups of students from each other.4 In most developed countries students are tracked into different programmes in secondary schools, but countries differ in the timing of tracking. No country separates students into different programmes earlier than the 4th grade. Some countries, however, track students to vocational schools as early as at the age of 10, while others keep the same programme for all secondary school students. Early tracking of students might have consequences not only on student achievement but also on social integrity. In late tracking systems students usually follow the same curriculum till they are 15 or 16 with many of them having large chances of enrolling into higher education. On the other hand, it is often argued that not all students can excel academically, while good vocational education would enable them to successfully enter labour market. Theoretical considerations of the effects of tracking predict different results depending on the assumptions and outcomes considered. Most studies focus on how the distribution of academic achievement is affected by tracking or streaming policies and that is also the perspective taken in this study. In addition, economic literature considers how tracking can address labour market demand for different sets of skills (see Meier and Schu¨tz, 2007, for a review of theoretical arguments), while sociologists often discuss the impact of tracking on social stratification and inequality and include other important aspects like motivation and self-esteem in their analyses (see Ansalone, 2010, for an example). A more complex approach would also recognize the interaction between tracking or streaming policies with the organization of teaching and curriculum. Regardless of the perspective taken, the basic mechanism driving the effects of tracking depends on the strength and characteristics of the so-called peer effects (see Wilkinson et al., 2010, for a comprehensive approach that discusses how peer effects interplay with structural and organizational factors). Tracking policy usually distributes students across schools at least partly according to their ability. It means that under tracking regime students are in classrooms that are less diverse comparing to the system with no selection. Rich literature documents that individual student achievement strongly depends on the ability and other characteristics of classroom or school peers, e.g. the more able are the other peers the better is the student’s achievement. The effects might vary across groups of students (for reviews of literature see Sacerdote, 2011; Wilkinson et al., 2010). Theoretically, the results of tracking will depend on whether the peer effects are linear (the same positive effect for all students despite their ability levels) or non-linear. If the effect is stronger for weaker students then tracking will have an overall negative effect: the weaker students will be harmed more by tracking than the most able students will benefit from it. If the opposite is true, and the peer effects are more positive for the most able students, then tracking will have an overall positive effect. In practice the strength and character of peer effects and thus the results of tracking policy will vary depending on how teaching is organized. For example, if in a non-tracking system teachers have no means to individualize work with students then mixing might have a stronger negative effect on the best students. However, if teaching can be individualized then the negative effect might be limited. In a tracking system, if vocational school students abandon learning general skills early then tracking might have a highly detrimental impact on them, while if they at least partly
4
See Meier and Schu¨tz (2007), for a discussion of how tracking and streaming are differently defined in the literature. See Oakes (2005), for an in-depth discussion of tracking (streaming) policies in the US context. See Wilkinson et al. (2010), for a generalized model of how segregating students affects their achievement under different institutional arrangements.
follow a general curriculum then these effects will be less severe. Similarly, if countries segregate students without realistic opportunities to change tracks the effects of tracking will probably be more negative if compared to countries where despite the segregation students are given chances to switch between tracks or upgrade their diploma later on. The overall effect of tracking is thus an empirical question and depends on teaching practices and organization of school systems. In one of the most influential papers analysing the effects of early selection on student achievement Hanushek and Wo¨ßmann (2006) argued that tracking does not help in increasing overall student performance while it increases educational inequalities. Specifically, they showed that standard deviation of scores in secondary schools (PISA) in tracking countries is higher than in non-tracking countries controlling for score variation in primary schools (PIRLS). Ammermueller (2005), Waldinger (2006) and Jakubowski (2010) used the same PIRLS, TIMSS and PISA data as Hanushek and Woessmann to come to different conclusions. While Ammermueller suggested that tracking negatively affects student performance and increases inequalities, Waldinger found that tracking has no effect on the relation between family background and achievement and concluded that there is no evidence of its negative impact on equity. Jakubowski (2010) found that tracking has negative impact on overall performance, but also suggested that this might be due to stronger effect in Eastern European countries and might be confounded with the effect of other policies common to these countries. Brunello and Checchi (2007) analysed IALS data, another popular source for international comparisons. Their findings are also ambiguous, suggesting different effects for literacy and earnings. In this paper, we calculate achievement progress estimates that can be used to see if student progress differently across early tracking and late tracking countries. This analytical example is presented in Section 7. Our results provide new data and new insights previously unavailable in the literature. We present comparisons that are more reliable thanks to adjustments discussed in the following sections. We also present results for students at different proficiency levels to separately estimate effects of tracking for low- and high-achieving students. 4. The data used in this study This study analyses publicly available datasets provided by the organizers of PIRLS and PISA. Data and documentation are accessible online (for PIRLS at timss.bc.edu; for PISA at pisa.oecd.org). The data used in this paper differ in two respects: First, we re-scaled performance scores using the same two parameter logistic Item Response Theory model (2PL IRT) with plausible values5 for both PIRLS and PISA to increase comparability. Results for PISA were standardized to have mean 500 and standard deviation 100 across all participating in comparisons countries/ economies while PIRLS mean was set to be 400 and standard deviation equals 100. This shift of minus 100 score points was applied to reflect the supposed score point gain between the age of 10 and 15. This shift is arbitrary and while it does not affect crosscountry comparisons in relative terms it emphasizes student progress during five years of education between primary and secondary school. The number is based on different studies of student progress which generally provide similar results, showing that the average grade gain is close to 1/5 of standard deviation 5 2PL IRT model is a statistical model commonly used in applied measurement for estimating item parameters and students’ proficiency (for a general overview see Hambleton et al., 1991, or DeMars, 2010). Plausible values are the least biased and the most efficient estimates of proficiency that might be estimated using the IRT methodology and widely used in large-scale assessments (for a general overview see Wu, 2005 or Von Davier et al., 2009).
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
(see Ingels et al., 1994). This suggests that the average score point gain between the age of 10 (mean age in PIRLS) and 15 (mean age in PISA) is close to 100 score points (20 score point equals 1/5 of a standard deviation on the PISA scale) and our results can be seen as differences in average gains across five years of education. In any case, the results should be interpreted in relative terms as they compare achievement progress among countries considered in our study. Second, we separated data for England and Scotland, as they participated in PIRLS independently and were published under ‘‘United Kingdom’’ in PISA. Only data for Scotland were considered, as 2000 data for England were withdrawn from official PISA publications due to a large non-response rate. For the comparison between PIRLS 2006 and PISA 2009, we also separately analysed data from Canadian provinces, treating them as other countries. To a large extent, Canadian provinces have separate educational institutions, and they participated separately in PIRLS 2006. There are other datasets that allow comparisons of achievement progress across countries. TIMSS allows direct comparisons within the same study between the 4th and 8th grade. These results are available in TIMSS reports for mathematics and science and for countries that participated in both options. Our analysis, however, allows comparison of achievement progress in reading and over a longer period of education. It provides new results that allow comparison of reading literacy between primary and secondary levels. As many countries separate students into different programmes after the 8th grade, it also allows analysing the results of such selection, which we discuss in the last section.
81
population of students indexed by i in K countries indexed by k, our regression model can be written as follows: yik ¼
K X
bk0 Dk þ b1 ðag e¯k ageik Þ þ eik
K¼1
where Dk is a dummy variable denoting a country, ag e¯k is a country’s median age, and eik is the error term. Regressions were run separately for each survey, so estimated coefficients b0 provide country estimates for students of the same age that can be subtracted to obtain achievement progress estimates. To eliminate non-linearity in the effect of age a square term was added to the regressions. Probability weights were standardized to sum up to 1000 for each country and gender, so each country and both genders weight equally in regressions. In addition, PIRLS data were additionally reweighted to match distribution of student characteristics in PISA. Original survey weights were multiplied by conditional probabilities estimated using logit regression with a dependent variable equal 1 for PISA and 0 for PIRLS data and a set of student characteristics as control variables. Tarozzi (2007) argues that such reweighting produces datasets that are balanced in terms of background characteristics and allow comparisons of outcome. The main results presented in this paper take into account differences in gender and age distribution. For brevity, additional results adjusting for student socio-economic background are not presented as they were qualitatively similar. These results are, however, available upon request from the authors. 5.2. Controlling for random errors
5. Methods The paper uses a novel methodology to compare results of different international surveys of student achievement. We focus on limiting the bias in point estimates by adjusting samples and using the same scaling methods. In addition, we propose a complex method to estimate standard errors that account for different sources of error. Wu (2010) has listed five most important sources of error in comparing results from international surveys across time and student cohorts: 1. Sampling errors that are related to sampling students from populations. 2. Measurement errors that are related to test precision. 3. Construct-discrepancy errors that are related to differences in constructs measured by the two tests. 4. IRT-model misspecification errors that are related to misspecification of models used to estimate performance scores. 5. Population-discrepancy errors that are related to comparing performance using samples that are representative of different populations. Below we discuss how all these sources of errors are addressed in our research. 5.1. Controlling for systematic errors The population-discrepancy error is the systematic error which in our case can be limited by adjusting the samples to make them representative to similarly defined populations. Obviously, students sampled in PIRLS and PISA are from different populations. Besides the fact that they differ in age, they follow different sampling schemes and cover different school grades. We adjust PIRLS samples to make them more comparable with PISA samples and eliminate age affects by estimating country effects in a regression model with additionally reweighted data. For a
PIRLS and PISA are complex surveys of student populations, involving multistage sampling designs and plausible values as student outcomes. We take sampling errors into account by using original methods employed in each survey.6 We also account for the measurement errors by using plausible values method to estimate student achievement. The sampling and measurement errors are calculated separately for each study using the original methods and added to the remaining errors discussed below. Crucial for our approach is the estimation of the constructdiscrepancy errors and the misspecification error of the IRT models. We estimate these two errors together calling the sum of them the linking error. Our approach to estimate it is similar to the resampling methods already proposed in the linking literature (Haberman et al., 2009; Rijmen et al., 2009). The previous attempts, however, were based on the jackknife procedure, while we apply half-sample replicates method. In our approach sets of items vary more across replications, providing the necessary variation to estimate the IRT-model misspecification error. This would not be possible in the jackknife approach where item replication sets differ by only one excluded item. To our knowledge this is not only the first time the half-sample approach is used to estimate linking error but also the first time when the item resampling method is used to estimate both construct-discrepancy errors and the misspecification error for international studies. Our procedure to estimate the linking error follows three steps. First, items are randomly sampled from each survey. Each item is sampled independently of others with the same probability of selection equal to ½. Thus, on average, half of the items are sampled from the item pool, but the actual number of test items differs across simulated samples (see Table A2 in the Appendix). 500 item samples were drown for each survey. In the second step, 6 PIRLS is using the jackknife method, while PISA uses the balanced repeated replication method (BRR). Both the jackknife and the BRR are replication methods that provide unbiased estimates of standard errors for complex survey designs (for more details see Efron, 1982).
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
82
student achievement scores are estimated using the same 2PL IRT model. In the third step, outcomes from each item sample were standardized to show a distribution that was congruent with the PISA reading performance scale. The shift needed to place results on the PISA scale was calculated with respect to the average statistics across all replications (average of 500 means and average of 500 standard deviations). Thus, although scores in each replication could still differ in mean value and distribution, they followed the PISA reading performance scale on average. Similar procedure was applied to PIRLS. The linking error was computed from 500 replications of student achievement scores. This mimics the methodology known from survey statistics although in our case sampling units are not respondents but test items. Following Efron (1982: 61–73) halfsample unbiased estimate of standard deviation is 2 3 2 1=2 J0 ˆ uˆ j Þ X ð u 5 ˆ HS ¼ 4 SD J0 1 j¼1 where J0 is a total number of possible half-samples, uˆ is the estimation of proficiency based on all items, and uˆ j is the estimate from item sample j.
The procedures described above allows us to account for all random errors connected to our analysis and estimate final standard error for any statistics using this formula: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 SE ¼ SE2PIRLS þ SE2PISA þ link where SEPIRLS and SEPISA are sampling and measurement errors calculated in the same way as in the original surveys, and where link corresponds to the linking error (the sum of the constructdiscrepancy and the IRT-model misspecification errors) obtained using the replication methods described above. 6. Results The main results comparing achievement progress across countries are summarized in Table 1 and were obtained after taking into account the effect of student age and balancing gender across countries and studies. These results are similar to the unadjusted ones (see Table A3 in the Appendix), except for countries that have a very different mean student age in PIRLS. One of the findings from these results is that while countries that show lower performance in primary school seem to experience greater
Table 1 Estimates of achievement progress adjusted for student age and gender (results presented on the PISA scale with a standard deviation of 100 across the OECD countries). PIRLS 2001 to PISA 2000 All students
Argentina Austria Belgium (Flemish) Belgium (French) Bulgaria Canada Canada Alberta Canada British Columbia Canada Nova Scotia Canada Ontario Canada Quebec Czech Republic Denmark England France Germany Greece Hong Kong-China Hungary Iceland Indonesia Israel Italy Latvia Lithuania Luxembourg Macedonia Netherlands New Zealand Norway Poland Qatar Romania Russian Federation Scotland Singapore Slovak Republic Slovenia Spain Sweden Taipei Trinidad and Tobago United States
PIRLS 2006 to PISA 2009
Boys
Girls
All students
Progress
S.E.
Progress
S.E.
Progress
S.E.
119.9
(10.2)
107.0
(9.6)
132.4
(13.1)
18.9 127.5
93.6
(7.9) (4.8)
(6.0)
10.1 122.2
84.2
(8.4) (5.0)
(6.8)
28.5 132.2
103.2
121.2 89.6 97.1 130.2 83.1 139.9
(9.4) (5.8)
Boys
Girls
Progress
S.E.
Progress
S.E.
Progress
S.E.
69.3 103.8 134.5 19.2
(5.3) (5.0) (5.8) (8.6)
59.8 98.0 128.0 6.9
(6.3) (5.5) (7.4) (9.8)
79.0 109.3 140.6 32.4
(6.2) (6.2) (6.4) (8.7)
103.0 100.4 108.5 106.7 125.5
(6.1) (5.9) (5.4) (5.9) (5.8)
95.1 91.3 108.9 100.4 121.5
(6.8) (7.0) (6.9) (6.7) (7.0)
110.4 109.0 107.9 112.4 129.4
(7.1) (6.5) (6.2) (6.8) (6.4)
72.5 90.6 111.1 80.0
(5.2) (5.0) (5.4) (5.3)
69.1 93.0 101.9 68.6
(6.0) (5.9) (6.2) (6.2)
76.8 88.3 120.0 91.7
(6.0) (5.8) (5.9) (5.9)
98.9 72.1 131.4 123.1 101.7 73.6 67.8 65.6 23.6
(6.6) (5.9) (5.3) (7.3) (5.6) (4.7) (5.3) (5.1) (4.2)
92.9 59.9 124.6 115.3 92.7 59.5 62.4 52.9 9.8
(6.9) (6.5) (6.0) (7.6) (6.7) (5.5) (5.9) (5.8) (5.0)
104.5 85.0 137.6 131.0 110.4 87.1 74.4 79.1 39.2
(7.5) (7.0) (5.8) (7.7) (6.5) (5.4) (6.2) (5.7) (4.9)
94.9 125.1 147.2 123.1 153.8 74.1 22.9 113.7 95.7 83.7 103.8 110.5 74.2 97.4 125.3 94.8
(6.9) (4.5) (5.8) (5.1) (6.0) (7.1) (6.5) (5.4) (5.3) (5.2) (5.5) (4.6) (5.1) (6.2) (6.0) (6.1)
90.4 120.8 138.8 112.9 146.4 64.9 14.1 117.4 93.8 69.8 92.6 101.8 67.9 90.7 116.0 92.4
(7.4) (5.6) (6.4) (5.7) (6.8) (8.0) (6.8) (6.5) (5.8) (6.3) (6.3) (5.0) (5.5) (7.0) (6.9) (6.9)
99.6 129.1 155.1 132.8 160.6 84.5 32.4 109.6 97.9 97.8 114.4 118.9 81.4 103.9 134.4 97.0
(7.2) (5.3) (6.6) (5.9) (5.9) (7.9) (7.2) (6.5) (5.8) (5.5) (6.0) (5.3) (5.9) (6.9) (6.6) (6.5)
(7.0)
113.6 80.5 89.0 134.5 75.8 132.1
(6.3) (5.3) (8.5) (8.9) (7.2) (5.6)
105.7 71.7 80.2 138.6 68.9 123.3
(7.1) (6.2) (9.1) (9.7) (8.4) (6.1)
(7.1) (6.1) (9.3) (9.6) (7.6) (6.5)
94.7 83.5 49.1
(9.8) (5.8) (6.9)
100.2 70.6 37.6
(12.0) (7.3) (7.5)
88.4 95.7 61.6
(10.2) (6.9) (8.3)
83.6
(6.6)
72.9
(7.1)
94.6
(7.4)
142.2 140.6
(6.4) (6.3)
135.7 129.8
(7.8) (7.4)
148.3 150.8
(8.0) (7.0)
85.8 74.3 138.8
(8.7) (7.9) (6.5)
88.4 62.7 135.9
(10.4) (8.5) (7.6)
81.7 85.7 140.7
(9.9) (8.3) (7.6)
78.0
(6.3)
71.1
(7.0)
85.5
(6.8)
105.7
(9.2)
99.9
(11.0)
111.3
(9.1)
Note: S.E. stands for the standard error. See Section 5 for detailed explanation how errors were computed.
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
progress, in many cases they are still outperformed by countries that show high performance in primary school. According to PIRLS results, in Norway, Iceland, Scotland, and Poland students perform relatively poorly in primary schools, but our result show that thanks to great achievement progress, they perform above average in secondary school. This also happens to some extent in the French-speaking part of Belgium and in France, although performance levels in PIRLS are closer to average. Spain and Slovenia are two countries that show below-average performance in primary school and relatively greater achievement progress, but not enough for students to perform above average in secondary school. New Zealand, Canada and Hong Kong-China show the greatest achievement progress from already-high performance levels in PIRLS. The high PISA rankings of these two countries and one economy can be attributed to good performance in primary schools and effective learning at the secondary level. To a lesser degree, the Flemish part of Belgium shows relatively greater achievement progress even as it shows one of the best performance levels in primary school. Additional results for Canadian provinces are based on the comparison between PIRLS 2006 and PISA 2009. Quebec is the province with the greatest achievement progress, while it also shows the lowest scores in primary school. At the secondary level, Quebec’s performance is similar to that of other provinces. Alberta shows the highest performance in both PISA and PIRLS. Nova Scotia shows the lowest performance in PISA, despite having the secondgreatest achievement progress due to the relatively poor performance in primary school. British Columbia shows the least achievement progress and was outperformed in PISA by Ontario, which shows slightly more progress. Despite these differences, all provinces considered in our study show similar levels of performance in secondary school and average or above-average achievement progress. A group of countries, including the United States, Israel and Singapore, together with Chinese Taipei, shows stable performance levels relative to others. The Netherlands, the Czech Republic, Greece and the Slovak Republic, which show achievement progress slightly below, but statistically similar to, the average, can also be added to this group. These countries differ, however, in their levels of performance. Israel is below average in both PIRLS and PISA, Greece is slightly below average, and the other countries are above average. The list of countries with relatively small achievement progress consists almost entirely of those that perform above average in primary school. However, these countries differ greatly in their performance levels and achievement progress. For example, Sweden shows relatively less achievement progress, but still outperforms most of the countries at the secondary level, while the Russian Federation was among the top performers in PIRLS, but is among the lowest-performing countries in PISA. Romania is one of the countries that shows a poor performance in primary school and even worse performance in secondary school. It is also worth noting that among countries with relatively small achievement progress, most have school systems that select and group students into different types of secondary schools, usually academic or vocational, at an early age. Countries like Germany, Hungary and Austria select students at the age of 10 or 11, while countries like Luxembourg, Bulgaria, the Russian Federation, Lithuania and Italy all track students into different types of school before the age of 15. Sweden and Denmark are the only countries with small achievement progress and no early selection of students. We explore these differences in more details in the following section. We also report results for groups of students. Additional results for boys and girls presented in Table 1 show interesting differences in achievement progress within countries. In nearly all countries
83
girls progress more than boys. The difference in achievement progress is closest to zero in England, Scotland, Canada (with Nova Scotia and Quebec having the smallest differences), Hong KongChina, Israel, Singapore, Romania, Denmark, the United States, the Netherlands, New Zealand and Belgium. Large differences in favour of girls are observed in Luxembourg, the Slovak Republic, Italy, Lithuania and Argentina. In these countries, the reading performance advantage of girls increases significantly between primary and secondary level. Both comparisons, between PIRLS 2001 and PISA 2000 and between PIRLS 2006 and PISA 2009, show similar differences in achievement progress between boys and girls, providing additional support for our methodology. Results from the two comparisons can be analysed together to see if the estimates of achievement progress provide a consistent picture between the age of 10 and 15. Obviously, these comparisons are based on data from different student cohorts and from different years. Some discrepancies between results are expected, as student performance might improve or decline over time due to policy changes or other factors. However, the results should be similar in most countries as both comparisons aim to measure achievement progress in similar ways and in most cases, school systems remain unchanged. In fact, for those countries with data available for both comparisons, the two comparisons give consistent estimates of achievement progress. The largest discrepancies are seen in the Russian Federation, where estimates based on PIRLS 2006 and PISA 2009 are much smaller than the estimates based on the earlier tests, and for Hong Kong-China, where the PIRLS 2001 and PISA 2000 comparison suggests greater achievement progress. Note, however, that the standard errors for these two countries are among the largest. Among all the countries considered, the linking error is the largest in Hong Kong-China, which suggests that in this case some PIRLS and PISA items might measure different aspects of achievement. Additional results provide analysis of how achievement gaps evolve over time across countries. Measures of age- and genderadjusted achievement progress at different percentiles help to evaluate how performance inequality evolves between primary and secondary education. Table 2 shows changes in the gap in reading performance (the difference between the 90th and 10th percentiles) between PIRLS 2006 and PISA 2009 and between PIRLS 2001 and PISA 2000. These differences capture the width of the distribution of student performance – in other words, they measure the performance gap between the highest- and the lowest-scoring students in each country. Changes in these differences show whether performance gaps widen or narrow relative to other countries. The gap in reading performance increased to large extent in countries like Qatar, Belgium, Luxembourg, France, Sweden, the Netherlands, Austria, Germany and Lithuania. In Qatar, Belgium, France and the Netherlands, this is due to greater achievement progress among the best-performing students compared to other countries. In the rest of the countries listed above, achievement progress was greater among the best-performing students when only students from the same country were compared. But when achievement progress was compared across countries, both lowand high-scoring students showed relatively little progress. Thus, not only did inequalities in educational outcomes increase in Luxembourg, Sweden, Austria, Germany and Lithuania, but achievement progress at all performance levels was small, especially among the lowest-achieving students. In several countries, the gap in reading performance narrowed between primary and secondary education. This is most evident in England, Poland, Romania and Scotland. However, only in Poland and Scotland do both poor and high performers show greater progress relative to students at similar performance levels in other
84
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
Table 2 Performance gap between 10th and 90th percentiles and its change over time (results presented on the PISA scale with a standard deviation of 100 across the OECD countries). Cnt/Prov/Econ
PIRLS 2001 to PISA 2000 PIRLS
PISA
Change
Argentina Austria Belgium (Flemish) Belgium (French) Bulgaria Canada Canada Alberta Canada British Columbia Canada Nova Scotia Canada Ontario Canada Quebec Czech Republic Denmark England France Germany Greece Hong Kong-China Hungary Iceland Indonesia Israel Italy Latvia Lithuania Luxembourg Macedonia Netherlands New Zealand Norway Poland Qatar Romania Russian Federation Scotland Singapore Slovak Republic Slovenia Spain Sweden Taipei Trinidad and Tobago United States
245.2
266.9
21.7
264.3 260.3
230.0
258.9 246.8
248.1
PIRLS 2006 to PISA 2009
5.4 13.5
245.1 277.4 253.2 224.2 252.1 248.4
12.0 39.5 4.1 13.3 12.8 9.5
295.3 249.9 229.8
270.6 242.7 260.0
24.7 7.1 30.2
282.5
229.8
52.7
309.1 246.9
277.1 267.4
32.1 20.5
274.1 232.9 290.2
255.9 245.0 263.9
18.2 12.1 26.2
233.4
252.9
19.5
287.6
272.6
15.0
7. The association between early tracking and achievement progress We use adjusted PIRLS and PISA data presented above to compare achievement progress in countries that select students into different tracks before the age of 15 (early tracking countries) and countries that keep students in the same schools till they are 15 or even 16 (late tracking countries). In both comparisons there are 19 early-tracking and 19 late-tracking countries or school systems (see Table A4 in the Appendix for details; note, however, that for PIRLS 2006/PISA 2009 comparison Canadian provinces are treated separately). Early-tracking and late-tracking countries differ in terms of average achievement, GDP per capita and
PISA
Change
222.9 205.0 221.6 261.5
257.4 249.2 277.9 268.2
34.5 44.2 56.3 6.8
237.9 246.9 254.3 248.2 229.5
248.4 253.0 241.0 240.9 243.3
10.5 6.1 13.3 7.2 13.8
228.2 273.0 233.5 221.3
227.4 252.0 282.7 254.9
0.9 21.0 49.1 33.7
221.3 227.0 232.6 196.5 296.2 238.0 208.4 206.7 215.5
230.6 243.1 250.8 189.4 285.9 251.5 218.8 233.4 269.0
9.3 16.1 18.2 7.0 10.2 13.5 10.4 26.7 53.5
200.7 276.0 226.5 252.9 201.6 250.9 231.9 264.6 249.0 239.8 239.4 238.3 215.1 221.5 265.3 242.5
240.8 266.1 241.9 234.3 273.5 232.7 237.7 246.8 257.8 235.7 242.5 237.7 258.6 233.4 281.6 257.5
40.0 9.8 15.5 18.6 71.9 18.1 5.8 17.8 8.8 4.0 3.1 0.6 43.4 11.9 16.3 15.0
18.0
257.1 237.9 249.0 237.5 239.3 257.8
countries, while within each country poor performers progress more. In England and Romania, the poorest-performing students remained at relatively similar levels compared to low-scoring students in other countries, but the best students show less achievement progress. Thus, while in Romania and England the reduction of the performance gap is due to less achievement progress among top performers, Poland and Scotland managed to narrow their achievement gaps while showing relatively great achievement progress among the best students.
PIRLS
spending per student (see Table A5 in the Appendix). Achievement in primary school is slightly higher among the early-tracking countries, although the difference is not very large (only 5 score points according to PIRLS 2001 and around 14 score points according to PIRLS 2006). This difference might be at least partly due to the fact that early tracking countries are richer on average (among countries participating in both PIRLS 2006 and PISA 2009 the average GDP per capita is 26,501 USD PPP in earlytracking vs. 23,445 in late-tracking countries). On the other hand, cumulative expenditure per student since the age of 6 up to the age of 15 is around 14% lower in early-tracking countries. Both groups of countries have, however, a very similar share (around 96%) of 15-year-olds enrolled in secondary schools. Thus, the comparison based on PISA data is not biased by higher or lower drop-out rate or different participation patterns in one of the groups. Participation in the primary education is nearly universal in the countries considered. Table 2 provides the main results regarding the association between tracking policies and the reading achievement progress. The data were obtained from the results in Table 1 by treating each country or school system as an independent observation. On average, students in tracking countries show lower achievement progress between primary and secondary education. The difference
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
is relatively large as it is around one third of the standard deviation on the PISA scale. The results are consistent across comparisons suggesting that the difference in achievement progress between early- and late-tracking countries is stable over time. The table also displays achievement progress among boys and girls. Although for both genders achievement progress is larger in non-tracking countries, the gap for boys is larger by 7–9 score points. This suggests that tracking might be more harmful for reading progress among boys. This is worrisome as in general boys have much lower reading skills than girls and lower progress of boys between primary and secondary education might increase this gap. This might be due to the fact that males are more often found in vocational education and thus are more affected by tracking policy. Among secondary school students in the 28 EU countries 56% of males are in vocational education comparing to only 45% of females (see Eurostat database, indicator tps00055; see Rodgers and Boyer, 2006, for the analysis of international data.) Additional results are presented in the bottom part of the table providing evidence on differences in achievement progress not only by gender but also between low- and high-performing students. Low performing students have lower achievement progress in early-tracking countries regardless of gender, but in absolute terms the negative impact of tracking is much more evident for boys. The achievement progress gap among low performing boys in early and late tracking countries is above 40 score points and is about 50% larger than the gap among the high-performing boys. This suggests that early selection might be harmful for low performing students, especially low performing boys. This is not surprising as low-achieving students are mostly among those who are tracked into vocational schools, while without tracking they remain in general schools. As we have already said, boys are also more often found in vocational schools. Thus, under tracking they have less opportunities to improve their general skills. In addition, the larger effect for males might be exacerbated by a negative peer effect related to having smaller share of females in the classroom. Many studies report that smaller share of girls in a classroom has a negative impact on achievement (see Sacerdote, 2011, for a review of these results). This might suggest that part of the effect of tracking is due to the lack of teaching of general skills in vocational schools but part of it might be due to negative peer effects, also those related to gender distribution across different tracks. It should be emphasized that these effects might be stronger for reading literacy as girls usually outperform boys in reading by a large margin. The benefit of using achievement progress measures instead of just comparing country achievement in secondary education is clear when one looks at performance in tracking and non-tracking countries. Clearly, there are many other differences between, for example, German and Polish systems that affect achievement levels. However, many of them similarly affect performance in both primary and secondary schools. For example, family living conditions or teacher’s regulations are more similar among German students at both educational levels than, for example, among primary school students in Germany and Poland. Our method of comparing achievement progress instead of achievement level excludes all factors that similarly affect student achievement in primary and secondary education. In other words, we construct a quasi-experiment where we compare achievement changes in one group of countries (early tracking countries) to achievement changes in another group of countries (late tracking countries). This method, very often called difference-in-differences method, helps to account for factors stable in each country when evaluating policies using cross-country comparisons. On the other hand, this analysis has important limitations, even after all the adjustments of PIRLS and PISA data to make them more comparable. First, the sample size of countries available for
85
Table 3 Estimates of achievement progress difference between early- and late-tracking countries (‘‘early’’ minus ‘‘late’’). PIRLS 2001 and PISA 2000
PIRLS 2006 and PISA 2009
All students
Boys
Girls
All students
Boys
Girls
38.61 (1.61)
41.17 (1.81)
35.86 (1.82)
30.60 (0.94)
34.83 (1.08)
25.88 (1.06)
PIRLS 2006 and PISA 2009 Achievement progress difference among low-performers
Achievement progress difference among high-performers
All students
Boys
Girls
All students
Boys
Girls
37.04
42.04
36.24
23.65
27.06
20.07
Note: Results presented on the PISA scale with a standard deviation of 100 across the OECD countries. Results calculated from country estimates. Standard errors provided in parentheses. Low- and high-performing students are those at the 10th and the 90th percentile of achievement distribution, respectively. Results for PIRLS 2001 and PISA 2000 provide qualitatively the same results and are available upon request from the authors.
analysis is very small. This limitation is true, however, for any kind of cross-country comparison of system level policies. Second, while difference-in-differences method helps to exclude factors similarly affecting results in primary and secondary education, there are remaining differences between countries in factors or policies that differently affect achievement of primary and secondary school students. For example, the results presented in Table 3 suggest that even among the high-performing students achievement progress is lower in early tracking countries. As these students are usually in academic or comprehensive tracks in both groups of countries, this might signal that there are other factors affecting difference in achievement progress between countries with early and late selection. It might be also true, however, that students in early tracking countries that are selected to academic tracks have, for example, smaller incentives to learn comparing to students in the other group of countries who are still before the selection at the age of 15 or 16. Overall, although in our view the above analysis is useful and gives the most reliable insights one can have based on international comparisons of student achievement data, the above limitations should be taken into account and further research is needed to disentangle effects of different policies on student achievement progress. 8. Conclusions This paper provides internationally comparable results on reading achievement progress between the ages of 10 and 15. The estimates were obtained by comparing individual results from PIRLS 2001 or PIRLS 2006 for 10-year-old students, to results from PISA 2000 or PISA 2009 for 15-year-old students. PIRLS and PISA are two international surveys of student achievement that, while different in some assumptions, still provide comparable assessments of reading. We adjust our estimates of achievement progress for differences in student age and we balance gender distribution by equally weighting boys and girls. We believe age must be taken into account to make valid comparisons because of the large differences in student ages found in PIRLS; gender imbalance could also bias the results because of large differences in reading achievement between boys and girls. Finally, our estimates of standard errors include a linking error that is obtained via simulation with random draws of items taken to see if different sets of items produce different results. Thus, our results provide adjusted estimates of achievement progress that also account for a linking error. We provide results for all students as well as for subpopulations defined by gender. We also provide results for
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
86
changes in performance gaps between the lowest- and highestperforming students. The results show remarkable consistency between the PIRLS 2001 to PISA 2000 comparison and the PIRLS 2006 to PISA 2009 comparison. In most cases, countries that saw great achievement progress in one comparison have similar results in the other. For example, Iceland, New Zealand and Norway show relatively greater progress than other countries in both comparisons. Students in Bulgaria progress less than those in other countries, while students in the United States maintain their standing in relative terms according to both comparisons. Across all countries, gender differences in progress estimates are similar in both comparisons. Anyhow, large cross-country variation in the primary to secondary school achievement progress suggests that countries should analyse these estimates carefully and relate them to policies that could be responsible for more limited development of their students or a relatively larger increase in performance or gender gaps. The paper provides an example of the usefulness of reliable achievement progress comparisons for policy analysis. Analysis of
differences in reading achievement progress across early and late tracking countries reveals not only that average achievement progress seems to be lower in early tracking countries, but also that the difference is mainly due to very low achievement progress among low-performing students, especially boys. The analysis suggests that while an early tracking policy might not be harmful for the best students, it seriously lowers achievement progress of the weakest students. This is something policy makers should consider when thinking about policies increasing educational opportunities for all students. Acknowledgment This research was supported by a financial grant from the Polish Ministry of Science and Higher Education.
Appendix See Tables A1–A6
Table A1 Comparison of test content and framework assumptions for PIRLS and PISA.
Definition of reading
Reading texts Content of the assessment
PIRLS (Mullis et al., 2006)
PISA (OECD, 2006)
Reading literacy is defined as the ability to understand and use those written language forms required by society and/or valued by the individual Continuous texts
The capacity to understand, use, reflect on and engage with written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society. Continuous texts and non-continuous texts, including graphs, forms and lists Multiple-choice (55%) and constructed response question (45%) dichotomous and partial credit scoring
Multiple-choice (50%) and constructed response question (50%), dichotomous and partial credit scoring (1) Focus on and retrieve explicitly stated information (2a) Make straightforward inferences (2b) Interpret and integrate ideas and information (3) Examine and evaluate content, language, and textual elements Reading for literary experience (narrative fictions); reading to acquire and use information
Cognitive processes
Contexts, purposes and text type
(1) Retrieving information (2) Interpreting texts (3) Reflecting on and evaluating texts
Reading for private use, for public use, for work, and for education
Table A2 Numbers of items across 500 replications.
PISA 2009 PIRLS 2006 PISA 2000 PIRLS 2001
Total item pool
Mean number of items sampled across replications
Standard deviation of the number of items sampled across replications
Minimum number of items sampled across replications
Maximum number of items sampled across replications
131 125 129 98
65.71 62.48 64.11 46.92
6.02 5.49 6.00 4.94
49 44 44 36
83 82 87 32
Table A3 Unadjusted reading achievement progress (PISA scale, standard deviation = 100). PIRLS 2001 to PISA 2000 All students
Argentina Austria Belgium (Flemish) Belgium (French) Bulgaria Canada Canada Alberta Canada British Columbia Canada Nova Scotia
Boys
Progress
S.E.
Progress
128.0
11.3
112.6
24.1 129.8
PIRLS 2006 to PISA 2009
7.9 4.9
16.0 124.6
Girls S.E. 9.9
8.3 4.9
All students
Progress
S.E.
138.4
13.2
34.2 134.7
9.4 5.7
Boys
Girls
Progress
S.E.
Progress
S.E.
Progress
S.E.
73.8 106.0 137.2 32.0
5.4 5.1 5.9 8.8
64.9 101.2 131.5 20.8
6.4 5.5 7.4 9.9
82.0 111.2 143.3 44.6
6.3 6.2 6.4 8.7
104.1 100.4 109.4
6.2 5.8 5.4
96.8 92.3 110.5
6.8 7.0 7.0
111.0 109.3 108.2
7.1 6.4 6.3
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
87
Table A3 (Continued ) PIRLS 2001 to PISA 2000 All students Progress Canada Ontario Canada Quebec Czech Republic Denmark England France Germany Greece Hong Kong-China Hungary Iceland Indonesia Israel Italy Latvia Lithuania Luxembourg Macedonia Netherlands New Zealand Norway Poland Qatar Romania Russian Federation Scotland Singapore Slovak Republic Slovenia Spain Sweden Taipei Trinidad and Tobago United States
96.5
PIRLS 2006 to PISA 2009
Boys S.E.
5.9
Girls
Progress
86.1
S.E.
7.0
Progress
All students S.E.
105.4
Boys
Girls
Progress
S.E.
Progress
S.E.
Progress
S.E.
107.4 127.1
5.9 5.8
101.7 123.4
6.6 7.0
113.1 130.3
6.8 6.4
84.4 92.2 114.6 86.6
5.1 5.0 5.4 5.4
82.8 94.9 105.5 76.6
6.0 6.0 6.2 6.3
86.3 89.0 122.5 97.1
5.9 5.8 5.9 6.0
99.7 79.7 130.5 133.3 101.9 72.6 83.5 75.0 53.4
6.6 6.0 5.3 7.6 5.6 4.7 5.2 5.1 3.9
94.9 68.8 124.0 126.9 93.0 59.3 78.9 63.4 41.0
6.8 6.6 6.0 7.9 6.7 5.4 5.9 5.7 4.6
105.5 90.9 136.6 139.2 109.8 86.6 86.6 86.7 66.1
7.4 7.0 5.8 7.9 6.5 5.4 6.2 5.7 4.5
97.5 125.2 146.8 121.7 155.7 88.3 34.2 112.3 99.7 87.8 102.7 112.1 85.3 97.8 131.4 97.2
7.0 4.6 5.7 5.1 6.1 7.4 6.5 5.4 5.3 5.3 5.5 4.7 5.1 6.3 6.3 6.2
93.6 121.6 139.2 112.5 149.2 78.4 26.2 116.6 98.3 74.2 92.1 104.3 79.4 91.5 123.7 96.0
7.5 5.6 6.3 5.6 6.8 8.3 6.8 6.5 5.8 6.4 6.3 5.1 5.5 6.9 7.2 7.0
101.4 129.0 154.9 131.5 162.6 97.1 42.0 108.3 100.7 100.7 113.3 120.1 90.7 103.8 138.4 98.8
7.3 5.2 6.5 5.8 5.9 8.0 7.2 6.5 5.8 5.6 5.9 5.3 6.0 6.9 6.8 6.6
7.0
117.1 83.7 90.0 139.5 80.1 131.9
6.4 5.4 8.5 8.9 7.2 5.6
109.2 75.5 80.9 142.9 74.1 122.1
7.1 6.2 9.1 9.7 8.3 6.1
124.0 92.0 99.1 136.1 86.6 141.4
7.1 6.1 9.3 9.6 7.6 6.3
94.2 84.7 59.2
9.7 5.8 7.2
98.6 71.6 47.1
12.0 7.4 7.5
88.1 97.9 69.1
10.2 6.8 8.3
84.8
6.6
73.8
7.2
96.3
7.3
142.7 141.3
6.5 6.4
135.8 130.7
7.8 7.4
149.3 152.1
8.0 7.0
67.5 77.4 137.8
7.7 7.9 6.4
66.3 65.8 134.3
8.6 8.4 7.5
68.2 88.7 141.6
8.4 8.3 7.4
82.9
6.2
76.5
6.9
89.4
6.8
107.0
9.2
101.0
11.1
112.4
9.1
Note: S.E. stands for the standard error. See Section 5 for detailed explanation how errors were computed.
Table A4 Early- and late-tracking countries. Early-tracking countries
Austria, Belgium (Flemish), Belgium (French), Bulgaria, Czech Republic, Germany, Greece, Hungary, Italy, Lithuania, Luxembourg, Macedonia, Netherlands, Romania, Russian Federation, Singapore, Slovak Republic, Slovenia, Trinidad and Tobago Argentina, Canada (with all provinces), Denmark, England, France, Hong Kong, Iceland, Indonesia, Israel, Latvia, New Zealand, Norway, Poland, Qatar, Scotland, Spain, Sweden, Taipei, United States
Late-tracking countries
Source: Table IV.3.2a, OECD, 2010; Eurydice.
Table A5 Mean scores and drop-out rates across early- and late-tracking countries. Results presented on the PISA scale with a mean of 500 for PISA and a mean of 400 for PIRLS and a standard deviation of 100 across the OECD countries. Average score in PIRLS (re-scaled and age-adjusted)
Early-tracking countries Late-tracking countries
Average score in PISA (re-scaled and age-adjusted)
2001
2006
2000
2009
387.04 382.31
400.71 384.53
463.15 497.03
478.35 492.77
School enrolment share at the age of 15 (PISA 2009) (%)
GDP per capita (in equivalent USD converted using PPPs)
Cumulative expenditure per student aged 6–15 (in equivalent USD converted using PPPs)
96.6 96.2
26,501 23,445
57,084 66,031
Source: Average scores are based on own calculations from the PIRLS and PISA datasets. Enrolment shares were calculated from the data presented in Table 11.1 in the PISA 2009 technical report (OECD, 2012). GDP per capita and cumulative expenditure per student were taken from Table IV.3.21 in the PISA 2009 report (OECD, 2010).
M. Jakubowski, A. Pokropek / International Journal of Educational Development 45 (2015) 77–88
88
Table A6 Coverage rates for PIRLS 2006 and PISA 2009. PIRLS 2006
PISA 2009
Country
Coverage rate (%)
Country
Participation rate (%)
Austria Belgium (Flemish) Belgium (French) Bulgaria Canada Alberta Canada British Columbia Canada Nova Scotia Canada Ontario Canada Quebec Germany Denmark England France Hong-Kong Hungary Indonesia Iceland Israel Italy Lithuania Luxembourg Latvia Netherlands Norway New Zealand Poland Qatar Romania Russia Scotland Singapore Spain Slovakia Slovenia Sweden Taipei Trinidad and Tobago
97 91 95 94 96 94 96 87 81 92 96 92 95 97 97 98 90 93 97 92 99 92 90 71 95 95 94 97 97 81 95 97 94 93 96 99 94
Austria Belgium
87 94
Bulgaria Canada
72 84
90 86 87 90 89 87 53 93 84 86 78 87 81 92 91 87 93 89 99 77 86 94 89 95 92 93 90 78
United States
82
Germany Denmark England France Hong-Kong Hungary Indonesia Iceland Israel Italy Lithuania Luxembourg Latvia Netherlands Norway New Zealand Poland Qatar Romania Russia Scotland Singapore Spain Slovakia Slovenia Sweden Taipei Trinidad and Tobago United States
82
Source: Marc Jonas, ‘‘PIRLS 2006 Sampling Weights and Participation Rates’’, chapter 9 in PIRLS 2006 Technical Report; Coverage Index 3, Table 11.1, PISA 2009 technical report (OECD, 2012).
References Ammermueller, A., 2005. Educational Opportunities and the Role of Institutions ZEW Discussion Paper No. 05-44. Ansalone, G., 2010. Tracking: educational differentiation or defective strategy. Educ. Res. Q. 34 (2), 3–17. Brown, G., Micklewright, J., Schnepf, S., Waldmann, R., 2007. International surveys of educational achievement: how robust are the findings? J. R. Stat. Soc. (Ser. A) 170 (3), 623–646. Brunello, G., Checchi, D., 2007. Does school tracking affect equality of opportunity? New international evidence. Econ. Policy 22 (52), 782–861. Card, D., Payne, A., 2002. School finance reform, the distribution of school spending, and the distribution of student test scores. J. Public Econ. 83 (1), 49–82. Carnoy, M., Rothstein, R., 2013. What Do International Tests Really Show About U.S. Student Performance? Economic Policy Institute. Retrieved from: www. epi.org/publication/us-student-performance-testing/ DeMars, C., 2010. Item Response Theory. Oxford University Press, New York. Efron, B., 1982. The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia, PA.
Fuchs, T., Wo¨ßmann, L., 2008. What accounts for international differences in student performance? A re-examination using PISA data. Empir. Econ. 32 (2–3), 433–464. Grek, S., 2009. Governing by numbers: the PISA ‘effect’ in Europe. J. Educ. Policy 24 (1), 23–37. Grisay, A., Gonzalez, E., Monseur, C., 2009. Equivalence of item difficulties across national versions of the PIRLS and PISA reading assessments. In: von Davier, M., Hastedt, D. (Eds.), IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, vol. 2. . Haberman, S., Lee, Y., Qian, J., 2009. Jackknifing techniques for evaluation of equating accuracy ETS Research Report No. RR-09-39, Princeton, NJ. Hambleton, R.K., Swaminathan, H., Rogers, H.J., 1991. Fundamentals of Item Response Theory. SAGE Publications, Thousand Oaks. Hanushek, E., Wo¨ßmann, L., 2011. Sample selectivity and the validity of international student achievement tests in economic research. Econ. Lett. 110 (2), 79–82. Hanushek, E., Wo¨ßmann, L., 2006. Does educational tracking affect performance and inequality? Differences-in-differences evidence across countries. Econ. J. R. Econ. Soc. 116 (510), C63–C76, 03. Hoxby, C., 2000. Peer Effects in the Classroom: Learning from Gender and Race Variation. NBER Working Papers 7867National Bureau of Economic Research, Inc.. Ingels, S.J., Dowd, K.L., Baldridge, J.D., Stipe, J.L., Bartot, V.H., Frankel, M.R., Owings, J., Quinn, P., 1994. National Education Longitudinal Study of 1988. Second Follow-Up: Student Component Data File User’s Manual. National Center for Education Statistics, Washington, DC. Jakubowski, M., 2010. Institutional tracking and achievement growth. Exploring difference-in-differences approach to PIRLS TIMSS and PISA data. In: Dronkers, J. (Ed.), Quality and Inequality of Education. Cross-National Perspectives. Springer, Dordrecht/London. Martin, M., Mullis, I., Kennedy, A., 2003. PIRLS 2001 Technical Report. Boston College, Chestnut Hill, MA. Meier, V., Schu¨tz, G., 2007. The Economics of Tracking and Non-Tracking Ifo Working Paper No. 50, Munich. Micklewright, J., Schnepf, S., Skinner, C., 2012. Non-response biases in surveys of schoolchildren: the case of the English Programme for International Student Assessment (PISA) samples. J. R. Stat. Soc. Ser. A R. Stat. Soc. 175 (4), 915–938. Mullis, I., Kennedy, A., Martin, M., Sainsbury, M., 2006. PIRLS Assessment Framework and Specifications, 2nd ed. TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA. Oakes, J., 2005. Keeping Track. How Schools Structure Inequality, 2nd ed. University Press, Yale. OECD, 2006. Assessing Scientific, Reading and Mathematical Literacy. OECD, Paris. OECD, 2009. PISA 2006 Technical Report. OECD, Paris. OECD, 2010. PISA 2009 Results: What Makes a School Successful? Resources, Policies and Practices. vol. IV. OECD, Paris. OECD, 2012. PISA 2009 Technical Report. OECD, Paris. Rijmen, F., Manalo, J., von Davier, A., 2009. Asymptotic and sampling-based standards errors for two population invariance measures in the linear equating case. Appl. Psychol. Meas. 33 (3), 222–237. Rindermann, H., 2007. The g-factor of international cognitive ability comparisons: the homogeneity of results in PISA, TIMSS, PIRLS and IQ-tests across nations. Eur. J. Pers. 21 (5), 667–706. Rodgers, Y., Boyer, T., 2006. Gender and racial differences in vocational education: an international perspective. Int. J. Manpow. 27 (4), 308–320. Sacerdote, B., 2011. Peer effects in education: how might they work, how big are they and how much do we know thus far? Handb. Econ. Educ. 3, 249–277. Tarozzi, A., 2007. Calculating comparable statistics from incomparable surveys, with an application to poverty in India. J. Bus. Econ. Stat. 25 (3), 314–336. von Davier, M., Gonzalez, E., Kirsch, I., Yamamoto, K., 2013. The Role of International Large-Scale Assessments: Perspectives from Technology, Economy, and Educational Research. Springer, Dordrecht. Von Davier, M., Gonzalez, E., Mislevy, R., 2009. What are plausible values and why are they useful. In: von Davier, M., Hastedt, D. (Eds.), IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, vol. 2. . Waldinger, F., 2006. Does Tracking Affect the Importance of Family Background on Students’ Test Scores? London School of Economics, Mimeo. Wilkinson, I., Hattie, J., Parr, J., Townsend, M., Fung, I., Ussher, C., Thrupp, M., Lauder, H., Robinson, T., 2010. Influence of Peer Effects on Learning Outcomes: A Review of the Literature. Auckland UniServices Limited, The University of Auckland. http://eric.ed.gov/?id=ED478708 (retrieved February 2015). Wu, M., 2005. The role of plausible values in large-scale surveys. Stud. Educ. Eval. 31 (2), 114–128. Wu, M., 2010. Measurement sampling, and equating errors in large-scale assessments. Educ. Meas.: Issues Pract. 29 (4), 15–27.