Hierarchical Linear Models D Rindskopf, CUNY Graduate Center, New York, NY, USA ã 2010 Elsevier Ltd. All rights reserved.
In many educational contexts, data occur in a nested structure. For example, we may measure the achievement level of a child in a particular subject matter. That child is located (nested) within a particular class of students, and we might expect that the child’s performance would have been different had she been in a different class of students. Further, the child’s class may be one of many within a particular school, and that school may be one of many within a school system, which is located within a larger educational or political unit (e.g., county, state, and region). Each level of the structure may have its own characteristics that influence the achievement of the child. Hierarchical linear models (and more general models, to be discussed later) are designed to properly model influences on outcome variables at all levels of such a hierarchy, including the proper handling of data dependencies created by the nested structure. In this article, we describe a variety of models that are designed to accommodate a nested data structure. Some models are linear, but others are nonlinear; some involve nesting of people within higher level units (such as classes), while other models are applicable to repeated measures that are nested within individuals (including longitudinal studies). We begin with a very simple example, and then show how the principles involved can be used to expand the scope and type of models that may be used. These principles are all expansions or generalizations of those used in ordinary least squares (OLS) regression; often, the models can be viewed as a series of regressions for different levels of the data structure. Those unfamiliar with regression should read the article on univariate linear regression in this volume before reading the rest of this one. Although we will not discuss it in detail, these methods also resolve the problem known as aggregation bias. It has been well known that correlations among variables at a higher level (e.g., measures on states) do not equal (and may be opposite in sign to) correlations among variables at the individual level. (A related problem that arises in categorical data analysis is known as Simpson’s paradox.) Multilevel models are constructed so that inferences can be made at the right level, whichever level that happens to be for the problem at hand. We discuss two examples of multilevel models. The first example is for a situation in which students are nested within colleges, and we wish to predict academic performance from one student-level and one college-level variable. The second example considers longitudinal data,
210
where a series of observations is made on each person over a period of time. Here, the nesting is of a different sort: Observations (over time) are nested within individuals.
Example 1: Two Levels, One Predictor at Each Level, Linear Model For a first example, assume that the administration in 10 colleges in a particular country wants to see how well scores on an academic-achievement test given to all students in the last year of secondary school predict performance of students in their first year of college. If we consider each college separately, we may imagine that within each college, researchers fit a regression model to the data; the equation for school j would then be: Yij ¼ b0j þ b1j Xij þ rij
In this equation, Y represents the grade of a student after the first year of college, X represents the score of the student on the secondary school-achievement test, i is used to refer to a specific student, and j to refer to a specific school. The coefficients are as in a linear regression model, except that each school has its own intercept (b0j) and slope (b1j). As we shall be examining the intercept and slope of each school, and modeling them, it is important to be sure that these are quantities of interest. For the slope, there is ordinarily no issue (except perhaps multiplication or division by a constant in order to make a one-unit change more easily interpreted). In order to make the intercept meaningful, we will often have to transform X. One common strategy is centering, which means to create a new variable by subtracting the mean of X from each observed value of X. (For this article, we assume that the mean is computed for the whole data set, not for each college individually; this is called grand mean centering.) In any regression, the intercept represents the predicted value of Y when all predictors have a value of 0; therefore centering X will cause the intercept in each college’s regression equation to equal the predicted value of Y for a student who is average on X (based on students from all included colleges). Using the exact sample mean is not necessary; sometimes, it is more important to use a round number for ease of interpretation. For example, if the mean is 497, it might be easier to subtract 500. Or, if a score of 600 is typically required for admissions, one might subtract 600 from all X values instead.
Hierarchical Linear Models
In addition to subtracting some value from X, one might also want to control the scaling further. In the United States, for example, where scholastic aptitude test (SAT) scores have a mean near 500 and a standard deviation near 100, one might divide by 100 so that a oneunit change in the transformed variable is more meaningful than a one-unit change on the original scale. This makes the slope more easily interpreted (and sometimes improves numerical stability.) From this point on, we will assume that X has been suitably scaled. The term rij is a residual, indicating that the students’ scores on the secondary school test will typically not predict their grades in college perfectly. The variance of the residuals in each college can differ among schools, but often modeling is begun by assuming that it is the same in each; we will denote the within-school residual variance by s2, in which we do not use a subscript for college ( j ). The intercepts and slopes will vary from college to college; some of this variation is just sampling variability due to using a sample of students within each college, but the other variability is due to true differences among colleges. The true variability in intercepts will be denoted by t00, and the true variability in slopes will be denoted by t11. (There is also a covariance between the intercepts and slopes, denoted t01, but this is seldom of much interest.) If t00 and t11are small, then the intercepts and slopes do not vary much among colleges, and one prediction equation will work equally well in each. The colleges will gain by knowing this, because now their prediction equation is known with greater precision due to the contribution of information from other colleges. Of course, unless the colleges are very similar, it is unlikely that both t00 and t11 will be small, which leads to a consideration of two issues: First, can we explain, based on characteristics of the colleges, why some have larger and others smaller intercepts and/or slopes? Second, even though the colleges differ, can we still use information from other colleges to help improve prediction in each college? We may believe that characteristics of the colleges will have some relationship to either the intercept or the slope of their prediction equations. For example, it may be that schools differ in the difficulty of their courses, or in their grading standards. These may affect the intercepts or slopes. One might use the college’s average value on X as a proxy for selectivity or difficulty; this could be a predictor for intercepts and/or slopes. Another possible measure would be the selection ratio, that is, the proportion of applicants who are admitted to the college. We will now write the model for variability among colleges out more formally. The equations, which will first be presented without predictors, are: b0j ¼ g00 þ u0j b1j ¼ g10 þ u1j
211
The first equation says that the intercept in a particular college (j ) is equal to some overall average plus a deviation of that college’s intercept from the overall average. The second equation expresses the same idea for the slopes from each college. The variances of the residuals (u0j in the first equation and u1j in the second) are the values t00 and t11 mentioned previously. In this instance, they are called unconditional residual variances, because they are not conditioned (dependent) on any college-level predictors. When we analyze the data we will obtain estimates of both the mean intercept and slope (called fixed effects in most software) and the variances of the residuals (called random effects in most software). Now suppose that the residual variances are large enough that we believe it worthwhile to try to account for some of their variation by using a college-level variable such as proportion of applicants admitted. In realistic applicants, there are often many potential variables, and variables that help predict differences in intercepts may not be the same variables that help predict differences in slopes. In this simple introduction, however, we will use only one such variable, which we will call W; individual values of W are denoted Wj, with the subscript j indicating that the value of W will vary across colleges. The extended model at the college level will now consist of the following equations: b0j ¼ g00 þ g01 Wj þ u0j b1j ¼ g10 þ g11 Wj þ u1j
The variances t00 and t11 of the residual terms u0j and u1j now represent conditional variances; that is, the variances remaining after accounting for variation in intercepts and slopes using information provided by the predictor W. These variances should now be smaller than in the unconditional model if W proves to be a good predictor, and measures analogous to R-square in the usual regression context may be calculated. Multilevel models are very much like ordinary regression models in many ways: the outcome variable may be continuous or categorical. If continuous, it may be normally distributed or have some other distribution (e.g., exponential for survival models). If the dependent variable is categorical, models similar to logistic regression (for binary outcomes) or its extensions to polytomous or ordinal outcomes are possible. The right-hand side of any of the equations may require polynomial coding (e.g., the square and cube) for continuous predictors, dummy variables (or other coding methods) for categorical predictors, and product terms for interactions. The model may be intrinsically nonlinear, so that more complex models must be fit. All of these are possible, but outside the scope of this introductory article.
212
Statistics
Example 2: Repeated Measures and Longitudinal Data In many educational applications, people are measured more than one time. In some cases, these measurements extend over a long period of time; measurements may be daily, weekly, monthly, or yearly, for example. In other cases, the time sequence is not of great importance; for example, most models for test scores do not take into account that some items are administered before others. In either case, we can use the concept of nesting to reconceptualize the model for this type of data: we view observations as being nested within an individual. (In some instances, it might seem more natural to view time of observation as crossed with individuals, but this viewpoint is not usually considered in the multilevel literature, because doing so would reduce the utility of the model in many common situations, as we hope will become apparent shortly.) As an example, suppose that a researcher wants to examine the growth of children’s ability to sight-read words. Each month during the school year, she has every child in a class attempt to read a list of 20 words, and counts the number of words that the child can correctly read. We will treat the number of words read as if it were a continuous outcome variable to simplify the discussion in this section, and will later deal with the complexities that this type of outcome necessitates. The researcher wants to model the number of words read by each child as a function of the number of months into the school year. The time variable may be coded in a number of ways; one might be to start the variable Months at 0 for the first measurement period, and increase by one for each month. Another approach would be to use 0 for the last period of measurement, and number preceding periods as –1, –2, etc. Each possible choice determines the meaning of the intercept in the model. Again, for simplicity we will use a linear model, but realize that a straight line is unlikely to be a perfect description of such growth; we write the equation for each student as follows: Yij ¼ p0j þ p1j Mij þ eij
where the subscript j is used to represent the individual student, i is used to represent a measurement occasion, Yij is the number of words read correctly by student j at occasion i, Mij is the number of months (on whatever scale is chosen; here we will assume 0 is used for the first time of measurement) for measurement of student j at occasion i, and eij is the residual. The intercept p0j and slope p1j are different for each student, and again we will allow these to be predicted by variables at the higher level, which in this case is the individual student. For example, boys and girls may start at different levels, and they may progress at different rates. We may represent this by the following level-2 equations, in which F is equal to one for females and zero for males: p0j ¼ b00 þ b01 Fj þ r0j
p1j ¼ b10 þ b11 Fj þ r1j
Of course the structure of all of these equations is exactly the same as the case in the first example where students were nested within colleges; only the context has changed. In these equations, the intercepts b00 and b10 represent the beginning number of words read (p0) and the rate of change in words per month (p1) for an average male (i.e., F = 0) in the study. The slopes in these equations, b01 and b11, represent the difference in beginning number of words read and rate of change for females as compared to males. Again, the choice of coding makes a difference; gender could be coded in other ways, which would change the interpretation of each of the coefficients in these equations. The residual terms represent the amount by which the true values of p0j and p1j differ from that predicted for the average male or female. This method of representing the model has several advantages over other methods. Most importantly, from a conceptual point of view, this is a model for each individual’s growth pattern and for deviations of individuals from the average pattern. The focus is on the individual, rather than a group. As a consequence of this focus, in principle (and at present, in practice also) missing data are not a problem, as long as the cause for missingness is not related to the outcome variable. Each child could have a different pattern of months during which he or she was measured, and in fact the times of measurement do not even have to follow a common pattern. A child could be measured at months 1.25, 3.75, and so on, rather than exactly at the beginning of each month – the model remains the same. (Of course, in practice one must measure each individual a reasonable number of times, and with those times spaced in a reasonable way, in order to have accurate estimates of that child’s growth pattern, but this is another matter.)
Generalized Linear and Nonlinear Models Two general categories of complications frequently arise with multilevel models. As mentioned previously, not all dependent variables are continuous, and not all continuous variables have a normal distribution. In addition, not all relationships are linear, so nonlinear models must sometimes be considered. This section briefly describes some approaches to each issue. In the second extended example described, the outcome was the number of words read correctly from a list of 20 words presented to the student. Equivalently, one might calculate the percentage or proportion of the list that was read correctly. If proportions were not extreme (e.g., if most were between 0.2 and 0.8), then using these in a model with a normal distribution might be a good
Hierarchical Linear Models
approximation to the correct analysis. A more accurate approach would be to consider each word as a trial, and the number correct as coming from a binomial distribution with an underlying probability of success for each trial that is constant across words. (An even more accurate model would allow words to have varying difficulties, and possibly to differ in other ways also, as in item-response theory. Often the simplifying assumption of equal difficulty is a good enough approximation to work well in practice.) The binomial model is closely related to a more widely known model, logistic regression. For these models, the outcome that is modeled is usually the logarithm of the odds of a correct response, commonly called the logit of the probability of a correct response. The right-hand side of the level-1 equation for individual responding still has a linear form in these models; the left-hand side is the transformed value ln(o/(1-o)), also denoted logit(o) in some sources, where o is the probability of a student responding correctly. The model for each individual is written in the following form: oij ¼ p0j þ p1j Mij ln 1 oij
where oij is the expected (not observed) probability of person j answering correctly during month i. The residual is not explicitly listed in this equation; instead, we state thatYij Binomial 20; oij ; that is, the observed number of words correctly read has a binomial distribution based on 20 trials, with probability oij of being correct on each trial. This is an example of a generalized linear model; these are extensions of linear models that include a linear model, a transform of the dependent variable, and a distributional form that is more general than a normal distribution. In this case, the logit is the transformation and the binomial is the distributional form. Another example requiring the use of generalized linear models is when the outcome is a count, but the count is not the number of successes in a fixed number of trials, or it is a generally small count out of a very large number of fixed trials (in which case we are using an approximation to the binomial distribution). A researcher might count the number of times a student shows an aggressive behavior toward classmates, or the number of times the student raises his or her hand for the attention of the teacher rather than responding without being called on. If the period of observation is the same for each student and time of observation, then generally the Poisson distribution provides the simplest probability model for counts. A slight extension is required when the time of observation varies, so that the model is for the rate of behavior per period of time (number of hand raises per hour, e.g., when the time of observation may vary between 5 and 20 min). If F represents the expected number of acts and Z represents the amount of time for an observational period,
213
so that F/Z is a rate per unit time, the level-1 equation can be written in the following form:
ln
Fij Zij
¼ p0j þ p1j Mij
As with the binomial model, there is no explicit error term in the equation because the count variable in the equation is an expected rather than observed frequency, and in this case it has a Poisson distribution. A related model with the same form is the discretetime survival model, in which we model the influences on the occurrence of an event. In education, we might wish to determine factors affecting whether, if so when, a student in college will drop out. Modeling whether or not the student drops out involves an ordinary logistic regression (or possibly with students nested within colleges if we observe many colleges). But to model the length of time to dropout is slightly more complicated. We observe each student over a series of semesters, and note whether he or she has dropped out or is still enrolled for that semester. The main aspects of this situation that are different from others with a binary outcome in each time period are that (1) drop out can only occur during one period, and (2) the sequence of observations can be stopped (censored) due to graduation so that the event we are looking for (dropout) may not be observed for that individual. It may not be obvious, but information from students who never drop out is nonetheless useful. Survival models developed in somewhat parallel ways in three areas: medicine (for obvious reasons), engineering (where they are generally known as failure time models), and sociology (where they are known as event history models).
Shrinkage Estimates In many educational situations (in particular, test administration), one main objective is to estimate the ability (or some other characteristic) of individuals. The most obvious estimate when one merely calculates the total number of items right (or totals points for various nondichotomous items) is to use that total as an estimate. But if the quantity of information available for different individuals is not the same, this obvious estimate has statistical flaws. To take an extreme example, if one individual was administered two items (randomly sampled from an item pool) and got both right, while another was administered a hundred items and got 90 correct, would ability estimates of 1.00 and 0.90 (or monotonic transformations of these) seem reasonable? Would you be willing to believe that the first person has a higher ability than the second, even though he or she answered only two items? If not, what would a reasonable procedure be? It turns out that a reasonable procedure is to shrink all estimates to the mean (much like the phenomenon of regression to
214
Statistics
the mean). The more information you have (e.g., the more test items), the less shrinkage there would be for that individual. Hierarchical models provide a general framework within which one can calculate differential shrinkage for individuals depending not only on the amount of information about the individual provided by the statistic of interest (standard error of the estimate), but also other (auxiliary but informative) data (e.g., the amount of training the individual has received in the subject matter). Hierarchical models can also be used to obtain improved estimates of units other than people (classes, schools, or higher level units).
Meta-Analysis A situation in which it is not obvious that multilevel models can be applied is meta-analysis. In meta-analysis, a large number of studies about a topic are summarized by a measure of effect size, and by other characteristics of the study (age of subjects, whether a randomized design was used, etc.). The objective is to determine what the average effect size is, whether it varies (beyond sampling variability) from study to study, and if so whether study characteristics can be used to predict that variability. If we view subjects as nested within studies, it is easy to see that this falls into the realm of multilevel models, but with the individual level data missing (i.e., we only see summary statistics for the study as a whole). Most programs for multilevel models (and some additional specialized programs) can perform such analyses.
Recommended Reading Several excellent textbooks are available to learn more about multilevel models. Raudenbush and Bryk (2002), whose notation is used here, and Goldstein (2003) cover a wide range of material on these models. Each of these also is associated with software to fit these models, HLM (Raudenbush et al., 2004) for the former and MLWin (Rasbash et al., 2005) for the latter. Other introductory books include Hox (1995, 2002) and Snijders and Bosker (2000). Gelman and Hill (2007) have a unique viewpoint, and many interesting approaches not found elsewhere. Other references in the reading list include more advanced works, as well as specialized references on various aspects of multilevel models.
Summary and Conclusion When units are nested within higher level units, multilevel modeling should be used for three major reasons: (1) it is the correct analysis (which ordinarily would be a
sufficient reason), (2) viewing the model in this way leads to insights that are otherwise elusive about relationships, and (3) we avoid certain perplexing problems, such as aggregation bias. (Note that not all nesting produces a multilevel model; only when both levels are random effects is this true. For example, people nested within treatments, as in a typical ANOVA, do not result in the type of model discussed here because levels of treatment are typically fixed in number and are not a random sample from a population of treatments.) Multilevel models can be used not only in obvious situations such as when we have children in classes (and or schools), but also where subjects are nested within studies (meta-analysis), or where observations are nested within people (repeated measures and longitudinal studies). Using these models enables us to estimate effects within each level of a study, as well as between levels. Specialized software is readily available, and several major statistical packages now include the capability to fit multilevel models. See also: Growth Modeling; Survival Data Analysis; Univariate Linear Regression.
Bibliography Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press. Goldstein, H. (2003). Multilevel Statistical Models, 3rd edn. London: Edward Arnold. Hox, J. J. (1995). Applied Multi-Level Analysis, 2nd edn. Amsterdam: TT-Publikaties. Hox, J. J. (2002). Multilevel Analysis: Techniques and Applications. Mahwah, NJ: Erlbaum. Rasbash, J., Steele, F., Browne, W. J., and Prosser, B. (2005). A User’s Guide to MLwiN version 2.0. Bristol: University of Bristol. Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd edn. Thousand Oaks, CA: Sage. Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., and Congdon, R. (2004). HLM 6: Hierarchical Linear and Nonlinear Modeling. Chicago, IL: Scientific Software International. Snijders, T. and Bosker, R. (2000). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage.
Further Reading Braun, H. (1989). Empirical Bayes methods: A tool for exploratory analysis. In Bock, R. D. (ed.) Multilevel Analysis of Educational Data, pp 19–55. New York, NY: Academic Press. Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for repeated measurement Data. Boca Raton, FL: Chapman and Hall. Efron, B. and Morris, C. (1977). Stein’s paradox in statistics. Scientific American 236, 119–127. Littell, R. C., Milliken, G. A., Stroup, W. W., Wolfinger, R. D., and Schabenberger, O. (2006). SAS for Mixed Models, 2nd edn. Cary, NC: SAS Institute. Longford, N. T. (1993). Random Coefficient Models. Oxford: Clarendon Press.
Hierarchical Linear Models Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). WinBUGS – a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 10, 325–337. McCulloch, C. E. and Searle, S. R. (2001). Generalized, Linear, and Mixed Models. New York: Wiley-Interscience. Pinheiro, J. C. and Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS. New York: Springer. Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics 24, 323–355. Singer, J. D. and Willett, J. B. (2003). Applied Longitudinal Data Analysis. Oxford, UK: Oxford University Press. Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. Boca Raton, FL: Chapman and Hall/CRC Press. Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004). Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: Wiley.
215
Spiegelhalter, D. J., Thomas, A., Best, N. G., and Lunn, D. (2003). WinBUGS User Manual. (Version 1.4). Cambridge: MRC Biostatistics Unit. http://www.mrc-bsu.cam.ac.uk/bugs (accessed August 2009).
Relevant Websites http://www.cmm.bristol.ac.uk – Center for Multilevel Modeling. http://www.ssicentral.com – Scientific Software International (contains many good references to books, articles, and websites). http://www.ats.ucla.edu – UCLA Statistical Computing (contains useful resources for multilevel models and other advanced statistical procedures).