Analysis and Interpretation of Multivariate Data

Analysis and Interpretation of Multivariate Data

Analysis and Interpretation of Multivariate Data D J Bartholomew, The London School of Economics and Political Science, London, UK ã 2010 Elsevier Ltd...

100KB Sizes 0 Downloads 159 Views

Analysis and Interpretation of Multivariate Data D J Bartholomew, The London School of Economics and Political Science, London, UK ã 2010 Elsevier Ltd. All rights reserved.

Glossary Binary response – A variable which can take only one of two values, such as a yes/no answer to a question. Categorical variable – A variable for which the observed values fall into categories, such as country of birth or socioeconomic class. Correlation coefficient – A measure of the strength of the relationship between two variables. Conventionally, 0 denotes no correlation, +1 a perfect correlation, and 1 a perfect negative relation. Correlation matrix – A square table giving the correlation coefficients between all pairs of variables. Dendrogram – A graphical way of presenting the results of cluster analysis; also called a tree diagram – hence the dendro- part of the name. Latent variable – A variable whose value cannot be observed. Manifest variable – A variable whose value can be observed. Product moment correlation coefficient – A correlation coefficient measuring the closeness of a relationship to a straight line.

Introduction Elementary statistical analysis is concerned with the case where we have just one observation per individual. In such cases our interest is usually in answering questions about how large the measurements are or how variable. The answers to such questions are provided by measures of location and dispersion. For further information, we often look at the frequency distribution to see the pattern of variation which lies behind these summary measures. With multivariate data we may still be interested in looking at the variables individually, but there is now the possibility of investigating the relationships between variables. When we do this, we are doing multivariate analysis. In education, for example, there may be measures of performance in a variety of school subjects. One might be

12

interested in how these were related to one another as, for example, if we wanted to know whether performance in mathematics was related to that in history. Or we might wish to bring demographic, or other, variables into the picture by asking whether overall performance depended on birth order in the family or on the time spent doing homework. When only two variables are involved, we speak of a bivariate analysis but this, of course, is only a special case of multivariate analysis where more than two variables are involved. Multivariate data, therefore, arises whenever we make more than one observation on an individual. In education, such measures might be of performance in tests, demographic features, type of school attended, and so forth.

The Data Matrix All multivariate data can be set out in what is called a data matrix and it is convenient to think about it in the following way. The conventional approach is to make the rows of the matrix correspond to persons, or whatever is the basic observational unit; the columns correspond to variables. It will be convenient to use x to denote any variable. In practice, such variables will have names, like score in French or school attended, but all such possibilities are represented by our x. A typical data matrix with n individuals and p variables will then appear as follows: x11 x12 . . . . . . x1p x21 x22 . . . . . . # x2p ... ... xn1 xn2 . . . . . . # xnp

Two subscripts are needed to identify an entry in the table. The first tells us which row the individual is in and the second which column: x37 is thus the value for the third individual of the seventh variable. Such data might result from a survey in which a sample of n individuals gives answers to p questions or it might record the score of individuals on a set of tests. In practice, some of the values may be missing, because they were never obtained, have been lost, or have been withheld for some reason. This may materially affect what we can learn from the data. We have spoken of the entries in the data matrix as values of some variable, like a test score but that is not always the case. Any piece of information can be entered

Analysis and Interpretation of Multivariate Data

in a data matrix. It might, for example, be the place of residence or nationality. Again, this will determine what kind of analysis we can do but it does not affect the interpretation of the layout.

13

pairs of categorical (ordered or unordered) variables and for pairs where one variable is metrical and the other categorical.

Types of Data The Correlation Matrix Multivariate analysis is about relationships and, in particular, relationships between variables, taken in pairs. The strength of these relationships is measured by correlation coefficients. The first step in any summarization of multivariate data is thus to obtain the pair-wise correlations between variables. In the case of metrical variables, we normally use the product moment correlations. These are measures of linear relationship and, if the relationship is not linear, it may be sensible therefore to first transform the variables in order to make the relationships approximately linear. If we denote the product moment correlation between variables i and j, by rij we may set the result out in the correlation matrix as follows: 1; r12 ; r13 ; . . . r1n r21 ; 1; r23 . . . r2n ............... ............... rn1 rn2 . . . . . . . . . 1

Since p will usually be much smaller than n, this matrix is much smaller than the data matrix and thus already represents a considerable reduction in the amount of information we have to take in. The diagonal elements are unity because each variable is perfectly correlated with itself. Almost all analyses of the data matrix start from the correlation matrix and it is always important to inspect this matrix before going further. The correlation between two variables is, of course, the covariance between the standardized variables. By starting with the correlation coefficients, we ignore the scales of the original variables. This is often desirable because, in much of social science, the scales are themselves arbitrary. There are occasions, however, where the scales of measurement are relevant and should be taken account of in the analysis but we shall not pursue this possibility here. What we have said relates to the case where all the variables are metrical. If some or all of them are not, the same kind of information can be conveyed by other types of correlation coefficient. Whether or not this will be a first step in the appropriate form of analysis will depend on what analysis is being carried out. With pairs of binary variables, for example, there are many possible coefficients of which the so-called tetrachoric coefficient is the closest to the product moment coefficient. There are other coefficients appropriate for pairs formed from

Data come in many forms. The result of an examination may be recorded simply as pass or fail. For any individual this is often described as a binary response; collectively such responses are expressed as proportions or percentages. Sometimes an outcome may consist of one of a series of ordered categories as, for example, when the categories are: poor, fair, good, very good, and excellent. Otherwise, we may measure ability by a numerical score or a continuous scale as we would with something like weight. For simplicity, all of these levels of measurement may be grouped into two classes which we shall call categorical and metrical. The latter term is used to cover both discrete and continuous measured variables since we do not usually distinguish between them in multivariate analysis and it is convenient to have a single term to cover both. In many practical multivariate problems, we shall have a mixture of types as commonly happens with data derived from sample surveys. For example, some questions require categorical answers, such as place of residence and nationality. Others may be metrical as when we ask for income or age. The form which the analysis takes will depend very much on whether the variables are metrical or categorical but, conceptually, the logic underlying the methods will be similar. Whether or not the data are obtained by experiment or observation, is another factor which affects the way they are interpreted. In an experimental situation, which is rather rare in the educational context, the conditions under which the variable values are obtained are controlled by the experimenter. One may, for example, test pupils under a variety of environmental conditions, where the individuals placed into each group can be chosen at random. More usually, we simply observe what is going on in which case we have no control over which factors, or combination of factors, are present. In such cases, it is much more difficult to infer causation. An entirely different distinction has to be made between variables which are observed, called manifest, and those which are not, called latent. At first sight, this distinction may seem to be absurd because latent variables yield no data and hence would appear to be redundant. However, there are contexts in which there is reason to believe that there are other variables which may be relevant, but which cannot be observed. If these latent variables are in causal relationships with observable variables, we may wish to determine their influence and obtain other information about them indirectly through their effects, which we can observe.

14

Statistics

Finally, we need to remind ourselves of the fact that individual values of a manifest variable may be unobserved not because they are latent, in the above sense, but because they have been lost or someone refused to answer a question or, for example, when a candidate in an examination did not have the time to complete all the questions in a test.

far from normal (test scores are sometimes an exception). Although we shall mention several methods below, which were originally derived on the assumption of normality, they are less applicable and less generally used in the social sciences. Many of the textbooks of multivariate analysis are, however, based on the multivariate normal distribution and are thus less relevant to educational practice.

Summarization and Inference

Dependence and Interdependence

The forms of multivariate analysis, which we shall review below, fall broadly into two categories. First, there are descriptive methods which aim to summarize multivariate data in a form where its message is more easily grasped. The calculation of a correlation matrix is an example of a descriptive method which is often a first step on the way to a more complete summarization. Second, there are inferential, or model-based, methods. In such cases, our analysis is of a sample from some population and the purpose is to learn something about the population from which the sample has been drawn, rather than about the sample itself. Thus, we might take a sample of schools and wish to infer something about the population of all schools. The step from the sample to the population can only be made, of course, if we know how the sample has been drawn. In practice, the method of sampling needs to be probabilistic. Probability theory then provides the link that we need to pass from sample to population. An alternative, but equivalent, way of expressing much the same idea is by supposing that the data have been generated by a probability model. The purpose of the analysis is then expressed by saying that we wish to infer what the model is from what we observe in the values which it yields. For example, a set of test scores obtained by a group of pupils on a particular occasion may be thought of as just one sample among those that might have been obtained on other occasions. There is a conceptual problem in moving from the univariate to the multivariate case. In univariate statistics, it is common to assume that the sample has come from a normal distribution. Even if this is not so, it is often possible to transform the variable to make it approximately normal. In the multivariate case, things are much more complicated because there are many multivariate distributions which have something in common with the normal distribution to choose from. The obvious candidate is the multivariate normal distribution which not only has the important property that the marginal distributions are normal, but also that the regressions are linear. For this reason, most of the theory of multivariate analysis is based on the assumption that the population distribution is multivariate normal. In the social sciences, in general, and the educational sciences, in particular, this assumption is highly suspect because many distributions of educational quantities are

The correlation matrix treats all variables on the same basis in which case any analysis may be described as interdependence analysis. Sometimes, however, the variables do not have the same status. In such cases, our interest may be in how some variables depend upon others; this arises particularly when there is a temporal ordering of the variables. We may then be interested in knowing how the later ones in the sequence depend on those which come earlier. For example, this might enable us to make predictions about the values of the later variables. The conceptually simplest, and best-known, example of dependence analysis is regression analysis where we have a set of variables whose values we wish to use to predict some other variable. For example, this might be from scores obtained in a job selection exercise where we may wish to predict the performance of a candidate on the job. Such predictions are essential in judging the aptitude of potential pilots, for example, when it is too costly or dangerous to test their ability in actual flight.

Methods We shall now briefly review some of the multivariate methods which are available. These reviews give no technical material because they are designed to explain the main idea behind each method in such a way as to distinguish it from others. Almost all of the methods are described in more detail, and illustrated, in Bartholomew et al. (2008), and in other articles in this encyclopedia. We begin with descriptive methods of the interdependence variety.

Cluster Analysis The object is to put sample members who are similar to one another into the same group, or cluster. Looked at in terms of the data matrix, we put together sample members (rows of the table) which are similar. This could be done crudely by eye but we need a method that is more objective and which can be programmed for a computer. To do this, we construct a measure of the similarity between any

Analysis and Interpretation of Multivariate Data

pair of rows, or groups of rows. The simplest methods, called hierarchical, proceed by first grouping the closest pair of individuals, then, at each stage, grouping the closest pair of clusters already formed, and so on. In this way a tree, or dendrogram, may be constructed. The decision on when to terminate the clustering procedure is, essentially, subjective and usually depends on the meaningfulness of the clusters.

Multidimensional Scaling This takes cluster analysis a stage further by locating sample members in space according to their distance apart. Similarity and distance are equivalent concepts, one being the inverse of the other: similar objects are close together and dissimilar objects are far apart. As the eye can only easily take in information displayed in two or, at most, three dimensions, the main interest lies in finding plots in two dimensions. The problem is then to locate sample members in a space of small dimension so that their distances apart are as near as possible to the distances calculated from the data matrix. The idea is often illustrated by reference to map construction. Atlases and books of reference often give tables showing the distances apart of towns or cities. If we had the distances only, we could attempt to reconstruct the map. This is a simpler problem in that we know in advance that the map is two dimensional. If we do not know the dimensionality, we have to try different numbers, in turn, to try to get a good fit. The problem as we have described it is, more exactly, metrical scaling because the distances are treated as real distances. Nonmetric scaling is the term used when all we are entitled to assume is that the distances are ordered. This arises because, in social applications, we may feel confident in ranking the distances without wishing to go as far as assigning them precise values. Principal Components Analysis This method is applicable only when the elements of the data matrix are metrical variables. Its prime purpose is to reduce the dimensionality of the data. If there were only two columns in our data matrix, we could plot the rows as points in the plane. Inspection of the plot would then show the salient features of the data. For example, if the points fell close to a line, this fact would be clearly evident and we would recognize that the position of each point could be described by the single number defining its position on the line. If the data matrix has more than two columns, it will not be so easy to visualize the position, but we can still use the same geometrical ideas. Dimension reduction of this kind has been a requirement of educational testing for many years. Candidates may

15

obtain scores on a variety of subjects, or on repeated tests on the same subject, and we then wish to summarize them in, say, a single number. The traditional way of doing this has been to add them up, or average them, and use the result as a measure of performance. Principal components analysis (PCA) is a way of determining whether or not this is a reasonable process and whether one number can provide an adequate summary. The correlation matrix is the starting point of PCA and it produces linear functions of the variables which have the property that they are uncorrelated with one another and that they are ordered according to the amount of the total variation for which they account. Latent Variable Methods Here we come to a family of methods which are often treated as quite distinct methods but which, in reality, are essentially the same. As in PCA, the key idea is to look for a much reduced number of variables in terms of which to describe the relationships in the data. Their novel feature is that their description is in terms of what we called latent (or unobserved) variables. One way of expressing their purpose is to say that they ask whether the correlation matrix can be explained by supposing that it arises from their dependence on a small set of unobserved variables. The oldest and best known of these methods is factor analysis. This applies when all of the variables involved, observed and unobserved, are metrical. Other methods include latent class analysis, latent profile analysis, and latent trait analysis. The relationship among these methods can be conveniently expressed in a fourfold table as set out below. This table is not exhaustive, as many hybrid models are possible, but it conveys the essential structure of the situation.

Classification of Latent Variable Models In practice, categorical variables will often be binary, especially in educational applications where the observed variables may be right/wrong answers to test items (Table 1). There are extensive and self-contained treatments of these topics, especially latent trait analysis, but, in this introductory overview it is important to emphasize

Table 1

Observed variables and latent variables Observed variables (x)

Latent variables Metrical Categorical

Metrical

Categorical

Factor analysis Latent profile analysis

Latent trait analysis Latent class analysis

16

Statistics

the essential similarity rather than the differences. In practice, it is important to obtain a parsimonious summary of the data in terms of the latent variables. This often involves trying to identify them with substantive entities. For example, the early work on factor analysis, which was invented by Spearman in 1904, was concerned with his attempt to establish the existence of a common factor which he called general ability, or g. Latent trait analysis has similar objects when the observed variables are binary.

Path Analysis Path analysis uses regression analysis to elucidate causal relations among a set of variables. For it to be applicable, the regressor variables must be ordered in time so that it is meaningful to speak of one variable being caused by those which precede it in time. A set of regression equations is then estimated so that each variable is regressed on those which precede it in time and on which it is presumed to depend.

Correspondence Analysis Regression Analysis Regression analysis is the oldest, and probably, most widely used multivariate technique in the social sciences. Unlike the preceding methods, regression is an example of dependence analysis in which the variables are not treated symmetrically. In regression analysis, the object is to obtain a prediction of one variable, given the values of the others. To accommodate this change of viewpoint, a different terminology and notation are used. The variable being predicted is usually denoted by y and the predictor variables by x with subscripts added to distinguish one from another. In linear multiple regression, we look for a linear combination of the predictors (often called regressor variables). For example, in educational research, we might be interested in the extent to which school performance could be predicted by home circumstances, age, or performance on a previous occasion. In practice, regression models are estimated by least squares using appropriate software. Important practical matters concern the best selection of the best regressor variables, testing the significance of their coefficients, and setting confidence limits to the predictions.

Discriminant Analysis Discriminant analysis is most simply thought of as regression analysis when the variable to be predicted is binary. Suppose that individuals belong to one of two categories to which we may assign the values 0 and 1. On each individual, we measure a number of quantities to help us determine to which category the individual belongs. How do we use these variables to help us decide the category to which the individual is to be allocated? One way to do this is to ask what linear combination of the variables best discriminates. This is essentially the same as asking for a way of predicting the category to which the individual belongs. To determine what is called the discriminant function, we need a sample for which the correct category is known. From this, we determine the regression equation treating the dependent variable as binary.

Correspondence analysis originated in an attempt to elucidate the bivariate relationship between pairs of categorical variables. In that sense it may be regarded as a kind of correlation analysis. Multiple correspondence analysis extends the method to more than two categorical variables and in that form it may be regarded as having similar objects to PCA for metrical variables.

Multilevel Modeling Multilevel modeling has found many applications in educational research; the models are also known under the names of hierarchical linear models, mixed models, and random effects models. It is used because it matches the structure so often found in educational systems where there is a hierarchy of levels. For example, at the lowest level, we have pupils who are grouped into classes which are part of schools and these, in turn, may be part of a region which, itself is embedded in a country. Sampling may take place at some or at all of these levels. Thus, suppose that there is a linear relationship between performance in an examination and parental income within a class, then that relationship may vary from class to class and school to school. The slope and intercept of any estimated regression equation may vary as we move between different levels of the hierarchy. Formally, we may allow the parameters of any regression model we fit to vary by treating them as random variables. It then makes more sense to estimate the parameters of the populations from which the sample is supposed to be drawn. From this brief description, it is easy to see how the various names for multilevel modeling are derived. There are several levels in the population being studied: the models are linear, the levels are arranged in a hierarchy, the models for each level are mixed together, and their coefficients are regarded as random.

Structural Equations Modeling Factor analysis aims to explain the correlations among a set of observed variables in terms of a smaller set of latent

Analysis and Interpretation of Multivariate Data

variables, or factors. Structural equation modeling takes this one step further by supposing that the latent variables themselves are interrelated. In particular, it is usually supposed that there are linear relations among the variables. It is therefore similar to path analysis except that the paths are between latent variables. This means that the regression equations have to be estimated indirectly from what we can observe – namely, the correlations (or covariances) among the observed variables. The starting point is thus the same as for factor analysis but more is asked of the method. In essence, we have to work out what we would expect the covariances to be if the model – which specifies the linear relations among the latent variables – were true. The parameters of the model are then estimated by choosing their values to make them as close as possible to the expected values. In practice, this is a difficult numerical problem which can only be solved efficiently by using appropriate computer software. The method is thus often known by the acronyms designating the many software packages which are now available, for example, LISREL, AMOS, EQS, Mplus, and GLAMM. As with any complex method of analysis, it is important to check the conclusions carefully as it is often the case that different models can lead to very similar covariance structures. See also: Cluster Analysis: Overview; Correspondence Analysis; Discrimination and Classification; Latent Class Models; Multidimensional Scaling; Principal Components Analysis; Structural Equation Models.

17

Bibliography Bartholomew, D. J., Steele, F., Moustaki, I., and Galbraith, J. I. (2008). Analysis of Multivariate Social Science Data, 2nd edn. London: Chapman and Hall/CRC.

Further Reading Bartholomew, D. J. and Knott, M. (1999). Latent Variable Models and Factor Analysis, Vol. 7: Kendall Library of Statistics, 2nd edn. London: Arnold. Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley. Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling, 2nd edn. London: Chapman and Hall/CRC. Everitt, B. S. and Dunn, G. (2001). Applied Multivariate Data Analysis, 2nd edn. London: Arnold. Goldstein, H. (2003). Multilevel Statistical Models, 3rd edn. London: Arnold. Gordon, A. D. (1999). Classification, 2nd edn. London: Chapman and Hall/CRC. Greenacre, M. and Blasius, J. (eds.) (1994). Correspondence Analysis in the Social Sciences. San Diego, CA: Academic Press. Heinen, T. (1996). Latent Class and Discrete Latent Trait Models. Thousand Oaks, CA: Sage. Joliffe, I. T. (1986). Principal Components Analysis. New York: Springer. Krzanowski, W. J. and Marriott, F. H. C. (1995). Multivariate Analysis, Part 1: Distributions, Ordination and Inference. London: Arnold. Krzanowski, W. J. and Marriott, F. H. C. (1995). Multivariate Analysis Part 2: Classification, Covariance Structures, and Repeated Measurements. London: Arnold. Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton, FL: Chapman and Hall/CRC.