Statistical Analysis, Special Problems of: Outliers Toby Lewis, University of East Anglia, Norwich, UK Ó 2001 Elsevier Ltd. All rights reserved. This article is reproduced from the previous edition, volume 22, pp. 15002–15006, Ó 2001, Elsevier Ltd.
Abstract Outliers, or outlying observations, are values in data which appear aberrant or unrepresentative. They occur commonly and have to be dealt with. Unless an outlier is explainable, e.g., as a mis-recording, action must be based on the discrepancy between it and the model for the data. If a test shows the outlier to be inconsistent with the model, it is characterized as discordant. A discordant outlier may be identified as of intrinsic interest, rejected as a nuisance item, or incorporated in the main data via a revised model. Multiple outliers may be tested either one at a time, testing continuing so long as successive values are found discordant, or en bloc. The phenomena of masking and swamping are described. A main alternative to detecting and testing outliers is accommodation. Inference procedures are used which down-weight extreme values automatically, without regard to whether or not they are discordant. In complex situations such as regression on several variables, designed experiments, multivariate data, and multilevel data, a data point may be inconsistent with the pattern of the main data, and thus outlying, but not perceivable by inspection. Such an outlier is detectable by a statistical examination of residuals.
Anyone analyzing data may expect sometimes to encounter one or more observed values which appear to be inconsistent or out of line with the rest of the data. Such a value is called an outlying observation or outlier. Judged subjectively, it looks discrepant or surprising. There are two main ways of dealing with outliers, discordancy testing (Sections Outliers in Univariate Samples, Multiple Outliers: Block and Consecutive Tests) and accommodation (Section Accommodation of Outliers). In simpler types of data set, the presence of an outlier is evident by inspection. In more complex data situations such as designed experiments, an outlying observation may disrupt the expected pattern of results, though its presence is not obvious by inspection and a statistical procedure is needed to detect it (Section Outliers in More Complex Data Situations).
Table 1 gives the durations of pregnancy for 1669 women who gave birth in a maternity hospital in the north of England in 1954 (summarized from Newell, 1964). Clearly 331 days, exceeded certainly by 1 case and possibly by 2 out of 1669 in this data set, cannot be regarded as an incredible duration.
Predicting College Grades Figure 1 shows a scatter plot of points (x, y) for a small sample of 22 college students; x is the student’s score in an aptitude test, used as a predictor of college performance, and y is the student’s freshman year grade-point average. There are two
4.0
Two Simple Examples
3.5
A Disputed Paternity Case
3.0
D
In 1921 Mrs Kathleen Gaskill was sued for divorce in the British courts by her husband, a serving soldier, because she had given birth to a child 331 days after he had left the country to serve abroad. He claimed that the child could not be his, since 331 days was not credible as a duration of pregnancy. She denied adultery and claimed, with medical support, that the length of pregnancy, though unusual, was genuine. Viewed as a duration of pregnancy, 331 days is an outlier. In the legal case (which the court decided in favor of the wife), the only evidence of the alleged adultery was the abnormal length of this duration. This exemplifies the situation where the judgment of an outlier depends purely on its position in relation to a main data set.
Table 1
B
y
2.5 2.0
C
1.5
A
1.0 1000
1100
1200
1300
x
1400
Figure 1 Scatter plot of grade-point average y vs predictive test score x for 22 college students.
Durations of pregnancy for 1669 women
Duration (weeks)
11–20
21–25
26–30
31–35
36–38
39–41
42
43
44
45
46
47
56
Numbers
4
10
23
113
333
962
144
53
12
10
3
1
1
International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 23
http://dx.doi.org/10.1016/B978-0-08-097086-8.44087-0
387
388
Statistical Analysis, Special Problems of: Outliers
outlying points A(xA, yA) and B(xB, yB); the desired predictive relationship of y from x could be modeled satisfactorily by a linear regression omitting these two points. Should they be included in the relationship or not? This is discussed in Section Outliers in More Complex Data Situations. See Linear Hypothesis: Regression (Basics).
Outliers in Univariate Samples The nature of outliers and the range of issues in dealing with them are first discussed in the simple situation of a univariate sample of n observations x1, x2,...,xn of a variable X. Denote the observations arranged in ascending order by xð1Þ ; xð2Þ ; .; xðnÞ ; xð1Þ < xð2Þ < / < xðnÞ . If, say, the upper extreme observation x(n) appears (to an analyst or investigator examining the data) to be, not only higher than the rest of the sample but surprisingly high, it is declared an outlier. This is a subjective judgment. Is the outlier credible as a value of X? This depends on the form the investigator is assuming for the distribution of X, e.g., whether it is, say, normal or on the other hand a skew distribution such as exponential in which a high value x(n) is less surprising. Call this assumed model for the distribution of X the basic model and denote it by F; x(n) is an outlier relative to model F (e.g., normal), but may not be an outlier relative to a different model (e.g., exponential). Table 2, with dotplot Figure 2, gives the scores by 30 children in a reading accuracy test. There is an upper outlier x(30) ¼ 62 and possibly a second one x(29) ¼ 53. The variable X is here the reading accuracy test score obtained by a child in a population of children of the relevant age group and educational characteristics. A reasonable working assumption is that its distribution is normal; this is supported by a plot of ordered values against normal order scores, which is roughly linear.
Causes of Outliers No statistical issue arises if there is a tangible explanation for the presence of an outlier, e.g., if in the above example the score was misrecorded, or component subscores were wrongly totaled, or if the child tested was known to be unrepresentative of the population under study. The outlying value would simply be removed from the data set, or corrected where appropriate, e.g., in the case of wrong totaling. Otherwise, a statistical judgment is required for interpreting the outlier and deciding how to deal with it, based on the apparent discrepancy between it and the pattern the investigator has in mind for the data – the basic model F. This judgment requires a test of the null hypothesis or working hypothesis H that all n observations belong to F. On this hypothesis, the upper outlier x(n) is a value from the distribution of a variable X(n), the highest observation in a random sample of size n from F. If H is rejected by a statistical test at
Figure 2
10
20
30
40
50
60
Dotplot of the 30 test scores in Table 2.
some level (e.g., 5 percent), x(n) is judged to be not credible (at the level of the test) as the highest value in a sample of size n from F. It is then said to be a discordant outlier (relative to the basic model F); the test of H is a discordancy test. (See Hypothesis Testing in Statistics; Significance, Tests of. See also Section Other Meanings of the Word ‘Outlier’.). In a Bayesian approach, there is no direct analogue of a test of discordancy. Its place is taken by procedures designed to obtain a posterior probability assessment of the ‘degree of contamination’ of the data. This means the degree to which the data contain observations (‘contaminants’) generated not by F but by some other model G; a basic model is used which contains additional parameters relating to a possible G (see Bayesian Statistics).
Treatment of Outliers: Retention, Identification, Rejection, Incorporation If the result of the test is to accept H and find x(n) nondiscordant, all the data can be taken to come from F, including the outlier. Although it has been thought to look somewhat unusual, it is consistent with having arisen purely by chance under the basic model and can be retained in the data set. When on the other hand the outlier is adjudged discordant, there are three possibilities. If the focus of interest is the outlier itself (as in the disputed paternity case), it is identified as a positive item of information, inviting interpretation or further examination; it might reveal, for instance, a novel teaching method causing an exceptional test score. The other two possibilities, rejection and incorporation, arise when the investigator’s concern is with the main data set, and the aim of the research is to study characteristics of the underlying population, e.g., to estimate the mean of a distribution of examination scores. The outlier is only of concern insofar as it affects these inferences about the underlying population. It is not of interest in itself; for example, the students giving rise to points A and B in Figure 1 are effectively unknown! Either the model F is firmly adopted without challenge, and estimates or tests of the population parameters are to be based on F; or there is the option of changing the model. If the model is not under question and inferences are to be based on it, the outlier, being adjudged inconsistent with F, is a nuisance item and should be rejected. Historically, outlying values have been studied since the mid eighteenth century,
Reading accuracy test scores (in ascending order) for 30 children
Table 2 2 23
0
4 26
5 27
8 27
9 28
12 30
12 34
12 38
12 40
14 42
16 53
16 62
18
19
20
21
23
23
Statistical Analysis, Special Problems of: Outliers
often in the context of astronomical observations. Until the 1870s rejection was often the only procedure envisaged for dealing with outliers, as typified by such publications as Peirce (1852) ‘Criterion for the rejection of doubtful observations’ and Stone (1868) ‘On the rejection of discordant observations.’ The fact that an outlier is omitted for purposes of analyzing a data set does not necessarily mean that it is erroneous or undesirable. Consider, for example, Figure 3. The ordinates are annual figures for the population of the UK, in millions, published in each year’s Annual Abstract of Statistics. Those for 1985 through 1990 are estimates based on the 1981 census, while the outlier at 1991, the true value from the 1991 census, is the only genuine data point! The alternative possibility, when the outlier is adjudged discordant relative to F, is that all the observations, including the outlier, belong to the same distribution, but this distribution is not satisfactorily represented by the initial basic model F. The analysis then needs to start afresh with a more appropriate choice of model F* (e.g., a skew distribution instead of a symmetrical F), in relation to which x(n) is not discordant. It was an outlier relative to F, but is not an outlier relative to F*, which satisfactorily models the entire data set. It has been incorporated in the data set in a non-discordant fashion, and the outlier problem disappears.
Other Meanings of the Word ‘Outlier’ In the entry, an observation which appears inconsistent with the main data set is called an outlier; this is a subjective judgment. If a statistical test provides evidence that it cannot reasonably be regarded as consistent with the main data set, it is a discordant outlier. It is important to bear in mind, however, that some writers use the word ‘outlier’ for a discordant outlier, and refer to an observation which looks anomalous but may or may not be discordant as a ‘suspicious’ or ‘suspect’ value. Note also the different meaning given to the word ‘outlier’ in graphical displays in exploratory data analysis (EDA), such as boxplots. With Q1 and Q3 denoting (essentially) the lower and upper quartiles in the sample, observations greater than Q3 þ kðQ3 Q1 Þ or less than Q1 kðQ3 Q1 Þ are flagged as outliers. These values are sometimes outliers and sometimes
57.5
not. With the typical value of 1.5 for k, a normal sample of size 100 has more than 50 percent chance of containing one or more of these ‘outliers’! (See Exploratory Data Analysis: Univariate Methods).
Multiple Outliers: Block and Consecutive Tests Any particular discordancy test (taking univariate samples as context) is specified by three items: the particular outlier configuration (e.g., a lower outlier; an upper and lower outlier pair), the distribution of the data on the working hypothesis (e.g., normal with unknown mean and variance), and the test statistic. Taking Table 2 for illustration, there is a discordancy test for x(30) ¼ 62, in a normal sample, based on its difference from the sample mean x ¼ 22:53 in relation to the sample standard deviation s ¼ 14.13, with test statistic T ¼ ðxðnÞ xÞ=s ¼ ð62 22:53Þ=14:13 ¼ 2:79. For n ¼ 30 the tabulated 5 percent point of T on hypothesis H is 2.74; 62 is declared discordant (just) at the 5 percent level. The observation x(29) ¼ 53, which is also high, can now be tested for discordancy in relation to the other 28 values; this is a consecutive test procedure for 2 outliers. The value of T is (53–21.17) / 12.22 ¼ 2.61, Px8 percent, the evidence of discordancy is weak. Alternatively, 62 and 53 could be considered together as an outlying pair in relation to the other 28 observations. This can be tested, in a block test procedure for 2 outliers, by the statistic ðxðn1Þ þ xðnÞ 2xÞ=s, which here ¼ (53 þ 62 2 22.53) / 14.13 ¼ 4.95, to be compared with the tabulated 1 percent point 4.92. So, on this block test, there is quite strong evidence that (62, 53) are a discordant pair.
Masking and Swamping If x(29) were nearer to x(28) ¼ 42, the T-value for x(30) would be higher than 2.79, and the evidence for the discordancy of x(30) would be stronger. This is called the masking effect. A discordancy test of the most extreme observation is rendered insensitive by the nearness of the next most extreme observation; the second observation has masked the first. Conversely, a block test could be insensitive if the second observation x(n 1) is close to x(n 2) and not outlying. The pair x(n), x(n 1) might not reach discordancy level when tested jointly, even though x(n) on its own is discordant in relation to the other n 1 observations. This is called the swamping effect of x(n 1). If x(n 1) is unduly high, there is the risk of masking; if it is unduly low, there is the risk of swamping.
Accommodation of Outliers
57.0
56.0
1985 Figure 3
389
1986
1987
1988
1989
Some annual figures for the UK population.
1990
1991
When the aim is to make inferences about a population of values of a variable X by analyzing a sample of observations of X, there is an alternative to detecting outliers and testing them for discordancy. This is to use procedures (testing or estimation) which are relatively unaffected by possible outliers, by giving extreme values less weight than the rest of the observations, whether they are discordant or not. Such a procedure is called robust against the presence of outliers (see Robustness).
390
Statistical Analysis, Special Problems of: Outliers
There is a great range of procedures available in the literature. The approach is well illustrated by the following simple methods for robust estimation of the mean and variance of a univariate sample xð1Þ < xð2Þ < / < xðnÞ .
Trimming The mean is calculated after one or more extreme values have been omitted from the sample. Omission of the r (lowest) and the s (highest) values gives the (r, s)-fold trimmed mean xðrþ1Þ þ xðrþ2Þ þ / þ xðnsÞ =ðn r sÞ:
Winsorizing Instead of omitting the r (lowest) values x(1),..., x(r), they are each replaced by the nearest value which is retained unchanged, i.e., x(r þ 1); similarly for the s (highest) values. The mean of the resulting modified sample, still of size n, is the (r, s)-fold Winsorized mean ðr þ 1Þxðrþ1Þ þ xðrþ2Þ þ / þ xðns1Þ þ ðs þ 1ÞxðnsÞ =n: The choice of r and s depends on various factors such as the form of the distribution of X (e.g., symmetrical or asymmetrical) and the level of robustness required. The greatest possible amount of trimming or Winsorizing, with r ¼ s, gives the sample median. Robust estimates of variance can be based in various ways on the deviations x m, where m is a trimmed or Winsorized mean, instead of the usual x x. For an extensive account of accommodation issues and procedures, see Barnett and Lewis (1994).
Outliers in More Complex Data Situations In a regression of a variable y on a single regressor variable x, an outlying point is apparent by inspection, as instanced by Figure 1. For a regression on 2 or more regressor variables x1, x2,..., this is no longer the case. The data may nonetheless contain an outlying point (or more than one), i.e., a point (x, y) which departs to a surprising degree from the overall pattern of the data. Its observed response value y deviates ‘surprisingly’ from the fitted value from the regression equation; in other words, it has an outlying residual. Detection of an outlying point requires essentially a statistical examination of residuals. To illustrate this, consider the regression data in Figure 1, with two outlying points A and B. Denote the fitted regression line CD by y ¼ a þ bx(a ¼ 1.528, b ¼ 0.003528), with residual mean square s2 ¼ 0.148. The residual for the outlying point A is yA ða þ bxA Þ ¼ 1:4 2:447 ¼ 1:047, and its estimated standard deviation is ffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ) u( 2 u 1 ðx xÞ A st 1 P ¼ 0:371: n ðx xÞ2 A discordancy test statistic for the point A is its studentized residual (the residual divided by its estimated standard deviation), viz. 1.047 / 0.371 ¼ 2.82; this is just significant at the
5 percent level. On a similar test the outlying point B is not discordant at all. The most outlying point in the regression (in this case, A) is the one with the largest studentized residual, ignoring sign. See Linear Hypothesis: Regression (Basics). An equivalent criterion to the studentized residual for testing discordancy is the proportional reduction in the residual sum of squares when the outlier is omitted, or, equivalently again, the increase in the maximized loglikelihood. The test then reflects directly how much the fit of the regression model to the data is improved on omission of the outlier. This approach – the deletion method – obviously extends to the detection and testing of multiple outliers, with the choice between block and consecutive deletion (see Likelihood in Statistics). It is also the appropriate approach to outliers in multivariate data; see Multivariate Analysis: Discrete Variables (Overview). Similar considerations apply to the detection and testing of outliers in designed experiments and outliers in contingency tables, with use of residual-based or deletion procedures. See Experimental Design: Overview. Accommodation procedures are also available in the literature. Outliers also occur in time series (see Time Series: General). For details of methods for dealing with outliers in the various data situations mentioned above, see Barnett and Lewis, 1994. Some work has been done on outliers in sample surveys, but the methodology for outliers in this vital field urgently needs developing. See Sample Surveys: Model-based Approaches. In multilevel, or hierarchical, data the concept of an outlier becomes more complex, and much work needs to be done. See Hierarchical Models: Random and Fixed Effects; Statistical Analysis: Multilevel Methods; Lewis and Langford (2001). See also Langford and Lewis (1998), from which the following passage is quoted: In a simple educational example, we may have data on examination results in a two-level structure with students nested within schools, and either students or schools may be considered as being outliers at their respective levels in the model. Suppose ... that ... a particular school is found under test to be a discordant outlier; we ... need to ascertain whether it is discordant owing to a systematic difference affecting all the students measured within that school or because one or two students are responsible for the discrepancy. At student level, an individual may be outlying with respect to the overall relationships found across all schools or be unusual only in the context of his or her particular school. ... [Again,] masking see Section Masking and Swamping can apply across the levels of a model. The outlying nature of a school may be masked because of another similarly outlying school or by the presence of a small number of students within the school whose influence brings the overall relationships within that school closer to the average for all schools.
Conclusion ‘The concern over outliers is old and undoubtedly dates back to the first attempt to base conclusions on a set of statistical data .. The literature on outliers is vast ... ’ (Beckman and Cook, 1983). As this article has shown, outliers occur commonly in data and the problem of how to deal with them
Statistical Analysis, Special Problems of: Outliers
in any particular case is an unavoidable one. A massive armory of principle, model, and method has built up over the years. For a review of the relevant literature, see the extensive coverage of outlier methodology by Barnett and Lewis (1994), whose book, which includes some 970 references, still remains the only comprehensive treatment of the subject. A valuable survey of many outlier problems is also given in Beckman and Cook (1983).
Bibliography Barnett, V., Lewis, T., 1994. Outliers in Statistical Data, third edn. Wiley, Chichester, UK. Beckman, R.J., Cook, R.D., 1983. Outliers (with discussion). Technometrics 25, 119–163. Chatterjee, S., Hadi, A.S., 1988. Sensitivity Analysis in Linear Regression. Wiley, New York. Cook, R.D., Weisberg, S., 1982. Residuals and Influence in Regression. Chapman and Hall, London.
391
von Eye, A., Schuster, C., 1998. Regression Analysis for Social Sciences. Academic Press, San Diego, CA. Hawkins, D.M., 1980. Identification of Outliers. Chapman and Hall, London. Hoaglin, D.C., Mosteller, F., Tukey, J.W., 1983. Understanding Robust and Exploratory Data Analysis. Wiley, New York. Langford, I.H., Lewis, T., 1998. Outliers in multilevel data (with discussion). Journal of the Royal Statistical Society A 161, 121–160. Lewis, T., Langford, I.H., 2001. Outliers, robustness and the detection of discrepant data. In: Leyland, A.H., Goldstein, H. (Eds.), Multilevel Modelling of Health Statistics. Wiley, Chichester, UK. McKee, J.M., 1994. Multiple Regression and Causal Analysis. F E Peacock, Itasca, IL. Newell, D.J., 1964. Statistical aspects of the demand for maternity beds. Journal of the Royal Statistical Society A 127, 1–33. Peirce, B., 1852. Criterion for the rejection of doubtful observations. Astronomical Journal 2, 161–163. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. Wiley, New York. Rousseeuw, P.J., van Zomeren, B.C., 1990. Unmasking multivariate outliers and leverage points (with discussion). Journal of the American Statistical Association 85, 633–651. Stone, E.J., 1868. On the rejection of discordant observations. Monthly Notices of the Royal Astronomical Society 28, 165–168.