Comment on regression analysis of air quality data

Comment on regression analysis of air quality data

Atmospheric Environment 35 (2001) 2423}2425 Technical note Comment on regression analysis of air quality data G.P. Ayers CSIRO, Division of Atmosphe...

85KB Sizes 2 Downloads 42 Views

Atmospheric Environment 35 (2001) 2423}2425

Technical note

Comment on regression analysis of air quality data G.P. Ayers CSIRO, Division of Atmospheric Research, PMB 1, Aspendale, Victoria 3195, Australia Received 31 August 2000; received in revised form 31 October 2000; accepted 14 November 2000

Abstract A simple numerical experiment demonstrates that the reduced major axis regression, rather than standard linear regression, is likely to be the method of choice for the regression analysis of air quality datasets. The case used for illustration is that of determining equivalence of alternative PM10 measurement systems.  2001 Published by Elsevier Science Ltd. Keywords: Air quality; Gravimetric PM10; Data analysis; Regression; Reduced major axis regression.

1. Introduction Linear regression is a term coined long ago in the social sciences for the procedure of "tting a straight line to a set of data points (x , y ), where each y is a measured G G G value dependent on an exactly known independent variable, x . However, in many situations in atmospheric G sciences, there is not an exactly known set of x 's, so G standard linear regression may not give an optimal "t. The purpose here is to illustrate this point, using an example of current relevance in Australia. Implementation of a nationally consistent PM10 measurement system across all state and federal jurisdictions in Australia requires an analysis of the comparative performance of the two most common measurement systems currently in use, the Australian Standard Gravimetric (high-volume sampler) method, and the TEOM continuous measurement system. The Australian Standard for PM10 aerosol measurement practices (AS3580.9.6) is written in terms of high-volume "lter sampling and gravimetric determination of PM10, taken as samples of 24 h duration. The Standard provides for the assessment of equivalence for alternative measurement systems such as the TEOM in terms of a minimum

E-mail address: [email protected] (G.P. Ayers).  TEOM威 is a registered trademark of Rupprecht and Patashnick, Inc.

of 10 measurements made at an appropriate measurement site by collocated high-volume sampler and candidate measurement systems. The test speci"cations to be met by the candidate sampler derive from standard linear regression of the candidate sampler data against the high-volume sampler data, and are written in terms of slope (required to be 1.0$0.1), intercept (to be 0$5 g m\) and correlation coe$cient (to be *0.97). However, it is argued below that the evaluation of equivalence in terms of standard linear regression of bivariate PM10 (or other air quality) data sets is not optimal, and that an unbiased test of equivalence, or evaluation of underlying linear relationships generally, would be better achieved using reduced major axis (RMA) regression.

2. RMA regression Standard linear regression involves minimisation of the sum of the squared deviations between the dependent variable data values and the "tted values, as illustrated schematically in Fig. 1. An underlying assumption is that deviations between the observations and "tted line all occur in the y direction, with the independent variable values being known precisely. This is appropriate, for example, in a laboratory experiment in physics, in which a voltage might be varied in precisely known increments while response of a circuit or machine is measured. In this case, the distinction between measured (dependent) and speci"ed (independent) variables is clear.

1352-2310/01/$ - see front matter  2001 Published by Elsevier Science Ltd. PII: S 1 3 5 2 - 2 3 1 0 ( 0 0 ) 0 0 5 2 7 - 6

2424

G.P. Ayers / Atmospheric Environment 35 (2001) 2423}2425

Fig. 1. Standard linear regression minimises the sum of the squared deviations in the y direction (y}>), while RMA regression minimises the product of the x and y deviations from the "tted line, (x}X)(y}>), in e!ect minimising the areas of triangles (shown shaded). Figure adapted from Davis (1986).

However, in situations such as the comparison of data generated by collocated air quality measurement systems, there is no real separation into dependent and independent variables, and deviations between "tted and observed data values will occur in both x and y directions due to random measurement errors. Under these circumstances, a regression model that minimises deviations only in the y direction is not ideal, a situation wellrecognised in "elds such as biometrics or geology, where measured data are often regressed in search of underlying structural relationships. For linear "tting of such data, the reduced major axis (RMA) method is recommended (Davis, 1986). In the RMA method, the linear "t is achieved by minimising the product of the x and y deviations between the data values and "tted values, as depicted schematically in Fig. 1. For additional discussion of the RMA method and formulae for calculating the slope and intercept variables and their standard error estimates, see Davis (1986), or equivalent texts. Here, we return to the particular focus that stimulated this work, the question as to whether PM10 equivalence testing according to the Australian Standard is optimum using standard linear regression. Based on the discussion given above, we hypothesise that RMA regression would be more appropriate, and test this hypothesis via a numerical experiment.

of Australian urban PM10 levels, two sets of `"ctitiousa PM10 data were generated, notionally corresponding to two hypothetical, collocated PM10 sampling devices. In the absence of measurement uncertainty, the two datasets would both correspond to the values reported by Keywood et al. (1999), and a perfect correlation between the datasets would result. However, measurements in three Australian cities using a variety of collocated PM10 sampling systems reported by Ayers et al. (1999) showed PM10 deviations between sampling systems of as much as $40% in measured PM10 levels under routine monitoring conditions. To mimic this result, each data element in the two `"ctitiousa PM10 datasets was allocated a random error in the range $33% using the random number function of the Lahey Fujitsu Fortran 95 programming language. The two modi"ed datasets were then subject to regression using both standard linear regression and RMA regression. This process was repeated 10,000 times.

4. Results and discussion The distribution of correlation coe$cients obtained from the 10,000 pairs of datasets is displayed in Fig. 2. The degradation from a perfect correlation caused by adding random errors in the range $33% to the data is clear from the R distribution. However, an average R of 0.87 still implies a very strong structural relationship between the datasets, which on average should be a linear relationship with slope of unity and intercept of zero. Figs. 3 and 4 display the comparable frequency distributions for regression slope and intercept derived from the 10,000 experiments, with each plot containing distributions derived from both the standard linear regression and RMA regression. It is immediately evident from the

3. Numerical comparison of standard regression and RMA regression Keywood et al. (1999) have reported gravimetric PM10 data recorded using a Micro Ori"ce Uniform Deposit Impactor in airsheds of six Australian cities on a total of 28 days. Adopting this dataset as representative

Fig. 2. Frequency distribution of correlation coe$cient, R, from 10,000 pairs of `"ctitiousa PM10 datasets. The values shown on the plot are the mean, 95% con"dence interval for the mean, and median values for the distribution.

G.P. Ayers / Atmospheric Environment 35 (2001) 2423}2425

Fig. 3. Frequency distribution of regression slope, from 10,000 pairs of `"ctitiousa PM10 datasets. Open circles: standard linear regression; "lled circles; RMA regression. The values shown on the plot are the mean, 95% con"dence interval for the mean, and median values for the distributions.

2425

distributed in both x and y directions: slope will be underestimated and intercept overestimated by the standard regression. There is evidence in the literature to support the fact that this bias is seen, in practice, where standard linear regression, rather than RMA regression, has been used to analyse observational data. One recent example is the case of indoor air quality, in which Freijer and Bloemen (2000) note with respect to their regression of indoor air quality data on outdoor air quality data (their Fig. 3) that the structural relationships visually evident in the data are not re#ected by the standard regression, and that an alternative regression model would better represent the central trend in the data. Some years before, Keene et al. (1986) had reached the same conclusion when analysing ionic constituent ratios in rainwater composition datasets.

5. Conclusions Linear regression of air quality data in applications such as the equivalence testing of alternative PM10 aerosol sampling systems, in which both variables are measured, should account for random variance in both the x and y variables, rather than solely in the y variable. Therefore, it is recommended that the use of reduced major axis (RMA) regression replace the use of standard linear regression in the Australian Standard Gravimetric PM10 method (AS 3580.9.6-1990). References

Fig. 4. Frequency distribution of regression intercept, from 10,000 pairs of `"ctitiousa PM10 datasets. Open circles: standard linear regression; "lled circles; RMA regression. The values shown on the plot are the mean, 95% con"dence interval for the mean, and median values for the distributions.

"gures that only the RMA regression returns, on average, the expected linear slope and e!ectively zero intercept that is the underlying structural relationship in the data. The standard linear regression returns a slope that on average is biased low, and an intercept that on average is biased high. As discussed by Davis (1986) these are exactly the biases to be expected of standard linear regression when applied to data with random variability

Ayers, G.P., Keywood, M.D., Gras, J.L., Cohen, D., Garton, D., Bailey, G.M., 1999. Chemical and physical properties of Australian "ne particles: a pilot study. Final Report to Environment Australia, Canberra, 97 pp. Davis, J.C., 1986. Statistical and Data Analysis in Geology, 2nd Edition. Wiley, New York. Freijer, J.I., Bloemen, H.J.Th., 2000. Modelling relationships between indoor and outdoor air quality. Journal of Air and Waste Management Association 50, 292}300. Keene, W.C., Pszenny, A.A.P., Galloway, J.N., Hawley, M.E., 1986. Sea-salt corrections and interpretation of constituent ratios in marine precipitation. Journal of Geophysical Research 88, 5122}5130. Keywood, M.D., Ayers, G.P., Gras, J.L., Gillett, R.W., Cohen, D., 1999. Relationships between size segregated mass concentration data and ultra"ne particle number concentrations in urban areas of Australia. Atmospheric Environment 33, 2907}2913.