Multivariate linear regression with missing values

Multivariate linear regression with missing values

G Model ACA-232785; No. of Pages 4 ARTICLE IN PRESS Analytica Chimica Acta xxx (2013) xxx–xxx Contents lists available at ScienceDirect Analytica C...

599KB Sizes 0 Downloads 165 Views

G Model ACA-232785; No. of Pages 4

ARTICLE IN PRESS Analytica Chimica Acta xxx (2013) xxx–xxx

Contents lists available at ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

Multivariate linear regression with missing values Yaser Beyad, Marcel Maeder ∗ Department of Chemistry, University of Newcastle, Newcastle, NSW 2308, Australia

h i g h l i g h t s

g r a p h i c a l

a b s t r a c t

• Optimal linear regression analysis of data sets with missing values.

• Explicit algorithm, no iterations. • Suitable for any pattern of missing data.

a r t i c l e

i n f o

Article history: Received 10 April 2013 Received in revised form 2 August 2013 Accepted 12 August 2013 Available online xxx Keywords: Linear regression Missing values

a b s t r a c t This contribution presents and discusses an efficient algorithm for multivariate linear regression analysis of data sets with missing values. The algorithm is based on the insight that multivariate linear regression can be formulated as a set of individual univariate linear regressions. All available information is used and the calculations are explicit. The only restriction is that the independent variable matrix has to be non-singular. There is no need for imputation of interpolated or otherwise guessed values which require subsequent iterative refinement. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved.

1. Introduction A literature search on the expression ‘missing values’ or ‘incomplete data’ results in large numbers of hits. The vast majority of these are in the field of statistics where the literature on incomplete data sets and techniques for their analysis is very extensive with several monographs dedicated to the subject [1–6]. There is also a series of reviews which discuss the advantages of the different techniques for the analysis of data sets with missing values [7–10]. In comparison, the number of publications dealing with missing values in the chemistry/chemometrics literature is substantially smaller. Imputation methods that attempt at replacing the missing data with appropriately calculated of otherwise guessed and iteratively refined values are most prominent [11–22]. The simplest imputation is based on interpolation which can only

∗ Corresponding author. Tel.: +61 249215478. E-mail address: [email protected] (M. Maeder).

be utilised if individual or at most very limited ranges of data are missing. Any iterative process requires an optimisation strategy where the convergence criterion depends on the subsequent goal for the analysis of the reconstructed data set. Imputation is necessary if the complete data set is to be analysed, a prominent example in chemistry is classification where only the complete matrix can be subject to the available classification algorithms [19,23]. Alternative non-iterative techniques in the chemistry/chemometrics literature are limited and include deletion [24,25] and maximum likelihood methods, the latter rely on independently pre-defined distributions [26,27]. The goal of the present contribution is not to determine the missing values, it is to perform the multivariate linear regression of an incomplete data set in an explicit, non-iterative manner. The method uses all available information while ignoring missing values in a way that the regression result gets minimum effect from missingness. The proposed method is known within the statistics literature and has been proposed by Yates in a very early publication on incomplete data sets [28] but it seems to have found very

0003-2670/$ – see front matter. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.aca.2013.08.027

Please cite this article in press as: Y. Beyad, M. Maeder, Multivariate linear regression with missing values, Anal. Chim. Acta (2013), http://dx.doi.org/10.1016/j.aca.2013.08.027

G Model ACA-232785; No. of Pages 4

ARTICLE IN PRESS Y. Beyad, M. Maeder / Analytica Chimica Acta xxx (2013) xxx–xxx

2

little recognition in the chemistry literature. In the current work the algorithm is described followed by demonstration of its application in resolving a chemical system where a substantial part of the data is missing. Then the tolerance of the proposed method for missingness is illustrated by the analysis of a computer generated data set.

A

=

Dr,:

Cr,:

2. The algorithm Linear regression is a very fundamental computational procedure which forms the basis of many more elaborate algorithms, e.g. non-linear regression and the alternating least squares (ALS) algorithm [29,30]. We define linear regression in a general matrix notation and develop the concepts using multiwavelength spectrophotometric absorbance data. Absorption spectra, collected during a chemical process, e.g. as a function of time in kinetics or as a function of reagent addition in a titration, can be collected as the rows of a data matrix, D. If nm spectra are collected at nl wavelengths, the dimensions of D are nm × nl. In the valid range where absorbance varies linearly with concentrations, according to Beer–Lambert’s law, this matrix can be decomposed into the product of matrix C of concentrations and matrix A of molar absorptivities. The matrix C, of dimensions nm × nc, contains as columns the concentration profiles of the nc absorbing components; the matrix A, of dimensions nc × nl, contains their row-wise molar absorption spectra. Due to experimental and instrumental imperfections the decomposition is not perfect and the differences are collected in a matrix R of residuals, Eq. (1). D=C×A+R

(1)

Without loss of generality, linear regression can be defined as the task of computing the matrix A for given matrices D and C which minimises the sum of squares over all elements of the residual matrix R (In the example discussed later, the matrix C is defined by an appropriate chemical model in model based analysis). A is computed via the pseudo-inverse of C, C+ , which can be written as given in Eq. (2) but there are numerically superior algorithms [31]. A = (Ct C)

−1 t

C D orA = C+ D

(2)

The prerequisite of course is that C has to have full rank. The calculation of C from D and A can be performed analogously, Eq. (3). C = DAt (AAt )

−1

orC = DA+

(3)

In this contribution the algorithm is discussed that deals with linear regression where sections of the data matrix D are missing. The missing sections can range from one or several single data points to one or several reasonably large areas of the data matrix. A very obvious method to deal with missing data is to delete either all rows or columns of D that contain corrupted data. This approach is known as list-wise deletion [2] and is represented in Fig. 1. Removal of all rows that are corrupted results in a data matrix that is reduced in the first dimension, Dr,: and also, for the subsequent analysis, in a reduced concentration matrix, Cr,: . A is then calculated as A = (Cr,: )+ Dr,: and seems not to be affected by the truncations. This is however not the case; there is a clear loss of information and the spectra of the species that mainly exist in the removed part of the data are less well defined. Correspondingly, the removal of corrupted columns reduces the data matrix in the second dimension, D:,r , and consequently also the matrix of spectra, A:,r , the matrix C is not affected. In this case A is calculated as A:,r = C+ D:,r , with a loss of information on all species spectra in the affected wavelength range. Fig. 1 clearly demonstrates that in either mode substantial valid sections, the shaded areas, of the data are

A:,r

D:,r

=

C

Fig. 1. Listwise deletion; top, removal of the partially corrupted spectra, bottom, removal of the partially corrupted wavelengths.

not used for the calculations. Naturally, the same principles apply for the calculation of C from D and A. The more efficient algorithm explained here is based on Eq. (2), the computation of the matrix A from the matrices D and C. The central concept is the recognition that the computation of A as a function of D and C can be performed column-wise, i.e. at wavelength i the column A:,i is computed from the corresponding column D:,i of the data matrix, see Fig. 2 for a schematic diagram. The reduced section of the ith column of D that contains uncorrupted data is marked in red, the notation is Dr,i . Dr,i is the product of the equivalent section of C, notation Cr,: , and the ith column of A, notation A:,i . Thus, at this wavelength the molar absorptivities are calculated as A:,i = (Cr,: )+ Dr,i

(4)

This calculation is repeated for all columns of D and A. In this way all ‘good’ information of the data matrix is used and thus A is defined as well as possible. Essentially, the multivariate linear regression analysis is reduced to a finite series of univariate linear regression analyses. This insight allows the exclusion of invalid data at each individual wavelength and thus no valid data are discarded. The computation of A is optimal. Alternatively, the columns of the matrix C can be computed from the appropriate sections of D and A; Eq. (4) is rewritten as Ci,: = Di,r (A:,r )+

(5)

Both algorithms can be implemented in Matlab in a few lines, see Scheme 1. Note, NaN’s (‘not-a-number’ in Matlab) are used

A:,i

=

=

Dr,i Dr,i

A:,i

Cr,:

Cr,:

Fig. 2. Step-wise linear regression; at wavelength i the valid section of the data, marked in red, are used to define the corresponding ith column of A. (For interpretation of the references to color in this figure legend and in the text, the reader is referred to the web version of the article.)

Please cite this article in press as: Y. Beyad, M. Maeder, Multivariate linear regression with missing values, Anal. Chim. Acta (2013), http://dx.doi.org/10.1016/j.aca.2013.08.027

ARTICLE IN PRESS

G Model ACA-232785; No. of Pages 4

Y. Beyad, M. Maeder / Analytica Chimica Acta xxx (2013) xxx–xxx

for i = 1:nl index = isfinite(D(:,i)); A(:,i) = C(index,:)\D(index,i); end

(b) for i = 1:nm index = isfinite(D(i,:)); C(i,:) = D(i,index)/A(:,index); end

Scheme 1. Matlab code for (a) the computation of the matrix A according to Eq. (4); (b) the computation of C according to Eq. (5).

to indicate the missing values, NaN’s are identified by the isfinite command. If there are no missing values at a particular wavelength the algorithm automatically uses the complete information. More importantly, any pattern of sections of non-valid data can exist in the data matrix. In such cases the column Dr,i and the matrix Cr,: are composed of several parts but otherwise there are no changes to the algorithm; the Matlab code in Scheme 1 is still valid. Naturally, there is a limit to the size of the missing section(s) that can be tolerated. If all elements of a complete column are invalid this column can be deleted altogether. Moreover, if the remaining matrix Cr,: is not full rank then its pseudo-inverse cannot be computed. 3. Applications The application of the method is demonstrated by the analysis of data obtained by a spectrophotometric titration from our on-going research where a sizable section of the data matrix has absorption readings that are beyond the linear range of the spectrophotometer. Such values are effectively missing and were not used in the analysis. The chemical system is the dimerisation of hydrogen sulfite to disulfite, Eq. (6). K

2− 2HSO− 3 S2 O5 (+H2 O)

(6)

The dimerisation constant is low (log K = −1.2) therefore high total concentrations have to be used to result in the formation of a significant and observable amount of the dimer. At the required high concentrations, the solutions feature excessively high absorption readings in some regions of the data matrix. A typical spectrophotometric titration (titration of 0.35 M NaHSO3 into water up to a total concentration of 0.1 M) is represented in Fig. 3. Fig. 3(b) displays the resulting absorption spectra for all absorbing species, in particular the spectrum of the disulfite is well characterised. An important aspect for techniques dealing with the analysis of data with missing values is the so-called tolerance for missing values. In order to illustrate the tolerance of the proposed method to the increasing size of the missing section a computer generated data set for spectrophotometric titration of a diprotic acid AH2 to form AH− and A2− is used. The concentration profiles are defined by the chemical model and considered as known throughout the analysis. The spectra of the three species are reasonably well separated Gaussians (as long as the matrix A of molar absorptivities is of full rank, the details about the spectra are irrelevant for the present analysis). Section (a) of Fig. 4 represents the three concentration profiles as well as a graphical representation of the range of missing data: the dashed line on top stands for the missing data while the full line represents the range of valid data, essentially all spectra under the solid line are retained in the matrices Cr,: and Dr,i , in the notation of Eq. (4). Section (b) of Fig. 4 represents the rank analysis of the section of the matrix C that remains for the calculation of the pseudo-inverse for increasing range of missing data. This is very similar to an Evolving Factor Analysis (EFA) plot where sections of the data matrix D are analysed [32]. Section (c) shows the computed molar absorptivities at the respective max for all three species, as an indication of the tolerance of the method for missingness. The interpretation is straightforward: as long as there is some significant contribution of the species AH2 in the valid region, Cr,: has full rank and the spectrum of AH2 is calculated accurately. Once the missing part is large enough that there is no contribution of AH2 in the remaining data matrix, the rank of Cr,: collapses to 2 and as a result the matrix Cr,: is singular and the molar absorptivity of AH2 is meaningless. It is possible to understand or interpret the proposed algorithm as weighted least-squares fitting. Note that the computations still have to be performed row- or column-wise [33]. This has been recognised in [15] where the ultimate goal is the Principal Component Analysis (PCA) of the completed matrix D. Note that the

(a)

3000

(b)

5.0

S2O52-

Molar absorpvity (1/M cm)

2500

2

Abs

1.5

1

0.5

1

0.5

0 240

260

280

300

320

0

Molar absorpvity (1/M cm)

(a)

3

2000 1500

HSO3-

4.0 3.0 2.0 1.0 0.0 230

260 290 wavelength (nm)

320

1000

SO2 (aq)

500 0

SO32230

240

250

260

270

280

290

300

310

320

330

wavelength (nm) vol added (ml)

wavelength

Fig. 3. (a) 3D representation of a spectrophotometric titration of 0.35 M NaHSO3 titrated into water; absorbance values above 1.5 are excluded from the analysis. (b) The calculated spectra of all species.

Please cite this article in press as: Y. Beyad, M. Maeder, Multivariate linear regression with missing values, Anal. Chim. Acta (2013), http://dx.doi.org/10.1016/j.aca.2013.08.027

ARTICLE IN PRESS

G Model ACA-232785; No. of Pages 4

Y. Beyad, M. Maeder / Analytica Chimica Acta xxx (2013) xxx–xxx

4

(a)

4. Conclusion 0.5

The Matlab codes given in Scheme 1 can replace the standard Matlab equations A = C\D or C = D/A whenever there are missing values in the data matrix D. Computation times are marginally slower. If linear regression, i.e. the computation of the matrix A (or C) is the only goal of the analysis, there is nothing to be added. However, if a secondary goal is to estimate the missing values, provided the model of Eq. (1) is valid, the product of C and A, as computed using the present algorithm, produces the missing data in an elegant and efficient way.

Concentraon (M)

0.4

0.3

0.1

Singular values Molar absorpvies

References 1.3

2.3

3.3

4.3

5.3

6.3

1.3

2.3

3.3

4.3

5.3

6.3

1.2 1.0 0.8 0.6 0.4 0.2 0.0

(c)

A-2

AH-

0.2

0.0

(b)

AH2

140 120

AH2

100 80

AH-

60 40 20 0

A-2 1.3

2.3

3.3

pH

4.3

5.3

6.3

Fig. 4. (a) Concentration profiles for the titration of a diprotic acid, the solid bar indicating the range of spectra retained, the dashed line representing the missing section of the data; (b) the singular values for the remaining concentration profiles; (c) the calculated molar absorptivities for AH2 , AH− and A2− at the respective wavelengths of maximum absorbance (the true values are 38, 35 and 25).

equivalent algorithm for row-wise computation, according to Eq. (5), in Matlab code can be found in ‘The Missing Toolbox’ [34]. There are no references and explanation on the website; to our knowledge, there is no publication on this algorithm in the chemistry/chemometrics literature.

[1] J.L. Schafer, Analysis of Incomplete Multivariate Data, Taylor & Francis, London, 1997. [2] P.D. Allison, Missing Data, SAGE Publications, Thousand Oaks, California, 2001. [3] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, Hoboken, New Jersey, 2002. [4] C.K. Enders, Applied Missing Data Analysis, Guilford Publications, New York, 2010. [5] S. Van Buuren, Flexible Imputation of Missing Data, CRC Press, Inc. Taylor & Francis Group, Boca Raton, Florida, 2012. [6] Y. Dodge, Analysis of Experiments with Missing Data, John Wiley, New York, 1985. [7] A.A. Afifi, R.M. Elashoff, J. Am. Stat. Assoc. 61 (1966) 595–604. [8] A.B. Anderson, A. Basilevsky, D.P.J. Hum, in: P.H. Rossi, J.D. Wright, A. Anderson (Eds.), Handbook of Survey Research, Accademic Press, New York, 1983, pp. 415–492. [9] H.O. Hartley, R.R. Hocking, Biometrics 14 (1971) 174–194. [10] R.J.A. Little, J. Am. Stat. Assoc. 87 (1992) 1227–1237. [11] B. Walczak, D.L. Massart, Chemom. Intell. Lab. Syst. 58 (2001) 15–27. [12] T.H. Brayden, P.A. Poropatic, J.L. Watanabe, Anal. Chem. 60 (1988) 1154–1158. [13] P.R.C. Nelson, P.A. Taylor, J.F. MacGregor, Chemom. Intell. Lab. Syst. 35 (1996) 45–65. [14] R. Bro, Chemom. Intell. Lab. Syst. 38 (1997) 149–171. [15] B. Grung, R. Manne, Chemom. Intell. Lab. Syst. 42 (1998) 125–139. [16] P.K. Hopke, P. Paatero, H. Jia, R.T. Ross, R.A. Harshman, Chemom. Intell. Lab. Syst. 43 (1998) 25–42. [17] G. Tomasi, R. Bro, Chemom. Intell. Lab. Syst. 75 (2005) 163–180. [18] I. Stanimirova, S. Serneels, P.J. Van Espen, B. Walczak, Anal. Chim. Acta 581 (2007) 324–332. [19] I. Stanimirova, B. Walczak, Talanta 76 (2008) 602–609. [20] T.G. Mercer, L.E. Frostick, A.D. Walmsley, Talanta 85 (2011) 2599–2604. [21] J. Josse, M.E. Timmerman, H.A.L. Kiers, Chemom. Intell. Lab. Syst. (2013). [22] H.A.L. Kiers, Psychometrika 62 (1997) 251–266. [23] Y. Liu, S.D. Brown, Chemom. Intell. Lab. Syst. 120 (2013) 106–115. [24] J.B. Hemel, H. van der Voet, F.R. Hindriks, W. van der Slik, Anal. Chim. Acta 193 (1987) 255–268. [25] V.V. Lopes, J.C. Menezes, Chemom. Intell. Lab. Syst. 78 (2005) 1–10. [26] D.T. Andrews, P.D. Wentzell, Anal. Chim. Acta 350 (1997) 341–352. [27] B. Walczak, D.L. Massart, Chemom. Intell. Lab. Syst. 58 (2001) 29–42. [28] F. Yates, Emp. J. Exp. Agric. 1 (1933) 129–142. [29] M. Maeder, A.D. Zuberbuehler, Anal. Chem. 62 (1990) 2220–2224. [30] R. Tauler, B. Kowalski, S. Fleming, Anal. Chem. 65 (1993) 2040–2047. [31] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, 2012. [32] M. Maeder, Anal. Chem. 59 (1987) 527–530. [33] M. Maeder, Y.M. Neuhold, Practical Data Analysis in Chemistry, Elsevier Science, Amsterdam, 2007. [34] http://www.models.life.ku.dk/algorithms

Please cite this article in press as: Y. Beyad, M. Maeder, Multivariate linear regression with missing values, Anal. Chim. Acta (2013), http://dx.doi.org/10.1016/j.aca.2013.08.027