The informative converse paradox: Windows into the unknown

The informative converse paradox: Windows into the unknown

Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory...

2MB Sizes 3 Downloads 53 Views

Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

The informative converse paradox: Windows into the unknown Harald Martens ⁎ Centre for Integrative Genetics CIGENE, Dept. of Mathematical Sciences and Technology, Norwegian University of Life Sciences, N-1430 Ås, Norway Nofima Mat AS, N-1432 Ås, Norway

a r t i c l e

i n f o

Article history: Received 23 December 2010 Received in revised form 17 February 2011 Accepted 19 February 2011 Available online 26 February 2011 Keywords: Causal modeling Incomplete Residuals Informative converse Spectroscopy Optimized EMSC

a b s t r a c t Model-based interpretation of empirical data is useful. But unanticipated phenomena (interferences) can give erroneous model parameter estimates, leading to wrong interpretation. However, for multi-channel data, interference phenomena may be discovered, described and corrected for, by analysis of the lack-of-fit residual table — although with a strange limitation, which is here termed the Informative Converse paradox: When a data table (rows × columns) is approximated by a linear model, and the model-fitting is done by row-wise regression, it means that only the column-wise interference information can be correctly obtained, and vice versa. These “windows into the unknown” are here explained mathematically. They are then applied to multi-channel mixture data — artificial simulations as well as spectral NIR powder measurements — to demonstrate discovery after incomplete row-wise curve fitting and column-wise multivariate regression. The analysis shows how the Informative Converse paradox is the basis for selectivity enhancement in multivariate calibration. Data-driven model expansion for statistical multi-response analyses (ANOVA, N-way models etc.) is proposed. © 2011 Elsevier B.V. All rights reserved.

1. Introduction 1.1. Incomplete knowledge is the rule Mathematical models of real-world systems are often incomplete, either because they reflect incomplete knowledge or because they are consciously simplified for a given purpose. Incomplete models can be very useful, e.g. to encapsulate knowledge and to study specific aspects of a system. However, when an incomplete mathematical model is fitted to empirical data, unmodeled phenomena can cause grave systematic alias errors in the resulting parameter estimates. This is a general and well known problem in e.g. linear statistical modeling, but it is handled differently in different model types. Some pragmatic modeling methods automatically correct for unidentified interferences by purely datadriven means, for instance within the field of multivariate calibration [1]. More causally oriented methods require conscious, explicit modeling of interferences. This difference warrants clarification. More importantly, there is a need for generic methodology that reveals and describes unanticipated phenomena in data, beyond what is already understood and modeled — i.e. windows into the unknown. This paper demonstrates the potentially damaging effect of unanticipated phenomena on parameter estimates due to incomplete modeling. But it also shows a way to study these phenomena, and how this is used implicitly in multivariate regression modeling. The methodology to be presented is generic for models fitted by ⁎ Centre for Integrative Genetics CIGENE, Norwegian University of Life Sciences, N-1430 Ås, Norway. Tel.: +47 95075025; fax: +47 64970333. E-mail address: harald.martens@nofima.no. 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.02.007

unrestricted linear least squares, and probably for other situations as well. Multivariate calibration of spectroscopic data of chemical samples will be used as example, since it is easy to visualize. 1.2. How does multivariate calibration compensate for unidentified interferences? In multivariate calibration [1], multichannel input data that are highly non-selective due to more or less unidentified interferences, may be converted into selective output information about analytes. In most cases the interference handling is done implicitly. It would be desirable if unexpected and unidentified interferences could be more easily discovered and characterized from calibration data, because multivariate calibration could then give more valuable contributions to the over-all scientific process of knowledge generation. Traditional univariate calibration employs a very simple model c ≈ f1(y) relating c, the concentration of the analyte of interest, to y, a given property measured, e.g. light absorbance at a certain wavelength. Once the calibration model f1(·) has been established, it may be used for predicting c from y in new samples, cˆ = f1(y). The calibration model is often linear and based on a presumed causal relationship between the two: y ≈ f2(c), e.g. in spectroscopy. To give correct results, such a univariate calibration requires a selective, one-to-one correspondence between c and y. Chemical substances that interfere with measurement y have to be removed physically from the samples prior to measurement, otherwise they will masquerade as “analyte signal” and thus cause alias errors in the predicted value c.ˆ In contrast, multivariate calibration [1] allows many selectivity problems to be removed mathematically, and thereby allows samples to be

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

concentrations. Moreover, the nature of this accurate, new information about the unexpected interference phenomena is the converse of the background knowledge that we used when modeling the observed data. This means that if our knowledge about the analytes consists of rows (analyte spectra), then the accurate information about the interferences consists only of the converse columns (interference concentrations), and vice versa. All other parameters will have alias errors in their estimates. Therefore it is here called the Informative Converse (IC) paradox. Fig. 1a) illustrates a simple situation where the IC paradox applies: A set of q different properties (attributes, channels; also called "variables") have been measured in n different samples (locations in time and space, individuals; "objects"), and collected in a data table Y (n × q). We expect the variations in the measured signals Yexpected to have been caused by varying levels of a recognized phenomenon (e.g. a chemical constituent) — or of several such expected phenomena — plus random measurement noise. But unknown to us, there are other, completely unexpected sources of variation Yunexpected in the system that also affect the measured data Y. The IC analysis shows what information can and cannot be derived from Y, depending on what is known about the analyte(s) in Yexpected and on the nature of Yunexpected. Technically, the basic IC analysis combines two well-known techniques, linear regression (projection) to model the empirical data Y in terms of prior knowledge about the analytes, followed by a bi-linear decomposition of the lack-of-fit residuals in Y. The IC analysis to be presented here is generic for additive systems. It will be illustrated in a calibration framework, but applies equally well to other linear regression-based modeling methods used in statistics, such as multi-response analysis of variance (ANOVA). The IC analysis to be presented here is graphically oriented and mathematically simple — almost a banality — and not even new. To perform Principal Component Analysis (PCA) of residuals in Y after fitting a model Y ≈ f(X) by e.g. Ordinary Least Squares (OLS) or Partial Least Squares (PLS) regression, has surely been done by numerous researchers in statistics and chemometrics. But the specific interpretation opportunities that the residual PCA offers, does not seem to be well known, or at least not used much in practice. The IC paradox associated with the interpretation of lack-of-fit residuals by PCA was,

measured without any physical sample clean-up. Multivariate calibration is therefore used extensively in chemistry and many other fields. Linear calibration models are often used, because of their simplicity and their good approximation ability. Several different linear modeling methods are used for multivariate calibration, but they often give similar results, and ˆ they all may be formulated as predictors of the form c=f 1(y), where y (1×q) is vector of measurements at q different channels. Multivariate calibration can provide clean information from dirty data — selective concentration estimatescˆ from non-selective measurements y. But what makes it work so well? Why are the concentration predictions cˆ free of alias errors, in spite of gross, unidentified interferences in y? And beyond the selectivity enhancement clean-up, can the nature of the “dirt” itself be identified from the data? Statistically speaking, the multivariate calibration methods work — implicitly or explicitly — by compensating for response similarities between analyte and interferences (unexpected chemical constituents and other chemical or physical effects present). This can be done in different ways, all of which seek to estimate and correct for analyte/ interference covariances, in terms of between-variables spectral overlap and/or in terms of between-sample intercorrelations. But statistical covariance correction is difficult to understand, at least for nonstatisticians. This paper shows that multivariate residual analysis simplifies the covariance correction and offers additional discovery tools, applicable even in more classical statistical analyses. 1.3. Paradox: what we know the least about can be seen most clearly The general problem addressed in this paper is the following: Assume that we want to analyze a system that is controlled by several causal phenomena, but where we only anticipate some of these phenomena — the others are totally unexpected. How will the latter affect our ability to quantify and interpret the former? To what extent can we discover and describe even these unexpected phenomena? This paper shows how new insight about the “dirt” — the unexpected, unidentified interferences — can be obtained from multi-channel data. In fact, it proves that we can get more accurate information about the unexpected interferences, about which we know nothing, than about the expected analytes, about which at least we know the spectra or the

q

a) Y

q Y expected

=

q Y unexpected

+

n

n

125

q +

F

n

Only Y and s known

b) c

q Y

×

=

d



×

+

z´ +

F

+

F

n Only estimate of d is OK!

c)

Only Y and c known q Y

c =

×



d

s´ +

×

n Only estimate of z is OK! Fig. 1. The Informative Converse, illustrated for the linear two-constituent linear mixture model Y = cs′ + dz′ + F. a) Mixtures with unanticipated interference problems: multichannel data table Y is a sum of an expected analyte contribution, an unexpected, but systematic interference contribution and random errors F. b) Hypothesis H1: modeling mixture data Y by known analyte spectrum s yields correct interference concentration estimates d, but erroneous estimates of analyte concentrations c and interference spectrum z. c) Hypothesis H2: modeling mixture data Y by known analyte concentrations c yields correct interference spectrum z, but erroneous estimates of analyte spectrum s and interference concentrations d.

126

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

to the author's knowledge, first briefly shown by Martens and Næs [1]. The present paper builds on an in-depth presentation of the paradox that the author gave at the PLS'07 conference [2]. It will be defined and illustrated in the case of one analyte and one interference phenomenon; the extension to multiple analytes and multiple interferences (Appendix A) is straight-forward.

unknown to us, the signals of the analyte and the interference is generated according to the linear causal model: Y = cs′ + dz′ + F

ð2Þ

Matrices, column vectors and scalars are here denoted by uppercase bold-face (Y), lower-case bold-face (c) and lower-case italics (n), respectively. Row vector z′ is the transpose of column vector z, while z⊥ s means that vector z is orthogonal to vector s. The symbol ~ means “proportional to”; E(·) means the expectation.

Again, since the errors in F are random and thus expected to be independent and identically distributed, their expected covariance with these unknown, but systematic vectors are also E(F′d) = 0 and E(Fz) = 0. Hence, in the following we only focus on expected systematic estimation errors due to dz′. For simplicity, Eq. (2) is here considered to represent the sum of contributions from only two chemical constituents — one single analyte and one single interference. Appendix A shows how the same applies to mixtures of several analytes and/or several interferences.

2.2. Modeling of systems with linear responses

2.4. The informative converse paradox

Assume that we have measured n samples with respect to q variables, i.e. data table Y (n × q), e.g. light absorbance spectra or mass spectra recorded at q channels. We intend to use these data to study an unknown aspect of a certain chemical analyte that we know is present in these samples and affecting our measurements Y. We symbolize the analyte's characteristic profile by row vector s′(1 × q); it may e.g. represent its light absorbance spectrum. The analyte is present in n mixtures at levels c(n × 1), (e.g. its concentration vector). We expect the measured signals to be generated by their outer vector product, according to the linear model of the system:

2.4.1. What we know the least about can be identified most clearly Fig. 1a) illustrates the situation when interpreting empirical data in light of incomplete prior understanding: A two-way table Y (n × q) of observed data (n objects or situations characterized by q variables or properties) is decomposed into the sum of a contribution from expected sources, a contribution from unexpected sources and a random error term. The question is what happens when we assess the expected contribution in light of prior knowledge, while being unaware of the unexpected contribution. Without loss of generality, we assume that the causality behind these data is described by the two-constituent model in Eq. (2). The “converse parameter” of a given parameter of a constituent is the OTHER matrix-way (row or column) of the OTHER constituent (modeled or unmodeled): The interference concentration (column vector d) and the analyte spectrum (row vector s′) are thus the converse of each other, while the interference spectrum (row vector z′) and the analyte concentration (column vector c) are the converse of each other. In analytical chemistry, the Informative Converse paradox is relevant under two different situations with respect to prior knowledge. Fig. 1b) and c) specifies these two different modeling situations and form two different hypotheses, H1 and H2, depending on whether the employed prior knowledge is given rowwise (s′) or column-wise (c):

2. Methods 2.1. Notation and terminology

Y = cs′ + F

ð1aÞ

where F (n × q) represents unstructured modeling errors, e.g. due to random errors. For simplicity we assume the errors in F to be random, independent and identically distributed, so that the expectation of its covariance with the model parameters is zero: E(F′c) = 0, E(Fs) = 0. The structure model in Eq. (1a) corresponds to the classical calibration model [1], but is also a generic model for causal systems with linear response to one underlying phenomenon. In the special case that both spectrum s and concentration vector c is correctly known beforehand, the error F can of course be quantified by simple subtraction. More interestingly, if only spectrum s is known, while its concentration vector c is unknown, the latter may be estimated by fitting Y to s row-wise by least squares regression (projection) over the q channel: cˆ = Ys s′ s

1

ð1bÞ

The multi-component version of this least squares fitting was employed by e.g. Sheldrick [3] and is shown in Appendix A. If on the other hand, only concentration vector c is known, while its spectrum s is unknown, then the latter may be estimated by fitting Y to c columnwise by least squares regression (projection) over the n samples:

sˆ = Y′ c c′ c

−1

H1. From known analyte spectra, only the interference concentrations can be determined without alias errors.

ð1cÞ

2.3. Modeling with incomplete prior knowledge But what if — unknown to us — an unanticipated interference phenomenon is also present, for instance an additional, unexpected chemical constituent which also affects measurements Y? Let the unanticipated constituent's unknown spectrum be termed z(q × 1), while its unknown concentration vector is termed d(n × 1). Then,

Fig. 1b) outlines the first hypothesis of this paper: It addresses the case when we, in addition to mixture measurements Y, know the characteristic spectrum s of the analyte a priori. The analyte concentration vector c is unknown. The presence of the interference with parameters d and z is not even anticipated; therefore it is of course not included in the model of Y (Eq. (1a)) used for estimating c (Eq. (1b)). Introducing an extra modeling step — multivariate analysis of the lack-of-fit residuals — can reveal information about dz′. But the paradox is that only the concentration vector d of the unanticipated constituent can thus be estimated correctly from known Y and s. Neither the unknown analyte concentration vector c nor the unanticipated interference spectrum z can be estimated without alias problems; to the extent that spectra s and z overlap, the estimate of c and the estimate of z will be erroneous. This is easy to prove algebraically: Conceptually, we can always write the unknown interference spectrum z as a linear function of the known analyte spectrum s plus a residual z⊥ s which is orthogonal to s, z = sa + z⊥s

ð3aÞ

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

where1 a is the coupling parameter between the unknown analyte spectrum and the unknown interference spectrum z, defined so that s′z⊥ s = 0. We have z⊥ s = z − sa, and Eq. (2b) may be rewritten: Y = cs′ + dðsa + z⊥s Þ′ + F

ð3bÞ

= ðc + daÞs′ + dz′⊥s + F:

ð3cÞ

Eq. (3c) implies that Y can equivalently be described by Y = c1 s′ + tv′ + F

ð3dÞ

where c1 represents the analyte concentration to be estimated according to the incomplete model in Eq. (1a), while t and v′ present a bilinear structure2 that describes possible unexpected interference structure remaining unmodeled in Y. Erroneously replacing Eq. (3d) by the oversimplified model (1a) yields the least squares estimator from Eq. (1b) cˆ 1 = Ys(s′s)−1. This is of course a good estimator of c1, but gives a biased estimator of the true c; Eq. (3c) shows that cˆ = cˆ 1 gives an estimated analyte ˆ = c + da. So, unfortuconcentration with biased expectation E( c) nately, unless the analyte and interference spectra s and z are nonoverlapping (a = 0), the analyte concentration cˆ obtained from a causal, but incomplete mixture model must be expected to be contaminated by the unexpected interference's unknown concentrations. However, the main idea in this paper is that multivariate analysis of the lack-of-fit residuals after this modeling can shed valuable light on the unexpected interference itself: We insert cˆ 1 into Eq. (3d) and obtain a crude lack-of-fit residual from the incomplete model Fˆ 1 = Y  cˆ1 s′ The signal remaining in Fˆ 1 represents Y after projection on s and is per definition orthogonal to s: Fˆ 1′s = 0. It is expected to contain information about the unexpected interference, plus random noise. The non-random information in Fˆ 1 may be extracted by PCA according to the bi-linear model Fˆ 1 = tv′ + Fˆ

ð3eÞ

ˆ −1 s′) where tv′ models the systematic structure in Fˆ 1, and F=F(I−s(s′s) ˆ with F′s=0. If only one interference is present in the data, then we expect only one major principal component in Fˆ 1, with score vector t (n×1) and loading vector v′ (1×q). We may choose the loading vector v as an estimate of the interference spectrum z and the obtained score vector t as a scaled estimate of concentration vector d. Unfortunately, this gives an erroneous estimate of the true spectrum z. A comparison of Eq. (3e) via ˆ ˆ Eq. (3d) to Eq. (3c) shows that if z=v, then E(z)~z ⊥ s, i.e. only reflecting the aspects of z that are orthogonal to the known analyte spectrum s. However, to the extent that the overlap between s and z is not ˆ complete (z⊥ s ≠0), d=t gives an estimate proportional2 to the true, but ˆ unknown concentration d of the unanticipated interference, E(d)~d. This is evident from comparing Eq. (3c) and (3d). So column vector d is considered the informative converse of the known row vector s, and ˆ therefore has a good estimate d, while both cˆ and zˆ have systematic alias errors. This will be illustrated for artificial data, with and without strong random noise F.

Had we known z, then we could have written a=(s′s)−1s′z and z⊥ s =(I−s(s′s)−1s′)z. ~ ~ ~ A bilinear product t · v′ has an affine ambiguity t · v′ ≡t ·v′, where t is proportional to t ~ and v is proportional to v, but the proportionality constant is unknown. The reason is that if r is an arbitrary proportionality constant ≠ 0, we have t · v′ = t · 1 · v′ = t · (r · r−1) · v′ = ~ ~ (tr) · (r−1v′) = t ·v′. Here, r is chosen so that the parameter of main interest (t or v) has a positive mean. 1 2

127

H2. From known analyte concentrations, only the interference spectra can be determined without alias errors. Fig. 1c) addresses the opposite hypothesis, concerning situations where we only know Y and the analyte concentration vector c. The interference is still not even anticipated and hence not included in the mixture model. Then, as the following algebraic development and subsequent illustrations will show, only the converse of c, the interference spectrum z, can be correctly estimated. But the estimated analyte spectrum s and the estimated interference concentration vector d may be erroneous. Mathematically, we are always free to let the unknown interference concentration d be represented by a simple linear function of the known analyte concentration c plus a residual d⊥ c. d = cb + d⊥c

ð4aÞ

where3 b is the coupling constant and d⊥ c = d − cb is the part of d orthogonal to c; c′d⊥ c = 0. Eq. (2) may then be rewritten: Y = cs′ + ðcb + d⊥c Þz′ + F  = c s′ + bz′ + d⊥c z′ + F:

ð4bÞ ð4cÞ

The model of Y may now be rewritten Y = cs′1 + tv′ + F

ð4dÞ

where s1 is some abstract spectrum that corresponds to the incomplete model (Eq. (1a)). If we do not anticipate the interference, we rely on the oversimplified model (Eq. (1a)) from which this abstract spectrum is estimated (Eq. (1c)) as sˆ 1 = Y′c(c′c)−1. This is a good estimator for s1, but not for the true analyte spectrum s; comparison to Eq. (4c) shows that the estimated analyte spectrum ˆ = (s′ + bz′). So unless the analyte and sˆ = sˆ 1 in fact is expected to be s′ interference concentrations c and d are unrelated (b = 0), spectral estimate sˆ will be contaminated by the unexpected interference's spectrum z. We can always compute the crude residual from the incomplete causal modeling: Fˆ 1 = Y  c sˆ1′ where c′ Fˆ 1 = 0. This residual may in turn be approximated by the bilinear model Fˆ 1 = tv′ + Fˆ

ð4eÞ

where c′ Fˆ 1 = 0. PCA of the residual matrix Fˆ 1 according to Eq. (4e) may now yield information about the unexpected, unmodeled interference. Unfortunately, comparing Eq. (4c) with Eq. (4d) shows that the interference concentration estimate dˆ = t is an erroneous, aliased estimate of d, ˆ since E(d) = d⊥ c only contains the aspects of d that are orthogonal to the known analyte concentration c. However, to the extent that the interference concentration is not proportional to the analyte concentration (d⊥ c ≠ 0), a comparison of Eq. (4c and d) also shows that when the unknown interference spectrum z is estimated as zˆ = v, this loading vector v may be expected to be proportional to the true, but unknown spectrum z of ˆ ~ z. So in this case row vector z is the unexpected interference: E(z) the informative converse of the known column vector c, with good ˆ ˆ while s ˆ and d estimate z, have systematic alias errors. This will be 3

Had we known d, we could have written b = (c′c)−1c′d and d⊥ c = (I − c(c′c)−1c′)d.

128

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

illustrated for artificial data as well as for measured near infrared (NIR) light absorbance spectra of powder mixtures. 2.4.2. Running backward re-estimation brings no new information Given this unanticipated, but correctly estimated interference spectrum zˆ = v from Eq. (4e), we may try to use it in order to get better estimates of the problematic parameters sˆ and dˆ — parameters that in the first round got erroneous expectations s′ + bz′ and d − cb. Unfortunately, this fails: A reasoning similar to that in (3a–3e) shows that considering the good estimate zˆ as “known analyte spectrum” z only leads back to sensible estimates of its informative converse, c, which we already knew, while dˆ and sˆ will still have alias errors, albeit of a different kind. This will be illustrated for measured NIR spectra. 2.5. The converse parameters In summary, this mathematical formalism has shown why concentration d of the unexpected interference may be called the informative converse of the known analyte spectrum s, while the interference spectrum z is the informative converse of the known analyte concentration c: They are both nicely estimated (except for a scaling constant), in spite of being unknown — in fact being totally unanticipated. In contrast, the unknown but expected parameter of the analyte will be estimated with more or less alias errors, and so will the non-converse parameter of the interference. In other words, the converse parameter — the unknown parameter of which we apparently know the least about — is the only parameter that we can expect to estimate correctly. And it is estimated more clearly than the unknown parameter of the analyte, which we erroneously might think that we should know more about, since we anticipated its presence, included it in our causal mathematical model (Eq. (1a)) and estimated by standard statistical procedure (least squares regression). An exception to this paradox is when the analyte and the unanticipated interference are orthogonal to each other in the relevant row- or column direction. Orthogonal concentration vectors with c′d= 0 are not to be expected except for designed experiments, in which case the interference is not really unexpected. If the analyte and interference concentrations vary independently of each other, and the amount of available samples in Y is very high, we may hope that c′d≈0, but since d is unknown and even unexpected, we cannot be sure about that. Yet another exception is when the spectra of the analyte(s) and chemical interference(s) are truly non-overlapping or otherwise orthogonal, so that s′z = 0, or at least s′z ≈ 0. This is implicitly assumed in many traditional chemical analyses, sometimes over-optimistically: It may e.g. be true if the spectral peaks are so sharp that they do not overlap in the region employed. But it might not be true for unexpected instrumental phenomena (autofluorescence, light scattering, temperature effects, constituent interactions etc.) that might also affect the Y-measurements. And it is certainly not true for diffuse NIR spectra, where the absorbance peaks of different chemical constituents and physical phenomena usually are broad and strongly overlapping. The Informative Converse paradox will here be illustrated for simple two-constituent mixture data. How it applies also for more complex mixtures is defined in Appendix A.

data Y generated from Eq. (2). The simulated data table in Fig. 2f) will be used as input data to subsequent mathematical modeling, as if they had been measured, first error-free and then with random noise added. The analyte and the interference properties were constructed so that they each have two positive peaks — one unique and one shared, both in their concentration vectors c and d (Fig. 2a and d) and in their spectral profiles s and z (Fig. 2b and e). Their differences and partial overlap are visually obvious in both the row and the column way. Hence, if a statistically estimated parameter is seen to have negative elements or more than two clear peaks, then it obviously has systematic alias errors. 2.6.2. NIR spectra of powder mixtures The paradox is also illustrated with real spectroscopic measurements on a set of two-constituent chemical mixtures [4]: Wheat protein (“gluten”) and wheat carbohydrate (“starch”) powders were mixed to mimic powdered samples in e.g. the food, feed or pharmaceutical industry. The homogenized powder samples were filled to different sample thicknesses in different sample holders, and measured repeatedly after different degrees of sample compressions, in transmission T at 100 channels in the 850–1050 nm near infrared wavelength range, in a Foss Tecator Infratec spectrophotometer. NIR spectra measured through the powder mixtures are dependent, not only on the chemical absorbances of the constituents, but also on the samples' effective optical path length light scattering, which is controlled by the sample thickness and by the light scattering. The latter, in turn, is controlled by differences in the particle diameter and refractive index of the powdered constituents, as well as by variations in how hard the powders were compacted before the light transmission measurements. The light scattering variations are here considered irrelevant and have to be eliminated in order for this data set to illustrate the IC paradox. Therefore, a new version of the Extended Multiplicative Signal Correction (EMSC; [4]) model, optimized for response linearity, described in Appendix B, was developed and applied. It is here called Optimized EMSC (OEMSC). The over-all experimental design of the data set consisted of gluten/ starch mixtures in five mixture ratios, ranging from pure gluten to pure starch. Each powder mixture was measured in five different sample holders with different, randomly chosen powder thicknesses. Each sample holder was measured at two different degrees of powder packing (loose, firm) and in two technical parallels (spectra measured twice). For the sake of illustration without discussions about statistical over-fitting in the present paper, the five sample holders with different light scattering levels were split into three different sets: 2.6.2.1. Linearization set. The EMSC model was optimized on two of the five sample holders, representing n = 40 samples with all five analyte levels (gluten fractions) c = [0, 0.25, 0.50, 0.75 and 1]′.

2.6. Data for illustration

2.6.2.2. Calibration set. The optimized EMSC model was then applied also to the spectra of two other sample holders, to be used as calibration samples. To demonstrate the power the Informative Converse paradox, only four analyte levels c = [0.25, 0.50, 0.75 and 1]′, in total n = 32 spectra in vector c; the pure interference, corresponding to c = 0, is not present in this calibration set.

2.6.1. Artificial mixture data The paradox will first be illustrated with an artificially constructed example, with n = 15 mixtures and q = 303 variables, mimicking e.g. idealized UV or IR spectra of two chemical compounds. The first row in Fig. 2 shows the non-negative concentrations and spectral profiles of an analyte used in Eq. (1a) to generate one-constituent data YExpected. The second row likewise shows the contributions from the “unanticipated” interference, along with the final two-constituent mixture

2.6.2.3. Test set. Finally, the same optimized EMSC model was also applied to the spectra from the remaining sample holder, with all five analyte levels (gluten fractions = [0, 0.25, 0.50, 0.75 and 1]′, in total n = 20 spectra). These are used as test data, to demonstrate the predictive ability for different calibration models. In the following, the gluten constituent is then regarded as the analyte and the starch as “unknown” interference present in unknown quantities. In contrast to the previous artificial data, their

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

129

Fig. 2. Design of artificial mixtures. The true model behind the noise-free artificial data Y(n× q) for q= 303 variables in n=15 objects (simulating e.g. q spectral channels in n chemical samples). a) Analyte concentration vector c (n ×1). b) Analyte spectral response vector s′ (1 ×q). c) Analyte contribution YExpected = cs′, imaged in 2D as (n ×q) matrix; white= 0, black=maximum response. d) Interference concentration vector d (n× 1). e) Interference spectral response vector z (1× q). f) Error-free data Y(n× q)=YExpected + YUnexpected = cs′+dz′; white =0, black=maximum.

concentrations here sum to 1; still they behave relatively independently in Eq. (2), as long as the data model has no offset. Multivariate calibration consists of developing a calibration model that later can predict the unknown analyte concentration from multichannel Y-measurements, cˆ i = f1(yi), in any new object i = 1,2,… of the kind calibrated for. For real-world samples, y-measurements may be affected by one or more unidentified interference phenomena (unexpected chemical constituents, light scattering etc.). Sometimes, neither the number of interference types, nor their concentration variations or spectral characteristics are known. But by multivariate regression modeling, even unknown interferences can often be modeled — implicitly — and thus corrected for. The data needed for such mathematical selectivity enhancement is a set of “dirty”, but representative calibration samples, for which both the spectra Y (n × q) and their known analyte concentrations c(n × 1) are available, to allow predictive modeling of the type c ≈ f(Y). (Note that in conventional calibration and regression notation this would be written y ≈ f(X) instead). To illustrate the basis for why multivariate calibration works so well, even in such situations with incomplete causal understanding, both hypotheses H1 (Fig. 1b) and H2 (Fig. 1c) will be demonstrated for these NIR powder mixture data. Normally, calibration models are pragmatically centered with respect to the column means in order to remove constant contributions and for better approximation of nonlinearities etc. Mean-centering would be particularly useful in this case, where the concentrations of the analyte and the “unexpected”

interfering constituent starch sum to 1. However, a non-centered model will here be used initially, to allow a more causally oriented modeling. 3. Results 3.1. Artificial, noise-free mixture data Based on the analyte structure in Fig. 2a, b) and interference structure in Fig. 2d, e), the artificial mixture response data set Y (15 × 303) was constructed from Eq. (2). It is displayed in Fig. 2f) as a 2D matrix (objects × variables), in Fig. 3a) as 3D structure (objects × variables × response) and in Fig. 4a) as a set of 1D n spectra. In the following, one of the analyte properties (column vector c or row vector s′) will be considered known, and the other is to be estimated from data matrix Y in light of this “prior knowledge” of the causal structure in Eq. (1a). The contribution dz′ in Eq. (2) is in the following considered as being due to an unanticipated, unidentified constituent, to be identified as far as possible according to the IC analysis. 3.1.1. Knowing only the analyte spectrum s Hypothesis H1 is first to be illustrated from the error free data: Fig. 3 shows what happens if the analyte spectrum s (Fig. 3c), is considered known a priori, and this knowledge is correct, but incomplete. To estimate the unknown analyte concentrations c in the n objects, the mixture data table Y is projected row-wise on s′ by

130

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Fig. 3. Known analyte spectrum only yields correct interference conc. a) “Observed”, noise-free multi-channel data Y (Fig. 2f), plotted in 3D as (n × q × response) matrix. ˆ b) Erroneously estimated analyte concentration vector c, obtained by fitting Y-rows to s′. c) Input: known analyte spectrum, vector s. d) Residual after removal of estimated analyteˆ ˆ effect Fˆ 1 = Y − cs′, plotted in 3D as (n × q) matrix. e) PCA of Fˆ 1: correctly estimated interference concentration scores vector d. f) PCA of Fˆ 1: erroneously estimated interference loading ˆ spectrum z.

Eq. (1b). Fig. 3b) shows that the obtained analyte concentration estimate cˆ is wrong — it has nonzero values in three sample regions instead of only the two present in the true c (Fig. 2a). What happened is that the partial overlap between the s and z around variable #150 ˆ = c + da, this caused (Fig. 2b, e) implies that a ≠ 0 in Eq. (3a). Since E(c) d to contaminate c^ with a positive alias error contribution, da (Eq. (3c)), which is clear in samples 2–4, but also evident in objects 7–9. Fig. 3d) shows that the table of residuals Fˆ 1 has clearly structured patterns. This proves the presence of one or more “unexpected”, but interfering constituents in Y. A simple SVD-based PCA of Fˆ 1 yielded only one major component, corresponding to only one interfering constituent or phenomenon. Its spectral loading estimate zˆ (Fig. 3f) is also obviously erroneous, since it has three peaks including a negative peak around channel 50, instead of two positive peaks (Fig. 2e). The reasons is that Fˆ 1 has already been projected on s, zˆ = z⊥ s, with expectation z − sa. Hence, zˆ shows a negative alias error of peaks from s already used for modeling Y. However, dˆ (Fig. 3e) is indistinguishable from the true, “unknown” interference concentration d (Fig. 2d), apart from a scaling factor. 3.1.2. Knowing only the analyte concentration vector c Hypothesis H2 is then to be illustrated: Fig. 4 shows what happens when the data table Y (Fig. 4a), is projected by Eq. (1c) on the known analyte concentration vector c (Fig. 4b) Y is seen to have signal in three variable regions, while the unknown analyte spectrum s only has two peaks (Fig. 2b). The obtained analyte spectrum estimate sˆ (Fig. 4c) now has three nonzero peaks, instead of only two. What happened now is that the partial overlap between c and d in samples ˆ = (s′ + bz′) is 7–9 cause nonzero b in Eq. (4a). Thus the estimate s′

expected to have a positive alias error contribution bz′ (Eq. (4c)) from the unexpected and hence unmodeled spectrum z′. ˆ Non-zero, structured residuals in Fˆ 1 = Y − c s′ (Fig. 4d) again showed the presence of “unexpected” unmodeled structure. PCA of Fˆ 1 gave only one clear PC, which yielded an erroneous concentration profile dˆ (Fig. 4e), with three groups of samples, including negative concentrations in samples 12–14, instead of just two. However, like Fˆ 1 itself, the PCA loading zˆ = v (Fig. 4f) shows signal only in two variable regions. In fact it is indistinguishable from the true, “unknown” z (Fig. 2e), apart from a scaling factor. This is because row vector z′ is the informative converse of the known column vector c. Hence, for these error-free, artificial data, the results correspond to those expected theoretically: Only the converse of the employed background knowledge is estimated without alias errors when studying the lack-of-fit residuals after the initial, incomplete modeling. 3.2. Artificial, noisy mixture data Fig. 5 shows the same analysis, now after random, independent, identically distributed normally distributed noise F was added to Y in Eq. (2). Assuming again the analyte concentration c to represent true, but incomplete causal prior knowledge, erroneous — and of course noisy — estimate of s. The PCA of Fˆ 1 now showed many small singular values, but still only one major one. The results are very similar to those for noise-free data in Fig. 4: The interference ˆ the informative converse of the concentration, dˆ is erroneous, but z, prior knowledge c, is correctly estimated apart from its scaling constant.

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Input data Y

object #

response

0.1

0.05

0 100

200

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

300

b

0.04

0

1

object #

response

2

100

3

0.05

0

300

200

300

variable #

Interfer. conc. d (est.) × Interfer. spectrum z´ (est.) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

e

f

0.15

response

=

0.1

200

0.02

conc.

d

100

0.03

0.01

variable #

Residual data F1

c

0.05

response

a 0.15

× Analyte spectrum s´ (est.)

Analyte conc. c

=

131

!

0.1

0.05

-0.2

0

variable #

0.2

0.4

conc.

0.6

0.8

100

200

300

variable #

Fig. 4. Known analyte conc. only yields correct interference spectrum. a) “Observed”, noise-free multi-channel response data Y (Fig. 2f), plotted as 1D spectra. b) Input: known ˆ ˆ analyte concentration vector c. c) Estimated analyte spectrum, vector s, obtained by fitting Y-columns to c′. d) Residual after removal of estimated analyte effect Fˆ 1 = Y − cs′. e) PCA ˆ ˆ of Fˆ 1: erroneously estimated interference concentration score vector d. f) PCA of Fˆ 1: correctly estimated interference loadings spectrum z.

3.3. Real mixture data 3.3.1. Measurements Fig. 6 shows the NIR absorbance input spectra Y obtained from the powder mixtures, before and after the optimized EMSC pre-processing, for the linearization set, the calibration set and the test sample set. The figure shows that the OEMSC has removed most of the systematic light scattering effects due to sample thickness, sample packing etc. 3.3.2. Knowing only the analyte concentration vector c Fig. 7a) again shows the n = 32 NIR mixture spectra Y from the calibration set, with the four different protein levels in c clearly evident. To estimate the presumably “unknown” gluten spectrum s, these were projected on the known gluten concentration c (Fig. 7b) by Eq. (1c). The resulting estimate sˆ (solid curve) is clearly erroneous, compared to the actually measured spectrum of pure gluten (thin dashed curve). This systematic statistical error was made in spite of the fact that calibration samples 1–5 and 16–20 contained pure analyte and thus constituted 10 clear replications of spectrum s (conf. Fig. 7b). Hence the estimation problem is not due to lack of informative calibration samples, but to the lurking variable d that interferes with the unconstrained least squares regression. ˆ Fig. 7d) shows the remaining residual, Fˆ 1 = Y − c s′. When submitted to non-centered PCA, these residuals Fˆ 1 showed one dominant component, which yielded clearly aliased concentration estimates dˆ = d⊥ c of the interference starch (Fig. 7e, conf. Eq. (4c)). However, the spectrum of the interference, z, i.e. the converse of the knowledge employed (c), was correctly estimated by the PCA of residuals Fˆ 1, apart from a scaling factor: The estimate zˆ from the

calibration set (Eq. (4d), solid curve) is virtually indistinguishable from the measured starch spectrum (Fig. 6e, thin dashed curve), rescaled to the same spectral mean. Note that samples with pure starch z were not present in the calibration set, where min(c) = 0.25. Hence, hypothesis H2 (Fig. 1c) was found to apply also for the NIR measurements. All the calibration samples fitted the bilinear model very well; after ˆ subtraction of dˆ z′, only negligible variance remained (not shown here). In analogy to Fig. 4), the theory of the informative converse also worked in the case when the analyte spectrum s was assumed known, while c, d and z were considered unknown. But that will not be pursued here. Instead, hypothesis H1 (Fig. 1b) will now be used “backwards”, to illustrate why multivariate calibration modeling works so well in the presence of unidentified interferences: 3.3.3. Trying to use the correctly estimate interference spectrum zˆ Is it possible to break out of the deadlock of Fig. 7)? May be one ˆ could use the correctly estimated spectrum z, obtained from Y and c in Fig. 7f), as another “known” input, and thereby improve the estimates of parameters d and s? If successful, that would be a way to use hypothesis H1 (Fig. 1b) “backwards” to break out of the “jail of prior knowledge” represented by c H2 (Fig. 1c). Fig. 8 shows what happens if H1 is used in reverse: Fig. 8d) shows the measured input data Y again, with an ordinate now scaled to include zero. ˆ ˆ ˆ −1 ˆˆ Y was projected on zˆ from Fig. 7f (redrawn in Fig. 8f) by d=Y z( z′z) in analogy to Eq. (1b). Unfortunately, our second attempt at estimating the ˆˆ interference concentration, d (Fig. 8e), is badly over-estimated. ˆ ˆˆ ˆ The residuals F1 = Y −d z′ (Fig. 8a) clearly indicate remaining structure in the data. PCA of these residuals gave one major component.

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Input data Y

a

0.1

object #

response

0.15

0.05

0

100

200

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

300

b

0 1

3

2

100

0

-0.05 300

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

e

0.15

response

object #

0.05

200

300

variable #

Interfer. conc. d (est.) × Interfer. spectrum z´ (est.)

=

0.1

response

0.02 0.01

0

d

200

0.03

conc.

Residual data F1

c

0.05 0.04

variable #

100

× Analyte spectrum s´ (est.)

Analyte conc. c

=

response

132

!

f

0.1

0.05

0 -0.2

0

variable #

0.2

0.4

0.6

0.8

conc.

100

200

300

variable #

Fig. 5. For noisy input data, known analyte conc. only yields good estimate of interference spectrum. a) “Observed”, noisy multi-channel response data YNoisy = Y + F. b) Input: ˆ known analyte concentration vector c. c) Estimated analyte spectrum, vector s,ˆ obtained by fitting Y-columns to c. d) Residual after removal of estimated analyte effect Fˆ 1 = Y − cs′. ˆ ˆ e) PCA of Fˆ 1: estimated interference concentration score vector d. f)PCA of Fˆ 1: estimated interference loadings spectrum z.

From that, we obtained a second estimate of the loading vector of the ˆˆ analyte spectrum (s, Fig. 8c) which was now quite different, but still not like the true analyte spectrum (dashed curve, Fig. 7c). Only the score vector obtained good estimates vector cˆ (Fig. 8b) proportional to the true analyte concentration c (Fig. 7b), which we already knew. This shows that our attempt to break out of the “knowledge jail” was unsuccessful, — the estimates of s and d remain erroneous. But on the other hand it illustrates why multivariate calibration and prediction works: Hypothesis H2 (Fig. 1c) is used — implicitly — in the calibration phase, for estimating the informative converse of the input analyte concentration c, i.e. the covariance pattern of z spanned by vector zˆ = v (Eq. (4d–e)). Then, once calibrated, this spectral interference information zˆ is used in a “backwards” version of H1 (Fig. 1b) for predicting analyte concentration cˆ in new spectra Y according to Eq. (3d–e). 3.3.4. Multivariate calibration related to the IC paradox Due to its simple two-constituent structure, the NIR residuals (Fig. 8a) lend themselves well to bilinear PCA analysis, where the loading (Fig. 8c) obviously summarizes the residuals well. Fig. 9a) demonstrates that this interference-corrected analyte spectrum is the so-called net analyte signal. This, in turn, corresponds to the calibration coefficients for the analyte from multivariate calibration. Fig. 9a) shows two other ways to estimate it. The crosses represent the coefficients for predicting gluten concentration c from direct unmixing of Y by projection on both s′ and z′ when both are considered known (S = [s′ and z′] in Eq. (A.1b) in Appendix A). The solid line, completely overlapping with this, is the corresponding ˆ regression coefficients b from an un-centered Partial Least Squares Regression (PLSR, [5]) of c (one regressand) on Y (q regressors), using

two PCs in the linear regression model c = Yb + f (the unconventional notation is because Y will be considered also as regressands below). Fig. 9c) shows the corresponding predictive ability: the predictions ˆ cˆ = Y b in the 20 test set samples are quite good, except for small negative values near c = 0. The latter is probably an extrapolation error due to slight response nonlinearities remaining in the NIR data that were not corrected for by the regression modeling, since c = 0 was not part of the calibration set. Fig. 9b) and d) shows that when an offset is included in the calibration model, c = 1c0 + Yb + f, already a one-component bilinear regression, both by Principal Component Regression (PCR) and PLSR, ˆ gave a predictive model cˆ = 1cˆ 0 + Y b, which improved predictive ability even at the zero analyte level. But the regression coefficient (Fig. 9b) had now changed, both in scale (much smaller) and shape (less pronounced peak at 900 nm). Note that it is negative at all wavelengths, indicating that the protein powder has lower light scattering than the starch powder. This scattering information is apparently retained by the response-optimized EMSC, while it was modeled and removed in the originally published EMSC treatment [4]. This shows that conventional, regression-based multivariate calibration is strongly related to the Informative Converse paradox via the net analyte signal. It also shows that direct causal interpretation of regression coefficients as the analyte's response can be misleading. 4. Discussion 4.1. The converse gives fresh insight Mathematical models are used extensively in science. In general, they usually represent conscious simplifications, designed to focus on

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

After pre-processing

a

3

2.5

2.5

2

2 850

900

950

1000

850

c

3

2.5

1000

d

2.5

2 900

950

850

1000

Wavelength, nm

900

950

1000

Wavelength, nm

e

2.5

f

3

O.D.

3

Test set O.D.

950

3

2

2.5

2

2 850

900

Wavelength, nm

O.D.

Calibration set O.D.

Wavelength, nm

850

b

3

O.D.

Linearization set O.D.

Measured

133

900

950

1000

Wavelength, nm

850

900

950

1000

Wavelength, nm

Fig. 6. Diffuse near-infrared absorbance spectra of protein and starch powder mixtures. Left: Optical density (O.D.) spectra Yinput = log(1/transmittance), measured through dry powders at q = 100 wavelength NIR channels between 850 and 1050 nm. Right: O.D. spectra Y after Optimized Extended Multiplicative Signal Correction, OEMSC. Top: Linearization set with n = 40 spectra of five protein levels 0, 0.25, 0.5, 0.75 and 1. Middle: Calibration set with n = 30 new spectra of four protein levels 0.25, 0.5, 0.75 and 1. Bottom: Test set with n = 20 new spectra of five protein levels 0, 0.25, 0.5, 0.75 and 1.

essential system traits. But for real-world systems, where our knowledge is incomplete, the models may be over-simplified and hence incomplete. When incomplete models are fitted to data, the resulting parameter estimates are prone to have alias errors. This is well known (see Ref. [1]). Still, models represent a compact representation of knowledge. They are useful for many purposes, e.g. for quantitative prediction. This paper has substantiated the paradox of the Informative Converse algebraically and demonstrated its prediction consequences: An explicit multivariate analysis of lack-of-fit residuals can lead us to discover and describe unexpected phenomena. When instead done implicitly in multivariate calibration, it provides selectivity enhancement. Such analysis is of course only possible if many informative variables are measured and modeled simultaneously, — not just one variable, as in some main-stream science traditions. Thus, the Informative Converse paradox confirms the experience of many chemometricians, that in complex systems it is important to measure more variables than strictly necessary, and employ explorative multivariate data modeling. But with the linear algebraic formalism introduced, some causal aspects of explorative analysis are brought on a more rational basis. To what extent were the present results dependent on the simplicity of the present system, with only two causal constituents and linear response? Adding random normally distributed noise to the artificial data Y did not change the conclusions (Fig. 5). Appendix A shows that the algebraic formalism holds also for systems with

several analytes and/or interferences. Hence, the IC paradox appears to be valid for linear models in general. Of course when extensive errors were introduced into the “known” c or s inputs, this left more of the true analyte contributions in Y unmodeled, which in turn generated more PCA components and made it more difficult to identify the true structure of the converse (not shown).| Although not yet proven rigorously, the IC paradox is apparently not limited to full-rank LS projections of the type used in Eq. (1b) and (1c). In fact, it seems to represent a fundamental phenomenon that is more or less independent of the method of estimation used, such as, weighted Least Squares (LS), Generalized LS, Maximum Likelihood (ML) etc., as long as the estimator accomplishes the subtraction of the fitted, known structure, so that only interference information and noise is left in the multivariate residuals. The optimized EMSC pre-processing method (Appendix B), which has not been published previously, appears equally successful in all three sets of spectra, which indicates that the optimization process itself has not led to overfitting in the linearization set (Fig. 6b). 4.2. Inverse vs classical calibration modeling All multivariate calibration methods yield predictors of the type cˆ = f1(y) for each new individual sample. Implicitly or explicitly they correct for unexpected interferences in measurements y, but in different ways. “Inverse” and “classical” methods are known to differ

134

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Input data Y 2.8

a

× Analyte spectrum s´ (est.)

Analyte conc. c

=

2.3

b

30 25

2.2

20

O.D.

object #

O.D.

2.6

2.4

c

2.25

15 10

2.15 2.1

2.2 5 850

900

950

2.05 0

1000

0.5

Residual data F1

d

30

1.5

object #

0.5 0

950

1000

Interfer. spectrum z´ (est.) 3.1

e

f

!

25

1

O.D.

Interfer. conc. d (est.) ×

=

900

wavelength, nm

3

20

O.D.

2

850

1

conc.

wavelength, nm

15

2.9

10 2.8

-0.5 5 -1 850

900

950

1000

-5

0

wavelength, nm

5

10

15

850

conc.

900

950

1000

wavelength, nm

Fig. 7. Known analyte conc. only yields correct interference spectrum. a) O.D. calibration spectra Y (30 × 100), showing the five gluten/starch mixture ratios. b) Input: known analyte concentration vector c (30 × 1). c) Erroneously estimated analyte spectrum sˆ (solid curve), scaled and superimposed on true analyte spectrum s (thin dashed). d) Residual after ˆ ˆ removal of estimated analyte effect ˆF1 = Y − c s′. e) PCA of Fˆ 1: erroneously estimated interference concentration score vector d. f) PCA of Fˆ 1: correctly estimated interference loadings spectrum ˆz (solid), rescaled and superimposed on true analyte spectrum s (thin, dashed).

in how the least squares estimation systematically affects the slope of the predicted-vs-measured analyte plots and hence the statistical population-based performance of the predictors. That is of no concern here. But these methods also differ in how they separate the analyte information and the interference information: “Inverse” calibration methods such as regression-based PLSR were in Ref. [1] termed “forward” since they employ the predictive model direction cˆ = f1(y) also in the calibration model, c ≈ f1(Y). The unexpected interferences in Y are implicitly identified by linear “components” (combinations of the regressor variables) that automatically seek to pick up the contributions of both analytes and interferences. Fig. 9a) showed how the regression model generated the “net analyte signal” to disregard the selectivity problems due to interference. The subspace spanned by these abstract, orthogonal components is nice for graphical inspection and thus for qualitative discovery of unexpected patterns. But it does not provide explicit, quantitative separation of the analyte and interference information. On the other hand, “classical” calibration methods such as Estimated Generalized Least Squares (EGLS), were termed “reverse” in Ref. [1], since the modeling direction is reversed from calibration to prediction: Y ≈ f2(c) for the calibration sample set, but cˆ = f1(y) for prediction of new individual samples. They identify the interference covariance matrix from the multivariate residuals from Eq. (1c), ˆ and from that they construct the net analyte signal. That Fˆ 1 = Y − c s′, can also work very well in terms of predictive ability. These more causally oriented calibration models give an explicit separation of the apparent analyte information and the interference information. But

this paper has demonstrated how misleading the estimated spectral characteristics of the analyte may be, when the analyte and interference concentrations happen to be intercorrelated (Figs. 4a, 5a and 7a). Still, as demonstrated in Fig. 4f, 5f and 7f, the spectral features of the unexpected interferences may be studied in the major ˆ principal components of F 1 (Eq. (3e)), or equivalently, in the first few eigenvectors of this estimated covariance matrix. 4.3. Consequences for multi-response ANOVA and other linear models The Informative Converse paradox may be useful for assessing modeling deficiencies in generic data modeling methods of the types commonly used in chemometrics. For instance, both in analysis of variance (ANOVA) with many response variables and in multivariate regression and discriminant analyses, a set of q Y-variables in Y(n × q) are regressed on a set of p X-variables in X(n × p) over n objects according to the linear model, in analogy to Eq. (1a) with X = c, plus an offset: jY = 1b0 + XB + F:

ð5aÞ

The effects B and offsets b0 may be estimated by conventional OLS (full-rank multivariate regression and linear discriminant analysis) or restricted OLS (ANOVA), or by stabilized regression methods such as PLSR, ridge-regression, elastic net etc. Kettaneh-Wold ([6]) showed for instance how mixture data for the type presented here, can be analyzed with respect to the underlying design by PLSR.

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Residual data F1

=

a

30

0.05

!

× Analyte spectrum s´ (est.#2)

b

c 0.02

25

0

20

O.D.

object #

O.D.

Analyte conc. c (est.)

135

15

-0.05

0

-0.02 10 5

-0.1 20

40

60

80

100

-0.04 0

0.05

wavelength, nm

Input data Y 3

20

0.1

conc.

=

d

40

60

80

100

wavelength, nm

Interfer. conc. d (est.#2) ×

e

30

Interfer. spectrum z´ (est.)

f

0.1

2.5 25 0.08

1.5

15

0.06 0.04

1

10

0.5 0 850

20

O.D.

object #

O.D.

2

0.02

5 900

950

1000

0

10

wavelength, nm

20

conc.

0

20

40

60

80

100

wavelength, nm

Fig. 8. Reversing the modeling process: Estimated interference spectrum only yields the analyte conc., which was already known. a) O.D. calibration spectra Y(30 × 100); same as ˆ Fig. 7a). b) Erroneously estimated interference concentration vector dˆ (30 × 1). c) Input: correctly estimated interference spectrum zˆ (Fig. 7f). d) Residual after removal of estimated ˆ ˆ ˆˆˆ analyte effect Fˆ 1 = Y − dˆ z′. e) PCA of Fˆ 1: correctly estimated analyte concentration score vector cˆ PCA of Fˆ 1: erroneously estimated interference loading spectrum s.

However, assume that Y, unknown to us, had been caused by an additional, unexpected but systematic source of variability, according to the true model Y = 1b0 + XB + dz′ + F:

ð5bÞ

The unexpected Y-contribution dz′ is lacking in the ANOVA model in Eq. (5a). To the extent d happens to correlate with the design X, the effect-estimates will have serious alias errors, rendering the traditional significance tests correspondingly misleading. However, according to hypothesis H2, the IC paradox allows new discovery to be made by multivariate analysis of the Y-residuals. Interference levels d cannot be estimated without negative alias errors from X. But characteristic Y-profile z′ may be found. Interpretation of zˆ in light of more or less tacit background knowledge may in turn lead to improved ways to the study the system, e.g. by measuring also d and including it in X, by ensuring that d is always constant, or by designing the next experiment so that d is known and orthogonal to the existing variables in X. If d remains unknown, care may at least be taken to increase the probability that it varies independently of X. If more than one systematic phenomenon in Y remain unmodeled after ANOVA, multivariate regression or linear discriminant analysis, there would of course be several major principal components in Fˆ A. Additional knowledge or assumptions about the system's constituents may then be needed in order to identify the individual phenomena, i.e. using Multivariate Curve Resolution of some sort to resolve the resulting bilinear ambiguity in the components. This may e.g. include projection on a data base of spectra of potential interferences, or involve more generic structural assumptions, such as non-negativity

and simplex-intersect structure [7], monotonicity and unimodality [8], as well as pursuing “interestingness” in terms of simple component structure (e.g. Varimax rotation, [9]) or non-normality (e.g. Independent Component Analysis [10]). The informative converse is interesting in conjunction with socalled L-PLS regression [11–13], in which e.g. a matrix Y (n × q) may be regressed, both column-wise on a matrix X(n × p) as in Eq. (5b) and row-wise on another matrix Z(m × q): Since the two alternating regressions take place in different information modes (e.g. sample rows or variable columns), different deflation- and residual analysis strategies may utilize the converse paradox differently. Subsequent PCA of the residuals in the different matrices is then possible. Since the three-matrix L-PLS generalizes into the multi-matrix Domino-PLS [14], the converse paradox applies also there. The paradox is probably related to Andersson's Direct Orthogonalization [15], but that remains to be elucidated. The analyte spectra and the interference concentrations are so tightly inter-related that there may probably be defined a meaningful group that acts upon both. The analyte concentrations and interference spectra relate to another group. The formal definition of these groups is now in the process of being established. 4.4. Generalization of the informative converse paradox If some of the lack of fit in Y is due to non-linear X–Y relationships inadequately described by the linear modeling, this would require a more careful analysis. There are many ways for the X–Y relationship to deviate from linearity, and some of them may be difficult to describe within the present framework. But at least the simplest non-

136

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Non-centred unmixing and PLS

Mean-centred PCR, PLSR

a

b -0.012

Calibration coefficients

Calibration coefficients

0.1

0

-0.1

-0.2

-0.013

-0.014

-0.015

-0.016 900

850

950

1000

850

900

Wavelength, nm

950

1000

Wavelength, nm

1

c

d 0.8

predicted conc. c

predicted conc. c

0.8

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

0.5

1

Analyte conc. c in test set

0

0.5

1

Analyte conc. c in test set

Fig. 9. Multivariate calibration for c = gluten conc. a) Calibration coefficient spectrum obtained by non-centered PLS regression (_) and by linear unmixing (x).. b) Calibration coefficient spectrum obtained by centered PLS regression (_) and PCR (•). c) Analyte concentrations c in test set, true vs predicted from Y (NIR), for non-centered PLS regression (□) and by linear unmixing (x). d) Analyte concentrations c in test set, true vs predicted from Y (NIR), for centered PLS regression (□) and by PCR (•).

linearity case is where a non-linear transform f(·), applies directly to the concentrations in the causal model

themselves under different conditions. That may indicate why both explorative and confirmative modeling is useful.

Y = f ðXÞB + f ðdÞz′ + F:

5. Conclusion: only the unknown can be seen clearly

ð5cÞ

Then a non-linear (e.g. polynomial) regression of Y on X and a subsequent svd of the residuals in Y may still yield the expected spectrum z of the interference by PCA of the Y-residuals. Work is in progress to elucidate if and how it applies in cases with various other types of non-linear causal relationships. Theoretically, the Informative Converse paradox should also apply to N-way modeling of tensor data: Lack-of-fit residuals should contain pristine information about unexpected structures in the N-ways directions not used for regression in the model fitting. The paradox should also apply to meta-models of nonlinear ODE- and PDE-models. But this remains to be tested. The Informative Converse paradox may apply to general learning systems based on multivariate observation: Information in the column mode may be regarded as the rules governing how attributes are related; this may be described by scientific theory. In contrast, the row mode then represents our experience of how different parts of the world actually seem to resemble each other in time and space. The constituent “spectra” s and z could then represent theoretical descriptions of how different observed variables are related to each other by the phenomena affecting the system, while c and d could describe the empirical patterns of how the phenomena manifest

The main claim of this paper is: When data are interpreted by unconstrained least squares fit to an over-simplified causal model, then the presence of unanticipated, and hence unmodeled phenomena in the data can give serious alias errors in the estimated model parameters. But by subsequent multivariate residual analysis, the unanticipated phenomena in the data can then be estimated without alias errors, although only so with respect to unknown parameters that are the converse of the explanatory model inputs. The Informative Converse paradox, first briefly reported by Martens and Næs [1], has here been described mathematically and demonstrated on artificial and real mixture data. It points to the importance of ensuring sufficiently high-dimensional observation, not only in terms of the number of cases or objects n, but also in the number of variables q. This paper demonstrates mathematically the importance of analyzing the multivariate lack-of-fit residuals by multivariate analysis. It is truly a paradox that the one aspect of a system which we seemingly know the least about a priori — the converse of our prior knowledge, is the ONLY parameter than can be safely estimated in incompletely specified linear systems, — unless the different phenomena affecting the system data Y are truly independent.

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

Acknowledgments Morten Bech Rye is thanked for cooperation in developing the optimized EMSC, as part of this M.Sc. thesis. This study was supported by the National Program for Research in Functional Genomics in Norway (FUGE) (RCN grant no. NFR151924/S10), by the Norwegian eScience program (eVITA) (RCN grant no. NFR178901/V30) and by the Agricultural Food Research Foundation of Norway. Appendix A. Informative converses in the case of more analytes and interferences The Informative Converse paradox was only illustrated for simple two-constituent mixtures. But it applies also for more complex mixtures. If there are q analytes, with concentrations C(n × p) = [c1, c2, …, cp] and spectra S(q × p) = [s1, s2,…, sp], the additive mixture model in Eq. (1a) may be rewritten Y = CS′ + F:

ðA:1aÞ

If the analyte spectra S are known, the p analytes are then resolved from each other by the multicomponent version of Eq. the the (1b): 1 Cˆ = YS S′ S

ðA:1bÞ

If instead their concentrations are known, the analytes can be resolved from each other by the multi-component version of Eq. (1c): −1 Sˆ = Y′ C C ′C

ðA:1cÞ

But assume that, unknown to us, Eq. (A.1a) is over-simplified; the true causal structure behind Y, unknown to us, is: Y = CS′ + DZ′ + F

ðA:2aÞ

because Y contains clear contributions also from m unknown interferences with concentrations D(n × m) = [d1, d2…, dm] and spectra Z(q × m) = [z1, z2…, zm]. If these interference spectra Z overlap with the analyte spectra S, and/or the interference concentrations D overlap with the analyte concentrations C, then the analytes are not properly resolved from the interferences. In analogy to Eqs. (3–4), this can be seen by writing Z = SA + Z⊥S

ðA:2bÞ

and D = CB + D⊥C :

ðA:2cÞ

Inserting expressions A.2b or A.2c) into the true mixture model (Eq. (A.2a)) shows that the estimates from Eq. (A.1b) or (A.1c) must be expected to give erroneous results. The size of these alias errors of course depends on the strength of the overlap matrices A and B. However, the advantage of the informative converse is that the m interferences may then be studied by multivariate decomposition of ˆ the residual matrix Fˆ 1 = Y −CS′ after having applied Eq. (A.2b), with ˆ the property Fˆ 1S = 0, or F1 = Y − CS′ after Eq. (A.2c) with the property ˆ ˆ F1′C = 0. An svd of F1, Fˆ 1 = TV′ + Fˆ

ðA:3Þ

will correctly identify an important aspect of the unexpected interference contribution DZ′ in terms of the bilinear subspace model TV′. The informative converse subspaces D or Z of the interferences will be correctly identified by this process, except for the effect of random

137

noise in F. But identifying the individual concentration vectors [d1, d2…, dm] and spectra [z1, z2…, zm] now has a slightly more serious “rotational” ambiguity in the SVD: With several (m) unexpected interferences clearly affecting Y, their unknown concentrations and spectra may be still be studied by SVD of residuals. After applying Eq. (5a) or 5b), there will now be several unknown variation patterns in the corresponding residual matrix Fˆ 1, and these can be extracted as several statistically valid components by SVD. The subspace TV′ of components is expected to span the subspace of the unexpected interferences, DZ′. But we would have an ambiguity when estimating the individual parameters in D and Z, due to affine transformation (“axis rotation”), since TV′=(TR−1)(RV′), where R is an arbitrary, but invertible matrix. Even this ambiguity may be resolved by Multivariate Curve Resolution [7–10], if additional assumptions about the system can be justified. Other than that, the basic Informative Converse paradox remains the same: The unknown parameter type that we know the least about is the one that we can estimate the most clearly — to the extent that it affects the multichannel measurements in Y. Appendix B. Preprocessing to correct for both additive and multiplicative interference problems by Optimized Extended Multiplicative Signal Correction (OEMSC) Near-independence between model spectra and other constituent spectra, s′z ≈ 0, was originally assumed by the author and colleagues when developing the model-based Multiplicative Signal Correction (MSC) preprocessing. As shown in Ref. [4], this can cause gross alias errors, which led to the development of the EMSC preprocessing, which was used here for removing strong light scattering from diffuse NIR spectra. The present optimization and subsequent use of the EMSC model was based on diffuse transmission NIR measurements from [4], in which transmitted light is measured at q = 100 wavelength channels in the range 850–1050 nm. The transmittance T was measured for a range of powder samples, and first linearized by the conventional absorbance transform Yinput = log10(T) for each wavelength at each sample, according to Beer's Law. However, due to widely varying thickness and powder compression in the samples, the obtained NIR spectral data table Yinput shows gross effects of light scattering, in addition to the chemical variations (see e.g. Fig. 6a). The two powder sources are expected to have different particle size and refractive index, so light scattering in itself is not a problem. But the variations due to sample thickness and powder compression give scattering interferences that needed to be eliminated in order for this data set to illustrate the Informative Converse paradox. To estimate and separate out most of the light scattering effect, EMSC was therefore applied to the data. A normal EMSC model [4] may be written:   y′i;input ≈ y0′ + αi 1′ + δˆ i λ′ βi

ðB  1aÞ

where each input spectrum yi (1 × q), i = 1,2,…,n, is modeled in terms _ of a reference spectrum y0′ plus an additive baseline consisting of a constant offset αi1′ and slanting effect δiλ′, scaled by the multiplicative parameter βi. A least squares projection of yi′ on [y0, 1, λ]′ yields, after some rearrangement, estimates of parameters αi, δi and βi, which are then used for the combined additive/multiplicative correction  ˆ i 1′ − δˆ i λÞ = β ˆi yi;corrected ′ = yi;input ′ −α

ðB  1bÞ

This EMSC model was found to give a selective, but somewhat nonlinear response to analyte (gluten) concentration c. The EMSC was therefore optimized with respect to response linearity.

138

H. Martens / Chemometrics and Intelligent Laboratory Systems 107 (2011) 124–138

The OEMSC consists in expanding the standard EMSC model to include a bilinear contribution, in order to prevent chemical information cs′ and/or dz′ in yi, corrected to affect the estimates of the “physical” paremeters αi, δi and βi, and hence be lost in the correction (B-1b). The OEMSC model is written  ′ ≈ y0′ + αi 1′ + δi λ′ + γi w βi yi;corrected

ðB  1cÞ

where w′(1 × q) is a “linearization spectrum”. When w is known, parameters αi, δi, βi and γi may be estimated by projection of yi, input′ on [y0, 1, λ, w]′, and Eq. (B-1b) is then used for correction. However, the linearization spectrum w′ is unknown and must somehow be estimated. The present OEMSC is based on assuming that w is expected to be found within the main variability covariance seen in the available data Yinput. Most of the variation in Yinput was here described by only 5 principal components. Hence, w was defined as an unknown linear combination of the first 5 principal component loadings P (q × 5) of Yinput: w′ = τP′

ðB  1eÞ

The five unknown score values in vector τ(1 × 5) were estimated, once and for all, by SIMPLEX optimization with respect to a certain criterion, reflecting the predictive ability of Y (n × q) w.r.t. c (n × 1) = gluten content in a set of n linearization samples. It was computed as the Mean Squared Error of Prediction of c from Y (repeatedly estimated, for computational speed, by leverage-corrected PCR, [1]), weakly penalized for rank deficiency in matrix [y0, 1, λ, w]′ to ensure that w does not become collinear to [y0, 1, λ]. To avoid concerns about overfitting, the linearization spectra used for estimating vector τ (Fig. 6a–b) were not used again for estimating spectra or regression coefficients.

References [1] H. Martens, T. Næs, Multivariate Calibration, J.Wiley & Sons Ltd, Chichester UK, 1989. [2] H. Martens, S. Wold, PLS tools for exploring causalities, Introductory Lecture, PLS'07 — The 5th International Symposium on PLS and Related Methods: Causalities Explored by Indirect Observation, Ås, Norway Sept 4–9 2007, 2007. [3] B. Sheldrick, Computer analysis of protein mixtures, Biochem. J. 123 (5) (1971) 996. [4] H. Martens, J. Pram Nielsen, S. Balling Engelsen, Light scattering and light absorbance separated by extended multiplicative signal correction. Application to near-infrared transmission analysis of powder mixtures, Anal. Chem. 75 (3) (2003) 394–404. [5] S. Wold, H. Martens, H. Wold, The multivariate calibration problem in chemistry solved by the PLS method. Matrix Pencils, Lecture Notes in Mathematics, 973, Springer Verlag, 1983, pp. 286–293. [6] N. Kettaneh-Wold, Analysis of mixture data with partial least squares, Chemom. Intell. Lab. Syst. 14 (1992) 57–69. [7] H. Martens, Factor analysis of chemical mixtures: non-negative factor solutions for spectra of cereal amino acids, Anal. Chim. Acta 112 (4) (1979) 423–442. [8] R. Tauler, B.K. Kowalski, S. Fleming, Multivariate curve resolution applied to spectral data from multiple runs of an industrial process, Anal. Chem. 65 (15) (1993) 2040–2047. [9] H.F. Kaiser, The varimax criterion for analytic rotation of factor analysis, Psychometrika 23 (3) (1958) 187–200. [10] P. Comon, Independent component analysis, a new concept? Signal Process. 36 (3) (1992) 287–314. [11] S. Wold, S. Hellberg, T. Lundstedt, M. Sjöström, H. Wold, PLS modelling with latent variables in two or more dimensions, Proceedings PLS Model Building. Theory and Applications. Symposium Frankfurt Am Main September 1987, 1987. [12] H. Martens, E. Anderssen, A. Flatberg, L.H. Gidskehaug, M. Høy, F. Westad, A. Thybo, M. Martens, Regression of a data matrix on descriptors of both its rows and of its columns via latent variables: L-PLSR. Computational Statistics and Data Analysis, PLS'01 Special Issue, 48, 2005, pp. 103–123. [13] S. Sæbø, T. Almøy, A. Flatberg, A.H. Aastveit, H. Martens, LPLS-regression: a method for improved prediction and classification through inclusion of background information on predictor variables, Chemom. Intell. Lab. Syst. 91 (2) (2008) 121–132. [14] H. Martens, Domino PLS: a framework for multi-directional path modelling, in: T. Aluja, J. Casanovas, V.E. Vinzi, A. Morineau, M. Tenenhaus (Eds.), Proc. PLS'05 Intl Symposium “PLS and related methods”, SPAD Groupe Test&Go, 2005, pp. 125–132. [15] C. Andersson, Direct orthogonalization, Chemom. Intell. Lab. Syst. 47 (1) (1999) 51–63.