Chemometrics and Intelligent Laboratory Systems 130 (2014) 1–5
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Modeling and correction of Raman and Rayleigh scatter in fluorescence landscapes Paul H.C. Eilers a,⁎, Pieter M. Kroonenberg b a b
Erasmus University Medical Center, Rotterdam, The Netherlands Department of Child & Family Studies, Leiden University, Leiden, The Netherlands
a r t i c l e
i n f o
Article history: Received 9 January 2013 Received in revised form 3 September 2013 Accepted 5 September 2013 Available online 17 September 2013 Keywords: Smoothing Background subtraction Rayleigh scatter Raman scatter Fluorescence excitation–emission spectroscopy
a b s t r a c t Rayleigh and Raman scatter in fluorescence (excitation–emission) landscapes are a nuisance in two-way and three-way data modeling. We provide a method to clean individual emission spectra. The scatter can be represented accurately by Gaussian peaks, characterized by location, width and height. The analytic signal of interest effectively acts as a background to the scatter peaks. Modeling it locally as a smooth curve, using penalized least squares, allows accurate estimation of the parameters of scatter peaks. Once the peaks are modeled, they can be subtracted from the spectrum, almost completely removing the artifacts. Apart from local smoothness, no assumptions are made about the fluorescence spectra. © 2013 Elsevier B.V. All rights reserved.
1. Introduction One of the success stories of multi-way models is their application to chemical spectra [1,2]. Here we consider fluorescence excitation– emission spectroscopy in which the components, representing chemical species, are known to be bilinear. Unfortunately, Raman and Rayleigh scatter usually are present. They can be quite strong, ruining the good fit of a model to the data. Several techniques have been proposed for handling scatter in twoway and especially three-way fluorescence data when the intention is to model the signal. Most available methods [3–5] attempt to eliminate the offending artifacts. In Bahram et al. [4] the following methods are mentioned: downweighting of the scatter region [6–8], subtraction of a standard [9], inserting missing values for the scatter [10], inserting zeroes outside the data area [8], avoiding the part of the data matrix which includes the scatter, applying constraints to the model for the signal [11,12], and modeling the scatter via two-way component models [8]. Their own proposal is to eliminate the scatter where signal and scatter overlap via interpolation, using the model to be fitted, thus intimately linking the handling of scatter to the fitting of the signal. In addition, Engelen et al. [3] proposed to treat each point of the scatter as an outlying observation with respect to the model and to eliminate it if it is too far away from the main body of the data. Attempts have also been made to indirectly model the scatter and the signal more or less independently using component models, especially within the context of three-way ⁎ Corresponding author. E-mail addresses:
[email protected] (P.H.C. Eilers),
[email protected] (P.M. Kroonenberg). 0169-7439/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2013.09.002
data [8]. Each of the mentioned papers gives further comments and references about the background to the problems and ways in which attempts have been made to solve the fitting of the signal in the presence of scatter. Note that all but the papers by McKnight et al. [9] and Zepp et al. [5] deal with modeling three-way data, indicating the need for appropriately handling scatter ridges in that context. The proposal in the present paper attacks the core of the problem, in the sense that it defines explicit and well-fitting models for both the Raman and Rayleigh scatter, making relatively mild assumptions. In particular, the assumption is that the ridges in the landscape can be modeled with separate normal distributions along the excitation wavelength. With this approach it will also be possible to assess in great detail the nature of the scatter itself. Due to the regularity in the scatter, estimates of the fluorescence intensity can be made at the same time for those locations in the landscape where it overlaps with the signal. The result of our fitting procedure is that the scatter in the excitation– emission matrix landscape can be eliminated in a very precise manner, without compromising the signal itself. Moreover, the estimation is independent of the type of signal and how it is being modeled. Thus, we use the characterization that Data = Scatter + Signal + Noise, and our concern here is exclusively with the Scatter part. Our procedure cleans individual emission spectra, so it is also useful outside the realm of multi-way data analysis. 2. Models for scatter Figs. 1 and 2 show a typical excitation–emission fluorescence landscape. Rayleigh scatter (both primary and secondary), as well as Raman scatter are visible as diagonal ridges. The primary and secondary Rayleigh
P.H.C. Eilers, P.M. Kroonenberg / Chemometrics and Intelligent Laboratory Systems 130 (2014) 1–5
sqrt (Intensity)
2
15 10 5
0 320 310 300 290 280 270
500 450
260
400
250
350 300
Excitation (nm) 240 230
250 200
Emission (nm)
Fig. 1. Waterfall view of an excitation–emission landscape. The diagonal ridges, from left to right, represent artifacts caused by primary Rayleigh scatter, Raman scatter and secondary Rayleigh scatter. To improve visibility of the peaks, the vertical scale has been set proportional to the square root of fluorescence intensity. The three spectra drawn with thicker lines are shown individually in Fig. 3.
scatter ridges center quite closely to the lines λem = λex and λem = 2λex, respectively. Raman scatter also forms a straight ridge, but its center line does not follow such a simple relationship: slope and offset vary between data sets. The data were obtained from the Department of Food Science, in the Faculty of Life Sciences of the University of Copenhagen (http://www. models.kvl.dk/Fluorescence). See also Bro et al. [13]. Because the secondary Rayleigh scatter manifests itself in a region where there is essentially no analytic signal, we can get rid of it very simply, by limiting the excitation wavelength to values below 450 nm. The perspective view already gives the impression that the observed scatter signals have a Gaussian shape, along the emission wavelength. This impression is amplified by Fig. 3, where we show cuts through the fluorescence landscape at three excitation wavelengths. Our strategy will be to model the scatter in individual emission spectra by
Sample 4, replicate 1 320 310
Excitation wavelength (nm)
300 290
Gaussian curves using non-linear regression. Three parameters describe such a curve completely: position, height and width. Assuming that good starting values are provided, fitting of Gaussian curves to data is not difficult. The problem is that we do not only have the Gaussian curves but also the signal. In the upper and lower panels of Fig. 3 the scatter peaks are clearly separated from the signal, or the signal under the peak is weak. The middle panel shows a less favorable situation. Both primary Rayleigh and Raman scatter are superimposed on a strong signal. Trying to fit a Gaussian curve locally will lead to useless results. To solve this problem, we locally model the signal as a smooth baseline. Locally means that for each excitation wavelength and each type of scatter we select an appropriate section of the emission wavelengths, which is known to contain the scatter peak. Because of the linear trend of the centers of the scatter peaks and their stable widths, it is not hard to define such windows. Details will be presented below. The model for the fluorescence intensity is, at excitation wavelength j and emission wavelength i: yij ¼ bij þ g xi ; α j ; μ j ; σ j þ eij ;
ð1Þ
where bij represents the signal as a smooth baseline, xi the emission wavelength, and eij random noise. The Gaussian shape is described by the function g; exp αj, gives its height, while μj determines its location and σj its width, as can be seen from its definition: 0 2 1 x −μ i j B C g xi ; α j ; μ j ; σ j ¼ exp@α j − A: 2σ 2j
To simplify the equations, we drop the subscript pffiffiffiffiffiffi j when no confusion can occur. Note that we do not divide g by σ 2π as is the case for a normal density. Our objective function is S¼
X i
2
ðyi −bi −g ðxi ; α; μ; σ ÞÞ þ λ
X 2 2 Δ bi :
ð3Þ
i
The first term in Eq. (3) is the familiar sum of squares of differences between data and model. The second term is a penalty on roughness of b, based on second order differences of b; note that Δ2bi = bi − 2bi − 1 + bi − 2. When b is smooth, second order differences will be relatively small and this term will not contribute much to the objective function. The parameter λ balances fidelity to the data and smoothness of the baseline. The penalty term is inspired by the Whittaker smoother [14]. The objective function is non-linear in the parameters α, μ and σ. If we consider small changes δα, δμ and δσ, the first-order Taylor expansion gives: 2 g ðx; μ þ δμ; σ þ δσ Þ≈gðx; μ; σ Þ δα þ uδμ=σ þ u δσ=σ ;
280
ð2Þ
ð4Þ
270 260 250 240 230 250
300
350
400
450
500
Emission wavelength (nm) Fig. 2. Image view of the excitation–emission landscape. The diagonal ridges, from left to right, represent artifacts caused by primary Rayleigh scatter, Raman scatter and secondary Rayleigh scatter. To improve visibility, the color scale has been set proportional to the square root of fluorescence intensity.
where u = (x − μ) / σ. Using the partial derivatives, the non-linear regression problem becomes linear in δα, δμ and δσ and will be easy to solve. With proper starting values it converges in a handful of iterations. If no baseline were present, this is an effective approach to fit a Gaussian shape to scatter peaks. A simple way to handle the baseline is to use back-fitting. Alternatingly one fits the Gaussian peak after correction for the baseline and one recomputes the baseline (by Whittaker smoothing) after removing the peak. Convergence is relatively slow: about 100 iterations can be needed to arrive at a precision of four significant figures. On the other hand, on a modern PC, one iteration takes about a millisecond, so it takes only a few seconds to correct one type of scatter for all excitation wavelengths.
P.H.C. Eilers, P.M. Kroonenberg / Chemometrics and Intelligent Laboratory Systems 130 (2014) 1–5
Good starting values are crucial. The first step is to estimate the baseline by asymmetric least squares [15]. The maximum after subtraction of the baseline estimate gives a good starting value for α, the logarithm of the height of the peak. For (secondary) Rayleigh scatter we set the initial value of μ equal to (twice) the excitation wavelength and σ equal to 5. For Raman scatter the same procedure can be applied, except for the center. From an image like the one in Fig. 2 it is easy, for humans, to read off the center at minimum and the maximum excitation wavelength. Linear interpolation then gives a reliably starting value for any excitation wavelength. The window should be wide enough to include parts of the signal on both sides of a scatter peak, to get a reliable baseline estimate. On the other hand we wish to model the signal only locally as a smooth baseline, and we should avoid taking the window so wide that it contains both Rayleigh and Raman scatter peaks. In addition we found it effective to use e only the observations in the central part of the window, with j x−μe =σ b2, for fitting the Gaussian peak. We observed that when scatter signals are weak, the algorithm can diverge hopelessly, because the Gaussian peak is not pronounced enough to determine its parameters unequivocally. To stabilize the procedure we introduce penalties that prevent the parameter estimates from straying away too far from the initial values. If we let θ = [αμσ] 2 3 be the vector of coefficients, the penalty is given by ∑ j¼1 κ j θ j − θ j . The elements of θ are chosen by inspection of the fluorescence landscape. The penalty weights are κ1 = 0.01, κ2 = 0.1 en κ3 = 10. In our experience these values are not very critical. Fig. 4 illustrates the results of Rayleigh scatter correction for five different values of the excitation wavelength. After the correction for both Rayleigh and Raman scatter, the fluorescence landscape in Fig. 1 looks as shown in Fig. 5. The artifacts have been annihilated to a high degree. As we have indicated, human interaction is needed, to set some initial values. We have not tried to automate all steps of our algorithm. We expect users to experiment with parameter values before embarking on large-scale automatic correction efforts.
3
3. Discussion Rayleigh and Raman scatter peaks can be accurately modeled as Gaussian peaks (the scatter) on top of a local smooth baseline (the analytic signal). Both components can be estimated reliably by a combination of penalized non-linear regression for the peak parameters (location, width and height) and penalized linear regression for the local signal. Two penalties are used: one to enforce smoothness of the signal estimate, the other to keep peak parameters within a desired range. No assumptions, except local smoothness, are being made on the structure of the analytic signal. Once their parameters have been estimated, peaks can be subtracted from the spectra. Equivalently the local smooth signal estimate can be filled in. In both cases very good correction is obtained for the example spectra we present here. No data points are lost and no interpolation is needed. Individual emission spectra can be corrected, because no model (like a three-way structure for a set of spectra) for the data is needed. We removed the influence of secondary Rayleigh scatter by limiting the excitation wavelength. If that is unacceptable, it can be eliminated by our procedure too. After correction straightforward three-way modeling can take place, because a complete and cleaner data array is obtained. No weights or robust algorithms are needed. This is an attractive advantage. The improvement in model fit, after cleaning, depends strongly on the quality of the data at hand. We have not tried to make formal comparisons to other procedures, in order to quantify improvements. There are too many options to consider. Although we developed our procedure in the context of three-way modeling, it has wider applicability. The emission spectrum is being corrected for each excitation wavelength separately. No bilinear or trilinear structure is assumed. An assumption of our procedure is that the emission spectrum is smooth, compared to the scatter peaks. This is what we have observed in practice. But in special cases, with a strongly varying signal, one should expect a reduction in performance.
Sample 4; excitation = 320 20 15 10 5 0 200
250
300
350
400
450
500
400
450
500
400
450
500
Sample 4; excitation = 275 200 150 100 50 0 200
250
300
350
Sample 4; excitation = 230 150 100 50 0 200
250
300
350
Emission wavelength Fig. 3. Cross sections of the excitation–emission landscape at three excitation wavelengths.
4
P.H.C. Eilers, P.M. Kroonenberg / Chemometrics and Intelligent Laboratory Systems 130 (2014) 1–5
Data and estimated baseline 200
Corrected data and scatter 30
ex = 230
20
100
0
10 0 250
260
270
280
200
0 250
260
270
280
30 ex = 235
270
280
290
200 ex = 240
0 250
260
270
280
290
270
280
290
200
260
270
280
290
0
0 260
270
280
290
30 ex = 245
−2 260
270
280
290
2
20
100
0
10 280
290
300
200
0 260
270
280
290
300
30 ex = 250
−2 260
270
280
290
300
2
20
100
0
10 0 270
−2 250 2
10
270
280
20
100
0 260
270
0
30
0 260
260
2
10 260
−2 250
20
100 0 250
Residuals 2
280
290
300
Emission wavelength (nm)
0 270
280
290
300
−2 270
Emission wavelength (nm)
280
290
300
Emission wavelength (nm)
Fig. 4. Results of Rayleigh and scatter correction by the proposed method. The left panels show the data and the estimated baseline curves for five excitation wavelengths. The middle panels present the corresponding baseline-corrected data and the fitted Gaussian curve. The right panels show the residuals.
We assume that the scatter signal can be modeled well by a Gaussian curve and our results support this. But it might not always be the case. To cater for asymmetry and longer tails, the skew-t distribution could be a candidate [16], or a non-parametric peak model might be considered [17].
As we have shown, a very good correction can be obtained. It is conceivable that with more advanced models for the parameters scatter peaks might work even better. The parameters of the primary and secondary Rayleigh peaks will be related, suggesting their joint estimation. We have not pursued this line of research. Acknowledgment
Sample 4, replicate 1
Thanks to Rasmus Bro for setting up the KVL website with many valuable data sets, from which we obtained our example data. References
100 50 0 320
450 400
300
350
280
300
260
Excitation (nm)
240
250
Emission (nm)
Linear
Sample 4, replicate 1
100 50 0 320
450 400
300
350
280
300
260
Excitation (nm)
240
250
Emission (nm)
Fig. 5. Waterfall plots of the original fluorescence landscape (top) and of the results of the primary Rayleigh and Raman scatter correction by the proposed method (bottom).
[1] A. Smilde, R. Bro, P. Geladi, Multi-way Analysis with Applications in the Chemical Sciences, Wiley, Chicester, UK, 2004. [2] P.M. Kroonenberg, Applied Multiway Data Analysis, Wiley, Hoboken, NJ, 2008. [3] S. Engelen, S.F. Møller, M. Hubert, Automatically identifying scatter in fluorescence data using robust techniques, Chemom. Intell. Lab. Syst. 86 (2007) 35–51. [4] M. Bahram, R. Bro, C. Stedmon, A. Afkami, Handling Rayleigh and Raman scatter for PARAFAC modeling of fluorescence data using interpolation, J. Chemom. 20 (2006) 99–105. [5] R.G. Zepp, W.M. Sheldon, M.A. Moran, Dissolved organic fluorophores in southeastern US coastal waters: correction method for eliminating Rayleigh and Raman scattering peaks in excitation–emission matrices, Mar. Chem. 89 (2004) 15–36. [6] R. Bro, N.D. Sidiropoulos, A.K. Smilde, Maximum likelihood fitting using simple least squares algorithms, J. Chemom. 16 (2002) 387–400. [7] R.D. Jiji, K.S. Booksh, Mitigation of Rayleigh and Raman spectral inferences in multiway calibration of expectation–emission matrix fluorescence spectra, Anal. Chem. 72 (2002) 718–725. [8] Å. Rinnan, K.S. Booksh, R. Bro, First order Rayleigh scatter as a separate component in the decomposition of fluorescence landscapes, Anal. Chim. Acta 537 (2005) 349–358. [9] D.M. McKnight, E.W. Boyer, P.K. Westerhoff, P.T. Doran, T. Kulbe, D.T. Andersen, Spectrofluorometric characterisation of dissolved organic matter for indication of precursor organic matter and aromaticity, Limnol. Oceanogr. 46 (2001) 36–48. [10] J. Christensen, V.T. Povlsen, J. Sørensen, Application of fluorescence spectroscopy and chemometrics in the evaluation of cheese during storage, J. Dairy Sci. 86 (2003) 1101–1107. [11] C.M. Andersen, R. Bro, Practical aspects of PARAFAC modelling of fluorescence excitation–emission data, J. Chemom. 17 (2003) 200–215.
P.H.C. Eilers, P.M. Kroonenberg / Chemometrics and Intelligent Laboratory Systems 130 (2014) 1–5 [12] R. Bro, N.D. Sidiropoulos, Least squares algorithms under unimodality and nonnegativity constraints, J. Chemom. 12 (1998) 223–247. [13] R. Bro, A. Rinnan, N.M. Faber, Standard error of prediction for multilinear PLS 2. Practical implementation in fluorescence spectroscopy, Chemom. Intell. Lab. Syst. 75 (2005) 69–76.
5
[14] P.H.C. Eilers, A perfect smoother, Anal. Chem. 75 (2003) 3631–3636. [15] P.H.C. Eilers, Parametric time warping, Anal. Chem. 76 (2004) 404–411. [16] A. Azzalini, A. Capitanio, Distributions generated by perturbation of symmetry with emphasis on a multivariate skew-t distribution, J. R. Stat. Soc. Ser. B 65 (2003) 367–389. [17] P.H.C. Eilers, Unimodal smoothing, J. Chemom. 19 (2005) 317–328.