Analysis of raw EEM fluorescence spectra - ICA and PARAFAC capabilities

Analysis of raw EEM fluorescence spectra - ICA and PARAFAC capabilities

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334 Contents lists available at ScienceDirect Spectrochimica Acta...

2MB Sizes 0 Downloads 47 Views

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Contents lists available at ScienceDirect

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy journal homepage: www.elsevier.com/locate/saa

Analysis of raw EEM fluorescence spectra - ICA and PARAFAC capabilities Jorge Costa Pereira* , Alberto A.C.C. Pais, Hugh D. Burrows CQC, Department of Chemistry, University of Coimbra, Coimbra P-3004 535, Portugal

A R T I C L E

I N F O

Article history: Received 12 April 2018 Received in revised form 3 July 2018 Accepted 8 July 2018 Available online 11 July 2018 Keywords: EEM Fluorescence spectra PARAFAC ICA Performance comparison Semi-empirical simulations

A B S T R A C T Excitation-Emission fluorescence spectroscopy is a versatile technique and has been used to detect, characterize and quantify residual Dissolved Organic Matter (DOM) in aquatic domains. PARAllel FACtor Analysis (PARAFAC) has been extensively used in the analysis of excitation-emission matrices (EEM), allowing for a better identification and quantification of contributions resulting from spectral decomposition. In this work we have adapted Independent Component Analysis (ICA) in order to make it suitable to the analysis of three-way EEM datasets, and tested ICA and PARAFAC performances for the study of three available datasets (Claus, Dorrit and drEEM). Semi-empirical simulation allowed us to assess the impact of (a) sample size, (b) signal sources and (c) composition dependencies, and the presence of (d) unspecific signal contributions (e.g. light scattering) upon both algorithms. PARAFAC and ICA have similar performances in processing ideal three-way EEM datasets but, in the presence of non-trilinear responses, ICA leads to a more realistic approach, yielding a better decomposition of contributing sources and their identification and quantification. This makes this algorithm more suitable for the analysis of real, raw EEM data, without the need of preprocessing to remove any unspecific contributions. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Fluorescence is a very powerful technique, able to detect a large range of residual organic material dissolved in water [1]. This has the major vantages of low analytical costs, very rapid analysis, very high sensitivity and large linear analytical range, which makes it very convenient for the characterization of residual organic matter dissolved in aquatic environments [2]. This technique is versatile, since it allows the study and characterization of samples using different types of spectra such as excitation, where fluorescence is measured at a fixed wavelength while excitation range is scanned, emission, where a given excitation wavelength is used and the respective system emission measured, synchronous fluorescence, with simultaneous scan of excitation and emission wavelengths keeping the wavelength difference constant. In this work, sample information details are enhanced by recording Excitation-Emission matrices (EEM) to produce a fluorescence surface. The decomposition of these EEM datasets allows to characterize and quantify dissolved organic matter (DOM) present in environmental samples and other matrices [2,3]. Advances in electronics and

* Corresponding author. E-mail address: [email protected] (J. Costa Pereira).

https://doi.org/10.1016/j.saa.2018.07.025 1386-1425/© 2018 Elsevier B.V. All rights reserved.

signal processing abilities are leading to standard use of, other techniques, such as time-resolved fluorescence [1], which can also use the two numerical techniques focused here. 1.1. EEM Analysis 1.1.1. PARAFAC PARAllel FACtor Analysis (PARAFAC) is a very powerful multivariate data analysis method suited for the decomposition of 3-way and 4-way multivariate systems into lower dimensional matrices [4,5]. Because of its ability to directly extract information with chemical meaning [5], PARAFAC is nowadays considered a standard algorithm to process EEM data [6]. In the case of tridimensional EEM datasets, PARAFAC performs a three-way tensor decomposition EEM = A ⊗ B ⊗ C + U

(1)

In case the first array dimension of EEM is related to the sample identification, matrix A contains sample scores and B and C the respective loadings, related with excitation and emission spectra. U represents the unjustified residual information of the initial dataset. Solving Eq. (1) requires an alternating least-squares approach (ALS), which permits the imposition of various restrictions [5]. In the case of EEM, non-negativity provides PARAFAC with the ability

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

321

to directly retrieve information with chemical meaning. The main drawback with PARAFAC is related with its difficulty in dealing with non-trilinear information such as light scattering phenomena (Rayleigh and Raman) [5,7], although several approaches have been described in the literature in order to compensate for this deficiency [7-9].

and establish the correspondence between Eqs. (1) and 2 in terms of loadings and scores.

1.1.2. Independent Component Analysis Independent Component Analysis (ICA) is based on the Blind Source Separation (BSS) algorithm [10], and has been developed to deconvolute mixed signals by maximizing their respective independence [11]. Measured signals (X) may be decomposed into signal sources, representing specific individual signal contributions, and respective weights, using the following representation [12]

Selected datasets are all representative of EEM information. The first two, Claus and Dorrit, correspond to very simple systems, synthetic mixtures, simulated in the lab, in which the number of components and respective composition are known and are used as blind samples. In order to better evaluate the two algorithms, extra datasets are simulated using semi-empirical and other simulations.

X = S.A + V

(2)

where S and A stand for the loadings of the signal sources, and respective mixing information matrix, the scores, respectively. Similarly to Eq. (1), V matrix corresponds to the unjustified residual information. In previous work, ICA signal deconvolution [13-15] was shown to be a very powerful method for dealing with spectral information, allowing the identification and quantification of the contributing species present in different systems. However, the ICA algorithm requires X to be a bidimensional (two-way) dataset. In order to process EEM information, ICA imposes a pre- and post-processing array unfolding and folding back i.e., an array reshape. In the absence of further restrictions, ICA solutions have to be carefully assessed before converting to chemical information. Data simulation and the use of adequate blind samples may be advised in order to familiarize with ICA [13,14]. In this study, we aim to establish a parallel between PARAFAC and ICA to compare the methods and their modeling performances,

2. Procedures 2.1. Datasets

2.1.1. Claus The Claus dataset [5] consists of five simple laboratory-made samples (I = 5). Each sample contains different amounts of tryptophan, tyrosine and phenylalanine (k0 = 3) dissolved in phosphate buffered water. The samples were measured by fluorescence (excitation 240–300 nm, emission 250–450 nm, 1 nm intervals) on a PE LS50B spectrofluorometer with excitation slit-width of 2.5 nm, an emission slit-width of 10 nm and a scan-speed of 1500 nm/s. This dataset is available online [16]. Fig. 1 represents all samples (I = 5) contained in the Claus dataset, with the respective mixture composition detailed in Table 1. 2.1.2. Dorrit The Dorrit dataset, EEM(I,J,Q) [7], consists of I = 27 synthetic samples containing different concentrations of four analytes (hydroquinone, tryptophan, phenylalanine and DOPA) (k0 = 4) measured on a Perkin-Elmer LS50 B fluorescence spectrometer. Each fluorescence landscape corresponding to an individual sample consists of J = 121 emission wavelengths (241–481 nm) and Q = 24 excitation wavelengths (200–315 nm taken every 5 nm). According to the

Fig. 1. Representation of excitation-emission fluorescence matrix (EEM) contour plots of samples (I = 5) contained in Claus dataset; horizontal scale refers to excitation (240– 300 nm) and vertical scale to emission (250–450 nm). Higher fluorescence intensities in red.

322

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Table 1 Details of mixture composition in Claus dataset, represented in Fig. 1. All concentrations in lM.

Table 2 Details of mixture composition in Dorrit selected dataset, represented in Fig. 2. All concentrations in lM.

Mixtures

m01

m02

m03

m04

m05

Mixtures

Phenylalanine

Tryptophan

Hydroquinone

DOPA

Tryptophan Tyrosine Phenylalanine

2.67 0 0

0 13.3 0

0 0 900

1.58 5.44 355

0.88 4.4 297

m01 m02 m03 m04 m05 m06 m07 m08 m09 m10 m11 m12 m13 m14 m15 m16 m17 m18

1400 1400 1400 2800 0 0 175 5600 2800 1400 3200 2800 4700 1400 700 1400 3200 1400

0.25 2 1 0 1 0.5 0 0.5 1 1 1 4 2 10 25 10 16 1

6 1.75 3.5 0 275 28 8 0.875 10 28 6 0 3.5 28 56 0 4 20

40 70 100 10 5 2.5 0 0 20 5 8 18 2.5 80 0 20 10 220

Tryptophan (CAS# 73-22-3); tyrosine (CAS# 60-18-4); phenylalanine (CAS# 63-91-2).

authors, this fluorescence dataset is three-way and ideally trilinear. It is available online [17]. From the original 27 spectra, those with missing information (NAN) were eliminated, resulting in a total of 18 spectra (I = 18). Fig. 2 represents the first twelve selected samples contained in this dataset, and respective mixture composition is detailed in Table 2. 2.1.3. drEEM This dataset is included in the drEEM toolbox for MATLAB [6]. It consists of I=224 real samples collected during four surveys of the San Francisco Bay for residual dissolved organic matter (DOM) present in environment and measured using excitations (230– 455 nm) and respective fluorescence emissions in (290–680 nm) range. The dataset is available online [18]. In Fig. 3 we represent first twelve excitation-emission fluorescence matrix contour plots contained in drEEM dataset.

Phenylalanine (CAS# 63-91-2); tryptophan (CAS# 73-22-3); hydroquinone (CAS# 123-31-9); DOPA (CAS# 59-92-7).

2.2. Simulations In order to test ICA and PARAFAC algorithms under well controlled situations where no experimental errors and other artifacts

Fig. 2. Excitation-emission fluorescence matrix (EEM) contour plots of first twelve selected samples contained in Dorrit dataset; horizontal scale corresponds to excitation (200–315 nm) and vertical scale to emission (241–481 nm). Higher fluorescence intensities in red.

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

323

Fig. 3. Representation of excitation-emission fluorescence matrix (EEM) contour plots of first twelve mixtures contained in drEEM dataset; horizontal scale is related with excitation (230–455 nm) and vertical scale with emission (290–680 nm). Higher fluorescence intensities in red.

(e.g. lack of linearity of signal response to each analyte concentration present in mixtures) are present we needed to perform dataset simulations. From Table 1 it is possible to see that the first three samples (m01–m03) in the Claus dataset only contain a single component (tryptophan, tyrosine and phenylalanine, respectively). These samples were used as representative of the three compounds. Each spectra was scaled to [0,1] fluorescence and simulated mixture spectra were obtained by multiplying each scaled spectra by the respective analyte contribution (0–100 %), and the three contributions added. The presence of Rayleigh scattering was also studied. In order to simulate first order scatter, a zeroed matrix (M0 ) with dimensions similar to M0 (J,Q)=0.0 was used, and a Lorentzian function at kEm = kEx applied. The sources used are presented in Fig. 4 (first row, s1–s4), as well as the first four simulated samples of the pure simulation data (second row, a01–a04) and simulation data including light scattering (bottom row, b01–b04).

2.4. Performance Evaluation Evaluation of the EEM modeling performance of these methods is important. With PARAFAC, there is a specific indicator denoted Core Consistency Diagnostic (CORCONDIA) [23]. This parameter evaluates the PARAFAC coherence of the model with respect to the original data. There is no corresponding indicator for ICA, and other performance estimators will be used in the present work. It should be noted that in analysis of EEM datasets, there are three important aspects to address: modeling fitting ability, analyte identification and respective quantification. These will be discussed below. 2.4.1. Modeling Fitting Ability Modeling fitting ability may be evaluated by RMSE (root mean squared model error). Since this number depends on the scale of the signal, the relative value, RRMSE (%), is used instead  I

2.3. Data Analysis Dataset manipulation, algorithm performance evaluation and respective graphical information were processed using Octave programming language [19]. R-project programming language [20] was used as platform to run packages FastICA [21] and Multiway [22] which were used for the ICA and PARAFAC decompositions.

RRMSE = 100 ×

 2 yijq − y∗ijq I J Q   i=1 q=1 yijq i=1

i=1

J

j=1

Q

q=1

(3)

where y∗ijq represents model estimates for signal yijq obtained for I samples over J and Q spectral coordinates in the Excitation-Emission fluorescence spectra.

324

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Fig. 4. Representation of sources (s1–s4) used for semi-empirical simulation and first four simulated spectra of mixtures in the simple case (a01—a04) and with the add of Rayleigh (s4) light scatter (b01–b04); horizontal scale corresponds to excitation (240–300 nm) and vertical scale to emission (250–450 nm). Higher fluorescence intensities in red.

Pearson’s R-squared estimate (R2 ), also known as the coefficient of determination, may also be used as indicator for relative information recovery ⎛ ⎜ R2 = 100 × ⎝

⎞ 2 s(Y,Y ∗) 2 s(Y,Y) . s(Y2 ∗ ,Y ∗ )

⎟ ⎠

(4)

allowing to estimate model ability in terms of overall signal recovery.

2.4.2. Identification In the spectral analysis of EEM, it is very important to characterize each emitting specimen and then use this information to identify and also recognize the presence of each contribution. The ability for correct identification may be accessed via the Relative Recovery index between signal sources (R2 (S), Eq. (4)), after the total unfold of each (J, Q) matrix into a single column vector (JQ,1).

2.4.3. Quantification The analysis of EEM also requires system evaluation in terms of composition. Since ICA and PARAFAC information is presented in relative unities (scores), composition Relative Recovery (R2 (A), Eq. (4)) was also used to evaluate the ability of each tested algorithm to obtain the respective contributions.

3. Results and Discussion 3.1. Number of Contributions For each tested dataset, the fitting ability of ICA and PARAFAC was used to detect the number of contributions present, applying each algorithm with an increasing number of factors (k). Results are summarized in Table 3. For all tested cases in Table 3, it is evident that ICA better describes each tested dataset, consistently displaying a higher recovery (R2 ) and a lower estimated error (RRMSE). The main reason for the higher ability of ICA in data modeling is that it consists of a totally free soft-modeling algorithm, focused into signal processing and especially conceived to maximize the independency of deconvoluted information in a given number of signal contributions [11]. PARAFAC was conceived specifically to directly retrieve physical-chemical meaningful information, without the need for “factor rotation” [4]. For this purpose, several restrictions in the trilinear decomposition of signal are available such as the nonnegativity restriction [4,5]. The Claus dataset consists of five mixtures (I = 5) of three components (k0 = 3). Theoretically, with this very simple system, it is expected to obtain very good results in modeling with both ICA and PARAFAC, by imposing k = 3. From Table 3 we see that progress from k = 2 model into k = 3, the relative residual error (RRMSE) decreases dramatically (around 20 and 14 times for ICA and PARAFAC, respectively). By increasing k to 4, RRMSE keeps decreasing, while the information recovery (R2 ) still increases slightly. With k = 5, the RRMSE becomes zero with ICA, and the recovered information attains its

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

325

Table 3 Evolution of modeling performance of ICA and PARAFAC on three test datasets (Claus, Dorrit and drEEM) by increasing the number of contributions (k). Dataset

Claus

Dorrit

Alg.

ICA

k

R2

RRMSE

R2

RRMSE

R2

RRMSE

R2

RRMSE

R2

RRMSE

R2

RRMSE

1 2 3 4 5 6 7 8 9

68.86 91.06 99.98 99.99 100.00 100.00 100.00 100.00 100.00

98.72 56.26 2.74 1.62 0.00 0.00 0.00 0.00 0.00

68.13 89.36 99.95 99.96 99.96 99.96 99.98 99.98 99.98

102.82 63.00 4.41 3.87 3.73 3.79 3.08 2.95 2.85

82.59 90.86 94.81 97.03 98.51 99.16 99.42 99.56 99.64

98.79 73.16 55.71 42.41 30.16 22.67 18.82 16.42 14.77

75.18 87.83 91.91 93.73 95.26 96.39 97.12 97.81 98.05

119.45 86.61 71.38 63.11 55.09 48.23 43.16 37.73 35.58

78.81 91.02 97.14 98.98 99.21 99.37 99.49 99.60 99.66

121.46 81.71 46.86 28.04 24.70 22.19 19.91 17.57 16.22

63.93 66.69 69.96 72.67 74.91 77.42 79.22 80.88 82.24

156.11 151.26 145.05 139.46 134.48 128.50 123.87 119.38 115.48

PARAFAC

drEEM

ICA

PARAFAC

ICA

PARAFAC

Alg. - algorithm; R2 - information recovery (%); RRMSE - model relative error (%).

maximum value (100.00%). This ability for total signal recovery may be related to the use of a very limited number of samples. The existence of experimental error and deviations from ideality, such as lack of response linearity, absorption phenomena, interferences, and other effects, does not allow the ICA method to obtain RRMSE = 0.00% and R2 = 100.00%, when imposing k = 3. By inspecting Fig. 1, contour plots m02, m03, m04 and m05 reveal some signal abnormalities due to the preprocessing step, consisting of removal of unspecific light scattering from these samples and subsequent smoothing of fluorescence in order to create a well-behaved dataset for trilinear decomposition. This data manipulation may be detected with ICA. The Dorrit dataset consists of an ensemble of 18 samples (I = 18) prepared in laboratory by mixing four compounds (k0 = 4), see Table 2. ICA and PARAFAC are not able to recover all information with k = 4 (R2 = 97.0% and 93.7%), and the model fitting error is very high (RRMSE = 42.4% and 63.1%), respectively. By inspecting the first 12 samples, Fig. 2, it is possible to detect several unspecific contributions such as Rayleigh (first and second order) and Raman light scattering with some other spectral imperfections related to noise. By increasing to a value of k = 4 + 2 (the second parcel, 2, relating to Rayleigh contributions) it is possible to see that ICA fitting ability increases to R2 = 99.2% while PARAFAC keeps a lower recovery of 96.4%. However, ICA and PARAFAC still show high fitting errors (RRMSE = 22.7% and 48.2%, respectively). The real samples case, drEEM dataset, has the benefit of a large number of samples (I = 224). However, the system is not known a priori, and is thus difficult to use for algorithm test and comparison. From Table 3, it is evident that PARAFAC was unable to deal with this information without preprocessing the signals in order to remove unspecific light emission (Rayleigh and Raman scattering), requiring k = 8 to recover 80% of information with expected errors around 120%. With ICA, the effective number of contributions is not totally clear, and going from k = 4 to k = 5 produces a benefit of 1% gain in R2 and a decrease of about 4% in model error, going from k = 5 to k = 6 does not significantly change the information recovery (variation < 0.2%) but slightly improves model error by reducing extra 2% in RRMSE. However, ICA displays a very high residual error (≈ 25% with k = 5). 3.2. Identification Both algorithms, ICA and PARAFAC, allow spectral identification of each contributing component. The ICA approach deals directly with signal sources of each individual contribution, and the deconvoluted signals are directly stored into a loading matrix S, Eq. (2). In the PARAFAC case, signal sources are obtained via respective loadings B and C in Eq. (1), resulting in a smoothed loading perspective.

ICA and PARAFAC estimated sources for the Claus (k = 3 and 4) (Fig. 5), Dorrit (k = 4 and 7) (Fig. 6) and drEEM (k = 4 and k = 7) (Fig. 7) datasets are presented. In the Claus dataset case (k0 = 3, I = 5), Fig. 5, it can be seen that the first three samples (m01–m03) are very similar with ICA (a1–a3) and PARAFAC (b1–b3) estimated sources, obtained by imposing k = 3. We see in Table 1, these samples consist of a single component, and the fact that ICA and PARAFAC estimated sources are similar to these samples reinforces that these algorithms are extracting information related with signal sources of the individual contributions. As identification remarks, m01, in Fig. 5, corresponds to tryptophan (maximum emission at Ex  275 nm, Em  357 nm), source depicted as a1 and b1 in ICA and PARAFAC cases; m02 corresponds to tyrosine (maximum emission at Ex  274 nm, Em  305 nm), depicted as a2 and b2 and m03 corresponds to phenylalanine (maximum emission at Ex  256 nm, Em  286 nm), extracted as a3 and b3. There is additional information in the Claus dataset, as previously discussed in Section 3.1. In Fig. 5 we see that, by making k = 4, ICA (c1–c3) and PARAFAC (d1–d3), the first three detected sources are consistent with previously detected ones (a1–a3 and b1–b3, respectively); however, major differences appear in the last one. The ICA c4 source resembles background emission details including Rayleigh light scatter, while PARAFAC proposes a signal source that closely matches the first one in shape and position. In the case of the Dorrit dataset (k0 = 4, I = 18), Fig. 6, ICA and PARAFAC present several differences in estimated sources, and only a2 and b2 seem consistent, probably being due to tryptophan (maximum emission at Ex  275 nm, Em  357 nm). Other sources are distorted and contain scatter information (e.g. a1 and a4). In order to remove extraneous contributions on the Dorrit dataset, we proceed further by imposing k = 7. Following this, information recovery is increased (R2 from 97.0 to 99.4% and 93.7 to 97.1% with ICA and PARAFAC, respectively) and model error reduced (RRMSE from 42.4 to 18.8% and 63.1 to 43.2%, respectively). It allows (i) to isolate non-specific light scattering (c7)and noise (c6), and (ii) to refine EEM sources for analytes in the present study, allowing the identification of phenylalanine (maximum emission at Ex  256 nm, Em  286 nm) in c5. The remaining sources (c1, c3 and c4) are related to hydroquinone and DOPA. Hydroquinone may be oxidized to produce benzoquinone and other products, which may explain the findings. With respect to the PARAFAC results, imposing k = 4 and k = 7 in the Dorrit dataset analysis, Fig. 6, produces a good coherence in all three sources (b2-d2, b3-d3 and b4-d4), while b1 seems to be decomposed into d1 and d5. Estimated sources d6 and d7 are quite extraneous and diverge from those obtained with ICA (c6 and c7).

326

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Fig. 5. Representation of first three Claus samples (m01–m03) and estimated sources obtained with ICA (a01–a03 and c01–c04) and PARAFAC (b01–b03 and d01–d04) by imposing k = 3 (series a01–a03 and b01–b03) and k = 4 (c01–c04 and d01–d04). Higher fluorescence intensities in red.

In the case of real environmental samples, drEEM dataset (I = 224), from Table 3, it is not totally clear how many contributions should be used in EEM decomposition. With k = 4, we can recover 90.0% and 72.7% of EEM information using ICA and PARAFAC, respectively. However, model estimated errors are very high (28.0% with ICA and 139.5% with PARAFAC). By increasing the number

of contributions to k = 7 information recovery increases slightly (99.5% for ICA and 79.2% with PARAFAC) with the additional benefit of decreasing model error (19.9% for ICA). However, the PARAFAC model error is still extremely high (RRMSE of 123.9%). In Fig. 7, the resolution of drEEM dataset with k = 4 for ICA (a1–a4) and PARAFAC (b1–b4) is presented. ICA estimated sources

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

327

Fig. 6. Representation of Dorrit dataset estimated sources obtained with ICA (a01–a04 and c01–c07) and PARAFAC (b01–b04 and d01–d07) obtained by imposing k = 4 (a01–a04 and b01–b04) and k = 7 (c01–c07 and d01–d07). Higher fluorescence intensities in red.

328

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Fig. 7. Representation of drEEM dataset estimated sources obtained with ICA (a01–a04 and c01–c07) and PARAFAC (b01–b04 and d01–d07) obtained by imposing k = 4 and k = 7, respectively. Higher fluorescence intensities in red.

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

only present evidence of a signature of dissolved organic matter (a1) and signatures for second order Rayleigh (a2 and a3) and first order Rayleigh (a4) light scattering. From the literature [24], this broad signal can be identified as a “UV humic-like” feature. PARAFAC results (b1–b4) also allow identification of this Humic acid signature (b1) but, other sources (b2–b3) are speculative and may be related to maxima of light scattering. Considering modeling difficulties with k = 4, and in order to better explore this dataset we also imposed k = 7 in EEM deconvolution. ICA results (c1–c7) maintain the relevance of the “UV humic-like” signature (c1) and further decompose the scatter light emission into several independent contributions. In the PARAFAC case, similar results were obtained as in k = 4 case (detection of Humic acid and other meaningless sources). For us, the analysis of this dataset was a little disappointing we were expecting to detect different DOM contributions in natural waters and only could detect Humic acid contribution. This subject will be further discussed in the simulation section.

329

3.3. Quantification The Claus and Dorrit tested datasets possess the advantages of being well defined in terms of composition. Considering this advantage, in Fig. 8 we represent the corresponding calibration curves. From Fig. 8, we observe that the Claus dataset is a well behaved system, with linear dependencies of fluorescence on concentration. The PARAFAC composition estimates (b1–b3) are much more accurate than ICA (a1–a3), but a similar linear dependency is found in both cases. For the Dorrit dataset, it is evident that the system does not present an ideal dependence of fluorescence on the respective analyte concentration. It is also possible to see that ICA estimates are less accurate than PARAFAC. In this dataset several complications may be involved related with a) lack of linearity, b) unspecific contributions and c) chemical equilibria (degradation of hydroquinone). These subjects will also be addressed by employing simulations.

Fig. 8. Calibration curves for Claus (a01–a03 and b01–b03) and Dorrit (c01–b04 and d01–d04) datasets obtained with ICA (a01–a03 and c01–c04) and PARAFAC (b01–b03 and d01–d04) by imposing k = 3 (a01–a03 and b01–b03) and k = 4 (c01–c04 and d01–d04). Higher fluorescence intensities in red.

330

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Table 4 Effect of sample size in ICA and PARAFAC model ability (R2 and RRMSE) and global information recovery in terms of signal sources (R2 (S)) and in composition respective (R2 (A)) obtained by imposing k = 3. Alg.

ICA

#

R2

RRMSE

R2 (S)

R2 (A)

R2

PARAFAC RRMSE

R2 (S)

R2 (A)

3 5 10 20 30 40 50 60 80 100 150 200

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

95.65 95.78 96.05 97.97 99.06 99.80 99.48 99.29 99.30 97.62 99.01 99.29

94.77 91.31 91.45 94.15 95.88 99.56 97.37 99.26 98.39 97.98 97.42 97.76

99.94 99.92 99.95 99.95 99.95 99.96 99.95 99.95 99.95 99.95 99.95 99.95

3.82 4.47 3.74 3.58 3.52 3.24 3.39 3.45 3.35 3.31 3.27 3.53

99.18 97.50 99.04 99.51 99.48 99.70 99.58 99.42 99.46 99.61 99.45 99.27

98.21 99.49 99.69 99.87 98.51 99.65 99.60 99.82 99.31 99.87 99.19 99.73

# - Sample size; Alg. - algorithm; R2 - information recovery (%); RRMSE - model relative error (%); R2 (S) - source recovery (%); R2 (A) - system composition recovery (%).

3.4. Simulation

In this way we were able to test the impact of the number of samples, unspecific contributions (Rayleigh light scatter) and linear dependencies on concentration (assuming simple chemical reactions), signal sources and conjugated cases of signals and composition. In Fig. 4 we present the signal sources used (s1–s3) and the simulated first order Rayleigh light scatter (s4) as simulation basis. In Tables 4 and 5, we present the effect of sample size on ICA and PARAFAC modeling ability in the absence and in the presence simulated first order Rayleigh light scattering. In the absence of extraneous contributions to fluorescence, both algorithms, ICA and PARAFAC, recover most significant information (R2 > 99.9%), but PARAFAC presents a superior model error (RRMSE  3.5%). Considering recovering source information and composition, although with a very limited number of samples (I = 3) PARAFAC is extremely accurate (R2 (S) and R2 (A) ≥ 99.5%). Surprisingly, ICA is less accurate than PARAFAC for source and composition estimates. It requires a larger number of samples (I ≥ 30) in order to achieve 99.0% of source recovery, and it retains only around 98% of information related with composition. In Fig. 9 it is evident that ICA (in blue) presents less accurate results for signal sources recovery (a1) and composition estimates (b1) but tends to stabilize with I > 50. In the presence of light scattering (simulated first order Rayleigh light scattering), by keeping k = 3, ICA and PARAFAC obtain higher EEM information recovery (R2 ≥ 99%) but with larger estimated errors (RRMSE  10 and 18% for ICA and PARAFAC). However,

From previous examples we concluded that ICA and PARAFAC are able to deal with EEM datasets. Unfortunately, all tested datasets present some difficulties in what concerns comparing the performances of the two methods. The Claus dataset is very limited in terms of available samples (I = 5). This dataset is also somehow distorted due to data preprocessing in order to remove scatter information and, for that reason, it slightly disturbs the respective data analysis. The Dorrit dataset contains more samples (I = 18 valid samples), and all information corresponds to raw fluorescence signal. However, it presents a very large amplitude in the concentration range, causing fluorescence to lose its linear dependence upon concentration, see Fig. 8. Finally, the DrEEM dataset presents a large number of samples (I = 224) with raw recorded fluorescence signal. Unfortunately this dataset is not well described in terms of composition and only presents one unique relevant signal to allow us to compare and test the algorithms. Using the information on the Claus dataset, Table 1 and Fig. 1, the first three samples correspond to single standard solutions. After performing scaling of the fluorescence signal (to [0,1] signal range), these three spectra were used as simulated sources and all tested samples were computed in order to obey (i) a linear dependence upon concentration of the emitting species and (ii) the total light emission follows a pure additive response across all emitting species present in each mixture.

Table 5 Effect of sample size in ICA and PARAFAC model ability (R2 and RRMSE) and global information recovery in terms of signal sources (R2 (S)) and in composition respective (R2 (A)) in the presence of simulated light scatter. Alg.

ICA (k = 3) 2

PARAFAC (k = 3) 2

2

2

ICA (k = 4) 2

2

2

PARAFAC (k = 4) 2

2

#

R

RRMSE

R (S)

R (A)

R

RRMSE

R (S)

R (A)

R

RRMSE

R (S)

R (A)

R2

RRMSE

R2 (S)

R2 (A)

R2 (S)*

R2 (A)*

3 5 10 20 30 40 50 60 80 100 150 200

100.00 99.77 99.61 99.26 99.13 99.29 99.36 99.36 99.40 99.35 99.35 99.41

0.00 6.99 9.32 12.91 13.69 11.92 11.27 11.27 10.95 11.47 11.28 10.86

76.49 89.92 89.91 92.70 93.40 92.41 93.08 94.21 94.28 94.34 94.45 93.73

84.31 94.82 86.64 89.41 90.41 91.46 92.98 92.48 93.02 92.41 92.67 93.17

– 98.95 98.29 97.65 98.09 98.31 98.29 98.45 98.43 98.37 98.46 98.46

– 15.76 21.66 25.22 22.57 20.18 19.85 18.87 18.98 19.48 18.62 18.66

– 99.62 99.51 99.68 99.35 99.71 99.29 99.67 99.56 99.46 99.47 99.52

– 99.34 99.92 98.85 99.83 99.78 99.76 99.77 99.90 99.87 99.82 99.76

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

76.40 96.59 97.36 97.50 97.85 97.88 97.97 97.98 98.06 97.94 98.02 97.99

84.21 91.10 91.93 92.33 92.08 95.20 94.95 97.36 96.67 97.85 97.71 97.55

– 98.96 98.30 97.67 98.12 98.33 98.33 98.48 98.46 98.39 98.53 98.49

– 15.71 21.59 25.03 22.31 20.04 19.55 18.67 18.74 19.30 18.15 18.43

– 99.60 99.26 98.89 98.81 98.21 99.03 98.95 99.03 98.24 99.49 99.20

– 99.76 98.83 97.86 99.47 99.80 98.78 99.06 99.11 98.34 99.85 99.66

– 25.23 24.27 7.99 7.46 16.84 15.42 2.98 16.82 10.37 38.09 21.00

– 34.04 67.72 55.24 27.91 51.27 57.92 40.84 60.66 14.56 42.73 62.37

# - Sample size; Alg. - algorithm; R2 - information recovery (%); RRMSE - model relative error (%); R2 (S) - source recovery (%); R2 (A) - system composition recovery (%); * - global estimate.

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

331

Fig. 9. Evolution of source (a1–a3) and composition (b1–b3) recovery estimates with sample size for ICA (open dots in blue) and PARAFAC (stars in red).

PARAFAC preserves a higher accuracy in source and composition recoveries (R2 (S) and R2 (S) ≥ 99.5%) while ICA leads to, on average, recoveries of 93%, respectively, see Fig. 9, plots a2 and b2. By adding an extra contribution to deal with “light scattering”, Table 5 and plots a3 and b3 in Fig. 9, with a large number of samples (I ≥ 60) ICA improves its global accuracy to 97% in all sources and composition contributions. In spite of totally failing to respect to light scattering, PARAFAC remains unaffected in all other estimates related with source and system composition. Sim01 and Sim02 simulation sources (s1–s3 and s1–s4) are represented in Figs. 10 and 11, respectively, along with estimated sources obtained with ICA (a1–a3 and a1–a4) and with PARAFAC (b1–b3 and b1–b4). In the case of Sim01, Fig. 10, the number of independent contributions is k0 = 3. By imposing k = 3 contributions in ICA and PARAFAC algorithms we could obtain (b1–b3) and (c1–c3) estimated sources that are perfectly related to sources used in simulation (a1–a3), respectively. In Sim02, Fig. 10, similar results are obtained for first three contributions, where (b1–b3) and (c1–c3) correspond to the sources used (a1–a3). However, the presence of a non-trilinear contribution similar to the first order Rayleigh light scattering (a4) is correctly retrieved with ICA (b4), while PARAFAC totally fails this identification (c4). In order to better evaluate other effects, such as the presence of light scattering and linear dependencies in signal sources and in simulated system composition, I = 60 was kept as a reasonable

sample size. Since in the simulated cases, ICA is “exact” for modeling overall EEM data (R2 = 100.00% and RRMSE = 0.00%), algorithm performance will be compared for the same smaller k value obtained with ICA 100% model recovery. In Table 6 we summarize the results obtained. From Table 6 there are several possible observations and rules that may be made. The number of independent contributions (k) present in each tested simulation is increased by 1 by adding one extraneous emission such as simulated Rayleigh light scattering. Keeping composition contributions independent (a1–a3 or a1– a4), when all sources are interdependently represented (s1 s2 s3) or in a combined way (s12 s13 s23 s123), the number of independent contributions is still 3. If some of these sources are not represented (e.g. in Sim02, s1 s2 s12), the number of independent contributions matches (k = 2 in this case). Keeping signal sources independent (s1 s2 and s3), the presence of a dependent composition contribution (e.g. a3 = (X-a2)) has an impact in reducing the number of independent contributions present in the system (k = 2 instead of 3). All dependent sources associated with dependent compositions (e.g. Sim07) only affect the number of independent contributions by adding 1. Interestingly, if there are dependent sources not related with dependent contributions (e.g. in Sim09 with sources (s1 s2 s12) and contributions (a1 (X-a1) a3) and Sim15 with sources (s1 s2 s3 s23) and contributions ((X-a2) a2 a3 a4)) the number of independent contributions seems to be ruled by the number of independent sources.

332

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Fig. 10. Contribution identification in Sim01; on top (a1–a3) used sources, ICA (b1–b3) and PARAFAC (c1–c3) estimates obtained by imposing k = 3. Higher fluorescence intensities in red.

In terms of modeling ability, PARAFAC is also able to retain relevant information (R2 ≥ 99%) with estimated errors in the range 3–4 % in the absence of extraneous signals and 10–15 % in the presence of light scattering effects. Considering only relevant fluorescence contributions (excluding light scattering estimates), in both cases, signal sources recoveries are in general very good (R2 (S) > 95%). Considering relevant contribution estimates (R2 (A)), ICA seems to maintain the tendency for less accuracy than PARAFAC, the latter conceived to deconvolve independent signals. Accuracy seems to be affected by increasing the number of contributions, its dependencies and other unspecific contributions. The worst ICA estimate was obtained in Sim20 (dependent sources (s12 s13 s23 s123) and dependent compositions (a1 a2 a3 (k-a3)) in the presence of unspecific contribution (r1, b1)); with PARAFAC, the worst estimate was obtained in Sim18 (dependent sources (s12 s13 s23 s123) with independent contributions (a1 a2 a3 a4) and in the presence of unspecific contribution (r1, b1)). With this simulations under ideal conditions (linear response of fluorescence with concentration and considering a totally additive fluorescence response without other phenomena associated) we may conclude: a) ICA and PARAFAC are sensitive to signal and system composition dependencies, making them vulnerable to undetectable

b)

c)

d)

e) f)

contributions in real samples (e.g. unable to resolve drEEM Humic acid signal into other contributions); PARAFAC, despite its small modeling error (≈ 3–5 % RRMSE) is extremely accurate in dealing with standard EEM datasets in absence of extraneous contributions such as light scattering; In the presence of light scattering phenomena, PARAFAC is still very robust and efficient to estimate signal sources and respective contributions for relevant fluorescent species present in solution; ICA is slightly less accurate than PARAFAC in estimating signal sources, but it allows the correct identification of each emitting component; ICA is less accurate for estimating system compositions, especially when sample size is low (I < 50); Overall, in the presence of light scattering, ICA allows to have a more realistic approach of EEM making it more suitable for analyzing raw data.

4. Conclusions In this work we have compared the performance of ICA and PARAFAC algorithms in dealing with three-way EEM multivariated datasets.

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

333

Fig. 11. Contribution identification in Sim02; on top (a1–a4) used sources, ICA (b1–b4) and PARAFAC (c1–c4) estimates obtained by imposing k = 4. Higher fluorescence intensities in red.

Available datasets (Claus, Dorrit and drEEM) were tested with both algorithms but, due to some experimental limitations (limited number of samples, preprocessed spectra, lack of linear dependency,

lack of information in real system in terms of composition and independent contributions) it was not possible to completely evaluate both algorithm performances.

Table 6 Evaluation of effects of linear dependencies in signal sources and system composition and the presence of light scattering in ICA and PARAFAC modeling performance (R2 and RRMSE) and information recovery related with signal sources (R2 (S)) and estimated composition (R2 (A)). Alg.

ICA

Sim.

Sources

Contrib.

k

R2

RRMSE

R2 (S)

R2 (A)

PARAFAC R2

RRMSE

R2 (S)*

R2 (A)*

Sim01 Sim02 Sim03 Sim04 Sim05 Sim06 Sim07 Sim08 Sim09 Sim10 Sim11 Sim12 Sim13 Sim14 Sim15 Sim16 Sim17 Sim18 Sim19 Sim20

s1 s2 s3 s1 s2 s3 r1 s1 s2 s12 s1 s2 s12 r1 s1 s2 s3 s1 s2 s3 r1 s1 s2 s12 s1 s2 s12 r1 s1 s2 s12 s1 s2 s12 r1 s1 s2 s3 s23 s1 s2 s3 s23 r1 s1 s2 s3 s23 s1 s2 s3 s23 r1 s1 s2 s3 s23 s1 s2 s3 s23 r1 s12 s13 s23 s123 s12 s13 s23 s123 r1 s12 s13 s23 s123 s12 s13 s23 s123 r1

a1 a2 a3 a1 a2 a3 b1 a1 a2 a3 a1 a2 a3 a4 a1 a2 (X-a2) a1 a2 (X-a2) b1 a1 a2 (X-a2) a1 a2 (X-a2) b1 a1 (X-a1) a3 a1 (X-a1) a3 b1 a1 a2 a3 a4 a1 a2 a3 a4 b1 a1 a2 a3 (X-a3) a1 a2 a3 (X-a3) b1 (X-a2) a2 a3 a4 (X-a2) a2 a3 a4 b1 a1 a2 a3 a4 a1 a2 a3 a4 b1 a1 a2 a3 (X-a3) a1 a2 a3 (X-a3) b1

3 4 2 3 2 3 2 3 2 3 3 4 3 4 3 4 3 4 3 4

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

95.65 96.59 98.41 98.80 95.65 92.68 98.41 98.80 98.41 98.80 95.66 96.29 95.28 96.58 95.28 96.59 95.65 96.59 95.65 96.29

93.21 91.89 77.53 82.73 93.84 91.96 85.79 85.21 88.50 80.80 81.02 83.60 69.80 83.01 77.09 85.34 94.36 94.26 65.23 57.27

99.95 98.83 99.94 99.10 99.95 98.69 99.93 99.27 99.94 99.28 99.95 99.14 99.95 99.24 99.92 99.25 99.94 98.98 99.91 99.11

3.24 15.84 3.89 14.78 3.47 16.55 4.01 12.47 3.95 13.29 3.39 13.42 3.40 12.89 3.99 11.61 3.60 14.55 4.40 13.36

95.65 96.59 98.41 98.80 95.65 92.68 98.41 98.80 98.41 98.80 95.66 96.29 95.28 96.58 95.28 96.59 95.65 96.59 95.65 96.29

99.94 95.46 78.80 60.74 99.95 99.86 83.81 83.11 85.28 89.68 86.39 87.01 92.26 87.13 90.78 91.75 95.91 59.56 65.94 65.34

Alg. - algorithm; Sim. - simulation label; Sources - si and sij for simple and combined source, r1 - Rayleigh; Contrib. - individual simulated contribution: ai random value (0– 100), (X-ai) contribution dependent on ai; R2 - information recovery (%); RRMSE - model relative error (%); R2 (S) - global source recovery (%); R2 (A) - global system composition recovery (%); * - excluding scatter estimates.

334

J. Pereira et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 205 (2018) 320–334

Using real fluorescence signal spectra (included in the Claus dataset) and creating semi-empirical simulations we could evaluate the effects of sample size, the presence of dependent information (sources and composition) and the effect of unspecific fluorescence contributions with simulated scattered light. In the absence of unspecific contributions and using the correct number of independent contributions (k), despite its relatively higher fitting error, PARAFAC allows us to obtain more accurate estimates for signal sources and respective contributions making this algorithm superior to ICA. With the correct number of independent contributions (k) and in the presence of unspecific contributions (e.g. light scattering phenomena), ICA leads to a more realistic approach allowing a better decomposition of contributing sources with the benefit of correct contribution identification and respective quantification. The ICA algorithm is more suitable for the analysis of raw EEM analysis without the need for any preprocessing in order to remove unspecific contributions. Based on our simulation results, ICA concentration estimates may be improved using resampling techniques. We also find out that, when used with care in the direct analysis of raw EEM datasets, PARAFAC leads to worst model fitting errors although it seems to be able to “suppress” unspecific information and keep good source and composition estimates, allowing to obtain a fairly good approximation for identification and quantification purposes. Acknowledgments The Coimbra Chemistry Centre (CQC) is supported by FCT, through the project UI0313/QUI/2013, also co-funded by FEDER/COMPETE 2020-UE.

References [1] J.R. Lakowicz, Principles of Fluorescence Spectroscopy, 3rd edition, ed., Springer. 2006. [2] E.M. Carstea, J. Bridgeman, A. Baker, D.M. Reynolds, Fluorescence spectroscopy for wastewater monitoring: a review, Water Res. 95 (2016) 205–219. https:// doi.org/10.1016/j.watres.2016.03.021. [3] R. Bro, Exploratory study of sugar production using fluorescence spectroscopy and multi-way analysis, Chemom. Intell. Lab. Syst. 46 (2) (1999) 133–147. https://doi.org/10.1016/S0169-7439(98)00181-6. [4] R.A. Harshman, M.E. Lundy, PARAFAC: parallel factor analysis, Comput. Stat. Data Anal. 18 (1) (1994) 39–72. https://doi.org/10.1016/0167-9473(94)901325. [5] R. Bro, PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. 38 (2) (1997) 149–171. https://doi.org/10.1016/S0169-7439(97)00032-4.

[6] K.R. Murphy, C.A. Stedmon, D. Graeber, R. Bro, Fluorescence spectroscopy and multi-way techniques. PARAFAC, Anal. Methods 5 (2013) 6557–6566. https:// doi.org/10.1039/C3AY41160E. [7] J. Riu, R. Bro, Jack-knife technique for outlier detection and estimation of standard errors in PARAFAC models, Chemom. Intell. Lab. Syst. 65 (1) (2003) 35–49. https://doi.org/10.1016/S0169-7439(02)00090-4. [8] R. Bro, M. Vidal, EEMizer: Automated modeling of fluorescence EEM data, Chemom. Intell. Lab. Syst. 106 (1) (2011) 86–92. https://doi.org/10.1016/j. chemolab.2010.06.005. [9] S. Elcoroaristizabal, R. Bro, J.A. Garcia, L. Alonso, PARAFAC models of fluorescence data with scattering: a comparative study, Chemom. Intell. Lab. Syst. 142 (2015) 124–130. https://doi.org/10.1016/j.chemolab.2015.01.017. [10] X. Yu, D. Hu, J. Xu, Blind Source Separation: Theory and Applications, 1st edition ed., Wiley. 2014. https://doi.org/10.1002/9781118679852. [11] D.J.-R. Bouveresse, D.N. Rutledge, Chapter 7 - Independent Components Analysis: Theory and Applications, in: C. Ruckebusch (Ed.), Resolving Spectral Mixtures, Data Handling in Science and Technology vol. 30, Elsevier. 2016, pp. 225–277. https://doi.org/10.1016/B978-0-444-63638-6.00007-3. [12] A. Hyvarinen, E. Oja, Independent component analysis: algorithms and applications, Neural Netw. 13 (4-5) (2000) 411–430. https://doi.org/10.1016/s08936080(00)00026-5. [13] F.A. Brehm, J.C.R. Azevedo, J.C. Pereira, H.D. Burrows, Direct estimation of dissolved organic carbon using synchronous fluorescence and independent component analysis (ICA): advantages of a multivariate calibration, Environ. Monit. Assess. 187 (11) (2015) 703. https://doi.org/10.1007/s10661-0154857-z. [14] J.C. Pereira, J.C.R. Azevedo, H.G. Knapik, H.D. Burrows, Unsupervised component analysis: PCA, POA and ICA data exploring - connecting the dots, Spectrochim. Acta A Mol. Biomol. Spectrosc. 165 (2016) 69–84. https://doi.org/10.1016/j.saa. 2016.03.048. [15] J.C. Pereira, I. Jarak, R.A. Carvalho, Resolving NMR signals of short-chain fatty acid mixtures using unsupervised component analysis, Magn. Reson. Chem. 55 (10) (2017) 936–943. https://doi.org/10.1002/mrc.4606. [16] R. Bro, Amino acids uorescence data; Quality & Technology web page; Department of Food Science, Faculty of Science of University of Copen- hagen, http:// www.models.life.ku.dk/, last access: Jan 2018. [17] R. Bro, Fluorescence data - four PARAFAC component system; Quality & Technology web page; Department of Food Science, Faculty of Science of University of Copenhagen, http://www.models.life.ku.dk/dorrit, last access: Jan 2018. [18] K. R. Murphy, C. A. Stedmon, D. Graeberc, R. Bro, The drEEM toolbox for MATLAB; Quality & Technology web page; Department of Food Science, Faculty of Science of University of Copenhagen, http://www.models.life.ku.dk/dreem, last access: Jan 2018. [19] J.W. Eaton, D. Bateman, S. Hauberg, R. Wehbring, GNU Octave Version 4.0.0 Manual: A High-level Interactive Language for Numerical Computations, CreateSpace Independent Publishing Platform. 2015.1441413006. [20] R. Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008, http://www.R-project.org. 3-900051-07-0. [21] J. L. Marchini, C. Heaton, B. D. Ripley, Package fastICA - algorithms to perform ICA and projection pursuit, http://cran.r-project.org/web/packages/fastICA/ index.html, last access: Dec 2016. [22] N.E. Helwig, Multiway: Component Models for Multi-Way Data, 2017, https:// CRAN.R-project.org/package=multiway. r package version 1.0-4. [23] C.M. Andersen, R. Bro, Practical aspects of PARAFAC modeling of fluorescence excitation-emission data, J. Chemometr. 17 (4) (2003) 200–215. https://doi.org/ 10.1002/cem.790. [24] C.A. Stedmon, S. Markager, R. Bro, Tracing dissolved organic matter in aquatic environments using a new approach to fluorescence spectroscopy, Mar. Chem. 82 (3-4) (2003) 239–254. https://doi.org/10.1016/S0304-4203(03)00072-0.