Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Parallel factor (PARAFAC) analysis on total synchronous fluorescence spectroscopy (TSFS) data sets in excitation–emission matrix fluorescence (EEMF) layout: Certain practical aspects Keshav Kumar ⁎, Ashok Kumar Mishra Department of Chemistry, Indian Institute of Technology-Madras, Chennai-600036, India
a r t i c l e
i n f o
Article history: Received 7 May 2015 Received in revised form 5 August 2015 Accepted 7 August 2015 Available online 14 August 2015 Keywords: PARAFAC TSFS EEMF Missing values Trilinear Core-consistency
a b s t r a c t In a recently developed procedure for parallel factor (PARAFAC) analysis of total synchronous fluorescence spectroscopy (TSFS), the issue of no-trilinearity of TSFS data set was addressed by representing the TSFS data in excitation–emission matrix fluorescence (EEMF) layout which ensures the trilinearity to data sets “a must” for PARAFAC analysis. This representation leads to generation of significantly large number of variables, which do not contain any experimentally acquired fluorescence information. It is essential that such variables be handled properly before subjecting TSFS data in EEMF layout to PARAFAC analysis. Based on our understanding of mechanism by which PARAFAC analysis on TSFS data in EEMF layout works, we used three possible ways: (i) assigning a value of zero, (ii) assigning missing (NaN) values, and (iii) combination of zero and NaN values to handle such variables. We evaluated each of these three possibilities and compared the outcomes of PARAFAC analysis with respect to two important parameters: (i) proximity between the actual and retrieved TSFS profile of all the three fluorophores and (ii) time taken for the convergence of PARAFAC algorithm. The obtained results of PARAFAC analyses on TSFS data in EEMF layout showed that better analytical results are obtained if we set all the variables with no experimentally acquired information to missing (NaN) values, though the computational time is significantly high. PARAFAC analysis tends to converge prematurely when a value of zero was assigned to all the variables in EEMF layout that do not contain any experimentally acquired information. The present work also showed that by using the combination of zero and missing values it is possible to optimize the computational time and retrieve PARAFAC separated TSFS profile from EEMF layout with reasonable purity. © 2015 Elsevier B.V. All rights reserved.
1. Introduction In recent years, application of multi-way data analysis techniques has increased significantly in the area of analytical chemistry. Essentially, a multi-way data analysis technique increases the effectiveness of the data-interpretation towards achieving successful analyses of the components of a multi-component system. Parallel factor (PARAFAC) analysis is one of the most commonly used multi-way data analysis technique [1–6]. PARAFAC has been mostly applied for the analyses of three-way data sets such as (i) sample × excitation wavelength × emission wavelength, obtained from excitation–emission matrix fluorescence (EEMF) spectroscopy, (ii) sample × m/z values (i.e. mass/ charge) × elution profile, obtained from chromatographic techniques coupled with mass spectrometry and (iii) sample × spectral profile × elution profile, obtained from chromatographic techniques coupled with UV–vis spectrophotometer or fluorometer [7–18]. PARAFAC is known to have second order advantages i.e. it can identify and quantify the components of a multicomponent system even in ⁎ Corresponding author. Tel.: +91 44 22574207; fax: +91 44 22574202. E-mail address:
[email protected] (K. Kumar).
http://dx.doi.org/10.1016/j.chemolab.2015.08.008 0169-7439/© 2015 Elsevier B.V. All rights reserved.
the presence of unknown interferences [19–24]. A successful application of PARAFAC analysis requires that a three way array must have the trilinear structure; otherwise results obtained from PARAFAC analysis will not be within the chemical, spectroscopic and mathematical premises [1–6,25,26]. A three-way data is classified as trilinear if (i) each of the three modes (i.e. dimension) contains equal number of factors, (ii) intensity of instrumental response for each factor is directly proportional to their concentrations and (iii) shape of the instrument profile of each component along a given mode is invariant to the changes in other modes [1,4–6,25–28]. Trilinear decomposition of a three-way data X of dimension I× J × K using PARAFAC analysis can be expressed using Eq. (1).
xi jk ¼
F X
aif b j f ck f þ ei jk
ð1Þ
f ¼1
xijk, aif, bjf, ckf, and eijk are the elements of matrices X, A, B, C, and E respectively. A is matrix of dimension I × F, B is matrix of dimension J × F, C is a matrix of dimension K × F and E is a three way data of dimension I× J × K. The matrices A, B and C contain profiles of each of the F factors
122
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
along first, second and third mode, respectively; E contains residual information (i.e. unexplained variance of the data sets) [1,4–6]. Total synchronous fluorescence spectroscopy (TSFS) [25,26,28–30] is an unconventional steady state fluorescence technique and its analytical utility has been proved with the successful analyses of samples obtained from the fields of environmental [25,26,28,31], biological [32,33], pharmaceutical [33,34], petrochemicals [26,28–30,35–37] etc. TSFS essentially explains the variation of synchronous fluorescence [38,39] (SF) spectra as a function of wavelength offsets (Δλ = λem − λex, where λem is the emission wavelength and λex is the excitation wavelength) [25,26,28–31]. Similar to EEMF spectroscopy [40,41], TSFS also captures the fluorescence of all the fluorophores of a multifluorophoric mixture [25,26,28–31]. In our recent work [42], we have shown that there are certain analytical advantages which one can achieve using TSFS over EEMF for the analyses of multifluorophoric samples. For example, (i) Rayleigh scattering that does not contain any fluorescence information can be eliminated from the fluorescence data set right at the data acquisition stage in TSFS mode, whereas certain mathematical operations are required to eliminate it from EEMF data sets and (ii) TSFS based fingerprints are more unique than EEMF based fingerprints [42]. However, TSFS data structure is intrinsically different and its trilinear decomposition using PARAFAC analysis is not as easy as it is for EEMF data sets. In our another recent work [25], we have shown that shape of SF spectra of a fluorophore changes with change in offset that gives nontrilinear structure to TSFS based three-way data, sample × excitation profile × offset. Thus, TSFS based three-way data must not be subjected to PARAFAC analysis. In the same work [25], we further showed that trilinear decomposition of TSFS based three-way data using PARAFAC analysis can be achieved by converting it to EEMF like data set. Arrangement of TSFS data in EEMF like layout, sample × excitation wavelength × (excitation wavelength + offset) profile, ensures the required trilinear structure to TSFS based three-way data. There are certain practical aspects which one need to consider while performing PARAFAC analysis on TSFS data set in EEMF like layout. One of such serious issue is the generation of a significantly large number of variables in EEMF layout, which do not contain any experimentally acquired information. It is necessary that such variables be handled properly; otherwise, it might (i) increase the computational time significantly or (ii) cause the premature convergence of PARAFAC analyses on TSFS data in EEMF layout. The objective of the present work is to understand and compare the various possibilities to handle this issue. Till date, there is no work on the practical aspects of PARAFAC analysis on TSFS data sets in EEMF layout. The specific objective of the present work is to optimize the computational time while retrieving PARAFAC separated pure TSFS spectra of each component from EEMF layout.
2. Theory PARAFAC model of a three-way data X represented using Eq. (1) is generally fitted using alternate least square (ALS) algorithm with an objective to minimise the Frobenius norm of residual matrix E [1–6,25,28]. We have given theory of fitting PARAFAC model in the supplementary information of the present work. Theoretical aspects of PARAFAC analysis can also be found elsewhere in the literature [1–6,25,28]. PARAFAC technique involves the simultaneous analysis of all the F factors i.e. it is based on the non-sequential algorithm [4–6,25,28]. Thus, it is essential that the number of factors used to fit PARAFAC model must be equal to the number of components of the multi-component system. The most commonly used method to find the optimum number of factors required to fit a PARAFAC model is the core consistency diagnostic (CCD) test [5,9,25,28,43]. A core consistency percentage less than 70% indicates lack of trilinearity and over fitting of the data by PARAFAC model [9,25,28,43]. A brief detail on CCD test is also given in Supplementary data.
3. Experimental section Present work is a natural extension of our previous works in this area; therefore, we preferred the same system, dilute aqueous mixtures of three fluorophores benzo[a]pyrene, chrysene and pyrene. Moreover, we performed TSFS measurements on this system with the instrumental parameters that we have used in our previous works.
3.1. Chemicals and sample preparation Three polycyclic aromatic hydrocarbons (PAHs), benzo[a]pyrene (BaP), chrysene (CY) and pyrene, (PY) were purchased from SigmaAldrich and used as such without further purification. Separate stock solutions were prepared for each of the three PAHs in analytical grade acetone. Further dilutions were also done with the acetone. Required amounts of BaP, CY, and PY were taken from the respective dilute stock solutions in 23 different vials of 10 mL volume. Acetone was removed completely by passing the nitrogen gas and 5 mL triple distilled water was added to the residue to make the working solutions. Concentrations of BaP, CY and PY in the various samples of the calibration set are reported in Table 1.
3.2. Instrument and data acquisition Fluoromax 4 (Horiba Jobin Yvon) spectrofluorometer equipped with Xenon lamp of 150 W as an excitation source was used to collect TSFS data set for the samples. Instrumental parameters such as (i) excitation wavelength range, (ii) offsets (Δλ's), (iii) integration time, and (iv) band-passes for excitation and emission monochromators used for TSFS data acquisition, are reported in Table 2. 3.3. Software used In the present work, MATLAB-2008Rb was used to process the fluorescence data. PARAFAC analysis was performed using PLS_Toolbox 5.0.3 (Eigen-Vector Research) written in MATLAB language. All the computational work was performed on a laptop having Intel Core i3 processor, 64-bit operating system and 3 GB RAM. Table 1 Concentration of chrysene, pyrene and benzo[a] pyrene in various samples of calibration set used. Sample
Chrysene (CY) ×10−8 M
Pyrene (PY) ×10−8 M
Benzo[a]pyrene (BaP) ×10−8 M
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1.48 0.74 1.48 0.74 1.48 1.48 0.44 1.04 0.00 0.37 0.37 2.23 2.97 0.18 2.60 1.48 0.74 0.00 1.48 0.00 0.18 1.48 0.74
1.86 0.93 0.56 0.93 0.00 0.93 0.30 1.86 1.86 0.47 2.80 0.47 0.00 3.26 0.23 1.86 0.93 3.73 0.93 0.00 0.23 1.86 1.86
0.00 1.65 1.15 1.65 1.65 0.82 0.65 0.49 1.65 2.47 0.41 0.41 0.00 0.21 0.21 1.65 0.82 0.00 1.65 3.30 2.88 0.82 1.65
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130 Table 2 Various instrumental parameter used for TSFS data acquisition on Fluoromax-4 fluorometer. Instrumental parameter Excitation wavelength range Offset (Δλ) Band passes for excitation–emission monochromators Integration time
250–600 nm with a step size of 2 nm 20–150 nm with a step size of 10 nm 7/7 nm 0.01 second
123
3.4. MATLAB codes Necessary codes required for (i) representing TSFS data (excitation wavelength × Δλ), in EEMF layout (excitation wavelength × (excitation wavelength + highest offset)), (ii) creating three way array, sample × excitation wavelength × (excitation wavelength + offset) of TSFS data in EEMF layout, (iii) assigning zero or missing (NaN) values to variables in EEMF layout and (iii) extracting TSFS data back from EEMF layout were written in MATLAB language. These codes are given in appendix.
Fig. 1. Raw and blank subtracted TSFS spectra of BaP, CY and PY. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
124
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
4. Results and discussion 4.1. Data arrangement TSFS data sets acquired in the excitation wavelength range 250–600 nm with a step size of 2 nm at offsets 20–150 nm with step size of 10 nm for all the 23 samples were arranged to create a three way array of dimension 23 × 176 × 14 (i.e. sample × excitation wavelength × Δλ). TSFS contour plots of pure BaP, CY, PY and solvent blank are shown in Fig. 1. It can be seen that there is a significant spectral overlap among the chosen fluorophores. The fluorescence of BaP, CY, and PY and Raman scattering due to solvent (water) appearing diagonally also have significant overlap. Thus, Raman correction in TSFS data set of all the 23 samples is necessary before processing them by any multi-way data analysis technique. Blank subtracted TSFS contour plots of BaP, CY, and PY are also shown in Fig. 1. Blank subtraction was performed to eliminate Raman scattering from TSFS data set. As we have shown in our previous work [25,42], a close correspondence between the fluorescence intensity expression of a fluorophore in TSFS and EEMF modes, given in Eqs. (2) and (3), respectively, enables the presentation of TSFS data in EEMF like layout. I TSFS ðc; λex ; ΔλÞ ¼ KcdEX ðλex ÞEM ðλex þ ΔλÞ
ð2Þ
IEEMF ðc; λex ; λem Þ ¼ KcdEX ðλex ÞEM ðλem Þ
ð3Þ
where ITSFS and IEEMF are the fluorescence intensities of a fluorophore in TSFS and EEMF mode, respectively; K is a constant factor accounting for instrumental parameters, c is concentration of the fluorophore, d is the path length, and EX and EM are the excitation and emission profile, respectively. Mathematically, an EEMF layout of dimension, excitation wavelength × (excitation wavelength + highest Δλ) is required to contain TSFS data set of dimension excitation wavelength × Δλ. In the present work for each of the 23 samples an EEMF like layout of dimension 176 × 326 is required to present their TSFS data set of dimension 176 × 14. TSFS data set of each sample contains 2464 variables whereas an EEMF like layout has 57,376 values; it means there are 44,912 variables which do not contain any experimentally acquired information. Each EEMF layout contains approximately less than 5% experimentally acquired fluorescence data and the remaining 95% of layout is devoid of any such information. Conversion of TSFS data set to EEMF like data set ensures necessary trilinearity to TSFS data set required for PARAFAC analysis; however, significantly large numbers of variables which do not contain any experimentally acquired information are generated. In Fig. 2, a schematic is given where we have represented TSFS data set of dimension 176 × 14 in EEMF layout of dimension 176 × 326 (excitation × (excitation + highest Δλ)). TSFS data set can be visualised along the fourteen white strips appearing diagonally in EEMF layout.
The variables in the dark portions of the layout do not contain any information. Before analysing the various aspects of handling these variables, it is necessary that we briefly summarise the mechanism (or steps) of creating TSFS profile from PARAFAC separated EEMF layouts of each fluorophore. (i) PARAFAC analysis on TSFS data in EEMF like layout would generate excitation profile and emission profile (i.e. excitation + offset) for each fluorophore. (ii) EEMF like spectrum can be created by multiplying obtained excitation and emission profile. (iii) Number of synchronous fluorescence spectra form EEMF spectrum can be obtained by moving diagonally at different offsets (Δλ's) of choices between the excitation and emission axes. (iv) Obtained synchronous fluorescence spectra can further be arranged as a function of increasing offset to create TSFS profile of each fluorophore. From the various mathematical steps, Eqs. S4–S7, given in Supplementary data, it could be seen that PARAFAC analysis involves the application of Khatri–Rao product [5,44] that involves product between certain pairs of elements of the loading matrices containing information related to excitation, emission, and contribution profiles for each factor. In other words, each element of the first column of a matrix will be multiplied with each element of first column of the second matrix only. Similarly, each element of second column of first matrix will be multiplied with each element of second column of second matrix. Just to make it clear, in Eq. (4), we have given below an example of Khatri–Rao product between matrix B and C. B ¼
b11 b21
Bj⊗jC ¼
b12 b22
b11 b21
22
b12 b22
And C ¼
22
j⊗j
c11 c21
c11 c21 c12 c22
c12 c22
22
2 0 2
b11 c11 B b11 c21 B ¼@ b21 c11 b21 c21
1 b12 c12 b12 a22 C C b22 c12 A b22 c22 4 2 ð4Þ
From our understanding and the discussion that we presented so far, we can use the following possibilities: (i) only zero, (ii) only missing values or (ii) combination of zero and missing values to handle the variables that do not contain any information when TSFS data is converted to EEMF like data set. We will evaluate each of these three possibilities and compare the outcomes of the PARAFAC analysis with respect to two important parameters: (i) closeness between the actual and retrieved TSFS profile of all the three fluorophores and (ii) time taken for the convergence of PARAFAC algorithm. It is to be noted that the goal of any computational data analysis technique is to achieve correct results in a short computational time. 4.2. Data pre-processing, constraints and parameters used for PARAFAC analysis
Fig. 2. Schematic of TSFS data sets in EEMF like layout.
In the present work, PARAFAC analysis on TSFS data in EEMF layout of dimensions 23 × 176 × 326 was performed with the following data pre-processing, constraints and parameters. (i) No conventional mathematical pre-processing was performed; however, solvent blank TSFS spectrum was subtracted from TSFS spectrum of each of the 23 samples of the calibration set to eliminate Raman scattering from the fluorescence data. (ii) Non-negative constraints on each of the three modes were required because fluorescence intensity and concentration values cannot be negative. (iii) Maximum number of iterations was set to 10,000. (iv) Stop criteria were relative or absolute change in fit of 10−6. (v) Maximum calculation time was set to 3600 second.
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
4.3. PARAFAC analysis on TSFS data sets in EEMF layout with zeros As we have discussed, PARAFAC analysis involves certain special type of product between the elements of the loading matrices. Moreover, we are selecting only particular set of elements from EEMF layout to create the pure TSFS. Thus, assigning a value of zero to the variables devoid of any information seems to be a good choice to begin with. PARAFAC analysis on TSFS data set in EEMF layout containing zeros was performed with different number of factors. The core consistency percentage value was calculated for each PARAFAC model and plotted against the number of factors used; core consistency plot is shown in Fig. 3(a). The core consistency percentage value was consistently found to be hundred till the tenth factor. However, we preferred three-factor PARAFAC model because there are three fluorophore in the mixture. Details of PARAFAC modelling such as number of iteration before convergence, relative and absolute changes in fit, and time taken for convergence are summarised in Table 3. It can be observed that PARAFAC model converged in a very short amount of time. Obtained PARAFAC model explained 24.29% variance of the data sets. The loading vectors along the second and third modes for each of the three factors were subjected to Khatri–Rao product. The obtained data sets for each of the three data sets were further reshaped to create their mathematical EEMF profile. In the next step, we selected the synchronous spectra from mathematical EEMF profile at 14 different offsets of our choices to create TSFS profile for each of the three factors, which are shown in Fig. 4. It can be seen that mathematically retrieved TSFS spectra do not correlate at all with experimentally acquired TSFS spectra of the fluorophores. It is due to (i) the excess of zeros in EEMF layout that imparted non-trilinearity to the data sets and (ii) premature convergence of PARAFAC model. Furthermore, the obtained model explained ~ 25% variance of the data set; it means majority of the variance of data set is unexplained. Thus, in principal the residual loading vectors must contain all spectral information of our interest. The residual loading vectors along the second and third modes were also subjected to Khatri–Rao product and following the scheme discussed above, we created TSFS profile, which is also shown in Fig. 4. TSFS profile created from the residual information of PARAFAC model was found to contain certain fluorescence characteristic of BaP, CY, and PY. It shows that PARAFAC model could not model TSFS data set adequately. To test
125
whether the quality of the retrieved spectra can be improved by performing PARAFAC analysis with no-constraints we removed nonnegative constraints from each of the three modes. The retrieved TSFS spectra for each of the three factors did not have any correspondence with the experimentally acquired TSFS spectra of the fluorophores. Thus, both PARAFAC models, constrained as well as unconstrained, failed to make trilinear decomposition of TSFS data sets in EEMF layout. 4.4. Missing values and PARAFAC analysis It is well studied that ALS algorithm based PARAFAC analysis needs certain modification before it can be employed for the trilinear decomposition of data sets containing the missing values (NaN) [45–48]. One of the ways of handling this problem is the application of expectation maximisation (EM) approach [45–50]. It is the specialised case of single imputation algorithm. In the present work, we have briefly presented the EM approach explained by Tomasi and Bro [45] in Supplementary data. However, there are number of reports in the literature that explain various aspects of EM algorithm in detail [45,49,50]. In the next part of work, PARAFAC analysis is carried out on TSFS data in EEMF layout where the variables with no information are set to missing values. EM approach discussed above and given in Supplementary data was used to handle these variables. PARAFAC algorithm available in PLS-Toolbox 5.0.3 has the necessary MATLAB codes to handle the missing values [51]. 4.5. PARAFAC analysis on TSFS data in EEMF layout containing missing values EEMF layout containing TSFS data set has approximately 95% variables with missing values. PARAFAC analysis on such data set is a computationally challenging task. However, as we discussed above, due to the fact that (i) the special kind of multiplications (e.g. Khatri–Rao or Hadamard product) are involved in PARAFAC algorithm [5] and (ii) unique diagonal relationship exists between TSFS and EEMF data sets [25,28], it is possible to perform such calculations. PARAFAC analysis on TSFS data set in EEMF layout with missing values was performed with constraints and parameters specified earlier in the present work. Core consistency plot explaining the core-consistency percentage of
Fig. 3. Core consistency analysis plot of PARAFAC model of TSFS data in EEMF layout where (a) variables with no experimentally acquired information (~95%) are set to zero, (b) variables with no experimentally acquired information are set to NaN, (c) variables with no experimentally acquired information are set to combination of zero (~59%), and NaN (~36%) (d) variables with no experimentally acquired information are set to combination of zero (~33%) and NaN (~62%).
126
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
Table 3 Various mathematical parameters obtained from three-factor PARAFAC analyses on TSFS data set in EEMF layout containing zero or NaN or combination of zero and NaN values. Parameter
Model A
Model B
Model C
Model D
Constraints on three modes Relative change in fit Absolute change in fit Number of iterations Iteration or calculation time Cause of termination Core consistency analysis
Non-negative 7.7751 × 10−7 7.5400 × 10 8 93 17 seconds Relative change in fit Unable to predict correct number of factors
Non-negative 1.7709 × 10−7 9.1829 × 107 8342 2428 seconds Relative change in fit Predicted the correct number of factors precisely
Non-negative 9.9194 × 10−7 1.2533 × 10 8 511 139 seconds Relative change in fit Unable to predict correct number of factors precisely
Non-negative 7.7709 × 10−7 8.1829 × 107 4764 1334 seconds Relative change in fit Unable to predict correct number of factors precisely
Model A: PARAFAC analysis on TSFS data in EEMF layout containing zeros; Model B: PARAAFC analysis on TSFS data in EEMF layout containing missing (NaN) values (amount of missing value is 95%); Model C: PARAAFC analysis on TSFS data in EEMF layout containing combination of zero and missing (NaN) values (amount of missing value is 36%); Model D: PARAAFC analysis on TSFS data in EEMF layout containing combination of zero and missing (NaN) values (amount of missing value is 62%).
PARAFAC model fitted as increasing function of factors is given in Fig. 3(b). It can be seen that core-consistency drops from hundred to zero when we change the number of factors from three to four for analysing the data set. Thus, a PARAFAC model of three factors is a legitimate choice to approximate TSFS data. Various evaluating parameters such as number of iteration before convergence, relative and absolute changes in fit, and time taken for convergence for PARAFAC model are reported in Table 3. It can be seen that time taken for the convergence of PARAFAC model is much longer compared to previous case where a value of zero was assigned to variables with no experimental information. There are 95% missing values in EEMF layout and making an approximation for them that would also ensure trilinear decomposition of the entire data set is certainly a time consuming task. PARAFAC model explained approximately 80% variance of the data set. Loading vectors of the three factors along second and third modes were multiplied together to generate their mathematical EEMF profiles, shown in Fig. 5(a). We compared PARAFAC retrieved EEMF profiles of these three factors with experimentally collected EEMF spectra of BaP, CY, and PY, given in Fig. 5(b). There is no linear correlation between the variables along emission-wavelength axis of experimentally acquired EEMF and the variables along “excitation wavelength + offset” axis of mathematically retrieved EEMF spectrum. Therefore, the comparison can only be carried out along the excitation wavelength axes where a perfect linear correlation exists between the variables of mathematically and experimentally acquired EEMF. It can be seen that mathematically retrieved profile of first, second and third factor correlates with EEMF
spectra of CY, PY and BaP, respectively. It is true that EEMF profiles of these three factors do contain some mathematical artefacts, but sill it is easy to see BaP, CY and PY fluorescence. The slight differences between experimentally acquired EEMF and that obtained from PARAFAC analysis of TSFS are also due to the fact that, at the instrumental level, fluorescence data acquisition processes in EEMF and TSFS modes are conceptually different. Moreover, Raman scattering signals also tend to interfere with fluorescence to different extent in TSFS and EEMF mode [42]. Nevertheless, the obtained results clearly show that PARAFAC analysis with EM approach is capable of retrieving the spectral profile form multi-component system even if there are a significantly large number of variables in data set with no experimentally acquired information. TSFS profile of all the three factors was created by collecting the data from their mathematical EEMF profile with suitable offsets between emission and excitation axes. PARAFAC separated TSFS profiles of all the three fluorophores viz. BaP, CY, and PY, shown in Fig. 6, are found to be pure. The mathematical artefacts that are there in mathematical EEMF profile of CY also did not appear in its TSFS profile. By assigning NaN instead of zero to all the variables in EEMF layout having no experimentally acquired information it is possible to retrieve EEMF as well as pure TSFS profile of all the fluorophores. However, there is a practical issue involved with the use of missing values. The computational time for PARAFAC analysis becomes significantly large (N 2400 seconds). It is really an issue to be taken care because the success of any analytical procedure demands that it should be doable in a short
Fig. 4. TSFS profile of first factor, second factor, third factor and residual data set obtained from PARAFAC analysis of TSFS data in EEMF layout where variables with no experimentally acquired information are set to zero. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
127
Fig. 5. (a) EEMF spectra of first factor (CY), second factor (PY) and third factor (BaP) obtained from PARAFAC analysis on TSFS data in EEMF layout with ~95% missing values and (b) experimentally acquired EEMF spectra of BaP, CY and PY. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
amount of time, in particular when a large number of samples need to analysed. 4.6. PARAFAC analysis on TSFS data in EEMF layout with combination of zero and missing value: Optimization of the computational time To optimize the computational time while fitting PARAFAC model to TSFS data in EEMF layout, we tried the combination of zero and missing values. We assigned missing value (NaN) to the variables lying in the regions between first and fourteenth diagonal strips and a value of zero to the variables lying left side of the first diagonal strip and right side of the fourteenth diagonal strip in the EEMF layout, shown in Fig. 2. This way of handling the variables with no information makes sense because presence of zero in the layout would ensure faster convergence of PARAFAC model and missing values between the diagonals strips would ensure that fluorescence information is not lost. Using this approach amount, we could reduce amount of missing values (NaN) in EEMF layout from 95% to approximately 36%. Core consistency plot obtained from the PARAFAC analyses performed with different number of factors is given in Fig. 3(c). It can be seen that core consistency percentage decreases with the increase in the number of factors used to fit the PARAFAC model. We preferred three-factor PARAFAC model that has hundred percent core consistency value; moreover, beyond this with addition of every extra factor a decrease of 10% or more is observed in core-consistency percentage. Number of iteration before convergence, relative and absolute changes in fit and
time taken for convergence for three-factor PARAFAC model are reported in Table 3. It can be observed that we could reduce and optimize the computational time while fitting PARAFAC model from N2400 seconds to approximately 140 seconds and the corresponding increasing of the speed. Thus, by reducing the amount of NaN values by approximately 60% in EEMF layout, we could perform the analysis at a five times faster speed. The obtained three-factor PARAFAC model explains approximately 82% variance of the data set. Following the protocol, we created PARAFAC separated TSFS profile of each the three factors, shown in Fig. 7. In terms of containing the spectral information, TSFS profiles are relatively less rich compared to the previous case (i.e. PARAFAC model with NaN only). However, they have the skeleton of BaP, CY and PY, which is very easy to identify. Nevertheless, a difference could be observed in the quality of TSFS profiles of the fluorophores obtained from this PARAFAC model to the previous one. To see whether the quality of results can be improved by increasing the percentage of missing values in EEMF layout, we assigned certain variables on left side of first white strip and right side of fourteenth white strip in EEMF layout, shown in Fig. 2. The amount of missing values in EEMF layout is increased from 36% to 62% (approximately twice). The core consistency results, shown in Fig. 3d, are similar to that obtained in previous case (i.e. presence of ~ 36% NaN value in EEMF layout). Knowing there are three fluorophores, we preferred a three-factor PARAFAC model, as we have done through out of the present work. The various parameters (i.e. iteration before convergence, calculation time and relative changes in fit etc.) for this model are
Fig. 6. TSFS profile of PY, CY and PY obtained from PARAFAC analysis of TSFS data in EEMF layout where variables with no experimentally acquired information are set to NaN. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
128
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130
Fig. 7. TSFS profile of PY, CY and PY obtained from PARAFAC analysis of TSFS data in EEMF layout where variables with no experimentally acquired information are set to combination of zero (~59%) and NaN (~36%). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
summarised in Table 3. There is a significant increase in computational time involved for PARAFAC analysis. It is primarily due to increase in percentage of missing values in EEMF layout. PARAFAC model explained 80% variance of the data set. TSFS profiles of each of the three factors are given in Fig. 8. It could be seen that retrieved TSFS profiles of BaP, CY and PY were similar to those obtained in the previous case shown, in Fig. 7. From the obtained results, it could be inferred that by increasing the amount of missing values on left side of first white strip and right side of fourteenth white strip in EEMF layout (shown in Fig. 2), the qualities of the retrieved TSFS profiles do not improve much. It is due to the fact that variables between first and fourteenth diagonal strip in EEMF layout contain same information in both the cases and are not influenced by the amount of missing values on left side of first white strip and right side of fourteenth white strip in EEMF layout.
To make the present study comprehensive, we also performed the quantitative analysis for PARAFAC models obtained from the analyses of TSFS data in EEMF layout containing (i) only missing values, (ii) combination of zeros and missing values where the percentage of missing values is 36% and (iii) combination of zeros and missing values where the percentage of missing values is 62%. Loading matrices obtained from PARAFAC analysis of these three data sets are subjected to principal component regression (PCR) analysis. The regression equations relating the actual and PCR predicted amounts of the three fluorophores are summarised in Table 4. It can be seen that regression parameter (slope and square of correlation coefficient R2) is comparable for each of the three models. The obtained results clearly showed that combination of zero and missing values in EEMF layout for PARAFAC analysis gives a good qualitative and quantitative analytical idea about the system being analysed in a faster way.
Fig. 8. TSFS profile of PY, CY and PY obtained from PARAFAC analysis of TSFS data in EEMF layout where variables with no experimentally acquired information are set to combination of zero (~33%) and NaN (~62%). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130 Table 4 Regression equation Y = m*X + c relating the actual and predicted concentrations of the fluorophores, where Y is predicted concentration, X is actual concentration, m is slope and c is intercept. Square of correlation coefficient R2 between accrual and predicted concentration. Model
CY
PY
BaP
Model B
Y = 0.90*X + 0.1 R2 = 0.912 Y = 0.88*X + 0.11 R2 = 0.883 Y = 0.90*X + 0.1 R2 = 0.902
Y = 0.86*X + 0.19 R2 = 0.861 Y = 0.84*X + 0.16 R2 = 0.841 Y = 0.86*X + 0.19 R2 = 0.853
Y = 0.95*X + 0.06 R2 = 0.954 Y = 0.93*X + 0.03 R2 = 0.932 Y = 0.95*X + 0.06 R2 = 0.942
Model C Model D
Model B: PARAAFC analysis on TSFS data in EEMF layout containing missing (NaN) values (amount of missing value is 95%); Model C: PARAAFC analysis on TSFS data in EEMF layout containing combination of zero and missing (NaN) values (amount of missing value is 36%); Model D: PARAAFC analysis on TSFS data in EEMF layout containing combination of zero and missing (NaN) values (amount of missing value is 62%).
129
(II) Code for replacing zero with missing NaN values in EEMF layout
(III) Code for creating three way arrays of EEMF layout (excitation + offset) × excitation
5. Conclusions From the various possibilities that we tested in the present work, it seems setting the variables with no experimentally acquired information in EEMF layout to missing (NaN) values and subjecting them to PARAFAC analysis is the best way to retrieve pure TSFS profiles of the fluorophores. However, excess of missing values in EEMF layout containing TSFS data makes PARAFAC analysis a time consuming analytical procedure. On the other hand, PARFAC analysis on TSFS data in EEMF layout with a value of zero to all the variables with no experimentally acquired information was found to converge prematurely and therefore not allowing the curve resolution. In the present work, it was also shown that using the combination of zero and missing values one can optimize the computational time and can still retrieve TSFS profile of each fluorophores. However, it is still essential that we come up with certain algorithm or procedure that will allow further optimization of the time required for PARAFAC analysis on TSFS data in EEMF layout. This, in turn would enhance the analytical efficiency of the combination of TSFS and PARAFAC analysis. Acknowledgments Authors thank Council of Scientific and Industrial Research (CSIR), New Delhi, India for providing the financial support to carryout the work. Appendix A. MATLAB Codes (I) Code for representing the TSFS data (excitation wavelength × offset) in EEMF like layout (i.e. excitation wavelength × (excitation wavelength + offset)) where a value of zero is assigned to all the variables with no experimentally acquired information
(IV) Code for extracting the TSSF data from EEMF like layout at various offsets to create TSF data sets
where “X” is the TSFS data created by extracting the SF spectra at various offsets from “Y” (an EEMF like layout containing TSFS data).
Appendix B. Supplementary data Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.chemolab.2015.08.008. References [1] R.A. Harshman, Foundations of the PARAFAC procedure: model and conditions for an ‘explanatory’ multi-mode factor analysis, UCLA Work. Pap. Phon. 16 (1970) 1–84. [2] J.D. Carol, J. Chang, Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart–Young” decomposition, Pyschometrika 35 (1970) 283–319. [3] R.A. Harshman, M.E. Lundy, Parafac-parallel factor-analysis, Comput. Stat. Data Anal. 18 (1994) 39–72. [4] R. Bro, PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. 38 (1997) 149–171. [5] R. Bro, Multi-way analysis in the food industry, theory algorithms and applications, University of Amsterdam, 1998 (PhD Thesis). [6] R. Bro, Exploratory study of sugar production using fluorescence spectroscopy and multi-way analysis, Chemom. Intell. Lab. Syst. 46 (1999) 133–147. [7] J.L. Beltrán, R. Ferrer, J. Guiteras, Multivariate calibration of polycyclic aromatic hydrocarbon mixtures from excitation–emission fluorescence spectra, Anal. Chim. Acta 373 (1998) 311–319. [8] H. Wang, Y. Zhang, X. Xiao, Quantification of polycyclic aromatic hydrocarbons in water: a comparative study based on three-dimensional excitation–emission matrix fluorescence, Anal. Sci. 26 (2010) 1271–1276. [9] H.B. Wang, Y.J. Zhang, X. Xiao, S.H. Yu, W.Q. Liu, Application of excitation–emission matrix fluorescence combined with second-order calibration algorithm for the
130
[10]
[11]
[12] [13]
[14]
[15]
[16]
[17]
[18]
[19] [20]
[21] [22]
[23] [24]
[25]
[26]
[27]
K. Kumar, A. Kumar Mishra / Chemometrics and Intelligent Laboratory Systems 147 (2015) 121–130 determination of five polycyclic aromatic hydrocarbons simultaneously in drinking water, Anal. Methods 3 (2011) 688–695. Y. Yamashita, R. Jaffé, Characterizing the interactions between trace metals and dissolved organic matter using excitation–emission matrix and parallel factor analysis, Environ. Sci. Technol. 42 (2008) 7374–7379. Z. Wang, W. Liu, N. Zhao, H. Li, Y. Zhang, W. Si-Ma, J. Liu, Composition analysis of colored dissolved organic matter in Taihu Lake based on three dimension excitation– emission fluorescence matrix and PARAFAC model, and the potential application in water quality monitoring, J. Environ. Sci. 19 (2007) 787–791. C.A. Stedmon, R. Bro, Characterizing dissolved organic matter fluorescence with parallel factor analysis: a tutorial, Limnol. Oceanogr. Methods 6 (2008) 572–579. S. Yang, J.S. Nadeau, E.M. Humston-Fulmer, J.C. Hoggard, M.E. Lidstrom, R.E. Synovec, Gas chromatography–mass spectrometry with chemometric analysis for determining C-12 and C-13 labeled contributions in metabolomics and C-13 flux analysis, J. Chromatogr. A 1240 (2012) 156–164. L. Rubio, S. Sanllorente, L.A. Sarabia, M.C. Ortiz, Optimization of a headspace solidphase microextraction and gas chromatography/mass spectrometry procedure for the determination of aromatic amines in water and in polyamide spoons, Chemom. Intell. Lab. Syst. 133 (2014) 121–135. M.L. Oca, L.A. Sarabia, A. Herrero, M.C. Ortiz, Optimum pH for the determination of bisphenols and their corresponding diglycidyl ethers by gas chromatography– mass spectrometry. Migration kinetics of bisphenol A from polycarbonate glasses, J. Chromatogr. A 1380 (2014) 23–38. J.A. Arancibia, G.M. Escandar, Second-order chromatographic photochemically induced fluorescence emission data coupled to chemometric analysis for the simultaneous determination of urea herbicides in the presence of matrix co-eluting compounds, Anal. Methods 6 (2014) 5503–5511. W.T. Li, S.Y. Chen, Z. Xu, Y. Li, C.D. Shuang, A.M. Li, Characterization of dissolved organic matter in municipal wastewater using fluorescence PARAFAC analysis and chromatography multi-excitation/emission scan: a comparative study, Environ. Sci. Technol. 48 (2014) 2603–2609. H.C. Goicoechea, M.J. Culzoni, M.D. Gil García, M.M. Galera, Chemometric strategies for enhancing the chromatographic methodologies with second-order data analysis of compounds when peaks are overlapped, Talanta 83 (2011) 1098–1107. K.S. Booksh, B.R. Kowalski, Theory of analytical chemistry, Anal. Chem. 66 (1994) 782A–791A. R.D. Jiji, G.G. Andersson, K.S. Booksh, Application of PARAFAC for calibration with excitation–emission matrix fluorescence spectra of three classes of environmental pollutants, J. Chemom. 14 (2000) 171–185. Å. Rinnan, J. Riu, R. Bro, Multi-way prediction in the presence of uncalibrated interferents, J. Chemom. 21 (2007) 76–86. A.C. Olivieri, N.M. Faber, J. Ferré, R. Boqué, J.H. Kalivas, H. Mark, Uncertainty estimation and figures of merit for multivariate calibration, Pure Appl. Chem. 78 (2006) 633–661. A.C. Olivieri, Analytical advantages of multivariate data processing. One, two, three, infinity? Anal. Chem. 80 (2008) 5713–5720. G.N. Piccirilli, G.M. Escandar, Second-order advantage with excitation–emission fluorescence spectroscopy and a flow-through optosensing device. Simultaneous determination of thiabendazole and fuberidazole in the presence of uncalibrated interferences, Analyst 135 (2010) 1299–1308. K. Kumar, A.K. Mishra, Application of parallel factor analysis to total synchronous fluorescence spectrum of dilute multifluorophoric solutions: addressing the issue of lack of trilinearity in total synchronous fluorescence data set, Anal. Chim. Acta 755 (2012) 37–45. K. Kumar, A.K. Mishra, Application of ‘multivariate curve resolution alternating least square (MCR–ALS)’analysis to extract pure component synchronous fluorescence spectra at various wavelength offsets from total synchronous fluorescence spectroscopy (TSFS) data set of dilute aqueous solutions of fluorophores, Chemom. Intell. Lab. Syst. 116 (2012) 78–86. A. de Juan, R. Tauler, Comparison of three-way resolution methods for non-trilinear chemical data sets, J. Chemom. 15 (2001) 749–772.
[28] K. Kumar, Integration of chemometric methods with certain novel fluorescence techniques, Indian Institute of Technology-Madras, 2014 (PhD Thesis). [29] D. Patra, A.K. Mishra, Total synchronous fluorescence scan spectra of petroleum products, Anal. Bioanal. Chem. 373 (2002) 304–309. [30] D. Patra, A.K. Mishra, Recent developments in multi-component synchronous fluorescence scan analysis, Trends Anal. Chem. 21 (2002) 787–798. [31] K. Kumar, A.K. Mishra, Simultaneous quantification of dilute aqueous solutions of certain polycyclic aromatic hydrocarbons (PAHs) with significant fluorescent spectral overlap using total synchronous fluorescence spectroscopy (TSFS) and N-PLS, unfolded-PLS and MCR-ALS analysis, Anal. Methods 3 (2011) 2616–2624. [32] T. Dramićanin, B. Dimitrijević, M.D. Dramićanin, Application of supervised selforganizing maps in breast cancer diagnosis by total synchronous fluorescence spectroscopy, Appl. Spectrosc. 65 (2011) 293–297. [33] R.C. Groza, A. Calvet, A.G. Ryder, A fluorescence anisotropy method for measuring protein concentration in complex cell culture media, Anal. Chim. Acta 821 (2014) 54–61. [34] A.V. Schenone, M.J. Culzoni, A.D. Campigila, H.C. Goicoechea, Total synchronous fluorescence spectroscopic data modeled with first-and second-order algorithms for the determination of doxorubicin in human plasma, Anal. Bioanal. Chem. 405 (2013) 8515–8523. [35] A.G. Ryder, Assessing the maturity of crude petroleum oils using total synchronous fluorescence scan spectra, J. Fluoresc. 14 (2004) 99–104. [36] A.K. Sarma, A.G. Ryder, Comparison of the fluorescence behavior of a biocrude oil and crude petroleum oils, Energy Fuel 20 (2006) 783–785. [37] A. Nevin, D. Comelli, G. Valentini, R. Cubeddu, Total synchronous fluorescence spectroscopy combined with multivariate analysis: method for the classification of selected resins, oils, and protein-based media used in paintings, Anal. Chem. 81 (2009) 1784–1791. [38] J.B.F. Llyod, Synchronized excitation of fluorescence emission spectra, Nat. Phys. Sci. 231 (1971) 64–65. [39] T. Vo-Dinh, Multicomponent analysis by synchronous luminescence spectrometry, Anal. Chem. 50 (1978) 396–401. [40] I.M. Warner, J.B. Callis, E.R. Davidson, M. Gouterman, G.D. Christian, Fluorescence analysis: a new approach, Anal. Lett. 8 (1975) 665–681. [41] J.H. Rho, J.L. Stuart, Fluorescence analysis: a new approach, Anal. Chem. 50 (1978) 620–625. [42] K. Kumar, A.K. Mishra, Analysis of dilute aqueous multifluorophoric mixtures using excitation–emission matrix fluorescence (EEMF) and total synchronous fluorescence (TSF) spectroscopy: a comparative evaluation, Talanta 117 (2013) 209–220. [43] R. Bro, H.A.L. Kiers, A new efficient method for determining the number of components in PARAFAC models, J. Chemom. 17 (2003) 274–286. [44] C.G. Khatri, R.C. Radhakrishna, Solutions to some functional equations and their applications to characterization of probability distributions, Sankhya: Indian J. Stat. Ser. A 30 (1968) 167–180. [45] G. Tomasi, R. Bro, PARAFAC and missing values, Chemom. Intell. Lab. Syst. 75 (2005) 163–180. [46] Å. Rinnan, C.M. Andersen, Handling of first-order Rayleigh scatter in PARAFAC modelling of fluorescence excitation–emission data, Chemom. Intell. Lab. Syst. 76 (2005) 91–99. [47] L.G. Thygesen, Å. Rinnan, S. Barsberg, J.K.S. Møller, Stabilizing the PARAFAC decomposition of fluorescence spectra by insertion of zeros outside the data area, Chemom. Intell. Lab. Syst. 71 (2004) 97–106. [48] C.M. Andersen, R. Bro, Practical aspects of PARAFAC modelling of fluorescence excitation–emission data, J. Chemom. 17 (2003) 200–215. [49] B. Walczak, D.L. Massart, Dealing with missing data: part I, Chemom. Intell. Lab. Syst. 58 (2001) 15–27. [50] B. Walczak, D.L. Massart, Dealing with missing data: part II, Chemom. Intell. Lab. Syst. 58 (2001) 29–42. [51] B.M. Wise, N.B. Gallaghar, R. Bro, J.M. Shaver, PLS Toolbox 4.0, Eigenvector research, 2006.