Analytica Chimica Acta 1087 (2019) 20e28
Contents lists available at ScienceDirect
Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca
A new scattering correction method of different spectroscopic analysis for assessing complex mixtures Long Li, Yankun Peng*, Yongyu Li, Fan Wang College of Engineering, China Agricultural University, Beijing, 100083, China
h i g h l i g h t s
g r a p h i c a l a b s t r a c t
A new scattering correction method was proposed. Additive effect elimination and the spectral ratio method can reduce the effect of light scattering. LRC-SR, FD-SR and OPS-SR significantly improved model's prediction performance as compared to other methods. It is a useful method for solving the problems for assessing complex mixtures.
a r t i c l e i n f o
a b s t r a c t
Article history: Received 19 June 2019 Received in revised form 21 August 2019 Accepted 27 August 2019 Available online 30 August 2019
Physical properties such as particle size distribution and compactness have significant confounding effects on the spectral signals of complex mixtures, which multivariate linear calibration methods such as partial least-squares (PLS) cannot effectively model or correct. Therefore, these effects significantly deteriorate calibration models’ predictive abilities for spectral quantitative analysis of complex mixtures. Here, new scattering correction methods were proposed to estimate the additive and multiplicative parameters considering light scattering effects in each spectrum and hence mitigate the detrimental influence of additive and multiplicative effects on quantitative spectroscopic analysis of complex mixtures. Three different correction methods were proposed to estimate the addition coefficient based on two different underlying assumptions, namely, whether this coefficient is related to the wavelength. After addition coefficient elimination, the multiplicative parameter can be eliminated by a simple but very efficient spectral ratio method. Furthermore, linear models are built with key variables, and the predictive performance of these models is verified using the root-mean-square error of prediction datasets. The proposed methods were tested on one apple data set and two publicly available benchmark datasets (i.e., near-infrared spectral data of meat and powder mixture samples) and compared with some existing correction methods. The results showed that (1) additive effects of different types of samples can be eliminated by different methods and (2) these methods can appreciable improve quantitative spectroscopic analysis of complex mixture samples. This study indicates that accurate quantitative
Keywords: Scattering correction Multivariate calibration Quantitative spectroscopic analysis Addition coefficient elimination Multiplicative coefficient elimination Complex mixtures
Abbreviations: MSC, multiplicative signal correction; SNV, standard normal variate; PCA, principal component analysis; PLS, partial least squares; MLR, multiple linear regression; OPLEC, optical path length estimation and correction; OPLECm, modified optical path length estimation and correction; FD, first derivative; LRC, linear regression correction; OPS, orthogonal spatial projection; SR, spectral ratio method; CCARS, correlation analysis combined with competitive adaptive reweighted sampling; RMSE, rootmean-square error; RMSEC, root-mean-square error of the calibration set; RMSEP, root-mean-square error of the prediction set; RMSECV, root-mean-square error of cross validation of the calibration set. * Corresponding author. College of Engineering, China Agricultural University, No. 17 Tsinghua East Road, Beijing, 100083, China. Tel.: þ86 10 6273770; fax: þ86 10 62737997. E-mail addresses:
[email protected] (L. Li),
[email protected] (Y. Peng),
[email protected] (Y. Li),
[email protected] (F. Wang). https://doi.org/10.1016/j.aca.2019.08.067 0003-2670/© 2019 Elsevier B.V. All rights reserved.
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
21
spectroscopic analysis of complex mixtures can be achieved through the combination of additive effect elimination and the spectral ratio method. © 2019 Elsevier B.V. All rights reserved.
1. Introduction The number of applications of spectroscopic techniques such as near-infrared (NIR), mid-infrared (MIR), and Raman spectroscopy has increased significantly across a range of detection sectors, such as the pharmaceutical and agricultural product fields [1e4]. However, the use of spectroscopic instrumentation to analyze complex mixture samples can lead to sample-to-sample variability in physical properties. The light scattering effects caused by the physical differences between samples (e.g., particle size and shape, sample packing, and sample surface) would mask the spectral variations related to the differences in the contents of chemical compounds in the samples. The light scattering effects mainly contain two parts: additive effect and multiplicative effect. An additive effect primarily causes a baseline drift in a spectrum, whereas a multiplicative effect can “scale” the entire spectrum [5,6]. The presence of significant additive and multiplicative effects in spectral data could invalidate commonly used multivariate linear calibration methods such as principal component analysis (PCA), partial least squares (PLS) and multiple linear regression (MLR). Therefore, eliminating the additive and multiplicative effects in raw spectra and extracting the spectral information that is linearly related to the target chemical component are critical for quantitative spectrographic analysis. Multiplicative scatter correction (or multiplicative signal correction, MSC) and standard normal variate (SNV) scatter correction are the preprocessing techniques most widely used to remove the scattering effects caused by variations in the physical properties of samples [5,7]. The MSC method needs to estimate the intercept and slope of the estimated regression equation on a reference spectrum (such as the average spectrum) and then correct for each individual spectrum by subtracting the intercept and dividing by the slope. Similarly, the SNV method needs to calculate the mean and standard deviation of the entire spectrum, then subtracts the mean from the raw spectrum and divides by the standard deviation to obtain the corrected spectrum. However, correction for light scattering effects is reliable only when the chemical change between the spectrum to be corrected and the reference spectrum is negligible, which is difficult to achieve in some practical scenarios. In addition, both the MSC and SNV methods need to process the entire spectrum or a wide wavelength band, and preprocessing of a single wavelength is meaningless, hence, the information of each spectral point cannot be effectively utilized. In addition, other light scattering correction methods, such as EISC [8] and ISC [9], have the same problems as MSC and SNV. To overcome these limitations, Chen et al. [10,11] developed a novel multiplicative effect correction approach, i.e., optical path length estimation and correction (OPLEC). The OPLEC method has a two-step procedure for correcting light scattering in a preprocessed spectrum after removing the baseline. First, the multiplication coefficients can be obtained from a linear relationship with the raw spectrum, and then the multiplicative effects in the raw spectra of the test samples can be removed by a dual-calibration strategy. Jin et al. [6] modified the OPLEC method (OPLECm), and the multiplication coefficients accounting for multiplicative effects on the raw spectra can be obtained by solving constrained optimization problems. Obviously, the performance of the OPLECm method
depends on the quality of the two linear correction models, but balancing the prediction effect of both linear models is sometimes difficult. In this study, three different scattering correction methods were proposed for spectroscopic analysis. Light scattering correction methods include two parts, i.e., elimination of the addition coefficient and multiplication coefficient. To eliminate the addition coefficient for different types of samples, first derivative (FD), linear regression correction (LRC) and orthogonal spatial projection (OPS) have been used to compare the effect of addition coefficient elimination based on two different underlying assumptions, i.e., whether the addition coefficient is related to the wavelength. After the addition coefficient has been eliminated, the multiplication coefficient can be eliminated by a simple spectral ratio (SR) method. Furthermore, correlation analysis combined with competitive adaptive reweighted sampling (CCARS) has been proposed to obtain an effective combination of different wavelength ratio points. These methods are different from the existing scatter correction methods in that they have unique advantages: (1) analysis of the ratio information of the different individual wavelengths is meaningful, and (2) the methods are no longer limited to each spectrum containing only one fixed multiplication coefficient. The aims of this study were (1) to theoretically analyze the feasibility of FD-SR, LRC-SR and OPS-SR methods to eliminate the addition and multiplication coefficients; (2) to select the key variables by CCARS methods and establish a multivariate linear correction model; and (3) to evaluate the performance of different correction methods using the root-mean-square error (RMSE). 2. Theory 2.1. Multivariate scattering model For ideal transparent solutions containing J chemical components, the theoretical absorption spectrum of the ith sample, according to the Lambert-Beer law, has a linear relationship with the J chemical components. The relationship between the theoretical absorption spectrum and the J chemical components can be expressed by the following equation:
Xi;chem ¼ p
J X
ci;j sj ; i ¼ 1; 2 /; I
(1)
j¼1
Xi ¼ ai Xi;chem þ bi l
(2)
where the row vector Xi;chem is the theoretical absorption spectrum of sample i; p is related to the optical path length and, ideally, is a fixed value variable; ci;j is the concentration of the chemical component in section j of the ith sample; and the row vector sj is used to evaluate the light absorption capacity of the jth chemical component, which is mainly related to the type of chemical component. By assuming we have obtained the row vector Xi;chem and the value of ci;j , the linear model can be obtained by the method of multiple linear regression correction, and this model can be successfully applied to the prediction of other sample concentrations of the unknown jth chemical component. However, in the
22
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
spectral detection of mixtures, the sample is usually a nontransparent medium, such as an apple. These mixture samples can be simplified into dispersions; therefore, it is actually challenging to make the optical path length in the sample constant due to the scattering effect. Thus, the scattered light information may cause the deviation of equation (1), where the variable p is related to the physical properties of different samples. The deviation of the absorption spectrum is expressed by equation (2), where the row vector is the real absorption spectrum of sample i; ai is a multiplication coefficient related to the optical path length of different mixture samples, which can “scale” the entire raw spectrum and lead to multiplicative bias of the Lambert-Beer law; bi is an addition coefficient that represents the baseline of the spectrum; and vector l ¼ ½1; 1; 1; /; 1 is introduced for matrix formality. Therefore, eliminating addition and multiplication coefficients (i.e., ai and bi in equation (2)) is key to light scattering correction, which can improve the linear correlation between the spectrum and the target chemical component.
The baseline of the spectrum is mainly determined by the detection environment, e.g., the state of the spectrometer and sample temperature. The baseline drift of the spectrum is the main cause of the addition coefficient, which is one of the reasons for the failure of the linear multivariate correction model. The purpose of correcting the baseline drift is to eliminate bi in equation (2). Two different hypotheses for addition coefficients are proposed: (1) the addition coefficient is not related to the wavelength (l), which represents only the up and down drift in the whole spectrum [5,7], so the addition coefficient depends on only different samples; (2) the addition coefficient is related to the wavelength, which depends on not only different samples but also different wavelengths, and the real spectrum can be the following EMSC model [8]: 2
(3)
where the coefficients wi and di represent the spectral trans2 formation related to l and l between different samples; the vec2 tors l and l are the wavelength and wavelength squared, respectively; and the vector εi, which is the random measurement noise, is added to the spectrum of the ith sample. The real spectrum can be expressed by equation (2) when the addition coefficient is not related to the wavelength. In this case, the first-derivative (FD) and the linear-regression-correction (LRC) preprocessing methods can eliminate the effect of the addition coefficient on the real spectrum, i.e., Xi . In equation (1) and equation (2), only the sj variable is related to the wavelength; therefore, the FD method can be expressed by the following equation:
Di ¼ ai
J X j¼1
ci;j
J X vsj vbi 0 þ ¼ ai ci;j sj vl vl j¼1
(5)
Li ¼ Xi bi;ref l
(6)
where Li is the spectrum of the ith sample after LRC preprocessing and ai;ref and bi;ref can be obtained from the linear regression of Xi; mean and Xi . In contrast to the corresponding variable in the MSC method, ai;ref may contain parts related to the target chemical component; thus, the LRC method maintains the multiplication coefficient in the spectrum and waits for subsequent correction. The real spectrum can be expressed by equation (3) when the addition coefficient is related to the wavelength and wavelength squared. The addition coefficient can be removed by the orthogonal spatial projection (OSP) method, i.e., projecting the measured spectrum Xi onto the orthogonal complement of the space spanned 2 by the row vectors of Q ¼ ðl; l; l Þ. The OSP method can be expressed by the following equations [10]:
2 Zi ¼ Xi E Q þ Q ¼ ai Xi;chem þ bi l þ wi l þ di l þ εi E Q þ Q ¼ ai p
2.2. Elimination of addition coefficient based on two different assumptions
Xi ¼ ai Xi;chem þ bi l þ wi l þ di l þ εi
Xi ¼ ai;ref Xi; mean þ bi;ref l
(4)
where Di is the spectrum of the ith sample after FD preprocessing; sj 0 is a differentiable variable related to the wavelength; and sj is the derivative of sj . FD preprocessing is solved by moving window second-order polynomial fitting. The LRC method needs to approximate the intercept of the estimated regression equation on the reference spectrum and then correct for each individual spectrum by subtracting the intercept. The reference spectrum is the average spectrum (Xi; mean ) in this study, and the LRC method can be expressed by the following two equations:
J X
ci;j kj þ ε*i kj ¼ sj E Q þ Q ; ε*i ¼ εi ¼ E Q þ Q
(7)
j¼1
where is the spectrum of the ith sample after OSP preprocessing and E represents an appropriately dimensioned identity matrix. kj and εi * are the row vectors that sj and εi project onto the orthogonal complement of the space spanned by the row vectors of Q . Three elimination methods for the addition coefficient are proposed based on two assumptions, i.e., whether the addition coefficient is related to the wavelength, which can be calculated by equations (3)e(7). The effects of the three addition coefficient elimination methods are illustrated with data analysis based on three different datasets. 2.3. Elimination of multiplication coefficient based on the ratio method The multiplication coefficient is another reason for the failure of the linear multivariate correction model. Although the additive effects can be readily removed by preprocessing methods, the multiplicative effects resulting from changes in the sample's effective optical path length are rather difficult to correct. Without being properly corrected, these effects can deteriorate the predictive performance of multivariate linear calibration models [6]. Suppose that one wavelength (l ¼ h) is found from the raw spectra and that these data are related to only the multiplication coefficient of the different samples or are linearly related to the concentration of the target chemical component. Therefore, based on this assumption, the multiplication coefficient can be eliminated by the following equation:
PJ J X Xi; pre ai p j¼1 ci;j sj;pre ¼g Ri;l¼h ¼ ¼ ci;j sj;pre xi;l¼h ai k j¼1 g¼
p k
(8)
where Ri;l¼h is the spectrum of the ith sample after elimination of the multiplication coefficient; Xi; pre is the spectrum of the ith sample after elimination of addition coefficient by preprocessing, i.e., the FD, LRC or OSP method; sj;pre is the preprocessed data of ; k
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
is an equilibrium coefficient, which depends on the wavelength and is not related to the concentration of the target chemical component; and g is an equivalent multiplication coefficient, which may be related to the l. In addition, g is considered linearly dependent on l in this study. Obviously, the key for this method is finding the l ¼ h point, and then the multiplication coefficient ai can be eliminated by a simple ratio method. 2.4. Selection of feature variables The reasonable selection of characteristic variable points is significant to improve the quality of the multivariate linear correction model. The scattering correction data at l ¼ h can be obtained by equation (8), and the data at different wavelength points can be used for the establishment of a multiple linear correction model. The whole ratio dataset for a sample can be expressed as equation (9):
i h Ri ¼ Ri;l¼q ; R i;l¼qþDl ; /; Ri;l¼f
(9)
where Ri is a three-dimensional matrix (I N N matrix, where N is the number of wavelength points); the variable q is the initial wavelength; and f is the final wavelength. For one sample, the combination of different wavelength points makes up a massive matrix. In this study, the CCARS method is used to select characteristic variables. Obviously, the CCARS method has two steps: first, we analyzed the correlation between whole ratio data and concentrations of the target chemical components, and some irrelevant wavelength points can be deleted according to the principle of minimum error; second, the wavelengths that are linearly correlated with the concentration of chemical components are retained for competitive adaptive reweighted sampling (CARS) analysis. In some studies, the CARS method has been confirmed to be effective in the selection of characteristic variables in a spectrum, which can simplify the model and improve its robustness [12,13]. In addition, the Pearson correlation coefficient (r) is used to evaluate the correlation between a single data point and chemical component concentrations. Correlation analysis is the first step in selecting effective variables and consists of two steps: first, the Pearson correlation coefficient matrix between each ratio vector and the target chemical component vector are obtained; then, variables with a relatively small Pearson correlation coefficient in the Ri matrix are eliminated, and root-mean-square error of cross validation of the calibration set (RMSECV) is obtained after the multivariate linear correction model has been established between the vector and the target chemical component. The second step of the correlation analysis is a cyclic process; i.e., the variables with relatively small r values are removed continually. Thus, the optimal value of r can be determined by locating the turning point in the plot of RMSECV versus r. The flowchart of this scattering correction method is shown in Fig. 1. The code of CARS can be downloaded from http://code.google.com/p/carspls free of charge. The MATLAB code of OPS, LRC and SR methods is available in the Supplementary Material. And the FD method can be implemented by PLS Toolbox version 8.0.2 (Eigenvector Research, Inc., USA). 2.5. Modeling method and model evaluation As the most widely used multivariate calibration method in chemometrics [14], PLS regression can be used to find the maximum variance between the spectral data and the target chemical component concentration, taking into account the relationship between the data and the target while disaggregating the data. To establish a robust PLS model, the master samples were
23
Fig. 1. The flowchart of this new scattering correction method.
divided into a calibration set and a prediction set. The performance of the PLS model was evaluated by using the RMSEP. Generally, we want the prediction model to have a low RMSEP.
3. Material and methods 3.1. Apple data A total of 120 Fuji apples were purchased from a local supermarket in Beijing. The apple samples were harvested in Yantai, Shandong, China, October 2018. All the samples were packed separately in polyethylene bags, placed in a 0 C portable thermostat and transferred to the laboratory. The apple samples were kept at room temperature for 24 h before visible near-infrared (Vis-NIR) spectra were collected. The experiment was divided into two steps: spectral collection and physicochemical value determination. The two-step experiments were completed within 5 days. Vis-NIR spectra of the apple samples were acquired using a selfdeveloped Vis-NIR detection device, as shown in Fig. S1. The spectra were acquired within the wavelength range of 640e1100 nm by a Vis-NIR spectrometer (STS-NIR, Ocean Optics, USA) equipped with a linear array charge-coupled device detector. To enhance the signal, a 10 mm collimating lens (74-DA, Ocean Optics, USA) was attached to the spectrometer. The probe was a tungsten halogen lamp (998070-1, Welch Allyn, America). The device deployed camera obscura to isolate the environmental light effect. According to the method described in NY/T 2637-2014 (China), the soluble solids content (SSC) was determined following the spectral measurement using traditional destructive tests as a reference. The SSC of the apple juice was recorded with a temperature-corrected digital refractometer (RA-620, KEM, Japan) and expressed in % at 20 C. Each apple was sampled along the equator position and measured four times for averaging. The average SSC of the 120 apple samples was 12.19%. In addition, the samples were divided into a calibration set and a prediction set in a 3:1 ratio. In this paper, feature variables selection and PLS model establishment were conducted in the calibration set, and the prediction set was used to validate this calibration model.
3.2. Tecator data: near-infrared spectra of meat samples This benchmark spectral dataset consists of the NIR absorbance spectra of 240 meat samples recorded on a Tecator Infratec food and feed analyzer working in the wavelength range of 850e1050 nm with an interval of 2 nm by the NIR transmission principle. Each sample contains finely chopped pure meat with different moisture, fat, and protein contents. In our study, only the moisture dataset was used to demonstrate this scattering correction method, and the samples were divided into a calibration set and a prediction set in a 3:1 ratio. Other datasets, such as the protein and fat contents, were similar to the moisture content dataset. The Tecator data are available at http://lib.stat.cmu.edu/ datasets/tecator.
24
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
3.3. Powder mixture data: near-infrared spectra of powder mixture samples This NIR spectral dataset consists of the NIR absorbance spectra of 100 powder mixture samples that contained five mixtures of gluten and starch powder with different weight ratios (1:0, 0.75:0.25, 0.5:0.5, 0.25:0.75, 0:1). Five samples were randomly selected and packed into five different glass containers for each of the powder mixtures, and two transmission NIR spectra were collected. Subsequently, each sample was packed more firmly, and two other transmission spectra were collected, resulting in a total of 100 spectra. Furthermore, the spectra were transformed into absorbance spectra, i.e., A ¼ log ð1 =TÞ. The samples were divided into a calibration set and a prediction set in a 3:2 ratio. The calibration set contained 60 samples with gluten/starch ratios equal to 1:0, 0.5:0.5, and 0:1, and the prediction set contained 40 samples with two other mixtures, as in the paper by Chen et al. [10] More experimental details can be found in the original paper by Martens et al. [8]. 4. Results and discussion 4.1. Apple data As described in the theory section, scattering correction in an absorption spectrum mainly includes two steps: elimination of addition coefficients and elimination of multiplication coefficients. The raw spectra of the apple samples are presented in Fig. 2A, and noticeable absorption bands were observed at 680, 760, 840 and
970 nm. These absorption peaks were mainly associated with peel chlorophyll content [15], moisture content [16] and sugar content [17]. The spectra preprocessed by LRC and OSP, which had the same absorption peaks as the raw spectra, are shown in Fig. 2C and D. The spectral absorption peak position changed after FD processing (Fig. 2B); e.g., the absorption peak of the moisture band shifted to approximately 950 nm. The absorption peaks in the raw spectra were the extreme points (i.e., the first derivative with respect to the wavelength is zero), which is the main reason for the change in the absorption peaks. Theoretically, the three preprocessing methods can be used to eliminate the addition coefficients of the raw spectra. In practice, the optimal preprocessing method can be obtained by comparing the PLS modeling results for different datasets. The second step of scatter correction was eliminating multiplication coefficients by a simple division operation based on the preprocessed spectra (i.e., the spectra after FD preprocessing, LRC preprocessing, or OSP preprocessing). The SR matrix can be calculated by equation (8) and equation (9). Three Pearson correlation coefficient matrices based on the ratio datasets of the different preprocessing methods are shown in Fig. 3. The highest correlation was 0.9261, which was obtained using the FD preprocessing ratio matrix at the l ¼ 917 nm= l ¼ 903 nm point. In addition, the maximum correlation coefficients calculated for the OPS and LRC preprocessed ratio datasets were 0.9180 and 0.8787, respectively. Correlation analysis was also carried out on the preprocessed spectra, as shown in Fig. S2. Comparison of preprocessed spectra and ratio spectra showed the Pearson correlation coefficient of single-point data of the latter was significantly higher than that of the former.
Fig. 2. (A) Raw absorption spectra of apple samples. (B) FD preprocessing. (C) LRC preprocessing. (D) OSP preprocessing.
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
25
Fig. 3. Results of correlation analysis between a single-ratio vector and SSC data: (A) FD preprocessing; (B) LRC preprocessing; (C) OSP preprocessing, in which the intensity of the color indicates the value of the Pearson correlation coefficient. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
The redundant information in the ratio spectral matrix often leads to the failure of multiple linear correction models. Therefore, steps to obtain valid variables were necessary before establishing the multivariate linear correction model. The changing trend in the RMSECV value after removing the data points with relatively small r values used the FD preprocessing ratio matrix, as shown in Fig. 4. When removing the data points with r values less than 0.83 in the FD preprocessing ratio matrix, the RMSECV was the smallest, and the effective information of the set was the largest. In addition, when removing the data points with r less than 0.53 in the LRC preprocessing ratio matrix and removing the data points with r less than 0.64 in the OPS preprocessing ratio matrix, the RMSECV was the smallest, as shown in Fig. S3. After the correlation analysis, 107, 1032, and 186 vectors were retained for the FD preprocessing ratio matrix, LRC preprocessing ratio matrix and OPS preprocessing ratio matrix, respectively. These effective vectors became the input variables of the CARS method. Fig. S4 shows the changing trend in the number of sampled variables (Fig. S4A), RMSECV values (Fig. S4B) and regression coefficient path of each variable (Fig. S4C) with increasing sampling runs from CARS operation. The number of variable values first declined quickly from sampling runs 1 to 20 and then entered a period of slow selection during sampling runs 20 to 50. The best subset with the lowest RMSECV value was marked by the vertical line denoted by an asterisk at sampling run 29. For the FD preprocessing ratio matrix, 11 SR data points were retained for the establishment of the
Fig. 4. The changing trend in the RMSECV value after removing the data points with relatively small r values using the FD preprocessing ratio matrix.
PLS model. For the LRC preprocessing ratio and OPS preprocessing ratio matrices, 5 SR data points and 16 SR data points were retained, respectively. The operation process of CARS for the LRC preprocessing ratio and OPS preprocessing ratio matrices is shown in Figs. S5 and S6. After the CARS method for the calibration samples, the key variables were adopted to mitigate the detrimental influence of additive effects and multiplicative effects on the predictive ability of the SSC. PLS calibration models with different preprocessing methods combined with the CARS selection method were also established for comparison purposes. Fig. 5 compares the predictive performance of the calibration models with different preprocessing methods for the SSC and the corresponding optimal PLS models with and without the application of preprocessing methods (e.g., MSC, SNV, OPLECm, FD-SR, LRC-SR and OPS-SR). For the raw spectra, the RMSE values for the calibration (RMSEC) and prediction sets (RMSEP) were 0.8254% and 0.8388%, respectively. The MSC, SNV and OPLECm preprocessing methods succeeded in improving the quality of the predictions of the PLS calibration model for the apple data, although the reasons for its success on this particular dataset were unclear. As expected, FD-SR, LRC-SR and OPS-SR had significantly better predictive abilities than SNV, MSC and OPLECm
Fig. 5. Predictive performance of the PLS models built on the key variables after application of the CARS method to apple datasets preprocessed by different methods (black square, the raw spectra; purple circle, SNV; yellow triangle up, MSC; cyan triangle downward, OPLECm; green diamond, OPS-SR; red five-pointed star, FD-SR; blue hexagonal star, LRC-SR). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
26
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
Fig. 6. Results of correlation analysis between a single-ratio vector and target chemical composition concentration vector (moisture content): (A) FD preprocessing; (B) LRC preprocessing; (C) OSP preprocessing, in which the intensity of the color indicates the value of the Pearson correlation coefficient. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
correction methods. Among FD-SR, LRC-SR and OPS-SR, the PLS calibration model established using LRC-SR key variables had the best predictive performance, and the RMSE values for the calibration and prediction sets were 0.3096% and 0.3350%, respectively. To further prove the effectiveness of the method, we discussed the results of the Tecator dataset and powder mixture dataset in the following two sections. 4.2. Tecator data As in the apple raw spectral data, there are significant additive effects and multiplicative effects on the Tecator data, as shown in Fig. S7 [6]. The three Pearson correlation coefficient matrices based on ratio datasets subjected to the different preprocessing methods are shown in Fig. 6. The highest correlation coefficient was 0.9890, which was also obtained using the FD preprocessing ratio matrix at the l ¼ 940 nm= l ¼ 928 nm point. In addition, the maximum correlation coefficients calculated by OPS and LRC preprocessed ratio datasets were 0.9780 and 0.9814, respectively. The single-point correlation analysis of the Tecator data was similar to that of the apple data; i.e., the highest correlation coefficient was obtained from the FD preprocessing ratio matrix. After the correlation analysis of the Tecator data, 770, 5109, and 823 vectors were retained for the FD preprocessing ratio matrix,
Fig. 7. Predictive performance of the PLS models built on the key variables after application of the CARS method of Tecator datasets preprocessed by different methods (black square, the raw spectra; purple circle, SNV; yellow triangle up, MSC; cyan triangle downward, OPLECm; green diamond, OPS-SR; red five-pointed star, FD-SR; blue hexagonal star, LRC-SR). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
LRC preprocessing ratio matrix and OPS preprocessing ratio matrix, respectively. These effective vectors became the input variables of the CARS method. Furthermore, the calibration model of the moisture contents was built by the key variables. Fig. 7 compares the predictive performance of the calibration models with different preprocessing methods for the moisture and the corresponding optimal PLS models with and without the application of preprocessing methods (e.g., MSC, SNV, OPLECm, FD-SR, LRC-SR and OPS-SR). Certainly, SNV and MSC had better prediction performance than the raw spectra, which can also eliminate the effect of light scattering. Furthermore, the OPLECm preprocessing method surprisingly succeeded in improving the quality of the predictions of the PLS calibration model for the Tecator data, and the RMSE values for the calibration and prediction sets were 0.7567% and 0.8745%, respectively. FD-SR, LRC-SR and OPS-SR showed significantly better prediction effects than the SNV and MSC preprocessing methods. Among FD-SR, LRC-SR and OPS-SR, the PLS calibration model established using OPS-SR key variables had the best prediction effect, and the RMSE values for the calibration and prediction sets were 0.6019% and 0.6198%, respectively. 4.3. Powder mixture data Ideally, five independent curves could be seen from the raw spectra. In fact, due to the addition and multiplication coefficients, this ideal situation does not exist, as shown by the raw spectra in Fig. 8A. As expected, some of the key wavelengths were linearly related to the multiplication factor, which was independent of the target chemical concentration. For instance, when the OPS spectra were divided by the l ¼ 1014 nm data, five independent curves could be seen from the spectra, as shown in Fig. 8B. In addition, FD and LRC also had this key wavelength point, although they corresponded to different wavelengths. The reasonable selection of characteristic variable points is significant to improve the predictive performance of PLS models, as confirmed in the previous two dataset analyses. Therefore, the key variables of the powder mixture datasets were also selected by the CCARS method. The predictive performance of the PLS models built on the key variables is shown in Fig. 9. The optimal PLS calibration model based on the raw calibration spectral key variables did not give satisfactory predictions for the powder mixture datasets, although the results were better than those obtained with the MSC and SNV methods. In addition, the application of the frequently used multiplicative light scattering correction methods, MSC and SNV, did not cause any significant changes in the RMSEP values for the powder mixture datasets. The minimum RMSEP value obtained by the OPLECm method was 0.0115%, which was less than obtained by the MSC and SNV methods. Among the FD-SR, LRC-SR and OPSSR methods, the PLS calibration model established using OPS-SR
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
27
Fig. 8. Raw spectra and ratio spectra of the five mixtures of gluten and starch powder with different weight ratios (black lines, 1:0; green lines, 0.75:0.25; red lines, 0.5:0.5; yellow lines, 0.25:0.75; blue lines, 0:1). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
Fig. 9. Predictive performance of PLS models built on the key variables after application of the CARS method to powder mixture datasets preprocessed by different methods (black square, the raw spectra; purple circle, SNV; yellow triangle up, MSC; cyan triangle downward, OPLECm; green diamond, OPS-SR; red five-pointed star, FDSR; blue hexagonal star, LRC-SR). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
key variables had the best prediction effect, and the RMSE values for the calibration and prediction sets were 0.0016% and 0.0021%, respectively. This result was better than those for previously reported methods; e.g., the RMSEP was equal to 0.005% using the EMSC method [8]. This remarkable improvement further confirmed the effectiveness of these three addition coefficient elimination methods combined with a simple ratio method in mitigating the detrimental influence of additive and multiplicative effects on the spectroscopic quantitative analysis of heterogeneous mixture samples.
5. Conclusions The elimination of the spectral contributions due to variations in chemical compositions from additive and multiplicative effects caused by light scattering is crucial for the accurate quantitative analysis of complex mixture samples using spectroscopic instruments. In this paper, three different novel scattering correction methods were proposed for spectroscopic analysis. Light scattering correction methods include two parts, i.e., elimination of the
addition coefficient and elimination of the multiplication coefficient. Three different methods for eliminating the addition coefficient have been proposed based on two different hypotheses, i.e., whether the addition coefficient is related to the wavelength (l). After the addition coefficient has been eliminated, the multiplication coefficient can be eliminated by a simple but effective ratio method. Furthermore, to simplify the model and improve the robustness of the model, correlation analysis combined with the competitive adaptive reweighted sampling (CCARS) method is used to select key variables that can be used to establish the PLS model. The performances of FD-SR, LRC-SR and OPS-SR were tested on an apple dataset, powder mixture dataset and another publicly available benchmark spectral dataset. The experimental results reveal that all three methods achieve satisfactory quantitative results from the spectroscopic measurements of the different mixture samples. Among these methods, the LRC-SR method has the best predictive performance on the apple dataset. By contrast, the OPSSR method has the best predictive performance on the other two mixture samples. The reason for this difference is related to the additive effects of different types of samples. The scattering correction methods are no longer limited to each spectrum containing only one fixed multiplication coefficient, which is different from other existing methods. Furthermore, compared with other existing methods designed for the same purpose, FD-SR, LRC-SR and OPS-SR have the advantages of high precision and hence have great potential in quantitative spectroscopic analysis of complex mixture samples. Funding The financial support provided by the National Key R&D Program of China (Grant no. 2016YFD0400905-5). Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements The authors gratefully acknowledge the financial support provided by the National Key R&D Program of China (Grant no. 2016YFD0400905-5).
28
L. Li et al. / Analytica Chimica Acta 1087 (2019) 20e28
Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.aca.2019.08.067. References [1] Z.P. Chen, G. Fevotte, A. Caillet, D. Littlejohn, J. Morris, Advanced calibration strategy for in situ quantitative monitoring of phase transition processes in suspensions using FT-Raman spectroscopy, Anal. Chem. 80 (2008) 6658e6665. [2] F. Davrieux, F. Allal, G. Piombo, B. Kelly, J.B. Okulo, M. Thiam, O.B. Diallo, J.M. Bouvet, Near infrared spectroscopy for high-throughput characterization of shea tree (Vitellaria paradoxa) nut fat profiles, J. Agric. Food Chem. 58 (2010) 7811e7819. [3] D.R. Willett, J.D. Rodriguez, Quantitative Raman assays for on-site analysis of stockpiled drugs, Anal. Chim. Acta 1044 (2018) 131e137. [4] O. Jovi c, T. Smoli c, I. Primo zi c, T. Hrenar, Spectroscopic and chemometric analysis of binary and ternary edible oil mixtures: qualitative and quantitative study, Anal. Chem. 88 (2016) 4516e4524. [5] P. Geladi, D. MacDougall, H. Martens, Linearization and scatter-correction for near-infrared reflectance spectra of meat, Appl. Spectrosc. 39 (1985) 491e500. [6] J.W. Jin, Z.P. Chen, L.M. Li, R. Steponavicius, S.N. Thennadil, J. Yang, R.Q. Yu, Quantitative spectroscopic analysis of heterogeneous mixtures: the correction of multiplicative effects caused by variations in physical properties of samples, Anal. Chem. 84 (2012) 320e326. [7] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra, Appl. Spectrosc. 43 (1989) 772e777.
[8] H. Martens, J.P. Nielsen, S.B. Engelsen, Light scattering and light absorbance separated by extended multiplicative signal correction. Application to nearinfrared transmission analysis of powder mixtures, Anal. Chem. 75 (2003) 394e404. [9] I.S. Helland, T. Næs, T. Isaksson, Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data, Chemometr. Intell. Lab. Syst. 29 (1995) 233e241. [10] Z.P. Chen, J. Morris, E. Martin, Extracting chemical information from spectral data with multiplicative light scattering effects by optical path-length estimation and correction, Anal. Chem. 78 (2006) 7674e7681. [11] Z.-P. Chen, L.-J. Zhong, A. Nordon, D. Littlejohn, M. Holden, M. Fazenda, L. Harvey, B. McNeil, J. Faulkner, J. Morris, Calibration of multiplexed fiberoptic spectroscopy, Anal. Chem. 83 (2011) 2655e2659. [12] W. Wang, Y. Peng, H. Sun, X. Zheng, W. Wei, Real-time inspection of pork quality attributes using dual-band spectroscopy, J. Food Eng. 237 (2018) 103e109. [13] H. Li, Y. Liang, Q. Xu, D. Cao, Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration, Anal. Chim. Acta 648 (2009) 77e84. [14] A.M. Rady, D.E. Guyer, Rapid and/or nondestructive quality evaluation methods for potatoes: a review, Comput. Electron. Agric. 117 (2015) 31e48. [15] M.N. Merzlyak, A.E. Solovchenko, A.A. Gitelson, Reflectance spectral features and non-destructive estimation of chlorophyll, carotenoid and anthocyanin content in apple fruit, Postharvest Biol. Technol. 27 (2003) 197e211. [16] Z. Guo, W. Huang, Y. Peng, Q. Chen, Q. Ouyang, J. Zhao, Color compensation and comparison of shortwave near infrared and long wave near infrared spectroscopy for determination of soluble solids content of ‘Fuji’ apple, Postharvest Biol. Technol. 115 (2016) 81e90. [17] Z. Xiaobo, Z. Jiewen, M.J. Povey, M. Holmes, M. Hanpin, Variables selection methods in near-infrared spectroscopy, Anal. Chim. Acta 667 (2010) 14e32.