An optimization strategy for waveband selection in FT-NIR quantitative analysis of corn protein

An optimization strategy for waveband selection in FT-NIR quantitative analysis of corn protein

Journal of Cereal Science 60 (2014) 595e601 Contents lists available at ScienceDirect Journal of Cereal Science journal homepage: www.elsevier.com/l...

670KB Sizes 5 Downloads 21 Views

Journal of Cereal Science 60 (2014) 595e601

Contents lists available at ScienceDirect

Journal of Cereal Science journal homepage: www.elsevier.com/locate/jcs

An optimization strategy for waveband selection in FT-NIR quantitative analysis of corn protein Hua-Zhou Chen a, *, Qi-Qing Song a, Guo-Qiang Tang a, Li-Li Xu b a b

College of Science, Guilin University of Technology, Guilin 541004, China School of Ocean, Qinzhou University, Qinzhou 535000, China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 15 February 2014 Received in revised form 17 July 2014 Accepted 18 July 2014 Available online 16 August 2014

An optimization strategy for waveband selection of corn protein was developed based on Fourier transform near infrared (FT-NIR) spectrometry. The optimized moving window partial least squares (OMWPLS) framework was proposed based on different sample partitions for modeling stability. A global-optimal model and some local-optimal models were selected by OMWPLS screening through the full scanning range. The modified optical path length estimation and correction (OPLECm) technique was utilized for further data preprocessing of the OMWPLS-selected wavebands. We finally acquired an optimal and stable model, of which the root mean square error and the correlation coefficients of prediction were 0.413 (%) and 0.939, respectively, and the modeling waveband was 5158e4857 cm1 with 40 wavenumbers. This selected waveband achieved high accuracy in validation. Moreover, many alternative wavebands with acceptable predictive results were also found. This finding seems valuable and quite practical for the design of a corn-specific NIR instrument. The waveband selection framework confirms the feasibility that FT-NIR quantitative analysis of corn protein can be determined. FT-NIR spectrometry combined with waveband optimization is expected to be an alternative technology for detection of corn chemical components, which is essential for the implementation of chemometric methods in the analysis of corn quality. © 2014 Elsevier Ltd. All rights reserved.

Keywords: Corn protein FT-NIR Waveband optimization Model stability

1. Introduction Near infrared (NIR) spectroscopy is an analytical technique which relates the light absorbed by a sample to its chemical and physical composition (Burns and Ciurczak, 2008; Givens et al., 1997). The NIR measurement provides the spectral data containing the integrated information of samples, including the absorption responses of all components in the samples, as well as some measuring noises (Ozaki, 2012; Rodriguez-Saona et al., 2004). Fourier transform near infrared (FT-NIR) spectrometry improves the spectral reproducibility, the accuracy and precision of wavelength discrimination by using Fourier transform technology (Jiang et al., 2012; Sinija and Mishra, 2009), and mainly reflects the absorption of the overtone and combination of vibrations of a functional group XeH (such as CeH, OeH and NeH). This technique is commonly used in the fields of agriculture, food, environment,

* Corresponding author. College of Science, Guilin University of Technology, 12 Jian'gan Road, Guilin 541004, China. Tel.: þ86 773 5896179. E-mail address: [email protected] (H.-Z. Chen). http://dx.doi.org/10.1016/j.jcs.2014.07.009 0733-5210/© 2014 Elsevier Ltd. All rights reserved.

 and Gergely, 2012; biomedicine, etc (Bondi et al., 2012; Salgo Sarembaud et al., 2008; Sousa et al., 2007). Corn is one of the hugely important kinds of cereal grain in the world. Protein is one of the kernel ingredients to evaluate the nutritional quality of corn (Ngonyamo-Majee et al., 2008). Protein molecules contain various XeH functional groups, which have significant absorption in the FT-NIR region. FT-NIR spectroscopy has potential for a rapid quantitative analysis of protein content in corn (Andrade et al., 2009; Yu et al., 1998). In comparison with the traditional chemical measurement (Kjeldahl method), the FT-NIR determination for protein can be more rapid when a stable calibration model is established. It will waste only 2 or 3 min including the spectroscopic measurement. Nevertheless, the process of calibration model establishment requires the reference measurement of protein by the Kjeldahl method, and a stable optimal model is constructed by necessary repeating calibrations. As corn is a complex system containing various components, protein is not easy for direct quantitative calculation with FT-NIR spectrometry. It is a major task for us to develop some effective chemometric methods. Chemometrics is an essential part of NIR spectroscopy in cereal analysis. NIR instrumentation in fact must always be complemented

596

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

List of abbreviations F

PLS factor, i.e. the number of latent valuables for PLS regression f(p) the quadratic function of p in OPLECm algorithm FT-NIR Fourier transform near infrared inf(RP) the infimum of RP MWPLS moving window partial least squares NIR near infrared OMWPLSoptimized moving window partial least squares OPLECm the modified optical path length estimation and correction p the multiplicative parameter vector in OPLECm algorithm PLS partial least squares regression PLS-VIP the PLS models with VIP scores R the number of active chemical component in OPLECm algorithm

with chemometric analysis to enable the extraction of useful information present in the spectra of the analyte, separating it both from not useful information and from spectral noise. Chemometric techniques most used are the principal component analysis (PCA) as a technique of qualitative analysis of the data and partial least squares (PLS) regression analysis as a technique to obtain quantitative prediction of the parameters of interest (Naes et al., 2002; Nicolai et al., 2007; Cen and He, 2007). Partial least squares (PLS) regression has a built-in capacity to deal with the collinearity of the spectra based on latent variables (Burnham et al., 1999; Igne et al., 2010). However, the spectroscopic analysis of a single component requires mitigating the interference of other components and noise. PLS is difficult to improve the modeling effects when the signal-to-noise ratio of the waveband is not adequately high. Wavelength selection is a key technique for improving the effectiveness of spectral prediction, reducing method complexity and designing parameters for specialized spectrometers. Few studies have been reported on determination of protein content in corn samples by FT-NIR spectrometry with waveband optimization. The discussion on waveband selection is prospective, and chemometric methods for waveband optimization should be developed in the FT-NIR analysis. Moving window partial least squares (MWPLS) is a well-known chemometric algorithm for waveband selection (Jiang et al., 2002; Du et al., 2004). However, as FT-NIR analysis requires a modelingvalidation division for samples and the modeling process requires the samples partitioned into calibration and prediction sets, differences in the partitioning of calibration will lead to fluctuations in MWPLS waveband selections, thus yielding unstable predictive results. Thus it is necessary to find a stable strategy to optimize the MWPLS waveband selection method in considering the repeat partitioning of calibration sets. Besides waveband selection, the data preprocessing is another aspect of waveband optimization. There are interferences in the process of spectroscopy measurement, such as light scattering, baseline drift, as well as the experimental conditions, so that the spectroscopic data include some noise other than the information of the analyte. It is an inevitable task to enhance the signal-to-noise ratio for any selected waveband. An effective data preprocessing method is called Modified Optical Path Length Estimation and Correction (OPLECm), which realizes the full potential for quantitative spectroscopic analysis of complex systems by using a strategy of duel-calibration (Chen et al., 2013; Jin et al., 2012; Wang et al.,

R RC RP RV RP,m RP,sd RMSE RMSEC RMSEP RMSEV RMSEPm RMSEPsd SN

correlation coefficients the R for calibration the R for prediction the R for validation the mean value of RP the standard deviation of RP root mean square error the RMSE for calibration the RMSE for prediction the RMSE for validation the mean value of RMSEP the standard deviation of RMSEP the starting point of the moving window in MWPLS method sup(RMSEP) the supremum of RMSEP VIP the variable importance in the projection W the size of the moving window of MWPLS method

2011). The OPLECm method is expected to be used in the noise reduction for the selected-waveband data of corn samples. An optimal strategy for FT-NIR spectroscopic waveband selection of corn protein was studied in this work. By considering the modeling stability, we discuss the optimization of the MWPLS method based on the mean value and standard deviation of many different partitions of calibration samples. Then the optimized MWPLS framework (hereafter referred to OMWPLS) is utilized for the waveband selection in the FT-NIR quantitative analysis of corn protein. Moreover, the potential of the OPLECm technique for data preprocessing is also studied for the OMWPLS-selected wavebands. 2. Materials and methods 2.1. Samples and measurements 158 corn samples were collected from plants in a 920-squarefoot field in northeast China. In species, they are mostly conventional corn and sweet corn, as well as some high-oil corn and highlysine corn. As the harvest probably influences the quality of corn, we harvested the corn samples in a mechanical way within 3 days after the corn become ripe. A sunny day was the best weather choice, and the pre-experimental physical treatment was completed immediately to ensure the off-line quality of corn kernels close to its on-line quality. In the pre-experimental preparation, each corn was peeled into corn kernels, then ground, milled and sifted by a sieve with 1 mm aperture and passed through the drying process. Finally, 158 dry powdered corn samples were collected, weighing 50 g crushed and sieved particles by a precision balance. Quantitative calibration of corn protein by FT-NIR spectrometry requires the traditional chemical measurement as the comparable standard, and the prediction and validation values of protein contents are to be examined and evaluated by comparison with the traditional standard technique. The commonly used traditional chemical measurement of protein content was completed by the Kjeldahl method (AACC Method 46-10). The distillation unit includes the Kjeldahl flasks, digestion heaters and distillation flask. The corn samples were digested in sulfuric acid, ammonia is distilled, and the excess acid titrated; finally, a conversion factor of 6.25 is used for feedstuffs. The distillation unit output the measured protein contents of the 158 corn samples, in weight percentage, ranged from 8.1 to 14.5 (%). The mean value and the standard deviation were 10.6 and 1.2 (%) respectively.

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

The spectral measurement was performed by using a Nexus™ 870 FT-NIR spectrometer (Thermo Nicolet Corporation, MA, USA) with its diffuse reflectance accessory. The absorbances of 158 samples were recorded in the full scanning spectral range of 10,000e4000 cm1. An average of 32 scans per spectrum was made with a resolution of 16 cm1. The temperature was controlled at 25 ± 1  C and the relative humidity was at 46 ± 1% RH throughout the spectral scanning process. As the samples were in milled solid powder, scattering and noise disturbance probably occurred when light passed through; this may cause the measuring accuracy to decrease. To minimize the noise in the spectral data, we have measured each sample in triplicate and calculated the mean value for establishing FT-NIR models. The models were further evaluated by comparing the predicted values to the Kjeldahl-measured values of protein contents [Axelson, 2010]. 2.2. Sample division and model indicators According to the calibration-prediction-validation procedure [Chen et al., 2013, 2014], some of the samples were randomly selected for validation and the remaining samples were used for modeling. The modeling samples were divided into calibration set and prediction set many times. Calibration and prediction were performed for each division i. The root mean square error (RMSE) and correlation coefficients (R) for each group of the calibration and prediction sets were denoted as RMSEC(i), RC(i), RMSEP(i) and RP(i). Based on all divisions, the mean value and standard deviation of the RMSE and R for the prediction were further calculated and denoted as RMSEPm, RP,m, RMSEPsd and RP,sd, where the RMSEPm and RP,m were used for evaluating the prediction accuracy and the RMSEPsd and RP,sd were for the modeling stability. A lower RMSEPm (or alternatively a higher RP,m) indicates higher accuracy for the model and a lower RMSEPsd and RP,sd higher stability. From the mean value and standard deviation, we can learn that all of the RMSEP values and RP values are respectively in the numerical interval of (RMSEPm  RMSEPsd, RMSEPm þ RMSEPsd) and (RP,m  RP,sd, RP,m þ RP,sd), thus the values of RMSEPm þ RMSEPsd and RP,m  RP,sd determine the ultimate level of model prediction. Based on this consideration, the minimum RMSEPm þ RMSEPsd would be chosen as the main indicator for us to select the optimal model parameters, and the maximum RP,m  RP,sd can be considered as the indispensable reference. For convenience, we have RMSEPm þ RMSEPsd and RP,m  RP,sd respectively denoted as the supremum of RMSEP (sup(RMSEP)) and the infimum of RP (inf(RP)). Finally, the selected model with the optimal parameters was evaluated against the randomly selected validation samples, which were not subjected to the modeling optimization process. They were regarded as the prediction set while the original modeling set was used as the calibration set. The RMSE and R for validation samples were then calculated and respectively denoted as RMSEV and RV. A total of 158 corn samples were measured. For the calibrationprediction-validation procedure, all samples were divided into three sample sets at the ratio of 2:1:1. Firstly, 40 samples were randomly selected as the independent validation set, then the remaining 118 samples were used for calibration and prediction. We planned to have 78 samples for calibration and 40 for prediction and the calibration-prediction partition was performed 50 times. The partitioning was absolutely random so that we have 50 different pairs of calibration set and prediction set. Aiming to establish objective and representative models, the sup(RMSEP) and inf(RP) were calculated according to the fundamental statistical theory of mean value and standard deviation, and successively, they were used as the kernel indicators for model optimization.

597

2.3. The framework for OMWPLS method The MWPLS method provides a good wavelength combination with high spectral signals for PLS modeling (Jiang et al., 2002; Du et al., 2004). Consecutive wavelengths were designated as a window. There are two changeable parameters to determine the window (i.e. a waveband). One is the starting point of the window (SN), which determines the position of the waveband; the other one is the size of the window (W), which means W consecutive wavelengths were included in the window. If SN varies, the window moves; if W varies, the window size changes. By moving the window and changing size of the window, PLS models were established for all possible windows in the full scanning spectral range, and the optimal analytical wavebands can be selected. PLS factor (F) is a tunable parameter that corresponds to the number of latent variables in PLS regression. Thus we have three critical parameters for MWPLS modeling: the starting point (SN), the size (W) and PLS factor (F) (Chen et al., 2011, 2013). The PLS models were established for all possible combinations of parameter group (SN, W, F), and the corresponding values of RMSEPm, RP,m, RMSEPsd, RP,sd were obtained. Further we calculated the sup(RMSEP) for each group of (SN, W, F), and formed a sup(RMSEP) cube to save the modeling results, which corresponds to the 3 axes of SN, W and F. For the optimization of MWPLS (i.e. the OMWPLS framework), the F was selected according to the average modeling results from different divisions for the calibration and prediction sets, so that the optimized F exhibited the stability and practicality of PLS models. The corresponding optimal F for each window (each combination of SN and W) was selected according to the minimum value in the sup(RMSEP) cube. The global-optimal model, with the informative window and the optimal F, was further selected by comparing the minimum value of sup(RMSEP) in all wavebands. Accordingly, the optimal combination of parameter group (SN, W, F) can be determined. Practical applications involve some kinds of restrictions for the position and size of the analytical wavebands, such as production costs and material properties. The global optimal waveband cannot meet all the demands of actual conditions. Therefore, some local optimal wavebands that correspond to different positions and sizes are significant. Logically, we have two ways to find out the local optimal wavebands. One is to consider the predictive results corresponding to the starting points (the SN axis of the sup(RMSEP) cube), we tried to have SN unchanged and then any possible value was assigned to the parameter group (W, F), and the local optimal models corresponding to SN can be selected based on the minimum value of sup(RMSEP). Another is to consider the predictive results corresponding to the window size (the W axis of the sup(RMSEP) cube). We have W unchanged and any possible value was assigned to the parameter group (SN, F), and then the local optimal models corresponding to the unchanged W can be selected. The selection of SN (or W) was practically performed on a SN-axis (or W-axis) mapping of the sup(RMSEP) cube by minimizing W and F (or SN and F), This mapping also showed the variation trend of sup(RMSEP) corresponding to varied values of SN (or W). The range of parameters SN, W and F should be set according to the analyte in use of the OMWPLS framework. The OMWPLSselected wavebands can be further evaluated by the PLS models filo et al., with variable important on projection (VIP) scores (Teo 2009). 2.4. The preprocessing method of OPLECm As the FT-NIR spectra were collected in the diffuse reflectance way, the multiplicative scattering effect may override the spectral

598

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

information of the component [Hiroaki et al., 2002; Peirs et al., 2003]. For multiplicative effect correction, OPLECm is a simple and effective approach pretreatment method with an improvement of robustness [Jin et al., 2012; Wang et al., 2011]. For spectral measurements with multiplicative effects caused by changes in the optical path length, the measured spectrum (Aj, row vector) of sample j composed of T chemical components can be approximated by the following model:

Aj ¼ pj

T X

Cj;t at ;

j ¼ 1; 2; …; N;

(1)

t¼1

where Cj,t is the content of the t-th chemical component in the j-th sample; at represents the collected spectrum for the t-th pure chemical component. The coefficient pj accounts for the multiplicative effects in the spectral measurements of the j-th sample caused by changes in the optical path length due to the physical variations. Assuming that the first component is the target analyte and Ci,t representing the unit-free concentration (i.e. Cj,1 þ Cj,2 þ … þ Cj,T ¼ 1), then the Formula (1) can be expressed as the following equation, where Dat ¼ at  a2.

Aj ¼ pj Cj;1 Da1 þ pj a2 þ

T X

pj Cj;t Dat ;

(2)

t¼3

Equation (2) shows a linear relationship between Aj and pj, and also between Aj and pjCj,1. If the multiplicative parameter pj for the calibration samples is available, the two calibration models can be built by multivariate linear calibration methods (such as PLS). For simplicity, the same number of latent variables is generally used in the two calibration models. Then the content of the chemical component in the prediction samples can be obtained by adopting the spectral data of the prediction samples into the two calibration models. Obviously, the estimation of the multiplicative parameter vector p for the calibration samples is the key to the correction of the multiplicative effects. OPLECm provides a refined method for the estimation of p by decomposing the spectra matrix A into the scoring and the loading, which consist of r columns (here r represents the number of active chemical component). Since there is no requirement to know the absolute value of p (assume p is no less than unity), p can be estimated by finding the minimum value of the quadratic function f(p). And the number r can also be estimated during the estimation of p, without knowing the pure spectra of the chemical components in samples. All of the computational algorithms, such as the repeated sample divisions, OMWPLS modeling framework, OPLECm preprocessing method, and etc., are archived by MATLAB software (version 7.11) in a PC with a 4-core CPU and a 4 GB memory for independent calculation. 3. Results and discussion The FT-NIR spectra of the 158 corn samples in the full scanning range (10000e4000 cm1) are measured. By MWPLS, there were a total of 778 wavenumbers for modeling and the quantitative analytical models for protein had high complexity. To improve prediction accuracy and reduce the computational complexity, waveband selection was performed by the abovementioned OMWPLS framework. 3.1. Waveband selection by the OMWPLS framework Optimal wavebands were selected by the built-up OMWPLS framework by testing the possible combinations of parameter

group (SN, W, F), with simultaneously dealing with the model stability and robustness. For the FT-NIR determination of protein content in corn, we had the F set changing from 1 to 20. The window-searching range for the MWPLS method covered the full scanning range of 10000e4000 cm1 with 778 wavelengths. SN was set changing from 1 to 778 with the step of 1 so that all possible values of SN were tested. If searching for all possible windows, W should be set changing from 1 to 778 with the step of 1. In this case, the number of windows was high up to 302,253. Combined with the changing F, we were forced to establish a total of 5,899,520 PLS models and calculate the six model indicators of RMSEPm, RP,m, RMSEPsd, RP,sd, sup(RMSEP) and inf(RP) for each partition of calibration and prediction sets. We had 50 different partitions. This computational workload is too large for the common calculating devise to afford even we used a PC with a 4-core CPU and a 4 GB memory. To reduce the workload but still keep the modeling representativeness, we have W set changing as follows: from 1 consecutively to 100, from 105 to 500 with a step of 5, from 510 to 770 with a step of 10, and 778. This setting created a total of 47,308 windows and we totally had 800,620 models for testing. This number is affordable for the devise working smoothly. Based on the minimum sup(RMSEP) with its corresponding inf(RP) as the reference, the OMWPLS models for the FT-NIR analysis of protein were selected. The minimum sup(RMSEP) corresponding to each starting point (in wavenumber) was shown in Fig. 1, respectively. The minimum sup(RMSEP) corresponding to each size of waveband was shown in Fig. 2, respectively. The globaloptimal model may be respected as the most valuable reference for designing the optical splitting system of spectroscopic instruments. But some local-optimal models whose prediction results were close to that of the global-optimal model can also be a good choice. These models will address the restrictions of position and size of the waveband in instrument design, such as production costs and material properties. We can find from Figs. 1 and 2 that the minimum sup(RMSEP) and its close values were acquired in the longwave region. This indicated that the global-optimal model and some local-optimal models were located in long-wave. With a view to the zoom-in figures (sub-figures (a) and (b)) in Figs. 1 and 2, we can more easily catch the starting points and the sizes of windows for minimum sup(RMSEP), and the wavebands with the same parameters simultaneously have high values of inf(RP). The modeling parameters of SN, W and F and the prediction results of sup(RMSEP) and inf(RP) corresponding to the globaloptimal and some local-optimal models were summarized in Table 1, where the sup(RMSEP) and inf(RP) were calculated by RMSEPm þ RMSEPsd and RP,meRP,sd, respectively. To proof and check the OMWPLS-selected wavebands, these wavebands were further used for re-establishing PLS models with VIP scores (denoted as PLS-VIP models) and the results were also shown correspondingly. For comparison, the prediction results of the PLS model as well as the PLS-VIP model established on the full scanning range (10000e4000 cm1) was also listed in Table 1. We can see from Table 1 that the sup(RMSEP)/inf(RP) of the selected OMWPLS models was obviously lower/higher than those in the full scanning range, thus the prediction accuracy and stability of the OMWPLS models were significantly improved. And, the global-optimal and local-optimal models selected by OMWPLS were much simpler in computational complexity than the full-range PLS models, by utilizing a much smaller size of windows. The results also showed that the PLS-VIP models on the selected wavebands little improved the model predictive results, which means that the OMWPLS framework owns the ability of extracting stable and robust modeling parameters and the variables in these wavebands have little collinear effect.

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

599

Fig. 1. The minimum sup(RMSEP) corresponding to each starting point in wavenumber. Note that subfigure (a) and (b) are zoom-in sights of the waveband regions marked (a) and (b) in the large figure.

Fig. 2. The minimum sup(RMSEP) corresponding to each size of waveband. Note that subfigure (a) and (b) are zoom-in sights of the waveband regions marked (a) and (b) in the large figure.

We obtained the parameters of the global-optimal model from Table 1. The starting point of wavenumbers was 6061 cm1, the optimal size of window was 80, so that we could find the corresponding waveband of 6061e5451 cm1. Other than the globalTable 1 The modeling prediction accuracy and stability of PLS models based on the optimal selected wavebands by OMWPLS framework. Waveband (cm1)

Wa

10000e4000 6162e5513 6154e5474 6061e5451 6054e5397 6046e5420 5505e4980 5158e4857 5150e4857

778 85 89 80 86 82 69 40 39

Fb

PLS modelc sup(RMSEP)

13 9 7 6 8 7 7 6 6

0.539 0.447 0.448 0.444 0.447 0.450 0.465 0.450 0.450

PLS-VIP modeld e

inf(RP) 0.850 0.926 0.927 0.928 0.925 0.926 0.911 0.913 0.910

f

sup(RMSEP)

e

0.528 0.446 0.448 0.445 0.449 0.451 0.465 0.449 0.449

Bold represents the best optimal waveband and its modeling results. a W d the size of the moving window of MWPLS method. b F d PLS factor, i.e. the number of latent valuables for PLS regression. c PLS model d the Partial Least Squares regression model. d PLS-VIP model d PLS models with VIPg scores. e sup(RMSEP) d the supremum of RMSEPh. f inf(RP) d the infimum of RPi. g VIP d the variable importance in the projection. h RMSEP d the root mean square error of prediction. i RP d the correlation coefficient for prediction.

inf(RP) 0.885 0.930 0.929 0.930 0.928 0.921 0.915 0.911 0.912

f

optimal waveband, the local-optimal wavebands provide appreciate predictive results (see Table 1) and they are quite useful for some actual applications. If the selected models cannot meet the demand of specific conditions, we can still find any other model in the OMWPLS framework, of which the wavebands and parameters can be found in Figs. 1 and 2. Additionally, we learned from Table 1 that some of the selected local-optimal wavebands overlap with the global-optimal waveband. The optimal wavebands we have found (located in the first overtone region) were a little different from the probable wavebands of NeH group (6667e6000 and 5000e4500 cm1) in our traditional knowledge. The NIR first overtone region (7200e5400 cm1) has proved to be informative for corn protein in our previous research (Chen et al., 2014). We believe our selected optimal wavebands, mainly located in and around the region of 6200e5400 cm1, were practical informative responses, which were not included in our traditional finding. The accuracy of the FTNIR calibration models for corn protein may have potential to be further improved if this newly found region can be added to enlarge the informative wavebands. This new finding seems valuable and quite practical for the design of a corn-specific NIR instrument.

3.2. Waveband optimization by OPLECm technique The prediction results of the OMWPLS models are expected to be further improved if information is extracted and noise reduced.

600

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

Table 2 The modeling prediction results of the OMWPLS-selected wavebands based on the spectral data respectively preprocessed by OPLECm and SG smoother. Waveband (cm1)

Wa

Raw data

Data preprocessing OPLECmb

sup(RMSEP) 10000e4000 6162e5513 6154e5474 6061e5451 6054e5397 6046e5420 5505e4980 5158e4857 5150e4857

778 85 89 80 86 82 69 40 39

0.539 0.447 0.448 0.444 0.447 0.450 0.465 0.450 0.450

c

inf(RP)

d

0.850 0.926 0.927 0.928 0.925 0.926 0.911 0.913 0.910

sup(RMSEP)

Savitzky-Golay smoother c

0.503 0.423 0.422 0.421 0.417 0.428 0.425 0.413 0.415

inf(RP) 0.861 0.936 0.932 0.938 0.934 0.923 0.926 0.939 0.936

d

sup(RMSEP) 0.513 0.439 0.431 0.424 0.425 0.430 0.438 0.419 0.427

c

inf(RP)d 0.857 0.933 0.922 0.933 0.932 0.917 0.919 0.938 0.934

Bold represents the best optimal waveband and its modeling results. a W d the size of the moving window of MWPLS method. b OPLECm d the modified optical path length estimation and correction. c sup(RMSEP) d the supremum of RMSEPe. d inf(RP) d the infimum of RPf. e RMSEP d the root mean square error of prediction. f RP d the correlation coefficient for prediction.

With a view to separating the spectral variations caused by multiplicative light scattering, the preprocessing technique of OPLECm was applied to the raw spectra in each selected waveband and PLS models were re-established. The optimal data-preprocessing models for each waveband were established for the analysis of corn protein, and we re-established the PLS models in the OMWPLS framework, obtaining the optimal values of sup(RMSEP) and inf(RP) (see Table 2), with the PLS factors no larger than 8. For comparison, the modeling results on the data preprocessed by the famous Savitzky-Golay smoother (Savitzky and Golay, 1964) were also listed. We can summarize from Table 2 that firstly, the prediction results were improved by data preprocessing (no matter using the OPLECm technique or the Savitzky-Golay smoother) regardless of any waveband; secondly, the effect of data preprocessing is better by OPLECm technique than by the Savitzky-Golay smoother; and finally, the best model on raw data is not necessarily the best model after data preprocessing. We acquired the minimum sup(RMSEP) of 0.413 (%) and the corresponding inf(RP) of 0.939, and the optimal preprocessed waveband was changed to 5158e4857 cm1 with 40 wavenumbers. Other than the best waveband, there are many alternative selections (see Table 2), which give out acceptable modeling prediction results. For the best data-preprocessing OMWPLS model, we scrutinized the process of the OPLECm technique. As the estimation of p

requires the determination of the number r, the plot of minimum f(p) versus r was shown (see Fig. 3). It can be seen from Fig. 3 that the minimum f(p) decreased obviously when r increased from 1 to 6, and the larger number of r lead to slow growth in min f(p), which means the most active chemical component information relevant to p was included in the first 6 principal components of the spectral data. Thus the optimal value of r was automatically set to 6 in the OPLECm technique so that the OPLECm could provide the best data preprocessing mode. 3.3. Model validation The randomly selected 40 validation samples, excluded in the modeling optimization process, were used to evaluate the best OMWPLS models with OPLECm preprocessing (the waveband of 5158e4857 cm1 with 40 wavenumbers). The PLS model was established by using the spectral data and actual contents (measured by Kjeldahl method) for the calibration and prediction samples, and the selected optimal modeling parameters output the regression coefficients. Further the FT-NIR predicted values for the validation samples can be calculated using the regression coefficients and spectral data. The FT-NIR predicted values of protein for the 40 validation samples were obtained by PLS modeling with 5 latent variables, which were close to the actual contents, with the RMSEV and RV were 0.443 (%) and 0.924 respectively. The correlation relationship between the predicted values and actual contents was shown in Fig. 4. The results showed that the predicted values and the actual contents were highly correlated for protein. The validation effect was more satisfactory for the random validation samples because we have achieved the modeling stability and robustness in the process of model optimization. 4. Conclusions

Fig. 3. Relationship between minimum f(p) and the number of active components (r) in OPLECm technique for the best data-preprocessing OMWPLS model.

An optimization strategy for quantitative determination of protein in corn was developed based on FT-NIR spectrometry with waveband optimization. Our results confirmed that the OMWPLS framework has the ability to maintain the modeling stability in the process of waveband selection. It was also confirmed that the selected global-optimal and local-optimal wavebands contained the informative spectroscopic responses of protein. This optimization strategy could help the rapid NIR determination of corn protein. FT-NIR spectrometry therefore offers the opportunity, and

H.-Z. Chen et al. / Journal of Cereal Science 60 (2014) 595e601

Fig. 4. The correlation relationship between the best FT-NIR predicted values and actual contents of protein for the 40 validation samples.

potentially the ability, to determine the protein content in corn. In fact, the FT-NIR spectra have many hidden details that can help us to reveal a practical way for maturing high-quality corn. We have screened through the full-scanning spectroscopy range in detail, searching for the informative wavebands by using the OMWPLS modeling algorithm. It was confirmed that the OMWPLSmodeling stability on the basis of redoing the partition of calibration samples. The selected global-optimized wavebands of 6061e5451 cm1 (without preprocessing) and 5158e4857 cm1 (with OPLECm preprocessing) indicated that the chemical groups of protein mainly concentrate in the special long-wave region. This finding seems valuable and quite practical for the design of a cornspecific NIR instrument. The waveband selection framework confirms the feasibility that the quantitative analysis of corn protein can be determined by FTNIR spectrometry, with the consideration of modeling stability and robustness. And the combination of OMWPLS framework and OPLECm preprocessing can be a chemometric technique for the optimization of FT-NIR waveband selections for the determination of corn protein. FT-NIR spectrometry combined with waveband optimization is expected to be an alternative technology for detection of corn chemical components, which is essential for the implementation of NIR methods in the analysis of corn quality. Acknowledgments This work was supported by the Natural Scientific Foundation of China (11226219, 61164020), the Natural Scientific Foundation of Guangxi (2012GXNSFBA053013, 2013GXNSFDA019001), and the Scientific Research Project of Guangxi Education Office (201203YB085). References Andrade, M.F., Salatier, B., Marcelo, A., 2009. Sources and times of nitrogen application on irrigated corn crop. Semin. Cienc. Agrar. 30, 275e283. Axelson, D., 2010. Data Preprocessing for Chemometric and Metabonomic Analysis. Kingston, Ontario. Bondi, R.W., Igne, B., Drennen, J.K., 2012. Effect of experimental design on the prediction performance of calibration models based on near-infrared spectroscopy for pharmaceutical applications. Appl. Spectrosc. 66, 1442e1453.

601

Burnham, A.J., Macgregor, J.F., Viveros, R., 1999. Latent variable multivariate regression modeling. Chemom. Intell. Lab. Syst. 48, 167e180. Burns, D.A., Ciurczak, E.W., 2008. Handbook of Near-infrared Analysis, third ed. Taylor and Francis, New York. Cen, H., He, Y., 2007. Theory and application of near infrared reflectance spectroscopy in determination of food quality. Trends Food Sci. Technol. 18, 72e83. Chen, H.Z., Ai, W., Feng, Q.X., Jia, Z., Song, Q.Q., 2014. FT-NIR spectroscopy and Whittaker smoother applied to joint analysis of duel-components for corn. Spectrochim. Acta Part A: Mol. Biomol. Spectrosc. 118, 752e759. Chen, H.Z., Pan, T., Chen, J.M., Lu, Q.P., 2011. Waveband selection for NIR spectroscopy analysis of soil organic matter based on SG smoothing and MWPLS methods. Chemom. Intell. Lab. Syst. 107, 139e146. Chen, H.Z., Tang, G.Q., Song, Q.Q., Ai, W., 2013. Combination of modified optical path length estimation and correction and moving window partial least squares to waveband selection for the fourier transform near-infrared (FT-NIR) determination of pectin in shaddock peel. Anal. Lett. 46, 2060e2074. Du, Y.P., Liang, Y.Z., Jiang, J.H., Berry, R.J., Ozaki, Y., 2004. Spectral regions selection to improve prediction ability of PLS models by changeable size moving window partial least squares and searching combination moving window partial least squares. Anal. Chim. Acta 501, 183e191. Givens, D.I., De Boever, J.L., Deaville, E.R., 1997. The principles, practices and some future applications of near infrared spectroscopy for predicting the nutritive value of foods for animals and humans. Nutr. Res. Rev. 10, 83e114. Hiroaki, I., Toyonori, N., Eiji, T., 2002. Measurement of pesticide residues in food based on diffuse reflectance IR spectroscopy. IEEE Trans. Instrum. Meas. 51, 886e890. Igne, B., Reeves III, J.B., McCarty, G., Hively, W.D., Lundc, E., Hurburgh Jr., C.R., 2010. Evaluation of spectral pretreatments, partial least squares, least squares support vector machines and locally weighted regression for quantitative spectroscopic analysis of soils. J. Near Infrared Spectrosc. 18, 167e176. Jiang, J.H., Berry, R.J., Siesler, H.W., Ozaki, Y., 2002. Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. Anal. Chem. 74, 3555e3565. Jiang, H., Liu, G.H., Xiao, X.H., Yu, S., Mei, C.L., Ding, Y.H., 2012. Classification of Chinese soybean paste by Fourier transform near-infrared (FT-NIR) spectroscopy and different supervised pattern recognition. Food Anal. Methods 5, 928e934. Jin, J.W., Chen, Z.P., Li, L.M., Steponavicius, R., Thennadil, S.N., Yang, J., Yu, R.Q., 2012. Quantitative spectroscopic analysis of heterogeneous mixtures: the correction of multiplicative effects caused by variations in physical properties of samples. Anal. Chem. 84, 320e326. Naes, T., Isaksson, T., Fearn, T., Davies, T., 2002. A User-friendly Guide to Multivariate Calibration and Classification. NIR Publications, Chichester, UK. Ngonyamo-Majee, D., Shaver, R.D., Coors, J.G., Sapienza, D., Correa, C.E.S., Lauer, J.G., Berzaghi, P., 2008. Relationships between kernel vitreousness and dry matter degradability for diverse corn germplasm: I. Development of near-infrared reflectance spectroscopy calibrations. Anim. Feed Sci. Technol. 142, 247e258. Nicolai, B.M., Beullens, K., Bobelyn, E., Peirs, A., Saeys, W., Theron, K.I., Lammertyna, J., 2007. Non-destructive measurement of fruit and vegetable quality by means of NIR spectroscopy: a review. Postharvest Biol. Technol. 46, 99e118. Ozaki, Y., 2012. Near-infrared spectroscopy-its versatility in analytical chemistry. Anal. Sci. 28, 545e563. Peirs, A., Tirry, J., Verlinden, B., Darius, P., Nicolaï, B.M., 2003. Effect of biological variability on the robustness of NIR-models for soluble solids content of apples. Postharvest Biol. Technol. 28, 269e280. Rodriguez-Saona, L.E., Khambaty, F.M., Fry, F.S., Dubois, J., Calvey, E.M., 2004. Detection and identification of bacteria in a juice matrix with Fourier transform-near infrared spectroscopy and multivariate analysis. J. Food Prot. 67, 2555e2559.  , A., Gergely, S., 2012. Analysis of wheat grain development using NIR specSalgo troscopy. J. Cereal Sci. 56, 31e38. Sarembaud, J., Platero, G., Feinberg, M., 2008. Stability studies of agrifood reference materials under different conditions of storage by near-infrared spectroscopy. Food Anal. Methods 1, 227e235. Savitzky, A., Golay, M.J.E., 1964. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627e1637. Sinija, V.R., Mishra, H.N., 2009. FT-NIR spectroscopy for caffeine estimation in instant green tea powder and granules. LWT Food Sci. Technol. 42, 998e1002. Sousa, A.C., Lucio, M.M.L.M., Bezerra Neto, O.F., Marcone, G.P.S., Pereira, A.F.C., ~o, R.K.H., 2007. A method for Dantas, E.O., Fragoso, W.D., Araujo, M.C.U., Galva determination of COD in a domestic wastewater treatment plant by using nearinfrared reflectance spectrometry of seston. Anal. Chim. Acta 588, 231e236.  filo, R.F., Martins, J.P.A., Ferreira, M.M.C., 2009. Sorting variables by using Teo informative vectors as a strategy for feature selection in multivariate regression. J. Chemom. 23, 32e48. Wang, K., Chi, G.Y., Lau, R., Chen, T., 2011. Multivariate calibration of near infrared spectroscopy in the presence of light scattering effect: a comparative study. Anal. Lett. 4, 824e836. Yu, P., Huber, J.T., Santos, F.A.P., Simas, J.M., Theurer, C.B., 1998. Effects of ground, steamflaked, and steam-rolled corn grains on performance of lactating cows. J. Dairy Sci. 81, 777e783.