Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-a validation study

Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-a validation study

Journal Pre-proof Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-A v...

2MB Sizes 0 Downloads 6 Views

Journal Pre-proof Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-A validation study Xiangzhong Song, Yue Huang, Kuangda Tian, Shungeng Min

PII:

S0030-4026(19)31918-7

DOI:

https://doi.org/10.1016/j.ijleo.2019.164019

Reference:

IJLEO 164019

To appear in:

Optik

Received Date:

30 October 2019

Accepted Date:

6 December 2019

Please cite this article as: Song X, Huang Y, Tian K, Min S, Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-A validation study, Optik (2019), doi: https://doi.org/10.1016/j.ijleo.2019.164019

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier.

Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-A validation study Xiangzhong Song a, Yue Huang a*, Kuangda Tian b, Shungeng Min b a

College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083,

ro of

P.R. China b

College of Science, China Agricultural University, Beijing 100193, P.R. China

lP

re

-p

* Corresponding author: Tel.: +86 10 62733091 Email address: [email protected] (Y. Huang).

Abstract:

na

A combination method for spectral variable selection was proposed in this study. In this method, predictive property ranked variable reduction with final complexity adapted models (FCAM)

ur

was used for further variable refinement following method of uninformative variables

Jo

elimination (UVE). In practice, two different near infrared spectral (NIRS) datasets were investigated to evaluate the quantitative performance of proposed method. Results showed that UVE-FCAM selected much fewer variables with better prediction than single UVE on both spectral datasets. Moreover, by contrast, both prediction and modeling stability of UVE-FCAM were proved to be better than a widely-used combination method as UVE-SPA. Overall results demonstrated UVE-FCAM could be a promising alternative method for optimizing variables, 1

and FCAM had the potential to be an effective variable refinement following other variable selection methods.

Keywords: Uninformative variables elimination (UVE), Final complexity adapted models

na

1. Introduction

lP

re

-p

ro of

(PPRVR-FCAM), Near-infrared spectroscopy (NIRS), Variable selection

In analytical chemistry, multivariate calibration algorithms are widely used for establishing

ur

regression models upon spectral data. With the development of instrumental technology, the spectra obtained usually consist of hundreds or even thousands of response signals which are

Jo

considered as large amount of variables. Actually, in most cases, there are only a few informative variables responsible for the interested property or even the whole sample. The rest noisy or uninformative variables always interfere with the prediction performance of calibration model, or make the calibration process very time-consuming. Therefore, lots of variable selection methods have been developed in the past decades, such as interval PLS (iPLS) [1], moving windows PLS 2

(MWPLS) [2, 3], uninformative variables elimination (UVE) [4-6], successive projection algorithm (SPA) [7, 8], competitive adaptive reweighted sampling (CARS) [9], genetic analysis-PLS (GA-PLS) [10, 11], iteratively retaining informative variables (IRIV) [12], variable iterative space shrinkage approach (VISSA) et al [13-15], other integrated variable optimization researches [16-19]. All these methods are focusing on selecting informative variables or spectral regions according to a specific strategy. Apparently, every strategy has its own advantages and of

been developed to achieve complementary advantages [20-25].

ro of

course disadvantages, so more and more combinatorial trials between different strategies have

As a classic variable selection method, UVE has been used to solve such problems and improve

-p

the quality of models by eliminating the irrelevant information and noise from the data matrix.

re

Commonly, a regression model is developed by PLS with the chosen variables. Then, calibration

lP

and prediction models obtained by the selecting characteristic information could be better than those obtained by the full-spectrum because the characteristic wavelengths instead of raw spectra

na

are used for developing the model. Although UVE can eliminate uninformative variables very effectively, its purpose is not trying to find the optimal subset of variables with the best

ur

prediction performance. Furthermore, the number of variables retained by UVE is still too large. In order to overcome these drawbacks, several variable selection methods have been used for

Jo

further refinement following UVE in many reports, such as iPLS, GA, SPA, etc [21, 23, 24, 26]. Among all these combinations, UVE-SPA proposed by our group may be one of the most popular choices [27-32]. Benefit from advantages of two treatments, UVE-SPA can select informative variables with as low as possible chances of multi-collinearity more effectively. However, the stability of UVE-SPA is unsatisfactory, namely, UVE-SPA always selects different 3

variables in different calculation, which may because that any variants of the input variables can have a great influence on the selection results of SPA [7]. Predictive property ranked variable reduction with final complexity adapted models, abbreviated as FCAM, is a promising stepwise variable selection method [26, 33, 34]. In this method, variable reduction is implemented iteratively according to the absolute value of regression coefficients. Although the computational burden of FCAM is usually heavy especially when

ro of

there are too many variables in the spectra, the stability of FCAM was proved to be better than

UVE-iPLS and UVE-GA [26]. Considering the poor stability of traditional combination methods based on UVE, a new combination strategy of variable selection was proposed by utilizing

-p

FCAM as a variable refining method following UVE in this study. The proposed method was

re

applied to two near infrared spectral datasets for analyzing protein content of corn and fipronil

lP

content in pesticide formulation. Meanwhile, as a common used selection method based on UVE, UVE-SPA was chosen for performance comparison.

na

2. Theory and algorithms 2.1. FCAM method

ur

FCAM is a variable selection method based on regression coefficient of PLS model. Flowchart and detail information about FCAM have been reported [26], and main steps of FCAM can be

Jo

summarized as follows:

(a) Build a full spectrum PLS model on training set with the best model complexity A determined by cross validation method. The absolute value of PLS regression coefficients corresponding to every variable, abbreviated as REG, is used as a variable property for variable reduction. Meanwhile, root mean squared error of cross validation (RMSECV) value of the PLS 4

model is assessed by five-fold cross validation among training set. (b) Rank variables in descending order of REG generated in step (a), and remove the variable with the smallest REG. (c) Build a new PLS model based on the latent variables A from retained variables, RMSECV value of new PLS model is assessed by the way used in step (a). (d) Rank variables in descending order of REG generated in step (c), and eliminate the variable

ro of

with the smallest REG.

(e) Repeat step (c)-(d) until the number of remaining variables equals to A.

(f) For the remaining A variables, repeat step (c)-(d) until the number of remaining variables

re

is always equal to the number of remaining variables.

-p

equals to 1, except that PLS model complexity used in step (c) is decreased gradually to keep it

lP

(g) Find the lowest RMSECV value denoted as RMSECVmin from all RMSECV values. Cutoff value denoted as RMSECVcrit is calculated according to Equation 1. (1)

na

RMSECVcrit2 = F(α, M, M)×RMSECVmin2

Where α represents the significance level of one-tailed F-test, and M means degrees of freedom

ur

of both the numerator and denominator. In this work, α and M are set as 0.05 and the number of training set samples respectively.

Jo

(h) Denote the RMSECV value in step (a) as RMSECVFS, if RMSECVcrit > RMSECVFS, then RMSECVcrit is set to RMSECVFS. Find the smallest variables set with RMSECV values less than or equal to RMSECVcrit, which will be considered as the best variables set selected by the FCAM method. 2.2. UVE method 5

UVE is a variable selection method based on the stability value of PLS model, which has been reported in many researches [4-6]. Therefore, only two key points of UVE are summarized as follows: (a) Stability value The stability value was acquired by hundreds of repeating Monte Carlo cross calculation, which can be considered as a parameter of frequency of retained variables. Stability value c is

ro of

calculated by Monte Carlo cross validation, because it has been proved to be more effective than

leave-one-out cross validation in reports [5, 6, 35]. Monte Carlo cross validation was repeated for 500 times in this article, and 80% training samples were selected randomly to construct PLS

-p

model in each repeat.

re

(b) Cutoff threshold

lP

Maximum of absolute value of c among the artificial random variables is used as the cutoff threshold, variables with absolute value of c less than which will be regarded as uninformative

na

and eliminated. 2.3. UVE-FCAM method

ur

UVE-FCAM is a combined variable selection method. In this method, uninformative variables are eliminated by UVE firstly, and then the further refinement of informative variables retained

Jo

by UVE is implemented by FCAM. In terms of variable selection, advantages of UVE–FCAM can be summarized into three aspects as the follows: (a) UVE-FCAM can select much fewer variables than single UVE with comparable prediction performance, because FCAM can retain as fewer as informative variables without loss of prediction ability. 6

(b) Both selection performance and computational burden of FCAM can be improved significantly in the combined method, since most uninformative variables have been eliminated by UVE. (c) Benefit from the good stability of FCAM, the stability of UVE-FCAM will be better than UVE-SPA.

3. Data collection and treatment

ro of

3.1. Corn

This dataset consists of spectra of 80 corn samples measured on three different NIR

spectrometers. The NIR spectra all consist of 700 variables range from 1100 to 2498 nm with an

-p

interval of 2 nm. In this study, spectra obtained by instrument denoted as “m5” were used, and

re

the protein content was considered as the interest property. In addition, the dataset was divided

(KS) algorithm. [36].

na

3.2. Pesticide

lP

into a calibration set (50 samples) and an independent test set (30 samples) by the Kennard-Stone

The pesticide dataset consists of 90 pesticide formulation samples. According to our previous

ur

study [37], a 10% acetamiprid reagent, which adding different densities of fipronil was used to prepare 90 fipronil fortified samples, whose total mass was approximately 10 g. The mass

Jo

percentage range of fipronil was from 0.1% to 4.5%, among which there were 35 samples with concentrations between 0.1% and 1.0%, and 55 samples between 1.0% and 4.5%. NIR spectra were obtained by a FT-NIR spectrometer (Spectrum ONE NTS, PerkinElmer, USA) over the wavenumber range of 4000-10000cm-1 at 4cm-1 resolution. Each spectrum was the average of 32 scans. Spectra of fipronil technical (>99% purity, w/w) were obtained by using the accessories of 7

integrating sphere and sampling rotator under the same spectral acquisition condition. The dataset was divided into a calibration set (50 samples) and an independent test set (40 samples) randomly. 3.3. Calculation All algorithms were implemented with MATLAB (Version 2012a, The Mathworks, Natick, MA, USA). Codes of SPA algorithm were obtained at the website: www.ele.ita.br/kawakami/spa/. KS,

ro of

UVE and FCAM algorithms were programmed independently. In order to compare the intrinsic

feature of each algorithm, all calculations were based on original spectra without any pre-treatment in this study. PLS regression was used as the multivariate calibration method in

-p

this study. For each PLS model, the optimal number of latent variables (LVs) was optimized by

re

five-fold cross validation, and the maximum number of latent variables was set to 15 for both

lP

datasets in this work. Root mean square error of cross validation (RMSECV), root mean square error of calibration (RMSEC) and root mean square error of prediction (RMSEP) was obtained

na

for model evaluation. In addition, each variable selection method was repeated for 100 times to evaluate their stabilities.

ur

4. Results and discussion 4.1. Corn data

Jo

Results of different methods on corn dataset were presented in Table 1. As shown, the prediction performances of all four variable selection methods were better than that of full spectrum, it was demonstrated that quantitative performance of NIR spectral analysis can be actually improved by proper variable selection. And the ranking of prediction performance of all four variable selection methods was as follows: UVE-FCAM>UVE>UVE-SPA>FCAM. It was clear that all 8

the methods including UVE can perform better than single FCAM, which conversely indicated that the performance of FCAM still had capacity to be further improved by removing more uninformative variables from full spectrum. More importantly, the computational burden of FCAM can be reduced significantly by UVE, since UVE retained only about 86 variables from 700 variables in the full spectrum. Compared with 86 variables retained by UVE, UVE-FCAM selected only 22 variables with even slightly better prediction performance, indicating that

ro of

FCAM can be an effective variable refining method after UVE. In contrast, although UVE-SPA selected even fewer variables than UVE-FCAM, its prediction was inferior to that of UVE, since certain informative variables may be eliminated along with the removing of collinear variables

-p

by SPA.

re

Mean spectrum of corn samples and accumulated selected frequencies of variables retained by

lP

all four variable selection methods were illustrated in Fig. 1a-e, respectively. Spectral regions of 1700-1800nm and 2100-2200nm, marked by grey shadows, were assigned to the first overtone of

na

C-H stretching and varied vibration combinations of N-H in protein structure respectively, which have been proved to be the informative regions in Fu’s report [38]. It also showed that most

ur

variables selected by four methods were centralized in these two informative regions, which just explained that all four variable selection methods performed better than none selection. Although

Jo

these four variable selection methods all selected a lot of uninformative variables outside informative regions, the selected frequencies of uninformative variables by UVE-FCAM were apparently much less than that of UVE, FCAM and UVE-SPA. Especially, the distribution of variables selected by UVE-SPA was more scattered than UVE-FCAM even though the number of variables was less. It reflected the stability of UVE-FCAM was better than UVE-SPA. The 9

fact was due to that some informative variables hidden in two informative regions might be eliminated by SPA along with the removing of collinear variables, because the chance of collinearity between adjacent variables within certain specific waveband is usually high especially for NIR data [15]. 4.2. Pesticide system Spectrum of each pesticide sample consisted of 6001 variables. After repeating 100 runs for

ro of

every variable selection method, results of each method were listed in Table 2. As shown,

although the average number of variables retained by UVE-FCAM was far less than that of UVE, its predictive performance was still comparable with that of UVE. Therefore it was also validated

-p

that FCAM can be a good alternative approach for the refinement of variables retained by UVE.

re

Plus, the fact that the computational burden of FCAM can be reduced significantly by UVE was

lP

tested again in this system, as the average number of variables to be refined by FCAM has been greatly reduced from 6001 to 1300. However, both single used FCAM and UVE-SPA performed

na

poorly, which may due to that some important variables have been eliminated, as they all retained no more than 20 variables.

ur

Mean spectrum of pesticide formulation and fipronil technical were shown in Fig. 2a-2b respectively. Four major characteristic absorption peaks of fipronil technical were marked in gray

Jo

shadow. Among which, peaks in 4400-4600 cm-1 region were assigned to the combination of stretching vibration of C-H, peaks near 5000 cm-1 were corresponding to the combination of stretching vibration of N-H, the weak peak near 6000 cm-1 was related to the first overtone of aromatic C-H stretching, and peaks in the range of 6500- 6800 cm-1 were attributed to the first overtone of N-H stretching. Because the content of fipronil technical in the pesticide formulation 10

ranged as 0.1-5% (w/w), characteristic peaks of fipronil technical could hardly be identified from spectrum of pesticide formulation. However, these four characteristic peaks were still considered as informative regions for establishing calibration model of fipronil technical content. Variables obtained by UVE, FCAM, UVE-FCAM and UVE-SPA were presented in Fig. 2c-f, respectively. From Fig. 2c and Fig. 2e, it was found that both UVE and UVE-FCAM retained variables from all the informative regions mentioned above, but UVE selected much more uninformative

ro of

variables outside these four regions than UVE-FCAM. However, both FCAM and UVE-SPA selected insufficient variables from only two or three informative regions, which definitely led to their poor prediction performance.

-p

It was worth noting that although the variables retained by FCAM and UVE-SPA were less than

re

20, distributions of them were quite different, as shown in Fig. 2d and Fig. 2f respectively. The

lP

distribution of variables retained by UVE-SPA was more dispersive, and their selected frequencies were all below 50 times within 100 repeats. This fact may be resulted from that SPA

na

was supposed to select variables with minimum collinearity. In contrast, as to FCAM, the selected frequency was always over 80 times within 100 repeats for most informative variables,

ur

which indicated the stability of FCAM was superior to UVE-SPA. Benefit from good stability of FCAM, the selected frequencies were always over 60 times within 100 repeats for most

Jo

informative variables obtained by UVE-FCAM, illustrating that the stability of UVE-FCAM was better than UVE-SPA. For example, 5026 cm-1 was an informative variable, the selected frequencies of UVE-SPA was only 6 times within 100 repeats. Obviously the stability of UVE-SPA was unsatisfactory when compared with other methods.

5. Conclusion 11

UVE-FCAM method was proposed for spectral variable selection in this study, performance of which was evaluated by two NIR datasets. Results show that UVE-FCAM can select much fewer variables with comparable or even slightly better prediction performance than single UVE for both datasets. It indicates that FCAM could be a useful variable refining method following UVE. Meanwhile, as most uninformative variables have been eliminated by UVE, the computational burden of FCAM is reduced significantly when compared with single FCAM. In addition, results

ro of

also show that UVE-FCAM can outperform UVE-SPA in terms of both prediction performance

and stability. Therefore, UVE-FCAM could be a good alternative method for the selection of more refined variables, and FCAM has the potential to be used as an effective variable

re

-p

refinement method after other variable selection methods without loss of prediction performance.

lP

Declaration of Interest Statement The authors declare that they have no known competing financial

na

interests or personal relationships that could have appeared to influence

ur

the work reported in this paper.

Jo

Acknowledgements

Authors gratefully acknowledge the financial support provided by the Open Fund of State Key Laboratory of Water Resource Protection and Utilization in Coal Mining (Grant No.SHJT-17-42.17); Fundamental Research Funds for the Central Universities of China (No. 3142017100).

12

Reference: [1] M.G. Nespeca, W.D. Pavini, J.E. Oliveira, Multivariate filters combined with interval partial least square method: A strategy for optimizing PLS models developed with near infrared data of multicomponent solutions. Vib. Spectrosc. 102 (2019) 97-102. [2] J.H. Jiang, R.J. Berry, H.W. Siesler, Y. Ozaki, Wavelength interval selection in

ro of

multicomponent spectral analysis by moving window partial least-squares regression with

applications to mid-infrared and near-infrared spectroscopic data. Anal. Chem. 74 (2002) 3555-3565.

-p

[3] J.M. Chen, Z.W. Yin, Y. Tang, T. Pan, Vis-NIR spectroscopy with moving-window PLS

re

method applied to rapid analysis of whole blood viscosity. Anal. Bioanal. Chem. 409 (2017)

lP

2737-2745.

[4] V. Centner, D.L. Massart, O.E. Noord, et al., Elimination of uninformative variables for

na

multivariate calibration. Anal. Chem. 68 (1996) 3851-3858. [5] Q.J. Han, H.L. Wu, C.B. Cai, L. Xu, R.Q. Yu, An ensemble of Monte Carlo uninformative

ur

variable elimination for wavelength selection. Anal. Chim. Acta 612 (2008) 121-125. [6] J.T. Rocha, L. Oliveira, J.C. Dias, et al., Sulfur Determination in Brazilian Petroleum

Jo

Fractions by Mid-infrared and Near-infrared Spectroscopy and Partial Least Squares Associated with Variable Selection Methods. Energ. Fuel 30 (2016) 698-705.

[7] A.A. Gomes, R.H. Galvão, M.U. Araújo, et al., The successive projections algorithm for interval selection in PLS. Microchem. J. 110 (2013) 202-208. [8] T. Mizutani, M. Tanaka, Efficient preconditioning for noisy separable nonnegative matrix 13

factorization problems by successive projection based low-rank approximations. Mach. Learn. 107 (2018) 643-673. [9] K.Y. Zheng, Q.Q. Li, J.J. Wang, et al., Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra. Chemometr. Intell. Lab 112 (2012) 48-54. [10] M.W. Assis, D.O. Fusco, R.C. Costa, et al., PLS, iPLS, GA-PLS models for soluble solids

spectroscopy. J. Sci. Food. Agr. 98 (2018) 5750-5755.

ro of

content, pH and acidity determination in intact dovyalis fruit using near-infrared

[11] C.S. Miaw, C. Assis, A.R. Silva, et al., Determination of main fruits in adulterated nectars by

-p

ATR-FTIR spectroscopy combined with multivariate calibration and variable selection

re

methods. Food Chem. 254 (2018) 272-280.

lP

[12] Y.H. Yun, W.T. Wang, M.L. Tan, et al., A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Anal. Chim. Acta

na

807 (2014) 36-43.

[13] B.C. Deng, P. Ma, C.C. Lin, et al., A new method for wavelength interval selection that

ur

intelligently optimizes the locations, widths and combinations of the intervals. Analyst 140 (2015) 1876-1885.

Jo

[14] B.C. Deng, Y.H. Yun, D.S. Cao, et al., A bootstrapping soft shrinkage approach for variable selection in chemical modeling. Anal. Chim. Acta 908 (2016) 63-74.

[15] Y.H. Yun, W.T. Wang, B.C. Deng, et al., Using variable combination population analysis for variable selection in multivariate calibration. Anal. Chim. Acta 862 (2015) 14-23. [16] C.M. Andersen, R. Bro, Variable selection in regression-a tutorial. J. Chemometrics 24 14

(2010) 728-737. [17] X.B. Zou, J.W. Zhao, M.J. Povey, M. Holmes, H.P. Mao, Variables selection methods in near-infrared spectroscopy. Anal. Chim. Acta 667 (2010) 14-32. [18] T. Mehmood, K.H. Liland, L. Snipen, S. Sabo, A review of variable selection methods in partial least squares regression. Chemometr. Intell. Lab 118 (2012) 62-69. [19] L.C. Paula, A.S. Soares, T.W. Soares, Epistasis-based FSA: Two versions of a novel

ro of

approach for variable selection in multivariate calibration. Eng. Appl. Artif. Intel. 81 (2019) 213-222.

[20] S.F. Ye, D. Wang, S. Min, Successive projections algorithm combined with uninformative

-p

variable elimination for spectral variable selection. Chemometr. Intell. Lab 91 (2008)

re

194-199.

lP

[21] J. Li, C. Zhao, W. Huang, C. Zhang, Y. Peng, A combination algorithm for variable selection to determine soluble solid content and firmness of pears. Anal. Methods 6 (2014)

na

2170-2180.

[22] G. Tang, Y. Huang, K. Tian, et al., A new spectral variable selection pattern using

ur

competitive adaptive reweighted sampling combined with successive projections algorithm. Analyst 139 (2014) 4894-4902.

Jo

[23] Q. Ouyang, J. Zhao, Q. Chen, Measurement of non-sugar solids content in Chinese rice wine using near infrared spectroscopy combined with an efficient characteristic variables selection algorithm. Spectrochim. Acta A 151 (2015) 280-285. [24] M. Barycki, A. Sosnowska, K. Jagiello, T. Puzyn, Multi-Objective Genetic Algorithm (MOGA) As a Feature Selecting Strategy in the Development of Ionic Liquids’ Quantitative 15

Toxicity-Toxicity Relationship Models. J. Chem. Inf. Model. 58 (2018) 2467-2476. [25] A.S. Saad, A.M. AlAlamein, M.M. Galal, Traditional versus advanced chemometric models for the impurity profiling of paracetamol and chlorzoxazone: Application to pure and pharmaceutical dosage forms. Spectrochim. Acta A 205 (2018) 376-380. [26] G. Tang, X. Song, J. Hu, H. Yan, K. Qiu, Characterization of a Pesticide Formulation by Medium Wave Near-Infrared Spectroscopy with Uninformative Variable Elimination and

ro of

Successive Projections Algorithm. Anal. Lett. 47 (2014) 2570-2579.

[27] J. Elder, The apparent paradox of complexity in ensemble modeling. Handbook Stat Anal Data Min Appl (Second Edit), (2018) 705-718.

-p

[28] R.M. Balabin, S.V. Smirnov, Variable selection in near-infrared spectroscopy:

re

Benchmarking of feature selection methods on biodiesel data. Anal. Chim. Acta 692 (2011)

lP

63-72.

[29] H. Yang, B. Kuang, A.M. Mouazen, Quantitative analysis of soil nitrogen and carbon at a

na

farm scale using visible and near infrared spectroscopy coupled with wavelength reduction. Eur. J. Soil. Sci 63 (2011) 410-420.

ur

[30] N. Omidikia, M. Kompany-Zareh, Uninformative variable elimination assisted by Gram-Schmidt Orthogonalization/successive projection algorithm for descriptor selection in

Jo

QSAR. Chemometr. Intell. Lab 128 (2013) 56-65.

[31] Z. Li, J. Wang, Y. Xiong, Z. Li, S. Feng, The determination of the fatty acid content of sea buckthorn seed oil using near infrared spectroscopy and variable selection methods for multivariate calibration. Vib. Spectrosc. 84 (2016) 24-29. [32] E. Giese, O. Winkelmann, S. Rohn, J. Fritsche, Determining quality parameters of fish oils 16

by means of 1H nuclear magnetic resonance, mid-infrared, and near-infrared spectroscopy in combination with multivariate statistics. Food Res. Int. 106 (2018) 116-128. [33] J.P. Andries, Y.V. Heyden, L.M. Buydens, Predictive-property-ranked variable reduction with final complexity adapted models in partial least squares modeling for multiple responses. Anal. Chem. 85 (2013) 5444-5453. [34] J.P. Andries, Y.V. Heyden, L.M. Buydens, Predictive-property-ranked variable reduction in

properties for ranking. Anal. Chim. Acta 760 (2013) 34-45.

ro of

partial least squares modelling with final complexity adapted models: Comparison of

[35] Q.S. Xu, Y.Z. Liang, Monte Carlo cross validation. Chemometr. Intell. Lab 56 (2001) 1-11.

-p

[36] R.W. Kennard, L.A. Stone, Computer aided design of experiments. Technometrics 11 (1969)

re

137-148.

lP

[37] K. Qiu, X. Song, G. Tang, S.G. Min, Determination of Fipronil in Acetamiprid Formulation by Attenuated Total Reflectance-Mid-Infrared Spectroscopy Combined with Partial Least

na

Squares Regression. Anal. Lett 46 (2013) 2388-2399. [38] G.H. Fu, Q.S. Xu, H.D. Li, D.S. Cao, Y.Z. Liang, Elastic net grouping variable selection

ur

combined with partial least squares regression (EN-PLSR) for the analysis of strongly

Jo

multi-collinear spectroscopic data. Appl. Spectrosc 6(2011) 402-408.

17

Figure caption:

Figure 1 Spectra of corn (a) and selected frequencies of variables obtained by different methods

Jo

ur

na

lP

re

-p

ro of

on corn samples within 100 repeats. (b) UVE; (c) FCAM; (d) UVE-FCAM and (e) UVE-SPA

Figure 2 Spectra of pesticide formulation (a), fipronil technical (b) and selected frequencies of variables obtained different methods on pesticide samples within 100 repeats. (c) UVE; (d) FCAM; (e) UVE-FCAM and (f) UVE-SPA.

18

19

ro of

-p

re

lP

na

ur

Jo

Tables Table 1 Results of different methods on the corn dataset. nVARa

Methods

nLVsb

RMSECV

RMSEC

RMSEP

14

0.114

0.062

0.100

11.8±0.6

0.036±0.010

0.018±0.007

0.032±0.010

700

PLS

c

UVE-PLS

86.2±15.7

FCAM-PLS

25.8±17.1

8.6±2.1

0.060±0.014

0.046±0.012

0.071±0.010

UVE-FCAM-PLS

22.0±8.3

9.8±1.7

0.019±0.010

0.013±0.009

0.031±0.015

UVE-SPA-PLS

17.0±3.2

11.0±1.3

0.034±0.009

0.022±0.006

0.046±0.011

Number of variables;

b

Number of latent variables;

c

Statistical results with the form mean value ± standard deviation from 100 repeats

re

-p

ro of

a

Table 2 Results of different methods on the pesticide data set. nLVsb

RMSECV

RMSEC

RMSEP

6001 1287.3±181.2c 19.1±4.4 79.3±29.7 14.0±3.2

14 14.0±0.9 9.2±1.0 11.9±0.5 12.4±2.0

0.098 0.072±0.003 0.071±0.009 0.045±0.007 0.068±0.009

0.032 0.034±0.006 0.059±0.013 0.029±0.005 0.049±0.008

0.103 0.091±0.006 0.140±0.022 0.090±0.009 0.126±0.026

na

PLS UVE-PLS FCAM-PLS UVE-FCAM-PLS UVE-SPA-PLS

lP

nVARa

Methods

Number of variables;

b

Number of latent variables;

c

Statistical results with the form mean value ± standard deviation from 100 repeats

Jo

ur

a

20