Comparison of several variable selection methods for quantitative analysis and monitoring of the Yangxinshi tablet process using near-infrared spectroscopy

Comparison of several variable selection methods for quantitative analysis and monitoring of the Yangxinshi tablet process using near-infrared spectroscopy

Infrared Physics and Technology 105 (2020) 103188 Contents lists available at ScienceDirect Infrared Physics & Technology journal homepage: www.else...

1MB Sizes 0 Downloads 27 Views

Infrared Physics and Technology 105 (2020) 103188

Contents lists available at ScienceDirect

Infrared Physics & Technology journal homepage: www.elsevier.com/locate/infrared

Comparison of several variable selection methods for quantitative analysis and monitoring of the Yangxinshi tablet process using near-infrared spectroscopy

T



Yong Chen, Hui Ma, Qing Zhang, Siyu Zhang, Ming Chen, Yongjiang Wu College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China

A R T I C LE I N FO

A B S T R A C T

Keywords: Yangxinshi tablet Process analysis Variable selection Near infrared spectroscopy

Variable selection methods can simplify the modelling process and improve the accuracy of models constructed for rapid monitoring using near-infrared (NIR) spectroscopy. In this study, NIR spectroscopy was applied in combination with chemometrics to determine the icariin, salvianolic acid B, and puerarin content, as well as the relative density through the concentration process of a Yangxinshi tablet (YXST) water extraction solution. Partial least squares with several variable selection algorithms were used for the modelling. Variable selection methods, including synergy interval partial least squares (Si-PLS), a genetic algorithm, competitive adaptive reweighted sampling (CARS), and their combination, were comparatively applied to calibrate the regression model. The performances of the models obtained were systematically evaluated according to the correlation coefficients of prediction (Rp), the relative standard error of prediction (RSEP), and the residual predictive deviation (RPD). For icariin, salvianolic acid B, and puerarin, the Si-PLS model showed optimal results. The Rp, RSEP, and RPD of Si-PLS were 0.9905, 7.84%, and 7.38 for icariin; 0.9781, 9.48%, and 4.80 for salvianolic acid B; and 0.9668, 12.81%, and 3.92 for puerarin, respectively. For the relative density, the CARS-Si-PLS acquired the best prediction results. The Rp, RSEP, and RPD of CARS-Si-PLS were 0.9725, 0.47%, and 4.31 for the relative density, respectively. This study demonstrated that variable selection methods can simplify the modelling process and improve the accuracy of the models. In addition, the methods developed in this study are suitable for a quality monitoring of the concentration process of YXST water extraction solution, and they provide reference for real-time monitoring of the concentration processes used in the preparation of other types of traditional Chinese medicine.

1. Introduction Yangxinshi tablet (YXST) is a type of traditional Chinese medicine (TCM), composed of 13 medicine herbs including Radix Astragali, Radix Codonopsis, Radix Salviae Miltiorrhizae, Radix Puerariae, and Folium Epimedii [1]. At present, YXST is widely used for the clinical treatment of coronary disease, angina pectoris, myocardial infarctions, hyperlipidemia, and hyperglycemia [2]. The production processes of YXST mainly comprise extraction, concentration, alcohol precipitation, and drying. The concentration process plays an essential role in TCM production; however, the process is energy-consuming and the commonly used methods for its quality control rely heavily on empirical approaches. In addition, it is difficult to achieve a quality control owing to a lack of supervising key quality indicators [3,4]. Therefore, a rapid and accurate method is required to monitor the intermediate quality during



the pharmaceutical concentration process. Near infrared (NIR) spectroscopy is a fast, accurate, and non-destructive analytical technique that has been extensively applied in multiple areas [5]. In most cases, it requires no sample preparation or manipulation with harmful chemicals, solvents, or reagents [6]. These advantages of NIR spectroscopy significantly reduce the analytical time and follow the trend of environmental protection. At the same time, NIR spectroscopy can achieve a simultaneous quantitative determination of multiple key quality indicators, and can be used as a replacement for subjective and non-standard empirical methods in pharmaceutical industries. However, NIR absorption spectra are often complex and extremely overlapping, which blocks the direct quantitative analysis of the samples [7]. In addition, the unrelated and collinear spectral variables used for NIR modelling may severely influence the accuracy and predictive ability of the model [8]. Thus, the development of

Corresponding author. E-mail address: [email protected] (Y. Wu).

https://doi.org/10.1016/j.infrared.2020.103188 Received 28 August 2019; Received in revised form 7 January 2020; Accepted 7 January 2020 Available online 08 January 2020 1350-4495/ © 2020 Elsevier B.V. All rights reserved.

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

samples were obtained for analysis. The sample solution was centrifuged at 13,000 rpm for 10 min to obtain supernatant.

appropriate variable selection methods to simplify the calibration process and achieve better quantification is urgently needed. To effectively extract the relevant information, variable selection methods can be applied to eliminate variables that are indirectly correlated with the property of interest and remove potential interference. Given the advantages of variable selection, an increasing number of variable selection methods have been proposed [9]. Such methods include continuous interval and separate wavenumber selection approaches [10]. At present, continuous interval selection methods, such as synergy interval partial least squares (Si-PLS) [11–13] and moving window partial least squares [14], and separate wavenumber selection methods, such as uninformative variable elimination [15], competitive adaptive reweighted sampling (CARS) [16,17], genetic algorithms (GAs) [18,19], and random frog [20], are more commonly used. Two types of variable selection methods can improve the stability and interpretability of the models, although their optimisation effects have significant differences owing to their different principles [21]. Continuous interval selection methods collect effective information by dividing the global spectrum into everal equidistant subintervals and evaluating the performance of local models. When the most efficient subinterval is determined, it is unavoidable to introduce collinear and irrelevant variables, no matter how small the subset is set. The importance of each variable is measured individually by separate wavenumber selection methods. However, due to the high dimensionality of the spectral data, those methods rely on randomly sampling at the initialisation step. But this randomness is troublesome to ensure that the best variable is selected. It is worth considering whether a combination of methods of different principles can complement each other. However, thus far few studies have systematically compared the optimisation of the two types of selection methods and their combination. This study presents a rapid and efficient way to quantitatively monitor the YXST concentration process through NIR spectroscopy. Icariin, salvianolic acid B, and puerarin are the principle bioactive compounds in YXST, and their relative density is an important indicator for the concentration process; thus, they were selected as key quality indicators for the concentration process of YXST. To improve the efficiency of the model construction, Si-PLS, GA, CARS, and their combination were performed to select key indicators’ variables to reduce the complexity of the modelling and determine the best method for analysing the spectrum data.

2.3. Reference assays The supernatant collected was filtrated through a 0.45 µm syringe filter prior to the HPLC analysis. Quantitative analysis of icariin, salvianolic acid B, and puerarin was carried out on an Agilent 1200 high-performance liquid chromatographic system (Agilent Technologies, USA) equipped with a vacuum degasser, a quaternary pump, an autosampler, a thermostatic column compartment, and a diode array detector. Separations were conducted on an Agilent Eclipse XDB-C18 column (250 mm × 4.6 mm × 5 µm) at 30 °C. The mobile phase consisted of methanol (A) and 0.3% phosphoric acid aqueous solution (B). The gradient elution procedure was as follows: 25% A for minutes 0–15, 25–40% A for minutes 15–17, 40–45% A for minutes 17–30, 45–55% A for minutes 30–31, 55–60% A for minutes 31–43, and 60–100% A for minutes 43–45. The flow rate of the mobile phase was maintained at 1 mL/min and the injection volume was 5 µL. The UV detection was applied for the following wavenumbers: 250 nm for minutes 0–23, 286 nm for minutes 23–33, and 270 nm for minutes 33–45. After each run, a 5 min equilibration was required. The relative density of the samples was determined using the following formula: ρ = m / v , where V represents the volume of the sample and m is the corresponding mass. In this paper, 1 mL of sample was accurately weighted at 92 °C to determine the relative density. 2.4. NIR instrument and data acquisition NIR spectra of the YXST water extraction solution were obtained using an Antaris II Fourier-transform near infrared analyser from Thermo Fisher Scientific, Inc. (Madison, WI, USA) equipped with an indium gallium arsenide detector. Each spectrum was scanned with a 5mm path length in absorbance mode. In addition, each spectrum was an average of 32 scans within the range of 10,000–4000 cm−1 with an 8 cm−1 resolution. The average spectrum from three measurements was used as the final spectrum of each sample to quantify the four key indicators. The final spectral data were measured at intervals of 3.857 cm−1, which resulted in 1557 variables.

2. Material and methods

2.5. Spectral pre-processing and division of samples

2.1. Chemicals and reagents

Different pre-processing methods were applied to eliminate the spectral baseline drift, noise, and other interference factors, including the first derivative (1st D), second derivative, and Savitzky–Golay (SG) and Norris smoothing. To avoid bias in the sample selection and ensure the uniform distribution, the samples were then divided into calibration (162 samples) and validation (53 samples) sets using the sample set partitioning based on joint X-Y distance (SPXY) algorithm [22].

Crude materials (Radix Astragali, Radix Codonopsis, Radix Salviae Miltiorrhizae, Radix Puerariae, Folium Epimedii, Radix Rehmanniae, Radix Angelicae Sinensis, Ganoderma Lucidum, and Radix Glycyrrhiza) were supplied by Qingdao Growful Pharmaceutical Co., Ltd. Standard icariin (≥98%), salvianolic acid B (≥98%), and puerarin (99.5%) were purchased from Chengdu Must Bio-Technology Co., Ltd. (Chengdu, China). HPLC-grade methanol was obtained from Merck (Darmstadt, Germany). Deionised water was purified using a Milli-Q purification system (Millipore, Bedford, MA, USA). All other reagents were of analytical grade.

2.6. Wavenumber variable selection methods NIR spectroscopy combined with variable selection methods was applied to establish partial least squares (PLS) models for monitoring the four quality indicators mentioned above during the concentration process of the YXST. In this study, six model types, Global-PLS, Si-PLS, GA-PLS, CARS-PLS, GA-Si-PLS, and CARS-Si-PLS were compared to select the optimised variables for modelling. Global-PLS used the entire region to establish the PLS models, and Si-PLS, GA-PLS, and CARS-PLS applied the variables selected by Si-PLS, GA, and CARS for the development of the PLS models, respectively. For, GA-Si-PLS and CARS- SiPLS, Si-PLS was used to first select efficient spectral intervals and, GA or CARS was then used to select the most appropriate variables from these efficient intervals. The performances of the final models were evaluated based on the following indexes: correlation coefficients of calibration

2.2. Concentration process sample preparation Seven batches of YXST water extraction solution were prepared from suitable amounts of Radix Astragali, Radix Codonopsis, Radix Salviae Miltiorrhizae, and Radix Puerariae. Each batch of the above solution was concentrated in a 5 L three-necked flask with heating temperature of 95 °C. Taking the actual concentration process into account, samples were collected at 10 min intervals during the first 3 h, and at 5 min intervals during the remaining time of the concentration process. A total of 215 2

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

Fig. 1. For human manipulation, the relative density was distributed uniformly at 1.1568 g/ml. Icariin, salvianolic acid B, and puerarin were fully separated using HPLC and quantified from the system (Fig. S1). The content distributions of three marking components were within the range of 0.1106–2.7675, 0.2839–7.7530, and 0.0940–0.4930 mg/ml, respectively. The trends of the three marking components were similar, and some of the maximum concentration points were presumably due to the fluctuations in the quality of the raw materials.

(Rc), correlation coefficients of prediction (Rp), root-mean-square error of calibration (RMSEC), root-mean-square error of prediction (RMSEP), relative standard error of calibration (RSEC), relative standard error of prediction (RSEP), residual predictive deviation (RPD), and root mean square error of cross-validation (RMSECV). Si-PLS was proposed by Nørgaard in 2000, and many researchers have shown interest in it owing to its excellent variable selection [23,24]. The local model is evaluated based on the RMSECV value from the leave-one-out cross-validation and regions corresponding to the smallest RMSECV value are selected. GA is an adaptive optimised algorithm based on the principles of heredity and natural selection as described by Leardi [25]. The main manipulations of GA are selection, crossover, and mutation [26]. GA is also a stochastic algorithm, the results of which vary with different runs. Therefore, in this study, paper, GA was run five times to obtain the best result with the lowest RMSECV value, and the selected variables were used to build the final PLS model. When GA is applied to the variable selection, the termination is determined based on the number of iterations. In this study, 100 iterations were applied and the other parameters were set to the default values. Given the stochastic characteristics of GA, the variable selection processes were repeated five times, and the model with the lowest RMSECV value was finally chosen as the optimum approach. CARS is a variable selection algorithm based on the principle of “survival of the fittest,” and can eliminate uninformative variables and reduce the influence of collinear variables on the model when selecting the critical variables [27,28].

3.2. Raw spectral data analysis

Data acquisition, spectral pre-treatments, and chemometric analysis were conducted using the RESULTTM software suite (version 3.0, Thermo Nicolet, USA), TQ analyst (version 8.0, Thermo Nicolet, USA), and MATLAB (version 8.1, MathWorks, Natick, MA, USA).

The absorbance spectra of 215 YXST samples were directly measured using FT-NIR spectroscopy. Fig. 2a shows the raw NIR spectra at wavenumbers of 10,000–4000 cm−1 of the YXST samples. As the figure indicates, the spectra exhibited intense absorption bands at 6944 and 5155 cm−1, which were assigned to the first OeH overtone and the combination of stretching and deformation of the OeH group in water, respectively [29]. These two bands are typical water peaks of aqueous extraction solution during the TCM preparations, which will mask any other bands present in these spectral regions. In addition, icariin, salvianolic acid B, and puerarin are all composed of only three chemical elements (C, H, and O), that is, these three compounds are rich in CeH and OeH bands, which contribute to the appearance of NIR spectra and also lead to a high degree of band overlapping. Compared with the second derivative and Norris smoothing, NIRs pre-processed using the 1st D/SG (Fig. 2b) obtained a better model performance. Therefore, the initial NIR spectra were treated using the 1st D and SG smoothing prior to further study. The smoothed spectrum eliminates the effects of noise; however, the spectra data are still extremely complex, which can influence the stability and precision of the models. Therefore, it was necessary to apply variable selection algorithms to select effective regions for modelling. Based on the SPXY method, 162 and 53 samples of the YXST dataset were divided into calibration and prediction sets, respectively.

3. Results and discussion

3.3. Si-PLS models

3.1. Reference data analysis

The global spectrum of 4,000–10,000 cm−1 was divided into 10, 11, 12,…, 25 intervals combined with 2, 3, or 4 subintervals using Si-PLS, as shown in Table 1 (taking icariin as an example). The optimum

2.7. Software

The concentration distribution of the four indicators is shown in

Fig. 1. Concentration distribution of icariin, salvianolic acid B, puerarin, and the relative density. 3

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

Fig. 2. (a) Raw NIR spectra (b) and spectra pre-processed using 1st D/SG of the concentration process of YXST water extraction solution.

5755–6001 cm−1. The consistency of the variable screening results of Si-PLS for the three target compounds is consistent with similar compound structures.

Table 1 Calibration results of Si-PLS model with different spectral range selection of icariin. No. of intervals

PLS components

Selected intervals

RMSECV

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

10 8 9 10 8 9 10 8 8 8 9 8 8 9 9 7

[4 8 9 10] [4 9 10 11] [4 10 11 12] [5 11 12 13] [5 11 13 14] [5 12 13 15] [6 13 14 15] [6 14 15 16] [6 11 16 18] [7 16 17 18] [7 12 18 20] [7 17 19 21] [7 8 18 22] [8 18 22 23] [8 14 18 24] [8 9 21 25]

0.0560 0.0551 0.0615 0.0511 0.0526 0.0605 0.0547 0.0501 0.0562 0.0537 0.0488 0.0557 0.0564 0.0503 0.0550 0.0562

3.4. GA-PLS and CARS-PLS models Table 2 shows the GA-PLS results of icariin, salvianolic acid B, puerarin, and the relative density. The appropriate variables selected for Rp, RSEP, and RPD in the GA-PLS models were 159, 0.9715, 13.49%, and 4.33 for icariin; 178, 0.9247, 17.33%, and 2.63 for salvianolic acid B; 173, 0.9340, 17.91%, and 2.83 for puerarin; and 196, 0.9646, 0.53%, and 3.93 for the relative density, respectively. Based on the removal of unimportant variables in a stepwise manner, CARS-PLS was applied for variable selection. The Rp, RSEP, and RPD values for the CARS-PLS models were 0.9559, 16.73%, and 3.40 for icariin; 0.8408, 24.64%, and 1.88 for salvianolic acid B; 0.9194, 19.72%, and 2.54 for puerarin; and 0.9417, 0.67%, and 3.06 for the relative density, respectively. The raw spectra contain 1557 variables, and after GA or CARS selection, fewer than 200 variables were selected for further study. Because it treats each wave as a separate form, the above method is more efficient than Si-PLS in compressing the number of variables.

combination of subintervals and PLS components was chosen according to the lowest leave-one-out RMSECV. The results in Table 1 indicate that the lower RMSECV value of 0.0488 mg/mL computed for the SiPLS model confirms its superior prediction accuracy over that of the Global-PLS model. Fig. 3 shows that the best selected intervals of 7, 12, 18, and 20 consisted of spectral variables spanning ranges of 5805–6102, 7309–7606, 9114–9407, and 9708–10,001 cm−1, respectively. For salvianolic acid B (Fig. S2) and puerarin (Fig. S3), the best subinterval combinations were 5890–6260, 8134–8505, 8883–9253, and 9257–9627 cm−1, and 5701–5936, 8809–9045, and 9527–9762, respectively, which are similar to those of icariin, with all avoiding water peak regions. The best subinterval combinations for the relative density corresponded to 4250–4497, 4501–4748, 5504–5751, and

3.5. GA-Si-PLS and CARS-Si-PLS models Except for direct variable selection, GA and CARS were also used in this study to optimise the variables after selection using Si-PLS. Taking icariin as an example, Fig. 4a shows the relationship between the number of variables and the RMSECV values in the calibration set of the GA-Si-PLS model. The curve quickly descends initially, and is then steadily maintained. At point 60, the lowest RMSECV value is shown, indicating that 60 is the optimum number of variables. Fig. 4b shows a histogram of the frequency of selection of each variable after 100 runs

Fig. 3. Optimal spectral intervals selected using Si-PLS with nine PLS components of icariin. 4

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

Table 2 Comparison of results based on different variable selection methods. Parameters

Models

Lvs

Variables

Calibration Rc

RMSEC

Prediction RMSECV

RSEC (%)

Bias

Rp −16

RMSEP

RSEP (%)

Bias

RPD −3

Icariin

Global-PLS Si-PLS GA-PLS CARS-PLS GA-Si-PLS CARS-Si-PLS

8 9 7 10 10 9

1557 310 159 113 60 61

0.9977 0.9984 0.9937 0.9992 0.9980 0.9986

0.0377 0.0318 0.0626 0.0229 0.0357 0.0298

0.0742 0.0488 0.0789 0.0389 0.0476 0.0388

4.07 3.43 6.77 2.47 3.86 3.22

6.23 * 10 3.16 * 10−16 8.90 * 10−16 −4.98 * 10−16 −2.23 * 10−15 −1.24 * 10−15

0.9696 0.9905 0.9715 0.9559 0.9889 0.9881

0.0654 0.0368 0.0633 0.0785 0.0397 0.0411

13.93 7.84 13.49 16.73 8.46 8.76

−8.30 * 10 −6.80 * 10−3 −1.42 * 10−2 2.23 * 10−4 −5.20 * 10−3 −7.50 * 10−3

4.12 7.38 4.33 3.40 6.79 6.61

Salvianolic acid B

Global-PLS Si-PLS GA-PLS CARS-PLS GA-Si-PLS CARS-Si-PLS

9 9 9 9 10 8

1557 388 178 44 70 64

0.9974 0.9992 0.9928 0.9983 0.9984 0.9991

0.1105 0.0628 0.1831 0.0887 0.0870 0.0663

0.2183 0.1117 0.2354 0.1188 0.1129 0.0875

4.59 2.61 7.61 3.68 3.62 2.76

3.28 * 10−16 2.36 * 10−15 3.76 * 10−15 8.20 * 10−16 1.33 * 10−15 6.50 * 10−15

0.9339 0.9781 0.9247 0.8408 0.9771 0.9780

0.1486 0.0866 0.1583 0.2250 0.0885 0.0867

16.27 9.48 17.33 24.64 9.69 9.50

1.06 * 10−2 −1.83 * 10−4 1.58 * 10−4 3.93 * 10−2 −1.45 * 10−2 7.73 * 10−4

2.80 4.80 2.63 1.88 4.76 4.79

Puerarin

Global-PLS Si-PLS GA-PLS CARS-PLS GA-Si-PLS CARS-Si-PLS

9 10 10 9 9 8

1557 248 173 92 70 50

0.9940 0.9978 0.9859 0.9970 0.9969 0.9972

0.0507 0.0308 0.0779 0.0361 0.0368 0.0351

0.1023 0.0516 0.1024 0.0562 0.0471 0.0431

6.66 4.04 10.22 4.74 4.83 4.60

1.17 * 10−15 4.68 * 10−15 9.32 * 10−16 −1.69 * 10−16 −7.03 * 10−16 1.03 * 10−15

0.9534 0.9668 0.9340 0.9194 0.9499 0.9563

0.0479 0.0406 0.0567 0.0625 0.0496 0.0464

15.13 12.81 17.91 19.72 15.67 14.66

1.60 * 10−3 8.17 * 10−4 8.90 * 10−3 −2.20 * 10−3 6.80 * 10−3 3.10 * 10−3

3.32 3.92 2.83 2.54 3.23 3.43

Relative density

Global-PLS Si-PLS GA-PLS CARS-PLS GA-Si-PLS CARS-Si-PLS

8 4 8 6 4 2

1557 260 196 54 50 5

0.9971 0.9972 0.9933 0.9978 0.9937 0.9915

0.0031 0.0030 0.0047 0.0027 0.0045 0.0053

0.0064 0.0058 0.0060 0.0035 0.0054 0.0055

0.30 0.30 0.46 0.26 0.44 0.51

−8.92 * 10−16 −9.33 * 10−16 −9.15 * 10−16 −8.37 * 10−16 −8.76 * 10−16 −7.58 * 10−16

0.9662 0.9466 0.9646 0.9417 0.9649 0.9725

0.0052 0.0065 0.0053 0.0068 0.0053 0.0047

0.52 0.65 0.53 0.67 0.53 0.47

−1.10 * 10−3 2.84 * 10−4 −1.40 * 10−3 −1.60 * 10−3 −1.00 * 10−3 −4.38 * 10−4

3.97 3.11 3.93 3.06 3.89 4.31

improved the prediction accuracy of the NIR models. This phenomenon can be attributed to the initialisation of the CARS and GA variables. The initialisation of the CARS variables relies on Monte Carlo sampling, whereas the GA applies random sampling. The application of Si-PLS, which selects those bands that have a strong correlation with the target compound, optimises the sampling step away from the interference bands.

by GA for icariin. The red line in Fig. 4b indicates that variables with the frequency of ≥7 can be selected in the final model. The variable selection by CARS-Si-PLS (see icariin for an example) is shown in Fig. 5. The curve in Fig. 5a indicates that the number of sampled variables decreased sharply during the first stage and then remained steady during the second period during the sampling runs. In Fig. 5b, the RMSECV values descended slowly at first and then increased quickly during the variable selection process. Initially, the RMSECV values decreased, indicating that some uninformative variables were eliminated, and because of the application of Si-PLS, there were only a few irrelevant variables left; thus, the RMSECV values descended slowly. An increase in the RMSECV value may then result from the elimination of useful variables. The change of spectral regression coefficients with the number of samples increased is shown in Fig. 5c. The blue asterisk line indicates that the RMSECV value was the lowest at this number of sampling runs. After passing the asterisk line, the RMSECV values begin to increase owing to the removal of the key variables. Finally, a total of 61 variables of icariin were selected by CARS-Si-PLS. As shown in Table 2, compared with simple CARS or GA models, the combination of separate wavenumber selection methods and Si-PLS

3.6. Results of different variable selection methods Icariin, salvianolic acid B, and puerarin are all flavonoids composed of benzene rings and phenolic hydroxyl group. The dual combined frequency of CeC and CeH stretching vibrations on the benzene ring showed a strong absorption peak at 9259 cm−1, and the peak at 6000 cm−1 was the first-order frequency absorption of the stretching vibration. The series of absorption peaks at 10,000 cm−1 owing to the secondary frequency doubling of the phenolic hydroxyl groups. The absorption peaks conforming to the structure were all selected using Si-PLS. The wavenumbers of the selected regions for the relative density were less than 6001 cm−1 (Fig. S4), which may indicate the difference between the physical and chemical properties through NIR

Fig. 4. (a) Plot of RMSECV versus number of variables selected by GA-Si for icariin. The red dashed line indicates that the optimal number of variables is 60. (b) The histogram of the frequency of selections of each variable after 100 runs by GA for icariin. The blue dashed line indicates the boundary. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 5

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

Fig. 5. CARS variable selection on spectra dataset (a) plot of number of sampled variables versus number of sampling runs for icariin; (b) RMSECV values versus number of sampling runs; (c) regression coefficients of each variable versus number of sampling runs.

regarded as the optimal method for the wavenumber selection of icariin, salvianolic acid B, and puerarin, whereas CARS-Si-PLS was considered the best variable selection method for the relative density. According to the model results, the values of RSEP were 7.84%, 9.48%, 12.81%, and 0.47% for icariin, salvianolic acid B, puerarin, and the relative density, respectively. In addition, the RPD value increased from 4.12 to 7.38, 2.80 to 4.80, 3.32 to 3.92, and 3.97 to 4.31, respectively. After variable selection, the modelling process was markedly simplified and the models obtained completely follow the trend of environmental protection. The methods proposed in this study can be used to construct accuracy models for quality monitoring of the concentration process of the YXST water extraction solution. Furthermore, the methods proposed in this study also provide a reference for the real-time monitoring of other concentration processes in the preparation of traditional Chinese medicine. Declaration of Competing Interest Fig. 6. Reference values versus NIR-predicted values of icariin in the calibration and prediction sets of the Si-PLS model.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

spectroscopy. Meanwhile, it can be inferred that Si-PLS was the best among the six types of variable selection methods for icariin, salvianolic acid B, and puerarin, as indicated in Table 2. When Si-PLS is applied, the RMSECV is reduced to 0.0488 from 0.0742, the RMSEP is reduced to 0.0633 from 0.0654, and the correlation coefficient of the calibration model increased to 0.9984 from 0.9977 for icariin, which confirms the superior prediction accuracy of Si-PLS over that of the Global-PLS model. The correlation between the NIR-predicted value and the reference value for icariin is shown in Fig. 6. Compared with Si-PLS, all GA-PLS and CARS-PLS models performed worse, which may be because the wavenumber point selection methods neglect the fact that the vibrational and rotational spectra have continuous features of spectral bands. In a number of studies, GA and CARS were used in combination with continuous interval selection methods. For a simple operation, GA-Si-PLS and CARS-Si-PLS can use 1/6 to 1/4 variables of Si-PLS to establish a model. For the relative density, the CARS-Si-PLS model showed the most optimal effects. The LVs for the modelling was 2. In addition, the remaining wavenumber variables for modelling were only 5, which is 1/311 of the global spectrum. However, during the process of variable selection, GA-Si-PLS and CARSSi-PLS required two steps, and Si-PLS only required one.

Acknowledgments The authors gratefully acknowledge the financial support provided by the National Major Scientific and Technological Special Project for “Significant New Drugs Development” (Grant No. 2013ZX09402203 and No. 2018ZX09201010). Appendix A. Supplementary material Supplementary data to this article can be found online at https:// doi.org/10.1016/j.infrared.2020.103188. References [1] W.J. Lou, K. Yang, M.Q. Zhu, Y.J. Wu, X.S. Liu, Y. Jin, Application of particle swarm optimization-based least square support vector machine in quantitative analysis of extraction solution of Yangxinshi tablet using near infrared spectroscopy, J. Innov. Opt. Heal. Sci. 6 (2014) 1–9. [2] Y.R. Li, K. Yang, Q.Y. Shi, B.W. Liu, Y. Jin, X.S. Liu, Z.L. Jiang, L.J. Luan, Y.J. Wu, Development of a method using high-performance liquid chromatographic fingerprint and multi–ingredients quantitative analysis for the quality control of Yangxinshi Pian, J. Sep. Sci. 38 (2015) 2989–2994. [3] L.Y. Tao, B.W. Liu, J. Ye, D. Sun, X.S. Liu, Y. Chen, Y.J. Wu, Characterization of toad skin for traditional chinese medicine by near–infrared spectroscopy and chemometrics, Anal. Lett. 50 (2017) 1292–1306. [4] H.Y. Pan, L. Wang, Q. Zhang, L.Y. Tao, G.Q. Chen, J.L. Chen, Y.L. Ding, L.H. Wang, Quality evaluation of ephedrae herba by near infrared spectroscopy, Anal. Lett. 51 (2018) 2849–2859. [5] H.Z. Jiang, S.C. Yoon, H. Zhuang, W. Wang, Y.F. Li, C.J. Lu, N. Li, Non-destructive assessment of final color and ph attributes of broiler breast fillets using visible and

4. Conclusion In this paper, six types of variable selection methods were systematically compared for the wavenumber selection of icariin, salvianolic acid B, puerarin, and the relative density during the concentration process of the YXST water extraction solution. Finally, Si-PLS was 6

Infrared Physics and Technology 105 (2020) 103188

Y. Chen, et al.

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

near-infrared hyperspectral imaging: a preliminary study, Infrared Phys. Techn. 92 (2018) 309–317. L.Y. Tao, Z.L. Zhong, J.S. Chen, Y.J. Wu, X.S. Liu, Mid-infrared and near-infrared spectroscopy for rapid detection of Gardeniae Fructus by a liquid-liquid extraction process, J. Pharmaceut. Biomed. 145 (2017) 1–9. L.H. Yin, J.M. Zhou, D.D. Chen, T.T. Han, B.S. Zheng, A. Younis, Q.S. Shao, A review of the application of near-infrared spectroscopy to rare traditional Chinese medicine, Spectrochim. Acta A. 221 (2019) 1386–1425. M. Lei, X.H. Yu, M. Li, W.X. Zhu, Geographic origin identification of coal using nearinfrared spectroscopy combined with improved random forest method, Infrared Phys. Techn. 92 (2018) 177–182. Y.W. Lin, B.C. Deng, L.L. Wang, Q.S. Xu, L. Liu, Y.Z. Liang, Fisher optimal subspace shrinkage for block variable selection with applications to NIR spectroscopic analysis, Chemometr. Intell. Lab. 159 (2016) 196–204. Q. Ouyang, J.W. Zhao, W.X. Pan, Q.S. Chen, Real-time monitoring of process parameters in rice wine fermentation by a portable spectral analytical system combined with multivariate analysis, Food Chem. 190 (2016) 135–141. G.Y. Ding, B.Q. Li, Y.Q. Han, A.N. Liu, J.R. Zhang, J.M. Peng, M. Jiang, Y.Y. Hou, G. Bai, A rapid integrated bioactivity evaluation system based on near-infrared spectroscopy for quality control of Flos Chrysanthemi, J. Pharmaceut. Biomed. 131 (2016) 391–399. H.L. Ma, J.W. Wang, Y.J. Chen, J.L. Cheng, Z.T. Lai, Rapid authentication of starch adulterations in ultrafine granular powder of Shanyao by near–infrared spectroscopy coupled with chemometric methods, Food Chem. 215 (2017) 108–115. E.A. Petrakis, M.G. Polissiou, Assessing saffron (Crocus sativus L.) adulteration with plant–derived adulterants by diffuse reflectance infrared Fourier transform spectroscopy coupled with chemometrics, Talanta 162 (2017) 558–566. M.J. Zhang, S.Z. Zhang, J. Iqbal, Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares, Chemometr. Intell. Lab. 128 (2013) 17–24. Z.G. Han, S.G. Cai, X.L. Zhang, Q.F. Qian, Y.Q. Huang, F. Dai, G.P. Zhang, Development of predictive models for total phenolics and free p-coumaric acid contents in barley grain by near-infrared spectroscopy, Food Chem. 227 (2017) 342–348. Y. Li, J. Zhang, T. Li, H.G. Liu, J.Q. Li, Y.Z. Wang, Geographical traceability of wild Boletus edulis based on data fusion of FT–MIR and ICP–AES coupled with data mining methods (SVM), Spectrochim. Acta A. 177 (2017) 20–27. L.J. Yao, Y. Tang, Z.W. Yin, T. Pan, J.M. Chen, Repetition rate priority combination

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25] [26] [27]

[28]

[29]

7

method based on equidistant wavelengths screening with application to NIR analysis of serum albumin, Chemometr. Intell. Lab. 162 (2017) 191–196. A.C.D.O. Neves, A.A.D. Araújo, B.L. Silva, P. Valderrama, P.H. Março, K.M.G.D. Lima, Near infrared spectroscopy and multivariate calibration for simultaneous determination of glucose, triglycerides and high-density lipoprotein in animal plasma, J. Pharmaceut. Biomed. 66 (2012) 252–257. K.H. Wong, V. Razmovski-Naumovski, K.M. Li, G.Q. Li, K. Chan, Differentiation of Pueraria lobata and Pueraria thomsonii using partial least square discriminant analysis (PLS-DA), J. Pharmaceut. Biomed. 84 (2013) 5–13. C. Zhang, H. Jiang, F. Liu, Y. He, Application of near–infrared hyperspectral imaging with variable selection methods to determine and visualize caffeine content of coffee beans, Food Bioprocess Tech. 10 (2017) 213–221. Y. Yang, L. Wang, Y.J. Wu, X.S. Liu, Y. Bi, W. Xiao, Y. Chen, On-line monitoring of extraction process of Flos Lonicerae Japonicae using near infrared spectroscopy combined with synergy interval PLS and genetic algorithm, Spectrochim. Acta A. 182 (2017) 73–80. R.K.H. Galvão, M.C.U. Araujo, G.E. José, M.J.C. Pontes, E.C. Silva, T.C.B. Saldanha, A method for calibration and validation subset partitioning, Talanta 67 (2005) 736–740. X.B. Zhan, S.L. Jiang, Y.L. Yang, J. Liang, T.L. Shi, X.W. Li, Ultrasonic characterization of aqueous mixture comprising insoluble and soluble substances with temperature compensation, Chemometr. Intell. Lab. 159 (2016) 12–19. S. Qi, Q. Ouyang, Q.S. Chen, J.W. Zhao, Real-time monitoring of total polyphenols content in tea using a developed optical sensors system, J. Pharmaceut. Biomed. 97 (2014) 116–122. R. Leardi, A.L. González, Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemometr. Intell. Lab. 41 (1998) 195–207. D. Jouan-Rimbaud, D. Massart, Genetic algorithms as a tool for wavelength selection in multivariate calibration, Anal. Chem. 67 (1995) 4295–4301. W. Fan, Y. Shan, G.Y. Li, H.Y. Lv, H.D. Li, Y.Z. Liang, Application of competitive adaptive reweighted sampling method to determine effective wavelengths for prediction of total acid of vinegar, Food Anal. Method. 5 (2012) 585–590. H.D. Li, Y.Z. Liang, Q.S. Xu, D.S. Cao, Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration, Anal. Chim. Acta 648 (2009) 77–84. L.J. Ma, Y.F. Peng, Y.L. Pei, J.Q. Zeng, H.R. Shen, J.J. Cao, Y.J. Qiao, Z.S. Wu, Systematic discovery about NIR spectral assignment from chemical structural property to natural chemical compounds, Sci. Rep. 9 (2019) 1–17.