Comparison of PLS algorithms in gasoline and gas oil parameter monitoring with MIR and NIR

Comparison of PLS algorithms in gasoline and gas oil parameter monitoring with MIR and NIR

Chemometrics and Intelligent Laboratory Systems 78 (2005) 74 – 80 www.elsevier.com/locate/chemolab Comparison of PLS algorithms in gasoline and gas o...

173KB Sizes 1 Downloads 31 Views

Chemometrics and Intelligent Laboratory Systems 78 (2005) 74 – 80 www.elsevier.com/locate/chemolab

Comparison of PLS algorithms in gasoline and gas oil parameter monitoring with MIR and NIR Caˆndida C. Felı´cioa, Lı´gia P. Bra´sa, Joa˜o A. Lopesa, Luı´s Cabritab, Jose´ C. Menezesa,* a

Centre for Chemical and Biological Engineering, IST, Technical University of Lisbon, Av. Rovisco Pais, P-1049-001, Lisbon, Portugal b Quality Control, PETROGAL Sines Refinery, P-7521-952 Sines, Portugal Received 30 March 2004; received in revised form 22 December 2004; accepted 24 December 2004 Available online 25 February 2005

Abstract The objective of the present work was to compare different partial least squares algorithms upon mid- and near-infrared data. The applied techniques were single PLS, multiblock PLS and serial PLS. The comparison was made upon values of Q 2y determined in leave one out validation, mean square error of prediction using 80% of the data as calibration set and 20% as validation, and 95% confidence intervals for this parameter. A comparison between regression coefficients for all algorithms was also performed, after selecting the number of latent variables. In order to perform this study, three parameters were used: flash point in gas oil, benzene and research octane number in gasoline. Serial PLS gave the best results in all analysed cases, followed by one single PLS with MIR or NIR. Multiblock PLS gave intermediate results between both single PLS. However, for the studied parameters, the best calibration model was single PLS, since the results were quite accurate and achieved in less time. D 2005 Elsevier B.V. All rights reserved. Keywords: Gasoline; Gas oil; Partial least squares; Multiblock partial least squares; Mid-infrared spectroscopy; Near-infrared spectroscopy

1. Introduction Recently, Bra´s et al. [11] evaluated the potential benefits of combining NIR and MIR spectra for the prediction of protein, moisture, fat and fiber content of soybean flour. These authors compared MIR- and NIRbased partial least squares (PLS) models with those obtained with multiblock PLS where both spectral data sets were combined for the modelling of the same flour’s property. The present work applies a similar approach to assess the advantages of combining NIR and MIR spectroscopic methods for the prediction of important parameters of gasoline (RON, benzene) and gas oil (flash point). Furthermore, the modelling performance of the different PLS-based regression techniques was compared: single

* Corresponding author. Tel.: +351 218 417 347; fax: +351 218 419 197. E-mail address: [email protected] (J.C. Menezes). 0169-7439/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2004.12.009

PLS (PLS), multiblock PLS (MB-PLS) and serial PLS (S-PLS). There are a number of parameters in gasoline and gas oil that must be analysed before these products can be commercialised, either due to vehicles motors specifications or to environmental requirements. The gasoline and gas oil samples analysed in the present work were obtained at the most important Portuguese Refinery (Petrogal’s refinery from GALPENERGIA S.A.). Throughout the latest years, directives have been approved concerning the limits of the research octane number (RON) and benzene; while flash point is an important parameter relating engine’s performances, hence the importance of accurate measuring methods for these parameters. The octane number is a fuel-performance property of gasoline that indicates the resistance of a motor fuel to knock. This property is determined by the measurement of a standard knock intensity in a standardized single cylinder variable compression ratio engine, where the

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80

sample’s performance is compared to accepted references, as specified in ASTM D2699 Method [1]. As a gasoline additive, benzene increases the octane rating and reduces knocking. Its negative health effects and the possibility of benzene entering the groundwater have led to a strict regulation of gasoline’s benzene content. Benzene is determined by gas chromatography, accordingly to the ASTM D3606 Method [2]. The flash point is the temperature at which the gas oil will ignite inside the cylinder. It is important that this parameter is within certain limits: if the temperature is higher than the specified, the gas oil will not ignite, and if it is lower, the engine is wasting energy. The flash point takes about 20 min to determine in traditional analysis, ASTM D93 [3]. Gasoline and gas oil are clear liquids and are easily measurable by vibrational spectroscopic methods shortening the analysis time and utilizing smaller sample volumes and cheaper measurement instruments. Hence, vibrational spectroscopic techniques such as near-infrared (NIR) and midinfrared (MIR) spectroscopy combined with multivariate calibration methods have been extensively used in the analysis of physical and chemical properties of these petrochemical products [4–9]. The MIR region of the electromagnetic spectrum (4000–400 cm1) is related to transitions between vibrational states of molecules, while overtone and combination bands of molecular vibrations are responsible for the NIR spectrum (12,500–4000 cm1) [10]. For this reason, the peaks of a MIR spectrum are more specific, sharper and better defined that those of a NIR spectrum.

75

blocked into meaningful subsets for the prediction of the same parameter [14–16]. In the present case there are two predictor blocks generated by different measurement methods, corresponding to the NIR and MIR spectral data sets. In terms of prediction, the results of MB-PLS and single PLS are the same when the latter is applied on a matrix consisting of both blocks. The advantage of multiblock is the combination of information from different sources or methods, and in particular, the separation of each block’s contribution to the description of the response variable. The method employed in this work was proposed by Westerhuis and Coenegracht [16]. Eqs. (1) and (2) also apply to multiblock PLS, but in this case, T is the matrix of superscores. The major difference between the MB-PLS and simple PLS algorithms is that MB-PLS determines a matrix of scores, called super-scores (T) which contains the scores for each block (S) computed separately. In the multiblock algorithm, the block-weights (u) and the block-scores (s) are determined for both blocks (MIR and NIR). In Eqs. (3), (4), (6) and (7), the subscript r is the latent variable. In the first iteration (r=1), matrices E0MIR and E0NIR are the same as the original XMIR and XNIR spectra matrices and f 0 is the original vector of the quality variable (flash point, benzene or RON). Eq. (3) describes how the weights for each block (block-weights) are obtained. The matrices containing each block of deflated data (EMIR and ENIR) are regressed against the residuals for the independent variable f yielding the block-weights (u MIR and u NIR). The block-scores are obtained multiplying each block of deflated data by the corresponding block-weights. ubr ¼

Ebr1 f r1 kEbr1 f r1 k

ð3Þ

2. Theory sbr ¼ Ebr1 ubr 2.1. Partial least squares In partial least squares, the predictor block X and the response vector y are decomposed using a given number of latent variables [12,13], according to Eqs. (1) and (2): X ¼ TPT þ E

The block-scores are then paired according to Eq. (5).   ð5Þ jsNIR Sr ¼ sMIR r r Using the block-scores the super-weights (w) and the super-scores (t) are obtained by regression.

ð1Þ wr ¼

y ¼ TqT þ f

ð2Þ

T and P are, respectively, the scores and loadings for X, q is the loading vector for y, and E and f are the residuals for X and y, respectively. These residuals are obtained by deflation for each new latent variable r(Er =Er1t r p rt and f r =f r1t r q r ). 2.2. Multiblock partial least squares Multiblock partial least squares regression consists on the application of partial least squares on explanatory variables

ð4Þ

Srt f r1 Srt f r1

tr ¼ Sr wr

ð6Þ ð7Þ

Once the super-scores are created, the algorithm proceeds like in single PLS. In order to make new predictions (yˆ ), a regression vector (b) can be obtained like in simple PLS. New data is paired in order to obtain a global matrix (X=[X(MIR)|X(NIR)]), which can be used to obtain new predictions (Eq. (8)). yˆ ¼ Xb

ð8Þ

76

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80

2.3. Serial partial least squares The difference between serial partial least squares (SPLS) and multiblock partial least squares (MB-PLS) is that the former treats the blocks in a serial mode instead of in a parallel mode. S-PLS can be written as follows [17]: X1 ¼ T1 PT1 þ E1

ð9Þ

X2 ¼ T2 PT2 þ E2

ð10Þ

y ¼ T1 qT1 þ T2 qT2 þ f

ð11Þ

Xi is the predictor’s block i, Ti , Pi and Ei are, respectively, the scores, the loadings and the residuals for predictor Xi , and f is the residual vector for the response y. While in MB-PLS the predictors are connected by the scores matrix, in S-PLS they are only connected by the response vector y [17]. The regression model can be written as: y ¼ X1 b1 þ X2 b2 þ f

ð12Þ

where b 1 and b 2 are the regression coefficients for each predictor block. Unlike MB-PLS, in S-PLS different number of latent variables can be used in the modelling for each block. The first model is calculated using the residuals from the second model, and the second model is calculated using the residuals from the first model. This procedure is done iteratively, which makes this method more time-consuming than the other methods applied in this work. The iterative method is as follows: 1. Set f 2=y. 2. Calculate the first PLS model with X1 and f 2 (using LV1 latent variables). 3. Calculate the residuals f 1=yT1 q 1T. 4. Calculate the second PLS model with X2 and f 1 (using LV2 latent variables). 5. Calculate the residuals f 2=yT2 q 2T. 6. Return to step 2 until convergence is achieved. The first block used was the MIR spectra, as this was the combination that provided the best results. 2.4. Model validation All models were validated by leave-one-out cross validation. For each model, the value of cross-validated explained variance ( Q y2) in y given by Eq. (13) was determined using different number of latent variables. "

Q2y

# ðy  yˆ ÞT ðy  yˆ Þ ¼ 1 d100 yT y

ð13Þ

Table 1 Values for RMSEP and corresponding 95% confidence interval limits for the number of latent variables chosen in Fig. 3 Algorithm

PLS-MIR

Flash point (LV=10) Superior limit RMSEP Inferior limit

6.05 3.38 1.87

Benzene (LV=8) Superior limit (102) RMSEP(102) Inferior Limit(102) RON (LV=4) Superior limit RMSEP Inferior limit

PLS-NIR

MB-PLS

S-PLS

4.85 2.98 1.71

8.01 3.74 1.98

5.16 2.95 1.83

8.47 6.41 4.59

18.29 12.80 8.69

8.73 6.62 4.76

9.69 8.00 6.09

2.34 1.83 1.38

0.63 0.52 0.42

2.22 1.76 1.35

0.84 0.58 0.45

In Eq. (13), y was the vector of mean centred quality parameter values for each sample and yˆ contained the corresponding mean centred model predictions. Each element of the yˆ vector was obtained by projecting the corresponding sample onto a model calibrated with the remaining samples (leave-one-out cross validation). Another model validation performed consisted on the determination of the Root Mean Square Error of Prediction (RMSEP). For each determination, the data was randomly divided into 20% validation set and 80% calibration set. The model was constructed for the calibration set and the predictions were determined. This process was repeated 200 times, leading to an average RMSEP. The error vectors were used to determine a mean RMSEP and corresponding 95% confidence interval (Table 1). The values determined provided a fair idea of the model’s robustness: Q 2y, RMSEP and 95% confidence intervals for RMSEP.

3. Experimental A total of 249 gasoline samples and 128 gas oil samples were obtained at the Petrogal S.A. Refinery in Sines. Table 2 shows the samples’ statistics for the considered parameters. The reference values for RON, benzene and flash point were provided by Petrogal S.A. Refinery and were measured using, respectively, EN ISO 5164, ASTM D3606 and ASTM D93 standard methods. Each sample was scanned with mid- and near-infrared radiation. Spectra in the NIR region were collected by immersion of an adjustable optical path length transflectance probe (Flex C22-Quartz probe from SOVIAS AG) into the samples. The probe was coupled to an ABB Bomem MB-160 spectrometer (InGaAs detector) through an AXIOM fiber-optics interface. Spectra were obtained with a resolution of 16 cm1, taking 32 scans per spectrum from 9400 to 4500 cm1, and using an optical path length of 6 mm.

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80 Table 2 Summary of some statistics for flash point, benzene and RON in the samples analysed

A) Flash Point 0.9 0.8

Benzene

RON

67–87 8C 77.4 8C 5.1 8C 45

0.3–2.3% v/v 1.0% v/v 0.4% v/v 133

89.1–101.1 min 96.3 min 3.1 min 188

0.7 0.6 2

Flash point

PLS (MIR) PLS (NIR) MB-PLS S-PLS

0.5

Qy

Range Mean Standard deviation Number of samples

77

0.4 0.3 0.2 0.1 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Latent Variables

B) Benzene 1

0.9 PLS (MIR) PLS (NIR) MB-PLS S-PLS

2

0.8

Qy

Mid-infrared spectra were obtained on a FTLA 2000 FTIR spectrometer (ABB Bomem), with a DTGS detector and a SiC source, using a ZnSe Attenuated Total Reflection (ATR) accessory (MIRacle, Pike Technologies). The recorded spectra cover the 4000–600 cm1 regions and resulted from 40 scans, using an 8 cm1 resolution. In the NIR spectra, the region between 5000 and 4500 cm1 was removed since it corresponds to a region of saturation of the optical fiber. In the MIR spectra, two regions were eliminated: the absorption of atmospheric CO2 ranging from 2410 to 2390 cm1, and the region ranging from 4000 to 3300 cm1, containing only experimental noise. For all parameters, the only pre-processing applied consisted on dividing the MIR and NIR spectra by their maximum absorbance value, as the magnitude of absorbance in the NIR spectra was much higher than in MIR spectra, which could lead to erroneous conclusions. All calculations were performed with Matlab 6.5 software (The MathWorks).

0.7

0.6

0.5

0.4

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Latent Variables

C) RON

4. Results and discussion

1

Fig. 1 shows the values of Q 2y for all algorithms tested (PLS with MIR, PLS with NIR, MB-PLS and S-PLS) with

0.9

PLS (MIR) PLS (NIR) MB-PLS S-PLS

LV 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Flash point

Benzene

Qy

2

0.8

Table 3 Number of latent variables used in the first block (MIR) and the second block (NIR) for the construction of S-PLS models (curves represented in Fig. 2)

0.7

0.6

RON

MIR

NIR

MIR

NIR

MIR

NIR

1 2 3 3 3 6 7 8 8 10 10 12 13 14 14

1 2 2 2 2 2 2 2 2 2 2 1 1 1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 13 13

1 2 3 2 5 5 7 8 8 8 9 8 8 8 8

1 2 3 1 2 6 5 1 3 1 1 1 1 1 1

1 2 3 4 5 5 7 8 9 10 11 12 13 14 15

0.5

0.4

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Latent Variables Fig. 1. Comparison of the cross-validated values (determined by leave-oneout cross-validation) of explained y-variance ( Q y2) in the modelling of (A) flash point, (B) benzene and (C) RON, for the different regression methods.

different number of latent variables. In S-PLS regression, every possible combination of the number of latent variables used for each predictor block model was performed, and the best combination was selected. However, for comparison

78

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80

purposes, the Q 2y values presented for S-PLS in each latent variable entry correspond to the best combination from the set that goes from one latent variable to that

A) Flash Point 5.0

PLS (MIR) PLS (NIR) MB-PLS S-PLS

RMSEP (°C)

4.5

4.0

3.5

3.0

2.5 2.0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Latent Variables

B) Benzene 0.24 PLS (MIR) PLS (NIR) MB-PLS S-PLS

RMSEP (%v/v)

0.20

0.16

0.12

0.08

0.04

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Latent Variables

C) RON 2.5 PLS (MIR) PLS (NIR) MB-PLS S-PLS

RMSEP (min)

2

1.5

1

0.5

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

number. For instance, the result presented for 2 latent variables corresponds to the best out of the 4 possible combinations (1+1, 1+2, 2+1 and 2+2). The referred combinations are depicted in Table 3. It can be seen in Fig. 1, both flash point and benzene were better modelled with MIR data than with NIR, unlike the parameter RON. For all parameters, the values of Q 2y achieved with the best spectroscopic technique with PLS were very similar to those achieved by serial PLS, while multiblock PLS appears as an intermediate between PLS with Midand Near-Infrared. Fig. 2 shows the RMSEP values for each parameter with different number of latent variables. When flash point predictions were determined, the best results are achieved with single PLS with NIR (Fig. 2A), instead of what had been concluded from the analysis of Q 2y (Fig. 1A), which might be due to overfitting. For the other parameters, the prediction errors were consistent with the calibration. Fig. 3 shows the regression coefficients of the calibration models using 10 latent variables for the flash point, 8 for benzene and 4 for RON. The latent variable selection was based on the values of Q 2y and RMSEP (85% and 2.25 8C for flash point, 96 and 0.064% v/v for benzene and 97% and 0.52 min for RON). For flash point, S-PLS does not use information from the NIR matrix, therefore leading to a calibration model based only on MIR data, which is the same as a single PLS applied to such data. This is why in Fig. 1A, both algorithms show the same Q 2y (for more than 7 latent variables). With the parameter benzene, S-PLS uses information from both MIR and NIR matrices and gives better calibration results. However, in terms of predictions, S-PLS gives as accurate values as MIR only for more than 9 latent variables. It can be seen from Table 3 that for less than 9 latent variables, S-PLS uses the same number of latent variables for MIR and NIR. Since for this parameter, the information in MIR gives a better model, serial PLS is performing an average between better and worse information. Only above 9 variables it uses more latent variables for MIR than for NIR, achieving the same predictive results. When modelling RON, S-PLS leads to the same results as single PLS with NIR data, as can be seen in Fig. 3C (null regression coefficients for the MIR matrix) and also in Table 3, where it is seen that the number of latent variables used for NIR are higher than for MIR. For all parameters, MBPLS uses information from both matrices, which explains why its results are intermediates between single PLS with NIR and MIR.

5. Conclusions

Latent Variables Fig. 2. Comparison of the root mean square error of prediction (RMSEP) in the modelling of (A) flash point, (B) benzene and (C) RON, for the different regression methods where 80% of the data was selected for training and 20% for validation (random partition).

In this work, NIR and MIR spectroscopy were used alone or in combination for the prediction of some relevant parameters of gasoline (benzene and RON) and gas oil (flash point). Furthermore, the modelling perform-

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80

ance of different PLS-based regression methods was compared. For all three parameters, calibrations with Q 2y superior to 85% were achieved with serial-PLS and single PLS.

79

Serial PLS was a very robust algorithm, always achieving the same or better results than the best single PLS model, while multiblock PLS appears as an average between single PLS. For the studied parameters, however, the

A) Flash point Abs

1 MIR

NIR

0.5

Regression Coefficients

0 200 100

PLS - MIR

PLS - NIR

0 -100

Regression Coefficients

-200 100 50

MB - PLS

0 -50

Regression Coefficients

-100 200 100 S - PLS

0 -100 -200 600

1500

2400

3290

4520

6140

7770

9390

Wavenumber, cm-1

B) Benzene 1

Abs

MIR

NIR

0.5

Regression Coefficients

Regression Coefficients

Regression Coefficients

0 10 5

PLS - MIR

PLS - NIR

0 -5 -10 3 2

MB - PLS

1 0 -1 4

S - PLS

2 0 -2 600

1500

2400

3290

4520

6140

Wavenumber, cm

7770

9390

-1

Fig. 3. Raw spectra, model regression coefficients for single PLS with MIR and NIR spectra, MB-PLS and S-PLS optimal models: A) Flash Point, B) Benzene and C) RON.

80

C.C. Felı´cio et al. / Chemometrics and Intelligent Laboratory Systems 78 (2005) 74–80

C) RON Abs

1

MIR

NIR

0.5

Regression Coefficients

Regression Coefficients

Regression Coefficients

0 5

PLS - NIR

PLS - MIR

0 -5 5

MB - PLS 0 -5

5 S - PLS 0 -5 600

1500

2400

3290

4520

6140

Wavenumber, cm

7770

9390

-1

Fig. 3 (continued).

advantages of using serial PLS against single PLS were very small, since it is a very time consuming algorithm, and only marginal gains are observed. Acknowledgements The authors would like to thank Petrogal-GALPENERGIA S.A. in Portugal for providing the gasoline and gas oil samples. The help of Sonia Mendes in the experimental work is greatly acknowledged. LPB and JAL thank Foundation for Science and Technology in Portugal for financial support (POCTI BD/10302/2002 and POCTI BPD/7194/2001, respectively). References [1] Method D2699 knock characteristics of motor fuels by the research method, Annual book of ASTM standards, American Society for Testing and Materials, Philadelphia, PA, 1999. [2] Method D3606 standard test method for determination of benzene and toluene in finished motor and aviation gasoline by gas chromatograph, Annual book of ASTM standards, American Society for Testing and Materials, Philadelphia, PA, 1999. [3] Method D093-02a standard test methods for flash-point by PenskyMartens Closed Cup Tester, American Society for Testing and Materials, Philadelphia, PA, 1995. [4] D.A. Burns, E.W. Ciurczak, Handbook of near-infrared analysis, 2nd edition, Marcel Dekker, NY, 2001. [5] J.J. Kelly, C.H. Barlow, T.M. Jinguji, J.B. Callis, Prediction of gasoline octane numbers from near-infrared spectral features in the range 660–1215 nm, Anal. Chem. 61 (1989) 313 – 320.

[6] G.E. Fodor, K.B. Kohl, R.L. Mason, Analysis of gasolines by FT-IR spectroscopy, Anal. Chem. 68 (1996) 23 – 30. [7] G. Boha´cs, Z. Ova´di, A. Salgo´, Prediction of gasoline properties with near infrared spectroscopy, J. Near Infrared Spectrosc. 6 (1998) 341 – 348. [8] A.F. Parisi, L. Nogueiras, H. Prieto, On-line determination of fuel quality parameters using near-infrared spectrometry with fibre optics and multivariate calibration, Anal. Chim. Acta 238 (1990) 95 – 100. [9] J.B. Cooper, K.L. Wise, W.T. Welch, M.B. Sumner, B.K. Wilt, R.R. Bledsoe, Comparison of near-IR, and mid-IR spectroscopies for the determination of BTEX in petroleum fuels, Appl. Spectrosc. 51 (1997) 1613 – 1620. [10] J.M. Chalmers, Spectroscopy in process analysis, Sheffield Academic Press, UK, 2000. [11] L.P. Bra´s, S.A. Bernardino, J.A. Lopes, J.C. Menezes, Multilock PLS as an approach to compare and combine NIR and MIR spectra in calibrations of soybean flour, accepted for publication in Chemom, Intell. Lab. Syst. (2004). [12] R.W. Gerlach, B.R. Kowalsky, H.O.A. Wold, Anal. Chim. Acta 112 (1979) 417 – 421. [13] H. Martens, T. Naes, Multivariate calibration, John Wiley & Sons, Chichester, UK, 1989. [14] J. McGregor, C. Jaeckle, C. Kiparissides, M. Koutoudi, Process monitoring and diagnosis by multiblock PLS methods, Proc. Syst. Eng. 40 (1994) 826 – 838. [15] J. Westerhuis, T. Kourti, J. MacGregor, Analysis of multiblock and hierarchical PCA and PLS models, J. Chemom. 12 (1998) 301 – 321. [16] J. Westerhuis, P. Coenegracht, Multivariate modelling of the pharmaceutical two-step process of wet granulation and tableting with multiblock partial least squares, J. Chemom. 11 (1997) 379 – 392. [17] A. Berglund, S. Wold, A serial extension of multiblock PLS, J. Chemom. 13 (1999) 461 – 471.