Characterisation of heavy oils using near-infrared spectroscopy: Optimisation of pre-processing methods and variable selection

Characterisation of heavy oils using near-infrared spectroscopy: Optimisation of pre-processing methods and variable selection

Analytica Chimica Acta 705 (2011) 227–234 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate...

4MB Sizes 0 Downloads 25 Views

Analytica Chimica Acta 705 (2011) 227–234

Contents lists available at ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

Characterisation of heavy oils using near-infrared spectroscopy: Optimisation of pre-processing methods and variable selection Jérémy Laxalde a,b , Cyril Ruckebusch a , Olivier Devos a , Noémie Caillol b , Franc¸ois Wahl b , Ludovic Duponchel a,∗ a b

LASIR, CNRS - Université Lille1 Sciences et technologies, bât. C5, 59655 Villeneuve d’Ascq Cedex, France IFP-Lyon, Direction Physique et Analyse, Département Caractérisation des Produits, Rond-point de l’échangeur de Solaize, BP 3, 69360 Solaize, France

a r t i c l e

i n f o

Article history: Received 21 December 2010 Received in revised form 12 May 2011 Accepted 18 May 2011 Available online 1 July 2011 Keywords: Genetic algorithm Variable selection Spectral pre-processing Near infrared spectrocopy Partial least squares regression Heavy oils

a b s t r a c t In this study, chemometric predictive models were developed from near infrared (NIR) spectra for the quantitative determination of saturates, aromatics, resins and asphaltens (SARA) in heavy petroleum products. Model optimisation was based on adequate pre-processing and/or variable selection. In addition to classical methods, the potential of a genetic algorithm (GA) optimisation, which allows the co-optimisation of pre-processing methods and variable selection, was evaluated. The prediction results obtained with the different models were compared and decision regarding their statistical significance was taken applying a randomization t-test. Finally, the results obtained for the root mean square errors of prediction (and the corresponding concentration range) expressed in %(w/w), are 1.51 (14.1–99.1) for saturates, 1.59 (0.7–61.1) for aromatics, 0.77 (0–34.5) for resins and 1.26 (0–14.7) for asphaltens. In addition, the usefulness of the proposed optimisation method for global interpretation is shown, in accordance with the known chemical composition of SARA fractions. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Heavy oils are defined as petroleum cuts with boiling point higher than 350 ◦ C [1]. The different classical heavy oil cuts are Vacuum Gas Oil (VGO: 350–550 ◦ C), Atmospheric Residue (AR: 350 ◦ C+ ), Vacuum Residue (VR: 550 ◦ C+ ) or DesAsphaltened Oil (DAO). Heavy oils are very complex matrices of hydrocarbon chains with alkylation degrees ranging from 25 to more than 200 carbon atoms. These samples also contain a high level of heteroelements (sulphur, nitrogen and oxygen) and metals (nickel and vanadium). The main commercial products derived from heavy oils are fuels for ships and power stations, lubricant oils and bitumen. However, the demand for these commercial products is very low compared to heavy oils resources, which are twice more abundant than conventional light crude oils. Therefore, heavy oils represent an attractive petroleum resource as long as they can be converted to valuable fuels. Nevertheless, the refinery of these products is difficult and, thus, specific processes and catalysts must be developed. Besides, the characterisation of heavy oil products requires the analysis of different chemical and physical properties, which are essential for the monitoring and the development of these technologies. Reference methods are usually costly as well as time and volume

∗ Corresponding author. Tel.: +33 320 434902; fax: +33 320 436765. E-mail address: [email protected] (L. Duponchel). 0003-2670/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2011.05.048

consuming especially for heavy oils which are dark and very viscous (solid at room temperature). The most widespread of these analyses is the determination of saturate, aromatic, resin and asphalten (SARA) contents. The use of near infrared (NIR) spectroscopy is widespread in analytical chemistry because of its speed, low operating cost, accuracy and simplicity. Absorption bands resulting from X-H overtones and combinations of fundamental vibrations of the molecules are observed. This makes NIR a very attractive spectroscopic method for the analysis of hydrocarbon mixtures. In many situations, NIR spectroscopy does not require sample preparation. On top of that, several properties can be simultaneously predicted from a single spectrum. However, the method is characterized by a low specificity i.e. absorption bands are overlapped and low sensitivity i.e. large variations of chemical properties lead to small spectral changes. Thus, the development of a NIR analysis requires building chemometric models to correlate the useful part of the spectral information to the target property(ies) that the analyst aims to predict indirectly. The potential of NIR to determine physical and chemical properties of petroleum products as an alternative to conventional analyses is well-established [2]. However, only a few studies deal with the prediction of SARA. Prediction models for these properties were developed by Blanco et al. [3] for bitumen, and by Aske et al. [4] and Hannisdal et al. [5] for crude oils. Satya et al. [6] have predicted SARA contents for petroleum fractions of hydrocarbon

228

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

Table 1 Overview of the models developed for the prediction of SARA using NIR spectroscopy. Petroleum fraction (number of samples)

Propertya : range %(w/w)

Pre-processing method

Spectral range (cm−1 )

LVb

RMSEP %(w/w)

Crude oil (18) [4]

S: 24–83 A: 13–43 R: 4–25 @: 0–12 S: 23–51 A: 26–48

1st derivative

10,000–4000

8 6 7 8 4 3

2.78 2.39 1.41 0.98 2.82 1.47

R: 7–30 @: 0.2–13 S: 34–70 A: 18–41 R: 6–27 @: 0–9 S: 3–10

2nd derivative 1st derivative 1st or 2nd derivative

6 2 1 1 1 5 5

1.46 0.44 2.3 1.7 1.1 1.5 0.6

A: 34–55 R: 18–34 @: 16–30

2nd derivative 2nd derivative 1st derivative

5 5 3

2.1 2.2 1.3

Crude oil (20) [5]

C12 + (11) C25 + (11) [6]

Bitumen (66) [3]

a b

2nd derivative 2nd derivative

2nd derivative

5900–5570 5900–5570 4670–4570 9000–8680 8330–4160

6060–5560 4480–4020

9000–4020

S; saturates; A: aromatics; R: resins; @: asphaltens. Number of latent variables.

chains with alkylation degrees higher than 12 and 25 carbon atoms (respectively C12 + and C25 + ). Table 1 provides an overview of the experimental conditions and the results obtained in these studies. As it can be observed, chemometric models were usually developed using PLS regression on full-range NIR spectra and first- or secondderivative as pre-processing. An additional comment concerns the small number of samples considered in most of these studies. To improve the predictive ability of multivariate calibration models which can be affected by noise, as well as physical and chemical interferences [7], data pre-processing and/or variable selection methods are usually considered in NIR spectroscopy. Considering pre-processing first, many different methods are available in the literature such as scattering correction, derivatives or baseline correction [8–10]. These methods were developed to address and remove specific phenomena affecting the original spectra, such as scattering. Pre-processing methods and their combinations are usually tested on the basis of a priori knowledge, or on trial and error procedures. On the other hand, variable selection can be used to improve predictive ability and robustness of the calibration models [11]. Variable selection methods (see Xiabo et al. [12] for a recent review) are usually classified in: (i) classical approaches such as manual approach (based on spectroscopic knowledge), (ii) “univariate” and “sequential” selection methods [11,13]; sophisticated methods such as successive projections algorithm (SPA) [14] and uninformative variable elimination (UVE) [15]; (iii) interval base algorithms such as interval partial least squares (iPLS) [16], windows PLS [17–19] and iterative PLS [20,21] and (iv) elaborated search-based strategies such as simulated annealing (SA) [22–24], artificial neural networks (ANN) [25,26] and optimisation with genetic algorithm (GA) [27–29]. Data pre-treatment and variable selection, and their different possible combinations, are usually performed independently and explored in a subjective manner. Consequently, it is seldom possible to conclude whether optimal prediction is obtained for the model retained or not. Besides, many other issues should be considered such as the order in which the data pre-treatment are applied [30], or the statistical significance of the difference of the prediction results [31–33]. To try to overcome these limitations, co-optimisation of pre-processing and variable selection methods by GA was recently proposed [34]. GA is a powerful tool both to explore (global search) and exploit (local search) the solution space. In principle, the implementation of an optimisation algorithm for the exploration of numerous pre-processing methods and for the global optimal seeking should rationalize

the development of NIR prediction models and, thus, potentially improve the prediction results. In the procedure applied in this study, 32 pre-processing methods, with various parameter settings, were selected. As pre-processing methods and variable selection are coded in the same chromosome, the different possible combinations of pre-processing and variable selection are explored and, finally, the most relevant solution should be obtained. Nevertheless, GA also presents significant drawbacks. Among them, GA optimisation requires important computation time, which was here speed up by parallel computing. Moreover, the configuration of GA models may be found difficult because of the numerous users’ defined parameters. In practice, for most of them, only few levels and/or short ranges can be spanned, as will be discussed in this paper. Since many parameters are optimised, overfitting and random correlation are issues to be carefully considered [35]. To get around this problem, final model parameters are fixed according to an approach based on selection frequency, i.e. the variables with lowest frequency are eliminated. Finally, validation is performed on an external dataset. The present paper is to our knowledge the first to investigate the potential of NIR for the quantitative prediction of SARA properties for a database containing both VGO and AR samples (same initial boiling point). When compared to commercial products which have to correspond to specifications, property values observed for VGO and AR samples cover a much wider range. Different chemometric approaches for the optimisation of prediction models are investigated. We report the results obtained for full spectra PLS models, GA-based methodology for the optimisation of pre-processing and GA-based methodology for the co-optimisation of pre-processing and variable selection. These results are compared and their statistical significance is assessed. The relevance of selected pre-processing methods for spectra correction and the agreement between the selected variables and the chemical composition of target properties are discussed. 2. Materials and methods 2.1. Heavy oils database The database is composed of 230 samples, of which 110 VGO and 120 AR samples. These samples are effluents from heavy oil quality upgrading processes. Considering the industrial nature of the problem investigated, they were selected in order to cover as

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

229

Table 2 Composition of the database. Property

Number of available reference values

Saturates Aromatics Resins Asphaltenes Asphaltenes C7

136 136 136 131 (24)a 171 (91)a

a b c

Property range %(w/w) 14.1–99.1 0.7–61.1 0–34.5 0–12.8 0–14.7

Reference method

Fidelityb %(w/w)

ASTM D2007/D4124 ASTM D2007/D4124 ASTM D2007/D4124 ASTM D2007/D4124 NF T60-115

r = 1.5 r = 1.5 r = 1.5 r = 0.14 × [Cc ] R = 0.2 + 0.2 × [Cc ]

In parenthesis: number of non-zero values. r and R correspond to repeatability and reproducibility values, respectively. [C] correspond to the concentration of the analysed sample.

exhaustively as possible the existing concentration range of each property, the different geographic origins of crude oils (Middle East, Venezuela, Canada, Russia, Nigeria, China) and the different processes (hydrocracking and hydrotreatment of VGO, AR and VR samples). Table 2 provides a description of the composition of the database. It should be noted first that not all the reference analyses were performed for each property resulting in a lot of missing Y-values. Secondly, as VGO samples do not contain asphalten compounds, only a few non-zero values are available for asphalten properties. To clarify, among the 120 AR samples, 24 non-zero reference values are available for asphaltens and 91 for asphaltens C7. For this reason, models for asphalten predictions were developed from asphaltens C7. To select these samples, a simple classification model was used in a first step (not described here).

temperature stabilisation of the sample is carried out in a 0.5 mm heated cell for 10 min. Spectra are recorded at 100 ± 0.1 ◦ C over the 12,000–4000 cm−1 spectral range. 100 scans were taken at a resolution of 4 cm−1 . Two spectra were recorded for each sample on two different aliquots. The difference between them is checked on a wavenumber-to-wavenumber basis. The result should not exceed 5 × 10−3 (in absorbance unit) for any wavenumber for a spectral acquisition to be considered correct. Otherwise, the procedure is repeated. Between two samples, the calibration cell is cleaned with toluene circulation. Prior to every acquisition, a single beam background of a sapphire reference block is collected.

2.2. Reference methods

The proposed methodology for optimisation follows the common principles of GA optimisation which are reminded first. To give an overview of the process, the different optimisation steps and the parameters to be adjusted in the procedure are given in Fig. 1 and Table 3, respectively. Two stages can be distinguished: the initial and the evolutionary stages. The initial stage consists in the creation of the first generation of individuals (solutions), created by a random draw of bit values. A population size of 256 individuals is chosen. The evolutionary stage consists in five steps: the evaluation, the selection, the recombination, the mutation, and the reinsertion. These steps are repeated until one of the optimisation

Reference values for the different properties were obtained by using standardized reference methods. The SARA fractionation i.e. the quantitative determination of saturate, aromatic, resin and asphalten contents, is based on liquid chromatography separation according to the polarity and was performed following ASTM D2007 and ASTM D4124 methods. The asphalten C7 content is the mass percentage of compounds that are insoluble in n-heptane and soluble in heated toluene. It is determined by using the NF T60-115 method. It can be noted that the two reference methods for asphaltens do not lead to exactly the same values as the fractionation procedures are significantly different. Values obtained for the fidelity of the different methods are reported in Table 2. These values correspond to repeatability (r) measurements for SARA and reproducibility (R) for asphaltens C7. The repeatability and reproducibility are obtained from the standard deviation of repeated measurements (see ASTM procedure). These r and R values are arbitrarily used as an indicator of the model performance to be reached. That is, it is considered that the RMSEP should be comparable to the ASTM even if the final uncertainty of the prediction is actually expressed in Eq. (1):

2.4. Genetic algorithm for optimisation of pre-processing and variable selection



prediction =

2 RMSEP2 + ASTM

(1)

where  prediction is the total error on the measurement by model prediction, RMSEP is the root mean square error of prediction and  ASTM is the r or R value. 2.3. NIR spectral acquisition An ABB Bomem FT-NIR spectrometer, the ResidIR, was used for NIR spectral acquisition. It is equipped with a DTGS detector and specially designed for heavy fuel sampling in transmission mode. The sample, contained in an aluminium cup, is liquefied in a heated pedestal at 100 ◦ C during 10 min. The cup is then manually pulled up to the sampling point. A peristaltic pump allows the sample circulation into the cell and then into the vent. After a few seconds, the circuit is closed manually by a valve. The

Fig. 1. Scheme of GA optimisation [34].

230

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

Table 3 GA parameters for pre-processing methods and variable selection co-optimisation. Corresponding GA step

Parameters

Value

Initialisation stage

256 Random 1–4 10%

Evaluation

Population size Pre-processing initialisation Consecutive pre-processing Initialisation probability for variables Fitness value

Selection

Maximum number of latent values Selection method

Recombination Mutation Reinsertion Convergence criteria

Crossover rate Mutation probability Reinsertion method Duplicates rate Maximum of generations

RMSECV (10 iterations of 10 fold) 20 Stochastic universal sampling 50% (single point) 0.5% Elitism replacement 50% 150

criteria is met. The first step of the evolutionary stage consists in the individual’s evaluation by a fitness function. In the case of PLS model optimisation, the objective is the minimisation of the prediction error. The model prediction ability is evaluated by the root mean square error of cross-validation (RMSECV). The cross validation is performed with 10 iterations of 10 random folds cross validation in order to reduce the risk of overfitting [36,37]. The optimal number of PLS latent variables (LVs) retained for the model is automatically fixed by identifying a significant knee on the RMSECV vs. the number of LVs plot. The maximum number of LVs is set to 20. During the next step, the best half of the population, 128 individuals, is selected according to their fitness value following the stochastic universal sampling method [38]. By using this method, all the solutions obtained from the generation can take part in the creation of the new generation. However, in order to improve the overall fitness of the population, individuals with high fitness values get a higher probability of being selected. These 128 individuals (parents) are then used for the creation of 128 children by recombination using a random single point crossover. Then, chromosomes can be subjected to mutations that consist in an inversion of random bit values. Mutations allow exploring some solutions very similar to the optimised ones keeping fitness unchanged. Moreover, mutations tend to slow the GA convergence and thus, reduce the risk of overfitting and the risk of convergence towards a local optimum. The probability of mutation is set to 0.5%. Finally, the new generation is composed of the 128 children and the 128 individuals of the previous generation with the best fitness values (elitism reinsertion). The GA optimisation procedure is stopped if there are 50% of duplicate solutions in the new generation or if the maximum number of generation is reached out (set here to 150). A methodology for the simultaneous optimisation of preprocessing and variable selection in PLS regression was recently proposed by Devos and Duponchel in Ref. [34] which provides full details about the method and assessment of its performances. A description of this strategy in the context of our study is given in the following. Two approaches were investigated for model optimisation: the simultaneous optimisation of pre-processing methods and variable selection (GA co-optimisation) and the optimisation of pre-processing method on full spectra (GA pre-processing optimisation). For the implementation of GA co-optimisation, variable selection is classically encoded in a chromosome as follows. Each wavenumber is represented by a bit which is set to “1 if the wavenumber is selected and to “0 if it is not. In this study, spectra were divided in blocks of 20 variables (38 cm−1 spectral range) for initialisation. 10% of the initial NIR variables were selected per chromosome. Pre-processing methods are encoded on 5 bits

Fig. 2. Encoding the co-optimisation problem [34].

in the first part of the chromosome (Fig. 2). Table 4 provides a list of the 32 different pre-processing methods with various settings that are available. These methods were selected to widely cover the range of pre-processings usually applied for chemometrics. During initialisation, the same selection probability is given to each of the 32 pre-processing methods. A user-defined parameter p, which defined the maximum number of consecutive pre-processing methods, was introduced in order to limit the number of methods that can be selected during the optimisation. In the case of GA pre-processing optimisation, models were developed on the full spectra. For this purpose, variable selection bit values were fixed to 1.

2.5. Exploitation of GA results GA was performed on the calibration set and the maximum number p of pre-processing methods was varied from 1 to 4 for both pre-processing and co-optimisation. Nevertheless, it was observed that, when combining 3 or 4 pre-processing methods, unsatisfactory combinations could occur. Indeed, pre-processing methods of the same kind could be selected. Therefore, it was decided in the following to compute only GA models allowing a maximum of 2 consecutive pre-processing methods both for optimisation and co-optimisation procedure. To reduce the risk of convergence in a local optimum, five independent runs were performed for each optimisation. At each run’s last generation, the best half of individuals is kept according to their fitness values. Consequently, 640 combinations of pre-processing methods and variable selection are then available. Thus, GA offers the choice between several optimal solutions (or close to optimal). An approach based on selection frequency criteria was established for the selection of final model parameters. First, the pre-processing methods were fixed to the most frequently selected combination. In case of GA co-optimisation, among the individuals that have this pre-processing combination, the selection frequency of variables was then calculated. Variables that had a selection frequency lower than 20% were eliminated to reduce the risk of random correlations [35]. Then, a PLS model was developed on calibration set using the selected combination of pre-processing methods and variables. Finally, model performances were evaluated using the validation set and the RMSEP was calculated.

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

231

Table 4 The 32 pre-processing methods available for GA co-optimisation (K: polynomial order; F: number of points in the filter; M: derivative order). Methods

Filtering

No pre-processing Savitzky–Golay smoothing (K, F)

Baseline correction

Detrend (K)

Normalisation

Multiplicative scatter correction (MSC) Normalize, sum equals to unity Standard normal variate (SNV) Savitzky–Golay derivative (K, F, M)

Derivative

Scaling and centering

Baseline correction

Parameters

Autoscale Mean center Median center Square root mean scale Weighted least squares baseline – WLSB (K)

For each property, calibration and validation sets were constructed by random selection and correspond to 75% and 25% of the samples, respectively. The distribution of VGO/AR samples in the validation set is the following: 25/7, 22/10, 24/9, 0/22 for saturate, aromatic, resin, and asphalten C7 properties, respectively. Comparing the prediction results obtained with different models (different pre-processings, different structures, etc.) is an important issue to prevent unjustified conclusions. Different statistical methods were proposed in the literature [31–33] for the comparison of the predictive ability of two calibration methods. However, in practice the statistical assumption under which the proposed test would be strictly applicable are seldom met [32]. In this work, a randomization t-test based on Monte Carlo simulations is applied as proposed in reference [31]. Randomization sets are obtained by interchanging labels of some of the samples in the validation set. This procedure uses the distribution of prediction errors as a measure of prediction accuracy. Thus, two models are compared using their difference between RMSEP as a statistical test. In this study, a two side test was used for the comparison of models. This test was based on the null hypothesis (RMSEP of model A and model B is the same). This null hypothesis is rejected if the calculated probability value (p-value) is equal or lower than 0.05. Finally, interpretation of the models was performed in order to check the relevance of selected variables and to evaluate their accordance with the SARA fractions chemical composition. SARA fractionation is based on separation according to the polarity. Therefore, the polarity of the compounds increases from saturate to asphalten fractions while the saturation degree decreases. In NIR spectra of heavy oils, absorption bands are mainly due to C–H bond vibrations. However, these C-H bonds belong to chemical groups with different saturation degrees. The main absorption bands of heavy oil NIR spectra can be divided in three groups according to their saturation degrees. The absorption bands in the 6050–5500 and 4500–4000 cm−1 spectral ranges were attributed

Gene coding

K

F

M

2 3 2 3 2 3 1 2 3

15 15 7 7 21 21

2 2 2 3 3 3 2 2 2 3 3 3

7 15 21 7 15 21 7 15 21 7 15 21

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111

1 1 1 1 1 1 2 2 2 2 2 2

1 2 3

to C–H vibration bonds in saturated chemical groups, the ones between 4750 and 4500 cm−1 to unsaturated compound group and ones in the 9000–6500 cm−1 spectral range to asphaltens. As NIR spectra are inherently rich and as very low spectral variations are usually observed, the complete interpretation of selected variable cannot be made. 3. Results and discussion 3.1. Spectral interpretation Fig. 3 represents the NIR spectra obtained for the 230 heavy oil samples. The common NIR absorption bands of hydrocarbons 2.5

2

Absorbance

Category

1.5

1

0.5

0 9000

8000

7000

6000

5000

−1

Wavenumbers (cm ) Fig. 3. NIR spectra of the heavy oil database.

4000

232

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

Table 5 Results obtained from classical PLS, pre-processing optimisation by GA (GAPLS), and pre-processing and variable selection co-optimisation by GA models (Co-GAPLS). SARA contents

Method

Number of variables (spectral range (cm−1 ))

Pre-processing

LVa

RMSEC %(w/w)

RMSEP %(w/w)

Saturates

PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS

1557 (7000–4000) 1557 (7000–4000) 460 1557 (7000–4000) 1557 (7000–4000) 520 1557 (7000–4000) 1557 (7000–4000) 500 2604 (9000–4000) 2604 (9000–4000) 600

1st derivative + mean center WLSBb (3c ) + mean center WLSBb (3c ) + mean center 1st derivative + mean center WLSBb (3c ) + mean center Mean center + detrend (2c ) 1st derivative + mean center WLSBb (3c ) + mean center WLSBb (1c ) + autoscale 1st derivative + mean center 1st derivative + mean center detrend (2c ) + mean center

7 8 9 6 9 10 8 9 9 3 3 8

1.75 1.31 1.38 1.86 1.17 0.99 1.24 0.88 0.86 1.41 1.41 0.63

2.09 1.51 1.82 2.06 1.68 1.59 0.82 0.80 0.77 1.26 1.26 1.28

Aromatics

Resins

Asphaltens C7

a b c

Number of latent variables. WLSB: weighted least square baseline method. Order of polynomial.

compounds are observed. Absorption bands between 4500 and 4000 cm−1 correspond to the combination of C–H stretching and bending of CH2 and CH3 . The weak band between 4750 and 4500 cm−1 can be assigned to the combination of fundamental vibrations in unsaturated groups. The first overtone of the fundamental C–H stretching mode is observed between 6050 and 5500 cm−1 . The weak absorption centred at 7000 cm−1 is due to the combination of the C–H fundamental bending and stretching first overtone modes. The band centred at 8000 cm−1 can be attributed to the second overtones of C–H stretching. When looking at the main features of the complete series of NIR spectra, two different types of spectral shapes can be clearly distinguished. The first type corresponds to NIR spectra that do not present any baseline offset and/or absorption slope. These spectra are characteristic of samples without asphalten compounds (VGO or AR). On the other hand, the second type of spectral shape is characterized by a baseline slope in the 9000–6500 cm−1 spectral region and a baseline offset. These features are characteristic of samples containing asphalten compounds. Indeed, this baseline slope corresponds to the tail of the absorption band in the visible region due to electronic transitions caused by ␲–␲* and n–␲* transitions of asphalten molecules [39]. Regarding the baseline offset, it can be attributed to light scattering caused by asphalten aggregates [4]. In addition, it should be noticed that spectral distortion is increasing with asphalten content. To reduce the influence of baseline slope, which is mainly observed between 9000 and 6500 cm−1 , calibration models for the prediction of saturates, aromatics and resins were performed on a reduced domain (7000–4000 cm−1 ). However, for the predictions of asphaltens C7, the full spectral range was used.

3.2. SARA prediction The results obtained by classical PLS, GA pre-processing optimisation and GA co-optimisation are summarized in Table 5. In order to compare the prediction performance of the different models, a two-side randomisation test was performed. The results obtained are indicated in Table 6. As mentioned earlier, if the significance level (p-value) calculated [31] is below 0.05, then the null hypothesis (RMSEP of model A and model B is the same) is rejected. For saturate predictions, results obtained (RMSEC and RMSEP) with GA pre-processing optimisation were 1.31 and 1.51%(w/w), respectively. This full-spectrum 8 LVs model was obtained using weighted least square baseline (WLSB, polynomial order 3) preprocessing. WLSB pre-processing performed very efficiently for aromatic and resin models also. From Fig. 4a, the effect of baseline correction can be visually assessed. When comparing the prediction performance of the GA pre-processing model to the classical PLS model using the randomisation test, the different between the two RMSEP values (1.51 and 2.09%(w/w), respectively) was found significant (see Table 5, p-value = 0.05). On the contrary, no significant difference can be demonstrated for the GA co-optimisation model (RMSEP = 1.82) for the prediction of saturate when compared to the GA pre-processing model (see Table 5, p-value = 0.07). The same conclusion can be raised for the comparison with the classical PLS model, which is non-intuitive but can be attributed to the different distribution of the prediction errors for the different models [31]. Regarding the prediction of aromatics, the lower RMSEP (1.59%(w/w)) was obtained using GA co-optimisation model. Comparing this value to the predictive power of the classical PLS model (RMSEP = 2.09%(w/w)), the difference is found significant

Table 6 Randomisation test for statistical comparison of different prediction methods: classical PLS, pre-processing methods optimisation by GA (GAPLS) and pre-processing methods and variable selection co-optimisation by GA models (Co-GAPLS). SARA contents

Prediction method

RMSEP %(w/w)

Saturates

PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS PLS GAPLS Co-GAPLS

2.09 1.51 1.82 2.06 1.68 1.59 0.82 0.80 0.77 1.26 1.26 1.28

Aromatics

Resins

Asphaltens C7

p-Value obtained by randomization test [31] Comparison with PLS

Comparison with GAPLS

– 0.05 0.14 – 0.07 0.05 – 0.45 0.40 – 1 0.24

– – 0.07 – – 0.63 – – 0.6240 – – 0.24

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

b

1

0.8

0.7

0.6

0.4

0.4

0.1

0.2

1

0.5

0.8

0.3

0.6

0.1

0.4

−0.1

0.2

6500

6000

5500

5000

4500

−0.3 7000

0 4000

6500

−1

1.3

1

5000

4500

0 4000

d 0.7

1

0.5

0.8

0.3

0.6

0.1

0.4

−0.1

0.2

0.8

0.7

0.6

0.4

0.4

0.1

0.2

6500

6000

5500

5000

4500

0 4000

−1

Wavenumbers (cm )

Selection frequency Absorbance

Absorbance

1

−0.2 7000

5500

Wavenumbers (cm−1)

Wavenumbers (cm )

c

6000

−0.3 9000

8000

7000

6000

5000

Selection frequency

−0.2 7000

0.7

1

Selection frequency Absorbance

Absorbance

1.3

Selection frequency

a

233

0 4000

−1

Wavenumbers (cm )

Fig. 4. (a) GA co-optimisation results for saturates. Spectra were pre-processed by WLSB with a polynomial order 3. Selected variables are represented with bars and it intensity indicates the selection frequency of each block. (b) GA co-optimisation results for aromatics. Spectra were pre-processed by detrend with a polynomial order 2. Selected variables are represented as previously. (c) GA co-optimisation results for resins. Spectra were pre-processed by WLSB with a polynomial order 1. Selected variables are represented as previously. (d) GA co-optimisation results for asphaltens C7. Spectra were pre-processed by detrend with a polynomial order 2. Selected variables are represented as previously.

(see Table 5, p-value = 0.05). However, no significant difference can be demonstrated when compared to the GA pre-processing model. Whatever the procedure, the RMSEP values obtained for resin predictions were found very similar, i.e. 0.82, 0.80 and 0.77%(w/w) for classical PLS, GA pre-processing and GA co-optimisation, respectively. Obviously, no meaningful difference is observed. However, GA co-optimisation model was chosen as the number of variables was greatly reduced, which improves model robustness [11]. Regarding the predictions of asphaltens C7, RMSEP being in that case 1.26%(w/w) for classical PLS and GA pre-processing, and 1.28%(w/w) in the case of GA co-optimisation, the prediction errors were found equivalent. But it should be noted that the GA cooptimisation model is less parsimonious, as 5 more latent variables were required. 3.3. Variable selection GA co-optimisation allows potentially improving the interpretation of the results focusing on the selected features of the spectra. Even in situations where no significant prediction improvement can be demonstrated, this may be considered as an additional advantage. For this purpose, the block selected in the GA co-optimisation procedure were first assigned to the different chemical group (saturated, unsaturated and asphalten groups). In a second step, the ratio of these selected blocks in each group was examined for in order to perform a global interpretation of selected variables. Fig. 4 provides the results obtained for each SARA property. For saturate predictions, the model consists of 460 variables (23 blocks of 20 variables). A WLSB method with a polynomial order 3 was applied for pre-processing. Fig. 4a shows that roughly half of the selected regions (12 blocks) correspond to absorption bands in the NIR regions which are characteristic of saturated group. In addition, 2 blocks were found in the NIR region corresponding to

unsaturated group absorption. As the corresponding PLS coefficients are negative, an inverse correlation is supposed. In addition, 9 variable blocks could not be attributed to known absorption. For aromatics, the GA co-optimisation model contains 520 variables selected over 26 blocks (see Fig. 4b). Pre-processing was a second-order polynomial detrend in that case, which allowed to correct baseline variations in the 7000–4000 cm−1 spectral range. The number of selected variables (8 blocks) corresponding to saturate group is smaller compared to saturate model and the number of folds corresponding to aromatics is increased to 4. Regarding resin GA co-optimisation model (500 variables, 25 blocks), spectra were corrected by WLSB with a polynomial order 1. Inspection of Fig. 4c shows that the absorption slope in the 7000–6500 cm−1 region is not completely removed and 4 blocks were selected in this region. It was observed that 11 blocks are associated to saturate groups, and 2 to unsaturated group, while 7 of them remain unattributed. The GA co-optimisation model for asphaltens retains 600 variables (30 folds) (see Fig. 4d). The number of selected variables for saturate group is lower than for the other properties (6 blocks) while 5 folds are located on aromatic absorption. A set of 14 blocks is selected in the range 9000–6500 cm−1 which corresponds to the electronic absorption of asphalten aggregates. In order to perform a global interpretation of selected variables, the repartition of the selected blocks through the three absorptions groups is examined. The ratio between selected blocks and the ones centred on saturate chemical groups globally decreases from saturate to asphalten models. The maximum ratio of selected blocks assigned to unsaturated group is obtained for the aromatic model. For resin model, the baseline slope was not totally removed between 7000 and 6500 cm−1 and 4 blocks were selected in this spectral range. Fig. 5 confirms resin content is highly correlated with the presence (or absence) of asphaltens, as expected (resins are very important for asphaltens stability [41]). The non uniform

234

J. Laxalde et al. / Analytica Chimica Acta 705 (2011) 227–234

GA co-optimisation was investigated to identify correspondence with the known chemical composition of SARA fractions. Despite a full interpretation cannot be expected, the selected variables were found globally in accordance with the composition of SARA fractions.

14

Asphalten content %(w/w)

12

10

References 8

6

4

2

0 0

5

10

15

20

25

30

35

Resin content %(w/w) Fig. 5. Plot of asphalten vs. resin contents.

dependence of resin and asphalten contents is a consequence of the sample chemical nature and history. Effectively, the hydrocarbons distribution of samples depends on the type of refining process it comes from. Therefore, the slope of absorption can be relevant for resin predictions. Finally, half of considered variables for asphalten model are situated in the 9000–6500 cm−1 spectral range confirming the relevance of this spectral range for the description of asphaltens [39]. In conclusion, the global interpretation has shown that the variable selection was in accordance with chemical composition. However, in order to evaluate in depth the significance of selected variables, especially the ones that remained unattributed, additional strategy such as backward interval PLS (biPLS) [40] could be performed. 4. Conclusion Models were developed for the quantitative prediction of SARA contents in heavy oils. For this purpose, 230 VGO and AR samples were collected to broadly cover the concentration range of the properties of interest. The results obtained for the prediction of saturates, aromatics, resins and asphaltens C7 were 1.51, 1.59, 0.77 and 1.26%(w/w), respectively. The potential of an objective approach based on GA for the optimisation or co-optimisation of pre-processing and variable selection was evaluated. In order to compare the predictive ability of the models developed using the different methods, a randomisation test was carried out. Significant improvement when compared to classical PLS could be concluded for saturate predictions using the GA pre-processing model and for aromatic predictions using GA co-optimisation model. For resin predictions, the different approaches were found equivalent. However, the GA co-optimisation model was retained as the number of variables was reduced. In addition, the location of variables selected by

[1] I. Merdrignac, D. Espinat, Oil Gas Sci Technol. 62 (2007) 7–32. [2] H. Chung, Appl. Spectrosc. Rev. 42 (2007) 251–287. [3] M. Blanco, S. Maspoch, I. Villarroya, X. Peralta, J.M. Gonzalez, J. Torres, Anal. Chim. Acta 126 (2001) 378–382. [4] N. Aske, H. Kallevik, J. Sjöblom, Energy Fuels 15 (2001) 1349–1357. [5] A. Hannisdal, P.I.V. Hemmingsen, J. Sjöblom, Ind. Eng. Chem. Res. 44 (2005) 1349–1357. [6] S. Satya, R.M. Roehner, M.D. Deo, F.V. Hanson, Energy Fuels 21 (2007) 998–1005. [7] H. Martens, T. Naes, Multivariate Calibration, John Wiley & Sons, Chichester, UK, 1989. [8] A. Rinnan, F. Van Den Berg, S.B. Engelsen, TrAC, Trends Anal. Chem. 28 (2009) 1201–1222. [9] B.M. Nicolaï, K. Beullens, E. Bobelyn, A. Peirs, W. Saeys, K.I. Theron, J. Lammertyn, Postharvest Biol. Technol. 46 (2007) 99–118. [10] A. Candolfi, R. De Maesschalck, D. Jouan-Rimbaud, P.A. Hailey, D.L. Massart, J. Pharm. Biomed. Anal. 21 (1999) 115–132. [11] C.H. Spiegelman, M.J. McShane, M.J. Goetz, M. Motamedi, Q. Li Yue, G.L. Coté, Anal. Chem. 70 (1998) 35–44. [12] B.Z. Xiabo, J.W. Zhao, M.J.W. Povey, M. Holmes, M. Hanpin, Anal. Chim. Acta 667 (2010) 14–32. [13] M.L. Thompson, Int. Stat. Rev. 46 (1978) 1–19. [14] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvão, T. Yoneyama, H.C. Chame, V. Visani, Chemometr. Intell. Lab. Syst. 57 (2001) 65–73. [15] V. Centner, D.-L. Massart, Anal. Chem. 68 (1996) 3851–3858. [16] A.S.L. Nørgaard, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Appl. Spectrosc. 54 (2000) 413–419. [17] S. Kasemsumran, Y.P. Du, K. Maruo, Y. Ozaki, Chemometr. Intell. Lab. Syst. 82 (2006) 97–103. [18] Y.P. Du, Y.Z. Liang, J.H. Jiang, R.J. Berry, Y. Ozaki, Anal. Chim. Acta 501 (2004) 183–191. [19] Y. Zheng, X. Lai, S.W. Bruun, H. Ipsen, J.N. Larsen, H. Løwenstein, I. Søndergaard, S. Jacobsen, J. Pharm. Biomed. Anal. 46 (2008) 592–596. [20] D. Chen, W. Cai, X. Shao, Chemometr. Intell. Lab. Syst. 87 (2007) 312–318. [21] C. Abrahamsson, J. Johansson, A. Sparen, F. Lindgren, Chemometr. Intell. Lab. Syst. 69 (2003) 3–12. [22] S. Kirkpatrick, M.P. Vecchi, Science 220 (1983) 671–680. [23] J.H. Kalivas, N. Roberts, J.M. Sutter, Anal. Chem. 61 (1989) 2024–2030. [24] H. Swierenga, F. Wülfert, O.E. de Noord, A.P. de Weijer, A.K. Smilde, L.M.C. Buydens, Anal. Chim. Acta 411 (2000) 121–135. [25] Z. Boger, Anal. Chim. Acta 490 (2003) 31–40. [26] R. Todeschini, D. Galvagni, J.L. Vilchez, M. del Olmo, N. Navas, TrAC, Trends Anal. Chem. 18 (1999) 93–98. [27] R. Leardi, J. Chromatogr. A 1158 (2007) 226–233. [28] R. Leardi, J. Chemometr. 15 (2001) 559–569. [29] J. Koljonen, E.M. Torbjörn, J.T. Alander, J. Near Infrared Spectrosc. 16 (2008) 189–197. [30] O.E. de Noord, Chemometr. Intell. Lab. Syst. 23 (1994) 65–70. [31] H. Van der Voet, Chemometr. Intell. Lab. Syst. 25 (1994) 313–323. [32] D.M. Haaland, E.V. Thomas, Anal. Chem. 60 (1988) 1193–1202. [33] Y. Roggo, L. Duponchel, C. Ruckebusch, J.-P. Huvenne, J. Mol. Struct. 654 (2003) 253–262. [34] O. Devos, L. Duponchel, Chemometr. Intell. Lab. Syst. (2011). [35] D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord, Chemometr. Intell. Lab. Syst. 35 (1996) 213–220. [36] K. Baumann, H. Albert, M. von Korff, J. Chemometr. 16 (2002) 339–350. [37] G. Cruciani, M. Baroni, S. Clementi, G. Costantino, D. Riganelli, B. Skagerberg, J. Chemometr. 6 (1992) 335–346. [38] T. Blickle, L. Thiele, Evol. Comput. 4 (1996) 361–394. [39] O.C. Mullins, Anal. Chem. 62 (1990) 508–514. [40] R. Leardi, L. Norgaard, J. Chemometr. 18 (2004) 486–497. [41] J.G. Speight, Oil Gas Sci Technol. 59 (2004) 467–477.