Copy right ~' IFAC Dvnamics and Control of Process Systems. Corfu, Greece, 1998
INDUSTRIAL SPECTRAL DATA CALIBRATION FROM MINIMAL DATA SETS
A.K. Conlin, E.B. Martin and AJ. Morris
Centre for Process Analysis. Chemometrics and Control University of Newcastle. Newcastle upon Tyne. NE1 7RU. u.K. Tel :- +44 (0) 191 2225382; Fax : - +44 (0) 1912225292 e-mail :
[email protected]@
[email protected]
Abstract: The need for on-line chemical property information has never been greater in industrial processing than it is today in the chemical manufacturing industries. However, due to process limitations and the need to reduce costs, obtaining sufficient relevant data to enable accurate and robust calibrations to be derived is a major challenge. This paper examines an approach to obtaining a calibration model based upon a limited amount of data. The method termed data augmentation is illustrated through application to an industrial batch processing application. Copyright © 19981FAC Keywords: Partial Least Squares; Data Augmentation; Batch Processing
then built for each individual data set using the statistical technique of Partial Least Squares (PLS). The models obtained from each individual data set are then combined to give a single augmented calibration model. In practice, however, any method, or combinations of methods, can be used to develop the final calibration model (Breiman 1996a, b; Wolpert, 1992).
1. INTRODUCTION Traditionally, after data has been collected from an industrial process for the purposes of generating an analytical calibration model (e.g. through NIR or Raman spectrometry), the data is split into a calibration and a validation data set. This approach, while appropriate for applications where large data sets can be obtained. can prove problematical in processes where data is limited. e.g. in batch processing. Typically for a batch process only a few batches might be available from which to form the calibration data set as a result of costs of production, time constraints and only a limited number of production runs being made, especially in market driven manufacturing environments. Consequently. an approach is required that allows calibration models to be built from small data sets. Such a technique is data augmentation which is based upon the idea of the smoothed bootstrap (Efron and Tibshirani, 1993; Raviv and Intrator, 1996). Data augmentation involves the generation of a number of data sets. Each data set being generated through the addition of noise to the original data set. A calibration model is
Two approaches to data augmentation are possible; the first relates to the addition of noise to both the process and reference variables whilst the second approach requires the augmentation of noise to the process variables only. It has been shown that the augmentation of noise to both the reference and process variables was unsatisfactory (Conlin et ai, 1998) for the industrial and simulated data sets investigated despite this being the approach advocated by Lee and Holt (1994) and Raviv and Intrator (1994). The pragmatic reasoning behind this is in part based upon the fact that by augmenting the reference data, the very data requiring estimation is corrupted.
363
A second and more theoretical explanation is described in Conlin (1996). This is based upon the concept of sufficient process information and data object density. These ideas are intrinsically related to the concept of data sparsity. The idea of sufficient process information in the data can be sub-divided into sufficient information in the process variables and sufficient information in the reference variables. The concept of high data dimensionality is typically associated with the process variables where there are hundreds of variables (wavelengths) compared to only a few data objects. Whereas for the reference variables, there are generally one or two variables compared with a few data objects. By investigating the ratio of objects to reference variables it is evident that there is little or no sparsity in the reference data set. Therefore the addition of noise to the reference variables appears to result, not in a reduction of data sparsity, but in a degradation of the ability of the calibration process to produce a model that can predict accurately. Thus a more robust calibration model can be attained without the addition of noise to the reference variables. Due to these considerations, interest in this paper will focus upon the augmentation of noise to the process variables alone. The first part of the paper introduces the technique of data augmentation. The methodology is demonstrated on a single industrial data set, details of which are presented in Section 3. Section 4 describes the different approaches to splitting the data into calibration and validation data sets. Following this , Section 5 reviews the results from the application of the data augmentation procedure to the spectra-wise split data set. Section 6 examines the potential benefits that can be obtained, i.e. increased accuracy and enhanced robustness, when the technique is used to generate calibration models from limited batch data and finally Section 7 draws some conclusions, and suggests further work.
The background and further details of the data augmentation technique is described in (Coni in, 1996), whilst (Conlin et aI., 1998) focuses upon the application of the methodology to an industrial data set. The following ten steps describe how a single Partial Least Squares, PLS, calibration model can be developed from a number of noise augmented data sets. Pre-process the data, i.e. remove objects where the reference value was either not, or incorrectly, recorded.
2.
Divide the original data set into a calibration and validation data sets.
Determine the standard deviation for each variable, in this case for each wavelength.
4.
Determine the augmentation factor '1', i.e. how many new data sets are to be created. Twenty data sets were used in this study. In practice, the augmentation factor will be specific to the data set under investigation but will theoretically lie between ten and thirty based upon the argument of the Central Limit Theorem.
5.
Determine the range over which the augmentation noise will be scaled, i.e. what percentage of the standard deviation of each of the variables in the original data will be used . The Gaussian augmentation noise is scaled according to the standard deviation of each particular wavelength, i.e. those wavelengths that vary the most, have the most noise added.
6.
For each data point, 'I' new points are created through the addition of Gaussian noise to each wavelength.
7.
A PLS calibration model is then developed for each of the 'I' noise augmented data sets. The number of latent variables used was selected using leave-one-out cross validation.
8.
The 'I' calibration models are reduced to vectors of regression coefficients for the original wavelengths, which are then averaged to produce the final augmented calibration model for that particular noise level.
9.
Steps 6 to 8 are repeated for different levels of augmentation noise. The augmentation noise was varied between 5% and 50% of the variation of the original data set, in steps of 5%, thus resulting in ten groups of 'I' noise augmented data sets.
10. The augmented calibration model is then used to predict the validation reference data.
2. DATA AUGMENTATION TECHNIQUE
1.
3.
3. DATA SET DETAILS The data set was collected from an industrial batch reaction process and comprises spectral and reference data on two related measures. These are the concentrations of certain reactants in a batch reaction. The analysis is only reported for one of the reference variables. The spectral data consists of NIR spectra taken during several experimental batches. The spectra were mainly recorded at the start and end of the experimental runs as these were the times of greatest interest in this particular batch process. At the start, the quality measures were used to confirm the correct loading of the reactor, whilst towards the end of the reaction the quality measures were utilised
364
to predict the end point of the reaction. Intermediate samples were also taken for analysis to provide a more comprehensive data set.
this initial step, the remaining batches were assigned at random to the calibration and validation sets until the data sets were approximately of equal size to those of the spectra-wise approach, i.e. ne "" 74 and
The range of wavelengths used was 1000-2100 nm with a 2 nm step size giving 551 spectral data points. The data set consisted of 118 spectra taken from sixteen experimental batch runs. At this stage it is necessary to consider how the data is split into a calibration and a validation data set, as it is possible to split the data objects in two contrasting ways. The first approach involves the random allocation of the individual spectra to the calibration and validation data sets, referred to here as a spectra-wise split. This method of splitting the data is essentially unrealistic when applied to the type of data considered here, namely batch process data. A more acceptable approach involves the assignment of all the spectra from an individual experimental batch, which typically comprises between four and eleven spectra, to either the calibration or the validation data set. This is referred to as a batch-wise split.
ny "" 37 . The exact details of the split are given in Table 1, where the letters in the final column indicate whether a particular batch was allocated to the calibration data set, C, or the validation data set, V. Table 1 Batch data details Batch Number 1 2 3 4 5 6 7 8 9 10
11
4. ALTERNA TIVE CALIBRATION AND VALIDATION DATA SPLITS
12 13 14 15 16
Based upon the above definitions, it is possible to define how the various calibration and validation data sets were created in this work. For the spectra-wise split, used in Section 5, the 118 spectra were randomly split to give a calibration data set of 80 objects, the remainder of the objects being placed in the validation data set. During the pre-processing and analysis of the data a number of objects were removed due to missing and incorrect recording of certain values. This resulted in the final spectra-wise calibration and validation data sets comprising 74 and 37 objects respectively . In order to gain an initial feel for the potential of the data augmentation approach to handle small data sets, the original spectra-wise split data was reduced through the elimination of every other data object. Thus, a second data set comprising 37 and 19 objects in the calibration and validation data sets respectively, was generated.
No. of Spectra in Batch
11 7 7 8 3 4 9 10 7 6 5
11 7 7 4 5
Calibration! Validation C V V V V C C C V C V C C C C C
The industrial interest is in the ability of the methodology to produce a robust calibration model from a minimal number of batch reactor runs. After discussions, it was decided to reduce the number of batches used to develop a calibration model to three. The calibration batches were selected using the method outlined above, while the validation batch was selected at random. The remaining twelve batches were used to simulate later at-line application of the calibration models. Table 2 Batch Allocation for Minimal Data Set Calibration 1,8, 14
The alternative approach, batch-wise split, was investigated to assess the ability of the data augmentation technique to provide a robust calibration model from only a few batches of process data, (see Section 6). Initially, ten batches were placed in the calibration data set, instead of spectra from all sixteen batches, and the remaining six batches were used for validation. It is noted that the total number of spectra in the calibration and validation data sets was the same as for the spectrawise split of the data. The data was first examined and the spectra from those batches giving either the maximum or minimum values for the reference variable were placed in the calibration data set. After
Validation 9
Simulated at-line 2, 3,4,5,6, 7, lO, 11 , 12, 13, 15, 16
Table 2 details which batches were placed in the calibration, validation and simulated at-line data sets. The numbers relate to the batch numbers given in Table 1. Details of the analysis of these two batchwise split data sets are given in Section 6.
365
5. ERROR REDUCTION THROUGH DATA AUGMENTATION
model generated from the addition of 30% augmentation noise to the original spectra, Table 3 and Figure 2. This resulted in a 40% drop in the prediction error over the original non-augmented PLS calibration. Based upon a calibration model derived from the augmentation of 30% noise to the original data, comparable results to those obtained for the PLS model trained on the full uncorrupted original data set were achieved for the validation data set. This implies that the use of noise augmentation to the original data can produce calibrations almost as good as those obtained from the collection of twice the amount of calibration data. This could result in significant cost reductions associated with reactor product testing and commissioning.
The results from the application of the data augmentation approach to a spectra-wise data split are presented to demonstrate the power of the technique. The error measure used in this work, unless otherwise stated, is the mean sum of squared error for the validation data set, MVSSE. Table 3 Results for Data Augmentation Noise Augmentation (%)
Ot 5 10 15 20 25 30 35 40 45 50
MVSSE Reduced Data Full Data Set Set 8.39 18.21 15.79 8.35 14.23 8.42 8.10 14.85 8.60 12.41 9.54 11.94 10.16 11.50 12.30 15.95 20.91 26.98
25
~ 20
Vl
>
::E
* * * *
15
10
In Table 3 't' indicates the mean validation sum of squared error based upon a calibration model generated without the addition of augmentation noise, indicates that that particular level of whilst augmentation noise was not applied to the calibration data set. It was found that for both the full and reduced data sets an improvement in the errors for the validation data set resulted. For the full data set, these are summarised in Table 3 and Figure 1.
9.5
8.5
15 20 Noise Level (%)
Fig. 1 Full data set with augmentation only
process
25
20 30 Noise Level (%)
with
40
50
process
variable
This paper has so far demonstrated that, in an industrial setting where the acquisition of data can prove difficult, expensive and/or time-consuming, the addition of noise to the available process data can produce calibration benefits. These take the form of calibration models with enhanced robustness thus resulting in improved prediction for the reference validation data. With these more accurate robust calibrations, it should then be possible to improve the manufacturing capabilities of the reactor and hence the profitability as a result of a reduction in the batch testing phase. However, to fully realise the potential of data augmentation based calibrations, it is necessary to consider a batch-wise split of the data set.
10
10
10
Fig. 2 Reduced data set augmentation only
*
5
0
30
6. MINIMAL DATA SETS AND DATA AUGMENTATION
variable The batch-wise split data was also analysed using Partial Least Squares and the results compared with the spectra-wise split, Table 4. The errors are reported in terms of the mean sum of squared error for the training (calibration) data set (MTSSE), the validation data set (MVSSE), and in the final column the overall error measure which involved summing
The level of process variable augmentation noise for the full data set which resulted in the minimum value for the MVSSE was 15%. For the reduced data set a marked improvement in the results for the validation data set was observed based upon the calibration
366
the calibration and validation errors prior to calculating the mean value (MSSE). The overall fit in terms of the MSSE was similar for both the spectrawise and the batch-wise split, with the batch-wise split being only slightly larger in magnitude.
halved data sets. The use of data augmentation does result in lower prediction errors being achieved. 35~-~--~--~--~--~.....,
30
Table 4 Error comparison for alternative data splits 25
Split Spectra Batch
MTSSE 12.78 9.92
MVSSE 8.39 17.40
MSSE 11 .32 12.41 15
For the batch-wise data split, the errors for the calibration data set (MTSSE) are lower than those for the spectra-wise split, whilst the validation errors (MVSSE) are higher. These results can be explained by considering the nature of the data. Each batch describes a separate event, thus by sampling across all sixteen batches a more general model is obtained and consequently the calibration error will be higher whilst the validation error will be lower. If, on the other hand, the data is sampled in a batch sense, then the model built will be more specific providing lower calibration errors but higher validation errors. However by increasing the number of batches in the calibration data set as more data becomes available, this initial limitation of the approach will be addressed. Alternatively, by analysing the data in a batch-wise sense the full potential of data augmentation can be realised. In practice, the resultant calibration model could be updated as information from each new batch becomes available, thus improving the robustness and predictive ability of the calibration model.
10 5L--~--~--~--~-_~~
o
~%2
0 5 10 15 20 25 30 35 40 45 50 55
20
30 Noise Le..1(% )
40
50
Fig. 3 Combined mean sum of squared error (MS SE) plot Table 6 Prediction errors for the remaining batches Noise Augmentation (%) 0 5 10 15 20 25 30 35 40 45 50 55
Table 5 Results for minimal data set calibration Noise Augmentation
10
MSSE 17.27 18.48 19.26 16.45 17.25 16.35 16.47 18.35 23 .02 21.48 25.89 31.78
35
Individual MSSE Calibration Validation 10.96 15.04 6.52 17.45 5.92 16.95 14.61 10.10 7.60 14.40 13.81 6.52 15.76 5.68 21.50 7.02 22.84 5.58 20.49 5.54 27.32 7.88 12.37 35.15
Combined MSSE 11.78 8.70 8.13 11.00 8.96 12.36 13.76 18.60 19.39 17.50 23.43 30.59
30
'"~ 25
::;:
20
15L--~--~--~--~--~~
o
10
20
30 Noise Le ..1(%)
40
50
Fig. 4 Prediction errors for the remaining batches However, the question as to which augmentation noise level produces the best calibration model needs to be addressed. From Figure 3 and Table 5, the augmentation noise level giving the lowest error for the calibration data set is seen to be 10%. However, once the errors for the remaining batches are considered, Table 6 and Figure 4, the picture becomes
Finally, the results for the data set comprising three calibration batches and one validation batch are given in Table 5 with the combined MSSE plotted in Figure 3. The mean sum of squared errors for the original (i.e. not augmented) data set compares favourably with those reported in Section 5 for the full and
367
a little less clear. This is because the use of a 10% noise level produces poorer predictions for the remaining batches, whilst an augmentation noise level in the range of 15 to 25% would have resulted in better predictions. The effect of varying the augmentation factor, '1', was investigated (Conlin, 1996). It was found that for a simulated data set the optimum 'I' was approximately ten. It is stressed, however, that this optimum will vary with the data set. Thus it is possible that if a higher 'I' had been used in the calibration step then the results might have been less ambiguous . From the study, it appears that a satisfactory calibration model can be obtained through the use of small data sets as demonstrated by the use of three batches to generate the calibration model and one batch for validation.
data augmentation approach, it has been shown that a reduction in the validation error is achievable if small data sets are used. Further work involves more detailed analysis of this approach (Coni in, 1996). Also on-going is the application of data augmentation to a wider variety of data sets in order to determine the effect of the various user-defined parameters.
7. CONCLUSIONS
REFERENCES
The potential of data augmentation for the development of a robust calibration model has been demonstrated on an industrial NIR data set. The addition of Gaussian noise to the original data set has been shown to be to be effective for reducing the prediction error of a validation data set. From the results, there is a clear indication that the larger the data set, the smaller the level of noise augmentation If the data set comprises sufficient required. information, then the addition of augmentation noise will only serve to degrade the performance of the calibration model. For small data sets the converse is true. Higher levels of noise augmentation are required but there is a limit beyond which the model performance will once again be degraded, this is currently under investigation.
Breiman, L. (I 996a). Stacked regressions, Machine Learning, 24, 1, 49-64.
8. ACKNOWLEDGEMENTS The authors acknowledge the support of the following organisations :- the Centre for Process Analysis, Chemometrics and Control, the EU BRITElEURAM Project, INTELPOL No. BE-7009, the Consortium Partners, and the EPSRC for AKC's PhD Studentship.
Breiman, L. (l996b). Bagging predictors. Machine Learning, 24, 2, 123-140. Conlin, A.K. (1996). Complex sensor data analysis through data augmentation, PhD Thesis, University of Newcastle. Conlin, A.K., E.B. Martin and AJ. Morris (1998). Data augmentation : An improved approach to the analysis of spectroscopic data, presented at SSC-5, Lahti, Finland, accepted for publication in Chemometrics and Intelligent Laboratory Systems. Efron, B. and R. Tibshirani (1993). An introduction to the bootstrap, Chapman and Hall , New York.
The main objective of the paper has been to show that through the use of data augmentation, satisfactory models can be derived from very small data sets. The reduced data sets included in this analysis have been based upon reducing the number of calibration objects to half the original number of spectra, further reductions have entailed handling the data in a batchwise sense. Here the data was reduced to just three batches from sixteen, and from seventy-four spectra to just twenty-eight. This split can be achieved in one of two ways, either treating the data, spectra by spectra or by using the spectra for a batch as a predefined group and then selecting groups to give the required reduction in data set size. The main problem with the spectra-wise approach is that all sixteen batches were required to obtain a calibration model as the spectra used in the calibration set were obtained from all the batches. However, by adopting a batchwise split, it is possible to reduce the number of batches required to give an acceptably robust and accurate predictor. In this example three batches formed the calibration data set and one batch the validation data set. Through the application of the
Lee, S.E. and B.R. Holt (1994). A consistent model building procedure for artificial neural networks with limited process data, Dept. of Chemical Engineering, University of Washington, Seattle, Personal Communication. Raviv, Y. and N. Intrator (1996). Bootstrapping with noise: An effective regularization technique, Connection Science, 8, 3-4, 355-372. Wolpert, D.H., Stacked generalisation (1992). Neural Networks, 5, 241-259.
368