Computers and Chemical Engineering 29 (2005) 1647–1659
Variable selection and data pre-processing in NN modelling of complex chemical processes Stavros Papadokonstantakisa , Stephan Macheferb,∗∗ , Klaus Schnitzleinb,∗ , Argyrios I. Lygerosa b
a School of Chemical Engineering, National Technical University of Athens, Athens, GR-157 80, Greece Department of Chemical Reaction Engineering, Brandenburg Technical University, Cottbus, Burger Chaussee 2, 03044 Cottbus, Germany
Received 16 March 2004; received in revised form 9 December 2004; accepted 31 January 2005 Available online 15 April 2005
Abstract The neural network models represent nowadays a powerful tool for complicated process identification. However, because of the fact that they belong to the category of data-driven “black box” models, they cannot avoid the consequences of the “garbage in–garbage out” rule. This work proposes a simultaneous data balancing-variable selection procedure, which is based on traditional statistical techniques and modern information theoretic approaches. It is implemented on a complicated dataset of restricted quality, which refers to a commercial aldol condensation unit (BASF). Based on the pre-processed database a neural model for the prediction of the process yield has been developed. The results verify the importance of the pre-processing stage in terms of generalization accuracy as well as of simpler network structure due to the data-variable selection procedure. Finally, an analysis of the model trends has been implemented to assess qualitative characteristics of the model, which was then used in industrial test runs and resulted in an improvement of the process operation. © 2005 Elsevier Ltd. All rights reserved. Keywords: Neural networks; Process modelling; Data preprocessing; Variable selection; Variable entropy; Variable information
1. Introduction Neural networks (NNs) belong to the most promising modelling techniques of our time. As “universal approximators” they can handle non-linear multivariate systems of any complexity level, while as “black box” models they don’t require an extensive knowledge about the process to be modelled. Instead, they are based on databases, being tolerant to the faults and noise included in them (Hornik, Stichcombe, & White, 1989). Although they are not new in concept, the interest in them has increased significantly in the last decade mainly due to the tremendous evolution of digital computing. Some published applications of neural networks in chemical engineering topics are concerned with fault diagnosis in chemical plants (Venkatasubramanian & Chan, 1989), system identification and control (Polland, Broussard, ∗
Corresponding author.: +49 355 691111; fax: +49 355 691110. Co-corresponding author. E-mail addresses:
[email protected] (Stephan Machefer);
[email protected],
[email protected] (Klaus Schnitzlein) ∗∗
0098-1354/$ – see front matter © 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2005.01.004
Garrison & San, 1992; Psichogios & Ungar, 1991), sensor data analysis (Piovoso & Owens, 1991), and chemical composition analysis (Weixiang, Dezhao, & Shangxu, 1998). As a result of their modelling success a number of software packages aiming at the design and development of neural networks has become nowadays available. Most of these packages include a variety of options about the design of the neural network architecture, the parameters of the training algorithm, the stop criteria and the model analysis. Most of all, they provide a user-friendly environment, which hides from the user the insights of the complicated mathematical and computational network training procedure and makes neural networks development much easier. However, the neural networks modelling task involves the data pre-processing stage, which can be decisive for the development of a successful model, since it must always be kept in mind that no matter how powerful the neural network modeling technique may be, it cannot escape the “garbage in–garbage out” curse of “black box” modelling. More specifically, as far as commercial databases are concerned, raw data obtained from plant operation should not
1648
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
be used unprocessed in identification studies. First of all, the special cases of the startup and shutdown of the process have to be recognized and the respective data should be removed. Secondly, outliers should be detected in order to avoid using data that correspond to measurement errors or operation faults. One should always be aware of the fact that not all of the outliers are of the aforementioned nature. They may also refer to extreme but yet possible process set-up and therefore contain useful information. The cooperation of the neural network model developer with the process engineers is always of great importance for all these aspects as well as for the variable selection. The proper variables have to be selected out of a large number of potential input variables, which may also strongly correlate to each other due to the process control system. Furthermore, for effective modelling the data must be information rich over the process operation range (Neelkanten & Guiver, 1998). A well-balanced data-set may cure some of the problems arising from the fact that the variables range in industrial data-sets is usually restricted because of the operational limitations set by the process control system. From the short description made above, it is clear that the data pre-processing stage is in fact a complicated task of data and variables selection, in which a substantial number of aspects must be taken into account. Because of the complex, multidimensional problems in which neural networks modelling is usually implemented, no standard data pre-processing procedure has been developed so far that treats all these aspects and could therefore be integrated and automatically performed in a neural network software package. Instead, it is very common that the users, based in a great extent on their intuition and the idiosyncrasies of a specific dataset, follow case-oriented data pre-processing methods. In this paper, a data pre-processing method is proposed that combines typical statistical techniques (Hair, Anderson, Tatham, & Black, 1998) as well as information theoretic approaches for variable selection and data preparation in neural networks modelling (Sridhar, Barlett, & Seagrave, 1998). This method deals with the data selection procedure in a systematic way, setting simultaneously various variable selection levels and mentioning at which points the experience of the process engineers is valuable. A database regarding a commercial aldolcondensation process has been used for the implementation of the method as well as of the neural network modelling technique. The product of this process constitutes an important intermediate for the production of fungicidal agro-chemicals. The complexity of the modelling task, which makes a neural network modeling effort worthwhile, arises mainly from the side reactions in combination with crystallization occurring in the semi batch reactor as will be discussed later. Finally, the paper deals with the effect of the proposed pre-processing methods on the model accuracy and generalization capability, analyzes the performance of the model in terms of theoretical consistency in some case studies and validates the model predictions with operational test runs on plant.
2. Combinatory data pre-processing method Typically, data pre-processing methods for neural networks modeling consist of two stages. In the first stage trivial techniques about missing data and outliers detection perform a first level data screening. Then, based on these results, more sophisticated multivariate techniques are implemented in order to decrease the dimensionality of the input variables space (variables selection) and homogenize the information distribution for input and output variables (data samples selection). Usually, the techniques of this second stage are either statistical or heuristic in nature, principal components analysis (PCA) and clustering techniques being two typical representatives respectively. However, the techniques of both kinds require expertise from a process engineer, in order to be implemented and comprehended properly. On the one hand, PCA in its linear or nonlinear versions, although based on a sound statistical theory, needs some of its parameters to be set heuristically or by a trial and error procedure, i.e. number of principal components extracted (Hair et al., 1998). Furthermore, the fact that it achieves input dimensionality reduction by grouping the variables in principal components, which are the model inputs, may have the advantage of using all the available variables, but makes the comprehension and analysis of the model considerably less straightforward. This is an important drawback when a “black box” model is going to be used in testing optimization scenarios. On the other hand, clustering techniques, besides their more heuristic nature, include the setting of parameters with unclear process related meaning, i.e. distance measures, type of clustering algorithm, number of derived clusters (Hair et al., 1998) and therefore are less comprehensive to the non expert process engineer. The approach proposed here differs significantly in this second stage, since it is based on information theory and more specifically on the so-called entropy measures for both data samples and variables selection. The main advantage of this approach is that the parameters that need to be set are more easily understood by a process engineer compared to the aforementioned techniques, helping him in this way to contribute with his experience in the data pre-processing procedure. However, the combination with the techniques of the first stage as well as with simple statistical correlation measures is still important, since it diminishes the influence of unrepresentative data samples, makes the results more stable and compensates for the smaller but inevitably existent heuristic part of the method. The entire procedure is presented in Fig. 1. 2.1. First stage procedures For the two steps of the first stage, namely missing data and outliers detection, there are various available methods in literature (Hair et al., 1998). The typical procedure of eliminating the respective cases is best suited when the extent of missing data is small compared to the available data, and one would not risk to bias the data by applying any of the available replacement methods. However, the interaction of
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
1649
able samples in a data set, and Ni the number of cases, in which X = xi , then the probability of this event is calculated by Eq. (1): pi = P(X = xi ) =
Ni N
(1)
and the entropy H(X) of the variable X is defined by Eq. (2): (2) Zi H(X) = i
where Zi = pi ln 0
1 pi
ifpi = 0 ifpi = 0
Fig. 1. The steps of the data pre-processing procedure.
the data analyst with the process engineers is valuable in order for the reasons of missing values to become clear and other procedures for handling them to be considered. More important is the contribution of the process engineer in the detection of outliers. This may be initially achieved for instance according to the univariate method of standard scores (z-scores). The data values of each variable are converted in order to have a mean of 0 and a standard deviation of 1 and the threshold value of 3 is used as a guideline for outlier detection, namely when the z-score of a variable exceeds this range, the respective case is considered to be an outlier (Hair et al., 1998). Whether this case is included or not in further analysis is highly subjective. The interaction with the process engineer may verify a shutdown or start up procedure during that period, faults in the measurement system or even process experimenting. If the procedures followed in these two steps result in a very narrow range for a candidate input variable of the modelling task, this is excluded from further analysis (“first-level variable selection”). As shown in Fig. 1 the key-term of variable resolution determines whether a range is narrow or not. This parameter is also very essential to the rest of the analysis (stage 2) and refers to measurement accuracy and sensitivity of variable changes regarding the process behaviour. These terms are much better understood by an engineer with experience in the process and therefore this parameter can be set rather objectively.
If X can take only one value, then H(X) = 0, indicating that the variable contains no useful information for a “black box” modelling task. This is actually the equivalent of eliminating a variable because of very narrow range, which was mentioned before as “first-level selection”. The information in X is maximized, when there is equal probability of occurrence of each of the S possible values. This maximum value is EP = H(X)max = ln(S) (Swingler, 1996). The entropy per centage EP = 100 H(X)/H(X)max expresses how far or close the entropy of the variable is to these two limits. Fig. 2 depicts such an example. A and B may be considered as two cases of distribution for the same variable X, which takes values in a specific range determined by the values of variable minimum and maximum. Ten intervals are used for the division of the domain space in both cases, and the significantly narrower distribution in case A is clearly depicted by the EP measures (43% and 87% respectively). For the analysis made above, the variable range and the number of intervals in which it will be divided (the S possible discrete variable values) are the basis for all necessary calculations. The variable range is determined by the operating
2.2. Second stage procedures After the first stage data screening, information theoretic approaches are used in order to create a balanced data set. The measure used in this paper is the entropy or information presented by the variable. More specifically, in cases of continuous variables the calculation of this measure requires a division of the domain space of the variable into a finite number of intervals. The variable is considered to have a constant value in each interval, thereby converting a continuous variable X into a discrete one, which can take on S values {x1 , x2 , . . . , xi , . . . , xS }. If N is the number of avail-
Fig. 2. The entropy percentage (EP) measure in case of a narrow (A) and a wide (B) distribution (EPmax denotes the ideal variable distribution).
1650
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
Fig. 3. The data elimination procedure according to the EP-algorithm (N is the initial number of samples in the database).
conditions of the process and the methods followed in the first stage. To determine the number of intervals, one needs to set a value for the variable resolution, which is a highly process related variable. This means that usually its selection is limited and the engineer is always aware of its physical meaning. After this parameter is set, the data pre-processing procedure enters the second stage by testing if the variables are well distributed over their range (EP calculation). The EP values calculated in this step are used as the baseline for the effort of their improvement, which takes place in the next step. If the quantity of samples is satisfactory, this may be done by samples elimination, otherwise samples duplication may be used (Swingler, 1996). In this paper we test the samples elimination approach (EP-algorithm), which is presented in Fig. 3. This algorithmic scheme is implemented on each variable separately. In the first iteration (j = 0) the first sample is eliminated (i = 1) and the respective EP measure is calculated. Then the sample is restored in the database and the elimination of the next sample is tested. When all N samples have been considered during this first iteration, the one (j = 1) that led to the maximum EP value is eliminated (ES1), reducing the data quantity to (N − 1). The algorithm examines whether (a) the EP value has reached a predetermined threshold (thr1), or (b) the EP value has started to decrease or (c) too many samples have been eliminated (thr2). If none of the above conditions is satisfied, the algorithm continues with the next iteration, which repeats the procedure described above attempting to reduce the remaining samples by one more. At the end of each iteration j is equal to the number of data samples eliminated so far. Gradually, as j grows, the effect of the elimination sets ES1 , ES2 , . . . , ESj is tested, ESj consisting of the set of samples that have been eliminated in the previous iteration (ESj−1 ) plus one of the remaining samples, after all possible combinations are tested and the set of conditions is
satisfied. From the results of this algorithm a “second-level variable selection” may be made, namely a variable may be excluded from the candidate input variables of the modelling task, if its EP-value has not reached thr1 or if it requires a large amount of samples to be eliminated in order to achieve a satisfactory EP value. The algorithm presented above detects a set of samples (ES(X)j ) that should be eliminated in a one-dimensional way, namely for variable X. After the algorithm is implemented in all the variables, the union of the sets for the variables that passed the second level screening is eliminated from the database. Two potential problems may arise: (a) the achieved EP level for each variable may be decreased compared to that after the one-dimensional implementation of the algorithm and (b) too many samples may be eliminated. Although the first and second-level variable selections soften down these problems, a potential remedy is also the readjustment of the two threshold values (thr1 and thr2), so that less data samples are rejected in the first place. The meaning of these two thresholds is easily understood in modelling terms and therefore, the process engineer who implements this method can feel rather confident about the values tested. After this step of the second stage, the goal is to have a well-balanced and information rich data set regarding the variables selected in the second level. However, depending on data quantity and more specifically on the “data to variables ratio”, one may wish to adjust further the input space dimensionality by excluding even more variables or including some of the variables for which no clear decision is made according to the procedure described above. This may be done with the conditional probability (ITSS) algorithm. The is given, conditional entropy of variable Y, when variable X expresses the information in target-variable Y that remains is known (X as a vecunexplained when input-variable X tor denotes the set of selected input variables). Therefore,
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
1651
the goal of the ITSS approach is the minimization of the conditional entropy or alternatively the maximization of the which is a normalized expression of following ratio U(Y |X), the mutual information between Y and X: = U(Y |X)
H(Y ) − H(Y |X) × 100% H(Y )
(3)
where H(Y ) is the entropy calculated for the dependent vari the conditional entropy, X being a candiable Y and H(Y |X) date subset of the input variables for the prediction of Y. The generation of the candidate subsets of variables can be based on simple or more sophisticated search procedures. In this paper the simple forward selection procedure is followed, which generates a new subset by adding one input variable at a time to the current subset, so that the new enlarged input vector The steps of the algorithm are maximizes the ratio U(Y |X). described below: (1) Calculate all U(Y |X1 ), X1 consisting each time of one variable. (2) Choose that X1 which gives the maximum value of U(Y |X1 ). (3) Calculate all U(Y |X2 ), X2 consisting each time of two variables, one of which is the variable already chosen in step 2. (4) Choose that X2 which gives the maximum value of U(Y |X2 ). Continue with triplets and so on until including all the vari ∼ ables or U(Y |X) = 100. The “third-level variable selection” depends on the results of this algorithm. It may indicate that a subset of the variables selected in the second level is enough for the prediction of the target variable, or that more variables must be added to compensate for the missing information part. Before making this final variable selection, it is useful to consider the correlation matrix, in order to be aware of the interdependencies between the final set of variables. Considering the correlation matrix at this point, where the dataset is already qualitatively improved, rather than in the original dataset helps in the direction of obtaining more reliable results as will be shown in Section 4.
3. Process description A brief description of the present type of process is given in order to illustrate the complexity-level of the modelling task as typical field of application for NN-modelling. 3.1. Complex reactive crystallization The reaction system of concern includes the aldol addition of two aldehydes R1 CH2 CHO (E1) and R2 CHO (E2) and the subsequent elimination of water due to the presence of an alkaline catalyst. The primary aldol condensation product is R2 CHC(CHO)R1 (P1). In aldol reactions the occurrence
Fig. 4. The most relevant reaction paths in the present aldol condensation process.
of side reactions is of particular relevance: the more species can be enolized, the more secondary products have to be expected. Even though only R1 CH2 CHO can form an enol since R2 CHO has no hydrogen atoms in α-position, multiple supplementary reactions may be expected where R1 CH2 CHO reacts as nucleophile in the present system (Clayden, Greeves, Warren, & Wothers, 2001). Actually, two secondary products are observed. The self-addition of R1 CH2 CHO leads to the parallel product R1 CH2 CHC(CHO)R1 (P2). In a consecutive reaction step the primary product R2 CHC(CHO)R1 reacts as electrophile with the enolized R1 CH2 CHO, forming R2 CHC(R1 )CHC(CHO)R1 (P3). Fig. 4 summarizes the relevant primary and secondary reactions. Every reaction product, meaningless whether it is primary or secondary can be expected to form crystals. The appearance of side reactions in combination with crystallization, which includes nucleation, crystal growth, agglomeration, breakage, etc. augments the complexity of the system dramatically. Nearly all process variables affect both reaction and crystallization. Especially the sophisticated influence of crystallization on the selectivity, that is crucial for the current system has hardly been modelled so far. In (Berry & Ng, 1997) and (Kelkar & Ng, 1999) merely approximative modelling approaches and phase diagrams in transformed coordinates for complex reactive crystallization systems have been developed. Since these approaches are only sufficiently accurate for design purposes and have not been validated so far, they may only be partly transferable to rigorous modelling attempts. A further challenge in reactive crystallization modelling is the implementation of the hydrodynamics in order to include mixing limitations which determine effects like mass transfer, segregation, local supersaturation, crystal agglomeration and breakage. This has been shown to be the crucial factor for model accuracy and scale-up performance (Zauner & Jones, 2002). Apart from the modelling efforts neither solubilities of the products P1, P2 and P3 in the reaction mixture nor any crystallization-related kinetic information is known. The experimental effort to obtain the corresponding model parameters would be tremendous.
1652
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
3.2. Semi-batch operation The extent of side reactions can be reduced by keeping the concentration of E1 low. This is usually accomplished applying semi-batch operation. E1 is fed slowly to a mixture of solvent, catalyst and E2 which have been previously filled into one of the reactors. During and after E1-feed the product crystals are formed. Having attained the final specification the reactor content is sent through a liquid–solid separation unit. Parts of the mother as well as the washing liquor are used as reaction solvent for the next charge. As batch-operations involve intensive cleaning-cycles, two reactors, denoted with A and B, are run in alternation. A principle process scheme is given in Fig. 5. The development of a deterministic dynamic model for the present process would be an extremely sophisticated task to accomplish due to the high experimental requirements, the computational effort arising from the necessity to consider fluid dynamics and the lack of elementary knowledge concerning complex reactions accompanied with crystallization.
4. Commercial database—implementation of pre-processing methods
Fig. 5. Sketch of the present reactive crystallization process.
The training and validation of the NN model was based upon industrial data provided by BASF (Schwarzheide). The 19 candidate input variables and the one output variable (the primary product yield P1, where P1 is simply denoted as P in the following) are presented in Table 1. From the description of the variables it is clear that operational variables of the reactor and the separation unit as well
as laboratory variables relative to the reagent quality have been taken into account. The commercial database covers a period of 4 years and consists of 1453 data samples (reactors runs) for both reactors. In selecting data for model building, however, it is important to ensure that they represent normal operating states to avoid spurious predictions from unusual conditions. Therefore, before the implementation of more sophisticated pre-processing methods the cases of the unit start
Table 1 Input variables, their ranges and resolution Variable description
Primary product yield Mass aldehyde (reagent 1) Mass aldehyde (reagent 2) Reactive part in reagent 1 Non-reactive part in reagent 1 Toluene in reagent 1 Remaining impurities in reagent 1 Mass catalyst (aqueous solution) Dilution of catalyst-solution Volume solvent in reactor Water-part in solvent Average reactor temperature Reagent 1 feed duration Reagent 1 initial feed rate Delay after reagent 1 feed Duration dilution Washing liquor in separation unit Feed-steps to separation unit Duration separation procedure Duration drying procedure
Notation (units)
P (% w/w) E1 (kg) E2 (kg) RP-E1 (% w/w) NRP-E1 (% w/w) Tol-E1 (% w/w) RC-E1 (% w/w) Cata (kg) Cata-c (% w/w) Sol-R (m3 ) W-Sol (% w/w) ART (◦ C) E1FD (min) E1SFR (kg/h) D1 (min) D2 (min) Sol-SU (m3 ) SUS DSP (s) DPD (s)
Resolution
0.5 1.0 1.0 0.1 0.1 0.1 0.1 0.5 0.2 0.1 0.2 0.5 1.0 1.0 30 30 0.2 1.0 30 30
Initial range (both reactors)
Outliers removal
Final range
Reactor A
Reactor B
Reactor A
Reactor B
82.0–92.0 1493–1525 1525–1549 98.3–99.2 0.4–0.7 0.0–0.4 0.1–0.3 159–167 14.9–17.0 3.8–4.7 14.9–18.2 18.6–21.5 243–360 199–310 180–1667 90–600 0.3–7.2 1–24 100–250 210–660
84.0–89.0 1519–1520 1547–1548 98.4–99.2 0.6–0.7 0.0–0.3 0.1–0.3 160–167 14.9–17.0 3.8–4.4 14.9–17.5 19.4–21.5 302–309 295–310 300–870 120–360 3.5–6.3 14–23 100–250 570–600
84.1–89.0 1519–1520 1547–1548 98.4–99.2 0.6–0.7 0.0–0.3 0.1–0.3 160–167 14.9–17.0 4.0–4.4 15.0–17.5 19.4–20.6 301–311 284–305 240–900 150–390 3.3–6.2 13–23 120–250 570–600
84.0–89.0 1519–1520 1547–1548 98.8–99.2 0.6–0.7 0.0–0.2 0.1–0.3 162–167 14.9–17.0 4.0–4.2 14.9–17.0 19.4–21.3 303–304 297–302 300–570 150–300 3.8–5.5 15–20 120–240 570–600
84.2–88.9 1519–1520 1547–1548 98.7–99.2 0.6–0.7 0.0–0.2 0.1–0.2 161–167 15.2–16.6 4.0–4.2 15.0–18.2 19.4–20.6 302–305 297–301 330–660 150–300 3.7–5.9 15–20 120–250 570–600
Dark-shaded variables designate those deselected after the first, statistical pre-processing step (outliers removal). Light-shaded variables were excluded as result of the advanced, information theoretic pre-processing methods (final range).
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
up and shut-down as well as of those corresponding to process faults were excluded from the study. This has limited the number of samples to 1317, which were then pre-processed according to the methods described in Section 2. 4.1. First-level variable selection and data elimination Table 2 summarizes the results of the data elimination process. The two reactors were treated separately, since the results of an initial neural networks modelling approach and the exchange of knowledge with the process operators have indicated such a need. As will be shown in the following section the individual treatment of both reactors did not only lead to better correlations but also increased the model plausibility. Starting with the missing data screening, the large amount of eliminated data was due to faults of the computerized registration systems, which occurred within this period and are typical for “real-life” systems like the industrial processes. The removal of outliers that followed specified a first set of ranges for a valid model. These ranges presented in Table 1 have also been used for the first-level variable selection, namely four variables (E1, E2, NRP-E1 and DPD) were excluded from further analysis due to very narrow ranges. The resolution of the variables presented in Table 1 makes this exclusion clearer and provides the basis for the next steps. 4.2. Second-level variable selection and data elimination With the variables resolution and valid ranges set, the implementation of the EP-algorithm has diminished the quantity of data (Table 2) in order to provide a more balanced and qualitatively better data set. The EP-value of 90% was used for thr1 and the value of 100 for thr2, that is the data elimination stops when the EP-value of the remaining data
1653
becomes greater than 90% or when no more than 100 data are left. Fig. 6 shows the results of the implementation of the EP-algorithm. It consists of two parts: the one-dimensional improvement of the EP-values of all the variables for reactors A and B in the lower part and the amount of eliminated data in order for this improvement to take place in the upper part. A first indication for the second-level variable selection is given by the lower part of Fig. 6, namely two of the variables (DSP, E1FD) clearly failed to reach the predetermined EPvalue threshold for both reactors. The two aforementioned variables have not reached EP-threshold, although the maximum allowed number of data to be eliminated has been reached. Therefore, these two variables were excluded from further analysis. Moreover, the rest of the variables were divided into two classes: (a) those that have reached the EP-value threshold by eliminating a small number of data samples; (b) those that needed the elimination of a large number of data samples. The classification of the variables according to this criterion was similar for both reactors, namely the variables D1, SolSU, W-Sol, Cata-c belong to the first class and the rest to the second one. The only exception is variable ART, which behaved differently in entropy terms in the two reactors. The four variables of the first class for reactor B, plus the ART variable for reactor A have been therefore selected (secondlevel variable selection) to determine the final amount of data samples. More specifically, the union of the sets of samples that have been eliminated by the one-dimensional implementation of the EP-algorithm was excluded from the final modeling task. This last step of the elimination process has led to the “Final Range” of Table 1. Figs. 7 and 8 present the final EP-values in the case of reactor A for the two classes of variables respectively. The impact of the last data elimination
Table 2 Pre-processing: effect of each step on data basis (italic numbers), input dimensionality and ratio between available data and variables Pre-processing steps
Remaining data-sets, variables and their ratios after each pre-processing step Reactor A Number
Initial data Potential variables Missing data elimination Unmeasurability, correlations, unsteady distribution Outliers removal Narrow range EP-algorithm (first class) EP-algorithm (first class) EP-algorithm (first + second class) EP-algorithm (first + second class) ITSS-algorithm, Pearson ITSS-algorithm, Pearson
666 ∼ 30 501 20 394 16 142 4 34 14 142 9
Reactor B Ratio
Number
∼ 22
651 ∼ 30 25.1 487 20
24.6 35.5 2.4 15.8
358 16 191 5 0 14 191 9
Ratio ∼ 20 24.4
22.4 38.2 0 21.2
Fig. 6. Top: eliminated data after the one-dimensional implementation of the EP-algorithm. The bold lines mark the upper elimination limit for each reactor (thr2 = 100). Bottom: actual improvement of the EP values (thr1 = 90%. Dashed lines: before EP-algorithm. Full lines: after EP-algorithm.
1654
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
Fig. 7. EP-values before and after the implementation of the EP-algorithm for the five selected variables of reactor A (first class variables).
step according to the union of the “one-dimensional” sets can be clearly seen here. For the variables of the first class there is deterioration as expected, compared to the one-dimensional improvement. However, substantially higher EP-values have still been achieved for those variables that initially (after the outliers removal) have low entropy measures. As expected, this is not observed for the variables of the second class. In this case the entropy level is approximately the same with the initial one, since the union of eliminated sets did not include the sets of the one-dimensional implementation of the EP-algorithm for the second class of variables. If this was done, then the quantity of the remaining data would be too small to build a neural model as shown in Table 2. The analysis made above is similar for reactor B and therefore, the respective figures have been omitted.
Fig. 8. EP-values before and after the implementation of the EP-algorithm for the non-selected variables of reactor A (second class variables).
Fig. 9. Cumulative information theoretic curve according to the ITSS algorithm for reactor A.
4.3. Third-level variable selection and data elimination Applying the ITSS algorithm has shown that most of the information that the target variable contains (98.7% for reactor A and 97.9% for reactor B) is already explained by the second level variable selection (first class variables). This is shown by the typical curve of cumulative information presented in Fig. 9 for reactor A. Reactor B behaves in the same way. However, four more (second class) variables were included in the input variables set of the modelling task (thirdlevel variable selection), namely RP-E1, RC-E1, Cata and Sol-R for reactor A, plus the ART variable for reactor B. These variables complete the missing part for information leading to U ratios of 99.7% and 100% for reactors A and B, respectively. Moreover, some of these variables (RP-E1, RC-E1) are theoretically expected to influence significantly the primary product yield, by characterizing the reagent (E1) quality. The interest for the influence of the rest (ART, Cata, Sol-R) lies on the fact, that these variables, unlike the reagent quality, can easily be controlled and therefore set to the values indicated by the model in an optimization procedure. Of course, it should be noticed that these variables account for a very small amount of additional information compared to the variables chosen in the second-level. However, when chosen alone, these variables can also explain a significant part of information. This is shown in Fig. 10 for reactor A and a similar bahavior was observed for reactor B. This kind of behavior indicates that these additional variables may be partly correlated with the variables selected in the secondlevel. Table 3 verifies this by presenting the Pearson’s correlation coefficients between the selected variables for both reactors. Such information is very important, when a neural model is going to be used in an optimization procedure. The effect of the pre-processing procedure on the initial variables is summarised in Table 1. The nine “non-shaded
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
1655
Table 3 Pearson’s correlation coefficients (r) Cata-c
W-Sol
D1
Sol-SU
ART −0.53 a
RP-E1
Cata
Sol-R
RC-E1
Reactor A
0.56 b
−0.48 b
Cata-c W-Sol D1 Sol-SU ART RP-E1 Cata Sol-R RC-E1
Cata-c W-Sol D1 Sol-SU ART RP-E1 Cata Sol-R RC-E1
1 −0.22 a −0.05 −0.56 a −0.12 0.11 0.62 b −0.33 b 0.01
−0.07 1 −0.02 0.21 a 0.06 0.30 a −0.21 b 0.28 b 0.26 b
0.01 −0.02 1 0.06 −0.17 0.03 0.13 −0.17 −0.16
−0.65 a 0.21 a 0.08 1 0.15 −0.11 −0.33 b 0.31 b 0.13
0.12 −0.17 0.43 a 1 −0.01 −0.10 −0.02 0.01
0.06 0.33 b 0.06 −0.06 0.13 1 0.08 0.07 0.03
−0.03 0.20 b −0.40 b −0.66 b 0.03 1 −0.52 a −0.11
0.12 −0.11 0.40 b 0.79 b 0.20 a −0.63 a 1 0.29 a
0.06 0.23 b −0.05 0.04 0.31 b 0.11 −0.13 0.31 a 1
Reactor B
Cata-c
W-Sol
D1
Sol-SU
ART
RP-E1
Cata
Sol-R
RC-E1
Upper triangular matrix: reactor A, lower triangular matrix: reactor B. Significant are the correlations with absolute values of (r) over 0.20, alpha level ≈0.01, d.f. = 140. a b
Significant correlations between variables selected in the second-level or between the additional variables. Significant correlations between variables selected in the second-level and the additional variables selected in the third-level.
variables” of this table are included in the input variables set of the neural model, while the rest serve as restrictions. The meaning of the variables forming the restrictions set is that an input sample may be characterized as “extrapolation case”, if these variables have values outside their ranges, although the variables that actually contribute into the model’s prediction do not. The prediction accuracy in this case may be significantly lower, as in any other extrapolation attempt.
5. Results and discussion The training of the NNs was carried out with ATLANtec’s NeuroModel software package (Version 1.41). This package performs automatically the data normalization procedure with a linear transformation in the range [0.1, 0.9]. It uses multilayer perceptrons (MLP), with one hidden layer and a version of the EBP algorithm with momentum term for the training procedure. The nodes of the hidden layer use a
Fig. 10. Cumulative information theoretic curve (ITSS algorithm) of the additional variables included in the third level of variable selection for reactor A.
sigmoid transfer function and the number of the hidden nodes is a parameter that can be adjusted. In this paper only network architectures with “data to weights” ratios over the value of 2 have been tried, that is NNs with no more than 5 nodes in the hidden layer in both cases. In each case ten randomly selected starting values for the initialization of the network weights were tested. A 15% of the available data have randomly been selected and left out for validation purposes. The NNs training procedure is stopped either after a predetermined number of epochs (30,000 because the software package requires a larger number of epochs compared to the usual EBP algorithm) or when the achieved modelling accuracy reaches a predetermined threshold (0.7 expressed as correlation factor). Both thresholds are carefully set to avoid overfitting, which would lead to decreased generalisation properties of the neural model. Fig. 11 shows the improvement in terms of mean square error (MSE) and correlation factor (R2 ) for the validation set after the implementation of the advanced pre-processing methods (APP). These values were obtained for the “winner neural model” among those tried in this work. From this figure
Fig. 11. Effect of the advanced pre-processing methods (APP) on neural networks modeling performance.
1656
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
it is also noticeable that the model for reactor A behaves better than the one for reactor B. A more detailed analysis follows.
5.2. Neural network training—separated treatment of both reactors
5.1. Neural network training—simultaneous implementation of both reactors
The training results for the neural networks of each reactor are given in Fig. 13. The data of 118 and 166 charges have been applied for training purpose in case of reactors A and B, respectively. Rating a prediction as satisfactory, as long as the absolute deviation from the actual measurement in P-yield is between −1% and +1%, the model can be regarded successful. About 85.6% and 72.3% of the fitted data is in between these limits for NN-models of reactors A and B, respectively. Even though the training results are quite similar for both reactors, reactor B shows a shift towards positive deviations as shown in Fig. 14. Summarizing the training results it can be noted that the separate modelling of the reactors lead to satisfactory results with hardly any deviations in P-yield beyond 2%. The somewhat less accurate training results for reactor B are likely to be put down to long term effects, like fouling, which could not be incorporated into the neural network because of the restricted data basis. A convenient incorporation of foulingrelated input variables would certainly presuppose several time spans between two reactor clean-outs to record the effect of proceeding fouling. This prerequisite was not fulfilled in the present case.
As first attempt both reactors have been modelled simultaneously with one neural network. This means that mixed data of charges of both reactors have been used for the NN training. This procedure seemed to be quite plausible, since both reactors have basically the same geometry and are operated identically. Surprisingly, training the NN gave the results presented in Fig. 12. A division of the training results in two areas is obvious indicating that the NN still requires input information in terms of additional variables in order to consistently model the process. Consultation with the process operating personal uncovered different fouling behaviors in the two reactors being very likely the reason for this specific training result. Influences like fouling are very complicated to be considered within NN in general, since appropriate variables have to be found and included, the time-domain of which is not only the duration of one run but a whole campaign of runs. This also means that the quantity of the data basis is an even more important prerequisite. In our particular case the data basis does not cover a longer, coherent time span. An alternative approach has therefore been followed in this work to overcome this problem, that is modelling both reactors separately as will be shown in the following.
Fig. 12. Simultaneous implementation of reactors A and B within one neural network model—training results with rejected data: comparison between modelled and measured output variable P (product yield).
5.3. Neural network validation The part of the pre-processed data basis that has not been utilized to train the neural networks was used as test-data,
Fig. 13. Neural network training: comparison between modelled and measured output variable P (product yield).
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
Fig. 14. Neural network training: absolute deviation between modelled and measured output variable P (product yield).
namely 24 sets for reactor A and 25 sets for reactor B. Since these data points lie within the training-range of each of the input variables, the validation procedure can be considered as interpolation. The validation results displayed in Fig. 15 are very similar to those of the training procedure, indicating the NN to generalise adequately. The NNs of reactors A and B succeeded in 87.5% and 68% of all validation attempts respectively. Only one calculation was performed with an error in excess of 2%. Again the model for reactor B tends to cause positive errors,
1657
Fig. 16. Neural network validation: absolute deviation between modelled and measured output variable P (product yield).
which goes along with the trend of the training procedure (see Fig. 16). The training and validation procedure above proved that data interpolation can be realized in a reliable qualitative as well as quantitative manner by means of neural networks. By applying a rigorous data pre-processing stage and a systematic variable selection the application of NNs has lead to improvement in a case, where database is very restricted. The next section deals with the exemplary application of the developed NNs in order to improve process operation and consequently to increase product yield. 5.4. Model application
Fig. 15. Neural network validation: comparison between modelled and measured output variable P (product yield).
The developed NNs have been applied in order to identify optimization potentiality for the process. The catalyst mass is believed to strongly affect the product quantity. However, since conversion approaches equilibrium state, the catalyst should theoretically not affect yield considering the reaction. Apart from that, not only the main reaction path but also the side reactions are catalyzed and moreover, interaction with thermodynamic equilibrium is not forseeable. This is why no assured predictions can be made without a model. Consequently, catalyst mass is selected as exemplary variable to show the efficient application of NNs in this type of systems. Additionally the purity of reagent 1 has also been taken into consideration in order to check extrapolation performance of the NNs and to uncover restrictions of NN modelling. The reagent’s purity is the only variable whose impact on product yield is known a priori, since it is nearly independent from the complexity of the system. The less impurities are in E1-Feed, the higher the product yield must be. Furthermore, this relationship may be expected to be linear. A qualitative evaluation of the model’s extrapolation
1658
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
Fig. 17. Causality plot—reactor A: effect of catalyst mass (Cata) and reagent purity (RP-E1) on product yield (P) as predicted by the NN model.
performance can therefore be done without experimental confirmation. Figs. 17 and 18 show the dependency of product yield (P) upon catalyst-mass (Cata) and E1-purity (RP-E1) calculated by the NNs for reactors A and B. It must be pointed out here that these variables do not correlate to each other according to Table 3. Table 1 presented the variable ranges of the NN training procedure, which mark out the interpolation region in these figures. Consequently, product yields for reactor A calculated for Cata > 167 kg or Cata < 162 kg and RP-E1 >
Fig. 18. Causality plot—reactor B: effect of catalyst mass (Cata) and reagent purity (RP-E1) on product yield (P) as predicted by the NN model.
99.2% w/w or RP-E1 < 98.8% w/w belong to extrapolation region. In case of reactor B the extrapolation region starts above 167 kg or below 161 kg concerning Cata and for values higher than 99.2% w/w or lower than 98.7% w/w with respect to RP-E1. As expected the trends are well indicated by the NN in case of interpolation. In the extrapolation regions things are somewhat different. Other than expected, extrapolation to RP-E1 values beyond 99.2% w/w in the NN for reactor A leads to a sharp decrease in P for nearly the whole Cata-range. Furthermore, linearity between RP-E1 and P is only given in the interpolated region. In case of reactor B the extrapolation to higher Cata values leads to an implausible reverse effect of RP-E1 on P. The reasonable positive, nearly linear relationship between P and RP-E1 at lower catalyst masses changes to a decrease of P with increasing RP-E1 for Cata > 167 kg which is not comprehensible anymore. The deficiencies concerning extrapolation performance are typical for NNs and every other entirely statistical type of model and must always be kept in mind, while working with Neural networks. The extrapolation of Cata, the variable of major interest concerning optimization potentiality, gives qualitatively not unreasonable trends even though the functional correlation between Cata and P is completely different from mathematical as well as phenomenological viewpoint. Product yield increases exponentially with higher catalystmasses as calculated and predicted for reactor A whereas a certain saturation behavior at raised catalyst loads is observed in case of reactor B. Catalyst saturation is what one observes in catalysis in general and would therefore be the more reasonable development. However, it cannot be concluded in our case that the mathematical relationship as calculated for reactor A is wrong, since the initial molar ratio between catalyst and reagent is rather low and differences could as well be put down to the different conditions in both reactors. Nevertheless, the interpolation ability of the developed NNs is quite good as has been shown during model training as well as validation and extrapolated trends are qualitatively reasonable. Test runs with moderately raised catalyst-masses have been carried out on plant in order to verify the model predictions. Figs. 19 and 20 indicate, that the model predicted trend is confirmed through the measured yields. A considerable increase in product yield could be observed in both reactors. For reactor A the P gained 1.3% w/w as the catalyst mass is raised for about 2.5%. The effect on reactor B is somewhat reduced, since the same change in catalyst mass lead only to 0.7% w/w more product yield. It can also be seen that the more variable Cata is in the extrapolated region, the more the model deviates from the real process values. The trend is, however, well predicted by the Neural Networks. Therefore, the additional objective of drawing causalities between variables in incalculable complex chemical processes in order to increase product yield has been shown to work well with the modelling approach presented in this study.
S. Papadokonstantakis et al. / Computers and Chemical Engineering 29 (2005) 1647–1659
1659
assumptions made about the correlations and the independency of the input variables. Based on the model accuracy and analysis convenience, a typical case study has been carried out regarding the effect of two of the most important process variables (Cata, RP-E1) on the process yield. Generally, the qualitative behavior of the neural model was comprehensible enough to encourage the realization of industrial test runs with the proposed values of greater catalyst mass. The results have verified the model trends leading the process to improved operation. However, the accuracy of the model predictions in this extrapolation region has started to deteriorate. Therefore, a hybrid first principles-neural networks model dealing with the problem of extrapolation accuracy is in the future plans of the authors.
Fig. 19. Test runs for reactor A at raised catalyst loads— comparison with NN interpolations as well as extrapolations.
Acknowledgements Financial support provided by BASF Schwarzheide GmbH and knowledge interchange with its research and operating section particularly with Udo Rotermund, Olaf Otto, Rainer Noack and Jan Rudloff is greatly acknowledged. References
Fig. 20. Test runs for reactor B at raised catalyst loads— comparison with NN interpolations as well as extrapolations.
6. Conclusions The data pre-processing stage in neural networks modelling is of great importance, especially in the case of commercial data sets with restricted quality. The simultaneous “data balancing-variable selection” procedure proposed in this paper has been implemented in the complex case of a commercial aldol condensation process and the results have shown that the neural networks generalization accuracy has been improved in comparison with the one achieved, when only the typical “outliers removal” procedure was implemented. Furthermore, the neural model structure has been significantly simplified, since the pre-processing procedure has resulted in nine input variables out of the initial 19. This has made the model analysis a much easier task, with fewer
Berry, D., & Ng, K. (1997). Synthesis of reactive crystallization processes. AIChE Journal, 43 (7), 1737–1750. Clayden, J., Greeves, N., Warren, S., & Wothers, P. (2001). Organic chemistry. Oxford: Oxford University Press. Hair, J., Anderson, R., Tatham, R., & Black, W. (1998). Multivariate data analysis. Prentice-Hall, International. Hornik, K., Stichcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2 (5), 359–366. Kelkar, V., & Ng, K. (1999). Design of reactive crystallization systems incorporating kinetics and mass-transfer effects. AIChE Journal, 45 (1), 69–81. Neelkanten, R., & Guiver, J. (1998). Applying neural networks. Hydrocarbon Processing, 77 (9), 91–96. Piovoso, M., & Owens, A. (1991). Sensor data analysis using neural networks. Proceedings of the CPC-IV International Conference on Chemical Process Control Padre Island, TX. (p. 101). Polland, J., Broussard, M., Garrison, D., & San, K. (1992). Process identification using neural networks. Computers and Chemical Engineering, 16 (4), 253–270. Psichogios, D., & Ungar, L. (1991). Direct and indirect model based control using artificial neural networks. Industrial Engineering and Chemistry Research, 30 (12), 2564–2573. Sridhar, D., Barlett, E., & Seagrave, R. (1998). Information theoretic subset selection for neural network models. Computers and Chemical Engineering, 22 (4), 613–626. Swingler, K. (1996). Applying neural networks. London: Academic-Press. Venkatasubramanian, V., & Chan, K. (1989). A neural network methodology for process foult diagnosis. AIChE Journal, 35 (12), 1993–2002. Weixiang, Z., Dezhao, C., & Shangxu, H. (1998). Potential function based neural networks and its application to the classification of complex chemical patterns. Computer and Chemistry, 25 (2), 385–391. Zauner, R., & Jones, A. (2002). On the influence of mixing on crystal precipitation processes – Application of the segregated feed model. Chemical Engineering Science, 57 (5), 821–831.