Computers
them.
Vol. 16. No.
Engng,
Printedin Great Britain.
All
4, pp. 413423, reserved
rights
1992
0098-l 354/92 $5.00 + 0.00 1992 Pcrgamon Press Ltd
CopyrightQ
EXPLORATORY COMPARISON OF WITH ARTIFICIAL
A
DATA ANALYSIS: STATISTICAL METHODS NEURAL NETWORKS
B. JOSEPH, F. H. WANG and D. S.-S.
SHIEH
Chemical Engineering Department, Washington University, St Louis, MO 63130-4899, 23 January 1991; fina/ revision receiued 1 October
(Received
IO December
U.S.A.
1991; received for publicaricm
1991)
Abstract-Knowledge acquisition is perceived as a major bottleneck in the deployment of expert systems. This problem is more severe in knowledge-based control systems because the knowledge is often difficult to extract and needs to be constantly updated to reflect changing processing conditions. This article considers the problem of extracting processing knowledge directly from historical plant operational data. First we consider the features of the problem. The applicability of machine learning algorithms is considered. Quadratic regression, induction and artificial neural network modeling are applied to a process selected from composite manufacturing. The results show that, while neural networks are a promising new approach, there is a need to focus attention on the preprocessing of data to cope with data reduction and process nonlinearity. Methods based on statistical analysis will play an important role in this preprocessing.
1. INTRODUCTION During
push towards factured
improving
control
quality
this subject
and
the productivity
corrective
actions
taken
why
sources.
quality
primary
understanding
effect
on
is
to
of the physical
quality.
An
takes into account a significant
off-specification
are
difficult
Furthermore,
and
suppliers
and
characteristics is absorbed
important
events
source
experience.
effect
of
time
of the system. by the operators
Many and
raw
on
difficulty
seen
an
industry
systems
operators
1987;
in
control
that
(Waterman,
Love
1988;
et al.,
the
implementation
of
systems is the acquisition
arise because experienced
of
in precise, system.
expert
be added
logical
Another
of
oper-
based
on
the
reasoning control
operating
removed
process
can
without
used
in
the
system. to search for computer knowledge
from past
advent
computer
data.
ing for SPC/SQC ability 413
by the
operational
data
of
that
aids to acquire process control process control,
operational
current
knowledge
There is a strong incentive
this knowledge
is
in such a way that new knowledge
and obsolete
disrupting
and
required
difficulty
Thus there is also an issue of representing
knowledge-based
express.
sequences
base changes with time, therefore requires updating
the knowledge
of
these
over a period
knowledge an
conditions.
This
material
the
Usually and
have
control
et al.,
Problems
knowledge
can vary over a period of in
major
constant
factors that have
characterize
changes
the latter type of
we
et al., 1988).
knowledge.
We
the influence of age on material and
periodic
knowledge-based
D’Ambrosio
The
is our
chemical
the process. to
fact
interest on the part of processing
knowledge-based
from many
knowledge
operational
these factors
time representing equipment,
on
In
ators often find it difficult to express their processing
many secondary
effect
to capture and employ knowledge.
employ
1986;
Process engineers use this
equally
is the past
product such as
Norman
To
of the process and
of
the
offer an
to
in artificial
imitate the skill of experienced
knowledge.
source
and correcting
affect
intelligence
increased
proper
the quality.
adversely
advances
processing
While
the problem.
can
expert systems and artificial neural networks opportunity
to identify key variables and to track their
knowledge
factors
process
arise from
can be acquired
taking place in the process. knowledge
part,
is an important
that can influence
knowledge
The
gains
Recent
improved
the
good at anticipating
that
quality.
process
situations.
understanding
refer to this as processing Processing
of
to remedy
the
requires a thorough
manu-
the most
deviations
first step,
all the variables
For
emphasizes
production
of the quality
understand
statistical
monitoring
off-quality
recognition
of goods
situations
the implementation
and
techniques.
on
gathering
through
control
(SQCjSPC)
the literature detect
the quality
in this country
of statistical
data
time they become
the last few years, there has been a significant
stored of
on tools
With
the
of
and the push for increased monitorpurposes,
there is a wealth of sensor
machine-readable and
algorithms
form. to
scan
The the
availlarge
B. JchwPn et al.
414
volumes of data will have immediate application in overcoming the bottleneck of knowledge acquisition. In this article, we look at some classical (statistics based) techniques and compare it with some novel methods using artificial neural networks. The methods are evaluated through application to a representative batch manufacturing process, namely the autoclave curing of composites. The results show the relative strengths and weaknesses of these two approaches. We have selected the autoclave curing of composite materials as an example because it is an industrially important manufacturing process and is susceptible to a large rejection rate of the final product. We have also accumulated some experience working on the models for this process. Due to the high cost associated with the parts built using this process and their strategic importance in defense, there has been considerable interest on the part of industry in intelligent processing of this batch manufacturing operation (Campbell et al., 1988). 2. BACKGROUND
2. I. Nature
of the problem
considered
For the purposes of this article we restrict ourselves to acquiring knowledge of the following type: given a set of quantitative measurements on past operating conditions of a process, generate a relationship between the product qualities and the measurements. We will also restrict ourselves to static relationships (time series data are not considered). To cite an example, consider the autoclave curing of composites, which is a batch process (this problem is discussed in more detail in a later section). Here the product quality is measured in terms of the thickness of the final product and the maximum void size present. The measurements include properties of the raw material fed, temperature of the feed, curing temperature used, pressure applied, time at which pressure is applied, time at which temperature in the autoclave is increased, etc. The knowledge sought is the influence of these variables on the product quality. In a sense, this is equivalent to building a process model; but it is more than that. We can consider, for example, other qualitative variables such as the supplier of the raw material, the stock number from which the raw material came, the shift in which the production was carried out, the name of the operator in the plant, etc. All these other variables may or may not have an influence on the product quality; some variables are directly causal, and some are less relevant. These latter type of variables are mostly qualitative in nature and therefore inductive
techniques (Quinlan, 1983, 1986; Cendrowska, 1987; Breiman er al., 1984) are more appropriate for relating them to the product quality. One important question that arises is how to use prior knowledge about the process in an effective way to reduce the problem. Ideally we would like to complement prior knowledge with the new data that is available. This issue is discussed in a related paper (Shieh and Joseph, 1992b). For the present we assume that prior knowledge is used to screen out irrelevant data records and attributes to reduce the dimensionality of the problem. 2.2. Regression
methods
Regression is one of the classical techniques used to find relationships among variables. Linear regression is usually tried first. One requirement for regression is that the data is well-distributed. Past operational data cannot be expected to have the nice distribution that is possible with well-planned experiments. More than likely, the data tend to be clustered around regions and skewed with respect to certain variables. Another difficulty with regression is the problem of coping with nonlinearity. If the nonlinearity is known to occur in a certain way (prior knowledge), then it is easier to treat. Otherwise, quadratic fit can be attempted provided the nonlinearity is not severe. Since regression methods are well known, we will not discuss these further. 2.3. Arrt&ial
neuraI network
methodrr
Artificial neural networks have evolved as a powerful computational paradigm in the past few years. Neural network models consist of simple processing units in a highly interconnected network. Information processing is achieved through activation and inhibition of the interconnections among the units in the network. The network structure is based on our present understanding of biological nervous systems. The models do share many of the computational properties that we attribute to human intelligence. Among these properties are association, discrimination, self-organization, self-stabilization and learning. The computational properties depend upon the topology of the neural network-the number of layers, the number of nodes in each layer and the interconnection scheme. They also depend upon the functions used to determine the activation of the nodes, the output from each node and the change in interconnection strength over time. Many different architectures have been proposed for artificial neural networks. For a discussion and taxonomy, see Lippmann (1987). The selection of the
Exploratorydata analysis architecture for a given application depends on the problem characteristics. To select a suitable neural network model for the above problem, the following three factors are considered: The nature of the inputs-it is recognized that most process operational data are continuous values, so only the models that take continuous values as inputs are considered. Availability of the desired outputs-since process operational data are available, information about the desired outputs is known. Therefore, supervised neural networks are more appropriate. Linearity-most chemical processes are nonlinear, so the selected model must be able to handle nonlinearity. Keeping the three factors in mind, a neural network model called “multi-layer perceptron” is selected to be used in this research. Multi-layer perceptrons are feed forward networks with one or more layers of nodes between the input and output layers. The learning rule selected to train the neural network is algorithm. This algorithm was the back-propagation originally developed by Rumelhart et al. (1986). One point worth noting is that the knowledge contained in a neural network is not explicitly stated rules anywhere in the model. In other words, the knowledge representation is implicit. This characteristic had its advantage and disadvantage. The advantage is its adaptibility and ease of modification. The disadvantage is the difficulty of explaining and integrating the knowledge. 2.4. Inductive methodr A third possible approach to acquiring knowledge, particularly suited for implementation in the form of rules in knowledge-based systems, is based on induction from examples. These algorithms tend to treat each record (or called instance) of historical data as an example from which one attempts to learn the rule(s) that govern the underlying relationships. The terminology in this area of research is slightly different from the statistical literature. The independent variables in a data set are called attributes. The dependent variable is called the class. The problem of classification is reduced to a problem of building a decision tree which branches on selected attributes. By branching down the decision tree (comparing attribute values at each node in the decision tree), we can arrive at a classification for a given new instance of independent variables. For a detailed discussion on induction methods in machine learning, the reader is
415
referred to Quinlan (1983, 1986). Quinlan called his algorithm ID3 (Iterative Dichotomizer 3rd). Although the original ID3 algorithm as proposed by Quinlan is restricted to attributes that are categorical (qualitative) in nature, Breiman et al. (1984) have developed a similar algorithm (called CART, Classification And Regression Trees) for problems that have numerical attribute values. The advantages of the inductive algorithms are as follows: (1) the algorithms are very simple and can be easily coded for use on a computer; (2) a large amount of data can be condensed into a simple decision tree; (3) the resulting decision tree is fully comprehensible and can be easily justified by the experts; and (4) the decision trees can be readily converted to production rules. The disadvantages associated with inductive methods are also many. First, the resulting decision trees are sometimes large and impractical especially with noisy data. In such a case, either some pretreatment of the data or a pruning of the decision tree is necessary. If data themselves are meaningful, then a comprehensible decision tree is almost always obtained. But what will it be if we have incremental data? Most likely, we may get a decision tree from the first batch of data and obtain a slightly different decision tree from the second batch of data. This leads to the second problem: the reconciliation of decision trees resulted from the different batches of data. The combination of these two characteristics could severely restrict the application of ID3 to the engineering domain. This handicap can be overcome to some extent by proper pretreatment of data using statistical pattern recognition techniques. We conducted a preliminary study to evaluate the applications of ID3 type algorithm to gather processing knowledge from operational data generated by an autoclave process simulator. The objective of the study was to classify what operational conditions are suitable for a desired product quality when the properties of feedstock are given. The numerical operational data were first converted to qualitative data. The resulting decision tree is comprehensible and justified as expected. But, due to the conversion from quantitative to qualitative data, the accuracy of representation is reduced significantly. As a result, the acquired knowledge by ID3 algorithm is not suitable in a knowledge-based control environment. As we will see later, inductive techniques can complement both regression and neural network models. 2-S. Feature selection A problem faced by all three types of classifiers above is the large number of features (or attributes) associated with the classification problem. In order to
416
B. JOSEPH et al.
use these algorithms efficiently, it is necessary to do some preliminary screening so that only the relevant attributes are selected for use in the classification step. This preliminary screening is called feature selection. Consider that objects, i.e. data, are the points in the hyperspace and an object is described by a number of features (or attributes). Then each feature can be regarded as a dimension. Feature selection is to reduce the dimensionality of the hyperspace without reducing the accuracy of the classifiers. Obviously, some information could be lost during the dimension reduction. However, the following reasons would justify the task of feature selection: The redundant measurements or highly correlated variables could be eliminated without the loss of information. (problem of collinearity) The high dimensionality of the hyperspace will make it difficult to apply complex classification techniques such as those using an estimate of class conditional probability density functions (pdfs). Hand (1981) has shown that using linear discriminant classifier, the misclassification rate initially decreases as the number of features increases, but then begins to degrade. In order to choose the best subset of features among the original set of features for the classifier, it is necessary to have an index or a measure on which feature selection is based. The direct and most reliable index for the performance of classifier is misclassification rate (or called the error rate). Unfortunately, for most classifiers, the calculation of the error rate is very time-consuming and needs large volumes of data. Instead of the error rate, some other measures such as Wilks’ ;i and conditional F-ratio are used (James, 1985). One of the most effective and efficient methods for feature selection is stepwise seIection (Devijver and Kittler, 1982). It starts with no features in the selected subset. Features are added one at a time to this subset. The feature having the smallest Wilks’ I (or the largest conditional F-ratio) is added to the selected subset. Next, select the feature from the remaining set, which, when paired with the selected features gives the smallest J. This is called the forward selection. Then start the backward selection procedure which examines the decrease in goodness of measure, when one of the features in the selected subset is deleted. If the decrease is below a specified threshold, then delete this feature. Repeat the cycle of forward and backward selection until the measure of goodness cannot be significantly increased with further inclusion of features. The final subset of features present the dimensions of hyperspace in
which the data are well-separated into groups according to their classes. There is another procedure for feature extraction in statistical pattern recognition which reduces the number of dimensions by linearly transforming original dimension D of features into a smaller dimension, d. Algorithms in this category include principal component analysis (Draper and Smith, 1981) and partial least squares (Geladi and Kowalski, 1986). Because the resulting d transformed features are not physically meaningful, this type of feature extractor will not be considered in this study. However this method may have applications in the use of artificial neural networks. 3.
A
COMPOSITE
ALGORITHM FOR PREPROCESSING OF DATA
Table 1 compares the characteristics of induction methods with those of regression techniques. Both methods have characteristics that are desirable when processing routine operational data. Chemical process systems are usually nonlinear and have large interaction among variables. These characteristics favor the use of induction methods over regression analysis. The operational data are mostly quantitative. The most desirable knowledge for the use in process control is in the form of quantitative formulae which can be used for estimation or prediction. The last two points would make the use of induction algorithm less appealing. In this section, we propose a preprocessing algorithm based on Breiman’s inductive classification algorithms to divide the operation data into regions within which we have a greater chance of finding a numerical relationship. For problems with nonnumerical input, IDS or PRISM type algorithms can be used to achieve similar decomposition of the data. This will have the advantage of isolating the region of association within which we can seek local quantitative relationships using regression methods. Figure 1 shows the schematic of the proposed composite A more detailed description and algorithm. evaluation of the algorithm is given in Shieh and Joseph (1992a). We believe that this classification step will bring together data that are similar in nature and hence are more likely to be randomly distributed with regard to Table 1. Characteristic of statistical and induction methods Multiple regression 1. 2. 3. 4. 5. 6.
Assume linear systems Assume randomly generated data Quantitative data Results in mathematical formulas Not sensitive to noise compact representation
Induction No such requirement needed No such assumption needed Qualitative (or mixed) data Results in production rules Sensitive to noise Decision trees can be large
Exploratorydata analysis Initial set of operational data containing qualitative and quantitative information
Inductive using
subset
classification or ID3
CART
1
subset
2
Further classified if necessary
quantitative association regression
data
using or ANN
Fig. 1. Schematic of proposed composite algorithm.
the other variables. Also by restricting the range of a quantitative fit, there is greater likelihood of obtaining a better model. A disadvantage of this approach, especially with small input data sets, is that it will reduce the sample size in each region. In the following proposed algorithm, the multiple quadratic regression is suggested instead of the commonly-used linear regression because in the chemical process, usually, the operational data are highly nonlinear. The composite algorithm will have the advantage that the nonlinearity is reduced by restricting the range of input variables. The application of composite algorithm to problems involving only quantitative data will proceed as follows: Given a set of operational data, conduct multiple regression on the data. If they are well fitted, then stop; otherwise conduct the composite algorithm which consists of the following steps: Step 1.
Step 2.
Apply Breiman’s CART (Breiman et al., 1984) to the data set to select a feature for partitioning the data. Group the data into the two subsets using this feature. For each subset, conduct multiple quadratic regression analysis. Calculate the R2 measure. Test goodness of fit. If R2 indicates it is well fitted, then stop; otherwise repeat Step 1 using each subset.
In Step 2, the number of records must be much larger than the parameters to be estimated in multiple quadratic regression. For instance, given 4 features in the record, parameters in a quadratic function include the intercept, 4 first-order terms, 6 crossproduct
417
terms and 4 square terms. In other words, the number of records needs to be larger than 15. However, to make regression meaningful, much larger than 15 is necessary. The determination of R2 value as a criterion is very subjective and usually depends on the context in which it is used. To determine the association relationship, i.e. correlation, the threshold of R* could be as low as 0.30 in a social science domain. In this study, high criteria of RZ would end up with a large decision tree while low R2 gives imprecise knowledge. The following application study suggests a value of R2 in the range 0.85-0.95 when testing goodness of fit for the training set. 4. AUTOCLAVE
CURING
OF COMPOSITRS
The autoclave process is used to produce fiberreinforced composite materials. In this process fiber mats impregnated with unreacted thermosetting resin (called prepregs) are laid upon a tool, which is then covered with a variety of materials, and the set-up is inserted into an autoclave for processing. A schematic of the lay-up is given in Fig. 2. By properly manipulating the pressure and temperature profiles in the autoclave, excess resin can be squeezed out of the laminate and the resin can be. cured to produce composite material with the desired properties. Quality criteria of final product include fiber/resin ratio, thickness, extent of curing and void size. Because there are strong interactions between the effects of autoclave temperature, autoclave pressure, raw material properties (e.g. impurity contents, weight fraction of resin), this process will be a good example to demonstrate how much knowledge can be extracted from the routine data for the use in process control. The autoclave curing of composites is not well understood and good processing is still an art. The process is subject to significant part rejection rates. Knowledge obtained from past historical data is therefore extremely valuable. Nylon
Glass Breather Porous ReleaseCloth
Aluminum Dam Fig. 2. A schematicof the lay-up of an autoclaveprocess.
B. JOSEPHet (II.
418
Variations in properties of raw materials and different specifications of the product will require different modifications to the standard operational plan. With help from knowledge acquisition techniques, i.e. ANN, induction, regression, the following questions faced by the process engineer are expected to be answered: (i) What input variables have a significant effect on the product thickness and void size (the feature selection problem)? (ii) Given raw material properties and selected operating conditions, predict the product output measurements, i.e. thickness and void size. (iii) Given raw material properties, how does one adjust the operating conditions to improve the product quality? Undoubtedly, the best answer for the above questions could be obtained by carrying out fundamental processing research. But it may take many years and considerable manpower. With the proposed techniques, we expect to get rough answers in a short time. Also, the acquired knowledge can throw some insight on the process to the researchers. In this example, void size and thickness are the two quality measurements to be controlled. To illustrate the characteristics of different approaches for knowledge acquisition, we consider only variations in the following variables whose relationship with the quality variables are to be sought. The data are generated from an autoclave simulator which was developed in Washington University (Wu, 1990). Historical logs of process data were generated by randomly varying the process parameters around a fIxed operation plan. Table 2 gives a typical data set that might appear in the operational log of the process. 5.
RESULTS
AND
properties (wt, im and init_b) and the operating conditions (pr, t 1, r2 and init_r). The output is either the thickness or void size (see Table 2). Three batches of operating data with 99 records each were generated using the autoclave simulator. These batches are named TRAIN.SETl, TRAIN.SET2 and TEST.SET. The first two are used for training while the third set is used for testing the fit. TRAIN.SETl and TRAIN.SET2 are merged to create a larger, 198-record, training set (called TRAIN.SET3). The performance criteria for comparing the methods and the comparison of different methods under each considered situation are presented next. The results are presented in terms of the accuracy of output prediction. There are many ways of indexing accuracy. The most common is the percentage error, i.e. Yi - E, error,% = x loo%, Yi
Yi = actual output value, Ei = prediction of Yi, n = number of records in the test set. In this study, the void size can span from 0.0 to 0.14 cm. Percentage error is not a good index of performance of a measurement which varies over such a wide range. Since there is no perfect measure of accuracy, three different presentations (E, R2 and graphs of actual output values vs predicted output values) are used to measure and illustrate the performance of the different methods. R2 is defined by equation (4) and E is defined as follows: 6, = I r;: -
z
_
Pi 1,
C(G) n
In this
’
where Yi = actual output value, Pi = prediction of Yi, n = number of records in the test set.
Tabk 2. A typical opwation log Variable 41
92 tie,) im (x2) initb(x,) in&(x,) Pr (-a $1 (*s)
12 6%)
DemxiDtion Thickness (cm) Void size (cm) Wt fraction of resin Impurity in feed Initial void size (cm) Initial temperature (K) Autoclave pressure (atm) First holding temperature (K) Second holding temperature (K)
(1)
where
DISCUSSION
section, the effects of nonlinearity, feature selection, sample size and noise on the methods of multiple quadratic regression, composite algorithm and artificial neural networks are examined. As mentioned in the previous section, there are seven attributes involved, namely the raw material
i = 1, 2, . . ., n,
I
2
3
-
1.689 0.005 0.38 0.072 0.0653 2%.8 3.8 382.3 444.2
1.685 0.012 0.418 0.060 0.0520 290.2 3.2 378.1 437.2
1.820 0.080 0.436 0.081 0.0997 299.0 2.5 384.2 448.1
-
(2)
Exploratorydata analysis Table 3. The results for the
Training set (TRAIN.SET.3) Test set (TESTSET)
R2 ;g c
prediction of
thickness
Quadratic mnression
Neural network method
0.9473 0.0142 0.9186 0.0168
0.9847 0.0079 0.9999 0.0105
Graphs of actual output values vs predicted output values are plotted to illustrate the deviation (or the fit) of the predictions from the actual values. This kind of graph gives a clear, direct sense of how well the mapping is. On the other hand, the mean absolute error < is calculated to give a quantitative evaluation. In order to study the effect of nonlinearity, the performances for the predictions of the different output measurements, thickness and void size, need to be compared. Since these two output measurements are of a different magnitude of scale, it is not appropriate to make comparisons by either examining the graphs or by 5. Therefore, R* is utilized. Generally, R2 is defined as:
R2 = 1 _ Z(Yi - RI2 Z(Yi - F12
(4)
where p is the mean of Yi. RZ, a popular measure used in linear regression, measures how much of the variation in Yi can be explained by a regression. Sometimes, a method models the whole population well except for a few outliers. Since the square of error is used in the calculation of R*, one exceptional outlier can hide the good performance on the rest of data. For such cases, the mean absolute error E gives a better performance since it weighs every point the same. In this section, the different methods are compared based on all three of these measures-graphs, C and R2. 5.1.
Effect
of nonlinearity
To predict the thickness, given the raw material properties and operating conditions, a multiple quadratic regression is conducted through the training set TRAIN.SET3. The RZ for this regression is
419
0.9473 which is considered to be a very good fit. Then, the test set TEST.SET is used to test how the obtained regression formula performs for the data beyond the training set, and this result also is satisfactory. Therefore, it is not necessary to utilize the composite algorithm described earlier. If linear regression is applied instead of quadratic regression, an R2 of 0.6653 is obtained. This implies that this case is mildly nonlinear. The same training set and test set are then used to train and test the neural network method, respectively. The neural network results are found to be even better. The results obtained by using multiple quadratic regression and the neural network are shown in Table 3. The neural network consists of seven input nodes, four nodes in the hidden layer and one output node (741 net). To predict the void size, a multiple quadratic regression is conducted through the same training set TRAIN.SET3. R2 for this regression is 0.6954. This suggests that the void size case is highly nonlinear. To see if further improvement is possible, the composite algorithm is employed. The detailed process of using the composite algorithm on the training set, from the very beginning of a one-node tree to a three-node tree, is shown in Table 4. The results in the neural network column are obtained by using a four-layer (7-5-3-1) neural network with the back-propagation training. Observing the overall R2 of each tree, improved performance can be seen from the one-node tree to the three-node tree for both methods. However, the improvement is much more pronounced for the regression method than for the neural network method. It is interesting to note that R2 of the second node in the three-node tree is very poor for both methods. Scrutinizing the data in this node, it is found to be highly skewed. The composite method suggests further branching of this node, but there are not enough records in the training set for this branching to be meaningful. Subsequently, the test set TESTSET is use-d to examine the performance of the resulting three trees
Table 4. The results for the pmdiction of void size using the composite algorithm (on the training set-TRAIN.SET3) Using neural network
SZUIlpk size
0.6954
0.8077
198
R2 Using quadratic reRrcssion
1. One-node
tra
R*
(i.e. no partitioning) 2. Two-node tree Node 1: Node 2:
Overall R<3.2 R 5 3.2
0.8186 0.7911 0.9190
0.8404 0.8334 0.9256
198 114 84
3. Three-node tree Node 1: Node 2: Node 3:
OVCAl Pr < 2.6 2.6 c R < 3.2 Pr53.2
0.8947 0.8928 0.4324 0.9190
0.8507 0.8734 0.4070 0.9256
198 63 51 84
B. JOSEPHet
420
al.
PrecJictkn d Y km1
Ri -0.04
’ 0
I
1
0.04
0.06
I
I
0.12
0.16
0
0.04
Real Y (void size, cm)
Quadratic Regression
Model
Neural Network
data beyond the training set. The results are shown in Table 5 and Fig. 3. Neural network methods are superior except in the case of the threenode tree. One explanation is that there are too few records for each node, resulting in an “over&” of the training set. To get further improvement, one should start with a larger training set. This case illustrates that the composite algorithm is more effective in the case of regression models. Note that only 10-15 points are present in the region of large void size where the nonlinearity predominates. Clearly, larger number of data points are needed in this region to get a better fit.
on
5.2. Eflect of feature
0.16
1. one-node tree 2. Two-node tree 3. Three-node tree
selecdon
E
Neural net method
R.ZglBSiO~ method
Neural net method
0.4962 0.6108 0.5850
0.6567 0.6617 0.3343
0.0119 0.0087 0.008 1
0.0077 0.0075 0.0096
6. The results of discrimination
Entered
algorithm
Regression method
Variable Removed
Model
To demonstrate the usefulness of feature selection, the training set TRAINSET is examined using a discrimination analysis. In this study, the statistical software package SAS (1988) is used to conduct the discrimination analysis. Table 6 shows the results of the discrimination analysis for the output variable thickness. A significance level of 0.15 (default value in SAS) is used. The calculation can he found in James (1985). For the case of thickness prediction, five out of the seven attributes included in the training set are
R2
step
0.12
data and one node tree).
5. The results for the prediction of void size using the composite (on the test se-TRAIN.SET31
Table
I
0.06 Real Y (void size. cm)
Fig. 3. Graph of predicted vs measured void size (TEST.SET
Table
0.6667
analysis
Conditional F-ratio
(feature
selection)
SigniRcance ICVCl
Wilks’ A
1 :
im pr t1
NOM None None
28.232 7.198 7.342
0.0001 0.0002
0.5287 0.3476 0.4283
4 5
wt initt
None None
2.530 1.912
0.0620 0.1332
0.3211 0.3021
Exploratory
data analysis
421
Table 7. The effect of feature selection Regression method Using 7 attributes
t
0.9473
0.9426
0.9847
0.9863
c
0.0142
0.0 150
0.0079
0.0076
Test set
R2 z
0.9186 0.0168
0.0 169
0.9 198
0.9999 0.0105
0.9999
In the present example, these advantages might not seem significant. However, in real applications, the large number of attributes in the operational log will definitely require the incorporation of feature selection before a model is sought using regression or neural networks. 5.3. Eflect of noise In the real world, due to sensor limitations and environmental disturbances, we almost never deal with perfect data. Therefore, it is important to examine how the knowledge acquisition techniques perform with noisy data. To demonstrate the effect of
2
.
/
I
Prediction of Y fcml
1.95
.
/.
1.9
. 1.85
--
1.8
1.75
..i A//,,, I=.’
1.7
R2-0.9190
1.7
1.75
1.8
1.05
1.9
1.95
0.0096
1. It reduces the memory necessary for storing the coefficients of regression formulae or the weights of artificial neural networks. 2. It reduces the processing and computing time. 3. It reduces the sample size of a training set.
ol Y (cm)
1.95
i
Using 5 attributes
R2
selected by the discrimination analysis as relevant to the variation of thickness-im, t 1, pr, wt and init_t. Two multiple quadratic regressions are conducted on the training set TRAIN.SET3, one using all seven attributes and the other using the five relevant attributes. Subsequently, the test set TEST/SET is used to examine the performance of the two resulting regressions beyond the training set. The results are shown in Table 7 and Fig. 4. The same procedure is performed again, this time using a three-layer neural network instead of the multiple regression. The results are shown in Table 7 and Fig. 5. The results show that five attributes are sufficient to predict thickness accurately for both the regression and neural network methods. In other words, the two attributes, init_b and t2, have negligible impact on thickness values. The merits of feature selection can be summarized as follows: F rediikn
Using 7 attributes
Training szt (TRAfN.SET3) (TESTSET)
2
Neural network method
Using 5 attributes
2
1.e5 4 1..6! j
.=
i
R’ - 0.9186
1.7
1.75
l.O
1.85
1.9
Real Y (thidtneas. an)
Real Y (thickneaa, cm)
Using 5 features
Using 7 features
Fig. 4. Effect of feature selection on thicknessprediction model).
(TEST.SET
data
and
quadratic
regression
1.95
2
422
2
B. Josepn et al. Pmdktkn
ol V (an)
1.95 -
1.9 -
1.85 -
1.6 -
I?- O.QQ99
1.65
1.7
1.75
1.6
1.65
Real V (Thiiness,
1.9
1.95
2
1.6
1.65
1.7
1.75
Using 5 features Fig. 5. Effect of
feature
Table 8. The effect of noise on the prediction
on
of void
thickness prediction
Noi= free 5% noise 10% noise 15% noise
0.6108 0.5708 0.4942 0.4182
size
c
R2 Neural net method 0.6567 0.6430 0.6209 0.6481
1.65
1.9
1.95
2
Using 7 features selection
noise, some random noise is added to both the attribute and output measurements of the training set TRAIN.SET3 and the test set TESTSET. In this study, three levels of noise are considered, i.e. 5, 10 and 15% of the total range of each attribute and output measurement. A two-node tree is used for the multiple regression method, and a one-node tree is used for the neural network method. The training set is again used to train, and the test set is used to examine the performance of both methods beyond the training set. The results for the prediction of void size are shown in Table 8. The results show that the performance of the regression method degrades as the noise level increases. This is expected, since the noise adds to the output measurements an unpredictable factor, which can not be accounted for by the method. The performance of the neural network method, on the other
Quadratic regression
1.6
Real V (thickness, cm)
an)
Quadratic rcgrcssion 0.00867 O.00908 0.00950 0.00998
N;yaoyt 0.00767 0.00802 0.00801 0.00795
(TEST.SET
data and ANN).
hand, is not affected by the considerable levels of noise. This is because of the relatively large number of processing nodes and interconnections, as mentioned in Section 2.
6. CONCLUSION
The feasibility of extracting quantitative relationships between process variables and product quality was investigated using the autoclave curing of composites as an example. The results show that if the relationships are mildly nonlinear, then both quadratic regression and neural net models perform well in modeling the relationship. However, the presence of strong nonlinearity and interactions among variables confound both modeling approaches, although the neural network models performed slightly better. Feature extraction is considered an important prerequisite for pretreatment of data. This can greatly reduce the dimensionality of the system to be modeled by eliminating superlluous variables. Feature extraction has been extensively studied in the statistics literature and excellent algorithms are available in most statistical analysis package. In this simple application, the reduction achieved was not significant, but in real-world applications, where one
Exploratorydata analysis might be concerned with hundreds of input variables, feature extraction will be a necessary component of any knowledge acquisition algorithm. One approach to overcome the problem of nonlinearity and interaction among variables is to partition the data using an induction algorithm. We showed that, using the autoclave simulation example, this approach can lead to much better performance using either quadratic regression or artificial neural networks. This idea is further explored and discussed in detail in a related paper (Shieh and Joseph, 1991a). The advantages of inductive analysis of the data are that it reduces the data into comprehensive knowledge that can be incorporated into, say, expert systems, for example. The advantage of regression techniques is that it is able to obtain a mathematical model from the data and the results are compact. The advantage of neural networks is in their ability to represent highly nonlinear relationships and their tolerance to noise in the data. Although the results are not explicitly available (results are hidden as weights associated with each node in the network). All three approaches should be used to analyze past operational data. The composite algorithm provides a nice way to integrate inductive method with regression or ANN modeling. The results on the autoclave simulation shows that such a synergistic combination of feature selection, induction, regression and ANN modeling can be used for exploratory analysis of routine data to discover processing knowledge that is otherwise hidden in the mass of numerical and categorical data. Because of the limitations of operational data (not totally random, limited range of variables, missing pieces of data etc.), it can only be used in an exploratory sense and not in a confirmatory sense. However, the knowledge extracted from the past history can be used to guide the design of further experiments. H. Wang would like to thank McDonnell Douglas Corporation for granting her leave
Acknowledgement-F.
of absence to complete her dissertation at Washington University.
423 REFERENCES
Breiman L., J. H. Friedman. R. A. Olshen and C. J. Stone, Chzssification and Regression Trees. Wadsworth Inc., Belmont, CA (1984). Campbell F. C. et al., Computer aided curing of composites. Internal Report AFWAL-TR-86-4060. Machine Research Laboratory, Wright Patterson AFB, OH (1988). Cendrowska J., PRISM: an algorithm for inducing modular rules. ht. J. Man-Mach. Stud. 27, 349 (1987). D’Ambrosio B., Real-time process management for materials composition in chemical manufacturing. IEEE Expert
2, 80 (1987).
Devijver P. A.
and J. Kittler. Pattern Recognition-A PrentieHall, London (1982). Draper N. R. and H. Smith, Applied Regression Analysis, 2nd Edn. Wiley, New York (1981). Geladi P. and B. R. Kowalski, Partial least-square regression: a tutorial. Analyt. Chimica Acta 185, 1 (1986). Hand D. J., Discrimination and Classt~cation. Wiley, New York (1981). Wiley, New York James M., Classification Algorithms. (1985). Liuumann R. P.. An introduction to computing with neural Statistical Approach.
&ts.
IEEE
ASSP
iUag.,
p.
4 (1987).
-
_
P. L. and M. Simaan, Automatic recognition of primitive changes in manufacturing process signals. Part.
Love
Recog. Norman
21, 333 (1988).
P. W. and S. Naveed. Knowledae awuisition analysis and structuring for the-constructi& of real-time supervisory expert systems. Chemical Ertgng Res. Des. 66, 47?I (1988). Ouinian J. R.. Learning efficient classification procedures and their application to chess end games. Machine Learning (Michalski R. S. et al., Fds), p. 463. Morgan Kaufmann, Los Altos (1983). Ouinlan J. R.. Induction of decision trees. Mach. Learn. 1, -81 (1986). . Rumelhart D. E., J. L. McClelland and the PDP research group. Parallel Distributed Processing-Explorations in x
the Microstructure
of Cognition,
Vol. I: Foundations.
MIT
Press, Cambridge, MA 11986). SAS Institute, SASjSTAT User’s Guide: For Personal Computers, Release 6.03 Edition. SAS Institute Inc., Cary, NC (1988). Shieh D. S. and B. Joseph, Exploratory data analysis using induction partitioning and regression trees. I & EC Res. (i992a). In press. Shieh D. S. and B. Joseph, Model-based feature selection, Computers them. Engig. (1992b). In press. Waterman D. A., A Guide to Expert Systems. AddisonWesley, Reading, MA (1986). Wu H. T., Knowledge based control of composite material manufacturing processes. D.Sc. Dissertation, Department University, of Chemical Engineering, Washington St Louis (1990).