Exploratory data analysis: A comparison of statistical methods with artificial neural networks

Computers them. Vol. 16. No. Engng, Printedin Great Britain. All 4, pp. 413423, reserved rights 1992 0098-l 354/92 $5.00 + 0.00 1992 Pcrgamon...

Download PDF

1MB Sizes 0 Downloads 79 Views

Report

PDF Reader
Full Text

Computers

them.

Vol. 16. No.

Engng,

Printedin Great Britain.

All

4, pp. 413423, reserved

rights

1992

0098-l 354/92 $5.00 + 0.00 1992 Pcrgamon Press Ltd

CopyrightQ

EXPLORATORY COMPARISON OF WITH ARTIFICIAL

A

DATA ANALYSIS: STATISTICAL METHODS NEURAL NETWORKS

B. JOSEPH, F. H. WANG and D. S.-S.

SHIEH

Chemical Engineering Department, Washington University, St Louis, MO 63130-4899, 23 January 1991; fina/ revision receiued 1 October

(Received

IO December

U.S.A.

1991; received for publicaricm

1991)

Abstract-Knowledge acquisition is perceived as a major bottleneck in the deployment of expert systems. This problem is more severe in knowledge-based control systems because the knowledge is often difficult to extract and needs to be constantly updated to reflect changing processing conditions. This article considers the problem of extracting processing knowledge directly from historical plant operational data. First we consider the features of the problem. The applicability of machine learning algorithms is considered. Quadratic regression, induction and artificial neural network modeling are applied to a process selected from composite manufacturing. The results show that, while neural networks are a promising new approach, there is a need to focus attention on the preprocessing of data to cope with data reduction and process nonlinearity. Methods based on statistical analysis will play an important role in this preprocessing.

1. INTRODUCTION During

push towards factured

improving

control

quality

this subject

and

the productivity

corrective

actions

taken

why

sources.

quality

primary

understanding

effect

on

is

to

of the physical

quality.

An

takes into account a significant

off-specification

are

difficult

Furthermore,

and

suppliers

and

characteristics is absorbed

important

events

source

experience.

effect

of

time

of the system. by the operators

Many and

raw

on

difficulty

seen

an

industry

systems

operators

1987;

in

control

that

(Waterman,

Love

1988;

et al.,

the

implementation

of

systems is the acquisition

arise because experienced

of

in precise, system.

expert

be added

logical

Another

of

oper-

based

on

the

reasoning control

operating

removed

process

can

without

used

in

the

system. to search for computer knowledge

from past

advent

computer

data.

ing for SPC/SQC ability 413

by the

operational

data

of

that

aids to acquire process control process control,

operational

current

knowledge

There is a strong incentive

this knowledge

is

in such a way that new knowledge

and obsolete

disrupting

and

required

difficulty

Thus there is also an issue of representing

knowledge-based

express.

sequences

base changes with time, therefore requires updating

the knowledge

of

these

over a period

knowledge an

conditions.

This

material

the

Usually and

have

control

et al.,

Problems

knowledge

can vary over a period of in

major

constant

factors that have

characterize

changes

the latter type of

we

et al., 1988).

knowledge.

We

the influence of age on material and

periodic

knowledge-based

D’Ambrosio

The

is our

chemical

the process. to

fact

interest on the part of processing

knowledge-based

from many

knowledge

operational

these factors

time representing equipment,

on

In

ators often find it difficult to express their processing

many secondary

effect

to capture and employ knowledge.

employ

1986;

Process engineers use this

equally

is the past

product such as

Norman

To

of the process and

of

the

offer an

to

in artificial

imitate the skill of experienced

knowledge.

source

and correcting

affect

intelligence

increased

proper

the quality.

adversely

advances

processing

While

the problem.

can

expert systems and artificial neural networks opportunity

to identify key variables and to track their

knowledge

factors

process

arise from

can be acquired

taking place in the process. knowledge

part,

is an important

that can influence

knowledge

The

gains

Recent

improved

the

good at anticipating

that

quality.

process

situations.

understanding

refer to this as processing Processing

of

to remedy

the

requires a thorough

manu-

the most

deviations

first step,

all the variables

For

emphasizes

production

of the quality

understand

statistical

monitoring

off-quality

recognition

of goods

situations

the implementation

and

techniques.

on

gathering

through

control

(SQCjSPC)

the literature detect

the quality

in this country

of statistical

data

time they become

the last few years, there has been a significant

stored of

on tools

With

the

of

and the push for increased monitorpurposes,

there is a wealth of sensor

machine-readable and

algorithms

form. to

scan

The the

availlarge

B. JchwPn et al.

414

volumes of data will have immediate application in overcoming the bottleneck of knowledge acquisition. In this article, we look at some classical (statistics based) techniques and compare it with some novel methods using artificial neural networks. The methods are evaluated through application to a representative batch manufacturing process, namely the autoclave curing of composites. The results show the relative strengths and weaknesses of these two approaches. We have selected the autoclave curing of composite materials as an example because it is an industrially important manufacturing process and is susceptible to a large rejection rate of the final product. We have also accumulated some experience working on the models for this process. Due to the high cost associated with the parts built using this process and their strategic importance in defense, there has been considerable interest on the part of industry in intelligent processing of this batch manufacturing operation (Campbell et al., 1988). 2. BACKGROUND

2. I. Nature

of the problem

considered

For the purposes of this article we restrict ourselves to acquiring knowledge of the following type: given a set of quantitative measurements on past operating conditions of a process, generate a relationship between the product qualities and the measurements. We will also restrict ourselves to static relationships (time series data are not considered). To cite an example, consider the autoclave curing of composites, which is a batch process (this problem is discussed in more detail in a later section). Here the product quality is measured in terms of the thickness of the final product and the maximum void size present. The measurements include properties of the raw material fed, temperature of the feed, curing temperature used, pressure applied, time at which pressure is applied, time at which temperature in the autoclave is increased, etc. The knowledge sought is the influence of these variables on the product quality. In a sense, this is equivalent to building a process model; but it is more than that. We can consider, for example, other qualitative variables such as the supplier of the raw material, the stock number from which the raw material came, the shift in which the production was carried out, the name of the operator in the plant, etc. All these other variables may or may not have an influence on the product quality; some variables are directly causal, and some are less relevant. These latter type of variables are mostly qualitative in nature and therefore inductive

techniques (Quinlan, 1983, 1986; Cendrowska, 1987; Breiman er al., 1984) are more appropriate for relating them to the product quality. One important question that arises is how to use prior knowledge about the process in an effective way to reduce the problem. Ideally we would like to complement prior knowledge with the new data that is available. This issue is discussed in a related paper (Shieh and Joseph, 1992b). For the present we assume that prior knowledge is used to screen out irrelevant data records and attributes to reduce the dimensionality of the problem. 2.2. Regression

methods

Regression is one of the classical techniques used to find relationships among variables. Linear regression is usually tried first. One requirement for regression is that the data is well-distributed. Past operational data cannot be expected to have the nice distribution that is possible with well-planned experiments. More than likely, the data tend to be clustered around regions and skewed with respect to certain variables. Another difficulty with regression is the problem of coping with nonlinearity. If the nonlinearity is known to occur in a certain way (prior knowledge), then it is easier to treat. Otherwise, quadratic fit can be attempted provided the nonlinearity is not severe. Since regression methods are well known, we will not discuss these further. 2.3. Arrt&ial

neuraI network

methodrr

Artificial neural networks have evolved as a powerful computational paradigm in the past few years. Neural network models consist of simple processing units in a highly interconnected network. Information processing is achieved through activation and inhibition of the interconnections among the units in the network. The network structure is based on our present understanding of biological nervous systems. The models do share many of the computational properties that we attribute to human intelligence. Among these properties are association, discrimination, self-organization, self-stabilization and learning. The computational properties depend upon the topology of the neural network-the number of layers, the number of nodes in each layer and the interconnection scheme. They also depend upon the functions used to determine the activation of the nodes, the output from each node and the change in interconnection strength over time. Many different architectures have been proposed for artificial neural networks. For a discussion and taxonomy, see Lippmann (1987). The selection of the

Exploratorydata analysis architecture for a given application depends on the problem characteristics. To select a suitable neural network model for the above problem, the following three factors are considered: The nature of the inputs-it is recognized that most process operational data are continuous values, so only the models that take continuous values as inputs are considered. Availability of the desired outputs-since process operational data are available, information about the desired outputs is known. Therefore, supervised neural networks are more appropriate. Linearity-most chemical processes are nonlinear, so the selected model must be able to handle nonlinearity. Keeping the three factors in mind, a neural network model called “multi-layer perceptron” is selected to be used in this research. Multi-layer perceptrons are feed forward networks with one or more layers of nodes between the input and output layers. The learning rule selected to train the neural network is algorithm. This algorithm was the back-propagation originally developed by Rumelhart et al. (1986). One point worth noting is that the knowledge contained in a neural network is not explicitly stated rules anywhere in the model. In other words, the knowledge representation is implicit. This characteristic had its advantage and disadvantage. The advantage is its adaptibility and ease of modification. The disadvantage is the difficulty of explaining and integrating the knowledge. 2.4. Inductive methodr A third possible approach to acquiring knowledge, particularly suited for implementation in the form of rules in knowledge-based systems, is based on induction from examples. These algorithms tend to treat each record (or called instance) of historical data as an example from which one attempts to learn the rule(s) that govern the underlying relationships. The terminology in this area of research is slightly different from the statistical literature. The independent variables in a data set are called attributes. The dependent variable is called the class. The problem of classification is reduced to a problem of building a decision tree which branches on selected attributes. By branching down the decision tree (comparing attribute values at each node in the decision tree), we can arrive at a classification for a given new instance of independent variables. For a detailed discussion on induction methods in machine learning, the reader is

415

referred to Quinlan (1983, 1986). Quinlan called his algorithm ID3 (Iterative Dichotomizer 3rd). Although the original ID3 algorithm as proposed by Quinlan is restricted to attributes that are categorical (qualitative) in nature, Breiman et al. (1984) have developed a similar algorithm (called CART, Classification And Regression Trees) for problems that have numerical attribute values. The advantages of the inductive algorithms are as follows: (1) the algorithms are very simple and can be easily coded for use on a computer; (2) a large amount of data can be condensed into a simple decision tree; (3) the resulting decision tree is fully comprehensible and can be easily justified by the experts; and (4) the decision trees can be readily converted to production rules. The disadvantages associated with inductive methods are also many. First, the resulting decision trees are sometimes large and impractical especially with noisy data. In such a case, either some pretreatment of the data or a pruning of the decision tree is necessary. If data themselves are meaningful, then a comprehensible decision tree is almost always obtained. But what will it be if we have incremental data? Most likely, we may get a decision tree from the first batch of data and obtain a slightly different decision tree from the second batch of data. This leads to the second problem: the reconciliation of decision trees resulted from the different batches of data. The combination of these two characteristics could severely restrict the application of ID3 to the engineering domain. This handicap can be overcome to some extent by proper pretreatment of data using statistical pattern recognition techniques. We conducted a preliminary study to evaluate the applications of ID3 type algorithm to gather processing knowledge from operational data generated by an autoclave process simulator. The objective of the study was to classify what operational conditions are suitable for a desired product quality when the properties of feedstock are given. The numerical operational data were first converted to qualitative data. The resulting decision tree is comprehensible and justified as expected. But, due to the conversion from quantitative to qualitative data, the accuracy of representation is reduced significantly. As a result, the acquired knowledge by ID3 algorithm is not suitable in a knowledge-based control environment. As we will see later, inductive techniques can complement both regression and neural network models. 2-S. Feature selection A problem faced by all three types of classifiers above is the large number of features (or attributes) associated with the classification problem. In order to

416

B. JOSEPH et al.

use these algorithms efficiently, it is necessary to do some preliminary screening so that only the relevant attributes are selected for use in the classification step. This preliminary screening is called feature selection. Consider that objects, i.e. data, are the points in the hyperspace and an object is described by a number of features (or attributes). Then each feature can be regarded as a dimension. Feature selection is to reduce the dimensionality of the hyperspace without reducing the accuracy of the classifiers. Obviously, some information could be lost during the dimension reduction. However, the following reasons would justify the task of feature selection: The redundant measurements or highly correlated variables could be eliminated without the loss of information. (problem of collinearity) The high dimensionality of the hyperspace will make it difficult to apply complex classification techniques such as those using an estimate of class conditional probability density functions (pdfs). Hand (1981) has shown that using linear discriminant classifier, the misclassification rate initially decreases as the number of features increases, but then begins to degrade. In order to choose the best subset of features among the original set of features for the classifier, it is necessary to have an index or a measure on which feature selection is based. The direct and most reliable index for the performance of classifier is misclassification rate (or called the error rate). Unfortunately, for most classifiers, the calculation of the error rate is very time-consuming and needs large volumes of data. Instead of the error rate, some other measures such as Wilks’ ;i and conditional F-ratio are used (James, 1985). One of the most effective and efficient methods for feature selection is stepwise seIection (Devijver and Kittler, 1982). It starts with no features in the selected subset. Features are added one at a time to this subset. The feature having the smallest Wilks’ I (or the largest conditional F-ratio) is added to the selected subset. Next, select the feature from the remaining set, which, when paired with the selected features gives the smallest J. This is called the forward selection. Then start the backward selection procedure which examines the decrease in goodness of measure, when one of the features in the selected subset is deleted. If the decrease is below a specified threshold, then delete this feature. Repeat the cycle of forward and backward selection until the measure of goodness cannot be significantly increased with further inclusion of features. The final subset of features present the dimensions of hyperspace in

which the data are well-separated into groups according to their classes. There is another procedure for feature extraction in statistical pattern recognition which reduces the number of dimensions by linearly transforming original dimension D of features into a smaller dimension, d. Algorithms in this category include principal component analysis (Draper and Smith, 1981) and partial least squares (Geladi and Kowalski, 1986). Because the resulting d transformed features are not physically meaningful, this type of feature extractor will not be considered in this study. However this method may have applications in the use of artificial neural networks. 3.

A

COMPOSITE

ALGORITHM FOR PREPROCESSING OF DATA

Table 1 compares the characteristics of induction methods with those of regression techniques. Both methods have characteristics that are desirable when processing routine operational data. Chemical process systems are usually nonlinear and have large interaction among variables. These characteristics favor the use of induction methods over regression analysis. The operational data are mostly quantitative. The most desirable knowledge for the use in process control is in the form of quantitative formulae which can be used for estimation or prediction. The last two points would make the use of induction algorithm less appealing. In this section, we propose a preprocessing algorithm based on Breiman’s inductive classification algorithms to divide the operation data into regions within which we have a greater chance of finding a numerical relationship. For problems with nonnumerical input, IDS or PRISM type algorithms can be used to achieve similar decomposition of the data. This will have the advantage of isolating the region of association within which we can seek local quantitative relationships using regression methods. Figure 1 shows the schematic of the proposed composite A more detailed description and algorithm. evaluation of the algorithm is given in Shieh and Joseph (1992a). We believe that this classification step will bring together data that are similar in nature and hence are more likely to be randomly distributed with regard to Table 1. Characteristic of statistical and induction methods Multiple regression 1. 2. 3. 4. 5. 6.

Assume linear systems Assume randomly generated data Quantitative data Results in mathematical formulas Not sensitive to noise compact representation

Induction No such requirement needed No such assumption needed Qualitative (or mixed) data Results in production rules Sensitive to noise Decision trees can be large

Exploratorydata analysis Initial set of operational data containing qualitative and quantitative information

Inductive using

subset

classification or ID3

CART

1

subset

2

Further classified if necessary

quantitative association regression

data

using or ANN

Fig. 1. Schematic of proposed composite algorithm.

the other variables. Also by restricting the range of a quantitative fit, there is greater likelihood of obtaining a better model. A disadvantage of this approach, especially with small input data sets, is that it will reduce the sample size in each region. In the following proposed algorithm, the multiple quadratic regression is suggested instead of the commonly-used linear regression because in the chemical process, usually, the operational data are highly nonlinear. The composite algorithm will have the advantage that the nonlinearity is reduced by restricting the range of input variables. The application of composite algorithm to problems involving only quantitative data will proceed as follows: Given a set of operational data, conduct multiple regression on the data. If they are well fitted, then stop; otherwise conduct the composite algorithm which consists of the following steps: Step 1.

Step 2.

Apply Breiman’s CART (Breiman et al., 1984) to the data set to select a feature for partitioning the data. Group the data into the two subsets using this feature. For each subset, conduct multiple quadratic regression analysis. Calculate the R2 measure. Test goodness of fit. If R2 indicates it is well fitted, then stop; otherwise repeat Step 1 using each subset.

In Step 2, the number of records must be much larger than the parameters to be estimated in multiple quadratic regression. For instance, given 4 features in the record, parameters in a quadratic function include the intercept, 4 first-order terms, 6 crossproduct

417

terms and 4 square terms. In other words, the number of records needs to be larger than 15. However, to make regression meaningful, much larger than 15 is necessary. The determination of R2 value as a criterion is very subjective and usually depends on the context in which it is used. To determine the association relationship, i.e. correlation, the threshold of R* could be as low as 0.30 in a social science domain. In this study, high criteria of RZ would end up with a large decision tree while low R2 gives imprecise knowledge. The following application study suggests a value of R2 in the range 0.85-0.95 when testing goodness of fit for the training set. 4. AUTOCLAVE

CURING

OF COMPOSITRS

The autoclave process is used to produce fiberreinforced composite materials. In this process fiber mats impregnated with unreacted thermosetting resin (called prepregs) are laid upon a tool, which is then covered with a variety of materials, and the set-up is inserted into an autoclave for processing. A schematic of the lay-up is given in Fig. 2. By properly manipulating the pressure and temperature profiles in the autoclave, excess resin can be squeezed out of the laminate and the resin can be. cured to produce composite material with the desired properties. Quality criteria of final product include fiber/resin ratio, thickness, extent of curing and void size. Because there are strong interactions between the effects of autoclave temperature, autoclave pressure, raw material properties (e.g. impurity contents, weight fraction of resin), this process will be a good example to demonstrate how much knowledge can be extracted from the routine data for the use in process control. The autoclave curing of composites is not well understood and good processing is still an art. The process is subject to significant part rejection rates. Knowledge obtained from past historical data is therefore extremely valuable. Nylon

Glass Breather Porous ReleaseCloth

Aluminum Dam Fig. 2. A schematicof the lay-up of an autoclaveprocess.

B. JOSEPHet (II.

418

Variations in properties of raw materials and different specifications of the product will require different modifications to the standard operational plan. With help from knowledge acquisition techniques, i.e. ANN, induction, regression, the following questions faced by the process engineer are expected to be answered: (i) What input variables have a significant effect on the product thickness and void size (the feature selection problem)? (ii) Given raw material properties and selected operating conditions, predict the product output measurements, i.e. thickness and void size. (iii) Given raw material properties, how does one adjust the operating conditions to improve the product quality? Undoubtedly, the best answer for the above questions could be obtained by carrying out fundamental processing research. But it may take many years and considerable manpower. With the proposed techniques, we expect to get rough answers in a short time. Also, the acquired knowledge can throw some insight on the process to the researchers. In this example, void size and thickness are the two quality measurements to be controlled. To illustrate the characteristics of different approaches for knowledge acquisition, we consider only variations in the following variables whose relationship with the quality variables are to be sought. The data are generated from an autoclave simulator which was developed in Washington University (Wu, 1990). Historical logs of process data were generated by randomly varying the process parameters around a fIxed operation plan. Table 2 gives a typical data set that might appear in the operational log of the process. 5.

RESULTS

AND

properties (wt, im and init_b) and the operating conditions (pr, t 1, r2 and init_r). The output is either the thickness or void size (see Table 2). Three batches of operating data with 99 records each were generated using the autoclave simulator. These batches are named TRAIN.SETl, TRAIN.SET2 and TEST.SET. The first two are used for training while the third set is used for testing the fit. TRAIN.SETl and TRAIN.SET2 are merged to create a larger, 198-record, training set (called TRAIN.SET3). The performance criteria for comparing the methods and the comparison of different methods under each considered situation are presented next. The results are presented in terms of the accuracy of output prediction. There are many ways of indexing accuracy. The most common is the percentage error, i.e. Yi - E, error,% = x loo%, Yi

Yi = actual output value, Ei = prediction of Yi, n = number of records in the test set. In this study, the void size can span from 0.0 to 0.14 cm. Percentage error is not a good index of performance of a measurement which varies over such a wide range. Since there is no perfect measure of accuracy, three different presentations (E, R2 and graphs of actual output values vs predicted output values) are used to measure and illustrate the performance of the different methods. R2 is defined by equation (4) and E is defined as follows: 6, = I r;: -

z

_

Pi 1,

C(G) n

In this

’

where Yi = actual output value, Pi = prediction of Yi, n = number of records in the test set.

Tabk 2. A typical opwation log Variable 41

92 tie,) im (x2) initb(x,) in&(x,) Pr (-a $1 (*s)

12 6%)

DemxiDtion Thickness (cm) Void size (cm) Wt fraction of resin Impurity in feed Initial void size (cm) Initial temperature (K) Autoclave pressure (atm) First holding temperature (K) Second holding temperature (K)

(1)

where

DISCUSSION

section, the effects of nonlinearity, feature selection, sample size and noise on the methods of multiple quadratic regression, composite algorithm and artificial neural networks are examined. As mentioned in the previous section, there are seven attributes involved, namely the raw material

i = 1, 2, . . ., n,

I

2

3

-

1.689 0.005 0.38 0.072 0.0653 2%.8 3.8 382.3 444.2

1.685 0.012 0.418 0.060 0.0520 290.2 3.2 378.1 437.2

1.820 0.080 0.436 0.081 0.0997 299.0 2.5 384.2 448.1

-

(2)

Exploratorydata analysis Table 3. The results for the

Training set (TRAIN.SET.3) Test set (TESTSET)

R2 ;g c

prediction of

thickness

Quadratic mnression

Neural network method

0.9473 0.0142 0.9186 0.0168

0.9847 0.0079 0.9999 0.0105

Graphs of actual output values vs predicted output values are plotted to illustrate the deviation (or the fit) of the predictions from the actual values. This kind of graph gives a clear, direct sense of how well the mapping is. On the other hand, the mean absolute error < is calculated to give a quantitative evaluation. In order to study the effect of nonlinearity, the performances for the predictions of the different output measurements, thickness and void size, need to be compared. Since these two output measurements are of a different magnitude of scale, it is not appropriate to make comparisons by either examining the graphs or by 5. Therefore, R* is utilized. Generally, R2 is defined as:

R2 = 1 _ Z(Yi - RI2 Z(Yi - F12

(4)

where p is the mean of Yi. RZ, a popular measure used in linear regression, measures how much of the variation in Yi can be explained by a regression. Sometimes, a method models the whole population well except for a few outliers. Since the square of error is used in the calculation of R*, one exceptional outlier can hide the good performance on the rest of data. For such cases, the mean absolute error E gives a better performance since it weighs every point the same. In this section, the different methods are compared based on all three of these measures-graphs, C and R2. 5.1.

Effect

of nonlinearity

To predict the thickness, given the raw material properties and operating conditions, a multiple quadratic regression is conducted through the training set TRAIN.SET3. The RZ for this regression is

419

0.9473 which is considered to be a very good fit. Then, the test set TEST.SET is used to test how the obtained regression formula performs for the data beyond the training set, and this result also is satisfactory. Therefore, it is not necessary to utilize the composite algorithm described earlier. If linear regression is applied instead of quadratic regression, an R2 of 0.6653 is obtained. This implies that this case is mildly nonlinear. The same training set and test set are then used to train and test the neural network method, respectively. The neural network results are found to be even better. The results obtained by using multiple quadratic regression and the neural network are shown in Table 3. The neural network consists of seven input nodes, four nodes in the hidden layer and one output node (741 net). To predict the void size, a multiple quadratic regression is conducted through the same training set TRAIN.SET3. R2 for this regression is 0.6954. This suggests that the void size case is highly nonlinear. To see if further improvement is possible, the composite algorithm is employed. The detailed process of using the composite algorithm on the training set, from the very beginning of a one-node tree to a three-node tree, is shown in Table 4. The results in the neural network column are obtained by using a four-layer (7-5-3-1) neural network with the back-propagation training. Observing the overall R2 of each tree, improved performance can be seen from the one-node tree to the three-node tree for both methods. However, the improvement is much more pronounced for the regression method than for the neural network method. It is interesting to note that R2 of the second node in the three-node tree is very poor for both methods. Scrutinizing the data in this node, it is found to be highly skewed. The composite method suggests further branching of this node, but there are not enough records in the training set for this branching to be meaningful. Subsequently, the test set TESTSET is use-d to examine the performance of the resulting three trees

Table 4. The results for the pmdiction of void size using the composite algorithm (on the training set-TRAIN.SET3) Using neural network

SZUIlpk size

0.6954

0.8077

198

R2 Using quadratic reRrcssion

1. One-node

tra

R*

(i.e. no partitioning) 2. Two-node tree Node 1: Node 2:

Overall R<3.2 R 5 3.2

0.8186 0.7911 0.9190

0.8404 0.8334 0.9256

198 114 84

3. Three-node tree Node 1: Node 2: Node 3:

OVCAl Pr < 2.6 2.6 c R < 3.2 Pr53.2

0.8947 0.8928 0.4324 0.9190

0.8507 0.8734 0.4070 0.9256

198 63 51 84

B. JOSEPHet

420

al.

PrecJictkn d Y km1

Ri -0.04

’ 0

I

1

0.04

0.06

I

I

0.12

0.16

0

0.04

Real Y (void size, cm)

Quadratic Regression

Model

Neural Network

data beyond the training set. The results are shown in Table 5 and Fig. 3. Neural network methods are superior except in the case of the threenode tree. One explanation is that there are too few records for each node, resulting in an “over&” of the training set. To get further improvement, one should start with a larger training set. This case illustrates that the composite algorithm is more effective in the case of regression models. Note that only 10-15 points are present in the region of large void size where the nonlinearity predominates. Clearly, larger number of data points are needed in this region to get a better fit.

on

5.2. Eflect of feature

0.16

1. one-node tree 2. Two-node tree 3. Three-node tree

selecdon

E

Neural net method

R.ZglBSiO~ method

Neural net method

0.4962 0.6108 0.5850

0.6567 0.6617 0.3343

0.0119 0.0087 0.008 1

0.0077 0.0075 0.0096

6. The results of discrimination

Entered

algorithm

Regression method

Variable Removed

Model

To demonstrate the usefulness of feature selection, the training set TRAINSET is examined using a discrimination analysis. In this study, the statistical software package SAS (1988) is used to conduct the discrimination analysis. Table 6 shows the results of the discrimination analysis for the output variable thickness. A significance level of 0.15 (default value in SAS) is used. The calculation can he found in James (1985). For the case of thickness prediction, five out of the seven attributes included in the training set are

R2

step

0.12

data and one node tree).

5. The results for the prediction of void size using the composite (on the test se-TRAIN.SET31

Table

I

0.06 Real Y (void size. cm)

Fig. 3. Graph of predicted vs measured void size (TEST.SET

Table

0.6667

analysis

Conditional F-ratio

(feature

selection)

SigniRcance ICVCl

Wilks’ A

1 :

im pr t1

NOM None None

28.232 7.198 7.342

0.0001 0.0002

0.5287 0.3476 0.4283

4 5

wt initt

None None

2.530 1.912

0.0620 0.1332

0.3211 0.3021

Exploratory

data analysis

421

Table 7. The effect of feature selection Regression method Using 7 attributes

t

0.9473

0.9426

0.9847

0.9863

c

0.0142

0.0 150

0.0079

0.0076

Test set

R2 z

0.9186 0.0168

0.0 169

0.9 198

0.9999 0.0105

0.9999

In the present example, these advantages might not seem significant. However, in real applications, the large number of attributes in the operational log will definitely require the incorporation of feature selection before a model is sought using regression or neural networks. 5.3. Eflect of noise In the real world, due to sensor limitations and environmental disturbances, we almost never deal with perfect data. Therefore, it is important to examine how the knowledge acquisition techniques perform with noisy data. To demonstrate the effect of

2

.

/

I

Prediction of Y fcml

1.95

.

/.

1.9

. 1.85

--

1.8

1.75

..i A//,,, I=.’

1.7

R2-0.9190

1.7

1.75

1.8

1.05

1.9

1.95

0.0096

1. It reduces the memory necessary for storing the coefficients of regression formulae or the weights of artificial neural networks. 2. It reduces the processing and computing time. 3. It reduces the sample size of a training set.

ol Y (cm)

1.95

i

Using 5 attributes

R2

selected by the discrimination analysis as relevant to the variation of thickness-im, t 1, pr, wt and init_t. Two multiple quadratic regressions are conducted on the training set TRAIN.SET3, one using all seven attributes and the other using the five relevant attributes. Subsequently, the test set TEST/SET is used to examine the performance of the two resulting regressions beyond the training set. The results are shown in Table 7 and Fig. 4. The same procedure is performed again, this time using a three-layer neural network instead of the multiple regression. The results are shown in Table 7 and Fig. 5. The results show that five attributes are sufficient to predict thickness accurately for both the regression and neural network methods. In other words, the two attributes, init_b and t2, have negligible impact on thickness values. The merits of feature selection can be summarized as follows: F rediikn

Using 7 attributes

Training szt (TRAfN.SET3) (TESTSET)

2

Neural network method

Using 5 attributes

2

1.e5 4 1..6! j

.=

i

R’ - 0.9186

1.7

1.75

l.O

1.85

1.9

Real Y (thidtneas. an)

Real Y (thickneaa, cm)

Using 5 features

Using 7 features

Fig. 4. Effect of feature selection on thicknessprediction model).

(TEST.SET

data

and

quadratic

regression

1.95

2

422

2

B. Josepn et al. Pmdktkn

ol V (an)

1.95 -

1.9 -

1.85 -

1.6 -

I?- O.QQ99

1.65

1.7

1.75

1.6

1.65

Real V (Thiiness,

1.9

1.95

2

1.6

1.65

1.7

1.75

Using 5 features Fig. 5. Effect of

feature

Table 8. The effect of noise on the prediction

on

of void

thickness prediction

Noi= free 5% noise 10% noise 15% noise

0.6108 0.5708 0.4942 0.4182

size

c

R2 Neural net method 0.6567 0.6430 0.6209 0.6481

1.65

1.9

1.95

2

Using 7 features selection

noise, some random noise is added to both the attribute and output measurements of the training set TRAIN.SET3 and the test set TESTSET. In this study, three levels of noise are considered, i.e. 5, 10 and 15% of the total range of each attribute and output measurement. A two-node tree is used for the multiple regression method, and a one-node tree is used for the neural network method. The training set is again used to train, and the test set is used to examine the performance of both methods beyond the training set. The results for the prediction of void size are shown in Table 8. The results show that the performance of the regression method degrades as the noise level increases. This is expected, since the noise adds to the output measurements an unpredictable factor, which can not be accounted for by the method. The performance of the neural network method, on the other

Quadratic regression

1.6

Real V (thickness, cm)

an)

Quadratic rcgrcssion 0.00867 O.00908 0.00950 0.00998

N;yaoyt 0.00767 0.00802 0.00801 0.00795

(TEST.SET

data and ANN).

hand, is not affected by the considerable levels of noise. This is because of the relatively large number of processing nodes and interconnections, as mentioned in Section 2.

6. CONCLUSION

The feasibility of extracting quantitative relationships between process variables and product quality was investigated using the autoclave curing of composites as an example. The results show that if the relationships are mildly nonlinear, then both quadratic regression and neural net models perform well in modeling the relationship. However, the presence of strong nonlinearity and interactions among variables confound both modeling approaches, although the neural network models performed slightly better. Feature extraction is considered an important prerequisite for pretreatment of data. This can greatly reduce the dimensionality of the system to be modeled by eliminating superlluous variables. Feature extraction has been extensively studied in the statistics literature and excellent algorithms are available in most statistical analysis package. In this simple application, the reduction achieved was not significant, but in real-world applications, where one

Exploratorydata analysis might be concerned with hundreds of input variables, feature extraction will be a necessary component of any knowledge acquisition algorithm. One approach to overcome the problem of nonlinearity and interaction among variables is to partition the data using an induction algorithm. We showed that, using the autoclave simulation example, this approach can lead to much better performance using either quadratic regression or artificial neural networks. This idea is further explored and discussed in detail in a related paper (Shieh and Joseph, 1991a). The advantages of inductive analysis of the data are that it reduces the data into comprehensive knowledge that can be incorporated into, say, expert systems, for example. The advantage of regression techniques is that it is able to obtain a mathematical model from the data and the results are compact. The advantage of neural networks is in their ability to represent highly nonlinear relationships and their tolerance to noise in the data. Although the results are not explicitly available (results are hidden as weights associated with each node in the network). All three approaches should be used to analyze past operational data. The composite algorithm provides a nice way to integrate inductive method with regression or ANN modeling. The results on the autoclave simulation shows that such a synergistic combination of feature selection, induction, regression and ANN modeling can be used for exploratory analysis of routine data to discover processing knowledge that is otherwise hidden in the mass of numerical and categorical data. Because of the limitations of operational data (not totally random, limited range of variables, missing pieces of data etc.), it can only be used in an exploratory sense and not in a confirmatory sense. However, the knowledge extracted from the past history can be used to guide the design of further experiments. H. Wang would like to thank McDonnell Douglas Corporation for granting her leave

Acknowledgement-F.

of absence to complete her dissertation at Washington University.

423 REFERENCES

Breiman L., J. H. Friedman. R. A. Olshen and C. J. Stone, Chzssification and Regression Trees. Wadsworth Inc., Belmont, CA (1984). Campbell F. C. et al., Computer aided curing of composites. Internal Report AFWAL-TR-86-4060. Machine Research Laboratory, Wright Patterson AFB, OH (1988). Cendrowska J., PRISM: an algorithm for inducing modular rules. ht. J. Man-Mach. Stud. 27, 349 (1987). D’Ambrosio B., Real-time process management for materials composition in chemical manufacturing. IEEE Expert

2, 80 (1987).

Devijver P. A.

and J. Kittler. Pattern Recognition-A PrentieHall, London (1982). Draper N. R. and H. Smith, Applied Regression Analysis, 2nd Edn. Wiley, New York (1981). Geladi P. and B. R. Kowalski, Partial least-square regression: a tutorial. Analyt. Chimica Acta 185, 1 (1986). Hand D. J., Discrimination and Classt~cation. Wiley, New York (1981). Wiley, New York James M., Classification Algorithms. (1985). Liuumann R. P.. An introduction to computing with neural Statistical Approach.

&ts.

IEEE

ASSP

iUag.,

p.

4 (1987).

-

_

P. L. and M. Simaan, Automatic recognition of primitive changes in manufacturing process signals. Part.

Love

Recog. Norman

21, 333 (1988).

P. W. and S. Naveed. Knowledae awuisition analysis and structuring for the-constructi& of real-time supervisory expert systems. Chemical Ertgng Res. Des. 66, 47?I (1988). Ouinian J. R.. Learning efficient classification procedures and their application to chess end games. Machine Learning (Michalski R. S. et al., Fds), p. 463. Morgan Kaufmann, Los Altos (1983). Ouinlan J. R.. Induction of decision trees. Mach. Learn. 1, -81 (1986). . Rumelhart D. E., J. L. McClelland and the PDP research group. Parallel Distributed Processing-Explorations in x

the Microstructure

of Cognition,

Vol. I: Foundations.

MIT

Press, Cambridge, MA 11986). SAS Institute, SASjSTAT User’s Guide: For Personal Computers, Release 6.03 Edition. SAS Institute Inc., Cary, NC (1988). Shieh D. S. and B. Joseph, Exploratory data analysis using induction partitioning and regression trees. I & EC Res. (i992a). In press. Shieh D. S. and B. Joseph, Model-based feature selection, Computers them. Engig. (1992b). In press. Waterman D. A., A Guide to Expert Systems. AddisonWesley, Reading, MA (1986). Wu H. T., Knowledge based control of composite material manufacturing processes. D.Sc. Dissertation, Department University, of Chemical Engineering, Washington St Louis (1990).

Exploratory data analysis: A comparison of statistical methods with artificial neural networks

Exploratory data analysis: A comparison of statistical methods with artificial neural networks

Recommend Documents