Nonlinear regression: a hybrid model

Nonlinear regression: a hybrid model

Computers & Operations Research 26 (1999) 799—817 Nonlinear regression: a hybrid model Shouhong Wang* Department of Marketing/Business Information Sy...

211KB Sizes 0 Downloads 78 Views

Computers & Operations Research 26 (1999) 799—817

Nonlinear regression: a hybrid model Shouhong Wang* Department of Marketing/Business Information Systems, College of Business & Industry, University of Massachusetts Dartmouth, North Dartmouth, MA 02747-2300, USA Received April 1997; received in revised form October 1998

Abstract Using traditional parametric methods of regression analysis, one must make assumption(s) about the form of the regression equation which may not be valid. In high-dimensional cases of a reasonable sample size, nonparametric techniques of regression analysis (kernel, nearest neighbor, and spline smoothing) do not perform well due to the ‘‘curse of dimensionality’’. This paper proposes a nontraditional nonlinear regression model for cases in which the sample space is high dimensional and the relationship between the independent variables and dependent variable is arbitrary. This research suggests the combination of the linear regression analysis method with the self-organizing feature maps, algorithm for high-dimensional convex polytopes, and back-propagation neural networks. When the sample set is pre-processed by a linear regression function, the self-organizing feature maps can be used to detect clusters of misrepresented sample points when they exist. Using the algorithm for high-dimensional convex polytopes, the sample data points in each of these clusters are sorted into two classes, each of which is supposed to distribute on one of the two sides of the pursued regression function. These groups of data points are then used to train the back-propagation neural network. The function represented by the trained neural network, which represents the boundary between the two groups, is the nonlinear regression function for the original data set.  1999 Elsevier Science Ltd. All rights reserved. Scope and purpose Neural networks have received a great deal of attention in many research and application areas. One of the advantages of neural networks is the adaptive ability of generalization of data from the real world. Taking this advantage, many researchers use neural networks for nonlinear regression analysis and have reported positive experimental results in their applications. However, little research has addressed two correlated issues concurrently: the unpredictability of a single neural network in nonlinear regression, and a possible way(s) of improving the performance of neural networks in nonlinear regression analysis. In this paper a hybrid model which combines the linear regression method, neural networks, and the algorithm for convex polytopes is developed for nonlinear regression. This paper contests that a neural network model could be

* Tel.: (508) 999-8579; fax: (508) 999-8646; e-mail: [email protected]. 0305-0548/99/$ - see front matter  1999 Elsevier Science Ltd. All rights reserved. PII: S 0 3 0 5 - 0 5 4 8 ( 9 8 ) 0 0 0 8 8 - 4

800

S. Wang / Computers & Operations Research 26 (1999) 799—817

a more useful nonlinear regression tool if it successfully incorporates human knowledge (heuristics) and other regression techniques. Keywords: Nonlinear regression; Linear regression; Self-organizing feature maps; Convex hull; Algorithm for convex polytopes; Back-propagation neural networks; Heuristics

1. Introduction Regression is one of the most important statistical methods applied to science, engineering, economics, and management [1]. Since so many fields apply regression techniques, there are difficulties in defining the term ‘‘regression’’ precisely [2]. Nevertheless, a definition of regression given by Johnson and Wichern [3] is used to begin the discussion. According to this definition, regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. One of the most simple and useful regression techniques is linear regression which assumes a linear relationship between the independent variables and dependent variables. The linear regression model has its limitations due to the assumption of a linear relationship between the predictors and the response. For this reason, nonlinear regression models, such as the polynomial regression model [2] and the nonlinear regression analysis model [4], were introduced. However, none of these nonlinear regression models can be applied without assumptions about the form of the regression equation or its parameter(s). More importantly, local averaging is the basis for most nonparametric regression methods such as kernel/nearest neighbor [5, 6], spline smoothing [7, 8], and B-splines [9, 10]. When local averaging is applied, the regression function at a data point is estimated by using a weighted average of responses of all data points. However, these local averaging techniques do not perform well in high-dimensional cases with a reasonable sample size. This is caused by the fact, known as the ‘‘curse of dimensionality’’, that high-dimensional space is mostly empty (see discussions in [11]). In other words, nonparametric techniques based on local computations often result in poor performance for a reasonably high-dimensional regression problem. Projection pursuit [11, 12] avoids the sparsity of the high-dimensional space by investigating low-dimensional projections. Since projection pursuit uses a sum of empirically determined univariate functions of linear combinations of the predictors to approximate the regression function, it is poorly suited to deal with highly nonlinear structures. Recently, artificial neural networks have received a great deal of attention in the operations research area [13]. Since Masson and Wang’s [14] presentation of an extensive review of neural networks was first published in an operational research journal, numerous articles on applications of neural networks in operations research and management have been published (e.g., [15—19]). Neural networks require few more assumptions than the training sample data sets for modeling a system, however, simple neural network models often have mixed performance results [14, 20]. Specifically, although neural networks can be used for nonlinear curve fitting, simple neural networks commonly fail to overcome the overfitting problem. In light of this, research (e.g., [13, 14, 21, 22]) has suggested that techniques (e.g., constrained learning and hybrid modeling

S. Wang / Computers & Operations Research 26 (1999) 799—817

801

techniques) other than generic learning algorithms could improve the performance of neural networks in modeling. Research [23] argues that a pure neural network based nonlinear regression model is unable to cope with problems of overfitting unless it is supplemented by problem-specific knowledge. There is even less research on neural network regression models which are not based on local computations. This paper presents a regression model which is a combination of the standard linear regression function (LRF), self-organizing feature maps [24], algorithm for convex polytopes (ACP) [25], and back-propagation neural networks [26] for nonlinear regression analysis. In this methodology, no assumptions about the statistical properties of the sample data set, nor assumptions about the form of the regression function, are required. To simplify the discussion without loss of generality, our discussion is primarily restricted to cases of a single response (dependent) variable. That is, it is supposed that y"F(X)#error,

(1.1)

where y is the dependent variable, X is m-dimensional independent variables X"(x , x 2, x ),   K F is a function to be estimated, and the error is assumed to have zero mean; its distribution may well depend on the value of X.

2. Linear and nonlinear regression analysis The LRF method has been widely used in regression analysis because of the advantages of its simplicity and robustness. It is assumed, rightly or wrongly, that F(X) in Eq. (1.1) is linear. It performs relatively well especially for a moderate size sample. However, if the populations of the data do not possess linear relationships between the dependent variables and independent variables, the linear regression function (LRF) is more likely to be an approximate of the true regression function than the true one. In quite general cases, the LRF approximates the unknown nonlinear true regression function (TRF) and divides the TRF into convex segments in such a way as shown in Fig. 1. Useful generic information can be revealed through a comparison of TRF and LRF in a general case of classification depicted in Fig. 1. When an observation to be estimated falls within region A, LRF overestimates it. On the other hand, if an observation to be estimated falls within region B, LRF underestimates it. This information can be interpreted in another way. That is, there are

Fig. 1. The relationships between the true regression function and the linear regression function.

802

S. Wang / Computers & Operations Research 26 (1999) 799—817

clusters of overestimates and underestimates resulting from LRF. These are distributed in a staggered fashion on the two sides of the LRF if the true regression function is nonlinear. Clusters on a one-dimensional line can be readily identified by an algorithm. In two-dimensional cases, clusters can be detected by sight. However, in high-dimensional cases, clusters are often difficult to identify (cf. [27]). This research uses self-organizing feature maps (SOFM) [24] to identify the clusters of overestimates and underestimates resulting from LRF. In many cases, a misestimation cluster is not strictly on one side of the LRF, but is divided by the LRF into two parts. One part (called major misestimation cluster) is significantly larger than the other (minor misestimation cluster), if the ‘‘true’’ function is nonlinear. Each major misestimation cluster and the part of LRF which supports the cluster can be considered a convex hull, as shown in Fig. 2. Note that, if the ‘‘true’’ function is extremely bumpy within a region, one may apply LRF and SOFM again to the cluster to find more than one convex hull within this region, as shown in Fig. 3. Note that, due to the ‘‘curse of dimensionality’’, it is often impossible to verify multiple convex hulls within a cluster in high-dimensional cases. For each major misestimation cluster it can be imagined that the part of the ‘‘true’’ function in this region is convex and divides the hull into two parts. Each of these parts contains almost equal numbers of data points, if the minor misestimation cluster is significantly small. Note that we use a general term ‘‘convex’’ for any ‘‘concave-up’’ and ‘‘concave-down’’ properties. To find the location of such a convex curve, an iteration process is applied. We first find the subset of data points which would constitute a shell for the convex hull and are then eliminated from the hull, then another convex hull is found. This procedure continues until the number of eliminated data points are

Fig. 2. A major misestimation cluster is a convex hull.

Fig. 3. Several convex hulls can be detected by applying LRF and SOFM again.

S. Wang / Computers & Operations Research 26 (1999) 799—817

803

almost equal to that within the remaining hull. In two dimensional cases, the shell of a convex hull can be detected by sight. However, in high-dimensional cases, convex hulls are often difficult to identify [28]. This research uses the algorithm for convex polytopes (ACP) [25] to find shell points of a convex hull for general high-dimensional cases. In practice, the number of data points in the corresponding minor misestimation cluster is often taken into account when sorting the data points in the misestimation cluster into such two groups that each of them contains almost equal data points and resides on one of the two sides of the pursued regression function. We will return to this point later in this paper. Once the two groups of data points are identified for all of the misestimates clusters, the regression problem becomes a classification problem. A nonlinear function which divides the entire sample data set into two almost equal parts would be the regression function to be found. Note that, the only assumptions used in this step are that the error in Eq. (1.1) is assumed to have a mean of zero, and it is distributes evenly on both sides of the regression function. To find such a nonlinear classification boundary, the research uses a constrained back-propagation neural network model, a modification of the standard back-propagation neural network (BPNN) [26] algorithm. In summary, the nonlinear regression analysis model proposed in this paper is a hybrid of four fundamental models which are commonly used in traditional and nontraditional data analysis; they are: LRF, SOMF, ACP and BPNN. Prior to a discussion of each of the fundamental models in our nonlinear regression context and the presentation of a formal algorithm for the proposed nontraditional nonlinear regression model, an informal description of procedures used in the proposed model is as shown in Table 1.

3. Clustering and self-organizing feature maps A definition of clusters for our purposes is: a cluster consists of observations that are close together in terms of Euclidian distance and the clusters themselves are clearly separated [27]. There are three general types of statistical clustering methods. They are hierarchical [29, 30], partitioning [29] and overlapping [31]. Only the second of these methods, however, is strictly compatible with the definition of clustering used in this research. Further, in a typical partitioning clustering method, an initial specification of cluster ‘‘centers’’ is required. Then observations are assigned to the clusters according to their nearest cluster centers. Cluster ‘‘centers’’ are refined and observations are reallocated. The procedure continues until some type of stability is reached. The best known of the partitioning procedures is the k-means algorithm [29]. A major difficulty in using the k-means algorithm is the selection of k. This difficulty is fatal in the application of the k-means clustering algorithm to our problem. The traditional clustering techniques are thus excluded from consideration in this study. An alternative to statistical methods of clustering analysis is self-organizing feature map (SOFM) [24]. Self-organizing feature maps are used to learn certain useful features found in their input patterns through the unsupervised or competitive learning process. The neural network depicted in Fig. 4 is the two-layer SOFM used in this study. The nodes at the lower layer (input nodes) receive inputs presented by the sample data points. The nodes at the upper layer (output nodes) will represent the organization map of the input patterns after the unsupervised learning process. In this

804

S. Wang / Computers & Operations Research 26 (1999) 799—817

Table 1 The proposed nontraditional nonlinear regression analysis model 1. Apply LRF to the sample data set, and obtain residual information for the data points

2. Apply SOFM to generate one-dimensional map, and identify misestimation clusters based on information of residuals according to LRF

3. For each cluster define a convex hull

4. For each cluster, using ACP, sort the data points into two groups 5. Use BPNN to find the boundary between the two groups of data points of all of the custers

study, the upper layer is a one-dimensional array. Every low layer node is connected to every upper layer node via a variable connection weight. The unsupervised learning process in SOFM can be briefly described as follows. The connection weights are assigned with small random numbers at the beginning. The incoming input vector presented by a sample data point is received by the input nodes. The input vector is transmitted to the output nodes via the connections. The activation of the output nodes depends upon the input. In a ‘‘winner-take-all’’ competition, the output node with the weights most similar to the input vector becomes active. In the learning stage, the weights are updated following the Kohonen learning rule, a type of Hebb’s law [24] w(new)"w(old)#g[X!w(old)],

(3.1)

where w is the weight matrix, X is the input vector, and g is the learning rate (0(g(1) that decreases over time. The weight update only occurs for the active output node and its topological

S. Wang / Computers & Operations Research 26 (1999) 799—817

805

Fig. 4. Self-organizing feature maps.

neighbors (Fig. 4). The neighborhood is large at the beginning and slowly decreases in size over time. In unsupervised learning SOFM, when the learning rate g is reduced to zero, the learning process will eventually halt. Note that SOFM is a dynamic system which learns the topological relations and abstract structures in the high-dimensional input vectors using low-dimensional space (one-dimensional in this study) for representation. This representation is an approximation of the local, instead of global, density of the input vectors. That is, although the low-dimensional self-organizing map cannot give the real density of the input vectors in the entire high-dimensional space, it does reveal the relative density of a limited field compared with its neighborhood. This model is used in this study because approximations of the local density of the misestimations are sufficient for the identification of their clusters. The reader is referred to [24] for a more detailed discussion and other application cases. Suppose the X values of the data points for the regression analysis are presented to the SOFM and they are self-organized on the map represented by the output nodes of the SOFM. The onedimensional map can then be depicted graphically in a two-dimensional space by adding information about residuals, as shown in Fig. 5. The horizontal axis represents locations of output nodes of the SOFM. A line indicates that a data point with its X values activates the output node of the SOFM at the corresponding location. The height (positive or negative) of the line indicates the residual of the data point in accordance with the LRF. A practical and illustrative case of SOFM for high-dimensional input data can be found in [32]. To make clusters clearer and more regular for the subsequent processes (discussed later in this paper), one may use a narrow band surrounding the LRF to screen out data points of less importance in identifying the nonlinearity, as shown in Fig. 6. However, this might cause a possible loss of information when the ‘‘true’’ nonlinear function is bumpy. Note that the one-dimensional map supplemented by information on residuals can be easier to process by computer or human sight in order to identify misestimation clusters.

4. Algorithm for convex polytopes Theoretically, the ‘‘true’’ nonlinear regression function can be as bumpy as a simple linkage of all of the sample data points, or as smooth as linear line. Our nonlinear regression model makes

806

S. Wang / Computers & Operations Research 26 (1999) 799—817

Fig. 5. Information revealed by the LRF is used for verification of the result of SOFM.

Fig. 6. Use a narrow band to screen out data points of less importance in identifying the nonlinearity.

a general assumption that the nonlinearity can be detected by identifying the misestimation clusters on the SOFM based on the pre-process result of the LRF. As discussed earlier in Section 2, in principle one may apply LRF further to a cluster to find higher nonlinearity structure. However, practically such a process is difficult to justify when the sample size is moderate, especially in high dimensional cases. Therefore, it will not be emphasized in this study. Suppose the misestimation clusters identified from SOFM are in turn considered a subset of the sample data for a convex segment of the ‘‘true’’ regression function. To find the trend of the convex segment, each cluster is considered a convex hull of the subset of the sample data, and is ‘‘peeled’’ layer by layer through identifying its shell data points, as shown in Fig. 7. Next, we discuss the convex hull problem. The convex hull of a set of data points S is the smallest convex set which includes all of the data points. For instance, the convex hull of a set S in the plane is the set of points within the smallest convex polygon which embraces all points of S (see Fig. 8). In Fig. 8 it is easy to judge that point b is not a member of the polygon of the convex hull, because it causes a concavity. The obvious brute force algorithm computes possible straight lines between the points in S and tests whether they form the boundary of the set. This simple algorithm can easily be generalized to higher-dimensional

S. Wang / Computers & Operations Research 26 (1999) 799—817

807

Fig. 7. ‘‘Peel’’ a convex hull in the high-dimensional space.

Fig. 8. Convex hull in the plane.

cases, but quickly becomes useless as the dimensionality increases due to the computational complexity [28]. The development of algorithms for the convex hull problem began in the late 1960s [33]. Since then there have been quite a few efficient algorithms for the two-dimensional and three-dimensional convex hull problem [34, 35]. An efficient algorithm for the four and beyond dimensional convex hull problem that was first suggested by Chand and Kapur [25] is called the algorithm for convex polytopes (ACP). So far there have been few other competitive algorithms published in the literature. ACP finds the shell data points for a convex hull by computing the boundary of the convex polytope. ACP is based on the basic theorem that each facet f of a convex hull in the M-dimensional space is defined by M points and contains M sub-facets. Each sub-facet is in turn defined by M!1 points, and each sub-facet is shared with exactly one other facet [25]. The algorithm is briefly described for our specific purposes as follows. Suppose S is the set of data points of the cluster being investigated, and S is the set of points   which are the projections of the data points in S on the LRF plane. SS, the sum of S and S , is    considered a convex hull. Arbitrarily select m points from S . Since these data points are located in  the LRF plane which must be a facet of the hull, the sub-hull constructed by these points becomes a hull facet to work on. This initial hull facet is added to R, the return set of convex hull facets. The sub-facets of this facet are stored in C, a working set of unmatched sub-facets. After this initial step, an iteration procedure is performed for each sub-facet c in C until C becomes empty. In the iteration procedure, each data point in SS is scanned. If any point creates facet f sharing  c with any facet within R, then f is added to R. Set C is then rebuilt through a loop that if a  sub-facet of f is already in C (waiting for f to be found) then it is deleted from C; otherwise, it is   added to C. After this iteration procedure, the extreme data points of the entire convex hull can readily be identified from the facet set R. The shell points of the cluster are these extreme data points excluding the projection points in the LRF plane.

808

S. Wang / Computers & Operations Research 26 (1999) 799—817

The formal ACP algorithm used in the current study to find the shell points of the convex hull of a misestimation cluster was developed based on [25].

5. Back-propagation neural networks In the last few years, interest in neural networks has grown dramatically. One of the most popular neural networks is the layered neural network with a back-propagation least mean square error learning algorithm (BPLMS) [26]. This research uses BPNN with one hidden layer and a single output node. Its topology is shown in Fig. 9. The neural network will perform a transformation from an input vector to an output. In this study the input vector is G"(x , x , x 2x , y) where x ,1 by the definition of BPNN,    K  (x , x 2x )"X are the m-dimensional independent variables for the regression analysis, and y is   K the dependent variable for the regression analysis. The values of the x ’s and y are normalized to G [0, 1] for computational convenience. The neural network will perform a transformation, that is Z"'(G).

(5.1)

Using the notation shown in Fig. 9 and the definition of BPLMS neural networks, we have



 



F \ (5.2) Z" 1#exp ! v # v (1#exp(!U G))\  G G G where h is the number of hidden nodes, U (i"1,2, h) is the weight vector of the inputs into ith G hidden node, and v (i"0, 12, h) is the weight at the connection from the ith hidden node to the G single output node. In this study a training data pair is (G, Z ) (or (G, Z )) if the data point is



 supposed to have a positive (or negative) residual against the regression function to be found, where Z and Z are less than 1, say, 0.8 and 0.2, respectively. Thus, the output scalar Z represents the



 classification score of a data point with (m#1) dimensionality. After training, the neural network can generate a classification boundary, and this boundary is the regression function to be sought. Suppose that the cutoff Z score for separating the two classes is 0.5. The boundary between the two

Fig. 9. One hidden layer, single output neural network.

S. Wang / Computers & Operations Research 26 (1999) 799—817

809

classes at time t is then the intersection function of the Z-function (Eq. (5.3)) and Z"0.5, that is



 



\ F "0.5 (5.3) 1#exp ! v # v (1#exp(-U G))\  G G G Note that this is an implicit function representing the relationship between X and y. To extract the value of y for a given X after the BPNN is trained, one can employ a simple numerical method (e.g., the Newton method) to search y. Cybenko [36] proved that, using the BPLMS algorithm, layered neural networks with only one hidden layer, and using sigmoid function nodes, can closely approximate any continuous function. That is, in principle, there is no need for more than one hidden layer in order to generate an arbitrary function. However, given a set of training data for the standard BPNN, the final classification result represented by the neural network could be unpredictable [22]. In our case the classification boundary represented by the Z-function must satisfy such a condition that y is a one-to-one mapping function of X (see Fig. 10). The condition of maintaining a one-to-one mapping relationship between X and y can be found from Eq. (5.3) in general; that is *Z/*yO0, y 3 [0, 1].

(5.4)

In our case the values of Z represent the ascending classification scores. Eq. (5.4) is thus specified to *Z/*y'0, y 3 [0, 1].

(5.5)

In other words, the Z-function is monotonic in y (see discussion in [21]). To ensure that the neural network generates a monotonic function, the BPNN is imposed with monotonicity constraints during the training process. The reader is referred to [21] for a general discussion of monotonic BPNN. During training for a sample point, if the monotonic condition is violated due to a large change of weights, the learning rate will be decreased for that sample point. Because this monotonicity constraint does not change the principle of error hill climbing, the adaptive nature of the learning algorithm is not changed. This technique was fully discussed in [21], and applied in [17, 18, 22, 37, 38]. At this moment, there is no answer regarding how many hidden nodes are needed for a particular classification problem. In other words, the ‘‘optimal’’ number of hidden nodes is contingent upon the unknown curvature of the non-linear boundary. A neural network with too few hidden nodes may not be able to accomplish a generation of complicated curvature, while a neural network with too many hidden nodes may cause oscillation, overlearning and impair the ability of generalization if its learning process is not constrained. In our case, because the neural network learning process is

Fig. 10. Classification boundary should be a one-to-one mapping function.

810

S. Wang / Computers & Operations Research 26 (1999) 799—817

imposed with the monotonic constraints, a relatively large number of hidden nodes for the neural network is usually selected (see [21] for a detailed discussion). In summary, the two groups of data points identified by the processes of LRF, SOFM, and ACP are used to train a BPNN which is imposed with monotonic constraints. The function, which is the intersection of the surface represented by the BPNN and the plane with the cutoff value (0.5), is the nonlinear regression function for the sample data.

6. An algorithm for the nonlinear regression model As discussed in the previous sections, it is assumed in LRF that the regression function is linear. This may or may not be valid for a particular regression problem, however, a linear discriminant analysis is able to reveal useful information about misestimation clusters. Properly designed SOFM, ACP and BPNN can be further used to process these clusters and obtain a non-linear function which gives better estimates for the sample data than the LRF in the sense of the demolition of misestimation clusters. Thus, a regression model, called LSCB, which combines LRF, SOFM, ACP, and BPNN is able to deal with nonlinear regression problems and enhance the performance of LFA in regression, especially when the linear model has failed the F-test and a nonlinear model is sought. In this model, heuristics (or subjective judgement) are applied in identifying clusters on the map as discussed earlier in this paper. It is safe to say that the applied rules impact on the performance of the regression analysis, although sensitivity analyses have not yet been investigated in any detail. However, since LSCB uses LRF as a starting point, the performance of LSCB would not be worse than that of LRF in terms of recognized indices such as sum of squares for error and mean squares for error. More importantly, LSCB avoids overfitting by providing the user of the model with visualized nonlinearity in the SOM. Compared with other statistical nonlinear regression techniques which often fail to control overfitting due to the difficulty in using a priori overfittingcontrol parameters, LSCB is robust. An algorithm of LSCB is summarized as follows. Step 1. Given a sample data set S for the regression analysis, each data point s3S is a pair (X, y), where X"(x , x 2, x ) is the m-dimensional independent variables and y is the depen  K dent variable. For each data point s define a data point s such that s is the X vector of s. 6 6 Normalize the x ( j"1, 2, m) and y sample data on [0, 1]. H Step 2. Apply the LRF method to the sample data set S, and obtain the LRF and the corresponding SSE (sum of squares for error) and MSE (mean squares). Step 3. Delete data points which are close to the LRF (e.g. within the interval of $0.1MSE) from S. The remainder is S . Define S such that s 3S if its corresponding data point s3S .

 6 6 6

 Step 4. Identify misestimation clusters by performing the following substeps. Step 4.1. Create a SOFM having m input nodes, and N output nodes, where N*"S " 6 (cardinality of S ). Train SOFM with S following an algorithm of SOFM. 6 6 Step 4.2. Plot the map graphically in a two-dimensional space. The horizontal axis represents locations of output nodes of the SOFM. A line indicates that a data

S. Wang / Computers & Operations Research 26 (1999) 799—817

811

point s activates the output node of the SOFM at the corresponding location. 6 The height (positive or negative) of the line indicates the residual of the data point in accordance with the LRF. Step 4.3. Identify misestimation clusters » (up"1, 22, ºP), where ºP is the number of SN clusters of data points with positive residuals, and » (dn"1, 2, 2, DN), where BL DN is the number of clusters of data points with negative residuals. Step 5. Based on the identified misestimation clusters » (up"1, 2,2, ºP) and SN » (dn"1, 2, 2, DN) and by doing the following substeps, find two sets of data points, BL S and S , such that if a data point s3S , it is supposed to have a positive residual, and    if s3S , it is supposed to have a negative residual.  Step 5.1. Initialize S " and S ".   Step 5.2. For each cluster »3» (up"1, 2,2,ºP) perform the following substeps.  Step 5.2.1. Identify the set of data points S which constitute the major mises4 timation cluster. Identify the set of data points S which constitute the T minor misestimation cluster. Let S — ".   Step 5.2.2. Apply ACP and find S for S .  4 Step 5.2.3. Let S — "+S — #S , and S "+S !S ,.    4 4    If "S — "& "S "#"S " then go to the next step; 4 T   otherwise repeat Step 5.2.2. Step 5.2.4. Let S "+S #S — , and S "+S #S #S ,.   4 T     Step 5.3. For each cluster »3» (dn"1, 2, 2, DN) perform the following substeps.  Step 5.3.1. Identify the set of data points S which constitute the major mises4 timation cluster. Identify the set of data points S which constitute the T minor misestimation cluster. Let S — ".   Step 5.3.2. Apply ACP and find S for S .  4 Step 5.3.3. Let S — "+S — #S , and S "+S !S ,.    4 4    If "S — "&"S "#"S " then go to the next step;   4 T otherwise repeat Step 5.3.2. Step 5.3.4. Let S "+S #S #S , and S "+S #S — ,.   4 T     Step 6. Using S and S , train a BPNN imposed with monotonic constraints in all   x ( j"1, 2, m) until the BPNN correctly classifies the two classes of data points, S and H  S . Suppose the cutoff value for the classification is 0.5, then the function  F v # v (1#exp(!U G))\"0 (6.1)  G G G is the regression function to be sought (see Eq. (5.3)). Step 7. Test the regression function generated by the LSCB model using S. If SSE of the regression function is lower than that of LRF, the regression function is verified; otherwise, use LRF.

7. An example When proposing a method for regression analysis, real data examples are not suitable to demonstrate the merit of the method because the true regression function is unknown to us. In this

812

S. Wang / Computers & Operations Research 26 (1999) 799—817

section a simulated data example is used to demonstrate a brief procedure of the application of this model, and how the neural network based model can generate a flexible nonlinear regression function without overfitting. A difficult regression problem borrowed from Silverman [39] which intends to emulate the motorcycle impact data was employed in this study. A random sample of 50 (x y) pairs was generated with the x from a uniform distribution in the interval [0, 1] and the y given by y"error,

0)x)0.2

"sin+2n[(1!x)/0.8],#error, 0.2(x)1

(7.1)

with the error randomly generated from N[0, (max(0.25, x))]. The standard deviation of the noise is small and constant for x)0.25, and then increases linearly with x. Fig. 11 shows a scatterplot of such a sample. This data sample was pre-processed using the LRF method. The LRF is also shown in Fig. 11, compared with the true regression function. In this one-dimensional case, SOFM was actually not needed in identifying misestimation clusters because one can identify them from Fig. 11 directly. For demonstration purposes, a SOFM with two input nodes and 100 output nodes was employed to identify clusters. The learning rate was set to 0.01, and the initial neighbourhood was set to 100. The learning process took 200 epochs. The generated feature map supplemented by the residuals revealed misestimation clusters, as shown in Fig. 12. Based on the principle that each cluster should contain a certain number of data points (e.g., '7) to avoid unnecessary fluctuation in the generated regression function, four clusters were identified. The clusters were processed by ACP, and the two groups of data were then determined through ‘‘peeling’’ processes. The two groups of data were employed to train the BPNN with 10 hidden

Fig. 11. The difficult problem example.

S. Wang / Computers & Operations Research 26 (1999) 799—817

813

Fig. 12. The self-organizing feature map for the example.

Table 2 The neural network weight matrices for the regression function w G

w G

w G

v (i"0,2, 9) G

— 2.816025 !8.439781 1.906457 4.520600 7.521598 !1.919772 2.791535 9.789718 !5.987590

— 0.934335 15.609353 !3.215684 0.764509 !25.716063 !8.655909 0.661416 !25.258087 !1.488367

— 1.132337 !2.943498 1.979390 !9.700676 !4.714221 3.935504 1.298725 1.470996 7.348166

!3.388751 !0.365969 3.961270 1.688855 !2.157058 !7.890989 4.956757 !0.174102 7.983838 3.762371

nodes. After 4928 learning epochs, the BPNN generated the final weight matrices corresponding to the symbolic diagram Fig. 9 as listed in Table 2. The grouped data and the regression curve represented by the BPNN for Eq. (6.1) are shown in Fig. 13. In this case, as the nonlinear regression function is complicated, a simple neural network model without assistance of other methods can hardly generate an acceptable function close to the true regression function.

8. Simulation results Based on the consideration that a few ad hoc examples are not sufficient to demonstrate the effectiveness of the proposed method, a preliminary simulation experiment was carried out which compared the LSCB model to the robust LRF model in nonlinear regression analysis. The experiment was designed to demonstrate that the proposed method can reduce the ‘‘curse of dimensionality’’ in nonlinear regression. In all of the experiments 300 sample data points were generated according to Eq. (1.1). To make the data points evenly spread over the sample space, data relating to the independent variables were drawn from a uniform distribution in the interval [0, 1]. To make the error term typical, the error was drawn from a normal distribution N(0, p). The

814

S. Wang / Computers & Operations Research 26 (1999) 799—817

Fig. 13. The two groups of the data and the final regression result of the example.

standard deviation was equal to 0.1. Since the sum of squares for error (SSE) is the measure of the unexplained variation in regression analysis, it was used as the performance measure for the comparison. This experiment was designed to investigate the behavior of SSE of LRA and LSCB, not the difference in SSE between LRA and LSCB, across the three levels of dimensionality. Correspondingly, the factorial experimental design method was employed. Five replications were generated for each of the six cells (2;3) in a two factor full factorial experimental design. To reduce experimental complexity, attention was restricted to the sine curve with multiple variables. The following factors were used: Factor 1 — Technique (2 levels): Level 1: LRF Level 2: LSCB Factor 2 — Dimensionality (3 levels): Level 1: y"sin 2n(x#x) #error,   Level 2: y"sin 2n(x#x#x) #error,    Level 3: y"sin 2n(x#x#x#x) #error.     The observed average SSEs obtained for the two models are shown in Table 3. The table shows that the LSCB model improves the regression performance of the LRF method by reducing the SSE by more than 50% in our experiments. The table also reveals that the relative performance of LSCB at different dimensionality levels is as stable as that of LRF. It can then be concluded that the LSCB model does not suffer from the ‘‘curse of dimensionality’’ significantly.

S. Wang / Computers & Operations Research 26 (1999) 799—817

815

Table 3 Relative performance of LSCB and LRF Level of dimensionality of the regression function

1 Two 2 Three 3 Four

Average sum of squares for error LRF

LSCB

118 127 146

47 61 68

In a two factor ANOVA of the simulated data, Factor 1 (LRF or LSCB) was found to be significant (F(1, 24)"1231, p(0.001). The result showed that there was an overall significant difference between LSCB and LRF in terms of a low SSE.

9. Conclusions and discussion In this paper, we have shown that a hybrid model which is a combination of LRF, SOFM, ACP and BPNN can be used in nonlinear regression analysis. An example as well as simulation results were also demonstrated. Generally speaking, the development of a regression function is based on the model developers’ prior knowledge. Yet this type of knowledge typically comes from a large amount of individual experience data, and should be verified or tested with a data set relevant to the issue being examined. As shown in this paper, the SOFMs are used to learn local densities of the input vector, and to identify the misestimation clusters based on pre-processing results of LRF. These clusters are further processed using ACP and BPNN to obtain a regression function. This method is self-protective because its regression accuracy, on average, could only be better than that of the robust regression method LRF. Compared with other traditional statistical techniques in regression, the proposed LSCB model has certain advantages. For instance, LSCB has very flexible theory requirements in comparison with parametric methods. This regression model does not require any a priori knowledge or assumptions about the structure of the regression function. As shown in the paper, LSCB does not rely on local computations in regression analysis; rather, it maps the whole high-dimensional space onto a low dimension space and detects the two data groups at the global level. The simulation result has demonstrated that LSCB’s performance is normally at least as robust as the LRF technique and LSCB does not suffer from the ‘‘curse of dimensionality’’. Clearly, much work remains to be done in order to make systematic comparisons of LSCB with other nonlinear regression techniques. Like any other regression model, LSCB has its assumptions and limitations. Firstly, this model assumes that the purpose of the regression analysis is data reduction rather than data interpretation. In other words, this model emphasizes the representation of a large number of data points by means of an easily storable analytic expression. It also allows the provision of an accurate estimate for data given a finite number of possibly inaccurate data points. At the same time the model

816

S. Wang / Computers & Operations Research 26 (1999) 799—817

deemphasizes the intuitive explanation of the relationship between the independent variables and dependent variable. Thus, a major limitation of this proposed regression model is that the coefficients of the generated regression function are not explicitly interpretable. Furthermore, this technique lacks the explanatory value of using independent variables which can be logically related to the dependent variable. This is also a drawback to other statistical techniques such as splines, especially in the cases where the dimensionality is low. Secondly, this method assumes that the user of this model is able to judge whether the misestimation clusters shown on the SOFM are convex hulls. As discussed earlier in this paper, a misestimation cluster could not actually be a convex hull if the ‘‘true’’ regression function is bumpy (Fig. 3). LRF should further be applied to the cluster to reveal smaller misestimation clusters within it. The model does not specify the number of data points that should be contained in a cluster, nor how many times one should apply LRF to the data. The user is assumed to be able to trade-off overfitting and oversmoothing based on the particular problem to be solved, keeping two extreme unfavourable outcomes of regression analysis in mind. That is, overfitting the sample data points might result in a simple linkage of all of the sample data points, and oversmoothing the regression function will result in a straight line (i.e., LRF). From our point of view, a mere set of sample data without any knowledge of problem domain is not enough for a nonliear regression analysis, especially when the sample size is moderate in high dimensional cases. We believe that a good regression method must allow the user to incorporate other rational knowledge or heuristics related to the particular problem to be solved. We also believe that no single regression method is absolutely better than any other. Thirdly, the proposed model is computationally inefficient in the BPNN part, although LRF, SOFM and ACP take little time (less than a few seconds) in their parts to solve a regression problem. To overcome this problem it is necessary to use another computationally efficient device to accomplish the classification task suggested by this regression model and generate an arbitrarily complex one-to-one function. Further research work is needed to investigate this issue.

Acknowledgements The author is indebted to the Editor and two anonymous referees for their valuable comments.

References [1] Kleinbaum D, Kupper L, Muller K. Applied regression analysis and other multivariate methods, 2nd ed. Boston, MA: Pws-Kent, 1988. [2] Draper N, Smith H. Applied regression analysis. New York: Wiley, 1981. [3] Johnson R, Wichern D. Applied multivariate statistical analysis. Englewood Cliffs, NJ: Prentice-Hall, 1982. [4] Bates D, Watts D. Nonlinear regression analysis and its applications. New York: Wiley, 1988. [5] Cleveland WS. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 1979;74:828—36. [6] Rosenblatt M. Curve estimation. Annals of Mathematical Statistics 1971;42:1815—42. [7] De Boor C. A practical guide to splines. New York: Springer, 1978. [8] Eubank RL. Spline smoothing and nonparametric regression. New York: Dekker, 1988. [9] Flavell R, Meade N Salkin G. The gilt market: Models and model-based training. Journal of the Operational Research Society 1994;45(4):392—408.

S. Wang / Computers & Operations Research 26 (1999) 799—817

817

[10] Steeley JM. Estimating the gilt-edged term structure: basis splines and confidence intervals. Journal of Business Finance & Accounting 1991;18(4):513—29. [11] Friedman JH, Stuetzle W. Projection pursuit regression. Journal of the American Statistical Association 1981;76(376):817—23. [12] Huber PJ. Projection pursuit. The Annals of Statistics 1985;13(2):435—75. [13] Burke L. Introduction to artificial neural systems for pattern recognition. Computers & Operations Research 1991;18:211—20. [14] Masson E, Wang Y. Introduction to Computation and Learning in Artificial Neural Networks. European Journal of Operational Research 1990;47:1—28. [15] Fletcher D, Goss E. Forecasting with neural networks. Information & Management 1993;24:159—67. [16] Subramanian V, Hung MS, Hu MY. An experimental evaluation of neural networks for classification. Computers & Operations Research 1993;20(7):769—82. [17] Wang S. Neural network approach to generating the learning curve. INFOR 1993;31(3):136—50. [18] Wang S. Neural network techniques for monotonic nonlinear models. Computers & Operations Research 1994;21(2):143—54. [19] Yoon Y, Swales G, Margavio TM. A comparison of discriminant analysis versus artificial neural networks. Journal of the Operational Research Society 1993;44(1):51—60. [20] Lippmann R. An introduction to computing with neural nets. IEEE ASSP Magazine 1987;2:4—22. [21] Archer NP, Wang S. Application of the back propagation neural network algorithm with monotonicity conditions for two-group classification problems. Decision Sciences 1993;24(1):60—75. [22] Wang S. The unpredictability of the standard back propagation neural networks in classification applications. Management Science 1995;41(3):555—9. [23] Wang S. An insight into the standard back-propagation neural network model for regression analysis. OMEGA: International Journal of Management Science 1998;26(1):133—40. [24] Kohonen T. Self-organization and associative memory, 3rd ed. Berlin: Springer, 1989. [25] Chand DR, Kapur SS. An algorithm for convex polytopes. Journal of the Association for Computing Machinery 1990;17(1):78—86. [26] Rumelhart D, McClelland J. The PDP Research Group. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations. Cambridge, MA: The MIT Press, 1986. [27] Gnanadesikan R, Kettenring JR. Discriminant analysis and clustering. Statistical Science 1989;14(1):34—69. [28] Lee DT, Preparata FP. Computational geometry — a survey. IEEE Transactions on Computers 1984; C33(12):1072—101. [29] Hartigan JA. Clustering algorithms. New York: Wiley, 1975 [30] Romesburg HC. Cluster analysis for researchers. Malabar FL: Robert E. Krieger, 1990. [31] Seber GAF. Multivariate observations. New York: Wiley, 1984. [32] Wang S. A self-organizing approach to cluster analysis: A case of detecting problems in teaching. In: Yen D, editor. Transactions of Conference of the Association of Chinese Management Educators, Chicago, IL. School of Business, Miami University, Oxford, OH 1996:419—28. [33] Bass LJ, Schubert SR. On finding the disc of minimum radius containing a given set of points. Mathematics of Computation 1967;12:712—4. [34] Graham RL. An efficient algorithm for determining the convex hull of a finite planar set. Information Processing Letters 1972;1:132—3. [35] Preparata FP, Hong SJ. Convex hulls of finite sets in two and three dimensions. Communications of the ACM 1977;20(2):87—93. [36] Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 1989;2:303—14. [37] Archer NP, Wang S. Fuzzy set representation of neural network classification boundaries. IEEE Transactions on Systems, Man and Cybernetics 1991;21:735—42. [38] Wang S. Generating fuzzy membership functions. Fuzzy Sets and Systems 1994;61(1):71—81. [39] Silverman BW. Spline smoothing: the equivalent variable kernel method. The Annals of Statistics 1984;12:898—916.