Information Sciences 176 (2006) 772–798 www.elsevier.com/locate/ins
On detecting nonlinear patterns in discriminant problems Chih-Yang Tsai
*
School of Business, The State University of New York at New Paltz, 75 South Manheim Blvd., New Paltz, NY 12561, USA Received 2 May 2004; accepted 5 January 2005
Abstract We propose a two-stage model for detecting nonlinear patterns in discriminant problems and for solving the problem. The model deploys a Linear Programming Based Discriminator (LPBD) in stage one for treating linear patterns and a Probabilistic Neural Network (PNN) in stage two for handling nonlinear patterns. The LPBD in stage one divides the decision space into a clear zone where observations are (almost) linearly separable and a gray zone where nonlinear patterns are more likely to occur. The PNN in stage two analyzes the gray zone and determines whether a significant nonlinear patterns exist in the observations. Our goal is to avoid using a nonlinear model unless the PNN strongly suggests so to maintain good interpretability and avoid overfitting. Our computational study demonstrates the effectiveness of the two-stage model in both classification accuracy and computational efficiency. 2005 Elsevier Inc. All rights reserved. Keywords: Linear programming; Probabilistic neural networks; Discriminant problems
*
Tel.: +1 845 257 2934; fax: +1 845 257 2947. E-mail address:
[email protected]
0020-0255/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2005.01.004
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
773
1. Introduction The discriminant problem, also known as the classification or pattern recognition problem, seeks to develop functions that separate observations with known class membership and to apply the result in classifying future observations based on some chosen feature variables. We propose a two-stage model combining two non-parametric discriminant models, a Linear Programming Based Discriminator (LPBD) [4] and a Probabilistic Neural Network (PNN) [54]. The former is a linear model and the latter is a nonlinear model. The model is developed based on a simple idea that nonlinear patterns generally appear near the border area between classes. In stage one, we first try to separate observations from different classes with the LPBD. When the underlying patterns are nonlinear, the LPBD alone can not clearly separate the observations. Stage one thus divides the samples into a clear zone where classes are (or almost) linearly separable and a gray zone with a mix of observations from various classes. Observations in the gray zone are those near the border lines between classes. If those observations form a few large clusters by classes, it is more likely that there is a nonlinear pattern in the instance. Otherwise, those small clusters may just be noises rather than generalizable patterns. This analysis is performed in stage two using the PNN. Because the PNN is applied to observations in the gray zone only, it significantly reduces the computational time for executing this nonlinear model. The decision rule then compares the classification accuracy of the gray zone observations made by the LPBD and PNN. If the improvement from applying the PNN in the gray zone is statistically significant, we conclude that the evidence of a nonlinear pattern is strong and our final model is obtained from combining the LPBD result for the clear zone and the PNN result for the gray zone. Otherwise, we ignore the result from stage two and use the LPBD only. Despite the concern of its normality and equal covariance assumptions on population distributions, FisherÕs Linear Discriminant Function (LDF) [15] remains a very popular approach. There are several benefits of applying a linear discriminant model such as the LDF. First, a linear model is more intuitive for a decision maker to understand and to relate its model output to the application of interests (interpretability). Second, developing a linear model in general requires less computational resources compared to a nonlinear model. Third, a linear model is less prone to overfitting. Overfitting occurs when a model intends to describe instance specific noises rather than generalizable patterns. The main reason for choosing the two models for our two-stage approach is that they compensate the weaknesses of each other making the decision model more capable and efficient. An LPBD is efficient because solving linear programs is practically fast and it is capable of solving large size instances by incorporating a row generation algorithm. However, it does not handle nonlinear patterns. On the other hand, a PNN is capable of treating nonlinear
774
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
patterns especially multi-modal patterns. Its weaknesses include longer training times and the need to retain all training examples for future classifications. In our approach, the clear zone created by the LPBD can be considered as the core of either linear or nonlinear patterns. Thus the LPBD reduces the computational time and storage requirement for the PNN by leaving only those gray zone observations to the PNN. Our computational study shows the effectiveness of the combined model comparing to two benchmark approaches, a stepwise LDF and a stand-alone PNN. For instances exhibiting a clear linear pattern and instances lacking a strong linear or nonlinear pattern, the classification accuracy from stage one of the two-stage model is comparable to the LDF model and for instances having a strong nonlinear pattern, the two-stage model outperforms the LDF and is comparable to the stand-alone PNN, which consumes more computational resources than the two-stage model. In the next section, we provide a brief introduction to the two discriminant models used in this study. The two-stage model is introduced in Section 3. Section 4 describes the design of our computational study and discusses the test result followed by the conclusion of this study in Section 5.
2. Review of LPBD and PNN models This section provides a review of the LPBD and PNN models used in this study. For discriminant problems, we consider observations drawn from K populations (classes) labeled hi, i 2 {1, . . . , K}. An observation with p features T is denoted as a p-vector x = [xP 1, x2, . . . , xp] . Let nk be the number of observaK tions drawn from hk and n ¼ i¼1 ni . The samples from class i can be repreT sented as an ni · p matrix, Ai, and A is defined as ½AT1 j . . . jATK . Ai is also used to denote the samples from class i and A ¼ [Ki¼1 Ai unless it causes confusion. Under the assumption that hi, i = 1, . . . , K is distributed multi-variate normal with the same covariance matrix for all classes, FisherÕs LDF is known to produce optimal separation functions that best separate class means with respect to their common population variance. Criticisms to this model include that the assumption is hardly met in practice and its inability to treat nonlinear ¨ stermark and Ho¨glund [47] compare several linear and nonlinear patterns. O multi-variate statistical approaches with mathematical programming based approaches on generated instances. Their study does not find any universal support in favor of any approach. In our computational experiments, we focus on data obtained from real world applications except for two generated test instances for the purpose of controlling the nonlinear border line. Non-parametric approaches such as neural networks not relying on any assumptions of population distributions have received great attention recently. In addition, a multi-layer perceptron [62] can theoretically approximate any nonlinear functions [14]. However, a major concern in implementing a neural
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
775
network model is its long training time. Blum and Rivest [5] show that even training a small neural network is NP-complete [22]. In addition, there is no one-to-one correspondence between the weights of a multi-layer perceptron and the feature variables, making model result harder to decipher. Our proposed two-stage model is capable of treating nonlinear patterns from two or more classes while preserving high interpretability and computational efficiency. The LPBD develops hyperplanes to separate observations from the K classes. This procedure is referred to as ‘‘discrimination’’ or ‘‘separation’’ [33]. The resulting hyperplanes are then used for the assignment of class membership on new cases. This part is usually called ‘‘classification’’ or ‘‘allocation’’ [33]. For the LPBD, both discrimination and classification are based on a measure of distance between an observation and the hyperplanes. Based on a Bayes decision rule, the PNN attempts to estimate the distribution of hi using the Gaussian function whose parameters ris (smoothing factors) are obtained from a training procedure. Classification of a future observation is done by computing the average distance (defined by the Gaussian function) from the observation to each class of samples in the training set and assigning the observation to the class to which the minimum average distance occurs. Although the definition of distance in the two models are different, they both provide direct associations between the model parameters and the feature variables. Both models adopt a similar decision rule to piecewise separate the classes, i.e. assigning x to class i if Ui ðxÞ > Uj ðxÞ;
8j 2 f1; . . . ; Kg;
j 6¼ i
ð1Þ
where Ul(x) is the discriminant function for class l 2 {1, . . . , K}. The PNN related discussions can be found as early as in the 1960s (see for example, [10,49,53]). Mangasarian [39] shows that a Multi-Surface Method Tree (MSMT) obtained from generating many more linear functions to zigzag between classes can be used to handle nonlinear border lines between classes at the cost of sacrificing the benefit of higher interpretability and generalizability of a linear model. That is the reason why, instead of creating an MSMT for nonlinear border lines, we analyze the data in the gray zone using a PNN, taking advantages of its ability to treat nonlinear patterns especially multi-modal patterns. More details about the two models are given next. 2.1. LPBD model An LPBD applies a goal programming formulation whose objective is to minimize a penalty function that can be specified in various combinations of deviations and interior distances [25] for different concern of a decision maker. The formulation was introduced by Freed and Glover [16,17] as a linear program. It triggered a series of discussions in the 1980s [24,18,42,20,25,26]. A major concern in the LP based model is the issue of null solutions in which the coefficients of the
776
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
separation function, Ui(x) Uj(x), obtained from an optimal solution to the LP are all zeros. Normalization of the parameters by adding an additional constraint is proposed and discussed in [20,25], and [26] to avoid the phenomenon. A computational study by Freed and Glover [19] indicates that among the objectives of minimizing maximum deviation, minimizing the sum of interior distances, and minimizing the sum of deviations (MSD), the MSD has the best performance in terms of classification accuracy. Most of the previous work focuses on two-class problems and extension to multi-class problems is made by sequentially (pairwise) separating out one class at a time [26]. Gochet et. al. [27] proposed another formulation with a normalization constraint focusing on classifying problems with more than two classes. However, some but not all of the separation functions could still have all zero coefficients. Thus, a sequential separation approach is still needed. Comparisons of performance between LP based discriminant models and others can be found in [31,32,45,47]. Bennett and Mangasarian [3,4] proposed an LP formulation that minimizes the sum of mean group deviations as follows: ðMDÞ
min
K K X X eT zij ni i¼1 j¼1;j6¼i
st:
Ai ðwi wj Þ ðci cj Þe þ zij P e;
i 6¼ j; i; j ¼ 1; . . . ; K
wi ; wj ; ci ; cj unrestricted in signs;
zij P 0;
ð2Þ
In Formulation MD, wi, wj are p-vectors and (wi wj) forms the coefficient vector of the linear separation function between classes i and j while ci, cj are scalars and (ci cj) becomes the constant term of the separation function. zij is an ni-vector recording the degree of non-negative deviation to the separation plane between class i and class j for each observation in Ai and e is an ni-vector of all elements being 1Õs. Once an optimal solution ð w; c; zÞ is found, an observation x is classified into class i if Tj Þx ðci cj Þ > 0; ð wTi w
8j 2 f1; . . . ; Kg;
j 6¼ i
ð3Þ
j 6¼ i
ð4Þ
based on decision rule (1) or equivalently Tj Þx ðci cj Þ P 1; ð wTi w
8j 2 f1; . . . ; Kg;
based on Constraint (2) in Formulation MD if Ais are linearly separable. This formulation, a special case of the general formulation in [26], does not require a normalization constraint to avoid the null optimal solution due to the following proposition proved by Bennett and Mangasarian [4]. Proposition 1. Ai’s, i 2 {1, . . . , K} are piecewise-linearly separable if and only if the optimal objective function value to Formulation MD is zero. In addition, the null solution in which wi wj = 0 for i 5 j, i, j 2 {1, . . . , K} happens if and only if
C.-Y. Tsai / Information Sciences 176 (2006) 772–798 K X e T Ai 1 e T Aj ¼ ; K 1 j¼1;j6¼i nj ni
777
i ¼ 1; . . . ; K
and even in such case there exists an alternative optimal solution to Formulation MD in which wi wj 5 0. Applications of this formulation can be found in [40,41]. The 5-class example solved in three iterations and four LPs by the sequential approach in [27] is solved by one LP under Formulation MD. Since LP is polynomially solvable [34] by the interior method and the simplex method used in popular commercial LP solvers is in most cases quite efficient, solving Formulation MD should be efficient. A study shows that even though the cycling problem rarely happens in real applications, commercial LP solvers are capable of handling it efficiently [23]. In our model, we use the optimal solution ð w; c; zÞ to divide the RP space into two zones, a clear zone and a gray zone. The clear zone CZ is the union of the clear zones of all classes, i.e. [Ki¼1 CZi where Tj Þx ðci cj Þ P 1; wTi w CZi ¼ fxjð
j ¼ 1; . . . ; K; j 6¼ ig
ð5Þ
An observation x 2 hi located in CZi is correctly classified while an observation x 2 hj, j 5 i is misclassified. The pattern in CZ is relatively clear even if the classes are not linearly separable, i.e. very few misclassified cases in this zone. The remaining part is called as the gray zone GZ ¼ [Ki¼1 GZi where Tj Þx ðci cj Þ P 0; wTi w GZi ¼ fxjð
j ¼ 1; . . . ; K; j 6¼ ig n CZi
ð6Þ
An observation in this zone is classified as a member of hi based on (3). Ais are piecewise linearly separable if for all x 2 Ai, x 2 CZi. Otherwise, GZ usually contains most of the misclassified cases as it covers the boundary area between classes. In Fig. 1, we show a three class example with two feature variables. The solid borderlines Lijs represent the function (wi wj)x (ci cj) = 0, and the dashed lines Gijs correspond to the function (wi wj)x (ci cj) = 1 for i, j 2 {1, 2, 3}, and i 5 j. Thus, an observation t from class i has 0 < ztij < 2 if it falls in the gray zone between Gij and Gji. 2.2. PNN model A PNN [54] is capable of treating nonlinear and multi-modal patterns. The model was proposed in the 1960s by Specht [53]. Those properties fit nicely to the behavior of the observations in the gray zone. The PNN adopts a Bayes decision rule, i.e. assigning an observation x as a member of class i 2 {1, . . . , K} if ci mi fi ðxÞ > cj mj fj ðxÞ
8j 6¼ i; j 2 f1; 2; . . . ; Kg
ð7Þ
778
C.-Y. Tsai / Information Sciences 176 (2006) 772–798 G21
L12
G12
CZ2 CZ1
G23
G13
L23
CZ3
G32
L13 G31
Fig. 1. Clear and gray regions of a 3-class example.
where ck, mk, and fk( ) are the cost of misclassifying an class k case, the prior probability of occurrence of class k and the probability density function of class k respectively. Thus, inequality (7) is a piecewise nonlinear discriminator. Since in most applications population distributions are not known, the PNN estimates the population density function with a multi-variate Parzen window estimator [10,49] on the collected training data set. The most popular kernel function used in a Parzen window estimator is the Gaussian function. Note that the use of the Gaussian function has nothing to do with any normality assumptions. We denote a case in the training set whose class membership is known as y, a p-vector, and the case whose class membership is to be determined as x. The density function fk(x) is estimated by ! 1 1 X 1 T 1 fk ðxÞ ¼ exp ðx yÞ C ðx yÞ ; Q 2 ð2pÞp=2 pi¼1 ri nk y2Ak k 2 f1; . . . ; Kg
ð8Þ
where C is a p · p diagonal matrix with element Cij = 0 if i 5 j and C ii ¼ r2i ; ri > 0. ris, i 2 {1, . . . p}, are called as the smoothing factors of a PNN. Eq. (8) can be interpreted as a measure of the average distance defined by the Gaussian function from x to all the y in Ak. When ci = cj and mi = mj for i, j 2 {1, . . . , K}, i 5 j, an observation x is assigned to the class closest to it
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
779
based on (7) due to the negative exponent in the Gaussian function. This model is also referred to as a weighted PNN [44], which allows adjustment of disparate magnitude and significance of feature variables through its smoothing factors ris. However, in practice, it is seldom justified to allow all ris to vary independently for training time and overfitting concern. The widely adopted basic PNN only uses one smoothing factor r by setting ri = r for i 2 {1, . . . , p}. The value of the smoothing factors ris is very important to the performance of a PNN. When they are too small, the method is close to a nearest neighbor method (overfitting). When they are too large, the pattern described by the PNN becomes blur. To determine the value of the smoothing factor for a basic PNN, there are simple heuristics depending solely on the problem dimension and the covariance matrix of the training sample (see [36] for example). On the other hand, the smoothing factor can also be obtained by applying nonlinear optimization techniques to a continuous error function [52] whose derivatives can be easily evaluated when the kernel function is Gaussian. Given an observation x 2 hi, the continuous error function is defined as 2
eðxjhi Þ ¼ ½1 pðxjhi Þ þ
K X
½pðxjhj Þ
2
ð9Þ
j¼1;j6¼i
PK where pðxjht Þ ¼ ft ðxÞ= l¼1 fl ðxÞ. In the training process, a training case x is held out from the rest of the training set when evaluating the error function (9) with respect to x. The training process finds the best smoothing factor by using a gradient descent search method [43] to the following unconstrained optimization problem: ( ) K X X h h h min eðxjAl ÞjAl ¼ Al n fxg; Aj ¼ Aj ; j 6¼ l; ri > 0; i ¼ 1; . . . ; p ri
l¼1 x2Al
ð10Þ The described process can be considered as a four-layer feedforward neural network [54,55]. It has an input layer of p neurons, a pattern layer with n neurons, a summation layer of K neurons, and a single neuron decision layer. The network structure is given in Fig. 2. The input layer takes each input x and feeds its elements to the neurons in the pattern layer. Each neuron in the pattern layer belongs to one of the K classes and represents a case from that classh in the training iset. The pattern layer evaluates the term T
exp 12 ðx yÞ C 1 ðx yÞ in Eq. (8) using the smoothing factors between
the input layer and the pattern layer. The summation layer completes the computation of (8) and the decision layer assigns x to a class according to (7). There are several advantages of using a PNN in addition to its nonlinear and multi-modal properties mentioned earlier. First, a PNNÕs network structure is totally dictated by the dimensionality of the samples as opposed to a
780
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
Fig. 2. PNN (p features, K classes).
multi-layer perceptron whose network structure is determined either by a trialand-error or rule-of-thumb approach. Second, the training process only takes one pass on the training set to compute the optimal smoothing factor. Although it takes longer training time than an LPBD, it is still much faster than that of a backpropagation [62] based training algorithm on a multi-layer perceptron. Third, it works well on small training sets and when the sample size increases, it provides a very close approximation to the real density function. Nevertheless, its major drawback is on the need to store all training cases in the pattern layer for future classifications. Applications of PNN are found in various areas such as image processing and medical diagnosis (see [1,2,11,13,46,50,59] for example). Most variations of the basic PNN were developed by applying some revised training algorithms, relaxing the assumption of ri = r for i 2 {1, . . . , p}, and allowing more complicated network topology. For example, Galleske and Castellanos [21] used a recursive algorithm to determine the values of the smoothing factors where each link between the input and pattern layers has its own distinct smoothing factor. By so doing, they demonstrated that the resulting PNN has better generalization ability. Yang and Chen [64] developed an expectation–maximization based training algorithm, which is shown to be less susceptible to numerical difficulties. This algorithm not only produces the values for the smoothing factors but also estimates the mean vectors to be used as pattern layer nodes instead of adopting all training examples. It also obtains the weights for the links between the pattern and summation layers through training instead of simple summations. Further relaxation of the network topology brings the PNN to a radial-basis function network (RBFN) introduced by
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
781
Broomhead and Lowe [8]. RBFNs apply radial basis functions such as Gaussian functions as the activation functions for its hidden layer and linear activation functions for its output layer (corresponding to the summation layer in a PNN). In addition to using the estimated mean vectors, RBFNs allow complete connections between the pattern and output layers whose connection weights are obtained from the training process. Park and Sandberg [48] showed that RBFNs can theoretically approximate any functions. RBFNs have been used for solving various types of problems such as classification [35] and time series forecasting [51,56]. A hybrid algorithm combining a PNN and a RBFN was proposed by Huang and Ma [28] and Huang [29] with the purpose of preserving the advantages and avoiding the disadvantages of the two underlying models. All the variations discussed produce better classification results on the test instances adopted, but they also take longer training time due to the fact that more parameters need to be estimated. When estimated mean vectors are used, they are usually much fewer than the training examples resulting in less memory requirement in implementation. They nevertheless demand some design work to determine the network topology. We apply the basic PNN in this study instead of those variations due to its simplicity in network designs and efficiency in training. The benchmark models are also chosen based on the same criteria.
3. The two-stage discriminant model This section describes the two-stage approach including its preprocessing procedure and a row pricing scheme for large data sets. Wang [61] proposed a hybrid approach to improve the approximation to the nonlinear borderline between classes. It uses an LDF as its linear discriminator and analyzes a misty region near the linear borderline using Self-Organizing Feature Maps. A small two-class data set from real application is included in the study. Support Vector Machine (see Burges [9] for a good tutorial) is a generalized model in which an LPBD or a PNN can be considered as a special case of the support vector machine by choosing either a linear or nonlinear kernel function. Nevertheless, its linear kernel requires solving a quadratic program instead of a linear program. The model is built for two class instances although it can be extended to multiclass cases by sequentially separating one class from the rest (pairwise separation). Lee et. al. [38] recently proposed a multi-category version of the model which preserves some preferred theoretical properties of the binary version. 3.1. Preprocessing One source of overfitting is contributed by the inclusion of noise variables into the model. In [41], the authors find that, for a two-class cancer diagnosis
782
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
application, a single separating plane with 3 variables (out of 30) performs the best. For feature reduction, we can either consolidate or drop variables to reduce the number of feature variables. In this study, we avoid using methods like Principal Component Analysis which creates fewer number of new variables by applying linear combinations to the original variables [12]. This type of approaches although preserving good portion of total variance may impair the interpretability of the model result as we are not always able to associate application specific meanings to those newly created variables. Most of the reduction algorithms that drop feature variables are ad hoc to the discriminant model being used. Reduction algorithms developed specifically for Formulation MD [6,7] focus on two-class instances and require solving a revised problem. For a basic PNN, Tsai [60] proposed an iterative reduction algorithm based on its corresponding weighted PNN. Because our goal is to apply a basic PNN for observations in the gray zone by using the LPBD as an intermediate step, feature reduction is performed for the PNN approach but the result is used by both the LPBD and PNN. The iterative reduction algorithm due to Tsai [60] removes non-contributing variables based on the value of their corresponding smoothing factors ris in the weighted PNN. The variables being dropped have smoothing factors significantly higher than those remained. The decision on the cutoff point for dropping may vary from application to application and is subject to userÕs judgment. This process is repeated iteratively until no variables are dropped. In order to have a compatible magnitude of the smoothing factors for comparison purpose, all variables are standardized. As shown from the stability theorems in Appendix A, the LPBD based on Formulation MD and weighted PNN are invariant to standardization. The variables remained have closer ris so as to preserve stability and generalizability of the basic PNN, which has a common smoothing factor r. Although the basic PNN is not invariant to standardization (see Appendix A), the reduction algorithm improves the robustness of classification results due to standardization. 3.2. Problem solving procedure The steps of the two-stage model are detailed below. Step 0. Initialization: Determine a significant level a for testing whether the PNN in stage two improves classification accuracy over that from the LPBD in the gray zone or not. Step 1. The LPBD Stage: Without loss of generality, we can set w1 = 0 and c1 = 0 in Formulation MD. Thus, the number of variables in the formulation is (K 1)p + (K 1)n + (K 1) and the number of constraints is (K 1)n. Although the number of variables and constraints is large, the constraint matrix is sparse due to the piecewise property of Formulation MD that
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
783
compares two classes at a time. Once an optimal solution ð w; c; zÞ is obtained, i and ci for future classifications. The optimal solution we retain the value of w also generates the clear zone CZ and the gray zone GZ defined by (5) and (6) respectively. Fig. 3 shows a two-class example with a nonlinear borderline between the classes and the clear/gray zones. The clear zone only has a few misclassified cases while the gray zone has a mix of cases from both classes. Step 2. The PNN Stage: For the PNN in stage two, the training set is reduced to A A n {xjx 2 A and x 2 CZ}. Thus, the LPBD serves as an observation reduction algorithm which can significantly cut the computational effort of the PNN compared to that in a stand-alone PNN. Since PNN can handle multi-modal patterns, we do not need to identify misclassified clusters along the borderline as seen in the method proposed by Wang [61]. Step 3. Model selection: We compare the classification accuracy for the gray zone observations produced by the LPBD and PNN. Assuming pLPBD and pPNN are the proportions of cases correctly classified by the LPBD and PNN respectively, we test the alternative hypothesis H1: pPNN > pLPBD against H0: pPNN = pLPBD at the significant level chosen in Step 0. If the null hypothesis is rejected, we confirm the existence of a nonlinear pattern and recommend a nonlinear model for the problem. Otherwise, we conclude that a linear model is adequate. Step 4. Classification: If a linear model is chosen in Step 3, we use the LPBD as our final model and classify observation x to class i if x 2 CZi [ GZi, i 2 {1, . . . , K}, i.e. x satisfying inequality (3). Otherwise, we adopt a 25
20 GZ
CZ-2
x2
15
10
5
CZ-1 0 -10
-5
0
5
10
x1 Fig. 3. A two-class example with 400 observations.
15
784
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
classification rule which combines the result from the LPBD and PNN. We classify x to class i if either x 2 CZi or x satisfies inequality (7) from stage two, i 2 {1, . . . , K} trained by all observations in GZ.
3.3. Pricing algorithm and extensions for LPBD This section discusses a row pricing algorithm used to solve large size instances and some extensions that can be derived from the LPBD. Unfortunately, no similar treatment for the PNN is available. In the case where there are large number of observations in the training set, it is difficult to consider all constraints generated by them in the MD formulation simultaneously due to computer memory limitation. Our pricing algorithm detailed in Fig. 4 selects a subset from the constraint set proportional to the size of ni for the initial formulation. After solving the LP, inactive rows are priced out from the formulation and certain number of rows that are violated by the current solution are priced into the formulation. The process continues until no violated rows are found. The threshold L in the algorithm is used to limit the size of the new formulation and to ensure that constraints from each
Fig. 4. Pricing algorithm for LPBD.
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
785
combinations of class i and class j have a chance to be examined in each iteration. To reduce the number of iterations, we do not price out inactive constraints corresponding to observations located very close to Gijs (Fig. 1) in the formulation. This is done by adding a small value to the right hand side of the pricing out criterion in Step Pricing Out/In. The algorithm allows us to solve the largest test instance in our computational study. When more observations are collected, the pricing algorithm can be modified to update the model without solving the problem from scratch. Another extension is to use the LPBD to identify a core subset of members in each class by selecting a portion of CZi which is away from the border line. For example, a core set defined by
j Þ ðci cj Þ P 1:5 þ ; a 2 Ai ; 8j 6¼ i; j ¼ 1; . . . ; K xjxT ð wi w
ð11Þ
might represent the most loyal customers in market segment i.
4. Computational experiments Although our computational test focuses on five data sets collected from real applications, we add two randomly generated data sets in which the nonlinear border line can be controlled. This section describes the data sets we use for test purposes and discusses the test result. 4.1. Two randomly generated data sets The first test instance has a longer and deeper wave in its nonlinear frontier line generated by a trigonometric function. Thus, this instance contains a clear nonlinear pattern while the second instance has a small and shallow wave which is difficult to describe by either a linear or a nonlinear model. As a result, the second test instance is used to illustrate the case where noises are present instead of a clear nonlinear pattern. The first test instance is generated by the following procedure. 1. Two hundred cases are uniformly distributed in a rectangular area for each class. For class one, x1 U(0, 10), x2 U(0, 10). For class two, x1 U(10, 20), x2 U(0, 10). x1 2. A nonlinear frontier is created by setting x1 x1 þ 2 10 sinð0:8x2 Þ for class 20x1 one and x1 x1 þ 2 10 sinð0:8x2 Þ for class two. This function is revised from the one in [61]. 3. The data set is rotated by an angle of p/4, i.e. x1 x1 cosðp=4Þ x2 sinðp=4Þ and x2 x1 sinðp=4Þ x2 cosðp=4Þ.
786
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
2 Þx ¼ c1 c2 generated by the LPBD is The separation function ð w1 w 0.55x1 + 0.576x2 = 8.393 with a non-zero objective function value of 0.294 indicating that the two classes are not linearly separable. Fig. 3 shows that the observations in the gray zone form a few clusters for both classes. The second test instance is generated by changing sin(0.8x2) for the first test instance to sin(4x2) so that the border line between the two groups is zigzagged. We expect that a nonlinear model is more likely to overfit the second instance than the first one. 4.2. Design of the experiments In addition to the two generated data sets, we choose 5 data sets from publicly available databases, three from medical applications and two from agricultural applications. We provide a brief description of each data set below. Results from solving the 7 data sets are summarized in Tables 1 and 2. Instances 1 & 2: These two data sets are the two randomly generated instances described earlier. Instance 3: The diabetes data set is obtained from the STATLOG project [58]. It contains eight attributes from 768 female patients from Phoenix (Arizona, USA.) area of Pima Indian heritage. The class variable is a binary variable
Table 1 Summary of computational results 1. G-0.8
2. G-4.0
3. Diab.
4. Diag.
5. Prog.
6. Iris
7. Satellite
2 400 2 2
2 400 2 2
2 768 8 3
2 682 9 3
2 81 32 3
3 150 4 2
6 4435/2000 36 16
46 54.35% 54.35%
397 64.30% 59.28%
39 61.54% 71.80% 2
46 60.87% 69.57% 2
11 90.00% 54.55%
1531/731a 62.71%b 78.54%b
Overall classification accuracy (cross validation) 9. Stage I 93.50% 93.75% 76.04% 10. Stage I–II 96.75% 93.75% 74.09%
96.33% 96.92%
69.14% 74.07%
98.00% 96.00%
83.05%b 88.10%b
Benchmark 11. PNN 12. LDF 13. LDF var.
96.63% 96.20% 8
67.90% 69.10% 3
96.67% 98.00% 4
88.20%b 83.50%b 27
1. 2. 3. 4.
N. classes N. cases N. vars. Red. vars.
Analysis of gray zone 5. N. cases 51 6. LPBD 66.67% 7. PNN 92.16% 8. N. vars.
a b
97.75% 93.50% 2
94% 93.75% 2
76.56% 77.10% 5
1531 cases in training set and 731 in test set. From the 2000 cases in the test set.
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
787
Table 2 Summary of benchmark test Alternative Hypothesis
Row # in Table 1
Analysis of stage one and stage two 1. H1: p(GZPNN) > p(GZLPBD) (7) > (6) 2. H1: p(I–II) > p(I) (10) > (9) Benchmark test 3. H1: p(PNN) > p(LDF) 4. H1: p(PNN) 5 p(I–II) 5. H1: p(I–II) > p(LDF) 6. H1: p(LDF) 5 p(I) *
Significant at 5% level;
**
(11) > (12) (11) 5 (10) (10) > (12) (12) 5 (9)
1 ** **
2
3
4
5
6
7
(*)
** **
**
**
**
**
significant at 1% level; ( ) significant in the opposite direction of H1.
indicating whether the patient shows signs of diabetes according to World Health OrganizationÕs criterion. Instance 4: The Wisconsin Breast Cancer Diagnosis data set [40] contains 682 cases after removing 16 cases with missing values and one observation considered as an outlier. Each case records measures of nine nuclear features from Fine Needle Aspirates (FNA) taken from a patient and the class variable indicates whether the case is benign or malignant based on a more accurate biopsy test. Instance 5: The Wisconsin Breast Cancer Prognostic data set contains 194 observations after removing four observations with missing values [41]. For each patient, there are 32 features measured. Another variable in the database records whether the patient had a recurrence of the breast cancer and when it happened (in months after the previous surgery). For non-recurrent cases, it records the known cancer free time based on the last checkup visit. We separate the database into two groups, the recurrent and non-recurrent groups. Since the longest time till recurrence recorded in the database is 78 months, we remove all patients in the non-recurrent group with cancer free time less than 78 months. The resulting file has 46 recurrent cases and 35 non-recurrent cases. The two breast cancer data sets can be found at an ftp site [63]. Instance 6: The Iris data set has four feature variables measuring 150 cases from three species of iris, 50 cases for each species. The data set is available in some statistical textbooks, for example [33]. Instance 7: The Satellite data set also comes from the Statlog database [58]. It contains satellite images of a small subarea of a large scene. The data set is obtained by taking 6435 images each containing a 3 · 3 pixel neighborhood from a 82 · 100 pixel area. Each pixel is described by four digital images of the same scene in different spectral bands. As a result, 36 variables are used to depict a 3 · 3 area. The goal is to determine the topography or land use of the central pixel of each 3 · 3 pixel neighborhood. There are six non-empty classes recorded by the class variable.
788
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
Due to the relatively small size of the data sets except for the satellite image set, hold-one-out method [37] is adopted for cross validation. Thus, there are n runs on each data set, each with n 1 observations in the training process and one held out observation in the test process for a data set with n observations. The satellite data set is divided into two subsets, 4435 cases in the training set and 2000 cases in the test set. Cross validation is done by applying the training result from the 4435 cases to the 2000 test cases. Due to its size, the pricing algorithm is used in stage one for this set with 20% of the rows included in the initial formulation and in each iteration we allow no more than 80 constraints to be added from each pair of i and j or 2400 constraints in total from all pairs. We use two approaches, a stepwise LDF (linear) and a stand-alone PNN (nonlinear), as benchmarks for the two-stage algorithm. All tests are conducted on a Pentium III 500 mHz PC with 128MB memory using CPLEX [30] as the LP solver and SPSS [57] for solving the stepwise LDF. The PNN code was adopted and modified from the C++ code in [43]. 4.3. The result Computational results are listed in Tables 1 and 2. The first four rows of Table 1 list the number of classes, observations, original feature variables, and feature variables after reduction for each data set. The second portion of the table also contains four rows indicating the number of cases in the gray zone, classification accuracy of the LPBD and PNN on the gray zone cases, and the number of feature variables used by the PNN in stage two if further reduced from row (4). The next part of Table 1 records the classification accuracy of using the LPBD alone and the two-stage approach. The last part reports the classification accuracy from the two benchmark approaches and the number of variables selected by the benchmark stepwise LDF using default choice of SPSS. The feature variables used by the benchmark PNN are the same as those reported in row (4). Table 2 shows the result from testing whether one approach produces better classification accuracy than the other using a proportion distribution. The first column shows the alternative hypothesis of each test where the null hypothesis (not listed in the table) says the two proportions are equal. The second column lists the two corresponding rows in Table 1 that we are comparing with. The last seven columns indicate for each test instance whether the result is statistically significant at a = 1% or a = 5%. A blank cell means not significant at a = 5%. 1. Feature reduction: The feature reduction method reduces a significant portion of variables for all but the two generated sets which have only two variables (Rows 3 and 4 in Table 1). Comparing the classification accuracy
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
2.
3.
4.
5.
789
between the LPBD and the stepwise LDF (Rows 9 and 12 of Table 1 and Row 6 of Table 2), we find the reduction method, although not designed for the LPBD, performs quite well for the LPBD. Further reduction of feature variables is performed before stage two for instances 4 and 5 to avoid overfitting (Row 8 of Table 1). Gray zone analysis: This is the most important analysis in this study. In five out of the 7 test instances, the PNN in stage two correctly classifies the same or more number of cases in the gray zone than the LPBD does (Rows 6 and 7 in Table 1). Statistical tests (Row 1 of Table 2) show that the PNN performs significantly better than the LPBD for Instances 1 and 7 at a = 1%. This indicates a strong nonlinear behavior existed in the gray zone for the two instances. In Instances 3 and 6, the LPBD outperforms the PNN in the gray zone due to overfitting. Instance 6 is almost linearly separable because the gray zone contains a very small portion of total observations and the LPBD outperforms the PNN in the gray zone at the 5% level. The parentheses in Table 2 indicate the test result is significant in the reverse direction of the alternative hypothesis. Instances 3 and 5 have a lower rate of classification accuracy from both approaches comparing to other instances. For those two instances, we found that a great portion of their observations falls in the gray zones. Their patterns are not as clear as others. As a result, neither a linear model nor a nonlinear model could perform well (lower than 80% classification accuracy). Value of stage two: The value of using the two-stage model as opposed to using stage one alone can be observed by comparing the result of Rows 9 and 10 of Table 1 and Row 2 of Table 2. It shows that Instances 1 and 7 exhibit significant improvement (a = 1%) when adding stage two to the model. The result supports our gray zone analysis in which the same two instances exhibits a nonlinear pattern in their gray zones. This concludes that stage two is needed in the final model when the gray zone analysis finds a strong (at a = 1%) indication of a nonlinear pattern. Comparing two benchmark approaches: Before we compare the two-stage model with the benchmark approaches, we first test the performance of the two benchmark models, a stand-alone PNN and a stepwise LDF. The result in Rows 11 and 12 of Table 1 and Row 3 of Table 2 indicates that the nonlinear model performs significantly better (a = 1%) than the linear model only when the nonlinear pattern in the gray zone is very significant (a = 1%) in the cases of Instances 1 and 7. Again, the result is consistent with our gray zone analysis. Two-stage model versus benchmark nonlinear model: Observing from Rows 10 and 11 of Table 1 and Row 4 of Table 2, there is no significant performance difference between the two-stage model and the stand-alone PNN. Both are capable of handling nonlinear patterns but the two-stage model consumes less computational resources.
790
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
6. Two-stage model versus benchmark linear model: Comparing the result between the two-stage model and the benchmark LDF (Rows 10 and 12 of Table 1 and Row 5 of Table 2), the two-stage model significantly outperforms the benchmark model for the two instances (1 and 7) with strong nonlinear patterns. 7. LPBD versus LDF: Rows 9 and 12 of Table 1 and Row 6 of Table 2 indicate that there is no significant difference between the performance of the two linear models. The reason for using the LPBD instead of the LDF is to preserve the non-parametric property of the two-stage model. 8. Stage one as a data reduction algorithm for stage 2: For Instances 1 and 7 where stage 2 is needed, stage one leaves only 12.75% and 36.55% of total observations respectively in the gray zone (Rows 2 and 5 of Table 1). For the largest data set, the satellite set, stage one takes 515 s in 21 solving-pricing iterations to solve the LP and 205 s in stage two to train the PNN on the 1531 cases in the gray zone. The stand-alone PNN takes 1050 s to train on the 4435 cases. The two-stage model achieves a 31% time saving. Should there be more computer memory available, we could have achieved more time saving because pricing iterations are not needed under such circumstance. 9. Pricing for large instances: The satellite set with 4435 cases results in a total of 22260 constraints in MD formulation. However, our pricing algorithm with p = 0.2 only puts 5148 constraints in the initial formulation. The number falls in the subsequent 21 iterations with the next highest number being 3635 constraints. It allows us to solve large instances with limited computer memory.
5. Conclusion This study proposes a two-stage approach for discriminant problems. The LPBD adopted for stage one separates the observations into a clear zone and a gray zone. The separation of the two zones provides better insight into the behavior of the data. The clear zone covers the observations representing the core segment of each class. The gray area provides rich information regarding the existence of a nonlinear pattern and serves as an indicator for the overall classification accuracy. We summarize the findings from our computational experiments as follows. (A) If the classification accuracy obtained from the PNN is significantly better than that from the LPBD in the gray zone at a = 1% level, there is strong evidence of nonlinear patterns (Instances 1 and 7).
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
791
(a) The combined result from stage one and stage two serves as the final decision model. (b) The external validity of the modelÕs capability of treating nonlinear patterns is supported by the benchmark test using the stand-alone PNN. It also indicates that even though the PNN is applied only to the gray zone observations in stage two, the two-stage model does not lose information about the overall distributions/patterns of the classes. Thus, the PNN on the gray region serves not only as a tool to detect nonlinearity but also as a critical part of the two-stage discriminant model. (B) If the PNN does not statistically outperform the LPBD in the gray zone at a = 1% level, the LPBD in stage one alone should be adopted as the final decision model. This rule applies to two types of instances. (a) Almost linearly separable instances: If the gray zone contains a small portion of the observations, both linear and nonlinear models can produce high classification accuracy (Instances 2, 4, and 6). Although Instance 2 has a built in nonlinear pattern, its signal is not strong enough to warrant the treatment of stage two. (b) Inseparable instances: In this case, the gray zone still contains a great portion of the observations. Neither a linear model nor a nonlinear model can produce good classification accuracy (Instances 3 and 5). As a result, in both cases there is no need to use a nonlinear model. The external validity of the LPBD is supported by the benchmark LDF approach. It also indicates that the LDF remains a viable linear discriminant model regardless of whether the basic assumptions are met. In addition, it is available in most statistical software packages. (C) In the two-stage model, observations in the clear zones are treated by the LPBD only. It is nevertheless possible that when we apply the stand-alone PNN to all observations, some of the clear zone observations may receive class memberships different from those assigned by the LPBD. However, we believe that this should have little impact to our decision model for the following two reasons. (a) For both almost linearly separable instances (Instances 2, 4, and 6) and almost nonlinearly separable instances (Instances 1 and 7), this only happens to a very small number of the clear zone observations, if any. The reason is that under such circumstances the majority of the observations fall into the clear zones, which should overlap largely with the high density areas produced by the estimated pdfÕs from the stand-alone PNN.
792
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
(b) For instances that cannot be separated well by either a linear or a nonlinear model (Instances 3 and 5), more than half of the observations fall into the gray zone, leaving fewer observations in the clear zones. As there is no clear pattern exists in those instances, it is likely that more clear zone observations may receive inconsistent classification results. However, when the classification accuracy is not satisfiable, we do not accept the model. Thus, the phenomenon does not have any impact on those instances. One weakness of this study is that among the five instances from real applications, only Instance 7 demonstrates a clearer nonlinear pattern. We would hope to obtain stronger support from more cases in which a strong nonlinear pattern exists. However, this might mean that real world applications usually either contain clear linear patterns as in Instances 4 and 6 or do not exist any strong linear/nonlinear patterns as in Instances 3 and 5. They seldom possess clear nonlinear patterns such as the one found in Instance 7. This is further supported by the results from the two generated instances (Instances 1 and 2) where nonlinear patterns are deliberately included. In Instance 1, the nonlinear pattern is strong and can be picked up by the PNN whereas in Instance 2, the nonlinear pattern is too weak for a nonlinear model to perform significantly better than a linear model. In addition, if the data is not almost piecewise separable, either a pairwise (sequential) linear model or a stand-alone PNN should be adopted. Nevertheless, none of our test instances suggests such phenomenon as the two-stage model and the stand-alone PNN have comparable performance on all test sets. In conclusion, by combining the two discriminant models, we create a twostage model with good interpretability and computational efficiency while preserving the capacity of treating nonlinear patterns. Alternatively, this model can be embedded as an efficient intermediate step in another decision model, which may apply a more advanced nonlinear model when nonlinear patterns are detected by the two-stage model.
Appendix A. Stability theorems This section establishes some stability qualities for the two models in the twostage approach with respect to affine transformations. They build the theoretical foundation for the standardization and feature selection in the preprocessing process described in Section 3.1. Let x be a p-vector and x 0 = Fx + c be the affine transformation of x where F is a p · p rotation matrix and c is a translation vector. The primal–dual stability theorem introduced in this section depicts the relationship between the primal and dual solutions to Formulation MD with respect to the original data set A and the transformed data set A 0 = AFT + ecT.
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
793
The dual of MD can be written as ðdMDÞ
max
K K X X
eT uij
i¼1 j¼1;j6¼i
st:
K X
ðATi uij ATj uji Þ ¼ 0;
i ¼ 1; . . . ; K
ð12Þ
j¼1;j6¼i K X
ðeT uij þ eT uji Þ ¼ 0;
i ¼ 1; . . . ; K
ð13Þ
j¼1;j6¼i
0 6 uij 6
1 e; ni
i 6¼ j; i; j ¼ 1; . . . ; K
ð14Þ
where constraints (12)–(14) correspond to w, c and z variables in the primal formulation respectively. Let MDt be the MD formulation on the transformed data A0i , i = 1, . . . , K and dMDt be the dual of MDt. We have the following primal–dual stability theorem extended from the stability theorem in [25]. Theorem 2 (MD primal–dual stability theorem). If ð w; c; zÞ is an optimal solution to MD and u is an optimal solution to dMD with an objective function value s, the following two results hold when F is non-singular. i ; ci ¼ ci þ cT ðF T Þ1 w i ; zij ¼ zij is an optimal solution to MDt 1. wi ¼ ðF T Þ1 w with an objective function value s 2. u is an optimal solution to dMDt with an objective function value s. Proof of Theorem 2. Glove et al. [25] provided a simple proof for the first part of the theorem for their general formulation. We give an alternative proof for Formulation MD that also proves the second part of the theorem. Formulation PK PKMD Tis guaranteed to have a finite optimal solution since i¼1 j¼1;j6¼i e zij P 0 and wi = 0, ci = 0, zij = e is a feasible solution with an objective function value K(K 1). MDt on the transformed data set A 0 can be written as ðMDt Þ
min
K K X X eT zij ni i¼1 j¼1;j6¼i
st:
ðAi F T þ ecT Þðwi wj Þ ðci cj Þe þ zij P e; i 6¼ j; i; j ¼ 1; . . . ; K wi ; wj ; ci ; cj unrestricted in signs; zij P 0
ð15Þ
794
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
Substituting w*, z*, and c* into constraint (15) of MDt, we have h i 1 j Þ ci þ cT ðF T Þ1 w i cj cT ðF T Þ1 w j e þ zij ðAi F T þ ecT ÞðF T Þ ð wi w j Þ ðci cj Þe þ zij P e ¼ Ai ð wi w from (2). Thus, (w*, c*, z*) is feasible to MDt with an objective function value s. We now show that the optimal solution to dMD, uij , is also feasible to dMDt. The only difference between dMD and dMDt is in Constraint (12) which can be written as K X ðdMDt Þ ½ðFATi þ ceT Þuij ðFATj þ ceT Þuji ¼ 0; i ¼ 1; . . . ; K ð16Þ j¼1;j6¼i
Constraint (16) evaluated at uij can be rewritten as K K X X F ðATi ðeT uij ATj uji Þ c uij þ eT uji Þ ¼ 0 j¼1;j6¼i
ð17Þ
j¼1;j6¼i
since the two terms are both zero vectors due to constraints (12) and (13). Thus ij is a feasible solution to dMDt with objective function value s. We have u shown that MDt and dMDt have the same objective function value s under solutions (w*, c*, z*) and u respectively. Hence, they are optimal solutions to MDt and dMDt respectively. h This theorem implies that the classification result of MD is invariant to standardization because standardization is a special case of the non-singular affine transformation and zij ¼ zij . In the next theorem, we prove the stability theorem for the weighted PNN. i s, i = 1, . . . , p, are optimal Theorem 3 (weighted PNN stability theorem). If r smoothing factors to a weighted PNN with respect to the total error function (10) i is an and Ppi¼1 fii 6¼ 0 where fiis are the diagonal elements of F, then ri ¼ fii r optimal solution to the PNN for the transformed data set A 0 with the same total error. Proof of Theorem 3. Note that the only partPof (8) that affects the result of (7) p 2 2 0 is the exponent (x y)TC1(x y) or be a i¼1 ðxi y i Þ =ri . Let A 0 0 0 transformed training data set and x = Fx + c, x 2 A be a held out observation in the training process. For a transformed observation y 0 = Fy + c, y 0 2 A 0 n{x}, the exponent portion of (8) can be written as T
1
T
1
ðx0 y 0 Þ ðCÞ ðx0 y 0 Þ ¼ ðx yÞ F T ðCÞ F ðx yÞ p X 2 2 ðxi y i Þ fii2 =ðri Þ ¼ i¼1
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
795
minimizes (10) for the original data set, ri ¼ fii r i minimizes (10) for Thus if r the transformed data set with the same minimum. h This theorem shows that the classification result of a weighted PNN is invariant to standardization since it is a special case of the specified affine transformation. However, since in implementation the heuristic search method does not guarantee an optimal solution, different starting points and step sizes may result in different solutions where the invariant property does not hold.
References [1] M.R. Azimi-Sadjadi, W. Gao, T.H.V. Haar, D. Reinke, BRIEF PAPERS—temporal updating scheme for probabilistic neural network with application to satellite cloud classification further results, IEEE Transactions on Neural Networks 12 (5) (2001) 1196–1203. [2] R.L. Bankert, Cloud classification of AVHRR imagery in maritime regions using a probabilistic neural network, Journal of Applied Meteorology 33 (8) (1994) 909–918. [3] K. Bennett, O.L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods and Software 1 (1992) 23–34. [4] K. Bennett, O.L. Mangasarian, Multicategory discrimination via linear programming, Optimization Methods and Software 3 (1993) 27–39. [5] A.L. Blum, R.L. Rivest, Training a 3-node neural network is NP-complete, Neural Networks 5 (1992) 117–127. [6] P.S. Bradley, O.L. Mangasarian, W.N. Street, Feature selection via mathematical programming, INFORMS Journal on Computing 10 (2) (1998) 209–217. [7] E.J. Bredensteiner, K. Bennett, Feature minimization within decision trees, Computational Optimizations and Applications 10 (2) (1998) 111–126. [8] D. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex Systems 2 (1988) 321–355. [9] C.J.C. Burges, A tutorial on support vector machines for patern recognition, Data Mining and Knowledge Discovery 2 (1998) 121–167. [10] T. Cacoullos, Estimation of a multivariate density, Annals of the Institute of Statistical Mathematics (Tokyo) 18 (2) (1966) 179–189. [11] S.R. Chettri, R.F. Cromp, Probabilistic neural network architecture for high-speed classification of remotely sensed imagery, Telematics and Informatics 10 (3) (1993) 187–198. [12] Y. Chtioui, D. Bertrand, D. Barba, Reduction of the size of the learning data in a probabilistic neural network by hierarchical clustering. Application to the discrimination of seeds by artificial vision, Chemometrics and Intelligent Laboratory Systems 35 (1996) 175–186. [13] S.A. Corne, S.J. Carver, W.E. Kunin, J.J. Lennon, W.W.S. van Hees, Predicting forest attributes in southeast Alaska using artificial neural networks, Forest Science 50 (2) (2004) 259–276. [14] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems 2 (1989) 303–314. [15] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics (1936) 179–188. [16] N. Freed, F. Glover, Simple but powerful goal programming models for discriminant problems, European Journal of Operational Research 7 (1981) 44–60.
796
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
[17] N. Freed, F. Glover, A linear programming approach to the discriminant problem, Decision Sciences 12 (1981) 68–74. [18] N. Freed, F. Glover, Linear programming and statistical discrimination—the LP side, Decision Sciences 13 (1981) 172–175. [19] N. Freed, F. Glover, Evaluating alternative linear programming models to solve the twogroup discriminant problem, Decision Sciences 17 (1986) 151–162. [20] N. Freed, F. Glover, Resolving certain difficulties and improving the classification power of LP discriminant analysis formulations, Decision Sciences 17 (1986) 589–595. [21] I. Galleske, J. Castellanos, Optimization of the kernel functions in a probabilistic neural network analyzing the local pattern distribution, Neural Computation 14 (5) (2002) 1183– 1194. [22] M. Garey, D. Johnson, Computers and Intractability: A Guide to the Theory of NPCompleteness, Freeman, New York, 1979. [23] S.I. Gass, S. Vinjamuri, Cycling in linear programming problems, Computers & Operations Research 31 (2004) 303–311. [24] L. Glorfeld, N. Gaither, On using linear programming in discriminant problems, Decision Sciences 13 (1982) 167–171. [25] F. Glover, S. Keene, B. Duea, A new class of models for the discriminant problem, Decision Sciences 19 (1988) 269–280. [26] F. Glover, Improved linear programming models for discriminant analysis, Decision Sciences 21 (1990) 771–785. [27] W. Gochet, A. Stam, V. Srinivasan, S. Chen, Multigroup discriminant analysis using linear programming, Operations Research 45 (2) (1997) 213–225. [28] D.S. Huang, S.D. Ma, A new radial basis probabilistic neural network model, in: The 3rd International Conference on Signal Processing Proceedings, 14–18 October1996, Beijing, China, vol. 2, pp. 1449–1452. [29] D.S. Huang, Radial basis probablistic neural networks: model and application, International Journal of Pattern Recognition and Artificial Intelligence 13 (7) (1999) 1083–1101. [30] ILOG CPLEX 6.5, ILOG, Inc., Mountain View, CA. [31] E.A. Joachimsthaler, A. Stam, Four approaches to the classification problem in discriminant analysis: an experimental study, Decision Sciences 19 (1988) 322–333. [32] E.A. Joachimsthaler, A. Stam, Mathematical programming approaches for the classification problem in two-group discriminant analysis, Multivariate Behavioral Research 25 (1990) 427– 454. [33] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, second ed., Prentice Hall, New Jersey, 1988. [34] N. Karmarkar, A new polynomial time algorithm for linear programming, Combinatorica 4 (1984) 375–395. [35] B. Kegl, A. Krzyzak, H. Niemann, Radial basis function networks in nonparametric classification and function learning, in: Proceedings of the International Conference on Pattern Recognition, International Conference on Pattern Recognition, Brisbane, Australia, 16–20 August 1998, pp. 565–570. [36] W.L.G. Koontz, K. Fukunaga, Asymptotic analysis of a nonparametric estimate of a multivariate density function, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (1972) 967–974. [37] P.A. Lachenbruch, An almost unbiased method of obtaining confidence interval for the probability of misclassification in discriminant analysis, Biometrics (December) (1967) 639– 645. [38] Y. Lee, Y. Lin, G. Wahba, Multicategory support vector machines: application to the classifications of microarray data and satellite radiance data, Journal of the American Statistical Association 99 (465) (2004) 67–81.
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
797
[39] O.L. Mangasarian, Mathematical programming in neural networks, ORSA Journal on Computing 5 (4) (1993), 349–360. [40] O.L. Mangasarian, W.H. Wolberg, Cancer diagnosis via linear programming, SIAM News 23 (5) (1990), 1 & 18. [41] O.L. Mangasarian, W.N. Street, W.H. Wolberg, Breast cancer diagnosis and prognosis via linear programming, Operations Research 43 (4) (1995) 570–577. [42] E.P. Markowski, C.A. Markowski, Some difficulties and improvements in applying linear programming formulations to the discriminant problem, Decision Sciences 16 (1985) 237–247. [43] T. Masters, Advanced Algorithms for Neural Networks, John Wiley & Sons, New York, 1995. [44] D. Montana, A weighted probabilistic neural network, Advances in Neural Information Processing Systems 4 (1992) 1110–1117. [45] R. Nath, W.M. Jackson, T.W. Jones, A comparison of the classical and the linear programming approaches to the classification problem in discriminant analysis, Journal of Statistics and Computer Simulation 41 (1992) 73–93. [46] R.K. Orr, Use of a probabilistic neural network to estimate the risk of mortality after cardiac surgery, Medical Decision Making 17 (2) (1997) 178–185. ¨ stermark, R. Ho¨glund, Addressing the multigroup discriminant problem using multivar[47] R. O iate statistics and mathematical programming, European Journal of Operational Research 108 (1) (1998) 224–237. [48] J. Park, I.W. Sandberg, Universal approximation using radial-basis function networks, Neural Computation 3 (2) (1991) 246–257. [49] E. Parzen, On estimation of a probability density function and mode, Annals of Mathematical Statistics 33 (1962) 1065–1076. [50] M.E. Regelbrugge, R. Calalo, Probabilistic neural network approaches for autonomous identification of structural dynamics, Journal of Intelligent Material Systems and Structures 3 (4) (1992) 572–584. [51] V.M. Rivas, J.J. Merelo, P.A. Castillo, M.G. Arenas, J.G. Castellano, Evolving RBF neural networks for time-series forecasting with EvRBF, Information Sciences 165 (2004) 207–220. [52] H. Schioler, U. Hartmann, Mapping neural network derived from the Parzen window estimator, Neural Networks 5 (1992) 903–909. [53] D. Specht, Generation of polynomial discriminant functions for pattern recognition, IEEE Transactions on Electronic Computers EC-16 (3) (1967) 308–319. [54] D. Specht, Probabilistic neural network, Neural Networks 3 (1) (1990) 109–118. [55] D. Specht, Probabilistic neural networks and the polynomial adaline as complementary techniques for classification, IEEE Transactions on Neural Network 1 (1) (1990) 111–121. [56] A.F. Sheta, K. De Jong, Time-series forecasting using GA-tuned radial basis functions, Information Sciences 133 (3–4) (2001) 221–228. [57] SPSS Release 10.1.0, SPSS, Inc., Chicago, IL. [58] STATLOG Project datasets. Available from:
. [59] W.P. Sweeney, M.T. Musavi, J. Guidi, Classification of chromosomes using a probabilistic neural network, Cytometry 16 (1) (1994) 17–24. [60] C.-Y. Tsai, An iterative feature reduction algorithm for probabilistic neural networks, Omega 28 (2000) 513–524. [61] S. Wang, A self-organizing approach to managerial nonlinear discriminant analysis: a hybrid method of linear discriminant analysis and neural networks, INFORMS Journals on Computing 8 (2) (1996) 118–124. [62] P.J. Werbos, The Roots of Backpropagation—From Ordered Derivatives to Neural Networks and Political Forecasting, John Wiley & Sons Inc., New York, 1994.
798
C.-Y. Tsai / Information Sciences 176 (2006) 772–798
[63] Wisconsin Breast Cancer Database. Available from: . [64] Z.R. Yang, S. Chen, Robust maximum likelihood training of heteroscedastic probabilistic neural networks, Neural Networks 11 (4) (1998) 739–747.