2D spiral pattern recognition with possibilistic measures

2D spiral pattern recognition with possibilistic measures

Pattern Recognition Letters 19 Ž1998. 141–147 2D spiral pattern recognition with possibilistic measures Sameer Singh 1 School of Computing, UniÕers...

224KB Sizes 3 Downloads 89 Views

Pattern Recognition Letters 19 Ž1998. 141–147

2D spiral pattern recognition with possibilistic measures Sameer Singh

1

School of Computing, UniÕersity of Plymouth, Plymouth PL4 8AA, UK Received 8 April 1997; revised 22 October 1997

Abstract The main task for a 2D spiral recognition algorithm is to learn to discriminate between data distributed on two distinct strands in the x–y plane. This problem is of critical importance since it incorporates temporal characteristics often found in real-time applications, i.e. the spiral coils with time. Previous work with this benchmark has witnessed poor results with statistical methods such as discriminant analysis and tedious procedures for better results with neural networks. This paper describes a fuzzy approach which outperforms previous work in terms of the recognition rate and the speed of recognition. The paper presents the new approach and results with the validation and test sets. The results show that it is possible to solve the spiral problem in a relatively small amount of time with the fuzzy approach Žup to 100% correct classification on the validation and test set; 77.2% correct classification with cross-validation using the leave-one-out method.. q 1998 Elsevier Science B.V. All rights reserved. Keywords: Possibilistic reasoning; Spiral data; Pattern recognition; Fuzzy membership

1. Introduction The 2D spiral data set was proposed by Alexis Wieland of MITRE Corporation and now forms one of the important benchmarks at the Carnegie Mellon repository. 2 This pattern recognition problem is interesting for several reasons: Ži. the problem is almost impossible to solve using a linear method such as the discriminant analysis; Žii. backpropagation networks and their relatives encounter considerable problems with error reduction on this data set with long training times ŽTouretzky and Pomerleau, 1989.; Žiii. it serves as an important neural network benchmark since it is possible to display the 2D receptive 1

E-mail: [email protected]. T h e s p ira l d a ta a re a v a ila b le a t: ftp :r r ftp.cs.cmu.edurafsrcs.cmu.edurprojectrconnectrbenchr. 2

fields of any unit in the neural network; Živ. the data have some temporal characteristics, i.e. the radius and angle of the spiral vary with time, and Žv. several real-time applications involving similar data, for example spiral feed data in manufacturing, are in need of better classifiers. In addition, the success of classifiers on solving the 2D spiral problem will encourage work on solving the same problem in n-dimension. In this paper, a new approach to solving the spiral problem is proposed. The nature of the problem is shown in Fig. 1 which plots the spiral. The data points for the two classes Ck Ž k s 1,2. spiral around each other for a total of 97 patterns in each class Ž N s 194.. In order to separate the two classes, the solution can be attempted using either an iterative learning approach as with neural networks, or with a statistical non-iterative method. The performance of the classifier is measured in terms of the

0167-8655r98r$19.00 q 1998 Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 6 5 5 Ž 9 7 . 0 0 1 6 3 - 3

142

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

Fig. 1. Spiral data in two dimensions.

overall recognition rate, and the speed of recognition which is the time taken for the pattern recognition algorithm to execute and produce results Žor equivalently as the number of epochs for neural networks.. In addition, ideally the recognition method should be reliable with noisy data. The spiral task is difficult to solve using the backpropagation method of training neural networks. In general, large networks with complicated architectures can yield good results after long training times. Lang and Witbrock Ž1988. solved the spiral problem using a 2 = 5 = 5 = 5 = 1 backpropagation network with learning times of 18,900 epochs, 22,300 epochs and 19,000 epochs on three different trials. By changing the error function, these authors are able to get better results with an average of 11,000 epochs over three trials. The neural network learns the patterns by using its first hidden layer to divide the input space into two regions and uses the second layer to use combinations of these first layer features to produce a curved response. The network is however under-constrained, i.e., it does not know how to respond to points which do not lie on the spiral. Touretzky and Pomerleau Ž1989. report: ‘‘Given this kind of freedom, back-propagation almost never develops a perfect solution’’. Fahlman Ž1988. has used the same network architecture of a multilayer perceptron with a quickpropagation algorithm reducing the training times to 4500 epochs, 12,300 epochs and 6800 epochs over three trials. The cascade correlation algorithm has been found to function better on the above problem solving it in 1700 epochs when averaged over 100 trials ŽFahlman and Lebiere,

1990.. In addition, the spiral problem has been solved using data encoding methods with reasonable success ŽChua et al., 1995; Jia and Chua, 1995.. There are some serious limitations with the neural network approach to solving the 2D spiral problem. The most obvious and crucial one is the long training time. This makes neural networks unattractive for several applications that need training in real time, or need solutions that develop with time as new training data is added. In addition, setting up and achieving the optimal neural network configuration can lead to several trials. The optimal configuration would also not work with another spiral generated with different parameters including different density and radius values. In addition, random starting weights over different trials can generate very different solutions at completion. Finally, completely correct classification on the training set does not ensure the same on the test set ŽKosko, 1992.. Hence, robust statistical approaches to the above problem are desirable. Unfortunately, established methods such as discriminant analysis are not very suitable and at best only 50% of the patterns can be correctly classified using this. In this context, this paper aims to explore fuzzy pattern recognition for solving the spiral problem in real-time. A possibilistic approach to decision making is described in the next section. This methodology outperforms previous methods applied to the spiral data both in terms of the recognition rate and speed.

2. Possibilistic proximity measures The theory of possibility states that for a fuzzy set A s  a1 , . . . ,a n4 , one may calculate the degree to which an individual observation a i , i s 1, . . . ,n, belongs to A. This may be referred to as the membership of a i in A Ž m AŽ a i .. and a well defined quadratic function can be used for its calculation for parametric data ŽMamdani and Gaines, 1981; Zadeh, 1965, 1987.. The membership can also be calculated for an external datum that does not belong to A but is within the range of measurements already included in A. The maximum membership possible is 1.0, usually for the mean measurement when the possibility function approximates a Gaussian function. The

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

membership of a measurement in A, which is outside the range of values already included in A, is 0. Using this approach, Singh and Steinl Ž1996. have shown that by calculating membership of individual feature values within a test pattern for different classes Ck , one may assign the test pattern to one of the classes for which it has the highest membership, 1 - k ( K. They also find that the technique works within acceptable time limits even for very large data-sets. For data that is either non-parametric or has a highly non-linear structure, a fuzzy classifier system based on proximity measures for membership computation may be used ŽPal and Majumder, 1986.. The data are initially divided as of training and test type. The purpose of the classifier is to allocate a class to the test data on the basis of training data on which it has been trained. The performance of the classifier is measured on the basis of correct test predictions that it makes. For a given test pattern X, the fuzzy classifier computes the membership of X in different classes C1 , . . . ,C j , . . . ,Cm where 1 ( j ( m. The membership of X in class C j can be expressed as m j Ž X .. The test pattern is allocated to a class for which the membership function yields the maximum value. The overall process may be mathematically explained as: consider an unknown pattern X represented by a point in a multi-dimensional space V X consisting of m pattern classes C1 , . . . ,Cm . Let R 1 , . . . , R j , . . . , R m be the reference vectors where R j associated with C j contains h j prototypes such that R Žj l . g R j , l s 1,2, . . . ,h j ,

½

m lj Ž X . s 1 q d Ž X , R Žj l . . rFd

5

Fe y1 .0

,

Ž 1.

where m lj Ž X . is the membership of X in class C j as determined through the class sample l, and dŽ X, R lj . is the distance between X and R lj . In Eq. Ž1., Fe and Fd are positive constants that determine the degree of fuzziness in the membership space. The main purpose of using the membership function is to map an n-dimensional feature space into an m-dimensional membership space which is a unit hypercube and satisfies the following conditions:

m lj Ž X . ™ 1 as d Ž X , R lj . ™ 0,

Ž 2.

m lj Ž X . ™ 0 as d Ž X , R lj . ™ `,

Ž 3.

143

m lj Ž X . ™ increases, as d Ž X , R lj . ™ decreases,

Ž 4.

The test pattern X is allocated to class i if,

mi Ž X . 0 m j Ž X .

for i / j,

where i , j s 1, . . . ,m, hj

and m j Ž X . s max Ý Ž m lj Ž X . . .

Ž 5.

ls1

In this manner, the class of X is the class for which it has the highest membership value. This approach is in a way based on finding the nearest neighbour of the test pattern in the training data. The class of the nearest neighbour is assigned as the class of the test pattern. The method of this class assignment is however considerably different from the well known ‘K nearest neighbour method’. The established k-NN method: ‘‘ . . . involves finding a hypersphere around a point X which contains K points Žindependent of their class., and then assigning X to the class having the largest number of representatives inside the hypersphere’’ ŽBishop, 1995, p. 57.. This method however achieves limited success with the spiral data as shown in Fig. 1. Here, generating a spherical boundary around a test pattern Žblack square. encloses equal number, of class C1 patterns Žwhite circles. and C2 patterns Žblack circles.. Previous results with this approach show between 50 to 55% correct classification of spiral data. In the proposed method, however, only one nearest neighbour is found from the training set. The quality of the classification would therefore depend on the sensitivity of the method for detecting the true nearest neighbour. The fuzzy approach to classification is generic to any test domain, but possibility calculation methods may need modification in different domains, e.g., manipulating the fuzzy parameters Fd and Fe in Eq. Ž1.. One of the important advantages of the fuzzy pattern recognition method is that possibilistic algorithms can perform decision making with equal or weighted influence of input features and with rejection thresholds where decisions based on low possibility counts are discarded ŽPal and Majumder, 1986; Bishop, 1995, p. 28.. Similar to the k-NN method, the Gaussian approach has serious limitations when solving the 2D spiral problem. The main difficulty lies with the

144

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

procedure used for calculating class memberships. The spiral data symmetrically coils around the origin and for either classes, x and y measurement means are 0. Possibility calculations using the traditional approach would be therefore less sensitive to the temporal nature of the helix and a number of test patterns may have equal memberships in different classes. In order to solve a spiral problem, a slightly modified version of the fuzzy membership method using Eq. Ž1. is used for our purposes. The revised procedure for membership computation is explained below: 1. Label the data as for training set V , validation set V and test set T. The benchmark consists of N s 194 patterns in each of these sets. 2. Separate class C1 data and class C2 training data in two training files F1 and F2. 3. For every pattern in p V s Ž x Õ , yÕ . in the validation set V, perform the following steps: 4. Find the upper and lower bounds of x Õ for class C1 from F1. These may be represented as x Õ Žlb1 . and x Õ Žub1 .. Here lb 1 and ub1 are positions at which lower and upper bound are found in the training data array T for class C1. If x Õ ) x V for all x V g F1, then x Õ has only a lower bound. Similarly, if x Õ - x V for all x V g F1, then x Õ has only an upper bound. If C1 has a total of h samples, the upper bound of x Õ in C1 is x i such that x i y x Õ - x k y x Õ for all x i 0 x Õ and x k 0 x Õ , x i g C1 , x k g C1 , 1 ( i ( h, 1 ( k ( h, i / k. Similarly, a lower bound can be found. 5. For F1, calculate class membership for the following cases: Case 1 pÕ already exists in F1 as a class C1 pattern: m 1Ž x Õ . s 1.0;: m 1Ž yÕ . s 1.0; Case 2 x Õ exists in F1 at position i, 1 ( i ( N but not yÕ in the same position. m 1Ž x Õ . s 1.0; m 1Ž yÕ . s 1.0rŽ1.0 q h1 . where h1 s < yÕ y y j < Lt h1 ™ 0 m 1Ž yÕ . s 1.0; Case 3 yÕ exists in F1 at position j but not x Õ in the same combination. m 1Ž yÕ . s 1.0; m 1Ž x Õ . s 1.0rŽ1.0 q h 2 . where h2 s < x Õ y x j < Lt h 2 ™ 0 m 1Ž x Õ . s 1.0;

Case 4 x Õ and yÕ do not occur in F1 m 1Ž x Õ . s 1.0rŽ1.0 q h 3 . if h 3 ( h4 else m 1Ž x Õ . s 1.0rŽ1.0 q h4 . if h4 - h 3 where h 3 s < x Õ y x Õ Žlb 1 .< and h4 s < x Õ Žub1 . y x Õ .< m 1Ž yÕ . s 1.0rŽ1.0 q h5 . if h5 ( h6 else m 1Ž yÕ . s 1.0rŽ1.0 q h6 . if h6 - h5 , where h5 s < yÕ y yÕ Žlb 1 .< and h6 s < yÕ Žub1 . y yÕ .<. 6. Perform steps 4 and 5 on F2 to calculate m 2 Ž x Õ . and m 2 Ž yÕ .. 7. Derive an optimal function j for the following: If j Ž m 1Ž x Õ ., m 1Ž yÕ .. ) j Ž m 2 Ž x Õ ., m 2 Ž yÕ .. then pÕ g C1 , else pÕ g C2 . The results shown later have used the multiplication, exponential multiplication and the min– max function for possibility combination. 8. Determine the recognition rate through correctly classified patterns. 9. Test the approach on the test set T by following steps 2 to 8. In the above discussion, the membership m in case 2 is inversely proportional to the distance between the target and the actual y value in the same position. In case 4 when both x and y test values are not present in the training set, the membership is inversely proportional to the distance between the test value and either its upper or lower bounds depending on the nearest one in the training file. It may be observed that in the described method, Fe s 1 and Fd s 1. In the above algorithm, if we find x Õ or yÕ in F1 in more than one position, the membership function at these different positions is calculated and the highest value is chosen. In rare cases it is also possible that x Õ or yÕ may have two or more upper and lower bounds at different positions in F1. As previously, the membership function is computed for these different cases and the highest value is chosen to be m 1Ž x Õ . or m 1Ž yÕ . as may be the case.

3. Experimental details The spiral data consists of a training set, a validation set and a test set. Each set consists of 194 patterns, 97 samples in each of the two classes

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

Žspiral densitys 1, max. radius s 6.5.. Since the spiral problem is artificial, it is possible to generate spirals of varying complexity by manipulating spiral parameters. The validation set is particularly important for a neural network so that optimal architectures may be tested with this data before the final performance is noted on the test set. In our context, it serves as a supplementary test set. The validation and test sets are generated by keeping either the x or y measurements fixed as found in the training set and offsetting the other variable by a set amount d , i.e. if the training set is Ž x, y . then the validation set is Ž x, y q d . and the test set is Ž x q d , y .. Since the two strands of the spiral are mirror image of each other and completely symmetrical, the test patterns are difficult to classify correctly. In the fuzzy approach, validation is not strictly required as the procedure is preset. Hence, both the validation and the test sets were used for testing purposes. The performance of the system was cross-validated using the leave-one-out method ŽWeiss and Kulikowski, 1991.. The results of such validation in spiral data analysis should be considered with caution. Any cross-validation method that requires some test patterns to be taken away from the spiral training set for test purposes and allows training on the remaining, is not well suited for this problem. By taking away some training data for test purposes, the strand of the spiral have gaps and any classifier is vulnerable to wrong predictions. Bishop Ž1995. also notes that validation methods for temporal data are not fully developed and conventional techniques can give misleading results. Hence, in order to minimise the effects of misclassified patterns due to missing data, this paper has used the leave-one-out method where only one test pattern will be tested with N y 1 training patterns Ž N s 194. in an iterative manner.

4. Results The 2D spiral problem is virtually impossible to solve using a linear discriminant analysis. The success rate is measured in terms of the test patterns that are correctly predicted to belong to their respective target classes using the training data to make this prediction. This is represented in percentage as the

145

proportion of correctly classified patterns to the total number of patterns tested. Linear discriminant analysis results in 50% recognition. When cross-validated with the leave-one-out-method, the proportion correct drops to 48% for discriminant analysis. The fuzzy method when testing the validation set computes m 1Ž x Õ ., m 1Ž yÕ ., m 2 Ž x Õ . and m 2 Ž yÕ . for any validation pattern Ž x Õ , yÕ . and allocates it to class C1 if j Ž m 1Ž x Õ ., m 1Ž yÕ .. ) j Ž m 2 Ž x Õ ., m 2 Ž yÕ ... The function j tested was the min–max method: if maxŽminŽ m 1Ž x Õ ., m 1Ž yÕ .., minŽ m 2 Ž x Õ ., m 2 Ž yÕ .. s minŽ m 1Ž x Õ ., m 1Ž yÕ .., then pÕ g C1 , else pÕ g C2 ; multiplication method: if m 1Ž x Õ .) m 1Ž y Õ .. ) m 2 Ž x Õ .) m 2 Ž yÕ ., then pÕ g C1 , else pÕ g C2 ; and exponential multiplication method: if e u1Ž x Õ . )e u1Ž y Õ . ) e u 2 Ž c Õ . )e u 2 Ž y Õ ., then pÕ g C1 , else pÕ g C2 . All these functions yield 100% correct classification on the validation and test set however with the leaveone-out cross-validation method on the training set, the following results were obtained: 75.12% Žmin– max method., 76.68% Žexponential multiplication. and 77.20% Žmultiplication method.. Using the multiplication method, Fig. 2 shows the difference m 1Ž pÕ . y m 2 Ž pÕ . when cross-validating with the leave-one-out method. The decision regarding the allocation of classes to test patterns are based on this difference. The higher the difference, the more confidence we have in that decision. If the difference is positive then the test pattern belongs to class C1 else to C2 . Fig. 2 also shows the classification success on the X axis. The aim of the classifier is to reduce the number of misclassifications. Ideally, positive and negative differences should be equal since the training set consists of an equal number of patterns in both classes. It is also shown that there are more positive differences than negative ones, therefore many more class C2 patterns were classified as belonging to class C1 than vice versa. In order to further test the reliability of the fuzzy method for recognising spiral data, the benchmark was corrupted with noise. Uniform and Gaussian noise was added to the training data to produce a test set. The noise was first generated with a different seed for each different trial and transformed within the w0,1x range. The total offset was the noise array multiplied with a maximum offset limit, i.e., d s dmax )noise, where noise is a value within the w0,1x range. Hence, the minimum noise that can be added

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

146

Fig. 2. Scatter plot of the difference between m 1Ž pÕ . y m 2 Ž pÕ .. Misclassified patterns are marked at the lower end and the correctly classified patterns at the top.

to a test pattern is 0 and the maximum is the maximum offset dmax . By increasing dmax , the average offset in the system increases. The noise used was additive and non-cumulative in nature. Recognition results are shown in Table 1. Table 1 shows that the fuzzy approach is reliable and consistent. The maximum offset limit d was increased from 0.1 to 4.1. This is a significant range considering the fact that the radius of the spiral is

only 6.5. In practice, most offsets would range between zero and the distance between two spiral strands, however, the sole purpose of having larger offsets was to observe the system behaviour in such cases. The recognition rates for uniform noise are all above 93% and for Gaussian noise all above 86%. Also, the system shows a very graceful degradation in performance with increasing offsets. In some cases, the performance actually improves with higher noise,

Table 1 The performance of the fuzzy algorithm on spiral data contaminated with varying additive noise in terms of its recognition rate Ž R %. Žrecognition rates on the test set ŽT . and the validation set Ž V . are shown. Max. offset, d ma x 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3 3.7 4.1

Recognition rate Ž%. with uniform noise

Recognition rate Ž%. with Gaussian noise

T

V

T

V

99.0 99.5 100 99.5 99.5 97.9 97.4 95.9 93.3 93.8 93.3

100 100 100 99.5 99.5 99.5 99.5 99.5 94.8 94.8 94.3

99.0 97.9 96.4 92.8 91.7 89.2 92.8 90.2 94.3 89.2 93.3

100 97.9 93.3 88.1 92.8 88.6 91.7 88.6 93.8 88.5 86.1

S. Singh r Pattern Recognition Letters 19 (1998) 141–147

e.g., for Gaussian noise a recognition rate of 90.2% for dmax s 2.9, and 94.3% for dmax s 3.3. This may be due to two different reasons: Ži. it is well-established that noise often improves classification after a certain level ŽJim et al., 1995.; or Žii. since noise distributions for different trials are generated with different seeds, the total amount of noise added to the system may actually be smaller even with a higher offset limit.

5. Conclusion Spiral data exhibits several interesting spatial and temporal characteristics. The fuzzy algorithm classifies the data in both the validation and the test set up to 100% accuracy. Since the approach is non-iterative in nature, the results are produced almost instantaneously Ža test pattern is classified in less that 0.5 s on a Pentium Pro 200 machine.. The technique is faster than conventional methods as training is not required. Although cross-validation is not the best suited validation technique with this type of data because the removal of data leads to discontinuities in the spiral, the fuzzy recognition with the leaveone-out method yields a high success rate of 77.2%. This is much better than what can be achieved with a linear discriminant analysis. The new approach is also resistant to noise and excellent results have been found in the presence of uniform and Gaussian noise. The present technique can be extended to solve the problem in n-dimension Ž n strands of a spiral on different planes.. The proposed technique is in its elementary stage and promises to be useful for small and noisy data sets which need classification in real-time. Further work needs to be initiated on similar lines to have a generic classifier powerful enough to recognise data in any domain.

147

References Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Chua, H.C., Jia, J., Chen, L., Gong, Y., 1995. Solving the two-spiral problem through input data encoding. Electronics Letters 31 Ž10., 813–814. Fahlman, S.E., 1988. Faster-learning variations on back-propagation: An empirical study. In: Proc. 1988 Connectionist Models Summer School. Morgan Kaufmann, Los Altos, CA. Fahlman, S.E., Lebiere, C., 1990. The Cascade-Correlation learning architecture. In: Touretzky ŽEd.., Advances in Neural Information Processing Systems 2. Morgan Kaufmann, Los Altos, CA. Jia, J., Chua, H., 1995. Solving two-spiral problem through input data representation. In: Proc. IEEE Internat. Conf. Neural Networks, Vol. 1, pp. 132–135. Jim, K., Horne, B., Giles, C.L., 1995. Effects of noise on convergence and generalisation in recurrent networks. In: Tesauro, G., Touretzky, D., Leen, T. ŽEds.., Neural Information Processing Systems 7. MIT Press, Cambridge, MA, p. 649. Kosko, B., 1992. Neural Networks and Fuzzy Systems – A Dynamical Systems Approach to Machine Intelligence. Prentice Hall, Englewood Cliffs, NJ. Lang, K.J., Witbrock, M.J., 1988. Learning to tell two spirals apart. In: Proc. 1988 Connectionist Models Summer School. Morgan Kaufmann, Los Altos, CA. Mamdani, E.H., Gaines, B.R., 1981. Fuzzy Reasoning and its Applications. Academic Press, New York. Pal, S.K., Majumder, D.D., 1986. Fuzzy Mathematical Approach to Pattern Recognition. Wiley, New York. Singh, S., Steinl, M., 1996. Fuzzy search techniques in Knowledge-Based systems. In: Proc. 6th Internat. Conf. on Data and Knowledge Systems for Manufacturing and Engineering ŽDKSME’96., Tempe, AR. Touretzky, D.S., Pomerleau, D.A., 1989. What’s hidden in the hidden layers? Byte ŽAugust issue., 227–233. Weiss, S.M., Kulikowski, C.A., 1991. Computer Systems that Learn. Morgan Kaufmann, San Mateo. Zadeh, L.A., 1965. Fuzzy Logic and its Applications. Academic Press, New York. Zadeh, L.A., 1987. Fuzzy sets as a basis for a theory of possibility. In: Yager, R.R., Orchinnikov, S., Tong, R.M., Nguyen, H.T. ŽEds.., Fuzzy Sets and Applications. WileyrInterscience, New York, pp. 193–218.