A fuzzy binary neural network for interpretable classifications

A fuzzy binary neural network for interpretable classifications

Neurocomputing 121 (2013) 401–415 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A fuzzy...

1MB Sizes 9 Downloads 160 Views

Neurocomputing 121 (2013) 401–415

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A fuzzy binary neural network for interpretable classifications Robert Meyer 1, Simon O'Keefe n York Centre for Complex Systems Analysis, University of York, York YO10 5GH, United Kingdom

art ic l e i nf o

a b s t r a c t

Article history: Received 14 December 2012 Received in revised form 25 April 2013 Accepted 6 May 2013 Communicated by V. Palade Available online 18 June 2013

Classification is probably the most frequently encountered problem in machine learning (ML). The most successful ML techniques like multi-layer perceptrons or support vector machines constitute very complex systems and the underlying reasoning processes of a classification decision are most often incomprehensible. We propose a classification system based on a hybridization of binary correlation matrix memories and fuzzy logic that yields interpretable solutions to classification tasks. A binary correlation matrix memory is a simple single-layered network consisting of a matrix with binary weights with easy to understand dynamics. Fuzzy logic has proven to be a suitable framework for reasoning under uncertainty and modelling human language concepts. The usage of binary correlation matrix memories and of fuzzy logic facilitates interpretability. Two fuzzy recall algorithms carry out the classification. The first one resembles fuzzy inference, uses fuzzy operators, and can directly be translated into a fuzzy ruleset in human language. The second recall algorithm is based on a well known classification technique, that is fuzzy K-nearest neighbour classification. The proposed classifier is benchmarked on six different data sets and compared to other systems, that is, a multi-layer perceptron, a support vector machine, an adaptive neuro-fuzzy inference system, and fuzzy and standard K-nearest neighbour classification. Besides its advantage of being interpretable, the proposed system shows strong performance on most of the data sets. & 2013 Elsevier B.V. All rights reserved.

Keywords: Neural networks Machine learning Supervised classification Fuzzy reasoning

1. Introduction Classification is probably one of the most frequently encountered tasks in machine learning and artificial intelligence. Accordingly, a wide variety of different techniques exist, encompassing methods such as genetic algorithms, decision trees, multi-layer perceptrons (MLP) and support vector machines (SVM). However, these usually constitute very complex systems, especially the latter techniques. Following the underlying reasoning of classification decisions made by such systems is therefore a non-trivial task. For example, due to their structure of several layers and many nonlinearities multi-layer perceptrons work more as a black boxes than as transparent classification systems. Similar objections also apply to non-linear support vector machines which project data into very high dimensional space. On the contrary, fuzzy inference and fuzzy rule systems are based on linguistic descriptions and can be easy to interpret by the user [1,2]. Nonetheless, interpretable fuzzy rule systems often lack accuracy and perform poorly compared to the previously named approaches. This trade-off between performance and accuracy on the one side and interpretability and n

Corresponding author. Tel.: +44 1904 325375. E-mail addresses: [email protected] (R. Meyer), [email protected] (S. O'Keefe). 1 New address: Bernstein Center for Computational Neuroscience, TU-Berlin, Marchstrasse 23 D-10587, Berlin. 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.05.030

transparency on the other is rather fundamental to machine learning. So called neuro-fuzzy techniques combine both methodologies to create synergisms and try to exploit the advantages of both worlds. This encompasses combinations fuzzy logic with genetic algorithms [3–5], associative memories [6–9], multi-layer neural networks [10–13], or support vector machines to improve robustness [14]. This paper proposes a new classifier that enhances the interpretability and transparency by forming a hybrid of a specific type of neural network, the binary correlation matrix memory (BCMM) and fuzzy logic. The standard learning techniques for binary correlation matrix memories are augmented and a novel training procedure is presented. This new training technique is combined with two new classification methods to place novel data items into previously learned classes. The first classification method is based on fuzzy logic operators whereas the second implements a fast and direct fuzzy K-nearest neighbour classification.

2. System design 2.1. Binary correlation matrix memories (BCMMs) Binary correlation matrix memories are a form of neural network that operates as an associative memory. The inputs, outputs and weights of the neural network take only binary values, i.e. the

402

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

weights are either set to 0 or 1. The concept dates back to the 1960s. Willshaw et al. [15] created a network representation of an optical system that was able to learn association relations. Nowadays BCMMs are most often used in combination with suitable hardware to implement very fast K-nearest neighbour classification [16–21]. In neural network terms, we take as inputs an array of binary elements. A single layer of weights connects these inputs to output nodes that calculate a weighted sum of their inputs. Each output node applies a threshold, but the threshold may be local or competitive. A convenient and practical format to visualize this single layer network is an n by m grid of straight lines, where the left hand side of the grid represents the n input nodes from which the row lines originate. The network's m output nodes are located at the bottom, where the column lines terminate. Each intersection of a row and a column line corresponds to a connection between the related input row and output column. A circle or dot is placed at the intersection to show when the weight is set to 1, otherwise the weight is assumed to be 0. Fig. 1 shows a binary correlation matrix visualized as a grid. There exists an augmented version of BCMMs termed improved binary correlation matrix memory (LCMM). The weights of an LCMM can take up integer values from 0 up to a maximum value L, compare [22–24]. The network operates only with binary values, so all inputs and outputs in a particular set of association must be reduced to an array of bits or binary pattern. Binary input patterns and their associated output patterns, which in the case of classification problems usually encode class labels, can be stored and retrieved from (improved) binary correlation matrix memories. The procedures are termed training and recall, respectively. For detail of the original training and recall procedures of the BCMM see [21] and for the LCMM see [24]. In outline, to store an association between binary patterns we set to one all weights between input and output bits which are both set to 1. In terms of the matrix representation, if we present an input and an output, we set a weight to one where the active rows and columns intersect. We 01

01 T 01

01

01

can represent this as M ¼ y x , where x and y are vectors of binary 01

elements, T denotes the transpose of a vector, and the resulting M is a matrix of binary elements. Further association storage results in further weights set to 1, but weights never revert back to 0. The storage capacity of the network depends on the size of inputs and outputs, and the number of bits set to 1 in each pair of patterns. 01 To recall an associated pattern, we present the input x . For each row that has a 1 bit on the input, the weights in the matrix determine its contribution to the net output. Effectively, for each output we count the ʽ1ʼ bits in the input that overlaps with ʽ1ʼ bits

Fig. 1. A grid visualization of a binary correlation matrix memory. The BCMM exhibits 10 input (denoted by X) and 10 output bins (denoted by Y). Line intersections with a black dot indicate a weight set to 1, intersection without a dot represent weights set to 0.

in that column of the weight matrix. We can represent this as 01 01 O ¼ M x , which gives us a vector O of integers. We apply a threshold to this to recover the binary output pattern. Before the training and recall operations can be applied to a set of given training and test data comprising vectors of continuous input values and their corresponding class labels, the input vectors need to be transformed into binary ones. The procedure is to quantize each continuous value, that is, to replace the continuous value with a set of bits, each of which represents a particular range of values of the original variable.

2.2. Quantization The quantization method applied in this paper is equi-width binning [21]. Thus, given a set of continuous n-dimensional training data, each of the n dimensions of the data is divided into Di intervals of equal size, with i being the index of a dimension. The interval boundaries of the outermost bins are chosen from the data, the first boundary is the minimum value encountered for the specific dimension, whereas the last boundary is chosen to be equal to the maximum value of all data items in the corresponding dimension. A data item is therefore translated into a binary vector exhibiting n ones and all other entries are zeros, i.e.  1 if aij ≤xi oaiðjþ1Þ 01 x i:j ¼ ð2:1Þ 0 else 01

with x i:j representing the jth bin of the ith dimension, xi is the real value of instance x of the ith dimension and aij being the lower bound of the jth interval. Note that the last interval includes the final upper bound. Moreover, novel test instances with values that are located outside any of these intervals are simply mapped to the outermost bin of the corresponding dimensions. To form the complete binary vector the bins of all dimensions are simply concatenated. The order of dimensions is not important as long as it is consistent for all data items. To avoid confusion of real valued 01 vectors and binary ones the notation x is used to indicate binary vectors. Furthermore, if input bins are considered, the double 01 indexing x i:j refers to a vector entry not to a matrix entry. Due to concatenation the vector contains several dimensions. In order not to mistake the vector indexing with normal matrix indexing, the vector indexing of bins is augmented by a colon “:”, for an overview of the most prevailing notations see Box 1. Box 1–Notations used throughout the paper.

x, integer or real-valued vector. 01 x , binary conversion of input vector. 01 01 x i , dimension vector, the ith dimension of binary vector x . 01 x i:j , single bin, the jth bin of the ith dimension of the 01 binary vector x . n, number of dimensions. x^ , kernel conversion of input vector, the above notations of the binary vector apply as well. 01 y , binary output patter that is associated with a binary 01 01 input pattern x , since only a single bin in y is set to 1, simply y as an integer value is often used to refer to the index of the bin set to 1. M, integer matrix. wk ¼ MðkÞ, k the column of integer matrix. wki , ith dimension of kth column. wki:j , single entry, jth entry of the ith dimension of the kth column. 01

M , binary matrix, all other notations used above for integer columns apply as well. Di , number or intervals and bins in the ith dimension.

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

sðxð2Þ ; xð1Þ Þ, sðwk ; xð1Þ Þ, (discretized) similarity measure between two patterns xð1Þ ; xð2Þ or a pattern xð1Þ and a matrix column wk . TRAINING: P, set of training patterns X subset of the training patterns with a specific label called actual class set. Z subset of the training patterns with instances with a different label than the instances in X, also called opposite class set. Thus, P ¼ XZ, and XZ ¼ . N ¼ jXj, number of instances in the actual class. N ¼ jZj, number of instances with an opposite class label. X S , sampled set containing instances of the actual class, X S X. Z S , sampled opposite class set, Z S Z. S, sample size if X S and Z S are randomly sampled. #bins, number of bins in each dimension for the B/LCMM. This notation summarizes that for each of n dimensions, the values Di for all i are chosen to be of equal size. RECALL: κ, support of membership functions. yk , output of column k if an input pattern is applied via one of the two recall methods (single value for fuzzy inference recall, vector for FKNN recall). μc ðxÞ, final membership value of instance x to class c. R, number of columns for all classes in the matrix. K, parameter of KNN recall.

Furthermore, with regard to the data space, the division of each dimension into a set of non-overlapping connected intervals partitions the space into a regular grid with hyperrectangular cells. This can also be seen in Fig. 2 on the right hand side.

403 01

In both cases a binary input pattern x is associated with a 01 binary output pattern y that contains only one bin set to 1 and the rest set to 0. Let M denote the integer weight matrix, then the storage of an input pattern into the matrix can be formulated as 01

MðyÞ←MðyÞ þ x

ð2:2Þ 01

where y denotes the index of the bin in the output vector y that is set to 1 and “←” means replace the left hand side with the right 01 hand side. Thus, the binary input pattern x is simply added to a specific matrix column. Note in case y exceeds the number of 01 columns within matrix M, the column vector x is simply con01 catenated to the matrix. Since an input vector x is added to and not or-ed with a matrix column, the matrix M contains integer values. Nonetheless, it can easily be converted into a BCMM or LCMM simply by replacing all values larger than 1 with 1 or all values larger than L with L, respectively. If none of these operations has been performed yet and a distinction between a BCMM and LCMM is not necessary, the matrix is termed B/LCMM. This paper presents a novel training algorithm that heuristically creates correlation matrix memories that reside in between these two extremes of storing all instances of the same class into a single column and storing each and every item into a separate column. Thus, the crucial part of training is to determine a suitable value of y, i.e. choosing a column to store a pattern into. In the following the algorithm considers the classes in the training set sequentially. If there are l classes there are l training procedures resulting in the partial matrices M1 to Ml, the final matrix is created by concatenating the partial matrices M ¼ M 1 JM 2 J⋯JM l

ð2:3Þ

with “○” the concatenation operator. But before the new training algorithm is introduced, two important aspects and prerequisites are briefly discussed.

2.3. Training Given a classification task the output vector usually encodes the classification decision, which in most cases is a corresponding class label. Therefore, the basic idea is to store each training item into a column that belongs to the corresponding class. In a naive approach the correlation matrix memory only contains as many columns as there are classes. The K-nearest neighbour approach in the literature [19–21] realizes the opposite extreme. Every training item is stored in a single column and the output of the BCMM directly encodes specific items in the training set. Nonetheless, this requires much memory and does not really provide any compression of the training data.

Fig. 3. Example of artificially generated classification regions of two three-dimensional data items.

Fig. 2. Three data items are stored into a binary correlation matrix memory. The resulting regions from which patterns could be reconstructed are shown on the right hand side.

404

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

Fig. 4. The two different kernel shapes and the resulting discrete kernel values for the input bins. A triangular kernel is shown on the left and a parabolic kernel on the right hand side. The maximum value of each kernel is 1/n with n dimensions.

2.3.1. Storing multiple instances Storing multiple instances into one and the same column corresponds to a lossy compression of the information provided by the training set (besides the loss of information by discritization of the pattern space). Given a trained (binary) correlation matrix memory with several patterns per column, it is usually impossible to reconstruct the original set of binary input vectors stored in one specific column. The problem is that the matrix forgets the specific combinations of bins set to 1 in each dimension for a specific pattern and simply merges all information of all patterns. Basically, a pattern that can be reconstructed from a matrix column can exhibit any of the possible combinations of binsets in each dimension that can be found among the training patterns stored into the column. This is visualized in Fig. 2 with a two dimensional example. As one can see the second column of the matrix stores only a single instance. Accordingly, the stored binary pattern can be easily reconstructed because it can only have originated from the red grid cell spanned by x1:4 and x2:5 . However, although only two patterns have been stored in the first column, there are four possible reconstructions as depicted by the blue grid cells. Not only the original binary patterns with x1:2 ¼ x2:2 ¼ 1 and x1:5 ¼ x2:4 ¼ 1 can be reconstructed but also x1:2 ¼ x2:4 ¼ 1 and x1:5 ¼ x2:2 ¼ 1. This behaviour occurs for all pairs of items that are stored in the same column and belong to different grid cells. More precisely, stored patterns span another larger n-dimensional hyperrectangle of which all corner cells contain such artificial data. Such grid cells or regions that do not incorporate real but artificial data will be termed artificial classification regions. A threedimensional example of this behaviour is given in Fig. 3. Hence, given d data items with n dimensions stored in the same column, n in the worst case there are d −d classification regions that do not contain a single training instance. Nonetheless, usually the number of artificial regions is much smaller since data items might belong to the same cell or the cells are not shifted towards each other along each dimension. Still, usually the number of artificial classification regions is massive compared to the cells in which instances are actually located. This behaviour can be problematic when the matrix generates artificial data that overlaps with training patterns of a different class, which in turn leads to misclassification. Accordingly, the proposed training algorithm tries to avoid such issues by storing suitable combinations of training patterns into a column.

2.3.2. Superimposing a kernel To better capture the structure of continuous pattern spaces Hodge et al. [20,21] superimpose integer kernels to binary input patterns that either represent a (discretized) Manhattan or squared Euclidean distance. A similar approach is used here with the exception that the kernel values are not restricted to integer values.

The parabolic kernel input to a binary correlation matrix memory is computed as follows: ! 1 ðξi −jÞ2 x^ i:j ¼ 1− ð2:4Þ n D2i where x^ i:j is the kernel input value of the jth input bin of the ith dimension, ξi denotes the position of the bin to which the ith value of the input pattern is actually mapped (i.e. the bin of the ith dimension of a binary input pattern that is set to 1), Di is the number of bins for the particular dimension, and n represents the number of dimensions. Likewise, a triangular kernel is computed as   1 jξ −jj 1− i : ð2:5Þ x^ i:j ¼ n Di The bin at position ξi that represents the interval incorporating the data instance is assigned the highest value of 1/n and the values are monotonically decreasing as the positions of the bins are further away from ξi . The general shapes of such kernels are given in Fig. 4. Because the kernel inputs are normalized, only parts of these kernels are actually superimposed onto an input pattern. The rest is truncated at the borders of the block of bits representing that dimension. The process of superimposing a triangular kernel on an input pattern is shown in Fig. 5. Furthermore, this kernel conversion can be used to determine a (discretized) similarity measure2 between two patterns xð1Þ and xð2Þ or a pattern xð1Þ and a matrix column wk n

01 ð2Þ

ð1Þ sðxð2Þ ; xð1Þ Þ ¼ ∑ maxðx i:j x^ i:j Þ i¼1

ð2:6Þ

j

n

ð1Þ sðwk ; xð1Þ Þ ¼ ∑ maxðwki:j x^ i:j Þ: i¼1

01

ð2:7Þ

j

2.3.3. Training algorithm The basic idea of the training algorithm is to group training patterns together that are similar according to a certain metric and store them together into columns such that no faulty overlap of artificial regions with clusters of different classes occurs. As a consequence, the algorithm aims for a few general columns that cover a lot of the data items and several columns that contain very few and exceptional instances that are usually located close to borders of opposite class clusters or do even overlap into such clusters. One might consider this as following human thought processes by creating general rules that explain most of the data, 01 ð2Þ

ð1Þ

ð1Þ

To avoid confusion, in Eqs. (2.6) and (2.7), x i:j x^ i:j and wki:j x^ i:j denote multiplication of the corresponding values. Therefore, the maximum is taken over the results of the multiplication not over the two values. 2

01

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

405

Fig. 5. Technique to superimpose a kernel to an input pattern. At first, the pattern is converted into the binary format via equi-width binning. Next, a triangular kernel is superimposed.

which are augmented by several exceptions that specify difficult cases. Principally, the algorithm tries to model the borderline between clusters of different classes. The training instances are randomly selected and it is tested whether this storage produces overlap with different classes. If yes, the item is removed from the column and a different one is tried. The according formula to unlearn stored patterns is simply 01

MðyÞ←MðyÞ−x :

ð2:8Þ

Testing whether an item can be safely inserted into a column or not is based on a heuristic. The test is whether the insertion of a pattern into a column creates an artificial classification region that is located too close to patterns of a different class label. Let X be the set of instances of the class to be stored and Z the set of instances with a different class label.3 If P is the set of all training patterns then P ¼ X∪Z and X∩Z ¼ ∅. Let x∈X be the instance that was chosen to be stored into the matrix M. Next let wk ¼ MðkÞ be the column that is tested to see whether x can safely be inserted. For convenience wk denotes the column after x has already been stored into it as in Eq. (2.2). Furthermore, let XS be a set of S patterns with X S ⊂X and Z S ⊂Z similarly defined. If S≥jXj or S≥jZj, these are simply the original sets X and Z, otherwise S patterns are randomly sampled from X and Z. For a discussion of the sampling procedure see below. The test is whether the insertion of x causes a conflict with patterns of a different class. The storage of x in wk is rejected if: ∃z∈Z S : sCR ðwk ; z; xÞ≥scrit ðzÞ with sCR ðwk ; z; xÞ ¼

ð2:9Þ !

n

∑ maxðwki:j z^ i:j Þ

i¼1

01

ð2:10Þ

j

  01 01 −min maxðwki:j z^ i:j Þ−maxðx i:j z^ i:j Þ : i

j

j

ð2:11Þ

sCR measures the similarity of the closest classification region caused by storing x into wk to the kernel conversion z^ of an 01 opposite class instance z∈Z S . The second term “mini ðmaxj ðwki:j z^ i:j Þ 01 −maxj ðx i:j z^ i:j ÞÞ” ensures that only the closest classification region that is actually caused by the storage of the new actual item x is considered. 3

For multiclass-problems Z contains instances of several classes.

The critical similarity or threshold scrit ðzÞ is computed as maxx∈X S ð∑ni¼ 1 maxj x i:j z^ i:j Þ þ 1 : 2 01

scrit ðzÞ ¼

ð2:12Þ

The value of scrit represents a border located half way in between the closest item of XS and an item z∈Z S . If the insertion is rejected, the instance is unlearned (Eq. (2.8)), the next column is tried, and the procedure is repeated. If the actual item does not fit into any column, it is stored into a new empty one. This whole procedure is repeated for every instance of the actual class set X. This mechanism results in protected areas within the data space in which no artificial classification regions are allowed to exist. As seen in Fig. 6 this usually prevents the artificial regions from overlapping with data of different classes. For instance, the data items annotated with (a) and (b) are not allowed by this method to be stored in one and the same column. They would generate a pathological classification region that is located inside one of the protected areas. Nonetheless, this technique constitutes only a heuristic. In Fig. 6 it is also shown that if the items of the other classes are not densely clustered, there might exist open holes inside a data cluster which do allow artificial regions to be placed in. Still, this heuristic usually works very well in practice, which is shown below for some benchmark tests. The critical similarity scrit between an item of the opposite class and an artificial region constitutes half the shortest distance to one of the instances of the actual class. Using only half of the distance ensures that also the overlap between artificial regions is avoided. Thereby a separating border is created that exactly lies half between the closest instances of the actual and opposite classes. Hence, artificial classification regions are only placed on the side of the border that belongs to their class. One can see that if another class is trained, the critical similarity between this next class and all instances of the other classes must be one of the already computed critical similarities or an even larger one. Thus, no overlap can occur. The algorithmic complexity of the training algorithm for the training of a single class is briefly discussed now. Given X S ¼ X and Z S ¼ Z, the computation of all critical similarities already requires OðNN′Þ with N ¼ jXj the number of training items of the actual class and N′ ¼ jZj the number of instances of all the other classes. Additionally, the testing of where one can insert an instance costs OðrNN′Þ with r denoting the number of columns created by the

406

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

Nonetheless, the phenomenon of false insertions is somewhat counterbalanced by also sampling data for computing the critical similarity ðjX S j ¼ SÞ. Accordingly, the real closest item might not have been selected in the sample. Consequently, the highest similarity of an instance of a different class to one of the items of the actual class is overestimated. As a result, the protected areas increase and it becomes harder to insert the actual instance into a column. In practice this counterbalance works quite well as the empirical data suggests. The pseudocode of the training method can be found in the appendix. Moreover, the training method is sketched as a flowchart in Fig. 7. Finally, the algorithm requires three parameters, the number of bins per dimension #bins, the sample size S, and the type of kernel function. Note that the number of bins could be chosen independently for each dimension, but for simplicity throughout the paper we will use the same number of bins per dimension, i.e. Di ¼ Dj ¼ bins, for i; j ¼ 1; …; n. Fig. 6. Visualization of the heuristic which tries to ensure that instances of the same class cannot be stored in one and the same column if they create artificial classification regions overlapping with data of different classes. Here for demonstration purposes an infinite cell grid is assumed and the kernel type is triangular. No artificial region is allowed closer than half the distance between an item of a different class and the closest item of the actual class. The resulting protected areas where no artificial regions are allowed are sketched by the dotted lines. For example, the instances annotated with (a) and (b) cannot be trained into the same B/LCMM column. Nevertheless, this only constitutes a heuristic because if the opposite class cluster is not densely packed, there can exist holes where artificial regions are allowed. This failure of the heuristic is shown in the middle of the image (Uncovered Space).

algorithm. In the worst case it also holds r ¼N, and each and every instance is stored in a separate column. Usually, the amount of time needed is less because if a data item can be inserted into a specific column, the remaining ones do not need to be tested any longer. Hence, only a fraction of r is considered. Moreover, another heuristic can help to quickly identify a suitable column for an instance. The algorithm keeps track of the number of instances already stored in each column. Accordingly, the columns are tried in decreasing order of the number of stored items in each column. Thus, the most general columns already covering a lot of data are tried first. Because they are most general, the location of the stored data is most often far off from opposite classes and the artificial classification regions that might be produced by insertion are unlikely to overlap pathologically with data of other classes. Hence, it is more likely that a new instance can be stored in these general columns than in an exceptional column including items very close to other class clusters. If a suitable data structure is applied, which sorts columns according to the linear bucketsort4 the ordering of the columns can be done in O(N) and does not increase the overall complexity of the algorithm. There is another way to reduce the complexity of the algorithm. Instead of considering all data items for measuring the critical similarities and testing the columns, only a random sample of fixed size S is drawn. Thus, the measuring of similarities becomes OðS2 Þ and the testing reduces to O(rNS). Nevertheless, this introduces the possibility of more overlap between data of different classes and artificial classification regions. Since only a sample of the instances of other classes ðjZ S j ¼ SÞ is considered to test a new insertion into a specific column, an instance of a different class that is too close to an artificial region might unfortunately not get selected. Accordingly, the faulty overlap remains undetected.

4

For a potential instantiation of such a bucketsort queue see the appendix of [25].

2.4. Recall Two new recall methods are proposed which actually carry out a classification decision for new previously unknown test instances. The pseudocode of both algorithms is provided in the appendix. The techniques are independent of the previously introduced training method. Thus, the user can apply the different recalling schemes without retraining the network. The first method based on fuzzy logic uses a BCMM and the second which implements a fuzzy K-nearest neighbour classifier is based on LCMMs. Accordingly, networks trained by the previously discussed learning algorithm have to be modified simply by reducing too large weights to 1 or L, respectively. For more information about fuzzy logic and fuzzy K-nearest neighbour classification see [1,26]. 2.4.1. Fuzzy inference recall As previously discussed in Section 2.3.2, superimposing kernel functions enable the computation of similarity measures. These measures resemble fuzzy membership functions which are used for this first recall method. Basically, the fuzzy inference recall technique uses smaller kernels than the training algorithms. To ensure interpretability the kernels implement symmetric membership functions and the decay from 1 to 0 is equal among every dimension. The computation of the fuzzy inference kernels works slightly differently compared to the previously introduced methods in Eqs. (2.4) and (2.5). There, the kernels span an entire dimension and their peak value is found to be 1=n, with n denoting the number of dimensions. In contrast, shorter kernels with peaks at 1 are used for the fuzzy inference recall. The support κ of such a fuzzy kernel has to be determined by the user. Accordingly, the user can choose how fuzzy the problem domain should be interpreted and how fuzzy borders between different classes become. The corresponding parabolic and triangular kernels are computed as 0 1 B ðα i:ξi −α i:j Þ2 C x^ i:j ¼ maxB ; 0C @1− A κ2 4 and

0

ð2:13Þ

1

B jα i:ξi −a i:j j C x^ i:j ¼ max@1− ; 0A κ

ð2:14Þ

2 with α i:j denoting the middle value of the interval that is represented by the jth bin of the ith dimension and α i:ξi being 01 the middle value of the interval where the bin of x i is actually

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

407

Fig. 7. Flow chart of the training algorithm. The computation of the critical similarities scrit is done in the box marked with “n”. This procedure needs a further loop, that considers the set XS, which is not shown.

01

μ1 ∧μ2 ¼ minðμ1 ; μ2 Þ

ð2:16Þ

with support of size κ. Accordingly, the matrix becomes a fuzzy rule set and each rule contains fuzzy sets combined in conjunctive normal form. First, the composition between the fuzzy sets and the crisp input value is implicitly performed by multiplying the binary weights with the corresponding kernel value. Subsequently, a fuzzy or-operator is applied among the fuzzy sets of a specific dimension and the resulting sets of each dimension are combined via a fuzzy and-operation. Moreover, since the training algorithm allows several columns per class, the BCMM works as a fuzzy inference engine with several rules per class. Accordingly, the final classification decision is computed by a further or-ing of the individual rule outputs

μ1  μ2 ¼ μ1 μ2 :

ð2:17Þ

μc ðxÞ ¼ S yk

set to 1. For example, if the border values of a bin x i:j are 1 and 2 (i.e. aij ¼ 1, aiðjþ1Þ ¼ 2), then α i:j ¼ 1:5. A BCMM in combination with such fuzzy kernels can compute a fuzzy inference if the recall is modified and fuzzy operators are applied. The output of column k is computed as follows: n

i ^ yk ¼ T ðS D j ¼ 1 ðwki:j x i:j ÞÞ 01

i¼1

ð2:15Þ

where T and S are a t-norm and a t-conorm, respectively. Two widely used t-norms are the min operation ðμ1 ∧μ2 Þ and the algebraic product ðμ1  μ2 Þ

Accordingly, the corresponding t-conorms are the max operation _ 2 Þ: ðμ1 ∨μ2 Þ and the algebraic sum ðμ1 þμ μ1 ∨μ2 ¼ maxðμ1 ; μ2 Þ

ð2:18Þ

_ 2 ¼ μ1 þ μ2 −μ1 μ2 : μ1 þμ

ð2:19Þ

For further t-norms and t-conorms and their properties with respect to fuzzy operations see [27]. Each weight that is stored into the matrix is interpreted as a fuzzy set with a fuzzy membership function defined by the kernel

r

k¼1

ð2:20Þ

where μc ðxÞ determines the final membership value of instance x to class c, given the output yk of r different columns that represent the rules for class c. If definite classification decisions are required, the instance is assigned to the class with the highest membership value. Thus, the recall complexity for a single new instance is O(R) with R denoting the total amounts of rules of all classes. Nonetheless, one has to bear in mind that a binary correlation matrix discretizes the pattern space and leads to a grid partitioning. Consequently, the fuzzy membership functions do not exhibit

408

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

a smooth decay from 1 to 0, but rather constitute a step function. When the system is translated into human language, and if desired, these step functions can be again smoothed and replaced by the continuous kernel functions. Such a translation can be performed by simply interpreting every weight as a fuzzy set with a corresponding fuzzy membership value and assigning a human readable label to it. Next, every column is rewritten as a fuzzy rule in conjunctive normal form in human language. The ruleset might be quite large, especially in terms of number of fuzzy sets. Accordingly, merging and compressing algorithms should be applied. 2.4.2. Fuzzy K-nearest neighbour recall It was shown that the kernel methods (see Eqs. (2.4) and (2.5)) can be used to compute similarities that reflect distances. Basically, the new fuzzy K NN recall considers the K closest artificial classification regions within the entire network to classify a new instance. The fuzzy K NN recall method can be applied to LCMMs where the maximum weight threshold L is equal to the parameter K.5 The usage of an LCMM ensures more stability because it shows how trustworthy an interval of a specific classification region is. Accordingly, the fuzzy K NN classifier takes into account that intervals containing higher integer weights represent the values of several training instances and not a single one. Thus, if the closest classification region to a new instance exhibits integer weights of value K in each dimension, it also manifests the K closest regions that are taken into account for fuzzy K NN classification. Consequently, no other less close classification region needs to be considered. To formalize this mathematically, the easiest way is to interpret a dimension of an LCMM column in combination with an input kernel instance x^ as a multiset of kernel values. The integer weight of the matrix is not multiplied by a kernel value but rather indicates how often this kernel value is accounted for in the multiset. Let “⊗” be a new operator that takes an integer vector on the left hand side and a real-valued kernel vector of the same length on the right and returns a vector containing multiple values of the kernel vector. Within this new vector each element occurs as often as indicated by the corresponding integer value in the first vector at the same position. For instance 0 1 0 1 0 0 0 1 0:375 B 1 C B 0:125 C B C B C B B C B C 0:375 C C B 2 C B 0:25 C B C B C B C B 0:375 C B C B 0:375 C B B C: ð2:21Þ B 1 C⊗ B C¼B B C B C 0:25 C C B 0 C B 0:5 C B C B C B C B B C B C @ 0:25 A @ 2 A @ 0:375 A 0:125 0 0:25 Note that the three 0.375 values are caused by two different integer weights, the 1 at the fourth position and the 2 at the sixth position. Next, let “K max” be a function that takes a vector and returns a vector of length K, with the K best values of the input vector in decreasing order. For example for K ¼4 00 11 0:375 BB 0:375 CC 0 0:375 1 BB CC BB CC BB 0:375 CC B 0:375 C C B CC B K maxB ð2:22Þ C: BB 0:25 CC ¼ B @ 0:375 A BB CC BB CC @@ 0:25 AA 0:25 0:125

5 A weight value of K should be equal to the maximum value L because a larger value of L than K is useless, whereas a smaller value might disturb the classification.

Accordingly, these operations can be used to compute the K closest artificial classification regions in each column, here for the kth column n

yk ¼ ∑ K maxðwki ⊗x^ i Þ:

ð2:23Þ

i¼1

yk in turn is a vector of length K containing the similarities of the K closest artificial regions to the input pattern in the corresponding matrix column. Subsequently, the vectors yk of length K of all columns of all classes are concatenated to form one very large vector or pool of similarities denoted by T ¼ yT1 ○yT2 ○⋯○yTR , where R is the number of columns in the whole matrix M. Again from this pool the K best values are chosen and denoted by y_ y_ ¼ K maxðẙÞ:

ð2:24Þ

The class labels of the columns from which these K best values originate need to be remembered in the following step. Next, these values are used to compute the final classification output μc ðxÞ ¼

∑K y_ j j ¼ 1 y_ j -c

∑Kj¼ 1 y_ j

:

ð2:25Þ

_ The expression ẙj -c Note that ẙj denotes the jth entry of vector y. simply states that the jth value in the K best vector y_ needs to be associated with class c, i.e. needs to have originated from an LCMM column that belongs to this specific class. μc ðxÞ denotes the membership value of instance x to class c. If a definite classification is needed, the maximum membership of all classes determines the class of x. Although this K NN classification usually relies on artificial classification regions, this works very well in practice, as will be shown in the next section. Compared to the fuzzy inference recall, this method also introduces a new parameter. For the fuzzy inference method the membership support κ is user chosen, whereas for the fuzzy K NN recall technique the user needs to specify the value of K. Moreover, due to the normalization of the output μc by ∑Kj¼ 1 ẙj , every point in the data space can be assigned a class label, and there are no unknown items. In contrast, due to the smaller kernels and the logical operators, only instances close to real data or artificial classification regions are classified by the fuzzy inference method. Assuming a feasible small value of κ, a membership of 0 to all classes is assigned to new instances far away from the training data. Furthermore, for the fuzzy K NN technique the complexity of recalling a single item is larger than for the fuzzy inference recall: O(KR). In comparison to the prior recall method, the fuzzy K NN recall can be considered as the more robust technique. The fuzzy inference recall is (almost) only based on the closest classification region, whereas fuzzy K-nearest neighbour does take K closest regions into account. Accordingly, outliers and misleading and overlapping instances have less influence on the classification.

3. Performance evaluation 3.1. Methods The proposed system has been applied to six classification tasks from the UCI machine learning library [28] (for a short overview of the different sets see Table 1.) The classifier is compared to five other machine learning techniques including multi-layer perceptrons (MLP), support vector machines (SVM), fuzzy and standard K-nearest neighbour classification (FK NN and K NN) and an adaptive neuro-fuzzy inference system (ANFIS).

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

409

Table 1 Overview of the different data sets. Name

Classes

Instances

Dimensions

Misc

Iris

3

150

Wine

3

50/50/50 178

13

Glass

6

59/71/48 214

9

Industrial data, classification of different glass types, strongly overlapping classes, uneven distributions of instances per class

Liver

2

76/70/29/17/13/9 345

6

Medical data, diagnosis of liver disorder, heavily overlapping classes

Pima diabetes

2

200/145 768

8

Medical data, diagnosis of diabetes, heavily overlapping classes

Image segmentation

7

500/268 2310

16

Biological data, flower classification, slight overlap between two classes

4

Biological/chemical data, classification of wines from different regions in Italy, very high dimensionality

Image recognition, large data set, some class overlap, high dimensionality

330/330/330/330/ 330/330/330

All B/LCMM were tested with a parabolic and a triangular kernel. The fuzzy inference was implemented with max and product t-norms and min and drastic sum t-conorms, respectively. The number of bins #bins,6 the value of S of the training algorithm, the value of κ for fuzzy inference recall, and the value of K for FK NN recall have been optimized via an exhaustive search and a full factorial design.7 The exhaustive search was performed for the first four data sets (iris, wine, glass, and liver) with 10 trials of 10-fold cross-validation (100 runs in total). The best settings were used for a larger run of 100 trials of 10-fold cross-validation (1000 runs total).8 and these values are compared to the benchmark classifiers. For the Pima and image set9 configurations were chosen according to the settings found for the first four data sets, especially based on the findings for the liver and the wine data set. The Pima diabetes set constitutes a medical diagnosis as well as the liver set and the image segmentation set is high dimensional like the wine data set. Due to the size of these sets only 20 trials of 10-fold cross-validation have been performed for benchmarking. In order to allow a fair comparison with the other classifiers, the benchmark systems have been optimized with exhaustive search as well. The optimization was based on 10 trials of 10-fold cross-validation for the first four training sets. The best configurations were chosen for a further run of 100 trials of 10-fold crossvalidation. Moreover, the configurations for the Pima and image set were based on the settings for the liver and wine data set and were run for 20 trials of 10-fold cross-validation. The values of K for the K-nearest neighbour as well as the fuzzy K-nearest neighbour classifiers have been optimized using the same range of values as for the second recall algorithm. The

6 The notation #bins is used instead of Di to indicate that for simplicity the same number of bins is chosen for each dimension. 7 The values are chosen as follows, number of bins, #bins : f25; 50; 75 ; 100; 150; 200; 250; 300; 400; 500g, S : f25; 50; 75; … ; 200g, κ : f0:2; 0:4; 0:6; … ; 2:0g, K : f1; 3; 5; …; 19g. 8 The classification was chosen according to the maximum class membership, ties were counted as errors. 9 In order to make the tasks less easier, three dimensions of the image segmentation data set including information of the location of a pixel have been removed because these highly correlated with the classes.

similarity function for the FK NN classifier was chosen to be sðx; yÞ ¼

1 ð∑ni jxi −yi jÞ2=ðm−1Þ

ð3:1Þ

which is based on the suggestions in [26], with a value of m ¼2. Furthermore, as one can see the similarity approaches infinity as the distance approaches zero. To avoid undesired effects of nonrepresentable large numbers the maximum similarity was set to 1020. If a larger similarity is measured, it is simply set to 1020. The multi-layer perceptron consisted of two hidden layers with variable numbers of units. The output layer contains as many nodes as there are classes. An exhaustive search was performed on the topology and combinations of numbers of units in both hidden layers.10 The preferred training method was Bayesian Regularization [29] with the MATLAB neural network toolbox [30]. In addition to a training and test set, the network requires a validation set. Accordingly, the training set that results from 10-fold cross-validation is further randomly partitioned into a smaller training and a validation set. The size of the validation set is chosen to be 20% of the cross-validation training set. The LIBSVM [31] package was used for the support vector machine benchmarking system. Multi-class soft-margin SVMs with a radial basis function kernel were chosen. The width of the kernel γ as well as the soft-margin cost parameter C were optimized with an exhaustive grid search.11 The ANFIS is based on the fuzzy logic toolbox in MATLAB [32,33]. First, a fuzzy inference system is generated via subtractive clustering [34,35] for different values of cluster influence ranges r.12 The resulting inference system and the fuzzy membership functions are optimized by a combination of gradient descent backpropagation and least-squares method. The membership functions are chosen to be Gaussian in order to guarantee differentiability. As for the MLP, overfitting was prevented by dividing the training data further into a smaller training and validation set. Since the original system is not suitable for 10

The number of units per layer were chosen between 1 and 12 nodes each. The values were chosen from γ : f2−15 ; 2−13 ; 2−11 ; …; 25 g and C : f2−5 ; 2−3 ; 2−1 ; …; 215 g. 12 The influence ranges were chosen from r : f0:25; 0:3; 0:35; …; 1g. 11

410

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

Table 2 The test performances and corresponding rank in comparison to each other of all classifiers are shown. The best performances are marked in bold and with a grey background. In the case of the iris, wine, glass, and liver data sets, 100 trials of 10-fold cross-validation have been applied; for the Pima diabetes and image segmentation data sets, 20 trials of 10-fold cross-validation were performed. Standard deviations of test errors are shown in brackets. The differences in performances have been tested with the Wilcoxon rank sum test, see appendix. Name Iris Wine Glass Liver Pima Image

Rank Test error Rank Test error Rank Test error Rank Test error Rank Test error Rank Test error

Fuzzy inf. recall

Fuzzy k-NN recall

MLP

4

3

5

4.01

( 7 5.46)

6 2.06 5 31.92 7 37.29 5 24.79 5 12.32

3.89

( 74.75)

3 ( 7 3.30) ( 7 9.18) ( 7 6.69) ( 7 4.66) ( 7 1.94)

k NN 6

4.08

( 7 4.86)

7

1.97 1 24.46 6 33.82 3 24.06 3 10.13

( 73.16) ( 78.39) ( 77.25) ( 74.84) ( 71.79)

3.09 6 34.01 3 30.56 7 29.44 7 12.84

4.42

( 7 9.70) ( 7 7.90) ( 7 4.95) ( 7 8.26)

1.98 3 27.04 5 33.97 6 28.41 2 9.84

SVM

7

1

( 7 5.02)

4 ( 7 4.76)

Fuzzy k NN

4.83 2

( 7 3.19) ( 7 9.21) ( 7 7.59) ( 7 4.47) ( 7 1.85)

1.93 2 25.96 4 31.02 4 26.46 1 7.75

( 75.08) ( 70) ( 73.21) ( 79.28) ( 77.25) ( 74.40) ( 71.63)

ANFIS 2

3.24

( 7 4.62)

1 1.18 4 28.10 1 26.97 2 23.84 6 12.53

3.31

( 7 4.34)

6 ( 7 2.51) ( 7 9.06) ( 7 7.14) ( 7 4.35) ( 7 1.96)

2.92 7 35.63 2 29.67 1 23.06 4 11.55

( 7 4.1) ( 7 9.05) ( 7 7.23) ( 7 4.24) ( 7 1.91)

Table 3 The parameter settings of the best configurations of the proposed classifier for each data set. Set

Recall I

Iris Wine Glass Liver Pima diabetes Image segmentation

S ¼150, S ¼100, S ¼125, S ¼125, S ¼125, S ¼100,

Recall II #bins ¼75, κ ¼0.6, Parabolic Kernel, prod/sum norms #bins¼ 25, κ ¼ 0:8, Triangular Kernel, prod/sum norms #bins¼ 25, κ ¼ 0:6, Triangular Kernel, prod/sum norms #bins¼ 400, κ ¼ 0:4, Triangular Kernel, prod/sum norms #bins¼ 150, κ ¼ 0:4, Triangular Kernel, prod/sum norms #bins¼ 200, κ ¼ 1:8, Triangular Kernel, min/max norms

multiclass classification, one ANFIS per class was used for oneagainst-all classification. The input data for all benchmarking systems except the ANFIS has been linearly rescaled to the interval [0,1]. The proposed system does not require any previous rescaling of the data because the binning and kernel computation implicitly rescale the input vectors. The ANFIS also implicitly rescales the data.13 Of the software used in this evaluation, the SVM is by far the fastest, and the MLP by far the slowest. The runtime of our proposed algorithm is comparable to that of the ANFIS, with parameter exploration for the ANFIS and for the new classifier taking roughly the same time, noting that our code is not optimized in any way. 3.2. Results The test performances and rankings of all classifiers are shown in Table 2. As shown the novel training algorithm combined with fuzzy K NN recall significantly (for significance tests see appendix) outperforms all benchmark systems on the glass data set. Similarly good results can be obtained for the iris, wine, Pima and image data set. Both recall techniques score worst for the liver data set. The settings of the best B/LCMM classifiers for each data set are shown in Table 3. Fig. 8 shows summaries of the exhaustive search for the wine and glass data set. The 3D plots visualize the test error as a function of the number of bins #bins and the sample size S for both recall methods, the values of κ and K are fixed. Plots for all first four data sets are provided and discussed in [25]. Table 4 shows the average number of columns in the matrix for the best configurations of the training algorithm. The size of the 13

The radii r are calculated over the unit hypercube of the data.

S ¼125, #bins¼ 75, K¼ 1, Parabolic Kernel S ¼175, #bins¼ 50, K¼5, Triangular Kernel S ¼150, #bins ¼200, K¼ 1, Triangular Kernel S ¼125, #bins¼ 150, K¼11, Triangular Kernel S ¼100, #bins¼ 200, K¼3, Triangular Kernel S ¼100, #bins¼ 500 K¼ 1 Parabolic Kernel

matrix can be compared to the number of support vectors generated by the SVM training and the number of unique fuzzy sets of the ANFIS classifier. Finally, Table 5 lists the number of comparisons necessary during training, this reflects the complexity and runtime of the training procedure. A comparison is defined as the computation of a kernel distance between an instance and a matrix column. The number of comparisons can be compared to the runtime of the best multi-layer perceptrons. Accordingly, an MLP iteration is defined as a full forward and backward pass for a single instance. 3.3. Discussion The proposed classifier shows a good performance for the majority of the data sets. In the case of the glass data set the proposed system outperforms all benchmarks systems, whereas the system scores third best for most of the other data sets. In the case of the liver data set the proposed classifier achieves rather low results in comparison to the benchmark classifiers. This is quite surprising, since the systems classify the Pima data set well which has similar characteristics as the liver set. There are certain differences among the recall techniques. The FK NN recall works more robustly compared to the fuzzy inference recall on the tested data sets. This is not surprising because given larger values of K, FK NN recall considers more classification regions than fuzzy inference recall whose decision is only based on the closest (i.e. the most similar) region. Accordingly, the first recall method is much more susceptible to noise in the data. Furthermore, all configurations react robustly to changes of their parameters. As Fig. 8 suggests good performances can be achieved for very wide ranges of parameter settings. For the other data sets (see [25]) the results of the extensive parameter search show similar plateau like graphs as in Fig. 8. However, there are

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

411

Fig. 8. Summary of the exhaustive search of parameters for 10 trials of 10-fold cross-validation. The 3D plots show the test error of both recall methods with respect to the different numbers of bins and sample sizes for the glass data set, the values of κ and K are fixed to 1.6 and 1, respectively.

Table 4 The number of columns for each data set is shown. From the two systems, i.e. fuzzy inference and fuzzy K NN recall, the columns of the one exhibiting the better test error are shown. In comparisons the middle columns show the number of support vectors required on average in the best SVM classifier. The standard deviations are shown in brackets, also the relative size in percent to the size of the training set is given. The rightmost column shows the number of unique antecedent fuzzy sets that are generated from the subtractive clustering algorithm of the ANFIS. Set

Training set size

#Columns

Iris Wine Glass Liver Pima diabetes Image segmentation

135 160 193 311 691 2079

16.42 29.90 47.70 78.24 166.41 177.01

in % ( 7 1.48) ( 7 1.29) ( 7 2.00) ( 7 2.54) ( 7 5.20) ( 7 9.48)

12.16 18.68 24.71 25.16 24.08 8.51

Table 5 The number of comparisons for each data set is shown. From the two systems (fuzzy inference and fuzzy K NN recall) the comparisons of the one exhibiting the better test error are shown. A comparison is defined as a kernel distance computation between an instance and a matrix column or between two instances. The rightmost column shows the number of iterations required by the best MLP. An iteration is defined as a full forward propagation of an input pattern, and a full backward propagation of the error and the corresponding weight adjustments. The standard deviations are shown in brackets. Set

Training set size

LCMM comparisons MLP iterations

Iris Wine Glass Liver Pima diabetes Image segmentation

135 160 193 311 691 2079

26,147 47,493 54,641 137,240 498,358 760,839

( 7 1776) 3507 ( 73007) ( 7 5248) 4445 ( 71310) ( 7 4507) 6346 ( 71674) ( 7 9280) 8445 ( 73095) ( 7 37,166) 677,304 ( 740,619) ( 7 97,031) 117,640 ( 739,715)

some tendencies noticeable. Usually, for the training algorithm the performance strengthens with an increase in the number of bins per dimension. Accordingly, more bins imply a finer discretization

in %

SVs ( 7 1.09) ( 7 0.81) ( 7 1.04) ( 7 0.82) ( 7 0.75) ( 7 0.46)

12.54 51.43 118.45 211.23 347.22 1003.10

( 7 1.43) ( 7 1.96) ( 7 3.44) ( 7 4.56) ( 7 6.07) ( 7 6.78)

9.29 32.14 61.37 67.92 50.25 48.25

ANFIS fuzzy sets ( 7 1.06) ( 7 1.23) ( 7 1.78) ( 7 1.47) ( 7 0.88) ( 7 0.33)

32 177.29 141.47 24.01 36.80 552.80

( 7 0) ( 7 21.94) ( 7 20.31) ( 7 0.38) ( 7 7.69) ( 7 14.23)

of the original pattern space. Nonetheless, after a certain level, an increase in the number of bins usually has only a minor effect. Thus, the partitioning of the pattern space is already fine enough to cover the slightest differences between pairs of patterns. Furthermore, increasing the sample size S usually has a beneficial effect for the test performance of the training algorithm. However, a small to moderate size of the sampling sets can also achieve good results. As with an increase in the number of bins, the positive effect of incrementing the sample size saturates after a certain amount, as shown in Fig. 8 (and the results given in [25]). The settings of the parameters κ and K for the recall methods are not as straightforward as the other parameters and heavily depend on the classification task. However, determining good parameter settings for recall is rather inexpensive. Since the training and recall methods are completely independent, different recall settings can be tried with one and the same matrix without the necessity of retraining. Usually the training algorithm requires many computation of many distances between different patterns to determine best similarities. Nonetheless, due to sampling and heuristic ordering of columns, the number of comparisons listed in Table 5 is much

412

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

smaller than the worst case of cubic complexity. Although the number of iterations required by the MLP seems much smaller than the comparisons needed by the training algorithm, one has to bear in mind that an iteration includes a full forward and backward pass through the network, which itself is already quite expensive. The complexity of a single MLP training iteration also depends on the topology of the network. In contrast, the procedure defined as a comparison, that is a kernel distance calculation, is a very cheap procedure and requires only a few cpu calculations and the cost of a single comparison only scales linearly with the number of bins a column or a binary pattern exhibits. Furthermore, the number of comparisons is determined by the size of the sampling set and the number of training instances, but the difficulty of the training set has only a small effect on the runtime of the algorithm. On the contrary, the MLP runtime depends on the pattern distribution of the data sets. Although only several thousand iterations on average are necessary for most of the data sets, the average number of iterations needed to train the Pima diabetes data set was even higher than the amount of comparisons by B/LCMM training. Moreover, as Table 4 shows the training algorithm compresses the information provided by the data set well, especially if one compares the number of columns of a matrix with the number of support vectors that are stored by a SVM or the unique fuzzy sets computed by the ANFIS. Nonetheless, one has to bear in mind that the encoding of a single dimension of a column depends on the number of bins and, therefore, is usually larger than the 32 or 64 bits required to encode the floating point value of one dimension of a support vector or the two floating point values encoding the support and mean of an ANFIS’ Gaussian membership function. Finally, interpretability is a somewhat subjective concept and hard to quantify. Accordingly, none of the results listed in the previous sections really measure interpretability. Nonetheless, it will be argued why the system can be considered as highly comprehensible. First of all, the proposed system's simplicity is one of the key concepts. The system is based on a simple matrix representation and corresponds to a single-layered network. Accordingly, the training does not involve minimizing a high dimensional cost function, as for MLPs, or projecting data instances into multidimensional space, as for SVMs, but is solely based on grouping similar data instances heuristically together. Secondly, for a non-linear support vector machine one cannot compute the weight vector of the separating hyperplane in the higher dimensional space. Hence, assigning meaning to the settings of the parameters and the support vectors is almost impossible. Likewise, the relation between the weights of a multi-layer perceptron and the original pattern space is incomprehensible. The complex interplay of nodes in different layers based on many weights forms the network output. Therefore, recognizing which weight becomes active for which specific site or location in the original space is very difficult. In contrast, each bin of a B/LCMM matrix uniquely and unambiguously maps to an interval in a specific dimension. Thus, also the recall procedures of the fuzzy binary correlation matrix memory classifier work intuitively. First, fuzzy inference recall is explicitly designed to translate into a fuzzy inference system. This can be done by simply interpreting a weight in the matrix as a fuzzy set with a membership function derived from the kernel computation. If the membership function

is also smoothed and assigned a human label, the binary correlation matrix memory directly maps to a fuzzy rule set in human language. Nonetheless, the ruleset might be quite large due to the number of fuzzy sets, thus, further pruning and merging of rules should be applied. For a detailed example of a translation see [25]. Secondly, also fuzzy K-nearest neighbour recall produces comprehensible classifiers. Because every bin maps to an interval of a specific dimension, the combination of bins in all dimensions maps to a unique hyperrectangular shaped classification region in the original pattern space. The distances or rather similarities between a new instance and the closest of these regions determine the classification. Furthermore, in particular for large values of K and L, bins with large weights map to intervals that represent important regions for classification. One can simply state that the higher the weight, the more important the interval becomes for the classification task. Examining the B/LCMM matrix directly provides the user with information about how classification decisions are formed by the network. On the contrary, investigating the weight distribution and the topology of multi-layer perceptrons or the parameter settings and support vector set of a SVM does provide only little to none understanding of the underlying reasoning processes.

4. Conclusion A novel classification technique based on a hybridization of fuzzy logic and binary neural networks has been presented. A new training and two new recall algorithms have been introduced to extract information from a given training set and carry out classification decisions for novel data patterns. The proposed system yields good classification performances, shows a robust behaviour to wide ranges of parameter settings, and provides a comprehensible and interpretable classifier for the different benchmark data sets. The good performances promise that the fuzzy binary correlation matrix can successfully be applied to real world problems. Moreover, due to the fact that the system constitutes a novel approach to classification and this paper is the first investigation of the system's dynamics, there is still plenty of room for further improvement. Future work might involve testing and designing novel kernels and augmenting the given training and recall algorithms. Moreover, the training algorithm needs many comparisons for large sample sizes. Maybe a better and more efficient heuristic to determine whether a new instance can be safely inserted into a B/LCMM column can be found. Moreover, investigating suitable pruning techniques for fuzzy inference recall can further enhance interpretability. Other quantization mechanisms that rely on the training pattern statistics like equi-frequency binning might improve the performance further. Certainly, an analytical discussion of the system can give more insights into its dynamics. Mathematical properties of the training algorithms like the probability to find the best solution in terms of classification error or the most compact matrix might be worth analysing.

Appendix A. Significance tests Table A1.

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

Table A1 Significance test of performance differences with the non-parametric Wilcoxon Rank sum test. In brackets the average test errors are shown. The test results are reported as [p o 0:01, N ¼ 100, U¼ 4000, D¼ 0.3], where p denotes the test's p-value, N is the number of measurements and U represents the Rank sum value of the Wilcoxon Rank sum test. Significance level was chosen to be 0.01. To quantify the effect size Cohen's D has been calculated. According to Cohen [36] values of about D ¼0.2 describes a small, D ¼ 0.5 a medium, and D≥0:8 a large effect. Data set

Test results

Iris data set Re.I (4.01%) vs. MLP (4.08%) Re.I (4.01%) vs. K NN (4.42%) Re.I (4.01%) vs. FK NN (4.83%) Re.I (4.01%) vs. SVM (3.24%) Re.I (4.01%) vs. ANFIS (3.31%) Re.II (3.89%) vs. MLP (4.08%) Re.II (3.89%) vs. K NN (4.42%) Re.II (3.89%) vs. FK NN (4.83%) Re.II (3.89%) vs. SVM (3.24%) Re.II (3.89%) vs. ANFIS (3.31%)

Not significant Not significant ½p o 0:001; N ¼ 1000; U ¼ 9:53e þ 05; D ¼ 0:16 Not significant ½p o 0:01; N ¼ 1000; U ¼ 1:03e þ 06; D ¼ 0:15 Not significant Not significant ½p o 0:001; N ¼ 1000; U ¼ 9:50e þ 05; D ¼ 0:19 Not significant ½p o 0:01; N ¼ 1000; U ¼ 1:03e þ 06; D ¼ 0:13

Wine data set Re.I (2.06%) vs. MLP (3.09%) Re.I (2.06%) vs. K NN (1.98%) Re.I (2.06%) vs. FK NN (1.94%) Re.I (2.06%) vs. SVM (1.18%) Re.I (2.06%) vs. ANFIS (2.92%) Re.II (1.97%) vs. MLP (3.09%) Re.II (1.97%) vs. K NN (1.98%) Re.II (1.97%) vs. FK NN (1.94%) Re.II (1.97%) vs. SVM (1.18%) Re.II(1.97%) vs. ANFIS (2.92%)

½p o 0:001; N ¼ 1000; U ¼ 9:42e þ 05; D ¼ 0:25 Not significant Not significant ½p o 0:001; N ¼ 1000; U ¼ 1:06e þ 06; D ¼ 0:30 ½p o 0:001; N ¼ 1000; U ¼ 9:52e þ 05; D ¼ 0:23 ½p o 0:001; N ¼ 1000; U ¼ 9:39e þ 05; D ¼ 0:28 Not significant Not significant ½p o 0:001; N ¼ 1000; U ¼ 1:06e þ 06; D ¼ 0:28 ½p o 0:001; N ¼ 1000; U ¼ 9:49e þ 05; D ¼ 0:26

Glass data set Re.I (31.92%) vs. MLP (34.01%) Re.I (31.92%) vs. K NN (27.04%) Re.I (31.92%) vs. FK NN (25.95%) Re.I (31.92%) vs. SVM (28.10%) Re.I (31.92%) vs. ANFIS (35.63%) Re.II (24.46%) vs. MLP (34.01%) Re.II (24.46%) vs. K NN (27.04%) Re.II (24.46%) vs. FK NN (25.95%) Re.II (24.46%) vs. SVM (28.10%) Re.II (24.46%) vs. ANFIS (35.63%) Pima data set Re.I (24.79%) vs. MLP (29.44%) Re.I (24.79%) vs. K NN (28.41%) Re.I (24.79%) vs. FK NN (26.46%) Re.I (24.79%) vs. SVM (23.84%) Re.I (24.79%) vs. ANFIS (23.06%) Re.II (24.06%) vs. MLP (29.44%) Re.II (24.06%) vs. K NN (28.41%) Re.II (24.06%) vs. FK NN (26.46%) Re.II (24.06%) vs. SVM (23.84%) Re.II (24.06%) vs. ANFIS (23.06%) Image segmentation data set Re.I (12.32%) vs. MLP (12.84%) Re.I (12.32%) vs. K NN (9.84%) Re.I (12.32%) vs. FK NN (7.73%) Re.I (12.32%) vs. SVM (12.53%) Re.I (12.32%) vs. ANFIS (11.55%) Re.II (10.13%) vs. MLP (12.84%) Re.II (10.13%) vs. K NN (9.84%) Re.II (10.13%) vs. FK NN (7.73%) Re.II (10.13%) vs. SVM (12.53%) Re.II (10.13%) vs. ANFIS (11.55%)

½p o 0:001; N ¼ 1000; U ¼ 9:36e þ 05; D ¼ 0:22 ½p o 0:001; N ¼ 1000; U ¼ 1:15e þ 06; D ¼ 0:53 ½p o 0:001; N ¼ 1000; U ¼ 1:18e þ 06; D ¼ 0:65 ½p o 0:001; N ¼ 1000; U ¼ 1:11e þ 06; D ¼ 0:42 ½p o 0:001; N ¼ 1000; U ¼ 8:79e þ 05; D ¼ 0:42 ½p o 0:001; N ¼ 1000; U ¼ 7:27e þ 05; D ¼ 1:05 ½p o 0:001; N ¼ 1000; U ¼ 9:17e þ 05; D ¼ 0:29 ½p o 0:001; N ¼ 1000; U ¼ 9:51e þ 05; D ¼ 0:17 ½p o 0:001; N ¼ 1000; U ¼ 8:81e þ 05; D ¼ 0:42 ½p o 0:001; N ¼ 1000; U ¼ 6:77e þ 05; D ¼ 1:31

½p o 0:001; N ¼ 200; U ¼ 3:01e þ 04; D ¼ 0:97 ½p o 0:001; N ¼ 200; U ¼ 3:18e þ 04; D ¼ 0:79 ½p o 0:001; N ¼ 200; U ¼ 3:60e þ 04; D ¼ 0:37 Not significant ½p o 0:001; N ¼ 200; U ¼ 4:47e þ 04; D ¼ 0:39 ½p o 0:001; N ¼ 200; U ¼ 2:89e þ 04; D ¼ 1:10 ½p o 0:001; N ¼ 200; U ¼ 3:04e þ 04; D ¼ 0:93 ½p o 0:001; N ¼ 200; U ¼ 3:43e þ 04; D ¼ 0:52 Not significant Not significant

½p o 0:001; N ¼ 200; U ¼ 4:74e þ 04; D ¼ 0:09 ½p o 0:001; N ¼ 200; U ¼ 5:29e þ 04; D ¼ 1:31 ½p o 0:001; N ¼ 200; U ¼ 5:86e þ 04; D ¼ 2:51 Not significant ½p o 0:001; N ¼ 200; U ¼ 4:42e þ 04; D ¼ 0:40 Not significant Not significant ½p o 0:001; N ¼ 200; U ¼ 5:34e þ 04; D ¼ 1:37 ½p o 0:001; N ¼ 200; U ¼ 2:77e þ 04; D ¼ 1:28 ½p o 0:001; N ¼ 200; U ¼ 3:19e þ 04; D ¼ 0:77

Appendix B. Training The training algorithm (see Algorithm Appendix B.1) includes several subroutines: testing the actual insertion (see Algorithm

413

Appendix B.2), computing the critical similarity among instances of another class and the actual class (see Algorithm Appendix B.3), and testing whether an artificial region is placed in a protected area (see Algorithm Appendix B.4). The queue q is a special data structure based on bucket sort that sorts the columns of the B/LCMM matrix in decreasing order of stored instances. Furthermore, the instruction in line 16 in Algorithm Appendix B.1 returns the index to the column with the most stored instances. Next, the pointer within the queue is moved to return the index of the column with the second most instance and so on to iterate through all columns in decreasing order of stored instances. In case a suitable column was found for storage, the pointer to return indexes is set back to the front item of the queue. Algorithm B.1. TRAININGII. Require: The set of all binary input vectors binary_inputs for a specific class c Require: The type of kernel function kernel Require: The sample size S Require: The set of all binary input vectors other_binary_items of all other classes Ensure: A trained B/LCCM matrix for the actual class Mc 1: Initialize an empty matrix Mc. 2: Initialize a bucket queue q that orders the columns according to the number of stored instances 3: Initialize the critical similarity vector sim with as many entries as there are binary items of other classes and set all values to −1 01 4: for all x in binary_items do 01 5: Select a random x from binary_items that has not been selected before 6: if q is empty then 01 7: M c ð1Þ≔x 8: Update q 9: else 10: for all entries in q do 11: if q is at the end then 12: Set y to exceed the number of columns by one to 01 concatenate x 01 13: M c ðyÞ≔M c ðyÞ þ x 14: Update q 15: else 16: Get the next index y from q 01 17: M c ðyÞ≔M c ðyÞ þ x 18: if SUITABLEINSERTION(binary_inputs, kernel, S, 01 other_binary_items, Mc(y), x , sim) then 19: Update q 20: break 21: else 01 22: M c ðyÞ≔M c ðyÞ−x 23: end if 24: end if 25: end for 26: end if 27: end for 28: M nc ¼ clearðM c Þ 29: return M nc

In Algorithm Appendix B.2 it is first checked if the critical similarity between an item of the opposite class and any one of the actual class has already been computed, see line 4. If not, this is calculated via sampling in Algorithm Appendix B.3. Note that the

414

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415 01

01

expression i ¼ indexof ðz Þ in line 3 returns the index of the item z in the similarity vector sim. Algorithm B.2. SUITABLEINSERTION Require: The set of all binary input vectors binary_inputs for a specific class c Require: The type of kernel function kernel Require: The sample size S Require: The set of all binary input vectors other_binary_items of all other classes Require: The column of the B/LCMM matrix that has just been modified wk 01

Require: The actual binary item x Require: The similarity vector sim 01 Ensure: TRUE or FALSE, whether x can safely be stored into wk or not 1: Create a random sample Z of size S from other_binary_items 01 2: for all z ∈Z do 01

3: i ¼ indexof ðz Þ 4: if simi ¼ ¼ −1 then 01 5: sbest ¼ GETBESTSIMILARITYðbinary_inputs; kernel; S; z Þ 6: simi ¼ ðsbest þ 1Þ=2 7: end if 01 01 8: if TOOCLOSEREGION(kernel, wk, x , simi , z ) then 9: return FALSE 10: end if 11: end for 12: return TRUE

Algorithm B.4. TOOCLOSEREGION Require: The type of kernel function kernel Require: The column of the B/LCMM matrix that has just been modified wk 01 Require: The actual binary item x Require: A similarity threshold scrit 01 Require: The binary item z of another class 1: Initialize the temporary variable mindiff and set it to 1, it will contain the smallest difference between the similarity 01 01 01 of x and z and similarity of applying z to the column wk 2: Initialize the variable sum and set it to 0, it will contain the 01 result of applying z to the column wk 01 3: Compute z^ from the kernel and z 4: for all n dimensions, indexed by i do 01 5: temp1 ¼ ∑Kj ¼i 1 z^ i:j x i:j 6: 7: 8: 9: 10: 11: 12: 13: 14:

temp2 ¼ maxj ðz^ i:j wki:j Þ mindiff ≔minðmindiff ; temp2−temp1Þ sum≔sum þ temp2 end for if ðsum−mindiff Þ≥scrit then return TRUE else return FALSE end if 01

Appendix C. Recall

Furthermore, Algorithm Appendix B.3 works relatively straightforward. Note that the reflected distance or similarity is not computed in the original data space but in the via equi-width binning discretized space.

The first recall Algorithm Appendix C.1 also requires the novel test instance to be translated into a binary one. The algorithm returns a vector containing the membership value for each class. If a definite answer is required, the maximum value determines the class label.

Algorithm B.3. GETBESTSIMILARITY

Algorithm C.1. RECALLI

Require: The set of all binary input vectors binary_inputs for a specific class c Require: The type of kernel function kernel Require: The sample size S 01 Require: The binary item z of another class Ensure: The largest similarity sbest that reflects the closest distance 1: Set sbest to 0 01 2: Compute z^ from the kernel and z 3: Create a random sample X of size S from binary_inputs 01 4: for all x ∈X do 01 5: Select a random x from binary_items that has not been selected before 01 6: sbest ←maxðsbest ; ∑ni¼ 1 ∑Kj ¼i 1 z^ i:j x i:j Þ 7: 8:

end for return sbest

Finally, Algorithm Appendix B.4 checks if an artificial classification regions gets too close to an instance of another class. The critical threshold has already been computed before in a loop within Algorithm Appendix B.2. The computation of mindiff in line 7 ensures 01 that only classification regions that are caused by the new instance x are considered. Accordingly, the value of mindiff is subtracted from the value that reflects the distance sum, see line 10.

Require: The type of kernel function kernel 01

Require: A fully trained BCMM matrix M with weight vectors 01 indexed by wk 01 Require: A novel binary instance x Require: A fuzzy set support size κ Require: A t−norm and a t−conorm function Ensure: A membership vector result containing the membership values for each of l classes 01 1: Compute the kernel instance x^ from the kernel, κ and x 2: for all l classes, indexed by c do 3: Initialize temp_class_mem with 0 4: for all r columns of class c, indexed by k do 5: Initialize temp_col_mem with 1 6: for all n dimensions, indexed with i do 7: Initialize temp_dim_mem with 0 8: for all Di bins, indexed with j do 9: temp_dim_mem←t−conorm 01 ðtemp_dim_mem; wki:j x^ i:j Þ 10: end for 11: temp_col_mem←t−normðtemp _col_mem; temp_dim_memÞ 12: end for 13: temp_class_mem←t−conormðtemp_ class_mem; temp_col_memÞ 14: end for 15: result c ¼ temp_class_mem

R. Meyer, S. O'Keefe / Neurocomputing 121 (2013) 401–415

16: 17:

end for return result

For the pseudocode of K NN recall see Algorithm Appendix C.2 below. Algorithm C.2. RECALLII Require: The type of kernel function kernel Require: A fully trained LCMM matrix M with weight vectors indexed by wk 01 Require: A novel binary instance x Require: A value of K Ensure: A membership vector result containing the membership values for each of l classes 01 1: Compute the kernel instance x^ from the kernel, κ and x 2: Initialize ẙ as an empty vector 3: for all R columns in the matrix, indexed by k do 4: yk ¼ ∑ni¼ 1 K maxðwki ⊗x^ i Þ 5: Concatenate yk to ẙ 6: end for 7: y_ ¼ K maxðẙÞ 8: norm_sum ¼ ∑nj¼ 1 y_ j 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

for all l classes, indexed by c do Initialize temp_sum with 0. for K times, indexed by j do if y_ j belongs to class c then temp_sum←temp_sum þ y_ j end if end for temp_sum result c ¼ norm_sum end for return result

References [1] L.H. Tsoukalas, R.E. Uhrig, Fuzzy and Neural Approaches in Engineering, 1st edition, John Wiley & Sons, Inc., New York, NY, USA, 1996. [2] T. Ross, Fuzzy Logic with Engineering Applications, John Wiley, 2004. [3] H. Ishibuchi, Evolutionary multiobjective optimization and multiobjective fuzzy system design, in: Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology, CSTST '08, ACM, New York, NY, USA, 2008, pp. 3–4. [4] H. Ishibuchi, Evolutionary multiobjective design of fuzzy rule-based systems, in: IEEE Symposium on Foundations of Computational Intelligence, 2007, FOCI 2007, 2007, pp. 9–16. [5] Z. Wang, V. Palade, Building interpretable fuzzy models for high dimensional data analysis in cancer diagnosis, BMC Genomics 12 (Suppl. 2) (2011) S5. [6] F.-L. Chung, T. Lee, On fuzzy associative memory with multiple-rule storage capacity, IEEE Trans. Fuzzy Syst. 4 (3) (1996) 375–384. [7] P. Sussner, M. Valle, Implicative fuzzy associative memories, IEEE Trans. Fuzzy Syst. 14 (6) (2006) 793–807. [8] P. Sussner, M. Valle, Fuzzy associative memories and their relationship to mathematical morphology, in: Handbook of Granular Computing, WileyInterscience, 2008 (Chapter 32). [9] E. Aykin, S. O'Keefe, A fuzzy classifier based on correlation matrix memories, in: Proceedings of the 10th WSEAS International Conference on Fuzzy Systems, FS'09, World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 2009, pp. 63–68. [10] H. Takagi, I. Hayashi, NN-driven fuzzy reasoning, International Journal of Approximate Reasoning 5 (3) (1991) 191–212. [11] D. Nauck, R. Kruse, NEFCLASS—a neuro-fuzzy approach for the classification of data, in: Proceedings of the 1995 ACM Symposium on Applied Computing, Applied Computing 1995, ACM Press, 1995, pp. 461–465. [12] D. Nauck, R. Kruse, A neuro-fuzzy method to learn fuzzy classification rules from data, Fuzzy Sets Syst. 89 (3) (1997) 277–288, Application of Neuro-Fuzzy Systems. [13] V. Palade, R.J. Patton, F.J. Uppal, J. Quevedo, S. Daley, Fault diagnosis of an industrial gas turbine using neuro-fuzzy methods, in: Preprints of the 15th IFAC World Congress, 2002. [14] R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 558–571. [15] D.J. Willshaw, O.P. Buneman, H.C. Longuet-Higgins, Non-holographic associative memory, Nature 222 (5197) (1969) 960–962.

415

[16] P. Zhou, J. Austin, J. Kennedy, A binary correlation matrix memory k-NN classifier with hardware implementation, in: BMVC'98, 1998. [17] P. Zhou, J. Austin, A PCI bus based correlation matrix memory and its application to k-NN classification, in: Proceedings of the Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-Inspired Systems, 1999, MicroNeuro '99, 1999, pp. 196–204. [18] P. Zhou, J. Austin, J. Kennedy, A high performance k-NN classifier using a binary correlation matrix memory, in: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, MIT Press, Cambridge, MA, USA, 1999, pp. 713–719. [19] M. Weeks, V. Hodge, S. O'Keefe, J. Austin, K. Lees, Improved aura k-nearest neighbour approach, in: Proceedings of the 7th International WorkConference on Artificial and Natural Neural Networks: Part II: Artificial Neural Nets Problem Solving Methods, IWANN '03, Springer-Verlag, Berlin, Heidelberg, 2003, pp. 663–670. [20] V.J. Hodge, K.J. Lees, J.L. Austin, A high performance k-NN approach using binary neural networks, Neural Networks 17 (3) (2004) 441–458. [21] V.J. Hodge, J. Austin, A binary neural k-nearest neighbour technique, Knowl. Inf. Syst. 8 (2005) 276–291. [22] N. Shah, Improving the recognition properties of binary neural networks using dynamic encoders, in: Proceedings of the 5th International Conference on Recent Advances in Soft Computing, 2004, pp. 566–571. [23] N. Shah, Using the Hough transform in binary neural networks with dynamic encoders, in: Proceedings of the 6th International Conference on Recent Advances in Soft Computing, 2006, pp. 60–65. [24] N. Shah, S. O'Keefe, J. Austin, The improved correlation matrix memory (CMML), in: International Joint Conference on Neural Networks, 2007, IJCNN 2007, 2007, pp. 1168–1173. [25] R. Meyer, A fuzzy binary correlation matrix memory classification system, Master's thesis, University of York, 2011. [26] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy K-nearest neighbor algorithm, IEEE Trans. Syst. Man Cybern. 1985, v15, #4, pp. 580–585, http://dx.doi.org/10.1109/ TSMC.1985.6313426. [27] M.M. Gupta, J. Qi, Theory of T-norms and fuzzy inference methods, Fuzzy Sets Syst. 40 (1991) 431–450. [28] A. Frank, A. Asuncion, UCI Machine Learning Repository, 2010, URL 〈http://archive.ics.uci.edu/ml〉. [29] F.D. Foresee, M.T. Hagan, Gauss–Newton approximation to Bayesian learning, Int. Conf. Neural Networks 3 (1997) 1930–1935. [30] MATLAB, The MathWorks Inc., Natick, Massachusetts, 2011 〈http://www. mathworks.com/〉. [31] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intelligent Syst. Technol. 2 (2011). 27:1–27:27, software available at: 〈http://www.csie.ntu.edu.tw/  cjlin/libsvm〉. [32] J.-S.R. Jang, Fuzzy modeling using generalized neural networks and Kalman filter algorithm, in: Proceedings of the Ninth National Conference on Artificial Intelligence, vol. 2, AAAI'91, AAAI Press, 1991, pp. 762–767. [33] J.-S. Jang, ANFIS: adaptive-network-based fuzzy inference system, IEEE Trans. Syst. Man Cybern. 23 (3) (1993) 665–685. [34] S. Chiu, Fuzzy model identification based on cluster estimation, J. Intelligent Fuzzy Syst. 2 (1994) 267–278. [35] R.R. Yager, D.P. Filev, Generation of fuzzy rules by mountain clustering, J. Intelligent Fuzzy Syst. 2 (1994) 209–219. [36] J. Cohen, A power primer, Psychol. Bull. 112 (1) (1992) 155–159.

Robert Meyer received his B.Sc. in Cognitive Science at the University of Osnabrueck in 2010 and his M.Sc. in Natural Computation at the University of York in 2011. Currently he is a member of the Bernstein Center of Computational Neuroscience in Berlin and pursues his PhD in Neuroinformatics at the Technical University Berlin. His research interests encompass bio-inspired machine learning and information processing in spiking neural networks with applications to stimulus encoding in primary visual cortex.

Simon O'Keefe, Lecturer, Department of Computer Science. Simon O'Keefe, MA (Cantab), M.Sc. (Lancaster), M.Sc., D.Phil. (York) has been a member of both the Advanced Computer Architectures Group and the NonStandard Computation Research Group in Computer Science, and has now become part of YCCSA. His research interests are principally in neural networks and pattern recognition, data mining and novel computation. He has been involved in a wide variety of projects funded by EPSRC, EU and other bodies, applying neural networks to industrial modelling, biological modelling and data mining, and image analysis.