c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
An associative memory approach to medical decision support systems夽 a ˜ Mario Aldape-Pérez a,b,∗ , Cornelio Yánez-Márquez , Oscar Camacho-Nieto a , Amadeo J.Argüelles-Cruz a a b
Center for Computing Research, CIC-IPN Building, Nueva Industrial Vallejo, G.A. Madero, Mexico City 07738, Mexico Superior School of Computing, ESCOM-IPN Building, Lindavista, G.A. Madero, Mexico City 07738, Mexico
a r t i c l e
i n f o
a b s t r a c t
Article history:
Classification is one of the key issues in medical diagnosis. In this paper, a novel approach
Received 10 September 2010
to perform pattern classification tasks is presented. This model is called Associative Mem-
Received in revised form
ory based Classifier (AMBC). Throughout the experimental phase, the proposed algorithm
1 April 2011
is applied to help diagnose diseases; particularly, it is applied in the diagnosis of seven dif-
Accepted 13 May 2011
ferent problems in the medical field. The performance of the proposed model is validated
Keywords:
twenty well known algorithms. Experimental results have shown that AMBC achieved the
Associative memories
best performance in three of the seven pattern classification problems in the medical field.
by comparing classification accuracy of AMBC against the performance achieved by other
Decision support systems
Similarly, it should be noted that our proposal achieved the best classification accuracy
Supervised Machine Learning
averaged over all datasets. © 2011 Elsevier Ireland Ltd. All rights reserved.
algorithms Pattern classification
1.
Introduction
Expert systems (ESs) as we know them today have their origins in the ground-breaking work of Feigenbaum, Buchanan and Lederberg [1–3] in the late sixties and early seventies. From that time until now, demonstrable successes of ES have resulted in the emergence of knowledge-based applications and, more particularly, on decision support systems. Unlike most daily decisions, many health-care decisions have important implications for the quality of life of the patient, and involve significant uncertainties and trade-offs. The uncertainties may be about the diagnosis, the accuracy of available diagnostic tests, the prevalence of the disease and its
attendant risk factors. For such complex decisions, which are inherently affected by so many uncertainties, it is indispensable to have computational tools that help to identify which variables of the problem should have a major impact on our decision. It is also needed to apply effective mathematical models, as well as efficient algorithms that allow decreasing the level of uncertainty in the diagnosis of the disease. Early models of learning matrices appeared more than four decades ago [4–6], and since then associative memories have attracted the attention of major research groups worldwide. From a connectionist model perspective an associative memory can be considered a special case of the neural computing approach for pattern recognition [7–9]. Furthermore, associative memories have a number of properties, including a rapid,
夽 This work was supported by the Science and Technology National Council of Mexico under Grant No. 174952, by the National Polytechnic Institute of Mexico (Project No. SIP-IPN 20101709) and by the ICyTDF (Grant No. PIUTE10-77 and PICSO10-85) ∗ Corresponding author at: Center for Computing Research, CIC-IPN Building, Nueva Industrial Vallejo, G.A. Madero, Mexico City 07738, Mexico. Tel.: +52 55 5729 6000x52032, 56584; fax: +52 55 5754 0506. E-mail address:
[email protected] (M. Aldape-Pérez). URL: http://www.aldape.org.mx (M. Aldape-Pérez).
0169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2011.05.002
288
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
compute efficient best-match and intrinsic noise tolerance that make them ideal for many applications [10–12]. As a consequence, associative memories have emerged as a computational paradigm to efficiently solve pattern recognition tasks. The first known mathematical model of associative memory is the Lernmatrix, developed in 1961 by the German scientist Karl Steinbuch, who published his article in a German journal titled Kybernetik [4]. Eight years after the Lernmatrix, Scottish scientists created the Correlograph, which is an elemental optical device able to behave as an associative memory [13]. In 1972, supported by the UCLA, James A. Anderson proposed his Interactive Memory [12]. In April of the same year, Teuvo Kohonen, professor at the Helsinki University of Technology at that time, introduced his Correlation Matrix Memories [9]. Three months later, Kaoru Nakano from Todai (Tokyo University) unveiled his Associatron [14]. In that year, Shun-ichi Amari, a professor at Todai (Tokyo University), published a theoretical work related to Self-Organizing Nets of Threshold Elements [15]. This work by Amari establishes a precedent to the creation of what would become one of the most important associative memory models: the Hopfield model. The ideas of Anderson and Kohonen, and to some extent Nakano’s, gave rise to the model currently known as Linear Associator. In 1982 John J. Hopfield [16], published an iterative model based on spin-glasses. Two years later published a second article, where he introduced an extension to the original model: a continuous model [17]. Hopfield results caused great excitement throughout the associative memories and neural network worlds. So much that many scientist who had so far remained on the sidelines, got interested in these topics. Thus, in the late 1980s many scientists took the classic models and gave birth to new kinds of associative memories [18]. Hopfield models are also appealing to many cognitive modelers because of their apparent similarity to human episodic memory: they can recall patterns after a single exposure using a Hebbian learning rule, and they are capable of retrieval from partial or noisy patterns [19]. This model is one of the most popular work that use Hebbian learning and owes some of its advantages regarding learning and recalling of altered patterns to Hebbian learning rules [20]. Among the myriad contributions and innovations in the field of associative memories, an associative memory approach for pattern recognition termed as Distributed Hierarchical Graph Neuron (DHGN) was presented by Khan and Muhamad Amin [21]. DHGN is a scalable, distributed, and one-shot learning pattern recognition algorithm which uses graph representations for pattern matching without increasing the computation complexity of the algorithm. This model has been successfully tested for character patterns with structural and random distortions. The pattern recognition process is completed in one-shot and within a fixed number of steps [22–26]. In 1998 Ritter et al. [27] introduced a novel class of artificial neural networks, called morphological neural networks, in which the operations of multiplication and addition are replaced by addition and maximum (or minimum),
respectively. By taking the maximum (or minimum) of sums instead of the sum of products, morphological network computation is nonlinear before possible application of a nonlinear activation function. The main difference between morphological associative memories [27–38] and classical associative memories—such as the Linear Associator and the Hopfield model—is that, while the classical models base their operation on the usual addition and multiplication operations over the ring of rational numbers, morphological memories are based on two lattice operations: dilation and erosion, which are immersed in a belt [39]. According to Haykin [40], the Hopfield model is a classical example of a recurrent neural network; which is at the same time, an associative memory model. Even though the Hopfield model has been a cornerstone for both neural networks and associative memories, it has two crippling disadvantages. First, this model shows a very low recall capacity, 0.15 n, where n is the dimension of stored patterns. Also, the Hopfield model is autoassociative, which means that it is not able to learn and thus recall input patterns which are different from the output patterns. In the late 1980s, Kosko [41] developed a heteroassociative memory from two Hopfield memories with the aim of remedying the second disadvantage of the Hopfield model. The Bidirectional Associative Memory (BAM) is based on an iterative algorithm, just as the Hopfield model. Associative models, including both associative memories and BAMs, have found their applications in many fields of human endeavour. They have been widely used to create knowledge databases for expert agents [42], as classifiers [43], data compression [44], fingerprints recognition [45], border detection [46], as English–Spanish/ Spanish–English translator [47,48] among others. In this paper, an associative memory based classifier is presented. The paper is organized as follows. In Section 2, a succinct description of associative memories fundamentals is presented. In Section 3, Associative Memory based Classifier (AMBC) foundations are presented. Section 4 provides a concise description of the most important characteristics of the datasets that were used as test sets to validate the experimentation. Section 5 provides a brief description of each of the algorithms that were used during the experimental phase. In Section 6, a consistent comparison between the classification performance achieved by our proposal and the classification performance achieved by some well known algorithms in different pattern classification problems is presented. Section 7 describes how the experimental phase was conducted. In Section 8, classification accuracy results achieved by each one of the compared algorithms in seven different pattern classification problems in the medical field are presented. Finally, Associative Memory based Classifier advantages, as well as some conclusions will be discussed in Section 9.
2.
Associative memories
An associative memory M is a system that relates input patterns and output patterns as follows:
x→ M →y
289
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
with x and y the input and output pattern vectors, respectively. Each input vector forms an association with its corresponding output vector. For each integer and positive, the corresponding association will be denoted as: (x , y ). An associative memory M is represented by a matrix whose ij th component is mij . An associative memory M is generated from an a priori finite set of known associations, called the fundamental set of associations. If is an index, the fundamental set is represented as: {(x , y ) | = 1, 2, . . ., p} with p as the cardinality of the set. The patterns that form the fundamental set are called fundamental patterns. If it holds that x = y ∀ ∈ {1, 2, . . ., p}, M is autoassociative, otherwise it is heteroassociative; in this case, it is possible to establish that ∃ ∈ {1, 2, . . ., p} for which / y . If we consider the fundamental set of patterns {(x , x = y ) | = 1, 2, . . ., p} where n and m are the dimensions of the input patterns and output patterns, respectively, it is said that x ∈ An , A = {0, 1} and y ∈ Am . Then the j th component of an input pattern x is xj ∈ A. Analogously, the i th component
¯ ∀ ∈ x = (x − x)
⎛
⎞
x1 x2 ⎟ ⎜ x = ⎜ . ⎟ ∈ An ⎝ . ⎠ . xn
⎛
⎞
Associative memory based classifier
Definition 3.1. Let x1 , x2 , . . ., xp be fundamental input patterns, so mean vector x¯ is obtained according to the following expression: 1 x p p
=1
1, 2, . . . , p
(2)
Definition 3.4. Let y1 , y2 , . . . , yp be displaced fundamental output patterns, so the i th component of each displaced fundamental output pattern is coded according to the following expression:
yi
=
∀ ∈
1
if i = k
0
if i = 1, 2, . . . , k − 1, k + 1, . . . m
(3)
1, 2, . . . , p
¯ xω = (xω − x)
(1)
(4)
Definition 3.6. Let A = {0, 1} and let er be the r th learning reinforcement vector of size n represented as:
⎛
er1
⎞
⎜ r⎟ ⎜ e2 ⎟ ⎜ ⎟ e = ⎜ ⎟ ∈ An ⎜ .. ⎟ ⎝ . ⎠ r
In any associative memory there are two phases that determine the particular performance of each model, namely Learning phase and Classification Phase. In our proposal, Associative Memory based Classifier (AMBC), besides the two phases which are intrinsic to all associative memory, a procedure for estimating the quality of learning is incorporated. In what follows, let M be an associative memory whose ij th component is denoted by mij . Let y ∈ Am , A = {0, 1} be the th fundamental output pattern of size m ∈ Z+ and let be an index such that ∈ {1, 2, . . ., p} with p as the cardinality of the set. Let xω ∈ Rn be an unknown input pattern to be classified, where n ∈ Z+ is the dimension of the unknown input pattern and let r be an index, such that r ∈{1, 2, . . ., (2n − 1)}.
x¯ =
Definition 3.5. Let xω ∈ Rn be an unknown input pattern to be classified, so a displaced unknown input pattern to be clas sified, denoted as xω , is obtained according to the following expression:
y1 y2 ⎟ ⎜ y = ⎜ . ⎟ ∈ Am ⎝ . ⎠ . ym
A distorted version of a pattern x to be recalled will be denoted as x˜ . An unknown input pattern to be recalled will be denoted as xω . If when an unknown input pattern xω is fed to an associative memory M, happens that the output corresponds exactly to the associated pattern yω , it is said that recalling is correct.
3.
Definition 3.3. Let m ∈ Z+ be the number of different classes and let k ∈ {1, 2, . . . , m} be the class to which a fundamental input pattern x ∈ An belongs.
of an output pattern y is represented as yi ∈ A. Therefore, the fundamental input and output patterns are represented as follows:
Definition 3.2. Let x1 , x2 , . . . , xp be displaced input patterns, obtained according to the following expression:
(5)
ern Definition 3.7. Let A = {0, 1} and let n ∈ Z+ be the dimension of an input pattern. Given an integer value r ∈ Z+ , the IntegerToVector operator, denoted by , takes r as input and returns a column vector er with r value expressed in its binary representation. Note that er1 is the Most Significant Bit (MSB) while ern is the Least Significant Bit (LSB). In order to understand the IntegerToVector operator, denoted by , consider the following example. Example 3.1. Let A = {0, 1}, let n = 4 and let r = 11. Obtain the r th learning reinforcement vector of size n, by applying the IntegerToVector operator as stated in Definition 3.7. To convert an integer to its binary representation, we divide r by two repeatedly, until the final remainder is zero. If we take only the remainder of each division, then the number 11 can be expressed as
11 = ( 1 × 23 ) + ( 0 × 22 ) + ( 1 × 21 ) + ( 1 × 20 )
290
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
After applying the IntegerToVector operator, denoted by , we obtain the 11 th learning reinforcement vector of size 4
⎛ ⎞
(11) = e11
1 ⎜0⎟ =⎝ ⎠ 1 1
Learning phase
Consists of finding the class to which an unknown input pattern xω ∈ An belongs. Finding the class means getting yω ∈ Am that corresponds to xω . Definition 3.8. Let yω be a column vector that represents the classification result of a displaced test pattern xω ∈ Rn , thus, ω the i th component of y is obtained according to the following expression:
yωi =
Find the adequate operators and a way to generate an associative memory M that will store the p associations of the fundamental set. Note that there are m different classes, so each one of the input patterns belongs to class k ∈ {1, 2, . . ., m}, represented by a column vector whose components will be coded as stated in expression (3). Obtain an associative memory M by performing the following steps: 1. Given the fundamental set of associations {(x , y ) | = 1, 2, . . ., p}, obtain the displaced fundamental set of associations {(x , y )| = 1, 2, . . . , p} using expression (1), expression (2) and expression (3). 2. Consider each one of the p associations (x , y ), so an m × n matrix is obtained according to the following expression:
y1
(6)
t
y1 x1 ⎜ . ⎜ .. ⎜
= ⎜ yi x1 ⎜ . ⎜ . ⎝ .
···
yi xj .. .
⎟ ⎟ ⎟ ⎟ · · · yi xn ⎟ ⎟ .. ⎟ . ⎠
ym x1
···
ym xj
···
y1 xn .. .
⎞
y1 xj .. .
···
···
⎜
=
k
p
y · x
t
=1
yi xj
3.2.1.
⎡ ⎤ n ω r ⎣ mhj · xj · ej ⎦
= mij
m×n
(10)
j=1
is the maximum operator.
Classification accuracy
Classification accuracy of any algorithm can be estimated taking into account the overall number of test patterns that are correctly classified. In the present paper, classification accuracy results were estimated using the following expression:
accuracy(T) =
assess(x ) =
assess(xω )
ω=1
|T|
; xω ∈ T
(11)
1
if classify(xω ) = y
0
otherwise
(12)
ym xn
where y is the actual condition of a test pattern xω and classify(xω ) returns the classification result of a test pattern xω by AMBC algorithm.
3.3.
AMBC algorithm
(7)
in this way the ij th component of an associative memory M is expressed as follows: p
otherwise
=1
mij =
h=1
ω
3. Obtain an associative memory M by adding all the p matrices according to the following expression:
M=
0
(9)
where T is the set of unknown input patterns to be classified (test set). Each time the classification result of a test pattern xω ∈ T is equal to the actual condition of that pattern, an integer value equal to 1 will be assigned to the assessment function, as it is shown in the following expression:
j=1
|T|
ym
y · x
⎪ ⎪ ⎩
⎞
⎜ ⎟ ⎜y ⎟ ⎜ 2 ⎟
t ⎟ · x1 , x2 , . . . , xn y · x =⎜ ⎜ .. ⎟ ⎜ . ⎟ ⎝ ⎠
⎛
⎡ ⎤ ⎧ n ⎪
⎪ ⎨ 1 if ⎣ mij · xω · er ⎦ = j j
where represents the maximum threshold value
and
⎛
Classification phase
In summary, the IntegerToVector operator, denoted by , helps us to obtain a column vector er with r ∈ Z+ expressed in its binary representation.
3.1.
3.2.
(8)
This section describes the proposed algorithm, called Associative Memory based Classifier (AMBC). This algorithm is divided into three phases. The first is the learning phase, the second is the learning reinforcement phase and the third is the classification phase. Given the fundamental set of patterns {(x , y ) | = 1, 2, . . ., p} with p as the cardinality of the set, obtain an Associative Memory based Classifier following the steps outlined below.
291
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Fig. 1 – Learning phase. At this stage we obtain the mean ¯ according to expression (1). vector x,
3.3.1.
Learning phase
1. Let n be the dimension of each input pattern in the fundamental set, grouped in m different classes. ¯ according to expression (1). 2. Obtain the mean vector x, 3. Obtain displaced input patterns x1 , x2 , . . . , xp , according to expression (2). 4. Each one of the input patterns belongs to a k class, k ∈ {1, 2, . . ., m}, represented by a column vector whose components will be assigned by yk = 1, so yj = 0 for j = 1, 2 . . . , k − 1, k + 1, . . . m, as stated in expression (3). 5. Create a classifier using expression (6), expression (7) and expression (8). As a result of the learning phase, we obtain an associative memory M (Figs. 1–3).
3.3.2.
Fig. 2 – Learning phase. At this stage we obtain displaced input patterns x1 , x2 , . . . , xp , according to expression (2).
Learning reinforcement phase
1. Initialize r = 1. 2. Initialize rmax = 2n − 1. 3. Use the IntegerToVector operator to get the r th learning reinforcement vector of size n, as stated in expression (5). 4. Classify the fundamental set of patterns {(x , y )| = 1, 2, . . . , p} that was used during the learning phase,
5. 6.
7. 8. 9. 10.
according to expression (9) so an r th classification accuracy parameter is obtained. Store both parameters (the r th classification accuracy parameter and the r th learning reinforcement vector). Compare the r th classification accuracy parameter with the (r − 1) th classification accuracy parameter. The best classification accuracy value is stored. Increment r If r < rmax then go to Step 3 of Section 3.3.2. Else go to Step of Section 3.3.2. End of Learning reinforcement phase.
As a result of the learning reinforcement phase, we obtain the r th learning reinforcement vector of size n, which allows us to reinforce learning.
3.3.3.
Classification phase
Given an unknown input pattern xω ∈ Rn to be classified and the r th learning reinforcement vector of size n, obtain the unambiguously recalled class vector yω .
1. Obtain displaced input test pattern xω , according to expression (2).
292
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
x5 , x6 , x7 , x8 belong to class 2. Fundamental input patterns are as follows:
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
30 30 30 31 x1 = ⎝ 64 ⎠ , x2 = ⎝ 62 ⎠ , x3 = ⎝ 65 ⎠ , x4 = ⎝ 59 ⎠ 1 3 0 2
34 34 38 39 x5 = ⎝ 59 ⎠ , x6 = ⎝ 66 ⎠ , x7 = ⎝ 69 ⎠ , x8 = ⎝ 66 ⎠ 0 9 21 0 As indicated in step 2 of Section 3.3.1, obtain the mean vector ¯ according to expression (1) x,
⎛
⎞
33.25 x¯ = ⎝ 63.75 ⎠ 4.5 As indicated in step 3 of Section 3.3.1, obtain displaced input patterns x1 , x2 , . . . , x8 , according to expression (2); which is
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎞
⎛
⎛
⎞
⎛
−3.25 −3.25 x = ⎝ 0.25 ⎠ , x2 = ⎝ −1.75 ⎠ −3.50 −1.50 1
−3.25 −2.25 x = ⎝ 1.25 ⎠ , x4 = ⎝ −4.75 ⎠ −4.50 −2.50 3
⎞
0.75 0.75 x = ⎝ −4.75 ⎠ , x6 = ⎝ 2.25 ⎠ −4.50 4.50 5
⎞
4.75 5.75 x = ⎝ 5.25 ⎠ , x8 = ⎝ 2.25 ⎠ 16.50 −4.50 7
Fig. 3 – Learning phase. As a result of the learning phase, we obtain an associative memory M.
2. Classify displaced input test pattern xω , according to expression (9).
Classification phase is applied repeatedly for each unknown input pattern xω ∈ Rn to be classified. In order to illustrate each step of the proposed algorithm, consider the following example. Notation 3.1. Numerical values of patterns used in this example were randomly taken from the Haberman survival dataset [49]. Each instance of this database has three attributes and a class label. The most important characteristics of this dataset are summarized in Table 3, while a more detailed description of its contents, appears in Section 4.1. Example 3.2. Let p = 8 be the cardinality of the fundamental set of associations and let n = 3 be the dimension of the fundamental input patterns. The fundamental set of associations consists of pairs {(x , y ) | = 1, 2, . . ., 8}. Each input pattern x is a column vector whose components take values in the set R. Similarly, each output pattern y is a column vector whose components will be assigned according to expression (3). Fundamental input patterns x1 , x2 , x3 , x4 belong to class 1, while
Once you have displaced input patterns x1 , x2 , . . . , x8 , obtain their corresponding output patterns, according to expression (3); which is
y1 =
, y2 =
0 1
, y6 =
y5 =
1 0
, y3 =
0 1
, y7 =
1 0
1 0
, y4 =
0 1
, y8 =
1 0
0 1
As indicated in step 5 of Section 3.3.1, create a classifier using expression (6), expression (7) and expression (8).
M=
−12 −5 5 12
−12 12
As a result of the learning phase, we obtain an associative memory M whose dimensions are m × n. The next step is to apply the learning reinforcement phase. Throughout this phase, we will conduct an iterative process to find the r th learning reinforcement vector of size n, that allows us to reinforce learning (Figs. 4–6). As indicated in step 1 and step 2 of Section 3.3.2, initialize r = 1, rmax = 23 − 1. Applying step 3 of Section 3.3.2, we have
⎛ ⎞
0 (1) = e1 = ⎝ 0 ⎠ 1
293
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Fig. 4 – Learning reinforcement phase. Use the IntegerToVector operator to get the r th learning reinforcement vector of size n, as stated in expression (5).
As indicated in step 4 of Section 3.3.2, classify the displaced fundamental set of patterns that was used during the learning phase, this is
1
M · x1 · e
=
−12 −5 −12 5 12 12
⎛⎛
Fig. 5 – Learning reinforcement phase. Classify the fundamental set of patterns {(x , y )| = 1, 2, . . . , p} that was used during the learning phase, according to expression (9) so an r th classification accuracy parameter is obtained.
⎞ ⎛ ⎞⎞
−3.25 0 · ⎝⎝ 0.25 ⎠ · ⎝ 0 ⎠⎠ −3.50 1
Calculate the first component of vector y1 according to expression (9) from Definition 3.8, this is
As can be seen, classification result of displaced test pattern x1 is equal to the actual condition of that pattern; thus, dis placed test pattern x1 was correctly classified (Figs. 7–9). For the second pattern, we have the following result:
(−12 × −3.25 × 0) + (−5 × 0.25 × 0) + (−12 × −3.50 × 1) = 42
Calculate the second component of vector y1 according to expression (9) from Definition 3.8, this is (12 × −3.25 × 0) + (5 × 0.25 × 0) + (12 × −3.50 × 1) = −42
M · x1 · e1 =
42 −42
42 −42
→
1 0
1
M· x ·e
=
M · x2 · e1 =
according to expression (10) from Definition 3.8, the maximum threshold value for this pattern is = 42; in this way, classifica tion result of displaced test pattern x1 is obtained according to expression (9) from Definition 3.8, this is y1 =
2
−12 −5 5 12 18 −18
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
−3.25 0 · ⎝⎝ −1.75 ⎠ · ⎝ 0 ⎠⎠ −1.50 1
, = 18
thus, classification result of displaced test pattern x2 is
2
y
=
18 −18
→
1 0
as can be seen, displaced test pattern x2 was correctly classified.
294
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Fig. 6 – Learning reinforcement phase. As a result of the learning reinforcement phase, we obtain the r th learning reinforcement vector of size n.
For the third pattern, we have the following result:
M · x3 · e
1
=
M · x3 · e1 =
−12 −5 −12 12 5 12 54 −54
⎛⎛
⎞ ⎛ ⎞⎞
−3.25 0 · ⎝⎝ 1.25 ⎠ · ⎝ 0 ⎠⎠ −4.50 1
y
=
54 −54
→
as can be seen, displaced test pattern x4 was correctly classified. For the fifth pattern, we have the following result:
1 0
M·
M · x4 · e
=
M · x4 · e1 =
−12 −5 −12 12 5 12 30 −30
⎛⎛
y4 =
30 −30
→
1 0
y5 =
⎞ ⎛ ⎞⎞
−2.25 0 · ⎝⎝ −4.75 ⎠ · ⎝ 0 ⎠⎠ −2.50 1
54 −54
=
=
−12 12 54 −54
−5 −12 5 12
⎛⎛
⎞ ⎛ ⎞⎞
0.75 0 · ⎝⎝ −4.75 ⎠ · ⎝ 0 ⎠⎠ −4.50 1
, = 54
x4
is
→
1 0
as can be seen, classification result of displaced test pattern x5 is different from the actual condition of that pattern, this is
, = 30
thus, classification result of displaced test pattern
· e1
1
x5
1
thus, classification result of displaced test pattern x5 is
as can be seen, displaced test pattern x3 was correctly classified. For the fourth pattern, we have the following result:
M · x5 · e
, = 54
thus, classification result of displaced test pattern x3 is 3
Fig. 7 – Classification phase. Given an unknown input pattern xω ∈ Rn to be classified and the r th learning reinforcement vector of size n, obtain the unambiguously recalled class vector yω .
1 0
= /
0 1
consequently, displaced test pattern x5 was not correctly classified.
295
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Fig. 9 – Classification phase. Classify displaced input test pattern xω , according to expression (9).
For the seventh pattern, we have the following result:
1
M · x7 · e M·
x7
· e1
=
=
−12 −5 5 12 −198 198
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
4.75 0 · ⎝⎝ 5.25 ⎠ · ⎝ 0 ⎠⎠ 16.50 1
, = 198
thus, classification result of displaced test pattern x7 is Fig. 8 – Classification phase. Obtain displaced input test pattern xω , according to expression (2).
y7 =
−198 198
→
0 1
as can be seen, displaced test pattern x7 was correctly classified. For the eighth pattern, we have the following result: For the sixth pattern, we have the following result:
M · x6 · e1 =
M · x6 · e1 =
−12 12 −54 54
−5 −12 5 12
⎛⎛
⎞ ⎛ ⎞⎞
0.75 0 · ⎝⎝ 2.25 ⎠ · ⎝ 0 ⎠⎠ 4.50 1
, = 54
−54 54
→
M · x8 · e
=
M · x8 · e1 =
−12 −5 12 5 54 −54
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
5.75 0 · ⎝⎝ 2.25 ⎠ · ⎝ 0 ⎠⎠ −4.50 1
, = 54
thus, classification result of displaced test pattern x is
1
thus, classification result of displaced test pattern x8 is 6
y6 =
y8 =
54 −54
→
1 0
as can be seen, classification result of displaced test pattern x8 is different from the actual condition of that pattern, this is
0 1
as can be seen, displaced test pattern x6 was correctly classified.
1 0
= /
0 1
296
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Table 1 – Classification performance achieved with different values of r, ranging from r = 1 to r = 4. r=1 1
x x2 3 x x4 5 x x6 7 x x8
r=2
√ √ √ √
× √ × √
× √ √
× √ √ √
× 75.0
r=3
r=4
√ √ √ √
√ √ √ √ √ √ √ √
× √ √ × 75.0
62.5
Table 3 – Characteristics of datasets used in the experimental phase.
100.0
consequently, displaced test pattern x8 was not correctly classified. In summary, for r = 1, six out of eight patterns were correctly classified. According to step 5 of Section 3.3.2, store classification performance for r = 1. As indicated in step 7 of Section 3.3.2, increment r. This procedure is repeated for each of the patterns, but with different values of r. Notation 3.2. If we take the number of patterns that were correctly classified for each value of r, we can estimate the quality of learning; that is, if we take the number of patterns that were correctly classified, we can identify the r th learning reinforcement vector of size n, that allows us to reinforce learning. Table 1 shows classification performance achieved with different values of r, ranging from r = 1 to r = 4. Table 2 shows classification performance achieved with different values of r, ranging from r = 5 to r = 7. We can see from Table 1 that for r = 1 there were two instances wrongly classified, for r = 2 there were three instances wrongly classified, for r = 3 there were two instances wrongly classified and for r = 4 all instances were correctly classified. As we can see from Table 2, for r = 5, r = 6 and r = 7 there was only one instance wrongly classified. As a result of the learning reinforcement phase, we obtain the r th learning reinforcement vector of size n, that allows us to reinforce learning. Considering results shown in Table 1 and Table 2, we can see that the best classification performance is achieved for r = 4.
Dataset 1. 2. 3. 4. 5. 6. 7.
Haberman Liver Inflammation Diabetes Breast Heart Hepatitis
x x2 3 x x4 5 x x6 7 x x8
r=5
r=6
r=7
√ √ √ √
√ √ √ √
√ √ √ √
× √ √ √
× √ √ √
× √ √ √
87.5
87.5
87.5
Missing
306 345 120 768 699 270 155
3 6 6 8 9 13 19
No No No No Yes No Yes
Notation 3.3. Numerical values of test patterns (unknown input patterns to be classified) used in this example were randomly taken from the Haberman survival dataset [49]. Each instance of this database has three attributes and a class label. The most important characteristics of this dataset are summarized in Table 3, while a more detailed description of its contents, appears in Section 4.1. The test set consists of the following patterns:
⎛
⎞
⎛
⎞
31 33 x9 = ⎝ 65 ⎠ , x10 = ⎝ 58 ⎠ 4 10
⎛
x11
⎞
⎛
⎞
41 41 = ⎝ 60 ⎠ , x12 = ⎝ 64 ⎠ 23 0
x9 , x10 belong to class 1, while x11 , x12 belong to class 2. The same way as with training patterns, obtain a set of displaced test patterns. As indicated in step 1 of Section 3.3.3, obtain displaced test patterns x9 , x10 , . . . , x12 , according to expression (2); which is
⎛
⎞
⎛
⎞
−2.25 −0.25 x9 = ⎝ 1.25 ⎠ , x10 = ⎝ −5.75 ⎠ −0.50 5.50
⎛
1
Attributes
In this case, the 4 th learning reinforcement vector of size n is e4 . The next step is to apply the Classification phase as stated in Section 3.3.3. Given an unknown input pattern xω ∈ Rn to be classified and the r th learning reinforcement vector of size n, obtain the unambiguously recalled class vector yω .
x11
Table 2 – Classification performance achieved with different values of r, ranging from r = 5 to r = 7.
Instances
⎞
⎛
⎞
7.75 7.75 = ⎝ −3.75 ⎠ , x12 = ⎝ 0.25 ⎠ 18.50 −4.50
As we can see from Table 1, the r th learning reinforcement vector of size n, that allows us to reinforce learning is e4 . As indicated in step 2 of Section 3.3.3, classify displaced test pat terns x9 , x10 , . . . , x12 , according to expression (9). For the ninth pattern, we have the following result:
9
4
M· x ·e
=
M · x 9 · e4 =
−12 12 27 −27
−5 −12 5 12
⎛⎛
⎞ ⎛ ⎞⎞
−2.25 1 · ⎝⎝ 1.25 ⎠ · ⎝ 0 ⎠⎠ −0.50 0
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
according to expression (10) from Definition 3.8, the maximum threshold value for this pattern is = 27; thus, classification result of displaced test pattern x9 is obtained according to expression (9) from Definition 3.8, this is
y9 =
27 −27
1 0
→
as can be seen, displaced test pattern x9 was correctly classified. For the tenth pattern, we have the following result:
10
M· x
4
·e
=
M · x10 · e4 =
−12 12
−5 5
3 −3
, = 3
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
−0.25 1 · ⎝⎝ −5.75 ⎠ · ⎝ 0 ⎠⎠ 5.50 0
thus, classification result of displaced test pattern x10 is
3 −3
y10 =
1 0
→
as can be seen, displaced test pattern x10 was correctly classified. For the eleventh pattern, we have the following result:
11
M· x
4
·e
=
M · x11 · e4 =
−12 12 −93 93
−5 5
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
7.75 1 · ⎝⎝ −3.75 ⎠ · ⎝ 0 ⎠⎠ 18.50 0
, = 93
thus, classification result of displaced test pattern x11 is
−93 93
y11 =
0 1
→
as can be seen, displaced test pattern x11 was correctly classified. For the twelfth pattern, we have the following result:
12
M· x
4
·e
=
M · x12 · e4 =
−12 −5 5 12 −93 93
−12 12
⎛⎛
⎞ ⎛ ⎞⎞
7.75 1 · ⎝⎝ 0.25 ⎠ · ⎝ 0 ⎠⎠ −4.50 0
, = 93
thus, classification result of displaced test pattern x12 is
y12 =
−93 93
→
0 1
as can be seen, displaced test pattern x12 was correctly classified. In summary, in this example we have shown the steps of the proposed algorithm. We obtained a vector of size n, that allows us to reinforce learning. We have also shown the behavior of the proposed algorithm when trying to classify unknown input patterns.
3.4.
AMBC algorithm complexity analysis
Complexity theory investigates the amount of computational resources needed to execute an algorithm. An algorithm is a
297
finite set of precise rules for a computational procedure that solves a problem [50]. It is generally accepted that an algorithm provides a satisfactory solution when it produces a correct answer efficiently. The efficiency of an algorithm can be estimated in two ways. One measure of efficiency is the time required by the computer to solve a problem using a given algorithm. A second measure of efficiency is the amount of memory required to implement that algorithm when input data are of a given size. In this section we analyze the behavior of the proposed algorithm taking into account time complexity as well as space complexity.
3.4.1.
Time complexity
The worst-case time complexity of an algorithm is defined as a function of the size of the input. For a given input size, the worst-case time complexity is the maximal number of execution steps needed for executing the program on arbitrary input of that size. Operations used to measure time complexity can be singleprecision floating point comparison, single-precision floating point addition, single-precision floating point division, variable assignation, logical comparison, or any other elemental operation. The following is defined:
EO: elemental operation. n: dimension of input patterns. p: cardinality of the fundamental set of patterns.
Notation 3.4. The learning reinforcement phase will be analyzed, since it is the one that requires a greater number of elemental operations.
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Learning Reinforcement Phase ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1 r− max = ( 2 ˆ ( n ) ) ; 2 f o r r = 1 : r− max − 1 3 c l a s s h i t = 0; 4 c l a s s −m i s s = 0 ; 5 e −r = i n t − t o − v e c t o r ( r ) ; f o r i = 1:p 6 7 y− mu −1= sum ( x−mu ( i ) . ∗ e −r . ∗ M( 1 ) ) ; 8 y − mu− 2= sum ( x−mu ( i ) . ∗ e −r . ∗ M( 2 ) ) ; i f y mu 1 > y− mu− 2 9 10 c l a s s − l a b e l = c l a s s −1 ; 11 else c l a s s − l a b e l = c l a s s −2 ; 12 13 end i f c l a s s − l a b e l == x mu ( i , n ) 14 15 c l a s s −h i t = c l a s s −h i t + 1 ; 16 else 17 c l a s s −m i s s = c l a s s m i s s + 1 ; 18 end 19 end 20 end ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
298
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Time c o m p l e x i t y a n a l y s i s ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1 1 EO , a s s i g n a t i o n 2 m a x i t e r EO , c o m p a r i s o n 3 m a x i t e r EO , a s s i g n a t i o n 4 m a x i t e r EO , a s s i g n a t i o n 5 m a x i t e r ∗n EO , a s s i g n a t i o n 6 m a x i t e r ∗p EO , c o m p a r i s o n 7 a m a x i t e r ∗n ∗p EO , m u l t i p l i c a t i o n 7 b m a x i t e r ∗n ∗p EO , m u l t i p l i c a t i o n 7 c m a x i t e r ∗n ∗p EO , a d d i t i o n 7 d m a x i t e r ∗p EO , a s s i g n a t i o n 8 a m a x i t e r ∗n ∗p EO , m u l t i p l i c a t i o n 8 b m a x i t e r ∗n ∗p EO , m u l t i p l i c a t i o n 8 c m a x i t e r ∗n ∗p EO , a d d i t i o n 8 d m a x i t e r ∗p EO , a s s i g n a t i o n 9 m a x i t e r ∗p EO , c o m p a r i s o n 10 m a x i t e r ∗p EO , a s s i g n a t i o n 14 m a x i t e r ∗p EO , c o m p a r i s o n 15 m a x i t e r ∗p EO , a s s i g n a t i o n ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ The total number of Elemental Operations is: n
n
Total EOs = 1 + 3(−1 + (2 )) + (−1 + (2 ))n + 7(−1 + (2n ))p + 6(−1 + (2n ))np
Then if g(n) = 2n , C = 20000 and k = 1, we have that
f (n) ≤ 20000 g(n) , whenever n > 1 Therefore, f(n) is O(2n ).
3.4.2.
Space complexity
The space complexity of a program (for a given input) is the number of elementary objects that this program needs to store during its execution. This number is computed with respect to the size of the input data. Let m be the dimension of an output pattern y , let n be the dimension of an input pattern x and let be an index such that ∈ {1, 2, . . ., p} with p as the cardinality of the set. In order to store the fundamental set of patterns, a matrix is needed. This matrix will have dimensions p × (n + m). The ¯ as well as the r th learning reinforcement mean vector x, vector will have dimensions 1 × n. Similarly, another matrix is needed to store the set of displaced patterns, this matrix will also have dimensions p × (n + m). The resulting associative memory M of the learning phase will have dimensions m × n. The number of elementary objects that the proposed algorithm needs to store during its execution is: TotalObj = p × (n + m) + (1 × n) + (1 × n) + p × (n + m) + (m × n) The number of bytes required to store a single-precision floating point value can be determined by NumOfBytes = sizeof (float)
By grouping some terms, we have the following: Total EOs = −2 + 3(2n ) + 7(−1 + (2n ))p + (−1 + (2n ))n(1 + 6p) If we factor some terms, we have the following: Total EOs = − 2 + 3(2n ) − n + (2n )n − 7p + 7(2n )p − 6np + 3(2n+1 )np Finally, the equation of the total number of Elemental Operations can be written as Total EOs = −2 − 7p + (−1 + (2n ))n(1 + 6p) + (2n )(3 + 7p) The growth of time and space complexity with increasing input size n is a suitable measure of the efficiency of the algorithm. To obtain an estimate of the complexity of the algorithm when it is applied to a known test set, we took the dataset with the largest number of features, which is the Hepatitis disease dataset [49]. As it is shown in Table 3, each of the 155 instances has 19 features and a class label. The number of fundamental input patterns is p = 155. The growth of functions is usually described using the bigO notation [50]. Definition 3.9. Let f and g be functions from the integers or the real numbers to the real numbers. We say that f(n)is O(g(n)) if there are constants C and k such that f (n) ≤ C g(n) whenever n > k. The total number of Elemental Operations can be written as Total EOs = 1 + 1088(−1 + (2n )) + 931(−1 + (2n ))n A function g(n) and constants C and k must be found, such that the inequality holds. We propose g(n): 1(2n ) − 1088(2n ) + 1088(2n ) − 931n(2n ) + 931n(2n )
Similarly, the number of bytes required to store an integer value can be determined by NumOfBytes = sizeof (int) It is noteworthy that in either case, NumOfBytes = 4. Since each of the components xj ∈ R of an input pattern x ∈ Rn can be represented by a single-precision floating point value, the number of bytes required to store an input pattern x ∈ Rn is n × NumOfBytes. Similarly, each of the components yi ∈ A, A = {0, 1} of an output pattern y ∈ Am can be represented by an integer value, consequently, the number of bytes required to store an output pattern y ∈ Am is m × NumOfBytes. The total amount of bytes required to implement the proposed algorithm is: TotalBytes = NumOfBytes(p(n + m) + (n) + (n) + p(n + m) + (mn)) TotalBytes = (8p(n + m) + 8(n) + 4(mn))
4.
Datasets
This section provides a brief description of the most important characteristics of the datasets that were used during the experimental phase. All of these were taken from the University of California at Irvine machine learning repository [49]. Characteristics of datasets used in the experimental phase are shown in Table 3.
4.1.
Haberman survival dataset
This database contains cases from a study that was conducted at the University of Chicago’s Billings Hospital on the survival
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
of patients who had undergone surgery for breast cancer. The purpose of the dataset is to identify the survival status of patients who had undergone surgery for breast cancer. Haberman survival dataset consists of 306 instances belonging to two different classes (225 “the patient survived 5 years or longer” cases, 81 “the patient died within 5 year” cases). Each instance consists of 4 attributes, including the class attribute.
4.2.
Liver disorders dataset
This database was created by BUPA Medical Research Ltd and was donated by Richard S. Forsyth. This dataset contains cases from a study that was conducted on liver disorders that might arise from excessive alcohol consumption. Liver disorders dataset consists of 345 instances belonging to two different classes. Each instance consists of 7 attributes, including the class attribute.
4.3.
Acute inflammations dataset
This database contains cases from a study that was conducted on the diagnosis of urinary system diseases of patients. This dataset consists of 120 instances. Each instance consists of 6 attributes and two decision labels. The main idea of this dataset is to perform the diagnosis of two diseases of urinary system.
4.4.
Pima Indians diabetes dataset
This database was originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases, U.S. This dataset contains cases from a study that was conducted on female patients at least 21 years old of Pima Indian heritage. This dataset consists of 768 instances belonging to two different classes (500 “the patient tested positive for diabetes” cases, 268 “the patient tested negative for diabetes” cases). Each instance consists of 9 attributes, including the class attribute.
4.5.
Breast cancer dataset
This database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg and was donated by Olvi Mangasarian. This dataset contains periodical samples of clinical cases. Breast cancer dataset consists of 699 instances belonging to two different classes (458 “benign” cases, 241 “malign” cases). Each instance consists of 10 attributes, including the class attribute.
4.6.
Heart disease dataset
This database comes from the Cleveland Clinic Foundation and was supplied by Robert Detrano, M.D., Ph.D. of the V.A. Medical Center, Long Beach, CA. The purpose of the dataset is to predict the presence or absence of heart disease given the results of various medical tests carried out on a patient. This dataset consists of 270 instances belonging to two different classes: presence and absence (of heart-disease). Each instance consists of 14 attributes, including the class attribute.
4.7.
299
Hepatitis disease dataset
This dataset was donated by the Jozef Stefan Institute, former Yugoslavia, now Slovenia. The purpose of the dataset is to predict the presence or absence of hepatitis disease in a patient. Hepatitis disease dataset consists of 155 instances belonging to two different classes (32 “die” cases, 123 “live” cases). Each instance consists of 20 attributes, 13 binary, 6 attributes with discrete values and a class label.
5.
Machine Learning algorithms
This section provides a brief description of each of the algorithms that were used during the experimental phase. It has to be mentioned that, although WEKA 3: Data Mining Software in Java [51] has more than seventy well known algorithms implemented, only the twenty best-performing algorithms were considered for comparison purposes. Further details on the implementation of these algorithms can be found in the following references [52,53].
5.1.
AdaBoostM1
AdaBoost.M1 algorithm, proposed by Yoav Freund and Robert E. Schapire [54], obtains a single composite classifier which is constructed through the combination of various classifiers produced by repeatedly running a given “weak” learning algorithm on various distributions over the training data. The “weak” learning algorithm is executed T rounds in order to obtain T “weak” hypotheses, finally the booster combine the T “weak” hypotheses into a single final hypothesis.
5.2.
Bagging
Bagging predictors method [55] works by generating various versions of a predictor and using these to obtain an amalgamated predictor. In order to predict a class label, each predictor casts a vote and a plurality voting scheme is applied. Similarly, when predicting a numeric class value, the multiple versions of a predictor are averaged.
5.3.
BayesNet
Bayesian networks are alternative ways of representing a conditional probability distribution by means of directed acyclic graphs (DAGs). In this graphical model, each node represents a random variable and the arrow connecting a parent node with a child node indicates that there is a relationship between them [56]. This relationship is calculated in terms of conditional probability among variables of interest.
5.4.
Dagging
This meta classifier, proposed by Ting and Witten [57], creates a number of disjoint, stratified folds out of the data and feeds each chunk of data to a copy of the supplied base classifier. Predictions are made via majority vote, since all the generated base classifiers are put into the Vote meta classifier.
300
5.5.
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
DecisionTable
This algorithm builds a simple classifier based on a decision table with a default rule mapping to the majority class. This representation called Decision Table Majority (DTM) [58] has two components: a schema which is a set of features that are included in the table and a body consisting of labeled instances from the space defined by the features in the schema.
5.6.
DTNB
Decision Table Naive Approach, proposed by Hall and Frank [59], builds a decision table/naive Bayes hybrid classifier. This method is based on a simple Bayesian network in which the decision table (DT) represents a conditional probability table. At each point in the search, the algorithm evaluates the merit of dividing the attributes into two disjoint subsets: one for the decision table, the other for naive Bayes.
5.7.
5.12.
Class for building and using a simple Naive Bayes classifier, numeric attributes are modeled by a normal distribution. For more information, see [65].
5.13.
Class for a Naive Bayes classifier using estimator classes. This is the updateable version of NaiveBayes. For more information on Naive Bayes classifiers, see [64].
5.14.
A random forest is a classifier consisting of a collection of tree-structured classifiers such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. For more information on Random Forest classifiers, see [66].
5.16.
Logistic
This algorithm focuses on building and using a multinomial logistic regression model with a ridge estimator. le Cessie and van Houwelingen [63] showed how ridge estimators can be used in logistic regression to improve the parameter estimates and to diminish the error made by further predictions.
MultiClassClassifier
MultiClassClassifier, proposed by Eibe Frank, Len Trigg and Richard Kirkby, builds a metaclassifier for handling multi-class datasets with 2-class classifiers. This classifier is also capable of applying error correcting output codes for increased accuracy.
RandomSubSpace
This method constructs a decision tree based classifier that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. For more information on The Random Subspace Method for Constructing Decision Forests, see [67].
5.17.
RBFNetwork
Class that implements a normalized Gaussian radial basis function network. It uses the k-means clustering algorithm to provide the basis functions and learns either a logistic regression (discrete class problems) or linear regression (numeric class problems). Symmetric multivariate Gaussians are fit to the data from each cluster. If the class is nominal it uses the given number of clusters per class. It standardizes all numeric attributes to zero mean and unit variance. For more information on Radial Basis Functions, see [68].
5.18. 5.11.
RandomForest
LMT
Logistic Model Trees (LMTs) are based on two basic approaches: tree induction and logistic regression [60]. LMT are classification trees with logistic regression functions at the leaves, which can deal with binary and multi-class target variables, numeric and nominal attributes and missing values [62].
5.10.
RandomCommittee
Class for building an ensemble of randomizable base classifiers. Each base classifiers is built using a different random number seed (but based on the same data). The final prediction is a straight average of the predictions generated by the individual base classifiers.
5.15.
5.9.
NaiveBayesUpdateable
FT
This algorithm focuses on the construction of Functional Trees, which are classification trees that could have logistic regression functions at the inner nodes and/or leaves [60]. The effects of using combinations of attributes at decision nodes, leaf nodes, or both nodes were studied by Gama [61].
5.8.
NaiveBayesSimple
RotationForest
NaiveBayes
This algorithm is based on two important simplifying assumptions. NaiveBayes assumes that the predictive attributes are conditional independent given the class, and it posits that no hidden or latent attributes influence the prediction process [64]. Numeric estimator precision values are chosen based on analysis of the training data.
Rotation Forest, proposed by Rodríguez et al. [69], is a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
axis rotations take place to form the new features for a base classifier.
5.19.
SimpleLogistic
Classifier for building linear logistic regression models. LogitBoost with simple regression functions as base learners is used for fitting the logistic models. The optimal number of LogitBoost iterations to perform is cross-validated, which leads to automatic attribute selection. For more information, see [62,60].
5.20.
SMO
Class that implements Platt’s Sequential Minimal Optimization algorithm for training a Support Vector Machine. For more information, see [70–72].
6.
Algorithm comparisons
One of the main objectives of this study is to make a consistent comparison between the classification performance achieved by our proposal and the classification performance achieved by some well known algorithms in different pattern classification problems in the medical field. There are two fundamental questions that naturally arise. The first one is which test is appropriate for comparing the differences between algorithms? The second is how classification performance (error rate) is compared? In order to answer the first question, Mitchell [73] presented an approach to determine the level of significance that one algorithm outperforms another. The classification accuracies and standard deviations are considered to differ from one another significantly if the result of a t-test is less than or equal to 0.05. Following this approach, good outcomes should have high accuracies and low standard deviation. In order to answer the second question, there are several approaches to make such comparisons. Clark [74] compared the accuracies and standard deviation of each pair of algorithms, averaged over all the experimental datasets. The classification accuracies to be averaged are the average accuracies of each algorithm over five runs. This approach can be strongly criticized since its affect is to ignore the underlying distribution of the dataset [75]. Murthy et al. [76] compared the number of datasets in which an algorithm achieves higher classification accuracy averages over five runs. An algorithm is considered better than its paired algorithm if it achieves higher classification accuracy on a greater number of datasets. Kohavi [77] reviewed accuracy estimation methods and compared cross-validation and bootstrap. Experimental results showed that bootstrap has low variance, but extremely large bias on some problems; as a consequence, stratified 10-fold cross-validation is recommended for model selection. Kohavi and John [78] pointed out that when comparing a pair of algorithms, it is critical to understand that when 10-fold cross-validation is used for classification accuracy evaluation, this cross-validation is an independent outer loop. They also pointed out that some researchers have reported accuracy results from the inner cross-validation loop; such results are
301
optimistically biased and are subtle means of training on the test set. In order to make a consistent comparison between the classification performance achieved by our proposal and the classification performance achieved by some well known algorithms in different pattern classification problems in the medical field, we followed Kohavi [77] and Kohavi and John [78] approaches.
7.
Experimental phase
Throughout the experimental phase, seven datasets were used as test set to estimate the classification performance of each one of the compared algorithms. These databases were taken from the UCI machine learning repository [49], from which full documentation for all datasets can be obtained. The main characteristics of these datasets have been expounded in Section 4. AMBC performance was compared against the performance achieved by the twenty best-performing algorithms of the seventy-six available in WEKA 3: Data Mining Software in Java [51]. WEKA is an open source software issued under the GNU General Public License, freely available on the Web [52]. Further information on each of the algorithms that were used during the experimental phase can be found in [53]. All experiments were conducted using a personal computer with an Intel Core 2 Duo Processor E6700 (4M Cache, 2.66 GHz, 1066 MHz FSB) running Windows XP Professional operating system with 2048 GB of RAM. In order to carry out such a comparison, we applied the same conditions and validation schemes for each experiment. Classification accuracy of each one of the compared algorithms was calculated using 50-50 training-test split, 70-30 training-test split, 10-fold cross-validation and leave-one-out cross-validation.
8.
Results and discussion
In this section we analyze the classification accuracy results achieved by each one of the compared algorithms in seven different pattern classification problems in the medical field. Although WEKA 3: Data Mining Software in Java [51] has more than seventy well known algorithms implemented, only the twenty best-performing algorithms were considered for comparison purposes. According to the type of learning scheme, each of these can be grouped in one of the following types of classifiers: Bayesian classifiers, Functions based classifiers, Meta classifiers, Rules based classifiers and Decision Trees classifiers. The twenty best-performing algorithms are as follows: • Four algorithms based on the Bayesian approach (BayesNet [56], NaiveBayes [64], NaiveBayesSimple [65] and NaiveBayesUpdateable [64]). • Four functions based classifiers (Logistic [63], RBFNetwork [68], SimpleLogistic [62] and SMO [70]). • Seven meta classifiers (AdaBoostM1 [54], Bagging [55], Dagging [57], MultiClassClassifier [52,53], RandomCommittee [52,53], RandomSubSpace [67], RotationForest [69]).
302
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
• Two rules based classifiers (DecisionTable [58] and DTNB [59]). • Three decision trees classifiers (FT [60], LMT [60], RandomForest [66]).
Table 4, Table 5, Table 6 and Table 7 show the classification accuracy achieved by each of the compared algorithms in seven different pattern classification problems in the medical field, using 50-50 training-test split, 70-30 training-test split, 10-fold cross-validation and leave-one-out cross-validation, respectively. For each compared algorithm, the values of classification accuracy averaged over all datasets are given at the end of each row. For each dataset, the highest classification accuracy is highlighted with boldface. As is shown in Table 4, Table 5, Table 6 and Table 7 there is no particular method that surpasses all other algorithms in all sorts of problems. This should not be surprising since Wolpert and Macready [79] demonstrated that what an algorithm gains in performance on one class of problems is necessarily offset by its performance on the remaining problems. Table 4 shows classification accuracy achieved by each of the compared algorithms, using 50-50 training-test split. Classification results are as follows: two of the four functions based classifiers (SimpleLogistic [62] and SMO [70]) achieved the best performance in two of the seven pattern classification problems. Similarly, one of the three decision trees classifiers (RandomForest [66]) achieved the best performance in two of the seven datasets. It is worth noting that our proposal achieved the best performance in three of the seven pattern classification problems in the medical field, using 50-50 training-test split. Table 5 shows classification accuracy achieved by each of the compared algorithms, using 70-30 training-test split. Classification results are as follows: one of the seven meta classifiers (Bagging [55]) achieved the best performance in two of the seven pattern classification problems. Two of the four functions based classifiers (SimpleLogistic [62] and SMO [70]) achieved the best performance in two of the seven pattern classification problems. Similarly, one of the three decision trees classifiers (LMT [60]) achieved the best performance in two of the seven datasets. It is worth noting that our proposal achieved the best performance in three of the seven pattern classification problems in the medical field, using 70-30 training-test split. Table 6 shows classification accuracy achieved by each of the compared algorithms, using 10 fold cross-validation. Classification results are as follows: two of the seven meta classifiers (Bagging [55] and RotationForest [69]) achieved the best performance in two of the seven pattern classification problems. One of the three decision trees classifiers (LMT [60]) achieved the best performance in two of the seven datasets. Similarly, two of the four functions based classifiers (RBFNetwork [68] and SimpleLogistic [62]) achieved the best performance in two of the seven datasets. It is worth noting that our proposal achieved the best performance in four of the seven pattern classification problems in the medical field, using 10 fold cross-validation. Table 7 shows classification accuracy achieved by each of the compared algorithms, using leave-one-out
cross-validation. Classification results are as follows: two of the seven meta classifiers (AdaBoostM1 [54] and MultiClassClassifier [52,53]) achieved the best performance in two of the seven pattern classification problems. One of the four algorithms based on the Bayesian approach (BayesNet [56]) achieved the best performance in two of the seven datasets. Similarly, two of the three decision trees classifiers (FT [60], and LMT [60]) achieved the best performance in two of the seven pattern classification problems. Two of the four functions based classifiers (Logistic [63] and SimpleLogistic [62]) achieved the best performance in two of the seven datasets. It is worth noting that our proposal achieved the best performance in three of the seven pattern classification problems in the medical field, using leave-one-out cross-validation. In summary, along the experimental phase we used different types of cross-validation techniques, namely, 50-50 training-test split, 70-30 training-test split, 10-fold crossvalidation and leave-one-out cross-validation to show how accurately the proposed model will perform in practice. After carrying out the experiments and as a consequence of analysis of the results shown in Table 4, Table 5, Table 6 and Table 7, we can say that the proposed algorithm has a competitive performance compared against the performance achieved by the twenty best-performing algorithms of the seventy-six available in WEKA 3: Data Mining Software in Java [51].
9.
Summary
In this paper, a novel approach to perform pattern classification tasks is presented. This model is called Associative Memory based Classifier (AMBC). Throughout the experimental phase, the proposed algorithm is applied to help diagnose diseases; particularly, it is applied in the diagnosis of seven different problems in the medical field. The performance of the proposed model is validated by comparing classification accuracy of AMBC against the performance achieved by the twenty best-performing algorithms of the seventy-six available in WEKA 3: Data Mining Software in Java [51]. An important point to note is that even when it seems that the calculation of repetitive matrices can be an impediment to addressing larger problems, the proper use of tools developed for matrix operations and structured data, such as MATLAB, allow you to manipulate arrays of considerable size. For instance, if you use MATLAB on a computer running 64-bit operating system, the Total Workspace Size in Bytes is <8TB, the Largest Matrix Size in Bytes is <8TB, the Number of Elements in Largest Real Double Array is 248 − 1 (2.8e14) and the Number of Elements in Largest int8 Array is 248 − 1 (2.8e14) [80]. It is also necessary to note that even though the most demanding phase of this algorithm is the learning reinforcement phase, if we look at Fig. 4, Fig. 5 and Fig. 6, we can see that once the learning phase is completed, it is possible to carry out the learning reinforcement phase fully in parallel. This means that we can divide the process of calculating the r th learning reinforcement vector of size n in as many cores or nodes, as we have available.
303
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Table 4 – Classification accuracy using 50-50 training-test split. The first twenty methods are included in WEKA 3: Data Mining Software in Java [51]. Algorithm
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
AdaBoostM1 Bagging BayesNet Dagging DecisionTable DTNB FT LMT Logistic MultiClassClassifier NaiveBayes NaiveBayesSimple NveBayesUpdateable RandomCommittee RandomForest RandomSubSpace RBFNetwork RotationForest SimpleLogistic SMO AMBC (our proposal)
Datasets
Average
Haberman
Liver
Inflammation
Diabetes
Breast
Heart
Hepatitis
72.87 72.22 73.20 73.52 73.20 73.20 71.56 73.52 73.20 73.20 74.50 74.18 74.50 62.74 67.32 73.52 72.22 73.85 74.18 73.52 74.83
66.95 68.11 57.97 57.68 57.97 57.97 68.40 66.37 64.92 64.92 52.17 53.91 52.17 65.50 70.14 66.95 61.73 68.98 66.37 58.55 65.40
100.00 95.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 96.66 96.66 96.66 100.00 100.00 98.33 100.00 100.00 100.00 100.00 91.66
73.95 75.00 75.00 71.35 74.73 74.73 75.91 76.56 77.08 77.08 75.91 75.78 75.91 72.91 74.34 72.52 75.52 76.82 77.86 77.86 70.57
95.02 96.77 97.65 96.48 94.72 97.51 97.36 96.04 96.77 96.77 96.19 96.04 96.19 96.63 96.92 95.46 96.48 97.21 96.63 96.92 97.80
83.33 82.96 82.22 83.33 81.48 84.44 82.96 81.48 82.59 82.59 84.81 83.70 84.81 80.00 79.25 77.03 81.48 80.74 81.48 83.33 83.33
62.58 64.51 67.09 66.45 61.93 63.87 62.58 61.29 57.41 57.41 69.67 69.67 69.67 65.80 63.87 60.00 65.80 62.58 61.29 67.09 83.76
Experimental results have shown that AMBC achieved the best performance in three of the seven pattern classification problems in the medical field, using 50-50 training-test split, 70-30 training-test split and leave-one-out cross-validation, as shown in Table 4, Table 5 and Table 7. Likewise we can see that AMBC achieved the best performance in four of the seven pattern classification problems in the medical field, using 10fold cross-validation, as shown in Table 6.
79.24 79.22 79.02 78.40 77.72 78.81 79.82 79.32 78.85 78.85 78.56 78.56 78.56 77.65 78.83 77.69 79.03 80.02 79.68 79.61 81.05
It should be noted that our proposal achieved the best classification accuracy averaged over all datasets. The proposed approach has proven to be an effective alternative to perform pattern recognition tasks in the medical field. The results presented in this paper demonstrate associative memories potential for medical decision support systems.
Table 5 – Classification accuracy using 70-30 training-test split. The first twenty methods are included in WEKA 3: Data Mining Software in Java [51]. Algorithm
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
AdaBoostM1 Bagging BayesNet Dagging DecisionTable DTNB FT LMT Logistic MultiClassClassifier NaiveBayes NaiveBayesSimple NveBayesUpdateable RandomCommittee RandomForest RandomSubSpace RBFNetwork RotationForest SimpleLogistic SMO AMBC (our proposal)
Datasets
Average
Haberman
Liver
Inflammation
Diabetes
Breast
Heart
Hepatitis
73.52 73.20 72.22 73.85 72.22 72.22 73.20 73.85 74.18 74.18 75.49 75.49 75.49 64.70 67.64 74.50 73.52 74.18 73.85 73.52 77.30
68.98 72.17 58.55 58.26 58.55 58.55 69.85 71.59 66.37 66.37 56.23 55.07 56.23 68.40 66.37 67.53 61.73 70.72 67.82 57.68 59.593
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 95.83 95.83 95.83 100.00 100.00 100.00 100.00 100.00 100.00 100.00 94.16
74.08 75.13 74.73 73.95 74.86 75.52 77.21 76.69 76.69 76.69 75.78 75.78 75.78 73.30 73.56 73.43 74.34 75.39 76.69 77.47 70.18
95.16 95.75 97.36 96.63 93.99 97.07 96.92 96.63 96.48 96.48 96.33 96.19 96.33 95.60 96.33 95.90 96.33 97.36 96.63 96.48 97.64
82.96 81.11 82.96 83.33 80.37 82.59 82.22 84.44 83.33 83.33 83.33 84.07 83.33 80.37 78.88 78.14 82.59 83.70 84.44 84.07 83.33
62.58 64.51 69.03 65.16 69.67 65.80 57.41 63.87 61.29 61.29 72.25 70.32 72.25 62.58 60.64 61.93 71.61 63.22 65.16 64.51 84.86
79.61 80.26 79.26 78.74 78.52 78.82 79.54 81.01 79.76 79.76 79.32 78.96 79.32 77.85 77.63 78.78 80.02 80.65 80.65 79.10 81.01
304
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
Table 6 – Classification accuracy using 10 fold cross-validation. The first twenty methods are included in WEKA 3: Data Mining Software in Java [51]. Algorithm
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
AdaBoostM1 Bagging BayesNet Dagging DecisionTable DTNB FT LMT Logistic MultiClassClassifier NaiveBayes NaiveBayesSimple NveBayesUpdateable RandomCommittee RandomForest RandomSubSpace RBFNetwork RotationForest SimpleLogistic SMO AMBC (our proposal)
Datasets
Average
Haberman
Liver
Inflammation
Diabetes
Breast
Heart
Hepatitis
73.20 73.20 72.54 73.52 72.54 72.54 72.87 73.85 74.50 74.50 74.50 73.85 74.50 64.37 67.97 72.22 72.87 73.20 73.85 73.52 76.33
66.66 73.04 56.81 57.97 57.97 57.97 70.43 69.85 68.69 68.69 54.20 55.07 54.20 68.11 70.72 64.05 66.08 73.04 71.01 57.97 65.50
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 95.83 95.83 95.83 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
74.34 74.60 74.34 74.08 71.22 73.82 77.34 77.47 77.21 77.21 76.30 76.30 76.30 75.26 72.39 75.26 75.39 76.82 77.47 77.34 70.39
95.60 96.19 97.21 96.77 95.75 97.51 96.92 96.48 96.63 96.63 96.19 96.33 96.19 96.48 97.07 95.54 95.90 97.21 96.63 96.92 97.80
82.22 83.70 82.22 82.22 83.33 82.59 82.22 82.22 83.70 83.70 83.33 82.96 83.33 82.22 83.70 82.22 84.07 82.59 82.22 83.33 83.70
67.09 69.67 69.03 66.45 72.25 68.38 69.03 67.09 68.38 68.38 71.61 70.96 71.61 63.22 65.16 67.74 69.67 66.45 66.45 72.25 85.16
79.87 81.49 78.88 78.72 79.01 78.97 81.26 80.99 81.30 81.30 78.85 78.76 78.85 78.52 79.57 79.60 80.57 81.33 81.09 80.19 82.70
Table 7 – Classification accuracy using leave-one-out cross-validation. The first twenty methods are included in WEKA 3: Data Mining Software in Java [51]. Algorithm
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
9.1.
AdaBoostM1 Bagging BayesNet Dagging DecisionTable DTNB FT LMT Logistic MultiClassClassifier NaiveBayes NaiveBayesSimple NveBayesUpdateable RandomCommittee RandomForest RandomSubSpace RBFNetwork RotationForest SimpleLogistic SMO AMBC (our proposal)
Datasets
Average
Haberman
Liver
Inflammation
Diabetes
Breast
Heart
Hepatitis
75.49 72.87 74.18 73.52 68.95 68.95 73.20 72.87 74.18 74.18 75.49 74.50 75.49 64.70 66.66 71.56 74.50 72.87 73.52 73.20 74.18
63.76 70.43 63.18 57.68 63.18 63.18 71.59 69.85 68.40 68.40 55.94 55.36 55.94 68.11 67.82 68.69 64.63 70.72 69.27 57.97 60.57
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 95.83 95.83 95.83 100.00 100.00 98.33 100.00 100.00 100.00 100.00 100.00
75.65 74.73 75.78 74.21 74.86 74.86 76.69 77.08 77.73 77.73 75.65 75.26 75.65 75.39 74.60 73.30 73.69 76.56 77.34 76.82 70.70
95.60 95.90 97.36 96.92 95.75 97.07 97.21 96.19 96.77 96.77 96.19 96.33 96.19 96.63 96.48 95.90 96.19 97.51 96.48 97.07 97.80
81.85 81.11 83.70 82.96 82.96 80.74 80.37 83.70 82.96 82.96 82.96 83.70 82.96 82.22 82.59 81.11 81.85 79.62 83.70 82.96 83.33
60.00 68.38 70.96 63.87 73.54 69.67 69.03 66.45 69.67 69.67 71.61 70.96 71.61 62.58 60.64 67.09 71.61 64.51 65.16 69.67 85.16
Concluding remark
Here are some relevant points that are useful to highlight the differences between the current proposal, named Associative Memory based Classifier, and some previous models proposed by the Alfa-Beta research group. First, previous associative models work with column vectors with binary components, while the current proposal work
78.90 80.49 80.74 78.45 79.89 79.21 81.15 80.88 81.39 81.39 79.09 78.85 79.09 78.52 78.40 79.43 80.35 80.25 80.78 79.67 81.68
with vectors with real components; which means significant savings in the encoding of information. Second, in order to be robust to noise, previous associative models need to encode the information using the JohnsonMöbius code [81], the current proposal does not require special coding. Third, previous associative models do not have the ability to identify relevant features that allow to increase the
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
performance of classification, on the contrary, the current proposal allow to increase the performance of classification by means of the learning reinforcement phase. But perhaps the crucial point is that the current proposal is completely parallelizable and can therefore take advantage of technological advances such as multiple cores computers or parallel computing, among others.
Acknowledgments The authors of the present paper would like to thank the following institutions for their economical support to develop this work: Science and Technology National Council of Mexico (CONACyT Grant No. 174952), SNI, National Polytechnic Institute of Mexico (COFAA, SIP, ESCOM, and CIC) and ICyTDF (Grant No. PIUTE10-77 and PICSO10-85).
references
[1] E.A. Feigenbaum, B. G. Buchanan, J. Lederberg, On generality and problem solving: a case study using the dendral program, Tech. Re CS-TR-70–176, Stanford University, Department of Computer Science, Stanford, CA, USA (1970). [2] B.G. Buchanan, E.A. Feigenbaum, J. Lederberg, A heuristic programming study of theory formation in science, in: IJCAI, 1971, pp. 40–50. [3] E.A. Feigenbaum, The art of artificial intelligence: themes and case studies of knowledge engineering, in: IJCAI, 1977, pp. 1014–1029. [4] K. Steinbuch, Die lernmatrix, Kybernetik 1 (1) (1961) 36–45. [5] K. Steinbuch, H. Frank, Nichtdigitale lernmatrizen als perzeptoren, Kybernetik 1 (3) (1961) 117–124. [6] K. Steinbuch, Adaptive networks using learning matrices, Kybernetik 2 (4) (1964) 148–152. [7] H. Kazmierczak, K. Steinbuch, Adaptive systems in pattern recognition, IEEE Transactions on Electronic Computers EC-12 (6) (1963) 822–835. [8] K. Steinbuch, B. Widrow, A critical comparison of two kinds of adaptive classification networks, IEEE Transactions on Electronic Computers EC-14 (5) (1965) 737–740. [9] T. Kohonen, Correlation matrix memories, IEEE Transactions on Computers C-21 (4) (1972) 353–359. [10] K. Steinbuch, U.A.W. Piske, Learning matrices and their applications, IEEE Transactions on Electronic Computers EC-12 (6) (1963) 846–862. [11] J.A. Anderson, A memory storage model utilizing spatial correlation functions, Kybernetik 5 (3) (1968) 113–119. [12] J.A. Anderson, A simple neural network generating an interactive memory, Mathematical Biosciences 14 (1972) 197–220. [13] D.J. Willshaw, O.P. Buneman, H.C. Longuet-Higgins, Non-holographic associative memory, Nature 222 (5197) (1969) 960–962. [14] K. Nakano, Associatron—a model of associative memory, IEEE Transactions on Systems, Man, and Cybernetics SMC-2 (3) (1972) 380–388. [15] S.-I. Amari, Pattern learning by self-organizing nets of threshold elements, System and Computing Controls 3 (4) (1972) 15–22. [16] J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences 79 (1982) 2554–2558.
305
[17] J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons, Proceedings of the National Academy of Sciences of the United States of America 81 (1984) 3088–3092. [18] I.G.L. Personnaz, G. Dreyfus, Information storage and retrieval in spin glass like neural networks, Journal of Physical Letters 46 (1985) L359–L365. [19] A. Liwanag, S. Becker, Improving associative memory capacity: one-shot learning in multilayer Hopfield networks, in: Proceedings of the 19th Annual Conference of the Cognitive Science Society, 1997, pp. 442–447. [20] F.T. Sommer, G. Palm, Improved bidirectional retrieval of sparse patterns stored by Hebbian learning, Neural Networks 12 (1999) 281–297. [21] R.R.M.A.H. Muhamad Amin, A. Khan, Analysis of pattern recognition algorithms using associative memory approach: a comparative study between the Hopfield network and Distributed Hierarchical Graph Neuron (DHGN), in: IEEE 8th International Conference on Computer and Information Technology Workshops, 2008, pp. 153–158. [22] A.I. Khan, A.H.M. Amin, One shot associative memory method for distorted pattern recognition, in: AI 2007: Advances in Artificial Intelligence, vol. 4830 of Lecture Notes in Computer Science, 2007, pp. 705–709. [23] A.H.M. Amin, A.I. Khan, Parallel pattern recognition using a single-cycle learning approach within wireless sensor networks, in: Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008, pp. 305–308. [24] A.H.M.A. Amir, H. Basirat, A.I. Khan, Under the cloud: a novel content addressable data framework for cloud parallelization to create and virtualize new breeds of cloud applications, in: Ninth IEEE International Symposium on Network Computing and Applications, 2010. [25] A.H.M. Amin, A.I. Khan, A divide-and-distribute approach to single-cycle learning HGN network for pattern recognition, in: 11th International Conference on Control, Automation, Robotics and Vision, 2010. [26] A.H.M. Amin, A.I. Khan, Distributed multi-feature recognition scheme for greyscale images, Neural Processing Letters 33 (2011) 45–59. [27] G. Ritter, P. Sussner, J. Diaz-de Leon, Morphological associative memories, IEEE Transactions on Neural Networks 9 (2) (1998) 281–293. [28] S. Peter, A fuzzy autoassociative morphological memory, in: Proceedings of the International Joint Conference on Neural Networks, 2003, pp. 326–331. [29] S.T. Wang, H.J. Lu, On new fuzzy morphological associative memories, IEEE Transactions on Fuzzy Systems 12 (3) (2004) 316–323. [30] P. Sussner, New results on binary auto- and heteroassociative morphological memories, in: Proceedings of International Joint Conference on Neural Networks, 2005, pp. 1199–1204. [31] G. Urcid, G.X. Ritter, Noise masking for pattern recall using a single lattice matrix associative memory, Ch. 5, pp. 81–100, in: Studies in Computational Intelligence, no. 67, Springer-Verlag, Berlin Heidelberg, 2007. [32] P. Sussner, M.E. Valle, Morphological and certain fuzzy morphological associative memories for classification and prediction, Ch. 8, pp. 149–171, in: Studies in Computational Intelligence, vol. 67, Springer-Verlag, Berlin Heidelberg, 2007. [33] M. Wang, R. Chu, Economizing enhanced fuzzy morphological associative memory, in: Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, 2008, pp. 495–500. [34] T. Saeki, T. Miki, Effectiveness of scale free network to the performance improvement of a morphological associative
306
[35]
[36]
[37]
[38]
[39] [40] [41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49] [50] [51]
[52]
[53]
[54]
[55]
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
memory without a kernel image, in: Neural Information Processing, vol. 4984 of Lecture Notes in Computer Science, 2008, pp. 358–364. M.E. Valle, P. Sussner, A general framework for fuzzy morphological associative memories, Fuzzy Sets and Systems 159 (2008) 747–768. M.E. Valle, Permutation-based finite implicative fuzzy associative memories, Information Sciences 180 (2010) 4136–4152. Y.S. Boutalis, A new method for constructing kernel vectors in morphological associative memories of binary patterns, Computer Science and Information Systems 8 (2011) 141–166. M.E. Valle, P. Sussner, Storage and recall capabilities of fuzzy morphological associative memories with adjunction-based learning, Neural Networks 24 (2011) 75–90. J. Serra, Image Analysis and Mathematical Morphology, vol. 2, Academic Press, London, 1992. H. Simon, Neural Networks—A Comprehensive Foundation, Prentice Hall International, Inc., 1999. B. Kosko, Bidirectional associative memories, IEEE Transactions on Systems, Man, and Cybernetics 18 (1980) 49–60. R. Bogacz, Knowledge database implemented as a neural networks, in: Proceedings of 2nd Conference on Neural Networks and their Application, 1996, pp. 66–71. G. MATHAI, B. UPADHYAYA, Performance analysis and application of the bidirectional associative memory to industrial spectral signatures, in: International Joint Conference on Neural Networks, 1989. ˜ E. Guzmán, O.B. Pogrebnyak, C. Yánez, J.A. Moreno, Image compression algorithm based on morphological associative memories, in: CIARP, 2006, pp. 519–528. M. Aldape-Pérez, I. Román-Godínez, O. Camacho-Nieto, Thresholded learning matrix for efficient pattern recalling, pp. 445–452, in: CIARP’08: Proceedings of the 13th IberoAmerican congress on Pattern Recognition, Springer-Verlag, Berlin, Heidelberg, 2008. S. Chartier, R. Lepage, Learning and extracting edges from images by a modified Hopfield neural network, in: Proceedings of the 16th International Conference on Pattern Recognition (ICPR’02), 2002. M.E. Acevedo-Mosqueda, Alpha–beta bidirectional associative memories (in Spanish). Ph.D. thesis, Center for Computing Research, México (2006). ˜ ˜ M.E. Acevedo-Mosqueda, C. Yánez-Márquez, I. López-Yánez, Alpha-beta bidirectional associative memories: theory and applications, Neural Processing Letters 26 (1) (2007) 1–40. A. Asuncion, D. Newman, UCI machine learning repository (2007). URL http://archive.ics.uci.edu/ml/. K.H. Rosen, Discrete Mathematics and Its Applications, 6th ed., McGraw-Hill, 2007. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update, SIGKDD Explorations 11 (1) (2009) 10–18. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, WEKA 3: Data mining software in java (2010). URL http://www.cs.waikato.ac.nz/ml/weka/. I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, in: Morgan Kaufmann Series in Data Management Systems, 2nd ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Thirteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 148–156. L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.
[56] N. Christofides, Graph Theory: An Algorithmic Approach (Computer Science and Applied Mathematics), Academic Press, Inc., Orlando, FL, USA, 1975. [57] K.M. Ting, I.H. Witten, Stacking bagged and dagged models, in: D.H. Fisher (Ed.), Fourteenth international Conference on Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA, 1997, pp. 367–375. [58] R. Kohavi, The power of decision tables, in: 8th European Conference on Machine Learning, Springer, 1995, pp. 174–189. [59] M. Hall, E. Frank, Combining Naive Bayes and decision tables, in: Proceedings of the 21st Florida Artificial Intelligence Society Conference (FLAIRS), AAAI Press, 2008, pp. 318–319. [60] N. Landwehr, M. Hall, E. Frank, Logistic model trees, Machine Learning 59 (1–2) (2005) 161–205. [61] J. Gama, Functional trees, Machine Learning 55 (3) (2004) 219–250. [62] M. Sumner, E. Frank, M. Hall, Speeding up logistic model tree induction, in: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, 2005, pp. 675–683. [63] S. le Cessie, J. van Houwelingen, Ridge estimators in logistic regression, Applied Statistics 41 (1) (1992) 191–201. [64] G.H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers, in: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, Morgan Kaufmann, 1995, pp. 338–345. [65] R. Duda, P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973. [66] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32. [67] T.K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844. [68] M.D. Buhmann, Radial Basis Functions: Theory and Implementations (Cambridge Monographs on Applied and Computational Mathematics), Cambridge University Press, 2003. [69] J.J. Rodriguez, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (10) (2006) 1619–1630. [70] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schoelkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, 1998. [71] T. Hastie, R. Tibshirani, Classification by pairwise coupling, in: M.I. Jordan, M.J. Kearns, S.A. Solla (Eds.), Advances in Neural Information Processing Systems, vol. 10, MIT Press, 1998. [72] S. Keerthi, S. Shevade, C. Bhattacharyya, K. Murthy, Improvements to Platt’s SMO algorithm for SVM classifier design, Neural Computation 13 (3) (2001) 637–649. [73] T.M. Mitchell, Machine Learning, 1st ed., McGraw-Hill, 1997. [74] P. Clark, R. Boswell, Rule induction with cn2: some recent improvements, pp. 151–163, in: Y. Kodratoff (Ed.), Machine Learning—Proceedings of the Fifth European Conference (EWSL-91), 1991. [75] P.W. Eklund, A. Hoang, A performance survey of public domain Supervised Machine Learning algorithms, Tech. Rep., Griffith University, School of Information Technology, Parklands Drive, Southport, Queensland 9726, Australia (2002). [76] S.K. Murthy, S. Kasif, S. Salzberg, A system for induction of oblique decision trees, Journal of Artificial Intelligence Research 2 (1994) 1–32.
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 6 ( 2 0 1 2 ) 287–307
[77] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI 95), 1995, pp. 1137–1145. [78] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1) (1997) 273–324. [79] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82.
307
[80] MathWorks, Maximum matrix size by platform (2010). URL http://www.mathworks.com/support/technotes/1100/1110.html. ˜ ˜ [81] C. Yánez, E.M.F. Riverón, I. López-Yánez, R. Flores-Carapia, A novel approach to automatic color matching, in: CIARP, 2006, pp. 529–538.