Self-Organizing Hidden Markov Model Map (SOHMMM)

Self-Organizing Hidden Markov Model Map (SOHMMM)

Neural Networks 48 (2013) 133–147 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Self-O...

1MB Sizes 1 Downloads 91 Views

Neural Networks 48 (2013) 133–147

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Self-Organizing Hidden Markov Model Map (SOHMMM) Christos Ferles ∗ , Andreas Stafylopatis Intelligent Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, 15780 Zografou, Athens, Greece

article

info

Article history: Received 12 June 2011 Received in revised form 8 June 2013 Accepted 31 July 2013 Keywords: Self-Organizing Map (SOM) Hidden Markov Model (HMM) Unsupervised learning Clustering Spatiotemporal DNA/RNA/protein sequences

abstract A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM’s characteristics and capabilities. © 2013 Elsevier Ltd. All rights reserved.

1. Introduction All life on this planet depends on deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and protein sequences; the three primary types of biological molecules. DNA, RNA and proteins are examples of sequences written in either the four-letter nucleotide alphabet of DNA and RNA or the twenty-letter amino acid alphabet of proteins. The advent of efficient experimental technologies has led to an exponential growth of linear descriptions of protein, DNA and RNA chain molecules requiring automated analysis. Therefore, the need for computational/statistical/machine learning algorithms and techniques, for the qualitative and quantitative description of biological molecules, is today stronger than ever (Mount, 2004). Analysis of sequence patterns must take into account that biological sequences are inherently complex and noisy, the variability resulting in part from random events amplified by evolution and in part from our lack of comprehensive theory of life’s organization and function at the molecular level. Since nucleotide or amino acid sequences with a given function or structure, in general, will differ and be uncertain, sequence models must be probabilistic (Baldi & Brunak, 2001). Neural Networks (NNs), Hidden Markov Models (HMMs), Bayesian networks and similar machine learning approaches are ideally suited for domains characterized by the presence of large amounts of data, variable sequence patterns, complex structures, and the absence of general theories.



Corresponding author. Tel.: +30 2107722504. E-mail address: [email protected] (C. Ferles).

0893-6080/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.neunet.2013.07.011

Exploratory data analysis actually represents a primordial exploration with little or no prior knowledge. Its objective is to produce simplified descriptions and summaries of large data sets. Clustering (Du, 2010; Xu & Wunsch, 2005) is one standard method that can create higher abstractions (symbolisms) from raw data automatically; another alternative method is to project highdimensional data as points on a low-dimensional display. Since the Self-Organizing Map (SOM) is a combination of these two methods, it has great capabilities as a clustering, dimensionality reduction and data visualization algorithm (Brugger, Bogdan, & Rosenstiel, 2008; Kohonen, 2001; Kraaijveld, Mao, & Jain, 1995; Seiffert & Jain, 2002; Tasdemir, 2010; Tasdemir & Merenyi, 2009; Ultsch, 2003). Various models have been proposed that extend the SOM, by recurrent dynamics and/or recursive connections, for processing sequential data. The sequential activation retention and decay network (James & Miikkulainen, 1994) is a feature map architecture which creates distributed response patterns (representations) for different sequences of vectors. The extended Kohonen feature map (Hoekstra & Drossaers, 1993) equips the SOM with internal connections from each neuron to all other neurons. The temporal Kohonen map (Chappell & Taylor, 1993) transforms the neurons in such a way that these act as leaky integrators. The recurrent SOM (Koskela, Varsta, Heikkonen, & Kaski, 1998; Varsta, Heikkonen, & Millan, 1997) uses a similar dynamic recurrent structure with the difference that leaky integrators are moved from output units to inputs. The recursive SOM (Voegtlin, 2002) is a modified SOM where time representation, which is implicit and selfreferent, is implemented by homogeneous feed-forward and feedbackward connections that describe activities or learning rules. The

134

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

merge SOM context (Strickert & Hammer, 2004) refers to a fusion of arbitrary lattice topologies with a learning architecture which implements a compact back-reference to the previous winner. The SOM for structured data (Hagenbuchner, Sperduti, & Tsoi, 2003; Sperduti, 2001) suggests a recursive mechanism capable of exploiting information conveyed in the input Directed Acyclic Graphs (DAGs) and information encoded in the DAG topology. The approach in Hagenbuchner, Sperduti, and Tsoi (2005) is effective in diversifying the mapping of vertices and sub-structures according to the context in which they occur inside a tree. An empirical comparison of these recursive models is presented in Vanco and Farkas (2010). In the general framework of Hammer, Micheli, Strickert, and Sperduti (2004) a uniform formulation is obtained which allows the investigation of possible learning algorithms and theoretical properties of several approaches combining the SOM with recurrent techniques. The self-organizing mixture model (Verbeek, Vlassis, & Krose, 2005) introduces an Expectation–Maximization (E–M) algorithm that yields topology preserving maps based on probabilistic mixture models. The approach in Lebbah, Rogovschi, and Bennani (2007) is a special case of the previous model, since it proposes a Bernoulli probabilistic SOM able to handle binary data. Soft topographic vector quantization (Graepel, Burger, & Obermayer, 1998) employs deterministic annealing on different cost/error functions for the generation of topographic mappings. In Heskes (2001) the relation between SOMs and mixture modeling is investigated. The goal of the generalized taxonomy (Barreto, Araujo, & Kremer, 2003) is to provide a nonlinear generative framework for describing unsupervised spatiotemporal networks making it easier to compare and contrast their representational and operational characteristics. All SOM variants are in position to process numerical sequences (viz. sequences which consist of continuous/real or discrete/binary values). Processing of symbolic/non-numerical sequences is carried out after an intermediate preprocessing stage (i.e. after transforming symbolic sequences to numerical data spaces). Some approaches use orthogonal encoding (Lebbah et al., 2007; Somervuo, 2004; Strickert & Hammer, 2004). This sparse encoding scheme has the advantage of not introducing any algebraic correlations between the symbols, but suffers from the disadvantage of increasing substantially the length of biological sequences (e.g. by a factor of 20 in the case of proteins). Consequently, existing SOM variants (employing the orthogonal encoding scheme) are usually applied to small-scale problems, since, in practice, biological sequences prove demanding in terms of modeling/computational requirements. On the contrary, the Self-Organizing Hidden Markov Model Map (SOHMMM) incorporates a mechanism for processing and analyzing discrete symbol sequential data (written in alphabets of arbitrary cardinalities) directly, without necessitating nor utilizing any type of numerical transformation or preprocessing. The techniques (Kohonen & Somervuo, 1998, 2002) combine the concept of the generalized median with the batch computation of the SOM. Their computing relies on data described by predetermined (dis)similarities (for instance weighted Levenshtein or local feature distances). A drawback of the Dissimilarity SOM (DSOM) is that the training phase’s computational complexity behaves quadratically with respect to the number of input data. In Conan-Guez, Rossi, and Gollib (2006) several modifications of the basic DSOM algorithm are proposed, which allow a much faster implementation. Another drawback of the DSOM, and in general of median clustering, is that it restricts prototype locations to the data space such that a severe loss of accuracy can be expected. This problem has been identified in Hammer, Hasenfuss, Rossi, and Strickert (2007), and also, in Hammer and Hasenfuss (2010) where a different approach is followed. HMMs come with a solid statistical background and with theoretically verified learning algorithms (Baldi & Brunak, 2001; Koski,

2001). HMMs’ architecture and probabilistic foundations prove useful for capturing and deciphering the latent information hidden in the spatial structure and linear aspects of chain molecules, and also, for analyzing the sequential pattern of monomers in sequences. HMMs, when applied properly, work very well in practice, hence, their wide range of applications (Durbin, Eddy, Krogh, & Mitchison, 1998; Mount, 2004) includes gene parsing/recognition, (periodical) pattern discovery, motifs, multiple alignments and functional sites. In the literature, a number of attempts have been made for combining HMMs and NNs to form hybrid models that combine the expressive power of artificial NNs with the sequential time series aspect of HMMs. The approach in Baldi and Chauvin (1996) suggests a class of hybrid architectures where the HMM and NN components are trained inseparably. In these architectures the NN component is used to reparameterize and tune the HMM component. There are also certain techniques based on scenarios involving SOM and HMM modules (see for instance Kang, Feng, Shao, & Hu, 2007; Kurimo, 1997; Kurimo & Somervuo, 1996; Kurimo & Torkkola, 1992; Minamino, Aoyama, & Shimomura, 2005; Recanati, Rogovschi, & Bennani, 2007; Rogovschi, Lebbah, & Bennani, 2010; Somervuo, 2000; Tsuruta, Iuchi, Sagheer, & Tobely, 2003). The training procedures of the two subunits are nearly always disjoint and are carried out independently. These approaches can be categorized in two classes. The first class contains the methods where a SOM is used as a front-end processor (e.g. vector quantization, preprocessing, feature extraction) for the inputs of a HMM, whereas the second class consists of the models that place the SOM on top of the HMM. A type of composite HMM (Krogh, Brown, Mian, Sjolander, & Haussler, 1994) is implemented by training several HMM components in parallel. The Baum–Welch learning algorithm (Baum, 1972) is utilized for partitioning the sequences of a single protein family into clusters (subfamilies) of similar sequences. The SOHMMM is an integration of the SOM and the HMM principles. The SOHMMM’s core consists of a novel unified/hybrid SOM–HMM algorithm. Both components’ corresponding architectures are fused. The model is coupled with a raw sequence data training method, that merges the SOM unsupervised learning and the HMM dynamic programming algorithms. The current selfcontained treatment analyzes the SOHMMM in depth and in detail (from the most primary concepts to the rather elaborate ones). A determinant new component (in contrast to earlier versions of the model) is the novel stochastic unsupervised learning algorithm which has a solid mathematical foundation and analytical proof, and also, is based on an energy-cost function. The devised learning methodology is essentially different from our previously published reports (Ferles & Stafylopatis, 2008a, 2008b; Ferles, Siolas, & Stafylopatis, 2011) which followed an adaptation schema in analogy to the original SOM learning rule (a type of Hebbian learning). Also, in the present study issues regarding the realization of the SOHMMM algorithm are being addressed systematically/methodically. Last, assiduous and extensive experiments have been conducted, thus, all the reported qualitative and quantitative results reveal the characteristics, performance and capabilities of the improved modified SOHMMM algorithm. The organization and structure of the paper is as follows. The theory of discrete hidden Markov chains and HMMs is briefly reviewed in Section 2. Moreover, in Section 2 topics relating to Bayesian inference are investigated, also, the basis of on-line gradient descent for reparameterized HMMs is set. An abstract and general overview of the SOHMMM framework is given in Section 3, followed by a complete and thorough description of the SOHMMM prototype from an architectural and algorithmic viewpoint. In Section 4, after discussing certain implementation issues (such as HMM architecture adjustment), the SOHMMM’s on-line unsupervised learning algorithm is presented in detail. Experimental

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

(comparative) results and practical application issues are discussed in Section 5. Finally, Section 6 concludes the paper, identifies several key problem areas and proposes possible improvements. In addition, hints for future work and suggestions for potential expansions are given. 2. Preliminaries 2.1. Hidden Markov model In essence, a HMM is a stochastic process generated by two interwoven probabilistic mechanisms, an underlying one (i.e. hidden) and an observable one (which produces the symbolic sequences). Thereinafter, we each  may  denote   individual  HMM as λ = (A, B, π ), where A = aij , B = bj (k) and π = πj are the transition, emission and initial state probability stochastic matrices respectively. Let {qt }∞ t =1 be a homogeneous Markov chain, where each random variable qt assumes a value in the state space S = {s1 , s2 , . . . , sN }. The conditional stationary probabilities are denoted as: aij = P (qt = sj |qt −1 = si ), t > 1, 1 ≤ i ≤ N , 1 ≤ j ≤ N. The initial state probability distribution is defined as: πj = P (q1 = sj ), 1 ≤ j ≤ N. Let {Yt }∞ t =1 be a random process defined over the finite space V = {v1 , v2 , . . . , vM } where, in general, M ̸= N. The ∞ processes {qt }∞ t =1 and {Yt }t =1 are related by the conditional probability distributions: bj (k) = P (Yt = vk |qt = sj ), t ≥ 1, 1 ≤ j ≤ N , 1 ≤ k ≤ M. Suppose O = o1 o2 , . . . , oT is an observation sequence where each observation ot assumes a value from V , and T is the number of observations in the sequence. In certain cases the emission probabilities’ indexes are denoted as ot and not as k. This interchange is made whenever the exact observation symbols are insignificant for the formulation under consideration, whereas, these values are considered to be given and specific. Consider the forward variable αt (i) = P (o1 o2 , . . . , ot , qt = si |λ) and the backward variable βt (i) = P (ot +1 ot +2 , . . . , oT |qt = si , λ). The forward–backward algorithm provides a computationally efficient solution to the scoring problem for a HMM (Koski, 2001; Rabiner, 1989). It allows the evaluation of the probability of O conditioned on model λ (likelihood), so that the computational requirement is linear to the observation sequence length P (O|λ) =

N 

αt (j)βt (j).

(1)

j =1

It should be noted that 1 ≤ t ≤ T , consequently, we have T ways for computing P (O|λ). For technical reasons, probabilities can be very small. The solution is to work with the corresponding logarithms. 2.2. Estimating model parameters A challenging type of inference is that of deriving a parameterized HMM λ from a corpus of data O; further details can be found in Baldi and Brunak (2001). From Bayes theorem we immediately have P (λ|O) = P (O|λ)P (λ)/P (O).

(2)

The prior P (λ) represents our estimate of the probability that HMM λ is correct before we have obtained any data. The posterior P (λ|O) represents our updated belief in the probability that HMM λ is correct given that we have observed the data set O. The term P (O|λ) is referred to as the likelihood. The most common objective is to find or approximate the best HMM within a class, that is, to find the set of parameters that maximize the posterior. This is called maximum a posteriori (MAP) estimation. To apply (2) to any class of HMMs, we will need to specify

135

the prior P (λ) and the data likelihood. Once the prior and the data likelihood terms are made explicit, the initial modeling effort is complete. All that is left is cranking the engine of probability theory. From an optimization standpoint, the prior plays the role of a regularizer, that is, of an additional penalty term that can be used to enforce additional constraints. Note that the term P (O) in (2) plays the role of a normalizing constant that does not depend on the HMM’s parameters, and is therefore irrelevant for the optimization. If the prior P (λ) is uniform over all HMMs considered, then the problem reduces to finding the maximum of P (O|λ), which is maximum likelihood (ML) estimation. The estimation problem for HMMs is typically resolved using the Baum–Welch algorithm (Baum, 1972). At each iteration of the Baum–Welch algorithm, the expressions for resetting the transition and emission probabilities are (next )

= aij

aij

T −1 

 T −1 αl (i)bj (ol+1 )βl+1 (j) αl (i)βl (i),

l =1

l =1

1 ≤ i ≤ N, 1 ≤ j ≤ N

(3)

 T  (next ) αl (j)βl (j), I {ol = t |λ} αl (j)βl (j) bj (t ) = T

l =1

1 ≤ j ≤ N, 1 ≤ t ≤ M

l=1

(4)

where the indicator function is defined as

 I {ol = t |λ} =

1, 0,

ol = v t otherwise.

(5)

The Baum–Welch learning equations are proven to be identical to the E–M steps for this particular problem (Rabiner, 1989). In certain cases the Baum–Welch algorithm turns out to be problematic (Baldi & Chauvin, 1994; Koski, 2001). Especially in the case of a single training sequence, at each iteration the algorithm resets a transition or emission probability to its expected frequency. As a result the Baum–Welch algorithm can lead to abrupt jumps in parameter space, and consequently, the procedure is not suitable for on-line learning, i.e. for application after each training example. Another drawback of the Baum–Welch algorithm is that zero probabilities are absorbing: once a transition or emission probability is set to 0, it is never used again and thereafter remains equal to 0. In most interesting models, the function being optimized is complex and its modes cannot be solved analytically. Thus, one must resort to iterative and possibly stochastic methods such as gradient descent (Baldi, 1995; Jang, Sun, & Mizutani, 1997) or simulated annealing, and also, settle for approximate or suboptimal solutions. Such methods have been shown to yield solutions comparable to those of the standard reestimation procedures. While the general gradient descent principle is simple, in complex parameterized models it can give rise to different implementations, depending on how the gradient is actually computed. This often requires the propagation of information ‘‘backwards’’, as in the case of the forward–backward algorithm for HMMs. The gradient descent equations on the likelihood can be derived directly by exploiting a useful reparameterization. We reparameterize the HMM using normalized exponentials, in the form wij

aij = e

 N

wil

e

l =1

πj = e

uj

 N

e

,

bj ( t ) = e

rjt

 M l =1

erjl , (6)

ul

l =1

with wij , rjt , uj as the new variables (Baldi & Chauvin, 1994). Based on this reparameterization we infer the partial derivatives of the likelihood with respect to the HMM’s parameters (these formulas

136

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

Fig. 1. Paradigm of a rectangular SOHMMM lattice. Each four-vertex graph represents a reference HMM neuron. The shaded area is an indicative winner neuron’s (λc ) closest topological neighborhood (NBc ).

can be found in the Appendix). The reparameterization has three main advantages: first, modification of the new variables automatically preserves standard stochastic constraints on initial state, transition and emission probability distributions; second, initial state, transition and emission probabilities can never reach the absorbing value 0; third, in the case of a single training sequence, or on-line learning, abrupt jumps in parameter space are avoided. Due to these advantages the above reparameterization overcomes certain drawbacks of the Baum–Welch algorithm and of similar E–M methods. It is obvious that the new variables wij , rjt , uj can also be arranged in a series of matrices, namely W = {wij }, R = {rjt } and U = {uj }. Correspondingly, a complete specification of a HMM requires specification of the cardinalities of the two state spaces (namely N and M), of the observation symbols, and of the matrices W , R and U. Henceforth, we may use the compact notation λ = (W , R, U ). 3. The self-organizing hidden Markov model map 3.1. Generic framework Studies conducted for many years by a large number of researchers have convincingly shown that the best self-organizing results are obtained if the following two partial processes are implemented in their purest forms (Kohonen, 2001): (1) decoding of that neuron that has the best match with the input data pattern (the so-called ‘‘winner’’); (2) adaptive improvement of the match in the neighborhood of neurons centered around the ‘‘winner’’. The SOHMMM may be described formally as a nonlinear, ordered, smooth mapping of observation sequence data onto the elements of a regular, low-dimensional array. The mapping is implemented in the following way, which resembles the two afore mentioned processes. Assume first O is an observation sequence. With each element e in the SOHMMM array we associate a HMM λe . Considering the probabilities of O given each λe (likelihoods), the image of an input observation sequence O on the SOHMMM array is defined as the array element with optimum weighted likelihood, where weighting refers to the neighborhood function defined over the topology of the map. Our task is to define each individual HMM λe in such a way that the mapping is ordered, descriptive and representative of the distribution of O. Consider Fig. 1, where a two-dimensional ordered array of nodes, each one having a HMM λe associated with it, is shown. Moreover, consider the neighborhood set NBc around the model λc , which matches best with O (viz. the model returning the optimum weighted likelihood). Here NBc consists of all neurons up to a certain radius on the grid from neuron c. The next task is to adjust the parameters of all HMMs within NBc to

optimize the corresponding appropriate energy-cost function, that is, to gain some knowledge from the given input O. Actually, we attempt to optimize the parameters of every λe ∈ NBc so as to best describe how a given observation comes about. This is achieved by employing the on-line learning algorithm detailed next. The SOHMMM, in its basic form, produces a probability distribution graph of input data that summarizes the HMMs’ distributions on observation sequence space. It converts the nonlinear statistical relationships between sequence data into simple geometric relationships of their image points on a low-dimensional display, usually a regular two-dimensional grid of nodes. As the SOHMMM thereby compresses information, whereas, preserving the most important topological and statistical relationships of the primary data elements on the display, it may also be thought to produce some kind of abstractions. These characteristics, abstraction, dimensionality reduction and visualization in synergy with clustering, can be utilized in a number of sequence data analysis tasks. 3.2. Analysis of the SOHMMM Let O = o1 o2 , . . . , oT be an observation sequence where each observation ot assumes a value from the alphabet F = {f1 , f2 , . . . , fG }, and T is the number of observations in the sequence. In addition, let Λ be a class of HMMs such that the corresponding observation symbols specified (V = {v1 , v2 , . . . , vM }) are a superset of the alphabet F (F ⊆ V ). A further assumption is that two distinct HMMs λε , λe ∈ Λ, in general, have different cardinalities of the corresponding state spaces (namely N (ε) ̸= N (e) and M (ε) ̸= M (e) ), different observation symbols (V (ε) ̸= V (e) ), and different matrices W , R and U (difference with respect to a matrix refers to nonidentical dimensions and/or nonidentical values of corresponding elements). Consider Fig. 2, where the SOHMMM defines a mapping from the input observation sequence space onto a two-dimensional array of neurons. In total there are E neurons. With every neuron e, a HMM λe ∈ Λ, also called reference HMM, is associated. The lattice type of the array can be defined to be rectangular, hexagonal or even irregular. In the simplest case, an input observation sequence O is connected to all HMMs in parallel. In an abstract scheme it may be imagined that the input O, by means of some parallel computing mechanisms, is compared with all the λe and the location of the best match is defined as the location of the response. Actually, the exact magnitude of the response need not be determined; the input is simply mapped onto this location, like in a set of decoders. We may then claim that the SOHMMM is a nonlinear projection of the input observation sequence onto the two-dimensional display. During learning, or the process in which the nonlinear projection is formed, those HMMs that are topologically close in the array up to a certain geometric distance will activate each other to learn something from the given input O. This will result in a local relaxation or smoothing effect on the parameters of HMMs in this neighborhood, which in continued learning leads to global ordering. The SOHMMM is based on an energy-cost function, therefore, its unsupervised learning algorithm corresponds to some kind of stochastic gradient descent on this function, hence having an arguably solid mathematical foundation. The choice we make for the SOHMMM energy function is

 = min d

E 

 {−hde P (O|λe )}

(7)

e=1

where ⟨· · ·⟩ denotes averaging over the distribution of input sequences, hde describes the neighborhood function of the lattice,

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

137

function (7) is

=

 E 

K (O, λd )

d=1

E 

 {−hde P (O|λe )}

(8)

e=1

where K {O, λc } =

  

1,

c = arg max

E 

d

 0,

hde P (O|λe )

(9)

e=1

otherwise.

Both energy formulas are in accordance with the variation proposed in Heskes (1996, 1999), which also uses a slightly different determination of the winner node. Performing stochastic gradient descent on the proposed energy function is thereupon a straightforward procedure. Given an observation sequence O, the energy function induces the sample function

(O) =

E 

K (O, λd )

d=1

E 

{−hde P (O|λe )} .

(10)

e=1

Then each on-line learning step corresponds to a local gradient descent on such a sample. A stochastic gradient descent with respect to the parameters x has the form x(next ) = x(now) − η∂ (O)/∂ xx=x(now)



(11)

where η is the learning rate, which can be fixed or adjusted during the learning process. By performing the differentiation we obtain

∂ (O)/∂ x =

E 

∂ K (O, λd )/∂ x

E 

d=1

+

{−hde P (O|λe )}

e=1

E 

K (O, λd )

E 

{−hde ∂ P (O|λe )/∂ x} .

(12)

e=1

d=1

The first term on the right hand side vanishes (see Hammer et al. (2004) for a detailed evaluation/justification), the second term on the right hand side yields the desired learning rule x(next ) = x(now) + η

E 

K (O, λd )

E 

d=1

Fig. 2. Hexagonal SOHMMM lattice where each individual HMM is depicted as a fully-connected four-vertex graph. The circular gray areas are examples of topological neighborhoods (NBc ) at different points in time (y1 < y2 ≪ y3 ). At each characteristic instance (a)–(c) of the SOHMMM unsupervised training procedure the corresponding best matching neuron, with respect to the given input sequence, is denoted as λc . In (a) and (b) the neighborhoods’ widths are equal since the learning step y2 immediately follows y1 , whereas in (c) the neighborhood’s radius has decreased since the learning step y3 is many iterations after the first two (i.e. the y1 and y2 ).

and P (O|λe ) denotes the probability of O given λe (likelihood). In essence, this energy function represents the minimum weighted negative likelihoods (of the neurons), averaged over the distribution of inputs. Consequently, it measures (at a certain extent) SOHMMM’s adaptability and capability to describe a corpus of observation sequences. Also, such energy functions are continuous for finite sets of sequences. An alternative definition of the energy

hde ∂ P (O|λe )/∂ x x=x(now) . (13)



e=1

An additional theoretical benefit of using stochastic gradient descent on an energy function is conceptualizing (a priori) what the learning rule in fact minimizes. This translates to the practical advantage of having a global measure of learning performance. An algorithm for adjusting the HMMs’ parameters, in order to minimize the corresponding energy function, can be derived by using the three propositions in the Appendix and (13). The resulting SOHMMM unsupervised learning algorithm is an iterative procedure that uses a modified definition of the best matching neuron, and provides rules for adjusting the parameters of interest, namely W (e) , R(e) and U (e) . Thus, the definition of the best matching HMM (indicated by the subscript c), i.e. the HMM corresponding to the maximum weighted likelihood, is given by c = arg max d

E 

hde (y)P (O|λe ) y



(14)

e=1

where y = 1, 2, 3, . . ., is an integer, the discrete time coordinate. With the use of (13) and Proposition 1 in the Appendix, the on-line gradient descent equation with respect to wij is

wij(e) (y + 1) = wij(e) (y) + η(y)hce (y)   T −1    · aij αl (i)bj (ol+1 )βl+1 (j) − αl (i)βl (i)  l =1

1 ≤ i ≤ N, 1 ≤ j ≤ N.

 λe ,y

, (15)

138

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

Similarly, with the use of (13) and Proposition 2 in the Appendix, the stochastic learning rule with respect to rjt is (e)

(e)

rjt (y + 1) = rjt (y) + η(y)hce (y)

·

 T   l =1

  I {ol = t |λ} αl (j)βl (j) − bj (t )αl (j)βl (j) 

 λe ,y

1 ≤ j ≤ N, 1 ≤ t ≤ M.

, (16)

Finally, by using (13) and Proposition 3 in the Appendix, the stochastic gradient descent equation with respect to uj is (e)

(e)

uj (y + 1) = uj (y) + η(y)hce (y)

 ·

   πj bj (o1 )β1 (j) − πj P (O|λ) 

λe ,y

1 ≤ j ≤ N.

 , (17)

The function η(y) plays the role of a scalar learning rate factor (0 < η(y) < 1), and usually, is decreasing monotonically in time (at least during the ordering process). In the relaxation process, the function hce (y) has a very central role: it acts as the neighborhood function, a smoothing kernel defined over the lattice points. For convenience, it is necessary that hce (y) → 0 when y → ∞. Usually hce (y) = h (∥δc − δe ∥ , y), where δc , δe ∈ ℜ2 are the location vectors of HMMs λc and λe on the array. With increasing ∥δc − δe ∥ the function hce → 0. The width and form of hce define the stiffness of the elastic surface to be fitted to the input data. In the literature, two simple choices for hce (y) occur frequently. The simpler of them refers to a neighborhood set of array points around HMM λc (Fig. 2). Let their set index be denoted as NBc , whereby, hce (y) = 1 if e ∈ NBc and hce (y) = 0 if e ̸∈ NBc . It is a common practice that the radius of NBc (y) is decreasing monotonically in time (at least during the ordering process). Another widely applied, smoother neighborhood kernel can be written in terms of the Gaussian function hce (y) = exp − ∥δc − δe ∥2 /2σ 2 (y)





(18)

where the parameter σ (y) defines the width of the kernel, and corresponds to the radius of NBc (y) above (σ (y), also, is some monotonically decreasing function of time). At this point it should be noted that the SOHMMM’s learning algorithm is clearly on-line, and as such, it is closer to the Bayesian spirit of updating one’s belief as data become available. More important, perhaps, learning after the presentation of each sample introduces a degree of stochasticity that may be useful in exploring the space of solutions and averting the phenomenon of trapping in local optima (Bishop, 1995; Jang et al., 1997). The SOHMMM on-line gradient descent unsupervised learning algorithm, detailed in this section, is only representative of many alternative forms and obviously can give rise to a number of variants and different implementations (e.g. a learning algorithm with a momentum term for regulating the influence of the previous gradient descent direction). Even devising a batch counterpart is a quite straightforward procedure. Such a batch version of the SOHMMM unsupervised learning algorithm would require performing gradient descent on the energy function (7) instead of the sample function. The remaining necessary actions for completing the task would be similar to the steps followed for deriving the online version. The outcome of this procedure would be learning rules (incorporating information from all available sequences) consisting of summations over all on-line learning equations (which correspond to each individual sequence).

4. Algorithmic realization of the SOHMMM In the previous section, the SOHMMM has been presented in its most generic formulation. It is reasonable that such an analysis could lead to several possible variations of the SOHMMM’s algorithm, structure, and architecture. A general, yet thorough and complete description of the SOHMMM framework, from an algorithmic perspective, is given next. Also, several practical implementation issues are being investigated and analyzed. 4.1. HMM architecture If the states of the Markov chain in the HMMs have no physical or biological interpretation, a basic problem seems to exist. This is that different topologies and architectures might be used to model the same corpus of sequences. A reasonable question is whether the HMM architecture can be adjusted from the data (Baldi & Brunak, 2001). Similar and relevant questions do not seem to have been widely studied (Koski, 2001). Nevertheless, certain algorithms for adjusting HMM architectures have been developed (Fujiwara, Asogawa, & Konagaya, 1994; Krogh et al., 1994; Stolcke & Omohundro, 1993). The basic idea in Stolcke and Omohundro (1993) is to start with a very complex model, essentially one state per sequence letter, and then iteratively merge states. In Fujiwara et al. (1994), on the other hand, the starting point is a small, fully connected HMM. The algorithm in this case proceeds iteratively by deleting transitions with very low probability and duplicating the most connected states. Last, in Krogh et al. (1994) an algorithm is presented for the dynamic adjustment of the HMM length during learning. The idea is to add or remove states wherever needed along the architecture, while respecting the overall connectivity pattern. The impact and practical usage of these techniques in the HMM research field has been tenuous due to their drawbacks, in particular: (1) all methods are applicable only to a single HMM that models exactly one class of sequence data, extensions for ensembles of HMMs or for modeling two or more families of sequences are not suggested; (2) in all approaches good results are reported on small test cases associated with HMMs having few states, on the contrary in the general case, where the number of all possible states is very large, these methods demonstrate high computational costs, thus, proving slow and impractical; (3) the choice of states to merge, remove, duplicate, add, as well as the choice of stopping criteria are guided by assumptions and thresholds depending widely on prior knowledge; (4) the technique proposed in Krogh et al. (1994) is applicable only to a certain type of architecture, namely left-to-right, and moreover, it has not been shown to converge always to a stable HMM size. In the SOHMMM approach preprocessing techniques have been avoided for four main reasons. First, preprocessing introduces additional computational cost, and thus, is time consuming. Particularly, in the case of the SOHMMM, if instead of estimating a single parameter (the number of states each HMM employs) we used a preprocessing stage, like (Stolcke & Omohundro, 1993) or Fujiwara et al. (1994), the imposed computational cost would be multiple times higher when compared to the computational cost of the learning algorithm itself. Second, as has been discussed previously, there are a number of ways in which prior knowledge can be incorporated in the design of the HMMs. All methods have been shown to be domain dependent (often with questionable results), relying on ad hoc assumptions and estimations. Following such a method would deviate considerably from the unsupervised learning approach of the SOHMMM. For instance, incorporating a methodology for trimming the HMMs’ architectures with respect to prior information (i.e. a choice noticeably biased towards supervised learning strategies) would lead to the oxymoron of an unsupervised learning algorithm integrating a semi-supervised preprocessing module. Third, loss of information, in a preprocessing

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

stage (e.g. feature extraction, numerical representation, sliding time window technique), is often inevitable (Baldi & Brunak, 2001; Hammer et al., 2004; Xu & Wunsch, 2005). Fourth, we select random values for all parameters (with respect to the standard stochastic constraints) to demonstrate that, starting from an arbitrary initial state, the reference HMMs will attain twodimensionally ordered values in the long run. In other words initially unordered HMMs will be ordered finally; this is the basic effect of self-organization. As shown in Stolcke and Omohundro (1993), Fujiwara et al. (1994), Krogh et al. (1994) there is no general purpose, computationally efficient, theoretically correct way for predefining the HMMs’ cardinalities. Also, the use of substantial prior knowledge or additional information is mandatory, which is something that makes all techniques problem dependent. Consequently, the decision on the number of states must be made depending on the input data being modeled (Rabiner, 1989). Thus, in the general case, grid based searches are the only means for trimming the SOHMMM architecture, i.e. for selecting the cardinality of each HMM. 4.2. Scaling The probabilities P (o1 o2 , . . . , ot |λ) are typically very small, since they are equal to the product of many transition and emission probabilities, each less than 1. It can be seen that as t starts to get big (e.g. 10 or more) each term of at (i) starts to head exponentially to zero. For sufficiently large t (e.g. 100 or more) the dynamic range of the at (i) computation will exceed the precision range of essentially any machine, even in double precision. Thus, the computation of at (i) proves to be beyond machine precision as t increases. A similar observation can be made for the backward variables βt (i) as t decreases. In general, during the implementation of the SOHMMM algorithm, and in particular of the forward and backward procedures, one has to deal with precision issues. Hence, we need a scaling procedure, where forward and backward variables are scaled (by a suitable coefficient that depends only on t) during the propagation, in order to avoid underflow (Rabiner, 1989). The scaling of the forward and backward variables is defined in a complementary way so that the learning equations remain essentially invariant under scaling. Another fact about the SOHMMM is that it is important to limit some of the HMMs’ parameter estimates in order to prevent them from becoming too small. For example, the constraint that bj (k) must be greater than or equal to some minimum value, is necessary to ensure that, even if the kth symbol never occurred in some state j in the training set, there is always a finite probability of its occurrence when evaluating an unknown observation sequence set.

two specifications. First, E1 · E2 = E (for obvious reasons) and second the E1 and E2 values must impose an oblong form of the map (in other words the lattice’s edges should become rectangular because the elastic network formed of the HMMs λe must be oriented and stabilized along these two dimensions). During the ordering phase of the SOHMMM’s on-line unsupervised learning algorithm, a Gaussian neighborhood function is utilized. The function’s standard deviation starts with a value of the same magnitude as half the largest dimension of the lattice and decreases to one map unit at the end of the phase. The learning rate factor decreases monotonically, starting with reasonably high values (close to unity) and at the end attaining values smaller by one (or more) orders of magnitude. Afterwards, during the fine adjustment/tuning phase the width of the neighborhood remains fixed and NBc contains the nearest neighbors of HMM λc on the lattice. The learning rate is determined by a function inversely proportional to y. To summarize, the SOHMMM on-line gradient descent unsupervised training algorithm can be formulated as follows: /*training phase*/ for y = 1 to trainingSteps O ← randomCyclicSelection(OS); for e = 1 to E

α1(e) (j) = πj(e) b(j e) (o1 ), 1 ≤ j ≤ N ;   N  (e) (e) (e) αt +1 (j) = αt (i)aij b(j e) (ot +1 ), i=1

1 ≤ t ≤ T − 1, 1 ≤ j ≤ N ;

βT(e) (j) = 1, 1 ≤ j ≤ N ; N  (e) (e) (e) aji bi (ot +1 )βt +1 (i), βt(e) (j) = i=1

t = T − 1, . . . , 1,

servation sequence, each observation assumes a value from the alphabet F = {f1 , f2 , . . . , fG }, and Td is the number of observations in sequence O(d) . In the present approach, Λ is a class of HMMs with the following attributes: the corresponding observation symbols specified (V = {v1 , v2 , . . . , vM }) constitute an identical set to the alphabet F (F ≡ V ); two individual HMMs λε , λe ∈ Λ, have equal cardinalities of the corresponding state spaces (namely N (ε) = N (e) ) and have, in general, different element values of the matrices W , R and U. The SOHMMM consists of E HMMs all of which are instances of class Λ(λe ∈ Λ, 1 ≤ e ≤ E ). These HMMs are assigned to the nodes of a two-dimensional hexagonal lattice. For visual inspection, the hexagonal lattice is to be preferred, because it does not favor horizontal and vertical directions as much as the rectangular array. There are E1 HMMs per lattice row and E2 HMMs per lattice column. E1 and E2 are chosen appropriately so as to meet

1 ≤ j ≤ N;

end for c ← arg max

E 

e

exp − ∥δe − δε ∥2 /2σ 2 (y) P (O|λε );





ε=1

for e = 1 to E

  wij(e) ← wij(e) + η(y) exp − ∥δc − δe ∥2 /2σ 2 (y)    T −1      · aij αl (i) bj (ol+1 )βl+1 (j) − βl (i)  , λe

l=1

1 ≤ i ≤ N, 1 ≤ j ≤ N; (e)

rjt

4.3. The SOHMMM learning algorithm denote the set of D observation sequences as OS =  (We O 1) , O(2) , . . . , O(D) , where O(d) = o1 o2 , . . . , oTd is the dth ob-

139

  ← rjt(e) + η(y) exp − ∥δc − δe ∥2 /2σ 2 (y)    T      · αl (j)βl (j) I {ol = t |λ} − bj (t )  , λe

l =1

1 ≤ j ≤ N, 1 ≤ t ≤ M; (e)

uj

  ← u(j e) + η(y) exp − ∥δc − δe ∥2 /2σ 2 (y)      · πj bj (o1 )β1 (j) − P (O|λ)  , 1 ≤ j ≤ N; λe

(e)

aij ← e

(e)

wij

 N

(e)

ewil ,

1 ≤ i ≤ N, 1 ≤ j ≤ N;

l =1

(e)

r

(e)

bj (t ) ← e jt

 M

r

(e)

e jl ,

1 ≤ j ≤ N, 1 ≤ t ≤ M;

l =1 (e)

πj(e) ← euj

 N l =1

end for end for

(e)

eul ,

1 ≤ j ≤ N;

140

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

4.4. Computational complexity Finally, an analysis of the SOHMMM on-line unsupervised learning algorithm reveals that he computational complexity for each single training step requires on the order of E (N + M ) NTd calculations, where Td ≫ N and Td ≫ M (exceptions to these conditions can be found only in degenerate cases). The fact that the computational cost of each training step is constant and independent of the overall number of employed training sequences has as a result that the total execution time of the training phase scales linearly with this number. 5. Experimental study The main objective of the artificial sequence data experimental setup is to study and analyze the SOHMMM’s capabilities as a unified framework for unsupervised clustering and nonlinear projection (on a low-dimensional display). The splice junction gene sequence recognition problem is used for an additional purpose. The primary goal of the latter experimental framework is a comparative study of the SOHMMM against a wide variety of machine learning algorithms on a well-established biological sequence classification problem. Both the experiments, based on artificial sequence data and the splice sites, should be considered as exemplary paradigms for solving various real-case problems. 5.1. Artificial sequence data Initially, consider the thirty-letter alphabet F = {f1 , f2 , . . . , f30 }. Let SA be a group of sequence data (observation sequences) where the symbols of each sequence are selected equiprobably from the universe of discourse {f1 , f2 , . . . , f30 }. Similarly, let SB be a group of sequence data where the symbols of each sequence are selected equiprobably from the universe of discourse {f16 , f17 , . . . , f30 }. Let SC be a group of sequence data where the procedure for obtaining the symbols of each sequence is as follows. Consider five disjoint six-letter universes of discourse, DN 1 = {f1 , f2 , . . . , f6 }, DN 2 = {f7 , f8 , . . . , f12 }, DN 3 = {f13 , f14 , . . . , f18 }, DN 4 = {f19 , f20 , . . . , f24 }, DN 5 = {f25 , f26 , . . . , f30 }, with the restriction that within a universe each letter is chosen with equal probability. Also, assume that a temporal cyclic permutation of these five universes has been defined, specifically DN1 → DN 3 → DN 5 → DN 2 → DN 4 → DN 1 . An initial universe is randomly chosen among the five. From this universe an observation is selected equiprobably. The next universe is chosen according to the predefined temporal cyclic permutation, and the observation selection procedure is repeated. Continuation of the above process yields observation sequences of desired lengths. Moreover, it must be stressed that all the following experiments are conducted with observation sequences of variable lengths randomly chosen from the interval [1000, 10 000]. As a result, the corresponding length ranges rest an order of magnitude apart (at maximum). The training set for the first experimental scheme contains 3000 sequence data from group SA, 3000 observation sequences from group SB, and 3000 sequences from group SC. In total, the input observation sequence data are 9000. The SOHMMM array consists of 28 HMMs arranged in a 4 × 7 lattice. The learning rate was an exponentially decreasing function between 1.0 and 0.1, whereas the neighborhood kernels’ standard deviations started with a value equal to half the largest dimension on the grid and decreased linearly to one map unit. The training phase’s duration was set to 5 epochs. This common standard choice of functions (which is usual among SOM networks) was sufficient. The results, before and after the completion of the ordering and tuning phases of the SOHMMM on-line unsupervised training algorithm, are illustrated in Fig. 3. The SOHMMM, along a straightforward unsupervised

Fig. 3. The SOHMMM lattice where each hexagon represents a HMM neuron. A randomly initialized (viz. untrained) SOHMMM is depicted in (a). The same SOHMMM after the completion of the ordering and tuning phases is shown in (b). In (a) and (b) HMMs corresponding to sequences from each one of the three data groups are differentiated by distinct gray scale shades. In (c) HMM nodes that represent sequences belonging to different data groups are distinguished by vertical, diagonal, and horizontal lineation.

learning process, produces an informative description of the input sequence space. The SOHMMM suggests a segregation of the training set into three clusters, where each cluster is associated with exactly one of the three sequence groups comprising the input data set. These clusters are separate, without any overlapping regions (i.e. high degree of external separation), and moreover, the SOHMMM correctly assigns all 9000 observation sequences to their respective clusters (i.e. increased internal homogeneity). Apart from the accurate mapping of all three groups, each group is represented by a set of topologically neighboring HMMs that form a higher-level systematic probabilistic descriptor. The second experimental scheme employs the trained SOHMMM of the first experiment and requires the definition of four more sequence data groups. SD is a group of sequence data where the symbols of each sequence are selected equiprobably from the universe of discourse {f9 , f10 , . . . , f30 }. Likewise, SE is a group of observation sequences where the symbols of each sequence are chosen equiprobably from the universe of discourse {f24 , f25 , . . . , f30 }. Consider five disjoint three-letter universes of discourse, {f1 , f2 , f3 }, {f8 , f10 , f12 }, {f13 , f15 , f18 }, {f22 , f23 , f24 }, {f25 , f27 , f29 }, with the restriction that within a universe each letter is chosen with equal probability. Also, assume that a temporal cyclic permutation is defined over them, specifically {f1 , f2 , f3 } → {f13 , f15 , f18 } → {f25 , f27 , f29 } → {f8 , f10 , f12 } → {f22 , f23 , f24 } → {f1 , f2 , f3 }. Then, SF is a group of sequence data with a construction process similar to the one that generates observation sequences for group SC above. Finally, assume that a temporal bidirectional permutation has been defined over the previous disjoint six-letter universes of discourse, specifically DN1 ↔ DN 3 ↔ DN 5 ↔ DN2 ↔ DN 4 ↔ DN 1 , where, at each step, the direction to follow is randomly chosen with equal probability. SG is a group of observation sequences where the generation process is identical to the procedure that has been followed for the members of group SC , according to the bidirectional permutation. The best matching HMMs, i.e. the HMMs yielding the highest likelihoods, for each group’s sequences are displayed in Fig. 3(c). It is important to note that the sequence data contained in group SE (SF ) are assigned to nodes lying within the cluster of group SB(SC ). This is something predictable since, according to its definition, SE (SF ) is a special case, or equivalently a subgroup, of group SB(SC ). As can be seen in Fig. 3(c), this remark is directly reflected on the SOHMMM lattice. On the contrary, member sequences of group SD(SG) are associated to HMMs which are located around the boundaries of the clusters representing the groups SA and SB (SA and SC ). Mappings of this kind should

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

be considered justifiable from the moment sequences of group SD(SG) are constructed as intermediate cases of the sequences contained in groups SA and SB (SA and SC ). This initial series of experiments allows us to draw some conclusions on the properties of the SOHMMM. The SOHMMM incorporates a mechanism for processing discrete symbol sequential data written in alphabets of arbitrary cardinalities, such as the employed thirty-letter alphabet. Also, the SOHMMM handles efficiently sequences of variable lengths which range an order of magnitude apart, and it exploits latent information hidden in the spatial/temporal correlations and sequential nature of chain molecules as in the case of groups SC , SF and SG. The SOHMMM has managed to access the significant information that lies upon the position/time dependencies of the corresponding groups’ sequences. Thus, by exploiting information retrieved solely from the input sequences it has succeeded in creating higher abstractions (that describe each cluster), written in the probabilistic/statistical language of the HMMs. One of the key features of the SOHMMM is its capability to learn the theory along a pure process of learning from examples without requiring any kind of prior or posterior knowledge (for example, class/category information). Consequently, it has managed to segregate the training set into three clusters representing the groups contained in the overall data set. As has been shown, the SOHMMM displays clusters alongside their relative positions on a low-dimensional array. Each one of the three clusters represents a compact description of the properties and features of each respective sequence data group. The HMMs of a given cluster develop into decoders of their respective sequence domains, thus, subsets of sequences are projected as probabilistic models in a process resembling dimensionality reduction. As has been demonstrated in the cases of the SD and SG groups, the SOHMMM reached a generalization level that exceeds the description of intracluster and extends to inter-cluster attributes. Partially, this is due to the fact that the SOHMMM covers a larger set of distributions, and thus, is in a position to express correlations inaccessible to single unrelated/unordered HMMs. Finally, an issue of significant importance is that, once a SOHMMM has been successfully derived from a corpus of sequences, all HMM nodes can be labeled according to the assigned sequences’ categories or group information (in a process resembling calibration), as in Fig. 3(b). Since the best matching neuron of any given unknown/unlabeled sequence can be computed, this sequence can be classified as belonging to the group/cluster represented by the labeled HMM node. Processes based on this strategy can be used in discrimination tests, database searches and classification problems. If well-established class-specific training sets are available, SOHMMMs can be derived (based on them), and subsequently, be employed for similarity searches across databases. 5.2. Splice junction gene sequences Splice junctions mark transitions from expressed to unexpressed gene regions and vice versa, and they are thus important for the assembly of structural and regulatory proteins that constitute and control biochemical metabolisms. More specifically, splice sites are locations in the DNA at the boundaries of exons (the parts of the DNA sequence retained after splicing, which code for proteins) and introns (the spliced out parts, which do not code for any proteins). Splice site recognition is therefore a topic of interest for the understanding of genotype/phenotype relationships. The UCI machine learning repository (Frank & Asuncion, 2010) contains a data set for primate (eukaryotic) splice junction determination. The problem posed in this data set is to recognize the exon–intron boundaries (frequently called donor splice sites), the intron–exon boundaries (acceptor splice sites), and the positions

141

Table 1 Average error rate percentages of the SOHMMM and various other machine learning algorithms on the splice junction determination problem. Model

Donor

LI SVM FK SVM TOP SVM PCL C4.5 SHMM-DDA SOHMMM C2 RIBD BRAIN KBANN Back-Prop. PEBLS ID3 COBWEB Perceptron C2 RIB Near. Neighbor SVM SMM SVM MC SVM IMM

0.8 ± 0.2 0.9 ± 0.3 1.6 ± 0.5 1.6 ± 0.4 1.5 ± 0.4 1.7 ± 0.3 – – – – 1.0 ± 0.4 2.4 ± 0.7 2.90 ± 0.35 3.25 ± 0.38 – – 5.0 4.0 7.56 8.47 5.74 10.75 8.18 7.55 10.58 13.99 15.04 9.46 16.32 17.41 – – 11.65 9.09 12.31 12.32 17.23

Acceptor

Neither

Overall

2.0 ± 0.3 2.1 ± 0.4 2.2 ± 0.4 – – 2.6 ± 0.5 4.05 ± 0.46 – 4.0 4.62 5.29 6.86 8.84 11.80 3.99 – 31.11 – – –

3.7 5.3 5.4 5.6 5.7 6.0 10.20 ± 0.57 12.0 13.0 20.65 21.78 22.59 33.41 36.30 37.72 46.9 51.85 – – –

not known to include a splicing site. The data set consists of 3190 DNA sequences, of which 767 are categorized as donors, 768 are categorized as acceptors, and the remaining 1655 are neither. All sequences are defined over the four-letter alphabet of nucleotides {A, T, G, C}. One should bear in mind that this benchmark problem is specially tailored for supervised learning algorithms able to deal with sequences of fixed lengths, where the splice junction is positioned exactly at the middle of each sequence (conditions evidently preferential for most fixed input supervised models). Obviously, the efficiency of certain supervised learning algorithms would be inferior if the junctions were arbitrary placed in various sequence positions. Moreover, if the sequences were highdimensional of variable lengths, many of the reported supervised learning techniques would be inapplicable. Nevertheless, despite the fact that it is not commonplace (for obvious reasons) to test a purely unsupervised methodology against supervised approaches (exploiting class information, prior knowledge and/or domain theory), we believe that this experimental practice serves as an additional proof of the SOHMMM’s performance and capabilities. Also, the present experimental setup provides the opportunity of comparing the SOHMMM against a sufficient number of alternative approaches, namely 19, on a well-established biological sequence classification problem (even though all 19 are supervised). This series of experiments was conducted using a SOHMMM array of 88 HMM neurons arranged in an 11 × 8 lattice. The learning rate was an exponentially decreasing function between 1.0 and 0.1, whereas the neighborhood kernels’ standard deviations started with a value equal to half the largest dimension in the lattice and decreased linearly to one map unit. The duration of both training phases (ordering and tuning) was 10 epochs. The performance of the SOHMMM was evaluated according to the ten-fold crossvalidation methodology. Table 1 shows the average error rates achieved by the SOHMMM, and also, summarizes the average error rates reported by various machine learning algorithms (Deshpande & Karypis, 2002; Hoche & Wrobel, 2002; Li & Wong, 2003; Rampone, 1998; Sonneburg, Ratsch, Jagota, & Muller, 2002; Towell & Shavlik, 1994). The majority of the reported results were obtained by following a ten-fold cross-validation schema on 1000 randomly selected sequences. The performances of the first six models as well as the error rates produced by the two Constrained Confidence-Rated ILP-Boosting (C2 RIB) variants are estimated by a ten-fold crossvalidation methodology on all 3190 sequences. In the latter experimental setup, involving all 3190 sequences, the SOHMMM’s

142

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

(a) Error-epoch trajectory for the first 10 epochs.

(b) Error-epoch trajectory for all 100 epochs.

Fig. 4. Error-epoch trajectories of training the SOHMMM on splice junction sequence data. Plot (a) shows a zoomed portion from point 1 to 10 of the epoch series.

Fig. 5. Average error rate percentages for various sizes of the SOHMMM lattice.

performance improves slightly, therefore, the reported average error rates reduce to 9.31 ± 0.55% (overall), 2.73 ± 0.31% (donor), 3.11 ± 0.36% (acceptor), and 3.47 ± 0.38% (neither). Such an outcome should be expected since the SOHMMM exploits the additional information contained in the complete data set, thus, improving its accuracy. Finally, the error rates of the last three models correspond to a ten-fold cross-validation setup on 1535 sequences. More specifically the results reported for the linear Support Vector Machine (SVM) based classifiers (that operate in the feature spaces used by higher-order Markov chains, Interpolated Markov Models (IMMs) or Selective Markov Models (SMMs)) correspond to a two class problem that involves only the donor and acceptor sequences. Two key observations can be made by examining the results in this table. First, probabilistic models (especially those incorporating HMMs like the Fisher Kernel (FK) SVM, Tangent vectors Of Posterior log-odds (TOP) SVM and SHMM classifier) greatly outperform many other documented methods. This finding further supports the argument that biological sequence analysis will be accelerated by going directly to a systematic probabilistic approach (Baldi & Brunak, 2001). Second, all the reported results are produced by supervised learning models that incorporate prior knowledge and/or domain theory. These models span a wide range of established machine learning techniques such as (weighted) nearest neighbors, decision trees, (multi-layer) perceptrons, hybrid (sub)symbolic classifiers, rule-based classifiers combined with discriminant analysis, (constrained) boosting, SVMs operating in the feature spaces of higher-order Markov models, SHMM classifiers, and SVMs coupled with Locality Improved (LI) or probabilistic kernels. On the contrary, the SOHMMM’s performance is superior in comparison to several supervised techniques, despite the fact that it is a pure unsupervised learning algorithm, and as such, makes no use of class information, and also, does not utilize any kind of additional information (in the form of prior knowledge or

domain theory). The only methods that outperform the SOHMMM in terms of accuracy are the rather advanced rule-based classifiers, SHMMs and SVMs. This should be expected not only because these approaches use class information and domain knowledge for adjusting the models’ parameters, but also because they are equipped with HMM modules especially designed for modeling the splice junction recognition problem. For instance, the Supervised HMM Domain Dependent Architecture (SHMM-DDA) classifier, that also achieves better results than the SOHMMM, has a predetermined architecture which is biologically motivated; the selection of its hyper-parameters has been carried out independently before the learning phase, and also, supervised driven HMM modules are dedicated for describing each one of the three classes. The fact that SOHMMM outdistances all the other supervised learning approaches in the splice site determination problem is justified by its key characteristics, as these have been analyzed previously. Experiments were conducted to study/analyze in which way and to what extent the SOHMMM’s behavior and performance are affected by the duration of the training phase, the size of the lattice and the width of the neighborhood kernel. The learning rate and neighborhood kernels (in the first two experimental setups) were tuned according to the adjustment schemes followed in all previous experiments (since such an approach always works quite well in practice). A representative scenario of a training phase (corresponding to a 11 × 8 SOHMMM lattice) lasting ten times more epochs than previously is illustrated in Fig. 4. Based on this, certain remarks can be made starting from the evident observation that most learning is completed within the first few epochs, after which the SOHMMM goes through a fine-tuning stage. These findings are in analogy to the ordering and tuning phases that are integrated into the model’s unsupervised learning algorithm. Moreover, one can notice that the average error rate actually decreases, rapidly at the beginning and gradually at the end, despite the fact that the training, which is based on optimizing the weighted likelihood, does not incorporate any kind of class information or domain knowledge. Thus, we can defensibly claim that, as long as category assignments are in accordance with the underlying attributes and structure of the data at hand, the SOHMMM produces rather accurate results. In addition, the fact that at the end of the first 10 epochs the SOHMMM has already reached good levels of performance (which is only slightly improved during the remaining epochs) further supports the statement that even short training periods prove sufficient. Partially, this is due to the proven learning algorithm that generates probabilistic modules (i.e. the SOHMMM neurons) which exploit all provided information in a more efficient manner, at least when compared to other (usually heuristic) deterministic vector-based methodologies. Also, after certain oscillations at the very beginning of the training phase, the SOHMMM converges to an optimum, demonstrating stability along the way, whereas at the end, fluctuations are practically diminished at least for the problem under consideration. The graph in Fig. 5 depicts the results of a stratified tenfold cross-validation methodology when applied to a variety of

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

(a) Error-epoch trajectory for an exponentially decreasing standard deviation function σ (y).

143

(b) Error-epoch trajectory for a standard deviation function σ (y) inversely proportional to y.

Fig. 6. Representative training scenarios with various monotonically decreasing widths of the neighborhood functions.

Fig. 7. Average error rate percentages corresponding to SOHMMMs adjusted by training schemas employing neighborhood kernels with constant widths.

SOHMMM array dimensionalities. A first key observation to be made is that there is a broad spectrum of lattice sizes that yield satisfying results. Consequently, reasonable variations in the array size have a small impact on SOHMMM’s performance. This relative robustness against certain free parameters (i.e. neurons per row/column) is a desired characteristic for an unsupervised method since accurate values cannot be obtained if class information is unknown or missing. The only pitfall that must be avoided is utilizing degenerate SOHMMM lattices (e.g 1 × 1). Obviously, a limited number of HMMs is not in position to model all data or clusters, as a result certain attributes and relations are misrepresented, in such cases erroneous states prevail and performance deterioration is inevitable. On the contrary, it is apparent that increasing the size of SOHMMM arrays which already work well is not a wise practice, since one can hardly trace a gain in terms of performance, whereas computational robustness is compromised. A comprehensive series of experiments was also conducted to study the effect the neighborhood kernel’s standard deviation σ (y) has on the SOHMMM’s performance and training behavior. The results (corresponding to a 11 × 8 SOHMMM lattice) are presented in Fig. 6. These are partly comparable to the error-epoch training phase trajectory in Fig. 4(a) where the neighborhood kernels’ standard deviations are linear monotonically decreasing functions. When comparing these three graphs a first direct observation is that both the exponentially decreasing and the inversely proportional standard deviation functions result in smaller fluctuations than the linearly decreasing standard deviation. Especially the inversely proportional to y neighborhood kernel generates a nearly monotonic error-epoch trajectory during the ordering phase and a balanced trajectory during the tuning phase (the latter is a straightforward effect of the fact that the neighborhood width remains nearly constant for a considerable training period). The serious drawback of the two refined monotonic decreasing functions in Fig. 6 is the asymptotically achieved error rate percentage which is 2%–4% higher when compared to the error rate of the linearly decreasing neighborhood kernel. This should be expected since the rapidly shrinking standard deviations do not devote as much training time with medium sized neighborhood widths as the linearly

decreasing function, thus the space of probable solutions is not being explored as extensively and thoroughly something which results to lower performances in the long run. The investigation on the effect/impact of the neighborhood kernel’s width (i.e. standard deviation) is complemented with the graph in Fig. 7. The illustrated error rates correspond to a stratified ten-fold cross-validation methodology when applied to a 11 × 8 SOHMMM array. For each separate execution of the SOHMMM algorithm the standard deviation remained constant throughout the training phase obtaining real values from the (0, 3]. range. It is evident that small values for the standard deviation (viz. degenerate neighborhoods that practically contain only one reference HMM) compromise considerably the performance. This should be expected since narrow neighborhood kernels essentially eliminate the cooperation phase of the self-organizing procedure, thus resulting in SOHMMM neurons that do not interact/cooperate. Vice-versa, high standard deviation values (viz. extremely wide neighborhoods that involve nearly every neuron of the SOHMMM lattice) cause instabilities and degrade SOHMMM’s performance, this is due to the fact that at each iteration of the training algorithm every neuron is almost equally adjusted by the learning equations. The reference HMMs’ parameters (asymptotically) coincide and at every replication of the SOHMMM learning algorithm they tend to produce identical representations and descriptions of the input sequence. Per contra, in the middle section of the standard deviation range one can find constant values for the neighborhood width that yield satisfactory results. In these cases, reasonably sized neighborhoods are formed. Frequently these neighborhoods consist of 7–19 neurons (see for instance Fig. 2) and the competition and cooperation procedures of the SOHMMM adaptation/training schema work quite well. Nevertheless, even in these cases performance appears inferior (by approximately 2%) when compared to the linear monotonically decreasing neighborhood function. Last, it is important to note that the series of optimal standard deviation values under consideration is the outcome of evaluating average error rate percentages produced by the utilization of posterior information (i.e. of each sequence’s categorization). Consequently, these values should be treated more as a rule of thumb since under a pure unsupervised learning setup (viz. during a learning procedure where there are no category assignments, or equivalently, these are considered to be unknown) the construction of similar graphs and the subsequent estimation of an optimal standard deviation range are theoretically and practically impossible. 6. Discussion and further research In spite of its capabilities, the SOHMMM can suffer in particular from two weaknesses, which are inherited from its constituent HMMs. First, HMMs for chain molecules often have a large number of unstructured parameters, thus, a high amount of meaningful and descriptive (of the sequence space under consideration) biological sequences is mandatory. This problem aggravates when only a few

144

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

sequences are available in a family, not an uncommon situation in early stages of genome projects or in certain situations of gene and protein groups. Second, first-order HMMs are limited by their first-order Markov property: they cannot express dependencies between hidden states, or in other words, they are based on the (arbitrary) assumption that the conditional probability of a current event given the past history depends only upon the immediate past and not upon the remote past. Proteins fold into complex three-dimensional shapes determining their function. Subtle, longrange correlations in their chains may exist; these are accessible to higher-order HMMs and most times are missed by any first-order HMM. Therefore, the SOHMMM fails to capture such correlations. Nevertheless, we must not overlook the fact that a major difficulty with higher-order HMMs is the exponential growth of free parameters and of the corresponding computational cost. In general, the SOHMMM incorporates first-order HMMs with the highest complexity, viz. fully-connected HMMs where transition probabilities between all states are non-zero (i.e. the corresponding transition matrices are dense). Further, its unsupervised learning algorithm is in position to automatically tune each corresponding parameter of the reference HMMs appropriately. In real-case applications we need to consider ad hoc SOHMMM architectures, employing HMMs with many more states and typically sparser connectivity. The design or selection of such architectures is highly problem dependent. In biological sequences the linear aspects of the sequences are captured by the so-called left–right architectures, the most basic and widely used of which is the standard linear architecture. These architectures impose certain restrictions on the transition matrix of the underlying hidden Markov chain. Hence, a potential expansion could involve the integration of left-to-right HMMs into the SOHMMM. Apart from that, the proposed SOHMMM can be improved in certain aspects. As has been stressed, the SOHMMM’s unsupervised learning algorithm is purely on-line. When several training sequences are available, they can be treated as independent. In such a case, an extension to a batch mode version is quite straightforward, thus motivating a comparative study of these two unsupervised training methodologies (Nakama, 2009). As has been shown, the SOHMMM integrates clustering, dimensionality reduction, and nonlinear projection on a low-dimensional lattice. From the produced mapping, detection of boundaries between eventual clusters, as well as their corresponding positions and distances, may prove a complex task (Brugger et al., 2008; Tasdemir, 2010; Tasdemir & Merenyi, 2009). Visualization techniques and graphic displays can be developed to illustrate the SOHMMM’s exploratory data analysis results, thus making its capabilities useful to a wider range of interdisciplinary tasks.

βT (j) = 1, 1 ≤ j ≤ N N  βt (j) = aji bi (ot +1 )βt +1 (i),

(A.3) t = T − 1, T − 2, . . . , 1,

i =1

1 ≤ j ≤ N.

(A.4)

By our analysis of the evaluation problem using the forward and backward variables, we have T ways for P (O|λ), thus P (O|λ) =

N 

αt (j)βt (j),

t = 1, 2, . . . , T .

(A.5)

j =1

For t = T we use the evaluation formula in the form P (O|λ) =

N 

αT (j)βT (j).

(A.6)

j =1

We next use the forward recursion (A.2) for αT (j) to get P (O|λ) =

N  N 

αT−1 (i)aij bj (oT )βT (j).

(A.7)

j=1 i=1

The propositions referenced in the main text are presented next. We have opted to omit those analytical proofs that demonstrate high similarity. Nevertheless, the exact mathematical Proofs of Propositions 1 and 3 (Lemmas 1 and 3) are analogous to that of Proposition 2 (Lemma 2), which is detailed below. It must be noted that a proof of Lemma 1 can be found in Koski (2001).

Lemma 1. T −1 ∂ P (O|λ)  αl (r )bs (ol+1 )βl+1 (s). = ∂ ars l=1

(A.8)

Proposition 1. T −1    ∂ P (O|λ) = aij αl (i)bj (ol+1 )βl+1 (j) − αl (i)βl (i) . ∂wij l =1

(A.9)

Acknowledgments The authors would like to thank graphic artist and biologist Vanessa Ferle for her essential contribution in all graphic design stages. Also, they would like to recognize the assistance of Anastasis Tzimas, his insightful remarks and comments proved more than inspiring. Appendix For convenience of reference we recapitulate here the recursions for forward and backward probabilities with

α1 (j) = πj bj (o1 ), 1 ≤ j ≤ N   N  αt +1 (j) = αt (i)aij bj (ot +1 ),

(A.1)

T ∂ P (O|λ) 1  = I {ol = y|λ} αl (x)βl (x). ∂ bx (y) bx (y) l=1

(A.2)

(A.10)

Proof. We differentiate partially (A.7) with respect to bx (y), to obtain N  ∂ P (O|λ) = I {oT = y|λ} αT −1 (i)aix βT (x) ∂ bx (y) i=1

+

i=1

1 ≤ t ≤ T − 1, 1 ≤ j ≤ N

Lemma 2.

N  N  ∂αT−1 (i) j=1 i=1

∂ bx (y)

aij bj (oT )βT (j).

(A.11)

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

Moreover, the use of the forward recursion (A.2) in the first term yields

∂ P (O|λ) 1 = I {oT = y|λ} αT (x)βT (x) ∂ bx (y) bx (oT ) +

N  N  ∂αT−1 (i)

∂ bx (y)

j=1 i=1

aij bj (oT )βT (j).

and if we use the forward recursion (A.2), we get N 

I {oT −1 = y|λ} αT −2 (l)alx βT −1 (x)

l =1

= (A.12)

Hence, we only need to calculate the second term on the right hand side. The task is accomplished for the benefit of our purposes by using the forward recursion (A.2) once more

1 bx (oT −1 )

(A.13)

N  N  N  ∂αT −2 (l)

∂ bx (y)

=

 ∂αT−1 (x) = I {oT −1 = y|λ} αT −2 (l)alx ∂ bx (y) l =1   N  ∂αT −2 (l) + alx bx (oT −1 ) ∂ bx (y) l =1   N  ∂αT−1 (i) ∂αT −2 (l) = ali bi (oT −1 ), i ̸= x. ∂ bx (y) ∂ bx (y) l =1

N  N  N  ∂αT −2 (l)

∂ bx (y)

=

N 

(A.14)

=

I {oT −1 = y|λ}

i=1 l=1

(A.15)

j =1

=

N  j=1

+

N 

=

l=1

j=1 i=1 l=1

∂ bx (y)

(A.16)

I {oT −1 = y|λ}

j =1

N 

=

N 

axj bj (oT )βT (j).

(A.17)

j=1

=

N 

axj bj (oT )βT (j)

j =1 N  l =1

∂ bx (y)

ali bi (oT −1 )βT −1 (i).

(A.22)

∂ bx (o1 ) axi bi (o2 )β2 (i) ∂ bx (y)

= I {o1 = y|λ} πx

N 

axi bi (o2 )β2 (i)

(A.23)

but by the backward recursion (A.4) we get N 

axi bi (o2 )β2 (i) = I {o1 = y|λ} πx β1 (x).

(A.24)

I {oT −1 = y|λ} αT −2 (l)alx βT −1 (x)

I {o1 = y|λ} πx β1 (x) =

1 bx (o1 )

I {o1 = y|λ} α1 (x)β1 (x).

(A.25)

Consequently, by (A.22) and (A.25) we see that, when going ∂α (l) downwards, the differentiation stops at the point where ∂ b 1(y) x ∂α1 (l) appears, since ∂ b (y) is the last remaining expression that might x contain a bx (y) term. As a result we have

By using the backward recursion (A.4), we obtain

l=1

N  N  ∂αT −2 (l)

From (A.1) it follows that

I {oT −1 = y|λ} αT −2 (l)alx

I {oT −1 = y|λ} αT −2 (l)alx

I {oT −1 = y|λ} αT −1 (x)βT −1 (x)

i=1

l =1

l =1

N 

πx

I {o1 = y|λ} πx

αT −2 (l)alx axj bj (oT )βT (j)

N



1 bx (oT −1 )

i =1

We now compute the values of each of the two terms on the right hand side of (A.16). First, N 

N  i=1

ali bi (oT −1 )aij bj (oT )βT (j).

(A.21)

N  N N  N   ∂ bl (o1 ) ∂α1 (l) πl ali bi (o2 )β2 (i) = ali bi (o2 )β2 (i) ∂ bx (y) ∂ bx (y) i=1 l=1 i =1 l =1

αT −2 (l)alx axj bj (oT )βT (j)

N  N  N  ∂αT −2 (l)

ali bi (oT −1 )βT −1 (i).

The double sum is seen to be exactly of the same form as that obtained in (A.12). Hence, we may continue the differentiation ∂α (l) of ∂ bT −(2y) and reorganize the resulting expressions exactly in the x same manner as above. Due to (A.1) we get

l =1

I {oT −1 = y|λ}

ali bi (oT − )aij bj (oT )βT (j)

i=1 l=1

alx bx (oT −1 )axj bj (oT )βT (j) ∂ bx (y)   N N N    ∂αT −2 (l) + ali bi (oT −1 )aij bj (oT )βT (j) ∂ bx (y) j=1 i=1,i̸=x l =1

+

(A.20)

1 ∂ P (O|λ) = I {oT = y|λ} αT (x)βT (x) ∂ bx (y) bx (oT )

+



 N   ∂αT −2 (l)

aij bj (oT )βT (j)

With the use of (A.16), (A.19) and (A.21) we have rewritten the right hand side of (A.12) as

αT −2 (l)alx axj bj (oT )βT (j)

N

∂ bx (y)

+

l=1

j =1

N  j =1

N  N  ∂αT −2 (l)

aij bj (oT )βT (j) N 

∂ bx (y)

j =1 i =1 l =1

Insertion of these partial derivatives into the second term of (A.12) gives

j=1 i=1

∂ bx (y)

ali bi (oT −1 )

but by the backward recursion (A.4) we get

N

N  N  ∂αT−1 (i)

(A.19)

ali bi (oT −1 )aij bj (oT )βT (j)

N  N  ∂αT −2 (l) i=1 l=1

By performing the differentiation we obtain

I {oT −1 = y|λ} αT −1 (x)βT −1 (x).

Second, by interchanging the order of summations in the second term of (A.16) we get

j =1 i =1 l =1 N ∂αT−1 (i) ∂  = αT −2 (l)ali bi (oT −1 ). ∂ bx (y) ∂ bx (y) l=1

145

(A.18)

T ∂ P (O|λ)  1 = I {ol = y|λ} αl (x)βl (x). ∂ bx (y) b ( ol ) x l =1

(A.26)

146

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147

Due to the expression I {ol = y|λ}, the only non-zero terms of the above summation are the ones for which the rule ol = vy stands true. So, an equivalent expression to (A.26) is T  ∂ P (O|λ) 1 = I {ol = y|λ} αl (x)βl (x) ∂ bx (y) bx (y) l =1 T 

1

=

bx (y) l=1

I {ol = y|λ} αl (x)βl (x).

Each symbol ol of an observation sequence O, assumes a single value from V (one out of the M possible values vm ), consequently the summation over the indicator function yields M 

I {ol = z |λ} = 1.

(A.27)

From (A.33), by using (A.34), we get bj (t )

Thus, (A.10) has been proved.

T 

αl (j)βl (j)

l =1

Proposition 2.

∂ P (O|λ) = ∂ rjt

 I {ol = t |λ} αl (j)βl (j) − bj (t )αl (j)βl (j) .

(A.28)

l =1

 ∂ P (O|λ) ∂ bj (z ) ∂ P (O|λ) = ∂ rjt ∂ bj (z ) ∂ rjt z =1 (A.29)

  ∂ bj (t ) = bj (t ) 1 − bj (t ) ∂ rjt

(A.30)

∂ bj (t ) = −bj (t )bj (k), ∂ rjk

(A.31)

k ̸= t .

T ∂ P (O|λ) 1  = bj (t )(1 − bj (t )) I {ol = t |λ} αl (j)βl (j) ∂ rjt bj (t ) l=1 M 



bj (z )bj (t )

z =1,z ̸=t T 

1

T 

bj (z ) l=1

I {ol = z |λ} αl (j)βl (j)

I {ol = t |λ} αl (j)βl (j)

l =1

− bj (t )

T 

I {ol = t |λ} αl (j)βl (j)

l =1 M 



bj (t )

T 

z =1,z ̸=t T 

I {ol = z |λ} αl (j)βl (j)

l =1

I {ol = t |λ} αl (j)βl (j)

l =1 M 

bj (t )

z =1

T 

I {ol = z |λ} αl (j)βl (j).

(A.32)

l=1

By interchanging the order of summations in the second term of (A.29) we get

z =1

= bj ( t )

I {ol = z |λ} αl (j)βl (j)

l=1 T  l =1

T  

I {ol = t |λ} αl (j)βl (j) − bj (t )αl (j)βl (j) .



(A.36)

Lemma 3.

∂ P (O|λ) = br (o1 )β1 (r ). ∂πr

(A.37)

Proposition 3.

∂ P (O|λ) = πj bj (o1 )β1 (j) − πj P (O|λ). ∂ uj

(A.38)

References

By applying (A.30), (A.31) and (A.10), the right hand side of (A.29) yields

T 

(A.35)

Thus, (A.28) has been proved.

From the normalized exponential reparameterizations (6) it follows that the partial derivatives of the emission parameters bj (t ) are

bj (t )

αl (j)βl (j).

l =1

l =1

M  ∂ P (O|λ) ∂ bj (z ) ∂ P (O|λ) ∂ bj (t ) = + . ∂ bj (t ) ∂ rjt ∂ bj (z ) ∂ rjt z =1,z ̸=t

M 

T 

T T   ∂ P (O|λ) = I {ol = t |λ} αl (j)βl (j) − bj (t ) αl (j)βl (j) ∂ rjt l =1 l =1

=

M



I {ol = z |λ} = bj (t )

z =1

Proof. Using the chain rule on the left hand side of (A.28), we get

=

M 

Insertion of the above expression (A.35) into (A.32) yields

T  

=

(A.34)

z =1

αl (j)βl (j)

M  z =1

I {ol = z |λ} .

(A.33)

Baldi, P. (1995). Gradient descent learning algorithms overview: a general dynamical systems perspective. IEEE Transactions on Neural Networks, 6, 182–195. Baldi, P., & Brunak, S. (2001). Bioinformatics: the machine learning approach (2nd ed.). Cambridge, Massachusetts: The MIT Press. Baldi, P., & Chauvin, Y. (1994). Smooth on-line learning algorithms for hidden Markov models. Neural Computation, 6, 305–316. Baldi, P., & Chauvin, Y. (1996). Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8, 1541–1565. Barreto, G. de A., Araujo, A., & Kremer, S. (2003). A taxonomy of spatiotemporal connectionist networks revisited: the unsupervised case. Neural Computation, 15, 1255–1320. Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Brugger, D., Bogdan, M., & Rosenstiel, W. (2008). Automatic cluster detection in Kohonen’s SOM. IEEE Transactions on Neural Networks, 19, 442–459. Chappell, G., & Taylor, J. (1993). The temporal Kohonen map. Neural Networks, 6, 441–445. Conan-Guez, B., Rossi, F., & Gollib, A. El. (2006). Fast algorithm and implementation of dissimilarity self-organizing maps. Neural Networks, 19, 855–863. Deshpande, M., & Karypis, G. (2002). Evaluation of techniques for classifying biological sequences. In Proceedings of the 6th Pacific–Asia conference on advances on knowledge discovery and data mining (pp. 417–431). Du, K.-L. (2010). Clustering: a neural network approach. Neural Networks, 23, 89–107. Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press. Ferles, C., & Stafylopatis, A. (2008a). Sequence clustering with the self-organizing hidden Markov model map. In 8th IEEE international conference on bioinformatics and bioengineering (pp. 1–7). Ferles, C., & Stafylopatis, A. (2008b). A hybrid self-organizing model for sequence analysis. In 20th IEEE international conference on tools with artificial intelligence (pp. 105–112). Ferles, C., Siolas, G., & Stafylopatis, A. (2011). Scaled on-line unsupervised learning algorithm for a SOM–HMM hybrid. In 26th international symposium on computer and information sciences (pp. 533–537). Frank, A., & Asuncion, A. (2010). UCI machine learning repository. Irvine, California: University of California, School of Information and Computer Science, Available: http://archive.ics.uci.edu/ml.

C. Ferles, A. Stafylopatis / Neural Networks 48 (2013) 133–147 Fujiwara, Y., Asogawa, M., & Konagaya, A. (1994). Stochastic motif extraction using hidden Markov models. In Proceedings of the 2nd international conference on intelligent systems and molecular biology (pp. 121–129). Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Hagenbuchner, M., Sperduti, A., & Tsoi, A. C. (2003). A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks, 14, 491–505. Hagenbuchner, M., Sperduti, A., & Tsoi, A.C. (2005). Contextual processing of graphs using self-organizing maps. In Proceedings of the European symposium on artificial neural networks (pp. 399–404). Hammer, B., & Hasenfuss, A. (2010). Topographic mapping of large dissimilarity data sets. Neural Computation, 22, 2229–2284. Hammer, B., Hasenfuss, A., Rossi, F., & Strickert, M. (2007). Topographic processing of relational data. In Proceedings of the 6th international workshop on selforganizing maps. Hammer, B., Micheli, A., Strickert, M., & Sperduti, A. (2004). A general framework for unsupervised processing of structured data. Neurocomputing, 57, 3–35. Heskes, T. (1996). Transition times in self-organizing maps. Biological Cybernetics, 75, 49–57. Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja, & S. Kaski (Eds.), Kohonen maps (pp. 303–315). Amsterdam, The Netherlands: Elsevier. Heskes, T. (2001). Self-organizing maps, vector quantization, and mixture modeling. IEEE Transactions on Neural Networks, 12, 1299–1305. Hoche, S., & Wrobel, S. (2002). Scaling boosting by margin-based inclusion of features and relations. In Proceedings of the 13th European conference on machine learning (pp. 148–160). Hoekstra, A., & Drossaers, M.F.J. (1993). An extended Kohonen feature map for sentence recognition. In International conference on artificial neural networks (pp. 404–407). James, D. L., & Miikkulainen, R. (1994). SARDNET: a self-organizing feature map for sequences. In Advances on neural information processing systems: vol. 7 (pp. 577–584). Jang, J.-S. R., Sun, C.-T., & Mizutani, E. (1997). Neuro-fuzzy and soft computing: a computational approach to learning and machine intelligence. Upper Saddle River, New Jersey: Prentice-Hall. Kang, J., Feng, C.-J., Shao, Q., & Hu, H.-Y. (2007). Prediction of chatter in machining process based on hybrid SOM-DHMM architecture. In Proceedings of the 3rd international conference on computing (pp. 1004–1013). Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin: Springer-Verlag. Kohonen, T., & Somervuo, P. (1998). Self-organizing maps of symbol strings. Neurocomputing, 21, 19–30. Kohonen, T., & Somervuo, P. (2002). How to make large self-organizing maps for nonvectorial data. Neural Networks, 15, 945–952. Koskela, T., Varsta, M., Heikkonen, J., & Kaski, K. (1998). Time series prediction using recurrent SOM with local linear models. International Journal of KnowledgeBased and Intelligent Engineering Systems, 2, 60–68. Koski, T. (2001). Hidden Markov models for bioinformatics. Dordrecht, The Netherlands: Kluwer Academics Publishers. Kraaijveld, M. A., Mao, J., & Jain, A. K. (1995). A nonlinear projection method based on Kohonen’s topology preserving maps. IEEE Transactions on Neural Networks, 6, 645–678. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531. Kurimo, M. (1997). Training mixture density HMMs with SOM and LVQ. Computer Speech and Language, 11, 321–343. Kurimo, M., & Somervuo, P. (1996). Using the self-organizing map to speed up the probability density estimation for speech recognition with mixture density HMMs. In 4th international conference on spoken language (pp. 358–361). Kurimo, M., & Torkkola, K. (1992). Training continuous density hidden Markov models in association with self-organizing maps and LVQ. In Proceedings of the IEEE workshop on neural networks signal processing.

147

Lebbah, M., Rogovschi, N., & Bennani, Y. (2007). BeSOM: Bernoulli on self-organizing map. In International joint conference on neural networks (pp. 631–636). Li, J., & Wong, L. (2003). Using rules to analyse bio-medical data: a comparison between C4.5 and PCL. In Proceedings of the 4th international conference on webage information and management (pp. 254–265). Minamino, K., Aoyama, K., & Shimomura, H. (2005). Voice imitation based on selforganizing maps with HMMs. In Proceedings of the international workshop on intelligent dynamics humanoids. Mount, D. W. (2004). Bioinformatics: sequence and genome analysis (2nd ed.). New York: Cold Spring Harbor Laboratory Press. Nakama, T. (2009). Theoretical analysis of batch and on-line training for gradient descent learning in neural networks. Neurocomputing, 73, 151–159. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. Rampone, S. (1998). Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics, 14, 676–684. Recanati, C., Rogovschi, N., & Bennani, Y. (2007). The structure of verbal sequences analyzed with unsupervised learning techniques. In Proceedings of the 3rd language and technology conference. Rogovschi, N., Lebbah, M., & Bennani, Y. (2010). Learning self-organizing mixture markov models. Journal of Nonlinear Systems and Applications, 1, 63–71. Seiffert, U., & Jain, L. C. (2002). Self-organizing neural networks: recent advances and applications. Heidelberg, New York: Physica-Verlag. Somervuo, P. (2000). Competing hidden Markov models on the self-organizing map. In Proceedings of the international joint conference on neural networks (pp. 169–174). Somervuo, P. (2004). Online algorithm for the self-organizing map of symbol strings. Neural Networks, 17, 1231–1239. Sonneburg, S., Ratsch, G., Jagota, A.K., & Muller, K.-R. (2002). New methods for splice site recognition. In Proceedings of the international conference on artificial neural networks (pp. 329–336). Sperduti, A. (2001). Neural networks for adaptive processing of structured data. In International conference on artificial neural networks (pp. 5–12). Stolcke, A., & Omohundro, S. (1993). Hidden Markov model induction by Bayesian model merging. In Advances on neural information processing systems: vol. 5 (pp. 11–18). Strickert, M., & Hammer, B. (2004). Self-organizing context learning. In Proceedings of the European symposium on artificial neural networks (pp. 39–44). Tasdemir, K. (2010). Graph based representations of density distribution and distances for self-organizing maps. IEEE Transactions on Neural Networks, 21, 520–526. Tasdemir, K., & Merenyi, E. (2009). Exploiting data topology in visualization and clustering of self-organizing maps. IEEE Transactions on Neural Networks, 20, 549–562. Towell, G. G., & Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial Intelligence, 70, 119–165. Tsuruta, N., Iuchi, H., Sagheer, A., & Tobely, T. (2003). Self-organizing feature maps for HMM based lip-reading. In Proceedings of the 7th international conference on knowledge-based intelligence information and engineering systems (pp. 162– 168). Ultsch, A. (2003). Maps for the visualization of high-dimensional data spaces. In Proceedings of the workshop on self-organizing maps (pp. 225–230). Vanco, P., & Farkas, I. (2010). Experimental comparison of recursive self-organizing maps for processing tree-structured data. Neurocomputing, 73, 1362–1375. Varsta, M., Heikkonen, J., & Millan, J.R. (1997). Context learning with the selforganizing map. In Proceedings of the workshop on self-organizing maps (pp. 197–202). Verbeek, J. J., Vlassis, N., & Krose, B. J. (2005). Self-organizing mixture models. Neurocomputing, 63, 99–123. Voegtlin, T. (2002). Recursive self-organizing maps. Neural Networks, 15, 979–992. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16, 645–678.