The tetratricopeptide repeats (TPR)-like superfamily of proteins in Leishmania spp., as revealed by multi-relational data mining

The tetratricopeptide repeats (TPR)-like superfamily of proteins in Leishmania spp., as revealed by multi-relational data mining

Pattern Recognition Letters 31 (2010) 2178–2189 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

1MB Sizes 0 Downloads 9 Views

Pattern Recognition Letters 31 (2010) 2178–2189

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

The tetratricopeptide repeats (TPR)-like superfamily of proteins in Leishmania spp., as revealed by multi-relational data mining Michely C. Diniz, Ana Carolina L. Pacheco, Karen T. Girão, Fabiana F. Araujo, Cezar A. Walter, Diana M. Oliveira * Núcleo Tarcisio Pimenta de Pesquisa Genômica e Bioinformática, NUGEN, Faculdade de Veterinária, Universidade Estadual do Ceara – UECE, Av. Paranjana, 1700, Campus do Itaperi, Fortaleza, CE 60740-000, Brazil

a r t i c l e

i n f o

Article history: Available online 4 May 2010 Keywords: Multi-relational data mining Hidden Markov models Viterbi algorithm Tetratricopeptide repeat motif Leishmania proteins

a b s t r a c t Protein sequence analysis tasks are multi-relational problems suitable for multi-relational data mining (MRDM). Proteins containing tetratricopeptide (TPR), pentatricopeptide (PPR) and half-a-TPR (HAT) repeats comprise the TPR-like superfamily in which we have applied MRDM methods (relational association rule discovery and probabilistic relational models) with hidden Markov models (HMMs) and Viterbi algorithm (VA) in genome databases of pathogenic protozoa Leishmania. Such integrated MRDM/HMM/ VA approach seeks to capture as much model information as possible in the pattern matching heuristic, without resorting to more standard motif discovery methods (Pfam, SMART, SUPERFAMILY) and it has the advantage of incorporation of optimized profiles, score offsets and distribution to compute probability, as a more recently reported tool (TPRpred) in order to take in account the tendency of repeats to occur in tandem and to be widely distributed along the sequences. Here we compare such currently available resources with our approach (MRDM/HMM/VA) to highlight that the latter performs best into the TPRlike superfamily assignment and it might be applied to other sequence analysis problems in such a way that it contributes to tight-fit motif discoveries and a better probability that a given target sequence is, indeed, a target motif-containing protein. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction There are many data mining problems in genomics and postgenomics that involve representing and reasoning about biological sequences (DNA, RNA and protein). Among these problems are gene (Burge and Karlin, 1997) and protein motif discovery (Lawrence et al., 1993), and protein structure prediction (Schmidler et al., 2000). It has been argued that such sequence-analysis tasks are becoming multi-relational problems because other sources of evidence, besides the sequences themselves, are being leveraged in the learning and inference processes (Page and Craven, 2003) which are core parts of the bioinformatics scenario. In addition, many sequence elements of interest are composed from other sequence elements. In this paper, we present and discuss results regarding motif discovery methods, focusing in particular on our recent work in analyzing the tetratricopeptide repeat (TPR)-like superfamily of proteins in Leishmania spp. (Girão et al., 2008). Early efforts in bioinformatics concentrated on finding the internal structure of individual genome-wide data sets and solving problems with the explosion of the ‘omics’ technologies. Since * Corresponding author. Tel./fax: +55 85 31019842. E-mail address: [email protected] (D.M. Oliveira). 0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.04.008

comprehensive coverage of the multiple aspects of cellular/organellar physiology is also progressing rapidly, there is a need for integration of vast amounts of biological data encompassing a systems-level approach that requires integrating all of the known properties of a given class of components (e.g., protein abundance, localization, physical interactions, etc.) with computational methods able to combine large and heterogeneous sets of data (Ideker et al., 2007). One of these techniques for the generation of unified mechanistic models of cellular/organellar processes (a major challenge for all who seek to discover functions of many yet unknown genes) is the multi-relational data mining (MRDM) approach, which looks for patterns that involve multiple input tables (relations) from a relational database (DB) made of complex/structured objects whose normalized representation requires multiple tables (Getoor et al., 2001a). MRDM extends association rule mining to search for interesting patterns among data in multiple tables rather than in one input table (Dehaspe and De Raedt, 1997). Two major challenges, however, that sequence analysis applications highlight for the area of MRDM are computational complexity and ease of use. Post-genomics databases are growing at an exponential rate and the task of determining whether a protein sequence contains a given substructure (or a motif) is an optimization problem that falls into the NP-complete (Garey and

2179

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

Johnson, 1979) task of subgraph isomorphism (Page and Craven, 2003). Any biologist can run a decision-tree learner or a linear regression routine on a given data with relatively little training and preprocessing, but running a MRDM system typically requires a much larger knowledge engineering effort, what reinforces the need for biologist-friendlier MRDM systems. Keeping this in mind, we have applied MRDM methods (relational association rule discovery – RARD and probabilistic relational models – PRMs) combined with hidden Markov models (HMMs) (Eddy, 1998; Winters-Hilt, 2006; Madera, 2008) and the Viterbi algorithm (VA) (Forney, 1973) to mine the tetratricopeptide repeat (TPR) (Blatch and Lässle, 1999; D’Andrea and Regan, 2003; Das et al., 1998) and related motifs (pentatricopeptide repeat (PPR) (Small and Peeters, 2000; Kotera et al., 2005) and half-aTPR (HAT) (Preker and Keller, 1998) in pathogenic protozoa Leishmania spp. (Girão et al., 2008). We have, then, built a panel of the TPR-like superfamily of proteins, whose members can be further assigned functional roles for their containing motifs. TPR motifs were originally identified in yeast as protein-protein interaction modules (Blatch and Lässle, 1999), but now they are known to occur in a wide variety of proteins (over 12,000 as included in SMART nrdb) present in prokaryotic and eukaryotic organisms (D’Andrea and Regan, 2003), being involved in protein–protein and protein– lipid interactions in cell cycle regulation, chaperone function and post-translation modifications (Blatch and Lässle, 1999; D’Andrea and Regan, 2003; Das et al., 1998). TPRs exhibit a large degree of sequence diversity and structural conservation (two antiparallel alpha-helices separated by a turn) that might act as scaffolds for the assembly of different multiprotein complexes (Scheufler et al., 2000), including the peroxisomal import receptor and the NADPH oxidase (Koga et al., 1999). Similar to TPR, PPR and HAT motifs also have repetitive patterns characterized by the presence of a tandem array of repeats, where the number of motifs seems to influence the affinity and specificity of the repeat-containing protein for RNA (Preker and Keller, 1998; Lurin et al., 2004; Rivals et al., 2006). PPR-containing proteins (PPRPs) occur predominantly in eukaryotes (Small and Peeters, 2000) (particularly abundant in plants), while it has been suggested that each of the highly variable PPRPs is a gene-specific regulator of plant organellar RNA metabolism. HAT repeats are less abundant and HAT-containing proteins (HATPs) appear to be components of macromolecular complexes that are required for RNA processing (Small and Peeters, 2000; Kotera et al., 2005; Preker and Keller, 1998; Lurin et al., 2004; Rivals et al., 2006). TPR-containing proteins (TPRPs) have recently attracted interest because of their versatility as scaffolds for the engineering of protein-protein interactions (Main et al., 2005; Karpenahalli et al., 2007). TPRPs are characterized by homologous, repeating structural units stacked together to form an open-ended superhelical structure. Such an arrangement is in contrast to the structure of most proteins, which fold into a compact shape (Groves and Barford, 1999). The curvature created by the superhelical nature of these proteins predetermines the target proteins that can bind to them (Kobe and Kajava 2000). TPRs, PPRs and HAT (all together referred as TPR-like motifs) form a large superfamily or the clan TPRlike (Blatch and Lässle, 1999; D’Andrea and Regan, 2003; Das et al.,

1998; Scheufler et al., 2000; Preker and Keller, 1998; Lurin et al., 2004; Rivals et al., 2006). Homologous structural repeat units are often highly divergent at the sequence level, a feature that makes their prediction challenging. Currently, several web-based resources are available for the detection of TPRs, including Pfam (Finn et al., 2008) SMART (Schultz et al., 1998; Letunic et al., 2009), and SUPERFAMILY (Gough et al., 2001; Madera et al., 2004), that use HMM profiles, which are constructed from the repeats trusted to belong to the family. However, the profiles used are constructed from closely homologous repeats; therefore, divergent repeat units often get a negative score and are not considered in computing the overall statistical significance, even though they are individually significant (Karpenahalli et al., 2007; Madera, 2008). For this reason, Pfam, SMART, and SUPERFAMILY perform with limited accuracy in detecting remote homologs of known TPRPs and in delineating the individual repeats within a protein. A new profile-based method, TPRpred (Karpenahalli et al., 2007), uses a P-value-dependent score offset to include divergent repeat units and to exploit the tendency of repeats to occur in tandem. Although TPRpred indeed performs significantly better in detecting divergent repeats in TPRPs, and finds more individual repeats than the afore mentioned methods, we have noticed that it (TPRpred) still fails to detect some members of the TPR-like superfamily, such as 03 TPRPs, 02-05 PPRPs and 02-03 HATPs in a total of 26 TPR-like proteins in the three examined species of Leishmania (Leishmania major, Leishmania infantum and Leishmania braziliensis) shown in Tables 1 and 2. The characterization of the proteins of a given family often relies on the detection of regions of their sequences shared by all family members, while computing the consensus of such regions provides a motif that is used to recognize new members of the family. Our integrated approach of HMMs/ VA with MRDM was suitable to detect 104 TPRPs, 36 PPRPs and 08 HATPs in Leishmania spp. genomes, a greater number than currently available resources (Pfam, SMART, SUPERFAMILY) are able to yield and slightly higher than TPRpred. Our results are, therefore, given as an illustration of how powerful MRDM algorithms and techniques might be when conveniently applied to data-driven knowledge discovery problems in bioinformatics.

2. Methods 2.1. Data sources and bioinformatics tools We have used publicly available datasets of individual or clusters of gene/protein data on Leishmania spp., mainly L. major, L. brazilienTable 2 Comparative results of half-a-TPR (HAT) motif finding in Leishmania genes obtained with three standard HMM tools (SUPERFAMILY, GeneDB and Pfam), with TPRpred and with our method (MRDM/HMM/VA). Numbers are shown for HAT-containing proteins in three species (L. major, L. infantum and L. braziliensis). HAT Motif

GeneDB

Pfam

TPRpred

MRDM

L. major L. infantum L. braziliensis

3 2 1

1 1 1

– – –

3 3 2

Table 1 Comparative results of TPR-like motif finding in Leishmania genes obtained with three standard HMM tools (SUPERFAMILY, GeneDB and Pfam), with TPRpred and with our method (MRDM/HMM/VA). Numbers are shown for TPR- and PPR-containing proteins in three species (L. major, L. infantum and L. braziliensis). Motif

L. major L. infantum L. braziliensis

Superfamily

GeneDB

Pfam

TPRpred

MRDM

TPR

PPR

TPR

PPR

TPR

PPR

TPR

PPR

TPR

PPR

95 98 99

– – –

62 37 53

12 11 8

55 51 42

12 11 8

101 101 100

31 31 31

104 104 103

36 34 33

2180

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

sis, L. infantum and related trypanosomatids (GeneDB (Hertz-Fowler et al., 2004), TryTrypDB (Aurrecoechea et al., 2007) and NCBI/Entrez – www.ncbi.nlm.nih.gov/sites/gquery). Variants of BLAST (Altschul et al., 1997), COBALT (Papadopoulos and Agarwala, 2007) and GlimmerHMM (Majoros et al., 2004) were widely used for sequence similarity searches, comparisons and gene predictions. External DB searches were performed against numerous collections of protein motifs and families. Gene ontology (GO) (The Gene Ontology Consortium, 2000) terms were assigned, based on top matches to proteins with GO annotations from Swiss-Prot/trEMBL (www.expasy. org/sprot) and AMIGO after GeneDB (www.genedb.org/amigo/perl) access. Functional assignment of genes and gene products was inferred using the RPS-BLAST search against conserved domain DB (CDD) (Marchler-Bauer et al., 2007). For protein domain identification and analysis of protein domain architectures, Simple Modular Architecture Research Tool (SMART) (Letunic et al., 2009), Pfam 23.0 (Finn et al., 2008), SUPERFAMILY (Madera et al., 2004; Wilson et al., 2009) and TPRpred (Karpenahalli et al., 2007) were used. For documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them we have used Prosite (Sigrist et al., 2010). For multiple alignments we used MUSCLE (Edgar, 2004) and a constraint based alignment tool COBALT (Papadopoulos and Agarwala, 2007). A detailed description of Leishmania protein datasets is given elsewhere (TriTrypDB version 2.0 – http://TriTrypDB.org). Our original dataset consisted of 25,756 unique sequences (Dataset 1 in supplementary material). The base of the negative samples in all tests consisted of randomly chosen 100 sequences of 300–500 residues returned as not-containing TPR motifs. Then, for each experiment, sequences matching the studied pattern (TPR-like motifs), if any, were removed from the negative set. The TPR family cannot be covered by a single PROSITE pattern. Therefore, the positive dataset included two extended patterns: PDOC51375 – Pentatricopeptide (PPR) repeat profile (PS51375) and PDOC50005 – TPR repeat profiles, the latter which include at least two profiles (PS50005; TPR repeat profile and TPR_REGION, PS50293; TPR repeat region circular profile), as provided by the Expasy Proteomics Server (www.expasy.org). The training set consisted of 5 different instances of the 31–35 residue long Prosite patterns (see Section 2.2, for details) and the 34-residue long pattern of SUPERFAMILY (Wilson et al., 2009). We have added 94 Leishmania sequences from the UniProt matching PS50005 to the positive part of the testing set (Dataset 2 in supplementary material) (The UniProt Consortium, 2010). The training set consisted of 11 representative instances (chosen on the basis of low sequence similarity) of these patterns and sequences found in the PROSITE false negative set. Therefore, 286 complete sequences matching the patterns formed the positive testing set. The TPR-like consensus pattern had 286 true positive hits, while 34 sequences were missed and there were 76 false positives. According to our observations, some residues taking part in the TPR motif might not be covered by the 31–35 residue long pattern. Hence, we extended the length of the training sequences up to 40 amino acids, according to multiple sequences alignment (see below). Eventually, the training set consisted of 68 representative sequences of the pattern instance and its neighborhood (40 residues altogether).The positive test set consisted of 120 non-redundant sequences which belong to the TPR-like superfamily (true positive and false negative PROSITE pattern PS50005 matches from UniProt database) (The UniProt Consortium, 2010). The patterns that our meta-pattern is based on are given below. 2.2. Definitions of a TPR-like regular expression and of a TPR-like protein As detailed previously (Girão et al., 2008), we have used the consensus sequence for TPR-like motif as a regular expression (1)

to fully exploit the TPR motif finding problem in Leishmania spp. genomes. A regular expression for a protein sequence defines the most probable amino acid (aa) residue to appear at each position within the protein sequence. Briefly, a TPR motif sequence is a minimally conserved (degenerate and variable) region of 34-residue long extension (with exceptions accepted to the range of 31 residues (Koga et al., 1999). A PPR motif is a degenerate 35-residue sequence, closely related to the 34-residue TPR motif. As reported in our earlier studies (Pacheco et al., 2007; Girão et al., 2008; Pacheco et al., 2009), we systematically solved inconsistencies in the motif annotation by manual expertise. Since motif occurrences are adjacent in sequences, we could define the motif sequence of a protein as the succession of motifs read from the N toward the C terminus. Therefore, we have defined a TPR-like protein as any protein sequence containing a TPR-like motif that fits in our regular expression (1), which also, by reference, confirms to a set of known bona fide domains contained in the TPR-like superfamily [a.118.8] of SCOP (version 1.73) (Murzin et al., 1995; Karpenahalli et al., 2007), SMART v.6.0 (Letunic et al., 2009), TPRpred (Karpenahalli et al., 2007) and SUPERFAMILY v1.73 (Wilson et al., 2009). Classification criteria are supported by structural and sequence similarity, plus searches with remote homology prediction.

½WLF  Xð2Þ  ½LIM  ½GAS  Xð2Þ  ½YLF  Xð8Þ  ½ASE  Xð3Þ  ½FYL  Xð2Þ  ½ASL  Xð4Þ  ½PKE

ð1Þ

As mentioned, lengths of those patterns vary from 30 to 40 amino acids. Also numbers of true positive hits in different sequences in the UniProtKB/Swiss-Prot database differ slightly between patterns (The UniProt Consortium, 2010). In order to produce a meta-pattern, a representative set of 11 instances of those patterns was chosen as a positive training set on the basis of 30% or lower similarity of the pattern. The positive tesst set of 68 sequences was picked randomly from the set of 94 sequences (Dataset 2). We must recall that some sequences contain instances of more than one pattern from the set. Although none of the PROSITE patterns involved in the TPR-like meta-pattern was found in the negative dataset, a few sequences described in either PDB or GeneDB as containing TPR-like motifs had to be manually excluded from the set. 2.3. Profile generation for TPR-like superfamily assignment Performance dependence on sequence profiles relies on either the selectivity or sensitivity of its regime, respectively depending on the number of close or remote homologues used (Karpenahalli et al., 2007). Here we have established a fixed threshold value to include a minimum number of remote homologues, thus avoiding too many false positives. Initial profiles were generated by iterative searches against non-redundant DBs (nrdbs) at NCBI and GeneDB, filtered to a maximum pairwise sequence identity of 60% (nr-60) by CD-HIT (Li et al., 2001, 2002), slightly modified after (Karpenahalli et al., 2007) in a sense that we have extracted sequences conservatively with PSI-BLAST through multiple iterations using the TPR-like regular expression (1) as a query sequence. We, then, performed iterative searches to convergence on nr-60 minus TPRPs (detected by Pfam, SMART, SUPERFAMILY and TPRPred) with various threshold parameters to test the resulting profiles on a positive (TPR-like) or negative set (non TPR-like). Best profiles were selected based on its performance on a predicted family assignment. We must mention that a large protein can be split into smaller domains or motifs, which can occur by themselves or in combination with other domains. By definition, a superfamily will group together domains of different families, which have a common evolutionary ancestor based on structural, functional and evolutionary data. In other words, a superfamily contains all proteins for which there is structural evidence of a common evolutionary ancestor.

2181

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

To provide structural (and hence implied functional) assignments to TPR-like proteins at the superfamily level, structured sequences from available Leishmania and trypanosomatid genomes were randomly selected and parsed into unique 25,756 Leishmania sequences (original dataset). Each sequence was a labeled input to a multi-class motif classifier. To pick the best method to represent one or more of the three target motifs, we compared the results of motif classifiers when the sequence was presented as a (I) TPRcontaining, (II) PPR-containing, (III) HAT-containing, (IV) combination of any two or three motifs and (V) not-containing target motifs. Performance was measured (as in Xu et al., 2006) by classification precision, recall and F1 measure (a composite measure of classification precision and recall). 2.4. Hidden Markov models (HMMs) and the Viterbi algorithm (VA) An HMM is a probabilistic network of nodes, so called states. One state qi is connected to another state qj by a transition probability ij. Non-silent states are able to emit an alphabet of symbols (Eddy, 1998; Winters-Hilt, 2006). A special topology of HMMs, termed pHMM, is frequently used in homology detection of protein families (Karplus et al., 1998). Transition and emission probabilities are estimated by a maximum likelihood approach combined with a standard dynamic programming algorithm for decoding HMMs, the Viterbi (VA) (Forney, 1973) to get site and path dependent probabilities for every hidden state in the posterior decoding. In a first validation step we used the feature of trained HMMs to emit domain-specific sequences according to their model parameters. Sequences were compared with generated state paths in the same way as described earlier (Girão et al., 2008; Pacheco et al., 2009). The process of generation was repeated 10 times for every TPR-like motif. To fully exploit the sequential ordering of motifs in a set, we used pHMMs to label motif types. We have transformed the motif categorization problem into an HMM sequence alignment problem. The HMM states correspond to the motif types. Labeling motifs in a sequence is equivalent to aligning the sequences to HMM states. There are five states in our HMM model: (I) TPR-containing, (II) PPR-containing (II) HAT-containing, (IV) combination of any two or all motifs and (V) not-containing target motifs. Transition probabilities between these states were estimated from the training data by dividing the number of times each transition occurs in the training set by the sum of all the transitions. The state emission probabilities were calculated from the score output reported by the multi-class classifiers. Given the HMM model (Karplus et al., 1998), state emission probabilities and state transition probabilities, VA was used to compute mostlikely sequence of states that emit (any of the target) motifs in sequences. Subsequently, the state associated with the motif was extracted from the most-likely sequence of states (Friedrich et al., 2006). 2.5. Stochastic context free grammar (SCFG) Framework Since an important drawback of HMM profiles is that they are not human-readable, these descriptors cannot provide biological insight by themselves. The expressive power of HMMs, in the field of protein sequence analysis and annotation, is similar to a stochastic regular grammar in which they have limitations regarding the types of patterns they are able to encode (such as nested and crossing relationships that are very common in proteins) (Dyrka and Nebel, 2009). The use of SCFGs (formal definition and implementation were elegantly provided by Dyrka and Nebel, 2009) can overcome such HMM limitations (Madera, 2008). Since each SCFG deals with one amino acid property at a time, scores obtained by several grammars need to be combined to obtain optimal results. Since a protein can be fully defined by a string composed

of 20 different characters, protein grammar is expected to rely on a large set of terminals (20 aa or residues). Therefore, the space containing the possible rules needed to describe the protein language is enormous and cannot be searched in a finite time by current induction techniques. To deal with the size of the protein alphabet, we have used quantitative properties of amino acids into a SCFG framework to reduce the number of symbols in a grammar. For each given property, we have defined all the terminal rules of the form A ? a, thus associating them with proper probabilities. Three non-terminal symbols (Low, Medium and High) were created to represent low, medium and high level of the property of interest, e.g., the presence of the residue valine in position 4 of the scanned motif. These non-terminals are also called ‘property non-terminals’ or pNT in a rule where each amino acid (ai) is associated with each of these non-terminals with a probability (Pr). For a given property, probabilities were calculated using the known quantitative values, pval, associated to the amino acids, ai, and, then, normalized so that the following Eq. (2) is true:

X i

PrLow ðai Þ ¼

X

PrMedium ðai Þ ¼

i

X

PrHigh ðai Þ ¼ 1:

ð2Þ

i

As proposed by Dyrka and Nebel (2009) in considering that all terminal rules are fixed with given probabilities, only the probabilities of the subset of rules in the form of A ? BC are subject to evolution. To avoid trivial solutions, non-terminals and Left-Hand Side (LHS) symbols, as well as ‘independent non-terminals’ or iNT were used to further reduce the size of the possible rule space. Usually, properties of each amino acid can be represented by a single feature in the induced grammar, but here we have generated one grammar per relevant physiochemical property. Protein sequences were parsed for each grammar and their parsing scores were combined to achieve more robust results: the final score is an arithmetic average of the scores obtained for each single property grammar. The rates for grammars, based on different properties, their combinations and their comparison with scores, were obtained by PROSITE patterns (Sigrist et al., 2010). Overall, SCFGs were able to detect relevant sequences missed by the PROSITE patterns for TPR and TPR-like motifs. The modified version of the Cocke–Kasami–Younger (CKY) algorithm, called Stochastic CKY (implemented by Dyrka and Nebel (2009)), was used to allow that SCFGs can be parsed. The outcome of this procedure is the probability that a given sequence was generated by a certain grammar G (i.e. the sequence belongs to the language L(G)) instead of a Boolean value as in the case of non-probabilistic CKY. This probability is defined as a product of probabilities of all grammar rules involved in the construction of the corresponding parse tree. A general approach to scaling within Viterbi-style CKY parsing scheme is to calculate normalized scores (Score’):

Score0 ¼ Score  A  ðRAÞi ;

ð3Þ

where A compensates the decrease in score caused by linking another terminal (A ? a), R compensates the decrease in score caused by invoking a rule (A ? BC) and i is the parsing level. As it is difficult to calculate the values of A and R, an empirical scaling factor, F = RA, was introduced. It is based on the change in average raw (not scaled) scores with increasing parse tree spans calculated over a large dataset for a given grammar. If a grammar is correctly induced, the most likely parse tree of a binding site sequence reflects its structural features (Sakakibara et al., 1994), whereas the basic version of the parser returns a score which is the log of probability rather than the probability itself. Therefore, it is the average log of probabilities for all training samples which is optimized in the training step. A Genetic algorithm (GA) was used for grammar induction in which a single individual represents a whole grammar. This strategy evolves probabilities of grammar rules, and the

2182

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

implementation of such grammar induction algorithm is based on Matthew Wall’s GAlib library, which provides a set of C++ genetic algorithm objects (Dyrka, 2007). To normalize results regarding sequence length, the objective function is defined as an arithmetic average of logs of probability returned by the parsing algorithm for each positive sample: jSj P

f ðXÞ ¼

log PrðW i jGÞ

i

jSj

;

ð4Þ

where S is the positive sample, Wi is a sequence from S, G is a given grammar and Pr(Wi|G) is the probability that Wi belongs to L(G), as posed by Dyrka and Nebel (2009). The algorithm stops when there is no further significant improvement in the score. 2.6. Multi-relational data mining (MRDM) method fundamentals Algorithms for relational association rule discovery (RARD) are well suited for exploratory data mining due to the flexibility required to experiment with examples more complex than feature vectors and patterns more complexes than item sets (Friedman et al., 1999), such as the case with TPR-like motifs. An adequate approach of machine learning (Getoor et al., 2001a) focuses on learning a complex web of relationships among a collection of diverse objects rather than supervised learning from independent and identically distributed training examples (a classifier f that given an object x would produce as output a classification label y = f(x)). Such formalism, developed as probabilistic relational models (PRMs) (Getoor et al., 2001a,b), can represent these webs of relationships and support learning and reasoning with them (Craven et al., 2000). PRMs are a multi-relational form of Bayesian networks that allow descriptions of a template for a probability distribution. This, together with a set of motif objects, defines a distribution over the attributes of the objects (Getoor, 2001). Such a model can, then, be used for reasoning about an entity, using the entire rich structure of knowledge encoded by the relational representation (Getoor et al., 2001b; Segal et al., 2001; Getoor and Grant, 2006). For each PRM, we were interested in constructing a model whose trades off fit to data with the TPR-like motif model complexity. This tradeoff allows us to avoid fitting the training data too closely, which would reduce our ability to predict unseen data. 2.7. Computational complexity To deal with the size of the protein alphabet, we have treated each physiochemical property separately and introduced 3 nonterminal rules which express their quantitative values. Nonetheless, the computational complexity of parsing a SCFG remains an issue, since it is cubic in time and square in space regarding the length of the input sentence Dyrka and Nebel, 2009). For protein sequence scanning (the average length of a sequence is around 350 amino acids), the time complexity is a critical issue in particular during the process of grammar induction. Typically a population made of hundreds of grammars needs to be parsed at each step of the iterative process. Moreover, since the structure of the final grammar is initially unknown, the initial population of grammars should have a set of rules large enough so that it has the potential to express the properties of the pattern of interest, in the present case the TPR motif. To better exploit capabilities and manageability, we have chosen to induce grammars using a constant population of 30 sequences, because, for such a population size, the total number of rules of each sequence should not exceed 30 to be able to generate a grammar within a reasonable amount of time (a few hours). This constraint was made by supplying each sequence with a complete set of rules associated with random prob-

abilities in order not to introduce any bias in the evolution of grammars. Since the Chomsky Normal Form for rules was used, the number of possible rules is bounded by the cube of number of non-terminals. Therefore, in this scheme no more than 3 nonterminals can be introduced to keep the number of rules below 30. For each grammar generation, we produced several grammars (usually 3) whose scanning results were combined using arithmetical averaging of scores. Using an advanced PC workstation (IntelÒ XeonÒ Quad-Core with 6,4 GT/s (Intel QuickPath), 2.8 GHz, 8 MB cache processor), the time needed for producing all grammars associated to a given TPR-like pattern was typically of 6–8 h.

3. Results and discussion While data mining and annotation of the genomes of three Leishmania species (at GeneDB, Sanger Center) has provided a complete inventory of predicted proteins and a fairly good estimative of their containing motifs and associated roles and functions, resources for integrating this information into effective knowledge are still greatly needed and currently underway. Given an original dataset of 24,708 protein sequences (Dataset 1 in supplementary material), there are a number of prediction tasks to be addressed. These include predicting additional genes, and detecting how genes and proteins are organized into functional substructures (motifs or domains), in cases where this is not known. Note that these tasks are clearly multi-relational in that the representation includes attributes of genes (e.g. sequence coordinates), relations among genes and proteins (e.g. containing the same motifs), and relations between genes and other entities, such as cellular localization and GO terms. Additionally, the records in the tables are linked to these entities through their sequence coordinates. Recently we have considered various approaches to recognizing functional elements in flagellar proteins from Leishmania genomes. In our recent works (Pacheco et al., 2007, 2009; Girão et al., 2008), we have employed probabilistic language models to simultaneously predict TPR, PPR and HAT motifs, whereas now we extend this research to a preliminary benchmarking with existing HMM methods. Our model is, for the most part, an HMM, although a component of it is actually stochastic context free grammar (SCFG) (Dyrka and Nebel, 2009). The Viterbi path (Pacheco et al., 2007, 2009) is the most likely derivation (parse) of the sequence, by the given SCFG, to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFGs. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar (Rivas and Eddy, 2001). We argue that most of the relational structure in protein motif discovery within sequence analysis is sequential. Therefore, probabilistic language models are especially suited to the task. Since the expressive power of an HMM is similar to a stochastic regular grammar (Durbin et al., 1998), they have limitations regarding the types of patterns they are able to encode. Context Free Grammars (CFGs) have the potential to overcome some of the limitations of HMM based schemes because they have the next level of expressiveness in Chomsky’s classification and produce human-readable descriptors. CFGs comprise a reduced complexity that makes them more practical and allows the possibility of learning grammar structure from examples (Dyrka and Nebel, 2009), such as we here apply to a group of TPR-like motifs/domains. Moreover, CFGs can be used to model dependencies between different parts of a protein, such as binding sites, motifs and domains, by using branching rules. Thus, the interest in development of grammars which have the abilities to model branched and nested relationships should permit to improve modeling of several motif-containing protein sequences.

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

3.1. TPR-like motif localization task We have applied two MRDM methods (RARD and PRMs) after HMMs/VA to mine TPR, PPR and HAT repeats in protein sequences of Leishmania spp. Provided six variants of the data set for the TPR-like motif localization task (considering four TPRs, one PPR and one HAT in the TPR-like clan), the first version consisted of a single table with 25,756 attributes and the second consisted of two tables with 26 attributes in total. We used a normalized version of the data set with two tables. The names of the two original tables are motifs_relation and interactions_relation. The motifs_relation table contained 120 different motifs but there could be more than one row in the table for each motif. The attribute motif_id identifies a motif uniquely. Since our current implementation of MRDM requires that the target table must have a primary key, it was necessary to normalize the motifs_relation table before we could use it as the target table. This normalization was achieved by creating the tables named motif, interaction, and composition as follows. Attributes in the motifs_relation table that did not have unique values for each motif were placed in composition table and the rest of attributes were placed in motif table. The motif_id attribute is a primary key in the motif table and a foreign key in composition table. The interaction table is identical to the original interactions_relation table. This represents one of several ways of normalizing the original table and renormalization of the relational DB has an impact on the entity-relation diagram for the renormalized version of Motif Localization db. Thus, for TPR-like motif localization task, the target table is motif and the target attribute is localization. From this point of view, the training set consists of 120 motifs and the test set 68. The experiments described here focused on building a classifier for predicting the localization of motif-containing proteins by assigning the corresponding instance to one of six possible localizations. For this motif localization task, we have chosen to construct a classifier using all training data and test the resulting classifier on the test set provided by Leishmania sequences (original dataset). The design of an unbiased negative sample is particularly difficult in protein sequence analysis, what prompted us to use an alternative approach based on stochastic grammars (SCFGs) which, in principle, do not require a negative set for their inference, as recently revisited by Dyrka and Nebel (2009) in their SCFG-based framework that is based on its fitting to the training sample of sequences, i.e. the selected grammar is the one which generates maximal log probabilities over the whole set. In order to normalize results regarding sequence length, the objective function was defined as an arithmetic average of logs of probability returned by the parsing algorithm for each positive sample. This task presents significant challenges because many attribute values in training instances corresponding to the 120 training motifs are missing. Initial experiments using a special value to encode a missing value for an attribute resulted in classifiers whose accuracy is around 40% on the test data. This prompted us to investigate incorporation of other approaches to handling missing values. Replacing missing values by the most common value of the attribute for the class during training resulted in an accuracy of around 68%. This shows that providing reasonable guesses for missing values can significantly enhance the performance of MRDM on our data sets. However, in practice, since class labels for test data are unknown, it is not possible to replace a missing attribute value by the most frequent value for the class during testing. Hence, there is a need for better ways of handling missing values (e.g., predicting missing values based on values of which attributes?). 3.2. Identification of the TPR-like superfamily in Leishmania spp. genomes The percentage of repeat-containing proteins, such as TPR-like, grows with the complexity of the organism, with repeat proteins

2183

being particularly abundant in multicellular organisms (Bjorklund et al., 2006). Genomes of unicellular eukaryotes, as Leishmania, usually possess a relatively high number of putative encoding genes (around 8000 genes in L. major, e.g.) (Hertz-Fowler et al., 2004). Analyses of such a large number of coded proteins require that the characterization of a given family of proteins be dependent on detection of regions of their sequences shared by all family members. Computing the consensus of such regions provides a motif that is used to recognize new members of the family (Servant et al., 2002). With the sequencing completion of 03 Leishmania and several trypanosomes genomes (Hertz-Fowler et al., 2004), we were able to search for all TPR-like genes in Leishmania using the defining characteristic of a TPR-like protein. As depicted by Table 1, numbers of members detected through different tools (GeneDB/SMART, Pfam, SUPERFAMILY and TPRpred) are shown in comparison to our method (MRDM/HMM/VA), which is able to assign a significantly larger number of TPR-like motif-containing proteins in Leishmania: 104 TPRs and 36 PPRs at the most and 08 HATs in total (Table 2). These members are elements putatively involved in several key cellular processes, such as glycosome biogenesis (PEX5 and PEX14) and flagellar pathways (IFT subunits, cyclophilins, phosphatases), besides binding partners of either motor or cargo proteins (kidins220/ARMS and other members of the KAP family of P-loop NTPases) or those involved with assembly or disassembly of protein complexes. The resulting descriptions of the families and its members, a good example of relevant patterns found along with reasonable assignment of family members with our approach, should provide a solid and unified platform on which future genetic and functional studies regarding Leishmania TPRPs can be based. 3.3. TPR-encoding genes in Leishmania spp. genomes We first used the alignment of 275 sequences previously identified as putative strict TPR-containing motifs (obtained from SUPERFAMILY, SMART, Pfam and TPRpred) to obtain the consensus model. This TPR signature matrix was subsequently used to search for TPR motifs in the six reading frames of whole Leishmania genomes. Multiple alignment of Leishmania TPRP sequences revealed that most of substitutions in the TPRs occur at nonconsensus positions; consensus residues are selectively conserved between orthologues (particularly in Trypanosoma spp.). Because TPR motifs are highly degenerate, a fairly large number of false positive hits were expected. However, because TPR motifs appear usually as tandem repeats, we could remove most random uninteresting matches by omitting all orphan TPR motifs that were found farther than 200 nucleotides from any other TPR motif. The 465 TPR motifs retained formed 132 clusters, each of which comprised a putative TPR gene. Each TPR motif cluster was then investigated in detail by manually analyzing the positions and reading frames of the TPR motifs compared with available Leishmania genomes (1) open reading frame (ORF) models and (2) predicted protein sequences within potential coding sequences. From this analysis, 104 putative TPR ORF models were constructed (i.e., 28 motif clusters were discarded or fused with other clusters). TPR genes are fairly evenly distributed throughout the 36 mini-chromosomes of L. major, with little in the way of obvious clusters. The densest grouping of TPR genes lies on chromosomes 30, 32 and 36, the latter which contains 13 genes, the maximum number found in any isolate chromosome of L. major (Table S1). The number of proteins containing TPR motifs and the distribution of individual repeats detected confidently by our MRDM/ HMM/VA method are illustrated and tabulated in Figs. 1 and 2. We were able to detect more proteins as TPR-containing proteins in Leishmania genomes than all compared resources (Pfam, SUPERFAMILY, SMART and TPRpred) and over 2 fold more individual

2184

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

repeats than Pfam and SMART, but not TPRpred, which, as a profilesequence comparison method, shows a marked improvement over existing methods based on regular HMM (such as HMMER that runs on Pfam and SMART web-servers and SAM that runs on Superfamily web-server), particularly in the detection of non-canonical, divergent repeats. We have benefited from all TPRpred improvements (Karpenahalli et al., 2007), extending its scope to include HAT motifs. 3.4. Functional features of TPR-like Motifs The preliminary functional predictions of a range of family members performed here, together with the sparse data on these proteins in Leishmania published so far, allows us to propose putative models in which TPR-like proteins might play the role of sequence-specific adaptors for a variety of other RNA-associated proteins. Structural models of Leishmania TPRPs were predicted (Fig. 3) in an attempt to aid such studies, yet requiring further testable hypothesis. Nonetheless, one can surround a testable prediction: that TPR-like proteins in Leishmania might be directly or indirectly associated with specific RNA sequences and with defined effector proteins, as previously suggested in Arabidopsis (Lurin et al., 2004). The rational for utilizing SCFGs to produce motif finding descriptors is that not only they have the power to express branched and nested like dependencies, but also their rules can be analyzed to acquire biological knowledge about recognizing motifs of interest. In this study, we illustrate how analysis of se-

quence based SCFGs allows gaining additional insight into the identification of TPR-like motifs in Leishmania protein sequences (mostly annotated as hypothetical conserved). We have analyzed probabilistic grammars to extract biologically meaningful features focusing on either parse trees or grammar rules. We have started by providing an in-depth study of grammars produced to describe two extended patterns: PDOC51375 – Pentatricopeptide (PPR) repeat profile (PS51375) and PDOC50005 – TPR repeat profiles, the latter which include at least two profiles (PS50005; TPR repeat profile and TPR_REGION, PS50293; TPR repeat region circular profile). These two profiles were developed for the module PDOC50005, the first one picks up TPR repeat units while the second profile is ‘circular’ and will, thus, detects a region containing adjacent TPR repeats (Expasy Proteomics Server – www.expasy.org). Through this analysis, we have used all the sequence information of Leishmania genomes (coding sequences and predicted proteins), i.e. the original dataset (Supplementary Dataset 1), to visualize the general structure of the motif and its description as provided by the grammar parse trees. We have focused our attention on grammars based on residue features (small and large hydrophobic amino acids), propensity to turn-positions both between the two helices of a single TPR and between the two adjacent TPRs, and propensity to conservation of helix-breaking residues. These grammars are known as being the most informative to describe the PDOC50005 pattern. Future work needs to be directed toward the identification of these factors to elucidate the precise functions of one of the largest

Fig. 1. Schematic diagram of TPR-containing genes in genomes of Leishmania major, L. infantum and L. braziliensis, detected after a multi-relational data mining and hidden Markov model/Viterbi algorithm approach (MRDM/HMM/VA). Shared orthologues among the three species are illustrated as colored circles intersections and individual identifiers (GeneDB IDs) for putative TPRPs are given as lateral tables. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

2185

Fig. 2. Distribution of number of TPR motifs in 311 TPR-containing proteins (TPRPs) occurring in three species of Leishmania (L. major, L. infantum and L. braziliensis). Around 66% of these TPRPs have 2–5 repeats in Leishmania TPRPs.

Fig. 3. Three dimensional modeling of TPR-containing proteins (TPRPs). (A) Template, superhelical TPR domain of the human O-linked N-acetylglucosaminyl (GlcNAc) transferase (PDB ID: 1W3B). (B) L. major hypothetical conserved protein (hcp) (GeneDB ID: LmjF30.3010) with 11 TPR domains. (C) Template; crystal structure of a synthetic 8 repeat consensus TPR superhelix (PDB ID: 2FO7). (D) L. major hpc (GeneDB ID: LmjF29.0150) with 4 TPR domains.

and least understood protein families in Leishmania, the TPR-like. For now, our MRDM approach may be also relevant for other families of proteins with repeated motifs, in a similar way to what was reported by Rivals et al. (2006). We must recall that the L. major genome contains 673 predicted proteins annotated with the term repeat in their descriptions (as a simple keyword full search performed at GeneDB main search page on June 30, 2009), including only 62 out of the 104 TPRPs and 12 out of the 36 PPRPs that we have identified here. For instance, there are 18 proteins containing repeats of the Kelch motif often associated to an F-box domain, 169 WD40 repeat-containing proteins and 121 proteins with Leu-rich

repeats frequently associated to a protein kinase domain. Other cases are armadillo (61) and ankyrin (45) repeats-containing proteins. In some of these protein families, the region containing the repeats is a large part of the proteins that can, and should be, a valuable target for applying MRDM methods. 3.5. PPR- and HAT-encoding genes in Leishmania spp. genomes The name PPR was coined based on its similarity to the betterknown TPR motif (Small and Peeters, 2000). PPRPs make up much of the unknown function proteins in many organisms, but only few

2186

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

of these PPRPs have functional roles ascribed, although a putative RNA-binding function is widely accepted (reviewed in Lurin et al., 2004) and one PPRP is involved in RNA editing (Kotera et al., 2005). The existence of a large family of PRPPs only became apparent with the Arabidopsis Genome Initiative that revealed 446 PPR coding genes – 6% of its entire genome (Lurin et al., 2004). The PPRP family has been divided, according to their motif content and organization, into two subfamilies: the PPRP-P and the exclusive plant combinatorial and modular proteins (PCMPs). PPR motifs have been found in all eukaryotes analyzed to date, but with an extraordinary discrepancy in numbers between plant and nonplant organisms (the human genome encodes only six putative PPRPs). Trypanosomatids and other flagellated organisms are expected to have an intermediate number (around one hundred PPR genes), a number still far from the 36 PPR-encoding genes we have found here (12 of them annotated at GeneDB as conserved hypothetical proteins of L. major) (Fig. 4). Recent reports (Mingler et al., 2006; Pusnik et al., 2007) mention less than thirty (respectively 23 and 28) PPRPs identified in Trypanosoma brucei (with at least 25 orthologues found in L. major) and with a predicted indication that most of these proteins are targeted to mitochondria. The HMMs used here were identical to those used for detection of PPR motifs in plants (Lurin et al., 2004; O’Toole et al., 2008). As multiple models based on PPR variants were used, multiple overlapping hits were usually obtained. In these cases, the highest scoring chain of nonoverlapping hits was retained and the alternative overlapping hits discarded. As an initial screen, genomic nucleotide sequence data in Leishmania were translated in all six frames. The search was applied to the translated sequence data to identify clusters of all PPR motifs separated by fewer than 200 base pairs. These clusters correspond to putative PPR genes. As of Release 2.1 of GeneDB (www.genedb.org) with curated annotations of Leishmania genes, 13 of the GeneDB ORFs are annotated as conserved hypothetical proteins that contain PPR motifs based on matches with

the PFAM profile PF01535 or SMART profile IPR002885. None of Leishmania GeneDB models are annotated as homologues of known PPRPs. Of the two sets of ORF models (ours and GeneDB’s), 12 are identical (i.e., our analysis agreed with the GeneDB model). The 13th GeneDB model does not have an equivalent in our set because we did not consider it to be a PPRP by our criteria (lacking tandem motifs matching our HMM profiles) 21 of our models have no GeneDB equivalent and correspond to genes apparently overlooked during annotation or considered to be pseudogenes. In all, 22 of our 36 models differ in at least some respects from the corresponding GeneDB model, but correspond quite well to the 28 PPRPs identified in T. brucei (Pusnik et al., 2007), what reinforces how well conserved PPR genes seem to be in trypanosomatids. It should be noted that in very few of these cases are molecular data available that can be used to decide between discordant models. Our choice has been generally made by comparison with other genes in the family and a general expertise with these proteins. A noticeable characteristic of PPR genes is that they rarely contain introns within coding sequences even in higher eukaryotes (more than 80% of known PPR genes of plants unexpectedly do not contain introns), what is also true for Leishmania ORF models (an obvious extension for trypanosomatid genes that usually do not contain introns anyway). This feature might explain why PPR genes are relatively short (on average <2 kb) although PPRPs are comparatively large proteins (680 aa on average). HAT repeats have three aromatic residues with a conserved spacing, being structurally and sequentially similar to TPRs/PPRs, although they lack the highly conserved alanine and glycine residues found in TPRs. The number of HAT repeats found in different proteins varies between 9 and 12. HATPs appear to be components of macromolecular complexes that are required for RNA processing and the HAT motif has striking structural similarities to HEAT repeats (IPR000357), being of a similar length and consisting of two short helices connected by a loop domain, as in HEAT repeats

Fig. 4. Schematic diagram of PPR-containing genes in genomes of Leishmania major, L. infantum and L. braziliensis, detected after a multi-relational data mining and hidden Markov model/Viterbi algorithm approach (MRDM/HMM/VA). Shared orthologues among the three species are illustrated as colored circles intersections and individual identifiers (GeneDB IDs) for putative PPRPs are given as lateral tables. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

2187

Fig. 5. Multiple sequence alignment of two HAT (half-a-TPR) motifs present in predicted proteins of Leishmania major (LmjF28.1020 and XP_001684383.1), Leishmania infantum (XP_001470154.1), Leishmania braziliensis (XP_001566140.1) and Trypanosoma cruzi strain CL Brener (XP_809989.1). HAT motif residues are shown with dark and light gray shaded boxes for identical or similar residues, respectively. Well conserved aromatic residues are commonly observed at positions 11 and 25 of HAT motifs. A schematic consensus for HAT (Champion et al., 2008) is aligned with TPR consensus to illustrate Leishmania predicted HAT motif-containing proteins (HATPs). Boxes surround residues in HATPs that are predicted to form alpha helices based on the TPR structure. Note that in TPRs, the conserved W is at position 4, and the conserved P is at position 32, resulting in a repeat that encodes helix A followed by helix B. In HATs, helix B is followed by helix A and the alignment reflects the HAT consensus.

(Preker and Keller, 1998; Small and Peeters, 2000; Kotera et al., 2005; Lurin et al., 2004; Rivals et al., 2006). Our survey identified a total of 08 putative HATPs (Table 2) in the three species of Leishmania analyzed, but the lack of general information on HATPs does not allow any further indication on their definite significance on the protozoan genome. As the first in silico identification of HAT-containing genes in Leishmania, we provide a multiple alignment in Fig. 5 to illustrate two HAT repeats in full trypanosomatid HATP homologues. The detection of such a small, but significant, presence of HATPs in Leishmania (Table 2) is certainly an issue for future investigation. Several proteins have been found to interact with HATPs, but no specific ligand for a HAT domain has been identified so far (Champion et al., 2008).

4. Conclusions We have performed bioinformatics analyses of Leishmania TPR, PPR and HAT genes/proteins (the TPR-like Superfamily) with an integrated MRDM/HMM/VA approach that, in contrast to other currently available resources (more general HMMER resources as PFAM, SMART, SUPERFAMILY, or specific tools such as TPRpred), seeks to capture as much model information as possible in the pattern matching heuristic, without resorting to more standard motif discovery methods. TPR genes are ubiquitous, whereas PPRs and HATs are mostly found in eukaryotes, but, in common, they have the fact of being largely unexplored in Leishmania parasites. The diffusion of new developments and applications of MRDM algorithms and techniques to data-driven knowledge discovery problems in bioinformatics and computational biology is a future direction towards better power of biological inference after sequence and structural analyses. The advantages of our MRDM method in detecting protein motifs of TPR-like superfamily members could be explained by the fact that we have incorporated optimized profiles, score offsets and distribution to compute

probability, as deployed by TPRpred, altogether with a Viterbi algorithm improved HMM method which, in turn, has been used in order to take in account the tendency of repeats to occur in tandem and to be widely distributed along the sequences. This integrated MRDM/HMM/VA approach might be applied to other sequence analysis problems in such a way that it contributes to motif discoveries with a better probability that a target sequence is, indeed, a target motif-containing protein.

5. Authors’ contributions MCD, ACLP and KTG were involved in the generation, analysis and interpretation of the data. FFA implemented the PERL scripts, programmed the HMM improvement with Viterbi algorithm and the integration of MRDM/SCFGs. ACLP, CAW and DMO drafted the manuscript. DMO planned and supervised the overall work. MCD, CAW and DMO critically revised the manuscript. All authors read and approved the final manuscript. Acknowledgements Financial support for this study was provided by the Brazilian funding agencies CNPq (National Council for Scientific and Technological Development, Grant# 486266/2006-0) and FUNCAP (The State of Ceará Research Foundation) through graduate fellowships to MCD and ACLP; as well as from Universidade Estadual do Ceará (UECE) internal funds. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.patrec.2010.04.008.

2188

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189

References Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Aurrecoechea, C., Heiges, M., Wang, H., Wang, Z., Fischer, S., Rhodes, P., Miller, J., Kraemer, E., Stoeckert Jr., C.J., Roos, D.S., Kissinger, J.C., 2007. ApiDB: Integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Res. 35, D427–D430. Blatch, G.L., Lässle, M., 1999. The tetratricopeptide repeat: A structural motif mediating protein–protein interactions. BioEssays 21, 932–939. Bjorklund, A.K., Ekman, D., Elofsson, A., 2006. Expansion of protein domain repeats. PLoS Comput. Biol. 2, e114. Burge, C., Karlin, S., 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94. Champion, E.A., Lane, B.H., Jackrel, M.E., Regan, L., Baserga, S.J., 2008. A direct interaction between the Utp6 half-a-tetratricopeptide repeat domain and a specific peptide in Utp21 is essential for efficient pre-rRNA processing. Mol. Cell. Biol. 28, 6547–6556. Craven, M., Page, D., Shavlik, J., Bockhorst, J., Glasner, J., 2000. A probabilistic learning approach to whole-genome operon prediction. In: Proc. Eighth Internat. Conf. on Intelligent Systems for Molecular Biology, AAAI Press, La Jolla, CA, pp. 116–127. D’Andrea, L.D., Regan, L., 2003. TPR proteins: The versatile helix. Trends Biochem. Sci. 28, 655–662. Das, A.K., Cohen, P.W., Barford, D., 1998. The structure of the tetratricopeptide repeats of protein phosphatase 5: Implications for TPR-mediated protein– protein interactions. EMBO J. 17, 1192–1199. Dehaspe, L., De Raedt, L., 1997. Mining association rules in multiple relations. In: Proc. Sevneth Internat. Workshop on Inductive Logic Programming, vol. 1297, Springer-Verlag, LNAI, Heidelberg. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J., 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, first ed. Cambridge University Press, Cambridge, UK. Dyrka, W., 2007. Probabilistic Context-Free Grammar for Pattern Detection in Protein Sequences. In: MSc Thesis. Kingston University, London. Dyrka, W., Nebel, J.-C., 2009. A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinf. 10, 323. Eddy, S.R., 1998. Profile hidden Markov models. Bioinformatics 14, 755–763. Edgar, R.C., 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797. Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, J.S., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L., Bateman, A., 2008. The Pfam protein families database. Nucleic Acids Res. 36 (Database Issue), D281–D288. Forney, G.D. Jr., 1973. The Viterbi algorithm. In: Proc. IEEE, vol. 61. p. 268. Friedman, N., Getoor, L., Koller, D., Pfeffer, A., 1999. Learning probabilistic relational models. In: Proc. Internat. Joint Conf. on Artificial Intelligence, Morgan Kaufman, Stockholm, Sweden, pp. 1300–1307. Friedrich, T., Pils, B., Dandekar, T., Schultz, J., Müller, T., 2006. Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics 22, 2851–2857. Garey, M.R., Johnson, D.S., 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, p. 202. ISBN:0-7167-1045-5. A1.4: GT 48. Getoor, L., 2001. Multi-relational data mining using probabilistic relational models: Research summary. In: Knobbe, A.J., van der Wallen, D.M.G. (Eds.), Proc. First Workshop in Multi-relational Data Mining, KDD, 2001. Getoor, L., Friedman, N., Koller, D., Pfeffer, A., 2001a. Learning probabilistic relational models. In: Dzeroski, S., Lavrac, N. (Eds.), Relational Data Mining. Kluwer, pp. 307–335. Getoor, L., Grant, J., 2006. PRL: A probabilistic relational language. Mach. Learn. J. 62, 7–31. Getoor, L., Taskar, B., Koller, D., 2001. Using probabilistic models for selectivity estimation. In: Proc. ACM SIGMOD Internat. Conf. on Management of Data, ACM Press, 2001, pp. 461–472. Girão, K.T., Oliveira, F.C.E., Farias, K.M., Maia, I.M.C., Silva, S.C., Gadelha, C.R.F., Carneiro, L.D.G., Pacheco, A.C.L., Kamimura, M.T., Diniz, M.C., Silva, M.C., Oliveira, D.M., 2008. Multi-relational Data Mining for Tetratricopeptide Repeats (TPR)-Like Superfamily Members in Leishmania spp.: Acting-byConnecting Proteins. Lecture Notes in Computer Science, vol. 5265, Pattern Recognition in Bioinformatics, 2008, pp. 359–372. doi:10.1007/978-3-54088436-1. Gough, J., Karplus, K., Hughey, R., Chothia, C., 2001. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure. J. Mol. Biol. 313, 903–919. Groves, M.R., Barford, D., 1999. Topological characteristics of helical repeat proteins. Curr. Opin. Struct. Biol. 9, 383–389. Hertz-Fowler, C., Peacock, C.S., Wood, C., Aslett, M., Kerhornou, A., Mooney, P., Tivey, A., Berriman, M., Hall, N., Rutherford, K., Parkhill, J., Ivens, A.C., Rajandream, M.A., Barrel, B., 2004. GeneDB: A resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res. 32, D339–D343. Ideker, T., Bafna, V., Lemberger, T., 2007. Integrating scientific cultures. Mol. Syst. Biol. 3, 105–112. Karpenahalli, M.R., Lupas, A.N., Söding, J., 2007. TPRpred: A tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences. BMC Bioinf. 8, 2.

Karplus, K., Barrett, C., Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856. Kobe, B., Kajava, A.V., 2000. When protein folding is simplified to protein coiling: The continuum of solenoid protein structures. Trends Biochem. Sci. 25, 509– 515. Koga, H., Terasawa, H., Nunoi, H., Takeshige, K., Inagaki, F., Sumimoto, H., 1999. Tetratricopeptide repeat (TPR) motifs of p67phox participate in interaction with the small GTPase Rac and activation of the phagocyte NADPH oxidase. Biol. Chem. 274, 25051–25060. Kotera, E., Tasaka, M., Shikanai, T., 2005. A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature 433, 326–330. Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J., 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214. Letunic, I., Doerks, T., Bork, P., 2009. SMART 6: Recent updates and new developments. Nucleic Acids Res. 37 (Database issue), D229–D232. Li, W., Jaroszewski, L., Godzik, A., 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282–283. Li, W., Jaroszewski, L., Godzik, A., 2002. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18, 77–82. Lurin, C., Andrés, C., Aubourg, S., et al., 2004. Genome-wide analysis of arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. The Plant Cell 16, 2089–2103. Madera, M., 2008. Profile Comparer (PRC): A program for scoring and aligning profile hidden Markov models. Bioinformatics 24, 2630–2631. Madera, M., Vogel, C., Kummerfeld, S.K., Chothia, C., Gough, J., 2004. The superfamily database in 2004: Additions and improvements. Nucleic Acids Res. 32, D235– D239. Main, E.R.G., Lowe, A.R., Mochrie, S.G.J., Jackson, S.E., Regan, L., 2005. A recurring theme in protein engineering: The design, stability and folding of repeat proteins. Curr. Opin. Struct. Biol. 15, 464–471. Majoros, W.H., Pertea, M., Salzberg, S.L., 2004. TigrScan and GlimmerHMM: Two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878– 2879. Marchler-Bauer, A., Anderson, J.B., Derbyshire, M.K., Derbyshire, M.K., DeWeeseScott, C., Gonzales, N.R., Gwadz, M., et al., 2007. CDD: A conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, D237– D240. Mingler, M.K., Hingst, A.M., Clement, S.L., Yu, L.E., Reifur, L., Koslowsky, D.J., 2006. Identification of pentatricopeptide repeat proteins in Trypanosoma brucei. Mol. Biochem. Parasitol. 150, 37–45. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. O’Toole, N., Hattori, M., Andres, C., Iida, K., Lurin, C., Schmitz-Linneweber, C., Sugita, M., Small, I., 2008. On the expansion of the pentatricopeptide repeat gene family in plants. Mol. Biol. Evol. 25, 1120–1128. Pacheco, A.C.L., Araujo, F.F., Kamimura, M.T., Medeiros, S.R., Viana, D.A., Oliveira, F.C.E., Araújo-Filho, R., Costa, M.P., Oliveira, D.M., 2007. Following the Viterbi Path to deduce flagellar actin-interacting proteins of Leishmania spp.: Report on cofilins and twinfilins. In: Pham, T. (Ed.), Computer Models for Life Sciences, CMLS’07, AIP Proc., vol. 952. American Institute of Physics, Australia, 2007, pp. 315–324. Pacheco, A.C.L., Araujo, F.F., Kamimura, M.T., Silva, S.C., Diniz, M.C., Oliveira, F.C.E., Araujo Filho, R., Costa, M.P., Oliveira, D.M., 2009. Hidden Markov models and the Viterbi algorithm applied to integrated bioinformatics analyses of putative flagellar actin-interacting proteins in Leishmania spp. Internat. J. Comput. Aided Eng. Technol. (IJCAET) 1, 420–436. Page, D., Craven, M., 2003. Biological applications of multi-relational data mining. SIGKDD Explorations 5, 69–79. Papadopoulos, J.S., Agarwala, R., 2007. COBALT: Constraint-based alignment tool for multiple protein sequences. Bioinformatics 23, 1073–1079. Preker, P.J., Keller, W., 1998. The HAT helix, a repetitive motif implicated in RNA processing. Trends Biochem. Sci. 23, 15–16. Pusnik, M., Small, I., Read, L.K., Fabbro, T., Schneider, A., 2007. Pentatricopeptide repeat proteins in Trypanosoma brucei function in mitochondrial ribosomes. Mol. Cell. Biol. 27, 6876–6888. Rivas, E., Eddy, S.R., 2001. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinf. 2 (1), 8. Rivals, E., Bruyère, C., Toffano-Nioche, C., Lecharny, A., 2006. Formation of the arabidopsis pentatricopeptide repeat family. Plant Physiol. 141, 825–839. Sakakibara, Y., Brown, M., Hughey, R., Mian, I.S., Sjolander, K., Underwood, R., Haussler, D., 1994. Stochastic context-free grammars for tRNA. Nucleic Acids Res. 22, 5112–5120. Scheufler, C., Brinker, A., Bourenkov, G., Pegoraro, S., Moroder, L., Bartunik, H., Hartl, F.U., Moarefi, I., 2000. Structure of TPR domain-peptide complexes: Critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell 101, 199–210. Schmidler, S., Liu, J., Brutlag, D., 2000. Bayesian segmentation of protein secondary structure. J. Comput. Biol. 7, 233–248. Schultz, J., Milpetz, F., Bork, P., Ponting, C.P., 1998. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857–5864. Segal, E., Taskar, B., Gasch, A., Friedman, N., Koller, D., 2001. Rich probabilistic models for gene expression. Bioinformatics 1, 1–10.

M.C. Diniz et al. / Pattern Recognition Letters 31 (2010) 2178–2189 Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J., Peyruc, D., Kahn, D., 2002. ProDom: Automated clustering of homologous domains. Brief. Bioinf. 3, 246– 251. Sigrist, C.J., Cerutti, L., de Castro, E., Langendijk-Genevaux, P.S., Bulliard, V., Bairoch, A., Hulo, N., 2010. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 38 (Database issue), D161–D166. Small, I.D., Peeters, N., 2000. The PPR motif – a TPR-related motif prevalent in plant organellar proteins. Trends Biochem. Sci. 25, 46–47. The Gene Ontology Consortium, 2000. Gene ontology: Tool for the unification of biology. Nature Genet. 25, 25–29. The UniProt Consortium. The universal protein resource (UniProt) in 2010. Nucleic Acids Res. (2010) D142–D148. Xu, R., Supekar, K., Huang, Y., Das, A., Garber, A., 2006. Combining text classification and Hidden Markov modeling techniques for structuring randomized clinical trial abstracts. In: AMIA Annu. Symp. Proc. 2006, pp. 824–828. Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C., Madera, M., Chothia, C., Gough, J., 2009. SUPERFAMILY – comparative genomics, datamining and sophisticated visualisation. Nucleic Acids Res. 37 (Database issue) D380– D386.

2189

Winters-Hilt, S., 2006. Hidden Markov model variants and their application. BMC Bioinf. 7, S14.

Web servers/resources NCBI (National Center for Biotechnology Information/Entrez/Cn3D (All Databases) . The Pathogen Sequencing Unit – Wellcome Trust Sanger Institute – GeneDB – . TriTrypDB version 2.0 - . The UniProt Consortium – . Swiss-Prot/trEMBL . AMIGO after GeneDB access. . SMART . SUPERFAMILY, . TPRpred . Arabidopsis Genome Initiative (AGI, 2000) . Pfam . Gene Ontology .