ELSEVIER
Published by Elsevier Science on behalf of IFAC
IFAC PUBLICATIONS www.elsevier.com/locale/ifac
EFFECTS OF SEARCH PAITERN VARIATIONS IN MOTIF DISCOVERY ALGORITHM: MOTlFFINDER
Williarn H. P. Leung, Wai Cbi Tarn, Bill C. H. Cbang and Sarnan K. Halgarnuge Mechatronics Research Group, Department ofMechanical and Manufacturing Engineering, University ofMelbourne, Vic 3010, Australia
Abstract: In this study, a recently proposed motif discovery algorithm is implemented with improved computational efficiency. The results show that the modified implementation has made a significant time saving in comparison to the original implementation by at least 50%. Effect of search pattern variations in the algorithm are also evaluated using three different artificially generated data sets. The obtained characteristics can then be used to further improve the capability and accuracy of the protein motif discovery algorithm. Keywords: Protein motif discovery, data mining, bioinformatics, sequence pattern, motif extraction.
combination of statistical and neuro-fuzzy methods (Halgamuge, 1997) to find a motif. It first identifies the most frequent short patterns that exist in all sequences. Those short patterns are then used for motif generation. Subsequently, the generated motif is optimized using a neuro-fuzzy network in order to improve the classification rate.
I. INTRODUCTION Text of paper, 76 mm (3in) column width, with 8 mm (.3in) space between. Use full 253 mm (10 in) column length. Paragraphs should be justified, using single spacing, with no paragraph indentation. Use Times Roman font, 10 point. Leave one clear line between paragraphs within a section; two clear lines before a main or secondary heading.
The protein motif extraction algorithm (MotifFinder) has previously been implemented in Java language. The algorithm has been investigated with the aim of improving the algorithm's speed and performance in identifying new protein motifs
One of the many important fields where data mining is applied is in protein sequence motif recognition. Motifs are short sequences contained within the sequences of a same protein family. The identification of protein sequence motifs would be helpful in classifying unknown sequences into their computationally predicted protein families. Every protein is composed of a unique sequence of amino acids, and although scientists now know the amino acid sequences of more than 100,000 proteins (Campbell, et aI., 1999), the task of identifying protein motifs from the sequences is a challenging task.
This paper describes the optimisation of the time and space efficiency of the existing MotifFinder program. It also reports the investigation on the effect of search pattern variations in the motif generation algorithm and its generated motifs accuracies. The organisation of this paper is as follows: Motif discovery algorithm is briefly presented in Section 2. Section 3 summarises the implementation of this algorithm in JAVA. Experimental methods and results on artificially generated data are presented in Section 4 and Section 5. Section 6 illustrates the amount of computational improvement with the new implementation. Finally, Section 7 discusses some of the possible future developments.
Various algorithms have been proposed to find protein sequence motifs (Chang and Halgamuge, 2003; Hart et al., 2000; Rigoutsos and Floratos, 1998; Smith and Smith, 1990). A protein motif extraction algorithm (Chang and Halgamuge, 2002) uses a
SOl
A Family of
!."'l'"l _ _
A.1.t.O.L.l,l.O,t,l,t.O.t • .cC'1U:ac;y: 0.0 .... 7.1.2.1.0.t,O,t,I.L,O.L •.cau:KT: 00 A,7.L.2.L..LL.l.L.O.I..l.L.tlCCUl-.=y: 0.0 4 ••• 1.~.t.O.L.I.[.O.L.O.1..c:l;\Ir.c-r-0.0
Sequence
Petcauqc: (I 0 '"re6lt.aof: 0 0 Pueent.aopr: 011 hremYOf: 00
L.:! .... O.L.2.L.O.L.l.t.O.L • .cClUKy: 0.0 'elc.... t.aOt: OD
L.l.r.O,L.O.L.2.L.O,L.',A.Kclluey: 0.0 'UCClt.,e: 0.0 L.O.1.1.[.O.L.O.l.2.l.tO ac:~raey: 0.0 PucmtAqe: 00
Imnortant Features __--.L...-__
L.O.L • .\.L.O.L.O.L.2.L.I0
Kauacy: 0.0 htC'l:RtaOC': 0.0
L.O.L.I.I..0,L.l.L.2.L.7.A• .co=.u:K'J': 0.0 L.2.L.2 ..... 0.1.2.L.O.L.l.t.Keu.tac;y: 0.0 L.Z.L.O.L.l.[.O.L,O,I.,Z.I.,ec:~K1':0.0 L.2.L.O,L.Z.L.O,L.I.I.0.L.acaarao:r. 0.0 "'.J."'.0.L.2.L.0.L.2.L.O.L.l.L.ec:o;uIecy, ...,J."'.0.L.2.L,J,L.O,L.l,L.O.L,ec:c=ao:y: "',3,A.J.L.O,L,I,L,O,L,O,L,2,L,ec:cu.lreey: "'.J.A.l.L.2.L.0.L.0.L.l.L.0.L.ec:aaeey: A.. O.L.2.L.0.L,l.l.0.L.0.L.2.L.ec:C'IoE.c:y: A••• L.Z.L.O,L,2.L,O,I..I,r:.0.L.ec:"""..,,.,:
Motif Generation Motif Candidates
---~---
Motif Selection
'ueDuqc: 00 'uem~: 00
Puo:aat.a9t: 0.0 PIrIrO:,",,~: 0.0 0.0 Pet""'~: 0.0 ho:o:ar.(,.9Ir: 0.0 PI1ECftlt.&fe: 0.0 Puo;au..- .qe: 0.0 PUCI1:IIotAql1: 0.0 "'-u:_c..oo,,:
0.0 0.0 0.0
0.0 0.0 0.0
OC'CaUI""$I'IOICI1Dl' S"OltP""". 091
[
Der_lI'Wl"'''shoIiIIb'~PJlltmI.OO
OI1Qult",*"nliPf'9lJl'" 10
Preliminarv Motif
HA _XlGsart' sdetlr
---~---
FwshM~'"
Neuro-Fuzzy
'lr\o$001 53 Ill" Total '"-~"
=Jt seconds
fInlSfte'Cl~S'-''"lMS Taufln'lt~=1
MConos
Fig. 2. The main display of the interface. Protein Motif
frequently enough above a certain threshold. Longer motifs are generated in the same way, by connecting together more short patterns.
Fig. I. Overview of the protein motif discovery algorithm
2.3 MotifSelection
2. ALGORITHM
A preliminary motif (the longest motif candidate that satisfies a threshold value of classification accuracy) is selected from the group of motif candidates generated in the previous step. The threshold value is chosen to suit each application.
The algorithm's objective is to discover motif(s), or consensus pattern from within sequences belonging to the same protein family. The algorithm consists of four main steps: Sequence Preprocessing, Motif Generation, Motif Selection and Motif Optimisation (Chang and Halgamuge, 2002). The overview of this algorithm is shown in Figure I.
2.4 Moti(Optimization 2.1 Sequence Preprocessing The preliminary motif is fuzzified and optimized using a neuro-fuzzy network in order to improve the classification rate.
The frequent short patterns (important features) within a single family of sequences are selected in this step. In a general form, sequence patterns can be represented as a series of Events and Intervals (Chang and Halgamuge, 2002):
3. IMPLEMENTAnON The interface (see Figure 2) is implemented using Java Swing, which allows for easy integration with
where El is the first event and lu is the interval gap between the first and second events. For example, the pattern ACDEF can be represented by a motif A-3-F.
~l, ~t:~"~-~;~·~~~l-II
The short patterns are included for motif generation if they satisfy a predefined threshold value. For example, if a threshold of 0.95 is selected, and if the short pattern A-I-E exists in 95% of the sequences, then the short pattern will be included in motif generation. This threshold value is called the short pattern threshold.
07
0.72
0.74
076
0,78
0.8
Threshold
Fig. 3. Total number of motifs generated v short pattern threshold I
i !
I
2.2 Motif Generation
t I =~-:- _-=:-:-§~:~::::!
!
Motif candidates are generated by "connecting" together the frequent short patterns found in the previous step. For example, if A-I-E-2-E occurs in 95% of input sequences, then the three short patterns A-I-E, E-2-E, and A-4-E must also exist in 95% or more of the input sequences. So the motif A-I-E-2-E is generated only when all three short patterns exist
NlMnber of motifs per level vs threshold
12000 rl---------------..;--;~;:;;;,--i
.
~
10000'
, - - - - - - , __
~,
2000 0
~->-.~.~. ~
0.72
074
~;
u
'"'
----,--._-. ._.
!
E.-pon fLat" 5) :
:===~Yo~L::.;)6J -';------
0.76
078
Threshold
Fig. 4. Number of motifs per level v short pattern threshold
502
I
the existing MotifFinder source code, which is implemented in Java. The interface was developed using Java SDK 1.4.0 (www.java.sun.com). Several key components of the original program were identified as possible areas for improvement. The following modifications are made to the existing program to improve the time and efficiency of the program: • Changing the way in which the data are stored within the program, hence decreasing the time taken in the motif generation step. • Changing the process of checking the accuracy of the generated motifs. This reduced the time taken to check each motif. • Changing a variable type used during the formatting process of the results, which dramatically decreased the time required to display and/or save the results.
4.2 Data sets with single pattern
The situation where data sets containing an identical pattern on each and every line, with no random data is also tested, but, instead of producing test cases with 100 lines of an identical pattern, one line test cases are used for analysis. Four data sets of this type are generated. 4.3 Data sets with mutated patterns
In this experiment, the same single pattern is firstly randomly mutated (by deletion, insertion or substitution) and then randomly inserted into 100 sequences. Each of the sequence is approximately 1000 characters long and is generated by the same random sequence generator as described in Section 4.1.
Three new features added to the interface to improve the performance of the program are: • The "Generated pattern threshold" command is implemented so that the user could filter out all motifs with an accuracy less than a userspecified minimum before displaying the results on screen. This allows the user to quickly identify which of the generated motifs are at or above a certain level of accuracy. • The "Max Events" command is implemented so that the user could specify the maximum number of events that the generated motifs can have. This can significantly reduce both the amount of time spent and memory used in the motif generation step and subsequent accuracy-checking step, as motifs with more events require more time to check. • The "Amino Acids" command is implemented so that the user could specify exactly what combination of amino acids to include for motif generation. Again, this can significantly reduce both the amount of time spent and memory used.
5. EXPERIEMENTAL RESULTS 5. J Random data sets
The investigation first focused on random data sets (random sequences of characters), as patterns are often located within streams of random background data. From the tests on random data sets, it was found that as the short pattern threshold decreases: • the minimum accuracy of motifs decreases • the average accuracy of motifs decreases • the maximum accuracy of motifs increases As the threshold decreases, a greater number of short patterns will be considered significant, hence, a greater number of motifs will be generated (see Figure 3). However, due to the randomness of the data, many of the generated motifs are found to have an accuracy of zero since they do not exist within the input sequences. An increasingly higher proportion of generated motifs will have an accuracy of zero as the threshold decreases, therefore the average accuracy decreases as threshold decreases. As seen in Figure 3, as the threshold decreases, the number of generated motifs increases sharply, indicating that there is an exponential relationship between the two variables.
4. EXPERIMENTAL METHOD The characteristics of the motif generation algorithm were found by running the program developed on a wide variety of test cases, and then analysing the results to discover the relationships between the operational variables.
It is also found that the number of motifs per level increases exponentially as the short pattern threshold decreases as shown in Figure 4. A level 2 motif represents motifs defined by two events and one interval (e.g. A-I-E), a level 3 motif is defined by three events and two intervals (e.g. A-I-E-2-E). Not only are there more short patterns as the threshold decreases, but there are also more possible combinations or ways to "connect" them together.
The motif generation algorithm was investigated on three groups of artificially generated test data sets: • Random data sets • Data sets with a repeated pattern • Data sets with mutated patterns 4. J Random data sets
5.2 Data sets with single pattern
A random sequence generator is implemented and used to generate 36 artificial data sets. This is used as a control group when evaluating results from the other two types of generated data.
The situation where data sets containing an identical pattern on each and every line, with no random data was also tested, but, instead of producing test cases with 100 lines of an identical pattern, one line test
503
cases are used for analysis. Since the same short patterns would be found on each and every line, it does not matter how many lines are in the test case; the results would be the same.
From Figure 5, it can be seen that there is an exponential relationship between the number of events in the pattern and the number of generated motifs. When the length of the pattern is large, there are more short patterns. Hence a larger number of motifs will be generated (and consequently, a larger size of a saved file).
Table I: Results of oaltern bv itself Size of Length Number Can the of motifs saved text of results be generated file (unit = displayed Short KB) pattern in interface < I (37 I 2 Yes bytes) < I (152 4 3 Yes bytes) < I (431 II 4 Yes bytes) 26 1.02 5 Yes 57 2.34 6 Yes 120 5.12 7 Yes 247 8 10.9 Yes 9 502 23.1 Yes 48.6 10 1013 Yes II 2036 101 Yes 12 4083 211 Yes 13 8178 439 Yes 14 16369 911 Yes 32752 1888 Yes 15 65519 3904 16 Out of memory error - unable to display results 17 131054 Out of Out of memory memory error error - too many -too fonnatted many results in fonnatted memory results in memory 262125 Out of 18 Out of memory memory error error - too many - too fonnatted many results in fonnatted memory results in memory . 19 Out of memory error 20 Out of memory error Table I shows the number of events of the pattern (the length of the pattern), and the corresponding total number of motifs generated, the size of the saved results, and whether the results can be displayed in the display area of the interface. Figure 5 shows the relationship between length of pattern and the number of motifs generated.
Illmber of motifs generated vs Length of pattern 350000~------------~ 3OOOOO~------------I----J
•
;"
E~
..
"0
250000
~------------l!~
200000
~ 15OCK)()
,: ... 100000
50000
~---------~~-----J
o +-..................
o
5
....-4~lt!::!~
10 Length of pattern
_ _-J
15
20
Fig. 5. Length of pattern vs. Number of motifs generated As seen in Table I, as the length of the pattern increases by one, the number of generated motifs approximately doubles. Let M = number of events in the pattern (length of the pattern). The relationship between the two variables is: The number of generated motifs
=
2M
(I)
For example, if M = 16, then the number of generated motifs = 65536, which is very close to the actual value of 65519. From Table I, it can also be seen that there are out of memory errors. There are actually three different reasons for the out of memory errors: I. out of memory error due to the interface display. The interface cannot display much more than 30,000 lines of results. Hence, there is a limit on the amount of text that the Java Swing display can show. 2. out of memory error due to the large number of results to be fonnatted; not much more than 65,000 lines of fonnatted results can be stored in memory. 3. out of memory error due to the large number of motifs generated; not much more that 300,000 generated motifs can be stored in memory Note: All three out of memory errors are due to particular aspects of Java itself, and not in any way due to poor design or coding of the graphic user interface and/or the MotifFinder algorithm. 5.3 Data sets with mutated patterns In the real world, proteins (along with most other biological organisms) will most likely experience some form of mutation. Hence, tests were done to detennine when the motif generation algorithm can still 'detect' the original pattern after the patterns
504
have experienced various degrees of (and different types of) mutation.
Compwlaon betw •• n the thr•• typ•• of mutation
120
-!
----
100
:;
Three types of mutation were tested: substitutions, insertions, and deletions. Each type of mutation exhibited the same three trends: • There is a higher chance of finding an exact motif for lower thresholds • As the number of mutated pattems (lines of mutated patterns) increases, the chance of finding an exact motif decreases • Increasing the number of character mutations within the pattern decreases the chance of finding an exact motif
~
80
ii -",
60
0..,
-........
-==-
u 0
-
~
-+-a,*t.loCons __-
. .ediD ..
-_.• _. deletions
40
0
z
20 0 0
1
2
3
4
5
6
No. of c:herKt.r mutation.
Fig. 6. Comparing the three types of mutation
6./ Modification ofthe data structure for basemotifs
As discussed previously, a lower threshold means a greater number of short patterns will be included for motif generation, which increases the probability of generating and 'detecting' an exact motif.
The basemotifs vector contains all the significant short patterns. During the motif generation step, in order to generate motifs, it is necessary to repeatedly search the basemotifs vector to determine whether the generated motif can exist. Although the best case and worse case number of searches for both data structures are the same, the average number of searches differs considerably.
A comparison of the number of successful detections of each type of mutation, (using the same set of thresholds and the same number of mutated patterns) is shown in Table 2 and Figure 6. A successful detection indicates that an exact motif e.g. "C-O-KO-D-O-L-O-E-O-M-O-P-O-V-Q-O-W" is generated for the non-mutated pattern "CKDLEMPVQW".
Let
Table 2: Comparing the number of successful detections for the three tvoes of mutation No. of character mutations 1 5 3 Type 110 91 64 Substitution of 64 57 57 Insertion mutat 64 56 57 Deletion ion
N = the number of motifs starting with the same letter M = the number of different starting letters
Table 3 show the general case for the number of searches required to find a particular pattern. Notice that if M = N, then the average and worse case number of searches for the original sequential storage would be NZ/2 and NZ respectively, which is bad when compared with the modified version.
~
Table 3: Number of searches required for two types of data structures Original Modified (indexed (sequential storage) storage) I Best 1 case Average MN/2 N/2 Worst N MN case
It can be seen that: • When the number of substitutions is small, the original pattern can still be detected (an exact motif is found) • When the number of mutations is small, substitutions perform better than insertions and deletions. • When the number of mutations is large, all three types of mutation perform the same.
For example, let N = 10 and M = 10. Although the best and worse cases are the same, the average for the modified storage (5 searches) is much better than the average for the original (50 searches).
The most important factor in the motif generation algorithm is the frequency of occurrence of the short patterns. Motifs are generated from the short patterns, so if the short patterns do not exist frequently enough, then none of the generated motifs will match the original pattern.
Hence, it is clear that the modified indexed version offers a significant improvement in average searching time, thus increasing the efficiency of the MotifFinder program.
6. COMPUTATION SPEED IMPROVEMENT 6.2 Modification ofthe process for checking motif accuracy
The improvement in computation speed arises from the following: data structure modification, motif accuracy checking and result formatting.
To calculate the accuracy, every generated motif is checked to see whether it exists in each line of input. When the extra condition determines that the motif being checked cannot be in the particular line, execution of the program will exit the inner loop and
505
Chang B.C.H. and Halgamuge, S. (2003) Approximate Symbolic Pattern Matching for Protein Sequence Data. International Journal of Approximate Reasoning, VoI 32, Feb 2003. Campbell N., Reece J., and Mitchell L Biology, 1999, pp. 68-76 Halgamuge, S. K. (1997) Self-evolving neural networks for rule based data processing. IEEE Transactions on Signal Processing, 45 (11): 2766-2773. Hart,R., Royyuru,A., Stolovitzky,S., and Califano,A., (2000) Systematic and automated discovery of patterns in PROSITE families. In RECOMB 2000, p. 147-154, Tokyo, Japan. Rigoutsos,l. and Floratos,A. (1998) Motif discovery without alignment or enumeration. In RECOMB 98, p. 221-227, New York, USA. Smith,R. and Smith,T. (1990) Automatic Generation of Primary Sequence Patterns from Sets of Related Protein Sequences, Nucleic Acid Research, 118-122.
will immediately proceed to check the next motif, thus saving considerable time. By implementing this extra condition, the time taken to check the accuracy of all the generated motifs was on average 43% faster than before. 6.3 Modification ofvariable type used in results formatting process To display or save the results, the results are formatted such that each generated motif is displayed on a new line, one after the other. Each line (generated motif) is formatted one at a time, and is appended to the end of a temporary variable used to store the previously formatted lines. Since the temporary variable is constantly being updated with the newly formatted motifs, the use of variable type StringBuffer is more efficient because StringBuffer is used for handling dynamic data, whereas type String is used for static data. All test cases, upon successful completion of the run command, were able to display the results on screen or save the results to file in under 5 seconds. Previously, before the modification, the time taken to display the results increased exponentially with respect to a linear increase in the number of generated motifs. (e.g. for 32752 motifs, it took 20 minutes to display the results before modification; after modification, it took only 3 seconds, therefore, over 99% reduction in time).
7. CONCLUSION In this investigation, a graphical user interface was successfully constructed for the existing MotifFinder program, making it more convenient and easier to find new motifs. Improvements were made to the existing MotifFinder program, which successfully resulted in significant and noticeable timesavings in the generation of motifs. The experimental analysis of the motif generation algorithm successfully revealed several major characteristics of the algorithm, and with this knowledge, additional functions for improving the performance of the algorithm were incorporated into the interface. For future research, the eXlstmg MotifFinder program, along with the graphic user interface, can both be extended to include the final step of neurofuzzy optimisation.
REFERENCES Chang B.C.H. and Halgamuge, S. (2002) Protein Motif Extraction with Neuro-Fuzzy Optimisation. Bioinformatics, 18 (8): 1084-1090.
506