An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases

An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases

J. Steroid Biochem. Molec. Biol. Vol. 62, No. 1, pp. 29-44, 1997 © 1997 Elsevier Science Ltd. All rights reserved Printed in Great Britain P I I : S09...

1MB Sizes 0 Downloads 56 Views

J. Steroid Biochem. Molec. Biol. Vol. 62, No. 1, pp. 29-44, 1997 © 1997 Elsevier Science Ltd. All rights reserved Printed in Great Britain P I I : S0960-0760(97)00013-7 0960-0760/97 $17.00 + 0.00

Pergamon

An Artificial Intelligence Approach to Motif Discovery. in Protein Sequences: Application to Steroid Dehydrogenases Timothy

L. B a i l e y , 1 M i c h a e l E . B a k e r 2 a n d C h a r l e s P . E l k a n 1.

IDepartment of Computer Science and Engineering, 0114 University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, U.S.A. and 2Department of Medicine, 062313 University of California, San Diego, 9500 Gilman Drive, La JoUa, CA 92093, U.S.A.

M E M E ( M u l t i p l e E x p e c t a t l o n - m a x l m l z a t i o n f o r M o t i f E l l c i t a t i o n ) is a u n i q u e n e w s o f t w a r e t o o l t h a t u s e s a r t i f i c i a l i n t e l l i g e n c e t e c h n i q u e s to d i s c o v e r m o t i f s s h a r e d b y a s e t o f p r o t e i n s e q u e n c e s in a fully a u t o m a t e d m a n n e r . T h i s p a p e r is t h e f i r s t d e t a i l e d s t u d y o f t h e u s e o f M E M E to a n a l y s e a l a r g e , b i o l o g i c a l l y r e l e v a n t set o f s e q u e n c e s , a n d to e v a l u a t e t h e s e n s i t i v i t y a n d a c c u r a c y o f M E M E in i d e n t i f y i n g s t r u c t u r a l l y i m p o r t a n t m o t i f s . F o r t h i s p u r p o s e , we c h o s e t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e s u p e r f m n i l y b e c a u s e it is l a r g e a n d p h y i o g e n e t i c a l l y d i v e r s e , p r o v i d i n g a t e s t o f h o w well M E M E c a n w o r k o n s e q u e n c e s w i t h l o w a m i n o a c i d s i m i l a r i t y . M o r e o v e r , t h i s d a t a s e t c o n t a i n s e n z y m e s o f biological[ i m p o r t a n c e , a n d b e c a u s e s e v e r a l e n z y m e s h a v e k n o w n X - r a y c r y s t a l l o g r a p h i c s t r u c t u r e s , we c a n t e s t t h e u s e f u l n e s s o f M E M E f o r s t r u c t u r a l a n a l y s i s . T h e f i r s t six m o t i f s f r o m M E M E m a p o n t o s u n a c t u r a l l y i m p o r t a n t ~ - h e l i c e s a n d / ~ - s t r a n d s o n Streptomyces hydrogenans 20/~h y d r o x y s t e r o i d d e h y d r o g e n a s e . W e also d e s c r i b e M A S T ( M o t i f A l i g n m e n t S e a r c h T o o l ) , w h i c h c o n v e n i e n t l y u s e s o u t p u t f r o m M E M E f o r s e a r c h i n g d a t a b a s e s s u c h as S W I S S - P R O T a n d G e n p e p t . MAST provides statistical measures that permit a rigorous evaluation of the significance of database searches with individual motifs or groups of motifs. A database search of Genpept90 by MAST with t h e l o g - o d d s m a t r i x o f t h e f i r s t six m o t i f s o b t a i n e d f r o m M E M E y i e l d s a b i m o d a l o u t p u t , d e m o n strating the selectivity of MAST. We show for the first time, using primary sequence analysis, that b a c t e r i a l s u g a r e p i m e r a s e s a r e h o m o l o g s o f s h o r t - c h a i n d e h y d r o g e n a s e s . M E M E a n d M A S T will b e i n c r e a s i n g l y u s e f u l as g e n o m e s e q u e n c i n g p r o v i d e s l a r g e d a t a s e t s o f p h y l o g e n e t i c a l l y d i v e r g e n t s e q u e n c e s o f b i o m e d i c a l i n t e r e s t . © 1997 E l s e v i e r S c i e n c e L t d .

J. Steroid Biochem. Moh~c. Biol., Vol. 62, No. 1, pp. 2 9 - 4 4 , 1997

INTRODUCTION

the sequences and make it available for use in biology and medicine. In this paper, we describe one such m e t h o d that we have developed which is based on algorithms from artificial intelligence research. We call this software tool M E M E (Multiple Expectation-maximization for M o t i f Elicitation) [1]. It has the attractive property that it is a discovery tool that can objectively identify motifs, such as regulatory sites on D N A and functional domains in proteins, from large or small groups of unaligned sequences. As will be seen, the motifs are a rich source of information about the dataset. M E M E gives a scoring matrix that represents each m o t i f and which can be used to search databases for homologs, identify protein subfamilies that contain one or more motifs, determine if an unknown protein belongs to a family, and provide

T h e past few years have seen an explosive increase in the n u m b e r of D N A and protein sequences in the available databases, with sequences accumulating at a pace that places a strain on the present methods of extracting significant biological information from the sequences. T h e future promises even m o r e of an information overload, as the h u m a n genome project accelerates and other genome sequencing projects come on-line. A consequence of this explosion in the sequence databases is that there is m u c h interest and effort in developing tools that will efficiently and automatically extract the relevant biological information in *Correspondence to C. P. Ellkan. e-mail: [email protected]. Received 15 Jul. 1996; accepl:ed 17 Jan. 1997. 29

30

Timothy L. Bailey et al.

information for mutagenesis studies to elucidate structure and function in the protein family as well as its evolution.

M E M E , an unsupervised learning wol Learning tools are used to extract biological information, such as motifs, from D N A and protein sequences. Learning tools can be divided into supervised and unsupervised learning tools. A supervised learning tool takes as input a set of sequences and discovers a pattern that all the sequences share. Supervised learning is often done by humans rather than by software, because it is an open-ended problem that is harder than database searching. Examples are creating profiles [2] and P R O S I T E signatures [3]. Although very useful, both require multiple alignment and extensive screening of the sequences by humans, which introduces subjectivity in determining motif boundaries, and in determining which sequences actually contain which motifs. Moreover, multiple alignments of large and diverse datasets require gaps and insertions in the alignment that obscure some patterns. These methods also do not give quantitative measures of the importance of different motifs for a protein superfamily, which can be useful in structural analyses and in mutagenesis studies. In contrast, M E M E , an unsupervised learning tool, conveniently provides an unbiased quantitative analysis of motifs in large datasets, without the difficulties of assigning gaps and insertions. Moreover, M E M E output is easily ported into the M A S T (Motif Alignment Search Tool) tool for a sensitive search of databases, such as S W I S S - P R O T and Genpept, that provides a statistical analysis of the output for identifying homologs. By reducing the h u m a n processing of data, more data can be analysed in less time, an important advantage for genome projects. Moreover, motifs of biological importance can be identified that may escape analysis by humans using traditional approaches. SHORT-CHAIN ALCOHOL DEHYDROGENASES

This paper is the first detailed study of the use of M E M E and M A S T to analyse a large biologically relevant example and to identify structurally important motifs. We performed these tests using a large dataset of oxidoreductases that belong to the short-chain alcohol dehydrogenase family [4-8], and which contain mammalian 1 l fl- and 17fl-hydroxysteroid dehydrogenases. These enzymes are also called sec-alcohol dehydrogenases [9], a functional definition based on their preference for substrates with secondary alcohols. We chose this enzyme family to test the practical value of M E M E and M A S T for several reasons. First, this is a large enzyme family, containing over 250 sequences in Genbank. Second, this is a functionally

and phylogenetically diverse family, with examples in bacteria, plants and animals. As expected from such diversity, many pairwise sequence comparisons reveal less than 22% identity after adding gaps to the alignment [4, 5]. We thus have a large set of distantly related enzymes that can test M E M E ' s efficiency and accuracy in identifying motifs characteristic of all or a subset of the enzyme family, and the sensitivity of M A S T for searching the database with these motifs to identify more distantly related homologs, an important application of motif analysis. Third, because the three-dimensional structure of four members of the family - - Streptomyces hydrogenans 20/3-hydroxysteroid dehydrogenase [10], rat dihydropteridine reductase [11], human 17/3-hydroxysteroid dehydrogenase-type 1 [12] and plant enoyl acyl carrier protein reductase [13] - - has been determined, we can learn if the discovered motifs correlate with secondary and tertiary structure. T h e short-chain alcohol dehydrogenases were also chosen because they perform a variety of functions that are of widespread biological and medical interest. Some enzymes metabolize sugars for use as an energy and a carbon source; others are important in antibiotic synthesis by soil bacteria [14], fatty acid synthesis and degradation in animals, plants and bacteria, chlorophyll synthesis in plants [9], resistance of the protozoan parasite Leishmania to methotrexate [15, 16], the control of Mycobacterium tuberculosis by isoniazid and ethionamide [17], and the regulation of concentrations of glucocorticoids, estrogens, androgens, and prostaglandins E2 and Fz~ in humans [4]. T h e differential conservation of the motifs among different short-chain alcohol dehydrogenases may thus yield useful biological information, such as the structural basis in the 11/3- and 17/3-hydroxysteroid dehydrogenases for their preference for catalysing oxidation vs reduction of steroids, as well as insights into their evolution. METHODS

T h e M E M E [1] software tool takes a collection of biosequences and produces a set of motifs describing the collection. T h e sequences can be members of a protein family (such as the families in the P R O S I T E database [3]) or they can be D N A sequences known or suspected to contain c o m m o n binding sites or other patterns. A motif can be thought of as a generalized consensus subsequence without gaps. M E M E is written in the C programming language and is available for free public use through http:// www.sdsc.edu/meme. T h e software is also available by F T P . M E M E discovers motifs using a statistical algorithm called expectation-maximization (EM)[18] to fit a statistical model to its input sequences. For each motif, M E M E maximizes a likelihood function [19]

Discovering Motifs in Proteins that balances width, c:rispness and coverage, i.e. the n u m b e r of sequences in the dataset that contain matches to the motif. For a given width, a motif which closely matches positions in each of the sequences is more statistically significant than one which matches fewer se,quences equally well. By maximizing likelihood, M E M E decides if the motif occurs in all or only a subset of the sequences. If the sequences are believed to contain repeats of a single pattern, M E M E can use this information by changing its statistical model under the control of the user. T h e default is to assume that each pattern occurs no more than once in each sequence in the dataset. T o compare motifs of different widths, M E M E maximizes a heuristic function based on a standard likelihood ratio test [20]. For each width that it tries, M E M E repeatedly executes the E M algorithm from different starting poinT:s, and chooses the final motif of the given width that maximizes the likelihood function. In this way, M E M E ensures that the motifs found are likely to be the most statistically significant ones present in the group of sequences. After one motif has been found, M E M E continues examining the dataset :for more motifs up to the number requested by the user. M E M E avoids finding overlapping motifs by including the motifs it has found so far in its statistical model for the next motif it finds. Because M E M E finds multiple motifs sequentially, patterns which contain gaps can still be discovered; M E M E splits these into separate, ungapped motifs. For each motif, M E M E reports: • the probability matrix for each residue at each position in the motif; • the most likely location of the motif in each sequence in the dataset; • a plot of the information content at each position of the motif; • a consensus sequence summary of the motif; and • a position-dependent scoring matrix. T h e probability matrix gives the expected frequency of each amino acid at each position of the motif, whereas the scoring matrix is a log-odds matrix that also takes into account the background probability with which each amino acid appears outside the motif. T h e scoring matrix is used in searching for matches to the motif.

The M A S T tool for searching databases with motifs The M A S T software takes as input position-dependent scoring matrices (log-odds matrices) and searches a sequence database (such as S W I S S - P R O T ) for sequences which match one or more of the motifs. For each motif, the subsequence beginning at each position in every sequence in the database is scored by summing one value from each column of the scor-

31

ing matrix. Which values are chosen is determined by the letters in the subsequence being scored. T h e score of a position in a sequence gives a measure of the degree of match of the motif to that position. M A S T reports the m a x i m u m score for each motif for each sequence in the database being searched. If more than one motif is found in the same sequence, M A S T reports the sum of the single best score for each motif for that sequence. This score is called the " M A X S U M " score. Because it combines measures of similarity between a sequence and several motifs, it tends to detect distant similarities more selectively than individual motif scores. Because a longer r a n d o m sequence has more chance of achieving a high M A X S U M score than a short r a n d o m sequence, it is necessary to normalize the M A X S U M scores with respect to the length of the sequence under investigation. M A S T does this by fitting a curve which plots sequence length vs the average M A X S U M score for r a n d o m sequences of that length. M A S T also fits a curve plotting sequence length vs the standard deviation of M A X S U M scores for r a n d o m sequences. These functions are then used to convert the M A X S U M score into a z-score according to the equation z score = ( M A X S U M - MEAN(L))/SD(L) where L is the length of the sequence, M A X S U M is its raw score and M E A N ( L ) , SD(L) are the predicted mean and standard deviation of M A X S U M scores for sequences of length L. T h e curve fitting is done twice using the actual sequences in the database being searched. T h e second time it is done, sequences whose z-scores exceed five the first time are removed from the calculations so that only non-family-member sequences are used in estimating the score mean and standard deviation curves for " r a n d o m " sequences. This normalization process is described in more detail in Bailey and Gribskov (manuscript in preparation). M A S T provides several forms of informative output. • First, it generates a histogram of z-scores for the entire database searched. This gives a graphical indication of the sensitivity and selectivity of the motif(s) used in the search. • Second, the sequences are sorted by z-score and the names, database descriptions and scores of high-scoring sequences are displayed. • Third, M A S T shows each high-scoring sequence in its entirety with the positions and scores of matches indicated above the actual sequence. (Matches are defined as subsequences whose score is greater than a M E M E - or user-determined threshold.) • Last, but not least, each high-scoring sequence is abstracted into a diagram showing just the motif matches and their spacing. Because the order and spacing of motifs is important, especially for distant

Timothy L.

32

h o m o l o g s , these diagrams make it easier for the user to discriminate real h o m o l o g s from chance matches.

RESULTS

Analysis of short-chain alcohol dehydrogenases We constructed a dataset for analysis by M E M E by using the F A S T A program to search version 28 of the S W I S S - P R O T database with Escherichia coli sorbitol6-phosphate dehydrogenase, Klebsiella ribitol dehydrogenase, and Candida albicans multifunctional enzyme [21]. These three divergent sequences identified 27 non-redundant sequences that in most cases have less than 30% identity with other sequences in the set, after adjustment with gaps and insertions; many alignments have 20% or fewer identical residues [4, 5, 22]. We included C. albicans multifunctional enzyme because it contains two copies of the dehydrogenase gene, in tandem. We were interested in h o w M E M E

Bailey et al.

would handle this input, and also if there were differences in motifs between the duplicated genes. T h e S W I S S - P R O T identifiers and the title lines for the sequences in the training dataset are s h o w n in Table 1. Figure 1 shows the first six motifs as identified by M E M E in the dehydrogenase dataset. T h e consensus sequence summary shows the sites where an individual residue appears with at least 20% probability in an instance o f a motif, with the highest probability amino acid first and lower probability amino acids below. T h e information content (IC) on the ordinate is a measure of the specificity of the amino acid(s) in the consensus sequence.

Meme motifs map onto structural units T h e three-dimensional structure of S. hydrogenans 20fl-hydroxysteroid dehydrogenase, a m e m b e r o f the dataset, has been determined by X-ray crystallography [10]. Figure 2 shows the mapping of the six M E M E motifs onto the primary sequence and secondary

Table 1. Training set for M E M E analysis of motifs in short-chain alcohol dehydrogenases 2BH_D_STREX 3BHD_COMTE ACT3_STRCO ADH_DROME AP27 MOUSE BA72 EUBSP BDH_HUMAN BEND_ACICA BPHD_PSEPS BUDL_KLETE DHES_HUMAN DHGB_BACME DHII HUMAN DHMA_FLAS 1 ENTA_ECOLI FABG_ECOLI FIXR BRAJA GUTD ECOLI HDE_CANTR HDHA_ECOLI LIGD PSEPA NODG_RHIME PGDH_HUMAN PHBB_ZOORA RIDH KLEAE YINL LISMO YRTP BACSU CSGA_MYXXA DHB2_HUMAN DHB3 HUMAN DHCA_HUMAN FABI_ECOLI FVT1 H U M A N HMTR_LEIMA H MASI AGRRA PCR_PEA RFBB NEIGO YURA_MYXXA

20-fl-Hydroxysteroid dehydrogenase (EC 1.1.1.53) 3-fl-Hydroxysteroid dehydrogenase (EC 1.1.1.51) Putative ketoacyl reductase (EC 1.3.1.-) Alcohol dehydrogenase 2 (EC 1.1.1.1) Adipocyte P27 protein (AP27) 7-ct-Hydroxysteroid dehydrogenase (EC 1.1.1.159) D-fl-Hydroxybutyrate dehydrogenase precursor (EC 1.1.1.30) Cis- 1,2-dihydroxy-3,4-cyclohexadiene-1-carboxylate dehydrogenase Biphenyl-cis-diol dehydrogenase (EC 1.3.1.-) Acetoin(diacetyl) reductase (EC 1.1.1.5) (acetoin dehydrogenase) Estradiol 17 fl-dehydrogenase (EC 1.1.1.62) Glucose 1-dehydrogenase B (EC 1.1.1.47) Corticosteroid 11-fl-dehydrogenase (EC 1. I. 1.146) N-acylmannosamine 1-dehydrogenase (EC 1.1.1.233) (NAM-DH) 2,3-Dihydro-2,3-dihydroxybenzoate dehydrogenase (EC 1.3.1.28) 3-Oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100) FIXR protein Sorbitol-6-phosphate 2-dehydrogenase (EC 1.1.1.140) Hydratase-dehydrogenase-epimerase (HDE) 7-~-Hydroxysteroid dehydrogenase (EC 1.1.1.159) (HSDH) C ~-dehydrogenase (EC -.-.-.-) Nodulation protein G (host-specificity of nodulation protein C) 15-Hydroxyprostaglandin dehydrogenase (NAD(+)) (EC 1.1.1.141) Acetoacetyl-COA reductase (EC 1.1.1.36) Ribitol 2-dehydrogenase (EC 1.1.1.56) (RDH) Hypothetical 26.8 kD protein in INLA 5'region (ORFA) Hypothetical 25.3 kD protein in RTP 5'region (ORF238) C-factor Estradiol 17-fl-dehydrogenase 2 (EC 1.1.1.62) (17-fl-HSD 2) Estradiol 17-fl-dehydrogenase 3 (EC 1.1.1.62) (17-fl-HSD 3) Carbonyl reductase (NADPI4) (EC 1.1.1.184) Enoyl-[acyl-carrier-protein] reductase (NADH) (EC 1.3.1.9) Follicular variant translocation protein 1 precursor (FVT-1) Region methotrexate resistance protein (EC 1-,-,-) Agropine synthesis reductase

Protochlorophyllide reductase precursor (EC 1.3.1.33) DTDP-glucose 4,6-dehydratase (EC 4.2.1.46) Hypothetical protein in URAA 5'region (fragment)

The first column is the SWISS-PROT code; the second column is the protein description. The 11 sequences at the bottom are used in the bootstrapping analysis.

Discovering Motifs in Proteins motif

Info

motif

1

Content

6.8 6.2 5.5 4.8 4.1 3.4 2.7 2.1 1.4 0.7 0.0

(consensus)

motif

Info

Info * * * *** *** ******** *** ************* ............. KVALVTGAASGIG VII GSR G

2

Content

6.8 6.2 5.5 4.8* 4.1" 3.4* 2.7* 2.1 1.4 0.7 0.0

(consensus)

* * * ** ******* ******* ....... YSASKAA A FG

A C D

:15::::531::: :::::::i::::: :::::::::::::

A C D

:::::::

E

i::::::::::::

E

:::::::

F G H I K L S S P Q R S T V W Y

::::::::::::: ::::::93219:a ::::::::::::: :1134::::::7: 7::::::::1::: :::5:::::::1: ::::::::::::: ::::::::::::: ::::::::::::: i::::::::i::: 1::::::::2::: ::::::::33::: :::::9::1:::: :7415::::::1: ::::::::::::: :::::::::::::

F G H I K L M N P Q

:::::2: :1:::14 :::::i: ::::::: ::::a:: ::::::: ::::::: ::::::: ::::::: :::::::

R S T

::::::: :416::: :i:I:::

motif

3

Content

33

6.8 6.2 5.5 4.8 4.1 3.4 2.7 2.1 1.4 0.7 0.0

(consensus)

Info * ** * * * ** * * ** * * * * * * *********** ........... GRVDVLVNNAG L I

:272:45 :i:::::

V

:::::::

W X

:::::i: a::::::

4

Content

6.8 6.2 5.5 4.8 4.1 3.4 2.7 2.1 1.4 0.7 0.0

(consensus)

* * * * ** * * ****** ******** ........ GRIINISS VV

A C D E F G S I K L M N P Q

::::1::::8: ::::::::::: :::9::::::: :I::::::::: ::::::::::: 81::1:::::9 ::::::::::: ::2:212:::: :i::::::::: ::3:18::::: :::::i::::: :::::::89:: :2::::::::: :::::::::::

A

::::::2:

C

::::::::

D m F G H I K L M N

:i:::::: :::::::: ::::i::: 9:::::1: :i:::::: :185:4:: :i:::::: :::i:::: :::::i:: ::::5:::

P

::::::::

Q

::::::::

R

:2:::::::::

S

:::::::::::

T V

:i::::::::: ::4:4:7::::

:3:::::: :2::::59 :1::221: :12413:: :::::::: ::::::::

W

:::::::::::

R S T V W

X

:::::::::::

Z

Timothy L. Bailey etaL

34

motif

Info

5

Content

motif 6.8 6.2 5.5 4.8 4.1 3.4 2.7 2.1 1.4 0.7

* * ** * ** * * * * * ** **** ****

0.0

...............

8

2 5 8 i Info

Content

4

*

7

*

*

1

*

*

*

4

*

***

* ***

7

*********

0

.................

* *

IRVNAVAPGxIxTDM VT I V A

(consensus)

6

(consensus)

*

*

* * * *******

WDRVI EVNLTGVFNGTR F L I S L Q

A C D E F G H I K L

::::3:2::1:112: ::::::i:::::::: :::::::::::::3: ::::::i::::i:i: :::::::::i::::: ::::1:1:81::::: ::::::i:::::::: 7::::3:::13:::1 :i:::::::::i::: ::i:ii:I::ii::i

A C D E F G H I K L

M

::::::::::::::5

M

::::i:::::::::ii:

N P

:::8::1::1:2::: :::::::8:1:::21

N p

:1:::1:9:1:::2::: :::::::::::::::::

Q R S T V

::::::::::::::: :5:::::::::1::: :::ii:::::::i:: :3::1:1:::2:7:1 3:7:25:::13::::

Q R S T V

::1::1::::::::::2 :13::::::::::2::3 ::11::::::2::::2: :::1::1::3:1::13: 1::41:4:21:41:111

W

:::::::::::::::

W

5::::::::::::1:::

Y

:::::::::::i:::

Y

i:::i::::::i:::::

Fig. 1. M o t i f s f r o m M E M E t h e i n f o r m a t i o n c o n t e n t at sites w h e r e s p e c i f i c a m i n o o n e digit. A r e s i d u e w i t h P r o b a b i l i t i e s o v e r 95% a r e

:::::i::::ii:::i: ::::::::::::::Ii: :31::2::::::::::: :11::2::::::::::: 3:::1::::1::5:::: ::::::::::6:::3:: :::::i:::::::i::: 1::12:3:1:::21:1: :11::1:::1::::::2 :::12:1:6::11121:

a n a l y s i s o f s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e s . T h e e n t r o p y p l o t is a m e a s u r e o f each position of the motif. The consensus sequence below the entropy plot shows a c i d s a r e p r e s e n t w i t h a p r o b a b i l i t y o f at l e a s t 20%. P r o b a b i l i t i e s a r e r o u n d e d to a p r o b a b i l i t y b e t w e e n 15% a n d 24% w i l l t h u s be l i s t e d as 20% i n t h e m a t r i x . i n d i c a t e d b y " a " , m e a n i n g a l m o s t a l w a y s . T h e c o m p l e t e o u t p u t is a v a i l a b l e f r o m the authors.

structure of S. hydrogenans 20/?-hydroxysteroid dehydrogenase. The assignments of the /?-strands and ehelices on this enzyme are similar to those in h u m a n 17/?-hydroxysteroid dehydrogenase-type 1 [ 12], although the two proteins have less than 23% sequence identity. Moreover, the secondary and tertiary structures of these two proteins are similar to those of rat dihydropteridine reductase [11] and plant enoyl acyl carrier reductase [13], which have less than 20% sequence identity to each other and the other two proteins. This reflects the better conservation of tertiary structure compared to primary structure [23]. It indicates that the motifs for other members of the dataset that do not have solved three-dimensional structures will map onto /~-strands and e-helices in the same way as shown in Fig. 2. We discuss the output of the first two motifs in detail as an example of how M E M E can be used to uncover functional information from a diverse dataset.

Motif 1 Motif 1 is 13 residues long, overlapping with flstrand A, the following turn and the beginning of ~helix B. This segment is part of a canonical flctfi fold that is the A M P binding domain in this and other oxidoreductase superfamilies [23-25]. Motif 1 contains the Gly-X-X-X-Gly-X-Gly signature motif, which is similar to but not identical to a signature for N A D ( P ) ( H ) binding proteins; this signature is widely used to locate the nucleotide binding domain on dehydrogenases. The three glycines are important for close association of the A M P moiety with the enzyme. All three glycines receive high scores and are marked with • in Fig. 2. One glycine is found in all sequences; the other two glycines are found in 90% of the dataset. As seen in Fig. 1, the M E M E motif contains other residues with high scores. Indeed, the M E M E motif consensus sequence summary for this motif is Lys-Val- [Ala/Val]- [Leu/Ile]- [Val/Ile]-Thr-Gly-

Discovering Motifs in Proteins

35

n~_ac._A n ~ n • • a-~l~-, I N N D L S G K T V I I T G G A R G L G A E A A R Q A V A A 1 1 1 1 1 1 1 1 1 1 1 1 1

p-.tr,-~-s 30GARVVL

ADV

l+'"t:rane-el 59L

DV

a-heXix-c

l

L D E E G A A T A R E

l

(I-hel4x-D

T I E E DW

~-strand-D 8SAG 33

I

[

l

L G D A A R Y

I

l

Q R V V A Y

[

I

I

p-,t,,d-m

G G O S I V N

I N L T G V F 6 6 6 6 6 6 6

I I S S A A

sl+y +G ~ . + m ' ~ l w

I

¢ v s

G L s K L A A V E L G T D R

X5

2 2 2 2 2 2 2

i

O L M O

4 4 4 4 4 4 4 4

~-holix-F

~46 L A ~. T s

~-strand-D

@-helix-E

I17I O M K T V I P A M K D A 6 6 6 6 X X X X X X

I

Q H

A R E E F G S V D G L V N N 3 3 3 3 3 3 3 3 3

I S T G M F L E T E S V E R F R K V V D 6 6 6 6 6 6

~-h,li~-z

u

i

~-.t=ana-F

|

175ni R V N

S V H P G M T Y T P M T A E T G

I R Q G E G N Y p

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

I

a-helix-G

204N T P M G R V G N E P G E

I p--t=,~-a 233T

G A E L A V

I A

I

G A V V K L L S D T S S Y V

I D G G W

T T G P T V

K Y V M G Q 255

Fig. 2. Alignment of MEME motifs on Streptomyces hydrogenans 20/~-hydroxysteroid dehydrogenase. Each motif as determhled by MEME from the input sequences is shown below the sequence of S. hydrogenans 20/~hydroxysteroid dehydrogenase. An X denotes positions added to the motif by the bootstrapping analysis. The secondary struct~are is given as determined from X-ray analysis of crystals of S. hydrogenans 20~-hydroxysteroid dehydrogenase [ 10] and rat dihydropteridine reductase [ 11 ].

[Ala/Gly] - [Ala/Ser] - [Ser/Arg] - Gly - [Ile/Leu] - Gly, which is a m o r e informative description of this d o m a i n than the conventional G I y - X - X - X - G I y - X - G l y signature. A M A S T search o f S W I S S - P R O T 31 with m o t i f 1 identified 77 sequences in the o u t p u t as h o m o l o g s ; m a n y are o p e n reading frames that are listed in S W I S S - P R O T as short-chain alcohol dehydrogenases. Interestingly, n o t all m e m b e r s o f the training set have a z-score above 6.4. F o r example, several Drosophila alcohol d e h y d r o g e n a s e sequences ( A D H ) have a zscore o f 5.4. O t h e r A D H s have a z-score o f 5.0. T h e inclusion o f a protein in o u r training set is thus not sufficient in itself to guarantee a high score. This happens because a discovered m o t i f is a global p r o p e r t y o f the training set.

Identification of homology between sugar epimerases and short-chain alcohol dehydrogenases A n impressive example o f the sensitivity o f M E M E is identification o f yeast U D P - g l u c o s e 4-epimerase, with a z-score o f 7, as a m e m b e r o f the family. M E M E f o u n d several other related epimerases with zscores above 6.4. M E M E ' s inclusion o f these epimerases in the short-chain d e h y d r o g e n a s e family is in a g r e e m e n t with an analysis o f the three-dimensional structure o f U D P - g l u c o s e - 4 - e p i m e r a s e and 20fl-hydroxysteroid d e h y d r o g e n a s e a n d dihydropteridine reductase [ 2 6 , 2 7 ] . D a t a b a s e searches with Blast or F A S T A do n o t find this h o m o l o g y . This is the first time that a sequence analysis p r o g r a m has given statistical evidence that the U D P - e p i m e r a s e protein family, which includes m a m m a l i a n 3fl-hydroxysteroid dehydrogenase, plant dihydroflavonol reductase and

Timothy L. Bailey et aL

36

b a c t e r i a l cholesterol d e h y d r o g e n a s e [28], is h o m o l o gous to s h o r t - c h a i n d e h y d r o g e n a s e s .

Motif 2 M o t i f 2 is seven r e s i d u e s l o n g a n d m a p s to a-helix F , w h i c h c o n t a i n s t h e c o n s e r v e d t y r o s i n e a n d lysine t h a t have b e e n p r o p o s e d to b e at the catalytic site in these e n z y m e s [ 2 9 - 3 4 ] . a-helix F plays an i m p o r t a n t role in the o l i g o m e r i c s t r u c t u r e o f s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e s , a l m o s t all o f w h i c h are e i t h e r d i m e r s or t e t r a m e r s . T h e d i m e r interface is a f o u r a-helix b u n d l e c o n s i s t i n g o f or-helices E a n d F f r o m e a c h s u b unit. M E M E identifies several c o n s e r v e d r e s i d u e s in a-helix F t h a t are n e a r t h e t y r o s i n e a n d lysine, p r o v i d ing a m o r e c o m p l e t e d e s c r i p t i o n o f this p a r t o f the s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e s . T h i s correlates with r e c e n t studies i n d i c a t i n g t h a t a m i n o acids a d j a c e n t to the c o n s e r v e d t y r o s i n e a n d lysine are i m p o r t a n t in stabilizing the d i m e r interface [ 3 5 - 3 7 ] . T h e first five r e s i d u e s o f m o t i f 2 are p a r t o f the P R O S I T E s i g n a t u r e m o t i f for t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e family. Several p r o t e i n s - - M. tuberculosis I n h A , E. coli e n o y l - a c y l - c a r r i e r p r o t e i n r e d u c t a s e (EnvM), rat 2,4-dienoyl-CoA reductase and Saccharomyces cerevisiae s p o r u l a t i o n - s p e c i f i c p r o t e i n ( S P X 19), w h i c h b e l o n g to t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e family [38], d o n o t c o n t a i n the c o n s e r v e d tyrosine which prevents identification of their ancestry w i t h P R O S I T E . O t h e r s s u c h as p r o t o c h l o r o p h y l l i d e r e d u c t a s e [9] a n d Myxococcus xanthus C - f a c t o r [39] have r e s i d u e s t h a t are n o t in t h e P R O S I T E signature. It is t h e r e f o r e n o t s u r p r i s i n g t h a t the s e g m e n t c o r r e s p o n d i n g to m o t i f 2 in these p r o t e i n s has z - s c o r e s b e l o w 3.5. H o w e v e r , M E M E clearly finds t h a t o t h e r parts o f these p r o t e i n s c o n t a i n m o t i f s t h a t are r e p r e sentative o f the s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e s . I n t e r e s t i n g l y , the two d e h y d r o g e n a s e d o m a i n s in C. albicans m u l t i f u n c t i o n a l e n z y m e have very different zscores for m o t i f 2. T h e first d e h y d r o g e n a s e d o m a i n

has a z - s c o r e o f 2.6, w h e r e a s the s e c o n d d e h y d r o g e n ase d o m a i n has a z - s c o r e o f 5.3. I n c o n t r a s t , m o t i f 1 has scores o f a p p r o x i m a t e l y 8.7 for e a c h d e h y d r o g e n ase d o m a i n . T h e r e has t h u s b e e n s u b s t a n t i a l diverg e n c e b e t w e e n the two d e h y d r o g e n a s e d o m a i n s in m o t i f 2. T h i s fact m a y b e o f f u n c t i o n a l i m p o r t a n c e .

M E M E analysis of 11fl- and 17fl-hydroxysteroid dehydrogenases As a special case s t u d y for e v a l u a t i n g the usefulness o f M E M E , we d e c i d e d to analyse a g r o u p o f m a m m a lian 1 lfl- a n d 1 7 f l - h y d r o x y s t e r o i d d e h y d r o g e n a s e s , w h i c h are m e m b e r s o f t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e family t h a t p l a y i m p o r t a n t roles in r e g u l a t ing s t e r o i d h o r m o n e action [22,40-43]. llflh y d r o x y s t e r o i d d e h y d r o g e n a s e s catalyse the i n t e r c o n version o f t h e inactive s t e r o i d c o r t i s o n e to cortisol, the b i o l o g i c a l l y active g l u c o c o r t i c o i d . 17fl-hydroxys t e r o i d d e h y d r o g e n a s e s catalyse t h e i n t e r c o n v e r s i o n o f e s t r a d i o l a n d estrone, as well as t h e t e s t o s t e r o n e a n d a n d r o s t e n e d i o n e . A l t h o u g h these h y d r o x y s t e r o i d d e h y d r o g e n a s e s are h o m o l o g s , t h e i r s e q u e n c e s have d i v e r g e d substantially. F o r e x a m p l e , the f o u r 17fl-hyd r o x y s t e r o i d d e h y d r o g e n a s e s have less t h a n 2 2 % s e q u e n c e i d e n t i t y [ 2 2 , 4 3 , 4 4 ] . I n fact, the s e q u e n c e d i v e r g e n c e o f two l lfl- a n d f o u r 1 7 f l - h y d r o x y s t e r o i d d e h y d r o g e n a s e s e q u e n c e s was an o b s t a c l e to the c l o n ing o f t h e t y p e 2 e n z y m e genes w i t h c D N A p r o b e s b a s e d o n the s e q u e n c e s o f the t y p e 1 enzymes. T h e s e h y d r o x y s t e r o i d d e h y d r o g e n a s e s are o f special i n t e r e s t b e c a u s e in vivo e a c h has specificity for either o x i d a t i o n o r r e d u c t i o n o f their s u b s t r a t e s , w h i c h c o m b i n e d w i t h selective e x p r e s s i o n o f these e n z y m e s , either in specific cells o r at specific times, is p a r t o f a u n i d i r e c t i o n a l m e c h a n i s m for r e g u l a t i n g s t e r o i d h o r m o n e a c t i o n [ 2 2 , 4 0 - 4 3 ] . F o r e x a m p l e , in t h e liver, 1 l f l - h y d r o x y s t e r o i d d e h y d r o g e n a s e - t y p e 1 converts c o r t i s o n e to cortisol; in the distal t u b u l e o f the k i d n e y , l lfl-hydroxysteroid dehydrogenase-type 2 converts

Table 2. M E M E motif z-scores for mammalian I l fl- and 17~-hydroxysteroid dehydrogenases SWISS-PROT code

Enzyme

Motif 1 score

Motif 2 score

Motif 3 score

Motif 4 score

Motif 5 score

Motif 6 score

DHB 1_Human

17fl-hydroxysteroid dehydrogenase-type 1

DHB2_Human

17fl-hydroxysteroid dehydrogenase-type 2

DHB3_Human

17fl-hydroxysteroid dehydrogenase-type 3

Human

17fl-hydroxysteroid dehydrogenase-type 4

DHII_Human

1 lfl-hydroxysteroid dehydrogenase-type 1

DHI2_Human

1lfl-hydroxysteroid dehydrogenase-type 2

9.2 8.7 6.8 7.8 6.5 7.5 8.1 8.2 8.9 8.6 7.2 7.0

6.6 6.3 4.9 4.8 4.0 5.8 3.7 4.0 6.4 6.4 4.4 4.3

9.5 9.2 4.5 5.9 4.1 5.4 8.8 9.3 5.5 5.5 6.2 6.7

3.2 2.8 4.6 5.5 4.4 4.8 4.5 4.4 5.0 5.0 4.9 4.9

5.5 5.6 6.1 7.6 2.5 5.0 1.9 1.2 5.2 4.9 4.8 6.7

6.6 9.9 5.2 9.5 5.6 9.8 4.8 7.5 5.6 8.2 5.7 8.6

Human 17fl-hydroxysteroid dehydrogenase-type 4 is not yet in SWISSPROT [47]. For each enzyme, the motif z-scores on the top line are for the original training set. The bootstrapped z-scores are shown in the second line. Bootstrapping increases the scores for motif 6, reflecting the additional six residues in the motif. Also the z-score for motif 5 increases for the types 2 and 3 17fl-hydroxysteroid dehydrogenases and type 2 1lfl-hydroxysteroid dehydrogenase.

Discovering Motifs in Proteins cortisol to cortisone. Similar differences in specificity are found for 17fl-hydroxysteroid dehydrogenasetypes 1, 2, 3 and 4 [22, 43, 44]. T h e M E M E z-scores for the different 11/3- and 17fl-hydroxysteroid dehydrogenases are shown in T a b l e 2. Only the type 1 enzymes are in the original training set. Examination of T a b l e 2 reveals that although the dehydrogenase sequences are divergent, the differences are not distributed equally throughout the proteins. Sometimes a motif is well conserved in all enzymes; for example m o t i f 1. In other cases, there are noticeable differences a m o n g the enzymes. For example, m o t i f 3 in 17fl-hydroxysteroid dehydrogenase-types 2 and 3 has z-scores of 4.6 and 4.4, respectively. In contrast, m o t i f 3 in the type 1 and 4 enzymes has z-scores of 9.5 and 8.8, respectively. Another striking result is the low z-scores of m o t i f 5 in 17fl-hydroxysteroidL dehydrogenases-type 3 and 4, which are 2.5 and 1.9, respectively, which distinguishes t h e m from the other steroid dehydrogenases. This kind of m o t i f iinformation provides insights for the m o r e effective analysis of three-dimensional struc-

37

tures and construction of m u t a n t hydroxysteroid dehydrogenases to understand their preferences for substrates and for catalysing either oxidation or reduction. It uses M E M E ' s flexibility in identifying differences as well as similarities in a m o t i f a m o n g a group of proteins; both kinds of information are important in understanding protein structure and function.

Identifying distant homologs with M E M E motifs One application of the motifs generated by M E M E from a set of related proteins is to search databases and identify distant unrecognized homologs that would not be found in a routine Blast or F A S T A search. Another application is to examine a newly sequenced open reading frame (ORF) for high-scoring M E M E - g e n e r a t e d motifs, as is done now with P R O S I T E signatures, in order to identify the O R F ' s ancestry and functional domains. In comparison to P R O S I T E signatures, M E M E offers the flexibility to search an O R F against several alternative motifs for a single protein family. In the short-chain alcohol dehydrogenase family, the P R O S I T E signature uses the

Table 3. Meta-motif analysis of short-chain alcohol dehydrogenases in the training dataset SWISS-PROT code ACT3_STRCO PHBB_ZOORA PGDH_HUMAN 2BHD_STREX FABG_ECOLI NODG_RHIME BUDC_KLETE DHGB_BACME YOXD_BACSU BA72_EUBSP HDE_CANTR(b) HDE_CANTR(a) FIXR_BRAJA HDHA_ECOLI AP27_MOUSE 3BHD_COMTE BEND_ACICA ENTA_ECOLI RIDH_KLEAE DHBI_HUMAN BPHB_PSEPS BDH_HUMAN YINL_LISMO GUTQ_ECOLI DHMA_FLAS1 DHII_HUMAN ADH_DROME LIGD_PSEPA

z-score 18.5 18.2 17.9 17.7 17.6 17.1 16.8 16.7 16,6 16.3 16,1 12.0 16.1 16.0 15.9 15.9 15.3 15.1 14.6 14.5 14.3 14.1 14.1 13.6 13.4 13.1 12.1 11.7

Motifs 1-6 and their spacing (6) (2) (5) (6) (5) (6) (2) (7) (6) (6) (8) (36) (11) (7) (6) (9) (5) (14) (2) (5) (55) (5) (2) (14) (34) (6)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

(62) (58) (64) (59) (59) (59) (62) (64) (62) (64) (58) (68) (60) (62) (54) (59) (61) (52) (59) (66) (58) (67) (62) (64) (83) (88) (83) (88)

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

(14) (14) (6) (14) (14) (14) (14) (14) (14) (14) (14) (14) (20) (13) (14) (14) (15) (14) (14) (14) (19) (14) (14) (14)

3

(6)

6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

(13) (11) (14) (11) (11) (11) (12) (12) (11) (11) (11) (11) (10) (11) (12) (10) (11) (11) (11) (31) (9) (I0) (11) (12) (31) (10) (14) (6)

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

(12) (12) (12) (12) (12) (12) (12) (13) (12) (12) (12) (12) (13) (12) (12) (12) (10) (12) (12)

4 4 4 4

(12) (12) (12) (12)

4 4 4

(12) (12) (12)

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

(16) (16) (18) (16) (16) (16) (16) (16) (16) (16) (16)

5 5 5 5 5 5 5 5 5 5 5

(16) (16) (16) (18) (16) (16) (16)

5 5 5 5 5 5 5

(15) (16) (18)

5 5 5

(16)

5

Motif order is 1-3-6-4-2-5. Spacing between each motif is shown in parentheses. Only motifs with a z-score of at least 5 are shown. DHES_HUMAN, DHMA_FLASI, DHII_HUMAN, ADH_DROME, HDE_CANTR and LIGD_PSEPA have one or more motifs below this score. However, in these enzymes, the low scoring motifs can be found at positions similar to those of the other enzymes. For example, motif 3 in DHII-Human (1 lfl-hydroxysteroid dehydrogenase) is 63 residues from motif 1 and 14 residues from motif 6. Motif 5 also is 16 residues from motif 2. Note the conservation of spacing between motifs, indicating that motif output contains information about conservation of tertiary structure.

38

Timothy L. Bailey et al.

first five residues of motif 2 because these positions contain the highly conserved tyrosine and lysine. As we showed earlier, this motif is not the most characteristic motif for short-chain alcohol dehydrogenases. It can match some false positives. Moreover, some members of the short-chain alcohol dehydrogenase family do not conserve either the tyrosine or lysine [38]. Others, such as protochlorophyllide reductase [9] and Myxococcus xanthus C-factor [39], have residues adjacent to the tyrosine that are not in the P R O S I T E signature.

Using multiple motifs to improve homology analysis An important property of the M E M E motifs is that their order and spacing are conserved in the shortchain alcohol dehydrogenase superfamily. Table 2 shows this information for the original training set. Conservation of motif order and spacing is biologically reasonable because the motifs correspond to functional structures. For example, consider the spatial relationship between motif 1, where the glycine-rich turn is at the amino terminus and motif 2, which contains the catalytically important residues. A protein with these motifs in the reverse order would not be homologous in its three-dimensional structure to the short-chain alcohol dehydrogenases. T h e order of the motifs is therefore a meta-motif or fingerprint for a given protein superfamily. In this way, even a motif with a low z-score that is present in the canonical order provides information that improves confidence in assigning the unknown protein to a superfamily. Conservation of the distances between the motifs in most of the proteins in the dehydrogenase dataset, as shown in Table 3, has been noted in past multiple alignments of the first 190 residues of these enzymes [4-8], in which there are few gaps or insertions for the majority of the enzymes. As seen in Table 3, the distance between motifs is conserved in proteins that lack a high score for a particular motif. For example, motif 4 in h u m a n 17fi-hydroxysteroid dehydrogenasetype 1 has a z-score of 3.2, which is below the cutoff, and thus this motif does not show up in the meta-motif diagram. Nevertheless, motif 4 is 11 residues from motif 6 and 12 residues from motif 2, distances that are similar to those in the rest of the dataset. Analysis of distances between motifs can identify an unusual insertion or deletion in a short-chain alcohol dehydrogenase that otherwise conserves canonical motifs. T w o examples are protochlorophyllide reductase and carbonyl reductase, which have an extra segment of 35 and 41 amino acids, respectively, between motifs 4 and 2. The motif diagram for protochlorophyllide reductase is < F X I > which shows its homology to other short-chain alcohol dehydrogenases (Table 3), while identifying a unique property of

this enzyme that is likely to be important in its functioning. H o w the M A S T tool uses multiple motifs to identify homologs is shown in Table 3 in the column that lists the z-score for the search with all six motifs. A protein may have a low scoring motif, such as motif 1 in Drosophila A D H or motif 4 in h u m a n 17fl-hydroxysteroid dehydrogenase-type 1 and still show its ancestry. The criterion of selection on the basis of motif order and spacing is a powerful tool for identifying true homologs. These distant homologs can then be added to the training set in a bootstrapping procedure, as described in the next section, for further searching of the database, and also analysis of the structurally important features in the proteins.

Bootstrapping Bootstrapping means adding distantly related homologs from a search to the training set to improve the sensitivity of the motifs. This process is convenient with M E M E because it is not necessary to redo separately the multiple alignment, which could be complicated as a result of the gaps and insertions needed for distantly related sequences. Moreover, if the added sequences are not members of the protein superfamily or lack one or more motifs, this does not degrade the output, because M E M E ignores extraneous sequences in identifying a motif. We were interested in how M E M E would respond to the information in distantly related sequences. We therefore examined the output of the M A S T tool from the training set and added 11 distantly related sequences (Table 1) to the training set. We then recomputed the first six motifs. T h e first four motifs changed slightly in their consensus sequence, and M E M E determined that the new strongest motif was the previous motif 3. T w o important changes with the additional information were the addition of a glycine residue to motif 5 and 6 residues to motif 6 (Fig. 2). Use of the M A S T program to search S W I S S - P R O T and sort the sequences yielded the same homologs, but their z-scores were higher because of the increased selectivity of motifs 5 and 6. This is clearly seen in the comparison of the output for the l lfl- and 17fl-hydroxysteroid dehydrogenases with that for the earlier training set, shown in Table 2. The score for motif 6 for all enzymes is substantially increased, which is because of the additional six residues in the motif. Motif 5 shows an increased score for most enzymes; only the type 4 enzyme is unchanged. In the first four motifs, motifs 1 and 3 of 17fl-hydroxysteroid dehydrogenase type 2 and motifs 1, 2 and 3 of the type 3 enzyme have significantly increased scores.

Database searching with multiple motifs T o show the utility of using meta-motifs, we used M A S T to analyse Genpept90 with the first six motifs

Discovering Motifs in Proteins

after bootstrapping. T h e histogram of the output shown in Fig. 3 is biraodal, demonstrating the selectivity of M A S T . All sequences at or above a z-score of 5 are homologs. Indeed, the first sequence that does not belong to the short-chain alcohol dehydrogenase family is Salmonella typhimurium flagellin, which has a z-score of 4.7. For this protein, the motif order is 4-5-2; motif 5 is 246 residues from motif 4 and 104 residues from motif 5. Therefore, both the

min -9.8 -9.2 -8.8 -8.2 -7.8 -7.2 -6.8 -6.2 -5.8 -5.2 -4.8 -4.2 -3.8 -3.2 -2.8 -2.2 -1.8 -1.2 -0.8 -0.2 0.2 0.8 1.2 1.8 2.2 2.8 3.2 3.8 4.2 4.8 5.2 5.8 6 2 6 8 7 2 7 8 8 2 8 8 9 2 9 8 i0 2 i0 8 ii 2 II 8 12 2 12 8 13 2 13 8 14.2 14.8 15.2 15.8 16.2 16.8 17.2 17.8

cumul 130302 130300 130299 130299 130294 130290 130285 130281 130273 130251 130218 130140 130031 129831 129291 128042 124670 116787 101182 78551 52682 29692 13933 5690 2184 1039 692 554 497 480 469 462 456 451 450 434 431 418 411 399 396 387 378 365 320 142 123 109 82 65 57 47 37 15 i0 4

count 2 1 0 5 4 5 4 8 22 33 78 109 200 540 1249 3372 7883 15605 22631 25869 22990 15759 8243 3506 1145 347 138 57 17 Ii 7 6 5 1 16 3 13 7 12 3 9 9 13 45 178 19 14 27 17 8 10 i0 22 5 6 4

motif order and spacing indicate that flagellin is not a short-chain alcohol dehydrogenase. Interestingly, human and rat dihydropteridine reductase have z-scores of 4.9 and 4.5, respectively, placing these enzymes among other distantly related short-chain dehydrogenases. Others have noted the limited sequence homology (less than 18% identities) to short-chain dehydrogenases, despite the fact that the solved X-ray crystal structure of the h u m a n and

=

== = = = = = = = = = = = = = = = = = = = 2 = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ~ = = = = = ~ = = = = = = = = = = = = = = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = = = = = = = = = = = = = = = ~ = = = = = = = = = = = = = = ~ = = = = = = = ==

Fig.

39

3--caption on p. 42.

40

Timothy L. Bailey e~ al.

SEQUENCE

DESCRIPTION

ZSCORE

gl 1531421gpIM195361STMA

S.coelicolor actIII gene, compl.. beta ketoacyl reductase [Strept.. acetoacetyl CoA reductase [Rhiz.. Streptomyces violaceoruber poly.. gl 3471791gplM982581SERD polyketide reductase [Saccharop.. g~ 386861gplX647721ABORF ORF3 gene product [Azospirillum.. gl 4887741gplX778651SGKS ketoreductase [Streptomyces gri.. gl 162051gplX644641ATBOR 3-oxoacyl-[acyl-carrier protein.. g~1463O41gplYoo6o41~oD Rhizobium meliloti nodulation g.. gll145o161gpl~o11121c~P 3-hydroxybutyric acid [Chromati.. gl12162701gpID900441BACG B. megaterium glucose dehydroge . g~[5767831gplL377611ACCP acetoacetyl-CoA reductase [Acin. glI1491731gplL045071KPNB acetoin(diacetyl)reductase [Kle. glI1485281gplM362921EUBB Eubacterium sp. bile acid-induc. g~I5818331gplX798631XSDN orf3 gene product [Xanthobacter. gII386831gplX529131ABNOD Azospirillum brasilense nodG ge. gl15322431gplL355601STMD daunorubicin-doxorubicin polyke. gl15371081gplU140031ECOU Escherichia coli K-12 chromosom. gl17973341gplU196201ATU1 MocC [Agrobacterium tumefaciens. gi19124371gplD104971ECOH 7alpha-hydroxysteroid dehydroge. gi16996081gplD261231MUSC carbonyl reductase [Mus musculu. giI26701gplX578541CTHDEG hydratase-dehydrogenase-epimera. gi19041991gplD45242 BACE hypothetical protein [Bacillus . gil1417501gplM76990 ACCB benD gene product [Acinetobacte. gi]1810371gplJ04056 HUMC carbonyl reductase [Homo sapien. gil1642941gplM80709 PIG2 20-beta-hydroxysteroid dehydrog. gi15221831~IM24143 ECOE entA gene product [Escherichia . gi15615331gplX78811 RNI7 estradiol 17 beta-dehydrogenase. gil1485101gpIM58473 EUB 7 7-alpha-hydroxysteroid dehydrog. gi1887607 IgplZ49939 SC99 unknown [Saccharomyces cerevisi. gi1666004 IgplD45911 BACN hypothetical protein [Bacillus . gx1349207 IgplL22883 ANAK ketoacyl reductase [Anabaena sp. gzI149673 IgplM67471 LISI Listeria monocytogenes internal. g~I151096 Igp]M83673 PSEB dioxygenase [Pseudomonas pseudo. gi1882735 Igp]uZ9581 ECU2 Esc~erichia coli K-12 genome; a. gi1763381 Igp]z47o47 SCCH unknown [Saccharomyces cerevisi. g~1563366 Igplxsoo19 GOGN gluconate oxidoreductase [Gluco. giI177198 ]gplM93107 HUM3 (R)-3-hydroxybutyrate dehydroge. gi1486419 ]gplZ28234 SCYK FOX2 gene product [Saccharomyce. [K. gi143939 I~IX660591KPSOR D-glucitol-6-P-Dehydrogenase gi1424160 Igplmo473 DURT tropinone reductase-I [Datura s. gi1296186 IgplX63657 HSFV FVTI gene product [Homo sapiens. gi1393184 IgplL20621 MZET alcohol dehydrogenase [Zea mays. gi1458714 IgplU07051 OCU0 NADPH-dependent carbonyl reduct. gi1471144 IgplD23722 PSEL 2,5-dichloro-2,5-cyclohexadiene. gi1886434 IgplZ49776 ATP4 ARP protein [Arabidopsis thalia. gz1312919 I~Ix63379 PTI7 3(or 17)beta-hydroxysteroid deh. gI1450261 IgplL27825 EMEV verA gene product [Emericella n. gi1286173 IgplD14595 PSEL 2,5-dichloro-2,5-cyolohexadiene. gi1581439 Igplx68594 PAxc orfX gene product [Pseudomonas . gi1520952 Igplx79863 XSDN orf5 gene product [Xanthobacter. gI1763222 Igplz47o47 sccH unknown [Saccharomyces cerevisi. gI1391836 IgplD17319 PSEB dihydrodiol dehydrogenase [Pseu. gi1407084 I~Ix51338 ~ P L reductase [Agrobacterium rhizog. gi1397883 Igplx66122 PSPB Biphenyl-2,3-dihydro-2,3-diol d. gz1425150 IgplL20485 HYST tropinone reductase-II [Hyoscya. gl 468031gplZII5111SCACT gl 7905521gpIUI72261RMUI gl 479921gplXI63001SVPKS

Fig. 3 (cont'd)--caption

on p. 42.

18.1 18.1 17.8 17.7 17.5 17.1 17.1 16.8 16.7 16.6 16.6 16.5 16.4 16.3 16.3 16.3 16.2 16.1 15.9 15.9 15.8 15.8 15.5 15.5 15.3 15.0 14.8 14.8 14.8 " 14.7 14.6 14.6 14.6 14.6 14.5 14.4 14.3 14.2 14 2 14 2 14 2 14 2 14 1 14 1 14 1 14 1 14 1 14 0 14 0 13 9 13 8 13 8 13 8 13 7 13 7 13 7

LENGTH 261 261 241 272 261 246 261 319 245 246 261 248 241 249 25O 254 261 254 248 255 244 906 262 261 277 289 248 344 266 267 257 287 248 277 253 263 256 343 90O 267 273 332 336 277 25O 629 254 264 25O 255 249 297 276 430 277 260

Discovering Motifs in Proteins

gm14311031gplL223091MGNP gx12167051gPlDg03161FVBN

polyhydroxynaphthalene reductas. Flavobacterium sp. nam gene for. gm18825971gplU295791ECU2 glucitol-6-phosphate dehydrogen. gmlas25321gvlu283771mcu2 Escherichia coli K-12 genome; a. gxlI794751gplM766651HUMB ll-beta-hydroxysteroid dehydrog. gII5311621gplU056591HSU0 17beta-hydroxysteroid dehydroge. g~I390871~Ix004931ATAc~ Agrobacterium tumefaciens Ti pl. giI51086?IgplX80052 NCFO multifunctional beta-oxidation . gmI763380[gplZ47047 SCCH unknown [Saccharomyces cerevisi. gll1517221gplM64747 PWWX 1,2-dihydroxycyclohexa-3,4-dien. gx12o335olgplJO51O7 RATC Rat corticosteroid ll-beta-dehy. gx13918461gplD16629 PSEP dehydrogenase [Pseudomonas puti. gl13046581gplM97637 DROA alcohol dehydrogenase [Drosophi. gl15599641gpIJ05282 PSED P.cepacia 2,2-dialkylglycine de. gm16878341gpIU21319 CELC C30G12.2 gene product [Caenorha. gll 6631711gplX82262 BTII ll-cis retinol dehydrogenase [B. 149 Drosophila ADH sequences deleted gl 1533911gplM965511STMO oxido-reductase [Streptomyces a... gl 8 8 6 2 [ ~ [ X I 5 5 8 5 1 D N A D H I alcohol dehydrogenase [Drosophi... gm 1516osl~lJo49961PSmT P.putida toluene dioxygenase (i... gm 9142761gplS7587515758 retinol dehydrogenase [Rattus s... gm 8861031gplU273571MTU2 Mycobacterium tuberculosis cycl... gm 20S301~IX630601PSPCR protochlorophyllide reductase [... gm 2982801SplS564561S564 Ke 6 gene product [Mus sp.] gl 156S4~.IgvlM555451DROA alchohol dehydrogenase [Drosoph... gm 5322411gplL355601STMD daunorubicin-doxorubicin polyke... gm 55707~LlgplU152981PSUI chlorobenzene glycol dehydrogen... gl 1569081~IMB73001DROA alcohol dehydrogenase [Drosophi... gl 4993401gplX782011SSI7 17beta-estradiol dehydrogenase ... gm 1568121gplM372761DROA D.mojavensis alcohol dehydrogen... gm 7 4 2 6 1 ~ I X 5 8 6 9 4 1 D H A D H I alcohol dehydrogenase [Drosophi... gm 2168881gplDII4731PSEL C alpha-dehydrogenase [Pseudomo... gm 6095481~IL370871NOSH oxidoreductase [Nostoc sp.] gl 5817061gplZII5191SLTN Probably an NADP-dependent oxid... gl 4o933~1~156382415638 NADPH-protochlorophyllide-oxido... gm 1568881gplM633921DROA alcohol dehydrogenase [Drosophi... gl 39278~;Ig-Plu006751CTAR D-afabinitol dehydrogenase [Can... gl 5312691gplD299761TOBT TFHP-I protein [Nicotiana tabac... gl 5102901gplD321421RERB 2,3-dihydroxy-l-phenylcyclohexa... gl 29556~IlgplL16227[YS;~ D-arabinitol dehydrogenase [Can... gl 7464451gplU234551CELF F55EI0.6 gene product [Caenorha... gm 8612801gplU287391CELC C17GI0.8 gene product [Caenorha... gl 5531431gplM903511YSCS SPXI9 [Saccharomyces cerevisiae... gm 3380211gplM762311HUMS sepiapterin reductase [Homo sap... gl 6ooo6~,lg-plx7s8981scxI N1362 gene product [Saccharomyc... gl 6339861gplD285331KPNM moaE gene product [Klebsiella a... gl 72641611gplU231771CELC C5662.6 gene product [Caenorhab... gl 15265019-plM953001SAUC csgA gene product [Stigmatella ... gl 5208961gplZB27421NGRF dTDP-D-glucose 4,6-dehydratase ... gl 471234~IgPlX785591HISB H.influenzae DNA for serotype b... gz 388932,1gplL091881NGOC TDP-glucose-dehydratase [Neisse... gm 6669711gpID503251PSE3 3-alpha-hydroxysteroid dehydrog... gm 5475111gplZ375161HIAC CDP-ribitol pyrophosphorylase [... gm 2068961gplM364101RATS Rat sepiapterin reductase mRNA .... gl 2361671gptS576931S576 Fbp2 gene product [Drosophila m... gm 3046621gplM976381DROA alcohol dehydrogenase [Drosophi... gl 4861571gplZ280711SCYK S.cerevisiae chromosome Xl read... gm 4607331gplX757801SCXI C256 [Saccharomyces cerevisiae]

Fig. 3 (cont'd)--caption overleaf.

41

13.6 13.6 13.4 13.4 13.4 13.3 13.3 13.2 13.1 13.1 13.0 12.9 12.8 12.7 12.6 12.5

283 271 259 294 292 310 430 894 254 269 287 259 254 249 28O 318

12.3 12.3 12.3 12.3 12.3 12.2 12.2 12.2 12.1 12.1 12.1 12.1 12.0 11.9 • 11.9 11.6 11.6 11.2 ii.i ii.0 10.9 i0 6 i0 6 i0 4 10 2 9 7 9 7 9 6 9 5 9 5 9 4 9 3 9 1 9 1 9 0 8 9 8 9 8 8 8 8 8 7 8 7

298 254 275 317 384 399 260 254 251 275 254 737 254 254 3O5 278 297 40O 254 282 234 263 281 305 418 263 261 295 257 333 173 346 • 474 360" 255 474 259 256 273 256 256

42

Timothy L. Bailey et al. gl 18871[gpIY006021DPADHG URF gene product [Drosophiia ps... gl 1714519PIX548131DAADHL unknown [Drosophila ambigua] gx 14837191991X78384 DMADAdh-dup gene product [Drosophil... gl 15808821gP[X73124 BSGE ipa-86r gene product [Bacillus . gl [6690271gp[U20864 CELF F32A5.1 gene product [Caenorhab. gi1397372 I~Ix60112 D ~ alcohol dehydrogenase [Drosophi. giI861340 Igp[U28943 CELE E04F6.7 gene product [Caenorhab. gi[220732 19-plD00569 RATD Rat mRNA for 2,4-dienoyl-CoA re. gi1799232 19plu23775 ECU2 dTDP-glucose dehydratase [Esche. gI1304659 IgPlM97637 DROA alcohol dehydrogenase [Drosophi. gz1536492 I~1z36o28 SCYB S.cerevisiae chromosome II read. gi[148191 10-plM87049 ECOU 0355 gene product [Escherichia . gi[587106 19plx78733 ECEN enoyl-ACP reductase [Escherichi. gxl14585119PlM97219 ECOE envM gene product [Escherichia . gx1886467[gplZ49969 CEW0 W01C9.4 [Caenorhabditis elegans. gl 602703[gplL26050 HUM2 2,4-dienoyl-CoA reductase [Homo. gl 1539551~IM318061s~YE S.typhimurium envM protein gene. gl 294896[gp[LI48421SHFR dTDP-D-glucose 4,6-dehydratase . gl 581659[gp[X719701SFRF rfbB gene product [Shigella fle. gl 4139961gp[X731241BSGE ipa-72d gene product [Bacillus . gx 61845619PlL278011ASm~ norsolornic acid [Aspergillus p. gx 142010]gp[LI00361ANAB Anabaena sp. sequence-specific . gl 6418191gPlD903501CORH Corynebacterium sp. hheB gene f. gx 468681gpIZlI9291SCMET ORF3 protein [Streptomyces coel. gx 473600[gpIU082231SFU0 dTDP-glucose dehydratase [Strep. gl 577921gPlX539491RSGAL Rat galE mRNA for UDP-galactose. gl 5678741gp[L373541SERO thymidine diphosphoglucose 4,6-. gx 39812019PlL239411X~R TDP-glucose oxireductase [Xanth. gx 4060951gplL204951NGOU UDP-glucose 4-epimerase [Neisse. gx 6336981gpIZ47767[YETR TrsG [Yersinia enterocolitica] gl 3889321gplL091881NGOC UPD-glucose-4-epimerase [Neisse. gl 42907910PIX761721EAGA E.amylovora (Ea7/74) galE gene. gl 308191gp[X048821HSDHP Human mRNA for dihydropteridine gl 40933519PlS63825 S638 NADPH-protochlorophyllide-oxido gx 78065819plU23040 RLU2 Rhizobium l e g u m i n o s a r u m p u t a t i v gx 6429491gp[U19895 NMUI UDP-glucose 4-epimerase [Neisse ol 1539791oDIMl1332 STYF Salmonella tvDhimurium H-l-i ae gl 5816771gplx62567 SGST dTDP-glucose dehydratase [Strep gx 4990181~IX759641VWF dihydroflavonol reductase [Viti gl 478951gPIX567931SERFB CDP-glucose 4,6-dehydratase [Sa gl [203979 gplJ03481 RATD Rat dihydropteridine reductase gx [861175 gp[M29682 SERF CDP-tyvelose epimerase [Salmone.. gl 1641817 gplD90349 CORH Corynebacterium sp. hheA gene f. gl 1633700 gplZ47767 YETR Uridine Diphosphate Glucose Epi.. gl [410490 gplZ18277 LEDI dihydroflavonol 4-reductase [Ly. gl 155497 gpIL01777 YERF CDP-glucose 4,6-dehydratase [Ye. gl 312777 gpIZ17221 GHDF dihydroflavonol-4-reductase [Ge. gl 633875 gp[$72887 $728 CDP-D-glucose 4,6-dehydratase [. gx 169977 gp[L01628 SOYG malate dehydrogenase [Glycine m. gl 20544[gp[XI55371PHDFR Petunia hybrida dfrA mRNA for d. gz 5362221gplZ358881SCYB GALl0 gene product [Saccharomyc. gl 4394781gplD260951ANAZ zeta-carotene desaturase [Anaba.

8.5 8.5 8.5 8.4 8.3 8.3 7.9 7 7 7 7 7 6 7 6 7 6 7 5 7 5 7 4 7 4 7 4 7 3 6 8 6 7 6 6 6 5 6 4 6 1 5 8 5.7 5.5 5.4 5.3 5.1 4.9 " 4.9 4.9 4.9 4.8 4.8 4.7 4.7 4.7 4.6 4.6 4.4 4.4 4.4 4.4 4.4 4.4 4.3 4.3 4.3 4.3 4.2

278 281 272 259 925 279 329 335 361 269 347 355 262 262 309 335 262 361 361 315 271 264 235 251 333 347 329 351 339 638 339 164 244" 199 229 339 490 328 337 359 241" 338 244 336 379 357 366 357 35O 373 699 499

Fig. 3. ( c o n t ' d ) . M A S T a n a l y s i s o f G e n p e p t 9 0 u s i n g all six m o t i f s . T h e o u t p u t h i s t o g r a m i n d i c a t e s t h e s e l e c t i v ity o f M A S T w h e n a b a t t e r y o f six m o t i f s is u s e d to s e a r c h t h e d a t a b a s e . O u r a n a l y s i s i n d i c a t e s t h a t t h e r e a r e at l e a s t 300 d i f f e r e n t g e n e p r o d u c t s i n G e n p e p t g 0 t h a t b e l o n g to t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e f a m i l y . F o r c l a r i t y , w e h a v e e d i t e d t h e o u t p u t r e m o v i n g o v e r 140 Drosophila a l c o h o l d e h y d r o g e n a s e s e q u e n c e s w i t h z s c o r e s b e t w e e n 12.2 a n d 12.5 as w e l l as o t h e r s e q u e n c e s w i t h h i g h s c o r e s . T w o s u g a r e p i m e r a s e s w i t h z - s c o r e s g r e a t e r t h a n 9 a r e m a r k e d w i t h o. A l s o s h o w n a r e h u m a n a n d r a t d i h y d r o p t e r i d l n e r e d u c t a s e s at z - s c o r e s o f 4.9 a n d 4.6, r e s p e c t i v e l y . T h e h i g h e s t s c o r i n g p r o t e i n t h a t d o e s n o t b e l o n g to t h e s h o r t - c h a i n a l c o h o l d e h y d r o g e n a s e f a m i l y is Salmonella typhimurium f l a g e l l l n w i t h a z - s c o r e o f 4.7. M o t i f o r d e r a n d s p a c i n g o f f l a g e l l i n differ from that for short-chain alcohol dehydrogenases.

rat dihydropteridine reductase clearly shows that these enzymes are homologs of S. hydrogenans 20fl-hydroxysteroid dehydrogenase. MAST shows its sensitivity by identifying the ancestry of dihydropteridine reductase. The motif order and spacing of dihydropteridine re-

ductase is 1-(54)-3(19)-6-(31)-2-(18)-5, which is representative of the short-chain dehydrogenase family (Table 3). This example shows how MAST with multiple motifs can identify distant homologs from a large database.

Discovering Motifs in Proteins DISCUSSION W e have u s e d a n e w m o t i f discovery algorithm, M E M E , to analyse a diverse g r o u p of proteins. By identifying relatively short motifs, M E M E facilitates w o r k i n g with large n u m b e r s of divergent s e q u e n c e s that w o u l d r e q u i r e n u m e r o u s gaps a n d i n s e r t i o n s in a global m u l t i p l e a l i g n m e n t . S u c h a n a l i g n m e n t c o u l d o b s c u r e s o m e f u n c t i o n a l features as well as a d d i n g p o tentially m i s l e a d i n g h u m a n bias in assigning b o u n d aries for motifs. M o r e o v e r , b e c a u s e M E M E ignores s e q u e n c e s that do n o t c o n t a i n a motif, M E M E o u t p u t does n o t degrade w h e n n o n - h o m o l o g o u s s e q u e n c e s are added. T h i s is usel:'ul in b o o t s t r a p p i n g where distantly related s e q u e n c e s are a d d e d to increase the inf o r m a t i o n for each motif. If a n a d d e d s e q u e n c e is n o t a true h o m o l o g , t h e n t a e analysis will n o t be h a r m e d ; if it is a d i s t a n t h o m o l o g t h e n the sensitivity of the o u t p u t is increased. A n i m p o r t a n t a d v a n t a g e of extracting several motifs for a p r o t e i n superfarnily is that even if a s e q u e n c e lacks o n e motif, its ancestry c a n still be identified from a n analysis of the other motifs. T h e log-odds matrices r e p r e s e n t i n g M E M E motifs are c o n v e n i e n t l y u s e d b y the M A S T tool to identify h o m o l o g o u s proteins that do n o t all c o n t a i n the same motifs. T h i s c a n be especially useful for distantly related p r o t e i n s that lack i n d i v i d u a l motifs, such as M . tuberculosis I n h A , a target for drug',; such as isoniazid a n d e t h i o n a m i d e for c o n t r o l l i n g t u b e r c u l o s i s [17, 38]. As s h o w n in Fig. 3, the M A S T tool provides a rigorous e v a l u a t i o n of the significance of a d a t a b a s e search. T h e sensitivity of this search is ,seen b y z-scores greater t h a n 9 for sugar epimerases, clearly d e m o n s t r a t i n g that they are h o m o l o g s of s h o r t - c h a i n alcohol d e h y d r o g e n a s e s . By e x a m i n i n g the o u t p u t from a d a t a b a s e search for the o r d e r a n d spacing of all the motifs jointly, o n e can identify distantly related h o m o l o g s a m o n g the noise of p r o t e i n s that by c h a n c e have a positive score for a single motif. T h e motifs f o u n d by M E M E c o r r e s p o n d to m a t c h i n g t h r e e - d i m e n s i o n a l features in those dehyd r o g e n a s e s whose t h r e e - d i m e n s i o n a l s t r u c t u r e is k n o w n , so the o r g a n i z a t i o n of the motifs is a m a p p i n g of t h r e e - d i m e n s i o n a l i n f o r m a t i o n o n t o the I D s e q u e n c e i n f o r m a t i o n . T h i s m a y explain the sensitivity of the m u l t i p l e m o t i f analysis. A n i m p o r t a n t global feature of the M E M E m u l t i p l e m o t i f o u t p u t is that it i n f o r m s a b o u t the similarities a n d differences in i n d i v i d u a l positions a n d motifs for a p r o t e i n family. I n this way, M E M E elucidates c o n served d o m a i n s that are likely to have similar f u n c tions in p r o t e i n family as well as idiosyncratic d o m a i n s that are likely to be i m p o r t a n t in the u n i q u e properties o f a n e n z y m e . T h i s i n f o r m a t i o n c a n be u s e d for m u t a g e n e s i s studies, a n d in a n analysis of their tertiary s t r u c t u r e ~:o b e g i n to elucidate the m e c h a n i s m of action of these proteins.

43

Acknowledgements--We thank Dr Michael Gribskov for discussion

and help with score normalization; the NIH Genome Analysis PreDoctoral Training Grant No. HG00005 for supporting Dr Bailey; and a Hellman Faculty Fellowship from UCSD for supporting Dr Elkan.

REFERENCES 1. Bailey T. L. and Elkan C. P., Fitting a mixture model by expectation-maximization to discover motifs in biopolymers. Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA,

1994, pp. 28-36. 2. Gribskov M., MacLachlan A. D. and Eisenberg D., Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84 (1987) 4355-4358. 3. Bairoch A., PROSITE: a dictionary of sites and patterns in proteins, its current status. Nucl. Acids Res. 20 (1992) 20132018. 4. Baker M. E., Sequence analysis of steroid- and prostaglandinmetabolizing enzymes: application to understanding catalysis. Steroids 59 (1994) 248-258. 5. Persson B., Krook M. and Jornvall H., Characteristics of shortchain alcohol dehydrogenases and related enzymes. Eur. J. Biochem. 200 (1991) 537-543. 6. Tannin G. M., Agarwal A. K., Monder C., New M. I. and White P. C., The human gene for 1lfl-hydroxysteroid dehydrogenase. J. BioL Chem. 266 (1991) 16653-16658. 7. Krozowski Z., 1lfl-hydroxysteroid dehydrogenase and the short chain alcohol dehydrogenase (SCAD) superfamily. Molec. Cell. Endocr. 84 (1992) C25-C31. 8. Jornvall H., Persson B., Krook M., Atrian S., Gonzalez-Duarte R., Jeffrey J. and Ghosh D., Short-chain dehydrogenases/reductases (SDR). Biochemistry 34 (1995) 6003-6013. 9. Baker M. E., Protochlorophyllide reductase is homologous to human carbonyl reductase and pig 20/3-hydroxysteroid dehydrogenase. Biochem. _7. 300 (1994) 605-607. 10. Ghosh D., Wawrzak Z., Weeks C. M., Duax W. L. and Erman M., The refined three-dimensional structure of 3~-20//-hydroxysteroid dehydrogenase and possible roles of the residues conserved in short-chain dehydrogenases. Structure 2 (1994) 629-640. 11. Varughese K. I., Xuong N. H., Kiefer P. M., Matthews D. A. and Whiteley J. M., Structural and mechanistic characteristics of dihydropteridiene reductase: a member of the Tyr-(Xaa)3Lys-containing family of reductases and dehydrogenases. Proc. Natl. Acad. Sci. U.S.A. 91 (1994) 5582-5586. 12. Breton R., Housset D., Mazza C. and Fontecilla-Camps J. C., The structure of a complex of human 17fl-hydroxysteroid dehydrogenase with estradiol and NADP÷ identifies two principal target for the design of inhibitors. Structure 4 (1996) 905-915. 13. Rafferty J. B., Simon J. W., Baldock C., Artymiuk P. J., Baker P. J., Stuitje A. R., Slabas A. R. and Rice D. W., Common themes in redox chemistry emerge from the X-ray structure of oilseed rape (Brassica napus) enoyl acyl carrier protein reductase. Structure 3 (1995) 927938. 14. Hopwood D. A. and Sherman D. H., Molecular genetics of polyketides and its comparison to fatty acid biosynthesis. Annu. Rev. Genet. 24 (1990) 37-66. 15. Papadopoulou B., Roy G. and Ouellette M., A novel antifolate resistance gene on the amplified H circle of Leishmania. E M B O J. 11 (1992) 3601-3608. 16. BelloA. R., Nare B., Freedman D., Hardy L. and BeverleyS. M., PTRI: a reductase mediating salvage of oxidized pteridines and methotrexate resistance in the protozoan parasite Leishmania major. Proc. Natl. Acad. Sci. U.S.A. 91 (1994) 11442-11446. 17. Banerjee A., Dubnau E., Quemard A., Balasubramanian V., Um K. S., Wilson T., Collins D., de Lisle G. and Jacobs W. R., inhA, a gene encoding a target for isoniazid and ethionamide in Mycobacterium tuberculosis. Science 263 (1994) 227-230. 18. Dempster A. P., Laird N. M. and Rubin D. B., Maximum likelihood from complete data vis the EM algorithm. J. R. Statist. Soc. 39 (1977) 1-38. 19. BaileyT. L. and Elkan C. P., The value of prior knowledge in discovering motifs with MEME. Proceedings of the 3rd

44

20. 21. 22. 23. 24. 25.

26. 27.

28.

29.

30. 31.

32.

33.

T i m o t h y L. Bailey et al. International Conference on Intelligent Systems for Molecular Biology. A A A I Press, Menlo Park, CA, 1995, pp. 21-29. Kendall, M., Stuart, A. and Ord, J. K., The Advanced Theory of Statistics. Charles Griffin and Co. Ltd, 1983. Bairoch A., The SWISS-PROT protein sequence data bank: current status. Nucl. Acids Res. 22 (1994) 3578-3580. Baker M. E., Unusual evolution of lift- and 17fl-hydroxysteroid and retinol dehydrogenases. BioEssays 18 (1996) 63-70. Branden, C. and Tooze, J., Introduction to Protein Structure. Garland Publishing, New York, 1991. Wierenga R. K., De Maeyer M. C. and Hol W. G. J., Interaction of pyrophosphate moieties with a-helixes in dinucleotide binding proteins. Biochemistry 24 (1985) 1346-1357. Wierenga R. K., Terpstra P. P. and Hol W. G. L., Prediction of the occurrence of the ADP-binding flat-fold in proteins using an amino acid sequence fingerprint..7. Molec. Biol. 187 (1986) 101-107. Holm L., Sander C. and Murzin A., Three sisters, different names. Nature Struct. Biol. 1 (1994) 146-147. Labesse G., Vidal-Cros A., Chomilier J., Gaudry M. and Mornon J.-P., Structural comparisons lead to the definition of a new superfamily of NAD(P)(H)-accepfing oxidoreductases: the single-domain reductases/epimerases/dehydrogenases (the "RED" family). Biochem. J. 304 (1994) 95-99. Baker M. E. and Blasco R., Expansion of the mammalian 3flhydroxysteroid dehydrogenase/plant dihydroflavonol reductase superfamily to include a bacterial cholesterol dehydrogenase, a bacterial UDP-galactose-4-epimerase, and open reading frames in vaccinia virus and fish lymphocystis disease virus. F E B S Lett. 301 (1992) 89-93. Ensor C. M. and Tai H.-H., Site directed mutagenesis of the conserved tyrosine-151 of human placental NAD+-dependent 15-hydroxyprostaglandin dehydrogenase yields a catalytically inactive enzyme. Biochem. Biophys. Res. Commun. 176 (1991) 840-845. Obeid J. and White P. C., Tyr-179 and lys-183 are essential for enzymatic activity of l lfl-hydroxysteroid dehydrogenase. Biochem. Biophys. Res. Commun. 188 (1992) 222-227. Albalat R., Gonzalez-Duarte R. and Atrian S., Protein engineering of Drosophila alcohol dehydrogenase. The hydroxy group of Tyr-152 is involved in the active site of the enzyme. F E B S Lett. 308 (1992) 235-239. Chen Z., Jiang J. C., Lin Z. G., Lee W. R., Baker M. E. and Chang S. H., Site-specific mutagenesis of Drosophila alcohol dehydrogenase: evidence for involvement of tyrosine-152 and lysine-156 in catalysis. Biochemistry 32 (1993) 3342-3346. Puranen T. J., Poutanen M. H., Peltoketo H. E., Vihko P. T. and Vihko R. K., Site-directed mutagenesis of the putative

34.

35.

36.

37.

38.

39.

40.

41.

42.

43. 44.

active site of human 17fl-hydroxysteroid dehydrogenase type 1. Biochem. ft. 304 (1994) 289-293. Wilks H. M. and Timko M. P., A light-dependent complementation system for analysis of NADPH:protochlorophyllide oxidoreductase: identification and mutagenesis of two conserved residues that are essential for enzyme activity. Proc. Natl. Acad. Sci. U.S.A. 92 (1995) 724-728. Chenevert S. W., Fossett N. G., Chang S. H., Tsigelny I., Baker M. E. and Lee W. R., Amino acids important in enzyme activity and dimer stability for Drosophila alcohol dehydrogenase. Biochem..7. 308 (1995) 419-423. Tsigelny I. and Baker M. E., Structures stabilizing the dimer interface on human 1 lfl-hydroxysteroid dehydrogenase-types 1 and 2 and human 15-fiydroxyprostaglandin dehydrogenase and their homologs. Biochem. Biophys. Res. Commun. 217 (1995) 859-868. Tsigelny I. and Baker M. E., Structures important in mammalian l i f t - a n d 17fl-hydroxysteroid dehydrogenases, ft. Steroid Biochem. Molec. Biol. 55 (1995) 589-600. Baker M. E., Enoyl-acyl-carrier-protein reductase and Mycobacterium tuberculosis Inb_A do not conserve the Tyr-XaaXaa-Xaa-Lys motif in mammalian lift- and 17fl-hydroxysteroid dehydrogenases and Drosophila alcohol dehydrogenase. Biochem. ft. 309 (1995) 1029-1030. Baker M. E., Myxococcus xanthus C-factor, a morphogenetic paracrine signal, is similar to Escherichia coli 3-oxoacyl-[acylcarrier-protein] reductase and human 17fl-hydroxysteroid dehydrogenase. Biochem..7. 301 (1994) 311-312. Monder C., Corticosteroids, receptors, and the organ-specific functions of l lfl-hydroxysteroid dehydrogenase. F A S E B ft. 5 (1991) 3047-3054. Funder J. W., Pearce P. T., Smith R. and Smith A. I., Mineralocorficoid action: target tissue specificity is enzyme, not receptor, mediated. Science 242 (1988) 583-586. Edwards C. R. W., Stewart P. M., Burr D., Brett L., Mclntyre M. A., Sutanto W. S., De Kloet E. R. and Monder C., Localization of l lB-hydroxysteroid dehydrogenase tissuespecific protector for the mineralocorticoid receptor. Lancet 2 (1988) 986-989. Andersson S., 17fl-hydroxysteroid dehydrogenase: isozymes and mutations, ft. Endocr. 146 (1995) 197-200. Adamski J., Norman T., Leenders E., Monte D., Begue A., Stehelin D., Jungblut P. W. and de Launoit Y., Molecular cloning of a novel widely expressed human 80 kDa 17fl-hydroxysteroid dehydrogenase IV. Biochem. ft. 311 (1995) 437443.