RNA-binding residues in sequence space: Conservation and interaction patterns

RNA-binding residues in sequence space: Conservation and interaction patterns

Computational Biology and Chemistry 33 (2009) 397–403 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage...

569KB Sizes 1 Downloads 22 Views

Computational Biology and Chemistry 33 (2009) 397–403

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Brief communication

RNA-binding residues in sequence space: Conservation and interaction patterns Ruth V. Spriggs, Susan Jones ∗ Department of Chemistry and Biochemistry, School of Life Sciences, John Maynard-Smith Building, University of Sussex, Falmer, Brighton, BN1 9QG, UK

a r t i c l e

i n f o

Article history: Received 12 March 2009 Received in revised form 14 July 2009 Accepted 18 July 2009 Keywords: Evolutionary conservation Interaction pattern Function prediction Regular expression Genome annotation RNA-binding

a b s t r a c t RNA-binding proteins (RBPs) perform fundamental and diverse functions within the cell. Approximately 15% of proteins sequences are annotated as RNA-binding, but with a significant number of proteins without functional annotation, many RBPs are yet to be identified. A percentage of uncharacterised proteins can be annotated by transferring functional information from proteins sharing significant sequence homology. However, genomes contain a significant number of orphan open reading frames (ORFs) that do not share significant sequence similarity to other ORFs, but correspond to functional proteins. Hence methods for protein function annotation that go beyond sequence homology are essential. One method of annotation is the identification of ligands that bind to proteins, through the characterisation of binding site residues. In the current work RNA-binding residues (RBRs) are characterised in terms of their evolutionary conservation and the patterns they form in sequence space. The potential for such characteristics to be used to identify RBPs from sequence is then evaluated. In the current work the conservation of residues in 261 RBPs is compared for (a) RBRs vs. non-RBRs surface residues, and for (b) specific and non-specific RBRs. The analysis shows that RBRs are more conserved than other surface residues, and RBRs hydrogen-bonded to the RNA backbone are more conserved than those making hydrogen bonds to RNA bases. This observed conservation of RBRs was then used to inform the construction of RBR sequence patterns from known protein–RNA structures. A series of RBR patterns were generated for a case study protein aspartyl-tRNA synthetase bound to tRNA; and used to differentiate between RNA-binding and non-RNA-binding protein sequences. Six sequence patterns performed with high precision values of >80% and recall values 7 times that of an homology search. When the method was expanded to the complete dataset of 261 proteins, many patterns were of poor predictive value, as they had not been manipulated on a family-specific basis. However, two patterns with precision values ≥85% were used to make function predictions for a set of hypothetical proteins. This revealed a number of potential RBPs that require experimental verification. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction 1.1. RNA-binding proteins and their binding sites RNA-binding proteins (RBPs) perform fundamental and diverse functions within the cell. They form an integral part of both the ribosome and the spliceosome, and their interactions with microRNAs are essential in gene expression regulation. RBPs achieve their diverse functions through a modular structure, featuring different combinations of a limited number of RNA-binding domains, including the RMM, KH, dsRBD and PAZ domains (Lunde et al., 2007). The RNA-binding sites within these domains are not as well characterised as those of DNA-binding domains, but studies have attempted to characterise their common features (Bahadur et al., 2008; Ellis et al., 2007; Jeong et al., 2003; Jones et al., 2001; Treger and Westhof, 2001). Such studies show that arginine and lysine

have high propensities for the RNA-binding site, and that amino acid side-chain contacts to RNA backbone atoms predominate. However, these studies reveal wide variations between favoured amino acid-base pairings, hydrogen bonding patterns and other physical and chemical features (Ellis et al., 2007). The lack of common features results from the diverse structures of the bound RNA molecules, which include A-form double helices, and single strand motifs such as hairpin loops and bulges. Whilst information on RBPs is increasing, their functional importance is not reflected in our understanding of the specificity of the interactions. For this to be achieved, a greater number of RBPs need to be identified in diverse proteomes. Approximately 15% of proteins in UniProtKB/SwissProt (Bairoch et al., 2007) are annotated as RNA-binding, but with a significant number of proteins without functional annotation, many novel RBPs are yet to be identified. 1.2. Protein function annotation

∗ Corresponding author. Tel.: +44 01273 877553; fax: +44 01273 678297. E-mail address: [email protected] (S. Jones). 1476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2009.07.012

All proteomes feature a significant percentage of proteins with no functional annotations (Friedberg et al., 2006), and these include

398

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403

those that bind to RNA. In order for these proteins to contribute to a wider knowledge of biological systems, functional information is essential. A percentage of uncharacterised proteins can be annotated by transferring functional information from proteins sharing significant sequence homology. However, predicting function from homology is a qualitative process, as the link between the two is complex (Ponting, 2001). More importantly, the majority of sequenced genomes include orphan open reading frames (ORFans) that do not share significant sequence similarity to other open reading frames (Siew and Fischer, 2004). As many ORFans correspond to functional proteins (Siew and Fischer, 2004), methods for functional annotation that go beyond sequence homology are essential. One method of functional annotation is the identification of the ligands that bind to proteins, through the characterisation of binding site residues. 1.3. Conservation of binding site residues in proteins Binding site residues, including those in protein–protein (Espadaler et al., 2005; Valdar and Thornton, 2001) and protein–DNA (Chang et al., 2008; Luscombe and Thornton, 2002) complexes, are considered to be conserved through evolution. However, in reality, the picture of binding site conservation is far more complex. Residues in protein–protein binding sites are not significantly more conserved than clusters of other protein surface residues (Caffrey et al., 2004), and residues interacting with DNA bases show variable levels of conservation dependant upon the specific protein family (Luscombe and Thornton, 2002). Studies showing that DNA-binding residues are conserved, has led to the assumption that RNA-binding residues (RBRs) are also conserved, but this has never been statistically evaluated. 1.4. Protein binding site patterns The conservation of functionally significant residues in proteins has led to sequence patterns becoming a well-documented feature of proteins. More than 1300 sequence patterns are currently stored as regular expressions in the PROSITE database (Hulo et al., 2007). The concept of a generic interaction pattern is exemplified in the database by the PXXP binding motif. This motif is present in proteins that bind to SH3 domains (Chandra et al., 2004) and it has been linked to a number of disease processes including Alzheimers (Chandra et al., 2004). However, no comparable generic binding sites have yet been identified in protein–nucleic acid complexes. In PROSITE DNA and RNA-binding proteins are represented by profiles that represent whole domains. Such profiles enable proteins of unknown function to be matched against the known RNA-binding domains. However, there are no generic RBR patterns that can be used to annotate RNA-binding proteins that do not feature known domains.

mance of the binding site patterns is discussed, and a number of function predictions for hypothetical protein sequences are highlighted. 2. Materials and methods 2.1. Datasets 2.1.1. Dataset of protein–RNA structures To analyse RNA-binding residue (RBR) conservation and generate the RBR patterns, a non-redundant dataset of RNA-binding proteins (RBPs) was extracted from the PDB (Berman et al., 2000). All RNA–protein complexes were extracted if both the protein and RNA chain contained at least five residues, and the structure had a resolution ≤3.0 Å (if solved by X-ray crystallography). A set of RBRs for each protein was defined by calculating all potential intermolecular hydrogen bonds and van der Waals interactions using HBPLUS (Mcdonald and Thornton, 1994). A maximum donor–acceptor distance of 3.35 Å and a maximum hydrogen-acceptor distance of 2.7 Å were used to define a hydrogen bond (Ellis et al., 2007). Atoms were considered to form van der Waals contacts if the distance between them was ≤3.9 Å, and the contact was not defined as a hydrogen bond. An RNA-binding residue was defined as any residue forming an intermolecular hydrogen bond or van der Waals interaction. The full dataset contained 1944 interacting protein–RNA pairs, involving 1670 protein chains. A non-redundant set was selected by clustering at 35% sequence identity and 90% overlap, using BLASTClust (Altschul et al., 1990). A representative was selected from each cluster resulting in a set of 261 protein–RNA pairs (denoted as RBP-STR). 2.1.2. Datasets of RNA-binding and non-RNA-binding protein sequences To analyse the prediction potential of the RBR patterns, positive and negative datasets of RBP sequences were selected. The positive RBP dataset included all proteins in UniProtKB/Swiss-Prot (Bairoch et al., 2007) with RNA-binding-related keyword entries (see Supplementary data: Table S1) (denoted RBP-SEQ-POS). The remaining proteins in UniProtKB/Swiss-Prot were assigned to the negative (non-RNA-binding) dataset (denoted RBP-SEQ-NEG). 2.2. Calculation of sequence and structural properties

1.5. RNA interaction patterns

2.2.1. Residue conservation from sequence The conservation of each residue in the protein chains in the representative set (RBP-STR) was calculated using Scorecons (Valdar, 2002); with the valdar01 scoring method and a multiple sequence alignment of homologues, created using PSI-BLAST (Altschul et al., 1997) on UniProtKB/Swiss-Prot to 20 iterations. Scores calculated by Scorecons range from 0 (not conserved) to 1 (highly conserved). For comparison, these scores (x) were clustered into five levels: x ≤ 0.2 in level 1, 0.2 < x ≤ 0.4 in level 2, 0.4 < x ≤ 0.6 in level 3, 0.6 < x ≤ 0.8 in level 4, and 0.8 < x ≤ 1 in level 5.

The functional significance of the generic PXXP interaction pattern (Chandra et al., 2004), and the recent identification of DNAbinding amino acid triplets and quintets (Ahmad et al., 2004), led to the current work on RBRs. In this paper the conservation of RNAbinding amino acids is quantified for the first time, and interaction patterns for RNA-binding proteins are generated. The aim is to (a) statistically analyse the conservation of RBRs and (b) use this information to inform the construction of RBR sequence patterns from known protein–RNA structures. The RBR patterns are tested for their potential to distinguish between known RNA-binding and non-RNA-binding protein sequences. The method is initially tested on aspartyl-tRNA synthetase bound to tRNA, and then expanded to a representative dataset of 261 RNA-binding proteins. The perfor-

2.2.2. Residue accessibility from structure For the comparison of RNA-binding residue conservation, residues on the surface of the protein–RNA structures were differentiated from those in the protein interior using the concept of accessible surface area (ASA). The relative ASA of each residue (compared to that residue type in an extended Ala-x-Ala tripeptide) in the RNA–protein complex set (RBP-STR) was calculated using NACCESS (www.bioinf.manchester.ac.uk/naccess/). A residue with a relative ASA of ≥5% in the unbound form was defined as present on the surface of the protein. 2 tests were used to evaluate patterns of conservation between the RNA-binding residues and the remainder of the residues on the protein surface; and between RBRs making base and backbone contacts (Table 1).

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403

399

Table 1 Statistical comparison of the conservation scores of residues in a dataset of 261 representative protein–RNA complexes (RBP-STR). Conservation comparison

2

p-Value

Conclusion

i

All RBRs on surface vs. non-RNA-binding surface residues

1077.5

2.2E−16

ii

H-bonded RBRs on surface vs. Surface residues not H-bonded to RNA

499.5

2.2E−16

iii

Van der Waals RBRs on surface vs. Surface residues not in van der Waals contact with RNA

485.2

2.2E−16

iv

All base-contacting RBRs vs. All backbone-contacting RBRs H-bonded base-contacting RBRs vs. H-bonded backbone-contacting RBRs

Distributions are significantly different at p < 0.01. RBRs are significantly more conserved than non-RNA-binding surface residues (Fig. 2a). Distributions are significantly different at p < 0.01. Hydrogen-bonded RBRs are more conserved than surface residues not involved in hydrogen bonding to RNA. Distributions are significantly different at p < 0.01. Van der Waals RBRs are more conserved than surface residues not involved in van der Waals contacts with RNA. Distributions are not significantly different at p < 0.01. Distributions are significantly different at p < 0.01. The hydrogen-bonded base-contacting RBRs are biased towards the lower levels of conservation. The hydrogen-bonded backbone-contacting RBRs show conservation bias to higher levels of conservation (Fig. 2b).

v

11.6

0.04

41.4

7.81E−8

2.3. Creating regular expression patterns from RNA-binding residues 2.3.1. Case study protein: aspartyl-tRNA synthetase To investigate the potential for RBR patterns to represent and identify RBPs, a case study was conducted on one complex from the RBP-STR dataset; aspartyl-tRNA synthetase bound to tRNA [PDB:1ASY] (Fig. 1) (Ruff et al., 1991). A series of RBR patterns were created for aspartyl-tRNA synthetase using homologous sequences. Homologues were retrieved using PSI-BLAST with the protein sequence from the PDB file as the query, against the UniProtKB/Swiss-Prot database. The sequences for these homologues were clustered using BLASTClust (90% overlap on both sequences and 60% identity) and one protein chain from each cluster taken to create a representative set of homologues.

This set of homologues was then used to create a series of RBR patterns in two ways; automatically using the PRATT (Jonassen et al., 1995) software and “manually” using multiple sequence alignments (MSA) and Perl scripts. PRATT was used with default parameters, and in each case the regular expression with the highest fitness score was used. Manual patterns were created to control the inclusion or exclusion of sequence segments and the specificity of each position. MSA for the creation of the manual patterns were constructed using ClustalW (Chenna et al., 2003). The combination of automatic and manual methods resulted in a total of five classes of pattern defined in Table 2. In the current work (see Section 3.2), it has been shown that RNA-binding residues are significantly more conserved than the remainder of the residues on the protein surface. This information was used to refine the pattern based on the longest contiguous segment of RBRs (1ASY LSRBR) to cre-

Fig. 1. Case study protein aspartyl-tRNA synthetase (PDBcode 1asy chain A). (a) Sequence from PDB file. (b) Structure of 1asy bound to RNA. The protein is depicted in space filling mode and the RNA molecule bound is shown in blue in cartoon mode. (c) Conservation score calculated using Scorecons (Valdar, 2002) plotted against residue number. In (a–c) residues hydrogen-bonded to RNA are indicated in blue and residues making van der Waals contacts to RNA are indicated in green. All other non-RNA-binding residues are indicated in yellow. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of the article.)

400

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403

Table 2 RNA-binding residue (RBR) patterns created for the case study protein aspartyl-tRNA synthetase. No.

Pattern ID

Brief description

Regular expression

Pattern start:end in input PDB

1

1ASY PT

G-x(3,5)-G-x(1,2)-R

476–485 and 524–531

2

1ASY CLSR1

E-[EG]-[TNPK]-[KEVI]-[ASN]

117–121

3

1ASY CLSR2

[SATKEVC]-x-x-[TEVPD]-x-[TPD]-x-x-[VLI]

202–210

4

1ASY CLSR3

R-x-[ED]-[NKEDG]-[SIL]-[SNHDRG]-[SATGL][STNPHDR]-[TKR]-[HQ]-[NL]

325–335

5

1ASY CLSR4

E-[VIL]-[SAVCIG]-[SNG]

478–481

6

1ASY LSRBR 1ASY LSRBR C0.3

R-x-E-[DEGKN]-[LS]-[DNR]-T-[DHNPRST]-R-H[AHLMNP] R-x-E-[DEGKN]-[LS]-[DNR]-x-[DHNPRST]-R-Hx

325–335

7

8

1ASY LSRBR C0.5

Best PRATT (31) pattern from homologues truncated to align to the smallest section of the PDB sequence that contains all the RBRs (automatic) Cluster of RBRs identified by manual inspection. Cluster can include ≤2 non-RBRs and highly variant positions in the MSA were simplified in the final pattern (manual) Cluster of RBRs identified by manual inspection. Cluster can include ≤2 non-RBRs and highly variant positions in the MSA were simplified in the final pattern (manual) Cluster of RBRs identified by manual inspection. Cluster can include ≤2 non-RBRs and highly variant positions in the MSA were simplified in the final pattern (manual) Cluster of RBRs identified by manual inspection. Cluster can include ≤2 non-RBRs and highly variant positions in the MSA were simplified in the final pattern (manual) Longest contiguous segment of RBRs allowing 1 non-RBR within segments (manual) Longest contiguous segment of RBRs allowing 1 non-RBR within segments and with residues with a Scorecons conservation score <0.3 set as wildcards (manual) Longest contiguous segment of RBRs allowing 1 non-RBR within segments and with residues with a Scorecons conservation score <0.5 set as wildcards (manual)

R-x-E-[DEGKN]-x-x-x-x-R-H-x

325–335

ate two further patterns, 1ASY LSRBR C0.3 and 1ASY LSRBR C0.5, which only included RBRs with a conservation score over a specific threshold. 2.3.2. Representative dataset The precision and recall values observed for the patterns generated for aspartyl-tRNA synthetase (see Section 3.3.1) led to the generation of a revised series of regular expressions for all proteins in the non-redundant dataset (RBP-STR). Regular expressions were created automatically from sets of homologous sequences: (i) using the PRATT (Jonassen et al., 1995) software (see (a) below) and (ii) using MSAs guided by the performance of the manually created patterns for the test case protein (see (b–d) below). Homologues of each of the proteins in the RBP-STR dataset were identified as described for aspartyl-tRNA synthetase (see Section 2.3.1). This resulted in a total of four classes of pattern for each protein. (a) PRATT (Jonassen et al., 1995) patterns from homologues truncated to align to the smallest section of the PDB sequence that contains all the RBRs (denoted REP PT). (b) Longest contiguous segment of RBRs, allowing ≤2 non-RBRs within segments and excluding very low frequency residues (denoted REP LSRBR). (c) Longest contiguous segment of RBRs, allowing ≤2 non-RBRs within segments, excluding very low frequency residues and with residues with a Scorecons conservation score <0.4 set as wildcards (denoted REP LSRBR C0.4). (d) Longest contiguous segment of residues that have a Scorecons conservation score >0.4 (denoted RE LSCR). Again, the observation that RNA-binding residues are significantly more conserved than the remainder of the protein surface is used to refine the patterns in (b) to create those used in (c). A conservation threshold of 0.4 was selected, for (c) and (d), from

325–335

observations of conservation frequency distributions for all RBRs in the representative dataset (data not shown). The motifs created in (d) tested whether conservation of residues alone (without information on which residues are in the RNA-binding site) was sufficient to identify RNA-binding motifs. Using the longest run of conserved residues in RE LSCR makes use of the observation that methods for predicting functionally important residues improve when the conservation of neighbours in the sequence is taken into account (Capra and Singh, 2007). 2.4. Using RNA-binding regular expression patterns for function prediction 2.4.1. Case study protein: aspartyl-tRNA synthetase Fuzzpro from EMBOSS (Rice et al., 2000) was used to search the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets of RBPs with the five classes of regular expression generated for aspartyl-tRNA synthetase (Table 2). Zero mismatches were allowed and all sequence hits were predicted to bind RNA. Numbers of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) were calculated. Precision and recall were calculated for the results of each search (Table 3). Precision is a measure of how many of the positive predictions made are correct and is defined as: TP/(TP + FP). Recall is a measure of what proportion of all the possible positive predictions have been made and is defined as: TP/(TP + FN). The F-measure (Hripcsak and Rothschild, 2005) combines precision and recall into their harmonic mean, and is defined as in Eq. (1) where R = recall, P = precision, and ˇ is a weight parameter. Fˇ =

(1 + ˇ2 ) × R × P (ˇ2 × P) + R)

(1)

In the current work neither precision nor recall are favoured and hence a value of ˇ = 1 is used. The F-measure shows whether

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403 Table 3 Results of searches using 11 RNA-binding residue (RBR) patterns from the case study protein aspartyl-tRNA synthetase against the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets. Each expression has a pattern ID(s) described in Table 2. Expressions 6–8 are created by combining two patterns in a single search. Pattern ID 1 2 3 4 5 6 7 8 9 10 11

1ASY 1ASY 1ASY 1ASY 1ASY 1ASY 1ASY 1ASY 1ASY 1ASY 1ASY

PT CLSR1 CLSR3 CLSR5 CLSR6 CLSR1 CLSR3 CLSR5 LSRBR LSRBR LSRBR

No. hits

CLSR5 CLSR5 CLSR6 C0.3 C0.5

1ASY HOMOLOGY

Precision (%)

Recall (%)

118,573 233,720 219,853 66 90,452 59 57 55 69 76 348

12.04 10.98 11.13 89.39 10.35 88.14 89.47 96.36 86.96 80.26 43.97

39.79 71.57 55.92 0.16 26.11 0.15 0.14 0.15 0.17 0.17 0.43

8

100.00

0.02

the trade-off between recall and precision has improved the overall performance. For comparison, precision, recall and F-measure were also calculated for an equivalent homology search across the two datasets (RBP-SEQ-POS and RBP-SEQ-NEG) using the aspartyl-tRNA synthetase sequence with a threshold of ≥35% sequence identity (Table 3). 2.4.2. Representative dataset Fuzzpro from EMBOSS (Rice et al., 2000) was used to search the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets of RBPs with the four classes of regular expression generated for each protein in the representative dataset (RBP-STR). The precision, recall and F-measure scores (see Section 2.4.1) were calculated (Table 4) and compared against the results for an equivalent set of homology searches at ≥35% sequence identity using a two-sample Kolmogorov–Smirnov test (Supplementary data: Table S2).

401

3. Results and discussion 3.1. Structure and sequence datasets A representative dataset (RBP-STR) of 261 structures of protein–RNA complexes was extracted from the PDB. A total of 35,863 RNA-binding protein sequences were extracted from UniProtKB/Swiss-Prot (2008) using keyword annotations, and defined as the positive dataset (RBP-SEQ-POS). The remaining 208,093 proteins in UniProtKB/Swiss-Prot were defined as the negative dataset (RBP-SEQ-NEG). 3.2. Conservation of RNA-binding residues The comparison of conservation scores shows that RNA-binding residues (RBRs) are significantly more conserved than non-binding residues on the protein surface (Table 1(i–iii)) (Fig. 2a). The residuals for this comparison show that the frequencies of RBRs and non-RBRs at all conservation levels contribute to the significant 2 test (all absolute residuals are ≥2.0). This puts RBRs in the same class of functionally conserved residues as protein–protein and protein–DNA-binding residues. However, those residues conferring specificity to the interaction show a more complex picture of conservation (Table 1(iv–v)) (Fig. 2b). At p < 0.01 there is no significant difference in the distribution of conservation scores of RBRs making hydrogen bond and van der Waals contacts to the RNA bases compared to those making such contacts to the backbone. However, when only the hydrogen bond contacts are considered, the distribution of conservation is significantly different (Table 1(v), Fig. 2b). The hydrogen-bonded RBRs making base contacts show a bias towards the lower levels of conservation; with the frequencies at level 2 having absolute residuals of ≥2.0, indicating that this level makes a major contribution to the significant 2 result. The hydrogen-bonded RBRs making backbone contacts show a bias towards the higher levels of conservation; with the comparison at level 4 having one absolute residual

Fig. 2. Percentage frequency distributions of residue conservation scores. (a) All surface RBRs (black bars) vs. non-RNA-binding surface residues (white bars). (b) Hydrogenbonded base-contacting RBRs (black bars) vs. hydrogen-bonded backbone-contacting RBRs (white bars).

402

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403

Table 4 Results of searches using four classes of RBR pattern, from each of the 261 proteins in the representative dataset (RBP-STR), against the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets. See Section 2.3.2 for detail of patterns.

1 2 3 4 5

Pattern ID

Precision (%) mean (max)

Recall (%) mean (max)

F-measure mean (max)

REP REP REP REP REP

32.1 (100.0) 63.7 (100.0) 64.2 (100.0) 88.9 (100.0) 91.6 (100.0)

38.4 (93.2) 3.2 (51.2) 4.0 (82.8) 0.4 (30.7) 0.4 (1.4)

15.0 (28.7) 2.9 (24.7) 2.8 (28.9) 0.3 (20.4) 0.8 (2.8)

PT LSRBR LSRBR C0.4 LSCR HOMOLOGY

of ≥2.0, indicating level 4 makes a major contribution to the significant 2 result. These distributions result from individual proteins in the dataset showing variable levels of conservation. 3.3. Creating regular expression patterns from RNA-binding residues and evaluating their prediction potential 3.3.1. Case study: aspartyl-tRNA synthetase Eight regular expression patterns were created for aspartyltRNA synthetase bound to tRNA [PDB:1ASY] (Ruff et al., 1991) (Table 2), and each pattern (or combination of patterns) was evaluated for its predictive potential by searching against the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets (Table 3). The results show that six patterns give levels of precision >80%. The recall values are low for all the motifs, as expected; as we are searching the positive and negative datasets with RBR patterns from a single protein–RNA complex. The aim of the work on the case study protein was to examine the precision of the patterns, and evaluate their generic nature. The six patterns with high precision values (>80%) have recall values at least 7 times higher than that achieved using full sequence homology; highlighting their ability to retrieve RBPs beyond close sequence homologues. The generic nature of the patterns is highlighted further by their ability to match RBPs bound to RNAs other than tRNA. For example, pattern 1ASY LSRBR C0.3 retrieves a premRNA-splicing factor [Swiss-Prot:Q4IRI9] and two rRNA-binding proteins: pseudouridine synthase E [Swiss-Prot:Q87QY8] and 50 s ribosomal protein L19 [Swiss-Prot:Q8DTP5]. 3.3.2. Representative dataset of 261 RNA-binding proteins The ability of six patterns created for the case study protein, aspartyl-tRNA synthetase, to identify RBPs beyond close homologues, with high levels of precision, led to the expansion of the method to the complete representative dataset (RBP-STR). The four classes of RBR pattern created for each protein in the dataset were evaluated for their ability to differentiate between the positive (RBP-SEQ-POS) and negative (RBP-SEQ-NEG) datasets of protein sequences (Table 4). In addition, the patterns were compared to an equivalent homology search using the full length protein sequence at ≥35% sequence identity (Table 4). The mean F-measures are low for the four pattern types generated for the entire representative dataset (Table 4). When a pattern class achieved a high mean precision score (e.g. REF LSCR 88.9%) the mean recall was low (e.g. REF LSCR 0.4%). However, all patterns that include information on the RBRs (REP PT, REP LSRBR, and REP LSRBR C0.4) showed significantly higher F-measures than the homology search (Supplementary data: Table S2). Using conserved residues alone (RE LSCR), with no information on which residues are in the RNA-binding site, creates motifs that perform less well than homology (Supplementary data: Table S2); indicating that conservation alone is not sufficient to identify RBRs. Many proteins are likely to have a number of functionally important sites, for example, a protein–protein interaction site as well as a protein–RNA interaction site. The protein will therefore have a number of sites of high conservation, making it impossible to distinguish between them using conservation alone.

The need to automatically create RBR patterns for all proteins using the same method means that many pattern types for the 261 proteins have poor predictive value, as they cannot be manipulated on a family-specific basis. Hence, the automated method cannot be used to construct a complete set of patterns capable of identifying RBPs beyond sequence homologues with high levels of precision. However, the maximum precision and recall scores indicate that some RBR patterns in all four pattern types do have the potential to predict RBPs beyond homologues, as was observed for the test case protein aspartyl-tRNA synthetase. These high precision (≥85%) RBR patterns generated from the representative dataset were used to assign function to proteins annotated in UniProtKB/Swiss-Prot with the keyword “hypothetical protein” (those for which no function has yet been confidently assigned). This process revealed a number of high precision RBR patterns matching interesting hypothetical sequences. The REP PT pattern created from human splicing factor 1 [PDB:1K1G] (Liu et al., 2001) matched hypothetical protein C30D11.14c [Swiss-Prot:Q09911] with a precision of 87%. This protein has since been annotated as having RNA-binding activity from an inferred GO classification and a non-traceable author statement in UniProt (Bairoch et al., 2007). The REP LSRBR pattern created from 30S ribosomal protein S2 [PDB:1FJG] (Carter et al., 2000) matched hypothetical protein AF 0655 [Swiss-Prot:O29602] with a precision of 93.7%. This protein shares sequence similarity with the U62 family of peptidases, which are reported to have rRNA-binding function in a STRING functional protein association network (Jensen et al., 2009). Hence, both hypothetical proteins are worthy of further targeted experimental work to confirm these RNA-binding function predictions. 4. Conclusion This paper represents the first statistical analysis of the conservation of RNA-binding residues (RBRs), and extends this to an evaluation of the patterns observed when RBRs are mapped onto the protein sequence. The potential for these RBR patterns to be used in the prediction of RNA-binding proteins (RBPs) has been evaluated for a case study protein aspartyl-tRNA synthetase, and extended to a complete non-redundant dataset of RBPs. This paper shows statistically that RBRs are more conserved than other protein surface residues, and that this conservation extends to residues making both hydrogen bonds and van der Waals contacts to the RNA molecule. However, the picture of conservation is more complex when RBRs making backbone and base contacts are differentiated. When both hydrogen bonds and van der Waals contacts were considered, no difference in conservation was observed between residues making base contacts and residues making backbone contacts. When only hydrogen-bonded RBRs were considered, those making backbone contacts were shown to be more conserved than those making base contacts. This complex picture is analogous to that observed in DNA-binding proteins, in which residues interacting with DNA bases show variable levels of conservation dependent upon the specific protein family (Luscombe and Thornton, 2002). A recent study in Saccharomyces cerevisiae has shown that RBPs bind to multiple mRNAs, with some binding more than 500 mRNAs with diverse sequences (Hogan et al., 2008).

R.V. Spriggs, S. Jones / Computational Biology and Chemistry 33 (2009) 397–403

In RBPs with such broad specificity, backbone contacts will make an important contribution to the association, which might explain why many of the proteins in the current paper show conserved backbone contacts. A fuller understanding of the conservation of RBRs will only be possible when the structures of many more diverse protein–RNA complexes are available. The knowledge that RBRs are significantly more conserved than other surface residues was used to inform the generation of RBR patterns. These patterns were created by mapping known RBRs onto protein sequences. The case study protein aspartyl-tRNA synthetase showed that it is possible to create RBR patterns that predict RNA-binding function with high levels of precision, and far greater recall values than a simple homology search. With this method it proved possible to create patterns that identified RBPs binding different types of RNA. However, when patterns were generated automatically for a representative dataset it did not prove possible to find a complete set of RBR patterns capable of identifying RBPs beyond sequence homologues with high levels of precision and recall. To achieve this, RBR patterns need to be generated on a family-specific basis and features in addition to conservation need to be identified that can be used to inform the creation of generic of RBR patterns. However, results of function predictions made using two of the high precision RBR patterns generated in this work, suggest that such patterns have the potential to be useful in a functional screening process. However, for a complete set of generic RBR patterns to be generated, a far larger dataset of protein–RNA structures is required and this will become available as the structural genomics projects near completion. Acknowledgements RVS was supported by MRC grant ID 70760 (New Investigator Award Scheme). The authors would like to acknowledge Jonathan J. Ellis’s work in the creation of the database of protein–RNA complexes. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.compbiolchem.2009.07.012. References Ahmad, S., Gromiha, M.M., Sarai, A., 2004. Analysis and prediction of DNA-binding proteins and their binding site residues based on composition, sequence and structural information. Bioinformatics 20, 477–486. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., et al., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., et al., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Bahadur, R., Zacharias, M., Janin, J., 2008. Dissecting protein–RNA recognition sites. Nucleic Acids Res. 36, 2705–2716. Bairoch, A., Bougueleret, L., Altairac, S., Amendolia, V., et al., 2007. The universal protein resource (UniProt). Nucleic Acids Res. 35, D193–D197.

403

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., et al., 2000. The Protein Data Bank. Nucleic Acid Res. 28, 235–242. Caffrey, D.R., Somaroo, S., Hughes, J.D., Mintseris, J., et al., 2004. Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 13, 190–202. Capra, J.A., Singh, M., 2007. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875–1882. Carter, A.P., Clemons, W.M., Brodersen, D.E., Morgan-Warren, R.J., et al., 2000. Functional insights from the structure of the 30S ribosomal subunit and its interactions with antibiotics. Nature 407, 340–348. Chandra, B., Gowthaman, R., Raj, R., Gupta, D., et al., 2004. Distribution of proline-rich (PxxP) motifs in distinct proteomes: functional and therapeutic implications for malaria and tuberculosis. Protein Eng. Des. Select. 17, 175–182. Chang, Y.H.T., Kao, C., Chen, Y., et al., 2008. Evolutionary conservation of DNA-contact residues in DNA-binding residues. BMC Bioinform. 9, S3. Chenna, R., Sugawara, H., Koike, T., Lopez, R., et al., 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31, 3497–3500. Ellis, J.J., Broom, M., Jones, S., 2007. Protein–RNA interactions: structural analysis and functional classes. Protein Struct. Funct. Bioinform. 66, 903–911. Espadaler, J., Romero-Isart, O., Jackson, R., Oliva, B., 2005. Prediction of protein–protein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics 16, 3360–3368. Friedberg, I., Jambon, M., Godzik, A., 2006. New avenues in protein function prediction. Protein Sci. 15, 1527–1529. Hogan, D.J., Riordan, D.P., Gerber, A.P., Herschlag, D., et al., 2008. Diverse RNAbinding proteins interact with functionally related sets of RNAs suggesting an extensive regulatory system. PloS Biol. 6, 2297–2313. Hripcsak, G., Rothschild, A.S., 2005. Agreement, the F-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12, 296–298. Hulo, N., Bairoch, A., Bulliard, V., 2007. 20 years of Prosite. Nucleic Acids Res. 36, D245–249. Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., et al., 2009. STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416. Jeong, E., Kim, H., Lee, S., Han, K., 2003. Discovering the interaction propensities of amino acids and nucleotides from protein–RNA complexes. Mol. Cell 16, 161–167. Jonassen, I., Collins, J.F., Higgins, D.G., 1995. Finding flexible patterns in unaligned protein sequences. Protein Sci. 4, 1587–1595. Jones, S., Daley, D.T.A., Luscombe, N.M., Berman, H.M., et al., 2001. Protein–RNA interactions: a structural analysis. Nucleic Acids Res. 29, 943–954. Liu, Z.H., Luyten, I., Bottomley, M.J., Messias, A.C., et al., 2001. Structural basis for recognition of the intron branch site RNA by splicing factor 1. Science 294, 1098–1102. Lunde, B., Moore, C., Varani, G., 2007. RNA-binding proteins: modular design for efficient function. Nat. Rev. Mol. Cell. Biol. 8, 479–490. Luscombe, N.M., Thornton, J.M., 2002. Protein–DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J. Mol. Biol. 320, 991–1009. Mcdonald, I.K., Thornton, J.M., 1994. Satisfying hydrogen-bonding potential in proteins. J. Mol. Biol. 238, 777–793. Ponting, C., 2001. Issues in predicting protein function from sequence. Briefing. Bioinformatics 2, 19–29. Rice, P., Longden, I., Bleasby, A., 2000. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277. Ruff, M., Krishnaswamy, S., Boeglin, M., Poterszman, A., et al., 1991. Class II aminoacyl transfer RNA synthetases: crystal structure of yeast aspartyl-tRNA synthetase complexed with tRNA(Asp). Science 252, 1682–1689. Siew, N., Fischer, D., 2004. Structural biology sheds light on the puzzle of genomic ORFans. J. Mol. Biol. 342, 369–373. Treger, M., Westhof, E., 2001. Statistical analysis of atomic contacts at RNA–protein interfaces. J. Mol. Recognit. 14, 199–214. Valdar, W., Thornton, J., 2001. Protein–protein interfaces: analysis of amino acid conservation in homodimers. Protein Struct. Funct. Genet. 42, 108–124. Valdar, W.S.J., 2002. Scoring residue conservation. Protein Struct. Funct. Genet. 48, 227–241.