ARTICLE IN PRESS
J. Parallel Distrib. Comput. 63 (2003) 728–737
A grid-aware approach to protein structure comparison Carlo Ferrari,a Concettina Guerra,a, and Giuseppe Zanottib a
Department of Information Engineering, University of Padova, via Gradenigo 6a, 35131 Padova, Italy b Dip. Chimica Organica e Centro Studi Biopolimeri, Univ. of Padova, via Marzolo 1, Padova, Italy Received 4 December 2002; revised 15 April 2003
Abstract This paper concentrates on the grid implementation of software tools for the comparison of protein structures. We have developed comparison algorithms based on indexing techniques that store transformation invariant properties of the 3D protein structures into tables. The method has large memory requirements and is computationally intensive. Furthermore, the dataset needs frequent updates as new proteins are added to the Protein Data Bank. Thus a significant advantage is obtained from a computational framework such as a grid. We report on a distributed implementation of the matching procedures on a grid using Globus MPI-CH, focusing on the data partition strategy to achieve good load balancing and to minimize the number of secondary memory accesses of the out-of-core computation. r 2003 Elsevier Inc. All rights reserved.
1. Introduction Current research in the area of protein analysis is making increasing use of systems for the storage and for the fast retrieval of structural similarity information. These systems complement the more conventional databases for textual and numerical information of biological data. The analysis and comparison of proteins is an important problem in modern proteomics for its implications to medicine, genetic engineering, protein classification and evolutionary studies. The structural comparison of proteins is a fundamental step toward the understanding of the folding process in biological organisms. Since the function of a protein is related to its 3D structure, the analysis and comparison can provide insight in the way proteins behave and perform their activity. The Protein Data Bank (PDB) database currently contains more than 21,000 protein structures and is growing at a fast pace (see Fig. 1 from the PDB). The available structures can be grouped into a relatively small number of folds, according to their 3D shape. This number is not growing at the same pace as seen
Corresponding author. E-mail addresses:
[email protected] (C. Ferrari),
[email protected] (C. Guerra),
[email protected] (G. Zanotti). 0743-7315/$ - see front matter r 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0743-7315(03)00081-9
in Fig. 2.1 A well-known classification (SCOP) [28] identifies approximately 600 distinct folds from the PDB. Comparison of new structures with those present in the PDB helps determining the fold classes or discovering new folds. Many systems have been built in the last few years for 3D protein matching and database access [1–3,5,6,13,21,29,36]. Surveys of the structural comparison methods are found in [7,20,23,30]. The systems based on indexing are particularly useful in searching large databases because they do not search and match a query protein against all entries of the database separately, but restrict the search to a subset of stored data. Indexing techniques have been employed for pairwise protein structure comparison, for fast retrieval of motifs from the database, for protein classification, for comparison of proteins allowing hinge bending [8,9,16,17,22,33,35]. Geometric properties of the data are used as indexes to a hash table, where the data are stored in a redundant way. The table, built off-line, is used as a look-up table to retrieve hypotheses of similarity for a query protein. While most of the indexing schemes are based on the atomic description of proteins, the method we propose is based on angular properties of secondary structures, namely of a helices 1
Figure available at http://www.rcsb.org/pdb/holdings.html.
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
729
Fig. 1. The PDB growth.
Fig. 2. The statistics on the number of new folds.
and b strands. The secondary structure description allows to retrieve hypotheses of similarity from a large protein database in a fast and efficient way. These hypotheses may be verified by taking into consideration the atomic representation of a small subset of candidate proteins. Recently, the grid computational framework has received a lot of attention in the scientific community. Within a grid environment, geographically distributed resources are coupled together to form a virtual computing power ‘‘generator’’. The end-user is offered a consistent and inexpensive access to it, irrespective of the actual physical location or access point of the machines. The transparent integration of many distributed facilities, connected in complex ways and usually managed by different and autonomous organizations,
requires the development of new technologies that enable software applications to share instruments, displays, computational and information resources through large-scale and high performance national and international-scale networks. There are some characteristics of the protein comparison and classification problem that suggest a grid immersion: large and heterogeneous database, frequent updates of the databases, geographic distribution of the data, high computational requirements, remote visualization. For a reliable comparison it is often required to integrate the results of different matching procedures operating at different levels of protein representation (sequence, atomic, secondary structures), each involving a different dataset. These databases may be at distant locations and need frequent updates as the PDB
ARTICLE IN PRESS 730
C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
experiences a continuous growth. The results of the matching procedures are visualized both in terms of volumetric and vectorial representations. A userfriendly graphical interface should visualize the structural superposition of the matched proteins. In this paper we focus on the distributed design and implementation of the matching procedure in a grid environment. A very important issue of distributed computation is load balancing. The amount of data associated to the proteins of the PDB is so large that cannot be fully loaded into the primary memory of the computers requiring out-of-core computation. The approach we present allows us to partition the large dataset of geometric invariants to achieve both load balancing in a distributed grid environment and minimal usage of secondary memory in out-of-core computation. This paper is organized as follows. Section 2 presents the indexing method to compare protein structures. Section 3 gives a short description of the grid computational framework, while Section 4 describes the distributed algorithm for protein comparison. Section 5 discusses the results of the algorithm and analyzes its performance. Finally, Section 6 presents conclusions and future work.
2. Structure comparison A protein is a sequence of amino acids linked by peptide bonds. An amino acid consists of a carbon atom ðCa Þ to which are attached a hydrogen atom, an amino group and a carbonyl group, and a carbon (called Cb ) or another hydrogen. The 20 amino acids (also called residues) differ in the side chain attached to the Ca atom. The sequence of amino acids is generally referred to as the primary structure of a protein. Its length varies from a few tens to few thousands aminoacids. A different level of protein representation, known as secondary structure, describes a protein in terms of recurrent regular substructures, such as a helices and b strands. The tertiary structure is the packing of the structural elements into the 3D shape. The protein may contain several chains forming its quaternary structure. For a survey of the protein architecture see [4,24,25]. Approaches to protein comparison use different protein structural descriptions. A complete structural description is given by the 3D coordinates ðx; y; zÞ of the individual atoms of a protein. Often only the Ca atoms of the amino acids, that form the so-called chain trace of a protein, are considered for comparison. A more compact description can be given in terms of the linear vectors associated to structural elements, a helices and b strands. Several programs have been developed to yield the vectorial representation of a protein [11,27]. We used a singular-value decomposition (SVD) [12] to find the axes
of a helices and the best fit segments for the b strands. Fig. 3 shows the vectorial representation of protein kinase CK2 (1DAY) where each segment is displayed as a cylinder of fixed radius. 2.1. Indexing Indexing techniques for database access have received much attention in the last few years. They are based on the use of transformation invariant properties of the 3D structures to index a table that stores the data in a redundant way. The invariant properties generate tuples of numbers from which indexes to specific locations of the table are derived. The table is a set of buckets or cells, each consisting of a number of records corresponding to proteins that index into that bucket. Once constructed, the table is used as look-up table to retrieve hypotheses of similarity for a query protein. Indexing techniques, initially proposed within the area of computer vision by Wolfson et al. [19], are used in different contexts and differ in the type of invariant properties (either local or global) in the transformation class (rigid body of affine transformations), and in the method used to formulate and verify hypotheses of associations of the query object. In the field of bioinformatics, indexing schemes have been applied to a variety of problems, ranging from sequence-based pattern retrieval, to substructure matching in 3D molecular matching, to the docking problem. The method presented in [8] compares protein structures at the atomic level. The invariants used are the affine coordinates in a reference frame formed by quadruples of points (typically, Ca atoms). It is important at this point to stress that the choice of invariant properties is in many respects the most crucial choice that affects the storage requirements and the performance of the method. The hashing function determines the size of the hash table and therefore the
Fig. 3. The vectorial representation of protein kinase CK2 (1DAY).
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
storage requirements, since the tuple of numbers corresponding to a given invariant gives the dimensionality of the table. Also the distribution of the invariants determines the length of the buckets of the table and consequently the time needed to process a query. In our approach we use high-level properties of secondary structural elements of triplets of segments associated to secondary structures to index the hash table where the proteins are stored. More precisely, we associate to each triplet of secondary structures of a protein the best-fit linear segments and use as invariant properties the cosines of the three dihedral angles associated to the three segments. The matching procedure attempts to find associations between triplets of secondary structures of the target protein with triplets of secondary structures of stored proteins. In the following, we describe the table construction phase and then the voting process used to formulate hypotheses of associations for a target structure. 2.1.1. Building the table Let P be a protein and ðPsi ; Psj ; Psk Þ be a triplet of segments of P; where each segment is associated either to an a-helix or a b-strand. Furthermore, let asr be the dihedral angle formed by two segments Ps and Pr : The dihedral angle between two segments, i.e. the angle formed by the two planes perpendicular to the straight lines containing the segments, is defined in the range [0,180]. The three cosines ðcos aij ; cos ajk ; cos aki Þ; quantized into uniform intervals of size cell size, are used as the first three indexes for the hash table. Another index triplet type is introduced to distinguish the different combinations of triplets of segments, that is whether all three segments are associated to a helices ðtriplet type ¼ 0Þ; or one segment is associated to an a helix and the other two to b strands ðtriplet type ¼ 1Þ and so forth. To minimize the size of the four-dimensional table, the first three quantized indexes are sorted so that the first is less or equal to the second and the second less or equal to the third. Thus only the upper portion of the table is used. Each cell of the table stores information about all triplets that hashed into it. More precisely, the cell with indexes ða; b; c; tÞ contains the list of records of triplets of segments with angular values between ða; b; cÞ; ða þ cell size; b þ cell size; c þ cell sizeÞ: The segments in a cell are associated to secondary structures of triplet type ¼ t: Each record contains the following information: *
*
the name of the protein containing the three secondary structures associated to the three segments; the starting and ending residue number of each of the three secondary structure elements;
*
731
the distances between each pair of segments in the triplet.
The distance between two segments is measured as the distance between the middle points of the segments. The distances in a record are sorted according to the order of the three cosines (indexes). Such distances are used to filter incorrect hypotheses of associations in the matching process, because false results could be obtained based on angular information only. When new proteins are inserted into the table, the records are always added at the same end of the list in a cell. Thus, the records in a cell corresponding to the same protein are consecutive. The construction of the table is computationally intensive; the insertion of a single protein structure into the table requires Oðn3 Þ time, where n is the number of secondary structures of the protein. 2.1.2. Searching the table The second phase is to retrieve similarity information from the table. Given a query protein Q; we extract from the table the proteins that are structurally similar to Q by the following procedure: Step 1: Initially, all records in the table are considered as marked unused for the target protein Q: Step 2: All triplets of secondary structures of Q are examined and for each such triplet ðsi ; sj ; sk Þ the following steps are executed: i. Compute the cosines ðcos a; cos b; cos gÞ and the three distances among the associated pairs of segments. ii. Access the corresponding cell in the table at the address indexed by the 4-tuple ða; b; c; triplet typeÞ; where each of the values a; b; c is the largest multiple of cell size that is smaller than cos a; cos b; cos g; respectively. All proteins associated to records in the same cell are examined. Let C be one such protein. The records of C (stored in consecutive cell elements) are tested w.r.t. the following two conditions: Condition 1. (compatibility constraint) the record is marked unused. Condition 2. (distance constraint) the distances between pairs of segments in the two participating triplets are within a given threshold; in other words, each of the three differences between a distance value of the triplet of Q and the corresponding distance value of C is below a threshold. If a record satisfying 1. and 2. is found then: a. a vote is cast to protein C: b. the record is marked used.
ARTICLE IN PRESS 732
C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
c. the remaining records corresponding to protein C are ignored. Step 3: Formulate and rank hypotheses of matching by determining the proteins with the highest number of votes. In the candidate matching, we seek a one-to-one mapping between triplets of C and Q: The above condition 1 is used to ensure that a triplet of C only contributes a vote to Q; even if it matches several triplets of Q: Conversely, a triplet of Q may only cast one vote to protein C: This is guaranteed by the fact that the procedure stops the scan of the records of C as soon as an unused triplet is found that matches the given triplet of C: Obviously, protein Q may cast multiple votes if multiple triplets match multiple triplets of C: The information extracted is in the form of a list of proteins ranked according to the vote they obtained. Some of the retrieved proteins may have low similarity with the target protein and need to be filtered. The approach we have presented provides a good basis for further analysis. The hypotheses of similarity can be verified by a procedure operating at a finer level of representation of the proteins (i.e. at atomic level).
3. Designing grid-aware algorithms Grid computing allows to couple geographically distributed resources offering consistent and inexpensive access to them irrespective of their physical location or access point [10]. The integration of these geographically distributed facilities, that are connected in complex ways and usually managed by different and autonomous organizations, requires the development of new technologies that allow software applications to share instruments, displays, computational and information resources through large-scale and high performance national and international-scale networks. At a lower level, a grid infrastructure relies on advanced optical networks and fibers, faster microprocessors and innovative parallel architectures. Moreover robust communications protocols and a distributed software structure for the general managing of the grid are needed. Finally sophisticated security mechanisms are needed for ensuring the correct use of the grid by a heterogeneous community. Computational grids must meet stringent and dynamically changing requirements: they must be dependable, consistent, pervasive and cheap despite of the lack of a centralized control. The main advantages in using grids are in an extremely high computing power, in a better use of idle resources to increase aggregate throughput as in Condor environment [26], in a shared remote access of special purpose resources or data
sources, in the support of collaborative work via a virtual shared space [32]. End users work on a grid in a complete transparent manner sending their tasks without worrying about the details related to when and where their tasks are executed. The grid infrastructure usually offers different programming paradigms, like message passing, RPC, multithreading, shared memory and data parallelism in a possibly Object Oriented environment. The grid architecture is highly scalable involving end systems, clusters, intranet and internet arrangements, resulting in a heterogeneous system. Proper grid software component should be designed for managing heterogeneity both in hardware and software resources, the temporal limited availability of the computational facilities and the geographical distribution of the machines, possibly belonging to different institutions. The grid system software is organized according to the ‘‘bag-of-service’’ model and it has specialized mechanisms together with general mechanisms that include resource discovery and allocation modules, resource brokers, authentication agents and grid information services. An important issues is security, that involves remote secure access but also asks to guarantee that remote (virtual) users belong to the same virtual and certified community. Within this scenario a correct design for the matching algorithm starts from a balanced partitioning of the dataset over the available computational resources, in order to let all the remote nodes work properly. Other important issues are the update of datasets and the remote visualization.
4. Distributed algorithm In this section we present a distributed implementation of the matching algorithms focusing on techniques to achieve a good load balancing and to minimize secondary memory usage. The distribution of geometric invariants into the table affects the data partition. A simple block-partition scheme of the 4-dimensional table, in which distinct blocks of adjacent cells are assigned to distinct nodes, may not be convenient because of the possible bias in the distribution of geometric invariants [34]. For many indexing schemes proposed in the literature, a nonuniform distribution of the invariants has been observed even for randomly chosen input datasets. The nonuniformity has important negative effects on the performance of the method in terms of computation time and discriminating power, since the worst case execution time needed to process a query depends on the size of the longest cell of the table. The distribution of triplets of angles of secondary structures of a selected set of 600 proteins from the PDB was analyzed in [14]. A strong preference for entries of
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
the table corresponding to boundary cells is observed. More precisely, cells corresponding to triplets of angles a; b; g; with g ¼ a þ b; are much more preferred than others. In [31] the distribution of the cosines of angles between triplets of segments associated to secondary structures of proteins is compared with a theoretically obtained distribution for triplets of random uniformly distributed unit vectors. It is shown that the distribution of the cosines of all three angles of proteins deviates significantly from the (non-uniform) distribution of randomly distributed vectors. 4.1. Data partition From the above considerations it is clear that blockpartitioning would result in an highly unbalanced scheme; thus we have considered an alternative data partitioning strategy based on a partition of the proteins into subsets. Each subtable of the original table is built from a subset of the proteins. To achieve a good load balancing, one has to overcome the problem of the large difference in protein size. The number of triplets of secondary structures generated by different proteins ranges from only one to hundreds of thousands. A static analysis can be performed to optimally assign proteins to nodes; however in our application it is not feasible because of the frequent updates of the PDB that would call for a reassignment of proteins to subtables. It is important to realize that the updating of the geometric hash table should be activated when new proteins are inserted in the PDB. Based on all the above considerations, we have chosen a simple procedure that creates and updates the database in a greedy way. Assume that a set G of s proteins is to be inserted in a table which is partitioned into t subtables, assigned to distinct computational nodes. Typically, sbt: G is divided into subsets containing k proteins. Each node repeatedly gets a subset of G of size k until all proteins are assigned. To reduce the disk access time, each node keeps two out-of-core subtables: a temporary subtable and a permanent subtable. Proteins are always inserted into the temporary table. When the temporary subtable has reached a given size, it is merged at once with the out-ofcore permanent subtable and then emptied. Thus the temporary subtable contains the triplets of the mostly recently inserted proteins. Since the merging the two tables is a costly operation involving disk accesses it should not be performed often. We have experimentally determined the size of the temporary subtable that reduces the insertion time. The simplest architectural schema for the matching can be seen as a two layered schema, where a grid node (the master), belongs to the first layer and the other nodes (the slaves), form the second layer. During the
733
table construction phase, the master distributes the input proteins to the various slaves and each slave node builds and maintains its portion of the table in the secondary memory. As soon as a slave has completed the insertion of the assigned list of k proteins, it signals its availability to the master that then sends another sublist of k proteins. This process is repeated until all proteins have been examined. This greedy approach that randomly partitions the proteins and assigns them to the first available slave achieves in practice a good balance. In fact, using such a random scheme for the input PDB proteins we have obtained a subtable occupancy very close to the optimal: the difference in subtable size in all nodes is less than 2% of the size of the subtables. It should be noted that the greedy distribution of the table in no way guarantees load balancing in the search, since equal partition of proteins does not imply uniform distribution across cells of each subtable. 4.2. Searching the table The search for similarity for a query protein follows the same simple master–slave paradigm as above. According to this architecture, the distributed algorithm for matching can be described as follows. The master node is in charge of broadcasting to all the slaves the information about the target protein sending all the slaves its features along with the specification of the voting criteria to be applied for the local ranking. Then the master stands idle waiting for the slave ranking. When all the slaves have sent back their results, the master resumes its activity, merging the local rankings to select the final rank. A slave computes the hash information for the target protein and then searches for similarities in its local permanent table, according to the voting mode. It returns to the master the computed rank. The amount of information to be exchanged is limited. For the insertion of a new protein, the master broadcasts the linear segments associated to all secondary structures. After the voting process takes place separately within each subtable, the partial votes are sent back to the master.
5. Experimental results In our experiments we used the secondary structures of proteins as extracted from the PDB. We did not include proteins without secondary structure assignments in the PDB. Furthermore, we discarded many proteins because of inaccurate data. There are in fact sometimes discrepancies between PDB secondary structure entries, the atomic coordinates and the offset numbers of secondary structures. Thus, of the about
ARTICLE IN PRESS 734
C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
21,000 proteins currently present in the PDB, approximately 19,500 were used in our experiments. An SVD alignment routine [11] was used to associate a segment to each a-helix and to each strand of the bsheets. Only the Ca atoms were considered in our tests. Secondary structures with fewer than 4 peptide bases were discarded because of the possible inaccuracy in the determination of the best-fit segment for small a-helices and strands. The average number of secondary structures of the considered proteins is 14.5. A total of more than 58,000,000 triplets of secondary structures were inserted in the table, yielding a total size of more than 5GB. We now give the numerical values of some thresholds used in the experiments. These values were determined empirically and can be changed according to the application. We have quantized the indexes cos a; cos b; cos g into intervals of size cell size equal to 0.1. Thus we built a 4dimensional table of size 4 20 20 20: Due to a relatively large approximation introduced in the simplified segment representation, a value of cell size smaller than that would not be adequate. In the matching procedure, we impose that the distances between pairs of segments in a triplet should be below a given threshold. We chose the value dmax ¼ 10 A˚ for such a threshold. Several values of this threshold were considered and the one chosen was a compromise between the number of votes generated and the robusteness to approximation errors. A post-processing phase was included in our program to allow to select subsets of the voted proteins according to the following simple criterion. To be selected, a candidate protein C should contain more than the lp percentage and less than the hp percentage of the number of secondary structures of the target protein Q: The post-processing phase does not need to resort to the data in the table; a small file containing information about the size of all proteins is sufficient. We conducted experiments on several proteins and report here on the results obtained with two proteins differing significantly in size and shape. A systematic comparison of our approach with SCOP and FSSP [18] will be presented in a forthcoming paper. One of the two proteins is sperm whale myoglobin (110m) consisting of 8 helices. The protein 110m is displayed with Rasmol in Fig. 4. When using myoglobin (110m) as target protein, the matching procedure gave as output a list of 4226 proteins that had at least one vote. Of these, 125 proteins have more than 5 votes. Table 1 lists the top 25 hits obtained when comparing the protein 110m against the 21,000 proteins in the database. For each element in the list, the table contains the protein name, the protein chain, and the obtained number of votes.
Fig. 4. Protein sperm whale myoglobin (110m).
Table 1 The 25 most voted proteins with target protein 110m Protein
Chain
Vote
Protein name
112m 109m 1mcy 1jdo 111m 103m 1mno 108m 106m 1mwc 1ch7 102m 101m 1ch2 1m6m 1mwd 1mdn 1mdn 1m6c 1ebc 107m 1mwd 1cpw 1cp5 1cp0
A A A A A A B A A B A A A A B A B A A A A B A A A
34 29 26 26 26 24 23 23 22 21 21 21 21 20 19 18 18 18 18 18 17 16 16 16 16
myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin
Mutant Mutant (carbonmonoxy) Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant biological unit Mutant Mutant Mutant Mutant
Table 2 lists the results of the search using the same protein 110m but restricting the list to contain the proteins in the database with a number of secondary structures greater than or equal to 4 and smaller than or equal to 12. These two numbers correspond to 50% and 150% of the number of secondary structures of 110m.
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
735
Table 2 The 25 most voted proteins for target protein 110m Protein
Chain
Vote
Protein name
112m 109m 1mcy 1jdo 111m 103m 108m 106m 1ch7 102m 101m 1ch2 1ebc 107m 1cpw 1cp5 1cp0 1cik 1ch5 1ch3 1ch1 1co9 1cio 2mbw 1ltw
A A A A A A A A A A A A A A A A A A A A A A A A A
34 29 26 26 26 24 23 22 21 21 21 20 18 17 16 16 16 16 16 16 16 15 15 14 14
myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin myoglobin
Mutant Mutant (carbonmonoxy) Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant biological unit Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant Mutant
The listed proteins have a number of secondary structures between 4 and 12.
For all the examined proteins, we compared the results of our alignment procedure to those of DALI [17] and, generally, we found strong similarities. We are able to select a list of proteins which contains most of those reported in the FSSP [18] database. For instance for the target protein myoglobin 110m, the 125 proteins that obtained more than 5 votes are all included in the list of proteins reported as structurally similar to myoglobin for FSSP. Furthermore, for some other target proteins we found few more proteins (with a low structural similarity) that were not included in the list of proteins in FSSP database. The second protein is Triose Phosphate Isomerase, TIM barrel (1TIM) which is displayed using Rasmol in Fig. 5. It consists of two chains each containing 12 helices and 8 strands. The strands form a beta-barrel. The 25 top hits are reported in Tables 3 and 4. Again, we are able to find most of the proteins included in the list produced by FSSP. Our experiments run over 4 nodes of a computational grid, each node being a standard 300 MHz PC. The software is written in C using the MPI-CH library running on Globus [15]. The total time required for comparing 110m against the entire database is 18 s; of which only 2.1 are spent searching, while the remaining seconds are used for the
Fig. 5. Triose Phosphate Isomerase, TIM barrel (1TIM).
Table 3 The 25 most voted proteins for target protein 1tim Protein
Chain
Vote
Protein name
1ypi 1ypi 1ea0
B A B
334 329 321
1e15 2ypi
B A
301 300
1c7t
A
296
1iig 1e15 1e6z 1e6n 1e6z 1ea0
A A A A B A
291 290 290 287 284 282
1aw1 1e6p 1e6r 1ci1 1e6r 1f3w 1f3w 1f3w 1f3w 1f3w 1f3w 1f3w 1f3w
B A A A B A B C D E F G H
280 280 279 278 277 277 277 277 277 277 277 277 277
Triose Phosphate Isomerase Triose Phosphate Isomerase Glutamate Synthase [Nadph] Large Chain Chitinase B Triose Phosphate Isomerase (TIM) (E.C. 5.3.1.1) Complex beta-NAcetylhexosaminidase Triosephosphate Isomerase Chitinase B Chitinase B Chitinase B Chitinase B Glutamate Synthase [Nadph] Large Chain Triosephosphate Isomerase Chitinase B Chitinase B Triosephosphate Isomerase Chitinase B Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase Pyruvate Kinase
grid initialization and authentication phases. For the larger protein 1tim the total time goes up to 119 s: Even though we have not done a detailed time analysis in comparison with other existing systems, we
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737
736
Table 4 The 25 most voted proteins for target protein 1tim Protein
Chain
Vote
Protein name
1qds 1k6a 1gom 1gok 1goo 1goq 1gor 1btc 1ezw 1if2 1gg0 1qnq 1ttj 1h7n 1eb3 1qns 1thf 1amk 1eom 1fh7 1fh9 1qnr 1qno 1qnp
A A A A A A A A A A A A A A A A D A A A A A A A
240 218 218 214 214 214 213 206 204 196 194 191 191 190 184 183 181 176 176 176 176 176 173 172
Triosephosphate Isomerase Xylanase I Endo-1,4-beta-Xylanase Endo-1,4-beta-Xylanase Endo-1,4-beta-Xylanase Endo-1,4-beta-Xylanase Endo-1,4-beta-Xylanase beta-Amylase (E.C. 3.2.1.2) Complex Coenzyme F420-Dependent N5 Triosephosphate Isomerase Kdop Synthase Endo-1,4-B-D-Mannanase Triosephosphate Isomerase 5-Aminolaevulinic Acid Dehydratase 5-Aminolaevulinic Acid Dehydratase Endo-1,4-B-D-Mannanase Hisp Protein Triose Phosphate Isomerase Endo-N-Acetylglucosaminidase F3 beta-1,4-Xylanase beta-1,4-Xylanase Endo-1,4-B-D-Mannanase Endo-1,4-B-D-Mannanase Endo-1,4-B-D-Mannanase
ranked proteins. The techniques that we intend to explore are based on the explicit determination of a transformation (rotation plus translation) that optimally aligns the target protein with each of the highly ranked proteins. Finally, we plan to provide a friendly user interface, to visualize the results of our matching using different protein representations (segments, secondary structures, atoms).
Acknowledgments Support for Guerra and Ferrari was provided in part by the Italian Ministry of University and Research under the FIRB Project ‘‘Enabling Platforms for High Performance Computational Grids in Scalable Virtual Organizations’’, and by the European DataGrid Project. Support for Guerra and Zanotti was also provided by the Italian Ministry of University and Research under the FIRB Project ‘‘Bioinformatics for Genomics and Proteomics’’. The collaboration of Volfango Canetti for software development is acknowledged.
The listed proteins have a number of secondary structures between 10 and 30.
References can claim that our approach compares favorably in terms of computing time. In fact most of the systems report execution times of similar order but for significantly smaller databases, consisting of few hundreds proteins. More details and results will be presented in a forthcoming paper, where the integration of matching procedures operating at different levels of protein representations is described.
6. Conclusions and future work We have presented a method for building, maintaining and searching a database of proteins, that is based on efficient indexing techniques. We reported on a distributed design and implementation on a grid environment using Globus MPI-CH. Parallel local searches enhanced the time performance, while secondary memory access was reduced using proper caching techniques involving temporary tables and file indexes. The system has been tested on several target proteins, and its results have been compared with other existing systems, showing a similar ranking for the proteins. More testing is planned using a larger grid, within the European Datagrid project. Further developments will involve the integration of the matching methods with other techniques operating at the atomic level and applied to subsets of the top
[1] R.A. Abagyan, V.N. Maiorov, An automatic search for similar spatial arrangements of a helices and b-strands in globular proteins, J. Mol. Struct. Dyn. 6 (5) (1989) 1045–1060. [2] A. Alesker, R. Nussinov, H.J. Wolfson, Detection of nontopological motifs in protein structures, Protein Eng. 9 (5) (1996) 1103–1119. [3] T. Akutsu, Protein structure alignment using dynamic programming and iterative improvement, IEICE Trans. Inform. Systems E78-D (0) (1996) 1–8. [4] C. Branden, J. Tooze, Introduction to Protein Structure, 2nd Edition, Garland, New York, 1999. [5] N.P. Brown, C.A. Orengo, W.R. Taylor, A protein structure comparison methodology, Comp. Chem. 27 (1996) 359–380. [6] V. Escalier, J. Pothier, H. Soldano, A. Viari, Pairwise and multiple identification of three-dimensional common substructures in proteins, J. Comput. Biol. 5 (1998) 41–56. [7] C. Ferrari, C. Guerra, Geometric for protein structure comparison, in: C. Guerra, S. Istrail (Eds.), Protein Structure Analysis and Design, Lecture Notes in Bioinformatics, Springer, Berlin, 2003. [8] D. Fischer, O. Bachar, R. Nussinov, H. Wolfson, An efficient automated computer vision based technique for detection of three dimensional structural motifs in proteins, J. Biomol. Struct. Dyn. 9 (1992) 769–789. [9] D. Fischer, C.J. Tsai, R. Nussinov, H. Wolfson, A 3D sequenceindependent representation of the protein data bank, Protein Eng. 8 (1995) 981–997. [10] I. Foster, C. Kesselman C. (Eds.), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publisher, Los Altos, CA, 1998. [11] M. Gerstein, A resolution-sensitive procedure for comparing protein surfaces and its application to the comparison of antigencombining sites, Acta Crystallogr. A8 (1992) 271–276.
ARTICLE IN PRESS C. Ferrari et al. / J. Parallel Distrib. Comput. 63 (2003) 728–737 [12] G.H. Golub, C.F. Van Loan, Matrix Computation, Johns Hopkins University Press, Baltimore, MD, 1996. [13] H.M. Grindley, P.J. Artymiuk, D.W. Rice, P. Willett, Identification of tertiary structure resemblance in proteins using a maxima common subgraph isomorphism algorithm, J. Mol. Biol. 229 (1993) 707–721. [14] C. Guerra, S. Lonardi, G. Zanotti, Analysis of proteins secondary structures using indexing techniques, IEEE Proceedings of the First International Symposium on 3D Data Processing Visualization and Transmission, Padova, Italy, 2002, pp. 812–821. [15] www.globus.org. [16] L. Holm, C. Sander, 3D-Lookup: fast protein structure database searches at 90% reliability, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Menlo Park, 1995, pp. 179–187. [17] L. Holm, C. Sander, Mapping the protein universe, Science 273 (1996) 595–602. [18] L. Holm, C. Sander, The FSSP database: fold classification based on structure–structure alignment of proteins, Nucleid Acids Res. 24 (1) (1996) 206–209. [19] Y. Lamdan, J.T. Schwartz, H.J. Wolfson, Affine invariant modelbased object recognition, IEEE Trans. Robot. Automat. 6 (5) (1990) 578–589. [20] G. Lancia, S. Istrail, Protein structure comparison: algorithms and applications, in: C. Guerra, S. Istrail (Eds.), Protein Structure Analysis and Design, Lecture Notes in Bioinformatics, Springer, Berlin, 2003. [21] G. Lancia, R. Carr, B. Walenz, S. Istrail, Optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem, Proceedings of the 5th ACM REsearch in COMputational Biology, Montreal, Quebec, Canada, 2001, pp. 193–202. [22] N. Leibowitz, Z.Y. Fligelman, R. Nussinov, H.J. Wolfson, Multiple structural alignment and core detection for geometric hashing, Proceedings of ISMB99, Heidelberg, Germany, 1999, pp. 169–177. [23] C. Lemmen, T. Lengauer, Computational methods for the structural alignment of molecules, J. Computer-Aided Mol. Design 14 (2000) 215–232.
737
[24] A.M. Lesk, Protein Architecture: A Practical Approach, Oxford University Press, Oxford, 1991. [25] A. Lesk, Computational Molecular Biology, in: Encyclopedia of Computer Science and Technology, Vol. 31, Marcel Dekker, Inc., New York, 1994. [26] M. Litzkow, M. Livny, M.W. Mutka, Condor—a hunter of idle workstations, Proceedings of the Eighth International Conference of Distributed Computing Systems, San Jose, CA, 1988, pp. 104–111. [27] E.M. Mitchell, P.J. Artymiuk, D.W. Rice, P. Willett, Use of techniques derived from graph theory to compare secondary structures motifs in proteins, J. Mol. Biol. 212 (1989) 151–166. [28] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol. 247 (1995) 536–540. [29] J.P. Overington, Z.Y. Zhu, A. Sali, M.S. Johnson, R. Sowdhamini, G.V. Louie, T.L. Blundell, Molecular recognition in protein families: a database of aligned three-dimensional structures of related proteins, Biochem. Soc. Trans. 21 (3) (1993) 597–604. [30] X. Pennec, N. Ayache, A geometric algorithm to find small but highly similar 3D substructures in proteins, Bioinformatics 14 (6) (1998) 516–522. [31] D.E. Platt, C. Guerra, I. Rigoutos, G. Zanotti, Global secondary structure packing angle bias in proteins, Proteins Structure Function Genet., in press. [32] www.sdsc.edu. [33] A.P. Singh, D.L. Brutlag, Hierarchical protein structure superposition using both secondary structures and atomic representations, Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, Menlo Park, 1997, pp. 284–293. [34] H. Solomon, Geometric Probability, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1978. [35] G. Verbitsky, R. Nussinov, H.J. Wolfson, Structural comparisons allowing hinge bendings, swiveling motions, Proteins 34 (1998) 232–254. [36] P. Willett, Searching fore pharmacophoric patterns in databases of three-dimensional chemical structures, J. Mol. Recognition 8 (1995) 290–303.