Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
677
Structural Protein Interactions: From Months to Minutes R Dafas, J. Gomoluch, A. Kozlenkov a, and M. Schroeder b 9 aDept, of Computing, City University, London, United Kingdom bBiotec/Fakult~it f'tir Informatik, TU Dresden, Germany Protein interactions are important to understand the function of proteins. PSIMAP is a protein interaction map derived from protein structures in the Protein Databank PDB and the Structural Classification of Proteins SCOR In this paper we review how to reduce the computation of PSIMAP from months to minutes, first by designing a new effective algorithm and second by distributing the computation over a Linux PC farm using a simple scheduling mechanism. From this experience we derive some general conclusions: Besides computational complexity, most problems in bioinformatics require information integration and this integration is semantically complex. We sketch our relevant work to tackle computational and semantic complexity. Regarding the former, we classify problems and infrastructure and review how marketbased load-balancing can be applied. Regarding the latter, we outline a Java-based rule-engine, which allows users to declaratively specify workflows separate from implementation details. 1. I N T R O D U C T I O N While DNA sequencing projects generate large amounts of data and although there are some 1,000,000 annotated and documented proteins, comparatively little is known about their interactions. This is however the vital context, in which to interpret function. To complement experimental approaches to protein interaction, PSIMAP, the protein structure interaction map [ 10], derives domain-domain interactions from known structures in the Protein Databank PDB [ 1]. The domain definitions that PSIMAP uses are provided by SCOP [7]. The original PSIMAP algorithm computes all pairwise atom/residue distances for each domain pair of each multidomain PDB entry. At residue level this takes days; at the atomic level even months. We approached this computational challenge from two directions: First, we developed an effective new algorithm, which substantially prune the search space. Second, we distribute the computation over a farm of 80 Linux-PCs. The combined effort reduced the computation time from months to minutes. The basic idea of our novel algorithm is to prune the search space by applying a bounding shape to the domains. Interacting atoms of two domains can only be found in the intersection of the bounding shapes of the two domains. All atoms outside the intersection can be discarded. The quality of the pruning depends on how well the bounding shape approximates the domain. The proposed algorithm uses convex polygons as bounding shapes to best approximate the structure of the domains. The new algorithm requires 60 times less residue pair comparisons *This project was supported by the European Commission (BioGrid project, IST 2002-2004)
678
Figure 1. Checking interactions of a coiled coil domain pair of C-type lectins from PDB entry lkwx, an X-Ray structure of the Rat Mannose Protein A Complexed With B-Me-Fuc shown on the left with RasMol. The old PSIMAP algorithm requires an expensive pairwise comparison of all atoms/amino acids of the two domains (third figure). Using the proposed algorithm (fourth figure), it is only necessary to pairwise compare only the atoms/residues in the intersection of two convex hulls. The new PSIMAP algorithm Given: Two domains with residue/atom coordinates 1. A convex hull for each of the two domains is computed. 2. Both convex hulls are swelled by the required contact distance threshold. 3. The intersection of the two transformed convex hulls is computed. 4. All residue/atom pairs outside the intersection are discarded and for the remaining residues/atoms the number of residue/atgm pairs within the distance threshold is computed. If this number exceeds the number threshold the two domains are said to interact. Figure 2. The backbone for the new algorithm.
than the old PSIMAP algorithm for the PDB and works extremely well at the atomic level, as most atoms outside the interaction interface are immediately discarded. Overall, this new algorithm computes all domain-domain interactions at atomic level within 20 minutes, when distributed over 80 Linux machines. 2. E F F E C T I V E C A L C U L A T I O N O F P R O T E I N I N T E R A C T I O N S The original PSIMAP algorithm computes all pairwise atom/residue distances for each domain pair of each multidomain PDB entry. It can be formally formulated as below: Let d be a distance threshold and t be a number threshold. Let D 1 - { x l , . . . , x n } and D 2 { Y l , - . . , Y ~ } , where xi,yj E R a, be two sets of (atomic/residue) coordinates of the two domains. The set of interacting residues/atoms is defined by I(D1,D2) -
X/~-~k=l(xik- yjk) 2 ~_ d}. The domains D 1 and D 2 are said to interact if and only if II[ _ t. The values of the interaction parameters, distance and number thresholds, that v
PSIMAP uses are 5A and 5 residues/atoms of different residues, accordingly. The proposed efficient algorithm uses convex polygons (convex hulls) as bounding shapes to best approximate the structure of the domains. The flow of the proposed algorithm for interaction detection in the residue/atom level is given in Fig. 2. A convex hull is represented by a list of triangles, each being a list of three vertices. The complexity of the algorithm in three dimensions, using the divide and conquer technique [ 11 ],
679
/~
.
Figure 3. The case in 2D. After we expand the two convex hulls by the distance threshold, the atoms/residues in the intersection of the first domain are visible from all the faces of the convex hull of the second domain. is O(nlog(n)), where n is the number of atoms/residues. Thus step 1 in Fig. 2 can be efficiently carried out. Step 2 and 3 can both be done in linear time. In step 2 we expand each convex polygon that have been constructed in step 1 by the distance threshold d defined by PSIMAP algorithm. This is done by shifting each polygon of the convex hull perpendicularly away from the convex hull by the distance threshold d. Essentially, for each polygon p, we compute a vector v perpendicular to p with direction that points out of the convex hull. Then we compute v' as the norm of vector v and multiply it by the required distance threshold d. Finally, we add v' to each of the vertices of p, thus shifting the polygon away from the convex hull and swelling it. In step 3, we need to compute which atoms/residues are in both the transformed convex hull. To determine whether an atom/residue u is inside a convex hull or not we use the notion of signed volumes for polyhedron as defined in [9]. Fundamentally, it holds that for any point inside a convex hull each of the signed volumes of the polyhedrons, defined by v and each of the polygons of the surface, is positive ([9]). This can be computed in linear time. Using this property we check for every atom/residue in domain D 1 whether it is inside the transformed convex polygon of domain D2, in which case it belongs to the intersection (Fig. 3). At the final step, step 4 in Fig. 2, the algorithm checks in detail how many atoms/residues pairs in the hulls' intersection are within the distance threshold. 3. DISTRIBUTED COMPUTATION Before we evaluate the pruning capabilities of the above algorithm, we present the distributed computation of PSIMAP. As far as the distributed computation is concerned, two problems arise: Can the computation be distributed and if how can the load be best balanced to achieve an overall best performance. The calculation of PSIMAP with the old and new algorithms is a so-called "embarrassingly parallel" problem: I.e. the computation can be easily partitioned into independent subproblems that do not require any communication and can hence be distributed over a loosely coupled net-
680 work of computers. Each process is then assigned a task of executing the PSIMAP algorithm for a given set of PDB file entries. Hence each process calculates a part of PSIMAP that corresponds to the domain interactions identified by the set of PDB files assigned to that process. After all the processes finish the calculation the results are collected and merged to produce the overall PSIMAE The distributed implementation of PSIMAP uses the new effective algorithm described in the previous section. We have tested the PSIMAP calculation in a distributed environment of Linux workstations. The workstations have a processing power varying between 600-800MHz. Each workstation has assigned a task that corresponds to the calculation of a subset of PSIMAE Since the processing power of the workstations is the same, the assigned tasks should be equal in terms of computation size. We estimate the computation size of one PDB file by a function of the number of protein domains and size of the protein in terms of number of residues. To simplify our analysis we further assume that the size of the protein is proportional to the size s of the PDB entry. Furthermore, given p, the number of domains of a PDB entry, there are P(P2-1) potentially interacting domain pairs. Overall, we estimate the processing time for one PDB entry of size s with p domains, as f ( s , p) = s. p(~-l). Given a number c of computers/processors, we split the whole PDB into c almost equal tasks in terms of the estimated execution times. This is done using a greedy strategy. After we have sorted the single pdb tasks in descending order according to their processing time estimates we assign each single task from the sorted list to that computer/processors that have at that point the least cumulative processing time estimate. Updating the tasks' estimates we repeat that procedure for each element of the list with the sorted single pdb tasks. Next, each computer/processor is assigned such a task, which it executes. Finally, all results are collected and inserted into a database. Using the above load-balancing strategy, we ran the convex hull algorithm for PSIMAP on a farm of 80 Linux-PC.
4. A D D R E S S I N G BALANCING
COMPUTATIONAL
COMPLEXITY:
MARKET-BASED
LOAD-
The PSIMAP application requires the computation of a large number of similar tasks, which belong to the same user, and which are all known a priori. The allocation of computational resources is therefore relatively straightforward and can be carried out in the way described above. This is usually not the case in Computational Grids which are open systems with multiple users, who may belong to different organisations. One important problem in such environments is the efficient allocation of computational resources. Over the past few years, economic approaches to resource allocation have been developed [ 12], and the question arises whether they can be applied to resource allocation on the Grid. They satisfy some basic requirements for a Grid setting as they are naturally decentralised, and as decisions about whether to consume or provide resources are taken locally by the clients or service providers. The use of currency offers incentives for service providers to contribute resources, while clients have to act responsibly due to their limited budget. Indeed, a number of systems [2, 8, 13] have been built using a market mechanism to allocate computational resources. However, the performance of market-based resource allocation protocols has not been sufficiently studied. It depends on many factors which characterise a computational environment: Typically, parameters such as the number of resources, resource heterogeneity, and commu-
681 nication delays are low in a cluster, but high in a Grid. In our work, we use simulations to compare the performance of several resource allocation protocols for various computational environments, loads, and task types. The aim is to determine which protocol is most suitable for a given situation. We study several scenarios in which the clients have different requirements concerning the execution of their tasks. E.g. clients may expect their tasks to be executed as fast as possible, or their tasks may have deadlines for their completion. Furthermore, tasks belonging to different users may have different values, and therefore, need to be weighted accordingly. In the examined scenarios, we use performance metrics which reflect these requirements. In our preliminary experiments [4, 5], we compared the performance of the Continuous Double Auction Protocol (CDA), the Proportional Sharing Protocol (PSP) and the Round Robin Protocol (RR). We investigated how these protocols perform when the objective is to minimise task completion times. Our results show that, in a cluster with homogeneous resources and negligible communication delays, the Continuous Double Auction Protocol (CDA) performs best. However, if the load is low, the differences between the three protocols are small and the Proportional Share Protocol (PSP) performs equally well as CDA. In a scenario with heterogeneous resources Round-Robin performs worse than the two market-based protocols. CDA is best in most cases, but if resource heterogeneity is very high it may be outperformed by PSP. For a high number of resources PSP performs equally well as CDA. If communication delays are high, PSP performs best, whereas the other protocols degrade strongly. These results are limited to the allocation of tasks without deadlines and consider only three protocols. In our current work, we investigate a wider range of scenarios and protocols. In particular, we examine the benefits of preemption and of opportunistic pricing by the servers. We also carry out experiments in a real computational environment in order to verify certain parameters in the simulation model including communication delays and processing delays. 5. A D D R E S S I N G SEMANTIC COMPLEXITY: RULE-BASED INFORMATION INTE-
GRATION Frequent changes and extensions to analysis workflows as well as multiplicity and rapid rate of change of relevant data sources contribute to the complexity of data and computation integration tasks when analysing protein interactions. Although theoretically, teams of programmers could continue updating and maintaining the code for the analysis workflows in traditional procedural languages such as Java or Perl, the difficulty in maintaining such code and the lack of its transparency for non-programmers or even programmers who are not the original authors make this very difficult. We have designed and implemented a rule-based Java scripting language Prova [6] to address the above issues. Prova extends an open source Mandarax [3] inference engine implemented in Java. The language offers clean separation between data integration logic and computation procedures. The use of rules allows one to declaratively specify the integration requirements at high level without any implementation details. The transparent integration of Java provides for easy access and integration of database access, web services, and arbitrary Java computation tasks. This way Prova combines the advantages of rules-based programming and object-oriented programming in Java. To demonstrate the power of Prova, we show below how the main workflow of PSIMAP computation is implemented as rules that call Java methods to perform the actual computation. The Prova program consists of
682 Prova rules: Prova facts: and Prova queries:
L : - L 1 , . . . , Ln. L.
solve(L).
Here L, L 1 , . . . are literals, : - is read as "if" and commas as "and". Literals can be constants, terms, or arbitrary Java objects. s c o p _ d o m 2 d o m ( D B , P D B _ I D , PXA, PXB) : % Rule A a c c e s s _ d a t a (pdb, P D B _ I D , Prot), % Access the PDB data s c o p _ d o m _ a t o m s (DB, Prot, PXA, DomA) , % R e t r i e v e d o m a i n d e f i n i t i o n s s c o p _ d o m _ a t o m s (DB, Prot, PXB, DomB) , DomainA.interacts (DomB) . % C a l l J a v a to t e s t i n t e r a c t i o n s c o p _ d o m _ a t o m s (DB, Prot, PX, Dora) : % Rule B findall ( % Collect the domain definition p s i m a p . S u b C h a i n (PX, C, B, E) , s q l _ s e l e c t (DB, s u b c h a i n , [px, PX] , [chain_id, C] , [begin, B] , [end, E] ) , DomDef) , Dom=Prot.getDomain(DomDef) . % Create a Java object with domain
The listing above shows Rule A that creates a virtual table representing interacting domains
P X A and P X B by accessing the PDB data in a Java object Prot corresponding to PDB_ID, invoking Rule B twice to retrieve domain definitions DomA and DomB, then finally, calling a Java method interacts() to test whether the domains interact. Rule B used by Rule A assembles a domain definition by accessing the database table subchain in the SCOP database. The declarative style of rules provides for a transparent and maintainable design. For example, access to the PDB data via the access_data predicate above automatically scans alternative mirror location records for the PDB data. If one location is available due to the current local network and disk configuration, that location is cached and subsequent accesses to PDB will be directed there. Otherwise, alternative mirrors are non-deterministically retried until the PDB data can be located. Declarative style and intrinsic non-determinism of rules offers better transparency and maintainability of the code than the usual for and if constructs in procedural languages. 6. E X P E R I M E N T A L RESULTS The new effective PSIMAP algorithm gives a dramatic improvement in the number of distance comparisons required in both residue and atom level. The algorithm, using convex hulls to approximate domain structures, often picks exactly the interacting atoms/residues, so that no further checks are needed. Overall, the new algorithm approach leads to a 60-fold improvement at residue-level. Importantly, the reductions in false positives, do not lead to any false negatives. To further reduce computation time beyond the improvements achieved through the pruning taking place in the new algorithms, we distribute the computation of the convex hull algorithm over 80 Linux-PCs as described above. As summarised in Fig. 5, the algorithm computes the interactions for the whole of PDB in 15 minutes at the residue level and in 20 minutes at the atom level. This compares to several days' computation of the old PSIMAP algorithm at residue level and an estimation of months on the atom level. The results in Fig. 4 documenting the pruning and in Fig. 5 documenting the joint reduction by pruning and distribution show that our approach successfully overcomes two problems of the original PSIMAP computation: Due to the pruning the computation is feasible at the more
683 Residue Comparisons
.
le+ll =
le+10
"~
le+09
0 "~
.
.
.
.
.
A/~-
"l
le+08
/
le+07 / 1e+06
Z
100000 1
i
1
i
2
3
4
i 5
6
7
8
9
10
Distance Threshold
Figure 4. Residue comparisons needed for the new algorithm. Number of residue pair comparisons (y-axis) for different distance thresholds (x-axis, in Angstrom). OLD represents the exhaustive search, NEW represents the convex hull method and AA the real number of interacting residue pairs. PSIMAP Residue Atom
OLD o n l PC Days Months
NEW o n l PC 4.5h 20h
NEW on 8 0 P C s ! 5 min 20 min
Figure 5. The table shows how the new improved algorithm reduces the computation times. While the OLD algorithm is estimated to take months on the detailed atomic level and hence was not feasible the NEW algorithm distributed over 80 Linux machines dramatically reduces the computation time to 20 minutes. The reason for the new algorithm taking only marginally longer at the atomic level (20 minutes) compared to the residue level (15 minutes) is due to the larger problem size leading to an overall better load balancing and CPU usage. detailed and accurate atom level and due to the combined effect of pruning and distribution the domain-domain interaction are scalable and will be able to handle the superlinear growth in PDB. 7. CONCLUSION In this paper we demonstrated how to reduce the computation of PSIMAP from months to minutes. First by designing a new effective algorithm using computational geometry and in particular convex polygons and second by distributing the computation over a Linux PC farm using a simple scheduling mechanism. However, from this experience we derived some more general conclusions: Most problems in bioinformatics have a dual nature. The first is computational complexity and the second is semantic complexity since almost all of them require intelligent information integration. Currently, systems are often poorly engineered with no particular attention to reusability, scalerability and maintainability. We have sketched our relevant work to tackle these two natures of the problems. Regarding computational complexity, before parallelising the problem should be specified, with all its
684 constrains, as precise as possible. It is rather usual that problems are formulated too general with no clear specifications. When an algorithmic solution to the problem exists, its complexity should be evaluated and be reduced as much as possible. At that stage extensive refactorisation or ever redesign of the algorithm may be necessary. Regarding semantic complexity, most problems in bioinformatics involve different data sources, which are often distributed. When designing workflows to integrate data, they should be specified at the higher level of abstraction, independent from any implementation details. A rule-based approach, we outlined in this work, is one way to realise these executable specifications. REFERENCES
[1] [21
[3] [4] [5]
[6] [7]
[8]
[9] [10]
[11 ] [12] [ 13]
H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucl. Acids Res., 28:235-41, 2000. R. Buyya, J. Giddy, and D. Abramson. An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications. In Proc. of the 2nd Workshop on Active Middleware Services, Pittsburgh, USA, 2000. Kluwer. J. Dietrich. Mandarax. http://www.mandarax.org. J. Gomoluch and M. Schroeder. Market-based Resource Allocation for Grid Computing: A Model and Simulation. In Proc. oflst Intl. Workshop on Middlewarefor Grid Computing (MGC2003), Rio de Janeiro, Brazil, June 2003. J. Gomoluch and M. Schroeder. Performance Evaluation of Market-based Resource Allocation for Grid Computing. To appear in Concurrency and Computation: Practice and Experience, 2004. A. Kozlenkov and M. Schroeder. Prova language, http://comas.soi.city.ac.uk/prova. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536-40, 1995. N. Nisan, S. London, O. Regev, and N. Camiel. Globally distributed computation over the Intemet- the POPCORN project. In Proc. of the 18th Intl. Conf. on Distributed Computing Systems, Amsterdam, Netherlands, 1998. IEEE. Joseph O'Rourke. Computational Geometry in C. Cambridge University Press, second edition, 1998. J. Park, M. Lappe, and S.A, Teichmann. Mapping protein family interactions" Intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol., 307:929-38, 2001. i F. P. Preparata and S. J Hong. Convex hulls of finite sets of points in two and three dimensions. Communications of the ACM, 20:87-92, 1977. T. Sandholm. Distributed rational decision making. In G. Weiss, editor, Multi-agent systems. MIT Press, 2000. C. A. Waldspurger, T. Hogg, B. A. Huberman, J. O. Kephart, and W. S. Stornetta. Spawn: A distributed computational economy. IEEE Trans. Software Engineering, 18(2):103117, 1992.