Gene 209 (1998) GC39–GC43
JADE: An approach for interconnecting bioinformatics databases Lincoln D. Stein a, Sam Cartinhour b, Danielle Thierry-Mieg c, Jean Thierry-Mieg c,* a MIT, WI/MIT Center for Genome, Cambridge, MS, USA b Texas AM University, College Station, TX, USA c CNRS, CRBM and Physique Mathematique, Montpellier, France
Abstract To achieve the integration of biological data available on the World Wide Web and maintained in diverse sources such as GDB, Genbank or Acedb, we have developed a software called Jade. Jade allows programmers to create analytic tools and graphical user interfaces for one or more existing bioinformatics data sources. These tools can then be interchanged, compared and reused without making modifications in the data sources themselves. The system is implemented in the Java programming language and will run equally well on Macintosh, Windows or Unix workstations. Jade is free and can be used immediately by all interested parties. © 1998 Elsevier Science B.V. Keywords: JADE; Java; Database; Integration
1. Introduction Despite the revolution in accessibility that the World Wide Web and network browsers, such as Mosaic and Netscape, has made for such on-line biological databases as GDB, Genbank, and PDB, the differences in access protocols, object names, and data organization among these databases remains a source of a fundamental semantic chaos when trying to understand and integrate data derived from these sources. The consequence of this chaos is that visualization and analysis tools are not reusable. Tools developed for use with one data source are not functional when translated to another ( Williams, 1997). Of course, a first-order integration is provided by careful design by Web authors of hypertext links among some data sources. The interested user can browse and jump from DNA to protein to structural information to genetics. However, Web browsers only provide static text and canned graphic displays, each adapted to the single data source for which it was created. This remains true for the latest developments, using Web based platform independent interfaces such as the Java (Anuff, 1997) or Common Gateway Interface (CGI ) ( Fawcett and Jepson, 1996). What is missing today is an extra layer of abstraction that will allow Web-based data * Corresponding author. E-mail:
[email protected] 0378-1119/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved. PII S0 3 7 8- 1 1 19 ( 97 ) 0 06 7 2 -0
analysis and display tools to operate on data sources speaking different languages and using different underlying data models. To fulfill this role, we have developed a software system called Jade, which acts as an adapter between data source servers and application programs for analysis and data visualization. No modification of the data sources is required, and existing analysis programs written in Java can be adapted to implement the Jade interface. In intranet applications, one typically controls the schema of the data source. In this case, there is no semantic chaos and Jade reduces to a portable protocol for transferring data between servers and generic browsing tools. This may accelerate the development of in-house data visualization programs, and provides an abstraction level that simplifies reusability if the database changes. However, when the server schema is not under the control of the application developer, or when data sources with multiple incompatible schemas need to be combined, one faces the deep problem of unifying their data models. A central design principle of Jade is to defer this problem of semantic integration and allows it to be solved piecewise for each application. Databases are coupled a posteriori by giving them a common API for connection, query, and retrieval. Application developers acquire data in the form of named relational tables. The
40
L.D. Stein et al. / Gene 209 (1998) GC39–GC43
implementation of the table view is accomplished by a thin server-specific layer that is an intermediate between the data server and the application code. The databases may be starkly different, but the data tables they serve are homogeneous from the applications point of view. Thus, the developer is freed from the burden of global semantic integration and needs only to provide an ad hoc integration for the application at hand. On the client side, Jade also supports an objectoriented view in addition to its table-oriented view. This may be more natural for representing biological information. However, we have only implemented a Jade to ACEDB adapter and cannot estimate the difficulty in providing full object-oriented views for other types of object-oriented servers. The Jade system is written in the Java programming language, which ensures an intuitive cross-platform graphical interface that runs equally well on Macintoshes, Windows or Unix systems.
2. The semantic chaos The origin of the problem is the great heterogeneity of the various biological database servers currently in wide use. The first difficulty is technical. A client application such as a data analysis or visualization tool cannot communicate with a database server unless it speaks the same language. However, the various database systems: flat file systems such as Access, relational managers such as Sybase, or object-oriented systems such as objectstore or Acedb, use different query languages, and different underlying communications’ protocols. This technical difficulty could be solved by something like a universal query language. However, even in the (unlikely) event of such an advance, the integration of diverse biological databases would still founder. The language problem only hides a much deeper semantic division. The simplest, but in many ways the most profound, differences are conflicts in the way concepts are named. For excellent reasons, the term locus may signify, in a particular database, a complementation group, a gene, a DNA sequence encoding a transcript, or a cytogenetic position. In each case, the term locus is well-defined and consistent within the database, but it is meaningless to attempt to compare locus objects that have been retrieved from two or more databases. Enormous efforts are needed to create and maintain a biological server, so that displays and analysis tools tend to be written to access this single source. The details of the semantics of the server are then used in the application code, and the exact nature of the data required by the display is not documented. This problem is nicknamed the magic tag syndrome in the Acedb community, where a series of poorly documented fields in the schema were used to select how to display a
particular object. This led to an unwanted dependency between the data model and application displays, and hindered the reuse of applications over different data sets. The overly intimate relationship between the servers and their client applications is also detrimental to the databases themselves. Since they have to remain compatible with their legacy clients, their evolution eventually reaches a standstill. To resolve these difficulties, several approaches are possible: (1) One may decide to adapt a universal semantic model. This is an utopian vision that is highly unlikely to achieve consensus among the community, and would have the undesirable characteristic of imposing uniformity on what is currently a creative and dynamic dialogue. (2) Without modifying the existing data sources, one may download all the data into a single database system. All client applications then would extract the data from this central place. This is technically feasible, because the central database can be virtual, but it is difficult to maintain. This is the approach advocated by the integrated genome database (IGD) project (Ritter, 1994) or the SRS database ( Etzold and Argos, 1993; Thierry-Mieg and Durbin, 1991). (3) All servers may adopt a universal description language. Intelligent agents, able to understand the subtleties of each database, translate automatically any query into subqueries adapted to the server and reformat the replies in a universal way. This solution makes a lot of sense in the long term, but requires first the modification of the servers and the development of these agents. Although prototype systems have been demonstrated, no production system currently exists. All these methods have in common a large overhead and will only work if a concerted effort is made by the entire bioinformatics community.
3. The Jade solution We believe that the semantic chaos can be tamed by adopting a much lower profile, in a way which does not impose a standard schema for biological concepts, advocate a universal query language, or require the modification of existing databases. Instead, we propose that the servers and the application programs should communicate by the simplest possible pipeline: relational tables. Adaptation at both sides of the pipeline would then be straightforward and could be done at once. We have created the Jade system to be this universal adapter. Jade has several layers, described in Appendix A and in Fig. 1. The important point is the existence of a small piece of code, the Jade adapter, running on the databases side of the connection, which provides the appropriate
L.D. Stein et al. / Gene 209 (1998) GC39–GC43
41
Fig. 1. The Jade architecture. User interface: The user runs a Java browser (Applet viewer, Netscape…) on a Macintosh, PC or X11 terminal. Connection to the Web server automatically downloads the Jade byte code. Then Jade opens in parallel threads several connections to the Jade2xxx adapter services declared on the Web server. Web server: The Web server should run on a Unix platform. It holds the master copy of the Jade byte code, html documents specifying the port number of its connectors, and it runs the connectors as Unix-services. Database servers: Each server can run on a different machine. The server must know how to queue the requests and how to export a table. Preferably, the server should be able to slice its replies to long queries to avoid flooding the presumably very slow user platform and network connection. Webserver to data server connection: Fast connection according to the protocol understood by the database server, for example the aceserver uses RPC. Webserver to user connection: Presumably slow connection using the socket protocol available even on the Macintosh. Jade2xxx: Small executables running as services on the Web server. They translate the Jade named table requests into requests understood by the data server, for example SQL, reformat the answers if needed, and communicate with the user and the data server using the correct protocols, for example, socket on one side and RPC on the other. Schema adapters: The schema adapter, which specifies the semantic translation between a given table and a given database can either be hooked to the Jade2xxx service, in which case the person installing it does not need write access on the data server, or it can be hooked directly into the data server if the latter is able to preregister named queries with parameters.
methods to transform clients requests into the data sources’ idiosyncratic language. The Jade adapter is under the control of the data curator, the party best suited to understand the semantics of the data source and to translate between the data sources’ native structure and the abstract tabular view required by Jade clients. For the client application programmer, the rule is to exchange all data in the form of named, parameterized, relational tables. This paradigm enforces an important discipline: the programmer must describe explicitly the kind of data expected in each table and cannot use tricks and short cuts. Benefits follow: the code is easier to maintain; there is no need to know the precise structure of the database servers; without loss of generality, the program can be prototyped with test data tables given in flat files, and later pointed to a full-fledged relational or object-oriented server. Finally, by simply changing the implementation of the table on the server side, the same display becomes reusable on different data sets. For example, a generic tree display with hot
spots can be used to visualize a phylogeny, a pedigree, a cell lineage or the organizational chart of an administration. From the data curator’s perspective, the paradigm allows a single Jade adapter to serve the data in parallel to clients running on any kind of platform. This approach imposes no constraint on the database protocol, since any database program can export a table. It also does not constrain the database schema: the same table request will be implemented differently on servers with different schemas. There is no need to solve the formidable problem of establishing a complete correspondence between two schemas, or to modify the schema of a database to support a new client. Only the portion of the data which is to be exported to the data analysis or display client need undergo translation. To give a concrete example, consider a client program that draws a genetic map. It must obtain from the server a two column table. In the first column are the names of genetically mapped loci. In the second column is the
42
L.D. Stein et al. / Gene 209 (1998) GC39–GC43
distance of each gene from the origin of the linkage group, which may be the top, the centromere or an arbitrarily chosen point depending on the species nomenclature. To provide access to such a client using Jade, the curator of a relational database would devise the SQL query adapted to his database schema. The curator of an object-oriented database would write a query following object pointers to the correct attributes to generate the two column table. From the biologist’s point of view, the net result is that the map display becomes selectable among possibly many and reusable. He will be able to select his preferred view and apply it to many different data servers or data types. Vice versa, provided the database curator is cooperative, the users will be enabled to choose among many different views on the same data server. This will allow direct comparisons between views and will stimulate the dialogue between biologists and developers. Jade can also fully handle the simpler case, in which a group of users want to visualize their data stored in an object-oriented server for which they chose the schema. Then, there is no problem of semantic adaptation. The programmer can use the schema directly in the display code. As explained above, this is convenient, but may be counter-productive in the long run. In any event, Jade can transfer full objects, handle the communication technicalities, and provide the basic browsing functionalities. This configuration has been tested with the Ace database server. Taken together, the Jade/Aceserver system constitutes a complete freeware platform for an intranet application. In our demos, http://alpha.crbm.cnrs-mop.fr/jade.html the displays are still embryonic and awaiting external contributions, but we prove the concept. The same map display code is used to show a genetic map and the advancement of the C. elegans sequencing project, with data imported from flat files, from the GDB and from 2 Acedb databases, worm and dace, with unrelated schemas. We reused several displays, which were written independently by colleagues working on different organisms. Finally, in the grass database demo of Bigwood et al. http://jagtest.gig.usda.gov:9000, the user sees a single display which imports in parallel, via Jade, data from several servers.
4. Discussion Jade connects two complex realms, the database servers and the application programs, by a very simple protocol: named tables with parameters. It is the responsibility of the application to say what sort of data is expected in each table, and it is the responsibility of the database to be able to export a table corresponding to the definition. In this way, we ensure reusability both ways, a given display can be reused on different data
sets, and reciprocally a data set can be reused by different displays. The key point is that interoperability does not require global integration. It is possible to develop software that can operate on multiple databases without solving the much harder problem of integrating those databases. For example, one can develop a generic map display that can display any kind of locus-oriented map, without agreeing on the meaning of the term locus. We are aware that the imposition of a tabular, relational view between the client program and the server is counter to the current trend which promotes object orientation. We are generally in favor of object orientation to manage data, to simplify its collection, and to support the required complexity and versatility of the schema needed in rapidly evolving scientific domains such as molecular biology (Searls, 1995). However, it is rare for a single application program to require more than a very restricted view of the complete schema. In this common case, relational tables offer an economical and general solution when data must be exchanged between heterogeneous sources and reusable analysis softwares. Because tables are easily understood, the data model does not get in the application programmers way. They are easily specified, and easily implemented. Jade’s approach is not unique. David Searls BioTK project (Searls, 1995) uses weak semantics to couple biological objects to graphical representations, but makes no provision for data storage and retrieval. Microsofts ODBC (for C language programs) and Suns JDBC (for Java programs) both provide a uniform API that allows developers to access the contents of relational databases without regard to their underlying semantics ( Kohane et al., 1996). However, these solutions are limited to SQL servers on specific platforms, whereas Jade is a cross-platform tool that accommodates flat files and object-oriented databases as well. Jade is also conceptually similar to the W3-EMRS system for distributed medical information (van Wingerde et al., 1996), but that system is too specialized to be applied to other domains. Another related solution uses the Object Management Groups CORBA (Ben-Natan, 1995; Stein et al.), an industry-standard protocol for sharing database objects among multiple heterogeneous servers. CORBA is a rich system, with several layers. CORBA’s data definition language makes it possible to share complex objects without regard to each database’s idiosyncratic representation of the information. Of course the protocol does not address the deeper semantic problem of interrelating different schemas, but it does provide a discipline for finding and describing a common semantic subset of two schemas. CORBA could be used in place of the simple socket protocol, to transfer our relational tables between the server and the Jade code, and would then offer a natural niche for the semantic adapter. Or, each
L.D. Stein et al. / Gene 209 (1998) GC39–GC43
display could be associated with an object view that was populated by the server, or standard classes could be exported from the servers all the way to the displays. However, CORBA is not available on all machines and requires installation by an advanced programmer. For data objects constructed from a few relational tables or imported from local files, CORBA is an overkill. If the CORBA interface definition language is used to create a universal schema, we again face the shortcomings discussed in the preceding sections. Notice finally that CORBA is needed both on the server and on the client side. Although Web browser vendors have announced that future products will incorporate a CORBA client, this is not yet a working possibility. In contrast, Jade is immediately and freely available. It essentially acts as a generator of specialized database views. Its lowest-common denominator approach to data representation does not require extensive schema integration. It exists as a separate software component that can be layered onto an existing database management system without source code modification. By virtue of its intimate connection to Java, it is cross-platform and implementation independent. The source code and the documentation of the whole system are available without restriction at the site http://alpha.crbm.cnrs-mop.fr/jade.html.
Acknowledgement Our demo includes displays adapted from Greg Helt, Lisa Pyoneer and Mazda Hewitt. We are grateful to Michel Potdevin and John Barnett for their contribution to the Jade code and to Doug Bigwood and Curt Jamison for productive feedback on their adaptation of Jade to the USDA sponsored plant databases.
43
queries in the language adapted to the server, for example SQL for the GDB, and reformulate the answer in Jade. The main properties of this design are that the communications are fully parallelized, the available servers are chosen by the Web curators, and most important the Jade byte code is insulated from the idiosyncrasies of the data server, since both protocol and language are handled by the jade2xxx adapters. Jade itself is object oriented. Data are organized in objects, cached on the user machine and imported transparently in a lazy way from the appropriate servers. When the server is also object oriented, Jade can import an object as a whole and display it as is. However, as explained below, to insure schema independence, we recommend driving the graphic displays by relational tables.The Jade system and its documentation are available from ftp://ftp.crbm.cnrs-mop.fr/unix/pub/jade. Following our documentation, the installation of a Jade server on a UNIX machine already functioning as a Web server takes about 1 h, but requires some familiarity with computers. On the other hand, the biologist using Jade just needs to connect via Netscape 3.0, or some other Java browser to a Jade server, for example to our demo at http://alpha.crbm.cnrs-mop.fr/jade.html. The interface is intuitive, and has on-line help. The displays look heterogeneous because they were contributed to by different programmers; this demonstrates the ease with which code can be reused in this system. An interesting feature is that one of the demos, GrassDB, obtains its data transparently from several servers. The present demo is rudimentary in terms of biological functionalities, but the hope is that it will stimulate other developers to share their work and connect their programs to the system.
References Appendix 6.1. Materials and methods The design of Jade is depicted in Fig. 1. The user connects to the Web server, retrieves the Jade html document which specifies several port numbers. This automatically downloads the Jade byte code which is executed by the Java browser. Jade then opens a collection of sockets to those ports on the same Web server. The ports correspond to connecting programs, one per underlying database server. They link the Jade clients to the data servers, fulfilling three functionalities: (a) They are allowed to connect to data servers elsewhere on the net, whereas this would represent a Java security violation if done directly by the Jade byte code. (b) Running on a Unix machine, they can use serverdependent communication protocols, like RPC, not available on the Macintosh. (c) They reformulate the
Anuff, E., 1997. The Java 1.1 source book. John Wiley, New York. Ben-Natan, R., 1995. CORBA. McGraw Hill, New York. Etzold, T., Argos, P., 1993. Comput. Appl. Biosci., 9, 49 and 59. Fawcett, J.W., Jepson, R.W., 1996. Using the common gateway interface, UNIX Review, 14, 39-46. Kohane, I.S., van Wingerde, F.J., Fackler, J.C., Cimino, C., Kilbridge, P., Murphy, S., Chueh, H., Rind, D., Safran, C., Barnett, O., Szolovits, P., 1996. Sharing electronic medical records across multiple heterogeneous and competing institutions. Proc AMIA Annu. Fall Symp., 608–612. Ritter, O., 1994. In: Suhai, S. ( Ed.), Computational Methods in Genome Research, Plenum Press, New York, pp. 57-73. Searls, D.B., 1995. bioTk: Componentry for genome informatics graphical user interfaces. Gene 163, GC1–GC16. Stein et al. Jade: a universal adapter for bioinformatics databases. Thierry-Mieg, J., Durbin, R., 1991. Acedb: a C. elegans database Available at ftp://ncbi.nlm.nih.gov/repository/acedb Williams, N., 1997. Science 275, 301 van Wingerde, F.J., Schindler, J., Kilbridge, P., Szolovits, P., Safran, C., Rind, D., Murphy, S., Barnett, G.O., Kohane, I.S., 1996. Using HL7 and the World Wide Web for unifying patient data from remote databases. Proc AMIA Annu. Fall Symp., 643–647.