BIOTOPICS
XML: a lingua franca for science? Emmanuel Barillot and Frédéric Achard XML is a new language designed to solve one of the biggest problems of the World Wide Web: its main language, HTML, is not extensible. In this article, the authors discuss the current successes and limitations of the World Wide Web, briefly explain the basics of XML and present the benefits of using XML as a data-exchange language. Finally, they discuss real-life applications that have been developed using XML, with a focus on biology.
he Internet is probably the most important factor at the foundation of the information era. It significantly reshaped the Western economy and has even altered some lifelong habits. This revolution is largely owing to the huge success of the World Wide Web (WWW) technology, whose friendly user-interface is easily understood and assimilated by almost anybody. Perhaps the only technological shift that is comparable with the WWW interface breakthrough is the first WIMP [Window, Icon, Menu, Pointer (e.g. mouse)] interface developed by Xerox (Palo Alto, CA, USA) and implemented in the Apple Macintosh in the early 1980s.
T
A wealth of information from the WWW Today, the vast majority of the general public and probably all scientists have access to the Internet. Not only does it provide direct and easy access to information, but it also gives its user the power to populate information via the Internet. Hypertext markup language (HTML) is the language used to publish documents over the Internet, so that they can be read and rendered by a WWW browser (eg. Netscape or Microsoft Internet Explorer). The tag-oriented syntax of the WWW browser is so simple that it usually only takes the reading of a couple of WWW pages, together with their HTML source codes, to start writing an HTML document. In fact, anyone can easily create and publish a WWW site, typically free of charge. There are now three million public WWW sites in the world, which include more than 800 million WWW pages1 (intranet sites and dynamic WWW pages are not included in these figures). Under these conditions, it seems doubtful that there is any room left for an emerging technology in the Web world of today, where HTML is so ubiquitous. Nevertheless, today’s users (e.g. a biologist in a laboratory) face a new challenge: how to find information that they are interested in, that is, how to select the relevant documents in this vast amount of data. Several specialized portals, such as Yahoo, offer hierarchically organized access to information1. However, the results are often frustrating because the classification is, by nature, subjective, and there are little possibilities for fine-tuning requests. E. Barillot (
[email protected]) and F. Achard (Frederic.
[email protected]) are at the G.I.S. Infobiogen, 7 rue Guy Môquet, BP 8, 94801 Villejuif, France. E. Barillot is also at Généthon, 1 bis rue de l’Internationale, 91000 Evry, France, and F.Achard is presently at Cereon Genomics, 45 Sidney Street, Cambridge, MA 02139, USA. TIBTECH AUGUST 2000 (Vol. 18)
Search engines such as Altavista and Infoseek enable the localization of the information by using a set of more-or-less complex queries based on keywords; however, keywords (even a combination of them) cannot capture the exact meaning of the human natural language embedded in a WWW page. As a result, search engines frequently produce numerous falsepositives (i.e. an irrelevant document is returned) and false-negatives (a relevant document is not returned). For example, a biologist looking for information on the gene rap is probably not interested in web pages about rap music. XML: a ‘do it yourself ’ markup language The need for extracting more meaning from Web pages is at the origin of a new technology that is becoming a new standard and is highly likely to transform the Web as it is known today. Promoted by the World Wide Web Consortium (W3C), the consortium that is also behind HTML, the extensible markup language (XML) was introduced in 1996 and is being adopted in different domains as a standard for data publishing and data exchange, particularly in the scientific world. Similarly to HTML, XML is a language based on tags. The rational of XML is simple: to give users the power to define their own set of tags. In HTML, the set of tags is predefined and is used to describe the rendering of a document, for example, font name and size, lists of items or tables and so on. In XML, the tags are defined by the user, and they can be used to describe the rendering, but also to assign meaning, to the document. The set of user-defined tags and their structuration (i.e. how to use and nest the tagged elements) are generally described in a separate file, called a document-type definition (DTD), which is reusable and exchangeable. More precisely, the DTD can reside in a separate file or be embedded in the XML document; in addition, an XML document can exist without a DTD. The idea is that the same DTD is defined for a series of documents of the same type. XML-aware browsers are able to interpret the meta-information contained in the DTD, to analyse the XML document, to validate its syntax and to present the data in a way that is dependent on who the end-user is (using another file containing the rendering specifications: the stylesheet). Of course, tags can then be analysed to perform intelligent queries against XML documents. In our previous example regarding the gene rap, one could tag every occurence of the word used in the genomic context to differentiate it from the word rap in the context
0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01465-7
331
BIOTOPICS
Box 1. Examples of resources of interest to new users of the extensible markup language Biological efforts in extensible markup language (XML): http://www.visualgenomics.com/products The bioinformatics sequence markup language (BSML): for representing nucleotide and protein sequences with some graphic properties http://www.proteometrics.com/BIOML The biopolymer markup language (BioML): for representing an integrated view of nucleotide and protein sequences (Fig. 1) http://www.bioxml.com/productspage.html BlastXML: for modeling sequence comparison blast output Technical documentation: http://www.w3.org/TR/REC-xml The official document specifying the XML syntax http://www.oasis-open.org/cover/xml.html The most exhaustive web site on XML, maintained by Robin Cover (Oasis, Billerica, MA, USA) Programming resources: http://www.perlxml.com/modules/perl-xml-modules.html Perl modules for facilitating parsing and publishing of XML data http://java.sun.com/xml Java Project X is available as a java class for parsing and publishing XML data http://www.alphaworks.ibm.com IBM has produced a number of tools and libraries to help developing XML applications http://www.jclark.com/xml/expat.html An XML parser written with the C programming language
of music, for example, ,gene.rap,/gene.. This type of information can be used by the query engine. From a string of characters in HTML, we have now a concept in XML. The advantage of XML over HTML is clear from this example: it adds parseable (i.e. information that is extractable by a computer program) semantics to a document. The role of XML is to
Figure 1 An extensible markup language document, viewed with the BioML browser and conforming to the BioML document-type definition. The screenshot shows the rendering of the human gene and protein prothrombin, with the hierarchical menu on the left, and a reference to the three-dimensional structure selected on the right (http://www.proteometrics.com/BIOML).
332
alleviate the task of the computer program, which cannot yet model the natural language, by providing some hints for comprehension. XML for scientific data exchange It did not take long for the scientific community to realize the benefits it could obtain from using XML. There are many ways to exchange scientific data. On the one hand, non-standardized formats are rich and flexible, but can only be interpreted by human experts. On the other hand, the extreme solution is to agree on a format beforehand, and to develop ad hoc programs for data exchange, but this often leads to a very rigid situation with proprietary softwares and data formats. Scientists were able to use XML to develop an intermediate solution to many of the problems they were facing when exchanging data, thus capitalizing on the many advantages of XML: (1) the ability to define a specific data structure that can be read by a standard Web browser or interpreted by one of the many XML parsers available; (2) the simplicity of the XML language, written in ASCII characters, which allows the user to read the source code and to write it without a specialized editor; (3) the flexibility of having the DTD in a separate file, which facilitates the modification of the data structure if needed; (4) the fact that, like HTML, XML is system- and language-independent because it was, from the begining, specified as an open standard (e.g. an XML document is a text file, readable on any computer); and (5) the availability of XML-aware browsers on every computer, such as Netscape 5.0 and Microsoft Internet Explorer 5.0. It is clear from the above that XML does more than just add semantics to the Web; it is also a powerful language for data interconnection. The flexibility and independence of XML with regards to computer operating systems makes it a universal hub between databases. In fact, several database-management systems already offer XML interfaces and all of them are integrating XML into their development plans. Other solutions, such as the common object request broker architecture (CORBA; http://www.corba.org), are already being used for data integration, but programming CORBA servers and applications requires the skills of highly trained software engineers. With its simplicity of deployment, XML is more dedicated to the users, whereas CORBA is a complex solution deployed by and for computer scientists (although it is powerful and useful in many other cases too). However, before exchanging data in XML, one needs to carry out an agreement on a common semantics, that is, on a common DTD. In fact, technically, XML provides only a data syntax and the semantics springs from this community agreement on a DTD. The definition of a DTD is a long and difficult process that requires a scientific community to specify the definition of the vocabulary and the grammar used in their field; this is probably the main factor that slows down the adoption of XML. To date, some communities have already achieved significant results: CML (chemical markup language; http://www.xml-cml.org) has been TIBTECH AUGUST 2000 (Vol. 18)
BIOTOPICS
used for several years in chemistry, where a dedicated CML browser, Jumbo (http://www.venus.co.uk/omf/ cml), allows the visualization of chemical compounds, secondary or three-dimensional structures, literature data and so on. In mathematics, MathML (http:// www.w3.org/Math; http://www.webeq.com) consists in a DTD of approximately 100 elements to describe mathematical equations. In biology, the definition of common DTDs have recently speeded up (Box 1); the most publicized achievements to date include the following. • The bioinformatic sequence markup language (BSML; Visual Genomics, Columbus, OH, USA), whose DTD aims at representing DNA, RNA, protein sequences and their graphic properties. • The biopolymer markup language2 (BioML; Proteometrics, NY, USA), which consists of a DTD that describes the annotation information for protein and nucleotide sequences and mimics the hierarchical structure of a living organism (Fig. 1). • XML is in use for decoding outputs of computer analysis programs, a very important source of knowledge in biology. For example, BlastXML facilitates the post-processing of the widely used sequence comparison program BLAST. It specifies a common format, so that programmers can use XML parsers
to decode a BLAST output, instead of having to program their own data parser. Conclusion Ten years ago, nobody anticipated the revolution of the WWW and HTML. Will we witness such a success story with XML? It is probably too early to be sure. However, it is already clear that XML will be a standard feature of the Web, especially in science and e-commerce, in which structured information has to be exchanged on the Internet. In many cases, HTML will still be used because the document to be presented is aimed at a human reader only, but, whenever computer processing is involved, XML is more likely to be present. In biology, where the problems of data integration are recognized as a major bottleneck in further scientific advances3, XML could soon play the role of the lingua franca that scientists have been praying for since the beginning of the genome projects. References 1 Lawrence, S. and Giles, C.L. (1999) Accessibility of information on the web. Nature 400, 107–109 2 Fenyö, D. (1999) The biopolymer markup language. Bioinformatics 15, 339–340 3 Reichhardt, T. (1999) It’s sink or swim as a tidal wave of data approaches. Nature 399, 517–520
TIBTECH online – making the most of your personal subscription • High quality printouts (from PDF files) • Links to other articles, other journals and cited software and databases All you have to do is: • Obtain your subscription key from the address label of your print subscription • Then go to http://www.trends.com • Click on the ‘Claim online access’ button at the bottom of the page • You will see the following: A BioMedNet login screen. If you see this, please enter your BioMedNet username and password. If you are not already a member please click on the ‘Join Now’ button and register. You will then be asked to enter your subscription key. • Once confirmed you can view the full-text of TIBTECH If you get an error message please contact Customer Services (
[email protected]) stating your subscription key and BioMedNet username and password. Please note that you do not need to re-enter your subscription key for TIBTECH, BioMedNet ‘remembers’ your subscription. Institutional online access is available at a premium. If your institute is interested in subscribing to print and online please ask them to contact
[email protected]
TIBTECH AUGUST 2000 (Vol. 18)
333