Mark
H. Needleman,
Column Editor
Standards Update
XML Mark
H. Needleman
This column discusses XML, the Extensible Markup Language currently being developed by the Worldwide Web Consortium (W3C). XML is designed to be much more flexible and provide more sophisticated capabilities than are available currently in HTML (Hypertext Markup Language). The column will describe XML and some of the uses that are being made of it. While there is insufficient space here to go into much of the technical details of XML, pointers to those details are provided for readers who are interested in exploring the subject.
WHAT IS XML?
Needleman is a Product Development SpecialistStandards, Data Research Associates, Inc., 1276 North Warson Road, PO Box 8495, St. Louis, MO 6312-1806, (314) 432-1100, Fax (314) 993-8927
. -STANDARDSUPDATE-
Since the beginning of the Web, the language used to define Web pages has been HTML. While HTML has gotten more capable and flexible as new versions have deveIoped, and it is now possible to do some rather sophisticated things with it, it is basically a static description language. HTML tags are predefined and essentially hardcoded into the language, and all applications are limited to the single set of tags defined in the HTML specification (with the exception, of course, of some of the browser-specific extensions that various vendors have implemented to get around that restriction). XML, on the other hand, is not a single markup language like HTML, but rather a metalanguage that allows the description of multiple markup languages.
V0~.25,N0.1,1999
117
XML defines a syntax for markup languages and does not define the actual tags that comprise those languages. It spells out the rules for using notation to specify a markup language, but the tag names and what those tag names mean is left to the markup language itself. XML is based on SGML (Standard Generalized Markup Language), the international standard for defining the descriptions of the structure and content of different types of electronic content. SGML has been standardized as IS0 8879. However, XML is an abbreviated version of SGML, omitting some of the more complex and less used parts of SGML. But XML is still valid SGML and can be parsed and validated the same as any other SGML tile. XML also provides a more complicated linking model than exists within HTML. It allows authors to specify different types of document relationships, allows for bidirectional and multiway links, and allows for links to a specific span of text within the same document or in other documents. This provides more sophistication and functionality than can be achieved with HTML’s existing HREF based anchors. The following extremely simple example of XML taken from the World Wide Web Consortium’s XML activity page might help give a more concrete feel both for what it looks like and for the flexibility it will provide. The example is for information describing a customer: Acme Pharmaceuticals Co. 7301 Smokey Boulevard dstreet> Smallvile deity> CA 12345 dzip-code> dcustomer-details> What this simple example shows is a markup language that has defined tags (and how those tags can be used) for a specific application domain. With XML it is thus possible to define whatever tags are needed to express and support the requirements of the application. As can be seen, XML tags have a lot in common with HTML tagging in their structure but provide functionality that can be customized to individual application needs rather than the single one-size-fits-all tag space provided by HTML. In SGML (and thus in XML which is legal subset of SGML) there is the concept of DTDs (Document Type
118
SERIALSREVIEW
Definitions). DTDs provide a formal definition for particular types of documents. The DTD defines what tags are valid for documents that conform to that type, what their names are, where they may occur, and how they are related to each other. In essence a DTD gives an application notice of what names and structures are valid for that particular document type and allows validation of whether or not the document is coded properly for the type of document it claims to be. The DTD specifies the set of required and optional elements (and their attributes) for documents that conform to that type and in addition specifies the names of allowed tags and the relationships among elements in a document. A DTD is not required for an XML document; a set of application-specific tags would suffice. But a DTD is useful for several of the reasons mentioned above and to allow multiple independent applications to parse and validate documents without having to have the embedded knowledge of the document structure built into each application. If the application can parse DTDs, it can validate and parse the documents. There are thousands of SGML DTDs already in existence and it is expected that (since XML is a subset of SGML) there will be XML versions of many of them available. HTML is in itself an SGML DTD, and there are XML versions of the HTML DTD in preparation. FORMATTING
AND DISPLAY ISSUES
Unlike HTML, with its fixed set of tags that mostly define the physical structure of a document, XML can be used to markup documents for a wide array of purposes and applications. As seen in the brief example above, and as is the thrust of many of the XML applications discussed below, XML tagging can be used to define the logical content of a document rather than its physical structure. Different applications can thus choose to format the same document in different ways. Style sheets are a mechanism by which content creators and Web designers can specify or give guidance on how they want their documents displayed. They can be used to associate formatting instructions with tags that in themselves may only define logical structure. Multiple style sheets could exist for the same document-one, for instance, giving instructions on how a document should be formatted for online display, and another giving different formatting instructions for the same document when it is to be printed out as hard copy. There are two major style sheet activities going on in the w3c:
-MARKH.NEEDLEMAN-
CSS (Cascading Style Sheets) is a simple declarative language that allows authors and users to apply stylistic information such as font, spacing, color and the like to structured documents written in HTML or XML. CSS is good for documents where the elements in the document are typically rendered in the same order they appear in the source document.
linking mentioned this group.
IN
XML
WORKING
XML Fragment Group is working to define a way to send fragments of an XML document without having to send all or part of the parent document as well. XML Syntax Group is concerned with several areas of XML including style sheet linking, defining an XML profile, canonicalizing XML which involves finding a single or canonical form of the same document, tracking internationalization issues such as character sets, and tracking errata reported against the XML specification.
XML
Due to the incredible flexibility allowed for tag names in XML, there is the problem of collisions between DTDs that have defined the same tag names with different uses or meaning. There is work going on in the W3C to define mechanisms for namespaces that will allow for qualification and scoping to avoid the potential duplication problem. This namespace mechanism work is also designed to allow documents to use tags coming from more than one namespace and to switch back and forth among tags from muhiple namespaces within a single document without confusion.
APPLICATIONS
l
XML Coordination Group acts as the general coordinator for XML activity and acts as a forum for coordination between XML activity and other portions of the W3C and between XML activity and other organizations. XML Schema Group addresses means for defining the structure, content and semantics of XML documents. XML Linking Group is responsible for designing hypertext links for XML. The work being done on
- STANDARDS
UPDATE
-
ACTIVITIES
GROUPS
There are several working groups within the W3C dealing with different aspects of XML development. Among them are:
l
AND RELATED
There is a lot of work going on both with XML itself and activities and applications making use of XML. The intent here is to give a sense of the wide range of applications and content domains in which XML is beginning to be used. Space constraints make it impossible to cover all of them (a great deal of information on both the applications mentioned and others can be found from the references given below), but some of the most significant and interesting activities include: l
l
of
XML Information Set Group is responsible for looking at more abstract descriptions of XML documents in terms of document tree structures, elements, attribute lists and the like to help ensure interoperability among various XML based specifications and among XML software tools in general.
XSL (Extensible Style Language), which is written in XML and whose syntax is still under development, allows more complex transformations and dynamic operations than are possible with CSS. XSL consists of rules, which consist of two parts. The first part specifies a pattern that matches an element in the document parse tree. The second part specifies a mapping to objects, which can render the document. NAMESPACES
above is the responsibility
l
RDF-The Resource Description Framework, which is a specification currently under development within the W3C-is designed to provide an infrastructure to support metadata across many Web-based activities. RDF allows different application communities to define the metadata that best serves their individual needs and provides an interoperable mechanism to exchange metadata between communities and ‘across the Web. RDF uses XML as its transfer syntax. TEI-The Text Encoding Initiative-has developed guidelines for the preparation and interchange of electronic texts for scholarly research and to satisfy a broad range of uses by the language industries more generally. The TEI guidelines have been published in SGML format, and
VOL. 25, No. 1,1999
119
the TEI Workgroup on architectural one of its charges the development version of the full TEI DTD.
issues has as of an XML
IMS-The EDUCOM Instructional Management Systems Project-has released technical specifications for how learning materials will flow over the Internet and for how organizations and learners will manage the learning process. IMS specifications will use XML as the serialization syntax for data and objects. Chemical Markup Language-CML-is cation of XML for use with chemical and data.
an appliapplications
ANSI X12, CommerceNet and the XML/EDI Group have entered into a joint project on how to translate ASC Xl2 ED1 data elements, segments and transactions into XML. vCard is a standard format for electronic business card data. An XML DTD has been developed that corresponds to the vCard electronic business card format. The W3C Consortium has a draft document Platform for Privacy Preferences (P3P) Syntax Specification that will enable sites to automatically declare their privacy practices in a way that is understandable to users’ browsers. P3P uses RFD/ XML. WEBDAV is an Internet Engineering Task Force (IETF) Working Group seeking to extend HTTP to support distributed authoring and versioning. WEBDAV names and values are expressed using XML. SMIL, the Synchronized Multimedia Integration Language, which is a W3C recommendation to allow integrating a set of independent multimedia objects into a synchronized multimedia presentation, uses XML syntax, and SMIL documents are required to conform to the XML 1.O specification. The Vector Markup Language (VML) defines a format for the encoding of vector information together with additional markup to describe how that information may be displayed and edited. VML is written using the syntax of XML. The Mathematical Markup Language (MathML) is an XML application for describing mathematical notation and capturing both its structure and content. MathML is an XML compliant markup
120 SERIALSREVIEW
language for describing the content and presentation of mathematical expressions. Channel Definition Format (CDF) is an application of the XML language that is designed for push technology. The Weather Observation Markup Format (OMF) is an application of XML used to encode weather observation reports. The goal of the OMF system is to annotate and augment standard weather reports with derived computed quantities and to recast the essential information in a markup format that is easy to interpret, yet completely accurate. The Open Software Description Format (OSD) provides an XML-based vocabulary for describing software packages and their inter-dependencies. XML is used to provide a general method of representing structured data in the form of lexical trees. CASE Data Interchange Format (CDIF) is a family of standards that lays out a single architecture for exchanging information between modeling tools and between repositories and defines the interfaces of the components to implement that architecture. The CDIF effort has created an XML DTD for terminology based on IS0 1260 (Data Categories for Terminology)
BROWSERSUPPORT The XML 1.0 Specification was released by the W3C in February of 1998 and, as can be seen by the number of applications and development activities listed above (and that is by no means a complete list), has generated a lot of activity and support. There have been several experimental browsers, parsers and other XML software tools developed (many can be found at the references listed below) and more are in development. Both Netscape and Microsoft (both of which are W3C members, as are many of the major companies doing Web development) have announced support for XML. Microsoft has announced that XML support will be in Internet Explorer 5 when released. However, it is important to note that, like a lot of Internet activities, XML is developing and will continue to evolve over time as experience is gained with it and new functionality is built into it in future versions. Like many things in this area, it is impossible to predict what versions particular vendors will support and at what time, or how much of the many features and functionality of
-MARK
H. NEEDLEMAN-
XML and its related activities will be built into any particular product.
XML applications SGML.
that would be the case for
HELPSITES
SUMMARY
An extremely good pointer to lots of information about XML including publications, conferences, software, XML specifications and other documentation, discussion lists and working groups, research and other XML related activities is http:// www.w3c.orglXML
XML offers some significant enhancements and functionality over what can be achieved with HTML. Among them: XML is a language for defining markup languages. It is not restricted in scope to the fixed tags defined in HTML. Document types can be created that are customized to the application, type of data, and user community for which it is intended.
Pointers to XML parsers and test cases can be located at http://www.jclark.com/xml/
Information content can be richer and easier to use, both because of the richer flexibility of tags and the hypertext linking abilities of XML are greater than those offered in HTML.
The SGML/XML home page has extensive information about XML and XML activities and applications. It is located at http://www.oasis-open.org/ cover
Information should be more usable and accessible. Because of the nature of XML, communities will be able to create applications customized to their needs, rather than being restricted to software provided by major vendors as has become the case with HTML.
Another site with extensive information on XML and related activities including new developments and events is http://www.xml.com
XML files are valid SGML. They can be used in SGML environments outside of the Web. At the same time XML has removed many of the complexities inherent in SGML, providing a more simple model and thus making it easier to develop
-STANDARDSUPDATE-
An FAQ on XML www.ucc.ie/xmll
can be found
at http://
Much of the material for this column was gathered from the sites listed above. More information about specific XML related activities discussed above can be found at: l
.
For RDF see http://www.w3c.org/RDF For information about W3C style sheet activities go to http://www.w3c.org/Style
VOL. 25,No.1,1999
121