Standard Generalized
Markup Language and related standards Joan M Smith discusses work underway on developing standards for text description and processing languages
Projects developed by the International Organization for Standardization/International Electrotechnical Commission Joint Technical Committee I/Subcommittee 18~Working Group 8 are described here, with the working group concentrating on the formulation of standards for text description and processing languages in the broader domain of text and office systems. Central to the work of WG 8 is ISO 8879 Standard Generalized Markup Language for the description of the information content of documents. Other standards and technical reports produced by the group support SGML in some way, either directly or indirectly. Their role in office publishing is described, and some information is given about office applications and the products that are available in the marketplace. Keywords: office information systems, Standard Generalized Markup Language,electronic~office~database publishing, management information systems, document interchange and standards
within what is now JTC 1, it seemed logical that the Standard Generalized Markup Language (SGML) and related projects be brought within the aegis of SC 18, to join work on text structures for which WGs 3 and 5 have responsibility. In many ways, the work of WG 8 is complementary to that of WGs 3 and 5, which deal with the various aspects of the Office Document Architecture (ODA) and Office Document Interchange Format (ODIF) standards. However, it is approached from a different viewpoint, that of publishing rather than the automation of office systems. Publishing is to be taken in its broadest definition here, 'ranging from single medium conventional publishing to multi-media database publishing', to quote from the introduction to ISO 88791. Also, 'SGML can also be used in office document processing when the benefits of human readability and interchange with publishing systems are required'.
SGML Joint Technical Committee (JTC) 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) has responsibility for information technology. Its Subcommittee (SC) 18 looks after text and office systems, within which Working Group (WG) 8 brings out those for text description and processing languages. WG 8 work was not always in the remit of SC 18, having originally been allocated to SC 5, responsible for producing standards for programming languages. When there was a reorganization of work President, SGML Users' Group, 17 Tanza Road, Hampstead, London NW3 2UA, UK
SGML is primarily intended for the re-use of data, where it is increasingly found that information is the most valuable asset an organization has, worth far more than its investment in hardware or software. Therefore, SGML is device-independent, an SGML system being able to transcend the changes of hardware (including output devices) over a period of many years. To achieve this, the information is stored in a neutral format. It can be output in many different ways - - on a high or low resolution screen, on a dot matrix or desktop laser printer, or it may be typeset. A document can have many different styles associated with it, depending on what is required. The documents can form part of a
0140-3664/89/020080-05 $03.00 © 1989 Butterworth & Co (Publishers) Ltd 80
computer communications
management information system where the elements are identified uniquely to enable their retrieval at will (in a non-sequential manner). The simple case is where an input document is output in its entirety (e.g. a letter), or an input document (e.g. a report) could be edited; on output the sections would be given specific numbers, as would cross-references and figures, and a table of contents and an index compiled to reflect the changes made. Consider more complex cases: information can be extracted from forms to provide statistical data (e.g. for inclusion in reports), information for training manuals can be taken from different sources, etc. Every department within an organization can have its own applications, from the very small to the very large, where needs can be met by SGML. In fact, it is the standard approach whenever information is to be re-used in some way. Some call it a standard for corporate database publishing, where publishing is every company's second business. Publishing in this sense starts in an office and often ends in an office. Documents may consist of text (including foreign languages, mathematics and other special symbols), graphics (from simple business charts to complex drawings), and scanned images, whether or not these other data types are coded in accordance with other standards. Other notations may be included, e.g. computed data, spreadsheets, audio content. There are many possibilities, where an information base may be built up gradually over the years. If required, existing data may be converted to SGML format, so old documents, even those that exist in paper form only, can be brought into the information system. SGML is rigorous in that types of documents are defined in a formal way by means of what is known as a Document Type Definition (DTD). It is here that the rules are laid down as to what elements a document may contain, the contexts in which they may occur, and the attributes they may have to give additional information about the element, e.g. an identifier. Also associated with a document is an SGML declaration which tells the system what SGML features are used (maybe several to permit markup minimization), what notations are included (perhaps IGESor Consultative Committee on Intemational Telephone and Telegraph (CCITT) Group 4 facsimile), character set information, etc. Within the DTD, the element names (or generic identifiers, to give them their correct appellation) are (usually) mnemonics. So, 'p' could stand for a paragraph, etc. The text itself is said to be tagged in that
could precede a paragraph (in fact, this is not necessarily so, as markup minimization techniques could be employed so that the start-tag could be omitted, as could the end-tag at the end of a paragraph
). The point made here, however, is that markup is added to a document (whether user-supplied or system-supplied and user-supplemented), the semantics of which are defined externally (in the DTD). The D T D goveming the output document from the information base could well be different from those which are applied to load documents that could n o w form constituent elements of the output information. This S G M L declaration would include use of the linkfeatures, givingdetailsof the various ways in which the information could be presented. Because allisgoverned according to rules, there are no surprises on output for a formatter having to deal with some undefined element.
vol 12 no 2 april 1989
This does not imply that a user would know about such things as DTDs and SGML declarations. That would all be left to the application designer, and that is how it should be. In addition, there is more to an SGML system than the ISO 8879 standard.
WG 8 PROJECTS Computer languages for processing text (ISO project 18.1 5) Standard Generalized Markup Language (ISO Project 18.1 5.1 ) ISO 8879 SGML was published on 15 October 1986, since when some 2 500 copies have been sold. It was inevitable that with a standard of this nature there should be some errors of omission and commission, plus ambiguities of meaning. Amendment 1, published on 1 July 1988, dealt with such cases. While this addressed text in the standard itself, everything pertaining to an implementation of SGML and amendments to some of the entity names listed in an annex were also proposed (showing how symbols such as a pilcrow could be entered using a normal ASCII (or other) keyboard). A second amendment is therefore in the pipeline. (A reference book to ISO 8879, containing various indexes and a list of all the character entities with their graphic representations, has now been published.) 2
Operational Model for Text Description and Processing Languages (ISO Project 18.1 5.2) A working draft of this proposed technical report has been produced which shows how all WG 8 projects fit together. Recent work on the inter-relationships among fonts and the two text composition standards will be incorporated. This model will also be input to WG 1, the group responsible for the development of an overall model taking into account the entire work of SC 18.
Computer-Assisted Publishing -- Vocabulary (ISO Project 18.15.3) ISO/Technical Report 95443, a type 2 technical report, was published on 15 July 1988. The terms used in this area remain subject to change, hence the decision not to standardize them at this stage. Definitions go into rather more detail than usual, even giving the derivation of some of the terms from the days of hot metal. It is from this that both the art and the science of publishing can be appreciated. Comments are welcomed, and these will be taken into account when the text is reviewed.
SGML and Text-Entry Systems (ISO Project 18.1 5.5) Guidelines for SGML Syntax-Directed Editing Systems (/SO Project 18. 15.5. 1) Type 3 draft technical report ISO/DTR 100374 is currently going the ballot rounds within SC 18. Such guidelines could never be standardized - - conformance would be an impossibility owing to the differing nature of smart editors that are already appearing in the marketplace. The additional facilities which they will variously offer should not be stifled; rather, the report gives advice on what may be achieved when a user enters what is to become an SGML document, perhaps without even knowing that SGML is being used. Since documents are based on a hierarchical structure, useful feedback can be given as the
81
text is entered. Such editors are used in the office, typically on PCs and other desktop publishing devices.
Retroactive Conversion (ISO Project 18. 15.5.2) Another technical report s is in the course of preparation to give guidelines on the conversion of the many thousands of existing documents that are destined to enter an SGML information base. Some may have to be scanned in from paper; others already exist in machinereadable form (perhaps prepared using word processing software); while even more will be in structured databases. All this is possible, as is already being proved by various consortia of suppliers who are both demonstrating their products and carrying out the conversion on real data.
Text Composition (ISO Project 18.15.6) Document Style Semantics and Specification Language (DSSSL) (ISO Project 18. 15.6.1) Since SGML documents are stored in a neutral format, there is a need to specify the way in which the information is to be presented. DSSSL6 will enable details concerning stylistic requirements to be passed to a formatter to result in the desired appearance. Equally, it could be used for the loading of a database. There is a need for great flexibility, since it must be able to take the most complex structures into account (including how to produce a milk carton) as well as the most simple, e.g. a memorandum. This standard could also be used to define the style in which the logical structure of ODA documents are to be imaged 7. Standard Page Description Language (SPDL) (ISO Project 18. 15.6.2) At the end of the text composition chain for SGML documents is SPDL, the standard that will describe a document at the page level on output. ODA documents will also require the SPDL standard, and since SPDL is closely allied with DSSSL, the desire is take these standards through the ballot process together. It is interesting to note that both Adobe, the developer of PostScript, and Xerox are providing a major input to this work 8.
SGML Support Facilities (ISO Project 18.15.7) SGML Document Interchange Format (SDIF) (/SO Project 18. 15.7. I) ISO 9069 SDIF 9 was published on 15 September 1988. It is SDIF that enables SGML documents to be transferred in an open environment, using the OSI standards. SDIF also applies to other forms of communication, and the exchange of magnetic media. Because SGML documents may contain any data type, SDI F can therefore be used to interchange any data type. Thus, for example, IGES product data may be identified in an SGML document and interchanged by means of SDIF.
Techniques for using SGML (/SO Project 18. 15.7.2) The publication date for ISO/TR 95731° was 15 December 1988. It was realized some time ago that ISO 8879 was not going to provide a great deal of help for document type designers - - that was not its purpose - - so this type 3 technical report was produced to present tutorial matter to assist the development of applications. The process of design of a DTD is described, and examples are given. A DTD for a general document type is explained,
82
where this could well be used for a company report. Other examples of DTDs are for a memorandum, a letter and a spreadsheet, all of these documents having their use in the office. There is a clause on mathematics, another on tables, and others on different languages, from those based on Latin alphabets via documents containing mixed languages written from right to left and from left to right, to other scripts (notably different ways of writing in Japanese languages, including English text and the use of ruby showing pronunciation). A proposal was made to produce a second technical report, this being for advanced techniques for using SGML; use of the link features would be included in this.
Registration Procedures for Public Text Owner Identifiers (ISO Project 18. 15.7.3) Needless to say, provision has been made so that DTDs and other such SCML constructs need not accompany each SGML document which is interchanged. They may already be known at the receiving end, and referred to in a unique manner. And, while they could be shared and made publicly available, some companies may not wish to publish design details. A further consideration when debating registration matters was that it was not necessarily desirable to register every different document type, plus all the other constructs that could be classified in some way. The real need was to be able to identify the owner of any public text, hence the development of ISO 907011. ANSI has agreed to be the registration authority, and give out the necessary numeric identifiers. In addition, an amendment to this standard has been proposed to enable a company's ISBN (International Standard Book Number) to precede the unique identifier for the public text, since many companies already have such a standard number. They will then have no need to apply to ANSI.
Description and Identification of Glyph Fonts (ISO Project 18.27) All parts of ISO/DIS 954112, the fonts standard, failed the ballot to elevate it from a Draft International Standard (DIS) for a variety of reasons. One was that members of SC 2 (Character Sets and Information Coding) felt that it could impinge on their work area. Some restructuring was carried out, first of all to remove sections of text that dealt with registration to a separate standard, and second to reorganize the text in a better manner. The term 'glyph' is now used to distinguish a symbol shape from a coded character (the province of SC 2) 13. Part 1 will be on architecture, and part 2 will be on interchange format. These parts will be submitted for a second DIS ballot at the same time as ISO/DIS 10036 TM. In the future, there will either be a part 3 to ISO/DIS 9541 for Glyph-Shape Representations12, or the text could be progressed as a separate standard. Font information is required by the text composition standards, and also by part 6 of ODA/ODIF Character Content Architectures and future graphics standards (see Reference 15).
SGML APPLICATIONS As far as ODA is concerned, the most notable application of SGML is the Office Document Language (ODL). This
computer communications
gives the ability for a document structured in accordance with ODA to be represented in accordance with SGML (ODL) and interchanged by means of SDIF. So, an ODA document may enter an SGML-govemed information base. It is also expected that this alternative means of interchange for ODA documents will be used when additional functionality is required that can only be provided by a publishing system (ODL is described in an annex to part 5 of the ODA standard) 16. There are, of course, many conventional publishing applications of SGML, and it is to be applauded that publishers have joined together to develop these, not only in the US but also in France and Holland, in particular. It is worth recounting that among the first was the Office for Official Publications of the European Communities. Government publishers are also using SGML. Several government departments have SGML applications, e.g. the Internal Revenue Service in the US, and the European Patents Office for all patents documentation. Others developing applications are the Department of Energy and Commerce in the US, and NASA. The largest application is for the US Department of Defense (DoD) Computer-aided Acquisition and Logistic Support (CALS) strategy. This affects not only the DoD but also the many contractors and subcontractors who do this work in offices. In the course of the two phases of implementation of CALS, moving from the current paper-based situation to a paperless one within the next decade, SGML is being used as a cornerstone. Among the many who will be displaying these SGML documents are military personnel deployed around the world. There has been universal recognition of the benefits that can accrue from CALS, and other countries are now taking up the same standards for much the same reasons. What is good for CALS is seen as being good for large organizations with many different types of business. Then there are financial houses that need to produce documents very quickly as the market prices change, and other commercial organizations who wish to capitalize on a new way of regarding information. In these cases, SGML is used for everything from memoranda and trade data to reports. Members of the SGML User's Group come from many different walks of life.
specific Data Transfer Done (DTD). SoftQuad has an interactive editor, as has Datalogics, both companies also having formatters (and other products). Taunton Engineering has hypertext/hypermedia software, as have OWL and Scribe, and Officesmiths supply database software. What You See I What You Get (WYSIWYG) solutions are provided by companies such as Interleaf and Compugraphic: markup may be seen if desired, or hidden; graphics and scanned images are incorporated in documents, the text edited, pages reformatted, and everything done that might be expected. Many products run on PCs, workstations, and some need mainframes. It is being proved that markup does not have to be added by a user, and that WYSlWYG has its place in an SGML system. Suppliers are enhancing their products continually, with more vendors coming into the market. This is in addition to all the systems that have been developed in-house, and are not available commercially.
CONCLUDING REMARKS It has only be possible, in the course of a short paper, to give an overview of SGML and the standards associated with it. For more information on the SGML approach to information, whether for the office or other needs, contact the SGML Users' Group, c/o Peter V Howgate, Secretary, Pindar Infotek, 11 Melrose Yard, Walmgate, York YOI 2XF, UK (Tel: (44) 904 622644).
REFERENCES*
1 Information Processing Systems- Text and Office Systems- Standard Generalized Markup Language (SGML) (ISO 8879) (1986) 2 Smith, l M and Stulely, R SGML: The User's Guide to ISO 8879 Ellis Horwood, UK (1988) 3 Computer-Assisted Publishing-- Vocabulary (ISO/TR 9544) (1988)
4 Guidelines for SGML Syntax-Directed Editing Systems (ISO/DTR 10037) (1988)
PRODUCTS The number of SGML products available in the marketplace has increased significantly during the last year. Part of the reason for this is that many suppliers are taking the short cut of licensing an SGML parser from one of the two software houses that make such a product available to OEMs. ArborText and Docupro have the (Canadian) Software Exoterica parser, while Compugraphic, Context, Interleaf, Scribe and Xyvision have that from Sobemap, a Belgian company. IBM has developed its own parser, as has Datalogics. A parser needs to be at the heart of an SGML system, but there are many other useful components to carry out all the other desirable functions. Scanning software is necessary to take paper documents, and here the Palantir scanner is one of those used, with the Avalanche IMSYS software adding the SGML markup in accordance with a
vol 12 no 2 april 1989
5 Retroactive Conversion (ISO Project 18.15.5.2) (1989) 6 Document Style Semantics and Specification Language (DSSSL) (ISO Project 18.15.6.1 ) (I 988) 7 Hunter, R, Kaijser, P and Nielsen, F 'ODA: a document architecture for open systems' CompuL Commun. Vol 12 No 2 (April 1989) pp 69-79 8 Robinson, P J and Strasen, S M 'Standard Page Description Language' CompuL Commun. Vo112 No 2 (April 1989) pp 85-92
9 SGML Document Interchange Format (SDIF) (ISO 9069) (I 988)
10 Techniques for Using Standard Generalized Markup Language (SGML) (ISO/TR 9573) (1988) 11 Registration Procedures for Public Text Owner Identifiers (ISO 9070) (1989) *Draft InternationalStandards(DIS) and Technical Reports (TR) are availablefrom the InternationalOrganizationfor StandardizationCentral Secretariat, 1 rue de Varemb~,Case Postale56, CH-1211 Geneva20, Switzerland.
83
12 Information Processing- Font and Character Information Interchange (ISO/DIS 9541) (1989) 13 Smura, E, Beeton, B, Savage, K and Oriffee, A 'Font information interchange standard ISO/IEC 9541' CompuL Commun. Vol 12 No 2 (April 1989) pp 93-96 14 Procedure for Registration of Glyph and GIyphCollection Identifiers (ISO/DIS 10036) (1989) 15 CompuL Aided Des. special issue on future graphics standards, Vol 19 No 8 (October 1987) 16 Office Document Architecture (ODA) and Interchange Formats-- Office Document Interchange Format (ODIF) (ISO 8613-4) (1989)
84
Joan Smith is President of the SGML Users" Group. Formerly leader of the UK's input into SGML and related standards, she is now the liaison representative for the Users" Group to ISO/IEC JTC 1/SC 18 and its Working Group 8. She is a Fellow of the British Computer Society and the winner of a Tekkie award.
computer communications