...(/p> (/oL> (/BODY>
0169-7552/96/$15.00 © 1996 Publishedby Elsevier Science B.V. All fights reserved PH S01 69-75 5 2(96)00042-6
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 (1996) 1075-1084
1076
BODY H3
P
I
!
. . . . . .
OL
/ ' . . LI I I I
LI
°°*
p
Fig. 1. A HTML document and its instance tree.
The next section discusses the issue of editing Web documents, Section 3 identifies different restructuring needs and Section 4 describes some approaches that have been taken in different areas where restructuring is necessary. The rest of the paper is devoted to the solution implemented in Tamaya for interactive structural transformations of HTML documents.
2. Editing correct HTML documents Most existing documents on the Web do not conform to the HTML DTD, but browsers are flexible enough to display invalid documents. However, incorrectly tagged documents can lead to different interpretations and prevent from further document processing, such as automatic extraction and filtering of information or different presentations of a document (using Cascading Style Sheet for instance [15]). There are mainly three ways to produce Web documents: • inserting tags manually in raw text, • using a filter that generates a HTML tagged document from another document format, • using a HTML authoring tool. It is very difficult for a human being to write by hand correct HTML documents, because of the complexity of the HTML syntax. Filters usually produce valid HTML documents, but users often need to modify them in order to obtain a final version. Therefore, among these three ways of document production, only authoring tools that provide structure and syntax checking can ensure the correctness of documents.
In Tamaya, checking is performed while the document is being edited. The user does not need to know the HTML language, as menus and dialogue boxes help users to create new elements; hypertext links can be added or modified very simply, by activating a command and clicking the target document. Editing is performed directly on formatted documents. Structural constraints imposed by the HTML DTD is a limitation in interactively editing HTML documents. For example, the familiar command that allows a user to copy or cut a part of a document (the source) and to insert (or paste) it into another part (the target) may be refused because it would lead to an incorrect structure. For handling this command, an editor must take into account both the structure of source elements and the structure allowed in the target. To allow a cut-and-paste operation to work when structures are different, source elements must be transformed to become consistent with the target structure. Usually, the user wants this transformation to be automatic when editing a document. In most cases, limitations to cut-and-paste due to structure differences are bothering. Sometimes, the user may want to indicate preferences when several transformations are possible or for complex transformations.
3. Restructuring requirements Two main classes of HTML document transformations may be identified: 1. structuring a fiat text into a HTML document, 2. restructuring an existing HTML document. The first class corresponds typically to the situation where the user wants to create a HTML document from raw text. Usually, the user starts from a flat document containing a sequence of paragraphs, and changes tags of each paragraph one after the other. This operation could be speeded up if the user could select a set of paragraphs and apply to them a more complex transformation. For example, a sequence of paragraphs can be transformed into a list in which each item contains a paragraph of the source structure; a DL (Definition Lis0 structure can be filled up with source paragraphs alternatively in DT (Definition Term) and DD (Definition Data) elements.
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 f1996) 1075-1084
The second class corresponds to the situation where the user decides to move some part of a document to another place (the common cut-andpaste operation) or to change the structure of a selected set of elements without moving them. These transformations are provided by the editor by the means of two sets of commands: Common commands of word processors: copy, cut and paste for moving elements. • Common commands of structured documents editors such as SGML editors: change, extract, surround, split and merge for local transformations. Change converts an element into another element allowed at the same place. The four other commands are shortcuts for specific transformations: extract suppresses one hierarchical level (lists of lists become lists), surround performs the opposite operation (adds a level), split divides an element into two sibling elements, and merge performs the reverse operation. These commands achieve basic structure transformations, but are not sufficient for handling the most complex cases. The methods for implementing restructuring operations depend on the differences between the source and the destination structures, not on the command itself. In most cases, several proposals can be presented to the user, depending on the source type. But the user may also want to express a specific transformation, and therefore a transformation language is needed.
4. Structure transformation methods Few studies have been made on the specific problem of document structure transformations. Some analyses of that problem can be found in [1,2,6,7,14], but structured editing systems are not the only systems faced with problems of structure conversions; similar problems arise in other areas such as programming languages, structure-oriented programming environments (Gandalf [8], Centaur [5], Synthesizer Generator [17]) and object-oriented databases [19]. In most of these areas, type conversions are based on explicit specifications of the desired transformations [11]. Synthesizer Generator [17] uses "transformation declarations", SIMON [7] is based
1077
on an extension of attributes grammars, Centaur is based on another extension of attributes grammars, natural semantics. When synthesizing these structure transformations techniques [12] and applying them to HTML documents, three main methods can be identified. They can be applied alone or combined, depending on structural differences between source and target structure definition: • Direct transformation: when source and target tags differ, but their structure are the same, only the tag of the element needs to be changed. A trivial example is the transformation of a H i (Heading) into a H j, or a UL (Unnumbered List) into a OL (Ordered List). • Automatic transformation: when source and target structures differ, automatic transformation can be performed by comparing these structures and the element attributes. For instance, a UL can be transformed into a DL if the system can automatically transform the UL tag into DL, and the LI (List Item) into either DT or DD. As this method is automatic, it is not possible to specify in which type each descendant must be transformed. UL
DL
A-,A LI LI
DL
or A DT DT
DD DD
Explicit transformation: when the user wants to specify how some specific (or all) elements of the source must be transformed, explicit transformation rules are necessary. As an example a user may wish to transform an UL element into a DL, the first LI into a DT, and the second LI into a DD. UL
A-"
LI LI
DL
A
DT DD
The aim of HTML is to represent a large diversity of documents. As a consequence, types defined in its DTD are general and can be used in various contexts. Thus, HTML documents usually have a flat structure (all headings are at the same level), but recursive types can produce deeper parts in documents (nested lists and directories). Due to this specific structure, a general tree transformation method is not relevant for HTML document transformations.
1078
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 (1996) 1075-1084
Furthermore, when the user wants to give a hierarchical structure to a flat document, the editor cannot do it automatically: hierarchies cannot be deduced from the sequence of tags of siblings elements. Explicit rules are required in this case.
5. Type conversion in Tamaya This section describes the language-based conversion mechanisms used in Tamaya. The first part explains the motivations for defining a new language, then the expression of transformations is presented. Finally, the transformation process is described. 5.1. Motivations The goal of the transformation language is to express transformations that can't be automatically achieved. When composing a HTML document, the starting point is frequently a plain text document or/and several HTML fragments. Like many languages for document structure transformation (DSSSL [11], Scrimshaw [2]), the one presented here is based on tree pattern matching [9] to identify elements to be transformed, and replacement rules to specify the target types. The problem posed by existing languages is that the expression of patterns and transformations are complex and are not easily understandable by a human being. For instance, a language like DSSSL applies for any SGML document, and for that reason, it is too much complex and not well suited to HTML documents. A specific language is more appropriate and much simpler to implement. The language described below has two original features: SGML-like expression of patterns: although parenthesized expressions are convenient for pattern matching (this formalism is used to describe patterns in many pattern matching based languages, like Scrimshaw [2] or RIGAL [3,4]), a SGML-Iike formalism has been chosen to express the pattern because it is more synthetical and readable. • Principle of minimal rule expression: in most transformation languages, transformation rules for
pattern elements have to be extensively specified. All elements generated in the output are a consequence of a transformation expression. Here the approach is different. Only pertinent transformation rules have to be expressed, implicit transformations are deduced using together the context of the source element, the previous transformation made and an automatic type conversion algorithm. Furthermore, an incremental definition of transformations is provided: the transformation system works on a set of predefined transformations. This set can be read from a file at initialization time and it can be dynamically completed by the user at any time. When a new transformation is introduced in the system, its conformance with the HTML DTD is checked and the pattern is parsed in order to produce an internal parenthesized string representation. 5.2. Specifying transformations The language allows the specification of transformations to be applied to sets of elements that match a pattern. Each transformation is defined by: • a pattern, identifying a particular organization of elements in the source. These elements constitute either a single tree or a forest of trees whose roots are immediate sibling. The common parent node of these trees is called the source root. • a list of transformation rules which specifies how to generate the target instance tree in accordance with the source elements. The root of this target tree, called the target root, depends on the command performed by the user: a local transformation command: it is the same root than the source instance elements, i.e. the source root, a paste command: it is the node identifying the place where the user wants the paste operation to be performed (the current selection). The specification of a transformation is written as a source pattern between square brackets followed by a list of transformation rules between curly brackets: Transformation
::=' [' P a t t e r n ']' '[' R u l e L i s t ']'
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 (1996) 1075-1084
5.2.1. Source pattern
Examples of patterns are: a sequence of headings, a numbered list of paragraphs, a heading followed by one or more paragraphs. These examples, informally given in English, are formally represented by pattern expressions. A pattern expression is an expression built with tags and operators: Four operators available: Alternative: P1 [P2 matches either pattern P1 or pattern P2, Immediate sibling: P1,P2 matches two consecutive element sets, each one matching respectively P1 and P2, • Child: T1.P2 matches a tree whose root is of type T1 and whose descendants match pattern P2, Sequence: P1 + matches a succession of element sets, each of them matching pattern P 1. There might be several occurrences of a same tag in a pattern expression. Thus tag names should be renamed to avoid ambiguity if different rules are to be applied to different occurrences of the same tag in the pattern expression. The renaming of a tag is an association of a local name with an occurrence of a tag in the pattern (see the example (2) below). The local name and the tag name are separated by a colon. Parentheses are used for grouping nodes on which the same operator must be applied (for instance, the operator +). The definition of a pattern expression is:
Pattern
::=
Forest Tree Branch Node
::= ::= ::= ::=
Forest I F o r e s t ' 1 ' Pattern T r e e I T r e e ',' F o r e s t Branch I Branch ' ' Tree N o d e i N o d e '+' TagName I LocalName ''' T a g N a m e I'(' P a t t e r n ')'
Here are some examples of pattern expressions and the structure they represent (see Figs. 3 and 4): (i)
(OL]UL).LI+
1079
This pattern identifies a sequence of Items of an Ordered or Unnumbered List. (2)
H3, ( S e c t P a r a g :P IUL) +, (H4, S u b - S e c t P a r a g :P + ) +
This pattern identifies a H3 heading followed by a sequence of paragraphs and Unnumbered Lists, itself followed by a repetition of H4 followed by a sequence of Paragraphs. As the tag P appears twice in the pattern, it is renamed by two different names (SectParag and SubSectParag). This pattern is typically a part of the structure of a document organized in sections and subsections. 5.2.2. Transformation rules
Once the pattern to be matched is defined, the transformation rules to be applied to elements matching the pattern are given between curly brackets. A rule has two parts: • The source tag, which is a component of the pattern. • The target tag list which specifies where and how the source tag has to be changed. Both parts are separated by the symbol ' - > ' and a rule is terminated by a semicolon. The target tag list is defined by a list of tags separated by dots. It gives the position where new elements have to be inserted. This position is given relatively to the rightmost branch of the target instance tree being generated. The first node in the list that does not yet exist involves the creation of a branch of new elements. A special separator ':' may appear once in the list to specify when a new branch must be generated in the target tree. Different positions of this separator in the target tag list can induce different results, as shown in Fig. 2. transformation transformation
i:
[ [ 2: [ [
P+ ] P -> U L : L I . P P+ ] P -> U L . L I : P
; ] ; ]
If the separator ':' is placed at the head of the target tag list, a new instance tree is created as the last child of the target root. This feature is useful to
1080
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 (1996) 1075-1084
So~¥cexrlaot
Pb
Pc
tzansform~~*-~.~~oxmation 2 Fa~stRoot
~a~et~ooc
I
vr~
I I
L~ Pa
Pb
Pa
Pc
Pb
Pc
Fig. 2. Two transformations of the same source.
So~rce flloot
Fa~er~ot
i r~
LI~
LIc 1flit
r,
rd
LId ill
i
ii
ii
,
i
Fig. 3. A transformation that suppresses levels.
I
DL a
So~rca Root
DT
,J
j??
~xtl
DD~, p~
DL a
DL¢
textl text2 text3 text4 text5 text6 tcxt7
~xt2
DTc
DDd
DTs
I
I
I
~x~
~xt4 Fig. 4. A transformation that adds levels.
/"~
Pd ~ x ~ ~xt6 ~xt'/
S. Bonhomme, C. Roisin / Computer Networks and ISDN Systems 28 (1996) 1075-1084
express transformations that suppress hierarchical levels such as the Fig. 3 based on pattern (1): [ (OLIUL).LI+ { LI ->:P;
] ]
Here is a set of rules for pattern (2) which adds new hierarchical levels: [ H3,(SectParag:PlUL)+, (H4,SubSectParag:P+)+ ] { H3 - > : D L . D T ; S e c t P a r a g -> D L . D D : P ; H4 -> D L . D D : D L . D T ; SubSectParag -> D L . D D . D L . D D : P ;
1081
5.3. Transformation process When a structuring command is invoked by a user (cut-and-paste or change type for example), patterns are checked against the current selection. A source tag string representing the structure of a selected region is first constructed. In the source tag string, the content of selected elements is ignored; only tags and their relative positions are represented. As an example, here is a part of a HTML document: