Journal of Computer and System Sciences 66 (2003) 451–472 http://www.elsevier.com/locate/jcss
Translation of binary regular expressions into nondeterministic e-free automata with Oðn log nÞ transitions$ Viliam Geffert Department of Computer Science, P.J. Sˇafa´rik University, Jesenna´ 5-04154 Kosˇice, Slovakia Received 2 July 2001; revised 5 March 2002
Abstract We show that every regular expression of size n over a fixed alphabet of s symbols can be converted into a nondeterministic e-free finite-state automaton with Oðsn log nÞ transitions (edges). In case of binary regular languages, this improves the previous known conversion from Oðnðlog nÞ2 Þ transitions to Oðn log nÞ: For the general case with no bound on cardinality of the input alphabet, our conversion yields a better constant factor in the Oðnðlog nÞ2 Þ term. The number of states is bounded by OðnÞ: r 2003 Elsevier Science (USA). All rights reserved. Keywords: Formal languages; Descriptional complexity; Regular languages; Finite-state automata; Regular expressions
1. Introduction The class of regular languages is probably one of the simplest and most extensively studied classes in the formal languages theory. Despite their simplicity, some important problems concerning regular languages are still open. In particular, not all relations among descriptional complexity of different kinds of formalisms representing regular languages are known. In this paper we shall consider relation between two standard formalisms representing regular languages, namely, regular expressions versus nondeterministic finite automata without e-transitions. The size of an automaton is measured by the number of its transitions (edges), while the size of a regular expression is defined as the number of occurrences of alphabet symbols
$
This work was supported by the Slovak Grant Agency for Science (VEGA) under contract ‘‘Combinatorial Structures and Complexity of Algorithms.’’ E-mail address:
[email protected]. 0022-0000/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0022-0000(03)00036-9
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
452
in it. (As we shall see later, this measure is linearly related to the length of a string describing regular expression.) On one hand, the conversion of a nondeterministic automaton into an equivalent regular expression may increase the size of the representation exponentially [2]. On the other hand, the standard textbook conversion (see e.g. [5]) of a regular expression into an e-free automaton produce automata with a quadratic increase of the size of the input. It was even conjectured that the Oðn2 Þ conversion is the best one we can achieve [9, p. 77 and Exercise 3.7 on p. 106]. This conjecture has been disproved in [6,7], where an Oðnðlog nÞ2 Þ conversion has been described. The authors have also presented an Oðn log nÞ lower bound, thus, no linear-size conversion is possible. In this paper, we shall show that, for regular expressions over a fixed alphabet consisting of s symbols, each regular expression of size n can be converted into a nondeterministic e-free finitestate automaton with Oðsn log nÞ transitions and OðnÞ states. Thus, for binary regular languages (or regular languages over any fixed alphabet), we get an equivalent automaton with Oðn log nÞ transitions. We shall also derive a second upper bound for this conversion procedure. This improves the known results for the general case, with no bound on cardinality of the input alphabet, giving us a better constant factor in the Oðnðlog nÞ2 Þ term.
2. Preliminaries We first recall some basic definitions that are used throughout. A one-way nondeterministic finite automaton is a quintuple M ¼ ðQ; S; D; qs ; F Þ; where Q is a finite set of states, S is a finite input alphabet, DDQ ðS,fegÞ Q is a set of transitions, qs AQ an initial state, and F DQ is a subset of final states. Here e denotes the empty string. A finite automaton M can also be viewed as a graph with states corresponding to nodes and transitions corresponding to oriented labeled edges. The language accepted by M is then the set LðMÞ of all strings a1 yak AS for which there exists a path of edges connecting qs with some state in F ; labeled by a1 yak : An automaton is e-free, if no transition is labeled by e; i.e., DDQ S Q: A regular expression over an alphabet S is defined as a finite expression a built from the symbols in S and special symbols ‘‘|’’ (empty set) and ‘‘e’’ (empty string), using the binary operators ‘‘+’’ (union) and ‘‘’’ (concatenation), and the unary operator ‘‘’’ (iteration). For technical reasons, we found it useful to introduce also a unary operator ‘‘} ’’ (option), which is a shorthand notation, defined by a} ¼ a þ e: df:
Binary operators are written in infix notation while unary operators in postfix notation, ‘‘’’ is often omitted. We use the standard precedence of operators; the unary operators are of the highest priority, ‘‘+’’ of the lowest, parentheses are used to indicate grouping. To resolve syntactic ambiguities arising from iterated unions, a1 þ ? þ an is understood as a shorthand alternative for ð?ðða1 þ a2 Þ þ a3 Þ?Þ þ an ; the same holds for concatenation.
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
453
The language represented by regular expression a; denoted by LðaÞ; is defined in the standard way, by structural induction on a (see e.g. [5]). The size of a regular expression a; denoted by size ðaÞ; is, by definition, the number of occurrences of alphabet symbols in it, i.e., by structural induction, sizeð|Þ ¼ sizeðeÞ ¼ 0; sizeðaÞ ¼ 1; sizeða1 Þ ¼ sizeða} 1 Þ ¼ sizeða1 Þ; sizeða1 a2 Þ ¼ sizeða1 þ a2 Þ ¼ sizeða1 Þ þ sizeða2 Þ; where a denotes a symbol in S; and a1 ; a2 are subexpressions of a: Lemma 3.1 displayed below shows that this measure of size is linearly related to the length of a string describing regular expression. We shall also use a representation of a regular expression by a binary tree. In this tree, each binary operator is represented by an inner node with two sons corresponding to its two subexpressions, each unary operator by an inner node with one son for its only subexpression, and each occurrence of an alphabet symbol or special symbol ‘‘|’’ or ‘‘e’’ represented by a leaf. By a subtree below a node x we mean the tree consisting of the node x itself and all its descendants in the given tree. The following technical lemma plays an important role for obtaining an efficient conversion of regular expressions into e-free automata. It was already used, in a slightly different form, in [6,7] to derive an Oðnðlog nÞ2 Þ upper bound for this conversion. Historically, it was used for recognition of context free-languages in Oððlog nÞ2 Þ space [3]. Compared with the original version, we slightly improve the initial threshold assumption (see also [4, Lemma 11.1 and Theorem 11.3]). The lemma shows how a binary tree can be decomposed into two ‘‘balanced’’ parts; the first one forming a subtree below some node x; the second one consisting of all remaining nodes. Each of these two parts contains at least 13 of the leaves that were marked, initially, as ‘‘attended.’’ Note also that the argument gives us an algorithmic procedure that finds the separating inner node x: Lemma 2.1. Let R be a finite binary tree, with each leaf marked either as ‘‘attended ’’ or as ‘‘ignored.’’ The total number of leaves marked as ‘‘attended ’’ is kX2: Then there exists a node x such that the number of attended leaves in the subtree below the node x is at most 2=3 k; but more than 1=3 k: Proof. We first need to prove the following simple claim. If x is a node of R with more than 2=3 k attended leaves in its subtree, then x has a son y; such that (a) either the subtree below y has more than 2=3 k attended leaves, (b) or the number of attended leaves is at most 2=3 k; but more than 1=3 k: Finding y is quite straightforward. First, for contradiction, suppose that x is a leaf. Then the entire subtree below x has only one leaf. Since 1o2=3 k for kX2; this contradicts the assumption
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
454
that the subtree has more than 2=3 k leaves marked as attended. Thus, x must be an inner node of R; with some sons y1 and y2 : Consider first its left son y1 : If y1 satisfies (a) or (b), we are done. Otherwise, y1 has at most 1=3 k attended leaves in its subtree. This implies that the subtree below the right son y2 must have more than 1=3 k attended leaves, since the total number of leaves marked as attended in the two subtrees of x is above 2=3 k: Therefore, y2 must satisfy (a) or (b). We are now ready to complete the argument. We start from the root x0 ; which clearly has k42=3 k leaves marked as attended. But then, by the claim above, x0 has a son x1 satisfying either (a) or (b). If x1 satisfies (b), we are done. Otherwise, by claim, x1 has a son x2 satisfying (a) or (b). ... By repeating this procedure, we get a path x0 ; x1 ; x2 ; y from the root down to the leaves. If none of the nodes, along this path, satisfies (b), we shall reach a leaf satisfying (a). But this is a contradiction, as shown above. &
3. Conversion into small automata In this section, we begin with a preprocessing phase, which puts a regular expression of size n into a normal form with at most n 1 binary operators. Such regular expression is then converted into a nondeterministic automaton with at most 2n states. We allow e-transitions in this intermediate machine, but the number of e-free transitions does not exceed n: Lemma 3.1. Each regular expression a of size n can be replaced by an equivalent expression a0 of size at most n; such that a0 is either a single symbol | or e; or, in the tree representation of a0 ; (i) each leaf corresponds to an alphabet symbol (i.e., a special symbol | or e is not allowed), and (ii) a son of a unary inner node corresponds either to a single alphabet symbol or to a concatenation (i.e., the son is not another unary node, nor can it be a binary node for union). Proof. Simplify the given regular expression a by systematic string replacements shown in Table 1. Note that all these transformations preserve LðaÞ; the language represented by a; and that neither of them increases the size of the expression. Further, the reader may easily verify that, when none of the rewriting rules (a)–(d) is applicable, the resulting expression a0 has either
Table 1 Rewriting rules for putting a regular expression into the normal form. Here a1 ; a2 represent arbitrary subexpressions in a (a)
(b)
}
| ) e; | a1 ) |; | þ a1 ) a1 ; e ) e; e a1 ) a1 ;
| ) e; a1 | ) |; a1 þ | ) a1 ; e} ) e; a1 e ) a1 ;
e þ a1 ) a} 1 ;
a1 þ e ) a} 1 ;
(c)
} a 1 ) a1 ; a1 ) a1 ; } }} a1 ) a1 ; a1 ) a} 1 ;
(d)
} ða1 þ a2 Þ ) ða} 1 a2 Þ ; } ða1 þ a2 Þ ) a1 þ a} 2 :
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
455
degenerated into a single symbol | or e; or it satisfies the condition (i), by (a) and (b), as well as (ii), by (c) and (d). & If a regular expression reduces into | or e; it can clearly be converted into an automaton with a single state and no transitions at all. We shall now concentrate on the nontrivial case. As a consequence of the lemma above, the expression in normal form does not contain more than n 1 binary operators, since, by (i), the corresponding tree has at most n leaves. By (ii), it neither contains more than 2n 1 unary operators, since a son of a unary operator is either a leaf or a binary node. We can now eliminate the unary nodes from the tree by ‘‘unifying’’ them with their sons. This gives a binary tree with the following types of nodes: Inner nodes represent binary operators of (i) a union a1 þ a2 ; (ii) a simple concatenation a1 a2 ; (iii) an optional concatenation ða1 a2 Þ} ; or (iv) an iterated concatenation ða1 a2 Þ : A leaf represents either (v) a simple alphabet symbol aAS; (vi) an optional symbol a} ; or (vii) an iterated symbol a : We are now ready to convert regular expressions into automata. Theorem 3.2. Each regular expression a of size nX1 can be replaced by an equivalent nondeterministic automaton M with at most 2n states and n e-free transitions, such that: (a) For each subexpression b in a; corresponding to a subtree below some node in the tree representation of a; there exists a subautomaton Mb in M; which is a subgraph in the graph representation of M: (b) For each b0 ; a subexpression of b corresponding to some subtree with the top node located in the subtree for b; Mb0 is a subautomaton of Mb ; i.e., a subgraph nested in the subgraph for Mb : (c) Mb has a single entry point, a state enb AQ; and a single exit point, a state exb AQ; with enb aexb ; such that a string a1 yak AS is in LðbÞ if and only if there exists a path of edges connecting, within the subgraph for Mb ; the state enb with exb and labeled by a1 yak : (d) Any path going into the subgraph for Mb from the surrounding environment must pass through the state enb : Further, Mb has no edges ending in enb : (e) Any path leaving the subgraph Mb to the surrounding environment must pass through the state exb : Further, there are no edges from the surrounding environment ending in exb : Proof. It is known that a regular expression can be replaced by a nondeterministic automaton with 2n states and n e-free transitions, if e-transitions are allowed. Similarly, it is easy to construct an automaton satisfying any of the structural properties (a)–(e). However, we must be a little more careful if these requirements should be satisfied simultaneously. After putting the expression a into the normal form and eliminating unary nodes in the tree representation by ‘‘unifying,’’ as discussed above, we allocate qs and qF ; the initial and final states for M; which are also, respectively, the entry and exit points for b ¼ a; the root in the tree representation of a: That is, qs ¼ ena and qF ¼ exa : Then we connect qs with qF by a ‘‘temporary pseudo-transition’’ labeled by a: Starting from the root in the tree representing a; we proceed downward in the tree and replace temporary pseudo-transitions in the graph, using the corresponding graph rewriting rules
456
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
presented in Table 2, as the structure of the tree representing a demands. This allocates new states as well as edges for subsequent nodes (sons representing subexpressions), until we get a final graph without any temporary pseudo-transitions. This construction is almost standard, but there are some details requiring an additional explanation. To keep the number of states small, we do allow a path, inside Mb ; visiting exb arbitrarily many times (see e.g. the rule for iterated concatenation in Table 2). This will do no harm, due to the following reasons. (i) For each subexpression b; Mb has no edges ending in enb : This can be verified easily, by a bottom-up structural induction, using Table 2. (ii) Each subautomaton has an exclusive exit point, i.e., for any two different subautomata Mb0 and Mb00 ; the states exb0 and exb00 do not coincide. This can be seen from Table 2 directly; each rewriting rule allocates a separate exit state for either of its nested subautomata. (iii) Using (i), (ii), and Table 2, it is now easy to see that, for any Mb ; there are no edges from the surrounding environment ending in exb : Table 2 Graph rewriting rules for producing a nondeterministic automaton Union: b ¼ b1 þ b2
)
Simple concat.: b ¼ b1 b2
)
Opt. concat.: b ¼ ðb1 b2 Þ}
)
Iter. concat.: b ¼ ðb1 b2 Þ
)
Simple symbol: b ¼ aAS
)
Opt. symbol: b ¼ a}
)
Iter. symbol: b ¼ a
)
Edges without labels represent e-transitions, dotted edges are ‘‘temporary pseudo-transitions,’’ to be replaced by subgraphs corresponding to given subexpressions. Filled bullets represent allocated ‘‘new’’ states.
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
457
(iv) By (i) and (iii), we now get that each path visiting the subgraph for Mb must enter Mb via enb and leave it via exb : Then, by structural induction, we can easily verify that each subautomaton Mb satisfies LðMb Þ ¼ LðbÞ; and hence also LðMÞ ¼ LðMa Þ ¼ LðaÞ: This also completes the proof of (a)–(e). Now we shall concentrate on the number of states. At the very beginning, we have allocated two states, namely, the initial and final states for M: Further, by Table 2, the rules for leaves do not allocate any new states and each rule for a binary inner node allocates exactly two new states. Recall that there are at most n 1 binary nodes in the tree representation of a; after putting it into the normal form. Summing up, the number of states is bounded by 2 þ 2ðn 1Þ ¼ 2n: It is also easy to see that M has at most n e-free transitions, since, by Table 2, rewriting rules for inner nodes of the tree representation of a do not produce any, and a rule for a leaf produces only a single e-free transition. &
4. Removing e-transitions Before passing further, we need to closely analyze the structure of the automaton presented in Table 2, describing the construction of Theorem 3.2. We also introduce some important notions concerning this automaton. Then we shall remove e-transitions. Recall that the automaton M constructed in Theorem 3.2 reflects the structure of the original regular expression a; as well as the structure of its tree representation. Namely, each subexpression b; corresponding to a subtree below some node in the tree, has its subautomaton Mb ; a subgraph placed between the states enb and exb in the graph representing M: This divides the graph into two parts, the inside part of Mb ; consisting of all states and edges that were created by the graph rewriting rules of Table 2 when producing Mb (including nested production for all descendants in the subtree for b), and the outside part, consisting of all remaining states and edges in M: The states enb ; exb form a boundary, enb being an entry point and exb an exit point for Mb : By definition, enb will belong to the outside part of Mb ; while exb to its inside. Now we shall generalize the above terminology to ‘‘regions.’’ Definition 4.1. Let b and b1 ; y; bc be some subexpressions of a; such that b1 ; y; bc are subexpressions of b; but bi is not a subexpression of bj ; for iaj: That is, in the tree representation, the root nodes for b1 ; y; bc are all in the subtree for b; but the root for bj does not lay along the path connecting the root of b with bi : (The list b1 ; y; bc may also be empty). Then a b b1 ; y; bc region is a subgraph R in the graph representing M; consisting of the inside part of Mb after removing the inside parts of Mb1 ; y; Mbc (see Fig. 1 for an example). We also call this subgraph an inside part of the region, while all remaining states and edges in M form an outside part. A boundary of a region consists of the boundary states for b and b1 ; y; bc : The states enb and exb1 ; y; exbc are entry points, while exb and enb1 ; y; enbc are exit points of the
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
458
Fig. 1. Splitting a region b b1 ; b2 ; b3 into two subregions; b0 b2 ; b3 and b b0 ; b1 :
region. (Not excluding the possibility that a state may be, at the same time, an entry point and also an exit point. This may happen if enb ¼ enbi or exbj ¼ enbi ; for some iaj). According to a strict graph theoretic definition, a region is not a subgraph, since a region can contain an edge with a source or target nodes not belonging to the same region. This will not cause any confusion, however. An important property of regions is that they can be split into subregions with a balanced number of e-free edges. Lemma 4.2. Let M be the automaton constructed in Theorem 3.2, let R ¼ b b1 ; y; bc be a region in M; for some subexpressions b and b1 ; y; bc ; and let the total number of e-free transitions inside the region R be kX2: Then R can be decomposed into two disjoint regions R1 and R2 such that the number of e-free transitions inside each of them is at least 1=3 k; but at most 2=3 k: More precisely, the list b1 ; y; bc can be partitioned into two disjoint lists b01 ; y; b0c0 and b001 ; y; b00c00 such that, for some subexpression b0 in b; R1 ¼ b0 b01 ; y; b00c and R2 ¼ b b0 ; b001 ; y; b00c00 : Proof. Consider the tree representation of b: The subexpressions b1 ; y; bc correspond here to some subtrees. Now mark all leaves in the tree that are not contained in the subtrees for b1 ; y; bc as ‘‘attended,’’ and those contained in the subtrees as ‘‘ignored.’’ It should be obvious, by Table 2, that there is a one-to-one mapping between the leaves of a subtree and the e-free transitions inside a corresponding subautomaton. Therefore, we have marked as ‘‘attended’’ exactly the k leaves corresponding to the k e-free edges located inside the region R ¼ b b1 ; y; bc : By Lemma 2.1, we can find a node such that the number of attended leaves in the subtree below this node is between 1=3 k and 2=3 k: This subtree corresponds to some subexpression b0 in b; i.e., to a subautomaton Mb0 of Mb : (Fig. 1 illustrates this idea). Now split the list b1 ; y; bc into two disjoint lists b01 ; y; b0c0 and b001 ; y; b00c00 ; the first list consists of subexpressions that are in b0 ; i.e., the roots of which are contained in the subtree for b0 ; the second one of those that are not contained in it. In other words, Mb01 ; y; Mb0 0 are subautomata of Mb0 ; while c Mb001 ; y; Mb0000 are not. Then we can split the region R into two subregions, separated by the c
boundary of Mb0 ; the region R1 ¼ b0 b01 ; y; b0c0 ; placed inside Mb0 ; and R2 ¼ b b0 ; b001 ; y; b00c00 placed outside Mb0 in R:
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
459
Clearly, the number of e-free edges in R1 is between 1=3 k and 2=3 k; since the subtree for b0 has this many attended leaves and no attended leaf is contained in any of the subtrees for b1 ; y; bc ; i.e., for b01 ; y; b0c0 ; as well. But then the number of e-free edges in R2 is also between 1=3 k and 2=3 k; since the total number of such edges in R1 and R2 is exactly equal to k: & It should be clear that the algorithm derived from Lemma 2.1—searching for a separating inner node in the tree—can be modified so that it works directly with the structure of regions in M; and finds the separating boundary between R1 and R2 ; i.e., the boundary states of Mb0 : The next definition, based on Lemma 4.2, is the heart of our conversion procedure. It allows us to fix some essential checkpoint states in M and then keep track of reachability among these checkpoints only. Definition 4.3. Let x; yAD be two e-free transitions of the automaton constructed in Theorem 3.2. A separating point between x and y is a final control state sepx;y AQ; obtained by the following algorithm. (a) If the edges x; yAD are not connected by a path x-y in M; the value of sepx;y is not defined. (We shall never compute sepx;y for such pair of edges). Here ‘‘x-y’’ represents a path beginning with the edge x; ending by y; and with an arbitrary number of e-transitions in between, including zero. (b) So assume that there exists a path x-y in M: Then we run the following iteration, keeping track of a ‘‘current region’’ R: Initially, R is the entire graph for M; i.e., R :¼ b b1 ; y; bc ; where b ¼ a; the original regular expression, and b1 ; y; bc ¼ |; the empty list. (c) Let R ¼ b b1 ; y; bc be the current region. Using the procedure derived from Lemma 4.2 based on Lemma 2.1, find b0 splitting R into two subregions R1 ¼ b0 b01 ; y; b0c0 and R2 ¼ b b0 ; b001 ; y; b00c00 with a balanced number of e-free transitions. That is, the number of e-free transitions in each subregion is between 1=3 k and 2=3 k; if the number of such transitions inside R is kX2: (d) Now we use the reachability of y from x as a criterion for branching. That is, we check where the paths x-y will go.
If there exists a complete path x-y located in the inside of R1 ; i.e., inside Mb0 ; then R1 becomes the new current region. That is, we let R :¼ R1 (among others, this implies that both x and y fall in the inside of R1 ).
If there exists a complete path x-y located in the inside of R2 ; i.e., outside Mb0 ; then R2 becomes the new current region, that is, R :¼ R2 : (In this case, both x and y fall in R2 ). (e) Steps (c) and (d) are repeated until the procedure finds out that neither R1 nor R2 contains a complete path connecting x with y: Recall that there does exist a path x-y in the inside of R: This is possible only if this path crosses the boundary between R1 and R2 : These two subregions were separated by the boundary of some subautomaton Mb0 ; the region R1 is placed inside Mb0 ; while R2 is placed outside Mb0 :
If x falls in the inside of Mb0 ; return exb0 as the value of sepx;y :
460
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
Fig. 2. A moment of separation for two transitions and their separating point q ¼ sepx;y :
Conversely, if x is in the outside of Mb0 ; return enb0 as sepx;y : In either case, sepx;y is located in the inside of R; since the entire path x-y was located in R: It should also be clear that, for xay; the split up scenario tries to reach a configuration in which x; y fall in different subregions, since the number of e-free edges in the current region goes abruptly down to k ¼ 1: Hence, sooner or later, the procedure must return a value of sepx;y ; if xay: This does not imply, however, that the two edges must lay across the boundary of Mb0 in different subregions, when the moment of separation comes. Fig. 2 shows all possible relative positions. (f) For x ¼ y; the procedure may end up with a current region containing the single e-free edge x; together with a complete path x-x: If this happens, the returned value of sepx;x will be, by definition, the node where the edge x points to. Informally, starting from the region equal to the entire graph for M; we keep on dichotomizing the current region into smaller subregions, and maintain the invariant that the inside of the current region must contain a complete path connecting x with y: When neither subregion contains a complete path, the procedure returns, as the value of sepx;y ; the point of the separating boundary between the subregions that is visited by the path from x to y; as shown in Fig. 2. Since a path x-y is, in general, different from y-x; the values of sepx;y and sepy;x are usually different. In order to clarify the idea, we shall consider an example displayed in Fig. 1. Suppose that the edges x; y; displayed in the figure, are connected by a path x-y located inside the region b b1 ; b2 ; b3 ; and that, in the first five iterations, the following split up boundaries are found in Step (c): b1 ; b2 ; b; b3 ; b0 ; in that order. Then the sequence of current regions is a |; a b1 ; a b1 ; b2 ; b b1 ; b2 ; b b1 ; b2 ; b3 : (Here a denotes the original expression and ‘‘|’’ the empty list for b1 ; y; bc ). In the fifth iteration, the current region R ¼ b b1 ; b2 ; b3 is split into R1 ¼ b0 b2 ; b3 containing the edge x; and R2 ¼ b b0 ; b1 containing y: Here the procedure stops; the edges are separated by the boundary of Mb0 ; and each path connecting x with y must pass through the state sepx;y ¼ exb0 : However, if each path x-y had to pass through b1 ; the procedure would return sepx;y ¼ enb1 after the first iteration. We are now ready to remove e-transitions. We shall use the standard technique, that is, we replace a path containing a single e-free edge enclosed between an arbitrary number of e-edges by
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
461
a single e-free edge connecting the source and target states of the path. However, we shall carefully pick out the states that are ‘‘essential,’’ to keep the number of edges small. Definition 4.4. Let M be the automaton constructed in Theorem 3.2. An e-free variant of M is an automaton M 0 having the same set of finite control states, constructed as follows. (a) The initial state qs of M becomes also the initial state of M 0 : For eeLðMÞ; we have F 0 ¼ fqF g; where qF is the final state of M: However, if eALðMÞ; the initial state will also become a final state, that is, F 0 ¼ fqs ; qF g: (b) If, for some e-free edges x; y; zAD; there exists a path x-y-z in M; include a transition y
y
sepx;y ! sepy;z in D0 : Here ‘‘q1 ! q2 ’’ denotes an e-free edge from the state q1 to q2 labeled by the same input alphabet symbol as the edge y: y (c) If, for some e-free y; z; there exists a path qs -y-z in M; include a transition qs ! sepy;z : y (d) If, for some e-free x; y; there exists a path x-y-qF in M; include a transition sep ! qF : x;y y (e) If, for some e-free y; there exists qs -y-qF in M; include qs ! qF :
Clearly, M 0 has no e-transitions at all. It should also be pointed out that we do not keep multiple copies of an edge connecting the same pair of states and labeled by the same symbol of input alphabet. Now we shall prove the correctness of the above construction. Theorem 4.5. The automata M and M 0 constructed in Theorem 3.2 and Definition 4.4, respectively, are equivalent. Proof. By (a) in Definition 4.4, eALðM 0 Þ if and only if eALðMÞ: We shall now consider an input wae: First, recall that, for each pair of e-free edges x; y such that there exists a path x-y in M; the state sepx;y is chosen so that we have at least one path connecting x with y and passing through sepx;y (see (e) and (f) in Definition 4.3). Therefore, for each e-free pair x; y; a path x-y in M can be replaced by a path of the form x-sepx;y -y: Among others, this implies that the initial state qs does not coincide with any sepx;y AQ; for no e-free x; yAD with a path x-y; since there are no edges in M ending in qs ; by (d) in Theorem 3.2, using qs ¼ ena : But then, by (b)–(e) in Definition 4.4, neither M 0 contains any edge ending in qs : For wae; this excludes the possibility of an accepting path terminated in qs ; even if qs AF 0 : Thus, for wae; both M and M 0 accept w ¼ a1 yak if and only if there are some paths connecting qs with qF and labeled by a1 yak : Now, suppose that w ¼ a1 yak ALðMÞ; for kX2: Then we can find a path qs -ea1 -ea2 -ea3 -?-eak 1 -eak -qF
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
462
in M; where eai denotes an edge labeled by the symbol ai ; for i ¼ 1; y; k: But then, by (b)–(d) in Definition 4.4, we have a corresponding path for M 0 ; namely ea1
qs ! sepea
;ea2
eak 1
ea3
ea2
1
eak
! sepea2 ;ea3 ! ? ! sepeak 1 ;eak ! qF :
Thus, w ¼ a1 yak ALðM 0 Þ: ea1
For k ¼ 1; we have a path qs -ea1 -qF in M; which can be replaced by qs ! qF in M 0 ; using (e) in Definition 4.4. Summing up, we have shown that LðMÞDLðM 0 Þ: y
On the other hand, it is not too hard to see that, if an edge q1 ! q2 has been included in M 0 ; y
there must exist a path q1 -y-q2 in M: For example, an edge sepx;y ! sepy;z included by the rule (b) of Definition 4.4 implies that there must exist a path x-y-z in M: But this path can be replaced by x-sepx;y -y-sepy;z -z; as shown above. Thus, we have, sepx;y -y-sepy;z : A similar argument holds also for edges included by the rules (c)–(e). Therefore, if there exists a path ea1
ea2
ea3
eak 1
eak
qs ! q1 ! q2 ! ? ! qk 1 ! qF in M 0 ; for some states q1 ; q2 ; y; qk 1 ; there must also exist a path qs -ea1 -q1 -ea2 -q2 -ea3 -?-eak 1 -qk 1 -eak -qF in M: This implies that LðM 0 ÞDLðMÞ; which completes the proof. &
5. Two upper bounds Here we derive some upper bounds on the number of transitions used by the automaton constructed in Definition 4.4. We show that, for each nX2 and each sX1; the number of 6 sn log n; where n denotes the size of the original regular expression transitions is less than logð3=2Þ and s the cardinality of the input alphabet. (All logarithms here are to the base 2.) Second, even if the cardinality of input alphabet is not fixed, the number of transitions is below 2 1 log nðlogð3=2Þ log n þ 1Þ: This gives an alternative proof for the Oðnðlog nÞ2 Þ upper bound n logð3=2Þ derived in [6,7]. However, we reduce the number of transitions to almost one half, keeping the same number of states, bounded by 2n: To obtain the above upper bounds, we need to fix some essential checkpoints in M: Definition 5.1. A split up scenario, for a state qAQ of the automaton constructed in Theorem 3.2, consists of two sets InðqÞDQ and OutðqÞDQ; produced by the following variant of the procedure described in Definition 4.3. (a) Initially, the sets InðqÞ and OutðqÞ are empty. We again keep track of a current region R; initially equal to the entire graph for M: (cf. Step (b) in Definition 4.3).
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
463
(b) Using the procedure derived from Lemma 4.2 based on Lemma 2.1, find b0 splitting the current region R into two subregions R1 and R2 with a balanced number of e-free transitions. (cf. Step (c) in Definition 4.3). (c) Instead of a path x-y as a criterion for branching (cf. Step (d) in Definition 4.3), we use the given state qAQ: (c1) If q is located inside R1 ; i.e., inside Mb0 ; then R1 becomes the new current region, i.e., R :¼ R1 : Then insert the state enb0 to InðqÞ and exb0 to OutðqÞ: (c2) If q is located inside R2 ; i.e., outside Mb0 ; then R :¼ R2 : Insert the state exb0 to InðqÞ and enb0 to OutðqÞ: (d) Repeat the Steps (b) and (c) until the number of e-free transitions in the current region R reduces to k ¼ 1: The current values of InðqÞ; OutðqÞ at this moment become the final ones. The main difference between the procedure computing sepx;y and the procedure computing InðqÞ; OutðqÞ is that the former stops as soon as all paths connecting x with y are broken down, but the latter keeps on splitting while possible. We end up with a region containing a single e-free edge. Note that the procedure does not care about the total number of transitions in regions, just transitions that are e-free. Second, here we maintain the invariant that the inside of the current region must contain the given state q; rather than a path x-y: Along the way, we build InðqÞ and OutðqÞ; two lists of boundary checkpoints from/to regions, respectively, that were separated from the state q: Recall also that we have already resolved a potential ambiguity arising in Step (c), for the case if q happens to be a point of the separating boundary between R1 and R2 : By convention we have adopted earlier, enb0 lay outside, while exb0 inside Mb0 : Finally, we point out that InðqÞ and OutðqÞ are sets of states, that is, we do not keep multiple copies of the same state in the same set. To illustrate the above idea, we shall return to the example shown in Fig. 1. Recall that we assumed that the edges x; y; displayed in the figure, were connected by a path x-y located in the inside of the region b b1 ; b2 ; b3 ; and that the procedure computing sepx;y found the following split up boundaries in the first five iterations: b1 ; b2 ; b; b3 ; b0 ; in that order. Now consider the state q ¼ exb : Then, in the first four iterations, the procedure computing InðqÞ; OutðqÞ follows the same trajectory of current regions, since both the state exb and the path x-y fall in the same subregions after splitting up: Iter. 0 1 2 3 4 5
Current R Add to InðqÞ Add to OutðqÞ a | exb1 enb1 a b1 a b1 ; b2 exb2 enb2 b b1 ; b2 enb exb b b1 ; b2 ; b3 exb3 enb3 0 exb0 enb0 b b ; b1
464
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
In subsequent iterations, the process keeps on splitting the region b b0 ; b1 until it gets a region with a single e-free edge, containing q ¼ exb : (Compare with the procedure computing sepx;y ; that stops after the first four iterations, since the edges x; y are separated by the boundary of Mb0 :) Consider also the case of q ¼ enb2 : The first four iterations for enb2 and exb are the same, since they both fall into the same subregions after splitting up. In the fifth iteration for enb2 ; however, we get the current region R ¼ b0 b2 ; b3 ; and the states enb0 ; exb0 are inserted to different sets. From this point forward, the computations for q ¼ enb2 and q ¼ exb run in a different way, examining regions across the opposite sides of the boundary for Mb0 : Now we can begin with a careful analysis of the structure of InðqÞ; OutðqÞ; including an evaluation of Oð1Þ terms in upper bounds on their cardinality. This is required to obtain a negative constant factor for oðsn log nÞ term in upper bound on the number of transitions. As a consequence, such term can be discarded, which simplifies the resulting formula in the main theorem of the paper. To make life easier, we shall content ourselves to display precisely only the constants that will survive in the final upper bounds. All ‘‘temporary’’ constants are calculated, and then immediately rounded at the third decimal digit, up or down, so that we do not ruin the inequality under consideration. Lemma 5.2. Let M be the automaton constructed in Theorem 3.2. Then, for each nX6 and each 1 log n 1:128 states. qAQ; neither of the sets InðqÞ; OutðqÞ contains more than logð3=2Þ Proof. Let us recall how the split up scenario of Definition 5.1 builds the sets InðqÞ; OutðqÞ: Starting from the current region being equal to the entire graph for M; containing k ¼ n e-free edges, it splits repeatedly the current region into two smaller subregions, neither of them containing more than 2=3 k e-free edges. It maintains also the invariant that the inside of the current region contains the state q: In each iteration, it adds exactly one state to the set InðqÞ; and exactly one state to OutðqÞ: Thus, after i iterations, both InðqÞ and OutðqÞ contain exactly i states, and the number of e-free edges in the current region is at most kpnð2=3Þi : Since the procedure terminates as soon as the current region contains a single e-free edge only we 1 log n þ Oð1Þ get immediately that neither of the sets InðqÞ; OutðqÞ contains more than logð3=2Þ states. To obtain a negative Oð1Þ term, we utilize the cumulative effect of truncation in the last few iterations. Note that if i; the number of executed iterations, satisfies i 2 o8; n 3 the number of e-free edges in the current region is below 8, i.e., at most 7. This must hold for each 1 log n log 8 log n 5:1284 ; i4 logð3=2Þ logð3=2Þ
ð1Þ
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
465
log 8 since logð3=2Þ 45:128: Taking the first integer i satisfying (1) for the number of iterations—sufficient to reduce the number of e-free edges below 8—we get 1 log n 4:128: ð2Þ ip logð3=2Þ
Now, let us calculate how many iterations can be executed when the number of e-free edges in the current region is at most 7. First, the current region is split into two subregions so that neither of them contains more e-free edges than 7 2=3p4:667: But the number of edges must be an integer, i.e., we have at most 4 e-free edges in the current region after this iteration. For the next iteration, we get 4 2=3p2:667; i.e., at most 2 edges. Finally, in one more iteration, we get a region with a single e-free edge only. Thus, when the number of e-free edges in the current region drops below 8, the procedure must terminate in the next 3 iterations. Combining this with (2), we get that the number of states in InðqÞ; OutðqÞ is bounded by 1 log n 1:128; ð3Þ jjInðqÞjj ¼ jjOutðqÞjjp logð3=2Þ for each n48: The method we have used for the cabalistic sequence going down from the number 7 can also be used to compute the exact upper bounds on the number of iterations for values n ¼ 8; 7; y and 1 log n 1:128: This reveals that (3) holds actually to compare the results with logð3=2Þ from nX6: & The following two lemmas play a fundamental role in obtaining the upper bounds. Lemma 5.3. Let M be the automaton constructed in Theorem 3.2. Then each triple of e-free transitions x; y; zAD; connected by a path x-y-z; must satisfy at least one of the following conditions: (a) (b) (c) (d)
x ¼ y ¼ z; sepy;z AOutðsepx;y Þ; sepx;y AInðsepy;z Þ; sepx;y AOutðsepy;z Þ:
Proof. If x ¼ y ¼ z; we are done. So assume that at least two edges do not coincide. Now consider, in parallel, sequences of regions examined by two instances of the procedure presented in Definition 4.3, searching for the separating states sepx;y and sepy;z : At the same time, consider also sequences of regions for two instances of the procedure of Definition 5.1, computing the sets Inðq1 Þ; Outðq1 Þ and Inðq2 Þ; Outðq2 Þ; for the case of q1 ¼ sepx;y and q2 ¼ sepy;z : All four processes start from the current region R being equal to the entire graph and keep on splitting the current region into smaller parts R1 and R2 : It should be clear that while there exists a complete path x-y-z falling in the inside of the same subregion, the two processes searching for sepx;y and sepy;z follow the same trajectory of current regions (see (d) in Definition 4.3). The same
466
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
trajectory is followed by the two processes building the sets Inðq1 Þ; Outðq1 Þ and Inðq2 Þ; Outðq2 Þ; since the states q1 and q2 coincide with sepx;y and sepy;z ; and hence they must be located along the ‘‘last unbroken’’ paths x-y and y-z; respectively (compare (c) in Definition 5.1 with (d) and (e) in Definition 4.3). This holds until the last unbroken paths connecting x-y or y-z start to break. The only difference is that the processes computing Inðq1 Þ; Outðq1 Þ and Inðq2 Þ; Outðq2 Þ insert some states to the corresponding sets, while those searching for sepx;y and sepy;z ignore such information. The trajectory of current processes starts branching at the moment when, after splitting R into two subregions R1 and R2 ; neither R1 nor R2 contains a complete path x-y-z: Recall that, so far, we did have such a path in the inside of R: Thus, the existing path x-y-z must cross the boundary between R1 and R2 : This moment of break down must come, since we assume that at least two edges among x; y; z do not coincide and the number of e-free edges in the current region goes down to k ¼ 1: For the moment of break down, there are now several cases to consider. (i) The procedure computing the value of sepx;y does not terminate later than the procedure computing sepy;z : That is, the path x-y will be broken not later than y-z: This covers also the case of some path y-z not being broken at all, which may happen for y ¼ z (see (f) in Definition 4.3), as well as the situation in which the paths x-y and y-z are broken at the same time, i.e., in the same iteration. At this moment, we have some path x-y in the inside of R; but neither R1 nor R2 contains any complete path connecting x with y: Thus, the path x-y must cross the boundary of the separating subautomaton Mb0 : This implies that the procedure computing the value of sepx;y stops in this iteration. There are the following subcases. (i-1) The edge x is located inside R1 ; i.e., inside Mb0 : Then the procedure will return the state q1 ¼ exb0 as the value of sepx;y : (see cases (a1) and (a2) in Fig. 2, as well as (e) in Definition 4.3). Thus, sepx;y ¼ q1 ¼ exb0 :
ð4Þ
For the state q2 ¼ sepy;z ; we have the following possibilities: *
*
The state q2 is located in the inside of Mb0 (this includes also the case of q2 ¼ exb0 ). Because the procedure computing the sets Inðq2 Þ; Outðq2 Þ has been following, so far, the same trajectory of current regions, it will insert the state exb0 in the set Outðq2 Þ in this iteration (see (c1) in Definition 5.1). We do not know how this procedure will run from this moment forward, since the trajectory of current regions forks here. Nevertheless, we have got exb0 AOutðq2 Þ and hence, using (4), we get sepx;y ¼ q1 ¼ exb0 AOutðq2 Þ ¼ Outðsepy;z Þ: Therefore, we have obtained the condition (d) in the statement of the lemma, that is, sepx;y AOutðsepy;z Þ: The state q2 is in the outside of Mb0 (including also the case of q2 ¼ enb0 ). Because of the shared trajectory of current regions, the procedure computing Inðq2 Þ; Outðq2 Þ will insert the state exb0 in the set Inðq2 Þ in this iteration. (see (c2) in Definition 5.1). Thus, by (4), sepx;y ¼ q1 ¼ exb0 AInðq2 Þ ¼ Inðsepy;z Þ; which gives sepx;y AInðsepy;z Þ:
(i-2) The edge x is located inside R2 ; i.e., outside Mb0 : Here the procedure computing the value of sepx;y will return the state q1 ¼ enb0 (see cases (b1) and (b2) in Fig. 2, and also (e) in
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
467
Definition 4.3). Thus, sepx;y ¼ q1 ¼ enb0 :
ð5Þ
Now we have the following possibilities for the state q2 ¼ sepy;z : *
*
The state q2 is inside Mb0 (including q2 ¼ exb0 ). Then the procedure computing Inðq2 Þ; Outðq2 Þ will insert the state enb0 in the set Inðq2 Þ; by (c1) in Definition 5.1. Using (5), we thus get sepx;y ¼ q1 ¼ enb0 AInðq2 Þ ¼ Inðsepy;z Þ; and hence also sepx;y AInðsepy;z Þ: The state q2 is outside Mb0 (including q2 ¼ enb0 ). Here we shall get enb0 in the set Outðq2 Þ; using (c2) in Definition 5.1. By (5), this gives sepx;y ¼ q1 ¼ enb0 AOutðq2 Þ ¼ Outðsepy;z Þ; that is, sepx;y AOutðsepy;z Þ:
(ii) The procedure computing the value of sepy;z terminates strictly earlier than the one computing sepx;y : That is, the path y-z will be broken strictly earlier than x-y: This covers also the case of some path x-y not being broken at all, which may happen for x ¼ y (see (f) in Definition 4.3). At this moment, we have some path y-z in the inside of R; but neither R1 nor R2 contains any complete path connecting y with z; since such paths must cross the boundary of the separating Mb0 : Therefore, the procedure computing the value of sepy;z stops here. On the other hand, we still have a complete path x-y located within the same subregion, either R1 or R2 ; depending on where the edge y will go, since we assume that the procedure computing the value of sepy;z terminates strictly earlier than the one computing sepx;y : Therefore, we consider the following subcases. (ii-1) The edge y; together with a complete path x-y; are located inside R1 ; i.e., inside Mb0 : Then the procedure computing the value of sepy;z will return the state q2 ¼ exb0 (see cases (a1) and (a2) in Fig. 2, keeping in mind substitution x0 :¼ y and y0 :¼ z; as well as (e) in Definition 4.3). Thus, sepy;z ¼ q2 ¼ exb0 :
ð6Þ
Because we still have a complete path x-y located within the same subregion R1 ; i.e., inside Mb0 ; the procedure computing the value of q1 ¼ sepx;y will locate this state along some path x-y in the inside of Mb0 ; in some of subsequent iterations (see (d) in Definition 4.3). Therefore, q1 is inside Mb0 : But then the procedure computing the sets Inðq1 Þ; Outðq1 Þ for this state, following the shared trajectory of current regions, must insert the state exb0 in the set Outðq1 Þ; by (c1) in Definition 5.1. This gives, by (6), that sepy;z ¼ q2 ¼ exb0 AOutðq1 Þ ¼ Outðsepx;y Þ: Therefore, we get that sepy;z AOutðsepx;y Þ: (ii-2) The edge y; together with a complete path x-y; are located inside R2 ; i.e., outside Mb0 : Here the computation of sepy;z will return the state q2 ¼ enb0 ; by (e) in Definition 4.3. Thus, sepy;z ¼ q2 ¼ enb0 :
ð7Þ
Because we still have a complete path x-y located within R2 ; i.e., outside Mb0 ; the procedure computing the value of q1 ¼ sepx;y will locate this state along some path x-y in
468
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
the outside of Mb0 ; in some of subsequent iterations (see (d) in Definition 4.3). Therefore, q1 is outside Mb0 : But then the procedure computing Inðq1 Þ; Outðq1 Þ must insert the state enb0 in the set Outðq1 Þ; by (c2) in Definition 5.1. Thus, using (7), we get that sepy;z ¼ q2 ¼ enb0 AOutðq1 Þ ¼ Outðsepx;y Þ; and therefore sepy;z AOutðsepx;y Þ: This has exhausted all cases and thus completed the proof. & As an easier variant, we obtain: Lemma 5.4. Let M be the automaton constructed in Theorem 3.2. Then, for each pair of e-free transitions xay; connected by a path x-y; (a) sepx;y AOutðtx Þ; and (b) sepx;y AInðsy Þ,Outðsy Þ; where su ; tu denote the source and target states for a transition uAD; respectively. Proof. The argument is a simplified version of the proof presented in Lemma 5.3. We again consider, in parallel, sequences of regions examined by several procedures, this time for the procedure computing the value of sepx;y ; together with two instances of the procedure computing the sets Inðq1 Þ; Outðq1 Þ and Inðq2 Þ; Outðq2 Þ; for the case of q1 ¼ tx and q2 ¼ sy ; the state where the edge x points to and the state from which the edge y springs, respectively. Again, all these processes follow the same trajectory of current regions. This holds while there exists a complete path x-y falling in the inside of the same subregion, since the states tx and sy must lay along each path x-y: The trajectory can fork only at the moment when, after splitting R into two subregions R1 and R2 ; separated by the boundary of some subautomaton Mb0 ; neither R1 nor R2 contains a complete path x-y: Such moment must come, since xay: Here the procedure computing sepx;y stops, returning some value. There are now four subcases, depending on whether x falls inside or outside Mb0 ; and whether sy is inside or outside Mb0 : These cases can be seen in Fig. 2. First, to prove (a), note that tx ; the target node for the edge x; is always located within the same subregion with the edge x: This follows from the fact that R contains the complete path x-y including both x and tx ; that enb0 is located outside Mb0 while exb0 is inside Mb0 ; and that Mb0 has neither edges ending in enb0 nor edges from the surrounding environment ending in exb0 (see (d) and (e) in Theorem 3.2). Thus, by examining cases (a1), (a2), (b1), and (b2) in Fig. 2, the reader may easily see that we have always sepx;y AOutðtx Þ (see also (c) in Definition 5.1 and (e) in Definition 4.3). Now we shall prove (b). Note that here, unlike in (a), the state sy and the edge y may fall in different subregions R1 ; R2 after splitting up, if sy is a boundary state for Mb0 : Nevertheless, sepx;y must be a boundary state, either equal to enb0 or to exb0 ; by (e) in Definition 4.3, depending on whether x falls outside or inside Mb0 : In any case, because of the shared trajectory of
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
469
current regions, a boundary state of Mb0 must be inserted either to Inðsy Þ or to Outðsy Þ; depending also on whether sy falls outside or inside Mb0 (see (c) in Definition 5.1). Thus, sepx;y AInðsy Þ,Outðsy Þ: & We can now state the main result of the paper. Theorem 5.5. For each regular expression of size nX2 over an alphabet consisting of s symbols, sX1; there exists an equivalent nondeterministic e-free automaton with at most 2n states and less 6 sn log n e-free transitions. than logð3=2Þ (For n ¼ 1; the resulting automaton has at most 2 states and a single e-free transition, for n ¼ 0; a single state with no transitions). Proof. The automaton M 0 presented in Definition 4.4, derived from M constructed in Theorem 3.2, is clearly equivalent to the original regular expression, by Theorem 4.5. Moreover, M 0 uses the same number of finite control states as M; bounded by 2n: It only remains to bound the number of transitions in M 0 : y
The rule (b) of Definition 4.4 introduces the edges of type sepx;y ! sepy;z ; for each path x-y-z in M: By Lemma 5.3, such edge must fall in one of the following sets: y
E1 ¼ fsepy;y ! sepy;y ; yADg; a
E2 ¼ fq ! q0 ; qAQ; q0 AOutðqÞ; aASg; a
E3 ¼ fq0 ! q; qAQ; q0 AInðqÞ; aASg; a
E4 ¼ fq0 ! q; qAQ; q0 AOutðqÞ; aASg; since we do not keep multiple copies of an edge connecting the same pair of states and labeled by the same symbol of input alphabet. The number of edges in E1 is bounded by n; since the automaton M of Theorem 3.2 has at most n e-free edges. By Lemma 5.2, the number of edges in E2 does not 1 log n 1:128Þ s; for each nX6 and each sX1: The same holds for E3 exceed 2nðlogð3=2Þ and E4 : The rules (c), (d), and (e) introduce the edges of the following types, respectively: a
E5 ¼ fqs ! q; qAQ; aASg; a
E6 ¼ fq ! qF ; qAQ; aASg; a
E7 ¼ fqs ! qF ; aASg: The number of edges both in E5 and in E6 is bounded by 2ns; and that in E7 by s:
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
470
Summing up, the total number of transitions in M 0 is bounded by 1 0 log n 1:128 s þ 2 2ns þ s jjD jjp n þ 3 2n logð3=2Þ 6 sn log n þ nð1 sÞ þ sð1 1:768nÞ: ¼ logð3=2Þ Since 1 sp0 for sX1 and 1 1:768no0 for nX1; 6 sn log n; jjD0 jjo logð3=2Þ
ð8Þ
for each nX6 and each sX1: For nAf2; 3; 4; 5g; we can still bound the number of states by 2n; by Theorem 3.2, and hence the 6 number of transitions by sð2nÞ2 : It is trivial to verify that sð2nÞ2 ologð3=2Þ sn log n for nAf2; 3; 4; 5g: Therefore, (8) holds for each nX2 and each sX1: For nAf0; 1g; we simply use the rewriting rules of Table 1 for putting a regular expression into the normal form, and thus reduce the given expression into one of the following types; |; e; a; aB ; or a ; where aAS: It is easy to construct an e-free automaton not exceeding the claimed number of states or transitions, for each of these expressions. & Finally, we derive an upper bound not using any assumption about the cardinality of input alphabet.
Theorem 5.6. For each regular expression of size nX2; there exists an equivalent nondeterministic 2 1 log nðlogð3=2Þ log n þ 1Þ e-free e-free automaton with at most 2n states and less than n logð3=2Þ transitions. Proof. We shall use the same construction, presented in Definition 4.4, but here we utilize Lemma y
5.4 instead of Lemma 5.3. Recall that Definition 4.4 introduces the edges of type q1 ! q2 ; for some yAD and some q1 ; q2 AQ: y
First, consider the state q1 : If the transition q1 ! q2 has been introduced by the rule (b) in Definition 4.4, then q1 ¼ sepx;y ; for some edge xAD with a path x-y in M: But then either x ¼ y or, by (b) in Lemma 5.4, q1 ¼ sepx;y AInðsy Þ,Outðsy Þ; where sy denotes the source state for the edge y: The same holds for transitions introduced by the rule (d). For the rules (c) and (e), we get that q1 ¼ qs : Summing up, q1 AInðsy Þ,Outðsy Þ,fsepy;y ; qs g: By the same reasoning, using (a) in Lemma 5.4, we get that either z ¼ y or q2 ¼ sepy;z AOutðty Þ; for the rules (b) and (c). Here ty denotes the target state for the edge y: For the rules (d) and (e), we have q2 ¼ qF : Thus, q2 AOutðty Þ,fsepy;y ; qF g: Therefore, each transition of M 0 must fall in the set y
E ¼ fq1 ! q2 ; yAD; q1 AInðsy Þ,Outðsy Þ,fsepy;y ; qs g; q2 AOutðty Þ,fsepy;y ; qF gg:
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
471
Using Lemma 5.2, we thus get that the number of transitions in M 0 is bounded by 1 1 0 jjD jjp n 2 log n 1:128 þ 2 log n 1:128 þ 2 logð3=2Þ logð3=2Þ 2 1 log n log n þ 1 ; on logð3=2Þ logð3=2Þ for each nX6: For nAf2; 3; 4; 5g; we use a different construction, with at most n2 þ n transitions and n þ 1 states in the resulting e-free automaton (see [6,7, Section 3] or [1] for such constructions). Again, it 2 1 log nðlogð3=2Þ log n þ 1Þ for each nAf2; 3; 4; 5g; and therefore is trivial to verify that n2 þ non logð3=2Þ 0 the above upper bound for jjD jj is valid for each nX2: &
6. Concluding remarks We have presented a conversion of regular expressions into nondeterministic e-free automata 6 sn log n transitions, where n denotes the size of the using at most 2n states and less than logð3=2Þ original expression and s the cardinality of the input alphabet. At the same time, the number of 2 1 log nðlogð3=2Þ log n þ 1Þ; even if s is not fixed. transitions is below n logð3=2Þ The reader may have noticed some asymmetries in statements of Lemmas 5.3 and 5.4. Should we be able to remove, e.g., condition (d) from Lemma 5.3, we could get a better constant factor in Theorem 5.5. The same holds for simplification of Lemma 5.4 and Theorem 5.6. These asymmetries have their origin in Definition 4.3, introducing sepx;y ; where the roles of the edges x and y are not symmetrical, and in Theorem 3.2 (see also Table 2), which allocates a separate exit state for each subexpression, but allows shared entry states. Making the rules of Table 2 symmetrical does not help; this increases the number of states, which in turn yields to a larger constant factor for the number of edges in Theorem 5.5. Nevertheless, we assume that the constant factors in Oðsn log nÞ or Oðnðlog nÞ2 Þ can be improved. However, we conjecture that the upper bound Oðn log nÞ for binary regular languages cannot be improved, though the matching lower bound is an open problem. The lower bound Oðn log nÞ; presented in [6,7], cannot be used here, nor for any other fixed input alphabet; it uses the witness sequence of regular languages defined by ða1 þ eÞða2 þ eÞ?ðan þ eÞ; i.e., the used alphabet grows in n: However, we are convinced that the upper bound can be improved in a special case of unary (tally) regular languages, defined over a single-letter input alphabet. For example, it is known that the optimal conversion of an n-state nondeterministic automaton into deterministic onepffiffiffiffiffiffiffiffi may ffi n n ln n Þ require 2 states. But, in case of unary regular languages, we have a simulation with Oðe states by a deterministic one-way automaton, even if the original nondeterministic machine is a two-way device [8]. We also do not know whether the assumption about a fixed alphabet is really necessary to obtain the upper bound Oðn log nÞ; i.e., whether we cannot replace Theorems 5.5 and 5.6 by one, more powerful, theorem.
472
V. Geffert / Journal of Computer and System Sciences 66 (2003) 451–472
References [1] R. Book, S. Even, S. Greibach, G. Ott, Ambiguity in graphs and expressions, IEEE Trans. Comput. C-20 (1971) 149–153. [2] A. Ehrenfeucht, P. Zieger, Complexity measures for regular expressions, J. Comput. System Sci. 12 (1976) 134–146. [3] J. Hartmanis, P.M. Lewis II, R.E. Stearns, Memory bounds for the recognition of context free and context sensitive languages, in: IEEE Conference Record on Switching Circuit Theory and Logical Design, New York, NY, 1965, pp. 191–202. [4] J.E. Hopcroft, J.D. Ullman, Formal Languages and Their Relation to Automata, Addison-Wesley, Reading, MA, 1969. [5] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, 1979. [6] J. Hromkovicˇ, S. Seibert, T. Wilke, Translating regular expressions into small e-free nondeterministic automata, in: Proceedings of the Symposium on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science, Vol. 1200, Springer, Berlin, 1997, pp. 55–66. [7] J. Hromkovicˇ, S. Seibert, T. Wilke, Translating regular expressions into small e-free nondeterministic finite automata, J. Comput. System Sci. 62 (2001) 565–588. [8] C. Mereghetti, G. Pighizzini, Optimal simulations between unary automata, SIAM J. Comput. 30 (2001) 1976–1992. [9] S. Sippu, E. Soisalon-Soininen, Parsing Theory, Vol. I: Languages and Parsing, in: EATCS Monographs of Theoretical Computer Science, Vol. 15, Springer, Berlin, 1988.