Information Processing North-Holland
Letters
30 November
44 (1992) 165-170
1992
A heuristic method for generating large random expressions John Bainbridge School of Computing, Information Systems and Mathematics, South Bank Unicersity, London SE1 OAA, United Kingdom Communicated by R.S. Bird Received 8 June 1992 Keywords: Combinatorial
problems;
labelled
trees
1. Introduction This paper describes a method to produce random expressions consisting of unary and binary operators. Given a particular set of operators and operands to choose from, an expression of a particular length is produced at random from the set of all possible expressions of that length, where length is the total number of operators and operands in an expression. For example, if the choice of operators and operands were {+,
-> X, log, sin, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
then 4 + log( 6 - 3)
(I)
expression. Producing a random expression of length n is therefore equivalent to producing a random syntax tree with n nodes. In effect we wish to generate random trees from the class of labelled, ordered (and therefore rooted) trees [4, pp. 305-3061 whose nodes have a maximum of two subtrees. The trees are ordered as an expression such as a + (a + a) is considered different to (a + a) + a. Methods for generating random trees from other classes of trees have been considered by several authors. Wilf describes an algorithm for choosing a free (i.e. unlabelled and unrooted) tree of n nodes at random 181. In the important class of binary trees on n nodes (and therefore ordered trees with n - 1 nodes with which there
is an expression with a length of 6. The problem of selecting, with equal probability, an expression of a fixed length is best tackled by looking at the syntax trees of the expressions. A syntax tree is a graphical representation of a sentence showing the syntatic relationship between parts of the sentence. The syntax tree for expression (1) is seen in Fig. 1. As can be seen from this example the number of nodes in the syntax tree is equivalent to the length of the
Correspondence to: J. Bainbridge, Centre for Systems and Software Engineering, School of Computing, Information Systems and Mathematics, South Bank University, 103 Borough Road, London SE1 OAA, United Kingdom. 0020-0190/92/$05.00
0 1992 - Elsevier
Science
Publishers
B.V. All rights reserved
+
/\
4
‘og
6
3
Fig. 1. Syntax tree for 4 + log(6 - 3). 165
Volume
44. Number
3
INFORMATION
PROCESSING
is a one-to-one correspondence) algorithms have been developed to ensure that computation remains practicable for large IZ [21. The method we present for producing random trees from the class under consideration here also remains feasible for large n. It consists of selecting an unlabelled, ordered tree of the required size and then labelling the nodes with the operators and operands. Some unlabelled trees can be labelled in more ways than others, so to ensure all syntax trees are produced with equal likelihood the unlabelled trees are selected with a weighting to counterbalance this bias. The method consists of the following stages: Classify the unlabelled trees of a given size into classes according to the number of k-ary nodes (i.e. nodes with k children): two trees are in the same class precisely when they have the same number of k-ary nodes for every k. Calculate the total number of syntax trees each class of trees can give rise to and select a class with probability proportionate to this number. Select a tree at random from within this class. Label the nodes of this tree with operators and operands chosen at random. Details of the first two stages are given in Section 2. In Section 3 a technique is described which ensures these first stages remain practicable to compute even for large expressions. Section 4 describes the second two stages. Section 5 gives the complexity of the algorithm.
2. Selecting a class As mentioned above, the method consists of selecting an unlabelled tree then labelling the nodes. If an unlabelled tree of a certain size is chosen at random then each possible syntax tree (and therefore each expression) will not have an equal chance of being picked. This is because the number of different ways an unlabelled tree can be labelled depends on the structure of the tree even if the number of nodes is kept constant. The number of ways a tree can be labelled depends on the outdegrees of its nodes (where 166
LETTERS
30 November
1992
outdegree of a node is defined as the number of subtrees of that node) and the number of different operators and operands. Syntax trees of expressions consisting of unary and binary operators have nodes with outdegrees 0, 1 or 2. Suppose there are a operands, b unary operators and c binary operators to choose from to build an expression. Then if a tree has n, nodes of outdegree 0, ~zi nodes of outdegree 1 and n2 nodes of outdegree 2 there are L(n,, nl, nz2) = unobnlcnz ways of labelling that tree. The unlabelled trees can thus be put into classes according the values of IE~, ~zi and n2 and all those trees in a particular class can be labelled in the same number of ways. As a consequence once a class is selected an unlabelled tree can be chosen from within that class at random. To select a class we need to know how many unlabelled trees there are in each class. Let the number of ordered, unlabelled trees with n, nodes of outdegree 0, n1 nodes of outdegree 1 and n2 nodes of outdegree 2 be denoted by T(n,, n,, n,); then from [61 we must have n,=n,+ 1 and
(n,+n,+n,-l)! 111, n2) =
T(n,,
n,!n,!n,!
.
Trees of a certain size, n say, are put into classes according to the values of n,, n, and n2. Since n = n, + n, + n2 and n, = n2 + 1 it is easily shown that 1 6 n, G (n + l)div2. Let Ci denote the class of trees of size II with IZ~= i (thereby also fixing n, and n2) and let the total number of syntax trees each class can give rise to be denoted by 1Ci I. Then there are (n + l)div2 classes of trees and ICi
I =
[W”,
=L(i,
n,, n -
X
T(i,
%)T(%7
n1,
41no=i
2i + 1, i - 1) n - 2i + 1, i - 1)
aibnp2i+lci-l(n
_
I)!
i!(n - 2i + l)!(i
- l)! .
(2)
Weighted probabilities are assigned to each of the classes in proportion to this value. Thus if there are r classes, C,:. . . , C,, giving rise to
Volume
44, Number
INFORMATION
3
PROCESSING
I C, I,. . . , I C, I expressions, respectively, then the class Ci is given a weighted probability of
IciI/(Ic,l+
... +lC,I>.
A class is chosen with these a priori probabilities. However, the values of I Ci I become inordinately large for trees not much larger than ten nodes so a heuristic method is employed so that a class can be selected without having to enumerate this expression.
3. Selecting
a class for large expressions
Consider a histogram in which the height of each bar represents the value of I Ci 1 for each class. The histogram is placed in a rectangle whose height corresponds to the tallest bar, IC,,, I say, and whose length is the width of the histogram (see Fig. 2). Points are chosen at random within the rectangle until one falls within the histogram. The class of the bar in which that point falls is then taken as the chosen class. Such a procedure works because the area of a particular bar, and therefore the probability of picking a point in that bar, is proportional to the size of the class it represents. To choose a point at random within the rectangle the horizontal and vertical coordinates are chosen at random separately. The horizontal co-
30 November
LETTERS
1992
ordinate is chosen by picking a class at random, i.e. a number between 1 and (n + l)div2. Suppose this class is C,. Now the height of the bar for that class is a fraction of the height of the rectangle, say LY(6 l), so the probability of chasing a point within the bar chart is CY.Thus the class is acceptd with this probability otherwise it is rejected and the procedure repeated. The fraction (Y can be calculated as follows. From equation (2) it is easily shown that lCi+J=ICilRi
forl
with
I C, I = ab”-’
and
R,= I
ac(n+l-2i)(n-2i) b’i(i
’
+ 1)
Theorem 1. 3m: Ri > 1 for 1
Ri < 1 for
Proof. It is first shown that Ri = 1 has at most one solution in the range 1 G i G (n + l)div2 - 1 when the domain of Ri is extended from positive integers to real numbers. Suppose Ri=
actn + 1 - 2ij(n ’ b2i(
i
+‘;)
- 2ij
I
=”
ICi I
2
max
3
(n+
I )div2
class Fig. 2. Histogram
of tree classes in rectangle. 167
Volume 44. Number 3
INFORMATION
PROCESSING
then
+ac(n2+n) *i=
(
+ 2)ac + b2)i
=o
I c, I
(4n+2)ac+b2 * [((4n
-4(4ac-h’)uc(nZ+n)]“2) - h’))-’
(4n + 2)ac + b2 + k =
2(4 uc-b2)
-
’
say. If 4uc < b2 then the quadratic one positive real solution. If 4uc > b2 then since (4n + 2)uc + b2 2(4uc -b*)
in i has at most
n
4nuc > 2(4uc - b2) =
b2 2-2ac
>i>(n+l)
div2-1
there can be at most one real solution with i < (n + 1) div 2 - 1. If 4uc = b2 then the quadratic reduces to a linear equation with one positive solution. In all cases Ri = 1 for at most one value of i in the range 1 < i < (n + 1) div 2 - 1 where i is any real number. Since R, = uc(n - l)(n - 2)/2b2 > 1 for large n and Rj is a continuous function in the range 1 G i G (n + 1) div 2 - 1, the characteristic set out in the theorem holds for integer values of i. 0 From this theorem it can be seen that the values of I Ci I reach a maximum at I C, + r I with R,>l for l mux then
ICm+21ICm+31 a=Icmlll=lcm+ilIc,,,LI”’ lC,I
=R I??+1Rm+2 ... 168
IC,I
I&_,
I
a=Ic,,,,I=c,,,IIc,I”~
+ 2)ac +b)2
x (2(4ac
30 November
R,_,.
1992
So accepting C, with probability (Y is equivalent to accepting the class only if it is repeatedly accepted by a series of trials whose chances of success are R,+,, Rm+2,. . . , Rk_,, respectively. Alternatively, if k < mux then
ac(n+l-2i)(n-2i)=b%(i+l) * (4ac - b2)i2 - ((4n
LETTERS
I c, I IC,-, I
1
1
R,
RP,
IC,I lC!s+,l
1
R, ’
In this case accepting C, with probability (Y is equivalent to accepting the class when it is repeatedly accepted by a series of trials whose chances of success are l/R,, l/R,+,,...,l/R,, respectively. Lastly, if k = mux then (Y= 1 and the class is accepted at once. With this heuristic method the constraining factor on whether a class can be selected depends on the ability to simulate trials whose probability of success are Ri or l/R,. Consequently large random syntax trees of the order of 1000 nodes, say, can be produced on a typical workstation. A typical value for Rj with a syntax tree of this size is 303000/811800 (when n = 1000, a = 10, b = 2, c = 3, i = 450). Having selected a class the next step is to choose a tree from within that class.
4. Selecting
a tree within a class
In [6] Raney uses the following properties proving a more general form of equation (2):
in
An ordered tree is characterised by the sequence d,d,..d, of the outdegrees of its nodes in preorder. (Traversing a tree in preorder is defined by the routine:- visit the root of the tree; traverse the first subtree (in preorder); traverse the remaining subtrees (in preorder) ]4, p. 3341.1 Let d,d,..d, be any sequence with nj appearances of nodes with outdegree j and satisfying the condition j=n-I
?Z,=l+
C j=2
(j-l)nj.
(3)
Volume
44. Number
3
INFORMATION
PROCESSING
Then there is precisely one cyclic rearrangethat corresponds to a tree. ment d,..d,d,..dk_, If e,e,..e, is a cyclic rearrangement then it corresponds to a tree if and only if C:= ,(e, - 11 2Ofor l
30 November
LETTERS
1992
choice with probability Ri and 1 - Ri or an n-way choice each with probability l/n. These trials are assumed to take constant time regardless of the size of n or denominator of Rj. The sequence of trials to be performed to choose a class can be described by a Markov chain. The start node represents the process of chasing one of the classes and has n edges leading from it each with an associated probability of l/n. If class i is chosen it is assumed that the average number of trials performed is t,. These values are used to weight the nodes associated with choosing a particular class. If the trials result in a class being selected then the end node is reached (for class i > 2 this happens with probability R, ..Ri_1,for class i = 1 the probability is l), otherwise another class is selected, represented by an edge back to the start node. The average path length from start node to stop node, T, say, represents the expected number of trials performed to choose a class. It can be found using analytical techniques described in L.51.
T, =
n + t, + t, + . . . +t,, 1
+R,+R,R,+
Markov chains values of ti:
...+R,-R,_;
can also be used to calculate
(4) the
t2= 1,
R,+R,R,+ ...+R, ...Rip2
t,=l+
for 2
of the algorithm
As stated above, the deterministic part of the algorithm described in Section 4 has time complexity O(n). The selection of a class described in Section 3 is non-deterministic so it is not possible to give an absolute upper bound to execution time of this part, however the expected execution time is O(n). The proof for this is outlined below for the general case when there are n classes, C,, . . . , C,, such that lCi+, I =RiICi+, I and O-CR,< 1 for l
Substituting
T, =
these values
into (4) gives
n
l+R,+R,R,+
...+R,...R,p, 1
+
1
+R,+R,R,+
...+R,...R,_,
+R, 1 +R,+R,R,+ ...+R,,..R,_, + ... l+ R,+ ...+R,_, + 1 +Rl+RlR2+ ...+R, -R,_,' 1
+
The first term is less than terms are less than 1.
n and the other
n - 1
169
Volume
44, Number
3
INFORMATION
Therefore, T, < n + (n - 1) = 2n - 1 and Fn = O(n). The method has been implemented in “C” and as an indication of its efficiency it took a Sun 3/60 workstation approximately 30 seconds to produce 100 expressions of length 1000. The program is currently being used to generate random predicate expressions in order to gain a greater understanding of the use of predicate expressions in Z specifications [7]. Details on how predicates in Z can be measured so as to provide structural information about a specification can be found in [3].
Acknowledgment I am grateful to Colin Cooper for initial observations which facilitated the heuristic method. Also to Michael Atkinson for providing a preprint of his paper containing an efficient algorithm for finding the cyclic rearrangement of a list corresponding to a tree. Thanks are also due to Robert Lockhart, David Singmaster and especially Robin Whitty for their valuable suggestions during the
170
PROCESSING
LETTERS
30 November
1992
writing of the paper. Finally an acknowledgement to the referees for their useful comments.
References 111M.D. Atkinson,
Uniform generation of rooted ordered trees with prescribed degrees, Computer J., to appear. 121M.D. Atkinson and J.-R. Sack, Generating binary trees at random, Inform. Process. Left. 41 (1) (1992) 21-23. R.W. Whitty and J. Wordsworth, Obtaining [31J. Bainbridge, structural metrics of Z specifications for systems development, in: J.E. Nicholls, ed., Z User Workshop, Oxford 1990 (Springer, Berlin, 1991) 267-281. The Art of Computer Programming, Vol. 1 141D. Knuth, (Addison-Wesley, Reading, MA, 2nd ed., 1973). Discrete Markov analysis of computer [51C.V. Ramamoorthy, programs, in: Proc. ACM 20th National Conf. (1965) 386392. composition patterns and power [61G.N. Raney, Functional series reversion, Trans. Amer. Math. Sot. 94 (1960) 441451. [71J.M. Spivey, The Z Notation: A Reference Manual (Prentice-Hall, Englewood Cliffs, NJ, 1989). k31H.S. Wilf, Combinatorial Algorithms: An Update, Regional Conference Series in Applied Mathematics (Society for Industrial and Applied Mathematics, Philadelphia, 1989) 27-29.