Computing the relative entropy between regular tree languages

Computing the relative entropy between regular tree languages

Information Processing Letters Information Processing Letters 68 (1998) 283-289 Computing the relative entropy between regular tree languages * Jor...

478KB Sizes 0 Downloads 34 Views

Information Processing Letters Information

Processing

Letters 68 (1998) 283-289

Computing the relative entropy between regular tree languages * Jorge Calera-Rubio *, Rafael C. Carrasco ’ Departamento

de L.enguajes y S&emus Informdticos,

Universidad de Alicante, E-03071 Alicante, Spain

Received 2 February 1998; received in revised form 19 October 1998

Communicated by L. Boasson

a more powerful class of formal grammars than string grammars: indeed, strings of symbols can be regarded as labeled unary trees. Ordered labeled trees (i.e., trees whose nodes are labeled and in which the left-to-right order among siblings is significant) are well suited to represent grammatical parses [ 11, to describe images [lo], and for a variety of tasks where the elementary components establish more complex relations than those described by strings [ 131. Learning a formal grammar from examples provides one with a model that can be later used in classification tasks. Often, it is possible to assume that the data source is random, an assumption that avoids the use of counterexamples during the learning process (counter-examples are usually scarce and difficult to obtain). In such cases, the model naturally inferred is a stochastic grammar, i.e., a grammar with a probability attached to each production. Stochastic grammars (or equivalent formalisms as hidden Markov models) have been traditionally used in tasks such as speech recognition [14] and natural language modeling [3]. Recently, a number of algorithm have been proposed to learn stochastic tree grammars, as the one by Carrasco and Oncina [5] which identifies any stochastic tree automaton or the one by Sakakibara [ 161 that has been applied to predict the secondary structure of RNA. ’ Work partially supported by the Spanish CICyT under grant TIC97-0941. * Corresponding author. Email: [email protected]. ’ Email: [email protected]. 0020-0190/98/$ - see front matter 0 1998 Elsevier Science B.V. All rights reserved. PII:SOO20-0190(98)00172-O

J. Calera-Rubio, R.C. Carrasco/Information

284

Processing Letters 68 (1998) 283-289

The quality of an inference method can be checked in practical situations, or also under controlled environments. For instance, samples can be generated with a given grammar and then the output grammars can be compared with the original one. A widely used measure of the difference between the inferred language and the correct one is the relative entropy or Kullback-Leibler distance (see, for instance, [6]). The relative entropy between the model and the correct language is often approximated by means of large test sets generated following the correct distribution. This method is in general unfeasible in the case of tree languages, due to the huge number (enormous compared with the case of strings) of different trees that one has to generate. Therefore, an alternative method for such evaluation is of interest.

2. Deterministic tree automata Ordered labeled trees can be represented using the functional notation: for instance, the functional notation for the tree shown in Fig. 1 is a(b(a(bc))c). Given a finite set of labels V, the set of all finite-size trees whose nodes are labeled with symbols in V will be denoted as VT. Note that any symbol a in V is also the representation of a tree consisting of a single node labeled with a and, therefore, V c VT. Deterministic tree automata (DTA) generalize deterministic finite-state automata (DFA, [9]), which work on strings. A DFA processes an input string from left to right and assigns a states in the automaton to every position in the string. This state depends both on the symbol at the current position and on the state assigned to the previous symbol. Once the last position is reached, if the current state is in the subset of accepting states, the string is accepted. In the case of DTAs, the trees are processed bottom-up and a state in the automaton is assigned to every node in the tree. This state depends on the node label and on the states associated to the descendants of the node. The state assigned to the root of the tree has to be an accepting state for the tree to be accepted by the automaton. Formally, the DTA is a 4-tuple A = (Q, V, 6, F), where l Q is a finite set of states; l V is a finite set of labels; l F c Q is the subset of accepting states; a,} is a set of transitionfunctions of the form & : V x Qk + Q. 0 6=(6o,S1,..., Compared to the case of string automata (DFA), there is no need to define explicitly the initial state. Indeed, for every leaf subtree a E V the state &(a) plays the role of an initial state. On the other hand, if t = f(tl , t2, . . . , tk) is a tree (or subtree) consisting of an internal node labeled f which expands k subtrees tl , t2, . . . , tk, the state s(t) 6 (tk)). Therefore, following [ 151, J(t) can be recursively defined as: iSSk(f,fi(fl),..., J(t) =

Jk(.f,

60(a)

a(tl),

. ..,a(&))

ift=f(rl...tk)E(VT-V), ift=aEV.

Fig. 1. A graphic representation

of the ordered labeled tree a(b(a(bc))c).

(1)

J. Calera-Rubio, R.C. Carrasco/ Information Processing Letters 68 (1998) 283-289

Every DTA defines a regular tree language consisting

285

of all trees accepted by the automaton:

L(A) = {t E VT: s(t) E F}. By convention, undefined transitions lead to an absorption state, i.e., to non-acceptable of width larger than IZcan be accepted by the automaton A.

trees. In particular, no tree

3. Stochastic tree automata Stochastic tree automata incorporate a probability for every transition in the automaton, with the normalization that the probabilities of transitions leading to the same state 4 E Q must add up to one. * In other words, there is a collection of functions p = {po, ~1, ~2, . . . , pn} of the type pk : V x Qk + [0, l] such that they satisfy, for all 4 E c

2

feV

k=O

c

(2)

Pk(f,41,...,4k)=1.

ql,....qkEQ: &(.f41.-.4k)=4

In addition to this, every stochastic deterministic tree automaton A = (Q, V, S, p, r) provides a function r : Q + [0, l] which, for every q E Q, gives the probability that a tree satisfies s(t) = q and substitutes, in the definition of the automaton, the subset of accepting states. Then, the probability of a tree t in the language generated by A is given by the product of the probabilities of all the transitions used when t is processed by A, times r (6 (t)): p@lA) = r(~(t))nW with n(t) recursively q(tl,-,

(3)

given by =Pk(f,6@),

tk))

. . .,s(tk>)~(tl>“‘n(tk).

(4)

Of course, n(a) = PO(U) for t = a E V. Eqs. (3) and (4) define a probability if c p(W t~vr The condition elements:

= 1. can be written in terms of matrix analysis.

c .fEv

p(tlA)

which is consistent

(5)

of consistency

k=l

distribution

Pk(f,ql,%...,

4k)@iq,

+ hq2 +

Indeed, let us define the expectation

. . . + 6iqk),

(6)

ql,qz.....qkEQ:

s(f,ql ,-4lk)'j

where 6ii is Kronecker’s delta. 3 Consistency, A is smaller than one [ 171. 4

in the sense of Eq. (5), is preserved if the spectral radius of matrix

4. Relative entropy between stochastic languages The entropy of a probability H(A) = - c

p(tlA)

distribution log2

p(t JA) over VT,

p(tlA),

(7)

jEVT * This normalization makes the probabilities of all possible expansions 3 Kronecker’s delta 6ij is 1 if i = j and 0 otherwise. 4 Note that matrix A in [17] corresponds to A transpose here.

of a tree node to add up to one.

J. Calera-Rubio, R.C. Carrasco /Infomdon

286

Processing Letters 68 (1998) 283-289

bounds (within a deviation of one bit, see [6]) the average length of the string needed to code a tree in VT provided that an optimal coding scheme is used. Optimal coding implies an accurate knowledge of the source A. If only an approximate model A’ is available, the average length becomes:

WA,

c

A’) = -

p(tlA)log,

(8)

p(tlA’).

The difference H (A, A’) = G(A, A’) - H(A) is known as relative entropy between A and A’ or Kullback-L.eibler distance, a magnitude which is always a positive number [6]: indeed, a suboptimal coding leads to larger average lengths. Note that H(A) = G(A, A) and, thus, H(A, A’) = G(A, A’) - G(A, A). In the next section, we will describe a procedure to compute G(A, A’) and, therefore, to compute the entropy of a regular tree language or the relative entropy between two languages.

5. Computing the relative entropy Recall that the probability that the tree t is generated by the automaton A’ = (Q’, V, S’, p’, r’) is given by the product of two different factors: the probability #(i?‘(t)) that the root of t is of type h’(t) E Q’, and the probability n’(t) that node J’(t) expands as t. Therefore, log, p(tlA’)

= log2 r’@‘(t))

+ log,! n’(t).

On the other hand, the class of subsets Lij = (t E VT: J(t) = i A s’(t) = j) for i E Q and j E Q’ defines a partition in VT. This allows one to write the contribution to G(A, A’) of the r-terms as

G,(A,

c c

A’) = -

p(Wlog2r’W = -

c

ieQ IELij

ieQ

jeQ’

jeQ’

r(i)qij

log2r’(j),

(9)

where nij, defined as 17ij=

c et),

(10)

EL;,

represents the probability that a node of type i E Q expands as a subtree t such that 8’(t) = j . According to Eq. (4) and expanding the sum above over all trees t = f(tl , . . . , tk) one gets

(11) k=O feV

t,,...,rkEVT: s(f,s(t,),...,s(fk))=i S’(f,S’(ji),...,6’(tk))=j

Using again the fact that {Lij] is a partition of VT, one finds

k=O f EV

il,...,ikEQ: W,il .dk)=i

j,,..., jkEQ’: S’(f, jl,.,,, jk)=j

fl ELi,,, tkELik

=e

C k=O f EV

C

C

il ,...,ikEQ: S(f.il*...&k)=i

j,,...,jkEQ’: S’(f,jl,...,jk)=j

Pk(f,ii,i2,...,ik)rlilj,...rlikjk.

jk

(12)

.I. Calera-Rubio, R.C. Carrasco/Infonnation

Processing Letters 68 (1998) 283-289

287

Therefore, all nij can be easily obtained by means of an iterative procedure: ?$+il=~c

c

c

il,i2,....in~Q:

k=O feV

6k(f.i13i2.-..ik)=i

with

#’

= 0. The iterative

straightforwardly

series

by induction5

A’) = -

c

v&

. . . a&

(13)

$(f,jl,j2,...,jk)=j

monotonically

converges

to the correct

values,

as it can be proved

that 1 3 n!;+” > n!j’.

In order to evaluate the contribution

G,(A,

~k(f, it, i2, . . . , ik)$j,

jl ,h,...,jkEQ':

to G(A, A’) of the n-terms,

pW)log2n’(t),

(14)

tevT

log2Jqf(t, . . .tk))= log2P;(L s'ct1>, . . . , Wk)) + log, Jr’@,) + . . . + log2 Jr’(&)

(15)

and, therefore, n

C

l”gzn’(t)=C k=O

C

(16)

lOg2p;(f,jl,jz,...,jk)N'(t;f,jl,j2,...,jk),

REV j,.j,,...,jkEQ’

where N’(t; f, ji, j2, . . . , jk) is the number of subtrees f(ti, t2, . . . , tk) in t such that s’(Q) = jl, . ..) B’(h) = jk. Then, the contribution of the Ir-terms becomes

S’(t2) = j2,

(17) k=O REV jl....,jkeQ’ where

n’(f,

A,

j2, . . . , jk) is the expectation

n’(f, jl, j2, . . . , jk>

=

c

value of N’(t; f, jl , j2, . . . , jk) according to the distribution

p(tlA)N’(t:

f, jl,

p(tlA):

.iz,. . . , jk>

tevT =c

C,p’(q-,f,jl,j2,...,jk)

(18)

with C, the expected number of nodes of type q in a tree and p’(q + f, jl,j2, . . . , jk) the probability that a node of type q E Q expands as any subtree f (ti , t2, . . . , tk) such that S’(t,) = i,,, for m = 1, . . . , k. A consideration parallel to Eq. (12) leads immediately toP1(4~f,jl,j2,...,jk)=

Pk(f,il,i2,...,ik)rlilj,ri,j,...rlikjk

c il.i2,...,ikeQ:

(19)

~k(f,il.i2.....ik)=q

and, then, Eq. (17) can be rewritten as G&LA’)=-2

c k=O fcV

c

c

il,i2 ,..., ikEQ jl,j2 ,..., jkEQ’

C&(f,il.iZ

,___,ik)Pk(f?

il, if?

. . . , id

288

J. Calera-Rubio, R.C. Carrasco /Information

Processing Letters 68 (1998) 283-289

The vector C of expectation values Ci can be easily computed using the matrix A defined in Eq. (6) together with vector r of probabilities r(i). As shown in [ 171, C=

(Eflm> r

m=O

and, then, C = r + AC. This relationship C!“”

= r(i)

+ C

allows a fast iterative computation:

(21)

AijCSf’

I

with C/O’ = 0. As in the case of Eq. (13), it is straightforward to show that the iterative procedure converges monotonically to the correct value. Finally, it is worth to comment on the difference between stochastic automata that process strings and trees. While a finite length string ata2 . . . al can be regarded as a unary tree a’ (a~(. . .))), it is the case that DFA (deterministic finite automata, [9]) process strings from left to right whereas a DTA would process the tree from right to left in this functional notation. Therefore, only reversible DFAs [2] are immediately equivalent to a DTA and thus, the procedure described above for evaluating the relative entropy can only be applied to stochastic DFA with a reversible structure. A procedure that can be applied to any (reversible or not) pair h4 and M’ of stochastic DFA is described in [4] :

where ~(4, a) represents the transition probability from state q with symbol a, p(q, at node q. The coefficients C’ij are evaluated through an iterative relation:

C/j+” =6i16j11 + C aEV

where

IQ1

IQ’1

C

C

k=l S(k,a)=i

I=1 s’(l.a)=j

I and I’ are the initial

monotonically

A) the end-of-string

PM(qk, U>CL’,

(23)

states of M and M’, respectively.

starting with Ciy’ = 0. A more detailed description

probability

This iterative

procedure

also converges

can be found in [4].

6. Conclusion A procedure has been described which allows for the evaluation of the relative entropy between stochastic deterministic tree grammars. Previous related work includes evaluation methods for Markov models [ 11,181 and stochastic regular languages [4]. Our procedure overcomes the need of generating huge test sets which, in the case of trees, grow exponentially fast with the explored depth (indeed, exponentially faster than sets of strings, which can be regarded as the special case of unary trees). The final result is made of two different contributions, G(A, A’) = G, (A, A’) + G, (A, A’), the first one due to the root probabilities and the second one to expansion probabilities. Both expressions can be readily computed using iterative procedures.

Acknowledgement We thank M. Forcada for his comments.

J. Calera-Rubio, R.C. Carrasco /Information Processing Letters 68 (1998) 283-289

289

References [l] A.V. Aho, J.D. Ullman, The Theory of Parsing, Translation and Compiling, Vol. I: Parsing, Prentice-Hall, Englewood Cliffs, NJ, 1972. [2] D. Angluin, Inference of reversible languages, J. ACM 29 (3) (1982) 741-765. [3] P.F. Brown, V.J. Della Pietra, P.V. de Souza, J.C. Lai, R.L. Mercer, Class-based n-gram models of natural language, Comput. Linguistics 18 (1992) 467-479. [4] R.C. Carrasco, Accurate computation of the relative entropy between stochastic regular grammars, Theoret. Inform. Appl. 31 (1997) 437444. [5] R.C. Carrasco, J. Oncina, Learning deterministic regular grammars from stochastic samples in polynomial time, Theoret. Inform. Appl. (1996). [6] T.M. Cover, J.A. Thomas, Elements of Information Theory, John Wiley and Sons, New York, 1991. [7] KS. Fu, Syntactic Pattern Recognition and Applications, Prentice-Hall, Englewood Cliffs, NJ, 1982. [8] F. GCcseg, M. Steinby, Tree Automata, Akademiai Kiadd, Budapest, 1984. [9] J. Hopcroft, J. Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, Reading, MA, 1980. [lo] GM. Hunter, Operations on images using quad trees, IEEE Trans. Pattern Analysis Machine Intell. 1 (1979) 145-153. [I l] G. Kesidis, J. Wahand, Relative entropy between Markov transition rate matrices, IEEE Trans. Inform. Theory 39 (1993) 10561057. [12] T. Knuutila, M. Steinby, The inference of tree languages from finite samples: an algebraic approach, Theoret. Comput. Sci. 129 (1994) 337-367. [13] M. Nivat, A. Podelski (Eds.), Tree Automata and Languages, Elsevier, Amsterdam, 1992, pp. 235-290. [14] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (1989) 257-286. [15] Y. Sakakibara, Learning context-free grammars from structural data in polynomial time, Theoret. Comput. Sci. 76 (1990) 223-242. [16] Y. Sakakibara, M. Brown, R.C. Underwood, I.S. Mian, D. Haussler, Stochastic context-free grammars for modeling RNA, in: L. Hunter (Ed.), Proc. 27th Annual Hawaii International Conference on System Sciences 5, IEEE Computer Society Press, Los Alamitos, CA, 1994, pp. 284294. [17] C.S. Wetherell, Probabilistic languages: A review and some open questions, ACM Comput. Surveys 12 (4) (1980) 361-379. [ 181 J. Ziv, N. Merhav, A measure of relative entropy between individual sequences with application to universal classification, IEEE Trans. Inform. Theory 39 (1993) 127&1279.