INFU~
TIUN SCIENCES 56.23-33
( 199 I 1
33
An Almost Machine-Independent Theory of Program-Length Complexity, Sophistication, and Induction
Communicated by Azriel Rosenfeld
ABSTRACT The purpose of this paper is to use a variant of program-length complexity to formally define the structure of a binary string. where the strttctwe of an object is taken to mean the aggregate of its projectible properties.
1.
INTRODUCTION
The structure of an object does not necessarily constitute a complete description of the object. For example, a random binary string has no structure. A long, finite binary string, Rp, in which each 2nth bit is the same as the (2n - llth bit but which is otherwise random, is partially characterized by its structure, namely the doubling of the bits. The string A = 01010101.. is completely characterized by its structure, namely alternation. The structure of a finite binary string can be used to predict the likely to predict that the continuation of the string. That is, it is reasonable continuation of the string will preserve its structure. Observe that since a string’s structure only partially characterizes it, there may be a large class of continuations that preserve the string’s structure. The minimum description of a binary string S is the shortest input to a universal Turing machine (UTM) that would result in output .S. We want to sort out from this minimal description the part that represents the structure of OElsevier Science Publishing Co., Inc. 1991 655 Avenue of the Americas, New York, NY 10010
OOZO-0255/91/$03.50
MOSHE KOPPEL
24
AND HENRI
ATLAN
S. We do this by dividing the input to a universal Turing machine into two parts: program and data. (This division is inherent in the original definition of the UTM.) The part of the minimal description of S that can’t be forced into the data constitutes the structure of S. Returning to our examples. the minimal description of a random string. H. would consist of a program that simply prints the data R. The minimal description of Rp would consist of a program that prints twice each of the bits of the data. namely. the odd bits of Rp. The minimal description of A is a program describing alternation with no data. The division into program and data not only allows us to distinguish structure from description; it also allows us to distinguish sophistication from complexity. The complexity of a string is the length of its minimal description. To understand what we mean by sophistication. imagine that signals are being received from some unknown source in outer space. If these signals obeyed some simple rule. say alternation. we would not attribute intelligence to the source. If the signals were more complex, say the characteristic string of the primes, we might suspect the existence of an intelligent source. Nevertheless. if the signals were maximally complex, that is random. we would certainly not attribute intelligence to the source. By sophistication we mean the quality of a string that leads us to attribute intelligence to its generator. Clearly. sophistication is orthogonal to complexity. Several attempts have been made to define sophistication [l, 3. 41. We propose that the sophistication of a string is simply the length of the program part of the minimal description. In this paper we demonstrate the naturalness and utility of our definitions of structure and sophistication. Our main result is that the structure ofu string is largely independent
of the choice of the lrnir,ersal Turing machine in terms of
which it is defined.
In the second section of this paper we formally define UTM, program-length complexity, and sophistication of finite strings. In the third section we define the structure of infinite strings and in the final section we prove the main theorem. Earlier versions of the results of this paper have been reported in a previous survey paper on work in progress [7].
2.
MINIMAL
DESCRIPTIONS
OF FINITE
STRINGS
In this section we define minimal description and minimal program for finite strings. For related approaches to program-length complexity see [2, 4-6, 13-151.
MACHINE-INDEPENDENT
25
THEORY
For two strings S and S’ let S s S’ mean that S is an initial segment of S’. A function f: (0,l}* + {0, l}* U (0,l}= is called a process if S c S* implies that f(S) c f(S’,. Let U be a Turing machine on the alphabet (0, l} consisting of two left-to-right input tapes (called the program tape and data tupe, respectively). a left-to-right nonerasable output tape and a two-way work tape. A computation in U halts if and only if a blank is encountered on the data tape. We say that CI(P, D) = S if either S is finite and the computation on inputs P, D halts with S on the output tape or if S is infinite and the computation on P, D continues printing bits of S forever. U is a unit~ersuf Tllring muchine if for every partially computable process .f, there exists some P such that for all D, U(P, D) = f(D).
For the rest of the paper U will refer to some universal Turing machine. A program is total if U( P, D) is infinite for all infinite D. Note that for finite D. U(P, D) can be finite or infinite. A program P is self-delimiting if for any D during the course of the computation of U(P, D) the first bit of output is printed while the program scanner reads the last bit of P. DEFINITION. (I’, D) is a description of a finite or infinite P is a total, self-delimiting program and U(P, D) 2 S.
binary string S if
DEFIUIIION. A description (P, D) of a finite string S is c-minimul if for any description of S. (P’, D’), we have IPI + 1DI < IP’I + ID’I + c. (We refer to a O-minimal description as a minimal description.) DEFINITION. The complexity of S, H(S), is the length of a minimal description of S. (When we wish to emphasize that the complexity is relative to the UTM U we call it HJS).)
A program P is a c-minimul progrum for S if (1) for some D, description of S and (2) for any c-minimal description of S, (P’. D’l we have IPI GIP’I. DWINITION.
(P, D) is a c-minimal
DEFINITION. The
minimal
c-sophistication
of
S, SOfH,(S).
is the length
of a c-
program for S.
Thus the c-minimal program for S is that part of the description of S that exploits patterns in S in order to allow the compression of S to a noncompressible data string-it represents the structure of S. By matching this program with all possible data we can find the class of strings that have the same structure as S. By matching it with data that extend the data used in generating S we can find the class of likely continuations of S. This will be discussed in Section 4.
26
MOSHE KOPPEL
AND HENRI
c is the number of bits by which a minimal description shorter than any other description with a shorter program in called a c-minimal program for S. c can be regarded as confirmation of the description (P, D) that we require before structure represented by the longer program P as inherent merely accidental.
2. I.
ATLAN
(P,D) need be order that P be the amount of regarding extra in S and not
RELATED COMPLEXITY MEASURES
Our definition of complexity is similar to those of Levin [81, Schnorr [ll], and Solomonoff [14], in which the restriction to unidirectional Turing machines is made. The program-data distinction is used implicitly in a definition of structure given by Cover and Kolmogoroff [4], of which our definition can be regarded as a generalization. We wish to briefly compare our definition of complexity with variants of the classical definitions of Kolmogoroff [6, 141 on the one hand and Chaitin [21 and Levin and Gacs [51 on the other, which do not distinguish between program and data. Let K(S) be the length of the shortest program P such that S is an initial segment of U(P,O). Let CL(S) be the length of the shortest self-delimiting program P such that S is an initial segment of U( P, 0). In both definitions the data are fixed as 0. Because the definition of K(S) makes no restriction on P, the role of the data is taken over by the program at no cost. In the case of CL(S), however, forcing the program to also play the role of data imposes the self-delimiting requirement on the whole description of S, not merely on the program. Let SL(S) be the smallest IPI + IDI such that P is self-delimiting and UCP, D) 2s. This version of the complexity measures found in [B, 11, 141 simply eliminates the requirement of totality on P. Then we have the following theorem. THEOREM 2.1. There exists c such that for all S,
Proof. Let (P, D) be the pair such that IPI + IDI is minimal for all pairs such that P is self-delimiting and UCP, D) 2 S. Then there exists F such that iJ(F. P. D,O) = U(P, D). Simply let F be a program that searches a succeeding string (in this case P*D) until that unique initial segment is f.-~lld which constitutes a self-delimiting program (in this case P) and then runs that
MACHINE-INDEPENDENT
THEORY
27
program on the rest of the string (in this case D). But then, setting c’ = IFI, we have for all S, K(S)< IF.P.DI = IPI + ID1 + IFI = SL(S)+ c’. Now let P be a self-delimiting program such that S is an initial segment of U(P, 0) and JPI = CL(S). If P is not total, let P’ be a self-delimiting program that is the same as P except that any data string D is treated as if D = 0 (that is, the data string is ignored so that P’ will be total). Then P’ is a total, self-delimiting program and S is an initial segment of CrfP, 0). Thus for some c”, H(S) < IP’I + 1 < IPI + c”. Finally let c = maxfc’, c”). THEOREM 2.2. There exists c such that for all S, H(S) < (S I + c. Proof. Define the program PRINT such that PRINT is self-delimiting and for all D, UCPRINT, D) = D. Let IPRINTI = c. Then since SPRINT, S) = S, H(S)<
IPRINT
+ ISI = 1.S + c.
Observe that this theorem
3.
COMPRESSION
does not hold for CL(S).
DESCRIPTIONS
OF INFINITE
STRINGS
A variant of the notion of minimal description is applicable to infinite strings. Let (Y” be the n-length initial segment of (Y. If UP, D> = (Y let D,, be the shortest initial segment of D such that U(P, D,,) 2 an. We call (P, D,,) a prefix-free description of a”. Let LYbe an infinite string. DEFINITION. A description of (Y,(P, D), is called a compression description of (Y if there exists c such that for all n, (PI + ID,,\ G H(cu”)+ c. P is called a compression program for cr if for some D, (P, D) is a compression description for LY. If a has a compression program we -call (Y describable. Not every string is describable. For example, if for all c, lim, ,JOPH,((Y”) = co, then LYis not describable. Such (Y, which we call transcendent, have infinite sophistication (see [121). The compression program reflects the structure of the string. The range of a compression program for (Y constitutes the class of strings that share its structure. Observe, however, that any describable string has an infinite set of compression programs. Nevertheless, we will show that in some limiting sense these compression programs all reflect the same structure. We begin by showing that the class of compression programs for (Y is invariant over choice of universal Turing machine.
MOSHE
28
THEOKEM 3.1. Let I/ und U’ be unkersal compression
description
Turing machines.
ATLAN
If (P, D) is a
of (Y in U, then there exists P’ such that for all X,
UY P’, X) = U( P, X> and Proof.
KOPPEL AND HENRI
( P', D) is a compression description of LYin U’.
there exists a program P’ such that for all X, that (P’, D) is not a compression description of cr. Then there exists a sequence of descriptions ((P*(n), D*(n)} such that U’(P*(n),D*(n))~a” and G,,, {(IP’I + IDI>-(IP*(n)l + ID*(n)l>}=m. But by the universality of U, there exist programs {(P**(n)} such that for all n and all D, U(P**(n), D) = (/‘(P*(n), D> and there exists c such -that for all n, IP**(n)l G IP*(n)l + c. But then U(P**(n), D*(n)> 2 a” and limn,,{(IPI + ID,I)--(/P**(n)1 + ID*(n)l)}=m, contradicting that (P, D) is a compression description of (Y.Therefore (P’, D) is a compression description of (Y. Since U’ is universal,
U’(P’, Xl = U(P, X). Suppose
3.1. RANDOMSTRINGS
Recall that PRINT is a program such that for all D, UCPRINT, D) = D. string LYis random with respect to U if PRINT is program for cy in U.
DEFINITION. An infinite
a compression
Thus a string is random if it can be described no more concisely than by enumerating it. All infinite binary strings (the range of PRINT) share at least the (nil) structure of a random string. THEOREM 3.1.1. If U and U’ are unkersal
Turing machines and (Y is random
with respect to II, then (Y is random with respect to U’. Proof.
This follows immediately
from Theorem
Thus, we can speak of strings being random respect to U.” THEOREM
3.1.2. a
3.1. without
the qualifier
“with
is random if and only if there exists c such that for all n,
H(cu”) B n - c. This property
Different strings. THEOREM
is usually used as the definition of randomness [S-11]. complexity measures give rise to different classes of random
3.1.3.If LYis SL-random
then (Y is random.
Proof. Suppose that LYis not random. Then G, --rmIn - H(an)l = 03. But since for all n, H(a”) > SL(a”), it follows that lim,,,{n - Sf,(~y~)} {n - H(n)} = M. Therefore (Y is not SL-random. slim,,,
MACHINE-INDEPENDENT
THEORY
29
COROLLARY 3.1.3. The set of random strings is of measure 1. Proof.
XL-random
This follows from Theorem strings is of measure 1 [ll].
THEOREM
3.1.3 and
3.1.4. If (P, D) is a compression
the
fact that
description
the set of
of CY then D is
random. Proof.
If D is not random, then there is a sequence of descriptions such that UP*(n), D*(n))2 D” and limn_,(n -(IP*(n)l + ID*(n)1 1) =m. Now, there exists c such that for each II there is a program P(P*(n))such that for all X, UP(P*(n)),x)= P(U(P*(n),x))and lP(P*(n))l < [P*(n)1 + c. But then for all II, (P(P*(n)), D*(n)) is a description of a” and ((IPI + ID,,I)-(IP(P*(n))l + lD*(n)l))=r, contradicting that (P, D) is lim, += a compression description of a. (P*(n),
3.2.
D*(n))
MULTIPLICITY
OF COMPRESSION
DESCRIPTIONS:
AN EXAMPLE
Recall that a given describable string has an infinite number of compression programs. Consider, for example, the string Rp, consisting of an infinite random sequence of pairs, for example, 11 11 00 00 11.. . . If DB is a program such that for any binary string D = a,a,a3.. . , U(DB, D) = a,a,a2a2a3a3,. . ., then DB is a compression program for RP. This reflects the fact that the structure of RP lies only in the doubling. But consider the program DBA such that
if D’ is1; otherwise. Then if Rp’ = 1, DBA is also a compression program for Rp. If(Y=a,az...,letcu”,‘n=a,+,a,+2...a,.Nowconsidertheprogram such that
DBE
U(DBA,D)=
u(DB,D)
if for all i, at least one 1 appears in Di.2’;
U(DB,D2”).D
if n is the smallest i such that no 1 appears in D’,“.
That is, DBE is the same as DB unless, starting from the nth bit, n consecutive O’s appear, in which case DB is aborted and the data string is
MOSHE KOPPEL AND HENRI
30
ATLAN
simply printed. If Rp is such that there is no such n, then DBE is also a compression program for Rp. Observe that the particular condition under which DBE aborts was arbitrarily chosen, and the same point could be made for any abort condition provided that that condition is not necessarily true of every random string,
We will exploit this limitation on possible abort conditions to demonstrate the sense in which all these compression programs reflect the same structure. 4. 4.1.
INVARIANCE
OF STRUCTURE
PREDICTION
For a given description (P, D) of a finite string S the possible continuations of S are precisely those that are generated by the program P and all continuations of the data D. Let CJJP, D)= (U(P, D.(O,l}k)}. That is, UJP, D) is the set of strings generated by the program P and all continuations of D of length IDI + k. A pair of (finite or infinite) binary strings, X and Y, are called consistent if one is a prefix of the other. Two sets of binary strings S and T are called consistent (and we write S = T) if for each string in one there is a string in the other that is consistent with it. Two programs for a string reflect the same structure to the extent that they yield consistent sets of possible continuations. Let A,(P, P’) = max{k(3D, D’ such that (P, D) and (P’, D’) are prefix-free descriptions of S and lJ,(P, D) = U,(P’, D’)). The larger A,( P, P’), the greater the agreement between P and P’ (for some D and D’, respectively). A program P is injective if U(P, D) 2 U(P, D’) 3 D 2 D’. The following theorem expresses the sense in which all injective compression programs for a given infinite string converge. THEOREM4.1.1. If P and P’ are injective compression programs for a, then lim, +m A,,( P, P’) = m. Proof. Let P and P’ be injective programs and let (P, D) and (P’, 0’) be compression descriptions of (Y. Suppose that, contrary to the theorem, lim _ A,AP, P’) = k - 1. Then there exists an infinite sequence (n,, n2,. . } _n such that for all i, A,+(P, P’) = k - 1. Thus, for each i there is a string T. with ITiI = k such that the string U(P, D,,;Ti) is not consistent with any member of U,(P’, D;,). But then, since one member of UJP’, D;,) is an initial segment of (Y, it follows that for each i, D,,;Ti is not an initial segment of D. Thus, there must be some smallest j < k such that for all sufficiently large i, if
MACHINE-INDEPENDENT
31
THEORY
DJ .T c D, then Di”.Tiip D. Let (n,,n, ,... } be the infinite subsequence sundh ‘that for each ‘i, D,;T;‘c D but D,,;T!+‘g D. We now construct a program P* such that (P*, D*) is a description of a, where D* is the same as D except that for each i, the (m, + j + 1)“’ bit of D is omitted. Since {m,,m*.. ..) is an infinite sequence, this contradicts that (P, D) is a compression description of cy. P* is a program that, given some infinite data string X, operates just like P except that, before doing so, it transforms each initial segment X’ of the data _ string X into some related string X’. (Because X’+’ 2X’, we can speak of X for the infinite string X.) In particular, D* is transformed to D and therefore UP*, D*) = UP, D) = a. The transformation
can be described inductively.
Now suppose for some r > m, + j, F follows. Compute the set Uk _j(P,yzhen
For all i < m, + j, ?? = X’.
is given. Xr+’ can be computed as (if it exists) find the shortest string
X’-j’ such that U(P’,X’p”) 2 U(P,X’-I) and compute the set LI,(P’,Xrpj’). If there is some string T of length k - j such that U(P,X’.T) is inconsistent with every string in U,(P’.X’-‘7,
then define Xr+’ =F.
’ T’.X(r
+ 1) (where
‘T’=I-T’ and X(r+ll is the (r+1)” bit of Xl. Otherwise X”‘=x’. X(r + 1). Now if X = D*, then the condition for inserting ’ T’ will hold exactly when r - jE(m,,m,,... ) so that the insertion would replace the (mi + j + 11”’ bit of D, which had been eliminated in the construction of D*. Therefore UP*, D*) = lJ(P, D) = a, completing the proof.
4.2.
PROBABILITY
For a (finite or infinite) binary string S and a program P, D is called a generator of S in P if (P, D) is a prefix-free description of S. The probability of S relative to P is PROB,(S) = C,(1/2)1D,1, where (0;) is the set of generators of S. That is, the probability of S relative to P is simply the probability that a sequence of coin tosses entered to P as data would result in S (or an extension of S) as output. Now let S and T both be finite binary strings. The probability (relative to PI of T following S is PROB,( T(S) = PROB,( ST)/PROB,(
S) .
The following theorem shows that in the limit all injective programs yield the same predictions.
compression
MOSHE
32
KOPPEL AND HENRI
ATLAN
MANNTHEOREM. If P and P’ are both injectke compression programs for the infinite string (Y, and T is some finite string, then for all sufficiently large i
PROB,(T\a’)
= PROB,(T[a’).
Proof. Since P is injective, for each (Y’, there exists a unique Dj such that U( P, D;O) 2 (Y’ and U( P, D,l) 2 a’ but for any proper initial segment S c D’, U( P, S) c ai. Regardless of whether U( P, Di) 2 ai, PROBJk) = (l/2)‘“,‘. Likewise, since P’ is injective, there exists a unique such 0,‘. Since P and P’ are injective, the generators of (Y’.T relative to P and P’ are shorter than IDi\ + ITI and ID/l + ITI, respectively. Now, let GEN(P,a’, y> be the number of x E (O,l)iTI such that U/(P, D,.x) 2 a’. y and let GEN( P’, (Y’, y) be the number of x E (0, l}lT’ such that U(P’, D/.x) 2 a’. y. It is sufficient to prove that for all sufficiently large i, GEN(P’, a’, T) = GENCP, ai, T), since then PROB,(Tla’) = GENCP, CY’, T>/21T’ = GEN( P’, ai, T>/21T’ = f’ROB,~(Tla’). To prove this claim, first note that, by the previous theorem, there exists n such that for all i>n, A,,(P,P’)> ITI. Thus for all ian, UITI(P,D~)= Ulq(P’, 0;). We will show that for all i 2 n and any string y of length < ITI, GEN(P,d, y) = GEN(P’, a’, y). We prove this by induction on the length of y. If Iy 1= 0, then the claim is immediate. Suppose that the claim is true for all strings of length n. Now let y be some string of length n. Let GEN( P, ai, y) = k.ThenGEN(P,a’,y.1)=Oork/2orkandGEN(P,a1,y.O)=kork/2or 0, respectively. Now suppose that GENCP, ai, y. 1) Z GEN( P’, ai, y. 1) (equivalently, GENCP, ai, y .O># GENCP’, (Y’,y .O>>.Then one of the four quantities GEN(P,a’, y.l), GEN(P’,cx’, y.l), GEN(P,a’, y.O), GEN(P’,a’, y.0) is equal to 0. Without loss of generality, assume that GEN(P, ai, y. 1) = 0. Then GEN(P’,a’, y.1)~ 0. But then the string y.1, which appears in {U(P, D,~x)lx E (0, l}lT1}is inconsistent with every string in {UCP’, 0:. x)/x E (0, l)lT1}, violating that UITICP, Di)= UITICP’, D/j. Thus the supposition that GEN(P,a’, y. 1) # GEN( P’, a’, y. 1) is false and the claim is proved.
This theorem should be compared with the many convergence results reported in [14]. While those results obtain for more general probability assignments, they do not imply equality but rather convergence to within a multiplicative constant. REFERENCES 1. C. Bennett, Logical depth and physical complexity, in The Unioersal Turing Machine: A Half-Century Surtiey (R. Herken, Ed.), Oxford University Press, 1988, pp. 227-258. 2. G. J. Chaitin, A theory of program size formally identical to information theory, Jour. ACM 22:329-340 (1975).
MACHINE-INDEPENDENT 3. G. J. Chaitin, Toward Formalism (R. Levine
THEORY
a mathematical and M. Tribus,
33
definition of “life.” in The Maximum Entropy Eds.), MIT Press. Cambridge, MA. 1979, pp.
479-W. 4. T. Cover, Kolmogorov complexity, data compressing, and inference, in The lm~uct (?f Prowsting Techniques on Communications (J. Skwirznyski. Ed.). Martinus Nijhoff, Amsterdam. 1985. 5. P. Gacs, On the symmetry of algorithmic information. .Sor,ie/ Mtrtlz. Dokl. 15:1474-1480 (1974). 6. A. N. Kolmogoroff. Three approaches to the quantitative definition of information. Problems of Informafion Transmission I :1-l (I 965). 7. M. Koppel, Structure, in The Uniwrsal Turing Mrchinr: A ffulf-C‘mtuv Surf,ey (R. Herken, Ed.), Oxford University Press, 19X8, pp. 435-457. 8. L. A. Levi,. On the notion of a random sequence, S~ief Mrlth. Dokl. 14(5kl413-1316 (1973). Y. L. A. Levin, Randomness conservation inequalities: Information and independence in mathematical theories, Information and Conrrol fr(l):lS-37 (19X4). 10. P. Martin-Lof, Complexity of oscillations in infinite binary sequences, ZW’F’C; lY:225-230 (1971). Il. C. P. Schnorr Process complexity and effective random teats, J. Camp. SLS~. Sci. 7:376-384 (1973). 12. C. P. Schnorr, and P. Fuchs, General random sequences and learnable sequences. J. SJW~. Logic 42:329-340 (1977). 13. K. J. Solomonoff. A formal theory of inductive inference. Informafion and Control 7: l-22 (1964). 14. R. J. Solomonoff, Complexity-based inductive systems. IEEE Trans. Inf. Thror.. 24:422-432 (197X). IS. A. K. Zvonkin and A. L. Levin, The complexity of finite objects. Russian Math. Surveys 25:X3-124 (1970).