JOURNAL OF COMPLEXITY 18, 1024–1036 (2002) doi:10.1006/jcom.2002.0649
Complete on Average Boolean Satis¢ability Jie Wang1 Department of Computer Science, University of Massachusetts, One University Avenue, Lowell, Massachusetts 01854 E-mail : wang@cs:uml:edu; URL : http : ==www:cs:uml:edu=ewang
Received September 5, 2001; revised April 18, 2002; accepted April 24, 2002; published online July 2, 2002
We present in this paper a dynamic binary coding scheme a on CNF formulas c; and show that under a uniform distribution ma on binary string aðcÞ; SAT is complete on average, where ma ðcÞ is proportional to jaðcÞj2 2jaðcÞj : We then show that there is k0 > 2 such that for all k5k0 ; kSAT under ma is complete on average. # 2002 Elsevier Science (USA)
Key Words: average NP-completeness; distributional tiling; distributional Boolean satisfiability; randomized reductions.
1. INTRODUCTION Finding a reasonable distribution on CNF formulas for which SAT is complete on average is a major open problem [CM97]. This problem was motivated by the desire to understand how difficult it is to determine whether a given random CNF formula is satisfiable. Most of the early approaches were algorithmic in nature, designing specific algorithms to evaluate CNF formulas and analyzing their average performance under various distributions of CNF formulas. In particular, many of the previous papers were centered around the Davis–Putnam procedure (DPP) [DLL62, DP60] under two models of probability distributions: one on random clause lengths (RCL) and the other on fixed clause lengths (FCL). DPP is a resolution procedure that employs certain heuristics to evaluate a CNF formula recursively by selecting literals one at a time with its two possible truth assignments. Let c be a CNF formula, and cðuÞ the formula obtained from c by setting u true. Then DPP works as follows: If c is empty return yes; else if c contains an empty clause return no; else if c contains a pure literal u or a unit clause fug return DPPðcðuÞÞ; else select a literal v in c; return yes if DPPðcðvÞÞ ¼ yes, and return DPPðcÞð:vÞÞ otherwise. 1
Supported in part by NSF under Grants CCR-9820611 and CCR-0296037.
1024 0885-064X/02 $35.00 # 2002 Elsevier Science (USA) All rights reserved.
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
1025
Both of the RCL and FCL models contain two parameters: the number n of variables and the number m of clauses from which a random CNF formula c is drawn. In the RCL model (also called ‘‘the fixed density model’’), c is constructed by including each of the 2n literals in each of the m clauses with probability p: Assume that clause lengths are binomially distributed, and so the expected clause length is 2pn: Thus the probability distribution of c; denoted by mRCLðpÞ ðcÞ; is proportional to n2 m2 22pnm : For any constant value of p; DPP solves random RCL instances in polynomial time Oðn2 mÞ on average [Gol79]. This result is due to a favorable choice of distribution, for the probability that a random RCL instance is satisfiable tends to 1 as n grows, and a witness can be found by a constant number of guesses of random truth assignments [FP83]. In the FCL model (also called ‘‘the random kSAT model’’), c is generated by selecting clauses uniformly at random from the set of all possible nontrivial clauses of a fixed length k: A clause is trivial if it contains both a variable and its negation. Since there are 2k nk nontrivial clauses, the probability distribution mFCL ðcÞ of a random FCL instance is proportional to !m 2 2 km n n m 2 : k DPP on random FCL instances runs in exponential time on average to find all solutions [CS88, FP83]. We observe that random formulas can also be generated as follows: First, choose a binary encoding scheme for encoding formulas, then randomly generate binary encodings. We shall allow probability distributions m on binary strings to sum up to less than 1 by assuming that mðxÞ ¼ 0 if x is not a legal encoding. Let jcj denote the number of literals occurring in c: If c is a kSAT instance over n variables and with m clauses, then jcj ¼ km: Each such formula can be encoded using 2n codewords of equal length, each of which represents a literal. Let cðcÞ denote such a binary encoding of c; and ‘c the length of cðcÞ: Then ‘c ¼ kmðdlog ne þ 1Þ; and the uniform distribution of cðcÞ; denoted by mUNI ðcðcÞÞ; is proportional to n2 m2 ‘c2 2‘c : Let mFCL ðcÞ denote the probability distribution of c; which is proportional to !m 2 2 km n : n m 2 k
1026 Since
JIE WANG
n k
pffiffiffi nk when k ¼ oð nÞ [Bol85], we have !m km n 2 2kmðlog nþ1Þ : k
This implies that mUNI ðcðcÞÞ and mFCL ðcÞ are equivalent distributions from average complexity point of view, for they are dominated by each another. A distribution m; is dominated by a distribution n; written as m%n; if for all x; mðxÞ4qðjxjÞnðxÞ for some fixed polynomial q: Thus, under mUNI ðcðcÞÞ; DPP on random kSAT instances also runs in exponential time on average. Despite intensive effort, it remains open whether kSAT is complete on average under mFCL ðcÞ or mUNI ðcðcÞÞ: It is not even known any concrete encoding scheme2 on CNF formulas such that, under the uniform distribution of this encoding, SAT is complete on average. Such an encoding scheme c0 ðcÞ must be polynomially equivalent to cðcÞ; namely, cðcÞ can be obtained from c0 ðcÞ in polynomial time in the size of input, and vice versa. We construct such a concrete encoding scheme in this paper and settle the open question affirmatively. We fix a finite alphabet A for describing formulas, with each symbol encoded in binary. In addition to all the available on a standard P symbols Q keyboard, A also includes symbols 8; ; ; and :: We label variables as v1 ; v2 ; . . . ; and label literals as li ; where i ¼ 1; 2; . . . ; such that li ði > 0Þ represents variable vi and li represents its negation :vi : For each clause C; we list the literals in increasing order on indices. This sequence of indices is unique for C; and we call it an identifier of C: The length of C is defined to be the length of its identifier. We list the clauses in c in lexicographical order: Clauses with shorter identifiers are listed prior to those with longer identifiers; within the same length, clauses are listed in the dictionary order. It is easy to see that any CNF formula c is uniquely represented in this way. For convenience, we still use c to denote this ordered expression. We use ‘‘polytime’’ to denote ‘‘polynomial-time’’ and ‘‘polynomial time.’’ We are only interested in polynomially equivalent encodings Ec of c: In the worst-case complexity, choices of encoding schemes do not affect NP-completeness as long as they are polynomially equivalent. In the average-case complexity, however, the choice of encoding scheme is crucial, for different encoding schemes may result in distributions that do not dominate each other. We consider a concrete encoding scheme a for c in this paper. Since unit clauses must be set true in order to satisfy a CNF formula 2
If we do not care about concrete encodings, then one can easily construct a universal distribution, based on an enumeration of all polynomially samplable distributions, to make SAT (and all standard NP-complete problems) complete on average for the class of NP problems with polynomially samplable distributions [BCGL92]. A distribution m is said to be samplable if there exists a randomized algorithm that outputs x with probability mðxÞ in polynomial time of jxj:
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
1027
c; we first check whether c contains unit clauses. If yes, we compress unit clauses based on arithmetic progressions in literal indexing (if there is any). We then use a dynamic coding scheme to encode the rest of the clauses in c (details of this encoding are given in Section 2). This encoding is one-to-one, polytime computable, and polytime invertible. For convenience, we call this distribution an a-distribution. Distributional Satisfiability (DistSAT) Instance: A CNF formula c: Question: Is c satisfiable? Distribution: ma ðcÞ is proportional to jaðcÞj2 2jaðcÞj : When c is restricted to be a kSAT instance in DistSAT, we denote the problem by Dist-kSAT. We show that DistSAT is complete on average for e DistNP. Since ma is flat, namely, ma ðcÞ42jcj for some fixed e > 0; randomized reductions are necessary for establishing the completeness result (unless EXP ¼ NEXP) [Gur91]. The rest of the paper is organized as follows. In Section 2, we describe the a-distribution in detail. In Section 3, we first present some basics of the theory of average NP-completeness for readers who are not familiar with it. We then define a flat distributional tiling problem (FDT) and show that it is complete on average for DistNP under randomized reductions. We show that FDT is reducible to DistSAT, and hence DistSAT is complete on average for DistNP. We then show that there exists a k0 > 2 such that for all k > k0 ; Dist-kSAT is complete on average under randomized reductions.
2. THE a-DISTRIBUTION Denote by jxj the length of a binary string x; and by jX j the cardinality of a set X : A probability distribution (or simply P distribution) m is a real-valued function from f0; 1gn ! ½0; 1 such that x mðxÞ41: In previous literature when one says that m is polytime computable, it P means that either the distribution function mn ðxÞ ¼ y4x mðyÞ is polytime computable, where 4 is the standard lexicographical order on f0; 1gn ; or m%n for some n with polytime computable nn : This definition implies that mðxÞ%2jzðxÞj for some function z with the property that zðxÞ is one-to-one, polytime computable and polytime invertible in jxj; which is the only property required of m for establishing average NP-completeness. Without loss of generality, we define m to be polytime computable if there exists a oneto-one and polytime computable function zm : f0; 1gn ! f0; 1gn ; with z1 m ðyÞ ¼ x being computable in time polynomial in jxj; such that for all x; mðxÞ%2jzm ðxÞj :
1028
JIE WANG
Given a binary string x; we can embed x in a longer string and find it efficiently as follows. Let s be jxj written in binary. Set eðxÞ ¼ 0jsj 1sx: Then given a string with eðxÞ as a prefix, we can count the number of 0’s before the initial 1, use this number to find the number s; and then use s to find x: Notice that jxj þ log jxj4jeðxÞj4jxj þ 2log jxj: For convenience, we call eðxÞ a logarithmic embedding of x (x-embedding, in short). Let x be a binary string starting with 1. Then x is a unique concatenation of the base strings: 1; 10; 000; 100 [Wan99]. We encode each symbol in A from a string in the regular set R ¼ 0100ð00 þ 11Þn 11 as follows. Let ‘ be the least even integer such that 2ð‘6Þ=2 5jxj þ jAj: Let S be the set of the first jAj strings (in lexicographical order) in R of length ‘ such that no string in S is a substring of x: Such a set S exists because the string x has at most jxj substrings of length ‘: If follows that none of the base strings is a prefix of any coded symbol, and that if a nonempty suffix z of a coded symbol u is a prefix of a coded symbol v; then z ¼ u ¼ v: We assign a distinct element of S to each symbol in A in a fixed order. The length of each coded symbol ‘ ¼ Oðlog jxjÞ: This dynamic encoding scheme was first used in [Gur91] to show that distributional Post correspondence under a uniform distribution is complete on average. For convenience, we call this coding scheme a logarithmic encoding scheme of x (x-encoding, in short). Let c be a CNF formula. We construct aðcÞ as follows. * Unit clause compression: A unit clause is a clause flg of length 1, in which the literal ‘ is also called a unit witness. Let u1 5 5u‘ be the indices of the literals in all unit clauses in c: Let huaþi iki¼0 be a subsequence of huj i‘j¼1 : If there exist integers c > 1 and d (d could be negative) such that for all 04i4k;
uaþi ¼ bi þ d þ ic; where bi 2 f0; 1g; then we call huaþi iki¼0 a 0–1 arithmetic progression with base ðc; d; b0 b1 bk Þ: Let hut ; utþ1 ; . . . ; utþZ1 i be the first (i.e., with t being the smallest of all) longest 0–1 arithmetic progression. Assume that its base is ðb; g; x0 x1 xZ1 Þ: Let x ¼ 1x0 x1 xZ1 : We replace flut g flutþZ1 g by x@ðut ; b; gÞ; and place it in front of the other clauses. For example, suppose c contains the following unit clauses: f:v3 g; f:v1 g; fv2 g; and fv5 g: Then u1 ¼ 3; u2 ¼ 1; u3 ¼ 2; and u4 ¼ 5: Let c ¼ 2; d ¼ 3; and b0 b1 b2 ¼ 001; then u1þi ¼ bi þ d þ ic for i ¼ 0; 1; 2; but u4 ¼ u1þ3 ¼ 5 > 1 þ d þ 3c > 0 þ d þ 3c: Thus, hu1 ; u2 ; u3 i is the first longest 0–1 arithmetic progression. Its base is ð2; 3; 001Þ: Hence, we replace f:v3 g; f:v1 g; and fv2 g in c by 001@ð0; 2; 3Þ:
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
1029
Dynamic encoding: Let c0 be the expression of formula c obtained after the unit clause compression. Let x be the binary string obtained from the unit clause compression (note that x begins with 1). We fix an x-encoding scheme for A; and encode every symbol in A by the x-encoding scheme. Replace x by eðxÞ: Encode every integer z in c0 by #bðzÞ#; where bðzÞ is z written in binary, and then replace 1 with 10 and 0 with 01 in bðzÞ: Since xencoding codes are strings of the form 0100ð00 þ 11Þn 11; a binary number so coded is easily distinguished from any coded symbol in A: Finally, encode every other symbol in c0 by its x-encoding code. If the length of the resulting binary string is less than jcj1=3 ; pad the string to make it at least this long. The final string is aðcÞ; with prefix eðxÞ; which is a sequence of coded symbols in A under x-encoding. *
Theorem 1. The encoding a is one-to-one, polytime computable, and polytime invertible. Proof. It is straightforward to see that a is one-to-one and polytime computable. To compute a1 on input y; we first look for the prefix eðxÞ from y to extract x; from which we know how A is encoded by the x-encoding scheme. If y represents a set of clauses, let c be the CNF formula uniquely determined by y; and output c: If y does not have eðxÞ as prefix or y does not represent a set of clauses, output ‘‘nil.’’ ] Corollary 2.
The distribution ma is polytime computable.
Clearly, ma is a flat distribution.
3. COMPLETE ON AVERAGE SAT 3.1. Basics of Average-Case NP-Completeness Denote by N the set of nonnegative integers. Let f : Sþ ! N be a function with an input distribution m: If there exists an e > 0 such that X f e ðxÞjxj1 mðxÞ51; jxj=0
then we say that f is polynomial on m-average [Lev86]. Denote by AP the class of distributional problems ðD; mÞ; where D is solvable in time polynomial on m-average. Let ðA; mÞ and ðB; nÞ be two distributional problems. If A is reducible to B via a one-to-one, polytime computable reduction f ; and m%n 8 f ; then ðA; mÞ is polytime reducible to ðB; nÞ: AP is closed under polytime reductions, and polytime reductions are transitive.
1030
JIE WANG
Let DistNP ¼ fðD; mÞ : D 2 NP and m is polytime computableg: Since ma is flat, we need to use randomized reductions to show that DistSAT is complete on average for DistNP. We assume that a randomized algorithm U flips a coin only when its computation requires a random bit, and the coin is unbiased. Randomized algorithms (to solve a problem) are allowed to make errors and produce incorrect outputs on some sequences of random bits. They can also run forever on some random (infinite) sequences. If U on input x halts with a correct output using random bits r; we call ðx; rÞ a good input for U: We note that deterministic algorithms are a special case with good inputs ðx; lÞ; where l represents the empty string. Let G be a set of good inputs for U: Let GðxÞ ¼ fr : ðx; rÞ 2 Gg: Let m be an input distribution. If GðxÞ=| for all x with mðxÞ > 0; we call G a good-input domain of U (with respect to m). It is easy to see that no string in GðxÞ is a prefix of a different string in GðxÞ (otherwise, the longer string cannot be in GðxÞ; for the algorithm halts before the string is generated). Let UG ðxÞ ¼ P
1 r2GðxÞ
2jrj
;
which is called the rarity function of G: We say that G is nonrare (with respect to m) if UG is polynomial on m-average. U is almost total if UG ðxÞ ¼ 1 for all x with mðxÞ > 0: For all ðx; rÞ 2 G; define mG ðx; rÞ ¼ mðxÞ2jrj UG ðxÞ: Let tðx; rÞ be the running time of U on input ðx; rÞ 2 G: If G is nonrare and there exists an e > 0 such that X te ðx; rÞjxj1 mG ðx; rÞ51; ðx;rÞ2G
then we say that U runs in polytime on m-average. If tðx; rÞ is bounded by a polynomial in jxj for all ðx; rÞ 2 G; then we say that U runs in polytime. One way to justify the correctness of the output is to show that its input belongs to the good domain. For this purpose, we consider certifiable domains. Domain G is certifiable if G is decidable in polytime on mðxÞjrj2 2jrj -average. It can be shown that U runs in polytime on m-average if and only if U can be iterated in a certain manner to run in polytime on maverage with an almost total good-input domain [BG93]. Denote by RAP the class of all distributional problems ðD; mÞ; where D is solvable by a randomized algorithm in polytime on m-average with a certifiable, nonrare good-input domain. We say that ðA; mÞ is polytime
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
1031
randomly reducible to ðB; nÞ if there is a one-to-one reduction f ; computable by a randomized algorithm in polytime with a certifiable, nonrare goodinput domain G; such that, for all ðx; rÞ 2 G; x 2 A if and only if f ðx; rÞ 2 B; and mG %n 8 f : RAP is closed under polytime randomized reductions, and polytime randomized reductions are transitive [BG93]. It is easy to see that AP is a subset of RAP, and the polytime reducibility is a special case of the polytime random reducibility. For more information about randomized reductions, the reader is referred to [BG93, Gur91, VL88, Wan97]. We say that a distributional problem ðD; mÞ 2 DistNP is complete on average for DistNP (under randomized reductions) if all distributional problems in DistNP are polytime (randomly) reducible to ðD; mÞ: 3.2. Main Theorems We first consider a variant of Levin’s distributional tiling problem [Lev86]. A tile is a square with a symbol on each side that may not be rotated or turned over. Denote by ða; b; c; dÞ a tile whose symbols are a; b; c; and d clockwise starting from the top side. We assume that there is a sufficient supply of copies of each tile. A tiling of an n n square is an arrangement of n2 tiles covering the square in which the symbols on the common sides of adjacent tiles are the same. Let T be a finite set of tiles. Let HT T T be the collection of pairs of tiles that can be placed horizontally, and VT T T be the collection of pairs of tiles that can be placed vertically. A tiling system of T is a pair ðS; sÞ; where S T with jSj ¼ 2 and S S HT ; and s is a sequence of tiles s1 s2 sk such that s1 2 T ; si 2 S for i ¼ 2; . . . ; k; and ðs1 ; s2 Þ 2 HT : Flat Distributional Tiling (FDT) Instance: A tiling system ðS; sÞ of T ; where s ¼ s1 s2 sk : Question: Can s be extended to tile a k k square? Distribution: mFT ðS; sÞ is proportional to k 2 2k : To show that FDT is complete on average for DistNP, we reduce the following flat distributional halting problem to FDT. Let hx; yi denote the binary string eðxÞy: Gurevich [Gur91] constructed a nondeterministic Turing machine (NTM) MG such that, on binary instances x01n with jxj5n; it is complete on average for DistNP to decide whether MG accepts x within n steps under distribution mH ðx01n Þ; which is proportional to n3 2jxj : Let K be the set of positive instances x01n : Flat Distributional Halting (FDH) Instance: hx; yi; where x and y are binary strings.
1032
JIE WANG
Question: Does MG accept x within jyj steps? Distribution: mFH ðhx; yiÞ is proportional to jxj2 jyj2 2jxj 2jyj : Let KF denote the set of positive instances of FDH. We can then randomly reduce ðK; mH Þ to ðKF ; mFH Þ as follows. On input x01n ; the reduction f generates a random string r with jrj ¼ n and outputs hx; ri: Let G ¼ fðx01n ; rÞ : jrj ¼ ng; then UG ðx01n Þ ¼ 1; and G is polytime computable and so is certifiable. We note that f is one-to-one and polytime computable on G: Clearly, for all ðx01n ; rÞ 2 G : x01n 2 K if and only if hx; ri 2 KF ; and mG ðx01n ; rÞ ¼ mH ðx01n Þ2jrj ¼ Oðn3 2jxj 2jrj Þ5Oðjxj2 ÞmFH ðhx; riÞ: This implies that FDH is complete on average for DistNP under randomized reductions. Lemma 3. There is a set TK of tiles and a tiling system ðS; sÞ of TK such that FDT is complete on average for DistNP under randomized reductions. Proof. We randomly reduce FDH to FDT. Let MF be a one-tape NTM that accepts KF in polytime. We construct a one-tape NTM M such that on input z; if z ¼ eðwÞz0 for some w; M extracts w; otherwise, M rejects. This can be carried out deterministically in polytime in jwj: M then determines whether w ¼ hx; yi for some x and y in deterministic polytime in jwj: If so, M simulates MF on w; otherwise, M rejects. Thus, there is a polynomial p such that MF accepts hx; yi if and only if M accepts eðwÞz0 for any z0 and every computation path of M is strictly less than pðjwjÞ: Note that M either accepts all inputs beginning with eðwÞ or rejects all inputs beginning with eðwÞ; depending on whether or not MF accepts hx; yi: Let Q be the set of states of M; with starting state qs ; accepting state qa ; and rejecting state qr : Let D be the transition function of M; and B the blank symbol. Let TK be the set of the following tiles, where a; b; c 2 f0; 1; Bg; q 2 Q – fqa ; qr g; p 2 Q; and *, #, $ are symbols not in Q [ f0; 1; Bg: * * *
ða; * ; a; * Þ; ððqa ; bÞ; * ; ðqa ; bÞ; * Þ; ððqs ; bÞ; #; $; $Þ; ða; #; $; #Þ; ðb; p; ðq; aÞ; * Þ and ððp; cÞ; * ; c; pÞ; if Dðq; aÞ ¼ ðp; b; RÞ; ðb; * ; ðq; aÞ; pÞ and ððp; cÞ; p; c; * Þ; if Dðq; aÞ ¼ ðp; b; LÞ:
The sets HTK and VTK can be readily obtained. Let S ¼ fð0; #; $; #Þ; ð1; #; $; #Þg: Let jeðwÞj ¼ k and write eðwÞ ¼ w1 w2 wk ; where wi 2 f0; 1g: Let r ¼ r1 r2 r‘ be a random string, where ‘ ¼ pðjwjÞ k and ri 2 f0; 1g: Let s ¼ s1 s2 sk skþ1 skþ‘ ; where s1 ¼ ððqs ; w1 Þ; #; $; $Þ; si ¼ ðwi ; #; $; #Þ for i ¼ 2; . . . ; k; and sj ¼ ðrj ; #; $; #Þ for j ¼ k þ 1; . . . ; k þ ‘: Since M will reach the accepting or the rejecting state in strictly less than pðjwjÞ time, and the
1033
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
tiling can only duplicate the accepting state, s can extend to a tiling of a pðjwjÞ pðjwjÞ square if and only if s occupies the bottom row of the square and M accepts s: Let G ¼ fðhx; yi; rÞ : jrj ¼ pðjwjÞ jeðwÞjg: Then UG ðhx; yiÞ ¼ 1 and G is polytime computable and so is certifiable. Thus, f ðhx; yi; rÞ ¼ ðS; sÞ is the desired randomized reduction from FDH to FDT, which is oneto-one and polytime computable on G: To verify domination of distributions, we note that mG ðhx; yi; rÞ ¼ mFH ðhx; yiÞ2jrj ¼ Yðjxj2 jyj2 2jxj 2jyj 2jrj Þ 4 Oðjyj2 2jhx;yij 2jrj Þ4Oðjwj2 jyj2 2jeðwÞj 2jrj Þ 5 Oððjxj þ jyjÞ4 jsj2 2jsj Þ: Thus mFH ðhx; yiÞ%mFT ðf ðhx; yiÞÞ: This completes the proof.
]
Theorem 4. DistSAT is complete on average for DistNP under randomized reductions. Proof. We reduce FDT to DistSAT. Let TK be the set of tiles and ðS; sÞ the tiling system of TK obtained from Lemma 3. Label the two tiles in S as t0 and t1 ; and the tiles in Tk S as t2 ; . . . ; ts1 ; where s ¼ jTK j is a constant. Let N ¼ jsj: Create n ¼ N 2 s variables v0 ; v1 ; . . . ; vn1 : For each variable vr ; where r ¼ ðiN þ jÞs þ k; 04i5N ; 04j5N ; and 04k5s; we will later want vr to be 1 if the ði; jÞth cell of the square is covered by the kth tile. We construct a formula c as follows. (1) Let s ¼ s0 s1 sjsj1 : Assume that s0 ¼ tp ; and si ¼ txi for i ¼ 1; . . . ; jsj 1; where xi 2 f0; 1g: c includes the following unit clauses: vp
jsj1 Y
visþxi :
ð1Þ
i¼1
(2) For all k; k 0 with ðtk ; tk 0 Þ 2= HTK ; where 04k; k 0 5s; c includes the following 2-clauses: 2 NY 1
ð:vIsþk þ :vðIþ1Þsþk 0 Þ:
ð2Þ
I¼0
(3) For all k; k 0 with ðtk ; tk0 Þ 2= VTK ; where 04k; k 0 5s; c includes the following 2-clauses: 2 NY 1
I¼0
ð:vIsþk þ :vðIþN Þsþk0 Þ:
ð3Þ
1034
JIE WANG
(4) Finally, c includes the following s-clauses: 2 NY 1 X s1
I¼0
vIsþk :
ð4Þ
k¼0
We note that jcj ¼ N þ 4N 2 þ N 2 s ¼ YðsN 2 Þ: Let f ðS; sÞ ¼ c; then f is one-to-one. If ðS; sÞ is a positive instance of FDT, then for all i and j with 04i5N and 04j5N ; there must be a k5s such that the square at location ði; jÞ is tiled by tk : Set vðiN þjÞsþk ¼ 1: Since for all I with 04I5N 2 there must be a pair of integers i and j with i; j 2 ½0; N Þ such that I ¼ iN þ j; we conclude that every clause in (4) is satisfied. Clearly, every unit clause in (1) is satisfied. Now for each pair ðk; k 0 Þ with ðtk ; tk0 Þ 2= HTK ; and for all I with 04I5N 2 ; at least one of vIsþk and vðIþ1Þsþk0 has not been set to 1. Set that variable to 0. Thus, every clause in (2) is satisfied. Similarly, every clause in (3) is also satisfied. Conversely, we can show that if c is satisfiable then ðS; sÞ is a positive instance of FDT. Let x ¼ x1 xjsj1 : Notice that s is a constant and jsj ¼ N : For every symbol a 2 A; we use a to denote the x-encoding of a: Then % ! jsj1 Y a vp visþxi ¼ x@ð#bð1Þ#; #bðsÞ#; #bð0Þ#Þv p; ð5Þ i¼1
Qjsj1
and so jaðvp i¼1 visþxi Þj ¼ jxj þ Yðlog jxjÞ: Under x-encoding we can see that jaðY Þj ¼ Yðlog jxj þ log N Þ for Y being a formula in either (2), (3), or (4). Thus, jaðcÞj ¼ jxj þ Yðlog jxj þ log N Þ ¼ jsj þ Yðlog jsjÞ;
ð6Þ ð7Þ
and so jaðcÞj ¼ Yðjcj1=2 Þ > jcj1=3 : Equality (7) implies that mFT ðS; sÞ%ma ðf ðS; sÞÞ: Hence f reduces FDT to DistSAT. ] We can extend a-encoding to compress certain special clauses in a kSAT formula c: Let C be a nontrivial clause of length k in c with identifier hj1 ; . . . ; jk i; where k > 1: We say that C is a special clause of c if c contains 2k 1 clauses of length k (including C) f‘q1 ; ‘q2 ; . . . ; ‘qk g; where qi ¼ ji ; and qi cannot be all negative. Thus, among these special clauses there is one that contains all variables. Let hh1 ; . . . ; hk i be the identifier of this clause. Then the product of all these 2k 1 special clauses equals 1 if and only if ‘h1 ¼ ‘h2 ¼ ¼ ‘hk ¼ 1: We call hh1 ; . . . ; hk i the basis of these special clauses.
COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY
1035
We look for the first special clause starting from the first clause in c; group together all the special clauses with the same basis hh1 ; ; hk i; and replace the product of these clauses by ½h1 ; . . . ; hk :
ð8Þ
We then encode (8) with ½#bðh1 Þ#; . . . ; #bðhk Þ# using the x-encoding in a: Theorem 5. For every k5s; where s ¼ jTK j; Dist-kSAT is complete on average for DistNP under randomized reductions. Proof (Sketch). Let c be the formula constructed in the proof of Theorem 4 with n variables. Then each clause in c has length 1, 2, or s: Create s new variables vn ; . . . ; vnþs1 ; and 2s 1 special clauses with basis fn; . . . ; ðn þ s 1Þg: Replace each unit clause flg in c by fl; :vn ; . . . ; : vnþs2 g; and replace each 2-clause fl1 ; l2 g in c by fl1 ; l2 ; :vn ; . . . ; :vnþs3 g: This produces a formula cs in sSAT, and c is satisfiable if and only if cs also is. Moreover, jaðcs Þj ¼ jsj þ Yðlog jsjÞ: So Dist-sSAT is complete on average. Reducing Dist-kSAT to Dist-ðk þ 1ÞSAT is straightforward. ] Finally, we would like to point out that it remains open whether for 34k5s; Dist-kSAT is complete on average for DistNP.
ACKNOWLEDGMENTS I am grateful to Jay Belanger and Drue Coles for carefully reading early drafts of this paper, and to Steve Cook, Yuri Gurevich, and Leonid Levin for their comments. I thank Drue Coles for pointing out an improved statement of FDT.
REFERENCES [BCGL92] S. Ben-David, B. Chor, O. Goldreich, and M. Luby, On the theory of average case complexity, J. Comput. System Sci. 44 (1992), 193–219. (Preliminary version first appeared in STOC’89.) [BG93] A. Blass and Y. Gurevich, Randomizing reductions of search problems, SIAM J. Comput. 22 (1993), 949–975. [Bol85] B. Bollob!as, ‘‘Random Graphs,’’ Academic Press, New York, 1985. [CS88] V. Chv!atal and E. Szemer!edi, Many hard examples for Resolution, J. ACM 35 (1988), 759–768. [CM97] S. A. Cook and D. G. Mtchell, Finding hard instances of the satisfiability problem: A survey, in ‘‘Satisfiability Problem: Theory and Applications,’’ (D.-Z. Du, J. Gu, and P. Pardolas Eds.), pp. 1–17, AMS Press, Providence, RI, 1997.
1036 [DLL62] [DP60] [FP83] [Gol79] [Gur91] [LeV86] [VL88]
[Wan97]
[Wan99]
JIE WANG
M. Davis, G. Logemann, and D. Loveland, A machine program for theoremproving, Comm. ACM 5 (1962), 394–397. M. Davis and H. Putnam, A computing procedure for quantification theory, J. ACM 7 (1960), 201–215. J. Franko and M. Paull, Probabillistic analysis of the Davis–Putnam procedure for solving the satisfiability problem, Discrete Appl. Math. 22 (1988), 35–51. A. Goldberg, ‘‘On the Complexity of the Satisfiability Problem,’’ Courant Computer Science Report No. 16, New York University, 1979. Y. Gurevich, Average case completeness, J. Comput. System Sci. 42 (1991), 346–398. (Preliminary version first appeared in FOCS’87.) L. Levin, Average case complete problem, SIAM J. Comput. 15 (1986), 285–286. (Preliminary version first appeared in STOC’84.) R. Venkatesan and L. Levin, Random instances of a graph coloring problem are hard, in ‘‘Proceedings of the 20th Annual Symposium on Theory of Computing,’’ pp. 217–222, ACM Press, Providence, RI, 1998. J. Wang, Average-case computational complexity theory, in ‘‘Complexity Theory Retrospective II,’’ (L. Hemaspaandra and A. Selman, Eds.), pp. 295–328, SpringerVerlag, Berlin, 1997. J. Wang, Distributional word problems for groups, SIAM J. Comput. 28 (1999), 1264–1283. (Preliminary version first appeared in STOC’95.)