Discrete Applied Mathematics 186 (2015) 275–282
Contents lists available at ScienceDirect
Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam
Note
A tight analysis of the Submodular–Supermodular Procedure Kevin M. Byrnes ∗ Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore MD 21218, USA
article
info
Article history: Received 6 February 2009 Received in revised form 7 January 2015 Accepted 13 January 2015 Available online 11 February 2015 Keywords: Submodularity Submodular–Supermodular Procedure Discrete optimization
abstract Narasimhan and Bilmes introduced the Submodular–Supermodular Procedure (SSP) for finding a local minimizer of the function h = f − g where f and g are both submodular functions. In their original analysis the authors left the worst case complexity of SSP as an open question. We provide a tight analysis of SSP by demonstrating a family of examples where SSP can require 2n−2 − 1 iterations before converging (although it reaches a global optimum). We also consider the related Supermodular–Submodular Procedure of Iyer and Bilmes and demonstrate an example that requires ≥⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge, and converges to a local, not global, optimum. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Submodular functions may be the most important class of functions you have never heard of. These functions arise frequently in combinatorial optimization and have a variety of applications in machine learning such as: feature selection [12], sensor placement [13], influence maximization in a social network [11], and document summarization [14].1 Submodular functions also have attractive extremization properties: they can be minimized in strongly polynomial time [7,17] and maximized to within a factor of 12 OPT in linear time [4] (also see [3]). Narasimhan and Bilmes [15] and Iyer and Bilmes [8] observe that many of these applications have natural generalizations which involve minimizing the difference h = f − g between two submodular functions f and g. However, because any set function can be expressed as the difference of two submodular functions [15], any algorithm that finds a global optimizer of the difference of submodular functions [1] [10] must have exponential worst case complexity, and therefore algorithms that return local minimizers are of interest. The Submodular–Supermodular Procedure (SSP) described in [15] is such a local minimization algorithm. In [15] the worst case complexity of this algorithm was left as an open question. In this note we definitively resolve the worst case complexity question by constructing a family of examples which require ≥2n−2 − 1 iterations for SSP to converge2 (although when it does converge it reaches a global optimizer). Our analysis is essentially tight in that SSP can never require more than 4 times as many iterations required by our ‘‘worst-case’’ example in order to converge. We also consider the related Supermodular–Submodular Procedure of [8]. We demonstrate an example applying to a more general family of algorithms to which the Supermodular–Submodular Procedure belongs that requires ≥⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge, and converges to a local, not global, optimum. Although our main results are negative, the differing behavior of the Submodular–Supermodular Procedure and the Supermodular–Submodular Procedure on the two different families of pathological examples (discussed in the conclusions)
∗
Correspondence to: 21 Wandsworth Bridge Way, Lutherville-Timonium, MD 21093, USA. Tel.: +1 908 227 2512. E-mail address:
[email protected].
1 www.submodularity.org has an excellent overview of submodularity as well as applications. 2 Each iteration of SSP can be completed in polynomial time, therefore the unresolved question is how many iterations SSP requires to converge. http://dx.doi.org/10.1016/j.dam.2015.01.026 0166-218X/© 2015 Elsevier B.V. All rights reserved.
276
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
may convey some further insight into these algorithms, and suggests the possibility of constructing a hybrid algorithm with sub-exponential worst case convergence. In addition, our worst case example for SSP has implications for the worst case convergence of the ‘‘ϵ -approximate’’ version of SSP introduced in [8] (also discussed in the conclusions). 2. Overview of submodularity and the Submodular–Supermodular Procedure First we briefly review some preliminaries of submodular functions and the description of the SSP algorithm. Throughout this note we shall let V denote a finite ground set V = {v1 , . . . , vn }. Definition 2.1. A function f : 2V → R is submodular if: f (A) + f (B) ≥ f (A ∪ B) + f (A ∩ B) ∀ A, B ∈ 2V . Similarly, f is supermodular if: f (A)+ f (B) ≤ f (A ∪ B)+ f (A ∩ B) ∀ A, B ∈ 2V , and f is modular if: f (A)+ f (B) = f (A ∪ B)+ f (A ∩ B) ∀ A, B ∈ 2V . Notice that f is submodular if and only if −f is supermodular. Thus minimizing the difference of two submodular functions is equivalent to maximizing the difference of two supermodular functions. We find our arguments to be more intuitive if we consider the underlying problem to be max h = f − g where f and g are supermodular, so we use this convention throughout the rest of this note. All results from the literature involving minimizing the difference between submodular functions have been appropriately ‘‘translated’’ here. To compare some of our results with those already in the literature may require a ‘‘translation’’ back. Equivalent to Definition 2.1 a function is submodular, supermodular, or modular if it exhibits the ‘‘decreasing returns’’,‘‘increasing returns’’, or ‘‘linear returns’’ property respectively. Proposition 2.2. A function f : 2V → R is submodular if: f (A ∪v)− f (A) ≥ f (A ∪w∪v)− f (A ∪w) for A $ V , and v, w ∈ V \ A. Similarly, f is supermodular if: f (A ∪ v) − f (A) ≤ f (A ∪ w ∪ v) − f (A ∪ w) for A $ V , and v, w ∈ V \ A, and f is modular if: f (A ∪ v) − f (A) = f (A ∪ w ∪ v) − f (A ∪ w) for A $ V , and v, w ∈ V \ A. Finally, observe that any h : 2V → R can be decomposed into the difference of two supermodular functions f − g, where g is an increasing and normalized (g (∅) = 0) function. Thus (unless stated otherwise) we shall assume, without loss of generality, that we are interested in maximizing f − g, where f and g are supermodular and g is increasing and normalized. Proposition 2.3. Let h : 2V → R, then h = f − g where f and g are supermodular functions, and g is increasing and normalized. Proof. Let z = maxA,B∈2V h(A) + h(B) − h(A ∪ B) − h(A ∩ B). Now let g : 2V → R be any strictly supermodular function, i.e. any function such that ϵ = minA,B incomparable g (A ∪ B) + g (A ∩ B) − g (A) − g (B) > 0. (Recall A and B are incomparable whenever A ̸⊆ B and B ̸⊆ A.) Without loss of generality, g is scaled so that ϵ ≥ z. Then f = h + g is supermodular. Note that for a simple, undirected, graph G = (V , E ) the function g : 2V → R given by g (A) = |EG (A)| = |{(vi , vj ) ∈ E |vi , vj ∈ A}| is increasing, normalized, and is strictly supermodular for G = Kn .3 Thus without loss of generality h = f − g where f and g are supermodular and g is increasing and normalized. One of the best-known algorithms for maximizing the difference of two supermodular functions is the Submodular–Supermodular Procedure (SSP), which is a discrete analog of the DC algorithm [6]. The algorithm proceeds from the current solution Vk ∈ 2V by first constructing a modular upper bound mk to g such that mk (Vk ) = g (Vk ). Then it computes an argmax Vk+1 of the resulting function hk = f − mk , which is both supermodular (since mk is modular) and a lower bound to h (as mk ≥ g). The procedure iterates until arriving at a set VT that is also an argmax of hT . Since hk = f − mk is supermodular it can be maximized in strongly polynomial time [7,17]. And since hk ≤ h with hk (Vk ) = h(Vk ) ∀k SSP will output a sequence of sets V0 , . . . , Vk , . . . , VT such that h(V0 ) ≤ h(Vk ) ≤ h(VT ). The algorithm is described in full in the following display. Some care needs to be exercised in selecting an argmax of hk , which will usually serve as the starting point Vk+1 of the next iteration, to avoid cycling.4 Choosing the argmax of hk to be the lexicographically smallest maximizer of hk will suffice. However, our worst case example does not depend upon cycling and is therefore insensitive to the choice of anti-cycling rule. In addition, to ensure that SSP converges to a set VT that is a local maximizer of h (i.e. h(VT ) ≥ h(S ) ∀S with |S △VT | ≤ 1) we explicitly check this as a requirement for termination. A modular upper bound mk for g exists and may be found in polynomial time, as shown in the following proposition due to Edmonds. Because we refer to the explicit construction of mk later on, we shall reproduce the result of [2] detailing its construction. (Note that the mk correspond to the lower bound modular subgradients of [8].) Proposition 2.4. Let g : 2V → R be supermodular. Let π be a permutation of {1, . . . , n} and consider the modular function g (v if i = 1 5 )) m : 2V → R defined as m(vπ(i) ) = g (Wπ()1− g (W ) i ≥ 2 where Wi = {vπ (1) , . . . , vπ (i) }. i
i−1
Then m(S ) ≥ g (S )∀S ∈ 2V , and m(Wi ) = g (Wi ) for i = 1, . . . , n.
3 K refers to the complete graph on n vertices. n 4 For example, if both f and g were constant functions, then every element of 2V could be an argmax of the approximating function h , and cycling could k trivially occur. 5 Note that since m is modular, it suffices to define m on the elements of V as for any S ∈ 2V we have m(S ) = m(v). v∈S
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
277
Algorithm 1 The Submodular–Supermodular Procedure (SSP) Data: A finite ground set V , and a function h : 2V → R expressed as h = f − g where f and g are supermodular functions. Output: A local maximizer of h Initialize: Pick V0 ∈ 2V arbitrarily. At iteration k with current solution Vk Choose a modular function mk so that: mk (S ) ≥ g (S ) ∀S ∈ 2V mk (Vk ) = g (Vk ) Define hk = f − mk Select S ∈ argmax hk if S = Vk and S is a local maximizer of h then Stop, return S else if S = Vk and S is not a local maximizer of h then Find U with |U △S | = 1 and h(U ) > h(S ). Set Vk+1 = U and iterate. else Set Vk+1 = S and iterate. Proof. See [2].
Proposition 2.4 details how a permutation π induces a chain of sets 0 = {∅ = W0 , . . . , Wn = V } with Wi = {vπ (1) , . . . , vπ (i) }. From the resulting chain of sets one may define a modular function m that is an upper bound of the supermodular function g and for which m(Wi ) = g (Wi ) for each Wi ∈ 0. Therefore in iteration k of the SSP algorithm we can construct a modular upper bound mk by selecting a permutation πk so that W|Vk | = Vk and defining mk as in the proposition. This is the same choice of modular upper bound as is given in [15,8]. Though at first it seems like there is considerable discretion involved in constructing a modular upper bound, [15] explains why we only need to consider those modular functions constructed as in Proposition 2.4. 3. A pathological family of examples 3.1. An exponential worst case example for the Submodular–Supermodular Procedure We now show that SSP can require exponentially many iterations to converge. Because our proof is somewhat technical, we will first explain the intuition. The subsets of V correspond naturally to the vertices of the n-dimensional unit cube, and on the n-dimensional unit cube, there is a Hamiltonian cycle that visits all vertices [18]. The idea, therefore, is to start with that cycle’s attendant Hamiltonian path visiting all vertices of the unit n-cube, and translate that into an ordered sequence of subsets of 2V , (T1 , . . . , T2n ). Next, we define a pair of supermodular functions f and g and show that at iteration k of SSP, there is an admissible permutation πk whose attendant chain 0k contains Vk = Tk and Tk+1 . Finally, we argue that Tk+1 is the unique argmax of hk = f − mk , where mk is the modular approximation of g in iteration k induced by permutation πk , and that h(Vk = Tk ) < h(Tk+1 ). Hence the algorithm will not terminate in iteration k, and will set Vk+1 = Tk+1 . The first step is to develop several technical lemmas that establish a positive lower bound on the ‘‘error of approximation’’ mk − g. Proposition 3.1. Let m be the modular approximation to g that is derived from a permutation π which has an attendant chain
0. Then for any S ̸∈ 0 we have: m(S ) − g (S ) = m(B) − g (S ) + g (S \ B) + g (W ) − g (W \ A) − m(A) where W is any element of 0 and A = W \ S, and B = S \ W . Proof. Observe that: m(S ) = m((S ∪ W ) \ (W \ S )) = m((B ∪ W ) \ A)
= m(B) + m(W ) − m(A) = m(B) + m(W ) − m(A) + g (C ) − g (C )
since B ∩ W = ∅ with m(∅) = g (∅) = 0 and A ⊂ B ∪ W for any C ∈ 2V .
In particular, taking C = S \ B = W \ A we get m(S ) = m(B) + m(W ) − m(A) + g (S \ B) − g (W \ A). Finally, observe that since W ∈ 0 we have m(W ) = g (W ), then subtract g (S ) from both sides of the equation and rearrange the right hand side to get m(S ) − g (S ) = m(B) − g (S ) + g (S \ B) + g (W ) − g (W \ A) − m(A) as desired. Next, observe a useful corollary of Proposition 2.4. Proposition 3.2. Let g, m, 0, S, A, B,and W be defined as in Proposition 3.1. Then g (W ) − g (W \ A) − m(A) ≥ 0. Proof. By Proposition 2.4 m(W \ A) − g (W \ A) ≥ 0, therefore (since A ⊆ W ) we have m(W ) − m(A) − g (W \ A) ≥ 0. Now W ∈ 0 and so m(W ) = g (W ), giving the desired result. We say that a function g is ‘‘highly non-modular’’ about some W ∈ 2V if for any 0 containing W and m induced by 0, we have m(S ) − g (S ) > 0 whenever S ̸∈ 0. This means that m is tight only on those sets it must be tight on, and nowhere else.
278
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
In what follows, we demonstrate that g = |EKn (.)| is highly non-modular about any W ∈ 2V with 0 < |W | < |V |, a crucial observation in the construction of our example. Lemma 3.3. Let |V | > 2 and let g : 2V → R be a supermodular function defined as g (S ) = |EKn (S )| ∀S ∈ 2V , i.e. g (S ) = 2 . Let W ∈ 2V with 0 < |W | < |V | = n, and let 0 be a chain containing W . Finally, let m be the modular upper bound of g derived from 0 as in Proposition 2.4. Then m(S ) − g (S ) > 0 if S ̸∈ 0.
|S |
Proof. Let W be as described in the statement of the proposition, and notice that m(vπ (i) ) = i − 1 for i = 1, . . . , n (recall that π is the unique permutation of {1, . . . , n} corresponding to the chain 0). Now for any S ̸∈ 0 we must have that |S | = p ∈ {1, . . . , n − 1} and hence S is incomparable to Wp , the pth element of the chain 0. Let A = Wp \ S and let B = S \ Wp . Note that A ∩ B = ∅ and |A|, |B| ≥ 1. By Propositions 3.1 and 3.2 we have m(S ) − g (S ) ≥ m(B) − g (S ) + g (S \ B). p p−q Denote B as {b1 , . . . , bq }, then to prove the claim it suffices to show that m(B) − 2 + 2 > 0. Without loss of generality, the bj are indexed so that bj precedes bj+1 in 0 for j = 1, . . . , q − 1. That is, if Wr is the least index element of 0 containing bj and Wt is the least index element of 0 containing bj+1 , then r < t. Since B ∩ Wp = ∅ and Wp = {vπ (1) , . . . , vπ (p) } this implies that the r satisfying vπ(r ) = bj must be ≥ p + j for j = 1, . . . , q. Thus m(b1 ) ≥ p, m(b2 ) ≥ p + 1, and so m(B) ≥ (p + (p + 1) + · · · + (p + q − 1)) = pq +
q(q−1) . 2
Hence m(B) −
p 2
+
p−q 2
≥ q2 > 0.
Now we are ready for the first main result of this note, that SSP can require exponentially many iterations to converge. In particular, we construct an instance where SSP takes at least 2n−2 − 1 iterations to converge, thus establishing that SSP requires Ω (2n ) iterations to converge. Theorem 3.4. SSP can require 2n−2 − 1 iterations to converge. Proof. Consider an n-element set V and let A = V \ {a, b} where a, b ∈ V , also {S a|S ∈ 2A } as 2A a. Each vertex denote of the unit n − 2 hypercube corresponds to an S ∈ 2A , and thus to a T = S a ∈ 2A a. Since there exists a Hamiltonian cycle on the n − 2 dimensional hypercube [18] there is an ordering (T1 , . . . , T2n−2 ) of 2A a so that the size of the symmetric difference between Tj and Tj+1 =1 ∀j. This follows since we could order 2A a according to a Hamiltonian path attendant to a Hamiltonian cycle on the unit n − 2 hypercube, where each jump from one vertex to the next adds or removes only one element. Now consider the set of partial chains 01 , . . . , 02n−2 −1 given by 0j = (Tj′ , Tj′+1 ), where:
Tj′ =
Tj , if Tj ⊂ Tj+1 Tj+1 , otherwise
and Tj′+1 =
Tj , if Tj ⊃ Tj+1 Tj+1 , otherwise
ˆ j by defining partial chains 0ℓj . We may extend each 0j to a full ∅ − V chain 0
ˆ j = {0ℓj , 0j , 0uj } b to V . Then 0 ˆ j and 0ˆ i overlap on elements of 2A a if and only if i = j + 1. This is because the only is a ∅ − V chain. Note that for j < i 0 A ˆ ˆ i in 2A a are Ti and Ti+1 . Furthermore, the only element members of 0j in 2 a are Tj and Tj+1 , and the only members of 0 ˆ j and 0ˆ i overlap on is Tj+1 . of 2A a that 0 Define g : 2V → R as g (U ) = |EKn (U )|. Notice that for any T ∈ 2A a we have 0 < |T | < |V |. Thus by Lemma 3.3, if m is the modular approximation of g induced by a chain 0 that contains T we must have m(U ) − g (U ) > 0 if U ̸∈ 0. There are a finite number of sets T ∈ 2A a and a finite number of chains 0 which pass through any of those T ’s, therefore there exists some ϵ > 0 such that for any T ∈ 2A a, any U ∈ 2V , and any chain 0 containing T , we have m(U ) − g (U ) > ϵ if U ̸∈ 0 δ if U = Tj V with (where m is defined with respect to 0). Now define f : 2 → R as f (U ) = g (U ) − δ(U ) where δ(U ) = 2jδ otherwise 1 ϵ > δ > δ > · · · > δ > 0 and δ , δ , . . . , δ are sufficiently small so that f is supermodular (this can be satisfied 1 2 1 2 2n−2 2n−2 2 since g is strongly supermodular). In the remainder of this proof we show that starting from an initial solution V1 = T1 , SSP outputs a sequence of solutions V1 , V2 , . . . , V2n−2 , with Vk = Tk . Thus SSP requires at least 2n−2 − 1 iterations to converge. ˆ 1 as the basis for our modular approximation Take our starting solution V1 (instead of V0 , to ease notation) = T1 and take 0 n−2 ˆ of g, m1 . At the beginning of iteration k(k < 2 ) we have current solution Vk = Tk and take the Γ k as the chain used for = f (U ) − g (U ) = h(U ), if U ∈ 0ˆ k basis of our modular approximation mk of g about Vk . Defining hk = f − mk we have hk (U ) . ≤ f (U ) − g (U ) − ϵ ≤ −ϵ, if U ̸∈ 0ˆ and 0uj where: 0ℓj is an arbitrary chain from ∅ to Tj′ \ a, and 0uj is an arbitrary chain from Tj′+1
k
ˆ k ) = −δk ≥ −δ1 > −ϵ/2 > −ϵ , we see that hk (Tk ) > hk (U )∀U ̸∈ 0ˆ k . Also Since hk (Tk ) = h(Tk ) (as Tk ∈ 0 ˆ k in 2A a are Tk and Tk+1 , and hk (Tk ) > −2δ1 , so hk (Tk ) > h(U ) ≥ hk (U ) ∀U ̸∈ 2A a. The only members of 0 hk (Tk+1 ) = h(Tk+1 ) = −δk+1 > −δk = hk (Tk ), so hk is uniquely maximized at Tk+1 . Thus Vk+1 = Tk+1 , and furthermore ˆ k+1 is an admissible chain for approximating g about Vk+1 . Hence SSP will produce at least 2n−2 solutions V1 , V2 , . . . , V2n−2 0 with Vk = Tk (and h(Vk ) < h(Vk+1 )), requiring at least 2n−2 − 1 iterations to converge. Notice that at the end of iteration 2n−2 − 1 SSP will arrive at V2n−2 = T2n−2 (and must converge to this set in the next iteration) which is a global maximizer of h. 3.2. An exponential worst case example for the Supermodular–Submodular Procedure If in our previous example we initialized with V1 = T1 and used a chain containing V1 and T2n−2 (the unique global maximizer of h) as the basis for the modular approximation m1 of g (such a chain exists since (T1 , . . . , T2n−2 , T1 ) corresponds
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
279
to a Hamiltonian cycle and therefore |T2n−2 △T1 | = 1), then SSP would arrive at the optimal solution in one iteration (and terminate with that same solution in the following iteration) since T2n−2 is also the unique argmax of h1 = f − m1 . This motivates considering an algorithm where h is approximated about the current solution Vk by a function hk that is equal to h on all neighbors of Vk (those sets S ∈ 2V with |S △Vk | = 1). Iyer and Bilmes [8] took a similar approach in their Supermodular–Submodular Procedure where instead of approximating g with a modular upper bound, they approximate f with a modular lower bound.6 Their modular upper bounds are derived from the following (not necessarily modular) bounds of [16] which are tight if |S △U | ≤ 1. Proposition 3.5. Let f : 2V → R be supermodular, let U , S ∈ 2V and let v ∈ V . Define ρvf : 2V → R as ρvf (U ) = f (U ∪ v) − f (U ). Then: i. f (S ) ≥ MU1 (S ) := f (U ) −
f f vj ∈U \S ρvj (U \ vj ) + vj ∈S \U ρvj (U ∩ S ) 2 f ii. f (S ) ≥ MU (S ) := f (U ) − vj ∈U \S ρvj ((U ∪ S ) \ vj ) + vj ∈S \U ρvf j (U ). Proof. See [16].
We use Mki (S ) to denote MUi (S ) defined as above, with the current solution Vk in place of U. The Supermodular–Submodular Procedure proceeds from the current solution Vk by using the inequalities in Proposition 3.5 (with Vk in place of U) to define modular lower bounds m1k and m2k for f that are less than or equal to Mk1 and Mk2 , respectively. This results in two submodular functions h1k = m1k − g and h2k = m2k − g that are lower bounds for h. Both of these functions are approximately maximized (for example using the 21 OPT algorithm of [4]), yielding approximate solutions Vk1+1 and Vk2+1 . Then Vk+1 is selected from {Vk1+1 , Vk2+1 } so as to maximizei=1,2 h(Vki+1 ).7 Rather than immediately constructing a worst case family of examples for the Supermodular–Submodular Procedure as it is stated in [8], we first show that the ‘‘variant’’ form of the Supermodular–Submodular Procedure which we call SSP2 (stated below) which uses the tighter lower bounds of Proposition 3.5 has exponential worst case convergence and converges to a local, not global, maximum. Then we show that this result also holds for a large family of lower bounds mik that are weaker than the ones given in Proposition 3.5, including the particular choice of m1k and m2k used by the Supermodular–Submodular Procedure of [8]. Algorithm 2 A Variant of the Supermodular–Submodular Procedure (SSP2) Data: A finite ground set V , and a function h : 2V → R expressed as h = f − g where f and g are supermodular functions. Output: A local maximizer of h Initialize: Pick V0 ∈ 2V arbitrarily. At iteration k with current solution Vk Define mik = Mki for i = 1, 2. Define hik = mik − g for i = 1, 2 Select Vki+1 ∈ argmax hik for i = 1, 2 Select Vk+1 ∈ {Vk1+1 , Vk2+1 } to maximizei=1,2 h(Vki+1 ) if Vk+1 = Vk then Stop, return Vk else Iterate The lower bounds to f , mik , are not necessarily modular but we stick with the ‘‘m’’ notation for consistency. Like SSP, an appropriate anti-cycling rule needs to be implemented when selecting Vki+1 and also selecting Vk+1 from {Vk1+1 , Vk2+1 }. Applying a lexicographic rule to select an argmax of hik and selecting Vk+1 as the lexicographically smaller of Vk1+1 and Vk2+1 in the case h(Vk1+1 ) = h(Vk2+1 ) will work. It can be shown that using both these anti-cycling rules obviates the need for an additional check to ensure that the set output by SSP2 is a local maximizer of h. The key to our example is the following result on the maximum length of a circuit code of spread 3 given by Klee in [5] and based upon earlier results of Singleton. Definition 3.6. A circuit code of spread 3 on the unit hypercube is a simple cycle (T1 , T2 , . . . , TK , T1 ) along the vertices of the hypercube such that if i − j = ±k(mod K ) and |k| ≥ 3, then the Hamming distance between Ti and Tj is at least 3. This definition implies that if Tk is an element of a circuit code of spread 3, then the only members Tj of that circuit with Hamming distance to Tk ≤ 2 are Tk−2 , Tk−1 , Tk , Tk+1 , Tk+2 . Also, since (T1 , . . . , TK , T1 ) is a cycle, for any Tk in the circuit code Tk±1(modK ) have Hamming distance of 1 to Tk . 6 Corresponding to the modular upper bound supergradients of [8]. 7 There are variations of the Supermodular–Submodular Procedure which alternately choose V
k+1
procedure is run as described in SSP2 to achieve the greatest decrease in h per iteration.
as a maximizer of h1k or h2k . Our analysis assumes the
280
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
Proposition 3.7. Let (T1 , T2 , . . . , TK , T1 ) be a maximum length circuit code of spread 3 on the unit hypercube of dimension n(n ≥ 6). Then K ≥ 32 ∗ 3(n−8)/3 . Proof. See [5] chapter 17.
The idea behind our worst case example for SSP2 is similar to the idea behind the example for SSP. We show that for f (S ) = |EKn (S )| both lower bounds for f given in Proposition 3.5 are strictly smaller than f by a positive constant for sets S with |S △Vk | ≥ 3, then we define a function δ : 2V → R+ that is strictly decreasing along a subpath of a circuit code of spread 3 of length ≥ 32 ∗ 3(n−8)/3 − 5 and is equal to a relatively large constant everywhere else except for one optimal set (which the algorithm will never visit). Finally we maximize h = f − g where g (S ) = f (S ) + δ(S ) and show that SSP2 must ‘‘visit’’ at least half of the members of the subpath before converging. Unlike Theorem 3.4 however, in this example we converge to a suboptimal solution. In what follows we prove Propositions 3.8 and 3.9 assuming that mik = Mki for i = 1, 2. Our main result (Theorem 3.10) uses only the facts that mik ≤ f and mik satisfies Propositions 3.8 and 3.9 for i = 1, 2. This allows us to extend the result of Theorem 3.10 to any lower bound functions mik satisfying both Propositions 3.8 and 3.9. In particular this includes the lower bounds used in the Supermodular–Submodular Procedure. Proposition 3.8. For a current solution Vk , maxi=1,2 mik (S ) = f (S ) ∀S with |S △Vk | ≤ 1, therefore maxi=1,2 hik (S )(= mik (S ) − g (S )) = h(S ) ∀S with |S △Vk | ≤ 1. Also mik (Vk ) = f (Vk ) for i = 1, 2. Proof. If |S △Vk | ≤ 1 then S = Vk or S = Vk ∪ vj or S = Vk \ vj for some vj ∈ V . In all 3 cases it is trivially true that mik (S ) = f (S ) and thus maxi=1,2 mik (S ) = f (S ), and mik (Vk ) = f (Vk ) for i = 1, 2. Proposition 3.9. Let f : 2V → R be defined as f (U ) = |EKn (U )| and let mik i = 1, 2 be the lower bounds of f defined relative to Vk ∈ 2V . Then ∀S ∈ 2V with |S △Vk | ≥ 3 we have f (S ) − mik (S ) ≥ 1 for i = 1, 2. Proof. Let n1 = |Vk |, n2 = |S |, k1 = |Vk \ S |, and k2 = |S \ Vk |. Since f (U ) = 2 ∀U ∈ 2V it is simple to compute f (S ) and each of the lower bounds mik (S ) for i = 1, 2. After some elementary algebraic manipulation we observe that both lower bounds yield the same value, and we have f (S ) − mik (S ) = 21 ((k21 + k22 ) − (k1 + k2 )) for i = 1, 2. If |S △Vk | ≥ 3 then at least one of k1 or k2 ≥ 2, and so f (S ) − mik (S ) ≥ 1 for i = 1, 2.
|U |
Before proceeding we call attention to two items: first we make a key assumption that each hik is maximized exactly in each iteration, second we abandon our previous assumption that g is normalized. Theorem 3.10. For sufficiently large n (n ≥ 6 suffices) SSP2 can require ⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge. Furthermore, it can converge (after so many iterations) to a set that is a local, not global, optimum. Proof. There is a natural correspondence between vertices of the n-dimensional unit cube and the elements of 2V (with
|V | = n), so we shall treat vertices and elements interchangeably.8 Let (T1 , T2 , . . . , TK , T1 ) be a maximum length circuit code of spread 3, then by Proposition 3.7 K ≥ 32 ∗ 3(n−8)/3 . δ , if S = T for j = 1, . . . , K − 5 j j Define f : 2V → R as f (S ) = |EKn (S )| and define δ : 2V → R as δ(S ) = 0, if S = TK −2 where 12 > δ1 > δ2 > 2δ1 , otherwise
· · · > δK −5 > 0, and δ1 , . . . , δK −5 are sufficiently small so that g = f + δ is supermodular (again, this is possible since f is strictly supermodular). Now apply SSP2 to find a local maximizer of h = f − g, starting from initial solution V1 = T1 . Observe that at iteration k we approximate h with the functions hik = mik − g for i = 1, 2, where the mik are defined as in the presentation of SSP2. We now show that whenever Vk = Tj for j ∈ {1, . . . , K − 6} that Vk+1 = Tj+1 or Tj+2 . Note that for i = 1, 2 hik (Vk ) = h(Tj )(by Proposition 3.8) = −δj > −2δ1 and > −δr (for r < j), so hik (Vk ) > h(S ) ≥ hik (S ) ∀S ̸∈ {Tj , . . . , TK −5 , TK −2 }. Thus any argmax of hik must be in {Tj , . . . , TK −5 , TK −2 }. For S with |S △Vk | ≥ 3, hik (S ) ≤ h(S ) − 1 (by Proposition 3.9) ≤ −1 < −δj = hik (Vk ), and so only Tj , Tj+1 , or Tj+2 can be an argmax of hik for i = 1, 2 (recall that (T1 , . . . , TK , T1 ) formed a circuit code of spread 3, so Tj , Tj+1 , Tj+2 are the only elements S of {Tj , . . . , TK −5 , TK −2 } satisfying |S △Vk | ≤ 2). Notice that if j = K − 6, then only Tj or Tj+1 can be an argmax, by definition of δ . Because Vk+1 = Vk1+1 or Vk2+1 , Vk+1 is an argmax of h1k or h2k , and this implies Vk+1 ∈ {Tj , Tj+1 , Tj+2 }. Finally, by Proposition 3.8, for some i∗ ∈ {1, 2} we have ∗ ∗ hik (Tj+1 ) = h(Tj+1 ) = −δj+1 > −δj = maxi=1,2 hik (Vk ) (since j ≤ K − 6). Therefore the argmax of hik is in {Tj+1 , Tj+2 } and ∗ since h(Vk+1 ) ≥ h(Vki+1 ) > h(Vk ) we must have Vk+1 ∈ {Tj+1 , Tj+2 }. Because we initialize SSP2 with V1 = T1 , it follows that for some index k, Vk = TK −7 or TK −6 . For that value of k, either Vk+1 = TK −5 or TK −6 , in the latter case Vk+2 = TK −5 , following from the preceding argument. Hence for some r, Vr = TK −5 .
8 Observe that the Hamming distance between two vertices of the n-dimensional unit cube equals the size of the symmetric difference between their attendant sets.
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
281
Now hir (Vr ) = −δK −5 > h(S ) ≥ hir (S ) ∀S ̸∈ {TK −5 , TK −2 } for i = 1, 2. And since |TK −5 △TK −2 | ≥ 3 we have by Proposition 3.9 that hir (TK −2 ) ≤ −1 for i = 1, 2. Thus TK −5 is the unique argmax of h1r and h2r , so Vr +1 = TK −5 and thus SSP2 converges to T K −5 . Hence SSP2 requires at least ⌊ K2 − 3⌋ ≥ ⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge (as each iteration before convergence ‘‘increases the index of Tj ’’ by 1 or 2). In addition, it converges to a set that is a local, not global, maximizer. Corollary 3.11. For sufficiently large n (n ≥ 6 suffices) and for any choice of lower bounds m1k and m2k ≤ f and satisfying m1k ≤ Mk1 and m2k ≤ Mk2 , and for which Proposition 3.8 holds, SSP2 can require ⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge. Furthermore it can converge (after so many iterations) to a set that is not a global optimum. Proof. The proof of Theorem 3.10 relied only upon m1k and m2k satisfying Propositions 3.8 and 3.9, and not the particular choices m1k = Mk1 and m2k = Mk2 . So to prove the claim it suffices to show that Proposition 3.9 holds, which is true by assumption that m1k ≤ Mk1 and m2k ≤ Mk2 . The claim then follows from replicating the proof of Theorem 3.10. Consider the particular choice of lower bounds m1k and m2k proposed by Iyer and Bilmes in the Supermodular–Submodular Procedure: m1k (S ) = f (Vk ) −
vj ∈Vk \S
m2k
(S ) = f (Vk ) −
vj ∈Vk \S
ρvf j (Vk \ vj ) + ρvj (V \ vj ) + f
vj ∈S \Vk
vj ∈S \Vk
ρvf j (∅)
ρvf j (Vk ).
Clearly (S ) = f (S ) for S = Vk , and (S ) = f (S ) for S = Vk \ vj and m2k (S ) = f (S ) for S = Vk ∪ vj . Thus m1k and m2k satisfy Proposition 3.8. Furthermore by Proposition 2.2, ρvf j (U ) ≤ ρvf j (U ′ ) if U ⊂ U ′ and vj ̸∈ U , U ′ , so m1k (S ) ≤ Mk1 (S ) and mik
m1k
m2k (S ) ≤ Mk2 (S ). Hence the assumptions of Corollary 3.11 hold, showing that the Supermodular–Submodular Procedure of [8] (when implemented as in the description of SSP2 and with the definition of m1k and m2k as above) can require ⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge and in that case converges to a local, not global, maximizer. 4. Conclusions and relationship to other convergence results When Narasimhan and Bilmes [15] first introduced SSP, its worst case complexity was left as an open problem. Although finding a local minimum of the difference of two submodular functions is a PLS complete problem [8], given the promising empirical performance of SSP it seemed plausible that a very weakly superpolynomial or sub-exponential bound on the worst case complexity was still attainable. Furthermore, Iyer and Bilmes [8] established a polynomial bound on the number of iterations required for the ‘‘ϵ -approximate’’ version of SSP to converge. We restate that result here in terms of maximizing the difference between supermodular functions for consistency. Definition 4.1. An ‘‘ϵ -approximate’’ version of SSP for maximizing a negative (i.e. <0 everywhere) function h expressed as the difference between two supermodular functions is an implementation of SSP where we require h(Vk ) ≤ (1 + ϵ)h(Vk+1 ) in order to proceed to iteration k + 1, where ϵ is a fixed positive constant. Proposition 4.2. Let h be a negative function expressed as the difference of two supermodular functions, let ϵ be a constant > 0, let V0 = ∅ and V1 be the first two solutions output byϵ -approximate SSP, and let M < 0 be an upper bound of maxS ∈2V h(S ).
Then ϵ -approximate SSP converges in O
log(|h(V1 )|/|M |)
ϵ
iterations.
Proof. See [8] for the proof and a discussion on computing M.
Our findings add to the literature on SSP in two ways. First, we demonstrate a family of examples where SSP takes
≥ 2n−2 − 1 iterations to converge, which is tight (up to a factor of 4) when one assumes that SSP is implemented with an anti-cycling rule. Thus we show that despite strong empirical performance, there is no sub-exponential bound on SSP’s worst case complexity. Second, observe that in our worst case example for SSP (Theorem 3.4) the algorithm outputs a sequence of solutions V1 , V2 , . . . , V2n−2 with h(Vk ) = −δk . Since the δk ’s may be chosen to satisfy both the requirements of Theorem 3.4 and also satisfy −δk ≤ −(1 + ϵ)δk+1 for any fixed ϵ > 0 this demonstrates that ϵ -approximate SSP can also require 2n−2 − 1 iterations to converge. (This does not violate Proposition 4.2 as 2n−2 − 1 would then be polynomial in the encoding length of h.) We also consider the related Supermodular–Submodular Procedure of [8] and show that for a more general class of algorithms to which it belongs it is possible to require ⌊16 ∗ 3(n−8)/3 − 3⌋ iterations to converge and still converge to a local, not global, optimizer. Unlike SSP, the fact that the Supermodular–Submodular Procedure has worst case convergence (as opposed to complexity) that is not bounded by a fixed polynomial is not implied by the PLS-Completeness of minimizing the difference between two submodular functions. This is because each iteration of the Supermodular–Submodular Procedure requires finding a maximizer for a submodular function (in our framework of maximizing the difference between supermodular functions) which is NP-hard in general.
282
K.M. Byrnes / Discrete Applied Mathematics 186 (2015) 275–282
Even though our main result is negative, it does have some positive implications for the SSP algorithm. Observe that while in our example SSP requires exponentially many iterations to converge, when it converges it finds a global maximizer of f − g. And the empirical performance of SSP [15,8] has been much better than the worst case presented here. Thus it appears that examples such as the one we presented are not common in practice. The situation where an algorithm can require an exponential number of iterations in the worst case but often converges quickly is mirrored by the performance of the Simplex method. Therefore it is not unreasonable to think that SSP might also converge in a polynomial or sub-exponential number of iterations on average [19,9]. In addition, the difference in behavior of our worst case examples for SSP and the Supermodular–Submodular Procedure may convey further insight on how these algorithms differ. For example, using the worst case example constructed for SSP, the Supermodular–Submodular Procedure would have found a global maximizer in one iteration. However, the Supermodular–Submodular Procedure could be forced to take exponentially many iterations to converge by only specifying the functions f , g and the initial solution V1 = T1 . To force SSP to take exponentially many iterations to converge we also needed to specify the choice of the chain that was used in each iteration for the approximation of h. These varied strengths suggest that it may be possible to design a hybrid algorithm that does not require exponentially many iterations to converge in the worst case. Acknowledgments The author thanks the late Prof. Alan J. Goldman for his detailed critique of an earlier version of this note, and two anonymous referees for their exceptionally helpful feedback. References [1] K. Byrnes, Maximizing general set-functions by submodular decomposition, 2009, arXiv preprint arXiv:0906.0120. [2] J. Edmonds, Submodular functions, matroids, and certain polyhedra, in: M. Jünger, G. Reinelt, G. Rinaldi (Eds.), Combinatorial Optimization–Eureka, You Shrink!, Springer-Verlag, New York, NY, 2003, pp. 11–26. [3] U. Feige, V. Mirrokni, J. Vondrák, Maximizing non-monotone submodular functions, SIAM J. Comput. 40 (2011) 1133–1153. [4] M. Feldman, N. Buchbinder, J. Naor, R. Schwartz, A tight linear time (1/2)-approximation for unconstrained submodular maximization, FOCS, 2012 pp. 649–658. [5] B. Grünbaum, Convex Polytopes, second ed., Springer-Verlag, New York, NY, 2003. [6] R. Horst, N.V. Thoai, DC programming: An overview, J. Optim. Theory Appl. 103 (1999) 1–43. [7] S. Iwata, L. Fleischer, S. Fujishige, A combinatorial strongly polynomial algorithm for minimizing submodular functions, J. ACM 48 (2001) 761–777. [8] R. Iyer, J. Bilmes, Algorithms for approximate minimization of the difference between submodular functions, with applications, Uncertainty in Artificial Intelligence (UAI), 2012. [9] G. Kalai, Linear programming, the simplex algorithm and simple polytopes, Math. Program. 79 (1997) 217–233. [10] Y. Kawahara, T. Washio, Prismatic algorithm for discrete D.C. programming problem, in: NIPS, 2011, pp. 2106–2114. [11] D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through a social network, in: Proc. of the Ninth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD, 2003. [12] A. Krause, C. Guestrin, Near-optimal nonmyopic value of information in graphical models, in: Uncertainty in Artificial Intelligence, UAI, 2005. [13] A. Krause, C. Guestrin, A. Gupta, J. Kleinberg, Near-optimal sensor placements: Maximizing information while minimizing communication cost, in: Proc. of Information Processing in Sensor Networks, IPSN, 2006, pp. 2–10. [14] H. Lin, J. Bilmes, Multi-document summarization via budgeted maximization of submodular functions, in: NAACL, 2010. [15] M. Narasimhan, J. Bilmes, A submodular-supermodular procedure with applications to discriminative structure learning, in: Uncertainty in Artificial Intelligence, UAI, 2005. [16] G.L. Nemhauser, L.A. Wolsey, M.L. Fisher, An analysis of approximations for maximizing submodular set functions-I, Math. Prog. 14 (1978) 265–294. [17] A. Schrijver, A combinatorial algorithm minimizing submodular functions in strongly polynomial time, J. Combin. Theory Ser. B 80 (2000) 346–355. [18] S. Skiena, Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica, Addison-Wesley, 1990. [19] M.J. Todd, Polynomial expected behavior of a pivoting algorithm for linear complementarity and linear programming problems, Math. Program. 35 (1986) 173–192.