ARTICLE IN PRESS Computers & Operations Research 37 (2010) 1348–1358
Contents lists available at ScienceDirect
Computers & Operations Research journal homepage: www.elsevier.com/locate/caor
Extended and discretized formulations for the maximum clique problem Pedro Martins ~ Operacional - FC/UL and ISCAC - Instituto Polite´cnico de Coimbra, Portugal CIO - Centro de Investigac- ao
a r t i c l e in f o
a b s t r a c t
Available online 29 October 2009
The maximum clique (MC) problem has long been concentrating the attention of many researchers within the combinatorial optimization community. Most formulations proposed in the literature for the MC problem were adapted from the maximum independent set (MIS) problem, which is known to be equivalent to the MC. In fact, independent sets can be easily modelled within the natural variables space. In the present paper we propose new formulations for the MC problem, using additional and extra indexed variables, leading to extended and discretized formulations. The number of variables and constraints of the new models depend on the range of variation of an interval containing the clique number (o) of the graph. Therefore, tight lower and upper bounds for o may strongly benefit the dimension of the new models. Computational results have been conducted on some DIMACS benchmark and biological instances. These results indicate that, among sparse graphs, the linear programming relaxation of the discretized formulations may produce stronger upper bounds than the known models, being faster to find an optimum clique when embedded into the branch-and-bound. Furthermore, the new models can be used to address other related problems, namely to find a maximum size k-plex, or a maximum size component involving two overlapping cliques. & 2009 Elsevier Ltd. All rights reserved.
Keywords: Maximum clique problem Extended formulations Discretized formulations
1. Introduction Let G ¼ ðV; EÞ be a simple undirected graph, where V ¼ f1; . . . ; ng is the set of vertices and E the set of edges. A clique of G is a subset C DV such that ði; jÞ A E for all i; j A C. In the maximum clique (MC) problem we want to find a maximum cardinality clique among all cliques in G. If we are simply interested in the size of the maximum clique, then we only want to know the clique number of G, denoted by oðGÞ. An independent set of G is a subset C D V such that ði; jÞ= 2E for all i; j A C. The maximum independent set (MIS) problem seeks for an independent set of maximum cardinality. It is well known that a clique of G is an independent set on the complementary graph G ¼ ðV; EÞ, where E ¼ fði; jÞ A V V : i aj and ði; jÞ= 2Eg. Then, there is an equivalence relationship among both MC and MIS problems, in the sense that solving one of them is equivalent to solve the other one in the complementary graph. Karp [24] showed in 1972 that the decision version of the MC problem is NPcomplete. Furthermore, there is no polynomialtime approximation algorithm for it, unless P ¼ NP [12]. In fact, the problem is not approximable within n1=4e , for any e 40 [4].
Tel.: + 351 239 802000.
E-mail address:
[email protected] 0305-0548/$ - see front matter & 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2009.10.010
There is quite a number of applications involving maximum cliques. In some of them, to find a maximum clique is just a subproblem of a wider system, as is most common in real applications. These arise in coding theory, fault detection and diagnosis, computer vision and pattern recognition, data mining, telecommunications and wireless networks, as suggested in [5,7,11,22,25]. Also in biological systems, like spot matching for two-dimensional gel electrophoresis images, protein structure alignment and protein side-chain packing, as suggested in [1,9,17,32]. Furthermore, the detection of functional modules in molecular networks were studied in [31], and the identification of strongly related groups of proteins have been seek within protein–protein interaction networks in [16]. In [8] the authors showed how clique detection procedures can help in other biochemistry and genomic problems. Since 1949, when Luce and Perry [26] have first introduced the concept of cliques, a lot of work has been done on it. In a broad sense and considering the attempts to find maximum cliques, we may divide these works into, bounding, exact and heuristic methodologies. Up to 1999, most of these contributions have been described in the extended survey of Bomze et al. [7]. After 1999, bounding properties and methods have been proposed in [13,29] on improvements of the Lova´sz–Schrijver bound for oðGÞ. Also in [27,10] proposing a quadratic characterization and a heuristic method to approximate the Lova´sz theta number, and in [18] proposing a sequential elimination algorithm that is guaranteed
ARTICLE IN PRESS P. Martins / Computers & Operations Research 37 (2010) 1348–1358
to improve upon a given upper bound. In [18], the authors also use their methodology to improve given lower bounds. Still after 1999, exact enumerative algorithms have been described in [28,15,30,33]. Heuristic methods have also been proposed in many works, some of them applying known heuristic and metaheuristic techniques adapted to the MC problem. Among those published after 1999, we detach the work of Battiti and Protasi [3] (also mentioned in [7]) that use memory to adjust the algorithm parameters. Outstanding heuristic results, both in terms of effectiveness and computing times, can also be found in [20,21]. Although following different metaheuristic ideas, all these three works are based or include local search techniques. Known formulations for the MC problem resort to modelling aspects of the MIS problem. As mentioned above, we can model cliques by modelling independent sets on the complementary graph. In the present paper we propose new formulations directly describing cliques on the original graph G. To do it, we resort to additional variables and extra indexes on the original vertex variables, conducting to extended and discretized formulations. The new models’ number of variables and constraints is influenced by the range of variation of an interval containing oðGÞ, claiming for tight bounds to those intervals. Furthermore, even if we could possibly know in advance the value of oðGÞ, conducting to the tightest interval of all, the problem would still be hard to solve, if our goal is to find an associated subset of vertices. For very sparse graphs, the new models revealed to be very competitive even without prior bounds for oðGÞ. This paper is organized as follows. Known and new formulations to the MC problem are presented in the next section. It also includes ideas for strengthening the LP relaxation of these models. Furthermore, we show how to adapt these formulations to address other clique related problems. Computational tests comparing the models and using DIMACS and biological instances are given in Section 3. Last section is devoted to concluding remarks.
2. Formulations Within this and the forthcoming sections, for a given integer programming formulation F, we denote by FL its linear programming (LP) relaxation. 2.1. Known formulations Most known formulations for the MC problem are natural models in the sense that they simply use one variable for each vertex, indicating whether it belongs to the solution or not, that is, the binary variables xi for all iA V, taking value 1 if vertex i is in the solution and 0 otherwise. Thus, in general terms, the MC problem can be formulated as X xi ð1Þ maximize iAV
subject to
x A fall cliques of Gg
ð2Þ
xi A f0; 1g;
ð3Þ
8iA V
for x ¼ fx1 ; x2 ; . . . ; xn g. We can find in the literature a number of ways to characterize constraints (2). Two long known characterizations are (see, e.g. [19]):
may belong to the clique, that is xi þ xj r1;
8ði; jÞ A E
ð4Þ
the Independent Set Formulation (here denoted by ISF) that imposes that any clique of G can contain no more than a single vertex from any maximal independent set of G, where the set of all maximal independent sets of G is represented by S, thus X xi r 1; 8s A S ð5Þ iAs
More recently, Della Croce and Tadei [14] proposed an aggregated version of formulation EdgeF, leading to the following characterization of set (2): X
xj r jNðiÞjð1 xi Þ;
8i A V
ð6Þ
j A N ðiÞ
for NðiÞ ¼ V\ðNðiÞ [ figÞ, the set of vertices not adjacent to vertex i. In fact, if we sum in j all constraints (4) we obtain constraints (6), for each i A V. This means that inequalities (4) imply inequalities (6). This formulation is here denoted by AEdgeF. Furthermore, it can also be shown that all constraints (4) are implied by inequalities in (5). 2.2. New formulations All the models previously described arise from independent set characterizations on the complementary graph, or use independent sets explicitly to characterize cliques. To our best knowledge, we have not found any formal characterization of cliques on G using only the natural set of variables xi and having polynomial dimension. In fact, we believe that such a formulation may not exist as it would possibly require an exponential number of constraints. However, if we resort to extra variables or extra index information on the original set of vertex variables, then we can obtain compact characterizations of cliques on G. The present section is devoted to such models. We start by defining the extra set of clique variables: ( 1 if the clique size is equal to q; 8qA Q wq ¼ 0 otherwise; with Q ¼ fqmin ; . . . ; qmax g. In the worst case we can have qmin ¼ 1 and qmax ¼ n, assuming G to be a simple graph of any kind, including the empty graph (with E ¼ |) and the complete graph (with E ¼ fði; jÞ A V V : i ajg). However, a preliminary study of the graph in hands can reduce the range of variation of Q, namely by using known methods in the literature on bounds for oðGÞ. From this point on, we assume that G is not an empty graph (jEjZ 1), which implies that qmin Z2. Considering the extra clique variables previously described, we propose the following Extended Formulation (EF) for the MC problem: X xi maximize iAV
subject to
X
xj Z ðq 1Þxi ðq 2Þð1 wq Þ; 8i A V; 8q A Q
ð7Þ
j A NðiÞ
the Edge Formulation (here denoted by EdgeF) that force the independent set property in the complementary graph G, imposing that for any pair ði; jÞ A E, only one of the vertices i or j
1349
ðEFÞ
X iAV
xi ¼
X qAQ
qwq
ð8Þ
ARTICLE IN PRESS 1350
P. Martins / Computers & Operations Research 37 (2010) 1348–1358
X
wq ¼ 1
ð9Þ
qAQ
xi A f0; 1g; 8iA V wq A f0; 1g; 8q A Q
ð10Þ
For the single clique size q determined, imposed by (9) and (10), inequalities (7), (8) and (3) force that each of the q vertices belonging to the clique must have at least ðq 1Þ neighbors in the clique. Thus, the vertices in the clique induce a complete subgraph of G. Furthermore, if the clique has size other than q (wq ¼ 0), then inequalities (7) are redundant, provided that G is connected. Inequalities (7) can be further strengthen if the clique size information is brought into the vertices variables, conducting to the following discretized variables: ( 1 if vertice i is in a clique of size q; xqi ¼ 8i A V; 8q A Q 0 otherwise; Both vertex variables are related by equality X q xi ; 8i A V xi ¼
ð11Þ
qAQ
Using these variables we can write an improved version of constraints (7), X q xj Z ðq 1Þxqi ; 8i A V; 8q A Q ð12Þ j A NðiÞ
and also to disaggregate Eq. (8), in the form X q xi ¼ qwq ; 8q A Q
ð13Þ
iAV
Thus, we propose the following Discretized Formulation (DF) for the MC problem: XX q xi ð14Þ maximize i A Vq A Q
ðDFÞ subject to ð12Þ; ð13Þ; ð9Þ; ð10Þ xqi A f0; 1g;
8i A V; 8qA Q
Model
Variables
Constraints
EdgeF
n
AEdgeF EF DF ADF
n n þ jQ j ðn þ 1ÞjQ j ðn þ 1ÞjQ j
jEj n njQ j þ 2 ðn þ 1ÞjQ jþ 1 n þ jQ jþ 1
of constraints if graph G is sparse. Furthermore, when G is sparse, inequalities (6) from AEdgeF are denser than inequalities (7) from EF and (12) from DF. On the contrary, when G is dense, constraints (7) and (12) are denser than (6). In fact, the size of set Q can significantly increase the dimension of the new discretized models. However, that size can be much reduced if we resort to known results on bounding methodologies for oðGÞ, in the literature. Actually, there is quite some work done on it, providing outstanding practical results that may considerably help our formulations (see, e.g. [18,10]). For some instances, those methods have been able to find the exact value of oðGÞ. In those cases, it closes the discussion for the identification of oðGÞ, but it does not provide a clique set of vertices S with jSj ¼ oðGÞ. Moreover, even if oðGÞ is known in advance, there is no polynomial time algorithm to solve problem MC, otherwise we would have a polynomial time algorithm to solve the general MC problem, something known not to be truth, unless P ¼ NP. In effect, when oðGÞ is given in advance, leading to jQ j ¼ 1, then all formulations here proposed can be reduced to the feasibility P q problem fxq A f0; 1gn : (12) and i A V xi ¼ qg, with q ¼ oðGÞ. This model can be projected onto the fxi g variables space, becoming ( ) X X n x A f0; 1g : xj Z ðq 1Þxi ; 8i A V and xi ¼ q ð17Þ j A NðiÞ
iAV
It has the same number of variables and almost the same number of constraints as AEdgeF, which is known to be the smallest formulation for the MC problem.
ð15Þ
As in model EF, any integer solution of the discretized formulation forces the selection of a single dimension for the clique size, guaranteed by (9) and (10). Furthermore, the vertices belonging to that clique must induce a complete subgraph of G (imposed by (12)). Equalities (13) link the two sets of variables. Both models have nearly the same number of constraints. However, DF has much more variables than EF, specially when jQ j is large. In fact, without any reductions, the LP relaxation of DF can be hard to handle for large values of jQ j. Thus, we suggest another version of DF that decreases its number of constraints by aggregating in q inequalities (12), leading to X X q X xj Z ðq 1Þxqi ; 8iA V ð16Þ q A Q j A NðiÞ
Table 1 Number of variables and constraints in all models under discussion.
qAQ
The new Aggregated Discretized Formulation (ADF) can be obtained from DF, substituting (12) by (16). The set of constraints in ADF still characterize all cliques of size qA Q in G, because constraints (8)–(10) guarantee that a single clique size q is going to be chosen. The number of variables and constraints in all polynomial sized models discussed in the present section is summarized in Table 1. When jQ j is large, the new discretized models DF and ADF have many more variables than the others. On the other hand, models EF and DF may also have many more constraints than the aggregated formulations AEdgeF and ADF, again, for large values of jQ j. However, model EdgeF may be the largest one on the number
2.3. Strengthening the models In order to strengthen the linear programming relaxation of the models, we propose two tests. The first is a vertex elimination test, while the second one involves the identification of pairs of vertices that can never belong to a size-q clique. This test induces the identification of additional constraints to strengthen the LP relaxation of all these models.
Test 1: For any iA V and q A Q , if jNðiÞj o q 1, then variable xqi
can be removed from models DF and ADF, as well as all constraints involving the pair ði; qÞ. Furthermore, for any i A V, if jNðiÞj o qmin 1, then variable xi can be removed from models EdgeF, AEdgeF and EF, and also all constraints involving vertex i. Test 2: For any pair i; j A V for which ði; jÞ A E and jNðiÞ\ Pqmax q NðjÞj o q 2, the following is a valid inequality q ¼ q0 xi þ P Pqmax q qmax q 0 q ¼ q0 xj r q ¼ q0 w , for q ¼ minfq A Q : jNðiÞ\ NðjÞj o q 2g. This constraint can strengthen models DF L and ADF L . Furthermore, for any pair i; j A V with ði; jÞ A E, if jNðiÞ \ NðjÞj o qmin 2, then constraint xi þxj r 1 should strengthen models EdgeF L , AEdgeF L and EF L .
To understand Test 2, we can state that for any pair i; j A V with ði; jÞ A E and q A Q , if jNðiÞ \ NðjÞj oq 2, then vertices i and j cannot belong to the same size-q clique, which can be guaranteed by the valid inequality xqi þ xqj r wq . As a single clique size is going
ARTICLE IN PRESS P. Martins / Computers & Operations Research 37 (2010) 1348–1358
to be chosen, then we consider an aggregated version of these constraints, obtained by summing in q all these inequalities for Pqmax Pqmax Pqmax q q q each pair fi; jg, leading to q ¼ q0 xi þ q ¼ q0 xj r q ¼ q0 w , for q0 ¼ minfqA Q : jNðiÞ \ NðjÞj o q 2g. Test 1 contributes to decrease the size of the models, while Test 2 generates a set of inequalities for strengthening their linear programming relaxation. These constraints take advantage of the clique size information provided in variables fxqi g, for q 42. On the other hand, models defined on the space of variables fxi g can only profit from both Tests 1 and 2 whenever qmin 42. Note that when qmin ¼ 2, we are assuming to have no information on lower bounds for oðGÞ, thus the maximum clique set can be as small as a set of two vertices. Furthermore, none of the constraints generated by Test 2 is included in the set (4) of model EdgeF. 2.4. Maximum clique related problems As previously mentioned, there are a number of applications involving the determination of a maximum size clique. However, in some cases, to find a maximum clique, or even maximal cliques, is just a subproblem of a larger system, while in other cases, more relaxed concepts of dense subgraphs rather than that of a clique can be more adequate. In this section we bring two such examples, to which the previously reported models can provide interesting answers. Maximum size component involving two overlapping cliques: Consider we want to find the biggest component involving two cliques sharing a number of vertices within the range L ¼ flmin ; . . . ; lmax g, that is, the two cliques must share at least lmin and at most lmax vertices, with lmin Z1 and lmax rn. Such biggest component should involve two maximal cliques with minimum number of common vertices. When thinking on protein interaction networks, this problem is important to help discovering proteins that share different functional modules (see [31,34]). These may constitute pleiotropic proteins that are involved in various cell functional activities. When lmin is small, searching for components with maximum total size should lead us into intermodule complexes. Considering formulation DF, we duplicate the set of variables ; w1;q g and fx2;q ; w2;q g with q A Q ¼ flmin ; . . . ; ng, fxqi ; wq g into fx1;q i i regarding the two cliques, respectively. The new variables have the same meaning as before, concerning the correspondent clique. We also bring the initial set of natural vertex variables xi for iA V. Then, the problem can be modelled as (denoted by OQF—Overlapping Cliques Formulation) X maximize xi
1351
Constraints (18), (19), (23) and (24) are extensions of constraints (12), (13), (15) and (10), respectively. Equality (20) extends (9), searching for a pair of cliques. Inequalities (21) relate the two sets of vertex variables, letting variable xi identify a vertex in the global component if and only if it belongs to one of the cliques. The bounds for the total number of shared vertices is imposed by inequalities (22). Note that the only reason for duplicating the set of q-indexed variables is because the two cliques may have the same size, otherwise we could have used the original variables from DF. The k-plex problem: The second problem involves the determination of a maximum size k-plex. For a given integer kZ 1, a subset of vertices S is a k-plex if the following condition holds: jNðiÞ \ Sj Z jSj k, for all iA S. That is, S is a k-plex if the degree of every vertex in the induced subgraph is at least jSj k. When k ¼ 1 this reduces to the clique problem. For k 41 the k-plex problem is a degree-based relaxation of the clique. An extensive number of applications of the k-plex and other clique relaxed problems are described in [2]. In many social and biological networks, to find a clique component can be much restrictive, particularly when we are dealing with ephemeral or missing links (edges) being involved in important cohesive components. In some of those practical problem, cliques have been criticized for their overly restrictive nature, which has motivated the emergence of more relaxed concepts of dense subgraphs, namely k-plexes, among others. The decision version of the k-plex problem has been shown to be NP-complete in [2]. The authors also propose a formulation and describe a branch-and-cut based algorithm, using some polyhedral properties also discussed in the same paper. The formulation proposed in [2] is based on the characterization of k-plexes by the following set of inequalities: X xj r ðk 1Þxi þ jNðiÞjð1 xi Þ; 8i A V ð25Þ j A N ðiÞ
leading to model ( ) X mk ðGÞ ¼ max xi : ð25Þ and xi A f0; 1g; 8i A V
ð26Þ
iAV
Formulation (26) can be seen as the k-plex version of model AEdgeF, previously described to the clique problem. Also based on other cliques characterizations here described, we can model k-plexes on the discretized and extended space of variables fxqi ; wq g, namely by substituting in model DF inequalities (12) by X q xj Z ðq kÞxqi ; 8iA V; 8q A Q ð27Þ j A NðiÞ
iAV
subject to
X
xh;q Z ðq j
1Þxh;q ; i
8i A V; 8q A Q ; h ¼ 1; 2
ð18Þ
j A NðiÞ
X
xh;q ¼ qwh;q ; i
8q A Q ; h ¼ 1; 2
ð19Þ
ðOQFÞ
i A Vq A Q
)
iAV
X
leading to model ( XX q mk ðGÞ ¼ max xi : ð27Þ; ð13Þ; ð9Þ; ð10Þ and xqi A f0; 1g;
ðw1;q þ w2;q Þ ¼ 2
ð20Þ
8i A V; 8q A Q
ð28Þ
qAQ
X
1=2ðx1;q þ x2;q Þ r xi r i i
qAQ
ðx1;q þ x2;q Þ; i i
8i A V
ð21Þ
qAQ
lmin r
X
ðqw1;q þ qw2;q Þ
qAQ
xi A f0; 1g; xh;q i A f0; 1g; h;q
w
X
A f0; 1g;
X
xi r lmax
As before, set Q ¼ fqmin ; . . . ; qmax g, with qmin Z k þ1 and qmax rn. This is a straightforward adaptation of model DF that can also be applied to the other discretized formulations.
ð22Þ
iAV
3. Computational results
8i A V 8i A V;
8qA Q ; h ¼ 1; 2
8qA Q ; h ¼ 1; 2
ð23Þ ð24Þ
From a practical point of view, it is important to compare the models here proposed and also compare them with those
ARTICLE IN PRESS 1352
P. Martins / Computers & Operations Research 37 (2010) 1348–1358
Table 2 Linear programming relaxation gaps and times. Instance
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 brock200-2 san400-0.5-1 sanr400-0.5 p-hat500-2 johnson8-2-4
n
500 500 200 200 500 300 500 64 500 200 300 200 400 400 500 28
d
0.036 0.073 0.077 0.163 0.186 0.244 0.253 0.349 0.374 0.426 0.489 0.496 0.500 0.501 0.505 0.556
EF L
AEdgeF L
EdgeF L
DF L
ADF L
Gap
Time
Gap
Time
Gap
Time
Gap
Time
Gap
Time
1685.7 861.5 733.3 316.7 290.6 1775.0 2677.8 700.0 98.4 72.4 500.0 733.3 1438.5 1438.5 594.4 250.0
1.22 1.20 0.16 0.13 1.05 0.31 0.95 0.01 0.78 0.09 0.19 0.08 0.36 0.36 0.59 0.00
1685.7 861.5 733.3 316.7 290.6 1775.0 2677.8 700.0 98.4 72.4 500.0 733.3 1438.5 1438.5 594.4 250.0
3.39 3.61 0.20 0.20 2.27 0.77 3.52 0.02 1.86 0.13 0.45 0.13 1.11 1.42 4.03 0.00
3407.9 1802.1 1528.8 723.4 677.9 3620.4 – 1477.0 – 243.6 1095.7 1561.6 – – – 589.0
104.47 73.66 3.14 4.08 5558.38 23.91 – 0.42 – 24.73 172.83 25.69 – – – 0.06
– – 41.7 37.5 – – – 475.0 – 46.6 – 733.3 – – – 300.0
– – 322.91 700.84 – – – 7.86 – 969.83 – 1366.14 – – – 0.45
49.7 49.8 49.3 41.5 47.6 926.4 – 475.0 – 48.0 545.1 735.8 – – – 300.0
24.91 48.70 2.16 4.36 3941.58 96.70 – 0.39 – 49.66 303.16 54.75 – – – 0.05
Q ¼ f2; . . . ; ng. Only instances with nr 500.
reported in the literature. As usual, we resort to the benchmark DIMACS instances1 that have long been used to test MC methods. An yeast protein interaction network2 reported in [34] has also been considered in our tests. In these experiments, we concentrate on the quality of the upper bounds obtained with the linear programming relaxations, the time to solve the LPs and the time to reach the optimums, when these models are embedded into the branch-and-bound. Among all DIMACS instances available, we only took those with density not higher than 0.6. Our option is because the new models are more efficient on sparse graphs, rather than dense graphs. We used the ILOG/CPLEX 9.0 package to solve both LPs and branch-and-bounds. These tests were ran on a Pentium IV with 3.2 GHz and 512 MBytes of RAM memory.
3.1. Maximum clique problem tests As mentioned in Section 2.2, the new models are strongly influenced by the range of variation of set Q. For that reason, tight lower and upper bounds for oðGÞ can reduce the models’ dimensions, benefiting their resolution. Thus, we resort to the recent work of Gendron et al. [18] and take the bounds reported in the papers (shown in the Appendix). In order to guarantee a good tradeoff between the quality of the upper bounds and the times reported to reach them we chose the values obtained through the Sequence elimination algorithm applied to upper bounds generated by the Linear coloring method. When referring to these bounds, we use the previously introduced notation qmin and qmax , as the best lower and upper bounds for oðGÞ, respectively. We have set an upper time limit of 10 800 s in all tests here described. Percent gaps are calculated as 100ððUB LBÞ=LBÞ, with LB the best known lower bound for oðGÞ or the optimum when available, and UB the upper bound produced by the LP relaxation of the models under discussion. All LB values used are given in the Appendix. Times are reported in seconds. Instances are sorted according to graph’s density (d). Missing results in the tables are due to time or memory limitations, being marked with ‘‘–’’. The primal simplex has been used in all linear programming relaxations involving model ADF L , because it has more variables 1 Available at ftp://dimacs.rutgers.edu/pub/challenge/graph/benchmarks/ clique. 2 Available at NAR online.
than constraints, whenever jQ j 4 1. For all other models, the dual simplex has been called to solve the associated LP relaxations. We start comparing the LP bounds produced by the models, without the prior knowledge of qmin nor qmax . This is the way models EdgeF and AEdgeF have first been posed. It also means that Q ¼ f2; . . . ; ng, which implies an enormous number of variables and/or constraints in the new models EF, DF and ADF. For this reason we only report results on instances with n r 500. The gaps reported in Table 2 indicate that for the sparser graphs, models DF L and ADF L produce upper bounds much closer to oðGÞ than the other models. However, when the density grows (d 40:45), the known models EdgeF L and AEdgeF L are the best to approximate oðGÞ. In addition, the extended formulation EF L presents very high gaps and execution times. The results in Table 2 also indicate that the two known models EdgeF L and AEdgeF L reach the same upper bound values. This observation only addresses these instances, as differences can occur among denser graphs. As expected, the new models require much more time to solve the LPs than the known ones. In fact, any attempt to solve the bigger instances (with n 4 500) would be inhibited from memory or execution time limitations. In addition, the aggregated AEdgeF L model requires more time to solve the LPs than model EdgeF L , although the latter has many more constraints. A reason for this can be associated with the density of constraints (6) from AEdgeF L , when compared with inequalities (4) of EdgeF L . Considering the two tests proposed in Section 2.3, we begin to show the number of variables and constraints identified by Tests 1 and 2, respectively. This is reported in Table 3 and only addresses the space of variables fxqi g that benefit from the additional information provided by the extra index q, as suggested in Section 2.3. Note that we are still assuming to have no information on approximate values for oðGÞ, thus Q ¼ f2; . . . ; ng. The column labelled with ‘Test 1 variables eliminated’ indicates the proportion of variables identified by Test 1 over the total number of vertex variables. Column labelled with ‘Test 2 constraints identified’ refer to the total number of constraints identified by Test 2. As expected, Test 1 is more effective on sparse graphs. A similar observation holds for Test 2, in the sense that smaller values for q0 should come up among sparse graphs, inducing stronger cuts generated by Test 2. Remember that q0 ¼ minfq A Q : jNðiÞ \ NðjÞj o q 2g. Note that the total number of constraints identified by Test 2 is equal to jEj, because there is
ARTICLE IN PRESS P. Martins / Computers & Operations Research 37 (2010) 1348–1358
always one such constraint for any pair fi; jg, as jNðiÞ \ NðjÞj can range from 0 to n 2, whenever Q ¼ f2; . . . ; ng. So the only condition holding is ði; jÞ A E. Table 4 reports the gaps and execution times to solve the LP relaxation of the discretized models with Tests 1 and 2. Among all constraints identified by Test 2, we took no more than n of them. The selected ones involved the n pairs of vertices fi; jg with higher common degree, where the common degree is given by jNðiÞ \ NðjÞj. This option is to provide the model with additional information involving the decision among pairs of vertices with higher chance to belong to the clique. The tighter values for qmin and qmax have still been ignored, thus keeping Q ¼ f2; . . . ; ng. As mentioned above, no experiments are made involving Tests 1 and 2 and addressing the models using variables fxi g, whenever qmin ¼ 2. The amount of variables eliminated by Test 1 contributes for faster solving the discretized models. Such reduction has also allowed to solve the bigger instances, within the given time limit. Test 2 additional constraints also contribute for further reduction of the previously reported gaps. However, contrarily to expected,
Table 3 Proportion of variables eliminated by Test 1 and total number of constraints identified by Test 2, within the space of variables fxqi g. Instance
n
d
Test 1 Variables eliminated
Test 2 Constraints identified
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 brock200-2 san400-0.5-1 sanr400-0.5 p-hat500-2 johnson8-2-4
500 500 200 200 500 300 500 64 500 200 300 200 400 400 500 28
0.036 0.073 0.077 0.163 0.186 0.244 0.253 0.349 0.374 0.426 0.489 0.496 0.500 0.501 0.505 0.556
0.96 0.93 0.92 0.84 0.81 0.76 0.75 0.65 0.63 0.57 0.51 0.50 0.50 0.50 0.50 0.44
4459 9139 1534 3235 23191 10933 31569 704 46627 8473 21928 9876 39900 39984 62946 210
Table 4 Linear programming relaxation gaps and times, including Tests 1 and 2. Instance
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 brock200-2 san400-0.5-1 sanr400-0.5 p-hat500-2 johnson8-2-4
n
500 500 200 200 500 300 500 64 500 200 300 200 400 400 500 28
d
0.036 0.073 0.077 0.163 0.186 0.244 0.253 0.349 0.374 0.426 0.489 0.496 0.500 0.501 0.505 0.556
ADF L
DF L Gap
Time
Gap
Time
42.9 46.2 41.7 37.5 45.3 725.0 – 475.0 47.6 44.8 400.0 661.3 1346.2 1338.0 – 277.8
0.67 2.73 0.13 0.59 17.88 753.83 – 0.09 502.53 5.53 95.44 168.17 1047.45 4974.16 – 0.03
48.9 49.0 48.1 39.0 46.5 805.4 1272.6 475.0 48.8 47.8 446.3 727.6 1435.6 1440.8 – 284.3
0.47 2.13 0.11 0.61 14.64 16.33 105.45 0.05 423.49 4.97 34.74 15.69 503.67 397.64 – 0.02
Q ¼ f2; . . . ; ng. Only instances with n r 500.
1353
the benefits are more meaningful on the denser graphs, where higher gaps still remain. For some instances, and among the discretized models, the number of constraints identified is much bigger than those selected for inclusion (only n of them). Some additional experiments revealed no significant decrease on the gaps, when more constraints are added to the models. Comparing with the gaps produced by the known models EdgeF L and AEdgeF L , reported in Table 2, the two discretized formulations generate tighter upper bounds, for all instances except ‘johnson8-2-4’. As before, the differences are more significant among the sparser graphs. However, if tests were conducted on other DIMACS instances with density higher than 0.6, then we would observe an opposite behavior on the gaps. This time the advantage would benefit the known formulations. Additionally, the times reported in Table 4 are still superior than those reported in Table 2 for solving models EdgeF L and AEdgeF L . The new formulations become significantly smaller when tight qmin and qmax values are made available. Results on the correspondent LP relaxations are reported in Table 5. In order to provide a fair comparison with the known models, we add to EdgeF L and AEdgeF L the following inequalities: X qmin r xi r qmax ð29Þ iAV
Once again, and for all models under discussion, we took no more than n constraints among all those generated by Test 2. The selected ones followed the same criterion as before. The total number of constraints identified by Test 2 can be found in Table 6. Table 5 only shows the times to solve the LPs because all optimums coincide with the correspondent qmax value. Smallest times for each instance are detached in bold. The cardinality of set Q has also been included in Tables 5 and 6. The inclusion of constraints (29) in EdgeF L induced a small increase on the time to reach the LP optimums. On the contrary, model AEdgeF L required less time than before. The same happened with the new formulations, revealing a significant decrease on the times due to the presence of the tight bounds for oðGÞ. Comparing with the known models, the new formulations are more effective among the sparser graphs and when jQ j ¼ 1.
Table 5 Times to solve the LP relaxations, with the tight values for qmin and qmax . Instance
n
d
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat1000-1 p-hat700-1 p-hat500-1 p-hat1500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 p-hat1000-2 brock200-2 p-hat700-2 san400-0.5-1 C2000-5 C4000-5 sanr400-0.5 san1000 p-hat500-2 p-hat1500-2 johnson8-2-4
500 500 200 200 500 300 1000 700 500 1500 64 500 200 300 1000 200 700 400 2000 4000 400 1000 500 1500 28
0.036 1 6.98 0.16 0.073 1 6.47 0.14 0.077 1 0.45 0.03 0.163 1 0.36 0.03 0.186 1 4.38 0.20 0.244 4 1.05 0.03 0.245 15 3.98 0.34 0.249 9 1.83 0.16 0.253 8 0.90 0.08 0.253 23 269.73 1.12 0.349 2 0.02 0.00 0.374 1 2.63 0.31 0.426 1 0.11 0.03 0.489 10 0.59 0.03 0.490 44 2.55 0.34 0.496 10 0.08 0.06 0.498 28 0.87 0.16 0.500 7 0.19 0.05 0.500 106 262.00 2.41 0.500 200 – 16.73 0.501 21 0.31 0.05 0.502 14 1.42 0.32 0.505 19 0.51 0.08 0.506 69 110.91 0.94 0.556 2 0.00 0.01
jQ j
EdgeF L
AEdgeF L
EF L
DF L
ADF L
0.00 0.03 0.01 0.01 0.16 0.05 6.26 1.63 0.45 106.44 0.00 0.18 0.05 0.55 1939.23 0.21 47.69 1.20 – – 2.86 12.48 4.27 – 0.02
0.01 0.03 0.01 0.02 0.08 0.06 8.34 1.88 0.58 53.02 0.00 0.15 0.05 0.52 3899.31 0.32 15.47 1.61 – – 3.64 18.95 4.77 – 0.01
0.00 0.02 0.00 0.01 0.03 0.05 9.75 1.02 0.33 70.45 0.00 0.11 0.02 0.33 1761.39 0.17 20.41 0.68 – – 1.75 8.09 2.50 – 0.01
ARTICLE IN PRESS 1354
P. Martins / Computers & Operations Research 37 (2010) 1348–1358
Table 6 Total number of constraints identified by Test 2. Instance
n
d
jQ j
Constraints on fxi g
Constraints on fxqi g
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat1000-1 p-hat700-1 p-hat500-1 p-hat1500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 p-hat1000-2 brock200-2 p-hat700-2 san400-0.5-1 C2000-5 C4000-5 sanr400-0.5 san1000 p-hat500-2 p-hat1500-2 johnson8-2-4
500 500 200 200 500 300 1000 700 500 1500 64 500 200 300 1000 200 700 400 2000 4000 400 1000 500 1500 28
0.036 0.073 0.077 0.163 0.186 0.244 0.245 0.249 0.253 0.253 0.349 0.374 0.426 0.489 0.490 0.496 0.498 0.500 0.500 0.500 0.501 0.502 0.505 0.506 0.556
1 1 1 1 1 4 15 9 8 23 2 1 1 10 44 10 28 7 106 200 21 14 19 69 2
2208 3048 585 2079 12555 83 0 1 5 0 224 19344 3192 3 0 0 0 0 0 0 0 0 0 0 0
2208 3048 585 2079 12555 358 243 237 343 67 224 19344 3192 155 74 0 180 0 0 0 0 0 143 38 0
This reveals the strong dependency of the new formulations from the given tight bounds, imposing much higher execution times when jQ j is large and when the density grows. In addition, it should be noted that the reported times do not include the effort to find the given bounds. However, for all instances to which the new models solve faster the LP problem, the times to compute the upper bounds are less than a second (0 s as reported in [18]), namely among all instances with density smaller than 0.2. For the same competitive instances, times for the lower bounds can also be found very fast (see, e.g. [18,20]). Table 6 shows the total number of constraints found by Test 2. Those identified within the space of variables fxi g are shown in the column labelled ‘constraints on fxi g’ , while those generated in the space of variables fxqi g have label ‘constraints on fxqi g’ . We present no results involving Test 1, because no vertex has been eliminated. When q takes values in fqmin ; . . . ; qmax g, the number of constraints generated by Test 2 decreases a lot. As expected, for jQ j 41, the test is more effective on the variables space fxqi g than in fxi g. However, it hardly finds constraints on the denser graphs. We also tested the effectiveness of the models to reach the optimums when embedded into the branch-and-bound. We concentrate on the very sparse graphs, with d r0:2, where the new models revealed to be more competitive. For these instances, the values taken from [18] for qmin and qmax are equal, which means we know the exact value of oðGÞ. In effect, while most qmax bounds being proposed do not have an associated feasible clique, the same does not happen with qmin . Actually, most lower bounding contributions in the literature are based on heuristic methods, bringing an associated feasible clique, which means that the problem is solved whenever qmin ¼ qmax . For this reason, we consider six different versions of set Q, with jQ j ¼ 1; 3; 5; 9; 15 and Q ¼ f2; . . . ; ng, that is, jQ j ¼ n 1. In all these sets, except the last one, oðGÞ is the central value. Hence, we are assuming an equal distance from oðGÞ to both qmin and qmax , which is not the most common situation among the known intervals reported in the literature. In fact, qmin is most frequently closer to oðGÞ than qmax . Anyway, even admitting that the qmin values are exaggeratedly far from oðGÞ, the branch-and-
bound as been observed to find very tight lower bounds in early stages of its execution, specially among the known models. In the jQ j ¼ 1 case we assume to know just the value of oðGÞ, having no feasible set for it. We used most default settings proposed by CPLEX, which involves an automatic procedure that uses the best rule for variable selection and the best-bound search strategy for node selection in the branch-and-bound tree. Other strategies available by CPLEX have also been experimented in some preliminary tests, revealing no benefits in general. We ran the tests with and without the automatic generation of mixed integer programming (MIP) global cuts, and chose to close the cuts generator, because it required much more time to solve the known models. Tables 7–11 show the times to run the branch-and-bound, using each of the models and considering the sequence of increasing cardinality of Q, mentioned above. Smallest times for each instance and jQ j value are detached in bold. The running time of the branch-and-bound tends to grow when jQ j increases. However, model EdgeF seems to be relatively insensitive to the size of Q, from jQ jZ 3, although taking much more time than the others, in general. Also, there is a significant jump on the times from the case jQ j ¼ 1 to 3, for all models, being more pronounced for EdgeF. Additionally, when Q ¼ f2; . . . ; ng, where we ignore any approximate bound for oðGÞ, the new models’ times tend to get worsen along with the graph’s density growth, although model DF is more resistent than the others. On the contrary, for the same Q ¼ f2; . . . ; ng case, the known models tend to be better choices, specially model AEdgeF and particularly among denser graphs. These results also indicate that the discretized models, DF and ADF, perform very well among the sparser graphs. A reason for this can be credited to the fast decrease of the best-bound value along with the branch-and-bound execution. In our opinion, this happens because the LP projected polyhedra of the two mentioned models should be closer to the MC convex hull than the other models’ LP polyhedra. Considering the option of closing the automatic cuts generator, we have observed that the times to run the branch-and-bound increased a lot in the version that includes the cuts, specially when the known models were involved. For some instances, it was more than 10 times slower than the times here reported, and in general the differences tend to be higher when jQ j increases. For the tests addressing the new models, those differences were not so high. It is important to remaind that CPLEX includes clique cuts among the global pool of MIP cuts available. These cuts have been most visible in the case with Q ¼ f2; . . . ; ng, and also when the known models EdgeF and AEdgeF were used. Still, we expect that the inclusion of the MIP global cuts can be helpful among higher dimensional and denser graphs. Sparse graphs are very common in real applications, namely those addressing biological network problems. For instance, considering some of the works mentioned in Section 1, the density of the CMO (maximum contact map overlap) graphs used in [32], to compare protein’s tertiary structure, ranged from 0.11 to 0.16, in their initial form. Other common applications involve the protein–protein interaction network of the yeast Saccharomyces cerevisiae, being reported to have density smaller than 0.01 (see, e.g. [34]). This interactome includes a giant component composed of 741 proteins (vertices) and 1752 interactions (edges), made available in Han et al. [35] and used in Valente and Cusick [34]. If we use formulations AEdgeF and DF to find a maximum size clique in this graph, it identifies 14 proteins, all belonging to the same functional module, the core transcription factor module. It has eight transcriptional control proteins, five transcriptional initiation proteins and a single chromatin conformation modification protein. In addition, there are eight vital and five lethal proteins, and
ARTICLE IN PRESS P. Martins / Computers & Operations Research 37 (2010) 1348–1358
1355
Table 7 Formulation EdgeF. Branch-and-bound execution times. Instance
n
d
oðGÞ
jQ j ¼ 1
jQ j ¼ 3
jQ j ¼ 5
jQ j ¼ 9
jQ j ¼ 15
Q ¼ f2; . . . ; ng
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5
500 500 200 200 500
0.036 0.073 0.077 0.163 0.186
14 26 12 24 64
9.86 12.36 3.39 9.69 16.08
1485.78 2211.64 43.83 68.53 4882.88
1650.42 2099.86 47.25 70.11 3831.48
1413.16 2178.19 48.78 68.83 3855.47
1561.41 2351.31 49.80 66.34 3989.70
1536.95 2379.08 49.42 69.03 4078.34
Table 8 Formulation AEdgeF. Branch-and-bound execution times. Instance
n
d
oðGÞ
jQ j ¼ 1
jQ j ¼ 3
jQ j ¼ 5
jQ j ¼ 9
jQ j ¼ 15
Q ¼ f2; . . . ; ng
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5
500 500 200 200 500
0.036 0.073 0.077 0.163 0.186
14 26 12 24 64
0.72 0.72 0.11 0.36 0.72
25.52 33.98 1.80 2.06 122.95
24.80 36.52 1.73 1.89 89.66
26.80 33.06 1.76 2.03 297.58
31.28 41.81 2.05 2.31 78.75
89.77 157.31 8.02 9.38 261.36
Table 9 Formulation EF. Branch-and-bound execution times. Instance
n
d
oðGÞ
jQ j ¼ 1
jQ j ¼ 3
jQ j ¼ 5
jQ j ¼ 9
jQ j ¼ 15
Q ¼ f2; . . . ; ng
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5
500 500 200 200 500
0.036 0.073 0.077 0.163 0.186
14 26 12 24 64
0.03 0.06 0.02 0.02 0.47
0.17 12.92 0.14 2.72 1728.92
45.09 126.67 3.00 6.70 370.23
94.28 178.30 4.70 17.73 906.19
115.75 366.41 7.55 39.58 1542.45
– – 1303.41 413.23 –
Table 10 Formulation DF. Branch-and-bound execution times. Instance
n
d
oðGÞ
jQ j ¼ 1
jQ j ¼ 3
jQ j ¼ 5
jQ j ¼ 9
jQ j ¼ 15
Q ¼ f2; . . . ; ng
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5
500 500 200 200 500
0.036 0.073 0.077 0.163 0.186
14 26 12 24 64
0.01 0.05 0.01 0.05 0.45
0.95 4.53 0.33 0.97 99.58
3.27 10.02 0.45 1.50 59.19
3.84 19.06 0.75 3.64 90.55
4.38 34.06 0.81 7.89 322.83
28.44 83.17 2.53 29.52 3247.78
Table 11 Formulation ADF. Branch-and-bound execution times. Instance
n
d
oðGÞ
jQ j ¼ 1
jQ j ¼ 3
jQ j ¼ 5
jQ j ¼ 9
jQ j ¼ 15
Q ¼ f2; . . . ; ng
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5
500 500 200 200 500
0.036 0.073 0.077 0.163 0.186
14 26 12 24 64
0.01 0.05 0.00 0.02 0.39
0.41 5.08 0.23 0.83 132.45
26.58 39.34 2.77 5.03 843.09
41.36 141.98 4.19 19.11 1037.67
64.86 347.25 4.50 39.36 1467.89
75.72 892.80 6.08 106.14 –
one unknown (protein TRA1). Considering the times to reach the optimum, model AEdgeF required 61.8 s, while model DF took just 0.09 s. In fact, model DF L immediately reached the P integer optimum in 0.02 s. Also, by adding inequality i A C xi r ðoðGÞ 1Þ, with C the mentioned clique set, to model (17), we would obtain an infeasible problem, showing that the 14-vertices clique obtained is the only one in that graph. This infeasible problem took 0.02 s to be solved. We set qmin ¼ 2 and qmax ¼ 33 in all these tests, involving models AEdgeF and DF. We chose a very simple upper bound, based on the maximum vertex degree (dmax) of the yeast interactome graph, where dmax ¼ 32.
We can find in the literature other implicit enumerative methods specially designed for the MC problem. Tomita and Kameda [33] provide a comparison among most of those methods, showing that any of the instances tested in Tables 7–11 could have been solved in less than 0.02 s, using the benchmark program dfmax proposed by Applegate and Johnson (see [23]) or the branch-and-bound algorithm proposed in [33]. These approaches can be much faster than using a general branch-andbound (or -cut) to solve a mathematical formulation of the problem. However, when moving into a different clique related problem, the mathematical modelling approach can play
ARTICLE IN PRESS 1356
P. Martins / Computers & Operations Research 37 (2010) 1348–1358
an important role. Two such cases have been described in Section 2.4, namely the overlapping cliques and k-plex problems. Next we provide a few tests with these problems.
3.2. Tests for the maximum size component involving two overlapping cliques Considering model OQF, we propose finding pairs of overlapping cliques using three different L ranges: L1 ¼ f1; . . . ; oðGÞ 1g, L2 ¼ f3; . . . ; oðGÞ 1g and L3 ¼ f5; . . . ; oðGÞ 1g. In these tests, we simply analyze changes on the minimum number of shared vertices. Table 12 shows the solutions and computational times to solve model OQF, applied to the 741 vertices S. cerevisiae interactome graph. There is a significant increase on the time to solve model OQF when few vertices are allowed to share the cliques. In fact, we have not been able to reach the optimum for the L1 range case. The best solution found is a 17-vertices component, involving the 14-vertices maximum clique and a 5-vertices clique. The shared proteins are TAF10 and TAF5, being both lethal. In fact, all proteins in the smaller clique are lethal. The two cliques belong to the core transcription factor module and all their proteins are in two different sub-modules (see [34]). Besides, if we remove the shared proteins, then all those from clique 1 go to one of the sub-modules while the other ones, from clique 2, fall in the other sub-module. The case with range L2 returned a 16-vertices component, composed of the same 14-vertices maximum clique and a 5vertices clique, also belonging to the core transcription factor module. The shared vertices correspond to proteins TAF10, TAF60, TAF5, being all lethal. Interestingly, the 5-vertices clique includes protein TAF2 with unknown lethality. Model OQF with range L3 also identified a 16-vertices component within the same functional module. It also includes the 14-vertices maximum clique and a 7-vertices clique, sharing five proteins. These shared proteins are TAF17, TAF10, TAF60, TAF61 and TAF5, all involved in the transcriptional initiation function. Curiously, these are all the lethal proteins in the 14vertices maximum clique. The additional proteins brought by the 7-vertices clique are TAF19 and TAF145, also involved in the transcriptional initiation function and being lethal as well. None of the solutions revealed a topological inter-module complex, basically because of the sparsity of the yeast interactome graph. If more interactions come up, then the OQF model may help discovering new functional complexes between the modules. Alternatively, more relaxed concepts of dense components rather than cliques should be considered instead. Table 12 Solutions and times (in seconds) to solve model OQF, for L1, L2 and L3. L1
L2
L3
Solution
Time
Solution
Time
Solution
Time
Z 17
10 800.00
16
4863.21
16
1625.64
3.3. Tests for the k-plex problem Tests with the two formulations (26) and (28), addressing the k-plex problem, have also been conducted, using the same S. cerevisiae interactome graph. Bounds for mk ðGÞ have been set to qmin ¼ kþ 1 and qmax ¼ minfk þ dmax; ng, with dmax the maximum vertex degree in G. For the present graph dmax ¼ 32. Solutions and times are reported in Table 13, for k ¼ 2; 3; 4; 5; 10. Model (28) is faster to solve all these problems. The 2-plex problem identified the same 14 proteins found in the maximum sized clique, while the other k-plex solutions, for k ¼ 3; 4; 5; 10, identified components in the large 60S ribosomal subunit module. In fact, the k-plex solutions seem to fall within known cell functional modules, which suggests that k-plexes can provide an adequate structure for identifying new functional modules or protein complexes in other cells’ interactome. Actually, the k-plex concept can deal with possible missing interactions, which is commonly a matter of concern in biological network analysis (see, e.g. [31,34]). We have also applied the two models (26) and (28), with Q ¼ fk þ1; . . . ; minfk þ dmax; ngg, to the sparser DIMACS instances previously considered: c-fat500-1, c-fat500-2, c-fat200-1, cfat200-1 and c-fat500-5. In this case, addressing problem 2-plex, the times to solve the DF based model (28) are 27.63, 248.72, 5.05, 42.44 and 10 269.08 s, respectively. On the other hand, the times required by the AEdgeF based model (26) are 235.91, 402.39, 16.08, 15.75 and 596.47 s, respectively. Considering the results reported in [2], their branch-and-cut method took 557.1, 885.2, 30.92, 31.67 and 1087.75 s, respectively, to solve the same instances, using a 2.2 GHz Pentium IV processor (remember that we are using a 3.2 GHz Pentium IV computer in our tests). Once again, the discretized formulations have been faster to find the optimums among the sparser graphs, including the yeast S. cerevisiae interactome graph that has long been concentrating the attention of biologists. However, as in the clique problem, the AEdgeF based formulation takes the lead when density gets growing. A similar conclusion holds for the branch-and-cut method described in [2].
4. Conclusions This paper proposes new formulations for the MC problem. These are extended and discretized models that use additional information on candidate values for the clique number of the graph. Computational tests conducted on some DIMACS instances showed that for sparse graphs, with d o 0:5, the LP relaxation of the new discretized models produce much stronger upper bounds for oðGÞ than the known formulations. This suggests that the associated LP projected polyhedra should be closer to the MC convex hull than the known models’ LP polyhedra. In addition, when tight bounds for oðGÞ are given in advance, the new formulations can be much reduced in their size, being able to reach the LP optimum much faster than before. For sparse graphs and when a short interval containing oðGÞ is known, then model DF is the best formulation within the branch-
Table 13 Solutions and times to solve models (26) and (28), for k ¼ 2; 3; 4; 5; 10. Model
AEdgeF based: (26) DF based: (28)
k¼2
k¼3
k¼4
k¼5
k ¼ 10
m2 ðGÞ
Times
m3 ðGÞ
Times
m4 ðGÞ
Times
m5 ðGÞ
Times
m10 ðGÞ
Times
14 14
75.64 0.19
15 15
74.88 0.36
17 17
62.41 0.11
18 18
57.73 0.14
23 23
71.09 0.08
ARTICLE IN PRESS P. Martins / Computers & Operations Research 37 (2010) 1348–1358
and-bound to find an optimum clique. On the other hand, when the graph becomes denser and the interval grows, then the best choice is formulation AEdgeF. Formulations EF and ADF are only competitive for very short intervals for oðGÞ. Sparse graphs are very common in biological interaction’s characterizations. Such an example, involving an yeast interactome, has been used, reinforcing the advantages of the new discretized formulations, even without tight bounds for oðGÞ. Sparse graphs are also common in other real-world applications, namely in social networks; internet and world-wide-web networks; metabolic, protein and genetic networks and in brain networks; as shown in [6]. Apart from the comparison among the new and the existing formulations described in the present paper, we can find in the literature implicit enumerative methods specially designed for the MC problem (see, e.g. [28,33]). These approaches can be much faster than using a general MIP solver like the branch-and-bound (or -cut) to answer the mentioned models. However, when moving into a different clique related problem, the mathematical modelling approach can play an important role. Two such cases have been considered, namely overlapping cliques and the k-plex problem, to which adequate reformulations have been proposed and tested.
Acknowledgments The author would like to thank Carlos Luz and Luis Cavique for kindly making available their most recent results on upper bounds for o. Thanks are also due to Andre´ X. C. N. Valente for making available the protein interactome of Saccharomyces cerevisiae. The author also thanks the anonymous referees for their valuable suggestions.
Appendix Table 14 indicates the optimum/best known values for oðGÞ, denoted by LB (from lower bound). It also shows values for qmin and qmax taken from [18]. We chose the upper bounds generated Table 14 Best known lower bounds for oðGÞ. Values for qmin and qmax , reported in [18]. Instance
n
d
LB
qmin
qmax
c-fat500-1 c-fat500-2 c-fat200-1 c-fat200-2 c-fat500-5 p-hat300-1 p-hat1000-1 p-hat700-1 p-hat500-1 p-hat1500-1 hamming6-4 c-fat500-10 c-fat200-5 p-hat300-2 p-hat1000-2 brock200-2 p-hat700-2 san400-0.5-1 C2000-5 C4000-5 sanr400-0.5 san1000 p-hat500-2 p-hat1500-2 johnson8-2-4
500 500 200 200 500 300 1000 700 500 1500 64 500 200 300 1000 200 700 400 2000 4000 400 1000 500 1500 28
0.036 0.073 0.077 0.163 0.186 0.244 0.245 0.249 0.253 0.253 0.349 0.374 0.426 0.489 0.490 0.496 0.498 0.500 0.500 0.500 0.501 0.502 0.505 0.506 0.556
14 26 12 24 64 8 10 11 9 12 4 126 58 25 46 12 44 13 16 18 13 15 36 65 4
14 26 12 24 64 8 10 11 9 11 4 126 58 25 46 10 44 8 15 16 13 8 36 65 4
14 26 12 24 64 11 24 19 16 33 5 126 58 34 89 19 71 14 120 215 33 21 54 133 5
1357
through the sequential elimination method applied to the chromatic number, obtained through the linear coloring heuristic. Among all instances proposed in the DIMACS library, we only considered those with density not higher than 0.6. Instances are sorted according to graph’s density (d). References [1] Bahadur KC, Akutsu T, Tomita E, Seki T, Fujiyama A. Protein matching under non-uniform distortions and protein side chain packing based on an efficient maximum clique algorithm. Genome Informatics 2002;13:143–52. [2] Balasundaram B, Butenko S, Hicks IV, Sachdeva S. Clique relaxations in social network analysis: the maximum k-plex problem. Manuscript /http://ie. tamu.edu/people/faculty/butenko/papers/kplex060127.pdfS; 2006. [3] Battiti R, Protasi M. Reactive local search for the maximum clique problem. Algorithmica 2001;29(4):610–37. [4] Bellare M, Goldreich O, Sudan M. Free bits, PCPs and non-approximability—toward tight results. In: Proceedings of the 36th annual IEEE symposium on foundations of computer science. New York: IEEE Computer Society; 1995. p. 422–31. [5] Berman P, Pelc A. Distributed fault diagnosis for multiprocessor systems. In: Proceedings of the 20th annual international symposium on fault-tolerant computing, Newcastle, UK, 1990. p. 340–6. [6] Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D-U. Complex networks: structure and dynamics. Physics Reports 2006;424:175–308. [7] Bomze IM, Budinich M, Pardalos PM, Pelillo M. The maximum clique problem. In: Du DZ, Pardalos PM, editors. Handbook of combinatorial optimization, suppl. vol. A. Dordrecht: Kluwer; 1999. p. 1–74. [8] Butenko S, Wilhelm WE. Clique-detection models in computational biochemistry and genomics. European Journal of Operational Research 2006;173:1– 17. [9] Caprara A, Carr R, Istrail S, Lancia G, Walenz B. 1001 optimal PDB structure alignments: Integer programming methods for finding the maximum contact map overlap. Journal of Computational Biology 2004;11:27–52. [10] Cavique L, Luz CJ. A heuristic for the stability number of a graph based on convex quadratic programming and tabu search. Journal of Mathematical Sciences 2009:161(6):944–55. [11] Cook DJ, Holder LB. Graph-based data mining. IEEE Intelligent Systems 2000;15(2):32–41. [12] Crescenzi P, Fiorini C, Silvestri R. A note on the approximation of the MAX CLIQUE problem. Information Processing Letters 1991;40:1–5. [13] de Klerk E, Pasechnik DV. Approximation of the stability number of a graph via copositive programming. SIAM Journal on Optimization 2002;12(4):875–92. [14] Della Croce F, Tadei R. A multi-KP modeling for the maximum-clique problem. European Journal of Operational Research 1994;73:555–61. [15] Fahle T. Simple and fast: improving a branch-and-bound algorithm for maximum clique. In: Lecture notes in computer science, vol. 2461. Berlin: Springer; 2002. p. 485–98. [16] Gagneur J, Krause R, Bouwmeester T, Casari G. Modular decomposition of protein–protein interaction networks. Genome Biology 2004;5(8):R57.1–2. [17] Gardiner EJ, Artymiuk PJ, Willett P. Clique-detection algorithms for matching three-dimensional molecular structures. Journal of Molecular Graphics and Modelling 1997;15:245–53. [18] Gendron B, Hertz A, St-Louis P. A sequential elimination algorithm for computing bounds on the clique number of a graph. Discrete Optimization 2008;5(3):615–28. [19] Grotschel M, Lova´sz L, Scherijver A. Geometric algorithms and combinatorial optimization. Algorithms and Combinatorics, vol. 2. Berlin: Springer; 1988. [20] Grosso A, Locatelli M, Pullan W. Simple ingredients leading to very efficient heuristics for the maximum clique problem. Journal of Heuristics 2008;14(6):587–612. [21] Hansen P, Mladenovic´ N, Uroˇsevic´ D. Variable neighborhood search for the maximum clique. Discrete Applied Mathematics 2004;145:117–25. [22] Hotta K, Tomita E, Takahashi H. A view-invariant human face detection method based on maximum cliques. Transactions of IPSJ 2003;44(SIG14): 57–70. [23] Johnson DS, Trick MA, editors. Cliques, coloring and satisfiability. In: DIMACS series in Discrete mathematics and theoretical computer science, vol. 26, 1996. [24] Karp RM. Reducibility among combinatorial problems. In: Miller RE, Thatcher JW, editors. Complexity of computer computations. New York: Plenum Press; 1972. p. 85–103. [25] Krishna P, Vaidya N, Chatterjee M, Pradhan D. A cluster-based approach for routing in dynamic networks. ACM SIGCOMM Computer Communication Review 1997:49–65. [26] Luce RD, Perry AD. A method of matrix analysis of group structure. Psychometrika 1949;14:95–116. [27] Luz CJ, Schrijver A. A convex quadratic characterization of the Lova´sz theta number. SIAM Journal on Discrete Mathematics 2005;19(2):382–7. ¨ stergard ˚ [28] O PRJ. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 2002;120:197–207. [29] Pena J, Vera J, Zuluaga L. Computing the stability number of a graph via linear and semidefinite programming. SIAM Journal on Optimization 2007;18(1):87–105.
ARTICLE IN PRESS 1358
P. Martins / Computers & Operations Research 37 (2010) 1348–1358
[30] Regin J-C. Solving the maximum clique problem with constraint programming. In: Lecture notes in computer science., vol. 2883. Berlin: Springer; 2003. p. 634–48. [31] Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Science 2003;100(21): 12123–8. [32] Strickland DM, Barnes E, Sokol JS. Optimal protein structure alignment using maximum cliques. Operations Research 2005;53(3):389–402.
[33] Tomita E, Kameda T. An efficient branch-and-bound algorithm for finding a maximum clique with computational experiments. Journal on Global Optimization 2007;37:95–111. [34] Valente AXCN, Cusick ME. Yeast protein interactome topology provides framework for coordinated-functionality. Nucleic Acids Research 2006;34(9): 2812–9. [35] Han JD, Bertin N, Hao T, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004;430:88–93.