Refined pivot selection for maximal clique enumeration in graphs

Refined pivot selection for maximal clique enumeration in graphs

JID:TCS AID:10517 /FLA Doctopic: Algorithms, automata, complexity and games [m3G; v1.168; Prn:20/11/2015; 10:58] P.1 (1-10) Theoretical Computer Sc...

871KB Sizes 2 Downloads 97 Views

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.1 (1-10)

Theoretical Computer Science ••• (••••) •••–•••

Contents lists available at ScienceDirect

Theoretical Computer Science www.elsevier.com/locate/tcs

Refined pivot selection for maximal clique enumeration in graphs Kevin A. Naudé Computing Sciences, Nelson Mandela Metropolitan University, Port Elizabeth, South Africa

a r t i c l e

i n f o

Article history: Received 17 September 2013 Received in revised form 2 November 2015 Accepted 10 November 2015 Available online xxxx Communicated by G. Ausiello Keywords: Clique enumeration Maximal cliques Maximal independent sets

a b s t r a c t This article re-examines the pivoting Bron–Kerbosch algorithm for identifying all maximal cliques within a graph. A curious result is presented: there exist pivot candidates which may be selected with no penalty to the worst-case running time, even if these vertices do not satisfy the established conditions for the known complexity bound. Hence, such pivots may be selected eagerly. A refined pivot selection process is described, and the preservation of the established worst case bound is proved. Additionally, empirical evidence shows that the revised algorithm is notably faster than Bron–Kerbosch with Tomita-style pivot selection. © 2015 Elsevier B.V. All rights reserved.

1. Introduction The discovery of cliques within graphs is an abstract problem with many practical applications. A well known example is that maximal cliques in a compatibility graph describe the strength of association between the structures described by the two source graphs. For example, if the source graphs describe biological structures, maximal clique enumeration may be used to find related structures; or more specifically, to measure the structural similarity between domain artefacts. This idea has been used in many fields, including bioinformatics and cheminformatics [1–3]. The importance of clique enumeration is elevated by noting that the cliques of some graph G are independent sets of vertices in its complement G. Consequently, the problems of clique enumeration and independent set enumeration are closely related. Thus maximal clique enumeration may be also used for problems that call for independent sets. For example, register allocation in compilers, which requires finding large sets of independent variables in a conflict graph. Each of these application domains may immediately benefit from well established algorithms for maximal clique enumeration. This article reports improvements to a family of such algorithms: the pivoting Bron–Kerbosch (BK) maximal clique enumerators [4]. This is a class of algorithms because the base method may be parameterised by a pivot selection strategy. Many such pivoting strategies have been studied, including greedy strategies which are fast when problem cases do not appear [5], and strategies that are somewhat more costly, but prevent cases of degenerate behaviour [6–8]. Such algorithms shall be denoted here in the form BK–PivotStrategy, the prevailing variety of which is BK–Tomita, using the pivoting method of [6]. This article is organised into six remaining parts. Section 2 introduces all necessary notations, and Section 3 offers a brief overview of the BK enumerators. The limitations of these enumerators are further examined in Section 4, which subsequently introduces a new revised algorithm. The worst case performance of the new algorithm is examined theoretically

E-mail address: [email protected]. http://dx.doi.org/10.1016/j.tcs.2015.11.016 0304-3975/© 2015 Elsevier B.V. All rights reserved.

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.2 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

2

Fig. 1. The application of Algorithm 1 to a simple graph, discovering all maximal cliques (shown in four colours), and discarding sub-optimal search branches. (For interpretation of the colours in this figure, the reader is referred to the web version of this article.)

in Section 5, while Section 6 describes a set of numerical experiments designed to understand the practical performance offered by the algorithm. The results are discussed in Section 7. 2. Preliminaries Maximal clique enumerators are applied to simple undirected graphs. In such graphs, two vertices are considered to conflict if they cannot appear together in a clique, i.e. there is no edge association between them. The mathematical preliminaries presented here unify the problems of clique and independent set enumeration under common notation, which is used throughout discussion.

• G = ( V , E ) denotes a simple undirected graph, with complement G . • N ( v ) = {u | u = v , (u , v ) ∈ E } denotes the neighbours of v. • K ( v ) = V − N ( v ) denotes the vertices that conflict with v, together with v itself. K ( v ) describes a hub-and-spoke conflict star graph in G, and is associated with independent set enumeration. v ∈ W K ( v )− W denotes the conflict frontier surrounding W , which is the set of vertices which cannot be included with W due to at least one conflict. • G ( W ) denotes the subgraph of G induced by W , with W ⊆ V . • If S is clique in G, it is maximal if, and only if, S ∪ F ( S ) = V .

• F (W ) =

3. Pivoting clique enumeration The Bron–Kerbosch (BK) algorithm is a widely known branch-and-bound algorithm for enumerating all maximal cliques [4]. The ordinary form of the algorithm is simple but inefficient, as it ultimately discards many rediscovered and submaximal cliques. Nevertheless, it is briefly described here as it is the foundation of subsequent algorithms. The clique discovery process maintains three disjoint sets: S, P , and X . The set S describes the current clique that is being expanded, while P contains those vertices that may legally be combined with S to form larger cliques. The role of the set X is the most subtle. It simultaneously acts to prevent the rediscovery of cliques already seen, and also ensures that only maximal cliques are reported. The elementary (non-pivoting) clique enumerator proceeds in a recursive process. At each step a vertex v is selected from P for consideration; thus, creating a branch in the search space. The first branch enumerates (recursively) all the cliques that contain v. The second branch attempts to find cliques that cannot contain v. In other words, cliques produced on the second branch must prove that the exclusion of v was necessary. This is achieved by ensuring that each such clique includes at least one vertex which conflicts with v. Hence, the set X maintains the collection of vertices for which excluding conflicts are still being sought. Elements of X are discarded whenever suitable conflicts are included in the current clique, thereby proving their exclusion. This basic algorithm is substantially improved by noting that not all elements of P need to be considered as starting points for new cliques. In fact it is sufficient to extend cliques by searching only a subset of vertices that have a common conflict, such as P ∩ K (q) for some vertex q ∈ P ∪ X . This gives rise to the pivoting Bron–Kerbosch enumerators. The role of pivoting is two-fold. Primarily, it promotes dividing the search space into branches that are likely to conflict with one another, meaning that they yield different solutions. As a result, pivot selection can allow earlier pruning of non-productive branches from the search space. Secondly, the choice of pivot may also reduce the branching factor such that the search tree is narrow and deep, rather than wide and shallow. Keeping the branching factor small favours the possibility of pruning larger sections of the search space. This is the intention of the pivot selection strategy widely attributed to Tomita et al. [6]. The BK–Tomita algorithm, shown in Algorithm 1, effects pivot selection by finding a vertex which maximises | P ∩ N ( v )|. This strategy trades an O (n2 ) cost in pivot selection for some assurances that degenerate cases (which lead to fruitless and discarded work) are not triggered. The application of Algorithm 1 to a small graph with four maximal cliques is shown in Fig. 1. The output produced is a structured text listing of all maximal cliques, and is similar in spirit to the tree-like output of Tomita et al. [6]. The

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.3 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

3

Algorithm 1: The pivoting Bron–Kerbosch maximal clique enumerator with Tomita-style pivot selection strategy.

listed tree provides a unique leaf node for every maximal clique, and the clique vertices may be read as the paths from the root to the leaves. For visual emphasis, the program code that deals directly with printing the listing is rendered in blue text1 in Algorithm 1. This formulation explicitly identifies every maximal clique with a check mark (✓), while marking discarded non-maximal cliques with a cross (✗). Such output can be instructive, and satisfies the following formal grammar:

Cliques → Vertex∗ (Mark | Choice) Choice → “(” Cliques (“|” Cliques)∗ “)” Mark → “✓” | “✗” 4. Revised algorithm n

There are potentially O (3 /3 ) maximal cliques within a given graph [9], and pivot selection occupies a substantial portion of the computation time spent in BK enumerators. The purpose of this work is to explore early exit opportunities within pivot selection. Specifically, it is shown that certain vertices may be eagerly selected as pivots with no penalty to the worst case behaviour, even when they are not strictly ideal pivot candidates. A close examination of pivot candidates in the BK–Tomita enumerator reveals unexploited opportunities for improvement in the selection process. The following examples describe five specific cases that are of interest. It is expected that good implementations of BK–Tomita already optimise for case 1. However, cases 2–5 offer less obvious optimisation opportunities and have not received prior treatment. In particular, cases 2–5 can only be selected in BK–Tomita after it has been proved that no better candidate is available, leading to its O (n2 ) cost per selection. The cases identified are: Case 1:

| P ∩ K (q)| = 0 (note that q ∈ X , as q ∈ P cannot occur) This condition represents an opportunity to discontinue search. The reason is that the pivot q has no available conflicts in P . Hence, it is impossible to find evidence to support the exclusion of q from the current clique. In consequence, the current clique is destined to remain sub-maximal and should be discarded as non-productive.

1

For interpretation of the colours, the reader is referred to the web version of this article.

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.4 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

4

Algorithm 2: Revised pivot selection strategy. Cliques and Apply (not shown here) are replicated from Algorithm 1.

Case 2:

Case 3:

Case 4:

Case 5:

| P ∩ K (q)| = 1 and q ∈ X

In this case it is clear that the pivot q has only one remaining conflict in P , let it be called w ∈ P . To produce a clique with evidence for the exclusion of q, w must be selected for inclusion in the growing clique. The branch in which w is excluded is non-productive. | P ∩ K (q)| = 1 and q ∈ P The pivot q is isolated in P . Therefore, no conflicting vertices for q remain available, so q must be part of the current maximal clique. | P ∩ K (q)| = 2 and q ∈ P The pivot q induces a conflict set with two elements. This is the smallest possible conflict set which still creates a legitimate branching of the search space. | P ∩ K (q)| = 2 and q ∈ X The pivot q has similar properties to that of case 4. However, it is not itself part of the conflict set, since it is not in P . Consequently, it is a more difficult case that will require special treatment in subsequent sections.

It is now asserted that each of these cases may be selected as pivots eagerly and with impunity, even if such a pivot does not maximise | P ∩ N ( v )|, which is the condition required by Tomita et al. [6]. The new algorithm introduced here exploits all of these cases. It does so by (a) searching for pivots in X before P , and (b) terminating the search early if any of the aforementioned cases are discovered. Throughout the search, the algorithm records the best pivot q (minimising | P ∩ K (q)|), which is then used as the general fall-through case if none of the early-exit candidates are available. A complete reference listing of BK–New is set out in Algorithm 2.

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.5 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

5

A small number of the more subtle aspects of the revised algorithm are now discussed. Firstly, the algorithm seeks to minimise | P ∩ K (q)| rather than to maximise | P ∩ N (q)|. These conditions are mathematically equivalent, but not computationally equivalent. Consider a vertex v which is ultimately rejected as a non-suitable pivot because | P ∩ N ( v )| ≤ | P ∩ N (q)|, or alternatively because | P ∩ K ( v )| ≥ | P ∩ K (q)|. In the former case, there is no opportunity to abandon v until | P ∩ N ( v )| is accurately determined. However, in the latter case it is sufficient to show that | P ∩ K ( v )| is at least as large as | P ∩ K (q)|, in order to reject v. This is a very subtle point: since | P ∩ K (q)| is expected to be low, many vertices such as v can be rejected quickly before determining the actual value of | P ∩ K ( v )| precisely. These facts are expressed in lines 9 and 22 of Algorithm 2 (some implementation details are omitted for brevity). Another subtlety occurs for cases 2 and 3. In both of these instances, the pivot candidate does not actually produce a branch in the search space; or perhaps more correctly, these pivots divide the search space with branching factor 1. The effect of selecting such a vertex in existing Bron–Kerbosch enumerators is that the search space is diminished but not divided, and pivot search is immediately restarted. However, in a naïve restarting of the search, the useful information held in the previous best pivot candidate q is discarded, only to be rediscovered. The revised algorithm instead processes cases 2 and 3 in place without disrupting the search. The search is only forcibly restarted if the current best pivot candidate q happens to conflict with the newly discovered pivot (see lines 18 and 30). The eager selection of pivots is an interesting trade-off. When the revised algorithm divides the search space, it might not do so with the least branching factor. However, the good pivots may be selected without considering all candidates. It is now necessary to formally show that the pivot candidates identified in this section may be selected with no meaningful harm to the overall running time. 5. Theoretical analysis n

The BK–Tomita algorithm has been shown to exhibit worst case behaviour of O (3 /3 ). While this seems poor, it corresponds with the largest number of maximal cliques that may be present in the source graph. No better worst case bound can be expected for maximal clique enumeration [9]. The revised algorithm has one feature which could affect worst case behaviour: it eagerly adopts sub-optimal pivot choices. This section provides theoretical analysis to show that the worst case n behaviour is not compromised by the pivot selection, and retains the desirable O (3 /3 ) bound. The analysis includes a minor theorem, giving the worst case time complexity for the PivotConflict function, and the main theorem. The main theorem gives an appropriate upper bound to the total time spent in enumerating cliques by Algorithm 2. The proof is similar in form to a related proof offered by Tomita et al. [6], and Lemma 3 is borrowed from the same authors. Theorem 1. The worst case time complexity of PivotConflict in Algorithm 2 is O (n3 ). Proof. Suppose PivotConflict is invoked with | P ∪ X | = n. It should be reasonably clear that ProcessInPlace will be called at most O (n2 ) times, since at each occasion the search may be restarted with at least one fewer candidate. Furthermore, ProcessInPlace is clearly O (n). Hence, the worst case time complexity of PivotConflict is O (n3 ). 2 Lemma 1. Let n

R (n) = C · 3 /3 − Y (n) , Y (n) = y 3 n3 + y 2n2 + y 1 n + y 0 , f (n) =

3 y 3n + 2 y 2n + y 1

y i > 0 for i ∈ [0, 3] ,

2

n/3−1

ln 3 · 3

,

and





C 1 = max f (t ) | t ∈ Z+ .

If C is a constant such that C ≥ C 1 , then R (n) is monotonic (non-decreasing) for n ∈ Z+ . Proof. First note that C 1 is a well defined constant, because f (n) is defined over Z+ , and

lim

n→+∞

f (n) = lim

3 y 3 n2 + 2 y 2 n + y 1

n→+∞

ln 3 · 3n/3−1

= 0.

Let

h(n) = C − f (n).

(1)

∀n ∈ Z+ , because C

Definition (1) ensures that h(n) ≥ 0 Now, as C is a constant, dC = 0, meaning that dn

dR dn

  n = C ln 3 · 3 /3−1 − 3 y 3n2 + 2 y 2n + y 1 .

Substituting (1) into (2), gives

≥ C 1 ≥ f (n).

(2)

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.6 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

6

  n = ( f (n) + h(n)) ln 3 · 3 /3−1 − 3 y 3n2 + 2 y 2n + y 1 dn     3 y 3 n2 + 2 y 2 n + y 1 n/3−1 2 = + h ( n ) ln 3 · 3 − 3 y n + 2 y n + y 3 2 1 ln 3 · 3n/3−1     n = 3 y 3n2 + 2 y 2n + y 1 + ln 3 · 3 /3−1 h(n) − 3 y 3n2 + 2 y 2n + y 1

dR

= ln 3 · 3 /3−1 h(n) ≥ 0

∀n ∈ Z+

n

Hence, R (n) is a monotonic non-decreasing function for n ∈ Z+ .

2

Lemma 2. If q is a pivot selected by Algorithm 2, and Q = K (q) ∩ P with | Q | = k, then either Condition A: ∀ v ∈ Q , | P ∩ K ( v )| ≥ k; or Condition B: ∀ v ∈ Q , | P ∩ K ( v )| ≥ k − 1, with q ∈ X . Proof. A close inspection of Algorithm 2 reveals that only four distinct pivot selection cases result in returning a conflict set Q . These are cases 1, 4, 5, and the general fall-through case. Each is examined in turn. Since | P ∩ K (q)| = 0, condition A is trivially satisfied. Noting that | P ∩ K (q)| = 2, suppose that Q = P ∩ K (q) = {q, w }. Consider w, for which P ∩ K ( w ) ⊇ {q, w }. Consequently, it is also clear that | P ∩ K ( w )| ≥ 2 = k, satisfying condition A. Case 5 In this case, | P ∩ K (q)| = 2 with q ∈ X . Each element v in Q must conflict with itself at least, so ∀ v ∈ Q , | P ∩ K ( v )| ≥ 1 = k − 1, satisfying condition B. General case Since q minimises | P ∩ K (q)| for all candidates remaining in P ∪ X , by definition, ∀ v ∈ Q , | P ∩ K ( v )| ≥ | P ∩ K (q)| = k, satisfying condition A.

Case 1 Case 4

Hence, in all cases, at least one of either condition A or condition B is satisfied.

2

Lemma 3.

∀k ∈ Z+ , k = 3

−k −2 0 < k · 3 /3 ≤ 2 · 3 /3 < 1 −k

−2

Proof. The fact that k · 3 /3 ≤ 2 · 3 /3 is proved in Lemma A.3 of [6]. −k −2 Furthermore, k · 3 /3 and 2 · 3 /3 are each confined to the (0, 1) interval for k ∈ Z+ , k = 3, which is easily confirmed by inspection. 2 n

Theorem 2. The worst case time complexity of Algorithm 2 is O (3 /3 ). Proof. With respect to Algorithm 2: Let B (n) = bn3 with b > 0 be an upper bound for the running time of PivotConflict with | P ∪ X | = n, in accordance with Theorem 1. Let T (n) similarly express an upper bound for the running time of Apply. Furthermore, let T (n) and B (n) be chosen such that T (1) = B (1), which is possible as T (n) and B (n) are merely upper bounds. Now with n

R (n) = C · 3 /3 − Y (n),

C >0

and

Y (n) = y 3 n3 + y 2n2 + y 1 n + y 0 ,

y i > 0 for i ∈ [0, 3]

it is required to show that there exist values for y 0 , y 1 , y 2 , y 3 , and C such that R (n) is monotonic with

T (n) ≤ R (n)

∀n ∈ Z+ .

Let y 3 = 12 b, y 2 =

27 b, 4

y1 =



81 b, 2



C 1 = max f (t ) | t ∈ Z+ , C2 =

1281 8 · 31/3

and y 0 =

891 b, 8

with f (n) =

3 y 3 n2 + 2 y 2 n + y 1 ln 3 · 3n/3−1

;

b;





C 3 = max g (t ) | t ∈ Z+ ,

with g (n) =

3Y (n − 3) 3 (1 − 2 · 3−2/3 ) n/3

; and

C = max {C 1 , C 2 , C 3 } . Note that C 1 and Lemma 1 ensure that R (n) is non-decreasing for n ∈ Z+ .

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.7 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

7

Further note that Y (n) ≥ 0 ∀n ∈ Z+ , and that algebraically,

3Y (n − 3) − B (n) = Y (n).

(3)

The proof proceeds by mathematical induction upon n as follows: Base case:



1

1

R (1) = C · 3 /3 − Y (1) = C · 3 /3 −

1

+

27

+

81

2 4 2 1273 1 = C · 3 /3 − b 8 1273 1 ≥ C 2 · 3 /3 − b 8 = b = T (1)

+

891 8



b

Induction case: Assume T (m) ≤ R (m) ∀m < n. Let q be the chosen pivot, with Q = K (q) ∩ P , and | Q | = k. The proof follows in two steps. Step 1 shows that T (n) ≤ kT (n − k) + B (n), and Step 2 finally shows that T (n) ≤ R (n). Step 1 Lemma 2 indicates that either condition A or condition B must hold, so both cases are examined. For condition A: Note that Apply is invoked once for each element in Q . Furthermore, for each invocation, at least k items are removed from P ∪ X . Hence, T A (n) ≤ kT (n − k) + B (n). For condition B: Note again that Apply is invoked once for each element in Q . Furthermore, for each invocation, at least k − 1 items are removed from P ∪ X . However, q ∈ / Q is also removed from X . Consequently, at least k items are removed from P ∪ X , giving T B (n) ≤ kT (n − k) + B (n). Thus, for both conditions,

T (n) ≤ kT (n − k) + B (n).

(4)

Step 2 Note from the inductive assumption that n−k/3

T (n − k) ≤ R (n − k) = C · 3

− Y (n − k) ∀k > 0.

(5)

Combining (4) and (5) gives

T (n) ≤ kT (n − k) + B (n)

  n−k ≤ k C · 3 /3 − Y (n − k) + B (n) n−k/3

= kC · 3

− (kY (n − k) − B (n))

(6)

To conclude the theorem it is necessary to show that R (n) is an upper bound for (6). The cases of k = 3 and k = 3 are now explored separately. For k = 3, substitution gives

T (n) ≤ 3C · 3 /3−1 − (3Y (n − 3) − B (n)) n

n

= C · 3 /3 − Y (n)

using (3)

= R (n)

by definition

For k = 3, the objective claim T (n) ≤ R (n) is true if

kC · 3

(n−k)/3

−(kY (n − k) − B (n))



n

C · 3 /3 − (3Y (n − 3) − B (n))

which is itself true if it can be shown that

3Y (n − 3) − kY (n − k) 3n/3 (1 − k · 3−k/3 )

≤ C.

(7)

Examining the left-hand side of (7) gives

3Y (n − 3) − kY (n − k) 3n/3 (1 − k · 3−k/3 )



3Y (n − 3) 3n/3 (1 − k · 3−k/3 ) 3Y (n − 3)



3n/3 (1 − 2 · 3−2/3 )

≤ C3 ≤ C

since Y (n − k) ≥ 0 using Lemma 3 by definitions

Thus, in all cases, T (n) ≤ R (n), proving the theorem by induction.

2

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.8 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

8

Table 1 An excerpt of data comparing the enumeration time of BK–Tomita and BK–New for edge densities d = 0.4 and d = 0.8. Results for BK–New marked with an asterisk (*) are statistically significant improvements over BK–Tomita at the 95% confidence level, with p < 0.05 in paired one tail t-tests. Number of vertices

50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150

Average for 40% edge density

Average for 80% edge density

Number of cliques

BK–Tomita time (s)

BK–New time (s)

498 669 892 1167 1503 1910 2397 2962 3642 4447 5383 6467 7730 9189 10 858 12 708 14 891 17 297 20 060 23 154 26 650

0.0002 0.0003 0.0004 0.0005 0.0007 0.0010 0.0013 0.0016 0.0020 0.0025 0.0030 0.0037 0.0044 0.0053 0.0063 0.0074 0.0090 0.0107 0.0128 0.0153 0.0180

0.0002 0.0003 0.0004 0.0005 0.0007 0.0009 0.0012 0.0015 0.0019 0.0023 ∗ 0.0028 ∗ 0.0034 ∗ 0.0041 ∗ 0.0049 ∗ 0.0058 ∗ 0.0068 ∗ 0.0080 ∗ 0.0095 ∗ 0.0114 ∗ 0.0134 ∗ 0.0157

Number of cliques 17 530 33 741 66 188 122 172 222 823 383 355 672 190 1 113 363 1 887 401 3 048 047 5 041 379 7 806 841 12 238 001 19 115 654 28 815 574 43 379 965 63 869 088 93 560 342 138 174 339 203 061 280 287 503 776

BK–Tomita time (s) 0.007 0.013 0.026 0.056 0.112 0.207 0.372 0.637 1.095 1.788 2.988 4.668 7.331 11.552 17.711 26.816 43.043 65.993 100.941 151.140 218.522

BK–New time (s) ∗ 0.005 ∗ 0.010 ∗ 0.020 ∗ 0.040

∗ 0.079 ∗ 0.142 ∗ 0.255 ∗ 0.434 ∗ 0.744

∗ 1.213

∗ 2.023 ∗ 3.159 ∗ 4.952 ∗ 7.933

∗ 12.165 ∗ 18.382 ∗ 28.847 ∗ 43.452 ∗ 65.285 ∗ 97.342

∗ 139.961

Readers may find it instructive to compare Theorem 2 with the similar theorem offered by Tomita et al. [6] which bounds the running time of BK–Tomita. Although the two theorems follow the same structure, they employ different notation for some aspects. For instance, Tomita et al. forgo symbolic names for sets P , X , and P ∪ X , preferring to use the descriptive names CAND, FINI, and SUBG, respectively. The same authors prefer the use of ( v ) to denote the set of vertex neighbours, which is simply N ( v ) in the present research. Finally, Tomita et al. employ polynomials named P (n) and Q (n) which, although different functions, are analogous to B (n) and Y (n) in Theorem 2. 6. Numerical experiments The theoretical analysis provided is most valuable for understanding the extreme cases of behaviour, but it does not necessarily characterise the practical performance that will be observed. Such behaviour is governed by properties of the graphs under review. To examine the practical performance, the new algorithm was compared with the prevailing pivoting BK–Tomita algorithm across a wide selection of pseudo-random graphs. The graphs were generated with the aid of a robust (and correctly initialised) Mersenne Twister pseudo-random number generator [10]. Graphs with specific characteristics ˝ were obtained by the standard G n, M Erdos–Rényi model, in which n denotes the graph vertex count, and M the number of edges [11]. The quantity of edges was chosen to yield a range of fixed edge densities,2 with M = d n(n2−1) for edge density d. A diverse combination of graph sizes were examined. The edge densities ranged from 5% and 95%, in 5% increments. Vertex counts ranged from 40 to 150 vertices, in 5 vertex increments, for all edge densities below 85%. Edge densities greater than 85% were constrained to lower maximum number of vertices due to otherwise infeasible computation time. For every combination of density and number of vertices, a sample of pseudo-random graphs was generated and their cliques enumerated. The sample size for each case was dependent upon the edge density. Specifically, samples of 30 graphs were used for high edge densities (d ≥ 65%), 100 graphs for moderate densities (35% ≤ d ≤ 60%), and 500 graphs for the remaining low edge densities.3 The algorithms were implemented in C , and utilise a common underlying set data structure in the form of a bit-vector with 64-bit word size. The bit-vector was chosen for time efficiency in all relevant set operations. Bits within the bit-vector were efficiently enumerated by means of a De Bruijn sequence. It should be noted that the revised algorithm was tested against a highly optimised implementation of BK–Tomita to ensure that the new algorithm receives no artificial advantage. The running time was measured for both algorithms. An important detail is that the algorithms were applied to the same graphs, rather than independently generated graph samples. This ensured that paired t-tests could be used to test the significance of any measured improvement. The next section describes the results of the numerical experiments, an excerpt of which is given in Table 1.

2 3

˝ The difference with respect of the Erdos–Rényi G n, p =d model is that G n, M provides specific rather than probabilistic edge densities. Clique enumeration in low edge density graphs is extremely quick, so a large sample size was used to mitigate limits in timing precision.

JID:TCS AID:10517 /FLA

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.9 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

9

Fig. 2. The speed of BK–New relative to BK–Tomita as a function of the number of graph vertices, shown for various edge densities.

Fig. 3. The speed of BK–New relative to BK–Tomita as a function of the graph edge density, shown for various vertex counts.

7. Results Table 1 presents the running time performance of the two algorithms when applied to both moderate edge density (40%), and high edge density (80%). The data show that the growth in number of cliques is dramatically pronounced for higher edge densities compared to moderate densities. Such conditions are computationally more challenging and thus interesting. The revised algorithm was found to yield improved performance over BK–Tomita for graphs with high edge densities, regardless of the number of vertices. All paired t-tests for such data were significant, with p < 0.05. Improvements were also widely noted for moderate edge densities, but the improvements for small numbers of vertices were not significance. Cases with n ≥ 100 favoured BK–New, and were significant with p < 0.05. The relationship between the observed performance and the edge density is further explored in Fig. 2. The figure shows that the edge densities form distinct performances strata. Edge densities below 30% typically favoured the established BK– Tomita algorithm, although the difference could only be confirmed statistically for very low edge densities (5–10%). In all edge densities higher than 35%, the data showed that the revised algorithm improved performance, provided the source graphs were large enough. An interesting phenomenon was evident when the number of vertices is increased beyond 64 vertices, and again beyond 128 vertices. The revised algorithm showed marked performance improvements over the standard algorithm at these quanta. The reason is that the underlying set bit-vector increases in size at thresholds that are multiples of the word size in bits,4 in this case, 64. The question that remains is how these facts benefit the revised algorithm. One explanation is straightforward: the BK–Tomita algorithm typically iterates through the whole set data-structure, whereas the revised algorithm has greater

4 Sparse set implementations could offer less punctuated growth. However, such structures lose the ability to process multiple vertices at once during set operations, which are of greater importance to the algorithms under study.

JID:TCS AID:10517 /FLA

10

Doctopic: Algorithms, automata, complexity and games

[m3G; v1.168; Prn:20/11/2015; 10:58] P.10 (1-10)

K.A. Naudé / Theoretical Computer Science ••• (••••) •••–•••

opportunities to exit the iteration early. Since the size of the set data-structure increases in fixed quantities, the practical improvement for the tested implementations increased in punctuated intervals. The effects of the organisation of the set data-structure are not observed when fixed graph vertex counts are examined, such as in Fig. 3. It is again clear that the revised algorithm yields noticeable improvement for moderate to large edge densities, particularly so when the number of vertices is large. The best improvement of approximately 56% speed up was observed for the case of 150 vertices and 80% edge density. However, it was not feasible to explore graphs of similar size for higher edge densities, so the upper threshold of the practical benefit is not known. For the lowest edge densities, the data suggests that other algorithms are likely to be preferred, such as that of [12,13] which is specifically formulated for very sparse graphs. In all other cases, the revised algorithm is to be strongly recommended. 8. Conclusion The pivoting Bron–Kerbosch clique enumerator combined with Tomita style pivot selection is well established as one of the best available clique enumeration algorithms. The primary contribution of this work is the provision of a new algorithm that has notably improved practical performance, while retaining the formally proved worst-case running time. A secondary contribution is that it extends the range of graph sizes and densities for which it is actually feasible to enumerate cliques with an allotted amount of time. These advantages are now immediately available to all the diverse application domains in which clique enumeration features. Acknowledgements The author would like to thank anonymous referees for investing time and energy in performing a thorough review of this article. The feedback received was highly valued, and resulted in and improvements throughout the work. References [1] S. Rahman, M. Bashton, G. Holliday, R. Schrader, J. Thornton, Small molecule subgraph detector (SMSD) toolkit, J. Cheminform. 1 (2009) 12. [2] S. Butenko, W.E. Wilhelm, Clique-detection models in computational biochemistry and genomics, European J. Oper. Res. 173 (2006) 1–17. [3] E.J. Gardiner, P.J. Artymiuk, P. Willett, Clique-detection algorithms for matching three-dimensional molecular structures, J. Mol. Graph. Model. 15 (1997) 245–253. [4] C. Bron, J. Kerbosch, Algorithm 457: finding all cliques of an undirected graph, Commun. ACM 16 (1973) 575–577. [5] I. Koch, Enumerating all connected maximal common subgraphs in two graphs, Theoret. Comput. Sci. 250 (2001) 1–30. [6] E. Tomita, A. Tanaka, H. Takahashi, The worst-case time complexity for generating all maximal cliques and computational experiments, Theoret. Comput. Sci. 363 (2006) 28–42. [7] F. Cazals, C. Karande, Reporting maximal cliques: new insights into an old problem, Research report RR-5615, INRIA, 2006. [8] F. Cazals, C. Karande, A note on the problem of reporting maximal cliques, Theoret. Comput. Sci. 407 (2008) 564–568. [9] J. Moon, L. Moser, On cliques in graphs, Israel J. Math. 3 (1965) 23–28. [10] M. Matsumoto, T. Nishimura, Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Trans. Model. Comput. Simul. 8 (1998) 3–30. ˝ A. Rényi, On random graphs. I, Publ. Math. 6 (1959) 290–297. [11] P. Erdos, [12] D. Eppstein, M. Löffler, D. Strash, Listing all maximal cliques in sparse graphs in near-optimal time, in: Algorithms and Computation, in: Lecture Notes in Computer Science, vol. 6506, Springer Berlin Heidelberg, 2010, pp. 403–414. [13] D. Eppstein, D. Strash, Listing all maximal cliques in large sparse real-world graphs, in: Experimental Algorithms, in: Lecture Notes in Computer Science, vol. 6630, Springer Berlin Heidelberg, 2011, pp. 364–375.