Grid based clustering for satisfiability solving

Grid based clustering for satisfiability solving

Applied Soft Computing Journal 88 (2020) 106069 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

4MB Sizes 0 Downloads 37 Views

Applied Soft Computing Journal 88 (2020) 106069

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Grid based clustering for satisfiability solving ∗

Celia Hireche , Habiba Drias, Hadjer Moulai 1 University of Sciences and Technology Houari Boumediene, Algeria

article

info

Article history: Received 6 January 2019 Received in revised form 20 November 2019 Accepted 2 January 2020 Available online 7 January 2020 Keywords: Satisfiability problem Computational complexity Problem solving Data mining techniques Data distribution Grid clustering BSO DPLL

a b s t r a c t The originality of this work resides into the exploitation of data mining techniques for problem solving. Two major phases define this work. The first one is to determine the clustering technique that best suits each SAT instance based on the distribution of the later. The clustering technique is then applied to reduce the complexity of each instance by creating sub-instances that can be solved independently in the second phase. The latter consists into a resolution step where the DPLL or BSO algorithms are executed depending on the number of variables to be assigned in each cluster. This two-phase resolution strategy provides more efficient problem solving. The Boolean Satisfiability problem (SAT) is considered in this study because of its importance for the Artificial Intelligence (AI) community and the impact of its solving on other complex problems. Three different distributions of the problem were observed. The first distribution defines a space where the variables are dispersed forming regions of considerable density interspersed with regions of lower density or empty regions. On the other hand, the other two distributions do not show any significant shape, as the variables are randomly scattered, with one of these two dispersions having the particularity that almost all its variables are of high occurrence. To each of the three distributions, a clustering technique is associated. Density-based clustering techniques are the most appropriate type of clustering for the first distribution. Meanwhile, grid-based clustering and frequent patterns mining seem to be the most suitable clustering techniques for the second and third distributions. Investigations are undertaken on these latter issues and contributions are presented in this paper. Experiments were conducted on public benchmarks and the results showed the importance of the pre-processing step of data mining to solve the SAT problem. © 2020 Elsevier B.V. All rights reserved.

1. Introduction and motivation Data mining, and particularly clustering techniques, are one of most frequently used tools for extracting relationships and knowledge from data. These techniques are exploited in many domains, whether in theory or in practice. There are many clustering techniques in the literature, and each of them may provide different results for the same dataset. Getting to know the data therefore becomes a necessity in order to determine the most appropriate clustering technique to apply. Given a problem to solve, the major contribution of this work is developed in two steps. The first step consists into exploring an instance of the problem by studying its statistical parameters, and in particular its distribution in order to determine the most appropriate clustering technique to apply. After that, a resolution step is carried out where each of the resulting clusters is solved ∗ Corresponding author. E-mail addresses: [email protected] (C. Hireche), [email protected] (H. Drias), [email protected] (H. Moulai). 1 Laboratory of Research in Artificial Intelligence. https://doi.org/10.1016/j.asoc.2020.106069 1568-4946/© 2020 Elsevier B.V. All rights reserved.

independently, using either DPLL or BSO depending on the number of variables in each cluster. This process leads to an efficient resolution of hard problems such as the Boolean Satisfiability problem (SAT) that is considered as a case study in this paper. The choice of SAT is related to its importance in the AI community, being the first problem to be proven NP-complete [1]. The SAT problem is one of the most fundamental and studied NP-complete problems. Its instance is defined by a Conjunctive Normal Form formula — CNF, which is a conjunction of clauses. Each clause is a disjunction of literals, a literal being defined as a variable x or its negation x. The main goal is to find variables assignation that satisfies all the clauses and thus the formula. A clause is said to be satisfied if at least one of its literals is set to true. There exist a plethora of SAT solvers, despite the fact that its complexity defines it as one of the hardest problems to solve. A complexity reduction seems to be one of the best approaches when trying to get an effective and efficient resolution to the problem. Throughout this work, we studied the statistical parameters and in particular the distribution of SAT instances in order to

2

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

determine the most appropriate clustering to apply. Three major distributions were found. In this paper, we focus on randomly scattered variables for which we believe that grid-based clustering and frequent pattern mining algorithms are the most appropriate clustering techniques. This article is organized as follows. The next section introduces some definitions and reviews related work to both clustering and the satisfiability problem. Section 3 discusses the discovered distributions and the proposed clustering techniques. Solving algorithms – DPLL and BSO – are presented in Section 4. Section 5 illustrates some case studies. A discussion of the obtained experimental results is presented in Section 6. Finally, in Section 7 we conclude this work and talk about some perspectives. 2. Definitions and related work In this section, we introduce some of the basic definitions and related work to both clustering and the boolean satisfiability problem SAT. Clustering Clustering [2] is the process of creating groups of data called clusters, with a high similarity between elements of the same cluster and high dissimilarity between elements of distinct clusters. There are numerous clustering techniques that can be organized as follows: Data driven algorithms(a) which are totally dependent on the number and type of data objects. They include Partitioning algorithms (a.1) that are based on the distance and similarity between objects, creating in general a set of exclusive clusters i.e., Each object must belong to one and only one cluster. One of the most popular partitioning based clustering algorithms is the k-means algorithm [2,3]. Its principle is to divide the dataset into k distinct clusters around central elements called centroids. These centroids are defined as the mean value of a cluster. However, k-means is sensitive to outliers because of its initial random initialization of centroids. Several variants were then proposed, such as the GA-K-means [4] which is one of the most effective improvement of k-means, with an optimal initial set of centroids generated by the genetic algorithm. Another more popular and effective, but less efficient, than kmeans algorithm is the Partitioning Around Medoids — PAM [5], which can be applied on any kind of data objects contrarily to k-means that can only be used on numerical data. The medoids, that replace centroids, are represented by the central element within each cluster. One of the most popular and effective PAM algorithms is The Clustering Large Applications based on RANdomized Search (CLARANS) [6,7]. It is designed as an iterative execution of PAM using a hill climbing strategy. In fact, the resulting clustering of each PAM is considered as a local optimum, and the global optimum defines the best final clustering. Hierarchical algorithms (a.2) [2], principle is either to successively merge the objects close to one another in the agglomerative (bottom-up) approach, or to successively split the dissimilar objects in a group into smaller clusters in the divisive (top-down) approach. Density based clustering methods (a.3) consider the number of objects (density) within a defined region rather than distance or similarity between objects to create clusters. The most popular density based clustering algorithm is the Density Based Spatial Clustering of Application with Noise — DBSCAN [8,9]. Its principle is to determine the neighbourhood of a data point and define it as a cluster if its density is greater to a certain fixed threshold

MinPoints. Each data point that does not belong to any dense region is considered an outlier. The second class of clustering techniques is Space Driven algorithms (b) that are represented by the Grid based clustering. In such techniques, data are projected on a grid which is spitted on a defined number of cells according to a certain number of parameters. These cells represent the resulting clusters. This strategy presents the advantage of being totally independent from the number of objects and totally dependent to the number of cells, which permit a fast running time. Two popular grid based clustering are defined, the Statistical Information Grid — STING [10], where the grid is successively divided shaping a hierarchical structure of different cell levels. In this algorithm, data are represented by some statistical parameters such as the mean value, minimal and maximal values, and especially data distribution. The second algorithm is the Clustering In QUEst — CLIQUE [11] that recursively joins cells according to their density which have to exceed a certain threshold, like in the Apriori [12] reasoning. Finally, Frequent Patterns Mining(c), which is considered to be the most important tool for the discovery and extraction of recurrent and interesting relationships and correlations between data, can be used for clustering too. The Apriori [12] algorithm is one of the most known and used frequent patterns algorithms for capturing frequent items of different sizes. SAT and its solvers A SAT problem instance is defined by a set of clauses presented in a normal conjunctive form (CNF), i.e. conjunction of clauses. Each clause being a disjunction of literals, which are defined as a variable x or its negation x. The goal is to decide whether or not there is a satisfying assignment to the CNF formula. Formally [13], an instance of SAT is defined by the following pair:

• Instance: a set of n binary variables V = x1 , x2 , . . . , xn forming a set of m clauses C = C1 , C2 , . . . , Cm . • Question: is the CNF formula φ = C1 ∧ C2 ∧ · · · ∧ Cm satisfiable? Does any variables assignment satisfy the formula? Example 1. Let us consider the set of variables V = x1 , x2 , x3 , x4 and the set of clauses C = C1 , C2 , C3 , C4 . An example of SAT instance would be defined as follows: C1 = x1 , x2 . C 2 = x2 , x3 , x4 . C 3 = x1 , x4 . C 4 = x1 , x2 , x3 . One possible variables assignation that would satisfy this SAT instance is {x1 = 0, x2 = 0, x3 = 1, x4 = 0}. The Boolean satisfiability problem (SAT) [14] is one of the most important NP-Complete problem studied. In fact, a tremendous amount of research have been undertaken over the last decades in order to find a solution to this problem. The proposed solvers can be organized into different categories, from which we present: The complete SAT solvers (a) [14] The particularity of this type of solvers is the coverage of the whole search area which allows them to decide on the satisfiability of the instance. They either provide a solution to the problem instance or prove that the instance cannot have a solution. The DPLL or the Davis– Putnam–Logemann–Loveland [15] algorithm is one of the most frequently used complete algorithms for SAT solving. It is defined as a backtracking algorithm that recursively and arbitrarily assigns a truth value to a chosen variable and simplifies the formula to check whether it is satisfied. If the formula presents an inconsistency (empty clause), the contrary truth value is assigned to the variable. This simplification is performed by removing

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

from the instances all satisfied clauses and unsatisfied variables from yet unsatisfied clauses. Two particular rules enhance the DPLL algorithm; the Unit propagation procedure which treats unit clause, and the Pure literal procedure that assigns the appropriate truth value to pure variables. Unit clause refers to any clause containing only one unassigned literal, which has to be assigned to satisfy the clause. For example, if at a step t, the assignation of x1 = 0, x2 = 0 than the clause C = x1 , x2 , x3 is a unitary clause and the variable x3 must be assigned to 1 to satisfy the clause. Pure variable is a variable occurring in one and only one polarity, either positive or negative, in the whole problem instance. Consider the following sub-instance with x1 = 1, C1 = x1 , x2 C2 = x2 , x3 , x4 C3 = x2 , x6 The variable x2 is considered to be pure variable occurring in the negative form only since the first clause is satisfied by the variable x1 . This variable must be set to 0. Hundreds of existing complete SAT solvers are based on DPLL. Some of these solvers focus on the variable decision selection by proposing various heuristics among them, we cite: The Generic seaRch Algorithm for the Satisfiability Problem or GRASP [16] SAT solver that introduces the Dynamic Largest Individual Sum — DLIS. It selects, as a decision variable, the variable appearing in the largest number of unsatisfied clauses. In Chaff solver [17], the Variable State Independent Decaying Sum – VSIDS – heuristic is introduced. This heuristic takes into account the polarity of a variable (positive–negative) while selecting the decision variable. Finally, Jeroslow and Wand introduces the Jeroslow–Wang heuristic — JW [18], which considers a calculated weight for each variable and selects the variable with the greatest weight. Another category of DPLL based SAT solvers introduces the Conflict-Driven Clause Learning — CDCL [14]. In CDCL solvers, each time a conflict occurs, a clause is learned, allowing a nonchronological backtrack to the origin of the conflict (contrarily to the original DPLL which executes only chronological backtracks). There are a dozen of CDCL SAT solvers, among which we cite ManySAT [19], which is a parallel portfolio SAT solver that uses four different configurations of CDCL algorithms. The communication between these algorithms is done via a multi-core architecture. The second category of SAT solvers is the incomplete solvers (b) [14]. These solvers aim to find a compromise between computing time and solution’s quality. The most commonly used category of incomplete algorithms for SAT resolution is metaheuristics [20], which can be organized as follows: Trajectory based algorithms (b.1), which deal with one solution and attempt to improve it at each iteration until the best possible solution is found. The most popular trajectory based algorithm for SAT solving is the Stochastic Local Search (SLS) [21]. Its principle is to randomly generate a solution and improve it at each iteration by selecting the best solution from its neighbourhood. Greedy SAT (GSAT) [22] and its extended version WALKSAT [23] present the most used SLS algorithms for SAT solving. The particularity of these algorithms resides in the choice of the variable to flip (change assignment). In fact, the GSAT algorithm flips, at each iteration, the variable that maximizes the greatest number of newly satisfied clauses, while the choice of the variable to be flipped in WALKSAT, is done in the same manner as in GSAT with a probability p, and randomly with a probability 1 − p. In 2016, KhudaBukhsh and al. introduced the SATenstein solver [24], which is a combination and improvement of existing SLS solvers including GSAT, WALKSAT and other SLS solvers. Population based algorithms (b.2), contrarily to the first category, deal with a set of solutions to improve at each iteration. Inspired from genetics, the Genetic Algorithm is one of the most

3

popular and widely used population based algorithm for SAT solving. Finally, the most commonly used meta-heuristic category for problem solving in recent years is Bio-inspired algorithms (b.3). These algorithms attempt to reproduce certain biological and natural functioning mechanisms, such as genetics in the Genetic Algorithm, and the functioning of animal reign like in the Ant Colony Algorithm and the Bees Swarm Optimization (BSO) algorithm. Indeed, some animals have a limited individual intelligence, but are capable of creating a certain degree of intelligence when in a group — collective intelligence. Inspired from the functioning of real bees while looking after food sources, the BSO algorithm [25] works as follows; A bee BeeInit starts by randomly generating a first solution named Sref. A search area is determined from this solution by flipping (changing the truth value of variables) a certain number of variables. Each bee of the swarm is affected a solution from this search area and runs a Local Search on it. The best solution found by each bee is stored in a table named Dance, a reference to the dance practiced by real bees when finding a food source. The best solution from this table is considered, at each iteration, as a reference solution, allowing the process to restart until stagnation or a predetermined stopping condition is met. SAT and clustering Although a considerable number of algorithms and solvers have been proposed to solve the very popular SAT problem, it remains one of the most complex and difficult NP-complete problems to solve. Reducing the complexity of the latter is one of the best strategies to deal with this type of problem. To do this, cluster pre-processing is performed before the resolution task is undertaken. In [4] and [26], the authors used Genetic Algorithm and another intuitive technique like clustering pre-processing on SAT instances. This approach allowed a reduction of overall instance complexity by creating sub-instances that were solved independently. In [27], an improved version of the Apriori algorithm [28] was proposed to deal with the SAT instances complexity. Indeed, this approach allowed to create groups of instances that were solved separately. Finally, two versions of the DBSCAN algorithm were proposed in [29,30]. These approaches considered regions where the density of variables and clauses are important in order to create sub-instances. The latter are then solved using the DPLL and BSO algorithms. While comparing the existing categories of SAT’s solvers, we can easily notice that the first class solvers (complete) offers the most effective solutions with one hundred per cent (100%) of satisfied clauses. However the time spent by these solvers to reach the problem’s solution is exponential according to the problem size. The second category provides more efficient solutions with a reasonable time solving but solutions are not effective. Our aim is to find the best compromise between effectiveness, efficiency and scalability by introducing a novel manner of solving problem based of data mining. This approach propose to reduce the instances complexity and explore the most promising regions. The expected result is to provide the best solution quality possible in a reduced time. 3. Data, knowledge, preprocessing and variables distribution The purpose of this work is to propose an efficient solving for SAT instances by reducing the complexity of the later. This complexity reduction is obtained by performing the most suitable clustering technique on each instance according to its distribution.

4

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

3.1. Data preprocessing and cleaning The large size of the data and its multiple sources make the processing of these data difficult because of possible inconsistency and lack of data values. For these reasons, a preprocessing – cleaning – step is proposed in this section. Unlike other data sets, SAT instances cannot contain inconsistent, missing or noisy data. Indeed, these instances are transcribed from real-world problems in order to be solved. The preprocessing step we propose includes meeting SAT conditions and further reduction of complexity. It is represented as follows: First, SAT conditions are satisfied by removing tautologies and redundant clauses. Any clause containing both a variable and its negation is defined as a tautology. These clauses, being always true whatever the variables’ assignation is, can be removed from the instance for complexity reduction without any problem. On the other hand, all clauses containing more than one occurrence of the same variable are considered for cleaning. Only one occurrence of the same variable is kept per clause. An additional preprocessing task that aims at reducing the complexity of an instance by removing inclusive clauses is proposed. A clause Ci is said to be included in a clause Cj if all its literals belong to Cj , i.e. the number of shared variables between Ci and Cj is equal to Ci length. In such a case, only the clause Ci is kept, because its satisfaction implies that of Cj . Example 2. Let us consider the following SAT instance: C 1 = x1 , x2 , x2 C2 = x1 , x1 , x3 C3 = x1 , x2 , x3 After preprocessing, the instance contains only the new clause C1 : C1 = x1 , x2 The double occurrence of x2 is removed from C1 . C2 and C3 are removed from the instance, C2 represents a tautology which is always true and C3 is included in C1 . Ralph Kimball said that ‘‘Data cleaning is one of the three biggest problems in data warehousing’’. We can complete by saying that no quality data implies no quality mining results. 3.2. Variables distribution and clustering techniques In order to provide the most efficient solving, an optimal complexity reduction is essential. To do so, a judicious choice of the clustering technique is the most important step. This choice is dependent on the distribution of variables on clauses that we introduce in this section. Distribution of variables on clauses results from the projection of clauses on a bi-dimensional plane, where the abscissa axis represents the variables whereas the ordinate axis corresponds to the clauses. A clause is then defined by n points such as each point represents a variable included in the clause as follows p1 {xi1 , i}, p2 {xi2 , i}, . . . , pn {xin , i}. Fig. 1 describes the dispersion of variables on clauses of the first example. For instance, the clause C2 = x2 , x3 , x4 is represented by the three points p1 {2, 2}, p2 {3, 2}, p3 {4, 2}. After a study of some variables distributions, three main distribution models have been found. For each of these distributions a clustering technique is associated. Fig. 2 introduces the principal discovered and treated distribution models. The first distribution presents a space where the variables form dense and sparse regions with less distributed regions and empty regions. This type of dispersion has the best configuration for density-based clustering [29,30]. The second discovered distribution presents a space where variables are scattered randomly without shaping any particular form or presenting any region of significant density.

Fig. 1. Variables dispersion on clauses (Example 1).

Finally, the third distribution follows the same schema as the second one with the particularity that almost all variables present an important frequency. The grid based clustering seem to be the most suitable clustering techniques to apply for the two previous distributions. Frequent patterns mining algorithms can also be a good choice for the third distribution because of the high probability of interaction between variables. These clustering techniques are judged to be the most suitable for the previous determined distributions according to the instances distribution themselves and the configuration and functioning of the algorithms. Modelling STING for SAT instances Proposed by Wei Wang and al., STING or STatistical INformation Grid (approach to spatial data mining), is a grid-based clustering algorithm that uses a hierarchical grid structure which allows an efficient analysis of data in order to answer a given request. This structure is composed of a root node containing all the objects, and each node (except the leaves) has four children corresponding to a quadrant of the mother cell. The splitting process stops when a certain granularity (number of objects per cell) is reached. The data is represented by certain attributedependent parameters: the mean value, minimum and maximum values, standard deviation and data distribution. And an attributeindependent parameter; the number of objects in a cell. Once the hierarchical structure is built, a top-down search is performed seeking the relevant cells for a given request. Fig. 3 illustrates the hierarchical structure. In our approach, the root node is represented by the bidimensional space where the instance is projected. However, unlike the STING algorithm, not all the structure is built. We propose to determine and fix the size of the leaf cells. Three different cell sizes are proposed. In the first approach (definition), namely vertical approach, the cell size is determined by defining a range of variables to which a variable must belong in order to allow a clause containing this variable to belong to this cluster. A clause can belong, in this approach, to more than one cluster. The size of cells, in the approach is defined by fixing the number of clauses per cell. Finally, the last approach – mixed approach – defines a cell size by fixing both the number of variables and clauses within the cluster. Fig. 4 summarizes the three different definitions of cell sizes. We find, respectively, in red, blue and black, the first second and third definition. The size definition shown in Fig. 4 is presented on the distribution of the variable according to the clauses of a benchmark from a uniform random SAT 3SAR13 [31]. Each of the resulting clusters, is then, solved independently. In the vertical and mixed approaches, only the variables included

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

5

Fig. 2. Variables distribution.

clusters. DPLL is used on clusters whose number of variables to assign does not exceed a certain threshold. The BSO algorithm is executed otherwise. To provide an effective global solution to the problem instance, an optimal solving is necessary. We propose to solve these clusters in a descending order according to their density. The density of a cluster is defined as the proportion of its number of clauses and its number of variables DC = (nbrClauses )/(nbrVariables ). We believe that this sorting can have a real impact on the effectiveness of the solution. In fact, priority is given to clusters with a small number of variables that are involved in a large number of clusters. Each time a cluster is solved, its solution is propagated on the yet unsolved clusters in order to remove satisfied clauses and unsatisfied variables. Modelling DBSCAN for SAT instances Fig. 3. STING hierarchical structure.

Fig. 4. STING modelling for SAT instance.

in the fixed interval of variables are assigned. While, in the horizontal approach, all variables making the different clauses are assigned. The DPLL and BSO algorithms are used for solving these

DBSCAN or Density Based Spatial Clustering of Application with Noise [8,9] is one of the most frequently used density-based clustering. It randomly, and at each iteration, selects a data point p, from which a dense region is determined. This region includes all data points within a radius r, raking p as centre. A region is said to be dense if and only if the number of data points within it exceeds the fixed minimum number MinPts. Two different modelling of DBSCAN were proposed for SAT instances. The first modelling [29] proposed a bi-dimensional view where variables were the only parameter taken into consideration when determining a region. Three regions definitions were proposed for this first modelling. From a starting point xi , the first approach which uses the Euclidean distance, considers in the same region all the variables within the interval [xi − r ; xi + r ] that share at least one clause with xi , r being the defined radius. The second approach uses the Hamming distance by including in the same region, all variables with which the central variable xi shares at least r clauses. Finally, the last bi-dimensional approach introduces a density definition relative to variables and how they are linked inside the clauses. In this approach, a variable is considered as centre of a region if its density is greater than a fixed density threshold. The second modelling [30], unlike the first one, considers the clauses in a multidimensional plan with the starting point of any region being the clause itself. Three sub-approaches are proposed for this modelling, according to region definitions and solving strategies.

6

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

The first definition of a region comprises all the clauses around a central clause, which similarity with the centre exceeds a fixed threshold. This region definition is employed by the two first approaches. In the first approach, only the variables making the central clause are assigned while solving the cluster. In contrary, the second approach assigns all the variables belonging to all clauses in the cluster. The third approach considers a larger region than the first two approaches. In addition to the closest clauses from the centre, the nearest clauses to all the clauses are added to the region. In this case, the solving is similar to the second approach. The results of the conducted experiments showed the impact of using such preprocessing step on problem instances. Modelling of partitioning methods Partitioning based clustering represents the most intuitive clustering methods, which are generally distance and centroid based. Each cluster, in these methods, is represented by a central element according to which the rest of elements are clustered. A particular feature of these grouping methods is the improvement step performed at each iteration. It consists in changing the centroids and moving the data objects from one cluster to another more appropriate according to the new centroids. The quality of this improvement is measured by the sum of the squared error between the elements and the centroid of each cluster and is calculated as follows: E=

k ∑ ∑

dist(p, Ci )2

(1)

i=1 p∈Ci

With: E the resulting clustering quality. p an object within a cluster. Ci the centroid of the ith cluster. K-means. The k-means algorithm [3] is the most commonly used partitioning based algorithm, which centroid is defined as the mean value of the elements within a cluster. The general principle of k-means can be summarized into the following steps:

• Choose randomly k objects to be the initial centroids. • Assign each remaining object to the cluster which centroid is the most similar.

• Recalculate centroids as mean value of objects within each cluster.

• Repeat the process until there is no further improvement to make. The modelling of k-means for SAT instances follows exactly the same schema as the original k-means. The parameter k is determined according to the number of variables within the instance. However, and because the length of clauses are different, the size of new centroids have to be calculated at each iteration and for each cluster. We propose to calculate this size as the average size of clauses within the cluster. New centroids are then calculated as follows:

• Sort all the variables forming the clauses within the cluster, repeating each variable’s number as its frequency. • Divide these variables into R ranges. R being the determined size of the new centroid. • Calculate for each range, the average value to be part of the new centroid. Example 3 explains the functioning of the centroid computation.

Example 3. Let us consider the following cluster:

⎧ Centroid : ⎪ ⎪ ⎪ ⎨ Clauses ⎪ ⎪ ⎪ ⎩

C ⎧ = x1 , x2 C = x1 , x2 , x3 , x4 ⎪ ⎨ 1 C2 = x1 , x2 , x5 ⎪ ⎩ C3 = x2 , x4 , x5 C4 = x2 , x4

∑ SizeNewcentoid =

x∈clauses

SizeCx

|Clauses|

=

4+3+3+2 4

=3

(2)

Variables are split into three (03) ranges as follows: R1 = x1 , x1 , x2 , x2 ; R2 = x2 , x2 , x3 , x4 ; R3 = x4 , x4 , x5 , x5 . The mean of each range is considered as part of the new centroid as follows: New Centroid: CNew = x1 , x2 , x4 . CLARANS. CLARANS is a partitioning based algorithm that offers a more effective but less efficient solving than k-means. The main difference between these two algorithms is the use of medoids that represent the central element of the cluster rather than centroids. In the modelling of CLARANS for SAT instances, the clauses themselves are considered as objects to the cluster. The number of clusters is determined empirically according to the number of clusters calculated when executing k-means. Modelling Apriori-k-means for SAT instances The Apriori algorithm [12] is one of the most known and frequently used patterns mining algorithms. It starts by extracting the most frequent items of size one (01), and for a size greater or equal to two (k ≥ 2), joins the previous frequent items (k − 1) together, determining new candidates of size k. This list of candidates is pruned keeping only the most frequent ones. An item is said to be frequent if and only if its support, which represents the probability of its appearance, is greater or equal to a predetermined Minimum Support (MinSup). Support(A → B) = P(A ∪ B)

(3)

Remark. The most important property of Apriori is that ‘‘All non-empty subsets of a frequent itemset must also be frequent’’. For example, an item {x1 , x2 , x3 } is frequent if and only if the following items are frequent {x1 , x2 }, {x1 , x3 }, {x2 , x3 } additionally to {x1 , x2 , x3 }. We propose, in our modelling of Apriori-k-means to cope with the dependency of the original k-means to the initial random initialization of centroids by applying the Apriori algorithm as a preprocessing step to k-means. The proposed approach is a two steps method, that first extracts the most frequent variables repeated together using the Apriori algorithm, in order to use them as the initial centroids of k-means. The second step consists into simply applying k-means. Fig. 5 illustrates the functioning of this approach. 4. Solving algorithms We propose, in this paper, a two steps approach which first determines and apply the appropriate clustering technique according to the distribution of each problem instance. The second step is to solve the resulting clusters using either DPLL or BSO according to the number of variables to be assigned. Algorithm 1 resumes the functioning of the DPLL algorithm.

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

7

Fig. 5. Apriori-K-means organigram.

Algorithm 1 DPLL Require: Lcl: List of clauses BEGIN if Lcl = null then return True (Satisfied) else if Lcl contains empty clauses then return False else for each unit clause do if xi ( else xi ) then assign to 1 else assign to 0 propagation of the solution end if end for for each pure literal do if xi then assign to 1 else assign to 0 propagation of the solution end if end for end if end if Choose a variable x return (DPLL (Lcl ∪ x) or DPLL(LCL ∪ x ) The complexity of DPLL in the worst case is θ (2n ). It represents the case where the algorithm tests all the possible combinations of assignation. In that case, no unit clauses nor pure literals exist. DPLL being a complete algorithm with a computing time that grows in an exponential way when increasing the number of variables to assign, a threshold number of variables is proposed and fixed in this work. The DPLL algorithm is used for solving clusters which number of variables to instantiate is less than the fixed threshold. The BSO algorithm is executed otherwise. The BSO algorithm has been designed for discrete data and combinatorial problems [25]. In the modelling of the BSO algorithm for SAT, a solution is represented by a vector of n elements where n represents the number of variables within the instance. The ith element of the vector contains the valuation (truth value) of the ith variable (xi ).

The local search performed by each bee consists of changing these values from 0 to 1 and vice versa (flip) to generate the solution neighbourhood. Algorithm 2 summarizes the functioning of BSO. Algorithm 2 BSO for SAT Require: TL: Taboo list. Dance: Dance table. Sref : Reference Solution. Bees: list of bees. BEGIN BeeInit ← Sref while Stop condition not reached do TL ← Sref Sref → SearchArea (determining search area from Sref) Bees ← SearchArea for each bee do SLS (Stochastic Local Search) Dance ← Bestsolution end for Sref ← Dance end while return Best solution for TL The BSO algorithm is structured in two loops, the inner loop where each bee executes a stochastic local search until the maximal number of iterations MaxIter is reached. And the outer loop where at each iteration, a search area is determined and a solution is assigned to each bee. This loop is repeated at maximum MaxIterBSO times. The complexity is then θ (MaxIterBSO ∗ MaxIterSLS ∗NbrBees) with NbrBees being the number of used bees. 5. Case study Different clustering techniques may provide different clustering results for the same dataset. Whereas, using the appropriate clustering technique according to data distribution may offer the best possible clustering. To do so, we have studied numerous SAT instances with the aim of determining the distribution of each of them. Three main distribution models have emerged (see Section 3.2), among which we focus, in this paper, on the last two. The first distribution having already been discussed and treated in [29,30]. We present, in this section, two case studies corresponding to the previously presented distribution models.

8

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

Table 1 Statistical description of the benchmark FLA-500. Benchmark’s name

FLA-500

Number of clauses

2205

Number of variables

500

Extant

X1 − X500

Mean variable

X248 (+X244 ) (−X252 )

Mode

X487

Quartiles (Q1 − Q2 − Q3 )

X123 , X250 , X373

Mean occurrence

13

Symmetrical occurrence variables

#45

Occurrence = 16

Fig. 7. FLA500 Histogram of frequencies.

Fig. 6. FLA500 Boxplot.

Remark. The presented case studies report the statistical parameters after preprocessing. These parameters and graphics are obtained from a C# program that we have implemented. 5.1. Case 1: FLA-500 We study, in this first case, a SAT instance which distribution corresponds to the second defined model (see Section 3.2). Table 1 and Fig. 6 exhibit some important statistical parameters of the SAT instance FLA-500 [32]. Remark. Symmetrical occurrence variables refers to variables appearing in both polarity (positive and negative) equally. We notice from both the table and figure a symmetry of variables distribution. In fact, the quartiles which are one of the best indicator of dispersion show a symmetrical dispersion. The median which divides the frequencies into two equal parts is equivalent to the mean value. The IQR (Q3 − Q1 ) is about 250 (373 − 123). Half of the variable frequencies are included in half of the number of variables, indicating perfect symmetry in the dispersion of variables. Moreover, the average variables of both polarity are equivalent. Fig. 7 presents the histogram of frequencies of the FLA500 instance. From this graphic, twenty-three different frequencies can be extracted. An important number of variables has the same frequencies.

Fig. 8. FLA500 Distribution of variables on clauses.

Finally, the distribution of the instance is presented in Fig. 8. The figure illustrates a random dispersion of variables in the whole plan without shaping any particular dense area. The grid based clustering seems to be the most suitable clustering for this kind of distribution. 5.2. Case 2: AIM-200 This second case studies an instance of SAT that corresponds to the last introduced model (see Section 3.2). Table 2 and Fig. 9 summarize the most important statistic parameters of the AIM200 benchmark [33]. As for the previous studied case, dispersion of variables presents an almost perfect symmetry. The particularity of this instance resides in the fact that almost all the variables present the same frequency. These frequencies are presented in Table 3. Table 3 presents the different variables frequencies with their occurrence. This instance is multi-modal with 149 modes which represents more than two-thirds (2/3) of the variables. On the other hand, the number of variables having the same frequency

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

9

Table 2 Statistical description of the benchmark AIM-200. Benchmark’s name

AIM-200

Number of clauses

1182

Number of variables

200

Extant

X1 − X200

Mean variable

X100 (+X100 )(−X100 )

Mode

#149 (modes) Occurrence = 18 (X)X51 , X101 , X151

Quartiles (Q1 − Q2 − Q3 ) Mean accuracy

17

Symmetrical variables

#155

Table 3 Variable’s frequencies — AIM-200. Frequency

15

16

17

18

Number of variables

01

08

42

149

Fig. 10. AIM200 Variables distribution on clauses.

Benchmarks description and preprocessing Table 4 summarizes the attributes (number of variables and clauses) of the used benchmarks and the number of removed clauses in the preprocessing step. The instances IBM1-IBM2-IBM7-IBM13 [34,35] present the first defined distribution with high density region, while the instances FLA-500, Unif-k3 and Unif-k5 [32] and AIM-200 [33] present respectively the second and third distributions. The preprocessing step allowed a complexity reduction through the removal of certain clauses. However, the integrity of the instances has to be kept to validate this preprocessing.

Fig. 9. AIM-200 Boxplot.

for both their positive and negative form is about 155. This kind of distribution is called balanced instance in some states of the art on SAT. The distribution of this instance is exposed in Fig. 10. The distribution of this instance is identical to the previous one (FLA-500). In fact, variables are scattered randomly in space without shaping any particular dense region. Furthermore, a particularity relative to variables frequencies is shown. This particularity has led us to consider the frequent patterns mining as a clustering technique for this distribution. 6. Experimental validation To show the effectiveness of the proposed approaches, experiments were conducted on some well-known benchmarks. These experiments were directed on a machine of processor i7 2.40 GHz and a RAM of 4Go. The implementation was done on Microsoft Visual Studio CSharp 2013, Framework .NET 4.5.1. We start, in this section, by describing our benchmarks and checking the importance of the preprocessing – cleaning – step. We then determine the different cell sizes of the STING (gridbased) approaches and finally discuss the resulting experiments.

Preprocessing validation. To determine whether the integrity of the problem instances is kept after preprocessing, we studied the distribution of the instances before and after the preprocessing in a first step. In a second time, a random solution is generated in order to validate the preprocessing by comparing the quality of the solution in both instances before and after preprocessing. To conclude the integrity of an instance after preprocessing, both the two conditions of identical distributions and similar satisfiability rate before and after preprocessing must be met. Fig. 11 introduces the boxplots of both instances before and after preprocessing one of the benchmarks (BMC-IBM1). Both instances before and after preprocessing present the same dispersion parameters, demonstrating the integrity of the instance after preprocessing although the removal of 1188 clauses. Table 5 exhibits the results of a randomly generated solution tested on both instances before and after preprocessing. We notice a similarity between satisfiability solution qualities of both instances before and after preprocessing, demonstrating that the integrity of the instance is kept after preprocessing. The dispersion of the instance before and after preprocessing and the resulted satisfiability rates led us to conclude that the preprocessing step respects the integrity of instances. Remarks. The same verification has been undertaken for the rest of the benchmark and led to the same results i.e the integrity of instances before and after preprocessing. All the results presented below are carried out on the instances after preprocessing.

10

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

Fig. 11. BMC-IBM1 Boxplots before and after preprocessing. Table 4 Benchmark description and preprocessing. Benchmark name

# variables

# clauses

# tautologies

# inclusive clauses

Distribution

BMC-IBM1 BMC-IBM2 BMC-IBM7 BMC-IBM13 AIM-200 FLA-500 Unif-k3 Unif-k5

9685 2810 8710 13 215 200 500 5000 230

55 870 11 683 39 774 65 728 1200 2205 21 335 4857

15 16 28 0 3 0 0 0

1173 1106 2358 2605 15 0 0 0

First dist First dist First dist First dist Third dist Second dist Second dist Third dist

Table 5 Data preprocessing effectiveness (satisfiability rate of a random solution on initial and treated benchmark). Benchmark

Initial

Preprocessing

Difference

# clauses

% SAT

# clauses

% SAT

BMC-IBM1

55 870

77.03

54 682

76.97

0.05

BMC-IBM2

11 683

78.87

10 561

78.82

0.05

BMC-IBM7

39 774

77.30

37 388

77.22

0.08

BMC-IBM13

65 728

78.28

63 123

78.26

0.02

AIM-200

1200

86.58

1182

86.63

−0.05

FLA-500

2205

87.89

2205

87.89

0.00

Grid-based clustering (STING) In our approach, unlike the traditional STING algorithm that divides the entire grid into four parts recursively, the size of the grid cells is fixed previously. To compare between the three (03) proposed approaches, we defined different cells’ sizes according to the number of variables and clauses. Table 6 describes the cell’s size and the resulting number of clusters of the instance BMC-IBM1. An interval of variables is fixed to determine the cell’s size in the vertical approach. Three different sizes were proposed, the small size fixes the number of variables within a cell to 25. An interval of 50 variables (respectively 100) is fixed for the medium (respectively large) size. Four sizes are defined for the horizontal approach by fixing the number of clauses within a cluster. The large size is determined by dividing the number of clauses by 2 (N /2), the remaining three sizes are defined respectively by dividing the number of clauses by 8, 16, 32 for the medium, small and the smallest sizes. Finally, the mixed approach is a

combination of the two previous approaches, the cells size is defined by fixing both the number of clauses and variables as defined in the table. Remark. The resulting number of clusters presented in the previous table are theoretical. In practice, this number depends on the DPLL (pure literal and unit propagation) procedures applied before clustering, which can reduce the number of clauses. Fig. 12 introduces the satisfiability rates and execution time of the vertical modelling approach of STING for SAT instances. We notice, from these figures, that when increasing the number of variables per cluster, the solving becomes less effective. These results can be explained by the setting of a threshold during the solving step. Indeed, clusters whom number of variables to instantiate do not exceed the threshold are solved using DPLL, while the remaining are solved using BSO. DPLL being a complete algorithm, it provides more effective solving. The execution time, however, decreases when increasing the number of variables. In fact, we can guess that the DPLL algorithm is used for solving the clusters with 25 variables. Being a complete algorithm and covering all the possible solutions, its execution time is greater than BSO even when the number of variables per cluster is about 50. The execution time increases when assigning 100 variables. The horizontal modelling STING for SAT instances results are exhibited in Fig. 13. Remark. Unif-k3 is not represented in the previous figures because of its too important execution time. In this second modelling, the cluster size is fixed by the number of clauses whereas the solving quality is dependent on the number of variables per cluster. However, we can say that after a certain threshold, the

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

11

Table 6 Grid based clustering — Radius and Cell’s size for benchmark the BMC-IBM1. Cell’s size

#variables/cluster

Smallest

#clauses/cluster

Vertical approach

Size/32 (1708)

Mixed approach

Horizontal approach

21

Small

25

Size/16 (3417)

388

11

867

Medium

50

Size/8 (6835)

194

6

377

Large

100

Size/2 (27341)

97

2

173

Fig. 12. Grid based clustering (Vertical approach).

Fig. 13. Grid based clustering (Horizontal approach).

more important the number of clauses is, the more important the number of variables is. We notice, through the presented satisfaction rates (in Fig. 13), that the clusters with fewer number of variables provides the best solutions in terms of satisfiability rate. The execution time, in contrary, increases with the number of clauses per cluster.

greater the probability of using the BSO algorithm rather than

Finally, the Fig. 14 presents the results of the mixed modelling STING for SAT instances. As for the vertical approach, the satisfiability rate decreases when the number of variables and clauses within a cluster increase. The greater the number of variables to instantiate, the

the solution of the cluster is propagated to the yet unsolved

DPLL. BSO being less effective than DPLL. The execution time increases with the cluster size. However, we notice that for the largest size, the execution time decreases compared with the medium size. Indeed, after a cluster solving, clusters. And because the number of instantiated variables is important, the propagation allows a consequent reduction of clauses to satisfy.

12

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

Fig. 14. Grid based clustering (Mixed approach).

Table 7 Benchmark description and Preprocessing.

Table 8 STING approaches – Validation – Effectiveness.

Benchmark name

# variables

# clauses

preprocessing

Distribution

Unif_k3_5600 Unif_k3_5800 Unif_k3_6600

5600 5800 6600

23 895 24 749 28 162

0 0 0

Second dist Second dist Second dist

Approach

Benchmark name

Unif_k3_56

Radius

Satisfaction Rate (%)

R = 25 R = 50 R = 100

96.60 96.56 95.09

Vertical

Grid-based clustering approaches validation In order to demonstrate the interest and the impact of the proposed approaches, we perform in this part additional experiments on difficult to solve instances and especially in a reasonable time. Indeed, different powerful solvers could not solve these instances efficiently and even effectively for some solvers. Our goal is to, first, demonstrate that our approach is able to reach quite high satisfaction rates in reasonable execution time, taking into account the complexity of the instances. And demonstrate that the use and integration of these approaches in a powerful solver would allow 100% resolution of these instances efficiently. To do so, we have compared our work with some results of best SAT solvers. The following Table 7 introduces the description of the used instances. Fig. 15 introduces the best resolution times of best SAT solvers for the instance Unif_k3_66 available on [36]. Results of the remaining benchmarks are available on [37] and [38]; Tables 8 and 9 exhibit the resulting satisfiability rates, number of unsatisfied clauses and execution time for the three proposed modelling of STING for SAT instances. The first thing we notice when analysing these results (Table 8) is that the horizontal approach which determines clusters according to the number of clauses does not return results because of exceeding time, and that the mixed approach does not provide as satisfying results as the vertical approach. Which is due to the fact that, first the vertical approach proposes the best approach for solving SAT instances, the central element of a SAT instance being the variables to instantiate. Secondly, the distribution of these instances is too compact which means that the variables fill all the dispersion space. Fig. 16 shows this distribution. In fact, when variables fill all the dispersion space, solving a cluster generated by horizontal STING approach is to solve the entire instance since almost all variables are present. We conclude with the effectiveness of the

Unif_k3_58

96.50 96.50 95.20

Unif_k3_66

96.55 96.56 95.05

Unsatisfied clauses (#) R = 25 R = 50 R = 100

813 821 1173

866 866 1189

971 968 1395

Satisfaction Rate (%) Horizontal

R = XS R = S-M-L

90.79 **

90.81 **

** **

Unsatisfied clauses (#) R = XS R = S-M-L

2200 **

2273 **

** **

Satisfaction Rate (%)

Mixed

R = S R = M R = L

90.75 90.01 89.20

90.51 90.06 92.84

90.87 90.41 89.49

Unsatisfied clauses (#) R = S R = M R = L

2210 2387 2580

2347 2460 2771

2569 2701 2960

Unif_k3_58

Unif_k3_66

Table 9 STING approaches – Validation – Efficiency. Approach

Benchmark name

Unif_k3_56

Radius

Execution time (s)

Vertical

R = 25 R = 50 R = 100

151.47 283.80 817.73

322.09 618.09 998.92

218.38 383.97 1267.96

Horizontal

R = XS R = S-M-L

1378.74 Exceeded

1686.61 Exceeded

Exceeded Exceeded

Mixed

R = S R = M R = L

9.06 61.85 38.16

10.23 75.69 45.32

26.94 158.69 61.36

vertical approach for this distribution (the effectiveness of the three approaches is proven in the previous experiments. We can notice, from Table 9, that the obtained results are competitive with results of powerful solvers. In fact, compared

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

13

Fig. 15. Unif_k3_66 best resolution time (best solvers)().

We notice that CLARANS provides better results than k-means in almost all cases in term of satisfiability rate. However, it is less efficient than k-means. The proposed Apriori-K-means shows better results than kmeans in term of satisfied clauses. However, it is less efficient than k-means with higher execution times. This execution time is due to the first Apriori step that is dependent on the number of variables and their connection. Comparison of proposed clustering for SAT instances

Fig. 16. Unif_k3_58 Variables distribution on clauses. Table 10 Partitioning based clustering — Execution Time. Benchmark name

BMC-IBM1 BMC-IBM2 BMC-IBM7 BMC-IBM13 AIM200 FLA500 unif-k3 unif-k5

Execution time (s) KM

CLARANS

Apriori-KM

189.78 3.05 57.29 722.0 0.45 0.23 11.86 0.44

1244.57 367.76 2537.5 5294.22 15.7 47.98 664.94 5.39

277.91 7.21 2421 179.51 0.3 0.28 23.02 46.53

with some solvers that exceeded the limited time without finding the solution or with solvers that found solution but are greedy, our solution proposes a satisfying solution (rate) with a reasonable time solving which represents our primordial aim and shows the impact of using the appropriate clustering as preprocessing before solving complex problems.

Partitioning based clustering and frequent patterns mining In this section, we begin by presenting the resulting solution quality of some popular partitioning techniques such as k-means and CLARANS, and then introduce our proposed combination of Apriori-K-means approach. Fig. 17 and Table 10 present the results of these approaches.

Fig. 18 and Table 11 summarize the results of the presented approaches, in this paper. In addition to a comparison with a randomly generated solution and a solution generated by the BSO algorithm on the whole instance without any preprocessing. When comparing the presented results, we first observe a considerable improvement of solution quality compared with a BSO solving on the whole instance, attesting the importance of reducing the instance complexity before solving it. According to these results, we notice that the vertical modelling of STING for SAT instances provides the best solutions in term of satisfiability rate, for almost all instances. Indeed, this approach defines a cell size as the number of variables within a cluster, which is the principal parameter while solving a SAT instance. The Horizontal and Mixed modelling of STING provide satisfying results. The proposed modelling of k-means yields good results, however, CLARANS provides better results than k-means in terms of satisfiability rate, although CLARANS is less efficient than kmeans. The introduction of the Apriori algorithm as a preprocessing to the k-means showed its importance by improving the satisfiability of the latter, making it more effective. We have treated in the presented work, the instances which distribution is random and does not present any region of higher density than others, the first distribution being treated in [29,30]. The following Table 12 presents a comparison between the best resulting rates of the modelling of DBSCAN and STING algorithms for SAT instances, and show the impact of choosing the appropriate clustering for each instance distribution. 7. Conclusion Throughout this paper, a novel and innovative solving approach is proposed for SAT instances. This approach explores the intelligence and technology offered by data mining. The proposed approach is organized in two steps. The first step is to study the distribution of variables on clauses in order to deduce and determine the most suitable clustering technique for each distribution. This most appropriate clustering technique is applied on the corresponding instance with the aim of reducing the

14

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069 Table 11 Execution times — Comparison. Method

IBM1

IBM2

IBM7

IBM13

FLA500

AIM200

unif-k3

unif-k5

STING 1st

742,03

35,71

438,61

2293,77

29,82

5,87

146.66

482.19

STING 2nd

383,11

33,07

180,68

593,47

3,58

4,76

1135.59

11.71

STING 3rd

115,85

5,14

65,36

244,38

0,11

0,04

7.23

0.65

K-means

189,78

3,05

57,29

722,02

0,23

0,45

11.86

0.44

CLARANS

1244,57

367,76

2537,5

5294,22

47,98

15,7

664.94

5.39

Apriori-KM

277,91

7,21

2421

179,51

0,28

0,3

23.02

46.53

BSO

609.33

35.51

383,68

2496,56

3,49

0,76

674,01

3,38

Fig. 17. Partitioning based clustering — Satisfiability rate. Table 12 Variables distribution and appropriate Clustering. Benchmark

Fig. 18. Satisfiability rates — Comparison.

complexity of the latter. The second phase of this work is to solve the resulting clusters from the first step. This solving step is performed using either DPLL or BSO algorithms, depending on the number of variables to be instantiated. After a study of instance distribution, three main distribution models have been revealed. The first model describes a space where variables shape regions of considerable density with regions of lower density and empty regions. The two following models define a random dispersion of variables within the bi-dimensional space. No particular dense regions are found in

DBSCAN

STING

IBM1

SAT Rate (%) Exe Time (s)

99.55 687.42

99,28 742,03

IBM2

SAT Rate (%) Exe Time (s)

99.55 6.23

99,36 35,71

IBM7

SAT Rate (%) Exe Time (s)

99.52 69.7

99,46 438,61

IBM13

SAT Rate (%) Exe Time (s)

98.35 213.09

99,27 2293,77

AIM200

SAT Rate (%) Exe Time (s)

96.96 0.53

94,33 5,87

FLA500

SAT Rate (%) Exe Time (s)

92.87 0.18

96,68 29,82

Unif-k3

SAT Rate (%) Exe Time (s)

91.38 5.26

96.46 146.66

Unif-k5

SAT Rate (%) Exe Time (s)

97.71 0.26

99.03 482.19

these two dispersion models. The third model, however, presents the particularity that almost all of its variables appear in high frequency. The first described distribution is correlated to the density based clustering. Two different approaches of DBSCAN were proposed in [29,30] where the efficiency of these approaches was proven. We focused, in this article, on the second and third distribution models which we think the grid based clustering is the best clustering to apply on. The Apriori algorithm seems, also, to be a suitable clustering for the last distribution. Three modelling of STING algorithm were proposed, in this paper. The first modelling proposes to fix the cells size by fixing an interval of variables. Clauses containing these variables are

C. Hireche, H. Drias and H. Moulai / Applied Soft Computing Journal 88 (2020) 106069

assigned to these clusters. The second approach, contrarily to the first one, fixes the number of clauses within each cluster. Finally, the last approach combines both the first and second approaches by fixing the number of variables and clauses within each cluster. A modelling of some popular partitioning based clustering are proposed, such as k-means and CLARANS. The Apriori algorithm is introduced as a preprocessing step for the k-means algorithm to deal with the random initialization of centroids. To validate the presented approaches, we have selected some well-known benchmarks. We proposed and applied on these instances a preprocessing – cleaning – step that allows a complexity reduction while preserving the integrity of the instances. The achieved results showed the importance and impact of using such preprocessing techniques prior to problem solving. In fact, some of the later used benchmarks can be solved using some complete SAT solvers. However, these algorithms can be very greedy. Our approach permits to reduce the complexity of the problem’s instance and allow a faster resolution with a very satisfying rate of satisfiability. To improve our results and the satisfiability rate to reach 100%, we plan as perspective to include and adapt our approach to one of the well-known SAT solvers. As second perspective, we plan to use a GPU to deepen the results of our investigations. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2020.106069. CRediT authorship contribution statement Hadjer Moulai: Writing - review & editing. References [1] S.A. Cook, The complexity of theorem-proving procedures, in: Proceedings of the Third Annual ACM Symposium on Theory of Computing, ACM, 1971, pp. 151–158. [2] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011. [3] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory 28 (2) (1982) 129–137. [4] H. Drias, A. Douib, C. Hireche, Swarm intelligence with clustering for solving sat, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2013, pp. 585–593. [5] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, Vol. 344, John Wiley & Sons, 2009. [6] R.T. Ng, J. Han, Effcient and effective clustering methods for spatial data mining, in: Proceedings of VLDB, 1994, pp. 144–155. [7] Kamalpreet Bindra, Anuranjan Mishra, et al., Effective data clustering algorithms, in: Soft Computing: Theories and Applications, Springer, 2019, pp. 419–432. [8] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Kdd, Vol. 96, 1996, pp. 226–231. [9] Derya Birant, Alp Kut, ST-DBSCAN: An algorithm for clustering spatial–temporal data, Data Knowl. Eng. 60 (2007) 208–221. [10] W.W.J.R. Sting, A statistical information grid approach to spatial data mining, in: Athens Proceedings of the 23rd Conference on VLDB1997, pp.186–195.

15

[11] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Vol. 27, ACM, 1998. [12] R. Agrawal, R. Srikant, et al., Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Vol. 1215, 1994, pp. 487–499. [13] M.R. Garey, D.S. Johnson, Computers and Intractability, Vol. 29, wh freeman New York, 2002. [14] A. Biere, M. Heule, H. van Maaren, Handbook of Satisfability, Vol. 185, IOS Press, 2009. [15] M. Davis, G. Logemann, D. Loveland, A machine program for theoremproving, Commun. ACM 5 (7) (1962) 394–397. [16] G.L. Cravo, A.R.S. Amaral, A GRASP algorithm for solving large-scale single row facility layout problems, Comput. Oper. Res. 106 (2019) 49–61, Elsevier. [17] M.W. Moskewicz, C.F. Madigan, Y. Zhao, L. Zhang, S. Malik, Chaff: Engineering an efficient sat solver, in: Proceedings of the 38th Annual Design Automation Conference, ACM, 2001, pp. 530–535. [18] R.G. Jeroslow, J. Wang, Solving propositional satisfiability problems, Ann. Math. Artif. Intell. 1 (1–4) (1990) 167–187. [19] Y. Hamadi, S. Jabboud, L. Sais, Manysat: a parallel SAT solver, J. Satisf. Boolean Model. Comput. 6 (4) (2009) 245–262. [20] F.W. Glover, G.A. Kochenberger, Handbook of Metaheuristics, Vol. 57, Springer Science & Business Media, 2006. [21] H.H. Hoos, T. Stutzle, Stochastic Local Search: Foundations and Applications, Elsevier, 2004. [22] B. Selman, H.J. Levesque, D.G. Mitchell, et al., A new method for solving hard satisfiability problems, in: AAAI, Vol. 92, Citeseer, 1992, pp. 440–446. [23] B. Selman, H.A. Kautz, B. Cohen, et al., Local search strategies for satisfiability testing, in: Cliques, Coloring, and Satisfiability, Vol. 26, 1993, pp. 521–532. [24] A.R. KhudaBukhsh, L. Xu, H.H. Hoos, K. Leyton-Brown, SATenstein: Automatically building local search SAT solvers from components, Artif. Intell. J. 232 (2016) 20–42. [25] H. Drias, S. Sadeg, S. Yahi, Cooperative bees swarm for solving the maximum weighted satisfiability problem, in: International Work-Conference on Artificial Neural Networks, Springer, 2005. [26] H. Drias, C. Hireche, A. Douib, Datamining techniques and swarm intelligence for problem solving: application to sat, in: Nature and Biologically Inspired Computing (NaBIC), 2013 World Congress on, IEEE, 2013, pp. 200–206. [27] C. Hireche, H. Drias, N.C. Benhamouda, Frequent patterns mining for the satisfiability problem, Polibits - Res. J. Comput. Sci. Comput. Eng. Appl. 55 (2017) 59–63. [28] N.C. Benhamouda, H. Drias, C. Hireche, Meta-apriori: A new algorithm for frequent pattern detection, in: Asian Conference on Intelligent Information and Database Systems, Springer, 2016, pp. 277–285. [29] C. Hireche, H. Drias, Density based clustering for satisfiability solving, in: World Conference on Information Systems and Technologies, Springer, 2018, pp. 899–908. [30] C. Hireche, H. Drias., Multidimensional appropriate clustering and DBSCAN for SAT solving, Data Technol. Appl. J. (2019) Emerald Publishing Limited. [31] Uniform Random SAT, http://www.satcompetition.org/2013/downloads. shtml. [32] Random SAT, https://baldur.iti.kit.edu/sat-competition-2016/index.php? cat=benchmark. [33] Artifially Generated Random, https://baldur.iti.kit.edu/sat-competition2016/index.php?cat=benchmarks. [34] A. Biere, A. Cimatti, E. Clarke, Y. Zhu, Symbolic model checking without bdds, in: International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Springer, 1999, pp. 193–207. [35] SAT encoded BMC, http://www.satcompetition.org/2013/downloads.shtml. [36] http://www.satcompetition.org/edacc/SATCompetition2013/experiment/ 25/results-by-instance?instance=8509. [37] http://www.satcompetition.org/edacc/sc14/experiment/24/results-byinstance?instance=5730. [38] http://www.satcompetition.org/edacc/sc14/experiment/24/results-byinstance?instance=5729.