CARs-Lands: An associative classifier for large-scale datasets

CARs-Lands: An associative classifier for large-scale datasets

Pattern Recognition 100 (2020) 107128 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

5MB Sizes 0 Downloads 13 Views

Pattern Recognition 100 (2020) 107128

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

CARs-Lands: An associative classifier for large-scale datasets Mehrdad Almasi, Mohammad Saniee Abadeh∗ Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 25 March 2019 Revised 5 September 2019 Accepted 24 November 2019 Available online 25 November 2019 Keyword: Classification association rules (CARs) Associative classifier Big data Large-scale datasets Evolutionary algorithms

a b s t r a c t Associative classifiers are one of the most efficient classifiers for large datasets. However, they are unsuitable to be directly used in large-scale data problems. Associative classifiers discover frequent/rare rules or both in order to produce an efficient classifier. Discovery rules need to explore a large solution space in a well-organized manner; hence, learning of the associative classification methods of large datasets is not suitable on large-scale datasets because of memory and time-complexity constraints. The proposed method, CARs-Lands, presents an efficient distributed associative classifier. In CARs-Lands, first, a modified dataset is generated. This new dataset has sub-datasets that are completely appropriate to produce classification association rules (CARs) in a parallel manner. The produced dataset by CARs-Lands contains two types of instances: main instances and neighbor instances. Main instances can be either real instances of training dataset or meta-instances, which are not in the training dataset; each main instance has several neighbor instances from the training dataset, which together form a sub-dataset. These subdatasets are used for parallel local association rule mining. In CARs-Lands, local association rules lead to more accurate prediction, because each test instance is classified by the association rules of their nearest neighbors in the training datasets. The proposed approach is evaluated in terms of accuracy on six realworld large-scale datasets against five recent and well-known methods. Experiment results show that the proposed classification method has high prediction accuracy and is highly competitive when compared to other classification methods. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction and literature review Classification has become one of the most popular research fields in machine learning and pattern recognition for more than two decades [1]. It has also many applications in daily life such as text recognition [2] and sentiment analysis [3], protein function prediction [4], and document classification [5]. Currently, there are numerous classification methods in different types, including ensemble classifiers [6,7], naïve Bayesian classifiers [8,9], support vector machines [10,11], neural networks [12,13], and rule-based classifiers [14,15]. Among these different types of classifiers, rulebased classifiers have a specific advantage: they are interpretable classifiers for humans, and they are efficient and effective classification methods. Interpretation allows the end user to understand the classifier [16]. At the end of the 1990s, association rule mining and classification, as two significant parts of the data mining field, have been integrated into a new term that is called associative classifiers. Association rule mining, which was first proposed by Agrawal et al. ∗

Corresponding author. E-mail addresses: [email protected] [email protected] (M. Saniee Abadeh). https://doi.org/10.1016/j.patcog.2019.107128 0031-3203/© 2019 Elsevier Ltd. All rights reserved.

(M.

Almasi),

[17], is the main part of associative classifiers. The approach of Agrawal et al. discovered a correlation among co-occurring items within a dataset. The representation form of an association rule is A → C, where A and C are called the antecedent and consequent, respectively. In an associative classifier, the consequent part of the association rule is the target class label. Because associative classifiers discover high confidence correlation among different attributes simultaneously, they can obtain better accuracy than other rule-based classifiers such as decision trees or rule induction classifiers [18]. Apriori [17] and FP-growth [19] are frequently used methods in association rule mining. Classification based on association rules methods have been designed for different aims due to application or technology facilities. Methods of classification based on association rule can be considered in one of three following categories: efficient exploring data space, selecting best rule sets, and suitable for large scale datasets. In efficient exploring data space category, the main challenges are discovery rules by determining suitable threshold values for support and confidence of rules or investigate potential area of significant association rules. In selecting best rule sets, methods for finding high performance rule are considered by researchers. Their focus is omitting redundant rules and proposing a compact set to build more accurate classifiers. At the beginning of this cen-

2

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

tury, the methods of building associative classifiers were in semioptimum point, however, large-scale datasets (massive datasets) is appeared which need new frameworks for association rule mining. In the following, we study some representatives from three aforementioned groups. The first combination of association rule mining (as a tools of knowledge discovery) and classification (a significant learning problem) was presented at the end of 20 century by Ma BL and Liu B. This research establish a new concept which is called associative classifier. Classification based on association (CBA) was the first associative classifier [20] that classified instances based on association rules. It contains two steps: association rules mining and classification model building. In CBA, the classifier is built by classification association rules (CARs) that have larger support and confidence than predefined thresholds. CBA can generate rules from datasets with continuous attributes via discretization. In CBA [20], the method of rule generation was an adaptation of the method by Agrawal and Srikant [21]. Many associative classifiers have been proposed based on CBA [20]. Each of them tries to improve the CBA [20] from one aspect. Liu et al. [22] proposed an improved version of CBA called CBA (2). They improved the CBA by considering multiple minimum class supports and integrating CBA with decision trees and naïve Bayesian methods. CMAR [23] and L3 [24] improved the candidate rule generation of CBA by using FP-growth. Nguyen et al. [25] and Rak et al. [26] proposed some data structural approaches to improve the CBA. They presented a lattice of class rules for discovery of CARs in an efficient manner. Each node of their lattice contained some values of attributes. Rak et al. [26] used a treeprojection structure to store datasets. Each branch of their tree contained some items instead of the entire itemset. In this way, they significantly reduced the number of candidate itemsets. Alwidian and et al. [27] proposed a new weighted CBA method for cancer disease. They took the advantages of association rules to enhance their model accuracy. In [27], weighted CBA has been developed to solve the problem of the estimated proper support and confidence values. Selecting the best set of association rules in order to build a compact and accurate associative classifier is another significant challenge in associative classifiers because of redundancy between rules. Ashrafi et al. [28] proposed a method to remove the redundant rules. Redundant rules contain identical knowledge. Costa et al. [29] omitted the useless rules by combining rules and by probabilistic smoothing. The rule-based classifier was developed by analysis of the coverage of rules. Zhang et al. [30] proposed a method to generate a compact rule set. They called their method GEAR. GEAR used the information theory to find the best attribute to classify each class label. Because, there are a few attributes (the collection of the best attributes) in GEAR, the generated rules are compact rules. Nowadays, real-time datasets are a pervasive form of data. In these datasets, records are continuously changed so rules of associative classifiers should be updated, Nguyen et al. [31] proposes an efficient method for updating and compacting CARs when records are deleted. In [31], an MECR-tree has been used to store original datasets and the concept of pre-large itemsets is used to avoid re-scanning the original dataset. Nowadays, large-scale or massive datasets have been the common form of data which are used in learning problems such as pattern recognition [32] and learning [33]. Although proposing new parallel classifiers based on software framework such as apache spark is a usual approach, sampling massive approaches are also used for large-scale datasets [34,35]. Sampling methods can reduce the size of datasets and make it possible to use classical approaches, however, they also omit relevant knowledge. Hence sampling is not suitable solution for building associative classifiers. In recent years, some scalable and parallel implementation of as-

sociative classifiers has been proposed. Bechini et al. [36] proposed a distributed associative classifier. In their method, CARs are extracted by using a distributed version of the FP-growth approach. Moreover, rules pruning is performed in a parallel manner. The superiority of [36] was the reduction in message passing and fault tolerance recovery as well as accuracy and computation time efficiency. López et al. [37] presented a new fuzzy associative classifier, which has a good performance on imbalanced datasets. Their method used the MapReduce programming model to compute the fuzzy model operations in parallel. In their method, cost-sensitive techniques were used to handle imbalanced datasets in an efficient manner. In [38], a new learning method that repeatedly transforms data between line and item spaces to discover frequent rule items is proposed. Thabtah et al. [38] implemented their method by a parallel MapReduce method. Many of the approaches in the literature of associative classifiers have been proposed for small or large datasets. They only considered accuracy as a challenge, without attention to time and space requirements. In fact, time and space are not their challenges, because of the small number of dataset instances and powerful computers [39]; hence, they are practically inapplicable to large-scale datasets [40,41]. However, the high accuracy of classic associative classifiers cannot be ignored. Providing an efficient solution to utilize classical associative classifiers for large-scale datasets in a parallel manner is the main goal of this study. Note that, using sampling methods, make it possible to use classical associative classifiers. Nevertheless, it leads to accuracy reduction for generated models. We believe that there are areas in large-scale data where CARs are not generated or CARs do not have a chance to participate in the final CARs set if we directly use the current format of datasets or use sampling methods. We called these areas lost lands and proposing a new storage format to cover these areas as our main novelty of this study. This new format (modified training dataset) contains some sub-datasets and is fully compatible with classical associative classifiers. In addition, this storage format enables us to efficiently generate CARs in a distributed manner. Note that although sampling methods severely reduce the accuracy of associative classifiers, they are unsuitable for building associative classifiers. In CARs-Lands, eliminating the need to remove the redundancy between all association rules is a subgoal. It should be noted that since assigning a label to a test instance is independent of the collaboration with CARs of sub-datasets; hence, there is no need to omit the global redundancies between rules. It is only necessary to remove redundancy within the rules of a sub-dataset. We have used 6 real-world large-scale datasets with different number of attributes and instances (up to 54 attributes and 11 million instances) to evaluate our proposed method. Moreover, focusing on accuracy, the results of CARs-Lands are compared to five other methods by a statistical test. The results demonstrate the efficiency of CARs-Lands. The paper is organized as follows. Section 2 explained the proposed method, with details of used functions. Section 3 indicates the configuration of the experimental study and discusses the results of statistical analysis in terms of accuracy, computation time, time complexity, and scalability. Finally, we summarize this study with some concluding remarks in Section 4.

2. CARs-Lands: a parallel associative classification As pointed out in Section 1, classical association rule mining methods are unsuitable to be directly used for large-scale datasets. They require some changes. In this section, we describe our proposed method, CARs-Lands, which uses an efficient classical association rule mining method to build an associative classifier. In Fig. 1, the parts of CARs-Lands are described. Moreover, it shows

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

3

Fig. 1. The CARs-Lands classifier. Part (a) shows the two steps of CARs-Lands, part (b) is the structure of a sub-dataset, and part (c) shows how CARs-Lands predicts the class. In part (c), the values d1 , d2 , d3 , …, dn are the distances of main instances from the unlabeled instance, assuming that d3 is the nearest main instance to the unlabeled instance.

how CARs-Lands determine the label of an unlabeled instance. CARs-Lands consists of the following two steps: 1. Creating a modified training dataset. In this step, the training dataset is converted into a modified training dataset. This dataset contains several sub-datasets that are built in a parallel manner. These sub-datasets contain useful information such as the most effective attributes in a subdataset and overlapped values between instances from different classes. 2. CAR mining. This step is performed based on an adapted version of Rare-PEARs [42]. In this version, the rules are extracted based on support, confidence, and CF and the consequent part of rules is a class. In CARs-Lands, assigning a label to a test instance is performed by using the CARs of one sub-dataset. Hence, there is no need to omit the global redundancy between the generated rules from different sub-

datasets. Note that in each sub-dataset, it is necessary to eliminate redundant rules from the ultimate non-dominated rule set. It can be quickly performed because the number of non-dominated rules is small. This characteristic of CARsLands is a significant advantage and helps CARs-Lands to attain a good performance. Note that the CARs of sub-datasets constitute the ultimate CARs. In each sub-dataset, there are a small number of rules that are not redundant rules. Therefore, the proposed approach is not prone to overfitting. In the following section, the two steps of CARs-Lands are discussed in detail. Note that in this paper, creating a modified training dataset step is more important than CAR mining step. Moreover, the modified datasets make it possible to fully utilize the performance of classical association rule mining methods in order to generate efficient classifier.

4

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 2. Sample of islands and lands in a sample dataset with two attributes and three classes. Islands and lands are marked by red and blue curves respectively.

Fig. 3. Presentation of some island and land sets in a sample dataset (the ovals in the figure). The main instances are shown with a plus sign if they are metainstances. Note that lands and islands are determined by the number of neighbors; hence, the size of the ovals is different.

2.1. Creating a modified training dataset There is an important point when associative classifiers are used. Generally, rules show the overall structure of datasets and do not consider instances of a specific class that are far from the other instances. The challenge is that in the classification models, some instances are not considered because they are few and surrounded by instances of other classes. CARs-Lands tries to consider these instances in modeling data. To this end, CARs-Lands generates some sub-datasets. These sub-datasets constitute a new training dataset (a representative instance and some nearest neighbors constitute a sub-dataset, see part (b) of Fig. 1). Note that in the modified training dataset, there are some sub-datasets for all classes. In order to achieve such a training dataset, we use the following two concepts: lands and islands. The instances of each class label constitute lands (blue curves in Fig. 2) and the instances of a specific rare or frequent class in a section of a land constitute islands (red curves in Fig. 2). Note that in islands, instances of a specific rare or frequent class are surrounded by instances of other classes (Fig. 2). In Fig. 2, the islands and lands concepts are shown by a sample dataset. In CARs-Lands, we use the distance measure to determine some subdatasets. These sub-datasets are a part or the whole of an island or land. In Fig. 3, a sample of islands and lands is shown for CARLands. In the implementation, we consider a predefined number as the number of instances in a sub-dataset. This number is used to produce a sub-dataset. In other words, we use a predefined value instead of a constant distance to generate sub-datasets. In this study, we need to have a representative from each island and land. We call these representatives main instances. These main instances are either instances of the training dataset or metainstances, which are not in the training dataset. In Fig. 3, the meta-instances are shown by the plus sign. In the second step of CARs-Lands, these representatives are utilized to produce a set of rules and predict the labels of test instances. In this study, we assume that islands constitute the lost lands of associative classifiers. In a large-scale dataset, the extracted rules of islands have fewer chances to participate in the final CARs if there are no policies. In this study, CARs-Lands produces sufficient representatives from both lands and islands. Before describing how the modified training dataset is built, it is necessary to explain the difficulty of building modified training datasets in more detail. In CARs-Lands, each instance of dataset along with some of its neighbors constructs

a sub-dataset (potentially an island or land). Knowing all possible sub-datasets to discover potential islands and lands is important because it facilitates building an efficient associative classifier. However, generating all possible sub-datasets is an impractical and time-consuming task, even in distributed environments. Consider a dataset with 1 million instances. In order to find all possible subdatasets, we need to compare each instance to all 999,999 other instances. This is an overwhelming task. Thus, CARs-Lands uses an evolutionary approach to find the most effective representatives of islands and lands. In the following paragraph, we describe the chromosome representation and used measures in the first part of CARs-Lands. In Fig. 4, the chromosome representation of the first step of CARs-Lands along with an example of a chromosome is shown. Each chromosome is a list that contains 4 members. In this list (Fig. 4), the first member (row) is the main instance or metainstance. The second member (row) is an adjacency matrix that contains the frequency of the values of attributes that are presented in the main instance. Note that in adjacency matrix, the values of frequencies for each element are calculated among neighbors whose classes are the same as with the main instance. The third member (row) is called the overlap matrix. Each element of this matrix shows the number of neighbors that belongs to a different class than the main instance and also has the same value as the corresponding value of the main instance (first member). In the fourth member (row), the degree of separation is calculated for a chromosome. Note that in the real attributes, for each attribute, we have calculated a vicinity area (for each attribute, 30 percent of distance between the value of attribute of main instance and the largest value of corresponding attributes in the neighbor instances with the same class are considered) and values within this vicinity are used to calculated adjacency matrix and overlap matrix. As mentioned previously, we need to find some effective islands and lands. Hence, there is a need for measures to compare potential islands and lands. We have used the following two measures. •

The first measure is called the degree of separation measure and reflects the power of attribute values of the main instance to split the instances of different classes. Note that each element of the degree of separation matrix (Fig. 4) is a degree of separation measure.

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

5

Fig. 4. Chromosome representation and an example chromosome.



In CARs-Lands, support is another measure that is calculated for main instances. This measure is used in the replacement process. The support is calculated by dividing the number of instances with the same class to the main instance by the number of instances in the sub-dataset. As an example, in Fig. 4, the support of chromosomes is 63 = 0.5 (the value 6 is the number of instances in a sub-dataset and the value 3 is the number of neighbors whose class is the same as with the main instance). The value of the support indicates the amount of homogeneity in a sub-dataset. The values near 0 show a complete diversity and values near 1 present purity.

Here, we want to illustrate the idea of producing a modified training dataset in CARs-Lands (Fig. 5). Suppose that we have 100 chromosomes similar to the chromosome presented in Fig. 4. These chromosomes are our initial population. Step 1 of CARs-Lands tries to produce a better set of main instances by evolutionary operators. Thus, there is a need to efficiently explore the search space. In large-scale datasets, efficient exploration is impossible in a sequential manner; hence, we propose a have parallel approach. The fitness calculation is a significant part of the first step of CARsLands. It is performed as follows (Fig. 6): the large-scale dataset is divided into several parts (assume that the number of these parts

is equal to the number of nearest neighbors). Then in each iteration of our evolutionary approach, all chromosomes are shared between the cores and each core acts as follows: for each main instance, each core finds the nearest neighbor. These nearest neighbors constitute the neighbors of each main instance. After determining the nearest neighbors, the measures of each main instance are calculated (such as in Fig. 4) and a replacement process is performed. The pseudo code of the idea of producing a modified training dataset has been presented in the procedure BuildSub-dataset (Algorithm 1). In procedure BuildSub-dataset (Algorithm 1), CARsLands finds the nearest neighbors of each main instance during the evolutionary process by a parallel-based approach (procedure Distance or Algorithm 2). Procedure Distance will be described after studying evolutionary operators. Figs. 7 and 8 show two examples of evolutionary operators of CARs-Lands. Fig. 7 shows how the mutation operator of CARsLands is performed. This figure shows an example chromosome along with its nearest neighbors (assuming that the number of nearest neighbors is 7). In the mutation operator, an attribute of the main instance, which has the lowest separation (degree of separation) value, is replaced with the most frequent value in its neighbor. Note that the neighbors should have the same class as the main instance. In the mutation in Fig. 7, the 5th attribute of

6

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 5. The idea of step 1 of CARs-Lands (α is a predefined number of iterations).

the main instance has the lowest separation value. Thus, the value 8 is replaced with 7. It should be noted that in the real attributes, the value is replace with the mean of values of its neighbors with same class. In the crossover operator (Fig. 8), a uniform crossover is used. However, we just produce a child. In this child, the genes are selected based on the degree of separation matrix. The genes with larger separation values are selected from the parents. Note that the genes are selected randomly, if two elements have the same separation values. In our example, parent 1 is better than parent 2 in the 2nd and 3rd attributes. Similarly, parent 2 is better than parent 1 in the 5th attributes. Moreover, both parents have the same discriminative power in the 1st and 4th attributes. Therefore, the child definitely has two genes from parent 1, one gene from parent 2, and two genes from one of the parents. Note that in 1st and 4th attribute, the discriminative power is the same for both parents; thus, these two genes are randomly selected. One point cannot be negligible; the number of instances of a particular class should not be extremely few. In CARs-Lands, instances of such classes are considered as a sub-dataset and rules are extracted from them. However, such these sub-datasets are different from the rest of datasets. All member of these datasets are main instance. In line 3 of Distance procedure, several processes are initiated. The number of processes is equal to the number of available cores (these processes are run simultaneously). The processes execute the Nearest_Neighbor procedure (Algorithm 3), which has 2 input parameters. The first parameter determines the file that each process works on (the file names are as follows: 1.txt, 2.txt, etc.) and the second parameter is a set of main instances. Note that before executing the Distance procedure (Algorithm 2), the large-scale

data is partitioned into N sub-datasets. The value N is equal to the number of neighbors if it is larger than number of cores (note that the value N is number of cores if number of cores is larger than number of nearest neighbors). Fig. 9 describes an example of running the Distance procedure (Algorithm 2). We assumed that the original data have 16 instances and the number of nearest neighbors and number of cores are 1 and 2, respectively. Thus, the dataset is partitioned into 2 sub-datasets (each sub-dataset has 8 instances). In Fig. 9, the pool contains 3 main instances and the down side in Fig. 9 shows the results of running the Distance procedure (Algorithm 2). In the replacement process (Fig. 10), the main instances that have a larger number of larger values of degree of separation matrix are selected to form the new population. In the case where there is the same superiority, the selection is performed based on the adjacency and overlap matrix respectively. Finally, the supports of main instances are considered if the adjacency and overlap matrices are the same. Note that replacement is performed randomly, if the supports are also the same. 2.2. CAR mining Since rule of CAR-lands are produced based on Rare-PEARs [42], at the beginning of this section, we briefly review it. Rare-PEARs [42] is an evolutionary association rule mining method which produces rare and reliable rules. In addition, it does not need discretization in continuous attributes. In summary, the aim of RarePEARs [42] is to maximize the interestingness, accuracy, and reliability of rules while providing a vast coverage on datasets. In CARsLands, an adapted version of Rare-PEARs [42] is used to generate CARs on the sub-datasets. Note that because the used dataset in

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

7

Fig. 6. The idea of step 1 of CARs-Lands (in this diagram, we assume that the number of nearest neighbors and cores are equal).

the CAR mining step of CARs-Lands is very small, we have used the sequential version of Rare-PEARs. In the second step of CARs-Lands (CAR mining), association rule mining is carried out on the produced sub-datasets by the first step of CARs-Lands. The generated rules of each sub-dataset constitute the ultimate CARs. In CARs-Lands, support, confidence, and CF are measures that are used to generate the CARs. In the following, we explain how they are calculated.

1. Support: support is the most common measure in association rules mining process. The support of a rule (A → C) is calculated by SUP|D(|AC ) . In this equation, SUP(AC) is the num-

ber of covered instances by both antecedent and consequent parts simultaneously, and |D| is the number of instances of the dataset. 2. Confidence: the confidence of a rule (A → C) is calculated (AC ) by SUP . The value of the confidence indicate how often SUP(A ) C appears in the instances that are covered by A. The values closer to 1 indicate a higher reliability of the association rules. 3. Certainty Factor: CF is a significant measure that is applied to calculate the accuracy and importance of a rule [43]. The domain of CF is [−1, 1]. The rules in which CF is equal to 1 are perfect rules. The values CF < 0, CF = 0, and CF > 0 imply negative dependence, independence, and positive dependence, respectively. Rules with positive dependency are desirable; however, association rules with CF < 0 are desir-

able if their absolute values are high enough.

if confidence (A → C ) > support (C ) CF(A → C ) =

confidence (A→C)−support(C ) 1−suport (C )

else if confidence (A → C ) < support (C ) CF(A → C ) =

(1)

confidence (A→C)−support(C ) suport (C )

else Procedure CARgenerate (Algorithm 4) uses the produced subdatasets by the first step to generate CARs in a parallel manner simultaneously. In this procedure, all cores have access to all sub-datasets. However, each core only processes the sub-datasets which are assigned to it. Note that the files of sub-datasets are very small and named as follows: 1.txt, 2.txt, etc. Fig. 11 shows the flowchart for determining the class of an unlabeled instance. For further explanation, in Fig. 12. An example of assigning a label to the unlabeled instances has been presented. 3. Experimental study In this section, an experimental study is carried out to compare the performance of CARs-Lands to other classifiers. In Section 3.1, the characteristics of datasets used in our experiments are introduced. Furthermore, the compared methods as well as the selected parameters are explained. In Section 3.2, the performance of CARsLands is presented. Moreover, a statistical test (Wilcoxon signedrank test [44]) is used to analyze the superiority of CARs-Lands.

8

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 7. Example of mutation in step 1 of CARs-Lands.

Fig. 8. Example of the crossover operator in the first step of CARs-Lands.

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

9

Fig. 9. Example of running the Distance procedure (Algorithm 2).

Finally, we perform an analysis to evaluate the running time of CARs-Lands. 3.1. Dataset, compared methods, and parameter setting In this study, the performance of CARs-Lands is analyzed using 6 well-known datasets. These datasets are available from the UCI dataset repository [45]. Table 1 provides the characteristics of the datasets used in the experiments. These datasets have different number of instances (from 581,012 to 11,0 0 0,0 0 0), different classes (from 2 to 23), and different attributes (from 10 to 54). For each dataset, the number of numeric (N) and categorical (C) attributes are reported.

All of the experiments are run using a parallel programming package. In our implementation, we use the joblib package to achieve a better performance in working with long running jobs [46]. In the experiments, the cloud computing infrastructure is used. We create a virtual machine on the cloud. This virtual machine has two Intel Xeon E5-2695 v3 processors. Each processor has 14 cores (2.30 GHz) and there are 64GB of RAM available. In the running time analysis, the number of working processors (jobs) are set to different values (16, 20, 24, and 28). We perform these changes because modifying the number of jobs or the number of sub-datasets can change the number of processing tasks. Note that the number of processing tasks has no effect on the accuracy of CARs-Lands because they are only used for measure calculations.

10

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Algorithm 1 BuildSub-Dataset. Input parameters: Number of individuals (number of main instances), Number of classes, Number of generations Output parameters: Modified dataset High-level abstraction: In this algorithm, primary dataset is converted into some sub-datasets (equals to number of main instances). To this end, the evolutionary algorithm has been used. The used algorithm has three steps. First, initial population is generated then evolutionary operators (crossover and mutation) are performed and finally a replacement process is executed. In this algorithm, the fitness calculation is the main challenge. This calculation is based on distance to main instances and is performed according to algorithm 2. 1. generate_initial_population(Number of individuals) //some instances are randomly selected from each class to form the population.Note that all classes have the same percentage of population.However,the indiviuals of some classes are more if the number of instances of some classes are fewer than the required number. 2. for i in range (0 Number of generations 3. for j in range (0, Number of classes) //do in parallel (in thread level) 4. for k in range (0, rate_of_crossover × Number of individuals ) 5. select two parents //roulette wheel selection 6. perform uniform crossover //it has been explained in Fig 8. 7. Accumulate new children into Cross − Pool 8. end for k 9. end for j 10. for j in range (0, Number of classes) //do in parallel (in thread level) 11. for k in range (0, rate_of_Mutation × Number of individuals ) 12. select an individual //roulette wheel selection 13. perform mutation //it has been explained in Fig 7. 14. Accumulate new children into Mutation − Pool 15. end for k 16. end for j 17. Compute the fitness of individuals of Cross − Pool and Mutation − Pool //see procedure Distance (Algorithm 2) 18. Replacement process // in each class, the children are replaced by existing individuals if they are better. 19. end for i

Algorithm 3 Nearest_Neighbors. Input parameters: which file // the file name is formed such as 1.txt or 2.txt. main instances High-level abstract: In this algorithm, first allocated file to each process is opened, then distances of each line of file to every main instance are calculated and the list of nearest neighbors is updated after it. Finally, results are received by algorithm 2. 1. address file=str(which file)+”.txt” 2. Dict=dict() // create a dictionary data structure. In this dictionary, the keys are main instances and the values are nearest neighbors and their distances. 3. Line number=0 4. for line in (address file): 5. line value = line.split(’,’) 6. if Line number == 0 7. for each_instance in main instances: // do in parallel (in thread level) 8. Dict[each_instance]= the distance of each_instance from line value 9. else: 10. for each_instance in main instances: // do in parallel (in thread level) 11. dis= the distance of each_instance from line value 12. sublist = [] 13. sublist.append(line value) 14. sublist.append(dis) 15. if number of element in Dict[each_instance]< number of nearest neighbors: 16. Dict[each_instance].append(sublist) 17. else: 18. Dict[each_instance].append(sublist) 19. TempList=Dict[each_instance] 20. TempList=TempList sort by distance 21. Dict[each_instance]=TempList[number_of_nearest_ neighbors,:] 22. //select number_of_nearest_neighbors first elements 23. Line number= Line number+1 24. return Dict

Algorithm 4 CARgenerate. Input parameters: number of Main Instances

Algorithm 2 Distance. Input parameters: pool of main instance // main instances that have been generated by evolutionary operators constitute the Pool number of cores, predefined number of neighbors High-level abstraction: In this algorithm, some processes are generated and executed in parallel. The number of these processes is equals to “number of nearest neighbors” parameter. Each of these processes uses algorithm 3 to achieve a list of nearest neighbors of a sub-dataset (for each main instances a list of nearest neighbors is calculated by each process). Finally, these lists are collected from processes and final nearest neighbors of each main instance is calculated by using a sort function. Note that the sort function is performed on limited number of elements (fewer than 100 individuals). 1. for i in range(1, predefined number of neighbors + 1)) // do in parallel (in processor level) results = combination of Nearest_Neighbors(i, pool of main instance ) // Algorithm 3 is used 2. List of neighbors = dict() // create a dictionary structure 3. for each_key in results[0]: // the result[0] is a dictionary structure and contains the nearest neighbor of each main instance in the first sub − data set 4. List of neighbors [each_key] = [] //note that keys of the results[0] are main instances. // in the following, the algorithm merges all nearest neighbors 5. for each_key in result[0] : // do in parallel (in processor level ) 6. for each_dict in results : //results contain the outputs of running the Nearest _Neighbors procedure 7. List of neighbors [each_key].append(each_dict[each_key]) 8. TempList = sort (List of neighbors[each_key] ) 9. List of neighbors[each_key] = TempList[number_of_nearest_ neighbors,:]//select a predefined number 10. return List of neighbors

High-level abstract: In this algorithm, some processes are generated to discover rules sets of main instances. The used algorithm of rule discovery is RarePEARs which have been described in algorithm 5. 1. for i in range (1, number of Main Instances + 1)//do in parallel (in processor level) 2. results=combination of RarePEARs (i) //in Algorithm 5, the procedure RarePEARs has been presented Table 1 Used datasets in the experiments. Data set

#Instances

#Attributes

#Classes

HIGGs KDDCup 1999 5 Classes KDDCup 1999 23 Classes Cover Type Poker-Hand Susy

11,000,000 4,898,431 4,898,431 581,012 1,025,010 5,000,000

28(28N) 41(26N,15C) 41(26N,15C) 54(10N,44C) 10(10C) 18(18N)

2 5 23 7 10 2

We have used these changes to review the scalability aspect of CARs-Lands. In this research, our approach is compared to five other approaches described as follows: •

MLlib version of Decision Tree (DT): DT is a popular method for machine learning tasks due to its interpretation power and ability to be applied in multiclass data. DT is a greedy algorithm that performs a recursive binary partition-

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 10. Replacement process in the first step of CARs-Lands.

Fig. 11. Flowchart of determining the class of unlabeled instances.

11

12

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 12. Example of determining the class when several matching rules exist.





ing of the feature space. In MLlib [47], first, DT produces a set of split candidates for each feature. In continuous features, these candidates are determined by quantile calculation over a sampled fraction of data. Similarly, in categorical features, split candidates are produced by possible values for a specific feature. Then a recursive tree construction procedure is performed until one of the stopping conditions is met. The stopping conditions are as follows: maximum depth of tree, minimum instances in a node, and minimum value for the information gain in a split candidate. Mahout version of Random Forest (RF): RF is an ensemble of DTs and is one of the most successful classifiers. RF combines DTs and builds a classifier that reduce the risk of overfitting. Similar to DT, RF can handle a dataset with categorical or continuous features or both. In Mahout [48], RF builds a predefined number of DTs. These DTs are generated separately so that they can be built in parallel. Each tree is built on a set of features that is randomly selected. RF is built by the MapReduce programming model in Hadoop. In Hadoop, the dataset is split into several partitions. The number of these partitions is determined by the maximum split size argument as an input parameter. The classification results are better when the number of partitions is few. However, it requires a lot of memory [36]. MRAC [36]: MRAC is a distributed association rule-based classifier. It uses the distributed version of FP-growth algorithm as its solution. MRAC generates the CARs then performs a distributed rule pruning. The set of survived CARs is



used to classify unlabeled patterns. Bechini et al. [36] proposed two different versions of MRAC in the prediction process. MRAC uses the weighted X2 inference for prediction. However, MRAC+ uses the best rule inference. We compare our approach to both of them. OPG [49]: OPG propose, a set of fuzzy rule–based classifiers characterized by different optimal trade-offs between accuracy and interpretability. It uses a recently proposed distributed fuzzy decision tree learning approach for generating an initial rule based that serves as input to the evolutionary process. In OPG, the evolutionary learning scheme is integrated with an ad hoc strategy for the granularity learning of the fuzzy partitions.

For the comparison, we have used the results of DT and RF that are available in [36]. In addition, the parameters of DT, RF, MRAC, and OPG are obtained from [36] and [49]. In Table 2, for each method, the parameter values are summarized. In DT, the maxBins parameter is the number of split candidates for discretizing continuous features [50]. In the experimental results and for each dataset, a five-fold cross-validation is used similar to that of competing methods. All the mentioned results are the average of 10 runs of CARs-Lands. The parameters values of CARs-Lands are categorized in 3 following groups: parameters of generating modified datasets, parameters of generating association rules, and size of sub-datasets parameters. Parameters of two first groups are evolutionary parameters (such as population size and mutation or crossover rates).

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

13

Table 2 Parameter values for each method. Method

Parameter

CARs-Lands

NGDS (number of generations to build the modified training data set) = 5000 NP (number of main instances) = 70,000 NS(number of sub-data sets)= NN(number of nearest neighbors)=50 DScrossover_rate (crossover rate in data set building) = 0.8 DSMutation_rate (mutation rate in data set building) = 0.1 MNI (minimum number of instances of a class) = 30 MNS (minimum support of main instances) = 0.1 NA = number of data set’s attributes in Rare-PEARs [42], P (number of individuals in each sub-process of the Rare-PEARs [42]) = 100, nGen (Number of iteration in Rare-PEARs [42]) = 1000, RC (Rate of crossover in Rare-PEARs [42])= 1, RM (Rate of mutation in Rare-PEARs [42]) = 0.01, α (minimum acceptable number of production for a rule over generation in Rare-PEARs [42])= 5 MinConf (minimum confidence for CAR mining step) = 0.9 × maximum confidence of rules in a sub − data set MinCF (minimum CF for CAR mining step) = 0.9 × maximum CF of rules in a sub − data set MaxDepth =5, maxBins=32, Impurity=GINI F NumTrees=100, Number_of_used_feature = log2 maxSplitSize=64MB, Nval (Total number of fitness evaluations)= 50,000 AS ((2+2)M-PAES archive size)= 64 Mmax (Maximum number of rules in a virtual RB)= 100 Tf (Number of fuzzy sets for each continuous attribute Xf )= 7 PCR (Probability of applying crossover operator to CR )= 0.6 PCT (Probability of applying crossover operator to CT )= 0.5 PCG (Probability of applying crossover operator to CG )= 0.5 PMRB1 (Probability of applying first mutation operator to CR )= 0.1 PMRB2 (Probability of applying second mutation operator to CR )= 0.7 PMT (Probability of applying mutation operator to CT )= 0.6 PMG (Probability of applying mutation operator to CG )= 0.2 Tmax (Maximum number of fuzzy sets for each linguistic variable)= 7 Tmin (Minimum number of fuzzy sets for each linguistic variable)= 3 HDFS block size = 0.1%, MinSupp = 0.01%, MinConf = 50%, min χ2 = 20%

Decision tree Random forest OPG

MRAC

Table 3 Average of accuracy and standard deviation. In OPG methods, two-class version of KDDCup has been used. Although it makes easier to build more efficient classifier by Barsacchi and et al. [56] we accept it and report their values as KDDCup 1999 5 Classes. Data sets Methods

Cover type

HIGGS

Poker-Hand

Susy

KDDCup 1999 5 Classes

KDDCup 1999 23 Classes

train

test

train

test

train

test

train

test

train

test

train

test

Decision tree

74.148 ± 0.199

74.140 ± 0.173

66.376 ± 0.074

65.375 ± 0.058

55.165 ± 0.213

55.191 ± 0.203

77.119 ± 0.040

77.118 ± 0.046

99.776 ± 0.064

99.775 ± 0.063

99.781 ± 0.059

99.782 ± 0.057

Random forest MRAC

70.198 ± 0.868 74.46 ± 0.100

70.068 ± 0.837

73.006 ± 0.016

72.542 ± 0.025

91.277 ± 0.127

89.591 ± 0.287

80.671 ± 0.009

80.064 ± 0.033

99.986 ± 0.002

99.982 ± 0.002

99.968 ± 0.006

99.966 ± 0.006

74.261 ± 0.156

65.079 ± 0.054

65.050 ± 0.061

94.480 ± 0.000

94.480 ± 0.000

76.245 ± 0.055

76.232 ± 0.068

99.898 ± 0.034

99.898 ± 0.035

99.640 ± 0.024

99.639 ± 0.024

78.329 ± 0.091

78.092 ± 0.157

65.942 ± 0.058

65.904 ± 0.091

94.480 ± 0.000

94.480 ± 0.000

78.247 ± 0.013

78.220 ± 0.035

99.863 ± 0.046

99.858 ± 0.047

67.649 ± 0.012

67.618 ± 0.012

65.040 ± 0.003

65.035 ± 0.004

61.778 ± 0.011

61.806 ± 0.001

78.628 ± 0.004

78.608 ± 0.004

99.886 ± 0.008

99.886 ± 0.010

99.582 ± 0.020 Not reported

99.579 ± 0.020 Not reported

91.102 ± 0.088

89.021 ± 0.113

74.743 ± 0.016

73.201 ± 0.088

83.425 ± 0.101

80.567 ± 0.070

81.962 ± 0.021

81.1610 ± 0.023

99.989 ± 0.013

99.980 ± 0.019

99.970 ± 0.028

99.691 ± 0.034

MRAC+

OPG

CARsLands

They have been set by try and error method which is a common way for evolutionary algorithms. In sub-datasets parameters, number of main instances and number of nearest neighbors parameters determine number of representatives and vicinity areas, respectively. Their values has been set by a grid search methods. The range of 10,0 0 0–10 0,0 0 0 for number of main instances has been searched. The increment step was 10,0 0 0. In similar way, the number of nearest neighbors parameter was studied in the range of 10–100. The increment step was 10. Note that the reported values in Table 2 were best for the most of studied datasets. Furthermore, although the number of parameters of CARs-Lands is many, these parameter make it possible to use efficient classical associative classifiers. It is a great advantage. 3.2. Performance of CARs-Lands and runtime analysis In this sub-section, we have analyzed the performance of CARs-Lands in terms of accuracy, time complexity, and compu-

tation time. In Table 3, the accuracy of CARs-Lands is compared to five other methods on six large-scale datasets. In this table, for each dataset and for each method, the average values of ± standard deviation of the accuracy are presented in both the training and test datasets. For each dataset, the highest values are shown in bold font. In order to study whether there are significant differences among the results of CARs-Lands and other methods, an in-depth analysis is performed by using the Wilcoxon signed-rank test [44] and by complexity measures [50]. In Table 3, we can observe 3 imbalanced and 3 balanced datasets. The accuracy of CARs-Lands is analyzed on both types (balanced and imbalanced) separately. In Tables 4 and 5, the descriptive statistics of different methods are presented for the balanced and imbalanced datasets, respectively (in the analysis, for each dataset, both training and test are considered so the number of datasets is 6). These description tables consist of statistical information of accuracy (the standard means, deviation, and median).

14

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128 Table 4 Descriptive statistics for the balanced datasets (Susy, Higgs, and Cover Type datasets). Method

Number of data set

Decision tree Random forest MRAC MRAC+ OPG CARs-Lands

6 6 6 6 6 6

Mean

Std. Deviation

72.379333 74.424833 71.887833 74.122333 70.4296667 81.865000

5.220119 4.758593 5.352024 6.351643 64.48043383 7.250149

Minimum

65.3750 70.0680 65.0500 65.9040 65.035000 73.2010

Maximum

77.1190 80.6710 76.2450 78.3290 78.628000 91.1020

Percentiles 25th

50th (Median)

75th

66.125750 70.165500 65.071750 65.932500 65.038750 74.357500

74.144000 72.774000 74.360500 78.156000 67.63350000 81.561500

77.118250 80.215750 76.235250 78.267500 78.613000 89.541250

Table 5 Descriptive statistics for the imbalanced datasets (KDD Cup 1999 23 classes, KDD Cup 1999 5 classes, and Poker-hand datasets). Method

Decision tree Random forest MRAC MRAC+ OPG CARs-Lands

Number of data set

6 6 6 6 4 6

Mean

84.911667 96.795000 98.005833 97.973667 90.9902500 93.937000

Std. Deviation

Minimum

23.031600 4.955977 2.733544 2.709078 10.45100083 9.294183

55.1650 89.5910 94.4800 94.4800 80.56700 80.5670

Maximum

99.7820 99.9860 99.8980 99.8630 9998900 99.9890

Percentiles 25th

50th (Median)

75th

55.184500 90.855500 94.480000 94.480000 81.2815000 82.710500

99.775500 99.967000 99.639500 99.580500 91.702500 99.830500

99.781250 99.983000 99.898000 99.859250 99.986750 99.982250

Table 6 Wilcoxon signed-rank test for the balanced datasets (Susy, Higgs, and Cover Type datasets). Ranks Number CARs-Lands – Decision Tree

CARs-Lands – Random forest

CARs-Lands - MRAC

CARs-Lands - MRAC+

CARs-Lands - OPG

Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total

a

0 6b 0c 6 0d 6e 0f 6 0g 6h 0i 6 0j 6k 0l 6 0m 6n 0o 6

Mean Rank

Sum of Ranks

0.00 3.50

0.00 21.00

0.00 3.50

0.00 21.00

0.00 3.50

0.00 21.00

0.00 3.50

0.00 21.00

0.00 3.50

0.00 21.00

CARs-Lands < Decision tree. CARs-Lands > Decision tree. c CARs-Lands = Decision tree. d CARs-Lands < Random forest. e CARs-Lands > Random forest. f CARs-Lands = Random forest. g CARs-Lands < MRAC. h CARs-Lands > MRAC. i CARs-Lands = MRAC. j CARs-Lands < MRAC+ . k CARs-Lands > MRAC+ . l CARs-Lands = MRAC+ . m CARs-Lands OPG. o CARs-Lands =OPG. a

b

The Wilcoxon signed-rank test [51] is a nonparametric test that is used for comparing the results of different approaches. It reports a value (p-value) in comparing two methods. This value is called Asymp.Sig and presents the significant differences in the obtained results by two methods. In the Wilcoxon signed-rank test, the value of Asymp.Sig is reported by using Z statistic. Z is calculated by using Eq. (2) [51]. In this equation, W is the sum of signed ranks and the values −0.5 and +0.5 are used when the value of W

is larger or smaller than μw , respectively. In this study, we have used the SPSS software package to produce these results.

Z=

( W − μw ) ± 0 . 5 σw

(2)

In Tables 6 and 7, all pairs of algorithms are compared by using the Wilcoxon signed-rank test. Each row in these tables indicates whether the number of datasets in a method is weaker, better, or

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

15

Table 7 Wilcoxon signed-rank test for imbalanced datasets (KDD Cup 1999 23 classes, KDD Cup 1999 5 classes, and Poker-hand datasets). Ranks Number CARs-Lands – Decision Tree

CARs-Lands – Random forest

CARs-Lands - MRAC

CARs-Lands - MRAC+

CARs-Lands - OPG

Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total

a

1 5b 0c 6 4d 2e 0f 6 2g 4h 0i 6 2j 4k 0l 6 0m 4n 0o 4

Mean Rank

Sum of Ranks

1.00 4.00

1.00 20.00

4.13 2.25

16.50 4.50

5.50 2.50

11.00 10.00

5.50 2.50

11.00 10.00

0.0 2.50

0.0 10.00

CARs-Lands < Decision tree. CARs-Lands > Decision tree. c CARs-Lands = Decision tree. d CARs-Lands < Random forest. e CARs-Lands > Random forest. f CARs-Lands = Random forest. g CARs-Lands < MRAC. h CARs-Lands > MRAC. i CARs-Lands = MRAC. j CARs-Lands < MRAC+ . k CARs-Lands > MRAC+ . l CARs-Lands = MRAC+ . m CARs-Lands OPG. o CARs-Lands =OPG. a

b

Table 8 Test statistics for the balanced datasets (Susy, Higgs, and Cover Type datasets). Test Statisticsa

Z Asymp. Sig. (2-tailed) a b

CARs-Lands – Decision Tree

CARs-Lands – Random forest

CARs-Lands - MRAC

CARs-Lands – MRAC+

CARs-Lands – OPG

−2.201 .028

−2.201 .028

−2.201 .028

−2.201 .028

−2.201b .028

b

b

b

b

Wilcoxon Signed Ranks Test. Based on negative ranks.

Table 9 Test statistics for the imbalanced datasets (KDD Cup 1999 23 classes, KDD Cup 1999 5 classes, and Poker-hand datasets). Test Statisticsa

Z Asymp. Sig. (2-tailed) a b c

CARs-Lands – Decision Tree

CARs-Lands – Random forest

CARs-Lands - MRAC

CARs-Lands - MRAC+

CARs-Lands – OPG

−1.992b 0.046

−1.261c 0.207

−0.105c 0.917

−0.105c 0.917

−1.826 0.068

Wilcoxon Signed Ranks Test. Based on negative ranks. Based on positive ranks.

equivalent than the other methods. For example, let us consider Table 6. In this table, the first row (CARs-Lands - DT) has four subrows. The first sub-row shows that CARs-Lands is not weaker than DT in any datasets (the Negative Ranks is 0). Similarly, the second and third sub-rows indicate that CARs-Lands is better than DT in six datasets (the Positive Ranks is 6) and is not equivalent in any dataset (the Ties is 0), respectively. Negative Ranks, Positive Ranks, and Ties are the terms used by the SPSS software package to show the weakness, superiority, and equivalence respectively.

In Tables 8 and 9, we have compared the methods in pairs to determine which one is better than another in terms of accuracy. In other words, we wanted to determine whether the superiority of a method is reliable or not. To this end, we use the values of Asymp.Sig to show the significant difference between the methods. In Table 8, all values of Asymp.Sig (significant difference) are smaller than 0.05. This demonstrates that there is a statistically significant difference between the results of methods. In the previous statement, the term results refers to the median values in

16

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128 Table 10 Average AUC for CARs-Lands; Ci shows the ith class label and nan is used when AUC does not provide any values for a given class label (this occurs when the model cannot compute the probability for some instances of the training dataset). KDDCup 1999 23 classes

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

C12

C13

C14

AUC KDDCup 1999 23 classes AUC

1 C15 0.99

0.99 C16 0.99

0.99 C17 0.5

nan C18 1

1 C19 0.48

1 C20 0.83

1 C21 0.99

0.99 C22 0.5

1 C23 0.49

1

1

1

0.99

1

Table 11 Average AUC for CARs-Lands (Ci shows the ith class label). KDDCup 1999 5 classes

C1

C2

C3

C4

C5

AUC

0.999989

0.870224

1

0.990364

0.999987

Table 12 Average AUC for CARs-Lands (Ci shows the ith class label). Poker-hand

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

AUC

nan

0.699346

0.804166

0.919529

0.982687

0.976242

0.904015

0.999985

0.840215

0.785849

Table 4, where the median values of DT, RF, MRAC, MRAC+ , and CARs-Lands are 74.1440 0 0, 72.7740 0 0, 74.36050 0, 78.1560 0 0, and 81.561500, respectively. In Table 8, the Wilcoxon signed-rank test shows that the superiority of CARs-Lands is reliable and will not be affected by increasing the number of datasets. However, in imbalanced datasets (Table 9), from the statistically significant difference view, CARsLands is only better than DT in terms of accuracy and exhibits performance equivalent to other methods. In other words, in imbalanced datasets, there are no significant difference between CARsLands, RF, MRAC [36], and OPG [49]. According to the aforementioned analysis, the median values of RF, MRAC, MRAC+ , CARsLands, and OPG [49] will be the same by increasing the number of datasets. This demonstrates that none of these methods is superior to the others in the imbalanced datasets. In the following, we examine the values of accuracy in KDD Cup and Poker-hand datasets in more detail. In both version of KDD Cup datasets, all classifiers have reached a considerable accuracy. This is because the number of instances of some classes is extremely few, and hence the accuracy does not change significantly (note that the number of instances of some classes is smaller than a few dozen). For further explanation, In Tables 10 and 11, the values of average area under the curve (AUC) for each class of KDD Cup has been presented. Poker-hand dataset has also similar situation. In Table 12, the values of average AUC of this dataset has been reported. The values in Tables 10–12 demonstrate that CARs-Lands suffers from accuracy reduction in classes with a few number of instances. In the following, we examine the nature of datasets to find more evidence for different results in the balanced and imbalanced datasets. To this end, we use the complexity measures [50]. In this study, the complexity measure is used to show the power of attributes for predicting an unlabeled instance besides the level of overlapping between the attributes values of different classes. To this end, we use the following three complexity measures that are calculated as follows: •

F1gen: the value of this measure is based on the maximum Fisher’s discriminant ratio (F1). This measure computes the maximum Fisher’s discriminant for a set of attributes. F1gen is a generalized version of F1 and works on multiclass datasets. This measure is calculated using Eq. (3) [50]:

C ni · dist(m, mi ) F1gen = Ci=1n i dist(xij , mi ) i=1 j=1

(3)

In Eq. (3), C is the number of classes; ni is the number of instances from class i; mi and m are respectively the mean of instances of class i and the mean of the whole dataset; and finally, xij is the jth instance from class i. The values of F1gen show the performance of classification model by considering all of the attributes. Larger values of F1gen indicate the advantage of attribute collaboration in the classification process. •

F2gen: this measure is based on F2 measure, which shows the volume of overlap region. F2 measure is compatible with two-class datasets and is the product of overlapped ranges for each attribute. This measure is calculated using the following equation [50]:

F2 =

|A| 



i=1

Min(M(i, 1 ), M(i, 2 ) ) − Max(m(i, 1 ), m(i, 2 ) ) Max(M(i, 1 ), M(i, 2 ) ) − Min(m(i, 1 ), m(i, 2 ) )



(4) In Eq. (4), M(i, j) is the maximum value for attribute i in class j, and m(i, j) is the minimum value for attribute i in class j. Furthermore, |A| shows the number of attributes. F2gen is a generalized version of F2, which can be used in multiclass datasets. F2gen is calculated as follows:

F 2gen =



F 2(i, j )

(5)

1≤i≤ j≤C

In Eq. (5), C is the number of classes. •

F3gen: this is another well-known complexity measure in meta-learning and is based on the F3 measure. The values of F3 demonstrate the rate of instances that can be classified by a given attribute. F3gen is the general version of F3 and uses the instances that are at least in an overlapping region for some pair of classes.

In Table 13, F2gen shows the level of overlap between the attributes of different classes. In Cover Type, KDD Cup 1999 5 classes, and KDD Cup 1999 23 classes datasets, the level of overlap is 0. Therefore, in these datasets, we expect to have high accuracy. In both versions of KDD Cup datasets, this expectation is satisfied (see Table 3). There is a similar situation in Cover Type dataset. CARsLands has proper accuracy. This is because the value of F1gen is large enough (0.24) which indicates that collaboration between attributes has great effect for building model. Note that F1gen values were larger in our sub-datasets (nearly 0.3). The values of F2gen are similar and small for both Susy and Higgs datasets. Simultaneously, the values of F1gen are large and there are some attributes

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

17

Table 13 Average of F1gen and F2gen. C-measures

F1gen F2gen

Data sets Cover type

HIGGS

Poker-Hand

Susy

KDDCup 1999 5 Classes

KDDCup 1999 23 Classes

0.24 0.00

0.29 2.01

0.07 38.01

0.73 2.00

0.75 0

1.37 0

Fig. 13. Average of F3gen for CARs-Lands in the Cover Type, Higgs, Susy, and Poker-hand datasets.

with large F3gen values (Fig. 13). Therefore, in both datasets, a high accuracy is expected. The values of Table 3 demonstrates this expectation. In the Poker-hand dataset, for all attributes, the values of F3gen are 0. This means that none of the attributes can determine the class label alone. Moreover, the value of F1gen and F2gen are 0.07

and 38.01, respectively. These values show a high level of overlapping regions between different classes; thus, in the Poker-hand dataset, collaboration is the only way to build an appropriate classifier. As observed in Table 3, MRAC [36] is the best method for the Poker-hand dataset. This is because, MRAC is a method based on collaboration between attributes (MRAC is an associative classi-

18

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

Fig. 14. Scalability of CARs-Lands.

fier). In the Poker-hand dataset, CARs-Lands is based on finding the CARs in several sub-datasets. In discovered sub-datasets, the values of F1gen and F3gen were not larger than the whole dataset. Hence, in the Poker-hand dataset, CARs-Lands has a lower accuracy than MRAC.

In this paragraph, the theoretical analysis of CARs-Lands is studied. Time complexity and speedup (Eq. (6)) are components of this analysis. CARs-Lands has 4 main algorithms (we have considered Algorithms 4 and 5 together as an algorithm) which their theoretical analysis is as follows in terms of time complexity and speed

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128

stances. Because the values of |LMI |, |LNI |, and |LSN | are small, building a modified training dataset can be performed efficiently with a time complexity of

Algorithm 5 RarePEARs. Input parameters: which File // the file name of each main instance High-level abstract: In this algorithm, for each main instance a file (it contains a main instance and its nearest neighbors) is opened. Then rules are generated and store in a predefined file. 1. 2. 3. 4. each 5. 6. 7.

FADDR= str(which File)+”.txt” //producing the file name open FADDR file: perform Rare-PEARs [40] on the file// CARs are generated rules_list=Accumulate the non-dominated rules in a list of lists // rule is an element of list remove redundant rules from the rules_list write rules_list in a predefined file return 1

up. Note that, in this part of analysis we assume there are infinity core. •







Algorithm 1: according to pseudo code of this algorithm, time complexity for sequential and parallel version are O(n × c × m ) + O(n × O(Algorithm2 )) and O(n × m ) + O(n × O(Algorithm2 )), respectively. The values of n, m and c are number of generation, number of classes, and number of new generated individuals. Hence, the speedup is dependent on the time complexity of Algorithm 2 in sequential and parallel version. In next paragraph, it is discussed in more detail. Algorithm 2: in this algorithm, time complexity for sequential and parallel version are O(n × m × c logc) and O(m × c logc), respectively. The values of n, m and c are number of main instances, number of sub datasets and number of nearest neighbors (which is a small constant value). The speedup is O(n). Note that, c logc shows complexity of the sort function. Algorithm 3: in this algorithm, time complexity for sequential and parallel version are O(n × m × c logc) and O(n × c logc), respectively (c logc shows complexity of the sort function). The values of n, m and c are number of lines of sub datasets, number of main instances and number of nearest neighbors, respectively. Since c is a small constant value so time complexity of sequential and parallel version of Algorithm 3 are simplified to O(n × m) and O(n), respectively. in result, the speedup is O(m). Algorithm 4: in this algorithm, time complexity for sequential and parallel version are O(n × O(Rare − PEARs ) ) and O(Rare − PEARs )), respectively (the value n is number of main instances). In result, speedup is n.

speedup =

t (s ) t ( p)

19



t (s ) : t ime o f sequent ial implementat ion t ( p) : time o f paral l el impl ementation (6)

Note that, aforementioned analysis was theoretical and are not included some real aspects such as communication overhead in speedup so reaching the aforementioned theoretical speedups are impossible. Hence, we have reported our CARs-Lands speedup in the following of this section. Nevertheless, first, we need to know the overall time complexity of CARs-Lands for both training and testing data which are discussed in more details in the following. In CARs-Lands, the running time of the learning process consists of two parts: building a modified training dataset and CAR mining. In building a modified training dataset (Algorithm 1), the main cost of running time depends on the fitness calculation. Let us assume that |LMI | is the number of main instances, |LNI | the number of nearest neighbors, |LSN | the number of instances of a subdataset, and |cdti | the cost of computing distance between two in-

|LNI | = number of cores (processor ) : O(k.|LSN |.|LMI |.|cdti |) else :  L O number| NIof| cores .k.|LSN |.|LMI |.|cdti | if

(7)

In Eq. (7), k is the number of iterations (a constant number). Furthermore, |cdti | depends on the number of attributes (|LAN |). Thus, the time complexity of building a modified train|LNI | ing dataset is O( number of .|LSN |.|LMI |.|LAN | ). By considering cores |LMI | < |LSN | = n and replacing |LAN |, |LNI |, and number of cores by constant numbers, the order of building a modified training dataset is O(n2 ). In CAR mining, the time complexity is equal |L | to O( numberMIof cores × O(Rare − PEARs ) ). In the CAR mining step, there are some small files. The number of these files is equal to the number of main instances. Because in each file the number of instance is few and the time complexity of Rare-PEARs [40] is a polynomial, the time complexity of CAR mining is a polynomial as well (see procedure CARgenerate (Algorithm 4) in Section 2.2). In the prediction phase, CARs-Lands first finds the nearest main instance to an unlabeled instance. This task is performed with a time complexity of O(|LMI |.|LAN |). Both |LMI | and |LAN | are constant numbers; thus, the time complexity is a constant value. Then CARs-Lands finds the matching rules and determines the class of unlabeled instance. Since the number of matching rules is small, the required time is also a constant value. Consequently, for the whole test dataset, the time complexity of the prediction phase is O(|Ntest |). Note that |Ntest | is the number of instances in the test dataset. Furthermore, in determining the class of unlabeled instance, the required time values are considerably reduced by using a parallel programming and increasing the number of cores. In Tables 14 and 15, the running time of the learning process and test process are presented (in seconds). In Table 14, the reported time consists of building a modified training dataset (step 1) and CAR mining step (step 2). We use different number cores to perform the scalability analysis. Note that the number of subdatasets has no effect on the time of CAR mining step: it only changes the running time of the first step of CARs-Lands. The number of cores is set to different values (16, 20, 24, and 28). However, in the first step of CARs-Lands, the number of sub-datasets is equal to 56. Note that, the number of waiting tasks is reduced by increasing the number of processors. Because our method is performed using different number of cores than the methods in [36] and [49], we have used speedup to show the scalability of CARs-Lands. In Table 14, in the first step of CARs-Lands, adding more cores considerably reduces the running time. This is because there are more cores to run all processes of the fitness calculation simultaneously in a parallel manner. In the implementation, Intel Xeon E5-2695 v3 processors are employed. Each of these processors has 14 cores. We have used two processors; therefore, there are 28 cores available. The last row in Table 14 provides the running time of sequential version of steps 1 and 2. Similarly, Table 15 provides the running time of the prediction process of CARs-Lands and its last row indicates the running time of sequential version. In this study, the scalability of CARs-Lands is calculated based on the speedup definition in Eq. (6). In Fig. 14 and Table 16, the values of speedup are shown. In Fig. 14, the scalability of the three steps of CARs-Lands is presented. The values demonstrate the efficiency of CARs-Lands in terms of running time. Note that changing the number of sub-datasets has no effect on the results (accuracy) of CARs-Lands because the

20

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128 Table 14 Average running time in the learning steps (in seconds). Running time

Data sets

Number of Cores= 16 Number of Cores= 20 Number of Cores= 24 Number of Cores= 28 Sequential versions

Cover type

HIGGS

Step1

Step2

Step1

1661 1148 929 731 6157

1821 1208 770 595 4271

12,131 9641 7301 5060 31,906

Poker-Hand

Susy

Step2

Step1

Step2

Step1

2311 1345 841 671 4451

1991 1393 1031 833 5910

1610 980 741 561 4106

3220 2394 1694 1302 10,872

KDDCup 1999 5 Classes

KDDCup 1999 23 Classes

Step2

Step1

Step2

Step1

Step2

1801 1298 801 620 4315

3141 2276 1601 1194 10,310

1771 1158 773 589 4321

3301 2683 1760 1235 10,168

1892 1103 785 611 4381

Table 15 Average running time in the prediction phase (in seconds). prediction phase

Parallel with 28 cores Sequential version

Data set Cover type

HIGGS

Poker-Hand

Susy

KDDCup 1999 5 Classes

KDDCup 1999 23 Classes

331 1421

1691 10,681

292 1031

731 4741

822 5231

891 5721

Table 16 Values of speedup. Steps of CARs-Lands

Speedup of the building a modified training data Speedup of CAR mining Speedup of prediction phase

Data set Cover type

HIGGS

Poker-Hand

Susy

KDDCup 1999 5 Classes

KDDCup 1999 23 Classes

8.422709 7.178151 4.293051

6.305534 6.633383 6.316380

7.094838 7.319073 3.530821

8.35023 6.959677 6.485636

8.634841 7.336163 6.363746

8.233198 7.170213 6.420875

results in both steps of the learning process of CARs-Lands (create dataset and CAR mining) are independent of data partitioning. 4. Conclusion In this study, we have proposed CARs-Lands, which is an efficient parallel associative classifier based on parallel programming and evolutionary algorithm. CARs-Lands intelligently generates some sub-datasets that form a modified training dataset based on vicinity (the term vicinity refers to the distance from some representative instances). The modified training datasets are used to extract CARs. The CARs are extracted by evolutionary association rule mining method [42] in order to generate a reliable classifier. We compared our approach to five algorithms (associative and non-associative classifiers) on 6 real-world large-scale datasets. CARs-Lands shows a higher or equal efficiency than other algorithms. Moreover, the results of CARs-Lands has been analyzed by the Wilcoxon signed-rank test to show whether there is a significant difference in the results of CARs-Lands compared to other methods. The test result shows that the superiority of CARs-Lands is significant compared to other methods. We also analyzed the running time and scalability of CARs-Lands by changing the number of cores. The results show the extensibility of CARs-Lands. Ability of using classical efficient associative classifiers is the main contribute of CARs-Lands. CARs-Lands generates a new dataset which is completely appropriate for using by classical associative classifiers. Since there are many classical efficient associative classifiers, ability of using these classifiers is a great advantage. In other words, we have converted large-scale datasets to a set of small datasets which are suitable for generating rules in parallel. In the future, we propose the large-scale version of some of the most prestigious classical associative classifiers. Moreover, we will attempt to develop a new version of CARs-Lands to handle streaming datasets. The idea of generating sub-datasets for a time period and adding or replacing new neighbors to the sub-datasets are feasible in streaming datasets.

References [1] J. Xu, W. An, L. Zhang, D. Zhang, Sparse, collaborative, or nonnegative representation: which helps pattern classification? Pattern Recognit. 88 (2019) 679–688. [2] Y. Zheng, B.K. Iwana, S. Uchida, Mining the displacement of max-pooling for text recognition, Pattern Recognit. (2019). [3] K. Kim, J. Lee, Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction, Pattern Recognit. 47 (2) (2014) 758–768. [4] I. Triguero, S. del Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera, ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem, Knowl.-Based Syst. 87 (2015) 69–79. [5] T. Van Phan, M. Nakagawa, Combination of global and local contexts for text/non-text classification in heterogeneous online handwritten documents, Pattern Recognit. 51 (2016) 112–124. [6] Y. Zhang, G. Cao, B. Wang, X. Li, A novel ensemble method for k-nearest neighbor, Pattern Recognit. 85 (2019) 13–25. [7] X. Xiao, M. Hui, Z. Liu, iAFP-Ense: an ensemble classifier for identifying antifreeze protein by incorporating grey model and PSSM into PseAAC, J. Membr. Biol. 249 (6) (2016) 845–854. [8] L. Dong, J. Wesseloo, Y. Potvin, X. Li, Discrimination of mine seismic events and blasts using the fisher classifier, naive bayesian classifier and logistic regression, Rock Mech. Rock Eng. 49 (1) (2016) 183–211. [9] T.T. Wong, C.R. Liu, An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial models, Pattern Recognit. 60 (2016) 62–71. [10] L.C. Padierna, M. Carpio, A. Rojas-Domínguez, H. Puga, H. Fraire, A novel formulation of orthogonal polynomial kernel functions for SVM classifiers: the Gegenbauer family, Pattern Recognit. 84 (2018) 211–225. [11] J. Luo, D. Liu, J. Wu, J. Yan, H. Zhao, Q. Wang, A novel differentiation sectionalized strengthen planning method for transmission line based on support vector regression, Neural Comput. Appl. (2018) 1–11. [12] T. Björklund, A. Fiandrotti, M. Annarumma, G. Francini, E. Magli, Robust license plate recognition using neural networks trained on synthetic images, Pattern Recognit. 93 (2019) 134–146. [13] C.S. Dash, A. Saran, P. Sahoo, S. Dehuri, S.B. Cho, Design of self-adaptive and equilibrium differential evolution optimized radial basis function neural network classifier for imputed database, Pattern Recognit. Lett. 80 (2016) 76–83. [14] A. Cano, B. Krawczyk, Evolving rule-based classifiers with genetic programming on gpus for drifting data streams, Pattern Recognit. 87 (2019) 248–268. [15] R.J. Kuo, M. Gosumolo, F.E. Zulvia, Multi-objective particle swarm optimization algorithm using adaptive archive grid for numerical association rule mining, Neural Comput. Appl. (2017) 1–14. [16] Y. Sun, Y. Wang, A.K. Wong, Boosting an associative classifier, IEEE Trans. Knowl. Data Eng. 18 (7) (2006) 988–992. ´ , A. Swami, Mining association rules between sets of [17] R. Agrawal, T. Imielinski items in large databases, in: ACM SIGMOD Record, 22, ACM, 1993, pp. 207–216.

M. Almasi and M. Saniee Abadeh / Pattern Recognition 100 (2020) 107128 [18] F. Chen, Y. Wang, M. Li, H. Wu, J. Tian, Principal association mining: an efficient classification approach, Knowl.-Based Syst. 67 (2014) 16–25. [19] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Mining Knowl. Discov. 8 (1) (2004) 53–87. [20] B.L. Ma, B. Liu, Integrating classification and association rule mining, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998. [21] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. Very Large Data Bases, 1215, VLDB, 1994, pp. 487–499. [22] B. Liu, Y. Ma, C.K. Wong, Improving an association rule based classifier, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, Berlin, Heidelberg, 20 0 0, pp. 504–509. [23] W. Li, J. Han, J. Pei, CMAR: Accurate and efficient classification based on multiple class-association rules., in: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, IEEE, 2001, pp. 369–376. [24] E. Baralis, S. Chiusano, P. Garza, A lazy approach to associative classification, IEEE Trans. Knowl. Data Eng. 20 (2) (2008) 156–171. [25] L.T. Nguyen, B. Vo, T.P. Hong, H.C. Thanh, Classification based on association rules: a lattice-based approach, Expert Syst. Appl. 39 (13) (2012) 11357–11366. [26] R. Rak, L. Kurgan, M. Reformat, A tree-projection-based algorithm for multi-label recurrent-item associative-classification rule generation, Data Knowl. Eng. 64 (1) (2008) 171–197. [27] J. Alwidian, B.H. Hammo, N. Obeid, WCBA: Weighted classification based on association rules algorithm for breast cancer disease, Appl. Soft Comput. 62 (2018) 536–549. [28] M.Z. Ashrafi, D. Taniar, K. Smith, Redundant association rules reduction techniques, Int. J. Bus. Intell. Data Min. 2 (1) (2007) 29–63. [29] G. Costa, G. Manco, R. Ortale, E. Ritacco, From global to local and viceversa: uses of associative rule learning for classification in imprecise environments, Knowl. Inf. Syst. 33 (1) (2012) 137–169. [30] X. Zhang, G. Chen, Q. Wei, Building a highly-compact and accurate associative classifier, Appl. Intell. 34 (1) (2011) 74–86. [31] L.T. Nguyen, N.T. Nguyen, B. Vo, H.S. Nguyen, Efficient method for updating class association rules in dynamic datasets with record deletion, Appl. Intell. 48 (6) (2018) 1491–1505. [32] S.K. Ng, R. Tawiah, G.J. McLachlan, Unsupervised pattern recognition of mixed data structures with numerical and categorical features using a mixture regression modelling framework, Pattern Recognit. 88 (2019) 261–271. [33] Z. Jiang, Z. Lin, H. Ling, F. Porikli, L. Shao, P. Turaga, Discriminative feature learning from big data for visual recognition, Pattern Recognit. 48 (10) (2015) 2961–2963. [34] Q. He, H. Wang, F. Zhuang, T. Shang, Z. Shi, Parallel sampling from big data with uncertainty distribution, Fuzzy Sets Syst. 258 (2015) 117–133. [35] M. Vojnovic, F. Xu, J. Zhou, Sampling Based Range Partition Methods for Big Data Analytics, 2012 Technical Report, Microsoft Research. [36] A. Bechini, F. Marcelloni, A. Segatori, A MapReduce solution for associative classification of big data, Inf. Sci. 332 (2016) 33–55. [37] V. López, S. Del Río, J. Benítez, F. Herrera, Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data, Fuzzy Sets Syst. 258 (2015) 5–38.

21

[38] F. Thabtah, S. Hammoud, H. Abdel-Jaber, Parallel associative classification data mining frameworks based MapReduce, Parallel Process. Lett. 25 (2) (2015) 1550 0 02. [39] L.T. Nguyen, B. Vo, T.P. Hong, H.C. Thanh, CAR-Miner: an efficient algorithm for mining class-association rules, Expert Syst. Appl. 40 (6) (2013) 2305–2311. [40] H. Li, Y. Wang, D. Zhang, M. Zhang, E.Y. Chang, Pfp: parallel fp-growth for query recommendation, in: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM, 2008, pp. 107–114. [41] W. Fan, A. Bifet, Mining big data: current status, and forecast to the future, ACM sIGKDD Explor. Newsl. 14 (2) (2013) 1–5. [42] M. Almasi, M.S. Abadeh, Rare-PEARs: a new multi objective evolutionary algorithm to mine rare and non-redundant quantitative association rules, Knowl.-Based Syst. 89 (2015) 366–384. [43] F. Berzal, I. Blanco, D. Sanchez, M. Vila, Measuring the accuracy and interest of association rules: a new framework, Intell. Data Anal. 6 (3) (2002) 221–235. [44] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, CRC Press, 2003. [45] UCI machine learning repository, http://archive.ics.uci.edu/ml. [46] https://pythonhosted.org/joblib/ [47] https://spark.apache.org/mllib [48] http://mahout.apache.org [49] M. Barsacchi, A. Bechini, P. Ducange, F. Marcelloni, Optimizing Partition Granularity, Membership Function Parameters, and Rule Bases of Fuzzy Classifiers for Big Data by a Multi-objective Evolutionary Approach, Cognit. Comput. (2019) 1–21. [50] E. Leyva, A. González, R. Perez, A set of complexity measures designed for applying meta-learning to instance selection, IEEE Trans. Knowl. Data Eng. 27.2 (2015) 354–367. [51] http://vassarstats.net/textbook/ch12a.html and https://statistics.laerd.com/ spss- tutorials/wilcoxon- signed- rank- test- using- spss- statistics.php. Mehrdad Almasi received his B.S. degree in Computer Engineering from Kharazmi University, Tehran, Iran, in 2008, the M.S. degree in Artificial Intelligence from Isfahan University of Technology, Isfahan, Iran, in 2011 and his Ph.D. degree in Software Systems Engineering at Tarbiat Modares University, Tehran, Iran in July 2018. His research has mainly focused on developing algorithms for big data mining and correlated feature discovery. His interests include structural/nonstructural big data mining, bioinformatics, pattern discovery from traffic data, and text mining. Mohammad Saniee Abadeh received his B.S. degree in Computer Engineering from Isfahan University of Technology, Isfahan, Iran, in 2001, the M.S. degree in Artificial Intelligence from Iran University of Science and Technology, Tehran, Iran, in 2003 and his Ph.D. degree in Artificial Intelligence at the Department of Computer Engineering in Sharif University of Technology, Tehran, Iran in February 2008. Dr. Saniee Abadeh is currently an Associate Professor at Faculty of Electrical and Computer Engineering in Tarbiat Modares University. His research has mainly focused on developing advanced meta-heuristic algorithms for big data mining and knowledge discovery purposes. His interests include biomedical and bioinformatics data mining, evolutionary algorithms and swarm intelligence for knowledge discovery, deep learning in medical image analysis, explainable artificial intelligence and text mining.