MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality

MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality

Journal Pre-proof MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality Amin Hashemi ConceptualizationMethodologySoftwa...

8MB Sizes 0 Downloads 46 Views

Journal Pre-proof

MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality Amin Hashemi ConceptualizationMethodologySoftwareOriginal draft preparation , Mohammad Bagher Dowlatshahi SupervisionData curationWritingInvestigationReviewing and Editing , Hossein Nezamabadi-pour SupervisionValidationReviewing and Editing PII: DOI: Reference:

S0957-4174(19)30741-9 https://doi.org/10.1016/j.eswa.2019.113024 ESWA 113024

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

9 July 2019 13 October 2019 13 October 2019

Please cite this article as: Amin Hashemi ConceptualizationMethodologySoftwareOriginal draft preparation , Mohammad Bagher Dowlatshahi SupervisionData curationWritingInvestigationReviewing and Editing , Hossein Nezamabadi-pour SupervisionValidationReviewing and Editing , MGFS: A multi-label graphbased feature selection algorithm via PageRank centrality, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113024

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights

1



We have proposed a fast algorithm for feature selection on the multi-label data



Features that discriminate classes are linked to provide an undirected weighted graph



Features relationships are defined based on correlation distance with labels



PageRank algorithm ranks the features according to their importance in weighted graph



The proposed multi-label graph based method outperforms competitive methods

MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality Amin Hashemi a, Mohammad Bagher Dowlatshahi b, *, Hossein Nezamabadi-pour c a

Department of Computer Engineering, Faculty of Engineering, Lorestan University, Khorramabad, Iran. [email protected] b

Department of Computer Engineering, Faculty of Engineering, Lorestan University, Khorramabad, Iran. [email protected] c

Department of Electrical Engineering, Shahid Bahonar university of Kerman, Kerman, Iran [email protected]

* Corresponding author

Abstract

In multi-label data, each instance corresponds to a set of labels instead of one label whereby the instances belonging to a label in the corresponding column of that label are assigned 1, while instances that do not belong to that label are assigned 0 in the data set. This type of data is usually considered as high-dimensional data, so many methods, using machine learning algorithms, seek to choose the best subset of features for reducing the dimensionality of data and then to create an acceptable model for classification. In this paper, we have designed a fast algorithm for feature selection on the multi-label data using the PageRank algorithm, which is an effective method used to calculate the importance of web pages on the Internet. This algorithm,

2

which is called multi-label graph-based feature selection (MGFS), first constructs an M×L matrix, called Correlation Distance Matrix (CDM), where M is the number of features and L represents the number of class labels. Then, MGFS creates a complete weighted graph, called Feature-Label Graph (FLG), where each feature is considered as a vertex, and the weight between two vertices (or features) represents their Euclidean distance in CDM. Finally, the importance of each graph vertex (or feature) is estimated via the PageRank algorithm. In the proposed method, the number of features can be determined by the user. To prove the performance of the proposed algorithm, we have tested this algorithm with several methods for multi-label feature selection and on several multi-label datasets with different dimensions. The results show the superiority of the proposed method in the classification criteria and run-time.

Keywords: Multi-label feature selection, Correlation Distance Matrix, Feature-Label Graph, PageRank centrality.

1 Introduction

The increasing growth of modern technologies, new computers, and different applications have generated many data at an outstanding speed. These data include mostly video, image, text, audio, and often have high dimensions. This high-dimensional data presents many challenges for data analysis, decision making, classification, and prediction (Cai, Luo, Wang, & Yang, 2018). 3

On the other hand, such data have usually many redundant and irrelevant features which can cause problems. Therefore, to solve these problems, a subset of the relevant and optimal features will be chosen, this procedure is called feature selection. Feature selection has many advantages for learning algorithms including reducing the cost of measurement and storage requirements, shortening the training time, avoiding the curse of dimensionality, and reducing overfitting (Bermingham et al., 2015; Hastie et al., 2017; Sun et al., 2012).

Feature selection methods are generally divided into two categories based on label information and search strategy (R. Zhang, Nie, Li, & Wei, 2019). Concerning the label information, feature selection methods could be categorized into three classes: supervised, unsupervised, and semi-supervised (Li et al., 2014). Feature selection methods can be applied to: 1) single-label (SL) data where each sample corresponds with only one class label, and 2) multilabel (ML) data in which each sample belongs to several class labels, and the feature relevance is calculated based on the correlation between features and labels (Kashef, Nezamabadi-pour, & Nikpour, 2018b; J. Li et al., 2017). Zare et al. (2019) proposed a supervised feature selection algorithm using matrix factorization and Singular Value Decomposition (SVD) for microarray datasets. Unsupervised methods do not require labels and usually use the relationships between features. These methods are commonly used in clustering methods. Tang et al. (2018) proposed an unsupervised feature selection framework via feature self-representation and robust graph regularization, to reduce the sensitivity to outliers. Semi-supervised methods are also a 4

combination of the two previous methods, where some of the training instances have a label (Kashef et al., 2018b). Coelho et al. (2019) presented a new relevance index based on mutual information which is based on labeled and unlabeled data for semi-supervised feature selection. Meanwhile, feature selection methods are also divided into three categories based on search strategies: filter, wrapper, and embedded (Pereira et al., 2016). Filter methods are implemented before classification and are independent of the learning algorithm. In these methods, the features are ranked based on specific methods such as ReliefF (Reyes et al., 2015), Mutual Information (MI) (Lee & Kim, 2015), Information Interest (Li et al., 2017), and Symmetric Uncertainty (SU) (Kashef et al., 2018). Wang, Cang, & Yu (2019) presented a filter-based feature selection method, named mRMJR-KCCA. The mRMJR-KCCA selects the feature with the highest relevance to the target class labels and simultaneously minimizes the redundancy between the feature candidate and the already selected features. On the other hand, wrapper methods use the learning algorithm in the feature selection process (Dowlatshahi et al., 2017). These methods search for a subset of features and then evaluate the selected features until some stopping criteria are satisfied. Although they usually have a better result than filter methods do, they have high computational complexity. The wrapper methods were classified into sequential selection algorithms and heuristic search algorithms (Chandrashekar and Sahin, 2014; Rafsanjani and Dowlatshahi, 2012; Rafsanjani et al., 2015). The sequential selection algorithms start with an empty set (full set) and add features 5

(remove features) until the maximum objective function is obtained. The heuristic search algorithms evaluate different subsets to optimize the objective function. Different subsets are generated either by searching around in a search space or by generating solutions to the optimization problem. The class of heuristic search algorithms includes, but is not restricted to, Ant Colony Optimization (ACO) (Dowlatshahi and Derhami, 2017) and Gravitational Search Algorithm (GSA) (Dowlatshahi et al., 2014; Dowlatshahi and Nezamabadi-Pour, 2014; Dowlatshahi and Rezaeian, 2016). Finally, embedded methods use the strengths of the two previous methods. They embed the feature selection process in learning, but they are more efficient than the wrapper methods since they do not evaluate the feature sets iteratively (Cai et al., 2018). Dowlatshahi et al. (2018) proposed a hybrid filter/wrapper feature selection algorithm for increasing the cancer classification accuracy. They first rank the miRNAs with multiple filter algorithms and then used Competitive Swarm Optimization (CSO) to find an optimal subset. Dowlatshahi et al. (2017) proposed a hybrid filter-wrapper algorithm, called Ensemble of Filterbased Rankers to guide an Epsilon-greedy Swarm Optimizer (EFR-ESO). In the proposed EFRESO, they used the knowledge about the feature importance by the ensemble of filter-based rankers to weight the feature probabilities in the ESO. J. Zhang et al. (2019) developed a hybrid filter/wrapper algorithm for feature selection which used a distance-based evaluation function for the filtering part and a weighted bootstrapping search strategy to find a set of candidate feature 6

subsets as the wrapper part. Zhu & Yang (2018) proposed an embedded unsupervised feature selection method that maximized the distances between samples from different clusters and ranked the features by sparse learning strategy.

Moradi and Rostami (2015) proposed a graph-theoretic approach for unsupervised feature selection. In this work, the entire feature set is considered as a weighted graph. Then, the features are divided into several clusters using a community detection algorithm, and a novel iterative search strategy based on node centrality is employed to select the final subset of features. Lv et al. (2019) proposed a new centrality measure for ranking the nodes and time layers of temporal networks simultaneously, referred to as the f-PageRank centrality. Herrmann et al. (2017) suggested a method for a local optima network that used PageRank centrality to predict the success rate and average fitness achieved by local search-based metaheuristics.

Henni et al. (2018) presented an unsupervised graph-based feature selection (UGFS) via subspace learning and PageRank centrality. UGFS could investigate the importance of features in a graph using PageRank. The graph nodes represent the features and subspace preference clusters are used to define the edges of linking features. For each data point, the algorithm scans the whole dataset searching the neighborhood and computes the variance among these sets. According to the computed variance and a predefined threshold, the algorithm selects subspace preference clusters for each data point. The features belonging to the same subspace preference

7

clusters

, associated with the neighborhood of the point

, are linked. So, an incomplete graph

of feature space is composed where the edges have no weights. Finally, the feature score vector is calculated by the PageRank algorithm. In this paper, we have tried to generalize the idea of the UGFS method that uses the PageRank on feature space as a graph for multi-label data. In our proposed method, unlike UGFS, the feature space is considered as a complete graph. In the UGFS method, links between nodes are weightless, and the PageRank algorithm assigns a score to each node based on the number of links connected to each node. So in our approach, given that the graph is fully connected, the method of calculating graph edges and ranking the features with PageRank cannot be used. We have calculated the correlation distance for each feature, and in the end, we considered the Euclidean distance between features according to the correlation distance as the weight of the edge between the features.

In this article, we have presented a fast method for multi-label feature selection which uses a filter strategy. In this method, we model the entire feature set in a complete weighted graph, with the well-known weighted PageRank algorithm evaluating this graph.

A general overview of the proposed method is as follows: 

The number of selected features can be determined by the user.



The correlation distance between features and labels is considered to select the associated features and eliminate unrelated features.

8



The algorithm is directly applied to the multi-label data, and there is no problem transformation in it.

The results on different datasets, in comparison with other methods, indicate the superiority of the proposed method in the categorization criteria and run-time.

The structure of this paper is organized as follows. Section 2 outlines the background of multi-label feature selection works, and section 3 briefly summarizes the multi-label learning and PageRank algorithm. The proposed method is presented in Section 4 and the experimental results are given in Section 5. Finally, Section 6 concludes the paper.

2 Background

Multi-label feature selection methods are proposed in two classification strategies. The first is called binary transformation, where multi-label data are transformed into multi single-label data, and single-label classifiers are performed on each single-label data. The second one is called algorithm adaption with the algorithm being directly applied to the multi-label data (Kashef, Nezamabadi-pour, & Nikpour, 2018a; Kashef et al., 2018b; Pereira, Plastino, Zadrozny, & Merschmann, 2018).

Recently, many efforts have been made to select a subset of features for multi-label data. Zhu et al. (2018) provided an embedded feature selection method for categorizing multi-label

9

data with missing labels. Missing labels are restored through linear regression, and discriminant features are selected by the effective

(0
and Verleysen (2011) introduced a pruned problem transformation (PPT) method which uses Mutual information (Read et al., 2008), called PPT-MI, converting multi-label data into singlelabel ones. Zhang and Duan (2019) presented a feature selection for multi-label text based on feature importance. In this method, multi-label texts transformed into single-label texts using the label assignment and category contribution (CC) is used for calculating the importance of each feature. A multi-label feature selection method called PMU, based on mutual information was presented by Lee and Kim (2013). In this method, the best feature is selected based on an incremental selection strategy which maximizes the mutual information between labels and features.

Huang et al. (2018) proposed a feature selection method for multi-label data called manifold based constraint Laplacian score (MCLS), which uses manifold learning to transform logical label space to Euclidean label space, where the corresponding numerical labels constrain the similarity between instances. Sun et al. (2019) proposed a multi-label feature selection method based on mutual information and label correlation. Elsewhere, Jia Zhang et al. (2019) suggested a multi-label feature selection method with manifold regularization (MDFS). They captured the correlation of the features with labels locally and used objective function involving

10

-norm regularization. Y. Li et al. ( 2018) presented a feature selection approach for multilabel learning based on kernelized fuzzy rough sets. They combined the kernelized information from the feature space and label space linearly to achieve the lower approximation and construct a robust multi-label kernelized fuzzy rough set model. P. Zhang et al. (2019) proposed a new multi-label feature selection based on label redundancy, called LRFS, which categorized labels into independent labels and dependent labels and analyzed the differences between independent labels and dependent labels. Kashef & Nezamabadi-pour (2019) proposed a new feature selection method for multi-label data based on the Pareto-dominance concept. They used nondominating sorting to find the front number of each feature and then used a clustering approach to consider the distribution of features.

3 Fundamental concepts 3.1 Multi-label classification

In multi-label data, each instance has a feature vector and a binary label vector

with M features

with L labels. The purpose of multi-label

learning is to develop a model of N training samples of a dataset which can predict labels for new instances. The picture below displays the structure of a multi-label dataset.

11

Y X Y1

Y2

YL

X11

X12

X1M

0

1

0

X21

X22

X2M

1

0

0

XN1

XN2

XNM

0

1

1

Fig. 1. The Structure of Multi-Label Data

3.2 PageRank algorithm

With the growing number of web pages, the presentation of pages based on the user requirements and with high quality is very difficult. Given that web pages are a large graph, page ranking algorithms have been presented as graph-based algorithms. PageRank is an algorithm which assigns scores to web pages and performs ranking of websites and webpages (Wills, 2006). The Ranking method is based on the outlinks of a website. The higher the number of links to a site, the higher the page rank rating of that website. Indeed, receiving links from other websites is important to increase the page rank of the website. The model behind this algorithm is that there can be a web-surfer who follows the links between web pages who gets bored and tries a random page, after a couple of moves. The PageRank of a page is associated with the probabilities a random surfer would choose (Massucci & Docampo, 2019). Each of the links to a website is included in the page rank calculation. Note that only the quantity of outlinks of a website is not

12

important; rather, these links should be valuable, valid, and with high quality to achieve a better score. The PageRank algorithm considers each webpage as a node of the graph, where the input and output links between the pages are the edges between the nodes. The PageRank algorithm (Sen & Chaudhary, 2017) is as follows:



The

(1)

is the sum of the pages that link to page

PageRank algorithm assigns to page

,

,

represents the score that the

denotes the number of outlinks of page

, and

shows the damping factor and which is usually equal to 0.85. To implement this algorithm, all nodes will initially start with a basic weight, usually , where n is the total number of nodes (Sen & Chaudhary, 2017).

3.3 Weighted PageRank algorithm

Now we elaborate on an expanded version of the PageRank algorithm called Weighted PageRank introduced in 2004; unlike the PageRank method, which evaluates each node based on the outlinks of each page, in its weighted version, its inlinks also affect the ranking system (Sen & Kumar Chaudhary, 2017; Xing & Ghorbani, 2004). The difference between this algorithm and the original version is that in this method the weight of the edges is also considered. The weighted PageRank formula is defined as (Luo, Gong, Hu, Duan, & Ma, 2016):

13



Where,

and

(2)

represent the PageRank of u and v vertices, respectively

nodes having edges to u, ∑

is set of

denotes the weight of edges between u and v,

shows the sum of weights on all edges from v. With this extended version of Test Set

PageRank, it is possible to evaluate the nodes of a fully connected graph. Feature Subset Complete Graph

Training Set

Training Set

ML-KNN

Calculate Correlation Distance Between Features And Labels

CDM Matrix

Use weighted PR to assign features score

Calculate Euclidean distance between features

Scores

EDM Matrix

Final Evaluation

Classification Accuracy

Sort the features according to scores in descent order

Build A Complete Graph with EDM Matrix

Fig. 2. Graphical abstract of the proposed method 4 Proposed Method

In this section, we introduce the proposed algorithm for multi-label feature selection, with Fig. 2 presenting a graphical abstract of this method. As mentioned earlier, in a multi-label data, each instance corresponds to a set of labels, where the usual mode of selecting features in this type of 14

data is converting the data to multiple data with a single label and performing a single-label feature selection on each data, and then combining the results as the final result. However, we do this from another point of view, where we run the algorithm directly to the multi-label data, and it is designed for multi-label data. The proposed method is based on graph theory, and for rank the features we use the Weighted PageRank Algorithm.

4.1. Motivation

As mentioned earlier, typical multi-label feature selection is divided into several single-label problems, but it is time-consuming, and every class label needs a classifier. So, our goal in this article is that the algorithm is independently run on multi-label data. On the other hand, in the labeled data, the labels of each instance are correlated and are not independent of each other. Each feature has a similarity with class labels; so if we consider the dataset as a graph, then the features represent the nodes. We use the correlation distance between features as the edges of the graph. Hence, we need a way to evaluate this graph node. Accordingly, we designed a ranking system to determine the best features using the PageRank algorithm. In this work, the graph node is analogous to a web page, and the distance between two nodes of the graph is similar to the weight on two-way links. PageRank algorithm is used to find the PageRank (PR) of each feature in the dataset. Application of PageRank method for a feature importance calculation has some

15

advantages including simple implementation as well as fast convergence both theoretically and practically.

4.2. Proposed algorithm

Multi-label data are available as the following matrix:

[

]

[

]

Fig. 3. Multi-label Data Structure In the features matrix, the rows are instances while the columns are features, in the labels matrix, the rows indicate the instances and the columns represent the labels.

In the case of multi-label feature selection, the best features are those that have the lowest correlation distance with the labels. To calculate the correlation distance between each feature and label, we use the following equation:

(3)



Where, x and y are equal to features and labels, the the feature x and label y, and

is equal to the covariance between

represents the variance of the feature x, and

denotes

the variance of label y. Using this, we calculate the correlation distance between all the features

16

and labels and further create a Correlation Distance Matrix (CDM) whose rows correspond to the features and columns correspond to labels.

=[

]

Fig. 4. Correlation Matrix Where

is equal to the correlation distance between feature i and label j.

Now, based on this matrix, our goal is to compare features, whereby we need a distance metric. In our proposed algorithm, this metric is the Euclidean distance between features based on the CDM.

√∑ Where,

(4) represents the Euclidean distance between features p and q, and L is the

number of columns of the correlation distance matrix i.e. the number of labels. According to this relation, the Euclidean distance of all features is computed to each other to achieve a Euclidean Distance Matrix (EDM). The rows and columns of this matrix represent the features.

=[

17

]

Fig. 5. Euclidean Distance Matrix For example,

is equal to the Euclidean distance between row i and row j of

CDM.

We use this matrix to build a weighted complete graph called Feature-Label Graph (FLG). The nodes of this graph are the features and the weights of edges are the components of EDM.

To better understand the proposed algorithm, we use an example to illustrate its steps. To this aim, we will consider a data set with 5 features, 3 labels, and 3 instances.

[

]

[

]

Fig. 6. A Sample Dataset Now we calculate the CDM matrix as follows:

[

]

Fig. 7. Sample Correlation Distance Matrix As mentioned above, the rows of this matrix correspond to the features, while the columns correspond to the labels. It means that CDM (2,1) is the correlation distance between feature 2 and label 1. We calculate CDM (2,1) as follows:

18

[

][

]



√ Now we calculate the EDM matrix:

[

]

Fig. 8. Sample Euclidean Distance Matrix For example,

is the Euclidean distance between feature 1 and 2 that calculate as

follow: √ Thus, we consider this matrix as graph edges weights and features as nodes, then we will draw the corresponding graph.

19

𝐹

𝐸𝐷𝑀

𝐹5

𝐸𝐷𝑀

𝐸𝐷𝑀

𝐹3 𝐸𝐷𝑀

𝐸𝐷𝑀

𝐹

𝐹4 𝐸𝐷𝑀

Fig. 9. Sample FLG Graph Having developed the FLG, now we are looking for the main purpose or feature selection, where we need to use a ranking system for the graph nodes. In our proposed algorithm, we have chosen the Weighted PageRank Algorithm for this task. This algorithm assigns a score to each node, concerning the edges of each node. Higher scores indicate that the node is important, or it is a better feature.

Now we use weighted PageRank algorithm to rank features in the above example and to illustrate how the algorithm works; we calculate the PR for feature

20

in iteration one:

[

]

[

]

[

Hence, the PR of feature

]

is 0.2868 in the first iteration. This procedure will continue in some

iterations until it converges. The following vector is the result of the PageRank algorithm and shows the score of each feature:

[

]

Fig. 10. Sample FLG Based on the score of each feature, we sort these scores in a descending order to obtain the ranking of the features. Now we are faced with a vector of features, and we can choose from the 21

vector as much as the user asks. According to FLG, the ranking of features for the above example is as follows:

[

5

3

4]

Algorithm: MGFS - The proposed multi-label graph-based feature selection algorithm via PageRank centrality Input: Multi-label dataset D with M features, N samples, and L labels Output: selected feature subset F 1. Calculate the Correlation Distance Matrix (CDM); 2. Calculate Euclidean Distance Matrix (EDM) to consider as graph edges weights; 3. Build a graph with features as graph nodes and EDM as the weight of edges; 4. Use weighted PageRank algorithm (WPR) to assign a score to every feature; 5. Sort the score as descent order; 6. F=choose as many as features that user requests;

5 Experimental studies

In this section, we will evaluate the proposed method on 7 datasets. To prove the performance of the proposed method, we compare it with 6 multi-label feature selection methods: PMU (Lee & Kim, 2013), LRFS(P. Zhang et al., 2019) , MDFS(Jia Zhang et al., 2019), PPT-MI (Doquire & Verleysen, 2011) , MCLS (Huang et al., 2018) and Pareto-Cluster (Kashef & Nezamabadi-pour, 2019) .

5.1 Datasets 22

In the experiments, 7 real multi-label datasets have been used with various applications obtained from the Mulan1 and Meka 2 repositories. Table 1 summarizes the specifications of these datasets including dataset name (Dataset), dataset domain (Domain), number of samples (samples), number of features (features), number of labels (labels), feature type (Type), and label cardinality (LD), which is the average number of labels associated with each instance as defined by Eq. 5 and label density (LC), which is the cardinality normalized by |L| defined by Eq. 6. Table 1. Description of the datasets used in the experiments Dataset

Samples

Features

Labels

Type

LC

LD

Domain

Yeast

2417

130

14

Numeric

4.237

0.303

Biology

Corel5k

5000

499

374

Nominal

3.522

0.009

Image

LanguageLog

1460

1004

75

Nominal

1.18

0.208

Text

Enron

1702

1001

53

Nominal

3.378

0.064

Text

Image

2000

294

5

Numeric

1.236

0.247

Image

Scene

2407

294

6

Numeric

1.074

0.179

Image

Bibtex

7395

1836

159

Nominal

2.402

0.015

Text



| |

(5)



|| ||

(6)

5.2 Performance evaluation criteria

1 2

http://mulan.sourceforge.net/datasets.html

http://waikato.github.io/meka/datasets/

23

To evaluate the performance of the proposed method, we used two multi-label evaluation criteria: Hamming loss, Accuracy, and the runtime of algorithms. Let } be a test set; corresponding to



be the actual label subset, and

{

be the predicted label set

. The evaluation criteria are as follows (Zhang et al., 2014):

Hamming Loss: For every sample, hamming loss computes the differences (∆) between predicted labels and actual labels and then averages over the obtained differences for total samples of the dataset. For example, a label is incorrectly assigned to an instance, or a label is not predicted. (Cherman et al., 2014)

|



Where 

| | |

(7)

is the symmetric difference between two sets.

Accuracy: This measure calculates the labels that have been correctly predicted among all true and predicted labels. ∑

|

|

|

|

(8)

5.3 Results

The proposed method is compared with five filter-based multi-label feature selection methods. These methods include PMU (Lee & Kim, 2013), LRFS(P. Zhang et al., 2019) , MDFS(Jia Zhang et al., 2019), PPT-MI (Doquire & Verleysen, 2011) , MCLS (Huang et al., 2018) and

24

Pareto-Cluster (Kashef & Nezamabadi-pour, 2019).

For all methods, we set the value of

parameters according to the recommendations by corresponding research.

The classification performance of comparing algorithms is evaluated based on ML-KNN (Zhang & Zhou, 2007), which is a multi-label version of the famous KNN algorithm with a neighboring number of 10. For each test, 60% of the samples are chosen randomly for the training, while the remaining 40% is used for the testing. The results are averaged over 10 independent runs of every algorithm for each different range of data; according to the interval [10:100] for testing each algorithm, 100 independent runs are executed on each dataset. In the proposed method, the number of selected features is defined by the user. Figs. 11- 31 demonstrate the comparison of the proposed algorithm with other algorithms in terms of accuracy, hamming loss, and runtime, respectively. Since some methods, such as PMU, have a long run time, the logarithmic scale has been used for Y-axis. According to the obtained results, it can be seen that the proposed method has a significant advantage over other methods in terms of two classification criteria and has a significantly fast run-time.

25

Fig. 11. Accuracy (Yeast)

Fig. 13. Accuracy (Scene)

Fig. 15. Accuracy (Image) 26

Fig. 12. Accuracy (LanguageLog)

Fig. 14. Accuracy (Enron)

Fig. 16. Accuracy (Corel5k)

Fig. 17. Accuracy (Bibtex)

Fig. 18. Hamming Loss (Yeast)

Fig. 19. Hamming Loss (Scene)

Fig. 20. Hamming Loss (LanguageLog)

Figure 21: Hamming Loss (Enron) 27

Figure 22: Hamming Loss (Image)

Fig. 23. Hamming Loss (Corel5k)

28

Fig. 24. Hamming Loss (Bibtex)

Fig. 25. Run-time (Yeast)

Fig. 26. Run-time (LanguageLog)

Fig. 27. Run-time (Scene)

Fig. 28. Run-time (Enron)

Fig. 30. Run-time (Corel5k)

Fig. 29. Run-time (Image)

Fig. 31. Run-time (Bibtex)

Now we compare the results of different methods statistically. For this purpose, we used the Friedman test (Kafadar & Sheskin, 1997) on the result of methods. We set the desired significance level for the post-hoc test to 0.05. If the result obtained by the Friedman test is less than the significance level, we perform a test for pairwise comparison of variables according to Conover (2000). The last row of these tables refers to the win/tie/loss results of the Friedman test of the MGFS method against other methods. Initially, we used the test over the results of MGFS

29

against the other methods in terms of accuracy and then hamming loss. The results are shown in Tables 2-15. Also, the obtained p-values by the Friedman test are presented in Tables 16-18 overall performance comparison is given in Table 19 in terms of the average Friedman test.

Table 3: accuracy for Scene dataset

Table 2: accuracy for Yeast dataset M

MGFS

PMU

LRFS

MDFS

PPT-

10

0.1424

0.1515

0.3069

0.0977

0.1172

0.0763

0.2994

0.4669

20

0.2626

0.2552

0.3284

0.1314

0.2740

0.1396

0.3631

0.4787

0.4898

30

0.4259

0.3529

0.3432

0.2546

0.2973

0.2545

0.3611

0.4941

0.4934

0.4899

40

0.4569

0.3884

0.3694

0.3157

0.3437

0.2845

0.3771

0.5019

0.4882

0.4981

50

0.4722

0.4108

0.3575

0.3488

0.3752

0.3830

0.3883

0.4812

0.5016

0.5003

0.5035 60

0.4858

0.4175

0.3733

0.3786

0.3773

0.4227

0.4028

0.4736

0.4865

0.5010

0.5017

0.5080

70

0.5089

0.4648

0.3875

0.3890

0.3996

0.3928

0.4306

0.4948

0.4678

0.4962

0.5021

0.5017

0.5039

80

0.5537

0.4907

0.4147

0.4539

0.4174

0.4877

0.4682

0.4966

0.4702

0.4947

0.4977

0.5044

0.4951

90

0.5709

0.5034

0.4576

0.5144

0.4477

0.5384

0.4648

0.5072

0.5057

0.4665

0.5082

0.5044

0.5026

0.4999

100

0.5874

0.5110

0.4612

0.5329

0.4845

0.5414

0.5172

6/0/0

+

+

+

+

+

+

win/tie/loss

6/0/0

+

+

+

+

+

+

MGFS

PMU

LRFS

MDFS

PPT-

10

0.4586

0.4267

0.4399

0.4182

0.4461

0.4318

20

0.4793

0.4597

0.4580

0.4389

0.4827

0.4640

30

0.4903

0.4745

0.4740

0.4479

0.4840

40

0.4932

0.4856

0.4737

0.4655

50

0.5036

0.4948

0.4806

0.4780

60

0.5042

0.4932

0.4716

70

0.5053

0.4999

80

0.5058

90

0.5086

100 win/tie/loss

/feat

MCLS

MI

Pareto-

M

Cluster

/feat

0.4398

MGFS

PMU

LRFS

MDFS

/feat

PPT-

MCLS

MI

MI

ParetoCluster

Table 5: accuracy for Enron dataset

Table 4: accuracy for LanguageLog dataset M

MCLS

Pareto-

M

Cluster

/feat

MGFS

PMU

LRFS

MDFS

PPT-

MCLS

10

0.2771

0.2511

0.2067

0.1929

0.2307

0.1882

0.3518

20

0.3588

0.3040

0.1911

0.1991

0.2472

0.2253

0.3767

MI

ParetoCluster

10

0.3547

0.3459

0.3524

0.3497

0.2004

0.2182

0.3798

20

0.3851

0.3456

0.3575

0. 3454

0.2254

0.3093

0.3772

30

0.3773

0.3490

0.3487

0.3419

0.2242

0.3630

0.3783

30

0.3579

0.3274

0.2101

0.2080

0.2691

0.2522

0.3726

40

0.3801

0.3694

0.3510

0.3443

0.2461

0.3722

0.3822

40

0.3689

0.3353

0.2154

0.2064

0.3105

0.2616

0.3705

50

0.3790

0.3563

0.3466

0.3478

0.2787

0.3723

0.3797

50

0.3673

0.3374

0.2114

0.1693

0.2914

0.2822

0.3683

60

0.3714

0.3740

0.3502

0.3529

0.3046

0.3716

0.3720

60

0.3701

0.3352

0.2248

0.1958

0.3014

0.2798

0.3761

70

0.3868

0.3715

0.3541

0.3468

0.3455

0.3800

0.3816

70

0.3644

0.3514

0.2393

0.1732

0.3134

0.3168

0.3801

80

0.3847

0.3871

0.3573

0.3465

0.3535

0.3789

0.3763

80

0.3735

0.3491

0.2499

0.1704

0.2886

0.3063

0.3719

90

0.3792

0.3741

0.3465

0.3457

0.3522

0.3728

0.3798

90

0.3632

0.3545

0.2363

0.1886

0.3271

0.3000

0.3855

100

0.3795

0.3713

0.3605

0.3458

0.3517

0.3738

0.3724

100

0.3774

0.3531

0.2550

0.2179

0.3316

0.3092

0.3515

=

win/tie/loss

5/1/0

+

+

+

+

+

=

win/tie/loss

5/1/0

30

+

+

+

+

+

Table 6: accuracy for Corel5k dataset M

MGFS

PMU

LRFS

MDFS

/feat

PPT-

MCLS

MI

Table 7: accuracy for Bibtex dataset Pareto-

M

Cluster

/feat

MGFS

PMU

LRFS

MDFS

PPT-MI

MCLS

0.0048

0.0487

Pareto-

0.0671

0.0918

0.0031

0.1151

Cluster

10

0.0047

0.0001

0.0016

0.0006

0.0030

0.0020

0.0079

10

0.0823

20

0.0077

0.0003

0.0011

0.0025

0.0054

0.0041

0.0109

20

0.1072

0.0026

0.0627

0.0871

0.1487

0.0164

0.1422

30

0.0082

0.0003

0.0016

0.0038

0.0078

0.0043

0.0100

30

0.1136

0.0045

0.0613

0.0950

0.1576

0.0128

0.1679

40

0.0067

0.0004

0.0023

0.0033

0.0057

0.0050

0.0124

40

0.1530

0.0091

0.0632

0.0950

0.1607

0.0131

0.1667

50

0.0082

0.0003

0.0039

0.0030

0.0043

0.0060

0.0118

50

0.1787

0.0153

0.0865

0.1020

0.1683

0.0171

0.1889

60

0.0074

0.0003

0.0058

0.0037

0.0067

0.0066

0.0102

60

0.1903

0.0123

0.0666

0.1096

0.1809

0.0178

0.2006

70

0.0086

0.0002

0.0039

0.0035

0.0062

0.0061

0.0101

70

0.2015

0.0178

0.0778

0.1143

0.1864

0.0292

0.2138

80

0.0071

0.0004

0.0027

0.0047

0.0056

0.0064

0.0114

80

0.2117

0.0242

0.0635

0.1250

0.1891

0.0302

0.2166

90

0.0062

0.0008

0.0043

0.0041

0.0049

0.0074

0.0105

90

0.2283

0.0257

0.0709

0.1295

0.1867

0.0355

0.2147

100

0.0064

0.0012

0.0060

0.0048

0.0055

0.0063

0.0104

100

0.2293

0.0278

0.1118

0.1351

0.1884

0.0311

0.2200

win/tie/loss

5/0/1

+

+

+

+

+

-

win/tie/loss

4/2/0

+

+

+

=

+

=

Table 9: hamming loss for Yeast dataset

Table 8: accuracy for Image dataset M

MGFS

PMU

LRFS

MDFS

PPT-MI

MCLS

/feat 0.1574

10

0.2355

0.2539

0.1247

0.1551

0.0681

Pareto-

M

Cluster

/feat

MGFS

PMU

LRFS

MDFS

PPT-

MCLS

0.1912

10

0.2074

0.2184

0.2148

0.2195

0.2143

0.2183

0.2151

0.2041

0.2091

0.2108

0.2146

0.2045

0.2080

0.2065

MI

ParetoCluster

20

0.1896

0.2931

0.3387

0.1796

0.1876

0.1385

0.2975

20

30

0.2538

0.3220

0.3555

0.3165

0.2982

0.1698

0.3143

30

0.2002

0.2075

0.2067

0.2099

0.2024

0.2060

0.2010

40

0.3086

0.3540

0.3710

0.3361

0.2826

0.1992

0.3469

40

0.1991

0.2016

0.2067

0.2077

0.1985

0.2008

0.2002

50

0.3806

0.3681

0.3715

0.3112

0.3576

0.2368

0.3847

50

0.1975

0.2008

0.2052

0.2037

0.1975

0.1998

0.1981

60

0.4011

0.4032

0.3773

0.3612

0.3664

0.2517

0.4127

60

0.1970

0.2010

0.2044

0.2020

0.1989

0.1997

0.1972

70

0.4252

0.3921

0.3391

0.3936

0.3904

0.2783

0.4234

70

0.1971

0.1978

0.2058

0.2022

0.1968

0.1967

0.1971

80

0.4465

0.4076

0.4166

0.3753

0.3952

0.3117

0.4257

80

0.1962

0.2007

0.2076

0.1990

0.1980

0.1968

0.1987

0.1955

0.1983

0.2067

0.2004

0.1992

0.1974

0.1994

90

0.4317

0.4129

0.4317

0.3927

0.4080

0.3233

0.4410

90

100

0.4324

0.4232

0.4333

0.3934

0..3836

0.3306

0.4266

100

0.1964

0.1966

0.2078

0.1968

0.1964

0.1966

0.1981

win/tie/loss

2/4/0

=

=

=

+

+

=

win/tie/loss

5/1/0

+

+

+

=

+

+

31

Table 10: hamming loss for Scene dataset M

MGFS

PMU

LRFS

MDFS

/feat

PPT-

MCLS

MI

Table 11: hamming loss for LanguageLog dataset Pareto-

M

Cluster

/feat

10

0.1674

0.1652

0.1479

0.1762

0.1677

0.1750

0.1485

20

0.1520

0.1532

0.1458

0.1741

0.1564

0.1694

0.1427

30

0.1361

0.1425

0.1423

0.1591

0.1488

0.1547

40

0.1326

0.1387

0.1415

0.1481

0.1435

50

0.1291

0.1325

0.1420

0.1430

0.1417

60

0.1231

0.1300

0.1398

0.1413

70

0.1190

0.1226

0.1379

80

0.1131

0.1196

90

0.1111

100 win/tie/loss

MGFS

LRFS

MDFS

PPT-

MCLS

MI

ParetoCluster

10

0.1797

0.1792

0.1802

0.1820

0.1952

0.1909

0.1693

20

0.1657

0.1740

0.1731

0.1832

0.1942

0.1789

0.1704

0.1409

30

0.1664

0.1761

0.1756

0.1839

0.1941

0.1703

0.1679

0.1540

0.1399

40

0.1662

0.1717

0.1768

0.1835

0.1921

0.1680

0.1625

0.1398

0.1383

50

0.1660

0.1673

0.1785

0.1826

0.1904

0.1652

0.1641

0.1389

0.1329

0.1327

60

0.1660

0.1682

0.1751

0.1809

0.1886

0.1647

0.1635

0.1371

0.1367

0.1369

0.1297

70

0.1637

0.1632

0.1748

0.1843

0.1847

0.1627

0.1619

0.1553

0.1285

0.1344

0.1217

0.1253

80

0.1645

0.1585

0.1746

0.1840

0.1791

0.1611

0.1620

0.1177

0.1308

0.1170

0.1281

0.1143

0.1247

90

0.1656

0.1626

0.1770

0.1829

0.1793

0.1629

0.1599

0.1080

0.1168

0.1287

0.1149

0.1222

0.1149

0.1193

100

0.1655

0.1593

0.1690

0.1842

0.1780

0.1625

0.1611

6/0/0

+

+

+

+

+

+

win/tie/loss

=

+

+

+

=

-

Table 12: hamming loss for Enron dataset M

PMU

MGFS

PMU

LRFS

MDFS

/feat

PPT-

MCLS

MI

3/2/1

Table 13: hamming loss for Corel5k dataset Pareto-

M

Cluster

/feat

MGFS

PMU

LRFS

MDFS

PPT-

MCLS

MI

ParetoCluster

10

0.0553

0.0542

0.0592

0.0598

0.0552

0.0592

0.0535

10

0.0094

0.0095

0.0095

0.0094

0.0094

0.0094

0.0094

20

0.0529

0.0539

0.0584

0.0599

0.0550

0.0571

0.0522

20

0.0095

0.0095

0.0094

0.0094

0.0094

0.0094

0.0095

30

0.0525

0.0528

0.0578

0.0596

0.0539

0.0560

0.0524

30

0.0094

0.0095

0.0095

0.0094

0.0095

0.0094

0.0094

40

0.0515

0.0525

0.0573

0.0592

0.0538

0.0548

0.0527

40

0.0094

0.0095

0.0094

0.0094

0.0094

0.0094

0.0095

50

0.0524

0.0520

0.0564

0.0594

0.0546

0.0546

0.0522

50

0.0094

0.0095

0.0094

0.0094

0.0094

0.0094

0.0095

60

0.0514

0.0521

0.0561

0.0592

0.0531

0.0539

0.0516

60

0.0094

0.0095

0.0094

0.0094

0.0094

0.0094

0.0095

70

0.0516

0.0516

0.0586

0.0584

0.0530

0.0531

0.0515

70

0.0094

0.0095

0.0095

0.0094

0.0094

0.0094

0.0095

80

0.0509

0.0522

0.0557

0.0586

0.0532

0.0530

0.0515

80

0.0094

0.0095

0.0094

0.0094

0.0094

0.0095

0.0094

90

0.0516

0.0521

0.0555

0.0588

0.0529

0.0526

0.0514

90

0.0094

0.0095

0.0094

0.0094

0.0094

0.0094

0.0095

100

0.0510

0.0517

0.0557

0.0584

0.0521

0.0523

0.0516

100

0.0094

0.0095

0.0095

0.0094

0.0094

0.0094

0.0095

win/tie/loss

4/1/0

=

+

+

+

+

=

win/tie/loss

2/4/0

+

=

=

=

=

+

32

Table 14: hamming loss for Bibtex dataset M

MGFS

PMU

LRFS

MDFS

/feat

PPT-

MCLS

MI

Table 15: hamming loss for Image dataset Pareto-

M

Cluster

/feat

MGFS

PMU

LRFS

MDFS

PPT-

MCLS

MI

ParetoCluster

10

0.0140

0.0152

0.0143

0.0141

0.0141

0.0151

0.0137

10

0.2310

0.2239

0.2222

0.2344

0.2291

0.2442

0.2258

20

0.0136

0.0150

0.0140

0.0141

0.0136

0.0151

0.0137

20

0.2296

0.2081

0.2056

0.2326

0.2245

0.2395

0.2114

30

0.0135

0.0152

0.0143

0.0141

0.0135

0.0152

0.0136

30

0.2232

0.2044

0.1972

0.2143

0.2093

0.2329

0.2057

40

0.0135

0.0149

0.0139

0.0140

0.0135

0.0150

0.0134

40

0.2143

0.1986

0.1977

0.2074

0.2139

0.2308

0.1998

50

0.0135

0.0152

0.0141

0.0139

0.0134

0.0149

0.0132

50

0.2037

0.1975

0.1968

0.2070

0.1969

0.2283

0.1951

60

0.0134

0.0150

0.0142

0.0137

0.0133

0.0150

0.0132

60

0.1979

0.1928

0.1976

0.1999

0.1955

0.2219

0.1974

70

0.0133

0.0149

0.0144

0.0139

0.0133

0.0150

0.0131

70

0.1910

0.1920

0.1930

0.1935

0.1960

0.2170

0.1846

80

0.0131

0.0152

0.0143

0.0137

0.0132

0.0150

0.0131

80

0.1861

0.1901

0.1916

0.1937

0.1934

0.2145

0.1866

90

0.0129

0.0150

0.0140

0.0137

0.0133

0.0149

0.0130

90

0.1871

0.1865

0.1878

0.1920

0.1918

0.2103

0.1849

100

0.0129

0.0149

0.0137

0.0137

0.0132

0.0150

0.0130

100

0.1861

0.1857

0.1855

0.1910

0.1909

0.2118

0.1810

win/tie/loss

4/2/0

+

+

+

=

+

=

win/tie/loss

1/3/2

-

=

=

=

+

-

Table 19. The obtained p-values by the Friedman test in term of Accuracy Dataset

MGFS VS. PMU

MGFS VS. PPT-MI

MGFS VS. MCLS

MGFS VS. MDFS

MGFS VS. LRFS

MGFS VS. Pareto-Cluster

Yeast

0.0018

0.0377

0.0153

0.0018

0.0005

0.0056

Scene

0.0056

0.0056

0.0005

0.0005

0.0377

0.0377

LanguageLog

0.0153

0.0005

0.0056

0.0005

0.0018

1

Enron

0.0018

0.0005

0.0005

0.0005

0.0005

0.0833

Imaage

1

0.0153

0.0005

0.0833

0.2987

0.1659

Corel5k

0.0005

0.0056

0.0153

0.0005

0.0005

0.0005

Bibtex

0.0005

0.4884

0.0005

0.0018

0.0005

0.0833

33

Table 17. The obtained p-values by the Friedman test in term of Hamming-loss Dataset

MGFS VS. PMU

MGFS VS. PPT-MI

MGFS VS. MCLS

MGFS VS. MDFS

MGFS VS. LRFS

MGFS VS. Pareto-Cluster

Yeast

0.0005

0.0833

0.0056

0.0005

0.0005

0.0377

Scene

0.0056

0.0018

0.0005

0.0005

0.0377

0.0377

LanguageLog

1

0.0005

0.2987

0.0005

0.0018

0.0377

Enron

0.0833

0.0056

0.0005

0.0005

0.0005

0.729

Imaage

0.0377

0.4884

0.0005

0.0833

0.0833

0.0018

Corel5k

0.0003

0.1659

0.4884

0.4884

0.729

0.0056

Bibtex

0.0005

0.4884

0.0005

0.0005

0.0005

1

Table 18. The obtained p-values by the Friedman test in term of Run-time Dataset

MGFS VS. PMU

MGFS VS. PPT-MI

MGFS VS. MCLS

MGFS VS. MDFS

MGFS VS. LRFS

MGFS VS. Pareto-Cluster

Yeast

0.0005

0.0056

0.0005

0.0005

0.0005

0.0005

Scene

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

LanguageLog

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

Enron

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

Imaage

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

Corel5k

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

Bibtex

0.0005

0.0005

0.0005

0.0005

0.0005

0.0005

34

Table 19. The win/tie/loss results of MGFS against the other methods in terms of Accuracy and Hamming loss based on Friedman test Evaluation metric

MGFS VS. PMU

MGFS VS. PPT-MI

MGFS VS. MCLS

MGFS VS. MDFS

MGFS VS. LRFS

MGFS VS. Pareto-Cluster

Accuracy

6/1/0

6/1/0

7/0/0

6/1/0

6/1/0

2/4/1

Hamming loss

4/2/1

3/4/0

5/2/0

5/2/0

5/2/0

3/2/2

Run-time

7/0/0

5/0/2

7/0/0

7/0/0

7/0/0

7/0/0

Total

17/3/1

14/5/2

19/2/0

18/3/0

18/3/0

12/6/3

5 Discussion According to the results, our proposed method is superior to the compared methods. Based on Figs. 25-31, the MGFS algorithm functioned in a short runtime, and four datasets (Yeast, Scene, Corel5k, LanguageLog, and Bibtex) were faster than the other methods. On the other hand, in the other two datasets, the difference with the PPT-MI method has been very minor. It can also be seen that this method is very fast for high dimensional data, and there is a significant difference in speed with other methods in high dimensional datasets as observed. Nowadays, the speed for high-dimensional data is a very important challenge, and in the proposed method, we have been paying special attention to this matter. Given the speed of this algorithm, we can suggest that the algorithm is not very complex and has low computational complexity. Furthermore, our 35

proposed method, in addition to high speed, has better classification accuracy and less error compared to other methods. To show this, we show the results for different features in Figs. 1124 and also we have used the Friedman test to compare the results of different methods statistically, with the results being presented in Tables 2 to 15. In our proposed algorithm, we have used the correlation distance, and for each feature, we calculated the Euclidean distance based on this correlation distance. According to our method, a feature is assigned a higher priority if it is more space than the other features, and indeed, we give the priority to unique features. The logic behind this is that the features that are less spaced than other features tend to have many commonalities with those features and are very similar. Also, we are looking for features that have unique information to give us when compared to other features. As can be seen in Fig. 8, which is an EDM matrix, the feature 5 has the maximum Euclidean distance with other features, making the PageRank algorithm assign the highest rank to this feature, while the rest of the features are ranked according to the same ranking procedure. In the end, the user can select as many features as he wishes, according to MGFS algorithm.

6 Conclusions

In this work, the MGFS method used a multi-label graph-based theory, and the Google PageRank algorithm was employed to select the best feature subset. This method was not similar to single-label methods and was designed for multi-label data. In this method, we used the

36

correlation distance between features and labels as a matrix and used this matrix to compare the features. Next, the PageRank algorithm ranked the features to find the best subset of features, where the user could choose the size of the feature subset. We compared the results of this method to 7 datasets with 6 similar methods and we observed that it is superior in terms of accuracy and classification error compared to similar methods. Further, in terms of execution time, it was more optimal and faster than the other methods as it had low computational complexity. We intend to use graph-based models in the feature selection process in the future and our main focus will be on multi-label data. Also, we seek to use other centrality measures to evaluate the proposed model against PageRank.

References Bermingham, M., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., … Haley, C. (2015). Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Scientific Reports, 5, 10312. https://doi.org/10.1038/srep10312 Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70–79. https://doi.org/10.1016/j.neucom.2017.11.077 Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. Coakley, C. W., & Conover, W. J. (2000). Practical Nonparametric Statistics. Journal of the 37

American Statistical Association, 95(449), 332. https://doi.org/10.2307/2669565 Coelho, F., Castro, C., Braga, A. P., & Verleysen, M. (2019). Semi-supervised relevance index for feature selection. Neural Computing and Applications, 31, 989–997. https://doi.org/10.1007/s00521-017-3062-0 Dowlatshahi, M. B., & Derhami, V. (2017). Winner Determination in Combinatorial Auctions using Hybrid Ant Colony Optimization and Multi-Neighborhood Local Search. Journal of AI and Data Mining, 5(2), 169-181. Dowlatshahi, M., Derhami, V., & Nezamabadi-pour, H. (2017). Ensemble of filter-based rankers to guide an epsilon-greedy swarm optimizer for high-dimensional feature subset selection. Information, 8(4), 152. Dowlatshahi, M. B., Derhami, V., & Nezamabadi-pour, H. (2018). A novel three-stage filterwrapper framework for miRNA subset selection in cancer classification. Informatics, 5(1). https://doi.org/10.3390/informatics5010013 Dowlatshahi, M. B., Nezamabadi-Pour, H., & Mashinchi, M. (2014). A discrete gravitational search algorithm for solving combinatorial optimization problems. Information Sciences, 258, 94-107. Dowlatshahi, M. B., & Nezamabadi-Pour, H. (2014). GGSA: a grouping gravitational search algorithm for data clustering. Engineering Applications of Artificial Intelligence, 36, 114121. Dowlatshahi, M. B., & Rezaeian, M. (2016, March). Training spiking neurons with gravitational search algorithm for data classification. In 2016 1st Conference on Swarm Intelligence and 38

Evolutionary Computation (CSIEC) (pp. 53-58). IEEE. Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Math. Intell. https://doi.org/10.1007/BF02985802 Kafadar, K., & Sheskin, D. J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. The American Statistician, 51(4), 374. https://doi.org/10.2307/2685909 Kashef, S., & Nezamabadi-pour, H. (2019). A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognition, 88, 654–667. https://doi.org/10.1016/j.patcog.2018.12.020 Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018a). FCBF3Rules: A feature selection method for multi-label datasets. 1–5. https://doi.org/10.1109/CSIEC.2018.8405419 Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018b). Multilabel feature selection: A comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8, e1240. https://doi.org/10.1002/widm.1240 Li, J., Cheng, K., Wang, S., Morstatter, F., P. Trevino, R., Tang, J., & Liu, H. (2017). Feature Selection: A Data Perspective. ACM Computing Surveys, 50. https://doi.org/10.1145/3136625 Li, Y., Lin, Y., Liu, J., Weng, W., Shi, Z., & Wu, S. (2018). Feature selection for multi-label learning based on kernelized fuzzy rough sets. Neurocomputing, 318, 271–286. https://doi.org/10.1016/j.neucom.2018.08.065 Luo, D., Gong, C., Hu, R., Duan, L., & Ma, S. (2016). Ensemble Enabled Weighted PageRank. 39

eprint arXiv:1604.05462. Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. C. (2018). Categorizing feature selection methods for multi-label classification. Artificial Intelligence Review, 49(1), 57–78. https://doi.org/10.1007/s10462-016-9516-4 Rafsanjani, M. K., & Dowlatshahi, M. B. (2012). Using gravitational search algorithm for finding near-optimal base station location in two-tiered WSNs. International Journal of Machine Learning and Computing, 2(4), 377. Rafsanjani, M. K., Dowlatshahi, M. B., & Nezamabadi-Pour, H. (2015). Gravitational Search Algorithm to Solve the K-of-N Lifetime Problem in Two-Tiered WSNs. Iranian Journal of Mathematical Sciences and Informatics, 10(1), 81-93. Sen, T., & Kumar Chaudhary, D. (2017). Contrastive study of Simple PageRank, HITS and Weighted PageRank algorithms: Review. 721–727. https://doi.org/10.1109/CONFLUENCE.2017.7943245 Sun, X., Liu, Y., Li, J., Zhu, J., Liu, X., & Chen, H. (2012). Using cooperative game theory to optimize the feature selection problem. Neurocomputing, 97, 86–93. https://doi.org/10.1016/j.neucom.2012.05.001 Tang, C., Zhu, X., Chen, J., Wang, P., Liu, X., & Tian, J. (2018). Robust graph regularized unsupervised feature selection. Expert Systems with Applications, 96, 64–76. https://doi.org/10.1016/j.eswa.2017.11.053 Wang, Y., Cang, S., & Yu, H. (2019). Mutual Information Inspired Feature Selection Using Kernel Canonical Correlation Analysis. Expert Systems with Applications: X, 100014. 40

https://doi.org/10.1016/j.eswax.2019.100014 Xing, W., & Ghorbani, A. (2004). Weighted PageRank Algorithm. 305–314. https://doi.org/10.1109/DNSR.2004.1344743 Zare, M., Eftekhari, M., & Aghamollaei, G. (2019). Supervised feature selection via matrix factorization based on singular value decomposition. Chemometrics and Intelligent Laboratory Systems, 185, 105–113. https://doi.org/10.1016/j.chemolab.2019.01.003 Zhang, J., Luo, Z., Li, C., Zhou, C., & Li, S. (2019). Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition, 95, 136–150. https://doi.org/10.1016/j.patcog.2019.06.003 Zhang, J., Xiong, Y., & Min, S. (2019). A new hybrid filter/wrapper algorithm for feature selection in classification. Analytica Chimica Acta. https://doi.org/10.1016/j.aca.2019.06.054 Zhang, P., Liu, G., & Gao, W. (2019). Distinguishing two types of labels for multi-label feature selection. Pattern Recognition, 95, 72–82. https://doi.org/10.1016/j.patcog.2019.06.004 Zhang, R., Nie, F., Li, X., & Wei, X. (2019). Feature selection with multi-view data: A survey. Information Fusion, 50, 158–167. https://doi.org/10.1016/j.inffus.2018.11.019 Zhu, Q. H., & Yang, Y. Bin. (2018). Discriminative embedded unsupervised feature selection. Pattern Recognition Letters, 112, 219–225. https://doi.org/10.1016/j.patrec.2018.07.018

41

CRediT author statement Amin Hashemi: Conceptualization, Methodology, Software, Original draft preparation. Mohammad Bagher Dowlatshahi: Supervision, Data curation, Writing, Investigation, Reviewing and Editing. Hossein Nezamabadi-pour: Supervision, Validation, Reviewing and Editing.

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

42