Journal Pre-proof
MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality Amin Hashemi ConceptualizationMethodologySoftwareOriginal draft preparation , Mohammad Bagher Dowlatshahi SupervisionData curationWritingInvestigationReviewing and Editing , Hossein Nezamabadi-pour SupervisionValidationReviewing and Editing PII: DOI: Reference:
S0957-4174(19)30741-9 https://doi.org/10.1016/j.eswa.2019.113024 ESWA 113024
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
9 July 2019 13 October 2019 13 October 2019
Please cite this article as: Amin Hashemi ConceptualizationMethodologySoftwareOriginal draft preparation , Mohammad Bagher Dowlatshahi SupervisionData curationWritingInvestigationReviewing and Editing , Hossein Nezamabadi-pour SupervisionValidationReviewing and Editing , MGFS: A multi-label graphbased feature selection algorithm via PageRank centrality, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113024
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.
Highlights
1
We have proposed a fast algorithm for feature selection on the multi-label data
Features that discriminate classes are linked to provide an undirected weighted graph
Features relationships are defined based on correlation distance with labels
PageRank algorithm ranks the features according to their importance in weighted graph
The proposed multi-label graph based method outperforms competitive methods
MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality Amin Hashemi a, Mohammad Bagher Dowlatshahi b, *, Hossein Nezamabadi-pour c a
Department of Computer Engineering, Faculty of Engineering, Lorestan University, Khorramabad, Iran.
[email protected] b
Department of Computer Engineering, Faculty of Engineering, Lorestan University, Khorramabad, Iran.
[email protected] c
Department of Electrical Engineering, Shahid Bahonar university of Kerman, Kerman, Iran
[email protected]
* Corresponding author
Abstract
In multi-label data, each instance corresponds to a set of labels instead of one label whereby the instances belonging to a label in the corresponding column of that label are assigned 1, while instances that do not belong to that label are assigned 0 in the data set. This type of data is usually considered as high-dimensional data, so many methods, using machine learning algorithms, seek to choose the best subset of features for reducing the dimensionality of data and then to create an acceptable model for classification. In this paper, we have designed a fast algorithm for feature selection on the multi-label data using the PageRank algorithm, which is an effective method used to calculate the importance of web pages on the Internet. This algorithm,
2
which is called multi-label graph-based feature selection (MGFS), first constructs an M×L matrix, called Correlation Distance Matrix (CDM), where M is the number of features and L represents the number of class labels. Then, MGFS creates a complete weighted graph, called Feature-Label Graph (FLG), where each feature is considered as a vertex, and the weight between two vertices (or features) represents their Euclidean distance in CDM. Finally, the importance of each graph vertex (or feature) is estimated via the PageRank algorithm. In the proposed method, the number of features can be determined by the user. To prove the performance of the proposed algorithm, we have tested this algorithm with several methods for multi-label feature selection and on several multi-label datasets with different dimensions. The results show the superiority of the proposed method in the classification criteria and run-time.
Keywords: Multi-label feature selection, Correlation Distance Matrix, Feature-Label Graph, PageRank centrality.
1 Introduction
The increasing growth of modern technologies, new computers, and different applications have generated many data at an outstanding speed. These data include mostly video, image, text, audio, and often have high dimensions. This high-dimensional data presents many challenges for data analysis, decision making, classification, and prediction (Cai, Luo, Wang, & Yang, 2018). 3
On the other hand, such data have usually many redundant and irrelevant features which can cause problems. Therefore, to solve these problems, a subset of the relevant and optimal features will be chosen, this procedure is called feature selection. Feature selection has many advantages for learning algorithms including reducing the cost of measurement and storage requirements, shortening the training time, avoiding the curse of dimensionality, and reducing overfitting (Bermingham et al., 2015; Hastie et al., 2017; Sun et al., 2012).
Feature selection methods are generally divided into two categories based on label information and search strategy (R. Zhang, Nie, Li, & Wei, 2019). Concerning the label information, feature selection methods could be categorized into three classes: supervised, unsupervised, and semi-supervised (Li et al., 2014). Feature selection methods can be applied to: 1) single-label (SL) data where each sample corresponds with only one class label, and 2) multilabel (ML) data in which each sample belongs to several class labels, and the feature relevance is calculated based on the correlation between features and labels (Kashef, Nezamabadi-pour, & Nikpour, 2018b; J. Li et al., 2017). Zare et al. (2019) proposed a supervised feature selection algorithm using matrix factorization and Singular Value Decomposition (SVD) for microarray datasets. Unsupervised methods do not require labels and usually use the relationships between features. These methods are commonly used in clustering methods. Tang et al. (2018) proposed an unsupervised feature selection framework via feature self-representation and robust graph regularization, to reduce the sensitivity to outliers. Semi-supervised methods are also a 4
combination of the two previous methods, where some of the training instances have a label (Kashef et al., 2018b). Coelho et al. (2019) presented a new relevance index based on mutual information which is based on labeled and unlabeled data for semi-supervised feature selection. Meanwhile, feature selection methods are also divided into three categories based on search strategies: filter, wrapper, and embedded (Pereira et al., 2016). Filter methods are implemented before classification and are independent of the learning algorithm. In these methods, the features are ranked based on specific methods such as ReliefF (Reyes et al., 2015), Mutual Information (MI) (Lee & Kim, 2015), Information Interest (Li et al., 2017), and Symmetric Uncertainty (SU) (Kashef et al., 2018). Wang, Cang, & Yu (2019) presented a filter-based feature selection method, named mRMJR-KCCA. The mRMJR-KCCA selects the feature with the highest relevance to the target class labels and simultaneously minimizes the redundancy between the feature candidate and the already selected features. On the other hand, wrapper methods use the learning algorithm in the feature selection process (Dowlatshahi et al., 2017). These methods search for a subset of features and then evaluate the selected features until some stopping criteria are satisfied. Although they usually have a better result than filter methods do, they have high computational complexity. The wrapper methods were classified into sequential selection algorithms and heuristic search algorithms (Chandrashekar and Sahin, 2014; Rafsanjani and Dowlatshahi, 2012; Rafsanjani et al., 2015). The sequential selection algorithms start with an empty set (full set) and add features 5
(remove features) until the maximum objective function is obtained. The heuristic search algorithms evaluate different subsets to optimize the objective function. Different subsets are generated either by searching around in a search space or by generating solutions to the optimization problem. The class of heuristic search algorithms includes, but is not restricted to, Ant Colony Optimization (ACO) (Dowlatshahi and Derhami, 2017) and Gravitational Search Algorithm (GSA) (Dowlatshahi et al., 2014; Dowlatshahi and Nezamabadi-Pour, 2014; Dowlatshahi and Rezaeian, 2016). Finally, embedded methods use the strengths of the two previous methods. They embed the feature selection process in learning, but they are more efficient than the wrapper methods since they do not evaluate the feature sets iteratively (Cai et al., 2018). Dowlatshahi et al. (2018) proposed a hybrid filter/wrapper feature selection algorithm for increasing the cancer classification accuracy. They first rank the miRNAs with multiple filter algorithms and then used Competitive Swarm Optimization (CSO) to find an optimal subset. Dowlatshahi et al. (2017) proposed a hybrid filter-wrapper algorithm, called Ensemble of Filterbased Rankers to guide an Epsilon-greedy Swarm Optimizer (EFR-ESO). In the proposed EFRESO, they used the knowledge about the feature importance by the ensemble of filter-based rankers to weight the feature probabilities in the ESO. J. Zhang et al. (2019) developed a hybrid filter/wrapper algorithm for feature selection which used a distance-based evaluation function for the filtering part and a weighted bootstrapping search strategy to find a set of candidate feature 6
subsets as the wrapper part. Zhu & Yang (2018) proposed an embedded unsupervised feature selection method that maximized the distances between samples from different clusters and ranked the features by sparse learning strategy.
Moradi and Rostami (2015) proposed a graph-theoretic approach for unsupervised feature selection. In this work, the entire feature set is considered as a weighted graph. Then, the features are divided into several clusters using a community detection algorithm, and a novel iterative search strategy based on node centrality is employed to select the final subset of features. Lv et al. (2019) proposed a new centrality measure for ranking the nodes and time layers of temporal networks simultaneously, referred to as the f-PageRank centrality. Herrmann et al. (2017) suggested a method for a local optima network that used PageRank centrality to predict the success rate and average fitness achieved by local search-based metaheuristics.
Henni et al. (2018) presented an unsupervised graph-based feature selection (UGFS) via subspace learning and PageRank centrality. UGFS could investigate the importance of features in a graph using PageRank. The graph nodes represent the features and subspace preference clusters are used to define the edges of linking features. For each data point, the algorithm scans the whole dataset searching the neighborhood and computes the variance among these sets. According to the computed variance and a predefined threshold, the algorithm selects subspace preference clusters for each data point. The features belonging to the same subspace preference
7
clusters
, associated with the neighborhood of the point
, are linked. So, an incomplete graph
of feature space is composed where the edges have no weights. Finally, the feature score vector is calculated by the PageRank algorithm. In this paper, we have tried to generalize the idea of the UGFS method that uses the PageRank on feature space as a graph for multi-label data. In our proposed method, unlike UGFS, the feature space is considered as a complete graph. In the UGFS method, links between nodes are weightless, and the PageRank algorithm assigns a score to each node based on the number of links connected to each node. So in our approach, given that the graph is fully connected, the method of calculating graph edges and ranking the features with PageRank cannot be used. We have calculated the correlation distance for each feature, and in the end, we considered the Euclidean distance between features according to the correlation distance as the weight of the edge between the features.
In this article, we have presented a fast method for multi-label feature selection which uses a filter strategy. In this method, we model the entire feature set in a complete weighted graph, with the well-known weighted PageRank algorithm evaluating this graph.
A general overview of the proposed method is as follows:
The number of selected features can be determined by the user.
The correlation distance between features and labels is considered to select the associated features and eliminate unrelated features.
8
The algorithm is directly applied to the multi-label data, and there is no problem transformation in it.
The results on different datasets, in comparison with other methods, indicate the superiority of the proposed method in the categorization criteria and run-time.
The structure of this paper is organized as follows. Section 2 outlines the background of multi-label feature selection works, and section 3 briefly summarizes the multi-label learning and PageRank algorithm. The proposed method is presented in Section 4 and the experimental results are given in Section 5. Finally, Section 6 concludes the paper.
2 Background
Multi-label feature selection methods are proposed in two classification strategies. The first is called binary transformation, where multi-label data are transformed into multi single-label data, and single-label classifiers are performed on each single-label data. The second one is called algorithm adaption with the algorithm being directly applied to the multi-label data (Kashef, Nezamabadi-pour, & Nikpour, 2018a; Kashef et al., 2018b; Pereira, Plastino, Zadrozny, & Merschmann, 2018).
Recently, many efforts have been made to select a subset of features for multi-label data. Zhu et al. (2018) provided an embedded feature selection method for categorizing multi-label
9
data with missing labels. Missing labels are restored through linear regression, and discriminant features are selected by the effective
(0
and Verleysen (2011) introduced a pruned problem transformation (PPT) method which uses Mutual information (Read et al., 2008), called PPT-MI, converting multi-label data into singlelabel ones. Zhang and Duan (2019) presented a feature selection for multi-label text based on feature importance. In this method, multi-label texts transformed into single-label texts using the label assignment and category contribution (CC) is used for calculating the importance of each feature. A multi-label feature selection method called PMU, based on mutual information was presented by Lee and Kim (2013). In this method, the best feature is selected based on an incremental selection strategy which maximizes the mutual information between labels and features.
Huang et al. (2018) proposed a feature selection method for multi-label data called manifold based constraint Laplacian score (MCLS), which uses manifold learning to transform logical label space to Euclidean label space, where the corresponding numerical labels constrain the similarity between instances. Sun et al. (2019) proposed a multi-label feature selection method based on mutual information and label correlation. Elsewhere, Jia Zhang et al. (2019) suggested a multi-label feature selection method with manifold regularization (MDFS). They captured the correlation of the features with labels locally and used objective function involving
10
-norm regularization. Y. Li et al. ( 2018) presented a feature selection approach for multilabel learning based on kernelized fuzzy rough sets. They combined the kernelized information from the feature space and label space linearly to achieve the lower approximation and construct a robust multi-label kernelized fuzzy rough set model. P. Zhang et al. (2019) proposed a new multi-label feature selection based on label redundancy, called LRFS, which categorized labels into independent labels and dependent labels and analyzed the differences between independent labels and dependent labels. Kashef & Nezamabadi-pour (2019) proposed a new feature selection method for multi-label data based on the Pareto-dominance concept. They used nondominating sorting to find the front number of each feature and then used a clustering approach to consider the distribution of features.
3 Fundamental concepts 3.1 Multi-label classification
In multi-label data, each instance has a feature vector and a binary label vector
with M features
with L labels. The purpose of multi-label
learning is to develop a model of N training samples of a dataset which can predict labels for new instances. The picture below displays the structure of a multi-label dataset.
11
Y X Y1
Y2
YL
X11
X12
X1M
0
1
0
X21
X22
X2M
1
0
0
XN1
XN2
XNM
0
1
1
Fig. 1. The Structure of Multi-Label Data
3.2 PageRank algorithm
With the growing number of web pages, the presentation of pages based on the user requirements and with high quality is very difficult. Given that web pages are a large graph, page ranking algorithms have been presented as graph-based algorithms. PageRank is an algorithm which assigns scores to web pages and performs ranking of websites and webpages (Wills, 2006). The Ranking method is based on the outlinks of a website. The higher the number of links to a site, the higher the page rank rating of that website. Indeed, receiving links from other websites is important to increase the page rank of the website. The model behind this algorithm is that there can be a web-surfer who follows the links between web pages who gets bored and tries a random page, after a couple of moves. The PageRank of a page is associated with the probabilities a random surfer would choose (Massucci & Docampo, 2019). Each of the links to a website is included in the page rank calculation. Note that only the quantity of outlinks of a website is not
12
important; rather, these links should be valuable, valid, and with high quality to achieve a better score. The PageRank algorithm considers each webpage as a node of the graph, where the input and output links between the pages are the edges between the nodes. The PageRank algorithm (Sen & Chaudhary, 2017) is as follows:
∑
The
(1)
is the sum of the pages that link to page
PageRank algorithm assigns to page
,
,
represents the score that the
denotes the number of outlinks of page
, and
shows the damping factor and which is usually equal to 0.85. To implement this algorithm, all nodes will initially start with a basic weight, usually , where n is the total number of nodes (Sen & Chaudhary, 2017).
3.3 Weighted PageRank algorithm
Now we elaborate on an expanded version of the PageRank algorithm called Weighted PageRank introduced in 2004; unlike the PageRank method, which evaluates each node based on the outlinks of each page, in its weighted version, its inlinks also affect the ranking system (Sen & Kumar Chaudhary, 2017; Xing & Ghorbani, 2004). The difference between this algorithm and the original version is that in this method the weight of the edges is also considered. The weighted PageRank formula is defined as (Luo, Gong, Hu, Duan, & Ma, 2016):
13
∑
Where,
and
(2)
represent the PageRank of u and v vertices, respectively
nodes having edges to u, ∑
is set of
denotes the weight of edges between u and v,
shows the sum of weights on all edges from v. With this extended version of Test Set
PageRank, it is possible to evaluate the nodes of a fully connected graph. Feature Subset Complete Graph
Training Set
Training Set
ML-KNN
Calculate Correlation Distance Between Features And Labels
CDM Matrix
Use weighted PR to assign features score
Calculate Euclidean distance between features
Scores
EDM Matrix
Final Evaluation
Classification Accuracy
Sort the features according to scores in descent order
Build A Complete Graph with EDM Matrix
Fig. 2. Graphical abstract of the proposed method 4 Proposed Method
In this section, we introduce the proposed algorithm for multi-label feature selection, with Fig. 2 presenting a graphical abstract of this method. As mentioned earlier, in a multi-label data, each instance corresponds to a set of labels, where the usual mode of selecting features in this type of 14
data is converting the data to multiple data with a single label and performing a single-label feature selection on each data, and then combining the results as the final result. However, we do this from another point of view, where we run the algorithm directly to the multi-label data, and it is designed for multi-label data. The proposed method is based on graph theory, and for rank the features we use the Weighted PageRank Algorithm.
4.1. Motivation
As mentioned earlier, typical multi-label feature selection is divided into several single-label problems, but it is time-consuming, and every class label needs a classifier. So, our goal in this article is that the algorithm is independently run on multi-label data. On the other hand, in the labeled data, the labels of each instance are correlated and are not independent of each other. Each feature has a similarity with class labels; so if we consider the dataset as a graph, then the features represent the nodes. We use the correlation distance between features as the edges of the graph. Hence, we need a way to evaluate this graph node. Accordingly, we designed a ranking system to determine the best features using the PageRank algorithm. In this work, the graph node is analogous to a web page, and the distance between two nodes of the graph is similar to the weight on two-way links. PageRank algorithm is used to find the PageRank (PR) of each feature in the dataset. Application of PageRank method for a feature importance calculation has some
15
advantages including simple implementation as well as fast convergence both theoretically and practically.
4.2. Proposed algorithm
Multi-label data are available as the following matrix:
[
]
[
]
Fig. 3. Multi-label Data Structure In the features matrix, the rows are instances while the columns are features, in the labels matrix, the rows indicate the instances and the columns represent the labels.
In the case of multi-label feature selection, the best features are those that have the lowest correlation distance with the labels. To calculate the correlation distance between each feature and label, we use the following equation:
(3)
√
Where, x and y are equal to features and labels, the the feature x and label y, and
is equal to the covariance between
represents the variance of the feature x, and
denotes
the variance of label y. Using this, we calculate the correlation distance between all the features
16
and labels and further create a Correlation Distance Matrix (CDM) whose rows correspond to the features and columns correspond to labels.
=[
]
Fig. 4. Correlation Matrix Where
is equal to the correlation distance between feature i and label j.
Now, based on this matrix, our goal is to compare features, whereby we need a distance metric. In our proposed algorithm, this metric is the Euclidean distance between features based on the CDM.
√∑ Where,
(4) represents the Euclidean distance between features p and q, and L is the
number of columns of the correlation distance matrix i.e. the number of labels. According to this relation, the Euclidean distance of all features is computed to each other to achieve a Euclidean Distance Matrix (EDM). The rows and columns of this matrix represent the features.
=[
17
]
Fig. 5. Euclidean Distance Matrix For example,
is equal to the Euclidean distance between row i and row j of
CDM.
We use this matrix to build a weighted complete graph called Feature-Label Graph (FLG). The nodes of this graph are the features and the weights of edges are the components of EDM.
To better understand the proposed algorithm, we use an example to illustrate its steps. To this aim, we will consider a data set with 5 features, 3 labels, and 3 instances.
[
]
[
]
Fig. 6. A Sample Dataset Now we calculate the CDM matrix as follows:
[
]
Fig. 7. Sample Correlation Distance Matrix As mentioned above, the rows of this matrix correspond to the features, while the columns correspond to the labels. It means that CDM (2,1) is the correlation distance between feature 2 and label 1. We calculate CDM (2,1) as follows:
18
[
][
]
√
√ Now we calculate the EDM matrix:
[
]
Fig. 8. Sample Euclidean Distance Matrix For example,
is the Euclidean distance between feature 1 and 2 that calculate as
follow: √ Thus, we consider this matrix as graph edges weights and features as nodes, then we will draw the corresponding graph.
19
𝐹
𝐸𝐷𝑀
𝐹5
𝐸𝐷𝑀
𝐸𝐷𝑀
𝐹3 𝐸𝐷𝑀
𝐸𝐷𝑀
𝐹
𝐹4 𝐸𝐷𝑀
Fig. 9. Sample FLG Graph Having developed the FLG, now we are looking for the main purpose or feature selection, where we need to use a ranking system for the graph nodes. In our proposed algorithm, we have chosen the Weighted PageRank Algorithm for this task. This algorithm assigns a score to each node, concerning the edges of each node. Higher scores indicate that the node is important, or it is a better feature.
Now we use weighted PageRank algorithm to rank features in the above example and to illustrate how the algorithm works; we calculate the PR for feature
20
in iteration one:
[
]
[
]
[
Hence, the PR of feature
]
is 0.2868 in the first iteration. This procedure will continue in some
iterations until it converges. The following vector is the result of the PageRank algorithm and shows the score of each feature:
[
]
Fig. 10. Sample FLG Based on the score of each feature, we sort these scores in a descending order to obtain the ranking of the features. Now we are faced with a vector of features, and we can choose from the 21
vector as much as the user asks. According to FLG, the ranking of features for the above example is as follows:
[
5
3
4]
Algorithm: MGFS - The proposed multi-label graph-based feature selection algorithm via PageRank centrality Input: Multi-label dataset D with M features, N samples, and L labels Output: selected feature subset F 1. Calculate the Correlation Distance Matrix (CDM); 2. Calculate Euclidean Distance Matrix (EDM) to consider as graph edges weights; 3. Build a graph with features as graph nodes and EDM as the weight of edges; 4. Use weighted PageRank algorithm (WPR) to assign a score to every feature; 5. Sort the score as descent order; 6. F=choose as many as features that user requests;
5 Experimental studies
In this section, we will evaluate the proposed method on 7 datasets. To prove the performance of the proposed method, we compare it with 6 multi-label feature selection methods: PMU (Lee & Kim, 2013), LRFS(P. Zhang et al., 2019) , MDFS(Jia Zhang et al., 2019), PPT-MI (Doquire & Verleysen, 2011) , MCLS (Huang et al., 2018) and Pareto-Cluster (Kashef & Nezamabadi-pour, 2019) .
5.1 Datasets 22
In the experiments, 7 real multi-label datasets have been used with various applications obtained from the Mulan1 and Meka 2 repositories. Table 1 summarizes the specifications of these datasets including dataset name (Dataset), dataset domain (Domain), number of samples (samples), number of features (features), number of labels (labels), feature type (Type), and label cardinality (LD), which is the average number of labels associated with each instance as defined by Eq. 5 and label density (LC), which is the cardinality normalized by |L| defined by Eq. 6. Table 1. Description of the datasets used in the experiments Dataset
Samples
Features
Labels
Type
LC
LD
Domain
Yeast
2417
130
14
Numeric
4.237
0.303
Biology
Corel5k
5000
499
374
Nominal
3.522
0.009
Image
LanguageLog
1460
1004
75
Nominal
1.18
0.208
Text
Enron
1702
1001
53
Nominal
3.378
0.064
Text
Image
2000
294
5
Numeric
1.236
0.247
Image
Scene
2407
294
6
Numeric
1.074
0.179
Image
Bibtex
7395
1836
159
Nominal
2.402
0.015
Text
∑
| |
(5)
∑
|| ||
(6)
5.2 Performance evaluation criteria
1 2
http://mulan.sourceforge.net/datasets.html
http://waikato.github.io/meka/datasets/
23
To evaluate the performance of the proposed method, we used two multi-label evaluation criteria: Hamming loss, Accuracy, and the runtime of algorithms. Let } be a test set; corresponding to
be the actual label subset, and
{
be the predicted label set
. The evaluation criteria are as follows (Zhang et al., 2014):
Hamming Loss: For every sample, hamming loss computes the differences (∆) between predicted labels and actual labels and then averages over the obtained differences for total samples of the dataset. For example, a label is incorrectly assigned to an instance, or a label is not predicted. (Cherman et al., 2014)
|
∑
Where
| | |
(7)
is the symmetric difference between two sets.
Accuracy: This measure calculates the labels that have been correctly predicted among all true and predicted labels. ∑
|
|
|
|
(8)
5.3 Results
The proposed method is compared with five filter-based multi-label feature selection methods. These methods include PMU (Lee & Kim, 2013), LRFS(P. Zhang et al., 2019) , MDFS(Jia Zhang et al., 2019), PPT-MI (Doquire & Verleysen, 2011) , MCLS (Huang et al., 2018) and
24
Pareto-Cluster (Kashef & Nezamabadi-pour, 2019).
For all methods, we set the value of
parameters according to the recommendations by corresponding research.
The classification performance of comparing algorithms is evaluated based on ML-KNN (Zhang & Zhou, 2007), which is a multi-label version of the famous KNN algorithm with a neighboring number of 10. For each test, 60% of the samples are chosen randomly for the training, while the remaining 40% is used for the testing. The results are averaged over 10 independent runs of every algorithm for each different range of data; according to the interval [10:100] for testing each algorithm, 100 independent runs are executed on each dataset. In the proposed method, the number of selected features is defined by the user. Figs. 11- 31 demonstrate the comparison of the proposed algorithm with other algorithms in terms of accuracy, hamming loss, and runtime, respectively. Since some methods, such as PMU, have a long run time, the logarithmic scale has been used for Y-axis. According to the obtained results, it can be seen that the proposed method has a significant advantage over other methods in terms of two classification criteria and has a significantly fast run-time.
25
Fig. 11. Accuracy (Yeast)
Fig. 13. Accuracy (Scene)
Fig. 15. Accuracy (Image) 26
Fig. 12. Accuracy (LanguageLog)
Fig. 14. Accuracy (Enron)
Fig. 16. Accuracy (Corel5k)
Fig. 17. Accuracy (Bibtex)
Fig. 18. Hamming Loss (Yeast)
Fig. 19. Hamming Loss (Scene)
Fig. 20. Hamming Loss (LanguageLog)
Figure 21: Hamming Loss (Enron) 27
Figure 22: Hamming Loss (Image)
Fig. 23. Hamming Loss (Corel5k)
28
Fig. 24. Hamming Loss (Bibtex)
Fig. 25. Run-time (Yeast)
Fig. 26. Run-time (LanguageLog)
Fig. 27. Run-time (Scene)
Fig. 28. Run-time (Enron)
Fig. 30. Run-time (Corel5k)
Fig. 29. Run-time (Image)
Fig. 31. Run-time (Bibtex)
Now we compare the results of different methods statistically. For this purpose, we used the Friedman test (Kafadar & Sheskin, 1997) on the result of methods. We set the desired significance level for the post-hoc test to 0.05. If the result obtained by the Friedman test is less than the significance level, we perform a test for pairwise comparison of variables according to Conover (2000). The last row of these tables refers to the win/tie/loss results of the Friedman test of the MGFS method against other methods. Initially, we used the test over the results of MGFS
29
against the other methods in terms of accuracy and then hamming loss. The results are shown in Tables 2-15. Also, the obtained p-values by the Friedman test are presented in Tables 16-18 overall performance comparison is given in Table 19 in terms of the average Friedman test.
Table 3: accuracy for Scene dataset
Table 2: accuracy for Yeast dataset M
MGFS
PMU
LRFS
MDFS
PPT-
10
0.1424
0.1515
0.3069
0.0977
0.1172
0.0763
0.2994
0.4669
20
0.2626
0.2552
0.3284
0.1314
0.2740
0.1396
0.3631
0.4787
0.4898
30
0.4259
0.3529
0.3432
0.2546
0.2973
0.2545
0.3611
0.4941
0.4934
0.4899
40
0.4569
0.3884
0.3694
0.3157
0.3437
0.2845
0.3771
0.5019
0.4882
0.4981
50
0.4722
0.4108
0.3575
0.3488
0.3752
0.3830
0.3883
0.4812
0.5016
0.5003
0.5035 60
0.4858
0.4175
0.3733
0.3786
0.3773
0.4227
0.4028
0.4736
0.4865
0.5010
0.5017
0.5080
70
0.5089
0.4648
0.3875
0.3890
0.3996
0.3928
0.4306
0.4948
0.4678
0.4962
0.5021
0.5017
0.5039
80
0.5537
0.4907
0.4147
0.4539
0.4174
0.4877
0.4682
0.4966
0.4702
0.4947
0.4977
0.5044
0.4951
90
0.5709
0.5034
0.4576
0.5144
0.4477
0.5384
0.4648
0.5072
0.5057
0.4665
0.5082
0.5044
0.5026
0.4999
100
0.5874
0.5110
0.4612
0.5329
0.4845
0.5414
0.5172
6/0/0
+
+
+
+
+
+
win/tie/loss
6/0/0
+
+
+
+
+
+
MGFS
PMU
LRFS
MDFS
PPT-
10
0.4586
0.4267
0.4399
0.4182
0.4461
0.4318
20
0.4793
0.4597
0.4580
0.4389
0.4827
0.4640
30
0.4903
0.4745
0.4740
0.4479
0.4840
40
0.4932
0.4856
0.4737
0.4655
50
0.5036
0.4948
0.4806
0.4780
60
0.5042
0.4932
0.4716
70
0.5053
0.4999
80
0.5058
90
0.5086
100 win/tie/loss
/feat
MCLS
MI
Pareto-
M
Cluster
/feat
0.4398
MGFS
PMU
LRFS
MDFS
/feat
PPT-
MCLS
MI
MI
ParetoCluster
Table 5: accuracy for Enron dataset
Table 4: accuracy for LanguageLog dataset M
MCLS
Pareto-
M
Cluster
/feat
MGFS
PMU
LRFS
MDFS
PPT-
MCLS
10
0.2771
0.2511
0.2067
0.1929
0.2307
0.1882
0.3518
20
0.3588
0.3040
0.1911
0.1991
0.2472
0.2253
0.3767
MI
ParetoCluster
10
0.3547
0.3459
0.3524
0.3497
0.2004
0.2182
0.3798
20
0.3851
0.3456
0.3575
0. 3454
0.2254
0.3093
0.3772
30
0.3773
0.3490
0.3487
0.3419
0.2242
0.3630
0.3783
30
0.3579
0.3274
0.2101
0.2080
0.2691
0.2522
0.3726
40
0.3801
0.3694
0.3510
0.3443
0.2461
0.3722
0.3822
40
0.3689
0.3353
0.2154
0.2064
0.3105
0.2616
0.3705
50
0.3790
0.3563
0.3466
0.3478
0.2787
0.3723
0.3797
50
0.3673
0.3374
0.2114
0.1693
0.2914
0.2822
0.3683
60
0.3714
0.3740
0.3502
0.3529
0.3046
0.3716
0.3720
60
0.3701
0.3352
0.2248
0.1958
0.3014
0.2798
0.3761
70
0.3868
0.3715
0.3541
0.3468
0.3455
0.3800
0.3816
70
0.3644
0.3514
0.2393
0.1732
0.3134
0.3168
0.3801
80
0.3847
0.3871
0.3573
0.3465
0.3535
0.3789
0.3763
80
0.3735
0.3491
0.2499
0.1704
0.2886
0.3063
0.3719
90
0.3792
0.3741
0.3465
0.3457
0.3522
0.3728
0.3798
90
0.3632
0.3545
0.2363
0.1886
0.3271
0.3000
0.3855
100
0.3795
0.3713
0.3605
0.3458
0.3517
0.3738
0.3724
100
0.3774
0.3531
0.2550
0.2179
0.3316
0.3092
0.3515
=
win/tie/loss
5/1/0
+
+
+
+
+
=
win/tie/loss
5/1/0
30
+
+
+
+
+
Table 6: accuracy for Corel5k dataset M
MGFS
PMU
LRFS
MDFS
/feat
PPT-
MCLS
MI
Table 7: accuracy for Bibtex dataset Pareto-
M
Cluster
/feat
MGFS
PMU
LRFS
MDFS
PPT-MI
MCLS
0.0048
0.0487
Pareto-
0.0671
0.0918
0.0031
0.1151
Cluster
10
0.0047
0.0001
0.0016
0.0006
0.0030
0.0020
0.0079
10
0.0823
20
0.0077
0.0003
0.0011
0.0025
0.0054
0.0041
0.0109
20
0.1072
0.0026
0.0627
0.0871
0.1487
0.0164
0.1422
30
0.0082
0.0003
0.0016
0.0038
0.0078
0.0043
0.0100
30
0.1136
0.0045
0.0613
0.0950
0.1576
0.0128
0.1679
40
0.0067
0.0004
0.0023
0.0033
0.0057
0.0050
0.0124
40
0.1530
0.0091
0.0632
0.0950
0.1607
0.0131
0.1667
50
0.0082
0.0003
0.0039
0.0030
0.0043
0.0060
0.0118
50
0.1787
0.0153
0.0865
0.1020
0.1683
0.0171
0.1889
60
0.0074
0.0003
0.0058
0.0037
0.0067
0.0066
0.0102
60
0.1903
0.0123
0.0666
0.1096
0.1809
0.0178
0.2006
70
0.0086
0.0002
0.0039
0.0035
0.0062
0.0061
0.0101
70
0.2015
0.0178
0.0778
0.1143
0.1864
0.0292
0.2138
80
0.0071
0.0004
0.0027
0.0047
0.0056
0.0064
0.0114
80
0.2117
0.0242
0.0635
0.1250
0.1891
0.0302
0.2166
90
0.0062
0.0008
0.0043
0.0041
0.0049
0.0074
0.0105
90
0.2283
0.0257
0.0709
0.1295
0.1867
0.0355
0.2147
100
0.0064
0.0012
0.0060
0.0048
0.0055
0.0063
0.0104
100
0.2293
0.0278
0.1118
0.1351
0.1884
0.0311
0.2200
win/tie/loss
5/0/1
+
+
+
+
+
-
win/tie/loss
4/2/0
+
+
+
=
+
=
Table 9: hamming loss for Yeast dataset
Table 8: accuracy for Image dataset M
MGFS
PMU
LRFS
MDFS
PPT-MI
MCLS
/feat 0.1574
10
0.2355
0.2539
0.1247
0.1551
0.0681
Pareto-
M
Cluster
/feat
MGFS
PMU
LRFS
MDFS
PPT-
MCLS
0.1912
10
0.2074
0.2184
0.2148
0.2195
0.2143
0.2183
0.2151
0.2041
0.2091
0.2108
0.2146
0.2045
0.2080
0.2065
MI
ParetoCluster
20
0.1896
0.2931
0.3387
0.1796
0.1876
0.1385
0.2975
20
30
0.2538
0.3220
0.3555
0.3165
0.2982
0.1698
0.3143
30
0.2002
0.2075
0.2067
0.2099
0.2024
0.2060
0.2010
40
0.3086
0.3540
0.3710
0.3361
0.2826
0.1992
0.3469
40
0.1991
0.2016
0.2067
0.2077
0.1985
0.2008
0.2002
50
0.3806
0.3681
0.3715
0.3112
0.3576
0.2368
0.3847
50
0.1975
0.2008
0.2052
0.2037
0.1975
0.1998
0.1981
60
0.4011
0.4032
0.3773
0.3612
0.3664
0.2517
0.4127
60
0.1970
0.2010
0.2044
0.2020
0.1989
0.1997
0.1972
70
0.4252
0.3921
0.3391
0.3936
0.3904
0.2783
0.4234
70
0.1971
0.1978
0.2058
0.2022
0.1968
0.1967
0.1971
80
0.4465
0.4076
0.4166
0.3753
0.3952
0.3117
0.4257
80
0.1962
0.2007
0.2076
0.1990
0.1980
0.1968
0.1987
0.1955
0.1983
0.2067
0.2004
0.1992
0.1974
0.1994
90
0.4317
0.4129
0.4317
0.3927
0.4080
0.3233
0.4410
90
100
0.4324
0.4232
0.4333
0.3934
0..3836
0.3306
0.4266
100
0.1964
0.1966
0.2078
0.1968
0.1964
0.1966
0.1981
win/tie/loss
2/4/0
=
=
=
+
+
=
win/tie/loss
5/1/0
+
+
+
=
+
+
31
Table 10: hamming loss for Scene dataset M
MGFS
PMU
LRFS
MDFS
/feat
PPT-
MCLS
MI
Table 11: hamming loss for LanguageLog dataset Pareto-
M
Cluster
/feat
10
0.1674
0.1652
0.1479
0.1762
0.1677
0.1750
0.1485
20
0.1520
0.1532
0.1458
0.1741
0.1564
0.1694
0.1427
30
0.1361
0.1425
0.1423
0.1591
0.1488
0.1547
40
0.1326
0.1387
0.1415
0.1481
0.1435
50
0.1291
0.1325
0.1420
0.1430
0.1417
60
0.1231
0.1300
0.1398
0.1413
70
0.1190
0.1226
0.1379
80
0.1131
0.1196
90
0.1111
100 win/tie/loss
MGFS
LRFS
MDFS
PPT-
MCLS
MI
ParetoCluster
10
0.1797
0.1792
0.1802
0.1820
0.1952
0.1909
0.1693
20
0.1657
0.1740
0.1731
0.1832
0.1942
0.1789
0.1704
0.1409
30
0.1664
0.1761
0.1756
0.1839
0.1941
0.1703
0.1679
0.1540
0.1399
40
0.1662
0.1717
0.1768
0.1835
0.1921
0.1680
0.1625
0.1398
0.1383
50
0.1660
0.1673
0.1785
0.1826
0.1904
0.1652
0.1641
0.1389
0.1329
0.1327
60
0.1660
0.1682
0.1751
0.1809
0.1886
0.1647
0.1635
0.1371
0.1367
0.1369
0.1297
70
0.1637
0.1632
0.1748
0.1843
0.1847
0.1627
0.1619
0.1553
0.1285
0.1344
0.1217
0.1253
80
0.1645
0.1585
0.1746
0.1840
0.1791
0.1611
0.1620
0.1177
0.1308
0.1170
0.1281
0.1143
0.1247
90
0.1656
0.1626
0.1770
0.1829
0.1793
0.1629
0.1599
0.1080
0.1168
0.1287
0.1149
0.1222
0.1149
0.1193
100
0.1655
0.1593
0.1690
0.1842
0.1780
0.1625
0.1611
6/0/0
+
+
+
+
+
+
win/tie/loss
=
+
+
+
=
-
Table 12: hamming loss for Enron dataset M
PMU
MGFS
PMU
LRFS
MDFS
/feat
PPT-
MCLS
MI
3/2/1
Table 13: hamming loss for Corel5k dataset Pareto-
M
Cluster
/feat
MGFS
PMU
LRFS
MDFS
PPT-
MCLS
MI
ParetoCluster
10
0.0553
0.0542
0.0592
0.0598
0.0552
0.0592
0.0535
10
0.0094
0.0095
0.0095
0.0094
0.0094
0.0094
0.0094
20
0.0529
0.0539
0.0584
0.0599
0.0550
0.0571
0.0522
20
0.0095
0.0095
0.0094
0.0094
0.0094
0.0094
0.0095
30
0.0525
0.0528
0.0578
0.0596
0.0539
0.0560
0.0524
30
0.0094
0.0095
0.0095
0.0094
0.0095
0.0094
0.0094
40
0.0515
0.0525
0.0573
0.0592
0.0538
0.0548
0.0527
40
0.0094
0.0095
0.0094
0.0094
0.0094
0.0094
0.0095
50
0.0524
0.0520
0.0564
0.0594
0.0546
0.0546
0.0522
50
0.0094
0.0095
0.0094
0.0094
0.0094
0.0094
0.0095
60
0.0514
0.0521
0.0561
0.0592
0.0531
0.0539
0.0516
60
0.0094
0.0095
0.0094
0.0094
0.0094
0.0094
0.0095
70
0.0516
0.0516
0.0586
0.0584
0.0530
0.0531
0.0515
70
0.0094
0.0095
0.0095
0.0094
0.0094
0.0094
0.0095
80
0.0509
0.0522
0.0557
0.0586
0.0532
0.0530
0.0515
80
0.0094
0.0095
0.0094
0.0094
0.0094
0.0095
0.0094
90
0.0516
0.0521
0.0555
0.0588
0.0529
0.0526
0.0514
90
0.0094
0.0095
0.0094
0.0094
0.0094
0.0094
0.0095
100
0.0510
0.0517
0.0557
0.0584
0.0521
0.0523
0.0516
100
0.0094
0.0095
0.0095
0.0094
0.0094
0.0094
0.0095
win/tie/loss
4/1/0
=
+
+
+
+
=
win/tie/loss
2/4/0
+
=
=
=
=
+
32
Table 14: hamming loss for Bibtex dataset M
MGFS
PMU
LRFS
MDFS
/feat
PPT-
MCLS
MI
Table 15: hamming loss for Image dataset Pareto-
M
Cluster
/feat
MGFS
PMU
LRFS
MDFS
PPT-
MCLS
MI
ParetoCluster
10
0.0140
0.0152
0.0143
0.0141
0.0141
0.0151
0.0137
10
0.2310
0.2239
0.2222
0.2344
0.2291
0.2442
0.2258
20
0.0136
0.0150
0.0140
0.0141
0.0136
0.0151
0.0137
20
0.2296
0.2081
0.2056
0.2326
0.2245
0.2395
0.2114
30
0.0135
0.0152
0.0143
0.0141
0.0135
0.0152
0.0136
30
0.2232
0.2044
0.1972
0.2143
0.2093
0.2329
0.2057
40
0.0135
0.0149
0.0139
0.0140
0.0135
0.0150
0.0134
40
0.2143
0.1986
0.1977
0.2074
0.2139
0.2308
0.1998
50
0.0135
0.0152
0.0141
0.0139
0.0134
0.0149
0.0132
50
0.2037
0.1975
0.1968
0.2070
0.1969
0.2283
0.1951
60
0.0134
0.0150
0.0142
0.0137
0.0133
0.0150
0.0132
60
0.1979
0.1928
0.1976
0.1999
0.1955
0.2219
0.1974
70
0.0133
0.0149
0.0144
0.0139
0.0133
0.0150
0.0131
70
0.1910
0.1920
0.1930
0.1935
0.1960
0.2170
0.1846
80
0.0131
0.0152
0.0143
0.0137
0.0132
0.0150
0.0131
80
0.1861
0.1901
0.1916
0.1937
0.1934
0.2145
0.1866
90
0.0129
0.0150
0.0140
0.0137
0.0133
0.0149
0.0130
90
0.1871
0.1865
0.1878
0.1920
0.1918
0.2103
0.1849
100
0.0129
0.0149
0.0137
0.0137
0.0132
0.0150
0.0130
100
0.1861
0.1857
0.1855
0.1910
0.1909
0.2118
0.1810
win/tie/loss
4/2/0
+
+
+
=
+
=
win/tie/loss
1/3/2
-
=
=
=
+
-
Table 19. The obtained p-values by the Friedman test in term of Accuracy Dataset
MGFS VS. PMU
MGFS VS. PPT-MI
MGFS VS. MCLS
MGFS VS. MDFS
MGFS VS. LRFS
MGFS VS. Pareto-Cluster
Yeast
0.0018
0.0377
0.0153
0.0018
0.0005
0.0056
Scene
0.0056
0.0056
0.0005
0.0005
0.0377
0.0377
LanguageLog
0.0153
0.0005
0.0056
0.0005
0.0018
1
Enron
0.0018
0.0005
0.0005
0.0005
0.0005
0.0833
Imaage
1
0.0153
0.0005
0.0833
0.2987
0.1659
Corel5k
0.0005
0.0056
0.0153
0.0005
0.0005
0.0005
Bibtex
0.0005
0.4884
0.0005
0.0018
0.0005
0.0833
33
Table 17. The obtained p-values by the Friedman test in term of Hamming-loss Dataset
MGFS VS. PMU
MGFS VS. PPT-MI
MGFS VS. MCLS
MGFS VS. MDFS
MGFS VS. LRFS
MGFS VS. Pareto-Cluster
Yeast
0.0005
0.0833
0.0056
0.0005
0.0005
0.0377
Scene
0.0056
0.0018
0.0005
0.0005
0.0377
0.0377
LanguageLog
1
0.0005
0.2987
0.0005
0.0018
0.0377
Enron
0.0833
0.0056
0.0005
0.0005
0.0005
0.729
Imaage
0.0377
0.4884
0.0005
0.0833
0.0833
0.0018
Corel5k
0.0003
0.1659
0.4884
0.4884
0.729
0.0056
Bibtex
0.0005
0.4884
0.0005
0.0005
0.0005
1
Table 18. The obtained p-values by the Friedman test in term of Run-time Dataset
MGFS VS. PMU
MGFS VS. PPT-MI
MGFS VS. MCLS
MGFS VS. MDFS
MGFS VS. LRFS
MGFS VS. Pareto-Cluster
Yeast
0.0005
0.0056
0.0005
0.0005
0.0005
0.0005
Scene
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
LanguageLog
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
Enron
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
Imaage
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
Corel5k
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
Bibtex
0.0005
0.0005
0.0005
0.0005
0.0005
0.0005
34
Table 19. The win/tie/loss results of MGFS against the other methods in terms of Accuracy and Hamming loss based on Friedman test Evaluation metric
MGFS VS. PMU
MGFS VS. PPT-MI
MGFS VS. MCLS
MGFS VS. MDFS
MGFS VS. LRFS
MGFS VS. Pareto-Cluster
Accuracy
6/1/0
6/1/0
7/0/0
6/1/0
6/1/0
2/4/1
Hamming loss
4/2/1
3/4/0
5/2/0
5/2/0
5/2/0
3/2/2
Run-time
7/0/0
5/0/2
7/0/0
7/0/0
7/0/0
7/0/0
Total
17/3/1
14/5/2
19/2/0
18/3/0
18/3/0
12/6/3
5 Discussion According to the results, our proposed method is superior to the compared methods. Based on Figs. 25-31, the MGFS algorithm functioned in a short runtime, and four datasets (Yeast, Scene, Corel5k, LanguageLog, and Bibtex) were faster than the other methods. On the other hand, in the other two datasets, the difference with the PPT-MI method has been very minor. It can also be seen that this method is very fast for high dimensional data, and there is a significant difference in speed with other methods in high dimensional datasets as observed. Nowadays, the speed for high-dimensional data is a very important challenge, and in the proposed method, we have been paying special attention to this matter. Given the speed of this algorithm, we can suggest that the algorithm is not very complex and has low computational complexity. Furthermore, our 35
proposed method, in addition to high speed, has better classification accuracy and less error compared to other methods. To show this, we show the results for different features in Figs. 1124 and also we have used the Friedman test to compare the results of different methods statistically, with the results being presented in Tables 2 to 15. In our proposed algorithm, we have used the correlation distance, and for each feature, we calculated the Euclidean distance based on this correlation distance. According to our method, a feature is assigned a higher priority if it is more space than the other features, and indeed, we give the priority to unique features. The logic behind this is that the features that are less spaced than other features tend to have many commonalities with those features and are very similar. Also, we are looking for features that have unique information to give us when compared to other features. As can be seen in Fig. 8, which is an EDM matrix, the feature 5 has the maximum Euclidean distance with other features, making the PageRank algorithm assign the highest rank to this feature, while the rest of the features are ranked according to the same ranking procedure. In the end, the user can select as many features as he wishes, according to MGFS algorithm.
6 Conclusions
In this work, the MGFS method used a multi-label graph-based theory, and the Google PageRank algorithm was employed to select the best feature subset. This method was not similar to single-label methods and was designed for multi-label data. In this method, we used the
36
correlation distance between features and labels as a matrix and used this matrix to compare the features. Next, the PageRank algorithm ranked the features to find the best subset of features, where the user could choose the size of the feature subset. We compared the results of this method to 7 datasets with 6 similar methods and we observed that it is superior in terms of accuracy and classification error compared to similar methods. Further, in terms of execution time, it was more optimal and faster than the other methods as it had low computational complexity. We intend to use graph-based models in the feature selection process in the future and our main focus will be on multi-label data. Also, we seek to use other centrality measures to evaluate the proposed model against PageRank.
References Bermingham, M., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., … Haley, C. (2015). Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Scientific Reports, 5, 10312. https://doi.org/10.1038/srep10312 Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70–79. https://doi.org/10.1016/j.neucom.2017.11.077 Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. Coakley, C. W., & Conover, W. J. (2000). Practical Nonparametric Statistics. Journal of the 37
American Statistical Association, 95(449), 332. https://doi.org/10.2307/2669565 Coelho, F., Castro, C., Braga, A. P., & Verleysen, M. (2019). Semi-supervised relevance index for feature selection. Neural Computing and Applications, 31, 989–997. https://doi.org/10.1007/s00521-017-3062-0 Dowlatshahi, M. B., & Derhami, V. (2017). Winner Determination in Combinatorial Auctions using Hybrid Ant Colony Optimization and Multi-Neighborhood Local Search. Journal of AI and Data Mining, 5(2), 169-181. Dowlatshahi, M., Derhami, V., & Nezamabadi-pour, H. (2017). Ensemble of filter-based rankers to guide an epsilon-greedy swarm optimizer for high-dimensional feature subset selection. Information, 8(4), 152. Dowlatshahi, M. B., Derhami, V., & Nezamabadi-pour, H. (2018). A novel three-stage filterwrapper framework for miRNA subset selection in cancer classification. Informatics, 5(1). https://doi.org/10.3390/informatics5010013 Dowlatshahi, M. B., Nezamabadi-Pour, H., & Mashinchi, M. (2014). A discrete gravitational search algorithm for solving combinatorial optimization problems. Information Sciences, 258, 94-107. Dowlatshahi, M. B., & Nezamabadi-Pour, H. (2014). GGSA: a grouping gravitational search algorithm for data clustering. Engineering Applications of Artificial Intelligence, 36, 114121. Dowlatshahi, M. B., & Rezaeian, M. (2016, March). Training spiking neurons with gravitational search algorithm for data classification. In 2016 1st Conference on Swarm Intelligence and 38
Evolutionary Computation (CSIEC) (pp. 53-58). IEEE. Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Math. Intell. https://doi.org/10.1007/BF02985802 Kafadar, K., & Sheskin, D. J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. The American Statistician, 51(4), 374. https://doi.org/10.2307/2685909 Kashef, S., & Nezamabadi-pour, H. (2019). A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognition, 88, 654–667. https://doi.org/10.1016/j.patcog.2018.12.020 Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018a). FCBF3Rules: A feature selection method for multi-label datasets. 1–5. https://doi.org/10.1109/CSIEC.2018.8405419 Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018b). Multilabel feature selection: A comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8, e1240. https://doi.org/10.1002/widm.1240 Li, J., Cheng, K., Wang, S., Morstatter, F., P. Trevino, R., Tang, J., & Liu, H. (2017). Feature Selection: A Data Perspective. ACM Computing Surveys, 50. https://doi.org/10.1145/3136625 Li, Y., Lin, Y., Liu, J., Weng, W., Shi, Z., & Wu, S. (2018). Feature selection for multi-label learning based on kernelized fuzzy rough sets. Neurocomputing, 318, 271–286. https://doi.org/10.1016/j.neucom.2018.08.065 Luo, D., Gong, C., Hu, R., Duan, L., & Ma, S. (2016). Ensemble Enabled Weighted PageRank. 39
eprint arXiv:1604.05462. Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. C. (2018). Categorizing feature selection methods for multi-label classification. Artificial Intelligence Review, 49(1), 57–78. https://doi.org/10.1007/s10462-016-9516-4 Rafsanjani, M. K., & Dowlatshahi, M. B. (2012). Using gravitational search algorithm for finding near-optimal base station location in two-tiered WSNs. International Journal of Machine Learning and Computing, 2(4), 377. Rafsanjani, M. K., Dowlatshahi, M. B., & Nezamabadi-Pour, H. (2015). Gravitational Search Algorithm to Solve the K-of-N Lifetime Problem in Two-Tiered WSNs. Iranian Journal of Mathematical Sciences and Informatics, 10(1), 81-93. Sen, T., & Kumar Chaudhary, D. (2017). Contrastive study of Simple PageRank, HITS and Weighted PageRank algorithms: Review. 721–727. https://doi.org/10.1109/CONFLUENCE.2017.7943245 Sun, X., Liu, Y., Li, J., Zhu, J., Liu, X., & Chen, H. (2012). Using cooperative game theory to optimize the feature selection problem. Neurocomputing, 97, 86–93. https://doi.org/10.1016/j.neucom.2012.05.001 Tang, C., Zhu, X., Chen, J., Wang, P., Liu, X., & Tian, J. (2018). Robust graph regularized unsupervised feature selection. Expert Systems with Applications, 96, 64–76. https://doi.org/10.1016/j.eswa.2017.11.053 Wang, Y., Cang, S., & Yu, H. (2019). Mutual Information Inspired Feature Selection Using Kernel Canonical Correlation Analysis. Expert Systems with Applications: X, 100014. 40
https://doi.org/10.1016/j.eswax.2019.100014 Xing, W., & Ghorbani, A. (2004). Weighted PageRank Algorithm. 305–314. https://doi.org/10.1109/DNSR.2004.1344743 Zare, M., Eftekhari, M., & Aghamollaei, G. (2019). Supervised feature selection via matrix factorization based on singular value decomposition. Chemometrics and Intelligent Laboratory Systems, 185, 105–113. https://doi.org/10.1016/j.chemolab.2019.01.003 Zhang, J., Luo, Z., Li, C., Zhou, C., & Li, S. (2019). Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition, 95, 136–150. https://doi.org/10.1016/j.patcog.2019.06.003 Zhang, J., Xiong, Y., & Min, S. (2019). A new hybrid filter/wrapper algorithm for feature selection in classification. Analytica Chimica Acta. https://doi.org/10.1016/j.aca.2019.06.054 Zhang, P., Liu, G., & Gao, W. (2019). Distinguishing two types of labels for multi-label feature selection. Pattern Recognition, 95, 72–82. https://doi.org/10.1016/j.patcog.2019.06.004 Zhang, R., Nie, F., Li, X., & Wei, X. (2019). Feature selection with multi-view data: A survey. Information Fusion, 50, 158–167. https://doi.org/10.1016/j.inffus.2018.11.019 Zhu, Q. H., & Yang, Y. Bin. (2018). Discriminative embedded unsupervised feature selection. Pattern Recognition Letters, 112, 219–225. https://doi.org/10.1016/j.patrec.2018.07.018
41
CRediT author statement Amin Hashemi: Conceptualization, Methodology, Software, Original draft preparation. Mohammad Bagher Dowlatshahi: Supervision, Data curation, Writing, Investigation, Reviewing and Editing. Hossein Nezamabadi-pour: Supervision, Validation, Reviewing and Editing.
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
42