Pattern Recognition 84 (2018) 273–287
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/patcog
Online multi-label streaming feature selection based on neighborhood rough set Jinghua Liu a, Yaojin Lin b,c,∗, Yuwen Li a, Wei Weng a, Shunxiang Wu a,∗ a
Department of Automation, Xiamen University, Xiamen 361000, PR China School of Computer Science, Minnan Normal University, Zhangzhou 363000, PR China c Key Laboratory of Data Science and Intelligence Application, Fujian Province University, PR China b
a r t i c l e
i n f o
Article history: Received 15 November 2017 Revised 18 June 2018 Accepted 16 July 2018 Available online 18 July 2018 Keywords: Online feature selection Multi-label learning Neighborhood rough set Granularity
a b s t r a c t Multi-label feature selection has grabbed intensive attention in many big data applications. However, traditional multi-label feature selection methods generally ignore a real-world scenario, i.e., the features constantly flow into the model one by one over time. To address this problem, we develop a novel online multi-label streaming feature selection method based on neighborhood rough set to select a feature subset which contains strongly relevant and non-redundant features. The main motivation is that data mining based on neighborhood rough set does not require any priori knowledge of the feature space structure. Moreover, neighborhood rough set deals with mixed data without breaking the neighborhood and order structure of data. In this paper, we first introduce the maximum-nearest-neighbor of instance to granulate all instances which can solve the problem of granularity selection in neighborhood rough set, and then generalize neighborhood rough set in single-label to fit multi-label learning. Meanwhile, an online multi-label streaming feature selection framework, which includes online importance selection and online redundancy update, is presented. Under this framework, we propose a criterion to select the important features relative to the currently selected features, and design a bound on pairwise correlations between features under label set to filter out redundant features. An empirical study using a series of benchmark datasets demonstrates that the proposed method outperforms other state-of-the-art multilabel feature selection methods. © 2018 Elsevier Ltd. All rights reserved.
1. Introduction Multi-label objects are ubiquitous in multiple different domains, such as text categorization [18,34], gene function classification [4], and music emotion recognition [38]. For instance, an elderly patient may suffer from many types of diseases including diabetes, hypertension, and coronary heart disease. As traditional singlelabel data, multi-label data also possesses a mass of features, and these features may be redundant and/or irrelevant for learning task. Generally speaking, the existence of redundant and/or irrelevant features may cause the additional problems of poor classification performance and may demand more for the computational and memory storage requirements. Multi-label dimensionality reduction, as one of the most popular data preprocessing techniques, is to remove irrelevant and redundant features. Roughly speaking, it can be primarily classified into multi-label feature extraction and multi-label feature se-
∗
Corresponding authors. E-mail addresses:
[email protected] (Y. Lin),
[email protected] (S. Wu).
https://doi.org/10.1016/j.patcog.2018.07.021 0031-3203/© 2018 Elsevier Ltd. All rights reserved.
lection. Multi-label feature extraction reduces the dimensionality of feature space via space mapping techniques or space transformation methods, however, this destroys the structural information of original features and blurs the physical significances of transformed features. Examples of multi-label feature extraction algorithms include LDA [9], MDDM [50], and MLSI [45]. On the other hand, multi-label feature selection is to select a subset of discriminant features from original features without any conversion. Hence, multi-label feature selection well remains the physical meanings of original features and possesses advantage with respect to readability and interpretability. The existing feature selection methods can be broadly classified into wrapper, embedded, and filter methods based on the methodologies. The wrapper methods rely on a predetermined classifier to select feature subset directly. These methods need to run the classifier many times to assess the quality of selected features, and they are typically computationally expensive. The embedded methods seek feature subset by jointly minimizing empirical error and penalty, which can be approximated as a continuous optimization problem [31]. To remove the irrelevant and noisy features, sparse regularization is
274
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
usually imposed on the feature selection matrices. There are some generalized methods, such as group Lasso [17] and fussed Lasso [39]. The filter methods evaluate the features by a certain criteria and select features by ranking their evaluation values [26]. The correlation criteria designed for feature selection include mutual information [12,25,26,28], maximum margin [35], and dependency [51]. The filter methods tend to be more efficient than the embedded model, and we will pay more attention to the filter model in our work. At present, lots of methods in filter models have been developed in the literature, and representative algorithms include RF-ML [35], MDMR [25], MFNMI [26], SCLS [12], and MUCO [28]. These above mentioned methods are capable of improving learning performance, but they suffer from a common limitation: it is necessary to collect a complete set of features before feature selection starts performing. However, in many real-word scenarios, the arrival of features is actually dynamic. For example, hot topics are continuously changing in the social network platform of twitter. When a popular topic appears, it is always accompanied with a set of fresh keywords. These fresh keywords may act as key features to distinguish the popular topic. This phenomenon suggests that it is infeasible to collect all features before learning begins. Online multi-label streaming feature selection assumes that the features arrive one by one dynamically, and keeps an optimal feature subset from so-far-seen features by processing each feature upon it arrival [20,28]. Theoretically, three critical conditions must be satisfied for online streaming feature selection. Firstly, it is not required to provide any prior knowledge of feature space beforehand. Secondly, efficient incremental update should be supported in selected features. Finally, it should be able to make an accurate prediction at each time to allow instance to contain reliable classification performance. Inspired by these critical conditions, in this paper, online multi-label streaming feature selection based on neighborhood rough set (NRS) is proposed. The main motivation is NRS can deal with mixed type data without breaking the neighborhood and order structure of data. Moreover, NRS-based feature selection does not require any priori knowledge of the feature space structure, which seems to be an ideal tool for online streaming feature selection. Recently, several NRS-based feature selection algorithms have been presented in the literature [10,11,44]. However, all these algorithms cannot handle streaming feature selection and multi-label learning directly, and they suffer from the issue of granularity selection before feature selection takes place. It is an open problem to select a proper granularity of neighborhood for a specific task [54]. In addition, different from traditional supervised learning where one instance only belongs to a single class label, in multi-label learning, one instance is assigned to multiple labels simultaneously. It is more complicated to specify a proper granularity for a given multi-label dataset. In this work, we define a new neighborhood relation which solves the problem of granularity selection by employing the information of labels and the surrounding instances distribution. Based on the above discussion, we redefine the concepts and properties of neighborhood rough set to fit multi-label learning. More specifically, we introduce the maximum-nearest-neighbor of instance to granulate all instances under different labels, and generalize neighborhood rough set in single-label to fit multi-label learning. Then, we present a new measure to calculate positive region, which can be used to evaluate the characterizing power of feature. Motivated by these observations, a new algorithm named OMNRS, i.e., Online Multi-label streaming feature selection based on Neighborhood Rough Set, is proposed. More specifically, OM-NRS conducts online streaming feature selection for multi-label learning with two intuitive steps. Firstly, when a new feature arrives, we utilize the significance of feature to determine whether this feature on the fly is important relative to the currently selected features and label set. The process is called online importance selection.
If this new feature is identified as superfluous feature in the first stage, we need to reevaluate this feature. Secondly, we develop a criterion to determine whether this feature is important relative to label set, and then design a bound on pairwise correlations between features under label set to filter out redundant features. The process can be accomplished with dependency function. We refer this phase as an online redundancy update. The contributions of this paper are summarized as follows: •
•
•
•
A new neighborhood relation is proposed to effectively solve the problem of granularity selection in neighborhood rough set. We generalize classical neighborhood rough set model to fit multi-label learning and present a novel measure to compute positive region. We propose a new feature selection framework, which solves online streaming feature selection and multi-label feature selection simultaneously. The experiment on ten benchmark datasets with different application scenarios shows a competitive performance of our proposed method against the state-of-the-art multi-label feature selection algorithms.
The remainder of this paper is organized as follows. After discussing related work in Section 2, we introduce the concepts of multi-label learning and neighborhood rough set in Section 3. In Section 4, we present neighborhood granularity design for multi-label neighborhood rough set. In Section 5, we propose the framework of online multi-label streaming feature selection and give our algorithm. Section 6 reports experimental results, and Section 7 concludes this paper and discusses future work. 2. Related work Feature selection, as an effective means of data preprocessing, has received considerable attention both in statistics and machine learning. This technology has found success in different real-word applications like chemometrics [41] and text recognition [18,34]. Existing feature selection methods are designed in various ways. According to whether the global feature space has been achieved beforehand, feature selection methods can be conducted in two types, namely batch manner and online manner. A batch method assumes that the global feature space on the training data has been achieved beforehand and all features are examined in detail at each round to select a best feature. Batch feature selection can further be classified into single-label feature selection and multi-label feature selection in supervised learning. For single-label feature selection, a large collection of algorithms have been presented and proved to be efficient and effective in improving prediction performance [22,24,29,32,33,55,56]. Different with single-label feature selection, there are various challenges that derive from label space in multi-label feature selection. To date, numerous multi-label feature selection algorithms have been developed from different views of label space, such as streaming label [27], label correlation [8,21,23,36,43], label selection [15], missing label [53], problem transformation [2,37], and dealing with label space directly [12,14,16,35]. These above mentioned methods share one common assumption that the global feature space is available beforehand, and they are lack of scalable for large-scale data applications. In contrast to the batch methods, online streaming feature selection assumes that the features are no longer required in advance but arrive one at a time, and each fresh feature is processed upon its arrival. In single-label feature selection, some works focus on this direction have been proposed, such as Grafting [30], Alphainvesting [52], OS-NRRSAR-SA [6], OSFS [42], Fast-OSFS [42], SAOLA [46], FRSA-IFS-HIS [47], and OGFS [40]. All these algorithms select features in an online manner based on the same assumption that
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
each instance only belongs to one single label. In many real-world application scenarios, however, one instance is usually associated with several labels simultaneously. Therefore, online streaming feature selection for multi-label learning should be performed [20,28]. In this research, we propose an online multi-label streaming feature selection framework, which includes online importance selection and online redundancy update. Under this framework, a novel online multi-label streaming feature selection based on neighborhood rough set algorithm is presented. Our proposed method, by comparison, makes an extra effort to manage the problem of realworld feature selection in an online manner. 3. Preliminaries
Let X = Rn×d be the input space, and Y = {−1, +1}M be the label space consisting of M labels. Given a set of n multi-label training data set D = {(xi , Yi )|1 ≤ i ≤ n}, xi ∈ X denotes a d-dimensional feature vector (xi1 , xi2 , , xid ), and Yi ∈ Y denotes the relevant label vector (Yi1 , Yi2 , , YiM ) of data point xi . The task of multi-label learning is to learn a function f : X −→ Y. Compared with traditional supervised learning, the performance evaluation function of multi-label learning is somewhat complex as each instance belongs to a set of labels simultaneously [48]. So far, several evaluation metrics have been designed for multi-label learning. Here, we select Hamming Loss, Average Precision, Coverage, Ranking Loss, and One-error as evaluation metrics[48,49]. Give a test set T = {(xi , Yi )|1 ≤ i ≤ m}, where Yi ∈ L is a correct label subset and Yi ∈ L be the binary predicted label vector. Then, the definitions of the above evaluation metrics are given as below. (1) Hamming Loss evaluates the proportion of misclassified instance-label pairs. m 1 |Yi yi | , m M
(1)
i=1
where denotes the XOR operation. (2) Average Precision is to evaluate the average proportion of relevant labels ranked higher than a particular label y ∈ Yi .
AP =
m 1 1 m |Yi | i=1
y∈Yi
(5) One-error evaluates the proportion of test instances whose top-ranked label is not a relevant label.
OE =
m 1 [[[argmax f (xi , y )] ∈ / Yi ]]. m y∈L
(5)
i=1
For these evaluation metrics, Hamming Loss focuses on evaluating the label set prediction performance for each instance, while other four evaluation metrics more concern with the performance of label ranking. In addition, for Average Precision, bigger value shows better generalization performance. While for Hamming Loss, Coverage, Ranking Loss, and One-error, smaller value indicates better performance. 3.2. Neighborhood rough set model
3.1. Multi-label learning
HL =
275
|{y ∈ Yi : rank(xi , y ) ≤ rank(xi , y )}| . rank(xi , y )
Given a decision system NDT =< U, C, D >, U = {x1 , x2 , . . . , xn } is a non-empty set of instances, C is a set of features to describe instances, and D is a decision feature set. A certain θ and C will generate a θ -neighborhood relationship N, then we call this system as neighborhood decision system, denoted as NDS =< U, C D, θ >. Definition 1. Given ∀xi ∈ U and feature subset A⊆C, the neighborhood θ A (xi ) of instance xi in A is defined as [10]:
θA (xi ) = {x j |x j ∈ U, A (xi , x j ) ≤ θ }.
(6)
θ A (xi ) is the θ -neighborhood set of instance xi , and the size of neighborhood depends on the threshold θ . In addition, denotes a metric function that satisfies (xi , xj ) ≥ 0. Normally, metric function can be implemented with p-norm. Definition 2. In a neighborhood decision system NDS =< U, C D, θ >, X1 , X2 , . . . , XN are instance subsets with decisions 1 to N, θ A (xi ) is the neighborhood granule including xi and generated by feature subset A⊆C. Then the low and upper approximations of the decision D with respect to feature subset A are defined as [10]:
NA D =
N
NA Xi ,
(7)
NA Xi ,
(8)
i=1
NA D =
N i=1
where
NA X = {xi |θA (xi ) ⊆ X, xi ∈ U },
(9)
(2) (3) Coverage evaluates the number of steps to go down the label ranking list so that all the ground-truth labels of test instance can be covered.
CV =
m 1 max rank(xi , y ) − 1, y∈Yi m
(3)
i=1
where rank(xi , y) represents the rank list of y according to its likelihood. For instance, if y1 > y2 , then rank(xi , y1 ) < rank(xi , y2 ). (4) Ranking Loss evaluates the average proportion of reversely ordered label pairs, i.e., the output value of an irrelevant label is larger than that of a relevant label. m 1 1 RL = |{(y , y )| f (xi , y ) ≤ f (xi , y ), (y , y ) ∈ Yi × Yi }|, m |Yi ||Yi | i=1
(4) where Yi denotes the complementary set of Yi .
NA X = {xi |θA (xi ) ∩ X = ∅, xi ∈ U }.
(10)
NA D is also named the positive region of the decision, denoted by POSA (D). Specifically, POSA (D) is the subset of instances whose neighborhoods consistently be assigned to one of the decision classes. The greater the positive region is, the greater the characterizing power of feature has. Definition 3. In a neighborhood decision system NDS =< U, C D, θ >, A⊆C, the dependency degree of D to A is defined as [10]:
γA (D ) =
|POSA (D )| . |U |
(11)
Obviously, γ A (D) ∈ [0, 1]. We say D is completely dependent on A if γA (D ) = 1; otherwise, D depends on A in the degree of γ A (D). Theorem 1. Given NDS =< U, C D, θ >, A1 , A2 ⊆C, A1 ⊆A2 , with the same distance function and threshold θ , then we have (1) ∀X⊆U, NA1 X ⊆ NA2 X;
276
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
(2) P OSA1 (D ) ⊆ P OSA2 (D ), γA1 (D ) ≤ γA2 (D ).
Table 1 Example of multi-label data.
The proof of Theorem 1 can be found in [10], and it means adding a new feature to the feature subset at least does not lead to the decrease of the dependency. This property is of great important for constructing a greedy forward or backward searching policy in feature selection algorithm. Then, the definition of reduct can be presented.
Instance
Feature
L1
L2
x1 x2 x3 x4 x5
0.20 0.10 0.25 0.13 0.12
+1 −1 +1 +1 −1
+1 +1 −1 +1 −1
Definition 4. In a neighborhood decision system NDS =< U, C D, θ >, A⊆C, we say feature subset A is a relative reduct if [10] (1) γA (D ) = γC (D ); (2) ∀a ∈ A, γA−a (D ) < γA (D ). The first condition is to guarantee POSA (D)⊆POSC (D), and the second condition denotes that there is no any superfluous features in subset A. Therefore, the approximating power of an optimal feature subset should possess the same or similar approximating power as the entire feature set. Inspired by the above definitions, we can present the definition of significance as follow. Definition 5. In a neighborhood decision system NDS =< U, C D, θ >, A⊆C, ∀a ∈ A, we can define the significance of feature a as [10]:
Sig(a, A, D ) = γA∪a (D ) − γA (D ).
(12)
4. Neighborhood granulation design for multi-label neighborhood rough set 4.1. Granulation of instance Neighborhood model is employed to manage mixed features by setting different thresholds for various types of features. Despite the neighborhood model has been successfully implemented in feature selection, it still suffers from the problem of granularity selection. To granulate all instances avoiding the problem of granularity selection, we introduce the maximum-nearest-neighbor of instance [19] to set the size of neighborhood. Given an instance x, the maximum-nearest-neighbor of instance x is defined as
d (x ) = max((x, NS(x )), (x, NT (x ))),
(13)
where NS(x) denotes the nearest instance whose label is different from x, which called nearest miss(NS); NT(x) denotes the nearest instance from the same label with x, which called nearest hit(NT). Therefore, we set θ (x ) = {y|y ∈ U, (x, y ) ≤ d (x )} as the neighborhood granule of x. 4.2. Discussion neighborhood granularity based on multi-label neighborhood rough set In multi-label learning, the problem of granularity selection is somewhat complex as each instance belongs to multiple labels simultaneously. Thus, it is important to set appropriate granularity for each instance with different labels. Definition 6. Given an instance x, and label Lk ∈ L, the maximumnearest-neighbor of instance x with respect to label Lk is defined as
dLk (x ) = max((x, NSLk (x )), (x, NT Lk (x ))). d Lk ( x )
d Lk ( x )
(14)
In which, ≥ 0. Besides, = 0 indicates there exist some instances, which have different labels and possess the same feature values with x. In order to show the maximum-nearestneighbor of instance with different labels more specifically and clearly, we depict the process by Example 1.
Fig. 1. Maximum-nearest-neighbor of instance x in labels L1 and L2 .
Example 1. Table 1 gives a multi-label data set. There are two labels L1 and L2 . In which, the positive instances of L1 is {x1 , x3 , x4 }, and the negative instances of L1 is {x2 , x5 }. On the other hand, the positive and negative instances of label L2 are {x1 , x2 , x4 } and {x3 , x5 }, respectively. Assuming 2-norm is used to compute (x, NSLk (x )) and (x, NT Lk (x )), and then the maximum-nearestneighbor of instance x1 in labels L1 and L2 can be shown in Fig. 1. From Fig. 1, we can find that the maximum-nearest-neighbor of instance x1 with respect to label L1 is (x1 , x5 ), while the maximumnearest-neighbor of x1 with respect to label L2 is (x1 , x4 ). From Example 1, we can see that each instance has different granularity via using the information of labels. Then, we can redefine the neighborhood of instance x under different labels as follow. Definition 7. Given an instance x, and label Lk ∈ L, the neighborhood granule with respect to x under label Lk is defined as:
θ Lk (x ) = {y|(x, y ) ≤ dLk (x ), y ∈ U }.
(15)
In which, (x, y ) = dLk (x ) if and only if dLk (x ) = 0. The neighborhood granule θ Lk (x ) degenerates to an equivalent class if dLk (x ) = 0, this case is applicable to discrete data; otherwise, the neighborhood granule θ Lk (x ) can be viewed as a ball (not include the sphere), which uses x as the center and with the radius of d Lk ( x ). Example 2. Considering the multi-label decision system in Table 1. Assume 2-norm is used for the metric function f(xi , xj ), we take label L1 for example, and have: f(x1 , x2 ) = 0.10, f(x1 , x3 ) = 0.05, f(x1 , x4 ) = 0.07, f(x1 , x5 ) = 0.08, f(x2 , x3 ) = 0.15, f(x2 , x4 ) = 0.03, f(x2 , x5 ) = 0.02, f(x3 , x4 ) = 0.12, f(x3 , x5 ) = 0.13, f(x4 , x5 ) = 0.01. Accordingly, the maximum-nearest-neighbor of instances as follows: dL1 (x1 ) = 0.08, dL1 (x2 ) = 0.03, dL1 (x3 ) = 0.13, dL1 (x4 ) = 0.07, dL1 (x5 ) = 0.02. Then, the neighborhood sets of instance xi can be given as follows: θ L1 (x1 ) = {x1 , x3 , x4 }, θ L1 (x2 ) = {x2 , x5 }, θ L1 (x3 ) = {x1 , x3 .x4 }, θ L1 (x4 ) = {x2 , x4 , x5 }, θ L1 (x5 ) = {x4 , x5 }. Based on the above definitions, we can obtain the following characteristics: (1) θ Lk (xi ) = ∅, for xi ∈ θ Lk (xi ); (2) ni=1 θ Lk (xi ) = U.
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
277
Fig. 3. Neighborhood of instance x under label Lk . Fig. 2. Rough set in discrete feature space.
The new neighborhood relation is a specific neighborhood relation, which only satisfies the property of reflexivity but not satisfies symmetry. This is primarily because the neighborhood size of each instance is different under different labels on the same multilabel data set. Definition 8. Given a set of instances U, L = {L1 , L2 , . . . , LM } is label set, Nk is a neighborhood relation on U under label Lk ∈ L, {θ Lk (x )|x ∈ U } is the family of neighborhood granules with respect to label Lk , and θ L (x ) = {θ L1 (x ), θ L2 (x ), . . . , θ LM (x )} is a set of neighborhood granules with respect to label set L, we then call MDS =< U, F L, θ L > as a multi-label neighborhood decision system. Definition 9. Given a multi-label neighborhood decision system MDS =< U, F L, θ L >, for ∀Lk ∈ L, X1 , X2 , . . . , XNk are the instance L
subsets with decisions 1 to Nk under label Lk . θAk (x ) is the neighborhood granule generated by feature subset A⊆F, the lower and upper approximations of label set L with respect to feature subset A are defined as
NA L =
M
NA Lk ,
(16)
NA Lk ,
(17)
k=1
NA L =
M k=1
where
NA Lk =
Nk
NA X j =
j=1
NA Lk =
Nk
Nk
{xi |θALk (xi ) ⊆ X j , xi ∈ U },
(18)
j=1
NA X j =
j=1
Nk
{xi |θALk (xi ) ∩ X j = ∅, xi ∈ U }.
(19)
j=1
The boundary region of label set L with respect to feature subset A is defined as
BNA (L ) =
M
BNA (Lk ) = NA L − NA L.
(20)
k=1
BNA (L) is the subset of instances whose neighborhood instances come from multiple decision classes with respect to different labels. Example 3. Approximations are demonstrated as shown in Figs. 2 and 3. In Fig. 2, instances are granulated into some mutually exclusive equivalence information granules. In this case, the neighborhood granule θ Lk (x ) = {y|(x, y ) = 0, y ∈ U }, i.e., dLk (x ) = 0. In Fig. 3, we take instance x1 , x2 , and x3 as examples, and assign spherical neighborhoods to these instances, where Lk1 is labeled with red “◦” and Lk2 is labeled with blue “”. We can find θ Lk (x1 ) ⊆ Lk1 , while θ Lk (x3 ) ∩ Lk1 = ∅ and θ Lk (x3 ) ∩ Lk2 = ∅.
Fig. 4. Two special situations of class-imbalance.
According to the above definitions, x1 ∈ NLk and x3 ∈ BN(Lk ). As a whole, if the maximum-nearest-neighbor of a given instance is its nearest miss, the instance belongs to positive region. By contrary, if the maximum-nearest-neighbor of given instance is its nearest hit, the instance belongs to boundary region. In addition, we can see that one particular scenario occurs in Fig. 3(b): the nearest miss and the nearest hit of instance x2 fall on the same sphere, i.e., the neighborhood granule of x2 only contains itself θ Lk (x2 ) = {x2 }. In this case, it is difficult to classify x2 determinately, then we can view x2 as the boundary region of decision. Motivated by Example 3, we can find that the new definition of neighborhood granule seems to be an ideal tool in computing positive region. In order to establish the relationship between positive region and the neighborhood granule of instance, we introduce the margin of instance [26]. Given an instance x and label Lk ∈ L, the margin of a given instance x with respect to Lk is defined as marginLk (x ) = (x, NSLk (x )) − (x, NT Lk (x )). Generally speaking, margin can be positive, negative, or zero. It is positive when the maximum-nearest-neighbor of instance x is its nearest miss, which indicates instances in the neighborhood of x having the same label, i.e., x ∈ NLk ; it is negative when the maximum-nearest-neighbor of x is its nearest hit, which implies instances in the neighborhood granule of x coming from different labels, i.e., x ∈ BN(Lk ). Correspondingly, it is zero when the nearest miss and the nearest hit of instance x possess the same feature values, and then x is difficult to be determinately classified, i.e., x ∈ BN(Lk ). In addition, as multi-label learning tasks usually encounter the problem of class-imbalance, two special situations may be exist in label distribution. One is that there does not exist the nearest hit of instance x under label Lk , as shown in Fig. 4 (a). In this case, we treat x as noisy instance, and set marginLk (x ) = 0; The other is that we cannot find the nearest miss of x under label Lk , as shown in Fig. 4 (b). In this situation, instance x can be regarded as classified determinately, and we set marginLk (x ) = 1. Based on above observations, the following theorem can be presented. Theorem 2. Given MDS =< U, F L, θ L >, where U described by features F, and L is a set of labels. For ∀x ∈ U, A⊆F, and Lk ∈ L, where X1 , X2 , . . . , XNk are the instance subsets with decisions 1 to Nk under label Lk , then we have
NA Lk (x ) = {x|marginAk (x ) > 0}. L
(21)
278
J. Liu et al. / Pattern Recognition 84 (2018) 273–287 L
Proof. Given an instance x ∈ U, let x ∈ NA Lk . As ∃ j ∈ Nk , θAk (x ) ⊆ L (x, NSAk (x ))
L X j , and we have > (x, NTA k (x )), according to L the definition of margin, then marginAk (x ) > 0. If there exists Lk Lk L marginA (x ) > 0, such that (x, NSA (x )) > (x, NTA k (x )), then we Lk Lk L have θA (x ) = {y|(x, y ) < (x, NSA (x ))}, which yields θAk (x ) ⊆ X j . Therefore, x ∈ NA Lk .
4.3. Positive region computation algorithm Based on Theorem 2, we can present a new positive region computation algorithm (N-POS, Algorithm 1). Algorithm 1 Positive region computation algorithm (N-POS). Input: U: instance set, and x ∈ U; A: feature set; L: label set, and Lk ∈ L. L
Output: marginAk (x ): the margin of instance x.
20:
Compute the Euclidean distance D of x with other instances in U with respect to feature set A, and rank D and the location of corresponding label as Lk (x ). F H ← 0; F M ← 0; a ← 2. while F H 1 and F M 1 do if Lk (x )(a ) ← Lk (x ) and F H ← 0 then Find out NT (x ), and F H ← 1; end if if Lk (x )(a ) Lk (x ) and F M ← 0 then Find out NS(x ), and F M ← 1; end if a ← a + 1. if a ← |U | + 1 then if F H ← 0 then L F H ← 1; marginAk (x ) = 0; end if if F M ← 0 then L F M ← 1; marginAk (x ) = 1; end if end if end while L marginAk (x ) = D(NS(x )) − D(NT (x )).
21:
return marginAk (x );
1:
2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
L
In our N-POS algorithm, the input are U, A, and label Lk of current decision system, and the output is the margin of instance L L mar ginAk (x ). If mar ginAk (x ) is greater than 0, it means that x is an element of POSA (Lk ); otherwise, x is not an element of POSA (Lk ). In Algorithm 1, Steps 4–9 are to find NS(x) and NT(x), which are used to calculate the margin of instance x. As special cases, if we cannot seek out the nearest miss or the nearest hit after going through all instances, Steps 11–18 will be executed. More specifL ically, if we cannot find the nearest hit, marginAk (x ) can be set as 0 in Step 13. Conversely, if we cannot find the nearest miss, we L will set marginAk (x ) as 1 in Step 16. Assumed that |U| is the number of instances U, then the time complexity of sorting technique is O(|U| · log|U|). In our experiments, the time complexity of finding the nearest miss and the nearest hit is O(1). Therefore, the time complexity of computing positive region is O(|U| · log|U|).
importance selection and online redundancy update. In the following sections, we will give the details of our proposed method. 5.1. Online importance selection In order to measure the significance of feature relative to the selected features and label set, positive region can be employed. As positive region reflects the discernibility of feature, and the greater the positive region is, the greater the recognition power of feature has. Then, the dependency of feature in multi-label decision system can be defined as follow. Definition 10. Given a multi-label neighborhood decision system MDS =< U, F L, θ L >, where U described by a feature space F, and L = {L1 , L2 , . . . , LM } is label set. POSA (Lk ) denotes the positive region of label Lk with respect to feature subset A, A⊆F, Lk ∈ L, then the dependency degree of L to A can be defined as
γA (L ) =
M
|POSA (Lk )| . M × |U |
k=1
(22)
γ A (L) reflects the ability of A to approximate label set L. As POSA (Lk )⊆U, we have γ A (L) ∈ [0, 1]. We say that label set L is completely dependent on A if γA (L ) = 1; otherwise, we say that label set L depends on A in the degree of γ A (L). Definition 11. Given MDS =< U, F L, θ L >, A⊆F, we say feature subset A is a relative reduct if (1) sufficient condition: γA (L ) = γF (L ); (2) necessary condition: ∀a ∈ A, γA−a (L ) < γA (L ). The aim of online feature selection is to evaluate feature at realtime and maintains an optimal feature subset from the features seen so far. Thus, benefiting from dependency function, we can evaluate a new arrival feature by Definition 12. Definition 12. Given label set L, Sti−1 is the selected feature subset at time ti−1 , and Fi is a new feature at time ti , with the inclusion of a “good ” feature, the dependency function should be enlarged if add Fi into Sti−1 . Then, the significance of feature Fi can be defined as:
Sig(Fi , Sti−1 , L ) = γSti−1 ∪Fi (L ) − γSti−1 (L ).
(L ) ∈ [0, 1], and γSt ∪Fi (L ) ≥ γSt (L ), we have i−1 i−1 Sig(Fi , Sti−1 , L ) ∈ [0, 1]. The bigger Sig(Fi , Sti−1 , L ) is, the better As
γSt
(23)
i−1
characterizing power of feature Fi has. We say that feature Fi is superfluous relative to the currently selected features if Sig(Fi , Sti−1 , L ) = 0; otherwise, Fi has a positive impact on the currently selected features Sti−1 . According to Eq. (23), we can evaluate the significance of feature, and decide to either abandon the feature or add it. However, the process tends to consider the local optimum, i.e., the arrival sequence of features makes a great difference to determine whether a new feature is selected. More specifically, if the previous arrived features are with high level of discriminative capacity, it is difficult to satisfy this condition for the following features. In addition, Fi is superfluous relative to the currently selected features, but it doesn’t mean Fi is trashy because Fi may be more important than its corresponding redundancy features. Furthermore, Fi may have positive influence with some features not arriving. Therefore, we further execute online redundancy update. 5.2. Online redundancy update
5. Online multi-label streaming feature selection via approximating neighborhood rough set In this section, we propose a framework for online multi-label streaming feature selection which consists of two-phase: online
In this part, we introduce online redundancy update to obtain an optimal subset via reevaluating the newly arrived feature Fi , which is identified as superfluous feature in the online importance selection phase. The solution of reevaluating feature can be divided
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
279
into three steps, (1) designing a criterion to identify whether the feature is important (i.e., importance analysis), (2) checking which feature is redundant with the new feature(i.e., redundancy recognition), and (3) deciding to reserve which features (i.e., redundancy update). To filter out redundant features concretely, we present pairwise comparisons to online calculate the correlations between features under all labels. Definition 13. (Importance analysis): Assuming Sti−1 is the selected feature subset at time ti−1 , Fi is a new feature at time ti , and an important threshold δ is given. If γFi (L ) > δ (0 ≤ δ ≤ 1 ), Fi is identified as a important feature with respect to label set; otherwise, Fi is abandoned as a nonsignificant feature and will no longer be considered. Definition 14. (Redundancy recognition): Assuming Sti−1 is the selected feature subset at time ti−1 , if ∃Fk ∈ Sti−1 such that Sig(Fi , Fk , L ) = 0, it testifies that adding Fi alone to Fk does not enhance the predictive capability of Fk . That is, Fk is redundant with Fi . Lemma 1. With the current selected feature subset Sti−1 , Fi is a new feature at time ti . If ∃Fk ∈ Sti−1 such that Sig(Fi , Fk , L ) = 0, then γ{Fi ,Fk } (L ) = γFk (L ). Proof. According to the definition of the significance of feature, we have Sig(Fi , Fk , L ) = γ{Fi ,Fk } (L ) − γFk (L ). If Sig(Fi , Fk , L ) = 0 holds, then we have γ{Fi ,Fk } (L ) − γFk (L ) = 0. Therefore, we can get the bound of Fi and Fk , which yields γ{Fi ,Fk } (L ) = γFk (L ). Lemma 1 presents a correlation bound between features to testify whether feature within current selected feature subset is redundant with the new feature Fi . Then, we further analyze whether Fi should be added into Sti−1 or Fk should be preserved, and the criterion is defined as follow. Definition 15. (Redundancy update): Assuming Sti−1 is the selected feature subset at time ti−1 , Fk ∈ Sti−1 , and Fi is a new feature at time ti . If Sig(Fi , Fk , L ) = 0 holds, then Fi should be added into Sti−1 if γFi (L ) > γFk (L ); otherwise, Fk should be preserved if γFi (L ) ≤ γFk (L ). Theorem 2. With the current selected feature subset Sti−1 , ∃Fk ∈ Sti−1 , and a new feature Fi at time ti . If γFi (L ) > δ holds, then the new feature Fi will be selected if the following criterion is satisfied:
γ{Fi ,Fk } (L ) = γFk (L ) and γFi (L ) > γFk (L ).
(24)
Proof. With Definition 15 and Lemma 1, Theorem 2 is proved.
Fig. 5. A diagram of streaming feature selection framework.
given in Fig. 5. In this framework, streaming features are simulated using training set with known feature sizes. Consequently, candidate feature set is constructed by the whole feature set of training set. Meanwhile, each streaming feature is generated from candidate feature set. Under the framework described in Fig. 5, we propose the OM-NRS algorithm in detail, as shown in Algorithm 2. Algorithm 2 Online Multi-Label Streaming Feature Selection Based on Neighborhood Rough Set (OM-NRS). Input: Fi : predictive features; L: label set; δ : a relevance threshold (0 ≤ δ ≤ 1); Sti−1 : the selected feature set at time ti−1 . Output: Sti : the selected feature set at time ti . 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
Similar to Theorem 2, if the feature within Sti−1 is preserved, then the following Theorem is achieved.
16:
Theorem 3. With the current selected feature subset Sti−1 , ∃Fk ∈ Sti−1 , and a new feature Fi at time ti . If γFi (L ) > δ holds, then Fk should be preserved if the following criterion is satisfied:
18:
γ{Fi ,Fk } (L ) = γFk (L ) and γFi (L ) ≤ γFk (L ).
(25)
17: 19: 20: 21: 22:
23:
After online redundancy update, we can remove redundant features and improve the disadvantage of the sequence of arriving features as a decisive influence. With combination of online importance selection and online redundancy update, the algorithm of online multi-label streaming feature selection based on neighborhood rough set can be structured.
25:
Proof. With Definition 15 and Lemma 1, Theorem 3 is proved.
5.3. Online Multi-label Streaming Feature Selection Based on Neighborhood Rough Set (OM-NRS) Based on online importance selection and online redundancy update, a diagram of streaming feature selection framework is
24: 26: 27: 28: 29: 30: 31:
repeat /*online importance selection */ Get a new feature Fi at time ti ; Compute the dependency of Sig(Fi , Sti−1 , L ) by combining Algorithm 1 and Eq. (2). if Sti−1 ← ∅ then if γFi (L ) < δ then Discard Fi , and go to Step 30; else Sti = Sti−1 ∪ Fi ; end if else if Sig(Fi , Sti−1 , L ) > 0 then Sti = Sti−1 ∪ Fi , and go to Step 30; else /*online redundancy update */ if γFi (L ) < δ then Discard Fi , and go to Step 30; else while ∃Fk ∈ Sti−1 do if γ{Fi ,Fk } (L ) = γFk (L ) and γFi (L ) ≤ γFk (L ) then Discard Fi , and go to Step 30; end if if γ{Fi ,Fk } (L ) = γFk (L ) and γFi (L ) > γFk (L ) then Sti−1 = Sti−1 − Fk , and Sti = Sti−1 ∪ Fi ; end if end while end if end if end if until no features are available; Output Sti ;
In the online importance selection phase, when Sti−1 ← ∅ holds at Step 5, as a new feature Fi arrives at time ti , if γFi (L ) < δ holds at Step 6, then Fi is abandoned as an irrelevant feature and OM-NRS
280
J. Liu et al. / Pattern Recognition 84 (2018) 273–287 Table 2 Characteristics of multi-label datasets. Dataset
Instances
Features
Labels
Training
Test
Card
Density
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast
50 0 0 50 0 0 50 0 0 50 0 0 593 1702 50 0 0 50 0 0 50 0 0 2417
462 438 681 550 72 1001 606 793 743 103
26 30 33 33 6 53 22 33 40 14
20 0 0 20 0 0 20 0 0 20 0 0 391 1123 20 0 0 20 0 0 20 0 0 1499
30 0 0 30 0 0 30 0 0 30 0 0 202 579 30 0 0 30 0 0 30 0 0 918
1.636 1.470 1.508 1.461 1.869 3.3784 1.423 1.169 1.451 4.238
0.063 0.074 0.046 0.044 0.311 0.064 0.065 0.035 0.036 0.303
waits for a next coming feature. If not, Fi will be added to Sti−1 . On the other hand, when Sti−1 is not empty, at Step 12, OM-NRS evaluates whether Fi should be added to the current feature set Sti−1 . If Sig(Fi , Sti−1 , L ) > 0 holds, we will add Fi into Sti−1 . Otherwise, there exists feature Fk in Sti−1 , which is redundant with Fi . In this case, it is necessary for us to further evaluate the redundancy between Fi and Fk . In the online redundancy update phase, Sti−1 will be checked whether some features within Sti−1 is redundant with Fi . At time ti , if γFi (L ) < δ holds in Step 16, then Fi is abandoned as a weakly relevant feature; if not, at Step 21, OM-NRS evaluates whether Fi should be kept in the current selected feature set Sti−1 . If ∃Fk ∈ Sti−1 such that Eq. (25) holds, we abandon Fi and no longer consider it. Otherwise, Sti−1 will be checked whether some features within Sti−1 should be removed due to the inclusion of new feature Fi . At Step 23, if ∃Fk ∈ Sti−1 such that Eq. (24) holds, Fk is removed. The major computation in OM-NRS is the computation of the dependency between features. At time ti , |Sti−1 | is the number of the currently selected feature set, and |L| is the number of label set L. Best-case scenario, OM-NRS achieves a good subset after just running online importance selection, and the time complexity of online importance selection is O(|L| · |U| · log|U|). However, OM-NRS is not always so simple and optimistic in many situations, and online redundancy update should be executed. The time complexity of online redundancy analysis relies on computing the dependency between features. So the worst case is that we need to go through all the selected features for processing Fi , and the time complexity of the worst case is O(|Sti−1 | · |L| · |U | · log|U | ).
use these ten benchmark datasets in Table 2 as our testbeds by observing features on training data arrive one by one at a time to simulate the situation of streaming features. 6.2. Experimental configuration In this subsection, we compare our proposed OM-NRS with several state-of-the-art multi-label feature selection methods, including MLNB [49], MDDM [50], MFNMI [26], RF-ML [35], PMU [13], and OMGFS [20]. Note that these baselines reflect the effectiveness of feature selection from different perspectives, and they are presented in recent years. The proposed method assumes that the features flow in one by one and maintains an optimal feature subset from the features seen so far. Different from OM-NRS, the first 5 comparing algorithms assume that all features are obtained before feature selection takes place, while this assumption is usually violated in real-world applications. The last one OMGFS assumes that the features flow in by groups to simulate streaming feature and they are processed in a group individually. The detailed information of the six comparing algorithms is listed below. •
•
6. Experiments 6.1. Datasets To validate the performance of our proposed OM-NRS, we use ten benchmark datasets from diverse application domains as our testbeds. In which, Arts, Business, Computer, Education, Recreation, Reference, and Science are from Yahoo which widely apply to Web page text categorization. The Enron Email dataset is a text data, which contains 1702 email documents and each email belongs to at least one of the 53 labels. Emotions is a music data, which includes 593 music instances and each instance belongs to any of 6 labels. Yeast is used to predict gene functional classes, including 2417 instances that each instance denotes a gene and 14 feasible labels. Table 2 displays the standard statistics of the ten multilabel datasets such as the number of instances, number of features, number of labels, number of training set, number of test set, label cardinality, and label density, respectively. In which, label cardinality Card denotes the average number of labels for per instance, and label density Density normalizes the label cardinality by the number of labels. Specially, in our experiments, we study the performance of streaming feature selection methods under the situation where the whole feature set is unknown beforehand, and then we
•
•
•
•
MLNB [49]: It obtains an optimal feature subset by incorporating feature extraction techniques based on principal component analysis and feature subset selection techniques based on genetic algorithms. Here, the threshold parameters smooth and ratio are set as 1 and 0.3 as suggested in the literature. MDDM [50]: It attempts to identify a lower-dimensional feature space by maximizing the dependence between the original feature description and the class labels. With different projection strategies, MDDM can be divided into MDDMproj and MDDMspc. The regularization parameters μ is set as 0.5 used default setting. MFNMI [26]: It generalizes neighborhood information entropy to fit multi-label learning, and employs an optimization objective function to evaluate the quality of features. Here, the neighborhood is set as pessimistic neighborhood. RF_ML [35]: It extends the single label feature selection ReliefF algorithm to directly deal with multi-label data without any data transformation. PMU [13]: It considers the mutual information between selected features and label set. Here, continuous features are discretized into two fold by using an equal-width strategy, and the categorical features remain unchanged. OMGFS [20]: It manages streaming features by online group analysis. Here, the relevance threshold and group dimension are set to 0.005 and 3, respectively. In addition, the buffer pool is set to obtain three groups of features.
In our proposed OM-NRS, the significance level δ is set to 0.1. Furthermore, for fair comparison, five evaluation criteria, including Average Precision, Ranking Loss, Coverage, One-error, and Hamming Loss, are selected to evaluate the performance of all algorithms. Note that the five criteria are designed from different evaluation
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
281
Fig. 6. Number of selected features (The labels of the x-axis from 1 to 10 denote the datasets: 1. Arts; 2. Business; 3. Computer; 4. Education; 5.Emotions; 6. Enron ; 7. Recreation; 8. Reference; 9. Science; 10. Yeast). Table 3 Comparative evaluation among 7 feature selection methods in terms of Average Precision (↑). Data sets
MLNB
MDDMspc
MDDMproj
MFNMI
PMU
RF-ML
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast Average
0.4991 0.8713 0.6391 0.5478 0.7529 0.6242 0.4790 0.6234 0.4613 0.7355 0.6234
0.5090 0.8685 0.6299 0.5307 0.7855 0.6335 0.4582 0.6126 0.4475 0.7370 0.6212
0.4954 0.8678 0.6258 0.5192 0.7791 0.6179 0.4637 0.6106 0.4370 0.7416 0.6158
0.5086 0.8652 0.6241 0.5119 0.7435 0.6311 0.4994 0.6187 0.4458 0.7401 0.6188
0.4975 0.8768 0.6319 0.5489 0.7483 0.6344 0.4417 0.6169 0.4535 0.7488 0.6199
0.4823 0.8725 0.6283 0.5328 0.7626 0.6362 0.4631 0.6151 0.4675 0.7504 0.6214
0.5217 0.8744 0.6418 0.5547 0.7785 0.6402 0.4991 0.6336 0.4766 0.7545 0.6375
Table 4 Comparative evaluation among 7 feature selection methods in terms of Ranking Loss (↓). Data sets
MLNB
MDDMspc
MDDMproj
MFNMI
PMU
RF-ML
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast Average
0.1542 0.0419 0.0910 0.0922 0.2055 0.0937 0.1879 0.0889 0.1364 0.1871 0.1279
0.1515 0.0438 0.0927 0.0934 0.1869 0.0969 0.1910 0.0888 0.1401 0.1857 0.1271
0.1559 0.0439 0.0939 0.0972 0.1885 0.0976 0.1888 0.0889 0.1429 0.1843 0.1282
0.1525 0.0452 0.0968 0.0988 0.2231 0.0946 0.1798 0.0874 0.1429 0.1865 0.1308
0.1532 0.0401 0.0942 0.0924 0.2072 0.0942 0.1942 0.0868 0.1372 0.1804 0.1280
0.1537 0.0421 0.0932 0.0940 0.1950 0.0928 0.1868 0.0856 0.1387 0.1781 0.1260
0.1440 0.0411 0.0891 0.0912 0.1765 0.0925 0.1788 0.0854 0.1349 0.1732 0.1207
perspectives, and usually a few algorithms could obtain the best performance on all these criteria simultaneously. Finally, the performance of all multi-label feature selection algorithms are evaluated by using MLKNN (K = 10) classifier [50]. 6.3. Experimental results 6.3.1. Comparison with state-of-the-art multi-label feature selection methods (1) Evaluation of Predictive Performance: To demonstrate the effectiveness of OM-NRS, we compare our algorithm with MLNB, MDDMproj, MDDMspc, MFNMI, RF-ML, and PMU in terms of the predictive classification performance. Fig. 6 shows the number of selected features in MLNB and OM-NRS. As MDDMproj, MDDMspc, MFNMI, RF-ML, and PMU get the feature ranking about all features, we select the same number of features with the quantity determined by OM-NRS as feature subset.Tables 3–7 report the detailed experimental results of all algorithms on Protein, Music, and Text categorization datasets, respectively. For evaluation criteria, sym-
bol “↓” indicates “the smaller the better” while symbol“↑” denotes “the larger the better”. In addition, the best predictive classification performance for each dataset and the average predictive classification performance of each algorithm are shown in bold font and italics, respectively. Based on the experimental results shown in Tables 3–7 and Fig. 6, the following observations can be made: (1) OM-NRS obtains better compactness than MLNB on all multi-label datasets except for Emotions and Yeast. (2) For Coverage and Ranking Loss, OM-NRS obtains the best performance at least on nine multi-label datasets. Meanwhile, the predictive classification performance of OM-NRS is also extremely close to the best performance on other one multi-label dataset. (3) For Average Precision, Hamming Loss, and One-error, the predictive classification performance of OM-NRS is much better than all comparison algorithms at least on seven multi-label datasets. Note that the results of OM-NRS remain fairly optimistic and rank 2st in 66.67% cases on other datasets. (4) Furthermore, the average classification performance of OM-NRS significantly outperforms all comparison algorithms on different eval-
282
J. Liu et al. / Pattern Recognition 84 (2018) 273–287 Table 5 Comparative evaluation among 7 feature selection methods in terms of Coverage (↓). Data sets
MLNB
MDDMspc
MDDMproj
MFNMI
PMU
RF-ML
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast Average
5.5040 2.3483 4.3740 3.9183 2.0743 13.1831 4.9953 3.4313 6.8367 6.6928 5.3358
5.4470 2.4183 4.4173 3.9687 2.0644 13.5561 5.0613 3.4390 6.9897 6.5904 5.3952
5.5737 2.4053 4.4693 4.0867 2.0347 13.5147 5.0090 3.4460 7.1267 6.5839 5.4250
5.4983 2.4420 4.5897 4.1410 2.2327 13.2746 4.8823 3.3878 7.1413 6.6961 5.4286
5.5147 2.3097 4.5130 3.9300 2.0990 13.2470 5.1433 3.3660 6.8943 6.5076 5.3525
5.4903 2.3417 4.4613 4.0020 2.0347 13.0466 4.9650 3.3270 6.9453 6.4847 5.3099
5.2543 2.2953 4.3140 3.8903 1.9208 13.1002 4.8090 3.3173 6.8120 6.4020 5.2115
Table 6 Comparative evaluation among 7 feature selection methods in terms of One-error (↓). Data sets
MLNB
MDDMspc
MDDMproj
MFNMI
PMU
RF-ML
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast Average
0.6433 0.1317 0.4320 0.5827 0.3762 0.3161 0.6643 0.4703 0.6713 0.2560 0.4544
0.6287 0.1327 0.4480 0.6203 0.2822 0.2832 0.70 0 0 0.4843 0.6937 0.2614 0.4535
0.6470 0.1333 0.4537 0.6313 0.2970 0.3178 0.6930 0.4887 0.7003 0.2505 0.4613
0.6270 0.1347 0.4507 0.6423 0.3713 0.2939 0.6343 0.4802 0.6923 0.2451 0.4572
0.6457 0.1250 0.4393 0.5873 0.3614 0.2798 0.7220 0.4867 0.6833 0.2407 0.4571
0.6737 0.1283 0.4443 0.6130 0.3416 0.2936 0.6917 0.4930 0.6513 0.2505 0.4581
0.6153 0.1247 0.4297 0.5810 0.3416 0.2729 0.6413 0.4583 0.6543 0.2386 0.4358
Table 7 Comparative evaluation among 7 feature selection methods in terms of Hamming Loss (↓). Data sets
MLNB
MDDMspc
MDDMproj
MFNMI
PMU
RF-ML
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast Average
0.0612 0.0283 0.0401 0.0405 0.2450 0.0525 0.0611 0.0296 0.0346 0.2080 0.0801
0.0605 0.0281 0.0408 0.0426 0.2112 0.0522 0.0626 0.0322 0.0350 0.2084 0.0774
0.0611 0.0284 0.0411 0.0429 0.2327 0.0527 0.0620 0.0311 0.0355 0.2106 0.0798
0.0600 0.0286 0.0404 0.0434 0.2574 0.0521 0.0601 0.0313 0.0346 0.2110 0.0819
0.0613 0.0275 0.0402 0.0408 0.2360 0.0518 0.0637 0.0306 0.0351 0.2027 0.0790
0.0624 0.0280 0.0415 0.0424 0.2401 0.0524 0.0636 0.0345 0.0344 0.2058 0.0805
0.0606 0.0274 0.0397 0.0408 0.2104 0.0512 0.0595 0.0296 0.0338 0.2021 0.0755
uation metrics. These experimental results demonstrate that OMNRS tends to work better than other baselines. (2) Statistical Test: To explore the statistical significance among the comparison algorithms systematically, we perform a nonparametric Friedman test [5], which are widely-accepted to statistically compare multiple algorithms on many datasets. Given k comparison algorithms and N j multi-label datasets, R j = N1 N i=1 ri is the average rank for the jth
Table 8 Summary of the Friedman statistics FF (k = 7, n = 10) in terms of each evaluation metric and the critical value.
j ri
algorithm on all datasets, where is the rank of algorithm j on the ith dataset. Under the null-hypothesis (i.e., the ranks of all algorithms are equal), the Friedman statistic can be defined as k (N − 1 )χF2 k ( k + 1 )2 12N 2 2 FF = , where χ = ( R − ). F i k (k + 1 ) 4 N (k − 1 ) − χF2 i=1
(26) In which, FF is distributed according to the F-distribution with
FF
Critical value(α = 0.10)
Average Precision Ranking Loss Coverage One-error Hamming Loss
5.3508 8.2780 10.3772 4.8576 4.9419
1.87
between the proposed OM-NRS method and other comparison algorithms, the Bonferroni–Dunn test [3] is used to serve this goal by setting OM-NRS as the control algorithm. The performance between OM-NRS and one comparison algorithm is identified as significantly different if the distance of average ranks exceeds the following critical difference (CD):
(k − 1 ) and (k − 1 )(N − 1 ) degrees of freedom. Table 8 summarizes the Friedman statistic FF in terms of each evaluation metric and the corresponding critical value. As shown in Table 8, the null hypothesis is clearly rejected on all evaluation metrics at significance level α = 0.10. Then, the post-hoc test can be employed to further determine which algorithms performed statistically different. As we are particulary interested in comparing the competitive performance
Evaluation metric
CDα = qα
k (k + 1 ) . 6N
(27)
For the Bonferroni–Dunn test, we have qα = 2.394 at significance level α = 0.10, and thus CD = 2.3128 (k = 7, N = 10). To visually show the relative performance of OM-NRS and other six comparison algorithms, Fig. 7 displays the CD diagrams in
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
283
Fig. 7. Comparison of OM-NRS against other comparison algorithms with the Bonferroni–Dunn test.
Fig. 8. Spider web diagram showing the stability index values obtained on ten benchmark multi-label datasets with different evaluation metrics.
terms of each evaluation metric, where the average rank of each comparison algorithm is plotted along the axis. The lowest rank in the axis is to the far-right since we treat the algorithm on the rightmost side as best. In these subfigures, if the average rank of any comparison algorithm is within one CD to that of OMNRS, it means that OM-NRS obtains comparable performance with the comparison algorithm. Otherwise, if any comparison algorithm cannot connect with OM-NRS using a thick line, there exists significantly different performance between OM-NRS and any comparison algorithm. As shown in Fig. 7, we can conclude that: (1) OM-NRS significantly outperforms MDDMproj, MDDMspc, PMU, and MFNMI in terms of each evaluation metric. (2) OM-NRS achieves statistically superior or at least comparable performance against RFML on all different evaluation metrics. Note that RF-ML need to collect a complete set of features before feature selection starts, while OM-NRS manages feature upon it arrival. (3) OM-NRS obtains comparable performance against MLNB in terms of Hamming Loss, however, the compactness of OM-NRS is much better than that of MLNB, as shown in Fig. 6. To summarize, OM-NRS achieves highly competitive performance comparing with several state-ofthe-art multi-label feature selection algorithms. (3) Stability Analysis: To verify the stability of different methods, we draw spider web diagram to show the stability index in terms of each evaluation metric. As predictive classification performance on different datasets using different evaluation metrics have large differences, we set a universal standard to denote predictive classification performance, i.e., predictive classification performance are both normalized into [0.1,0.5]. Then, we represent the stability index according to the value after normalization. Fig. 8 shows the stability index with Average Precision, Hamming Loss, Ranking Loss, Coverage, and One-error, respectively. In which, the red line denotes the stability value of our proposed algorithm. From Fig. 8, we can observe that: (1) For Ranking Loss and Coverage, the shape of OM-NRS is very close to a regular decagon, which means that OM-NRS obtains more stable solution. (2) For Average Precision and Hamming Loss, OM-NRS achieves the stable solution on seven datasets, and the difference of the stability value of OM-NRS is significant at level 0.1 that is far fewer than other comparison algorithms. (3) For One-error, OM-NRS even more comes into contact with the regular decagon than other five algorithms except for Emotions dataset. The results in Fig. 8 demonstrate that OM-NRS tends to work more stable. (4) Runtime Analysis: To show the efficiency of OM-NRS, we conduct experiment to compare the execution times of algorithms. Since RF-ML is implemented in JAVA language and our algorithm is written by MATLAB, a direct time comparison between RF-ML and OM-NRS is inappropriate. On the other hand, as MDDM and
Table 9 Runtime analysis (in seconds) among MFNMI, PMU, and OM-NRS. Data sets
MFNMI
PMU
OM-NRS
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast
176,425 158,737 433,806 254,150 186 188,461 227,649 396,101 456,330 2395
2088 2193 5129 6146 151 2330 2743 6954 8212 50
138,630 154,281 213,562 197,024 76 49,410 53,673 235,160 257,712 2178
MLNB are multi-label feature extraction algorithms, and it is unfair to compare them with OM-NRS directly. Thus, we here give a time comparison among MFNMI, PMU, and OM-NRS. In which, MFNMI selects features by designing an optimization objective function, and the time complexity of MFNMI is O(|U | · |L| + |U | · |U | + |S| · |F | ) where |S| is the number of selected features and |F| is the number of the whole features. PMU maximizes the mutual information between selected features and label set, and the time complexity of PMU is O(|S| · |L| + |L| · |L| ). In addition, the hardware platform for our experiments were collected on PC equipped with Windows 7, 3.4 GHz CPU, and 32GB memory. The results in Table 9 show that: (1) OM-NRS is faster than MFNMI on all datasets. (2) PMU is faster than OM-NRS when the number of training instances is large, because OM-NRS is determined by the computation of positive region and online selection strategy, while PMU only considers the relation among features and labels. (3) Furthermore, we can observe that the runtime of OM-NRS is significantly influenced by the number of training instances. Due to the calculation of positive region is determined by the number of training instances, and the greater the number of training instances is, the greater the running time of system has. For example, on Enron and Arts datasets, OM-NRS is much more time-consuming on Arts, because Arts has a large number of training instances, even if the numbers of features and labels on Arts dataset are less than half the numbers of Enron. In summary, when a dataset obtains large data samples, OM-NRS is time-consuming. However, when the size of streaming features is huge and the number of training instances is not large, OM-NRS is time-efficient. 6.3.2. Comparison with the OMGFS algorithm To further verify the effectiveness of OM-NRS, we compare its performance against the OMGFS algorithm, which is a recently proposed online multi-label streaming feature selection method. For paper space considerations, we here selected Average Precision,
284
J. Liu et al. / Pattern Recognition 84 (2018) 273–287 Table 10 Predictive performance of OMGFS and OM-NRS. Data sets
Arts Business Computer Education Emotions Enron Recreation Reference Science Yeast
Average Precision
Ranking Loss
OMGFS
OM-NRS
OMGFS
OM-NRS
OMGFS
OM-NRS
0.5278 0.8760 0.6414 0.5398 0.7608 0.6336 0.5295 0.6287 0.4799 0.7545
0.5217 0.8744 0.6418 0.5547 0.7785 0.6402 0.4991 0.6336 0.4766 0.7545
0.1456 0.0413 0.0901 0.0966 0.2047 0.0937 0.1732 0.0879 0.1333 0.1737
0.1440 0.0411 0.0891 0.0912 0.1765 0.0925 0.1788 0.0854 0.1349 0.1732
5.3037 2.3260 4.3640 4.0627 2.0941 13.3092 4.7193 3.4203 6.7860 6.4346
5.2543 2.2953 4.3140 3.8903 1.9208 13.1002 4.8090 3.3173 6.8120 6.4020
Table 11 Characteristics of single-label data sets. Data sets
Instances
Features
Classes
ICU Sonar Diabe Mfeat Waveform Letter TOX_171 YALE
200 208 768 20 0 0 50 0 0 20,0 0 0 171 165
20 60 8 649 21 16 5748 1024
3 2 2 10 2 26 4 15
Ranking Loss, and Coverage to evaluate the performance of OMNRS. Table 10 summarizes the prediction performance of OM-NRS and OMGFS. From Table 10, we can see that: (1) for Ranking Loss and Coverage, OM-NRS outperforms OMGFS on eight out of the ten datasets. (2) For Average Precision, OM-NRS achieves superior or at least comparable performance against OMGFS on six datasets and is inferior to OMGFS on the other four datasets. The reason is that OM-NRS evaluates streaming features individually, and it ignores the existence of the inherent group structure information of features on some multi-label datasets. It implies that OM-NRS can select more features than OMGFS on some multi-label datasets, but it may also include redundancy features, which will lead to the decrease of the quality of predicted label. On the other hand, Average Precision aims to evaluate the ranking quality of different labels for each instance. Thus, for Average Precision, the prediction performance of OM-NRS is slightly lower than that of OMGFS on some multi-label datasets. Furthermore, as different criteria consider the performance from diverse aspects, it is extremely difficult to achieve the best result consistently on all those criteria. To summarize, OM-NRS provides highly competitive performance against OMGFFS when facing streaming features. 6.3.3. Analysis of the properties of OM-NRS (1) Analysis of neighborhood granularity: To verify the effectiveness of the strategy of maximum-nearest-neighbor of instance, we further compare our proposed OM-NRS with the NRS algorithm which is one of the traditional granularity selection methods. Since NRS can only deal with single-label datasets, we set OM-NRS only having one label, for an impartial comparison. Table 11 lists the eight single-label datasets from different domains. The first six datasets come from UCI Repository of machine learning databases [1]; TOX_171 is chosen from microarray dataset [7]; and the last one: YALE is available at the Face Image field [7]. After feature selection, we select KNN (K = 10) classifier to evaluate the performance of the selected features. In addition, we compute the classification accuracy with the 10-fold cross validation and choose the average accuracy as the final result. Table 12 reports the prediction accuracies of NRS and OM-NRS. As shown in Table 12, we can conclude that: (1) OM-NRS has nothing to
Coverage
neighborhood sensitivity, while NRS is sensitive to the neighborhood size. For example, on the first three datasets, the optimal neighborhood for NRS takes values of 0.1, 0.3, and 0.05, respectively. (2) OM-NRS is very competitive with NRS, and the performance of OM-NRS keeps or improves the performance of the raw datasets. (3) OM-NRS achieves better performance than NRS on the datasets with large data samples. More specifically, on Waveform and Letter datasets, NRS achieves comparable performance against the raw data sets in interval [0.1, 0.15], but the performance of NRS is nonexistent in interval [0.2, 0.3] because NRS selects an empty feature set. However, OM-NRS avoids appearing the above phenomenon and owns preferable performance on these two datasets. Therefore, these observations justify the effectiveness of OM-NRS’s neighborhood setting. (2) The influence of relevance threshold: As shown in Algorithm 2, the only parameter needed to be specified for OM-NRS is δ , which determines whether the feature is relevant. To show the influence of relevance threshold δ , we set different relevance thresholds for OM-NRS on Arts, Enron, Emotions, and Yeast datasets, respectively, due to these four datasets come from different application domains. Fig. 9 displays how the performance of OM-NRS changes with increasing value of δ in terms of each evaluation metric. From Fig. 9, we can observe that: (1) The relevance threshold does not impose a significant impact on the OM-NRS algorithm as the value of δ increases but not exceed 0.6; (2) OM-NRS has slightly worse performance on Emotions and Yeast when the value of δ is bigger than 0.6. Therefore, these observations justify the rationality of the relevance threshold setting in Section 6.2. (3) The effect of incremental update: To further evaluate the effect of incremental update, we conduct an empirical study by removing Steps 15 through 27 in Algorithm 2. That is, we only consider online importance selection, and call the variant of the OMNRS algorithm as the OM-NRS-D algorithm. Owing to spatial confined, we select three datasets from different application domains to show the classification performance with OM-NRS and its variant. Table 13 reports the detailed experimental results of OM-NRS and OM-NRS-D, which confirms that adding incremental update via online redundancy update obtains better or at least comparable performance than removing incremental update in terms of each evaluation metric. In summary, we can conclude that incremental update makes a certain contributions in our proposed method. 6.3.4. Comparison with OM-NRS and its offline version To explore more insights of the proposed online solutions, in this section, we conduct experiments to compare OM-NRS with its offline version. The offline version, called the FM-NRS algorithm, can be instantiated by maximizing the increment of dependency. More specifically, FM-NRS assumes that all features are obtained beforehand, and it begins with an empty set for feature and adds one feature which makes the increment of dependency maximal into the set in each round. The algorithm does not stop until the
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
285
Table 12 Prediction accuracies of NRS and OM-NRS. Data sets
Raw
ICU Sonar Diabe Mfeat Waveform Letter TOX_171 YALE
0.8997 0.7164 0.7370 0.9775 0.9387 0.9493 0.6186 0.6100
NRS
OM-NRS
0.05
0.1
0.15
0.2
0.25
0.3
0.8471 0.7445 0.7513 0.9390 0.9043 0.9263 0.4412 0.4533
0.8997 0.7555 0.7370 0.9695 0.9064 0.9495 0.5092 0.4467
0.8787 0.7364 0.7370 0.9550 0.9040 0.9495 0.5255 0.3267
0.8734 0.7124 0.7370 0.9560 – – 0.5504 0.4167
0.8682 0.7555 0.7370 0.9575 – – 0.4442 0.4900
0.8892 0.7748 0.6263 0.9535 – – 0.5626 0.3100
0.9261 0.7938 0.7631 0.9780 0.9408 0.9505 0.6834 0.5833
Fig. 9. Performance of OM-NRS changes in terms of each evaluation metric as relevance threshold δ increases on four benchmark multi-label data sets. Table 13 Predictive performance of OM-NRS-D and OM-NRS. Evaluation metrics
Emotions
Average Precision Ranking Loss Coverage Hamming Loss One-error
Yeast
Science
OM-NRS-D
OM-NRS
OM-NRS-D
OM-NRS
OM-NRS-D
OM-NRS
0.7570 0.1961 2.0495 0.2393 0.3614
0.7785 0.1765 1.9208 0.2104 0.3416
0.7449 0.2063 6.5131 0.1809 0.2462
0.7545 0.2040 6.4020 0.1732 0.2386
0.4765 0.0338 6.8137 0.1350 0.6543
0.4766 0.0338 6.8120 0.1348 0.6543
Table 14 Running time and predictive performance between FM-NRS and OM-NRS. Data sets
Arts Business Education Emotions Enron Yeast
Running time
Average precision
Ranking loss
Coverage
FM-NRS
OM-NRS
FM-NRS
OM-NRS
FM-NRS
OM-NRS
FM-NRS
OM-NRS
768,745 875,010 1,129,420 105 1,479,669 6815
138,630 154,281 197,024 76 49,410 2178
0.5191 0.8748 0.5505 0.7219 0.6611 0.7447
0.5217 0.8744 0.5547 0.7785 0.6402 0.7545
0.1476 0.0424 0.0934 0.2394 0.0896 0.1790
0.1440 0.0411 0.0912 0.1765 0.0925 0.1732
5.3597 2.3387 3.9770 2.2228 12.9275 6.5076
5.2543 2.2953 3.8903 1.9208 13.1002 6.4020
dependency increases less than zero. In addition, experiments are conducted on six datasets from different fields, including Emotion, Yeast, Enron, Arts, Business, and Education. Table 14 shows the comparison of running time and the prediction performance in Average Precision, Ranking Loss, and Coverage, respectively. In terms of running time, as shown in Table 14, OM-NRS is faster than FMNRS. The reason is that FM-NRS selects an optimal feature by evaluating all features in each round, which significantly increases the running time especially for the existence of high-dimensional feature vector. In terms of prediction performance, experimental results show that OM-NRS has higher prediction performance than FM-NRS at least on four out of the six datasets with respect to the given evaluation metrics. These results validate that even without requiring a complete set of features in advance, OM-NRS is very competitive comparing to its offline version FM-NRS. 7. Conclusions In this paper, we presented a multi-label streaming feature selection based on neighborhood rough set. We first employed
maximum-nearest-neighbor to granulate all instances and generalized neighborhood rough set in single-label to fit multi-label learning. Then, we divided the proposed online multi-label streaming feature selection into two phases, i.e., online importance selection and online redundancy update. In the experiments, our study has shown that, OM-NRS achieves highly competitive performance against other state-of-the-art competitors. In future work, it is interesting to design other streaming feature selection strategies by considering the inherent structure of objects and investigate how to consider class-imbalance and relative labeling-importance simultaneously in managing streaming features.
Acknowledgments We are very grateful to the anonymous reviewers for their valuable comments and suggestions. This work is supported by Grants from the National Natural Science Foundation of China (Nos. 61672272, 61303131, and 61673327), and the Natural Science Foundation of Fujian Province (2018J01548 and 2018J01572).
286
J. Liu et al. / Pattern Recognition 84 (2018) 273–287
References [1] K. Bache, M. Lichman, UCI Machine Learning Repository, CA: University of California, School of Information and Computer Science„ 2013 Master’s Thesis. [ http://archive.ics.uci.edu/ml]. Irvine. [2] G. Doquire, M. Verleysen, Feature selection for multi-label classification problems, Adv. Comput. Intell. (2011) 9–16. [3] O. Dunn, Multiple comparisons among means, J. Am. Stat. Assoc. 56 (293) (1961) 52–64. [4] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in Neural Information Processing Systems, Cambridge, MA, 2001, pp. 681–687. [5] M. Friedman, A comparison of alternative tests of significance for the problem of m ranking, Ann. Math. Stat. 11 (1) (1940) 86–92. [6] S. Eskandari, M. Javidi, Online streaming feature selection using rough sets, Int. J. Approx. Reason. 69 (2016) 35–57. [7] Feature selection, http://featureselection.asu.edu/datasets.php. [8] Q. Gu, Z. Li, J. Han, Correlated Multi-label Feature Selection, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ACM, 2011, pp. 1087–1096. [9] H. Hotelling, Relations between two sets of variates, Biometrika 28 (3/4) (1936) 321–377. [10] Q. Hu, D. Yu, J. Liu, C. Wu, Neighborhood rough set based heterogeneous feature subset selection, Inf Sci (Ny) 178 (18) (2008) 3577–3594. [11] Q. Hu, D. Yu, Z. Xie, Numerical attribute reduction based on neighborhood granulation and rough approximation, J.Softw. 19 (3) (2008) 640–649. [12] J. Lee, D. Kim, SCLS: multi-label feature selection based on scalable criterion for large label set, Pattern Recognit. 66 (2017) 342–352. [13] J. Lee, D. Kim, Feature selection for multi-label classification using multivariate mutual information, Pattern Recognit. Lett. 34 (3) (2013) 349–357. [14] J. Lee, D. Kim, Memetic feature selection algorithm for multi-label classification, Inf. Sci. 293 (2015) 80–96. [15] J. Lee, D. Kim, Efficient multi-label feature selection using entropy-based label selection, Entropy 18 (11) (2016) 3–26. [16] J. Lee, D. Kim, Fast multi-label feature selection based on information-theoretic feature ranking, Pattern Recognit. 48 (9) (2015) 2761–2771. [17] Y. Lei, J. Liu, J. Ye, Efficient methods for overlapping group Lasso, in: Advances in Neural Information Processing Systems, 2011, pp. 352–360. [18] D. Lewis, Y. Yang, T. Rose, F. Li, RCV1: a new benchmark collection for text categorization research, J.Mach.Learn.Res. 5 (April) (2004) 361–397. [19] J. Liu, Y. Lin, M. Lin, S. Wu, J. Zhang, Feature selection based on quality of information, Neurocomputing 255 (2017) 11–22. [20] J. Liu, Y. Lin, S. Wu, C. Wang, Online multi-label group feature selection, Knowl. Based Syst. 143 (2018) 42–57. [21] F. Li, D. Miao, W. Pedrycz, Granular multi-label feature selection based on mutual information, Pattern Recognit. 67 (2017) 410–423. [22] Y. Li, S. Wu, Y. Lin, J. Liu, Different classes ratio fuzzy rough set based robust feature selection, Knowl. Based Syst. 120 (2017) 74–86. [23] L. Li, H. Liu, Z. Ma, Y. Mo, Z. Duan, J. Zhou, J. Zhao, Multi-label feature selection via information gain, in: International Conference on Advanced Data Mining and Applications, Springer International Publishing, 2014, pp. 345–355. [24] Y. Lin, J. Li, P. Lin, G. Lin, J. Chen, Feature selection via neighborhood multi– granulation fusion, Knowl. Based Syst. 67 (2014) 162–168. [25] Y. Lin, Q. Hu, J. Liu, J. Duan, Multi-label feature selection based on max-dependency and min-redundancy, Neurocomputing 168 (2015) 92–103. [26] Y. Lin, Q. Hu, J. Liu, J. Chen, J. Duan, Multi-label feature selection based on neighborhood mutual information, Appl. Soft Comput. 38 (2016) 244–256. [27] Y. Lin, Q. Hu, J. J. Zhang, X. Wu, Multi-label feature selection with streaming labels, Inf. Sci. 372 (2016) 256–275. [28] Y. Lin, Q. Hu, J. Liu, J. Li, X. Wu, Streaming feature selection for multi-label learning based on fuzzy mutual information, IEEE Trans. Fuzzy Syst. 25 (6) (2017) 1491–1507. [29] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [30] S. Perkins, J. Theiler, Online feature selection using grafting, in: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003, pp. 592–599. [31] L. Qiao, B. Zhang, J. Su, X. Lu, A systematic review of structured sparse learning, Front. Inf. Technol. Electron.Eng. 18 (4) (2017) 445–463. [32] Y. Qian, J. Liang, W. Pedrycz, C. Dang, Positive approximation: an accelerator for attribute reduction in rough set theory, Artif. Intell. 174 (9–10) (2010) 597–618. [33] Y. Qian, Q. Wang, H. Cheng, J. Liang, C. Dang, Fuzzy-rough feature selection accelerator, Fuzzy Sets Syst. 258 (2015) 61–78. [34] R. Schapire, Y. Singer, BoosTexter: a boosting-based system for text categorization, Mach. Learn. 39 (2–3) (20 0 0) 135–168. [35] N.S. or, E. Cherman, M. Monard, H. Lee, Relieff for multi-label feature selection, in: Intelligent Systems (BRACIS), 2013 Brazilian Conference on. IEEE, 2013, pp. 6–11. [36] N.S. or, M. Monard, G. Tsoumakas, H. Lee, Label construction for multi-label feature selection, in: Intelligent Systems (BRACIS), 2014 Brazilian Conference on. IEEE, 2014, pp. 247–252.
[37] N.S. or, E. Cherman, M. Monard, H. Lee, A comparison of multi-label feature selection methods using the problem transformation approach, Electron. Notes Theor. Comput. Sci. 292 (2013) 135–151. [38] K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas, Multi-label classification of music into emotions, in: Proceedings of the 9th International Society Music Information Retrieval, Philadelphia, USA, 2008, pp. 325–330. [39] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight, Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. 67 (1) (2005) 91–108. [40] J. Wang, M. Wang, P. Li, L. Liu, Z. Zhao, X. Hu, X. Wu, Online feature selection with group structure analysis, IEEE Trans. Knowl. Data Eng. 27 (11) (2015) 3029–3041. [41] L. Wei, P. Xing, G. Shi, Z. Ji, Q. Zou, Fast prediction of methylation sites using sequence-based feature selection technique, IEEE/ACM Trans. Comput. Biol. Bioinform. (2017). https://doi.org/10.1109/TCBB.2017.2670558. [42] X. Wu, Y. Yu, W. Ding, H. Wang, X. Zhu, Online feature selection with streaming features, IEEE Trans. Pattern Anal. Mach. Intell. 35 (5) (2013) 1178–1192. [43] P. Yan, Y. Li, Graph-margin based multi-label feature selection, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer International Publishing, 2016, pp. 540–555. [44] L. Yong, H. Wenliang, J. Yunliang, Z. Zeng, Quick attribute reduct algorithm for neighborhood rough set model, Inf. Sci. 271 (2014) 65–81. [45] K. Yu, S. Yu, V. Tresp, Multi-label informed latent semantic indexing, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 258–265. [46] K. Yu, X. Wu, W. Ding, J. Pei, Towards scalable and accurate online feature selection for big data, in: IEEE International Conference on Data Mining. IEEE, 2014, pp. 660–669. [47] A. Zeng, T. Li, D. Liu, J. Zhang, H. Chen, A fuzzy rough set approach for incremental feature selection on hybrid information systems, Fuzzy Sets Syst. 258 (2015) 39–60. [48] M. Zhang, Z. Zhou, A review on multi-label learning algorithms, IEEE Trans. Knowl. Data Eng. 26 (2014) 1819–1837. [49] M. Zhang, J.P. na, V. Robles, Feature selection for multi-label Naive Bayes classificaiton, Inf. Sci. 179 (2009) 3218–3229. [50] Y. Zhang, Z. Zhou, Multilabel dimensionality reduction via dependence maximization, ACM Trans. Knowl. Discov. Data 4 (2010) 1–21. [51] L. Zhang, Q. Hu, J. Duan, X. Wang, Multi-label feature selection with fuzzyrough sets, in: Rough Sets and Knowledge Technology, Springer InternationalPublishing, 2014, pp. 121–128. [52] J. Zhou, D. Foster, R. Stine, L. Ungar, Streaming feature selection using alpha-investing, in: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge discovery in data Mining. ACM, 2005, pp. 384–393. [53] P. Zhu, Q. Xu, Q. Hu, H. Zhao, Multi-label feature selection with missing labels, Pattern Recognit. 74 (2018) 488–502. [54] P. Zhu, Q. Hu, Adaptive neighborhood granularity selection and combination based on margin distribution optimization, Inf. Sci. 249 (2013) 1–12. [55] P. Zhu, W. Zhu, Q. Hu, C. Zhang, W. Zuo, Subspace clustering guided unsupervised feature selection, Pattern Recognit. 66 (2017) 364–374. [56] P. Zhu, W. Zuo, L. Zhang, Q. Hu, S. Shiu, Unsupervised feature selection by regularized self-representation, Pattern Recognit. 48 (2) (2015) 438–446.
Jinghua Liu currently is a Ph.D. student in Department of Automation, School of Aerospace Engineering, Xiamen University. Her research interests are focused on data mining, and granular computing.
Yaojin Lin received the Ph.D. degree in School of Computer and Information from Hefei University of Technology. He currently is an associate professor with Minnan Normal University and a postdoctoral fellow with Tianjin University. His research interests include data mining, and granular computing. He has published more than 30 papers in many journals, such as IEEE Transactions on Fuzzy Systems, Decision Support Systems, Information Sciences, and Neurocomputing.
J. Liu et al. / Pattern Recognition 84 (2018) 273–287 Yuwen Li currently is a Ph.D. student in Department of Automation, School of Aerospace Engineering, Xiamen University. Her research interests include data mining, and granular computing.
Wei Weng currently is an associate professor with Xiamen University of Technology and a Ph.D. student in Department of Automation, School of Aerospace Engineering, Xiamen University. His research interests include data mining, community mining in complex networks, and recommendation techniques.
287 Shunxiang Wu was born in Shao yang, Hunan Province, China, in 1967. He received the M.S. degree in Department of Computer Science and Engineering from Xi’an Jiaotong University in 1991 and the Ph.D. degree in School of Economics and Management, Nanjing University of Aeronautics & Astronautics in 2007. He is currently a professor in Department of Automation, School of Aerospace Engineering, Xiamen University. His research interests include intelligent computing, data mining and knowledge discovery, systems engineering theory and application.