Feature selection for hierarchical classification via joint semantic and structural information of labels

Feature selection for hierarchical classification via joint semantic and structural information of labels

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

3MB Sizes 0 Downloads 27 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Feature selection for hierarchical classification via joint semantic and structural information of labels✩ ∗

Hai Huang a,b , , Huan Liu b a b

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, AZ, USA

article

info

Article history: Received 16 August 2019 Received in revised form 9 February 2020 Accepted 11 February 2020 Available online xxxx Keywords: Feature selection Hierarchical classification Label semantic similarity Label hierarchical structure

a b s t r a c t Hierarchical Classification is widely used in many real-world applications, where the label space is exhibited as a tree or a Directed Acyclic Graph (DAG) and each label has rich semantic descriptions. Feature selection, as a type of dimension reduction technique, has proven to be effective in improving the performance of machine learning algorithms. However, many existing feature selection methods cannot be directly applied to hierarchical classification problems since they ignore the hierarchical relations and take no advantage of the semantic information in the label space. In this paper, we propose a novel feature selection framework based on semantic and structural information of labels. First, we transform the label description into a mathematical representation and calculate the similarity score between labels as the semantic regularization. Second, we investigate the hierarchical relations in a tree structure of the label space as the structural regularization. Finally, we impose two regularization terms on a sparse learning based model for feature selection. Additionally, we adapt the proposed model to a DAG case, which makes our method more general and robust in many real-world tasks. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework for hierarchical classification domains. © 2020 Elsevier B.V. All rights reserved.

1. Introduction The big data era brings huge challenges for high-dimensional data processing in data mining and machine learning, where a critical issue is known as the curse of dimensionality [1]. When data are high-dimensional, they not only dramatically increase the storage and computing costs for data analysis, but also degenerate the efficiency of many learning algorithms. Dimensionality reduction technique is a powerful tool to improve the performance of machine learning algorithms, which can be categorized into two main components: feature extraction and feature selection [2,3]. Feature extraction constructs a new low dimensional feature space using a linear or nonlinear mapping to combine the original features [4]. On the contrary, feature selection aims to select a subset of relevant features directly from the original feature spaces to construct more compact models. In some cases where the raw input data include no features understandable to a ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2020.105655. ∗ Correspondence to: No. 10 Xitucheng Road, Haidian District, Beijing, 100876, China. E-mail address: [email protected] (H. Huang).

learning model, feature extraction is preferred. Nevertheless, the new feature space created by feature extraction has no physical meanings for interpretation. In contrast, feature selection maintains the original representation of features and provides models with better readability and interpretability. Therefore, it has been widely used in many applications such as text classification and gene prediction. Additionally, feature extraction may break down when processing large-scale data sets or data streams due to its high computational complexities [5]. Thus, we focus on the feature selection algorithm in this paper. We can categorize feature selection into supervised [6] and unsupervised methods [7] based on the availability of label information. Supervised feature selection is specifically designed for classification or regression tasks where the label information is known a priori. It chooses features that can discriminate data samples from different classes by assessing the feature relevance via its correlation with the class labels. Nowadays many real-world classification applications have made the label space complicated, where all labels are constructed as a hierarchical structure like a tree or a Directed Acyclic Graph (DAG). These applications are typically called hierarchical classification problems [8,9] which naturally appear in text classification [10], image annotation [11], and gene function prediction [12]. Fig. 1 illustrates two examples of hierarchical classification, namely the Pascal Visual Object Classes (VOC) [13]

https://doi.org/10.1016/j.knosys.2020.105655 0950-7051/© 2020 Elsevier B.V. All rights reserved.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

2

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

that is a standard dataset of images and annotation used for object detection, and a small sub-hierarchy of the Gene Ontology (GO) [14] used in bioinformatics tasks for predicting gene function. Note that the label space of VOC is a tree structure while the label space of GO is a DAG. Additionally, each label of them has rich semantic descriptions. Therefore, how to exploit the structural and semantic information of labels to guide feature selection is worthy of further discussion. In this paper, our motivation is to take full advantage of the two types of label information by recasting them as effective representations to improve the performance of feature selection. Traditional supervised feature selection algorithms are not suitable for hierarchical classification, since they treat each class label as a simple symbol without correlations and assume features are independent of each other. These algorithms include similarity based [15,16], information theoretical based [17–19], sparse learning based [6,20,21] and statistical based methods [22–25]. For complex feature spaces, some feature selection methods incorporate structure information of features to improve classification tasks. According to different types of structures of feature space, these algorithms are catalogized into feature selection with group structure [26], feature selection with tree structure [27], and feature selection with graph structure [28]. However, these methods only focus on the relation in feature space and rarely consider the connection among labels. In recent years, many feature selection methods are proposed for multi-label classification. Their goals are to select more relevant features by exploring label correlations. According to [29], they can be divided into data transformation based and direct algorithm adaptation methods. The former decomposes the multi-label problem into a set of single-label ones and applies a traditional feature selection technique for them [30]. The latter directly extends conventional algorithms to deal with multi-label data [31]. Although these algorithms make use of multi-label dependencies, few of them are proposed to exploit hierarchical relationship of label spaces. Recently, some methods have been proposed to leverage the structure information of labels to improve the performance of feature selection. Slavkov et al. [32] presented a feature ranking method by extending the RReliefF algorithm, which can incorporate the hierarchical structure into the calculation of feature relevance. Cerri et al. [33] proposed a feature selection technique for hierarchical multi-label protein function prediction based on a decision tree induction algorithm Clus-HMC. Zhao et al. [34] proposed a hierarchical feature selection method in a tree structure of the label space, where they used the parent–child and the sibling relationships for hierarchical regularizations. Tuo et al. [35] provided a method called hierarchical feature selection with subtree based graph regularization, which is aimed at exploring two-way dependence among different classes in the label space. Nevertheless, these algorithms still cannot fully exploit the dependencies and interconnections of label spaces, especially the semantic descriptions in the labels which consist of word, phrase, or even sentence. In this paper, we propose a novel Feature Selection based on Semantic and Structural information of labels (FSSS). The basic idea of it is to transform the label descriptions and structures into valid regularizations which can be imposed on the learning model to enhance the performance of feature selection. Fig. 2 depicts the overall architecture of the proposed approach, which consists of two regularizations and one learning model. First, we represent the label descriptions as the semantic regularization via a mathematical form. Specifically, we transform each label description into a vector of real numbers using sentence embedding techniques. Afterwards, we propose a similarity score based on attention mechanism to calculate the relevance between pairwise label vectors. By this means, we can explore the semantic

similarities of labels and use them to guide the feature selection. Second, we investigate the hierarchical relations in label spaces as the structural regularization. There exist two relationships in hierarchical structure labels, namely parent–child and sibling relations, which are only decided by the structure of label spaces and independent with the semantics of labels. We model the two types of information and expect that the parent–child relation characterizes affinities between superclass and subclass labels, while the sibling relation indicates the dissimilarities between different subclass labels belonging to the same superclass. Finally, we build a supervised learning model and impose the semantic and structural regularization terms on it. Through these regularizations, we can add certain prior distributions to parameters of the learning model. Besides, we adopt the ℓ2,1 -norm in the learning model as a sparse regularization to guarantee the sparsity of the feature coefficients, which is widely used in different scenarios such as multi-task feature selection [20], multi-view feature selection [21], and hierarchical feature selection [34,35]. As a result, the performance of feature selection methods would be better. Furthermore, we adapt the proposed model to a DAG case by extending the parent–child and sibling relations in a tree, which makes our method more general and robust in real-world hierarchical classification tasks. The major contributions of this paper are as follows:

• We make the first attempt to explore a principled way to

• • • •

jointly take advantage of the semantic description and the hierarchical structure of class labels in supervised feature selection. We propose a robust feature selection framework for supervised hierarchical classification tasks. We design an efficient algorithm to optimize the proposed frameworks. We conduct extensive experiments on seven real-world hierarchical datasets to evaluate the efficacy of our approach. We extend our method to a DAG structure label space and achieve a superior performance.

The rest of this paper is organized as follows. In Section 2, the problem statement is introduced. In Section 3, the FSSS framework is proposed with the corresponding optimization method and its convergence analysis. In Section 4, experimental evaluations on seven real-world datasets are presented with discussions. In Section 5, related work are reviewed. In Section 6, we present the conclusions and future work. 2. Problem statement Before discussing the feature selection framework, we first summarize some notations used in this paper. According to the commonly used symbols, we use normal lowercase characters for scalars (e.g. a), calligraphy fonts for sets (e.g. S), bold lowercase characters for vectors (e.g. v), and bold uppercase characters for matrices (e.g. M). In the matrix setting, we represent ith row of the matrix M as M (i, : ), the jth column as M (: , j), the (i, j)th entry as M (i, j), the transpose of M as MT , and the trace of M n×d as Tr(M) if it is a square matrix. For any √ matrix M ∈ R , its

∑n ∑d 2 i=1 j=1 M(i, j) , and ∑n √∑d 2 = i=1 j=1 M(i, j) .

Frobenius norm is defined as ∥M∥F = its ℓ2,1 -norm is defined as ∥M∥2,1

Let X = {x1 , x2 , . . . , xn } denote a set of n data instances and F = {f1 , f2 , . . . , fd } denote a set of d features. We use X = [x1 ; x2 ; . . . ; xn ] ∈ Rn×d to represent the data matrix where each instance in X consists of d features in F. Suppose C = {c1 , c2 , . . . , ck } denotes the label space with k possible class labels, and L = {l1 , l2 , . . . , lk } denotes the corresponding set of k label

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

3

Fig. 1. Two examples of hierarchical classification: (a) The label space of VOC with a tree structure; (b) The label space of a small GO sub-hierarchy with a DAG structure. Both the two label spaces have complex hierarchies and detailed text descriptions, which can be recast as effective structural and semantic regularizations to guide feature selection.

Fig. 2. The overall architecture of the proposed framework with two regularizations and one learning model. The workflow of it consists of three steps. First, the label descriptions are transformed into vectors based on which the similarity score between labels is computed as the semantic regularization. Second, the label hierarchy including parent–child and sibling relations is modeled as the structural regularization. Finally, the two regularizations are imposed on a sparse based learning model to enhance feature selection.

descriptions for C. In a hierarchical classification problem, each instance xi is associated with a subset of labels in C, termed multi-label classification task, and we use a binary vector yi = [ 1 2 ] yi , yi , . . . , yki ∈ {0, 1}k to represent this subset of labels, where j

j

yi = 1(j = 1, 2, . . . , k) if xi is associated with the label cj or yi = 0 otherwise. For the data matrix X, the associated label matrix can be denoted by Y = [y1 ; y2 ; . . . ; yn ] ∈ {0, 1}n×k . As mentioned above, the label space C is complicated in the hierarchical classification domain. Let H = (C, PCR, SR) denote the hierarchical tree structure for it, where PCR and SR represent the ‘‘Parent–Child’’ and ‘‘Sibling’’ relations in label space C, respectively. In a hierarchical tree structure, PCR(ci , cj ) means ci is the only direct parent of cj , while SR(ci , cj ) means ci and cj are the direct children of the same parent. The two types of relations are predefined since the structure of label space is known a priori. According to [36], PCR relation is asymmetric, anti-reflexive and intransitive while SR relation is symmetric, anti-reflexive and transitive. With above notations, we can define the problem statement as follows. Problem 1. Supervised feature selection for hierarchical classification problem.

Given: the feature set F, the data matrix X, the label matrix Y for all data instances, the label description set L, the label space C and its hierarchical tree structure H.1 Select: a subset of most relevant features S ⊆ F by learning the correlation between X and Y, also by exploiting both label description L and the hierarchical tree structure H of the label space C. 3. Proposed feature selection framework In this section, we elaborate the proposed feature selection framework for hierarchical classification problems. We first introduce the basic sparse learning method which embeds feature selection into a classification algorithm. Then, we discuss how to model the label description and the hierarchical structure in label spaces, and impose them as two regularization terms on the basic feature selection model. Finally, we design a novel optimization algorithm for the proposed framework with its convergence analysis. 1 We have a file of hierarchical structure of label spaces, which records the PCR and SR relations for each label.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

4

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

vector representations of li and lj , respectively. Then the pairwise similarity score for labels is defined as

3.1. Basic sparse learning based feature selection model In the supervised feature selection, sparse learning has proven to be an effective technique due to its good performance and interpretability [2]. This type of method aims to minimize the fitting errors and some sparse regularization terms. Typically, the sparse learning based model for a multi-label classification is formulated as follows, min L (W; X, Y) + α R (W) , W

(1)

where L (·) is a loss function and some popular functions include least squares loss, hinge loss, and logistic loss. For most sparse learning based approaches, the linear model with least squares loss is widely used [21,34,35], so the loss function is defined as L (W; X, Y) = ∥XW − Y∥2F ,

(2)

⟨ B (i, j) = ∑k

i=1

vi , vj



∑k ⟨ j=1

vi , vj

⟩.

(5)

Here, ⟨·⟩ returns the inner product of two vectors and k is the number of all class labels. B (i, j) is actually a normalized score to estimate the similarity significance of li and lj in the label space. The larger the normalized score is, the more similar the two labels are. As mentioned in Section 3.1, each column vector W (: , j) correlates the jth label in Y. According to Eq. (5), we also introduce the pairwise similarity score for weights as

⟨W(: , i), W(: , j)⟩ . ∑k i=1 j=1 ⟨W(: , i), W(: , j)⟩

S (i, j) = ∑k

(6)

where W ∈ Rd×k is the feature weight matrix which transforms the data matrix X into the corresponding label matrix Y. Each row vector W (i, : ) measures the importance of the ith feature in X, and each column vector W (: , j) represents the correlation between each instance in X and the jth label in Y. R (W) is a sparse regularization term on W. In the sparse learning methods, the ℓ2,1 -norm regularization is preferred for multi-label classification problems since it can add structured sparsity to the feature selection. In other words, it tends to reduce many feature weights to very small values or exactly zeros, so that it eliminates the corresponding features simply. Moreover, the ℓ2,1 -norm regularization is convex and can be globally optimized [20]. The parameter α is a positive constant used to adjust the sparsity of the model. By combining Eq. (2) and the ℓ2,1 -norm sparse regularizer, Eq. (1) can be reformulated as

It is generally expected that labels which are close to each other in semantic descriptions tend to share similar weights. For example, ‘‘cinema’’ is related to ‘‘entertainment’’, and therefore their transformation weights are supposed to be similar, which means the data instance with label of ‘‘cinema’’ has a high probability of being tagged with ‘‘entertainment’’. We can achieve it by imposing the semantic regularization on weights as

min ∥XW − Y∥ + α ∥W∥2,1 .

where β is a parameter to balance the label description modeling and the basic feature selection procedure.

W

2 F

(3)

By solving the optimization problem in Eq. (3), traditional methods can sort each feature according to the value of ∥W (i, : )∥2 in descending order and then select the top ranked features. However, better performance is expected to be achieved by imposing the semantic and hierarchical regularization terms on W. 3.2. Label description model In hierarchical classification domain, each label has rich semantic descriptions which consist of word, phrase, and sentence. We expect to encode these texts to some representations in mathematics and use them to calculate relations between labels. In recent years, sentence embedding [37,38] techniques have developed rapidly in Natural Language Processing (NLP) domain, which map each sentence to a vector in the continuous space. Specifically, the label vector representation for its text description can be formulated as vi = e (li ) (li ∈ L),

(4)

where e (·) is the embedding function. In our model, we use the Universal Sentence Encoder [38] to transform the label text. This encoder uses a transformer-network trained on a variety of data sources and tasks. By inputting a variable length label text, we can get the output of a 512-dimensional vector. In the next step, we calculate the pairwise relation between label vectors and estimate the importance of this relation in the whole label space. As is well known, the attention mechanism [39] can decide the weight assignment among different related input information, which is widely applied in the sequence to sequence model of NLP. Motivated by [40], we can figure out the pairwise similarity weight in all label pairs based on the attention models. Assume vi and vj denote the label

∥S − B∥2F ,

(7)

where S = {S (i, j)} ∈ R is the pairwise similarity matrix for weights, and B = {B (i, j)} ∈ Rk×k is the pairwise similarity matrix for labels. By adding Eqs. (7) to (3), the new objective function with semantic regularization can be formulated as k×k

min ∥XW − Y∥2F + α ∥W∥2,1 + β ∥S − B∥2F ,

(8)

W,S

3.3. Hierarchical structure model There are two types of relationship in a hierarchical tree structure H = (C, PCR, SR). PCR denotes the ‘‘Parent–Child’’ relation which represents a strong affinity between a superclass and its direct subclass. We expect that labels with the ‘‘Parent–Child’’ relation should share similar transformation weights. More formally, given transformation weight vectors ( W (): , i) and W (: , pi ) for label ci and its parent label cpi , i.e., cpi , ci ∈ PCR, we can define the parent–child regularization on weights as k ∑

 2 ∥W (: , i) − W (: , pi )∥22 = W − Wp F .

(9)

i=1

Here, Wp = [W (: , p1 ) , W (: , p2 ) , . . . , W (: , pk )] ∈ Rd×k

(10)

represents the parent matrix of the weight matrix W. In a tree structure, each W (: , i) has the only parent W (: , pi ) except the root. We assign W (: , r ) to W (: , pr )if cr is the root label. The other relationship in a hierarchical tree structure is the ‘‘Sibling’’ relation. It represents the affinity between two subclasses of the same superclass, which is generally believed to be a weak dependence since it indicates two different branches. For example, in Fig. 1(a), the labels of ‘‘Bottle’’ and ‘‘TV/monitor’’ describe two different categories even if they are from the same superclass of ‘‘Household’’. Therefore, transformation weights corresponding to labels with the ‘‘Sibling’’ relation tend to be { } dissimilar |Si |

with each other. Specifically, let Si = s1i , s2i , . . . , si

denote the

index set of all siblings of label ci , i.e., (ci , cl ) ∈ SR (l ∈ Si ), then

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

W (: , l) (l ∈ Si ) denotes the sibling vector of W (: , i). The sibling regularization on weights can be formulated as k ∑ ∑

( ) ⟨W (: , i) , W (: , l)⟩ = Tr WT WS H ,

(11)

i=1 l∈Si

where the inner product ⟨W (: , i) , W (: , l)⟩ reveals the similarity between a weight vector and its sibling vector. A smaller inner product indicates a lower similarity. The matrix WS = WS1 , WS2 , . . . , WSk

[

]

(12)

denotes the sibling matrix of the weight matrix W, and each matrix

[

(

|Si |

WSi = W : , s1i , W : , s2i , . . . , W : , si

)

(

)

(

)]

∈ Rd×|Si |

(13)

represents the sibling matrix ∑ of the weight vector W (: , i). H = {H (p, q)} ∈ {0, 1}h×k (h = ki=1 |Si |) is a binary constant matrix, where H(p,q) is defined as H (p, q)

=

⎧ ⎪ ⎪ ⎪ ⎨1

q−1

q−1





|Si | < p ≤ i⏐=0 ⏐ + ⏐Sq ⏐ if

⎪ ⎪ ⎪ ⎩

0

|Si |

i=0

(1 ≤ p ≤ h, 1 ≤ q ≤ k, |S0 | = 0).

otherwise

(14) By adding Eqs. (9) and (11) to Eq. (8), the final objective function with the semantic and hierarchical structure regularizations can be reformulated as J =

min

W,S,Wp ,WS

∥XW − Y∥2F + α ∥W∥2,1 + β ∥S − B∥2F

 2 ( ) +θ W − Wp  + γ Tr WT WS H , F

where the parameters of θ and γ are used to balance the parent– child and sibling regularizations. 3.4. Optimization algorithm It can be observed from Eq. (15) that there exist four different variables in the proposed feature selection framework. However, according to Eqs. (6), (10), (12) and (13), we notice that the variables of S, Wp , WS are all dependent on W. Therefore, we first convert these variables to functions of W and then optimize the objective function only with respect to the variable of W. Based on S = {S (i, j)} ∈ Rk×k and Eq. (6), we can derive the function of S w.r.t. W as S = ∑k

WT W

i=1

∑k

j=1

⟨W(: , i), W(: , j)⟩

=

WT W Tr(WT W1)

,

(16)

where 1 ∈ Rk×k is a matrix of ones where each element is equal to one. From Eq. (10) and the definition of W (: , pi ), the function of Wp w.r.t. W is formulated as Wp = [W (: , p1 ) , W (: , p2 ) , . . . , W (: , pk )] = WP,

(17)

where P = {P (i, j)} ∈ {0, 1} is the mapping matrix of parent, which maps the weight vector W (: , i) in W to its parent vector W (: , pi ) in Wp . In this way, the P(i,j) is described as k×k

P (i, j) =

{

1 0

if W (: , i) is the parent of W (: , j) otherwise.

(18)

Similarly, the function of WS w.r.t. W can be obtained from Eqs. (12) and (13), which is as follows, WS = WS1 , WS2 , . . . , WSk = WQ,

[

]

where Q = [Q1 , Q2 , . . . , Qk ] denotes the mapping matrix of siblings for the matrix W, and each Qm = {Qm (i, j)} ∈ {0, 1}k×|Sm | (m = 1, 2, . . . , k) represents the mapping matrix of siblings for the vector W (: , i), which maps the weight vector W (: , i) in W to its corresponding sibling vector set WSi in WS . The Qm (i, j) is defined as Qm (i, j) =

(19)

{

1 0

if W (: , i) is the j_th sibling of W (: , m) otherwise.

(20)

Substitute Eqs. (17) and (19) into Eq. (15), we arrive at J(W) = min ∥XW − Y∥2F + α ∥W∥2,1 + β ∥S − B∥2F W

( ) +θ ∥W − WP∥2F + γ Tr WT WQH .

(21)

It can be observed from Eqs. (21) and (16) that the converted object function is only dependent on the variable W. Therefore, we can calculate the derivative of the object function J with respect to W as

∂ J (W) W (S − B + Tr ((B − S) S) 1) ( ) = 2XT (XW − Y) + 4α ∂W Tr WT W1 ) ( (22) +2β DW + 2θ W (I − P) (I − P)T + γ W QH + (QH)T , where I ∈ Rk×k is an identity matrix, and D ∈ Rd×d is a diagonal matrix with the ith diagonal element as D (i, i) = √ 1 . T 2

W(i,:)W(i,:)

In Eq. (22), the derivative of ∥W∥2,1 with respect to W is 2DW, which may cause that the object function J(W) is not smooth when W (i, : ) = 0. To guarantee the convergence of J(W) in its feasible region, we redefine D (i, i) =

(15)

5

1



2 W (i, : ) W (i, : )T + ε

,

(23)

where ε is a very small positive constant. Since the derivative is complicated and the variable D is still dependent on W, it is difficult to obtain the closed-form solution of the object function by taking its derivative to zero. Motivated by [6], we propose an iterative optimization algorithm with gradient descent to solve this problem. The detailed algorithm is illustrated in Algorithm 1. In each iteration, we first calculate the gradient of the object function by Eq. (22). Then we update Wt +1 by the gradient descend rule as follows, Wt +1 = Wt − λt

∂ J(Wt ) , ∂ Wt

(24)

where λt > 0 is the step size for the tth iteration. To accelerate the convergence, we adopt the Armijo rule [41] to search for the suitable step size. After obtaining the value of Wt +1 , we update Dt +1 because it is dependent on Wt +1 . The iteration process is repeated until the converge condition is satisfied. Finally, the optimal weight matrix W∗ is returned. We sort each feature according to the value of ∥W∗ (i, : )∥2 in descending order and then select the top ranked features. As shown in Algorithm 1, the computation cost in each iteration mainly depends on the update of W and D. According to ( Eqs. (22) and (24), the operations to update W are( O ) ndk + k2 d ) +k3 + dk2 , while the cost of computing D is O k2 based on Eq. (23). Here, n, d, and k are the numbers of data instances, features, and labels, respectively. In a feature selection problem, the number of labels k is usually much less than n and d. Hence, the total computation cost of FSSS algorithm is approximately O (Tndk), where T is the number of iterations and it depends on the convergence rate. Since we employ the Armijo rule to guide the step size selection, the convergence speed will be improved [42].

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

6

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

3.5. Convergence analysis f (xt + λt dt ) ≥ f (xt ) + λt (1 − ρ )∇ f (xt )T dt , In this subsection, we show the proposed algorithm monotonically decreases the object function value until it converges. As aforementioned, the object function is only dependent on W and the gradient descent method is utilized to decrease its value. In each iteration of Algorithm 1, the step size λt is decided by the Armijo rule, which tests whether a selection of λt achieves an adequately corresponding decrease in the objective function f . Specifically, the Armijo rule requires λt fulfills two conditions as follows, f (xt + λt dt ) ≤ f (xt ) + λt ρ∇ f (xt )T dt ,

(25)

(26)

where ρ ∈ (0, 0.5) is a selected control parameter, and dt is the descent direction which equals the negative gradient, i.e., −∇ f (xt ). Based on Eq. (25), the following inequality holds J (Wt − λt ∇ J (Wt )) − J (Wt ) ≤ −λt ρ ∥∇ J (Wt )∥22 ≤ 0.

(27)

Integrating Eq. (24) with Eq. (27), we have J (Wt +1 ) ≤ J (Wt ) ,

(28)

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

which proves the object function is convergent. In addition, according to Eq. (26), we have J (Wt +1 ) − J (Wt ) ≥ −λt (1 − ρ ) ∥∇ J (Wt )∥ , 2 2

(29)

which guarantees λt is not too small, and thus it improves the convergence rate. To verify such analysis, we conduct the convergence experiment on two different datasets of ImageCLEF and VOC. In Fig. 3, the empirical evaluation results show that our proposed algorithm decreases monotonically and converges efficiently. 4. Experiments In this section, we first introduce the real-world datasets and experimental settings. Next, we conduct extensive experiments to verify the effectiveness of the proposed approach. Then, we present the way to extend our framework to a hierarchical DAG structure and compare the performance of our feature selection using different semantic similarity computing methods. Finally, we perform some tests to validate the robustness of the proposed method. 4.1. Datasets Seven public benchmark datasets with the hierarchical tree structure label space are used in our experiments, which cover three categories: text, image, and gene. The basic description of these datasets is listed in Table 1.

• Enron: Enron corpus [43] is a large email dataset contain-











ing approximately 500,000 messages from about 150 users, which was made public by the Federal Energy Regulatory Commission during its legal investigation of Enron’s collapse. In our study, we choose 55 labels in a hierarchical tree structure with the height of 4. The Average Word Number Per Label (AWNPL) is 3.07. RCV1: Reuters Corpus Volume 1 (RCV1) [44] is a benchmark collection for text categorization research, with an archive of over 80,000 newswire stories made available by Reuters, Ltd. We select 101 labels in a hierarchical tree structure with the height of 5. The average word number per label is 2.02. ImageCLEF: ImageCLEF [45] aims to provide an evaluation forum for the cross–language annotation and retrieval of images. We select dataset from the ImageCLEF 2007 competition for annotation of medical X-ray images, which includes 30 labels in a 4-height tree. The average word number per label is 1.5. VOC: VOC [13] is a benchmark in visual object category recognition and detection, providing a standard dataset of images and annotation. We choose VOC2007 dataset in our study, which contains 30 labels in a hierarchical tree structure with the height of 5. The average word number per label is 1.1. ImageNet: ImageNet [46] is a large-scale hierarchical image database, in which each node of the hierarchy is depicted by hundreds and thousands of images. We select a fine-grained classification subset of bird, which contains 112 labels in a 6-height tree structure. The average word number per label is 2.59. Seq: Seq [47] is a gene dataset annotated by the Gene Ontology [14] resource which provides a computational representation of the current scientific knowledge about the functions of genes. We choose 661 labels to construct a tree structure with the height of 13. The average word number per label is 3.61.

7

• Yeast: Yeast [48] is another gene dataset annotated by the Gene Ontology resource. We pick up 133 labels to construct a 7-height tree structure. The average word number per label is 3.31. 4.2. Experimental settings We evaluate the proposed feature selection framework based on the hierarchical classification performance. Since the hierarchical classification problem belongs to the multi-label learning, we follow the standard ways to assess it. Three common multilabel classification metrics are used. The first is example-based Jaccard index [49] which measures the accuracy of the multi-label classification and defined as Jaccard =

n 1 ∑ |Ti ∩ Pi |

n

|Ti ∪ Pi |

i=1

,

(30)

where Ti ⊆ C is the true label set associated with the instance of xi , and Pi ⊆ C is the predicted label set for each instance. n is the total number of all data examples. The other two metrics are label-based Micro-F1 and Macro-F1 [9], which use different methods to calculate the average F1 scores. Let TPj , FPj , TNj and FNj denote the number of true positive, false positive, true negative and false negative test data examples with respect to the label of cj ∈ C, respectively. Then, the Micro-F1 is defined as

∑k

j=1

PrMi = ∑k

TPj

j=1 (TPj

∑k , ReMi = ∑k

+ FPj ) 2 × PrMi × ReMi Micro-F 1 = , PrMi + ReMi

j=1

j=1 (TPj

TPj

+ FNj )

, (31)

where PrMi and ReMi are the weighted average of the precision and the recall of all labels, respectively. The Macro-F1 is defined as PrMa =

k 1∑

k

j=1

TPj TPj + FPj

Macro-F 1 =

, ReMa =

2 × PrMa × ReMa PrMa + ReMa

k 1∑

k

j=1

TPj TPj + FNj

,

, (32)

where PrMa and ReMa are the arithmetic average of the precision and the recall of all labels, respectively. k is the total number of all labels. In addition to these classical measures, we also use the extended metrics of precision, recall and F1 score, which are specifically designed for the hierarchical classification domain [9]. They are defined as n ˆ i⏐ 1 ∑ ⏐Tˆ i ∩ P





hPr =

n

i=1

Hier-F 1 =

⏐ ⏐ ⏐Pˆ i ⏐

n ˆ i⏐ 1 ∑ ⏐Tˆ i ∩ P





, hRe =

2 × hPr × hRe hPr + hRe

n

i=1

⏐ ⏐ ⏐Tˆ i ⏐

, (33)

,

where Tˆ i = Ti ∪ Ancestors(Ti ) represents the union set of the true labels for the ith test example and all their ancestor labels. ˆ i = Pi ∪ Ancestors(Pi ) denotes the union set of the predicted P labels for the ith test example and all their ancestor labels. Based on the above four metrics, Jaccard, Micro-F1, Macro-F1 and Hier-F1, we evaluate the proposed FSSS framework against the following different types of representative feature selection methods:

• FScore: Fisher Score selects features by the way that the feature values are similar in the same class while they are dissimilar in different classes [1]. • aMTFL: aMTFL implements feature selection by imposing least square on loss function and ℓ2,1 -norm minimization

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

8

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 3. Convergence curves of the object function value on two different datasets: (a) ImageCLEF; (b) VOC. Table 1 Detailed information of datasets. Dataset

Type

#Validation

#Training

#Test

#Features

#Labels

Height

AWNPL

Enrona RCV1a ImageCLEFa VOCb ImageNetc Seqd Yeaste

Text Text Image Image Image Gene Gene

500 2000 2000 2000 2000 1000 1000

988 3000 10,000 7178 6000 1692 2310

660 3000 1006 5105 2950 1332 1155

1000 2938 80 1000 2048 478 5930

55 101 30 30 112 661 133

4 5 4 5 6 13 7

3.07 2.02 1.5 1.1 2.59 3.61 3.31

a

https://sites.google.com/site/hrsvmproject/datasets-hier. https://github.com/fhqxa/KNOSYS-D-18-00724/tree/master/HFSGR/dataset. c http://www.image-net.org/. d https://dtai.cs.kuleuven.be/clus/hmcdatasets/. e http://kt.ijs.si/DragiKocev/PhD/resources/doku.php?id=hmc_classification. b









on regularization [20], which belongs to the sparse learning based method. MIFS: Multilabel Informed Feature Selection decomposes the multi-label information into a low-dimensional label space to guide the feature selection via a joint sparse regression framework [42]. SFUS: Sub-Feature Uncovering with Sparsity is a joint sparse feature selection method, which can uncover the shared subspace of original features [50]. HiFSRR: Hierarchical Feature Selection with Recursive Regularization considers the hierarchical relations of the class structure, which provide significant information for feature selection in classification learning [34]. HFSGR: Hierarchical Feature Selection with Graph Regularization is aimed at exploring two-way dependences among different classes in a hierarchical structure, and then utilizes them to guide feature selection via a sparse learning approach [35].

Among the above baseline methods, FScore and aMTFL are utilized in the single-label classification, so we adapt them for the multi-label domain. SFUS and MIFS are designed for multi-label classification without considering the hierarchical label structure. HiFSRR and HFSGR are specialized for hierarchical classification problems. There are some parameters needed to be set in advance. In MIFS, we set the local geometry structure parameters σ and p to be 1 and 5, respectively, as suggested by the original paper [42]. For a fair comparison, we tune the regularization parameters for all methods among the range of {0.01, 0.1, 0.5, 1, 5, 10, 100}. Since there are four parameters in our model, for efficiency, we use the way of varying one parameter with

the other parameters fixed instead of the ‘‘grid search’’ strategy, which is also adopted for other baseline methods. In this work, we conduct all experiments on a computer with 3.4 GHz Intel Core i7 processor and 16 GB memory. We first get the best model parameters for all methods by tuning them in the validation set. Then we apply feature selection methods in the training set to choose features. Finally, we evaluate the performance of different methods in the test set by employing the hierarchical classification algorithm based on the selected features. In our experiments, we use the HR-SVM [9] to learn the hierarchical classifier, which is a hierarchical multi-label classification system for tree and DAG structures based on LIBSVM [51] and HEMKit [36]. Additionally, we have published a package of our FSSS method on GitHub (https://github.com/bupthh/FSSS).

4.3. Effectiveness of semantic and hierarchical regularizations In this subsection, we verify the effectiveness of the semantic and hierarchical regularizations in our framework. According to Eq. (21), we can see the parameter β controls the semantic regularization term and the parameters θ and γ jointly determine the hierarchical regularization terms. We compare the effectiveness on three different settings: (1) β = 0, θ = 0, γ = 0: This means FSSS is a basic sparse learning based model without semantic and hierarchical structure regularizations. (2) β = 0.5, θ = 0, γ = 0: This means FSSS is a basic sparse learning based model with semantic regularization but without hierarchical structure regularization.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

(3) β = 0.5, θ = 1, γ = 0.1: This means FSSS is a sparse learning based model with semantic and hierarchical structure regularizations. We set another parameter α to 1. The ratio of selected features varies in {4%, 8%, 12%, 16%, 20%}. For each selected feature set, we perform a 5-fold cross-validation and report the average Jaccard, Micro-F1, Macro-F1 and Hier-F1 values on two datasets. The comparison results are shown in Fig. 4. It can be observed that the performance with setting (1) and (2) consistently outperforms the performance with setting (3) in different ratios of selected features. The advantage becomes more obvious when a small ratio of features, such as 4% or 8%, are chosen, which is very practical in real-world applications. The comparison results are similar on two different datasets of ImageCLEF and Enron which have small and big number of labels, respectively. In summary, our model can take advantage of the semantic and structural information to improve the feature selection of hierarchical classification problems. It is worth mentioning that the setting (2) is a nonhierarchical classification situation, where the label space has no structures but the semantic information. The results also indicate that our method can be effectively utilized in a non-hierarchical classification task. 4.4. Quality comparison of selected features on real-world datasets We compare the quality of selected features by FSSS with other baseline algorithms on the seven mentioned datasets. In this experiment, the ratio of selected features is among {2%, 4%, . . . , 20%}. We perform a 5-fold cross-validation for each selected feature set and present the average performance in terms of Jaccard, Micro-F1, Macro-F1 and Hier-F1 on all feature sets. For all four metrics, a higher value indicates a better performance and the best performance is highlighted in bold type. The comparison results are shown in Table 2. We make the following observations:

• FSSS outperforms the other baseline methods in almost all cases, which proves that the proposed model is effective for different types of datasets including text, image, and gene. The major reason is that FSSS fully exploits the semantic and structural information of the label space to select better features. • FSSS performs better than other approaches in the subset of ImageNet, which is a well-known hierarchical classification example. It indicates that FSSS is still effective to find discriminative features in the fine-grained classification task. • FSSS works well on the simple datasets with small number of features and labels, like ImageCLEF, and still achieves good performance on the complex datasets with more features and labels, like Yeast, by using less than 20% of total features. In particular, the performance advantage on ImageCLEF is less obvious than that on other complex datasets, which shows FSSS prefers to explore more helpful information from complicated hierarchical label spaces to find latent features. • Compared to the two same type of feature selection methods for hierarchical classification problems, FSSS performs better than HiFSRR and HFSGR in all cases on different types of datasets. The reason is that FSSS considers not only the hierarchical structure of the label space but also the rich semantic description of labels. To further verify the above observations, we perform the statistical tests to compare different methods over multiple datasets. First, we implement the Friedman test [52] to measure if there

9

is a significant difference among all feature selection methods. j More formally, given k methods and N datasets, let ri be the rank of the jth method on the ith dataset in terms of the classification metric. average rank of the jth method is defined as ∑N The j Rj = N1 i=1 ri . Under the null hypothesis that all methods are equivalent, i.e., each Rj is equivalent, we calculate the Friedman statistic as FF =

(N − 1)χF2 N (k − 1) − χ

2 F

where χF2 =

⎡ ⎤ k 2 ∑ k(k + 1) ⎦, ×⎣ R2j − 4

j=1

12N k(k + 1) (34)

which is distributed according to the F -distribution with k − 1 and (k − 1)(N − 1) degrees of freedom. Based on Table 2, the values of FF for four classification metrics (Jaccard, MicroF1, Macro-F1 and Hier-F1) are calculated as 5.86, 7.45, 4.61, and 6.04, respectively. With 7 methods and 7 datasets, i.e., k = 7 and N = 7, the critical value of F (6, 36) for the α = 0.05 significance level is 2.36, so we reject the null hypothesis with respect to all four classification metrics. Then we proceed with the Bonferroni– Dunn post-hoc test [53] to further analyze the difference between FSSS and the other baseline methods. Specifically, we calculate the difference between two average ranks and then compare it with the critical difference which is defined as

√ CD = qa

k(k + 1) 6N

,

(35)

where the critical value qa is based on the Studentized range √ statistic divided by 2 [52]. According to [52], the critical value q0.05 for our experiment is 2.638, so the corresponding CD is 3.05. The performance of two methods is significantly different if the difference between their average ranks exceeds the value of CD. We present the CD diagram of four classification metrics in Fig. 5, where each feature selection method is marked along the rank axis from right to left by the ascending order of their average ranks (lower rank is better). It can be observed that our FSSS method ranks first in all four metrics. To further compare FSSS with other methods, we draw a thick line starting from the point of FSSS to the left along the rank axis, whose length is one CD interval. Any compared baseline method not connected with the line is significantly different from FSSS. From Fig. 5, we can see that in all 24 comparisons, i.e., 6 compared methods times 4 metrics, FSSS performs significantly better than other methods 11 times (46% cases) and achieves comparable performance 13 times (54% cases). In summary, the experimental and statistical results clearly state that our proposed model outperforms the other state-of-the-art feature selection methods. 4.5. Method extension In this subsection, we extend our method to a hierarchical DAG structure. As aforesaid, there are two types of relationship in a hierarchical tree structure H_tree = (C, PCR, SR), where PCR and SR represent the ‘‘Parent–Child’’ and ‘‘sibling’’ relations, respectively. For a DAG, we should update the two relations since each non-root node has one or more parents. Specifically, let H_dag = (C, PCR_dag , SR_dag ) denote the DAG structure for the label space C. For the parent–child relation, PCR_dag(ci , cj ) means ci is the one of the direct parents of cj . For the sibling relation, SR_dag(ci , cj ) means ci and cj are the direct children of the same parent but are not the parents of each other, because the sibling relation is in conflict with the parent–child one. For example, in Fig. 1(b), GO:0048513 and GO:0007399 are siblings, while GO:0048731 is not the sibling of GO:0048513 although they

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

10

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 4. Comparison results of three settings with respect to Jaccard, Micro-F1, Macro-F1, Hier-F1 on two different datasets: (a) ImageCLEF; (b) Enron. Table 2 Classification results on seven datasets (%). Dataset Enron

RCV1

ImageCLEF

VOC

ImageNet

Seq

Yeast

Metric

FScore

aMTFL

MIFS

SFUS

HiFSRR

HFSGR

FSSS

Jaccard Micro-F1 Macro-F1 Hier-F1

49.13 63.29 19.29 63.70

48.65 63.06 20.05 63.42

50.85 64.59 22.35 65.21

47.32 61.94 17.76 62.29

50.65 64.15 21.99 64.76

50.62 64.37 21.90 64.87

56.46 69.57 25.11 70.21

Jaccard Micro-F1 Macro-F1 Hier-F1

65.61 73.36 35.78 75.68

41.53 50.75 17.08 54.17

35.18 44.58 13.81 47.65

33.75 43.08 13.30 46.24

54.92 63.82 29.55 65.98

54.70 63.71 29.13 65.86

66.72 74.66 37.42 76.61

Jaccard Micro-F1 Macro-F1 Hier-F1

56.75 66.84 15.60 66.96

53.40 64.91 14.48 64.81

55.59 66.28 14.52 66.52

54.59 65.79 14.10 65.61

53.51 64.92 14.21 64.80

54.15 65.20 14.15 65.17

56.16 67.02 15.59 67.00

Jaccard Micro-F1 Macro-F1 Hier-F1

47.52 57.15 31.44 59.42

47.62 57.44 32.90 59.49

47.57 57.45 33.29 59.33

47.05 56.57 32.18 58.94

49.41 59.03 34.30 61.04

49.12 58.94 34.28 60.90

49.43 59.24 34.37 61.18

Jaccard Micro-F1 Macro-F1 Hier-F1

81.96 87.08 81.04 86.29

81.42 86.71 80.68 85.92

81.94 87.19 81.41 86.35

81.31 86.64 80.56 85.84

83.85 88.61 83.25 87.78

83.81 88.56 83.20 87.76

83.98 88.65 83.11 87.85

Jaccard Micro-F1 Macro-F1 Hier-F1

21.44 31.99 2.83 33.61

14.37 22.95 1.51 23.96

22.33 33.26 2.55 35.03

19.17 29.21 2.36 30.64

19.91 30.21 2.62 31.64

18.84 28.73 2.34 30.27

22.83 33.81 2.75 35.56

Jaccard Micro-F1 Macro-F1 Hier-F1

35.18 40.43 9.41 46.82

38.60 40.58 13.37 50.18

38.04 39.60 12.38 49.63

37.46 38.85 10.46 49.10

33.29 38.42 7.64 44.80

33.16 38.20 7.57 44.67

41.05 42.98 15.05 52.36

have the same parent, because the former is the latter’s parent. According to the new rule, we can {update the hierarchical regu} |Pi |

larization term in Eq. (9). Let Pi = p1i , p2i , . . . , pi

denote the

index set of all parents for label ci , i.e., (ct , ci ) ∈ PR_dag (t ∈ Pi ), and then W (: , t ) (t ∈ Pi ) denotes the parent vector of W (: , i). The parent–child regularization on weights can be reformulated as k ∑ ∑

∥W (: , i) − W (: , t )∥22 = ∥WM − WF∥22 ,

(36)

i=1 t ∈Pi

where F = [F1 , F2 , . . . , Fk ] is used to obtain all the parent vectors of all the weight vectors W (: , i) (i = 1, 2, . . . , k) in W. It has the

similar definition as Q = [Q1 , Q2 , . . . , Qk ] in Eqs. (19) and (20). M = [M1 , M2 , . . . , Mk ] is the self-replication matrix of columns for the matrix W, and each Mv = {Mv (i, j)} ∈ {0, 1}k×|Pv | (v = 1, 2, . . . , k) is used to duplicate W (: , v) |Pv | times. The Mv (i, j) is defined as Mv (v, j) =

{

1 ≤ j ≤ |Pv | otherwise.

1 0

(37)

For the sibling regularization, we need not to update Eq. (11) except to obtain the siblings based on the new sibling relation of SR_dag. Adding Eqs. (36) to (21), we can obtain the object function for DAG as J_dag(W) = min ∥XW − Y∥2F + α ∥W∥2,1 + β ∥S − B∥2F W

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

11

Fig. 5. CD diagrams of comparison results of FSSS against other baseline methods on different metrics: (a) Jaccard; (b) Micro-F1; (c) Macro-F1; (d) Hier-F1. All other methods not connected to the thick line are significantly different (α = 0.05) from FSSS.

( ) +θ ∥WM − WF∥2F + γ Tr WT WQH .

(38)

Accordingly, we calculate the derivative of the object function J_dag with respect to W as

∂ J_dag (W) W (S − B + Tr ((B − S) S) 1) ( ) = 2XT (XW − Y) + 4α ∂W Tr WT W1 ( ) +2β DW + 2θ W (M − F) (M − F)T + γ W QH + (QH)T . (39) The optimization method for Eq. (38) is similar to Algorithm 1. To verify the effectiveness of our proposed extension method, we conduct the experiment on two gene subsets with the DAG structure label space, which are extracted from the datasets of Seq and Yeast. The details of the two DAG data subsets are listed in Table 3. We compare our method with HFSGR, which is the same type of feature selection algorithm for hierarchical classification problems with DAG structures. As the experimental settings in Section 4.4, we vary the ratio of selected features in {2%, 4%, . . . , 20%} and record four metrics on two different gene datasets. The comparison results are shown in Table 4. We can observe that FSSS achieves better classification performances than HFSGR in almost all ratios of select features. Especially, the performance advantage is more obvious in small ratios. The major reason is that HFSGR only considers the parent–child relation in the DAG structure while FSSS handles both parent–child and sibling relations. Moreover, FSSS can exploit the semantic information of labels to find better features. 4.6. Comparison on different computing methods of semantic similarity One of the major contributions of our work is to exploit the label semantic similarity based on label descriptions by using sentence embedding and attention technologies from the NLP. There exist some other methods to estimate the semantic similarity. For example, in bioinformatics domain, the disease semantic similarity can be mined by measuring the similarity of their DAG structures [54]. To further analyze the effectiveness of our proposed framework, we conduct some comparison experiments based on different computing methods of label semantic similarity. Specifically, we keep the structural regularization term

of the proposed FSSS model unchanged, and replace the original computational method of label semantic similarity with the miRNA disease similarity approach. According to [54,55], the DAG structure of the label space can be used as the disease DAG to calculate the label similarity. As mentioned in Section 4.5, we vary the ratio of selected features in {2%, 4%, . . . , 20%} and record the classification performance on two DAG gene datasets. Table 5 shows the comparison results. We can observe that the performance with the NLP method of sentence embedding is superior to that with the DAG method of miRNA disease similarity in most cases, which verifies the effectiveness of original FSSS framework. It indicates that the text semantics analysis is relatively better to construct the label semantic regularization term than the structural semantics estimation in the feature selection model. 4.7. Robustness analysis From Section 4.5, we can see that the FSSS framework is robustly designed. It adapts not only to the tree structure label spaces but also to the DAG ones. To further validate the robustness of the proposed method, we conduct the parameter sensitivity test and the incomplete and noisy data test. 4.7.1. Parameter sensitivity test Our FSSS model has four parameters α , β , θ , and γ . The α adjusts the sparsity of the proposed model and β controls the semantic regularization term. The other two parameters jointly determine the hierarchical regularization terms: the θ balances the parent–child relation while γ manages the sibling regularization. To investigate the effect of the four parameters on feature selection performance, we conduct the similar experiment as mentioned above by varying one parameter with the other parameters fixed. The ratio of selected features is among {4%, 8%, 12%, 16%, 20%} and the parameter varies in {0.01, 0.1, 0.5, 1, 5, 10, 100}. We only give the results of FSSS on Enron dataset in terms of Hier-F1, since we have the similar observations with respect to other metrics on other datasets. First, we fix {β = 0.5, θ = 1, γ = 0.1} and vary the parameter α in {0.01, 0.1, 0.5, 1, 5, 10, 100}. As shown in Fig. 6(a), with the increase of α , the Hier-F1 metric increases and reaches

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

12

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx Table 3 The description of two DAG data subsets. Dataset

Type

#Validation

#Training

#Test

#Features

#Labels

Height

AWNPL

Seq_DAG Yeast_DAG

Gene Gene

1000 1000

1386 1930

1332 1155

478 5930

379 62

13 6

3.81 3.15

Table 4 Classification results on two DAG data subsets (%). (a) Seq_DAG

Ratio of selected features (%)

Metric

Method

2

4

6

8

10

12

14

16

18

20

Jaccard

HFSGR FSSS

12.81 16.91

20.12 23.57

21.62 23.35

21.55 21.49

20.78 20.69

20.09 20.80

20.15 21.20

20.15 21.12

20.11 21.33

20.31 20.97

Micro-F1

HFSGR FSSS

22.01 27.60

30.98 35.59

32.31 34.78

31.52 32.52

31.44 31.38

30.29 31.31

30.36 31.84

30.40 31.74

30.27 32.01

30.60 31.75

Macro-F1

HFSGR FSSS

2.09 2.60

2.25 3.00

2.49 3.11

2.75 3.20

2.84 3.33

2.96 3.44

3.14 3.57

3.21 3.65

3.29 3.66

3.45 3.67

Hier-F1

HFSGR FSSS

22.25 28.18

32.36 36.73

34.11 36.15

33.80 33.68

32.88 32.61

31.79 32.69

31.87 33.28

31.82 33.10

31.68 33.38

31.98 32.95

(b) Yeast_DAG

Ratio of selected features (%)

Metric

Method

2

4

6

8

10

12

14

16

18

20

Jaccard

HFSGR FSSS

19.27 37.51

20.31 43.19

26.84 43.30

40.85 43.05

42.08 41.36

42.31 41.11

37.59 42.37

33.34 46.98

33.90 47.52

35.27 47.44

Micro-F1

HFSGR FSSS

26.14 40.18

28.65 41.36

34.77 41.49

43.96 42.49

42.92 42.17

42.02 42.20

41.13 42.34

40.05 50.34

40.74 51.23

41.60 51.48

Macro-F1

HFSGR FSSS

7.89 6.13

8.98 10.81

7.59 13.78

6.24 15.18

3.22 16.15

2.15 16.49

5.86 16.48

11.40 17.76

11.39 19.47

11.29 19.78

Hier-F1

HFSGR FSSS

29.86 49.45

31.42 53.94

38.65 53.91

52.17 53.67

53.27 52.35

53.51 52.03

49.25 52.91

45.34 57.61

45.94 58.08

47.24 58.10

Table 5 Classification results based on two computing methods of label semantic similarity (%). (a) Seq_DAG

Ratio of selected features (%)

Metric

Method

2

4

6

8

10

12

14

16

18

20

Jaccard

DAG NLP

16.94 17.04

23.64 23.48

23.39 23.34

21.75 21.71

20.72 20.76

20.84 20.85

20.60 21.15

20.94 21.26

20.95 21.25

21.22 21.10

Micro-F1

DAG NLP

27.59 27.69

35.27 35.42

34.77 34.74

32.70 32.56

31.28 31.31

31.41 31.42

31.06 31.78

31.51 31.81

31.57 31.89

31.96 31.90

Macro-F1

DAG NLP

2.64 2.63

3.01 3.04

3.13 3.14

3.11 3.18

3.33 3.34

3.49 3.50

3.47 3.58

3.62 3.64

3.64 3.69

3.75 3.67

Hier-F1

DAG NLP

28.23 28.36

36.69 36.59

36.17 36.11

34.03 33.89

32.63 32.67

32.76 32.78

32.47 33.20

32.89 33.23

32.87 33.24

33.26 33.14

Metric

Score way

2

4

6

8

10

12

14

16

18

20

Jaccard

DAG NLP

30.21 31.20

38.59 37.22

40.11 40.50

39.99 39.54

40.10 40.44

39.85 39.98

38.86 40.55

44.11 44.25

44.63 44.25

44.59 44.46

Micro-F1

DAG NLP

37.53 37.09

39.07 38.83

38.48 40.01

38.15 38.42

39.14 40.42

38.81 41.01

37.53 42.50

47.62 48.59

48.67 48.69

48.70 49.18

Macro-F1

DAG NLP

4.55 4.93

8.98 9.19

10.51 12.81

12.11 14.24

14.23 15.46

14.33 15.95

14.23 16.78

15.40 16.13

17.16 16.66

17.55 17.06

Hier-F1

DAG NLP

44.06 44.68

50.34 49.37

51.37 51.71

51.22 50.61

51.30 51.47

50.98 51.17

49.76 51.75

55.35 55.50

55.91 55.46

55.83 55.66

(b) Yeast_DAG

Ratio of selected features (%)

stable between 1 and 100, which means too small sparsity of the model decreases the performance of feature selections. Second, we fix {α = 1, θ = 1, γ = 0.1} and tune the parameter β among the same range as above. Fig. 6(b) presents the result. We can observe that the classification performance is similar for all values of β , and is relatively better when β is 10, which suggests the performance is not sensitive to β . Third, we investigate the impact of θ by fixing {α = 1, β = 0.5, γ = 0.1}. From Fig. 6(c), we can observe that the performance increases and then keeps stable when θ varies from 1 to 100. Finally, we fix {α = 1, β = 0.5, θ = 100} and vary the parameter γ . The result is depicted in Fig. 6(d). It can be observed that the performance increases with γ growing and decreases when γ reaches 100, which indicates

that a big hierarchical sibling regularization may degrade the classification performance. In summary, our proposed method is not very sensitive to all the parameters. However, the classification performance is relative more sensitive to the number of selected features, which is still an open problem needed to be further studied. 4.7.2. Incomplete and noisy data test In this subsection, we simulate the case of incomplete and noisy label spaces to test the robustness of our proposed model. To be specific, we construct a synthetic subset of incomplete label spaces from two real-world datasets of Enron and VOC by randomly deleting some labels of entries in the label matrix of Y.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

13

Fig. 6. The parameter effects of FSSS on Enron dataset in terms of Hier-F1: (a) Effect of α ; (b) Effect of β ; (c) Effect of θ ; (d) Effect of γ .

Similarly, we derive a subset of noisy label spaces by randomly changing some entries’ values in the label matrix. The percentage of deleted and changed labels varies in {0%, 5%, 10%, 15%, 20%} and the ratio of selected features is among {4%, 8%, 12%, 16%, 20%}. we report the performance of FSSS on Enron and VOC dataset in terms of Hier-F1. The similar results can be observed with respect to other metrics on other datasets. Fig. 7 shows the values of Hier-F1 on two datasets. We can observe that the classification performance is very similar in various incomplete label cases. Fig. 8 depicts the performance of feature selection when adding some noise to the label space. As can be observed, the classification performance almost keeps the same in all cases. The experimental results indicate that FSSS is noise resilient, which can select discriminative features even under the circumstance of labels missing or labels changed. In summary, our proposed FSSS framework is indeed robust since it adapts to both tree and DAG structures of label space, and is non-sensitive to all model parameters and noise resilient. 5. Related work We review literature on traditional feature selection, multilabel feature selection, feature selection for hierarchical classification, and new developed feature selection. Traditional feature selection methods are designed for flat data with fixed samples and simple features. From a data perspective [2], they are grouped into similarity based, information theoretical based, sparse learning based and statistical based

methods. The similarity based algorithms evaluate feature relevance by their ability to preserve data similarity [15,16]. The information theoretical based algorithms assess feature importance via some information theoretic criteria like information gain and mutual information [17]. One of them is the famous mRMR algorithm [18], which utilizes mutual information to select the maximum relevant features with the minimum information redundancy. Another similar method is MRMD [19], which ranks the features with regard to not only the relevance between the features and their target classes, but also the distance among the features themselves. The sparse learning based algorithms target to train a classification model by minimizing the fitting errors and sparse regularization terms on feature coefficients simultaneously [6,20,21]. The statistical based methods rely on predefined statistical measures such as T-score and CFS index to select features [22,23]. Recently, a statistical based method named analysis of variance (ANOVA) is effectively used in the identification of mitochondrial proteins of malaria parasite [24]. The ANOVA approach can rank the features by measuring the ratio between their variances between groups and within groups [25]. However, traditional feature selection methods cannot be directly applied to hierarchical classification since they assume labels are independent of each other and the feature space is simple with no structures. For complicated feature spaces, some researchers propose algorithms to improve the classification performance by exploring structure information of features to guide feature

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

14

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 7. The classification results of incomplete data in terms of Hier-F1 on two different datasets: (a) Enron; (b) VOC.

Fig. 8. The classification results of noisy data in terms of Hier-F1 on two different datasets: (a) Enron; (b) VOC.

selection. Depending on different types of feature space structures, these algorithms are catalogized into feature selection with group structure [26], feature selection with tree structure [27], and feature selection with graph structure [28]. However, these methods only investigate the relations in feature space without analysis of structural information in labels. In the multi-label classification domain, feature selection algorithms explore label correlations to search for discriminative features. From a selection strategy perspective, they can be categorized into filter, wrapper, and embedded approaches [56]. The filter methods [57] evaluate feature subsets regardless of any learning model, and only rely on characteristics of data. The wrapper methods [58] employ the performance of a predefined learning algorithm to evaluate candidate features and find the optimal subset, which consume a huge amount of computational cost. The embedded methods [59] incorporate feature selection into a model fitting [60,61]. Here, the model fitting is usually a multi-label classification learning algorithm with linear scoring function. Due to the good performance and interpretability, the most widely used embedded method is sparse learning based feature selection, among which the ℓ2,1 -norm regularization based approaches are preferred in different scenarios [20,21, 34,35]. From a data model perspective [29], there are two types of multi-label feature selection algorithms like the categories of multi-class classification [62]. One is the transformation based

method which decomposes a multi-label dataset into a single label one and then applies a traditional feature selection technique to it [30]. The other is the direct method which adapts some existing popular feature selection algorithms to the multi-label data [31]. However, multi-label feature selection methods focus on the label dependencies, and rarely consider the hierarchical label relations. Recently, a variety of works have been developed for new applications. In multi-source feature selections, Yu et al. [63] proposed a MCFS algorithm which formulates the problem of causal feature selection with multiple sources as a search problem for an invariant set across the datasets. In multi-view feature selections, wang et al. [64] presented a joint learning framework to combine the subspace learning for different modalities and the ℓ2,1 -norm for coupled feature selection. Similarly, in [3], a supervised sparse multi-view feature selection model was proposed based on the separable weighted loss term and the discriminative regularization terms, which considers both the complementarity of multiple views and the specificity of each. In distributed multilabel feature selections, Gonzalez-Lopez et al. [65] proposed a distributed model to compute the score that measures the quality of each feature with respect to multiple labels on Apache Spark. In local feature selections, Armanfard et al. [66] presented a localized feature selection approach whereby each region of the sample space is associated with its own distinct optimized feature set, which allows the feature set to optimally adapt to local

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

variations in the sample space. Nevertheless, these new methods still do not consider the complexity of label spaces, especially the structural and semantic information of labels. In recent years, some studies have used feature selection techniques in hierarchical classification problems. They expect to exploit the structural information of labels to improve the efficiency. For example, Slavkov et al. [32] proposed a feature ranking method named HMC-ReliefF, which extends the RRileifF algorithm for hierarchical multi-label classification tasks. By using a weighted Euclidean distance, this method can incorporate the structural information of labels into the assess of feature relevance. Cerri et al. [33] leveraged a decision tree induction algorithm Clus-HMC to select features for hierarchical multi-label protein function prediction, where the label hierarchy helps to construct feature nodes of the decision tree. Zhao et al. [34] investigated the node interconnections in a tree structure of the label space and extracted the parent–children and sibling relationships as regularizations to help improve feature selection process. Tuo et al. [35] proposed a hierarchical feature selection method named HFSGR for the classification problem with a tree structure of the label space, which is aimed at exploring twoway parent–child relations in different classes. Also, the algorithm can be extended for a DAG label structure. Nevertheless, these works are different from our proposed framework FSSS, because FSSS exploits the label space more completely by leveraging both structural information and semantic description of the labels, and jointly takes advantage of the two regularizations in one learning model. It is worth mentioning that some hybrid-based feature selection techniques have made good progress in special domains. The two-step feature selection strategy is one of them, which uses two different methods in two independent steps [25,67]. For example, in the identification of origin of replication [67], the F-score method is firstly utilized to obtain the optimal feature subset, then the mRMR algorithm is performed to find the best feature subset which could produce the maximum accuracy. We believe the two-step strategy is still suitable for the hierarchical classification since it can be regarded a general hybrid-based feature selection technique. For example, we can combine our feature selection method for hierarchical classification with other ones. In the first step, we use the proposed FSSS framework to reduce a large number of features by taking advantage of the semantic and structural information of label spaces, and then in the second step, a more complicated method like wrapper-based feature selection will focus on a smaller feature subset to further obtain the better discriminate features. 6. Conclusion and future work In this paper, we propose a novel feature selection framework FSSS for hierarchical classification tasks. This method makes use of the semantic description and the structure information of labels in feature selection phase simultaneously. First, we transfer the label description to a mathematical vector via sentence embedding techniques, and then use a similarity score to calculate the relevance between pairwise label vectors as the semantic regularization. Second, we exploit the hierarchical information of the label space and extract the parent–child and sibling relations from a tree as the structural regularization. Finally, we construct a sparse based learning model and impose the joint semantic and structural regularizations on it to improve the feature selection performance. Methodologically, we propose an iterative algorithm based on gradient descent to optimize the proposed model. Experimental results on seven real-world datasets show that the semantic and structural regularizations lead to better

15

effectiveness and our proposed framework outperforms the stateof-the-art feature selection methods in hierarchical classification domain. Also, we extend the proposed algorithm to a DAG label space and obtain the superior performance to another feature selection baseline method for DAG. The future work will focus on feature selection on large-scale online streaming data where the features and the label space vary over time, and feature selection on imbalanced data classification [68,69]. CRediT authorship contribution statement Hai Huang: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing original draft, Writing - review & editing, Visualization, Supervision, Project administration. Huan Liu: Resources, Writing review & editing, Supervision, Project administration. Acknowledgments This work was supported by the National Natural Science Foundation of China under Grant No. 61802029 and China Scholarship Council. The authors wish to thank the anonymous reviewers and the editors for their valuable comments and suggestions. References [1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2012. [2] J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, ACM Comput. Surv. 50 (6) (2018) 94. [3] J. Zhong, N. Wang, Q. Lin, P. Zhong, Weighted feature selection via discriminative sparse multi-view learning, Knowl.-Based Syst. 178 (2019) 132–148. [4] T. Deng, D. Ye, R. Ma, H. Fujita, L. Xiong, Low-rank local tangent space embedding for subspace clustering, Inform. Sci. 508 (2020) 1–21. [5] J. Yan, B. Zhang, N. Liu, et al., Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing, IEEE Trans. Knowl. Data Eng. 18 (3) (2006) 320–333. [6] F. Nie, H. Huang, X. Cai, C.H. Ding, Efficient and robust feature selection via joint ℓ2,1 -norms minimization, in: Advances in Neural Information Processing Systems, 2010, pp. 1813–1821. [7] Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012, pp. 1026–1032. [8] C.N. Silla, A.A. Freitas, A survey of hierarchical classification across different application domains, Data Min. Knowl. Discov. 22 (1–2) (2011) 31–72. [9] P. Vateekul, M. Kubat, K. Sarinnapakorn, Hierarchical multi-label classification with SVMs: A case study in gene function prediction, Intell. Data Anal. 18 (4) (2014) 717–738. [10] J. Rousu, C. Saunders, S. Szedmak, J. Shawe-Taylor, Kernel-based learning of hierarchical multilabel classification models, J. Mach. Learn. Res. 7 (Jul) (2006) 1601–1626. [11] I. Dimitrovski, D. Kocev, S. Loskovska, S. Džeroski, Hierarchical annotation of medical images, Pattern Recognit. 44 (10–11) (2011) 2436–2449. [12] R. Cerri, R.C. Barros, A.C. de Carvalho, Hierarchical classification of gene ontology-based protein functions with neural networks, in: 2015 International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–8. [13] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [14] M. Ashburner, C.A. Ball, J.A. Blake, et al., Gene ontology: tool for the unification of biology, Nature Genet. 25 (1) (2000) 25–29. [15] Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learning, in: Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 1151–1157. [16] M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (1–2) (2003) 23–69. [17] B. Guo, M.S. Nixon, Gait feature subset selection by mutual information, IEEE Trans. Syst. Man Cybern. A 39 (1) (2009) 36–46. [18] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [19] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking metric with application to scalable visual and bioinformatics data classification, Neurocomputing 173 (2016) 346–354.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.

16

H. Huang and H. Liu / Knowledge-Based Systems xxx (xxxx) xxx

[20] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efficient ℓ2, 1-norm minimization, in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 339–348. [21] Q. Lin, Y. Xue, J. Wen, P. Zhong, A sharing multi-view feature selection method via Alternating Direction Method of Multipliers, Neurocomputing 333 (2019) 124–134. [22] J.C. Davis, R.J. Sampson, Statistics and Data Analysis in Geology (Vol. 646), Wiley, New York, others, 1986. [23] M.A. Hall, L.A. Smith, Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper, in: FLAIRS Conference, 1999, pp. 235–239. [24] H. Ding, D. Li, Identification of mitochondrial proteins of malaria parasite using analysis of variance, Amino Acids 47 (2) (2015) 329–333. [25] J.X. Tan, F.Y. Dao, H. Lv, P.M. Feng, H. Ding, Identifying phage virion proteins by using two-step feature selection methods, Molecules 23 (8) (2018) 2000. [26] L. Jacob, G. Obozinski, J.P. Vert, Group lasso with overlap and graph lasso, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 433–440. [27] J. Liu, J. Ye, Moreau-Yosida regularization for grouped tree structure learning, in: Advances in Neural Information Processing Systems, 2010, pp. 1459–1467. [28] S. Yang, L. Yuan, Y.C. Lai, X. Shen, P. Wonka, J. Ye, Feature grouping and selection over an undirected graph, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 922–930. [29] R.B. Pereira, A. Plastino, B. Zadrozny, L.H. Merschmann, Categorizing feature selection methods for multi-label classification, Artif. Intell. Rev. 49 (1) (2018) 57–78. [30] W. Chen, J. Yan, B. Zhang, Z. Chen, Q. Yang, Document transformation for multi-label feature selection in text categorization, in: Seventh IEEE International Conference on Data Mining (ICDM 2007), 2007, pp. 451–456. [31] Y. Zhang, Z.H. Zhou, Multilabel dimensionality reduction via dependence maximization, ACM Trans. Knowl. Discov. Data (TKDD) 4 (3) (2010) 1411–1421. [32] I. Slavkov, J. Karcheska, D. Kocev, S. Dzeroski, HMC-ReliefF: Feature ranking for hierarchical multi-label classification, Comput. Sci. Inf. Syst. 15 (1) (2018) 187–209. [33] R. Cerri, R.G. Mantovani, M.P. Basgalupp, A.C. de Carvalho, Multi-label feature selection techniques for Hierarchical multi-label protein function prediction, in: 2018 International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1–7. [34] H. Zhao, P.F. Zhu, P. Wang, Q.H. Hu, Hierarchical feature selection with recursive regularization, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017, pp. 3483–3489. [35] Q. Tuo, H. Zhao, Q. Hu, Hierarchical feature selection with subtree based graph regularization, Knowl.-Based Syst. 163 (2019) 996–1008. [36] A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, I. Androutsopoulos, Evaluation measures for hierarchical classification: a unified view and novel approaches, Data Min. Knowl. Discov. 29 (3) (2015) 820–865. [37] S. Subramanian, A. Trischler, Y. Bengio, C.J. Pal, Learning general purpose distributed sentence representations via large scale multi-task learning, 2018, arXiv preprint arXiv:1804.00079. [38] D. Cer, Y. Yang, S.Y. Kong, et al., Universal sentence encoder, 2018, arXiv preprint arXiv:1803.11175. [39] A. Vaswani, N. Shazeer, N. Parmar, et al., Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008. [40] P. Wang, Z. Yang, S. Niu, Y. Zhang, L. Zhang, S. Niu, Modeling dynamic pairwise attention for crime classification over legal articles, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 485–494. [41] D.P. Bertsekas, Nonlinear Programming, Athena Scientific Belmont, 1999. [42] L. Jian, J. Li, H. Liu, Exploiting multilabel information for noise-resilient feature selection, ACM Trans. Intell. Syst. Technol. (TIST) 9 (5) (2018) 52. [43] B. Klimt, Y. Yang, The enron corpus: A new dataset for email classification research, in: European Conference on Machine Learning, 2004, pp. 217–226. [44] D.D. Lewis, Y. Yang, T.G. Rose, F. Li, Rcv1: A new benchmark collection for text categorization research, J. Mach. Learn. Res. 5 (Apr) (2004) 361–397.

[45] I. Dimitrovski, D. Kocev, S. Loskovska, S. Džeroski, Hierarchical annotation of medical images, Pattern Recognit. 44 (10–11) (2011) 2436–2449. [46] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. [47] A. Clare, Machine Learning and Data Mining for Yeast Functional Genomics (Doctor of Philosophy), The University of Wales, Aberystwyth, 2003. [48] Z. Barutcuoglu, R.E. Schapire, O.G. Troyanskaya, Hierarchical multi-label prediction of gene function, Bioinformatics 22 (7) (2006) 830–836. [49] O.O. Koyejo, N. Natarajan, P.K. Ravikumar, I.S. Dhillon, Consistent multilabel classification, Adv. Neural Inf. Process. Syst. 332 (2015) 1–3329. [50] Z. Ma, F. Nie, Y. Yang, J.R. Uijlings, N. Sebe, Web image annotation via subspace-sparsity collaborated feature selection, IEEE Trans. Multimed. 14 (4) (2012) 1021–1030. [51] C.C. Chang, C.J. Lin, LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST) 2 (3) (2011) 27. [52] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30. [53] O.J. Dunn, Multiple comparisons among means, J. Amer. Statist. Assoc. 56 (293) (1961) 52–64. [54] Q. Xiao, J. Dai, J. Luo, H. Fujita, Multi-view manifold regularized learningbased method for prioritizing candidate disease miRNAs, Knowl.-Based Syst. 175 (2019) 118–129. [55] D. Wang, J.A. Wang, M. Lu, F. Song, Q.H. Cui, Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases, Bioinformatics 26 (2010) 1644–1650. [56] M. Rong, D. Gong, X. Gao, Feature selection and its use in big data: Challenges, methods, and trends, IEEE Access 7 (2019) 19709–19725. [57] G. Ditzler, R. Polikar, G. Rosen, A sequential learning approach for scaling up filter-based feature subset selection, IEEE Trans. Neural Netw. Learn. Syst. 29 (6) (2018) 2530–2544. [58] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1–2) (1997) 273–324. [59] Q. Gu, Z. Li, J. Han, Correlated multi-label feature selection, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 2011, pp. 1087–1096. [60] T. Lai, R. Chen, C. Yang, Q. Li, H. Fujita, A. Sadri, H. Wang, Efficient robust model fitting for multistructure data using global greedy search, IEEE Trans. Cybern. (2019) http://dx.doi.org/10.1109/TCYB.2019.2900096. [61] T. Lai, H. Fujita, C. Yang, Q. Li, R. Chen, Robust model fitting based on greedy search and specified inlier threshold, IEEE Trans. Ind. Electron. 66 (10) (2018) 7956–7966. [62] L. Zhou, H. Fujita, Posterior probability based ensemble strategy using optimizing decision directed acyclic graph for multi-class classification, Inform. Sci. 400 (2017) 142–156. [63] K. Yu, L. Liu, J. Li, W. Ding, T.D. Le, Multi-Source causal feature selection, IEEE Trans. Pattern Anal. Mach. Intell. (2019) http://dx.doi.org/10.1109/ TPAMI.2019.2908373. [64] K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection and subspace learning for cross-modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2016) 2010–2023. [65] J. Gonzalez-Lopez, S. Ventura, A. Cano, Distributed multi-label feature selection using individual mutual information measures, Knowl.-Based Syst. 188 (2020) 105052, http://dx.doi.org/10.1016/j.knosys.2019.105052. [66] N. Armanfard, J.P. Reilly, M. Komeili, Local feature selection for data classification, IEEE Trans. Pattern Anal. Mach. Intell. 38 (6) (2016) 1217–1227. [67] F.Y. Dao, H. Lv, F. Wang, et al., Identify origin of replication in saccharomyces cerevisiae using two-step feature selection technique, Bioinformatics 35 (12) (2018) 2075–2083. [68] J. Sun, H. Li, H. Fujita, B. Fu, W. Ai, Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting, Inf. Fusion 54 (2020) 128–144. [69] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multiimbalance: An open-source software for multi-class imbalance learning, Knowl.-Based Syst. 174 (2019) 137–143.

Please cite this article as: H. Huang and H. Liu, Feature selection for hierarchical classification via joint semantic and structural information of labels, Knowledge-Based Systems (2020) 105655, https://doi.org/10.1016/j.knosys.2020.105655.