A novel approach for learning label correlation with application to feature selection of multi-label data

A novel approach for learning label correlation with application to feature selection of multi-label data

A novel approach for learning label correlation with application to feature selection of multi-label data Journal Pre-proof A novel approach for lea...

1MB Sizes 0 Downloads 58 Views

A novel approach for learning label correlation with application to feature selection of multi-label data

Journal Pre-proof

A novel approach for learning label correlation with application to feature selection of multi-label data Xiaoya Che, Degang Chen, Jusheng Mi PII: DOI: Reference:

S0020-0255(19)30978-8 https://doi.org/10.1016/j.ins.2019.10.022 INS 14938

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

23 January 2019 8 October 2019 13 October 2019

Please cite this article as: Xiaoya Che, Degang Chen, Jusheng Mi, A novel approach for learning label correlation with application to feature selection of multi-label data, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.10.022

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Highlights Generally, every label has its own indispensable features, and it is reasonable to assume higher repetitiveness of indispensable features represents higher correlations among labels. Inspired by discernibility matrix and granular computing, this paper draws conclusions as follows: • Essential elements of each label are defined and characterized to reflect the internal connection between features and label. Then a process for calculating essential elements of a single label is also provided. • Through considering overlap of essential elements collections determined by different labels, relevancy of label and corresponding relevance judgement matrix with labels set are described. Therefore, several labels with strong relationship are assigned to a relevant labels group. Mean- while, local and global label correlations can be computed. • By applying local label correlation to feature selection of multi-label data, we present a novel multi-label learning algorithm named CLSF to select a compact subset of indispensable features for each relevant labels group. Finally, comprehensive experiments on a total of 11 benchmark data sets clearly illustrate the effectiveness and efficiency of CLSF against other five multi-label learning algorithms.

1

A novel approach for learning label correlation with application to feature selection of multi-label data Xiaoya Che1 , Degang Chen2∗, Jusheng Mi3 1

School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China 2 School of Mathematics and Physics, North China Electric Power University, Beijing, 102206, China 3 College of Mathematics and Information Science, Hebei Normal University, Shijiazhuang, 050016, China

Abstract: Each example of multi-label data is represented in an object with its feature vector (i.e. an instance) while being related to multiple labels. Learning label correlation can effectively reduce the labels needed to be predicted and optimize the classification performance. For this reason, label correlation plays a crucial role in multi-label learning and has been explored by many existing algorithms. Generally, every label has its own indispensable features, and it is reasonable to assume that a higher repetitiveness of indispensable features represents higher correlations among labels. Inspired by this fact, the essential elements for each label, which are composed of indispensable features, are constructed in this paper. A method is proposed for learning label correlation and applying it to feature selection of multi-label data based on the overlap of different families of essential elements related to label. Firstly, the essential elements of each label are defined and characterized to reflect the internal connection between features and label. In addition, a process for calculating the essential elements of a single label is provided. Secondly, by considering the overlap of the essential element collections that are determined by the different labels, relevancy of label and corresponding relevance judgement matrix with the label set are described. Therefore, several labels with strong relationships are assigned to a relevant group of labels. Meanwhile, local and global label correlations can be computed. Thus a novel multi-label learning algorithm, called CLSF, is presented to select a compact subset of indispensable features for each relevant group of labels by applying local label correlation to feature selection of multi-label data. Finally, comprehensive experiments on eleven benchmark data sets clearly illustrate the effectiveness and efficiency of CLSF against five other multi-label learning algorithms. Key words: Multi-label learning; Global label correlation; Local label correlation; Feature selection.

1

Introduction

In traditional supervised learning, it is always assumed that each sample only has one kind of semantic information, i.e., only has one label. However, in real-world problems, examples may be complicated and have multiple class labels at the same time. For instance, a document may cover several topics, such as government and reform [22]; a scenic image may be attached with a set of semantic classes, such as urban and beach [17]; and ∗

Corresponding author. E-mail: [email protected] (Xiaoya Che), [email protected] (Degang Chen), [email protected] (Jusheng Mi).

2

a gene may be annotated with multiple functional classes, such as energy, transcription and metabolism [33]. From the above cases, examples of multi-label data in a training set are related to multiple labels. Multi-label learning tasks can deal with examples that have different labels simultaneously, and they widely exist in various applications, such as image annotation [31], music emotion classification [7] and social network [2]. The ultimate aim of multi-label classification is to structure a model that can accurately predict the possible label set of test instances (i.e. object with its feature vector). The challenge of multi-label classification is that the classifier’s output space is exponential in size to the quality of possible class labels (i.e. 2s , s is number of possible class labels). The huge size of the output space leads to difficulties in the multi-label classification of big data [21]. Meanwhile, the dimensions of objects or features in the multi-label data set are also increasing rapidly [25], and then different concepts are related to various labels. There is a great need for approaches that can offer effective and efficient multi-label classification. The landmark difference between multi-label learning and traditional supervised learning is that there are various relationships among different labels in multi-label learning. The information about these relationships with labels is usually useful for learning other related labels [20]. Therefore, global and local label correlations effectively reduce the labels to be predicted and improve the performance of multi-label classification. To improve accuracy and speed of multi-label learning, many algorithms that exploit label correlation of multi-label data tend to investigate global [14, 13] and local label correlations [4, 24, 32]. They are roughly grouped into three major categories [20], that is, first-order, second-order and high-order approaches. The second-order methods deal with the multi-label learning problem by exploring the pairwise relationship between labels, and then the multi-label learning problem is decomposed into Cs2 pairwise comparison problems [20]. Two examples include the local positive and negative pairwise label correlation (LPLC) method [9], and the multi-label learning approach using local correlation (ML-LOC) also provided by Huang [23]. However, in the real world, label correlations can be rather complex and more than second-order. High-order approaches are needed to model interactions among all the class labels, that is, to consider that each label is affected by all the other labels. Examples of high-order approaches include the method designed by Yu [30], where two novel multi-label classification algorithms based on variable precision neighborhood rough sets, named multi-label classification using rough sets (MLRS) and MLRS-LC by using local correlation. High-order approaches may have more computational complexity and less adaptability. Various corrections in the label set are essentially determined repeats of indispensable features for the corresponding labels because each label has its indispensable features, which are closely related and absolutely necessary to label. However, most of the existing methods do not consider which vital features contribute to the relevance of labels and they do not naturally split the label set into several disjoint subsets of relevant labels where the labels have a strong correlation. Feature selection is an excellent way to increase the efficiency of models regarding multi-label learning. The irrelevant or unnecessary features for each label are eliminated from the original feature set in the training data. Label-specific features was an outstanding feature selection method that was first defined by Zhang [21], called LIFT, where each label had its own specific characteristics; however, LIFT did not discover discriminant information hidden between positive and negative instances. Then, Zhang [15] proposed a novel algorithm, called ML-DFL. In ML-DFL, a matrix was produced and its element rep3

resents the similarity based on the distance between the positive and negative instances. Recently, entropy-based methods have also attracted significant attention in multi-label learning, and several related multi-label feature selection methods have been put forward. Lee and Kim [10] selected the effective compact set of features by maximizing the dependency between the selected features and label set by using second-order approximation functions into multivariate mutual information. Furthermore, Lee and Kim [11] conducted theoretical analysis to investigate why a score function that considers the lower-degree interaction information can effectively select a compact feature set and also proposed a new score function called D2F, which has excellent performance in multi-label classification. They [12] first applied mutual information to improve a genetic algorithm (GA) based multi-label feature selection algorithm and presented a novel memetic algorithm to refine the population of the feature subsets generated by the GA through adding (and removing) relevant (and redundant) features to (from) multiple labels. A new multi-label feature selection method, SCLS, was provided by Lee and Kim [13] to further improve previous studies. This method employed an effective approximation for the dependency calculations that are still not considered in the multi-label feature selection problem. Lin and Hu also presented many excellent entropy-based multi-label feature selection methods, where the MDMR [27] that combined mutual information with a max-dependency and min-redundancy algorithm to select the superior feature subset for multi-label learning. A method called multi-label feature selection with label correlation (MUCO) was also proposed [26] based on fuzzy mutual information. Lin et al. [28] provided a multi-label feature selection algorithm with streaming labels. Meanwhile, Chen [16] explored feature selection for multi-data with the help of kernel and mutual information. To explore feature selection of multi-label learning, Li [7] used two functions to reflect the certainty and uncertainty between the labels with equivalence classes. Then, the uncertainties conveyed by the labels were analyzed and a new type of feature selection was proposed, called complementary decision reduct (CDR). However, existing methods for feature selection of multi-label data have only focused on feature selection and usually overlook the impact of label correlation. Table 1.2 Discernibility with A and 𝑙

U

U 𝑥 𝑥 𝑥 𝑥 𝑥

A 𝑎 1 3 3 3 3

𝒴

𝑎 2 3 3 1 2

𝑎 3 3 3 1 1

𝑙 1 1 -1 -1 -1

𝑙 1 -1 -1 -1 -1

Table 1.1 A multi-label data

𝑥 𝑥 𝑥 𝑥 𝑥

U 𝑥 𝑥 𝑥 𝑥 𝑥

A 𝑎 1 3 3 3 3

𝑎 2 3 3 1 2

𝑎 3 3 3 1 1

A 𝑎 1 3 3 3 3

𝑎 2 3 3 1 2

𝑎 3 3 3 1 1

𝑙

1 U -1

𝑙 1

-1

Table 1.3 Discernibility with A and 𝑙

4

𝑥 𝑥 𝑥 𝑥 𝑥

A 𝑎 3 3 3 3 1

𝑌

𝑎 3 3 1 2 2

𝑎 3 3 1 1 3

{𝑙 , 𝑙 }

2 4

Table 1.4 Discernibility with A and 𝑌

𝑎 3 3 3 3 1

𝑎 3 3 1 2 2

𝑌

2 4

𝑎 3 3 3 3 1

𝑎 3 3 1 1 3

𝑌

2 4

𝑎 3 3 1 2 2

𝑎 3 3 1 1 3

𝑌

2 4

𝑎 3 3 3 3 1

𝑌

2 4

𝑎 3 3 1 2 2

𝑌

2 4

𝑎 3 3 1 1 3

𝑌

2 4

Table 1.5 — 1.10 Discernibility with each feature subset and 𝑌

To address above-state problems, we consider essential elements of label for exploring the relevance between features and labels, where an essential element is composed of indispensable features for the label. Then, a novel calculating process of the local label correlation with application to feature selection of multi-label data is proposed. A simpler example is utilized as a supplement to explain the basic concepts, increase readability and facilitate understanding of the theoretical part of the proposed method. Table 1.1 lists a simple multi-label data set with five objects, three features and two labels, where the five objects are divided into a quotient set with four equivalence classes based on the original feature set A. Table 1.2 and 1.3 are single data sets with labels {l1 } or {l2 }, where the discernibility information contained was determined by the original feature set A and the labels {l1 } or {l2 }. Considering Table 1.2 as an example, because there is the same label value of label {l1 }, the equivalence classes {x4 } and {x5 } do not need be discriminated between each other. Only equivalence class {x2 , x3 } should be discriminated with the other equivalence classes. By keeping positive and negative consistency (equivalent to discernibility) conveyed by the features and label unchanged, the essential elements are defined and described concerning the label through the discernibility matrix, where an essential element is composed of indispensable features for the label. The discernibility matrix of label {l1 } or {l2 } that is defined in our paper implies the entire certainty information of Table 1.2 or 1.3. The relevancy between labels is computed based on all the essential elements in discernibility matrix to evaluate corresponding relevance judgement matrix of label set, and then the relativity of any given label with other labels can be judged. Therefore, the whole label set is automatically divided into several disjoint relevant label groups and labels with strong relationships are assigned to a group of relevant labels. In Table 1.4, the two labels in the label set Y have a strong relationship, that is, only one relevant label group Y1 = {l1 , l2 } in Y. Thus, the corresponding global and local label correlations can be counted. The equivalence classes induced by A are reassigned to the two regions in Table 1.4 by considering the weigh of each equivalence class and label correlation. Tables 1.5 to 1.10 characterize the different quotient sets determined by the all feature subsets. The equivalence classes in the quotient set induced by {a2 } and {a1 , a3 } are appropriate for retaining the discernibility implied in Table 1.4, where the quotient sets computed by {a1 , a2 } and {a2 , a3 } are too fine and the quotient sets calculated based on {a1 } and {a3 } are too coarse for Table 1.4. The aim of our multi-label feature selection method, CLSF, is to find the minimal feature subset to maintain discernibility information, containing between original feature set and label group {Y1 }, unchanged. Therefore, {a2 } is the most suitable choice. This algorithm simplifies the process and effectively improves the speed of predicting possible labels for unseen instances in multi-label classification. The remainder of this paper is organized as follows, Section 2 reviews some basic concepts of multi-label classification and five performance evaluation metrics. Section 3 5

provides definition and computation process of essential elements. Then, concept of label relevancy and corresponding relevance judgement matrix are provided. Meanwhile, the labels set are split into some relevant label groups, and then global, local label correlations are explored. In Section 4, labels in a relevant label group are integrated to reconstruct a binary relation by considering local label correlation and classification error rate. A novel algorithm, called correlation-labels-specific features, is proposed. Section 5, eleven benchmark data sets are selected to test CLSF and six other multi-label classifications in terms of five performance evaluation metrics. In Section 6, we conclude this paper with a summary and outlook for further research.

2

Preliminaries

To further clarify content of our discussion, we present several formal notations and five evaluation metrics of multi-label classification used in this paper. Multi-label classification [6] is a supervised learning problem that every instance may be related with multiple labels. In multi-label learning process, X = {(x, A(x))|x ∈ U } ⊆ Rm is the domain of input instances with m−dimensional, where A = {a1 , a2 , . . . , am } is a set of features. Any object x ∈ U is described by a vector with m feature values A(x) = [a1 (x), a2 (x), . . . , am (x)]. The set Y = {l1 , l2 , . . . , ls } is a finite set of s possible labels. For each object x ∈ U , an s−dimensions label value vector Y(x) = [l1 (x), l2 (x), . . . , ls (x)] represents its possible labels. If label lj belongs to object x ∈ U then lj (x) = 1; otherwise lj (x) = −1. Then the set of possible labels denotes as K(x) = {lj |lj ∈ Y, lj (x) = 1} ⊆ Y. Given set D = {(A(xi ), Y(xi ))|xi ∈ U, i = 1, 2, . . . , n} denotes as a training data set with n instances and their related labels. Here, subscripts are used to avoid ambiguity with label dimensions. Therefore, lj (xi ) corresponds to binary relevance of the jth label in Y associating with the ith object in U . In this paper, features of multi-label data are discrete value based on a discretization algorithm FIMUS [18]. With any features subset B ⊆ A, there is an associated indiscernibility relation (i.e. equivalence class) [34] RB : RB = {(x, y) ∈ U × U |∀a ∈ B, a(x) = a(y)}. Obviously, RB forms a quotient set (i.e. equivalence classes partition) of U , denoted by U/RB = {[x]B |x ∈ U }, where [x]B is an equivalence class determined by x ∈ U with respect to indiscernibility relation RB , that is, [x]B = {y ∈ U |(x, y) ∈ RB }. Specifically, objects with or without a given label are considered as positive or negative objects. For one given class label lj ∈ Y, the sets of positive and negative training objects are denoted as [21]: Posj = {xi |(A(xi ), Y(xi )) ∈ D, lj ∈ K(xi )}, N egj = {xi |(A(xi ), Y(xi )) ∈ D, lj ∈ / K(xi )}. H : X → R is a set of classifiers. The final purpose of multi-label classification task is to output a real-value classifier, which optimizes the multi-label evaluation metric (such as Coverage) for h(x) = K(x). The performance evaluation metrics of multi-label classification algorithms differ from single-label classification algorithms. Five multi-label

6

evaluation metrics [20] are available to evaluate the performance of multi-label classification algorithms, that is, Hamming loss, Average Precision, Coverage, One-error, Ranking loss. (1). Hamming loss: computes the radio that an object-label pair is misclassified between the predictive labels h(x) = K(x)0 and the ground truth labels. It is normalized over total number of objects and the labels. The smaller of value of Hamming loss the better performance of algorithm. n

Hammingloss = 1 −

s

1 XX 0 (lj (xi ) = lj (xi )). ns i=1 j=1

(2). Average Precision: evaluates the average fraction of relevant labels ranked higher than a particular label lj belongs to xi . The bigger value of this metric the better performance of algorithm. s

1 1X AverageP recision = s |K(xi )| i=1

X

lj ∈K(xi )

|R(xi , lj )| , rank(xi , lj )

where R(xi , lj ) = {lk |rank(xi , lk ) ≤ rank(xi , lj ), lk ∈ K(xi )} and | · | denotes the cardiP nality of a set. Here rank(xi , lj ) = sj=1 δ(fj (xi ) ≥ fk (xi )) indicates the rank of lj for xi , when all class labels are stored in descending order according to {f1 (xi ), f2 (xi ), . . . , fs (xi )}, and δ(λ) = 1 if λ holds and 0 otherwise. (3). Coverage: expresses how far, on average, we need to move down the rank list of labels so as to cover whole ground true labels of the object. The smaller value of coverage the better performance of algorithm. n

Coverage =

1X ( max rank(xi , lj ) − 1). n lj ∈K(xi ) i=1

(4). One-error: evaluates how many steps the top-ranked predicted label is not in the relevant label set of the object. The smaller value of one-error the better performance. n

one − error =

1X δ(arg max f (xi ) ∈ / K(xi )). l∈Y n i=1

(5). Ranking loss: calculates the average percentage of wrong instruction label pairs, that is, the irrelevant labels of an object is ranked higher than relevant labels. n

1 X |{(lk , lj )|fk (xi ) ≤ fj (xi ), (lk , lj ) ∈ K(xi ) × K(xi )}| Rankingloss = , n |K(xi )||K(xi )| i=1

where K(xi ) is the complementary set of K(xi ) with respect to Y.

3

Label correlation in multi-label data based on essential elements of label

Multi-label training data D = {(A(x), Y(x))|x ∈ U }, where U is the domain of input objects, A is a set of features and Y = {l1 , l2 , . . . , ls } is label set, can be transformed into multiple single label data Dj = {(A(x), lj (x))|x ∈ U, 1 ≤ j ≤ s} with the same objects 7

and their related features. This section defines and characterizes the essential elements of label, where each essential element consists of indispensable features for label, and then the calculation method of the essential elements of label is given and finally counts local and global label correlations.

3.1

Essential elements of label by considering discernibility matrix

In any single label data Dj , incongruity exists between features and label lj because of the limited cognitive level and capacity of human [3, 8]. The incongruity reflects the ability of features depicting the label and characterizes uncertain information contained in data Dj . A quotient set, which is induced by the indiscernibility relation and composed of equivalence classes, is the most basic tool to explore incongruity conveyed by the features and label. By considering objects in each equivalence class with features in A and the positive and negative label values, a definition of inconsistent and consistent equivalence classes in the quotient set U/RA is proposed to reflect the certain and uncertain information of data Dj . Definition 3.1.1. For each label lj ∈ Y, U/Rlj = {Posj , N egj } are sets of positive and negative objects, U/RA is quotient set with respect to A. An equivalence class [x]A ∈ U/RA is called positive consistent equivalence class with respect to lj and denoted as lj ([x]A ) = 1, if [x]A ⊆ Posj ; or [x]A is negative consistent equivalence class with respect to lj and denoted as lj ([x]A ) = −1 if [x]A ⊆ N egj ; otherwise [x]A is inconsistent equivalence class with respect to lj and denoted as lj ([x]A ) = 0. All the inconsistent equivalence classes with lj ([x]A ) = 0 imply that there is an incongruity between label lj and A, which reflects uncertain information in data Dj . The equivalence classes belonging to Posj or N egj contain certain information of Dj defined by lj and A. Consistent equivalence classes are constructed to partition Posj and N egj and can be served as appropriate prototypes for giving expression to the certain and uncertain information included in Dj . Therefore, U/RA are split into three regions: positive consistent region P osA (lj ), negative consistent region N egA (lj ), and inconsistent region of boundary BnA (lj ) = U − P osA (lj ) ∪ N egA (lj ). The number of equivalence classes in the quotient set U/RA induced by features set A, which consistent with the quotient set U/Rlj derived by label lj , determines the certain information in Dj . It can also be used to equivalently characterize the uncertain information in Dj . From this point, keep the certain and uncertain information included in Dj constant, the positive and negative consistent regions confirmed by U/RA and U/Rlj are naturally guarantee to be unchanged. To do so, the equivalence classes between different regions must be distinguished and identified. For example, any equivalence class [x]A ⊆ P osA (lj ) cannot be merged with another equivalence class [y]A * P osA (lj ). That being said, there must be at least one feature in A with the ability to identify [x]A ⊆ P osA (lj ) and [y]A * P osA (lj ), that is, ∃a ∈ A such that a([x]A ) 6= a([y]A ). It should be noticed that if [x]A ⊆ P osA (lj ) and [y]A * P osA (lj ), then it also must satisfy lj ([x]A ) 6= lj ([y]A ). The concept of distribution discernibility matrix is provided to store all the pairs of equivalence classes that need to be distinguished and all the features with the capacity to identify each pair of equivalence classes. Definition 3.1.2. ∀x, y ∈ U , lj ∈ Y, we denote pairs of equivalence classes with different regions as Pj∗ = {([x]A , [y]A )|lj ([x]A ) 6= lj ([y]A )}. Distribution discernibility matrix with respect to label lj is denoted as Pj = (Pj ([x]A , [y]A ))|U/RA |×|U/RA | , where

8

element Pj ([x]A , [y]A ) is defined by  {a ∈ A|a([x]A ) 6= a([y]A )}, ([x]A , [y]A ) ∈ Pj∗ ; Pj ([x]A , [y]A ) = ∅, otherwise, For any pair of equivalence classes ([x]A , [y]A ), two constraints are utilized given in Definition 3.1.2 to determine their corresponding element Pj ([x]A , [y]A ) in the distribution discernibility matrix. In other words, the non-empty discernibility feature subset Pj ([x]A , [y]A ) holds for ([x]A , [y]A ) ∈ Pj∗ . It is necessary to distinguish equivalence classes [x]A and [y]A belonging to different consistent regions in Definition 3.1.1, and then each feature in Pj ([x]A , [y]A ) can discern equivalence classes [y]A and [x]A in the quotient set U/RA induced by features set A; otherwise, Pj ([x]A , [y]A ) = ∅ means that the equivalence classes [y]A and [x]A belong to the same consistent region. All the non-empty elements in the distribution discernibility matrix Pj imply the entire uncertain information of data Dj included between label lj and the features set A. The discernibility matrix proposed in Definition 3.1.2 is symmetric, that is, Pj ([x]A , [y]A ) = Pj ([y]A , [x]A ) and Pj ([x]A , [x]A ) = ∅. Therefore, pairs of equivalence classes ([x]A , [y]A ) and ([y]A , [x]A ) are treated as the same in the calculation process of the algorithms provided in this paper. Thus, elements in the lower triangle of the discernibility matrix Pj need to be calculated to reduce the computational complexity. The following example is provided to illustrate the calculation process of distribution discernibility matrix of each label. Example 3.1.1. A multi-label training data set D = {(A(x), Y(x))|x ∈ U } is presented in Table 1, where U = {x1 , x2 , . . . , x11 } is domain of objects, A = {a1 , a2 , . . . , a10 } is features set and Y = {l1 , l2 , . . . , l6 } is label set. U x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11

A a1 0.10 0.17 0.26 0.80 0.77 0.10 0.73 0.23 0.44 0.94 0.52

a2 0.20 0.30 0.25 0.45 0.56 0.98 0.45 0.94 0.74 1.00 0.69

a3 5.00 6.00 5.50 6.00 5.00 2.00 6.40 3.00 0.23 7.00 0.11

a4 0.37 0.48 0.55 0.29 0.31 0.79 0.22 0.85 0.92 0.97 0.87

a5 2.0 3.0 1.0 5.1 4.6 7.2 5.0 6.8 6.0 8.0 2.0

a6 0.30 0.22 0.10 4.90 5.70 0.59 5.80 0.72 0.24 6.00 0.14

a7 0.20 0.14 0.27 0.88 0.76 0.26 0.79 0.31 0.38 0.89 0.61

a8 0.11 0.29 0.32 0.40 0.39 0.71 0.48 0.83 0.90 0.99 0.28

a9 6.6 6.5 6.2 4.1 5.1 4.9 5.2 4.8 5.7 8.3 7.0

a10 0.70 0.94 0.83 0.50 0.52 0.12 0.59 0.04 0.65 1.89 0.57

Y l1 1 -1 -1 -1 1 1 1 1 1 -1 1

l2 1 -1 -1 -1 1 1 -1 -1 1 -1 1

l3 1 -1 -1 1 -1 1 -1 -1 1 1 -1

l4 1 1 -1 1 1 -1 -1 1 1 1 -1

l5 -1 -1 -1 1 1 1 -1 -1 1 -1 -1

l6 -1 1 1 1 -1 1 -1 1 1 -1 -1

Table 1: A multi-label data set. For convenience of constructing quotient set, each feature in domain of input instances X = {(x, A(x))|x ∈ U } ⊆ R11 is first divided into three equal-width intervals, depended on a discretization algorithm called FIMUS [18], and illustrated in Table 2. The quotient set induced by A are computed that, U/RA = {X1 , X2 , X3 , X4 , X5 } = {{x1 , x2 , x3 }, {x4 , x5 , x7 }, {x6 , x8 , x9 }, {x10 }, {x11 }}. For any label, sets of positive and negative training objects are calculated respectively that, U/Rl1 = {Pos1 , N eg1 } = {{x1 , x5 , x6 , x7 , x8 , x9 , x11 }, {x2 , x3 , x4 , x10 }}, U/Rl2 = {{x1 , x5 , x6 , x9 , x11 }, {x2 , x3 , x4 , x7 , x8 , x10 }}, U/Rl3 = {{x1 , x4 , x6 , x9 , x10 }, {x2 , x3 , x5 , x7 , x8 , x11 }}, U/Rl4 = {{x1 , x2 , x4 , x5 , x8 , x9 , x10 }, {x3 , x6 , x7 , x11 }}, U/Rl5 = {{x4 , x5 , x6 , x9 }, {x1 , x2 , x3 , x7 , x8 , x10 , x11 }}, U/Rl6 = {{x2 , x3 , x4 , x6 , x8 , x9 }, {x1 , x5 , x7 , x10 , x11 }}. Then, equivalence classes in quotient set U/RA are divided to three regions for every label according to Definition 3.1.1. P osA (l1 ) = {x6 , x8 , x9 , x11 } = {X3 ∪ X5 }, N egA (l1 ) = 9

U x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11

A a1 1 1 1 2 2 1 2 1 1 3 2

a2 1 1 1 1 1 2 1 2 2 3 2

a3 2 2 2 2 2 1 2 1 1 3 1

a4 1 1 1 1 1 2 1 2 2 3 2

a5 1 1 1 2 2 2 2 2 2 3 1

a6 1 1 1 2 2 1 2 1 1 3 1

a7 1 1 1 2 2 1 2 1 1 3 2

a8 1 1 1 1 1 2 1 2 2 3 1

a9 2 2 2 1 1 1 1 1 1 3 2

a10 1 1 1 1 1 1 1 1 1 3 1

Y l1 1 -1 -1 -1 1 1 1 1 1 -1 1

l2 1 -1 -1 -1 1 1 -1 -1 1 -1 1

l3 1 -1 -1 1 -1 1 -1 -1 1 1 -1

l4 1 1 -1 1 1 -1 -1 1 1 1 -1

l5 -1 -1 -1 1 1 1 -1 -1 1 -1 -1

l6 -1 1 1 1 -1 1 -1 1 1 -1 -1

Table 2: A discretized multi-label data set.

{X4 }, BnA (l1 ) = {X1 ∪ X2 }; P osA (l2 ) = {X5 }, N egA (l2 ) = {X4 }, BnA (l2 ) = {X1 ∪ X2 ∪ X3 }; P osA (l3 ) = {X4 }, N egA (l3 ) = {x11 } = {X5 }, BnA (l3 ) = {X1 ∪ X2 ∪ X3 }; P osA (l4 ) = {X4 }, N egA (l4 ) = {X5 }, BnA (l4 ) = {X1 ∪X2 ∪X3 }; P osA (l5 ) = ∅, N egA (l5 ) = {X1 ∪ X4 ∪ X5 }, BnA (l5 ) = {X2 ∪ X3 }; P osA (l6 ) = {X3 }, N egA (l6 ) = {X4 ∪ X5 }, BnA (l6 ) = {X1 ∪ X2 }. For simplicity, the digital k is used here to represent feature ak . Then, elements in the distribution discernibility matrix of each label are proposed that   ∅ ∅ {2, 3, 4, 5, 8, 9} A {1, 2, 3, 4, 7}  ∅ ∅ {1, 2, 3, 4, 6, 7, 8} A {2, 3, 4, 5, 6, 9}    ; ∅ A ∅ P1 =    {2, 3, 4, 5, 8, 9} {1, 2, 3, 4, 6, 7, 8}   A A A ∅ A {1, 2, 3, 4, 7} {2, 3, 4, 5, 6, 9} ∅ A ∅

and P2 = P3 = P4 = {∅, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}, A}; P5 = {∅, {1, 5, 6, 7, 9}, {1, 5, 7, 8, 9}, {2, 3, 4, 5, 8, 9}, {2, 3, 4, 5, 6, 9}, A}; P6 = {∅, {2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}, {1, 2, 3, 4, 6, 7, 8}, A}.  Based on each feature, the feature values of the objects in Table 1 are discerned into several intervals and form the corresponding granular structure where some granules (i.e. equivalence classes) are contained, and then the granules in all granular structures are intersected to obtain the needed discrete data. If a partition interval is too small according to each feature, then too many features will result in discrete data that is too fine after intersection. There are ten features in Example 3.1.1, which are different from hundreds of features of data sets in Section 5, the three bins for each feature and the five equivalence classes based on binary relation in Table 2 are appropriate to vividly reflect the whole calculation process of the proposed method. As known, the dimensions of elements of P1 in aforementioned example are different. Consider P1 (X1 , X5 ) ⊂ P1 (X2 , X3 ), where the inclusion relationship implies that any feature in P1 (X1 , X5 ) cannot only separate pair of equivalence classes (X1 , X5 ), but also identify (X2 , X3 ) at the same time. Does this phenomenon mean that element P1 (X2 , X3 ) is redundant reflecting uncertain information in the discernibility matrix P1 ? A welldefined answer to this question can be given. Definition 3.1.3. Pj ([x]A , [y]A ) ∈ Pj is referred to as an essential element in Pj , if there does not exist another element in Pj as its proper subset, and denoted as Ej ([x]A , [y]A ). All the essential elements of label lj briefly are denoted as Ej = {Ej |Ej ⊆ Pj }. In any discernibility matrix, these essential elements are the most basic component and reflect the entire incongruity between the label and features. It is necessary to calculate and 10

Algorithm 3.1.1. Essential elements for each label based on discernibility matrix Input: (1). Training data D = {(A(x), Y(x))|x ∈ U } (U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls }) Output: Collection of essential elements of any label Ej (1 ≤ j ≤ s) Initialize: Ej = ∅ 1. calculate quotient set U/RA with respect to A 2. compute discernibility matrix of label lj , Pj = (Pj ([xs ]A , [xt ]A ))|U/RA |×|U/RA | 3. count U/Rlj = {Posj , N egj } with respect to label lj 4. while (Pj 6= ∅) 5. select Pj ([xs ]A , [xt ]A ) satisfying |Pj ([xs0 ]A , [xt0 ]A )| = min{|Pj ([xs ]A , [xt ]A )| : Pj ([xs ]A , [xt ]A ) ∈ Pj , Pj ([xs ]A , [xt ]A ) 6= ∅} 6. let Ej = [Ej : Pj ([xs0 ]A , [xt0 ]A )] 7. for each Pj ([xs ]A , [xt ]A ) ∈ Pj and Pj ([xs ]A , [xt ]A ) 6= ∅ 8. if Pj ([xs0 ]A , [xt0 ]A ) ⊆ Pj ([xs ]A , [xt ]A ), let Pj ([xs ]A , [xt ]A ) = ∅ 9. if Pj ([xs0 ]A , [xt0 ]A ) = Pj ([xs ]A , [xt ]A ), let Pj ([xs ]A , [xt ]A ) = ∅ 10. endfor 11. let Pj ([xs0 ]A , [xt0 ]A ) = ∅. 12. endwhile 13. Retern Ej .

store every essential element of Pj for exploring uncertain information of the discernibility matrix portrayed by the label and A. Therefore, the following proposition is intended to demonstrate this characterization. Proposition 3.1.1. (1) For any pair of equivalence classes ([x]A , [y]A ) ∈ Pj∗ , there exists at least an essential element Ej ∈ Ej can distinguish it. (2) For any Ej ∈ Ej , if the essential element is removed, then there exists at least a pair of equivalence classes in Pj∗ cannot be identified. Proof. (1) ∀([x]A , [y]A ) ∈ Pj∗ , there must exists an element Pj ∈ Pj such that Pj ([x]A , [y]A ) = {a ∈ A|(a([x]A ) 6= a([y]A )) ∧ (lj ([x]A ) 6= lj ([y]A ))} based on Definition 3.1.2. If Pj is an essential element, that is, there does not exist another element in Pj as its proper subset, and then Pj = Ej ∈ Ej . It must has an essential element Ej ⊆ Pj that can distinguish ([x]A , [y]A ) but also some the other pairs of equivalence classes by considering Definition 3.1.3. (2) For any pair of equivalence classes, there is a one-to-one corresponding element in Pj that can distinguish the equivalence classes. Assume that an essential element Ej ∈ Ej is removed and all pairs of equivalence classes in Pj∗ can be identified by Ej − Ej . In other words, every pair of equivalence classes, which is identified by Ej ⊆ Pj , also can be distinguished by another essential element Ej0 ∈ Ej −Ej and Ej0 ⊆ Pj . Then, there exists Pj0 satisfying Pj0 = Ej , and Pj0 also has another essential element Ej0 ∈ Ej such that Ej 6= Ej0 ⊆ Pj0 = Ej . Therefore, Ej0 ⊂ Ej , which conflicts with the fact that Ej is an essential elements in Ej . Thus, assumption is invalid.  Elements in the discernibility matrix Pj with the least features must be essential elements, and essential elements do not contain each other. A given essential element is selected, and then it can eliminate elements in Pj that contain it and the other essential elements cannot be deleted. Meanwhile, elements with the least features in the remaining elements are still essential elements. This process is repeated until Pj = ∅, and then all the essential elements for label lj are counted. According to the analysis above, the algorithm for calculating the essential elements of any label is provided. To calculate the essential elements for a label, the following major operations are needed: (1). The first step is compute discernibility matrix of the label, and its time complexity is O(|U/RA |2 ). (2). The second part is collect set of essential elements in the 11

discernibility matrix, and corresponding time complexity is O(|U/RA |2 |A|). Therefore, the time complexity of Algorithm 3.1.1 is O(|U/RA |2 |A|). The essential elements for each label can be computed in the following example. Example 3.1.2. (Continued from Example 3.1.1) Set of essential elements of distribution discernibility matrix of every label is provided respectively that E1 = {{2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}}; E2 = E3 = E4 = {{1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}; E5 = {{1, 5, 6, 7, 9}, {1, 5, 7, 8, 9}, {2, 3, 4, 5, 8, 9}, {2, 3, 4, 5, 6, 9}}; E6 = {{2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}.  To distinguish corresponding pairs of equivalence classes, every indispensable feature in an essential element is equal. A compact subset of indispensable features can be selected from the essential elements in Ej one by one until all the pairs of equivalence classes in Pj∗ are recognized. The essential elements of a discernibility matrix contain each compound mode of the compact subset of indispensable features.

3.2

Local and global label correlations through using essential elements

By considering the overlap of different essential element families of the discernibility matrix related to label, label relevancy and the corresponding relevance judgement matrix are proposed to estimate the relationships among multiple labels. Therefore, a novel model for calculating local and global label correlations is provided. In a practical data set, the difference between some essential elements in Ej and some the other essential elements in Ek may be very small. For instance, suppose that there is another label l7 in Table 1, and families of essential elements related to l7 , l2 are respectively E7 = {E71 , E72 , E73 } = {{1, 2, 3, 4, 7}, {2, 3, 4, 5, 8, 9}, {1, 5, 7, 8, 9}}, E2 = {E21 , E22 , E23 } = {{1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}, where E71 = E21 , E73 = E23 and E72 4E22 = {6, 8}. If E7 and E2 are simply intersected, then E72 and E22 are not equal and cannot be used to calculate relevancy between l7 and l2 . Obviously, there is little difference between E7 and E2 . To address this phenomenon, which regularly appears in large amounts of multi-label data, the restriction of counting the overlap of different essential element sets from inside and outside can be relaxed to propose the definition of relevancy of labels. Definition 3.2.1. For any lj , lk ∈ Y, a function γ : Y × Y → {0, 1} defined by ( |E uE | 1, |Ejj ∪Ekk | ≥ α; γ(j, k) = 0, otherwise, |E ∩E |

is a label relevancy with respect to set of essential elements, where Ej uEk = {Ej | |Ejj ∪Ekk | ≥ β, Ej ∈ Ej , Ek ∈ Ek }. Then, the relevance judgement matrix is refer to as Rel = (γ(j, k))|Y|×|Y| , 1 ≤ j, k ≤ s. Related algorithms of digging label correlation only depict the possible relevance of the label pair or whole label set. Compared with them, our algorithm provides a relevance judgement matrix to indicate the relationships among labels in Y based on the essential element collection determined by the label and features set. Then, the label set Y is divided into several disjoint groups of relevant labels and is denoted as clu(Y) = {Y |Y ⊆ Y}. Every label in any group of relevant labels has a close relationship with the fixed label, which is selected randomly from the group of relevant labels. For a given fixed label, a group of relevant labels is obtained around the label with the help of Definition 3.2.1. In this paper, a group of relevant labels is found from the first column (or line), and then labels that have already been attached to group one are removed, and another group of relevant labels is found from the second column (or line)and so on until traversing the entire label

12

set Y. The first label in each group of relevant labels can be selected as a fixed label and the others are related labels. Using the fixed label as benchmark, several label pairs are constructed with the fixed label and other related labels in a group of relevant labels. For any label pair in a group of relevant labels, in each equivalence class, the labels of an object are the same (or exclusive) providing positive (or negative) correlation support. Definition 3.2.2. For each group of relevant labels Y , lj is the fixed label and lk is any related label. Then, local positive and negative correlations of the label pair (lj , lk ) with respect to equivalence class [x]A can be formulated as follows, posj,k ([x]A ) = negj,k ([x]A ) =

|{y|y ∈ [x]A , lj (y) × lk (y) = 1}| , |[x]A |

|{y|y ∈ [x]A , lj (y) × lk (y) = −1}| . |[x]A |

Obviously, posj,k ([x]A ) + negj,k ([x]A ) = 1. Because of the different qualities of equivalences classes in the quotient set U/RA , a local label correlation is proposed by considering weight of equivalence class.Then, local correlation is explored concerning the label pair for equivalence class. Definition 3.2.3. For each group of relevant labels Y , lj is the fixed label and lk is any related label. Then, local correlation of label pair (lj , lk ) related to equivalence class [x]A can be formulated as follows,  |[x] | A  posj,k ([x]A ) ≥ η;  |U | , |[x] | A locj,k ([x]A ) = − |U | , negj,k ([x]A ) ≥ η;   0, otherwise.

η is a parameter, and usually needs more than 0.5. In other words, when local positive (or negative) label correlation enjoys a decided advantage, it is considered that the whole equivalence class is positive (or negative) related between lj and lk . If local positive (or negative) label correlations are close to each other, then it is believe that labels pair (lj , lk ) do not have an obvious relevance on this equivalence class. Global label correlation is obtained as follows by aggregating local label correlations of equivalence class concerning the label pair. Definition 3.2.4. For each group of relevant labels Y , lj is the fixed label and lk is any related label. Then, global correlation of label pair (lj , lk ) can be formulated as follows, glo(j, k) =

X

locj,k ([x]A ).

[x]A ∈U/RA

The following Example 3.2.1 shows how to determine groups of relevant labels and calculate local, global label correlations for each labels pair. Example 3.2.1. ( Continued from Example 3.1.1. ) Let α = 75%, β = 85%, η = 60%. Then, the relevance judgement matrix of label set Y defined in Definition 3.2.1. is counted that   1 0 0 0 0 1  0 1 1 1 0 1     0 1 1 1 0 1    Rel =    0 1 1 1 0 1   0 0 0 0 1 0  1 1 1 1 0 1 13

Groups of relevant labels are counted that clu(Y) = {Y1 , Y2 , Y3 } = {{l1 , l6 }, {l2 , l3 , l4 }, {l5 }}, where l1 , l2 , l5 are respectively the fixed labels in Y1 , Y2 , Y3 , and the corresponding set of label pairs are L(Y1 ) = {(l1 , l6 )}, L(Y2 ) = {(l2 , l3 ), (l2 , l4 )}, L(Y3 ) = ∅. Therefore, local and global label correlations for each label pair are that loc1,6 (X1 ) = −3/11, loc1,6 (X2 ) = −3/11, loc1,6 (X3 ) = 3/11, loc1,6 (X4 ) = 1/11, loc1,6 (X5 ) = −1/11; loc2,3 (X1 ) = 3/11, loc2,3 (X2 ) − 3/11, loc2,3 (X3 ) = 3/11, loc2,3 (X4 ) = −1/11, loc2,3 (X5 ) = −1/11; loc2,4 (X1 ) = 3/11, loc2,4 (X2 ) = −3/11, loc2,4 (X3 ) = 3/11, loc2,4 (X4 ) = −1/11, loc2,4 (X5 ) = −1/11.glo1,6 = −3/11; glo2,3 = glo2,4 = 1/11.  Local label correlation can be formulated in Algorithm 3.2.1 based on the essential element collection of label defined in Definition 3.1.3. Algorithm 3.2.1. Local label correlation based on essential elements Input: (1) Training data D = {(A(x), Y(x))|x ∈ U } (U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls }) (2). Quotient set: U/RA = {[x]A |x ∈ U } (3). Family of essential elements sets with respect to label: E = {Ej |1 ≤ j ≤ s} (4). α, β, η, y Output: clu(Y) = {Y |Y ⊆ Y}, loc = {locj,k , 1 ≤ j, k ≤ s} // Partitioning groups of relevant labels. 1. for 1 ≤ j, k ≤ s 2. compute γ(j, k) by Ej and Ek 3. endfor 4. count relevance judgement matrix Rel = (γ(j, k))|Y|×|Y| 5. calculate groups of relevant labels clu(Y) // Computing set of label pairs in each group of relevant labels. 6. for each group of relevant labels Y ⊆ Y 7. if |Y | = 1, then lj ∈ Y is fixed label and set of label pairs is L(Y ) = ∅ 8. else select a fixed label lj and L(Y ) = {(lj , lk )|lk 6= lj , lk ∈ Y } 9. endfor // Computing local label correlation. 10. If |Y | > 1 11. for each label pair (lj , lk ) ∈ L(Y ). 12. for every equivalence class [x]A ∈ U/RA 13. count positive and negative label correlations posj,k ([x]A ), negj,k ([x]A ) 14. calculate local label correlation locj,k ([x]A ) 15. endfor 16. endfor 17. Return clu(Y), loc

The time complexity for counting relevance judgement matrix is O(|Y|2 ). The time complexity of calculating local label correlation of equivalence class concerning on Y is O(|Y|2 · |U/RA |) by supposing extreme situation that group of relevant labels Y ∈ cly(Y) satisfies Y = Y. Time complexity of Algorithm 3.2.1 is O(|Y|2 · |U/RA |).

4

Correlation-labels-specific features of relevant label group

In the multi-label classification, D = {(A(x), Y(x))|x ∈ U } is a training data set, where U is the domain of input objects, A is a set of features and quotient set are U/RA = {[x]A |x ∈ U }, Y = {l1 , l2 , . . . , ls } is the label set and divided into several disjoint groups of relevant labels clu(Y) = {Y |Y ⊆ Y}. In Section 3, several labels with strong relationships are assigned to a group of relevant labels. Naturally, whether the labels in a group of relevant labels can be integrated into a binary relation is a question worthy of in-depth deliberation. Equivalence classes are divided into several disjoint regions by considering their local label 14

correlation and classification error rate, and then the corresponding discernibility matrix can be calculated to distinguish the equivalence classes belonging to different regions. Therefore, a compact subset of indispensable features for each group of relevant labels can effectively captured maintaining the recognition capability of a discernibility matrix. For each group of relevant labels, the compact subset of indispensable features is called correlation-labels-specific features in this paper. To aggregate labels in any group of relevant labels, knowledge is taken from the ideology of boosting algorithms [22]. The purpose of boosting algorithms is to find a highly accurate classification rule by combining different base hypotheses that are moderately accurate (usually higher than 50%). In this section, it is assumed that a separate procedure that is induced by local label correlation of each label pair is the base learner. Then, the weights of these classifiers are given according to their classification error rate. Hence, these base hypotheses are combined into a single rule called the final or combined hypothesis. Definition 4.1. For each group of relevant labels Y having more than one label, its corresponding set of label pairs is L(Y ) = {(lj , lk )|lk 6= lj , lk ∈ Y }, where lj is the fixed label and lk is the related label. ∀(lj , lk ), the base learner Gj,k ([x]A ) : U/RA → {−1, +1} concerning the local label correlation defined on U/RA is formulated as follows,  1, locj,k ([x]A ) ≥ 0; Gj,k ([x]A ) = −1, locj,k ([x]A ) < 0, The classification error rate with label pair (lj , lk ) is defined as ( P I[(lj (y),lk0 (y))6=(lj (y),lk (y))] , locj,k ([x]A ) 6= 0; y∈[x] |[x]A | A e(j,k) ([x]A ) = 1/2, locj,k ([x]A ) = 0, where, lk0 (y) is the predicted value of related label lk with object y according to the fixed label lj and local label correlation. If locj,k ([x]A ) > 0, then lk0 (y) = lj (y), else if locj,k ([x]A ) < 0, then lk0 (y) = −lj (y). Therefore, the coefficient of G(j,k) on equivalence class [x]A is referred to as 1 1 − e(j,k) ([x]A ) α(j,k) ([x]A ) = ln , 2 e(j,k) ([x]A ) Hence, a linear combination of the base learners with respect to group of relevant labels Y on [x]A can be constructed as follows, X fY ([x]A ) = α(j,k) ([x]A )Gj,k ([x]A ). (lj ,lk )∈L(Y )

Even though e(j,k) ([x]A ) = 0.01%, the absolute value of coefficient |α(j,k) ([x]A )| is 9.2102, if e(j,k) ([x]A ) = 0, then obviously |α(j,k) ([x]A )| =Inf. To convenience of calculation, then let |α(j,k) ([x]A )| = 10, if e(j,k) ([x]A ) = 0. For a group of relevant labels, every equivalence class has a linear combination of the base learner fY , and then we can assign these equivalence classes to |Y | + 2 regions by considering the distance between maximum and minimum values of fY ([x]A ). Each step of the region for group of relevant labels Y is s(Y ) =

max[x]A ∈U/RA f ([x]A ) − min[x]A ∈U/RA f ([x]A ) . |Y | + 2

([x]A ) The final learner GY ([x]A ) with Y on [x]A is defined as GY ([x]A ) = d fYs(Y ) e, where d·e represents the smallest integer not less than a number.

15

For example, a group of relevant labels have two labels, that is, L(Y ) = {(lj , lk )}, where lj is the fixed label and lk is the related label. In this paper, it is reasonable to set the number of regions with respect to (lj , lk ) as four, that is, equivalence classes with positive and negative label correlations posj,k ([x]A ) = 1 and negj,k ([x]A ) = 0; 0 < negj,k ([x]A ) ≤ posj,k ([x]A ) < 1; 0 < posj,k ([x]A ) < negj,k ([x]A ) < 1; posj,k ([x]A ) = 0 and negj,k ([x]A ) = 1. If label group have three labels, that is, L(Y ) = {(lj , lk ), (lj , lt )}, where lj is the fixed label and lk , lt are the related labels. It is believe that the equivalence classes with 0 < negj,k ([x]A ) ≤ posj,k ([x]A ) < 1, 0 < posj,t ([x]A ) < negj,t ([x]A ) < 1 and 0 < negj,t ([x]A ) ≤ posj,t ([x]A ) < 1, 0 < posj,k ([x]A ) < negj,k ([x]A ) < 1 are the same, and then these classes are assigned to a same region. Therefore, the number of regions with Y is five, that is, |Y | + 2. The equivalence classes in a quotient set U/RA derived by features set A are distribute to |Y | + 2 regions. The distribution discernibility matrix, that can be used to compute correlation-labels-specific features for labels having strong correlations, is briefly stated as follows. Definition 4.2. Let Y be any group of relevant labels with more than one label. ∀x, y ∈ U , the pairs of equivalence classes from different regions are denoted as PY∗ = {([x]A , [y]A )|GY ([x]A ) 6= GY ([y]A )}. The distribution discernibility matrix for Y is described as PY = (PY ([x]A , [y]A ))|U/RA |×|U/RA | , where element PY ([x]A , [y]A ) is defined by  {a ∈ A|a([x]A ) 6= a([y]A )}, ([x]A , [y]A ) ∈ PY∗ ; PY ([x]A , [y]A ) = ∅, otherwise. For a given group of relevant labels Y ⊆ Y, the set of essential elements of a distribution discernibility matrix in Definition 4.2 with respect to Y is denoted as EY = {EY |EY ⊆ PY }. Then, correlation-labels-specific features for group of relevant labels with more than one label are counted through using the essential elements. The correlation-labels-specific features for a single label are proposed in Section 3. Thus, the method of computing correlation-labels-specific features for each group of relevant labels called CLSF, which CLSF is a feature selection algorithm on multi-label data. Therefore, for each group of relevant labels, the possible labels of each test object can be predicted by CLSF and multi-label classifier ML-KNN. The following example is provided to illustrate the computation process for correlationlabels-specific features for each group of relevant labels. Example 4.1. ( Continued from Example 3.1.1. ) In each group of relevant labels, a linear combination of the base learners related to equivalence class can be determined that fY1 (X1 ) = fY1 (X2 ) = fY1 (X5 ) = −10, fY1 (X3 ) = fY1 (X4 ) = 10; fY2 (X1 ) = 10.896, fY2 (X2 ) = 0, fY2 (X3 ) = 9.104, fY2 (X4 ) = fY2 (X5 ) = −20. The equivalence classes in a quotient set U/RA are assigned into four and five regions for Y1 and Y2 . The final learner for each group of relevant labels is GY1 (X1 ) = GY1 (X2 ) = GY1 (X5 ) = 1, GY1 (X3 ) = GY1 (X4 ) = 4; GY2 (X4 ) = GY2 (X5 ) = 1, GY2 (X3 ) = GY2 (X2 ) = 3, GY2 (X1 ) = 4. I.e., {X1 ∪ X2 ∪ X5 , X3 ∪ X4 } for Y1 , {X1 , X2 ∪ X3 , X4 ∪ X5 } for Y2 . Based on Definitions 4.1 and 4.2, the sets of all elements in a distribution discernibility matrix with respect to Y1 , Y2 and Y3 are PY1 = {∅, A, {a1 , a2 , a3 , a4 , a6 , a7 , a8 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; PY2 = {∅, A, {a2 , a3 , a4 , a5 , a8 , a9 }, {a1 , a2 , a3 , a4 , a7 }, {a1 , a5 , a7 , a8 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; PY3 = {∅, A, {a2 , a3 , a4 , a5 , a6 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}, where |Y3 | = 1. Obviously, essential elements to each group of relevant labels are EY1 = {a1 , a2 , a3 , a4 , a6 , a7 , a8 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; EY2 = {{a2 , a3 , a4 , a5 , a8 , a9 }, {a1 , a2 , a3 , a4 , a7 }, {a1 , a5 , a7 , a8 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; EY3 = {{a2 , a3 , a4 , 16

a5 , a6 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}. Hence, correlation-labels-specific features for each group of relevant labels can be confirmed that {a8 }; {a1 , a5 }; {a5 }.  The correlation-labels-specific features can be formulated in Algorithm 4.1 based on essential elements and boosting through. Algorithm 4.1. Correlation-labels-specific features for relevant label group Input: (1). Training data D = {(A(x), Y(x))|x ∈ U } (U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls }) (2). Quotient set: U/RA = {[x]A |x ∈ U } (3). All groups of relevant labels: clu(Y) = {Y |Y ⊆ Y} (4). Label pairs of each group of relevant labels: L(Y ) = {(lj , lk )|lk 6= lj , lk ∈ Y } and local label correlation loc Output: A compact set of features with respect to each group of relevant labels B ⊆ A Initialize: B = ∅ 1. for each group of relevant labels Y ⊆ Y 2. select the number of regions |Y | + 2 3. determine the distance of each step s(Y ) // Aggregating labels in each group of relevant labels 1. for each equivalence class [x]A in U/RA 2. for every label pair (lj , lk ) in L(Y ) 3. if locj,k ([x]A ) > 0, then the base learner Gj,k ([x]A ) = 1 4. elseif locj,k ([x]A ) < 0 then Gj,k ([x]A ) = −1 5. else then Gj,k ([x]A ) = 0 6. calculate classification error rate e(j,k) ([x]A ) and coefficient α(j,k) ([x]A ) 7. endfor 8. construct linear combination of base learner fY ([x]A ) and final learner GY ([x]A ) 9. endfor // Compute essential elements of every group of relevant labels 10. set up discernibility matrix PY 11. similar to Algorithm 3.1.1, calculate set of essential elements EY // Compute a compact set of features with respect to each group of relevant labels 12. while (EY 6= ∅) 0 13. select the most frequently a ∈ EY ∈ EY 0 0 14. let B = B ∪ {a} and EY = EY − {EY |a ∈ EY ∈ EY } 0 15. let EY = {EY − EY |EY ∈ EY } 16. endwhile 17. Retern B

The time complexity of Algorithm 4.1 is summarized as follows: (1). Assume that Y = Y, which is the most complex case, and the time complexity of aggregating labels in Y is O(|Y|2 · |U/RA |). (2). The time complexity of storing the essential elements in PY is O(|U/RA |2 · |A|). (3). On steps 12 to 16, the number of while loop is |EY | ≤ |U/RA |2 , the time complexity of one while loop is O(A). The time complexity of Algorithm 4.1 is O(|U/RA | · (|Y|2 + |U/RA | · |A|).

5

Experimental results

The proposed multi-label classification algorithm in our paper, called CLSF-MK, is composed of multi-label feature selection algorithm CLSF and multi-label classifier ML-KNN. This section describes the characteristics of our data sets, comparative methods and evaluation metrics respectively, and then the superiority of CLSF-MK in multi-label feature selection and classification is empirically illustrated. 17

5.1

Data sets and experimental settings

Eleven benchmark data sets are selected to test CLSF-MK, and contain multi-label instances coming from different application domains. Among them, the first eight data sets come from Mulan Library [5]. Computer, Recreation and Reference are three independent subsets of the web page multi-label classification data set yahoo. For better readability, detailed information about these data sets is presented in Table 3. Six existing multi-label classification methods are chosen to indicate effectiveness of our proposed algorithm. Algorithms applied for comparison are ML-KNN (a lazy learning approach to multi-label learning) in [19], LPLC (local positive and negative pairwise label correlations) in [9], CDR (complementary decision reduct) in [7], SCLS (scalable criterion for large label set)in [13], MDMR (max-dependency and min-redundancy) [29], and MUCO (multi-label feature selection with label correlation) in reference [27]. Multi-label evaluation metrics: Average Precision, Ranking Loss, Coverage, One Error, and Hamming Loss, are selected to evaluate the performance. ML-KNN and LPLC are multi-label classification and do not consider feature selection. In LPLC, parameter α, which is set as 0.7 in this section, controls tradeoff between positive and negative correlation. CDR, SCLS, MDMR, MUCO and CLSF-MK are multi-label feature selection algorithms, where only CDR does not consider label correlation. CDR and CLSF-MK screen out a same feature subset for each label in the label set (or relevant label group) with a fixed number of features. We chose ML-KNN as multi-label classifier for these six algorithms and k = 10, where parameter k means the number of nearest neighbors of the given object by considering their feature values. To ensure the fairness of experimental results, we compute with the same number of selected features of each data set for SCLS, MDMR, MUCO and CLSF-MK. Relationships between the labels and features cannot be considered in advance to eliminate causal confusion in data discretization pretreatment, that is, unsupervised discretization method is necessary for our method. With a fixed number of discrete intervals, the granular structure becomes finer and finer as the number of features increases. The discrete conditions for eleven data sets applied in this section is appropriately relaxed to enhance stability of our algorithm because of the large number of features. Therefore, a discretization algorithm called FIMUS [18] is selected in CLSF-MK and CDR to divided numerical-value features in eleven data sets to two equal-width intervals. SCLS uses supervised discretization method [1] to discretize these numerical-value features. In CLSF-MK, parameters are searched in α = 0.4, β = 0.6 and η = 0.6 to measure the relativity degree of the essential elements and the overlap of essential element sets related to different labels. If the classification error rate of an equivalence class with any labels pair is 0, and then the coefficient of base learners concerning the equivalence class is 10. The number of regions for group of relevant labels is |Y | + 2, where Y is any group of relevant labels in clu(Y). All experiments were run on a serve equipped with a 4.2 GHz Intel Core i7-7700k CPU and 80 GB of RAM.

5.2

Performance analysis on CLSF-MK

In this paper, each label in data set for MDMR, SCLS, MUCO shares the same number of compact feature set, which is the weighted average number of selected features for every label for CLSF, to predict possible labels of test objects. ML-KNN and LPLC are multilabel classification and do not consider feature selection, and then only the number of selected features for each label of our method and CDR need to provided. In Table 3, basic information of data sets, and the ability of CLSF or CDR for selecting features and segmenting labels of Y, are minutely illustrated, where “Datasets” presents 18

name of each data set, “|U |” illustrates the number of instances in every data set, “|Y|” denotes the number of labels of data set, “|clu(Y)|” is the number of relevant label groups, “|Y | ” means the number of labels in every group of relevant labels, “|A|” describes the number of features for each data set, “BY ” denotes the number of indispensable features for each group of relevant labels, “BY ” is the weighted average number of selected features for every label, and “CDR” is the average number of selected features for CDR. The weighted average for the label implies that every label is averages with BY indispensable features in A and can be calculated by BY =

|U |

Datasets

|Y|

|clu(Y)|

1 Σ |BY | · |Y |. |Y| Y ∈clu(Y) |Y |

|A|

BY

BY

CDR

Emotion

593

6

2

3;3

72

63;47

55

53

Birds

645

20

2

5;15

260

197;184

187

119

Scene

2407

6

2

5;1

294

218;230

220

178

Yeast

2417

14

1

14

103

90

90

86

Reuters

2000

7

3

3;3;1

243

167;147;175

159

132

Recreation

5000

22

1

22

606

546

546

477

Computer

5000

33

4

30;1;1;1

681

556;438;387;403

394

522

Reference

5000

33

5

29;1;1;1;1

793

605;617;615;634;662

608

609

1449

16;18;9;5;3;75;8;2;18;11;9;12

15

51

77;92;92;88;51;15;21;8;4;5;5;

86

81

19

32

Medical

Enron

978

45

1702

z

12 = 1 + 11

}|

{

34;1; . . . ; 1

32 = 1 × 4 + 28

53

11

z

28

}|

{

2;3;6;14;1; . . . ; 1

1001

25;24;55;23;25;20;36;4;4;8;6; 19;31;4;8;31;53;20;21;25;11;

Cal500

502

174

z

117 = 26 + 5 + 86

26

}|

{z

5

}|

86

{z

}|

{

3; . . . ; 3;2; . . . ; 2;1; . . . ; 1

>

68

> 24;23;24;24;23;25;24;24;25;24;23;25;26;26;26;24;26;25;25;26;25;24;25;26;26;19;25;25;26;26;25;22;19;18;21;20;20;18;18;18;11;10;19;11;9;9; 10;15;14;17;13;13;11;20;14;15;17;21;20;16;15;20;19;15;17;18;13;13;14;12;16;18;20;17;13;10;19;15;19;18;12;15;16;14;15;15;16;17;16;11;13;9;9; 12;16;14;17;18;17;12;10;11;12;15;14;15;13;10;15;11;12;17;13;11;15;12;8

30 Cal500

25

20

15

10

0

10

20

30

40

50

60

70

80

90

100

110

120

Index of each relevant label group

Number of selected features of each relevant label group

Number of selected features of each relevent label group

Table 3: The performance of multi-label feature selection of CLSF and CDR

100 90

Enron

80 70 60 50 40 30 20 10 0 0

5

10

15

20

25

30

35

Index of each relevant label group

Figure 1: The performance of CLSF for Cal500 and Enron In Table 3, CLSF and CDR can perceptibly remove the unnecessary features with respect to the label on all data sets. On Computer, Reference, Medical and Cal500, CLSF 19

selects a smaller compact feature subset than CDR. The ability of CLSF and CDR to delete useless features is comparable on Emotion, Yeast and Enron, and the performance of reducing redundancy features of CLSF is slightly poor than CDR on the other four data sets. For every data set, it is clearly that CLSF achieves obvious performance advantages than CDR on five multi-label evaluation metrics combined with Tables 4 to 8. As a supplement to Table 3, Figure 1 clear depicts the numbers of features in compact feature subsets for each relevant label group of Cal500 and Enron. CLSF has a high efficiency in extracting features from Cal500 (or Enron), which has 8 ∼ 26 (or 4 ∼ 92) indispensable features for relevant label group and the weighted average 19 (or 86) features for every label. In Table 3, it is known that CLSF also has generally efficient on the other nine data sets. For instance, on Yeast and Recreation, labels in their label set possess strong correlations with the other labels, and they have about 90% indispensable features in the features set. CLSF splits label set of Birds, Emotion and Scene into two relevant label groups and selects average about 75% of indispensable features for each label. Reuters has 7 labels and 3 relevant label groups, which are related to 159 indispensable features for label on average and 167, 147, 175 features with label groups. For the rest of data sets, Computer, Reference and Medical contain many labels and involve many outliers in their relevance judgment matrix. Reference includes 8 ∼ 26 features for relevant label group and 608 indispensable features for label on average. 556, 448, 387, 403 and 394 indispensable features are concerned with relevant label group and label in Computer. Medical is divided into 12 relevant label groups by CLSF, where 34 labels with strong connections and 11 groups only have one label. The numbers of selected features with relevant label group for Medical are concentrated between 5 ∼ 18 and 75. The classification performance of CLSF-MK, which consists of CLSF and ML-KNN, is effectively promoted. To demonstrate the classification performance of each algorithm more clearly and specifically, the best experimental results of each multi-label algorithm over all data sets are reported in Tables 4 to 8, where “↑” indicates “the bigger the better”, while “↓” indicates “the smaller the better”. For fair comparison, CLSF-MK and the other six comparing multi-label classification algorithms are repeatedly run five times on randomly partitioned training (80%) and testing (20%) data. The experiments are tuned by five-told internal cross validation on data. Based on these experimental results, following observations can be made: Datesets

ML-KNN

LPLC

CDR

SCLS

MDMR

MUCO

CLSF-MK

Birds

.6889±.0060

.6733±.0213

.6448±.0106

.7144±.0297

.6840±.0178

.7158±.0267

.7184±.0301

Yeast

.7541±.0028

.7526±.0174

.7028±.0077

.7602±.0082

.7535±.0070

.7535±.0063

.7610±.0097

Emotion

.7578±.0230

.6870±.0301

.5669±.0320

.7749±.0209

.7656±.0053

.7754±.0225

.7798±.0156

Cal500

.4793±.0018

.4675±.0153

.4878±.0105

.4865±.0112

.4880±.0093

.4899±.0085

.4867±.0081

Scene

.8514±.0002

.8239±.0099

.3670±.0419

.7969±.0435

.7986±.0272

.7979±.0320

.8381±.0469

Medical

.7700±.0043

.7115±.0222

.8057±.0387

.7871±.0200

.7273±.0248

.7014±.0137

.8066±.0169

Reuters

.8559±.0172

.8564±.0105

.6283±.0143

.8103±.0157

.8283±.0107

.8563±.0056

.8730±.0082

Enron

.5446±.0072

.5365±.0289

.6210±.0255

.6193±.0280

.5971±.0388

.6014±.0348

.6211±.0204

Computer

.6390±.0055

.4843±.1112

.6032±.0104

.6406±.0136

.6468±.0126

.6521±.0131

.6491±.0103

Recreation

.4699±.0144

.3107±.0173

.3770±.0090

.4864±.0107

.4816±.0139

.4775±.0100

.4871±.0084

Reference

.6175±.0019

.2657±.0355

.5640±.0135

.6274±.0112

.6328±.0092

.6375±.0149

.6445±.0040

Average

.6753

.5972

.5799

.6822

.6731

.6808

.6969

Ave. Rank

4.4

5.8

5.5

3.6

4.1

3.2

1.5

Table 4: Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of

Average Precision ↑

20

Datasets

ML-KNN

LPLC

CDR

SCLS

MDMR

MUCO

CLSF-MK

Birds

.1266±.0015

.1521±.0110

.1604±.0099

.1146±.0141

.1206±.0069

.1112±.0110

.1110±.0134

Yeast

.1748±.0030

.1886±.0155

.2114±.0067

.1713±.0052

.1728±.0046

.1753±.0047

.1707±.0076

Emotion

.1955±.0226

.2775±.0393

.4291±.0326

.1824±.0119

.2059±.0048

.1822±.0218

.1821±.0142

Cal500

.1907±.0013

.2213±.0075

.1816±.0041

.1859±.0047

1883±.0058

.1851±.0042

.1850±.0037

Scene

.0876±.0055

.1156±.0069

.5736±.0734

.1227±.0213

.1231±.0166

.1211±.0170

.1048±.0228

Medical

.0614±.0054

.0573±.0067

.0559±.0112

.0550±.0061

.0685±.0071

.0817±.0068

.0548±.0106

Reuters

.0964±.0116

.0948±.0090

.2619±.0138

.1266±.0110

.1134±.0075

.0945±.0096

.0849±.0064

Enron

.1101±.0084

.1730±.0142

.1031±.0119

.1023±.0130

.1093±.0114

.1046±.0158

.0968±.0095

Computer

.0891±.0031

.2555±.0739

.1021±.0048

.0890±.0050

.0858±.0043

.0848±.0034

.0846±.0036

Recreation

.1879±.0049

.3916±.0316

.2216±.0050

.1790±.0047

.1817±.0073

1817±.0048

.1811±.0036

Reference

.0916±.0003

.2656±.0198

.1057±.0054

.0881±.0041

.0848±.0036

.0854±.0040

.0813±.0040

Average

.1283

.1993

.2187

.1288

.1322

.1280

.1216

Ave. Rank

4.6

5.7

5.5

3.3

4.4

3.4

1.3

Table 5: Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of

Ranking Loss ↓

Datasets

ML-KNN

LPLC

CDR

SCLS

MDMR

MUCO

CLSF-MK

Birds

.3675±.0114

.3984±.0316

.4388±.0178

.3442±.0447

.4217±.0288

.3444±.0320

.3442±.0440

Yeast

.2418±.0024

.2294±.0188

.2487±.0056

.2289±.0109

.2297±.0084

.2458±.0078

.2288±.0093

Emotion

.3428±.0459

.4403±.0437

.5651±.0491

.3152±.0465

.3221±.0141

.3015±.0260

.2969±.0281

Cal500

.1215±.0063

.1960±.0265

1155±.0248

.1154±.0148

.1156±.0168

.1156±.0168

.1165±.0159

Scene

.2435±.0072

.2661±.0125

.8549±.0737

.3356±.0817

.3319±.0472

.3361±.0553

.2699±.0265

Medical

.2740±.0174

.3888±.0333

.2465±.0431

.2638±.0219

.3209±.0368

.4384±.0499

.2372±.0145

Reuters

.2135±.0215

.2075±.0111

.5770±.0278

.2810±.0233

.2655±.0136

.2250±.0061

.2236±.0327

Enron

.3555±.0411

.3300±.0445

.2962±.0250

.3407±.0829

.3731±.0984

.3354±.0377

.3167±.0243

Computer

.4138±.0132

.5616±.1489

.4762±.0126

.4338±.0175

.4312±.0157

.4220±.0159

.4260±.0127

Recreation

.6345±.0125

.8056±.0113

.8040±.0136

.6324±.0106

6636±.0159

.6770±.0141

.6606±.0088

Reference

.4525±.0055

.9486±.0326

.5340±.0139

.4666±.0120

.4666±.0098

.4536±.0214

.4523±.0101

Average

.3328

.4338

.4688

.3416

.3584

.3545

.3248

Ave. Rank

3.6

4.8

5.9

3.5

4.6

4.1

2.2

Table 6: Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of One Error ↓

It is obtained that Average Precision values of CLSF-MK on nine data sets are the largest compared with the other four multi-label feature selection methods after data sets Cal500 and Computer. On almost all data sets without Scene, Average Precision values of CLSF-MK perform better than ML-KNN and LPLC. Ranking Loss values of CLSF-MK on nine data sets excluding Cal500 and Recreation are also better than the other four feature selection methods CDR, SCLS , MDMR and MUCO. Except Scene, CLSF-MK has better performance than ML-KNN with Ranking Loss. For each data set, Ranking Loss values of CLSF-MK perform better than LPLC. Based on Table 6, One Error values of CLSF-MK show better performance than the other four feature selection methods on Birds, Yeast, Emotion, Scene, Medical, Reuters and Reference. On Scene and Computer, One Error values of CLSF-MK are close to ML-KNN and on the other nine data sets CLSF-MK are better than ML-KNN. One Error values of CLSF-MK have better expression than LPLC on every data set. On Scene and Computer, CLSF-MK is close to ML-KNN and 21

Datasets

ML-KNN

LPLC

CDR

SCLS

MDMR

MUCO

CLSF-MK

Birds

.0567±.0030

.2529±.0065

.0585±.0072

.0554±.0056

.0638±.0019

.0550±.0042

.0549±.0066

Yeast

.2000±.0022

.2167±.0117

.2320±.0044

.1963±.0022

.1973±.0016

.1999±.0009

.1960±.0054

Emotion

.2301±.0124

.4504±.0107

.3141±.0165

.2153±.0159

.2257±.0080

.2139±.0106

.2130±.0096

Cal500

.1398±.0012

.1608±.0032

1372±.0040

.1370±.0063

.1408±.0064

.1372±.0060

.1401±.0034

Scene

.1021±.0009

.3015±.0091

.1713±.0088

.1184±.0220

.1174±.0173

.1208±.0190

.1026±.0218

Medical

.0172±.0006

.0718±.0013

.0142±.0017

.0143±.0010

.0164±.0008

.0206±.0018

.0142±.0010

Reuters

.0743±.0043

.2101±.0339

.1646±.0015

.0904±.0040

.0869±.0040

.0733±.0036

.0552±.0034 .0542±.0042

Enron

.0565±.0039

.1768±.0162

.0571±.0058

.0543±.0056

.0555±.0049

.0554±.0066

Computer

.0380±.0013

.2177±.0368

.0443±.0016

.0396±.0014

.0391±.0014

.0383±.0015

.0388±.0013

Recreation

.0595±.0008

.3094±.0608

.0645±.0009

.0594±.0014

.0601±.0014

.0602±.0013

.0593±.0011

Reference

0291±.0002

.1276±.0325

.0355±.0006

.0293±.0010

.0302±.0007

.0308±.0013

.0301±.0008

Average

.0912

.2269

.1176

.0918

.0939

.0914

.0881

Ave. Rank

3.4

6.9

5.3

2.9

4.2

3.5

1.8

Table 7: Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of

Hamming Loss ↓ Datasets

ML-KNN

LPLC

CDR

SCLS

MDMR

MUCO

CLSF-MK

Birds

3.512±.1124

3.741±.3147

4.234±.3571

3.132±.2802

3.318±.3112

3.112±.2864

3.107±.1444

Yeast

6.449±.0366

6.711±.2063

6.784±.1016

6.340±.0467

6.388±.0488

6.405±.0697

6.315±.1223

Emotion

1.944±.0427

2.232±.1930

3.150±.0519

1.860±.794

1.973±.1222

1.880±.1181

1.857±.1027

Cal500

131.5±.4841

148.5±1.734

129.9±2.033

131.6±1.758

132.1±1.960

131.2±2.073

131.0±2.149

Scene

.5246±.0440

.6050±.0391

2.969±.3985

.6987±.1067

.7034±.0875

.6921±.0879

.6329±.0973

Medical

3.634±.2290

3.241±.2560

2.855±.4781

3.188±.4650

4.110±.5382

4.685±.4101

3.463±.5349

Reuters

.7545±.1045

.6980±.0688

1.705±.0908

.9410±.0674

.8650±.0391

.7390±.0406

.6930±.0445

Enron

14.75±.9089

20.53±1.201

14.37±1.576

14.31±1.899

14.94±1.910

14.49±2.052

13.64±1.269

Computer

4.226±.1893

9.794±2.067

4.742±.2710

4.259±.2874

4.156±.2448

4.089±.2274

4.078±.2105

Recreation

4.988±.1603

8.874±.5220

5.715±.1374

4.800±.1366

4.854±.2115

4.861±.1414

4.838±.1278

Reference

3.498±.0438

9.419±.6589

3.963±.2006

3.385±.1480

3.281±.1400

3.300±.1407

3.157±.0692

Average

15.98

19.49

16.40

15.86

16.06

15.95

15.71

Ave. Rank

4.3

5.5

5.3

3.4

4.5

3.5

1.6

Table 8: Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of

Coverage ↓

Hamming Loss values of CLSF-MK are smaller than ML-KNN on the other nine data sets. CLSF-MK is smaller than LPLC on all data sets for Hamming Loss. CLSF-MK illustrates better performance than the others four feature selection algorithms on nine data sets without Cal500 and Computer. It is well known that Coverage values of CLSFMK demonstrate better capability than CDR, SCLS , MDMR and MUCO on eight data sets except Cal500, Medical and Recreation according Table 8. On ten data sets do not contain Scene, Coverage values of CLSF-MK are better than ML-KNN. CLSF-MK also illustrates better results than LPLC on all data sets. From Tables 4 to 8, we observe that CLSF-MK outperforms the other six algorithms on account of the optimal average values and rank according to these five evaluation metrics. CLSF-MK has better performance compared with the other three feature selection algorithms (SCLS , MDMR and MUCO) by considering label correlation on ten data sets for Hamming Loss, Ranking Loss and Coverage, nine data sets for Average Precision 22

and One error. Five evaluation metrics values of CLSF-MK describe better classification ability than ML-KNN and LPLC except 1 ∼ 3 data sets. The facts prove that CLSFMK is superior to ML-KNN and LPLC with all features. The performance of CLSFMK is more excellent than CDR for major part of data sets without one or two data sets. This fact demonstrates that the feature selection method through considering label correlation called CLSF-MK, indeed improves the performance. There are no problems with important features loss for ML-KNN by our proposed method. In general, each multi-label feature selection algorithm has some limitations, and any single multi-label feature selection algorithm cannot be suitable for all situations. However, there are usually few algorithms that have the largest values on all different evaluation metrics. Therefore, we can conclude that CLSF-MK has a better performance compared with ML-KNN, LPLC, CDR, SCLS , MDMR, MUCO, which indicates that the effectiveness and practicability of the proposed method in this paper are significantly increased.

6

Conclusions

This paper focused on improving the efficiency and performance of multi-label classification by exploring feature selection and label correlation. We presented a novel feature selection method of multi-label data, named CLSF, which was based on label correlation. This method has many advantages compared with the existing methods currently in use. All the essential elements of each label are defined and calculated by considering the uncertain and certain information determined by the label and features set. Then, a label relevancy between the labels and the corresponding relevance judgement matrix are given by using an overlap of different essential element families related to the label, which is a process that is neglected in the existing algorithms. The matrix can be applied to reflect label correlation among labels. Therefore, the label set is divided into several relevant label groups and local and global label correlations are considered. Thus, CLSF takes feature selection and local label correlation into account simultaneously, and then the labels in a group of relevant labels are integrated to reconstruct a binary relation. To simplify the process of multi-label classification, the indispensable features with respect to several labels having close relationships are selected for each group of relevant labels. In summary, the effectiveness and accuracy of the prediction is promoted in the proposed method. There are several feasible directions for subsequent research: The discretization method applied in Kim and Lee [13] suggests how to choose a more reasonable method for discrete multi-label numerical data to calculate the relationship among labels. Inspired by Kim and Lee [13, 14], Lin and Hu [29, 27], research could investigate how to define the concept of dependency of a feature relative to relevant label group and exclusion among features in compact feature subsets with respect to the label group and then select the minimum number of features to keep the deterministic information contained between feature set and the given label group unchanged. In addition, another method of synthesizing labels in the relevant label group to further improve classification performance needs to be pursued.

Acknowledgements This paper is supported by grants of the National Natural Science Foundation of China (61573127, 71471060) and the fund of North China Electric Power University.

23

Appendix 1 Abbreviations X = {(x, A(x))|x ∈ U } Y = {l1 , l2 , . . . , ls } D = {(A(x), Y(x))|x ∈ U } K(x) lj (xi ) RB U/RB = {[x]B |x ∈ U } U/lj = {Posj , N egj } P osA (lj )(N egA (lj )) Pj∗ Pj Pj ([x]A , [y]A ) Ej = {Ej |Ej ⊆ Pj } γ(j, k) α β Rel = (γ(j, k))|Y|×|Y| clu(Y) posj,k ([x]A )(negj,k ([x]A )) locj,k ([x]A ) η glo(j, k) L(Y ) Gj,k ([x]A ) ej,k ([x]A ) αj,k ([x]A ) fY ([x]A ) s(Y ) GY ([x]A ) PY∗ PY PY ([x]A , [y]A ) EY = {EY |EY ⊆ PY } |·| d·e

Explanations X is set of input instances, A is set of features, U is domain of input objects Y is finite set of s possible labels Training data set with instances and their related labels Set of possible labels of object x The ith object xi associating with the jth label lj Indiscernibility relation with respect to subset of features B U/RB is quotient set with respect to B, [x]B is equivalence class with respect to B Quotient set of positive (Posj ) and negative (N egj ) objects with respect to label lj Positive (Negative) consistent region with respect to lj Set of pairs of equivalence classes to be discerned with respect to lj Distribution discernibility matrix with respect to lj Discernibility feature set discerning [x]A and [y]A with respect to lj Group of essential elements with respect to lj Label relevancy with respect to labels lj and lk Coincidence degree between groups of essential elements threshold Coincidence degree between essential elements threshold Relevancy judgement matrix with respect to set of labels Y Disjoint relevant label groups Local positive (negative) correlation of label pair (lj , lk ) with respect to [x]A Local correlation of label pair (lj , lk ) with respect to [x]A Local correlation parameter Global correlation with respect to label pair (lj , lk ) Collection of label pairs with respect to group of relevant labels Y The base leaner of label pair (lj , lk ) with respect to [x]A Classification error rate of label pair (lj , lk ) with respect to [x]A The coefficient of Gj,k with respect to [x]A Linear combination of base leaner with respect to Y Each step of region with respect to Y The final leaner with respect to Y Set of pairs of equivalence classes to be discerned with respect to Y Distribution discernibility matrix with respect to Y Discernibility feature set discerning [x]A and [y]A with respect to Y Group of essential elements with respect to Y Cardinality of a set Smallest integer not less than a number

References [1] A. Cano, J. Luna, E. Gibaja, S. Ventura, LAIM discretization for multi-label data, Information Science, 2016, 330: 370-384. [2] A. Ghazikhani, R. Monsefi, H.S. Yazdi, Online neural network model for non-stationary and imbalanced data stream classification, International Journal of Machine Learning & Cybernetics, 2014, 5(1): 51-62. [3] D. Chen, Y. Yang, Z. Dong, An incremental algorithm for attribute reduction with variable precision rough sets, Applied Soft Computing, 2016, 45: 129-149. [4] G. Nan, Q. Li, R. Dou, et al, Local positive and negative correlation-based k-labelsets for multi-label classification, Neurocomputing, 2018, 318: 90-101. [5] G. Tsoumakas, E. Spyromitros-Xioufis, V. Vilcek, et al. MULAN: A Java library for multi-label learning, Journal of Machine Learning Research, 2011, 12(7): 2411-2414. [6] G. Tsoumakas, I. Katakis, Multi label classification: An overview, International Journal of Data Warehousing and Mining,2007, 3(3): 1-13. [7] H. Li, D. Li, Y. Zhai, et al, A novel attribute reduction approach for multi-label data based on rough set theory, Information Sciences, 2016, 367-368: 827-847. [8] J. Dai, H. Hu, W. Wu, et al, Maximal-Discernibility-Pair-Based Approach to Attribute Reduction in Fuzzy Rough Sets, IEEE Transactions on Fuzzy Systems, 2018, 26(4):2174-2187. [9] J. Huang, G. Li, S. Wang, et al, Multi-label classification by exploiting local positive and negative pairwise label correlation, Neurocomputing, 2017, 257: 164-174. [10] J. Lee, D. Kim, Feature selection for multi-label classification using multivariate mutual information, Pattern Recognition Letters, 2013, 34: 349C357. [11] J. Lee, D. Kim, Mutual Information-based multi-label feature selection using interaction information, Expert Systems with Applications, 2015, 42: 2013C2025.

24

[12] J. Lee, D. Kim, Memetic feature selection algorithm for multi-label classification, Information Sciences, 2015, 293: 80C96. [13] J. Lee, D. Kim, SCLS: Multi-label feature selection based on scalable criterion for large label set, Pattern Recognition, 2017, 66: 342-352. [14] J. Lee, H. Kim, N. Kim, et al, An approach for multi-label classification by directed acyclic graph with label correlation maximization, Information Sciences, 2016, 351: 101-114. [15] J. Zhang, M. Fang, X. Li, Multi-label learning with discriminative features for each label, Neurocomputing, 2015, 154: 305-316. [16] L. Chen, D. Chen, H. Wang, Alignment Based Kernel Selection for Multi-Label Learning, Neural Processing Letters, 2018, https://doi.org/10.1007/s11063-018-9863-z. [17] M. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification, Pattern Recognit, 2004, 37: 1757-1771 . [18] M. Rahman, M. Islam, FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis, Knowledge-Based Systems, 2014, 56(3): 311-327. [19] M. Zhang, Z. Zhou, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition, 2007, 40(7): 2038-2048. [20] M. Zhang, Z. Zhou, A Review on Multi-Label Learning Algorithms, IEEE Transactions on Knowledge & Data Engineering, 2014, 26(8):1819-1837. [21] M. Zhang, L. Wu , LIFT: multi-label learning with label-specific features, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(1): 1609-1614. [22] R. Schapire, Y. Singer, BoosTexter: A Boosting-based System for Text Categorization, Machine Learning, 2000, 39(2-3): 135-168. [23] S. Huang, Z. Zhou, Multi-label learning by exploiting label correlations locally, Proceedings of the 26th AAAI Conference on Artificial Intelligence, 2012, 949-955. [24] W. Weng, Y. Lin, S. Wu, et al., Multi-label learning based on label-specific features and local pairwise label correlation, Neurocomputing, 2018, 273: 385-394. [25] Y. Lin, J. Li, P. Lin, G. Lin, J. Chen, Feature selection via neighborhood multi-granulation fusion, Knowledge Based on System, 2014, 67: 162-168. [26] Y. Lin, Q. Hu, J. Liu, J. Chen, J. Duan, Multi-label feature selection based on neighborhood mutual information, Applied Soft Computing, 2016, 38: 244-256. [27] Y. Lin, Q. Hu, J. Liu, et al., Streaming Feature Selection for Multi-label Learning Based on Fuzzy Mutual Information, IEEE Transection Fuzzy System, 2017, 25(6): 1491-1507. [28] Y. Lin, Q. Hu, J. Zhang, et al., Xindong Wu, Multi-label feature selection with streaming labels, Information Sciences, 2016, 372: 256-275. [29] Y. Lin, Q. Hu , J. Liu, et al., Multi-label feature selection based on max-dependency and min-redundancy, Neurocomputing, 2015, 168: 92-103. [30] Y. Yu, W. Pedrycz, D. Miao, Multi-label classification by exploiting label correlations, Expert Systems with Applications, 2014, 41(6): 2989-3004. [31] Y. Yu, W. Pedrycz, D. Miao, Neighborhood rough sets based multi-label classification for automatic image annotation, International Journal of Approximate Reasoning, 2013, 54(9): 1373-1387. [32] Y. Zhu, J. Kwok, Z. Zhou, Multi-Label Learning with Global and Local Label Correlation, IEEE Transactions on Knowledge and Data Engineering, 2017, 30: 1081-1094. [33] Z. Barutcuoglu, R. E. Schapire, O. G. Troyanskaya, Hierarchical multi-labelpredictionof gene function, Bioinformatics, 2006, 22(7): 830-836,2006. [34] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences, 1982, 11(5): 341-356.

25

Declaration of Interest Statement Dr. Xiaoya Che School of Control and Computer Engineering North China Electric Power University Beijing 102206 P. R. China [email protected] [email protected] [email protected] April 22, 2019 Prof. Witold Pedrycz Editor-in-Chief Information Sciences

Dear Prof. Witold pedrycz, The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The manuscript entitled, ‘A novel approach for learning label correlation with application to feature selection of multilabel data’. Thank you very much for your attention and consideration. Yours Sincerely, Xiaoya Che, Degang Chen, Jusheng Mi.