Generalized mean based back-propagation of errors for ambiguity resolution

Generalized mean based back-propagation of errors for ambiguity resolution

Accepted Manuscript Generalized Mean based Back-propagation of Errors for Ambiguity Resolution Shounak Datta, Sankha Subhra Mullick, Swagatam Das PII...

829KB Sizes 0 Downloads 33 Views

Accepted Manuscript

Generalized Mean based Back-propagation of Errors for Ambiguity Resolution Shounak Datta, Sankha Subhra Mullick, Swagatam Das PII: DOI: Reference:

S0167-8655(17)30135-6 10.1016/j.patrec.2017.04.019 PATREC 6803

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

21 July 2016 9 March 2017 23 April 2017

Please cite this article as: Shounak Datta, Sankha Subhra Mullick, Swagatam Das, Generalized Mean based Back-propagation of Errors for Ambiguity Resolution, Pattern Recognition Letters (2017), doi: 10.1016/j.patrec.2017.04.019

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Research Highlights

CR IP T

ACCEPTED MANUSCRIPT

This section presents the research highlights for the article “Generalized Mean based Back-propagation of Errors for Ambiguity Resolution”. The research highlights are as follows:

AN US

• The Ambiguity Resolution problem is formally defined.

• A new multi-layer ambiguity resolving perceptron is proposed.

• A continuous and differentiable generalized mean based error function is introduced. • Back-propagation algorithm for the proposed error function is formulated.

AC

CE

PT

ED

M

• The new method is compared with 4 alternatives to show its usefulness.

ACCEPTED MANUSCRIPT 1

Pattern Recognition Letters journal homepage: www.elsevier.com

Generalized Mean based Back-propagation of Errors for Ambiguity Resolution

a Electronics

and Communication Sciences Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700 108, India.

ABSTRACT

AN US

Ambiguity in a dataset, characterized by data points having multiple target labels, may occur in many supervised learning applications. Such ambiguity originates naturally or from misinterpretation, faulty encoding, and/or incompleteness of data. However, most applications demand that a data point be assigned a single label. In such cases, the supervised learner must resolve the ambiguity. To effectively perform ambiguity resolution, we propose a new variant of the popular Multi-Layer Perceptron model, called the Generalized Mean Multi-Layer Perceptron (GMMLP). In GMMLP, a novel differentiable error function guides the back-propagation algorithm towards the minimum distant target for each data point. We evaluate the performance of the proposed algorithm against three alternative ambiguity resolvers on 20 new artificial datasets containing ambiguous data points. To further test for scalability and comparison with multi-label classifiers, 18 real datasets are also used to evaluate the new approach. c 2017 Elsevier Ltd. All rights reserved.

M

Ambiguity Resolution Generalized Mean Multiple Labels Back-propagation Multi-Layer Perceptron

CR IP T

Shounak Dattaa , Sankha Subhra Mullicka , Swagatam Dasa,∗∗

ED

1. Introduction

CE

PT

Given a training dataset S = {(xi , ci )|xi ∈ P ⊂ Rd ; ci ∈ C = {1, 2, · · · , C}}, consisting of data points xi in the training set P and their corresponding labels ci , the traditional supervised learning problem is to identify the mapping f : Rd → C so that f (xi ) = ci ∀xi ∈ P. Then, the label for a new data point y ∈ Q ⊂ Rd (Q being the test set) can be predicted to be f (y). However, many practical applications are characterised by ambiguous training data points, i.e., data points with multiple corresponding labels. Formally, the notion of ambiguity in context to supervised learning can be defined as follows.

AC

Definition 1. For a supervised learning problem with dataset S = {(xi , Ci )|xi ∈ P ⊂ Rd ; Ci ⊆ C}, a data point xi ∈ P is said to be ambiguous if |Ci | ≥ 2. Ambiguity in a supervised learning problem may stem from a variety of reasons such as label noise, lack of sufficient information to be able to distinguish between classes or concepts,

?? Supplementary document is available for online publication only from the journal website. ∗∗ Corresponding author e-mail: [email protected] (Shounak Datta), [email protected] (Sankha Subhra Mullick), [email protected] (Swagatam Das)

faulty label encoding scheme resulting in distinct labels being assigned to a single concept, information overlap between classes etc. Such datasets can be subjected to two distinct forms of supervised learning, viz. ambiguity resolution and multilabel learning Madjarov et al. (2012), which are defined as follows. Definition 2. For a dataset S containing ambiguous data points, the problem of ambiguity resolution is to identify a suitable mapping f1 : Rd → C such that f1 (xi ) ∈ Ci ∀xi ∈ P. Definition 3. The problem of multi-label learning, on the other hand, is to identify a suitable mapping f2 : Rd → P(C) \ {Φ} (P(C) being the power set of C) so that f2 (xi ) ≡ Ci ∀xi ∈ P.

Therefore, ambiguity resolution also differs from multi-label learning, in the treatment of a test point y, in that f1 (y) ∈ C is a single predicted label while f2 (y) ⊆ C is a set of possible labels. Multi-label learning predicts potentially multiple labels for a given data point. Consequently, it is not suitable for applications where a single label (out of multiple ambiguous labels) should be selected for each data point. Let us look at a few scenarios where the need for such ambiguity resolution arises. 1. If the training dataset is multi-labeled but the user insists that a single label be assigned to a query point (Bullinaria, 1995). A befitting example can be, the task of identifying

ACCEPTED MANUSCRIPT 2 usual learning method adopted. In this article, we propose an elegant improvement over Bullinaria’s early milestone by using the concept of the generalized mean (Hardy et al., 1988). Definition 4. The generalized mean µρ of a set of real numbers A ⊂ R is defined as 1   1 X ρ  ρ a  . µρ (A) =  |A| a∈A 

(1)

CR IP T

It is well known that the generalized mean of a set of values tends towards the minimum value, for a sufficiently small choice of the exponent, i.e. µρ (A) → min(A) as ρ → −∞. The generalized mean function being both continuous as well as differentiable, unlike the minimum function, can be directly subjected to back-propagation based learning. Furthermore, the affinity to the minimum value can be controlled by varying the exponent ρ. Because of these desirable characteristics, we are motivated to utilize generalized mean for ambiguity resolution using MLPs. The major contributions of the current study are summarized below:

AN US

individual personalities using a facial recognition classifier which is trained on multi-labeled news-feed images, where no correspondence between the labels and the personalities in an image is specified (Chen et al., 2015). 2. If multiple experts are used to label a dataset, their personal opinions, feelings, knowledge, and biases can cause some of the data points to be assigned with multiple ambiguous labels, only one of which is the true label. For example, detecting emotions from speech (EMA dataset (Lee et al., 2005)), recognising faces in a picture (LOST dataset (Cour et al., 2009)),predicting medical condition from clinical reports (Wang and Bi, 2016), etc. Such type of problems have been previously dealt with in Zhang and Obradovic (2011), Yan et al. (2014), etc. 3. A similar but more challenging problem arises when the labels are crowd-sourced, resulting in almost every data point being labeled with a potentially large set of ambiguous labels, only one of which is correct. Detailed discussions on this problem can be found in Her´nandezGon´zalez et al. (2015), Raykar et al. (2010), etc. 4. There can also be cases where the labels of a dataset becomes noisy or corrupted due to faulty transmission, storage, etc. A description and simulation strategy of such problems can be found in the literature on partial label learning (Zhang, 2014; Zhang and Yu, 2015).

AC

CE

PT

ED

M

Hence, ambiguity resolution which is the general problem encompassing all the above-mentioned scenarios can be significantly important and useful in many real-life applications. Surprisingly, the present literature on learning with ambiguous data points abounds with paradigms of multi-label learning (Madjarov et al., 2012), while the equally (if not more) important problem of ambiguity resolution has received little attention. Some learners designed for handling multi-label problems are Zhang and Zhou (2007), Zhang and Zhou (2006), Rai et al. (2015), Tenenboim et al. (2009), Zhang et al. (2014) etc. Another approach to deal with ambiguity is preference learning (more specifically label ranking), where each data point has a preference ranking corresponding to each label (H¨ullermeier et al., 2008; Zabkar et al., 2010). While it may help in resolving ambiguity, by assigning an ambiguous point with the label having maximum preferability (if there exists such an unique label), it usually requires prior preference information (F¨urnkranz and H¨ullermeier, 2010) which is often unavailable or costly. The major work in ambiguity resolution is that of Bullinaria (1995), in which a common Multi-Layer Perceptron (MLP) is proposed to handle ambiguities in the (g + 1)-th epoch by drawing each ambiguous data point xi towards the label c(g) ∈ Ci i which generates minimum error for xi in the g-th epoch. The assumption behind this approach is that the ambiguity gets resolved automatically as the network gets trained on the nonambiguous data points. However, the use of discrete minimum function prevented the application of the back-propagation algorithm directly to the non-differentiable error function. Moreover, such an approach also completely ignores the affinity that a data point may have towards other labels. The reliance on a large number of hidden nodes, as demonstrated by experiments in Bullinaria (1995), is possibly a side-effect of the un-

1. We put forth a novel error function for back-propagation based learning of MLPs, which is able to handle nonambiguous and ambiguous data points alike. We minimize the generalized mean of errors of each data point w.r.t. each of its target labels. Notice that the generalized mean of errors boils down to the traditional error function for an umambiguous data point. 2. We prepare a set of 20 artificial datasets having ambiguously labeled data points. The datasets, which are diverse in terms of structure, dimensions, and extent of ambiguity, can be found at: https://dataverse.harvard.edu/dataset.xhtml? persistentId=doi%3A10.7910%2FDVN%2FO4RIRM. 3. The proposed method is tested on the 20 artificial datasets created by us and on 10 other real-life ambiguous datasets (without ground truth information), from various fields like bioinformatics, video annotation, etc. We compare its performance with those of three alternative ambiguity resolution strategies and a neural network based multi-label classifier called BP-MLL (Zhang and Zhou, 2006). 4. We establish the better performance of our proposed ambiguity resolver compared to the multi-label classifier BPMLL on datasets where ground truth is available. To simulate noisy labeled datasets we use 6 real-life datasets from the UCI repository (Lichman, 2013) following Zhang (2014). To illustrate our algorithm’s improved immunity against inexperienced, misguided and/or biased experts we also conduct experiments on the LOST and EMA datasets. We conduct Wilcoxon signed rank (Derrac et al., 2011) and Mann-Whitney U tests (Gibbson and Chakraborti, 2011) to establish the superiority of the proposed learner in a statistically significant way.

Organization of this paper is in order. We derive the expressions for the proposed back-propagation method in Section

ACCEPTED MANUSCRIPT 3 2. Subsequently, in Section 3, we describe the used datasets and the experimental procedure. Next in the same section, we present the experimental results and analyse them. We finally conclude the article with a brief summary and remarks in Section 4.

and ci,τ denotes the τ-th element of Ci . Thus, the distance between the τ-th target and the output of the network, for the input xi , is given by C

ei,τ =

2. Generalized Mean Multi-Layer Perceptron

(5)

CR IP T

Therefore, in order to prioritize the minimization of the error corresponding to the closest target (see Section 2.2), while also not completely ignoring the distance from all other targets, we propose to combine the errors ei,τ (τ = 1, 2, · · · , |Ci |) by using the differentiable error function  ρ  ρ1 |Ci |  X  1 X  1 C   2   (or (xi ) − ωr,τ (xi ))   , ei = µρ ({ei,1 , · · · , ei,|Ci | }) =  |Ci | 2 τ=1

r=1

(6)

where the exponent ρ is chosen to be sufficiently small. 2.2. Back-propagation of Errors

The error presented in (6) can then be propagated from the succeeding layers to the preceding layers through the network to train the MLP, according to the following theorem. Theorem 1. To diminish the error ei for an input pattern xi , the GMMLP network weights vbr (b = 0, 2, · · · , α; r = 1, · · · , C), and ukb (k = 0, 1, · · · , d; b = 1, · · · , α) must be modified according to the following expressions:

vbr = vbr −

AC

CE

where we set the skewness parameter γ = 1, in keeping with general conventions. Using the desired target outputs and the currently obtained output, an error ei can be calculated for each pattern xi . There are various ways to encode the target outputs for training an MLP. One way to encode the target outputs is in the form of binary strings in which 1 occurs in the position pertaining to the desired label and all other positions are 0. Therefore, the target output vector for the input xi is defined as Ω(xi ) = [ωr (xi )]C×1 where    1 if r ∈ Ci , ωr (xi ) =  (3)  0 otherwise. Since each target label must be treated separately, this form of target encoding must be modified for ambiguity resolution to the target output matrix Ωa (xi ) = [ωr,τ (xi )]C×|Ci | where    1 if r = ci,τ , ωr,τ (xi ) =  (4)  0 otherwise,

ρ i  η 1 X 1X (or (xi ) − ωr,τ (xi ))2 |Ci | |Ci | τ=1 2 r=1 |C |

C

! ρ1 −1

×

|Ci |  X C ρ−1 X 1 (or (xi ) − ωr,τ (xi ))2 (or (xi ) − ωr,τ (xi )) × 2 r=1 τ=1 ! or (xi )(1 − or (xi ))hb (xi ) , (7)

PT

ED

M

AN US

MLP (Haykin, 2009) is a popular non-parametric supervised learner having a network architecture. Hence, it does not require any prior assumptions about the class distributions of the datasets to be learned and can effectively generate non-linear separation boundaries to distinguish between structurally complex classes. The structure of an MLP is simple and highly parallel in nature, making it suitable for high-dimensional data processing. Back-propagation of errors (Rumelhart et al., 1986), scaled conjugate gradient method (Moller, 1993), and many other learning algorithms have been designed to train MLPs. All these factors have influenced us to use MLP as the underlying supervised learner for ambiguity resolution. We refer to the proposed MLP based ambiguity resolver as Generalized Mean Multi-Layer Perceptron (GMMLP), the details of which are presented in the rest of this section. 2.1. Generalized Mean based Error Function Let us consider an MLP consisting of an input layer of (d +1) nodes, a single hidden layer having α nodes, and an output having C nodes (as many nodes as the number of possible classes). Let ukb denote the weight of the connection from the k-th input node to the b-th hidden node, and let u0b denote the bias term of the b-th hidden node. Similarly, let vbr denote the weight of the connection from the b-th hidden node to the r-th output node, and let v0r denote the bias term of the r-th output node. Moreover, let U = [ukb ]d×α , u0 = [u0b ]α×1 , V = [vbr ]α×C , and v0 = [v0r ]C×1 denote the matrices and vectors of the weights and the bias terms. Let the activation function be the sigmoid function 1 , (2) ψ(x) = 1 + e−γx

1X (or (xi ) − ωr,τ (xi ))2 . 2 r=1

ukb

ρ i  η 1 X 1X = ukb − (or (xi ) − ωr,τ (xi ))2 |Ci | |Ci | τ=1 2 r=1 |C |

C

! ρ1 −1

×

|Ci |  X C C  ρ−1 X X 1 (or (xi ) − ωr,τ (xi ))2 (or (xi ) − ωr,τ (xi )) × 2 r=1 τ=1 r=1 !!  or (xi )(1 − or (xi ))vbr hb (xi )(1 − hb (xi ))xi,k , (8)

where η ∈ (0, 1] is a scaling parameter. Proof. To decrease the error ei using gradient descent method, the network weights vbr (b = 0, 2, · · · , α; r = 1, · · · , C), and ukb (k = 0, 1, · · · , d; b = 1, · · · , α) must be adapted in the following manner: vbr = vbr − η∆vbr (xi ) and ukb = ukb − η∆ukb (xi ), where i i η ∈ (0, 1], ∆vbr (xi ) = ∂v∂e and ∆ukb (xi ) = ∂u∂e . Now, by the br (xi ) kb (xi )

ACCEPTED MANUSCRIPT 4 3.1. Description of the Datasets

chain rule of differentiation,

3.1.1. Artificial Datasets:

∂ei ∂ei,τ ∂or (xi ) ∂hb (xi ) ∂ei = ∂ukb (xi ) ∂ei,τ ∂or (xi ) ∂hb (xi ) ∂ukb (xi ) |Ci |  X C ρ ! ρ1 −1 η 1 X 1 = (or (xi ) − ωr,τ (xi ))2 × |Ci | |Ci | τ=1 2 r=1

Fig. 1. Colours used to represent the set of class labels for a data point in the datasets shown in Figure 2

|Ci |  X C ρ−1 X 1 (or (xi ) − ωr,τ (xi ))2 (or (xi ) − ωr,τ (xi )) × 2 r=1 τ=1 !! or (xi )(1 − or (xi ))hb (xi ) , (9)

(a) Dataset 1

(b) Dataset 6

(c) Dataset 8

(d) Dataset 9

AN US

|Ci |  X C C  ρ−1 X X 1 (or (xi ) − ωr,τ (xi ))2 (or (xi ) − ωr,τ (xi )) × 2 r=1 τ=1 r=1  ! or (xi )(1 − or (xi ))vbr hb (xi )(1 − hb (xi ))xi,k . (10)

CR IP T

∂ei ∂ei ∂ei,τ ∂or (xi ) = ∂vbr (xi ) ∂ei,τ ∂or (xi ) ∂vbr (xi ) |Ci |  X C ρ ! ρ1 −1 1 1 X 1 2 (or (xi ) − ωr,τ (xi )) = × |Ci | |Ci | τ=1 2 r=1

This completes the proof. 

ED

M

In practice, there are two ways to update the weights. In the first method, known as the on-line method, the weight changes for each pattern xi are imposed immediately; while in the second method, known as the batch method, the weight changes are calculated for all xi ∈ P, and the average of these changes are imposed only at the end each epoch. The batch method, while being somewhat slower than the on-line method, is chosen for our tests because of its greater reliability and ease of implementation, over the latter which is prone to overfitting. Therefore, the set of weights in the (g + 1)-th epoch are updated from those of the g-th epoch, using the following rules.

PT

U (g+1) = U (g) − η∆U, V (g+1) = V (g) − η∆V, u(g+1) = u(g) 0 0 − η∆u0 , (g) v(g+1) = v 0 0 − η∆v0 ,

(11) (12) (13) (14)

AC

CE

where the parameter η, known as the learning rate, is a positive scaling factor, typically chosen in the range (0, 1]; and the matrices and vectors are defined as ∆U(xi ) = [∆ukb (xi )]d×α , ∆V(xi ) = [∆vbr (xi )]α×C , ∆u0 (xi ) = [∆u0b (xi )]α×1 , ∆v0 (xi ) = 1 P|P| 1 P|P| [∆v0r (xi )]C×1 ; ∆U = |P| i=1 ∆U(xi ), ∆V = |P| i=1 ∆V(xi ), 1 P|P| 1 P|P| ∆u0 = |P| ∆u (x ) and ∆v = ∆v (x ). 0 i 0 0 i i=1 i=1 |P| 3. Experiments and Results

In this section, we describe the artificial and real datasets used for evaluating the performance of GMMLP. We also briefly define the indices used to measure the quality of the prediction made by a learner over a test set. Moreover, we detail alternative approaches for ambiguity resolution, which are then used for comparison. We then elaborate on the experimental settings used for the study. Subsequently, we document and discuss the performances of all the contending ambiguity resolving learners (including the proposed method) and BP-MLL.

Fig. 2. Scatter plots of some of the artificial datasets

To compensate for the dearth of benchmark datasets with ambiguous data points, we prepare a benchmark test suite containing 20 new synthetic datasets. These datasets are designed to present a significant variety of challenges to an ambiguity resolving classifier (Please see the supplementary document for a detailed discussion). These datasets are diverse in terms of structure as well as the nature of ambiguity. Data points within the regions of overlap among multiple classes are considered to be ambiguous, potentially belonging to all of the classes overlapping at the point in question. The benchmark suite consists of 10 two-dimensional (2D) datasets and 10 corresponding higher dimensional analogues. Just as samples, four of the scatter plots of 2D datasets are presented in Figure 2, with the color scheme described in Figure 1. 3.1.2. Real Datasets: We use 10 real multi-labeled datasets in our experiments, brief decsriptions about which can be found in Table 1, where n is the number of data points, d is the number of dimensions, and |C| is the total number of classes. All these datasets give us an opportunity to directly investigate the performance of GMMLP in classification problems originating from real-life events of considerable complexity. Furthermore, many of these datasets contain a large number of high-dimensional data points which helps us establish the scalability of the proposed technique.

ACCEPTED MANUSCRIPT 5 Dataset Name YeastML

C

Field of Study

2417 103

14

Image

2000 294

5

Bibtex

7395 1836 159

Delicious

16090 500

Enron

1702 1001 53

TMC500

28596 500

Medical

978

Microarray gene expression and phylogenetic profile data of Saccharomyces cerevisiae. Zhang and Zhou (2007) (Elisseeff and Weston, 2002). Natural scenes classification. Zhang and Zhou (2007) Metadata from bibliographic items. Katakis et al. (2008) Information from social website. Tsoumakas et al. (2008) Emails send by 150 senior officials of Enron. Klimt and Yang (2004) Analysis of problem reports in aviation security. Feature subset of TMC2007. Srivastava and Zane-Ulman (2005) Syndrome-to-disease map. Read et al. (2011) Corel images. Duygulu et al. (2002) Multimedia processing. Trohidis et al. (2008) Multimedia processing. Snoek et al. (2006)

n

d

983

22

1449 45

Corel5k 5000 499 Emotions 593 72 Mediamill 42117 120

374 6 101

3.2. Real Datasets with Noisy Labels

n 4168 214 148 1484 2175 548

d 8 10 18 8 9 10

C 21 6 4 10 5 5

3.3. LOST and EMA datasets

M

Table 2. Description of the UCI Datasets Dataset Name Abalone Glass Lymphography Yeast Shuttle Page Blocks

CE

PT

ED

The LOST dataset contains facial images of characters from the LOST TV series. There are a total of 1122 images (RGB of resolution 90 × 90) each belonging to one of the 16 individuals captured. Each image is labeled by a ground-truth as well as a label set containing all the annotations suggested by multiple experts. We first transform the images into gray scale and then use a local binary pattern feature extraction (Ahonen et al., 2006) to obtain a 256 dimensional numerical feature vector for each image. The EMA dataset is a collection of utterances by 3 individuals reflecting different emotions, namely happiness, anger, sadness, and neutrality. A total of 564 recordings (sample rate 16000 Hz) are available, each annotated by 4 experts who classified each recording (based on their perceived emotion) into one of the four said emotions or to a fifth class ’others’ pertaining to neither of the said emotions. A multi-label annotation results if we combine all the expert labellings to form the label set. However, the original intention of the speaker is available as the ground-truth. For representing each audio recording as a data vector we first use Mei frequency cepstral coefficient extraction (Logan, 2000), which gives a 39×w sized matrix, where w is the number of windows (all parameters set as per Voicebox (Brookes, 1997)). We then undertake a principal component analysis to convert each of the 39 × w data matrices to a 39 dimensional data point.

AC

j

π(.) being an indicator function of the form    1 if statement is true, π(statement) =   0 otherwise.

(16)

Clearly 0 < acc < 1, and the better performer will have a higher value of acc. To better evaluate the performance on ambiguous test points, we define another measure called accmulti . Let Q0 = {y j |y j ∈ Q, |C j | > 1} be the set of ambiguous test points. Then, accmulti denotes the fraction of ambiguous test points for which a single correct c j ∈ C j is found.

AN US

We consider six datasets from the UCI repository as described in Table 2. We begin with a singleton label set for each data point (in a particular dataset) containing the true class label. Then, for each data point, we add the other possible class labels to its label set with a probability of 0.1, forming the final dataset with noisy labels.

3.4. Performance Indices Ambiguity resolution predicts a single label c j for a test point y j , while the point may have a set of target labels C j . Hence, in the absence of ground-truth, we consider a data point to be rightly classified if c j ∈ C j . Therefore, a performance index similar to the classification accuracy called acc, can be defined as 1 X π(c j ∈ C j ), (15) acc = |Q| y ∈Q

CR IP T

Table 1. Description of the real multi-labeled datasets

accmulti =

1 X π(c j ∈ C j ). |Q0 | y ∈Q0

(17)

j

Where the ground-truth is available, we compare the single label output of the algorithm c j to the known true label cˆ j and calculate the classification accuracy following the regular definition as 1 X accuracy = π(c j = cˆ j ). (18) |Q| y ∈Q j

3.5. Contending Algorithms In this section, we describe the characteristics of all the algorithms to be used in the comparative experiments, codes for all of which can be found at https://github.com/ SankhaSubhra/ambiguity-resolving-ANN.git. Due to lack of sufficient published research for solving such a problem of ambiguity resolution, we compare GMMLP with three alternative approaches and a multi-label classifier called BP-MLL. Among the compared methods, two (Alternative Approach 1 and 2) are proposed by us, while the the third is an implementation of Bullinaria (1995), the only notable work on the topic. 3.5.1. Alternative Approach 1: The target vector for a training point xi is defined as per (3), so that ambiguous points have target strings of length C with multiple ones corresponding to the elements of Ci . Regular back-propagation algorithm is used to learn these targets. The ambiguity for a test point is resolved by choosing the label corrresponding to the maximum value output node. 3.5.2. Alternative Approach 2: The target vector is a binary string of length C having a single 1 corresponding to a randomly chosen ci ∈ Ci . For example, in a 4-class classification problem, if Ci = {1, 3} for a training point xi , then a randomly selected ci can be 3 (as 3 ∈ Ci ) and the target vector will be Ω(xi ) = [0010]. Learning is based on regular back-propagation.

ACCEPTED MANUSCRIPT 6

3.5.4. BP-MLL BP-MLL is a multi-label classifier based on the multi-layer perceptron model of artificial neural networks, which is trained by back-propagation algorithm. The idea is to tune the weights of the network connections such that for an xi ∈ P, the outputs corresponding to the elements of Ci are maximized while those corresponding to the elements of C \ Ci are minimized. To use BP-MLL as an ambiguity resolver, one may assign a test point to the class corresponding to the output node generating maximum response.

3.6. Experimental Procedure

CE

PT

ED

M

All four contending algorithms (the proposed GMMLP, the three alternative approaches and BP-MLL) are 10-fold crossvalidated on the 20 artificial and 10 real datasets. The mean acc and accmulti are tabulated in Table 3 for each case. Furthermore, GMMLP and BP-MLL are run with 10-fold cross-validation on the 6 UCI datasets with noisy labels, the LOST dataset, and the EMA dataset. For this, the mean acc and accuracy values are listed in Table 4. The best performances are boldfaced. To check if the performance of GMMLP is significantly different from those of the other algorithms, a non-parametric statistical test known as Wilcoxon signed rank test with a significance level of 0.05 (Derrac et al., 2011) is conducted on multiple datsets. A pairwise comparison between GMMLP and others for each of the datasets is also performed by using the Mann-Whitney U test (also known as rank-sum test) with a significance level of 0.05 (reported in terms of Wins (W), Ties (T) and Losses (L)). 3.7. Parameter Settings

The value of α or the number of hidden nodes is set to 10 for all of the competing ambiguity resolvers (as it is observed during experiments that increasing the number of hidden nodes does not significantly affect the accuracy of the classifier). The value of γ is set to 1 following convention (Haykin, 2009). The tunable parameter ρ is empirically chosen to be -20. This value seems to attain a good trade-off between the closest target and the other candidate targets, as implied by the observed results. The value of η is set to 0.5 which is empirically observed to produce a good result on average. A classifier is trained for a maximum of 20000 epochs with pre-mature termination if the average error converges below 0.05. For BP-MLL, the value of η is set to 0.05, a single hidden layer is used (with 15 nodes), and the maximum number of epochs is taken as 1000, as advised in Zhang and Zhou (2006).

AC

In this section, we first compare the performance of GMMLP with the four alternative ambiguity resolvers on a total of 30 multi-labeled datasets (20 artificial and 10 real-world) to establish the improvement through the proposed method. Subsequently, we compare GMMLP with BP-MLL on 6 real-world datasets with simulated noisy labels and two datasets annoted by a conflicting set of multiple experts to establish the shortcomings of a multi-label classifier in handling the problem of ambiguity resolution. 3.8.1. Comparison of GMMLP and Alternative Ambiguity Resolving Classifiers We summarize the results (in terms of acc and accmulti ) produced by GMMLP and the four competing methods on the 20 artificial datasets in the first 20 rows of Table 3. A thorough inspection reveals that GMMLP performs best in terms of acc on 18 datasets (9 of them are 2D, 5 are ten-dimensional and the rest are five-dimensional). Among the remaining datasets, Dataset 8 is classified better by Approach 1. BP-MLL achieves the best acc on DataExt 1-5, though the difference of performance from that of GMMLP is statistically insignificant. The poor performance shown by GMMLP on Dataset 8 (a symmetric dataset with high degree of ambiguity) can be explained by the complex structure of the dataset (see Figure 2c). This also explains the low acc achieved by all the classifiers on Dataset 8-10 (the extended higher dimensional version). In the case of DataExt 1-5, which is a noisy and rotated dataset, the performance decline shown by GMMLP may be fortuitous, as this behavior is not repeated on any other noisy and rotated dataset. In terms of accmulti , GMMLP performs best on 19 datasets (except the case of Dataset 8, which is better handled by BPMLL). Among these, GMMLP achieves the maximum value of accmulti in 15 cases. This demonstrates the greater capability of GMMLP to resolve ambiguity for the multi-labeled instances. On 13 datasets, the performance of GMMLP is identical to at least one other contender. However, on 11 of these GMMLP performs better in terms of acc. This indicates that GMMLP is capable to better handle ambiguity as well as achieves higher classification performance on nonambiguous data points. This observation is further supported by the superior performance of GMMLP on Dataset 3, DataExt 3-10, Dataset 5 and DataExt 5-10, all of which contain nonoverlapping well-separated classes. From the results on the artificial benchmarks one can conclude that GMMLP attains a good performance of ambiguous as well as non-ambiguous data points for datasets having diverse characteristics (such as symmetry (Dataset 9), asymmetry (Dataset 5), anti-symmetry (Dataset 4), noise (DataExt 4-5), high (Dataset 8), moderate (Dataset 2) or low (Dataset 5) degree of ambiguity, etc). The performance of GMMLP and its contenders on the real datasets are tabulated in the final 10 rows of Table 3. The GMMLP classifier performs better than others in terms of acc on 5 datasets (Image, Delicious, Enron, Medical, and TMC500), and in terms of accmulti in 5 cases (YeastML, Image, Bibtex, Delicious, and Enron). Approach 1 performs better than others in terms of both acc and accmulti only on 3 occasions

AN US

3.5.5. GMMLP: A target matrix is obtained as per the formula in (4). Error ei for a data point xi is calculated according to (6). The modified back-propagation of Section 2.2 is used as the learning algorithm.

3.8. Results and Comparison

CR IP T

3.5.3. Alternative Approach 3: A target matrix is obtained according to (4). Therefore, a set of errors E = {ei,1 , ei,2 , · · · , ei,|Ci | } is obtained. The minimum error emin = min E is selected and back-propagated through the network. This is identical to the approach of Bullinaria (1995).

ACCEPTED MANUSCRIPT 7 Table 3. Comparison of GMMLP and three alternative approaches Approach 2 acc accmulti 0.7793† 1≈ 0.8901† 1≈ 0.6493† 0.5906† 0.5320† 0.7680† 0.9392† 1≈ 0.7775† 1≈ 0.7819† 1≈ 0.4410† 0.8394≈ 0.6903† 0.9986† 0.7819† 0.4687†

Approach 3* acc accmulti 0.9633† 1≈ 0.9062† 1≈ 0.6896† 0.5905† 0.7437† 0.8231† 0.9589≈ 1≈ 0.7638† 1≈ 0.7892† 1≈ 0.6067≈ 0.8875≈ 0.9747† 1≈ 0.8305† 0.5795†

BP-MLL acc accmulti 0.7220† 0.7815† 0.7680† 0.8846† 0.5293† 0.6727† 0.6200† 0.8966† 0.6533† 1≈ 0.8307† 1≈ 0.8720† 1≈ 0.5893≈ 0.9667† 0.8760† 1≈ 0.6773† 0.7269†

GMMLP acc accmulti 0.9735 1 0.9592 1 0.7271 0.7211 0.8196 0.9372 0.9696 1 0.8727 1 0.9008 1 0.5880 0.8601 0.9912 1 0.8611 0.7472

DataExt 1-5 DataExt 2-10 DataExt 3-10 DataExt 4-5 DataExt 5-10 DataExt 6-5 DataExt 7-5 DataExt 8-10 DataExt 9-10 DataExt 10-5

0.9580≈ 0.8640† 0.8093† 0.8600† 0.7973† 0.8720≈ 0.9067≈ 0.3387† 0.8365≈ 0.8453≈

1≈ 0.9000† 1≈ 0.9198† 0.8000† 1≈ 1≈ 1≈ 1≈ 0.6205†

0.7730† 0.8373† 0.8067† 0.8120† 0.6360† 0.7573† 0.7973† 0.3266† 0.8297† 0.7920†

0.9972† 0.9200† 0.8667† 0.9768† 0.7333† 1≈ 0.9896† 1≈ 1≈ 0.5612†

0.9460≈ 0.9387≈ 0.8520≈ 0.8707≈ 0.7733† 0.8680≈ 0.9067≈ 0.3548† 0.8351≈ 0.8413†

1≈ 0.9600† 0.8000† 1≈ 0.8000† 1≈ 1≈ 1≈ 0.9333† 0.5974†

0.9670≈ 0.4467† 0.7227† 0.6667† 0.5293† 0.8533† 0.7493† 0.3548† 0.8405≈ 0.7120†

1≈ 0.7000† 0.6667† 0.9000† 0.8000† 1≈ 0.9308† 1≈ 0.9000† 0.6264†

0.9540 0.9400 0.8573 0.8720 0.9120 0.8747 0.9093 0.3737 0.8432 0.8547

1 1 1 1 1 1 1 1 1 0.6648

YeastML Image Bibtex Corel5k Delicious Emotions Enron Mediamill Medical TMC500 Signed Rank Test Rank-sum Test (W-T-L)

0.7633≈ 0.6979≈ 0.4720≈ 0.3117† 0.9017≈ 0.5405† 0.7480≈ 0.9075† 0.9104≈ 0.7837≈ H1 10-16-4

0.7659≈ 0.8281≈ 0.3270† 0.3118† 0.8706† 0.4584† 0.7356≈ 0.8615† 0.7840≈ 0.6894≈ H0 10-17-3

0.1815† 0.6009† 0.2143† 0.2441† 0.1616† 0.3116† 0.2752† 0.2877† 0.8186† 0.7111† H1 30-0-0

0.6394† 0.7773† 0.2466† 0.0700† 0.0562† 0.2649† 0.5740† 0.6193† 0.6443≈ 0.3284† H1 20-10-0

0.7497≈ 0.6984≈ 0.5403† 0.2976≈ 0.7797† 0.4093≈ 0.6905† 0.8550† 0.9007≈ 0.7796† H1 15-14-1

0.7502† 0.8244≈ 0.3495† 0.2963≈ 0.8000† 0.3827† 0.6918† 0.7450† 0.7659† 0.6844† H1 16-14-0

0.7653≈ 0.6730† 0.1014† 0.1250† 0.1625† 0.3759† 0.4390† 0.2503† 0.6268† 0.6522† H1 26-4-0

0.7582† 0.7790† 0.0615† 0.1032† 0.1530† 0.4384† 0.4853† 0.2703† 0.9163† 0.7856† H1 20-7-3

0.7505 0.7007 0.4692 0.2830 0.9051 0.3971 0.7490 0.8882 0.9110 0.7901 -

0.7738 0.8320 0.3556 0.2850 0.8825 0.3632 0.7455 0.8275 0.7850 0.6922 -

Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset 6 Dataset 7 Dataset 8 Dataset 9 Dataset 10

CR IP T

Approach 1 acc accmulti 0.9581† 1≈ 0.9520≈ 1≈ 0.7258≈ 0.6563† 0.7683† 0.9095† 0.9603≈ 1≈ 0.7916† 1≈ 0.8777† 1≈ 0.6417† 0.8928† 0.9905≈ 1≈ 0.8533† 0.6342†

AN US

Dataset

†: Significantly different by rank-sum test. ≈: No significant difference by rank-sum test. H1 : Alternative hypothesis, significantly different by signed rank test. H0 : Null hypothesis, no significant difference by signed rank test. ∗ Bullinaria (1995)

CE

PT

ED

M

(Corel5k, Emotions and Mediamill). BP-MLL performs better than GMMLP (though the difference is not statistically significant) in terms of acc only on YeastML and in terms of accmulti in two cases. Approach 3 performs better than GMMLP in terms of acc on Bibtex. However, most of these datasets are of complex nature due to various reasons (for example TMC500 and Mediamill datasets have a peculiarly high number of overlapping classes), suggesting that the slight difference in performance is not necessarily due to better modeling capability of the competing methods. This is further supported by the cases of Corel5k and Emotions, where none of the algorithms showed a satisfactory performance. Hence, GMMLP can be considered to be superior to its competitors as well as scalable to large, high-dimensional datasets.

AC

The Wilcoxon signed rank and rank-sum statistical tests further evaluate the performance improvement shown by GMMLP compared to the competing algorithms over 30 datasets (20 artificial and 10 real). The results of the tests suggest that in terms of accmulti GMMLP is significantly better than Approach 2, Approach 3 and BP-MLL. The performances of GMMLP and Approach 1 in terms of accmulti are found to statistically comparable by the signed rank test, while GMMLP registers more wins than losses in the rank-sum test. In terms of acc, GMMLP performs significantly better than all the four of its contenders in both the tests. These findings further highlight GMMLP as a better ambiguity resolver which can also achieve statistically better classification accuracy on non-ambiguous data instances.

3.8.2. Comparison of GMMLP and BP-MLL Table 4. Comparison of GMMLP and BP-MLL Dataset Abalone Glass Lymphography Yeast Shuttle Page Blocks LOST EMA Signed Rank Test Rank-sum Test (W-T-L)

GMMLP acc accuracy 0.3503 0.1612 0.3799 0.3565 0.8044 0.7300 0.5918 0.2834 0.9605 0.7216 0.9500 0.9022 0.4505 0.2252 0.4011 0.2627 -

BP-MLL∗ acc accuracy 0.2010† 0.1466† 0.2650† 0.1977† 0.6406† 0.6236† 0.4455† 0.2591† 0.8907† 0.6999† 0.9246† 0.8998† 0.4095† 0.1989† 0.3982† 0.2585≈ H1 H1 8-0-0 7-1-0

†: Significantly different by rank-sum test. ≈: No significant difference by rank-sum test. H1 : Alternative Hypothesis, significantly different by signed rank test. H0 : Null hypothesis, no significant difference by signed rank test. ∗ Zhang and Zhou (2006).

We summarize the performance of GMMLP and BP-MLL in terms of acc and accuracy, on a total of eight datasets in Table 4. A review of the results show that GMMLP performs better than BP-MLL in terms acc on all of the 8 datasets. Similar behavior is observed when the two algorithms are compared in terms of accuracy. Interestingly, GMMLP greatly improved the values of acc for Abalone, Yeast and Shuttle datasets compared to the corresponding increments in accuracy. This indicates that GMMLP can find labels which are members of the ambiguous label set, even when it fails to match the ground-truth. This occurrence is unavoidable in the presence of ambiguity characterised by multiple classes having similar properties and is more desirable than assigning a label outside the possible label set. The result of Wilcoxon signed rank and rank-sum tests

ACCEPTED MANUSCRIPT 8

AN US

4. Conclusion We formally define the notion of ambiguity in the context of supervised learning and propose a new algorithm called GMMLP for ambiguity resolution. GMMLP is a variant of the MLP network architecture popularly used for supervised learning. We design a novel generalized mean based error function which can resolve ambiguity by guiding the back-propagation algorithm to assign an ambiguous data point with the label most suitable according its dominant characteristics. The proposed error function is differentiable, and the extent of affinity to the label with minimum error can be controlled by varying the value of the exponent. We also create 20 new artificial datasets which can be used for ambiguity resolution as well as multi-label learning. We compare GMMLP with three alternative ambiguity resolvers as well as with BP-MLL on 20 artificial and 10 real datasets of diverse characteristics. Furthermore, we compare the performance of GMMLP and BPMLL on 8 real datasets containing noisy label sets or conflicting annotations. Our experiments, validated with appropriate nonparametric statistical tests, indicate that GMMLP can serve as an attractive alternative for ambiguity resolution in real world applications. In future this work can be extending to other learning models other than back-propagation available for training MLP.

H¨ullermeier, E., F¨urnkranz, J., Cheng, W., Brinker, K., 2008. Label ranking by learning pairwise preferences. Artificial Intelligence 172, 1897–1916. Katakis, I., Tsoumakas, G., Vlahavas, I., 2008. Multi label text classification for automated tag suggestion, in: ECML/PKDD Discovery Challenge. Klimt, B., Yang, Y., 2004. The enron corpus: a new dataset for email classification research, in: European conference on Machine Learning, pp. 217–226. Lee, S., Yildirim, S., Kazemzadeh, A., Narayanan, S., 2005. An articulatory study of emotional speech production, in: InterSpeech, pp. 497–500. Lichman, M., 2013. UCI machine learning repository. URL: http:// archive.ics.uci.edu/ml. Logan, B., 2000. Mel frequency cepstral coefficients for music modeling, in: In International Symposium on Music Information Retrieval. Madjarov, G., Kocev, D., Dejan Gjorgjevikj, S.D., 2012. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition 45, 3084–3104. Moller, M.F., 1993. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533. Rai, P., Hu, C., Henao, R., Carin, L., 2015. Large-scale bayesian multi-label learning via topic-based label embeddings, in: Advances in Neural Information Processing Systems, pp. 3204–3212. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L., 2010. Learning from crowds. Journal of Machine Learning Research 11, 1297–1322. Read, J., Pfahringer, B., Holmes, G., Frank, E., 2011. Classifier chains for multi-label classification. Machine Learning 85, 333–359. Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323, 533–536. Snoek, C.G.M., Worring, M., vanGemert, J.C., Geusebroek, J.M., Smeulders, A.W.M., 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia, in: 14th Annual ACM International Conference on Multimedia, pp. 421–430. Srivastava, A., Zane-Ulman, B., 2005. Discovering recurring anomalies in text reports regarding complex space systems, in: IEEE Aerospace Conference, pp. 55–63. Tenenboim, L., Rokach, L., Shapira, B., 2009. Multi-label classification by analyzing labels dependencies, in: Proceedings of the 1st International Workshop on Learning from Multi-Label Data, Bled, Slovenia, pp. 117–132. Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I., 2008. Multi label classification of music in to emotions, in: 9th International Conference on Music Information Retrieval, pp. 320–330. Tsoumakas, G., Katakis, I., Vlahavas, I., 2008. in: ECML/PKDD Workshop on Mining Multidimensional Data, pp. 30–44. Wang, X., Bi, J., 2016. Bi-convex optimization to learn classifiers from multiple biomedical annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics PP. doi:10.1109/TCBB.2016.2576457. Yan, Y., Rosales, R., Fung, G., Subramanian, R., Dy, J., 2014. Learning from multiple annotators with varying expertise. Machine Learning 95, 291–327. Zabkar, J., Mozina, M., Janez, T., Bratko, I., Demsar, J., 2010. Preference learning from qualitative partial derivatives, in: 3rd ECML/PKDD workshop on preference learning (PL-10). Zhang, H., Hu, B., Feng, X., 2014. A multi-label learning approach based on mapping from instance to label, in: Lecture Notes in Computer Science, pp. 743–752. Zhang, M.L., 2014. Disambiguation-free partial label learning, in: 2014 SIAM International Conference on Data Mining, pp. 37–45. Zhang, M.L., Yu, F., 2015. Solving the partial label learning problem: an instance-based approach, in: 24th International Conference on Artificial Intelligence, pp. 4048–4054. Zhang, M.L., Zhou, Z.H., 2006. Multi-label networks with applications to functional genomics and text catagorization. IEEE Transactions on Knowledge and Data Engineering 18, 1338–1351. Zhang, M.L., Zhou, Z.H., 2007. ML-KNN: A lazy learning approach to multi label learning. Pattern Recognition 40, 2038–2048. Zhang, P., Obradovic, Z., 2011. Learning from inconsistent and unreliable annotators by a gaussian mixture model and bayesian information criterion, in: ECML PKDD, pp. 553–568.

CR IP T

also highlight the capability of GMMLP to better resolve ambiguity compared to a multi-label classifier such as BP-MLL in terms of both acc and accuracy.

M

References

AC

CE

PT

ED

Ahonen, T., Hadid, A., Pietikainen, M., 2006. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 2037–2041. Brookes, M., 1997. Voicebox: Speech processing toolbox for matlab. URL: www.ee.ic.ac.uk/hp/dmb/voicebox/voicebox.html. Bullinaria, J.A., 1995. Neural network learning from ambiguous training data. Connection Science 7, 99–122. Chen, C.H., Patel, V.M., Chellappa, R., 2015. Matrix completion for resolving label ambiguity, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4110–4118. Cour, T., Sapp, B., Jordan, C., Taskar, B., 2009. Learning from ambiguously labeled image, in: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Derrac, J., Garc´ıa, S., Molina, D., Herrera, F., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation 1, 3 – 18. Duygulu, P., Barnard, K., deFreitas, J., Forsyth, D., 2002. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, in: 7th European Conference on Computer Vision, pp. 349–354. Elisseeff, A., Weston, J., 2002. A kernel method for multi-labelled classification, in: Advances in Neural Information Processing Systems 14, MIT Press. pp. 681–687. F¨urnkranz, J., H¨ullermeier, E., 2010. Preference learning: An introduction, in: Preference learning. Springer, pp. 1–17. Gibbson, J.D., Chakraborti, S., 2011. Nonparametric Statistical Inference. CRC Press. Hardy, G.H., Littlewood, J.E., P´olya, G., 1988. Inequalities. 2 ed., Cambridge University Press. Haykin, S.S., 2009. Neural Networks and Learning Machines. 3 ed., Prentice Hall. Her´nandez-Gon´zalez, J., Inza, I., Lozano, J.A., 2015. Multidimensional learning from crowds: Usefulness and application of expertise detection. International Journals of Intelligent Systems 30, 326–354.