Learning unseen visual prototypes for zero-shot classification

Learning unseen visual prototypes for zero-shot classification

Accepted Manuscript Learning unseen visual prototypes for zero-shot classification Xiao Li, Min Fang, Dazheng Feng, Haikun Li, Jinqiao Wu PII: DOI: R...

3MB Sizes 0 Downloads 55 Views

Accepted Manuscript

Learning unseen visual prototypes for zero-shot classification Xiao Li, Min Fang, Dazheng Feng, Haikun Li, Jinqiao Wu PII: DOI: Reference:

S0950-7051(18)30326-5 10.1016/j.knosys.2018.06.034 KNOSYS 4414

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

2 November 2017 15 June 2018 19 June 2018

Please cite this article as: Xiao Li, Min Fang, Dazheng Feng, Haikun Li, Jinqiao Wu, Learning unseen visual prototypes for zero-shot classification, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.06.034

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • A novel zero-shot learning method is developed to rectify the hubness and domain shift problem.

CR IP T

• The proposed method exploits the class level visual samples to learn the projection function. • The unseen visual prototypes are modified by the label correlations and their knns.

AC

CE

PT

ED

M

AN US

• The proposed method outperforms existing methods on 5 zero-shot learning datasets.

1

ACCEPTED MANUSCRIPT

Learning unseen visual prototypes for zero-shot classification a

CR IP T

Xiao Lia , Min Fanga,∗, Dazheng Fengb , Haikun Lia , Jinqiao Wua

School of Computer Science and Technology, Xidian University,710071, China b School of Electronic Engineering, Xidian University, 710071, China

Abstract

PT

ED

M

AN US

The number of object classes is increasing rapidly leading to the recognition of new classes difficult. Zero-shot learning aims to predict the labels of the new class samples by using the seen class samples and their semantic representations. In this paper, we propose a simple method to learn the unseen visual prototypes (LUVP) by learning the projection function from semantic space to visual feature space to reduce hubness problem. We exploit the class level samples rather than instance level samples, which can alleviate expensive computational costs. Since the disjointness of seen and unseen classes, directly applying the projection function to unseen samples will cause a domain shift problem. Thus, we preserve the unseen label semantic correlations and then adjust the unseen visual prototypes to minimize the domain shift problem. We demonstrate through extensive experiments that the proposed method (1) alleviates the hubness problem, (2) overcomes the domain shift problem, and (3) significantly outperforms existing methods for zero-shot classification on five benchmark datasets.

CE

Keywords: Zero-shot classification, Unseen visual prototypes, Semantic correlation, Hubness, Domain shift

AC

1. Introduction There are a high number of object classes in reality, some objects are rare such as wild animals, or from new categories, such as new type of car. It is ∗

Corresponding author. Email address: [email protected] (Min Fang)

Preprint submitted to Knowledge-Based Systems

July 5, 2018

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

impossible to annotate all the classes and the conventional recognition models are unable to classify the new categories with no training samples. Therefore, it is desirable to develop new methods that can recognize new categories and reduce human labor. Zero-shot learning aiming at recognize new classes by transferring knowledge from seen classes has caused great interest of many researchers and has been applied in many computer vision tasks, like image classification [1, 2], action recognition [3, 4, 5], event detection [6]. The new classes are unseen classes that are not present in training stage. Seen classes are labelled samples and disjoint with unseen classes. They are related in a semantic space. A semantic space refers to an attribute space where the labels are attribute representations or a word space where the labels are high-dimensional word vector representations. The semantic space is represented by attributes [2, 7, 8] or word vectors [9, 10, 11]. Both seen classes and unseen classes are represented by the semantic representations so that different classes can be compared. Each class can be represented by a semantic vector and named as a class prototype. Seen class prototypes are the semantic vectors of seen classes in the semantic space. Unseen class prototypes are the semantic vectors of unseen classes in the semantic space. A survey on zero shot learning has been given in [12]. Existing zero shot learning methods project images to semantic space and then apply nearest neighbor search in the semantic space to match the projected images with an unseen class prototype. The key point is to learn a projection function from visual feature space to the semantic space. Then the function can be used for projecting unseen samples into the semantic space, where nearest neighbor search can be implemented to match the projected samples with an unseen class prototype. However, the nearest neighbor search would suffer from the hubness problem, that is a few objects in the dataset may be the nearest neighbor of many objects [13, 14]. The hubs will reduce the effect of nearest neighbor search because the nearest neighbor of many objects contain the same hubs. In order to overcome the hubness problem in zero-shot learning, Dinu et al. [14] proposed a globally corrected neighbor retrieval method to reduce hubness. Shigeto et al. in [13] pointed out that ridge regression from visual feature space to semantic space makes the hubness problem worse and perform reverse regression to project class prototypes into the visual feature space. It has been proved that projecting the semantic representations to the visual feature space can reduce hubness. However, the seen and unseen classes are disjoint, the projection function learnt by seen classes can not perform well on unseen samples. 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

In this paper, we propose a novel method, learning unseen visual prototypes (LUVP) for zero-shot classification. Seen visual prototypes are the visual center of every seen class which are obtained by averaging the feature vectors of every seen class. Unseen visual prototypes are the visual center of every unseen class which require to be learnt. We project the semantic representations to the visual feature space. Different from [13], the projection function is learnt by using seen class prototypes and seen visual prototypes rather than all the seen images. Directly projecting the unseen class prototypes to the visual feature space with the projection function will lead to a domain shift problem that the unseen visual prototypes are biased towards the unseen samples. Thus, we alleviate the domain difference problem by preserving the label semantic correlations when learning the unseen visual prototypes and modifying the unseen visual prototypes to make them more suitable with the unseen images. The main contributions of this paper are summarized as follows: (1) The visual feature space is utilized as the embedding space to rectify the hubness problem. We learn the projection function by exploiting the class level visual samples rather than instance level visual samples in order to alleviate expensive computational costs. (2) In order to alleviate the domain shift problem, we preserve the correlations between unseen class prototypes when learning the unseen visual prototypes with the learnt projection function. Then, the correlations between different classes in the semantic space are maintained in the visual feature space. (3) To counter the domain shift problem further, we search for the k nearest neighbors of the unseen visual prototypes among the unseen samples. We use the means of k nearest neighbors of the unseen visual prototypes along with themselves as the modified unseen visual prototypes to make them more consistent with the unseen images. (4) To verify our method, we evaluate it on the state-of-the-art zeroshot learning datasets. Our results demonstrate that the proposed method achieves the state-of-the-art zero-shot learning results of 86.83%, 53.06%, 54.82% and 94.00% on AwA, aPY, CUB and SUN datasets, separately. The experimental analyses show that the proposed method can alleviate the hubness problem, overcome the domain shift problem and significantly outperform existing methods for zero-shot classification on five benchmark datasets. This paper is organized as follows. Section 2 provides a brief review of related work. In Section 3, we develop a novel zero-shot classification method 4

ACCEPTED MANUSCRIPT

by learning unseen visual prototypes (LUVP). Section 4 provides extensive experiments. In Sections 5, the conclusions and future work are presented.

CR IP T

2. Related work

AC

CE

PT

ED

M

AN US

Zero-shot classification has been an active research in the recent past and achieves very good performance [2, 7, 12]. There is a semantic space where image features and labels can be embedded and compared. Existing zero-shot learning methods can be divided into the following four categories. Semantic space methods: Most zero-shot learning methods are semantic space methods. These methods [2, 8, 9, 15, 16, 1, 17, 18, 19] firstly learn a projection function by seen samples and their corresponding attribute representations. Secondly, the unseen samples can be projected to the semantic space by the projection function. The classification can be done in the semantic space by nearest neighbor search [8, 9], maximum a posterior probability (MAP) [2], label propagation [20] or compatibility function [1, 16]. DAP method in [2] predicted the attributes for the unseen images based on the model trained on seen images and then predicted their labels in the semantic space. Frome et al. [9] proposed a new deep visual-semantic embedding model to map images into a semantic embedding space, and then look for the nearest labels in the embedding space. In [21], Palatucci et al. came up with a semantic output code classifier. ESZSL [22] unified the features, attributes, and classes in an objective function to learn the projection function. Li et al. [23] proposed a semi-supervised zero-shot learning method to learn the projection function. Kodirov et al. [24] proposed a Semantic Autoencoder model using linear projection function to project images to semantic space along with minimizing the reconstruction error. However, directly exploiting the projection function will lead to a domain shift problem. UDAZS [25] exploited sparse coding [26] to learn the projection function and handle the domain shift problem and then performed nearest neighbor search to find the closest unseen class prototypes for a given unseen image in the sparse coding space. However, the nearest neighbor search in [9, 25, 21] will be affected by the hubness problem. Visual feature space methods: Some researchers [13, 14, 27] have studied the hubness problem in zero-shot learning and proposed methods to tackle the hubness problem. Dinu et al. [14] proposed a globally corrected neighbor retrieval method to reduce hubness in zero-shot learning. Shigeto et al. in [13] pointed out that ridge regression from visual feature space 5

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

to semantic space makes the hubness problem worse and perform reverse regression to project class prototypes into the visual feature space. The visual feature space acting as the embedding space can lead to less hubness problem. However, the seen and unseen classes are disjoint, the projection function learnt by seen classes can not perform well on unseen samples. Our proposed method also utilizes the visual feature space as the embedding space but makes the unseen visual prototypes better by minimizing the domain shift problem which yields far superior performance on zero-shot learning tasks. Latent embedding space methods: There are some other methods focusing on learning a latent embedding space [28, 29, 30, 31]. They learn a discriminative and predictive latent embedding space where both the visual features and the semantic representations can be projected into. TME method proposed in [32] projects the images and multiple semantic representations to a latent embedding space to tackle the projection domain shift problem. Gan et al. [33] proposed a method to learn the kernel projection function in the latent space as a multi-source domain generalization problem. Zhang et al. [34] proposed a joint latent space in which the latent similarity of seen and unseen data can be measured. These methods have high computational complexity, while ours are simple and efficient which are based on class level visual samples with small computation cost. Semantic similarity methods: There are some zero-shot learning methods based on the similarity of seen and unseen classes [11, 35, 36, 37, 38]. They recognize the unseen class samples by the classifiers trained on seen samples and their relationships. IAP in [2] learns a classifier on seen images and uses the classifier to predict the unseen images with seen labels, assigning them with unseen classes based on their correlations with seen classes. Norouzi et al. in [38] assume that the unseen classes can be represented by a convex combinations of the seen classes. Li et al. [39] learn an unseen class classifier using seen class classifiers and their semantic correlations. Mensink et al. [35] employ co-occurrences statistics of image classes and adopt the cooccurrences to design a new classifier. Zhang et al. [40] proposed a semantic similarity embedding method in which the similarity of seen and unseen data can be measured. The work in [41] introduced phantom classes and proposed to train phantom classifiers which can be used as the linear combination bases of the unseen classes. These methods depend on the extrapolated similarity of seen and unseen classes, that is whether the similarity matrix is good or bad has a direct impact on the performance. Nevertheless, ours do not need 6

ACCEPTED MANUSCRIPT

Table 1: Notations and descriptions.

Description Seen and unseen images Seen and unseen image labels Seen and unseen class prototypes # Seen and unseen samples

to calculate the similarity matrix. 3. Learning the unseen visual prototypes

CR IP T

Notation Xs ,Xt Zs ,Zt Ps ,Pt ns , nt

AC

CE

PT

ED

M

AN US

Let Ds = {Xs , Zs } be the seen data with the labels of seen classes {1, ..., c} and Dt = {Xt , Zt } be the unseen data. The labels of unseen classes are {c + 1, ..., c + u} which are unknown. Xt are unknown in the training stage. The seen and unseen class prototypes are Ps ∈ Rc×a and Pt ∈ Ru×a , separately. Pi ∈ R1×a is the semantic representation of the ith class, in which a is the dimension of semantic representation. Xs ∈ Rns ×d and Xt ∈ Rnt ×d are seen and unseen images, d is the dimension of image features and ns is the number of seen samples and nt unseen samples. Zs ∈ Rns ×1 and Zt ∈ Rnt ×1 are seen and unseen labels. There is no overlap between Zs and Zt . Table 1 summarizes the key notations and descriptions used in this paper. The transpose of matrix is represented by the superscript ’. Ik is k identity matrix. For a matrix B ∈ Rn×k , denote the ith row as Bi , the jth column as Bj . Bij means the elements at the ith row and jth column B. qP in matrix n Pk 2 ||B||F denotes the Frobenius norm of a matrix B. ||B||F = i=1 j=1 Bij . Existing zero shot learning methods aim to learn the projection function from visual feature space to semantic space, project the unseen classes to the semantic space and apply nearest neighbor search in the semantic space to match the projected images with an unseen class prototype. It will lead to the hubness problem, that is some unseen class prototypes are the nearest neighbors of many samples. Projecting the class prototypes from semantic space to visual feature space can reduce the hubness problem, which has been proved by Shigeto in [13]. The mapped images in the semantic space are more compact than their class prototypes which means that the mapped images are near to the origin. The class prototypes which are near to the origin may be the hubs, being the nearest neighbors of many mapped images. If we make the opposite projection, projecting the class prototypes to the visual feature 7

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: The overview of the proposed zero-shot classification method.

CE

PT

ED

M

space, the projected class prototypes are near to the origin which can be slightly affected by the hubness problem. Therefore, we focus on projecting the unseen class prototypes to the visual feature space, called the unseen visual prototypes. The main steps are as follows. First, we describe how we learn a projection function from semantic space to visual feature space. Second, given the unseen class prototypes, we describe how to apply this function to learn the unseen visual prototypes. Third, we describe how to adjust the learnt unseen visual prototypes to be more suitable for the unseen samples. Last, we utilize nearest neighbor search to predict the labels of the unseen images. The proposed method is illustrated in Figure 1.

AC

3.1. Learning the projection function M We project the label semantic vectors into visual feature space to overcome the hubness problem. Different from the method SVNN in [13], we exploit the class level samples rather than instance level samples, which can alleviate the expensive computational costs. Because the deep visual features have natural clusters of every class, we calculate the mean of every class images which can be used to represent its class. Rs ∈ Rc×d are the seen class samples by averaging the feature representations of every class, named as seen visual prototypes. For the i-th seen class, the image visual 8

ACCEPTED MANUSCRIPT

representations are writen as Xsi . The seen visual prototype of the i-th seen class is: ni 1 X Xj ni j=1 si

(1)

CR IP T

Ris =

where ni is the number of the i-th seen class samples, i ∈ {1, · · · , c}. ∈ R1×d is the seen visual prototype of the i-th seen class. All the seen visual prototypes are Rs = {R1s ; · · · ; Rcs }. We learn the projection function M from seen class prototypes Ps and seen visual prototypes Rs .

AN US

Ris

argmin ||Rs − Ps M||2F + α||M||2F M

(2)

ED

M

where M is the projection matrix. The first term is the reconstruction error term, while the second term is the regularization term to avoid overfitting. The regularization parameter α is used to balance the two terms. The loss function (2) is convex with respect to M. Taking derivatives of Equation (2) with respect to M and setting the gradient to zero leads to the following closed-form solution. M = (αIa + P0s Ps )−1 P0s Rs

(3)

AC

CE

PT

Thus, the projection matrix M can be obtained by Equation (3). Thus, we can utilize the projection matrix M to embed unseen prototypes into visual feature space. It costs O(a3 + a2 c + cad) to calculate Equation (3). However, it costs O(a3 + a2 n + nad) if utilizing all the seen samples. It is clear that our method has lower complexity, especially when the number of training samples is large. However, since the disjointness of seen and unseen classes, directly applying the projection matrix M learnt by seen classes to unseen classes will lead to a domain shift problem.

3.2. Learning the unseen visual prototype Rt We learn the unseen visual prototypes by mapping unseen class prototypes into visual feature space. Rt ∈ Ru×d are the required unseen visual prototypes. In other words, the unseen visual prototypes Rt are learnt using 9

ACCEPTED MANUSCRIPT

AN US

CR IP T

the projection function M and the unseen class prototypes Pt . However, the projection function M well learnt for seen classes may not generalize well to unseen categories, because the seen and unseen classes are disjoint. M is learnt by seen class samples, thus, directly utilizing M will lead to a domain shift problem. Fortunately, preserving the similarity between different unseen classes can reduce the bad effect caused by the domain difference. Therefore, we put forward a new idea to preserve the label semantic correlations which are measured by the similarity between two labels in the semantic space. This naturally leads us to consider the geometric structure of the unseen class prototypes based on the manifold assumption [42]. The manifold assumption is that two samples are close in the projected low dimensional space if they are close in the original space which has been used widely [43, 44, 45]. This defines the objective function for learning the the unseen visual prototype Rt : argmin ||Rt − Pt M||2F + γLsc Rt

(4)

AC

CE

PT

ED

M

where γ is the regularization parameter. The first term means using the projection function to project the class prototypes to visual feature space. The second term Lsc is the label semantic correlation term. Label semantic correlation: it means the similarity between two labels in the semantic space. We try to preserve the label semantic correlations by preserving the geometric properties of unseen class prototypes. Based on the manifold assumption, we mean that if the unseen class prototypes Pit and Pjt are close, the unseen visual prototypes Rit and Rjt should also be close. Therefore, we incorporate the graph Laplacian term to preserve the geometric properties of unseen class prototypes in the visual feature space. For the u target classes, a full connected graph G can be constructed with u vertices. A is the weight matrix of the graph where the u vertices represent all the target classes. Aij = e−

j ||Pit −Pt ||2 δ2

(5)

δ is the bandwidth parameter which is set as the mean of the distances between all the target images. The graph Laplacian matrix is Lt = D − A 10

(6)

ACCEPTED MANUSCRIPT

in which D is a diagonal matrix, D = diag(D1 , · · · , Du ), in which Di is the sum of the ith row of A. Thus, the label semantic correlation term is u X

i,j=1

Aij ||Rit − Rjt ||2F = tr(R0t Lt Rt )

(7)

CR IP T

Lsc =

That is, if two classes are similar, their similarity Aij is large, thus, we restrict the distance of their visual prototypes to be small making their visual prototypes similar. By substituting Equation (7) with Equation (4), we can get:

AN US

argmin ||Rt − Pt M||2F + γtr(R0t Lt Rt ) Rt

(8)

ED

M

The first term means projecting the class prototypes to visual feature space. The second term ensures that the relationship in the semantic space should be kept in the visual feature space. Therefore, we can predict the unseen visual prototype in the visual feature space for every unseen class prototype. By setting the derivative of Equation (8) with respect to Rt to zero, we can obtain the optimal solution: Rt = (Iu + γLt )−1 Pt M

(9)

PT

Thus, we can get the unseen visual prototype Rt by Equation (9). We learn Rt not only by the projection matrix M, but also by preserving the label semantic correlations between the unseen classes. Hence, the learnt Rt can deal with the domain shift problem to some extent.

AC

CE

3.3. Modifying the unseen visual prototype To solve the domain shift problem further, we modify the unseen visual prototypes Rt to be more similar to the unseen samples. We search for the k nearest unseen sample neighbors of every unseen visual prototype to find the center of every unseen class, denoted as modified unseen visual prototypes Rtm . N N (Rt ) means the k nearest neighbors of Rt . Rtm =

1 ( k+1

k X

Xt ∈N N (Rt )

11

Xt + R t )

(10)

ACCEPTED MANUSCRIPT

CR IP T

Then, we can get the modified unseen visual prototype Rtm by Equation (10). The k nearest neighbors of Rt are used to adjust the unseen visual prototypes Rt . Then we use Rtm as the modified unseen visual prototypes to compare with all unseen samples for classifying them. Rtm are modified by the unseen visual samples, thus, Rtm are more semantically consistent with the unseen samples. The modified unseen visual prototypes Rtm are better than Rt because Rtm are closer to the center of the samples of the corresponding classes than Rt .

AN US

3.4. Predicting the labels of unseen samples To classify a testing instance x, we apply calculating distance of the testing instance against the modified unseen visual prototypes Rtm . The labels are given by the nearest distance of the modified unseen visual prototypes Rtm . φ(x) = argmin Dist(x, Rjtm ) j

(11)

PT

ED

M

where Dist is the distance function, Rjtm is the j-th modified unseen visual prototype of the j-th unseen class, φ(x) means the label of the instance which is predicted as the one with the minimum distance. Up to now, we can predict the labels of unseen samples by Equation (11). We summarize the algorithm of the proposed method in Algorithm 1. What’s more, LUVP can handle the case of generalized zero-shot learning (GZSL) [46]. GZSL means that the testing instance may be seen or unseen class. For GZSL, we only need to modify Equation (11) to measure the distance to both seen and unseen visual prototypes:

CE

φ(x) = argmin Dist(x, Rja )

(12)

j

AC

where Ra = [Rs ; Rtm ] ∈ R(c+u)×d , Rja is the j-th visual prototype of the j-th class. Whether the test instance is from seen or unseen classes, we can predict the label of test instance by Equation (12). 4. Experiments In this section, we perform experiments to evaluate our method on five state-of-the-art datasets (AwA [2], aPY [7], CUB [47], SUN [48] and ImageNet2 [49] ). Table 2 summarizes the main information of the five datasets 12

ACCEPTED MANUSCRIPT

CR IP T

Algorithm 1. Learning the unseen visual prototypes for zero-shot classification. Input: Seen images Xs and labels Zs ; Unseen images Xt ; Seen class prototypes Ps ; Unseen class prototypes Pt ; parameters α γ. 1: Compute the projection function M by Equation (3). 2: Learn the unseen visual prototype Rt by Equation (9). 3: Learn the modified unseen visual prototype Rtm by Equation (10). 4: Predict the labels of unseen samples by Equation (11). Output: Unseen image labels Zt .

AN US

used in our experiments. Firstly, we describe the datasets in Section 4.1. Secondly, different compared algorithms are presented in Section 4.2. Thirdly, the results of all the algorithms are compared and analyzed in Section 4.3. In the end, we prove our proposed method in varies aspects in Section 4.4.

AC

CE

PT

ED

M

4.1. Data sets In this subsection, we give some details of the datasets used for zero-shot classification problems. Animal with Attribute dataset (AwA) [2]: AwA dataset is about animals with 30475 images from 50 classes like antelope, grizzly+bear, killer+whale, beaver, dalmatian. 40 classes are used as seen classes with 24295 images and 10 classes are unseen classes with 6180 images. The dataset has 85 binary attributes for every class characterizing the attributes of the animals like color, size, behavior etc. Every class has a 85 dimension representation, called a class prototype. aPascal aYahoo dataset (aPY) [7]: aPY contains images from two datasets aPascal and aYahoo. aPascal dataset is source domain with 12695 images, 20 classes. aYahoo dataset is target domain with 2644 images, 12 classes. The source and target domain classes are disjoint but share some similarity. Every image has a 64 binary attribute representations like shape, textures etc. We get the class prototypes by calculating the means of the attribute representations of every class. Caltech-UCSD-Birds dataset (CUB) [47]: In total CUB dataset has 11788 images of 200 bird classes in which 8855 images with 150 classes are seen images and 2931 images with 50 classes are unseen images. CUB is a finegrained recognition dataset which is challenging. Every image has a 312

13

ACCEPTED MANUSCRIPT

Table 2: The descriptions of the benchmark image datasets

# seen images ( classes ) 24295 (40) 12695 (20) 8855 (150) 14140 (707) 200000 (1000)

# unseen images ( classes ) 6180 (10) 2644 (12) 2931 (50) 200 (10) 54000 (360)

# attribute 85 64 312 102 1000

CR IP T

Dataset AwA aPY CUB SUN ImageNet2

ED

M

AN US

binary attributes like color, part pattern. We also take the averages of attribute representations from the same class to generate class prototypes. SUN scene attributes dataset (SUN) [48]: SUN dataset has 14340 images in which 14140 images from 707 classes are source domain and 200 images from 10 classes are target domain with the same setting in [15]. Every image has a 102-dimensional attribute representation. Averaging the attribute representations of the same class, we get the class prototypes of seen and unseen classes. ImageNet2 dataset ((ImageNet2) [49]: For ImageNet2 dataset, the seen classes are 1000 classes from ILSVRC2012 and the unseen classes are 360 classes from ILSVRC2010 which is different from the 1000 classes from ILSVRC2012. 1000-dim word vector representations are used as the class semantic representations. For all datasets, we utilize the deep learning features, 1024-dimensional features from GoogLeNet [50] which are given in the literature of [51].

AC

CE

PT

4.2. Experimental Setup We compare our LUVP method with five state-of-the-art methods for the zero-shot classification tasks. 1. Directed Attribute Prediction (DAP) [2] The method learns attribute classifiers with seen images and attributes. It then predicts the attributes for the unseen images and compares the attributes with the unseen class prototypes to predict the labels. 2. Unsupervised Domain Adaptation for Zero-shot Learning (UDAZS) [25] UDAZS learns the projection function from visual feature space to semantic space using sparse coding theory [26]. It utilizes the source dictionary to restrict the target dictionary and the semantic representations of target images to learn the sparse coding of target images. 14

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

3. Transductive Multi-view Embedding for Zero-shot Recognition (TME) [20] TME projects the attributes, word vectors and image features to a common latent space by Canonical Correlation Analysis [52]. 4. Visual to Semantic space and Nearest Neighbor search (VSNN) VSNN applies linear ridge regression [53] to map visual feature representations into the semantic space, such as in existing zero-shot learning methods [14, 21]. The method projects visual features to the semantic space and performs nearest neighbor search in this space to match the projected images with the unseen class prototypes. 5. Semantic to Visual feature space and Nearest Neighbor search (SVNN) [13] SVNN firstly uses linear ridge regression [53] to map semantic representations into the visual feature space to overcome the hubness problem. Then the method finds the closest unseen visual prototypes for a given unseen image as measured by the Euclidean distance. Parameters settings: There are a number of parameters in the proposed method, such as the regularization parameters α, γ, the kernel bandwidth δ and the number of nearest neighbors k. The parameters are set by 5-fold cross validation using the seen data. For each dataset, we randomly selected 20% images of the seen classes to formulate the validation set and used the remaining images for training. The parameters α, γ and k are set as the parameters when the performance on the validation set is the best. The kernel bandwidth δ is empirically set as the mean of the distances between all the target images. For all the compared methods, we also use 5-fold cross validation to search the best parameters based on the test performance on the validation set. We list the parameters for the five compared algorithms. For DAP method, we use the code provided in [2]. For UDAZS, λ, λ1, λ2, λ3 are set as 0.1 and Label Propagation (LP) classifier is applied. For TME, we use attribute and word vector as two semantic representations.

AC

4.3. Experimental results The average multiclass classification accuracies and the standard deviation of 10 times of the proposed method LUVP and five compared algorithms on four benchmark datasets are summarized in Table 3. It can be seen that LUVP achieves much better performance than all the compared approaches. The accuracies of LUVP on four datasets are 86.83%, 53.06%, 54.82% and 94.00% respectively, which are better than other compared methods. 15

ACCEPTED MANUSCRIPT

Table 3: Comparisons of classification accuracies (%) and standard deviations (%) on all datasets. UDAZS 73.55 ± 0.64 25.75 ± 0.94 39.75 ± 0.88 75.75 ± 0.57

TME 79.89 ± 1.05 27.50 ± 1.55 45.40 ± 0.95 78.50 ± 0.65

VSNN 57.20 ± 0.70 18.07 ± 0.95 33.37 ± 0.87 75.00 ± 0.67

SVNN 81.31 ± 0.99 43.41 ± 0.67 53.05 ± 0.88 91.50 ± 0.74

LUVP 86.83 ± 0.55 53.06 ± 0.39 54.82 ± 0.47 94.00 ± 0.63

CR IP T

DAP 58.50 ± 1.32 18.12 ± 0.70 36.45 ± 1.41 72.45 ± 1.78

M

AN US

Method AwA aPY CUB SUN

ED

Figure 2: Confusion matrices of the test results on unseen classes for the proposed method on AwA dataset. Diagonal numbers indicate the correct prediction accuracy. Rows corresponds to the ground truth and columns to the predictions.

AC

CE

PT

4.3.1. Results on AwA dataset The zero-shot classification results of the proposed approach on AwA dataset are in the confusion matrices in Figure 2. In the confusion matrix, the rows correspond to the ground truth and the columns correspond to the predictions. In Table 3, we can see that very impressive result 86.83% is obtained which is 5.52% higher than the best result of SVNN. The reason may be that LUVP considers narrowing down the domain difference additionally by preserving the label semantic correlations and modifying the unseen visual prototypes to fit the target images. What’s more, LUVP uses the class level visual samples instead of all the samples to learn the projection function which can reduce the error. Both LUVP and SVNN utilize the visual feature space as the embedding space. SVNN is the simplest approach of directly mapping class 16

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: Confusion matrices of the test results on unseen classes for the proposed method on aPY dataset. Diagonal numbers indicate the correct prediction accuracy. Rows corresponds to the ground truth and columns to the predictions.

CE

PT

ED

M

prototypes to visual feature space. Even SVNN works better suggesting that using visual feature space is effective for alleviating the hubness problem. The results from both DAP and VSNN are poor which shows that projecting the samples to the semantic space fails. Moreover, they ignore the domain difference across seen and unseen classes. The performances of UDAZS and TME are higher than those of DAP and VSNN methods. UDAZS considers domain shift problem but suffers from hubness problem. TME also considers domain shift problem by projecting visual features, attributes and word vectors into a common space. However, both of them significantly underperform the proposed LUVP method, which indicates the advantage of LUVP method.

AC

4.3.2. Results on aPY dataset The confusion matrices on aPY dataset are shown in Figure 3, from which we can see that LUVP has higher precision on aPY dataset. From Table 3, we can draw a conclusion that our result on aPY dataset is further improved to 53.06%, which is 9.65% higher than the best result of SVNN. There are four main observations (1) Table 3 shows that our method clearly beats DAP, UDAZS, TME and VSNN. The confusion matrices in Fig17

ACCEPTED MANUSCRIPT

AN US

CR IP T

ure 3 show the excellent performance of the LUVP method. This is attributed to mitigating the hubness problem by projecting to the visual feature space rather than the semantic space. The bad performance of VSNN shows that using Linear Regression to learn a projection function from visual feature space to semantic space is negatively influenced by the hubness problem. (2) LUVP and SVNN are better than other methods without considering the hubness problem. These results clearly demonstrate the contribution of projecting to visual feature space to the overall performance of zero-shot learning tasks. (3) We can see that SVNN yields a worse performance than our method. This proves that it is useful to mitigate the domain shift when learning the unseen visual prototypes. (4) There is a big domain difference between aPascal and aYahoo. Therefore, UDAZS and TME considering the domain shift are better than other methods DAP and VSNN without thinking of the domain shift problem. We can draw the conclusion that considering the domain shift is significant for the zero-shot learning tasks. Our method considers not only the domain shift but also the hubness problem.

AC

CE

PT

ED

M

4.3.3. Results on CUB dataset The confusion matrices of the proposed approach on CUB dataset are shown in Figure 4. From Table 3, we can see that the results on the finegrained dataset CUB is further improved by LUVP method. This verifies that considering the label semantic correlations and the domain shift problem are critical for the zero-shot classification tasks. What’s more, experimental results in Table 3 show that SVNN and LVUP methods outperform other methods. This validates that using the visual feature space is better for the nearest neighbor search to be less influenced by the hubness problem. It is important to note that the results of LUVP and SVNN are 54.82% and 53.05%, the gap is only 1.77%. The major reason is that there are little domain difference between seen and unseen samples. So SVNN without considering domain difference is better enough for the zero-shot classifcation tasks. The proposed method LUVP considers preserving the label semantic correlations when learning the unseen visual prototypes. This is one of the reasons why our proposed method outperforms other methods. Another major reason is that the modified unseen visual prototypes are better for the classification tasks. These observations explain the improvements of our method on other compared methods.

18

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

19 Figure 4: Confusion matrices of the test results on unseen classes for the proposed method on CUB dataset. Diagonal numbers indicate the correct prediction accuracy. Rows corresponds to the ground truth and columns to the predictions.

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: Confusion matrices of the test results on unseen classes for the proposed method on SUN dataset. Diagonal numbers indicate the correct prediction accuracy. Rows corresponds to the ground truth and columns to the predictions.

AC

CE

PT

ED

M

4.3.4. Results on SUN dataset We present our results on zero-shot classification tasks on SUN dataset in the confusion matrices in Figure 5. It also can be seen from Table 3 that LUVP achieves the best results on SUN dataset. LUVP has obtained significant improvements on the zero-shot classification tasks, especially for comparing with the methods not considering hubness problem. The results of LUVP and SVNN are 94.00% and 91.50%, the gap is only 2.50%. It is worth pointing out that this result is only a little bit better than SVNN. This is mainly because there is little domain difference between seen and unseen samples. So SVNN without considering domain difference is also better enough for the zero-shot classification tasks. LUVP and SVNN are much better than other methods DAP, UDAZS, TME and VSNN. Because these methods do not take into account the hubness problem and cannot restrain the hubs in the following nearest neighbor search. We can see from the results that, in the methods without considering the hubness, TME performs the best on the whole. Because TME is a transductive method and uses multiple views such as deep features, low level features, attributes and word vectors for embedding. By contrast, our method just use 20

ACCEPTED MANUSCRIPT

Method ConSE AMP SAE LUVP

ImageNet2 15.50 13.10 26.30 27.14

CR IP T

Table 4: Comparisons of T-5 (top 5) classification accuracies (%) on ImageNet2 dataset.

deep features and attributes. Nevertheless, we achieve the best performance on the challenging AwA, aPY, CUB and SUN datasets.

AC

CE

PT

ED

M

AN US

4.3.5. Results on large scale ImageNet2 dataset For testing the performance of the proposed method on large scale dataset with high-dimensional semantic representations, we conduct experiments on the challenging ImageNet2 dataset where the seen classes are from ILSVRC2012 and the unseen classes are from ILSVRC2010. We compare with ConSE [38], AMP [17] and SAE [24] methods because many existing zero shot learning methods can not handle the large scale dataset. We report T-5 accuracy on the large scale dataset which means that the correct label is in the top 5 of the predicted labels. The proposed method LUVP learns the projection matrix using the seen class prototype and seen visual prototype, thus, is tractable on this large scale dataset. The experimental results shown in Table 4 indicate that the proposed LUVP method outperforms other compared methods, especially compared to ConSE and AMP. LUVP achieves comparable performance compared to SAE. The proposed method learns the unseen visual prototype for every unseen class with preserving the label semantic correlations. The modified unseen visual prototypes are better for the classification tasks on the ImageNet2 dataset. This indicates that LUVP can be extended to a realistic setting with large scale dataset. As the conclusion, LUVP has four important merits than other algorithms. (1) LUVP performs the inverse projection direction by projecting the unseen class prototypes to the visual feature space to overcome the hubness problem. (2) When learning the projection function, we utilize the class level samples rather than instance level samples which are simple and can help reduce error. (3) The label semantic correlations are preserved when we learn the unseen visual prototypes which can mitigate the domain shift problem. (4) In order to further reduce the domain difference, the unseen 21

ACCEPTED MANUSCRIPT

Table 5: Comparisons of classification accuracies (%) and standard deviations (%) of LUVP and LUSP.

AwA aPY CUB SUN 73.73 ± 0.28 26.51 ± 0.91 45.14 ± 0.63 89.00 ± 0.81 86.83 ± 0.42 53.06 ± 0.65 54.82 ± 0.71 94.00 ± 0.43

CR IP T

Method LUSP LUVP

visual prototypes are modified by adjusting the unseen samples by searching for the k nearest neighbors of the unseen visual prototypes among the unseen samples. The results demonstrate the effectiveness of LUVP for the zero-shot classification tasks.

AN US

4.4. Algorithm Analysis In this subsection, we analyze our proposed method in different aspects which proves the superior performance of the proposed method.

AC

CE

PT

ED

M

4.4.1. Importance of projecting to the visual feature space We argue that the key for alleviating the hubness problem is the use of the visual feature space rather than the semantic space. We perform the projection from visual feature space to semantic space, while keeping other settings the same, called learning unseen semantic prototypes (LUSP). Specifically, we learn the projection matrix by seen visual prototypes Rs and seen class prototypes Ps and project the unseen samples to the semantic space. Then we search for the k nearest neighbors of the unseen class prototypes among the learnt semantic representations which are averaged to act as the class centers. Table 5 shows that by mapping the visual features to the semantic embedding space, the performance on AwA, aPY, CUB and SUN drop by 13.10%, 26.55%, 9.68% and 5.00%, highlighting the importance of selecting the right embedding space. In order to demonstrate the advantages of our method intuitively, we project the unseen images to the semantic space. We take aPY as example and show the t-SNE [54] visualization of the 12 unseen class images in the semantic space and in the visual feature space in Figure 6. We can seen that different classes are less separated in the semantic space, but the separability of the 12 classes in visual feature space is surprisingly good. Compared with the two figures in Figure 6, we can seen that the images are more separable in the visual feature space. We can come to a conclusion that projecting to the visual feature space is better than projecting to the semantic space because 22

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

(a) Semantic space

(b) Visual feature space

Figure 6: Visualization of the distribution of the 12 unseen class images in the semantic space and the visual feature space on aPY using t-SNE [54]. Different classes are shown in different colors.

23

ACCEPTED MANUSCRIPT

Table 6: Comparisons of classification accuracies (%) and standard deviations (%) with and without label semantic correlations term.

aPY CUB SUN 53.06 ± 0.65 54.82 ± 0.71 94.00 ± 0.43 43.98 ± 0.75 54.21 ± 0.57 93.00 ± 0.33

CR IP T

Method AwA LUVP 86.83 ± 0.42 LUVP w/o L 81.61 ± 1.22

projecting to the semantic space will cause hubness problem while projecting to the visual feature space can reduce the bad effect of the problem. From AwA, CUB and SUN datasets, we can draw the same conclusions that the separability of all the images are better in the visual feature space.

PT

ED

M

AN US

4.4.2. Importance of preserving the label semantic correlations In our approach, we preserve the label semantic correlations when learning the unseen visual prototypes in order to reduce the domain shift. To study the impact of preserving the label semantic correlations, we conduct experiments by dropping the label semantic correlation term with γ = 0 . We test both of cases, LUVP and LUVP w/o L. The zero-shot classification results are reported in Table 6. As it can be seen, preserving the label semantic correlation has a critical effect on the performance. This can be justified by the improved performance with the label semantic correlations leveraged. The increasing accuracy both demonstrates our motivation that using both the projection function and the label semantic correlation is indeed better than only using the projection function. Because there are domain shift between seen and unseen classes. Directly utilizing the projection function M will lead to the domain shift problem. Preserving the correlations between different unseen classes can reduce the bad effect caused by the domain difference problem.

AC

CE

4.4.3. Impact of modifying the unseen visual prototypes Figure 7 gives the visualization of the unseen 10 class images and their corresponding modified visual prototypes Rt and Rtm . After comparing the two figures in Figure 7, we can see that the modified visual prototypes are closer to the center of the samples of the corresponding classes than the learnt Rt . The visual prototypes Rtm are near to the center of every class. Thus, Rtm acting as visual prototypes are better than Rt . It can be concluded that the modified visual prototypes Rtm are more suitable for the zero-shot classification tasks. 24

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

(a) Rt

(b) Rtm

Figure 7: Visualization of the distribution of the 10 unseen class images in the visual feature space with the visual prototypes Rt and Rtm on SUN using t-SNE. Different classes as well as their corresponding visual prototypes are shown in different colors.

25

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 8: Comparisons of classification accuracies with the unseen visual prototype Rt and with the modified unseen visual prototype Rtm .

CE

PT

ED

M

To further study the impact of modifying the unseen visual prototypes, we compare both of cases, LUVP with Rt and LUVP with Rtm . The zeroshot classification results are reported in Figure 8. We can see from Figure 8 that LUVP with Rtm outperforms LUVP with Rt . Because the unseen visual prototypes Rt are a little further than Rtm to the center of the classes, making the NN accuracy reduced. Rtm are closer to the center of the classes. However, it can be seen that there is no clear winner between using the unseen visual prototype Rt and the modified unseen visual prototype Rtm for SUN and CUB datasets. Because there is more subtle domain difference and the seen and unseen image distributions may have already been similar for both datasets. Therefore, modifying the unseen visual prototype has little or even no effect for the two datasets.

AC

4.4.4. Parameter analysis We study the impact of the regularization parameters α, γ, the kernel bandwidth δ and the number of nearest neighbors k on the zero-shot classification performance. Here we analyze the impact of these two parameters α and γ by measuring zero-shot classification accuracies on AwA and aPY datasets. We evaluate the effects of α and γ. α ∈ {10−3 , 10−2 , 10−1 , 1, 10} and γ ∈ {0, 10−3 , 10−2 , 10−1 , 1, 10}. The parameter α is used to avoid overfitting. With too big of α, the punishment is too big leading to bad results. With 26

AN US

CR IP T

ACCEPTED MANUSCRIPT

(a) AwA

(b) aPY

AC

CE

PT

ED

M

Figure 9: Classification accuracies of AwA and aPY datasets with different α and γ.

(a) the kernel bandwidth δ

(b) the number of nearest neighbors k

Figure 10: Classification accuracies of AwA and aPY datasets with different values of the kernel bandwidth δ and the number of nearest neighbors k.

27

ACCEPTED MANUSCRIPT

Table 7: Comparisons of classification accuracies (%) of GZSL on AwA and aPY datasets.

DAP ConSE LUVP

AwA accut accst 2.40 77.90 9.50 75.90 47.00 80.83

CUB accut accst 4.00 55.1 1.80 69.9 32.04 58.94

CR IP T

Method

PT

ED

M

AN US

too small of α, the punishment is too small leading to overfitting. Therefore, we should select a proper regularization parameter α in order to obtain better performance. The results in Figure 9 show that there is a slightly preference towards 1 of α. The parameter γ controls the preference for preserving the label semantic correlations in mapping to the visual feature space versus exactly fitting the projection function. From Figure 9, we observe that our model is insensitive to γ for any non-zero values lower than 1. However, when γ is 0, the performance is bad because the label semantic correlations are unexploited. Preserving the label semantic correlations is important for making the unseen visual prototypes more suitable for the unseen samples. We study the influence of different choices of δ. From Figure 10, we observe that the performance of LUVP is better over a large range of δ. LUVP is robust to a large range of δ on AwA and aPY datasets. In addition, the value of k in k nearest neighbors used for modifying the unseen visual prototype has more effect on our method. Figure 10 illustrate that k ∈ [20, 30] is better for the two datasets.

AC

CE

4.4.5. Results on Generalized Zero-shot classification Generalized zero-shot classification considers testing instances are from seen or unseen classes. In the experiment, we follow the experimental settings in [46] which is randomly selecting 20% seen images and adding them with unseen images to form the test set. The evaluation criterion is accst and accut : the accuracy of classifying testing instance from seen images and unseen images into the whole label list of seen and unseen classes. We compare our method with DAP [2] and ConSE [38] methods in Table 7. Table 7 shows that our method performs significantly better than other methods in most cases. However, all the methods perform worse than on the conventional zero-shot classification task. The reason is that generalized 28

ACCEPTED MANUSCRIPT

CR IP T

zero-shot classification task is more complex when the test set contains seen and unseen images. The unseen images is easy to be misclassified to the seen classes. Both DAP and ConSE methods perform worse on unseen classes for AwA and CUB datasets. On AwA dataset, the results of LUVP are better than the compared methods on accst and accut . On CUB dataset, the results of LUVP are not better than the compared methods on accst but are better on accut which shows that LUVP can better classify the unseen images to the unseen classes. Our method can deal with the more complex setting of generalized zero-shot classification task.

AN US

5. Conclusion

AC

CE

PT

ED

M

In this paper, we tackle the zero-shot classification problem by learning unseen visual prototypes. We project the class prototypes to the visual feature space to overcome the hubness problem rather than projecting images to the semantic space. The advantage is that we learn the projection function by class level samples rather than instance level samples which can substantially reduce the computational complexity. Another important advantage of our method is the robustness of the domain shift between the seen and unseen images by preserving the correlations between the unseen class prototypes in the semantic space. In addition, we adjust the learnt unseen visual prototypes to suit the unseen images by searching for the k nearest neighbors of every unseen visual prototype among the unseen samples to further overcome the domain shift problem. Our approach has been compared with other state-of-the-art zero-shot learning approaches yielding a substantial improvement. The results of LUVP obtained in the experiments achieve a recognition success within AwA, aPY, CUB and SUN datasets respectively, revealing that the modified unseen visual prototypes are surprisingly good for improving the zero-shot classification performance. In the future we plan to exploit the multiple source knowledge, word vectors and sentence descriptions to compensate the sparsity of the prototype. We plan to learn a nonlinear projection function as well as applying deep learning model to deal with the zero-shot learning problem. Acknowledgments This work is supported by National Natural Science Foundation of China (Grant No. 61472305), Aeronautical Science Foundation of China (Grant 29

ACCEPTED MANUSCRIPT

CR IP T

No. 20151981009), Science Research Program, Xian, China (Grant No. 2017073CG/RC036(XDKD003)), China Postdoctoral Science Foundation (Grant No. 2018M631125) and Fundamental Research Funds for the Central Universities (Grant No. XJS18037). References References

AN US

[1] Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding for image classification, IEEE transactions on pattern analysis and machine intelligence 38 (7) (2016) 1425–1438.

[2] C. H. Lampert, H. Nickisch, S. Harmeling, Attribute-based classification for zero-shot visual object categorization, Pattern Analysis and Machine Intelligence, IEEE Transactions on 36 (3) (2014) 453–465.

M

[3] M. L. Chuang Gan, Y. Yang, Y. Zhuang, A. G. Hauptmann, Exploring semantic interclass relationships (sir) for zero-shot action recognition, AAAI, 2015.

ED

[4] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, Y. Wang, Zero-shot action recognition with error-correcting output codes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2833–2842.

PT

[5] X. Xu, T. Hospedales, S. Gong, Transductive zero-shot action recognition by word-vector embedding, International Journal of Computer Vision (2017) 1–25.

AC

CE

[6] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, P. Natarajan, Zero-shot event detection using multi-modal fusion of weakly supervised concepts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2665–2672. [7] A. Farhadi, I. Endres, D. Hoiem, D. Forsyth, Describing objects by their attributes, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 1778–1785.

30

ACCEPTED MANUSCRIPT

[8] J. Liu, B. Kuipers, S. Savarese, Recognizing human actions by attributes, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 3337–3344.

CR IP T

[9] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al., Devise: A deep visual-semantic embedding model, in: Advances in neural information processing systems, 2013, pp. 2121–2129.

[10] R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero-shot learning through cross-modal transfer, in: Advances in neural information processing systems, 2013, pp. 935–943.

AN US

[11] M. Elhoseiny, B. Saleh, A. Elgammal, Write a classifier: Zero-shot learning using purely textual descriptions, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2584–2591. [12] Y. Xian, C. H. Lampert, B. Schiele, Z. Akata, Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly, arXiv preprint arXiv:1707.00600.

ED

M

[13] Y. Shigeto, I. Suzuki, K. Hara, M. Shimbo, Y. Matsumoto, Ridge regression, hubness, and zero-shot learning, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2015, pp. 135–151.

PT

[14] G. Dinu, A. Lazaridou, M. Baroni, Improving zero-shot learning by mitigating the hubness problem, in: Workshop at ICLR, 2015.

CE

[15] D. Jayaraman, K. Grauman, Zero-shot recognition with unreliable attributes, in: Advances in Neural Information Processing Systems, 2014, pp. 3464–3472.

AC

[16] Z. Akata, S. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of output embeddings for fine-grained image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2927–2936. [17] Z. Fu, T. Xiang, E. Kodirov, S. Gong, Zero-shot object recognition by semantic manifold distance, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2635–2644. 31

ACCEPTED MANUSCRIPT

[18] Z. Fu, T. Xiang, E. Kodirov, S. Gong, Zero-shot learning on semantic class prototype graph, IEEE Transactions on Pattern Analysis and Machine Intelligence.

CR IP T

[19] Z. Ding, M. Shao, Y. Fu, Low-rank embedded ensemble semantic dictionary for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2050–2058. [20] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, S. Gong, Transductive multiview embedding for zero-shot recognition and annotation, in: European Conference on Computer Vision, Springer, 2014, pp. 584–599.

AN US

[21] M. Palatucci, D. Pomerleau, G. E. Hinton, T. M. Mitchell, Zero-shot learning with semantic output codes, in: Advances in neural information processing systems, 2009, pp. 1410–1418.

[22] B. Romera-Paredes, P. Torr, An embarrassingly simple approach to zeroshot learning, in: Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 2152–2161.

ED

M

[23] X. Li, Y. Guo, D. Schuurmans, Semi-supervised zero-shot classification with label representation learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4211–4219. [24] E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learning, arXiv preprint arXiv:1704.08345.

CE

PT

[25] E. Kodirov, T. Xiang, Z. Fu, S. Gong, Unsupervised domain adaptation for zero-shot learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2452–2460.

AC

[26] B. A. Olshausen, D. J. Field, Sparse coding with an overcomplete basis set: A strategy employed by v1?, Vision research 37 (23) (1997) 3311– 3325. [27] A. G. MarcoBaroni, Hubness and pollution: Delving into cross-space mapping for zero-shot learning. [28] J. Qin, Y. Wang, L. Liu, J. Chen, L. Shao, Beyond semantic attributes: Discrete latent attributes learning for zero-shot recognition, IEEE Signal Processing Letters 23 (11) (2016) 1667–1671. 32

ACCEPTED MANUSCRIPT

[29] Y. Wang, Y. Gong, Q. Liu, Robust attribute-based visual recognition using discriminative latent representation, in: International Conference on Multimedia Modeling, Springer, 2015, pp. 191–202.

CR IP T

[30] Y. Guo, G. Ding, X. Jin, J. Wang, Learning predictable and discriminative attributes for visual recognition., in: AAAI, 2015, pp. 3783–3789.

[31] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, Learning multimodal latent attributes, IEEE transactions on pattern analysis and machine intelligence 36 (2) (2014) 303–316.

AN US

[32] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, Transductive multi-view zero-shot learning, IEEE transactions on pattern analysis and machine intelligence 37 (11) (2015) 2332–2345. [33] C. Gan, T. Yang, B. Gong, Learning attributes equals multi-source domain generalization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 87–97.

M

[34] Z. Zhang, V. Saligrama, Zero-shot learning via joint latent similarity embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 6034–6042.

ED

[35] T. Mensink, E. Gavves, C. G. Snoek, Costa: Co-occurrence statistics for zero-shot classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2441–2448.

CE

PT

[36] M. Rohrbach, M. Stark, B. Schiele, Evaluating knowledge transfer and zero-shot learning in a large-scale setting, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1641–1648.

AC

[37] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, B. Schiele, What helps where–and why? semantic relatedness for knowledge transfer, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 910–917. [38] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, J. Dean, Zero-shot learning by convex combination of semantic embeddings, in: International Conference on Learning Representations, 2014. 33

ACCEPTED MANUSCRIPT

[39] X. Li, M. Fang, J. Wu, Zero-shot classification by transferring knowledge and preserving data structure, Neurocomputing 238 (2017) 76–83.

CR IP T

[40] Z. Zhang, V. Saligrama, Zero-shot learning via semantic similarity embedding, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4166–4174. [41] S. Changpinyo, W.-L. Chao, B. Gong, F. Sha, Synthesized classifiers for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5327–5336.

AN US

[42] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of machine learning research 7 (Nov) (2006) 2399–2434. [43] J. J.-Y. Wang, H. Bensmail, N. Yao, X. Gao, Discriminative sparse coding on multi-manifolds, Knowledge-Based Systems 54 (2013) 199– 206.

M

[44] R. Shang, W. Wang, R. Stolkin, L. Jiao, Subspace learning-based graph regularized feature selection, Knowledge-Based Systems 112 (2016) 152– 165.

ED

[45] A. B. Hamza, Graph regularized sparse coding for 3d shape clustering, Knowledge-Based Systems 92 (2016) 92–103.

CE

PT

[46] W.-L. Chao, S. Changpinyo, B. Gong, F. Sha, An empirical study and analysis of generalized zero-shot learning for object recognition in the wild, in: European Conference on Computer Vision, Springer, 2016, pp. 52–68. [47] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltechucsd birds-200-2011 dataset.

AC

[48] G. Patterson, J. Hays, Sun attribute database: Discovering, annotating, and recognizing scene attributes, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2751– 2758. [49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large 34

ACCEPTED MANUSCRIPT

scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.

CR IP T

[50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions (2014) 1–9. [51] S. Changpinyo, W.-L. Chao, B. Gong, F. Sha, Synthesized classifiers for zero-shot learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

AN US

[52] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation 16 (12) (2004) 2639–2664. [53] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, Liblinear: A library for large linear classification, The Journal of Machine Learning Research 9 (2008) 1871–1874.

AC

CE

PT

ED

M

[54] L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (Nov) (2008) 2579–2605.

35