Zero-shot learning by mutual information estimation and maximization

Zero-shot learning by mutual information estimation and maximization

Journal Pre-proof Zero-shot learning by mutual information estimation and maximization Chenwei Tang, Xue Yang, Jiancheng Lv, Zhenan He PII: DOI: Refe...

703KB Sizes 0 Downloads 48 Views

Journal Pre-proof Zero-shot learning by mutual information estimation and maximization Chenwei Tang, Xue Yang, Jiancheng Lv, Zhenan He

PII: DOI: Reference:

S0950-7051(20)30010-1 https://doi.org/10.1016/j.knosys.2020.105490 KNOSYS 105490

To appear in:

Knowledge-Based Systems

Received date : 1 July 2019 Revised date : 6 January 2020 Accepted date : 7 January 2020 Please cite this article as: C. Tang, X. Yang, J. Lv et al., Zero-shot learning by mutual information estimation and maximization, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2020.105490. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier B.V.

Journal Pre-proof *Revised Manuscript (Clean Version) Click here to view linked References

pro of

Zero-Shot Learning by Mutual Information Estimation and Maximization Chenwei Tanga,∗, Xue Yanga,∗, Jiancheng Lva,∗∗, Zhenan Hea,∗∗ a the

College of Computer Science, Sichuan University, Chengdu 610065, P. R. China

Abstract

The key of zero-shot learning is to use the visual-semantic embedding to transfer

re-

the knowledge from seen classes to unseen classes. In this paper, we propose to build the visual-semantic embedding by maximizing the mutual information between visual features and corresponding attributes. Then, the mutual information between visual and semantic features can be utilized to guide the

lP

knowledge transfer from seen domain to unseen domain. Since we are primarily interested in maximizing mutual information, we introduce the noise-contrastive estimation to calculate lower-bound value of mutual information. Through the noise-contrastive estimation, we reformulate zero-shot learning as a binary classification problem, i.e., classifying the matching visual-semantic pairs (positive

urn a

samples) and mismatching visual-semantic pairs (negative/noise samples). Experiments conducted on five datasets demonstrate that the proposed mutual information estimators outperforms current state-of-the-art methods both in conventional and generalized zero-shot learning settings. Keywords: Zero-shot learning, mutual information, noise-contrastive

Jo

estimation, visual-semantic embedding

∗ Both

authors contributed equally to this research. author Email addresses: [email protected] (Jiancheng Lv ), [email protected] (Zhenan He ) ∗∗ Co-Corresponding

Preprint submitted to Journal of LATEX Templates

January 6, 2020

Journal Pre-proof

1. Introduction

pro of

Zero-Shot Learning (ZSL) [1, 2, 3, 4, 5, 6, 7] aims to recognize instance of new category that has never seen before, i.e., the categories in the training set and the test set are disjoint, by learning an embedding space between the image 5

and semantic features. The key ensuring the success of ZSL is to build a visualsemantic embedding, then the cross-modality representation between the visual features and semantic features can be learned [8]. After that, by finding the intermediate semantic representation, the knowledge learned from seen classes can be transferred to unseen ones.

There have been many effective methods proposed to build the visual-semantic

re-

10

embedding. These methods usually project either visual features or semantic features from one space to the other, or project both features to an sharing embedding space. Early works of ZSL learn the bilinear compatibility function

15

lP

between visual and semantic space using ranking loss [9, 10]. Other approaches are based on the non-linear multi-modal embedding [11, 8]. Another fashion direction is the max margin-based [12, 13], which employs a ranking function to measure the matching scores between the visual and semantic feature vectors.

urn a

Recently, the methods by learning the cross-modal mapping with discriminative losses [14, 15, 16] are becoming more and more popular. 20

After the embedding step, most previous ZSL assume that the test categories belong only to the unseen classes in performance evaluation. However, in practice, image classification applications are required to recognize images belong to seen or unseen classes. Thus, we advocate evaluating the ZSL approaches in the setting of Generalized Zero-Shot Learning (GZSL) [17, 18], where we classify both the seen source and unseen target classes during test. Due to that

Jo

25

most existing ZSL methods map visual features to several fixed anchor points in the visual-semantic embedding during training, the images of unseen categories tend to be recognized as the seen source classes in the joint labeling space. As we all know, it is the fundamental domain adaptation problem [18, 2] in GZSL.

30

To address the domain shift problem, many methods follow the way of trans-

2

Journal Pre-proof

ductive learning [2, 5, 19], where both the labeled source and unlabeled target

pro of

images are available for training. By combining the unlabeled images of unseen categories with the seen source data, the transductive ZSL methods can learn a more general visual-semantic embedding. In this paper, we propose a novel 35

ZSL method based on mutual information estimation and maximization [20, 21] to alleviate the domain shit problem. More specifically, unlike the transduction ZSL method mentioned above, the proposed method can significantly solve the domain adaptation problem even if it follows the inductive learning [22], i.e., only data of the labeled source classes are available during the training phase. However, in continuous high-dimensional space, the calculation of mutual

re-

40

information is notoriously difficult. And the improving ZSL requires that the visual-semantic embedding is less specialized towards solving a single dataset [17]. Thus, we leverage Mutual Information Estimator (MIE) based on Noise-

45

lP

Contrastive Estimation (NCE) [23] for training a binary classifier to distinguish the matching visual-semantic pairs from the negative pairs. In this way, the high-dimensional visual features are compressed into the much more compact attributes embedding space in which cross-modality learning is easier to model. The experiments on five datasets show that the proposed method, called NCE-

50

urn a

base MIE, can be a good stepping stone towards robust and generalizable ZSL method.

Our main contributes of this paper are summarized as follows: • Most existing methods project either visual features or semantic features from one space to the other, or project both features to an sharing embedding space. In this paper, we propose a novel method by utilizing the mutual information between visual and semantic features to guide the

Jo

55

learning of visual-semantic embedding. Without mapping the visual features to the fixed anchor points, the well-known domain shift problem in GZSL is alleviated.

• By introducing NCE, we reformulate the zero-shot learning as a binary

60

classification problem, i.e., distinguishing the matching image-attribute 3

Journal Pre-proof

pairs from the noise samples.

pro of

• We are the first to develop a ZSL framework incorporating the mutual information maximization and NCE. Experiments show that our framework outperforms state-of-the-art ZSL methods both in the conventional ZSL 65

and GZSL setting.

2. Relate Work 2.1. Zero-Shot Learning

re-

ZSL, as a highlighted research topic in the field of Zero-shot Hashing [24], has been widely applied in real world applications. Typically, it is achieved by 70

utilizing the visual-semantic embedding to transfer knowledge from seen classes to unseen classes. Here, we will briefly introduce several mainstream methods

lP

and compare our method with them in the experimental section. Direct Attribute Prediction (DAP) [25], as one of the pioneering studies, learns a probabilistic attribute classifiers, then recognizes the unseen instance 75

by estimating the posterior of each attribute. On the other hand, Structured Joint Embedding (SJE) [12], ESZSL [26] and Semantic AutoEncoder (SAE)

urn a

[10] are all directly map the image feature to the semantic space. Specifically, SJE learns a bilinear compatibility by optimizing a structural Support Vector Machine (SVM) loss. ESZSL learns the bilinear compatibility between visual 80

features, attributes and class labels with the square loss. SAE, following the auto-encoder, reconstructs the visual features in the semantic space. Further, LatEm [11] extends the bilinear compatibility model of SJE. Recent advances, such as, Semantic Similarity Embedding (SSE) [27] and Synthesized Classifiers

Jo

(SYNC) [28], embed both the visual and semantic features into another common

85

space.

Recently, a new branch of methods target to ZSL by generative models

[29, 30]. There are two main themes among them: Variational Auto-encoder (VAE)[31] and Generative Adversarial Network (GAN) [16, 32, 14]. These methods synthesize pseudo instances for unseen classes with auxiliary of seen class 4

Journal Pre-proof

90

prototypes [33]. Then, with the generated unseen samples, ZSL can be trans-

pro of

formed to a standard classifier for object recognition.

ZSL in generalized settings provides a more practical point of view than conventional settings. Due to the fact that the projection learned from the seen classes has domain shift problem when directly applied to the disjoint unseen 95

classes, which is different and potentially unrelated from the source data, in the joint labeling space, the results in GZSL are significantly lower than the conventional ZSL results.

To alleviate the domain shift problem, many methods based on the trans-

100

re-

ductive learning [2, 34] are proposed. [5] alleviates the domain shift problem by iteratively adding the unlabeled unseen instances depending on the reliableness during the training phase. The Quasi-Fully Supervised Learning (QFSL) [19] also follows the way of transductive learning, where the labeled source images

lP

and the are unlabeled target images are forced to be mapped to different specified points. Inductive ZSL [12, 28, 35] methods such as DAP, SSE and GFZSL 105

mentioned above generally perform well on the seen source classes during the testing phase, while the performances on the unseen test classes are very poor.

urn a

2.2. Mutual Information Maximization

Mutual information measures the dependence between two random variables.For two discrete variables X and Y whose joint probability distribution is 110

p(x, y), the mutual information between them, denoted I(X; Y ), is given by

I(X; Y ) =

X

p(x, y) log

X,Y

p(x, y) , p(x)p(y)

(1)

where p(x, y) is the joint probability distribution, while p(x) and p(y) are

Jo

the marginal probabilities. Furthermore, mutual information, as a Shannon entropy-based quantity, can be written as follow:

I(X; Y ) = H(X) − H(X|Y ),

5

(2)

Journal Pre-proof

where H is the Shannon entropy and H(X|Y ) is the conditional entropy which is the average uncertainty about X after observing a second random variable Y .

pro of

115

The methods based on mutual information have been widely used in unsupervised learning. The methods based on info-max optimization principle [36, 37, 20] estimate and maximize the mutual information between input data and the output of a deep neural network for unsupervised learning of represen120

tations. Mutual Information Neural Estimator (MINE) [21] presents that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. What’s

re-

more, the experiments demonstrate that MINE can be applied to supervised classification. Some recent works maximize the mutual information between 125

images for unsupervised clustering and segmentation [38, 39]. In general, most existing methods based on mutual information are devoted to measure the de-

lP

pendence between two random variables of the same modes, and they are mainly applied to the unsupervised learning of representations. The mutual information estimation has not been used to the applications, such as ZSL, which involve 130

the measurement of the dependence between cross-modal data.

urn a

3. Proposed Approach

Here we firstly introduce some notations and the problem definition. Let S = {(xsi , asi , yis ), i = 1, ..., Ns } and U = {(xuj , auj , yju ), j = 1, ..., Nu } denote the

training data from seen source classes and unseen data, where xsi and xuj denote 135

the visual features, asi and auj denote attributes of seen classes and unseen classes in the semantic space As and Au , yis and yju are the corresponding label for xsi

Jo

and xuj , and Ns and Nu denote the number of images of seen categories and

unseen classes, respectively. The seen source data S and the unseen target data U are disjointed in categories, i.e., S ∩ U = ∅. Given the visual feature xuj of

140

an unseen class and all the attributes in the Au space, the goal of ZSL is to

predict the class label yju . As for the GZSL, both xsi and xuj are given with the attributes space A = As +Au , the class label in joint labeling space Y = Ys +Yu 6

Journal Pre-proof

should be recognized.

pro of

The key technology in ZSL is visual-semantic embedding, which can transfer the knowledge from the seen classes to the unseen data. We encode the high-dimensional visual features into the much more compact attributes space, then build the visual-semantic embedding by maximally preserving the mutual information between the visual features x and corresponding attributes a: I(x; a) =

X

p(x, a) log

x,a

p(x|a) . p(x)

(3)

However, computing mutual information is very difficult, especially in con-

re-

tinuous high-dimensional space. Thus, we model a density ratio as follows: F (xi , ai ) ∝

p (xi |ai ) , p (xi )

(4)

where A ∝ B denotes that A is proportional to B. Then, by using the density

lP

ratio F (xi , ai ), we can avoid modeling the high dimensional distribution. Thus, a compatibility function F : X × A → R with adjustable parameters W can be used:

F (xi , ai ; W ) = exp(xTi W ai ),

(5)

urn a

where W is the parameter matrix, and the value of F denotes the matching score between the input visual feature xi and the attribute ai . The larger the value of F , the larger the probability that xi and ai belong to the same class yi . Therefore, the classification for an unseen instance xj can be achieved by maximizing F :

f (xj ; W ) = arg max F (xj , aj ; W ), aj ∈Au

(6)

where f (xj ; W ) is the classifier of xj , and the class label is the corresponding label to the attribute with largest compatibility score. Depending on the end-goal,

Jo

145

the mutual information maximization means that finding a set of parameters, W , such that the mutual information, I(xi , ai ), is maximized. Note that we cannot evaluate the p (x|a) and p (x) directly, and we are

primarily concerned with maximizing the mutual information between the im-

150

age feature and the corresponding attribute, rather not the precise value, the 7

Journal Pre-proof

pro of

ak

visual feature xk

positive

1

ResNet image

0

visual embedding zk

Visual feature extractor

negative

scores

Binary Classifier based on NCE

𝐴s

re-

Visual-semantic embedding

Figure 1: The architecture of the proposed NCE-based MIE. xk , zk , and ak denote visual feature, visual embedding, and the corresponding attribute, respectively. attributes of seen classes.

As denotes all

lP

non-KL divergences may offer favorable trade-offs. Thus, we propose to use a lower-bound to mutual information based on the NCE following the Deep InfoMax (DIM) [20], which simultaneously estimates and maximizes the mutual information by training a classifier to distinguish between samples from the 155

positive and negative.

urn a

Let M denotes the number of all classes, and the Ms and Mu denote the number of seen classes and unseen classes, respectively. Given a visual feature xk from the seen classes and the corresponding attribute ak as the positive pair, the rest attributes in As except for ak are treated as noise samples, i.e., there are Ms − 1 negative samples for per visual feature. Finally, the NCE-based MIE

Jo

can be formulate a bound on mutual information as follow: F (xk , ak ; W ) (N CE) Ibw (xk ; ak ) := log P . ai ∈As F (xk , ai ; W )

(7)

The loss in Equation 7 is the categorial cross-entropy of classifying the pos-

itive sample from the negative noise samples, and the

PF

As

F

denotes the pre-

diction of the model. Then, the NCE-based MIE can be derived as follows:

8

Journal Pre-proof

pro of

exp(xTk W ak ) (N CE) , Ibw (xk ; ak ) = log P T ai ∈As exp(xk W ai ) X = xTk W ak − log exp(xTk W ai ).

(8)

ai ∈As

Figure 1 shows the overall architecture of the proposed NCE-based MIE. First, the input of visual feature xk extracted by a pre-trained Deep Convolutional Neural Network (DCNN), such as the ResNet [40], is encoded to a visual embedding zk = xTk W in the attribute space. Next, the compatibility scores s {F (zk , ai )}M i=1 between the visual embedding zk and all Ms attributes in As

re-

space can be got. Finally, by utilizing the binary classifier based on NCE, we can distinguish the positive samples from negative/noise samples. In general, we can maximize the mutual information between visual features xk and attributes ak by training the NCE-based MIE as follows:

lP

(N CE) I(xk ; ak ) ≥ Ibw (xk ; ak ).

(9)

In addition to NCE being used to formulate the bound of mutual information, the MIE based on Jensen-Shannon Divergence (JSD) may also offer favourable

[41]:

urn a

trade-offs. The JSD mutual information estimator follows the formulation of (JSD) Ibw (xk ; ak ) = −σ(−xTk W ak ) − σ(xTk W ar ),

(10)

where σ(·) = log(1 + exp(·)) is a softplus function, i.e., σ(−xTk W ak ) = log(1 + exp(−xTk W ak )). ak is the corresponding attribute to the visual feature xk , and ar is a negative attribute sampled form As − ak . We compare the NCE-based MIE and JSD-based MIE in our experiments, and prove that NCE-based MIE perform better than JSD-based MIE. However, because the number of negative samples required by JSD-based MIE is far less than NCE-based MIE, the JSD-

Jo

160

based MIE is superior to NCE-based MIE in training speed. What’s more, [20] shows a structure matters that incorporating knowledge

about locality of mutual information can greatly influence the model’s suitability, we also define a local NCE-based MIE. We directly slice the visual feature

9

Journal Pre-proof

vectors xk into T equal length parts, then the average mutual information can

pro of

be maximized:

T F (xtk , ak ; W ) 1 X (LN CE) Ibw log P (xk ; ak ) = . t T t=1 ai ∈As F (xk , ai ; W )

(11)

In our experiments, we investigate the global NCE-based MIE, local NCEbased MIE, and the combination of the two NCE-based MIE: F (xk , ak ; w1) (L+G) Ibw (xk ; ak ) =α log P ai ∈As F (xk , ai ; w1)

(12)

re-

T β X F (xtk , ak ; w2) + log P , t T t=1 ai ∈As F (xk , ai ; w2)

where α and β are the hyper parameters. Depending on the value of these

165

(L+G) (X; A) can learn a standard global NCE-based hyper parameters, the Ibw

MIE (α = 1, β = 0), or a standard local NCE-based MIE (α = 0, β = 1), or the

lP

global & local NCE-based MIE. w1 and w2 are the parameter matrix for the global and local objectives, respectively. The error functions mentioned above will be compared and analyzed in the experimental section.

170

urn a

4. Experiments

To demonstrate the effectiveness and efficiency of the proposed method, we conduct extensive experiments on five most widely used benchmark datasets for ZSL. First, we introduce the experimental settings, mainly including the datasets and evaluation metrics. Then we compare the proposed method with existing state-of-the-art ZSL methods in both the conventional and the gener-

175

alized settings. Finally, we compare and analyze the proposed several mutual

Jo

information estimators.

4.1. Experimental Settings Datasets These five datasets are considered: SUN Attribute Database

(SUN) [42], Caltech-UCSDBirds 200-2011 (CUB) [43], Animals with Attributes

180

1 (AWA1) [25], Animals with Attributes 2 (AWA2) [17], Attribute Pascal and

10

Journal Pre-proof

Table 1: The details of five benchmark datasets. A and Y denote the dimension of attributes

pro of

and number of classes. S and U are the numbers for seen and unseen classes. N presents the number of images. Ns and Nu are the number of images of seen and unseen classes, respectively. Note that Ns→ts denotes the images’ number of seen classes during test in the GZSL setting.

Data

Att

Classes

images

Y

S +U

N

SUN [42]

102

717

645+72

14340

CUB [43]

312

200

150+50

11788

AWA1 [25]

85

50

40+10

30475

AWA2 [17]

85

50

40+10

37322

aPY [44]

64

32

20+12

15339

PS

Ns

Nu

Ns

Nu

Ns→ts

12900

1440

10320

1440

2580

8855

2933

7057

2967

1764

24295

6180

19832

5685

4958

30337

6985

23527

7913

5882

12695

2644

5932

7924

1483

re-

A

SS

Yahoo (aPY) [44]. SUN is a fine-grained dataset, which contains 14, 340 images coming from 717 types of scenes, of which 645 types are used for training and

lP

72 for testing. CUB is also a fine-grained dataset containing 11, 788 images of 200 bird categories. We use the split with 150 classes for training and 50 for 185

testing. AWA1 and AWA2 are both coarse-grained dataset having the same 50 animal classes. In total, AWA2 has 37, 322 images compared to 30, 475 images

urn a

of AWA1. For both AWA1 and AWA2, we use 40 classes for training and the rest 10 classes for testing. For aPY, which contains 15, 339 images from 32 classes, we use 20 classes for training and the rest 12 classes for testing. In our experi190

ments, we adopt both the standard train/tes splits (SS) and the proposed splits (PS) in [17]. We present the details of SS and PS on five datasets in Table 1. For the visual features, we extract 2, 048-dimensional top-layer pooling units of the 101-layered ResNet following the pre-trained method in [17]. We use Adam

Jo

optimizer with base learning rate 0.05 and minibatch size 64. Evaluation Metrics To compare the performance with the existing method,

we average the correct predictions independently for each class before dividing the number of classes, i.e. Mean Class Accuracy (MCA) as the evaluation metrics:

M CA =

1 X accy , ||Y || y∈Y

11

(13)

Journal Pre-proof

where accy denotes the top-1 accuracy of class y. In the GZSL setting, both un-

pro of

seen classes y u and seen classes y s are considered. Therefore, we adopt M CAS (the M CA on the seen test classes), M CAU (the M CA on the unseen test classes), and their harmonic mean (H) as the evaluation metrics in GZSL setting: 2 ∗ M CAS ∗ M CAU . M CAS + M CAU

H=

(14)

Table 2: Comparisons in generalized settings on Proposed Split (PS) in [17] measuring top-1 accuracy in %. U = M CAU and S = M CAS denote Top-1 accuracy on unseen classes and Method DAP [25]

CUB

AWA1

AWA2

aPY

U

S

H

U

S

H

U

S

H

U

S

H

U

S

H

4.2

25.1

7.2

1.7

67.9

3.3

0.0

88.7

0.0

0.0

84.7

0.0

4.8

78.3

9.0

2.1

36.4

4.0

8.5

46.9

14.4

7.0

80.5

12.9

8.1

82.5

14.8

0.2

78.9

0.4

LATEM [11]

14.7

28.8

19.5

15.2

57.3

24.0

7.3

71.7

13.3

11.5

77.3

20.0

0.1

73.0

0.2

SJE [12]

14.7

30.5

19.8

23.5

59.2

33.6

11.3

74.6

19.6

8.0

73.9

14.4

3.7

55.7

6.9

ESZSL [26]

11.0

27.9

15.8

12.6

63.8

21.0

6.0

75.6

12.1

5.9

77.8

11.0

2.4

70.1

4.6

SYNC [28]

7.9

43.3

66.3

13.3

SAE [10]

8.8

18.0

0.9

GFZSL [14]

0.0

39.6

25.9

40.2

lP

SSE [27]

NCE-based MIE

13.4

11.5

70.9

19.8

8.9

87.3

16.2

10.0

90.5

18.0

7.4

11.8

7.8

54.0

13.6

1.8

77.1

3.5

1.1

82.2

2.2

0.4

80.9

0.0

0.0

45.7

0.0

1.8

80.3

3.5

2.5

80.1

4.8

0.0

83.3

0.0

31.5

26.7

69.3

38.5

22.6

90.6

36.2

17.9

91.9

29.9

17.3

79.5

28.4

4.2. Comparisons in Generalized Setting

urn a

195

SUN

re-

seen test classes, respectively.

In order to prove that the proposed method can alleviate the domain shift problem, we first compare our method with several state-of-the-art ZSL methods mentioned in the related work in the generalized setting. Experimental results are given in Table 2. An interesting observation is that the methods, e.g., 200

DAP and GFZSL, perform well on seen test classes (M CAS ), but they work poorly on the unseen test classes (M CAU ). However, our method improves the

Jo

overall performance by an obvious margin for all the M CAU , M CAS and the harmonic mean (H). Especially for the M CAU , the proposed method performs

much better than all the state-of-the-art methods. For instance, the M CAS

205

ranks SYNC as the best performing method on SUN and CUB, and GFZSL as the best method on aPY, respectively. Whereas, for the M CAU , they perform

very poorly on all datasets. Especially, the GFZSL can not correctly recognize 12

Journal Pre-proof

Harmonic Mean

pro of

45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 DAP

SSE

CUB LATEM

SJE

AWA1 ESZSL

SYNC

AWA2

SAE

GFZSL

re-

SUN

aPY

NCE-based MIE

Figure 2: Comparisons in generalized settings on Proposed Split (PS) in [17] of the harmonic

lP

mean.

the unseen classes on SUN, CUB, and aPY datasets, and the M CAU is as low as 0.0. What’s more, the proposed method achieves accuracy scores in harmonic 210

mean (H) 4.9 ∼ 16.6% (where 4.9 = 38.5 − 33.6 and 16.6 = 36.2 − 19.6) better than the previous highest results on five datasets.

urn a

The striking improvement is achieved by inter-class separation during training process. Most existing approaches project either visual features or semantic features from one space to another, or project both features to an shared em215

bedding space. However, these methods fail to recognize unseen classes because they map the visual or semantic features to the fixed anchor points in the embedding space. Our binary classifier on seen classes possesses good separation property by maximizing the mutual information between positive samples pairs,

Jo

and minimizing that of noise pairs. When the binary classifier is used on unseen

220

classes, it can be high discriminative to recognize the new visual features which do not belong to the seen classes. Then the domain shift is alleviated. In order to show intuitively that our method achieves much better result in

generalized setting, we present the harmonic mean obtained by previous state-

13

Journal Pre-proof

Split.

Method

SUN

CUB

AWA1

PS

SS

PS

DAP [25]

38.9

39.9

37.5

40.0

SSE [27]

54.5

51.5

43.7

43.9

LATEM [11]

56.9

55.3

49.4

49.3

SJE [12]

57.1

53.7

55.3

53.9

ESZSL [26]

57.3

54.5

55.1

53.9

SYNC [28]

59.1

56.3

54.1

55.6

SAE [10]

42.4

40.3

33.4

33.3

GFZSL [14]

62.9

60.6

53.0

49.3

NCE-based MIE

AWA2

aPY

SS

PS

SS

PS

SS

PS

57.1

44.1

58.7

46.1

35.2

33.8

68.8

60.1

67.5

61.0

31.1

34.0

74.8

55.1

68.7

55.8

34.5

35.2

76.7

65.6

69.5

61.9

32.0

32.9

74.7

58.2

75.6

58.6

34.4

38.3

72.2

54.0

71.2

46.6

39.7

23.9

80.6

53.0

80.7

54.1

8.3

8.3

80.5

68.3

79.3

63.8

51.3

38.4

re-

SS

CSSD [33]

pro of

Table 3: Comparisons in conventional settings using SS = Standard Split, PS = Proposed

-

-

52.5

-

81.2

-

-

-

54.1

-

64.4

64.2

57.4

58.1

83.1

69.1

81.9

66.7

51.0

39.1

225

lP

of-the-art methods and our method on these five datasets in Figure 2. In the histogram, the nine colors denote the accuracy scores in harmonic mean (H) of eight existing ZSL methods and the proposed method (the red one) on each dataset. It can be notice that our method outperforms the previous state-ofthe-art ZSL methods consistently on all five datasets. Extensive experiments

230

urn a

validate that the proposed method can address the domain shift problem without transductive learning by mutual information estimation and maximization, instead of mapping the visual features to the fixed anchor points in the visualsemantic embedding space.

4.3. Comparisons in Conventional Setting Robustness and generalization are ignored by most existing ZSL methods. In our method, the improved ZSL method where the visual-semantic embedding

Jo

235

is less specialized toward solving a single dataset is required. We compare our method with existing state-of-the-art methods mentioned in the related work. We use both the Standard Split (SS) and the Proposed Split (PS) in [17] to conduct experiments on SUN, CUB, AWA1, AWA2 and aPY. Table 3

240

shows the experimental results in conventional ZSL setting. It can be observed 14

Journal Pre-proof

Table 4: Comparisons of different MIE in both generalized and conventional settings. M =

pro of

M CA denotes the top-1 accuracy on the test data in the conventional settings. U = M CAU and S = M CAS denote Top-1 accuracy on unseen classes and seen test classes, respectively. Model

SUN

CUB

AWA1

M

U

S

H

M

U

S

H

M

U

G

64.2

25.9

40.2

31.5

58.1

26.7

69.3

38.5

69.1

AWA2

aPY

S

H

M

U

S

H

M

U

S

H

22.6

90.6

36.2

66.7

17.9

91.9

29.9

39.1

17.3

79.5

28.4

L

61.9

25.8

41.4

31.7

57.4

26.8

69.5

38.7

69.3

19.3

90.8

31.8

65.8

16.7

91.6

28.3

36.0

14.5

79.8

24.5

L+G

63.8

25.6

41.1

31.6

57.3

27.2

69.6

39.1

68.9

26.3

90.6

40.8

67.5

22.4

91.8

36.0

37.3

14.2

79.3

24.1

JSD-Based

54.2

21.9

30.4

25.5

49.8

28.2

61.5

38.7

68.8

19.7

88.3

32.2

65.8

22.3

89.9

35.8

39.0

17.2

79.2

28.3

that the proposed method outperforms the existing approaches almost on all datasets. Notably, on SUN, CUB, AWA1 and AWA2, our method outperforms

re-

the previous best methods, which achieve the best results among the existing methods toward a single dataset , by a large margin of 1.2 ∼ 2.5% (where 245

1.2 = 81.9 − 80.7 and 2.5 = 83.1 − 80.6) and 0.8 ∼ 3.6% (where 0.8 = 69.1 − 68.3 and 3.6 = 64.2 − 60.6) using SS and PS, respectively. The experimental results prove that our method effectively utilizes mutual information estimation and

lP

maximization between the high-dimensional visual features and corresponding attributes to facilitate the visual-semantic embedding. 250

4.4. Comparisons of Different Estimators

urn a

In the following experiments, G and L refer the global-only (α = 1, β = 0) and local-only (α = 0, β = 1, T = 4) NCE-based MIE, respectively. L + G denotes the combination of local and global NCE-based MIE (α = 0.5, β = 0.5, T = 4 is present here as L + G). JSD-Based is the MIE based on Jensen255

Shannon Divergence. The ZSL and GZSL results got by these MIEs using the proposed split in [17] are presented in Table 4. In general, the mutual information estimators based on NCE and JSD all perform well than all the

Jo

existing state-of-the art methods on all datasets. In detail, all NCE-based MIEs contain the better and more stable performance than JSD-based MIE in most

260

experiments. With high capability of visual classification, the NCE-based MIEs cannot maintain the high training speed. The time consuming per iteration of different MIEs are reported in Table

5. For each model, we keep the batch size as 64. As shown in Table 5, we

15

Journal Pre-proof

can see that the training speed of JSD-based MIE is faster than that of the MIEs based on NCE, especially for the local-only NCE-based MIE (L), as well

pro of

265

as the local and global NCE-based MIE (L + G). But it’s worth mentioning that the computational time consuming per iteration of all the MIE models are very short.

Table 5: Comparisons of time consuming per iteration.

Model

Time consuming per iteration (10−3 s) CUB

AWA1

AWA2

aPY

G

1.2

1.3

1.6

1.6

1.3

L

2.5

2.7

2.9

2.9

2.7

L+G

2.9

3.0

3.3

3.3

3.1

JSD-Based

1.0

1.0

1.1

1.1

1.1

re-

SUN

270

lP

As shown in Figure 3, although the overall results of JSD-based MIE are not as well as the NCE-based MIE, the M CAU ranks JSD-based MIE as the best performing method on CUB. On the other hand, the performances of localonly NCE-based MIE and the combination of local and global NCE-based MIE

urn a

are not always better than the global-only NCE-based MIE. But in summary, the NCE-based MIEs are superior to the JSD-based MIE. Due to the negative 275

samples required by NCE-based MIE are much more than the JSD-based MIE, the NCE-based MIE are more robust and generalizable.

5. Conclusion

In this paper, we address the fundamental projection domain shift problem

Jo

in generalized settings, and seek an improved ZSL method where the visual-

280

semantic embedding is robust and has generalization ability for all datasets. We propose to build the visual-semantic embedding by maximizing the mutual information between visual features and corresponding attributes. Without mapping the visual features to the fixed anchor points, the well-known domain shift problem in GZSL is alleviated. We further leverage NCE-based MIE for training a 16

Journal Pre-proof

re-

pro of

JSD-Based

Figure 3: Comparisons of MCA (M) in conventional ZSL setting and the harmonic mean (H) in GZSL setting among the proposed several MIEs.

binary classifier to distinguish the matching image-attribute pairs from the noise

lP

285

samples. Extensive experiments show the proposed NCE-based MIE achieves robustness and generalization than most state-of-the-art ZSL methods. What’s more, we compare and analyze several mutual information estimators both in the conventional ZSL and the more realistic GZSL setting. The NCE-based MIE is simpler and produces better results. In future, the transductive ZSL by

urn a

290

mutual information estimation and maximization in the GZSL setting will be in our research scope. Meanwhile, more other cross-modal applications will be attempted by maximizing mutual information.

6. Acknowledgments

This paper is supported by the National Key R&D Program of China under

Jo

295

contract No. 2017YFB1002201, the National Natural Science Fund for Distinguished Young Scholar under Grant No. 61625204, the Key Program of National Science Foundation of China under Grant No. 61836006 and 61432014, and the National Natural Science Foundation of China under Grant No. 61602328.

17

Journal Pre-proof

300

References

pro of

[1] C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida, USA, 2009, pp. 951–958. doi:10.1109/CVPRW.2009.5206594. 305

[2] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, Transductive multi-view zero-shot learning, IEEE Trans. Pattern Anal. Mach. Intell. 37 (11) (2015) 2332–2345. doi:10.1109/TPAMI.2015.2408354.

re-

[3] Y. Fu, L. Sigal, Semi-supervised vocabulary-informed learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Ve310

gas, NV, USA, 2016, pp. 5337–5346. doi:10.1109/CVPR.2016.576.

lP

[4] Z. Ji, Y. Yu, Y. Pang, J. Guo, Z. Zhang, Manifold regularized cross-modal embedding for zero-shot learning, Inf. Sci. 378 (2017) 48–58. doi:10.1016/ j.ins.2016.10.025.

[5] Y. Yu, Z. Ji, J. Guo, Y. Pang, Transductive zero-shot learning with adaptive 315

structural embedding, IEEE Trans. Neural Netw. Learning Syst. 29 (9)

urn a

(2018) 4116–4127. doi:10.1109/TNNLS.2017.2753852.

[6] Y. Guo, G. Ding, J. Han, Y. Gao, Zero-shot learning with transferred samples, IEEE Trans Image Process 26 (7) (2017) 3277–3290.

[7] G. Ding, Y. Guo, K. Chen, C. Chu, J. Han, Q. Dai, Decode: Deep confi320

dence network for robust image classification, IEEE Transactions on Image

Jo

Processing 28 (8) 3752–3765.

[8] R. Socher, M. Ganjoo, C. D. Manning, A. Y. Ng, Zero-shot learning through cross-modal transfer, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Sys-

325

tems 2013. Lake Tahoe, Nevada, USA., 2013, pp. 935–943.

18

Journal Pre-proof

[9] Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding for

pro of

image classification, IEEE Trans. Pattern Anal. Mach. Intell. 38 (7) (2016) 1425–1438. doi:10.1109/TPAMI.2015.2487986.

[10] E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learn330

ing, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 2017, pp. 4447–4456. doi:10.1109/CVPR. 2017.473.

[11] Y. Xian, Z. Akata, G. Sharma, Q. N. Nguyen, M. Hein, B. Schiele, Latent

335

re-

embeddings for zero-shot classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 2016, pp. 69–77. doi:10.1109/CVPR.2016.15.

[12] Z. Akata, S. E. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of out-

lP

put embeddings for fine-grained image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 340

2015, pp. 2927–2936. doi:10.1109/CVPR.2015.7298911. [13] X. Li, Y. Guo, Max-margin zero-shot learning for multi-class classification,

urn a

in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS, San Diego, California, USA, 2015.

[14] V. K. Verma, P. Rai, A simple exponential family framework for zero345

shot learning, in: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part II, 2017, pp. 792–808. doi:10.1007/

Jo

978-3-319-71246-8\_48.

[15] Y. Li, D. Wang, Zero-shot learning with generative latent prototype model,

350

CoRR abs/1705.09474.

[16] T. Long, X. Xu, F. Shen, L. Liu, N. Xie, Y. Yang, Zero-shot learning via discriminative representation extraction, Pattern Recognition Letters 109 (2018) 27–34. 19

Journal Pre-proof

[17] Y. Xian, C. H. Lampert, B. Schiele, Z. Akata, Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly, IEEE transactions

pro of

355

on pattern analysis and machine intelligence.

[18] W. Chao, S. Changpinyo, B. Gong, F. Sha, An empirical study and analysis of generalized zero-shot learning for object recognition in the wild, in: Computer Vision - ECCV 2016 - 14th European Conference, Amster360

dam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016, pp. 52–68. doi:10.1007/978-3-319-46475-6\_4.

re-

[19] J. Song, C. Shen, Y. Yang, Y. Liu, M. Song, Transductive unbiased embedding for zero-shot learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 2018, pp. 1024– 365

1033. doi:10.1109/CVPR.2018.00113.

lP

[20] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, Y. Bengio, Learning deep representations by mutual information estimation and maximization, CoRR abs/1808.06670. [21] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, A. C. Courville, MINE: mutual information neural estimation, CoRR abs/1801.04062.

urn a

370

[22] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, J. Dean, Zero-shot learning by convex combination of semantic embeddings, in: ICLR, 2014.

[23] M. Gutmann, A. Hyv¨arinen, Noise-contrastive estimation of unnormalized 375

statistical models, with applications to natural image statistics, Journal of

Jo

Machine Learning Research 13 (2012) 307–361.

[24] Z. Ji, Y. Sun, Y. Yu, Y. Pang, J. Han, Attribute-guided network for crossmodal zero-shot hashing, IEEE transactions on neural networks and learning system.

20

Journal Pre-proof

380

[25] C. H. Lampert, H. Nickisch, S. Harmeling, Attribute-based classification

pro of

for zero-shot visual object categorization, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 453–465. doi:10.1109/TPAMI.2013.140.

[26] B. Romera-Paredes, P. H. S. Torr, An embarrassingly simple approach to zero-shot learning, in: Proceedings of the 32nd International Conference 385

on Machine Learning, ICML, Lille, France, 2015, pp. 2152–2161.

[27] Z. Zhang, V. Saligrama, Zero-shot learning via semantic similarity embedding, in: 2015 IEEE International Conference on Computer Vision,

re-

ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4166–4174. doi:10.1109/ICCV.2015.474. 390

[28] S. Changpinyo, W. Chao, B. Gong, F. Sha, Synthesized classifiers for zeroshot learning, in: 2016 IEEE Conference on Computer Vision and Pattern

lP

Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 5327–5336. doi:10.1109/CVPR.2016.575. [29] L. Liu, S. Wang, B. Hu, Q. Qiong, D. S. Rosenblum, Learning structures 395

of interval-based bayesian networks in probabilistic generative model for

urn a

human complex activity recognition, Pattern Recognition 81 (2018) 545– 561.

[30] H. Zhang, P. Koniusz, Zero-shot kernel learning, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,

400

UT, USA, June 18-22, 2018, 2018, pp. 7670–7679.

[31] A. Mishra, S. K. Reddy, A. Mittal, H. A. Murthy, A generative mod-

Jo

el for zero shot learning using conditional variational autoencoders, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 2188–2196.

405

[32] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, A. Elgammal, A generative adversarial approach for zero-shot learning from noisy texts (2018) 1004–1013.

21

Journal Pre-proof

[33] Z. Ji, J. Wang, Y. Yu, Y. Pang, J. Han, Class-specific synthesized dictionary

pro of

model for zero-shot learning, Neurocomputing 329 (2019) 339–347.

[34] Y. Guo, G. Ding, X. Jin, J. Wang, Transductive zero-shot recognition via 410

shared model space learning, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., 2016, pp. 3434–3500.

[35] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visual-semantic embedding model, in: Advances in Neural Information Processing Systems 26: 27th Annual Confer-

re-

415

ence on Neural Information Processing Systems 2013. Lake Tahoe, Nevada, USA., 2013, pp. 2121–2129.

[36] R. Linsker, Self-organization in a perceptual network, IEEE Computer

420

lP

21 (3) (1988) 105–117. doi:10.1109/2.36.

[37] A. J. Bell, T. J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Computation 7 (6) (1995) 1129– 1159. doi:10.1162/neco.1995.7.6.1129.

urn a

[38] X. Ji, J. F. Henriques, A. Vedaldi, Invariant information distillation for unsupervised image segmentation and clustering, CoRR abs/1807.06653.

425

[39] J. Rigau, M. Feixas, M. Sbert, A. Bardera, I. Boada, Medical image segmentation based on mutual information maximization, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004, 7th International Conference Saint-Malo, France, September 26-29, 2004, Pro-

Jo

ceedings, Part I, 2004, pp. 135–142.

430

[40] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.

22

Journal Pre-proof

[41] S. Nowozin, B. Cseke, R. Tomioka, f-gan: Training generative neural samplers using variational divergence minimization, in: Advances in Neural

pro of

435

Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 271–279.

[42] G. Patterson, J. Hays, SUN attribute database: Discovering, annotating, 440

and recognizing scene attributes, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012,

re-

2012, pp. 2751–2758.

[43] T. M. C. W. F. S. S. B. P. Welinder, S. Branson, P. Perona, Caltech-ucsd birds 200, in: Caltech, Tech. Rep. CNS-TR-2010-001, 2010. 445

[44] A. Farhadi, I. Endres, D. Hoiem, D. A. Forsyth, Describing objects by

lP

their attributes, in: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami,

Jo

urn a

Florida, USA, 2009, pp. 1778–1785.

23

*Conflict of Interest Form

Journal Pre-proof

CONFLICT OF INTEREST DECLARATION We confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could

pro of

have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication,

re-

including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.

lP

We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions

urn a

of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from [email protected].

Jo

Signed by all authors as follows: Chenwei Tang, Xue Yang, Jiancheng Lv, Zhenan He

*Author Contributions Section

Journal Pre-proof

Author Contributions Section Chenwei Tang: Conceptualization, Methodology, Validation, Investigation, Writing Original Draft, Writing - Review & Editing, Xue Yang: Methodology, Software, Validation, Formal analysis,

pro of

Jiancheng Lv: Resources, Supervision, Project administration, Funding acquisition

Jo

urn a

lP

re-

Zhenan He: Writing - Original Draft, Writing - Review & Editing, Visualization