Journal Pre-proof Zero-shot learning by mutual information estimation and maximization Chenwei Tang, Xue Yang, Jiancheng Lv, Zhenan He
PII: DOI: Reference:
S0950-7051(20)30010-1 https://doi.org/10.1016/j.knosys.2020.105490 KNOSYS 105490
To appear in:
Knowledge-Based Systems
Received date : 1 July 2019 Revised date : 6 January 2020 Accepted date : 7 January 2020 Please cite this article as: C. Tang, X. Yang, J. Lv et al., Zero-shot learning by mutual information estimation and maximization, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2020.105490. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier B.V.
Journal Pre-proof *Revised Manuscript (Clean Version) Click here to view linked References
pro of
Zero-Shot Learning by Mutual Information Estimation and Maximization Chenwei Tanga,∗, Xue Yanga,∗, Jiancheng Lva,∗∗, Zhenan Hea,∗∗ a the
College of Computer Science, Sichuan University, Chengdu 610065, P. R. China
Abstract
The key of zero-shot learning is to use the visual-semantic embedding to transfer
re-
the knowledge from seen classes to unseen classes. In this paper, we propose to build the visual-semantic embedding by maximizing the mutual information between visual features and corresponding attributes. Then, the mutual information between visual and semantic features can be utilized to guide the
lP
knowledge transfer from seen domain to unseen domain. Since we are primarily interested in maximizing mutual information, we introduce the noise-contrastive estimation to calculate lower-bound value of mutual information. Through the noise-contrastive estimation, we reformulate zero-shot learning as a binary classification problem, i.e., classifying the matching visual-semantic pairs (positive
urn a
samples) and mismatching visual-semantic pairs (negative/noise samples). Experiments conducted on five datasets demonstrate that the proposed mutual information estimators outperforms current state-of-the-art methods both in conventional and generalized zero-shot learning settings. Keywords: Zero-shot learning, mutual information, noise-contrastive
Jo
estimation, visual-semantic embedding
∗ Both
authors contributed equally to this research. author Email addresses:
[email protected] (Jiancheng Lv ),
[email protected] (Zhenan He ) ∗∗ Co-Corresponding
Preprint submitted to Journal of LATEX Templates
January 6, 2020
Journal Pre-proof
1. Introduction
pro of
Zero-Shot Learning (ZSL) [1, 2, 3, 4, 5, 6, 7] aims to recognize instance of new category that has never seen before, i.e., the categories in the training set and the test set are disjoint, by learning an embedding space between the image 5
and semantic features. The key ensuring the success of ZSL is to build a visualsemantic embedding, then the cross-modality representation between the visual features and semantic features can be learned [8]. After that, by finding the intermediate semantic representation, the knowledge learned from seen classes can be transferred to unseen ones.
There have been many effective methods proposed to build the visual-semantic
re-
10
embedding. These methods usually project either visual features or semantic features from one space to the other, or project both features to an sharing embedding space. Early works of ZSL learn the bilinear compatibility function
15
lP
between visual and semantic space using ranking loss [9, 10]. Other approaches are based on the non-linear multi-modal embedding [11, 8]. Another fashion direction is the max margin-based [12, 13], which employs a ranking function to measure the matching scores between the visual and semantic feature vectors.
urn a
Recently, the methods by learning the cross-modal mapping with discriminative losses [14, 15, 16] are becoming more and more popular. 20
After the embedding step, most previous ZSL assume that the test categories belong only to the unseen classes in performance evaluation. However, in practice, image classification applications are required to recognize images belong to seen or unseen classes. Thus, we advocate evaluating the ZSL approaches in the setting of Generalized Zero-Shot Learning (GZSL) [17, 18], where we classify both the seen source and unseen target classes during test. Due to that
Jo
25
most existing ZSL methods map visual features to several fixed anchor points in the visual-semantic embedding during training, the images of unseen categories tend to be recognized as the seen source classes in the joint labeling space. As we all know, it is the fundamental domain adaptation problem [18, 2] in GZSL.
30
To address the domain shift problem, many methods follow the way of trans-
2
Journal Pre-proof
ductive learning [2, 5, 19], where both the labeled source and unlabeled target
pro of
images are available for training. By combining the unlabeled images of unseen categories with the seen source data, the transductive ZSL methods can learn a more general visual-semantic embedding. In this paper, we propose a novel 35
ZSL method based on mutual information estimation and maximization [20, 21] to alleviate the domain shit problem. More specifically, unlike the transduction ZSL method mentioned above, the proposed method can significantly solve the domain adaptation problem even if it follows the inductive learning [22], i.e., only data of the labeled source classes are available during the training phase. However, in continuous high-dimensional space, the calculation of mutual
re-
40
information is notoriously difficult. And the improving ZSL requires that the visual-semantic embedding is less specialized towards solving a single dataset [17]. Thus, we leverage Mutual Information Estimator (MIE) based on Noise-
45
lP
Contrastive Estimation (NCE) [23] for training a binary classifier to distinguish the matching visual-semantic pairs from the negative pairs. In this way, the high-dimensional visual features are compressed into the much more compact attributes embedding space in which cross-modality learning is easier to model. The experiments on five datasets show that the proposed method, called NCE-
50
urn a
base MIE, can be a good stepping stone towards robust and generalizable ZSL method.
Our main contributes of this paper are summarized as follows: • Most existing methods project either visual features or semantic features from one space to the other, or project both features to an sharing embedding space. In this paper, we propose a novel method by utilizing the mutual information between visual and semantic features to guide the
Jo
55
learning of visual-semantic embedding. Without mapping the visual features to the fixed anchor points, the well-known domain shift problem in GZSL is alleviated.
• By introducing NCE, we reformulate the zero-shot learning as a binary
60
classification problem, i.e., distinguishing the matching image-attribute 3
Journal Pre-proof
pairs from the noise samples.
pro of
• We are the first to develop a ZSL framework incorporating the mutual information maximization and NCE. Experiments show that our framework outperforms state-of-the-art ZSL methods both in the conventional ZSL 65
and GZSL setting.
2. Relate Work 2.1. Zero-Shot Learning
re-
ZSL, as a highlighted research topic in the field of Zero-shot Hashing [24], has been widely applied in real world applications. Typically, it is achieved by 70
utilizing the visual-semantic embedding to transfer knowledge from seen classes to unseen classes. Here, we will briefly introduce several mainstream methods
lP
and compare our method with them in the experimental section. Direct Attribute Prediction (DAP) [25], as one of the pioneering studies, learns a probabilistic attribute classifiers, then recognizes the unseen instance 75
by estimating the posterior of each attribute. On the other hand, Structured Joint Embedding (SJE) [12], ESZSL [26] and Semantic AutoEncoder (SAE)
urn a
[10] are all directly map the image feature to the semantic space. Specifically, SJE learns a bilinear compatibility by optimizing a structural Support Vector Machine (SVM) loss. ESZSL learns the bilinear compatibility between visual 80
features, attributes and class labels with the square loss. SAE, following the auto-encoder, reconstructs the visual features in the semantic space. Further, LatEm [11] extends the bilinear compatibility model of SJE. Recent advances, such as, Semantic Similarity Embedding (SSE) [27] and Synthesized Classifiers
Jo
(SYNC) [28], embed both the visual and semantic features into another common
85
space.
Recently, a new branch of methods target to ZSL by generative models
[29, 30]. There are two main themes among them: Variational Auto-encoder (VAE)[31] and Generative Adversarial Network (GAN) [16, 32, 14]. These methods synthesize pseudo instances for unseen classes with auxiliary of seen class 4
Journal Pre-proof
90
prototypes [33]. Then, with the generated unseen samples, ZSL can be trans-
pro of
formed to a standard classifier for object recognition.
ZSL in generalized settings provides a more practical point of view than conventional settings. Due to the fact that the projection learned from the seen classes has domain shift problem when directly applied to the disjoint unseen 95
classes, which is different and potentially unrelated from the source data, in the joint labeling space, the results in GZSL are significantly lower than the conventional ZSL results.
To alleviate the domain shift problem, many methods based on the trans-
100
re-
ductive learning [2, 34] are proposed. [5] alleviates the domain shift problem by iteratively adding the unlabeled unseen instances depending on the reliableness during the training phase. The Quasi-Fully Supervised Learning (QFSL) [19] also follows the way of transductive learning, where the labeled source images
lP
and the are unlabeled target images are forced to be mapped to different specified points. Inductive ZSL [12, 28, 35] methods such as DAP, SSE and GFZSL 105
mentioned above generally perform well on the seen source classes during the testing phase, while the performances on the unseen test classes are very poor.
urn a
2.2. Mutual Information Maximization
Mutual information measures the dependence between two random variables.For two discrete variables X and Y whose joint probability distribution is 110
p(x, y), the mutual information between them, denoted I(X; Y ), is given by
I(X; Y ) =
X
p(x, y) log
X,Y
p(x, y) , p(x)p(y)
(1)
where p(x, y) is the joint probability distribution, while p(x) and p(y) are
Jo
the marginal probabilities. Furthermore, mutual information, as a Shannon entropy-based quantity, can be written as follow:
I(X; Y ) = H(X) − H(X|Y ),
5
(2)
Journal Pre-proof
where H is the Shannon entropy and H(X|Y ) is the conditional entropy which is the average uncertainty about X after observing a second random variable Y .
pro of
115
The methods based on mutual information have been widely used in unsupervised learning. The methods based on info-max optimization principle [36, 37, 20] estimate and maximize the mutual information between input data and the output of a deep neural network for unsupervised learning of represen120
tations. Mutual Information Neural Estimator (MINE) [21] presents that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. What’s
re-
more, the experiments demonstrate that MINE can be applied to supervised classification. Some recent works maximize the mutual information between 125
images for unsupervised clustering and segmentation [38, 39]. In general, most existing methods based on mutual information are devoted to measure the de-
lP
pendence between two random variables of the same modes, and they are mainly applied to the unsupervised learning of representations. The mutual information estimation has not been used to the applications, such as ZSL, which involve 130
the measurement of the dependence between cross-modal data.
urn a
3. Proposed Approach
Here we firstly introduce some notations and the problem definition. Let S = {(xsi , asi , yis ), i = 1, ..., Ns } and U = {(xuj , auj , yju ), j = 1, ..., Nu } denote the
training data from seen source classes and unseen data, where xsi and xuj denote 135
the visual features, asi and auj denote attributes of seen classes and unseen classes in the semantic space As and Au , yis and yju are the corresponding label for xsi
Jo
and xuj , and Ns and Nu denote the number of images of seen categories and
unseen classes, respectively. The seen source data S and the unseen target data U are disjointed in categories, i.e., S ∩ U = ∅. Given the visual feature xuj of
140
an unseen class and all the attributes in the Au space, the goal of ZSL is to
predict the class label yju . As for the GZSL, both xsi and xuj are given with the attributes space A = As +Au , the class label in joint labeling space Y = Ys +Yu 6
Journal Pre-proof
should be recognized.
pro of
The key technology in ZSL is visual-semantic embedding, which can transfer the knowledge from the seen classes to the unseen data. We encode the high-dimensional visual features into the much more compact attributes space, then build the visual-semantic embedding by maximally preserving the mutual information between the visual features x and corresponding attributes a: I(x; a) =
X
p(x, a) log
x,a
p(x|a) . p(x)
(3)
However, computing mutual information is very difficult, especially in con-
re-
tinuous high-dimensional space. Thus, we model a density ratio as follows: F (xi , ai ) ∝
p (xi |ai ) , p (xi )
(4)
where A ∝ B denotes that A is proportional to B. Then, by using the density
lP
ratio F (xi , ai ), we can avoid modeling the high dimensional distribution. Thus, a compatibility function F : X × A → R with adjustable parameters W can be used:
F (xi , ai ; W ) = exp(xTi W ai ),
(5)
urn a
where W is the parameter matrix, and the value of F denotes the matching score between the input visual feature xi and the attribute ai . The larger the value of F , the larger the probability that xi and ai belong to the same class yi . Therefore, the classification for an unseen instance xj can be achieved by maximizing F :
f (xj ; W ) = arg max F (xj , aj ; W ), aj ∈Au
(6)
where f (xj ; W ) is the classifier of xj , and the class label is the corresponding label to the attribute with largest compatibility score. Depending on the end-goal,
Jo
145
the mutual information maximization means that finding a set of parameters, W , such that the mutual information, I(xi , ai ), is maximized. Note that we cannot evaluate the p (x|a) and p (x) directly, and we are
primarily concerned with maximizing the mutual information between the im-
150
age feature and the corresponding attribute, rather not the precise value, the 7
Journal Pre-proof
pro of
ak
visual feature xk
positive
1
ResNet image
0
visual embedding zk
Visual feature extractor
negative
scores
Binary Classifier based on NCE
𝐴s
re-
Visual-semantic embedding
Figure 1: The architecture of the proposed NCE-based MIE. xk , zk , and ak denote visual feature, visual embedding, and the corresponding attribute, respectively. attributes of seen classes.
As denotes all
lP
non-KL divergences may offer favorable trade-offs. Thus, we propose to use a lower-bound to mutual information based on the NCE following the Deep InfoMax (DIM) [20], which simultaneously estimates and maximizes the mutual information by training a classifier to distinguish between samples from the 155
positive and negative.
urn a
Let M denotes the number of all classes, and the Ms and Mu denote the number of seen classes and unseen classes, respectively. Given a visual feature xk from the seen classes and the corresponding attribute ak as the positive pair, the rest attributes in As except for ak are treated as noise samples, i.e., there are Ms − 1 negative samples for per visual feature. Finally, the NCE-based MIE
Jo
can be formulate a bound on mutual information as follow: F (xk , ak ; W ) (N CE) Ibw (xk ; ak ) := log P . ai ∈As F (xk , ai ; W )
(7)
The loss in Equation 7 is the categorial cross-entropy of classifying the pos-
itive sample from the negative noise samples, and the
PF
As
F
denotes the pre-
diction of the model. Then, the NCE-based MIE can be derived as follows:
8
Journal Pre-proof
pro of
exp(xTk W ak ) (N CE) , Ibw (xk ; ak ) = log P T ai ∈As exp(xk W ai ) X = xTk W ak − log exp(xTk W ai ).
(8)
ai ∈As
Figure 1 shows the overall architecture of the proposed NCE-based MIE. First, the input of visual feature xk extracted by a pre-trained Deep Convolutional Neural Network (DCNN), such as the ResNet [40], is encoded to a visual embedding zk = xTk W in the attribute space. Next, the compatibility scores s {F (zk , ai )}M i=1 between the visual embedding zk and all Ms attributes in As
re-
space can be got. Finally, by utilizing the binary classifier based on NCE, we can distinguish the positive samples from negative/noise samples. In general, we can maximize the mutual information between visual features xk and attributes ak by training the NCE-based MIE as follows:
lP
(N CE) I(xk ; ak ) ≥ Ibw (xk ; ak ).
(9)
In addition to NCE being used to formulate the bound of mutual information, the MIE based on Jensen-Shannon Divergence (JSD) may also offer favourable
[41]:
urn a
trade-offs. The JSD mutual information estimator follows the formulation of (JSD) Ibw (xk ; ak ) = −σ(−xTk W ak ) − σ(xTk W ar ),
(10)
where σ(·) = log(1 + exp(·)) is a softplus function, i.e., σ(−xTk W ak ) = log(1 + exp(−xTk W ak )). ak is the corresponding attribute to the visual feature xk , and ar is a negative attribute sampled form As − ak . We compare the NCE-based MIE and JSD-based MIE in our experiments, and prove that NCE-based MIE perform better than JSD-based MIE. However, because the number of negative samples required by JSD-based MIE is far less than NCE-based MIE, the JSD-
Jo
160
based MIE is superior to NCE-based MIE in training speed. What’s more, [20] shows a structure matters that incorporating knowledge
about locality of mutual information can greatly influence the model’s suitability, we also define a local NCE-based MIE. We directly slice the visual feature
9
Journal Pre-proof
vectors xk into T equal length parts, then the average mutual information can
pro of
be maximized:
T F (xtk , ak ; W ) 1 X (LN CE) Ibw log P (xk ; ak ) = . t T t=1 ai ∈As F (xk , ai ; W )
(11)
In our experiments, we investigate the global NCE-based MIE, local NCEbased MIE, and the combination of the two NCE-based MIE: F (xk , ak ; w1) (L+G) Ibw (xk ; ak ) =α log P ai ∈As F (xk , ai ; w1)
(12)
re-
T β X F (xtk , ak ; w2) + log P , t T t=1 ai ∈As F (xk , ai ; w2)
where α and β are the hyper parameters. Depending on the value of these
165
(L+G) (X; A) can learn a standard global NCE-based hyper parameters, the Ibw
MIE (α = 1, β = 0), or a standard local NCE-based MIE (α = 0, β = 1), or the
lP
global & local NCE-based MIE. w1 and w2 are the parameter matrix for the global and local objectives, respectively. The error functions mentioned above will be compared and analyzed in the experimental section.
170
urn a
4. Experiments
To demonstrate the effectiveness and efficiency of the proposed method, we conduct extensive experiments on five most widely used benchmark datasets for ZSL. First, we introduce the experimental settings, mainly including the datasets and evaluation metrics. Then we compare the proposed method with existing state-of-the-art ZSL methods in both the conventional and the gener-
175
alized settings. Finally, we compare and analyze the proposed several mutual
Jo
information estimators.
4.1. Experimental Settings Datasets These five datasets are considered: SUN Attribute Database
(SUN) [42], Caltech-UCSDBirds 200-2011 (CUB) [43], Animals with Attributes
180
1 (AWA1) [25], Animals with Attributes 2 (AWA2) [17], Attribute Pascal and
10
Journal Pre-proof
Table 1: The details of five benchmark datasets. A and Y denote the dimension of attributes
pro of
and number of classes. S and U are the numbers for seen and unseen classes. N presents the number of images. Ns and Nu are the number of images of seen and unseen classes, respectively. Note that Ns→ts denotes the images’ number of seen classes during test in the GZSL setting.
Data
Att
Classes
images
Y
S +U
N
SUN [42]
102
717
645+72
14340
CUB [43]
312
200
150+50
11788
AWA1 [25]
85
50
40+10
30475
AWA2 [17]
85
50
40+10
37322
aPY [44]
64
32
20+12
15339
PS
Ns
Nu
Ns
Nu
Ns→ts
12900
1440
10320
1440
2580
8855
2933
7057
2967
1764
24295
6180
19832
5685
4958
30337
6985
23527
7913
5882
12695
2644
5932
7924
1483
re-
A
SS
Yahoo (aPY) [44]. SUN is a fine-grained dataset, which contains 14, 340 images coming from 717 types of scenes, of which 645 types are used for training and
lP
72 for testing. CUB is also a fine-grained dataset containing 11, 788 images of 200 bird categories. We use the split with 150 classes for training and 50 for 185
testing. AWA1 and AWA2 are both coarse-grained dataset having the same 50 animal classes. In total, AWA2 has 37, 322 images compared to 30, 475 images
urn a
of AWA1. For both AWA1 and AWA2, we use 40 classes for training and the rest 10 classes for testing. For aPY, which contains 15, 339 images from 32 classes, we use 20 classes for training and the rest 12 classes for testing. In our experi190
ments, we adopt both the standard train/tes splits (SS) and the proposed splits (PS) in [17]. We present the details of SS and PS on five datasets in Table 1. For the visual features, we extract 2, 048-dimensional top-layer pooling units of the 101-layered ResNet following the pre-trained method in [17]. We use Adam
Jo
optimizer with base learning rate 0.05 and minibatch size 64. Evaluation Metrics To compare the performance with the existing method,
we average the correct predictions independently for each class before dividing the number of classes, i.e. Mean Class Accuracy (MCA) as the evaluation metrics:
M CA =
1 X accy , ||Y || y∈Y
11
(13)
Journal Pre-proof
where accy denotes the top-1 accuracy of class y. In the GZSL setting, both un-
pro of
seen classes y u and seen classes y s are considered. Therefore, we adopt M CAS (the M CA on the seen test classes), M CAU (the M CA on the unseen test classes), and their harmonic mean (H) as the evaluation metrics in GZSL setting: 2 ∗ M CAS ∗ M CAU . M CAS + M CAU
H=
(14)
Table 2: Comparisons in generalized settings on Proposed Split (PS) in [17] measuring top-1 accuracy in %. U = M CAU and S = M CAS denote Top-1 accuracy on unseen classes and Method DAP [25]
CUB
AWA1
AWA2
aPY
U
S
H
U
S
H
U
S
H
U
S
H
U
S
H
4.2
25.1
7.2
1.7
67.9
3.3
0.0
88.7
0.0
0.0
84.7
0.0
4.8
78.3
9.0
2.1
36.4
4.0
8.5
46.9
14.4
7.0
80.5
12.9
8.1
82.5
14.8
0.2
78.9
0.4
LATEM [11]
14.7
28.8
19.5
15.2
57.3
24.0
7.3
71.7
13.3
11.5
77.3
20.0
0.1
73.0
0.2
SJE [12]
14.7
30.5
19.8
23.5
59.2
33.6
11.3
74.6
19.6
8.0
73.9
14.4
3.7
55.7
6.9
ESZSL [26]
11.0
27.9
15.8
12.6
63.8
21.0
6.0
75.6
12.1
5.9
77.8
11.0
2.4
70.1
4.6
SYNC [28]
7.9
43.3
66.3
13.3
SAE [10]
8.8
18.0
0.9
GFZSL [14]
0.0
39.6
25.9
40.2
lP
SSE [27]
NCE-based MIE
13.4
11.5
70.9
19.8
8.9
87.3
16.2
10.0
90.5
18.0
7.4
11.8
7.8
54.0
13.6
1.8
77.1
3.5
1.1
82.2
2.2
0.4
80.9
0.0
0.0
45.7
0.0
1.8
80.3
3.5
2.5
80.1
4.8
0.0
83.3
0.0
31.5
26.7
69.3
38.5
22.6
90.6
36.2
17.9
91.9
29.9
17.3
79.5
28.4
4.2. Comparisons in Generalized Setting
urn a
195
SUN
re-
seen test classes, respectively.
In order to prove that the proposed method can alleviate the domain shift problem, we first compare our method with several state-of-the-art ZSL methods mentioned in the related work in the generalized setting. Experimental results are given in Table 2. An interesting observation is that the methods, e.g., 200
DAP and GFZSL, perform well on seen test classes (M CAS ), but they work poorly on the unseen test classes (M CAU ). However, our method improves the
Jo
overall performance by an obvious margin for all the M CAU , M CAS and the harmonic mean (H). Especially for the M CAU , the proposed method performs
much better than all the state-of-the-art methods. For instance, the M CAS
205
ranks SYNC as the best performing method on SUN and CUB, and GFZSL as the best method on aPY, respectively. Whereas, for the M CAU , they perform
very poorly on all datasets. Especially, the GFZSL can not correctly recognize 12
Journal Pre-proof
Harmonic Mean
pro of
45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 DAP
SSE
CUB LATEM
SJE
AWA1 ESZSL
SYNC
AWA2
SAE
GFZSL
re-
SUN
aPY
NCE-based MIE
Figure 2: Comparisons in generalized settings on Proposed Split (PS) in [17] of the harmonic
lP
mean.
the unseen classes on SUN, CUB, and aPY datasets, and the M CAU is as low as 0.0. What’s more, the proposed method achieves accuracy scores in harmonic 210
mean (H) 4.9 ∼ 16.6% (where 4.9 = 38.5 − 33.6 and 16.6 = 36.2 − 19.6) better than the previous highest results on five datasets.
urn a
The striking improvement is achieved by inter-class separation during training process. Most existing approaches project either visual features or semantic features from one space to another, or project both features to an shared em215
bedding space. However, these methods fail to recognize unseen classes because they map the visual or semantic features to the fixed anchor points in the embedding space. Our binary classifier on seen classes possesses good separation property by maximizing the mutual information between positive samples pairs,
Jo
and minimizing that of noise pairs. When the binary classifier is used on unseen
220
classes, it can be high discriminative to recognize the new visual features which do not belong to the seen classes. Then the domain shift is alleviated. In order to show intuitively that our method achieves much better result in
generalized setting, we present the harmonic mean obtained by previous state-
13
Journal Pre-proof
Split.
Method
SUN
CUB
AWA1
PS
SS
PS
DAP [25]
38.9
39.9
37.5
40.0
SSE [27]
54.5
51.5
43.7
43.9
LATEM [11]
56.9
55.3
49.4
49.3
SJE [12]
57.1
53.7
55.3
53.9
ESZSL [26]
57.3
54.5
55.1
53.9
SYNC [28]
59.1
56.3
54.1
55.6
SAE [10]
42.4
40.3
33.4
33.3
GFZSL [14]
62.9
60.6
53.0
49.3
NCE-based MIE
AWA2
aPY
SS
PS
SS
PS
SS
PS
57.1
44.1
58.7
46.1
35.2
33.8
68.8
60.1
67.5
61.0
31.1
34.0
74.8
55.1
68.7
55.8
34.5
35.2
76.7
65.6
69.5
61.9
32.0
32.9
74.7
58.2
75.6
58.6
34.4
38.3
72.2
54.0
71.2
46.6
39.7
23.9
80.6
53.0
80.7
54.1
8.3
8.3
80.5
68.3
79.3
63.8
51.3
38.4
re-
SS
CSSD [33]
pro of
Table 3: Comparisons in conventional settings using SS = Standard Split, PS = Proposed
-
-
52.5
-
81.2
-
-
-
54.1
-
64.4
64.2
57.4
58.1
83.1
69.1
81.9
66.7
51.0
39.1
225
lP
of-the-art methods and our method on these five datasets in Figure 2. In the histogram, the nine colors denote the accuracy scores in harmonic mean (H) of eight existing ZSL methods and the proposed method (the red one) on each dataset. It can be notice that our method outperforms the previous state-ofthe-art ZSL methods consistently on all five datasets. Extensive experiments
230
urn a
validate that the proposed method can address the domain shift problem without transductive learning by mutual information estimation and maximization, instead of mapping the visual features to the fixed anchor points in the visualsemantic embedding space.
4.3. Comparisons in Conventional Setting Robustness and generalization are ignored by most existing ZSL methods. In our method, the improved ZSL method where the visual-semantic embedding
Jo
235
is less specialized toward solving a single dataset is required. We compare our method with existing state-of-the-art methods mentioned in the related work. We use both the Standard Split (SS) and the Proposed Split (PS) in [17] to conduct experiments on SUN, CUB, AWA1, AWA2 and aPY. Table 3
240
shows the experimental results in conventional ZSL setting. It can be observed 14
Journal Pre-proof
Table 4: Comparisons of different MIE in both generalized and conventional settings. M =
pro of
M CA denotes the top-1 accuracy on the test data in the conventional settings. U = M CAU and S = M CAS denote Top-1 accuracy on unseen classes and seen test classes, respectively. Model
SUN
CUB
AWA1
M
U
S
H
M
U
S
H
M
U
G
64.2
25.9
40.2
31.5
58.1
26.7
69.3
38.5
69.1
AWA2
aPY
S
H
M
U
S
H
M
U
S
H
22.6
90.6
36.2
66.7
17.9
91.9
29.9
39.1
17.3
79.5
28.4
L
61.9
25.8
41.4
31.7
57.4
26.8
69.5
38.7
69.3
19.3
90.8
31.8
65.8
16.7
91.6
28.3
36.0
14.5
79.8
24.5
L+G
63.8
25.6
41.1
31.6
57.3
27.2
69.6
39.1
68.9
26.3
90.6
40.8
67.5
22.4
91.8
36.0
37.3
14.2
79.3
24.1
JSD-Based
54.2
21.9
30.4
25.5
49.8
28.2
61.5
38.7
68.8
19.7
88.3
32.2
65.8
22.3
89.9
35.8
39.0
17.2
79.2
28.3
that the proposed method outperforms the existing approaches almost on all datasets. Notably, on SUN, CUB, AWA1 and AWA2, our method outperforms
re-
the previous best methods, which achieve the best results among the existing methods toward a single dataset , by a large margin of 1.2 ∼ 2.5% (where 245
1.2 = 81.9 − 80.7 and 2.5 = 83.1 − 80.6) and 0.8 ∼ 3.6% (where 0.8 = 69.1 − 68.3 and 3.6 = 64.2 − 60.6) using SS and PS, respectively. The experimental results prove that our method effectively utilizes mutual information estimation and
lP
maximization between the high-dimensional visual features and corresponding attributes to facilitate the visual-semantic embedding. 250
4.4. Comparisons of Different Estimators
urn a
In the following experiments, G and L refer the global-only (α = 1, β = 0) and local-only (α = 0, β = 1, T = 4) NCE-based MIE, respectively. L + G denotes the combination of local and global NCE-based MIE (α = 0.5, β = 0.5, T = 4 is present here as L + G). JSD-Based is the MIE based on Jensen255
Shannon Divergence. The ZSL and GZSL results got by these MIEs using the proposed split in [17] are presented in Table 4. In general, the mutual information estimators based on NCE and JSD all perform well than all the
Jo
existing state-of-the art methods on all datasets. In detail, all NCE-based MIEs contain the better and more stable performance than JSD-based MIE in most
260
experiments. With high capability of visual classification, the NCE-based MIEs cannot maintain the high training speed. The time consuming per iteration of different MIEs are reported in Table
5. For each model, we keep the batch size as 64. As shown in Table 5, we
15
Journal Pre-proof
can see that the training speed of JSD-based MIE is faster than that of the MIEs based on NCE, especially for the local-only NCE-based MIE (L), as well
pro of
265
as the local and global NCE-based MIE (L + G). But it’s worth mentioning that the computational time consuming per iteration of all the MIE models are very short.
Table 5: Comparisons of time consuming per iteration.
Model
Time consuming per iteration (10−3 s) CUB
AWA1
AWA2
aPY
G
1.2
1.3
1.6
1.6
1.3
L
2.5
2.7
2.9
2.9
2.7
L+G
2.9
3.0
3.3
3.3
3.1
JSD-Based
1.0
1.0
1.1
1.1
1.1
re-
SUN
270
lP
As shown in Figure 3, although the overall results of JSD-based MIE are not as well as the NCE-based MIE, the M CAU ranks JSD-based MIE as the best performing method on CUB. On the other hand, the performances of localonly NCE-based MIE and the combination of local and global NCE-based MIE
urn a
are not always better than the global-only NCE-based MIE. But in summary, the NCE-based MIEs are superior to the JSD-based MIE. Due to the negative 275
samples required by NCE-based MIE are much more than the JSD-based MIE, the NCE-based MIE are more robust and generalizable.
5. Conclusion
In this paper, we address the fundamental projection domain shift problem
Jo
in generalized settings, and seek an improved ZSL method where the visual-
280
semantic embedding is robust and has generalization ability for all datasets. We propose to build the visual-semantic embedding by maximizing the mutual information between visual features and corresponding attributes. Without mapping the visual features to the fixed anchor points, the well-known domain shift problem in GZSL is alleviated. We further leverage NCE-based MIE for training a 16
Journal Pre-proof
re-
pro of
JSD-Based
Figure 3: Comparisons of MCA (M) in conventional ZSL setting and the harmonic mean (H) in GZSL setting among the proposed several MIEs.
binary classifier to distinguish the matching image-attribute pairs from the noise
lP
285
samples. Extensive experiments show the proposed NCE-based MIE achieves robustness and generalization than most state-of-the-art ZSL methods. What’s more, we compare and analyze several mutual information estimators both in the conventional ZSL and the more realistic GZSL setting. The NCE-based MIE is simpler and produces better results. In future, the transductive ZSL by
urn a
290
mutual information estimation and maximization in the GZSL setting will be in our research scope. Meanwhile, more other cross-modal applications will be attempted by maximizing mutual information.
6. Acknowledgments
This paper is supported by the National Key R&D Program of China under
Jo
295
contract No. 2017YFB1002201, the National Natural Science Fund for Distinguished Young Scholar under Grant No. 61625204, the Key Program of National Science Foundation of China under Grant No. 61836006 and 61432014, and the National Natural Science Foundation of China under Grant No. 61602328.
17
Journal Pre-proof
300
References
pro of
[1] C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida, USA, 2009, pp. 951–958. doi:10.1109/CVPRW.2009.5206594. 305
[2] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, Transductive multi-view zero-shot learning, IEEE Trans. Pattern Anal. Mach. Intell. 37 (11) (2015) 2332–2345. doi:10.1109/TPAMI.2015.2408354.
re-
[3] Y. Fu, L. Sigal, Semi-supervised vocabulary-informed learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Ve310
gas, NV, USA, 2016, pp. 5337–5346. doi:10.1109/CVPR.2016.576.
lP
[4] Z. Ji, Y. Yu, Y. Pang, J. Guo, Z. Zhang, Manifold regularized cross-modal embedding for zero-shot learning, Inf. Sci. 378 (2017) 48–58. doi:10.1016/ j.ins.2016.10.025.
[5] Y. Yu, Z. Ji, J. Guo, Y. Pang, Transductive zero-shot learning with adaptive 315
structural embedding, IEEE Trans. Neural Netw. Learning Syst. 29 (9)
urn a
(2018) 4116–4127. doi:10.1109/TNNLS.2017.2753852.
[6] Y. Guo, G. Ding, J. Han, Y. Gao, Zero-shot learning with transferred samples, IEEE Trans Image Process 26 (7) (2017) 3277–3290.
[7] G. Ding, Y. Guo, K. Chen, C. Chu, J. Han, Q. Dai, Decode: Deep confi320
dence network for robust image classification, IEEE Transactions on Image
Jo
Processing 28 (8) 3752–3765.
[8] R. Socher, M. Ganjoo, C. D. Manning, A. Y. Ng, Zero-shot learning through cross-modal transfer, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Sys-
325
tems 2013. Lake Tahoe, Nevada, USA., 2013, pp. 935–943.
18
Journal Pre-proof
[9] Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding for
pro of
image classification, IEEE Trans. Pattern Anal. Mach. Intell. 38 (7) (2016) 1425–1438. doi:10.1109/TPAMI.2015.2487986.
[10] E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learn330
ing, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 2017, pp. 4447–4456. doi:10.1109/CVPR. 2017.473.
[11] Y. Xian, Z. Akata, G. Sharma, Q. N. Nguyen, M. Hein, B. Schiele, Latent
335
re-
embeddings for zero-shot classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 2016, pp. 69–77. doi:10.1109/CVPR.2016.15.
[12] Z. Akata, S. E. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of out-
lP
put embeddings for fine-grained image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 340
2015, pp. 2927–2936. doi:10.1109/CVPR.2015.7298911. [13] X. Li, Y. Guo, Max-margin zero-shot learning for multi-class classification,
urn a
in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS, San Diego, California, USA, 2015.
[14] V. K. Verma, P. Rai, A simple exponential family framework for zero345
shot learning, in: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part II, 2017, pp. 792–808. doi:10.1007/
Jo
978-3-319-71246-8\_48.
[15] Y. Li, D. Wang, Zero-shot learning with generative latent prototype model,
350
CoRR abs/1705.09474.
[16] T. Long, X. Xu, F. Shen, L. Liu, N. Xie, Y. Yang, Zero-shot learning via discriminative representation extraction, Pattern Recognition Letters 109 (2018) 27–34. 19
Journal Pre-proof
[17] Y. Xian, C. H. Lampert, B. Schiele, Z. Akata, Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly, IEEE transactions
pro of
355
on pattern analysis and machine intelligence.
[18] W. Chao, S. Changpinyo, B. Gong, F. Sha, An empirical study and analysis of generalized zero-shot learning for object recognition in the wild, in: Computer Vision - ECCV 2016 - 14th European Conference, Amster360
dam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016, pp. 52–68. doi:10.1007/978-3-319-46475-6\_4.
re-
[19] J. Song, C. Shen, Y. Yang, Y. Liu, M. Song, Transductive unbiased embedding for zero-shot learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 2018, pp. 1024– 365
1033. doi:10.1109/CVPR.2018.00113.
lP
[20] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, Y. Bengio, Learning deep representations by mutual information estimation and maximization, CoRR abs/1808.06670. [21] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, A. C. Courville, MINE: mutual information neural estimation, CoRR abs/1801.04062.
urn a
370
[22] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, J. Dean, Zero-shot learning by convex combination of semantic embeddings, in: ICLR, 2014.
[23] M. Gutmann, A. Hyv¨arinen, Noise-contrastive estimation of unnormalized 375
statistical models, with applications to natural image statistics, Journal of
Jo
Machine Learning Research 13 (2012) 307–361.
[24] Z. Ji, Y. Sun, Y. Yu, Y. Pang, J. Han, Attribute-guided network for crossmodal zero-shot hashing, IEEE transactions on neural networks and learning system.
20
Journal Pre-proof
380
[25] C. H. Lampert, H. Nickisch, S. Harmeling, Attribute-based classification
pro of
for zero-shot visual object categorization, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 453–465. doi:10.1109/TPAMI.2013.140.
[26] B. Romera-Paredes, P. H. S. Torr, An embarrassingly simple approach to zero-shot learning, in: Proceedings of the 32nd International Conference 385
on Machine Learning, ICML, Lille, France, 2015, pp. 2152–2161.
[27] Z. Zhang, V. Saligrama, Zero-shot learning via semantic similarity embedding, in: 2015 IEEE International Conference on Computer Vision,
re-
ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4166–4174. doi:10.1109/ICCV.2015.474. 390
[28] S. Changpinyo, W. Chao, B. Gong, F. Sha, Synthesized classifiers for zeroshot learning, in: 2016 IEEE Conference on Computer Vision and Pattern
lP
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 5327–5336. doi:10.1109/CVPR.2016.575. [29] L. Liu, S. Wang, B. Hu, Q. Qiong, D. S. Rosenblum, Learning structures 395
of interval-based bayesian networks in probabilistic generative model for
urn a
human complex activity recognition, Pattern Recognition 81 (2018) 545– 561.
[30] H. Zhang, P. Koniusz, Zero-shot kernel learning, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
400
UT, USA, June 18-22, 2018, 2018, pp. 7670–7679.
[31] A. Mishra, S. K. Reddy, A. Mittal, H. A. Murthy, A generative mod-
Jo
el for zero shot learning using conditional variational autoencoders, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 2188–2196.
405
[32] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, A. Elgammal, A generative adversarial approach for zero-shot learning from noisy texts (2018) 1004–1013.
21
Journal Pre-proof
[33] Z. Ji, J. Wang, Y. Yu, Y. Pang, J. Han, Class-specific synthesized dictionary
pro of
model for zero-shot learning, Neurocomputing 329 (2019) 339–347.
[34] Y. Guo, G. Ding, X. Jin, J. Wang, Transductive zero-shot recognition via 410
shared model space learning, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., 2016, pp. 3434–3500.
[35] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visual-semantic embedding model, in: Advances in Neural Information Processing Systems 26: 27th Annual Confer-
re-
415
ence on Neural Information Processing Systems 2013. Lake Tahoe, Nevada, USA., 2013, pp. 2121–2129.
[36] R. Linsker, Self-organization in a perceptual network, IEEE Computer
420
lP
21 (3) (1988) 105–117. doi:10.1109/2.36.
[37] A. J. Bell, T. J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Computation 7 (6) (1995) 1129– 1159. doi:10.1162/neco.1995.7.6.1129.
urn a
[38] X. Ji, J. F. Henriques, A. Vedaldi, Invariant information distillation for unsupervised image segmentation and clustering, CoRR abs/1807.06653.
425
[39] J. Rigau, M. Feixas, M. Sbert, A. Bardera, I. Boada, Medical image segmentation based on mutual information maximization, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004, 7th International Conference Saint-Malo, France, September 26-29, 2004, Pro-
Jo
ceedings, Part I, 2004, pp. 135–142.
430
[40] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.
22
Journal Pre-proof
[41] S. Nowozin, B. Cseke, R. Tomioka, f-gan: Training generative neural samplers using variational divergence minimization, in: Advances in Neural
pro of
435
Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 271–279.
[42] G. Patterson, J. Hays, SUN attribute database: Discovering, annotating, 440
and recognizing scene attributes, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012,
re-
2012, pp. 2751–2758.
[43] T. M. C. W. F. S. S. B. P. Welinder, S. Branson, P. Perona, Caltech-ucsd birds 200, in: Caltech, Tech. Rep. CNS-TR-2010-001, 2010. 445
[44] A. Farhadi, I. Endres, D. Hoiem, D. A. Forsyth, Describing objects by
lP
their attributes, in: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami,
Jo
urn a
Florida, USA, 2009, pp. 1778–1785.
23
*Conflict of Interest Form
Journal Pre-proof
CONFLICT OF INTEREST DECLARATION We confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could
pro of
have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication,
re-
including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
lP
We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions
urn a
of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from
[email protected].
Jo
Signed by all authors as follows: Chenwei Tang, Xue Yang, Jiancheng Lv, Zhenan He
*Author Contributions Section
Journal Pre-proof
Author Contributions Section Chenwei Tang: Conceptualization, Methodology, Validation, Investigation, Writing Original Draft, Writing - Review & Editing, Xue Yang: Methodology, Software, Validation, Formal analysis,
pro of
Jiancheng Lv: Resources, Supervision, Project administration, Funding acquisition
Jo
urn a
lP
re-
Zhenan He: Writing - Original Draft, Writing - Review & Editing, Visualization