Zero-shot learning for action recognition using synthesized features

Zero-shot Learning for Action Recognition using Synthesized features Communicated by Prof. Zhou Xiuzhuang Journal Pre-proof Zero-shot Learning for ...

Download PDF

4MB Sizes 0 Downloads 172 Views

Report

Full Text

Zero-shot Learning for Action Recognition using Synthesized features

Communicated by Prof. Zhou Xiuzhuang

Journal Pre-proof

Zero-shot Learning for Action Recognition using Synthesized features Ashish Mishra, Anubha Pandey , Hema A. Murthy PII: DOI: Reference:

S0925-2312(20)30130-2 https://doi.org/10.1016/j.neucom.2020.01.078 NEUCOM 21835

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

26 May 2019 4 December 2019 18 January 2020

Please cite this article as: Ashish Mishra, Anubha Pandey , Hema A. Murthy , Zero-shot Learning for Action Recognition using Synthesized features, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.078

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Zero-shot Learning for Action Recognition using Synthesized features Ashish Mishra , Anubha Pandey and Hema A Murthy Indian Institute of Technology Madras

Abstract The major disadvantage of supervised methods for action recognition is the need for a large amount of annotated data, where the data is matched to its label accurately. To address this issue, Zero-Shot Learning (ZSL) is introduced. Zero short learning primarily uses data that is synthesized to compensate for lack of training examples. In this paper, two different approaches are proposed for the synthesis of artificial examples for novel classes; namely, inverse autoregressive flow (IAF) based generative model and bi-directional adversarial GAN(Bi-dir GAN). A consequence of the proposed approach is a transductive setting using a semi-supervised variational autoencoder, where the unlabelled data from unseen classes are used to train the model. This enables the generation of novel class examples from textual descriptions. The proposed models perform well in the following settings, namely, i) Standard setting(ZSL), where the test data comes only from unseen classes, and ii) Generalized setting(GZSL), where the test data comes from both seen and unseen classes. In the case of the generalized setting, examples with pseudo labels are generated for unseen classes. Experiments are performed on three baseline datasets, UCF101, HMDB51, and Olympic. In comparison with state-of-the-art approaches, both the proposed models, IAF based generative model and Bi-dir GAN model outperform in UCF101, and Olympic datasets in all the settings and achieve comparative results in HMDB51. Keywords: Generalized Zero Shot Learning, Inverse Autoregressive Flow, Bi-directional Genrative Adversarial Network, Transductive ZSL setting, Inductive ZSL setting.

Preprint submitted to Neurocomputing

January 22, 2020

Figure 1: Complete illustration of ZSL with feedback mechanism. Encoder with auto-regressive flow. Our proposed decoder with skip connection. Dotted red line shows the gradient flow direction.

1. Introduction Action recognition in video is an important and challenging problem in computer vision. Action recognition problem becomes challenging due to high intraclass and low inter-class variance, complex backgrounds, and noise. Training examples required in a supervised setting need to be exhaustive to address the issue of a wide intra-class variance. Currently, zero-shot learning is an emerging technique which has attracted the researcher’s attention that can automatically recognize action from novel/unseen classes[1, 2, 3]. In conventional supervised techniques for action recognition, the trained model classifies the test data into the seen action classes that have appeared during training. However, zero-shot learning(ZSL) for action recognition circumvents this problem by classifying novel/unseen classes during testing phase based on some extra information about the classes, namely, Word2vec[4]/Glove[5] representations. Since attributes of the classes are manually annotated by humans, the content is rich, while Word2vec/Glove representations learn word embeddings from huge text corpora. Generating word embeddings takes less effort but these are noisy when compared to that of manually labeled attributes. Zero-shot learning is defined in two types of settings: 2

1. Standard / Disjoint setting: In this setting [6, 7, 8, 9, 10] train/seen classes and test/unseen classes are mutually disjoint. In the test phase of the model, test examples come only from the test/unseen classes. 2. Generalized setting: In this setting [11, 12, 13] test data may come from both seen/train and unseen/test classes. This setting is more challenging because models are more biased towards seen/train classes data. Contrast to supervised learning, ZSL uses the semantic representation of classes, in addition to the word2Vec representation. Existing approaches focus on image classification, while the objective of this paper is to extend ZSL for action recognition [14, 6, 15, 16]. Zero-Shot Action classification is more challenging than image classification for the following reasons: 1. Image representation is restricted to spatial features, while video representation also includes temporal features. 2. Videos are more noisy. The dynamic nature of the video affects the background, and appearances of the objects of interest itself. Most of the existing ZSL approaches learn projection among different representation spaces based on the labeled data of seen classes[16, 13, 12, 11, 17, 18, 19]. These projection-based approaches have several limitations, in that they exploit the correlation between seen and unseen classes based on semantic information, while ignoring information derived from the visual representation. In the situation when the data distribution between the seen and unseen classes are different, projection based methods fail for unseen classes. Projection-based approaches are also not suitable for the Generalized ZeroShot Learning(GZSL), where test data may come from both seen and unseen classes, while the model understands seen classes better[7, 18, 13, 19]. These methods perform a nearest neighbour(NN) search by projecting onto the learned space. In the high dimensional space, some of the seen classes during testing may become a hub node leading to the hubness[20] problem. The utility of nearest neighbor search would be significantly reduced if the same objects were to appear consistently as the search result, irrespective of the query. Such objects, termed hubs, indeed occur in high-dimensional space [21]. To address these problems, we propose two generative approaches, one is based on variational autoencoder(VAE) [22] and other is based on generative adversarial network (GAN)[23]. In both the approaches we generate the visual representation of seen/unseen classes by using class attributes/descriptions, these representations are referred to as pseudo class 3

examples. Using synthesized/generated visual features for the unseen classes, we can turn the zero-shot learning problem to a typical supervised action classification problem. In this GZSL approach, the classifier is trained on both seen and unseen classes (synthesize features) data. Unlike most of the approaches, both methods work for both ZSL as well as GZSL. Most of the previous method rely on the NN search in the semantic space, that is not an effective classifier. On all the three standard datasets our approach outperforms the existing approaches for both standard setting and generalize setting. In this work, we present two generative models which can be leveraged from standard/disjoint zero-shot setting to generalized zero-shot setting for action recognition. Our main contribution in this paper can be summarized as follows: 1. Estimation of posteriors is more accurate owing to VAE and the use of Inverse autoregressive flow (IAF) [24]. 2. A decoder with skip connection is proposed which leads to more stable training, and additionally prevents overfitting [25]. 3. Feedback mechanism is used to ensure that the distributions of the input space, and reconstructed space are identical. 4. A bidirectional GAN based generative model is proposed for zero-shot action recognition. 5. Generative models are extended for transductive setting using semi-supervise VAE [26]. 2. Related Work Initially, Zero-shot learning(ZSL) proposed for image classification, and now it is used in different applications such as action classification, image retrieval, etc. In this section, we review related research for image classification and action classification in zero-shot setting. ZSL for image classification: Zero-Shot learning [1, 2, 3, 27] has drawn the attention of researchers owing to its capability to classify a novel class during the test. In ZSL, the main part is to learn a projection between two spaces using labeled seen class data. Existing proposed methods can be categorized into two different types projection based models and generative model. Projection based models learn a projection between the visual and semantic space, and vice-versa directly. [16, 13, 12, 11, 17, 18, 19, 8]. The learned projection is then used for mapping unseen/test classes.

4

The second type of models corresponds to generative models, that have been recently proposed for zero-shot image classification. Under certain constraints, the distribution of the data is learned. In [18, 19, 28, 29] the data distribution learned is a uni-modal Gaussian distribution. Whereas the proposed model is not limited to any specific distribution since based on the class attribute, it can simulate any visual space. While VAE[30] and GAN[23] baesd approaches have been used for image classification, we extend this idea to action recognition. Two types of settings have been proposed for ZSL,namely standard [6, 7, 8, 9, 10, 31, 32] and generalized setting [11, 12, 13, 33, 34]. In the standard setting, test examples only come from unseen classes, whereas in a generalized setting, the test examples can come from both seen and unseen classes. Projection-based approaches work well for the standard-setting, whereas for generalized setting the performance of these models degrades due to the biases of train/seen class data examples. The generative models show good performance in both the standard as well as the generalized setting because they can generate the examples using class attributes [33, 34, 35, 30]. [36] proposed an approach for zero-shot image classification using semantic embedding of the classes and the knowledge graph of the classes. To model the knowledge graph, this approach builds a Graph Convolutional Networks, whereas our proposed method is a generative model that synthesizes the class features using only semantic class embedding. ZSL for action classification: Earlier work on ZSL was used only for image classification, whereas, recently several models are proposed for zero-shot action classification [8, 19, 37, 38, 39, 40, 41]. [8] proposed a stage-wise bidirectional latent embedding framework of two subsequent learning stages for zero-shot action recognition. [19] proposed a generative model that estimates the distribution of action class features using the action class attributes. [39] developed an Error-Correcting Output Codes(ECOC) derived from the text corpus that has well-defined class hierarchy relationships and used it as class attributes for zero-shot action recognition. [38] proposed an evaluation procedure that enables fair use of external data for zero-shot action recognition. [37] proposed to utilize a large-scale training source to achieve a Universal Representation(UR) that can automatically generalize to a more realistic Cross-Dataset UAR(CD-UAR) scenario. Unseen actions from new datasets can be directly recognized via the UR without further training or fine-tuning on the target dataset. A very recent work [40] introduced an out-of-distribution detector that determines whether the video features belong to a seen or unseen action category.

5

A zero-shot model can be trained in two different settings, inductive setting, and transductive setting. In inductive setting, only seen class labeled examples are used to train the model[7, 33, 34, 30, 35, 42] whereas in transductive setting unlabelled examples from the unseen classes are also used to train the model [19, 18, 15, 8]. In the proposed work, IAF based generative model and Bi-direction GAN based generative model are extended for transductive setting using a similar concept of semi-supervised VAE[26]. 3. Methodology Suppose the total number of seen/train classes is S and the number of unseen/test classes is U . We are given the labeled training examples for the seen classes, whereas for the unseen classes the labeled data is unavailable. In a standard zero-shot setting, the test examples belong only to unseen classes, whereas for generalized zero-shot setting test examples may belong to both the seen and unseen classes. In other words, for the standard zero-shot setting S ∩ U = φ and for generalized setting S ∩ U 6= φ. Although, we don’t have labeled examples of unseen classes, the attributes of all classes (seen/unseen) c = 1, · · · , S, S + 1, · · · , S + U are given. The zeroshot model can leverage the information from seen classes to unseen classes by using knowledge transfer on basis of attributes. The class attribute vectors are represented by {ac }S+U c=1 for classes c = {1, · · · , S, S + 1, · · · , S + U }, where attribute vector of class c is ac ∈ RL . Note that labeled training examples of each seen class {xn , yn } can be equivalently denoted as {xn , ayn }, i.e., feature vector and class-attribute vector pair for each example. Therefore, assuming a total of NS examples from the seen classes, the training data from seen classes can be collectively denoted by DS = S {xn , ayn }N n=1 . The goal of ZSL is to learn a classification model using DS and then use the learned model to predict the labels for unseen test class examples. Our first proposed model uses the Inverse Autoregressive flow (IAF) based Encoder [24] to transforms an input visual feature x into a latent code vector z, which gives a better predictive posterior close to the prior distribution. The second proposed model consists of two GANs network named as GAN1 and GAN2 arranged sequentially, in which GAN1 generates visual class features conditioned on class attributes whereas GAN2 generates class attributes conditioned on the generated class visual features. Since this model generates the features in both directions, that is from attribute to visual features and from visual feature to attribute features, it is referred to as Bi-directional GAN based generative model. In 6

the next sub-sections, we define some important terminologies related to Inverse Autoregressive Flow (IAF). 3.1. Variational Inference and Learning Suppose X is a set of observed variables and z a set of latent variables. The joint distribution p(x, z) be the parametric model of their joint distribution, it is also called as generative model over latent variables z. For a given dataset X = {x1 , ·, xN } the maximum marginal likelihood of its parameters is defined as: log p(X) =

N X

log p(xi )

(1)

i=1

Since the direct differentiation of above equation is not easy so this marginal likelihood is intractable. For addressing this problem parametric inference model q(z|x) over latent variable is introduced, which is equivalent to optimize the variational lower bound on the marginal log likelihood as follows: log p(x) ≥ Eq(z|x) [log p(x, z) − log q(z|x)] = L(x; θ)

(2)

where p and q models are parametrized by θ. Since Kullback-Leibler divergences Dkl (.) are non-negative, so that L(x; θ) is a lower bound on log p(x), it can be written as : L(x; θ) = log p(x) − Dkl (q(z|x)||p(z|x)) (3)

The lower bound L(x; θ) can be optimized in various ways, as re-parameterization of q(z|x) for the continuous z. We can see from Equation 3 for maximum value of L(x; θ) w.rt. θ, log p(x) should be maximum and Dkl (q(z|x)||p(z|x) should be minimize. The minimum value of Dkl (q(z|x)||p(z|x) is close to 0, so the maximum value of L(x; θ) is close to log p(x). Since p(z|x) is the prior model and q(z|x) is the inference model. Dkl (q(z|x)||p(z|x) must be minimized, in other words, the distribution of the prior model and the inference model should be as close as possible. If the true distribution(prior) is very complex then the single latent variable z is not sufficient to make the inference distribution close to the true distribution. In order to capture complex structure, we need to compute the inference model with multiple latent variables. The inference model can be represented as a factorization of partial inference models; namely q(za , zb |x) = q(za |x)q(zb |za , x). In this paper, we use an inverse autoregressive flow(IAF) based variational autoencoder in which latent code vector z is computed in several steps iteratively in an autoregressive way. After each step we get a better refined latent code vector for the input visual feature vector x, which is defined after t steps as: zt = ft (zt−1 , x), where ft is some transformation function. 7

3.2. Normalizing Flow The inference get network needs to be highly flexible so that it is able to predict posterior distribution that is similar to the prior. Normalising flow is a popular approach that is used in the literature for variational prediction of the posterior over a latent space. Normalizing flow [43] depends on a sequence of invertible mappings for transforming the initial probability density. Let z0 be the initial random variable with simple probability density function q(z0 |x) and zt be the final output of a sequence of invertible transformations ft on z0 . zt can be computed as: zt = ft (zt−1 , x) ∀t = 1, · · · , T . If the Jacobian determinant of each ft can be computed, the final probability density function can be computed as: log q(zT |x) = log(z0 |x) −

T X t=1

∂zt log det ∂zt−1

3.3. Inverse Autoregressive Transformations Let y be a variable modeled by a Gaussian version of an autoregressive model. Suppose [µ(y), σ(y)] be the representation of the function from vector y to µ and σ. Due to the autoregressive structure, the Jacobian is a lower triangular matrix with zeros on the diagonal. Mean and standard deviation of the ith element of the feature vector are computed from y1:i−1 i.e., the previous elements of y. To sample from such a model is a sequence of transformations from a noise vector ∼ N (0, I) to the corresponding vector y as: y0 = µ0 + σ0 0 and for i > 0, yi = µi (y1:i−1 )+σi (y1:i−1 ).i . Variational inference makes sampling from posterior, such models are not interesting to be directly used. Although an inverse transformation is interesting for normalizing the flows for all i, i > 0, as the transformation yi =µi (y1:i−1 )+σi (y1:i−1 ).i is one-to-one and it can be y −µ (y1:i−1 ) , but can not be used directly. Two important inverted as : i = i σi (yi i:i−1 ) observations about IAF are given below: • As the samples i are uncorrelated, computation of these variables can be done in parallel, so inverse transformation can be parallelized = y−µ(y) σy (subtraction and division are element-wise). • Inverse autoregressive operation has a simple Jacobian determinant. It is a lower triangular matrix. A consequence is that the log-determinant of the Jacobian of transformation is simple to compute: P ∂ log det | ∂y |= D i=1 − log σi (y) 8

3.4. IAF step This is shown in Figure 1. The output of the initial encoder network is µ0 , σ0 and an extra output h which is considered as a feedback for each step in the flow. In other words, the parameters of the encoder is refined iteratively based on the feedback from the previous step µ0 , σ0 and h. The sampled vector from latent space of the initial encoder is defined as: z0 = µ0 + σ0

(4)

Where ∼ N (0, I). After t steps the refinement of sample z0 is iteratively defined as : zt = µt + σt zt−1 (5)

In this sequential step, the predicted posterior fits more closely to the true posterior. Finding an appropriate latent space for sampling is a crucial part of generative models. Conventionally, VAE and GAN based methods are used. They compute the latent space in one step which is insufficient to capture complex distributions. On the other hand IAF based VAE transform the predicted posterior into true posterior using some simple sequential transformation. This sequence of simple transformations can be reduced to any complex distribution. This results in reducing the difference between the estimated and true posterior. 4. IAF based Proposed Model The components of the IAF based generative model are discussed below: 4.1. Inverse Autoregressive Flow(IAF) based Encoder This is an improved version of a variational autoencoder, which uses inverse autoregressive flow(IAF) to approximate posterior over a latent vector. IAF consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. These invertible transformations transform the input space in such a way that it scales well to high dimensional latent space. The difference between an IAF based variational autoencoder and a simple variational autoencoder is only the encoder part. IAF based encoder consists of a simple encoder coupled with an IAF. Figure 1 shows the pipeline of an IAF based variational autoencoder based model for the zero-shot action classification problem. In IAF based variational autoencoder, the model uses multiple latent variables, whereas in simple variational autoencoder a single latent variable is used. 9

4.2. VAE with feedback mechanism In our model IAF based encoder consists of a standard encoder coupled with the IAF module, the output of an IAF based encoder is a refinement of latent code which is initialized by a standard encoder, denoted as pE (zt |x) with parameters θE and discriminator/regressor output distribution as pR (a|x). Decoder/generator is denoted by pG (x|z, a) with parameters θG . Then the VAE loss function is given by(assuming the regressor to be fixed): LV AE (θE , θG ) = −EpE (zt |x),p(a|x) [log pG (x|zt , a)] + KL(pE (zt |x)||p(zt ))

(6)

where the first term on R.H.S. is the generator’s reconstruction error and the second term promotes the VAE posterior (the encoder) to be close to the prior. 4.3. Discriminator/Regressor In our proposed model, the discriminator/regressor is defined by a probabilistic model pR (a|x) with parameters θR . This is a feedforward neural network that learns to project the example x ∈ RD to its corresponding class-attribute vector a ∈ RL . The regressor is learned using two sources of data: S • Labeled examples {xn , ayn }N n=1 from the seen class, on which we can define a supervised loss, given by

LSup (θR ) = −E{xn ,ayn } [pR (ayn |xn )]

(7)

• Synthesized examples x ˆ from the generator, for which we can define an unsupervised loss, given by LU nsup (θR ) = −EpθG (ˆx|zt ,a)p(zt )p(a) [pR (a|ˆ x)]

(8)

The weighted combination of supervised and unsupervised loss is defined as the overall objective of training discriminator/regressor: min LR = LSup + λR · LU nsup θR

10

(9)

Discriminator-Driven Learning The discriminator/regressor guides the generator to improve the generated examples by back-propagating its error which encourages the decoder/generator to generates example x ˆ that corresponds to the class-attribute vector a. This is performed by using some loss functions, which are described below: For the first loss function, suppose generator generates poor examples, so discriminator can classify these examples as fake/generated very easily. In this case discriminator/regressor assumes that it has optimal parameters and will not regress to the correct value. Lc (θG ) = −EpG (ˆx|zt ,a)p(zt )p(a) [log pR (a|ˆ x)]

(10)

LReg (θG ) = −Ep(zt )p(a) [log pG (ˆ x|zt , a)]

(11)

LE (θG ) = −Exˆ∼pG (ˆx|zt ,a) KL[(pE (zt |ˆ x)||q(zt ))]

(12)

min LV AE + λc · Lc + λreg · LReg + λE · LE

(13)

This loss function helps the generator to generate examples for which the created approximate attribute vector by the discriminator is correct. Another loss function acts as a regularizer that encourages the generator to generate a good class-specific sample even from a random zt drawn from the prior p(zt ) and combined with class-attribute a drawn from p(a). The above two loss functions help in increasing the coherence of x ˆ ∼ pG (ˆ x|z, a) with class-attribute a. A third loss function is used to ensure that the sampling distribution p(zt ) and the distribution obtained from the generated examples pE (zt |ˆ x) follow the same distribution. Hence the complete learning objective for the generator and encoder is given by, θG ,θE

Here λc , λreg and λE are hyper-parameters, which are decided by crossvalidation. 4.4. Proposed Decoder The proposed decoder is a combination of a deep and a shallow network. The deep network is responsible for better reconstruction of visual space while the shallow network reduces over-fitting. This architecture is inspired by ResNet[25] where the network has skip connections. These skip connections provide more paths to the network for information propagation. In this architecture, some paths are deep, and some are shallow. The advantage of such a network is that the deep part is vulnerable to vanishing/exploding gradient problems. While in the shallow parts of the network, the gradients do flow backward. 11

The complete Model Architecture In this section, we describe the structure of our proposed model in detail, the block diagram of our model is shown in Figure 1, based on IAF based variational autoencoder. It consists of encoder coupled with IAF module which provides the latent code for input image as p(zt |x), where zt is the latent code after t steps. Here t is a hyperparameter. The idea is to transform this latent space iteratively such that, if we sample latent code vector zt from this space and pass it to decoder module we can generate features that discriminate well across different classes. The Decoder/generator module is represented as p(x|zt , a). Unlike standard VAE, the decoder’s input consists of latent code vector zt and class-attribute vector a, similar to the architecture of conditional variational autoencoder(CVAE)[30]. As compared to traditional CVAE, our model has a few key differences, motivated by the goal which generates the class features based on class-attribute. • It consists of a discriminator, which is similar to a multi-output regressor that learns a projection from a real example x of seen classes or generated examples x ˆ of seen/unseen classes of the corresponding class-attributes.. Backpropagation of the regressor’s loss, guides the decoder/generator to generate representative features for each class. • The discriminator leverage our model for semi-supervised setting, where some training examples do not have class-attribute vectors. The output of discriminator given as p(a|x) acts as the class-attributes for these examples. • A feedback link from decoder/generator to encoder ensures that the synthesized examples x ˆ are very similar to actual examples x, that is the distribution p(zt |x) is approximately identical to the distribution p(zt |ˆ x). Our proposed model architecture draws its attention from recent work on controllable text generation[44] and improved variational inference with the inverse autoregressive flow(IAF)[24], where the aim is to generate a sentence having certain desired characteristics such as positive/negative sentiment by specifying a binary attribute. The IAF based encoder tries to find a latent code vector zt for an input feature in a progressive manner such that by using this latent code zt and corresponding class-attribute a, decoder/generator can reconstruct class features that discriminate well across classes. The focus of the work in [44] was for text generation, whereas the goal of our model is to exploit this framework for zero-shot action classification in both the generalized and standard setting. 12

Support Vector Machine(SVM) is used as a classifier. SVM is trained on pseudo labeled examples generated by the model. For the standard ZSL setting, SVM classifier is trained on pseudo labeled examples of unseen/test classes whereas, for GZSL setting, it is trained on the pseudo examples of both the seen/train and unseen/test classes. This trained SVM on pseudo examples is used for classification during the testing phase. Figure 1 shows the key components of the proposed architecture. 5. Bi-directional generative adversarial based proposed model(Bi-dir GAN) In this section, we propose a GAN[23] based generative model which learns in both the directions using adversarial learning. It consists of two GANs; one that generates visual features conditioned on class attributes, and another that takes the generated features to generate class attributes. An additional classifier is used to predict the class labels for the generated features. The overall architecture of our proposed model is shown in the Figure 2. G1 and D1 are the components for GAN-1 and G2 and D2 are the components of GAN-2. C is the additional classifier component. Parameters of D1 , G1 , D2 , G2 and C are represented by θd1 , θg1 , θd2 , θg2 and θc respectively. ˆ takes a random noise z ∈ N (0, 1), conThe Generator G1 : A×Z −→ X ˆ that catenated with the class-attribute vector of a class, and produces a sample X is similar to a real sample from that class. The Discriminator D1 : X −→ [0, 1] tries to distinguish such generated samples from the actual sample X of the real ˆ × Z −→ a data distribution. Similarly, the Generator G2 : X ˆ takes random noise z ∈ N (0, 1), concatenated with the generated feature from G1 , and produces a class-attribute a ˆ that is similar to a real attribute from that class. The Discriminator D2 : A −→ [0, 1] tries to distinguish such generated classattributes a ˆ from the actual class-attribute a. Also, the goal of the classifier netˆ −→ y is to take the generated sample from G1 and classify it work C : X into the original class. Presence of classifier module C ensures that the generated sample has the same characteristics as that of its class. Let LD1 , LD2 , LG1 , LG2 and LC denote the objective functions of the discriminator D1 , discriminator D2 , generator G1 , generator G2 and classifier C respectively. The objective function LD1 for the discriminator D1 can be defined as: LD1 (θd1 ) = Ex∼q(x) D1 (x|θd1 ) − Exˆ∼Pθg1 D1 (ˆ x|θd1 ) (14)

13

The objective function LD2 for the discriminator D2 can be defined as : LD2 (θd2 ) = Ea∼q(a) D2 (a|θd2 ) − Eaˆ∼Pθg2 D2 (ˆ a|θd2 )

(15)

The objective function LG2 for generator G2 is defined as : LG2 (θg2 ) = −Exˆ∼Pθg1 ,z∼N (0,1) D2 (G2 (ˆ x, z|θg2 )|θd2 )

(16)

The objective function LC for classifier C is defined as : LC (θc ) = Exˆ∼Pθg1 C(ˆ x|θc )

(17)

The objective function LG1 for the generator G1 is defined as : LG1 (θg1 ) = −Ea,,z∼N (0,1) D1 (G1 (a, z|θg1 )|θd1 ) + ||ˆ x − x||2 + LC (θc + LG2 (θg2 )

(18)

The loss function LG1 (θg1 ) depends upon LD1 (θd1 ), LD2 (θd2 ), LG2 (θg2 )

Figure 2: Complete illustration of Bi-directional adversarial GAN.

and LC (θc ), which helps the generator G1 generate discriminative features for unseen classes given class attributes. These generated features are used to train a SVM classifier to predict the class label for unseen classes.

14

6. Transductive setting using Semi-Supervise Variational Auto-encoder In the inductive setting of zero-shot learning, the unlabeled data of unseen classes are not used to train the model, whereas in the transductive setting of zeroshot learning the unlabeled data of unseen classes are also used to train the model. The concept of the semi-supervised variational auto-encoder[26] can be used to extend the proposed approaches, namely, IAF, and Bi-dir GAN to the transductive setting. Using the proposed model we can generate labeled examples for unseen classes based on the attributes of unseen classes. These generated examples for unseen classes are referred to as pseudo labeled examples. If we merge these pseudolabeled examples with the unlabeled examples of unseen classes, it creates a semisupervised setup for the ZSL problem. Suppose Dl = {xl , yl }N l=1 is the dataset M of pseudo-labeled examples of unseen classes and Du = {xu }u=1 is the dataset for unlabeled examples of unseen classes. D = {Dl , Du } is the set of all pseudo-labeled and unlabeled examples of unseen classes. We denote the encoder, decoder and classifier networks as Eφ , Dθ and Cω and their parameters by φ, θ and ω respectively. In this model, we have two cases, in the first case, we have observed the class labeled of data point(example) those are pseudo-labeled examples. For this the variational bound and the loss function L(x, y) are defined as : log pθ (x, y) ≥Eqφ (z|x,y) [log pθ (x|y, z) + log pθ (y) + log p(z) − log qφ (z|x, y)] = −L(x, y)

(19)

In the second case, where the labeled of data points are missing. In this case the class label is treated as a latent variable over which we perform a posterior inference. The bound for data point with an unobserved label y and loss function U (x) are defined as: log pθ (x) ≥ Eqφ (y,z|x) [log pθ (x|y, z) + log pθ (y) + log p(z) − log qφ (y, z|x)] X = qφ (y|x)(−L(x, y)) + H(qφ (y|x) = −U (x)

(20)

y

The bound on the marginal likelihood for the entire dataset(labeled and unlabeled) is defined as : X X F = L(x, y) + U (x) (21) (x,y)∼pl

(x)∼pu

15

Encoder E(x) : X −→ Z, encodes image features to a latent representation. Decoder D(z, y) : Z × Y −→ X, reconstructs the input x from the latent z conditioned on y, where y is one-hot representation of the class label. Classifier C(x) : X −→ Y , predicts the class label of image features.

Figure 3: Tranductive setting using semi-supervise VAE

Dataset UCF101 HMDB51 Olympic

#videos 13320 6676 783

#classes 101 51 16

seen/unseen 51/50 26/25 8/8

Attribute dim 115 N/A 40

Table 1: Dataset details and their train test split on all the three dataset used in our experiment.

7. Experimental Details 7.1. Datasets and Setting The proposed models are evaluated on three challenging datasets for action classification, namely, UCF101[45], HMDB51[46] and Olympics[11]. HMDB51 is specifically created for human action recognition. A total of 6766 videos with 51 action has collected from various sources. UCF101 is the most challenging and realistic action recognition dataset. This contains 101 classes collected from the Yo uTube with a total of 13320 videos. Olympic dataset is collected from YouTube corresponds to sports. It contains 783 videos with 16 different classes of action. For the UFC101 and Olympic we have the human annotated 16

attribute information, HMBD51 does not contain attribute information. Therefore only word2vec information can be used for HMBD51 dataset. A summary of the datasets used in the study is given in Table 1 7.2. Zero-Shot settings In this paper, we have evaluated our model in two types of ZSL setting, first one is standard setting (ZSL) and second is generalized setting (GZSL). In the standard setting, the train/seen classes and test/unseen classes are mutually exclusive. That is, test examples belong only from unseen/test classes. Whereas, in generalized setting, test examples may belong to both seen/train and unseen/test classes. 7.3. Inductive and Transductive learning settings We have evaluated the proposed models in inductive and transductive settings for zero-shot learning. In contrast to conventional inductive approaches [7, 12, 13, 19] that only use labeled data from seen classes, transductive zeroshot learning methods additionally leverage unlabeled data from unseen classes. We extended our models for transductive setting using semi-supervised VAE[26] in which synthesized features for unseen classes act as labeled examples for test data. We have observed that adding unlabeled features from test data to train our model improves the model accuracy for unseen classes significantly. 7.4. Visual features In this paper, we use C3D convolutional visual features provided in [47], as these features provide state-of-the-art results on zero-short action classification. For a fair comparison, we have used the same split and visual features provided by paper[8] in our model. 7.5. Class attributes(Semantic representation) Class-attributes/semantic representation describes any class semantically. There are two types of semantic representation of any class used in zero-shot learning. 1Class-attributes: this is manually annotated semantic representation. 2- Word vector representation: this is automatically learned by a skip-gram model[4] trained on Google news text corpus provided by Google. It represents each word by a 300 dimensions semantic vector. The class-attributes vectors are publicly available for UCF101 and Olympic datasets whereas, there is no class-attributes vector for the HMDB51 dataset. For UCF101 and Olympic data, we experiment on both attributes and word vector representation whereas, for HMDB51 we use word vector representation. 17

7.6. Evaluation Metric The model is evaluated using 30 different splits into seen and unseen classes provided by [8] for UCF101 (51/50), HMDB51 (26/25) datasets. For the Olympic dataset, ten random splits are generated for seen and unseen classes (8/8). The results are reported as an average over 5 trials of 30 different splits. The results are presented in terms of mean and standard deviation. In generalised setting 20% of the features from each seen class is added to the unseen class. The final evaluation metric is on the harmonic mean of the average accuracy of seen and unseen classes, as used in [12, 11, 13]. 7.7. Implementation Details and Model Architecture Implementation Details: IAF based generative model In all the experiments for all the datasets, we used two layers deep architecture for the encoder. Each layer is of 1024 dim, while the latent space µ, σ are of size attribute-dim/4. In the IAF we have used a five layers deep architecture and MADE[43] implementation is used for the autoregressive architecture. For the decoder architecture we use a combination of deep and shallow network of four layers. Each layer is of 2048 dim, and this remains the same for all datasets. The optimised hyper-parameter values for λc , λreg and λE are set to 0.1. Implementation Details: Bi-dir GAN Bi-dir GAN consists of Generators G1 and G2 , Discriminators D1 and D2 and Classifier C modules as shown in Figure 2. All the modules consists of a series of two fully-connected (FC) layers each followed by a ReLU layer. A dz dimension(varies with the datasets) noise vector z, concatenated with a da dimensional class attributes vector a, is fed into the generator G1 . G1 passes the input through a series of two FC layers having 1024 and dx (image feature vector dimension varies with the dataset) neurons respectively and outputs dx dimensional feature vector x ˆ of the corresponding real image feature vector. Discriminator module D1 tries to distinguish between the features of real images x, and features x ˆ generated from G1 . It takes dx dimension feature vectors and pass it through a series of two FC layers having 1024 and 128 neurons respectively. It outputs the probability of the features being real. Generator Network G2 takes features generated from G1 and tries to regenerate the class attributes vector a. It passes the input through a series of 2 FC layers having 1024 and da neurons respectively. The output of the network is da dimensional feature vector ˆ a. The output of G2 is further fed to the discriminator module D2 . The architecture of D2 is similar to that of D1 . D2 tries to distinguish between real class attribute features a, 18

and the class attribute features ˆ a generated from G2 . The output of the generator G1 is also fed to a Classifier module C which predicts class label for the generated features x ˆ. We train our network parameters using Adam Optimizer on loss functions given in Equations 14, 15, 16 and 18 with learning rate = 0.001 and batch size = 50. Input feature vector dimension dx are 4096, 4096 and 512 for HMDB51, UCF101 and Olympic datasets respectively. Class attribute dimension da are 300, 115 and 40 for HMDB51, UCF101 and Olympic datasets respectively. Noise dimension dz used for training are 100, 50 and 20 for HMDB51, UCF101 and Olympic datasets respectively. Implementation Details: Semi-supervised VAE Semi-supervised VAE consists of Encoder E, Decoder D, and Classifier C modules, as shown in Figure 3. The input to the network X = {xl , xu } consists of labeled and unlabeled features from the unseen class of the dataset. The labeled features xl of the unseen class data are synthesized features generated from trained Bi-dir GAN and IAF based generative model described in section 4 and 5. We pass the input feature vector X to the Encoder E and Classifier C modules of the network. The encoder E has a series of two fully connected layers with 500 and 100 neurons, respectively. Each of the FC layers has a softplus activation function. E encodes the input feature vector X to 100-dimensional latent space with mean µ and variance σ. We pass the unlabeled features xu through the Classifier C ˆ for the unlabeled features. The Classifier module C which predicts the label Y has a series of two fully connected layers with 500 and nl neurons(number of unseen labels varies with the dataset) respectively. The number of unseen labels nl in UCF, HMDB51 and Olympic datasets is 50, 25 and 8 respectively. Next, we sample noise vector Z from the latent space of the encoder E, concatenate it with ˆ for xu ) of the input feature vector X . We feed the the class labels(Y for xl and Y sampled noise vector Z to the decoder module D. The decoder uses a series of two FC layers with softplus activation function to generate a dx (input feature vector ˆ similar to the input dimension varies with the dataset) dimension feature vector X feature vector X. The FC layers of D has 500 and dx neurons, respectively. The dimension dx of input feature vector X is 4096, 4096 and 512 for HMDB51, UCF101 and Olympic datasets respectively. We train our network using Adam Optimizer on F loss defined in Equation 21 with learning rate = 0.0001 and batch size = 100.

19

8. Results Analysis and Comparison In this section we present the performance of the proposed models in inductive ZSL, transductive ZSL, and generalised ZSL in Table 2, 4 and 3. For comparison with existing ZSL techniques we choose ST [6], SJE[13], ConSE [12], ESZSL[10], Bi-Dir[8], RSAE[19], RR[16], HAA[11], DAP[14], IAP[14], UDA[9] and MTE[15], SAE[48], SEF-ZSL[18], CVAE[30], ZSECOC[39], fCLSWGAN[49], cycle-UWGAN[50], GF-ZSL[51], UAR[37], TFE-ZSAR[38] and TARN[41] as baseline models. Standard Zero-Shot Learning in Inductive and Transductive setting We experimented our both proposed models on three action datasets UCF101, HMDB51, and Olympic in the inductive setting and transductive setting. In the inductive setting, the unlabeled test data is not used, model is trained on labeled seen classes data only, whereas in the transductive setting the unlabelled test data is used to train the model. Unlabelled examples for the test classes helps the model to learn a better class distribution. The synthesized examples for unseen classes work as labelled examples so we can adopt a semi-supervised learning concept for the transductive setting. Here we use semi-supervised VAE[26] for transductive ZSL and observe a significant improvement when compared with that of the inductive setting. In standard ZSL setting the test examples are only from unseen classes. We have used the class-attributes vector as semantic information of classes for UCF101, Olympic, whereas, for HMDB51 the word2vec is used as semantic information for classes. Our proposed IAF+D model shows the state-of-art results for UCF101 and Olympic datasets in both inductive and transductive setting, whereas for the HMDB51 dataset, its performance is close to the previous state-of-art. The relative improvement over state-of-the-art for the UCF101, Olympic datasets in inductive setting is 8.6% and 2.1% respectively and in transductive setting is 15.4% and 4.1% respectively. The proposed Bi-directional GAN based model also performs well in all the three datasets for inductive and transductive setting. It out-performs the state-of-the-art in inductive setting by 3.1%, 2% and in transductive setting by 9.2%, and 1.1% in UCF101, and Olympic datasets respectively. We observe that most of the methods are biased toward datasets; it means that they perform well for one particular dataset and perform badly for other datasets like UAR performs well for HMDB51 datasets whereas for UCF101 its performance is not good. In contrast, our proposed models perform better for all three, Olympic, HMDB51, and UCF101 datasets. Table

20

Method ST [6]-ICIP 2015 SJE [13]-CVPR 2015 ESZSL [10]- ICML 2015 Bi-Dir [8] IJCV 2017 RR[16]-IJCV 2017 ZSECOC [39]-CVPR 2017 SEF-ZSL [18]-ECML 2017 SAE [48]-CVPR 2017 RSAE [19]- WACV 2018 UAR[37]-CVPR 2018 TFE-ZSAR [38]-ECCV 2018 CVAE [30]-CVPR 2018 TARN [41]-BMVC 2019 f-CLSWGAN[49]-CVPR 2018 cycle-UWGAN[50]-ECCV 2018 GF-ZSL[51]-CVPR 2018 Ours(IAF+D) Ours(Bi-dir GAN) HAA [11]-CVPR 2011 DAP [14]-TPAMI 2014 IAP [14]-TPAMI 2014 SJE [13]-CVPR 2015 UDA [9]-ICCV 2015 MTE [15]-ECCV 2016 Bi-Dir [8]-IJCV 2017 RR [16]-IJCV 2017 SAE [48]-CVPR 2017 ZSECOC [39]-CVPR 2017 SEF-ZSL [18]-ECML 2017 RSAE [19]-WACV 2018 CVAE [30]-CVPR 2018 TARN [41]-BMVC 2019 f-CLSWGAN[49]-CVPR 2018 cycle-UWGAN[50]-ECCV 2018 GF-ZSL[51]-CVPR 2018 Ours (IAF) Ours (D) Ours (IAF+D) Ours (Bi-dir GAN)

Embed W W W W W W W W W W W W W W W W W W A A A A A A A A A A A A A A A A A A A A A

Olympic N/A 28.6±4.9 39.6±9.6 N/A 35.7±8.8 21.6±0.9 33.6±10.4 35.1±10.9 34.12±10.1 N/A N/A 31.9±9.5 N/A 37.7±10.2 36.2±9.9 32.8±10.1 34.3±10.5 36.6±11.1 46.1 ± 12.4 45.4 ± 12.8 42.3±12.5 47.0±14.8 N/A 44.5 ± 8.2 N/A 51.7±11.3 47.8±9.6 27.7±4.6 45.4±11.1 50.41±11.2 45.7±10.1 N/A 46.5±9.3 46.3±10.1 44.8±10.5 41.46±10.96 50.82±11.62 52.06±11.72 49.80±10.72

UCF101 13.0±2.7 9.9±1.4 15.0±1.3 18.9±.4 11.7±1.7 3.2±0.7 19.2±2.3 17.4±2.3 17.33+1.1 17.5±1.6 N/A 17.1±2.4 19.0±2.3 18.2±2.5 17.7±2.7 19.2±3.5 20.8± 3.1 19.7±2.5 14.9 ± .8 14.3 ± 1.3 12.8 ± 2 12.0±1.2 13.2±1.9 15.8 ±1.3 20.5±.5 12.6±1.8 20.2±2.1 13.7±0.5 20.8±2.4 22.74±1.2 20.1±2.1 23.2±2.9 23.1±2.3 20.3±3.1 22.7±2.6 23.45±2.59 23.62±3.00 25.23±2.95 23.86±2.95

HMDB51 10.9±1.5 13.3±2.4 18.5±2 18.6±.7 14.5 ± 2.7 16.5 ± 3.9 16.3 ± 2.4 16.3 ± 2.1 19.28±2.1 24.4 ± 1.6 19.9 ± 3.3 15.3 ± 3.3 19.5 ± 4.2 18.4±3.3 15.9±3.8 16.6±3.4 17.88±3.71 18.73±3.73 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 17.58±3.83(W) 16.90±6.93(W) 17.88±3.71(W) 18.73±3.73(W)

Table 2: Results for the inductive setting for standard 21 zero shot learning setting(disjoint setting) for the action recognition. Here A represents the human annotated attribute vectors and W represents the word2vec embedding. UCF101 and Olympic uses A, while for HMDB51, A is not available only W is used.

Method HAA [11]- CVPR 2011 SJE [13]- CVPR 2015 ConSE [12]- ICLR 2014 SAE [48]- CVPR 2017 SAE [48]-CVPR 2017 SEF-ZSL [18]- ECML 2017 SEF-ZSL [18]- ECML 2017 CVAE [30]-CVPR 2018 CVAE [30]- CVPR 2018 f-CLSWGAN[49]-CVPR 2018 f-CLSWGAN[49]-CVPR 2018 cycle-UWGAN[50]-ECCV 2018 GF-ZSL[51]-CVPR 2018 GF-ZSL[51]-CVPR 2018 Ours(IAF) Ours(IAF) Ours(Bi-dir GAN) Ours(Bi-dir GAN)

Embed A W W A W A W A W A W A A W A W A W

Olympic 48.3 ± 10.2 32.5±6.7 36.6 ± 9.0 45.20 ± 9.3 34.5 ± 11.0 43.25 ±11.0 30.1 ±10.2 42.60 ± 10.0 29.10 ± 8.5 44.3±9.8 33.4±10.1 43.8±10.4 46.1±10.0 28.7±9.8 48.46±7.0 30.19±11.15 44.2±11.25 32.20±10.45

UCF101 14.3 ± 2.0 9.3±1.7 11.6 ± 2.1 18.6 ± 2.2 14.6 ± 2.2 18.6 ± 2.4 17.3 ± 3.1 18.45 ± 2.1 15.40 ± 2.1 20.9±2.9 16.6±2.9 19.2±2.9 21.2±2.9 19.6±2.9 25.98±2.56 20.20±2.61 22.70±2.52 17.20±2.31

Table 3: Results for generalized zero-shot learning setting for the action recognition. Here A represents the human annotated attribute vectors and W represents the word2vec embedding.

2 and 4 compare the proposed approaches with that of state-of-the-art approaches for zero-shot learning(ZSL) in inductive and transductive setting respectively. Generalized Zero-Shot setting In a generalized setting, the test data may come from both the seen as well as unseen classes, so this setting is more challenging as compared to the standard setting. Experiments were performed using both the proposed models in the generalized settings, using both class attributes and word2vec representations. The results are tabulated in Table 3. From Tables 3, and 2 it is evident that the proposed models are robust for both zero-shot settings, i.e., standard as well as generalized. The proposed model performs better than all existing state-of-art models in the generalized zero-shot setting for action recognition. We observe that existing approaches for zero-shot learning are not performing well in generalized setting due to biases towards train classes. To reduce the bias toward the seen class, the proposed model for generalised setting uses the harmonic mean of seen and unseen class accuracy. In IAF based model, we observe 22.5% and 22

HMDB51 N/A 12.0±2.6 15.0± 2.7 NA 15.2± 2.3 NA 14.90± 2.3 NA 14.30± 2.8 N/A 18.2±2.9 15.6±3.2(W) N/A 15.4±3.1 N/A 15.63±2.17 N/A 17.53±2.37

2.0% relative improvement with the state-of-art models in UCF101 and Olympic datasets, respectively, whereas a competitive performance in HMDB51 dataset. In the Bi-directional GAN model, we observe 7.1% relative improvement with the state-of-art models in UCF101 and competitive performance in HMDB51 and Olympic datasets. H = (2 · AccY tr · AccY ts )/(AccY tr + AccY ts )

where, AccY tr and AccY ts is average class accuracy for train and test class data respectively. Table 3 shows the comparison of our both proposed models with existing state-of-the art methods for generalize zero-shot learning(GZSL) in inductive setting. Method ConSE [12]- ICLR 2014 ST [6]-ICIP 2015 SJE [13]-CVPR 2015 RR[16]- IJCV 2017 SEF-ZSL [18]-ECML 2017 UAR [37]-CVPR 2018 Ours (IAF+D) Ours (Bi-dir GAN) HAA [11]-CVPR 2011 ConSE [12]-ICLR 2014 SJE [13]-CVPR 2015 UDA [9]-ICCV 2015 RR [16]-IJCV 2017 SEF-ZSL [18]-ECML 2017 Ours (IAF+D) Ours (Bi-dir GAN)

Embed W W W W W W W W A A A A A A A A

Olympic 37.0±9.9 N/A 28.6±4.9 37.3±9.1 40.1 ± 11.8 N/A 39.8±11.6 40.2±10.62 49.4 ± 10.8 45.6±9.2 47.0±14.8 N/A 52.7±11.3 50.3 ± 9.8 54.90±11.72 53.20±10.52

UCF101 12.7 ± 2.1 15.8±2.3 9.9±1.4 15.9±2.3 19.8 ± 2.3 20.1 ± 1.4 22.20±2.75 21.80±3.65 18.7 ± 2.4 14.8 ± 2.2 12.0±1.2 13.2±1.9 20.2±2.2 22.6 ± 2.5 26.10±2.95 24.70±3.75

HMDB51 15.4±2.8 15.0±3.0 13.3±2.4 19.1 ± 3.8 18.9±1.9 28.9±1.2 19.20±3.71 21.32±3.21 N/A N/A N/A N/A N/A N/A 19.20±3.71(W) 21.32±3.21(W)

Table 4: Results for the Transductive setting for standard zero shot learning setting(disjoint setting) for the action recognition. Here A represents the human annotated attribute vectors and W represents the word2vec embedding. UCF101 and Olympic uses A, while for HMDB51, A is not available only W is used.

9. Ablation Studies 9.1. Ablation Studies for IAF Model Extensive ablations are performed to justify the additional components in the base VAE architecture. Our ablation shows that if we are using only the IAF 23

component with plain decoder, it gives a significantly lower performance on the Olympic and UCF101 dataset. If we are using the proposed decoder (D) without IAF, we have a significant improvement on the Olympic and UCF101 datasets. In another experiment, the IAF is combined along with the proposed decoder. The later one gives the state-of-art result on the UCF101 and Olympic dataset, while the proposed approach is competitive for the HMDB51 dataset. In Figure-4 we took 4 random splits from the 30 splits of the Olympic dataset and plot the epoch vs. test accuracy for the decoder+IAF and without-decoder+IAF setup. It is clear from Figure-4 that IAF+Decoder gives stable training and superior performance. It can be seen in 4 that the IAF+Decoder accuracy increases monotonically wrt epoch, and is therefore less prone to overfitting.

Figure 4: Accuracy plot with decoder+IAF and W/O decoder+IAF on the four random split from all 30 split of Olympic dataset.

Figure-5 (left) shows the tSNE plot of original visual features and the synthesized features by the proposed model for eight different classes of the Olympic dataset. Using the tSNE plot, we can see that each pseudo class also has a similar distribution to that of the original. These pseudo samples are generated by conditioning the class attribute. Using the pseudo sample for training, we can convert the ZSL problem to a supervised learning problem, and any supervised classifier can be used for classifying unseen classes.

24

Figure 5: tsne Plot:Left Original samples and right synthesize sample.

Figure 6: Accuracy vs. λc and Accuracy vs. λreg respectively. Note: X axis is scaled on log(λc ) and log(λreg )

9.2. Effects of Hyper-parameters λc , λreg and λE In Equation 13, there are λc , λreg and λE are hyper-parameters which decide the contribution of losses Lc , Lreg and LE respectively. We find the optimum hyper-parameters using cross-validation. We conducted the experiments in HMDB51 dataset for different values of λc , λreg and λE . We vary the hyperparameters value from 0.001 to 0.5. In the Figures 6 and 7 the scale of the X-axis is log(λ). We find the best hyper-parameter one by one. First we fix the value for λreg = 0.05 and λE = 0.05 and conduct the experiments for different value of λc . We observe that initially test accuracy increases and for 0.1 value model shows the best performance. Now we fix the λc = 0.1 and λE = 0.05 and conduct the experiment for different value of λreg . We observe the best accuracy in test dataset for λreg = 0.1. Now, we fix λc = 0.1, λreg = 0.1 and perform the 25

Figure 7: Accuracy vs. λE . Note: X axis is scaled on log(λE )

experiments for different value of λE . Our model shows the best performance for λE = 0.1. We observe that λc , λreg and λE contribute equally and the optimum value is 0.1. 9.3. Ablation Studies for Bi-direction GAN Model Inductive vs Transductive The proposed model also can be applied in the transductive setting. In transductive setting, we assume that the unlabelled unseen class examples are present at training time. Therefore we can utilize unlabelled unseen class examples for better prediction of class distributions for unseen classes. In contrast, in the inductive setting our model uses only seen classes examples to train the model. In Figure-5 (right) the inductive and transductive result are compared. The transductive setting shows a significant improvement over the inductive setting. Using unlabelled examples for unseen classes(transductive setting) increases the performance of the proposed model by 3.5%, 16.1% and 6.9% as compared to that of unlabelled examples of unseen classes(inductive setting) for UCF101, HMDB51, and Olympic datasets respectively. Few-Shot Learning The efficacy of the proposed approach is also evaluated by performing fewshot learning. In this case, we fine-tune the pre-trained model on seen classes data using a few numbers of examples from unseen classes. We experiment for several samples from un-seen classes as 2,3,5 and 10 for all three datasets. Figure 8 and 9 (left) clearly shows that with few(ten) examples per class the proposed model is competitive with that of existing supervised classifiers[52, 53, 26

Figure 8: Performance comparison between our Bi-dir GAN model and RSAE [19] model in few-shot setting for HMDB51 and UCF101 datasets respectively.

Figure 9:

54]. A comparison with RSAE[19] shows that the proposed model performs significantly better. Effect of number of labelled examples for transductive setting: Extension of both the proposed models for transductive setting using semisupervised VAE[26] is studied here, where examples are generated for unseen classes are used as pseudo-labels. We perform experiments for a different number of pseudo labeled examples for unseen classes and observe that the performance of models continuously increases until the number of examples is 100, and saturates after that. Figure 9 (right) and 10 show the effect of the number of pseudo-labeled examples on the performance of our both IAF based and Bi-dir GAN based models.

27

Figure 10:

Bi-direction GAN vs Simple GAN In Bi-directional GAN model, we try to generate features in both directions (from attribute to visual feature and visual feature to attribute features). GAN2 assists GAN1 in learning better, discriminative visual features conditioned on attributes, by using the capability of regeneration of attributes from generated visual features. To prove our claim we performed an experiment for simple condition GAN and observe significant improvement in case of bi-directional GAN. We observe 24%,13.5% and 9.5% relative improvement over simple GAN based model in UCF101, HMDB51, and Olympic datasets respectively in the inductive zero-shot setting.

Relation between IAF based model and Bi-dir GAN model There are two most used generative models that are used for synthesizing the features of unseen classes based on their class attributes, one is based on variational auto-encoder(VAE), and the other is based on generative adversarial network(GAN). Our IAF model is based on the VAE and Bi- dir GAN model is based on GAN. Our proposed IAF based model and Bi-dir GAN model have a similar relation as VAE and GAN. In VAE decoder act as identical to the generator in GAN. In our Bi-dir GAN model we use two generators, one synthesizes action class features based on action class attribute and the other tries to reconstruct the class attributes from generated action class features. Likewise in IAF based model decoder generates action class features based on class attributes and 28

regressor module tries to reconstruct class attributes from generated action class features. In summary, both IAF and Bi-dir GAN based models work to synthesized features in both directions, from class attribute to action class features and vice versa. The main difference between them is that one is based on VAE generative model and the other is based on GAN generative model. 10. Conclusion We have proposed two generative methods for zero-shot action recognition, one is IAF based VAE and the other is Bi-directional GAN based approach. The proposed approaches are equally applicable to both the ZSL and GZSL settings, and the performance is comparable to that state-of-the-art, or even better than state-of-the-art. We have extended our models for transductive zero-shot setting using semi-supervised VAE and observed a significant improvement over inductive setting in all the three datasets. We have found that the IAF based variational inference and the combination of the deep and shallow network gives stable training and is less prone to over-fitting. Our results and ablation study on the standard dataset supports our claim. The proposed approaches are not constrained to work for specific distributions. Both the models can be simulated for any data distribution conditioned on the attribute. We also observed that the attribute gives better performance as compared to the unsupervised embedding approach like Word2vec/Glove. Author Contribution Statement Ashish Mishra: Conceptualization, Methodology, Writing- Original draft preparation. Anubha Pandey: Software, Reviewing and Editing. Hema A Murthy: Supervision. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

29

References [1] M. Palatucci, D. Pomerleau, G. E. Hinton, T. M. Mitchell, Zero-shot learning with semantic output codes, in: Advances in neural information processing systems, 2009, pp. 1410–1418. [2] C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 951– 958. [3] Y. Xian, B. Schiele, Z. Akata, Zero-shot learning-the good, the bad and the ugly, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4582–4591. [4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119. [5] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [6] X. Xu, T. Hospedales, S. Gong, Semantic embedding space for zero-shot action recognition, in: Image Processing (ICIP), 2015 IEEE International Conference on, IEEE, 2015, pp. 63–67. [7] E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learning, arXiv preprint arXiv:1704.08345 (2017). [8] Q. Wang, K. Chen, Zero-shot visual recognition via bidirectional latent embedding, International Journal of Computer Vision 124 (3) (2017) 356–383. [9] E. Kodirov, T. Xiang, Z. Fu, S. Gong, Unsupervised domain adaptation for zero-shot learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2452–2460. [10] B. Romera-Paredes, P. Torr, An embarrassingly simple approach to zeroshot learning, in: International Conference on Machine Learning, 2015, pp. 2152–2161.

30

[11] J. Liu, B. Kuipers, S. Savarese, Recognizing human actions by attributes, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 3337–3344. [12] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, J. Dean, Zero-shot learning by convex combination of semantic embeddings, arXiv preprint arXiv:1312.5650 (2013). [13] Z. Akata, S. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of output embeddings for fine-grained image classification, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 2927– 2936. [14] C. H. Lampert, H. Nickisch, S. Harmeling, Attribute-based classification for zero-shot visual object categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3) (2014) 453–465. [15] X. Xu, T. M. Hospedales, S. Gong, Multi-task zero-shot action recognition with prioritised data augmentation, in: European Conference on Computer Vision, Springer, 2016, pp. 343–359. [16] X. Xu, T. Hospedales, S. Gong, Transductive zero-shot action recognition by word-vector embedding, International Journal of Computer Vision (2017) 1– 25. [17] W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, L. Carin, Zero-shot learning via class-conditioned deep generative models, in: AAAI Conference on Artificial Intelligence (AAAI-18), Louisiana, USA., 2018. [18] V. K. Verma, P. Rai, A simple exponential family framework for zero-shot learning, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2017, pp. 792–808. [19] A. Mishra, V. K. Verma, M. Reddy, P. Rai, A. Mittal, et al., A generative approach to zero-shot and few-shot action recognition, arXiv preprint arXiv:1801.09086 (2018). [20] G. Dinu, A. Lazaridou, M. Baroni, Improving zero-shot learning by mitigating the hubness problem, arXiv preprint arXiv:1412.6568 (2014).

31

[21] M. Radovanovi´c, A. Nanopoulos, M. Ivanovi´c, Hubs in space: Popular nearest neighbors in high-dimensional data, Journal of Machine Learning Research 11 (Sep) (2010) 2487–2531. [22] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680. [24] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling, Improved variational inference with inverse autoregressive flow, in: Advances in Neural Information Processing Systems, 2016, pp. 4743– 4751. [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [26] D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semi-supervised learning with deep generative models, in: Advances in neural information processing systems, 2014, pp. 3581–3589. [27] A. Gaure, A. Gupta, V. K. Verma, P. Rai, A probabilistic framework for zeroshot multi-label learning, in: In The Conference on Uncertainty in Artificial Intelligence (UAI), 2017, pp. 3–12. [28] Y. Guo, G. Ding, J. Han, Y. Gao, Synthesizing samples for zero-shot learning, IJCAI, 2017. [29] S. Changpinyo, W.-L. Chao, B. Gong, F. Sha, Synthesized classifiers for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5327–5336. [30] A. Mishra, S. Krishna Reddy, A. Mittal, H. A. Murthy, A generative model for zero shot learning using conditional variational autoencoders, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2188–2196.

32

[31] Z. Zhang, V. Saligrama, Zero-shot learning via joint latent similarity embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 6034–6042. [32] Z. Zhang, V. Saligrama, Zero-shot learning via semantic similarity embedding, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4166–4174. [33] Y. Xian, T. Lorenz, B. Schiele, Z. Akata, Feature generating networks for zero-shot learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5542–5551. [34] R. Felix, V. B. Kumar, I. Reid, G. Carneiro, Multi-modal cycle-consistent generalized zero-shot learning, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 21–37. [35] G. Arora, V. K. Verma, A. Mishra, P. Rai, Generalized zero-shot learning via synthesized examples, CVPR (2018). [36] X. Wang, Y. Ye, A. Gupta, Zero-shot recognition via semantic embeddings and knowledge graphs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6857–6866. [37] Y. Zhu, Y. Long, Y. Guan, S. Newsam, L. Shao, Towards universal representation for unseen action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9436–9445. [38] A. Roitberg, M. Martinez, M. Haurilet, R. Stiefelhagen, Towards a fair evaluation of zero-shot action recognition using external data, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0. [39] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, Y. Wang, Zero-shot action recognition with error-correcting output codes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2833– 2842. [40] D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, L. Shao, Out-of-distribution detection for generalized zero-shot action recognition, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 33

[41] M. Bishay, G. Zoumpourlis, I. Patras, TARN: temporal attentive relation network for few-shot and zero-shot action recognition, CoRR abs/1907.09021 (2019). arXiv:1907.09021. URL http://arxiv.org/abs/1907.09021 [42] M. Bucher, S. Herbin, F. Jurie, Generating visual representations for zeroshot classification, in: International Conference on Computer Vision (ICCV) Workshops: TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision, 2017. [43] M. Germain, K. Gregor, I. Murray, H. Larochelle, Made: Masked autoencoder for distribution estimation, in: International Conference on Machine Learning, 2015, pp. 881–889. [44] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, E. P. Xing, Toward controlled generation of text, in: International Conference on Machine Learning, 2017, pp. 1587–1596. [45] K. Soomro, A. R. Zamir, M. Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402 (2012). [46] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, Hmdb: a large video database for human motion recognition, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 2556–2563. [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497. [48] E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3174–3183. [49] Y. Xian, T. Lorenz, B. Schiele, Z. Akata, Feature generating networks for zero-shot learning, in: CVPR, 2018, pp. 5542–5551. [50] R. Felix, V. B. Kumar, I. Reid, G. Carneiro, Multi-modal cycle-consistent generalized zero-shot learning, in: ECCV, 2018, pp. 21–37. [51] V. K. Verma, G. Arora, A. Mishra, P. Rai, Generalized zero-shot learning via synthesized examples, CVPR (2018). 34

[52] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems, 2014, pp. 568–576. [53] H. Wang, A. Kl¨aser, C. Schmid, L. Cheng-Lin, Action recognition by dense trajectories, in: CVPR 2011-IEEE Conference on Computer Vision & Pattern Recognition, IEEE, 2011, pp. 3169–3176. [54] H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.

35

Ashish Mishra: He received his Bachelor’s degree in mathematics from the University of Allahabad, Allahabad, India in 2008, the Master’s degree in Computer Science from Unversity of Hyderabad, Hyderabad, India in 2013. He is currently a Ph.D. scholar in the Department of Computer Science and Engineering, IIT Madras, India. His area of research includes Applications of Zero-Shot Learning in different domains like Image Classification, Action Classification, Image Retrieval, etc.

Anubha Pandey: She received her Bachelor’s degree in computer science and engineering from the Medicaps Institute of Technology and Management, Indore, India in 2016. She is currently an MS scholar in the Department of Computer Science and Engineering, IIT Madras, India. She is working in image inpainting, image deblurring and sketch based image retrieval in the zero-shot setting.

36

Hema A Murthy: she received the Bachelor’s degree from Osmania University, Hyderabad, India in 1980, the Master’s degree from McMaster University, Hamilton, Canada in 1986, and the Ph.D. degree from the Indian Institute of Technology (IIT) Madras, Chennai, India in 1992. She is currently a Professor with the Department of Computer Science and Engineering, IIT Madras. She has vast research experience in the areas of speech processing, computer networks, music information retrieval, computational brain research, and other areas of machine learning and signal processing.

37

Zero-shot learning for action recognition using synthesized features

Zero-shot learning for action recognition using synthesized features

Recommend Documents