Multimodal Multiclass Boosting and its Application to Cross-modal Retrieval

Multimodal Multiclass Boosting and its Application to Cross-modal Retrieval

Neurocomputing 357 (2019) 11–23 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Multimoda...

2MB Sizes 0 Downloads 124 Views

Neurocomputing 357 (2019) 11–23

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Multimodal Multiclass Boosting and its Application to Cross-modal Retrieval Shixun Wang a,∗, Zhi Dou a, Deng Chen b, Hairong Yu c, Yuan Li a, Peng Pan d a

School of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China Hubei Provincial Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China c School of Information Science and Engineering, Qufu Normal University, Qufu 276800, China d School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China b

a r t i c l e

i n f o

Article history: Received 9 May 2018 Revised 20 February 2019 Accepted 13 May 2019 Available online 16 May 2019 Communicated by Dr. Xinmei Tian Keywords: Multimodal multiclass Boosting Intra-modal loss Inter-modal loss Semantic correlation Cross-modal retrieval

a b s t r a c t Although Boosting approach has been proved to be a very successful ensemble learning technology, the conventional ones are limited to two classes or single modality. In this paper, to deal with multiclass setting and heterogeneous modalities, we propose a multimodal multiclass Boosting framework called MMBoost, in which the intra-modal semantic information and inter-modal semantic correlation can be captured at the same time. By utilizing the multiclass exponential and logistic loss functions, we further acquire two new versions of MMBoost, namely MMBoost_exp and MMBoost_log. The empirical risk, which simultaneously considers the intra-modal and inter-modal losses, is designed and then minimized by gradient descent in the multidimensional functional spaces. More concretely, the optimization problem is solved in turn for each modality. The posterior probability of semantic category can be naturally attained by applying sigmoid function to the multiclass margin. A series of experiments on the Wiki and NUS-WIDE datasets demonstrate that the performance of our proposed method significantly outperforms those of existing Boosting approaches for cross-modal retrieval. © 2019 Elsevier B.V. All rights reserved.

1. Introduction In the past few years, more and more real-world problems about classification have appeared in the machine learning and computer vision communities, such as object detection [1], action recognition [2], and medical diagnosis [3]. One reliable technology to solve these problems is Boosting, which is a powerful and effective approach for aggregating several weak learners to build a stronger learner. Initially, the first weak learner is trained on the original training set that has a uniform weight distribution. Then a larger weight is assigned to the misclassified instance, so that it has a greater possibility to be selected as the training instance of the next weak learner [4]. In other words, the training set of the next weak learner are reweighed according to the performance of the previous learner. In this sequential manner, Boosting approach tends to focus more attentions on learning the misclassified instances. In the literatures, the existing Boosting methods applied to unimodal data come in two flavors, namely binary [5–7] and mul-



Corresponding author. E-mail address: [email protected] (S. Wang).

https://doi.org/10.1016/j.neucom.2019.05.040 0925-2312/© 2019 Elsevier B.V. All rights reserved.

ticlass [8,9]. The binary ones have made rich theoretical analyses on the Bayes consistency and margin maximization to guarantee the effective and efficient performance. However, the multiclass ones are more difficult than their binary counterparts. Broadly speaking, there are two strategies to design multiclass Boosting, depending on whether the binary weak learners are employed or not. The first strategy transforms the original multiclass learning task into several binary sub-tasks solved by utilizing binary weak learners, where one-versus-all and one-versus-one are the popular schemes. Despite the successes in some cases, this strategy has several drawbacks, such as the imbalanced data distributions generated in the binary process, the complexity growing with the number of classes, and the ambiguous decision leading to a suboptimum learner [10]. Alternatively, the second strategy represents class membership with a set of codewords, and utilizes an appropriate loss function when multiclass weak learners are directly boosted. Nevertheless, some methods in this type have a high risk of over-fitting, or a lack of margin guarantee [11]. On the other hand, a large number of multimodal documents often make use of heterogeneous data to better express the same semantic information. In practice, a user may expect to retrieve text or audio when given a query image. To deal with this search paradigm, one can resort to cross-modal retrieval in which the

12

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

Fig. 1. The mapping difference between unimodal (dashed arrow) and multimodal (solid arrow) Boosting approaches, the semantic distance in the bottom right is smaller than that in the bottom left.

modality of a query is different from that of the retrieved results. Due to the diversities between low-level features, the key problem of cross-modal retrieval is how to measure the relationship of heterogeneous modalities [12,13]. Most of the existing methods usually map the original features of each modality into an isomorphic space, such as correlative subspace [14,15], hash space [16–18] and semantic space [19,20], so that the relationship could be directly measured. In semantic space, a vector denotes the posterior probabilities with respect to each of the given semantic concepts. The semantic vectors not only have a discriminative interpretation to some extent, but also provide heterogeneous data with semantic similarity that plays an important role in cross-modal retrieval. If an approach can assign the correct semantic category to an unseen unimodal example, then it has a better capability to learn the intra-modal semantic information. When the different modalities share a same semantic category, they have the inter-modal semantic correlation. In general, the unimodal Boosting approaches can individually learn strong learners for different modalities, and then use the corresponding semantic vectors to carry out crossmodal retrieval. However, these unimodal methods fail to make full use of the semantic correlation between different modalities, which may bring unsatisfactory effect. The mapping difference between unimodal and multimodal Boosting approaches is illustrated in Fig. 1, where the red double arrow denotes that the image and text have inter-modal semantic correlation. Concretely, the dimension of semantic concept 1 means Biology, and the yellow circle and green square respectively denote the semantic toy vectors of image and text. Both the purple dashed and red solid lines represent the distances between semantic vectors, but the latter indicates a smaller distance. If the original feature of image has bad quality, its semantic feature mapped by the unimodal Boosting may deviate from Biology concept, which results in a bigger semantic distance. Due to the separated learning processes of different modalities, the corresponding text cannot help to improve the semantic quality of image, even if its semantic feature is close to the correct semantic concept. Viewed in this light, we expect

that the multimodal Boosting can utilize the semantic correlation to ameliorate the worse semantic feature, so that the semantic distance may be curtailed to some degree. In this paper, we propose a multiclass Boosting framework for analyzing multimodal data, which is termed as MMBoost. Seen from Fig. 1, MMBoost agrees that all modalities of one instance should have stronger correlations and smaller distances in the semantic space because they share the same semantic concept. To achieve this end, MMBoost designs a unified objective function for different modalities of instances, which takes both intra-modal and inter-modal losses into account such that the learning process is intrinsically different from those of unimodal Boosting approaches. Based on gradient descent, the unified objective function is alternately minimized for each modality in the multidimensional functional spaces. After applying sigmoid function to the multiclass margins of stronger learners, we attain the posterior category probabilities of different modalities and then execute cross-modal retrieval in the isomorphic semantic space. In summary, the major contributions of this paper are listed as follows: • To analyze multimodal data, we propose the novel MMBoost framework which can jointly learn intra-modal semantic information and inter-modal semantic correlation. These two types of semantics complement each other, and the combined use may lead to a satisfactory gain. MMBoost can effectively connect heterogeneous modalities and the final stronger learner of each modality can generate semantic vectors, which results in a better cross-modal retrieval performance. • With the multiclass exponential and logistic loss functions, we acquire two effective algorithms, namely MMBoost_exp and MMBoost_log. Specifically, the optimization problem is minimized in turn for each modality. Moreover, we provide the corresponding theoretical analyses for these algorithms in detail. • We execute extensive experiments on two public datasets to compare our MMBoost algorithms with several related

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

methods, and the results show that MMBoost has significant superiority for cross-modal retrieval. The rest of this paper is organized as follows. Section 2 gives a brief overview of related works. In Section 3, our multimodal multiclass Boosting approach is presented, and the corresponding optimization algorithms and theoretical analyses are also provided in detail. The comprehensive experiments are presented in Section 4, and at last the conclusion is summarized in Section 5. 2. Related work The first notable Boosting procedure is AdaBoost algorithm [5], which utilizes gradient descent to minimize the exponential risk. From the perspectives of additive modeling and maximum likelihood, LogitBoost [6] executes Newton method to minimize the logistic risk. According to Taylor series expansion, TaylorBoost [7] may employ any loss functions to generate a series of algorithms with first or second order. By utilizing the network of logical gates to combine several weak learners, GatedBoost [21] can deal with intra-class variation. Given the sum and product operations, the final learner can be grown by selecting the more appropriate operations [22]. In DeepBoost [23], the weak learners may be a hypothesis set containing deep decision trees, or other similarly complex families. As a semi-supervised Boosting method, RegBoost [24] utilizes a predefined similarity function to measure the pairwise similarities between labeled and unlabeled data, so that the resulting predictor can assign more reliable pseudo-labels to unlabeled examples. However, these methods aim at handling the binary problem. The ECOC method [8] associates each class with a codeword row of the coding matrix, where the elements are taken from {-1, 0, +1}. In AdaBoostM2, the weight distribute and class weighting function are simultaneously conducted, and the task of weak learner is to minimize a pseudo-loss [5]. TotalBoost [25] utilizes quadratic programming to maximize the minimal margin in the training set, and RUSBoost [26], a variant of AdaBoostM2, introduces random undersampling into AdaBoost algorithm. MDeepBoost [27] generalizes the result of DeepBoost to multiclass setting, and gives data-dependent learning bounds for convex ensembles. To model the uncertainty caused by each weak learner, PIBoost [11] exploits different vector encodings to represent the labels and predictor responses. REBEL [28] presents a novel family of weak learners called similarity stumps, which can result in much better generalization than decision stumps. In these methods, the binary weak learners are used to separate groups of classes. By utilizing a multiclass exponential loss function, SAMME [9] directly extends AdaBoost algorithm to the multiclass case without reducing it to multiple binary problems. In [29], the optimal requirements of the trained weak learners are identified, and a general framework for multiclass Boosting is introduced. Rob_MulAda [4] formally designs a noise-detection based multiclass loss function and presents a new weight updating scheme to mitigate the harmful effect of noisy examples. BVDT algorithm [30] uses vector decision tree as weak learner, and directly maps the feature space to the decision space in the multiclass setting. MCBoost [31] introduces the optimal set of codewords which are the vertices of multidimensional regular simplex centered at the origin, and updates the predictor based on gradient descent. In [32], the guess-averse characteristic is advocated in the sense that the loss function should encourage more correct classifications than arbitrary guesses. In [10], the best weak learner is aggregated by selecting the sum or Hadamard product operations at each iteration, leading to the more sophisticated combinations of weak learners. Based on the coding scheme of [9], MSAB [33] not only minimizes the empirical loss on labeled data, but also uses the

13

manifold and cluster assumptions to minimize the consistencies over labeled and unlabeled data. By utilizing the regular simplex vertices [31], MSSBoost [34] designs a loss function which includes the multiclass margin cost on labeled data and the regularization term on unlabeled data, and learns the optimal similarity function for given data. In these approaches, the multiclass weak learners are utilized. As a popular baseline for cross-modal retrieval, canonical correlation analysis (CCA) [14] learns a subspace that maximizes the pairwise correlation between two sets of heterogeneous data. Multi-label CCA [15], an extension of CCA, utilizes multi-label information to learn a discriminative subspace which is more suitable for cross-modal tasks. In [19], logistic regression is applied to get the semantic vectors of images and texts. Nevertheless, logistic regression cannot effectively work when the feature has high dimensionality and sparse values. MCBoost is explored to learn the semantic abstraction in [20], where the learning procedure of images is separated from that of texts. Specifically, CCA is used to attain a common subspace of image and text, and then their representations in the subspace are respectively mapped into a semantic space by MCBoost. However, the viewpoint of our MMBoost is to learn a mapping scheme which jointly mine intra-modal semantic information and inter-modal semantic correlation from multimodal data. 3. The proposed MMBoost method Although there exist many multiclass Boosting methods of single modality, we choose MCBoost [31] as the foundation of our algorithm. The reason is that MCBoost uses the optimal codewords to directly handle multiclass setting without decomposing it into several binary problems. In this section, we firstly present a multimodal multiclass Boosting framework for analyzing multimodal data, and then propose the optimization algorithms with different multiclass loss functions in detail. To discriminate from scalar variables, the matrices and vectors are denoted as boldface letters. 3.1. Construction of framework Suppose N, M and K are the numbers of examples, modalities and classes, respectively. The multimodal dataset is (D, S ) = {(D1 , s1 ), . . . , (DN , sN )}, where Di = {x1i , . . . , xM } denotes the ith exi ample, and si ∈ {1, . . . , K } is the corresponding semantic category. Concretely, xm ∈ Rdm denotes the feature representation of the mth i modality with respect to the ith example, and dm is the corresponding feature dimension. For the mth and jth modalities, if m = j then they generally have different feature dimensions, namely dm = dj . Each example in the training set is associated with a semantic category, while the ones in the testing set are assumed to be unlabeled. It is noted that class label plays an important role in Boosting methods. For binary problem, ± 1 are the class labels. As for one-versus-all scheme, the class label is 1k ∈ RK where the kth element is one and others are zeros. In the light of [31], the codeword matrix C = [c1 , . . . , cK ] is a set of K distinct unit vectors, which are the vertices of K − 1 dimensional regular simplex centered at the origin. Therefore, each semantic category k may be represented with a codeword vector ck ∈ RK−1 which identifies the category. Given a query in Rdm of the testing set, the cross-modal retrieval searches for the closest match in Rd j of the retrieved set. If fm (xm ) ∈ RK−1 is a stronger predictor of the mth modality, its multiclass margin with respect to category k could be defined as < fm (xm ), ck > , where < · , · > denotes the standard inner product. Based on the multiclass margins and nonlinear transformation, the original low-level features of different modalities can be mapped

14

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

Fig. 2. Framework of MMBoost illustrated with two modalities and three iterations.

into an isomorphic semantic space, where the cross-modal retrieval is easily carried out. The purpose of our approach is to build the stronger predictors, which are learnt by combining intra-modal and inter-modal information in the aggregating process. With two modalities and three iterations, the framework of our MMBoost is illustrated in Fig. 2. From left to right, the three parts of MMBoost framework are the characterization of multimodal dataset, the aggregating procedure of learners and the semantic mapping of given examples, respectively. In the middle part, the semantic information of the first and second modalities may be respectively assembled along the purple solid and blue dashed arrows, and the intermodal semantic correlations can be gathered along the red solid arrows. It is noted that the red solid arrows are absent from this part when the unimodal Boosting methods are applied to the multimodal dataset. That is to say, the stronger predictors of different modalities are separately built in the unimodal Boosting methods, which leads to the absence of semantic correlation. Generally, the unimodal Boosting method such as MCBoost [31] first learns a stronger predictor by minimizing the empirical risk on a set of unimodal training examples, and then utilizes the learnt predictor to provide a new example with the semantic category that has the maximum probability of correct sense. Here, we extend the formulation of MCBoost to multimodal setting, in order to attain the accurate semantic information of each modality. Consequently, the intra-modal risk of all modalities can be given as follow

Rintra =

M 

αm Rm [fm (xm )]

m=1

=

M  m=1

αm

N 

Lm [si , fm ( xm i )]

naturally defined to preserve the semantic consistency among heterogeneous modalities,

Rinter =

=

i=1

where α m , Rm [·] and Lm [·, ·] denote the weight, empirical risk and multiclass loss function of the mth modality, respectively. In the next two subsections, we will minutely introduce two multiclass loss functions, which are both non-negative functions of multiclass margin. On the other hand, the different modalities of an example share the identical semantic category, so their representations in the semantic space should have a smaller distance. Furthermore, the semantic distance is dependent on the semantic vectors, which are calculated by the multiclass margins and nonlinear transformation. To project any real number into the interval whose values are from 0 to 1, the sigmoid function is taken as nonlinear transformation in this paper. Given xm and its predictor fm , CT fm (xm ) denotes a veci i tor in which the kth element is the multiclass margin with respect to semantic category k. Due to the monotonicity of sigmoid function, we may simply consider the differences of multiclass margin vectors in the inter-modal risk. Concretely, if the semantic category of an example is shared by different modalities, then there should have smaller differences among the corresponding vectors of multiclass margins. Inspired by these, the following inter-modal risk is

2

Rm, j [fm (xm ), f j (x j )]

m=1 j=1

M  M  N  β 

2

2

j  CT [fm ( xm i ) − f j ( x i )]

(2)

2

m=1 j=1 i=1

where β is a parameter to balance the trade-off between intramodal and inter-modal risks, and Rm, j [·, ·] denotes the empirical risk between the mth and jth modalities. To search for the final predictors of different modalities, the overall empirical risk can be represented as follow

R[f1 (x1 ), . . . , fM (xM )] = Rintra + Rinter

(3)

From Eqs. (1)–(3), the whole loss of all modalities is given

L[s, f1 (x1 ), . . . , fM (xM )] =

M 

αm Lm [s, fm (xm )]

m=1

+

M  M  β 

2

2

C T [ f m ( x m ) − f j ( x j )] 

m=1 j=1

2

(4) where the first and second parts are the intra-modal and intermodal losses, respectively. In general, the empirical risk in Eq. (3) can be minimized by solving the following optimization problem

 (1)

M  M β

minf1 ,...,fM s.t

R [ f 1 ( x 1 ) , . . . , f M ( x M )] fm (xm ) ∈ span(Hm ), ∀m = 1, . . . , M

(5)

where Hm = {gm (xm )|Rdm → RK−1 } is a set of multiclass weak learners, and span( · ) denotes the functional space of linear combinations of weak learners. Accordingly, both intra-modal semantic information and inter-modal semantic correlation can be learnt by minimizing the unified objective function, which bridges the semantic gap between heterogeneous modalities. To attenuate the complexity of Eq. (5), we will propose two efficient optimization algorithms with different loss functions. 3.2. Optimization algorithm with multiclass exponential loss To encourage a large multiclass margin, the loss function of the mth modality is defined as the following non-negative function

Lm [s, fm (xm )] =

K 

exp(< fm (xm ), ck − cs > )

(6)

k=1

It is complicated to optimize all the predictors together, so an alternate manner can be used to solve the optimization problem in Eq. (5). After t iterations, the predictor of the mth modality is

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

denoted with ftm . First, we utilize gradient descent to implement the boosting procedure of ft1 by fixing the other predictors ftm (m = 2, . . . , M ). Along the direction of multiclass weak learner g1 (x1 ), the first order partial differential of R[f1 (x1 ), ft2 (x2 ), . . . , ftM (xM )] around point ft1 (x1 ) is computed as follow

 ∂ R[ft1 + ξ1 g1 , ft2 , . . . , ftM ]   ∂ξ1 ξ =0

δ R[ft1 ; g1 ] =

=−

t+1 t ∗ t t λ∗m = arg min R[ft+1 1 , . . . , fm−1 , fm + λm gm , fm+1 , . . . , fM ]

λm ∈R

< g1 (x1i ), P1i >

(7)

(15)

Therefore, the updating rule for the predictor of the mth modality is m t m ∗ ∗ m ft+1 m (x ) = fm (x ) + λm gm (x )

1

N 

15

(16)

With the multiclass exponential loss, the main steps of our multimodal multiclass Boosting called MMBoost_exp are summarized into Algorithm 1. As can be seen, if O(ϕ m ) denotes the time

i=1

Algorithm 1 MMBoost_exp.

where P1i ∈ RK−1 and its expression is

P1i = α1

K 

(csi − ck ) exp(< ft1 , ck − csi > ) − 2β

M 

CCT (ft1 − ftm )

m=1

k=1

(8) At iteration t + 1, the direction decreasing the greatest risk is below

g∗1 (x1 ) = arg min δ R[ft1 ; g1 ] g1 ∈H1

= arg max

g1 ∈H1

N 

< g1 (x1i ), P1i >

(9)

i=1

and the learning step along this direction is optimized by

λ∗1 = arg min R[ft1 + λ1 g∗1 , ft2 , . . . , ftM ]

(10)

λ1 ∈R

Hence, the predictor of the first modality can be updated as below 1 t 1 ∗ ∗ 1 ft+1 1 (x ) = f1 (x ) + λ1 g1 (x )

(11)

Suppose that the predictors of the top m − 1(m ≥ 2 ) modalities have been boosted, we similarly execute the adjustment process of ftm with the other predictors are fixed. At point ftm (xm ), the first order functional derivative of objective function R[ft+1 (x1 ), . . . , ft+1 (xm−1 ), fm (xm ), ftm+1 (xm+1 ), . . . , ftM (xM )], along 1 m−1 the direction of multiclass weak learner gm (xm ), is calculated as follow

δ R[ftm ; gm ] =

 ∂ R[ft+1 , . . . , ft+1 , ft + ξm gm , ftm+1 , . . . , ftM ]  1 m−1 m   ∂ξm ξ

=−

N 

m < gm ( xm i ), Pi >

where

Pm i =

αm



RK−1

K 

m =0

(12)

and its expression is

(csi − ck ) exp(< ftm , ck − csi > )

k=1



−2β

m −1 

CC ( T

ftm



ft+1 j

)+

M 

j=1

CC (

ftm



ftj

)

(13)

j=m

Based on gradient descent, the optimal multiclass weak learner of the mth modality, at iteration t + 1, is given below

g∗m (xm ) = arg min

gm ∈Hm

= arg max

gm ∈Hm

δ R[ftm ; gm ] N 

m k P ( s = k|xm i ) = σ ( < fm ( xi ), c > )/

<

(

)

, Pm i

>

i=1

and the optimal step along the direction g∗m is

k σ (< fμm (xm i ), c > )

(17)

where σ ( · ) denotes the sigmoid function. Given any query form one modality and the retrieved objects from the other modality, their semantic vectors are attained by the mapping mechanism in Eq. (17). Therefore, the universal distance may be used to accomplish the cross-modal retrieval.

Proof. Given an example D, if ρs = PS |D (s|x1 , . . . , xM ) denotes the probability that belongs to the sth semantic category, then we can use Eq. (4) to compute the expectation of risk

R[f1 , . . . , fM |D ] = ES |D {L[s, f1 (x1 ), . . . , fM (xM )]|D } =

gm xm i



Proposition 1. If the multiclass exponential loss function is given in Eq. (6), then the optimization problem in Eq. (5) is convex with respect to any predictor fm (m = 1, . . . , M ) when the others are fixed.

 T

cost for learning the optimal predictor of the mth modality in each iteration, then the overall time complexity of our proposed algo rithm is O(μ M m=1 ϕm ). Accordingly, our algorithm also have liner characteristic when the time cost of multiclass weak learner is linear to the dataset size. In addition, the convex characteristic of optimization problem is proved in Proposition 1. Hence, the global optimal solutions in Eq. (5) can be generally achieved after a few iterations, and the convergence of objective function will be explored in the experiments. After acquiring the stronger predictor μ fm of each modality, the posterior category probability of a given example xm can be computed as below i μ

i=1

Pm i

Input: Dataset (D, S ), codeword matrix C, the number of iterations μ, parameters αm and β . 1: Set t = 0, and ftm (xm ) = 0 ∈ RK−1 for m = 1, . . . , M. 2: while t < μ do Compute P1i with Eq. (8). 3: Find g∗1 (x ) and λ∗1 by using Eq. (9) and Eq. (10), respectively. 4: Update f1 with Eq.(11). 5: for m = 2 to M do 6: Calculate Pm with Eq. (13). 7: i Find g∗m (xm ) and λ∗m by using Eq. (14) and Eq. (15), respec8: tively. 9: Update fm with Eq. (16). end for 10: 11: t = t + 1. 12: end while μ Output: The stronger predictor fm (xm ) for each modality.

(14)

K 

ρs L[s, f1 (x1 ), . . . , fM (xM )]

(18)

s=1

With respect to fm (m = 1, . . . , M ), the functional derivatives of first and second orders are given below

16

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

K  K ∂ R[f1 , . . . , fM ]  = αm ρs θk,s exp(< fm , θk,s > ) ∂ fm s=1 k=1

+ 2β

M 

CCT (fm − f j )

Algorithm 2 MMBoost_log.

(19)

j=1 K  K ∂ R2 [f1 , . . . , fM ]  T = αm ρs [θk,s θk,s ] exp(< fm , θk,s > ) 2 ∂ ( fm ) s=1 k=1

+ 2β (M − 1 )CCT

(20)

where θk,s = Owing to the difference among all codewords, T ] and CCT are positive definite matrices. Moreover, the co[θk,s θk,s efficients are nonnegative real, thus Eq. (20) is a strictly positive definite matrix. On the other hand, span(Hm ) is a convex functional space. Therefore, the above proposition can hold water, and the proof is completed.  ck

− cs .

3.3. Optimization algorithm with multiclass logistic loss

Input: Dataset (D, S ), codeword matrix C, the number of iterations μ, parameters αm and β . 1: Set t = 0, and ftm (xm ) = 0 ∈ RK−1 for m = 1, . . . , M. 2: while t < μ do for m = 1 to M do 3: Calculate Qm with Eq. (23). 4: i Find g∗m (xm ) by using Eq. (24). 5: Find λ∗m by using Eq. (10) or Eq. (15). 6: Update fm with Eq. (11) or Eq. (16). 7: end for 8: t = t + 1. 9: 10: end while μ Output: The stronger predictor fm (xm ) for each modality. Table 1 The summary of Wiki and NUS-WIDE datasets. Dataset

Size

Training set

Testing set

Classes

Wiki NUS-WIDE

2866 25,0 0 0

2173 50 0 0

693 1250

10 10

Alternatively, the multiclass logistic loss function of the mth modality can be defined as follow

Lm [s, fm (xm )] =

K 

log[1 + exp(< fm (xm ), ck − cs > )]

(21)

k=1

As before, if the top m − 1(m ≥ 2 ) predictors have been updated, we update ftm by fixing the predictors ft+1 ( j = 1, . . . , m − 1 ) and j

Proof. Succinctly, the first and second order derivatives of expectation risk, with respect to fm (m = 1, . . . , M ), are K  K exp(< fm , θk,s > ) ∂ R[f1 , . . . , fM ]  = αm ρs θk,s ∂ fm 1 + exp(< fm , θk,s > ) s=1 k=1

ftj ( j = m + 1, . . . , M ). Along the direction of weak learner gm (xm ), the first order partial derivative of the corresponding risk around point ftm (xm ) is

δ R[ftm ; gm ] = −

N 

m < gm ( xm i ), Qi >

(22)

i=1

where

Qm i =

Qm i

αm



RK−1

K 



CC (

(25)

K  K exp(< fm , θk,s > ) ∂ R2 [f1 , . . . , fM ]  T = αm ρs [θk,s θk,s ]

2 2 ∂ ( fm ) 1 + exp(< fm , θ > ) s=1 k=1

+2β (M − 1 )CC

T

ftm



ft+1 j

)+

M 

CC ( T

ftm



ftj

 )

4. Experiment

Specially, the second term in Eq. (23) would turn into  T t t −2β M j=1 CC (f1 − f j ) if m = 1. At iteration t + 1, the direction that decreases the greatest risk can be analogously calculated according to gradient descent

gm ∈Hm

m < gm ( xm i ), Qi >

(26)

(23)

j=m

N 

T

Similarly, Eq. (26) is a strictly positive definite matrix, which easily completes the proof. 

exp(< ftm , ck − csi > ) 1 + exp(< ftm , ck − csi > )

j=1

g∗m (xm ) = arg max

CCT (fm − f j )

j=1

and its expression is

k=1

−2β

M 

k,s

( csi − ck )

m −1 

+2β

(24)

In this section, we conduct extensive experiments on two real world datasets which both include image modality and text modality, and compare the proposed MMBoost framework with several existing unimodal Boosting methods for cross-modal retrieval where the query and retrieved objects come from different modalities. 4.1. Experiment settings

i=1

Therefore, the optimal step size along this direction is acquired by Eq. (15), and the predictor of the mth modality can be updated by Eq. (16). Clearly, we have the similar updating rules with Eqs. (10) and (11) when m = 1. The multimodal multiclass Boosting with logistic loss is formally depicted in Algorithm 2, and the theoretic guarantee of optimization problem is given in Proposition 2. For simplicity, we omit the analysis of MMBoost_log which is analogous to that of MMBoost_exp. Essentially, the proposed two algorithms utilize different manners, namely Pm and Qm , to recompute the weight i i distributions of examples. Proposition 2. If the multiclass logistic loss function is given in Eq. (21), then the optimization problem in Eq. (5) is convex for any predictor fm (m = 1, . . . , M ) when the others are fixed.

The benchmark datasets contain Wiki [19] and NUS-WIDE [35], whose statistic characteristics are given in Table 1. The notable Wiki dataset consists of 2866 multimodal documents, namely image-text pairs, which are assembled from Wikipedia articles. For each document, image modality is represented by a 128dimensional bag-of-visual-words vector based on SIFT feature [36], and text modality is described as a 10-dimensional topic vector generated by latent Dirichlet allocation [37]. This dataset is initially divided into a training set of 2173 documents and a testing set of 693 documents, and each document is manually annotated by one label from ten semantic categories. As a large-scale dataset collected from Flickr, NUS-WIDE originally has 269,648 multimodal examples. Every example contains an image and its corresponding textual tags, and is associated with one or more labels from 81 semantic categories. Similar

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

17

Fig. 3. Parameter sensitivity analysis for text query on the left and image query on the right.

to the previous works [17,18], we randomly extract 25,0 0 0 multimodal examples by keeping that each example only belongs to one of the top ten most common semantic concepts, such as animal, buildings, clouds and so on. For each example, the 500-dimensional SIFT feature vector and 10 0 0-dimensional tags codebook vector are provided to represent the image and text modalities, respectively. Since it is unrealistic to attain all the supervised information of a large dataset, we just select 1250 examples as a testing set, and 50 0 0 examples from the rest to form a training set. Generally, mean Average Precision (MAP) is a standard numeric metric to evaluate the performance of cross-modal retrieval. The larger the MAP, the better the performance. Given a query and a set of w retrieved examples, the Average Precision is

AP =

w 1 P recision(i )η (i ) E

unimodal Boosting, the posterior class probabilities are recovered by applying the sigmoid function to margins. Following [31,32], the multiclass weak learner in our MMBoost is decision tree whose depth is two. Unless otherwise specified, the weight of the mth modality α m and the number of iterations μ are set to be 1 and 200, respectively. We learn the stronger predictor on the training set, then attain the semantic vectors of examples in testing set with the learnt predictor, and finally utilize centered normalized correlation to rank the retrieved objects. It should be noted that both the query and retrieved sets are the testing set. Due to the randomness of data split, all the approaches evaluated on NUSWIDE dataset are performed with ten runs, and the average results are reported. Moreover, all the experiments are fully implemented by MATLAB 2014b, which is installed on a computer with Intel(R) Core(TM) i7-5500U CPU and 8 GB RAM.

(27)

i=1

where Precision(i) is the precision value, and E is the number of relevant examples in the retrieved set. η(i) denotes a function whose value is equal to one if the ith retrieved example is relevant to the query or zero otherwise. In the experiments, the value of w is set to 50 or the number of all retrieved examples, namely w = 50 or w = al l . From Eq. (27), MAP can be got by averaging the AP values of all queries. In addition, the 11-point PrecisionRecall (PR) curve and precision curve are also reported on the two datasets. Note that, in our experiments, an object is ground-truth relevant to a query if and only if they share the same semantic label. To compare with the proposed MMBoost, we carry out unsupervised CCA [14], unimodal Boosting RUSBoost [26], TotalBoost [25], AdaBoostM2 [5] and MCBoost [31], and CCA+MCBoost [20]. Specifically, CCA utilizes the low-level features to learn the maximizing correlation between different modalities, but neglects both the intra-modal and inter-modal semantic information. Moreover, the unimodal Boosting methods only consider the intra-modal semantic information rather than the inter-modal semantic correlation. The original settings of above methods are carefully tuned for getting a fair comparison, and their best performances are shown. For

4.2. Preliminary experiment In our MMBoost framework, there has an important parameter

β , which controls the influence of different risks on the objective function. To explore whether the two proposed algorithms are sensitive to this parameter, the training set is randomly split into five folds, and each fold is validated with the remaining four as training data. The average MAP scores with respect to the different parameter values are attained on the validation sets of two datasets, which are presented in Fig. 3. Here the results with w = al l are plotted, and in fact the results for w = 50 have similar curves. As can be seen, the same version of MMBoost has different performances on the different datasets, which may be attributed to the inherent characteristics of datasets. On the other hand, the different versions of MMBoost have similar performances on the same dataset, which will also be observed in the coming sections. In the learning procedures, both versions utilize the same optimization strategy, namely alternate gradient descent for each modality, to learn the intra-modal and inter-modal semantic information, which may be the reason causing this phenomenon. If β is too large, the optimization algorithms prefer to reduce the inter-modal

18

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

Fig. 4. Precision-recall curves (top row) and precision curves (bottom row) on Wiki dataset, for text query on the left and image query on the right.

risk, thereby ignoring the semantic information of each modality. Contrarily, the optimization algorithms focus on reducing the intramodal risk, which is profitless to well capture the semantic consistency among different modalities. Besides, the inter-modal risk is absolutely discarded when β = 0, and the results in this case are consistently worse than those in the best settings. Specifically, the better average MAP scores of MMBoost_exp and MMBoost_log are achieved on Wiki dataset when their parameters are respectively 100 and 10, while the satisfactory average performances of these two methods are obtained on NUS-WIDE dataset when their parameters are both 10. Thereafter, the above parameter settings are used in the remaining experiments.

4.3. Results on Wiki dataset The cross-modal retrieval performances of our proposed MMBoost and the compared methods on Wiki dataset are reported in Table 2, including the MAP values of retrieving image with text (text query), those of retrieving text with image (image query) and their averages. It can be clearly seen from this table that our proposed approaches consistently outperform the compared ones for the retrieval tasks. For instance, MMBoost_exp achieves the average MAP values 0.338 with w = 50 and 0.262 with w = al l , which respectively improve about 11.6% and 19.1% over the second best approach. The intra-modal semantic information aims at

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

19

Fig. 5. Precision-recall curves (top row) and precision curves (bottom row) on NUS-WIDE dataset, for text query on the left and image query on the right.

Table 2 The performance comparison (MAP Values) on Wiki dataset. Experiment

CCA [14] RUSBoost [26] TotalBoost [25] AdaBoostM2 [5] MCBoost [31] CCA+MCBoost [20] MMBoost_log MMBoost_exp

Text query

Image query

Average

w=50

w=all

w=50

w=all

w=50

w=all

0.345 0.169 0.189 0.215 0.254 0.287 0.361 0.382

0.198 0.126 0.143 0.149 0.167 0.174 0.225 0.231

0.261 0.171 0.226 0.219 0.219 0.237 0.291 0.294

0.242 0.190 0.203 0.199 0.200 0.218 0.291 0.292

0.303 0.170 0.207 0.217 0.236 0.262 0.326 0.338

0.220 0.158 0.173 0.174 0.183 0.196 0.258 0.262

reflecting the discriminative abstraction of each modality, and the inter-modal semantic correlation focuses on preserving the consistency between different modalities, so their combination may be beneficial for improving the retrieval performance. On the other hand, the intra-modal semantic vectors mapped from low quality objects can be enhanced by complementing the corresponding inter-modal semantic correlations. These may be the reasons behind the gain of our MMBoost framework. In addition, the performance of MMBoost_exp is quite similar to that of MMBoost_log, meaning that the proposed MMBoost is somewhat insensitive to the multiclass loss function.

20

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

Fig. 6. Effect of training set size on MAP values, for text query on the left and image query on the right.

Table 3 The performance comparison (MAP Values) on NUS-WIDE dataset. Experiment

CCA [14] RUSBoost [26] TotalBoost [25] AdaBoostM2 [5] MCBoost [31] CCA+MCBoost [20] MMBoost_log MMBoost_exp

Text query

Image query

Average

w=50

w=all

w=50

w=all

w=50

w=all

0.282 0.277 0.262 0.304 0.332 0.295 0.410 0.422

0.207 0.205 0.209 0.217 0.253 0.229 0.293 0.300

0.278 0.235 0.216 0.272 0.314 0.292 0.363 0.367

0.208 0.258 0.239 0.275 0.274 0.232 0.336 0.343

0.280 0.256 0.239 0.288 0.323 0.293 0.386 0.395

0.207 0.232 0.224 0.246 0.264 0.231 0.314 0.322

To get a more detailed analysis, the corresponding PR curves and precision curves are shown in Fig. 4. From the PR curves, we can see that the proposed methods again outperform their counterparts for the two tasks, which are consistent with the results in Table 2. Concretely, the improvements are substantial and occur at most levels of recall, implying the capability of better accuracy and generalization. The precision curve can reflect the changes of precision with regard to the number of retrieved objects. Clearly, the precision curves of MMBoost are always above those of the compared methods, which means more relevant objects can be returned when the numbers of retrieved objects are equal. 4.4. Results on NUS-WIDE dataset To cope with the out-of-sample data, we randomly select a small portion of examples as the training set. For text and image queries, the MAP scores of different methods are shown in Table 3, and the corresponding curves are plotted in Fig. 5. As can be seen, our MMBoost algorithms again perform better than the compared methods with a large gain, which is consistent with that on Wiki dataset. Specially, MMBoost_exp improves about 22.0% over MCBoost, attaining an average MAP score 0.322 when w = al l . Moreover, we can observe a phenomenon by combining the results in above subsection. On Wiki dataset, CCA has better performance

than CCA+MCBoost, and CCA+MCBoost has higher performance than MCBoost. However, from high to low, the performances on NUS-WIDE dataset are respectively generated by MCBoost, CCA+MCBoost and CCA. The reason may be that the image and text features are high-dimensional and sparse in NUS-WIDE dataset. From these features, CCA with orthogonality constraint can seek imprecise canonical components, which further degrades the performance of CCA+MCBoost. In contrast, CCA may avoid the meaningless components and learn the subspace correlation in Wiki dataset, resulting in the performance improvement of CCA+MCBoost. In a word, the above experiments demonstrate that the MMBoost framework has significant superiority because of the combination of intra-modal semantic information and inter-modal semantic correlation. 4.5. Effect of training set size To investigate the effect of training set size, we conduct two experiments on NUS-WIDE dataset, where the sizes of training set are changed from 10 0 0 to 80 0 0. The first experiment is designed to analyze the retrieval performance of MMBoost, and the average results with different w on testing set are shown in Fig. 6. As the size of training set increases, we can observe from this figure that the performance of MMBoost firstly increases and finally trends to converge. Specially, there is no obvious influences on the MAP values when the size is bigger than 50 0 0. The reason of this phenomenon may contain the following two aspects. More information can be supplied by increasing the size of training set when it is small, leading to the improvement of performance. Oppositely, the redundant information may be generated, which is profitless to mine the characteristics of dataset. Furthermore, this experiment also demonstrates that our MMBoost is able to learn effective stronger predictors with a reasonable small training set. Next, we explore the time complexities of MMBoost and MCBoost for learning the stronger predictors with different training set sizes, and plot the average seconds divided by the number of iterations in Fig. 7. As can be seen, the time costs of MMBoost are

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

21

Fig. 7. Effect of training set size on time cost.

Fig. 8. Convergence study of MMBoost, for Wiki on the top and NUS-WIDE on the bottom.

significantly lower than those of MCBoost, and the cost difference gets bigger as the size of training set increases. The reason may be that the stronger predictors of different modalities are independently learned in MCBoost method, but simultaneously learned by a joint objective function in MMBoost framework. Specially, MCBoost may be first executed on image modality for 200 iterations to learn the image predictor, and then re-executed on text modality for another 200 iterations to learn the text predictor. Therefore, there are two independent periods in the above learning process, and their sum divided by 200 is the time cost of MCBoost. However, our MMBoost can be jointly implemented on two modalities for 200 iterations to accomplish the aim. Our two MMBoost algorithms essentially have the same learning procedures except for

the manner of recomputing the weight distribution of examples, which is the reason why their time costs are similar. Moreover, the decision trees are taken as multiclass weak learners in MMBoost and MCBoost approaches, so their time costs are all linear to the size of training set, which indicates the ability to deal with a large-scale dataset. 4.6. Convergence study In a sense, our MMBoost can be regarded as an extended variant of MCBoost which has been shown to converge at the minimum in [31]. To analyze the convergence property of MMBoost framework with respect to the number of iterations, we finally

22

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23

design an experiment including two parts on Wiki and NUS-WIDE datasets. Specifically, the first part is conducted for 10 0 0 iterations on the training sets of each dataset, and the average objective function values divided by the training size are calculated at all iterations. Meanwhile, the second part is performed on the corresponding testing sets, and the average MAP values with w = al l are computed at every 20 iterations. Due to the limited space, these experimental results mentioned above are together illustrated in Fig. 8. As can be seen from this figure, the objective function values have the convergent tendency on NUS-WIDE dataset, yet the homologous phenomenon is not clearly shown on Wiki dataset. The image features of Wiki dataset are somewhat bad so that they may not well describe the corresponding examples, which may be the reason why the decreases of objective function values have the slower speeds. As the number of iterations increases, the average MAP values can quickly increase and then gradually converge on the two datasets, which implies the steady performance of MMBoost may be attained after a few iterations. Furthermore, the experiment on Wiki dataset is only performed with one run, which results in the slightly rougher curves of MAP values. 5. Conclusion Up to now, we have focused on designing a multimodal multiclass Boosting framework for cross-modal retrieval, in which a query from one modality is responded with the objects from another. Different from the unimodal Boosting, we simultaneously consider the intra-modal and inter-modal losses in the objective function to learn the stronger predictor of each modality. Furthermore, we have proposed two effective and efficient algorithms for mining the intra-modal semantic information and inter-modal semantic correlation, and also given the theoretical analysis. Next, we have attained posterior category probabilities by utilizing the sigmoid function, and executed cross-modal retrieval in a semantic space. Experiment results on two benchmark datasets have well demonstrated the effectiveness of our proposed framework. In the future, we would like to work in the following three aspects. Firstly, the framework used here is supervised approach, therefore how to combine an unsupervised approach or exploit a semi-supervised method will be studied to improve the performance. Secondly, an interesting benefit may be obtained by integrating other multiclass weak learns, such as convolutional neural networks. Lastly, we will research how to extend our scheme to online learning, multi-label learning and mislabeled noisy data. Conflict of interest None. Acknowledgment The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. This work was partly supported by the National Natural Science Foundation of China (Nos.11702087, 61602158 and 61772176), the Natural Science Foundation of Henan Province (No.162300410177), the Natural Science Foundation of Shandong Province (No.ZR2018PF007), the Key Program of Higher Education Institutions of Henan Province (No. 17A520040), and Ph.D. Research Foundation of Henan Normal University (No.qd15134). References [1] P. Wang, C. Shen, N. Barnes, H. Zheng, Fast and robust object detection using asymmetric totally corrective boosting, IEEE Trans. Neural Netw. Learn. Syst. 23 (1) (2012) 33–46.

[2] L. Liu, L. Shao, P. Rockett, Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition, Pattern Recognit. 46 (7) (2013) 1810–1818. [3] X. Yuan, L. Xie, M. Abouelenien, A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data, Pattern Recognit. 77 (2018) 160–172. [4] B. Sun, S. Chen, J. Wang, H. Chen, A robust multi-class AdaBoost algorithm for mislabeled noisy data, Knowl. Based Syst. 102 (2016) 87–102. [5] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [6] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Statist. 28 (2) (20 0 0) 337–407. [7] M.J. Saberian, H. Masnadi-Shirazi, N. Vasconcelos, TaylorBoost: first and second-order boosting algorithms with explicit margin control, IEEE Conf. Comput. Vis. Pattern Recognit. (2011) 2929–2934. [8] E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multiclass to binary: a unifying approach for margin classifiers, J. Mach. Learn. Res. 1 (Dec) (20 0 0) 113–141. [9] J. Zhu, H. Zou, S. Rosset, T. Hastie, Multi-class adaboost, Statist. Interface 2 (3) (2009) 349–360. [10] S. Wang, P. Pan, Y. Lu, An adaptive multiclass boosting algorithm for classification, Int. Joint Conf. Neural Netw. (2014) 1159–1166. [11] A. Fernández-Baldera, L. Baumela, Multi-class boosting with asymmetric binary weak-learners, Pattern Recognit. 47 (5) (2014) 2080–2090. [12] Y. Peng, X. Huang, Y. Zhao, An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges, IEEE Trans. Circuits Syst. Video Technol. 28 (9) (2018) 2372–2385. [13] S. Wang, P. Pan, Y. Lu, L. Xie, Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model, Multimed. Tools Appl. 74 (6) (2015) 2009–2032. [14] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (12) (2004) 2639–2664. [15] V. Ranjan, N. Rasiwasia, C. Jawahar, Multi-label cross-modal retrieval, Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4094–4102. [16] T. Yao, X. Kong, H. Fu, Q. Tian, Semantic consistency hashing for cross-modal retrieval, Neurocomputing 193 (2016) 250–259. [17] G. Ding, Y. Guo, J. Zhou, Y. Gao, Large-scale cross-modality search via collective matrix factorization hashing, IEEE Trans. Image Process. 25 (11) (2016) 5427–5440. [18] Z. Lin, G. Ding, J. Han, J. Wang, Cross-view retrieval via probability-based semantics-preserving hashing, IEEE Trans. Cybern. 47 (12) (2017) 4342–4355. [19] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, ACM Int. Conf. Multimed. (2010) 251–260. [20] J.C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G.R. Lanckriet, R. Levy, N. Vasconcelos, On the role of correlation and abstraction in cross-modal multimedia retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 521–535. [21] O. Danielsson, B. Rasolzadeh, S. Carlsson, Gated classifiers: boosting under high intra-class variation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2673–2680. [22] M.J. Saberian, N. Vasconcelos, Boosting algorithms for simultaneous feature extraction and selection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2448–2455. [23] C. Cortes, M. Mohri, U. Syed, Deep boosting, Proceedings of the International Conference on Machine Learning, 2014, pp. 1179–1187. [24] K. Chen, S. Wang, Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 129–143. [25] M.K. Warmuth, J. Liao, G. Rätsch, Totally corrective boosting algorithms that maximize the margin, Proceedings of the International Conference on Machine Learning, 2006, pp. 1001–1008. [26] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, A. Napolitano, RUSBoost: improving classification performance when training data is skewed, Proceedings of the International Conference on Pattern Recognition, 2008, pp. 1–4. [27] V. Kuznetsov, M. Mohri, U. Syed, Multi-class deep boosting, Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 2501–2509. [28] R. Appel, P. Perona, A simple multi-class boosting framework with theoretical guarantees and empirical proficiency, Proceedings of the International Conference on Machine Learning, 2017, pp. 186–194. [29] I. Mukherjee, R.E. Schapire, A theory of multiclass boosting, J. Mach. Learn. Res. 14 (Feb) (2013) 437–497. [30] K. Wu, Z. Zheng, S. Tang, BVDT: A boosted vector decision tree algorithm for multi-class classification problems, Int. J. Pattern Recognit. Artif. Intell. 31 (05) (2017) 1750016. [31] M.J. Saberian, N. Vasconcelos, Multiclass boosting: theory and algorithms, Proceedings of the Advances in Neural Information Processing Systems, 2011, pp. 2124–2132. [32] O. Beijbom, M. Saberian, D. Kriegman, N. Vasconcelos, Guess-averse loss functions for cost-sensitive multiclass boosting, Proceedings of the International Conference on Machine Learning, 2014, pp. 586–594. [33] J. Tanha, M. van Someren, H. Afsarmanesh, Boosting for multiclass semi-supervised learning, Pattern Recognit. Lett. 37 (2014) 63–77. [34] J. Tanha, MSSBoost: a new multiclass boosting to semi-supervised learning, Neurocomputing 314 (2018) 251–266.

S. Wang, Z. Dou and D. Chen et al. / Neurocomputing 357 (2019) 11–23 [35] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, NUS-WIDE: a real-world web image database from national university of Singapore, Proceedings of the ACM International Conference on Image and Video Retrieval, 2009, p. 48. [36] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [37] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (Jan) (2003) 993–1022. Shixun Wang received the Ph.D. degree in computer science from Huazhong University of Science and Technology, Wuhan, China, in 2015. He is currently a Lecturer with the School of Computer and Information Engineering, Henan Normal University. His research interests include machine learning and cross-modal analysis.

23

Hairong Yu earned her Ph.D. degree in computer science from Huazhong University of Science and Technology, Wuhan, China, in 2016. She is currently a Lecture in Qufu Normal University, and her research interests include machine learning and software engineering.

Yuan Li received her Ph.D. degree in mechanical engineering from the Hunan University, Changsha, China, in 2015. She is currently a Lecturer with Henan Normal University, and her research interests include machine learning and numerical method.

Zhi Dou received his Ph.D. in communication engineering from Nanjing University of Science and Technology, Nanjing, China, in 2016. He is currently a Lecturer in Henan Normal University, and his research focuses on digital image processing.

Deng Chen received the Ph.D. degree in computer science from Huazhong University of Science and Technology, Wuhan, China, in 2014. He is currently a Lecturer with Wuhan Institute of Technology, and his research focuses on multimedia data processing and software engineering.

Peng Pan received the B.S., M.S. and Ph.D. degrees in computer science from Huazhong University of Science and Technology in 1998, 2001 and 2007, respectively. He is currently an associate Professor with the School of Computer Science & Technology, Huazhong University of Science and Technology. His research interests are in the areas of multimedia information retrieval, computer vision and machine learning.