Novel framework for image attribute annotation with gene selection XGBoost algorithm and relative attribute model

Novel framework for image attribute annotation with gene selection XGBoost algorithm and relative attribute model

Accepted Manuscript Novel framework for image attribute annotation with gene selection XGBoost algorithm and relative attribute model Hongbin Zhang, D...

3MB Sizes 0 Downloads 73 Views

Accepted Manuscript Novel framework for image attribute annotation with gene selection XGBoost algorithm and relative attribute model Hongbin Zhang, Diedie Qiu, Renzhong Wu, Yixiong Deng, Donghong Ji, Tao Li

PII: DOI: Reference:

S1568-4946(19)30135-8 https://doi.org/10.1016/j.asoc.2019.03.017 ASOC 5387

To appear in:

Applied Soft Computing Journal

Received date : 13 August 2018 Revised date : 5 March 2019 Accepted date : 6 March 2019 Please cite this article as: H. Zhang, D. Qiu, R. Wu et al., Novel framework for image attribute annotation with gene selection XGBoost algorithm and relative attribute model, Applied Soft Computing Journal (2019), https://doi.org/10.1016/j.asoc.2019.03.017 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights 1. A novel feature mid-fusion algorithm called GS-XGBoost based on eXtreme gradient boosting algorithm and a modified effective range based gene selection algorithm is proposed. 2. The work is the first to conduct relative attribute annotation on the deep-level semantics of various material attributes. 3. A new hierarchical material attribute representation mechanism is constructed on the basis of the correctly classified material attributes and their deep-level semantics.

Graphical abstract (for review)

Graphical Abstract Our image attribute annotation framework is composed of several key components: ”image feature learning,” “feature mid-fusion,” and “attribute annotation”. Figure 1 illustrates the working flow of the annotation framework in detail (we use the “MattrSet” and “Mattr_RA” datasets to describe the proposed framework).

Figure 1 Working flow of the proposed image attribute annotation framework.

Material attribute recognition from the visual appearance of attributes is an important problem in computer vision field. However, few works model the hierarchical relationship between material attributes and their deep-level semantics that occurs in the same image. Meanwhile, single image feature is insufficient to achieve high-quality material attribute classification. In this paper, we present methods for generating a new hierarchical material attribute representation mechanism

using

a

new-designed

feature

mid-fusion

algorithm

and

the

state-of-the-art relative attribute (RA) model. The novel feature mid-fusion algorithm can improve the performance of material attribute classification. The deep-level

semantics of material attributes are mined by the state-of-the-art RA model. They provide considerable useful and detailed knowledge on material attributes. We call the novel feature mid-fusion algorithm gene selection eXtreme gradient boosting (GS-XGBoost). This algorithm considers the state-of-the-art boosting idea (eXtreme gradient boosting) and the popular multi-feature fusion idea (effective range-based gene selection). To comprehensively describe material attributes, we also measure the relative degree of their deep-level semantics. A new hierarchical material attribute representation mechanism is constructed on the basis of the correctly classified material attributes and their deep-level semantics. The mechanism has two forms. One is binary attribute representation mechanism, and the other is relative attribute representation mechanism. We demonstrate the effectiveness of the proposed GS-XGBoost algorithm on two different datasets. The proposed GS-XGBoost algorithm is not an end-to-end framework but is efficient and practical for fine- and coarse-grained material attribute classification problems that can be applied in different scenarios in large-scale product image retrieval, robotics, and industrial inspection. The novel hierarchical material attribute representation mechanism will help humans or robotics accurately recognize diverse materials and their deep-level semantics. Our research contributes to not only computer science but also material science and engineering.

*Manuscript Click here to view linked References

Novel Framework for Image Attribute Annotation with Gene Selection XGBoost Algorithm and Relative Attribute Model Hongbin Zhang1,*, Diedie Qiu1, Renzhong Wu1, Yixiong Deng2, Donghong Ji3, Tao Li4 1

East China Jiaotong University, School of Software, Nanchang, China

2

East China Jiaotong University, School of Materials Science and Engineering, Nanchang, China

3

Wuhan University, School of Cyber Science and Engineering, Wuhan, China

4

School of Computing and Information Science, Florida International University, Miami, USA

* Correspondence: [email protected]; Tel.: +86-13767005548 Abstract:Material attribute recognition from the visual appearance of attributes is an important problem in computer vision field. However, few works model the hierarchical relationship between material attributes and their deep-level semantics that occurs in the same image. Meanwhile, single image feature is insufficient to achieve high-quality material attribute classification. In this paper, we present methods for generating a new hierarchical material attribute representation mechanism using a new-designed feature mid-fusion algorithm and the state-of-the-art relative attribute (RA) model. The novel feature mid-fusion algorithm can improve the performance of material attribute classification. The deep-level semantics of material attributes are mined by the state-of-the-art RA model. They provide considerable useful and detailed knowledge on material attributes. We call the novel feature mid-fusion algorithm gene selection eXtreme gradient boosting (GS-XGBoost). This algorithm considers the state-of-the-art boosting idea (eXtreme gradient boosting) and the popular multi-feature fusion idea (effective range-based gene selection). To comprehensively describe material attributes, we also measure the relative degree of their deep-level semantics. A new hierarchical material attribute representation mechanism is constructed on the basis of the correctly classified material attributes and their deep-level semantics. The mechanism has two forms. One is binary attribute representation mechanism, and the other is relative attribute representation mechanism. We demonstrate the effectiveness of the proposed GS-XGBoost algorithm on two different datasets. The proposed GS-XGBoost algorithm is not an end-to-end framework but is efficient and practical for fine- and coarse-grained material attribute classification problems that can be applied in different scenarios in large-scale product image retrieval, robotics, and industrial inspection. The novel hierarchical material attribute representation mechanism will help humans or robotics accurately recognize diverse materials and their deep-level semantics. Our research contributes to not only computer science but also material science and engineering.

1

Key words: image attribute annotation; material attribute classification; effective range-based gene selection; eXtreme gradient boosting; relative attribute

1、 Introduction Isolated tags or textual annotations are used in image annotation, which is the characterization of objects or scenes in images. The traditional image annotation method helps humans gain comprehensive understanding of the implicit semantic information of images [1-2]. Given a group of accurate tags or textual information, a robust and discriminative multi-media information retrieval (MIR) system (or vertical search engine) can be constructed easily. The MIR system can provide humans better retrieval interactive experience than before. The nouns, namely, “names,” “places,” and “bands,” are traditionally utilized to depict objects or scenes in images. However, a large number of ambiguous meanings or noise appear in the traditional image annotation method [1-2]. To overcome the problem to some extent, several researchers strive to annotate sentences (or N-gram phrases) on images at present. The new research field is called image captioning [3-6]. Annotation models can obtain a large amount of combined semantic information (very long phrases or complete sentences) but are increasingly becoming complex. A complicated natural language generation (NLG) model using the state-of-the-art encoder–decoder framework [5-6] is required, and the NLG model should be tuned carefully. Noisy textural or syntactic information in sentences will also affect the final performance and reduce the practical value of the annotation model. Unlike the above-mentioned image annotation methods [1-6], a new annotation method that recognizes the mid-layer semantic information of images is effective. This new method is called image attribute annotation, which has attracted considerable attention in visual recognition research. Attributes provide useful and detailed knowledge on objects or scenes in images and serve as a bridge between low-level features and high-level categories. Various applications [7-10], such as multimedia retrieval, face recognition, cloth detection, and multimedia content analysis, can benefit from attributes. Binary attributes (BAs) (i.e., Boolean values describing the presence of an attribute) and relative attributes (RAs) (i.e., real values describing the strength of an attribute) are predicted and used for different applications. Material is also a key image attribute. Material attribute recognition (annotation) from the visual appearance of attributes is an important problem in computer vision having numerous applications spanning from scene analysis to robotics and industrial inspection [11]. Knowledge of material attributes of an object or a surface can provide considerable valuable information on its properties. Some recent studies focus on material attribute recognition and obtain improved results [12-18]. Material attribute annotation can also be regarded as a texture classification problem [12-14]. However, material attribute prediction is still a challenging task because the appearance of a surface depends on various factors, such as shape, reflectance properties, illumination, viewing direction, and micro-geometry. However, humans do not only rely on vision to recognize materials but also 2

use a combustion method to distinguish different materials. For example, Polyester material produces black smoke after combustion, Nylon material produces white smoke after combustion, Poly Urethane (Pu) material produces black smoke and no evident smell after combustion, and Canvas material produces white smoke after combustion. These simple physical changes are insufficient to accurately distinguish different material attributes. Many unexpected confusing results may be acquired after combustion. The recognition problem will become difficult if the materials are coated in advance. Apart from combustion methods, the sense of touch can also be useful to discriminate different materials with subtle difference [11]. For example, when humans want to determine the material of a garment, they rub it to assess its roughness. In addition to the aforementioned methods, computer models can be used to complete material attribute recognition. Surprisingly, distinguishing different materials is difficult even if a popular classification model combined with the state-of-the-art image feature is utilized. The t-SNE [19] algorithm for dimensionality reduction enables the embedding of high-dimensionality data into a 2D space. This algorithm can provide intuitive cognition for challenging material attribute classification. The high-dimensional data for visualization are various image features including hand-crafted low-level features (scale-invariant feature transform (SIFT) [20], Gist [21], and local binary pattern (LBP) [22]) and convolutional neural network (CNN) feature (i.e., the last fully-connected layer of the state-of-the-art VGG16 [23] model). We extract the above-mentioned features of the images from our dataset “MattrSet” (detailed information on the dataset is presented in Section 4.1) and then use the t-SNE algorithm to illustrate the image features, as shown in Figures 1(a), (b), (c), and (d). The Pu material category, shown in red, presents a distinct localization in the t-SNE space. However, the Polyester and Nylon materials have confusing features, as indicated in the perceived overlap and accounts for the lower sensitivity and specificity results than in the Pu material category. Therefore, except the Pu material, other materials including Polyester, Nylon, and Canvas are difficult to be identified. The results indicate that material attribute prediction is a challenging computer vision task. The confusing results also imply that single image feature is insufficient to achieve high-quality material attribute classification. Material is difficult to be recognized but is an interesting attribute and can be interpreted by other attributes. Each material attribute contains a group of deep-level semantics (or attributes). For example, the Pu material has the deep-level semantic of “Waterproofness,” the Canvas material has the deep-level semantic of “Permeability,” the Polyester material has the deep-level semantic of “Softness,” and the Nylon material has the deep-level semantic of “Washability.” The deep-level semantics of material attributes are close to humans’ empirical cognition. They can provide humans or robotics with considerable valuable semantic information. On the basis of the correctly classified materials and their deep-level semantics, new hierarchical material representation mechanisms (cognition systems) similar to WordNet [24] and HowNet [25], can be built easily to help humans or robotics accurately recognize different materials from diverse perspectives. However, to the best of 3

our knowledge, very few works model the hierarchical relationship between material attributes and their deep-level semantics that occurs in the same image. On the basis of the aforementioned analysis, we strive to create a novel hierarchical material representation mechanism of two key modules. One achieves material attribute classification by designing a novel feature mid-fusion algorithm, and the other mines the deep-level semantics of material attributes by using the state-of-the-art RA model [26].

(b)

(a)

(c)

(d)

Figure 1. Sample visualization results based on t-SNE. (a) SIFT; (b) Gist; (c) LBP; (d) VGG16.

We propose a novel hierarchical material representation mechanism that combines a newly designed feature mid-fusion algorithm and the state-of-the-art RA model [26]. We demonstrate the effectiveness of the proposed feature mid-fusion algorithm on two different datasets. One is a fine-grained material dataset called “Fabric” [11], and the other is a coarse-grained material dataset called “MattrSet.” We measure the relative degree of the deep-level semantics of material attributes 4

to describe these attributes clearly. Our research contributes to not only computer science but also material science and engineering. The main contributions of this study are summarized as follows: 1) We propose a new material attribute dataset called “MattrSet.” The dataset contains heterogeneous products. It can help humans gain comprehensive understanding of various popular materials, such as “Pu,” “Canvas,” “Nylon,” and “Polyester.” With the help of material attributes, we further mine a group of valuable deep-level attributes (e.g., the “Pu” material has the deep-level semantic of “Waterproofness,” whereas the “Polyester” material has the deep-level semantic of “Softness”). The deep-level semantics provide considerable valuable and detailed knowledge on various materials. 2) We design a novel feature mid-fusion algorithm called gene selection eXtreme gradient boosting

(GS-XGBoost)

based

on

the

state-of-the-art

eXtreme

gradient

boosting

(XGBoost) [27] algorithm and a modified feature fusion algorithm called effective range-based gene selection (ERGS) [28]. The proposed GS-XGBoost algorithm dynamically assigns the corresponding ERGS weight to the estimated probability of each image feature. Experiments on two different datasets demonstrate the effectiveness of the proposed GS-XGBoost algorithm. 3) We annotate the deep-level semantics of images by the state-of-the-art RA [26] model. To the best of our knowledge, this work is the first to conduct RA annotation on the deep-level semantics (“Waterproofness,” “Breathability,” “Softness,” “Washability,” and “Wearability”) of various popular material attributes. The discovered deep-level semantics based on relative measurement contribute to building a new human–machine interface, which can provide humans with better retrieval interactive experience than before. 4) We propose a new material attribute representation mechanism. A new hierarchical material attribute representation mechanism is constructed on the basis of the correctly classified material attributes and their deep-level semantics. The novel representation mechanism will help humans or robotics accurately recognize diverse materials and their deep-level semantics. 5) We extend several evaluation metrics based on Average Precision (AP) and Mean Average Precision (MAP). APfeature, APmaterial, APattribute, MAPfeature, and MAPmaterial are designed to evaluate the presented annotation model from different perspectives. They are useful supplements to the evaluation of material attribute classification. The rest of the paper is organized as follows. In Section 2, the related works are discussed from two aspects and our motivations are presented. The proposed method is described in Section 3. Experiments on two different datasets and results are demonstrated in Section 4. Useful conclusions and future works are elaborated in Section 5.

2、 Related Work 5

2.1 Attribute Learning Using attributes to describe objects or scenes in images is a novel technology that has attracted considerable attention in computer vision research. The reason is that attributes can provide considerable valuable and detailed knowledge on objects (or scenes). Attribute is a visual property that appears or disappears in an image. If this property can be expressed in human language, then we call it a semantic property [29]. Different properties may describe different image features, such as colors, patterns, and shapes [30]. Several recent studies concentrate on attribute learning, which serves as a bridge between low-level features and high-level categories. Farhadi [30] trains a group of binary SVM models to accomplish image attribute annotation. Kumar [31-32] also uses several binary classifiers including nose, eye, and mouse classifiers to achieve face recognition and retrieval. Similar to Kumar, Liu [33] learns binary face attributes by using a localization component and an identity classifier followed by linear SVMs for BA prediction. Recently, distributed deep CNN-based models are proposed to address attribute predictions, such as person re-identification [34], fashion detection [35], cross-modal facial attribute recognition [36], image segmentation [37], image captioning [38], fashion search [39], and clothes recognition [40]. Despite the success of end-to-end deep learning, many researchers use neural networks for feature extraction and utilize these features in traditional machine learning frameworks for attribute prediction. Liang [41] learns a feature space by using additional information from object categories, and Gan [42] creates category-invariant features that are helpful for attribute predictions. Unseen objects or scenes can also be recognized by textual descriptions [43-44]. Berg [45] mines a group of product attributes from noisy online texts. Kovashka [46] learns object models expediently by providing information on multiple object classes with each attribute label. To find considerable valuable semantic knowledge, state-of-the-art technologies, such as dictionary learning, attention mechanism, and multi-task learning, are applied in attribute predictions. Liu [47] proposes a weakly-supervised dictionary learning method to exploit attribute correlations, and this method is beneficial for improving image classification performance. Yuan [48] presents a part-based middle-level representation based on the response maps of local part filters for improved attribute learning. Tang [49] also focuses on extracting a middle-level representation based on the pattern that is extremely shared among different classes for scene attribute classification. Recently, Liu [50] proposes an attribute-guided attention localization scheme in which the local region localizers are learned under the guidance of part attribute descriptions for fine-grained attribute recognition. An attention-guided transfer architecture that learns weighing of the available source attribute classifiers is proposed for non-semantic transfer from attributes that may be in different domains [51-52]. However, multiple attributes can be recognized in an entire framework. Shao [53] uses a multi-task learning framework to learn attributes for crowd scene understanding. Fouhey [54] recognizes 3D shape attributes by using a multi-label and an embedding loss. Then, binary semantic attributes are learned through a multi-task CNN model. 6

Abdulnabi [29] proposes a multi-task learning framework that allows CNN models to simultaneously share visual knowledge among different attribute categories. Considering that face attribute prediction has important applications in video surveillance, face retrieval, and social media, many multi-task learning-based models are used in face attribute prediction, such as joint prediction of heterogeneous face attributes [55], facial attribute classification [56], and face identification [57][58]. The aforementioned studies [11-18] [29-58] use BA to describe whether an image holds an attribute (true) or not (false). In most cases, the BA definition is useful for attribute learning (attribute label prediction). However, in some cases, we aim to measure the strength of the attribute presence. Thus, RA is learned through ranking method (RA has a real value to describe the strength of an attribute). Grauman [26] first proposes to model RA by learning a ranking function per attribute. A novel interactive search method [59-60] (WhittleSearch system) with RA feedback is proposed, and this method is computationally feasible for the machine and the user by requiring less user interaction. Yu [61] attempts to predict noticeable differences in visual RA by a Bayesian local learning strategy. Qiao [62] acquires enhanced image classification performance by using RA measurement of a shared feature space. Law [63] proposes a general Mahalanobis-based distance metric learning framework that exploits distance constraints over up to four different images for RA recognition. Cheng [64] introduces a novel zero-shot image classifier called random forest (RF) based on RA. Ergul [65] puts forward an RA-based incremental learning for image recognition. Recently, the state-of-the-art multi-task learning framework is used for RA prediction. Multi-task RA predictions are completed by incorporating local context and global style information features [66]. Singh [67] proposes an end-to-end multi-task deep convolutional network to simultaneously localize and rank relative visual attributes with only weakly-supervised pairwise image comparisons. Apart from the above-mentioned works [26][59-67], Dubey [68] also develops a novel algorithm to model image virality on online media by using a pairwise neural network. The network generates category supervision and provides positive signal to RA prediction. Souri [69] introduces deep neural network architecture for RA prediction. Yu [70] trains attribute ranking models based on generative adversarial networks to predict the relative strength of an attribute via synthetically generated images. In summary, attributes provide considerable valuable and detailed knowledge on objects or scenes in images and serve as a bridge between low-level features and high-level categories. Thus, BA and RA have played important roles in computer vision. We also strive to conduct attribute predictions on material datasets and create a new hierarchical material representation mechanism for humans or robotics. 2.2 Feature Fusion Four types of feature fusion methods are usually used in machine learning research. 1) Boosting methods: In these methods, a group of weak classifiers are trained and integrated into a strong 7

classifier. As expected, final classification performance is enhanced to some extent by the strong classifier. Several state-of-the-art boosting-based methods, such as gradient boosting decision tree (GBDT) [71], RankBoost [72], Adaboost [73], and XGBoost [27], are proposed to solve various classification problems (i.e., sentiment analysis problem in natural language processing or image classification problem in computer vision). 2) Multi-feature fusion methods: In these methods, several individual features, such as shape features (i.e., SIFT [20] and LBP [22]), texture features (i.e., Textons [12] and Gist [21]), and color features (i.e., RGB and HSV), are linearly weighted to create a “new feature” with more discriminant ability than each individual feature. Several state-of-the-art multi-feature fusion methods, such as multiple-kernel learning (MKL) [74], multiple-kernel boost (MKBoost) [75], and ERGS [28], are proposed to solve various classification problems. 3) Graph methods: In these methods, numerous sub-graphs (or heterogeneous graphs) corresponding to different image features are constructed first to evaluate the visual similarity or context-sensitive similarity between images. Then, the sub-graphs (or heterogeneous graphs) are fused (or integrated) into a complete large graph. As expected, image retrieval performance is improved to some extent by the complete graph. Several state-of-the-art graph methods, such as multiple affinity graphs [76], heterogeneous graph propagation [77], rank-aware graph fusion [78], and ImageGraph [79], are proposed to solve large-scale image retrieval or search problems. 4) Search methods in wrapper techniques: Finding a minimal feature set has been described as an NP-hard problem [80]. Searching the best combination of features is a challenging problem, especially in the wrapper-based methods. Thus, an intelligent optimization method is required to reduce the number of evaluations [81]. In these methods, modern search strategies or intelligent optimization models are designed to efficiently complete feature selection. Several state-of-the-art search methods, such as binary grasshopper optimization algorithm with evolutionary population dynamic [81], binary salp swarm algorithm with crossover [82], binary dragonfly optimization with time-varying transfer function [83], asynchronous accelerating multi-leader salp chain algorithm [84], gray wolf optimizer [85], whale optimization approach [86], and chaotic water cycle algorithm [87], are proposed to solve various classification problems. Boosting methods [27][71-73] improve classification performance by using a group of weak classifiers to create a strong classifier. However, they cannot fully utilize the mutual complementarity among various image features. For example, shape features (i.e., SIFT [20]) and texture features (i.e., Gist [21]) are usually used together to depict the key visual characteristics of images. The implicit mutual complementarity between them is beneficial for improving final classification performance. However, boosting methods cannot fully utilize the advantage of implicit complementarity. The multi-feature fusion methods [28][74-75] improve classification performance by assigning weight to each feature and creating a “new feature.” The “new feature” represents the implicit mutual complementarity between features and thus can effectively describe the key visual content of images. However, the state-of-the-art multi-feature fusion methods, such 8

as MKL and MKBoost, must calculate a complete kernel matrix. As the number of samples increases, the kernel matrix gradually enlarges, which will affect the real-time efficiency. Although several kernel tricks are applied in the multi-feature fusion methods, single classifier rather than a group of weak classifiers is used to complete classification. Much room for improvement is available for classification performance. Graph methods [76-79] are often used for large-scale ranking-based image retrieval task. Image retrieval performance is improved by using a complete large graph. However, as the number of samples increases, the graph methods need a large amount of storage space and computing resources to create a large graph, which will also affect the real-time efficiency. Search methods [81-87] usually focus on feature selection rather than feature fusion. They expect to use few features, improved fitness measures, short running time, and small running space to describe samples in datasets. They mainly deal with the benchmark datasets in the feature selection field rather than large-scale unstructured data obtained from the real application field (i.e., images from Flickr.com or real shops). 2.3 Our Motivations The research motivations can be divided into two. One is derived from the real application perspective. We aim to build a new material attribute representation mechanism for humans or robotics. The representation mechanism will be constructed hierarchically by different material attributes and the corresponding deep-level semantics (Table 1). The new representation mechanism has two forms. One is BA representation mechanism (BARM), and the other is RA representation mechanism (RARM). The material attribute representation mechanism will help humans or robotics accurately recognize diverse materials. With the help of the material attribute representation mechanism, material attribute classification can be implemented automatically and efficiently, which is beneficial for many real scenarios in robotics, industrial inspection, and large-scale product image retrieval (for online e-commerce website). The other motivation is derived from the machine learning perspective. We aim to create a novel feature mid-fusion algorithm to solve various classification problems. The novel feature mid-fusion algorithm will consider the state-of-the-art boosting idea (the boosting idea is based on a group of weak classifiers) and the popular multi-feature fusion idea (the multi-feature fusion idea is based on assigning weight to each image feature) to build a robust and discriminant classifier. With the help of the classifier, the material attribute classification accuracy will be greatly improved.

3、 GS-XGBoost Algorithm and RA Model 3.1 Annotation Framework Our image attribute annotation framework is composed of several key components: ”image feature learning,” “feature mid-fusion,” and “attribute annotation”. Figure 2 illustrates the working

9

flow of the annotation framework in detail (we use the “MattrSet” and “Mattr_RA” datasets to describe the proposed framework).

Figure 2. Working flow of the proposed image attribute annotation framework.

First, numerous images of various material attributes are downloaded from a famous b2b e-commerce website in China using a spider program. With the instruction of a material expert, a new “wild” dataset called “MattrSet” is created, which comprises four popular material attributes: Pu, Canvas, Polyester, and Nylon. The dataset contains heterogeneous products. Some useful data cleaning strategies are needed to clean the “MattrSet” dataset: 1) Images of too small sizes (<100×100) should be removed because they contain little useful semantic information. 2) Duplicated images (very low sum of squared differences) should be removed because they contain little useful semantic information. 3) Noisy or irrelevant images should be removed because they will heavily affect material classification performance. 4) Materials with too little images (<1000) should be discarded because a robust classification model should be trained. These data cleaning strategies are all achieved by our programs automatically. With the help of the strategies, we obtain a pure “MattrSet” dataset (approximately 3.2 GB). Section 4.1 describes the “MattrSet” dataset in detail.

10

Second, four image features (SIFT [20], Gist [21], LBP [22], and VGG16 [23]) are extracted to characterize images from diverse visual perspectives. ”Image feature learning” is a fundamental task in computer vision research. Various image features, including SIFT [20], Gist [21], LBP [22], RGB, and HSV, are usually utilized to characterize the visual content of images. These traditional image features have played important roles in most computer vision systems for nearly two decades. However, with the rapid development of feature learning engineering, several state-of-the-art feature learning models, such as efficient match kernels (EMK) [88], kernel descriptors (KDES) [89], sparse coding (SC) [90], restricted Boltzmann machines (RBM) [91], deep belief nets (DBN) [92], and CNN [93], have been utilized to effectively process images. As illustrated in Figure 2, “image feature learning” is an important premise of the proposed annotation framework. A robust “image feature learning” method is needed to effectively extract image features for material attribute annotation. To achieve the goal, we should consider the discriminant ability of each image feature and the complementarity among different image features. Thus, SIFT [20] (a traditional shape feature of several invariant characteristics), Gist [21] (a traditional texture feature for effectively describing global textural characteristics), LBP [22] (a traditional shape feature for effectively describing local image patches), and VGG16 [23] (a state-of-the-art deep learning feature that is an excellent complementarity to the above-mentioned traditional features) are chosen to complete our “image feature learning” procedure. These features characterize images from different visual aspects. They establish an important premise for constructing our feature mid-fusion algorithm. Third, a novel feature mid-fusion algorithm called GS-XGBoost is proposed to deal with the material attribute classification problem. The hybrid algorithm is a fundamental component of the image attribute annotation framework. Details on the GS-XGBoost algorithm are shown in Figure 3. Feature mid-fusion is different from feature early fusion that obtains very high-dimensional features. High-dimensional features indicate “curse of dimensionality,” which is not good for material attribute classification. Feature mid-fusion is also different from feature late fusion that needs final bagging or boosting strategies to complete feature fusion. Thus, we strive to create a novel hybrid feature mid-fusion algorithm. The novel feature mid-fusion algorithm will consider the state-of-the-art boosting idea (assembling a group of weak classifiers) and the popular multi-feature fusion idea (assigning weight to each image feature) to create a robust and discriminant classifier. Accordingly, the classifier can fully utilize the advantages of the popular boosting and multi-feature fusion algorithms. With the help of the classifier, material attribute classification performance will be greatly improved. Finally, four popular material attributes are annotated by the proposed GS-XGBoost algorithm. On the basis of material attributes, we can further acquire a group of valuable deep-level semantics of the corresponding material attributes. These deep-level semantics are very close to humans’ empirical cognition, thereby allowing us or robotics to comprehensively understand the popular 11

material attributes from diverse perspectives. We use the state-of-the-art RA model [26] to measure the strength of each deep-level semantic for obtaining a large amount of useful semantic information on material attributes. We obtain a new material attribute representation mechanism. The innovative material attribute representation mechanism is hierarchically constructed by material attributes and their deep-level semantics. This mechanism has two forms. One is BARM, which is designed for binary material attribute annotation. The other is RARM, which is designed for relative material attribute annotation. The material attribute representation mechanism will help humans or robotics accurately recognize diverse materials. With the help of the material attribute representation mechanism, material attribute classification can be implemented automatically and efficiently, which is beneficial for many real scenarios in robotics, industrial inspection, and large-scale product image retrieval (for online e-commerce website). The proposed GS-XGBoost algorithm is a fundamental component of the image attribute annotation framework. We use Figure 3 to describe the innovative feature mid-fusion algorithm in detail.

Figure 3. Proposed GS-XGBoost feature mid-fusion procedure.

As illustrated in Figure 3, four image features (SIFT [20], Gist [21], LBP [22], and VGG16 [23]) are extracted to characterize images from diverse visual perspectives. Then, the state-of-the-art XGBoost algorithm [27] is used to calculate the estimated probability of each image feature. The XGBoost algorithm builds a strong classifier by assembling a group of weak classifiers. Thereafter, the traditional ERGS algorithm [28] is modified to dynamically compute the corresponding ERGS weight of each image feature. The traditional ERGS algorithm [28] is designed only for feature selection rather than feature fusion. However, feature fusion procedure must be implemented to depict the visual content of images comprehensively and objectively. Accordingly, the estimated probability of each image feature is weighted by its ERGS weight. The novel feature mid-fusion algorithm considers the state-of-the-art boosting idea (XGBoost) and the popular multi-feature fusion idea (ERGS) to create a robust and 12

discriminant classifier. On the one hand, this algorithm fully utilizes the mutual complementarity among different image features. On the other hand, this algorithm fully utilizes all weak classifiers and creates a strong classifier. 3.2 XGBoost Algorithm XGBoost algorithm does not apply stochastic gradient descent method to complement the corresponding optimization procedure. Instead, this algorithm uses additive learning strategy, which adds the best tree model into the current classification model in the m-th prediction. Therefore, the m-th prediction results are shown as .

(1)

is the best tree model in the m-th prediction. and

is the current classification model,

is a new classification model of the next prediction. Therefore, the corresponding objective

function of XGBoost algorithm is shown as . We use

(2)

to compute the loss value of XGBoost algorithm.

is a

regularization term that helps prevent overfitting. On the basis of Equations (1) and (2), we obtain a new form of the objective function of XGBoost algorithm as . In Equation (3),

is a regularization term that is the complexity of the tree

(3)

, and

is a constant term. Taylor expansion is shown as .

(4)

Accordingly, we take the Taylor expansion of the objective as (5)

XGBoost algorithm reserves the second derivative differentiable. XGBoost algorithm uses two functions we remove the constant term

to define any loss function of twice and

to complete the scoring process. If

, then we obtain a new form of the objective function of

XGBoost algorithm as (6)

13

Each tree is redefined as sample

is on a leaf node.

current model.

(

).

indicates that a

is the score of the leaf node and is also the prediction value by the

is redefined as .

(7)

The first term of Equation (7) is the number of leaf nodes. The second term of Equation (7) is the score L2 normalization of each leaf node, which contributes to preventing overfitting. The instance set of leaf node is defined as

. Thus, we obtain a new form of the objective of

XGBoost algorithm as (8)

XGBoost algorithm computes the optimal

as .

(9)

On the basis of Equation (9), we further obtain a new form of the objective of XGBoost as . We set

and

(10)

, and the objective is rewritten as .

We set

(11)

, and the objective is rewritten as .

(12)

An exact greedy algorithm for the split finding of XGBoost algorithm is designed to find the best split point [20]. To prevent overfitting, a shrinkage parameter

should be tuned carefully.

This parameter indicates the degradation of the prediction value, which is used to leave large “space” for the following trees. Therefore, it helps prevent overfitting. The final prediction model is shown as .

(13)

The state-of-the-art XGBoost algorithm is shown as follows: Algorithm 1:XGBoost Algorithm Input: the image feature: function:

,

},

, ,

, the total number of sub-tree: M;

Output: the estimated probability of image feature (1) Repeat 14



, the loss

(2)

Initialize the m-th tree

(3)

Compute

(4)

Compute

(5)

Use the statistics to greedily grow a new tree

(6)

:

As shown in Equation (13), add the best tree

into the current model

(7)

Until all M sub-trees are processed

(8)

Obtain a strong regression tree based on all weak regression sub-trees

(9)

Output the estimated probability based on the strong regression tree

3.3 A Modified ERGS Algorithm Four image features (LBP, SIFT, Gist, and VGG16) are extracted to characterize images from different visual perspectives. However, a novel feature mid-fusion model with discriminative ability is required to effectively classify different material attributes. Thus, ERGS algorithm [28] is used to evaluate the extracted image features and rank the features on the basis of different ERGS weights. Compared with the MKL algorithm [74] or the MKBoost algorithm [75], ERGS algorithm does not compute a complete kernel matrix. Compared with the state-of-the-art feature selection algorithm [80-87], ERGS algorithm can resolve large-scale real data. High ERGS weight indicates high importance of the corresponding image feature. However, the traditional ERGS algorithm [28] only evaluates the effective range of each image feature on each type of samples of interval estimation. The ERGS weight of each image feature is computed depending on the overlap zones among heterogeneous samples. Thus, the traditional ERGS algorithm [28] is developed for feature selection. Only one feature of the most powerful discriminative ability is kept to complete image classification. However, we modify the traditional ERGS algorithm [28] to create a new feature fusion model rather than feature selection model. We use all the proposed image features, namely, SIFT, LBP, Gist, and VGG16, to complete the multi-feature fusion procedure (Algorithm 3). The modified ERGS algorithm can fully utilize the mutual complementarity among different image features. We first provide some definitions to describe the modified ERGS algorithm in detail: ,







, ,



is defined as the datasets used in this study.

Section 4.1 provides detailed information on the datasets. the material attribute label of the sample

, ,

.

is the number of all data samples. ,

is

is the set of all material attribute

labels, and l is the number of attribute categories, it is equal to 4 (or 9) in the study. F







is the set of all proposed image features.

is the number of all

proposed image features. In particular, all image features, namely, SIFT, LBP, Gist, and VGG16, are used to complete mid-feature fusion. The effective range of the image feature samples that are labeled

on the

is defined as (14) 15

represents the upper boundary of the effective range of the image feature samples

. On the contrary,

feature samples

on samples , whereas

on

represents the lower boundary of the effective range of the image .

represents the mean value of the image feature

represents the standard deviation of the feature

is the priory probability of the samples influence of the standard deviation

, whereas

on

on samples

.

can reduce the

on the upper and lower boundaries of the effective range.

is computed by the following Chebyshev inequality: (15)

is a derived constant that is equal to 1.732. Overall, the modified ERGS algorithm is shown as the follows: Algorithm 2 Modified Effective Range-based Gene Selection Algorithm Input: the image feature:



Output: the ERGS weight

of each image feature

(1)

}

As shown in Equation (14), compute the effective range

of the feature

on the samples labeled (2)

Compute the overlap area

of the feature

: ,







(3)

Compute the overlap area coefficient

(4)

Compute the ERGS weight

of the feature

of the feature

:

:

: (5)

Output the ERGS weight

of each image feature

3.4 GS-XGBoost Algorithm As described in Section 3.1, the proposed GS-XGBoost algorithm is a fundamental component of the attribute annotation framework. Four image features, namely, LBP, SIFT, Gist, and VGG16, are extracted to characterize images from different visual perspectives. First, the state-of-the-art XGBoost algorithm [27] is utilized to obtain the estimated probability of each image feature (Algorithm 1). Second, we modify the traditional ERGS algorithm [28] to dynamically calculate the ERGS weight of each image feature (Algorithm 2). Therefore, all image features are considered in the material attribute classification procedure according to their ERGS weights. High ERGS weights indicate high importance of the corresponding image features (Table 9). They will play more important roles in material attribute annotation. Finally, we present the GS-XGBoost algorithm on 16

basis of Algorithm 1 and 2. The estimated probabilities are weighted by the corresponding ERGS weights and summed to complete the final feature mid-fusion. If a material attribute has a maximum sum value, then the material attribute should be annotated on the images. The proposed GS-XGBoost algorithm is shown as follows: Algorithm 3: GS-XGBoost Algorithm Input: , ,

, which is an image dataset, ,

, l is the number of attribute categories. In this study, we set l=4 or l=9.

Output: Image material attributes (1)

Extract four image features, namely, SIFT, LBP, Gist, and VGG16

(2)

Obtain an image feature dataset F

(3)

Create feature combination D=

, ,

, ,



, , d ≤ n on the basis of

F (4)

Repeat

(5)

Compute the estimated probability

of each image feature in D by the

XGBoost algorithm introduced in Section 3.2 (6)

Until all features in D are processed

(7)

Repeat

(8)

Compute the ERGS weight

of each image feature in D by the

modified ERGS algorithm introduced in Section 3.3 (9)

Until all features in D are processed

(10) Each estimated probability

is weighted by the corresponding ERGS weight

and the maximum weighted value is computed: : (11) Output image material attribute-based

3.5 RA Model As described above, each material attribute contains various deep-level semantics, such as ”Waterproofness,” “Breathability,” and “Softness.” These deep-level semantics are very close to humans’ objective cognition and may therefore help humans or robotics comprehensively recognize material attributes. We use the state-of-the-art RA model [26] to measure the strength of each deep-level semantic for obtaining considerable valuable semantic information on materials. rank models, namely, RankSVM [94] and RankNet [95], are usually used to design an RA model. We choose RankSVM rather than RankNet to complete RA annotation. The RankSVM model, which is widely used in the industry and has been proven to be effective, transforms a rank problem into a pairwise classification problem. An RA set first for each deep-level semantic (attribute)

and a similar attributes set ,

should be constructed

. We have two samples

and

and

compare them depending on the strength of each deep-level semantic. If

, then the

deep-level semantic

of the sample

is higher than that of the sample . If

, then the

deep-level semantic

of the sample

is similar to that of the sample . Thus, a rank function is

needed for each deep-level semantic and is shown as 17

(16) (17) (18)

In Equation (16),

is the weight that will be trained in our RA model.

is an image

feature, such as LBP, Gist, and VGG16. On the basis of

, we can calculate the real strength of the

deep-level semantic

weights. The definition of each weight is

of each testing image. We train

. Then, we calculate the real strength of each deep-level semantic of each testing image by using the corresponding weight. The following objective function is designed for training the weight (19) (20)

and

are two slack variables that are used to convert the above-mentioned objective

function into an inequality optimization problem.

is a regularization term that is utilized to

prevent overfitting. The quadratic loss function and the similarity constraint are used together to resolve the aforementioned optimization problem. This optimization problem is approximated by introducing two slack variables. To achieve this goal, we should ensure 1) a significant change in deep-level semantic strength in

and 2) a slight change in deep-level semantic strength in

. The

state-of-the-art RA model strives to learn a novel rank margin rather than the classifier margin of the traditional binary attribute model. Therefore, we can calculate the real strength of each deep-level semantic of each testing image by using the corresponding weight

. We can also rank our images

depending on the corresponding real value. We aim to acquire a group of deep-level semantics of material attributes. Thus, each material attribute can be interpreted by various deep-level semantics from diverse views. The proposed deep-level

semantic

set

is

composed

of

“Waterproofness,”

“Breathability,”

”Softness,”

“Washability,” and “Wearability.” These semantics are very close to humans’ empirical cognition, thereby allowing humans or robotics to comprehensively understand different material attributes from diverse perspectives. These attributes provide useful and detailed knowledge on images and serve as a bridge between low-level features and high-level categories. With the help of a material expert, we classify the material attributes depending on the strength of the corresponding deep-level semantics. Table 1 provides details on the “Mattr_RA” datasets used in our experiment. To complete deep-level semantic annotation, we use 200 images per class to construct a new dataset called Mattr_RA on the basis of the “MattrSet” dataset. The “Mattr_RA” dataset has 800 images, which are utilized to acquire the rank function of our RA model for each deep-level semantic. Table 1 shows the binary relationship and the relative orderings of categories by the deep-level semantics. Most BA models focus entirely on attributes as binary predicates that indicate the presence (or absence) of a certain deep-level semantic in an image. For example, the Pu material has the deep-level semantic of “Waterproofness.” This semantic is presented as “1” in the BA model. 18

However, for various deep-level semantics, this binary setting restriction is unnatural. We see the limitation of BA in distinguishing between some categories, whereas the same set of attributes relatively teases them apart. Thus, we use the state-of-the-art RA model [26][59-60] to predict the strength of a deep-level semantic of material attributes with respect to other images. For example, we use a value from “1” to “4” to represent the strength of a deep-level semantic. Furthermore, “4” is the highest strength value, whereas “1” is the lowest strength value. Another important advantage of the RA model is that zero-shot learning can be made simple. Thus, we need a small number of training samples to build our RA model. We complete zero-shot learning [26] from relative comparisons and obtain considerably better accuracy than the traditional BA model [30-34]. Table 1. Relationship between material attributes and their deep-level semantics. Binary Relationship

Deep-level

Relative Relationship

Pu

Canvas

Polyester

Nylon

Pu

Canvas

Polyester

Nylon

Waterproofness

1

0

1

0

4

1

3

2

Breathability

0

1

0

0

2

4

1

3

Softness

0

0

1

1

1

2

4

3

Washability

0

1

1

1

1

4

2

3

Wearability

0

1

0

0

1

4

3

2

Semantics

4、 Experimental Results and Discussion 4.1 Datasets and Baselines 4.1.1 Dataset The coarse-grained “MattrSet” dataset has approximately 3.2 GB data. Detailed information on the

dataset

can

be

found

in

the

URL

https://drive.google.com/open?id=12xXX_MuwII8hghwXFLtT3sneEzgA4-SN. Most recent works focus on

garment classification and there is no dataset of material attributes across heterogeneous products. We focus our research on material attribute annotation across different products. Many images of various popular materials are downloaded from a website (http://www.made-in-china.com/) by a spider program. Some useful data cleaning strategies are needed to clean the “MattrSet” dataset (please refer to Section 3.1). Finally, the dataset contains four popular material attributes: Pu, Canvas, Nylon, and Polyester. The image resolution is from 123×123 to 4300×2867. The dataset comprises two types of products (i.e., bags and shoes). Two materials, namely, Pu and Canvas, are shared by bags and shoes. The basic number distribution per class of the bag images is as follows: #Pu is 1982, #Canvas is 1948, #Nylon is 1764, and #Polyester is 1715. The basic distribution of the shoe images is as follows: #Pu is 1757 and #Canvas is 1855 (nylon and polyester shoes are rare). We will demonstrate the effectiveness of the proposed GS-XGBoost algorithm on this coarse-grained dataset.

19

The fine-grained “Fabric” dataset [11] has nearly 1.76 GB data. Detailed information on the dataset can be found in the URL https://ibug.doc.ic.ac.uk/resources/fabrics/ link. The users [11] collect samples of the surface of over 2000 garments and fabrics by using photometric stereo sensors. They visit many shops with the sensor (its resolution is 640×480) and a laptop and capture all images. They confirm the composition of garments from the manufacturer label. Finally, the dataset contains nine popular material attributes, namely, Cotton, Terrycloth, Denim, Fleece, Nylon, Polyester, Silk, Viscose, and Wool. The number of samples per class is as follows: #Cotton is 588, #Terrycloth is 30, #Denim is 162, #Fleece is 33, #Nylon is 50, #Polyester is 226, #Silk is 50, #Viscose is 37, and #Wool is 90. Each sample is represented as four patches. The users [11] crop all images to 400×400 pixels to avoid out-of-focus regions at the corners. The “Fabric” dataset reflects the distribution of fabrics in the real world and is not balanced. The majority of garments are made of cotton and polyester, whereas silk and linen are rare. We will demonstrate the effectiveness of the proposed GS-XGBoost algorithm on this fine-grained dataset. 4.1.2 Baselines We focus our first work on material attribute annotation (classification) rather than product classification. Notably, we obtain very worse accuracies when we use the 50% versus 50% data partitioning in deep learning models. Hence, we choose 70% samples randomly for training and the remaining samples for testing when the following deep learning models are used due to the data-driven characteristic. Except for the deep learning models, we choose 50% samples randomly for training, whereas the remaining samples are chosen for testing. For the “MattrSet” dataset, the training and testing sample distribution of the material is as follows: #Pu is 1869, #Canvas is 1901, #Polyester is 857, and #Nylon is 882. The training and testing sample distribution of the objects is as follows: #Shoes is 1810 and #Bags is 3699. We compare the proposed feature mid-fusion algorithm GS-XGBoost with about twenty baselines that are shown as follows: 1)

Traditional classification algorithms

or

models implemented in the

scikit-learn

package [101]: Logistic regression (LR) [96], RF [97], k-nearest neighbor (KNN) [98], decision tree (DT) [99], naïve Bayes (NB) [100], and GBDT [71] (we use 200 weak classifiers in GBDT). 2)

State-of-the-art feature fusion algorithms: XGBoost [27] and Adaboost (we use 500 weak

classifiers in Adaboost) [73]. 3)

Group of novel feature mid-fusion algorithms designed by the presented gene selection

(GS) idea: GS-LR, GS-RF, GS-KNN, GS-DT, GS-NB, GS-Adaboost, and GS-GBDT. 4)

Deep learning models: InceptionResNetV2 [102], Densenet169 [103], and MobileNets [104].

5)

Other models: Farhadi’s SVM models [30] and the proposed method in [11].

Given that the “Fabric” dataset has a small number of samples, it cannot accurately present the relative degree of attributes. Thus, we complete RA annotation on the “MattrSet” dataset. In comparison with the traditional BA model [30-34] in deep-level semantic annotation, we only use 200 images per class in the “MattrSet” dataset to construct a new dataset called “Mattr_RA.” The 20

“Mattr_RA” dataset consists of 800 images, which are used to acquire the rank function of our RA model for each deep-level semantic. Our RA model does not need any training data because it performs well at zero-shot learning. On the contrary, the “Mattr_RA” dataset (800 images) is utilized to test the traditional BA model [30-34], and the remaining data (approximately 10000 images in the “MattrSet” dataset) are adopted to train the traditional BA model. The traditional BA model requires a large number of training samples. 4.2 Evaluation Metrics Accuracy is an important metric for evaluating material attribute annotation performance and is expressed as Equation (21). Notably, the “MattrSet” dataset is chosen to depict the following equations. Similar equations are obtained if the “Fabric” dataset is chosen. (21)

(22)

“ model. “

” is the number of images that classified correctly by the proposed annotation ” is the number of positive samples that are classified correctly. “

negative samples that are classified correctly. “

” is the number of

” is the total number of all

images. In addition to Accuracy, Precision and Recall are used to characterize attribute annotation performance intuitively. The Precision metric is shown as Equation (23), whereas the Recall metric is shown as Equation (24): (23) (24)

∈ {Pu, Canvas, Polyester, Nylon}, “ classified correctly as the label are classified as the label label

.“

” is the number of images that are ” is the sum number of images that

.”

” is the total number of all images of the

in our dataset. On the basis of Precision, two metrics, namely, AP and MAP, are

computed naturally to effectively analyze our model. APfeature, APmaterial, MAPfeature, APattribute, and MAPmaterial are defined as Equations (25), (26), (27), (28), and (29), respectively: (25) (26) (27) (28) (29) 21

is the sum number of all materials. , and information is shown in Table 2). consists of all GS models. “ and all image features. “ features. “

indicates feature combination (detailed

is the total number of all annotation models, which ” is the combination number of all material attributes ” is the combination number of all models and all image

” is the combination number of all material attributes and all models.

As defined above, the The

is the sum number of all image features.

metric evaluates the averaged performance of each image feature.

metric evaluates the averaged performance of each attribute. The

metric

evaluates the mean averaged performance of each image feature, thereby enabling us to appropriately choose the corresponding image feature or feature combination. The metric evaluates the mean averaged performance of each material, thereby allowing us to determine which material is the most difficult to be recognized. Owing to space limitation, we compute the AP and MAP values of the “MattrSet” dataset and draw Precision–Recall (P–R) curves of the “MattrSet” dataset (details are provided in Sections 4.3 and 4.4). To confirm the robustness of the proposed GS-XGBoost algorithm, we compute Accuracy values on two datasets (details are provided in Sections 4.5 and 4.6). Then, the GS-XGBoost algorithm is fairly compared with the state-of-the-art method proposed in [11] (Section 4.7). Finally, we evaluate the RA performance on the “Mattr_RA” dataset (Section 4.8). 4.3 P–R Curves We first use P–R curves to depict annotation performance. In Figure 4, “L”, “G”, “S”, and “V” represent different features, namely, LBP, Gist, SIFT, and VGG16, respectively. “S+G” indicates that SIFT is fused with Gist by the presented GS idea. “S+G+L” indicates that three features, namely, SIFT, Gist, and LBP, are fused together by the presented GS idea. Other feature combinations are similar. To draw P–R curves, we choose the top one feature combination of each kind of combination. We define four types of feature combination: single feature, bi-feature combination, tri-feature combination, and four-feature combination. For example, “S” is the topmost single feature. “S+G” is the topmost bi-feature combination. “S+G+L” is the topmost tri-feature combination. Second, we choose the top two annotation models combined with the selected feature combinations to draw the corresponding P–R curves. We use different colors to represent different combinations. Blue is utilized to draw the P–R curves of the feature “S”. Purple is adopted to draw the P–R curves of the feature combination “S+G.” Red is utilized to draw the P–R curves of the feature combination “S+G+L.” Green is utilized to draw the P–R curves of the feature combination “S+G+L+V.” In addition to diverse colors, we use different characters to represent different annotation models. ‘-.-’ is utilized to represent the P–R curves of GS-GBDT algorithm. “—” is utilized to represent the P–R curves of the presented GS-XGBoost algorithm. From the above-mentioned definitions, we can draw our P–R curves in detail. For example, Purple “-.-” is adopted to represent the P–R curves of the GS-GBDT algorithm combined with the feature combination “S+G.” Red “—” 22

is used to represent the P–R curves of the presented GS-XGBoost algorithm combined with the feature combination “S+G+L.” However, GS-GBDT-S and GS-XGBoost-S are special cases that do not use any GS idea.

Precision-Recall(Canvas) 1

0.9

0.9

0.8

0.8

Precision(%)

Precision(%)

Precision-Recall(Pu) 1

0.7 0.6 0.5

0.6 0.5 0.4

0.4 0.3

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3

1

0.1

0.2

0.3

0.4

(a)

0.9

0.9

0.8

0.8

0.7 0.6 0.5 0.4

0.8

0.9

1

0.4

0.2 0.5

1

0.5

0.2 0.4

0.9

0.6

0.3

0.3

0.8

0.7

0.3

0.2

0.7

Precision-Recall(Nylon) 1

Precision(%)

Precision(%)

Precision-Recall(Polyester)

0.1

0.6

(b)

1

0.1

0.5

Recall

Recall

0.6

0.7

0.8

0.9

0.1

1

Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

(c)

(d)

Figure 4. P–R curves of the top two annotation models. (a) P–R curves of Pu; (b) P–R curves of Canvas; (c) P–R curves of Polyester; (d) P–R curves of Nylon.

As shown in Figures 4(a) and (b), the proposed GS-XGBoost algorithm combined with the feature combination “S+G+L” obtains the best performance in Pu and Canvas material annotation. Compared with single feature, multiple features help improve the annotation performance. For example, when we use the feature “V”, the proposed GS-XGBoost algorithm combined with the feature combination “S+G+L+V” also results in very competitive performance compared with other algorithms. This finding demonstrates that the feature “V” plays an important role in material attribute annotation. As shown in Figures 4(c) and (d), the novel GS-GBDT algorithm combined 23

with the feature combination “S+G+L” obtains the best performance in Polyester and Nylon material annotation. Compared with single feature, multiple features help improve the annotation performance. However, the proposed GS-XGBoost algorithm is very competitive. We conclude the material

attribute

annotation

difficulty

in

descending

order

as

follows:

Polyester > Nylon > Canvas > Pu. The proposed GS-XGBoost algorithm recognizes Pu material more easily than any other materials because Pu material has very consistent textural characteristics. By contrast, Polyester material has rough and coarse texture characteristics. This condition confuses our annotation models and destroys the final performance. The conclusion is similar to that in Figure 1. As illustrated in Figure 4, GS models are superior to the models combined with single feature (e.g., GS-GBDT-S). This result proves the effectiveness of the presented GS idea. Overall, the presented GS-XGBoost algorithm obtains the best annotation performance as anticipated. 4.4 AP and MAP Values We conduct evaluations from another perspective in detail to choose the best feature combination and the best annotation model for material attribute annotation. We compute (Equation (26)) values and compute

(Equation (27)) values to construct Table 2. We

(Equation (25)) values and

(Equation (29)) values to construct

Table 3. Owing to space limitation, we cannot present all AP (or MAPs) values of all annotation models. As illustrated in Figure 4, the presented GS models are superior to other baselines. Thus, the experimental results of all GS models combined with different feature combinations can be demonstrated. The last four rows of Table 2 are special cases. Single image feature is used to complete material attribute classification in these rows. Thus, the corresponding GS model is decreased into an original annotation model. Table 2 shows that the state-of-the-art XGBoost algorithm is superior to other traditional models (e.g., GBDT, Adaboost, LR, and DT) when single image feature (“L” or “S” or “G” or “V”) is chosen (the best APmaterial of single image feature is 56.41%). This finding further confirms the effectiveness of XGBoost algorithm to a certain extent. Although the traditional Adaboost algorithm obtains competitive performance, it needs longer training time (approximately 3 times) than the state-of-the-art XGBoost algorithm. Second, the feature combination “L+G” is the best choice among all bi-feature combinations (MAPfeature = 51.86%). The feature combination “L+G” is also superior to any other single image feature (the best MAPfeature of single image feature is 50.41%). These results confirm that the feature combination “L+G” helps improve annotation performance and the feature “L” is an excellent complement to the feature “G” in material attribute annotation. These features focus on depicting the fundamental textural characteristics of images. However, the feature “L” is a local descriptor, whereas the feature “G” is a global descriptor. This difference is beneficial for improving the final material attribute annotation performance when the two features are combined. 24

Third, when the feature combination “L+G” is used in the proposed GS-XGBoost algorithm, it obtains the second best performance (58.60%) among all bi-feature combinations. Another important discovery is that, although the feature combination “S+V” is not the best choice among all bi-feature combinations, it is superior to the single feature “S” and the single feature “V” in most models. Thus, the feature “S” is an excellent complement to the feature “V” in material attribute annotation. Finally, we find that the annotation performance of the feature combination “S+G+L” is preferable to any other combination in Table 2. In particular, the proposed GS-XGBoost algorithm combined with the feature combination “S+G+L” obtains the suboptimal performance (59.99%). We conclude that the feature “S” promotes the recognition robustness of the proposed annotation model by decreasing the shape influence of different images. The modified ERGS algorithm must play an important role in improving annotation performance. Overall, the presented GS-XGBoost algorithm obtains the best overall annotation performance. Table 2. APmaterial and MAPfeature values of different models combined with different feature combinations. The best APmaterial of each row is underlined (Unit: %). APmaterial of Each Model

Feature GS-DT

GS-GBDT

GS-KNN

GS-LR

GS-NB

GS-RF

GS-Adaboost

GS-XGBoost

all

37.49

53.51

50.31

60.30

43.69

53.56

55.52

55.11

51.19

S+G+L

41.01

58.64

56.69

53.08

41.44

54.56

59.90

59.99

53.16

S+G+V

37.58

51.59

46.60

25.68

46.38

51.67

53.01

53.07

45.70

S+L+V

37.54

51.19

46.37

37.46

43.37

51.36

52.52

52.85

46.58

L+G+V

37.59

52.79

49.28

53.48

39.72

52.81

55.29

53.30

49.28

S+G

39.95

55.91

53.06

46.29

46.57

52.26

56.34

57.37

50.97

S+V

37.75

48.15

42.16

23.53

34.55

49.04

48.61

50.28

41.76

S+L

32.72

57.98

52.14

53.70

41.78

51.13

58.30

57.92

50.71

L+G

40.39

57.32

55.07

51.43

37.79

55.10

59.16

58.60

51.86

L+V

37.48

49.88

44.64

44.07

29.17

50.17

52.61

51.03

44.88

G+V

37.47

50.39

45.09

29.55

22.88

50.90

52.33

51.29

42.49

L

43.54

54.72

51.08

50.05

37.79

53.34

56.34

56.41

50.41

S

29.80

48.46

38.66

36.99

45.75

40.60

46.64

47.95

41.85

G

40.45

53.51

51.22

34.22

44.48

52.54

55.46

55.29

48.40

V

37.50

46.39

41.64

17.89

8.63

47.65

47.96

48.32

37.00

Table 3 shows that the feature combination “S+G+L” is the best choice for Pu and Polyester material annotation. For example, the feature combination “S+G+L” (67.71%) is considerably better than any single image feature (62.42%) in Pu material annotation. This finding also demonstrates the effectiveness of the proposed GS idea. Moreover, the feature combination “L+G” is the best choice for Canvas material annotation. Surprisingly, the feature combination “S+G+L+V” (48.31%) is the best choice for Nylon material annotation. This combination obtains the second best result (“S+G+L”) by a large margin (48.31%–43.17%=5.14%). Thus, the feature “V” plays an important role in Nylon material annotation. Overall, the feature combination “S+G+L” obtains the best overall 25

performance. We can conclude the material attribute annotation difficulty in descending order as Polyester > Nylon > Canvas > Pu based on the MAPmaterial values. Polyester and Nylon materials are more difficult to be recognized by our model than other materials. The conclusion is similar to the results in Figures 4 and 1. Table 3. APfeature and MAPmaterial values of all models combined with different feature combinations. The best APfeature value of each column is underlined (Unit: %). APfeature of Each Material Feature Pu

Canvas

Polyester

Nylon

all

62.17

57.26

37.01

48.31

S+G+L

67.71

60.37

41.40

43.17

S+G+V

60.76

54.96

32.21

34.85

S+L+V

60.13

55.77

36.74

33.68

L+G+V

59.59

57.02

38.09

42.43

S+G

67.41

58.29

37.57

40.60

S+V

58.11

52.46

29.55

26.91

S+L

65.01

57.49

38.32

42.01

L+G

65.96

61.43

38.91

41.13

L+V

57.04

55.51

25.72

41.26

G+V

57.89

54.17

29.60

28.28

L

62.11

60.26

40.49

38.77

S

56.40

47.39

27.53

36.10

G

62.42

57.71

39.91

33.55

V

47.38

49.82

24.50

26.30

61.98

56.79

35.01

38.42

4.5 Accuracy We use the Accuracy metric (Equation (21)) to evaluate the proposed GS-XGBoost algorithm comprehensively. To confirm the robustness and effectiveness of the proposed GS-XGBoost algorithm, we compute the Accuracy values on two datasets: “MattrSet” and “Fabric.” All results are shown in Figure 5 and Figure 6. Figure 5(a) illustrates the accuracies on the “MattrSet” dataset. Figure 5(b) illustrates the accuracies on the “Fabric” dataset. We use various GS models, namely, GS-DT, GS-GBDT, GS-KNN, GS-LR, GS-RF, GS-Adaboost, and GS-XGBoost, which are combined with 15 feature combinations to compute classification accuracies. For single image feature, its ERGS weight is equal to 1. For any other combination, the ERGS weight of each image feature is calculated dynamically by Algorithm 2 (detailed information on ERGS weights is shown in Table 9). Figure 6(a) illustrates the comparisons between the traditional models (NO-GS) and the corresponding GS models (IS-GS) on the “MattrSet” dataset. Figure 6(b) illustrates the comparisons between the traditional models (NO-GS) and the corresponding GS models (IS-GS) on the “Fabric” dataset.

26

70

Accuracy(%)

60 50 40 30 20

G SXG Bo os t

G SAd ab oo st

F G SR

B G SN

G SLR

G SKN N

G SG BD T

G SD

0

T

10

all S+G+L S+G+V S+L+V L+G+V S+G S+V S+L L+G L+V G+V L S G V

GS models applied in the “MattrSet” dataset

G SXG Bo os t

GS models applied in the “Fabric” dataset

G SAd ab oo st

F G SR

B G SN

G SLR

G SKN N

G SG BD T

T

90 80 70 60 50 40 30 20 10 0

G SD

Accuracy(%)

(a)

all S+G+L S+G+V S+L+V L+G+V S+G S+V S+L L+G L+V G+V L S G V

(b)

Figure 5. Annotation accuracies. (a) Accuracies of GS models in the “MattrSet” dataset; (b) Accuracies of GS models in the “Fabric” dataset.

As shown in Figure 5(a), the proposed GS-XGBoost algorithm combined with the feature “L” obtains the best accuracy of all distinct features. The feature “L” can suppress noise generated by shape variations. When bi-feature combinations are chosen, the feature combinations, namely, “L+G”,”S+L,” obtain better accuracies than the others. This result confirms the effectiveness of the proposed GS idea. This feature mid-fusion method is effective. The feature “V” plays a very “interesting” role in feature mid-fusion. On the one hand, except the feature combination “S+V,” the feature “V” does not support material attribute annotation. We hypothesize that the feature “V” is directly transferred from the VGG16 model without any fine-tuned processing. This feature will fit the ImageNet dataset very much, which is detrimental to our algorithm. On the other hand, we find that the feature combination “S+V” obtains very competitive accuracies when we choose the annotation algorithms, namely, GS-DT, GS-RF, GS-GBDT, GS-Adaboost, and GS-XGBoost. We hypothesize that the feature “V” focuses on describing the fundamental textural characteristics of material attributes. Therefore, this feature has properties analogous to those of the feature “L” and 27

the feature “G”. Therefore, the feature “V’ and the feature “L” (or the feature “G”) cannot complement each other when they are combined together. On the contrary, the feature “S” concentrates on describing the fundamental shape characteristics of materials, which is dissimilar to that of the feature “V”. Therefore, the feature “V” and the feature “S” complement each other positively and the feature “S” finally improves the shape robustness of the feature combination “S+V” apparently. As a result, the final annotation performance is improved. The proposed GS-XGBoost algorithm combined with the feature combination “L+G” obtains very competitive accuracies. When tri-feature combinations are chosen, the proposed GS-XGBoost algorithm combined with the feature combination “S+G+L” obtains the best accuracy of all feature combinations. Evidently, the traditional features complement one another positively, and the proposed GS idea also plays a major role in material attribute annotation. Ultimately, the final annotation performance is enhanced. However, when four-feature combination is chosen, we obtain very poor accuracies. In Figure 5(b), the state-of-the-art GS-XGBoost algorithm also obtains competitive accuracies, especially when bi-feature combinations, namely, “S+G” and “S+L,” are chosen. We find that the feature “S” contributes considerably to the feature mid-fusion procedure. In the fine-grained “Fabric” dataset, different materials are represented by regular shape variations. Thus, the feature “S” captures the key shape variations and obtains excellent classification accuracies. The feature “S” is utilized to depict the key visual content of “Fabric” images [11]. As shown in Figure 6(a), the state-of-the-art XGBoost algorithm is superior to any other traditional model or algorithm on the “MattrSet” dataset. However, the proposed GS algorithms are superior to the corresponding traditional algorithms. Our GS-XGBoost algorithm is also superior to any other kind of GS model. Figure 6(b) shows that the state-of-the-art XGBoost algorithm is superior to most of the traditional models on the “Fabric” dataset. The proposed GS-XGBoost algorithm obtains excellent classification accuracies on the coarse-fined dataset. We infer that image features play a more important role in fine-grained material attribute classification than the classification model. Overall, the above-mentioned results not only prove the effectiveness of the presented GS idea but also demonstrate the effectiveness of the state-of-the-art XGBoost algorithm. This algorithm is robust and can acquire high classification accuracies on the coarse-grained “MattrSet” and fine-grained “Fabric” datasets. Thus, we conclude that a novel effective feature mid-fusion algorithm for tackling the material attribute classification problem is created. The feature mid-fusion algorithm considers the boosting idea (a strong classifier is created by integrating a group of weak classifiers) and the popular multi-feature fusion idea (ERGS weight is dynamically assigned to each image features), thereby building a robust and discriminant classifier.

28

70

NO-GS IS-GS

60 Accuracy(%)

50 40 30 20 10 0

DT

KNN

LR

NB

RF

GBDT

Adaboost

XGBoost

Comparisons of different models in the “MattrSet” dataset

(a) 90

NO-GS IS-GS

80

Accuracy(%)

70 60 50 40 30 20 10 0

DT

KNN

LR

NB

RF

GBDT

Adaboost

XGBoost

Comparisons of different models in the “Fabric” dataset

(b)

Figure 6. (a) Comparisons between the traditional models (NO-GS) and our models (IS-GS) in the “MattrSet” dataset; (b) Comparisons between the traditional models (NO-GS) and our models (IS-GS) in the “Fabric” dataset.

For a comprehensive comparison, we choose the best accuracy of each baseline (Section 4.1.2) and conduct comparisons with the best proposed models. For example, on the “MattrSet” dataset, we achieve the best classification accuracy (67.67%) when the state-of-the-art GS-XGBoost algorithm is combined with the feature combination “S+G+L.” For most baselines, we choose 50% samples randomly for training, whereas the remaining samples are chosen for testing. Notably, we obtain very worse accuracies when we use the 50% versus 50% data partitioning in deep learning models. Hence, we choose 70% samples randomly for training and the remaining 30% samples for testing due to the data-driven characteristic by using the state-of-the-art deep learning models, namely, InceptionResNetV2 [102], Densenet169 [103], and MobileNets [104]. Accuracies of various models are presented in Table 4.

29

Table 4. Accuracy comparisons with all baselines (Unit: %). Dataset

MattrSet

Fabric

Model

Accuracy

Model

Accuracy

SVM-S [30]

50.83

GS-DT-SGL

46.20

GBDT-L [71]

61.28

GS-RF-LG

61.75

Adaboost-L [73]

61.54

GS-KNN-SGL

62.10

XGBoost-L [27]

62.93

GS-LR-SL

59.45

VGG16 [23]

33.98

GS-NB-SG

49.61

InceptionResNetV2 [102]

52.09

GS-GBDT-SGL

65.13

Densenet169 [103]

59.77

GS-Adaboost-SGL

66.11

MobileNets [104]

33.98

GS-XGBoost-SGL

67.67

SVM-S [30]

77.92

GS-DT_SGL

65.64

GBDT-S [71]

79.66

GS-RF-SG

76.58

Adaboost-S [73]

76.86

GS-KNN-LG

73.74

XGBoost-S [27]

82.03

GS-LR-SL

78.28

VGG16 [23]

46.22

GS-NB-SG

57.42

InceptionResNetV2 [102]

46.22

GS-GBDT-SG

79.98

Densenet169 [103]

46.29

GS-Adaboost-SL

78.16

MobileNets [104]

46.22

GS-XGBoost-SG

81.95

Table 4 shows that, on the “MattrSet” dataset, the state-of-the-art XGBoost algorithm combined with the feature “L” achieves the highest accuracy before feature mid-fusion among all algorithms. On the “Fabric” dataset, the state-of-the-art XGBoost algorithm combined with the feature “S” obtains the highest accuracy before feature mid-fusion among all algorithms. All of these results confirm the effectiveness of the boosting idea (XGBoost algorithm). Second, most of the GS models achieve better accuracies than the corresponding traditional models (Figures 6(a) and (b)). This finding confirms the effectiveness of the modified ERGS idea. With the help of the two ideas above, the proposed GS-XGBoost algorithm obtains the highest accuracies in Table 4. On the “MattrSet” dataset, the feature “S”, the feature “G”, and the feature “L” contribute considerably to the presented feature mid-fusion process. By contrast, on the “Fabric” dataset, the feature “S” and the feature “G” contribute considerably to the presented feature mid-fusion process. We infer that, on the fine-grained “Fabric” dataset, different materials are represented by regular shape variations, which are captured by the feature “S”. Image features play a more important role in fine-grained material attribute classification than the classification model. Third, although a large number of samples are used in training, the classification accuracies of the state-of-the-art deep learning models are worse than the proposed GS models. When experiments are executed on the fine-grained “Fabric” dataset, the accuracies are much worse, which is similar to the results in Table 6. Surprisingly, on the fine-grained “Fabric” dataset, slight accuracy difference occurs between these deep learning models. Therefore, on the ‘Fabric’ dataset, deep learning models cannot capture effective deep-level features for material attribute classification. We also find that discriminating the material attributes in the “Fabric” dataset (except the deep learning models) is easier than those in the “MattrSet” dataset. Large accuracy margins appear between the two 30

datasets. The reason is due to the way the datasets are constructed. Our “MattrSet” dataset is crawled from the web and therefore has considerable noises. For example, the background of images is an important noise source. “MattrSet” is a real “wild” dataset. By contrast, the “Fabric” dataset is collected from several real garment shops. The users [11] confirm the composition of garments from the manufacturer label. Thus, “Fabric” is a real “pure” dataset. Apparently, the “wild” dataset is difficult to be recognized. Finally, we obtain another important conclusion that the proposed GS-XGBoost algorithm is not a data-driven method due to the effectiveness on both datasets. This algorithm is a robust method for material attribute classification. The experimental results above indicate that the presented GS-XGBoost algorithm is an excellent choice for material attribute annotation. This algorithm can be applied to fine-grained and coarse-grained material datasets. 4.6 Ablation Analysis We further study the impacts of various architectural decisions on the performance of the proposed algorithm. Table 5 reports an ablation analysis conducted on the ”MattrSet” and ”Fabric” datasets. We report the results of several different model variations. We first describe the default setting from which ablation reports are derived. In the default setting, we use the standard model with all components (the state-of-the-art XGBoost algorithm, the proposed GS idea, and the best feature combination). The experimental results indicate that the standard model on the “MattrSet” dataset is GS-XGBoost-SGL, whereas the standard model on the “Fabric” dataset is GS-XGBoost-SG (Table 4). We report the results from four different variations to show the importance of each component. In (1), we remove all components. On the “MattrSet” dataset, the best model after completing the ablation process is Adaboost-L. On the “Fabric” dataset, the best model after completing the ablation process is LR-S. In (2), we remove the presented GS idea. On the “MattrSet” dataset, the best model after completing the ablation process is XGBoost-L. On the “Fabric” dataset, the best model after completing the ablation process is XGBoost-S. In (3), we remove the best feature combination. On the “MattrSet” dataset, the best model after completing the ablation process is GS-XGBoost-SG. On the “Fabric” dataset, the best model after completing the ablation process is GS-XGBoost-SL. In (4), we remove the state-of-the-art XGBoost algorithm. On the “MattrSet” dataset, the best model after completing the ablation process is GS-Adaboost-LG. On the “Fabric” dataset, the best model after completing the ablation process is GS-GBDT-SG. We compute

values

in accordance with the following equation: (30)

The impact of removing all components is evident, which leads to considerable performance degradation. On the one hand, removing all components (ablation (1)) for the “MattrSet” dataset leads to the largest degradation. On the other hand, removing the best feature combination 31

(ablation (3)) or the state-of-the-art XGBoost algorithm (ablation (4)) for the same dataset leads to marginal degradation. Thus, the importance of each component in descending order is as follows: All > GS idea > the best feature combination > XGBoost algorithm. For the “MattrSet” dataset, the presented GS idea is the most important component of our model. On the one hand, removing the state-of-the-art XGBoost algorithm (ablation (4)) for the “Fabric” dataset leads to the largest degradation. On the other hand, removing the best feature combination (ablation (3)) for the same dataset leads to marginal degradation. We also conclude the importance of each component in descending order as follows: XGBoost Algorithm > All > the best feature combination > GS idea. As found above, the key shape features play a more important role in fine-grained material attribute classification than the proposed GS idea. Apparently, for the “Fabric” dataset, the state-of-the-art XGBoost algorithm is the most important component of our model. Table 5. Ablation Analysis (Unit: %). Dataset MattrSet Fabric

Measurement

Default

(1)

(2)

(3)

(4)

Accuracy

67.67

61.54

62.93

65.42

66.11



−6.13

−4.74

−2.25

−1.56

Accuracy

81.95

80.85

82.03

81.71

79.98



−1.10

0.08

−0.24

−1.97

4.7 Final Accuracy Comparisons To validate the robustness of the presented GS-XGBoost algorithm, we conduct a fair comparison with the method in [11] on the “Fabric” dataset. We first choose the top three GS models, namely, GS-GBDT, GS-Adaboost, and GS-XGBoost. The models are combined with their best feature combinations, namely, “S+G+L+V,” “S+G+L,” “S+G,” and ”S+L”. As described in [11], we perform four-fold cross-validations (i.e., using 75% for training and 25% for testing in each fold). Table 6 summarizes the results of our experiments for fabric material classification. In Table 6, FC indicates a Fully-Connected feature, which indicates that fabric material classification is executed by the state-of-the-art VGG model [105]. The users [11] use the VGG-M pre-trained neural network of eight layers, five convolutional layers plus three fully connected layers. A feature encoding that accumulated first- and second-order statistics of local features is the “FV” (Fisher Vector) [106]. “FV” uses a soft clustering assignment by using a Gaussian mixture model instead of the K-means algorithm. Fabric material classification is implemented by an SVM model after “FV” encoding. Moreover, “Albedo” represents the reflectance of the image surface. “Normals” represents the micro-geometry of the image surface. “Albe+Norm” indicates the users [11] consider the reflectance and the micro-geometry of image surface. As is shown in Table 6, the FC feature provides worse accuracy (similar to the results in Table 4), whereas the FV feature obtains improved accuracy due to the application of a soft clustering algorithm (the best accuracy is 76.10%). However, the combination of Albedo and Normals provides constantly enhanced accuracy (the best accuracy is 79.60%). Therefore, using geometry 32

and texture is useful for fine-grained material classification. Furthermore, this finding indicates that the two characteristics, namely, Albedo and Normals, depict images from diverse views and that valuable complementarity exists between the FC and FV features. Among the top three GS models, the proposed GS-XGBoost provides the best overall accuracy. However, among all feature combinations, the bi-feature combinations, namely, “S+G” and “S+L,” obtain the best accuracies. Overall, the proposed GS-XGBoost algorithm combined with the feature combination “S+G” is superior to the method in [11]. Fine-grained material classification accuracy is improved by approximately 84.20%–79.60%=4.6%, which can validate the robustness of the presented GS-XGBoost algorithm. Table 6. Accuracy Comparisons with the method in [11] (Unit: %). The best Accuracy of each column is underlined. Our Feature Combinations Model

VGG-M[11]

S+G+L+V

S+G+L

S+G

S+L

GS-Adaboost

75.73

78.06

78.83

79.52

GS-GBDT

79.23

81.40

81.13

GS-XGBoost

80.73

82.92

84.20

Modality

FV

FC

FC+FV

Albedo

67.90

47.60

71.20

82.80

Normals

73.90

41.90

74.10

83.53

Albe+Norm

76.10

50.50

79.60

4.8 RA Annotation Results With the help of the material attribute annotation, we further mine a group of additional valuable deep-level semantics. As advised by a material expert, the deep-level semantics of material attributes are composed of “Waterproofness,” “Breathability,” “Softness,” “Washability,” and “Wearability.” These semantics are very close to humans’ empirical cognition and are thus convenient for humans or robotics to comprehensively understand different material attributes from diverse perspectives. We attempt to annotate the deep-level semantics on the “Mattr_RA” dataset by the state-of-the-art RA [26] model. We extract four image features: Gist, SIFT, LBP, and VGG16. Then, we consider RankSVM [94] as a rank function to create the state-of-the-art RA model. Using Equation (16), each deep-level semantic is measured quantitatively by the RA model. A new learning strategy called zero-shot learning [26] is applied in turn to annotate the corresponding deep-level semantic on images. Precisions of different models are presented in Tables 7 and 8. For example, BAG refers to the BA model using the Gist feature, whereas RAG refers to the RA model using the Gist feature. Two improvements, namely,

and

, should be

calculated in Table 8. Their definitions are shown as (31) (32)

33

, .

In

Tables

7

and

8,

=56.20% 41.93%=14.27%. Table 7. Precisions of the BA models. The best Precision of each column is underlined (Unit: %). Precision and APattribute

BA model

APfeature

Waterproofness

Breathability

Softness

Washability

Wearability

BAG

35.79

33.29

28.91

18.15

33.54

29.94

BAS

41.93

35.54

33.42

25.53

36.42

34.57

BAL

30.66

36.05

30.16

14.27

32.04

28.64

BAV

23.28

23.28

21.15

11.01

22.03

20.15

APattribute of BA

32.59

32.99

27.43

16.00

31.36

/

Table 7 shows that the SIFT feature obtains better precisions among all BA models, especially for the “Waterproofness” and “Wearability” semantics. We infer that different deep-level semantics may represent as different shape variations. Moreover, the Gist feature is another excellent choice for BA models. Thus, global textural variations also contribute to distinguishing different deep-level semantics. However, although large numbers of samples are used for training, the annotation precisions of the traditional BA models are very low. Therefore, a large space for improvement is available. The said performance is due to the lack of sufficient semantic information in the traditional binary attribute annotation (”0” or “1”, detailed information is presented in Table 1). Accordingly, we introduce the state-of-the-art RA models to solve the problem. Table 8. Precisions of the RA model. The best Precision of each column is underlined (Unit: %). Precision and APattribute

RA model

APfeature

Waterproofness

Breathability

Softness

Washability

Wearability

RAG

56.20

51.94

51.56

61.95

59.82

56.30

RAS

38.30

34.04

67.71

60.33

66.08

53.29

RAL

52.57

21.53

80.73

57.57

66.21

55.72

RAV

45.43

41.30

52.69

38.92

45.80

44.83

APattribute of RA

48.09

38.52

61.78

52.49

58.25

/

Improvemax

14.27

15.89

47.31

36.42

29.79

/

Improvemean

15.50

5.53

34.35

36.49

26.89

/

As shown in Table 8, the state-of-the-art RA model is superior to the traditional BA model when the proposed deep-level semantics are annotated. A very large margin between them occurs, especially for the “Softness” and “Washability” semantics. For example, precision is improved by approximately 47.31% when the “Softness” semantic is annotated by the state-of-the-art RA model. We conclude that the relative relationships between different material attributes (Table 1) enhance the final annotation performance. From the

values in Table 8, we conclude that the 34

difficulty in annotating different deep-level semantics in descending order is as follows: Breathability > Waterproofness > Washability > Wearability > Softness. Thus, the “Softness” semantic is the easiest one to be accurately annotated by our RA model among all semantics. By contrast, the deep-level semantic “Breathability” is the hardest to be accurately annotated by our model. Interestingly, the

values in Table 7 indicate that the “Breathability” semantic is not very

difficult to be accurately annotated by the traditional BA model. Evidently, different models (BA and RA) focus on different types of deep-level semantics. This finding implicitly shows that we can fuse the annotation results of all models (BA and RA) to further enhance the final annotation performance. The

values in Table 8 also imply that the Gist feature plays a very important

role in the RA annotation procedure. This feature effectively describes the fundamental textural characteristics of images. The global textural characteristics can fit the relative relationship among different material attributes very well. We qualitatively compare our RA model and the traditional BA model in Figure 7.

Figure 7. Qualitative comparisons between our RA model and the traditional BA model.

As shown in Figure 7, the traditional BA model describes that an image holds a deep-level semantic (true) or not (false). However, the state-of-the-art RA model shows that a deep-level 35

semantic is not a binary value but a real value. The deep-level semantics (”Waterproofness,” “Breathability,” ”Softness,” “Washability,” and ”Wearability”) acquired by our RA model are very close to humans’ empirical cognition. On the one hand, the annotation results are convenient for humans or robotics to comprehensively understand different material attributes from various perspectives. On the other hand, a novel retrieval interactive experience based on real values of deep-level semantics can be provided. For example, when humans request to retrieve the material attributes of excellent “Waterproofness,” our system may list the final results depending on the ranking order of “Waterproofness,” which is shown similar to the first row of Figure 7. 4.9 Parameter Tuning Procedure 4.9.1 Weak Classifier Selection The core idea of the proposed GS-XGBoost algorithm is to build a strong classifier by weak classifiers integration and multi-feature fusion . Thus, the number of weak classifiers combined with different feature combinations is a key factor that will affect the final annotation performance. In this section, different numbers of weak classifiers are set first and then the annotation performance difference is compared between the traditional XGBoost model and the proposed GS-XGBoost algorithm. Owing to space limitation, we only show the experimental results on the “MattrSet” dataset. In the traditional XGBoost model, four single features, namely, “S”, “G”, “L”, and “V”, are chosen for weak classifier selection. In the proposed GS-XGBoost algorithm, four feature combinations, namely, “S+L,” “S+G,” “S+G+L,” and “S+G+L+V,” are chosen for weak classifier selection. The experimental results are shown in Figure 8.

(a)

(b)

Figure 8. Experimental results of weak classifier selection. (a) XGBoost; (b) GS-XGBoost

Figure 8(a) shows that, if the feature “S” is chosen and the number of weak classifiers equals 200, then the best accuracy of material attribute annotation reaches 53.31%. If the feature “L” is chosen and the number of weak classifiers equals 250, then the best accuracy of material attribute annotation reaches 63.31%. If the feature “G” is chosen and the number of weak classifiers equals 36

500, then the best accuracy of material attribute annotation reaches 62.68%. If the feature “V” is chosen and the number of weak classifiers equals 500, then the best accuracy of material attribute annotation reaches 55.36%. The feature “L” and the feature “G” obtain the highest performance, which has also been confirmed in Table 2. However, all polylines are nearly flat with the increase in weak classifiers, except the polyline of the feature “L”. In the traditional XGBoost algorithm, weak classifier selection slightly improves the final annotation accuracy. Figure 8(b) shows that, if the feature combination “S+L” is chosen and the number of weak classifiers equals 100, then the best accuracy of material attribute annotation reaches 64.06%, which is superior to the best performance in Figure 8(a). If the feature combination “S+G” is chosen and the number of weak classifiers equals 500, then the best accuracy of material attribute annotation reaches 65.13%. This result is also superior to any result in Figure 8(a). If the feature combination “S+G+L” is chosen and the number of weak classifiers equals 200, then the best accuracy of material attribute annotation reaches 67.67%. Compared with the best result in Figure 8(a), annotation accuracy is improved by approximately 4.34%. Therefore, a large margin is observed. If the feature combination “S+G+L+V” is chosen and the number of weak classifiers equals 100, then the best accuracy of material attribute annotation reaches 61.30%. Annotation performance degrades slightly due to the characteristics of the feature “V”. Figure 8(b) demonstrates the effectiveness of the proposed GS-XGBoost algorithm. The annotation performance is greatly improved when we choose a suitable feature combination and tune the number of weak classifiers. The state-of-the-art boosting idea and the popular multi-feature fusion idea are considered, thereby building a robust and discriminant attribute classifier. 4.9.2 ERGS Weight Distribution The ERGS weight of each feature is analyzed to quantitatively evaluate the role of different image features in the ERGS fusion process. Notably, each weight is calculated dynamically on the basis of Algorithm 2. In this section, we choose the best one of each kind of feature combination (bi-feature combination, tri-feature combination, and four-feature combination) and conduct feature mid-fusion by using the proposed GS-XGBoost algorithm. Three feature combinations, namely, “S+G,” “S+G+L,” and “S+G+L+V,” are chosen for analyzing the ERGS weight distribution of the proposed GS-XGBoost algorithm. Owing to space limitation, we only show the experimental results on the “MattrSet” dataset in Table 9. As shown in Table 9, the ERGS weight of the feature “G” is large when we choose bi-feature combination or tri-feature combination. The ERGS weight of the feature “L” is larger than that of the feature “S” in tri-feature combination. Evidently, global texture (”G”) or local shape (“L”) contributes considerably to material attribute annotation. We also believe that the feature “S” can provide the corresponding feature combinations with robustness to the shape variation. With the 37

help of the feature “G” and the feature “L”, the feature combination “S+G+L” acquires the best annotation performance on the “MattrSet” dataset (67.67%). However, the ERGS weight of the feature “V” is large when we choose the four-feature combination. In conclusion, the proposed GS-XGBoost algorithm helps emphasize the distinguishing ability of each feature for different samples. This algorithm pays considerable attention to the relatively weak features and improves their weight to enhance the overall performance of material attribute annotation. Table 9. ERGS weight distribution. The largest ERGS weight of each feature combination is bold. ERGS Weight Distribution

Feature Feature Combinations

Pu

Canvas

Polyester

Nylon

S

0.3824

0.4544

0.3400

0.3381

G

0.6176

0.5456

0.6600

0.6619

S

0.2762

0.3239

0.2469

0.2171

G

0.4470

0.3689

0.4970

0.5182

L

0.2768

0.3072

0.2561

0.2647

S

0.1211

0.1633

0.0974

0.0866

G

0.1960

0.1860

0.1961

0.2067

L

0.1214

0.1549

0.1010

0.1055

V

0.5615

0.4959

0.6055

0.6012

S+G

S+G+L

S+G+L+V

5、 Conclusions and Future Works We present a novel image attribute annotation framework based on a newly designed hybrid feature mid-fusion algorithm and the state-of-the-art RA model. The innovative hybrid algorithm called GS-XGBoost is designed for feature mid-fusion. This algorithm computes the estimated probability of each image feature by the state-of-the-art XGBoost algorithm. Then, the algorithm dynamically assigns the corresponding ERGS weight to the estimated probability of each image feature. We first use the GS-XGBoost algorithm to accomplish material attribute annotation. We demonstrate the effectiveness of the proposed GS-XGBoost algorithm on two different datasets. This algorithm is robust for coarse- and fine-grained material classification. The state-of-the-art RA model is also used to acquire the deep-level semantics of material attributes. All these annotation results can help build a new hierarchical material representation mechanism. The new representation mechanism has two forms. One is BARM, which is based on the BA annotation. The other is RARM, which is based on the RA annotation. The mechanism is beneficial for humans or robotics to comprehensively understand different material attributes from diverse perspectives. The proposed annotation framework has excellent characteristics of generality (it can integrate different types of classification models, such as XGBoost and GBDT), effectiveness (it achieves the best accuracy of material attribute annotation), inclusiveness (it fuses various features, including SIFT, Gist, and VGG), and rich knowledge (it creates a hierarchical cognition system between the popular 38

material attributes and the corresponding deep-level semantics). Our research contributes to not only computer science but also material science and engineering. The proposed model has two possible limitations. We should tune the number of weak classifiers carefully because they may affect the efficiency of the proposed method. Moreover, the proposed model is not an end-to-end framework. Thus, we should conduct hand-crafted feature engineering first. Although room for improvement is available, our model is an efficient and practical solution for fine- and coarse-grained material classification that can be applied in many different scenarios in online search engine, robotics, and industrial inspection. In the future, we intend to use the proposed GS-XGBoost algorithm to other research fields, such as natural language processing. We hope that this algorithm will bring exciting experimental results. We also intend to add the state-of-the-art attention mechanisms [5] into our framework to focus on the most important local areas of images (i.e., the foreground of images). This approach may help suppress noises produced by the background of images. We also plan to use the state-of-the-art data augment methods, such as DCGAN [107] and ACGAN [108], to improve the performance of material attribute classification. To further improve the performance of RA annotation, we will use the state-of-the-art rank models, namely, LambdaMART [109] and IRGAN [110], rather than RankSVM [94]. Author Contributions: Conceptualization, Hongbin Zhang., Donghong Ji. and Tao Li.; Methodology, Diedie Qiu.; Software, Diedie Qiu., Renzhong Wu.; Validation, Hongbin Zhang., Renzhong Wu.; Formal Analysis, Tao Li.; Investigation, Hongbin Zhang., Diedie Qiu., Yixiong Deng; Resources, Renzhong Wu., Yixiong Deng; Data Curation, Hongbin Zhang.; Writing-Original Draft Preparation, Hongbin Zhang., Diedie Qiu.; Writing-Review & Editing, Hongbin Zhang., Donghong Ji; Visualization, Renzhong Wu.; Supervision, Hongbin Zhang.; Project Administration, Hongbin Zhang.; Funding Acquisition, Hongbin Zhang. Funding: This research was funded by National Natural Science Foundation of China grant number 61762038, 61741108 and 61861016; This research was funded by Natural Science Foundation of Jiangxi Province grant number 20171BAB202023; This research was funded by Key Research and Development Plan of Jiangxi Provincial Science and Technology Department grant number 20171BBG70093; This research was funded by Science and Technology Projects of Jiangxi Provincial Department of Education grant number GJJ160497, GJJ160509, GJJ160531; This research was funded by Humanity and Social Science Foundation of the Ministry of Education grant number 17YJAZH117, 16YJAZH029; This research was funded by Humanity and Social Science Foundation of the Jiangxi Province grant number 16TQ02. Acknowledgments: I give thanks to my first graduate student Yi Yin who made much valuable basic work. I give thanks to Prof. Guangli Li and Prof. Liyue Liu who gave many valuable advices 39

to us on our method. I also give thanks to my new graduate students Ziliang Jiang, Jinpeng Wu, and Tian Yuan who also completed lots of valuable experiments. Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References [1] F Tian, X Shen. Image Annotation by Semantic Neighborhood Learning from Weakly Labeled Dataset. Journal of Computer Research and Development, 2014, 51(8): 1821-1832. [2] X Wang, L Zhang, W Ma. Duplicate Search-based Image Annotation Using Web-scale Data. Proceedings of the IEEE, 2012, 100(9):2705-2721. [3] N Srivastava, R Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. In Proceedings of IEEE International Conference on Machine Learning, 2014: 2222-2230. [4] H Zhang, D Ji, L Yin, Y Ren, Z Niu. Caption Generation from Product Image Based on Tag Refinement and Syntactic Tree. Journal of Computer Research and Development, 2016, 53(11): 2542-2555. [5] K Xu, J L Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. Show, attend and tell: Neural Image Caption Generation with Visual Attention. In Proceedings of IEEE International Conference on Machine Learning, 2015, 37: 2048-2057. [6] J Justin,K Andrej, F Li. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, 4565-4574. [7] V Ferrari, A Zisserman. Learning Visual Attributes. In Proceedings of Advances in Neural Information Processing Systems, 2007, 433-440. [8] G Qi, X Huang, Y Rui, J Tang, T Mei, H Zhang. Correlative Multi-label Video Annotation. In Proceedings of the ACM International Conference on Multimedia, 2007, 17-26. [9] B Siddiquie, R S Feris, L S Davis. Image Ranking and Retrieval Based on Multi-attribute Queries. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011:801-808. [10] D Jayaraman, K Grauman. Zero-shot Recognition with Unreliable Attributes. In Proceedings of Conference and Workshop on Neural Information Processing Systems, 2014:3464-3472. [11] C Kampouris, S Zafeiriou, A Ghosh, S Malassiotis. Fine-grained Material Classification Using Micro-geometry and Reflectance. In Proceedings of European Conference on Computer Vision, 2016, 778-792. [12] T Leung, J Malik. Representation and Recognizing the Visual Appearance of Materials Using Three-dimensional Textons. International Journal of Computer Vision, 2001, 43(1):29-44. [13] M Varma, A Zisserman. A Statistical Approach to Texture Classification from Single Image. International Journal of Computer Vision, 2005, 62(1):61-81. 40

[14] M Varma, A Zisserman. A Statistical Approach to Texture Classification Using Image Patch Exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(11):2032-2047. [15] M Cimpoi, S Maji, I Kokkinos, S Mohamed, A Velaldi. Describing Textures in the Wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014:3606-3613. [16] S Bell, P Upchurch, N Snavely, K Bala. Material Recognition in the Wild with the Materials in Context Database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015:3479-3487. [17] M Cimpoi, S Maji, A Velaldi. Deep Filter Banks for Texture Recognition and Segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015:3828-3836. [18] S Bell, P Upchurch, N Snavely, K Bala. Opensurface: A Richly Annotated Catalog of Surface Appearance. ACM Transactions on Graphics, 2013, 32(4), Article 111:1-11. [19] L Maaten, G Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9(2605):2579-2605. [20] D Lowe. Distinctive Image Features from Scale-invariant Key Points. International Journal of Computer Vision. 2004, 60(2):91-110. [21] A Oliva, A Torralba. Building the Gist of a Scene: The Role of Global Image Features in Recognition. Progress in Brain Research: Visual Perception, 2006, 155:23-36. [22] M Pietikainen. Computer Vision Using Local Binary Patterns. [M] London Ltd, Springer Berlin Heidelberg. 2011. [23] K Simonyan, A Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of International Conference on Learning Representation, 2015, 1-15. [24] WordNet,

a

Lexical

Database

for

English

[EB/OL].

https://wordnet.princeton.edu/,

access:2018-7-13 [25] HowNet, [EB/OL]. http://www.keenage.com/, access:2018-12-8 [26] D Parikh, K Grauman. Relative Attributes. In Proceedings of IEEE International Conference on Computer Vision, 2011:503-510. [27] T Chen, G Carlos. XGBoost: A Scalable Tree Boosting System. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2016, 785-794. [28] B. Chandra, M Gupta. An Efficient Statistical Feature Selection Approach for Classification of Gene Expression Data. Journal of Biomedical Informatics. 2011, (44): 529-535. [29] A Abdulnabi, G Wang, J Lu, K Jia. Multi-task CNN Model for Attribute Prediction. IEEE Transactions on Multimedia, 2015, 17(11): 1949-1959. [30] A Farhadi, I Endres, D Hoiem, Forsyth David. Describing Objects by Their Attributes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009:1778-1785.

41

[31] N Kumar, P Belhumeur, S.K Nayar. FaceTracer: A Search Engine for Large Collections of Images with Faces. In Proceedings of European Conference on Computer Vision, 2008:340-353. [32] N Kumar, A C Berg, P Belhumeur, S K Nayar. Attribute and Simile Classifiers for Face Verification. In Proceedings of IEEE International Conference on Computer Vision, 2009:365-372. [33] Z Liu, P Luo, X Wang, X Tang. Deep Learning Face Attributes in the Wild. In Proceedings of IEEE International Conference on Computer Vision, 2015, 3730-3738. [34] K Cheng, Y Zhan, M Qi. AL-DDCNN: A Distributed Crossing Semantic Gap Learning for Person Re-identification. Concurrency and Computation: Practice and Experience, 2017, 29(3). [35] G-L Sun, X Wu, H-H Chen, Q Peng. Clothing Style Recognition Using Fashion Attribute Detection. In Proceedings of International Conference on Mobile Multimedia Communications, 2015, 145-148. [36] C Bradley, T E Boult, J Ventura. Cross-Modal Facial Attribute Recognition with Geometric Features. In Proceedings of IEEE International Conference on Automatic Face & Gesture Recognition, 2017, 891-896. [37] D Jayaraman, F Sha, K Grauman. Decorrelating Semantic Visual Attributes by Resisting the Urge to Share. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1629-1636. [38] T Yao, Y Pan, Y Li, Z Qiu, T Mei. Boosting Image Captioning with Attributes. In Proceedings of IEEE International Conference on Computer Vision, 2017:4894-4902. [39] B Zhao, J Feng, X Wu, S Yan. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017:1520-1528. [40] Z Liu, P Luo, S Qiu, X Wang, X Tang. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016:1096-1104. [41] K Liang, H Chang, S Shan, X Chen. A Unified Multiplicative Framework for Attribute Learning. In Proceedings of IEEE International Conference on Computer Vision, 2015, 2506-2514. [42] C Gan, T Yang, B Gong. Learning Attributes Equals Multi-source Domain Generalization. In Proceedings of IEEE Computer Vision and Pattern Recognition, 2016, 87-97. [43] C H. Lampert, H Nickisch, S Harmeling. Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, 951-958. [44] Z Akata, F Perronnin, Z Harchaoui, C Schmid. Label-embedding for Attribute-based Classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013, 819-826. 42

[45] T L Berg, A C Berg, J Shih. Automatic Attribute Discovery and Characterization from Noisy Web Data. In Proceedings of European Conference on Computer Vision, 2010:663-676. [46] A Kovashka, S Vijayanarasimhan, K Grauman. Actively Selecting Annotations among Objects and Attributes. In Proceedings of IEEE International Conference of Computer Vision, 2011, 1403-1410. [47] L Wu, Y Wang, S Pan. Exploiting Attribute Correlations: A Novel Trace Lasso-Based Weakly Supervised

Dictionary

Learning

Method.

IEEE

Transaction

on

Cybernetics,

2017,

47(12):4497-4508. [48] B Yuan, J Tu, R Zhao, Y Zheng, Y Jiang. Learning Part-based Mid-level Representation for Visual Recognition. Neurocomputing, 2018, 2126-2136. [49] P Tang, J Zhang, X Wang, B Feng, F Roli, W Liu. Learning Extremely Shared Middle-level Image Representation for Scene Classification. Knowledge Information System, 2017, 52:509-530. [50] X Liu, J Wang, S Wen, E Ding, Y Lin. Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition. In Proceedings of AAAI, 2017, 4190-4196. [51] N Murrugarra-Llerena, A Kovashka. Asking Friendly Strangers: Non-Semantic Attribute Transfer. In Proceedings of AAAI, 2018, 7268-7275. [52] C Su, S Zhang, J Xing, W Gao, Q Tian. Deep Attributes Driven Multi-Camera Person Re-identification. In Proceedings of European Conference on Computer Vision, 2016, 475-491. [53] J Shao, K Kang, C. C Loy, X Wang. Deeply Learned Attributes for Crowded Scene Understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, 4657-4666. [54] D F Fouhey, A Gupta, A Zisserman. 3D Shape Attributes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, 1516-1524. [55] F Wang, H Han, S Shan, X Chen. Deep Multi-Task Learning for Joint Prediction of Heterogeneous Face Attributes. In Proceedings of IEEE International Conference on Automatic Face & Gesture Recognition, 2017, 173-179. [56] N Zhuang, Y Yan, S Chen, H Wang. Multi-task Learning of Cascaded CNN for Facial Attribute Classification. https://arxiv.org/abs/1805.01290, 2018. [57] HL Hsieh, W Hsu, Y Chen. Multi-task Learning for Face Identification and Attribute Estimation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, 2981-2985. [58] Z Wang,K He,Y Fu, R Feng, Y Jiang, X Xue. Multi-task Deep Neural Network for Joint Face Recognition and Facial Attribute Prediction. In Proceedings of ACM on International Conference on Multimedia Retrieval, 2017, 365-374. [59] A Kovashka, K Grauman. Attribute Adaptation for Personalized Image Search. In Proceedings of IEEE International Conference on Computer Vision, 2013: 3432-3439. 43

[60] A Kovashka, D Parikh, K Grauman. WhittleSearch: Interactive Image Search with Relative Attribute Feedback. International Journal of Computer Vision, 2015, 115(2): 185-210. [61] A Yu, K Grauman. Just Noticeable Differences in Visual Attributes. In Proceedings of IEEE International Conference on Computer Vision, 2015, 2416-2424. [62] X Qiao, P Chen, D He, Y Zhang. Shared Features Based Relative Attributes for Zero-shot Image Classification. Journal of Electronics and Information Technology, 2017, 39(7): 1563-1570. [63] M T Law, N Thome, M Cord. Learning a Distance Metric from Relative Comparisons between Quadruplets of Images. International Journal of Computer Vision. 2017, 121:65-94. [64] Y Cheng, X Qiao, X Wang, Q Yu. Random Forest Classifier for Zero-Shot Learning Based on Relative Attribute. IEEE Transactions on Neural Networks and Learning, 2017, 29(5):1662-1674. [65] E Ergul. Relative Attribute Based Incremental Learning for Image Recognition. CAAI Transactions on Intelligence Technology, 2017, 2(1):1-11. [66] Y He, L Chen, J Chen. Multi-task Relative Attributes Prediction by Incorporating Local Context and Global Style Information Features, In Proceedings of BMVC, 2016, Article, 131:1-12. [67] KK Singh, Y Lee. End-to-End Localization and Ranking for Relative Attributes. In Proceedings of European Conference on Computer Vision, 2016, 753-769. [68] A Dubey, S Agarwal. Modeling Image Virality with Pairwise Spatial Transformer Networks. In Proceedings of ACM International Conference on Multimedia, 2017, 663-671. [69] Y Souri, E Noury, E Adeli. Deep Relative Attributes. In Proceedings of Asian Conference on Computer Vision, 2016, 118-133. [70] A Yu, K Grauman. Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images. In Proceedings of IEEE International Conference on Computer Vision, 2017, 5571-5580. [71] J H Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 2001, 29(5):1189-1232. [72] HT Vu, P Gallinari. Using RankBoost to Compare Retrieval Systems. In Proceedings of ACM International Conference on Information and Knowledge Management, 2005:309-310. [73] G. Rätsch, T. Onoda, K.-R. Müller. Soft Margins for AdaBoost. Machine Learning, 2001, 42(2):287-320. [74] A Vedaldi, V Gulshan, M Varma, A Zisserman. Multiple Kernels for Object Detection. In Proceedings of IEEE International Conference on Computer Vision, 2009, 606 - 613. [75] H Xia, SCH Hoi. MKBoost: A Framework of Multiple Kernel Boosting. IEEE Transactions on Knowledge & Data Engineer, 2013, 25 (7):1574-1586. [76] S Bai, S Sun, X Bai, Z Zhang, Q Tian. Smooth Neighborhood Structure Mining on Multiple Affinity Graphs with Applications to Context-sensitive Similarity. In Proceedings of European Conference on Computer Vision, 2016, 592-608. [77] L Xie, Q Tian, W Zhou, B Zhang. Heterogeneous Graph Propagation for Large-Scale Web Image Search. IEEE Transactions on Image Processing, 2015, 24(11):4287-4298. 44

[78] X Xie, W Zhou, H Li, Q Tian. Rank-aware Graph Fusion with Contextual Dissimilarity Measurement for Image Retrieval. In Proceedings of IEEE International Conference on Image Processing, 2015, 4082-4086. [79] Z Liu, S Wang, L Zhang, Q Tian. Robust ImageGraph: Rank-level Feature Fusion for Image Search. IEEE Transactions on Image Processing, 2017, 26(7):3128-3141. [80] SK. Pal, A Skowron. Rough-Fuzzy Hybridization: A New Trend in Decision Making, Springer-Verlag Berlin, Heidelberg, 1999. [81] M Mafarja, I Aljarah, AA Heidari, IH Abdelaziz, H Faris, AZ Ala’M, S Mirjalili. Evolutionary Population Dynamics and Grasshopper Optimization Approaches for Feature Selection Problems. Knowledge-Based Systems, 2018(145):25-45. [82] H Faris, MM Mafarja, AA Heidari, I Aljarah, AZ Ala’M, S Mirjalili, H Fujita. An Efficient Binary Salp

Swarm

Algorithm

with

Crossover

Scheme

for

Feature

Selection

Problems.

Knowledge-Based System, 2018(154):43-67. [83] MM Mafarja, I Aljarah, AA Heidari, H Faris, P Fournier-Viger, X Li, S Mirjalili. Binary Dragonfly Optimization for Feature Selection Using Time-varying Transfer Functions. Knowledge-Based System, 2018(161):185-204. [84] I Aljarah, M Mafarja, AA Heidari, H Faris, Y Zhang, S Mirjalili. Asynchronous Accelerating Multi-leader Salp Chains for Feature Selection. Applied Soft Computing, 2018, 71: 964-979. [85] E Emary, H M Zawbaa, C Grosan, A E Hassenian. Feature Subset Selection Approach by Gray-Wolf Optimization. In Proceedings of Afro-European Conference for Industrial Advancement, 2014, 1-13. [86] MM Mafarja, S Mirjalili. Whale Optimization Approaches for Wrapper Feature Selection. Applied Soft Computing, 2018, 62:441-453. [87] S M A Pahnehkolaei, A Alfi, A Sadollah, J Kim. Gradient-Based Water Cycle Algorithm with Evaporation Rate Applied to Chaos Suppression. Applied Soft Computing, 2017, 53: 420-440. [88] L Bo, C Sminchisescu. Efficient Match Kernels Between Sets of Features for Visual Recognition. In Proceedings of Advances in Neural Information Processing Systems, 2009:135-143. [89] L Bo, X Ren, D Fox. Kernel Descriptors for Visual Recognition. In Proceedings of Advances in Neural Information Processing Systems, 2010:1734-1742. [90] J Yang, K Yu, Y Gong, T Huang. Linear Spatial Pyramid Matching using Sparse Coding for Image Classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009:1794-1801. [91] N Srivastava, R Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research, 2014, 15(8):1967-2006. [92] G Hinton, S Osindero, Y Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006, 18:1527-1554. [93] A Krizhevsky, I Sutskever, G Hinton. ImageNet Classification with Deep Convolutional Neural 45

Networks. In Proceedings of Conference on Neural Information Processing Systems, 2012, 1106-1114. [94] T Joachims. Optimizing Search Engines using Clickthrough Data. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2002: 133-142. [95] C Burges, T Shaked, E Renshaw, A Lazier, M Deeds, N Hamilton, G Hullender. Learning to Rank using Gradient Descent. In Proceedings of the IEEE International Conference on Machine learning, 2005: 89-96. [96] D R Cox. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society. Series B (Methodological). 1958, 20 (2): 215-242. [97] T Ho. Random Decision Forests. In Proceedings of the International Conference on Document Analysis and Recognition, 1995: 278-282. [98] N S Altman. An Introduction to Kernel and Nearest-neighbor Nonparametric Regression. The American Statistician. 1992, 46 (3): 175-185. [99] J R Quinlan. Decision Trees and Multi-valued Attributes. In J.E. Hayes & D. Michie (Eds.). 1985: 305-318. [100] I Kononenko. ID3, Sequential Bayes, Naive Bayes and Bayesian Neural Networks. In Proceedings of European Working Session on Learning, 1989:91-98. [101] R Garreta, G Moncecchi. Learning Scikit-learn: Machine Learning in Python. Packt Publishing, Birmingham: England, 2013. [102] C Szegedy, S Ioffe, V Vanhoucke, A Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of AAAI, 2017, 4278-4284. [103] G Huang, Z Liu, L V D Maaten, K Q Weinberger. Densely Connected Convolutional Networks. In Proceedings of

IEEE

Computer Vision and Pattern Recognition,

2016:4700-4708. [104] Howard, Andrew G, M Zhu, B Chen, D Kalenichenko, W Wang, T Weyand, M Andreetto, H Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv: 1704.04861, 2017. [105] K Chatfield, K Simonyan, A Vedaldi, A Zisserman. Return of The Devil in The Details: Delving Deep Into Convolutional Nets. In Proceedings of British Machine Vision Conference, 2014:1-12. [106] F Perronnin, C Dance. Fisher Kernels on Visual Vocabularies for Image Categorization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007:1-8. [107] A Radford, L Metz, S Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434, 2015. [108] A Odena, C Olah, J Shlens. Conditional Image Synthesis with Auxiliary Classifier GANs. arXiv preprint arXiv:1610.09585, 2016. [109] C JC Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Learning, 2010. 46

[110] J Wang, L Yu, W Zhang, Y Gong, Y Xu, B Wang, P Zhang, D Zhang. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of ACM SIGIR, 2017, 515-524.

47