Clothescounter: A framework for star-oriented clothes mining from videos

ARTICLE IN PRESS JID: NEUCOM [m5G;October 25, 2019;1:14] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing j...

Download PDF

3MB Sizes 0 Downloads 30 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 25, 2019;1:14]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Clothescounter: A framework for star-oriented clothes mining from videos Haijun Zhang a,∗, Han Guo a, Xinghao Wang a, Yuzhu Ji a, Q. M. Jonathan Wu b a b

Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China Department of Electrical and Computer Engineering, University of Windsor, Ontario, Canada

a r t i c l e

i n f o

Article history: Received 29 October 2018 Revised 13 July 2019 Accepted 16 September 2019 Available online xxx Communicated by Dr Shenglan Liu Keywords: Clothes clustering Clothes detection Video advertising Fashion data mining

a b s t r a c t This paper presents a novel framework, ClothesCounter, which aims to automatically identify clothes worn by certain stars in videos. At ﬁrst, several deep convolutional neural networks (CNN) models were utilized to preprocess video data in order to detect clothing images from original video frames, including human body detection, human posture selection, human pose estimation, face veriﬁcation, and clothing detection. We then propose a method for extracting features of clothing images based on triplet loss that can map clothing images into a compact feature space. In the learned feature space, we present a twostage clustering algorithm that does not require the number of clusters. Our framework was examined in a large-scale video dataset. Experimental results demonstrate the feasibility and effectiveness of our proposed method.

1. Introduction With the rapid development of the Internet economy, online video traﬃc has grown dramatically in recent years. The clothing worn by stars in a video always lead fashion trends and attract the attention of a large number of fans. When watching idol dramas or television shows in which protagonists wear fashionable clothes, the viewers, especially females, are easily attracted to these clothes and are stimulated to purchase ones that are identical to those shown in the video. Clothing worn by stars has always been popular with audiences, and has increased purchase needs from audiences who want to keep pace with their idols. If it was possible to detect stars’ clothing automatically, great beneﬁts could be obtained for video advertising and cross-scenario image clothing retrieval. Consequently, determining how to quickly and accurately detect clothing worn by actors in videos has become a common concern in video platforms that aim to combine video websites and e-commerce to achieve conversion from traﬃc to sales. Our objective is not only to detect clothes, but also discover how many sets of clothes that a given star wears in a video. However, such an intelligent system may involve several research ﬁelds in computer vision and machine learning, including object detection, human pose estimation, face veriﬁcation, clothing segmentation, clothing image clustering, etc. The diverse appearance of clothes, cluttered

∗

Corresponding author. E-mail address: [email protected] (H. Zhang).

© 2019 Elsevier B.V. All rights reserved.

backgrounds, distortions, different light conditions, and motion blur in videos make automated clothing identiﬁcation from videos a challenging task. At present, there are several works on clothing parsing, clothing retrieval and recommendations, and video advertising based on clothes recognition. For example, Yamaguchi et al. [1] demonstrated an effective method for parsing clothing in fashion photographs. In addition, Kiapour et al. [2] used a retrieval-based approach to solve the clothing parsing problem. Clothing retrieval has immense applicability for the commercial industry. Extensive efforts have been focused on similar clothing retrieval [3] and exactly the same clothing retrieval [4] Bell et al. [3] proposed a convolutional neural network (CNN) using contrastive loss to learn visual similarity between products. Huang et al. [5] presented a dual attribute-aware ranking network (DARN) based on a Siamese network to retrieve similar clothes. Clothing detection and segmentation techniques [6] have also been utilized to retrieve clothing. Deep learning-based object detection methods, such as a region based convolutional neural network (R-CNN) model [7], fast R-CNN model [8], faster R-CNN model [9], and YOLO [10], can be straightforwardly utilized for clothing detection. Moreover, from an application perspective, video advertising based on clothes recognition offers huge potential for revenue in the online video market. Zhang et al. [11] ﬁrstly introduced an optimization framework towards object-level video advertising. Subsequently, Cheng et al. [12] explored a new cross-domain task of online clothing shopping, targeting matching clothes occurring in videos to identical items in online stores.

https://doi.org/10.1016/j.neucom.2019.09.023 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

Despite recent advances in images-to-stores retrieval, only few studies have focused speciﬁcally on linking star-oriented clothes in videos to online stores. Recently, Zhang et al. [13] proposed a sitcom star-oriented clothing retrieval system based on videos by using several state-of-the-art deep learning methods. However, this system only learned clothing features based on categories and employed the density peaks clustering algorithm (DPCA) [14] for clothing clustering. Clustering performance cannot be guaranteed based on such a rough representation for measuring clothes similarity. In this paper, we presented a framework for automated star-orientated clothes identiﬁcation from videos, called ClothesCounter, which aims to automatically identify clothing worn by certain stars in a given video, while determining the clothing categories and number of pieces of clothing appearing in the video. Due to the continuity of video frames for a certain period of time, a large number of identical clothes would be detected by the aforementioned deep learning-based methods, making it challenging to discern the number of pieces of clothing. In addition, if we regard raw detected clothing images as queries for retrieval, it will produce a large number of similar retrieval results in terms of these similar queries. Therefore, the main task becomes ﬁnding an eﬃcient method for redundant queries removal, i.e., accurately identifying the number of pieces of clothing that a given star wears in a video. In this paper, we proposed a two-stage clustering algorithm that is used to remove redundant continuous data from the original clothing detection results, leaving only the cluster centers as representatives that can be viewed as retrieval queries in various clothing retrieval applications. Moreover, good clustering performance is largely achieved by good representations of extracted clothing features. Desirable features for clothing images are expected to satisfy the criterion that the maximal intra-class distance is smaller than the minimal inter-class distance under a certain metric space. However, learning features with this criterion is generally diﬃcult because of major differences in luminance condition, body poses, clothing distortions of intra-class variation, and high inter-class similarity exhibited by clothing images. Pioneering works based on CNNs learn image features indirectly that are not suﬃciently discriminative using a traditional softmax loss. To overcome these challenges, we have adopted a method for extracting features of clothing images based on triplet loss that that can map clothing images into a compact feature space, enabling us to measure clothes similarity effectively. Then, the proposed two-stage clustering algorithm was performed on the learned feature space. To evaluate our proposed framework, we conducted extensive experiments over a famous American sitcom, The Big Bang Theory. The used dataset contains 103 episodes. After pre-processing, including human body detection, pose selection and face veriﬁcation, a total of 18,274 clothing images were detected. Among them, 6031 clothing images of six protagonists were used for the clustering experiment. Experimental results demonstrate the feasibility and efﬁcacy of our framework. The motivation of this paper is driven by the query-redundancy removal in a video advertising system by automatically counting the clothing items dressed on the leading roles. In practice, a clustering algorithm can be applied for ﬁltering out redundant query clothing images. However, the performance of a clustering algorithm is largely inﬂuenced by the generalization capability of feature representations. Thus, in this paper, to learn more discriminative features, we proposed to train an eﬃcient feature extractor by leveraging triplet loss function under a clothing re-identiﬁcation framework. Moreover, we also developed a two-stage clustering algorithm to eﬃciently count the number of clothing items dressing on sitcom-stars according to the number of clusters. The key contributions of this paper are three-fold: (1) A novel framework, ClothesCounter, is proposed by utilizing state-of-theart deep CNN models and clustering algorithms. The framework is

able to automatically identify the categories of clothes appearing in a video and obtain the number of pieces of clothing worn by a given star; (2) A deep model based on triplet loss is designed for feature extraction of clothing images. The developed triplet lossbased model can capture the intrinsic similarity of clothing images belonging to the sample of clothes; (3) A two-stage clustering algorithm is proposed for star-oriented clothing clustering. Multiple density clustering by adaptively setting different neighborhood radii is performed at the ﬁrst stage. The second stage merges clusters based on similarities between clusters. The remainder of this paper is organized as follows. In Section 2, our ClothesCounter framework is brieﬂy presented. In Section 3, we describe the implementation details of clothing detection from a video, including human body detection, pose selection, protagonist face veriﬁcation, and clothing detection by utilizing different deep CNN models. Sections 4 and 5 illustrate the feature learning of clothing images based on triplet loss and a twostage clustering algorithm, respectively. Experimental results based on real video datasets are presented in Section 6. In Section 7, we conclude the paper and suggest directions for future work. 2. Overview of our framework The whole framework is developed to automatically parse a given video, and obtain images, categories, and pieces of clothing worn by each protagonist. A potential application of our framework allows us to make a clothes album for each star in a video, identify the total number of pieces of clothing, and label their categories. These results can be easily utilized for clothing recommendations. The whole ClothesCounter framework comprises several modules, including human body detection, pose selection, face detection and veriﬁcation, clothing detection, and clothing images clustering. The entire pipeline of our framework is illustrated in Fig. 1. It is ﬁrstly necessary to detect human bodies from video frames. After human body detection, a pose selection module, including a binary classiﬁer and a body key points detection module, was used to determine whether or not a human body pose is good. A good pose is determined by whether or not the form of clothing attached to the human body is good and whether or not it is suitable for image retrieval. After passing through the face veriﬁcation module to identify main characteristics, the clothes detection module determines the bounding box and category of the clothes, and segments clothing image patches from the whole protagonist body image. Finally, clothing features are extracted based on triple loss, and clustering is performed based on the learned feature space. Overall, the entire framework includes three key components: (1) clothing image segmentation from a video for a given star; (2) feature extraction of clothing images; and (3) clothing image clustering. In summary, the whole working ﬂow of the ClothesCounter is presented in Fig. 2. The detailed implementations of these components are described in the following context. 3. Star-oriented clothes detection from videos 3.1. Human body detection Given a video, one frame is extracted in a ﬁxed duration, and human detection is performed over these video frames. Human body detection constitutes a sub-problem of object detection. In this research, we chose faster R-CNN as a human body detection model, as it shows state-of-the-art object detection accuracy [13] The dataset was constructed from public datasets provided by PASCAL VOC2012. The dataset covers 20 categories,including people, animals, vehicles, etc. Categories and bounding box locations are manually labeled. Approximately 8,174 images contain human bodies associated with annotated bounding boxes. We trained

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

3

Fig. 1. Overview of the ClothesCounter framework.

Input a video Pick one frame in a second as input

Is there a human body? Yes Is a human body in a good pose? Yes Is a face detected? Yes

No

No No

Is it the protagonist?

No

Yes Clothing Detection Clothing Clustering

No

No

End of the Video? Yes End Fig. 2. Working ﬂow of ClothesCounter framework.

For the binary classiﬁer for pose selection, a good pose is deﬁned as having good lighting conditions, a good shape of clothing without distortion, a frontal side of the human body, etc. We obtained a total of 23,167 human body images, in which 11,097 images were annotated as positive pose samples and 12,070 images as negative ones. In the dataset, 13,190 images were collected from the videos used in this research, and 9,977 images were cropped from the Street-shop dataset [36] The AlexNet model modiﬁed with a binary classiﬁer was trained on this collected dataset. The implementation details can be found in [13] Although the trained binary classiﬁer can ﬁlter out most human body images in bad poses, some human images that are obscured by other objects or contain half-body regions may remain. In order to further ﬁlter out these images, we developed a key human joints detection module. In this research, we employed the method proposed by Cao et al. [16] This method adopts a bottom-up algorithm and uses a global contextual clue relationship to detect components and their association. The network structure is divided into two branches: (1) key points detection; and (2) the connection prediction between key points. Each branch is an iteratively predicted structure. The detailed implementation of this method can be found in [16] Some key joints detection results for certain samples are shown in Fig. 3. From left to right, the detection results are based on human body images with good poses, proﬁle images, and human upper-body images. Here, if the detected key joints are incomplete, it suggests that the human body image may be inappropriate for clothing detection. Speciﬁcally, a human body image that can be detected as a complete top should be able to detect shoulders, elbows and wrists, and that detected as a complete bottom should be able to detect hips, knees, and ankles. These rules are utilized to choose human body images with good poses. 3.3. Protagonist face veriﬁcation

human body detection networks on this human body dataset. We used the same parameter settings and network structure as [13] The implementation details can also be found in [13] By using the human body detection model, we cropped out human body regions from raw video frames. 3.2. Pose selection After human body detection, the pose selection module was utilized to determine whether or not the human body is in a good pose. Given a video, human body poses usually change frequently. Selecting a good pose of stars constitutes a crucial step, because it directly affects the performance of clothing detection [13] The pose selection module includes a binary classiﬁer and key body joints detection.

The key idea of the proposed ClothesCounter framework is mainly centered on the clothing of famous stars because stars are usually at the forefront of fashion. Thus, it is ﬁrstly necessary to verify whether or not a detected person plays one of the leading roles from a given video. In general, face recognition can be categorized as face identiﬁcation and face veriﬁcation [17] The former classiﬁes a face to a speciﬁc identity, while the latter determines whether a pair of faces belongs to the same identity. In our protagonist face veriﬁcation module, we used an open source face recognition method with deep representation, named VIPLFaceNet [18], that achieves 98.60% mean accuracy on the LFW (Labelled Faces in the Wild) face dataset. We manually selected seven face images as a standard face for each protagonist. Each detected face calculates the cosine similarity with each standard face, and the average of the similarities being the greatest and being

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 3. Results of body key joints detection based on human images with: (a) good poses; (b) sideways images; and (c) upper body images. Table 1 Constructed clothing detection dataset (w denotes woman, and m denotes man in the ﬁrst column). Categories

Amazon

Taobao

Train/val

Test

Total

Dress-w Gallus-w Shirt-m Shirt-w Suit-m Sweater-w Tshirt-m Tshirt-w Skirts-w Hoodie-m Jeans-m Jeans-w Leggings-w Pants-m Pants-w Shorts-m Shorts-w

1002 1001 499 905 1005 532 843 600 1002 1100 568 601 502 621 599 600 587

888 393 432 210 901 1270 500 921 532 276 432 399 498 379 401 400 413

945 697 466 558 953 901 672 761 767 688 800 800 800 800 800 800 800

945 697 465 557 953 901 671 760 767 688 200 200 200 200 200 200 200

1890 1394 931 1115 1906 1802 1343 1521 1534 1376 1000 1000 1000 1000 1000 1000 1000

greater than a predetermined veriﬁcation threshold (we set this to 0.66 according to [13]) is identiﬁed as the speciﬁc protagonist. The detailed veriﬁcation scheme can be found in [13] 3.4. Clothing detection After protagonist body images with good poses are obtained, the clothing detection module is performed to locate clothing regions. Clothing detection is similar to object detection, which predicts the location information and the category of an object simultaneously. In this research, we used a state-of-the-art object detection method, YOLO v2 [15], as a model for clothing detection and utilized the same clothing dataset introduced in [13] The constructed clothing dataset contains a total of 21,812 images, categorized into 17 classes (see Table 1). These data were collected from two popular online shopping websites, i.e., Amazon.com and Taobao.com. Each image in this dataset was labeled with category and bounding box manually. For dataset partition, we randomly selected 80% of each category of images as a training/validation set, and the remaining 20% of images were used for testing (see Table 1). The results of mAP for each category are shown in Fig. 4. YOLO v2 achieved 98.16% mAP over all categories. Such a high mAP would largely satisfy the basic requirement of our applications. Therefore, we selected this well-trained YOLO v2 model to detect clothes for our video dataset. 4. Feature learning based on triplet loss Our ultimate goal lies in automatically identifying clothes worn by certain stars in a given video, while recognizing the clothing categories and number of pieces of clothing that appear in the

video. After clothing detection, a large number of similar clothes are detected and cut out from the video due to the continuity of video frames in a certain period of time. These images, even belonging to the same piece of clothing, may exhibit major differences in angle, size, shape, etc. It is necessary to determine whether a set of clothing images belongs to the same piece of clothing from the clothing detection results, and then remove redundant queries for potential clothing retrieval applications. Since a clothing set compiled from a video usually has a large number of clothing categories and a few samples in each category, it is diﬃcult for traditional category-based loss functions under the deep learning framework to learn discriminative features for clothing image representations. In this research, we propose to utilize a triplet loss function for extracting features of clothing images. A deep CNN was trained by a triplet loss function that serves to pull the instances of the same clothes closer, and at the same time push the instances belonging to different categories farther from each other in the learned feature space. Our method is different from previous works [13] regarding the outputs of the last two fully connected layers as feature representations of clothing images. A metric learning method supervised by a triplet loss is used to project the original features into a low dimension space, while preserving their discriminative information. Metric learning with a triplet loss aims to separate the positive pairs from the negative pairs by a distance margin. This was initially used in face veriﬁcation [19] In our application, it can minimize the squared L2 distance between the same clothes and enforces a margin between the distances of different clothes. In this paper, we explored an Inception-ResNet v2 [20] network architecture by combining Inception architectures with residual connections, which achieved state-of-the-art performance in the ILSVRC2015 challenge. Based on the Inception-ResNet v2 network structure, we added the triplet loss function into the design, removed the softmax classiﬁcation layer of the convolutional network, and directly took the feature map from the ﬁnal convolutional layer. After an L2 normalization, an embedding space was established for the representations of dimensionally reduced image features. The triplets were selected from our constructed clothing dataset (see Section 6.2), and the triplet losses were calculated based on feature representations in the embedding space. Given the model structure, as shown in Fig. 5, the key to our approach is end-to-end learning of the embedding space. Speciﬁcally, we learn an embedding relationship from clothing images to a feature space, making the distances of the same clothes be small, whereas the distances of different clothes are large. For an image xa (anchor), we deﬁne a triplet set of xa as (xa , xp , xn ), where xp (positive) is the sample belonging to the same clothes of xa ; and xn (negative) is the sample belonging to different clothes of xa . It is necessary to ensure that an image xa is closer to all other images xp of the same clothes than to any image xn of any different clothes. The triplet loss function requires the distance between xa and xp to be less than not only the distance between xa

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

5

Fig. 4. Detection results on our established clothing dataset.

Fig. 5. The model structure.

and xn , but also a predeﬁned threshold. Thus, we need:

||xa − x p || + α < ||xa − xn ||22 , ∀(xa , x p , xn ) ∈ T, 2 2

(1)

where α is the parameter of the margin, by empirically setting the margin to 0.2; and T is the set of triplets in the training set that has cardinality N. All selected triplets need to satisfy the constraint in Eq. (1). We denote f(xa ), f(xP ), and f(xn ) as the embedding from shared parameters of the CNN model on xa , xp , and xn , respectively. The loss function can be given by:

L=

N

[||f(xa ) − f (x p )||22 − ||f(xa ) − f(xn )||22 + α ]+ .

(2)

i

The whole training process and the veriﬁcation of the representational capability of learned features based on triplet loss can be found in Section 6.2. To verify the representational capability of learned features based on triplet loss, we performed clothing veriﬁcation (see Section 6.2). Given a pair of clothing images, clothing veriﬁcation aims at determining if they belong to the same piece of clothing. The similarity between a pair of clothing images is measured by the squared L2 distance in terms of their learned features. A distance threshold is obtained by cross-validation, i.e., the distance is greater than the threshold, indicating that the clothing pair is not the same piece, and vice versa. 5. A two-stage method for clothes clustering Clustering is an important technology in data mining. It can divide a large number of data into several clusters according to the similarity between them. If the clothes identiﬁed by the clothing detection module for a given star can be clustered, the number of clothes worn by the star will be the number of clusters. We can use the cluster centers as representatives of clothes for retrieval applications. At the same time, a larger size of cluster suggests that the clothes in this cluster occur more frequently in the video. At present, extant clustering analysis methods are mainly divided into ﬁve categories: (1) partitioning methods, such as the K-MEANS [21] algorithm, the K-MEDOIDS algorithm, etc., which

need to predeﬁne the number of clusters; (2) hierarchical methods, which hierarchically decompose a given dataset until a certain condition is satisﬁed. Representative algorithms are the CURE algorithm [22], the BIRCH algorithm [23], etc.; (3) density-based methods, in which a point will be added to a cluster as long as the density of the point in a region is greater than a certain threshold. Representative algorithms are the DBSCAN algorithm [24,25], the OPTICS algorithm [26], etc.; (4) grid-based methods which ﬁrstly divide the data space into a grid structure with a ﬁnite number of cells. All processing is based on a single unit. Representative algorithms include the STING algorithm [27] and the CLIQUE algorithm [28]; and (5) model-based methods, which are mainly based on statistical models and neural network models [29,30]. The ﬁrst stage of our work mainly followed the scheme of density-based clustering methods, because they do not need to predeﬁne the number of clusters. The main idea of a densitybased clustering method lies in ﬁnding high-density regions separated by low-density regions. A cluster is the largest set of densityconnected points, and can be uniquely determined by any one of the core points. It does not need to determine the number of clusters to be divided in advance, and can identify clusters of arbitrary shapes in the dataset containing noise. Therefore, density-based clustering is suitable for clustering unknown data. The DBSCAN algorithm [24,25] is a representative density-based clustering algorithm. However, since the parameters, Eps (neighborhood radius) and MinPts (the minimum number of points in the neighborhood radius), involved in the classical DBSCAN algorithm are invariant during the clustering process, it is diﬃcult for the algorithm to adapt to data sets with uneven densities. Our task is to cluster clothing images detected from video frames. Considering uneven density distribution of data in the clothing data set, we propose a clustering method by adaptively selecting multiple Eps parameter values. In the image feature extraction module, we evaluated our method on the clothing veriﬁcation task (see Section 6.2 for detailed experimental results). Given a pair of two clothing images, a threshold, γ , in terms of squared L2 distance between these two images, was utilized to determine

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

if this pair of images belongs to the same piece of clothing. We used a cross-validation method to determine this threshold. It was then treated as the upper limit of the clustering parameter, Eps. Below this upper limit, we generated multiple Eps values with a ﬁxed step size, such as 0.1 in the range of (0, γ ). Initially, at the ﬁrst stage, the neighborhood radii with respect to Eps were arranged in ascending order, and MinPts were set to 2 to prepare for clustering. We then selected the minimum Eps and MinPts to perform density clustering on the data. Here, a 128dimensional data point pi used to represent an image. We examined each point sequentially over the dataset. If point pi is not processed (classiﬁed as a cluster or marked as noise), we mark pi as ‘processed’ and ﬁnd its neighborhood NEps (pi ), which is deﬁned by NEps (pi ) = {pi ∈ D | dist(pi ,pj ) ≤ Eps}, where D denotes the whole dataset, and the distance between points pi and pj , dist(pi ,pj ), is expressed as the square of the Euclidean distance. If NEps (pi ) contains fewer data points than MinPts, i.e., |NEps (pi )| < MinPts, we mark pj as noise; if |NEps (pi )| ≥ MinPts, we mark pi as ‘processed’ and group all points in the neighborhood of pi , NEps (pi ), into a new cluster c. For each point pk in NEps (pi ) that has not yet been processed, we then check its neighborhood NEps (pk ). If NEps (pk ) contains at least MinPts points, i.e., |NEps (pk )| ≥ MinPts, the points in NEps (pk ) that are not grouped into any other clusters are then added to c. The above steps are repeated until all data points are classiﬁed into a cluster or marked as noise. Subsequently, we use the next Eps in turn, and perform density clustering again on the data labeled as noise based on the clustering results in the last step. This process continues until all of the Eps values are used up. It is worth noting that the neighborhood radii Eps should be used in an ascending order in the multiple clustering processes at the ﬁrst stage. When using smaller Eps for clustering, the data points in the sparse clusters would not be processed, because smaller Eps values can handle only denser points without affecting smalldensity data. At the second stage, we used the squared L2 distance between the centers of clusters formulated at the ﬁrst stage to measure their similarities. If the distance between ci and cj is less than the threshold γ , i.e., dist(ci ,ci ) < γ , we merge the two clusters. The ﬁnal clustering results are accomplished by continuously merging the most similar clusters. Algorithm 1 describes our proposed two-stage clustering algorithm, where D is the input data set; NEps (p, D) is a sub-set in D that is present in a hyper-sphere of radius Eps at p (p ∈ D); and card(NEps (p, D)) is the cardinality of the set. The ﬁrst stage marks each point of D with a cluster identiﬁer (c_id) that gives the cluster to which the point belongs or marks the point as noise. In the second phase, the clusters obtained in the previous phase are merged with clusters whose distances are less than the merging threshold. It can be seen that the time-consuming step at the ﬁrst stage lies in ﬁnding NEps (p, D), which takes O(n) time, where p ∈ D and |D| = n. Hence, the time complexity of the ﬁrst stage is O(n2 ). At the second stage, assuming that the number of clusters obtained in the previous stage is nc , the time complexity of the merging phase is O(nc 2 ). The total time complexity of our method is O(n2 + nc 2 ). 6. Experiment 6.1. Preprocessing in video In order to establish our video dataset, we downloaded 103 episodes of The Big Bang Theory from the Internet (they were only used for research purpose). For each video, we removed the header part with a ﬁxed length of 1000 frames from the beginning. One frame per second was extracted to form the dataset with raw

Algorithm 1 Two-Stage clustering. 1 Input: D (the data set), E ps_list (candidates of E ps values), MinPts, γ (the threshold for merging clusters) 2 Output: Labels(the cluster label of each points) 3 Initialize: Cur_D = D, c_id = 0 4 for E ps in E ps_list do 5 for each point p in Cur_D do 6 if p is visited then 7 Continue 8 end if 9 Mark p as ‘visited’ 10 Find NE ps (p,Cur_D) 11 if card(NE ps (p,Cur_D)) < MinPts then 12 Mark p as noise 13 else 14 c_id = c_id+1 15 Mark each point of NE ps (p,Cur_D) with c_id 16 for each point y ∈ NE ps (p,Cur_D) and y is not marked as “visited” do 17 Make y as “visited” 18 Find NE ps (y,Cur_D) 19 if card(NE ps (p,Cur_D)) ≥ MinPts then 20 Mark each point of NE ps (p,Cur_D) with c_id 21 if any point of NE ps (p,Cur_D) is marked as noise then 22 Remove that mark 23 endif 24 endif 25 endfor 26 endif 27 endfor 28 Cur_D = noise 29 endfor 30 for c1 in clusters do 31 for c2 in clusters do 32 if ||c1 − c2||22 > γ and c1 = c2 then 33 Merge(c1,c2) 34 endif 35 endfor 36 endfor

Table 2 Dataset processing in each module. Stage

Before

After

Video segmentation Human body detection Pose selection Face detection Face veriﬁcation Clothes detection

103 videos 115,905 (raw frames) 208,519 (detected bodies) 84,415 (positive poses) 60,305 (detected faces) 22,779 (veriﬁed faces)

115,905 (raw frames) 208,519 (detected bodies) 84,415 (positive poses) 60,305 (detected faces) 22,779 (veriﬁed faces) 18,274 (detected clothes)

frames. Finally, we compiled a video dataset with a total of 115,905 raw frames. After human detection, we detected and cropped out 208,519 human body images from the original frames. In these human images, 84,415 human images were determined to have good human poses by using our trained human pose selection model and key joints detection model. Face recognition was then performed on these positive human body images with good poses, in which 60,305 faces were detected and aligned. 22,779 faces were veriﬁed as protagonists” faces. Clothing detection was then performed on the human images associated with the 22,779 protagonists’ faces. Finally, 18,274 well-formed and complete clothing images were detected and cropped. For clarity, Table 2 presents the statistics over the dataset at each processing procedure.

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx Table 3 Clothing veriﬁcation model dataset. Total

LookBook Video Frames Total

7

Table 4 Performance with different embedding dimensionality. Train/Val

Test

Embedding

Pieces

Instances

Pieces

Instances

Pieces

Instances

5155 941 6902

35,021 8852 43,873

3043 403 3446

25,585 5175 30,760

2112 1344 3456

9436 3677 13,113

Accuracy AUC

Dimensionality

128

256

512

1024

0.98 0.998

0.979 0.998

0.983 0.999

0.981 0.999

Table 5 The main characters in The Big Bang Theory.

6.2. Clothing veriﬁcation based on triplet loss In order to train a deep model based on triplet loss for extracting image features, we constructed a clothing dataset containing a total of 43,873 clothing images collected from videos and LookBook datasets. These images belong to 6,902 pieces of different clothing. We manually selected clothing images detected in various genres of movies, including comedy, romance and action, and chose 941 pieces of different clothing with a total of 8852 images. The remaining 35,021 clothing images were collected from LookBook, which is an overseas street shooter [31] Some fashion lovers upload their own street photos on LookBook and share wardrobes with other users. From the LookBook street dataset, we achieved 5155 pieces of different clothing, including a total of 35,021 clothes images. Images from videos were also collected in order to make the model adapt to the application environment of our program for video data. The remaining clothing data were mainly crawled from online websites LookBook.nu, which is a famous website for sharing users’ personal fashion experience by uploading street photos. In practice, clothing items can be automatically collected by a crawler from these online fashion websites or stores, which usually contain multi-viewed clothing images belonging to the same items. Indeed, it is necessary to establish such a clothing dataset in order to improve the generalization capability of a feature extractor for distinguishing whether or not two clothing images contain the same clothes item. Moreover, the generalization performance of our triplet-loss-based feature extraction model should be improved along with the increase of dataset scale. In particular, if a dataset contains more similar clothing items, the generalization capability of a feature extractor can be further improved, because similar items can be utilized as hard negative samples when constructing triplet samples for producing more robust and discriminative feature representations. It is important to collect similar clothing images to improve the generalization capability of the proposed feature extractor by avoiding more easy negative samples during model training process. Table 3 shows the statistics of our constructed dataset. We trained the CNN using RMSprop with standard backward propagation and started with a learning rate of 0.01 in the experiment. The model parameters were randomly initialized and trained on GTX1080. The margin α in the triplet loss was set to 0.2. The network structure is based on Inception-ResNet v2, the ﬁnal softmax classiﬁcation layer was removed and the feature map of the ﬁnal convolutional layer was directly obtained, followed by L2 normalization, which results in the clothing features. In order to evaluate the representational capability of our learned features, we examined our model on clothing veriﬁcation. Given a pair of clothing images, clothing veriﬁcation determines whether or not they belong to the same clothing by calculating a squared L2 distance between their learned embedding feature vectors. For dataset partition, less than six instances of a piece of clothing were used as a test set, and the remaining images were used for training/validation (see Table 3). We used three widely utilized measures [32] to evaluate performance: accuracy and area under curve(AUC). For each setting of embedding dimension, we used 10-fold cross validation. Table 4 lists the results with respect

Dataset related to a star

Instances

Clusters

Jim Parsons Kaley Cuoco Kunal Nayyar Johnny Galeck Melissa Rauch Simon Helberg

1671 620 953 2515 60 212

159 97 123 296 20 45

to different embedding dimensions. It can be seen that embedding dimensions with 128, 256, 512, and 1024 deliver quite similar performance with good results. In order to conserve computational burden, we used the embedding dimension with 128. For the 128dimensional embedding with 10-fold cross validation, we obtained the average similarity threshold γ with 0.81, which was used for clothing clustering in the following section. 6.3. Clustering on real video datasets After processing the videos, we obtained the clothing image datasets with respect to the six main characters: Jim Parsons, Kaley Cuoco, Kunal Nayyar, Johnny Galeck, Melissa Rauch, and Simon Helberg, in an American sitcom, The Big Bang Theory. We organized and manually classiﬁed six clothing image datasets with respect to the six stars. The statistics of these six datasets are shown in Table 5. Fig. 6 presents a few examples of the experimental results of our two-step clustering algorithm, in which images with red boxes represent an example of clustering error and images with green boxes are the center samples of each cluster. In order to demonstrate the effectiveness of our two-stage clustering algorithm, we compared the proposed method with DPCA [14], DBSCAN [24,25], AP [33], Mean Shift [34], and Birch algorithms [23] on our constructed datasets. These compared algorithms are widely used for various applications in the clustering ﬁeld, and do not necessitate a predeﬁned number of clusters. We used six commonly used measures for evaluation, including normalized mutual information (NMI), rand index (RI), fowlkesmallows scores (FMI), homogeneity, completeness, and v-measure [35]. We also listed the number of clusters estimated by different clustering methods. Quantitative results are summarized in Table 6. It can be seen that our method delivers superior performance in terms of most measures in comparison to other methods. In particular, the number of clusters achieved by our method is usually near the correct number of clusters. Moreover, our method always outperforms other methods in terms of NMI and V-measure. The method, DBSCAN, produces superior results in terms of completeness over other methods on the datasets related to Jim Parsons, Kaley Cuoco, Kunal Nayyar, Johnny Galeck. AP achieves the best performance with respect to the measures of RI and FMI over the Johnny Galeck dataset, and outperforms other methods in terms of completeness over the Melissa Rauch dataset. Finally, we produced a star-oriented photo gallery of the clothing of every protagonist in the American sitcom, The Big Bang Theory, as shown in Fig. 7. These results could be easily utilized for

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 6. Examples of clothes clustering.

Table 6 Performance comparison of the algorithms (TNC represents the true number of clusters; the values in bold represent the best results). Datasets

Measure

DPCA

DBSCAN

AP

Mean Shift

Birch

Ours

Jim Parsons

Clusters(TNC=159) NMI RI FMI Homogeneity completeness V-measure

138 0.882 0.452 0.505 0.824 0.944 0.88

81 0.775 0.144 0.293 0.611 0.984 0.754

64 0.89 0.596 0.631 0.817 0.97 0.887

86 0.843 0.3 0.421 0.73 0.974 0.834

61 0.835 0.328 0.435 0.717 0.972 0.825

155 0.913 0.642 0.65 0.895 0.931 0.913

Kaley Cuoco

Clusters(TNC=97) NMI RI FMI Homogeneity completeness V-measure

46 0.835 0.414 0.511 0.721 0.967 0.826

61 0.863 0.351 0.472 0.754 0.987 0.855

27 0.83 0.478 0.564 0.71 0.971 0.82

63 0.888 0.475 0.562 0.805 0.979 0.884

36 0.835 0.405 0.509 0.712 0.979 0.825

96 0.909 0.614 0.636 0.887 0.932 0.909

Kunal Nayyar

Clusters(TNC=123) NMI RI FMI Homogeneity completeness V-measure

206 0.916 0.684 0.689 0.944 0.89 0.916

110 0.927 0.558 0.625 0.872 0.985 0.925

48 0.893 0.658 0.701 0.815 0.978 0.889

111 0.931 0.652 0.688 0.892 0.972 0.93

60 0.895 0.572 0.631 0.815 0.982 0.891

145 0.944 0.783 0.79 0.964 0.926 0.944

Johnny Galeck

Clusters(TNC=296) NMI RI FMI Homogeneity completeness V-measure

1105 0.88 0.349 0.41 0.977 0.793 0.875

241 0.901 0.474 0.56 0.825 0.985 0.898

97 0.887 0.682 0.719 0.809 0.973 0.884

246 0.925 0.671 0.71 0.874 0.98 0.924

177 0.883 0.599 0.638 0.822 0.949 0.881

324 0.925 0.673 0.677 0.926 0.923 0.925

Melissa Rauch

Clusters(TNC=20) NMI RI FMI Homogeneity completeness V-measure

10 0.822 0.432 0.54 0.699 0.967 0.812

25 0.937 0.738 0.752 0.946 0.929 0.937

7 0.797 0.411 0.536 0.642 0.988 0.778

25 0.937 0.738 0.752 0.946 0.929 0.937

13 0.895 0.601 0.668 0.811 0.987 0.89

20 0.964 0.833 0.842 0.959 0.97 0.964

Simon Helberg

Clusters(TNC=45) NMI RI FMI Homogeneity completeness V-measure

6 0.652 0.163 0.341 0.426 1 0.597

35 0.912 0.595 0.662 0.839 0.991 0.909

15 0.819 0.462 0.564 0.683 0.983 0.806

36 0.915 0.608 0.671 0.847 0.987 0.912

23 0.865 0.523 0.607 0.76 0.984 0.858

42 0.925 0.647 0.664 0.904 0.946 0.925

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

JID: NEUCOM

ARTICLE IN PRESS

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx

9

Fig. 7. An example of the star-oriented photo gallery.

various real-world applications, such as cross-scenario clothing retrieval and video advertising.

6.4. Discussion and extension (1) Sensitivity analysis on different videos: To further evaluate the sensitivity of our method with respect to different videos, we have performed experiments on another show, How I Met-Your Mother. The constructed testing dataset contains 112 episodes of this show from season I to season V. For each video, we selected one frame per second. As a result, we obtained a video dataset with a total of 118,328 raw frames. After human body detection, a total of 140,582 human bodies were detected and cropped. Among these human body images, 39,554 image samples were selected as positive poses by leveraging our trained human pose selection model and key joints detection model. After face detection and veriﬁcation, 13,216 images with protagonists’ positive body poses were obtained. By performing clothing detection, 3534 wellformed clothing images were detected and cropped. To evaluate the performance of our proposed clustering algorithm, samples in each cluster are manually annotated as groundtruth. We further manually cleaned the data by removing irrelevant images. As a result, a total of 2792 images were obtained and classiﬁed into ﬁve datasets with respect to the ﬁve leading characters: Alyson Hannigan, Cobie Smulders, Jason Segel, Josh Radnor, and Neil Patrick Harris. The statistics of these ﬁve datasets are listed in Table 7.

Table 7 The main characters in How I Met-Your Mother. Dataset related to a star

Instances

Clusters

Alyson Hannigan Cobie Smulders Jason Segel Josh Radnor Neil Patrick Harris

117 307 711 965 692

32 76 146 225 153

Similarly, we compared our proposed two-stage algorithm with ﬁve widely used clustering algorithms, including DPCA, DBSCAN, AP, Mean shift and Birch. Quantitative results with respect to six evaluation metrics, i.e., NMI, RI, FMI, homogeneity, completeness, and v-measure, are summarized in Table 8. For clarity, we also listed the number of clusters estimated by different clustering algorithms. It shows that our method can deliver competitive performance in terms of most measures in comparison to other methods on Cobie Smulders and Jason Segel datasets, but result in slightly lower performance on other datasets related to leading roles Alyson Hannigan, Josh Radnor and Neil Patrick Harris. Speciﬁcally, Mean Shift algorithm achieves the best performance on the Alyson Hannigan dataset in terms of most metrics in comparison with other methods. Moreover, our method delivers better performance in terms of most measures in comparison to other methods on Cobie Smulders and Jason Segel datasets. For Josh Radnor and Neil Patrick Harris datasets, it suggests that DPCA algorithm can achieve the best per-

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx Table 8 Clustering performance in comparison of the algorithms on How I Met-Your Mother dataset (TNC represents the true number of clusters; the values in bold represent the best results). Datasets

Measure

DPCA

DBSCAN

AP

Mean Shift

Birch

Ours

Alyson Hannigan

Clusters(TNC=32) NMI RI FMI Homogeneity completeness V-measure

39 0.883 0.565 0.582 0.893 0.874 0.883

37 0.932 0.715 0.726 0.943 0.922 0.932

8 0.777 0.431 0.546 0.615 0.983 0.756

34 0.936 0.771 0.785 0.924 0.947 0.935

21 0.896 0.679 0.717 0.827 0.971 0.893

36 0.93 0.664 0.688 0.958 0.903 0.93

Cobie Smulders

Clusters(TNC=76) NMI RI FMI Homogeneity completeness V-measure

87 0.883 0.558 0.568 0.875 0.891 0.883

75 0.805 0.194 0.331 0.706 0.918 0.798

12 0.687 0.262 0.399 0.98 0.915 0.659

71 0.834 0.345 0.439 0.77 0.904 0.831

31 0.809 0.444 0.532 0.697 0.94 0.8

80 0.915 0.633 0.641 0.923 0.907 0.915

Jason Segel

Clusters(TNC=146) NMI RI FMI Homogeneity completeness V-measure

302 0.9 0.528 0.548 0.954 0.849 0.899

185 0.901 0.441 0.505 0.871 0.932 0.901

34 0.786 0.397 0.479 0.669 0.924 0.776

174 0.911 0.622 0.635 0.9 0.923 0.911

82 0.866 0.553 0.593 0.801 0.937 0.864

175 0.91 0.627 0.638 0.933 0.888 0.91

Josh Radnor

Clusters(TNC=225) NMI RI FMI Homogeneity completeness V-measure

552 0.887 0.526 0.588 0.988 0.797 0.882

354 0.878 0.503 0.512 0.909 0.848 0.878

34 0.701 0.286 0.362 0.59 0.834 0.691

326 0.878 0.551 0.557 0.907 0.85 0.878

138 0.828 0.492 0.508 0.799 0.858 0.827

226 0.841 0.393 0.399 0.864 0.819 0.841

Neil Patrick Harris

Clusters(TNC=153) NMI RI FMI Homogeneity completeness V-measure

333 0.848 0.462 0.481 0.92 0.782 0.846

199 0.631 0.026 0.129 0.507 0.785 0.616

19 0.568 0.149 0.234 0.442 0.729 0.551

197 0.731 0.129 0.191 0.706 0.757 0.731

69 0.673 0.196 0.259 0.591 0.767 0.667

136 0.736 0.156 0.197 0.713 0.759 0.736

formance in terms of the most evaluated metrics, but our method can estimate the number of clusters closer to the correct number of clusters. Although the DPCA algorithm can achieve the best performance, it produces a relatively large number of clusters. This demonstrates that the DPCA algorithm may not perform well in counting the number of clothing items dressed on the leading roles. Large number of outliers or redundant clusters may be obtained by using the DPCA algorithm. (2) Application scenarios of our proposed framework: In practice, our proposed ClothesCounter aims at automatically counting clothing items dressed on the leading roles by leveraging multiple pre-processing steps, including human body detection, human pose selection, key joints detection, face detection, leading role veriﬁcation and clothing detection, etc. The implemented program is more suitable to be executed oﬄine for processing lengthy videos with movie-stars. In particular, the model trained by triplet loss can be directly applied to clothing feature extraction for very short videos such as ads. 7. Conclusion This paper presented a learning-based framework for automated star-orientated clothing identiﬁcation from videos, ClothesCounter, which aims at recognizing clothes worn by certain stars in videos while determining their categories and numbers of pieces. For clothing clustering, we trained a clothing image veriﬁcation network based on triplet loss to extract the features of clothing images, and proposed a two-stage clustering algorithm on our constructed clothing dataset from videos. Extensive exper-

imental results demonstrated the effectiveness of our proposed method. In clothing detection tasks, the identiﬁcation of clothing details may further augment detection accuracy. In follow-up studies, we plan to utilize speciﬁc attributes of clothes, such as collars, cuffs, and pockets, in our future work. Declaration of Competing Interest No conﬂict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed. Acknowledgment This work was supported in part by the National Key R&D Program of China under Grant no. 2018YFB1003800, no. 2018YFB1003805, the Natural Science Foundation of China under Grant no. 61972112 and no. 61832004, and the Shenzhen Science and Technology Program under Grant no. JCYJ20170413105929681 and no. JCYJ20170811161545863. References [1] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, T.L. Berg, Parsing clothing in fashion photographs, in: Proceedings of the CVPR, 2012.

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

JID: NEUCOM

ARTICLE IN PRESS

[m5G;October 25, 2019;1:14]

H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx [2] K. Yamaguchi, M.H. Kiapour, T.L. Berg, Paper doll parsing: retrieving similar styles to parse clothing items, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3519–3526. [3] S. Bell, K. Bala, Learning visual similarity for product design with convolutional neural networks, ACM TOG 34 (4) (2015) 98:1-98:10. [4] X. Wang, Z. Sun, W. Zhang, Y. Zhou, Y. Jiang, Matching user photos to online products with robust deep features, in: Proceedings of the ICMR, 2016. [5] J. Huang, R.S. Feris, Q. Chen, S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, in: Proceedings of the ICCV, 2015, pp. 1062–1070. [6] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, S. Yan, Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval, IEEE TMM 18 (6) (2016) 1175–1186. [7] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the CVPR, 2014, pp. 580–587. [8] R. Girshick, Fast R-cnn, in: Proceedings of the ICCV, 2015, pp. 1440–1448. [9] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in: Proceedings of the NIPS, 2015, pp. 91–99. [10] J. Redmon, S. Divvala, R. Girshick, et al., You only look once: uniﬁed, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. [11] H. Zhang, X. Cao, J. Ho, T. Chow, Object-level video advertising: an optimization framework, IEEE Trans. Ind. Inf. 13 (2) (2017) 520–531. [12] Z.Q. Cheng, X. Wu, Y. Liu, X.S. Hua, Video2Shop: exact matching clothes in videos to online shopping images, in: Proceedings of the CVPR, 2017, pp. 4169–4177. [13] H. Zhang, Y. Ji, W. Huang, L. Liu, Sitcom-star-based clothing retrieval for video advertising: a deep learning framework, Neural Comput. Appl. (2018), doi:10. 1007/s00521-018-3579-x. [14] A. Rodriguez, A. Laio, Clustering by fast search and ﬁnd of density peaks, Science 344 (6191) (2014) 1492–1496. [15] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the CVPR, 2017, pp. 1–9. [16] Z. Cao, T. Simon, S.E. Wei, et al., Realtime multi-person 2D pose estimation using part aﬃnity ﬁelds, in: Proceedings of the CVPR, 2017. [17] I. Kemelmacher-Shlizerman, S.M. Seitz, D. Miller, E. Brossard, The megaface benchmark: 1 million faces for recognition at scale, in: Proceedings of the CVPR, 2016. [18] X. Liu, M. Kan, W. Wu, et al., VIPLFAcenet: an open source deep face recognition SDK, Front. Comput. Sci. 11 (2) (2007) 208–218. [19] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: a uniﬁed embedding for face recognition and clustering, in: Proceedings of the CVPR, 2015, pp. 815–823. [20] C. Szegedy, S. Ioffe, V. Vanhoucke, et al., Inception-v4, inception-resnet and the impact of residual connections on learning, Proceedings of the AAAI (2017). [21] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 2009. [22] S. Guha, R. Rastogi, K. Shim, CURE: an eﬃcient clustering algorithm for large databases, ACM Sigmod Record 27 (2) (1998) 73–84. [23] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: An eﬃcient data clustering method for very large databases, ACM Sigmod Rec. ACM 25 (2) (1996) 103–114. [24] A. Zhou, S. Zhou, J. Cao, et al., Approaches for scaling DBSCAN algorithm to large spatial databases, J. Comput. Sci. Techno. 15 (6) (2000) 509–526. [25] D. Birant, A. Kut, ST-DBSCAN: an algorithm for clustering spatial-temporal data, Data Knowl. Eng. 60 (1) (2007) 208–221. [26] M. Ankerst, M.M. Breunig, H.P. Kriegel, et al., OPTICS: ordering points to identify the clustering structure, ACM Sigmod Rec. 28 (2) (1999) 49–60. [27] W. Wang, J. Yang, R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, in: Proceedings of the VLDB, volume 97, 1997, pp. 186–195. [28] D. Duan, Y. Li, R. Li, et al., Incremental k-clique clustering in dynamic social networks, Artif. Intell. Rev. 38 (2) (2012) 129–147. [29] S.A. Mulder, Million city traveling salesman problem solution by divide and conquer clustering with adaptive resonance neural networks, Neural Netw. 16 (5–6) (2003) 827–832. [30] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans. Neural Netw. 11 (3) (2000) 586–600. [31] Y. Lin, H. Xu, Y. Zhou, et al., Styles in the fashion social network: an analysis on lookbook, in: Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, Springer, Cham, 2015, pp. 356–361. [32] A.D. Sokolova, A.S. Kharchevnikova, A.V. Savchenko, Organizing multimedia data in video surveillance systems based on face veriﬁcation with convolutional neural Networks, in: Proceedings of the International Conference on Analysis of Images, Social Networks and Texts, Springer, Cham, 2017, pp. 223–230. [33] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (5814) (2007) 972–976. [34] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619.

11

[35] X. Huang, Y. Ye, H. Zhang, Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation, IEEE Trans. Neural Netw. Learn. Syst. 25 (8) (2014) 1433–1446. [36] M. Hadi Kiapour, X. Han, S. Lazebnik, A.C. Berg, T.L. Berg, Where to buy it: matching street clothing photos in online shops, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) 3343–3351. Haijun Zhang (M’13) received the B.Eng. and Master’s degrees from Northeastern University, Shenyang, China, and the Ph.D. degree from the Department of electronic Engineering, City University of Hong Kong, Hong Kong, in 2004, 2007, and 2010, respectively. He was a PostDoctoral Research Fellow with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada, from 2010 to 2011. Since 2012, he has been with the Shenzhen Graduate School, Harbin Institute of Technology, China, where he is currently a Professor of Computer Science. His current research interests include multimedia data mining, machine learning, computational advertising, and service computing. Prof. Zhang is currently an Associate Editor of Neurocomputing, Neural Computing and Applications, and Pattern Analysis and Applications.

Guo Han received the B.S. degree in software engineering from Yunnan normal university, Kunming, China, in 2016, and the M.S. degree in computer science from the Harbin Institute of Technology, Shenzhen, China, in 2019. She was a Master candidate in Computer Engineering of the Harbin Institute of Technology Shenzhen Graduate School, under this research performed. She is currently working on recommendation algorithm at Tencent China Co., Ltd. Her research interests include data mining, computer vision, and deep learning.

Xinghao Wang received the B.S. degree in software engineering from Heilongjiang University, Heilongjiang, China, in 2017. He is a Master candidate in Computer Engineering of the Harbin Institute of Technology Shenzhen Graduate School, under this research performed. His research interests include data mining, computer vision, and deep learning.

Yuzhu Ji received the B.S. degree in computer science from PLA Information Engineering University, Zhengzhou, China, in 2012, and the M.S. degree in computer engineering from the Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China, in 2015, where he is currently pursuing the Ph.D. degree in computer science. His research interests include data mining, computer vision, image processing, and deep learning.

Q. M. Jonathan Wu received the Ph.D. degree in electrical engineering from the University of Wales, Swansea, U.K., in 1990. He was with the National Research Council of Canada for ten years from 1995, where he became a Senior Research Oﬃcer and a Group Leader. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada. He has published more than 250 peer-reviewed papers in computer vision, image processing, intelligent systems, robotics, and integrated microsystems. His current research interests include 3-D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. Dr. Wu holds the Tier 1 Canada Research Chair in Automotive Sensors and Information Systems. He was an Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN, and CYBERNETICS PART A, and the International Journal of Robotics and Automation. He has served on technical program committees and international advisory committees for many prestigious conferences.

Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023

Clothescounter: A framework for star-oriented clothes mining from videos

Clothescounter: A framework for star-oriented clothes mining from videos

Recommend Documents