ARTICLE IN PRESS
JID: NEUCOM
[m5G;October 25, 2019;1:14]
Neurocomputing xxx (xxxx) xxx
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Clothescounter: A framework for star-oriented clothes mining from videos Haijun Zhang a,∗, Han Guo a, Xinghao Wang a, Yuzhu Ji a, Q. M. Jonathan Wu b a b
Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China Department of Electrical and Computer Engineering, University of Windsor, Ontario, Canada
a r t i c l e
i n f o
Article history: Received 29 October 2018 Revised 13 July 2019 Accepted 16 September 2019 Available online xxx Communicated by Dr Shenglan Liu Keywords: Clothes clustering Clothes detection Video advertising Fashion data mining
a b s t r a c t This paper presents a novel framework, ClothesCounter, which aims to automatically identify clothes worn by certain stars in videos. At first, several deep convolutional neural networks (CNN) models were utilized to preprocess video data in order to detect clothing images from original video frames, including human body detection, human posture selection, human pose estimation, face verification, and clothing detection. We then propose a method for extracting features of clothing images based on triplet loss that can map clothing images into a compact feature space. In the learned feature space, we present a twostage clustering algorithm that does not require the number of clusters. Our framework was examined in a large-scale video dataset. Experimental results demonstrate the feasibility and effectiveness of our proposed method.
1. Introduction With the rapid development of the Internet economy, online video traffic has grown dramatically in recent years. The clothing worn by stars in a video always lead fashion trends and attract the attention of a large number of fans. When watching idol dramas or television shows in which protagonists wear fashionable clothes, the viewers, especially females, are easily attracted to these clothes and are stimulated to purchase ones that are identical to those shown in the video. Clothing worn by stars has always been popular with audiences, and has increased purchase needs from audiences who want to keep pace with their idols. If it was possible to detect stars’ clothing automatically, great benefits could be obtained for video advertising and cross-scenario image clothing retrieval. Consequently, determining how to quickly and accurately detect clothing worn by actors in videos has become a common concern in video platforms that aim to combine video websites and e-commerce to achieve conversion from traffic to sales. Our objective is not only to detect clothes, but also discover how many sets of clothes that a given star wears in a video. However, such an intelligent system may involve several research fields in computer vision and machine learning, including object detection, human pose estimation, face verification, clothing segmentation, clothing image clustering, etc. The diverse appearance of clothes, cluttered
∗
Corresponding author. E-mail address:
[email protected] (H. Zhang).
© 2019 Elsevier B.V. All rights reserved.
backgrounds, distortions, different light conditions, and motion blur in videos make automated clothing identification from videos a challenging task. At present, there are several works on clothing parsing, clothing retrieval and recommendations, and video advertising based on clothes recognition. For example, Yamaguchi et al. [1] demonstrated an effective method for parsing clothing in fashion photographs. In addition, Kiapour et al. [2] used a retrieval-based approach to solve the clothing parsing problem. Clothing retrieval has immense applicability for the commercial industry. Extensive efforts have been focused on similar clothing retrieval [3] and exactly the same clothing retrieval [4] Bell et al. [3] proposed a convolutional neural network (CNN) using contrastive loss to learn visual similarity between products. Huang et al. [5] presented a dual attribute-aware ranking network (DARN) based on a Siamese network to retrieve similar clothes. Clothing detection and segmentation techniques [6] have also been utilized to retrieve clothing. Deep learning-based object detection methods, such as a region based convolutional neural network (R-CNN) model [7], fast R-CNN model [8], faster R-CNN model [9], and YOLO [10], can be straightforwardly utilized for clothing detection. Moreover, from an application perspective, video advertising based on clothes recognition offers huge potential for revenue in the online video market. Zhang et al. [11] firstly introduced an optimization framework towards object-level video advertising. Subsequently, Cheng et al. [12] explored a new cross-domain task of online clothing shopping, targeting matching clothes occurring in videos to identical items in online stores.
https://doi.org/10.1016/j.neucom.2019.09.023 0925-2312/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
Despite recent advances in images-to-stores retrieval, only few studies have focused specifically on linking star-oriented clothes in videos to online stores. Recently, Zhang et al. [13] proposed a sitcom star-oriented clothing retrieval system based on videos by using several state-of-the-art deep learning methods. However, this system only learned clothing features based on categories and employed the density peaks clustering algorithm (DPCA) [14] for clothing clustering. Clustering performance cannot be guaranteed based on such a rough representation for measuring clothes similarity. In this paper, we presented a framework for automated star-orientated clothes identification from videos, called ClothesCounter, which aims to automatically identify clothing worn by certain stars in a given video, while determining the clothing categories and number of pieces of clothing appearing in the video. Due to the continuity of video frames for a certain period of time, a large number of identical clothes would be detected by the aforementioned deep learning-based methods, making it challenging to discern the number of pieces of clothing. In addition, if we regard raw detected clothing images as queries for retrieval, it will produce a large number of similar retrieval results in terms of these similar queries. Therefore, the main task becomes finding an efficient method for redundant queries removal, i.e., accurately identifying the number of pieces of clothing that a given star wears in a video. In this paper, we proposed a two-stage clustering algorithm that is used to remove redundant continuous data from the original clothing detection results, leaving only the cluster centers as representatives that can be viewed as retrieval queries in various clothing retrieval applications. Moreover, good clustering performance is largely achieved by good representations of extracted clothing features. Desirable features for clothing images are expected to satisfy the criterion that the maximal intra-class distance is smaller than the minimal inter-class distance under a certain metric space. However, learning features with this criterion is generally difficult because of major differences in luminance condition, body poses, clothing distortions of intra-class variation, and high inter-class similarity exhibited by clothing images. Pioneering works based on CNNs learn image features indirectly that are not sufficiently discriminative using a traditional softmax loss. To overcome these challenges, we have adopted a method for extracting features of clothing images based on triplet loss that that can map clothing images into a compact feature space, enabling us to measure clothes similarity effectively. Then, the proposed two-stage clustering algorithm was performed on the learned feature space. To evaluate our proposed framework, we conducted extensive experiments over a famous American sitcom, The Big Bang Theory. The used dataset contains 103 episodes. After pre-processing, including human body detection, pose selection and face verification, a total of 18,274 clothing images were detected. Among them, 6031 clothing images of six protagonists were used for the clustering experiment. Experimental results demonstrate the feasibility and efficacy of our framework. The motivation of this paper is driven by the query-redundancy removal in a video advertising system by automatically counting the clothing items dressed on the leading roles. In practice, a clustering algorithm can be applied for filtering out redundant query clothing images. However, the performance of a clustering algorithm is largely influenced by the generalization capability of feature representations. Thus, in this paper, to learn more discriminative features, we proposed to train an efficient feature extractor by leveraging triplet loss function under a clothing re-identification framework. Moreover, we also developed a two-stage clustering algorithm to efficiently count the number of clothing items dressing on sitcom-stars according to the number of clusters. The key contributions of this paper are three-fold: (1) A novel framework, ClothesCounter, is proposed by utilizing state-of-theart deep CNN models and clustering algorithms. The framework is
able to automatically identify the categories of clothes appearing in a video and obtain the number of pieces of clothing worn by a given star; (2) A deep model based on triplet loss is designed for feature extraction of clothing images. The developed triplet lossbased model can capture the intrinsic similarity of clothing images belonging to the sample of clothes; (3) A two-stage clustering algorithm is proposed for star-oriented clothing clustering. Multiple density clustering by adaptively setting different neighborhood radii is performed at the first stage. The second stage merges clusters based on similarities between clusters. The remainder of this paper is organized as follows. In Section 2, our ClothesCounter framework is briefly presented. In Section 3, we describe the implementation details of clothing detection from a video, including human body detection, pose selection, protagonist face verification, and clothing detection by utilizing different deep CNN models. Sections 4 and 5 illustrate the feature learning of clothing images based on triplet loss and a twostage clustering algorithm, respectively. Experimental results based on real video datasets are presented in Section 6. In Section 7, we conclude the paper and suggest directions for future work. 2. Overview of our framework The whole framework is developed to automatically parse a given video, and obtain images, categories, and pieces of clothing worn by each protagonist. A potential application of our framework allows us to make a clothes album for each star in a video, identify the total number of pieces of clothing, and label their categories. These results can be easily utilized for clothing recommendations. The whole ClothesCounter framework comprises several modules, including human body detection, pose selection, face detection and verification, clothing detection, and clothing images clustering. The entire pipeline of our framework is illustrated in Fig. 1. It is firstly necessary to detect human bodies from video frames. After human body detection, a pose selection module, including a binary classifier and a body key points detection module, was used to determine whether or not a human body pose is good. A good pose is determined by whether or not the form of clothing attached to the human body is good and whether or not it is suitable for image retrieval. After passing through the face verification module to identify main characteristics, the clothes detection module determines the bounding box and category of the clothes, and segments clothing image patches from the whole protagonist body image. Finally, clothing features are extracted based on triple loss, and clustering is performed based on the learned feature space. Overall, the entire framework includes three key components: (1) clothing image segmentation from a video for a given star; (2) feature extraction of clothing images; and (3) clothing image clustering. In summary, the whole working flow of the ClothesCounter is presented in Fig. 2. The detailed implementations of these components are described in the following context. 3. Star-oriented clothes detection from videos 3.1. Human body detection Given a video, one frame is extracted in a fixed duration, and human detection is performed over these video frames. Human body detection constitutes a sub-problem of object detection. In this research, we chose faster R-CNN as a human body detection model, as it shows state-of-the-art object detection accuracy [13] The dataset was constructed from public datasets provided by PASCAL VOC2012. The dataset covers 20 categories,including people, animals, vehicles, etc. Categories and bounding box locations are manually labeled. Approximately 8,174 images contain human bodies associated with annotated bounding boxes. We trained
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
3
Fig. 1. Overview of the ClothesCounter framework.
Input a video Pick one frame in a second as input
Is there a human body? Yes Is a human body in a good pose? Yes Is a face detected? Yes
No
No No
Is it the protagonist?
No
Yes Clothing Detection Clothing Clustering
No
No
End of the Video? Yes End Fig. 2. Working flow of ClothesCounter framework.
For the binary classifier for pose selection, a good pose is defined as having good lighting conditions, a good shape of clothing without distortion, a frontal side of the human body, etc. We obtained a total of 23,167 human body images, in which 11,097 images were annotated as positive pose samples and 12,070 images as negative ones. In the dataset, 13,190 images were collected from the videos used in this research, and 9,977 images were cropped from the Street-shop dataset [36] The AlexNet model modified with a binary classifier was trained on this collected dataset. The implementation details can be found in [13] Although the trained binary classifier can filter out most human body images in bad poses, some human images that are obscured by other objects or contain half-body regions may remain. In order to further filter out these images, we developed a key human joints detection module. In this research, we employed the method proposed by Cao et al. [16] This method adopts a bottom-up algorithm and uses a global contextual clue relationship to detect components and their association. The network structure is divided into two branches: (1) key points detection; and (2) the connection prediction between key points. Each branch is an iteratively predicted structure. The detailed implementation of this method can be found in [16] Some key joints detection results for certain samples are shown in Fig. 3. From left to right, the detection results are based on human body images with good poses, profile images, and human upper-body images. Here, if the detected key joints are incomplete, it suggests that the human body image may be inappropriate for clothing detection. Specifically, a human body image that can be detected as a complete top should be able to detect shoulders, elbows and wrists, and that detected as a complete bottom should be able to detect hips, knees, and ankles. These rules are utilized to choose human body images with good poses. 3.3. Protagonist face verification
human body detection networks on this human body dataset. We used the same parameter settings and network structure as [13] The implementation details can also be found in [13] By using the human body detection model, we cropped out human body regions from raw video frames. 3.2. Pose selection After human body detection, the pose selection module was utilized to determine whether or not the human body is in a good pose. Given a video, human body poses usually change frequently. Selecting a good pose of stars constitutes a crucial step, because it directly affects the performance of clothing detection [13] The pose selection module includes a binary classifier and key body joints detection.
The key idea of the proposed ClothesCounter framework is mainly centered on the clothing of famous stars because stars are usually at the forefront of fashion. Thus, it is firstly necessary to verify whether or not a detected person plays one of the leading roles from a given video. In general, face recognition can be categorized as face identification and face verification [17] The former classifies a face to a specific identity, while the latter determines whether a pair of faces belongs to the same identity. In our protagonist face verification module, we used an open source face recognition method with deep representation, named VIPLFaceNet [18], that achieves 98.60% mean accuracy on the LFW (Labelled Faces in the Wild) face dataset. We manually selected seven face images as a standard face for each protagonist. Each detected face calculates the cosine similarity with each standard face, and the average of the similarities being the greatest and being
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM 4
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 3. Results of body key joints detection based on human images with: (a) good poses; (b) sideways images; and (c) upper body images. Table 1 Constructed clothing detection dataset (w denotes woman, and m denotes man in the first column). Categories
Amazon
Taobao
Train/val
Test
Total
Dress-w Gallus-w Shirt-m Shirt-w Suit-m Sweater-w Tshirt-m Tshirt-w Skirts-w Hoodie-m Jeans-m Jeans-w Leggings-w Pants-m Pants-w Shorts-m Shorts-w
1002 1001 499 905 1005 532 843 600 1002 1100 568 601 502 621 599 600 587
888 393 432 210 901 1270 500 921 532 276 432 399 498 379 401 400 413
945 697 466 558 953 901 672 761 767 688 800 800 800 800 800 800 800
945 697 465 557 953 901 671 760 767 688 200 200 200 200 200 200 200
1890 1394 931 1115 1906 1802 1343 1521 1534 1376 1000 1000 1000 1000 1000 1000 1000
greater than a predetermined verification threshold (we set this to 0.66 according to [13]) is identified as the specific protagonist. The detailed verification scheme can be found in [13] 3.4. Clothing detection After protagonist body images with good poses are obtained, the clothing detection module is performed to locate clothing regions. Clothing detection is similar to object detection, which predicts the location information and the category of an object simultaneously. In this research, we used a state-of-the-art object detection method, YOLO v2 [15], as a model for clothing detection and utilized the same clothing dataset introduced in [13] The constructed clothing dataset contains a total of 21,812 images, categorized into 17 classes (see Table 1). These data were collected from two popular online shopping websites, i.e., Amazon.com and Taobao.com. Each image in this dataset was labeled with category and bounding box manually. For dataset partition, we randomly selected 80% of each category of images as a training/validation set, and the remaining 20% of images were used for testing (see Table 1). The results of mAP for each category are shown in Fig. 4. YOLO v2 achieved 98.16% mAP over all categories. Such a high mAP would largely satisfy the basic requirement of our applications. Therefore, we selected this well-trained YOLO v2 model to detect clothes for our video dataset. 4. Feature learning based on triplet loss Our ultimate goal lies in automatically identifying clothes worn by certain stars in a given video, while recognizing the clothing categories and number of pieces of clothing that appear in the
video. After clothing detection, a large number of similar clothes are detected and cut out from the video due to the continuity of video frames in a certain period of time. These images, even belonging to the same piece of clothing, may exhibit major differences in angle, size, shape, etc. It is necessary to determine whether a set of clothing images belongs to the same piece of clothing from the clothing detection results, and then remove redundant queries for potential clothing retrieval applications. Since a clothing set compiled from a video usually has a large number of clothing categories and a few samples in each category, it is difficult for traditional category-based loss functions under the deep learning framework to learn discriminative features for clothing image representations. In this research, we propose to utilize a triplet loss function for extracting features of clothing images. A deep CNN was trained by a triplet loss function that serves to pull the instances of the same clothes closer, and at the same time push the instances belonging to different categories farther from each other in the learned feature space. Our method is different from previous works [13] regarding the outputs of the last two fully connected layers as feature representations of clothing images. A metric learning method supervised by a triplet loss is used to project the original features into a low dimension space, while preserving their discriminative information. Metric learning with a triplet loss aims to separate the positive pairs from the negative pairs by a distance margin. This was initially used in face verification [19] In our application, it can minimize the squared L2 distance between the same clothes and enforces a margin between the distances of different clothes. In this paper, we explored an Inception-ResNet v2 [20] network architecture by combining Inception architectures with residual connections, which achieved state-of-the-art performance in the ILSVRC2015 challenge. Based on the Inception-ResNet v2 network structure, we added the triplet loss function into the design, removed the softmax classification layer of the convolutional network, and directly took the feature map from the final convolutional layer. After an L2 normalization, an embedding space was established for the representations of dimensionally reduced image features. The triplets were selected from our constructed clothing dataset (see Section 6.2), and the triplet losses were calculated based on feature representations in the embedding space. Given the model structure, as shown in Fig. 5, the key to our approach is end-to-end learning of the embedding space. Specifically, we learn an embedding relationship from clothing images to a feature space, making the distances of the same clothes be small, whereas the distances of different clothes are large. For an image xa (anchor), we define a triplet set of xa as (xa , xp , xn ), where xp (positive) is the sample belonging to the same clothes of xa ; and xn (negative) is the sample belonging to different clothes of xa . It is necessary to ensure that an image xa is closer to all other images xp of the same clothes than to any image xn of any different clothes. The triplet loss function requires the distance between xa and xp to be less than not only the distance between xa
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
5
Fig. 4. Detection results on our established clothing dataset.
Fig. 5. The model structure.
and xn , but also a predefined threshold. Thus, we need:
||xa − x p || + α < ||xa − xn ||22 , ∀(xa , x p , xn ) ∈ T, 2 2
(1)
where α is the parameter of the margin, by empirically setting the margin to 0.2; and T is the set of triplets in the training set that has cardinality N. All selected triplets need to satisfy the constraint in Eq. (1). We denote f(xa ), f(xP ), and f(xn ) as the embedding from shared parameters of the CNN model on xa , xp , and xn , respectively. The loss function can be given by:
L=
N
[||f(xa ) − f (x p )||22 − ||f(xa ) − f(xn )||22 + α ]+ .
(2)
i
The whole training process and the verification of the representational capability of learned features based on triplet loss can be found in Section 6.2. To verify the representational capability of learned features based on triplet loss, we performed clothing verification (see Section 6.2). Given a pair of clothing images, clothing verification aims at determining if they belong to the same piece of clothing. The similarity between a pair of clothing images is measured by the squared L2 distance in terms of their learned features. A distance threshold is obtained by cross-validation, i.e., the distance is greater than the threshold, indicating that the clothing pair is not the same piece, and vice versa. 5. A two-stage method for clothes clustering Clustering is an important technology in data mining. It can divide a large number of data into several clusters according to the similarity between them. If the clothes identified by the clothing detection module for a given star can be clustered, the number of clothes worn by the star will be the number of clusters. We can use the cluster centers as representatives of clothes for retrieval applications. At the same time, a larger size of cluster suggests that the clothes in this cluster occur more frequently in the video. At present, extant clustering analysis methods are mainly divided into five categories: (1) partitioning methods, such as the K-MEANS [21] algorithm, the K-MEDOIDS algorithm, etc., which
need to predefine the number of clusters; (2) hierarchical methods, which hierarchically decompose a given dataset until a certain condition is satisfied. Representative algorithms are the CURE algorithm [22], the BIRCH algorithm [23], etc.; (3) density-based methods, in which a point will be added to a cluster as long as the density of the point in a region is greater than a certain threshold. Representative algorithms are the DBSCAN algorithm [24,25], the OPTICS algorithm [26], etc.; (4) grid-based methods which firstly divide the data space into a grid structure with a finite number of cells. All processing is based on a single unit. Representative algorithms include the STING algorithm [27] and the CLIQUE algorithm [28]; and (5) model-based methods, which are mainly based on statistical models and neural network models [29,30]. The first stage of our work mainly followed the scheme of density-based clustering methods, because they do not need to predefine the number of clusters. The main idea of a densitybased clustering method lies in finding high-density regions separated by low-density regions. A cluster is the largest set of densityconnected points, and can be uniquely determined by any one of the core points. It does not need to determine the number of clusters to be divided in advance, and can identify clusters of arbitrary shapes in the dataset containing noise. Therefore, density-based clustering is suitable for clustering unknown data. The DBSCAN algorithm [24,25] is a representative density-based clustering algorithm. However, since the parameters, Eps (neighborhood radius) and MinPts (the minimum number of points in the neighborhood radius), involved in the classical DBSCAN algorithm are invariant during the clustering process, it is difficult for the algorithm to adapt to data sets with uneven densities. Our task is to cluster clothing images detected from video frames. Considering uneven density distribution of data in the clothing data set, we propose a clustering method by adaptively selecting multiple Eps parameter values. In the image feature extraction module, we evaluated our method on the clothing verification task (see Section 6.2 for detailed experimental results). Given a pair of two clothing images, a threshold, γ , in terms of squared L2 distance between these two images, was utilized to determine
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
JID: NEUCOM 6
ARTICLE IN PRESS
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
if this pair of images belongs to the same piece of clothing. We used a cross-validation method to determine this threshold. It was then treated as the upper limit of the clustering parameter, Eps. Below this upper limit, we generated multiple Eps values with a fixed step size, such as 0.1 in the range of (0, γ ). Initially, at the first stage, the neighborhood radii with respect to Eps were arranged in ascending order, and MinPts were set to 2 to prepare for clustering. We then selected the minimum Eps and MinPts to perform density clustering on the data. Here, a 128dimensional data point pi used to represent an image. We examined each point sequentially over the dataset. If point pi is not processed (classified as a cluster or marked as noise), we mark pi as ‘processed’ and find its neighborhood NEps (pi ), which is defined by NEps (pi ) = {pi ∈ D | dist(pi ,pj ) ≤ Eps}, where D denotes the whole dataset, and the distance between points pi and pj , dist(pi ,pj ), is expressed as the square of the Euclidean distance. If NEps (pi ) contains fewer data points than MinPts, i.e., |NEps (pi )| < MinPts, we mark pj as noise; if |NEps (pi )| ≥ MinPts, we mark pi as ‘processed’ and group all points in the neighborhood of pi , NEps (pi ), into a new cluster c. For each point pk in NEps (pi ) that has not yet been processed, we then check its neighborhood NEps (pk ). If NEps (pk ) contains at least MinPts points, i.e., |NEps (pk )| ≥ MinPts, the points in NEps (pk ) that are not grouped into any other clusters are then added to c. The above steps are repeated until all data points are classified into a cluster or marked as noise. Subsequently, we use the next Eps in turn, and perform density clustering again on the data labeled as noise based on the clustering results in the last step. This process continues until all of the Eps values are used up. It is worth noting that the neighborhood radii Eps should be used in an ascending order in the multiple clustering processes at the first stage. When using smaller Eps for clustering, the data points in the sparse clusters would not be processed, because smaller Eps values can handle only denser points without affecting smalldensity data. At the second stage, we used the squared L2 distance between the centers of clusters formulated at the first stage to measure their similarities. If the distance between ci and cj is less than the threshold γ , i.e., dist(ci ,ci ) < γ , we merge the two clusters. The final clustering results are accomplished by continuously merging the most similar clusters. Algorithm 1 describes our proposed two-stage clustering algorithm, where D is the input data set; NEps (p, D) is a sub-set in D that is present in a hyper-sphere of radius Eps at p (p ∈ D); and card(NEps (p, D)) is the cardinality of the set. The first stage marks each point of D with a cluster identifier (c_id) that gives the cluster to which the point belongs or marks the point as noise. In the second phase, the clusters obtained in the previous phase are merged with clusters whose distances are less than the merging threshold. It can be seen that the time-consuming step at the first stage lies in finding NEps (p, D), which takes O(n) time, where p ∈ D and |D| = n. Hence, the time complexity of the first stage is O(n2 ). At the second stage, assuming that the number of clusters obtained in the previous stage is nc , the time complexity of the merging phase is O(nc 2 ). The total time complexity of our method is O(n2 + nc 2 ). 6. Experiment 6.1. Preprocessing in video In order to establish our video dataset, we downloaded 103 episodes of The Big Bang Theory from the Internet (they were only used for research purpose). For each video, we removed the header part with a fixed length of 1000 frames from the beginning. One frame per second was extracted to form the dataset with raw
Algorithm 1 Two-Stage clustering. 1 Input: D (the data set), E ps_list (candidates of E ps values), MinPts, γ (the threshold for merging clusters) 2 Output: Labels(the cluster label of each points) 3 Initialize: Cur_D = D, c_id = 0 4 for E ps in E ps_list do 5 for each point p in Cur_D do 6 if p is visited then 7 Continue 8 end if 9 Mark p as ‘visited’ 10 Find NE ps (p,Cur_D) 11 if card(NE ps (p,Cur_D)) < MinPts then 12 Mark p as noise 13 else 14 c_id = c_id+1 15 Mark each point of NE ps (p,Cur_D) with c_id 16 for each point y ∈ NE ps (p,Cur_D) and y is not marked as “visited” do 17 Make y as “visited” 18 Find NE ps (y,Cur_D) 19 if card(NE ps (p,Cur_D)) ≥ MinPts then 20 Mark each point of NE ps (p,Cur_D) with c_id 21 if any point of NE ps (p,Cur_D) is marked as noise then 22 Remove that mark 23 endif 24 endif 25 endfor 26 endif 27 endfor 28 Cur_D = noise 29 endfor 30 for c1 in clusters do 31 for c2 in clusters do 32 if ||c1 − c2||22 > γ and c1 = c2 then 33 Merge(c1,c2) 34 endif 35 endfor 36 endfor
Table 2 Dataset processing in each module. Stage
Before
After
Video segmentation Human body detection Pose selection Face detection Face verification Clothes detection
103 videos 115,905 (raw frames) 208,519 (detected bodies) 84,415 (positive poses) 60,305 (detected faces) 22,779 (verified faces)
115,905 (raw frames) 208,519 (detected bodies) 84,415 (positive poses) 60,305 (detected faces) 22,779 (verified faces) 18,274 (detected clothes)
frames. Finally, we compiled a video dataset with a total of 115,905 raw frames. After human detection, we detected and cropped out 208,519 human body images from the original frames. In these human images, 84,415 human images were determined to have good human poses by using our trained human pose selection model and key joints detection model. Face recognition was then performed on these positive human body images with good poses, in which 60,305 faces were detected and aligned. 22,779 faces were verified as protagonists” faces. Clothing detection was then performed on the human images associated with the 22,779 protagonists’ faces. Finally, 18,274 well-formed and complete clothing images were detected and cropped. For clarity, Table 2 presents the statistics over the dataset at each processing procedure.
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx Table 3 Clothing verification model dataset. Total
LookBook Video Frames Total
7
Table 4 Performance with different embedding dimensionality. Train/Val
Test
Embedding
Pieces
Instances
Pieces
Instances
Pieces
Instances
5155 941 6902
35,021 8852 43,873
3043 403 3446
25,585 5175 30,760
2112 1344 3456
9436 3677 13,113
Accuracy AUC
Dimensionality
128
256
512
1024
0.98 0.998
0.979 0.998
0.983 0.999
0.981 0.999
Table 5 The main characters in The Big Bang Theory.
6.2. Clothing verification based on triplet loss In order to train a deep model based on triplet loss for extracting image features, we constructed a clothing dataset containing a total of 43,873 clothing images collected from videos and LookBook datasets. These images belong to 6,902 pieces of different clothing. We manually selected clothing images detected in various genres of movies, including comedy, romance and action, and chose 941 pieces of different clothing with a total of 8852 images. The remaining 35,021 clothing images were collected from LookBook, which is an overseas street shooter [31] Some fashion lovers upload their own street photos on LookBook and share wardrobes with other users. From the LookBook street dataset, we achieved 5155 pieces of different clothing, including a total of 35,021 clothes images. Images from videos were also collected in order to make the model adapt to the application environment of our program for video data. The remaining clothing data were mainly crawled from online websites LookBook.nu, which is a famous website for sharing users’ personal fashion experience by uploading street photos. In practice, clothing items can be automatically collected by a crawler from these online fashion websites or stores, which usually contain multi-viewed clothing images belonging to the same items. Indeed, it is necessary to establish such a clothing dataset in order to improve the generalization capability of a feature extractor for distinguishing whether or not two clothing images contain the same clothes item. Moreover, the generalization performance of our triplet-loss-based feature extraction model should be improved along with the increase of dataset scale. In particular, if a dataset contains more similar clothing items, the generalization capability of a feature extractor can be further improved, because similar items can be utilized as hard negative samples when constructing triplet samples for producing more robust and discriminative feature representations. It is important to collect similar clothing images to improve the generalization capability of the proposed feature extractor by avoiding more easy negative samples during model training process. Table 3 shows the statistics of our constructed dataset. We trained the CNN using RMSprop with standard backward propagation and started with a learning rate of 0.01 in the experiment. The model parameters were randomly initialized and trained on GTX1080. The margin α in the triplet loss was set to 0.2. The network structure is based on Inception-ResNet v2, the final softmax classification layer was removed and the feature map of the final convolutional layer was directly obtained, followed by L2 normalization, which results in the clothing features. In order to evaluate the representational capability of our learned features, we examined our model on clothing verification. Given a pair of clothing images, clothing verification determines whether or not they belong to the same clothing by calculating a squared L2 distance between their learned embedding feature vectors. For dataset partition, less than six instances of a piece of clothing were used as a test set, and the remaining images were used for training/validation (see Table 3). We used three widely utilized measures [32] to evaluate performance: accuracy and area under curve(AUC). For each setting of embedding dimension, we used 10-fold cross validation. Table 4 lists the results with respect
Dataset related to a star
Instances
Clusters
Jim Parsons Kaley Cuoco Kunal Nayyar Johnny Galeck Melissa Rauch Simon Helberg
1671 620 953 2515 60 212
159 97 123 296 20 45
to different embedding dimensions. It can be seen that embedding dimensions with 128, 256, 512, and 1024 deliver quite similar performance with good results. In order to conserve computational burden, we used the embedding dimension with 128. For the 128dimensional embedding with 10-fold cross validation, we obtained the average similarity threshold γ with 0.81, which was used for clothing clustering in the following section. 6.3. Clustering on real video datasets After processing the videos, we obtained the clothing image datasets with respect to the six main characters: Jim Parsons, Kaley Cuoco, Kunal Nayyar, Johnny Galeck, Melissa Rauch, and Simon Helberg, in an American sitcom, The Big Bang Theory. We organized and manually classified six clothing image datasets with respect to the six stars. The statistics of these six datasets are shown in Table 5. Fig. 6 presents a few examples of the experimental results of our two-step clustering algorithm, in which images with red boxes represent an example of clustering error and images with green boxes are the center samples of each cluster. In order to demonstrate the effectiveness of our two-stage clustering algorithm, we compared the proposed method with DPCA [14], DBSCAN [24,25], AP [33], Mean Shift [34], and Birch algorithms [23] on our constructed datasets. These compared algorithms are widely used for various applications in the clustering field, and do not necessitate a predefined number of clusters. We used six commonly used measures for evaluation, including normalized mutual information (NMI), rand index (RI), fowlkesmallows scores (FMI), homogeneity, completeness, and v-measure [35]. We also listed the number of clusters estimated by different clustering methods. Quantitative results are summarized in Table 6. It can be seen that our method delivers superior performance in terms of most measures in comparison to other methods. In particular, the number of clusters achieved by our method is usually near the correct number of clusters. Moreover, our method always outperforms other methods in terms of NMI and V-measure. The method, DBSCAN, produces superior results in terms of completeness over other methods on the datasets related to Jim Parsons, Kaley Cuoco, Kunal Nayyar, Johnny Galeck. AP achieves the best performance with respect to the measures of RI and FMI over the Johnny Galeck dataset, and outperforms other methods in terms of completeness over the Melissa Rauch dataset. Finally, we produced a star-oriented photo gallery of the clothing of every protagonist in the American sitcom, The Big Bang Theory, as shown in Fig. 7. These results could be easily utilized for
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM 8
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 6. Examples of clothes clustering.
Table 6 Performance comparison of the algorithms (TNC represents the true number of clusters; the values in bold represent the best results). Datasets
Measure
DPCA
DBSCAN
AP
Mean Shift
Birch
Ours
Jim Parsons
Clusters(TNC=159) NMI RI FMI Homogeneity completeness V-measure
138 0.882 0.452 0.505 0.824 0.944 0.88
81 0.775 0.144 0.293 0.611 0.984 0.754
64 0.89 0.596 0.631 0.817 0.97 0.887
86 0.843 0.3 0.421 0.73 0.974 0.834
61 0.835 0.328 0.435 0.717 0.972 0.825
155 0.913 0.642 0.65 0.895 0.931 0.913
Kaley Cuoco
Clusters(TNC=97) NMI RI FMI Homogeneity completeness V-measure
46 0.835 0.414 0.511 0.721 0.967 0.826
61 0.863 0.351 0.472 0.754 0.987 0.855
27 0.83 0.478 0.564 0.71 0.971 0.82
63 0.888 0.475 0.562 0.805 0.979 0.884
36 0.835 0.405 0.509 0.712 0.979 0.825
96 0.909 0.614 0.636 0.887 0.932 0.909
Kunal Nayyar
Clusters(TNC=123) NMI RI FMI Homogeneity completeness V-measure
206 0.916 0.684 0.689 0.944 0.89 0.916
110 0.927 0.558 0.625 0.872 0.985 0.925
48 0.893 0.658 0.701 0.815 0.978 0.889
111 0.931 0.652 0.688 0.892 0.972 0.93
60 0.895 0.572 0.631 0.815 0.982 0.891
145 0.944 0.783 0.79 0.964 0.926 0.944
Johnny Galeck
Clusters(TNC=296) NMI RI FMI Homogeneity completeness V-measure
1105 0.88 0.349 0.41 0.977 0.793 0.875
241 0.901 0.474 0.56 0.825 0.985 0.898
97 0.887 0.682 0.719 0.809 0.973 0.884
246 0.925 0.671 0.71 0.874 0.98 0.924
177 0.883 0.599 0.638 0.822 0.949 0.881
324 0.925 0.673 0.677 0.926 0.923 0.925
Melissa Rauch
Clusters(TNC=20) NMI RI FMI Homogeneity completeness V-measure
10 0.822 0.432 0.54 0.699 0.967 0.812
25 0.937 0.738 0.752 0.946 0.929 0.937
7 0.797 0.411 0.536 0.642 0.988 0.778
25 0.937 0.738 0.752 0.946 0.929 0.937
13 0.895 0.601 0.668 0.811 0.987 0.89
20 0.964 0.833 0.842 0.959 0.97 0.964
Simon Helberg
Clusters(TNC=45) NMI RI FMI Homogeneity completeness V-measure
6 0.652 0.163 0.341 0.426 1 0.597
35 0.912 0.595 0.662 0.839 0.991 0.909
15 0.819 0.462 0.564 0.683 0.983 0.806
36 0.915 0.608 0.671 0.847 0.987 0.912
23 0.865 0.523 0.607 0.76 0.984 0.858
42 0.925 0.647 0.664 0.904 0.946 0.925
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
JID: NEUCOM
ARTICLE IN PRESS
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx
9
Fig. 7. An example of the star-oriented photo gallery.
various real-world applications, such as cross-scenario clothing retrieval and video advertising.
6.4. Discussion and extension (1) Sensitivity analysis on different videos: To further evaluate the sensitivity of our method with respect to different videos, we have performed experiments on another show, How I Met-Your Mother. The constructed testing dataset contains 112 episodes of this show from season I to season V. For each video, we selected one frame per second. As a result, we obtained a video dataset with a total of 118,328 raw frames. After human body detection, a total of 140,582 human bodies were detected and cropped. Among these human body images, 39,554 image samples were selected as positive poses by leveraging our trained human pose selection model and key joints detection model. After face detection and verification, 13,216 images with protagonists’ positive body poses were obtained. By performing clothing detection, 3534 wellformed clothing images were detected and cropped. To evaluate the performance of our proposed clustering algorithm, samples in each cluster are manually annotated as groundtruth. We further manually cleaned the data by removing irrelevant images. As a result, a total of 2792 images were obtained and classified into five datasets with respect to the five leading characters: Alyson Hannigan, Cobie Smulders, Jason Segel, Josh Radnor, and Neil Patrick Harris. The statistics of these five datasets are listed in Table 7.
Table 7 The main characters in How I Met-Your Mother. Dataset related to a star
Instances
Clusters
Alyson Hannigan Cobie Smulders Jason Segel Josh Radnor Neil Patrick Harris
117 307 711 965 692
32 76 146 225 153
Similarly, we compared our proposed two-stage algorithm with five widely used clustering algorithms, including DPCA, DBSCAN, AP, Mean shift and Birch. Quantitative results with respect to six evaluation metrics, i.e., NMI, RI, FMI, homogeneity, completeness, and v-measure, are summarized in Table 8. For clarity, we also listed the number of clusters estimated by different clustering algorithms. It shows that our method can deliver competitive performance in terms of most measures in comparison to other methods on Cobie Smulders and Jason Segel datasets, but result in slightly lower performance on other datasets related to leading roles Alyson Hannigan, Josh Radnor and Neil Patrick Harris. Specifically, Mean Shift algorithm achieves the best performance on the Alyson Hannigan dataset in terms of most metrics in comparison with other methods. Moreover, our method delivers better performance in terms of most measures in comparison to other methods on Cobie Smulders and Jason Segel datasets. For Josh Radnor and Neil Patrick Harris datasets, it suggests that DPCA algorithm can achieve the best per-
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
ARTICLE IN PRESS
JID: NEUCOM 10
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx Table 8 Clustering performance in comparison of the algorithms on How I Met-Your Mother dataset (TNC represents the true number of clusters; the values in bold represent the best results). Datasets
Measure
DPCA
DBSCAN
AP
Mean Shift
Birch
Ours
Alyson Hannigan
Clusters(TNC=32) NMI RI FMI Homogeneity completeness V-measure
39 0.883 0.565 0.582 0.893 0.874 0.883
37 0.932 0.715 0.726 0.943 0.922 0.932
8 0.777 0.431 0.546 0.615 0.983 0.756
34 0.936 0.771 0.785 0.924 0.947 0.935
21 0.896 0.679 0.717 0.827 0.971 0.893
36 0.93 0.664 0.688 0.958 0.903 0.93
Cobie Smulders
Clusters(TNC=76) NMI RI FMI Homogeneity completeness V-measure
87 0.883 0.558 0.568 0.875 0.891 0.883
75 0.805 0.194 0.331 0.706 0.918 0.798
12 0.687 0.262 0.399 0.98 0.915 0.659
71 0.834 0.345 0.439 0.77 0.904 0.831
31 0.809 0.444 0.532 0.697 0.94 0.8
80 0.915 0.633 0.641 0.923 0.907 0.915
Jason Segel
Clusters(TNC=146) NMI RI FMI Homogeneity completeness V-measure
302 0.9 0.528 0.548 0.954 0.849 0.899
185 0.901 0.441 0.505 0.871 0.932 0.901
34 0.786 0.397 0.479 0.669 0.924 0.776
174 0.911 0.622 0.635 0.9 0.923 0.911
82 0.866 0.553 0.593 0.801 0.937 0.864
175 0.91 0.627 0.638 0.933 0.888 0.91
Josh Radnor
Clusters(TNC=225) NMI RI FMI Homogeneity completeness V-measure
552 0.887 0.526 0.588 0.988 0.797 0.882
354 0.878 0.503 0.512 0.909 0.848 0.878
34 0.701 0.286 0.362 0.59 0.834 0.691
326 0.878 0.551 0.557 0.907 0.85 0.878
138 0.828 0.492 0.508 0.799 0.858 0.827
226 0.841 0.393 0.399 0.864 0.819 0.841
Neil Patrick Harris
Clusters(TNC=153) NMI RI FMI Homogeneity completeness V-measure
333 0.848 0.462 0.481 0.92 0.782 0.846
199 0.631 0.026 0.129 0.507 0.785 0.616
19 0.568 0.149 0.234 0.442 0.729 0.551
197 0.731 0.129 0.191 0.706 0.757 0.731
69 0.673 0.196 0.259 0.591 0.767 0.667
136 0.736 0.156 0.197 0.713 0.759 0.736
formance in terms of the most evaluated metrics, but our method can estimate the number of clusters closer to the correct number of clusters. Although the DPCA algorithm can achieve the best performance, it produces a relatively large number of clusters. This demonstrates that the DPCA algorithm may not perform well in counting the number of clothing items dressed on the leading roles. Large number of outliers or redundant clusters may be obtained by using the DPCA algorithm. (2) Application scenarios of our proposed framework: In practice, our proposed ClothesCounter aims at automatically counting clothing items dressed on the leading roles by leveraging multiple pre-processing steps, including human body detection, human pose selection, key joints detection, face detection, leading role verification and clothing detection, etc. The implemented program is more suitable to be executed offline for processing lengthy videos with movie-stars. In particular, the model trained by triplet loss can be directly applied to clothing feature extraction for very short videos such as ads. 7. Conclusion This paper presented a learning-based framework for automated star-orientated clothing identification from videos, ClothesCounter, which aims at recognizing clothes worn by certain stars in videos while determining their categories and numbers of pieces. For clothing clustering, we trained a clothing image verification network based on triplet loss to extract the features of clothing images, and proposed a two-stage clustering algorithm on our constructed clothing dataset from videos. Extensive exper-
imental results demonstrated the effectiveness of our proposed method. In clothing detection tasks, the identification of clothing details may further augment detection accuracy. In follow-up studies, we plan to utilize specific attributes of clothes, such as collars, cuffs, and pockets, in our future work. Declaration of Competing Interest No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed. Acknowledgment This work was supported in part by the National Key R&D Program of China under Grant no. 2018YFB1003800, no. 2018YFB1003805, the Natural Science Foundation of China under Grant no. 61972112 and no. 61832004, and the Shenzhen Science and Technology Program under Grant no. JCYJ20170413105929681 and no. JCYJ20170811161545863. References [1] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, T.L. Berg, Parsing clothing in fashion photographs, in: Proceedings of the CVPR, 2012.
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023
JID: NEUCOM
ARTICLE IN PRESS
[m5G;October 25, 2019;1:14]
H. Zhang, H. Guo and X. Wang et al. / Neurocomputing xxx (xxxx) xxx [2] K. Yamaguchi, M.H. Kiapour, T.L. Berg, Paper doll parsing: retrieving similar styles to parse clothing items, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3519–3526. [3] S. Bell, K. Bala, Learning visual similarity for product design with convolutional neural networks, ACM TOG 34 (4) (2015) 98:1-98:10. [4] X. Wang, Z. Sun, W. Zhang, Y. Zhou, Y. Jiang, Matching user photos to online products with robust deep features, in: Proceedings of the ICMR, 2016. [5] J. Huang, R.S. Feris, Q. Chen, S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, in: Proceedings of the ICCV, 2015, pp. 1062–1070. [6] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, S. Yan, Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval, IEEE TMM 18 (6) (2016) 1175–1186. [7] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the CVPR, 2014, pp. 580–587. [8] R. Girshick, Fast R-cnn, in: Proceedings of the ICCV, 2015, pp. 1440–1448. [9] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in: Proceedings of the NIPS, 2015, pp. 91–99. [10] J. Redmon, S. Divvala, R. Girshick, et al., You only look once: unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. [11] H. Zhang, X. Cao, J. Ho, T. Chow, Object-level video advertising: an optimization framework, IEEE Trans. Ind. Inf. 13 (2) (2017) 520–531. [12] Z.Q. Cheng, X. Wu, Y. Liu, X.S. Hua, Video2Shop: exact matching clothes in videos to online shopping images, in: Proceedings of the CVPR, 2017, pp. 4169–4177. [13] H. Zhang, Y. Ji, W. Huang, L. Liu, Sitcom-star-based clothing retrieval for video advertising: a deep learning framework, Neural Comput. Appl. (2018), doi:10. 1007/s00521-018-3579-x. [14] A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496. [15] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the CVPR, 2017, pp. 1–9. [16] Z. Cao, T. Simon, S.E. Wei, et al., Realtime multi-person 2D pose estimation using part affinity fields, in: Proceedings of the CVPR, 2017. [17] I. Kemelmacher-Shlizerman, S.M. Seitz, D. Miller, E. Brossard, The megaface benchmark: 1 million faces for recognition at scale, in: Proceedings of the CVPR, 2016. [18] X. Liu, M. Kan, W. Wu, et al., VIPLFAcenet: an open source deep face recognition SDK, Front. Comput. Sci. 11 (2) (2007) 208–218. [19] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: a unified embedding for face recognition and clustering, in: Proceedings of the CVPR, 2015, pp. 815–823. [20] C. Szegedy, S. Ioffe, V. Vanhoucke, et al., Inception-v4, inception-resnet and the impact of residual connections on learning, Proceedings of the AAAI (2017). [21] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 2009. [22] S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, ACM Sigmod Record 27 (2) (1998) 73–84. [23] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: An efficient data clustering method for very large databases, ACM Sigmod Rec. ACM 25 (2) (1996) 103–114. [24] A. Zhou, S. Zhou, J. Cao, et al., Approaches for scaling DBSCAN algorithm to large spatial databases, J. Comput. Sci. Techno. 15 (6) (2000) 509–526. [25] D. Birant, A. Kut, ST-DBSCAN: an algorithm for clustering spatial-temporal data, Data Knowl. Eng. 60 (1) (2007) 208–221. [26] M. Ankerst, M.M. Breunig, H.P. Kriegel, et al., OPTICS: ordering points to identify the clustering structure, ACM Sigmod Rec. 28 (2) (1999) 49–60. [27] W. Wang, J. Yang, R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, in: Proceedings of the VLDB, volume 97, 1997, pp. 186–195. [28] D. Duan, Y. Li, R. Li, et al., Incremental k-clique clustering in dynamic social networks, Artif. Intell. Rev. 38 (2) (2012) 129–147. [29] S.A. Mulder, Million city traveling salesman problem solution by divide and conquer clustering with adaptive resonance neural networks, Neural Netw. 16 (5–6) (2003) 827–832. [30] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans. Neural Netw. 11 (3) (2000) 586–600. [31] Y. Lin, H. Xu, Y. Zhou, et al., Styles in the fashion social network: an analysis on lookbook, in: Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, Springer, Cham, 2015, pp. 356–361. [32] A.D. Sokolova, A.S. Kharchevnikova, A.V. Savchenko, Organizing multimedia data in video surveillance systems based on face verification with convolutional neural Networks, in: Proceedings of the International Conference on Analysis of Images, Social Networks and Texts, Springer, Cham, 2017, pp. 223–230. [33] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (5814) (2007) 972–976. [34] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619.
11
[35] X. Huang, Y. Ye, H. Zhang, Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation, IEEE Trans. Neural Netw. Learn. Syst. 25 (8) (2014) 1433–1446. [36] M. Hadi Kiapour, X. Han, S. Lazebnik, A.C. Berg, T.L. Berg, Where to buy it: matching street clothing photos in online shops, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) 3343–3351. Haijun Zhang (M’13) received the B.Eng. and Master’s degrees from Northeastern University, Shenyang, China, and the Ph.D. degree from the Department of electronic Engineering, City University of Hong Kong, Hong Kong, in 2004, 2007, and 2010, respectively. He was a PostDoctoral Research Fellow with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada, from 2010 to 2011. Since 2012, he has been with the Shenzhen Graduate School, Harbin Institute of Technology, China, where he is currently a Professor of Computer Science. His current research interests include multimedia data mining, machine learning, computational advertising, and service computing. Prof. Zhang is currently an Associate Editor of Neurocomputing, Neural Computing and Applications, and Pattern Analysis and Applications.
Guo Han received the B.S. degree in software engineering from Yunnan normal university, Kunming, China, in 2016, and the M.S. degree in computer science from the Harbin Institute of Technology, Shenzhen, China, in 2019. She was a Master candidate in Computer Engineering of the Harbin Institute of Technology Shenzhen Graduate School, under this research performed. She is currently working on recommendation algorithm at Tencent China Co., Ltd. Her research interests include data mining, computer vision, and deep learning.
Xinghao Wang received the B.S. degree in software engineering from Heilongjiang University, Heilongjiang, China, in 2017. He is a Master candidate in Computer Engineering of the Harbin Institute of Technology Shenzhen Graduate School, under this research performed. His research interests include data mining, computer vision, and deep learning.
Yuzhu Ji received the B.S. degree in computer science from PLA Information Engineering University, Zhengzhou, China, in 2012, and the M.S. degree in computer engineering from the Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China, in 2015, where he is currently pursuing the Ph.D. degree in computer science. His research interests include data mining, computer vision, image processing, and deep learning.
Q. M. Jonathan Wu received the Ph.D. degree in electrical engineering from the University of Wales, Swansea, U.K., in 1990. He was with the National Research Council of Canada for ten years from 1995, where he became a Senior Research Officer and a Group Leader. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada. He has published more than 250 peer-reviewed papers in computer vision, image processing, intelligent systems, robotics, and integrated microsystems. His current research interests include 3-D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. Dr. Wu holds the Tier 1 Canada Research Chair in Automotive Sensors and Information Systems. He was an Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN, and CYBERNETICS PART A, and the International Journal of Robotics and Automation. He has served on technical program committees and international advisory committees for many prestigious conferences.
Please cite this article as: H. Zhang, H. Guo and X. Wang et al., Clothescounter: A framework for star-oriented clothes mining from videos, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.09.023