Accepted Manuscript
Image-guided 3D model labeling via multiview alignment Kan Guo, Xiaowu Chen, Bin Zhou, Qinping Zhao PII: DOI: Reference:
S1524-0703(18)30004-3 10.1016/j.gmod.2018.02.001 YGMOD 992
To appear in:
Graphical Models
Received date: Revised date: Accepted date:
20 October 2017 4 January 2018 3 February 2018
Please cite this article as: Kan Guo, Xiaowu Chen, Bin Zhou, Qinping Zhao, Image-guided 3D model labeling via multiview alignment, Graphical Models (2018), doi: 10.1016/j.gmod.2018.02.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Corresponding author:
[email protected]
1
ACCEPTED MANUSCRIPT GRAPHICAL MODELS 2018
Image-guided 3D model labeling via multiview alignment Kan Guo, Xiaowu Chen, Bin Zhou† , Qinping Zhao
CR IP T
State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University
AN US
Abstract This paper presents a new method for 3D model labeling guided by weakly tagged 2D color images. Many previous methods on 3D model labeling achieve impressive performances using large training data sets. However, it is difficult and time-consuming to build such a carefully annotated data set. In order to solve this problem, we take advantage of the large number of weakly tagged color images to label the 3D models. In our approach, we first collect and tag the web color images with semantic annotations. Then we project the input 3D model into multiview projections. Through the multiview alignment, we transfer the semantic labels onto the model projections via a color-weighting process. Combining pre-segment information, we back-project the labels and get final labeling results. Experimental results between two benchmarks show that our approach could get comparable labeling accuracy compared to other two state-of-art methods without expensive training cost. Keywords: 3D model labeling, image-guided, multiview alignment CCS Concepts •Theory of computation → Computational geometry;
M
1. Introduction
PT
ED
3D Model labeling is a basic technology for high-level 3D understanding or other 3D operations. In 3D space, giving each triangle face a semantic label is challenging with the 3D model geometry features itself. Actually, the 3D model is finally presented in 2D images in human eyes. 2D images contain lots of useful information that 3D model does not have, such as colors and textures. With the rapidly growth of 2D images, how to use 2D images to help understand 3D models becomes very valuable.
AC
CE
In early studies, most researchers focus on developing effective and robust features of 3D model itself to do the labeling task. In these approaches, one representative way is to directly analyse the 3D model via extracting geometry features of each triangle face [SOCG10, SSS∗ 10, AZC∗ 12]. Another way is to co-analysis multiple 3D models in the same category [HKG11, HFL12, SQX∗ 16]. Generally, these methods utilize geometry features from multiple models and improve the final results constrained by correspondences between each other. All the above methods can achieve state-of-art performance for many cases. However, due to the limited feature space and variety, such hand-crafted features always lack the ability to label various types of 3D models. To address this problem, learning based methods for 3D model labeling are proposed [KHS10, GZC15]. With a large amount of training data, these methods first extract basic geometry features and then learn an effective and robust model representation, which can be next used to label the 3D model effectively. However, the training cost is very expensive, and building the training dataset would cost a Corresponding author:
[email protected]
lot of human labour. Moreover, for 3D models with complex structures, the basic geometry features may vary large. Therefore, learning directly from the 3D model space may also fail. To analyse and understand 3D model from another point of view, projection based 3D model labeling methods are approached [XXS∗ 15, KAMC17]. These methods first project the 3D model into multiview projections, and utilize deep neural networks to learn powerful features from 2D image space. Then they back-project and optimize the labeling result onto the 3D model surface. Both their approaches combine image-based convolutional networks for view-based feature learning and outperform other hand-crafted features based methods. However, their methods also need a carefully labeled training dataset and a high training cost. Based on the above observations, we proposed 3D model semantic labeling approach guided by weakly tagged 2D color images. We aim to use the color-semantic relationships contained in the web color images and take advantage of the large number of images. Our main idea is to transfer the color-semantic maps to 3D model surface via multiview alignment. Even though project the 3D model into 2D space may lose one-dimension information, there are several benefits we can gain. Firstly, one model which has complex topology and non-linked surface could be represented by uniform array. While in the 3D space, it is hard to get robust geometry features. Next, we could easily connect the web images to 3D model through multiview alignment. The web images have a large quantity. For one given shape category, we could search matching objects which involve almost all shape varieties in the web images,
ACCEPTED MANUSCRIPT 3
CR IP T
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Figure 1: One typical bicycle labeling result via web color images
which is hard to achieve in 3D model world. Moreover, the web images contain various colors which always have links to the semantic labels.
CE
PT
ED
M
AN US
In our method, we first collect the web images of different object categories. In order to use the color images, we build the colorsemantic mapping via an easy way. Different from early projection analysis based methods [WGW∗ 13,XXS∗ 15,KAMC17], we do not need to carefully label each pixel of the images as training data. A person only needs to select the primary color of each part that appears in the image by a few simple clicks, which is called the weakly tag process. Each image requires only several color links and can be reused. Even though there are wrong and ambiguous tagged images, lots of right ones could correct the final results. We then pre-segment and project the 3D models into 2D images, which are used to search similar 2D weakly tagged images. Here we utilize simple image descriptors Fourier and Zernike moments to construct matching energy function. According to the matched tagged images, we employ dense sift flow method to wrap the color images onto projections. Subsequently, we meanwhile achieve the semantic labels of the projections based on the color-semantic mappings. We consider that a color which maps to less semantic labels should get a larger confidence. Thus we propose a color weighting operation and calculate transfer confidence. In this way, the projection images serve as a bridge that links the weakly tagged web images and model triangles. Eventually, through the back-projection and optimization, we get the final 3D model labeling results.
Shape Diameter Function [SSS∗ 10]. All their methods could achieve good segment results on limited categories. Since different models have large varieties in shape and topology. Their method may easy to fail. Another way to segment and label the models is to find the boundary of each part, such as Oscar et al. [AZC∗ 12]. They present a simple algorithm that solely exploits the shape concavity information to effectively extract model segments. In some cases, when a boundary region does not contain sufficient concavity, their method may fail. Take a separate model as input and only consider hand-craft features, the methods above are difficult to extend to various 3D models.
AC
The main contribution of this paper is to introduce a 3D model part semantic labeling framework, which guided by weakly tagged web color images. Without carefully labeled training data and hand-craft geometry features, our method could also achieve good labeling results with low cost. 2. Related Work As a fundamental problem, model segmentation and labeling have attracted lots of attention in last decade. Traditional methods always focus on finding effective geometry descriptors [BCG08, HWAG09, ZZWC12], such as Heat Kernel Signature [SOCG10], Corresponding author:
[email protected]
To improve the robustness, co-analysis based approaches are proposed [HKG11,SQX∗ 16]. Representatively, given a set of models from a common family, Hu et al. [HFL12] extract multiple patch geometry features and cluster them into subspace. Even though their method is also based on geometry features, with a consistent penalty, the method can jointly extracts consistent parts. Similarly, van Kaick et al. [vKXZ∗ 13] introduce a co-hierarchical analysis of a set of models, aimed at discovering their hierarchical part structures and revealing relations. All their methods are unsupervised and focus on maintaining consistency between models from the same category. However, without labeled data, it is hard for their methods to decide which result is correct. On the contrary, the data-driven approaches can learn an effective and robust model representation [vKTS∗ 11, HSG13, KLM∗ 13, YSGG17]. Kalogerakis et al. [KHS10] first propose a combination of JointBoost classifier and conditional random field learning algorithm and obtains state-of-the-art results on all categories in the Princeton Segmentation Benchmark [CGF09]. Guo et al. [GZC15] first introduce a deep learning framework which uses convolution neural networks. Both of their methods are based on the geometry features and carefully labeled training data. On the one hand, some geometry features depend on the model topology and may be sensitive to geometric changes. On the other hand, the carefully labeled training data is hard to acquire and process. Even though the 3D data quantity grows fast, it is difficult to cover various object shapes in the real world. Moreover, to accurately label a 3D model requires a lot of effort. In recent years, Su et al. [SFG17, QYSG17]
ACCEPTED MANUSCRIPT 4
AN US
CR IP T
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Figure 2: Framework of our method. Input a 3D model and relative web object images. We first simply remove the image backgrounds and annotate them with semantic tags. Correspondingly, we pre-process the 3D model and project it into multi-view masks. According to the correspondence between the masks, we align the 2D masks and 3D projected masks. Then we transfer the semantic annotations onto 3D projected masks with a color weighting process. Finally, we back project the semantic labels onto the 3D model surface.
M
propose several learning-based methods which focus on point cloud input. Using novel deep neural networks that directly consume point cloud, their methods are efficient and effective. Similarly, their methods also need a lot of training data, which is costly.
AC
CE
PT
ED
To understand the 3D model in a new way, Wang et al. [WGW∗ 13] introduce projective analysis for semantic segmentation and labeling of 3D models. They involve a novel bi-class Hausdorff distance to match the projections with the labeled images. The labeled dataset in their method is gathered from the Web. However, they need to manually segment and label each object part, which requires a lot of effort. In comparison, our method only needs weak links between color and semantic with less cost. Motivated by the success of deep neural networks, Xie et al. [XXS∗ 15] and Kalogerakis et al. [KAMC17] propose projective convolutional networks for 3D model labeling respectively. Representatively, Kalogerakis et al. [KAMC17] use multi-view projected shaded and depth images to learn per-label confidence maps. Then through a surfacebased CRF layer, they achieve final labeling results. Their method has no geometric or topological assumptions, nor exploits any hand-craft geometric descriptors. However, their training data all comes from human-labeled 3D models, which are difficult to get and guaranteed enough. And the training cost is expensive. As discussed above, our key idea is to exploit color-semantic maps in existing large number of color images. Through low-cost weakly tagging process, without expensive training or training data building, we aim at easily transferring the color image knowledge for 3D model labeling.
3. Our Method
3.1. Web images collection and tagging Our goal is to use web color images to guide 3D model labeling. So the first step of our method is to build image datasets. For convenience, we download the color images from Google image search. It is worth mentioning that we use different words to make our dataset various. For example, when we build truck image dataset, we use truck and oil tank truck as key words to search images. While for bicycle, we apply bicycle and tandem bicycle. In this way, through the semantic bridge, we could get various-object-style images for next searching and matching processes.
After collecting various color images, we do a few processes to make the image available for next steps. Fig. 2 shows the major steps of our algorithm. Firstly, we need to extract foreground object in each image. There are many mature algorithms that can solve this problem. Here we employ Minimum Barrier Salient Object Detection method mentioned in [ZSL∗ 15]. Through this method, we initially produce robust saliency maps. Then we use an adaptive threshold to obtain binarized saliency maps as image masks. In order to measure the images under uniform scale, the masks are cut and resized to 400 by 400. For easily tag the images, we first define a color plate which includes basic colors and visually distinguished ones. As shown in Fig. 3, we develop a color tag tool to make part level color-semantic mapping of each normalized image. Because of the ambiguity and singularity between colors, we first cluster the image colors into our defined color plate. And one user can specify multiple colCorresponding author:
[email protected]
ACCEPTED MANUSCRIPT 5
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Figure 5: Voxel based pre-segmentation and optimization. The left is initial voxelization result and the right is the optimized one.
CR IP T
Figure 3: The color tag tool we designed to make part level colorsemantic mapping of each normalized image.
AN US
Figure 4: Examples of camera sampling.
ED
M
ors for one part label by our tool. Let C = {c1 , c2 , ..., cm } denotes the color plate, L = {l1 , l2 , ..., ln } denotes the semantic part label, and l1 has color c2 , c3 , the color-semantic map can be expressed as (l1 ,V c1 ), V c1 = {c2 , c3 }. In this way, one user could distinguish the color of each part much easier. Note that, the color with few pixels will be set invalid. Moreover, one user only needs to click the primary colors of each part. Obviously, our tags are much simpler and less costly than pixel-level labels. Even though several users would make wrong labels, we assume that most images will be tagged correctly and the incorrect ones could be put right in the subsequent merging process.
PT
3.2. 3D models pre-segmentation and projection
AC
CE
After achieving the weakly tagged image datasets, we aim to build correspondences between 2D images and 3D models for label transferring. A natural method to build correspondences is to retrieve the most similar rendered model projections and use their known camera poses. Given an input 3D model, as shown in Fig. 4, we originally set the camera positions uniformly around the hemisphere with intervals. However, through experiments and observation, we find that the objects in the images are also forward or lateral according to the habit of taking pictures of human. So in our experiments, in order to decrease the amount of computation and ambiguity, according to the object style, we sometimes reduce the sample camera poses into side views, such as bicycles and trucks. Given the 3D models, we first move them to ordinate origin. Then we scaled the models to normalized coordinate space [−1, 1]. Since our semantic knowledge comes from the original Web image, inevitably, the following transfer step will contain many noise and error pixels. In our approach, we solve this problem in two asCorresponding author:
[email protected]
pects. On the one hand, we assume that the number of right transferred images is larger than the wrong one. On the other hand, we propose a voxel based merge and pre-segment algorithm. Given a normalized 3D model, we first interpolate the vertices according to the triangle face area threshold, in our experiments we set it to 1e − 10. Through such a process, we make the model vertices dense and uniform. Then we voxelize the processed model by a predefined resolution. And in our experiments, we set the resolution by 1000 in each axis. At the first stage, we combine the voxels according to the group information of the model. For each group, we calculate its valid voxel number and bounding box. Further, we choose the groups with a small valid voxel number under ν = 50. Then we merge them into larger ones according to overlapping bounding box and intersecting voxel proportion ρ = 1. In our experiments, we gradually increase ν and decrease ρ, repeat iterative merge steps until most small groups are merged. And one merge example is shown in Fig. 5. Note that not all models have group information. So for the ones without group information, we simply apply an over-segment process and pre-segment the surface triangles into larger patches. 3.3. Multiview alignment and color tags transfer By predefined camera poses, we apply the orthogonal projections and get multi-view projection masks of the 3D model. To be consistent with the 2D image masks, we also cut and normalize the projections into 400 by 400. In the next stage of our pipeline, we compute features of both 2D and 3D image masks. Inspired by [ZL02], we extract Fourier and Zernike moments descriptors of the masks as matching energies. Moreover, we also calculate IOU distance according to the overlapping and non-overlapping areas between the masks as an additional item. Assume FD as Fourier descriptor, ZMD as Zernike moments descriptor, IOUD as IOU distance descriptor, the distance between image mask i and model mask j can be defined as: Ei j = δi wFD EFDi j + wZMD EZMDi j + wIOU EIOUi j 1, i f NCi > 0 δi = ∞, otherwise
(1)
where wFD , wZMD , wIOU are the weights balance the three energy items. In our experiments, we set wFD = 0.7, wZMD = 0.3, wIOU = 1. NCi denotes the number of annotated colors con-
ACCEPTED MANUSCRIPT 6
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
CR IP T
Figure 7: Visualization of the color weights of each color in one tagged web image. A color has a higher weight if it corresponds to less semantic parts.
Sect. 3.1, and NL denotes the number of labels. And one weighted color map example is shown in Fig. 7. Figure 6: Color transfer result of an example bicycle.
AN US
tained in one image. And E is calculated through Euclidean distance. Specially, our main idea takes advantage of color richness of the images. So one image is useless if all object parts are tagged the same color. And we denote δi as a threshold to determine whether image i is valid. Through the energy defined above, we filter and get matching pairs of 2D and 3D masks with a predefined threshold.
After pre-segment process illustrated in Sect. 3.2, we could get many groups or patches which combined the 3D model. For one C3im, through the projection correspondence, we could easily get the group id for each pixel. On the contrary, for each group gk , we could get the corresponding pixels in C3im. Then we simply count the pixel colors, denoted as gck . We next search the color-semantic map and get the matching label. The number of matching label may be more than one. And we assume that the greater the number, the lower the degree of discrimination of the group. Thus we define W (gk ) as follows:
ED
M
Given one matching pair, denote M2im as the image mask, C2im as the clustered color image and M3im as the projected model mask. Our task now is to transfer the C2im onto M3im and achieve C3im, according to the correspondence between M2im and M3im. At this stage we exploit Dense Siftflow method [LYT11]. Even though the matched masks have different shapes, the method could effectively find the right correspondence between the masks. And through an easy wrap process, we could get well transfer results, as shown in Fig. 6.
PT
3.4. Color tags weighting and back-projection
AC
CE
Based on the color-semantic maps and transferred color masks we illustrated above, the next stage is to infer the semantic of the 3D projection masks. For each color appears in a C3im, we first search and find which semantic label it belongs. Here we simply traverse each label inside the color-semantic map. In our approach, we consider that a color appears with fewer part label should get a higher confidence. In other words, we want to give distinguished color a higher weight. For a plate color ck , let W (ck ) denotes the color confidence weight, we can calculate it as: NL
W (ck ) = exp 1 − ∑ ismember(ck ,V ci )/NL i=1
1, ismember (ck ,V ci ) = 0,
i f ck εV ci otherwise
!
NL
W (gk ) = β − ∑ ismatched (gck ,V ci ) /NL i=1
ismatched (gck ,V ci ) =
1, 0,
i f gck ∩V ci 6= ∅ otherwise
(3)
where β is a pre-defined param to control the weight, and in our experiment we set it to 1.5. Subsequently, let P (li |gk ) denotes the probability of assigning label li to group gk . Through the statistics of the projected pixels and the weights we defined above, we calculate P (li |gk ) as follows: P (li |gk ) =
∑
∀c j εgck ∩V ci
W (c j ) ×W (gk ) × P(c j |gk )
(4)
P(c j |gk ) = NP(c j |gk )/NP(gk ) where NP(gk ) denotes the number of pixels which belong to gk in C3im. And NP(c j |gk ) denotes the number of pixels which have color c j in gk . Finally, we simply back-project the labels with max probability onto 3D model. Due to the complexity of the 3D model and occlusions, some groups may have no corresponding labels. For these groups, we simply give them the labels of their nearest neighbour group.
(2)
where V ci indicates the color of part label li as demonstrated in
4. Experiments In this section, we present various experiments to analyse our method. Corresponding author:
[email protected]
ACCEPTED MANUSCRIPT K. Guo et al. / Image-guided 3D model labeling via multiview alignment
7
Table 1: A dataset consisting of weakly tagged web images. The number of semantic parts and the size for each category are shown. #Parts 4 3 2 3 2 2
#Images 799 874 118 398 400 400
Category Stroller Lamp Cap Laptop Pistol Tree
#Parts 4 3 2 2 3 2
#Images 605 400 399 300 413 400
Table 2: Comparison between our method and Kalogerakis et al. over seven object categories. #Shapes 38 28 250 223 92 138 250
ShapeBoost 93.1% 85.9% 89.0% 86.1% 94.9% 88.2% 74.5%
ShapePFCN 94.6% 94.5% 91.8% 95.3% 96.0% 91.5% 84.8%
Ours 89.5% 84.7% 93.4% 92.6% 96.7% 81.3% 85.0%
AN US
Category Bag Cap Guitar Laptop Mug Pistol Table
PT
ED
M
Dataset. The 2D color images we used are downloaded from Google image search website. Specially, for each object class, we may use different description words to search various subcategories. For example, when we search truck images, we use truck and oil tank truck as key words to search. Meanwhile for bicycle, we apply bicycle and tandem bicycle. In this way, through the semantic bridge, we could build relatively robust image databases of various object styles. For the 3D models, we select the corresponding classes from ShapeNetCore dataset [YKC∗ 16] and Projective shape analysis (PSA) dataset [WGW∗ 13]. Each model in one class is highly representative and different models are very different in shape topology and appearance. Even though the number of some categories in PSA is small, it is easy to transfer the results to other shape similar models. The data details are shown in Table 1.
AC
CE
Evaluation and comparison. To verify the effectiveness of our method, we evaluate our method on both ShapeNetCore dataset and Projective shape analysis (PSA) dataset. For each model category in ShapeNetCore dataset, we select the same test shapes split in [KAMC17]. Since the models in ShapeNetCore dataset have no group information, we use a simple k-means method by clustering coordinates and normals to achieve over-segment patches, instead of the voxel-based pre-segment results in Sect. 3.2. Moreover, in the last step, we dense sample the triangle vertices according to the area of each triangle. Then we apply a point-level graph cut optimization similar to [vKFK∗ 14]. The labeling accuracy is shown in Table 2. With the fine annotated training set, their method could get finer results compared to ours in some cases. Mention that, we do not need any training labeled 3D shapes, which costs a lot of labour. And only with weakly tagged web images, our method could achieve approximate or even better results. Additionally, since our method is directly driven by web color images, the most similar recent apCorresponding author:
[email protected]
Figure 8: Visual comparison of typical bicycle and truck. The left results are produced by [WGW∗ 13] and the right ones are ours.
CR IP T
Category Bicycle Truck Bag Guitar Mug Table
Figure 9: Labeling result of nature object. The left is the input tree model and the right is the output labeling result.
proach is [WGW∗ 13]. Even if their approach is based on pixel-level label training images, our method utilizes color-semantic mapping and can achieve comparable or even better results at a lower cost. As shown in Fig. 8., since their training set is much smaller than pictures in the Internet, their method would fail when they cannot find a good match in the dataset, such as tandem bicycle. More labeling results on various imperfect meshes are shown in Fig. 11. Interestingly, we also test our method on nature objects. Nature objects mostly have uniform color-semantic mappings. For example, the tree always has green leaves, dark red or dark yellow trunks. Therefore, we can simply assign the same tag links to the image set and the labeling result is shown in Fig. 9. Image quantity analysis. In order to show the importance of rich image information, we perform image quantity analysis from two aspects. On the one hand, the images in the Web contain various object styles. Using different keywords, we could collect enough pictures to transfer. As shown in Fig. 12, the left image shows the labeling result using the image set searched by bicycle only, and the right one shows the result using the image set searched by bicycle and tandem bicycle. For our method, it is flexible to control the image set through different keywords. On the other hand, to test the impact of different numbers of images on the results, we gradually increase the number of images to transfer. As shown in Fig. 10. when there are few valid images, the labeling result is unstable and wrong. With the amount of image increases to a certain extent, the result becomes correct and stable. Labeling 3D point cloud models. To test the robustness of our method in labeling different inputs, we perform typical 3D point
ACCEPTED MANUSCRIPT 8
CR IP T
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
AC
CE
PT
ED
M
AN US
Figure 10: Labeling results with gradually increasing input images. The color depth represents the probability that each component belongs to each semantic label.
Figure 11: More results of our method. The left are the input 3D models, the middle are sample matched web images, and the right are the output labeled 3D models.
Corresponding author:
[email protected]
ACCEPTED MANUSCRIPT K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Figure 13: Labeling 3D point cloud model. The left is the input point cloud, the middle is the labeling result produced by [vKFK∗ 14] and the right is the labeling result by our method.
Figure 15: Labeling results of sampled pistol and knife. Our method may fail to guarantee the labeling border of small parts and could not deal well with symmetrical parts.
M
AN US
cloud labeling result. For a point cloud input, we do not have group information and the part will appear with holes. We calculate the semantic probability of each projection pixel and change the groupbased transition into pixel transfer. The labeling result shown in Fig. 13. proves that our method can be adapted to different inputs. Additionally, we also make a comparison with [vKFK∗ 14], which is an approximate convexity based unsupervised method. As Fig. 13 shown, it can be observed that only with geometry constraints, their method can only get a bunch of pieces without semantic information.
Figure 14: Labeling 2D image. The left is the input image and the right is the labeling result by our method.
CR IP T
Figure 12: Labeling results with different keywords. The left shows the labeling result using bicycle as keyword, and the right one shows the result using bicycle and tandem bicycle as keywords.
9
PT
ED
Labeling 2D images. Our method makes a semantic bridge between 2D images and 3D models. One can go from one end to the other, and vice versa. After achieving the 3D model labeling results, we can also use them to help 2D understanding. Given an input image, we search matched 3D labeling results by projection matching and then transfer the semantic labels, as shown in Fig. 14. Since the color image is mainly black only, it is not easy to label the bicycle parts by traditional 2D methods. Benefit from the labeled 3D models, semantic tags can be reversed.
AC
CE
Time cost. Our implementation runs on a single thread on Xeon E5-2670 2.60GHz CPU. In the image preparatory stage, depending on the complexity of the image and the number of semantic labels, the time taken to process an image may vary slightly. For example, processing an image of a bicycle takes an average of about 3 seconds, including foreground extraction, image cutting, image normalization and manual interaction tagging. And it takes about 20-30 minutes to process a collection of 400 images. Using the same time, only a few triangle face level 3D models can be labeled. Moreover, we do not need a very time-consuming training process. The test stage includes three main procedures. The multiview alignment procedure takes about dozens of seconds. The time of color tags transfer and color tags weighting procedures vary with the number of matching images. It takes about 2 seconds for one matching image in the transfer procedure and about 1 second in the weighting procedure. Generally, testing time will take a few minutes. Corresponding author:
[email protected]
Failure cases. Fig. 15 shows labeling results of sampled pistol and knife. Our method can well transfer the image color-semantic onto 3D model surface, but may fail to ensure the labeling borders because of the alignment errors, such as the trigger of pistol in the figure. In another side, sometimes different 3D model parts are symmetrical, as the knife shown in the figure. The alignment process could not decide which direction is correct, which leads to a fuzzy labeling result. 5. Conclusion and discussion We present an easy way to transfer image color-semantic onto 3D model surface for 3D model labeling. Several learning based methods recently can achieve very good results. However, constructing the training data is difficult and time costing. Involving the large amount of web color images, we weakly tag them and build colorsemantic mappings by simple chicks. Then through the multiview alignment and transfer, we weight each color-semantic map and achieve projection labeling results. Finally, we optimize the backprojected labels and achieve the final labeling results. With the rapid growth of image data sets, how to apply the knowledge contained in the image is very useful. Our main idea is to give people an easy way to understand complex 3D models through web color images. Our method does not need pixellevel labeling or any training process. And through the projection matching and transferring, our approach could allow various input forms, such as manifold models, non-manifold models and even point cloud models. As a weak point, our method can only capture the model surface semantic labels and the parts inside are inestimable. Another weakness we realized is that even though large number of images can fix most of the transfer errors, without supervised learning, there still exists wrong transfer label in some parts. Moreover, to save the matching cost and transferring calculation, we assume that the input models are given in upright orientation. We argue that it is not a big obstacle compared to our goals.
ACCEPTED MANUSCRIPT 10
K. Guo et al. / Image-guided 3D model labeling via multiview alignment
Last but not least, there are many interesting directions for future extensions. Currently our approach only focuses on single object part labeling, to label multiple objects or even a scene is more interesting. Another possibility for future work is to investigate the relationships between semantic and high-level image features, such as image textures and materials.
[KLM∗ 13] K IM V. G., L I W., M ITRA N. J., C HAUDHURI S., D I V ERDI S., F UNKHOUSER T.: Learning part-based templates from large collections of 3d shapes. ACM Trans. Graph. 32, 4 (2013), 70. 3 [LYT11] L IU C., Y UEN J., T ORRALBA A.: Sift flow: Dense correspondence across scenes and its applications. IEEE TPAMI 33, 5 (2011), 978–994. 6 [QYSG17] Q I C. R., Y I L., S U H., G UIBAS L. J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS (2017). 3 [SFG17] S U H., FAN H., G UIBAS L.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proc. CVPR (2017). 3 [SOCG10] S KRABA P., OVSJANIKOV M., C HAZAL F., G UIBAS L.: Persistence-based segmentation of deformable shapes. In Proc. CVPRW (2010), pp. 45–52. 2, 3 [SQX∗ 16] S HU Z., Q I C., X IN S., H U C., WANG L., Z HANG Y., L IU L.: Unsupervised 3d shape segmentation and co-segmentation via deep learning. CAGD 43 (2016), 39–52. 2, 3 [SSS∗ 10] S HAPIRA L., S HALOM S., S HAMIR A., C OHEN -O R D., Z HANG H.: Contextual part analogies in 3d objects. IJCV 89, 2-3 (2010), 309–326. 2, 3 [vKFK∗ 14] VAN K AICK O., F ISH N., K LEIMAN Y., A SAFI S., C OHEN O R D.: Shape segmentation by approximate convexity analysis. ACM Trans. Graph. 34, 1 (2014), 4. 7, 9
AN US
Acknowledgements. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61532003, 61421003 and 61502023).
[KHS10] K ALOGERAKIS E., H ERTZMANN A., S INGH K.: Learning 3d mesh segmentation and labeling. ACM Trans. Graph. 29, 4 (2010), 102:1–102:12. 2, 3
CR IP T
At another hand, even though applying the original images directly is simple and convenient for user, there are also some drawbacks. Firstly, most of the images we collected are relatively clean, one picture with complex background would be discarded during the tagging and matching process. Even though in our test, the amount of matched images is enough, it would be more robust if more images are involved. Secondly, not all real-world objects have obvious color-semantic correspondences or color boundaries. For example, 3D vases are always segmented and labeled into four parts in traditional methods, which include handle, cup, top and base. However, in real-world, cup, top, and base of one vase always appear with the same color. And our method may fail in such situations.
References
[AZC∗ 12] AU O. K.-C., Z HENG Y., C HEN M., X U P., TAI C.-L.: Mesh segmentation with concavity-aware fields. IEEE TVCG 18, 7 (2012), 1125–1134. 2, 3 [BCG08] B EN -C HEN M., G OTSMAN C.: Characterizing shape using conformal factors. In Proc. Eurographics 3DOR (2008), pp. 1–8. 3
M
[CGF09] C HEN X., G OLOVINSKIY A., F UNKHOUSER T.: A benchmark for 3d mesh segmentation. ACM Trans. Graph. 28, 3 (2009), 73:1–73:12. 3
ED
[GZC15] G UO K., Z OU D., C HEN X.: 3d mesh labeling via deep convolutional neural networks. ACM Trans. Graph. 35, 1 (2015), 3:1–3:12. 2, 3
[HFL12] H U R., FAN L., L IU L.: Co-segmentation of 3d shapes via subspace clustering. CGF 31, 5 (2012), 1703–1713. 2, 3
PT
[HKG11] H UANG Q., KOLTUN V., G UIBAS L.: Joint shape segmentation with linear programming. ACM Trans. Graph. 30, 6 (2011), 125. 2, 3
CE
[HSG13] H UANG Q.-X., S U H., G UIBAS L.: Fine-grained semisupervised labeling of large shape collections. ACM Trans. Graph. 32, 6 (2013), 190:1–190:10. 3 [HWAG09] H UANG Q.-X., W ICKE M., A DAMS B., G UIBAS L.: Shape decomposition using modal analysis. CGF 28, 2 (2009), 407–416. 3
AC
[KAMC17] K ALOGERAKIS E., AVERKIOU M., M AJI S., C HAUDHURI S.: 3d shape segmentation with projective convolutional networks. In Proc. CVPR (2017). 2, 3, 4, 7
[vKTS∗ 11] VAN K AICK O., TAGLIASACCHI A., S IDI O., Z HANG H., C OHEN -O R D., W OLF L., H AMARNEH G.: Prior knowledge for part correspondence. CGF 30, 2 (2011), 553–562. 3 [vKXZ∗ 13] VAN K AICK O., X U K., Z HANG H., WANG Y., S UN S., S HAMIR A., C OHEN -O R D.: Co-hierarchical analysis of shape structures. ACM Trans. Graph. 32, 4 (2013), 69:1–69:10. 3 [WGW∗ 13] WANG Y., G ONG M., WANG T., C OHEN -O R D., Z HANG H., C HEN B.: Projective analysis for 3d shape segmentation. ACM Trans. Graph. 32, 6 (2013), 192:1–192:12. 3, 4, 7 [XXS∗ 15] X IE Z., X U K., S HAN W., L IU L., X IONG Y., H UANG H.: Projective feature learning for 3d shapes with multi-view depth images. CGF 34, 7 (2015), 1–11. 2, 3, 4 [YKC∗ 16] Y I L., K IM V. G., C EYLAN D., S HEN I., YAN M., S U H., L U A., H UANG Q., S HEFFER A., G UIBAS L., ET AL .: A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. 35, 6 (2016), 210. 7 [YSGG17] Y I L., S U H., G UO X., G UIBAS L.: Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Proc. CVPR (2017). 3 [ZL02] Z HANG D., L U G.: An integrated approach to shape based image retrieval. In Proc. ACCV (2002). 5 [ZSL∗ 15] Z HANG J., S CLAROFF S., L IN Z., S HEN X., P RICE B., M ECH R.: Minimum barrier salient object detection at 80 fps. In Proc. ICCV (2015), pp. 1404–1412. 4 [ZZWC12] Z HANG J., Z HENG J., W U C., C AI J.: Variational mesh decomposition. ACM Trans. Graph. 31, 3 (2012), 21:1–21:14. 3
Corresponding author:
[email protected]