Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement

Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement

ARTICLE IN PRESS JID: CAG [m5G;August 25, 2017;13:22] Computers & Graphics xxx (2017) xxx–xxx Contents lists available at ScienceDirect Computers...

3MB Sizes 69 Downloads 105 Views

ARTICLE IN PRESS

JID: CAG

[m5G;August 25, 2017;13:22]

Computers & Graphics xxx (2017) xxx–xxx

Contents lists available at ScienceDirect

Computers & Graphics journal homepage: www.elsevier.com/locate/cag

Special Issue on CAD/Graphics 2017

Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement Fuchang Liu a, Shuangjian Wang a, Dandan Ding a, Qingshu Yuan a, Zhengwei Yao a,∗, Zhigeng Pan a, Haisheng Li b

Q1

Q2

a b

Hangzhou Normal University, China Beijing Technology and Business University, China

a r t i c l e

i n f o

Article history: Received 13 June 2017 Revised 18 July 2017 Accepted 18 July 2017 Available online xxx Keywords: 2D-3D alignment 3D Model retrieval ROI-based refinement 3D indoor scenes

a b s t r a c t Given a single indoor image, this paper proposes an automatic retrieval system to estimate the bestmatching 3D models with consistent style and pose. To support this system, we combine a deep CNN based object detection approach with a deformable part based alignment model. The key idea is to cast a 2D-3D alignment problem as a part-based cross-domain matching. We also provide an interactive refinement interface that allows users to browse models based on similarities and differences between shapes in user-specified regions of interest (ROIs). We demonstrate the ability of our system on numerous examples.

1

1. Introduction

2

Virtual interior design has recently received a great deal of attention. It has been widely applied in VR/AR, gaming, and robotics. In general, the pipeline of interior design consists of several components such as object selection [1,2], layout estimation [3,4], and scene optimization [5,6]. All these components are non-trivial. Recent works on indoor scenes have focused on object selection using 3D model searches based on global or local similarities. It is encouraging that some works [7] enables browse 3D models based on ROI-based similarities between shapes. Compared with the ease at which a cell phone can take a photo or search for web images, it is inconvenient to acquire such 3D models for the same objects that populate our daily lives. To address this issue, object alignment is applied to handling object selection. However, it is well known that object alignment is one of the greatest challenges in computer vision. Object alignment matches an image to a 3D model, which significantly affects follow-up components in scene modelling. It is difficult to reliably match between real photographs and CAD models because real photographs can be very different from synthesized views of the same 3D model, e.g., due to texture, materials, colour, illumination, or geometry. We pose our problem as virtual indoor object retrieval from a single image.

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22



Corresponding author. E-mail address: [email protected] (Z. Yao).

© 2017 Published by Elsevier Ltd.

Given an image, we first detect all the objects in the image and align them from the image to their respective 3D CAD models. In addition, we allow users to refine the alignment results using an ROI-based similarity search. Our main contribution is an interactive system that produces CAD models (certain categories of indoor objects) from a single photo. Inspired by [8], we phrase the retrieval problem as establishing the correspondence between 2D photographs and computer-generated 3D models. We also extend the original work of [8] into multiple classes and integrate the benefits of ROI-based similarity searches. Our approach can be considered as a marriage between part-based 2D-3D alignment and fuzzy correspondences across 3D models. Like part-based alignment, we represent objects using the star model to enforce spatial constraints. We use YOLO [9] deep network for multiple-object detection as the pre-process stage of the part-based alignment. YOLO also speeds up the alignment process because it can quickly locate the root filter of the star model. Furthermore, we present an interactive exploration tool that allows users to refine the alignment results based on similarities between shapes in user-specified ROIs. To complete this refinement, we use fuzzy correspondences between points on the 3D shapes to support ROI-based similarity searches. We provide retrieval results of various scene images to demonstrate the effectiveness of our system. The outline of this paper is as follows. After discussing related works in Section 2, a system overview of this work is introduced in Section 3. The part-based retrieval algorithm and ROI-based refinement are described in Sections 4 and 5, respectively. Finally,

http://dx.doi.org/10.1016/j.cag.2017.07.029 0097-8493/© 2017 Published by Elsevier Ltd.

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

JID: CAG 2

ARTICLE IN PRESS F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

52

we evaluate the effectiveness of our system in Section 6, and conclude the paper in Section 7.

53

2. Related work

54

Many techniques have already been developed for image-based shape retrieval. The first person to develop such a technique is Lawrence Roberts [10] who presented a system that inferred a 3D scene from a single photo. Owing to limitations of computer capacity at the time of his research, Roberts was only able to demonstrate his method on simple synthesizing scenes. Many researchers have tried to extend his approach to realistic images and scenes. A number of works have considered object retrieval as matching the input photo with a dataset of 2D views of objects from different viewpoints [11–13]. In this line of approaches, images of 2D views of objects are typically represented using low-level features such as SIFT [14] and HOG [15]. These approaches perform well when identifying the category of a query object. However, they fall short when identifying the most similar style and the most likely viewpoint of a query object. Especially for indoor scenes, most items have huge intra-class variation. For instance, there are thousands of different types of tables (such as round or square table with short or long legs). However, Malisiewicz et al. [16] proposed an Exemplar-SVM approach to determine a separate classifier for each exemplar instead of a single complex category classifier. Although Exemplar-SVM is capable of distinguishing different style objects in same category, it requires training a large number of exemplar classifiers to represent categories with high intra-class variation. Furthermore, it models an object with a single global template. Thus, it is impossible to handle small deformations between a query object and the training template. To address this issue, Lim et al. [17] trained an LDA classifier for each local patch and identified discriminative patches for 2D-3D alignment. In addition, Aubry et al. [8] developed an exemplar-based 3D category representation with part-based alignment; however, their method only applies to the case of single category, rather than multiple category scenarios. Inspired by [8], Izadinia et al. [18] used deep features instead of HOG features to perform CAD model alignment. Additionally, other recent works [19,20] focused on object viewpoint estimation from 2D images with manual annotation or render-based image synthesis, which could assist in aligning 3D models to 2D images. Liu et al. [21] reconstructed an indoor scene from a single indoor image using normal inference and edge features inferred from scene geometry. Different from those recent works, this study focuses on cross-domain matching based on mid-level discriminative visual features learned from a dataset of 3D models having a great number of synthesized views. Another closely related line of research is 3D model-based retrieval and sketch-based retrieval. There are several approaches to conducting 3D model-based retrieval, such as keyword-based retrieval [22], example-based retrieval [1,23], and context-aware retrieval [24]. A keyword based search alone is not promising because most datasets are insufficiently annotated. Typically, example-based retrieval identifies the most similar shapes according to global or local features. A number of techniques [7,25,26] have been developed to improve traditional approaches, e.g., by analysing shape collections and producing consistent correspondences. With consistent correspondences, users can browse collections based on similarities and differences between shapes in ROIs. However, example-based retrieval requires a good example, which is often not available. To address this issue, context-aware retrieval provides the solution by inferring objects from scene context. However, the success of its results heavily depends on the ”context”, which is often ambiguous. Recent works on data-driven approaches [27,28] define context based on pairwise and spatial relationships between objects in the dataset. However, they require a

51

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

[m5G;August 25, 2017;13:22]

large dataset to extract fine-grained geometric and spatial features. 3D models yield a more intuitive and interactive display, and it is much easier to acquire and provide direct information on object appearance from their corresponding images. Sketch-based retrieval is often used in conjunction with example-based retrieval. One of the earliest works of sketch-based retrieval was given by [29], who presented a system that lets users refine an initial keyword-based search using a sketch of their desired view. Furthermore, Funkhouser et al. [1] proposed an imagebased approach. Chen et al. [23] developed a system for examplebased retrieval that also supports sketch-based queries. Daras and Axenopoulos [30] developed a unified framework that supports both sketch-based as well as example-based retrieval. However, their sketch-based retrieval was limited by the quality of the drawings. Most people have limited drawing skills and contours may deviate significantly from the original shape. More recently, Mathias et al. [31] developed a way to rectify large local and global deviations using new descriptors based on the bag-of-features approach. Also, Xu et al. [32,33] successfully performed model design and scene reconstruction by jointly processing sketched objects. Some recent works [34,35] have considered object reconstruction from a single-view image. The idea behind these approaches is to jointly analyse the images along with a collection of existing 3D models. Joint analysis leads to reasonable reproduction of object appearance based on reliable image-shape correspondences. In this paper, we investigate a complementary problem and focus on retrieving 3D models using images. 3. System overview Our approach to retrieving 3D models of indoor objects from an image (see Fig. 1) is based on detecting objects by convolutional neural nets (CNNs), matching objects from 2D views to 3D models by part-based alignment, and optimizing retrieval results by ROIbased refinement. The approach involves several steps, as follows. We first detect multiple indoor objects from a given image, such as bed, cabinet, chair, shelf, sofa, and table. We detect these objects using YOLO network, which is a state-of-the-art object detection method. With the category and location provided by YOLO, we align the detected 2D object with the most similar 3D CAD model, by comparing its appearance with hundreds of rendered 3D models from many different angles, using mid-level discriminative features learned from rendered views. Finally, we optimize the retrieval results by interaction with users based on ROI-based refinement. When the user paints an ROI, our system returns the most similar shapes accordingly from the database. In the following sections, we describe these technical components in detail: part-based retrieval and ROI-based refinement.

115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161

4. Part-based retrieval

162

4.1. Preprocessing stage

163

This step in our retrieval system involves detecting the location of objects of interest in the image. While any category of object detectors can be trained, we focused on common furniture items such as beds, cabinets, chairs, shelfs, sofas, and tables. Current detection approaches re-purpose classifiers to perform detection. To detect an object, these methods usually use a sliding window running over the entire image at various locations and scales. These approaches are very time consuming, like deformable parts models (DPMs) [36]. More recent approaches like R-CNN [37,38] use region proposal and object classification together with shared convolutional layers. We employed a YOLO system to predict what objects are present and where they are. YOLO is extremely fast, because it

164

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

165 166 167 168 169 170 171 172 173 174 175

ARTICLE IN PRESS

JID: CAG

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

3

Fig. 1. Overview of our retrieval framework. Given an input image, we automatically detect objects in the image and return 3D models matching the style and viewpoint of these detected objects. We then use an ROI-based search to improve the retrieval results.

195

avoids sliding windows over the image. YOLO frames detection as a regression problem, straight from image pixels to bounding box coordinates and class probabilities. In our system, we can detect multiple objects using YOLO and output their locations. Instead, using YOLO straightforward to output multiple objects’ locations, we integrate it tightly with star models. More specifically, we place root filters at the locations detected by YOLO for star models in the 2D-3D alignment stage, which improves both the precision and speed-up detection. Additionally, with the class probabilities, we could perform categoryspecified 2D-3D alignment accordingly. For instance, if we knew the object was a chair, then we could align the image area containing a chair with 3D chair models. Otherwise, we would have to try all the object categories and calibrate matching scores among different categories as is done with Exemplar-SVM [39] because matching scores are computed based on discriminative features learned in the independent training procedure. In order to obtain high detection precision, we trained YOLO with thousands of images selected from the SUN database [40] and imageNet dataset [41]. More details are given in Section 6.

196

4.2. 2D-3D alignment

197

In this section, we describe our approach to matching objects in the image with 3D CAD models considering appearance and pose. We downloaded 3D models from ShapeNet [42], an on-line repository of publicly available models, and rendered each 3D model with colour and texture on a white background from 62 different viewpoints sampled over the upper half of the viewing sphere centred on the model. There was an overlap between the 2D-3D alignment and object detection. Both were used to pursue the best matching between a test image and an objects’ model of various categories. Take DPMs [36] for example, which is a classical object detection model that computes an overall score for its star model,

176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194

198 199 200 201 202 203 204 205 206 207 208

score( p0 ) = F0 φ (H, p0 ) +

n 

max(βψ (H, pi ))

(1)

i=1

209 210 211 212 213 214 215

The star model for an object is defined by (n + 1 ) components where F0 is a root filter, H is a feature pyramid and pi = (xi , yi , l ) specify a position (xi , yi ) in the l-th level of the pyramid. φ (H, p0 ) denotes the feature vector obtained from H inside the w × h subwindow centred at p0 . F0 φ (H, p0 ) is the score of placing the root filter at p0 . score(p0 ) computes an overall score for each root location according to the best possible placement of the parts.

Table 1 Statistics of images used in training. Train

# SUN Dataset

# ImageNet

Bed Cabinet Chair Shelf Sofa Table

1665 4491 8363 376 376 4577

346 692 469 580 506 1250

Table 2 Detection performance (per-class average precision measured at IOU ≥ 0.5 overlap). Test

# num

# dataset

Precision

Bed Cabinet Chair Shelf Sofa Table

2539 404 2443 114 3700 8378

MSCOCO ImageNet VOC 2012 SUN+VOC 2012 VOC 2012 + MSCOCO MSCOCO

80.83 81.39 66.88 64.04 72.07 60.88

Additionally, ψ (H, pi ) = (φ (H, pi ), φd (dxi , dyi )), where φ d (dxi , dyi ) is the deformation cost. β is a vector of part filters and deformation constraints parameters. βψ (H, pi ) is the score of the ith part’s filters at pi . 2D-3D alignment can also be understood by using equation (1), but it becomes a cross-domain matching. β is learned from synthesized views of 3D models, while ψ (H, pi ) is constructed from real photographs. It is well known that synthesized training images are much different than real photographs, especially regarding texture, lighting, background, and occlusion. Therefore, it is challenging to solve cross-domain matching. To alleviate differences in texture and lighting, we extracted mid-level discriminative features by performing LDA over HOG features to represent synthesized images and real photographs. Furthermore, if we only considered global matching over the entire object, the alignment was not reliable owing to the effects of background differences and occlusion. To improve the reliability, we performed a parts-based alignment using the star model. Building part-based detectors. As mentioned above, β is learned from synthesized training data. There are several ways to learn β . DPM solves learning β by using latent SVM. However, optimization of latent SVM is slow and also requires a searching latent value for positive examples. Aubry et al. [8] improved the learning process

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237

ARTICLE IN PRESS

JID: CAG 4

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

Fig. 2. Example models used in this paper.

238 239 240 241 242 243

by using the LDA version of Exemplar-SVM called exemplar LDA. We rewrote the 2D-3D alignment in the form of exemplar LDA using equation 2. ωqi can be learned by training an exemplar classifier for the i-th component (i.e. the root or each part) using a single positive patch q with label yq = 1 and a large number of negative patches si with labels yi = −1.

scoreE ( p0 ) =

n 

max(ωqi φ (x ))

(2)

i=0

244

ω =  ( φ ( q ) − μn ) i q

245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263

−1

(3)

Similarly, scoreE (p0 ) gives an overall score of placing the root filter at p0 with the other parts located at their best possible placements. Exemplar LDA can reach a similar object detection accuracy as expensive iterative SVM [43,44]. Moreover, computing exemplar  LDA is quite simple when using Eq. (3), where μn = N1 N i=1 φ (si )  N and  = N1 i=1 (φ (si ) − μn )(φ (si ) − μn )T . φ () denotes the HOG descriptor. ωqi φ (x ) reflects the similarity between the positive patch q for ith part filter and the test patch x. We represented both real photographs and synthesized views by the HOG descriptor at multiple spatial scales, thereby capturing object boundaries reliably. Moreover, exemplar LDA enhances the salient boundaries by re-weighting the value of the HOG descriptor in the specific region. However, exemplar LDA relaxes the deformation cost constraint in the star model, which causes the average precision to decrease. Exemplar LDA has a high rate of false positives by confusing objects with object-like structures in the complex background. To address this issue, we located the root filter for the star model by CNN-based detection and constrained all the part filters around the

root filter. We express our model in Eq. (4).

scoreE ( p0 ) = P r ( p0 ) +

n 

max(ωqi φ (x )) ∗ Ii

264

(4)

i=0

Pr(p0 ) is the probability computed by YOLO, which gives the location of the root filter centred at p0 . We define the confidence score Ii ∈ [0, 1] and let it equal the intersection of the filters’ box and the bounding box of objects over the filters’ box. With the term Ii , we penalize part filters, which have a large displacement from the root filter. Part-based matching. At the test time, we located the root for the star model and performed a local search around the root patch. YOLO provided us the location of the root filter. All the filters of the star model were applied to the test image in parallel under spatial constraints. The star model allows for small spatial deformations that enable handling objects with large intra-class variation. In DPM [36] and exemplar LDA [8], the star model is carried out in a sliding-window fashion. With the help of YOLO, we avoided running the star model by sliding the windows over the entire image, and only performed sliding windows inside the bounding box of the detected object. YOLO directly predicts bounding boxes of objects using regression. Therefore, our matching process was faster than DPM and exemplar LDA. Finally, the 2D-3D alignment was achieved by searching patches with the maximum matching score around the root patch using Eq. (4). Note that the matching score was dependent on the patch q, and calibration of the matching scores across different filters was important. Calibration could be done using a linear affine transformation [8] or a logistic fitting [39]. For each rendered 3D view, we chose patch q as the single positive example for each filter from patches having the highest response after non-maximum suppression. We measured the similarity for the ith filter between the positive patch q and the

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292

JID: CAG

ARTICLE IN PRESS

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

Fig. 3. Our detections and top five alignments.

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

5

JID: CAG 6

ARTICLE IN PRESS

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

Fig. 4. Detections and alignments of Exemplar LDA.

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

ARTICLE IN PRESS

JID: CAG

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

7

Fig. 5. Result of user study evaluating the quality of the alignment and style.

Table 3 Timing comparison between Exemplar LDA and proposed method.

Timing(s)

Exemplar LDA

Ours

69.82

28.47

294

test patch x using ωqi φ (x ). We found our matching model worked well by preserving the view and maintaining consistent style.

295

5. ROI-based refinement

296

With part-based matching, our system was capable of outputting n candidate models for each detected object. However, users may not be satisfied with these alignment results because some people may focus on specified regions of interest. Thus, we introduce an interactive ROI-based refinement to improve the retrieval results. To support this functionality, we must consider both similarity and alignment between shapes for arbitrary userspecified regions. We used fuzzy correspondences to address those geometric correspondences. Fuzzy correspondences were first proposed by Kim et al. in their original work [7]. They used fuzzy correspondence values in ROI regions as weights to compute the similarity between the aligned target shape and the example shape. Then, users could interactively refine results by reranking the retrieval models based on the similarity between the models. More specifically, in our system, we sampled points uniformly on the models’ surfaces and mapped points to an embedded space where the corresponding points are close to each other. In fact, the fuzzy correspondences function f is an embedding mapping. To compute the fuzzy correspondences, we first created an initial alignment graph G0 in which the models were represented as vertices and all the pairwise alignments represented as edges. We computed f0 using eigenvectors of the correspondence matrix C0 derived from G0 . We defined a consistency score for each edge using f as weights. Then, we updated all the edges in the alignment graph by pruning noisy edges whose scores fell below a certain threshold. We repeated updating fi and G0 several times. Finally, the optimized alignment graph G generated the fuzzy correspondences f, which provided consistent pointto-point correspondences between all models. More details can be found in the original paper [7]. Our system allows users to select an arbitrary ROI on an example shape and explore the rest of the models with similar cor-

293

297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327

responding regions at interactive rates. The users can then refine the retrieval results by reranking the candidate models according to ROIs.

328

6. Results

331

In this section, we provide the verification of the output retrieval results of our system, compared with exemplar LDA results. We first evaluated the detection accuracy of our root filter detection by YOLO. Our training dataset was selected from a SUN database and an imageNet dataset. We show the statistics of the training images in Table 1. We tested YOLO on PASCAL VOC 2012 [45] and the MSCOCO dataset [46]. From Table 2, we can see our precision was close to state-of-the-art approaches. For the 3D models, we manually selected 100 models for each category from ShapeNet [42], which were available in the public repositories. The models we selected are shown in Fig. 2. We chose these representative high-quality 3D models to explicitly represent the shape variation of an object category. We performed all experiments on an Intel(R) Core 4 i7 3.40 GHz CPU machine with 12 GB memory and Geforce GTX 1060 GPU with 6 GB memory. Comparison with Exemplar LDA. Here, we present the comparison results between our method and exemplar LDA to demonstrate the effectiveness and efficiency of our method. The results in this comparison show that our method achieves better accuracy in both detection and alignment. Exemplar LDA easily confuses objects with object-like structures in the background. As shown in Figs. 3 and 4, exemplar LDA has high false positives while our method predicts bounding boxes correctly. It should be noted that we filtered out patches of less than 100 × 100 pixels in order to reduce small spurious noisy patches. We performed a user study to evaluate the quality of the alignment and return the object style. We randomly picked ten input indoor images per category from our test dataset in Table 2, and output the alignment results and returned the object style using both our method and exemplar LDA. Ten computer science graduates were asked to label the alignment as “Good” or “Bad” and to label the style as “Acceptable” or “Unacceptable”. “Good” meant that the predicted pose was very similar to the ground truth. “Bad” meant that the alignment was incorrect. “Acceptable” meant that the predicted style was an accurate match or partial match. “Unacceptable” meant that there was no style match. We reported the results of the user study evaluating the quality of alignments and indoor objects styles in Fig. 5. From the results, it is clear that our method significantly

332

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

329 330

333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369

JID: CAG 8

ARTICLE IN PRESS

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

Fig. 6. ROI-based refinement. Our system outputs top five aligned models. Then, users are capable to paint ROIs and explore models according to the geometric similarity between the selected region of an example shape and other models.

Fig. 7. Failures of our algorithm caused by occlusion.

370 371 372 373 374 375 376 377 378 379 380 381

outperformed the baseline exemplar LDA owing to the high detection precision of YOLO and the deformation constraint in the star model. Computational cost. To compare the performance speeds of our method and exemplar LDA, we conducted an experiment in which we selected images containing one to three objects from VOC 2012. Table 3 gives an overview of the average computation times for the detection and alignment steps, and compares them with those of exemplar LDA. It can be seen that our approach was about 1.5 times faster with an average image size of 500 by 300 pixels. Rapidly locating root filters by YOLO reduced the slidingwindow area and enabled faster speed of detection. Note that the

time cost by YOLO could be neglected because it is extremely fast (<1 s). ROI-based refinement. In this experiment, we wished to verify the effectiveness of ROI-based refinement. For a given query image, we output top n aligned 3D models by our part-based retrieval. Actually, n could be large; however, giving users many choices could be cumbersome. Instead of giving them too many candidates, we only showed them the best candidate or a group of them, such as the top five ones. The ROI-based refinement was capable of helping users interactively rerank the top 5 aligned models according to ROIs. In Fig. 6 for example, our system output the top 5 aligned models. These aligned models were not perfect owing to the

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

382 383 384 385 386 387 388 389 390 391 392 393

JID: CAG

ARTICLE IN PRESS

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

413

partial chair occlusion by the table. In this case, we provided an interactive ROI-based refinement to users and let the users paint an ROI and identify the model they wanted in the shape collections. In this experiment, we created our shape collection using the top 20 aligned models retrieved by our part-based matching. The computation of fuzzy correspondences took less than 100 s for 20 models. Finally, we visualized the representation of our results, where most of the furniture was detected and represented using wellmatched CAD models with proper positioning. Note that some results did not exactly match the input photograph, as we did not have that specific style in our dataset. We used a relatively smaller model dataset than [8]. While our results were not perfect, we were able to improve the accuracy by adding more models. However, the alignment would take more time on a larger dataset. Failure cases. Our failures were mainly caused by strong occlusion, as shown in Fig. 7. Our system achieved poor results on table detection and alignment. Object detection and alignment in strongly occluded case gave an interesting research direction for our future work.

414

7. Conclusion

415

436

This paper presents an automatic system that aligns 3D models with 2D objects, which is detected by the deep CNN model (i.e. YOLO) from a single image. We framed 2D-3D alignment as a deformable part based matching. We integrated YOLO with deformable part-based models tightly by avoiding an exhaustive sliding-window search. To alleviate the differences between synthesized training images and real photographs, we represented both images and synthesized views by mid-size HOG descriptors and enhanced the salient boundaries region using exemplar LDA. Our approach could handle objects with huge intraclass shape variation while preserving consistent viewpoint and style. In addition, we provided an interactive ROI-based retrieval refinement function for users, which was welcome and proposed a robust and efficient exploration tool for large model collections. We demonstrated efficient detection and alignment of our algorithm on various dataset. The output results of our system could benefit other applications, such as 3D scene understanding and room modelling. Future efforts may be made to improve discriminative features via deep CNNs instead of HOG and speed up part-based alignment by parallelizing on GPUs. In addition, improving detection accuracy by considering context information is another promising research direction, especially for a strongly occluded scene.

437

Acknowledgements

438

445

This work was co-supported by NSFC under Grant No. 61502133, Zhejiang Provincial NSFC under Grant No. LY16F020029. In addition, this work was partially supported by Zhejiang Provincial NSFC under Grant No.LQ15F010 0 01, the fund of the Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology & Business University (BTBU), and Zhejiang Provincial NSFC under Grant No. LY13F020050. ZhiGeng Pan and ZhengWei Yao are the co-corresponding authors of this paper.

446

Supplementary material

447 448

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.cag.2017.07.029.

449

References

394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412

416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435

439 440 441 442 443 444

450 451

[1] Funkhouser T, Min P, Kazhdan M, Chen J, Halderman A, Dobkin D, et al. A search engine for 3d models. ACM Trans Gr 2003;22(1):83–105.

9

[2] Tangelder JW, Veltkamp RC. A survey of content based 3D shape retrieval methods. Multimedia Tools Appl 2008;39(3):441–71. [3] Merrell P, Schkufza E, Koltun V. Computer-generated residential building layouts. ACM Trans Gr 2010;29(6):181. [4] Merrell P, Schkufza E, Li Z, Agrawala M, Koltun V. Interactive furniture layout using interior design guidelines. ACM Trans Gr 2011;30(4). 87:1–87:10. [5] Yu L-F, Yeung S-K, Tang C-K, Terzopoulos D, Chan TF, Osher SJ. Make it home: automatic optimization of furniture arrangement. ACM Trans Gr 2011;30(4):86. [6] Chen X, Li J, Li Q, Gao B, Zhou D, Zhao Q. Image2scene: transforming style of 3D room. In: Proceedings of the 23rd annual ACM conference on multimedia conference. ACM; 2015. p. 321–30. [7] Kim VG, Li W, Mitra NJ, Diverdi S, Funkhouser T. Exploring collections of 3D models using fuzzy correspondences. ACM Trans Gr 2012;31(4):54. [8] Aubry M, Maturana D, Efros AA, Russell BC, Sivic J. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad. In: Proceedings of the CVPR; 2014. p. 3762–9. [9] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, realtime object detection; 2015. ArXiv:1506.02640 [cs.CV]. [10] Roberts LG. Machine perception of three-dimensional soups. Massachusetts Institute of Technology; 1963. Ph.D. thesis. [11] Glasner D, Galun M, Alpert S, Basri R, Shakhnarovich G. Viewpoint-aware object detection and pose estimation. In: Proceedings of the ICCV; 2011. [12] Xiao J, Russell B, Torralba A. Localizing 3D cuboids in single-view images. In: Proceedings of the NIPS; 2012. [13] Satkin S, Lin J, Hebert M. Data-driven scene understanding from 3d models. In: Proceedings of the BMVC; 2012. [14] Lowe D. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004;60(2):91–110. [15] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the CVPR; 2005. [16] Malisiewicz T, Gupta A, Efros AA. Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of the ICCV. [17] Lim JJ, Pirsiavash H, Torralba A. Parsing IKEA objects: fine pose estimation. In: Proceedings of the ICCV; 2013. [18] Izadinia H, Shan Q, Seitz SM. Im2cad; 2016. ArXiv:1608.05137 [cs.CV]. [19] Xiang Y, Mottaghi R, Savarese S. Beyond pascal: a benchmark for 3D object detection in the wild. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV); 2014. [20] Su H, Qi CR, Li Y, Guibas LJ. Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE international conference on computer vision (ICCV); 2015. [21] Liu M, Guo Y, Wang J. Indoor scene modeling from a single image using normal inference and edge features. Vis Comput 2017:1–14. [22] Min P, Kazhdan M, Funkhouser T. A comparison of text and shape matching for retrieval of online 3d models 2004;3232. Lecture Notes in Computer Science [23] Ding-Yun C, Xiao-Pei T, Yu-Te S, Ming O. On visual similarity based 3D model retrieval. Comput Gr Forum 2003;22(3):223–32. [24] Fisher M, Hanrahan P. Context-based search for 3D models. ACM Trans Gr 2010;29(6):182. [25] Huang Q, Wang F, Guibas L. Functional map networks for analyzing and exploring large shape collections. ACM Trans Gr 2014;33(4):36. [26] Yi L, Kim VG, Ceylan D, Shen I-C, Yan M, Su H, et al. A scalable active framework for region annotation in 3D shape collections. SIGGRAPH Asia 2016. [27] Chaudhuri S, Koltun V. Data-driven suggestions for creativity support in 3D modeling. ACM Trans Gr 2010;29(6):183. [28] Fisher M, Savva M, Hanrahan P. Characterizing structural relationships in scenes using graph kernels. ACM Trans Gr 2011;30(4):34. [29] Loffler J. Content-based retrieval of 3D models in distributed web databases by visual shape information. In: Proceedings of the IEEE International Conference on Information Visualization; 20 0 0. p. 82–7. [30] Daras P, Axenopoulos A. A 3d shape retrieval framework supporting multimodal queries. Int J Comput Vis 2010;89(2):229–47. [31] Eitz M, Richter R, Boubekeur T, Hildebrand K, Alexa M. Sketch-based shape retrieval. ACM Trans Gr 2012;31(4):31. [32] Xie X, Xu K, Mitra NJ, Cohen-Or D, Su Q, Gong W, et al. Sketch-to-design: context-based part assembly. Comput Gr Forum 2013;32(8):233–45. [33] Xu K, Chen K, Fu H, Sun W-L, Hu S-M. Sketch2scene: sketch-based co-retrieval and co-placement of 3d models. ACM Trans Gr 2013;32(4). 123:1–123:12. [34] Su H, Huang Q, Mitra NJ, Li Y, Guibas L. Estimating image depth using shape collections. ACM Trans Gr (Special issue of SIGGRAPH 2014) 2014;33(4). [35] Huang Q, Wang H, Koltun V. Single-view reconstruction via joint analysis of image and shape collections. ACM Trans Gr 2015;34(4). [36] Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 2010;32(9):1627–45. [37] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the CVPR; 2014. p. 580–7. [38] Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NIPS; 2015. [39] Malisiewicz T, Gupta A, Efros AA. Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of the ICCV. [40] Xiao J, Hays J, Ehinger K, Oliva A, Torralba A. Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the CVPR; 2010. [41] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 2015;115(3):211–52.

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469Q3 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494Q4 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523Q5 524 525 526 527 528 529 530 531 532 533 534 535 536 537

JID: CAG 10

538 539 540 541 542 543

ARTICLE IN PRESS

[m5G;August 25, 2017;13:22]

F. Liu et al. / Computers & Graphics xxx (2017) xxx–xxx

[42] Chang AX, Funkhouser TA, Guibas LJ, Hanrahan P, Huang QX, Li Z, et al. Shapenet: an information-rich 3d model repository. CoRR 2015;abs/1512.03012. [43] Gharbi M, Malisiewicz T, Paris S, Durand F. A gaussian approximation of feature space for fast image similarity. Tech. Rep. MIT CSAIL; 2012. [44] Hariharan B, Malik J, Ramanan D. Discriminative decorrelation for clustering and classification. In: Proceedings of the ECCV; 2012.

[45] Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes challenge: a retrospective. Int J Comput Vis 2015;111(1):98–136. [46] Coco: common objects in context. 2016. http://mscoco.org/dataset/#detections -leaderboard.

Please cite this article as: F. Liu et al., Retrieving indoor objects: 2D-3D alignment using single image and interactive ROI-based refinement, Computers & Graphics (2017), http://dx.doi.org/10.1016/j.cag.2017.07.029

544 545 546 547 548