Journal Pre-proof Adversarial autoencoders for compact representations of 3D point clouds Maciej Zamorski, Maciej Zi˛eba, Piotr Klukowski, Rafał Nowak, Karol Kurach, Wojciech Stokowiec, Tomasz Trzci´nski
PII: DOI: Reference:
S1077-3142(20)30014-X https://doi.org/10.1016/j.cviu.2020.102921 YCVIU 102921
To appear in:
Computer Vision and Image Understanding
Received date : 4 August 2019 Revised date : 28 January 2020 Accepted date : 29 January 2020 Please cite this article as: M. Zamorski, M. Zi˛eba, P. Klukowski et al., Adversarial autoencoders for compact representations of 3D point clouds. Computer Vision and Image Understanding (2020), doi: https://doi.org/10.1016/j.cviu.2020.102921. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Inc.
*1. Cover Letter
Journal Pre-proof
Title Page
Point Clouds
lP repro of
Title of the manuscript:Adversarial Autoencoders for Compact Representations of 3D
Author names and affiliations: Authors ● Maciej Zamorski (
[email protected]), Wrocław University of Science
and Technology, Tooploox
● Maciej Zięba, Wrocław University of Science and Technology, Tooploox
● Piotr Klukowski, Wrocław University of Science and Technology, Tooploox ● Rafał Nowak, University of Wrocław, Tooploox ● Wojciech Stokowiec, Tooploox
● Tomasz Trzciński, Warsaw University of Technology, Tooploox
Institutions
● Wrocław University of Science and Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland
rna
● University of Wrocław, plac Uniwersytecki 1, 50-137 Wrocław, Poland
● Warsaw University of Technology, plac Politechniki 1, 00-661 Warsaw, Poland
Jou
● Tooploox, Tęczowa 7, 53-601 Wrocław, Poland
Journal Pre-proof *2. Manuscript
Authorship Confirmation
lP repro of
Computer Vision and Image Understanding
Please save a copy of this file, complete and upload as the “Confirmation of Authorship” file.4 As corresponding author I, Maciej Zamorski, hereby confirm on behalf of all authors that:
1. This manuscript, or a large part of it, has not been published, was not, and is not being submitted to any other journal. 2. If presented at or submitted to or published at a conference(s), the conference(s) is (are) identified and substantial justification for re-publication is presented below. A copy of conference paper(s) is(are) uploaded with the manuscript. 3. If the manuscript appears as a preprint anywhere on the web, e.g. arXiv, etc., it is identified below. The preprint should include a statement that the paper is under consideration at Computer Vision and Image Understanding. 4. All text and graphics, except for those marked with sources, are original works of the authors, and all necessary permissions for publication were secured prior to submission of the manuscript. 5. All authors each made a significant contribution to the research reported and have read and approved the submitted manuscript. Date
Signature
rna
List any pre-prints:
Adversarial Autoencoders for Compact Representations of 3D Point Clouds, https://arxiv.org/abs/1811.07605
Jou
Relevant Conference publication(s) (submitted, accepted, or published):
Justification for re-publication:
Research Highlights (Required)
lP repro of
Journal Pre-proof
To create your highlights, please type the highlights against each \item command.
It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.)
• Comprehensive studies of VAEs and AAEs in context of 3D point cloud generation
• Selecting proper cost function for EMD achieves proper ELBO for variational calculus • We show that AAE can learn 3D points generation, retrieval, and clustering
Jou
rna
• Extending AAE framework to obtain compact representation of 3D point clouds
Journal Pre-proof 1
Computer Vision and Image Understanding
lP repro of
journal homepage: www.elsevier.com
Adversarial autoencoders for compact representations of 3D point clouds
Maciej Zamorskia,e,∗∗, Maciej Zieba ˛ a,e,∗∗, Piotr Klukowskia,e , Rafał Nowakb,e , Karol Kurachf , Wojciech Stokowiecc,e , d,e Tomasz Trzci´nski a Wrocław
University of Science and Technology, Wrocław, Poland of Computer Science, University of Wrocław, Wrocław, Poland c Polish-Japanese Academy of Information Technology, Warsaw, Poland d Warsaw University of Technology, Warsaw, Poland e Tooploox, Wrocław, Poland f Google Brain, Zurich, Switzerland b Institute
ABSTRACT
1. Introduction
rna
Deep generative architectures provide a way to model not only images but also complex, 3-dimensional objects, such as point clouds. In this work, we present a novel method to obtain meaningful representations of 3D shapes that can be used for challenging tasks, including 3D points generation, reconstruction, compression, and clustering. Contrary to existing methods for 3D point cloud generation that train separate decoupled models for representation learning and generation, our approach is the first end-to-end solution that allows to simultaneously learn a latent space of representation and generate 3D shape out of it. Moreover, our model is capable of learning meaningful compact binary descriptors with adversarial training conducted on a latent space. To achieve this goal, we extend a deep Adversarial Autoencoder model (AAE) to accept 3D input and create 3D output. Thanks to our end-to-end training regime, the resulting method called 3D Adversarial Autoencoder (3dAAE) obtains either binary or continuous latent space that covers a much broader portion of training data distribution. Finally, our quantitative evaluation shows that 3dAAE provides state-of-the-art results for 3D points clustering and 3D object retrieval. c 2020 Elsevier Ltd. All rights reserved.
Jou
As more and more sensors offer capturing depth along with
other visual cues, three-dimensional data points start to play a pivotal role in many real-life applications, including simultaneous localization and mapping or 3D object detection. A proliferation of devices such as RGB-D cameras and LIDARs leads to an increased amount of 3D data that is being captured and analyzed, e.g., by autonomous cars and robots. Since storing and processing 3D data points in their raw form quickly becomes Fig. 1. Synthetic point cloud samples generated by our AAE models trained with Earth Mover Distance as a reconstruction error. ∗∗ Corresponding
authors e-mail:
[email protected] (Maciej Zamorski),
[email protected] (Maciej Zieba) ˛
Journal Pre-proof 2 a bottleneck of a processing system, designing compact and effi-
generation, feature extraction, clustering, and object interpola-
cient representations is of the utmost importance for researchers
tion. Interestingly, these goals are attainable by using only unsu-
in the field (Gumhold et al. (2001); Pauly et al. (2003); Daniels
pervised training with reconstruction loss and prior distribution
et al. (2007); Qi et al. (2017a); Zhou and Tuzel (2018); Yu et al.
regularization. To summarize, in this work, we present the following con-
lP repro of
(2018); Ben-Shabat et al. (2018)).
Despite the growing popularity of point cloud representations,
tributions: (1) We present comprehensive studies of variational
the labeled data in this domain is still very limited, bringing the
autoencoders (VAE) and adversarial autoencoders (AAE) in the
necessity of extracting features using unsupervised (Shoef et al.
context of 3D point cloud generation. (2) Although AAE out-
(2019)) or semi-supervised (Suger et al. (2015)) methods. As
performs VAE in our studies, we have shown that by select-
learned features are later used for downstream tasks, such as
ing proper cost function for EMD distance, we achieve proper
reconstruction, generation, retrieval, and clustering, the latent
ELBO for variational calculus. (3) We show that AAE can solve
space is desirable to match a specified prior probability distribu-
challenging tasks related to 3D points generation, retrieval, and
tion. While such representations have been extensively investi-
clustering. In particular, our AAE model achieves state-of-the-
gated for 2D images (Bengio et al. (2013); Kingma and Welling
art results in 3D point generation. (4) We show our AAE model
(2013); Radford et al. (2015); Tomczak and Welling (2018); Xia
can represent 3D point clouds in compact binary space (up to
et al. (2014); Zieba et al. (2018); Zieba and Wang (2017)), there
100 bits) and achieves competitive results to models that are
has been much less focus on applying these techniques to 3D
using wide and continuous representations.
shapes and structures (Simon et al. (2014)). In particular, there
is no other work that obtains binary representation from point cloud data for generation, retrieval, and interpolation tasks.
To address this problem, we introduce the novel single-stage
rna
end-to-end 3D point cloud probabilistic model called 3D Adversarial Autoencoder (3dAAE). This generative model is capable
of representing 3D point clouds with compact continuous or
binary embeddings. Additionally, those representations follow an arbitrary prior distribution, such as Gaussian Mixture Model tions).
Jou
(continuous embeddings), Bernoulli, or Beta (binary representa-
2. Related Work
Point clouds The complex task of creating rich 3D point
clouds feature representations is an active field of investigation. In (Wu et al. (2015)), the authors propose the voxelized representation of an input point cloud. Other approaches are using such techniques as multi-view 2D images (Su et al. (2015)) or calculating occupancy grid (Ji et al. (2013); Maturana and Scherer (2015)). The PointNet architecture (Qi et al. (2017a)) allows handling unordered sets of real-valued points, by introducing permutation-invariant feature aggregating function. This
In the proposed approach, we utilize adversarial autoencoder,
approach has constituted a basis for numerous extensions, in-
wherein the PointNet model (Qi et al. (2017a)) is used as an
volving training hierarchical features Qi et al. (2017b), point
encoder. As a reconstruction loss, we use Earth-Mover distance
cloud retrieval (Angelina Uy and Hee Lee (2018)) or point cloud
(Rubner et al. (2000); Fan et al. (2017)), which enables end-
generation (Achlioptas et al. (2018)).
to-end training of the model with permutation invariance for
Representation learning The studies on the 3D-GAN model
input data. Finally, to guarantee the desired distribution on the
(Wu et al. (2016)) have resulted in an extension of an original
latent space, we utilize adversarial training with the Wasserstein
GAN (Goodfellow et al. (2014)), which enabled generating re-
criterion (Gulrajani et al. (2017)).
alistic 3D shapes sampled from latent variable space. Another
In contrast to existing approaches, our model captures the
work in the field (Schönberger et al. (2017)) uses a version of
point cloud data in a way, which allows for simultaneous data
Variational Autoencoder (VAE) (Kingma and Welling (2013);
Journal Pre-proof 3 the variational posterior (the encoder E), p(x|z) is the genera-
scene that is used for relocalization in SLAM. However, models
tive model (the generator G) and p(z) is the prior. In practi-
that incorporate VAE into the framework usually regularize the
cal applications p(x|z) and q(z|x) are parametrized with neural
feature space with a normal distribution. It has been shown that
networks and sampling from q(z|x) is performed by so called
there are better prior distributions to optimize for (Tomczak and
reparametrization trick. The total loss used to train VAE can be
Welling (2018)). Moreover, due to the necessity for defining
represented by two factors: reconstruction term, that is `2 norm
a tractable and differentiable KL-divergence, VAE allows only
taken from the difference between sampled and reconstructed
for a narrow set of prior distributions. As a way to apply an
object if p(x|z) is assumed to be normal distribution and regular-
arbitrary prior regularization to the autoencoder, the Adversar-
ization term that forces z generated from q(z|x) network to be
ial AutoEncoder framework has been proposed (Makhzani et al.
from a prior distribution p(z).
lP repro of
Rezende et al. (2014)) to build a semantic representation of a 3D
(2016)). It regularizes latent distribution via discriminator out-
z ~ q(z|x)
put, rather than KL-divergence. This approach has been used to
2D image data but has not been validated in the context of 3D point cloud data.
G
E
x
D
The only reference method, known to the authors, for calcu-
z ~ p(z)
lating binary 3D feature descriptors is B-SHOT (Prakhya et al. (2015)). The approach was designed for binarization of the initial SHOT descriptor (Salti et al. (2014)), which was designed for a keypoint matching task.
To sum up, numerous existing solutions are able to tackle generating or representation of 3D point clouds. However, we are
real p(z) or fake q(z)
Fig. 2. 3dAAE model architecture that extends AE with an additional decoder D. The role of the decoder is to distinguish between true samples generated from p(z) and fakes delivered by the encoder E. The encoder is trying to generate artificial samples, that are difficult to be distinguished by the discriminator.
lacking end-to-end solutions, which can capture both of them
3. Methods
rna
and allow for learning compact binary descriptors or clustering.
3.1. Variational Autoencoders
Variational Autoencoders (VAE) are the generative models
Jou
that are capable of learning approximated data distribution by
applying variational inference (Kingma and Welling (2013); Rezende et al. (2014)). We consider the latent stochastic space z and optimize the upper-bound on the negative log-likelihood of x:
Ex∼pd (x) [− log p(x)]
3.2. Adversarial Autoencoders The main limitation of VAE models is that the regularization term requires particular prior distribution to make KL divergence tractable. In order to deal with that limitation, authors of (Makhzani et al. (2016)) introduced Adversarial Autoencoders (AAE) that utilize adversarial training to force a particular distribution on z space. The model assumes that an additional neural network - discriminator D, which is responsible for distinguishing between fake and real data instances, where the real instances are sampled from assumed prior distribution p(z) and fake instances are generated via encoding network q(z|x). The adversarial part of training can be expressed in the following way:
= reconstruction + regularization,
where KL(·k·) is the Kullback–Leibler divergence (Kullback and Leibler (1951)), pd (x) is the empirical distribution, q(z|x) is
min max V(E, D) = Ez∼p(z) [log D(z)]+Ex∼pd (x) [log (1 − D(E(x)))]. E
D
(1)
Journal Pre-proof 4 The training procedure is characteristic for GAN models
be further used in the autoencoding framework. In the literature,
and is performed by alternating updates of parameters of en-
two common distance measures are successively applied for re-
coder E and discriminator D. The parameters of discrimina-
construction purposes: Earth Mover’s (Wasserstein) Distance
tor D are updated by minimizing the LD = −V(E, D) and
(Rubner et al. (2000)) and Chamfer pseudo-distance. Earth Mover’s Distance (EMD): introduced in (Rubner et al.
by minimizing the reconstruction error together with V(E, D):
(2000)) is a metric between two distributions based on the mini-
LEG = reconstruction + V(E, D). In practical applications, the
mal cost that must be paid to transform one distribution into the
stated criterion can be substituted with the so-called Wasser-
other. For two equally sized subsets S1 ⊆ R3 , S2 ⊆ R3 , their
lP repro of
the parameters of encoder E and generator G are optimized
stein criterion (Gulrajani et al. (2017)). Other approaches that are found in the literature that apply the Wasserstein criterion are Wasserstein Autoencoders (Tolstikhin et al. (2017)) (of which Adversarial Autoencoders are a special case) and Wasserstein VAE (Ambrogioni et al. (2018)).
Learning prior on z latent space using adversarial training has
several advantages over standard VAE approaches (Makhzani
et al. (2016)). First of all, the data examples coded with the encoder exhibit sharp transitions indicating that the coding space
is filled, which is beneficial in terms of interpolating on the latent space, as it reassures that the model learned meaningful
EMD is defined as:
EMD(S1 , S2 ) = min
φ:S1 →S2
X
c(x, φ(x)),
(2)
x∈S1
where φ is a bijection and c(x, φ(x)) is cost function and can be defined as c(x, φ(x)) =
1 2
kx − φ(x)k22 .
Chamfer pseudo-distance (CD): measures the squared dis-
tance between each point in one set to its nearest neighbor in the other set:
CD(S1 , S2 ) =
X
x∈S1
min kx − yk22 + y∈S2
X
y∈S2
min kx − yk22 . x∈S1
(3)
In contrast to EMD which is only differentiable almost every-
ing linear interpolation between two encodings, the model is
where, CD is fully differentiable. Additionally, CD is computa-
able to interpret those features and generated samples in data
tionally less requiring. However, as it calculates minimum dis-
(point clouds) space indicating, that the model does not overfit
tances, it tends to strongly favor the regions with high point den-
to samples from the training set but instead also generalizes to
sity and undervalue the other parts, that are still important to an
unseen, possible shapes. Secondly, there is no limitation for the
overall reconstruction. On the other hand, EMD compares the
distribution that is adjusted to z space.
distribution of points between original and reconstructed point
Jou
4. Our Approach
rna
semantic features that are encoded in the hidden vector. Dur-
clouds. This prevents from obtaining low loss values for degenerate cases, which have points collected in fewer but denser groups, like in the case of CD. See (Achlioptas et al. (2018)) for
VAE and AAE are widely applied for the analysis of images, but there is limited work on their applications to the generation
quantitative and qualitative comparison of reconstructions using CD and EMD.
and representation of the point clouds. In our study, we have
Additionally, attention to finer details of point clouds samples
adjusted these models to address multiple challenging tasks aris-
provided by EMD leads to better generative capabilities of the
ing in point cloud analysis, such as generation, representation,
model in terms of minimum matching distance and distribution
learning binary descriptors, and clustering.
coverage (Achlioptas et al. (2018)).
The point cloud can be represented as a set of points in 3D N {xn }n=1 ,
3
Defining good distance measure in data (pixel) space can be
where xn ∈ R is a sin-
difficult. The superiority of GANs over Autoencoders in gener-
gle point in 3D space. For this particular data representation,
ative power (Ledig et al. (2017)) comes from using a discrim-
the crucial step is to define proper reconstruction loss that can
inator instead of calculating a reconstruction error. It leads to
Euclidean space, denoted by S =
Journal Pre-proof 5 generating higher quality artificial data. Thanks to the distance measures, like Chamfer or Earth-Mover, this limitation is no longer observed for 3D point clouds. Therefore, we introduce VAE and AAE models for 3D point clouds and call them 3dVAE
lP repro of
and 3dAAE, respectively.
Following the framework for VAE introduced in Section 3
we parametrize q(z|x) with the encoding network E(·) and
Fig. 3. The t-SNE plot for the latent space obtained from AE and 3dAAE models (chair category). One can notice the interpolation gap between two chairs for AE. For encodings obtained from the 3dAAE model, this phe-
expect normal distribution with diagonal covariance matrix:
nomenon is not observed, and the latent variable space is much more dense,
q(z|x) = N(Eµ (x), Eσ (x)), where x stores the points from S
which allows for smooth transition within the space. This confirms that the
from the training dataset, which can be linked to improved Coverage and Minimum Matching Distance (see Section 5.1) by 3dAAE.
0
5
0
1.0
5
0.7
0.5
0.2
5
1.0 0.00 0
0
=0.01
0.7
5
0.5
0.2
5
0
1.0 0.00 0
5
=0.1
0.7
0.5
0.2
5
0
0.0
ditionally independent, φ(x j ) = Gi (z) and φ(·) is a bijection.
1.0 0.00 0
G(·), p(x j |z) = N(Gi (z), 1), where p(x1 |z) . . . p(xN |z) are con-
5
normal distribution with the mean values returned by network
Epoch 250
manner. We model each of point distributions p(x j |z) with the
Epoch 400
tions for each of the cloud points, x1 , . . . , xN , in the following
=2
AE
106 105 104 103 102 101 106 105 104 103 102 101 106 105 104 103 102 101
0
hand, we make use of G model to parametrize separate distribu-
Epoch 5
tion for all possible orderings of points from S. On the other
0.7
permutations as E(·). Therefore we receive the same distribu-
tor. Furthermore, it shows generalization capability beyond the samples
0.5
of points, we utilize the PointNet model that is invariant on the
model learns to embed meaningful semantic features into the hidden vec-
0.2
assuming some random ordering. Because we operate on a set
Because we are not able to propose permutation invariant map-
Fig. 4. Impact of the adversarial training with Beta(α = 0.01, β = 0.01) on
ping G(·) (as it is a simple MLP model), we utilize additional
the distribution of point cloud embeddings. To balance reconstruction and
function φ(·) that provides one-to-one mapping for the points
adversarial losses, we have introduced λ hyperparameter.
rna
stored in S. For this particular assumption, the problem of training the VAE model for the point clouds (3dVAE) can be solved by minimizing the following criterion: Q = KL(q(z|x)kp(z)) + min φ
N X ||xi − φ(xi )||2 2
i=1
2
.
(4)
Jou
The training criterion for the model is composed of reconstruction error defined with Earth-Mover distance between original and reconstructed samples, and KL distance between samples
generated by encoder and the samples generated from prior distribution. Samples from the encoder are obtained by application of the reparametrization trick on the last layer. To balance the gap caused by the orders of magnitude between the components of the loss, we scale the Earth-Mover component by multiplying it by the scaling parameter λ. To be consistent with reference methods in the experimental part we also use slightly modified unsquared cost function: c(x, φ(x)) = kx − φ(x)k2 .
Due to the limitations of VAE listed in the previous section, namely narrow spectrum of possible priors and worse distribution adjustment, we propose the adversarial approach adjusted to be applied to 3D point clouds (3dAAE). The scheme of adversarial autoencoder for 3D point clouds is presented in Figure 2. The model is composed of an encoder E that is represented by the PointNet (Qi et al. (2017a))-like architecture, which transforms 3D points into latent space z. The latent coding z is further used by a generator G to reconstruct or generate 3D point clouds. To train the assumed prior distribution p(z), we utilize a discriminator D that is involved in the process of distinguishing between the real samples generated from the prior distribution p(z) and the fake samples obtained from encoder E. If z is either a normal distribution or a mixture of Gaussians, we apply the reparametrization trick to the encoder E to obtain the samples of z.
Journal Pre-proof 6 5.1. Metrics Following the methodology for evaluating generative fidelity and diversification of samples provided in (Achlioptas et al. (2018)), we utilize the following criteria for evaluation: Jensen-
lP repro of
Fig. 5. Interpolations between the test set objects obtained by our singleclass AAE models. The leftmost and rightmost samples present the ground
Shannon Divergence, Coverage, and Minimum Matching Dis-
truth objects. Images in between are the result of generating images from
tance.
linear interpolation between our latent space encodings of the side images.
The model is trained in an adversarial training scheme and
Jensen-Shannon Divergence (JSD): a measure of distance
between two empirical distributions P and Q, defined as:
described in detail in Section 3.2. As a reconstruction loss, we
JS D(P||Q) =
take Earth-Mover loss defined by eq. (2). For the GAN part of the training process, we utilize the Wasserstein criterion.
In Figure 3 we present the 2D visualization of the coding
space for AE and 3dAAE methods. For the 3dAAE model, we can observe that the encodings are clustered consistently to the
prior distribution. For the AE model, we can find the gaps in some areas that may lead to poor interpolation results.
Subsequently, we have used the proposed 3dAAE model to
learn compact binary descriptors (100 bits) of 3D point clouds. The adversarial training with Beta(0.01, 0.01) has allowed to
alter the distribution of AE embeddings so that it accumulates
rna
its probability mass around 0 or 1 (Figure 4). The binary embeddings presented in the experimental section, are obtained by thresholding z at 0.5.
Contrary to the stacked models presented in (Achlioptas et al.
where M =
KL(P||M) + KL(Q||M) , 2
(5)
P+Q 2 .
Coverage (COV): a measure of generative capabilities in
terms of richness of generated samples from the model. For two sets of point clouds S1 , S2 ⊂ R3 coverage is defined as
a fraction of points in S2 that are the nearest neighbor to some points in S1 in the given metric.
Minimum Matching Distance (MMD): Since COV only takes
the closest point clouds into account and does not depend on the distance between the matchings additional metric was introduced. For point cloud sets S1 , S2 MMD is a measure of similarity between point clouds in S1 to those in S2 .
Both COV and MMD can be calculated using Chamfer (COV-
CD, MMD-CD) and Earth-Mover (COV-EMD, MMD-EMD) distances, respectively. For completeness, we report all possible combinations.
(2018)), our model is trained in an end-to-end framework, and
the latent coding space is used for both representation and sam-
Jou
pling purposes. Thanks to the application of adversarial training, we obtain data codes that are consistent with the assumed prior distribution p(z).
5. Evaluation
5.2. Network architecture In our experiments, we use the following network architectures: Encoder (E) is a PointNet-like network composed of five conv1d layers, one fully-connected layer and two separate fullyconnected layers for reparametrization trick. ReLU activations are used for all except the last layer used for reparametrization. Generator (G) is a fully-connected network with 5 layers and ReLU activations except the last layer. Discriminator (D) is
In this section, we describe the experimental results of the proposed generative models in various tasks, including 3D points reconstruction, generation, a binary representation, and clustering.
a fully-connected network with 5 layers and ReLU activations except the last layer. The above architecture is trained using Adam (Kingma and Ba (2014)) with parameters α = 10−4 , β1 = 0.5 and β2 = 0.999.
Journal Pre-proof 7 of priors are considered in our experiments: normal distribution
5.3. Experimental setup To perform experiments reported in this work, we have used either ShapeNet (Chang et al. (2015)) or ModelNet40 (Wu et al.
3D Categorical Adversarial Autoencoder (3dAAE-C). AAE model with an additional category output returned by an encoder for clustering purposes (see Section 5.9).
lP repro of
(2015)) datasets transformed to the 3 × 2048 point cloud rep-
(3dAAE), mixture of Gaussians (3dAAE-G), beta (3dAAE-B).
resentation following the methodology provided in (Achlioptas et al. (2018)). Unless otherwise stated, we train models
with point clouds from a single object class and work with train/validation/test sets of an 85% − 5% − 10% split. When
5.5. Reconstruction capabilities
Over-regularization problem is an issue of sacrificing recon-
reporting JSD measurements, we use a 283 regular voxel grid to
struction quality and embedding capturing data factors into a fea-
compute the statistics.
ture vector in favor of fitness to the prior distribution. When the prior and posterior distributions are indistinguishable, it means
5.4. Models used for evaluation
Following the previous studies (Achlioptas et al. (2018)), we have included the following models in our work:
Autoencoder (AE). The simple architecture of autoencoder
that converts the input point cloud to the bottleneck representation z with an encoder. The model is used for representing 3D
that an encoder has failed to embed a data instance features into the hidden vector. Solving this problem is essential for constructing meaningful feature vectors, while preserving the ability of the decoder to reconstruct encoded data instances and to create new ones, by sampling new features from the defined prior.
We control for over-regularization by monitoring the statistics
can be used to sample artificial 3D objects. Two approaches to
of posterior distributions during training and balance the recon-
train AE are considered: 1) with Chamfer (AE-CD) or 2) Earth-
struction and regularization losses with added hyperparameter
Mover (AE-EMD) distance as a way to calculate reconstruction
(Higgins et al.). Additionally, we exam two cases of data gener-
error.
ation – from feature vector sampled from prior distribution p(z)
rna
points in latent space without any additional mechanisms that
Raw point cloud GAN (r-GAN). The basic architecture of
GAN that learns to generate point clouds directly from the sampled latent vector.
(see Section 5.6) and obtained by encoding a data sample. In this experiment, we evaluate the reconstruction capabilities of the proposed autoencoders using unseen test examples. We confront the reconstruction results obtained by the AE model
version of GAN trained in stacked mode on latent space z with
with our approaches to examine the influence of a prior regu-
and without an application of the Wasserstein criterion (Ar-
larization on reconstruction quality. In Table 1, we report the
jovsky et al. (2017)).
MMD-CD and MMD-EMD between reconstructed point clouds
Jou
Latent-space (Wasserstein) GAN (l-(W)GAN). An extended
Gaussian mixture model (GMM). Gaussian Mixture of Models fitted on the latent space of an encoder in the stacked mode. Finally, we also evaluate the models proposed in this work as well as their variations:
3D Variational Autoencoder (3dVAE). Autoencoder with EMD reconstruction error and KL(·k·) regularizer. 3D Adversarial Autoencoder (3dAAE). Adversarial autoen-
and their corresponding ground-truth in the test dataset of the chair object class. It can be observed that the 3dVAE model does not suffer from an over-regularization problem (Achlioptas et al. (2018)). Both reconstruction measures are on a comparable level or even slightly lower for 3dAAE-G. 5.6. Generative capabilities
coder introduced in Section 4 that makes use of a prior p(z) to
We present the evaluation of our models, that are involved in
learn the distribution directly on a latent space z. Various types
sampling procedure directly on latent space z and compare them
Journal Pre-proof 8
Source object
lP repro of
Compact binary embeddings
Targret object
Interpolated objects
Fig. 6. Compact binary representations (100 bits) of 3D point clouds. For each of the 3D shapes, we provide corresponding binary codes.
-
+
-
+
=
-
+
=
=
-
+
=
Fig. 7. Modifying point clouds by performing additive algebra on our latent space encodings by our single-class AAE modes. Top-left sequence: adding rockers to a chair. Bottom-left: adding armrests to a chair. Top-right: changing table legs from one in the center to four in the corners. Bottom-right: changing table top from rectangle to circle-shaped.
population of the comparative set (test or validation). We report
MMD-CD
MMD-EMD
AE (Achlioptas et al. (2018))
0.0013
0.052
3dVAE
0.0010
0.052
In Table 2 we report the generative result for memorization
0.0009
0.052
baseline model, 5 generative models introduced in (Achlioptas
0.0008
0.051
et al. (2018)) and three our approaches: 3dVAE, 3dAAE and
3dAAE 3dAAE-G
rna
Method
Table 1. Reconstruction capabilities of the models captured by MMD. Mea-
Jou
surements for reconstructions are evaluated on the test split for the considered models trained with EMD loss on training data of the chair class.
the averages after repeating the process three times.
3dAAE-G. A sampling-based memorization baseline is a lack of any model that can produce artificial data, and instead, it takes a random subset of the training set of the same size as the other
with the generative models trained in stacked mode basing on
generated sets. This subset is then used to compute the Cover-
the latent representation of AE.
age and MMD statistics. We can expect that our models achieve
For evaluation purposes, we use the chair category and the
similar or worse results than a baseline when training and test
following five measures: JSD, MMD-CD, MMD-EMD, COV-
sets follow the same distribution. In this case, this baseline is a
CD, COV-EMD. For each of the considered models, we select
benchmark that the generative models can try to beat. However,
the best one according to the JSD measurement on the validation
by using a trained generative model, we aim not only to be able
set. To reduce the sampling bias of these measurements, each
to generate whole data distribution but also to produce low di-
generator produces a set of synthetic samples that is 3x the
mensional embeddings. It allows us to learn the latent features
Journal Pre-proof 9 of the data instead of just memorizing the samples. We can use
Fidelity
Method
JSD
A
For the 3dAAE-G model, we fix the number of Gaussian
B
equal 32 with different mean values and fixed diagonal covari-
C
0.048
0.0020
ance matrices.
these representations to create new data or for other downstream
CD
EMD
CD
EMD
0.017
0.0018
0.0630
79.4
78.6
0.176
0.0020
0.1230
52.3
19.0
0.0790
59.4
32.2
lP repro of
tasks, such as classification or retrieval problems.
Coverage
D
0.030
0.0023
0.0690
52.3
19.0
It can be observed that all of our approaches achieved the
E
0.022
0.0019
0.0660
67.6
66.9
best results in terms of the JSD measure. Practically, it means
F
0.020
0.0018
0.0650
68.9
67.4
that our models are capable of learning better global statistics
3dVAE
0.018
0.0017
0.0639
65.9
65.5
for generated point locations than reference solutions. We also
3dAAE
0.014
0.0017
0.0622
67.3
67.0
noticed the slight improvement in MMD criteria for both of the
3dAAE-G
0.014
0.0017
0.0643
69.6
68.7
considered distances. In terms of coverage criteria, the results
are comparable to the results obtained by the best GAN model and GMM approach.
Table 2. Evaluating generative capabilities on the test split of the chair dataset on epochs/models selected via minimal JSD on the validation-split.
We report A) sampling-based memorization baseline and the following ref-
In Table 3 we present an additional results on four categories:
car, rifle, sofa and table. For further evaluation, we use MMDEMD and COV-EMD metrics. The 3dAAE-G model achieved
erence methods from (Achlioptas et al. (2018)): B) r-GAN, C) l-GAN (AECD), D) l-GAN (AE-EMD), E) l-WGAN (AE-EMD), F) GMM (AE-EMD).
The last three rows refer to our approaches: 3dVAE, 3dAAE and 3dAAEG.
the best results considering the MMD-EMD criterion for each of
have been performed on binary embeddings obtained in adver-
it means that the assumed Gaussian model with diagonal covari-
sarial training with Beta distribution 6.
ance matrix is sufficient to represent the data in latent space, and
Figure 7 presents model ability to learn meaningful encodings
training GMM in stacked mode is unnecessary when adversar-
that allow performing addition and subtraction in latent space
ial training with fixed prior is performed. In order to present
in order to modify existing point clouds. In these examples,
a qualitative result, we provide synthetic samples generated by
we were able to focus on the specific feature that we want to
the model in Figure 1.
rna
the datasets and the highest value for three of them. Practically,
add to our initial point cloud while leaving other characteristics
Jou
5.7. Latent space arithmetic
One of the most important characteristics of well-trained la-
unchanged. Moreover, the operations were done by using only one sample for each part of the transformation. 5.8. Learning binary embedding with AAE models
tent representations is its ability to generate good-looking sam-
In the presented work, we have employed the adversarial
ples based on embeddings created by performing interpolation
training to learn informative and diverse binary features. The
or simple linear algebra. It shows that the model is able to learn
proposed routine uses Beta(0.01, 0.01) distribution to impose
the distribution of the data without excessive under- or overfit-
binarization of the embeddings z in the training phase, and
ting (Tasse and Dodgson (2016)).
Bernoulli samples to perform point clouds generation.
In Figure 5, we show that the interpolation technique in the
In our studies, we have trained two 3dAAE models and com-
latent space produces smooth transitions between two distinct
pared their performance with AE to assess the impact of Beta
types of same-class objects. It is worth mentioning that our
regularization and adversarial training on model quality. The
model is able to change multiple characteristics of objects at
key issue in the experiment is the appropriate calibration of λ
once, e.g., the shape and legs of the table. The similar studies
coefficient, which keeps a balance between reconstruction and
Journal Pre-proof 10 Fidelity
Class
Cluster 1
Coverage
A
AG
PA
A
AG
car
0.041
0.040
0.039
65.3
66.2
67.6
rifle
0.045
0.045
0.043
74.8
72.4
75.4
sofa
0.055
0.056
0.053
66.6
63.5
65.7
table
0.061
0.062
0.061
71.1
71.3
73.0
Cluster 3
Cluster 4
lP repro of
PA
Cluster 2
Table 3. Fidelity (MMD-EMD) and Coverage (COV-EMD) metrics on the test split of the car, rifle, sofa and table datasets on epochs/models selected
via minimal JSD on the validation split. We report the results for our 3dAAE (denoted as A), 3dAAE-G (denoted by AG) models compared to
the reference GMM model (denoted as PA) reported in (Achlioptas et al. (2018)).
Method
Continuous Accuracy 84.68
Ours 3dAAE2
84.35
AE
84.85
Achlioptas et al. (2018)
84.50
Wu et al. (2015)
83.3
Sharma et al. (2016)
75.5
Girdhar et al. (2016)
74.4
Chen et al. (2003)
75.5
Kazhdan et al. (2003)
68.2
coder with an additional categorical units.
Accuracy
mAP
79.82
42.94
The results presented in Table 4 provide an evidence that the
79.78
44.09
proposed 3dAAE models classify continuous embeddings with
78.12
41.76
accuracy (Linear SVM) on par with state-of-the-art approaches.
-
-
Thus imposing an additional Beta(0.01, 0.01) prior in adversar-
-
-
ial training has no negative impact in this setting. The main
-
-
benefit arising from 3dAAE training is noticeable once the em-
-
-
beddings are binarized, which leads to superior performance in
-
-
both retrieval and classification tasks.
rna
Ours 3dAAE1
Fig. 8. Selected examples from test set clustered with adversarial autoen-
Binary
-
-
Table 4. Results of point cloud retrieval and embedding classification with
Linear SVM on ModelNet40. The 3dAAE1 has been trained with constant
5.9. Clustering 3D point clouds In this subsection we introduce the extension of the adversarial model for 3D point clouds (3dAAE-C) inspired by
training proceeds.
(Makhzani et al. (2016)), that incorporates an additional one-
Jou
value of λ = 2.0. In 3dAAE2 we use exponential decay to reduce λ as the
hot coding unit y. This unit can be interpreted as an indicator of
adversarial losses. Here we present two top-performing 3dAAE
an abstract subclass for the 3D objects delivered on the input of
models. The first one (3dAAE1 ) has been trained with a constant
the encoder.
λ = 2.0, which is the best value found in the ablation studies.
The architecture for this model extends the existing model
The latter model (3dAAE2 ) uses an exponential decay to reduce
(see Figure 2) with an additional discriminator Dc . The role of
λ as the training proceeds.
the discriminator is to distinguish between noise samples gener-
We examine the proposed models in retrieval and embedding
ated from categorical distribution and one-hot codes y provided
classification tasks on ModelNet40. For each experimental set-
by the encoder. During the training procedure, the encoder is
ting, we report the corresponding quality metric for continuous
incorporated in an additional task in which it tries to fool the
and binary embeddings. The binarization is performed simply
discriminator Dc into creating the samples from the categorical
by taking the threshold value equal to 0 for AE and 0.5 for
distribution. As a consequence, the encoder is setting 1 value for
3dAAE-Beta.
the specific subcategories of the objects in the coding space z.
Journal Pre-proof 11 In Figure 8 we present the qualitative clustering results for
Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial
the chair test data. We trained the categorical adversarial au-
networks, in: Proceedings of the 34th International Conference on Machine
toencoder on the chair dataset and set the number of potential clusters to be extracted by our model to 32. We selected 4 most
Ben-Shabat, Y., Avraham, T., Lindenbaum, M., Fischer, A., 2018. Graph based over-segmentation methods for 3d point clouds. Computer Vision and Image Understanding 174, 12–23.
lP repro of
dominant clusters and presented 6 randomly selected represen-
Learning, International Convention Centre, Sydney, Australia. pp. 214–223.
tatives for each of them. It can be seen that among the detected
clusters, we can observe the subgroups of chairs with characteristic features and shapes.
Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 1798–1828.
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al., 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 .
6. Conclusions
Chen, D.Y., Tian, X.P., Shen, Y.T., Ouhyoung, M., 2003. On visual similar-
In this work, we proposed an adversarial autoencoder for generating 3D point clouds. Contrary to the previous approaches (Achlioptas et al. (2018)), our model is an end-to-end 3D point
cloud generative model that is capable of representing and sampling artificial data from the same latent space. The possibility of using arbitrary priors in an adversarial framework makes the
model useful not only for representation and generative purposes but also in learning binary encodings and discovering hidden categories.
ious experiments in terms of 3D points reconstruction, genera-
rna
tion, retrieval, and clustering. The results of the provided experiments confirm the good quality of the model in the mentioned
tasks that are competitive to the existing state-of-the-art solu-
Acknowledgments
Jou
The authors thank Google for supporting this work by donating credits for Google Cloud Platform. References
Library. pp. 223–232.
Daniels, J.I., Ha, L.K., Ochotta, T., Silva, C.T., 2007. Robust smooth feature extraction from point clouds, in: IEEE International Conference on Shape Modeling and Applications 2007 (SMI’07), IEEE. pp. 123–136.
Fan, H., Su, H., Guibas, L.J., 2017. A point set generation network for 3d object reconstruction from a single image., in: CVPR, p. 6.
Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A., 2016. Learning a predictable and generative vector representation for objects, in: European Conference on Computer Vision, Springer. pp. 484–499.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
To show the many capabilities of our model, we provide var-
tions.
ity based 3d model retrieval, in: Computer graphics forum, Wiley Online
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L., 2018. Learning representations and generative models for 3d point clouds, in: International Conference on Machine Learning, pp. 40–49.
Ambrogioni, L., Güçlü, U., Güçlütürk, Y., Hinne, M., van Gerven, M.A., Maris, E., 2018. Wasserstein variational inference, in: Advances in Neural Information Processing Systems, pp. 2478–2487. Angelina Uy, M., Hee Lee, G., 2018. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4470–4479.
S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information processing systems, pp. 2672–2680.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, pp. 5767–5777. Gumhold, S., Wang, X., MacLeod, R.S., 2001. Feature extraction from point clouds. Proceedings of the 10th international meshing roundtable. Newport Beach, California, 2001 , 293–305. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A., . beta-vae: Learning basic visual concepts with a constrained variational framework. Ji, S., Xu, W., Yang, M., Yu, K., 2013. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 221–231. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S., 2003. Rotation invariant spherical harmonic representation of 3 d shape descriptors, in: Symposium on geometry processing, pp. 156–164. Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 . Kullback, S., Leibler, R.A., 1951. On information and sufficiency. The annals of mathematical statistics 22, 79–86. Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A.,
Journal Pre-proof 12 Aitken, A.P., Tejani, A., Totz, J., Wang, Z., et al., 2017. Photo-realistic
robots in outdoor environments: A semi-supervised learning approach based
single image super-resolution using a generative adversarial network., in:
on 3d-lidar data, in: 2015 IEEE International Conference on Robotics and
CVPR, p. 4.
Automation (ICRA), IEEE. pp. 3941–3946.
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., 2016. Adversarial autoen-
Tasse, F.P., Dodgson, N., 2016. Shape2vec: semantic-based descriptors for 3d
coders, in: International Conference on Learning Representations. URL:
shapes, sketches and images. ACM Transactions on Graphics (TOG) 35,
real-time object recognition, in: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 922–928.
Pauly, M., Keiser, R., Gross, M., 2003. Multi-scale feature extraction on point-
sampled surfaces, in: Computer graphics forum, Wiley Online Library. pp. 281–289.
208.
lP repro of
http://arxiv.org/abs/1511.05644.
Maturana, D., Scherer, S., 2015. Voxnet: A 3d convolutional neural network for
Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B., 2017. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558 .
Tomczak, J., Welling, M., 2018. Vae with a vampprior, in: International Conference on Artificial Intelligence and Statistics, pp. 1214–1223.
Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B., 2016. Learning a probabilistic latent space of object shapes via 3d generative-adversarial
Prakhya, S.M., Liu, B., Lin, W., 2015. B-shot: A binary feature descriptor for fast and efficient keypoint matching on 3d point clouds, in: 2015 IEEE/RSJ
international conference on intelligent robots and systems (IROS), IEEE. pp. 1929–1934.
modeling, in: Advances in Neural Information Processing Systems, pp. 82– 90.
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J., 2015. 3d shapenets: A deep representation for volumetric shapes, in: Proceedings of
Qi, C.R., Su, H., Mo, K., Guibas, L.J., 2017a. Pointnet: Deep learning on
point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1, 4.
Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017b. Pointnet++: Deep hierarchical
feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, pp. 5099–5108.
Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
and approximate inference in deep generative models, in: International Con-
rna
ference on Machine Learning, pp. 1278–1286.
Rubner, Y., Tomasi, C., Guibas, L.J., 2000. The earth mover’s distance as a
metric for image retrieval. International Journal of Computer Vision 40,
Salti, S., Tombari, F., Di Stefano, L., 2014. Shot: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding 125, 251–264.
Xia, R., Pan, Y., Lai, H., Liu, C., Yan, S., 2014. Supervised hashing for image retrieval via image representation learning., in: AAAI, p. 2.
Yu, L., Li, X., Fu, C.W., Cohen-Or, D., Heng, P.A., 2018. Pu-net: Point cloud upsampling network, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2790–2799.
Zhou, Y., Tuzel, O., 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection, in: Proceedings of the IEEE Conference on Computer
Rezende, D.J., Mohamed, S., Wierstra, D., 2014. Stochastic backpropagation
99–121.
the IEEE conference on computer vision and pattern recognition, pp. 1912–
1920.
Jou
Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T., 2017. Semantic visual localization. CoRR abs/1712.05773.
Sharma, A., Grau, O., Fritz, M., 2016. Vconv-dae: Deep volumetric shape learning without object labels, in: European Conference on Computer Vision, Springer. pp. 236–250.
Shoef, M., Fogel, S., Cohen-Or, D., 2019. Pointwise: An unsupervised pointwise feature learning network. arXiv preprint arXiv:1901.04544 . Simon, T., Valmadre, J., Matthews, I., Sheikh, Y., 2014. Separable spatiotemporal priors for convex reconstruction of time-varying 3d point clouds, in: European Conference on Computer Vision, Springer. pp. 204–219. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E., 2015. Multi-view convolutional neural networks for 3d shape recognition, in: Proceedings of the IEEE international conference on computer vision, pp. 945–953. Suger, B., Steder, B., Burgard, W., 2015. Traversability analysis for mobile
Vision and Pattern Recognition, pp. 4490–4499.
Zieba, M., Semberecki, P., El-Gaaly, T., Trzcinski, T., 2018. Bingan: Learning compact binary descriptors with a regularized gan, in: Advances in Neural Information Processing Systems, pp. 3612–3622. Zieba, M., Wang, L., 2017. Training triplet networks with gan. arXiv preprint arXiv:1704.02227 .
Journal Pre-proof *Credit Author Statement
Jou
rna
lP repro of
Maciej Zamorski: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization Maciej Zięba: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - Original Draft, Visualization, Supervision Piotr Klukowski: Conceptualization, Visualization, Investigation Rafał Nowak: Software Karol Kurach: Conceptualization Wojciech Stokowiec: Software Tomasz Trzciński: Writing - Original Draft, Writing- Reviewing and Editing
Journal Pre-proof *Declaration of Interest Statement
Declaration of interests
lP repro of
☑ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Jou
rna
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: