Robotics and Autonomous Systems 62 (2014) 497–505
Contents lists available at ScienceDirect
Robotics and Autonomous Systems journal homepage: www.elsevier.com/locate/robot
Occlusion-aware multi-view reconstruction of articulated objects for manipulation Xiaoxia Huang a , Ian Walker a , Stan Birchfield a,b,∗ a
Electrical and Computer Engineering Department, Clemson University, Clemson, SC 29634, United States
b
Microsoft Research, Redmond, WA 98052, United States
highlights • Use of the Procrustes-Lo-RANSAC (PLR) algorithm to perform 3D reconstruction of articulated objects. • A purely geometric approach recovers the axes’ location, orientation, and type (revolute or prismatic). • The resulting 3D model is occlusion-aware, meaning that parts not visible in the current view are included.
article
info
Article history: Received 13 June 2012 Received in revised form 10 July 2013 Accepted 10 December 2013 Available online 22 December 2013 Keywords: Articulated reconstruction 3D reconstruction Procrustes analysis Locally optimized RANSAC
abstract We present an algorithm called Procrustes-Lo-RANSAC (PLR) to recover complete 3D models of articulated objects. Structure-from-motion techniques are used to capture 3D point cloud models of an object in two different configurations. Procrustes analysis, combined with a locally optimized RANSAC sampling strategy, facilitates a straightforward geometric approach to recovering the joint axes, as well as classifying them automatically as either revolute or prismatic. With the resulting articulated model, a robotic system is then able to manipulate the object along its joint axes at a specified grasp point in order to exercise its degrees of freedom. Because the models capture all sides of the object, they are occlusionaware, meaning that the robot has knowledge of parts of the object that are not visible in the current view. Our algorithm does not require prior knowledge of the object, nor does it make any assumptions about the planarity of the object or scene. Experiments with a PUMA 500 robotic arm demonstrate the effectiveness of the approach on a variety of real-world objects containing both revolute and prismatic joints. © 2013 Elsevier B.V. All rights reserved.
1. Introduction The transition of robots from constrained industrial environments to unstructured, dynamic environments is crucial for emerging application areas such as assistive and service robotics. For such applications, one important capability is that of reconstructing articulated objects [1–4], which are sets of rigid links connected by one or more (revolute or prismatic) joints. A surprisingly large number of important objects encountered every day can be modeled in this fashion, such as refrigerators, microwave ovens, drawers, doors, laptop computers, scissors, staplers, and so forth. While considerable effort has been spent on recovering the joint parameters of a human subject from video or motion capture [5–9], reconstruction of non-human articulated objects has only recently
∗ Corresponding author at: Electrical and Computer Engineering Department, Clemson University, Clemson, SC 29634, United States. E-mail addresses:
[email protected] (X. Huang),
[email protected] (I. Walker),
[email protected] (S. Birchfield). 0921-8890/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.robot.2013.12.006
begun to gain significant attention from researchers. Early work in this area focused on estimating joint axes by clustering tracked feature points, from which camera projection matrices were then recovered by assuming that the scene consists of planar surfaces rotating about vertical axes [10], or by imposing motion constraints on the projection equations [11]. In the computer vision community, factorization methods for affine reconstruction [12,13] are perhaps the most popular approach to articulated reconstruction. For example, by adding articulation constraints to the traditional formulation, Tresadern and Reid [14] detect the articulated objects, determine their degrees of freedom, and locate the joints. Similarly, Paladini et al. [15] recover 3D shape and motion of non-rigid and articulated objects in the case of missing data using an iterative factorization approach. Yan and Pollefeys [16] also investigate the subspace properties of articulated motion in a factorization framework by segmenting feature trajectories by local sampling and spectral clustering, then building the kinematic chain as a minimum spanning tree of a graph constructed from the segmented motion subspaces. More recent work by Fayad et al. [17] uses a hillclimbing approach that minimizes a single energy functional based
498
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
on image reprojection error, with alternating steps utilizing graph cuts to assign points to links, then factorization to reconstruct 3D models of the links. Note that unlike these factorization-based approaches, which are limited to affine projection, our system allows for perspective projection. In robotics, several methods have been proposed to reconstruct articulated objects with unknown skeletal parameters. Sturm et al. [4] recover kinematic models of 1-DOF articulated objects such as a microwave by tracking the poses and orientations of rigid parts captured by a motion capture system and addressing a mixture of parameterized and parameter-free (Gaussian process) representations to best explain the given observation. In follow up work, the same researchers [3] learn articulation models of objects without using artificial markers by applying generative models to depth images obtained from a self-developed active stereo system. In contrast to their work, our approach is not restricted to planar objects. Similar work by Katz et al. [1] reconstructs 3D kinematic structures of rigid articulated bodies in a single-view using feature tracking, motion segmentation, and structure-frommotion techniques. One limitation of current approaches to articulated object reconstruction is their restriction to processing data from a single view. As a result, such approaches do not yield any information about the occluded portion of the object (that is, the back side that is not visible in the current view), thus preventing manipulation of these non-visible portions. We introduce the term occlusionaware to refer to a model’s ability to capture knowledge of parts of the object that are not visible in the current view. We argue that in a robotics context it is important to be able to reason about non-visible portions of the object in order to manipulate them. For example, a robot might approach a stapler from a different direction during manipulation than it did during model construction. Such full 3D knowledge has always been assumed in the context of grasping research based on 3D CAD models [18–20]. Motivated by recent developments in automated reconstruction of complete 3D models from a collection of images [21–24], in this paper we present an occlusion-aware system for reconstructing articulated objects from images taken by a camera from different viewpoints. Our method, called Procrustes-Lo-RANSAC, or PLR, first builds two complete 3D point cloud models by applying structure-from-motion algorithms to images captured of the object in two different configurations. Then the method uses Procrustes analysis combined with a locally optimized RANSAC sampling strategy to automatically segment the points into the individual links. After aligning the links, the articulated structure of the object is estimated using a geometric approach. With hand–eye calibration, the robot can then align its coordinate system with that of the recovered articulated model and manipulate the object by exercising the degrees of freedom captured by the model. Unlike some previous techniques, our approach uses perspective (rather than affine) projection, and it does not make any planar assumptions about the scene. This paper is based on our earlier work in [25], extending it to the manipulation of articulated objects based on their models. We show the results of the system on a variety of everyday objects, demonstrating the effectiveness of the approach. 2. Learning articulated objects Assume we have an object composed of a set of rigid links connected by revolute or prismatic joints, so that a configuration of the object refers to a specific set of values for the joint angles or ¯ = (p(1) , . . . , p(m) ) and q¯ = displacements, as appropriate. Let p (1) (m) (q , . . . , q ) be two point clouds of the object in two different configurations, where p(i) ↔ q(i) , i = 1, . . . , m, are corresponding points, and p(i) , q(i) ∈ R3 . The heart of our PLR algorithm is to
Fig. 1. Overview of our system.
assign a label λ(i) ∈ Z+ , i = 1, . . . , m, to each point indicating to which link it belongs, as well as to compute the joint parameters {uab , ωab } for each adjacent pair of links a and b, where uab ∈ R3 is the free vector specifying the direction of the joint axis, and ωab ∈ R3 is a point on the joint axis (specifically the average projection of all points on both links onto the axis). The point clouds define the 3D model of the object obtained by structure-from-motion, and correspondence is established between them using descriptors associated with the points. An overview of our system is illustrated in Fig. 1. We assume that the capability of performing sufficient exploratory interaction with the object to change its configuration is present. In this way, the approach bears some resemblance to interactive perception [26,1,27–29], except that we allow either a human or robot to perform the interaction. Automatically planning the end effector motion path for interactive perception in such situations remains an unsolved problem, because some sort of preliminary model (at least) is needed in order to interact with the object, but the interaction is necessary to estimate the model. Therefore, having the user perform the interaction enables us to escape this difficult chicken-and-egg problem. As progress is made toward developing such autonomous exploratory behavior, the reconstruction method described in this paper still applies. 2.1. Building initial 3D models The first step of the approach is to build 3D models of the object in two different configurations. (Note that only two configurations are needed, no matter how many links or joints.) With the object in each configuration, feature points are detected and matched in a set of images taken from different viewpoints. We use the Bundler algorithm [30,21], which calibrates the camera and computes the camera locations in 3D. With this information, patch-based multi-view stereo (PMVS) [31,22] is used to reconstruct dense 3D oriented points, where each point has an associated 3D location, surface normal, and a set of visible images. (See Fig. 2.) 2.2. Rigid link segmentation Once the models have been constructed, the oriented 3D points of the models are segmented into the constituent rigid components of the object. We use the affine SIFT (ASIFT) feature detector [32], which is an affine invariant extension of the popular SIFT feature detector [33], to find features in every image of the two sets. For every feature point in an image of the first configuration, the matching feature point in the second configuration is found as the one that minimizes the sum-of-squared differences (SSD) between gray-level patches surrounding the two features. The same matching algorithm is then run in the reverse order by swapping the roles of the images, and matches are retained if
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
499
Fig. 2. Top: Four images (out of 121 captured) of a toy truck, and the 3D reconstruction obtained. Bottom: Four images (out of 145 captured) of the truck in a different configuration, along with the 3D reconstruction.
they agree in both directions. Correspondence between oriented 3D points in the two models is established by the matching of the closest ASIFT features, found by projecting onto the image plane according to the determined camera parameters. Given such correspondences, the Procrustes-Lo-RANSAC (PLR) algorithm shown in Algorithm 1 is applied. The first part of this algorithm is motion segmentation. Procrustes analysis [34], which is a classic method for aligning two point sets, is run iteratively in combination with a locally optimized RANSAC (LoRANSAC) sampling strategy [35] to find similarity transformations of rigid parts. Randomly selected triplets of correspondences are repeatedly used to compute the transformation between them until the number of correspondences that fit the transformation (i.e., the number of inliers) exceeds a threshold. Once a link has been found, these inliers are removed, and the entire process is repeated until no transformation can be found, i.e., there are no ¯ = (λ(1) , . . . , λ(m) ) are more rigid parts. In the code, the labels λ (i) (i) such that λ indicates the link for p and q(i) , or None if p(i) and q(i) are not on any link. An alternate approach to segmentation in the case that the number of segments is known would be to apply expectation–maximization (EM) [36], which tends to be computationally expensive. 2.3. Classifying joints We assume that two adjacent rigid links are connected by either a revolute or prismatic joint. The type of joint is automatically determined by examining the similarity transformation R, t, and σ between the links determined by Procrustes alignment, where R is the rotation matrix, t is the translation vector, and σ is the relative scaling between the two models. Although one might be inclined to use the translation vector t to distinguish between the two types of joints, it is important to note that t will not in general be zero for a revolute joint. This is because the axis of the coordinate system attached to the link does not necessarily (and usually will not) align with the axis of rotation. In other words, although we are interested in rotation about the axis, Procrustes computes the rotation about the origin of the coordinate system, which is somewhat arbitrarily determined by structure-frommotion. While these rotations themselves are identical, a non-zero translation t is needed to compensate for the misalignment. As a result, we instead determine the type of joint automatically by examining the rotation matrix R: If R is close to the identity matrix, then the joint is determined to be a prismatic joint; otherwise it is a revolute joint. This procedure is repeated for each pair of adjacent links. 2.4. Finding joint axes Once the point clouds have been segmented according to the links, the joint parameters could be computed in a probabilistic
manner, as in [3]. Instead, we adopt a straightforward geometric approach as outlined in the second part of Algorithm 1. For both revolute and prismatic joints, an axis is a ray in 3D space about or along which the movement occurs. Locating the axis of a prismatic joint is straightforward: The unit vector t/∥t∥ yields the direction of motion along the prismatic joint, while the mean of the points on the second link is used as a point on the axis. For revolute joints, we use Procrustes analysis combined with Lo-RANSAC to find the 4 × 4 homogeneous Euclidean transformation AB Ta aligning Link a in the second configuration to the same link in the first configuration, where {A} and {B} are the coordinate frames of the two point clouds, respectively. Then the technique is applied again to find the transformation AA Tb to align Link b in the second configuration, where the rotation matrix can be written as r11 r21 r31
r12 r22 r32
A A Rb
=
r13 r23 . r33
(1)
This rotation matrix can be parameterized using the axis–angle representation as a unit vector u ∈ R3 indicating the direction of a free vector parallel to the axis of rotation, and an angle θ describing the magnitude of the rotation about the axis in the right-hand sense. Similarly, the translation AA tb is parameterized by computing an arbitrary point ω ∈ R3 on the axis. The rotation angle π ≤ θ ≤ 0 can be computed from the matrix by the simple formula
θ = arccos
r11 + r22 + r33 − 1 2
,
(2)
where the constraint π ≤ θ ≤ 0 arises from the ambiguity that a rotation of θ about u is equivalent to a rotation of −θ about −u. To find the axis u, we note that the eigenvalues √ of a 3 × 3 rotation matrix are 1 and cos θ ± i sin θ , where i = −1. Since any vector u parallel to the rotation axis must remain unchanged by the rotation, the vector must satisfy AA Rb u = u. Therefore, from the definition of eigenvalues and eigenvectors, the axis is the eigenvector corresponding to the eigenvalue λ = 1. One way to estimate u, then, is to compute the eigenvalues and eigenvectors of AA Rb and to retain the eigenvector associated with the eigenvalue of 1. If there is no rotation, i.e., AA Rb = I3 , then all three eigenvalues are 1, and the rotation axis is undefined. An alternate, simpler formula for computing the axis of rotation is the following: u=
ˆ u ∥uˆ ∥
r32 − r23 r13 − r31 , r21 − r12
,
ˆ= where u
(3)
and ∥ · ∥ is the L2 norm. Note that this formula not only does not work when θ = 0 (in which case the axis is undefined) but also when θ = π (in which case the formula yields an unhelpful
ˆ= 0 u
0
T
0 ).
500
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
Algorithm 1 Procrustes-Lo-RANSAC (PLR) algorithm.
Πu
u
z
¯ , q¯ ) ProcrustesLoRansac(p ¯ = (p(1) , . . . , p(m) ) in first configuration, Input: Points p Points q¯ = (q(1) , . . . , q(m) ) in second configuration, where p(i) ↔ q(i) , i = 1, . . . , m are corresponding points ¯ = (λ(1) , . . . , λ(m) ) for each point Output: Link labels λ Joint parameters {uab , ωab } for adjacent links a and b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
// motion segmentation for i ← 1 to m do λ(i) ← None link ← 0 iter ← 0 while iter < max-iter do i, j, k ← Rand3WithoutReplacement(1, m) ¯ , q¯ , {i, j, k}) R, t, σ ← Procrustes(p num-inliers ← 0 for i ← 1 to m do ′ q(i) ← Transform(p(i) , R, t, σ ) ′ if ∥q(i) − q(i) ∥ < τ And λ(i) == None then (i) λ ← Temp num-inliers ← num-inliers +1 if num-inliers > min-num-inliers then new-label ← link link ← link +1 iter ← 0 else new-label ← None iter ← iter +1 for i ← 1 to m do if λ(i) == Temp then λ(i) ← new-label
β
α
x
y
Fig. 3. The axis of rotation u is parameterized by angles α and β , and the plane Πu is perpendicular to u.
sin β = uy ,
(7)
T
where η = u2x + u2z and u = ux uy uz . Since the rotation axes are fixed, the matrices are composed in right-to-left order, yielding an overall transformation of
Aπ AR
1 0 0
=
0 cβ sβ
0 −s β cβ
uz /η ux uy /η −u x
=
cα 0 −s α
0 1 0
ux /η −uy uz /η , uz
0
sα 0 cα
(8)
η
uy
(9)
where cα = cos α , sα = sin α , and similarly for cβ and sβ . π Now that we have found AA R, we can apply this rotation matrix to our data to align the x and y axes with the xπ yπ plane, then rotate about the z axis by θ , then rotate back: A A Rb
=
Aπ T AR
−s θ
cθ sθ 0
cθ 0
25 // find joint parameters 26 for each link a do 27 b ← FindClosestLink(a) 28 La ← {i : λ(i) = a} 29 Lb ← {i : λ(i) = b} A A A ¯ ¯ 30 B Ra , B ta , B σa ← Procrustes(p, q, La ) ′ 31 q¯ ← Transform(¯q, AB Ra , AB ta , AB σa ) A A A ¯ ¯′ 32 A Rb , A tb , A σb ← Procrustes(p, q , Lb ) A 33 Given A Rb in (1), compute uab and θ from (2) and (3) π 34 Compute AA R from uab using (9) 35 Given θ , construct R¨ as in (15) π 36 Compute t¯ = AA R AA tb 37 Extract t¨ and tz from t¯, as in (17) 38 Compute ω ¨ from R¨ and t¨ using (19) π 39 Compute ωab from AA R, ω ¨ , and tz using (20)
Rθz
0 0 1
Aπ AR
.
(10)
Including translation, we have A
qb = AA Rb A q′b + AA tb π T
π
= AA R Rθz AA R A q′b + AA tb , where A q′b are the coordinates of a point from Link b in the second configuration, expressed in Frame {A} of the first configuration, according to the transformation obtained by Link a. It is a simple matter to show that any Euclidean transformation (rotation plus translation) can be rewritten as a rotation applied to translated points: A
qb = AA Rb A q′b + AA tb A A Rb
A
(11)
qb − ω + ω, ′
(12)
−1 A ω = Id − AA Rb A tb ,
(13)
= where
Once the axis of rotation u has been found, the 3D rotation about this axis can be thought of as a 2D rotation in the plane Πu perpendicular to u. Fig. 3 shows the vector u in the current coordinate frame {A}. Let us define a new coordinate frame {Aπ } such that the z π axis is aligned with u. Therefore, the plane Πu is the same as the xπ yπ plane in {Aπ }. To transform from {A} to {Aπ }, we rotate about the y axis by α , then about the original x axis by β . By geometry (see the figure), these angles are given by cos α = uz /η
(4)
sin α = ux /η
(5)
cos β = η
(6)
and Id is the d × d identity matrix. Instead of rotating the point, then translating, this alternate formulation involves shifting the origin of the coordinate system, applying the rotation, then shifting the origin back. The point ω ∈ Rd is the temporary origin about which the rotation is applied. As a result, we can rewrite the application of rotation followed by translation in (11) as a shift of the origin, followed by a rotation, followed by a shift back: A
π T
π
qb = AA R Rθz AA R
A
q′b − ω + ω,
where
π T π −1 A ω = Id − AA R Rθz AA R A tb ,
(14)
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
501
Aπ
where Id is the d × d identity matrix. As before, if A R is the identity matrix (no rotation), then the point ω about which we are rotating is undefined because Id − Rθz is singular. The point ω is the unique point in 3D where the appropriate 3D rotation about it aligns A q′b with A qb . Although (14) should work, we have found better results are obtained by an alternate approach in which we use a 3D rotation to align the x–y plane with Πu (or equivalently, the z axis with u), then apply a 2D version of (13) to compute ω, then rotate back. To see this, let us define R¨ as the upper 2 × 2 part of Rθz : R¨ =
−s θ
cθ sθ
cθ
.
(15) π
Define q¨ as the first two elements of A q′b = third element, so that Aπ ′ qb
=
q¨ qz
.
Aπ A ′ A R qb ,
and qz as the
π
Define t¨ as the first two elements of t¯ = AA R AA tb , and tz as the third element, so that (17)
tz
Now we have A
π T
qb = AA R
0 R¨ (¨ q − ω) ¨ + ω¨ + 2 tz qz
,
(18)
where 02 is a 2 × 1 vector of all zeros and
−1 ω¨ = I2 − R¨ t¨,
(19)
where I2 is the 2 × 2 identity matrix. Once ω ¨ is found, the final 3D point ω is given by
ω=
Aπ T AR
ω¨ tz
.
a point O p ∈ R4 on the object (in homogeneous coordinates), the point described in the robot frame is given by R p =RO T O p, where R OT
(16)
¨ ¯t = t .
Fig. 4. Locating an object with respect to the robot world base {R} requires camera calibration, and hand–eye calibration, along with robot calibration (which we assume to be given by the robot kinematics).
(20)
3. Manipulating articulated objects The occlusion-aware 3D articulated model, once obtained, can be used to enable a robot to manipulate the object. That is, given a desired grasp point on the object, the robot can move its end effector to that position, even if the point is not visible in the current view. This is one of the main advantages of the occlusionaware approach, namely, that the robot is not limited to reasoning about only the side of the object that is currently visible, but rather that a full 3D model is available. Having grasped the object at a point, the robot can then use its knowledge of the articulation axis to move in such a way as to exercise the intended degree of freedom. We assume that a camera is rigidly mounted on the robot hand, and that the object is placed in the camera’s field of view. There are four coordinate frames relevant to our problem: the object frame {O} centered at the object center, the camera frame {C } centered at the camera lens center, the robot hand frame {H } centered at the robot end-effector, and the robot base frame {R} centered at the robot base. As shown in Fig. 4, locating an object with respect to the robot base requires three calibrations [37]: camera calibration, which describes the relative position and orientation between the object and the camera; hand–eye calibration, which describes the relative position and orientation between the camera and the robot hand; and robot calibration, which describes the relative position and orientation between the robot hand and the robot base. Given
= σ CO T HC T RH T
(21)
is a 4 × 4 homogeneous Euclidean transformation between the object and robot coordinate frames, and σ is the unknown scale factor of the camera. We assume that the robot kinematic system provides RH T . The scale σ of the object can be estimated in one of several ways. If the camera is attached to the robot’s hand while capturing the images used to construct the model, then the known positions of the end effector (via the robot’s kinematics) can be used to determine the overall scale of the scene. Alternatively, by placing a calibration target on the table, the image coordinates of the projection can be compared with the known height of the table (again, via the robot’s kinematics) to yield the scale. A third alternative is to simply use a known length on the object. We use the latter. To perform hand–eye calibration, namely, to estimate H C T , we place a checkerboard calibration target on the table in front of the robot. A classic approach to this problem is for the robotic arm to make a series of motions while the camera captures images of the calibration target at a number of locations [37]. Given the pose of the camera with respect to the world coordinate system for each image frame (estimated using camera extrinsic calibration techniques), along with the pose of the robot hand with respect to the robot base system (provided by the robot kinematics), the transformation CH T can be solved using any of several methods [38,39,37,40–42]. Tsai and Lenz [37] first decouple the rotational part from the translation using the axis–angle representation and solve the equation using least squares with a closed form solution. The method yields a simple numerical solution, but its performance is limited by errors of the linear system, the parameters of the robot kinematic system and the estimations of the camera rotation and translation [40]. Park and Martin [42] solve the problem using methods of Lie theory in a similar way to [37] but also deal with the noise presented in the measurements. Horaud and Dornaika [40] use a unit quaternion to represent rotation and also find a closed form solution for rotation and translation simultaneously by a non-linear technique. In [41] dual quaternions and the singular value decomposition (SVD) quickly finds a solution. Based on the Camera Calibration Toolbox for Matlab [43] and the Hand–Eye Calibration Toolbox [44], we estimate the pose of the camera relative to the robot hand using the approach of [37]. To estimate the pose of the object with respect to the camera mounted on the robot hand, that is, CO T , we place the object in the camera field of view and take an image of the object at some viewpoint. The relative 3D position and orientation of the object can then be estimated using either the perspective npoint (PNP) algorithm [45] or a variant of POSIT [46]. We have found that the best results were obtained for our objects using the SoftPOSIT algorithm [47]. Either of these approaches require
502
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
Fig. 5. Axis estimation of the truck. The top two rows show ten images (out of 121) of the first configuration, and ten images (out of 145) of the second configuration. The last row shows the estimated axis (red line) overlaid on an image and 3D model from each configuration.
2D–3D correspondences between the current image and the 3D model as an input argument. To produce these correspondences, we use the affine SIFT (ASIFT) feature detector [32] to find features in the current image and the image sets used for reconstructing the 3D models. For every feature point in the current image, the matching feature point in the sets is found, which is defined as the one that minimizes the sum-of-squared differences (SSD) between gray-level patches surrounding the two features. These matched points are potential point correspondences. The same matching algorithm is run in the reverse order by swapping the roles of the images, and matches are retained if they agree in both directions. The image in the image sets with the largest number of matches is then used to produce 2D–3D correspondences between the current image and the 3D model. For each oriented 3D point, its closest ASIFT feature in the selected image is found by projecting the point onto the image plane according to the determined camera parameters. Correspondence between the current image and the 3D model is thus established using the matching of these closest ASIFT features. 4. Experimental results The performance of our PLR algorithm was evaluated on a variety of different real-world objects, such as those that might be found in a home, office, or kitchen. For our experiments, we used a Logitech Quickcam Pro 5000 for collecting images, mounted on a PUMA 500 robotic arm for manipulation. We first demonstrate the PLR algorithm on the truck encountered earlier. Fig. 5 shows some of the images used to reconstruct the two 3D models, along with the estimated axis overlaid on two of those images and on the 3D models. A total of 121 and 145 images, respectively, were needed to reconstruct the two models. Visually, the axis appears to be quite close to the true axis, indicating that the algorithm was able to accurately segment the links and estimate the position and orientation of the axis. We also tested the approach on more practical items. Fig. 6 shows the results on five different objects, including a full-sized door, a cabinet, a microwave, a refrigerator, and a drawer. All objects contain a single revolute joint except the last, which contains a single prismatic joint. Again, the method is able to reconstruct fairly dense 3D models, segment the points into the individual links, and accurately estimate the axis of rotation or translation. The resulting models therefore include dense 3D point clouds representing the surfaces of the objects, along with the joint parameters. To quantify the accuracy of the estimated axes, we computed the angle between the axis and the normal of a plane in the scene, where the plane was obtained by fitting plane parameters to points from the cloud corresponding to the real plane in the scene. For example, in the cases of the door, cabinet, and refrigerator, the
Table 1 The average and standard deviation of the angular error of the estimated axis using the proposed algorithm (PLR), along with two others, namely Procrustes-RANSAC (PR) and Procrustes-Lo-RANSAC-ICP (PLRI). Object
Truck Door Cabinet Microwave Refrigerator Drawer Synthetic
PR
PLR
PLRI
Angle (°)
Std
Angle (°)
Std
Angle (°)
Std
1.0 2.8 1.0 1.8 4.6 2.3 1.0
0.4 1.7 0.4 0.4 0.7 1.5 0.4
0.9 1.1 0.8 1.1 4.9 0.2 0.9
0.0 0.2 0.2 0.2 0.3 0.2 0.0
0.8 1.1 1.0 1.2 4.9 0.7 0.8
0.0 0.2 0.2 0.2 0.3 0.5 0.0
floor plane was obtained, and the error was deemed to be the angle between the estimated axis and the normal to the floor. For the microwave, truck, and drawer, the same procedure was followed, except that a plane was fit to the table instead of the floor, and the error was subtracted from 90° in the latter two cases, since the truck and drawer axes are parallel to the table. The results of this quantitative assessment are shown in Table 1. The last column of the table shows the results on a synthetic refrigerator created by a 3D drawing program (not shown for space considerations). This table also shows, for comparison, two other versions of the system, one which uses RANSAC instead of Lo-RANSAC (called PR), and one which augments the proposed technique with iterative closest point (called PLRI). For each object, each algorithms was executed five times, and the average and standard deviation of the error are shown. Overall, the PLR algorithm outperforms the other two, both in terms of a lower average error and a lower standard deviation. Fig. 7 shows examples of objects with multiple joints. The PLR algorithm was able to reconstruct 3D models and estimate the multiple axes. For the dump truck, the angle between the estimated axes was 2.5°, while the angle between the axes for the scraper truck was 7.6°. However, it is difficult from these numbers to assess the accuracy of the system, since the cheap plastic construction of both trucks allows for considerable motion between the parts in all directions, so that the axes are not perfectly parallel even in the real objects. For the LCD arm, the angle between two vertical axes is 0.1°, while the angle between the vertical axes and the normal of the table is 14°, and the angle between the horizontal axis and the normal of the table is 86.1°. As might be expected, the error accumulates with each subsequent axis in the chain. Once the occlusion-aware 3D articulated model has been obtained, the robot uses the model to manipulate the object, given a desired grasp point. Fig. 8 shows two example images captured for manipulation, showing the variety of poses encountered. The vast difference between such views and the images used for creating the models leads to difficulty for the SIFT algorithm to find matches. We have found that ASIFT produces significantly more
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
door
cabinet
microwave
refrigerator
503
drawer
Fig. 6. Five examples of articulated reconstruction. Each column shows two images of the object in two configurations and two 3D reconstructions with the estimated axis overlaid (red line). The images are arbitrarily selected from two sets of images used for 3D reconstructions. From left to right, the number of images used are 22/19 (door), 17/20 (cabinet), 99/94 (microwave), 24/25 (refrigerator), and 13/18 (drawer).
Fig. 7. Articulated reconstruction of multiple axes for a dump truck, scraper truck, and LCD arm. The display is similar to the previous figure. The number of images used for the two configurations are 125/119 (dump truck), 111/134 (scraper truck) and 50/52 (LCD arm).
504
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505
Fig. 8. Top: ten images taken at arbitrary viewpoints for pose estimation of the truck. Bottom: the corresponding images that, among all the images used for 3D reconstruction, have the highest number of matched features points for each image.
Fig. 9. Comparison of SIFT (top) and ASIFT (bottom) feature matching. The red lines indicate matches between features detected in an image used for pose estimation (left) and an image used to create the 3D model (right). In this case SIFT finds 4 matches, while ASIFT finds 72.
object to facilitate image correspondence, and it will fail if such texture is not available. For real-world applications, it will be important to remove the need for artificial markers attached to objects, whose sole purpose is to provide this texture. One possible solution to address this issue is to use RGBD sensors that yield a depth value for each pixel immediately by the sensor itself rather than requiring correspondence to establish depth. The proposed approach should be able to be adapted to such sensors with only minimal changes in the processing pipeline, leading to much denser 3D models, as well as reducing the computation time required. Additional improvements could be obtained by extending the algorithm to handle other types of joints present in articulated objects such as ball joints, or by improving the robustness of our approach to finding the link structure of multiple axes, which is limited to simple sequential chains of links. Other limitations of the approach presented here are the sparsity of the point clouds, which make it difficult to estimate a grasp, and the non-real-time nature of the computation. Finally, future work should be aimed at building on our occlusion-aware models by using such knowledge in path planning. Acknowledgment The authors gratefully acknowledge the support of NSF grant IIS-1017007. References
Fig. 10. The PUMA 500 robotic arm manipulates a truck using the truck’s kinematic model obtained by the occlusion-aware articulated reconstruction procedure.
matches for this problem. On four example images, the ratio of the number of matches found by ASIFT to SIFT is 58:2, 70:3, 37:1, and 72:4. The latter example, in which ASIFT found 72 matches but SIFT found only 4, is shown in Fig. 9. Fig. 10 shows the robot interacting with the articulated object. 5. Conclusion In this paper we have proposed an algorithm called ProcrustesLo-RANSAC (PLR) to extract the 3D surface and kinematic structure of articulated objects. Multiple images are taken of an object by a single uncalibrated camera in two different configurations, and full ‘‘occlusion-aware’’ 3D models are reconstructed using structurefrom-motion techniques. From these models, the rigid links of the object are segmented and aligned, allowing the joint axes to be estimated using a geometric approach. The system supports both revolute and prismatic joints, which are automatically determined. The learned kinematic structure can then be used to perform purposeful manipulation of such objects by exercising their degrees of freedom. The proposed approach does not require prior knowledge of the object nor does it make any assumptions of planarity. Experimental results show the effectiveness of the algorithm on a range of environmental conditions with various types of objects useful in domestic robotics. There is plenty of room for further improvements to the approach presented. The current method requires texture on the
[1] D. Katz, A. Orthey, O. Brock, Interactive perception of articulated objects, in: 12th International Symposium on Experimental Robotics, ISER, Dec. 2010. [2] J. Sturm, A. Jain, C. Stachniss, C. Kemp, W. Burgard, Operating articulated objects based on experience, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2010. [3] J. Sturm, K. Konolige, C. Stachniss, W. Burgard, Vision-based detection for learning articulation models of cabinet doors and drawers in household environments, in: Proceedings of the International Conference on Robotics and Automation, ICRA, 2010. [4] J. Sturm, C. Stachniss, V. Pradeep, C. Plagemann, K. Konolige, W. Burgard, Learning kinematic models for articulated objects, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2009. [5] G.K. Cheung, S. Baker, T. Kanade, Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2003. [6] D.A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, D. Ramanan, Computational studies of human motion: part 1, tracking and motion synthesis, Found. Trends Comput. Graph. Vis. 1 (2/3) (2006). [7] A.O. Bălan, L. Sigal, M.J. Black, J.E. Davis, H.W. Haussecker, Detailed human shape and pose from images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2007. [8] P. Guan, A. Weiss, A.O. Bălan, M.J. Black, Estimating human shape and pose from a single image, in: Proceedings of the International Conference on Computer Vision, 2009. [9] H.S. Park, Y. Sheikh, 3D reconstruction of a smooth articulated trajectory from a monocular image sequence, in: Proceedings of the International Conference on Computer Vision, Nov. 2011, pp. 201–208. [10] D. Sinclair, L. Paletta, A. Pinz, Euclidean structure recovery through articulated motion, in: Proc. 10th Scandinavian Conference on Image Analysis, 1997. [11] X. Zhang, Y. Liu, T.S. Huang, Motion analysis of articulated objects from monocular images, IEEE Trans. Pattern Anal. Mach. Intell. 28 (4) (2006) 625–636. [12] C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization method, Int. J. Comput. Vis. 9 (2) (1992) 137–154. [13] J. Costeira, T. Kanade, A multi-body factorization method for motion analysis, in: Proceedings of the International Conference on Computer Vision, 1995, pp. 1071–1076,.
X. Huang et al. / Robotics and Autonomous Systems 62 (2014) 497–505 [14] P. Tresadern, I. Reid, Articulated structure from motion by factorization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2, 2005, pp. 1110–1115. [15] M. Paladini, A.D. Bue, M. Stošić, M. Dodig, J. Xavier, L. Agapito, Factorization for non-rigid and articulated structure using metric projections, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2009, pp. 2898–2905. [16] J. Yan, M. Pollefeys, A factorization-based approach for articulated nonrigid shape, motion, and kinematic chain recovery from video, IEEE Trans. Pattern Anal. Mach. Intell. 30 (5) (2008) 865–877. [17] J. Fayad, C. Russell, L. Agapito, Automated articulated structure and 3D shape recovery from point correspondences, in: Proceedings of the International Conference on Computer Vision, Nov. 2011, pp. 431–438. [18] G.M. Bone, A. Lambert, M. Edwards, Automated modeling and robotic grasping of unknown three-dimensional objects, in: Proceedings of the International Conference on Robotics and Automation, ICRA, May 2008. [19] U. Klank, D. Pangercic, R.B. Rusu, M. Beetz, Real-time CAD model matching for mobile manipulation and grasping, in: IEEE-RAS International Conference on Humanoid Robots, Dec. 2009, pp. 290–296. [20] C. Papazov, S. Haddadin, S. Parusel, K. Krieger, D. Burschka, Rigid 3D geometry matching for grasping of known objects in cluttered scenes, Int. J. Robot. Res. (2012) forthcoming. [21] N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: exploring image collections in 3D, ACM Trans. Graph. (SIGGRAPH) 25 (3) (2006) 835–846. [22] Y. Furukawa, J. Ponce, Accurate, dense, and robust multiview stereopsis, IEEE Trans. Pattern Anal. Mach. Intell. 32 (8) (2010) 1362–1376. [23] Y. Furukawa, B. Curless, S.M. Seitz, R. Szeliski, Reconstructing building interiors from images, in: Proceedings of the International Conference on Computer Vision, Sept. 2009. [24] C.A. Vanegas, D.G. Aliaga, B. Beneš, Building reconstruction using Manhattanworld grammars, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2010. [25] X. Huang, I. Walker, S. Birchfield, Occlusion-aware reconstruction and manipulation of 3D articulated objects, in: Proceedings of the International Conference on Robotics and Automation, ICRA, May 2012. [26] D. Katz, O. Brock, Manipulating articulated objects with interactive perception, in: Proceedings of the International Conference on Robotics and Automation, ICRA, May 2008, pp. 272–277. [27] B. Willimon, S. Birchfield, I. Walker, Rigid and non-rigid classification using interactive perception, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2010, pp. 1728–1733. [28] B. Willimon, S. Birchfield, I. Walker, Classification of clothing using interactive perception, in: International Conf. on Robotics and Automation, ICRA, 2011. pp. 1862–1868. [29] B. Willimon, S. Birchfield, I. Walker, Model for unfolding laundry using interactive perception, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2011. [30] N. Snavely, Bundler: SfM for unordered image collections, 2006. http:// phototour.cs.washington.edu/bundler. [31] Y. Furukawa, J. Ponce, PMVS, 2007. http://www.cs.washington.edu/homes/ furukawa/research/pmvs. [32] J.M. Morel, G. Yu, ASIFT: a new framework for fully affine invariant image comparison, SIAM J. Imaging Sci. 2 (2) (2009) 438–469. [33] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [34] D. Eggert, A. Lorusso, R.B. Fisher, Estimating 3-D rigid body transformations: a comparison of four major algorithms, Mach. Vis. Appl. 9 (5–6) (1997) 272–290. [35] O. Chum, J. Matas, J. Kittler, Locally optimized RANSAC, in: DAGM-Symposium, 2003, pp. 236–243. [36] J. Franco, E. Boyer, Learning temporally consistent rigidities, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2011, pp. 1241–1248. [37] R.Y. Tsai, R.K. Lenz, A new technique for fully autonomous and efficient 3D robotics hand/eye calibration, IEEE Trans. Robot. Automat. 5 (3) (1989) 345–358. [38] N. Andreff, R. Horaud, B. Espiau, Robot hand-eye calibration using structure from motion, Int. J. Robot. Res. 20 (3) (2001) 228–248.
505
[39] K.H. Strobl, G. Hirzinger, Optimal hand-eye calibration, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2006, pp. 4647–4653. [40] R. Horaud, F. Dornaika, Hand-eye calibration, Int. J. Robot. Res. 14 (3) (1995) 195–210. [41] K. Daniilidis, Hand-eye calibration using dual quaternions, Int. J. Robot. Res. 18 (3) (1999) 286–298. [42] F.C. Park, B.J. Martin, Robot sensor calibration: solving AX = XB on the Euclidean group, IEEE Trans. Robot. Automat. 10 (5) (1994) 717–721. [43] J.-Y. Bouguet, 2010. Complete camera calibration toolbox for Matlab, http:// www.vision.caltech.edu/bouguetj/calib_doc. [44] C. Wengert, Camera calibration and hand to eye calibration toolbox, http:// www.vision.ee.ethz.ch/software/calibration_toolbox//calibration_toolbox. php. [45] F. Moreno-Noguer, V. Lepetit, P. Fua, Accurate non-iterative O(n) solution to the PNP problem, in: Proceedings of the International Conference on Computer Vision, Oct. 2007. [46] D. Dementhon, L.S. Davis, Model-based object pose in 25 lines of code, Int. J. Comput. Vis. 15 (1–2) (1995) 123–141. [47] P. David, D. DeMenthon, R. Duraiswami, H. Samet, SoftPOSIT: simultaneous pose and correspondence determination, Int. J. Comput. Vis. 59 (3) (2004) 259–284.
Xiaoxia Huang received a Ph.D. degree in Electrical and Computer Engineering from Clemson University in 2013. She received her B.S. and M.S. degrees in Electrical Engineering from Xi’an University of Technology, Xi’an, China, in 2000 and 2003, respectively. She worked at Xi’an Technological University, Xi’an, China, as a research assistant professor on software/hardware development and now works in industry. Her research interests include computer vision, robotics, pattern recognition, and automation.
Ian Walker is a Fellow of the IEEE and a Senior Member of the AIAA. He has served as Vice President for Financial Activities for the IEEE Robotics and Automation Society, and as Chair of the AIAA Technical Committee on Space Automation and Robotics. He has also served on the Editorial Boards of the IEEE Transactions on Robotics, the IEEE Transactions on Robotics and Automation, the International Journal of Robotics and Automation, the IEEE Robotics and Automation Magazine, and the International Journal of Environmentally Conscious Design and Manufacturing. His research has been funded by DARPA, the National Science Foundation, NASA, NASA/EPSCoR, NSF/EPSCoR, the Office of Naval Research, the US Department of Energy, South Carolina Commission of Higher Education, Sandia National Laboratories, and Westinghouse Hanford Company. Stan Birchfield is an Associate Professor of Electrical and Computer Engineering at Clemson University on leave of absence with Microsoft Research in Redmond, Washington. He received the B.S. degree from Clemson University, Clemson, SC, in 1993, and the M.S. and Ph.D. degrees from Stanford University, Stanford, CA, in 1996 and 1999, respectively. While at Stanford, his research was supported by a National Science Foundation Graduate Research Fellowship, and he was part of the winning team of the AAAI Mobile Robotics Competition of 1994. He spent several years as a research engineer with a startup company in Palo Alto, CA and has consulted with various companies over the years. His research interests include visual correspondence, tracking, and segmentation, particularly applied to real-time robotic systems.