Successively alternate least square for low-rank matrix factorization with bounded missing data

Successively alternate least square for low-rank matrix factorization with bounded missing data

Computer Vision and Image Understanding 114 (2010) 1084–1096 Contents lists available at ScienceDirect Computer Vision and Image Understanding journ...

874KB Sizes 0 Downloads 39 Views

Computer Vision and Image Understanding 114 (2010) 1084–1096

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Successively alternate least square for low-rank matrix factorization with bounded missing data Keke Zhao, Zhenyue Zhang * Department of Mathematics, Zhejiang University, Yu Quan Campus, Hangzhou 310027, China

a r t i c l e

i n f o

Article history: Received 14 October 2009 Accepted 13 July 2010 Available online 21 July 2010 Keywords: Matrix complement Matrix factorization Missing data Low-rank matrix Computer vision 3D reconstruction

a b s t r a c t The problem of low-rank matrix factorization with missing data has attracted many significant attention in the fields related to computer vision. The previous model mainly minimizes the total errors of the recovered low-rank matrix on observed entries. It may produce an optimal solution with less physical meaning. This paper gives a theoretical analysis of the sensitivities of the original model and proposes a modified constrained model and iterative methods for solving the constrained problem. We show that solutions of original model can be arbitrarily far from each others. Two kinds of sufficient conditions of this catastrophic phenomenon are given. In general case, we also give a low bound of error between an -optimal solution that is practically obtained in computation and a theoretically optimal solution. A constrained model on missing entries is considered for this missing data problem. We propose a two-step projection method for solving the constrained problem. We also modify the method by a successive alternate technique. The proposed algorithm, named as SALS, is easy to implement, as well as converges very fast even for a large matrix. Numerical experiments on simulation data and real examples are given to illuminate the algorithm behaviors of SALS. Ó 2010 Elsevier Inc. All rights reserved.

1. Introduction Matrix factorization is an essential tool for solving several computer vision problems including structure from motion (SFM) [21,11], illumination based reconstruction [8], non-rigid model tracking [2,5], plane-based pose estimation [20], and motion segmentation [22]. These problems have received a lot of attentions in the past decade [7]. In these applications, data sets are generated from feature points in a sequence of images. In the ideal case when the data collection is complete, these problems can be solved easily by algorithms based on the classical low-rank approximation approaches. However, the available data are generally incomplete. For example, feature points may be beyond the frame domain or blinded by an object in the SFM problem. So these feature points cannot be captured by cameras, and hence their data are unknown in the data collection. In the illumination based reconstruction problem, feature points may be shadowed by other objects or pixels of feature points may be saturated in the imaging process. These pixel values do not arise from the true light reflection and do not satisfy the well-known Lambertian assumption. Obviously, the erroneous pixels cannot been used in the corresponding models. Practically, if missing data occurs, it is difficult to solve these problems efficiently. These reconstruction problems with missing data are uniformly modeled in literature as a low-rank approximation problem that * Corresponding author. E-mail addresses: [email protected] (K. Zhao), [email protected] (Z. Zhang). 1077-3142/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2010.07.003

minimizes the total reconstruction errors corresponding to known data. This problem can be mathematically stated as follows. Let K be the set of index-pairs of known entries in an m  n matrix W = (wij), i.e., only partial entries fwij ; ði; jÞ 2 Kg of W are available. We want to find out a matrix of rank r, say X, such that its entries xij with ði; jÞ 2 K can fit the known wij of W as much as possible. That is, the optimal rank-r matrix X solves the minimization problem

X

min

rankðXÞ6r

ðxij  wij Þ2 ;

ð1Þ

ði;jÞ2K

Writing X = RS with m  r factor R and r  n factor S, the above problem is equivalent to that in matrix form

min

R2Rmr ;S2Rrn

kðRS  WÞ  Mk2F ;

ð2Þ

where  denotes the Hadamard (or componentwise) product of matrices and M = (mij) the character matrix of K, i.e.,

 mij ¼

1; ði; jÞ 2 K; 0; ði; jÞ R K:

The above nonlinear (and also non-convex) model for addressing the missing data problem was first formulated in [19] and then attracted a lot of research interests. Variant algorithms were proposed in literature for solving the problem numerically; see [3,24,16,22,4] or the survey paper [3].

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

In general, the matrix W is not low-rank exactly, due to the existence of noise, as well as the model errors under some simplification assumptions in application. The existing algorithms for solving the problem (1) may loss their efficiency as the number of missing entries increases [4]. A mathematically optimal solution of (1) may be meaningless in real application if the observed entries are not well distributed. This phenomenon may occur if the percentage of missing part is large. To reduce the possibility of getting a meaningless solution, a regularized model is considered in [3] by adding a penalty term that restricts the norms of factors R and S into the original target function. Recently, the rigid restrict of low-rank on a solution X is relaxed and implicitly imposed by the nuclear norm of X as a regularization term. See [18,10,14] for example. However, this kind of regularization cannot address the missing data problem efficiently since such a solution may not be strictly low rank. Hence a truncation technique is required to obtain an approximation solution with given rank [18] but it also rises the approximation error. What we are interested is the uncertainty of solutions to the model problem (1). The key observation is that an optimal solution to (1) may be very unrealistic compared with the ‘‘true” one that we want to estimate in application. In this paper, we give a theoretical analysis to show the uncertainty of the problem. It may be known that the model (1) may have multiple solutions. We further prove that multiple solutions of (1) can be arbitrarily large or arbitrarily far away from each others under some conditions. Two kinds of sufficient conditions are given on the factors R and S of an optimal solution: One is theoretical and it is presented in terms of an intersection of the space of missing pattern and a union of some spaces spanned by partial columns of R and rows of S. The other one is algebraically checkable and it is related with the singularity of row-blocks of R and column-blocks of S. We also analysis the uncertainty of (1) by considering -optimal solutions, i.e., approximate solutions that have values of k(RS  W)  MkF slightly larger than the minimum within . A low bound of the error between an -optimal solution and an optimal solution is given in terms of the smallest singular values of row-blocks of R and column-blocks of S. We show that if one of these submatrices is not well-conditioned, an -optimal solution computed iteratively may be suspicious as an meaningful solution. This is important in real applications. In order to reduce the above risks, we will further consider a modified model for solving the missing data problem, assuming that the entries of W are restricted in known intervals. That is, our framework takes account of the restriction on the missing entries of W, leading to a constrained low-rank approximation problem. Practically, the constraint intervals can be estimated by the known entries for those missing data shielded by other observed objects. To our best knowledge, this model has not been considered before in both fields of the numerical algebra and the computer vision. This constrained model does not completely address the problem of multiple solutions. However, multiple solutions, if exist, cannot be arbitrarily large or far away from each others due to the entry boundary. Similarly, an -optimal solution is also bounded. We will propose an iterative method to solve the constrained problem, based upon two orthogonal projections: One is the projection onto the space of low-rank matrices and the other one is the projection onto the space of so-called template matrices whose entries are constrained (the detailed definition will be given in Section 4). This algorithm will be speeded up by a successive iterative method to pursuit a meaningful solution. In this algorithm, a sequence of variant template matrices are used to channel the iterative solutions towards the required one as the iteration goes on. We believe that this approach will be more reliable if integrating more prior knowledge of a real solution into the model. The rest of this paper is organized as follows. In Section 2, we briefly review some existing algorithms for solving the problem

1085

(1). The theoretical analysis on the uncertainty or sensitivity of (1) is given in Section 3. We propose the modified constrained model in Section 4. An effective method of solving the constraint problem is presented in Section 5. It successively modifies the template matrix in each iteration. In the last Section 6, we compare the proposed algorithm with others numerically by some toy data sets and two real-world data. Numerical behaviors of the algorithm depending on the parameters are also illustrated. 2. Overview of related work It is known that any submatrix of W has rank at most r, provided W is of rank r exactly. This rank restriction helps to determine some entries from others directly. For example, consider a submatrix A of W in which only one entry is unknown. Deleting the row or column of A containing the unknown entry, if the remaining matrix B is also of full rank, then the unknown entry can uniquely determined easily. The retrieved entries are also useful for successively recovering other missing entries. This proposition is used in the earlier direct methods such as imputation [21] and Jacobs’ method [9]. A similar work can be found in [1] in which missing entries are computed one-by-one by solving this model problem successively: the unique unknown in a submatrix A is determined by minimizing the (r + 1)st singular value of A by taking the unknown as a variable. Obviously, the efficiency of the direct methods highly depends on the pattern of known entries and submatrices like A. For example, if the remainder submatrix B of A is approximately singular in this procedure, the later recovered entries may have larger errors for noisy data. On the other hand, there are no guarantees to recover all missing entries if a high percentage of entries are unknown. So iterative methods are prior for solving the missing data problem. In the remaining part of this section, we focus on two classical iteration methods: Newton-like methods and alternative methods. Writing a rank-r candidate X in the form X = RS with independent matrix R of order m  r and matrix S of order r  n, the error

EðR; SÞ ¼ kðX  WÞ  Mk2F ¼ kðRS  WÞ  Mk2F : is a function of variables of R and S. Furthermore, representing all the variables of R and S as a column vector x of length (m + n)r, E(R, S) is a function of x,

f ðxÞ ¼ EðR; SÞ: A Newton-like algorithm for minimizing f(x) needs to compute the gradient vector rf and the Hessian matrix H of f(x). The classical Newton step is to solve the linear system Hkd = rfk for updating the current estimation xk as xk+1 = xk + d. Because Hk may be approximately singular, a regularization technique is required. For example, the following regularization version is used in literature:

ðHk þ kIÞd ¼ rfk ; with a preassigned positive parameter k. For a sufficiently large k, the solution d has the direction rfk approximately and f(xk+1) < f(xk). Starting at a small initial k, the damped Newton algorithm [3] increases k in this way k 10k until f(xk+1) < f(xk) holds. Generally speaking, Newton-like methods converge fast if the initial guess is close to a local minimum due to the quadratic convergence. However, considering the size (m + n)r  (m + n)r of of H, the computation cost of solving the linear system Hd = b is expensive since H is large in scale if m or n is not small. Beside of the expensive computational cost, a large memory is also required for the Hessian matrix. Clearly, E(R, S) is a quadratic function of R and S. Minimizing E(R, S) with respect to R or S gives a least squared (LS) problem. Practically, it can be divided into several independent LS problems

1086

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

in much smaller scales. For example, let S = [s1, . . . , sn], then minSE(R, S) is equivalent to the n independent LS problems

min kRðIj ; :Þsj  WðIj ; jÞk;

j ¼ 1; . . . ; n;

sj

ð3Þ

Let ‘ be the order of i in Ij such that eTi R ¼ eT‘ RðIj ; :Þ. Taking partial derivatives with respect to variable rik with i 2 Ij on the two sides of @ðQ w Þ @s the equalities yields @rj j ¼ e‘ skj þ RðIj ; :Þ @r j and ek eT‘ wj ¼ ek eT‘ R ik ik @sj T T ðIj ; :Þsj þ RðIj ; : Þ e‘ skj þ RðIj ; : Þ RðIj ; :Þ @r . It follows that ik

where Ij denotes the index set of i’s such that ði; jÞ 2 K; RðIj ; :Þ is the submatrix of R consisting of those rows ri with i 2 Ij, and W(Ij,j) is the patch of the jth column of W corresponding to Ij. Thus, the alternate Least Squares (ALS) algorithm follows,

8 < Rkþ1 ¼ arg min EðR; Sk Þ; R

ð4Þ

: Skþ1 ¼ arg min EðRkþ1 ; SÞ: S

ALS and its variants summarized in [3] work well under mild conditions. The value of objective function decreases quickly at early iterations but quite slowly as the iteration goes on [3]. Mathematically, if all submatrices R(Ij, :) are of full column rank, minSE(R, S) has the unique solution S = S(R). The missing data problem (1) is then equivalent to

 T   @sj ^ j  RðIj ; : ÞT e‘ skj RðIj ; :Þ ¼ RðIj ; : Þy ek eT‘ w @rik  T ^ j  Q j e‘ skj ; ¼ RðIj ; : Þy ek eT‘ w ^ j ¼ wj  RðIj ; :Þsj ¼ ðI  Q j Þwj . Hence, where w ^ j  Q j e‘ skj , i.e., ek eT‘ w

@ðQ j wj Þ ¼ e‘ skj @rik

 T þ RðIj ; : Þy

 T @ðQ j wj Þ ^ j: ¼ ðI  Q j Þe‘ sTj þ RðIj ; : Þy eT‘ w T @ðei RÞ b  R. Since sT DT ei ¼ sT DðIi ; : ÞT e‘ ¼ eT DðIi ; :Þsj and Let D ¼ R ‘ j j DTei = D(Ij,:)Te‘, the vector in (6) has the simple expression  T X @Q j wj T ^ j  ðI  Q j ÞDðIi ; :Þsj  RðIj ; : Þy ðI  Q j Þwj  D ei ¼ w T @ðei RÞ i2I j

min EðR; SðRÞÞ:

ð5Þ

R

The framework of this methodology is given in [17] which considered a general minimization problem with multiple variables. Since S(R) has columns

 1 sj ¼ RðIj ; : Þy WðIj ; jÞ ¼ RðIj ; : ÞT RðIj ; :Þ RðIj ; : ÞT WðIj ; jÞ; where R(Ij, :)  = (R(Ij, :)TR(Ij, :))1R(Ij, :)T is the Moore-Penrose generalized inverse of R(Ij, :), the cost function in (5) has the simple expression n  X  ðI  Q j Þwj 2 ; EðR; SðRÞÞ ¼

wj ¼ WðIj ; jÞ;

j¼1

where Qj = R(Ij, :)R(Ij, :)  is the orthogonal projection onto the column space of R(Ij, :). Clearly, E(R, S(R)) is a rational function with respect to the entries of R. One may solve the problem (5) by Newton-like methods iteratively. The Gauss–Newton method is used in [23], leading to the Wiberg algorithm. In [16], the Wiberg algorithm is derived in this way: The cost function E(R, S(R)) is represented in terms of Kronecker tensor products of matrices at first, and then the orthogonal projection QF onto the column space of   F ¼ SM SðRÞT  Im is used to represent the resulting LS problem, where SM is the matrix selecting the rows corresponding to nonzero entries in M.1 The Wiberg algorithm can be derived in a simple way. For completeness, we show the derivation as follows. Practically, the Gauss–Newton method is an iterative algorithm P 2 2 for minimizing a squared nonlinear function i fi ðxÞ ¼ kf ðxÞk , where f is the vector function of fi’s. At the current point x, Gauss–Newton updates x to ^ x by minimizing the squared norm of the first order approximation f ðxÞ þ rf ðxÞð^ x  xÞ to f,

^x ¼ arg min kf ðxÞ þ rf ðxÞð^x  xÞk2 : ^x

Hence, the iteration step of the Gauss–Newton method from R b is to R

2   n  X @ðQ j wj Þ X   b ¼ arg min b  RÞei  : ðI  Q j Þwj  R ð R   T @ðe RÞ bR j¼1   i i2Ij

ð6Þ

Whath we should pay i attentions on is a simple formula of @ðQ w Þ @ðQ w Þ ¼ @rj j ;    ; @rj j . We can derive it from the two equalities

@ðQ j wj Þ @ðeTi RÞ

i1

Q j wj ¼ RðIj ; :Þsj ; 1

b i ; :Þsj ^ j  ðI  Q j Þ RðI ^j ¼ w  DðIj ; : ÞT w  T b j ; : ÞT w ^ j:  RðIj ; : Þy RðI b minimizes It follows that R

( 2 ) 2  T X     ^ y T b b  ^ wj  ðI  Q j Þ RðIj ; :Þsj  þ  RðIj ; : Þ RðIj ; : Þ wj   j

¼

2 X  2 X ^ ^T b y b j wj  ðI  Q j ÞC j Rs  þ wj C j RRðIj ; : Þ  ; j

j

where Cj is the selection matrix consisting of the rows eTi of I with    2  A a   i 2 Ij. It is not difficult to represent the error as   B x  0  with  T b a¼ w ^ T T , and ^ ;...;w x the vector representation of R; 1

2 6 A¼6 4

3

sT1  ðI  Q 1 ÞC 1 7 .. 7; 5 . sTn  ðI  Q n ÞC n

2 6 B¼6 4

n

3 ^ T1 C 1 Þ ðRðI1 ; : Þy ÞT  ðw 7 .. 7: 5 . y T ^ Tn C n Þ ðRðIn ; : Þ Þ  ðw

b can be obtained by solving the normal equation Therefore, R Gx = b with G = ATA + BTB and b = ATa. We point out that it is not required to compute the orthogonal projection QF = I  F(FTF)1FT according to our derivation. Note that the size of F ¼ SM   SðRÞT  Im is K  (mr) with K the number of observed entries in W. If W has many known entries, QF is a matrix in large scale. Regularization is required for solving the linear system Gx = b from the computational point of view. The well-known Levenberg–Marquardt (LM) method solves the regularized normal equations (G + kI)x = b or (G + kdiag(G))x = b with the diagonal matrix diag(G) of G (see [15]). This is the Tikhonov regularization similar with the damping technique in the damped Newton. Applying the damped Newton method on (5) results in the algorithm LM_S [4]. Different from the LM algorithm, LM_S requires computing the Hessian matrix of E(R, S(R)); see the construction details given in [4]. Obviously, computing Hessian matrices continually is very complicated. We point out that the above analysis for deriving the gradient of Qjwj can also be used for constructing the gradient vector and the Hessian matrix of E(R, S(R)). 3. Uncertainty of original model for missing data problems

ik

T

T

RðIj ; : Þ wj ¼ RðIj ; : Þ RðIj ; :Þsj :

The representation of F in Eq. (7) of [16] should be F = V  I.

Let us consider the linear mapping from a matrix to its restriction on the observation set K at first:

/ : Z # Z  M:

1087

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

The null space of /, denoted by Nð/Þ, plays a key role in our analysis on the uncertainty of model (1). Let RS be an optimal solution of (1) or (2), where R has r columns and S has r rows. We will give sufficient conditions related with the factors R and S, under which two optimal solutions can have arbitrarily large distance to each other. Assume there is a nonzero Z in Nð/Þ such that span(Z) is a subspace of span(R), where span() denotes the column space of a matrix. Then Z can be factorized as Z = RT with a matrix T 2 Rrn. Let us consider a shift of the optimal RS along Z:

has an equivalent representation that takes the structure of M into account. More importantly, this condition is easily checkable. To show it, as defined before, let Ij be the set of i’s such that ði; jÞ 2 K, or equivalently the index set of nonzero entries in jth column of M. Similarly, let Ji be the set of j’s such that ði; jÞ 2 K. It is also the index set of nonzero entries in the ith row of M. Lemma 2. Let RS be an optimal solution of problem (1). Then ðUR [ VS Þ \ Nð/Þ – f0g, if and only if there is a column-rank deficient submatrix R(Ij,:) or row-rank deficient submatrix S(:, Ji).

X ¼ RS þ aZ ¼ RðS þ aTÞ; with a scalar parameter a. Obviously, X has rank at most r and it is also an optimal solution of (1) since

kðW  XÞ  MkF ¼ kðW  RSÞ  MkF : However, the error kRS  XkF = jajkZkF of these two optimal solutions can be arbitrarily large since a is free. A similar conclusion related with S is also true if the row space of Z is contained by the row space of S. This analysis can be extended to a more general case. To show the general conditions, we give some notations at first. Let K be a subset of index set {1, 2, . . . , r}; K can be the whole set or empty. Kc is the complement of K. Let URð:; KÞ and VSðK c ;:Þ be the range spaces of the orthogonal projections, n o

c URð:;KÞ ¼ Rð:; KÞF j F 2 RjKjn ; VSðK c ;:Þ ¼ HSðK c ; :Þ j H 2 RmjK j : ð7Þ

We will consider motions of RS along a direction chosen from the space of the sum of URð:; KÞ and VSðK c ;:Þ ,

n o c URð:;KÞ þ VSðK c ;:Þ ¼ Rð:; KÞF þ HSðK c ; :Þ j F 2 RjKjn ; H 2 RmjK j :

Theorem 1. Let RS be an optimal solution of problem (1) and let UR and VS be given in (7). If

Nð/Þ \

[

URð:;KÞ þ VSðK c ;:Þ – f0g;

ð8Þ

K

then there is an optimal solution X of (1) such that kRS  Xk can be arbitrarily large.

Proof. Let Z 2 [K ðURð:; KÞ þ VSðK c ;:Þ Þ \ Nð/Þ be nonzero. We write Z = R(:, K)F + HS(Kc,:) with a certain index set K. Then for an arbitrary a, c

c

c

RS þ aZ ¼ Rð:; KÞSðK; :Þ þ Rð:; K ÞSðK ; :Þ þ aðRð:; KÞF þ HSðK ; :ÞÞ c

c

¼ Rð:; KÞðSðK; :Þ þ aFÞ þ ðRð:; K Þ þ aHÞSðK ; :Þ ¼ ½Rð:; KÞ; ðRð:; K c Þ þ aHÞSðK; :Þ þ aFSðK c ; :Þ: Hence RS + aZ has rank not larger than r. Since (RS + Z)  M = (RS)  M, RS + aZ is also an optimal solution. h Theorem 1 shows the risk that a computed optimal solution of (1) may be arbitrarily far from a ‘‘meaningful” solution in a real application. We point out that there are other sufficient conditions for this arbitrariness of optimal solutions. For example, assume that there is a nonzero matrix Z 2 Nð/Þ with rank less than or equal to 2r. Then for any splitting of Z into a sum of two rank-r c , i.e., Z ¼ W  W c , then both W and W c are optimatrices W and W mal solutions to the problem (1). The condition (8) in Theorem 1 may be a bit difficult for checking. In the special case when K is the whole set {1, 2, . . . , r} or K is empty, URð:;KÞ þ VSðK c ;:Þ is equal to UR or VS , respectively. We will show that in that case, the sufficient condition

ðUR [ VS Þ \ Nð/Þ – f0g;

Proof. Assume that R(Ij,:) is rank deficient. Let z – 0 be a vector satisfying R(Ij,:)z = 0. Then Z ¼ Rz eTj 2 ðUR [ VS Þ \ Nð/Þ, where ej is the jth column of the identity matrix of order n. Inversely, suppose there is a nonzero Z 2 ðUR [ VS Þ \ Nð/Þ. Without loss of generality we assume that Z 2 UR \ Nð/Þ. We can rewrite Z = RC with a nonzero matrix C since Z 2 UR . Assume that the jth column Cej of C is nonzero. The condition Z 2 Nð/Þ implies (Z  M)ej = 0, i.e., Z(Ij,j) = 0. Hence, R(Ij,:)Cej = Z(Ij,j) = 0. It follows that R(Ij,:) must be column-rank deficient since Cej – 0. h Therefore, we have the following corollary. Corollary 3. Let RS be an optimal solution of problem (1). If there is a column-rank deficient submatrix R(Ij,:) or row-rank deficient submatrix S(:, Ji), then there is an optimal solution X of (1) such that kRS  Xk can be arbitrarily large. A geometric illustration for multiple solutions was given in [13] for the SFM problem: One frame of the image stream has all the observed feature points that are coplanar. This geometric condition is equivalent to the algebra condition that S(:, Ji) is rank deficient, where Ji is the index set of the observed points in this frame. Practically, in the orthographic camera model of SFM in which r = 4, a point pj observed by camera i can be represented as

pj ¼ ci þ ½ui ; v i ; wi ½xij ; yij ; zij T ; where ui and vi span the camera plane centered at ci, and the coordinates of pj in the frame are

"

xij yij

#

"

¼ ½ui ; v i T ðpj  ci Þ ¼

uTi

ai

v

bi

T i

#

pj 1

 ;

with ai ¼ uTi ci and b ¼ v Ti ci . It is known that

2

.. 6 . 6 T 6 ui RðIj ; :Þ ¼ 6 6 vT 6 i 4 .. .

3 .. . 7 7 ai 7 7 ; bi 7 7 5 .. . i2I

 Sð:; J i Þ ¼

   pj



1





 ; j2J i

j

where Ij is the index set of cameras capturing the point pj and as shown before, Ji is the index set of points observed by camera i. Because S(:, Ji) has only four rows, S(:, Ji) is of full row-rank is equivalent to the geometric condition that all the points pj, j 2 Ji captured by camera i are not coplanar. However, even if all the points observed by each frame are not coplanar, i.e., all S(:, Ji) are of full row-rank, R(Ij,:) may have rank less than its column number, and hence, by Corollary 3, multiple solutions may also exist. Here are two examples for rank deficient R(Ij,:). one is that all the camera planes capturing point pj are parallel, resulting in R(Ij,:) with rank at most 3. Multiple physical solutions exist obviously since moving this point along the direction orthogonal to these planes gives same two-dimensional projection onto these frames. The other example shown below yields multiple solutions mathematically. Assume that the point pj is observed by only two cameras whose camera planes can be not parallel. Then R(Ij,:) can be

1088

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

2

uT1 6 vT 6 RðIj ; :Þ ¼ 6 T1 4 u2

a1

v T2

b2

3

b1 7 7 7: a2 5

Without loss of generality, we can assume that {u1,v1} and {u2,v2} share a common vector, say, v2 = v1. If c2 = c1 + u1, then b2 ¼ v T2 c2 ¼ b1 . Obviously, R(Ij,:) is of at most rank 3, and by Corollary 3, multiple solutions exist.2 More generally, if the intersections of two camera planes capturing pj are parallel and the origins of these camera plans belong to a plane orthogonal to these intersection lines, then R(Ij,:) must be column-rank deficient. One can prove it similarly. For real-word data sets, Rj = R(Ij,:) or Si = S(:, Ji) may be not exactly rank deficient because of noise in W. However, Rj or Si may be approximately rank deficient since their condition number j(Rj) or j(Si) can be very large.3 For example, for the kuls data that will be tested in Section 6, the computed optimal solution RS has an almost singular submatrix Si, since

max jðSi Þ ¼ 1:6041  103 : i

In that case, an -optimal solution may also far from the optimal one even if the optimal solution is unique. The following theorem gives a low bound for errors of -optimal solutions to the optimal one. Theorem 4. Let RS be an optimal solution of problem (1). Assume that all of R(Ij,:) or S(:, Ji) are of full rank. Then for any small  > 0, bb there exists a matrix R S of rank r such that

bb kðW  R SÞ  MkF 6  þ kðW  RSÞ  MkF ; bb kR S  RSk P  max fqðRÞ; qðSÞg; F

where qðRÞ ¼ min rrrrðRÞ ; qðSÞ ¼ min rrrr ðSÞ and rr() denotes the rth ðRðIj ;:ÞÞ ðSð:; J i ÞÞ j i singular value of a matrix in decreasing order. Proof. Let lj = rr(R(Ij,:)) denote the rth singular value of R(Ij,:). Let zj and uj be the normalized left and right singular vectors of R(Ij,:) corresponding to the smallest singular value lj, respectively, i.e.,

RðIj ; :Þzj ¼ lj uj ;

uTj RðIj ; :Þ ¼ lj zTj :

Given a small  > 0, we consider a modification on the jth colb ¼ R, umn of S and set R

b S ¼ S þ aj zj eTj ;

 aj ¼ : lj

bb Note that the resulting matrix R S is also of rank at most r. It is easy to verify that this modification yields a small change on the observed entries,

bb kð R S  RSÞ  MkF ¼ kaj RðIj ; :Þzj k2 ¼ aj lj ¼ : bb Thus, R S is an -optimal solution since

bb kðW  R SÞ  MkF 6  þ kðW  RSÞ  MkF : On the other hand,

bb kR S  RSkF ¼ aj kRzj k2 P aj rr ðRÞ  qj ðRÞ: The proof for S is similar and hence we omit it here. h 2 If these two camera planes are not parallel, pj can be uniquely determined from the multiple solutions by imposing the orthogonal conditions on {u1, v1} and {u2, v2} and the all-one condition on the last row of S as shown in [13]. 3 The condition number of R or S is defined by the ratio of the largest singular value and the rth largest singular value.

Table 1 The sensitivity of solutions. Data

Algo.

Residuum

Deviation

q(R)

q(S)

Dinosaur

SALS DNH

79.06 78.98

2720.40

20.40 22.24

14.91 14.95

Kuls

SALS DNH

4.17 4.12

132.62

4.67 7.37

316.96 491.61

The lower bound in Theorem 4 clearly shows the risk that an optimal solution of (1) may be meaningless in real application if an optimal solution RS has a nearly rank-deficient block R(Ij,:) or S(:, Ji). Practically, an optimal solution of (1) may be over-estimated to the intrinsic meaningful solution. Two approximate optimal solutions may be far from each other. This phenomenon occurs in real-world data sets. In Table 1, we list the residua k(W  RS)  bb MkF for two computed solutions and their deviation k R S  RSkF for the real-world data sets dinosaur and kuls, respectively. Where, SALS is our algorithm that will be proposed in the next section, and DNH is the damped Newton hybrid algorithm in [3]. For each of the data sets, the difference between the residual values of the two computed solutions is ignorable, but their deviation is very large relatively. It also implies that at least one of the two solutions is over-estimated. 4. Constrained models Due to the numerical sensitivities of minimizing the residual function k(W  RS)  MkF and the risk in retrieving the best lowrank approximation of W, it requires to modify the original model (1) with some prior information. In [3], a modified version that penalizes the scales of the two factors R and S was discussed, resulting in the penalized problem

min kðW  RSÞ  Mk2F þ kkRk2F þ lkSk2F : R;S

Recall that in practical application, a meaningful solution RS to the corresponding missing data problem always has some natural constraints on entries. For example, the factor R should satisfy the constraint of rigid motion in the SFM problem. In a general framework, a more reasonable model for solving the low-rank approximation problem with missing data can be

min

kðW  RSÞ  MkF ;

s:t: ðR; SÞ 2 D;

ð9Þ

where D denotes the feasible domain of practically meaningful approximations. Usually, constructing the constraint domain D depends on individual problems. For example, orthogonal constraints are imposed on each pair of rows for the SFM problem in [13]. The resulting optimization problem turns to

min

kðW  RSÞ  MkF ;

s:t: Ri RTi ¼ ai I;

i ¼ 1; . . . ; m=2:

where Ri = R(2i  1:2i,:). The constraints enforced here are problemdependent. However, if the matrix scale is a bit large, it is also hard to solve this problem by existent techniques efficiently, though imposing such a constraint may result in a better approximate solution to the SFM problem. We are interested in uniform constraints on the solution RS, rather than on the individuals R and S. The constraints are not problem-dependent generally. The feasible domain D we considered is a convex set B of matrices whose entries are bounded with given boundary values. For example,

B ¼ fRSjðRSÞij 2 ½a; bg: That is, the problem we are interested is the following constrained one

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

min kðW  RSÞ  MkF :

ð10Þ

RS2B

This constraint makes sense in applications. In the problem of SFM, if the whole object is captured into each frame by cameras, wij should be restricted into the image domain. In the illumination based reconstruction problem, it is natural to impose the constraints wij 2 [0, 256] for the missing entries of W, since W is extracted from gray images. In general, we assume that the feasible restricting domain can be determined beforehand according to the real applications. If there is no information for the unknowns, we may estimate them form the observed data points. For example, one can set

a ¼ min wij ; ði;jÞ2K

b ¼ max wij : ði;jÞ2K

ð11Þ

For SFM problem, one may consider two restriction intervals for the odd and even rows, corresponding to the x-coordinates and ycoordinates in frames, respectively.4 The constrained low-rank approximation problem (10) is not a convex problem and hence it is hard to solve generally. However, (10) can be equivalently transformed to a classical best low-rank approximation problem with a so-called template matrix. To show it, let TB denote the set of template matrices,

TB ¼ X ¼ ðxij Þ 2 B : xij ¼ wij ; 8ði; jÞ 2 K ; and let PTB ðÞ denote the orthogonal projection onto the convex set TB , which is given by

PTB ðAÞ ¼ arg min kA  XkF : X2TB

This projection can be easily constructed. Let A = (aij). Then X ¼ ðxij Þ ¼ P TB ðAÞ is a shrinking of A to [a,b],



xij ¼ max a; minðb; aij Þ ;

8ði; jÞ 2 Kc ;

while xij = wij for ði; jÞ 2 K. Thus, (10) is equivalent to

min kP TB ðRSÞ  RSkF : RS2B

ð12Þ

However, the optimal problem (12) is not easy to solve. We consider a slight modification of the model by releasing the explicit restriction of on RS, RS

ð13Þ

We point out that the restriction still implicitly exists in the projection PTB . This slight modification makes the problem easily solvable by an alternative projection method, because besides the explicit orthogonal projection PTB onto the convex templet set TB , (13) also employs the projection bestr onto the non-convex set of matrices with rank no larger than r. For a given matrix A, bestr(A) denotes the best rank-r approximation to A,5

kA  bestr ðAÞkF ¼ min kA  XkF ; rankðXÞ6r

and it can be obtained by the singular value decomposition (SVD) of A. One can verify that (13) is equivalent to the classical problem of seeking for the best low-rank approximation to A in TB ,

min kX  bestr ðXÞkF :

X2TB

Therefore, these two projections can be utilized for solving the nonlinear system (13) and (14) as follows: Fix an estimation RS and find its projection PTB ðRSÞ of RS onto TB . Or, fix a template matrix X ¼ P TB ðRSÞ and find its best rank-r projection bestr(X). It results in the following iterative scheme

X k ¼ PTB ðRk Sk Þ;

Rkþ1 Skþ1 ¼ bestr ðX k Þ:

ð15Þ

Obviously, defining the projection error functions

f ðXÞ ¼ kX  bestr ðXÞkF ;

gðRSÞ ¼ kP TB ðRSÞ  RSkF ;

we see that

f ðX kþ1 Þ 6 gðRkþ1 Skþ1 Þ 6 f ðX k Þ 6 gðRk Sk Þ: This monotonicity indicates convergence of the algorithm. We remark that since (13) is not a convex problem, a good initial value is very helpful as other algorithms always do. 5.1. Modification of the projections To improve the iteration or accelerate convergence, we further consider two modifications on the projections. The first one is a slightly modification on PTB . Since X ¼ P TB ðRSÞ may lie on the boundary of B, a better way is to project RS by absorbing the entries out of B into the inner part of B. For example,

8 > < ðRSÞij ; if xij ¼ a þ d; if > : b  d; if

ðRSÞij 2 ½a; b; ðRSÞij < a; ði; jÞ 2 Kc : ðRSÞij > b:

ð16Þ

Here, d is a positive constant less than (b  a)/2. The second modification is on the orthogonal projection operator bestr. RS = bestr(X) minimizes the error function

kX  RSk2F ¼ kðW  RSÞ  Mk2F þ kðX  RSÞ  Mc k2F ;

5. Successive alternate iteration with orthogonal projections

min kP TB ðRSÞ  RSkF :

1089

ð14Þ

4 We assume that the frames are not pretreated to contain all feature points as tight as possible. 5 We also call bestr as an orthogonal projection though the set of low-rank matrices is not convex. On the other hand, if the rth and (r + 1)th singular values of A are equal but zero, then bestr(A) is not unique. That is, minrank(X)6rkA  XkF has multiple optimal solutions.

where Mc is the character matrix corresponding to the complement Kc of K. If the unknown entries of W are not well-estimated by X, bestr(X) may be over-approximated and lose the necessary adjacency with the true W. To decrease the effect of the second term above, we introduce a relaxation parameter k 2 (0, 1) and consider a modified projection of bestr(X), a weighted rank-r approximation to X,

n o min kðW  RSÞ  Mk2F þ k2 kðX  RSÞ  Mc k2F : R;S

ð17Þ

There are at least two improvements from the modification (17) in the case when the unknown entries of W are not well-estimated by X. On one hand, (17) is a soft version of the best low-rank approximation since k < 1. The estimation to missing data in RS are not attracted by X as strongly as in the best low-rank approximation problem. On the other hand, (17) is a relaxed version of the original problem (10). The error k(W  RS)  Mk is much smaller than k(X  RS)  Mck because of the small k, if RS minimizes the error function. It implies that the known entries of W can be well approximated. So one can expect that an optimal solution RS of (17) will close to the known entries of W, while the entries of unknown part satisfy the constraint B as approximate as possible. Therefore, the two-step projection algorithm (15) can be slightly modified for suiting to (17) as follows: replace Rk+1Sk+1 by an approximation solution of (17) with X = Xk and k = kk, where kk can be a constant as we used in our experiments. Note that it is not required to solve (17) at a high accuracy since Xk will be updated as the iteration goes on. This strategy results in an inexact inner–outer iteration for solving (13): exact outer iteration for computing the projection P TB and inexact inner iteration for the modified projection (17). In the next subsection, we will discuss

1090

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

a simple alternate iteration approach for solving (17), by updating each factor R and S, alternatively. 5.2. Alternate LS iteration There is no closed form to the model problem (17). However, (17) is of the obvious property: For a given matrix X and a parameter k, the objective function

Ek;X ðR; SÞ ¼ kðW  RSÞ  Mk2F þ k2 kðX  RSÞ  M c k2F ;

ð18Þ

is a quadratical form with respect to the variables R and S, respectively. It is a least squares (LS) problem if we only vary one factor R or S: Minimize E with respect to S with a fixed R, or minimize E with respect to R with a fixed S. Thus, similar to the algorithm ALS for (1), an alternate iteration method follows,

8
arg min Ek;X ðR; SÞ;

:R

arg min Ek;X ðR; SÞ:

S

ð19Þ

R

The initial guess can be set by the factors in bestr(X). It has been reported in literature that the alternate iteration algorithm ALS for (1) could not obtain the required results in some applications. This can be concluded essentially from the ill-conditioned closed forms of the LS problems. However, it is not the case for our model (17), since the LS problems in (19) are well-conditioned if k is not very small. We emphasize that k is a regularization constant which helps to strengthen the computational stability. For perceiving it, we show the computational details below. Practically, Ek,X(R, S) can be represented as X n 





 o diagðMej Þ Wej  RSej 2 þ diag kM c ej Xej  RSej 2 : Ek;X ðR; SÞ ¼ j

Since diag(Mej)W = diag(Mej)X, diag(Mej)diag(Mcej) = 0, we have that

Ek;X ðR; SÞ ¼

X 

diag ðM þ kM c Þej ðX  RSÞej 2 : j

Obviously, the partial optimization problem minSEk,X(R, S) consists of n independent LS problems with respect to n columns of S. So the solution S has a unique closed form with



Sej ¼ Rj ðkÞy diag ðM þ kM c Þej Xej ;

Rj ðkÞ ¼ diag ðM þ kMc Þej R; Similarly, minREk,X(R, S) has the solution R with

Algorithm SALS 1. Set an initial template matrix X0, parameter k, iteration number Kinner or accuracy inner for alternative iteration (19), and accuracy  for out iteration. Let ‘ = 0. 2. Inexactly solve the penalized problem (17) within Kinner iterations of the alternative LS (19) with X = X‘, or until the given accuracy inner is satisfied. The initial guess R (or S) is obtained by computing the r leading singular values of X‘ and the corresponding left (or right) singular vectors. Set the resulting approximation as R‘+1S‘+1. 3. Project R‘+1S‘+1 by the updating scheme (16) to obtain X‘+1. 4. Check convergence. If

kX ‘  X ‘þ1 kF < kX ‘ kF ;

ð20Þ

terminate the iteration and set RS = R‘S‘. Otherwise, increase ‘ and go to Step 2. SALS is an inner–outer iteration method: Steps 2 and 3 perform the whole inexact inner iterations within given iteration number or accuracy, while Step 4 is an exact outer iteration. The initial templet X0 can be set randomly or by an approximation solution of (1) computed by other methods, for example, Jacobs’ method in [9] or imputation method in [21] for SFM problem. Note that in the second step of SALS, only the leading eigenvectors need to be estimated. It is not necessary to compute the whole SVD of X‘ in Step 2. To evaluate this subproblem, one can use the truncated SVD; its computation cost linearly depends on the matrix dimensions. The truncated SVD can be computed by a state-of-the-art code like PROPACK, see [12]. For large matrices, however, the computational cost of PROPACK is also expensive. One may use the Monte Carlo algorithm LinearTimeSVD proposed in [6] instead for the best rank-r approximation. LinearTimeSVD is much faster than PROPACK, though PROPACK is more accurate when X‘ is not of low-rank exactly. 6. Numerical experiments In this section, we report the performance of the proposed algorithm SALS by three kinds of data sets with different purposes: synthetic data sets that simulate a sequence of images capturing a ball, two real-world data sets dinosaur and kuls,6 and randomly generated matrices in large scale. The computational accuracy is measured by the Root Mean Square (RMS) errors

pffiffiffiffiffiffiffiffi



¼ eTi Xdiag eTi ðM þ kM c Þ Si ðkÞy ; T

Si ðkÞ ¼ Sdiag ei ðM þ kMc Þ ; i ¼ 1; . . . ; m:

r ¼ kðW  RSÞ  MkF = jKj;

Many computational redundancies in solving the LS subproblems can be reduced, and R and S can be updated in a cheap way. The total computational cost of updating both S and R in each alternative iteration is no more than Oðr3 ðm þ nÞ þ r 2 min ðjKj; jKc jÞÞ. See appendix for the details.

with respect to the observed entries and/or the missing data (for the simulated or randomly generated data), respectively. Only use r may be incomprehensive since as we will show later, a computed solution may have small RMS error on the observed entries but its RMS error corresponding to missing data may be hugely large. At first, four groups of matrices are simulated with different matrix scales and different percentages of missing data. We will show that the SALS can retrieve solutions with small RMS errors of both the observed or missing data for these simulation data. The real-world data sets dinosaur and kuls are from the application of structure from motion and illumination based reconstruction. It is reported in literature that a meaningful solution is indiscoverable for these two data sets [3]. We will show that SALS can find a meaningful solution for these real data. The third kind of testing data are used

eTi R

5.3. The algorithm of successive alternate LS iteration Now we are ready to propose our alternate projection method for solving the constrained problem (10). This algorithm uses two kinds of techniques: exactly orthogonal projection on the templet space to general a variable templet sequence and inexactly projection of a fixed templet onto the low-rank space. We call it Successively Alternate Least Square (SALS) algorithm since the templet matrix X is successively updated as the outer iteration goes on and the inner problem is solved by the alternate LS method.

qffiffiffiffiffiffiffiffiffiffi

rc ¼ kðW  RSÞ  Mc kF = jKc j

ð21Þ

6 The real data and their best solutions are downloaded from http://www.robots.ox.ac.uk/abm/. We will compare SALS solutions with these previously best results in this section.

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

to show the fast convergence and high accuracy of SALS for large scaled data sets. In the final experiment, we will show the behaviors of SALS depending on the constraint interval [a, b] and the parameter d by the three kinds of testing data sets. We will also compare SALS with four existing algorithms: alternate least squares (ALS) proposed in [19], damped Newton (DN) in [3], LM_S in [4], and FPCAr in [18]. ALS is a very simple algorithm for implementation and it has been used for getting a good initial guess in other hybrid iterative algorithms such as Damped Newton Hybrid (DNH). For the two real data sets, DNH is the algorithm that first gives solutions with the smallest residue k(W  RS)  MkF, compared with other algorithms [3]. LM_S can also achieve the smallest residue as DNH within less computational times. We will show that SALS can give much more meaningful solutions than DNH or LM_S, although the values of k(W  RS)  MkF slightly increase for the two real data sets. Newton-like methods and alternative methods fail generally if data scale and the percentage of missing data are larger. To show the quick convergence of SALS for matrices in large scale, we will compare SALS with FPCAr since FPCAr converges fast for randomly generated low-rank matrices without noise. All the algorithms are implemented in C++ and called through Matlab using ‘‘mex” interface and the Microsoft Visual C++ compiler to speed up computing on a PC with 2.67 GHz, Intel Core 2 Quad processor.

1091

0.5 in this experiment. The character matrix M corresponds to those feature points on the opaque ball that can be observed from the cameras. Some observed feature points nearby the boundary may be reset as unknowns if we need more unknown points for comparison. W can be generated with given scale and percentages of missing entries. We will compare SALS with ALS and LM_S applied on this kind of matrices. In SALS, the constraint interval [a, b] is given in Eq. (11). We use uniform k = 0.01, d = 0, and Kinner = 30. The initial template matrix of SALS is set as X0 = W  M. The SVD of X0 is used as an initial guess in both of ALS and LM_S. These algorithms terminate if an iteration solution RS satisfies the artificial criteria

kW  RSkF < skW  bestr ðWÞkF

ð22Þ

200

with a given constant s > 1. To compare the efficiency of these three algorithms depending on the matrix size and amount of unknowns, we consider two scales (140  200 and 240  400) and two percentages (50.0% and 61.7%) of unknowns in W, and set the parameter s as s = 1.5 if half of entries are unknown or g = 3 otherwise. Table 2 lists the percentages of convergent tests of the three algorithms among 100 tests for each scale and percentage of unknowns, and the average values of r, rc, and the CPU times among the successful tests. We also list the percentage of failures and their average r, rc. ALS fails in a high percentage whatever the matrix size or the rate of unknowns is relative large nor small, while LM_S is always successful if only half of data are unknown. However, the possibility of convergence will decrease rapidly in both of ALS and LM_S if more entries of W are unknown. if converges, LM_S can achieve the highest accuracy in the cost of much of CPU times for computing Hessian matrices. For example, LM_S has the average CPU time 110 times more than SALS if matrix size is 240  400 and half of data are unknown. SALS performs quite good in this experiment: (1) it always converges in all the 400 tests, (2) its average CPU time is the lowest, and (3) it can retrieve solutions in a high accuracy measured by the RMS errors r and rc for both of the observed data or missing data. Moreover, the accuracy of SALS can be improved if we take more iterations. For example, if we decrease the parameter s in the stopping condition (22) from 1.5 to 1.2, then in the third case (280  400, 50%) the average accuracy of SALS can be reduced to r = 0.198 and rc = 0.266, while the CPU time is slightly increased from 3.04 to 3.46 in second. We point out that the RMS error rc of a solution obtained by ALS or LM_S corresponding to the missing data may be hugely large, though it may have small value of r corresponding to the observed entries. It implies the incredibility of relatively small r in checking a meaningful solution in application.

150

6.2. Dinosaur data

6.1. Simulation data The synthetic data sets were generated to simulate the Structure-from-Motion problem, based upon the orthography model as follow (also see [21]). We randomly select three-dimensional feature points from the surface of a fixed ball. Fig. 1 illustrates the three-dimensional data. Cameras are simulatively located in a bigger circle around the opaque ball with equally spaced camera positions. Then we project the feature points onto each camera plan and shift the local coordination origin to the left bottom corner of the frame. These resulting two-dimensional points form the true measurement matrix W*. Considering that each frame has integer values of pixels, W* is rounded to W with integer entries. That is, W has errors on entries with the maximal abstract error

100 50 0 −50 −100 −150 −200 200 200

100 100

0

0

−100

−100 −200

−200

Fig. 1. Simulated ball. No features points are distributed around two polar points of this ball for convenience.

The dinosaur data set is collected from a rotating toy dinosaur (see Fig. 2).7 The observation matrix W = (wij) is generated from 2D projections of 319 feature points on to 36 frames. So the matrix size of W is 72  319. The sequence (w2i1,j, w2i,j),i = 1, . . . , 36, of W form the 2D track of feature point j [3]. However W has 76.92% missing data, but at least 19 feature points are captured by each frame and each feature point is tracked by at least seven frames. That is, each row of W has at least 19 known entries and each column has at least 14 known entries. The left panel of Fig. 3 plots point tracks of the observed entries. Since the 3D track of each feature point is a circle if the toy body is rigid, the projection of each track onto a 2D plane should be an ellipse. So the ideal recovered trajectories should be closed ellipses. However, the collected data have errors on some feature points of the tail, due to slight slant of the tail.

7

The data set can be download from http://www.robots.ox.ac.uk/vgg/data1.html.

1092

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

Table 2 Comparison of SLAS with ALS and LM_S for the synthetic ball data. Data

Algor.

Size

Unknowns (%)

140  200

50.0

61.7

280  400

50.0

61.7

Convergent test

Failed test

%

r

r c

ALS LM_S SALS ALS LM_S SALS

65 100 100 4 60 100

0.19 0.19 0.19 0.19 0.19 0.19

0.37 0.29 0.34 1.57 0.74 0.82

ALS LM_S SALS ALS LM_S SALS

68 100 100 23 72 100

0.20 0.20 0.20 0.20 0.20 0.20

0.38 0.28 0.34 1.35 0.47 0.71

r

r c

1.8 52.0 0.7 9.8 64.6 4.2

7.19 – – 2.89 1.91 –

749 – – 87,1480 143.1 –

4.6 334.0 3.0 25.8 402.0 10.3

7.63 – – 2.84 2.35 –

4381 – – 91,8035 191.7 –

CPU(s)

the most of points on the body of the dinosaur. However, the improvement is significant. It implies in a way that our method is not very sensitive to the choice of bounds. 6.3. Kuls data

Fig. 2. A frame from the dinosaur sequence from http://www.robots.ox.ac.uk/vgg/ data1.html.

The dinosaur data seems to be ’hard’ because of the high percentage of missing data and the inaccuracy of some collected data points. The middle panel of Fig. 3 shows the best result computed by DNH or LM_S reported in [3,4]. Obviously, the best result is not satisfied yet, since the symmetric property of the 2D trajectories is lost and many trajectories are not closed and far from ellipses. We set the feasible domain B for the missing entries as follows:

x2i1;j 2 ½a1 ; b1 ;

x2i;j 2 ½a2 ; b2 ;

where [a1, b1] = [80, 650] and [a2, b2] = [0, 550] are estimated from the known entries of W. The modified projection X ¼ P TB ðRSÞ used in SALS is given by X = (xij) with xij = wij for ði; jÞ 2 K and xij = (RS)ij if ði; jÞ 2 Kc and (RS)ij satisfies the above constraints, otherwise,

8 > < a1 þ d1 ; ðRSÞ2i1;j < a1 ; x2i1;j ¼ b1  d1 ; ðRSÞ2i1;j > b1 ; > : 8 > < a2 þ d2 ; ðRSÞ2i;j < a2 ; x2i;j ¼ b2  d2 ; ðRSÞ2i;j > b2 : > : Here we set d‘ = (b‘  a‘)/5, ‘ = 1,2. k = 0.01, and K = 300. SALS performs quite better than the existing methods for this example. It achieves the accuracy  ¼ 12  104 defined by (20) within 10 outer iterations, starting at the best solution reported in [3,4]. In the right of Fig. 3 we plot the computed trajectories by SALS. Most of the recovered trajectories are closed ellipses with centers on a common axis and the coordinates of trajectories belong to the restricted region. We point out that the constraint bounds are not tight. The observed data are partial to the right (see the left of Fig. 3). They are excessive for the x-components and weak for the y-components of the tail points. Practically, they are too weak for

The kuls data comes from an application of illumination based 3D reconstruction. Its purpose is to recover the normal directions and the lighting directions at the considered feature points. The kuls data is collected from 20 gray face images with different illumination conditions at total 2944 feature points. So, W is a 2944  20 matrix. Pixel values at the shadowed feature positions cannot reflect the true illumination conditions as others and hence the true pixel values are missed. Fig. 4 shows two samples of the 20 face images and corresponding observed feature points. Totally, there are 41.7% entries are missed in this data collection. The observation matrix W is rescaled such that all the known entries belong to the interval [0, 1]. So, the constraint condition on the unknown entries is naturally set as xij 2 [0, 1] for ði; jÞ R K. We use the parameters k = 0.2, d = 0.3 and K = 40 in SALS. The initial template matrix X0 = (xij) is set as follows: xij = 0.45 for ði; jÞ 2 Kc and xij: = wij for ði; jÞ 2 K. SALS is terminated when the out accuracy  ¼ 12  104 is achieved. Although the RMS error r of the solution computed by SALS is slightly larger than the previously best result obtained by DNH, we believe that SALS performs much better than DNH, since the SLAS solution is more meaningful: almost all the recovered entries satisfy the constraint conditions; only 59 entries (0.24%) among the total 24,562 recovered entries are slightly out of the interval [0,1]. Correspondingly, the DNH solution has 16.84% entries out of [0, 1], and the largest distortion is larger than 10. In Table 3, we list the numbers of recovered entries less than 0, in the interval [0, 1], or larger than 1, respectively. We also list the percentages of these points belonging to the three intervals. The last column shows the computable RMS error r for DNH and SALS. It makes sense that SALS has a slightly larger r-value than DNH since SALS pays more attention on the accuracy of the recovered missing data. 6.4. Randomly generated data in large scale Now we show the other character of SALS by testing randomly generated matrices: SALS converges very fast even if the matrix W has a large scale (we have shown the evidence in Table 2). The testing matrices are set as follows, represented in MATLAB symbols.

W ¼ randðm; rÞ randðr; nÞ; M ¼ ðrandðm; nÞ < ratioÞ; Here, the command rand(m, r) generates an m  r random matrix with entries uniformly distributed on the interval [0, 1]. The second command generates a logical matrix of order m  n

1093

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

0

0

0

100

100

100

200

200

200

300

300

300

400

400

400

500

500

500

0

200

400

600

0

200

400

600

0

200

400

600

Fig. 3. Middle: observed feature trajectories in the dinosaur data. Left: the best recovering reported in [3,4]. Right: the recovered trajectories by algorithm SALS.

Table 4 Computation errors and CPU times of FPCAr and SALS (r = 4). Data scale

0

0

10

10

20

20

30

30

40

40

50

50

60

60 0

20

40

60

0

SALS

jKj (%)

g(RS)

g(RS)

CPU(s)

1000

5 15 25

8.92e09 8.64e09 8.89e09

48.8 10.8 6.7

4.04e08 4.47e10 2.32e11

1.8 1.7 2.0

2000

5 15 25

8.66e09 6.88e09 7.21e09

133.8 39.3 24.9

3.59e08 1.63e10 2.03e11

6.3 7.6 8.8

CPU(s)

gence of SALS, we compare it with algorithm FPCAr [18], rather than the Newton-like methods and alternative methods. FPCAr adopts a shrinking technique from minimizing kðW  XÞ  Mk2F þ lkXk with decreasing parameter l > 0 as iteration goes on, where kXk* defines the nuclear norm by the sum of all the singular values of X. For this kind of low-rank matrices shown above, FPCAr is effective and fast.9 The inequality (20) is also used as the stopping criterion for both FPCAr and SALS with  = 108. Table 4 shows the relative errors of the computed solutions by FPCAr and SALS, defined by 20

40

60

nz = 1409

nz = 1426

FPCAr

m=n

gðRSÞ,

kW  RSkF ; kWkF

Fig. 4. Two samples of kuls data (top) and their feature point positions (bottom).

r

and the corresponding CPU times in second for different scales of W and different percentages of the missing data in W. Clearly, SALS is very fast for the tested large-scale examples, compared with FPCAr. By the way, SALS can recover the original matrix within 10 s for a low-rank dense matrix of size 2000  2000. 6.5. Dependence of SALS on constraint intervals

Table 3 Distribution of the entries of the computed X by DNH and SALS for the kuls data. Algorithm

^ ij w <0

[0, 1]

>1

min

max

DNH

# %

2216 9.02

20426 83.16

1920 7.82

9.55

11.48

0.0223

SALS

# %

3 0.01

24503 99.76

56 0.23

0.03

1.09

0.0225

with randomly selected entries evaluated with 1 in the given ratio and zero for the others.8 We employ the LinearTimeSVD in the second step of SALS and use four inner iterations for each outer iteration. Here we set k = 0.01 and d = (b  a)/100. The Newton-like algorithms such as DN, DNH, or LM_S cannot solve the missing data problem in large scale, while the alternative algorithms as ALS converge very slowly. To show the quick conver-

8 The randomly generated matrices in above scheme have quite different properties from the simulation data like that tested in Section 6.1. The later kind of synthetic data may be more suitable for exploiting the convergence behaviors of algorithms for solving the missing data problems.

Roughly speaking, a solution obtained by SALS depends on the constraint interval [a, b] for bounding unknown entries. It is obvious that there is no guarantee of dependability for the computed solutions of SALS if the interval [a, b] is too large to provide information for missing data. A true and tighter constraint interval and suitable d in (16) for attracting the wanted entries are effective for improving solutions. In this subsection, we illustrate the efficiency of d and the numerical behaviors of SALS depending on the constraint interval [a, b] by a toy data and the two real data sets that have been tested in the previous experiments. The toy data set is generated as in Section 6.1. We simulate 70 cameras and 200 points. All the points are located in the box [66.6, 466.2]  [106.3, 427.9], but there are about 62% missing entries. To show the behaviors of SALS depending on [a, b], we 9 FPCAr cannot recover a reasonable solution for the real-world data sets Dinosaur and Kuls yet and also fails for the simulation data shown in Section 6.1.

1094

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

consider a sequence of choices for bounding the unknown entries as follows:

½a; b ¼ ½a  k; b þ k;

k ¼ 100 : 10 : 100;

where a* = 66.6 and b* = 466.2. For simplicity, we set d = 0 in this experiment. Note that no square [a, b]  [a, b] is a tight box for all the (known or unknown) entries of W. However, SALS always gives reasonable solutions for this example. The RMS error r of each computed solution RS corresponding to known entries is less than 0.0626, and among the 21 testings for variant constraint intervals, the average of rc corresponding to missing entries is 1.02. Both of r and rc are very small with respect to the tightest range [66.6, 466.2]  [106.3, 427.9] of W. We point out that this excellent behaviors of SALS will not be preserved if the constraint interval [a, b] is too large or the pattern of missing entries is too bad. For example, if more entries are assumed unknown, SALS with some of the above tested intervals may fail to find a good solutions. In Fig. 5 we plot the curves of r and rc of the computed solutions by SALS depending on the parameter k = 100:10:100 for this toy data and the same data but more entries are assumed unknown, respectively. Clearly, for the later data set, SALS can retrieves an excellent solution if [a, b] is variant from [66.6 + 60, 466.2  60] to [66.6  30, 466.2 + 30]. A solution of SALS with [a, b] larger than [66.6  30, 466.2 + 30] will have a relatively large rc. Moreover, SALS fails if [a, b] is smaller then [66.6 + 60,466.2  60]. Now we consider the efficiency of d in SALS for the kuls data. The reasonable (but maybe not tight) constraint interval is [0, 1]. If we set a relatively large [a, b] = [1.5, 2.5], the efficiency of d

61.96% missing entries

3.5

seems very weak: SALS with d varying from 0 to 0.45 always gives same solution. All the entries of the solution belong to the constraint interval [1.5, 2.5] and the recovered missing entries locate in a smaller interval [0.27, 1.49]. Meanwhile, about 1.3% of entries are out of [0, 1]. However if we set [a, b] as [0, 1] or a smaller one than [0, 1], a suitable d can improve the solution obtained by SALS without d (d = 0). In Fig. 6, we plot the curves of the minimal and maximal entries of the solutions and the percentage (%) of entries out of [0, 1], depending on d when [a, b] is prescribed. For the three choices of [a, b] that we tested, using d 2 [0.15, 0.2] can improve the retrieval of SALS obviously. The changes of RMS error r are ignorable. In the experiment shown in Section 6.2, we used two constraint intervals for uniformly bounding the x-components and y-components, respectively. This kind of constraints may be good for some missing entries (corresponding to the tail of the dinosaur) but too loose for some of others (on the head of the dinosaur). A more efficient but also more complicate strategy is to use various constraints for bounding entries in different frames. For example, let

ai ¼ min wij ;

bi ¼ max wij ;

ði;jÞ2K

ð23Þ

ði;jÞ2K

and use the feasible domain

n o B ¼ RS j ðRSÞij 2 ½ai ; bi ; 8j :

ð24Þ

Using this kind of nonconforming constraints in SALS, the good results shown in Section 6.2 can be further improved. Most of the trajectories for the head of the dinosaur and the left-bottom part

63.71% missing entries

200 σ σ

3

σ σ

c

c

150

2.5 2

100 1.5 1

50

0.5 0 −100

−50

0

50

k

100

0 −100

−50

0

50

k

100

Fig. 5. The RMS errors r and rc of SALS with variant constraint intervals [a, b] = [a*  k,b* + k] for a toy data with two different patterns of missing entries.

1.4

0.0233

0.7

1.2

[0.0,1.0] [0.1,0.9] [0.2,0.8]

0.6

1

0.6 0.4

0.0231

0.5

max,[0.0,1.0] max,[0.1,0.9] max,[0.2,0.8] min,[0.0,1.0] min,[0.1,0.9] min,[0.2,0.8]

0.8

0.023

0.4

0.0229

0.3

0.0228 0.0227

0.2

0.2

[0.0,1.0] [0.1,0.9] [0.2,0.8]

0.0232

0.0226

0

0.1

−0.2

0 0

0.1

0.2

0.3

0.4

0.5

0.0225 0.0224 0

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

Fig. 6. The minimal and maximal recovered entries (left), the percentage (%) of entries out of [0, 1] (middle), and the RMS error r (right) of SALS depending on d for given constraint interval [a, b].

1095

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

0

0

50

50

100

100

150

150

200

200

250

250

300

300

350

350

400

400

450

450

500

500

550

100

200

300

400

500

600

550

100

200

300

400

500

600

Fig. 7. SALS solutions of the dinosaur when using uniform bounds (left) and variable bounds (right).

are obviously improved, while others are preserved. In the right panel of Fig. 7, we plot the improved results of SALS using the variable constraints (24). For comparing the improvement conveniently, we also plot the previous results of SALS in the left of Fig. 7. We emphasize again that the input data are of noises.

7. Conclusion In this paper, we gave a theoretical analysis on sensitivity of the previous model minRSk(W  RS)  MkF for the problem of low-rank matrix factorization with missing data. We shown that an optimal solution to the problem can be arbitrarily far from a solution that has physical meaning in application. We also shown a low bound of errors between an optimal solution and an -optimal solution. Different from the earlier work, we consider a constrained model with simple constraints on missing entries. A two-step projection method is proposed for solving the constrained problem. This method is speeded up by an approach of successive alternation. This algorithm is efficient in our experiments with simulation data of and two real-world data sets. The experiments show that our SALS algorithm outperforms other methods: it is faster and more stable, and the computed solutions are reliable in physical meaning. We believe that Newton-like algorithms may also work well for the constrained model proposed in this paper. However, there are some issues that need to be further investigated at least in the two topics: (1) adaptively setting of the tuning parameter k in the relaxed model (17) and the shrinking constant d in (16) for template projection and (2) sensitive analysis for the constrained model (12). It is also not clear if the constrained model has multiple solutions or not. We will further work on the related topics. Acknowledgment The work of this author was supported in part by NSFC Project 10771194 and National Basic Research Program of China (973 Program) 2009CB320804.

 1 Rj ðkÞy ¼ Rj ðkÞT Rj ðkÞ Rj ðkÞT : So possibly, it may be required O(rm) for constructing each Rj(k) and O(r2m + r3) = O(r2m) for each Rj(k) . Thus, totally we need O(r2mn). The computational cost can be significantly reduced to O(r3(m + n)) if taking into account the structures of Rj(k)’s. We show the computation details below. For simplicity, we denote

mj ¼ Mej ;

mcj ¼ M c ej ;

hj ¼ mj þ kmcj :

Notice that hj  hj ¼ mj þ k2 mcj . It follows that

Rj ðkÞT Rj ðkÞ ¼ RT diagðhj  hj ÞR ¼ RT diagðmj þ k2 mcj ÞR:

  The (a, b)-entry of Rj(k)TRj(k) is Rj ðkÞT Rj ðkÞ ¼ ðmj þ k2 mcj ÞT a;b ðRea  Reb Þ. Since r is small generally, we can compute and restore the r(r + 1)/2 m-dimensional columns

aa;b ¼ Rea  Reb ;

1 6 a 6 b 6 r;

with the computational cost O(r2m). Note that it does not depend on j. So

  Rj ðkÞT Rj ðkÞ

a;b

¼ ðmj þ k2 mcj ÞT aa;b ¼

X k2Ij

aa;b ðkÞ þ k2

X

aa;b ðkÞ:

k2Icj

P P P Since k2Ij aa;b ðkÞ þ k2Ic aa;b ðkÞ ¼ m entry (Rj(k)TRj k¼1 aa;b ðkÞ, the P j c (k))a,b can be computed within OðminðjIj j; jIj jÞÞ for k2Ij aa;b ðkÞ or P aa;b ðkÞ which depends on j and an additional O(m) for the k2Icj P T sum m k¼1 aa;b ðkÞ. So computing the n matrices Rj(k) Rj(k) needs

 

O r 2 ðn minðjIj j; jIcj jÞ þ mÞ 6 O r 2 ðminðjKj; jKc jÞ þ mÞ : Together with O(r3n) for n inverses and the ignorable O(rn) for n matrix–vector multiplication, the computation cost of S is about Oðr3 n þ r 2 ðminðjKj; jKc jÞ þ mÞÞ. The total computational cost of updating both S and R in each alternative iteration is Oðr3 ðm þ nÞ þ r 2 minðjKj; jKc jÞÞ. References

Appendix A. Computational details for solving the LS problems in (19) Given a full column-rank matrix R, the closed form S needs to compute n matrices Rj(k) and their generalized inverses Rj(k) , j = 1, . . . , n. Due to the non-singularity of diag((M + kMc)ej) and the full rankness of R, it is easy to compute Rj(k)  in the form

[1] P.M.Q. Aguiar, J. Xavier, M. Stosic, Spectrally optimal factorization of incomplete matrices, CVPR, June 2008, pp. 1–8. [2] C. Bregler, A. Hertzmann, H. Biermann, Recovering non-rigid 3d shape from image streams, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, South Carolina, USA, June 2000, pp. 690–696. [3] A.M. Buchanan, A.W. Fitzgibbon, Damped newton algorithms for matrix factorization with missing data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2005, pp. 316–322.

1096

K. Zhao, Z. Zhang / Computer Vision and Image Understanding 114 (2010) 1084–1096

[4] P. Chen, Optimization algorithms on subspaces: revisiting missing data problem in low-rank matrix, International Journal of Computer Vision 80 (1) (2008) 125–142. [5] A. Del Bue, F. Smeraldi, L. Agapito, Non-rigid structure from motion using ranklet-based tracking and non-linear optimization, Image Vision Computing 25 (3) (2007) 297–310. [6] P. Drineas, R. Kannan, M.W. Mahoney, Fast monte carlo algorithms for matrices II: computing a low-rank approximation to a matrix, SIAM Journal on Computing 36 (1) (2006) 158–183. [7] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, New York, NY, USA, 2003. [8] H. Hayakawa, Photometric stereo under a light-source with arbitrary motion, Journal of Optical Society of American 11 (11) (1994) 3079–3089. [9] D. Jacobs, Linear fitting with missing data: applications to structure-frommotion and to characterizing intensity images, in: CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), Washington, DC, USA, IEEE Computer Society, 1997, p. 206. [10] J. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization 20 (4) (2010) 1956–1982. [11] C. Julià, A.D. Sappa, F. Lumbreras, J. Serrat, A. López, An iterative multiresolution scheme for sfm with missing data, Journal of Mathematical Imaging and Vision 34 (3) (2009) 240–258. [12] R.M. Larsen, PROPACK – Software for Large and Sparse SVD Calculations, , 2005. [13] Manuel Marques, João Costeira, Estimating 3d shape from degenerate sequences with missing data, Computer Vision and Image Understanding 113 (2) (2009) 261–272.

[14] R. Mazumder, T. Hastie, R. Tibshirani, Regularization methods for learning incomplete matrices, 2009, arXiv:0906.2034v1. [15] J. Nocedal, S.J. Wright, Numerical Optimization, Springer, Berlin, 1999. [16] T. Okatani, K. Deguchi, On the Wiberg algorithm for matrix factorization in the presence of missing components, International Journal of Computer Vision 72 (3) (2007) 329–337. [17] A. Ruhe, P. ÅWedin, Algorithms for separable nonlinear least squares problems, SIAM Review 22 (1980) 318–337. [18] D. Goldfarb S. Ma, L. Chen, Fixed point and Bregman iterative methods for matrix rank minimization, Tech. Report, Department of IEOR, Columbia University, 2008. [19] H. Shum, K. Ikeuchi, R. Reddy, Principal component analysis with missing data and its application to polyhedral object modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (9) (1995). [20] P. Sturm, Algorithms for plane-based pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, South Carolina, USA, June 2000, pp. 1010–1017. [21] C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization method, IJCV 9 (1992) 137–154. [22] R. Vidal, R. Tron, R. Hartley, Multiframe motion segmentation with missing data using Power Factorization and GPCA, IJCV 79 (1) (2008) 85–105. [23] T. Wiberg, Computation of principle components when data are missing, in: 2nd Symposium Computational Statistics, 1976. [24] A. Zaharescu, R. Horaud, R. Ronfard, L. Lefort, Multiple camera calibration using robust perspective factorization, in: 3DPVT ’06: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06), 2006, pp. 504–511.