Incremental subspace learning via non-negative matrix factorization

Incremental subspace learning via non-negative matrix factorization

Pattern Recognition 42 (2009) 788 -- 797 Contents lists available at ScienceDirect Pattern Recognition journal homepage: w w w . e l s e v i e r . c...

1MB Sizes 12 Downloads 87 Views

Pattern Recognition 42 (2009) 788 -- 797

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Incremental subspace learning via non-negative matrix factorization Serhat S. Bucak, Bilge Gunsel ∗ Multimedia Signal Processing and Pattern Recognition Laboratory, Electrical-Electronics Engineering Faculty, Department of Electronics and Communications Engineering, Istanbul Technical University, Maslak, 34469 Istanbul, Turkey

A R T I C L E

I N F O

Article history: Received 21 August 2007 Received in revised form 30 July 2008 Accepted 2 September 2008 Keywords: Incremental subspace learning Non-negative matrix factorization Statistical background modeling Video content representation Clustering

A B S T R A C T

In this paper we introduce an incremental non-negative matrix factorization (INMF) scheme in order to overcome the difficulties that conventional NMF has in online processing of large data sets. The proposed scheme enables incrementally updating its factors by reflecting the influence of each observation on the factorization appropriately. This is achieved via a weighted cost function which also allows controlling the memorylessness of the factorization. Unlike conventional NMF, with its incremental nature and weighted cost function the INMF scheme successfully utilizes adaptability to dynamic data content changes with a lower computational complexity. Test results reported for two video applications, namely background modeling in video surveillance and clustering, demonstrate that INMF is capable of online representing data content while reducing dimension significantly. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction Eliminating redundancies and extracting vital components in multidimensional data are among the major steps of content analysis tasks. Several decomposition methods such as principal component analysis (PCA) [1] aim to achieve this goal as they decrease dimension of data significantly and reveal patterns of interest. Non-negative matrix factorization (NMF) [2–9], which is offered to extract key features of data by operating an iterative matrix factorization, is one of the recent decomposition tools for multivariate data. It is shown that NMF offers dimension reduction and produces useful representations by converting a data matrix to multiplication of two smaller matrices. These factor matrices are expected to contain latent and vital components of original data. Unlike similar methods, NMF puts a non-negativity constraint which enables it to form intuitive and parts based representations of data on its factors. Although there are earlier approaches to the problem of solving NMF [2,3], it was 1999 when NMF started to draw attentions of researchers after Lee and Seung proposed their multiplicative update rules for NMF and showed NMF's success as a machine learning tool [4,5]. Afterwards new cost functions were offered to increase sparseness of NMF representations for addressing requirements of specific applications. In general these cost functions include



Corresponding author. Tel.: +90 212 285 3606; fax: +90 212 285 3679. E-mail addresses: [email protected] (B. Gunsel). URL: http://www.mspr.itu.edu.tr/.

0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.09.002

additional penalty components which would add new constraints on the optimization problem and force sparseness on factors [6–10]. To speed up convergence of the optimization process, new projected gradient descent methods are proposed for NMF [11]. Alternatively, methods that employ Newton-type numerical approaches are also presented to derive exact solutions [12]. As a tool that offers dimension reduction and forms intuitive representations, NMF has been used in different research areas. Some examples of these fields are face recognition [13], biomedical applications [7], music transcription [14], bioinformatics [15], blind source separation [8], and image hashing [16]. One of the reasons that make NMF suitable for these applications is the nature of data they use. Mostly, data used in these applications consist of non-negative elements. Sparseness, which provides parts-based representations and localization, is another utility of NMF that has been reported and proven to be useful. Moreover, it has been shown in Ref. [17] that NMF is equivalent to fuzzy k-means clustering and produces successful results in clustering tasks. In this paper we propose an incremental non-negative matrix factorization (INMF) method in order to overcome the difficulties that conventional NMF confronts in online processing of large scale data. Generally in an online analysis updating former representations according to new samples is required. Thus, an effective online factorization scheme is expected to update its factors without causing much computational effort. However, conventional NMF's computational complexity is proportional to the number of samples; therefore, it is not practical to perform the batch NMF processing whenever a new sample arrives. Additionally, in some applications contribution of old observations to factorization should be excluded

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

whereas participations of new and significant samples should be emphasized. This is important in modeling dynamic content changes and can be achieved by imposing memorylessness. On the other hand, since adding a new sample into existing data would require recomputation of the whole basis set, batch algorithms are not generally suitable to online applications. Moreover, because such a recomputation requires using all of the previous samples, a huge memory requirement appears. The introduced INMF scheme first extracts subspace representations for each new sample and then updates the former representation in an efficient way. It is capable of adaptively controlling contribution of each sample to the representation. This is achieved by assigning different weighting coefficients to each sample of the cost function. INMF also reduces significance of optimal rank selection by its weighted cost function. The proposed INMF scheme is inspired by the technique introduced in Ref. [18] which is called as incremental principal component analysis (IPCA). However, unlike the incremental PCA scheme that is presented in Ref. [18], INMF brings a new approach to incremental matrix factorization by including the non-negativity constraint. Preliminary versions of this work were published in Refs. [19,20]. This paper is organized as follows: After giving mathematical background of conventional NMF, its limitations for large data sets are discussed in Section 2. Section 3 presents the introduced INMF scheme. Prior to the final remarks given in Section 5, tests results are reported in Section 4.

789

alternating update rules for H and W are derived via gradient decent optimization [5]. The update rule for the elements of matrix H is given by Eq. (4), where j = 1, . . . ,m, and a = 1, . . . ,r. Haj ← Haj

(WT V)aj

(4)

(WT WH)aj

The update procedure which is derived for the elements of the mixing matrix W is shown by Eq. (5) where i = 1, . . . ,n, and a = 1, . . . ,r. Wia ← Wia

(VHT )ia

(5)

(VHHT )ia

The update rules given by Eqs. (4) and (5) are executed alternatively which means that at each iteration the elements of W will be updated after the update process of the encoding matrix is completed. Step size is a crucial parameter of a gradient descent algorithm. A small step size may lead to a late convergence whereas a large one can cause divergence. It is proven by Lee and Seung [5] that convergence is guaranteed when step size is chosen as aj for the update procedure of entity Haj and chosen as ia for Wia . aj and ia are given by Eq. (6) where i = 1, . . . ,n, and a = 1, . . . ,r.

aj =

Haj (WT WH)aj

,

ia =

Wia

(WHHT )ia

(6)

Note that multiplicative update formulas shown by Eqs. (4) and (5) do not require a predefined fixed learning step size, instead the step size is automatically updated.

2. Conventional NMF 2.1. Mathematical definitions Assume that an observation is represented by an n dimensional vector and the number of observations is equal to m. Therefore, V, the data matrix, will be an n by m matrix where Vij refers to an entity in V (i = 1, . . . ,n, j = 1, . . . ,m). As it is shown by Eq. (1), conventional NMF approximately factorizes the data matrix (V ∈ Rn×m ) into two matrices: W ∈ Rn×r , the mixing matrix, and H ∈ Rr×m which is called as the encoding matrix [4]. r is a pre-defined parameter which is named as rank of the factorization and determines the level of dimension reduction. Smaller values of r yield higher reconstruction errors that result in a higher loss in representation of original data. V ≈ WH

(1)

V consists of column vectors each of which corresponds to a different sample. Each column vector in V has a corresponding representation vector in the encoding matrix H. For the matrices V = [v1 v2 v3 . . . vm ] and H=[h1 h2 h3 . . . hm ] vc and hc indicate the column vectors of V and H, respectively, where c = 1, . . . ,m. The mixing matrix is defined as W = [w1 w2 . . . wr ]. Eq. (2) shows that the elements in a column of H determine how to combine the columns of W in constructing the corresponding column of V. vc = Whc ,

c = 1, 2, . . . , m

(2)

Conventionally the NMF scheme minimizes a cost function described in terms of the reconstruction error defined by Eq. (3). Although different cost functions are defined in the literature [5,7], the mean squared error given by Eq. (3) is generally preferred by many researchers [4–7] due to its simplicity and efficiency. F=

n m 1 1  V − WH2 = (Vij − (WH)ij )2 2 2

(3)

i=1 j=1

It is shown that the error function (F) defined by Eq. (3) is a convex function of W and H separately. Therefore, multiplicative and

2.2. Drawbacks of conventional NMF in online processing In an online process once a factorization is obtained for a group of samples, that representation should be appropriately updated with respect to subsequent observations with a small computational cost. Because of its batch nature conventional NMF suffers some problems in this aspect. Since each column of data matrix V corresponds to a different sample, arrival of each new sample leads to an increase in dimension of V. NMF's computational complexity is O(mnr) per iteration, and this implies that computational load linearly increases by dimension of data. Therefore, re-running the whole batch NMF process repetitiously whenever a new sample is received is impractical. Moreover, an increase in data dimension necessitates a higher value of rank which increases computational load to preserve the same representation quality. NMF's significant storage demand is also a problem for online data processing. As Eqs. (4) and (5) indicate, the multiplicative updating formulas use matrices W, H and V at each iteration thus require storing all the previous samples. Therefore, a huge amount of memory needs to be assigned for storing matrices V and H which makes the factorization impractical in an online application. Another drawback of conventional NMF is encountered in integrating contribution of the latest samples to the representation. Conventionally, each sample has an equal contribution to the factors of NMF. However, some degree of memorylessness is desired for most applications. In such applications, systems need to exclude participations of old samples from their representations and need to increase influence of the latest samples. For example in a video surveillance application, a background model that can reflect content changes is required. For instance, if a mobile foreground object becomes stationary, then it has to be added into the background model immediately. Oppositely, a background object becomes a foreground object when it starts to move; therefore, it has to be excluded from the background representation. In order to achieve this, influences

790

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

of new samples on the representation should be higher while effects of old ones need to be diminished. Because of its batch processing nature, conventional NMF lacks the ability to adaptively control contributions of individual samples to its representations and fails to impose memorylessness. Next section presents INMF which is introduced to overcome these problems. 3. Incremental non-negative matrix factorization

3.1. Formulating the cost function of INMF Let Wk and Hk denote the optimized factor matrices of the initial k samples where k  2r. In Eq. (7), Fk refers to the cost function corresponding to NMF representation of the first k samples. n k 1  (Vij − (Wk Hk )ij )2 2

(7)

i=1 j=1

When the (k+1)th sample, vk+1 , arrives the reconstruction error is formulated by Eq. (8). Note that this error function is expressed in terms of Wk+1 and Hk+1 . n k+1 1 Fk+1 = V − Wk+1 Hk+1 2 = (Vij − (Wk+1 Hk+1 )ij )2 2

Fk+1 =

n k+1 1 (Vij − (Wk+1 Hk+1 )ij )2 2 i=1 j=1 n 

Fk +

1 2

i=1

((vk+1 )i − (Wk+1 hk+1 )i )2

Fk + fk+1

The aim of INMF is to update factor matrices W and H by adding effects of subsequent samples without requiring much computational effort. A new column should be added to both V and H, and mixing matrix W needs to be updated for each new sample. This yields an online updating scheme. Following subsections describe the introduced incremental form of the cost function and updating rules derived for INMF.

Fk = V − Wk Hk 2 =

formulation hk+1 is the encoding vector corresponding to the last sample.

(8)

(10)

It can be seen from Eq. (10) that contribution of each sample to the cost function is equal. Theoretically this is reasonable when samples are independent. However, in some applications, i.e. video processing, a number of successive samples with similar content may become dominant in the factorization. This may cause a problem in reflecting content changes which are brought to the factorization by a new frame, especially when the number of existing samples is high. Therefore, one might want to control contribution of each sample to the representation. In fact, it's usually desirable to make the latest samples have more effect on the representation. For example in background modeling, just as it is indicated in Section 2.2, if latest samples have more influence on the factorization, then adaptability of the factorization to dynamic content changes would increase [19]. In order to control contribution of each sample on the representation and add a degree of memorylessness to the INMF model, weighting coefficients So () and Sf (), which are two functions of constant , are included to the cost function Fk+1 as it is formulated by Eq. (11). Function So () determines contribution of older samples whereas Sf () is used to control that of the last one. Fk+1 = So ()Fk +

n

 1 S () ((vk+1 )i − (Wk+1 hk+1 )i )2 2 f

i=1 j=1

Conventional NMF updating rules given by Eqs. (4) and (5) can also be used to obtain optimal Wk+1 and Hk+1 matrices. However, this is not plausible in an online scheme because of high computational load. Columns of matrix W can be thought as the building blocks of the data. Each entity of vector hc , where the subscript indicates the sample number, determines how these building blocks participate in the corresponding observation vector vc [6]. This is why mixing matrix W is important in representing whole data. Moreover, as the number of observed samples increases, effects of new samples on the representation decrease. Hence, INMF does not need to update the encoding vectors of the old samples, because the new sample would not be able to significantly affect W's optimality for previous samples. By assuming that the first k columns of Hk+1 would be approximately equal to Hk , it is adequate to update only the last column of matrix Hk+1 in addition to updating mixing matrix W. Thus, when the number of observed samples is equal to k+1, the reconstruction error of the first k samples (Fk ) can be formulated as follows: Fk =

n k 1  (Vij − (Wk+1 Hk+1 )ij )2 2 i=1 j=1

n k 1 

(Vij − (Wk+1 Hk )ij )2 2

(9)

i=1 j=1

Consequently, reconstruction error of all samples can approximately be expressed as the sum of Fk and fk+1 . In Eq. (10) fk+1 refers to the residual error corresponding to the (k+1)th sample. In this

(11)

i=1

According to Eq. (11), error terms corresponding to previous samples are repetitiously multiplied by So (), and the error component of a new sample is weighted by Sf (). This forms a power series of So () in the weighting coefficients. Consequently, the cost function given by Eq. (11) can be rewritten as it is formulated by Eq. (12) where Sj () indicates the weight of jth sample. Fk+1 =

k+1 n  1  Sj () (Vij − (Wk+1 Hk+1 )ij )2 2 j=1

(12)

i=1

where  Sj () = Sok+1−2×r (),

k+1−j Sj () = So ()Sf (),

j2 × r 2 × r < jk + 1

Under the constraint that So () is smaller than 1, contribution of k+1−j

older frames that are weighted by Sok+1−2×r () and So () will always decrease and will converge to zero as k goes infinity. This means that memorylessness is achieved when such a weighting is used. Note that So () controls how fast participations of early samples wane. In other words, So () supervises the subspace learning scheme. 3.2. Derivation of multiplicative update rules of INMF After constructing the cost function given by Eq. (12), gradient descent optimization that yields INMF is performed. Whenever a new sample is acquired, mixing matrix W and the corresponding new column of H are updated. The update rule of hk+1 can be formulated

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

within the framework of gradient descent algorithm as it is shown by Eq. (13) where a = 1, . . . ,r. (hk+1 )a ← (hk+1 )a − a

jFk+1 j(hk+1 )a

(13)

(Wk+1 )ia ← (Wk+1 )ia 

jFk+1



⎤ k+1 n   1 2 ⎣ = Sj () (Vij − (Wk+1 Hk+1 )ij ) ⎦ j(hk+1 )a 2

j

j=1

=

1 2

k+1  j=1

Sj ()

i=1

(Wk+1 )ia ← (Wk+1 )ia

⎡ ⎛ ⎞2 ⎤ n r  ⎢ ⎝ ⎥ ×⎣ (Wk+1 )ik (hj )k ⎠ ⎦ (vj )i − = Sk+1 ()

i=1

(((vk+1 )i − (Wk+1 hk+1 )i )(−Wk+1 )ia )

T v T = Sk+1 ()[−(Wk+1 k+1 )a + (Wk+1 Wk+1 hk+1 )a ]

(14)

When the expression given by Eq. (15) is selected as the step size, the update rule given by Eq. (16) is derived.

a =

(hk+1 )a T Sj ()(Wk+1 Wk+1 hk+1 )a

(hk+1 )a ← (hk+1 )a

T v (Wk+1 k+1 )a T W (Wk+1 k+1 hk+1 )a

(15)

(16)

The update equation for the iath component of mixing matrix Wk+1 is the one given by Eq. (17) where i = 1,2, . . . ,n and a = 1,2, . . . ,r. Wk will be used as initial condition for the factorization process that will be done when (k+1)th sample is received. (Wk+1 )ia ← (Wk+1 )ia − ia

jFk+1 j(Wk+1 )ia

(17)

The partial derivative shown in Eq. (17) can be expressed by Eq. (18).

jFk+1 j(Wk+1 )ia

⎤ k+1 n   1 2 ⎣ = Sj () (Vij − (Wk+1 Hk+1 )ij ) ⎦ j(Wk+1 )ia 2 j=1 i=1 ⎡ ⎤ k+1 n  1  j 2 ⎣ (Vij − (W ⎦ = Sj () k+1 Hk+1 )ij ) 2 j(Wk+1 )ia

j



j=1

=

k+1  j=1

=

k+1  j=1

i=1

Sj ()(Vij − (Wk+1 Hk+1 )ij )(−Hk+1 )aj Sj ()(−Vij (Hk+1 )aj + (Hk+1 )aj (Wk+1 Hk+1 )ij )

(18)

ia , which is given by Eq. (19) is chosen as the step size. By defining a matrix that contains elements such that Tij = S()Hij and using it in Eqs. (12) and (19), it can be seen that this formulation is basically equivalent to the case of conventional NMF. As a consequence, the proof of convergence given in Ref. [5] will also be valid for this situation. ia =  k+1

(Wk+1 )ia

S ()(Wk+1 Hk+1 )ij (HTk+1 )ja j=1 j

(20)

(So ()Vk HTk + Sf ()vk+1 hTk+1 )ia

(So ()Wk+1 Hk HTk + Sf ()Wk+1 hk+1 hTk+1 )ia (21)

k=1

n 

S ()(vj hTj )ia j=1 j k+1 S ()(Wk+1 hk+1 hTk+1 )ia j=1 j

Since the first k columns of Hk+1 is not updated when the (k+1)th sample arrives, the update rule given by Eq. (20) can be rewritten as in the form of Eq. (21).

j j(hk+1 )a

i=1

By using Eqs. (17)–(19), multiplicative update rule for elements of Wk+1 can be shown to be equal to the one given by Eq. (20) where hj denotes jth column of Hk+1 , vj denotes jth column of V and i = 1, . . . ,n, a = 1, . . . ,r. k+1

The partial derivative in Eq. (13) can be calculated as follows:

j(hk+1 )a

791

(19)

Although the new form given by Eq. (21) is equivalent to Eq. (20), the expression shown in Eq. (21) is helpful in reducing the complexity. This is because both Vk and Hk do not change throughout the process; thus, instead of storing Vk and Hk , whose dimensions increase as new samples arrive, the multiplications Vk HTk and Hk HTk can be stored. Two benefits appear here one of which is about size of required storage memory. Since dimensions of the matrices obtained by these multiplications remain constant, required storage memory will be the same regardless the sizes of Vk and Hk . Secondly, the number of matrix multiplications which are the main reason of computational complexity of conventional NMF will be reduced. When using these matrix multiplications in the update rule given by Eq. (21), instead of implementing them repetitiously we can make use of the updated versions given by Vk+1 HTk+1 = So ()Vk HTk + Sf ()vk+1 hTk+1

(22)

Hk+1 HTk+1 = So ()Hk HTk + Sf ()hk+1 hTk+1

(23)

Vk+1 HTk+1 can be expressed as sum of the multiplications of the corresponding columns of Vk+1 and Hk+1 . Since only the last column of Hk+1 is updated, there will be no need to recalculate Vk HTk which would require multiplying each k column pair of Vk and Hk and then  adding these ( ki=1 vi hTi ). In addition, multiplying the right hand side components by So () and Sf (), respectively, whenever a new sample is processed will produce the weighing coefficients which are given by Eq. (12). Hence, computational complexity of executing these multiplications will be O(nr) instead of O(nmr). Considering this, overall computational complexity of INMF becomes O(nr2 ) per iteration. Note that it is independent of the number of samples, m. A comparison between complexities of INMF and NMF can be done via Fig. 1. The time required running the NMF algorithm increases linearly by the number of samples. On the other hand, INMF's computational complexity is independent of the number of samples. 3.3. Selection of weighting coefficients/functions This subsection discusses weighting scheme of INMF's cost function and its effect on controlling memorylessness of the factorization. As it is described in the previous subsection, So () and Sf (), which are introduced in Eq. (11), are independent of the elements of V, W and H. These functions are included into the cost function for the aim of controlling contributions of the subsequent observations to INMF representations, and there are no restrictions for these functions except their non-negativity. Non-negativity constraint is necessary in order to preserve additive nature of INMF.

792

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

Fig. 1. Time required by one iteration versus number of samples. Fig. 2. Distribution of residual errors versus frame number.

A wise approach is choosing So () and Sf () in a way that the prior samples' weights decay while effects of the latest samples are emphasized. With  taking a value between 0 and 1, a suitable function pair for So () and Sf () would be (1−) and , respectively. A similar approach is also used for the incremental PCA scheme presented in Ref. [18] for statistical background modeling problem. As indicated in Section 3.1, consecutive execution of this weighting process will form a power series of So () = 1− on the weighting coefficients. Let f2r+k shown in Eq. (24) be used to indicate the residual error caused by the (2r+k)th sample. As it can be seen in Eq. (24), since So () is chosen as a positive number which is smaller than 1, effects of earlier samples are smaller than those of new ones on the cost function. F2r+k = (1 − )k F2r + (1 − )k−1 f2r+1 + (1 − )k−2 f2r+2 + · · · + f2r+k

(24)

One consequence of selecting weighting functions as in Eq. (24) is an increase in the speed of adaptability to dynamic data changes. Although this was the main reason of choosing the weighting functions as (1−) and , it is not the only significant advantage that Eq. (24) provides. For conventional NMF there is a tradeoff between rank and factorization accuracy. In contrast, the introduced weighted cost function eliminates importance of optimal rank selection. This is because the weighting scheme makes a windowing effect on the factorization process, since the effects of the earlier samples continuously wane. Thus, even if the number of samples increases significantly the actual dimension of the problem would not change due to memorylessness property. Note that  should be selected experimentally depending on requirements of the application.

be adapted to dynamic content changes. We have performed background modeling tests on video samples taken from PETS 2000 and 2001 surveillance video data sets [21]. Outdoor scene videos in PETS database are captured under severe illumination changes and include several foreground objects moving on a busy background. Thus, PETS database is a good test case to evaluate memorylessness and subspace learning capability of our incremental factorization scheme. Since INMF aims to minimize the mean squared error formulated by Eq. (10), it is convenient to evaluate the factorization performance by examining the residual (reconstruction) error encountered for each video frame. Therefore, in this paper factorization performance is reported in terms of residual error obtained for each new frame. Note that in Eq. (10) residual error for frame (k+1) is defined as fk+1 =

4.1. Background representation capability of INMF Automatic visual tracking module included in most video surveillance systems requires a proper background model that needs to

i=1

Here subscription k denotes the number of video frames processed up to the last observation. Initially, the first 150 frames of `dataset1_cam1' surveillance video sequence from PETS 2001 database is used to constitute data matrix V. An (144×192) dc image [22] of each video frame is considered as a sample observation and stored as a column of V, thus V is an ((144×192)×150) matrix. Since all of these 150 frames are background frames, it is assumed that content of data samples remains stationary throughout the observations. Note that conventional NMF applies a batch processing on all of these 150 frames. Therefore, residual error corresponding to the (k+1)th frame is equal to

4. Experimental results In order to evaluate performance of the introduced INMF method in large scale online data processing, several tests are performed on PETS [21] and TRECVID [24] benchmarking video data sequences. In order to demonstrate use of INMF, results of general experiments on two applications, which are background modeling and clustering, are reported in this section.

n 1 ((vk+1 )i − (Wk+1 hk+1 )i )2 2

fk+1 =

n 1 (Vi,k+1 − (WH)i,k+1 )2 2 i=1

for NMF. Fig. 2 illustrates distribution of reconstruction error versus frame number for four different tests where rank is chosen as r = 2. Since the observed 150 background frames are very similar to each other, the reconstruction error obtained both by unweighted INMF and conventional NMF are very small (blue and red lines plotted in Fig. 2). In fact, an error value about 0.5 is negligible when compared to the upper bound of error which is equal to 2552 /2. This indicates that both INMF and NMF are capable of forming an efficient initial background representation. In this test, the number of iterations for

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

Fig. 3. (a) Original frame 614. (b) Original frame 837. Residue image showing the deviations from the background model (c) at frame 614 and (d) at frame 837 (pixel intensities are increased for illustration purpose).

convergence of conventional NMF is t = 250 whereas it is t = 13 per frame for INMF. Fig. 2 also illustrates effect of weighting on reconstruction performance. It is expected that residual error should be smaller when contribution of each observation is weighted by INMF. It was concluded in Section 3.3 that for a dynamic background modeling task a suitable choice for the weighting coefficients So () and Sf (), which are shown in Eq. (11), is (1−) and , respectively. Although both results shown in Fig. 2 are acceptable and sensible, reconstruction error of weighted INMF is smaller for  = 0.3 compared to  = 0.2 (green and cyan line plots in Fig. 2). This is because higher  values increase contribution of the last observation on the factorization. Note that unlike conventional NMF, INMF allows controlling contribution of each sample to representation which results in a better optimization. 4.2. Adaptability to dynamic content changes In video surveillance applications generally a background model is constructed at first. If there is a moving foreground object in the scene, for instance if an object enters the scene or an initially stable object starts moving, some deviations from background model appear. On the other hand, a mobile object may exit the scene or stops and turns into a background object. A powerful online and incremental factorization scheme should be capable of adapting existing background representation according to these changes which are referred as dynamic content changes. These changes can be considered as local variations from current model, because the observed samples (new video frames) do not contain a global change but only a local region of data experiences variations. So, it is common to evaluate dynamic content representation capability of a factorization by first projecting the samples (frames) onto the background model and then measuring the deviations from the background model. Numerical value corresponding to these deviations is referred as residual error within the text. Fig. 3(a) and (b) illustrate two frames from a PETS2000 test sequence which are used in examining online representation capability of INMF. In the first frame (Fig. 3(a)) there are two moving objects: a car is parking while a man is walking. In the second frame (Fig. 3(b))

793

Fig. 4. Distribution of the residual error versus frame number for three methods: the introduced INMF with So = 0.8 and Sf = 0.2, the conventional NMF, and IPCA.

the car becomes stationary after parking. On the other hand, two new moving objects, a man who is getting off from that car and a new car, enter the scene. Figs. 3(c) and (d) illustrate the residue images obtained by taking the difference between the reconstructed background and the observed frame. It is shown that the proposed INMF scheme is capable of efficiently representing the background, thus all foreground objects are observable in the residue images. Furthermore, the moving car is successfully included into the background model after parking. For this experiment the INMF algorithm is executed with So () = 0.8 and Sf () = 0.2, and it is observed that adaptation of the background model takes 7 frames after parking. Performances of different methods can numerically be examined by evaluating the variations of residual error. For this test a PETS2001 video sequence is used and Fig. 4 plots distribution of the residual error versus frame number for 700 frames. Three different methods are compared: INMF, NMF and the incremental PCA algorithm that is presented in Ref. [18]. As it can be seen from Fig. 4, since there is no foreground object in the scene from frame 800 to 940, the residual errors for all of the methods are small. Note that the error of conventional NMF is slightly higher compared to other two methods. This is because of batch processing scheme of NMF which assigns equal weighting factors to each frame. It can be concluded from Fig. 4 that a car enters the scene at frame 980 and moves till frame 1120. Thus, the residual error increases, since the existence of a foreground object results in high variations from the background model. The important point here is that the residual error of INMF drops at 1120, just after the car stops and becomes a background object. This demonstrates the fact that deviations from the background model are diminished when the car is successfully included into the background model. Notice that unlike INMF, NMF fails to achieve this adaptability (Fig. 4 black line). On the other hand, the residual error increases again when a new walking man enters the scene, and one of the parking cars starts moving after frame 1200. By examining Fig. 4 it can also be concluded that unlike conventional NMF, INMF has the ability of adapting its factors with respect to content changes and is capable of imposing memorylessness to its representations. Moreover, background modeling and adaptability performances of INMF and incremental PCA method of Ref. [18] are very close. In this experiment So () and Sf () are selected as 0.8 and 0.2, respectively. As it is discussed in Section 3.3, choosing So () = 0.8

794

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

enforces memorylessness by lowering contributions of older frames. In order to prevent dominance of the last frame on the representation, it is appropriate to choose Sf () = 0.2. It should be noted that the choice of weighting functions are problem specific. Taking both of them as equal to 1 yields an unweighted INMF scheme.

are distorted, the level of distortion is very small for INMF (see Fig. 6(d)) which reflects the error distribution shown in Fig. 5. This is because theoretically INMF does not impose a Gaussian distribution assumption on the transformed data in contrast to PCA. Therefore, INMF representations are more robust to illumination variations which significantly change mean value of the observations.

4.3. Robustness to illumination changes 4.4. INMF as a clustering tool We have also evaluated INMF's performance under severe illumination changes that shift intensity values of majority of pixels and thus significantly change the mean intensity. Intense illumination changes are difficult to handle because of significant deviations they cause on the original data. For this test, samples are taken from another PETS2001 sequence where severe illumination changes are experienced. A comparison between INMF and incremental PCA of Ref. [18] has been performed for the weighting functions (1−) and  where  = 0.05. Fig. 5 plots distribution of the residual error versus frame number through the frames taken from a stationary scene in which there are no foreground objects or motions but only illumination changes. Fig. 5 leads to the conclusion that INMF is more robust to global data changes compared to incremental PCA that is described in Ref. [18]. This can also be observed from Fig. 6(c) and (d) which illustrate the residue images (variations from the background model) corresponding to the original frames that are shown in Fig. 6(a) and (b), respectively. Since there are no foreground objects in the scene, the residue images are expected to be blank. However, because the illumination variations change the pixel intensities significantly, pixels which belong to shiny regions of the scene are slightly visible in the residue images. Although the residue images obtained by both methods

In this subsection we have examined INMF as a clustering tool to demonstrate another application of the introduced incremental factorization scheme. We have used clustering capability of INMF for online video scene change detection which constitutes the first step in content-based video abstraction [23]. Recently the relationship between NMF and fuzzy k-means clustering has been evaluated, and it is shown that both methods optimize the same objective function [17]. Inspired by this approach, it can be shown that INMF can also be considered as a clustering tool, because—as it is derived in Section 3.1—the cost function that INMF optimizes (Eq. (10)) is in the same form as conventional NMF (Eq. (8)), except that INMF performs the optimization incrementally. To demonstrate the use of INMF as a clustering tool, we have shown its performance in video scene change detection, thus a video sequence taken from TRECVID data set [24] which contains two different scenes is processed. A video scene is a sequence of consecutive frames taken within the same camera shot and generally considered as the main structural element of video that contain temporal knowledge about the content. If we define scene change detection as a pattern classification problem, these two scenes correspond to two different clusters, and the motivation behind solving this problem

Fig. 5. Distribution of the residual error versus frame number for the INMF and IPCA.

Fig. 7. (a) Original frame 75; (b) original frame 175; (c) frame 75 reconstructed by INMF; and (d) frame 175 reconstructed by INMF.

Fig. 6. (a) Original frame 2635. (b) Original frame 2874. Variations from the background model for frame 2874 obtained (c) by IPCA and (d) by INMF.

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

795

Fig. 8. Incrementally evolved basis images. The first column of: (a) W12 , (b) W63 , (c) W84 , (d) W170 and the second column of: (e) W12 , (f) W63 , (g) W84 , (h) W170 , where the subscripts shows the number of frames processed incrementally up to that time instant.

Fig. 10. Cluster indicators which are obtained by the INMF.

Fig. 9. Distribution of the reconstruction error of INMF versus frame number.

can be simply defined as finding an appropriate clustering scheme that provides cluster centroids as well as the members. Fig. 7(a) and (b) illustrate two video frames (samples) which are taken from each of these two scenes (classes). For simplicity, test results obtained on a short video sequence which consists of 180 frames are reported, and the number of samples taken from each class is very close. Since the number of clusters in a scene change detection problem is fixed to 2, it is appropriate to choose factorization rank as equal to 2 (r = 2). The introduced INMF scheme is performed on 180 frames, and the analogy between factorization and clustering has been clearly observed by examining the basis images (columns of mixing matrix W) and the reconstruction error for each new sample (frame). Fig. 8 illustrates how the first and second basis images (the first and second columns of W) evolve during the factorization. Note that rather than a batch process, an incremental and online clustering scheme has been performed by INMF. Since the first 80 observations (frames) come from the first cluster, the basis images that are obtained after factorizing these samples are visually similar to the frames of the first class (Fig. 8(a), (b), (e) and (f)). This is not surprising since INMF's objective function that minimizes the reconstruction error forces mixing matrix W to contain as much information about data as possible.

After the 80th frame samples of the second class are processed, and content of basis images start to change. Throughout the incremental factorization process the second basis image incrementally converges to the cluster centroid of the second cluster (Fig. 8, second row). On the other hand, the first basis image converges to centroid of the first cluster (Fig. 8, first row). Note that each of the basis images contains a little information about the other cluster as well. This is due to fuzzy clustering nature of INMF. In an online video scene change detection task, the aim is to detect scene boundaries in an online manner. This requires online determination of cluster boundaries or the location where the old scene ends and the new one begins. We take advantage of examining the factorization residual error to achieve this. Therefore, whenever a new frame (sample) is processed, we also determine whether that frame is the last sample of the current cluster or not by examining its residual error. As before, the residual error is obtained by calculating fk+1 =

n 1 (Vi,k+1 − (WH)i,k+1 )2 . 2 i=1

If it is higher than a prespecified threshold, this implies that the new frame does not belong to the current cluster. So, the residual error can be used as a decision criterion to locate scene changes (cluster boundaries).

796

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

Fig. 11. Nine different video frames taken from three scenes. Each row contains three frames from one of the three scenes.

Fig. 9 plots the residual error versus frame number. Up to frame 80 all frames are from the first scene, thus the basis images contain information only about the first scene (Fig. 8(a), (b), (e), (f)). When a frame from the new scene arrives, a rise in the error is experienced. This is due to the fact that it will take some time until the basis images retrieve enough information about the new scene to secure a successful reconstruction performance. On the other hand, as new frames from the second scene continue to arrive, their content is included into the basis images (Fig. 8(c), (d), (g), (h)) gradually. This results in a better reconstruction performance and a drop in the residual error. Fig. 7(c) and (d) gives the reconstructed images of frames 75 and 175, respectively. The reconstructions of both frames are successful. A peak in the residual error observed at frame 80 indicates that a scene change occurs. Considering this, we can conclude that the first 80 frames belong to the first cluster (scene) while the rest belongs to the second cluster (scene 2). Regarding the clustering issue, since each column of the encoding matrix H determines how the individual basis images contribute to reconstruction of corresponding samples, these columns can be considered as cluster indicators. For instance, let the first column of matrix H be [0.8 0.2]T . This illustrates that contribution of the first mixing vector (cluster centroid) is higher than the second one. Consequently, it can be concluded that the corresponding sample belongs to the first cluster. Hence, the cluster indicator of each sample constitutes a 2D feature vector and shows to which cluster that sample belongs. Consequently, samples that belong to the same cluster have similar cluster indicators. Fig. 10 plots columns of encoding matrix H obtained by the test described above. It can be seen from Fig. 10 that cluster indicators of frames coming from different scenes are linearly separable. The use of NMF's clustering ability for detecting scene changes can be further expanded, and scene-based video frame classification can be achieved by INMF representations. In this case, in contrast to scene change detection application, we do not seek to detect whether the latest sample belongs to a new pattern, but aim to determine to which of the classes (clusters) it belongs. Additionally, the number of classes is not restricted to two.

Fig. 12. Cluster indicators obtained by INMF for three clusters.

As an example, 250 video frames of three scenes are taken from TRECVID database. Two of these scenes have some frames that contain partial information about the other scene as it is shown in the last column of the first and second rows of Fig. 11. As a result of a tilt type camera motion (vertical motion of camera), content variation of the third scene shown at the last row of Fig. 11 is higher than the other two. Three samples from each of these three scenes are given in Fig. 11. When INMF (r = 3) is executed on these 250 frames at random order, three different clusters, each of which is plotted in a different color in Fig. 12, are obtained. Note that this process is an unsupervised one and Fig. 12 plots 3D cluster indicators acquired by the encoding matrix. The use of INMF in this application is quite straightforward. NMF has been used in this kind of classification/clustering problems recently [17], and what INMF does is achieving the same goal in an incremental manner. For comparison purposes, Fig. 13 is obtained when NMF (r = 3) is executed on the same data. Observe that classification performances of both factorizations are high but shapes

Serhat S. Bucak, B. Gunsel / Pattern Recognition 42 (2009) 788 -- 797

797

Acknowledgment This work is partially supported by the Scientific and Technological Research Council of Turkey. References

Fig. 13. Cluster indicators obtained by NMF for three clusters.

of the clusters are quite different. This difference is due to the way that the factorization is performed. Conventional NMF's batch nature treats each individual sample equally while INMF tries to fit the incoming samples to the factors created by the prior samples. 5. Conclusions In this paper an incremental non-negative matrix factorization (INMF) scheme is introduced for online processing of large scale non-stationary data. It is shown that with its incremental nature the introduced factorization achieves the goal of forming subspace representations which can reflect dynamic data content with a low computational complexity. Our algorithm, which is implemented with an application that is developed in Visual Studio on Intel Core 2 Quad CPU with 2.40 GHz processor, works real-time on the test video sequences. Unlike conventional NMF, memorylessness property which is crucial in modeling online data is another important feature of INMF. Furthermore, INMF's weighted cost function that results in an adaptive control of memorylessness reduces significance of optimum rank selection which is an important problem in factorization. In order to demonstrate INMF's efficiency, general experiments on two applications, which are background modeling and clustering, are performed. Background modeling in video surveillance illustrates our algorithm's robustness and success in tracking the content changes while the other group of experiments aims to show how INMF's clustering ability can be used in video scene change detection and scene-based video frame classification problems. We are currently investigating new weighting functions to improve the subspace learning capability of incremental NMF. By using appropriate weighting functions it could also be possible to improve the adaption rate of our algorithm. We are also exploring the clustering capability of incremental NMF and how NMF's clustering performance could be improved by imposing sparseness.

[1] I.T. Jolliffe, Principal Component Analysis, second ed., Springer, New York, 2002. [2] J. Shen, G.W. Israel, A receptor model using a specific non-negative transformation technique for ambient aerosol, Atmos. Environ. 23 (10) (1989) 2289–2298. [3] P. Paatero, U. Tapper, Positive matrix factorization—a nonnegative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (1994) 111–126. [4] D.D. Lee, H.S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature 401 (1999) 788–791. [5] D.D. Lee, H.S. Seung, Algorithms for nonnegative matrix factorization, in: Proceedings of Neural Information Systems, 2000, pp. 942–948. [6] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res. 5 (2004) 1457–1469. [7] A. Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehmann, R.D. Pascual-Marqui, Nonsmooth nonnegative matrix factorization, IEEE Trans. Pattern Anal. Mach. Intell. 42 (2006) 403–415. [8] A. Cichocki, R. Zdunek, S. Amari, New algorithms for non-negative matrix factorization in applications to blind source separation, in: IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, 2006, pp. 621–624. [9] P.O. Hoyer, Nonnegative sparse coding, in: IEEE Workshop Neural Networks for Signal Processing, Martigny, Switzerland, 2002, pp. 557–565. [10] S.Z. Li, X. Hou, H. Zhang, Q. Cheng, Learning spatially localized partsbased representations, in: IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2001, pp. 207–212. [11] C.-J. Lin, Projected gradient methods for non-negative matrix factorization, Neural Comput. 19 (2007) 2756–2779. [12] S. Kim, I.S. Dhillon, Fast Newton-type methods for the least squares nonnegative matrix approximation problems, in: Sixth SIAM Conference on Data Mining, Minnesota, USA, 2007, pp. 343–354. [13] D. Guillamet, J. Vitria, Nonnegative matrix factorization for face recognition, in: Fifth Catalonian Conference on Artificial Intelligence, 2002, pp. 336–344. [14] P. Smaragdis, J.C. Brown, Nonnegative matrix factorization for polyphonic music transcription, in: IEEE Workshop Applications of Signal Processing to Audio and Acoustics, New York, USA, 2003, pp. 177–180. [15] P. Fogel, S.S. Young, D.M. Hawkins, N. Ledirac, Inferential robust non-negative matrix factorization analysis of microarray data, Bioinformatics 23 (2007) 44–49. [16] V. Monga, M.K. Mihcak, Robust image hashing via non-negative matrix factorizations, in: IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, 2006, pp. 225–228. [17] T. Li, C. Ding, The relationships among various nonnegative matrix factorization methods for clustering, in: Sixth IEEE International Conference on Data Mining, Hong Kong, 2006, pp. 362–371. [18] Y. Li, On incremental and robust subspace learning, Pattern Recognition 37 (2004) 1509–1518. [19] S.S. Bucak, B. Gunsel, Video content representation by incremental non-negative matrix factorization, in: IEEE International Conference on Image Processing, San Antonio, USA, 2007, pp. 113–116. [20] S.S. Bucak, B. Gunsel, O. Gursoy, Incremental non-negative matrix factorization for dynamic background modeling, in: ICEIS International Workshop on Pattern Recognition in Information Systems, Funchal, Portugal, 2007, pp. 107–116. [21] PETS Video Database http://ftp.pets.rdg.ac.uk/. [22] B. Yeo, B. Liu, Rapid scene analysis on compressed videos, IEEE Trans. Circuits Systems Video Technol. 5 (1995) 533–544. [23] J. Oostveen, T. Kalker, J. Haitsma, Feature extraction and a database strategy for video fingerprinting, in: International Conference on Recent Advances in Visual Information Systems, Taiwan, 2002, pp. 117–128. [24] TRECVID Video Retrieval Evaluation http://www.nlpir.nist.gov/projects/ trecvid/.

About the Author—SERHAT SELCUK BUCAK received his B.Sc. in Telecommunication Engineering from Istanbul Technical University, Turkey in 2006. His research interests include decomposition methods and statistical modeling. He is currently a graduate student and Research Assistant at Istanbul Technical University. About the Author—BILGE GUNSEL received M.S. and Ph.D. degrees in Electronics and Communication Engineering from Istanbul Technical University, Istanbul, Turkey in 1988 and 1993, respectively. She is currently working as a Professor in the Electrical and Electronics Engineering Department at Istanbul Technical University. She has been a Senior Researcher at the Scientific and Technical Research Council of Turkey, Marmara Research Center, Information Technologies Research Institute (1999–2002), Senior Researcher at LG Research Center, NJ, USA (1998), Research Associate in the Electrical Engineering Department at University of Rochester, Rochester, USA (1995–1997), Assistant Professor in the Electrical and Electronics Engineering Department at Istanbul Technical University (1994). Her research interests include audio watermarking, video/image content analysis and retrieval, stochastic image models and multimedia information systems. She has been serving as an Editorial Board member for Journal of Pattern Recognition published by Elsevier Science. She holds one US patent on content based access to video.