Neurocomputing 80 (2012) 54–63
Contents lists available at SciVerse ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Archetypal analysis for machine learning and data mining Morten Mørup n, Lars Kai Hansen Section for Cognitive Systems, Technical University of Denmark, Richard Petersens Plads, bld 321, 2800 Lyngby, Denmark
a r t i c l e i n f o
a b s t r a c t
Available online 15 November 2011
Archetypal analysis (AA) proposed by Cutler and Breiman (1994) [7] estimates the principal convex hull (PCH) of a data set. As such AA favors features that constitute representative ‘corners’ of the data, i.e., distinct aspects or archetypes. We currently show that AA enjoys the interpretability of clustering – without being limited to hard assignment and the uniqueness of SVD – without being limited to orthogonal representations. In order to do large scale AA, we derive an efficient algorithm based on projected gradient as well as an initialization procedure we denote FURTHESTSUM that is inspired by the FURTHESTFIRST approach widely used for K-means (Hochbaum and Shmoys, 1985 [14]). We generalize the AA procedure to KERNEL-AA in order to extract the principal convex hull in potential infinite Hilbert spaces and derive a relaxation of AA when the archetypes cannot be represented as convex combinations of the observed data. We further demonstrate that the AA model is relevant for feature extraction and dimensionality reduction for a large variety of machine learning problems taken from computer vision, neuroimaging, chemistry, text mining and collaborative filtering leading to highly interpretable representations of the dynamics in the data. Matlab code for the derived algorithms is available for download from www.mortenmorup.dk. & 2011 Elsevier B.V. All rights reserved.
Keywords: Archetypal analysis Principal convex hull Clustering Non-negative matrix factorization FurthestFirst FurthestSum Kernel methods
1. Introduction Decomposition approaches have become a key tool for the analysis of a wide variety of massive data from modeling the Internet, such as term-document matrices of word occurrences, bio-informatics data such as micro-array data of gene expressions, neuroimaging data such as neural activity measured over space and time to collaborative filtering such as the celebrated Netflix problem to mention but a few. The conventional approaches range from low rank approximations such as singular value decomposition (SVD), principal component analysis (PCA) [4], independent component analysis (ICA) [6], sparse coding (SC) [24] also denoted dictionary learning [8] and non-negative matrix factorization (NMF) [18] and convex NMF [10] to soft clustering approaches such as fuzzy k-means [3] and the EM-algorithm for clustering [4] to hard assignment clustering methods such as K-means and K-medoids. Common to these approaches is that they can be understood as a linear mixture or factor analysis type representation of data with various constraints. Thus data xm,n , where m ¼ 1, . . . ,M is the feature index and n ¼ 1,: :,N is the sample index, is written in terms of hidden variables sd,n and projections am,d with d ¼ 1, . . . ,D where D denotes the number of factors, typically with a Gaussian noise
n
Corresponding author. Tel.: þ45 45 25 39 00. E-mail addresses:
[email protected] (M. Mørup),
[email protected] (L.K. Hansen).
0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.06.033
model, i.e., X xm,n ¼ am,d sd,n þ em,n ,
em,n N ð0, s2 Þ:
ð1:1Þ
d SVD/PCA requires A and S be orthogonal, in ICA statistical independence is assumed for S and in SC a penalty term is introduced to promote sparsity of S, while in NMF all variables are constrained non-negative. In soft clustering S is also nonnegative but the columns constrained to sum to one whereas in hard clustering by K-means S is constrained to be a binary assignment matrix such that A ¼ XS > ðSS > Þ1 represents the Euclidean centers of each cluster while for K-medoids ad ¼ xn for some n, i.e., the cluster centers are actual data points. Despite the similarities of the above approaches their internal representations of the data differ greatly and thus the nature of the data analyses they offer. In SVD/PCA the features constitute the directions of maximal variation, i.e., so-called eigenmaps, for NMF the features are constituent parts, while K-means and K-medoids find the most representative prototype objects. A benefit of clustering approaches is that features are similar to measured data making the results easier to interpret, however, the binary assignments reduce flexibility. Also, clustering typically involves complex combinatorial optimization leading to a plethora of heuristics. On the other hand low rank approximations based on SVD/PCA/NMF have a great degree of flexibility but the features can be harder to interpret. Invariance to rotation of the extracted features can lead to lack of uniqueness, i.e., X AS ¼ AQQ 1 S ¼ A~ S~ .
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
In addition, SVD/PCA/ICA/SC are prone to cancelation effects in which two components both lose meaning because they locally become highly correlated taking positive and negative near-canceling values (while still being globally orthogonal). In conclusion, clustering approaches give easy interpretable features but pay a price in terms of modeling flexibility due to the binary assignment of data objects. Approaches such as SVD/PCA/ICA/ NMF/SC have added model flexibility and as such can be more efficient in capturing, e.g., variance, however, this efficiency can lead to complex representations from which we learn relatively little. Archetypal analysis (AA) proposed by Cutler and Breiman [7] directly combines the virtues of clustering and the flexibility of matrix factorization. In the original paper on AA [7] the method was demonstrated useful in the analysis of air pollution and head shape and later also for tracking spatio-temporal dynamics [27]. Recently, archetypal analysis has found use in benchmarking and market research identifying typically extreme practices, rather than just good practices [26] as well as in the analysis of astronomy spectra [5] as an approach for the end-member extraction problem [25]. In this paper we demonstrate the following important theoretical properties of AA
The archetypal analysis representation is unique in the sense
that a solution to AA in general does not suffer from rotational ambiguity. Archetypal analysis can be initialized efficiently through the proposed FURTHESTSUM method. Archetypal analysis can be efficiently computed using a simple projected gradient method. Archetypal analysis can be generalized to KERNEL-AA in order to extract the principal convex hull in a Hilbert space.
We further demonstrate that AA is useful for a wide variety of important machine learning and data mining problem domains resulting in easy interpretable features that well account for the inherent dynamics in data. This paper is an extended version of Mørup and Hansen [23]. Apart from elaborating more on the
55
aspects of AA given above we have included an additional application within chemistry as well as a relaxation of AA that can address the issue when ‘‘pure’’ archetypes cannot be represented by convex combinations of the observed data. We further provide Matlab code for the derived algorithms available for download from www. mortenmorup.dk.
2. Archetypal analysis and the principal convex hull The convex hull or envelope of a data matrix X is the minimal convex set containing X. Informally it can be described as a rubber band wrapped around the data points, see also Fig. 1. While the problem of finding the convex hull is solvable in linear time (i.e., OðNÞ) [21] the size of the convex set increases dramatically with the dimensionality of the data. The expected size of the convex set for N points in general position in M dimensional space grows exponentially with dimension as OðlogðNÞM1 Þ [11]. As a result, in high dimensional spaces the ‘minimal’ convex set forming the convex hull does not provide a compact data representation, see also Fig. 1. Archetypal analysis (AA) [7] considers the principal convex hull (PCH), i.e., the (D1)-dimensional convex hull that the best account for the data according to some measure of distortion Dð9Þ. This can formally be stated as the optimal C and S to the problem arg min DðX9XCSÞ C,S
s:t:9cd 91 ¼ 1, C Z0,
9sn 91 ¼ 1
S Z0
The constraint 9cd 91 ¼ 1 together with C Z0 enforces the feature matrix A ¼ XC as given in Eq. (1.1) to be a weighted average (i.e., convex combination) of the data observations while the constraint 9sn 91 ¼ 1,S Z0 requires the nth data point to be approximated by a weighted average (i.e., convex combination) of the feature vectors XC A RMD . For brevity we consider DðX9XCSÞ ¼ JXXCSJ2F . As in principal component analysis, the optimal C,S will generate the principal (i.e., dominant) convex hull for the data X, see Fig. 1.
Fig. 1. Illustration of the AA representation of data in 2D (left panel) and 3D (middle and right panel). Blue dots and lines indicate the convex sets and hulls respectively whereas green lines and dark shaded regions indicate the extracted principal convex hulls. Notice how the four component principal convex hull (left and middle panels) and five component convex hull (right panel) account for most of the dynamics in the data while the complete convex hull even in 2D is given by a rather large convex set exhibiting an exponential growth in size with the dimensionality of the data. The end points of the principal convex hull are convex combinations of the data points. These end points constitute the distinct regions of the clusters rather than the central regions as would be extracted by K-means. As such, the fifth cluster given in cyan color (right panel) is modeled as a combination of existing aspects rather than given a feature of its own. Thus, contrary to clustering approaches archetypal analysis models the distinct dynamics in the data rather than focus on the most prototypical aspects. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 1 Relation between the medoids.
AA/PCH
model and unsupervised methods such as
SVD/PCA
NMF
convex
C AR
XC Z 0
S AR
SZ0
SVD/PCA, NMF,
soft K-means (i.e., fuzzy K-means or the EM-algorithm for clustering), K-means and K-
AA/PCH
Soft K-means
K-means
K-medoids
C Z0
9c d 91 ¼ 1, C Z 0
9c d 91 ¼ 1,C Z 0
9c d 91 ¼ 1,C A B
S Z0
9sn 91 ¼ 1, S Z 0
s c d,n ¼ P d,n n0 sd,n0 9sn 91 ¼ 1,S Z 0
9sn 91 ¼ 1,S A B
9sn 91 ¼ 1,S A B
NMF
56
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
Archetypal analysis favors features that constitute representative ‘corners’ of the data, i.e., distinct aspects or archetypes. Furthermore, the AA model can naturally be considered a model between low-rank factor type approximation and clustering approaches, see Table 1. While the principal convex hull is formed by the optimal solution to the AA we will not distinguish strongly between the two and interchangeable denote the AA model as the PCH or jointly as AA/PCH. 2.1. Properties of
AA/PCH
Contrary to SVD/PCA/ICA/NMF/SC, the K-means and K-medoids objectives are invariant to affine transformation and scaling of the data. This also holds for the AA/PCH model, i.e., Lemma 1. C and S are invariant to affine transformation and scaling of the data. Proof. Consider the affine transformation by the vector v and scaling a of the data X to X~ ¼ aX þ v1> . For the PCH objective we find JX~ X~ CSJ2F ¼ JðaX þ v1> ÞðaX þ v1> ÞCSJ2F ¼ JaX þ v1> ðaXÞCSv1> J2F ¼ a2 JXXCSJ2F
&
Lack of uniqueness in matrix decomposition is a main motivation behind rotational criteria such as Varimax in factor analysis [15] as well as imposing statistical independence in ICA [6] and sparsity [24]. Here we prove the important property that AA/PCH is in general unique up to permutation of the components. Theorem 1. Assume 0
ðC:1Þ
0
ðC:2Þ
8d(n : cn,d 4 04cn,d0 ¼ 0,
d ad
8d(n : sd,n 4 04sd0 ,n ¼ 0,
d ad
then the AA/PCH model does not suffer from rotational ambiguity, i.e., if X XCS ¼ XCQ 1 QS ¼ X C~ S~ such that both C,S and C~ , S~ are equivalent solutions to AA/PCH then Q is a permutation matrix. Proof. Since S~ ¼ QS and S Z 0 and S~ Z0 both are solutions to AA/PCH we have 9sn 91 ¼ 1 and 9Qsn 91 ¼ 18n. For this to be true under the P condition (C.1), Q has to be a Markov matrix, i.e., d qd,d0 ¼ 1, Q Z0. 1 Since C Z 0 and CQ Z 0 and given (C.2) then Q 1 has to be nonnegative. Since both Q and Q 1 are non-negative it follows from Minc [22, Lemma 1.1], see also Laurberg et al. [16], that Q can only be a scale and permutation matrix, and as the column sum of Q has to be one it follows that Q can only be a permutation matrix. & Condition (C.1) states that at least one of the observations used to define each archetype cannot be used to define any of the other archetypes while condition (C.2) states that for each archetype there exists at least one observation that only belongs to this archetype, forming two easily testable conditions. The conditions will in general hold for the AA/PCH model as the distinct aspects in general position will not be a convex combination of the same data points while some of the data points with high probability will pertain to a specific archetype. Although these conditions are sufficient we suspect tighter conditions can be derived using measures of coherence in compressive sensing as discussed in Lim and Comon [19] as well as the notion of boundary close in Laurberg et al. [16]. In particular, the AA/PCH decomposition does not suffer from scale ambiguity, thus the model is in general able to recover the correct units of measures. We note that although AA/PCH is unique there is no guarantee the optimal solution will be identified due to the occurrence of local minima. Thus in Fig. 2 conditions (C.1) and (C.2) are satisfied for both the solutions given in red and black obtained for a three component AA/PCH model.
Fig. 2. The AA/PCH model is in general unique, i.e., does not suffer from rotational ambiguity. As such despite that the two minima given in red and black for a three component AA/PCH model can potentially account more or less equally well for the data all continuous paths from the one solution to the other have intermediate objective values with higher costs. Notice also that despite the existence of a unique solution of the AA/PCH the non-convexity of the AA/PCH does not guarantee that the optimal (unique) solution will be identified. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
2.2. Projected gradient for
AA/PCH
As noted in Cutler and Breiman [7] there is no closed form solution to the AA/PCH problem but from the following lemma it can be seen that each ‘subproblem’ of estimating S for fixed C and vice versa form convex optimization problems for distortion measured by least squares, i.e., DðX9XCSÞ ¼ JXXCSJ2F . Lemma 2. For least squares optimization alternating optimization of C for fixed S and S for fixed C form convex optimization problems. Proof. Each alternating subproblem can trivially be written as a quadratic programming problem with linear and non-negativity constraints forming a convex problem. As such, the Hessian when optimizing for vecðSÞ is given by H S ¼ I ½ðXCÞ> XC whereas the Hessian optimizing for vecðCÞ is given by H C ¼ diagðSS > Þ ½X > X . As a result, both alternating problems’ Hesians are positive semidefinite. & We currently propose a simple projected gradient procedure that can be adapted to any proximity measure. In the present analysis we will however without loss of generality consider proximity measured by least squares. In the paper of Cutler and Breiman [7] the model was estimated by non-negative least squares such that the linear constraints were enforced by introducing quadratic penalty terms in the alternating updates of S P P and C, i.e., minimizing JXXCSJ2F þ L d ð9c d 91 1Þ2þ L n ð9sn 911Þ2 , where L is some large number. Alternatively, standard nonnegative quadratic programming solvers with linear constraints can be invoked [1] for each alternating subproblem solving S for fixed C and vice versa. We found however, that the following projected gradient method inspired by the projected gradient method for NMF [20] using the normalization invariance approach ¨ of Eggert and Korner [12] worked efficiently in practice. We recast the AA/PCH-problem in the l1-normalization invariant P P variables s~ d,n ¼ sd,n = d0 sd0 ,n and c~ n,d ¼ cn,d = n0 cn0 ,d such that the equality constraints are explicitly satisfied. Noticing that @s~ d0 ,n d0 s0 ¼ P d ,d P d ,n 2 @sd,n ð d sd,n Þ d sd,n and differentiating by parts, we find the following updates for the AA parameters recast in the above normalization invariant
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
2.3. Efficient initialization of
variables n ~ o X ~ sd,n ’max s~ d,n mS~ g Sd,n d0 g Sd0,n s~ d0 ,n ,0 , s s~ k,n ¼ P k,n , k sk,n
> > ~ GS ¼ C~ X > X C~ S~ C~ X > X
n ~ o X ~ cn,d ’max c~ n,d mC~ g Cn,d n0 g Cn0,d c~ n0 ,d ,0 , c c~ n,d ¼ P n,d , n cn,d
57
~
>
GC ¼ X > X C~ S~ S~ X > X S~
>
Each alternating update is performed on all elements simultaneously where mS~ and mC~ are step-size parameter that we tuned by line-search. In our implementation of the projected gradient method we carried out 10 line-search updates for each alternating update of S and C exploiting as also noted in Lin [20] that precalculated terms can be re-used within each alternating update. In particular we find that our projected gradient approach for AA/PCH has same computational complexity as NMF: > > In the update of S the terms C~ X > X and C~ X > X C~ can be precomputed having a cost of OðDMNÞ where D is the number of archetypes, M the dimensionality of the feature space and N the number of observations. The evaluation of the least squares P objective using the inner-product notation /A,BS ¼ ij aij bij is given by >
>
>
JXXCSJF ¼ const:2/C~ X > X, S~ S þ/C~ X > X C~ , S~ S~ S which as for the computation of the gradient have a computational cost of OðD2 NÞ. > > In the update of C the terms X > X S~ and S~ S~ can be precomputed having a computational cost of OðDMNÞ and OðD2 NÞ respectively while calculating the gradient as well as the evaluation of the least squares objective given by > > > JXXCSJF ¼ const:2/X > X S~ , C~ S þ/C~ X > X C~ , S~ S~ S
also has a computational complexity of OðDMNÞ. Thereby the overall computational complexity is the same for AA as for NMF [20]. When only considering T candidate points used to define the archetypes as proposed in Bauckhage and Thurau [1] this latter complexity can be further reduced to OðDMTÞ. In Bauckhage and Thurau [1] it was suggested to identify these candidates point by outlying data points found through projections of the eigenvectors of the covariance matrix of X. We found that the following FURTHESTSUM algorithm not only form an efficient initialization procedure but also form an alternative approach to the identification of these T candidate points.
AA
by FURTHESTSUM
Cutler and Breiman [7] point out that careful initialization improves the speed of convergence and lowers the risk of finding insignificant archetypes, in particular to choose mixtures that are not too close together. For K-means a popular initialization procedure is based on the FURTHESTFIRST method described in [14]. The method proceeds by randomly selecting a data point as centroid for a cluster and selecting subsequent data points the furthest away from already selected centers noted as xj . As such, a new data point jnew is selected according to ð2:1Þ jnew ¼ arg max minJxi xj J,j A C i
j
where J J is a given norm and C index current selected data points. For initializing AA we propose the following modification forming our FURTHESTSUM procedure: 8 9
To improve the identified set, C, the first point selected by random is removed and an additional point selected in its place, see also Algorithm 1. For the proposed FURTHESTSUM method we have the following important property. Theorem 2. The points generated by the FURTHESTSUM algorithm are guaranteed to lie in the minimal convex set of the unselected data points. Proof. We will prove the theorem by contradiction. Assume that there is a point t not in the minimal convex set, i.e., xt ¼ Xc such P that 9c91 ¼ 1, cm Z0, ct ¼ 0 while t ¼ arg maxi j Jxi xj J,j A C. We then have X X X Jxt xj J ¼ JXcxj J o cm Jxm xj J jAC
m
r max
jAC
m:cm 4 0
X Jxm xj J jAC
where the first inequality follows from the triangular inequality. Hence a better solution is given by a point different from t contradicting that t was the optimal point. & A comparison between the FURTHESTFIRST and FURTHESTSUM initialization procedures can be found in Fig. 3 based on the 2-norm, i.e., Jxi xj J2 . From the figure it can be seen that the proposed FURTHESTSUM extracts observations from the convex set of the unselected data points whereas FURTHESTFIRST does not specifically
Fig. 3. Illustration of the extracted initial candidate observations C by FURTHESTFIRST [14] (green circles) and the proposed FURTHESTSUM initialization procedure (red circles) for 4 (left), 25 (middle) and 100 (right) candidate observations. Clearly, the FURTHESTSUM extract points belonging to the convex set of the unselected data (useful for AA) whereas FURTHESTFIRST distribute the extracted candidate observations evenly over the data region (useful for K-means). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
58
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
identify the distinct aspect while also identifying points in the interior of the data cloud. The primary computational cost of the FURTHESTSUM procedure for the identification of T candidate points is the evaluation of the distance between all data points and the selected candidate points which has an overall computational complexity of OðMNTÞ while these distances have to be sorted T times giving a total cost for sorting the distances of OðTNlogNÞ. Thus, the FURTHESTSUM procedure is efficient in the sense that its computational cost is similar to that of a single iteration of the projected gradient algorithm. Algorithm 1. FURTHESTSUM initialization procedure. 1:
2: 3: 4: 5: 6: 7: 8: 9:
Initialization Choose an observation j by random as initial candidate archetype, and set C ¼ fjg. 8 i : qi ¼ Jxi xj JF q^ ¼ q for t ¼1 D do if t ¼D then q ¼ qq^ end if j ¼ arg maxi=2C qi C ¼ C [ fjg 8 i : qi ¼ qi þ Jxi xj JF end for
Fig. 4. Illustration of the relaxation of PCH/AA-d when pure archetypes are not available as convex combinations of the observations. The corners of each colored triangle indicate the location of the estimated archetypes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
the data forming what we denote the 2.4. Choosing the number of archetypes D
C,S
1d r9cd 91 r1 þ d,
C Z0,
9sn 91 ¼ 1,
S Z0:
ð2:3Þ
The linear constraint on C is thereby relaxed to two inequality constraints and as such the estimation of C for fixed S is still convex. In our implementation we introduce a scaling variable ak such that 9c k 91 ¼ 1 whereas 1d r ak r 1 þ d, hence the objective is written as arg min DðX9XCdiagðaÞSÞ aC,S
1d r ad r 1 þ d,
s:t:
9cd 91 ¼ 1, 9sn 91 ¼ 1, C Z 0,
S Z0
such that the alternating updates of a, C and S becomes
a’ama g a 2.5. Extensions of
AA/PCH
A benefit of AA/PCH is that the observations have to be convex combinations of the actual data keeping the representation close to the data observations, however, the ‘‘pure’’ forms or ‘‘true’’ archetypes may not always be possible to represent as convex combinations of the observed data, see also Fig. 4. We here give a relaxation of the AA/PCH to address the situation when no such ‘‘pure’’ forms can be represented well by the model. Furthermore, data may be represented by some measure of pairwise similarity and we therefore discuss how AA/PCH can be extended to data defined by pair-wise relations (i.e., kernels). These extensions will be briefly discussed but we leave for future work a full evaluation of these frameworks.
ad ¼ Pa ðad Þ g ad ¼
X > > > > ½X X C~ diagðaÞS~ S~ X > X S~ n0 ,d c~ n0 ,d n0
( ! ) X ~ ~ sd,n ’max s~ d,n mS~ g Sd,n g Sd0,n s~ d0 ,n ,0 d0
s s~ k,n ¼ P k,n k sk,n > > ~ GS ¼ diagðaÞðC~ X > X C~ diagðaÞS~ C~ X > XÞ
( cn,d ’max c~ n,d mC~
2.5.1. Relaxation of the AA/PCH model in the case of no pure forms To alleviate the problem that archetypes may not always be possible to represent as convex combinations of the observed data we relax the AA/PCH model to accommodate the potential configuration that archetypes reside outside the convex hull of
model.
arg min DðX9XCSÞ s:t:
An open problem is to determine the number of archetypes D. This problem is no different from the problem of choosing the number of components in other matrix decomposition approaches such as SVD/PCA/NMF/SC/K-means, thus, methods based on approximating the model evidence or generalization error can be invoked. Particularly, information criteria such as Akaikes Information Criteria (AIC) or Bayesian Information Criteria (BIC) can be straightforward applied. We found that a corner of the curve of the variation explained by the model versus number of components seemed to define a reasonable criteria to base the selection for the number of components, see also Figs. 7 and 8. Finally, entries can be treated as missing and predicted by the AA model such that the number of archetypes D can be determined using a cross-validation approach, see also Section 3.5.
AA/PCH-d
X ~ ~ g Cn,d g Cn0,d c~ n0 ,d n0
! ) ,0
c c~ n,d ¼ P n,d n c n,d ~
>
>
GC ¼ ðX > X C~ diagðaÞS~ S~ X > X S~ ÞdiagðaÞ
ð2:4Þ
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
4:
where 8 > < 1d P a ðak Þ ¼ 1 þ d > :a k
if ak 4 1 þ d otherwise
In the limit d-1 we recover a model similar to the conic coding of Lee and Seung [17]. However, as d becomes large features can no longer be interpreted as archetypes while the uniqueness criterion in Theorem 1 no longer holds. Thus, care has to be taken using this relaxation in practice. How the relaxation admit the extraction of the underlying pure forms even when these are not present as convex combinations of the observed data is demonstrated in Fig. 4.
Algorithm 2. 2:
KERNEL-AA/PCH-d.
Set d and the number of archetypes K. Initialization: Initialize C by FURTHESTSUM, set a ¼ 1, initialize S by random and estimate S by iterating over the S-update below. repeat
Fig. 5. Illustration of the extracted archetypes for a 10 component AA/PCH model based on regular inner product similarity (i.e., K ij ¼ x> i xj ) forming the standard least squares objective and archetypes based on KERNEL-AA/PCH using a radial basis function kernel, (i.e., K ij ¼ expðð1=2s2 ÞJxi xj J2 ). Notice that the data points in the central region of the 2D inner product space does not form distinct aspects however the distinct aspects are found as the exterior regions of the data points constituting the outer circle (i.e., the black marked archetypes). For the KERNEL-AA/PCH the data points reside in a much higher dimensional Hilbert space and here the central regions of the outer circular band as well as the inner circle constitute the distinct aspects hence are extracted by the KERNEL-AA/PCH (i.e., the red marked archetypes). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
a-update: Calculate gradient: P > > g ad ¼ n0 ½K C~ diagðaÞS~ S > K S~ n0 ,d c~ n0 ,d , Update a in gradient direction: a’ama g a , Project a to feasible region: ad ¼ Pa ðad Þ, Tune ma by line-search. S-update: Calculate gradient: > > ~ GS ¼ diagðaÞðC~ X > X C~ diagðaÞS~ C~ KÞ,
if ad o 1d
2.5.2. Kernel-AA/PCH From the updates of the parameters of the AA/PCH model it can be seen that the estimation only depends on the pairwise relations, i.e., the kernel matrix of inner products K ¼ X > X. As such, the AA model trivially generalizes to kernel representations based on pairwise relations between the data points (kernel-AA) and can here be interpreted as extracting the principal convex hull in a potentially infinite Hilbert space (i.e., similar to the corresponding interpretation of kernel-K-means and kernel-PCA). This is illustrated in Fig. 5. The algorithm generalizing AA to kernel representations while relaxing the objective to the case of no pure forms denoted KERNEL-AA/PCH-d is outlined in Algorithm 2 where regular AA/PCH is formed setting K ¼ X > X and d ¼ 0.
59
Update S in gradient direction: n o P ~ ~ sd,n ’max s~ d,n mS~ g Sd,n d0 g Sd0,n s~ d0 ,n ,0 , s Project S to the simplex: s~ d,n ¼ Pd,ns
d d,n
6:
Tune mS~ by line-search. C-update: Calculate gradient: > > ~ GC ¼ ðK C~ diagðaÞS~ S~ K S~ ÞdiagðaÞ, Update C in gradient direction: n o P ~ ~ cn,d ’max c~ n,d mC~ g Cn,d n0 g Cn0,d c~ n0 ,d ,0 , c Project C to the simplex: c~ n,d ¼ P n,d n cn,d Tune mC~ by line-search. until convergence
3. Data mining and machine learning applications We demonstrate the usefulness of the AA model on five data sets taken from a variety of important machine learning problem domains. 3.1. Computer vision To illustrate the properties of the AA/PCH model we compared the extracted model representation to the representations obtained by SVD/PCA, NMF and K-means on the CBCL face database of M ¼ 361 pixels and N ¼2429 images used in Lee and Seung [18]. Here the AA/PCH model extracts archetypal faces given by the columns of A ¼ XC as well as how each face in the database is represented as a convex combination of archetypal faces given by S. The results of the analysis are illustrated in Fig. 6 where we show the extracted face features A for a D ¼49 component model. There is no particular reason for this choice of number of components but we note that similar quantitative results were found for other choices of D. From Fig. 6 it can be seen that SVD/PCA extracts features that have low to high spatial frequency while NMF gives a part based representation as reported in Lee and Seung [18]. K-means extracts cluster centers of anonymous ‘typical’ faces while features extracted by AA represents more distinct (archetypal) face prototypes exposing variability and face diversity. At the top of the extracted face-features for each method is given the percentage of variation explained. It can be seen that AA accounts for more variation than K-means but less than SVD/PCA and NMF as is expected from the models relative flexibilities indicated in Table 1. 3.2. NeuroImaging We analyzed a positron emission tomography data set with M¼40 time points and N ¼157,244 voxels based on [18F]Altanserin as radioligand in order to measure serotonin-2A
60
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
Fig. 6. Left panel: Example of faces present in the CBCL face database. Right panel: Properties of features extracted analyzing the face database using SVD/PCA, NMF, AA and K-means. Whereas SVD/PCA extract low to high spatial frequency eigenimages and NMF decompose the data into an atomic mixture of constituting parts the AA model closely resemble actual faces that constitute notably more distinct aspects than the prototypical faces extracted by K-means. The percentage of explained variation is given on top of the extracted features of each method along with their standard deviation across 10 random initializations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
neuroreceptors. Each recorded voxel is a mixture of vascular regions, non-binding regions and high-binding regions. The AA model should ideally extract the temporal profiles of these types of binding regions in A ¼ XC as well as how each voxel is a convex mixture of these regions in S. This holds provided a convex combination of the observations are able to generate the pure tissue profiles, i.e., can extract the distinct profiles. For comparison we also included an analysis by independent component analysis (ICA) of the data based on imposing either temporal or spatial independence. We used the infomax approach proposed in Bell and Sejnowski [2] as implemented in the ICAML routine available from http://cogsys.imm.dtu.dk/toolbox/. From Fig. 7 the AA/PCH model have indeed extracted three components that well correspond to non-binding, high-binding and vascular regions respectively. No such profiles are clearly extracted by SVD/PCA, ICA, NMF or K-means that all extract components that are mixtures of the three different region types. In particular, the identified time profiles do not have the expected unimodal structure found by AA/PCH and we attribute ICA being unable to identify the components to the fact that the underlying components being both temporally and spatially dependent. 3.3. Chemistry In Fig. 8 is given an analysis of a nuclear magnetic resonance (NMR) spectroscopy data set described in Winning [13] of M¼14,000 spectral frequency bins and N ¼231 samples where each sample contain a known convex mixtures of propanol, butanol and pentanol. We assume the concentrations are unknown and investigate whether AA/PCH is able to recover the true concentration fractions from the estimated S. Indeed, the AA/PCH model is able to successfully extract the true underlying concentration fractions with
an average correlation between the true and estimated components of 0:9982 7 0:0002 as well as the corresponding component spectra while accounting for the dynamics of the data as well as SVD/PCA and NMF. 3.4. Text mining Latent semantic analysis [9,18] has become an important tool for extracting word and document associations in text corpora. In Fig. 9 we demonstrate the utility of the AA-model for this task by analyzing the NIPS bag of words corpus consisting of M ¼12,419 words and N ¼1500 documents with approximately 6,400,000 word occurrences (see also http://archive.ics.uci.edu/ml/datasets/ Bagþof þWords). Each word was normalized by the inverse document frequency (IDF). Clearly, the 10 component model has well extracted distinct categories of the NIPS papers. For the AA model C indicate which documents constitute the extracted distinct aspects A ¼ XC given in the figure while S (not shown) gives the fraction by which each document resemble these aspects. Indeed the extracted archetypes correspond well to distinct types of papers in the NIPS corpora. Notice in particular how neither NMF nor AA/PCH have extracted a cluster similar to the second cluster of K-means including the words {bayesian,mixture, gaussian, kernel, likelihood, posterior, density, distribution, cluster, regression} as these words are prototypical rather than archetypal and therefore do not form distinct aspect of the NIPS text corpora in line with the properties of the central cyan cluster illustrated in Fig. 1. Furthermore, AA have 42:2 7 0:1% of the elements of S being non-zero whereas NMF has 49:5 70:3% (standard deviation given for 10 random initialized models) resulting in the documents belonging to significantly fewer of the extracted topic features for AA.
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
61
Fig. 7. Analysis of an Altanserin tracer positron emission tomography data set. Top panel: The extracted temporal profiles for a three component SVD/PCA, ICA, NMF, AA/PCH and K-means models. Of the four models only the AA/PCH model has extracted the expected region specific temporal profiles that from the spatial maps in the bottom right panel correspond well to non-binding, high-binding and vascular regions respectively. As there is no scaling ambiguity the estimated tissue curves by AA are in the correct units. Bottom left panel: The explained variation as a function of the number of components D for the various models.
Fig. 8. Analysis of 1H NMR spectra of mixtures of propanol, butanol and pentanol. Top panel: To the left is given the variance explained by the different models as a function of the number of components D as well as the estimated concentration fractions S of the three component AA model (middle) and the true concentrations (right). Bottom panel: The estimated spectra A ¼ XC obtained for each of the three extracted AA/PCH components (x-axis denote the chemical shift in parts per million).
3.5. Collaborative filtering Collaborative filtering has become an important machine learning problem of which the celebrated Netflix prize is widely known. Given the preferences of users the aim is to infer preferences of products the user has not yet rated based on the ratings given by other users. We analyzed the medium size and
large size Movielens data given by 1,000,209 ratings of M¼3952 movies and N ¼6040 users and 10,000,054 ratings of M¼10,677 movies and N ¼71,567 users with ratings from {1,2,3,4,5} (http:// www.grouplens.org/). The AA/PCH model extracts idealized users (extreme user behaviors) while at the same time relating the users to these archetypal preference types. Movies that are left unrated by the users we treated as missing values based on the
62
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
Fig. 9. Analysis of the NIPS bag of words corpus. The 10 most prominent words for each feature component of NMF, AA and K-means respectively of the extracted 10 component models. The average cosine angle (averaged over all the 10ð101Þ=2 feature comparisons) are 84.161, 80.311 and 73.331 respectively, thus, the AA model has extracted more distinct aspects than K-means. Furthermore, the percentage of non-zero entries in S for NMF AA/PCH and K-means are 49.57 0.3%, 42.2 7 0.1% and 10% respectively (standard deviation across 10 random initializations) resulting in a more sparse encoding (i.e., representation that is easier to interpret) of the documents in terms of the extracted features for AA/PCH than NMF. Clearly each of the components correspond well to aspects of the documents in the NIPS corpus. The components are ordered according to importance and in gray scale are given the relative strength of each of the 10 top words for each of the extracted term groups.
Fig. 10. Left and middle panels: Root mean square error (rmse) for the various models as a function of the number of components when analyzing the medium size Movielens data (left) given by 1,000,209 ratings of M¼ 3952 movies by N ¼ 6040 users and large sized (middle) Movielens data formed by 10,000,054 ratings of M ¼10,677 movies and N ¼ 71,567 users. Training error is given by solid lines and test error performance by dotted lines based on removing 10% of the data for testing. Clearly, the AA/PCH model has added flexibility over K-means in accounting for the data. Right panel: Illustration of the extracted archetypal users-types, A ¼ XC, and their distinct preferences for the 25 component AA models with lowest validation error based on the large sized problem. A benefit of AA/PCH is that the extracted features can be directly interpreted as distinct user profiles forming easy interpretable features while having similar test performance to SVD/PCA and NMF.
following extension of the missing information, i.e.: min S,C
X n,m:qn,m ¼ 1
AA/PCH
objective to accommodate
!2 X P 0 xn,m0 cm0 ,d m P xn,m sd,m m0 qn,m0 c m0 ,d þ E d
where Q is an indicator matrix such that qn,m ¼ 1 if the nth movies was rated by the mth user and zero otherwise and E ¼ 103 is a small constant used for numeric stability and to avoid potential division by zero. Since the SVD and NMF model turned out to be more prone to local minima (due to missing data) than the AA/PCH
model these methods were initialized by the AA/PCH solution obtained. From Fig. 10 it can be seen that despite that the AA/PCH model is more restricted than NMF and SVD it has a very similar test error performance and extracts features that are much more useful for predicting the ratings than K-means while the extracted archetypes directly can be interpreted in terms of distinct userprofiles. From Table 2 it can be seen that AA identify a sparser representation of each user in terms of the archetypes than NMF. In particular as more components are included in the AA model the percentage of non-zero elements in S decreases drastically relative to NMF.
M. Mørup, L.K. Hansen / Neurocomputing 80 (2012) 54–63
Table 2 Percentage of non-zero entries in S for problem. Method (%) (%)
NMF AA
NMF
and
AA
for the large sized Movielens
D¼5
D ¼ 10
D ¼25
D ¼ 50
D ¼ 100
87 71
81 55
78 37
75 24
74 15
4. Discussion We demonstrated how the archetypal analysis model of Cutler and Breiman [7] is useful for a broad variety of machine learning problems. A simple algorithm for fitting the AA/PCH model was derived as well as the FURTHESTSUM initialization procedure to extract end-members for initial archetypes. The utility of AA/PCH over clustering methods is that it focuses more on distinct or discriminative aspects yet has additional modeling flexibility by using soft assignment. We saw examples of improved interpretability in the AA/PCH representations over existing matrix factorization and clustering methods. AA/PCH is a promising unsupervised learning tool for many machine learning problems and as the representation is unique in general we believe the method holds particularly great promise for data mining applications. Matlab implementation of the described methods are available for download from www.mortenmorup.dk.
Acknowledgement We thank Gitte Moos Knudsen and Claus Svarer for providing the PET data set. This work is supported in part by the Danish Lundbeck Foundation through CIMBI, Center for Integrated Molecular Brain Imaging. References [1] C. Bauckhage, C. Thurau, Making archetypal analysis practical, in: Proceedings of the 31st DAGM Symposium on Pattern Recognition, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 272–281. [2] A. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159. [3] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, 1981. [4] C.M. Bishop, Pattern Recognition and Machine Learning, 1st ed., 2006, Springer (Corr. 2nd printing, 2007). [5] B.H.P. Chan, D.A. Mitchell, L.E. Cram, Archetypal analysis of galaxy spectra, Mon. Not. Roy. Astron. Soc. 338 (2003) 790. doi:10.1046/j.1365-8711. 2003.06099.x. [6] P. Comon, Independent component analysis a new concept? Signal Process. 36 (1994) 287–314. [7] A. Cutler, L. Breiman, Archetypal analysis, Technometrics 36 (4) (1994) 338–347. [8] K.K. Delgado, J.F. Murray, B.D. Rao, K. Engan, T.W. Lee, T.J. Sejnowski, Dictionary learning algorithms for sparse representation, Neural Comput. 15 (2) (2003) 349–396 doi: http://dx.doi.org/10.1162/089976603762552951. [9] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, R.A. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (6) (1990) 391–407.
63
[10] C. Ding, T. Li, M.I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (2010) 45–55. [11] R.A. Dwyer, On the convex hull of random points in a polytope, J. Appl. Probab. 25 (4) (1988) 688–699. ¨ [12] J. Eggert, E. Korner, Sparse coding and nmf. in: Neural Networks, vol. 4, 2004, pp. 2529–2533. [13] H. Winning, F.H. Larsen, R. Bro, S. Engelsen, Quantitative analysis of nmr spectra with chemometrics, J. Magn. Reson. 190 (2008) 26–32. [14] D.S. Hochbaum, D.B. Shmoys, A best possible heuristic for the k-center problem, Math. Oper. Res. 10 (2) (1985) 180–184. [15] H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrica 23 (1958) 187–200. [16] H. Laurberg, M.G. Christensen, M.D. Plumbley, L.K. Hansen, S.H. Jensen, Theorems on positive data: on the uniqueness of nmf, Comput. Intell. Neurosci. doi:10.1155/2008/764206, in press. [17] D. Lee, H.S. Seung, Unsupervised learning by convex and conic coding, Adv. Neural Inf. Process. Syst. 9 (1999) 515–521. [18] D. Lee, H. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (6755) (1999) 788–791. [19] L.-H. Lim, P. Comon, Multiarray signal processing: tensor decomposition meets compressed sensing, C. R. Mecanique 338 (2010) 311–320. [20] C.-J. Lin, Projected gradient methods for non-negative matrix factorization, Neural Comput. 19 (2007) 2756–2779. [21] D. McCallum, D. Avis, A linear algorithm for finding the convex hull of a simple polygon, Inf. Process. Lett. 9 (1979) 201–206. [22] H. Minc, Nonnegative Matrices, John Wiley and Sons, NY, USA, 1988. [23] M. Mørup, L.K. Hansen, Archetypal analysis for machine learning, in: IEEE Workshop on Machine Learning for Signal Processing (MLSP), URL /http:// www2.imm.dtu.dk/pubdb/p.php?5912S, 2010. [24] B.A. Olshausen, D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature 381 (1996) 607–609. [25] A. Plaza, P.P.R.P.J. Martinez, A quantitative and comparative analysis of endmember extraction algorithms from hyperspectral data, IEEE Trans. Geosci. Remote Sensing 42 (3) (2004) 650–663. [26] G.C. Porzio, G. Ragozini, D. Vistocco, On the use of archetypes as benchmarks, Appl. Stoch. Model. Bus. Ind. 24 (5) (2008) 419–437. [27] E. Stone, A. Cutler, Introduction to archetypal analysis of spatio-temporal dynamics, Phys. D 96 (1–4) (1996) 110–131.
Morten Mørup received his Ph.D. degree in Applied Mathematics at the Technical University of Denmark in 2008. He is currently an Assistant Professor at the Section for Cognitive Systems at the Technical University of Denmark and a Current Member of the Technical Committee for the IEEE International Workshops on Machine Learning for Signal Processing (MLSP). His research interest is machine learning and biomedical data analysis and a major focus of his research has been dedicated to unsupervised learning with application to the analysis of neuroimaging data such as EEG and fMRI.
Lars Kai Hansen is a Professor of Digital Signal Processing at the Technical University of Denmark, Lyngby, where he also heads the Cognitive Systems Section. His research concerns adaptive signal processing and machine learning with applications in biomedicine and digital media. He has published more than 225 contributions on these subjects in journals, conferences, and books.