Two-mode multi-partitioning

Two-mode multi-partitioning

Computational Statistics & Data Analysis 52 (2008) 1984 – 2003 www.elsevier.com/locate/csda Two-mode multi-partitioning Roberto Roccia , Maurizio Vic...

4MB Sizes 3 Downloads 44 Views

Computational Statistics & Data Analysis 52 (2008) 1984 – 2003 www.elsevier.com/locate/csda

Two-mode multi-partitioning Roberto Roccia , Maurizio Vichib,∗ a University “Tor Vergata”, Rome, Italy b Department of Statistics, Probability and Applied Statistics, University “La Sapienza”, P.le A. Moro 5, I-00185 Roma, Italy

Received 11 September 2004; received in revised form 25 June 2007; accepted 26 June 2007 Available online 10 July 2007

Abstract New methodologies for two-mode (objects and variables) multi-partitioning of two way data are presented. In particular, by reanalyzing the double k-means, that identifies a unique partition for each mode of the data, a relevant extension is discussed which allows to specify more partitions of one mode, conditionally to the partition of the other one. The performance of such generalized double k-means has been tested by both a simulation study and an application to gene microarray data. © 2007 Elsevier B.V. All rights reserved. Keywords: Two-way data; Two-mode partitioning; Least square partitioning

1. Introduction The analysis of proximity relationships between pairs of objects allows identifying disjoint classes, in which objects are perceived similar, obtained from cluster analysis methodologies (Hartigan, 1975; Gordon, 1999). Nevertheless, cluster analysis is frequently used to partition variables instead of objects. This procedure is rather popular among researchers and sometimes it is preferred to factor analysis, because the factorial solution, even after its rotation, produces a covering of the observed variables where each factor may not be fully representative of only one cluster of variables. Less popular, but still very useful, is the methodology to cluster both objects and variables. For example, sellers are interested to know how the market can be segmented into homogeneous classes of customers according to their preference on products; and, at the same time, sellers may wish to know how products are clustered according to the preferences of customers. This case will be referred to as two-mode partitioning. In the simplest case, the basic idea is to identify blocks, i.e., sub-matrices of the observed data matrix, where objects and variables forming each block specify an object cluster and a variable cluster, respectively; thus, defining a partition of the elements of each of the two modes. Of course, two-mode partitioning requires to consider variables expressed on the same scale, so that entries are comparable among both rows and columns. If not, data need to be rescaled by an appropriate standardization method. Two-mode partitioning is a relevant topic in the field of the methods including a dimensionality reduction. For a detailed discussion on new advances on statistical learning methods including dimensionality reduction, the reader can refer to the Special Issue of CSDA (Bock and Vichi, 2007). Procedures for clustering variables and objects are given, ∗ Corresponding author.

E-mail addresses: [email protected] (R. Rocci), [email protected] (M. Vichi). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.06.025

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1985

for example, in Hartigan (1972). He proposes an algorithm where, at each stage, the original data matrix is divided into clusters which correspond to a sub-matrix identifying an object cluster and a variable cluster. Then, the cluster with maximum variance is split by rows (columns) so as to reduce at most the average variance of rows (columns). Hartigan (1975, chapter 15) also presents an agglomerative two-mode clustering technique that produces a tree for the objects and a tree for the variables. Two-mode clustering has recently become very popular also in the analysis of gene microarray data where it makes good sense to classify both genes, as well as samples of tissue. Generally, samples and genes are clustered independently (see for example: Eisen et al., 1998; Alon et al., 1999). Getz et al. (2000) propose a heuristic algorithm that, by using the iterative procedure proposed by Hartigan (1972), applies to each sub-matrix a clustering algorithm determining a partition of genes or samples. Then, this step is iterated further, using as sub-matrices those found by pairs of all previous found clusters of genes and samples. The authors use the superparamagnetic algorithm SPC given by Blatt et al. (1996). However, the two-mode clustering methodologies described above are heuristic procedures that detect partitions with some remarkable properties, but they do not use a two-way clustering model and a clear statistical estimation method for optimizing a single loss function measuring the discrepancy between the given model and the observed data (as, it is in clustering literature based on least-squares (LS) or maximum likelihood (ML) techniques). Following this approach, Bock (1979) proposes a technique based on the concept of iteration between row clusters and column clusters; then, q LS criteria are optimized. In the special case of contingency tables, Bock (2003) proposes a two-mode partitioning method based on a maximum-tangent-plane algorithm. DeSarbo (1982) gives an alternating least-squares algorithm for two-mode partitioning of a proximity matrix. For binary data a two-mode clustering methodology which optimizes a L1 -norm criterion has been proposed by Govaert (1995). Furthermore, Govaert and Nadif (2003) specify the block mixture model that hypothesizes two multinomial distributions: one for the partition of objects and the other for the partition of variables and, conditionally to each block, a parametric distribution for each entry is given. The maximization of the likelihood of the block mixture model is obtained by the Block CEM algorithm. Pollard and van der Laan (2002) formalize the simultaneous clustering problem as a parametric model to be estimated by using bootstrap techniques. Two conditions have to be respected to detect a two-mode partitioning: (i) the data matrix is partitioned into disjoint blocks; (ii) the blocks specify an object and a variable partition. The interested reader can find a very complete structured overview of two-mode clustering methods in Van Mechelen et al. (2004). The two-mode partitioning model here adopted is defined as a collection of disjoint object clusters and variable clusters, where objects and variables within each cluster differ only for a random residual. In this model, named double k-means (Vichi, 2000), dk in short, the data matrix is supposed to be split into disjoint blocks such that object clusters and variable clusters form a partition for each mode. Such a two-mode partitioning model, when degenerates into the clustering of objects only (this happens when the number of clusters of variables is set equal to the number of variables), identifies the well-known k-means algorithm in the least-squares context (MacQueen, 1967), or the CEM algorithm (Celeux and Govaert, 1992), in the maximum likelihood context. This is the most parsimonious two-mode partitioning model, since it has the important property to specify a unique partition both for objects and variables. This also implies that for each class of objects the same partition of variables is identified and vice-versa, for each class of variables the same partition of objects is found. In some situations, especially for large data sets, this property may be seen too restrictive. For example customers (objects), belonging to very dissimilar clusters, very likely may specify different partitions of products (variables) due to differences in their preferences. When clustering microarray data, samples of tissues (objects) may group differently with respect to different subtypes (clusters) of genes (variables). Therefore, in this paper we introduce a new model, more flexible than dk, because it allows specifying for each class of the partition of objects a different partition of variables. This case will be referred as two-mode multi-partitioning or as a two-mode class-conditional partitioning. The parameters of the model will be estimated in a model free LS context. The new model has the major property to include the double k-means (dk) as special case, when a unique partition of the variables is required. For this reason the new model will be named generalized double k-means (gdk). Furthermore, it will be shown that the unique partition in dk represents a “consensus partition” of the variable partitions of the gdk, i.e. the closest partition in a LS sense. In some situations it could be useful to consider the dual (transpose) problem to identify a unique partition for variables, and different partitions of objects in each variable cluster. For sake of brevity, this case will not be discussed

1986

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

here; but the interested reader can use our model by simply transposing the data matrix. The overview of the paper is as follows. In Section 2 the notation common to all sections is listed. Section 3 is devoted to the discussion of the two-mode class-conditional partitioning. Firstly, in Section 3.1 simple procedures, named sequential procedures, to obtain a two-mode multi partitioning by applying ordinary clustering techniques in repeated steps are discussed. Secondly, in Section 3.2 the simultaneous approach is presented which generalize the double k-means. Finally, in Section 3.3 the sequential procedures are compared with the gdk on a theoretical basis. A general alternating least squares algorithm for gdk is introduced in Section 4, along with some modifications to increase the speed or reduce the chances to find a local optima. Section 5 is dedicated to the problem of the optimal selection of the number of clusters. Performances of the proposals are evaluated in Section 6 by a simulation study (6.1) and an application on gene microarray data (6.2). Some conclusions follow in Section 7. 2. Notation For the convenience of the reader the notation and terminology common to all sections is listed here. I, J, P , Q X = [xij ] U = [u1 , . . . , uP ] = [uip ] M = [m1 , . . . , mP ] V

C Qp Vp = [vj qp ]

N Np Ip = diag(up ) cp = [cp1 , . . . , cpQp ] C = diag(c1 , . . . , cP ) V = [V1 , . . . , VP ]

E, Ea , Eb HA

number of: units, variables, clusters of objects, clusters of variables, respectively (I ×J ) two-way data matrix, where xij is the value of the jth variable on the ith object (I × P ) binary matrix defining a partition of the objects into P clusters, where uip = 1 if the ith object belongs to cluster p, uip =0 otherwise. Matrix U has only one non-zero element per row (P × J ) matrix of centroids for objects, where mp = [mp1 , . . . , mpJ ] is the centroid of the pth cluster (J × Q) binary matrix defining a partition of variables into Q clusters, where vj q = 1 if the jth variable belongs to cluster q, vj q = 0 otherwise. Matrix V has only one non-zero element per row (P × Q) matrix of centroids number of clusters of the variable partition corresponding to the pth object cluster (J × Qp ) binary matrix specifying the pth partition of variables into Qp clusters, where vj qp = 1 if the jth variable belongs to cluster q and vj qp = 0, otherwise. Matrix Vp has only one non-zero element per row (I × Q)matrix of centroids for variables (I × Qp ) matrix of centroids of the variables for the pth class of the objects (I × I ) diagonal matrix having the elements of the pth columns of U on the main diagonal (Qp × 1) vector centroid of the pth object class (P × Q), with Q1 + · · · + QP = Q, block diagonal matrix of centroids for the generalized double k-means (J × Q) binary block matrix specifying a set of P partitions of variables into Q1 , . . . , QP clusters, with Q1 + · · · + QP = Q. Matrix V specifies a special covering where each row sums P matrices of residual terms projection operator into the column space of A, i.e., HA = A(A A)−1 A

3. Two-mode multi-partitioning model of a two-way data set Simple procedures for multi-partitioning a (I ×J ) data matrix X, called sequential procedures or simply s-procedures, are briefly defined by ordinary clustering methodologies which can be applied in two different ways by using standard statistical packages. Such s-procedures are very useful when a researcher needs to find a two-mode multi-partitioning by standard software. However, the reader is advised that s-procedures identify generally sub-optimal solutions when compared to gdk, as it will be shown in Section 3.3. Nevertheless, s-procedures still preserve their general usefulness as good starting points for the algorithm of gdk.

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1987

Fig. 1. Image matrix plot of a two way data matrix with 1000 objects (rows) and 400 variables (columns): (a) two-mode double k-means partitioning; (b) two-mode object class-conditional partitioning. (a) The data matrix is divided into blocks which specify a unique partition for objects and a unique partition for variables. (b) The data matrix is divided into blocks where conditionally to each class of objects a different partition of the variables is allowed.

3.1. Sequential procedures First two-mode partitioning s-procedure: a. Partition the I objects into P clusters, via k-means applied to X; the result is the binary object membership matrix U and the object centroids matrix M = [m1 , . . . , mp , . . . , mP ] . b. Partition the J variables into Q clusters, by running the ordinary k-means algorithm on matrix M weighted by U, i.e., weighting each row of M by its corresponding cluster size, The result is the binary variable membership matrix V and the variable centroid matrix C. Formally, it can be written as a. Partitioning of objects via k-means applied to X by minimizing fa (U, M) = X − UM2 .

(1)

b. Partitioning of variables via k-means applied to M by minimizing 

fb (C, V) = UM − UCV 2 .

(2)

Thus, an object partition and a variable partition are obtained so that each class of objects is associated to the same partition of variables and, vice-versa, each class of variables is associated to the same partition of objects (Fig. 1a). If we relax this symmetry we could have that, conditionally to each class of object partition, a different partition of variables is allowed (Fig. 1b). In this case a two-mode multi-partitioning or two-mode class-conditional partitioning problem is defined. Thus, step b of the first two-mode multi-partitioning s-procedure is modified as follows, b . Partition the J variables into Qp clusters, via k-means applied to mp weighted by the number of objects in the pth cluster, for p = 1, . . . , P . The result is the binary variable membership matrix Vp and the variable centroid vector cp , for p = 1, . . . , P . Formally b . Partitioning of variables via k-means applied to mp by minimizing fb (c1 , . . . , cP , V1 , . . . , VP ) =

P 

up mp − up cp Vp 2 .

(3)

p=1

It is important to note that (3) can be minimized by applying a k-means separately to each centroid vector mp ; thus, weights can be set all equal to 1 without affecting the results.

1988

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Second two mode partitioning s-procedure: Perform two independent k-means on the rows and columns of X. Formally, c. Partitioning of objects via k-means applied to X by minimizing (1). d. Partitioning of variables via k-means applied to X by minimizing 

fd (V, N) = X − NV 2 .

(4)

Thus, in the second multi-partitioning s-procedure, the variables must be partitioned into Qp clusters by using the rows belonging to the pth object cluster. The modified step is d . Partitioning of variables by minimizing fd  (N1 , . . . , NP , V1 , . . . , VP ) =

P 

Ip X − Np Vp 2 ,

(5)

p=1

where Ip = diag(up ) and Ip X is the matrix having the ith row equal to the ith row of X if it belongs to the pth cluster and equal to 0 otherwise. Even in this case the function can be minimized by separately minimizing each term via k-means. It should be noted that this procedure is properly sequential only in the multi-partitioning case. The order with which steps (c) and (d) are executed does not affect the results in the two-mode partitioning case. 3.2. Simultaneous approach The model for two-mode multi-partitioning can be formally written as X = UCV + E,

(6)

where E is a residual term and C is a block diagonal matrix of the form ⎡ c 0 · · · 0 ⎤ 1

c2 .. . 0

⎢0 C=⎢ ⎣ .. . 0

···

0 ⎥   .. ⎥ ⎦ = diag(c1 , . . . , cP ), . · · · cP

(7)

where cp = [cp1 , . . . , cpQp ] is the (Qp × 1) centroid vector of the pth object cluster; and Qp is the number of clusters of the variable partition corresponding to the pth object cluster. Matrix C has dimension (P × Q) where Q = Q1 + Q2 + · · · + QP . Note that matrix V has the form V = [V1 , . . . , Vp , . . . , VP ], where Vp = [vj qp ] is the (J × Qp ) binary matrix defining the variable partition corresponding to the pth object class. Matrix V specifies a particular covering of the set of variables. Model (6) can also be written as X=

P 

up cp Vp + E.

(8)

p=1

As mentioned above, the generalized double k-means model includes the double k-means as a special case. In fact, when V1 = V2 = · · · = VP = V, model (6) can be written as ⎡ c 0 · · · 0 ⎤ 1

⎢0 X =U⎢ ⎣ .. . 0

c2 .. . 0

···

0 ⎥    .. ⎥ ⎦ [I, I, . . . , I] V + E = UCV + E, . · · · cP

(9)

where C = [c1 , . . . , cP ] is the centroid matrix of the double k-means. Thus, double k-means can be seen as the model identifying the consensus partition of the variable partitions; the consensus provides a central classification which gives an estimate of the true common classification of the variables.

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1989

The least-squares estimation of the matrices U, V and C of the two-mode multi-partitioning can be formalized by considering a loss function of the form SSR gdk (U, V, C) =

Qp I  J  P  

(xij − cpq )2 uip vj qp

(10)

i=1 j =1 p=1 q=1

This loss requires that the blocks must consist of entries similar as much as possible in a least-squares sense. In compact matrix notation, loss (10) can be equivalently written as SSR gdk (U, V, C) = X − UCV 2

(11)

or 2 2  P P P       SSR gdk (U, V, C) = X − up cp Vp = (Ip X − up cp Vp ) = Ip X − up cp Vp 2 , p=1 p=1 p=1

(12)

because X = p Ip X and the terms in the right-hand side are orthogonal. The algorithm minimizing (10), (11) and (12) can be considered the natural extension of the k-means algorithm for two-mode partitioning. In fact, we have already proved that gdk includes dk. Furthermore, when the number of clusters of variables is just equal to the number of variables itself, each matrix Vp is equal to the identity matrix and the generalized double k-means reduces to the k-means algorithm for clustering objects according to variables. Even when U is equal to the identity matrix, a conditional version of the k-means is given, where a partition for variables is fitted for each object. The optimal C = diag(c1 , . . . , cP ) in (12), is the LS solution of p multivariate regression problems, thus cp = (up up )−1 up XVp (Vp Vp )−1 .

(13)

Hence, upon substitution of (13) into the first of (12), we obtain the sum of squared residuals 2 P  SSR gdk (U, V) = X − Hup XHVp , p=1

(14)

where HA is the projection operator into the column space of A, i.e., HA = A(A A)−1 A . It is interesting to note that the total sum of squares of X (which coincides with the total deviance when the variables are centred) can be decomposed as 2 2 P P   2 X = X − Hup XHVp + Hup XHVp p=1 p=1 2 P P   = X − H XH + Hup VHVp 2 , up Vp p=1 p=1

(15)

where the first term is SSR gdk (U, V), i.e. the within-block deviance, while the pth addendum of the second term is the between-block deviance of the pth partition of the variables. Therefore, for any partition of objects into P clusters and P partitions of variables in Q = Q1 + Q2 + · · · + QP clusters, matrix X is split into Q blocks (sub-matrices), where the within-block deviance plus the between-block deviance is constant and equal to the total deviance. It follows that the minimization of SSR gdk (U, V) is equivalent to the maximization of the sum of P dependent double k-means problems Hup XHVp 2 .

1990

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

3.3. Comparison between gdk and s-procedures Firstly, it is interesting to note that the generalized double k-means is the simultaneous version of the first sequential procedure described in Section 3.1, from which a unique model is specified. In fact, the first s-procedure can be seen as the sequential LS estimation of the hierarchical model X = UM + Ea , P P   up mp = up cp Vp + Eb ⇔ UM = UCV + Eb , p=1

(16) (17)

p=1

where Ea and Eb are residual terms. By substituting (17) into (16) and setting E = Ea + Eb , we obtain X = UCV + E,

(18)

which is the model underlying the generalized double k-means. Secondly, it is important to note that the LS estimation of model (18) produces a residual sum of squares smaller than or equal to the one of the first s-procedure. In fact, since the centroid vectors in the first s-procedure are computed as mp = (up up )−1 up X,

cp = mp Vp (Vp Vp )−1 = (up up )−1 up XVp (Vp Vp )−1 ,

(19)

substituting (19) into (1) and (3), the residual sum of squares of the first s-procedure is SSR 1 (U, V) = X − HU X2 +

P 

Hup X − Hup XHVp 2

p=1

2  P P  H X − H XH = X − HU X2 + up up Vp , p=1 p=1

(20)

because Hup Huq = 0 for p  = q. Since the two terms in (20) are orthogonal, and HU = Pp=1 Hup , the sum (20) can also be written as 2 2 P P   SSR 1 (U, V) = X − HU X + HU X − Hup XHVp = X − Hup XHVp (21) . p=1 p=1 Comparing (21) with (14), we deduce that the residual sum of squares of the first s-procedure is greater than or equal to the one of the generalized double k-means, because the latter identifies U and V by directly minimizing (21). The results above are interesting even because they show how the sum of squared residuals (SSR) of gdk can be decomposed as the sum of the SSR, due to the object classification on the original data, plus the SSR, due to the classification of the variables on the object centroids. Finally, it is worthwhile to show that gdk can be seen as a penalized version of the second s-procedure. In fact, the centroid matrices are M = (U U)−1 U X and Np = Ip XVp (Vp Vp )−1 , thus, the residual sum of squares is SSR 2 (U, V) = X − HU X2 +

P 

Ip X − Ip XHVp 2 .

(22)

p=1

In order to compare (22) with SSR gdk , we rewrite (14) as SSR gdk (U, V) =

P  p=1

Ip X − Hup XHVp 2 ,

(23)

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1991

by substituting (13) into the last of (12). Then, each term of (23) is such that Ip X − Hup XHVp 2 = Ip X − Ip XHVp + Ip XHVp − Hup XHVp 2 = Ip X − Ip XHVp 2 + Ip XHVp − Hup XHVp 2 ,

(24)

because Ip Hup = Hup and tr{(Ip X − Ip XHVp ) (Ip XHVp − Hup XHVp )}

= tr{X Ip XHVp − X Hup XHVp − HVp X Ip XHVp + HVp X Hup XHVp } = tr{X Ip XHVp − X Hup XHVp − X Ip XHVp + X Hup XHVp } = 0.

(25)

It follows that (23) can be written as SSR gdk (U, V) =

P 

Ip X − Ip XHVp 2 +

P 

Ip XHVp − Hup XHVp 2 .

(26)

p=1

p=1

By adding (26) to (20) (which has been shown in (21) to be equal to SSR gdk ), we derive that 2SSR gdk (U, V) = SSR 2 (U, V) +

P 

Hup X − Hup XHVp  +

p=1

2

P 

Ip XHVp − Hup XHVp 2 .

(27)

p=1

Eq. (27) shows that gdk can be seen as a penalized version of the second s-procedure, where solutions giving very different object and variable centroids (i.e., the rows of Hup X and the columns of Ip XHVp ) are penalized. In fact, by doing some algebra, we obtain Hup X − Ip XHVp 2 = (Hup X − Hup XHVp ) + (Hup XHVp − Ip XHVp )2 = Hup X − Hup XHVp 2 + Ip XHVp − Hup XHVp 2 ,

(28)

because the two terms in round brackets are orthogonal. 4. Generalized double k-means algorithm An alternating least-squares algorithm for the generalized double k-means is described in the following steps: Initialization. Initial values for U and V must be chosen. For example, each partition can be chosen randomly sampling from a multinomial distribution with equal probabilities. We can also consider a rational start based on one of the s-procedures. Step 1:Update C. Given the current estimates of U and V, update C row by row as in (13). Step 2: Update U. The problem of minimizing SSR gdk over U, given the current estimate of V and C, is simply implemented as an assignment step of an ordinary k-means algorithm where X is the data matrix and CV is the matrix of centroids. Step 3: Update V. The problem of minimizing SSR gdk over each Vp , given the current estimate of U and cp is simply implemented as an assignment step of an ordinary k-means algorithm where X Ip is the data matrix and cp up are the centroids (see (12)). Stopping rule. The function SSR gdk (U, V, C) is evaluated for the current values of U, V and C. If the updates make the function decrease considerably (more than an arbitrary small tolerance value), then U, V and C are updated once more according to Steps 1, 2 and 3. Otherwise, the process is considered to have converged. The algorithm monotonically decreases the loss function and, since SSR gdk (U, V, C) is bounded from below, it converges to a stationary point which usually turns out to be at least a local minimum. To increase the chance of finding the global minimum, the algorithm should be run several times, with different initial values for U and V.  For double k-means, the block centroid matrix becomes C = (U U)−1 U XV(V V)−1 ; instead of step 3, the update of the membership matrix for the variables, reduces to the update of V by simply implementing the assignment step of  an ordinary k-means algorithm where C U is the matrix of centroids.

1992

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Three modifications of the previous algorithm can be introduced to reduce the chances to find local optima or increase the speed. They are listed below. 4.1. Cluster splitting During the iterations, it is possible that some columns of U sum to zero, i.e., there are one or more empty clusters. In this case, it is obvious that the algorithm is converging to a non-optimal solution. To force the algorithm away from this solution, the object chosen to be moved to the empty cluster is the one that minimizes the within deviance once the centroids are recomputed. To preserve the monotonicity of the algorithm, the centroid block matrix must be updated when the cluster splitting is used. The same procedure is used in step 3 if one column of Vp sums to zero. 4.2. Dimensional reduction The algorithm can be accelerated by reducing the dimension of the data in the assignment step 3. This is particularly useful when the number of objects is very large. To show how to implement such modification, first we note that SSR gdk (U, V, C) = X − UCV 2 = (X − HU X) + (HU X − UCV )2 = X − HU X2 + HU X − UCV 2 ,

(29)

because the two terms in round brackets are orthogonal. We also note that 2  P P   2    HU X − UCV  = (up mp − up cp Vp ) = up mp − up cp Vp 2 p=1 p=1 =

P 

up up mp − cp Vp 2 .

(30)

p=1

Therefore, in step 3 of the previous algorithm, it is convenient to minimize mp − cp Vp 2 instead of SSR gdk . The two minimizations are equivalent, but the latter is computationally faster, since it works on matrices having 1 column instead of up up . When the algorithm is used to fit dk, from (29) and (30) we have SSR dk (U, V, C) = X − HU X2 +

P 



up up mp − cp V 2

p=1

= X − HU X2 +

P   up up mp − up up cp V 2 p=1 

= X − HU X + M D − VC D2 , (31)  where D = diag( u1 u1 , . . . , uP uP ). Thus, in step 3 some computational time can be saved by updating V as the minimizer of the second term in (31). Following the same procedure, it is also possible to achieve a dimensional reduction in step 1 (dk only). 2

4.3. Permutations In our simulation studies, we noted that sometimes the algorithm stops at a local optimum because it is not able to assign the right number of clusters to the partition of the variables corresponding to an object cluster. In order to improve this ability, step 3 can be modified by fitting several partitions for the variables for each cluster object, by changing the number of clusters. The update is found by assigning to each object cluster one partition for the variables minimizing the loss. For example, if P =3 and Q1 =2, Q2 =3 and Q3 =4 then for each object class, say a, b and c, three

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1993

partitions for the variables with 2, 3 or 4 clusters are considered. The fit of each feasible solution: (a = 1, b = 2, c = 3), (a = 1, b = 3, c = 2), (a = 3, b = 1, c = 2), (a = 3, b = 2, c = 1), (a = 2, b = 3, c = 1), (a = 2, b = 1, c = 3), are computed and the solution corresponding to the best fit is chosen. If the best solution is, for example, (a = 3, b = 1, c = 2), then object cluster a is labelled as 3 and the corresponding variable partition has 4 clusters, object cluster b is labelled as 1 and the variable partition has 2 clusters, finally, object cluster c is labelled as 2 and the variables are partitioned into 3 clusters. Formally, the problem above described is a linear assignment problem which can be optimally solved by using the polynomial Hungarian algorithm with O(P 3 ) time complexity. A detailed survey of algorithms for solving the assignment problem is given in Ahuja et al. (1993). 5. Cluster validity indices to assess the quality of the partition One problem that needs to be addressed in double k-means is the specification of appropriate values for P and Q. For standard clustering algorithms several authors have suggested indices of cluster validity. Among them, the criterion proposed by Calinski and Harabasz (1974) has been found to have the best performance in a comparative study of 30 stopping rules carried out by Milligan and Cooper (1985); therefore, a similar index for dk based on a pseudo-F statistics has been proposed pF dk =

HU XHV − (1/I J )1I 1I X1J 1J 2 /(P Q − 1) , X − HU XHV 2 /(I J − P Q)

(32)

where the numerator and the denominator are the between and within block deviances, respectively, each divided by the corresponding degrees of freedom. The values of P and Q of the double k-means solution corresponding to a local maximum of the pF dk are selected as the optimal numbers of classes for the object and variable partitions, respectively. In our experience, this index gives good results when there is a clear-cut cluster structure in the data. However, it is advisable not to use only one index to asses the quality of the partition. Our suggestion is to make the choice on the basis also of a, three dimensional, scree plot of the deviances as well as the interpretability of the solution and the robustness of the results. The formulation of a pseudo F index for the generalized double k-means is an open question and it will be discussed in a future work. 6. Performance of the generalized double k-means 6.1. Simulation study The performance of the generalized double k-means have been evaluated by a simulation study. An experiment was conducted on four different algorithms to test their goodness in recovering the true partitions and their sensitivity to the local minima problem. The simulation set up is summarized in Table 1. Data sets have been generated according to model (6) with P randomly chosen among 2 and 3 while Qp = p + 1. Entries within blocks have been drawn from independent normal distributions. The mean of the normal distribution for the first block is zero, while those of the successive blocks are obtained by increasing the mean of the previous block by 6. Then, blocks are randomly placed in the data matrix (Fig. 2). Table 1 Simulation set up Number of generated data sets Number of objects Number of variables Number of objects clusters Number of variable clusters Block means Variance Number of random starts

100 I = 50, 1000 J = 10, 400 Random P = 2, 3 Qp = p + 1 0, 6, 12, . . . Low error (2%), medium error, (20%), high error (40%) 1, 5, 10, 20, 30

1994

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Fig. 2. (50 × 10) data matrices with P = 3, object clusters and Q1 = 2, Q2 = 3, Q3 = 4 variable clusters. Three levels of error (a), (b) and (c) are considered. Rows and columns of the data matrices are randomly permuted, and generalized double k-means has been applied to evaluate the recovery of the two-mode clustering structure.

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1995

Three levels of deviance have been chosen: 2%, 20% and 40%, to consider different levels of homogeneity within blocks. As it can be observed in Fig. 2, in case of low error level, blocks are well distinct; on the other hand, when the error level is high, blocks are hardly to be recognizable. In Fig. 2 (second column) rows and columns of each data matrix have been randomly permuted to visualize that the clustering structure is no more evident. Then generalized double k-means algorithm has been applied to recover the blocks and, specifically, the original partitions of objects and variables. Data matrices have been generated of size 50 × 10 and 1000 × 400, respectively. Combining the three error levels and the two sizes we have obtained six different experimental situations. For each situation 100 data sets have been generated and for each data matrix the four different algorithms described below have been tested, all implementing the cluster splitting modification as described in Section 4: ALG1. Standard algorithm, it starts from the generated true partition. ALG2. Standard algorithm with a random start for both object and variable partitions. ALG3. Algorithm with dimensionality reduction and permutations. The univariate k-means, fitting the variable partitions, starts from the quantiles at each iteration. In detail, to partition the variables into Qp clusters by using the object centroid mp , the initial value of the centroid of the qth cluster is set equal to 0.5f (q − 1, mp ) + 0.5f (q, mp ), where f (q, mp ) is the qth quantile of mp . Of course, f (0, mp ) and f (Qp , mp ) are equal to the minimum and the maximum value of mp . The initial start of U is randomly chosen. ALG4. Algorithm with dimensionality reduction and permutations. The univariate k-means starts randomly at the first iteration, then it is initialized by using the partition found in the previous iteration. The initial value of U is randomly chosen. The performance of every algorithm has been evaluated by calculating the following measures for each situation: 1. for each data set the Mrand (Modified Rand Index, Hubert and Arabie, 1985) between the true U and the corresponding fitted matrix is computed. Then, the Mrand values are averaged over the 100 data sets; 2. for each data set the average of Mrand indices between the true matrices V1 , . . . , Vp and the corresponding fitted matrices is computed. Those values are then averaged over the 100 data sets; 3. number of times the fitted partition of objects is equal to the true partition; 4. number of times the fitted partitions of variables are equal to the true partitions; 5. number of times where the algorithm gives a value of the loss greater than the minimum among the four algorithms (“sure” local optima); 6. percentage of sum of squared residuals over the total sum of squares; 7. average number of iterations; 8. average cpu time. The first 4 measures are computed to evaluate the goodness of the algorithms in recovering the true clustering structure. The fifth measure is introduced to study the local minima problem, because it represents the number of times the algorithm is trapped into a local minimum. Of course, this is only a lower bound of the true number. The sixth measure gives some information on how good the algorithm is in recovering the true fit. The last two measures concern the computational complexity. All the measures depend on the number of times the algorithm is run. To study the sensitivity of the performance of the algorithms to the number of random starts, we have considered 1, 5, 10, 20 and 30 random starts for algorithm 2, 3 and 4. They are nested, in the sense that we run 30 times the algorithm with different random starts, and we retain the first solution, the best of the first 5, the best of the first 10, and so on. In this way, the comparability among results corresponding to algorithms with different numbers of random starts is guaranteed and some computational time saved. Of course, the same starting points have been used for all the algorithms. The simulation results for matrices of size (50 × 10) are described in Table 2. In Table 2 (a—low error) it can be observed that ALG3 always performed better in terms of fit and recovery of the true partition than the other algorithms. Passing from low to medium and large error levels (Table 2, (b) and (c)) ALG4 performs sometimes better than ALG3. ALG2 is always the fastest, but even the worst in terms of fit and recovery. We can see that even with 30 random starts ALG2 performs worse than ALG3 with only 5 random starts. Finally, it is

1996

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Table 2 Simulation results Algorithm

# Random starts

Mean Mrand object partit.

Mean Mrand variable partit.

% Object partit. = true partit.

% Var. partit. = true partit.

% “Sure” local optima

(a) Low error level ALG1 –

1.00

1.00

100

100

0

ALG2 ALG3 ALG4

1

0.96 0.99 0.99

0.58 1.00 0.95

93 99 97

17 99 82

ALG2 ALG3 ALG4

5

0.99 1.00 1.00

0.81 1.00 0.99

99 100 100

ALG2 ALG3 ALG4

10

1.00 1.00 1.00

0.89 1.00 0.99

ALG2 ALG3 ALG4

20

1.00 1.00 1.00

ALG2 ALG3 ALG4

30

% SSR

Mean # iter.

Cpu time

1.71





83 1 18

4.79 1.77 2.42

3.57 3.73 3.95

0.006 0.035 0.049

52 100 97

48 0 3

2.21 1.71 1.72

3.16 3.73 3.92

0.122 0.172 0.250

100 100 100

71 100 98

29 0 2

1.90 1.71 1.72

3.23 3.73 3.93

0.239 0.340 0.473

0.96 1.00 0.99

100 100 100

91 100 98

9 0 2

1.74 1.71 1.72

3.21 3.73 3.93

0.497 0.686 0.957

1.00 1.00 1.00

0.98 1.00 0.99

100 100 100

95 100 98

5 0 2

1.73 1.71 1.72

3.18 3.73 3.93

0.765 1.033 1.433

(b) Medium error level ALG1 –

0.99

0.94

89

60

38

17.42





ALG2 ALG3 ALG4

1

0.95 0.96 0.94

0.52 0.84 0.82

81 85 82

1 37 32

92 26 41

19.82 18.17 18.87

4.33 4.72 4.73

0.007 0.042 0.052

ALG2 ALG3 ALG4

5

0.98 0.99 0.99

0.73 0.86 0.86

87 89 89

24 41 39

64 7 15

17.76 17.41 17.42

4.24 4.70 4.70

0.102 0.202 0.246

ALG2 ALG3 ALG4

10

0.99 0.99 0.99

0.82 0.87 0.87

89 89 89

34 44 43

48 4 6

17.50 17.40 17.41

4.34 4.64 4.72

0.173 0.408 0.429

ALG2 ALG3 ALG4

20

0.99 0.99 0.99

0.85 0.87 0.87

89 89 89

37 43 43

35 2 2

17.45 17.40 17.41

4.30 4.68 4.70

0.410 0.818 0.911

ALG2 ALG3 ALG4

30

0.99 0.99 0.99

0.86 0.87 0.87

89 89 89

38 44 44

28 1 0

17.43 17.40 17.40

4.27 4.68 4.72

0.826 1.237 1.571

(c) High error level ALG1 –

0.93

0.78

46

19

62

39.37





ALG2 ALG3 ALG4

1

0.85 0.85 0.85

0.42 0.55 0.55

38 35 35

1 5 4

98 38 67

40.87 39.91 40.15

5.30 5.86 5.96

0.014 0.056 0.066

ALG2 ALG3 ALG4

5

0.90 0.90 0.90

0.55 0.59 0.59

38 40 41

5 7 6

89 13 36

39.48 39.19 39.23

5.67 6.02 6.35

0.127 0.245 0.294

ALG2 ALG3 ALG4

10

0.90 0.90 0.91

0.57 0.59 0.60

38 40 42

6 6 9

75 10 26

39.33 39.18 39.18

5.55 6.15 6.44

0.272 0.497 0.576

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1997

Table 2 (continued) Algorithm

# Random starts

Mean Mrand object partit.

Mean Mrand variable partit.

% Object partit. = true partit.

% Var. partit. = true partit.

% “Sure” local optima

% SSR

Mean # iter.

Cpu time

ALG2 ALG3 ALG4

20

0.90 0.91 0.90

0.60 0.60 0.60

36 41 41

6 7 9

66 3 21

39.25 39.16 39.17

5.60 6.23 6.26

0.572 0.989 1.153

ALG2 ALG3 ALG4

30

0.90 0.91 0.91

0.61 0.60 0.60

39 41 41

6 7 9

54 3 20

39.21 39.15 39.16

5.54 6.10 6.20

0.759 1.500 1.624

% “Sure” local optima

% SSR

Mean # iter.

Cpu time

Averaged values over 100 data sets of dimension (50 × 10) (best performance in bold).

Table 3 Simulation results Algorithm

# Random starts

Mean Mrand object partit.

Mean Mrand variable partit.

% Object partit. = true partit.

% Var. partit. = true partit.

(a) Low error level ALG1 –

1.00

1.00

100

100

0

1.66





ALG2 ALG3 ALG4

1

0.99 0.99 1.00

0.57 0.99 0.97

99 97 99

6 97 89

83 1 18

5.08 2.12 1.88

3.31 3.51 3.94

1.690 1.131 1.214

ALG2 ALG3 ALG4

5

1.00 1.00 1.00

0.76 1.00 0.99

100 99 99

23 99 97

48 0 3

2.66 1.77 1.78

3.39 3.52 3.93

8.419 5.740 6.000

ALG2 ALG3 ALG4

10

1.00 1.00 1.00

0.81 1.00 0.99

100 99 99

33 99 97

29 0 2

2.43 1.77 1.78

3.43 3.52 3.93

16.774 11.781 12.131

ALG2 ALG3 ALG4

20

1.00 1.00 1.00

0.83 1.00 0.99

100 99 99

42 99 97

9 0 2

2.27 1.77 1.78

3.61 3.52 3.93

33.493 23.422 24.243

ALG2 ALG3 ALG4

30

1.00 1.00 1.00

0.86 1.00 1.00

100 99 99

54 99 98

5 0 2

2.09 1.77 1.77

3.64 3.52 3.92

50.700 35.513 36.652

(b) Medium error level ALG1 –

1.00

1.00

100

100

38

17.06





ALG2 ALG3 ALG4

1

0.98 0.98 0.98

0.57 0.99 0.97

96 96 95

8 96 92

92 26 41

19.91 17.49 17.66

3.71 3.74 4.05

1.880 1.197 1.231

ALG2 ALG3 ALG4

5

1.00 1.00 0.99

0.78 1.00 1.00

100 99 98

33 99 98

64 7 15

17.82 17.15 17.24

3.84 3.74 3.96

9.281 6.213 6.187

ALG2 ALG3 ALG4

10

1.00 1.00 0.99

0.86 1.00 1.00

100 99 98

54 99 98

48 4 6

17.53 17.15 17.24

3.83 3.74 3.96

18.503 12.658 12.606

ALG2 ALG3 ALG4

20

1.00 1.00 1.00

0.89 1.00 1.00

100 99 99

68 99 99

35 2 2

17.34 17.15 17.15

3.95 3.74 3.95

37.179 25.097 24.948

1998

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Table 3 (continued) Algorithm

# Random starts

Mean Mrand object partit.

Mean Mrand variable partit.

% Object partit. = true partit.

% Var. partit. = true partit.

% “Sure” local optima

% SSR

Mean # iter.

Cpu time

ALG2 ALG3 ALG4

30

1.00 1.00 1.00

0.90 1.00 1.00

100 99 99

72 99 99

28 1 0

17.29 17.15 17.15

3.98 3.74 3.95

55.497 37.754 37.562

(c) High error level ALG1 –

1.00

1.00

100

93

85

39.48





ALG2 ALG3 ALG4

1

0.92 0.93 0.93

0.53 0.95 0.93

81 85 85

5 78 76

98 38 67

43.13 40.93 41.11

4.03 4.04 4.26

2.014 1.288 1.283

ALG2 ALG3 ALG4

5

0.99 1.00 0.99

0.76 1.00 0.99

98 100 98

31 92 89

89 13 36

40.22 39.48 39.62

4.19 3.96 4.15

10.149 6.447 6.231

ALG2 ALG3 ALG4

10

1.00 1.00 1.00

0.83 1.00 1.00

99 100 100

45 92 91

75 10 26

39.93 39.48 39.51

4.25 3.96 4.11

20.173 13.136 12.779

ALG2

20

1.00

0.87

100

59

66

39.70

4.35

40.446

1.00 1.00

1.00 1.00

100 100

92 91

3 21

39.48 39.51

3.96 4.10

26.149 25.436

1.00 1.00 1.00

0.91 1.00 1.00

100 100 100

68 92 91

54 3 20

39.65 39.48 39.51

4.38 3.96 4.10

60.849 39.271 38.424

ALG3 ALG4 ALG2 ALG3 ALG4

30

Averaged values over 100 data sets of dimension (1000 × 400). (best performance in bold).

important to note that the differences between ALG3 and ALG4 tend to disappear when the number of random starts increases, as expected. The results for large data matrices (1000 × 400) (see Table 3 and Fig. 3) are similar to those above described even if differences among the algorithms are here less evident. Quite often ALG2 encounters a local minimum and it is not able to recover the true cluster structure of the variables and both ALG3 and ALG4 are faster than ALG2. The latter difference is due to the dimensional reduction which decreases the cpu-time compensating the increment induced by the permutation modification. In the first simulation, the size of the data is small and ALG2 is the fastest because the dimensional reduction has not a great effect. 6.2. Application to gene expression data To illustrate the use of the generalized double k-means, we include the analysis of data from a study of cutaneous melanoma gene expression (Bittner et al., 2000). We do not present a careful re-analysis of this data set, but rather use it to highlight the main features of generalized double k-means. The authors of this study aimed to determine whether or not molecular profiles generated by cDNA microarrays could be used to identify distinct subtypes of cutaneous melanoma, a malignant neoplasm of the skin. The data consists of 31 samples of cutaneous melanomas. The mRNA was extracted and Cy5-labelled cDNA was created for the 31 cutaneous melanoma samples. A single reference probe, labelled Cy3, was used for all samples. The Cy5 and Cy3-labelled cDNA was mixed for each sample and hybridized to a separate melanoma microarray. The hybridized array was scanned using both red and green lasers, and the resulting image was analysed. The cDNAs identified as well measured were 3613 over the 8150 observed. Expression ratios of Cy5/Cy3 were calculated for the well-measured genes. Ratios greater than 50 and less than 0.02 were truncated to 50 and 0.02, respectively.

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

1999

Fig. 3. (1000 × 400) data matrices with P = 3, object clusters and Q1 = 2, Q2 = 3, Q3 = 4 variable clusters. Three levels of error (a), (b) and (c) are considered. Rows and columns of the data matrices are randomly permuted, and generalized double k-means has been applied to evaluate the recovery of the two-mode clustering structure.

2000

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Table 4 Descriptive statistics for the gene (row) clusters Row cluster

Card.

Mean column cluster 1

Mean column cluster 2

Mean F

Std F

Max F

Min F

1 2 3 4

1078 1944 98 493

−0.358 0.135 −1.783 0.605

−0.523 0.153 −2.442 1.027

22.295 7.724 118.318 54.817

23.133 9.208 99.319 48.182

183.011 63.069 478.366 379.255

0.699 0.010 18.989 3.751

1.3 1.2 24

1.1

21

1

27

17

23

16

14 25 22 26

31 18 13 28 19 20 30

4

15

29

0.9 7

0.8

8

5

1 6

0.7 0.6

3 10

0.5 0.4 0.3

11

12

2

-3

-2.5

9

-2

-1.5

-1

-0.5

Fig. 4. Plot of tissue samples on the third and fourth row centroids.

The resulting ratios were transformed to a logarithm scale (base 2). The log-ratios were normalized by subtracting the median log-ratio within an experiment from all log-ratios for that experiment, so that the median log-ratio within an experiment was zero. No normalization was performed across experiments, since a single reference probe was used for all experiments. Average linkage hierarchical clustering was carried out on the 31 cutaneous melanoma samples by using one minus the Pearson correlation coefficient of log-ratios as dissimilarity measure between two experiments. In this way Bittner et al. obtained two clusters of 12 and 19 samples, which have been validated by using multidimensional scaling (MDS) and the non-hierarchical clustering algorithm CAST (Ben-Dor et al., 1999). Both MDS and CAST procedures identified the same large cluster of 19 samples as that found by average linkage hierarchical clustering. On the same data set composed by 3613 genes (rows) × 31 samples (columns), we applied the double k-means. To be coherent with the analysis of Bittner et al., the columns have been centred and rescaled to unit variance. In this way the distance between two columns is proportional to one minus the Pearson correlation coefficient. The technique has been applied for different combination of the number of clusters for the rows (from 2 to 25) and the columns (from 2 to 6). For each combination, we ran the algorithm several times to avoid local minima, and retained the best solutions. On the basis of the pF dk index (32), the combination corresponding to the highest value of the pseudo-F was chosen, i.e., four clusters for genes (rows) and two for samples (columns). Comparing the results with Bittner et al., we see that

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

2001 100 90

5

80 70

10

60 15

50 40

20

30 25

20 10

30 5

10

15

20

25

30

Fig. 5. Data map of the contingency table: frequency of each pair of tissue samples belonging to the same cluster, when the number of clusters of genes varies from 2 to 25.

only two samples are classified in different clusters. In particular, in our analysis samples 4 and 7 belong to the main group of 19 samples obtained by Bittner et al. To analyze the genes ability to discriminate between the two sample clusters, we computed the Fisher F statistics for each gene. Some descriptive statistics of the distributions of the F statistics within the clusters are reported in Table 4. From Table 4, it seems clear that row cluster 3 contains the genes that most discriminate between the two sample clusters, while the second contains the genes with the lowest discriminant power. The samples configuration is shown in Fig. 4, by using as dimensions the centroids of the third and fourth clusters, i.e., the ones that most discriminate between the two cluster samples. In Fig. 4, the samples of cluster 1 (circles) and cluster 2 (diamonds) are displayed. The circled samples are those that in Bittner et al. belong to a different group, (samples 4 and 7). The figure shows that cluster 2 is more compact than the other one, confirming the analysis of Bittner et al. Furthermore, to validate the two column clusters, the partitions of the columns obtained by using a different number of groups for the rows were considered. Then, for each couple of samples the number of partitions where the two samples belong to the same cluster is computed. Such a sensitivity analysis aims to evaluate how stable are clusters’ memberships when an increasing number of clusters for genes is used. The resulting contingency table, counting for each pair of 31 samples the number of times they belong to the same cluster, is displayed in Fig. 5. It shows that samples in cluster 2 (from 13 to 31) always belong to the same cluster (except for sample 15 that few times was classified in cluster 1). Samples 1, 4, 7, 8 and 15 show a less stable membership to the same cluster. These samples appear in Fig. 4 on the border of the cluster they belong to, but close to the other cluster. This confirms that their membership to one cluster may be considered more uncertain. However, samples 4 and 7 belong more frequently to cluster 2 which justifies the difference with the classification by Bittner et al. Furthermore, from Fig. 5 it can be observed that the first group is less stable, confirming the configuration displayed in Fig. 4, where samples of cluster 1 appear less homogeneous. The results of our technique also agree with those obtained by Goldstein et al. (2002) on the same dataset by applying different hierarchical clustering methods. However, the same authors obtained very different results by using partitioning algorithms like k-means or PAM (Kaufman and Rousseeuw, 1990). This is probably due to the fact that they rescaled the columns in a different way.

2002

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

Fig. 6. Gene expression data (Bittner et al., 2000). Partition in 4 groups for genes and 3 different clusters sizes for samples. In any situation, the cluster cardinality and the pseudo F index is given.

The generalized double k-means was applied by fixing P = 4 as above but allowing different partitions of the samples. We have chosen the combination giving the best interpretability of the results with Qp equal to either 2 or 3. The solution was Q1 = Q2 = Q3 = Q4 = 3, as shown in Fig. 6. The partition of genes determined by gdk generally coincides with the one of dk (cardinalities are 98, 1106, 1944, 465). The last cluster of 465 genes (Fig. 6d) has the highest gene expressions and specifies the partition of samples with the largest pF = 185.08. One of the clusters of this partition (Fig. 6d) includes the 19 samples found also in the analysis of Bittner et al. plus sample 4. 7. Discussion In this paper, new methodologies for two-mode partitioning and multi-partitioning of a two-way data set are proposed. They are extensions of the well known, and frequently used, k-means algorithm and are discussed and evaluated in a least squares context. The proposed models are compared from a theoretical point of view with sequential procedures performed by repeatedly applying ordinary clustering techniques to obtain a two-mode single or multi-partitioning. Several alternating least squares algorithms to estimate our models are proposed and tested by simulation studies.

R. Rocci, M. Vichi / Computational Statistics & Data Analysis 52 (2008) 1984 – 2003

2003

Finally, an example of use is given by an application on micro-array data. The results obtained in the simulation study and in the empirical analysis are encouraging, showing the usefulness of the proposals and the efficiency of the algorithms. As a final consideration, as suggested by one of the referees, gdk is related to regression trees in the case with just two row clusters and two column clusters (within each row cluster). Then gdk is equivalent to a regression tree with a first split on the rows and a second split (per row cluster) of the columns. References Alon, U., Bakai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Lavine, A.J., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96, 6745–6750. Ahuja, R.K., Magnanti, T.L., Orlin, J.B., 1993. Network Flows—Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs NJ. Ben-Dor, A., Shamir, R., Yakhini, Z., 1999. Clustering gene expression patterns. J. Comput. Biol. 6, 281–297. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V., Hayward, N., Trent, J., 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540. Blatt, M., Wiseman, S., Domany, E., 1996. Supermagnetic clustering of data. Phys. Rev. Lett. 76 (18), 3251–3254. Bock, H.H., 1979. Simultaneous clustering of objects and variables. In: Tomassone, R. (Ed.), Analyse des données et informatique. Le Chesnay, France INRIA. Bock, H.H., 2003. Two-way clustering for contingency tables maximizing a dependence measure. In: Schader, M., Gaul, W., Vichi, M. (Eds.), Between Data Science and Applied Data Analysis. Springer, Heidelberg, pp. 143–155. Bock, H.H., Vichi, M., 2007. Statistical learning methods including dimensionality reduction. Comput. Statist. Data Anal. 52(1), 370–613. Calinski, T., Harabasz, J., 1974. A dendride method for cluster analysis. Commun. Statist. 3, 1–27. Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14, 315–332. DeSarbo, W.S., 1982. GENNCLUS: new models for general non-hierarchical clustering analysis. Psychometrika 47, 449–475. Eisen, M., Spellman, P., Brown, P., Botstein, D., 1998. Proc. Natl. Acad. Sci. USA 95, 14863–14868. Getz, G., Levine, E., Domany, E., 2000. Coupled two-way clustering of gene microarray data. Proc. Natl. Acad. Sci. USA 97 (22), 12079–12084. Goldstein, D.R., Ghosh, D., Conlon, E.M., 2002. Satistical issues in the clustering of gene expression data. Statist. Sinica 12, 219–240. Gordon, A.D., 1999. Classification, second ed. Chapman and Hall, London. Govaert, G., 1995. Simultaneous clustering of rows and columns. Control Cybernet. 24, 437–458. Govaert, G., Nadif, M., 2003. Clustering with block mixture models. Pattern Recognition 36, 463–473. Hartigan, J.A., 1972. Direct clustering of a data matrix. J. Amer. Statist. Assoc. 67, 123–129. Hartigan, J.A., 1975. Clustering Algorithms. Wiley, New York. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification 2, 193–218. Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., and Neyman, J. (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 Statistics. University of California Press, Berkeley, pp. 281–297. Milligan, G.W., Cooper, M., 1985. An Estimation of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika. 50, 159–179. Pollard, K.S., van der Laan, M.J., 2002. Statistical Inference for simultaneous clustering of gene expression data. Math. Biosci. 176, 99–121. Van Mechelen, I., Bock, H.H., De Boeck, P., 2004. Two-mode clustering methods: a structured overview. Statist. Methods Medical Res. 13(5), 363–394. Vichi, M., 2000. Double k-means clustering for simultaneous classification of objects and variables. In: Borra, S., et al. (Eds.), Advances in Classification and Data Analysis. Springer, Berlin.