A graph spectrum based geometric biclustering algorithm

Journal of Theoretical Biology 317 (2013) 200–211 Contents lists available at SciVerse ScienceDirect Journal of Theoretical Biology journal homepage...

Download PDF

848KB Sizes 0 Downloads 100 Views

Report

PDF Reader
Full Text

Journal of Theoretical Biology 317 (2013) 200–211

Contents lists available at SciVerse ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

A graph spectrum based geometric biclustering algorithm Doris Z. Wang n, Hong Yan Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong

H I G H L I G H T S c c c c c

In our work, we perform a graph spectrum based geometric biclustering (GSGBC). We use modiﬁed Hough Transform to ﬁnd sub-biclusters in column-pair space. We use a graph spectrum based algorithm to combine the sub-biclusters. GSGBC achieves better bicluster detection accuracy. GSGBC reduces time cost for sub-bicluster combination of geometric biclustering.

a r t i c l e i n f o

a b s t r a c t

Article history: Received 27 April 2012 Received in revised form 4 October 2012 Accepted 6 October 2012 Available online 16 October 2012

Biclustering is capable of performing simultaneous clustering on two dimensions of a data matrix and has many applications in pattern classiﬁcation. For example, in microarray experiments, a subset of genes is co-expressed in a subset of conditions, and biclustering algorithms can be used to detect the coherent patterns in the data for further analysis of function. In this paper, we present a graph spectrum based geometric biclustering (GSGBC) algorithm. In the geometrical view, biclusters can be seen as different linear geometrical patterns in high dimensional spaces. Based on this, the modiﬁed Hough transform is used to ﬁnd the Hough vector (HV) corresponding to sub-bicluster patterns in 2D spaces. A graph can be built regarding each HV as a node. The graph spectrum is utilized to identify the eigengroups in which the sub-biclusters are grouped naturally to produce larger biclusters. Through a comparative study, we ﬁnd that the GSGBC achieves as good a result as GBC and outperforms other kinds of biclustering algorithms. Also, compared with the original geometrical biclustering algorithm, it reduces the computing time complexity signiﬁcantly. We also show that biologically meaningful biclusters can be identiﬁed by our method from real microarray gene expression data. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Geometric biclustering Hough transform Graph spectra Microarray data analysis

1. Introduction Biclustering is a non-supervised approach that performs simultaneous clustering on both rows and columns of a matrix (Liu and Wang, 2007). The resulting biclusters are sub-matrices of the original full data matrix that contain some special coherent structures. In (Madeira and Oliveira, 2004), ﬁve types of bicluster patterns are characterized: (a) constant pattern, (b) constant row pattern, (c) constant column pattern, (d) additive coherent pattern and (e) multiplicative coherent pattern. These patterns can be corrupted by noise and overlapped in the data matrix. Biclustering has a wide range of applications in different areas such as bioinformatics, image retrieval and text ﬁle recognition (Dhillon et al., 2003; Amaratunga and Cabrera, 2004; Sun et al.,

n

Corresponding author. Tel.: þ852 69924393. E-mail addresses: [email protected] (D.Z. Wang), [email protected] (H. Yan). 0022-5193/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jtbi.2012.10.012

2010; Maulik et al., 2009). Gene expression data, which can be obtained from microarray experiments performed under different conditions or at different time points, can be used to study many biological problems and unravel the mechanisms characterizing cellular responses (Rueda et al., 2008). Clustering techniques have been widely used to investigate the underlying structure of gene expression data set. However, these algorithms can be only applied to either gene or condition direction (Desper et al., 2004; Liu et al., 2006). The clusters produced reﬂect the global patterns of expression data and thus ignore the cases that only subset of genes co-expressed under a subset of conditions. Biclustering can work well for detecting those local patterns and help uncover many genetic pathways that are not apparent otherwise. A number of biclustering algorithms have been proposed recently. Based on the bicluster model and search strategy, we can classify those algorithms into ﬁve categories: distance based methods (Wang et al., 2002), matrix factorization based methods (Pascual-Montano et al., 2006), probabilistic models (Lazzeroni

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

and Owen, 2002), coherent evolution methods (Ben-Dor et al., 2003) and geometric models (Gan et al., 2008). In geometric based biclustering, which was ﬁrst proposed in 2005 (Gan et al., 2008) 0.0, different bicluster patterns in a matrix can be seen as linear objects with special structures in high dimensional spaces. For a data matrix Dmn, each row can be considered as a point or a vector and each column can be considered as coordinates v1, v2, v3, y, vn of a point. We can represent the data matrix with the expression Dmn ¼[p1, p2, p3, y, pm]T and each point can be represented as pi(v1, v2, v3, y, vn). A constant pattern can be represented as pi ¼pj, which corresponds to the same point with the same coordinates (v1 ¼v2 ¼y¼vn). A constant column pattern can be represented as pi ¼pj, which can also be seen as the same point, but their coordinates v1, v2, v3, y, vn can be different. A constant row pattern pi apj can be seen as different points on the same hyperplane with the same coordinates (v1 ¼v2 ¼y¼vn). The additive pattern represented by vi ¼vj þbij and the multiplicative pattern represented by vi ¼kijvj can also be seen as points on the same hyperplanes with special parameters. So we can search for linear objects in a multidimensional space to ﬁnd bicluster patterns in a matrix. The Hough transform (HT) can be utilized to identify linear structures in a high dimensional space (Illingworth and Kittler, 1988). However, the computation complexity of the HT increases at the rate of O(In 2), where n is the number of dimensions. Thus, a large matrix will lead to a low efﬁciency for the HT to detect linear structures. To overcome the disadvantage of the HT based method applied to a multidimensional space, the sub-dimension based method has been introduced for biclustering. For example, the geometrical biclustering algorithm (GBC) with the HT in 2-dimensional space is introduced in (Zhao et al., 2008). The matrix with bicluster patterns embedded in it can be split into column pairs. The HT can then be applied to the column-pair sub-matrix in order to ﬁnd sub-biclusters. Then an expansion algorithm can be used to combine the sub-biclusters into the largest bicluster. In the GBC, the expansion algorithm considers all the sub-biclusters throughout the whole matrix, leading to a complicated combination process. In this paper, we propose a graph spectra based geometrical biclustering algorithm (GSGBC). To improve the speed of the classical Hough transform, we use a modiﬁed Hough transform in 2-D spaces for sub-bicluster extraction. The main problem to be tackled is how to merge the sub-biclusters. The most important contribution of this paper is that we utilize the graph spectra theory (Cvetkovic et al., 1980) to optimize the sub-bicluster combination process so that the computational speed is improved signiﬁcantly. Using graphic approaches to study biological systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies in a series of important biological topics, such as enzyme-catalyzed reactions (Zhou and Deng, 1984; Myers and Palmer, 1985; Chou, 1989; Andraos, 2008). investigation into the slow conformational change with Chou’s non-steady-state graphical rule (Lin and Neet, 1990), protein folding kinetics and folding rates (Chou, 1990), inhibition of HIV-1 reverse transcriptase (Althaus et al., 1993a, 1993b, 1993c), inhibition kinetics of processive nucleic acid polymerases and nucleases(Chou et al., 1994), drug metabolism systems(Chou, 2010), and recently using Wenxiang graphs (Chou et al., 2011) to analyze protein–protein interactions (Zhou, 2011a, 2011b). In this paper, graph theories are applied to biclustering. After the modiﬁed Hough transform, we consider each sub-bicluster in 2-D spaces as a node in a graph. Two nodes in each pair are connected by an edge with the similarity between the two nodes deﬁned as the edge weight. By computing the spectrum of the adjacency matrix, the nodes are classiﬁed naturally into their eigengroups, and thus the entropy of the sub-bicluster matrix decreases. Combining the sub-biclusters in each eigengroup, the largest bicluster can be

201

Fig. 1. (a) An intensity representation of a simulated microarray data matrix with two hidden overlapped biclusters. (b) The biclustering result obtained through GSGBC.

obtained and the time complexity in this merging step can be reduced accordingly. The proposed GSGBC algorithm achieves similar results with GBC(although a bit worse), but it outperforms GBC considerably in terms of the computing time. One simple example of the original matrix with the bicluster embedded and the biclustering result with the GSGBC algorithm is shown in Fig. 1. Through the GSGBC process, the overlapped biclusters in the matrix can be detected. We organize the paper as follows. In Section 2, we introduce the main concept of the GSGBC algorithm. Speciﬁcally, the modiﬁed HT and the graph spectra based expansion process are presented. In Section 3, the matching score and entropy measures are introduced for biclustering algorithm validation. Experiment data including artiﬁcial as well as biological ones are presented and comparative studies are carried out on those data in Sections 4 and 5 respectively. The last part of the paper includes a discussion and conclusions.

2. Graph spectrum based geometric biclustering algorithm In the geometric view, bicluster patterns can be seen as linear structures in high dimensional spaces (Gan et al., 2008). The HT that is widely used for extracting lines and curves in images is employed to detect those bicluster patterns. The statistical properties of the HT like robustness, consistency and convergence make it the most suitable for bicluster detection (Illingworth and Kittler, 1988). However the computational complexity for the voting process in the parameter space becomes excessive when the data dimensions increase. In order to overcome the low speed and large storage problem of the classical HT, sub-bicluster detection in column spaces through 2-D HT was proposed based on a merging algorithm called GBC (Zhao et al., 2008). In this method, the ﬁve bicluster patterns are split into column pair sub-biclusters. Each type of subbiclusters can be viewed as a set of points regressing to a line with special parameters. To further reduce the computational complexity of 2-D HT, we modify the HT according to the geometric properties of the sub-bicluster. Then the Hough vector representing the subbiclusters in each column pair can be obtained. 2.1. Geometric structure properties of biclusters in column-pair spaces In column-pair spaces, the ﬁve kinds of bicluster patterns and their corresponding geometrical structures are shown in Fig. 2

202

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

(Zhao et al., 2008). If each type of column-pair sub-bicluster can be seen as a set of points {pi ¼(xi, yi)AR2: i¼1, 2y} in 2D space, the constant pattern (C) refers to the same points with the same coordinates, the constant row pattern (CR) refers to the points on the diagonal line: x¼y, constant column pattern (CC) refers to the same points with different x, y coordinates, additive pattern (A) refers to the points on line: y¼x þb, and multiplicative pattern refers to the points on line y¼kx. Thus three conclusions can be obtained: (1) CCCC, CCCR, (2) CC CA, CC CM, (3) CRCA, CRCM. In order to identify the sub-biclusters in a column space, we only need to identify the additive coherent pattern and multiplicative coherent pattern. 2.2. The classical Hough transform

HT is to view each point as generating a line comprising all pairs (k and b) that are consistent with this point. Thus the collinearity in the original set of points will manifest itself in a common intersection of lines in the Hough domain. The HT algorithm is implemented numerically by identifying the cells with the desired quantization in the Hough space that gain the largest number of counts. In Fig. 3, a quantization process in parameter space is implemented with 2nþ1 steps for both k and b. Thus (2n þ1)2 cells can be obtained. We denote the number of counts for a cell centered at e(k,b) as D(k,b). The HT aims to ﬁnd the maximal value of D(k,b) in the parameter space representing the largest number of lines passing through cell e(k,b).

2.3. The modiﬁed Hough transform for sub-bicluster detection

The classical HT is a popular technique that is widely used in detecting lines and curves of images through a voting process in parameter spaces. Given a set of points {pi ¼ (xi, yi)AR2: i¼1,y, n}, the objective of the HT is to infer the parameters (k, b) of the line y¼kx þb which can ﬁt the data {pi} optimally. The main idea of

Fig. 2. Geometric structure of the ﬁve bicluster types in column pair space.

From the discussion above, we can conclude that additive/ multiplicative sub-bicluster patterns refer to points on the regression line with k¼1 or b¼0 respectively. To detect those patterns using the HT, we only need to consider the cells located at e(1,b) and e(k,0) in the Hough domain. The dark cells in Fig. 3 shows the ones we are mainly concerned about in the parameter spaces. As these cells keep a count of the lines crossing over them, the function of D(1,b) and D(k,0) refers to the histogram of b when k¼1 and k when b¼0 respectively. Thus, instead of quantizing the parameter spaces into (2n þ1)2 cells, we just need to take quantization along two lines: k¼1 and b¼0. As a result, we obtain 4nþ 1 cells which retain the same number as the intersection lines. For each column pair, the distribution of D(1,b) is equal to the difference histogram of the two columns and the distribution of D(k,0) is equal to the ratio histogram of the two columns.

Fig. 3. Quantization of HT in parameter space.

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

203

Thus, for sub-bicluster detection in column-spaces, we compute the column-pair difference and ratio histogram with 2n þ1 bins respectively instead and then similarly the largest sub-bicluster refers to the bins with the largest counts.

nodes’ neighbors are considered as features here, the Jaccard’s coefﬁcient is deﬁned by 9G U p \ G U q 9 JC U p ,U q ¼ ð2Þ 9G U p [ G U q 9

2.4. The Hough vector

where 99 means the number of non-zero elements in a set, Up and Uq are two nodes (sub-biclusters) in the G(U,E). The Jaccard’s coefﬁcient measure the similarity between two nodes based on an idea that two nodes Up and Uq are more likely to form a link in the future if their sets of neighbors Up and Uq have large overlaps. Since we want to combine similar sub-biclusters sharing as many common elements as possible, HV nodes that have more common elements in the HVs also show a larger probability to form an edge. Thus, we change this measure by multiplying a coefﬁcients t ¼9Up\Uq9/L(U), where L(U) is the length of node vector U. Finally, the HV node similarity score is deﬁned as follows: 9U p \ U q 9 9G U p \ G U q 9 score U p ,U q ¼ ð3Þ LðUÞ 9G U p [ G U q 9

We deﬁne a binary Hough vector (HV) to represent the patterns of the sub-biclusters in the column pair spaces of the data matrix. Given that we use the modiﬁed HT in 2-D spaces to detect the largest sub-biclusters, the detected sub-biclusters are mainly composed of a series of points that regress to a line with special parameters according to their pattern type. The binary HV hi refers to the sub-bicluster with 1 s representing the points on the line and 0 s representing unrelated points. All HVs can be obtained after sub-bicluster identiﬁcation on every column pair with the modiﬁed HT, which are made up of a Hough array (HA) HA¼[h1, h2,y,hn(n 1)/2]. Two HVs can be combined if the subbiclusters they represent share common columns. For example if h1 represents the sub-bicluster detected from column 1 and 2 of the original matrix and h2 represents the sub-bicluster detected from column 1 and 3 of the original matrix, the two vectors can be combined as h1,2 with fewer intersection rows of 1 elements. Furthermore, the HVs to be combined should share more common 1 s such that the combined sub-bicluster can be of a larger size. To meet the two requirements, a graph spectra based combination algorithm is proposed in the following section. 2.5. The graph spectra based combination algorithm Given that there are N ¼ C 2n vectors for combination, to group the vectors according to some measures like the distance and combine the vectors in each group can correctly reduce the expansion time complexity (Wang and Yan, 2011). Considering each HV as a node of graph G(U,E), we can build the relation graph of the HVs. The vector grouping scheme can be transformed to a graph-partitioning problem. 2.6. The graph adjacency matrix Graph based methods are very useful in analyzing enzyme kinetics since it can provide a visually intuitive relation between calculations and chemical reaction (Zhou and Deng, 1984; Chou, 1989, 1990). Based on this enlightening, we also apply the graph theory to our proposed biclustering method. The relationship between the HVs can be effectively captured in the form of a graph whose nodes U represent the HVs and whose edges E denote similarity between HVs. Often, we use the adjacency matrix A to represent the relation graph. The spectra of the adjacency matrix for the graph are the basis for the node grouping scheme. For the HV graph, the adjacency matrix A of the graph G(U,E) can be deﬁned through an N by N matrix AN N ¼ A(G) ( wp,q U p and U q A E, p aq ap,q ¼ ð1Þ 0 p¼q where wp,q is the weight of the edge between nodes Up and Uq. If two HVs have common elements, then the two nodes are connected by a weighted edge. In our paper, the edges in the graph are weighted according to the similarity of two nodes. For a general graph, the nodes’ similarity is usually deﬁned by their common neighborhoods (Steinhaeuser and Chawla, 2008). For a node Up, let G(Up) denote the set of neighbors of Up in U. Jaccard’s coefﬁcient (JC), which is a commonly used similarity metric in information retrieval, can measure the probability that both x and y have a randomly selected feature f that either x or y has. If the

Thus, the adjacency matrix can be built with a similarity score performing the weight wp,q ¼sp,q. 2.7. Edge pruning A threshold can be set on the edge weight in the HV graph. This means that if the value of the edge weight is smaller than the threshold, it can be set to zero. In GSGBC, we prune the edges if JC othjc for the two nodes they connect. Besides, a threshold on t can be identiﬁed through the minimum row size (RBmin) of the bicluster. So the threshold tht can be given by tht ¼RBmin/L(U). Fig. 4 shows an example arrangement of HVs. Through computing the JC measure as well as coefﬁcient t between each pair of nodes represented by HV in a Hough array shown in Fig. 4(a), the original relation graph can be obtained. With RBmin ¼2 and thjc.¼0.3, the ﬁnal pruned relation graph is computed and shown in Fig. 4(b). Then, the adjacency matrix shown in Fig. 4(c) can ﬁnally be built. The HV graph and the adjacency matrix are constructed in a different way from the reaction graph described in (Zhou and Deng, 1984; Chou, 1989; Chou, 1990). For the weighted HV graph, the elements aij in the adjacency matrix represent the edge weight between nodes Ui and Uj. The weight from 0 to 1 is deﬁned according to the node similarity. Thus the larger the weight is, the more similar the two nodes are. However, the reaction graph is unweighted so that the adjacency matrix only contains 1 s or 0 s. The matrix deduced from the adjacency matrix may contain negative values in the light of Chou’s graphical rules. For our task, we build the HV graph in order to ﬁnd the node groups for sub-bicluster combination. 2.8. The graph spectrum Considering the task for ﬁnding node groups in a weighted graph G of n nodes, we can utilize the spectrum of a graph which provides us with a natural cutting of its nodes (Sarkar and Boyer, 1996). First, we can represent the natural cutting of graph G(U,E) by H node groups {V1,V2,y,VH} and any node group Vi can be represented by a column vector, a, whose kth entry captures the participation of node k in that group. If one node does not belong to the group, the corresponding entry is zero. We can write the vector a referring to node group Vi in the following form ( api ¼

f

f 4 0, api A V i

0

api2 =V i

ð4Þ

204

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

where A is a Hermitian matrix, lmax are the maximum eigenvalues of A and a is the eigenvector of A. According to this theorem the largest value of Eq. (5) will be lmax, the maximum eigenvalue of A, and the corresponding eigenvector will be the optimal eigenvector a in which the nodes are better connected with each other. From the deﬁnition of vector a, we know that all the elements in vector a are non-negative. Consider the normalization process of eigenvectors, we should not only consider the eigenvector with all non-negative elements but also the non-positive ones. Thus, all the nodes are naturally partitioned into H groups corresponding to eigenvectors with all non-negative/positive elements. The node group with a better cohesiveness value is considered as the eigenvector with a larger corresponding eigenvalue. The eigenvalue decreases when the connections between nodes get sparser. So the larger eigenvalue means the greater coherence of the corresponding node group. We will have N eigenvalues in a non-decreasing order {l1, l2, y, lN} referring to the spectrum of the graph. The CourantFischer theorem (Horn and Johnson, 1990) generalized from the Rayleigh-Ritz theorem shows an interpretation for all the eigenvalues and eigenvectors. Let falN ,alN1 ,. . .,alNk þ 1 g denote the set of eigenvectors corresponding to the k 1 largest eigenvalues (lN,lN 1,y,lN k þ 1), respectively, then

lNk ¼

max

a?falN ,...,al

N-k þ 1

aT Aa

g

ð7Þ

So the eigenvector of k-th largest eigenvalue is orthogonal to all other eigenvectors with the k 1 largest eigenvalues. Thus, in practical terms, the orthogonality constraint can be used to deﬁne disjoint groups. 2.9. The eigengroup

Fig. 4. An example of Hough vector (HV) graph construction. (a) The Hough array (HA). (b) The corresponding graph of the HA. (c) The adjacency matrix of the HV graph.

Therefore, the group vector a is composed of non-negative values. A constraint aTa ¼1 is also imposed on the norm of the group vector. Then, a measure of the nodes’ cohesiveness in a group can be deﬁned based on the edge weights of the graph. We deﬁne the nodes’ cohesiveness measure in group Vi as follows: CM ¼

m X m X

wp,q api aqi ¼ aT Aa

ð5Þ

p¼1q¼1

where wp,q is the edge weight between two nodes Up and Uq belonging to group Vi. Given that the nonparticipating nodes in a group correspond to the zero value in the column vector a, only a sub-group of matrix A will be considered in the above equation. The better connected the nodes in the same group are, the larger the value is of the group cohesiveness measure. The maximal cohesiveness of group Vmax can be found through maximizing Eq. (5). The Rayleigh-Ritz theorem(Horn and Johnson, 1990) states that

lmax ¼ max

aT Aa ¼ max aT Aa aT a aT a ¼ 1

ð6Þ

The spectrum of a graph provides us with a natural way to group the nodes of a graph. From the physical interpretation of a graph spectrum, the non-negative/non-positive eigenvectors indicate the nodes groups. However, there will be only one nonnegative/non-positive eigenvector, if the whole pruned graph is connected. As we are only concerned about the highly connected part of the graph, the inﬂuence of a node in the group can be measured through eigenvector centrality. The magnitudes of the elements of the eigenvector indicate the node strength with which nodes belong to the groups. The most important node in a group is the one with the largest value of node strength. We represent it by STmax. For an eigenvector, the nodes that are of the node strength less than 10% of STmax are assumed to be the meaningless nodes in the group and the strength value in the corresponding eigenvector is set to zero. As a result, we obtain a set of required eigenvectors. If these eigenvectors are all nonnegative/non-positive, we call them eigengroups. The deﬁnition is so strict that there can be no overlapping nodes in different eigengroups which is ensured by orthogonal constraint. A strict eigengroup will lose some information for sub-bicluster combination, especially under the condition in which the original matrix has overlapping biclusters. Thus, we re-deﬁne the eigengroup as the eigenvector with no fewer than 95% percent of its components that are non-negative/positive. The large absolute value of the component in the eigenvector means the dominant node in the eigengroup and the cohesiveness of nodes in the eigengroup is determined by the corresponding eigenvalue of the eigenvector. The graph spectrum provides a natural way for graph cutting and the sub-bicluster matrix can be transformed to a graph with each sub-bicluster corresponding to a node. So the graph spectrum can be used to group the sub-biclusters. The sub-biclusters in each eigengroup share more common elements leading to the entropy decrease in a sub-bicluster matrix. Combining the

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

sub-biclusters in each eigengroup reduces the computational complexity. This process meets one of the requirements that those HVs which share more common 1 s are grouped together. Then in each eigengroup, we use the following combination method. First, we check that all the column pairs for each HV are represented and delete the HVs which contain no common column shared with other vectors. Then, we combine all the rest of the HVs to the largest bicluster. The whole combination algorithm is summarized below: Input: HAN N: The Hough array (HA) CBmin: the minimum number of columns in a bicluster RBmin: the minimum number of rows in a bicluster t: the number of the largest eigenvalues to compute thjc: the minimum value of JC measure Output: BC(R,C), the output biclusters 1. Compute the similarity between every two columns of the HA to obtain the graph weight. 2. Transform HA to an adjacency matrix with a th , jc tht ¼ RBmin =LðUÞto prune the graph edge. 3. Compute the ﬁrst t eigenvalues and eigenvectors of the adjacency matrix. 4. Search the t eigenvectors of the corresponding eigenvalues to ﬁnd the eigengroups with non-negative/ positive ratio NNRZ95%. 5. In each eigengroup, delete the isolated HVs that share no column with other HVs of the original matrix. 6. Combine the remaining HVs in the same eigengroup EG. For8HV(Ci, Ck)AEG, the index set is indexH ¼{i, k} Search for all HVA{EG-HV(Ci, Ck)} to ﬁnd HV (Ci, Cj) that has a common column index with the element in indexH HV(Ci, Cj, Ck)¼HV(Ci, Cj)\HV(Ci, Ck) if 9HV(Ci, Cj, Ck)9 4 ¼RBmin, then insert j to index set indexH, set HV(Ci, Cj, Ck) a new HV in EG delete HV(Ci, Ck), HV(Ci, Cj)

205

Go on this process until 9HVcombine9oRBmin, [ ðC ii filtered with HVcombine Þ BC initial ðR,C Þ ¼ ii A indexH

7.

Filter the biclusters if the number of columns for a bicluster is less than CBmin. Then the output BCs are obtained.

In summary, the proposed GSGBC algorithm is composed of two main steps. The ﬁrst step is sub-bicluster detection with the modiﬁed HT in 2-D space and the second step is the graph spectrum based combination process. The overall GSGBC biclustering algorithm can be illustrated in the following ﬂow diagram in Fig. 5. GBC considers all the sub-biclusters which can generate at most 2N different combinations. In our proposed method, we use a graph spectra based method to group the sub-biclusters and then we combine the sub-biclusters in the same eigengroup. The partitioning time complexity is dominated by the computation of ﬁrst t eigenvalues, which is aboutO(9E9thþ 9U9t2 þt3h) where h is the number of iterations required for converging(Smyth and White, 2005). The running complexity for combining the subbiclusters in each eigengroup is at most O(9Vmax92), where Vmax represent the eigengroup with most number of nodes.

3. The validation method 3.1. The matching score To validate the result obtained from the different algorithms, a matching score method is proposed considering both the true bicluster as well as the false ones. We represent a bicluster by B(X, Y) with X representing the row index and Y representing the column index. If B1(X1, Y1) is the ground truth and B2(X2, Y2) are the detected biclusters, then we can divide B2(X2, Y2) into two parts: Bf (Xf, Yf) sharing no common parts with the ground truth and Br(Xr, Yr) sharing some common parts with the ground truth.

Original matrix

Column pair sub-matrices

Modified HT on the sub-matrices

HA comprised of HVs representing sub-biclusters

Building the adjacency matrix of HVs in HA

Computing the spectrum of adjacency matrix to obtain the eigengroups

Sub-bicluster combination in each eigengroup

Bicluster filtering

Output biclusters Fig. 5. Flow diagram of the GSGBC algorithm.

206

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

The matching score is then given by SXY ðB1 ,B2 Þ ¼

1 9B1 9 ðX

X 1 ,Y 1 Þ A B1

max

ðX r ,Y r Þ A Br

9X 1 \ X r 9 þ9Y 1 \ Y r 9 9X 1 [ X r [ X f 9 þ9Y 1 [ Y r [ X f 9 ð8Þ

with B1 and B2 representing the true and detected biclusters, we can obtain the matching score for each algorithm. However, the equation can only be used when the ground truth is given. For the biological data, we do not know the exact biclusters. So we need to validate the experiment result through the web tool for functional analysis. We use DAVID(Da Wei Huang and Lempicki, 2008; Huang et al., 2009) to analyze the genes in the detected biclusters.

3.2. The entropy measure The entropy measure was previously used to validate partitions of dynamical system (Wang and Yan, 2011). We now utilize entropy to measure the irregularity of HVs in the whole HA as well as eigengroups. A reduction of the entropy measure means a higher consistency of the eigengroups. The entropy of the whole HA is given by E¼

L X

P ðwi Þ log2 Pðwi Þ

ð9Þ

i¼1

where w1,w2,w3,y,wL represents a series of the typical subbicluster patterns which we were ﬁnally able to identify. The total entropy of HVs in a different eigengroup is given by Ecut ¼

K L X X P gj P wi 9g j log2 P wi 9g j j¼1

ð10Þ

i¼1

where g1, g2, y, gK are the eigengroups. By comparing the entropy of HVs in HA and the entropy of HVs in all the eigengroups, we can ﬁnd whether the HVs in each eigengroup are of a higher coherence for combination.

4. Experiment data To test the performance of the GSGBC, we carried out different experiments on both the simulated data and the biological data with matching scores and p-value for validation. We compare our method with different types of existing algorithms and several recently developed ones. The compared algorithms are as follows: GBC (Zhao et al., 2008) belonging to geometric models, qubic (Li et al., 2009) belonging to probabilistic models, Fabia (Hochreiter et al., 2010) belonging to matrix factorization based methods, cc (Cheng and Church, 2000) belonging to distance based methods, and Xmotif (Murali and Kasif, 2002), OPSM (Ben-Dor et al., 2003), ISA (Bergmann et al., 2003) as well as Bimax (Prelic´ et al., 2006), all belonging to coherent evolution methods. The simulated data are generated in order to study the advantages of the GSGBC for detecting noisy and overlapping bicluster patterns. To test the performance of the algorithms for noisy data, we embed 10 non-overlapping constant, additive and multiplicative patterns of 10 columns and 10 rows with Gaussian noise of variance ranging from 0 to 1 into a 100 by 100 matrix. Similarly, to test the effectiveness of the algorithms for resolving overlapped patterns, we embed 10 overlapped additive patterns of 10 columns and 10 rows into a 110 by 110 matrix with overlapping degrees from 2 to 10, representing the number of rows or columns that overlap. The background data of both datasets are sampled from a uniform distribution U( 5,5).

Furthermore, another dataset is designed that is much closer to real data with ﬁve additive biclusters and ﬁve multiplicative biclusters of different sizes embedded in a 1000 by 100 matrix. The biclusters are distributed randomly along column and row direction with row size chosen randomly from 10 to 200 and column size chosen randomly from 5 to 25. All the biclusters are corrupted by Gaussian noise of variance 0.1. The background data are a sample from a uniform distribution U( 50,50). To further study quality of different biclustering algorithms, we consider real biological data for our experiment, the ‘multiple tissue type’ dataset obtained from diverse tissues of human cancer samples containing 102 samples and 5565 genes (Chou, 1990). Another dataset we use in the experiment is the gene expression data of yeast cells towards different stress conditions. The original microarray gene expression data with 2993 selected genes and 173 different stress conditions can be obtained from the web site: http://www.tik.ee.ethz.ch/sop/ bimax/. The dataset has been normalized using mean centering (Prelic´ et al., 2006).

5. Experiment result and analysis 5.1. Parameter selection We use the entropy measure described in Section 3.2 to choose optimal parameters for the proposed GSGBC in the simulation study. We select the optimal parameters so that the entropy measure of the HVs reduces most signiﬁcantly after the subbicluster grouping scheme. With different noise levels and overlapping degrees, we compute the average entropy measure for the artiﬁcial data when parameter changes. For the constant, additive, multiplicative and overlapped datasets, we set RBmin ¼6, since there is a sharp decrease for the average entropy (Fig. 6) when RBmin changes from 5 to 6 and the entropy does not change much when RBmin becomes larger. Smaller entropy indicates a higher coherence of the sub-biclusters in each eigengroup. Thus, the computational complexity of sub-bicluster combination can reduce more signiﬁcantly with smaller entropy measures. We set thjc ¼0.5 for all the four datasets, because with this threshold value, we obtain the smallest average entropy (Fig. 7). From Fig. 8, we can see that the average entropy remains the same when t changes from 20 to 50. As the time complexity of eigenvalue computation grows as t increases, a small t would result in fast computation. Thus, we choose t¼ 20. For more general cases in which the embedded biclusters are of different types and sizes and the rows and columns are distributed randomly, we set RBmin ¼30, thjc ¼0.4 and t¼20, for which the entropy measure is minimized (Fig. 9).

5.2. Comparative study for the artiﬁcial data The matching scores of different algorithms for detecting simulated noisy bicluster patterns are shown in Table 1. For the proposed GSGBC, parameters are chosen according to the procedures in Section 5.1. In order to obtain the best biclustering result, CBmin ¼9 is chosen for the constant, additive, multiplicative and overlapped datasets and CBmin ¼5 is chosen for the dataset designed to simulate real biological data. For other biclustering algorithms, optimal parameters are set according to their corresponding publications. Theoretically, the matching score should decrease when the noise level increases. However, for constant patterns (Table 1(a)), ISA and GBC show a better result with matching scores equal to 1 through all noise levels, which demonstrates a higher robustness to noise of the two algorithms

E-RBmin

0.15 0.1 0.05 0

Entropy measure

Entropy measure

0.3 0.25 0.2

5

6

7

RBmin

8

9

0.25

E-RBmin

0.2 0.15 0.1 0.05 0

4

5

6

7 RBmin

8

9

10

Entropy measure

Entropy measure

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

207

0.125

E-RBmin

0.1 0.075 0.05 0.025 0

4

5

6

7 RBmin

8

9

0.25

10

E-RBmin

0.2 0.15 0.1 0.05 0

4

5

6

7 RBmin

8

9

10

Fig. 6. Average entropy measure of HVs in eigengroups with different values of RBmin for graph spectra based partitioning process. (a) For constant bicluster patterns. (b) For additive bicluster patterns. (c) For multiplicative bicluster patterns. (d) For overlapped bicluster patterns.

Fig. 7. Average entropy measure of HVs in eigengroups with different values of thjc for graph spectra based partitioning process. (a) For constant bicluster patterns. (b) For additive bicluster patterns. (c) For multiplicative bicluster patterns. (d) For overlapped bicluster patterns.

Fig. 8. Average entropy measure of HVs in eigengroups with different values of t for graph spectra based partitioning process. (a) For constant bicluster patterns. (b) For additive bicluster patterns. (c) For multiplicative bicluster patterns. (d) For overlapped bicluster patterns.

for detecting constant biclusters. GSGBC and Fabia also produce good result with matching scores close to 1. Bimax and qubic perform well with noisy-free data. However, their matching score

decreases when noise is added mainly because the noise blurs the differences between background and biclusters, and as a result spurious, small sub-biclusters emerge. Table 2 shows that the

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

E-RBmin

Entropy measure

Entropy measure

208

10

20

30

40 RBmin

50

60

70

E-JC

0.1

0.2

0.1

0.3

0.4 thjc

0.5

0.6

0.7

E-t

0.09 Entropy measure

0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

10

20

30

40

50

60

t Fig. 9. Average entropy measure of HVs in eigengroups through graph spectra based partitioning process with different parameter settings for the dataset designed to simulate real biological data. From (a) to (c), the average entropy measure are shown for different values of thjc, t, and RBmin respectively.

time cost of GSGBC is reduced nearly 100 times compared with GBC, as the quantization size in parameter space is reduced from (2n þ1)2 to 4n þ1 cells and the sum of entropy for sub-bicluster nodes in each eigengroup is much smaller than that of the entire sub-bicluster matrix. For additive bicluster patterns, the result in Table 1(b) suggests that Fabia does not produce a good result as the algorithm is based on a multiplicative model. ISA and Bimax also produce poor results compared with the performance for constant bicluster detection. The above algorithms are also much more sensitive to noise. GSGBC and GBC that outperform other algorithms on detecting additive bicluster patterns maintain similar biclustering results when the noise level increases. GSGBC requires much less computing time compared with GBC. Overall, the comparison illustrated in Table 2 shows that GSGBC is computationally the most efﬁcient algorithm for additive bicluster extraction. The performance of all algorithms for multiplicative bicluster patterns are shown in Table 1(c). Fabia shows a better result compared with the performance for additive bicluster detection while ISA shows a slightly worse result. Bimax shows a rather worse performance for multiplicative detection compared with constant and additive bicluster data sets. GBC still show the best performance while GSGBC shows a little worse but comparable result. However, the time cost comparison illustrated in Table 2 shows GSGBC requires much less computing time. For overlapping bicluster pattern detection, the result in Table 1(d) indicates that GBC achieves the best performance with matching scores equal to 1 throughout all overlapping degrees. However, the combination speed of GBC is much lower than GSGBC which produce a comparably good result in detecting overlapping patterns. Other algorithms achieve a relatively lower matching score, especially when the overlapping degree increases. Overall, GSGBC is fast and it shows a better result in detecting different bicluster patterns. Using the modiﬁed Hough transform,

GSGBC has a low storage, high speed as well as high robustness for detecting noisy sub-bicluster patterns. The graph spectrum based algorithm optimizes the combination process of GBC and enhances the computational efﬁciency. GSGBC produces similar results to GBC, but increases the computing speed hundred folds using the proposed graph spectrum based algorithm. To better evaluate the overall result of the biclustering algorithm, we generate 20 groups of data that are much closer to real ones (mentioned above) with Gaussian noise equal to 0.1. The results indicate the performance of different biclustering algorithms for the 20-time experiment. From Table 3, we can see that GBC shows the best result with the highest mean value and a relatively smaller variance. GSGBC also obtains a better result with the mean matching score close to 0.6, but its variance is larger. The algorithm Fabia also obtains a good result. Other algorithms show a smaller variance for the 20-time experiment, but their mean matching score for detecting embedded biclusters is lower than GSGBC, GBC and Fabia. The time cost and entropy comparison for different geometrical biclustering algorithms are also presented in Table 2, which shows that GSGBC is much faster than GBC for bicluster detection. 5.3. Experiments with real microarray data 5.3.1. Multiple tissues dataset To further investigate the biclustering performance, we consider the real biological ‘multiple tissues’ datasets, which have been provided by the Broad Institute (Hoshida et al., 2007). The dataset was analyzed by clustering the samples with additional datasets. They clustered the samples with additional datasets and conﬁrmed the clusters by gene set enrichment. Our goal is to study how well different biclustering method can identify those clusters without any extra datasets. The cluster samples can be used as ground truth. In Table 4, the sizes of the largest biclusters detected

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

209

Table 1 Matching score result of different algorithms for detecting different types of bicluster patterns. Noise level

0

(a) Constant biclusters GSGBC 1.0000 GBC 1.0000 Fabia 0.9565 ISA 1.0000 Bimax 1.0000 Qubic 1.0000 Xmotif 0.0531 OPSM 0.3310 CC 0.2896 (b) Additive biclusters GSGBC 1.0000 GBC 1.0000 Fabia 0.9000 ISA 0.9000 Bimax 0.7000 Qubic 0.2321 Xmotif 0.0231 OPSM 0.0802 CC 0.0429

0.2

0.4

0.6

0.8

1

0.9785 1.0000 0.9434 1.0000 1.0000 0.0951 0.0619 0.3540 0.0580

0.9440 1.0000 0.9174 1.0000 0.8414 0.0904 0.0153 0.3000 0.0500

0.9592 1.0000 0.9524 1.0000 0.8920 0.1274 0.0066 0.1940 0.0730

0.9382 1.0000 0.885 1.0000 0.7000 0.1324 0.0046 0.1500 0.0310

0.9092 1.0000 0.9009 1.0000 0.5190 0.1192 0.0046 0.2600 0.0150

0.9290 1.0000 0.5449 0.6494 0.1900 0.1233 0.0231 0.0701 0.0303

0.8750 1.0000 0.3967 0.6343 0.1810 0.1258 0.0229 0.0708 0.0285

(c) Multiplicative biclusters GSGBC 1.0000 0.9700 GBC 1.0000 1.0000 Fabia 0.7332 0.7504 ISA 0.5866 0.5866 Bimax 0.0733 0.0733 Qubic 0.3626 0.3666 Xmotif 0.0764 0.0528 OPSM 0.1613 0.1579 CC 0.0465 0.0515 Overlapping degree

0

(d) Overlapping biclusters GSGBC 1.0000 GBC 1.0000 Fabia 0.6708 ISA 1.0000 Bimax 0.7590 Qubic 0.6542 Xmotif 0.0484 OPSM 0.0622 CC 0.1540

0.9700 1.0000 07388 0.5866 0.0733 0.3752 0.0422 0.1413 0.0558

0.9170 0.9810 0.3467 0.5796 0.1800 0.1052 0.0286 0.0663 0.0353

0.9190 0.9670 0.4978 0.6993 0.1720 0.1108 0.0272 0.0408 0.0189

0.9960 1.0000 0.6220 0.5866 0.0733 0.3570 0.0537 0.1225 0.0355

0.9680 1.0000 0.7337 0.5866 0.0733 0.3445 0.0529 0.1269 0.0406

0.9640 0.8583 0.2106 0.6726 0.2000 0.1246 0.0066 0.0773 0.0441 0.9600 1.0000 0.7413 0.5866 0.0733 0.3105 0.0419 0.1107 0.0434

2

4

6

8

10

0.9600 1.0000 0.5964 0.5067 0.6017 0.5897 0.0422 0.0220 0.0554

0.9847 1.0000 0.643 0.5092 0.4751 0.5621 0.0374 0.0856 0.1183

0.9627 1.0000 0.6125 0.5622 0.3665 0.5622 0.0345 0.0220 0.0939

0.9810 1.0000 0.6482 0.4841 0.3157 0.5154 0.0307 0.0554 0.0359

0.9199 1.0000 0.5708 0.5848 0.3037 0.5407 0.0272 0.0100 0.0095

Table 2 Computing time and entropy comparison of the two geometrical biclustering algorithms. Method

Constant

Additive

Integrated

Multiple tissues

Yeast

(a) Time cost comparison. The computing time is measured in seconds. GBC 139.106 255.781 244.736 299.588 GSGBC 0.824 1.617 0.659 1.637

303.289 7.418

1121.144 390.465

3504.026 1185.901

(b) Entropy comparison GBC 0.612 GSGBC 0.000426

0.637 0.0683

0.432 0.155

0.826 0.336

0.462 0.0132

Multiplicative

0.622 0.000377

are shown as well as the column sensitivity, speciﬁcity and the accuracy, which can measure how well different algorithms can identify the four clusters in the datasets. For example, the largest bicluster detected by Fabia from the known Cluster 1 is 146 45. For the 45 columns, the sensitivity representing the proportion of actual columns identiﬁed by Fabia is 72% while the speciﬁcity representing the proportion of true negatives is 65%. As a result, the accuracy of Fabia for identifying the column clusters is 67%. For Clusters 1, 3, and 4, GSGBC obtains the highest accuracy while GBC provide highest accuracy in Cluster 2. However, the bicluster size obtained from GBC is slightly larger than that by GSGBC. Considering the overall performance, GSGBC shows the best result with the

Over-lapping

1.668 0.0123

highest accuracy and a relatively large bicluster size. Compared with GBC, again GSGBC is much faster.

5.3.2. Yeast cell microarray data We also use another yeast cell microarray data (illustrated in Section 4) to study the performance for different biclustering algorithms. For comparative studies, we choose the ﬁrst 100 biclusters detected by GSGBC from yeast cell microarray data. We try to determine whether the set of genes in a bicluster shows signiﬁcant enrichment with respect to a speciﬁc Gene Ontology (GO) annotation. David is a web tool that offers us different levels

210

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

of GO terms of gene annotations. We are concerned about the GO Fat terms of the biological process that ﬁlter the broadest terms so that more speciﬁc terms are not overshadowed. We can validate the genes in each bicluster by the p-value (signiﬁcance level). The p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed when the null hypothesis is true in statistical signiﬁcance test (Schervish, 1996). If the p-value is less than the signiﬁcance level alpha, the null hypothesis is rejected, which means that the result is statistically signiﬁcant. For each method, we compute the proportion of biclusters in which genes are highly enriched with one or several Table 3 Matching score result for detecting the mixed bicluster patterns. Algorithm Matching score

Variance

Algorithm Matching score

Variance

GSGBC GBC Fabia ISA Bimax

0.0422 0.0210 0.0988 0.000257 0.000122

Qubic Xmotifs OPSM CC

0.00110 0.0000208 0.00231 0.00317

0.6272 0.6734 0.4568 0.1582 0.0731

0.1002 0.0192 0.2594 0.1960

GO biological processes at different levels of signiﬁcance (alpha¼ 5%, 1%, 0.5%, 0.1%, 0.001%). The histogram in Fig. 10 shows the GO enrichment result with x axis represents the biclustering algorithm and y axis represents the ratio of biclusters in which genes are enriched with GO biological process. The best results are obtained from GBC by which all biclusters detected at different signiﬁcant levels are highly enriched with GO biological processes. GSGBC and qubic shows slightly worse result. When the signiﬁcant level is low, all the biclusters are functionally enriched. However, at a higher signiﬁcance level, the proportion of the highly enriched biclusters decreases. At signiﬁcance level alpha¼0.01%, more than 95% of the biclusters detected by GSGBC are functionally enriched, which is higher than qubic at the same signiﬁcance level. At the signiﬁcance level alpha¼0.001%, the proportion of the highly enriched biclusters for GSGBC and qubic are nearly the same. OPSM and Bimax, ISA also provide a high portion of functionally enriched biclusters, with a slight advantage of OPSM and Bimax (490% at a signiﬁcance level of 5%) over ISA ( 480% at a signiﬁcance level of 5%). Fabia gives a better result when the signiﬁcance level is low. However, the ratio of functional enriched biclusters reduces sharply when the signiﬁcance level goes

Table 4 The biclustering result for the ‘Multiple tissue’ data with different algorithms. (a)

Cluster 1 (Column size:25)

Cluster 2 (Column size:26)

Method

Size

Sensitivity

Speciﬁcity

Accuracy

Size

Sensitivity

Speciﬁcity

Accuracy

GSGBC GBC Fabia ISA OPSM CC Bimax xMotif qubic

61 24 71 23 146 45 111 3 869 5 49 39 55 514 6 60 55

0.96 0.92 0.72 0.12 0.08 0.2 0.2 0.2 0.8

1 1 0.65 1 0.96 0.56 1 0.99 0.55

0.99 0.98 0.67 0.78 0.75 0.47 0.8 0.79 0.61

63 25 73 26 146 45 241 6 15 26 81 6 24 22 1075 6 50 67

0.96 1 1 0.23 0.42 0.23 0.85 0.19 0.92

1 1 0.75 1 0.8 1 1 0.99 0.43

0.99 1 0.81 0.8 0.71 0.8 0.96 0.78 0.56

(b)

Cluster 3 (Column size:28)

Method

Size

Sensitivity

Speciﬁcity

Accuracy

Size

Sensitivity

Speciﬁcity

Accuracy

GSGBC GBC Fabia ISA OPSM CC Bimax xMotif qubic

30 23 41 22 140 38 275 6 18 25 38 23 13 15 133 6 64 50

0.82 0.79 1 0.21 0.32 0.14 0.14 0.18 0.25

1 1 0.86 1 0.78 0.74 0.88 0.99 0.42

0.95 0.94 0.91 0.78 0.66 0.58 0.68 0.76 0.37

68 23 86 22 82 36 241 2 157 8 38 23 14 16 237 6 84 44

1 0.96 1 0.09 0.17 0.26 0.7 0.22 0.83

1 1 0.84 1 0.95 0.78 1 0.99 0.68

1 0.99 0.87 0.79 0.77 0.67 0.93 0.81 0.72

Cluster 4 (Column size:23)

1.1 1 0.9

alpha=5%

0.8

alpha=1%

0.7

alpha=0.5%

0.6

alpha=0.1%

0.5

alpha=0.001%

0.4 0.3 0.2 0.1 0 GSGBC

GBC

Fabia

ISA

OPSM

CC

Bimax

xMotifs

qubic

Fig. 10. GO enrichment analysis for the ‘Saccharomyces cerevisiae’ biclusters obtained from different algorithms.

D.Z. Wang, H. Yan / Journal of Theoretical Biology 317 (2013) 200–211

higher. In contrast, CC and xMotifs gives the worst result with a rather lower highly enriched bicluster proportion. Even at the lowest signiﬁcant level alpha ¼5%, CC and xMotifs can only detect less than 40% functionally enriched biclusters. Overall, from the experiment results of both the artiﬁcial data and the biological data, GBC shows the best result in most cases with a relatively low speed. Our proposed GSGBC which aims to optimize the biclustering process shows a slightly worse result but a comparable result with GBC under different conditions. Besides, it achieves a great improvement in the optimization of the sub-bicluster detection as well as combination process so that the time cost is reduced signiﬁcantly. 6. Conclusion In geometric biclustering, we can extract bicluster patterns as linear structures in high dimensional data spaces. The GSGBC proposed in this paper uses modiﬁed HT to detect sub-biclusters in column-pair spaces. The Hough vector and Hough array are deﬁned to represent the sub-biclusters. The graph spectrum based combination algorithm is carried out by transforming the Hough array to a graph. Each Hough vector can be seen as a node in the graph with similarity between the nodes as edge weight. The graph spectrum theory is used to partition the nodes in different eigengroups. Each eigengroup is cohesive, which means that in the Hough vectors, the nodes display similarity with each other. Thus, the entropy of all eigengroups is much smaller than the entropy of the whole Hough array. Combining the Hough vectors in each eigengroup instead of the whole reduces the computational complexity. Furthermore, the experiment result for artiﬁcial data demonstrates that our GSGBC algorithm can achieve comparable result with the original GBC algorithms and shows better results than other kinds of algorithms. Our GSGBC can also detect meaningful gene biclusters from the microarray data with good results. Acknowledgment This work is supported by the Hong Kong Research Grant Council (Project 123809) and City University of Hong Kong (Project 7008094). References Amaratunga, D., Cabrera, J., 2004. Exploration and analysis of DNA microarray and protein array data. LibreDigital. Andraos, J., 2008. Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws—New methods based on directed graphs. Can. J. Chem. 86 (4), 342–357. Althaus, I.W., Chou, J., Gonzales, A., Deibel, M., Chou, K., Kezdy, F., Romero, D., Aristoff, P., Tarpley, W., Reusser, F., 1993a. Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E. J. Biol. Chem. 268 (9), 6119–6124. Althaus, I.W., Gonzales, A., Chou, J., Romero, D., Deibel, M., Chou, K., Kezdy, F., Resnick, L., Busso, M., So, A., 1993b. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. J. Biol. Chem. 268 (20), 14875–14880. Althaus, I.W., Chou, J.J., Gonzales, A.J., Deibel, M.R., Chou, K.C., Kezdy, F.J., Romero, D.L., Palmer, J.R., Thomas, R.C., 1993c. Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E. Biochemistry 32 (26), 6548–6554. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z., 2003. Discovering local structure in gene expression data: the order-preserving submatrix problem. J. Comput. Biol. 10 (3–4), 373–384. Bergmann, S., Ihmels, J., Barkai, N., 2003. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys. Rev. E 67 (3), 031902. Cvetkovic, D.M., Doob, M., Sachs, H., 1980. Spectra of Graphs—Theory and Applications, vol. 1982. Deutscher Verlag der Wissenschaften, Berlin. Chou, K.C., 1989. Graphic rules in steady and non-steady state enzyme kinetics. J. Biol. Chem. 264 (20), 12074–12079. Chou, K.C., 1990. Applications of graph theory to enzyme kinetics and protein folding kinetics: Steady and non-steady-state systems. Biophys. Chem. 35 (1), 1–24. Chou, K., Kezdy, F., Reusser, F., 1994. Kinetics of processive nucleic acid polymerases and nucleases. Anal. Biochem. 221 (2), 217.

211

Chou, K.C., 2010. Graphic rule for drug metabolism systems. Curr. Drug Metab. 11 (4), 369–378. Chou, K.C., Lin, W.Z., Xiao, X., 2011. Wenxiang: a web-server for drawing wenxiang diagrams. Nat. Sci. 3 (10), 862–886. Cheng, Y., Church, G.M., 2000. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93–103. Dhillon, I.S., Mallela, S., Kumar, R., 2003. A divisive information theoretic feature clustering algorithm for text classiﬁcation. J. Mach. Learn. Res. 3, 1265–1287. ¨ Desper, R., Khan, J., Schaffer, A.A., 2004. Tumor classiﬁcation using phylogenetic methods on expression data. J. Theor. Biol. 228 (4), 477–496. Da Wei Huang, B.T.S., Lempicki, R.A., 2008. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 (1), 44–57. Gan, X., Liew, A., Yan, H., 2008. Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC bioinformatics 9 (1), 209. Horn, R.A., Johnson, C.R., 1990. Matrix analysis. Cambridge Univ Pr. Huang, D.W., Sherman, B.T., Lempicki, R.A., 2009. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37 (1), 1. Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., Khamiakova, T., Van Sanden, S., Lin, D., Talloen, W., 2010. FABIA: factor analysis for bicluster acquisition. Bioinformatics 26 (12), 1520–1527. Hoshida, Y., Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P., 2007. Subclass mapping: identifying common subtypes in independent disease data sets. PLoS ONE 2 (11), e1195. Illingworth, J., Kittler, J., 1988. A survey of the Hough transform. Comput. vision, graphics, image process. 44 (1), 87–116. Liu, X., Wang, L., 2007. Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 23 (1), 50–56. Liu, W., Li, R., Sun, J.Z., Wang, J., Tsai, J., Wen, W., Kohlmann, A., Mickey Williams, P., 2006. PQN and DQN: Algorithms for expression microarrays. J. Theor. Biol. 243 (2), 273–278. Lazzeroni, L., Owen, A., 2002. Plaid models for gene expression data. Stat. Sinica 12 (1), 61–86. Lin, S.X., Neet, K.E., 1990. Demonstration of a slow conformational change in liver glucokinase by ﬂuorescence spectroscopy. J. Biol. Chem. 265 (17), 9670–9675. Li, G., Ma, Q., Tang, H., Paterson, A.H., Xu, Y., 2009. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 37 (15), e101. Madeira, S.C., Oliveira, A.L., 2004. Biclustering algorithms for biological data analysis: a survey. IEEE Trans. Comput. Biol. Bioinformatics, 24–45. Maulik, U., Mukhopadhyay, A., Bandyopadhyay, S., 2009. Combining Paretooptimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinformatics 10 (1), 27. Myers, D., Palmer, G., 1985. Microcomputer tools for steady–state enzyme kinetics. Comput. Appl. Biosci.: CABIOS 1 (2), 105–110. Murali, T., Kasif, S., 2002. Extracting Conserved gene Expression Motifs from gene Expression data. pp. 77–88. Pascual-Montano, A., Carmona-Saez, P., Chagoyen, M., Tirado, F., Carazo, J., Pascual-Marqui, R., 2006. bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinformatics 7 (1), 366. ¨ Prelic´, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E., 2006. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22 (9), 1122–1129. Rueda, L., Bari, A., Ngom, A., 2008. Clustering time-series gene expression data with unequal time intervals. Trans. Comput. Syst. Biol. X, 100–123. Sun, X., Hou, Q., Ren, Z., Zhou, K., Guo, B., 2010. Radiance transfer biclustering for realtime all-frequency biscale rendering. IEEE Trans Vis Comput Graph, 64–73. Steinhaeuser, K., Chawla, N.V., 2008. Community detection in a large real-world social network. Soc. Comput., Behav. Modeling Prediction, 168–175. Sarkar, S., Boyer, K.L., 1996. Quantitative measures of change based on feature organization: Eigenvalues and eigenvectors. IEEE Computer Society, p. 478. Smyth, S., White, S., 2005. A spectral clustering approach to ﬁnding communities in graphs. In Proceeding of the Fifth SIAM International Conference on Data Mining, pp. 76–84. Schervish, M.J., 1996. P values: what they are and what they are not. Amer Statist, 203–206. Wang, H., Wang, W., Yang, J., Yu, P.S., 2002. Clustering by pattern similarity in large datasets. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 394–405. Wang, D.Z., Yan, H., 2011. Integration of clustering and biclustering procedures for analysis of large DNA microarray datasets. Int. J. Comput. Biol. Drug Design 4 (2), 179–193. Zhao, H., Liew, A.W.C., Xie, X., Yan, H., 2008. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. J. Theor. Biol. 251 (2), 264–274. Zhou, G., Deng, M., 1984. An extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. Biochem. J. 222 (1), 169. Zhou, G.P., 2011. The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism. J. Theor. Biol. 284 (1), 142–148. Zhou, G.P., 2011. The structural determinations of the leucine zipper coiled-coil domains of the cGMP-dependent protein kinase I and its interaction with the myosin binding subunit of the myosin light chains phosphase. Protein Peptide Lett. 18 (10), 966–978.

A graph spectrum based geometric biclustering algorithm

A graph spectrum based geometric biclustering algorithm

Recommend Documents