Measuring the quality of linear patterns in biclusters

Measuring the quality of linear patterns in biclusters

Methods 83 (2015) 18–27 Contents lists available at ScienceDirect Methods journal homepage: www.elsevier.com/locate/ymeth Measuring the quality of ...

3MB Sizes 0 Downloads 30 Views

Methods 83 (2015) 18–27

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

Measuring the quality of linear patterns in biclusters Shuhua Chen, Juan Liu ⇑, Tao Zeng School of Computer, Wuhan University, Wuhan, Hubei 430072, China

a r t i c l e

i n f o

Article history: Received 31 January 2015 Accepted 2 April 2015 Available online 15 April 2015 Keywords: Biclustering Gene expression Linear pattern Coherence measurement

a b s t r a c t In microarray analysis, biclustering is used to find the maximal subsets of rows and columns satisfying some coherence criteria. The found submatrices are usually called as biclusters. On one hand, different criteria would help to find different types of biclusters, thus the definition of coherence criterion is critical to the biclustering method. On the other hand, qualitative criteria result to qualitative biclustering methods that cannot evaluate the qualities of the biclusters, while quantitative criteria can numerically show how well the mined biclusters and are more useful in real applications. In bioinformatics communities, there are several quantitative coherence measurements for linear patterns proposed. However, they face the problem of weakness in finding all subtypes of linear patterns or sensitivity to the noise. In this work, we introduce a coherence measurement for the general linear patterns, the minimal mean squared error (MMSE), which is designed to handle the evaluation of biclusters with shifting, scaling and the general linear (the mixed form of shifting and scaling) correlations. The experiments on synthetic and real data sets show that the proposed methods is appropriate for identifying significant general linear biclusters. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction Nowadays, microarray techniques play an important role in biological research. The data are usually transformed into a numerical matrix to be analyzed, in which rows refer to genes and columns represents experimental conditions. genes are not necessarily coherently expressed on all conditions, instead they might be coexpressed only on a small subset of conditions such as cells [1] or samples from the same disease subtype [2]. In such case, biclusters are biologically and clinically interesting. Therefore, biclustering is crucial on finding gene subsets (rows) showing similar expression behaviors on subsets of conditions (columns) and many methods have been proposed in literature for gene expression data biclustering. To do biclustering, one first needs to determine a criterion to judge whether a submatrix can be regarded as a bicluster, then he should design an efficient searching algorithm to find out as many as possible maximal submatrices that satisfy the criterion. There are two kinds of assessment criteria: qualitative and quantitative, accordingly the biclutering methods can be classified as qualitative and quantitative ones. Since the qualitative biclustering methods cannot numerically describe how well the mined biclusters, the quantitative ones are more informative to real applications. Moreover, according to the found pattern types, ⇑ Corresponding author. E-mail address: [email protected] (J. Liu). http://dx.doi.org/10.1016/j.ymeth.2015.04.005 1046-2023/Ó 2015 Elsevier Inc. All rights reserved.

biclustering algorithms can also be categorized as nonlinear and linear coherent ones. Various techniques have been proposed to identify nonlinear coherent bicluster patterns from gene expression data: (1) two-way hierarchical clustering method (HCL) [3]; (2) the Bayesian-based biclustering (BBC) [4] that are based on a rigorous statical model; (3) the qualitative biclustering algorithm (QUBIC) that can search biclusters in a general form [5]; (4) the iterative signature algorithm (ISA) [6]; (5) the order preserving submatrix algorithm (OPSM) [7]; (6) the xMotif method [8]; (7) Samba method [9]; and (8) the gene shaving (GSH) method [10]. Though originally targeted by non-linear methods [9,7,6,8,11,12], linear measures are still informative in evaluating the quality of discovered biclusters [13]. Biclustering is first introduced by Cheng and Church [14]. They use the Mean Squared Residue (MSR) to measure the shifting patterns in biclusters. The MSR measure has been widely followed by other methods as a coherence measure to identify linear biclusters from gene expression data [14–20]. The innovative idea of MSR is based on the variability of the genes’ expression neighborhood with regard to their arithmetic mean—if all the elements in a submatrix are similar, then the mean squared residue is small. However, the MSR-based biclustering methods [14,21] suffer from a major limitation, pointed out by Aguilar-Ruiz [16], that it can only help to identify biclusters exhibiting shifting patterns whereas linear biclusters include not only shifting, but also scaling and even more general linear patterns. Since the scaling or linear patterns are also very important to the biological functions, and

19

S. Chen et al. / Methods 83 (2015) 18–27

have been convinced by many researches related to gene regulatory pathways [15,22], as well as different kinds of cancers and tumors [23,24] (especially the negative correlated linear patterns), developing biclustering methods for the linear patterns is of significance. In order to face the challenge of searching scaling and linear patterns, a few methods have been proposed. For example, the pcluster and d-cluster methods [25] search scaling patterns indirectly by using logarithm transformation to convert them into normal shifting patterns, they impose a strict assumption on the model thus cannot deal with general linear patterns. Anirban et al. proposed Scaling Mean Squared Residue (SMSR) to specially measure scaling patterns [26], however, this measurement cannot correctly evaluate the negative scaling patterns. Teng et al. used the Average Correlation Value (ACV) to evaluate the quality of biclusters [27], which is not a rigid measurement and sensitive to noise though it can correctly identify most approximate perfect scaling patterns. Pontes et al. suggested to use virtual errors (VE) by computing the difference between the standardized gene expression values and the expected pattern values, to identify some interesting biclusters which are ignored by MSR [28]. However, this measurement still has no ability to perfectly evaluate the negative correlated patterns. Ayadi et al. proposed Average Spearman’s rho (ASR) based on Spearman’s rank correlation [29] to overcome the weakness of ACV, but it fails to measure the negative correlated patterns. Flores et al. proposed Spearman’s biclustering measure(SBM) using absolute value to detect the negative correlated patterns [20]. However, it fails to distinguish a perfect linear pattern from a non-linear pattern with same ranking trends. Nepomuceno et al. have tried to minimize the mean squared residue by nonlinear optimization techniques [30]. Unfortunately, they failed to find the negative correlations due to the constraints on the parameters. Gu et al. proposed a statistical model to describe different kinds of gene expression patterns, but it puts constraints on the overlap among biclusters [4]. From above we can see that there are two common but critical problems to be solved in the above methods: (i) they lack a unified rigid ranking measurement to evaluate the scaling/linear patterns; (ii) they cannot detect a special kind of patterns where negative correlations are involved. Although some newly developed qualitative methods have the ability of detecting the scaling patterns [31,5], even with the negative correlations [5], how to quantitatively measure scaling/linear patterns is still an open question. With these regards, we try to define a generalized, unified measurement not only for shifting patterns, but also for positively and negatively scaling patterns, or even general linear patterns. In this work, we define minimal mean squared error (MMSE) as a coherence measurement to identify biclusters with shifting, scaling, and general linear patterns. By the validation experiments, we show the superiority of our proposed MMSE as compared to other measurements on the measurement of general linear patterns. The experimental results also show MMSE-based biclustering has competitive performance than other traditional clustering and non MSR-based biclustering methods because its found MMSE-biclusters include most numerical and biological significant linear patterns undetected by other methods. 2. Problem definition Through out the paper, we denote a gene expression profiling (or microarray) data set as a triplet M ¼ ðG; C; lÞ, where G ¼ fg 1 ; g 2 ; . . . ; g n g is a set of genes (rows), C ¼ fc1 ; c2 ; . . . ; cm g is a set of conditions (columns), and l : G  C ) R is the level function by which lðg i ; cj Þ represents the expression level of gene g i on

condition cj . Or simply, a gene microarray data set is represented as a matrix below if i stands for gene g i , column j for condition cj , where li;j is short for lðg i ; cj Þ,

0

1 l1;m l2;m C C C .. C . A

l1;1 B l2;1 B M¼B B .. @.

l1;2 l2;2 .. .

... ... .. .

ln;1

ln;2

. . . ln;m

Definition 1 (bicluster). Given a gene expression data set M ¼ ðG; C; lÞ, let B ¼ ðI; J; dÞ be a triplet, where I # G; J # C, and d:IJ )R is the level function of B satisfying dðg i ; cj Þ ¼ lðg i ; cj Þ; 8ðg i ; cj Þ 2 I  J. B is a bicluster iff genes in I exhibit a coherent expression behavior (correlated with each other following a specific pattern) across all the conditions in J. Now that we only consider how to measure the coherence of B instead of the whole gene expression data M, for simplicity, we abbreviate dðg i ; cj Þ as di;j ; g i 2 I as i 2 I, and cj 2 J as j 2 J without confusion hereafter (gene g i in ith row, and condition cj in jth column). Thus, the triplet B ¼ ðI; J; dÞ is equivalent to a submatrix of M:

d1;1

d1;2

. . . d1;jJj

1

B d2;1 B B¼B B .. @.

d2;2 .. .

... .. .

d2;jJj .. .

C C C C A

djIj;1

djIj;2

. . . djIj;jJj

0

Definition 2 (shifting pattern). A bicluster B ¼ ðI; J; dÞ is said to exhibit a shifting pattern if all of its elements di;j satisfy the condition:

di;j ¼ pj þ bi

ð1Þ

where pj is the base value of the jth column, and bi is the shifting factor for the ith row. Definition 3 (scaling pattern). A bicluster B ¼ ðI; J; dÞ is said to exhibit a scaling pattern if all of its elements di;j satisfy the condition:

di;j ¼ ai pj

ð2Þ

where pj is the base value of the jth column, and ai is the scaling factor for the ith row. To integrate the ideas behind both shifting and scaling patterns, we define linear patterns: Definition 4 (linear pattern). A bicluster B ¼ ðI; J; dÞ is said to exhibit a linear pattern if all of its elements di;j satisfy the condition:

di;j ¼ ai pj þ bi

ð3Þ

where pj is the base value of the jth column, ai and bi are the scaling and shifting factors for the ith row respectively. We also call p ¼ fp1 ; p2 ; . . . ; pjJj g the base vector, a ¼ fa1 ; a2 ; . . . ; ajIj g the scaling vector, and b ¼ fb1 ; b2 ; . . . ; bjIj g the shifting vector of the linear pattern. Shifting, scaling and linear patterns are all of biological interests. A shifting pattern of a subset of genes reveals that the genes express towards to the same trend, though the curves shift to each

20

S. Chen et al. / Methods 83 (2015) 18–27

other with some constants [14,32,33]. A subset of genes exhibiting a scaling or linear pattern are closely related to gene regulatory pathways [15,22] where genes show the same behavior with regard to the regulation, but not with the same energy, i.e. changes are more abrupt for one gene than for the others [13]. Computationally, identification of linear biclusters with different patterns is technically difficult, just as discussed briefly in [13,19,34].

3.1. Calculation of MMSE for multiple patterns Let B ¼ ðI; J; dÞ be a submatrix of a gene expression data set. Let P, denoted below, exhibit a perfect linear pattern with exactly the same size as B:

a1 p1 þ b1 B B a2 p1 þ b2

a1 p2 þ b1 a2 p2 þ b2 .. .

ajIj p1 þ bjIj ajIj p2 þ bjIj

. . . a1 pjJj þ b1

1

C . . . a2 pjJj þ b2 C C C .. .. C . . A . . . ajIj pjJj þ bjIj

We first introduce an error score for every entry of B as follows:

erri;j ¼ di;j  ðai pj þ bi Þ The mean of squared errors (MSE) over the entire submatrix is then defined as:

MSEðI; J; dÞ ¼

a;p;b

X

1 X 2 err jIjjJj i2I;j2J i;j

Thus, we can define the Minimal Mean Square Error (MMSE) between B and P as:

MMSEðBÞ ¼ MMSEðI; J; dÞ ¼ minðMSEðI; J; dÞÞ a;p;b ! X 1 2 ðai pj þ bi  di;j Þ ¼ min a;p;b jIjjJj i2I;j2J where p; a, and b are the base vector, scaling vector and shifting vector respectively. The objective of this definition is to use three vectors ða; p; bÞ as constraints to minimize the MSE between B and P such that the submatrix B can match to some perfect linear pattern P as much as possible. Thus, computing MMSE is in fact a problem of parametric optimization. Please note that we use MMSEðBÞ and MMSEðI; J; dÞ alternatively in the following sections.

Fig. 1. Workflow of biclustering tasks for microarray data analysis.

ðai pj þ bi  di;j Þ

2

ð4Þ

i2I;j2J

OLPM has two special cases: one for scaling patterns, and the other for shifting patterns: a;p

In microarray data analysis tasks, Fig. 1 shows a automated workflow for achieving good quality biclusters, which indicate that gene expression data reveal certain interesting patterns. Here we use MMSE as measurement of biclusters (submatrices). In this paper, we mainly focus on the method of the coherence measurement of the certain biclusters rather than the total process.

B P ¼ B. B .. @

MMSEðBÞlinear ¼ min

X

MMSEðBÞscaling ¼ min

3. Methods

0

As jIj and jJj are two constants for a specific matrix, we can omit them and redefine the essential part of the measure. We call the resulting model optimal linear pattern model (OLPM), denoted as:

MMSEðBÞshifting ¼ min p;b

ðai pj  di;j Þ

2

ð5Þ

i2I;j2J

X

2

ðpj þ bi  di;j Þ

ð6Þ

i2I;j2J

They are called optimal scaling pattern model(OSCPM) and optimal shifting pattern model (OSPM), respectively. We note that if ða ; p ; b Þ is an optimal solution to OLPM, then   a ðjja jj ; p jja jj; b Þ (with unit scaling vector), and ða jjp jj; jjpp jj ; b Þ (with unit base vector), are also optimal solutions. All of the optimal solutions guarantee the same MMSE scores for the submatrix. We call solutions with unit scaling or base vectors as identity factor solutions. Henceforth, we will only consider the case of identity factor solutions. 3.1.1. Calculating MMSE for a shifting pattern: equivalent to MSR calculation First, we explore the calculation of MMSEðBÞshifting . It is interesting to find that biclusters with shifting patterns under our MMSE measure are exactly biclusters under MSR measure according to Theorem 1. Theorem 1. Let B ¼ ðI; J; dÞ be a submatrix of a gene expression data set, then MMSEðBÞshifting ¼ jIjjJjMSRðI; J; dÞ. We omit the proof of Theorem 1 for the limit of the page (all the proofs will also be omitted hereinafter). This theorem indicates that MSR is only equivalent to MMSEshifting . 3.1.2. Calculating MMSE for a scaling pattern Directly computing MMSEðI; J; dÞscaling is a complicated process, as there are two sets of parameters to be optimized, i.e. the scaling and base vectors. Fortunately, we can simplify the computation by

(A)

(B)

(C)

(D)

(E)

(F)

Fig. 2. Cases used for coherence measurements comparison. (A) Perfect constant bicluster, (B) perfect shifting pattern, (C) perfect scaling pattern (positive), (D) perfect linear pattern (positive), (E) perfect scaling pattern (negative), (F) perfect linear pattern (negative).

S. Chen et al. / Methods 83 (2015) 18–27

using Eqs. (7) or (8) according to Lemma 1. The detailed procedures are presented in the Supplementary File. Lemma 1. Let B ¼ ðI; J; dÞ be a submatrix of a gene expression data, then

0

X

MMSEðBÞscaling ¼ min @ jjajj2 ¼1

2 di;j



i2I;j2J

!2 1 ai di;j A

ð7Þ

!2 1 pj di;j A

ð8Þ

X X j2J

i2I

and

0 MMSEðBÞscaling ¼ min @ jjpjj2 ¼1

X

i2I;j2J

2 di;j



X X i2I

j2J

where a or p is an unit vector with Euclidean distance of 1 to the origin. Since (7) and (8) are equivalent and both are with one set of parameters, we can choose one with fewer parameters for computational efficiency, i.e. if jIj < jJj, we choose (7), otherwise, (8). Consequently, we can derive out Theorem 2 to compute MMSEscaling without parameters. The proof is presented in the Supplementary File. Theorem 2. Let B ¼ ðI; J; dÞ be a submatrix of the gene expression data, then the calculation of MMSEðI; J; dÞscaling is finalized as either Eq. (9) or (10), corresponding to models 7 and 8 respectively.

MMSEðBÞscaling ¼

X

2

ð9Þ

2

ð10Þ

di;j  kmax ðBBT Þ

i2I;j2J

MMSEðBÞscaling ¼

X

di;j  kmax ðBT BÞ

i2I;j2J

where kmax ðBBT Þ and kmax ðBT BÞ are the largest absolute values of eigenvalues of matrices BBT and BT B respectively. 3.1.3. Calculating MMSE for a linear pattern In Section 3.1.2, we have shown how to compute MMSE score for a scaling pattern. That is to say, given a submatrix B, if MMSEðBÞscaling is low enough, then it is very likely to have a scaling pattern. However, B is not always expected to show a scaling pattern, but rather a shifting or even more general linear pattern. So we should config how to compute MMSEðBÞlinear as well. In this section, we address the problem of computing MMSEðBÞlinear . Theorem 3. Let B ¼ ðI; J; dÞ be a submatrix of the gene expression 0 0 data. Assume B0 ¼ ðI; J; d Þ, where di;j ¼ di;j  di;J , then

MMSEðBÞlinear ¼ MMSEðB0 Þscaling ¼ min a;p

X

2

ðai pj  ðdi;j  di;J ÞÞ

i2I;j2J

ð11Þ Theorem 3 indicates that the linear pattern of B ¼ ðI; J; dÞ can be 0 transformed into the scaling pattern of B0 ¼ ðI; J; d Þ despite of the shifting factors. Also, the calculation of MMSE for a linear pattern can be determined by the MMSE for a scaling pattern by replacing di;j with di;j  di;J in (7) or (8). 3.2. Algorithm for MMSE calculation 3.2.1. Algorithm description The pseudo code for computing MMSE score of a linear pattern is presented in Algorithm 1. MMSE is a general measurement for shifting, scaling and linear patterns, and the former two patterns are just special forms of the linear pattern. So in the identification of a bicluster (with a scaling, shifting or linear pattern), we can

21

simply transform the original submatrix by replacing di;j with di;j  di;J (lines 1–10), and then compute its MMSEscaling value by using Eq. (9) or (10), shorten as MMSE value hereafter. If the submatrix has the least MMSE value, it must contain a significant coherent pattern as a constant, shifting, scaling, or linear pattern. Algorithm 1. Calculating MMSE for a submatrix.

22

S. Chen et al. / Methods 83 (2015) 18–27

3.2.2. Time complexity analysis The time complexity of the algorithm is analyzed as follows: converting matrix B to B0 (lines 1–10) costs OðIJÞ, computing constant all (lines 12–16) costs OðIJÞ, deriving matrix S (lines 17–38) costs OðminðI; JÞIJÞ, and initializing vector X (line 39) needs OðminðI; JÞÞ. In the iterative process of computing the largest absolute eigenvalue of matrix S (lines 40–43), one run of such iteration (lines 41–43) costs OðminðI; JÞ2 Þ. Since the iteration number is only related to the initial value of X and the distribution of eigenvalues, especially the largest two eigenvalues [35], it can safely be regarded as a constant K in terms with the dimension of the matrix. In all, the time complexity of Algorithm 1 is OðminðI; JÞIJÞ. To the best of our knowledge, Algorithm 1 for computing MMSE scores actually has competitive time complexity compared with existed scaling/linear pattern evaluation methods, when J < I which is true at most cases of biclustering gene expression data. For examples, the Hough transform used in linear geometry method needs an OðI2 JÞ sequential time complexity [36] to judge whether a linear pattern exists in one bicluster [37]; and the evaluation of the Average Correlation Value (ACV) [27], Average Spearman’s rho (ASR) [29] or Spearman Biclustering Measure (SBM) [20] costs OðI2 JÞ. Although computing the Virtual Error (VE) [28] only costs OðIJÞ, it cannot help to identify scaling patterns with negative correlations. However, MMSE can find any type of scaling patterns, including the negative correlated ones. 4. Results and discussion In Section 3.1, we have proposed a general coherence measurement MMSE for simultaneously identifying shifting, scaling and linear patterns. To further validate the ability of MMSE on actual biclustering application, we design four kinds of experiments. Moreover, we implement a simple MMSE-based biclustering method by using the CC [14] searching framework. Although there have been many sophisticated searching algorithms proposed in the biclustering researches, which can help to find much more significant patterns by using advanced strategies, the coherence measurement is crucial for discovering different kinds of patterns. Therefore, we pay little attention to the searching method in our current biclustering implementation though it may be unfair to the MMSE-based biclustering method. The first experiment is designed to illustrate MMSE’s capability of measuring different kinds of patterns by comparing with the widely used MSR [14] and other literature measurements that are appropriate for both scaling and shifting patterns. The second experiment is to show how well the MMSE measurement helps the biclustering method to robustly detect different kinds of predefined patterns by comparing the MMSE-based biclustering method with other eight representative methods on detecting different types of patterns embedded in synthesized data sets. Unlike the matrix are just single patterns in first experiments, the random matrix involves the submatrix with a specific pattern. To evaluate the MMSE-based biclustering method on real applications, the third experiment is designed to compare the MMSEbased biclustering method with the above mentioned eight biclustering/clustering methods on real data sets. The last experiment is to investigate the biological and statistical significance of biclusters mined by the MMSE-based biclustering method by case studies. The purposes of all the experiments are to illustrate that MMSE can unbiasedly estimate shifting, scaling and general linear patterns, thus MMSE-based biclustering method can capture some significant biclusters undiscovered by other literature methods. Especially, the linear pattern is revealed to be an important phenomenon in real gene expression profiles, but kept undiscovered

by most of previous biclustering methods, including correlation based methods. However our proposed MMSE can help to find such patterns as easily as shifting or scaling patterns, which is our main contribution compared with other coherence measurements in the biclustering research field. 4.1. Description of experiment process  Input: the gene expression data or other numeric matrix  Output: biclusters (the submatrix which contains the approximate linear patterns, according to the threshold)  Databases: the detailed sources are listed in each experiment below  Software URL: http://biod.whu.edu.cn/newweb/download/ MMSE.zip 4.2. Comparative evaluation of MMSE measure on benchmark single pattern matrices To overcome the weakness of MSR, there have been several measures proposed, such as ACV [27], ASR [29], VE [28] and SMSR [26]. Therefore, we evaluate the abilities of the measures for different kinds linear patterns by computing their scores on different single pattern matrices shown in Fig. 2, respectively corresponding to (A) perfect constant (trivial), (B) shifting, (C) positive scaling, (D) negative scaling, (E) positive linear and (F) negative linear patterns. Among the matrices, (A)–(C) and (E) are also used in [27,29]. We list the scores of different measurements for each pattern in Table 1. Theoretically, the range and the best score of above measurements are: MSR(0 to 1; 0), ACV(0 to 1; 1), ASR(1 to 1; 1 or 1), VE(0 to 1; 0), SMSR(0 to 1; 0) and MMSE(0 to 1; 0). From this table we can see that, MSR is only suitable for constant and shifting patterns, ASR and VE fails to measure negative scaling patterns, SMSR can only perfectly measures constant and scaling patterns, while MMSE and ACV can get the desirable values on all kinds of perfect patterns. To further compare the robustness of ACV and MMSE to noises, we respectively introduce (0,r)-Gaussian noise with r ranging from 0 to 1 with step 0.1 to above matrices, for each value of r, we run 10 times and take the average measure scores over the 10 runs. The results are illustrated in Fig. 3, indicating that MMSE is more robust to noise than ACV. 4.3. Comparative evaluation of MMSE-based biclustering on multiple pattern synthesized data sets We have performed the comparative experiments on the synthesized data sets by choosing other eight representative methods from literature, which are (1) MSR-based biclustering method by Cheng and Church (CC) [14]; (2) the order preserving submatrix algorithm (OPSM) [7]; (3) the iterative signature algorithm (ISA) [6]; (4) the Bayesian biclustering (BBC) [4]; (5) the extensive maximum similarity biclustering method (MSBE) [19]; (6) the

Table 1 The comparison of different coherence measurements on different patterns. Measure

(A)

(B)

(C)

(D)

(E)

(F)

MSR ACV ASR VE SMSR MMSE

0.000 1.000 1.000 0.000 0.000 0.000

0.000 1.000 1.000 0.000 0.089 0.000

0.625 1.000 1.000 0.000 0.000 0.000

0.625 1.000 1.000 0.000 0.021 0.000

3.125 1.000 0.200 1.033 0.000 0.000

3.325 1.000 0.200 0.930 3.458 0.000

23

S. Chen et al. / Methods 83 (2015) 18–27

neighbor biclusters is k (that is, they are overlapped on k rows and k columns). The dimension of every bicluster in Bpopt is ð10þkÞð10þkÞ, thus the dimension of Mp;k is ð100þkÞð100þkÞ. In our experiments, overlapping degree k varies from zero to eight. Therefore, we have generated 36 matrices for one runtime in all. In order to reduce the affection of randomness in synthetic data, above data production and experiment evaluation process have ran ten times and the average performance of each method is reported. (A)

4.3.2. Evaluation methods Let Bopt and B be perfect bicluster set and mined bicluster set respectively. As introduced in [38], the gene match score SG and gene recovery score RG can be computed from Bopt and B:

SG ¼

T 1 X jG1 G2 j S max jBj ðG ;C Þ2BðG2 ;C 2 Þ2Bopt jG1 G2 j 1

RG ¼

X 1 jBopt j ðG ;C Þ2B 1

(B) Fig. 3. Comparison of MMSE and ACV on noises. (A) MMSE performances on noises, (B) ACV performances on noises.

qualitative biclustering algorithm (QUBIC) [5]; (7) a hierarchical clustering (HCL) method and (8) the gene shaving (GSH) method [10]. The software packages of these methods were obtained from BicAT [38], TM4:MeV [3], BBC [4], MSBE [19] and QUBIC [5]. The comparison experiment is about the rediscovery of perfect patterns. It was conducted on a series of controlled synthesized data sets. In each of such data set, several perfect biclusters with a specific pattern are embedded into the random matrix. We tried different methods on the data set independently, and checked their abilities of restore such predefined biclusters, by using the gene match/recovery scores. The generation of the synthesized data and the gene match/recovery scores strictly follow the methods used by [38].

4.3.1. Synthesized data sets Following the method in [38], we generated four kinds of matrices, respectively corresponding to constant (or trivial), shifting, scaling and linear patterns. In each matrix M p;k , where p stands for one of the ‘‘constant’’, ‘‘shifting’’, ‘‘scaling’’ and ‘‘linear’’ patterns and k stands for the overlap degree of two neighbor patterns, 10 pre-defined biclusters Bpopt ¼ fðGi ; C i Þg10 i¼1 (Gi and C i are gene set and condition set for the ith biclusters) with pattern p are distributed along the diagonal, the overlap degree between two

ð12Þ

1

1

T jG1 G2 j S ðG2 ;C 2 Þ2B jG1 G2 j max

opt

ð13Þ

SG measures the matching rate of biclusters in B to perfect biclusters in Bopt , while RG measures the recovery rate of perfect biclusters in Bopt . Both SG and RG vary from zero to one, the larger the scores, the closer the mined biclusters approach to the perfect ones. In our experiments, since we have known the perfect biclusters, we just used gene recovery score to evaluate the bicluster sets obtained from different algorithms. 4.3.3. Results analysis The comparison results among different methods are shown in Fig. 4, which show the ability of restoring predefined biclusters under different conditions. It strongly support our claim that MMSE is a better quantitative measurement than MSR for evaluating biclusters with scaling or linear patterns. Since the frameworks of our MMSE-based biclustering algorithm and MSR-based CC algorithm are similar and MMSE is motivated to overcome the weakness of MSR by extending its ability to detect scaling patterns and even general linear patterns, MMSE-based biclustering is expected to have similar performances with CC algorithm when searching for shifting patterns (Fig. 4(B)), and it is also expected to outperform CC when searching for scaling and linear patterns (Fig. 4(C) and (D)). Please note that this experiment is not for illustrating which method can get the highest gene recovery score on a particular type of patterns, but for comparing the robustness of the methods on all types of linear patterns in biclusters. The robustness indicates that the method can get the stable recovery score on all types of linear pattern in biclusters. In Fig. 4, OPSM, ISA and BBC also shown robustness on different patterns besides our method; GSH performs badly on all patterns (even the worst on constant/shifting

Fig. 4. The gene recovery scores of clustering/biclustering methods under different overlapping degrees.

24

S. Chen et al. / Methods 83 (2015) 18–27

Fig. 5. Comparison results of numbers of significant biclusters with different P-values mined by different methods from Gasch data set.

Fig. 6. Refined comparision results of biclusters with different P-values mined by different methods from Gasch data set. In (A), all the considered biclusters have no more than 50 genes, and in (B), all the considered biclusters have more than 50 genes. Fig. 7. Example of two overlapped biclusters mined by MMSE-based biclustering.

patterns); while the other four methods show strong bias on the patterns. In fact, both OPSM and ISA focus on order-preserving submatrices [7] and transcription modules [6] rather than specific patterns, thus can find different patterns (although performing not very well). The statistical model in BBC is expanded to contain linear patterns [4], so that it can mine different patterns equally. Among four biased methods, CC and MSBE were designed to mine only shifting patterns, they of course can perform better in recovering constant/shifting patterns. However, HCL and QUIBIC were originally proposed to find patterns disregarding specific types therefore they performed very differently on different patterns. We note that, there is a discretization process before the graphbased biclustering in QUBIC, and HCL is indeed a clustering method rather than the biclustering one, therefore their performances are

greatly influenced by the boundaries between the patterns and the background (the area without any patterns). Whereas in the synthesized data, the boundaries between constant/shifting patterns and the background (with random values from 1 to 100 + k) are less significant than those between scaling/linear patterns and the background. That is why HCL and QUBIC show poorer performances on constant/shifting patterns than on scaling/linear patterns. Similar to HCL, GSH is also a clustering method, it iteratively seeds a gene, does clustering by coefficient similarity and determines the cluster boundaries, so it should also be able to detect different patterns. However, it can only recover very small part of the embedded ones due to its higher time complexity.

S. Chen et al. / Methods 83 (2015) 18–27

Anyway, it performs a little better in recovering scaling/linear patterns than it does in constant/shifting patterns with the same reason as HCL. Among the five robust methods, ours consistently reaches high performances on all kinds of linear patterns though we do not consider more sophisticated searching method, which confirms that MMSE is a promising coherence measurement for general linear patterns. Moreover, only MMSE belongs to quantitative one among five robust methods. As it is well known, even the qualitative methods are naturally neither able to distinguish the mined gene clusters of different types of patterns, nor able to evaluate how good the clusters are. However our method can evaluate the bicluster by MMSE score.

25

4.4. Comparative evaluation of MMSE measure on real data sets In order to show the recognizing capabilities of the approach, we performed the comparisons among different biclustering methods on several public real data sets which are commonly used in biclustering studies. The experiments are conducted on three yeast data sets. One is from Gasch et al. [39], which contains 2993 genes and 173 conditions and is commonly used in previous biclustering studies [38,19], the other two are Yeast80 [40] and Ronen [41]. Different experiments on different data sets follow the same framework, so we just show the experiment results on Gasch data set here. Following the systematical comparison methodology of Prelic´ [38], we first ran every candidate method on such data to obtain

Fig. 8. Example of two-way hierarchical dendrogram of one biological significant bicluster mined by MMSE.

26

S. Chen et al. / Methods 83 (2015) 18–27

no more than 100 gene clusters (the top 100 ones). Then we filtered these overlapping gene clusters by guaranteeing the overlapping degree (both genes and conditions) between any two is at most 25%. On the statistical significance, P-value is used to measure the biological functional enrichment for each gene cluster, which means the probability that the gene cluster is said to be enriched for an attribute from random chance. In our experiments, we adopted the gene set from web tool FunSpec [42] to calculate the P-values according to the hypergeometric distribution [42]. This gene set contains many genes’ catalogues from different biological backgrounds (each category contains at least five genes to avoid over-estimation on small gene group). For a gene cluster, we will get several candidate P-values to evaluate the possibilities of such group of genes appearing in some specific categories of biological background occur by chance. And the minimum P-value is considered to be the biological significant measurement of that gene cluster. So it is possible that two gene clusters have similar P-values but they are functional rich in different biological backgrounds. The number of biological significant gene clusters for each method is demonstrated in Fig. 5. For MMSE, all its outcomes have P-values lower than 102 which is the default significant threshold in analysis from GO, MIPS, SGD, and so on. Specifically, the numbers of gene clusters with P-values lower than 103 ; 104 ; 105 , and 106 are about 82, 78, 60 and 40 respectively. From this figure we can see the clustering/biclustering methods are classified into two groups. The first group contains GSH, OPSM, ISA and QUBIC methods, where the absolute numbers of significant biclusters (with P-values less than 106 ) mined by them are small (usually no more than 20), though the percentages of those biclusters are close to 100%. The remains belong to the second group, and they can search more than 70 significant biclusters with P-value less than 103 . Even if we only consider the most significant biclusters with P-values less than 106 , the absolute numbers of such biclusters are still larger than those from the first group. It should be mentioned that, under the assumption of un-overlapping, the number of mined biclusters is related to the sizes of them, i.e., the larger the number of biclusters, the smaller the sizes of biclusters are. However, this is not always true if some extent of overlapping among biclusters is permitted. In fact, whether a biclustering method can find out the overlapped biclusters (with degree less than 25% in our work) or not is usually regarded as an important criterion of evaluating the method too in data mining research community, although we do not mainly address this problem in our work. That is why we can use the absolute number of the significant biclusters to compare different methods.

Moreover, since some subset of a bicluster can also be a bicluster, it is expected that the biclustering method should find the maximal ones. Therefore, the number of large significant biclusters can also be used to evaluate different methods. We show the further comparison results of number of significant biclusters with sizes less than and more than 50 genes in Fig. 6(A) and (B) respectively, from which we can see that MMSE is able to find the larger number of significant biclusters with over 50 genes than most of other methods. Taking Figs. 5 and 6 together, we can draw conclusions that MMSE based biclustering is superior to most of other methods in terms of the number of significant biclusters which is more interesting to the biological scientists, especially for the significant biclusters with large number of genes.

4.5. Case studies on linear biclusters measured by MMSE From the MMSE-biclusters mined from Gasch data set, we chose some to further investigate their biological significance. Fig. 7 uses heatmap to show the overlapped biclusters. In Fig. 7(A), both of two biclusters overlapping on genes (marked by yellow solid rectangles) are with P-values less than 105 thus should be biological significant. Furthermore, this kind of overlapping illustrates the important phenomenon in biological process that the same gene groups (the common genes in both biclusters) may participate in different function modules under different conditions. Fig. 7(B) shows other two significant biclusters (even with P-values less than 1014 ) overlapped on conditions (marked by yellow dashed rectangles). The genes in either of such biclusters are highly coexpressed with each other, while the coherence does not keep true under all conditions, which are often occurred in biological pathways. Although detecting overlapped biclusters is not the capability of MMSE measure itself but rather the searching algorithm, MMSE do can help to find biological significant overlapped biclusters. From Fig. 8 we can clearly see that the genes in the bicluster are grouped into two sets, the one containing 21 genes and the one containing 11 genes, denoted as Gb and Gr respectively. In order to show the correlations of genes within and between sets, we draw their expression profiles (Gb in blue colors and Gr in red colors) under conditions of the bicluster in Fig. 9. From this figure we can see that genes within Gb are significantly positive correlated, even with P-value less than 1014 ; and the genes in Gr are negative correlated expressed, with P-value about 105 . In another word, the coherence pattern of the genes in this bicluster is neither

Fig. 9. Expression profiles of genes in the bicluster in Fig. 8.

S. Chen et al. / Methods 83 (2015) 18–27

shifting nor scaling, instead it is general linear. Thus this kind of biclusters is difficult to be detected by other methods. 5. Conclusion Identifying all maximal subsets of genes showing significantly coherent expression behaviors, including scaling and general linear patterns, has much biological significance. In this work, we have proposed a new coherence measurement based on the idea of the minimal mean squared error (MMSE) to detect any above type biclusters from microarray data. We have theoretically deduced the ways for calculating MMSE scores for different types of patterns. We have conducted a series of experiments to validate that MMSE coherence measurement has great ability to find the biclusters with general linear patterns. Acknowledgments This work was supported by the National Science Foundation of China [61272274, 60970063]; the program for New Century Excellent Talents in Universities [NCET-10-0644]. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ymeth.2015.04. 005. References [1] H. Wang, W. Wang, J. Yang, P.S. Yu, in: SIGMOD Conference, ACM Press, New York, NY, USA, 2002, pp. 394–405. [2] D. Soh, D. Dong, Y. Guo, L. Wong, SIGKDD Explorations 9 (1) (2007) 3–13. [3] A.I. Saeed, V. Sharov, J. White, J. Li, W. Liang, N. Bhagabati, J. Braisted, M. Klapa, T. Currier, M. Thiagarajan, A. Sturn, M. Snuffin, A. Rezantsev, D. Popov, A. Ryltsov, E. Kostukovich, I. Borisovsky, Z. Liu, A. Vinsavich, V. Trush, J. Quackenbush, Biotechniques 34 (2) (2003) 374–378. [4] J. Gu, J. Liu, BMC Genomics 9 (Suppl 1) (2008). [5] G. Li, Q. Ma, H. Tang, A.H. Paterson, Y. Xu, Nucleic Acids Res. 9 (210) (2009). [6] S. Bergmann, J. Ihmels, N. Barkai, Phys. Rev. E 67 (3 Pt 1) (2003). [7] A. Ben-Dor, B. Chor, R. Karp, Z. Yakhini, in: RECOMB’02: Proceedings of the Sixth Annual International Conference on Computational Biology, ACM, New York, NY, USA, 2002, pp. 49–57. [8] T.M. Murali, S. Kasif, in: Proceedings of the 8th Pacific Symposium on Biocomputing, 2003, pp. 77–88. [9] A. Tanay, R. Sharan, R. Shamir, Bioinformatics 18 (suppl 1) (2002) S136–144. [10] T. Hastie, R. Tibshirani, M. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. Chan, D. Botstein, P. Brown, Genome Biol. 1 (2) (2000). research0003.1– research0003.21.

27

[11] P.A. DiMaggio Jr., S.R. McAllister, C.A. Floudas, X. Feng, J.D. Rabinowitz, H.A. Rabitz, BMC Bioinformatics 9 (458) (2008). [12] J. Li, S.K. Halgamuge, S. Tang, BMC Evolut. Biol. 8 (116) (2008). [13] J.S. Aguilar-Ruiz, Shifting and scaling patterns from gene expression data, Bioinformatics 21 (20) (2005) 3840–3845. [14] Y. Cheng, G.M. Church, in: ISMB 2000: Proceedings of the 8th International Conference on Itelligent Systems for Molecular Biology, IEEE Computer Society, La Jolla, CA, 2000, pp. 93–103. [15] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, D. Lockhart, R. Davis, Mol. Cell 2 (1) (1998) 65–73. [16] J.S. Aguilar-Ruiz, F. Divina, in: SAC ’05: Proceedings of the 2005 ACM symposium on Applied computing, ACM, New York, NY, USA, 2005, pp. 959– 960. [17] S.C. Madeira, A.L. Oliveira, IEEE/ACM Trans. Computat. Biol. Bioinform. 1 (1) (2004) 24–45. [18] S. Mitra, H. Banka, Pattern Recogn. 39 (12) (2006) 2464–2477. [19] X. Liu, L. Wang, Bioinformatics 23 (1) (2007) 50–56. [20] J.L. Flores, I. Inza, P. Larrañaga, B. Calvo, Comput. Methods Programs Biomed. 112 (3) (2013) 367–397. [21] S. Dharan, A.S. Nair, BMC Bioinformatics 10 (Suppl 1) (2009) S27. [22] M. Schmid, T.S. Davison, S.R. Henz, U.J. Pape, M. Demar, M. Vingron, B. Schölkopf, D. Weigel, J.U. Lohmann, Nat. Genet. 37 (5) (2005) 501–506. [23] V. Subbarayan, X.-C. Xu, J. Kim, P. Yang, A. Hoque, A.L. Sabichi, N. Llansa, G. Mendoza, C.J. Logothetis, R.A. Newman, et al., Neoplasia 7 (3) (2005) 280–293. [24] N.M. Kopelman, D. Lancet, I. Yanai, Nat. Genet. 37 (6) (2005) 588–589. [25] J. Yang, W. Wang, H. Wang, P.S. Yu, in: ICDE’02: Proceedings of the 18th International Conference on Data Engineering, 2002. [26] M. Anirban, U. Maulik, S. Randyopadhyay, J. Bioinform. Computat. Biol. 7 (5) (2009) 853–868. [27] L. Teng, L. Chan, J. Signal Process. Syst. 50 (3) (2008) 267–280. [28] B. Pontes, F. Divina, R. Giráldez, J.S. Aguilar-Ruiz, in: EvoBIO, Vol. 4447 of Lecture Notes in Computer Science, Springer, 2007, pp. 217–226. [29] W. Ayadi, M. Elloumi, J.K. Hao, BioData Mining 2 (9) (2009). [30] J.A. Nepomuceno, A.T. Lora, J.S. Aguilar-Ruiz, J. Garc´a-Gutíerrez, in: IDEAL 2007: Proceedings of 8th International Conference on Intelligent Data Engineering and Automated Learning, Springer, Birmingham, UK, 2007, pp. 840–849. [31] K. Cheng, N. Law, W. Siu, A.W. Liew, BMC Bioinformatics 9 (210) (2008). [32] P. Che, D.J. Gingerich, S. Lall, S.H. Howell, Plant Cell 14 (2002) 2771–2785. [33] B. Pontes, R. Giráldez, J.S. Aguilar-Ruiz, in: KES’06: 10th International Conference on Knowledge-Based, Intelligent Information and Engineering Systems, IOS Press, Netherlands, 2006, pp. 1264–1271. [34] X. Xu, Y. Lu, K. Anthony, W. Wang, in: ICDE’06: Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2006, pp. 89–89. [35] J. Friedman, Linear Algebra Appl. 280 (2–3) (1998) 199–216. [36] L. Chuang, Henry Y.H. nad Chen, in: Proceedings of the 1994 International Conference on Parallel and Distributed Systems, IEEE Computer Society, La Jolla, CA, 1994, pp. 236–241. [37] X. Gan, A.W. Liew, H. Yan, BMC Bioinformatics 9 (2008) 209. [38] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, E. Zitzler, Bioinformatics 22 (9) (2006) 1122–1129. [39] A.P. Gasch, P.T. Spellman, C.M. Kao, O.C. Harel, M.B. Eisen, G. Storz, D. Botstein, P.O. Brown, Mol. Biol. Cell 11 (12) (2000) 4241–4257. [40] A. Gyenesei, U. Wagner, S. Barkow-Oesterreicher, E. Stolte, R. Schlapbach, Bioinformatics 23 (15) (2007) 1927–1935. [41] M. Ronen, D. Botstein, Proc. Natl. Acad. Sci. U. S. A. 103 (2) (2006) 389–394. [42] M.D. Robinson, J. Grigull, N. Mohammad, T.R. Hughes, BMC Bioinformatics 3 (1) (2002).