ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis

ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217 journal homepage: www.intl.elsevierhealth.com...

1MB Sizes 1 Downloads 114 Views

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis R. Giancarlo a,∗ , D. Scaturro a,∗ , F. Utro b,∗ a b

Dipartimento di Matematica ed Informatica, University of Palermo, Italy Computational Biology Center, IBM T.J. Watson Research, Yorktown Heights, NY 10598, USA

a r t i c l e

i n f o

a b s t r a c t

Article history:

The prediction of the number of clusters in a dataset, in particular microarrays, is a

Received 20 May 2014

fundamental task in biological data analysis, usually performed via validation measures.

Received in revised form

Unfortunately, it has received very little attention and in fact there is a growing need for

7 October 2014

software tools/libraries dedicated to it. Here we present ValWorkBench, a software library

Accepted 16 December 2014

consisting of eleven well known validation measures, together with novel heuristic approximations for some of them. The main objective of this paper is to provide the interested

Keywords:

researcher with the full software documentation of an open source cluster validation plat-

Microarray cluster analysis

form having the main features of being easily extendible in a homogeneous way and of

Bioinformatics software

offering software components that can be readily re-used. Consequently, the focus of the

Pattern discovery in bioinformatics

presentation is on the architecture of the library, since it provides an essential map that can

and biomedicine

be used to access the full software documentation, which is available at the supplementary material website [1]. The mentioned main features of ValWorkBench are also discussed and exemplified, with emphasis on software abstraction design and re-usability. A comparison with existing cluster validation software libraries, mainly in terms of the mentioned features, is also offered. It suggests that ValWorkBench is a much needed contribution to the microarray software development/algorithm engineering community. For completeness, it is important to mention that previous accurate algorithmic experimental analysis of the relative merits of each of the implemented measures [19,23,25], carried out specifically on microarray data, gives useful insights on the effectiveness of ValWorkBench for cluster validation to researchers in the microarray community interested in its use for the mentioned task. © 2014 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

The advent of high throughput technologies for biological and biomedical research, in particular microarrays, has demanded fast progress in many areas of the information sciences,



including the development of mathematical and statistical software environments able to “standardize” many of the data analysis pipelines for biological investigation. Well known examples are BioJava [44], BioPerl [50], SeqAn [13], MatLab [32] and R [46]. The first three are software libraries that privilege

Corresponding author. E-mail addresses: [email protected] (R. Giancarlo), [email protected] (D. Scaturro), [email protected] (F. Utro).

http://dx.doi.org/10.1016/j.cmpb.2014.12.004 0169-2607/© 2014 Elsevier Ireland Ltd. All rights reserved.

208

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

the “programmer and algorithm engineer point of view”, in the sense that modules and procedures within the entire libraries can be used to develop new tools. The second two privileges the “user point of view”, since the available tools are offered either via a GUI or as a set of implemented functions, easy to use as a black box. However, from the “programmer and algorithm engineer point of view”, there is no direct access to the functions in order to use them as building blocks to implement new methods. That is due to copyright restrictions and/or to the complex structure and constrains of the existing packages. In terms of methodologies, cluster analysis, with its long record of deep mathematical and statistical studies on which it is based [16,33,29,34,30] and well documented success in many applied sciences, e.g., biomedicine [4], is a natural candidate for biological data analysis. As for microarrays, its analysis potential was shown almost immediately after their introduction in a pioneeristic paper by Eisen et al. [15]. Shortly thereafter additional results, mostly related to cancer classification [2,3,14,27,41,42,48], helped in establishing cluster analysis as one of the essential techniques to discover “biologically significant groups” in microarray data [10]. Since then, most of the attention has been devoted to the development of new clustering algorithms, although cluster analysis is a process that goes beyond the mere production of a partition of the data. Indeed, essential to the process is the assessment of the quality of the partition obtained by the algorithm, with the use of specific indices, referred to as validation measures, as well explained in [33]. Such a practice was mostly ignored in microarray data analysis, as well argued by Handl et al. [28] in a study that certainly contributed to the adoption of cluster validation techniques as a common practice in microarray cluster analysis. Indeed, some of the validation methods that have been specifically designed for microarrays have become very popular and seem to be used in many studies, e.g., the methods in [14,40]. Unfortunately, novel and effective validation measures are not easy to come by, in particular for very challenging high dimensional data such as microrrays, and in fact the entire area of cluster analysis for post-genomic studies is still the object of intense research [22,35]. Relevant for this contribution is the state of the art regarding cluster validation software libraries In particular, there is no software library that offers a wide range of measures and that can be useful both for data analysis and for research in the development, prototyping and benchmarking of new validation measures. Indeed, with reference to the software environments mentioned above, clustering software is absent in BioJava, BioPerl and SeqAn. MatLab and R, being designed to cater to an audience wider than bioinformatics, computational biology and biomedicine, offer cluster analysis tools. However, as discussed in detail in Section 3, they privilege mostly the “user point of view”, the only notable exception being mosclust [52] that offers full documented access to its modules for program development. Unfortunately, the library consists of two internal validation measures only. Moreover, most of the validation measures present in the mentioned software environments do not allow to be used in conjunction with a clustering algorithm that has been developed outside of their specific programming environment. That is, an algorithm that is external to the system, available as an executable and compatible with the input/output conventions

of the validation measures. Such a level of “algorithm independence” would allow the fast validation of novel external clustering algorithms, as discussed in Section 3. Therefore, given the state of the art depicted above, the main contribution of this paper is to fill an important gap in the literature by carrying out the non-trivial task of providing the full software documentation of ValWorkBench, an open source and portable Java library for cluster validation specifically tested on microarray data [23,25,19]. The objective is to grant full access to the wealth of software modules and classes present in the library, which can be used for the fast development, prototyping and testing of new internal validation measures as well as clustering algorithms, therefore privileging here the “programmer and algorithm engineer point of view”. For completeness, we mention ValWorkBench is the result of the accurate and robust comparative experimental analysis mentioned earlier (see [23,25,19] again). However, the primary intent of that line of research is to provide useful information for the choice of a measure in microarray applications, i.e., its precision in identifying the correct number of clusters and its time performance, in order to give the potential user useful insights on the effectiveness of the proposed library. However, as of now, its internal structure is not readily accessible and its wealth of modules and classes cannot be used for algorithm development and experimentation. ValWorkBench is freely and anonymously available at [1]. It is open source and distributed under the GNU licence. Javadocs and instructions on how to install it are available at the website, as well as the clustering algorithms binary executables (e.g. K-means [33], Non Negative Matrix Factorization [38] and Hierarchical [33]). ValWorkBench is tested and runs on any platform that supports Java version 1.6 or higher, in particular for Windows, Linux and Mac OS X. The remainder of this paper is organized as follows. Section 2 offers an overview of the library software structure as well as some background material on clustering, essential for the presentation of ValWorkBench. In particular, Section 2.1 provides the mentioned prerequisites. Section 2.2 presents, at a high level, the main software components of ValWorkBench. The next four sub-sections offer some details of particularly relevant classes. A fully commented account of code details is provided in the Additional Files available at the supplementary material website [1]. For completeness, JavDocs and user manuals are also provided there. Section 3 discusses the main progress, in terms of software design and algorithmic benchmarking, granted by the full documentation reported here of the internals of ValWorkBench. Moreover, such a contribution is also highlighted via a comparison of ValWorkBench with other libraries offering implementations of internal validation measures. The last section contains the conclusions and future steps.

2.

Materials and methods

2.1.

Basic notions and definitions on cluster analysis

Following Handl et al. [28], cluster analysis is seen here as a three step process. The first, usually referred to as preprocessing, consists of data normalization, feature selection and of

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

the choice of a distance function. The state of the art is given in [45] for normalization, in [39] for feature selection and in [21,20,22,43] for the choice of similarity/distance functions. Regarding the other two steps, which consist of the choice of a clustering algorithm and of a validation technique, in what follows we highlight the essential aspects of them, with some emphasis on cluster validation since it is central for this paper. To this end, we need to introduce some notation. Consider a set of n items  = { 1 , . . .,  n }, where  i , 1 ≤ i ≤ n, is defined by m numeric values, referred to as features or conditions, and let Ck = {c1 , c2 , . . ., ck } be a partition of . Each subset ci , 1 ≤ i ≤ k, is referred to as a cluster, and Ck is referred to as a clustering solution.

2.1.1.

Clustering algorithms

Usually, the partition of the items in  is accomplished by means of a clustering algorithm A. A recent survey of classic as well as more innovative methods, specifically designed for microarray data, is given in [4,49] and a more in depth treatment can be found for instance in [16,28,29,33,34]. For the convenience of the reader, we recall that clustering algorithms are classified into: partitional and hierarchical. The first type of clustering algorithms take as input  and an integer k and give as output a partition Ck of , with |Ck | = k. It is worth pointing out that a partitional clustering algorithm can take as input a partition of the data and use it as an initial clustering solution that the algorithm refines, hopefully improving its quality. In this paper, we refer to this input option as external initialization. The second type of clustering algorithms produce a nested sequence of partitions, i.e. a tree. However, they can be easily adapted to generate a partition of a dataset into k clusters. The details are left to the reader.

2.1.2.

209

value of R. In what follows, the optimal number of clusters according to R is referred to as k* . It is worth pointing out that, in the specialistic literature, it is usual to refer to relative measures with the term internal. We follow that convention here. For the state of the art on internal measures, the reader is referred to [23,24,28,26]. Some of the most prominent internal measures are based on: (a) compactness; (b) hypothesis testing in statistics; (c) stabilitybased techniques and (d) jackknife techniques. This also gives a natural division of the main measures that are provided by ValWorkBench: (a) Within Clusters Sum of Square (WCSS for short) [30] and Krzanowski and Lai Index (KL for short) [37]. (b) Gap Statistics (Gap for short) [51]. (c) CLEST [14], Model Explorer (ME for short) [6], Consensus Clustering (Consensus for short) [40] and Fast Consensus (FC for short) [25]. (d) Figure of Merit (FOM for short) [56].

2.2.

ValWorkBench architecture and main packages

Conceptually, ValWorkBench can be thought of as having two layers, referred to as task and service, respectively. A high level diagram of the library architecture is given in Fig. 1. The first layer consists of the measure package, further subdivided into two subpackages, offering all the methods to carry out various validation tasks. The second one consists of five additional packages that offer “basic service” to methods within all packages, and in particular to the ones contained in the task layer. They range from ensuring a uniform handling of data input to graphic routines. A synoptic description of each of them is provided next, grouped by layer.

Validation measures

Let C¯ j be a reference classification for , consisting of j classes. That is, C¯ j may either be a partition of  into j groups, usually referred to as the gold solution, or a division of the universe generating  into j categories, usually referred to as class labels. An an external measure E is a function that takes as input two partitions Cj and Ck and returns a value assessing how close Ck is to Cj . It is external because the quality assessment of the partition is established via criteria external to the data. Notice that it is not required that j = k. ValWorkBench provides the three most prominent external measures known in the literature: the Adjusted Rand Index [31], the F-Index [47] and the Fowlkes and Mallows Index (FM-Index for short) [18] (see Section 2.4 and Additional File 3 at the supplementary material website [1]). An important aspect of cluster analysis, referred to as model selection, is the determination of the number of clusters in a dataset. Technically, one is interested in the following:

• Given: (a) A sequence of clustering solutions C1 , . . ., Cs , obtained for instance via the repeated application of a clustering algorithm A; (b) a function R, usually referred to as a relative measure, that estimates the relative merits of a set of clustering solutions. One is interested in identifying the partition Ck∗ , among the ones given in (a), providing the best

• The service layer packages: – datatypes: it contains classes encapsulating methods and state information for storing data related to the application of validation measures. It is presented in Section 2.3 and detailed in Additional File 1 at the supplementary material website [1]. – algorithms: it contains classes encapsulating methods and state information for computing clustering partitions of a specific dataset. In this library, only the class of Hierarchical Clustering algorithms is provided. Indeed, it consists of the class HLink.java only, supporting Average, Complete and Single Link cluster merging [33]. This choice is dictated by efficiency. Indeed, the availability of Hierarchical algorithms within the library allows to design measures that interleave the execution of a Hierarchical Clustering algorithm with the computation of the measure itself. That is, the computation proceeds by level of the hierarchical tree being built rather than by starting the entire clustering process again at each iteration of the measure being computed. The algorithms package is presented in Additional File 2 at the supplementary material website [1]. – graphics: it contains classes encapsulating methods and state information for visualizing results of clustering algorithms and validation measures. It is presented in

210

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

Fig. 1 – ValWorkBench architecture. The six packages composing the library, divided by a dashed line into the task (top) and service (bottom) layers. The two subpackages present in the measure package are also shown, since they correspond to methods performing internal and external validation, respectively. The solid arrows indicate the “subpackage relation”. Methods in each package may use “objects”, e.g. classes, contained in other packages. Such a fact is indicated by a dotted arrow (the direction indicates where are the defined “objects” being used).

Additional File 4 at the supplementary material website [1]. – nullmodels: it contains classes encapsulating methods to generate datasets from null models, those latter being a formalization of the intuition of “no structure” or “randomness” in a dataset. Those “data generation” procedures are central for the computation of many internal validation measures (see for instance [23,25,26]), although their range of application goes beyond those measures [33]. Consequently, internal to ValWorkBench, the ones most essential for data analysis are implemented [33]: Poisson, Principal Components and Permutational. The package is presented in Additional File 5 at the supplementary material website [1]. – exceptions: it contains classes encapsulating methods and state information for handling exceptions related to the application of validation measures. It is presented in Additional File 6 at the supplementary material website [1]. • The task layer package(s): – measures: it contains the Measure class that defines abstract and public methods common to all measures. Moreover, it contains two subpackages encapsulating methods and state information that implement external and internal validation measures. In particular, the ones that have been listed in Section 2.1. The two subpackages are defined as follows. external-it contains the following main subpackages: * adjustedRand * findex * fmindex * nullmeasures The first three packages naturally correspond to the external measures mentioned in Section 2.1.2. Moreover, it results convenient, for uniformity of notation, to consider the partition of a dataset D into clusters as a task performed by a null measure, whose software is contained in the forth package. Some level of detail about some of those packages is provided in Section 2.4. Full details are

given in Additional File 3 at the supplementary material website [1]. internal-it contains the following subpackages: * WCSS * KL * Gap * FOM * diffFOM * CLEST * ConsensusC * modelExplorer Each of those packages naturally corresponds to each of the internal measures mentioned in Section 2.1.2. Again, some level of detail about some of them is provided in Section 2.5. Full details are given in Additional File 3 at the supplementary material website [1].

2.3.

The datatypes package

It contains the classes that define the basic data types of the library. • Data Input Matrix. Given a dataset , consisting of n elements, each being an m-dimensional vector, it can be represented in two different ways: as (1) a data matrix D, of size n × m, in which the rows represent the items and the columns represent the condition values; (2) a similarity/dissimilarity matrix S, of size n × n, in which each entry Si,j , 1 ≤ i = / j ≤ n, quantifies the similarity/dissimilarity of the pair of items (i, j). Specifically, the value of Si,j can be computed using rows i and j of D. The DataMatrix.java and SimilarityMatrix.java classes allow to store and handle the data matrix D and the similarity matrix S. • Gold/Clustering Solution. Given a clustering solution C, the ClusterMatrix.java class allows to store and handle a clustering solution as well as a gold solution, as a matrix, while a linked list representation is managed by the ClusterList.java class. Moreover, C can also be represented by an n × n connectivity matrix MC , in which each entry MC (i,

211

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

j) is 1 if the items i and j are in the same cluster, and 0 otherwise. The ConnetivityMatrix.java class allows to store and handle MC . • Indicator Matrix. Given a dataset represented as a data matrix D, one can define a sampling dataset D as a dataset obtained taking n rows from D, with 0 < n < n. Let ID be an n × n indicator matrix in which each entry ID (i, j) is 1 if the items i and j belong to D , and 0 otherwise. The IndicatorMatrix.java class allows to store and handle ID . • Consensus Matrix. Let D1 , D2 , . . ., Dh be h sampling datasets of D and let C1 , C2 , . . ., Ch be the corresponding partitions into k clusters. A consensus matrix is defined as follows:

M(k) =

 (h) M h (h) h

I

(1)

where M(i) is the connectivity matrix of Ci and I(i) is the indicator matrix of Di , 1 ≤ i ≤ h. The ConsensusMatrix.java class allows to store and handle M(k) . • Header Data. The HeaderData.java class allows to store and handle book-keeping information of a computational experiment. Indeed, it maintains all the information about the experimental set-up, e.g., the list of the command-line arguments used for the analysis as well as the running time of the experiment. • Measure Vector. The MeasureVector.java class allows to store and handle the results of both external and internal measures. • Input Measure. The InputMeasure.java class is an abstract class that encapsulates different state information fields, as well as some of the input data needed to compute an internal/external validation measure. Each measure has, as a parameter, a corresponding input class that is extension of the InputMeasure.java class. Such an organization allows to group together the input that is common to all measures, while delegating specialization to a lower level of implementation. This point is exemplified in the next subsection.

2.4.

The external measures package

The structure and content of this package is depicted in Fig. 2. For brevity, we limit ourselves to discuss with some level of detail only the classes contained in the packages corresponding to the Adjusted Rand Index RA and to the null measure. As anticipated earlier, a full description of the entire package is given in Additional File 3 at the supplementary material website [1].

2.4.1.

The Adjusted Rand Index: method and package

In what follows, for the definition of RA , we assume that one of the two partitions is the gold solution C¯ r while the other partition Ct is provided as output by a clustering algorithm, since a generalization of the definition to the case of two arbitrary partition is straightforward. Let ni,j be the number of items

common to both c¯ i and cj , 1 ≤ i ≤ r and 1 ≤ j ≤ t. Moreover, let |¯ci | = ni. and |cj | = n.j . We have:

 i,j



ni,j 2

      ni.  n.j



i



2

 

j

2

n 2

      ni.  n.j

RA = 1 2

      ni. n.j i

2

+

j

2

i



2

 

j

2

n 2

It has a maximum value of one, indicating a perfect agreement between the two partitions, while its expected value of zero suggests a level of agreement due to chance. Moreover, RA can take values on a larger range than [0, 1] and, in particular, may be negative [17,55]. Therefore, for two partitions to be in significant agreement, RA must assume a non-negative value substantially away from zero. The adjustedRand package contains two main classes: InputARand.java and AdjustedRand.java. The first is an extension of the abstract class InputMeasure.java and it provides input fields needed to compute the Adjusted Rand Index with the second class (see Fig. 2 again). The AdjustedRand.java class allows to compute the Adjusted Rand Index in two modes: single and iterative. In single mode, the agreement is computed between two given partitions of D stored both in the clusterMatrix data format. In the iterative mode, the agreement is computed between a given partition (e.g. the gold solution) and a set of partitions of D produced by a clustering algorithm, for all k in [kmin , kmax ], i.e., Ckmin , . . ., Ckmax .

2.4.2.

The nullmeasures package

In order to grant simplicity and uniformity, the computation of a set of partitions of D can be seen as the computation of a measure that returns nothing, i.e., it is a null measure. It takes as input D, a clustering algorithm A and two integers kmin , kmax (with 1 < kmin ≤ kmax ) and it stores in main memory the sequence of partitions of D, i.e., Ckmin , . . ., Ckmax , obtained via the repeated application of A. The package implementing this idea contains two main classes: InputNullM.java and NullMeasure.java. The role of the first one is analogous to the InputARand.java class and, as in that case, it is an extension of the InputMeasure.java class. As for the second, it has the following three extensions: • NullMeasureGeneric.java: it takes as one of its parameters the path name of a clustering algorithm binary executable, with input/output formats compatible with ValWorkBench. the clustering is then performed via that algorithm. • NullMeasureHierarchical.java: the clustering is performed via Hierarchical Clustering, with an implementation internal to the library. Average, Complete and Single Link cluster merging are supported. • NullMeasureHierarchicalInit.java: as pointed out in Section 2.1, the partition quality of some clustering algorithms may be improved by getting an external

212

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

Fig. 2 – The structure of the external measure package, in terms of the classes composing it. Each box corresponds to a class: its name is in bold and below it there is a specification of the package it belongs to. For example, the classes AdjustedRand.java and InputARand.java compose the adjustedRand package within the external measure package. The first class is an extension of the Measure.java class, while the second is an extension of the InputMeasure.java class (not shown), specialized to the AdjustedRand.java class. The solid unlabelled arrows connecting classes encode such an extension relationship and such an encoding is used throughout this manuscript. The solid labelled arrows indicate that a given method, or attribute within a class, uses an object in the class pointed at. The symbol “-” stands for private, while the symbol “#” stands for protected (see [7]).The NullMeasure.java class has three extensions, detailed in Section 2.4.2.

initialization. This class is analogous to the NullMeasureGeneric.java class in that it takes as a parameter a clustering algorithm external to the library. However, in addition to it, it provides a Hierarchical Clustering initiation to it. This latter feature is built-in.

2.5.

The internal measures package

The structure and content of this package is depicted in Fig. 3. With reference to the description given in that figure, it is worth pointing out that, in addition to WCSS, the following classes also have extensions analogous to it: ConsensusClustering.java, GapStatistics.java, KrzanowskiLai.java and FigureOfMerit.java. Those extensions provide a standard implementation of a given measure along with variants that have been designed to be computationally efficient, via heuristics [23]. The advantages of such an approach are discussed and exemplified in Section 3. For brevity, we limit ourselves to discuss with some level of detail only the classes contained in the WCSS package. As anticipated earlier, a full description of the entire internal measures package is given in Additional File 3 at the supplementary material website [1].

2.5.1.

WCSS: method and package

WCSS measures the “goodness” of a cluster via its compactness, one of the most fundamental indicators of cluster quality. Indeed, for each k ∈ [2, kmax ], the method consists of computing the sum of the square distance between each element in a cluster and the centroid of that cluster. The “correct” number of clusters k* is predicted according to the following rule of thumb. For values of k < k* , the value of WCSS should be substantially decreasing, as a function of the number of clusters k. On the other hand, for values of k* ≤ k, the compactness of the clusters will not increase as much, causing

the value of WCSS not to decrease as much. The following heuristic approach comes out [51]: Plot the values of WCSS, computed on a given set of clustering solutions, in the range [1, kmax ]; choose as k* the abscissa closest to the “knee” in the WCSS curve. The WCSS package contains two main classes. Specifically, the InputWCSS.java class is an extension of the abstract class InputMeasure.java that allows to provide input fields needed to compute WCSS by the WithinClustersSumSquares.java class. This latter, is an abstract class for: • WCSSGeneric.java: it takes as one of its parameters the path name of a clustering algorithm binary executable, with input/output formats compatible with ValWorkBench. The computation of WCSS is then performed via that algorithm. • WCSSFast.java: it is the implementation of a fast heuristic for the computation of WCSS. The interested reader can find details in [23]. • WCSSHierarchical.java: the computation of WCSS is performed via Hierarchical Clustering, with an implementation internal to the library. Average, Complete and Single Link cluster merging are supported. • WCSSHierarchicalInit.java: this class is analogous to the WCSSGeneric.java class in that it takes as a parameter a clustering algorithm external to the library. However, in addition to it, it provides a Hierarchical Clustering initiation to it. This latter feature is built-in.

3.

Results

ValWorkBench has been designed with the aim to provide a generic programming paradigm for the development and

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

213

Fig. 3 – The main structure of the internal measure package. The notation here is the same as in Fig. 2. For conciseness, for each specific measure package, only the generic class corresponding to the measure associated with it is reported, limiting the inclusion of extensions to the WithinClustersSumSquares.java class only. Moreover, the corresponding extension of the InputMeasure.java class included in each specific measure package is also omitted. Its relation with the generic class measure included in a package is analogous to the one depicted in Fig. 2 for the external measures.

testing of novel validation measures that takes full advantage of the Java programming features, including platformindependence. In order to illustrate the novelty offered by this library in the cluster validation literature, it results convenient to first summarize the main characteristics of the library, as they can be extrapolated from the technical presentation given in Section 2, then discuss the advantages they seem to offer to programmers and algorithm engineers, and finally compare ValWorkBench with existing validation software packages. The “user point of view” is also briefly discussed, for completeness, since it is not central to this paper.

(1) An abstraction design common to all measures. All modules of ValWorkBench are as “generic” as possible and structured within a hierarchy of classes. That is, a particular internal validation measure is an instance of a particular class, summarizing measures analogous to it. In turn, that class is an instance of a generic validation measure class that summarizes all of the internal/external validation measures. (2) Access/re-usability of the main building blocks. Given the architecture description provided in this paper, together with all of the Additional Files provided at the supplementary material website [1], all of the main modules and classes present in ValWorkBench are fully described for use by programmers interested in developing new measures, based on those building blocks. Moreover, being an open source library, it is possible to extend it by simply adding new code or functionality, inheriting data types and methods from the common base classes.

(3) Clustering algorithm independence. Any clustering algorithm, provided as a binary executable and with input/output conventions compatible with the ones of ValWorkBench, can be executed within the library. The sinergic combination of points (1) and (2) above makes possible: (a) the rapid “combination” and specialization of measures already implemented in ValWorkBench, even by programmers not part of the “project”, to obtain new ones; (b) the use of a basic set of methods that may bring to a substantial simplification of the development of entirely new measures. Point (a) is best illustrated via the following examples, that have already been implemented. • Extensions of a measure class. Examples of the further specialization of a measure, via extensions of a measure class, have been given for the WithinClustersSumSquares.java and NullMeasure.java classes. With reference to Section 2.5.1, we discuss the advantages of such an approach, using this latter class. Analogous considerations hold for all measures having extensions. The four extensions of the WithinClustersSumSquares.java class cover, in an orderly fashion, a spectrum of execution scenarios, providing a set of choices to the user and opportunities for further extensions to the programmer. WCSSGeneric.java covers the case in which a user wants to execute the measure with a clustering algorithm of her/his choice. WCSSHierarchicalInit.java covers the case in which an external hierarchical initialization is required by the clustering algorithm used to compute the measure. WCSSFast.java covers

214

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

the case in which, for time efficiency reasons, the computation of a fast approximation of WCSS is needed. Likewise, the WCSSHierarchical.java class relates to efficiency of execution, when WCSS has to be computed by a Hierarchical Clustering algorithm. Indeed, in that class, the computation of the measure is interleaved with the bottom-up construction of the tree corresponding to the Hierarchical Clustering. Such a simple implementation detail results in major gains in execution time with respect to the execution of WCSS with a Hierarchical Clustering algorithm external to the class. Such an execution can take place via WCSSGeneric.java. However, the user has a more time-efficient alternative. • Combination of existing measure classes to create new ones. The DiffFom measure, defined in [23], is implemented via the DiffFOMGeneric.java class. The aim of DiffFom is to make FOM automatic in the prediction of the optimal number of clusters in a dataset. Indeed, as opposed to KL, the prediction using FOM is based on the analysis of the FOM curve (see [56,23] for details). Such a “visual inspection methodology”, although common to a few other highly appraised measures, is subjective and represents a limitation for that type of measures. Mathematically, given the analogy between FOM and KL, it is very natural to define a variant of FOM in which the automatic prediction rule defined for KL is extended to FOM. Such a natural extension of FOM was very easily realized via an extension of the KrzanowskiLai.java class that uses the FigureOfMerit.java class. • Development of entirely new measures. We now discuss point (b). This point is somewhat more difficult to illustrate via examples. However, it is essential to remark that the service layer of ValWorkBench offers a good variety of methods that, in our view, simplifies the development of a new measure, since the library is open source and all of its modules and classes are available for re-use. We limit ourselves to mention the following packages. The datatypes package offers several datatypes, pertinent to cluster validation and that may be of use for the implementation of a new measure. Moreover, it can be extended if a new datatype is needed. The nullmodels package offers the fundamental

null models of use in cluster validation. Therefore, if a new method needs the generation of “random datasets”, usually done via the null models included in ValWorkBench, the software is already there. The graphics package offers various methods to display curves and diagrams. They can be of use for the development of a new measure, if the prediction of the optimal number of clusters in a dataset is done via the inspection of a curve.

As for point (3), apart from the obvious advantages that this offers in terms of its use for data analysis purposes, such a feature also offers the possibility to rapidly benchmark new clustering algorithms that have been developed independently of ValWorkBench. For illustrative purposes, we assume that one has a new clustering algorithm that needs to be evaluated. A sound experimental protocol to carry out such a validation task is outlined in [11,5,24] and involves the use of external indexes. We limit ourselves to describe the basic step, with the aid of methods available in ValWorkBench. One chooses datasets in which the gold solution is known, e.g., a collection of them is reported in [25]. One chooses an external index in ValWorkBench in order to measure the level of agreement between a partition generated by the algorithm that needs to be validated and the gold solution of a given dataset. The algorithm binary executable is given as an input parameter to the selected external index in order to obtain the result of the partition agreement. The interested reader can find additional details of the experimental protocol just outlined in [11,5,24]. Moreover, in those papers, there are also pointers to clustering algorithms binary executables, compatible with ValWorkBench, that can be used to establish how competitive is the new algorithm with respect to some basic existing ones. As for a comparison with existing cluster validation software, a preliminary clarification concerning the programming environments in which those software packages have been developed, i.e., R and MatLab, is in order. In principle, both of those software environments can be used to design a library analogous to ValWorkBench, offering analogous advantages. Of course, it remains to be seen up to which degree those

Table 1 – Comparison of ValWorkBench with other existing cluster validation libraries, in terms of programming. The first two columns provide environments and software libraries dedicated to validation measures. The next three indicate accessibility of the various software modules for the development of additional measures, the possibility to execute algorithms external to the software environment in which the measure has been developed and its level of abstraction design common to all measures. A dash indicates that the software documentation provided for a given library does not allow for the evaluation of a given point. Environment

Library

Access/reusability of the main building blocks

Clustering algorithms independence

Abstraction design

R

SAGx [8] ConsensusClusterPlus [54] clValid [9] cclust [12] RSKC [36] mosclust [52]

NO NO NO NO NO YES

Partially Partially Partially NO NO NO

NO NO NO NO NO NO

Matlab

CVAP [53]

NO

YES

-

Java

ValWorkBench

YES

YES

YES

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

215

Table 2 – Comparison of ValWorkBench with other existing cluster validation libraries, in terms of validation measures available. The first two columns provide environments and software libraries dedicated to validation measures. The next three indicate the number and type of validation measure available in each library, together with the ones in common with ValWorkBench. Environment

Library

R

SAGx [8] ConsensusClusterPlus [54] clValid [9] RSKC [36] mosclust [52]

Matlab

CVAP [53]

Java

ValWorkBench

No. of internal measures

Measures in common with ValWorkBench

3 1 3 1 2

0 0 0 1 0

2 (Gap, FOM) 1 (Consensus) 1 (FOM) 1 (CLEST) 1 (ME)

14

4

3 (KL, Adjusted Rand Index, FM-Index)

8

3

-

programming environments will support a full fledged objectoriented, generic, design that is naturally supported by Java. Table 1 summarizes the cluster validation software libraries available in the Literature, grouping them by programming environment and mentioning how they compare with ValWorkBench with respect to its main features, summarized in (1)-(3) above. The fact that the software developed in R does not grant the same features of ValWorkBench is mostly due to the fact that no effort has been made to come-up with a unique R programmming framework in which to develop cluster validation measures. Indeed, those software packages have been programmed apparently with no higher level of coordination. As for Matlab and in particular CVAP [53], since all of the methods in that library are not accessible and can be used only via a GUI, software re-usability is not granted and the question of as to whether there is an abstraction design common to all measures becomes a meaningless question. The absence of clustering independence in some of the software developed in R is again part of something that has not been considered during the design of the mentioned software libraries, the exception being the ones that allow the execution of the measure they implement via an external algorithm present in the R environment. In terms of “the user point of view”, all of the measures implemented in ValWorkBench and in the other mentioned libraries have been proposed in the literature, where their success for microarray data analysis is well documented. For the convenience of the reader, Table 2 summarizes the measures that all of those libraries have in common with the library presented here. Since ValWorkBench is based on an earlier accurate benchmarking of the implemented measures [23,25,19], the potential user is given useful insights on the effectiveness of the proposed library.

4.

No. of external measures

Conclusions

We have presented a new software library with an efficient and generic design that addresses a wide range of problems in cluster analysis. Moreover, it is the very first software development platform for cluster validation analysis that has been specifically designed to provide full usability of all its building blocks. In fact, departing from the current state of the art, the major novelty of ValworkBench is to place the “developer point of view” at a par with the “user point of view”. Those features make it a unique programming platform for cluster analysis,

in particular for microarray, in bioinformatics, computational biology and biomedicine. ValWorkBench is under active development and we hope that it will become one of the standard platforms for algorithmic engineering at the interface of cluster analysis. We envision to extend the ValWorkBench framework including additional statistical tests for measuring the quality of a partition (e.g. Bayesian analysis), as well as including new building blocks (e.g. new data generation approaches and robustness analysis).

Conflict of interest The authors declare that they have no conflict of interest.

Acknowledgements Part of this work was supported by Italian Ministry of Scientific Research, FIRB Project “Bioinfomatica per la Genomica e la Proteomica” (project no. RBNE01F5WT), FIRB Project “Algoritmi per la Scoperta ed il Ritrovamento di Patterns in Strutture Discrete, con Applicazioni alla Bioinformatica” (project no. RBIN04BYZ7). Additional support to R.G. has been provided by Progetto di Ateneo dell’Universitá degli Studi di Palermo [2012ATE-0298] ‘Metodi Formali ed Algoritmici per la Bioinformatica su ScalaGenomica’. The authors are also very grateful to the referees that, through their comments and constructive criticisms, have greatly helped in improving the presentations of our contributions.

references

[1] ValWorkBench: ValWorkBench Web Page. http://www.math.unipa.it/raffaele/valworkbench/ [2] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson Jr., L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, L.M. Staudt, Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling, Nature 403 (2000) 503–511. [3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon

216

[4]

[5] [6]

[7]

[8]

[9] [10] [11]

[12]

[13]

[14]

[15]

[16] [17] [18] [19]

[20]

[21]

[22]

[23]

[24]

[25]

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. U. S. A. 96 (1999) 6745–6750. B. Andreopoulos, A. An, X. Wang, M. Schroeder, A roadmap of clustering algorithms: finding a match for a biomedical application, Brief. Bioinform. 10 (3) (2009) 297–314. A. Ben-Dor, R. Shamir, Z. Yakhini, Clustering of gene expression patterns, J. Comput. Biol. 6 (1999) 281–297. A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustering data, in: Seventh Pacific Symposium on Biocomputing, ISCB, 2002, pp. 6–17. Grady Booch, Object-Oriented Analysis and Design with Applications, second ed., Benjamin-Cummings, Redwood City, CA, 1994. P. Broberg, SAGx: Statistical Analysis of the GeneChip, 2009 http://www.bioconductor.org/packages/2.4/bioc/html/ SAGx.html G. Brock, V. Pihur, S. Datta, S. Datta, clValid: an R package for cluster validation, J. Stat. Softw. (2008) 1–28. P. D’haeseleer, How does gene expression cluster work? Nat. Biotechnol. 23 (2006) 1499–1501. V. Di Gesú, R. Giancarlo, G. Lo Bosco, A. Raimondi, D. Scaturro, Genclust: a genetic algorithm for clustering gene expression data, BMC Bioinform. 6 (2005) 289. E. Dimitriadou, Convex clustring methods and clustering indexes, 2001 http://www.cran.r-project.org/web/packages/ cclust/ A. Doring, D. Weese, T. Rausch, K. Reinert, SeqAn an efficient, generic C++ library for sequence analysis, BMC Bioinform. 9 (1) (2008) 11. S. Dudoit, J. Fridlyand, A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biol. 3 (2002). M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. U. S. A. 95 (1998) 14863–14868. B. Everitt, Cluster Analysis, Edward Arnold, London, 1993. D. Fisher, P. Hoffman, The adjusted rand statistic: a SAS macro, Psychometrika 53 (1988) 417–423. E.B. Fowlkes, C.L. Mallows, A method for comparing two hierarchical clusterings, J. Am. Stat. Assoc. 78 (1983) 553–584. R. Giancarlo, G. Lo Bosco, F. Utro, Bayesian versus data driven model selection for microarray data, Nat. Comput. (2014) 1–10. R. Giancarlo, G. Lo Bosco, L. Pinello, Distance functions, clustering algorithms and microarray data analysis, in: Learning and Intelligent Optimization, Lecture Notes in Computer Science, 2010, pp. 125–138. R. Giancarlo, G. Lo Bosco, L. Pinello, F. Utro, A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis, BMC Bioinform. 14 (2013) S6. R. Giancarlo, G. Lo Bosco, P. Pinello, F. Utro, The three steps of clustering in the post-genomic era: a synopsis, in: Computational Intelligence Methods for Bioinformatics and Biostatistics, Lecture Notes in Computer Science, 2011, pp. 13–30. R. Giancarlo, D. Scaturro, F. Utro, Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer, BMC Bioinform. 9 (2008) 462. R. Giancarlo, D. Scaturro, F. Utro, A tutorial on computational cluster analysis with applications to pattern discovery in microarray data, Math. Comp. Sci. 1 (2008) 655–672. R. Giancarlo, F. Utro, Speeding up the consensus clustering methodology for microarray data analysis, Algorithms Mol. Biol. 6 (2011) 1.

[26] R. Giancarlo, F. Utro, Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis, Theor. Comp. Sci. 428 (2012) 58–79. [27] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeeck, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (531) (1999) 531–537, 5439. [28] J. Handl, J. Knowles, D.B. Kell, Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (15) (2005) 3201–3212. [29] J.A. Hartigan, Clustering Algorithms, John Wiley and Sons, 1975. [30] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, New York, NY, USA, 2003. [31] L. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1985) 193–218. [32] The MathWorks Inc. MATLAB. The MathWorks Inc., Natick, Massachusetts, 2014. [33] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Englewood Cliffs, Prentice-Hall, 1988. [34] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990. [35] S. Klie, Z. Nikoloski, J. Selbig, Biological cluster evaluation for gene function prediction, J. Comput. Biol. 21 (2010). [36] Y. Kondo RSKC: Robust sparse K-means. http://cran.r-project.org/web/packages/RSKC/index.html, 2013. [37] W. Krzanowski, Y. Lai, A criterion for determining the number of groups in a dataset using sum of squares clustering, Biometrics 44 (1985) 23–34. [38] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791. [39] H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, Norwell, MA, USA, 1998. [40] S. Monti, P. Tamayo, J. Mesirov, T. Golub, Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, Mach. Learn. 52 (2003) 91–118. [41] C.M. Perou, S.S. Jeffrey, M. van de Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D. Lashkari, D. Shalon, P.O. Brown, D. Botstein, Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, Proc. Natl. Acad. Sci. U. S. A. 96 (1999) 9212–9217. [42] J.R. Pollack, C.M. Perou, A.A. Alizadeh, M.B. Eisen, A. Pergamenschikov amd, C.F. Williams, S.S. Jeffrey, D. Botstein, P.O. Brown, Genome-wide analysis of DNA copy-number changes using cDNA microarrays, Nat. Genet. 23 (1999) 41–46. [43] I. Priness, O. Maimon, I. Ben-Gal, Evaluation of gene-expression clustering via mutual information distance measure, BMC Bioinform. 8 (2007) 111. [44] A. Prlic, A. Yates, S.E. Bliven, P.W. Rose, J. Jacobsen, P.V. Troshin, M. Chapman, J. Gao, C.H.K. Koh, S. Foisy, R. Holland, G. Rimsa, M.L. Heuer, H. Brandstätter-Müller, P.E. Bourne, S. Willis, BioJava: an open-source framework for bioinformatics, Bioinformatics 24 (2012) 2096–2097. [45] J. Quackenbush, Microarray data normalization and transformation, Nat. Genet. 32 (2002) 496–501. [46] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2006.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 8 ( 2 0 1 5 ) 207–217

[47] C. Van Rijsbergen, Information Retrieval, second ed., Butterworths, London, 1979. [48] D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, P. Spellman, V. Iyer, S.S. Jeffrey, M. van de Rijn, M. Walthama, A. Pergamenschikov, J.C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers, J.N. Weistein, D. Botstein, P.O. Brown, Systematic variation in gene expression patterns in human cancer cell lines, Nat. Genet. 24 (2000) 227–235. [49] R. Shamir, R. Sharan, Algorithmic approaches to clustering gene expression data, in: Current Topics in Computational Biology, MIT Press, 2003, pp. 120–161. [50] J.E. Stajich, D. Block, K. Boulez, S.E. Brenner, S.A. Chervitz, C. Dagdigian, G. Fuellen, J.G. Gilbert, I. Korf, H. Lapp, H. Lehväslaiho, C. Matsalla, C.J. Mungall, B.I. Osborne, M.R. Pocock, P. Schattner, M. Senger, L.D. Stein, E. Stupka, M.D. Wilkinson, E. Birney, The Bioperl Toolkit: Perl Modules for the Life Sciences., 2002, pp. 1611–1618.

217

[51] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a dataset via the gap statistics, J. R. Stat. Soc. B. 2 (2001) 411–423. [52] G. Valentini, Mosclust: a software library for discovering significant structures in bio-molecular data, Bioinformatics 23 (2007) 387–389. [53] K. Wang, B. Wang, L. Peng, CVAP: validation for cluster analyses, Data Sci. J. 8 (2009) 88–93. [54] M.D. Wilkerson, D.N. Hayes, Consensusclusterplus, Bioinformatics 26 (2010) 1572–1573. [55] K.Y. Yeung, Cluster analysis of gene expression data, University of Washington, 2001 (Ph.D. thesis). [56] K.Y. Yeung, D.R. Haynor, W.L. Ruzzo, Validating clustering for gene expression data, Bioinformatics 17 (2001) 309–318.