Artificial Intelligence in Medicine 26 (2002) 281–304
Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles Giorgio Valentinia,b,* a
Dipartimento di Informatica e Scienze dell’Informazione (DISI), Universita` di Genova, via Dodecaneso 35, 16146 Genova, Italy b Isitituto Nazionale di Fisica della Materia (INFM), via Dodecaneso 33, 16146 Genova, Italy Received 23 October 2001; received in revised form 22 April 2002; accepted 16 May 2002
Abstract The large amount of data generated by DNA microarrays was originally analysed using unsupervised methods, such as clustering or self-organizing maps. Recently supervised methods such as decision trees, dot-product support vector machines (SVM) and multi-layer perceptrons (MLP) have been applied in order to classify normal and tumoural tissues. We propose methods based on non-linear SVM with polynomial and Gaussian kernels, and output coding (OC) ensembles of learning machines to separate normal from malignant tissues, to classify different types of lymphoma and to analyse the role of sets of coordinately expressed genes in carcinogenic processes of lymphoid tissues. Using gene expression data from ‘‘Lymphochip’’, a specialised DNA microarray developed at Stanford University School of Medicine, we show that SVM can correctly separate normal from tumoural tissues, and OC ensembles can be successfully used to classify different types of lymphoma. Moreover, we identify a group of coordinately expressed genes related to the separation of two distinct subgroups inside diffuse large B-cell lymphoma (DLBCL), validating a previous Alizadeh’s hypothesis about the existence of two distinct diseases inside DLBCL. # 2002 Elsevier Science B.V. All rights reserved. Keywords: Gene expression data analysis; Output coding ensembles of learning machines; Support vector machines; DNA microarrays
1. Introduction DNA hybridisation microarrays [14,33] supply information about gene expression through measurements of mRNA levels of a large amount of genes in a cell. This *
Tel.: þ39-10-3536709; fax: þ39-10-3536699. E-mail address:
[email protected] (G. Valentini). 0933-3657/02/$ – see front matter # 2002 Elsevier Science B.V. All rights reserved. PII: S 0 9 3 3 - 3 6 5 7 ( 0 2 ) 0 0 0 7 7 - 5
282
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
information can be used to refine the traditional classification of human malignancies based on morphological and clinical parameters. In fact, information obtained by DNA microarray technology gives a snapshot of the overall functional status of a cell, offering new insights into potential different types of malignant tumours, based on functional and molecular discrimination. The large amount of data produced by this powerful analytic technique can be processed through machine learning methods, using both unsupervised and supervised approaches. In a typical unsupervised approach, expression patterns of several hundreds or thousands of genes are obtained for different cells, or tissues, or for different functional states of the same cell. Then clustering algorithms are used to group together similar expression patterns for both grouping sets of genes, or sets of different cells, or different functional status of the same cell [7,20,25,43,47]. Using this approach, we can discover functionally correlated genes [14,42,45] or we can separate expression patterns of normal from pathological tissues [2,6,40]. Also, new functional classes not detected using traditional tumor classification can be discovered, and these sometimes correspond to different diseases with different clinical courses [1]. Unsupervised methods cannot always correctly separate classes, because they use unlabeled data to indirectly identify classes through clusters of gene expression data. Supervised methods can overcome this problem, exploiting ‘a priori’ biological and medical knowledge on the problem domain, using labeled data to directly identify and separate classes. Recently several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips. These methods include decision trees, Fisher linear discriminant, multi-layer perceptrons (MLP), nearest-neighbours classifiers, linear discriminant analysis, Parzen windows and others [10,17,23,30,39]. In particular, support sector machines (SVM) are well suited to manage and classify high dimensional data [50], as microarray data usually are, and have been recently applied to the classification of normal and malignant tissues using dot-product (linear) kernels [22]. When we are faced with more difficult tasks, we need more powerful kernels, and in this perspective we apply non-linear SVM with polynomial and Gaussian kernels in order to classify normal and tumoural tissues. These types of kernels have been successfully applied to the separation of functional classes of yeasts’ genes using microarray expression data [10]. A major medical problem in the genomic diagnosis (and in perspective, therapy) of tumors, consists in selecting subsets of genes mostly related to carcinogenic processes. Previous studies used feature ranking methods to select genes that are most correlated or that individually classify best the training data [22,23,39]. These methods can offer useful information about single genes, but they assume that the expression patterns of each gene are independent, while usually mRNA levels are coordinately expressed by groups of dependent genes. We propose a simple heuristic approach that takes into account ‘a priori’ biological knowledge about groups of coordinately expressed genes, and information provided by clustering methods, in order to select ‘‘candidate’’ subgroups of genes successively tested for their potential discrimination power using supervised learning methods. In recent years several ensemble methods have been proposed, as it has been shown that they enhance accuracy of learning machines [15,31]. Ensemble methods combine
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
283
different classifiers in order to build a more accurate classifier. In a recent work, bagging and boosting [17], that are ensemble methods based on resampling techniques, and one-versus-all and all-pairs combinations of binary classifiers [52] have been applied to the analysis of DNA microarray data. In this paper we propose error correcting output coding decomposition methods [16], i.e. ensemble methods specialised for multi-class classification and based on a divide-and-conquer paradigm, in order to classify different types of lymphoma. We tackled three problems related to the classification of human lymphoma using DNA microarray data. Data of a specialised DNA microarray, named ‘‘Lymphochip’’, developed at Stanford University School of Medicine, specifically designed to study lymphoid and malignant development related genes, have been used. These data are very challenging from a machine learning standpoint, considering that they are constituted by a small number of 4026-dimensional samples. In our first task we distinguish malignant from normal tissues using the overall information available. This dichotomic problem is tackled using SVM, MLP and linear perceptrons (LP). In our second task we attempt to identify groups of genes specifically related to the expression of a tumor phenotype, exploiting ‘a priori’ biological knowledge about sets of genes and information provided by clusters of coordinately expressed genes, i.e. ‘‘expression signatures’’ [1]. Finally, we try to directly classify different types of lymphoma (a multi-class problem) using MLP and parallel non-linear dichotomisers (PND), i.e. ensembles of learning machines based on output coding decomposition of a multi-class problem [36]. The paper is structured as follows. In Section 2, a brief overview of DNA microarray technology is presented. In Section 3, we outline the characteristics of SVM needed to understand their application to the analysis of gene expression data and the basics about output coding decomposition methods. In Section 4, we present our experimental approach to the recognition of human lymphoma using gene expression data. Then the results of the three classification problems described above are presented and discussed, comparing the estimated generalisation error and the receiver operating characteristic of the proposed classifiers. Conclusions and future developments of this work end the paper.
2. DNA microarray technology In this section we present only a brief overview of DNA microarray gene expression technology. For a more detailed introduction see [19]. DNA microarrays are microscopic arrays of DNA sequences printed on glass slides. Using state of the art technologies all the human genome can be printed on a standard 1 in: 3 in: microscopic slide in about 1 day. The first step in preparing DNA microarrays involves the selection of suitable DNA targets. For organisms as mice and humans, individual cDNA clones from cDNA libraries can be used as source of gene-specific targets in the arrays. Then an arrayer robot prints the DNA samples through a cluster of specialised printing tips on a standard microscopic slide:
284
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
each DNA sample is printed in a precise and known position on the slide. The next step consists in the preparation of the fluorescent cDNA probes that will be used to hybridise the arrays. The mRNA of the cell whose gene expression has to be studied is isolated and purified. The prepared mRNA is used as template for synthesis of fluorescent cDNA probe by means of reverse transcription. Usually fluorescently labeled deoxyribonucleotides are used for producing cDNA probes. The cDNA sequences obtained by reverse transcription are then hybridised with the DNA samples printed on the microarray. A laser beam successively scans the slide and a raster image of the array is acquired. Measuring the fluorescent intensities of the image we can reconstruct the quantities of cDNA that hybridises with each individual sample on the printed microscope slide. Consequently we gain information about the quantities of mRNA produced by the cell, i.e. we have a quantitative image of the gene expression. Unfortunately the absolute representation of every RNA species in any cell or tissue sample cannot be obtained, as there is a complex relationship between the amount of input mRNA for a given gene and the intensity of the fluorescent cDNA probes, depending on a multitude of experimental conditions. Using relative representation of RNA species in two or more samples we can bypass these problems and moreover we are interested in differences in gene expression between samples, not in the absolute amounts of RNA. For these reasons the ratios between two differently labeled cDNA probes, one of them acting as reference, are usually considered. Gene expression data of different cells or different experimental/functional conditions are collected in matrices for numerical processing: each row corresponds to the gene expression data of a specific cDNA clone relative to all the examples, and each column corresponds to the expression data of all the cDNA clones relative to a specific cell sample. Typically thousands of genes are used and analyzed for each microarray sample.
3. Methods High dimensional data, as gene expression data usually are, constitute a serious problem in several machine learning methods; dimensionality reduction (e.g. principal component analysis) can be used, but it leads often to information loss and to performance degradation. SVM can overcome this problem, because they can generalise well also with high dimensional data [50]. In this section we introduce only the SVM characteristics needed to understand their application to the analysis of DNA microarray data. For a more detailed description see [13]. Then we outline a simple two-stage approach to the selection of subsets of genes involved in carcinogenic processes. This heuristics exploits a priori knowledge to select groups of correlated genes and successively supervised methods to identify the most significant ones. Multi-class classification with small data set size and high dimensional input is another difficult task in machine learning. In recent years it has been shown that in these experimental conditions ensemble of classifiers can enhance performances of learning machines [15,31]. Ensemble methods combine a set of base classifiers in order to obtain a more accurate classifier. In particular, in this section we focus on output coding (OC) decomposition ensembles, and in their application to the classification of tumors.
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
285
3.1. Support vector machines SVMs are two-class classifiers theoretically founded on Vapnik’s statistical learning theory [50]. They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space: the resulting classifier is in general non-linear in the input space and it achieves good generalisation performances maximizing the margin between the classes. More precisely, SVMs perform the following mapping from the input space X to the feature space Z: f : X Rd ! Z Rn where n can often be much larger than d. A support vector machine can locate a separating hyperplane in the feature space and classify points in that space without representing the feature space explicitly: it can be showed that under some conditions (Mercer’s theorem), the SVM can work in the feature space using only data in the input space, avoiding the computational burden of explicitly representing the feature vectors in an high dimensional feature space. The resulting SVM decision function is expressed as a linear combination of the input data transformed through an appropriate kernel function: ! X FðxÞ ¼ sign ai yi Kðxi ; xÞ b (1) i
where the ðxi ; yi Þ (xi 2 Rd and yi 2 f1; 1g) are pairs of labeled data of the training set, K a kernel function satisfying Mercer’s conditions, ai 2 R are coefficients and b the bias term both learned by the SVM algorithm. Three commonly used types of kernels are: polynomial:
ðxT xi þ 1Þp
radial basis ðGaussianÞ: sigmoidal:
ðx xi Þ2 exp 2s2
!
tanh ðb0 xT xi þ b1 Þ
In presence of noise or with incorrect labeled data we need to use a more sophisticated version of the SVM algorithm, which tolerates examples violating the maximum separation margin and also training errors (see [13]). The SVM learning algorithm requires to solve a quadratic programming problem. This can be accomplished through off-the-shelf quadratic programming packages [29], or using specific methods such as sequential minimum optimisation [41]. In practice, using open source software available on the web (for instance, SVMlight [27]), training an SVM requires to specify the type of kernel function and the regularisation parameter C. In addition, typically one or two kernel related parameters must be specified, such as the degree for polynomial kernels, or the s parameter in Gaussian kernels. The SVM algorithm finds the ai and b coefficients needed to compute the discriminant function (Eq. (1)).
286
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
3.2. Heuristic expression signature selection Expression signatures [1,34] are subsets of coordinately expressed genes identified by the biological process in which its genes are known to function (i.e. proliferation), or by the cell type in which its component genes are expressed (i.e. germinal centre B cell). We propose a particular case of feature selection, that is a gene expression signature selection, that takes into account the biological significance of the groups of genes used to discriminate between normal and malignant tissues. The main purpose of this approach consists in identifying subsets of genes biologically characterised (expression signatures) correlated with a malignancy, providing insights into the biology of tumoural diseases. This two-stage approach can be summarised as follows: 1. Select candidate subgroups of genes. 2. Identify subgroups of genes most related to the disease. In the first stage we use ‘‘a priori’’ biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes. For instance, known markers of cellular differentiation, genes altered by chromosomal translocations, genes that function as oncogene in vitro, and in general genes with known roles in pathophysiology of malignant tissues can be used to select candidate subgroups of genes. We can also identify groups of genes with similar gene expression profiles using clustering methods in an unsupervised fashion [20,25,47]. The selection of candidate subgroups of genes can be performed using either or both these approaches. For instance, clustering methods can individuate groups of genes with similar gene expression profiles, but their biological meaning must be explained through the biological knowledge in order to interpret them as expression signatures. In the second stage we evaluate the groups of genes selected in the first stage using supervised methods in order to identify those groups mostly related to tumoural transformations. This is accomplished by evaluating the performances of classifiers with each selected subset of genes. In particular for these tasks we propose to apply SVM with polynomial and Gaussian kernels, but in principle any supervised learning machine can be used. The strength of the relationships between an expression signature and a particular malignancy can be estimated through the accuracy of the SVM trained with gene expression data of the expression signature itself. The proposed approach differs from feature ranking methods [22,23,39], as it takes into account that often the most significant information is given by sets of complementary genes rather than by single individual and independent genes, while feature ranking can offer useful information about single genes, but fail in detecting the role of coordinately expressed genes in carcinogenic processes [24]. 3.3. Output coding decomposition ensembles Output coding (OC) methods [16,37] consist in decomposing a multi-class problem in a set of L two-class subproblems, training the resulting ensemble of dichotomisers and then combining the L outputs to predict the class label. The set of dichotomic problems
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
287
generated by the decomposition are each assigned to a different learning machine: from a general standpoint any dichotomic learning machine can be used (e.g. a multi-layer perceptron, a support vector machine or a naive Bayes classifier). OC decomposition methods code classes through binary strings. A coding process is a mapping M: C ! S from the set of the classes to the set of binary strings S ¼ fs1 ; . . . ; sk g, where the si 2 f1; 1gl are named codewords, i.e. binary strings of length l. Each string si must univocally determine its corresponding class, i.e. 8i; j such that i 6¼ j; 1 i; j k, we have si 6¼ sj . The class coding implicitly generates a decomposition of the k-polychotomy into a set of l dichotomies f1 ; . . . ; fl , where l is the length of the codeword coding a class. Each dichotomy fi subdivides the input patterns into two complementary superclasses Cþ i and C , each of them grouping one or more classes of the k-polychotomy. Given a decomi position matrix D ¼ ½dik of dimension l k, that represents the decomposition, and connects the classes C1 ; . . . ; Ck to the superclasses Cþ i and Ci , an element of D is defined as: þ1; if Ck Cþ i dik ¼ 1; if Ck C i In a decomposition matrix, rows correspond to dichotomiser tasks and columns to classes and each class is univocally determined by its codeword. 3.3.1. Decomposition schemes Different decomposition schemes have been proposed in literature, such as the one per class (OPC) [3], the correcting classifiers (CC) [38], and the error correcting output coding (ECOC) [16] decomposition. According to the one per class decomposition scheme, each dichotomiser fi has to separate a single class from all the others. As a consequence, if we have K classes, we will use K dichotomisers. Classification based on error correcting output codes (ECOC) is a decomposition method borrowed from coding theory [8]. The notion of codeword used for class labeling suggests the idea of adding error recovering capabilities to decomposition methods in order to obtain classifiers less sensitive to noise [35]. An example of an ECOC decomposition matrix for a four classes polychotomy is shown in Table 1.
Table 1 ECOC decomposition matrix 0 1 þ1 þ1 þ1 1 B þ1 þ1 1 þ1 C B C B þ1 þ1 1 1 C B C B þ1 1 þ1 þ1 C B C B þ1 1 þ1 1 C B C @ þ1 1 1 þ1 A þ1 1 1 1
288
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
The error recovering capabilities of ECOC methods depend on column and row separation of the decomposition matrix. More precisely, the maximum number of errors Maxerr , that can be corrected in an ECOC based decomposition is [16]: Maxerr ¼ b12 ðDD 1Þc
(2)
where DD is the minimal Hamming distance between each pair of columns of the decomposition matrix D. For instance, in Table 1, DD ¼ 4 and Maxerr ¼ 1. Effectiveness of ECOC decomposition methods depends mainly on the design of the learning machines implementing the decision units, on the similarity of the ECOC codewords, on the accuracy of the dichotomisers, on the complexity of the multi-class learning problem and on the dependence of the codeword bits [35]. 3.3.2. Decomposition and reconstruction OC is a two-stage classification method: after the decomposition stage, a reconstruction stage performs the decoding of the codeword computed during the decomposition stage in order to output the class label. As a consequence, learning machines constructed using OC methods are composed of two units: a decomposition unit, that analyses the input patterns and calculates the codewords using an assigned decomposition scheme; (b) a decision unit, that decodes the computed codeword, mapping it to the associated class. The decoding function Gð^sÞ specifies the class predicted by the ensemble, and it is usually implemented by a maximisation of a suitable similarity measure between the computed ^s codeword and the effective codewords si ; 1 i k associated to the classes: Gð^sÞ ¼ arg max Simð^s; si Þ 1iK
(3)
where si is the codeword of class Ci , the vector ^s is the codeword computed by the set of dichotomisers, and Simðx; yÞ is a general similarity measure between two vectors x and y. This similarity measure can be the Hamming distance for dichotomisers with discrete outputs, or an inner product or one of the L1 or L2 norm distances for dichotomisers with continuous outputs. In particular in our experimentation we use parallel non-linear dichotomisers (PND), i.e. OC decomposition ensembles where each dichotomiser is implemented through a MLP [36].
4. Experimental setup We apply the proposed methods to three classification tasks related to the recognition of human lymphoma using DNA microarray gene expression data. 4.1. Data Data used in our experiments are taken from [1], and consist of 96 tissue samples from normal and malignant populations of human lymphocytes, considering for each sample 4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or malignant development.
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
289
Table 2 DNA microarray samples Type of tissue
Number of samples
Normal lymphoid cells DLBCL FL CLL TCL
24 46 9 11 6
Gene expression data are expressed as fluorescence ratios normalised subtracting for each value the median between all the values. Missing gene expression data (about 6% of all the data) were replaced with zeros. We modified the format of the original data in order to use them with the machine learning software (SVMlight [27] and NEURObjects [49]) used in our experiments.1 We considered three main classes of lymphoma: diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL) and chronic lymphocytic leukemia (CLL) together with transformed cell lines (TCL) and normal lymphoid tissues [1] (Table 2). We combined the five normal lymphocyte types of the Alizadeh et al.’s paper in one single group, as our main task consisted in separating normal from malignant lymphoid tissues (see Section 4.2). In this way the number of classes was reduced from nine to five. The total number of malignant samples was 72, while the number of different normal lymphoid samples amounted to 24. 4.2. Classification tasks In our experiments we faced three main problems: 1. Separating normal from malignant lymphoid cells. 2. Identifying and separating two different subclasses of lymphoma inside diffuse large B-cell lymphoma using gene expression signatures. 3. Identifying different types of lymphoma on functional basis. Our first task consisted in distinguishing malignant from normal tissues using the overall information available, i.e. all the 4026 gene expression data. The second task tries to validate the hypothesis of Alizadeh et al. [1] about the existence of two distinct functional types of lymphoma inside DLBCL. They showed that two subgroups of DLBCL, that they named germinal centre B-like DLBCL (GCB-like) and activated B-like DLBCL (AB-like) can be separated using hierarchical clustering algorithms. To this purpose we applied supervised methods to separate GCB-like from AB-like cells. Lossos et al. [34] and Alizadeh et al. [1] claimed that different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures that they named Proliferation, T-cell, Lymphnode, and GCB. Each one is composed by less than 100 genes. 1
The original data are available at http://llmpp.nih.gov/lymphoma/data, and the modified data at http:// ftp.disi.unige.it/person/ValentiniG/Data/lymphoma.
290
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
These expression signatures have been identified by applying hierarchical clustering to gene expression data [1]. The Proliferation signature includes diverse cell-cycle control genes, DNA synthesis and replication genes and the gene Ki67 used to asses the ‘‘proliferation index’’ of a tumour biopsy [40]. The T-cell signature includes genes preferentially expressed in T lymphocytes, while genes expressed by most of DLBCLs and samples of normal lymph node and tonsil characterise the Lymphnode signature, such as genes encoding markers of natural killer cells, monocytes and macrophages. GCB expression signature is characterised by many known markers of germinal centre differentiation, such as gene encoding the cell-surface proteins CD10 and CD38 [32] or BCL-6, the most frequently translocated gene in DLBCL [46], or genes that distinguish germinal centre B cells from other stages in B-cell ontogeny. In our experiments we attempted to identify if expression signatures are specifically related to the expression of these two different tumor phenotypes, in order to indirectly gain information on sets of coordinately expressed genes involved in carcinogenic processes of lymphoid cells. In our third task we tried to directly classify different types of lymphoma (a multi-class problem), using again all the available gene expression data. In all learning tasks we used NEURObjects [49], a set of Cþþ library classes for neural network development and SVMlight [27], a set of C applications implementing dichotomic SVM for classification tasks.2 4.3. Evaluation of the results For all the classification problems we used 10-fold cross validation techniques in order to evaluate the generalisation error of the learning machines, and in addition the Joachims estimator xa [28] of the leave-one-out error has been applied to SVM. From a general standpoint it would be preferable to use a double-resampling procedure to really evaluate the discrimination power of the proposed methods, dividing the available data into a training set, used for learning and model estimation through cross-validation techniques, and a prediction set, used only for estimating the prediction risk of the model [21]. Even if this double-resampling procedure provides an unbiased estimate of the the generalisation error, it may be highly variable due to the variance of finite samples and the specific choice of the prediction sample, especially with small data sets [12]. Having small number of samples (a typical case with gene expression data) it would be safer to use cross validation estimates of the generalisation error, as this estimate is invariant to a particular partitioning of the examples (in the leave-one-out case), or only slightly overoptimistic with respect to an unbiased estimate [9]. As soon as more gene expression data will be available, we might use separate unknown samples to really estimate the prediction risk of the selected models. In several classification tasks, the accuracy is not a sufficient criterion to evaluate the performance of a classifier. In many problems a false alarm is often not as expensive as a missed correct alarm. For instance, considering the detection of seriously diseased 2
NEURObjects is available on line at http://www.disi.unige.it/person/ValentiniG/NEURObjects. SVMlight software is available at http://ais.gmd.de/thorsten/svm_light.
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
291
patients, it is preferable to avoid false negative results rather than false positive ones. In this and similar problems it is preferable to use also other measures of performance, considering explicitly different types of misclassification, such as false positives and false negatives. For instance, considering the medical problem of separating malignant from normal tissues, the sensitivity gives the percentage of correctly detected tumoural examples, the specificity the percentage of correctly detected normal tissues and 1-specificity the percentage of incorrectly detected malignant tissues with respect to the total number of normal tissues. For classifiers obtained by thresholding, such as MLP or SVM, varying the decision threshold of the classifier, we can compute a receiver operating characteristic (ROC) curve, describing the trade-off between specificity and sensitivity. Using ROC curves we can compare the performance of different learning systems: the best point in the ROC plane is ð0; 1Þ, i.e. 1-specificity ¼ 0 and sensitivity ¼ 1; the worst point is the opposite ð1; 0Þ; ROC curves lying near the diagonal correspond to random guessing classifiers, and in general learning systems with ROC curves lying on the the top and leftmost portion of the ROC plane are the better ones.
5. Results and discussion We trained about 1600 SVM and 2000 MLP considering globally all the classification tasks involved in this experimentation. We applied three different types of SVM, using linear, polynomial and radial basis kernel functions, and we considered MLP with one hidden layer both in two-class and multi-class MLP. The dichotomic base learner of the decomposition unit of the PND were implemented by MLP with one hidden layer. Considering SVM, we performed model selection using different values of the regularisation parameter, varying it from 0.1 to 1000, and using different values for kernel parameters. For polynomials kernels we used degree 2–5 in order to consider from pairwise to quinary correlations among gene expression measurements [39], and we varied the sigma value of the radial basis SVM from 0.01 to 1000 in order to take into account several distribution spreads of the data. With MLPs we performed model selection varying the number of hidden neurons from 3 to 15, and the parameters of the backpropagation algorithm. 5.1. Classifying normal versus malignant tissues In the first classification task we used all the available gene expression information (4026-dimensional input patterns) in order to separate malignant from normal lymphoid cells. The results are shown in Table 3. MLP with 10 hidden neurons shows an estimated generalisation error of about 2% (using 10-fold cross validation), while SVM-linear achieves the best results for a large range of values of the regularisation parameter (1 C 1000). This is not surprising, because, as shown by the bias–variance analysis of the error in SVMs, not always C values affect the generalisation performance of SVMs, and kernel parameters are usually the most important factors [48]. Interestingly the estimation of the leave-one-out error for SVM is identical to
292
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
Table 3 Classification of malignant and normal lymphoid cells: generalisation error, standard deviation of the error, precision and sensitivity percent estimation through 10-fold cross validation Learning machine model
Generalisation error
S.D.
Precision
Sensitivity
SVM-linear SVM-poly SVM-RBF MLP LP
1.04 4.17 25.00 2.08 9.38
3.16 5.46 4.48 4.45 10.24
98.63 94.74 75.00 98.61 95.65
100.0 100.0 100.0 98.61 91.66
SVM-poly, polynomial SVM; SVM-RBF, radial basis SVM; LP, linear perceptron.
the computed estimation of the generalisation error through 10-fold cross validation for all the three types of SVM, and the standard deviation of the error is slightly lower. SVMs show also a very high estimation (100%) of the probability of detecting tumoural lymphoid cells (sensitivity), no matter what type of kernel function is used. Radial basis SVM show a high misclassification rate, entirely due to the low precision of this type of SVM (in fact, the recall is 100%). Varying the values of the C regularisation factor between 0:1 to 1000, and the values of the s parameter of the Gaussian kernel between 0:1 and 100 leads in all cases to the same results with one-fourth of the examples misclassified, both using 10-fold cross validation and leave-one-out estimation of the generalisation error. This type of SVM has an high estimated Vapnik Chervonenkis (VC) dimension [50], confirmed also by the fact that systematically all the input patterns are support vectors. Moreover, the best results for polynomial SVM are achieved by the simplest kernel used: second degree polynomial kernel achieved the best results, with an estimated generalisation error of 4:17%, a 100% sensitivity and a precision of 94:74%, while, for instance, fifth degree kernel showed an estimated generalisation error of 17:71%. Conversely, SVM-linear shows a low estimated VC dimension and can correctly separate the two classes directly in the input space. ROC analysis gives more insights into the behavior of the different learning machines in this classification task. The ROC curve of the SVM-linear is almost ideal (Fig. 1b), in the sense that the area below the curve is 1, showing that the linear SVM achieves a perfect ranking of the examples, ordering the positive samples (malignant tissues) before the negative samples (normal tissues). The polynomial SVM also achieves a reasonably good ROC curve, lying just below the SVM-linear ROC curve. The SVM-RBF (radial basis function SVM, that is, SVM with Gaussian kernel) registers the worst ROC curve, with values lying on the diagonal: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells. This analysis confirms that in this task Gaussian kernels seem to overfit the data (in all cases the classification error with respect to the training set is always equal to zero), considering that we have a small data set associated with a very high dimension of the input data. However, it is also possible that a more careful model selection could yield to better results. The ROC analysis of the polynomial kernel SVM (Fig. 1c) confirms the preferential use of simpler SVM for this classification task (second degree outperforms all other higher degree polynomial kernels).
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
293
Fig. 1. ROC curves for the classification problem of separating malignant from normal tissues: (a) multi-layer perceptron and linear perceptron; (b) SVM; (c) polynomial kernel SVM; (d) comparison of ROC curves between SVM, LP and MLP.
The ROC curve of the MLP (one hidden layer with 10 hidden units and fixed learning rate Z ¼ 0:1) is also nearly optimal (Fig. 1a), confirming the good accuracy results previously shown, while the linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane. However, for medical applications, the sensitivity of the LP is surely too low. Fig. 1d shows the ROC curves of all the learning machines used in this task. All the ROC curves lie below the SVM-linear curve. The MLP ROC curve is just below the optimal one of the SVM-linear and SVM with polynomial kernel of second degree also show reasonable good results. Comparing our results with those obtained in [1] using hierarchical clustering, we achieved, as expected, a significant improvement of the classification accuracy. Hierarchical clustering does not perform an explicit classification, assigning a label to each example. In fact, it only groups together examples with similar expression patterns.
294
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
However, we can indirectly classify the examples considering the main clusters provided by the algorithm. Using this approach, four normal tissues are clustered in the same group of DLBCL tissues; other four normal tissues are grouped with CLL cells. Moreover, at least six normal tissues are grouped with TCL. Hence 14:6% of the examples are misclassified, against the 1:04% of the SVM, the 2:08% of the MLP and the 9:38% of the LP. Of course, the classification error is significantly reduced with SVM and MLP, as supervised methods exploit ‘a priori’ biological knowledge (i.e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data. Globally, we can state that supervised machine learning methods correctly separate malignant from normal lymphoid tissues, but only linear SVM and MLP can be used to build classifiers with a high-sensitivity and a low rate of false positives. 5.2. Identifying DLBCL subgroups Using clustering methods, Alizadeh et al. [1] showed that two subgroups of DLBCL lymphoma can be separated. They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like cells characterised by genes normally expressed in germinal centre B cells and activated B-like cells characterised by expression of genes normally induced during in vitro activation of B cells. These two classes correspond also to patients with very different prognosis: those with activated B-like cells showed a significantly lower overall survival rate after treatment with comparable multi-agent chemotherapy regimens [51]. 5.2.1. Using gene expression signatures to separate DLBCL subgroups In our experiments we employed information provided by clusters of coordinately expressed genes (expression signatures), in order to verify if we could identify an expression signature related to the DLBCL partition proposed by Alizadeh. More precisely, we performed five classification tasks, using SVM and leave-one-out methods to estimate the generalisation error relative to the separation of germinal centre B cells and activated B-like cells. For each classification task we used each expression signature individually, and all the four signatures together. The results are shown in Fig. 2. Using the Proliferation and the Lymphnode signature we could not separate these two subgroups. With the T-cell signature we could separate the two subclasses only with an estimated error of about 13%, a precision of about 90% and an estimated sensitivity of about 81% using linear SVM. The best results were achieved with the GCB expression signature, with an estimated generalisation error of about 4% and an estimated precision of 100% (estimated sensitivity 91%) using SVM-RBF. Using all the signatures we obtain an high precision (100%) both with polynomial and radial basis SVM but a generalisation error of about 10%. Hence the GCB expression signature seems specifically related to the separation of GCB-like and AB-like subgroups of lymphoma inside the DLBCL group. In order to verify this hypothesis we applied SVM, MLP and LP to identify GCB-like and AB-like cells using only the GCB signature and the above four signatures all together. The results obtained through 10-fold cross validation are summarised in Table 4.
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
295
Fig. 2. Estimated generalisation error for the classification of GCB-like and AB-like subgroups of DLBCL using four different gene expression signatures (and the four signatures all together). SVM-poly stands for polynomial SVM and SVM-RBF stands for radial basis SVM.
The estimated generalisation errors obtained with the GCB signature using 10-fold cross validation are substantially similar to the ones obtained through leave-one-out techniques. SVM-RBF achieves the best results, with an estimated error equal to that obtained through leave-one-out, while SVM-linear and SVM-poly errors are, respectively, slightly lower and slightly higher than the corresponding leave-one-out estimation error. MLP and LP show results comparable with linear and polynomial SVM. The leave-one-out estimations of the Table 4 Classification of GCB-like and AB-like DLBCL cells: estimated generalisation error, standard deviation of the error, precision and sensitivity through 10-fold cross validation Learning machine model
Generalised error
S.D.
Precision
Sensitivity
GCB signature SVM-linear SVM-poly SVM-RBF MLP LP
10.50 8.70 4.50 8.70 8.70
11.16 14.54 9.55 10.50 10.50
90.00 96.67 100.0 90.90 90.90
90.00 88.33 90.00 90.90 90.90
All signatures SVM-linear SVM-poly SVM-RBF MLP LP
15.00 14.00 10.00 8.70 10.87
11.16 18.97 10.54 13.28 14.28
85.00 93.33 100.00 95.00 86.96
85.00 76.67 76.67 86.36 90.90
SVM-poly stands for polynomial SVM, SVM-RBF stands for radial basis SVM and LP stands for linear perceptron.
296
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
Fig. 3. ROC curves for the classification problem of separating GCB-like and AB-like subclasses of human lymphoma inside DLBCL: (a) classification using the GCB signatures; (b) classification using all the four signatures.
generalisation error are confirmed considering the four signatures all together. In this task MLP slightly outperforms SVM (but the difference is not statistically significant). In all cases the standard deviation of the error estimated with leave-one-out procedures is slightly lower. 5.2.2. ROC analysis of DLBCL subgroups separation The ROC analysis shows that SVM-RBF outperforms all other learning machine models using the GCB signature (Fig. 3a). Zooming in to show the highest and leftmost part of the ROC plane (Fig. 4a), we can see that the ROC curve of SVM-RBF is above all the other curves (apart from a little interval of 1-specificity where polynomial SVM performs slightly better). Fig. 4c shows that SVM-RBF performs better than MLP and LP for this classification task. It is also evident that in this task, lowering the dimensionality of the input space, SVM-linear is not the best classifier (Fig. 3 and 4). On the other hand, considering the signatures all together, we note that it is very hard to distinguish the performance of SVM and MLP (Fig. 3). In particular, zooming in to show the most significant part of the ROC plane (Fig. 4b and d), we can see that ROC curves of SVM and MLP intersect in many points, and in several cases MLP outperforms SVM. Moreover, the ROC curve of the LP is also comparable with that of the SVM-RBF. It is worth noting that the ROC curves relative to the GCB signature classification task are in general above and on the left compared with the ROC curves relative to the ‘‘all signatures’’ classification task. Summarizing, the estimation of the generalisation error both through leave-one-out and 10-fold cross validation techniques, together with the analysis of the ROC curves of the classifiers show that the GCB-like and AB-like subgroups of the DLBCL group of human lymphoma can be separated using the GCB signature and supervised learning machine methods. Hence, these results support the hypothesis of Alizadeh about the existence of
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
297
Fig. 4. Zooming in to show the leftmost part of the ROC curves for the classification problem of separating GCB-like and AB-like subclasses of human lymphoma inside DLBCL: (a) classification using the GCB signatures; (c) the same, with the compared ROC curves only for SVM-RBF, MLP and LP; (b) classification using all the four signatures; (d) the same, with the compared ROC curves only for SVM-RBF, MLP and LP.
two distinct subgroups in DLBCL and moreover identify the GCB signature as a cluster of coordinately expressed genes related to the separation between the GCB-like and AB-like DLBCL subgroups. 5.2.3. Notes on misclassified examples In the supplementary material (available at http://llmpp.nih.gov/lymphoma), Alizadeh et al. [1] applied prediction methods by weighted voting [23], using subsets of genes selected by feature ranking methods, in order to separate the GCB-like and AB-like DLBCL subgroups. Interestingly enough, this method achieved the same results obtained by our approach (two examples misclassified), using different subsets of genes and different classification methods. However, our approach identifies a cluster of coordinately
298
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
expressed genes biologically significant (the GCB signature). On the contrary, the Golub’s method used by Alizadeh et al. identifies genes pertaining to different functional groups of genes, individually selected on the basis of their discrimination power. Moreover, by this approach, complementary genes that individually do not separate well the DLBCL subclasses are missed. The magnitude of the output of the SVM is related to the confidence of the prediction: if the output is highly positive the learning machine strongly believes that the sample is positive, and on the contrary if it is highly negative it strongly believes that the sample is negative [13,50]. Similar considerations can also be extended to MLP. RBF-SVM misclassifies only two examples, classifying them as AB-like while they are GCB-like. One of these examples is classified with relatively high confidence as AB-like, and this prediction is confirmed with high confidence also by polynomial and dot-product SVM as well as by MLP and LP. It is likely that this sample is incorrectly labeled, and a careful medical examination of the patient could confirm this hypothesis. For the other misclassification we achieve similar results but with a lower confidence. Anyway, the analysis of the causes of the classification errors of the SVM remains an open problem: we need more data and experiments to establish if the errors are due to the errors in the labeling of these two new functional DLBCL subclasses or to the too small number of the available examples. 5.2.4. Data characteristics and SVM kernels In Section 5.1 we outlined that more simple kernels reveal better results, while in this second task we showed that more complex kernels perform better. In the first task we deal with high-dimensional data, and the projection in the feature space leads to a very high dimensional space. For instance, using a polynomial kernel, the dimension of the projected feature space grows to: nþd1 d where n is the dimension of the original input space and d the degree of the polynomial. With a Gaussian kernel the dimension of the feature space is infinite. This way a dimensional problem arises: SVMs are sensitive to the curse of dimensionality problem [5], even if they take into account the overfitting problem arising from large sets of features [44]. Moreover, some kernels could be better suited to particular distributions of the data, that is, their performance depends on the structure and characteristics of the data to be analyzed. Indeed, statistical learning theory shows that the generalisation capabilities of SVMs depend critically on the maximisation of soft margins, a quantity kernel and data dependent [50]. In this general framework we can understand the reasons why in the first task simple kernels outperform more complex ones, while in the second the opposite occurs. Recall that in the first case we used high dimensional data, and so a projection in a very high dimensional feature space can be counter-productive (in fact, in RBF-SVM the training error is zero and the VC dimension high, and the estimated generalisation error through 10-fold cross-validation and leave-one-out techniques is very high), while in the second task we used data with a substantially lower dimensionality, and with a different distribution of the data. However, the problem of finding kernels well suited to discriminate
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
299
between normal an malignant lymphoid tissues is a particular case of the more general machine learning problem of discovering classes of discriminant functions well suited to particular, real and finite distribution of the data, one of the most important open problem in machine learning [26]. If we have no ‘‘a priori’’ information about the data, a straightforward approach consists in trying different kernels and select the best one for the specific problem. 5.3. Classifying different types of lymphoma In our last task we classified directly different types of lymphoma, considering all the classes listed in Table 1. For this task we used multi-class MLP, parallel linear dichotomisers (PLD), one per class parallel non-linear dichotomisers (OPC-PND) and error correcting output coding parallel non-linear dichotomisers (ECOC-PND) ensembles [35,36]. For ECOC-PND we used 15-bit ECOC codes generated by exhaustive algorithms [16]. Each MLP was independently trained to learn each individual bit of the codeword coding the classes. The decision unit was implemented using the L1 norm distance between the outputs of the decomposition unit (that is the vector of the continuous outputs of the base dichotomic MLP learners) and the codeword of the classes. Hence the decision unit computes the following decoding function: X arg min L1 ðy; Di Þ ¼ arg min jyj Dij j i2C
i2C
l
j
where y ¼ ½y1 ; y2 ; . . . ; yl ; y 2 R is the output of the decomposition unit composed by l dichotomisers, C is the set of the classes, D is the decomposition matrix and Di is its ith column, corresponding to the codeword associated to the ith class (see Section 3.3). Fig. 5 shows the results obtained varying the number of hidden units both for multi-class MLP and for the base learners of PND ensembles. OPC and ECOC PND achieved the best results, with an estimated generalisation error (through 10-fold cross validation) of about 5%, but also simple MLP achieved similar results, although slightly worse; PLD failed on this task, achieving an high estimated error rate (about 23%), revealing that simple linear classifiers cannot be used for this task. Analysis of the confusion matrix for PND ensembles (Table 5) shows that the errors are due to false positives DLBCL (whereas they are normal lymphoid cells) and false positives TCL (whereas they are DLBCL) systematically repeated both in OPC and ECOC PND ensembles. Similar results (with added errors) were achieved by the multi-class MLP. In this classification task it is difficult to evaluate if there is a statistical significant difference between the proposed classifiers, as we should have a larger data set in order to perform tests of hypotheses with a satisfactory confidence level. However, OPC and ECOC PND outperformed the multi-class MLP, showing also less sensitivity to model parameters: we obtained good results varying the number of hidden units and the parameters of the learning algorithms for a relatively large range of values. For instance, with ECOC PND we obtained the best results using base classifiers with 8 hidden units, but varying the number of the hidden units from 3 to 15 the predicted error ranges from 5:2 to 6:2%, while with multi-class MLP, varying the number of hidden units used the error ranges from 6:2 to 9:4%.
300
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
Fig. 5. Multi-class classification of different types of lymphoma using MLP multi-class, PLD and PND ensembles of learning machines.
5.4. Some remarks on the analysis of gene expression data using SVMs In order to use SVMs with gene expression data, researchers should train the SVMs performing model selection using, for instance, cross-validation or bootstrapping techniques [18], as, for now, DNA microarray technology allows to produce only a relatively limited number of samples. To apply the trained SVMs for predictions with unknown gene expression data we therefore must consider different problems, arising mainly from the current DNA microarray technology. Also limiting ourselves to cDNA microarray, a common standard protocol and a general standard format for DNA gene expression data for now does not exist. Moreover, gene expression data are often noisy and standardised Table 5 Confusion matrix for the classification of different types of lymphoma Expected
Predicted DLBCL CLL Normal FL TCL
DLBCL
CLL
Normal
FL
TCL
44 0 0 1 1
0 11 0 0 0
3 0 21 0 0
0 0 0 9 0
0 0 0 0 6
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
301
procedures to assess the reproduceability and the reliability of the data are far to be established, considering also that the repetitions number of the experiments are in general limited only to three or at most five, because experiments remain costly or tedious to repeat, even with state of the art technology [4]. As a consequence, it is difficult to compare gene expression data produced by different laboratories or to use training data produced by a laboratory to classify gene expression data produced by another one. From a machine learning standpoint, it is difficult to evaluate the generalisation performance of a learning machine using small data sets, as gene expression microarray data usually are. Statistical learning theory provides upper bounds to the generalisation error considering the cardinality of the data set used, the VC dimension [50], the span of support vectors [11], or the margin between different classes, but these probabilistic bounds, very important from a theoretical standpoint, are in general too loose and impractical to be applied to real problems. In practice, to evaluate the generalisation error resampling methods are used. As long as biotechnology problems related to cDNA microarray will be solved and more data will be available, SVMs will become more reliable, as their characteristics are well suited to analyze cDNA microarray data, and only few parameters, such as the C regularisation factor, and some kernel parameters (e.g. the degree for polynomial kernels or the s value for Gaussian kernels) must be adjusted.
6. Conclusion We proposed supervised learning machine methods to classify tumors using gene expression data, and to identify subgroups of genes related to carcinogenic processes. State-of-the-art machine learning methods, such as SVM with dot-product, polynomial and Gaussian kernels are well suited to deal with high dimensional data: using these methods we could process the entire human genome (about 30,000 genes) as long as DNA microarrays can collect such a large number of genes in microchip experiments. Our experiments showed also that MLP in some genomic tasks can achieve performances comparable with SVM. We performed three classification tasks to analyse gene expression data of human lymphoma, using the proposed methods. In the first and the third task we showed that SVM, MLP and PND can be successfully applied to the classification of malignant and normal lymphoid tissues and to the recognition of different types of lymphoma. In the second task we pointed out how to use ‘a priori’ biological and medical knowledge to separate two functional subclasses of DLBCL not detectable with traditional morphological classification of lymphoma, identifying also a set of coordinately expressed genes related to this separation. As expected, SVM and MLP largely outperform clustering methods in classifying normal and malignant tissues, using labeled data to directly identify and separate classes. Practical diagnosis of tumors based on genomic expression data requires reliable and precise multi-class classification methods. Output coding methods enhance the accuracy of multi-class classification tasks by using ensembles of dichotomic classifiers, and are well suited for classification tasks with small number of examples, as DNA microarray data usually are.
302
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
The proposed methods, with the cautions and caveats outlined in Section 5.4, can perform diagnosis of lymphoma and in general of tumors and other polygenic diseases using gene expression data. In particular, identifying subsets of genes highly correlated with a specific disease, we could design dedicated and inexpensive DNA microchips for diagnostic purposes, using supervised methods to discriminate between normal and diseased patients. One development of this work could consist in integrating ‘a priori’ biological knowledge, supervised machine learning methods and unsupervised clustering methods for discovering distinct subclasses of malignancies based on functional and molecular differences. Extending this approach, gene expression data analysis could stratify patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials, and opening new perspectives on the development of new therapeutics based on a molecular understanding of the malignancy phenotype.
Acknowledgements We thank the anonymous reviewers for their comments and suggestions. This work was partially funded by DISI, Dipartimento di Informatica, Universita` di Genova and INFM, Istituto Nazionale di Fisica della Materia di Genova.
References [1] Alizadeh A, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000;403:503–11. [2] Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expressions revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. National Academy of Sciences, Washington DC. PNAS 1999;96:6745–50. [3] Anand R, Mehrotra G, Mohan CK, Ranka S. Efficient classification for multi-class problems using modular neural networks. IEEE Trans Neural Netw 1995;6:117–24. [4] Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularised t-test and statistical inferences of gene changes. Bioinformatics 2001;17(6):509–19. [5] Bellman R. Adaptive control processes: a guided tour. New Jersey: Princeton University Press; 1961. [6] Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification with gene expression profiles. In: Proceedings of the Fourth International Conference on Computational Molecular Biology. Tokyo: Universal Academic Press; 2000. [7] Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression patterns. J Comput Biol 1999;6(3):281– 97. [8] Bose RC, Ray-Chauduri DK. On a class of error correcting binary group codes, Inform Control 1960;3:68– 79. [9] Breiman L, Spector P. Submodel selection and evaluation in regression: the x-random case. Int Rev Stat 1992;3:291–319. [10] Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey TS, et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. National Academy of Sciences, Washington, DC. PNAS 2000;97(1):262–7. [11] Chapelle O, Vapnik V. Model selection for support vector machines. In: Solla SA, Leen TK, Muller KR, editors. Advances in neural information processing systems, vol. 12. Cambridge, MA: MIT Press; 2000.
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
303
[12] Cherkassky VN, Mulier, F. Learning from data: concepts, theory and methods. New York: Wiley; 1998. [13] Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press; 2000. [14] De Risi J, Iyer V, Brown P. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997;278:680–6. [15] Dietterich TG. Ensemble methods in machine learning. In: Kittler J, Roli F, editors. Multiple classifier systems. Lecture notes in computer science, vol. 1857. Proceedings of the First International Workshop, MCS 2000, Cagliari, Italy. Berlin-Heidelberg: Springer-Verlag; 2000. p. 1–15. [16] Dietterich TG, Bakiri G. Solving multi-class learning problems via error-correcting output codes, J Artif Intell Res 1995;2:263–86. [17] Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576, Department of Statistics, University of California, Berkeley, 2000. [18] Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall; 1993. [19] Eisen M, Brown P. DNA arrays for analysis of gene expression. Methods Enzymol 1999;303:179–205. [20] Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. National Academy of Sciences, Washington, DC. PNAS 1998;95(25):14863–8. [21] Friedman JH. An overview of predictive learning and function approximation. In: Cherkassky V, Friedman JH, Wechsler H, editors. From statistics to neural networks. NATO ASI Series. New York: Springer-Verlag; 1994. [22] Furey TS, Cristianini N, Duffy N, Bednarski D, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000;16(10):906–14. [23] Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Mesirov J, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7. [24] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learn 2002;46(1–3):389–422. [25] Hastie T, Tibshirani R, Eisen M, Brown P, Ross D, Scherf U, et al. Gene shaving: a new class of clustering methods for expression arrays. Technical Report, University of Stanford, 2000. [26] Ho TK. Data Complexity analysis for classifiers combination. In: Kittler J, Roli F, editors. Multiple classifier systems. Lecture notes in computer science, vol. 2096. Proceedings of the Second International Workshop, MCS 2001, Cambridge, UK. Berlin-Heidelberg: Springer-Verlag; 2001. p. 53–67. [27] Joachims T. Making large scale SVM learning practical. In: Smola A, Scholkopf B, Burges C, editors. Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press; 1999. p. 169–84. [28] Joachims T, Estimating the generalisation performance of a SVM efficiently. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000). San Francisco, CA: Morgan Kaufman; 2000. [29] Kaufman L. Solving the quadratic programming problem arising in support vector classification. In: Smola A, Scholkopf B, Burges C, editors. Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press; 1998. [30] Khan J, Wei JS, Ringne´ r M, Saal LH, Ladanyi M, Westermann F. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001;7(6):673–9. [31] Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE Trans Pattern Anal Machine Intell 1998;20(3):226–39. [32] Liu Y, Bancherau J. In: Weir D, Blackwell C, Herzenberg L, editors. Handbook of experimental immunology. Oxford: Blackwell Scientific; 1996. p. 93.1–9. [33] Lockhart DJ, Winzeler EA. Genomics, gene expression and DNA arrays. Nature 2000;405:827–36. [34] Lossos I, Alizadeh A, Eisen M, Chan WC, Brown PO, Botstein D. Ongoing immunoglobulin somatic mutation in germinal center B-cell-like but not in activated B-cell-like diffuse large cell lymphomas. National Academy of Sciences, Washington, DC. PNAS 2000;97(18):10209–13. [35] Masulli F, Valentini G. Effectiveness of error correcting output codes in multi-class learning problems. In: Lecture notes in computer science, vol. 1857. Berlin-Heidelberg: Springer-Verlag; 2000. p. 107–16. [36] F. Masulli, G. Valentini. Parallel non-linear dichotomisers. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN2000), vol. 2. Como, Italy, 2000. p. 29–33.
304
G. Valentini / Artificial Intelligence in Medicine 26 (2002) 281–304
[37] Mayoraz E, Moreira M. On the decomposition of polychotomies into dichotomies. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, July 1997. p. 219–26. [38] Moreira M, Mayoraz E. Improved pairwise coupling classifiers with correcting classifiers. In: Nedellec C, Rouveirol C, editors. Lecture notes in artificial intelligence, vol. 1398. Berlin, 1998. p. 160–71. [39] Pavlidis P, Weston J, Cai J, Grundy WN, Gene functional classification from heterogenous data. In: Proceedings of the Fifth International Conference on Computational Molecular Biology, ACM, Montreal, Canada, 2001. [40] Perou CM, Jeffrey SS, van de Rijn M, Eisen MB, Ross DT, Pergamenschikov A. Distinctive gene expression patterns in human mammary epithelial cells and breast cancer. National Academy of Sciences, Washington, DC. PNAS 1999;96:9212–7. [41] Platt JC. Fast training of SVMs using sequential minimum optimisation. In: Scholkopf B, Burges C, Smola A, editors. Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press; 1998. [42] Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR, Bennett HA. Signaling and circuitry of multiple Mapk pathways revealed by a matrix of global gene expression profiles. Science 2000;287:873– 80. [43] Sharan R, Shamir R. CLICK: a clustering algorithm with applications to gene expression analysis. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB’00). Menlo Park, CA: AAAI Press; 2000. p. 307–16. [44] Shawe-Taylor J, Cristianini N. Margin distribution and soft margins. In: Smola AJ, Bartlett P, Scholkopf B, Schuurmans C, editors. Advances in large margin classifiers. Cambridge, MA: MIT Press; 1999. [45] Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridisation. Mol Biol Cell 1998;9:3273–97. [46] Staudt LM, Dent AL, Shaffer A, Yu X. Regulation of lymphocyte cell fate decisions and lymphomagenesis by BCL-6. Int J Immunol 1999;18:381–403. [47] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al. Interpreting patterns of gene expression with self-organizing maps. National Academy of Sciences, Washington, DC. PNAS 1999;96:2907–12. [48] Valentini G, Dietterich TG. Bias–variance analysis and ensembles of SVM. In: Multiple classifier systems. Proceedings of the Third International Workshop, MCS 2002, Cagliari, Italy. Berlin-Heidelberg: SpringerVerlag; 2002. [49] Valentini G, Masulli F. NEURObjects: an object-oriented library for neural network development. Neurocomputing 2002;48:623–46. [50] Vapnik VN. Statistical learning theory. New York: Wiley; 1998. [51] Vose JM. Current approaches to the management of non-Hodgkin’s lymphoma. Semin Oncol 1998;25:483–91. [52] Yeang C, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R, Angelo M. Molecular classification of multiple tumor types. In: Proceedings of the Ninth International Conference on Intelligent Systems for Molecular Biology (ISMB 2001), Copenaghen, Denmark. Oxford: Oxford University Press; 2001. p. 316–22.