Molecular profiling techniques and bioinformatics in cancer research

Molecular profiling techniques and bioinformatics in cancer research

EJSO 33 (2007) 255e265 www.ejso.com Review Molecular profiling techniques and bioinformatics in cancer research A.T. Manning, J.T. Garvin, R.I. Sha...

733KB Sizes 13 Downloads 48 Views

EJSO 33 (2007) 255e265

www.ejso.com

Review

Molecular profiling techniques and bioinformatics in cancer research A.T. Manning, J.T. Garvin, R.I. Shahbazi, N. Miller, R.E. McNeill, M.J. Kerin* Department of Surgery, Clinical Science Institute, University College Hospital, Galway, Ireland Accepted 6 September 2006 Available online 27 October 2006

Abstract Aims: Our aim was to describe the commonly used molecular profiling techniques in cancer research, to examine their limitations and to discuss the challenges of bioinformatics. Methods: A literature search was performed using the PubMed database to identify publications relevant to this review. Citations from these articles were also examined to yield further relevant publications. Results: We describe the use of DNA microarrays, comparative genomic hybridisation, tissue microarrays and digital differential display. The limitations of these technologies, their contribution to cancer research and the challenges of bioinformatics are also discussed. Conclusions: Although these high throughput technologies each have their own limitations they are rapidly developing and contributing significantly to our understanding of cancer genetics. They have also led to the emergence of bioinformatics as a rapidly developing and vital field. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Molecular profiling; DNA-microarrays; Comparative genomic hybridisation; Tissue microarrays; Digital differential display; Bioinformatics

Introduction Several new technologies in functional genomics have emerged in recent years and are contributing enormously to our understanding of the molecular basis of cancer. Different genomics platforms are currently being used such as gene expression microarrays, tissue microarrays and array comparative genomic hybridisation (array CGH). These technologies are leading to the identification of useful prognostic and predictive markers and helping to achieve the goal of individualised cancer treatment. As the use of these techniques becomes more widespread our understanding of their limitations and sources of error increases. Also the large amount of data being produced from such high throughput systems has necessitated the use of complex computational tools for management and analysis of this data and bioinformatics has therefore emerged as a vital and rapidly developing field. In this article we describe the above techniques, their applications in cancer research, their limitations and potential sources of error. We also * Corresponding author: Tel.: þ353 91 524390; fax: þ353 91 750509. E-mail address: [email protected] (M.J. Kerin). 0748-7983/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ejso.2006.09.002

discuss the challenges in bioinformatics that these new data rich systems have created and describe the use of differential digital display, a useful bioinformatics datamining tool. DNA microarrays Microarray technology was introduced in 19951 and has become the most powerful research tool in genomics. It is capable of measuring simultaneously the genomic expression levels of thousands of genes in both physiological and pathological conditions. They can be used to compare gene expression levels at the messenger RNA level within a sample or more commonly to look at differences in the expression of specific genes across samples. RNA isolated from tissue comprises a complex mixture of different RNA transcripts. The abundance of individual transcripts in the mixture is a reflection of the expression levels of corresponding genes. When a complementary DNA mixture copied from the RNA is labelled and hybridised to a microarray, the strength of the signal produced at each address shows the relative expression levels of the corresponding gene.

256

A.T. Manning et al. / EJSO 33 (2007) 255e265

What are microarrays? Microarrays rely on the complementarity of the DNA duplex, i.e. two strands will always reassemble with base pairing A to T and C to G. This reaction occurs with high specificity. Microarrays are miniature devices containing thousands of DNA sequences, which act as gene-specific probes, immobilised on a solid support (nylon, glass, silicon) in a highly parallel format. The array is then exposed to cDNA targets labelled with a radioactive, fluorescent or chemiluminescent tag, so that the intensity of the signal generated by each bound probe indicates the relative abundance of that transcript in the sample. When hybridised to the array, abundant sequences will generate strong signals and rare sequences will generate weaker signals. The strength of the signal represents the relative level of gene expression in the original sample. Modern microarrays can be categorised as either cDNA arrays, usually using probes constructed with PCR products of up to a few thousand base pairs; or oligonucleotide arrays, using either short (25e30 nucleotide) or long (60e70 nucleotide) probes. The slide itself typically contains thousands of spots of cDNA or oligonucleotides, each representing a particular gene, which should hybridise to homologous cDNA in the samples.2 Probes can be contact spotted, ink-jet deposited or directly synthesised by photolithography onto the substrate which is made of either glass, nylon or silicon. RNA quality and performing a DNA microarray study A microarray study is a multistep process (Fig. 1). It begins with a well defined biological question and the design of an experiment appropriate to that question. The next step is to acquire appropriate samples be it cultured cells or patient tissues. Preferably tissue samples should be snap frozen immediately in liquid nitrogen and stored at 80  C in order to attempt to preserve the quality of the tissues’ RNA. Formalin-fixed and paraffin embedded tissues are generally unsuitable for microarray studies as the RNA in the sample becomes quite degraded during tissue processing. Due to the omnipresence of ribonucleases, enzymes which degrade RNA, and the inherent instability of RNA it is essential to measure RNA integrity after extraction. RNA integrity, quantity and ratios of ribosomal RNA are measured using microfluidics technology such as that used in the Bioanalyzer system (Agilent Technologies inc., Palo Alto, CA USA) (Fig. 2). This RNA is then reverse transcribed into complementary DNA (cDNA) which is more stable, labelled, and hybridised to the microarrays. Images are acquired of the microarray slide wherein the relative brightness at a particular spot is a measure of the expression level of the corresponding gene. The intensity readings from the image are background adjusted and transformed; this data is then processed and analysed. The results are then interpreted according to previous biological

knowledge. The accuracy of microarray measurements must subsequently be confirmed using a reliable independent technology such as quantitative real-time PCR. The outcome should also be validated on a further set of patients, independent of the original set of patients from whose gene expression the classifier was derived. There are three main types of microarray study: class comparison, class prediction and class discovery.3 In class comparison studies there is a predefined classification of the specimens and the objective is to determine which genes are differentially expressed among the classes, e.g. comparing expression profiles of different types of tissue.4e6 Examples of class comparison studies include comparing expression profiles between BRCA1 mutated tumours and sporadic tumours,7 comparing profiles between patients that respond to chemotherapy and those that do not or comparing the profiles of cells before experimental intervention and after. Class prediction studies also examine differential gene expression between groups, however, here the aim is to develop a multivariate class predictor that can accurately classify a new specimen based on its expression profile. Examples of class prediction studies include comparing normal tissues to tumour tissues,8,9 comparing oestrogen receptor positive to oestrogen receptor negative breast cancer,10 and comparing good prognosis cancer to poor prognosis cancer.11e13 Prognostic prediction studies are a type of class prediction study. In prognostic marker development studies the goal is to construct a multi-gene predictor of prognosis. Class discovery studies aim to discover new taxonomies, groupings or clusters within a collection of samples.14,15 Current uses in cancer research Several important findings have been reported using microarrays, and the technology is becoming more and more accessible. From a clinical point of view microarrays have many varied applications. Microarrays have been successfully applied to identification of drug targets,16 drug development17 and treatment validation.18 Microarray-based gene expression profiling of human cancers has rapidly emerged as a new powerful screening technique generating hundreds of novel diagnostic, prognostic and therapeutic targets.14,19e23 Recently breast cancer gene expression signatures have been identified that are associated with the oestrogen receptor and lymph node status of patients and can aid in classification of breast cancer patients into subgroups with different prognosis or clinical outcome after therapy.10e13,24e31 Two large scale prospective trials in node negative breast cancer, the Program for the Assessment of Clinical Cancer Tests (PACCT) trial and the Microarray in Node Negative Disease May Avoid Chemotherapy (MINDACT) trial, have begun recruiting patients to evaluate the integration of genomic profiling into clinical decision making.32 The 21 gene Oncotype DX assay31

A.T. Manning et al. / EJSO 33 (2007) 255e265

Experimental question

Experimental design

Extraction of total RNA

Collection of samples and appropriate controls

Hybridise labeled cDNA to microarray

Reverse transcribe mRNA into labeled cDNA

Scan and analyse microarray

257

Measure RNA quality and concentration

Image analysis

Data Analysis

Figure 1. The steps involved in a DNA microarray experiment.

will be used in the PACCT study and the 70-gene Mammaprint assay12 will be evaluated in the MINDACT trial. Limitations In statistical terms, microarray studies are often underpowered meaning that their ability to reliably discern

a difference between expression levels in two treatments is inadequate considering the nature of variance of that expression. There are many factors which are responsible for differences in gene expression such as attempting to compare samples from different anatomical areas, comparing diseased tissue to normal tissue and comparing patients from different ethnic backgrounds. The smaller the difference between two groups, the greater the statistical power

Figure 2. RNA integrity may be tested using the Agilent bioanalyzer system (Agilent Technologies inc., Palo Alto, CA, USA). This shows the electropherograms from good quality RNA (right) and degraded RNA (left). The 28S and 18S ribosomal peaks are clearly visible on the right, and are also visible on the ‘‘virtual gel’’ to the right of the electropherogram.

258

A.T. Manning et al. / EJSO 33 (2007) 255e265

must be to detect differences. Power can be increased in two ways, either by decreasing variability between samples or by increasing the numbers of samples measured in each group. For small studies, there may be too much withingroup variation to reliably detect differences between groups and there may be too few subjects in each group to provide an appropriate sample of the population in order to provide an estimate of variance. However, there are relatively few publications regarding the sample size estimates required for class comparison studies or the development of prognostic markers. It is also important that the array platform used should be reproducible with little variation between replicates. This can be achieved either with precise quality control steps built into the hybridisation process or by running technical replicates and averaging the signal intensities produced. One of the most important limitations of microarray experiments is that they attempt to make phenotypic inferences based on the expression of messenger RNA. While changes in the level of messenger RNA give us an insight into genes which may be transcriptionally active, it is important to remember that changes at the messenger RNA level do not automatically translate into changes in protein expression or phenotype. Microarray data should be interpreted with some caution and results need to be validated. Several current prognostic markers and response predictors in breast cancer such as oestrogen receptor, progesterone receptor and Her2/neu status are measured at protein level. It is interesting to note that in many of the microarray studies to date designed to produce gene expression prognosis prediction profiles, these functionally important genes are absent. Hierarchical clustering is one of the most commonly used bioinformatics methods in the analysis of microarray experiments. The problem with this technique and many of the other bioinformatics techniques used to analyse microarray experiments, is that they do not allow input of any prior functional information. For example, it is widely accepted that the oestrogen receptor is functionally important in breast cancer. However, the use of unsupervised methods of microarray analysis has led to the exclusion of this gene in the results of some experiments. Obviously this is counter intuitive; just because microarrays can give us new and unique insights does not mean we should abandon previously discovered associations. It is important to base the analysis of microarray data on previously described functional pathways and gene families so that this new data builds upon our prior knowledge. Several attempts have been made to predict survival of cancer patients in general33e37 and of breast cancer patients in particular11,13,38 on the basis of gene expression profiling. Several of these yielded gene sets whose expression profiles successfully predicted survival in breast cancer. However, the overlap between these gene sets is very low. For example only 17 genes overlap in the 456 genes of the Norwegian study30 and the 231 genes of the Dutch

study11 and there are several similar examples. This lack of agreement can be attributed to use of different microarray platforms, different methods of sample preparation, mRNA extraction and data analysis, different patient groups with varying stages of disease and genuine differences between patients. Undoubtedly the lack of agreement between datasets calls into question the application of this technology to the entire breast cancer population. However, more recently four different gene-expression based models, the intrinsic subtypes,24 wound response,38 Mammaprint12 and Oncotype DX31 were used on a single data set of 295 samples and found to significantly agree in the outcome predictions for individual patients.39 This suggests that although the named genes do not correlate they are still probably tracking a common set of biological phenotypes. The difficulty with many of the large sets of prognostic marker genes produced by many studies is that they comprise a large amount of data which would require reduction in order to be used clinically for repeated measurements with, for example, quantitative RQ-PCR. Another limitation is that microarrays differ in their type of probe (cDNA or oligonucleotide), manufacturer, composition of probes, deposition technology, and labelling and hybridisation protocol. They also differ in their coverage of the genome, the completeness and accuracy of their annotation and the specificity of their probes. All of this can lead to poor reproducibility between platforms, and therefore between experiments. In addition, the intended probe sequence can differ from the actual sequence synthesised or deposited on the microarray. The Microarray Gene Expression Data (MGED) Society is an international organisation of biologists, computer scientists and data analysts that aims to facilitate the sharing of microarray data. A major focus of this group has been the establishment of standards for microarray data annotation and exchange, facilitating the creation of microarray databases and related software packages. The MGED has published a set of requirements referred to as MIAMEdthe ‘‘minimum information about a microarray experiment’’ to allow unambiguous interpretation and reproduction.40 Several major scientific journals have now adopted MIAME recommendations as a requirement for publication of a microarray experiment. The statistical software R41 and the associated Bioconductor Project42 have become an important bioinformatics source for analysis of genomic data. The Bioconductor Project was commenced in 2001 and is based primarily at the Fred Hutchinson Cancer Research Center in Seattle. The project aims to provide access to a wide range of powerful statistical and graphical methods for analysis of genomic data, and to facilitate the integration of biological metadata in experimental data analysis. It also aims to facilitate rapid development of biostatistical software and training in computational and statistical methods, and to promote high quality documentation and reproducible research. Bioconductor is primarily based on the R programming

A.T. Manning et al. / EJSO 33 (2007) 255e265

language41 and it integrates the most advanced analysis tools that are currently available. It is regularly updated, and all software components are available free. Available packages include those for pre-processing Affymetrix and cDNA array data, identifying differentially expressed genes and plotting genomic data. A broad range of R based statistical and graphical techniques are available including linear and non-linear modelling, cluster analysis, prediction, re-sampling, survival analysis and time-series analysis. These can be accessed through the Bioconductor website (http://www.bioconductor.org) and are also described in detail in the recent publication ‘‘Bioinformatics and Computational Biology Solutions Using R and Bioconductor’’ by Gentleman et al.43

Tissue microarrays Tissue microarray (TMA) technology, originally described in 1998 by Kononen et al.,44 allows the simultaneous analysis of hundreds of tissue specimens in a single experiment (Fig. 3). A TMA is assembled by taking core biopsies of paraffin embedded tissues and re-embedding them on a single arrayed ‘‘master block’’. More than 600 individual tissue cores may be represented on a single block and this may be further sectioned to produce 100e200 slides for analysis. TMAs are amenable to a variety of techniques such as immunohistochemistry for protein expression and fluorescence in situ hybridisation (FISH) to detect DNA alterations.44e49 Unlike gene arrays where thousands of genes may be examined in a single experiment, TMAs generally examine a single gene product per experiment but in a large number of samples.50

259

Advantages and limitations of TMA technology The increasing use of gene expression microarrays has led to the discovery of many potentially useful biomarkers for cancer diagnosis, classification, prognosis and individualised patient therapy. Individual genes with such potential uses need to be validated first by technologies such as RQPCR, however, the critical next step following this is to ensure that these markers are biologically relevant at the protein level.50 TMA technology has developed in response to this need, and has been used for validating gene expression studies for many cancers such as breast, colon, prostate and kidney.51e53 Before the development of TMA technology an experiment to analyse the expression of a single biomarker in a large cohort of tissue samples may have taken several weeks to complete, involving the processing and staining of hundreds of individual slides. Using TMAs, multiple biomarkers can be assessed using serial sections of an array possibly on the same day and at much less cost as only a small amount of reagent is required to analyse an entire cohort. There is more uniformity compared to traditional experiments as all samples on an array are treated using identical reagent concentrations and incubation times, reducing the likelihood of experimental error.54 Perhaps the most important advantage of TMA technology however relates to archival tissue management. TMA technology enables scientists to perform a vast number of arrays per tissue sample (50,000e75,000 compared to 75e100 using standard histological sections), therefore maximising the use of a limited and extremely important resource.55 The main concern regarding TMA technology is whether such a small sample of tissue can be regarded as

Figure 3. Building a tissue microarray (TMA).

260

A.T. Manning et al. / EJSO 33 (2007) 255e265

truly representative and several studies have addressed this issue. Camp et al.54 compared the staining of 2e10 TMA disks with the whole tissue sections from which they were derived in 38 cases of invasive breast cancer for representation of hormone receptor and Her2/neu oncogene antigens. This experiment showed that analysis of only two 0.6 mm TMA spots resulted in more than 95% accuracy. TMAs have been successfully used to identify markers in more heterogenous tumour types also such as Hodgkin’s lymphoma56 and prostate cancer,57 provided that sufficient redundancy is present; i.e. the selection of 2e10 cores per tumour sample to decrease sampling error and minimise the impact of tissue loss during processing.50 A certain level of technical expertise however is necessary to reliably distinguish tumour tissue from surrounding stroma and normal epithelium and so experience in microarray construction is an important factor. In addition there are technical aspects regarding the staining of the TMAs, as the techniques used differ from those used for the staining of larger tissue sections. There is also the subjective nature of analysing intensities of microarray spots, as many factors such as background staining and stromal staining may influence a pathologist’s decision,50 and routine use of image analysis software will therefore be necessary to enable automated quantitative analysis.58 Comparative genomic hybridisation Comparative genomic hybridisation (CGH) is a recently developed technique in molecular genetics that enables genome-wide screening for DNA sequence copy number changes. This technique allows identification of genetic abnormalities where there is a gain (duplication, insertion or amplification) or loss (deletion) of material. Such chromosomal abnormalities are believed to play a key role in cancer progression as genes involved in cellular proliferation and differentiation may be effected. Also gains in genetic material may result in overexpression of oncogenes (genes which stimulate cell growth), and loss of material may result in under-expression of tumour suppressor genes. Principle of comparative genomic hybridisation To perform CGH (Fig. 4) two genomic DNA samples are used; the test sample is extracted from a tumour and the reference sample is extracted from normal tissue. The DNA samples are labelled with different fluorochromes such as tetramethyl rhodamine isothiocyanate (TRITC) which fluoresces red, and fluorescein isothiocyanate (FITC) which fluoresces green. An equal amount of test and reference DNA are mixed together in the presence of excess human cot-1 DNA to prevent binding of repeat sequences present in both samples. In conventional CGH the samples are then denatured and co-hybridised to metaphase chromosome spreads

from a normal control. Competitive hybridisation then takes place whereby the labelled DNA samples compete with each other for hybridisation sites along the chromosomes. If the test sample has a gain of genetic material this will bind to the corresponding chromosome on the metaphase spread. Likewise a loss of genetic material will allow increased binding of the reference sample (normal DNA) to the corresponding DNA on the metaphase spread. Following hybridisation the solution is washed to remove excess DNA and analysis is done using a fluorescence microscope and digital image analysis. This allows the ratio of red to green fluorescence to be determined along each chromosome, indicating areas of gain or loss of genetic material. Array comparative genomic hybridisation One of the limitations of conventional CGH using metaphase chromosome spreads as targets for hybridisation is that only relatively large gains or losses in genomic material can be detected. An important advancement in this technology has been the development of microarrays (array CGH) replacing conventional chromosomal CGH.59e61 With this technique DNA sequences from evenly spaced loci along the entire genome are embedded onto glass slides and used as targets for hybridisation. This allows for higher resolution than that achieved by chromosomal CGH, with the detection of smaller amplifications and deletions. Several types of DNA clones can be used as targets for array CGH such as bacterial artificial chromosomes (BACs),59,60,62 cDNA fragments61 and oligonucleotides,63,64 however the basic principle remains the same. As with chromosomal CGH the fluorescence ratios for each BAC spot on an array can be calculated, and as the location of each BAC in the genome is known, the ratios can be converted into a genome-wide copy number profile. Genome-wide marker based arrays such as BACs sample the genome at megabase intervals. Snijder et al.62 constructed a microarray consisting of approximately 2400 BAC clones across the human genome. Although such arrays using large insert clones (LICs) allow copy number changes to be assessed on a genome-wide scale, there are still large gaps between those clones where no genomic material is obtainable. More recently Ishkanian et al.65 have used a tiling design of 32,433 overlapping BACs to produce the first submegabase resolution tiling set (SMRT) array covering the complete human genome. The choice of array platform for CGH studies depends on a variety of factors such as DNA quality and quantity. DNA quality may be compromised in archival formalin fixed paraffin-embedded specimens and quantity may be limited when analysing small tumour specimens. However, large insert clones such as BACs can capture signals from such DNA samples for genome wide analysis, requiring 200e400 ng of DNA per experiment. Oligonucleotide and cDNA platforms typically require microgram amounts of

A.T. Manning et al. / EJSO 33 (2007) 255e265

261

Figure 4. Comparative genomic hybridisation. Genomic DNA is isolated from both the tumour sample and the normal reference sample. These are labelled with different fluorochromes and mixed in the presence of excess Cot-1 DNA to prevent binding of repetitive sequences. In conventional chromosomal CGH these are hybridised to normal metaphase chromosomes and the ratio of fluorescence intensities along each chromosome is analysed. As seen above increased DNA copy number (amplification) in the tumour sample will be detected by increased red fluorescence, whereas decreased copy number in the tumour sample will allow more binding of the normal DNA and increased green fluorescence. On the right a similar hybridisation to a cDNA array permits measurement of copy number at a higher resolution. The red and green spots on the fluorescence image represent increased and decreased copy number changes, respectively.

DNA and are useful for more detailed analysis when quality and quantity are not limited.66 Uses of comparative genomic hybridisation in cancer research CGH was first described in 1992 by Kallioniemi who used it for the identification of 16 previously unknown regions of amplification in tumour cell lines and primary bladder tumours.67 Since then it has been utilised in the characterisation of many different tumour types including ovarian68,69 and colorectal.70,71 CGH has provided a useful contribution to our understanding of the molecular genetics of breast cancer particularly in relation to ductal carcinoma in situ (DCIS).72 We now know that DCIS is a genetically advanced lesion with widespread DNA copy number changes.73,74 Characterisation of well, intermediate and poorly differentiated DCIS tumours has suggested that these different tumour types have genetically distinct patterns of DNA gain and loss.75,76 For example well differentiated DCIS is characterised most frequently by loss of 16q and gain of 1q, while poorly differentiated DCIS frequently displays localised amplifications such as 11q13 (CCND1) and 17q12 (ERBB2). CGH studies support the theory that DCIS is a precursor of invasive carcinoma with common

patterns of alteration between in situ and adjacent invasive tumours.74,77 Close genetic similarities have also been found to exist between well intermediate and poorly differentiated DCIS and distinct morphological types of invasive breast cancer.75 We have previously used CGH to identify two regions of copy number change (gain on 5p and deletion on 16q) which correlate with lobular breast carcinoma.78 In addition to its use in the characterisation of different tumour types, CGH has also been of use in identifying karyotypic aberrations associated with different outcomes in terms of development of recurrence in breast cancer patients.79 In this CGH analysis of 40 primary breast cancers it was found that loss of 16q (E-cadherin) and gain of 16p correlated with increased disease free survival, while patients who had loss of 17p13 ( p53) and amplification of 17q12 (HER2) were more likely to have disease recurrence at a mean follow-up of 8.4 years. Disease-specific arrays have been constructed for cancer diagnostics for some tumour types such as chronic lymphocytic leukaemia and certain types of lymphoma.80e82 These arrays allow analysis of multiple known cancer gene loci to detect gains and losses of tumour suppressor genes and oncogenes. For array CGH to become more prominent in clinical diagnostics factors such as cost and standardisation of protocols need to be addressed. Analysis of

262

A.T. Manning et al. / EJSO 33 (2007) 255e265

array CGH data is also quite complex. Data visualisation involves conversion of spot image data to locus-specific copy number ratio, and linking of the array elements to their genomic positions. Advances in array CGH software pertaining to ease of use, interpretation and visualisation will also be necessary.66 GenBank, UniGene and Digital Differential Display GenBank is a comprehensive database built and distributed by the National Centre for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM) in the United States. GenBank contains publicly available DNA sequences for more than 165,000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from largescale sequencing projects. It continues to grow at an exponential rate and as of August 2004 it contained over 41.8 billion nucleotide bases from 37.3 million individual sequences. ESTs (expressed sequenced tags) continue to be the major source of new sequence records and gene sequences comprising over 12 billion nucleotide bases.83 GenBank sequences are automatically partitioned into gene-orientated clusters using the UniGene system. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed, and the map location. Digital Differential Display (DDD) is a data mining bioinformatics tool available to researchers through the NCBI website. With DDD comparisons of EST-based expression profiles of two or more gene libraries represented in UniGene can be made. This allows the identification of genes that differ among libraries of different tissue, making it possible to identify which genes may be contributing to a cell’s unique characteristics. Similarly DDD may be used to try to identify genes for which the expression levels differ between normal, pre-malignant and malignant tissues.84 DDD uses Fisher’s exact test to identify only those genes with statistically different expression levels between different tissues ( p  0.05). Also only libraries with over 1000 sequences in UniGene are used for DDD. This limits the capabilities of the analysis but unless there are a large number of sequences in each library, the frequencies of genes are generally not found to be statistically significant. This highlights the importance of making more libraries publicly available and for comparisons to be made using standardised controls. The Cancer Genome Anatomy Project (CGAP) of the National Cancer Institute (NCI) has been invaluable in making more libraries publicly available for comparison. Using DDD, ESTs from various tumour types can be analysed for differential expression and an electronic expression profile and chromosomal map position of these sequences can be generated from the UniGene database. Scheurle et al.85 have shown that genes known to be

up-regulated in breast, pancreatic and prostate cancers were correctly identified by DDD, demonstrating the usefulness of this technique. Two hundred known genes were discovered to be differentially expressed in these tissues and results were validated for expression specificity using reverse transcriptase-PCR. The colon cancer related gene (CCRG) has been discovered using DDD and has been verified by immunohistochemistry in paraffin sections of colon tumour samples.86 Prostate epithelium-derived Ets transcription factor (PDEF) is implicated in the regulation of proliferation, differentiation and oncogenic development. By using two independent methods including DDD it was demonstrated that PDEF is overexpressed in 70% of human breast tumours and is barely expressed in most normal human tissue except prostate and trachea.87 The challenges of bioinformatics Cancer research is a worldwide enterprise which results in the production of an incredible amount of data each year. As a result of this the application of computational tools in cancer research has become an important and rapidly developing field. Bioinformatics has developed primarily to address analysis of data rich systems in genomics and proteomics, however large datasets are also produced in cell biology, physiology, pathology, therapeutics, clinical trials and epidemiology. To maximise the impact of results generated by researchers on patients’ diagnosis and treatment, collaboration between various research groups is essential. In the UK the National Cancer Research Institute (NCRI) Informatics Initiative was established in 2003 with the aim of bringing together data gained in all areas of cancer research into one fully integrated and accessible knowledge base. The NCRI has a strong working relationship with key international organisations addressing the challenges of bioinformatics including the National Cancer Institute Centre for Bioinformatics (NCICB) in the United States and the European Bioinformatics Institute (EBI). These organisations are working towards a common goal to allow individual researchers have access to high quality well described datasets. This needs to be done in a way that ensures the highest ethical and scientific standards, respecting the rights of the people whose personal data is involved and the creativity of those who have collected and created the datasets. Individual researchers will also require access to the bioinformatics tools necessary for the analysis of this data. Much progress has already been made in establishing links between leading international organisations, and the success of initiatives such as the Bioconductor Project described above are very encouraging. However there are still many important factors that need to be overcome before such large scale data sharing can become routine. These challenges include ethical considerations such as consent and confidentiality, concern over misuse of data, and strong personal ownership of datasets. It will be

A.T. Manning et al. / EJSO 33 (2007) 255e265

necessary to develop common vocabularies and data standards not only for use within a specific discipline such as genomics but across all areas of cancer research and also across international boundaries. An appropriate infrastructure needs to be provided allowing individual laboratories to have access to bioinformatics expertise, as well as the development and provision of the necessary software. Also the provision of training for cancer researchers to use the databases and infrastructure and the necessary funding to achieve this will be key.88 Conclusion The technologies described above are providing unique insights into physiological and pathological processes at a cellular level and are of immense importance in ongoing cancer research. As with any new technology it is nonetheless important to interpret their findings with a degree of caution, and to be aware of the limitations and potential sources of error in these experiments. These systems rely on newly developed computational tools to analyse and visualise such large volumes of data and are driving the development of the bioinformatics field. Digital differential display is just one example of the many bioinformatics data mining tools currently in use. The use of such tools in identifying cancer related genes is a remarkable advance and such ‘‘in silico’’ experiments are likely to play an important role in the future.

References 1. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995;270(5235):467–70. 2. Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM. Expression profiling using cDNA microarrays. Nat Genet 1999;21(1 Suppl):10–4. 3. Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y. Design and analysis of DNA microarray investigations. New York: Springer; 2003. 4. Bittner M, Meltzer P, Chen Y, et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000;406(6795):536–40. 5. Tanaka TS, Jaradat SA, Lim MK, Kargul GJ, Wang X, Grahovac MJ. Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. Proc Natl Acad Sci USA 2000;97(16):9127–32. 6. Dobbin K, Simon R. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 2005;6(1):27–38. 7. Berns EM, van Staveren IL, Verhoog L, et al. Molecular profiles of BRCA1-mutated and matched sporadic breast tumours: relation with clinico-pathological features. Br J Cancer 2001;85(4):538–45. 8. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification with gene expression profiles. J Comput Biol 2000;7(3-4):559–83. 9. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol 2002;9(3):505–11. 10. West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001;98(20):11462–7.

263

11. van ’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415(6871):530–6. 12. van de Vijver MJ, He YD, van’t Veer LJ, et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002; 347(25):1999–2009. 13. Wang Y, Klijn JG, Zhang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005;365(9460):671–9. 14. Alizadeh AA, Ross DT, Perou CM, van de Rijn M. Towards a novel classification of human malignancies based on gene expression patterns. J Pathol 2001;195(1):41–52. 15. McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 2002;18(11): 1462–9. 16. Kozian DH, Kirschbaum BJ. Comparative gene-expression analysis. Trends Biotechnol 1999;17(2):73–8. 17. Gray NS, Wodicka L, Thunnissen AM, et al. Exploiting chemical libraries, structure, and genomics in the search for kinase inhibitors. Science 1998;281(5376):533–8. 18. Marton MJ, DeRisi JL, Bennett HA, et al. Drug target validation and identification of secondary drug target effects using DNA microarrays. Nat Med 1998;4(11):1293–301. 19. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286(5439):531–7. 20. Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000;403(6769):503–11. 21. Bhattacharjee A, Richards WG, Staunton J, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001;98(24): 13790–5. 22. Yeoh EJ, Ross ME, Shurtleff SA, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 2002;1(2):133–43. 23. Dyrskjot L, Thykjaer T, Kruhoffer M, et al. Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet 2003;33(1):90–6. 24. Perou CM, Sorlie T, Eisen MB, et al. Molecular portraits of human breast tumours. Nature 2000;406(6797):747–52. 25. Gruvberger S, Ringner M, Chen Y, et al. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res 2001;61(16):5979–84. 26. Hedenfalk I, Duggan D, Chen Y, et al. Gene-expression profiles in hereditary breast cancer. N Engl J Med 2001;344(8):539–48. 27. Sorlie T, Perou CM, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 2001;98(19):10869–74. 28. Ahr A, Karn T, Solbach C, et al. Identification of high risk breast-cancer patients by gene expression profiling. Lancet 2002;359(9301):131–2. 29. Huang E, Cheng SH, Dressman H, et al. Gene expression predictors of breast cancer outcomes. Lancet 2003;361(9369):1590–6. 30. Sorlie T, Tibshirani R, Parker J, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 2003;100(14):8418–23. 31. Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 2004; 351(27):2817–26. 32. Mauriac L, Debled M, MacGrogan G. When will more useful predictive factors be ready for use? Breast 2005;14(6):617–23. 33. Khan J, Wei JS, Ringner M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001;7(6):673–9. 34. Beer DG, Kardia SL, Huang CC, et al. Gene expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 2002; 8(8):816–24.

264

A.T. Manning et al. / EJSO 33 (2007) 255e265

35. Nguyen DV, Rocke DM. Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 2002;18(12):1625–32. 36. Rosenwald A, Wright G, Chan WC, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 2002;346(25):1937–47. 37. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol 2004;2(4):E108. 38. Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet 2003; 33(1):49–54. 39. Fan C, Oh DS, Wessels L, et al. Concordance among gene-expressionbased predictors for breast cancer. N Engl J Med 2006;355(6):560–9. 40. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray experiment (MIAME) - toward standards for microarray data. Nat Genet 2001;29(4):365–71. 41. The R project for statistical computing. http://www.R-project.org. 42. Bioconductor. http://www.bioconductor.org. 43. Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S. Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer; 2005. 44. Kononen J, Bubendorf L, Kallioniemi A, et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 1998;4(7):844–7. 45. Bubendorf L, Kolmer M, Kononen J, et al. Hormone therapy failure in human prostate cancer: analysis by complementary DNA and tissue microarrays. J Natl Cancer Inst 1999;91(20):1758–64. 46. Moch H, Schraml P, Bubendorf L, et al. High throughput tissue microarray analysis to evaluate genes uncovered by cDNA microarray screening in renal cell carcinoma. Am J Pathol 1999;154(4):981–6. 47. Mucci NR, Akdas G, Manely S, Rubin MA. Neuroendocrine expression in metastatic prostate cancer: Evaluation of high throughput tissue microarrays to detect heterogeneous protein expression. Hum Pathol 2000;31(4):406–14. 48. Perrone EE, Theoharis C, Mucci NR, et al. Tissue microarray assessment of prostate cancer tumor proliferation in African-American and white men. 2000;92(11):937e39. 49. Schraml P, Kononen J, Bubendorf L, et al. Tissue microarrays for gene amplification surveys in many different tumor types. Clin Cancer Res 1999;5(8):1966–75. 50. Giltnane JM, Rimm DL. Technology insight: identification of biomarkers with tissue microarray technology. Nat Clin Pract Oncol 2004;1(2):104–11. 51. van de Rijn M, Gilks CB. Applications of microarrays to histopathology. Histopathology 2004;44(2):97–108. 52. Bertucci F, Salas S, Eysteries S, et al. Gene expression profiling of colon cancer by DNA microarrays and correlation with histoclinical parameters. Oncogene 2004;23(7):1377–91. 53. Dhanasekaran SM, Barrette TR, Ghosh D, et al. Delineation of prognostic biomarkers in prostate cancer. Nature 2001;412:822–6. 54. Camp RL, Charette LA, Rimm DL. Validation of tissue microarray technology in breast carcinoma. Lab Invest 2000;80(12):1943–9. 55. Rimm DL, Camp RL, Charette LA, Costa J, Olsen DA, Reiss M. Tissue microarray: a new technology for amplification of tissue resources. Cancer J 2001;7(1):24–31. 56. Garcia JF, Camacho FI, Morente M, et al. Hodgkin and Reed-Sternberg cells harbor alterations in the major tumor suppressor pathways and cell-cycle checkpoints: analysis using tissue microarrays. Blood 2003;101(2):681–9. 57. Rubin MA, Dunn R, Strawderman M, Pienta KJ. Tissue microarray sampling strategy for prostate cancer biomarker analysis. Am J Surg Pathol 2002;26(3):312–9. 58. Thiellet C. Full speed ahead for tumour screening. Nat Med 1998;4(7): 767–8. 59. Solinas-Toldo S, Lampel S, Stilgenbauer S, et al. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 1997;20(4):399–407.

60. Pinkel D, Segraves R, Sudar D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998;20(2):207–11. 61. Pollack JR, Perou CM, Alizadeh AA, et al. Genomic-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 1999;23(1):41–6. 62. Snijders AM, Nowak N, Segraves R, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet 2001;29(3):263–4. 63. Lucito R, West J, Reiner A, et al. Detecting gene copy number fluctuations in tumour cells by microarray analysis of genomic representations. Genome Res 2000;10(11):1726–36. 64. Lucito R, Healy J, Alexander J, et al. Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res 2003;13(10):2291–305. 65. Ishkanian AS, Malloff CA, Watson SK, et al. A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004;36(3):299–303. 66. Lockwood WW, Chari R, Chi B, Lam WL. Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet 2006;14(2):139–48. 67. Kallioniemi A, Kallioniemi OP, Sudar D, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumours. Science 1992;258(5083):818–21. 68. Kiechle M, Jacobsen A, Schwarz-Boeger U, Hedderich J, Pfisterer J, Arnold N. Comparative genomic hybridization detects genetic imbalances in primary ovarian carcinomas as correlated with grade of differentiation. Cancer 2001;91(3):534–40. 69. Dent J, Hall GD, Wilkinson N, Perren TJ, et al. Cytogenetic alterations in ovarian clear cell carcinoma detected by comparative genomic hybridisation. Br J Cancer 2003;88(10):1578–83. 70. Aragane H, Sakakura C, Nakanishi M, et al. Chromosomal aberrations in colorectal cancers and liver metastases analyzed by comparative genomic hybridization. Int J Cancer 2001;94(5):623–9. 71. Nakao K, Shibusawa M, Ishihara A, et al. Genetic changes in colorectal carcinoma tumors with liver metastases analyzed by comparative genomic hybridization and DNA ploidy. Cancer 2001; 91(4):721–6. 72. Reis-Filho JS, Simpson PT, Gale T, Lakhani SR. The molecular genetic of breast cancer: The contribution of comparative genomic hybridization. Pathol Res Pract 2005;201(11):713–25. 73. Moore E, Magee H, Coyne J, Gorey T, Dervan PA. Widespread chromosomal abnormalities in high-grade ductal carcinoma in situ of the breast. Comparative genomic hybridization study of pure high-grade DCIS. J Pathol 1999;187(4):403–9. 74. Buerger H, Otterbach F, Simon R, et al. Comparative genomic hybridization of ductal carcinoma in situ of the breast- evidence of multiple genetic pathways. J Pathol 1999;187(4):396–402. 75. Buerger H, Otterbach F, Simon R, et al. Different genetic pathways in the evolution of invasive breast cancer are associated with distinct morphological subtypes. J Pathol 1999;189(4):521–6. 76. Buerger H, Mommers EC, Littmann R, et al. Ductal invasive G2 and G3 carcinomas of the breast are the end stages of at least two different lines of genetic evolution. J Pathol 2001;194(2):165–70. 77. Aubele M, Mattis A, Zitzelsberger H, et al. Extensive ductal carcinoma in situ with small foci of invasive ductal carcinoma: evidence of genetic resemblance by CGH. Int J Cancer 2000; 85(1):82–6. 78. Loveday RL, Greenman J, Simcox DL, et al. Genetic changes in breast cancer detected by comparative genomic hybridisation. Int J Cancer 2000;86(4):494–500. 79. Hislop RG, Pratt N, Stocks SC, et al. Karyotypic aberrations of chromosomes 16 and 17 are related to survival in patients with breast cancer. Br J Surg 2002;89(12):1581–6. 80. Greshock J, Naylor TL, Margolin A, et al. 1-Mb resolution arraybased comparative genomic hybridization using a BAC clone set optimized for cancer gene analysis. Genome Res 2004;14(1):179–87.

A.T. Manning et al. / EJSO 33 (2007) 255e265 81. Schwaenen C, Nessling M, Wessendorf S, et al. Automated arraybased genomic profiling in chronic lymphatic leukaemia: development of a clinical tool and discovery of recurrent genomic alterations. Proc Natl Acad Sci USA 2004;101(4):1039–44. 82. Kohlhammer H, Schwaenen C, Wessendorf S, et al. Genomic DNAchip hybridization in t(11;14)-positive mantle cell lymphomas shows a high frequency of aberrations and allows a refined characterization of consensus regions. Blood 2004;104(3):795–801. 83. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic Acids Res 2002;30(1):17–20. 84. Pontius JU, Wagner L, Schuler GD. Unigene: a unified view of the transcriptome. NCBI handbook. Bethesda, MD: NCBI; 2003.

265

85. Scheurle D, DeYoung MP, Binninger DM, Page H, Jahenzeb B, Narayanan R. Cancer gene discovery using digital differential display. Cancer Res 2000;60(15):4037–43. 86. De Young MP, Damania H, Scheurle D, Zylberberg C, Narayanan R. Bioinformatics-based discovery of a novel factor with apparent specificity to colon cancer. In Vivo 2002;16(4):239–48. 87. Ghadersohi A, Sood AK. Prostate epithelium derived Ets transcription factor mRNA is overexpressed in human breast tumours and is a candidate breast tumour marker and a breast tumour antigen. Clin Cancer Res 2001;7(9):2731–8. 88. National Cancer Research Institute (NCRI). Strategic framework for the development of cancer research informatics in the UK. NCRI; 2003.