688
Research Update
TRENDS in Genetics Vol.17 No.12 December 2001
Meeting Report
The morning after Michele Clamp The joint Cold Spring Harbor Laboratory/Wellcome Trust Conference on Genome Informatics was held at Hinxton Genome Campus, Cambridge, UK, from 8 to 12 August 2001.
Imagine the English countryside bathed in warm sunshine with Pimm’s and strawberries on the lawn and the gentle murmur of international genome bioinformaticians as they stroll around the Hinxton lake. Well that’s what we did. Imagine that is. It might be owing to the American influence on this the first joint CSHL/Wellcome Trust conference, but no one thought to worry about the English weather. Amidst the sessions on gene prediction and genome assembly, we were treated to a fine display of how many different weather conditions can occur in Cambridgeshire on one August afternoon. Safely ensconced inside the auditorium the range of talks was no less impressive. Gone was the frenzy that surrounded last year’s draft human genome announcement, and 250 people were gathered together to face the morning after the night before, soberly assessing what is and is not possible with this huge, shifting mass of data. A few days after the draft human genome was announced, my mother phoned and rather touchingly, yet somewhat infuriatingly, asked how we were enjoying all this spare time now everything was all wrapped up. The talk by Colin Semple (Dept Medical Sciences, University of Edinburgh, UK) comparing the assemblies of the Celera sequence1 and the two public assemblies from UCSC and NCBI (based on the International Human Sequencing Consortium sequence2) emphasized how far we still have to go before we have anything approaching a stable human genome sequence. Although only a small fraction of the genome was compared (4 Mb or 0.1%), several interesting things emerged. First, the amount of sequence that differed between the Celera and public assemblies was very small, only 6% of the sequence was unique to Celera. http://tig.trends.com
Bearing that in mind, the UCSC assembly managed to cover more than twice the amount of sequence that Celera did. Second, even though this is a very small region, all groups had missassemblies, with the NCBI assembly having most (four mistakes in this region). Of course, one cannot extrapolate from these numbers to assess what the rest of the genome is like, but it does emphasize that the draft assembly is exactly that, and although it can reveal many interesting things, we should treat it very carefully. ‘Gone was the frenzy that surrounded last year’s draft human genome announcement, and 250 people were gathered together to face the morning after the night before...’
A very important dataset to emerge from the sequencing of a whole genome is the set of expressed proteins, the proteome. Although not giving in to the temptation of comparing proteomes is almost impossible, it was not forgotten that finding the 3% of the genome that codes for proteins is by no means a done deal. Interestingly, there were no talks on ab initio gene prediction. Instead people are knuckling down and using all the available genomic sequence data to mop up as much annotation as possible. The big new bioinformatics challenge in the genome world is the mouse whole genome shotgun data, which, the theory goes, should make annotating the human genome a walk in the park. As always, statements of this kind are made by people who do not have to do the work, and reality is showing that things are more complicated than some people have assumed. Paul Flicek (Washington University, St Louis, USA) presented a method that is able to use the draft human sequence and the mouse shotgun sequence to predict gene structures – a departure from many currently used annotation methods, which rely on proteins, cDNAs and expressed sequence tags.
The problem of using genomic sequence is that many of the conserved regions between human and mouse are in noncoding regions of the genome. In fact, Flicek reported more than 50% of the matches between the two genomes are not conserved exons but other pieces of genome that, for whatever reason, have remained conserved for the ~90 million years since the human–mouse separation. Flicek’s method, Twinscan, uses a probabilistic model to differentiate the coding from noncoding matches and attempts to build gene structures from them3. It currently performs significantly better than Genscan4, which is generally considered to be the best ab initio human gene prediction method around. It might not be a silver bullet for genome annotation, but this and similar methods show great promise for making sense of large genome comparison. With datasets of this size and many different groups around the world working on interpreting the data, communication and comparison of results becomes difficult. There were two presentations addressing different aspects of this problem – the Distributed Annotation System (DAS; http://biodas.org/) by Lincoln Stein (Cold Spring Harbor Laboratories, NY, USA) and the Gene Ontology project (GO)5 by Midori Harris (European Bioinformatics Institute, Hinxton, UK). These projects have a common characteristic in that they do not claim to have all the answers; Midori Harris stated explicitly, ‘This is not a dictated standard.’ Both projects aim to be a framework for people to exchange data easily, with the capacity to evolve where necessary. The message is, ‘Hey, it may not be the perfect solution but we’re all working on the same imperfect solution.’ This should be a lesson to all those who come up with the bright idea, ‘Why don’t we make all the different databases talk to each other?’ It’s not the databases that need to talk to each other; make it easy for the people to do the talking, and the databases will follow suit.
0168-9525/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0168-9525(01)02546-X
Research Update
And what happened to the Pimm’s on the lawn? Who cares? Any drink whose recipe includes cucumber deserves to be left out in the rain. References 1 Venter, J.C. et al. (2001) The sequence of the human genome. Science 291, 1304–1351
TRENDS in Genetics Vol.17 No.12 December 2001
2 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921 3 Korf, I. et al. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140–S148 4 Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94
689
5 Ashburner, M. et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29
Michele Clamp The Wellcome Trust, Sanger Institute, Hinxton, Cambridge, UK CB10 1SA. e-mail:
[email protected]
Chromosomes on the move Steven Henikoff The fourteenth John Innes Symposium on Chromosome Dynamics and Expression was held at the John Innes Institute in Norwich, UK, from 5 to 8 September 2001.
In his Keynote Address, David Botstein (Stanford University, CA, USA), made an apt comparison between the beginning of genetics as a discipline and the beginning of the so-called ‘post-genomic era’. William Bateson, the first Director of the John Innes Institute, was the first to fully grasp the importance of genetics as a primary tool for understanding biology. Bateson introduced language that survived to the present day, including ‘homozygote’, ‘heterozygote’, ‘F1’, ‘F2’ and, of course, ‘genetics’. A hundred years later, we are witnessing a new paradigm in genetics made possible by the availability of the sequences of all the genes in a genome, and a common language describing the discipline is needed. As Botstein pointed out, the Gene Ontology language provides categories of biological processes, molecular function and cellular components to help in transforming information into knowledge1. This language can be used to facilitate understanding of biological processes, such as the cell cycle, when confronted with overwhelming amounts of data generated by microarrays. Clearly, studies of gene expression have profited from sophisticated new genomics and bioinformatics tools. However, transcription is only one of several dynamic processes involving chromosomes, and comparably powerful tools are needed to help understand less tractable phenomena, such as condensation, nuclear movements and mitosis. New tools for the job
Help is on the way, with cytological reagents based on protein fusions with http://tig.trends.com
green-fluorescent protein (GFP) recently becoming available. In a method introduced by Andrew Belmont (University of Illinois, Urbana, IL, USA), arrays of lac repressor-binding sites are visualized by expression of GFP–lac repressor fusion protein. This site-specific marking system reveals previously undetected details, such as chromatid axes (Fig. 1). Such fusion gene arrays permit the detection of rapid chromosomal movements, even over very short distances. This and other fusions of GFP to appropriate binding proteins have been used to analyze mutations affecting chromosome movements in Bacillus subtilis and in Caulobacter (Jeff Errington, Oxford University, UK; Jonathon Dworkin, Harvard University, MA, USA; Rasmus Jensen, Stanford University, CA, USA). In Drosophila, John Sedat (University of California, San Francisco, CA, USA) has used GFP–lac repressor arrays to measure the slow and fast modes of random motion that chromosomes undergo2. Slow large-scale movements are confined to early G2 of the cell cycle in primary spermatocytes, suggesting a regulated process. Superimposed on these motions, split-second ‘jittering’ might allow for more of the nuclear volume to be explored, raising the possibility that interchromosomal regulatory interactions depend ultimately on random movements. Among these interactions are those between the B-cell transcription factor, Ikaros, and pericentric heterochromatin. As shown by Amanda Fisher (Imperial College, London, UK), many Ikaros target genes are associated with heterochromatin when they are silenced. Such associations with heterochromatin could maintain gene silencing, but they do not establish it, as silencing appears to precede
recruitment to heterochromatin, and transient silencing is seen without recruitment. Chromosomal movements are confined to territories, which remarkably maintain a radial arrangement in mammalian cells, as shown by Thomas Cremer (Ludwig Maximilians University, Munich, Germany) using photobleaching of histone–GFP. Preparation is key
Although much of the progress in understanding chromosome dynamics comes from better tools, including microscopes and cytological markers, sample preparation procedures have improved as well. For example, Peter Shaw (John Innes Institute, Norwich, UK) described how living plant tissue can be sliced neatly to give subcellular sections, thanks to their stiff cell walls. For electron micoscopic analysis, slices that are only 1-nm thick permit the visualization of ribosomal transcription units by in situ hybridization. This analysis confirms in three dimensions the ‘Christmas tree’ appearance of dense active transcription units previously described for extended chromatin, and further reveals that only about 10% of the genes in a nucleolus are active. For other genes, Peter Cook (Oxford University,
Fig. 1. Detection of a chromatid axis by greenfluorescent protein (GFP)–lac repressor fusion protein bound to arrays of lac repressor-binding sites. A plasmid containing multiple lac repressor-binding sites was incorporated into a chromosome, followed by in vivo amplification and expression of GFP-lac repressor fusion protein. Examples of labeled chromosomes exhibit a surprising linear structure for one of the amplified clones in this study (adapted, with permission, from Ref. 8).
0168-9525/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0168-9525(01)02527-6