416
Computational Christopher Bioinformatics
sequencing
protein
Sequence
from families detection
of remote
sequence analysis
of aligned
database of genome
is emerging candidate
and locating
analysis
for large-scale
of bioinformatics
profiles
derived
have advantages
relationships.
sequence
junctions.
are the
the likely biochemical
sequence
sequences
the key to
intron/exon searching
for predicting although
sequence
increasing proportion (now -1.1% or -0.03% overall) are indexed as specifically contributing to research into human disease. These figures, however, are almost certainly a significant underestimate of the real importance of
for the
components
and provide
recent methods
in the
The use of comparative
data from model
as the most important
the application
Methods
are essential
and database
approaches
of new genes,
and David B Seat-M
tool in many aspects
research.
projects
sequence
comparison
pre-eminent function
genetics
of gene structure
in genomic deriving
J Rawlings*
is now an essential
of human molecular prediction
gene discovery and human disease
organisms
development
bioinformatics because most laboratories closely integrate computational methods with experimental approaches to gene identification. To compile this review, we have selected publications which best illustrate the way in which different bioinformatics methods can be used to link previously uncharacterized nucleotide sequences to human disease phenotypes or to elucidate the molecular basis of human disease.
in
for characterizing Fiaure 1
disease genes.
r2.0
Addresses *SmithKline Beecham Pharmaceuticals, Department of Bioinformatics, New Frontiers Science Park, Third Avenue, Harlow, Essex CM1 9 5AW, UK; e-mail: Chris-Rawlings-1 @sbphrd.com +SmithKline Beecham Pharmaceuticals, Department of Bioinformatics, 709 Swedeland Road, PO Box 1539, King of Prussia, PA 19406-0939, USA; e-mail:
[email protected] Current Opinion in Genetics & Development
1997, 7:416-423
http://biomednet.com/elecref/0959437X00700416 0 Current Biology Ltd ISSN 0959-437X Abbreviations AT ataxia telangiectasia mutated in a&a telangiectasia ATM BLAST basic local alignment search tool expressed gene anatomy database EGAD EST expressed sequence tag FISH fluorescent in sifu hybridization GRAIL gene recognition and analysis internet link HMM hidden Markov model HNPCC hereditary nonpolyposis colorectal cancer hypoxanthine-guanine phosphoribosyltransferase bk IRE iron responsive element NBCC nevoid basal cell carcinoma syndrome PKD polycystic kidney disease Ras-related nuclear protein Ran RCCl regulation of chromosome condensation protein World Wide Web XP xeroderma pigmentosum
l.51100 1992
1993
1994
1995
’
1990
z
Year
l
Sequence analysis
q
Sequence analysis
The proportion sequence analysis
of papers
analysis papers
indexed
results.
indexed
each year in Medline
The relative
as relating
proportion
to disease
is shown to illustrate the increasing contribution methods to the understanding of the molecular disease.
Medline
indexes
-400
000
papers
as containing
of sequence
or disease
progression
of bioinformatics basis of human
per year.
Introduction It is now well established that computational molecular biology (bioinformatics) can make a significant contribution to the search for genes implicated in human diseases. The increasing impact of bioinformatics in human disease genetics is illustrated by an analysis of the published scientific literature (Fig. 1). The number of papers in Medline (-400000 per annum) indexed as containing results of sequence analysis has increased steadily to the present level of 2.8%. Of these, an
This article begins with an overview of gene prediction methods and then reviews developments in sequence database searching methods and their function in comparative genomics and whole-genome analysis. The remainder of this review addresses methods for identifying increasingly remote sequence similarities and ends by providing some practical pointers to disease-specific databases and to further reading in bioinformatics research.
Computational gene discovery and human disease Rawlings and Searls
Gene prediction The
identification
In an elegant of putative
genes
in stretches
of raw
study
translocation
of the gene
breakpoint
structure
between
surrounding
chromosomes
417
the
2 and 22
genomic DNA sequence is an increasingly important computational activity in support of disease gene analysis.
in a patient suffering from DiGeorge syndrome, Budarf ef nl. [Z”] have obtained sufficient genomic sequence to
As the cost of sequencing drops and large-scale genomic sequencing efforts increase, candidate genes are more and more frequently being identified in lengthy sequenced regions that are isolated positionally (e.g. [ 11) and eventually will simply be sought and found in databases of genomic sequence compiled in an undirected fashion as a result of
allow exon prediction using GRAIL. One of the open reading frames spans the breakpoint and the predicted protein sequence was found to be similar to mammalian
the overall
ulators implies a role for the candidate gene product that would be consistent with the developmental phenotype of
human
genome
programme
sequencing
effort.
androgen receptor sequences and zipper motif. Although the sequence the similarity to members of a family
the DiGeorge The venerable GRAIL (gene recognition and analysis internet link) software package remains one of the most-used systems for gene identification (e.g. [2”,3,4]) and it continues to be improved on a regular basis [S]. GRAIL is available both as a stand-alone program and via e-mail or the World Wide Web (WWW). Rival systems
contained a leucine relationship is weak, of transcriptional reg-
syndrome.
The analysis of genomic sequence data has contributed to significant recent advances in the understanding of the molecular structure of the PKDI gene for autosomal dominant polycystic kidney disease (PKD) [8’,9’] for which GRAIL identified 48 exons in the 53.5 kb sequence. From protein sequence data predicted from the modelled exons, a repeated leucine-rich motif was identified which
Perhaps the most interesting current trend in gene prediction methodology is toward systems that attempt to detect genes not simply by first principles of gene structure and statistical characteristics of protein-coding DNA but in combination with known protein sequences. One such system simply screens predicted exons against a protein database after the fact ([12]: http://linkage.rockefeller.edu/wli/gene/list.html). Another approach, termed spliced alignment, tests a very large number of exons-essentially all blocks of sequence between invariant dinucleotide splice signals ([ 131; http://wwwhto.usc.edu/software/procrustes). Variations on this approach have now made their way into a number of systems, with demonstrable improvement in accuracy of prediction [ 141, though speed is likely to remain an issue in searching lengthy input against complete databases.
has led to suggestions that the PKDI gene may code for a membrane-bound glycoprotein that functions in cell-cell or cell-matrix interactions or a signal transduction protein. A further bioinformatics analysis of PKDZ sequence data [lo’] has added further support to the PKDZ gene defect being associated with a protein involved in cell-cell and cell-matrix interactions and has indicated that this could be consistent with the plethora of connective tissue abnormalities seen in the PKD phenotype.
Techniques for gene prediction and sequence similarity searches are thus converging rapidly. hlore sensitive and/or more convenient methods for detecting distant homologies continue to be developed [ 15”,16] and an interesting research trend is toward the incorporation of increasingly detailed biological knowledge into the classic alignment algorithms that form the basis for sequence comparisons [ 171.
include GeneParser yder/GeneParser.html) by Fickett [7].
GRAIL
([6]; http://beagle.colorado.edu/-eesnand a number of others reviewed
was also used
to characterize
the structure of the retinitis pigmentosa a protein sequence database searching has
RPGR gene for the RP3 form X-linked
Sequence
[ll]. could
A statistically two sequences
shown similar
With the exons identified, be predicted and sequence
that it has a conserved repeat structure to that found in sequences from the
which is regulation
of chromosome condensation protein (RCCl), a regulator of the GTPase Ran (a member of the Ras superfamily) with possible functions including cell cycle control, DNA synthesis and RNA processing. It is thought that Ran may work by coupling the GTPase cycle with cellular processes such as membrane transport or trafficking. As the retina is a tissue that exhibits exceptional levels of membrane turnover, the prediction that the RPCR gene is associated with a regulator of membrane function seems consistent with a disease phenotype involving progressive retinal degeneration.
similarity searches significant provides
similarity identified strong evidence that
between both code
for proteins with similar three-dimensional structure and are thus likely to exhibit equivalent biochemical functions. The use of nucleotide and amino sequence comparison
methods-in
particular
those
such as BLAST
[18] and FASTA [19] which are sufficiently fast to search entire nucleotide and protein sequence databases-have therefore contributed some of the most important insights into disease gene function. (For a concise and useful review of sequence database searching resources available on the Internet using the MWW’, see [ZO].) The primary computational sequence database searching
challenge is the
now posed detection of
for dis-
418
Genetics of disease
The
predictions
familial around
of biochemical
breast cancer the detection
function
for
both
the
and tubby phenotypes have pivoted of sequence relationships in higher
organisms where extensive biochemical already been undertaken. Finding a human that is related to an already fully characterized
analysis had disease gene mammalian
gene can therefore be largely a matter of luck, whereas lower organisms offer new opportunities for functional genomic
analysis.
Functional genomics model organisms a new human gene is sequenced, sequence database searching is essential for finding clues to its biochemical function. Sequence similarity has therefore been the basis of many of the discussions regarding the potential biological function of the familial breast and ovarian cancer genes RR&II and BRCM. In [Zl], sequence similarity between BRCAl and the granin protein family was detected and granin sequence motifs were also detected in BRCAZ sequences. A more rigorous analysis of the sequence features in BRCAl [ZZ”], however, has challenged the statistical significance of the granin relationship and proposed a new protein superfamily for the BRCAl product which also included a large number of nuclear proteins mostly involved in cell cycle checkpoint functions which may associate with ~53. A defect in cell cycle control could be an equally pausible explanation of the BRCAl phenotype (see Note added in proof). Obesity
Sequence similarity has also been used to help elucidate the genetic defect which results in the mouse tubby phenotype (see also the review by Naggert, Harris and North, this issue, pp 398404). The tubby late-onset obesity phenotype with associated hearing loss and retinal degeneration provides an important model for the study of late-onset obesity in humans and may also be a mode1
by sequence analysis of
The importance of sequence data from lower organisms where systematic and large-scale functional analysis (e.g. gene knockouts) is possible and more extensive genetic experiments can be undertaken quickly is now widely appreciated. The availability of increasing amounts of sequence data from the classical geneticists’ model organisms-yeast, fruitfly and the nematode-is having a significanr impact on the understanding of human disease as models of these defects can be rapidly identified and exploited in lower organisms.
Cancer-prone
syndromes
The major human diseases considered to involve defects in DNA repair and metabolism and which exhibit sensitivity to DNA-damaging agents and increased susceptibility to cancer all have equivalents in model organisms. The genetic similarities have been detected through the use of sequence database searching. For example, following the early finding that the Drosophila gene haywire is equivalent to the human DNA excision repair gene ERCC3 defective in xeroderma pigmentosum [ZS], other excision repair genes (e.g. ERCCZ) have been shown to be homologous in human, mouse and hamster genomes [26]. Patients
with
the
autosomal
recessive
disorder
ataxia
of other human obesity syndromes such as Bardet-Beidl and Alstrom. The cloning and sequencing of the mutated mouse gene [23,24] has shown that the defect is a single base change in the intron splice junction resulting in
telangiectasia (AT) exhibit increased sensitivity to ionizing radiation and are >lOO times more likely to develop cancer than members of the general population. Sequence similarity analysis has revealed that the gene mutated in AT, UM, was most similar co the yeast gene TEL2
aberrant expression of intron sequence. The carboxyl terminus of the predicted protein from the wild type tubby gene was shown by sequence database searching to be similar to phosphodiesterase ~4-6. It is not clear, however, whether the similarity to the phosphodiesterase is significant and whereas Kleyn ef al. [23] argue the case that the tubby gene is a member of the new family, Noben-Trauth et a/. [24] argue that the 62% amino acid similarity to ~4-6 is consistent with phosphodiesterase being a causal factor in other examples of retinal degeneration. Furthermore, it is argued that there is a link between the accumulation of cGRIP-as a result of insufficient phosphodiesterase activity-and apoptosis and that evidence of apopototic cell death can be observed in tu66Js mice.
which, in turn, was similar to a second gene, iMECZ. Both TELI and MECI show primary structure similarity to a checkpoint gene in Schizosaccharom~~cespombe and appear to act in redundant checkpoint pathways. It would therefore seem that the gene defect responsible for AT could well be a cell cycle control gene. Sequencing of cDNA from the candidate gene for the cancer-prone Bloom’s syndrome, BLrW, enabled a protein sequence to be predicted and compared with protein sequence databases. BLM was found to contain regions of strongly conserved local similarity to recQ helicases from human, yeast and Escheri~hia co/i [27]. Furthermore, sequence comparison suggests that BLh,I has, in addition to potential helicase activity, DNA-dependent ATPase motifs and possibly other uncharacterized enzymatic activities.
Computational gene discovery and human disease Rawlings and Searls
Sequence into
the
comparison possible
has thus yielded
functions
of genes
important defective
potentially
insights in these
diseases and has enabled the investigators to understand the results from experiments using the model genetic defects in other organisms such that the biochemical phenotype
could
be further
new
functional Tumour
suppressor
Sequence sequence
between Drosophila patched and from the nevoid basal cell carci-
noma syndrome (NBCCS) or Gorlin syndrome [28-301 has provided further evidence for the role of tumour suppressor genes in cancer. Patched is the third Drosophila segment polarity gene which has a human homolog implicated in tumorigenesis. The other two are .wing/ess and Cubitus interruptus and add to the growing evidence of the human
importance cancer.
Comparative
of developmental
genome
regulator
genes
human
disease.
that the pufferfish will provide an even model organism suitable for comparative
genomics
in
the
search
for
human
disease
Expressed sequence tag databases The most direct window onto the expressed genome is that provided by ESTs, single-pass partial sequences from cDNA libraries such as those available in the dbEST database ([47]; http://www.ncbi.nlm.nih.gov/dbEST) which provides public access notably that developed
to collections of human ESTs, most by Washington University [48”].
in cDNA widely
analysis
The success of small-scale comparative analysis of single genes and their orthologs in other species combined with the availability of increasing amounts of genome sequence data and access to prodigious amounts of computing power has made the automatic analysis and comparison of entire genomes a practical proposition. In some cases, the genomes of entire organisms such as Haemophilus itgflzlenzae [ 3 11, iMycop/asma genitalium [ 3 21, i Methartococcus jannaschii [33], E. coli and Saccllaromyces cermisiae ([34-361; http://www.ncbi.nlm.nih.gov/XREFdb/) are already available and rapid progress is being made towards completion of full genomic sequencing for higher organisms such as Caenorhabditis elegans [37]. The publication of results from all such large-scale sequencing projects is accompanied by extensive bioinformatics analysis and prediction of putative genes and their predicted functions. Koonin eta/. [38,39] have shown how computational methods can be coordinated to analyze genome-scale volumes of sequence data and have illustrated how comparison of viral genomes can provide information about the genetic processes at work during evolution [40]. Ouzounis et a/. ([41”]; http://www.sander.embl-heidleberg.de/genequiz/) have shown how the GeneQuiz program [42] can be used to compare bacterial genomes and provide predictions of gene function through extensive use of bioinformatics software
for
genes.
genes
homology data derived
responsible
Baxendale et a/. [46”] argue-on the basis of the close conservation of genomic organization of the Huntington’s disease genes in fish and man and the small size of its genomemore powerful
dissected.
genes
419
and databases.
Although h,liklos and Rubin [43] have sounded notes of caution regarding the limitations in the use of model organism sequence data as the basis for comparative functional genetics, real excitement has been generated by the productivity of large-scale comparative analysis of known human disease genes and mouse mRNA and encoded proteins [44**] and of Drosophila mutant genes with human EST sequence databases [45**]. In the latter study, FISH and radiation hybrid mapping was used to link EST sequences with human genetic loci to discover
sequencing and EST databases are now as part of both academic and commercial
used gene
discovery projects [49,50**]. Large-scale analysis and review of EST sequence data, particularly with regard to the data quality, are now possible [48**] and attempts are being made to link EST data with other genomic sequence information in databases such as EGAD [Sl] and XREFdb ([34]; http://www.ncbi.nlm.nih.gov/XREFdb/). Example successes from database analysis of EST sequences came from the identification of homology between the yeast equivalent (MLHI) of the bacterial DNA mismatch repair gene mutL [SZ] and the gene defective in chromosome-3-linked hereditary nonpolyposis colorectal cancer-an observation made independently by Bronner eta/. [53]. Levy-Lahad et al. [54] also used an EST database to identify and clone a candidate gene for chromosome 1 familial Alzheimer’s disease (S?IVZ). Sequence analysis suggested that the predicted protein contained seven transmembrane domains and was similar to the S182 gene-a strong candidace for the Alzheimer’s locus AD3 mapped to chromosome 14. Large-scale
bioinformatic
and
experimental
comparative
genomics is complex and time consuming. The project ([34]; http://www.ncbi.nlm.nih.gov/XREFdb/)which genes
links together human in model organisms-is
disease
XREFdb
EST data with homologous being developed to bring
the results of such studies together as an information resource that will be of particular value in gene identification.
Sequence
profiles and hidden Markov models
Whereas sequence alignment provides the most powerful method for finding closely related sequences, the identification of remote sequence similarities or local sequence features that are indicative of a particular biological function requires alternative approaches. Profiles are probabilistic sequence patterns defining the likelihood of
420
Genetics of disease
finding
particular
nucleotides
or amino
acids
order. Profile construction and searching now established bioinformatics techniques 20-28
in
[SS])
used
for
identifying
in a specific methods are (see chapters
remotely
sequence but the
profile method
construction and database currently creating the most
excitement for detecting remote sequence relationships uses hidden h,larkov models (Hhlhls) [56], which have been described recently promoter detection [58].
for gene
recognition
proteins in nucleic
related
sequences. Profiles are derived from the statistical analysis of residue conservation in aligned families of sequences with common function or origin from a common ancestor. A number of different statistical approaches have been used for searching
gene regulation and interaction of DNA-binding is moderated via higher-order structural features acids.
[57] and
Sequence motifs and profiles are often used to help elucidate the potential function of proteins encoded by novel genes. Key databases providing collections of aligned sequences and derived profiles and motifs are BLOCKS [59], PROSITE [60] and PRINTS [61]. A new collection of protein family alignments, some of which are constructed using Hhlhls, can be found in the Pfam database ([62]: http://www.sanger.ac.uk/Pfam/). Each family has functional annotation and cross-references to protein families in other protein databases and the literature. Some examples of how protein motif searching can reveal information about the function of candidate disease genes have already been listed in this review. In addition, a particularly clear example of the application of motif searching is given in a study of the gene implicated in Smith-hlagenis syndrome [63] which was characterized as coding for an extracellular matrix protein involved in cell adhesion or intercellular interactions on the basis of the presence of a fibrinogen-like domain and a ligand motif for the cell surface receptor integrin. Heavy-metal-associated proteins in prokaryotes contain a specific sequence motif. The review by Bull and Cox [64] discusses the evidence that two human diseases in which disruption of copper transport is a key feature (Wilson and Menkes disease) are caused by mutations in genes which encode proteins possessing the same heavy-metal sequence motif as found in prokaryotes. They propose that these genes encode the first putative heavy-metal transport proteins to be identified in eukaryotes. Another class of sequence motifs with relevance to human disease are transcription factors. A federation of databases (TRANSFAC) containing information on known transcription factor sequences and their DNA-binding sites is described in [65]; reference [66] is a review of the role of transcription factors in human disease. Sequence comparison and profile searches address the identification of function via primary sequence information only. It is generally accepted, however, that aspects of
Higher-order
information
hlethods for the prediction of nucleic acid (RNA) secondary structure have been the subject of extensive study by bioinformaticians [67] and interest in these techniques has been growing as the role of RNA in the regulation of gene expression becomes clearer. A useful review of methods of searching for RNA hairpins and their potential value in identifying potential regulatory RNA motifs can be found in reference [68]. A recent example of how a mutation in a structural control element is implicated in human disease is illustrated by the role of the iron responsive element (IRE) in inherited hyperferritinaemia. IRE is a regulatory RNA structural motif predicted to be a single hairpin found upstream of the coding sequence for iron regulatory protein. Beaumont et al. [69’] report the identification of a point mutation in the IRE motif of L-ferritin in members of a family suffering from dominantly inherited hyperferritinaemia and cataracts; the authors speculate that the mutated IRE affects the interaction with iron regulating protein resulting in ferritinaemia and accumulation of ferritin leads to cataracts.
Discussion and conclusions Bioinformatics is clearly playing a key role in the discovery and characterization of the genes implicated in human disease and is an essential component in all contemporary molecular biology research. The most active areas of research and the most promising approaches for delivering new insights are those that exploit genome data from a variety of sources, such as in computational comparative genome analysis or in the integration of gene detection and sequence comparison methods. The future challenges for bioinformatics arise from the deluge of new data that is being deposited in publicly accessible and private computer databases. It is almost certain that, in the future, the most successful genomics research centres will be those that have methods in place to track and exploit the widest variety of data sources and integrate them with their own research activities. In many cases, this will not require just the development of more sophisticated algorithms but also the application of good practices in information engineering and database management. Disease-specific
databases
The databases available for bioinformatics analysis are as important as the programs and algorithms for searching them. In addition to the well-known international archives of genetic map information, DNA/protein sequence and structure information and a number of human gene and disease-specific databases have been established to serve the more specialist needs of research into human genetic disease. Short descriptions of many of these database and
Computational gene discovery and human disease Rawlings
Internet
resources
provided
in a special
issue of Nucleic
Acids Research [70], including the diseases haemophilia A/B, Marfan syndrome, X-linked agammaglobulinemia, and gene loci for Factor VIII, hprt, ~53, adenomatous polyposis coli, phenylalanine hydroxylase (PAH), cholinesterase, fibrillin, the androgen receptor, the low-density lipoprotein receptor, the MHC and immunoglobulin families and type I collagen. Most of these databases contain mutation data and some contain information and links to other bioinformatics resources. Further
reading
in bioinformatics
The book by Bishop and Rawlings [71] is the most recent to be published providing practical advice on the use (and pitfalls) of bioinformatics techniques and applications. A recent comprehensive overview of bioinformatics methods and resources relating particularly to protein sequence analysis and structure prediction is available in the computer methods special issue of Methods in Enzymology [SS]. Each year, Nucleic Acids Research publishes a special issue on databases (the most recent being [70]) and Trends in Genetics features regular bioinformatics technology updates in its ‘Genetwork’ section. For the latest information on bioinformatics research, the key journals are Computer Applications in the Biosciences and the Journal of Computational Biology as well as the traditional journals publishing molecular biology and genetics research. The annual international conference on Intelligent Systems for Molecular Biology is probably the most important international conference on bioinformatics and the refereed and published proceedings are now a key resource [72,73].
Note added in proof Recent observations [74-76) have confirmed the role of the BRCAI and BRCAZ genes in cell cycle control and, in particular, as potential caretaker genes through an association with the Rad51 gene which is known to be essential in both DNA repair and meiotic and mitotic recombination.
Acknowledgements Thanks to David Carpenter from SmithKline hlanagement for assistance with compiling kledlinc Bo$z for administrative assistance.
References
and recommended
Bcecham searches
Information and to Evelyn
. l
1.
2. ..
*
1995, 10:269-276. An excellent example of combining bioinformatics and experimental approaches in the search for a candidate human disease gene in the region surrounding the translocation breakpoint between chromosomes 2 and 22 in a patient suffering from DiGeorge syndrome. The key bioinformatics methods include the GRAIL gene prediction and sequence database searching. Protein sequence from one of the ORFs spanning the breakpoint predicted by GRAIL was found to be similar to mammalian androgen receptor sequences and contained a leucine zipper motif. Although the sequence relationship was weak, the similarity to members of a family of transcriptional regulators could suggest a role for the candidate gene product that would be consistent with the developmental phenotype of the DiGeorge syndrome. 3.
Gama RE, Du YL, Baumann J, McCormick PJ: Identification of exons in a novel embryonal carcinoma locus using the GRAIL program. Oncol Rep 1996, 3:371-374.
4.
Schluter G, Celik A, Obata R, Schlicker M, Hofferbert S, Schlung A, Adham IM, Engel W: Sequence analysis of the conserved protamine gene cluster shows that it contains a fourth expressed gene. MO/ Reprod Dev 1996,43:1-6.
5.
Uberbacher EC, Xu Y, Mural RJ: Discovering and understanding genes in human DNA sequence using GRAIL. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:259-281.
6.
Snyder EE, Storm0 GD: Identification of protein coding regions in genomic DNA. J MO/ Biol 1995, 246:1-l 6.
7.
Fickett JW: Finding genes by computer: the state of the art 7iends Genet 1996, 12:316-320.
6. .
Burn TC, Connors TD, Dackowski WR, Petty LR, Van Raay TJ, Millholland JM, Venet M, Miller G, Hakim RM, Landes GM et al: Analysis of the genomic sequence for the autosomal dominant polycystic kidney disease (PKDI) gene predicts the presence of a leucine-rich repeat The American PKDl Consortium (APKDl Consortium). Hum MO/ Genet 1995, 4:575-562. An excellent example of how many different bioinformatics techniques including gene prediction (GRAIL) and protein sequence data analysis and database searching were combined to assist in the characterization of a candidate human disease gene. 9. .
Gliicksmann-Kuis MA, Tayber 0, Woolf EA, Bougueleret L, Deng N, Alperin GD, Iris FI, Hawkins F, Munro C, Lakey N et al: Polycystic kidney disease: the complete structure of the PKDl gene and its protein. The International Polycystic Kidney Disease Consortium. Cell 1995, 61:289-296. A good example of a comprehensive bioinformatic analysis of the polycystic kidney disease gene sequence that illustrates the range of inferences that can be made from thorough use of sequence analysis methods and database searching. The authors found a signal peptide pattern and 5 leucine-rich repeats flanked by cysteine-rich regions in the translated protein sequence. Sequence comparison showed the sequence to be similar to C-type (calciumdependent) lectin proteins. In addition, 14 copies of a low-density lipoprotein domain were identified. Protein secondary structure prediction methods suggested that the protein domain is globular and contains an antiparallel p sheet. 10. .
Hughes J, Ward CJ, Peral B, Aspinwall R, Clark K, San Millan JL, Gamble V, Harris PC: The polycystic kidney disease 1 (PKDI) gene encodes a novel protein with multiple cell recognition domains. Nat Genet 1995, lo:151 -160. A thorough bioinformatics analysis of a human disease gene. Particular emphasis is placed on the use of multiple protein sequence alignment and analysis of conserver regions to derive evidence for the potential structure and function of the encoded protein. 11.
Meindl A, Dry K, Herrmann K, Manson F, Ciccodicola A, Edgar A, Carvalho MR, Achatz H, Hellebrand H, Lennon A et al.: A gene (RPGR) with homology to the RCCl guanine nucleotide exchange factor is mutated in X-linked retinitis pigmentosa (RP3). Nat Genet 1996, 13:35-42.
12.
Rogozin IB, Milanesi L, Kolchanov NA: Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 1996, 12:161-170.
13.
Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. froc Nat/ Acad Sci USA 1996, 93:9061-9066.
14.
Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34:353-367.
15. ..
Birney E, Thompson JD, Gibson TJ: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res 1996, 24:2730-2739.
of special interest of outstanding interest Chen EY, 20110 M, Mazzarella R. Ciccodicola A, Chen CN, Zuo L. Heiner C, Burough F, Rep&to M, Schlessinger D, Durso M: Longrange sequence-analysis in Xq28 - 13 known and 6 candidate genes in 219.4 kb of high GC DNA between RCP/GCP and GGPD loci. Hum MO/ Genet 1996, 5:659-668. Budarf ML, Collins J, Gong W, Roe B, Wang 2, Bailey LC, Sellinger B, Michaud D, Driscoll DA, Emanuel BS: Cloning a balanced translocation associated with DiGeorge syndrome
421
and identification of a disrupted candidate gene. Nat Genet
reading
Papers of particular interest, published within the annual period of review, have been highlighted as:
Searls
422
Genetics
of disease
An important extension to existing methods for searching DNA sequence databases with protein sequence data that can accommodate frameshift errors. These programs integrate with the GCG sequence analysis package and can use either single or aligned proteins as the basis of the search. The programs have particular utility for searching error-prone EST databases. 16.
Guan X, Uberbacher EC: Alignments of DNA and protein sequences containing frameshift errors. Comput Appl Biosci 1996, 12:31-40.
1 7.
Searls DB: Sequence 1996, 12:35-37.
16.
Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet 1994, 6:119-l 29.
19.
Pearson WR: Effective protein sequence comparison. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:227-256.
20.
Brenner SE: BLAST, Blitz. BLOCKS and BEAUTY: sequence comparison on the net. fiends Genet 1995, 11:330-331.
21.
Jensen RA, Thompson ME, Jetton TL, Szabo Cl, Van der Meer R, Helou B, Tronick SR, Page DL, King MC, Holt JT: BRCAl is secreted and exhibits properties of a granin. Naf Genet 1996, 12:303-306.
alignment
through
pictures.
Complete genome sequence of the methanogenic archaeon. Methanococcus jannaschii. Science 1996, 273:1056-l 073. 34.
Hieter P, Bassett DE Jr, Valle D: The yeast currency. Nat Genet 1996, 13:253-255.
35.
Oliver SG: From DNA sequence 1996,379:597-600.
36.
Walsh S, Barrel1 B: The Saccharomyces cerevisiae genome the World Wide Web. Trends Genet 1996, 12:276-277.
37.
Hodgkin J, Plasterk RH, Waterston RH: The nematode Caenorhabditis elegans and its genome. Science 1995, 270:41 O-41 4.
36.
Koomn EV, Tatusov RL, Rudd KE: Protein sequence comparison at genome scale. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:295-322.
39.
Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of fscherichia co/i proteins: functional and evolutionary implications. Proc Nat/ Acad Sci USA 1995, 92: 11921-l 1925.
40.
Hannenhalli S, Chappey C, Koonin EV, Pevzner PA: Genome sequence comparison and scenarios for gene rearrangements: a test case. Genomics 1995, 30:299-311.
Trends Gener
Koonln EV, Altschul SF, Bark P: BRCAI protein products ... 22. .. Functional motifs... Nat Genet 1996, 13:266-266. This paper Illustrates the Importance ot applying bloIntormatlcs methods wth rigour when seeking to predict genelprotem function from remote sequence similarities identified from database searching. The authors showed that, by careful dissection of the BRCAl protem into separate domains and selecting Improved amino acid scoring matrices for sequence database search methods, the significance of more remote similarities with cell cycle control genes ~53 and Rad9.
genome-a
to biological
common
function.
Nature on
Ouzounis C, Casari G, Sander C, Tamames J, Valencla A: Computational comparisons of model genomes. fiends Biotechnol 1996, 14:260-265. This paper is a valuable overview of the work of the GeneQuiz consortium. The paper describes how the application of large-scale genome sequence analysis and automated gene function prediction provides the basis for comparing whole genomes and reveals some of the evoluttonary processes that have shaped the genomes of higher organisms. 41. ..
42.
Kleyn PW, Fan W, Kovats SG, Lee JJ, Pulldo JC, Wu Y, Berkemeier LR, Misumi DJ, Holmgren L, Charlat 0 ef al.: Identification and characterization of the mouse obesity gene tubby: a member of a novel gene family. Cell 1996, 85:261-290.
Scharf M, Schneider R, Casari G, Bork P, Valencia A, Ouzounis C, Sander C: GeneQuiz: a workbench for sequence analysis. /SMB 1994, 2:346-353.
43.
Miklos GL, Rubin GM: The role of the genome project in determining gene function: insights from model organisms. Cell 1996, 86:521-529.
24.
Noben-Trauth K, Naggert JK, North MA, Nlshina PM: A candidate gene for the mouse mutation tubby. Nature 1996, 380534-536.
44. ..
25.
Mounkes LC, Jones RS, Liang BC, Gelbart W, Fuller MT: A Drosophila model for xeroderma pigmentosum and Cockayne’s syndrome: haywire encodes the fly homolog of ERCC3, a human excision repair gene. Cell 1992, 71:925-937.
26.
Lamerdin JE, Stilwagen SA, Ramlrez MH, Stubbs L. Carrano AV: Sequence analysis of the ERCC2 gene regions in human, mouse, and hamster reveals three linked genes. Genomics 1996, 34:399-409.
2 7.
Ellis NA, Groden J, Ye TZ, Straughen J, Lennon DJ, Ciocci S, Proylcheva M, German J: The Bloom’s syndrome gene product is homologous to RecQ helicases. Cell 1995, 83:655-666.
26.
Hahn H, Wicking C, Zaphiropoulous PG, Gailani MR, Shanley S, Chidambaram A, Vorechovsky I, Holmberg E, Unden AB, Gillies S et a/.: Mutations of the human homolog of Drosophila patched in the nevoid basal cell carcinoma syndrome. Cell 1996, 85:641-851.
29.
Hahn H, Christiansen J, Wicking C, Zaphiropoulos PG, Chidambaram A, Gerrard B, Vorechovsky I, Bale AE. Toftgard R, Dean M, Wainwright B: A mammalian patched homolog is expressed in target tissues of sonic hedgehog and maps to a region associated with developmental abnormalities. J Biol Chem 1996, 271 :12125-l 2126.
30.
Gailani MR, Stahle Backdahl M, Leffell DJ, Glynn M, Zaphiropoulos PG, Pressman C, Unden AB, Dean M, Brash DE, Bale AE, Toftgard R: The role of the human homologue of Drosophila patched in sporadic basal cell carcinomas. Nat Genet 1996, 14:76-61.
23.
31.
Fieischmann RD, Adams MD, White 0, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et a/.: Whole-genome random sequencing and assembly of Haemophilus infloenzae Rd. Science 1995, 269:496-512.
32.
Fraser CM, Gocayne JD, White 0, Adams MD, Clayton RA. Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM et al.: The minimal gene complement of Mycoplasma genitalium. Soence 1995, 270:397-403.
33.
Bult CJ, White 0, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA. Gocayne JD et al.:
Makalowski W, Zhang J, Boguskl MS: Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 1996, 6:646-657. This large-scale comparison of mouse and human nucleotide sequence data provides a valuable insight into some of the limits that might be expected from comparative genome analysis and the possible consequences for the cross-referencing of transcript maps. The authors compared the degree of amino acid and nucleotide sequence conservation between 1196 orthologous mouse and human genes. The data showed that, for some mouse and human genes, the nucleotide sequence is better conserved than the amino acid sequences. In some cases, functionally cloned equivalent genes showed remarkably low levels of sequence conservation by comparison with the average genes (65%): notably the breast cancer gene BRCAl (57%) and the testis determlning factor (SRM (42%). These results help to benchmark the likely success of comparative genomics. 45. ..
Banfi S, Borsani G, Rossi E, Bernard L. G&anti A, Rubboli F, Marchitiello A, Giglio S, Coluccia E, Zollo M et a/.: Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching. Naf Genef 1996, 13:167-l 74. This landmark study illustrates the potential for exploiting Drosoph!la genetics and the wide variety of mutant phenotypes for identifying human disease genes from EST databases. 66 Drosophila gene sequences with known mutant phenotypes were cross-referenced to human genes In the dbEST database by sequence comparison. FISH and radiation hybnd mapping were used to determine if the sequence tags mapped to loci associated with human genetic diseases. Approximately half of the Drosophila genes were found to map with the Genebridge 4 panel. The possible links between the these STS markers and human disease loci are being further investigated by the authors. Baxendale S, Abdulla S, Elgar G, Buck D, Berks M, Mlcklem G, Durbin R, Bates G, Brenner S, Beck S et al: Comparative sequence analysis of the human and puffetfish Huntington’s disease genes. Nat Genet 1995, 10:67-76. This paper is an important example of how detailed comparative sequence analysis of a human disease gene (for Huntington’s disease; HD) using the pufferfish HD gene can further the understanding of the function of a gene. This study also illustrates the power of the pufferfish genome as a model system for the analysis of human genes. The authors use the data from the pufferfish to considerably extend the evolutionary range over which sequence compansons can be made in order to identify the regions of the HD gene that have been most conserved and should therefore hold the key to Its function. Their analysis showed that the first coding exon, the site ok the disease46. ..
Computational
gene
discovery
and human
disease
Rawlings
and Searls
423
causing triplet repeat, is highly conserved; however, in the pufferfish, the sequence consists of just four glutamine residues. The hypothesis advanced is that the polar zipper structure that might be formed by the more extensive human glutamine repeat region probably could not come about in the shorter pufferfish repeats. This supports the view that the polar zipper motif might play a part in the development of the disease.
60.
Bairoch A, Bucher P, Hofman K: The PROSITE database. status in 1997. Nucleic Acids Res 1997, 25:217-221.
61.
Attwood TK, Beck ME, Bleasby AJ, Degtyarenko K, Mitchie AD, Parry-Smith DJ: Novel developments with the PRINTS protein fingerprint database. Nucleic Acids Res 1997, 25:212-216.
47.
62.
Sonnhammer ELL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein families based on seed alignments. Proteins 1997, in press.
63.
Zhao Z, Lee CC, Jiralerspong S, Juyal RC, Lu F, Baldini A, Greenberg F, Caskey CT, Pate1 PI: The gene for a human microfibril-associated glycoprotein is commonly deleted in Smith-Magenis syndrome patients. Hum MO/ Genet 1995, 4:589-597.
64.
Bull PC, Cox DW: Wilson disease and Menkes disease: new handles on heavy-metal transport Trends Genet 1994, 10:246-252.
65.
Wingender E, Kel AE, Kel OV, Karas H, Heinemeyer T, Dietze P, Knuppel R, Romashenko AG, Kolchanov NA: TRANSFAC, TRRD and COMPEL: towards a federated database system on transcriptional regulation. Nucleic Acids f?es 1996, 25:265-268.
66.
Engelkamp D, Van Heyningen V: Transcription Curr Opin Genef Dev 1996, 6:334-342.
67.
Westhof E, Auffinger P, Gaspin C: DNA and RNA structure prediction. In DNA and Protein Sequence Analysis. Edited by Bishop MJ, Rawlings CJ. Oxford: IRL Press; 1997:255-278.
68.
Dandekar T, Hentze MW: Finding the hairpin in the haystack: searching for RNA motifs. Trends Genet 1995, 11:45-50.
Boguski MS: The turning point in genome Biochem Sci 1995, 20:295-296.
research.
Trends
48. ..
Hillier L, Lennon G, Becker M, Bonaldo FM, Chiapelli B. Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W et Generation and analysis of human expressed tags. Res 1996, 6:807-828. paper presents a thorough of EST generated in the Washington University cDNA sequencing which majority of in of data quality of the tissue libraries and the EST are reviewed the of gene
Trends Biofechnol 1996, 14:294-298. This paper shows how statistical and bioinformatics methods applied to EST sequence databases can be combined to provide insights into gene expression patterns in the tissues used to generate the cDNA libraries. The use of gene expression data generated in this way is presented as an important approach to the identification of potential therapeutic targets. 51.
Aaronson JS, Eckman B, Blevins RA, Borkowski lmran S, Elliston KO: Toward the development to the human genome: an assessment of the throughput EST sequence data. Genome Res
52.
Papadopoulos N, Nicolaides NC, Wei YF, Ruben SM, Carter KC, Rosen CA, Haseltine WA, Fletschmann RD, Fraser CM, Adams MD et a/.: Mutation of a mutL homolog in hereditary colon cancer. Science 1994, 263:1625-l 629.
53.
Bronner CE, Baker SM, Morrison PT, Warren G, Smith LG, Lescoe MK, Kane M, Earabino C, Lipford J, Lindblom A et a/.: Mutation in the DNA mismatch repair gene homologue hMLH1 is associated with hereditary non-polyposis colon cancer. Nature 1994, 368:258-261.
54.
JA, Myerson J, of a gene index nature of high1996, 6:829-845.
Levy-Lahad E, Wasco W, Poorkaj P, Romano DM, Oshima J, Pettingell WH, Yu CE, Jondro PD, Schmidt SD, Wang K et al.: Candidate gene for the chromosome 1 familial Alzheimer’s disease locus. Science 1995, 269:973-977.
55.
Doolittle RF: Computer Methods for Macromolecular Analysis. New York: Academic Press; 1996.
Sequence
56.
Eddy SR: Multiple alignment lSM5 1995, 3:114-l 20.
models.
57.
Kulp D, Haussler D, Reese M, Eeckmann FH: A generalised hidden Markov model for the recognition of human genes DNA. ISMB 1996,4:134-142.
using
hidden
Markov
in
58.
Pedersen AG, Baldi P, Brunak S, Chauvin Y: Characterisation of prokaryotic and eukaryotic promoters using hidden Markov models. /SMB 1996, 4:182-l 91.
59.
Henikoff JG, Henikoff S: Blocks database Methods Enzymol 1996, 266:88-l 05.
and its applications.
factors
its
in disease.
69. .
Beaumont C, Leneuve P, Devaux I, Scoazec JY, Berthier M, Loiseau MN, Grandchamp B, Bonneau D: Mutation in the iron responsive element of the L ferritin mRNA in a family with dominant hyperferritinaemia and cataract Nat Gener 1995, 11:444-446. This paper illustrates the potential importance of regulatory structural features in nucleic acids and shows that mutations in such regions are implicated in some human diseases. 70.
Various: Database 25:1-282.
71.
Bishop MJ, Rawlings CJ: DNA and Protein Sequence Oxford: IRL Press; 1997.
72.
Rawlings CJ, Clark DA, Altman R, Hunter L, Lengauer T, Wodak S: Proceedings of Third international Conference on intelligent Systems for Molecular Biology. Menlo Park, California: AAAI Press; 1995.
73.
States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R: Proceedings of Fourth lntemational Conference on Intelligent Systems for Molecular Biology. Menlo Park, California: AAAI Press; 1996.
74.
Scully R, Chen AP, Xiao Y, Weaver D, Feunteun J, Ashley T, Livingston DM: Association of BRCAl with RAD51 in mitotic and meiotic cells. Cell 1997, 88:265-275.
75.
Sharan SK, Moramitsu M, Albrecht U, Lim DS, Regel E, Dinh C, Sands A, Eichele G, Hasty P, Bradley A: Embryonic lethality and radiation hypersensitivity mediated by Rad51 in mice lacking BRCA2. Nature 1997, 386:804-810.
76.
Milner J, Ponder B, Hughes-Davies L, Seltmann M, Kouzarides T: Transctiptional activation functions in BRCAP. Nature 1997, 3861772-773.
Issue
[abstracts].
Nucleic Aods
Res 1997, Analysis.