Computational gene discovery and human disease

Computational gene discovery and human disease

416 Computational Christopher Bioinformatics sequencing protein Sequence from families detection of remote sequence analysis of aligned datab...

1MB Sizes 4 Downloads 50 Views

416

Computational Christopher Bioinformatics

sequencing

protein

Sequence

from families detection

of remote

sequence analysis

of aligned

database of genome

is emerging candidate

and locating

analysis

for large-scale

of bioinformatics

profiles

derived

have advantages

relationships.

sequence

junctions.

are the

the likely biochemical

sequence

sequences

the key to

intron/exon searching

for predicting although

sequence

increasing proportion (now -1.1% or -0.03% overall) are indexed as specifically contributing to research into human disease. These figures, however, are almost certainly a significant underestimate of the real importance of

for the

components

and provide

recent methods

in the

The use of comparative

data from model

as the most important

the application

Methods

are essential

and database

approaches

of new genes,

and David B Seat-M

tool in many aspects

research.

projects

sequence

comparison

pre-eminent function

genetics

of gene structure

in genomic deriving

J Rawlings*

is now an essential

of human molecular prediction

gene discovery and human disease

organisms

development

bioinformatics because most laboratories closely integrate computational methods with experimental approaches to gene identification. To compile this review, we have selected publications which best illustrate the way in which different bioinformatics methods can be used to link previously uncharacterized nucleotide sequences to human disease phenotypes or to elucidate the molecular basis of human disease.

in

for characterizing Fiaure 1

disease genes.

r2.0

Addresses *SmithKline Beecham Pharmaceuticals, Department of Bioinformatics, New Frontiers Science Park, Third Avenue, Harlow, Essex CM1 9 5AW, UK; e-mail: Chris-Rawlings-1 @sbphrd.com +SmithKline Beecham Pharmaceuticals, Department of Bioinformatics, 709 Swedeland Road, PO Box 1539, King of Prussia, PA 19406-0939, USA; e-mail: [email protected] Current Opinion in Genetics & Development

1997, 7:416-423

http://biomednet.com/elecref/0959437X00700416 0 Current Biology Ltd ISSN 0959-437X Abbreviations AT ataxia telangiectasia mutated in a&a telangiectasia ATM BLAST basic local alignment search tool expressed gene anatomy database EGAD EST expressed sequence tag FISH fluorescent in sifu hybridization GRAIL gene recognition and analysis internet link HMM hidden Markov model HNPCC hereditary nonpolyposis colorectal cancer hypoxanthine-guanine phosphoribosyltransferase bk IRE iron responsive element NBCC nevoid basal cell carcinoma syndrome PKD polycystic kidney disease Ras-related nuclear protein Ran RCCl regulation of chromosome condensation protein World Wide Web XP xeroderma pigmentosum

l.51100 1992

1993

1994

1995



1990

z

Year

l

Sequence analysis

q

Sequence analysis

The proportion sequence analysis

of papers

analysis papers

indexed

results.

indexed

each year in Medline

The relative

as relating

proportion

to disease

is shown to illustrate the increasing contribution methods to the understanding of the molecular disease.

Medline

indexes

-400

000

papers

as containing

of sequence

or disease

progression

of bioinformatics basis of human

per year.

Introduction It is now well established that computational molecular biology (bioinformatics) can make a significant contribution to the search for genes implicated in human diseases. The increasing impact of bioinformatics in human disease genetics is illustrated by an analysis of the published scientific literature (Fig. 1). The number of papers in Medline (-400000 per annum) indexed as containing results of sequence analysis has increased steadily to the present level of 2.8%. Of these, an

This article begins with an overview of gene prediction methods and then reviews developments in sequence database searching methods and their function in comparative genomics and whole-genome analysis. The remainder of this review addresses methods for identifying increasingly remote sequence similarities and ends by providing some practical pointers to disease-specific databases and to further reading in bioinformatics research.

Computational gene discovery and human disease Rawlings and Searls

Gene prediction The

identification

In an elegant of putative

genes

in stretches

of raw

study

translocation

of the gene

breakpoint

structure

between

surrounding

chromosomes

417

the

2 and 22

genomic DNA sequence is an increasingly important computational activity in support of disease gene analysis.

in a patient suffering from DiGeorge syndrome, Budarf ef nl. [Z”] have obtained sufficient genomic sequence to

As the cost of sequencing drops and large-scale genomic sequencing efforts increase, candidate genes are more and more frequently being identified in lengthy sequenced regions that are isolated positionally (e.g. [ 11) and eventually will simply be sought and found in databases of genomic sequence compiled in an undirected fashion as a result of

allow exon prediction using GRAIL. One of the open reading frames spans the breakpoint and the predicted protein sequence was found to be similar to mammalian

the overall

ulators implies a role for the candidate gene product that would be consistent with the developmental phenotype of

human

genome

programme

sequencing

effort.

androgen receptor sequences and zipper motif. Although the sequence the similarity to members of a family

the DiGeorge The venerable GRAIL (gene recognition and analysis internet link) software package remains one of the most-used systems for gene identification (e.g. [2”,3,4]) and it continues to be improved on a regular basis [S]. GRAIL is available both as a stand-alone program and via e-mail or the World Wide Web (WWW). Rival systems

contained a leucine relationship is weak, of transcriptional reg-

syndrome.

The analysis of genomic sequence data has contributed to significant recent advances in the understanding of the molecular structure of the PKDI gene for autosomal dominant polycystic kidney disease (PKD) [8’,9’] for which GRAIL identified 48 exons in the 53.5 kb sequence. From protein sequence data predicted from the modelled exons, a repeated leucine-rich motif was identified which

Perhaps the most interesting current trend in gene prediction methodology is toward systems that attempt to detect genes not simply by first principles of gene structure and statistical characteristics of protein-coding DNA but in combination with known protein sequences. One such system simply screens predicted exons against a protein database after the fact ([12]: http://linkage.rockefeller.edu/wli/gene/list.html). Another approach, termed spliced alignment, tests a very large number of exons-essentially all blocks of sequence between invariant dinucleotide splice signals ([ 131; http://wwwhto.usc.edu/software/procrustes). Variations on this approach have now made their way into a number of systems, with demonstrable improvement in accuracy of prediction [ 141, though speed is likely to remain an issue in searching lengthy input against complete databases.

has led to suggestions that the PKDI gene may code for a membrane-bound glycoprotein that functions in cell-cell or cell-matrix interactions or a signal transduction protein. A further bioinformatics analysis of PKDZ sequence data [lo’] has added further support to the PKDZ gene defect being associated with a protein involved in cell-cell and cell-matrix interactions and has indicated that this could be consistent with the plethora of connective tissue abnormalities seen in the PKD phenotype.

Techniques for gene prediction and sequence similarity searches are thus converging rapidly. hlore sensitive and/or more convenient methods for detecting distant homologies continue to be developed [ 15”,16] and an interesting research trend is toward the incorporation of increasingly detailed biological knowledge into the classic alignment algorithms that form the basis for sequence comparisons [ 171.

include GeneParser yder/GeneParser.html) by Fickett [7].

GRAIL

([6]; http://beagle.colorado.edu/-eesnand a number of others reviewed

was also used

to characterize

the structure of the retinitis pigmentosa a protein sequence database searching has

RPGR gene for the RP3 form X-linked

Sequence

[ll]. could

A statistically two sequences

shown similar

With the exons identified, be predicted and sequence

that it has a conserved repeat structure to that found in sequences from the

which is regulation

of chromosome condensation protein (RCCl), a regulator of the GTPase Ran (a member of the Ras superfamily) with possible functions including cell cycle control, DNA synthesis and RNA processing. It is thought that Ran may work by coupling the GTPase cycle with cellular processes such as membrane transport or trafficking. As the retina is a tissue that exhibits exceptional levels of membrane turnover, the prediction that the RPCR gene is associated with a regulator of membrane function seems consistent with a disease phenotype involving progressive retinal degeneration.

similarity searches significant provides

similarity identified strong evidence that

between both code

for proteins with similar three-dimensional structure and are thus likely to exhibit equivalent biochemical functions. The use of nucleotide and amino sequence comparison

methods-in

particular

those

such as BLAST

[18] and FASTA [19] which are sufficiently fast to search entire nucleotide and protein sequence databases-have therefore contributed some of the most important insights into disease gene function. (For a concise and useful review of sequence database searching resources available on the Internet using the MWW’, see [ZO].) The primary computational sequence database searching

challenge is the

now posed detection of

for dis-

418

Genetics of disease

The

predictions

familial around

of biochemical

breast cancer the detection

function

for

both

the

and tubby phenotypes have pivoted of sequence relationships in higher

organisms where extensive biochemical already been undertaken. Finding a human that is related to an already fully characterized

analysis had disease gene mammalian

gene can therefore be largely a matter of luck, whereas lower organisms offer new opportunities for functional genomic

analysis.

Functional genomics model organisms a new human gene is sequenced, sequence database searching is essential for finding clues to its biochemical function. Sequence similarity has therefore been the basis of many of the discussions regarding the potential biological function of the familial breast and ovarian cancer genes RR&II and BRCM. In [Zl], sequence similarity between BRCAl and the granin protein family was detected and granin sequence motifs were also detected in BRCAZ sequences. A more rigorous analysis of the sequence features in BRCAl [ZZ”], however, has challenged the statistical significance of the granin relationship and proposed a new protein superfamily for the BRCAl product which also included a large number of nuclear proteins mostly involved in cell cycle checkpoint functions which may associate with ~53. A defect in cell cycle control could be an equally pausible explanation of the BRCAl phenotype (see Note added in proof). Obesity

Sequence similarity has also been used to help elucidate the genetic defect which results in the mouse tubby phenotype (see also the review by Naggert, Harris and North, this issue, pp 398404). The tubby late-onset obesity phenotype with associated hearing loss and retinal degeneration provides an important model for the study of late-onset obesity in humans and may also be a mode1

by sequence analysis of

The importance of sequence data from lower organisms where systematic and large-scale functional analysis (e.g. gene knockouts) is possible and more extensive genetic experiments can be undertaken quickly is now widely appreciated. The availability of increasing amounts of sequence data from the classical geneticists’ model organisms-yeast, fruitfly and the nematode-is having a significanr impact on the understanding of human disease as models of these defects can be rapidly identified and exploited in lower organisms.

Cancer-prone

syndromes

The major human diseases considered to involve defects in DNA repair and metabolism and which exhibit sensitivity to DNA-damaging agents and increased susceptibility to cancer all have equivalents in model organisms. The genetic similarities have been detected through the use of sequence database searching. For example, following the early finding that the Drosophila gene haywire is equivalent to the human DNA excision repair gene ERCC3 defective in xeroderma pigmentosum [ZS], other excision repair genes (e.g. ERCCZ) have been shown to be homologous in human, mouse and hamster genomes [26]. Patients

with

the

autosomal

recessive

disorder

ataxia

of other human obesity syndromes such as Bardet-Beidl and Alstrom. The cloning and sequencing of the mutated mouse gene [23,24] has shown that the defect is a single base change in the intron splice junction resulting in

telangiectasia (AT) exhibit increased sensitivity to ionizing radiation and are >lOO times more likely to develop cancer than members of the general population. Sequence similarity analysis has revealed that the gene mutated in AT, UM, was most similar co the yeast gene TEL2

aberrant expression of intron sequence. The carboxyl terminus of the predicted protein from the wild type tubby gene was shown by sequence database searching to be similar to phosphodiesterase ~4-6. It is not clear, however, whether the similarity to the phosphodiesterase is significant and whereas Kleyn ef al. [23] argue the case that the tubby gene is a member of the new family, Noben-Trauth et a/. [24] argue that the 62% amino acid similarity to ~4-6 is consistent with phosphodiesterase being a causal factor in other examples of retinal degeneration. Furthermore, it is argued that there is a link between the accumulation of cGRIP-as a result of insufficient phosphodiesterase activity-and apoptosis and that evidence of apopototic cell death can be observed in tu66Js mice.

which, in turn, was similar to a second gene, iMECZ. Both TELI and MECI show primary structure similarity to a checkpoint gene in Schizosaccharom~~cespombe and appear to act in redundant checkpoint pathways. It would therefore seem that the gene defect responsible for AT could well be a cell cycle control gene. Sequencing of cDNA from the candidate gene for the cancer-prone Bloom’s syndrome, BLrW, enabled a protein sequence to be predicted and compared with protein sequence databases. BLM was found to contain regions of strongly conserved local similarity to recQ helicases from human, yeast and Escheri~hia co/i [27]. Furthermore, sequence comparison suggests that BLh,I has, in addition to potential helicase activity, DNA-dependent ATPase motifs and possibly other uncharacterized enzymatic activities.

Computational gene discovery and human disease Rawlings and Searls

Sequence into

the

comparison possible

has thus yielded

functions

of genes

important defective

potentially

insights in these

diseases and has enabled the investigators to understand the results from experiments using the model genetic defects in other organisms such that the biochemical phenotype

could

be further

new

functional Tumour

suppressor

Sequence sequence

between Drosophila patched and from the nevoid basal cell carci-

noma syndrome (NBCCS) or Gorlin syndrome [28-301 has provided further evidence for the role of tumour suppressor genes in cancer. Patched is the third Drosophila segment polarity gene which has a human homolog implicated in tumorigenesis. The other two are .wing/ess and Cubitus interruptus and add to the growing evidence of the human

importance cancer.

Comparative

of developmental

genome

regulator

genes

human

disease.

that the pufferfish will provide an even model organism suitable for comparative

genomics

in

the

search

for

human

disease

Expressed sequence tag databases The most direct window onto the expressed genome is that provided by ESTs, single-pass partial sequences from cDNA libraries such as those available in the dbEST database ([47]; http://www.ncbi.nlm.nih.gov/dbEST) which provides public access notably that developed

to collections of human ESTs, most by Washington University [48”].

in cDNA widely

analysis

The success of small-scale comparative analysis of single genes and their orthologs in other species combined with the availability of increasing amounts of genome sequence data and access to prodigious amounts of computing power has made the automatic analysis and comparison of entire genomes a practical proposition. In some cases, the genomes of entire organisms such as Haemophilus itgflzlenzae [ 3 11, iMycop/asma genitalium [ 3 21, i Methartococcus jannaschii [33], E. coli and Saccllaromyces cermisiae ([34-361; http://www.ncbi.nlm.nih.gov/XREFdb/) are already available and rapid progress is being made towards completion of full genomic sequencing for higher organisms such as Caenorhabditis elegans [37]. The publication of results from all such large-scale sequencing projects is accompanied by extensive bioinformatics analysis and prediction of putative genes and their predicted functions. Koonin eta/. [38,39] have shown how computational methods can be coordinated to analyze genome-scale volumes of sequence data and have illustrated how comparison of viral genomes can provide information about the genetic processes at work during evolution [40]. Ouzounis et a/. ([41”]; http://www.sander.embl-heidleberg.de/genequiz/) have shown how the GeneQuiz program [42] can be used to compare bacterial genomes and provide predictions of gene function through extensive use of bioinformatics software

for

genes.

genes

homology data derived

responsible

Baxendale et a/. [46”] argue-on the basis of the close conservation of genomic organization of the Huntington’s disease genes in fish and man and the small size of its genomemore powerful

dissected.

genes

419

and databases.

Although h,liklos and Rubin [43] have sounded notes of caution regarding the limitations in the use of model organism sequence data as the basis for comparative functional genetics, real excitement has been generated by the productivity of large-scale comparative analysis of known human disease genes and mouse mRNA and encoded proteins [44**] and of Drosophila mutant genes with human EST sequence databases [45**]. In the latter study, FISH and radiation hybrid mapping was used to link EST sequences with human genetic loci to discover

sequencing and EST databases are now as part of both academic and commercial

used gene

discovery projects [49,50**]. Large-scale analysis and review of EST sequence data, particularly with regard to the data quality, are now possible [48**] and attempts are being made to link EST data with other genomic sequence information in databases such as EGAD [Sl] and XREFdb ([34]; http://www.ncbi.nlm.nih.gov/XREFdb/). Example successes from database analysis of EST sequences came from the identification of homology between the yeast equivalent (MLHI) of the bacterial DNA mismatch repair gene mutL [SZ] and the gene defective in chromosome-3-linked hereditary nonpolyposis colorectal cancer-an observation made independently by Bronner eta/. [53]. Levy-Lahad et al. [54] also used an EST database to identify and clone a candidate gene for chromosome 1 familial Alzheimer’s disease (S?IVZ). Sequence analysis suggested that the predicted protein contained seven transmembrane domains and was similar to the S182 gene-a strong candidace for the Alzheimer’s locus AD3 mapped to chromosome 14. Large-scale

bioinformatic

and

experimental

comparative

genomics is complex and time consuming. The project ([34]; http://www.ncbi.nlm.nih.gov/XREFdb/)which genes

links together human in model organisms-is

disease

XREFdb

EST data with homologous being developed to bring

the results of such studies together as an information resource that will be of particular value in gene identification.

Sequence

profiles and hidden Markov models

Whereas sequence alignment provides the most powerful method for finding closely related sequences, the identification of remote sequence similarities or local sequence features that are indicative of a particular biological function requires alternative approaches. Profiles are probabilistic sequence patterns defining the likelihood of

420

Genetics of disease

finding

particular

nucleotides

or amino

acids

order. Profile construction and searching now established bioinformatics techniques 20-28

in

[SS])

used

for

identifying

in a specific methods are (see chapters

remotely

sequence but the

profile method

construction and database currently creating the most

excitement for detecting remote sequence relationships uses hidden h,larkov models (Hhlhls) [56], which have been described recently promoter detection [58].

for gene

recognition

proteins in nucleic

related

sequences. Profiles are derived from the statistical analysis of residue conservation in aligned families of sequences with common function or origin from a common ancestor. A number of different statistical approaches have been used for searching

gene regulation and interaction of DNA-binding is moderated via higher-order structural features acids.

[57] and

Sequence motifs and profiles are often used to help elucidate the potential function of proteins encoded by novel genes. Key databases providing collections of aligned sequences and derived profiles and motifs are BLOCKS [59], PROSITE [60] and PRINTS [61]. A new collection of protein family alignments, some of which are constructed using Hhlhls, can be found in the Pfam database ([62]: http://www.sanger.ac.uk/Pfam/). Each family has functional annotation and cross-references to protein families in other protein databases and the literature. Some examples of how protein motif searching can reveal information about the function of candidate disease genes have already been listed in this review. In addition, a particularly clear example of the application of motif searching is given in a study of the gene implicated in Smith-hlagenis syndrome [63] which was characterized as coding for an extracellular matrix protein involved in cell adhesion or intercellular interactions on the basis of the presence of a fibrinogen-like domain and a ligand motif for the cell surface receptor integrin. Heavy-metal-associated proteins in prokaryotes contain a specific sequence motif. The review by Bull and Cox [64] discusses the evidence that two human diseases in which disruption of copper transport is a key feature (Wilson and Menkes disease) are caused by mutations in genes which encode proteins possessing the same heavy-metal sequence motif as found in prokaryotes. They propose that these genes encode the first putative heavy-metal transport proteins to be identified in eukaryotes. Another class of sequence motifs with relevance to human disease are transcription factors. A federation of databases (TRANSFAC) containing information on known transcription factor sequences and their DNA-binding sites is described in [65]; reference [66] is a review of the role of transcription factors in human disease. Sequence comparison and profile searches address the identification of function via primary sequence information only. It is generally accepted, however, that aspects of

Higher-order

information

hlethods for the prediction of nucleic acid (RNA) secondary structure have been the subject of extensive study by bioinformaticians [67] and interest in these techniques has been growing as the role of RNA in the regulation of gene expression becomes clearer. A useful review of methods of searching for RNA hairpins and their potential value in identifying potential regulatory RNA motifs can be found in reference [68]. A recent example of how a mutation in a structural control element is implicated in human disease is illustrated by the role of the iron responsive element (IRE) in inherited hyperferritinaemia. IRE is a regulatory RNA structural motif predicted to be a single hairpin found upstream of the coding sequence for iron regulatory protein. Beaumont et al. [69’] report the identification of a point mutation in the IRE motif of L-ferritin in members of a family suffering from dominantly inherited hyperferritinaemia and cataracts; the authors speculate that the mutated IRE affects the interaction with iron regulating protein resulting in ferritinaemia and accumulation of ferritin leads to cataracts.

Discussion and conclusions Bioinformatics is clearly playing a key role in the discovery and characterization of the genes implicated in human disease and is an essential component in all contemporary molecular biology research. The most active areas of research and the most promising approaches for delivering new insights are those that exploit genome data from a variety of sources, such as in computational comparative genome analysis or in the integration of gene detection and sequence comparison methods. The future challenges for bioinformatics arise from the deluge of new data that is being deposited in publicly accessible and private computer databases. It is almost certain that, in the future, the most successful genomics research centres will be those that have methods in place to track and exploit the widest variety of data sources and integrate them with their own research activities. In many cases, this will not require just the development of more sophisticated algorithms but also the application of good practices in information engineering and database management. Disease-specific

databases

The databases available for bioinformatics analysis are as important as the programs and algorithms for searching them. In addition to the well-known international archives of genetic map information, DNA/protein sequence and structure information and a number of human gene and disease-specific databases have been established to serve the more specialist needs of research into human genetic disease. Short descriptions of many of these database and

Computational gene discovery and human disease Rawlings

Internet

resources

provided

in a special

issue of Nucleic

Acids Research [70], including the diseases haemophilia A/B, Marfan syndrome, X-linked agammaglobulinemia, and gene loci for Factor VIII, hprt, ~53, adenomatous polyposis coli, phenylalanine hydroxylase (PAH), cholinesterase, fibrillin, the androgen receptor, the low-density lipoprotein receptor, the MHC and immunoglobulin families and type I collagen. Most of these databases contain mutation data and some contain information and links to other bioinformatics resources. Further

reading

in bioinformatics

The book by Bishop and Rawlings [71] is the most recent to be published providing practical advice on the use (and pitfalls) of bioinformatics techniques and applications. A recent comprehensive overview of bioinformatics methods and resources relating particularly to protein sequence analysis and structure prediction is available in the computer methods special issue of Methods in Enzymology [SS]. Each year, Nucleic Acids Research publishes a special issue on databases (the most recent being [70]) and Trends in Genetics features regular bioinformatics technology updates in its ‘Genetwork’ section. For the latest information on bioinformatics research, the key journals are Computer Applications in the Biosciences and the Journal of Computational Biology as well as the traditional journals publishing molecular biology and genetics research. The annual international conference on Intelligent Systems for Molecular Biology is probably the most important international conference on bioinformatics and the refereed and published proceedings are now a key resource [72,73].

Note added in proof Recent observations [74-76) have confirmed the role of the BRCAI and BRCAZ genes in cell cycle control and, in particular, as potential caretaker genes through an association with the Rad51 gene which is known to be essential in both DNA repair and meiotic and mitotic recombination.

Acknowledgements Thanks to David Carpenter from SmithKline hlanagement for assistance with compiling kledlinc Bo$z for administrative assistance.

References

and recommended

Bcecham searches

Information and to Evelyn

. l

1.

2. ..

*

1995, 10:269-276. An excellent example of combining bioinformatics and experimental approaches in the search for a candidate human disease gene in the region surrounding the translocation breakpoint between chromosomes 2 and 22 in a patient suffering from DiGeorge syndrome. The key bioinformatics methods include the GRAIL gene prediction and sequence database searching. Protein sequence from one of the ORFs spanning the breakpoint predicted by GRAIL was found to be similar to mammalian androgen receptor sequences and contained a leucine zipper motif. Although the sequence relationship was weak, the similarity to members of a family of transcriptional regulators could suggest a role for the candidate gene product that would be consistent with the developmental phenotype of the DiGeorge syndrome. 3.

Gama RE, Du YL, Baumann J, McCormick PJ: Identification of exons in a novel embryonal carcinoma locus using the GRAIL program. Oncol Rep 1996, 3:371-374.

4.

Schluter G, Celik A, Obata R, Schlicker M, Hofferbert S, Schlung A, Adham IM, Engel W: Sequence analysis of the conserved protamine gene cluster shows that it contains a fourth expressed gene. MO/ Reprod Dev 1996,43:1-6.

5.

Uberbacher EC, Xu Y, Mural RJ: Discovering and understanding genes in human DNA sequence using GRAIL. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:259-281.

6.

Snyder EE, Storm0 GD: Identification of protein coding regions in genomic DNA. J MO/ Biol 1995, 246:1-l 6.

7.

Fickett JW: Finding genes by computer: the state of the art 7iends Genet 1996, 12:316-320.

6. .

Burn TC, Connors TD, Dackowski WR, Petty LR, Van Raay TJ, Millholland JM, Venet M, Miller G, Hakim RM, Landes GM et al: Analysis of the genomic sequence for the autosomal dominant polycystic kidney disease (PKDI) gene predicts the presence of a leucine-rich repeat The American PKDl Consortium (APKDl Consortium). Hum MO/ Genet 1995, 4:575-562. An excellent example of how many different bioinformatics techniques including gene prediction (GRAIL) and protein sequence data analysis and database searching were combined to assist in the characterization of a candidate human disease gene. 9. .

Gliicksmann-Kuis MA, Tayber 0, Woolf EA, Bougueleret L, Deng N, Alperin GD, Iris FI, Hawkins F, Munro C, Lakey N et al: Polycystic kidney disease: the complete structure of the PKDl gene and its protein. The International Polycystic Kidney Disease Consortium. Cell 1995, 61:289-296. A good example of a comprehensive bioinformatic analysis of the polycystic kidney disease gene sequence that illustrates the range of inferences that can be made from thorough use of sequence analysis methods and database searching. The authors found a signal peptide pattern and 5 leucine-rich repeats flanked by cysteine-rich regions in the translated protein sequence. Sequence comparison showed the sequence to be similar to C-type (calciumdependent) lectin proteins. In addition, 14 copies of a low-density lipoprotein domain were identified. Protein secondary structure prediction methods suggested that the protein domain is globular and contains an antiparallel p sheet. 10. .

Hughes J, Ward CJ, Peral B, Aspinwall R, Clark K, San Millan JL, Gamble V, Harris PC: The polycystic kidney disease 1 (PKDI) gene encodes a novel protein with multiple cell recognition domains. Nat Genet 1995, lo:151 -160. A thorough bioinformatics analysis of a human disease gene. Particular emphasis is placed on the use of multiple protein sequence alignment and analysis of conserver regions to derive evidence for the potential structure and function of the encoded protein. 11.

Meindl A, Dry K, Herrmann K, Manson F, Ciccodicola A, Edgar A, Carvalho MR, Achatz H, Hellebrand H, Lennon A et al.: A gene (RPGR) with homology to the RCCl guanine nucleotide exchange factor is mutated in X-linked retinitis pigmentosa (RP3). Nat Genet 1996, 13:35-42.

12.

Rogozin IB, Milanesi L, Kolchanov NA: Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 1996, 12:161-170.

13.

Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. froc Nat/ Acad Sci USA 1996, 93:9061-9066.

14.

Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34:353-367.

15. ..

Birney E, Thompson JD, Gibson TJ: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res 1996, 24:2730-2739.

of special interest of outstanding interest Chen EY, 20110 M, Mazzarella R. Ciccodicola A, Chen CN, Zuo L. Heiner C, Burough F, Rep&to M, Schlessinger D, Durso M: Longrange sequence-analysis in Xq28 - 13 known and 6 candidate genes in 219.4 kb of high GC DNA between RCP/GCP and GGPD loci. Hum MO/ Genet 1996, 5:659-668. Budarf ML, Collins J, Gong W, Roe B, Wang 2, Bailey LC, Sellinger B, Michaud D, Driscoll DA, Emanuel BS: Cloning a balanced translocation associated with DiGeorge syndrome

421

and identification of a disrupted candidate gene. Nat Genet

reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

Searls

422

Genetics

of disease

An important extension to existing methods for searching DNA sequence databases with protein sequence data that can accommodate frameshift errors. These programs integrate with the GCG sequence analysis package and can use either single or aligned proteins as the basis of the search. The programs have particular utility for searching error-prone EST databases. 16.

Guan X, Uberbacher EC: Alignments of DNA and protein sequences containing frameshift errors. Comput Appl Biosci 1996, 12:31-40.

1 7.

Searls DB: Sequence 1996, 12:35-37.

16.

Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet 1994, 6:119-l 29.

19.

Pearson WR: Effective protein sequence comparison. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:227-256.

20.

Brenner SE: BLAST, Blitz. BLOCKS and BEAUTY: sequence comparison on the net. fiends Genet 1995, 11:330-331.

21.

Jensen RA, Thompson ME, Jetton TL, Szabo Cl, Van der Meer R, Helou B, Tronick SR, Page DL, King MC, Holt JT: BRCAl is secreted and exhibits properties of a granin. Naf Genet 1996, 12:303-306.

alignment

through

pictures.

Complete genome sequence of the methanogenic archaeon. Methanococcus jannaschii. Science 1996, 273:1056-l 073. 34.

Hieter P, Bassett DE Jr, Valle D: The yeast currency. Nat Genet 1996, 13:253-255.

35.

Oliver SG: From DNA sequence 1996,379:597-600.

36.

Walsh S, Barrel1 B: The Saccharomyces cerevisiae genome the World Wide Web. Trends Genet 1996, 12:276-277.

37.

Hodgkin J, Plasterk RH, Waterston RH: The nematode Caenorhabditis elegans and its genome. Science 1995, 270:41 O-41 4.

36.

Koomn EV, Tatusov RL, Rudd KE: Protein sequence comparison at genome scale. In Methods in Enzymology. Edited by Doolittle RF. San Diego: Academic Press; 1996:295-322.

39.

Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of fscherichia co/i proteins: functional and evolutionary implications. Proc Nat/ Acad Sci USA 1995, 92: 11921-l 1925.

40.

Hannenhalli S, Chappey C, Koonin EV, Pevzner PA: Genome sequence comparison and scenarios for gene rearrangements: a test case. Genomics 1995, 30:299-311.

Trends Gener

Koonln EV, Altschul SF, Bark P: BRCAI protein products ... 22. .. Functional motifs... Nat Genet 1996, 13:266-266. This paper Illustrates the Importance ot applying bloIntormatlcs methods wth rigour when seeking to predict genelprotem function from remote sequence similarities identified from database searching. The authors showed that, by careful dissection of the BRCAl protem into separate domains and selecting Improved amino acid scoring matrices for sequence database search methods, the significance of more remote similarities with cell cycle control genes ~53 and Rad9.

genome-a

to biological

common

function.

Nature on

Ouzounis C, Casari G, Sander C, Tamames J, Valencla A: Computational comparisons of model genomes. fiends Biotechnol 1996, 14:260-265. This paper is a valuable overview of the work of the GeneQuiz consortium. The paper describes how the application of large-scale genome sequence analysis and automated gene function prediction provides the basis for comparing whole genomes and reveals some of the evoluttonary processes that have shaped the genomes of higher organisms. 41. ..

42.

Kleyn PW, Fan W, Kovats SG, Lee JJ, Pulldo JC, Wu Y, Berkemeier LR, Misumi DJ, Holmgren L, Charlat 0 ef al.: Identification and characterization of the mouse obesity gene tubby: a member of a novel gene family. Cell 1996, 85:261-290.

Scharf M, Schneider R, Casari G, Bork P, Valencia A, Ouzounis C, Sander C: GeneQuiz: a workbench for sequence analysis. /SMB 1994, 2:346-353.

43.

Miklos GL, Rubin GM: The role of the genome project in determining gene function: insights from model organisms. Cell 1996, 86:521-529.

24.

Noben-Trauth K, Naggert JK, North MA, Nlshina PM: A candidate gene for the mouse mutation tubby. Nature 1996, 380534-536.

44. ..

25.

Mounkes LC, Jones RS, Liang BC, Gelbart W, Fuller MT: A Drosophila model for xeroderma pigmentosum and Cockayne’s syndrome: haywire encodes the fly homolog of ERCC3, a human excision repair gene. Cell 1992, 71:925-937.

26.

Lamerdin JE, Stilwagen SA, Ramlrez MH, Stubbs L. Carrano AV: Sequence analysis of the ERCC2 gene regions in human, mouse, and hamster reveals three linked genes. Genomics 1996, 34:399-409.

2 7.

Ellis NA, Groden J, Ye TZ, Straughen J, Lennon DJ, Ciocci S, Proylcheva M, German J: The Bloom’s syndrome gene product is homologous to RecQ helicases. Cell 1995, 83:655-666.

26.

Hahn H, Wicking C, Zaphiropoulous PG, Gailani MR, Shanley S, Chidambaram A, Vorechovsky I, Holmberg E, Unden AB, Gillies S et a/.: Mutations of the human homolog of Drosophila patched in the nevoid basal cell carcinoma syndrome. Cell 1996, 85:641-851.

29.

Hahn H, Christiansen J, Wicking C, Zaphiropoulos PG, Chidambaram A, Gerrard B, Vorechovsky I, Bale AE. Toftgard R, Dean M, Wainwright B: A mammalian patched homolog is expressed in target tissues of sonic hedgehog and maps to a region associated with developmental abnormalities. J Biol Chem 1996, 271 :12125-l 2126.

30.

Gailani MR, Stahle Backdahl M, Leffell DJ, Glynn M, Zaphiropoulos PG, Pressman C, Unden AB, Dean M, Brash DE, Bale AE, Toftgard R: The role of the human homologue of Drosophila patched in sporadic basal cell carcinomas. Nat Genet 1996, 14:76-61.

23.

31.

Fieischmann RD, Adams MD, White 0, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et a/.: Whole-genome random sequencing and assembly of Haemophilus infloenzae Rd. Science 1995, 269:496-512.

32.

Fraser CM, Gocayne JD, White 0, Adams MD, Clayton RA. Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM et al.: The minimal gene complement of Mycoplasma genitalium. Soence 1995, 270:397-403.

33.

Bult CJ, White 0, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA. Gocayne JD et al.:

Makalowski W, Zhang J, Boguskl MS: Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 1996, 6:646-657. This large-scale comparison of mouse and human nucleotide sequence data provides a valuable insight into some of the limits that might be expected from comparative genome analysis and the possible consequences for the cross-referencing of transcript maps. The authors compared the degree of amino acid and nucleotide sequence conservation between 1196 orthologous mouse and human genes. The data showed that, for some mouse and human genes, the nucleotide sequence is better conserved than the amino acid sequences. In some cases, functionally cloned equivalent genes showed remarkably low levels of sequence conservation by comparison with the average genes (65%): notably the breast cancer gene BRCAl (57%) and the testis determlning factor (SRM (42%). These results help to benchmark the likely success of comparative genomics. 45. ..

Banfi S, Borsani G, Rossi E, Bernard L. G&anti A, Rubboli F, Marchitiello A, Giglio S, Coluccia E, Zollo M et a/.: Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching. Naf Genef 1996, 13:167-l 74. This landmark study illustrates the potential for exploiting Drosoph!la genetics and the wide variety of mutant phenotypes for identifying human disease genes from EST databases. 66 Drosophila gene sequences with known mutant phenotypes were cross-referenced to human genes In the dbEST database by sequence comparison. FISH and radiation hybnd mapping were used to determine if the sequence tags mapped to loci associated with human genetic diseases. Approximately half of the Drosophila genes were found to map with the Genebridge 4 panel. The possible links between the these STS markers and human disease loci are being further investigated by the authors. Baxendale S, Abdulla S, Elgar G, Buck D, Berks M, Mlcklem G, Durbin R, Bates G, Brenner S, Beck S et al: Comparative sequence analysis of the human and puffetfish Huntington’s disease genes. Nat Genet 1995, 10:67-76. This paper is an important example of how detailed comparative sequence analysis of a human disease gene (for Huntington’s disease; HD) using the pufferfish HD gene can further the understanding of the function of a gene. This study also illustrates the power of the pufferfish genome as a model system for the analysis of human genes. The authors use the data from the pufferfish to considerably extend the evolutionary range over which sequence compansons can be made in order to identify the regions of the HD gene that have been most conserved and should therefore hold the key to Its function. Their analysis showed that the first coding exon, the site ok the disease46. ..

Computational

gene

discovery

and human

disease

Rawlings

and Searls

423

causing triplet repeat, is highly conserved; however, in the pufferfish, the sequence consists of just four glutamine residues. The hypothesis advanced is that the polar zipper structure that might be formed by the more extensive human glutamine repeat region probably could not come about in the shorter pufferfish repeats. This supports the view that the polar zipper motif might play a part in the development of the disease.

60.

Bairoch A, Bucher P, Hofman K: The PROSITE database. status in 1997. Nucleic Acids Res 1997, 25:217-221.

61.

Attwood TK, Beck ME, Bleasby AJ, Degtyarenko K, Mitchie AD, Parry-Smith DJ: Novel developments with the PRINTS protein fingerprint database. Nucleic Acids Res 1997, 25:212-216.

47.

62.

Sonnhammer ELL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein families based on seed alignments. Proteins 1997, in press.

63.

Zhao Z, Lee CC, Jiralerspong S, Juyal RC, Lu F, Baldini A, Greenberg F, Caskey CT, Pate1 PI: The gene for a human microfibril-associated glycoprotein is commonly deleted in Smith-Magenis syndrome patients. Hum MO/ Genet 1995, 4:589-597.

64.

Bull PC, Cox DW: Wilson disease and Menkes disease: new handles on heavy-metal transport Trends Genet 1994, 10:246-252.

65.

Wingender E, Kel AE, Kel OV, Karas H, Heinemeyer T, Dietze P, Knuppel R, Romashenko AG, Kolchanov NA: TRANSFAC, TRRD and COMPEL: towards a federated database system on transcriptional regulation. Nucleic Acids f?es 1996, 25:265-268.

66.

Engelkamp D, Van Heyningen V: Transcription Curr Opin Genef Dev 1996, 6:334-342.

67.

Westhof E, Auffinger P, Gaspin C: DNA and RNA structure prediction. In DNA and Protein Sequence Analysis. Edited by Bishop MJ, Rawlings CJ. Oxford: IRL Press; 1997:255-278.

68.

Dandekar T, Hentze MW: Finding the hairpin in the haystack: searching for RNA motifs. Trends Genet 1995, 11:45-50.

Boguski MS: The turning point in genome Biochem Sci 1995, 20:295-296.

research.

Trends

48. ..

Hillier L, Lennon G, Becker M, Bonaldo FM, Chiapelli B. Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W et Generation and analysis of human expressed tags. Res 1996, 6:807-828. paper presents a thorough of EST generated in the Washington University cDNA sequencing which majority of in of data quality of the tissue libraries and the EST are reviewed the of gene

Trends Biofechnol 1996, 14:294-298. This paper shows how statistical and bioinformatics methods applied to EST sequence databases can be combined to provide insights into gene expression patterns in the tissues used to generate the cDNA libraries. The use of gene expression data generated in this way is presented as an important approach to the identification of potential therapeutic targets. 51.

Aaronson JS, Eckman B, Blevins RA, Borkowski lmran S, Elliston KO: Toward the development to the human genome: an assessment of the throughput EST sequence data. Genome Res

52.

Papadopoulos N, Nicolaides NC, Wei YF, Ruben SM, Carter KC, Rosen CA, Haseltine WA, Fletschmann RD, Fraser CM, Adams MD et a/.: Mutation of a mutL homolog in hereditary colon cancer. Science 1994, 263:1625-l 629.

53.

Bronner CE, Baker SM, Morrison PT, Warren G, Smith LG, Lescoe MK, Kane M, Earabino C, Lipford J, Lindblom A et a/.: Mutation in the DNA mismatch repair gene homologue hMLH1 is associated with hereditary non-polyposis colon cancer. Nature 1994, 368:258-261.

54.

JA, Myerson J, of a gene index nature of high1996, 6:829-845.

Levy-Lahad E, Wasco W, Poorkaj P, Romano DM, Oshima J, Pettingell WH, Yu CE, Jondro PD, Schmidt SD, Wang K et al.: Candidate gene for the chromosome 1 familial Alzheimer’s disease locus. Science 1995, 269:973-977.

55.

Doolittle RF: Computer Methods for Macromolecular Analysis. New York: Academic Press; 1996.

Sequence

56.

Eddy SR: Multiple alignment lSM5 1995, 3:114-l 20.

models.

57.

Kulp D, Haussler D, Reese M, Eeckmann FH: A generalised hidden Markov model for the recognition of human genes DNA. ISMB 1996,4:134-142.

using

hidden

Markov

in

58.

Pedersen AG, Baldi P, Brunak S, Chauvin Y: Characterisation of prokaryotic and eukaryotic promoters using hidden Markov models. /SMB 1996, 4:182-l 91.

59.

Henikoff JG, Henikoff S: Blocks database Methods Enzymol 1996, 266:88-l 05.

and its applications.

factors

its

in disease.

69. .

Beaumont C, Leneuve P, Devaux I, Scoazec JY, Berthier M, Loiseau MN, Grandchamp B, Bonneau D: Mutation in the iron responsive element of the L ferritin mRNA in a family with dominant hyperferritinaemia and cataract Nat Gener 1995, 11:444-446. This paper illustrates the potential importance of regulatory structural features in nucleic acids and shows that mutations in such regions are implicated in some human diseases. 70.

Various: Database 25:1-282.

71.

Bishop MJ, Rawlings CJ: DNA and Protein Sequence Oxford: IRL Press; 1997.

72.

Rawlings CJ, Clark DA, Altman R, Hunter L, Lengauer T, Wodak S: Proceedings of Third international Conference on intelligent Systems for Molecular Biology. Menlo Park, California: AAAI Press; 1995.

73.

States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R: Proceedings of Fourth lntemational Conference on Intelligent Systems for Molecular Biology. Menlo Park, California: AAAI Press; 1996.

74.

Scully R, Chen AP, Xiao Y, Weaver D, Feunteun J, Ashley T, Livingston DM: Association of BRCAl with RAD51 in mitotic and meiotic cells. Cell 1997, 88:265-275.

75.

Sharan SK, Moramitsu M, Albrecht U, Lim DS, Regel E, Dinh C, Sands A, Eichele G, Hasty P, Bradley A: Embryonic lethality and radiation hypersensitivity mediated by Rad51 in mice lacking BRCA2. Nature 1997, 386:804-810.

76.

Milner J, Ponder B, Hughes-Davies L, Seltmann M, Kouzarides T: Transctiptional activation functions in BRCAP. Nature 1997, 3861772-773.

Issue

[abstracts].

Nucleic Aods

Res 1997, Analysis.