Regulatory elements and expression profiles

Regulatory elements and expression profiles

400 Regulatory elements and expression profiles Philipp Bucher There has been transcription control the gene accurate progress regions, in ...

974KB Sizes 24 Downloads 111 Views

400

Regulatory

elements

and expression

profiles

Philipp Bucher There

has been

transcription

control

the gene accurate

progress regions,

in the computational

but current

methods

regulatory features of noncoding enough to be useful in automatic

Therefore, newly based

steady

detailed

information

analysis

of

of predicting

sequences are still not genome annotation.

on the expression

patterns

of

sequenced genes is more likely to come from microarrayhigh-throughput mRNA quantitation technologies, which

have

made

revolutionary

now

ready

for genome-wide

progress

over

the past few years

application.

regulatory

element

prediction

problem

combined

analysis

of genome

sequence

Future may

solutions

be found

and are to the

by the

and expression

data.

Addresses Swiss lnstltute for Experimental Canc:er of Blolnformatics, Ch. des boveresses Switzerland; e-mail: [email protected] Current

Opinion

in Structural

Biology

Research 166, 1066

1999,

and Swiss Epallnges,

Institute

9:400-407

http://biomednet.com/elecref/0959440X00900400 8~) Elsevier

Science

Ltd

ISSN

0959-440X

Abbreviations b

EST HCR SAGE UTR

base pair expressed sequence tag highly conserved region serial analysis of gene expressron untranslated region

Introduction control mechanisms ha\e been intensely diverse org2nisnis for :lt least three decades. I>cspite these efforts. our understanding of how regulatory information is cncotled by a I)NA secluence is still \cry fragmcntar~-. (:onfronted with a newly sequenced control region of a ;:cne, we are, in most cases, unable to make reliable prcdic$onr; about its tissuc-spccific and dcvelopmcntal-srage-spttcific expression pattern. Kelating IIN.:\ seclucncc to gene regulatory function thus WCIIIS to hc one ot‘ the hardest problen-s in current biology and accurate prediction methods for gene regrllatory clemcnts arc not c\pectcd to be ready when the complete human genomc sequence is finished (in approxim;ltely four years from now). FortunateI>; the steady. hiIt slow. progress in this field is accompaniccl 1,): rcvolutionaq breakthroughs In high-throllghput ,gene expression monitoring technolog): ‘I’hub. n-e can be confident that WC will soon know from wet laboratory studies where and when each human gone is transcribed and how its cxprcssion is controlled by
in

Characterization control regions

of transcriptional

The past two years have been a period of consolidation and maturation for this field. Critical evaluation studies of existing methods have led to A more realistic assessment of the performance of current algorithms in identifying and locating transcription control signals. A broadly noticed review on eukaryotic promoter prediction programs, presenting bench-marks obtained with an independent test set, came to the conclusion thtit these methods arc not yet accurate enough to he useful in the automatic annotation of the human genome [ 1’1. Another test rcvealcd that the available software tools for locating the binding sites for a particular transcription factor ((:‘I’F/Nl;l) often missed experimentally confirmed target sites and failed to make accurate binding strength predictions for others [2]. Although neither of theso studies claims to be rcpresentative, they both make clear that current methods can not bc blindly trusted and further highlight the need for objective bench-marking procedures in order to monitor progress towards more accurate tools. A positive development in the field is that there are now increasingly convergent views on the organization of gene regulatory regions and also an emerging paradigm for their computational characterization. which can be summarized as follows: the elementary units of transcription regulator); regions arc transcription-filctor-binding sites; control regions, such as promoters, cnhanccrs, locus control regions and so on, arc modular; the regulatory output of ;I control region depends on the specific combination of its elements, ;IS well as on the order and oricnttition in which they occur: and genes are typically controlled bv se\wal control regions located upstream or downstream jand possibly far away) from the transcription initiation site. I:rom cxpcrimcntal studies, it is clear that the function of ;I regulatory region is mediated by milltiprotein coiiiplexcs comprising syergistically interacting transcription factors bound to clustered I)iKA sites. Similar principles ma)- apply to noncoding regul:ltory RNA regions, which arc not the focus of this rcl.ie\z: It is now broadly recognized that the major obstacle to the computational characterization of gene regulatory regions relates to the fact that the scqumce motifs corresponding to the elementary modules contain too little diagnostic information for them to be distinguished from chance occurrences [A]. l
Regulatory

distinguish between functional and biologically irrelevant transcription-factor-binding sites matching a corresponding consensus sequence or weight matrix description. Much of the recent work in this field has been inspired by this basic idea. In a previous review [4], we have focused on three topics: methods for characterizing individual control elements; promoter prediction algorithms; and phylogenetic foorprinting. which is the name of an increasingly popular appr(Ai for localizing important regulatory regions within large genomes through cross-species comparisons. Before switching to new thcmcs, a Ilrief update on the most important developments in these areas is given now. A comprehensi\,c, comparative e\ aluation of software tools for the characterization and identification of transcription control elements has appcarcd IS’] and several new algorithms for eukar!;otic promoter prediction have been published [h-8]. A nice validation of the phylogenetic footprinting approach came from ehperimcntal work in which highly conserved regions (H(:Ks) in mKNA 3’ untranslat(t (‘I-K) seq~~cnc cs were subsequently ed region functionally characterized as >ensors of environmental stress. [Y]. In a provocative paper, I,ipman [lo’] offered a novel explanation for the high sequcncc conservation observed in 3’ Il’1‘Ks, suggesting that they function via duplex formation with antisensc KI%A.

Composite

elements

and control

regions

.As mentioned above, a major current trend is to analyze the contextual constraints governing the composition and positioning ofgcnetic clcmcnts within a regulator); region. ‘I~vo main approaches can bc distinguished. which may be called ‘bottom up‘ and ‘top down’. and the): are not mutually cxchlsivc. ‘r’he bottom-up approach starts with a known element and attempts to idcntii’y contextual features that ma> help to distinguish function.ll from nonfunctional sites. A straighrforward application of rhis idea is co search for significant clcmcnt pairs occurring at conserved distances from each other. An early disco\ ered example of such a bipartite control signal, occurriiig in embryo-specific sea urchin historic gcncs. consists ot ;I (;A’l”l‘(: motif followed i-9 bp downstream by a canonic;ll ‘I’!Yl’A-box promoter clcmcnt [ 1 11. Another noteworthy cuample was found in the 5’ flanking regions of I-ibosomal protein genes in ~)i/lhll.stlc~~f~l~}~~~,~~~.~piirr/LY. ‘l’hcse genes have a highly unusua1 promoter type. consistiny of a site selector eiement that is functionally analogous. bllt different secJucncc-wise from a T,%‘1::\ box. and an upstrc‘un element that can occur in cithcr orientation [I?]. Both clement pairs are expected to have considerable diagnostic 1 due for the corresponding regulaton class of gents, as ch:lnce matches to the conbined motifs arc expected to occur less than once in a million base pairs. Many more examples of composite clemcnts can be found in the (X)RlPE:I, database [ 13”,14]. In the top-down approach, one
elements

and expression profiles

Bucher

401

genes and attempts to derive a specific model of the conserved sequence features (for a review of such techniques, see [15]). ‘The models typically used in such studies consist of weight matrices or consensus sequences for obligatory and optional elements, and rules restricting the order and orientation in which the elements are allowed to occur. Control regions conferring muscle-specific expression to genes and the long terminal repeats of retroviruses were recently characterized by such techniques (16,171. In both cases, the resulting models selectively identified new, plausible target sequences in the database that were not included in the training set used for model building. Other groups applied methods based on hidden Markov models [ 181, grammatical models [19] and fuzzy clustering [20] in order to analyze regulatory regions. A number of groups have tried to integrate DNA structure prediction into their methodologies for characterizing transcriptional control regions (e.g. [2 1.221). A recent application of such a technique to eukaryotic promoters has revealed an interesting 10 bp periodicit); starting immediately downstream of the trAnscriptinn initiation site [23’]. Although it is undeniable that such methods can capture physiologically relevant sequence features, it is less certain whether this is accomplished through correct 1)N.A structtlrc prediction. hly personal experience is that different software tools for computing the intrinsic curvature and bendability of double-stranded Jli%A often make conflictin:: predictions when tested on the same sequence. ‘The success of such approaches can thcrcfore not bc directly interpreted as evidence that a specific kind of’ intrinsic I>iXA structure plays an important role in gene regulation.

Recent control

work on prokaryotic elements

gene

‘I’he availability of sev-era1 complete bacterial genomes has both resllscitated interest in prokaryotic gene regulation and motivated a number of comparative computational studies of transcriptional control elements. hot surprisingresearchers have chosen to analyze the Iv, most ~.~.srtie?~//ti~ CO/; genotne because of the wealth of cxperimental data on transcriptional regulation in this organism. One group generated a library of weight matrices defining the binding specificity of 5.5 different transcription factors for which target sites had been exJ~erimentally idcntificd j23.J. Another group searched the cntirc genomc for matches to
402

Sequences and topology

sequenced bacterial species USC: KNA-hairpin-based tcrmination mechanisms that are similar to the rho-indcpendcnt ~Xlth\V’~~ Of I,:. /O/i [27*‘].

Expression

profiles

If the prediction of gene regulatory features from sequcnc~ is so hard -do ux reall\; need it? In view of the rapid recent ~wgtcss made in genomc-wide expression monitoring, some resarchcrs may he tempted to say no. Remember that the hliman geiwme project was partly justificd tyi arguing that noncoding IIiXA sct~~iciices will tell us something about the regulation of the genes. With the dent of micrtvarray technology it no\\ seems probable that this information will first cwne from high-throllghptit mKNA quuntitation experilncnts, rather than from scc~iien~etxtsed /II .ri/i/o predictions. It wodd lx shortsighted, however, to play dou,n the importawe of the gent regulatory future prdiction ptot)lem using such arguments. In applied hiomeclical rvscarch, it ma); lx sufficient to know when ;lnd where a gent is exptesscd; in order to understand lift. one has definitely also to know whv this happens. I,arge-scale apreeion profiling thtls shoujrl bc viewed as a welcome complement, rather than an alternati\q to whole genolnc seqwncii’g and should he integrated into research mcthodologics aimed at elucidating the seclllcncc/function relationship5 of gene rcgiilator)i rcgionc.

New technologies

and new data structures

Among the high-throlighput methods for gene expression profiling, two principally differc.nt strategic5 can hc distinguished: the wcluencin g of a large number of cloned gcrrc tags and the parallel h!hi-idizition of/II t-if/-o labekd mliNA populations (11511a1ly ;ichicved I)! revcrrc tr,lnscription) to densely arrayed target probes. ‘l’hc two approaches have also 1)ccii called ‘digital‘ and ‘analog‘. as they product expression profiles consistin, 1~ol‘ intcgcr and rc;iI nuinlws, respcctivclv (2X’]. :~dams uf of. f,!9] wcrc the first to dcmonstraw that large-scale single-pa(l) expressed seq~rcncc tag (t3St’) sequencing can he ~rxcd to study gent expression. An acccleratcd version of this strategy, consisting of sequcnciry multiple concatcn;ltecJ short oligonucleotidc tags in a single rtln, has been pllblished under the acronym SA(;t< (serial analysis of ,genc c\pression) (XI]. Although scclllencing-l,;lsctf cupres.sirJn profiling tcchniquc\ have the principal advantage of being capable of detecting new transcripts. micro;lrra)-l);lsed appi-oaches arc cxpectcd to pusail in the flltrlrc ~ as they appear to be more cost cffccti\,e and accurarf2, especially for weakly exprcsscd gents. \lorewcJ-, the fact that III RNA cluantitation cxpcrimerits occur in parallel urldcr exactly tl1e s3mc hybridization conditions ensure‘s a high dcgrec of crossstandardi/.atioii among indivic{ual nieasuremcnts. ‘lb0 types cJf microarray arc: currcntiy in IISC: 1IN.A arrays, carrving long cl>NA moIcctIIcs tr.lnsferred by a robot to ;I nylon JllclnbJ3llc or glass slide 131 1, and oli~oli~rcleoti~l~ arrabs, carry inE I’// si/// s)nthe;ized oli~onuclcotides oj abollt 10 hascs [AZ]. Both metllods hat-c hcen sho\vn to

work in practice and have already made important contributions to our understanding of gene regulation, as exemplified by the pioneering case studies described furthcr below. Additional information on all aspects of DNA microarrays can be found in a comprehensive series of reviews published in a recent supplement of :Vofuw Gfwficx entitled “‘l‘he (lhipping Forecast” [33’,33,3.5]. ‘[‘he merits and drawbacks of the various sequencing and microarra~-leased techniques arc further discussed in [.X5’). ‘l’he advent of microarray-bawd gene expression monitoting is a revolutionary development in biological research. not because man\; biologists arc expected to apply this new and still very espcnsive rechnology in their own rcscarch. but because every biologists will soon have access to the large amounts of public data produced by this technique. ‘I’he term transcriptomc has been coined to refer to this new type of data structure, comprising the expression levels of all the genes of a genomc in a given regulatory state of a cell. As a result of both the dynamic nature of gene expression and the complexity of regulatory processes in higher organisms. transcriptomes may soon exceed gcnome seclwnce data in shcu volume. ‘I’hc nature of this new type of biological information and how it can be exploited for answering biological cluestions will be the focus of the remaining part of this reviw:

Examples monitoring

of genome-wide studies

expression

For olwious reasons, the budding yeast was the first model organism in which physiological processes wcrc studied using genome-wide expression profilcs. A complete characterization of its transcriptome was initially achieved tw SA(;K [37”] and was later confirmed and refined & oligolillclcotide array technology [.3X”]. In this latter study, a teiiil’crat~lrc-sensitive KNA po1ymcrase 11 mutant strain was wised to nicasui-e the timcAepcndent decay of mKN,A levels after blocking CT’?JJlJ’iTCJ mKKA synthesis. Knowing the half2ivcs of all the mKN.A obtain4 in this way. it is now possible to convert steady-state niK!VA levels into transcription rates, which is essential for understanding the kinetic aspects of transcription rcgrllatory c\znts. hloreo\w, the dependence of the transcriptomr on various kg components of the yeast polynerasc I I transcription machincr!; nas assayed using a set of ‘gcnctic reagents’ (conditional mutants and gcnc knockout stl-ains). Additional publicly available data sets for yeast AI-~’ indicated in ‘t:,lblc I. \,‘cry t-ecu+ high-throughpLlt expression profiling has been extended to mammalian systems. l:or instance, the IhlM;E: consortium has released cluantitative hybridization data for SO.58 human genes in six different tissues [.W’]. ‘l’ime-course analysis is a particularly powcrfuI way of studying global gent regulation tisin :: expression profiles. In this type of analysis, snapshots of the entire transcriptome are taken at 5ucccssiw time poinrs after inducing a ch;~nge in the rcgulator~- state of a cell culture or a tissue. ‘l‘hc value of the data resulting from such ex~xrimcnts can 1~ further

Regulatory

Table

elements

and expression

profiles

Bucher

403

1

Examples

of expression

profiles

available

over Number

of data

Species

Type

Yeast

Wild-type

the Internet. of states/time

Transcription apparatus mutants: RNA polymerase II SRB-mediator core complex SRBl 0-CDK compler SWI-SNF complex General transcription fxtors

Heat

glucose

medium

shock

Mating

type

Dlauxlc TlJPl YAP1

alpha

versu!.

a

shift-time course deletion overexpression

Rat

thar

WI

6308

04

[60’1

7

6200

(d

[40”]

6200

Cc)

[4 1-1

6200

(4

7 1

Ndt80

Kw [44”]

Cervical

spinal

9

122

(4

[56’1

for six tissues In fibrot)lasts:

cord

development

csprcssic)n

lacking thought

[43”] [43”] [42”] [43”] [43”1 [43”1

(b) http:l/arep.med.harvard.edu/mrnadata/expression.h~ml (d) http://ldefix.upr420.vjf.cnrs.fr. (e) http://rsb.info.nih.gov/mol-phys~ol/homepage.h~ml

profiles

or overexpressing to he functiondly ‘I’he

study

gencr;lted transcription invol\wl

of the

for facin the

ciiauxic

shift

in

change from fci-nicntation to respiration), expression profiling of strains lacking the rqrilutorv

1~11’1

spmllacion

cycle

COIII-SC :initlyis. I>! two

w;w

progr.lm

protocols. In the rhe transcriptional

in the

cuamplcs N)t

of

same rhis

surprisingly. first urgets

one 0:‘ the

;ilso

Foirr indcpcnctenr data sets arc diffcrcnt lat3oratories [4?*,4.3”]

tcchnicliies four

as well

[W*],

(oligonilclel)tidc .. dlttcl-cnt cell first applicarion response of

1~3s analyzed

for X000

and cIIImrc

I)NA

organism giohal

the yeast of lime-

Iy

have in or&r

;I review

tarive were

or of these

rcgiilatcd

altcrcd

gene

technology drugs. such suppressants e\ er,

is applied 3s protcin

less data

research.

csk)ression

imporC;lnt SribtriictiLc

role

l)rofiling in mctiical and

clr,ning

:ind

also plays l~harmacologic~ll

differential

displa!

an

profiling various image

is

methodology. kinds of ~mxxwing

In pharmacolo,~:),

applied

on biological

are not

a\3il:ihlc

;III esscncial

tl;it;i

the h)i

the

by

alwrcd

studies

M ill.

wiic of

gent how-

heca~~sc the

rcscarch on the

Intcrnor.

of gene component

An excellent re\,iew computer applications I0

diffcr-

the potential side cffccu inhibitors [4X] (II- irnmuno-

more

Computational analysis expression data

[44”].

Hioinformatics tiigh-tliroii~~il~irr increasingly

‘l’hcsc

are

cells arrilys

induced

may be indicated

that

tumor

chat

patrcrn

to dewct kinasc

impact set\

genes

cells cluanti-

and to characterize

(461

[37].

tumor

I,atcr,

normal and Oli,~onuolcotide

to identify

infection

~W’].

ha1.e

used

for [-IS].

cxprcssion

profiles.

rcsiilting

profiles SA(;l<

in

WC [M’]).

by interferon

c),tomegalo~irus

expression

system. to strum

txcn

cntially

some time in cancer that are significanr-

over-represented

techniclucs.

gene expression determined by

a;iilahlc. microar-

Ixen upplictl for to identify transcripts

under-represend

(for

usiiifi two

synchroniution

ro 3 nismmalian hliman fibrohlasr gcncs

as the

new

tcchniqucs research

ha~c recently

‘1’1 11’1 or owrexpress-

protein

activ&r

rcpracnr pioneering I0 gcnc reglilacion.

stimiil:ltion

b)

1

(4

of the

and

b3’3’1

(d

irepression

rays)

b)

6308

0600

transcriptional

diffcrcn~

6300

1

5058

response.

gcncrarcrl

1

17 6

arc

cell

[38”]

11

ing

mitoric

(a)

response cyclohexlmlde

gliicose

[Jl”]. app-oath

WY1

5460

transcnptome

yzast (metat~olic accompanied by

analysis

[38”]

Serum plus

transcriptional

the

l:i

Partial

srrains

tors

5460 6308

18 28 17 14 2 2

1,~ addirional

rnulant

References .--

-.

Mltotic cell cycle: (L-pheromone (Gl arrest) cdc-I 5 (late M arrest) cdc-28 (late Gl arrest1 Elutnation (start in earl{ Gl) InductIon of Cln3 (Gl cyclin) InductIon of Clb2 (B-type cyclin)

(a) http://www.wl.mit.edu/young/transcrtptome.html (c) http://genome-www.stanford.edu/

incrwscd

URL

1 1

Sporulatlon program: with ectopically expressed

Human

of genes

3 1 1 4 2 2

SAGA versus

Number

1 1

transcriptome

Galactose

points

m~in;ljiernent,

of the

expression

summrlrizcs involved.

the from

mining

2nd

404

Sequences

and

topology

visualization [WI. Some of these applications are only relevant to the researchers performing the experiments. Here, the focus is on those methods that are of interest to rverybody who wants to access and exploit publicly available data, either by using web-b:tsed exploration tools or by downloading the raw data for .tnalysis on a local computer. An important (IcJeStiOn that frequently arises in the incerprctation of digital expresston profiles is whether an observed difference in gene ilumber counts is statistically significant. Audit and (Yaverie jZ8.1 have analyzed this problem and proposed a statlsticd framework for significancc tests. ‘l.he developmel~t of algorithms that exploit microarray-hascd gene expression data has only started very recently. ‘I’hanks to the regular structure of these data (usually ;I matrix of numbers corresponding to genes and rejiulatory states), many biologically relevant cluestions can already be addressed by the standard methods implementcd in widely used statistical analysis packages. ‘l’he few specific methods developed so far perform the kinds of operations known from molecrllar secluence analysis: computation of diatdncc measuics for pairs of expression profiles and the generation of rree-like clustering diagrams uscfiil for defining groups of co-regulated genes [51’,52]. ‘1%~ gene-specific transcrih)tional response patterns observed in the budding yeast after a diauxic shift [40”] and tho pattern observed in h!unan fibroblasts after serum stimulation I-+4”] have both heen classified in this way. More specialized Fourier transform methods wcrc applied to the yeast mitotic cell cycle data in order to revcal gems Gth oscillatin,g cxpKssion patterns [.52]. A more ambitious goal in the analysis of expression profiles is the reconstruction of regulatory circuits from timecourse data, also called rcvcrsc engineering of genetic pathways. Hoolean networks [S.3] and linear models [.54,5.5]. in which the change in the mKNA abundance of a particular gene depends on the simultaneous I~KNA abundance of all the other genes, have been proposed for this purpose. ‘I’he latter kind of model was applied to a matrix of gene expression tcvcls for 112 rat gcncs measured at nine diffcrent time points during central i IcrvotIs system dcvelopmcnt [.5.5,X)*1. As this data scr wxs mathematically too small to estimate all the parameters of the model, the study has a mainly exploratory value. Others have validated the same type of nicthodolog~ i~sing cot[lputer simulated data. however, shoving that it could work in principle I!%]. In another pioneering application to real data [.57’], the nctwork of regulator)interactions hetwccn the ciz-acting elements of a 2.300 bp control region of a sea urchin gone was deduced from a series of spatial and temporal exprcssion protilcs produced by mutant promoter constructs.

Combining sequence data analysis

and expression

A promising current trend is to USC genome-wide expression profile data to &fine traiiling and test sequence sets for analyzing gene rcgnlatory. elements. A number of

applications to yeast data have confirmed the general validity of this approach. For instance, time-course data on the mitotic cell cycle were used to extract upstream regions of genes specifically expressed during the ‘31, S, G2 or M phases, respectively [42”,43”]. Subsequent searches of these sequence sets for published consensus sequences for transcription-factor-binding sites identified several diagnostic patterns for genes transcribed at a specific stage of the cell cycle. In addition, a refined model of a composite element that is specific for genes induced during the transition from mitosis to the (;l phase was generated by this approach [4.3”]. Others have developed and applied c/h ir/ifio methods to identify gene regulator): signals mediating a particular regulatory response IS8,59,60’] from combined genomc and transcriptome data. ‘I’he advantage of using whole genome expression profiles in the development of predictive methods for gene regulatory elements is that the matches found in the test sets can be objectively classified as true or false hits. An example may further illustrate this point. Suppose that a computer program had predicted a binding site of a cellcycle-regulating transcription factor in the upstream region of a tnethionine biosynthesis gent a few years ago. At that time, the developer of the program would have probably discarded this result as a /!w/N.~L’P false hit and changed the algorithm of the program to suppress it. In the meantime, the time-course data on the yeast ccl1 cycle have revealed the surprising fact that methioninc biosynthesis genes are indeed cell-cycle-regulated, making it possible today to draw the right conclusions from such a test reSIJlt.

Conclusions Methods for predicting gene regulatory features have made steady progress over the past few years, but are still not reliable enough to be used in automatic genome annotation. A new paradigm is emerging in this field that views gene regulatory regions as modular structures and attempts to assess the function of individual elements in a contextdependent manner. First, promising results have already hCXl1 obtained using the contextual approach. (;omprehensive information on the tissue-specific and developmental-stage-si~ccitic expression of all the genes of an organism is nevertheless more likely ro come from experimental studics. High-throughput microarray-based parallel mKNA cluantitaticln technology has recently become operational anrt non’ makes it possible to monitor the expression of an entire ~enome in a single experiment. ‘I’he resulting data have been catlcd 3 ‘transcriptome’. In order to have a maximal impact on biological research. transcriptomc data need to be shared by as many researchers as possible, as they inc[itahly contain information that ia beyond the reacarch interests of the team producing it. It is therefore imperative to put in place efficient data distribution channels for this new type of biological data, like those th:lt already exist for nllclcic acid and protein sequences. 1,eaJing journals could promote the dissemination of this information by requiring authors

Regulatory

to deposit their data in a public archive. A major chaIJenge to bioinformatics is the organization and representation of the transcripcome in such a way that it can readily bc accessed, visualized and analyzed by powerful computer algorithms. J’rototypes of integrated gene expression databases that possibly fulfill these requirements have been described by the researchers of the N(:BI (National (knter for Biotechnology Information) [61’] and the JhlAGK consortium [W]. ‘I’hc trmscriptome will certainlv be of great help in the comp~~cational characterization of transcriptional control regions and, hopefully. will acc.elerate progress towards reliable predictive tools. Jn the long run, these data may cnablc the computational biologist to reconstruct the gene regulatory networks that control the development of highcr organisms, including mrselbw.

References Papers of particular have been highllghted l

**of

and recommended Interest, as:

published

the annual

period

of review,

of special Interest outstanding Interest

1. Fickett JW, Hatzlgeorgiou AG: Eukaryotic promoter recognition. . Genome Res 1997,7:861-878. A provocatlve review presenting bench-marks for eukaryotlc promoter prediction tools. The conclusion IS that none 01 the currently available methods IS accurate enough to be useful in automatic genome annotation. 2.

3

4.

Roulet E, Flsch I, Junier T, Bucher P, Mermod N: Evaluation computer tools for the prediction of transcription binding genomic DNA. In Silica Bfol 1998, I :21-28.

of sites

Audtc S, Clavene JM: Visualizing the competitive recognition TATA-boxes in vertebrate promoters Trends Genet 1998, 14:10-l 1. Duret L, Bucher P: Searching noncoding sequences. Curr

on

Frech K, CIuandt K, Werner T: Software for the analysis of DNA sequence elements of transcription. Con~p~it Appi B0sci 1997, 13:89-97. A recent review on currently applied algoritnms and software in this field. Contains useful URLs for public programs arid Internet resources.

7

8.

Audit S, Claverle JM: Detection of eukaryotic using Markov transition matrices. C )mput 21~223.227.

promoters Chem 1997,

Chen QK. Hertz GZ. Storm0 GD: PromFD 1.0: a comouter that predicts eukaryotic pol II promoters using strings matrices. Cornput Appl Biosci 1997. 13:29-35. Zhang MQ: Identification of human Genome Res 1998, 8:31 S-326.

gene

core

promoters

13. *

14

Helnemeyer Meinhardt TRANSFAC molecular

T, Chen X, Karas H, Kel AE, Kel OV, Lleblch I, T, Reuter I, Schacherer F, Wingender E: Expanding database towards an expert system of regulatory mechanisms. N&e/c Acids Res 1999, 27:318-322.

the

15.

Werner T: Models for prediction and recognition promoters. Mamm Genome 1999, IO:1 68-175.

16

Wasserman WW, Flckett JW: which confer muscle-specific 278:167-181.

1 7.

Frech K, Danescu-Mayer J, Werner T: A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J MO/ 5/o/ 1997, 270:674-687.

18.

Crowley locating 268:8-l

EM, Roeder regulatory 4.

K, Bina regions

of eukaryotic

Identification of regulatory regions gene expression. J MO/ Biol 1998,

M: A statistical model for in genomic DNA. J MO/ B/o/

1997,

19.

Rosenblueth DA. Thieffry D, Huerta AM, Salgado H, Collado-Vides Syntactic recognition of regulatory regions in Escherichia coli. Compuf Appl B/osc/ 1996, 12:415-422.

20.

Pickert L, Reuter I, Klawonn F, WIngender region analysis using signal detection Bioinformatics 1998, 14:244-251.

21.

Benham CJ: Computation predictor of DNA regulatory 12:3?5-381.

22.

Karas H, Knuppel R, Schulz structural analysis of DNA of transcription regulatory 12:441-446.

J:

E: Transcription regulatory and fuzzy clustering.

of DNA structural variability-a regions. Compuf A,@ Bioso

new 1996,

W, Sklenar H, WIngender E: Combining with search routines for the detection elements. Comput Appl Biosci 1996,

23. .

anb

orooram iGD in silica.

9. .

Spicher A, Gutcherit OM, Duret L. AsIanIan A, Sanjnes EM, Denko NC, Giaccia AJ, Blau HM: Highly conserved RNA sequences that are sensors of environmental stress. Moi Ceil 5ioi 1998, 18:7371-7382. The experlmental work presented In this pap+, demonstrates the power of phylogenetic footprtntlng. The regulatory regions studled were first tdentlfled as bemg highly conserved regions (HCRs) by cross-genomic sequence comparison. 10. Lipman DJ: Making (anti)sense of non-coding sequence . conservation. Nucleic Acids Res 1997, 25:3580-3583. A provocatlve paper suggestmg that hrghly conserved regrons In 3’ untranslated regions evolve slowly because they form ‘juolexes with antisense RNA, The conservation IS explaIned by sel&on ag&nst mismatched doublestranded regions. 11.

405

Bucher

profiles

Kolchanov NA, Ponomarenko MP, Kel AE, Kondrakhin YuV, Frolov AS, Kolpakov FA, Goryachkovsky TN, Kel OV, Ananko EA, lgnatleva EV et al.: GeneExpress: a computer system for description, analysis, and recognition of regulatory sequences in eukaryotic genome. ISMB 1998, 6:95-l 04. An overview of the recent research and software developments by Nikolay Kolchanov’s group in Novosibirsk and his internatlonal network of collaborators. The article does not provide much detail about mdivldual achievements, but contains many references to significant, but not so well-known work by Russian scientists. The sections on gene networks and the sequence-based prediction of functional site activity are particularly Interesting. l

5. .

6.

expression

Gross T, Kaufer NF: Cytoplasmic ribosomal protein genes of the fission yeast Schizosaccharomyces pombe display a unique promoter type: a suggestion for nomenclature of cytoplasmic ribosomal proteins in databases. Nucleic Acids Res 1998, 26:3319-3322. This paper presents a remarkable example of an unusual eukaryotlc promoter type, consisting of two closely spaced elements, one of which is apparently a functional TATA box analog.

of

for regulatory elements in human Opin Strc~cf f3iol 1997, 7:399-406.

and

12 .

reading

within

elements

Busslinger M, Portmann R, lrmlnger JC, Birnstlel ML: Ubiquitous gene-specific regulatory 5’ sequences in a sea urchin histone DNA clone coding for histone protein variants. Nucleic Acids 1980, 8:957-977.

and Res

Pedersen AG, Bald1 P, Chauvln Y, Brunak S: DNA structure in human RNA polymerase II promoters. J MO/ Biol 1998, 281~663.673. This paper exempllfles a structural approach to the oroblem of characterizing transcnptjonal control regions. A ‘10 bp periodic bendabIlIty pattern IS reported to occur downstream of the transcrlptton start site and Its Impllcations to chromatm structure are discussed, 24. .

Roblson K, McGuire AM, Church DNA-binding site matrices for complete Escherichie co/i K-12 284:241-254. Fifty-five weight matrices for transcnptlon from E. co/i were constructed from experiments and from SELEX data. weight matrlces derived from either reported and dlscussed.

GM: A comprehensive library 55 proteins applied to the genome. J MO/ B/o/ 1998,

of

regulatory DNA-binding proteins natural sites defined by footpnntmg Slgnlflcant differences between the natural or ,n vitro selected sites are

25. .

Thieffry D, Salgado H, Huerta AM, Collado-Vides J: Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia co/i K-l 2. Biomformatics 1998, 14:391-400. Presumably the first genome-wide analysts of the putattve transcnpt~onal control elements of a prokaryote. The complete set of predlctions is available on the Internet and. hopefully, will soon be tested by genome-wide expression studies. 26.

Yada T, Totokl Y, Ishit T, Nakal K: Functional genes from their regulatory sequences.

prediction ISMB 1997,

of 8. sobtilis 5354-357.

406

Sequences

and

topology

42. l *

27. ..

Washio T, Sasayama J, Tomtta M: Analysis of complete genomes suggests that many prokaryotes do not rely on hairpin formation in transcription termination. Nucleic Acids Res 1998, 26:5456-5463. A beautiful study on comparative genomics. The comprehensive analysis gene 3’ regions in 17 microbial genomes reveals an unexpected diversity transcription termination signals and remmds us how much our views prokaryotlc gene regulation are biased by the classical work on E. co/i.

of of on

28. A&c S, Claverte JM: The significance of digital gene expression . profiles. Genome Res 1997, 7:986-995. A useful paper providing an answer to a frequently asked question arIsIng from EST- (expressed sequence tag) and SAGE(sequential analysis of gene expressjon) based expresslon profiles. 29.

Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White 0 et al: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature Suppl 1995, 377:3-l 7.

30.

Velculescu VE, Zhang L, Vogelstein gene expression. Science 1995.

31.

Schena M. Shalon D, Davis RW, Brown PO: Quantitative of gene expression patterns with a complementary microarray. Science 1995. 270:467-470.

Cho RJ, Campbell MJ, Wlnzeler EA, Stemmetz L, Conway A, Wodicka L, Wolfsberg TG, GabrIelIan AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle. MO/ Cell 1998, 2:65-73. Another important time-course analysis application; transcript levels of all the yeast genes were measured at 17 time points throughout the cell cycle.

43. 0.

Spellman PT, Sherlock G, Zhang Ma, lyer VR, Anders K, Elsen MB, Brown PO, Botstem D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. MO/ Biol Cell 1998, 9:3273-3297. The authors present three additlonal time-course experiments on the yeast mltotic cell cycle. The ceil cultures were synchronized by three different methods. Clustering of the expresslon profiles was used to define subsets of gene upstream regions, which were subsequently searched for regulatory elements. 44. a

lyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM. Hudson J Jr. Boauskl MS et al.: The transcriotional program in the response gf human fibroblasts to se&m. Science 1999, 283:83-87. The first large-scale time-course gene expression monitoring experiment in a mammallan system (involvtng 8600 genes).

l

32.

B, Kmzler 270:484-487.

KW:

Serial

analysis

of

monitoring DNA

Lockhart DJ, Dong H, Byrne MC, Follettle MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashl M, Horton H. Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Botechnol 1996,14 1675-I 680.

33. Bowtell DD: Options available-from start to finish-for obtaining . expression data by microarray. Nat Genet 1999, 21:25-32. Perhaps the most useful review in the “ChippIng Forecast” supplement Nature Genetics. Contains many URLs of interesting web sites. 34.

Brown PO, Botsteln with DNA microarrays.

D: Exploring the new world Nat Genct 1999, 21:33-37.

35.

Debouck C, Goodfellow and development. Nat

PN: DNA microarrays Genef 1999, 21:48-50.

of the in drug

of

45

Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban Vogelstein B, Kinzler KW: Gene expression profiles cancer cells. Science 1997, 276:1268-l 272.

46

Der SD, Zhou A, Williams differentially regulated oligonucleotide arrays. 95:15623-15628

47.

Zhu H, Cong JP, Mamtora G, Gingeras T, Shenk T: Cellular gene expression altered by human cytomegalovirus: global monitoring with oligonucleotide arrays. Proc Nat/ Acad SC; USA 1998, 95:14470-l 4475.

48

Grav NS. Wodicka L. Thunnissen AM. Norman TC. Kwon S Espinoza FH, Morgan DO, Barnes G, LeClerc S, Meiler L et a/.: Exploiting chemical libraries, structure, and genomics in the search for kinase inhibitors. Science 1998, 281:533-538

genome discovery

36. .

Carulll JP, Artinger M, Swain PM, Root CD, Chee L, Tullg C, Guerln J, Osborne M, Stein G, Lian J, Lomedico PT: High throughput analysis of differential gene expression. I Cell Biochem Suppl 1998, 3031:286-296. A comprehensive review and cntlcal comparison ot old and new techniques for studying differential gene expression.

37. *

Velculescu VE. Zhana L. Zhou W. Voaelsteln J. Basral MA Bassett DE Jr, Hleter-P, ‘Vogelstein B:Kinzler t&/: Characterization of the yeast transcriptome. Cell 1997, 88:243-251. A classical paper presenting the first description of a complete transcriptome. The mRNA abundances were deiermlned by SAGE (sequential analysis of gene expresslon). l

38. .

Holstege FC, Jenmngs EG, Wyrlck JJ, Lee TI, Hengartner CJ. Green MR, Golub TR, Lander ES, Youna RA: Dissecting the regulatory circuitry of a eukaryotic geiome. Cell 1998. 95:717-720. Another complete charactenzatlon of the yeast ttanscriptome. Half-IIves for all the mRNAs were determined with the aid of a temperature-sensitive RNA polymerase II mutant. Additional strains with defects III the general transcnption apparatus were proflled as well. The publicly available database is an important information resource for studies on yeast transcriptlon. l

Pietu G, Marlage-Samson R, Fayeln NA, Matlngou C, Eveno E, Houlgatte R, Decraene C, Vandenbrouck Y, Tahi F, Devlgnes MD ef a/.: The genexpress IMAGE knowledge base of the human brain transcriptome: a prototype integrated resource for functional and computational genomics. Geno/ne Res 1999, 9:195-209. The authors present a set of expresscon oroftles for 5058 human genes different tissues and describe an Integbted knowledge base combining InformatIon with genome sequences an
DeRlsl JL, lyer VR, Brown control of gene expression 278:680-686. annotation to [41”].

PO: Exploring the metabolic and genetic on a genomic scale. Science 1997,

Chu S, DeRlsl J, Elsen M, Mulholland J, Botsteln D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998. 282:699-705. Two ploneerlng studies (see, also, [40”]) exempltfying the power of tamsseries data analysts. The temporal program of a transcriptional response transItIon was monitored by microarray technology.

SR, and

BR, Srlverman RH: Identification of genes by interferon alpha, beta, or gamma using froc Nat/ Acad Sci USA 1998,

49 .

Marton MJ. DeRlsl JL. Bennett HA. lver VR. Mever MR. Roberts CJ, Stoughton R, Burchard J:Slade D, dai H et al.: Drug target validation and identification of secondary drug target effects using DNA microarrays. Nat Med 1998, 4: 1293-l 301, A frequently cited paper describmg a very useful applrcatjon of highthroughput gene expression monitoring in drug design. In the case study presented, the authors dlscovered a potentially undesirable transcrlptional response pattern as a secondary effect of a drug treatment - so far only in yeast cells. 50. Bassett DE Jr, Elsen MB, Boguski MS: Gene expression . informatics-it’s all in your mine. Nat Genet 1999, 21:51-55 An excellent review on all aspects of the computational analysts expression data - from image analysis to data mining.

of gene

51. .

Etsen MB, Spellman PT, Brown PO, Botsteln D: Cluster analysis and display of genome-wide expression patterns. Proc Nat/ Acad So USA 1998, 95:14863-14868. Specific distance measures and clustering algorithms for expresslon proflies are described. 52.

Michaels GS, Carr DB, Askenari M, Fuhrman S, Wen Somogyi R: Cluster analysis and data visualization scale gene expression data. Pat Symp Biocompuf 3~42-53.

53.

Thieffry D, Thomas Symp Biocompuf

54.

Weaver DC, Workman networks with weight 4:112-123.

55.

D’Haeseleer P, Wen X, Fuhrman mRNA expression levels during Symp Biocompuf 1999, 4:41-52.

39. .

III SIX this

RH, Hamilton in normal

56 .

R: Qualitative 1998, 3:77-88.

analysis

of gene

X, of large1998,

networks.

Pat

CT, Storm0 GD: Modeling regulatory matrices. Pat Symp Biocompuf 1999, S, Somogyl R: Linear CNS development

modeling and injury.

of Pat

Wen X, Fuhrman S. Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R: Large-scale temporal gene expression mapping of central nervous system development. Proc Nat/ Acad SC/ USA 1998, 95:334-339. This paper describes a data set reflecting the temporal fluctuation of 112 genes during rat nervous system development, which has been used to study gene regulatory networks.

Regulatory

57. .

Yuh CH, Bolouri H, Dawdson EH: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 1998, 279:1896-l 902. In this work, the network of regulatory interactrons between elements of a control region was deduced from a series of spatral and temporal expressron profiles generated by mutant promoter constructs. 58.

59.

Brazma elements 9:1202-l

A, Jonassen in silica 215.

van Helden J, Andre from the upstream of oligonucleotide

I, Vrlo J, Ukkonen E: Predicting on a genomic scale. Genome

gene regulatory ffes 1998,

B, Collado-Vrdes J: Extracting regulatory region of yeast genes by computational frequencies. J Mnl &o/1998, 281:827-842.

sites analysis

elements

and

expression

profiles

Bucher

407

60. .

Roth FR, Hughes JD, Estep PE, Church GM: Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 1998, 16:939-945. The yeast transcriptome was characterized under different regulatory conditrons for the purposes of the computational analysrs of gene regulatory elements. An ab in/r/o method is presented for extracting sequence motifs from combined genome and transcriptome data. 61. .

Ermolaeva 0, Rastogi M, Pruitt KD, Schuler GD, Bittner ML, Chen Y, Simon R, Meltzer P, Trent JM, Boguskr MS: Data management and analysis for gene expression arrays. Nat Genet 1998, 20:19-23. The authors describe the current state and future directions of a data management and visualization system for gene expression data.