400
Regulatory
elements
and expression
profiles
Philipp Bucher There
has been
transcription
control
the gene accurate
progress regions,
in the computational
but current
methods
regulatory features of noncoding enough to be useful in automatic
Therefore, newly based
steady
detailed
information
analysis
of
of predicting
sequences are still not genome annotation.
on the expression
patterns
of
sequenced genes is more likely to come from microarrayhigh-throughput mRNA quantitation technologies, which
have
made
revolutionary
now
ready
for genome-wide
progress
over
the past few years
application.
regulatory
element
prediction
problem
combined
analysis
of genome
sequence
Future may
solutions
be found
and are to the
by the
and expression
data.
Addresses Swiss lnstltute for Experimental Canc:er of Blolnformatics, Ch. des boveresses Switzerland; e-mail:
[email protected] Current
Opinion
in Structural
Biology
Research 166, 1066
1999,
and Swiss Epallnges,
Institute
9:400-407
http://biomednet.com/elecref/0959440X00900400 8~) Elsevier
Science
Ltd
ISSN
0959-440X
Abbreviations b
EST HCR SAGE UTR
base pair expressed sequence tag highly conserved region serial analysis of gene expressron untranslated region
Introduction control mechanisms ha\e been intensely diverse org2nisnis for :lt least three decades. I>cspite these efforts. our understanding of how regulatory information is cncotled by a I)NA secluence is still \cry fragmcntar~-. (:onfronted with a newly sequenced control region of a ;:cne, we are, in most cases, unable to make reliable prcdic$onr; about its tissuc-spccific and dcvelopmcntal-srage-spttcific expression pattern. Kelating IIN.:\ seclucncc to gene regulatory function thus WCIIIS to hc one ot‘ the hardest problen-s in current biology and accurate prediction methods for gene regrllatory clemcnts arc not c\pectcd to be ready when the complete human genomc sequence is finished (in approxim;ltely four years from now). FortunateI>; the steady. hiIt slow. progress in this field is accompaniccl 1,): rcvolutionaq breakthroughs In high-throllghput ,gene expression monitoring technolog): ‘I’hub. n-e can be confident that WC will soon know from wet laboratory studies where and when each human gone is transcribed and how its cxprcssion is controlled by
in
Characterization control regions
of transcriptional
The past two years have been a period of consolidation and maturation for this field. Critical evaluation studies of existing methods have led to A more realistic assessment of the performance of current algorithms in identifying and locating transcription control signals. A broadly noticed review on eukaryotic promoter prediction programs, presenting bench-marks obtained with an independent test set, came to the conclusion thtit these methods arc not yet accurate enough to he useful in the automatic annotation of the human genome [ 1’1. Another test rcvealcd that the available software tools for locating the binding sites for a particular transcription factor ((:‘I’F/Nl;l) often missed experimentally confirmed target sites and failed to make accurate binding strength predictions for others [2]. Although neither of theso studies claims to be rcpresentative, they both make clear that current methods can not bc blindly trusted and further highlight the need for objective bench-marking procedures in order to monitor progress towards more accurate tools. A positive development in the field is that there are now increasingly convergent views on the organization of gene regulatory regions and also an emerging paradigm for their computational characterization. which can be summarized as follows: the elementary units of transcription regulator); regions arc transcription-filctor-binding sites; control regions, such as promoters, cnhanccrs, locus control regions and so on, arc modular; the regulatory output of ;I control region depends on the specific combination of its elements, ;IS well as on the order and oricnttition in which they occur: and genes are typically controlled bv se\wal control regions located upstream or downstream jand possibly far away) from the transcription initiation site. I:rom cxpcrimcntal studies, it is clear that the function of ;I regulatory region is mediated by milltiprotein coiiiplexcs comprising syergistically interacting transcription factors bound to clustered I)iKA sites. Similar principles ma)- apply to noncoding regul:ltory RNA regions, which arc not the focus of this rcl.ie\z: It is now broadly recognized that the major obstacle to the computational characterization of gene regulatory regions relates to the fact that the scqumce motifs corresponding to the elementary modules contain too little diagnostic information for them to be distinguished from chance occurrences [A]. l
Regulatory
distinguish between functional and biologically irrelevant transcription-factor-binding sites matching a corresponding consensus sequence or weight matrix description. Much of the recent work in this field has been inspired by this basic idea. In a previous review [4], we have focused on three topics: methods for characterizing individual control elements; promoter prediction algorithms; and phylogenetic foorprinting. which is the name of an increasingly popular appr(Ai for localizing important regulatory regions within large genomes through cross-species comparisons. Before switching to new thcmcs, a Ilrief update on the most important developments in these areas is given now. A comprehensi\,c, comparative e\ aluation of software tools for the characterization and identification of transcription control elements has appcarcd IS’] and several new algorithms for eukar!;otic promoter prediction have been published [h-8]. A nice validation of the phylogenetic footprinting approach came from ehperimcntal work in which highly conserved regions (H(:Ks) in mKNA 3’ untranslat(t (‘I-K) seq~~cnc cs were subsequently ed region functionally characterized as >ensors of environmental stress. [Y]. In a provocative paper, I,ipman [lo’] offered a novel explanation for the high sequcncc conservation observed in 3’ Il’1‘Ks, suggesting that they function via duplex formation with antisensc KI%A.
Composite
elements
and control
regions
.As mentioned above, a major current trend is to analyze the contextual constraints governing the composition and positioning ofgcnetic clcmcnts within a regulator); region. ‘I~vo main approaches can bc distinguished. which may be called ‘bottom up‘ and ‘top down’. and the): are not mutually cxchlsivc. ‘r’he bottom-up approach starts with a known element and attempts to idcntii’y contextual features that ma> help to distinguish function.ll from nonfunctional sites. A straighrforward application of rhis idea is co search for significant clcmcnt pairs occurring at conserved distances from each other. An early disco\ ered example of such a bipartite control signal, occurriiig in embryo-specific sea urchin historic gcncs. consists ot ;I (;A’l”l‘(: motif followed i-9 bp downstream by a canonic;ll ‘I’!Yl’A-box promoter clcmcnt [ 1 11. Another noteworthy cuample was found in the 5’ flanking regions of I-ibosomal protein genes in ~)i/lhll.stlc~~f~l~}~~~,~~~.~piirr/LY. ‘l’hcse genes have a highly unusua1 promoter type. consistiny of a site selector eiement that is functionally analogous. bllt different secJucncc-wise from a T,%‘1::\ box. and an upstrc‘un element that can occur in cithcr orientation [I?]. Both clement pairs are expected to have considerable diagnostic 1 due for the corresponding regulaton class of gents, as ch:lnce matches to the conbined motifs arc expected to occur less than once in a million base pairs. Many more examples of composite clemcnts can be found in the (X)RlPE:I, database [ 13”,14]. In the top-down approach, one
elements
and expression profiles
Bucher
401
genes and attempts to derive a specific model of the conserved sequence features (for a review of such techniques, see [15]). ‘The models typically used in such studies consist of weight matrices or consensus sequences for obligatory and optional elements, and rules restricting the order and orientation in which the elements are allowed to occur. Control regions conferring muscle-specific expression to genes and the long terminal repeats of retroviruses were recently characterized by such techniques (16,171. In both cases, the resulting models selectively identified new, plausible target sequences in the database that were not included in the training set used for model building. Other groups applied methods based on hidden Markov models [ 181, grammatical models [19] and fuzzy clustering [20] in order to analyze regulatory regions. A number of groups have tried to integrate DNA structure prediction into their methodologies for characterizing transcriptional control regions (e.g. [2 1.221). A recent application of such a technique to eukaryotic promoters has revealed an interesting 10 bp periodicit); starting immediately downstream of the trAnscriptinn initiation site [23’]. Although it is undeniable that such methods can capture physiologically relevant sequence features, it is less certain whether this is accomplished through correct 1)N.A structtlrc prediction. hly personal experience is that different software tools for computing the intrinsic curvature and bendability of double-stranded Jli%A often make conflictin:: predictions when tested on the same sequence. ‘The success of such approaches can thcrcfore not bc directly interpreted as evidence that a specific kind of’ intrinsic I>iXA structure plays an important role in gene regulation.
Recent control
work on prokaryotic elements
gene
‘I’he availability of sev-era1 complete bacterial genomes has both resllscitated interest in prokaryotic gene regulation and motivated a number of comparative computational studies of transcriptional control elements. hot surprisingresearchers have chosen to analyze the Iv, most ~.~.srtie?~//ti~ CO/; genotne because of the wealth of cxperimental data on transcriptional regulation in this organism. One group generated a library of weight matrices defining the binding specificity of 5.5 different transcription factors for which target sites had been exJ~erimentally idcntificd j23.J. Another group searched the cntirc genomc for matches to
402
Sequences and topology
sequenced bacterial species USC: KNA-hairpin-based tcrmination mechanisms that are similar to the rho-indcpendcnt ~Xlth\V’~~ Of I,:. /O/i [27*‘].
Expression
profiles
If the prediction of gene regulatory features from sequcnc~ is so hard -do ux reall\; need it? In view of the rapid recent ~wgtcss made in genomc-wide expression monitoring, some resarchcrs may he tempted to say no. Remember that the hliman geiwme project was partly justificd tyi arguing that noncoding IIiXA sct~~iciices will tell us something about the regulation of the genes. With the dent of micrtvarray technology it no\\ seems probable that this information will first cwne from high-throllghptit mKNA quuntitation experilncnts, rather than from scc~iien~etxtsed /II .ri/i/o predictions. It wodd lx shortsighted, however, to play dou,n the importawe of the gent regulatory future prdiction ptot)lem using such arguments. In applied hiomeclical rvscarch, it ma); lx sufficient to know when ;lnd where a gent is exptesscd; in order to understand lift. one has definitely also to know whv this happens. I,arge-scale apreeion profiling thtls shoujrl bc viewed as a welcome complement, rather than an alternati\q to whole genolnc seqwncii’g and should he integrated into research mcthodologics aimed at elucidating the seclllcncc/function relationship5 of gene rcgiilator)i rcgionc.
New technologies
and new data structures
Among the high-throlighput methods for gene expression profiling, two principally differc.nt strategic5 can hc distinguished: the wcluencin g of a large number of cloned gcrrc tags and the parallel h!hi-idizition of/II t-if/-o labekd mliNA populations (11511a1ly ;ichicved I)! revcrrc tr,lnscription) to densely arrayed target probes. ‘l’hc two approaches have also 1)ccii called ‘digital‘ and ‘analog‘. as they product expression profiles consistin, 1~ol‘ intcgcr and rc;iI nuinlws, respcctivclv (2X’]. :~dams uf of. f,!9] wcrc the first to dcmonstraw that large-scale single-pa(l) expressed seq~rcncc tag (t3St’) sequencing can he ~rxcd to study gent expression. An acccleratcd version of this strategy, consisting of sequcnciry multiple concatcn;ltecJ short oligonucleotidc tags in a single rtln, has been pllblished under the acronym SA(;t< (serial analysis of ,genc c\pression) (XI]. Although scclllencing-l,;lsctf cupres.sirJn profiling tcchniquc\ have the principal advantage of being capable of detecting new transcripts. micro;lrra)-l);lsed appi-oaches arc cxpectcd to pusail in the flltrlrc ~ as they appear to be more cost cffccti\,e and accurarf2, especially for weakly exprcsscd gents. \lorewcJ-, the fact that III RNA cluantitation cxpcrimerits occur in parallel urldcr exactly tl1e s3mc hybridization conditions ensure‘s a high dcgrec of crossstandardi/.atioii among indivic{ual nieasuremcnts. ‘lb0 types cJf microarray arc: currcntiy in IISC: 1IN.A arrays, carrving long cl>NA moIcctIIcs tr.lnsferred by a robot to ;I nylon JllclnbJ3llc or glass slide 131 1, and oli~oli~rcleoti~l~ arrabs, carry inE I’// si/// s)nthe;ized oli~onuclcotides oj abollt 10 hascs [AZ]. Both metllods hat-c hcen sho\vn to
work in practice and have already made important contributions to our understanding of gene regulation, as exemplified by the pioneering case studies described furthcr below. Additional information on all aspects of DNA microarrays can be found in a comprehensive series of reviews published in a recent supplement of :Vofuw Gfwficx entitled “‘l‘he (lhipping Forecast” [33’,33,3.5]. ‘[‘he merits and drawbacks of the various sequencing and microarra~-leased techniques arc further discussed in [.X5’). ‘l’he advent of microarray-bawd gene expression monitoting is a revolutionary development in biological research. not because man\; biologists arc expected to apply this new and still very espcnsive rechnology in their own rcscarch. but because every biologists will soon have access to the large amounts of public data produced by this technique. ‘I’he term transcriptomc has been coined to refer to this new type of data structure, comprising the expression levels of all the genes of a genomc in a given regulatory state of a cell. As a result of both the dynamic nature of gene expression and the complexity of regulatory processes in higher organisms. transcriptomes may soon exceed gcnome seclwnce data in shcu volume. ‘I’hc nature of this new type of biological information and how it can be exploited for answering biological cluestions will be the focus of the remaining part of this reviw:
Examples monitoring
of genome-wide studies
expression
For olwious reasons, the budding yeast was the first model organism in which physiological processes wcrc studied using genome-wide expression profilcs. A complete characterization of its transcriptome was initially achieved tw SA(;K [37”] and was later confirmed and refined & oligolillclcotide array technology [.3X”]. In this latter study, a teiiil’crat~lrc-sensitive KNA po1ymcrase 11 mutant strain was wised to nicasui-e the timcAepcndent decay of mKN,A levels after blocking CT’?JJlJ’iTCJ mKKA synthesis. Knowing the half2ivcs of all the mKN.A obtain4 in this way. it is now possible to convert steady-state niK!VA levels into transcription rates, which is essential for understanding the kinetic aspects of transcription rcgrllatory c\znts. hloreo\w, the dependence of the transcriptomr on various kg components of the yeast polynerasc I I transcription machincr!; nas assayed using a set of ‘gcnctic reagents’ (conditional mutants and gcnc knockout stl-ains). Additional publicly available data sets for yeast AI-~’ indicated in ‘t:,lblc I. \,‘cry t-ecu+ high-throughpLlt expression profiling has been extended to mammalian systems. l:or instance, the IhlM;E: consortium has released cluantitative hybridization data for SO.58 human genes in six different tissues [.W’]. ‘l’ime-course analysis is a particularly powcrfuI way of studying global gent regulation tisin :: expression profiles. In this type of analysis, snapshots of the entire transcriptome are taken at 5ucccssiw time poinrs after inducing a ch;~nge in the rcgulator~- state of a cell culture or a tissue. ‘l‘hc value of the data resulting from such ex~xrimcnts can 1~ further
Regulatory
Table
elements
and expression
profiles
Bucher
403
1
Examples
of expression
profiles
available
over Number
of data
Species
Type
Yeast
Wild-type
the Internet. of states/time
Transcription apparatus mutants: RNA polymerase II SRB-mediator core complex SRBl 0-CDK compler SWI-SNF complex General transcription fxtors
Heat
glucose
medium
shock
Mating
type
Dlauxlc TlJPl YAP1
alpha
versu!.
a
shift-time course deletion overexpression
Rat
thar
WI
6308
04
[60’1
7
6200
(d
[40”]
6200
Cc)
[4 1-1
6200
(4
7 1
Ndt80
Kw [44”]
Cervical
spinal
9
122
(4
[56’1
for six tissues In fibrot)lasts:
cord
development
csprcssic)n
lacking thought
[43”] [43”] [42”] [43”] [43”1 [43”1
(b) http:l/arep.med.harvard.edu/mrnadata/expression.h~ml (d) http://ldefix.upr420.vjf.cnrs.fr. (e) http://rsb.info.nih.gov/mol-phys~ol/homepage.h~ml
profiles
or overexpressing to he functiondly ‘I’he
study
gencr;lted transcription invol\wl
of the
for facin the
ciiauxic
shift
in
change from fci-nicntation to respiration), expression profiling of strains lacking the rqrilutorv
1~11’1
spmllacion
cycle
COIII-SC :initlyis. I>! two
w;w
progr.lm
protocols. In the rhe transcriptional
in the
cuamplcs N)t
of
same rhis
surprisingly. first urgets
one 0:‘ the
;ilso
Foirr indcpcnctenr data sets arc diffcrcnt lat3oratories [4?*,4.3”]
tcchnicliies four
as well
[W*],
(oligonilclel)tidc .. dlttcl-cnt cell first applicarion response of
1~3s analyzed
for X000
and cIIImrc
I)NA
organism giohal
the yeast of lime-
Iy
have in or&r
;I review
tarive were
or of these
rcgiilatcd
altcrcd
gene
technology drugs. such suppressants e\ er,
is applied 3s protcin
less data
research.
csk)ression
imporC;lnt SribtriictiLc
role
l)rofiling in mctiical and
clr,ning
:ind
also plays l~harmacologic~ll
differential
displa!
an
profiling various image
is
methodology. kinds of ~mxxwing
In pharmacolo,~:),
applied
on biological
are not
a\3il:ihlc
;III esscncial
tl;it;i
the h)i
the
by
alwrcd
studies
M ill.
wiic of
gent how-
heca~~sc the
rcscarch on the
Intcrnor.
of gene component
An excellent re\,iew computer applications I0
diffcr-
the potential side cffccu inhibitors [4X] (II- irnmuno-
more
Computational analysis expression data
[44”].
Hioinformatics tiigh-tliroii~~il~irr increasingly
‘l’hcsc
are
cells arrilys
induced
may be indicated
that
tumor
chat
patrcrn
to dewct kinasc
impact set\
genes
cells cluanti-
and to characterize
(461
[37].
tumor
I,atcr,
normal and Oli,~onuolcotide
to identify
infection
~W’].
ha1.e
used
for [-IS].
cxprcssion
profiles.
rcsiilting
profiles SA(;l<
in
WC [M’]).
by interferon
c),tomegalo~irus
expression
system. to strum
txcn
cntially
some time in cancer that are significanr-
over-represented
techniclucs.
gene expression determined by
a;iilahlc. microar-
Ixen upplictl for to identify transcripts
under-represend
(for
usiiifi two
synchroniution
ro 3 nismmalian hliman fibrohlasr gcncs
as the
new
tcchniqucs research
ha~c recently
‘1’1 11’1 or owrexpress-
protein
activ&r
rcpracnr pioneering I0 gcnc reglilacion.
stimiil:ltion
b)
1
(4
of the
and
b3’3’1
(d
irepression
rays)
b)
6308
0600
transcriptional
diffcrcn~
6300
1
5058
response.
gcncrarcrl
1
17 6
arc
cell
[38”]
11
ing
mitoric
(a)
response cyclohexlmlde
gliicose
[Jl”]. app-oath
WY1
5460
transcnptome
yzast (metat~olic accompanied by
analysis
[38”]
Serum plus
transcriptional
the
l:i
Partial
srrains
tors
5460 6308
18 28 17 14 2 2
1,~ addirional
rnulant
References .--
-.
Mltotic cell cycle: (L-pheromone (Gl arrest) cdc-I 5 (late M arrest) cdc-28 (late Gl arrest1 Elutnation (start in earl{ Gl) InductIon of Cln3 (Gl cyclin) InductIon of Clb2 (B-type cyclin)
(a) http://www.wl.mit.edu/young/transcrtptome.html (c) http://genome-www.stanford.edu/
incrwscd
URL
1 1
Sporulatlon program: with ectopically expressed
Human
of genes
3 1 1 4 2 2
SAGA versus
Number
1 1
transcriptome
Galactose
points
m~in;ljiernent,
of the
expression
summrlrizcs involved.
the from
mining
2nd
404
Sequences
and
topology
visualization [WI. Some of these applications are only relevant to the researchers performing the experiments. Here, the focus is on those methods that are of interest to rverybody who wants to access and exploit publicly available data, either by using web-b:tsed exploration tools or by downloading the raw data for .tnalysis on a local computer. An important (IcJeStiOn that frequently arises in the incerprctation of digital expresston profiles is whether an observed difference in gene ilumber counts is statistically significant. Audit and (Yaverie jZ8.1 have analyzed this problem and proposed a statlsticd framework for significancc tests. ‘l.he developmel~t of algorithms that exploit microarray-hascd gene expression data has only started very recently. ‘I’hanks to the regular structure of these data (usually ;I matrix of numbers corresponding to genes and rejiulatory states), many biologically relevant cluestions can already be addressed by the standard methods implementcd in widely used statistical analysis packages. ‘l’he few specific methods developed so far perform the kinds of operations known from molecrllar secluence analysis: computation of diatdncc measuics for pairs of expression profiles and the generation of rree-like clustering diagrams uscfiil for defining groups of co-regulated genes [51’,52]. ‘1%~ gene-specific transcrih)tional response patterns observed in the budding yeast after a diauxic shift [40”] and tho pattern observed in h!unan fibroblasts after serum stimulation I-+4”] have both heen classified in this way. More specialized Fourier transform methods wcrc applied to the yeast mitotic cell cycle data in order to revcal gems Gth oscillatin,g cxpKssion patterns [.52]. A more ambitious goal in the analysis of expression profiles is the reconstruction of regulatory circuits from timecourse data, also called rcvcrsc engineering of genetic pathways. Hoolean networks [S.3] and linear models [.54,5.5]. in which the change in the mKNA abundance of a particular gene depends on the simultaneous I~KNA abundance of all the other genes, have been proposed for this purpose. ‘I’he latter kind of model was applied to a matrix of gene expression tcvcls for 112 rat gcncs measured at nine diffcrent time points during central i IcrvotIs system dcvelopmcnt [.5.5,X)*1. As this data scr wxs mathematically too small to estimate all the parameters of the model, the study has a mainly exploratory value. Others have validated the same type of nicthodolog~ i~sing cot[lputer simulated data. however, shoving that it could work in principle I!%]. In another pioneering application to real data [.57’], the nctwork of regulator)interactions hetwccn the ciz-acting elements of a 2.300 bp control region of a sea urchin gone was deduced from a series of spatial and temporal exprcssion protilcs produced by mutant promoter constructs.
Combining sequence data analysis
and expression
A promising current trend is to USC genome-wide expression profile data to &fine traiiling and test sequence sets for analyzing gene rcgnlatory. elements. A number of
applications to yeast data have confirmed the general validity of this approach. For instance, time-course data on the mitotic cell cycle were used to extract upstream regions of genes specifically expressed during the ‘31, S, G2 or M phases, respectively [42”,43”]. Subsequent searches of these sequence sets for published consensus sequences for transcription-factor-binding sites identified several diagnostic patterns for genes transcribed at a specific stage of the cell cycle. In addition, a refined model of a composite element that is specific for genes induced during the transition from mitosis to the (;l phase was generated by this approach [4.3”]. Others have developed and applied c/h ir/ifio methods to identify gene regulator): signals mediating a particular regulatory response IS8,59,60’] from combined genomc and transcriptome data. ‘I’he advantage of using whole genome expression profiles in the development of predictive methods for gene regulatory elements is that the matches found in the test sets can be objectively classified as true or false hits. An example may further illustrate this point. Suppose that a computer program had predicted a binding site of a cellcycle-regulating transcription factor in the upstream region of a tnethionine biosynthesis gent a few years ago. At that time, the developer of the program would have probably discarded this result as a /!w/N.~L’P false hit and changed the algorithm of the program to suppress it. In the meantime, the time-course data on the yeast ccl1 cycle have revealed the surprising fact that methioninc biosynthesis genes are indeed cell-cycle-regulated, making it possible today to draw the right conclusions from such a test reSIJlt.
Conclusions Methods for predicting gene regulatory features have made steady progress over the past few years, but are still not reliable enough to be used in automatic genome annotation. A new paradigm is emerging in this field that views gene regulatory regions as modular structures and attempts to assess the function of individual elements in a contextdependent manner. First, promising results have already hCXl1 obtained using the contextual approach. (;omprehensive information on the tissue-specific and developmental-stage-si~ccitic expression of all the genes of an organism is nevertheless more likely ro come from experimental studics. High-throughput microarray-based parallel mKNA cluantitaticln technology has recently become operational anrt non’ makes it possible to monitor the expression of an entire ~enome in a single experiment. ‘I’he resulting data have been catlcd 3 ‘transcriptome’. In order to have a maximal impact on biological research. transcriptomc data need to be shared by as many researchers as possible, as they inc[itahly contain information that ia beyond the reacarch interests of the team producing it. It is therefore imperative to put in place efficient data distribution channels for this new type of biological data, like those th:lt already exist for nllclcic acid and protein sequences. 1,eaJing journals could promote the dissemination of this information by requiring authors
Regulatory
to deposit their data in a public archive. A major chaIJenge to bioinformatics is the organization and representation of the transcripcome in such a way that it can readily bc accessed, visualized and analyzed by powerful computer algorithms. J’rototypes of integrated gene expression databases that possibly fulfill these requirements have been described by the researchers of the N(:BI (National (knter for Biotechnology Information) [61’] and the JhlAGK consortium [W]. ‘I’hc trmscriptome will certainlv be of great help in the comp~~cational characterization of transcriptional control regions and, hopefully. will acc.elerate progress towards reliable predictive tools. Jn the long run, these data may cnablc the computational biologist to reconstruct the gene regulatory networks that control the development of highcr organisms, including mrselbw.
References Papers of particular have been highllghted l
**of
and recommended Interest, as:
published
the annual
period
of review,
of special Interest outstanding Interest
1. Fickett JW, Hatzlgeorgiou AG: Eukaryotic promoter recognition. . Genome Res 1997,7:861-878. A provocatlve review presenting bench-marks for eukaryotlc promoter prediction tools. The conclusion IS that none 01 the currently available methods IS accurate enough to be useful in automatic genome annotation. 2.
3
4.
Roulet E, Flsch I, Junier T, Bucher P, Mermod N: Evaluation computer tools for the prediction of transcription binding genomic DNA. In Silica Bfol 1998, I :21-28.
of sites
Audtc S, Clavene JM: Visualizing the competitive recognition TATA-boxes in vertebrate promoters Trends Genet 1998, 14:10-l 1. Duret L, Bucher P: Searching noncoding sequences. Curr
on
Frech K, CIuandt K, Werner T: Software for the analysis of DNA sequence elements of transcription. Con~p~it Appi B0sci 1997, 13:89-97. A recent review on currently applied algoritnms and software in this field. Contains useful URLs for public programs arid Internet resources.
7
8.
Audit S, Claverle JM: Detection of eukaryotic using Markov transition matrices. C )mput 21~223.227.
promoters Chem 1997,
Chen QK. Hertz GZ. Storm0 GD: PromFD 1.0: a comouter that predicts eukaryotic pol II promoters using strings matrices. Cornput Appl Biosci 1997. 13:29-35. Zhang MQ: Identification of human Genome Res 1998, 8:31 S-326.
gene
core
promoters
13. *
14
Helnemeyer Meinhardt TRANSFAC molecular
T, Chen X, Karas H, Kel AE, Kel OV, Lleblch I, T, Reuter I, Schacherer F, Wingender E: Expanding database towards an expert system of regulatory mechanisms. N&e/c Acids Res 1999, 27:318-322.
the
15.
Werner T: Models for prediction and recognition promoters. Mamm Genome 1999, IO:1 68-175.
16
Wasserman WW, Flckett JW: which confer muscle-specific 278:167-181.
1 7.
Frech K, Danescu-Mayer J, Werner T: A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J MO/ 5/o/ 1997, 270:674-687.
18.
Crowley locating 268:8-l
EM, Roeder regulatory 4.
K, Bina regions
of eukaryotic
Identification of regulatory regions gene expression. J MO/ Biol 1998,
M: A statistical model for in genomic DNA. J MO/ B/o/
1997,
19.
Rosenblueth DA. Thieffry D, Huerta AM, Salgado H, Collado-Vides Syntactic recognition of regulatory regions in Escherichia coli. Compuf Appl B/osc/ 1996, 12:415-422.
20.
Pickert L, Reuter I, Klawonn F, WIngender region analysis using signal detection Bioinformatics 1998, 14:244-251.
21.
Benham CJ: Computation predictor of DNA regulatory 12:3?5-381.
22.
Karas H, Knuppel R, Schulz structural analysis of DNA of transcription regulatory 12:441-446.
J:
E: Transcription regulatory and fuzzy clustering.
of DNA structural variability-a regions. Compuf A,@ Bioso
new 1996,
W, Sklenar H, WIngender E: Combining with search routines for the detection elements. Comput Appl Biosci 1996,
23. .
anb
orooram iGD in silica.
9. .
Spicher A, Gutcherit OM, Duret L. AsIanIan A, Sanjnes EM, Denko NC, Giaccia AJ, Blau HM: Highly conserved RNA sequences that are sensors of environmental stress. Moi Ceil 5ioi 1998, 18:7371-7382. The experlmental work presented In this pap+, demonstrates the power of phylogenetic footprtntlng. The regulatory regions studled were first tdentlfled as bemg highly conserved regions (HCRs) by cross-genomic sequence comparison. 10. Lipman DJ: Making (anti)sense of non-coding sequence . conservation. Nucleic Acids Res 1997, 25:3580-3583. A provocatlve paper suggestmg that hrghly conserved regrons In 3’ untranslated regions evolve slowly because they form ‘juolexes with antisense RNA, The conservation IS explaIned by sel&on ag&nst mismatched doublestranded regions. 11.
405
Bucher
profiles
Kolchanov NA, Ponomarenko MP, Kel AE, Kondrakhin YuV, Frolov AS, Kolpakov FA, Goryachkovsky TN, Kel OV, Ananko EA, lgnatleva EV et al.: GeneExpress: a computer system for description, analysis, and recognition of regulatory sequences in eukaryotic genome. ISMB 1998, 6:95-l 04. An overview of the recent research and software developments by Nikolay Kolchanov’s group in Novosibirsk and his internatlonal network of collaborators. The article does not provide much detail about mdivldual achievements, but contains many references to significant, but not so well-known work by Russian scientists. The sections on gene networks and the sequence-based prediction of functional site activity are particularly Interesting. l
5. .
6.
expression
Gross T, Kaufer NF: Cytoplasmic ribosomal protein genes of the fission yeast Schizosaccharomyces pombe display a unique promoter type: a suggestion for nomenclature of cytoplasmic ribosomal proteins in databases. Nucleic Acids Res 1998, 26:3319-3322. This paper presents a remarkable example of an unusual eukaryotlc promoter type, consisting of two closely spaced elements, one of which is apparently a functional TATA box analog.
of
for regulatory elements in human Opin Strc~cf f3iol 1997, 7:399-406.
and
12 .
reading
within
elements
Busslinger M, Portmann R, lrmlnger JC, Birnstlel ML: Ubiquitous gene-specific regulatory 5’ sequences in a sea urchin histone DNA clone coding for histone protein variants. Nucleic Acids 1980, 8:957-977.
and Res
Pedersen AG, Bald1 P, Chauvln Y, Brunak S: DNA structure in human RNA polymerase II promoters. J MO/ Biol 1998, 281~663.673. This paper exempllfles a structural approach to the oroblem of characterizing transcnptjonal control regions. A ‘10 bp periodic bendabIlIty pattern IS reported to occur downstream of the transcrlptton start site and Its Impllcations to chromatm structure are discussed, 24. .
Roblson K, McGuire AM, Church DNA-binding site matrices for complete Escherichie co/i K-12 284:241-254. Fifty-five weight matrices for transcnptlon from E. co/i were constructed from experiments and from SELEX data. weight matrlces derived from either reported and dlscussed.
GM: A comprehensive library 55 proteins applied to the genome. J MO/ B/o/ 1998,
of
regulatory DNA-binding proteins natural sites defined by footpnntmg Slgnlflcant differences between the natural or ,n vitro selected sites are
25. .
Thieffry D, Salgado H, Huerta AM, Collado-Vides J: Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia co/i K-l 2. Biomformatics 1998, 14:391-400. Presumably the first genome-wide analysts of the putattve transcnpt~onal control elements of a prokaryote. The complete set of predlctions is available on the Internet and. hopefully, will soon be tested by genome-wide expression studies. 26.
Yada T, Totokl Y, Ishit T, Nakal K: Functional genes from their regulatory sequences.
prediction ISMB 1997,
of 8. sobtilis 5354-357.
406
Sequences
and
topology
42. l *
27. ..
Washio T, Sasayama J, Tomtta M: Analysis of complete genomes suggests that many prokaryotes do not rely on hairpin formation in transcription termination. Nucleic Acids Res 1998, 26:5456-5463. A beautiful study on comparative genomics. The comprehensive analysis gene 3’ regions in 17 microbial genomes reveals an unexpected diversity transcription termination signals and remmds us how much our views prokaryotlc gene regulation are biased by the classical work on E. co/i.
of of on
28. A&c S, Claverte JM: The significance of digital gene expression . profiles. Genome Res 1997, 7:986-995. A useful paper providing an answer to a frequently asked question arIsIng from EST- (expressed sequence tag) and SAGE(sequential analysis of gene expressjon) based expresslon profiles. 29.
Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White 0 et al: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature Suppl 1995, 377:3-l 7.
30.
Velculescu VE, Zhang L, Vogelstein gene expression. Science 1995.
31.
Schena M. Shalon D, Davis RW, Brown PO: Quantitative of gene expression patterns with a complementary microarray. Science 1995. 270:467-470.
Cho RJ, Campbell MJ, Wlnzeler EA, Stemmetz L, Conway A, Wodicka L, Wolfsberg TG, GabrIelIan AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle. MO/ Cell 1998, 2:65-73. Another important time-course analysis application; transcript levels of all the yeast genes were measured at 17 time points throughout the cell cycle.
43. 0.
Spellman PT, Sherlock G, Zhang Ma, lyer VR, Anders K, Elsen MB, Brown PO, Botstem D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. MO/ Biol Cell 1998, 9:3273-3297. The authors present three additlonal time-course experiments on the yeast mltotic cell cycle. The ceil cultures were synchronized by three different methods. Clustering of the expresslon profiles was used to define subsets of gene upstream regions, which were subsequently searched for regulatory elements. 44. a
lyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM. Hudson J Jr. Boauskl MS et al.: The transcriotional program in the response gf human fibroblasts to se&m. Science 1999, 283:83-87. The first large-scale time-course gene expression monitoring experiment in a mammallan system (involvtng 8600 genes).
l
32.
B, Kmzler 270:484-487.
KW:
Serial
analysis
of
monitoring DNA
Lockhart DJ, Dong H, Byrne MC, Follettle MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashl M, Horton H. Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Botechnol 1996,14 1675-I 680.
33. Bowtell DD: Options available-from start to finish-for obtaining . expression data by microarray. Nat Genet 1999, 21:25-32. Perhaps the most useful review in the “ChippIng Forecast” supplement Nature Genetics. Contains many URLs of interesting web sites. 34.
Brown PO, Botsteln with DNA microarrays.
D: Exploring the new world Nat Genct 1999, 21:33-37.
35.
Debouck C, Goodfellow and development. Nat
PN: DNA microarrays Genef 1999, 21:48-50.
of the in drug
of
45
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban Vogelstein B, Kinzler KW: Gene expression profiles cancer cells. Science 1997, 276:1268-l 272.
46
Der SD, Zhou A, Williams differentially regulated oligonucleotide arrays. 95:15623-15628
47.
Zhu H, Cong JP, Mamtora G, Gingeras T, Shenk T: Cellular gene expression altered by human cytomegalovirus: global monitoring with oligonucleotide arrays. Proc Nat/ Acad SC; USA 1998, 95:14470-l 4475.
48
Grav NS. Wodicka L. Thunnissen AM. Norman TC. Kwon S Espinoza FH, Morgan DO, Barnes G, LeClerc S, Meiler L et a/.: Exploiting chemical libraries, structure, and genomics in the search for kinase inhibitors. Science 1998, 281:533-538
genome discovery
36. .
Carulll JP, Artinger M, Swain PM, Root CD, Chee L, Tullg C, Guerln J, Osborne M, Stein G, Lian J, Lomedico PT: High throughput analysis of differential gene expression. I Cell Biochem Suppl 1998, 3031:286-296. A comprehensive review and cntlcal comparison ot old and new techniques for studying differential gene expression.
37. *
Velculescu VE. Zhana L. Zhou W. Voaelsteln J. Basral MA Bassett DE Jr, Hleter-P, ‘Vogelstein B:Kinzler t&/: Characterization of the yeast transcriptome. Cell 1997, 88:243-251. A classical paper presenting the first description of a complete transcriptome. The mRNA abundances were deiermlned by SAGE (sequential analysis of gene expresslon). l
38. .
Holstege FC, Jenmngs EG, Wyrlck JJ, Lee TI, Hengartner CJ. Green MR, Golub TR, Lander ES, Youna RA: Dissecting the regulatory circuitry of a eukaryotic geiome. Cell 1998. 95:717-720. Another complete charactenzatlon of the yeast ttanscriptome. Half-IIves for all the mRNAs were determined with the aid of a temperature-sensitive RNA polymerase II mutant. Additional strains with defects III the general transcnption apparatus were proflled as well. The publicly available database is an important information resource for studies on yeast transcriptlon. l
Pietu G, Marlage-Samson R, Fayeln NA, Matlngou C, Eveno E, Houlgatte R, Decraene C, Vandenbrouck Y, Tahi F, Devlgnes MD ef a/.: The genexpress IMAGE knowledge base of the human brain transcriptome: a prototype integrated resource for functional and computational genomics. Geno/ne Res 1999, 9:195-209. The authors present a set of expresscon oroftles for 5058 human genes different tissues and describe an Integbted knowledge base combining InformatIon with genome sequences an maps. 40. .. See 41. ..
DeRlsl JL, lyer VR, Brown control of gene expression 278:680-686. annotation to [41”].
PO: Exploring the metabolic and genetic on a genomic scale. Science 1997,
Chu S, DeRlsl J, Elsen M, Mulholland J, Botsteln D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998. 282:699-705. Two ploneerlng studies (see, also, [40”]) exempltfying the power of tamsseries data analysts. The temporal program of a transcriptional response transItIon was monitored by microarray technology.
SR, and
BR, Srlverman RH: Identification of genes by interferon alpha, beta, or gamma using froc Nat/ Acad Sci USA 1998,
49 .
Marton MJ. DeRlsl JL. Bennett HA. lver VR. Mever MR. Roberts CJ, Stoughton R, Burchard J:Slade D, dai H et al.: Drug target validation and identification of secondary drug target effects using DNA microarrays. Nat Med 1998, 4: 1293-l 301, A frequently cited paper describmg a very useful applrcatjon of highthroughput gene expression monitoring in drug design. In the case study presented, the authors dlscovered a potentially undesirable transcrlptional response pattern as a secondary effect of a drug treatment - so far only in yeast cells. 50. Bassett DE Jr, Elsen MB, Boguski MS: Gene expression . informatics-it’s all in your mine. Nat Genet 1999, 21:51-55 An excellent review on all aspects of the computational analysts expression data - from image analysis to data mining.
of gene
51. .
Etsen MB, Spellman PT, Brown PO, Botsteln D: Cluster analysis and display of genome-wide expression patterns. Proc Nat/ Acad So USA 1998, 95:14863-14868. Specific distance measures and clustering algorithms for expresslon proflies are described. 52.
Michaels GS, Carr DB, Askenari M, Fuhrman S, Wen Somogyi R: Cluster analysis and data visualization scale gene expression data. Pat Symp Biocompuf 3~42-53.
53.
Thieffry D, Thomas Symp Biocompuf
54.
Weaver DC, Workman networks with weight 4:112-123.
55.
D’Haeseleer P, Wen X, Fuhrman mRNA expression levels during Symp Biocompuf 1999, 4:41-52.
39. .
III SIX this
RH, Hamilton in normal
56 .
R: Qualitative 1998, 3:77-88.
analysis
of gene
X, of large1998,
networks.
Pat
CT, Storm0 GD: Modeling regulatory matrices. Pat Symp Biocompuf 1999, S, Somogyl R: Linear CNS development
modeling and injury.
of Pat
Wen X, Fuhrman S. Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R: Large-scale temporal gene expression mapping of central nervous system development. Proc Nat/ Acad SC/ USA 1998, 95:334-339. This paper describes a data set reflecting the temporal fluctuation of 112 genes during rat nervous system development, which has been used to study gene regulatory networks.
Regulatory
57. .
Yuh CH, Bolouri H, Dawdson EH: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 1998, 279:1896-l 902. In this work, the network of regulatory interactrons between elements of a control region was deduced from a series of spatral and temporal expressron profiles generated by mutant promoter constructs. 58.
59.
Brazma elements 9:1202-l
A, Jonassen in silica 215.
van Helden J, Andre from the upstream of oligonucleotide
I, Vrlo J, Ukkonen E: Predicting on a genomic scale. Genome
gene regulatory ffes 1998,
B, Collado-Vrdes J: Extracting regulatory region of yeast genes by computational frequencies. J Mnl &o/1998, 281:827-842.
sites analysis
elements
and
expression
profiles
Bucher
407
60. .
Roth FR, Hughes JD, Estep PE, Church GM: Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 1998, 16:939-945. The yeast transcriptome was characterized under different regulatory conditrons for the purposes of the computational analysrs of gene regulatory elements. An ab in/r/o method is presented for extracting sequence motifs from combined genome and transcriptome data. 61. .
Ermolaeva 0, Rastogi M, Pruitt KD, Schuler GD, Bittner ML, Chen Y, Simon R, Meltzer P, Trent JM, Boguskr MS: Data management and analysis for gene expression arrays. Nat Genet 1998, 20:19-23. The authors describe the current state and future directions of a data management and visualization system for gene expression data.