A statistical analysis of the TRANSFAC database

A statistical analysis of the TRANSFAC database

BioSystems 81 (2005) 137–154 A statistical analysis of the TRANSFAC database Gary B. Fogel a , Dana G. Weekes a , Gabor Varga b , Ernst R. Dow b , An...

499KB Sizes 3 Downloads 17 Views

BioSystems 81 (2005) 137–154

A statistical analysis of the TRANSFAC database Gary B. Fogel a , Dana G. Weekes a , Gabor Varga b , Ernst R. Dow b , Andrew M. Craven b , Harry B. Harlow b , Eric W. Su b , Jude E. Onyia b , Chen Su b,∗ b

a Natural Selection, Inc., 3333 N. Torrey Pines Ct., Suite 200, La Jolla, CA 92037, USA Lilly Research Laboratories, Eli Lilly and Company, 2001 W. Main Street, Greenfield, IN 46140, USA

Received 16 March 2005

Abstract Transcription factors are key regulatory elements that control gene expression. The TRANSFAC® database represents the largest repository for experimentally derived transcription factor binding sites (TFBS). Understanding TFBS, which are typically conserved during evolution, helps us identify genomic regions related to human health and disease, and regions that might be predictive of patient outcomes. Here we present a statistical analysis of all TFBS in the TRANSFAC® database. Our analysis suggests that current definition of TFBS core regions in TRANSFAC® should be re-examined so as to capture a more precise notion of “cores.” We offer insight into more appropriate definitions of TFBS consensus sequences and core regions. These revised definitions provide a better understanding of the nature of transcription factor-DNA binding and assist with developing algorithms for de novo TFBS discovery as well as finding novel variants of known TFBS. © 2005 Elsevier Ireland Ltd. All rights reserved. Keywords: Transcription factor binding site; TRANSFAC® ; Gene regulation

1. Introduction Transcription factor binding sites (TFBS) are important features of gene regulation. Considerable progress has been made in identifying, experimentally verifying, and collecting databases of TFBS. These databases have been useful for the development of new bioinformatic tools for TFBS discovery, and for ∗

Corresponding author. Tel.: +1 317 277 9657. E-mail address: chen [email protected] (C. Su).

our overall understanding of TFBS sequence characteristics and locations. The TRANSFAC® (Wingender et al., 1996, 2000, 2001; Matys et al., 2003; http://www.biobase.de/) and COMPEL (Wingender et al., 1997; Heinemeyer et al., 1998; Kel-Margoulis et al., 2000) databases are the most commonly used repositories for TFBS. TRANSFAC® is available in both commercial and non-commercial license and provides information for single transcription factors, providing information on co-factors if these are known. COMPEL provides composite regulatory

0303-2647/$ – see front matter © 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2005.03.003

138

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

elements containing two closely located TFBS, which are minimal functional units in combinatorial transcription regulation. The TRANSCompel database (Kel-Margoulis et al., 2002) combines these resources and COMPEL can be considered as a subset of TRANSFAC® . The TFBS information in these databases represents a collection of experimental data found in the literature, arranged by transcription factor type. Analysis of these databases is typically performed with software packages such as MatInspector (Quandt et al., 1995; http://www.genomatix.de), wherein functionally similar TFBS motifs are grouped together and represented as a nucleotide distribution matrix. Statistics about each TFBS matrix are provided, with a TFBS consensus motif and the frequency of nucleotide occurrence at each position. Such information is valuable when searching unannotated sequence for putative new TFBS locations by similarity to previously known TFBS or to derive a degree of confidence in a putative TFBS relative to the distributions of experimentally verified binding sites. In addition, a profile of the degree of conservation at each position of the matrix (e.g., Ci -vector) is provided, ranging from 0 (an equal distribution of all four nucleotides) to 100 (conservation of one nucleotide to the exclusion of all others). A “core” sequence of the matrix is also provided and defined as the (usually 4) highest conserved, consecutive positions of the matrix (e.g., MatInspector; Cavener, 1987). TFBS database interpretation methods provide valuable insight into the nature of TFBS conservation and have greatly assisted in the development of novel computational approaches for TFBS discovery (Kel et al., 2003). However, this insight is provided on a family-by-family basis without examination of the overall characteristics of all known TFBS. Statistical interpretation of all known TFBS offers a means to assist with the tuning of algorithms designed for TFBS discovery, which in practice will help us identify genomic regions related to human health and disease (Qiu et al., 2002; Kellis et al., 2003; Mrowka et al., 2003; Bulyk, 2003; Feingold et al., 2004). Here we present such an analysis of the TRANSFAC® database, using the information presented by MatInspector. We suggest that several commonly-used TFBS definitions can be improved with the understanding of the global properties of all known TFBS.

2. Material and methods 2.1. TRANSFAC® database Two hundred ninety-two experimentally verified vertebrate TFBS matrices were extracted from the June 2003 release of the TRANSFAC® database using MatInspector Version 6.2. These were examined using StatView for Windows Version 5.0 (SAS Institute, Inc.) in terms of TFBS motif length, core length, position of the core relative to the full TFBS, and nucleotide distributions of both the full motif and the core region. TRANSFAC® provides a symbol string consensus sequence using IUPAC base codes. Following the definitions in MatInspector, a single nucleotide is shown if its frequency is greater than 50% and at least twice as high as the second most frequent nucleotide. A doubledegenerate code indicates that the corresponding two nucleotides occur in more than 75% of the underlying sequence but each of them is present in less than 50%. All other frequency distributions are represented by the letter ‘N’ (Quandt et al., 1995; Cavener, 1987). 2.2. Statistical analysis All statistical analyses were performed by using the software package StatView except for the computation of the compositional complexity. The compositional complexity (K) of each TFBS core was calculated using the equation   1 w!  K = log N (1) w ni ! where w was the core length and ni was the number of nucleotides of type i, where i is an element of {A, T, C, G}, and N = 4 in this case (Wootten and Federhen, 1996).

3. Results Fig. 1a provides a histogram of the length of the 292 vertebrate TFBS consensus motifs in TRANSFAC® . These consensus motifs have a mean length of 14.3 ± 4.7 nt with a minimum of 6 nt and maximum of 32 nt. Fig. 1b provides a histogram of the 292 TFBS in the length of their cores. Cores have a mean length of 4.1 ± 0.4 nt with a minimum of 4 nt and maximum of

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

139

Fig. 2. Distribution of the number of TFBS motifs used in building the TFBS matrices in TRANSFAC® . The mean is 35 with a maximum of 389 motifs per matrix.

Fig. 1. (a and b) Lengths of (a) TFBS motifs and (b) TFBS cores in TRANSFAC® . The TFBS motifs have a mean length of 14.3 ± 4.7 nt with a minimum of 6 nt and maximum of 32 nt. Note that nearly 95% of TFBS cores have a length of 4 nt.

7 nt. Note that nearly 95% of the TRANSFAC® TFBS have a core of 4 nt. This overabundance is dramatic and the rationale for this will be discussed below. On average, 30.7 ± 39.7 motifs were used per TFBS matrix (Fig. 2). In some cases, as few as four motifs were used (e.g., ELK1 01, HNF4 02 B, DTYPEPA B), while in others as many as 389 motifs were used (TATA 01) (Fig. 2). The ratio of positions identified as “core” by MatInspector to the total matrix length is provided in Fig. 3. On average, this ratio is 0.32 ± 0.11. The lowest ratio was 0.125 and the highest was 0.778. Thus, core sequences typically comprise 21–43% of the entire TFBS length. A total of 4172 nucleotide symbols compose the entire set of TRANSFAC® TFBS consensus motifs (Table 1). Of these symbols, 58.7% were {A, T, C, G} and 41.3%

were {R, Y, K, M, S, W, N, “·”}, where “·” represented a position of insertion or deletion in the nucleotide distribution matrix. The symbol N composed a surprisingly high percentage (26.3%) of the 4172 symbols. These N’s could potentially mask information in the matrices if one only uses the consensus sequences. There were no instances of {B, D, H, V} in either the full length TFBS motifs or TFBS cores due to the restrictive MatInspector definition described in the Methods section. In direct contrast, TFBS cores had a vastly different nucleotide symbol representation. 94.1% of the symbols were {A, T, C, G} with N and R having nearly identical representation at only 1.3% and 1.2%,

Fig. 3. Ratio of the core length to full TFBS length for all TFBS in TRANSFAC® . On average, the ratio of core to motif lengths is 0.32 ± 0.11, suggesting that core positions compose roughly one third of the TFBS motifs in TRANSFAC® .

140

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

Table 1 Number of occurrences of the 12 symbols found in the entire TRANSFAC® database and their percentage of occurrence in full motifs and within core regions only Nucleotide

Occurrence in full matrix

Percentage of occurrence in full matrix

Occurrence in core

Percentage of occurrence in core

A C G T R Y M K W S N · Total

762 516 609 563 144 122 74 72 129 91 1078 12 4172

18.2 12.4 14.6 13.5 3.5 2.9 1.8 1.7 3.1 2.2 26.0 0.3 100

393 213 272 246 14 13 5 5 12 6 16 0 1195

32.9 17.8 22.8 20.6 1.2 1.1 0.4 0.4 1.1 0.4 1.3 0 100

The symbol N was the most frequent symbol in full motifs, at roughly one quarter of the entire database. Adenine was the most common nucleotide in the core regions. Other IUPAC symbols (B, D, H and V) were not found in the database (see text for details).

respectively. In addition, no “.” symbols were found in the TFBS cores. The percentage of positions in full-length TFBS motifs that were conserved at 100% Ci varied from 0% to 89% with a mean of 22.48 ± 17.1% (Fig. 4). We then examined 100% conserved nucleotide positions in regions outside of cores (“non-core” regions). The vast majority of these non-core regions did not contain any such positions (Fig. 5a). This is to be expected if the definition of the core corresponds to the region of highest conservation. However, several full-length

Fig. 4. Percentage of nucleotide positions in TRANSFAC® TFBS matrices that have 100% conservation (Ci ).

TFBS did contain nucleotide positions of 100% conservation adjacent to the core (Fig. 5a). Overall, 0.9 ± 1.4 positions per non-core region were 100% conserved. It is arguable that these positions should be assigned as part of the core if they are adjacent to the core. We then examined the core regions for positions with 100% conservation (Fig. 5b). As expected, the majority of cores had at least one position conserved at 100%. However, over 42 cores had no position conserved at 100%. Overall, an average of 2.4 ± 1.4 positions were 100% conserved within the core regions. Reviewing the surprisingly high percentage of N’s in the full-length consensus motifs, we found that positions at the 5 and 3 ends of the TFBS matrices were frequently being called N due to the rules associated with Ci -vector calculation and to a general lack of sequence conservation at those positions. If these “terminal N’s” were not conserved during evolutionary history, they are unlikely to be important in defining the true TFBS. We therefore removed the terminal N’s from subsequent analysis. The modified TFBS motif lengths had a mean of 11.9 ± 4.6 nt with a minimum of 2 nt and a maximum of 32 nt (Fig. 6). This provided a slightly better fit to the normal distribution when compared to the original distribution in Fig. 1a. We then asked the question of whether the highly conserved nucleotide positions are evenly spread out or form clusters in the motifs. By evaluating the num-

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

141

Fig. 6. Motif lengths for all 292 TFBS in TRANSFAC® with terminal N’s removed. TFBS motifs with terminal N’s removed have a minimum length of 2 nt and a maximum length of 32 nt with mean 11.9 ± 4.6 nt.

Fig. 5. (a and b) Histograms of (a) the number of occurrences of positions of 100% conservation outside of regions identified to be core regions and (b) the number of 100% conserved nucleotide positions in core regions. Most non-core regions lack any 100% conserved positions. However, some motifs have as many as 12 completely conserved adjacent nucleotide positions in regions that are outside the core.

ber of neighboring positions (i.e., a value of 1 is equivalent to one pair of neighboring positions) that were conserved at ≥85% Ci in core and non-core regions (Fig. 7a and b), we found that the number of conserved neighboring position pairs in core regions was much higher than in non-core regions, suggesting that more conserved positions in core regions form clusters. This has important implications in algorithm design when one can use conserved neighboring position to identify TFBS cores. However, some TFBS still exist with non-core regions of high continuous conservation (Fig. 7a). Fig. 8 provides the frequency of occurrence of conserved neighboring positions in core and non-core TFBS regions with Ci values varying

between 90% and 100%. As the level of required conservation for neighboring positions is increased, the number of positions not meeting these requirements increases in both the non-core and core regions of TFBS motifs. To understand the nature of TFBS cores, we also examined the compositional complexity of the nucleotide sequence, which is commonly used when one analyzes biological sequence information. Given that the data in TRANSFAC® was represented by the larger symbol set {A, T, C, G, R, Y, K, M, S, W, N}, for TFBS matrices with degenerate positions, we calculated all possible complexities represented by the degenerate positions using only the symbols {A, T, C, G}. This resulted in 356 complexity scores for the 292 TFBS in TRANSFAC® . The resulting graphs (Fig. 9a and b) demonstrate that complexity does not appear to be normally distributed and that there is no apparent correlation between core length and core complexity.

4. Discussion The statistical interpretation of promoter databases is a key process for increased understanding and appreciation of TFBSs (Bajic et al., 2003; Sinha and Tompa, 2003; Marino-Ramirez et al., 2004). These statistics can be used to refine algorithms used to search for

142

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

Fig. 7. (a and b) Histogram of the number of occurrences of neighboring position pairs (NPP) of ≥85% Ci conservation in (a) non-core regions and (b) core regions. Most non-core regions lack any ≥85% conserved neighboring positions. However, some motifs have as many as nine NPP in non-core regions. Note that the number of core regions without any ≥85% conserved positions represents roughly 8% of all core positions.

novel putative TFBS (Birnbaum et al., 2001; Moses et al., 2003; Werner, 2003; Fogel et al., 2004). The core regions of TFBS have traditionally been defined as regions of high nucleotide conservation over a wide range of organisms. This information is valuable when one deciphers mechanisms of action and patterns of gene regulation at the sequence level. A typical method used in the interpretation of core regions is to assume a minimum length of 4 nt as a “core” element, typically located in the center of a larger TFBS motif (Quandt et al., 1995). This length restriction can be observed in Fig. 1b, where the vast majority of “cores” in TRANSFAC® are of length 4 nt with none over length 7 nt. Such an arbitrary definition raises ques-

tions about the true lengths of TFBS cores, the nature of what it means to have a “core”, and if only one “core” exists per TFBS. For the purpose of illustration, consider the condition where two equally wellconserved blocks of ≥4 nt exist but are separated by ∼6 nt in a TFBS of total length 20 nt. If there is always one “core” associated with these TFBS, which one would be labeled as a core in the resulting TFBS database? When applying a strict requirement of 100% conservation, a maximum of six positions can be found in regions currently identified as cores in TRANSFAC® (Fig. 5b). However, regions of 100% conservation as long as 12 nt can be found in areas outside of cores

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

143

Fig. 8. (a–f) Frequency of occurrence of neighboring position pairs (NPP) in TFBS core and non-core regions with different Ci values.

(Fig. 5a). Similarly, a maximum of six neighboring positions with 100% Ci can be found in core regions whereas continuous highly conserved regions as long as nine positions can be found in non-core areas (Fig. 8a and b). This is counterintuitive to the biologist’s notion of a “core” and suggests that any arbitrary limitation on “core” length or composition should be made with

care, especially when there is significant evolutionary distance represented in the sequences being used to generate the TFBS matrices. Experimental verification of a TFBS is a prerequisite for deposition in a promoter database. Given that experimental detection is organism specific, and that the more important elements of a TFBS can only

144

G.B. Fogel et al. / BioSystems 81 (2005) 137–154 Table 2 TFBS with known “terminal N” artifacts in TRANSFAC® TFBS family

Description

GATA1 02 GATA1 03 GATA1 04 GATA2 01 GC 01 GEN INI3 B GFI1 01 HEN1 01

GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 2 GC box elements General initiator seq. (viral + cellular) Growth factor independence 1 E-box binding factor without transcript. activation E-box binding factor without transcript. activation Hepatocyte Nuclear Factor 3beta Hepatic nuclear factor 4 Hox-1.3 Ikaros 2 General initiator Myogenic MADS factor MEF-2 Myogenic MADS factor MEF-2 Myogenic MADS factor MEF-2 Muscle initiator sequence-20 Muscle initiator Nuclear factor 1 Nuclear factor of activated T-cells N-Myc Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 p300 Pax3 binding sites Pax-4 binding sites Pax-4 binding sites Pax-4 binding sites Pax-6 Pu.1 (Pu120) Ets-like transcription factor identified in lymphoid B-cells S8 SOX (SRY-related HMG box) Sex-determining region Y gene product Cellular and viral TATA box elements General TATA box TCF11/KCR-F1/Nrf1 homodimers TCF11/MafG heterodimers Thing1/E47 heterodimer POU-factor Tst-1/Oct-6 Upstream stimulating factor v-Maf X-box-binding protein 1 Yin and Yang 1

HEN1 02

Fig. 9. (a and b) (a) Histogram of the complexity scores for TFBS cores in TRANSFAC® . The core sequences have a mean complexity of 0.679 ± 0.202. Given the wide variance of core complexity there appears to be little benefit in using complexity for TFBS discovery. (b) There is no apparent correlation between core complexity and core length.

be determined by comparative analysis to a family of similar TFBS, flanking regions are of little value when averaged over examples from an assortment of species with large evolutionary distance. When viewed as a family nucleotide frequency matrix, these flanking regions of little or no conservation are reported in a consensus sequence as Ns at either the 5 or 3 end of the TFBS (“terminal N’s”). Removal of these terminal N’s may be valuable when determining characteristics about the true nature of TFBS regions that are conserved over multiple species. TFBS families in TRANSFAC® have different numbers of sequences that were used in generating the consensus motif (see Appendix A). This difference can have an affect on the number of positions that are con-

HNF3B 01 HNF4 01 HOX13 01 IK2 01 INI B MEF2 02 MEF2 03 MEF2 04 MINI20 B MUSCLE INI B NF1 Q6 NFAT Q6 NMYC 01 OCT1 02 OCT1 03 OCT1 04 OCT1 Q6 P300 01 PAX3 B PAX4 01 PAX4 03 PAX4 04 PAX6 01 PU1 B S8 01 SOX9 B1 SRY 02 TATA 01 TATA B TCF11 01 TCF11MAFG 01 TH1E47 01 TST1 01 USF 02 VMAF 01 XBP1 01 YY1 01

One of the best examples of this phenomenon is PAX4 01, with a motif of NGNNGTCANGCGTNNNNNNNN but a core listed as GTCA.

G.B. Fogel et al. / BioSystems 81 (2005) 137–154 Table 3 Errors of core estimation within TRANSFAC® TFBS Family

Description

Problem

SRF C FREAC4 01 MYT1 02 B

Serum responsive factor Fork head related activator-4 MyT1 zinc finger transcription factor involved in primary neurogenesis PPAR/RXR heterodimers v-Myb MyT1 zinc finger transcription factor involved in primary neurogenesis Interferon regulatory factor 2 Ectopic viral integration site 1 encoded factor Neuron-restrictive silencer factor c-Ets-1 binding site c-Ets-2 binding site Octamer binding site Octamer factor 1 Octamer factor 1 Octamer factor 1 Pit1, GHF-1 pituitary specific pou domain transcription factor hepatic nuclear factor 4 SF1 steroidogenic factor 1 Homeo domain factor Pbx-1 EBOX (E-BOX binding factors) c-Myc/Max heterodimer Upstream stimulating factor USF binding site Tal-1alpha/E47 heterodimer Tal-1alpha/E47 heterodimer Tal-1beta/ITF-2 heterodimer MYOblast Determining factor Myoblast determination gene product Complex of Lmo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 1 MYOblast determining factor Myoblast determining factor Tumor suppressor p53 Nuclear factor Y (Y-box binding factor) NF-Y binding site Serum response factor E2F-myc activator/cell cycle regulator E2F Early growth response gene 3 product SEF1 binding site E2F binding sites in 3 E1A-inducible promoters Interferon-stimulated response element GATA binding site GATA-binding factor 1 Human and murine ETS1 Factors GABP: GA binding protein Nuclear respiratory factor 2 RAR-related orphan receptor alpha1 RAR-related orphan receptor alpha2

Core should be ATGG not CCWT Core should be TAAACA not AACA Core should be AAGTT not AAGT

PPARA 01 VMYB 02 MYT1 01 B IRF2 01 EVI1 06 NRSF 01 ETS1 B ETS2 B OCT C OCT1 01 OCT1 05 OCT1 B PIT1 B HNF4 01 B SF1 B PBX1 02 MAX 01 MYCMAX 01 USF 01 USF C TAL1ALPHAE47 01 TAL1BETAE47 01 TAL1BETAITF2 01 E47 02 MYOD 01 LMO2COM 01 E47 01 MYOD Q6 P53 01 NFY Q6 NFY C SRF 01 E2F Q6 E2F 02 EGR3 01 SEF1 C E2F 01 ISRE 01 GATA C GATA1 06 ELK1 01 GABP B NRF2 01 RORA1 01 RORA2 01

Core should be AAAGGT not AAAG Core should be AACGG not AACG Core should be AAGTTTACTT not AAGT Core should be AAGYGAAA not GAAA Core should be ACAAGAT not AGAT Core should be AGCACC not AGCA Core should be AGGA not GGAA Core should be AGGA not GGAA Core should be ATGCAAA not GCAAA Core should be ATGCAAA not TATG Core should be ATGCAAAT not ATGC Core should be ATGCAAAT not ATGC Core should be ATTCA not ATTC Core should be CAAAG not CAAA Core should be CAAGG not AAGG Core should be CAATC not CAAT Core should be CACGTG not CACG Core should be CACGTG not CACG Core should be CACGTG not CACG Core should be CACGTG not CACGT core should be CAGATG not CAGA Core should be CAGATG not CAGA Core should be CAGATG not CAGA Core should be CAGGTG not CAGG Core should be CAGGTG not CAGG Core should be CAGGTG not CAGG Core should be CAGGTG not GCAG Core should be CANCTG not CANC Core should be CATGCCCGGGCATG not CATG Core should be CCAAT not CCAA Core should be CCAATCA not RCCAA Core should be CCATATATGG not TATA Core should be CGCGA not CGCG Core should be CGCSAAA not SAAA Core should be CGTGGG not GCGT Core should be CTGTGGT not TCTGT Core should be GAAAA not GAAA Core should be GAAAC not GAAA Core should be GATA not GATAA Core should be GATAA not GATA Core should be GGAAG not GGAA Core should be GGAAG not GGAA Core should be GGAAG not GGAA Core should be GGTCA not GGTC Core should be GGTCA not GGTC

145

146

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

Table 3 (Continued ) TFBS Family

Description

Problem

T3R 01

Core should be GGTCA not GGTC

NGFIC 01

Viral homolog of thyroid hormone receptor alpha1 Nerve growth factor-induced protein C

FREAC3 01 MIF1 01

Fork head related activator-3 MIBP-1/RFX1 complex

XFD3 01 EGR2 01

Xenopus fork head domain factor 3 Egr-2/Krox-20 early growth response gene product Glucocorticoid response element X-box binding protein RFX1 X-box binding protein RFX1 Myocyte enhancer factor Myogenic enhancer factor 2 Xenopus fork head domain factor 2 Xenopus fork head domain factor 1 Lentiviral TATA upstream el. Pdx1 (IDX1/IPF1) pancreatic and intestinal homeodomain TF Avian C-type TATA box Sterol regulatory element binding protein 1 (MEF3 BINDING SITES) Clox

GRE C RFX1 01 RFX1 02 HMEF2 Q6 MEF2 01 XFD2 01 XFD1 01 TAACC B PDX1 B ATATA B SREBP1 02 MEF3 B CLOX 01 TAXCREB 01 CREBP1 Q2 VJUN 01 TAXCREB 02 HFH8 01 HFH3 01 CREBP1 01 STAT1 01 HNF4 02 B DTYPEPA B PAX1 B CAP 01 MSX1 01 PAX2 01 EVI1 04 EVI1 01 IRF1 01 NFKAPPAB65 01 RSRFC4 01 SREBp1 01 STAT3 01 VDR RXR2 B

Tax/CREB complex CRE-binding protein 1 v-Jun Tax/CREB complex HNF-3/Fkh Homolog-8 HNF-3/Fkh Homolog 3 (=Freac-6) cAMP-responsive element binding protein 1 Signal transducer and activator of transcription 1 Hepatic nuclear factor 4 PolyA signal of D-type LTRs Pax1 binding sites Cap signal for transcription initiation msh-like (muscle segment homeobox) homeobox protein 1 PAX2 (PAX-2/PAX-8 binding sites) Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Interferon regulatory factor 1 NF-kappaB (p65) Related to serum response factor, C4 Sterol regulatory element binding protein 1 Signal transducer and activator of transcription 3 VDR/RXR heterodimer site

Core should be GGYG not GCGT (a completely different location) Core should be GTAAA not GTAA Core should be GTAAC not GTAA and there might be another core of GTT upstream. Core should be GTMAACA not AACA Core should be GTRGG not GCGT Core should be GTYCT not TGTY Core should be GYAAC not GYAA Core should be GYAAC not GYAA Core should be TAAAAATAAC not AAAT Core should be TAAAAATAAC not TAAA Core should be TAAACA not TAAA Core should be TAAAYA not TAAA Core should be TAACC not CTAACC Core should be TAATKAC not TAAT Core should be TATWTAAG not TATWTA Core should be TCACCCCAC not TCAC Core should be TCAGGTT not TCAG Core should be TCGA or CGAT or ATCGAT but they only list ATCG Core should be TGACG not TGAC Core should be TGACG not TGAC Core should be TGACGTCA not TGAC Core should be TGCGTCA not TGCG Core should be TGTTT not TGTT Core should be TRTTT not TRTT Core should be TTACGT not ACGT Core should be TTCCGGGAA not GGAA Is the entire TFBS a core due to 100% Ci and few sequences? Is the entire TFBS a core due to 100% Ci and few sequences? Is the entire TFBS a core due to 100% Ci and few sequences? Is there any core here? Is there any core here? Is there any core here? Is there any core here? Primary core should be GAYAAGA but there is a second core of AAGA 3’ downstream. Two cores? Two cores? Two cores? Two cores? Two cores? Two cores?

The majority of these errors are of the type where the core definition used in TRANSFAC® was too restrictive and a larger, generally highly conserved element was missed. Of particular interest is the large number of TFBS families that share the motif CAVRTG (12 families in total).

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

sidered to be 100% conserved. TFBS motifs that are derived from matrices that have few sequence representatives tend to have more positions with high percentage conservation than TFBS matrices with many representatives. Thus, Ci -vectors should be normalized relative to the number of sequences used in generating the matrix. This effect is not considered in the current version of MatInspector. Following detailed inspection of all 292 vertebrate TFBS in TRANSFAC® , areas of improvement concern can be observed for 136 TFBS (46.6% of the data). These problems can be classified into two major categories of: (A) terminal N’s at either the 5 or 3 end or both (Table 2) and; (B) errors in the estimation of a core, wherein either too many or too few positions were assigned, or there exists more than one “core” of equal value elsewhere in the motif that was avoided for no apparent reason (Table 3). One motif (OCT1 07) has a calculation of a Ci score for one position that is believed to be incorrect. When reviewing TFBS in the category of terminal N errors, the majority have N’s at both the 5 and 3 flanking regions rather than one or the other. For example, in the matrix PAX4 04, 24 binding motifs were used to generate the nucleotide distribution matrix. The core is listed as AAWT (positions 4–7). Positions 1, 2, 9–23, 28, and 30 are all given as N (19 positions out of 30). The motif GATA1 04 is another typical example with two N positions at the 5 end and four N positions at the 3 end, out of 13 total positions. In this case, only roughly half of the matrix is useful for determining the TFBS motif. In the TFBS matrices with “core estimation” problems, a variety of patterns emerge. Many of the core estimation errors are the result of a misidentified, artificially short, core. For instance, the TFBS FREAC4 01 has a core of AACA (positions 9–12) but clearly this should be extended to TAAACA (positions 7–12) to include the conserved T at position 7 and the highly conserved A at position 8. This type of error is likely due to the overly restrictive definition of the core in MatInspector. In some cases it seems to be difficult to identify a core, given that a few sequences were pooled to generate the matrix such that essentially all positions are 100% conserved (c.f., DTYPEPA B, constructed using only four sequences). In other cases there was no apparent core. MSX1 01 is a good example, in which the core of 4 nt is given as “WNTG” for a TFBS of

147

length 9 nt. The position with the highest conservation in that TFBS is position 8, a T of Ci 66.7, but on average, the conservation over the entire motif is very low. Matrix EVI1 01 is particularly interesting in that there appears to be a primary core of GAYAAGA (misidentified in MatInspector as only AAGA) but there is also a second highly conserved AAGA downstream of this element. Thus it may be that the true core should be represented as GAYAAGATAAGA or that two (or perhaps even more) regions of high conservation truly exist in this TFBS. Other TFBS motifs such as IRF1 01 have similar problems of potentially having two conserved areas shown in the matrix. The current definition of core used in TRANSFAC® will identify only one of these to the exclusion of the other. In summary, we have identified problems of the TRANSFAC® TFBS matrices and proposed new solutions. Open questions regarding the true nature of core regions in TFBS still remain. For instance, it is still unclear if species specificity should play a general role in defining cores, and if the core truly reflects the sites of protein-DNA binding. As our understanding of the nature of TFBS increases through more detailed statistical and experimental analysis, better computational algorithms can be designed to facilitate bona fide promoter analysis to pinpoint regulatory mechanisms of gene expression.

Appendix A. 292 transcription factor binding sites (TFBS) from the TRANSFAC® database for vertebrates For each TFBS, the description is provided along with the number of sequences used in generating the TFBS motif, the number of positions in the core of the TFBS with 100% conservation, the number of core positions, the number of positions with 100% conservation outside the core, the number of total positions in the full length motif, the ratio of core positions to the full length of the TFBS motif, the number of total positions conserved at 100% Ci , and the percentage of motif positions conserved at 100% Ci . For some TFBS, the number of sequences used in generating the TFBS motif was not provided by MatInspector (labeled as “–” in column 3).

148

Description

ACAAT B AHR 01 AHRARNT 01 AHRARNT 02 AMEF2 Q6 AML1 01 AP1 C AP1 Q2 AP1 Q4 AP1 Q6 AP1FJ Q2 AP2 Q6 AP4 Q1 AP4 Q5 AP4 Q6 APOLYA B AREB6 01 AREB6 02 AREB6 03 AREB6 04 ARNT 01 ARP1 01 ATATA B atf 01 ATF B BARBIE 01 BEL1 B BRACH 01 BRN2 01 CAAT 01 CAAT C CAP 01 CART1 01 CDP 01 CDP 02 CDPCR1 01 CDPCR3 01 CDPCR3HD 01 CDX2 B

Avian C-type CCAAT box Aryl hydrocarbon/dioxin receptor Aryl hydrocarbon/Arnt heterodimers Aryl hydrocarbon/Arnt heterodimers, fixed core Myocyte enhancer factor Runt-factor AML-1 Ap-1 binding site Activator protein 1 Activator protein 1 Activator protein 1 Activator protein 1 Activator protein 2 Activator protein 4 Activator protein 4 Activator protein 4 Avian C-type polya signal AREB6 AREB6 AREB6 AREB6 AhR nuclear translocator homodimers Apolipoprotein AI regulatory protein 1 Avian C-type TATA box Activating transcription factor ATF binding site Barbiturate-inducible element Bel-1 simliar region Brachyury POU factor Brn-2 Cellular and viral CCAAT box Retroviral CCAAT box Cap signal for transcription initiation Cart-1 Cut-like homeodomain protein Transcriptional repressor CDP Cut-like homeodomain protein Cut-like homeodomain protein Cut-like homeodomain protein Cdx-2 mammalian caudal related intestinal transcr. factor CdxA CdxA CCAAT/enhancer binding protein C/EBP binding site CCAAT/enhancer binding protein CCAAT/enhancer binding protein alpha CCAAT/enhancer binding protein beta CCAAT/enhancer binding protein beta c-Ets-1(p54) c-Ets-1(p54) Heterodimers of CHOP and C/EBPalpha Clox c-Myb

CDXA 01 CDXA 02 CEBP 01 CEBP C CEBP Q2 CEBPA 01 CEBPB 01 CEBPB 02 CETS1P54 01 CETS1P54 02 CHOP 01 CLOX 01 CMYB 01

# Sequences used in generating TFBS motif 12 9 24 25 8 38 – 14 23 56 17 13 5 7 5 15 12 17 12 12 20 6 13 25 17 15 19 40 12 175 – 303 25 18 86 32 24 33 6

Number of 100% Ci positions in core of TFBS motif 4 4 0 4 3 1 3 3 2 2 3 1 4 4 4 5 4 4 2 4 0 2 4 4 4 4 3 4 2 1 3 1 2 2 3 0 3 2 4

Number of core positions in TFBS motif 5 4 4 4 4 4 7 4 4 4 4 4 4 4 4 6 4 4 4 4 4 4 6 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4

Number of 100% Ci position outside of core 2 2 0 2 0 0 0 0 0 0 0 0 1 0 0 3 1 1 2 0 0 0 2 1 1 0 1 3 0 0 2 0 0 0 2 0 0 0 0

Number of total positions in TFBS motif 9 18 16 19 18 6 9 11 11 11 11 12 18 10 10 15 13 12 12 9 16 16 10 14 12 15 28 24 16 12 25 8 18 12 15 10 15 10 19

19 18 22 – 62 43 21 17 15 40 39 138 60

0 0 1 2 1 1 2 1 3 3 0 3 3

4 4 4 5 4 4 4 4 4 4 4 4 4

0 0 0 1 0 0 0 0 0 0 0 2 0

7 7 13 18 14 14 14 14 10 13 13 15 18

#core positions/# positions in motif 0.556 0.222 0.250 0.211 0.222 0.667 0.778 0.364 0.364 0.364 0.364 0.333 0.222 0.400 0.400 0.400 0.308 0.333 0.333 0.444 0.250 0.250 0.600 0.286 0.333 0.267 0.179 0.167 0.250 0.333 0.160 0.500 0.222 0.333 0.267 0.400 0.267 0.400 0.211

Total # motif positions at 100% Ci 6 6 0 6 3 1 3 3 2 2 3 1 5 4 4 8 5 5 4 4 0 2 6 5 5 4 4 7 2 1 5 1 2 2 5 0 3 2 4

Percentage of motif positions at 100% Ci 66.667 33.333 0.000 31.579 16.667 16.667 33.333 27.273 18.182 18.182 27.273 8.333 27.778 40.000 40.000 53.333 38.462 41.667 33.333 44.444 0.000 12.500 60.000 35.714 41.667 26.667 14.286 29.167 12.500 8.333 20.000 12.500 11.111 16.667 33.333 0.000 20.000 20.000 21.053

0.571 0.571 0.308 0.278 0.286 0.286 0.286 0.286 0.400 0.308 0.308 0.267 0.222

0 0 1 3 1 1 2 1 3 3 0 5 3

0.000 0.000 7.692 16.667 7.143 7.143 14.286 7.143 30.000 23.077 0.000 33.333 16.667

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

TFBS name

EGR2 01

7 13 6 29 16 15 20 47 7 43 17 41 4 17 19 5 12 8 11 38 23 100

0 1 3 3 2 4 4 3 4 3 0 2 4 2 3 4 3 3 3 2 3 3

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

0 0 0 0 0 0 0 1 1 0 0 0 4 2 0 3 2 1 2 2 1 0

24 14 11 8 12 12 12 8 12 8 10 11 10 16 16 15 8 13 15 16 12 12

0.167 0.286 0.364 0.500 0.333 0.333 0.333 0.500 0.333 0.500 0.400 0.364 0.400 0.250 0.250 0.267 0.500 0.308 0.267 0.250 0.333 0.333

0 1 3 3 2 4 4 4 5 3 0 2 8 4 3 7 5 4 5 4 4 3

0.000 7.143 27.273 37.500 16.667 33.333 33.333 50.000 41.667 37.500 0.000 18.182 80.000 25.000 18.750 46.667 62.500 30.769 33.333 25.000 33.333 25.000

100

2

4

2

12

0.333

4

33.333

100 4 31 10 20 21 9 13 13 27 8 19 16 27 16 20 23 12 – 53 20 12 48 10 15 53 31 18 63 41 26 274

3 3 3 0 1 3 3 3 4 4 0 0 4 4 4 4 2 4 3 1 4 4 4 4 3 0 3 3 1 3 3 1

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4

3 1 0 0 0 1 1 5 1 2 0 0 2 1 2 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0

12 16 14 7 19 15 14 16 11 11 15 11 9 16 16 16 16 12 11 10 14 14 13 10 10 10 10 10 9 10 10 14

0.333 0.250 0.286 0.571 0.211 0.267 0.286 0.250 0.364 0.364 0.267 0.364 0.444 0.250 0.250 0.250 0.250 0.333 0.455 0.400 0.286 0.286 0.308 0.400 0.400 0.400 0.400 0.400 0.444 0.400 0.400 0.286

6 4 3 0 1 4 4 8 5 6 0 0 6 5 6 5 2 5 3 1 4 4 4 5 4 0 3 4 1 3 3 1

50.000 25.000 21.429 0.000 5.263 26.667 28.571 50.000 45.455 54.545 0.000 0.000 66.667 31.250 37.500 31.250 12.500 41.667 27.273 10.000 28.571 28.571 30.769 50.000 40.000 0.000 30.000 40.000 11.111 30.000 30.000 7.143

149

EGR3 01 ELK1 01 ELK1 02 EN1 01 ER Q6 ETS1 B ETS2 B EVI1 01 EVI1 02 EVI1 03 EVI1 04 EVI1 05 EVI1 06 FREAC2 01 FREAC3 01 FREAC4 01 FREAC7 01 GABP B GATA C GATA1 01 GATA1 02 GATA1 03 GATA1 04 GATA1 05 GATA1 06 GATA2 01 GATA2 02 GATA2 03 GATA3 01 GATA3 02 GATA3 03 GC 01

COMP1 COUP/HNF-4 heterodimer CP2 cAMP-responsive element binding protein cAMP-responsive element binding protein cAMP-responsive element binding protein cAMP-responsive element binding protein cAMP-responsive element binding protein 1 CRE-binding protein 1 CRE-binding protein 1/c-Jun heterodimer c-Rel deltaEF1 PolyA signal of d-type LTRs Papilloma virus regulator E2 Papilloma virus regulator E2 E2F binding sites in 3 E1A-inducible promoters E2F E2F-myc activator/cell cycle regulator MYOblast Determining factor MYOblast Determining factor CReb Binding Proteins Egr-1/Krox-24/NGFI-A immediate-early gene product Egr-2/Krox-20 early growth response gene product Early growth response gene 3 product Human and murine ETS1 Factors Elk-1 Engrailed 1 Estrogen receptor c-Ets-1 binding site c-Ets-2 binding site Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Ectopic viral integration site 1 encoded factor Fork head related activator-2 Fork head related activator-3 Fork head related activator-4 Fork head related activator-7 GABP: GA binding protein GATA binding site GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 1 GATA-binding factor 2 GATA-binding factor 2 GATA-binding factor 2 GATA-binding factor 3 GATA-binding factor 3 GATA-binding factor 3 GC box elements

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

COMP1 01 COUP 01 CP2 01 CREB 01 CREB 02 CREB Q2 CREB Q4 CREBP1 01 CREBP1 Q2 CREBP1CJUN 01 CREL 01 DELTAEF1 01 DTYPEPA B E2 01 E2 Q6 E2F 01 E2F 02 E2F Q6 E47 01 E47 02 E4BP4 01 EGR1 01

150

Appendix A(Continued ) Description

GEN INI B GEN INI2 B GEN INI3 B GFI1 01 GKLF 01 GR Q6 GRE C HEN1 01

General initiator seq. (viral + cellular) General initiator seq. (viral + cellular) General initiator seq. (viral + cellular) Growth factor independence 1 Gut-enriched Krueppel-like factor Glucocorticoid receptor Glucocorticoid response element E-box binding factor without transcript. activation E-box binding factor without transcript. activation HNF-3/Fkh Homolog 1 HNF-3/Fkh Homolog 2 HNF-3/Fkh Homolog 3 (=Freac-6) HNF-3/Fkh Homolog-8 Hepatic leukemia factor Myocyte enhancer factor Hepatic nuclear factor 1 Hepatic nuclear factor 1 Hepatocyte nuclear factor 3beta Hepatic nuclear factor 4 Hepatic nuclear factor 4 Hepatic nuclear factor 4 Imperfect Hogness/Goldberg box Hox-1.3 HOXA3 (homeobox cluster protein) Heat shock factor 1 Heat shock factor 2 Ikaros 1 Ikaros 2 Ikaros 3 General initiator Interferon regulatory factor 1 Interferon regulatory factor 2 Interferon-stimulated response element Lentiviral Poly A downstream element Complex of Lmo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 1 Complex of Lmo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 2 Lentiviral Poly A signal LyF-1 EBOX (E-BOX binding factors) Myogenic enhancer factor 2 Myogenic MADS factor MEF-2 Myogenic MADS factor MEF-2 Myogenic MADS factor MEF-2 MEF2-myocyte-specific enhancer-binding factor (MEF3 BINDING SITES) MIBP-1/RFX1 complex Muscle initiator sequence-19 Muscle initiator sequence-20

HEN1 02 HFH1 01 HFH2 01 HFH3 01 HFH8 01 HLF 01 HMEF2 Q6 HNF1 01 HNF1 C HNF3B 01 HNF4 01 HNF4 01 B HNF4 02 B HOGNESS B HOX13 01 HOXA3 01 HSF1 01 HSF2 01 IK1 01 IK2 01 IK3 01 INI B IRF1 01 IRF2 01 ISRE 01 LDSPOLYA B LMO2COM 01 LMO2COM 02 LPOLYA B LYF1 01 MAX 01 MEF2 01 MEF2 02 MEF2 03 MEF2 04 MEF2 Q6 MEF3 B MIF1 01 MINI19 B MINI20 B

Number of 100% Ci positions in core of TFBS motif 0 0 1 4 2 2 2 0

Number of core positions in TFBS motif 4 4 4 4 4 4 4 4

Number of 100% Ci position outside of core 0 0 0 0 0 0 2 0

Number of total positions in TFBS motif 8 8 8 24 14 19 16 22

#core positions/# positions in motif 0.500 0.500 0.500 0.167 0.286 0.211 0.250 0.182

Total # motif positions at 100% Ci 0 0 1 4 2 2 4 0

54

0

4

0

22

0.182

0

0.000

14 32 31 48 18 11 26 – 24 32 19 4 22 10 14 45 27 24 36 25 60 21 15 13 16 31

4 1 3 3 1 3 0 1 2 0 4 4 2 4 0 1 1 3 3 4 0 2 4 4 2 2

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

0 0 1 1 0 4 0 1 0 0 1 5 2 0 0 1 1 0 0 0 0 2 2 2 0 2

12 12 13 13 10 16 15 17 15 19 14 15 32 30 9 10 10 13 12 13 20 13 13 15 16 12

0.333 0.333 0.308 0.308 0.400 0.250 0.267 0.235 0.267 0.211 0.286 0.267 0.125 0.133 0.444 0.400 0.400 0.308 0.333 0.308 0.200 0.308 0.308 0.267 0.250 0.333

4 1 4 4 1 7 0 2 2 0 5 9 4 4 0 2 2 3 3 4 0 4 6 6 2 4

33.333 8.333 30.769 30.769 10.000 43.750 0.000 11.765 13.333 0.000 35.714 60.000 12.500 13.333 0.000 20.000 20.000 23.077 25.000 30.769 0.000 30.769 46.154 40.000 12.500 33.333

31

4

4

0

9

0.444

4

44.444

16 11 37 5 104 90 47 9 5 6 8 7

6 0 4 4 2 1 3 2 4 2 2 2

6 4 4 4 4 4 4 4 4 4 4 4

0 0 2 6 0 1 1 0 2 3 0 0

8 9 14 16 22 22 22 10 13 18 21 21

0.750 0.444 0.286 0.250 0.182 0.182 0.182 0.400 0.308 0.222 0.190 0.190

6 0 6 10 2 2 4 2 6 5 2 2

75.000 0.000 42.857 62.500 9.091 9.091 18.182 20.000 46.154 27.778 9.524 9.524

# Sequences used in generating TFBS motif 23 21 20 54 49 38 – 51

Percentage of motif positions at 100% Ci 0.000 0.000 12.500 16.667 14.286 10.526 25.000 0.000

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

TFBS name

MMEF2 Q6 MSX1 01 MTATA B MTBF B MTF-1 B MUSCLE INI B MYB Q6 MYCMAX 01 MYCMAX 02 MYCMAX B MYOD 01 MYOD Q6 MYOGNF1 1 MYT1 01 B MYT1 02 B

NKX25 02

3 0

4 4

0 0

16 9

0.250 0.444

3 0

18.750 0.000

12 7 22 16 18 34 29 18 5 14 8 26

2 3 3 2 2 4 2 1 2 2 1 4

4 4 4 4 4 4 4 4 4 4 4 4

0 1 0 0 0 2 0 1 2 2 0 5

17 9 15 21 10 14 12 10 12 10 29 12

0.235 0.444 0.267 0.190 0.400 0.286 0.333 0.400 0.333 0.400 0.138 0.333

2 4 3 2 2 6 2 2 4 4 1 9

11.765 44.444 20.000 9.524 20.000 42.857 16.667 20.000 33.333 40.000 3.448 75.000

30

4

4

1

11

0.364

5

45.455

20 16 75 26 9 40 18 18 – 13 164 – 20 100 6

2 0 1 2 3 2 3 3 4 2 2 4 3 1 4

4 4 4 4 4 4 4 4 5 4 4 5 4 4 4

0 0 0 0 0 0 1 2 0 0 0 3 1 3 0

8 13 18 12 11 10 10 10 12 14 16 14 11 12 7

0.500 0.308 0.222 0.333 0.364 0.400 0.400 0.400 0.417 0.286 0.250 0.357 0.364 0.333 0.571

2 0 1 2 3 2 4 5 4 2 2 7 4 4 4

25.000 0.000 5.556 16.667 27.273 20.000 40.000 50.000 33.333 14.286 12.500 50.000 36.364 33.333 57.143

5

4

4

0

8

0.500

4

50.000

40 7 6 28 – 56 44 51 47 7 6 18 56 30 8 16 17 20 – 5

1 4 3 3 3 3 1 0 0 3 2 4 3 0 3 0 4 1 4 4

4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4

0 1 3 5 1 1 0 0 0 2 1 2 1 0 1 0 8 0 0 12

12 10 21 21 13 19 15 13 23 14 14 12 10 15 22 14 20 10 9 18

0.333 0.400 0.190 0.190 0.385 0.211 0.267 0.308 0.174 0.286 0.286 0.333 0.400 0.267 0.182 0.286 0.200 0.400 0.556 0.222

1 5 6 8 4 4 1 0 0 5 3 6 4 0 4 0 12 1 4 16

8.333 50.000 28.571 38.095 30.769 21.053 6.667 0.000 0.000 35.714 21.429 50.000 40.000 0.000 18.182 0.000 60.000 10.000 44.444 88.889

151

NMYC 01 NRF2 01 NRSE B NRSF 01 OCT C OCT1 01 OCT1 02 OCT1 03 OCT1 04 OCT1 05 OCT1 06 OCT1 07 OCT1 B OCT1 Q6 OLF1 01 P300 01 P53 01 P53 02 PADS C PAX1 B

10 13

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

MZF1 01 MZF1 02 NF1 Q6 NFAT Q6 NFE2 01 NFKAPPAB 01 NFKAPPB50 01 NFKAPPAB65 01 NFKB C NFKB Q6 NFY 01 NFY C NFY Q6 NGFIC 01 NKX25 01

Myocyte enhancer factor msh-like (muscle segment homeobox) homeobox protein 1 Muscle TATA box Muscle-specific Mt binding site Metal transcription factor 1, MRE Muscle initiator c-Myb c-Myc/Max heterodimer c-Myc/Max heterodimer MYC-MAX binding sites Myoblast determination gene product Myoblast determining factor Myogenin/nuclear factor 1 or related factors MyT1 zinc finger transcription factor involved in primary neurogenesis MyT1 zinc finger transcription factor involved in primary neurogenesis MZF1 (Myeloid Zinc Finger 1 factors) MZF1 (Myeloid Zinc Finger 1 factors) Nuclear factor 1 Nuclear factor of activated T-cells NF-E2 p45 NF-kappaB NF-kappaB (p50) NF-kappaB (p65) NF-kappa-B binding site NF-kappaB Nuclear factor Y (Y-box binding factor) NF-Y binding site Nuclear factor Y (Y-box binding factor) Nerve growth factor-induced protein C Homeo domain factor Nkx-2.5/Csx, tinman homolog Homeo domain factor Nkx-2.5/Csx, tinman homolog N-Myc Nuclear respiratory factor 2 Neural-restrictive-silencer-element Neuron-restrictive silencer factor Octamer binding site Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Octamer factor 1 Olfactory neuron-specific factor p300 Tumor suppressor p53 Tumor suppressor p53 Retroviral Poly A downstream element Pax1 binding sites

152

Appendix A(Continued ) Description

PAX2 01 PAX3 PAX3 B PAX4 01 PAX4 02 PAX4 03 PAX4 04 PAX5 01 PAX5 02 PAX6 01 PAX8 B PAX9 B PBX1 01 PBX1 02 PDX1 B

PAX2 (PAX-2/PAX-8 binding sites) Pax-3 binding sites Pax3 binding sites Pax-4 binding sites Pax-4 binding sites Pax-4 binding sites Pax-4 binding sites B-cell-specific activating protein B-cell-specific activating protein Pax-6 PAX8 binding sites Zebrafish PAX9 binding sites Pbx-1 Homeo domain factor Pbx-1 Pdx1 (IDX1/IPF1) pancreatic and intestinal homeodomain TF Pit1, GHF-1 pituitary specific pou domain transcription factor Retroviral Poly A signal PPAR/RXR heterodimers Pu.1 (Pu120) Ets-like transcription factor identified in lymphoid B-cells Epstein-Barr virus transcription factor R Retinoic acid receptor, member of nuclear receptors X-box binding protein RFX1 X-box binding protein RFX1 RAR-related orphan receptor alpha1 RAR-related orphan receptor alpha2 Ras-responsive element binding protein 1 Related to serum response factor, C4 RSRFC4 Q2 S8 SEF1 binding site SF1 steroidogenic factor 1 Smad3 transcription factor involved in TGF-beta signaling Smad4 transcription factor involved in TGF-beta signaling Sox-5 SOX (SRY-related HMG box) Stimulating protein 1 Stimulating protein 1 Sterol regulatory element Sterol regulatory element binding protein 1 Sterol regulatory element binding protein 1 Serum response factor Serum responsive factor Serum response factor Sex-determining region Y gene product Sex-determining region Y gene product Se-Cys tRNA gene transcription activating factor

PIT1 B POLY C PPARA 01 PU1 B R 01 RAR B RFX1 01 RFX1 02 RORA1 01 RORA2 01 RREB1 01 RSRFC4 01 RSRFC4 Q2 S8 01 SEF1 C SF1 B SMAD3 B SMAD4 B SOX5 01 SOX9 B1 V4SP1 01 SP1 Q6 SRE B SREBp1 01 SREBP1 02 SRF 01 SRF C SRF Q6 SRY 01 SRY 02 STAF 01

Number of 100% Ci positions in core of TFBS motif 0 3 0 0 1 0 0 2 3 0 0 1 0 3 3

Number of core positions in TFBS motif 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Number of 100% Ci position outside of core 0 0 0 0 0 0 0 1 1 0 0 2 0 1 2

Number of total positions in TFBS motif 19 13 21 21 11 12 30 28 28 21 18 24 9 15 19

#core positions/# positions in motif 0.211 0.308 0.190 0.190 0.364 0.333 0.133 0.143 0.143 0.190 0.222 0.167 0.444 0.267 0.211

Total # motif positions at 100% Ci 0 3 0 0 1 0 0 3 4 0 0 3 0 4 5

Percentage of motif positions at 100% Ci 0.000 23.077 0.000 0.000 9.091 0.000 0.000 10.714 14.286 0.000 0.000 12.500 0.000 26.667 26.316

11

2

4

1

10

0.400

3

30.000

– 7 10

5 3 4

6 4 4

0 3 0

18 20 16

0.333 0.200 0.250

5 6 4

27.778 30.000 25.000

35 14

1 2

4 4

1 1

21 10

0.190 0.400

2 3

9.524 30.000

32 32 25 36 11 38 28 59 – 9 12

2 2 4 4 3 2 3 4 3 3 4

4 4 4 4 4 4 4 4 5 4 4

1 1 1 2 0 2 2 0 3 1 0

17 18 13 13 14 16 17 16 19 9 8

0.235 0.222 0.308 0.308 0.286 0.250 0.235 0.250 0.263 0.444 0.500

3 3 5 6 3 4 5 4 6 4 4

17.647 16.667 38.462 46.154 21.429 25.000 29.412 25.000 31.579 44.444 50.000

13

3

4

0

8

0.500

3

37.500

23 73 11 108 7 30 7 33 – 21 23 29 73

2 3 2 1 4 2 4 2 2 2 0 1 2

4 4 4 4 4 4 4 4 5 4 4 4 4

0 0 0 0 1 2 5 4 2 1 0 0 0

10 14 10 13 20 11 11 18 15 14 7 12 22

0.400 0.286 0.400 0.308 0.200 0.364 0.364 0.222 0.333 0.286 0.571 0.333 0.182

2 3 2 1 5 4 9 6 4 3 0 1 2

20.000 21.429 20.000 7.692 25.000 36.364 81.818 33.333 26.667 21.429 0.000 8.333 9.091

# Sequences used in generating TFBS motif 32 26 51 43 20 17 24 7 5 47 24 5 16 40 6

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

TFBS name

STAF 02 STAT 01 STAT1 01 STAT3 01 T3R 01

VDR RXR B VDR RXR2 B VJUN 01 VMAF 01 VMYB 01 VMYB 02 WHM B WT1 B XBP1 01 XFD1 01 XFD2 01 XFD3 01 YY1 01 YY1 02 ZF5 B ZID 01

10 14 55 55 17

2 2 3 4 4

4 4 4 4 4

1 1 3 3 1

21 9 21 21 16

0.190 0.444 0.190 0.190 0.250

3 3 6 7 5

14.286 33.333 28.571 33.333 31.250

17 35 44 59 21 389 60 – 17 5 31 14 29 6 58 81 – 12 18

2 2 3 2 0 0 0 4 3 4 3 3 2 1 1 0 2 3 2

6 4 4 4 4 4 4 7 4 4 4 4 4 4 4 4 5 4 4

1 2 2 2 0 0 0 0 1 5 0 0 0 0 1 0 1 0 0

23 16 16 16 19 15 22 10 15 15 13 22 16 15 14 14 8 10 10

0.261 0.250 0.250 0.250 0.211 0.267 0.182 0.700 0.267 0.267 0.308 0.182 0.250 0.267 0.286 0.286 0.625 0.400 0.400

3 4 5 4 0 0 0 4 4 9 3 3 2 1 2 0 3 3 2

13.043 25.000 31.250 25.000 0.000 0.000 0.000 40.000 26.667 60.000 23.077 13.636 12.500 6.667 14.286 0.000 37.500 30.000 20.000

8 5 24 39 21 29 100 34 33 6 6 6 18 11 17 35

1 2 4 0 3 3 4 1 4 4 4 3 1 4 0 2

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

1 3 4 0 0 1 0 1 0 1 1 2 0 0 0 0

15 15 16 19 10 9 11 13 17 14 14 14 17 20 13 13

0.267 0.267 0.250 0.211 0.400 0.444 0.364 0.308 0.235 0.286 0.286 0.286 0.235 0.200 0.308 0.308

2 5 8 0 3 4 4 2 4 5 5 5 1 4 0 2

13.333 33.333 50.000 0.000 30.000 44.444 36.364 15.385 23.529 35.714 35.714 35.714 5.882 20.000 0.000 15.385

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

TAACC B TAL1ALPHAE47 01 TAL1BETAE47 01 TAL1BETAITF2 01 TANTIGEN B TATA 01 TATA B TATA C TAXCREB 01 TAXCREB 02 TCF11 01 TCF11MAFG 01 TH1E47 01 TST1 01 USF 01 USF 02 USF C USF Q6 VBP 01

Se-Cys tRNA gene transcription activating factor Signal transducers and activators of transcription Signal transducer and activator of transcription 1 Signal transducer and activator of transcription 3 Viral homolog of thyroid hormone receptor alpha1 Lentiviral TATA upstream el. Tal-1alpha/E47 heterodimer Tal-1alpha/E47 heterodimer Tal-1beta/ITF-2 heterodimer Major T-antigen binding site Cellular and viral TATA box elements General TATA box Retroviral TATA box Tax/CREB complex Tax/CREB complex TCF11/KCR-F1/Nrf1 homodimers TCF11/MafG heterodimers Thing1/E47 heterodimer POU-factor Tst-1/Oct-6 Upstream stimulating factor Upstream stimulating factor USF binding site Upstream stimulating factor PAR-type chicken vitellogenin promoter-binding protein VDR/RXR heterodimer site VDR/RXR heterodimer site v-Jun v-Maf v-Myb v-Myb Winged helix Wilms Tumor Suppressor X-box-binding protein 1 Xenopus fork head domain factor 1 Xenopus fork head domain factor 2 Xenopus fork head domain factor 3 Yin and Yang 1 Yin and Yang 1 ZF5 binding sites Zinc finger with interaction domain

153

154

G.B. Fogel et al. / BioSystems 81 (2005) 137–154

References Bajic, V.B., Choudhary, V., Hock, C.K., 2003. Content analysis of the core promoter region of human genes. In Silico Biol. 4, 0011. Birnbaum, K., Benfey, P.N., Shasha, D.E., 2001. cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. Genome Res. 11, 1567–1573. Bulyk, M.L., 2003. Computational prediction of transcription-factor binding site locations. Genome Biol. 5, 201. Cavener, D.R., 1987. Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 15, 1353–1361. Feingold, E.A., Good, P.J., Guyer, M.S., Kamholz, S., Liefer, L., Wetterstrand, K., Collins, F.S., Gingeras, T.R., Kampa, D., Sekinger, E.A., et al., 2004. The ENCODE (ENCyclopedia of DNA Elements) project. Science 306, 636–640. Fogel, G.B., Weekes, D.G., Varga, G., Dow, E.R., Harlow, H.B., Onyia, J.E., Su, C., 2004. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. 32, 3826–3835. Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A.E., Kel, O.V., Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A., Kolpakov, F.A., Podkolodny, N.L., Kolchanov, N.A., 1998. Databases on transcriptional regulation: TRANSFAC, TRRD, and COMPEL. Nucleic Acids Res. 26, 362–367. Kel, A.E., Gossling, E., Reuter, I., Chermushkin, E., Kel-Margoulis, O.V., Wingender, E., 2003. MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579. Kellis, M., Patterson, N., Endrizzi, M., Birren, B., Lander, E.S., 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 15, 241–254. Kel-Margoulis, O.V., Romashchenko, A.G., Kolchanov, N.A., Wingender, E., Kel, A.E., 2000. COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res. 28, 311–315. Kel-Margoulis, O.V., Kel, A.E., Reuter, I., Deineko, I.V., Wingender, E., 2002. TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res. 30, 332–334. Marino-Ramirez, L., Spouge, J.L., Kanga, G.C., Landsman, D., 2004. Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res. 32, 949–958.

Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al., 2003. TRANSFAC® : transcriptional regulation, from pattern to profiles. Nucleic Acids Res. 31, 374–378. Moses, A.M., Chiang, D.Y., Kellis, M., Lander, E.S., Eisen, M.B., 2003. Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol. Biol. 3, 19. Mrowka, R., Steinhage, K., Patzak, A., Persson, P.B., 2003. An evolutionary approach for identifying potential transcription factor binding sites: the renin gene as an example. Am. J. Physiol. Regul. Integr. Comp. Physiol. 284, 1147–1150. Qiu, P., Ding, W., Jiang, Y., Greene, J.R., Wang, L., 2002. Computational analysis of composite regulatory elements. Mamm. Genome 13, 327–332. Quandt, K., Frech, K., Karas, H., Wingender, E., Werner, T., 1995. MatInd and MatInspector—new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878–4884. Sinha, S., Tompa, M., 2003. YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 31, 3586–3588. Werner, T., 2003. The state of the art of mammalian promoter recognition. Brief. Bioinform. 4, 22–30. Wingender, E., Dietze, P., Karas, H., Knuppel, R., 1996. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24, 238–241. Wingender, E., Kel, A.E., Kel, O.V., Karas, H., Heinemeyer, T., Dietze, P., Knuppel, R., Romaschenko, A.G., Kolchanov, N.A., 1997. TRANSFAC, TRRD and COMPEL: towards a federated database system on transcriptional regulation. Nucleic Acids Res. 25, 265–268. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., Schacherer, F., 2000. TRANSFAC® : an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319. Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull, M., Matys, V., Micheal, H., Ohnhauser, R., et al., 2001. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 29, 281–283. Wootten, J.C., Federhen, S., 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266, 554–571.