Genomics 84 (2004) 524 – 535 www.elsevier.com/locate/ygeno
First comprehensive mapping of cartilage transcripts to the human genome T.D. Yagera,1, A.A. Dempseya,1, H. Tanga, D. Stamatioua, Samuel Chaoa, K.W. Marshalla, C.C. Liewa,b,* b
a ChondroGene, Inc., 800 Petrolia Road, No. 15, Toronto, Ontario, Canada M3J 3K4 Harvard Medical School, 77 Louis Pasteur Avenue, NRB Room 0630K, Boston, MA 02115, USAce:ftnauthor id="fn1">1These authors contributed equally to this article.
Received 21 January 2004; accepted 17 May 2004
Abstract We present the first comprehensive transcriptome-to-genome mapping for human cartilage. First, we determined that the cartilage transcriptome represents between 13,200 and 15,800 unique genes. Next, a subset of ~10,000 of the best characterized cartilage-expressed transcripts (CETs) was selected and mapped to the human genome. The distribution of CETs across the genome was found to be significantly different compared to the expected distribution. Furthermore, clusters of adjacent coordinately transcribed genes, as well as numerous bhot spotsQ and bcold spotsQ for transcription in cartilage, were identified. We propose that transcriptional control in cartilage can be exerted over genomic domains containing as few as four neighboring genes. Our findings, which are consistent with recent bchromatin domainQ models of transcription, are further supported by our identification of CETs that putatively encode components of the HDAC- and Swi/SNF-mediated chromatin remodeling pathways. Our study illustrates the value of comprehensive high-resolution scans to detect transcription patterns within the human genome. D 2004 Elsevier Inc. All rights reserved. Keywords: Osteoarthritis; Expressed sequence tags; Cartilage; Microarray; Genomics; Transcriptome; Gene clusters; Chromatin domain
Although a well-defined type of gene organization and transcriptional control exists in the prokaryotic genome in the form of operons, the general lack of operons in eukaryotes has led to the idea that genes (particularly active genes) are randomly placed throughout the eukaryotic genome. However, a growing body of evidence supports the existence of a higher order(s) of eukaryotic gene organization and transcriptional control. For example, the organization of genes into operons, once thought to be exclusive to prokaryotes, has also recently been observed in eukaryotes. Most current evidence for eukaryotic operons comes from Caenorhabditis elegans [1–4], and operons Abbreviations: CET, cartilage-expressed transcript; EST, expressed sequence tag; OA, osteoarthritis. * Corresponding author (at Harvard Medical School). E-mail address:
[email protected] (C.C. Liew). 1 These authors contributed equally to this article. 0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2004.05.006
have also been observed in tomato, Drosophila, mouse, rat, and human [5,6]. However, with the exception of C. elegans, eukaryotic operons are thought unlikely to play a major role in genomic organization or transcriptional control. Evidence for other (non-operon-based) levels of organization in the eukaryotic genome came initially from an observed correlation between GC content and gene density in humans and other vertebrates [7,8]. More recently, Cohen et al. [9] observed that yeast genes of similar function tend to lie adjacent in the genome. Similarly, using oligonucleotide microarrays, Spellman and Rubin [10] identified ~200 groups of adjacent genes in the Drosophila genome, with each group containing 10 to 30 genes that shared similar expression profiles. However, genes within individual groups did not seem to be related by function. In the human genome, it has been found that certain groups of highly expressed genes [11,12] and widely expressed (housekeeping) genes [13] are clustered together.
T.D. Yager et al. / Genomics 84 (2004) 524–535
525
adjacent genes corresponding to CETs and also a number of genomic transcriptional bhot spotsQ and bcold spotsQ in cartilage. Our observations are consistent with recent bchromatin domainQ models of tissue-specific gene activation and repression [24–27].
Genomic proximity or location may also influence the tissue specificity of expression of sets of genes. For example, overrepresentation of expressed genes on individual chromosomes has been reported for specific tissues in the mouse [14] and human [15,16]. More detailed analyses have identified specific regions within individual chromosomes that are over- or underrepresented in the transcriptomes of the human cardiovascular system [17,18] and other tissues [19–21]. For example, Roy et al. [22] investigated the distribution of 1364 muscle-expressed genes in the C. elegans genome and observed significant numbers of clusters containing 2 to 5 genes. Similar examination of 1661 testes-specific genes in Drosophila revealed that ~35% are found in clusters of 3 or more genes [23]. Thus genes with tissue-restricted expression patterns may sometimes be organized into clusters in the eukaryotic genome. In view of this evidence supporting higher orders of organization of transcriptionally active genes in eukaryotic genomes, we sought to investigate the genomic distribution of human cartilage-expressed transcripts (CETs). From an initial set of 13,000–16,000 CETs, a subset of ~10,000 of the best characterized CETs was selected and precisely mapped to the genome. We detected significant runs of
Results Chromosome distribution of cartilage-expressed transcripts in the human genome To map CETs in the most precise manner, we selected a subset containing the 10,379 best characterized CETs. Our null hypothesis states that these CETs should have the same chromosomal distribution as does the set of all 24,863 human genes (NCBI Build 34); however, the numbers of CETs on chromosomes 3, 14, 15, 19, X, and Y significantly differed from the null-hypothesis prediction ( p b 0.05) (Table 1). We repeated the analysis after subtracting: (1) a set of previously defined housekeeping genes [28,29] (N = 774) and (2) a set of fetal-specific CETs (N = 779). After removal of these potentially confounding genes, the CET distribution
Table 1 Distribution of CETs in the human genome Human genome
CETs
CETs less houskeeping genes (H3)
CETs less houskeeping genes (H3) and fetal-specific genes (F)
Chromosome #
Known genes
# CETs observed
# CETs expected
m2
# CETs Observed
# CETs Expected
m2
# CETs Observed
# CETs Expected
m2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total
2377 1671 1307 952 1102 1314 1264 866 1011 932 1501 1213 463 1103 670 1000 1345 368 1512 691 285 673 1095 148 24863
1,070 714 631 415 494 564 497 362 392 423 591 565 205 364 332 437 581 152 552 297 108 233 380 20 10379
992.27 697.56 545.6 397.41 460.03 548.53 527.65 361.51 422.04 389.07 626.59 506.36 193.28 460.44 279.69 417.45 561.47 153.62 631.18 288.46 118.98 280.95 457.1 61.79 10379
6.734 0.418 14.112** 0.811 2.626 0.464 1.875 0.000 2.231 3.075 2.150 7.138 0.723 21.138** 10.051* 0.958 0.719 0.020 10.573* 0.257 1.022 8.408 13.599** 28.429**
981 676 601 398 458 515 472 336 368 403 536 505 197 336 316 402 524 145 475 281 99 212 353 16 9605
918.28 645.54 504.92 367.77 425.72 507.62 488.30 334.55 390.57 360.05 579.86 468.60 178.86 426.11 258.83 386.32 519.60 142.16 584.11 266.95 110.10 259.99 423.02 57.17 9605
4.737 1.541 19.299** 2.583 2.561 0.113 0.574 0.007 1.359 5.324 3.5314 2.972 1.874 19.939** 12.976** 0.663 0.039 0.057 21.702** 0.761 1.132 9.105 12.123 29.83**
905 626 561 376 436 482 444 318 329 379 486 467 193 318 292 346 467 137 392 258 90 171 337 16 8826
843.80 593.18 463.97 337.95 391.19 466.45 448.70 307.42 358.89 330.85 532.83 430.60 164.36 391.55 237.84 354.99 477.46 130.63 536.74 245.29 101.17 238.91 388.71 52.54 826
4.908 1.947 21.42** 4.456 5.37 0.547 0.052 0.377 2.595 7.282 4.381 3.235 5.086 14.457** 12.675** 0.237 0.242 0.315 41.558** 0.677 1.248 19.838** 7.196 25.563**
Tail probability: chi-square calculator from http://www.stat.sc.edu/~west/applets/chisqdemo.html. * p b 0.05. ** p b 0.01.
526
T.D. Yager et al. / Genomics 84 (2004) 524–535
remained significantly different from the null-hypothesis prediction ( p b 0.05) for chromosomes 3, 14, 15, 19, and Y; however, chromosome 22 reached significance ( p b 0.01) and chromosome X was no longer significant (Table 1). The distribution of cartilage-expressed transcripts at the level of cytogenetic bands Since whole-chromosome analysis has an inherently low resolution, we also evaluated the genomic distribution of a subset of 8583 CETs with respect to the 862 Giemsa bands that define the high-resolution cytogenetic map [30]. Distribution across cytogenetic band classes We first examined the distribution of CETs between Giemsa-negative bands (N = 414), Giemsa-positive bands (N = 378), and centromeric bands (N = 70). Approximately 95% of the CETs were localized within individual cytogenetic bands, while the remaining 5% spanned the boundaries of adjacent bands. Using the Kolmogorov-Smirnov (KS) test (http://www.physics.csbsju.edu/stats/KS-test.n.plot_ form. html), we established that the percentage of genes/band in our 8583-member CET subset did not significantly differ
between Giemsa-positive and Giemsa-negative bands (KS test: D = 0.3182, p = 0.175). Genes in the centromeric bands also had significantly less fractional representation than genes in either the Giemsa-positive (KS test: D = 0.5909, p b 0.001) or the Giemsa-negative bands (KS test: D = 0.5455, p = 0.002). Thus, a relatively low proportion of centromeric genes appears to be transcribed in cartilage. Distribution across particular cytogenetic bands We next examined individual Giemsa bands for significantly high or low concentrations of CETs. In general, the distribution of CETs was consistent with a brandom occupancyQ model [31] as predicted by a null hypothesis (Fig. 1). Sequential placement of CETs along individual chromosomes These analyses demonstrated that the cartilage transcriptome is not strongly defined by an underlying cytogenetic band structure; it was therefore necessary to consider a finer level of resolution. Using the 10,379 precisely mapped CETs, we identified all runs of adjacent
Fig. 1. Fraction of genes in a cytogenetic band that are CETs, as a function of band size. Empirical distribution for a subset of 8583 CETs selected by Method 1. The gray plots show the predicted relationship between the fraction of genes that are CETs and the cytogenetic band size, under the null hypothesis of random placement of CETs within cytogenetic bands.
T.D. Yager et al. / Genomics 84 (2004) 524–535
genes that were expressed in cartilage on a chromosome-bychromosome basis. A one-sample Runs test (Eqs. (1) – (3)) was used to assess the significance of the CET runs. With the exception of chromosomes 10 and 20, the observed distribution of CET runs was improbable by chance ( p b 0.05) (Supplementary Table 1). Furthermore, CET runs of length 1 or 2 occurred less frequently than expected by chance, while CET runs of length 4 to 10+ occurred more frequently ( p b 0.001) (Table 2, Fig. 2). Housekeeping CETs Not surprisingly, our set of 10,379 CETs contained a large fraction (N90%) of known housekeeping genes [28,29]. These genes are known to form clusters in the human genome [13], which may have accounted for the CET runs we observed. After removing housekeeping genes from our set of CETs, we once again observed a statistically significant ( p b 0.05) number of CET runs on each chromosome, with the exception of chromosomes 10 and 20 (Supplementary Table 1). CET runs of length 1 or 2 occurred less frequently than expected by chance, while runs of length 4 to 10+ occurred more frequently ( p b 0.001) (Table 2 and Fig. 2). Fetal-specific CETs Additional complications may result from the inclusion of fetal cartilage data, thereby introducing developmentally regulated genes. Analysis revealed that 779 of the 10,379 CETs were derived exclusively from fetal cartilage. After subtracting both the housekeeping and the fetal-derived genes from the original CET set, a statistically significant ( p b 0.05) number of CET runs was still observed on each chromosome, with the exception of chromosomes 10, 18, 19 and 20 (Supplementary Table 1). Similar to the previous analyses, CET runs of length 1 or 2 were found to occur less frequently than expected by chance, while runs of 3 to 10+ were found to occur more frequently ( p b 0.001), with the exception of run length 9 (Table 2 and Fig. 2).
527
bSliding windowQ analysis to detect genomic transcriptional hot spots and cold spots in cartilage Genomic transcriptional hot spots and cold spots in cartilage were detected by a combination of two approaches. (1) A sliding window analysis was used to identify genomic segments displaying very high or low percentages of CETs; (2) cytogenetic bands that displayed the highest or lowest percentages of CETs, relative to other bands of similar size, were identified. Fig. 3a presents sliding-window calculations for chromosome 22, using a 25-gene-wide window as an illustrative example. A transcriptional cold spot (the immunoglobulin E gene cluster at 22q11.22) is clearly evident, as are several hot spots. We performed sliding-window calculations for all chromosomes (Supplementary Fig. 1 and Supplementary Table 3); results indicate that hot spots and cold spots in cartilage are often located in areas of relatively high gene density. Supplementary Fig. 2 details the distribution of physical lengths of all possible 25-gene windows. Our ability to identify hot spots and cold spots is relatively independent of bwindow width,Q as demonstrated by Fig. 3 and Supplementary Table 4.
Discussion The identification of cartilage-expressed transcripts To elucidate the transcriptional events underlying cartilage development and dysfunction (e.g., osteoarthritis) we characterized the complete set of genes expressed in human cartilage [32,33]. This led to our large-scale generation and sequencing of ESTs from normal fetal and adult cartilage and from adult osteoarthritic cartilage. In addition, we conducted a number of hybridizations of adult cartilage RNA samples to the Affymetrix HU133A GeneChip.
Table 2 Distribution of sequential runs of CETs in the human genome Run length
1 2 3 4 5 6 7 8 9 10+ *
p b 0.001.
CETs (n = 10,379)
CETs less houskeeping genes (H3) (n = 9,605)
CETs less houskeeping genes (H3) and fetal-specific genes (F) (n = 8,826)
Runs observed
Runs expected
m2
Runs observed
Runs expected
m2
Runs observed
Runs expected
m2
2670 1231 596 322 188 80 29 25 14 20
3440.5 1446.3 610.7 262.8 113.7 49.1 21.3 9.5 4.2 3.4
258.161* 88.861* 1.297 59.327* 256.687* 120.578* 20.023* 202.997* 206.732* 937.302*
2824 1218 577 283 136 59 25 17 6 8
3533.1 1374.7 541.5 215.8 86.9 35.2 14.4 5.9 2.5 1.8
225.086* 50.034* 8.408 91.98* 145.293* 98.728* 55.201* 167.893* 44.205* 220.108*
2915 1125 513 240 112 45 21 13 2 6
3582.2 1282.2 467.6 172.7 64.6 24.4 9.3 3.6 1.4 0.9
209.289* 54.381* 15.706* 113.783* 180.478* 106.098* 103.792* 196.99* 2.317 304.82*
528
T.D. Yager et al. / Genomics 84 (2004) 524–535
Fig. 2. Runs of adjacent CETs in the human genome. The total number of CET Runs in the human genome was computed by comparing NCBI Genome Build 34 to our subset of 10,379 CETs selected by Method 2. The expected result (under the null hypothesis of random CET distribution) was computed from 1000 iterations of a random bball and urnQ simulation. A Runs test was used to determine the significance ( p value). (a) Percentage deviation of observed from expected CET run frequency, as a function of CET run length. (b) Log-log plot of K R (number of within-run CET neighbors) as a function of N R (number of CET runs of length R). CET refers to the original set of 10,379 mapped CETs; CET-H3 refers to the original set of CETs less the list of 774 housekeeping genes; CET-H3&F refers to the original set of CETs less the list of 774 housekeeping and 779 fetal-specific CETs; rand refers to the random model.
These studies have allowed us to define empirically the btranscriptomeQ of human cartilage as representing 13,200 to 15,800 unique genes. These two independent estimates are strikingly similar, and their magnitudes also agree reason-
ably well with estimated numbers of genes in the transcriptomes of other human tissues and organs [34,35]. We note that the study of Velculescu et al. [34] employed SAGE technology to examine 88,875 chondrocyte transcripts, from
T.D. Yager et al. / Genomics 84 (2004) 524–535
529
Fig. 3. bSliding windowQ analysis of chromosome 22, to detect transcriptional hot spots and cold spots. This analysis employed a chromosome 22specific subset of the 10,379 CETs that had been selected by Method 2. (a) The percentage of transcriptional activity was computed using a 25-genewide sliding window. The horizontal lines are defined as follows: The dark black horizontal line in the center of the plot represents the mean percentage of transcriptional activity for a set of 100 bmock transcriptomesQ generated by iterations of a random ball and urn model; the three lines sequentially above and below the mean represent F1, F2, and F3 standard deviations away from this mean, respectively. (b) Comparison of results with blongQ sliding windows that are 25 genes, 1 Mb, or 3 Mb wide. (c) Comparison of results with bshortQ sliding windows that are 6 genes, 12 genes, or 0.3 Mb wide. The six vertical lines are provided as guides to aid aligning the horizontal axes of the three plots and are separated by 100 window positions.
which a subset of 11,628 unique transcripts was deduced. However, we believe this represents a lower bound estimate—primarily because cultured chondrocytes were examined, but also because of technical limitations associated with SAGE [36]. In contrast, we examined a set of four cDNA libraries derived from true biological samples (fetal, adult normal, adult mild OA, and adult severe OA cartilage).
The distribution of cartilage-expressed transcripts in the human genome We found that several chromosomes contained a significantly higher or lower number of CETs than predicted by a null hypothesis ( p b 0.05). In addition, we found that certain cytogenetic bands were either transcriptionally bhotQ
530
T.D. Yager et al. / Genomics 84 (2004) 524–535
or bcoldQ in cartilage (Table 3). However, since the overall genomic distribution of CETs may largely be explained by a brandom occupancyQ model (Fig. 1), it was necessary to proceed to a finer grain analysis that precisely mapped the CETs to the human genome sequence. We thereby uncovered significant nonrandom patterns in the genomic distribution of CETs at the level of adjacent genes. Our analyses suggest that the human genome is deficient in isolated CETs and CET pairs, but is enriched in longer CET runs from 4 to 10+ genes; we also identified numerous genomic transcriptional hot spots and cold spots in cartilage that are relatively independent of the width of the examination window used. Since it has been previously reported that housekeeping genes tend to cluster together in the human genome [11,13], the housekeeping genes identified in our CET set were removed and the data reanalyzed. The distribution of CETs across the genome was not significantly altered. Only chromosome X lost its statistical significance, suggesting that it may harbor a relatively large number of housekeeping gene clusters. A significant number of CET runs was still observed, suggesting that the runs we observed are not simply due to the clustering of housekeeping genes expressed in cartilage. A similar bias might have been introduced by including CETs from various stages of development and disease. Although we are confident in stating that the majority of changes in OA are the result of changes in gene expression levels, the same cannot be said for the fetal cartilage. Thus, we reanalyzed the results after removal of fetal-specific CETs. The distribution of CETs remained similar, with the Table 3 Haplotype-disease associations mapping to the MHC region of chromosome 6 (cytogenic bands 6p21.31–p21.33), comprising a transcriptional hotspot in cartilage Human haplotype-disease association or mouse knock-out phenotype
References*
MICB: highly polymorphic, but presently no known disease associations
OMIM 602436, PMID 11429322, 12671735 OMIM 600169, PMID 12878361, 11169253, 11737059, 12591055, 12022360, 11390038 OMIM 601022, PMID 12509789 OMIM 153440, PMID 12426569 OMIM 191160, PMID: 11904678, 11196702, 9818939, 12483736, 11393661
MICA: ulcerative colitis, peripheral arthropathy, celiac disease, psoriatic arthritis
NFKBIL1: rheumatoid arthritis TNFSF1 (TNFh, LTA): myocardial infarction susceptibility TNFSF2 (TNFa, cachectin): susceptibility to cerebral malaria, septic shock, ulcerative colitis, Crohn’s disease, osteoporosis, Guillain-Barre syndrome, juvenile oligoarthritis, rheumatoid arthritis TNFSF3 (TNFc; LTB): development of peripheral lymphoid organs [mouse knockout] *
OMIM 600978, PMID 12969314, 9256477, 9133428, 10405365
OMIM = Online Mendelian inheritance in Man; PMID = PubMed.
additional significance of chromosome 22. More importantly, a statistically significant number of CET runs was still observed. Thus, our analysis was not significantly affected by the inclusion of either the housekeeping or the fetal-specific CETs. Genomic transcriptional hot spots and cold spots in cartilage Genomic transcriptional hot spots in cartilage were found to be diverse in genetic content. Some contained polymorphic genes linked to human genetic diseases involving cartilage dysfunction. Table 3 presents one such example (the MHC region on chromosome 6, mapping to cytogenetic bands 6p21.31-p21.33) linked to rheumatoid arthritis, psoriatic arthritis, osteoporosis, juvenile oligoarthritis, and other diseases. As expected, many of the genomic transcriptional cold spots in cartilage contained gene families that biologically should not be expressed in this tissue: embryonic patterning genes, immunoglobulin genes, the T cell receptor B locus, olfactory and taste receptor genes, defensins, neural protocadherins, and neuronal receptor genes. Evolutionary aspects of the clustering of coexpressed genes Four hypotheses are consistent with the observation that coexpressed genes tend to cluster in eukaryotic genomes. Clustering could result from: Tandem gene duplication This could account for some of the transcriptionally active gene clusters and transcriptional cold spots that were observed. However, the majority of CET clusters do not contain paralogous genes. Also, many CET clusters contain both paralogs and nonparalogs. Thus, the hypothesis of tandem gene duplication is unable to explain all of the CET clustering that was observed. Recombination-deficient islands A recent report indicates that in yeast, essential genes have become clustered together within bislandsQ characterized by low recombination rates [37]. The relevance of this finding to other eukaryotes is not known at present. Coordinate chromatin-based regulation A widely held model is that clusters of coexpressed genes are contained within individual chromatin domains (the blocus control regionQ model; see Refs. [25,26]). While supported by studies of the globin gene cluster and other gene clusters, the generality of this model is debatable [24]. bBiased recruitmentQ hypothesis During evolution of cartilage as a tissue, the brecruitmentQ of new genes into the cartilage transcriptome
T.D. Yager et al. / Genomics 84 (2004) 524–535
might have been biased in favor of genes that were physically contiguous to preexisting CET runs. This is an example of a bpreferential growthQ model in network theory [38]. Our data appear consistent with this model. In Fig. 2b we plot the distributions of observed and expected CET runs on a log-log scale (excluding CET run lengths of 1, 2, 3). We may interpret Fig. 2b according to network theory by defining the bcartilage transcriptome networkQ to be equivalent to the set of all CET runs in the genome and by noting the relationship K R = (R 1) N R , where R is the CET run length, N R is the number of CET runs of length R in the network, and K R is the number of bconnectionsQ (within-run CET neighbors) in the network. In Fig. 2b, the downward-sloping curves for the random (ball and urn) model are consistent with a Poisson frequency distribution p(K R ) for connections between CETs in a random model. But clearly, the empirical frequency distribution p(K R ) for connections between actual CETs (solid points in Fig. 2b) deviates from the random model and is not Poisson or Gaussian. The empirical distribution p(K R ) behaves more in accordance with a scale-free power law (cf. the linear trend lines in the figure). The presence of elevated frequencies of long CET runs and also our finding that isolated CETs and CET pairs are underrepresented in the genome are both consistent with the biased recruitment hypothesis. Possible mechanisms of chromatin domain-specific transcriptional control We have found that genomic transcriptional hot spots and cold spots in cartilage are localized to specific genomic domains, typically 25–75 genes wide. This is consistent with recent bchromatin domainQ models of transcriptional activation and repression [24–27]. We have also obtained evidence that chromatin remodeling may be involved in control over transcription within cartilage. Evidence for general HDAC-mediated chromatin repression in cartilage Deacetylation of chromatin is often correlated with repression of chromatin domains [27]. The following 24 genes, which are implicated in chromatin-silencing pathways mediated by histone deacetylation [27], were represented in the CET dataset: HP1-BP74, HIWI2, SIRT1, SIRT2, SIRT3, SAS10, DXS399E, EPC2, NSPC1, EPC1, CBX1, CBX3, CBX5, MYST1, SATB1, CHC1, KIAA0159, HBOA, HAT1, HDAC11, HDAC2, HDAC3, HDAC5, HDAC9. A literature cocitation analysis with PubGene (Fig. 4a) suggests that 13 of the above genes may be organized into a functional network. The fact that mRNAs specifying this network were found in cartilage thus provides evidence for the existence of a chromatinrepression mechanism in this tissue.
531
Evidence for general Swi/SNF-mediated chromatin activation in cartilage A general mechanism for chromatin activation involves the operation of a Swi/SNF remodeling complex [39–44]. The following 11 genes, which are implicated in chromatinactivation pathways mediated by Swi/SNF, were represented in the CET set: ATRX, BTAF1, PCAF, SMARCA1, SMARCA2, SMARCA4, SMARCA5, SMARCC2, SMARCD2, SMARCE1, SMARCF1. Literature cocitation analysis with PubGene indicates that these genes may be organized into a functional network (Fig. 4b). The fact that mRNAs specifying this network were found in cartilage thus provides evidence for the existence of a Swi/SNF-mediated chromatin-activation mechanism in this tissue. Interestingly, PCAF is present in both network diagrams (Fig. 4). This gene might provide a link between a general HDAC-dependent chromatin-repression pathway and a general Swi/SNFdependent chromatin-activation pathway in cartilage. This study presents the first comprehensive mapping of human cartilage-expressed transcripts to the human genome. From the 13,000–16,000 genes that we find to be transcribed in human cartilage, the best characterized ~10,000 were mapped to the human genome. The human genome was found to be deficient in isolated CETs and CET pairs, but enriched in longer CET runs. Our set of reference data may prove useful in the understanding of both the biology of normal cartilage and the pathogenesis of complex cartilage diseases such as arthritis. The approach described herein can be used to understand better the events underlying cartilage function and dysfunction and may also be extended to other biological tissues and processes.
Materials and methods Identification of 13,000 to 16,000 unique transcripts expressed in human cartilage Sequencing of randomly selected ESTs from cartilage cDNA libraries To establish a CET database, we generated 117,892 ESTs from four unique human cartilage cDNA libraries (fetal— 28,036 ESTs; normal adult—29,553 ESTs; mild OA— 29,963 ESTs; severe OA—30,700 ESTs). Protocols for construction of cDNA libraries, generation of ESTs, and analysis of data have been described previously [32,33]. Unique sequences were determined through clustering via TIGR Assembler [45], as well as comparison to GenBank [32,33] and SOURCE [46] databases. After removal of extraneous sequences, the final CET database comprised 15,778 unique genes. Profiling of cartilage gene expression on the Affymetrix U133A GeneChip Total RNA was extracted from six human cartilage samples using TRIzol (Invitrogen, Carlsbad, CA, USA)
532
T.D. Yager et al. / Genomics 84 (2004) 524–535
SMARCD2
Fig. 4. Cocitation analysis of CETs that are related to chromatin remodeling. Literature cocitation analysis was conducted using PubGene v2.1. The genes used in these queries were all detected as ESTs in cartilage libraries and are specified under Discussion. Network depth setting is 1 in these diagrams. (a) Hypothesized pathway for HDAC-mediated chromatin repression in cartilage. (b) Hypothesized pathway for Swi/SNF-mediated chromatin activation in cartilage.
according to the manufacturer’s instructions. One hundred nanograms of total RNA was used for each hybridization on the Affymetrix U133A GeneChip following the manufacturer’s protocols (Affymetrix, Santa Clara, CA, USA). Further information can be found in Supplementary Table 5. A gene was defined as expressed in cartilage if flagged as being present in at least four of the six profiles. Of the 14,100 unique genes on the U133A GeneChip, 7485 (~53%) were observed expressed in our samples. Assuming the U133A GeneChip represents an unbiased sample of the entire genome, we expect ~53% of all genes (~13,177 genes) to be expressed in human cartilage. Selection of the two CET subsets used for genomic location analysis Selection Method 1 MatchMiner [47] was used to identify 7423 CETs having a unique cytogenetic band location on the UCSC sequence-
to-band translation, which maps the 862 Giemsa bands onto NCBI Genome Build 33. SOURCE [46], MatchMiner [47], and the NCBI Entrez genomes viewer [48,49] were used to assign unique chromosomal locations to an additional 2644 CETs, thus producing a combined list of 8583 CETs with unique cytogenetic band locations. Selection Method 2 EST annotations from our manually annotated 15K ChondroChip [32] were combined with the semiautomated annotations described above. Only ESTs with matches of at least 95% sequence identity over half the EST length were selected for this analysis. LocusLink IDs for each gene match were obtained from either the NCBI or the SOURCE database [46], resulting in the identification of 9199 unique genes. An additional 1703 genes were identified from the Affymetrix U133A cartilage profiles, for a final total of 10,902 genes with LocusLink IDs. When these IDs were queried against NCBI Genome Build
T.D. Yager et al. / Genomics 84 (2004) 524–535
34, a total of 10,379 unambiguous physical locations were returned. Removal of housekeeping and fetal CETs Hsiao et al. [28] defined a set of 451 housekeeping genes, while Eisenberg and Levanon [29] defined a set of 575 housekeeping genes. We selected 431 and 573 genes within the respective sets based on the genes being nonredundant, having LocusLink IDs, and mapping unambiguously to NCBI Genome Build 34. We define set bH3Q (830 genes) to be the (nonredundant) union of the above two housekeeping gene sets. Our set of 10,379 precisely mapped CETs (selected by Method 2) was found to contain 774/830 (93.2%) of the housekeeping genes in set bH3Q. After removal of the housekeeping genes, the subtracted set (CETH3) was composed of 9605 precisely mapped CETs. A similar approach was used to subtract fetal-specific CETs. We generated a list of 779 CETs that were recovered from our fetal cartilage cDNA library, but that were not identified in our adult cartilage cDNA libraries, by Affymetrix HU133A hybridization against adult cartilage RNA or in the lists of housekeeping genes defined above. The union of the sets of fetal-specific CETs and housekeeping CETs (set H3+F) was subtracted from our set of 10,379 precisely mapped CETs for a final subtracted set (CET-H3F) of 8826 precisely mapped CETs. Whole-chromosome analysis The genomic locations of 10,379 CETs (selection Method 2) were determined by reference to NCBI Genome Build 34. We used the m2 test to determine if significant differences existed between the number of CETs observed per chromosome and the number of CETs expected per chromosome based on the underlying gene distribution. For each individual chromosome the m2 statistic (df = 1) was calculated as follows: m2 = ((N obs N exp)2/N exp)chromosome + ((N obs N exp)2/N exp)rest of genome. A significance level of either p b 0.05 (=0.05/24 or 0.0002) or p b 0.01 (=0.01/24 or 0.0004) was used, according to the Bonferroni correction [15]. Cytogenetic band analysis CET Subset 1, defined above and consisting of 8583 CETs with unambiguous cytogenetic band locations, was used for this analysis. A random ball and urn model [31] was used for occupancy calculations.
533
sequence of b0Q and b1Q symbols was then analyzed to determine the number (U) of runs of either type of symbol. Statistical significance was assessed with the one-sample Runs test (Minitab v14; Minitab, State College, PA, USA), where Z represents a transformation of U with mean = 0 and standard deviation = 1: 2n0 n1 þ 1; U¯ ¼ n0 þ n1 r ¯2 ¼ U
Z¼
2n0 n1 ð2n0 n1 n0 n1 Þ ðn0 þ n1 Þ2 ðn0 þ n1 1Þ
U U¯ : rU¯
ð1Þ
;
ð2Þ
ð3Þ
Comparison to a random model The empirical distribution of CET runs (of length 1 to 10+ genes) on each chromosome was obtained from analyzing a set of 10,379 CETs that were selected by Method 2 and mapped to NCBI Genome Build 34. This was compared to a random distribution generated by a ball and urn simulation [31] under the null hypothesis that CETs are randomly dispersed throughout the genome. The m2 statistics were used to test for significant differences ( p b 0.001 after Bonferroni correction) between the empirical and the random distributions, using run lengths as bbinsQ in the analysis. Sliding-window analysis to identify transcriptional hot spots and cold spots in cartilage A sliding-window approach was used to compute the fraction of genes represented as CETs within each segment of the genome (NCBI Build 34). The analysis employed windows that were 6, 12, or 25 genes wide or 0.3, 1, or 3 Mb wide. Each chromosome was analyzed separately. The observed maxima and minima in the sliding-window plots were tested for significance by comparison to plots from a null hypothesis model in which CETs were randomly dispersed throughout the genome. Significance was assessed relative to the mean and standard deviation of the percentage of expressed genes per window in the random model. To calculate the correlations between the CET profiles for different window widths, Pearson’s correlation coefficient was calculated using Minitab v14 (Minitab). Cocitation analysis of CETs
Statistical test for significance of observed CET runs Based on NCBI Genome Build 34, each chromosome was represented as a linear sequence of b1Q and b0Q symbols with b1Q corresponding to a cartilage-expressed gene and b0Q corresponding to a gene not expressed in cartilage. The
PubGene v2.1 was used to conduct literature cocitation analyses on specific CET subsets, to identify potential functional gene networks. Additional Medline queries were performed manually. Literature cocitation analysis is based on the assumption that, if two genes are cocited in the
534
T.D. Yager et al. / Genomics 84 (2004) 524–535
abstract of a scientific publication, then these genes may be biologically related [50].
Acknowledgments We thank Guy Hulbert (ChondroGene) for a Perl script that generates sets of pseudo-random numbers. We also thank Daniel Fefer (ChondroGene) and Yong Yu (ChondroGene) for their editorial assistance and constructive reviews.
Appendix A. Supplementary data Supplementary data for this article may be found on ScienceDirect.
References [1] D.A. Zorio, N.N. Cheng, T. Blumenthal, J. Spieth, Operons as a common form of chromosomal organization in C. elegans, Nature 372 (1994) 270 – 272. [2] T. Blumenthal, et al., A global analysis of Caenorhabditis elegans operons, Nature 417 (2002) 851 – 854. [3] M.J. Lercher, T. Blumenthal, L.D. Hurst, Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes, Genome Res. 13 (2003) 238 – 243. [4] T. Blumenthal, K.S. Gleason, Caenorhabditis elegans operons: form and function, Nat. Rev. Genet. 4 (2003) 112 – 120. [5] T. Blumenthal, Gene clusters and polycistronic transcription in eukaryotes, Bioessays 20 (1998) 480 – 487. [6] J.G. Lawrence, Shared strategies in gene organization among prokaryotes and eukaryotes, Cell 110 (2002) 407 – 413. [7] G. Bernardi, et al., The mosaic genome of warm-blooded vertebrates, Science 228 (1985) 953 – 958. [8] D. Mouchiroud, et al., The distribution of genes in the human genome, Gene 100 (1991) 181 – 187. [9] B.A. Cohen, R.D. Mitra, J.D. Hughes, G.M. Church, A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression, Nat. Genet. 26 (2000) 183 – 186. [10] P.T. Spellman, G.M. Rubin, Evidence for large domains of similarly expressed genes in the Drosophila genome, J. Biol. 1 (2002) 5. [11] H. Caron, et al., The human transcriptome map: clustering of highly expressed genes in chromosomal domains, Science 291 (2001) 1289 – 1292. [12] R. Versteeg, et al., The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes, Genome Res. 13 (2003) 1998 – 2004. [13] M.J. Lercher, A.O. Urrutia, L.D. Hurst, Clustering of housekeeping genes provides a unified model of gene order in the human genome, Nat. Genet. 31 (2002) 180 – 183. [14] M.S. Ko, et al., Genome-wide mapping of unselected transcripts from extraembryonic tissue of 7.5-day mouse embryos reveals enrichment in the t-complex and under-representation on the X chromosome, Hum. Mol. Genet. 7 (1998) 1967 – 1978. [15] S. Bortoluzzi, et al., A comprehensive, high-resolution genomic transcript map of human skeletal muscle, Genome Res. 8 (1998) 817 – 825.
[16] B.L. Gabrielsson, B. Carlsson, L.M. Carlsson, Partial genome scale analysis of gene expression in human adipose tissue using DNA array, Obes. Res. 8 (2002) 374 – 384. [17] A.A. Dempsey, N. Pabalan, H.C. Tang, C.C. Liew, Organization of human cardiovascular-expressed genes on chromosomes 21 and 22, J. Mol. Cell. Cardiol. 33 (2001) 587 – 591. [18] J.D. Barrans, et al., Chromosomal distribution of the human cardiovascular transcriptome, Genomics 81 (2003) 519 – 524. [19] T. Fujii, et al., A preliminary transcriptome map of non-small cell lung cancer, Cancer Res. 62 (2002) 3340 – 3346. [20] P. Qiu, L. Benbow, S. Liu, J.R. Greene, L. Wang, Analysis of a human brain transcriptome map, BMC Genom. 3 (2002) 10. [21] K. Megy, S. Audic, J.M. Claverie, Positional clustering of differentially expressed genes on human chromosomes 20, 21 and 22, Genome Biol. 4 (2003) P1. [22] P.J. Roy, J.M. Stuart, J. Lund, S.K. Kim, Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans, Nature 418 (2002) 975 – 979. [23] A.M. Boutanaev, A.I. Kalmykova, Y.Y. Shevelyov, D.I. Nurminsky, Large clusters of co-expressed genes in the Drosophila genome, Nature 420 (2002) 666 – 669. [24] N. Dillon, P. Sabbattini, Functional gene expression domains: defining the functional unit of eukaryotic gene regulation, Bioessays 22 (2000) 657 – 665. [25] Q. Li, K.R. Peterson, X. Fang, G. Stamatoyannopoulos, Locus control regions, Blood 100 (2002) 3077 – 3086. [26] L. Xin, D.P. Liu, C.C. Ling, A hypothesis for chromatin domain opening, Bioessays 25 (2003) 507 – 514. [27] S.I. Grewal, D. Moazed, Heterochromatin and epigenetic control of gene expression, Science 301 (2003) 798 – 802. [28] L.-L. Hsiao, et al., A compendium of gene expression in normal tissues, Physiol. Genom. 7 (2001) 97 – 104. [29] E. Eisenberg, E.Y. Levanon, Human housekeeping genes are compact, Trends Genet. 19 (2003) 362 – 365. [30] T.S. Furey, D. Haussler, Integration of the cytogenetic map with the draft human genome sequence, Hum. Mol. Genet. 12 (2003) 1037 – 1044. [31] W. Feller, An Introduction to Probability Theory and Its Applications, 3rd Ed. Vol. 1, Wiley, New York, 1968. [32] H. Zhang, C.C. Liew, K.W. Marshall, Microarray analysis reveals the involvement of beta-2 microglobulin (B2M) in human osteoarthritis, Osteoarthritis Cartilage 10 (2002) 950 – 960. [33] H. Zhang, et al., Profiling genes expressed in human fetal cartilage using 13,155 expressed sequence tags, Osteoarthritis Cartilage 11 (2003) 309 – 319. [34] V.E. Velculescu, et al., Analysis of human transcriptomes, Nat. Genet. 23 (1999) 387 – 388. [35] A.A. Dempsey, V.J. Dzau, C.C. Liew, Cardiovascular genomics: estimating the total number of genes expressed in the human cardiovascular system, J. Mol. Cell. Cardiol. 33 (2001) 1879 – 1886. [36] E.D. Pleasance, M.A. Marra, S.J. Jones, Assessment of SAGE in transcript identification, Genome Res. 13 (2003) 1203 – 1215. [37] C. Pal, L.D. Hurst, Evidence for co-evolution of gene order and recombination rate, Nat. Genet. 33 (2003) 392 – 395. [38] L.A. Amaral, A. Scala, M. Barthelemy, H.E. Stanley, Classes of small-world networks, Proc. Natl. Acad. Sci. USA 97 (2000) 11149 – 11152. [39] C. Muchardt, M. Yaniv, ATP-dependent chromatin remodelling: SWI/ SNF and Co. are on the job, J. Mol. Biol. 293 (1999) 187 – 198. [40] C.L. Peterson, J.L. Workman, Promoter targeting and chromatin remodeling by the SWI/SNF complex, Curr. Opin. Genet. Dev. 10 (2000) 187 – 192. [41] P. Varga-Weisz, ATP-dependent chromatin remodeling factors: nucleosome shufflers with many missions, Oncogene 20 (2001) 3076 – 3085. [42] C.J. Fry, C.L. Peterson, Chromatin remodeling enzymes: who’s on first? Curr. Biol. 11 (2001) R185 – R197.
T.D. Yager et al. / Genomics 84 (2004) 524–535 [43] W. Wang, The SWI/SNF family of ATP-dependent chromatin remodelers: similar mechanisms for diverse functions, Curr. Top. Microbiol. Immunol. 274 (2003) 143 – 169. [44] K.E. Van Holde, T.D. Yager, Models for chromatin remodeling: a critical comparison, Biochem. Cell Biol. 81 (2003) 1 – 4. [45] G. Sutton, O. White, M. Adams, A. Kerlavage, TIGR Assembler: a new tool for assembling large shotgun sequencing projects, Genome Sci. Technol. 1 (1995) 9 – 19. [46] M. Diehn, et al., SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data, Nucleic Acids Res. 31 (2003) 219 – 223.
535
[47] K.J. Bussey, et al., MatchMiner: a tool for batch navigation among gene and gene product identifiers, Genome Biol. 4 (2003) R27. [48] T.A. Tatusova, I. Karsch-Mizrachi, J.A. Ostell, Complete genomes in WWW Entrez: data representation and analysis, Bioinformatics 15 (1999) 536 – 543. [49] D.L. Wheeler, et al., Database resources of the National Center for Biotechnology, Nucleic Acids Res. 31 (2003) 28 – 33. [50] T.K. Jenssen, A. Laegreid, J. Komorowski, E. Hovig, A literature network of human genes for high-throughput analysis of gene expression, Nat. Genet. 28 (2001) 21 – 28.