Cell, Vol. 30, 599-606.
September
1982.
Copyright
0 1982
by MIT
Comparisons of the Complete Sequences of Two Collagen Genes from Caenorhabditis elegans James M. Kramer, George N. Cox and David Hirsh Department of Molecular, Cellular and Developmental Biology University of Colorado Boulder, Colorado 80309
Summary Several collagen genes have been isolated from the nematode Caenorhabditis elegans. The complete nucleotide sequences of two of these genes, co/-l and CO/-~, have been determined. These collagen genes differ from vertebrate collagen genes in that they contain only one or two introns, their triplehelical regions are interrupted by nonhelical amino acid sequences and they are smaller. A high degree of nucleotlde and amino acid homology exists between co/-l and CO/-~. In particular, the regions around cysteines and lysines are most highly conserved. The C. elegans genome contains SD or more collagen genes, the majority of which probably encode cuticle collagens; co/-l and co/-2 apparently are members of this large family of cuticle collagen genes. Introduction Collagens are a family of structural proteins found in all metazoan phyla and distinguished by their triplehelical structure. This helical structure is formed by regions of the polypeptide that are composed of a repeating (Gly-X-Y), amino acid triplet, in which the first amino acid is always glycine and the X and Y positions are frequently proline and hydroxyproline (Ramachandran, 1967; Fessler and Fessler, 1978). At least eight structurally distinct collagen polypeptides have been identified in vertebrates (Bornstein and Sage, 1980). Amino acid sequence analyses of vertebrate interstitial collagens have shown that their helical regions contain more than 300 consecutive Gly-X-Y triplets (Hulmes et al., 1973; Kang et al., 1975; Hofmann et al., 1978; Dixit et al., 1979). Amino acid composition data from invertebrate collagens indicate that, as expected for a (Gly-X-Y), structure, they have a high glycine content (28%-33%), similar to that found in vertebrate collagens (30%-36%). The contents of other amino acids are quite variable among different invertebrate collagens, as are their similarities to the amino acid compositions of vertebrate collagens (Gross, 1963; Adams, 1978). Without amino acid sequence data for invertebrate collagens, it is not possible to know precisely how their structures compare with those of the known vertebrate collagens. The structure of collagen genes in vertebrates has been investigated. DNA sequence and heteroduplex studies on the pro-a2(l) collagen genes of chicken
(Ohkubo et al., 1980; Yamada et al., 1980; Wozney et al., 1981) and sheep (Boyd et al., 1980; Schafer et al., 1980) have demonstrated that each of these genes contains numerous introns. The chicken gene, for which the most information exists, contains more than 50 introns, most of which are within the triple-helical coding region of the gene. Fourteen exons within the triple-helical coding region have been sequenced, and it is evident that introns always occur between Gly-XY repeats, never within the amino acid triplet repeat. The sizes of exons appear to be related to a basic 54 base pair unit, coding for six Gly-X-Y repeats. Half of the sequenced exons are 54 bp; the others are 45 bp (54 - 91, 99 bp (2 x 54 - 9) or 108 bp (2 x 54). These data led to the proposal that collagen genes evolved by duplications of an ancestral 54 bp sequence (Yamada et al., 1980). We are studying genes involved in production of the cuticle of the nematode Caenorhabditis elegans because the cuticle is a complex structure whose synthesis is developmentally regulated (Cox et al., 1981 a, 1981 b, 1981 c). Furthermore, C. elegans mutants with defective cuticles have been isolated (Higgins and Hirsh, 1977; Cox et al., 1980) and will be valuable for the genetic analysis of cuticle synthesis and assembly. Previous studies demonstrated that the C. elegans cuticle is composed largely of collagen and that it contains a large number of collagen species with different molecular weights (Cox et al., 1981~; Ouazana and Herbage, 1981). There are also differences in the collagens in the cuticles of worms at different stages of development. Because of the variety of posttranslational modifications that affect collagens, such as propeptide processing, glycosylation and covalent crosslinking, it is not known whether this diversity of cuticle collagens is due to the existence of a large number of unique collagens or is the result of posttranslational modifications of a small number of proteins. We have therefore begun to study the collagen gene family in C. elegans. We compare the nucleotide sequences of two C. elegans collagen genes that probably code for cuticle collagens. We also compare the gene and protein structures of these two C. elegans collagens with those of vertebrate collagens. Results Isolation and Sequence Analysis of C. elegans Collagen Genes When Southern blots of C. elegans DNA, digested with various restriction enzymes, are hybridized to the chicken pro-a2 collagen cDNA clone pCg45, 20 or more hybridization bands can be seen (Figure 1). We determined which portions of pCg45 were homologous to C. elegans DNA by cutting pCg45 into three subfragments and hybridizing each subfragment to whole genome Southern blots. Two subfragments, an
Cell 600
800 bp Hind Ill-Hpa II fragment and a 720 bp Hpa II fragment, are comprised entirely of triple-helical (GlyX-Y) coding region (see Lehrach et al., 19781, and their hybridization patterns are similar to that seen with the intact pCg45 plasmid. The third subfragment, a 620 bp Hpa II-Eco RI fragment, codes for the carboxy-terminal propeptide region and does not hybridize to C. elegans DNA under identical conditions (data not shown). These results demonstrate that the hybridization of pCg45 to C. elegans DNA is due to homology with the triple-helical portion of the chicken collagen cDNA and that there is little or no homology between C. elegans DNA and the propeptide region of the chicken collagen. The large number of hybridizing bands seen in whole genome blots indicates that C. elegans either has a large number of collagen genes or has a few genes containing numerous introns. We isolated C. elegans collagen genes by screening C. elegans DNA libraries cloned in A Charon 10 and h1059 with 32P-labeled pCg45. A large number of hybridizing phages were purified. Two groups of
23.5-
phages defining the two collagen genes designated co/-l and ~01-2 were chosen for sequence analysis. Several duplicate and overlapping phages containing these two collagen genes were isolated from the Eco RI partial digest library constructed in h Charon 10. Restriction maps of these phages are presented in Figure 2. Single and double restriction enzyme digests of the phages were hybridized to pCg45 to localize the fragments homologous to the chicken collagen cDNA probe. In the phages containing ~01-1, the 3.6 kb Hind Ill fragment hybridized to pCg45, while 2.1 kb of DNA to the left and 11 kb of DNA to the right did not. In the phages containing ~01-2, hybridization was restricted to a 2.4 kb Eco RI fragment; the 8 kb of DNA to the left and the 10 kb of DNA to the right did not hybridize. More extensive restriction maps of these phages are available on request. Subfragments of the pCg45-hybridizing regions of the phages were subcloned into pBR322, pBR325 or
__-
9.76.6-,
Figure 2. Restriction Maps and Sequencing Strategies Encoding the C. elegans Collagen Genes co/-I and co/-2
Figure 1. Hybridization elegans Genomic DNA
of the Chicken
Collagen
cDNA
Probe
to C.
C. elegans genomic DNA was digested with the restriction enzymes Eco RI. Hind Ill or Barn HI, subjected to electrophoresis on an 0.7% agarose gel and transferred to a nitrocellulose filter. The chicken pro&W collagen cDNA clone pCg45 was labeled by nick translation and hybridized to the filter. The positions of size standards are indicated in kilobases to the left.
of Phages
Phages were isolated from the Eco RI partial digest library of C. elegans DNA constructed in A Charon 10, by hybridization to the chicken collagen cDNA clone pCg45. Regions of the phages that hybridize to pCg45 are indicated by hatched boxes below the restriction maps. Subclones were produced from these regions for use in DNA sequence analysis (shown expanded below phage map). The restriction sites and orientations used for sequencing are indicated. Restriction enzymes sites are symbolized as follows: B, Barn HI: D. Dde I; F. Hinf I; H, Hind Ill; M, Msp I; P, Pvu II; R, Rsa I; S, Sau 3A; T, Taq I. (A) Four phages were isolated that contain an identical 17 kb insert and are designated as KG1 , 9. 10 and 11. These phases contain the gene designated co/-l. (B) Five overlapping phages, designated XCGP. 3. 7.8 and 12. contain the gene designated CO/-~. The locations and orientations of co/-I and co/-2 within the sequenced regions are indicated by large arrows, which correspond to the nucleotide sequences presented in Figures 3 and 4.
Collagen
Genes from C. elegans
601
pBR313. These subclones were used to determine the complete nucleotide sequences of the 3.6 kb Hind III fragment containing co/-l and of the 2.4 kb Eco RI fragment containing ~01-2. Maps of the restriction sites and orientations of the DNA fragments used for sequencing are presented in Figure 2. The coding regions of both ~01-1 and ~01-2 have been sequenced at least twice, with the exceptions of nucleotides 149273 and 666-693 of co/-l and nucleotides l-l 70 and 21 O-240 of CO/-~. Structures of the C. elegans Collagen Genes The nucleotide and amino acid sequences of ~01-1 and CO/-~, extending from the proposed initiator methionines to the termination codons, are presented in Figures 3 and 4. The initiator methionine for co/-l was assigned because there is an in-phase termination codon 57 bp upstream from the methionine and no consensus intron splice sequences are detectable within this 57 bp interval. Sequences resembling the transcription signal sequences TATA (Corden et al., 1980) and CAAT (Benoist et al., 1980) are approximately 60 and 100 nucleotides, respectively, upstream from the methionine. In ~01-2, no in-phase termination codon is found in the 40 bp of DNA sequenced to the 5’ side of the proposed initiator methionine. The extensive homology between co/-l and ~01-2, however, suggests that the ~01-2 initiator methionine would be in this position. This notion is supported by the fact that there is only 25% nucleotide homology between co/-l and co/-2 upstream from this methionine, compared with an overall homology of 67% between the two genes in the remainder of the coding regions (see below). The 3’ end termination codons are TAA for both ~01-1 and ~01-2. A poly(A) addition signal (AATAAA) (Proudfoot and Brownlee, 1976) is located 220 bp downstream from the termination codon in co/-l, and two possible poly(A) signals are present in co/-2 at 26 bp and 340 bp downstream from the termination codon. Two introns have been identified in co/-l and one intron in CO/-~. These introns are denoted in Figures 3 and 4 by lowercase letters. lntrons have been identified as regions containing in-phase termination codons, bounded by sequences homologous to the consensus intron splicing signals (Sharp, 1981). The boundaries of these three introns have the consensus sequence 5’-AR/GTRA...TTYYAG/G-3’, where R is purine and Y is pyrimidine. All three introns are small (102 and 52 bp in ~01-1, 47 bp in ~01-2) and are located outside the (Gly-X-Y), triple-helical coding regions. The average GC content of the introns is 19.9%, while the coding sequences have an average GC content of 58.8%. The regions of co/-l and ~01-2 that encode a repeating Gly-X-Y triple-helical sequence are underlined in Figures 3 and 4 and boxed in Figure 5A. These regions contain glycine as every third amino acid and
have a high proline content (36% in ~01-1, 31% in col2). The frequency of proline as X is 13% in both colI and ~01-2; the frequency of proline as Y is 23% in co/-l, 19% in ~01-2. Each gene contains several GlyX-Y, triple-helical regions interrupted by short stretches (2-l 8 amino acids) that depart from the GlyX-Y sequence. The total sizes of the triple-helical regions in co/-l and co/-2 are identical, 150 amino acids. It is notable that the first triple-helical region in both genes is 30 amino acids long, the sum of the second and third regions in both genes is 54 amino acids and the fourth region of co/-l (66 amino acids) is equal to the sum of the fourth and fifth regions in co/-2 (24 + 42 = 66). Homologies between the C. elegans Collagen Genes ~01-1 and ~01-2 When co/-l and co/-2 are aligned at the first glycine of the first triple-helical region of each gene, they are the same length in the 3’ direction but co/-2 is 5 amino acids longer in the 5’ direction. The two genes can be aligned for maximum homology by looping out five amino acids (15 bp) from co/-2 (nucleotides 172-l 86) and creating a 3 bp gap in co/-2 between nucleotides 767 and 768 or by looping out the corresponding region of ~01-1. The 3 bp gap is necessary because, although both genes have the same number of triplehelical amino acids, co/-2 has one less amino acid between the third and fourth helical regions and one more amino acid at the extreme 3’ end. After these adjustments are made, several striking homologies become evident (Figure 5). The nine cysteines in col7 are in precisely the same positions as the nine cysteines in ~01-2. Ten of the 11 lysines in co/-l are in the same positions as ten of the 14 lysines in CO/-~. Two of the four other lysines in cot-2 correspond to arginines in ~01-1, and therefore represent conservative amino acid changes. The overall nucleotide homology between co/-l and co/-2 is 67%, and the overall amino acid homology is 59%. A graph of amino acid homology derived by scanning with a sliding window of seven amino acids (three to each side of the indicated amino acid) is presented in Figure 5. Regions that have more than average (>59%) homology are shaded. The constraint of having glycine as every third amino acid in the triple-helical regions increases their overall level of homology. Extensive homology exists for regions in which both co/-l and co/-2 contain cysteines or lysines or both. In addition, from the beginning of the first triple-helical region to the beginning of the second, and at the carboxyl ends of the molecules, are long regions with extensive homology. The last 24 amino acids at the carboxyl end of cob1 are identical with the last 25 amino acids at the carboxyl end of CO/-~, with the exceptions of a leucine-isoleucine difference and the presence of three arginines at the terminus of co/-2 where there are only two arginines
Cell 602
Cal-1 20 -
ALA
-
Tnl-
LTS
-
Pm?
-
VAL
-
80 TCTATTGCCGCCGTCGCTTCAGTTCTTTTGACCCTTCCAATCGTCTATTCGTACCTGTCG SRR - ILit - ALA - ALA - Ku. - ALA - SRI - VAL
-
LnJ
-
LB0
- TaR
-
LRD
-
PKI
-
140 CACGTCACACAGCAGATGCACCACGAAATCAACTTCTGCAAGgta~~~~~~~~~~~~~~~ m RIS - SIS NIS - ml, - Am - na - GUI -
-
cL.n
-
ILR
-
ASI
-
Pm!
-
CYS
-
GLY
-
SLR
-
ALA
-
CL”
-
VAL
-
ALA
-
Au
-
PAL
-
GLT
-
Pin
IQa
L*s
As”
-
CYS
-
CYS
-
LB”
-
PRO
-
VAL
-
T?R
-
SLR
-
Tm
-
PAL
-
WE
-
180 -
300 -
m
-
nRT
-
11s
-
ALA
-
As”
-
ALA
560 CAAGGACCACCAGCACCACCAGGACCACCAGGAGCACCAGGAGACCCGGGAGAGGCTGGA GLS - GLY - PRO - PRO - GLY - PRO - PECI - GLY -
GL”
-
PRO
-
TSR
420
GLY
-
Plu
-
PRn
-
CLI
-
Pm
-
AlA
-
CLY
460
-
nui
-
Pm
-
-
PRO
-
GLY
-
ALA
-
PP.0
-
620 ACCCCAGGACGCCCAGGGACCGATCCCGCCCCAGGATCCCCAGGACCACGTGGACCACCA TNR - PRO - a.1 - ARG - PP.0 - GLY - TNR - ASP -
ALA
-
ALA
-
PRO
-
GLI
-
SLR
-
680 GGACCAGCTGGAGAGGCCGGAGCCCCAGGACCAGCCGGAGAGCCAGGAACCCCAGCTATT GLY - PRO - ALA - GLY - GL” - Au - GLY - ALA -
PRO
-
GLr
-
PRO
-
ALA
-
GL1
-
740 TCCGAGCCACTCACCCCAGGAGCACCAGGAGA(;CCAGGAG~CTCCGGACCACCAGGACCA SIR - GL” - PRO - LS” - TSR - PRO - “.Y - ALA -
PRO
-
GLY
-
GL”
-
PRO
-
GLY
Plm
540 -
Pm
-
cm
-
LYS
-
PlD
-
CYS
-
Pm
-
800 CCAGGACCACCAGGAGCACCAGCAAACGAAACGACCGACCGCCACGACCACCAGGACCAAAGGGA PSO - GLY - PP.0 - PRO - GLY - ALA - PRO - GLY - ASA
-
ASP
-
GLY
-
PRO
-
PRO
860 GCCCCACGACCAGACGGACCACCAGG~GCCGACCGACA~TCCCGACCACCAGGACCACCA ALA - PRO - GLP - PlEo - ASP - GLY - PRO - PRO
-
ALA
-
ASP
-
GLY
-
GLN
GLY
600 -
ASP
-
PRO
-
GLY
-
GLO
-
*LA
-
GIY
640
-
660 PRO
-Gl.~-m-ARG-Gl,-Pm-PRo-
lx.”
-
PSO
-
GLY
-
*UP.
-
Pm
-
ALA
-
ILE
ASP
-
SEP.
-
GLY
-
PBD
-
PBD
-
GLY
-
PSc,
GLY
-
PO0
-
PSO
-
“.Y
-
PllD
-
LYS
-
GLY
-
GLY
-
PRO
-
PW
-
GLY
-
PD
-
PllD
700
720 -
760 -
780
820 -
GLY
of the C. elegans
Collagen
-
-
840
RUO -
-
480
580 PRO
-
360
520 -
-
120 -
400 GLY
500
Acid Sequence
-
340
440 GCCCCAGCAAAGCCAGGAAAGCCACGACGTCCAGGAGCACCAGGAACTCCAGGAACCCCA ALA - PRO - nr - LYS - PRO - GLT - LB - P~-~~-~G-~-~~-~-P~-~~-W-P~-~~-~-~-
and Amino
TYR
280
380 CCAGCTCCAAATCTCCAATGCGAGGGATCCTGCCT~CCAGGACCACCAGGACCAGCTGGA PRO - AL4 - PRO - bsu - LRO - GUI - CYS - GUI -
3. Nucleotide
-
160
320 GGACCAGTTCCACCACGCAACCGTACCACCCGTCAAGCCTACCGAGGACCAGAAGTCAAC - I.IC-~-~-I116-GLR-ALA-~-CLI-nY-PR)-CLII-v~-~nnr - PRO - VAL - PRO - PRO - .ulG - As”
Figure
ALA
100
260 ~~C~~~~LCLC~~~~C~~C~~~~SGGATCTGCTGAGGTCAACTACATGAAGGCCAACGCT
GCAAAGCCACCAGTTGCCCC.ATGTGAGCCAACTACTCCACCACCATGCAAGCCATGCCCA GL’I - LT.9 - Pm0 - Pm3 - VAL - ALA - PM - CYS
60
40
ATGGAAACTGACGGTAGCCTCAAAGCCTACAAATTTGTGGCCTATGCCGCTGTGGGTTTC WT - GUI - Tim - ASP - CLI - Au: - LEO - LPS
SER
Gene
The complete nucleotide sequence of co/-l from the proposed initiator methionine lowercase letters. Amino acids indicated are derived from the nucleotide sequence. third amino acid, are underlined.
-
900 -
co/-l to the termination codon is presented. Potential triple-helical regions, which
lntrons contain
are indicated by glycine as every
Collagen
Genes from
C. elegans
603
Cal-2 20
60
40
ATGGACATCGACGCTCGTATCAAAGCTTATAAATTTGTTGCCTATTCGGCAGTCACCTTC
Nm - ASP - YLS - ASP -ALA-
ARG -ILK
LYS - Pm-"AL-ALA-YYS-SIIP-ALA-vAL-m-pBE-
- LYS - ALA - m80
120
100
TCGGTTCTCGCCGTTGTCTCTGTTTTCATCACATTGCCAATCGTTTATAACTATGTTAAT SRR - VAL - VAL - Al4 - PAL - VAL - sell - VAL
-
PNH
-
ILE
-
YIIR
-
Lm
-
pR0
-
-
ASP
-
vu,
-
As”
-
pBE
-
rxs
-
140
mrf
-
VAL
-
YYu
-
As”
-
YYII
-
VAL
-
Asw
-
v,u
-
$,m
-
ALA
-
AN
-
ASP
-
ILE
180
160
AATGTGAAGAAACAGATTCACACTGATGTTAATTTCTGCAAGGTTTCTGCTCGTGATATC ASN - VAL - LYS - LYS - GLR - ILE - BIS - w
200
LYS
-
ASP
-
dLA
-
PpD
-
CLY
-
K%
-
260 GCTTATTCCACTGGAGGAGCTGGAGGAGGCGGACGCGGCGGAG~~=~~~~~=~~~~~~~~~=~ ALA - TYR - SRR - TNR - GLY - GLY - ALA - GLY -
CLY
-
GLY
-
GLY
-
GLY
-
GLY
-
240
220
TGGAGTGAGGTGCATCTTATTAAGGATGCTCCAGGAAACAATACTCGTGTTGCTCGTCAA TRP - SRR - CL0 - VAL - NIS - LE” - ILK - LYS
-
As”
-
Tfm
-
&lG
-
v.u
-
N..A
-
As
-
CLR
-
300
280 -
360
340
320 agttLaagtartegaatacatttccagCCGGCGGCGGAGGATGTGATGGATGCTGCAATCCTG GLY
-
GLY
-
GLY
-
GLY
-
CYS
380
-
ASP
-
GLY
-
CYS
-
CYS
-
As”
-
PRO
420
400
GACCACCAGGACCAGGTGGATCTCCCGGAAAACCAGGAAAGCCAGGAAAGCCAGGAGCTC GLY - PRO - PRO - GLY - PROCLY - CLY - SSS -
PRO - GLY -LYS
- PRO - CLY - LYS-
440
PRO - CLY -ITS-
PI) - GLY -ALA480
460
CAGGAGCCCCAGGAAATCCAGGAAAAGGAGCATCAGCTCCATGTGAGCCAGTTACTCAAC PRO - GLY - ALA - PM) - GLY - As” - PIto - GLY -
LYS
-
GLY
-
ALA
-
SIR
-
ALA
-
PRO
-
CYS
-
CLLI
-
Pm
-
VAL
-
Tm
-
GLR
-
540
520
500
-
CACCATGCCAACCATGCCCAGGAGGACCACCAGGACCAGCTGGTCCAGCCGGACCACCAG
PRO - PRO - CYS - CLN - PRO - CYS - PRO - GLY - CLY - ml0 - Pm - a.1 - PSO - ALA560
GLY -Pm-ALA-
SLY - PP.0 - PFa -
580
GACCACCGGGACCAGATGGAAACCCAGGATCACCAGCCGGACCATCAGGCCCAGGACCAG ASP - GLY -ASN CLY - PSO - Pm - CLY - PSO-
- PPiO - CLY - SEE-
FRO-ALA
600
- a.1 - PSO - SER - fxy
620
-P&l
- CLy -PRO
640
CCGGACCACCAGGACCAGCAGGACCAGCCGGAAACGACGGAGCCCCAGGAGCCCCAGGAG GLY ALA - GLY - PI0 - PRO - CLY - P,", - ALA-
PRO -AL&-
".Y
-AsN
- ASP -SLY
680
660
-ALA-
PRO - GLY -A,.,,-
PBD - GLY -
700
720
GACCAGGAGAACCAGGAGCATCCGAGCAAGGAGGACCAGGAGAGCCAGGACCAGCTGGAC
GLy - PRO - CL1 - GLO - PRO-
SLY-ALA
- SBS-
".O - GLN - GLY - GLY - FSO - SLY - CL0 - Pm, - GLy -Pm
740
-f&A-
GLy -
760
CACCAGGACCAGCCGGACCAGCCGGAAATGACGGAGCCCCAGGAACCGGAGGACCAGGAC PRO - PRO - GLY - PRO - ALA - CLY - PRO - NA-
CLY - ASN - ASP - GLY -ALA-
800
780
PSD - GLY -llm
- GLY - GLY -Pm
- C&Y -
ASP - GLY -MN
- PIX, -
820
CAGCTGGACCAAAGGGACCACCAGGAGCTGCTGCTGGAGCACCAGGAGCTGACGGAAACCCAG PRO - ALA - GLY - PRO - LYS - GLY - PRO - PRO - GLY
- ALA - ALA-
GLY -ALA
860
840
- PM, - GLY -ALA880
900
GAGGACCAGGAACCGCTGGAAAGCCAGGAGGACAGAAGGGAATCTGCCCCAAGT GLY
-
GLY
-
PRO - CLY - TNR - ALA-
GLY - LYS - PRO - GLY - GLY -CL"
- CLY -CL"
920 ATTGTGCTATCGACGGAGGAGTCTTCTTTGAGGACGCAACCCGACGCCGCTAA TYR - CYS - ALAILE - ASP - GLY - GLY -
Figure
4. Nucleotide
- LYS - GLY - ILS - CYS - Pit0 - LYS -
940
VAL-
and Amino Acid Sequence
PEE - Pm7 - CL" - ASP -GLy of the C. elegans
Collagen
ARC -A&x
-ARC
-0ca
-
Gene ~01-2
The complete nucleotide sequence of co/-2 from the proposed initiator methionine lowercase letters. Amino acids indicated are derived from the nucleotide sequence. third amino acid, are underlined.
in co/-l. The region with the least homology is a stretch of about 35 amino acids preceding the first triple-helical region. The codon usage in ~01-1 and co/-2 exhibits a highly
-YNR-
to the termination codon is presented. Potential triple-helical regions, which
lntrons contain
are indicated by glycine as every
skewed pattern. The third positions of glycine and proline codons may be occupied by any of the four nucleotides, but both ~01-1 and ~01-2 show extreme preference for adenine in the third position. In co/-l,
Cell 604
Coi-1 An?
A.
Co\-2
Figure
5.
ATG
Schematic
Structural
Representations
and Amino Acid Homologies
of co/-f
and co/-2
(A) The amino acid sequences of co/-l and ~01-2. as derived from the nucleotide sequences, are schematically presented. The boxed, hatched areas represent triple-helical regions, where glycine is every third amino acid. Nonhelical regions are represented by the heavy horizontal line. The numerals indicate the number of amino acids in each region. Cysteine residues are indicated by dashed vertical lines, and lysine residues by solid vertical lines. To align co/-l and co/-2 for maximum homology, five amino acids (indicated by the triangle) have been looped out of co/-2 between the amino terminus and the first triple-helical region and a one amino acid gap was made in co/-2 immediately in front of the fifth triple-helical region. (8) A graph of amino acid homology between co/-l and co/-2 was constructed by examining windows of seven amino acids. This graph is in register with the schematic figure above. A comparison is made between co/-l and co/-2 for each amino acid and the three amino acids to each side of it. Under the position of the central amino acid, the number of homologous amino acids is plotted (O-7). Regions of the graph with more than average homology (59%. or 4.1 of 7 amino acids) are shaded. The values for the windows centered on the terminal amino acids, which do not have three amino acids to each side, are not included in the graph
57 of 60 glycine codons end with adenine, as do 72 of 74 proline codons. In ~01-2, 71 of 78 glycine codons and 55 of 59 proline codons end with adenine. The combined third position usages for the glycine and proline codons for both co/-l and co/-2 are 94.1% adenine, 1.8% uridine, 2.6% cytosine and 1.5% guanine. The codon usage for other amino acids does not show this bias. Rather, when the other amino acids that can utilize any third position nucleotide are analyzed, they appear to prefer uridine and cytosine in the third position: 11.3% adenine, 40.6% uridine, 40.0% cytosine and 8.1% guanine. Discussion Comparison of co/-l and co/-2 with Vertebrate Collagens The structures of the co/-l and co/-2 collagen genes from C. elegans differ significantly from the structures of vertebrate collagen genes. Whereas the vertebrate collagen genes are interrupted by more than 50 introns, the C. elegans genes contain only one or two. The recently published nucleotide sequence of a portion of a Drosophila collagen gene also does not contain a large number of introns (Monson et al., 1982). Furthermore, the vertebrate collagen gene introns are located mostly within the triple-helical coding region of the gene, but in the C. elegans genes none of the introns is within the helical coding regions. The sizes of sequenced exons in the helical coding region of the chicken pro-a201 collagen gene are related to
a 54 bp length. This finding prompted the proposal that collagen genes were formed by duplications of an ancestral 54 bp helical coding sequence (Yamada et al., 1980). We cannot detect any structure in the C. elegans co/-l or co/-2 genes that appears related to such a 54 bp sequence. One explanation proposed for the large number of introns in vertebrate collagen genes is that they interrupt the highly repetitive GC-rich nucleotide sequences of the triple-helical coding region, and therefore reduce the probability of homologous unequal recombination. Such unequal recombination would cause rapid alterations in the sizes of collagen genes. It is conceivable that the small interruptions of the C. elegans triple-helical regions by nonhelical amino acids regions achieve the same effect as intronsnamely, interruption of the repetitive, homologous nucleotide sequences. However, these nonrepeating sequences occur less frequently in C. elegans collagen genes than do the introns in vertebrate collagen genes, and they are also much shorter than the vertebrate introns. These C. elegans collagens and vertebrate collagens also differ significantly in protein structure. Vertebrate interstitial collagens are comprised of 300 or more contiguous Gly-X-Y repeats, flanked by nonhelical telopeptides. The helical regions of co/-l and col2 are much shorter, each containing a total of 50 GlyX-Y repeats. In contrast with the single, long, continuous Gly-X-Y region in vertebrate collagens, ~01-1 and co/-2 contain several Gly-X-Y regions separated by
Collagen 605
Genes
from C. elegans
short stretches that do not have glycine every third amino acid. The continuous triple helix of vertebrate collagen results in a molecule that acts as a long, rigid rod. The interruptions in the C. elegans collagens may allow bending to occur in the molecule. Alternatively, the short interruptions may have little or no effect on the formation of a triple-helical structure. Vertebrate collagens are initially synthesized as preprocollagens that contain a signal sequence and amino-terminal and carboxy-terminal propeptides that are cleaved off during formation of the mature collagen (Fessler and Fessler, 1978). We have no information as to whether C. elegans collagens are initially formed as preprocollagens. The amino-terminal amino acids of co/-l and co/-Z are unlikely to function as signal sequences, since five of the first ten amino acids are charged. However, following these first ten amino acids are stretches of hydrophobic amino acids, in both ~01-1 and ~01-2, that could function as signal sequences. An internal signal sequence has been demonstrated in chicken ovalbumin (Lingappa et al., 1979; Davis and Tai, 1980). Comparison of co/-l and co/-Z The ~01-1 and ~01-2 genes are closely related by several criteria. The total triple-helical coding regions for both genes are identical in length, and the total lengths of the two proteins differ by only five amino acids. Overall amino acid homology is 59%, and nucleotide homology is 67%. The identical positions of all of the cysteines in co/-l and co/-2 suggest that disulfide crosslinking is an important, highly conserved aspect of these proteins. Indeed, studies of the cuticle collagens of C. elegans (Cox et al., 1981 b; Ouazana and Herbage, 1981) and other nematodes (McBride and Harrington, 1967a; Leushner et al., 1979) have demonstrated that they are highly disulfide crosslinked. The identical positions of most of the lysines in these two genes also suggest their importance in interchain crosslinking of collagen polypeptides. Covalent crosslinks involving lysine residues are very important to collagen structure in vertebrates (Bornstein and Traub, 1979). The fact that cysteines and shared lysines are in regions of more than average amino acid homology in ~01-1 and ~01-2 is further evidence that they are functionally important in these molecules. Though we do not know whether lysine crosslinking occurs in C. elegans collagen, it may explain the discrepancy in size between the 30,000 dalton proteins that would be encoded by ~01-1 and ~01-2 and the in vivo cuticle collagens, which give a series of bands on one-dimensional SDS gels ranging from 50,000 to 200,000 daltons (Cox et al., 1981 b, 1981 c; Ouazana and Herbage, 1981). Previous studies on the cuticle collagen of the parasitic nematode Ascaris suggested a triple-helical structure derived from a single polypeptide chain that folded back on itself twice (McBride and Harrington,
1967a, 1967b). It is conceivable that these C. elegans collagens could assume such a structure by bending within their nonhelical regions and aligning their (GlyX-Y), regions. If this occurs, the proteins would assume a very compact structure because of the relatively short runs of Gly-X-Y. Data from whole genome Southern blots and library screenings indicate that the collagens are a large multigene family in C. elegans. We estimate that 50 or more collagen genes are present in the genome (unpublished results). We believe that co/-l and co/-2 are representative of the collagens involved in cuticle formation. The cuticle contains many different collagen species and accounts for at least 5% of the worm’s mass (Cox et al., 1981 b, 1981 c; Ouzana and Herbage, 1981). Most cuticle components are synthesized during a 3-4 hr period preceding each molt, accounting for approximately 10% of total protein synthesis during these periods (Cox et al., 1981 a). Thus the mRNAs for cuticle collagens are expected to be abundant. Both ~01-1 and co/-2 specifically hybridize to the most abundant class of collagen mRNA, approximately 1200 nucleotides in length, supporting the contention that they are cuticle collagens. Experiments with a small gene-specific probe from co/-l and Sl mapping of co/-l demonstrate that it is transcribed into an RNA of this abundant size class (unpublished results). The major in vitro translation products of C. elegans poly(A) RNA that are collagenasesensitive are 30,000 to 40,000 daltons, consistent with the size of the,major collagen mRNA class (J. Politz, personal communication). The unusual properties of co/-l and co/-2 as compared with vertebrate collagens may reflect the unique functional requirements of the nematode cuticle. Experimental
Procedures
C. elegans DNA Libraries The construction of an Eco RI partial digest library in h Charon 10 has been described elsewhere (J. Files. S. Carr and D. Hirsh, submitted manuscript). A Sau 3A partial digest library of C. elegans DNA in h1059 was provided by J. Karn (Karn et al., 1960). The phage libraries were screened by the method of Benton and Davis (1977). Nucleic Acid Hybridizations The chicken pro-a2(l) collagen cDNA clone pCg45 was provided by H. Boedtker (Lehrach et al., 1978). It contains the carboxy-terminal propeptide region and approximately half of the triple-helical region of the chicken pro+2 collagen message, cloned into pBR322. Hybridization probes were labeled with 32P by nick translation (Rigby et al., 1977). Methods for the preparation of C. elegans Ll DNA, restriction enzyme digests, Southern transfers and hybridization conditions were all as described by Emmons et al. (1979). DNA Sequence Determination Sequence analysis was performed essentially as described by Maxam and Gilbert (1960). Fragments were labeled at the 5’ end with T4 polynucleotide kinase or at the 3’ end with DNA polymerase I (Klenow fragment) and either strand-separated or cut with a second restriction enzyme. Reaction products were separated on polyacrylamide-urea gels of 40 cm (Maxam and Gilbert, 1960) or 60 cm (Smith and Calvo, 1960).
Cell 606
Caenorhabditis
Acknowledgments We are grateful to t-l. Boedtker for providing pCg45 and to J. White for initial discussions of this research. We thank J. Kop for advice on DNA sequencing techniques. This research was supported by a National Institutes of Health research grant to D. H. J. M. K. is a fellow of the Helen Hay Whitney Foundation. G. N. C. is a fellow of the Jane Coffin Childs Memorial Fund for Medical Research. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Received
April 26, 1981;
revised
elegans.
Mol. Gen. Genet.
Hulmes, 0. J. S., Miller. A., Parry, D. A. D., Piez. K. A. and WoodheadGalloway, J. (1973). Analysis of the primary structure of collagen for the origins of molecular packing. J. Mol. Biol. 79, 137-148. Kang, A. H.. Dixit, S. N., Corbett, C. and Gross, J. (1975). The covalent structure of collagen. Amino acid sequence of al-CB5 glycopeptide and al-C84 from chick skin collagen. J. Biol. Chem. 250, 7428-7434. Karn, J.. Brenner, S.. Barnett, bacteriophage h cloning vector. 51 76.
June 7, 1982
References Adams, E. (1978). Invertebrate collagen% Marked differences vertebrate collagens appear in only a few invertebrate groups. ence 202, 591-598.
from Sci-
150, 63-72.
Hofmann, H., Fietzek, P. P. and Kuhn, K. (1978). The role of polar and hydrophobic interactions for the molecular packing of type I collagen: a three-dimensional evaluation of the amino acid sequence. J. Mol. Biol. 125, 137-165.
L. and Cesareni, G. (1980). Novel Proc. Nat. Acad. Sci. USA 77, 5172-
Lehrach, H., Frischauf. A. M., Hanahan, D.. Wozney. J., Fuller, F., Crkvenjakov. R.. Boedtker. H. and Doty. P. (1978). Construction and characterization of a 2.5-kilobase procollagen clone. Proc. Nat. Acad. Sci. USA 75. 5417-5421.
Benoist, C., O’Hare, K.. Breathnach. R. and Chambon, P. (1980). The ovalbumin gene-sequence of putative control regions. Nucl. Acids Res. 8. 127-142.
Leushner, J. R. A., Semple. N. L. and Pasternak, J. (1979). Isolation and characterization of the cuticle from the free-living nematode Panagrellus silusiae. Biochem. Biophys. Acta 580, 168-l 74.
Benton, W. D. and Davis, R. W. (1977). Screening hgt recombinant clones by hybridization to single plaques in situ. Science 796, 180182.
Lingappa, ovalbumin 121.
Bornstein, P. and Sage, types. Ann. Rev. Biochem.
Maxam. A. M. and Gilbert, with base-specific chemical
H. (1980). 49, 957-l
Structurally 003.
distinct
collagen
V. R., Lingappa, J. R. and Blobel, G. (1979). Chicken contains an internal signal sequence. Nature 281, 117W. (1980). cleavages.
Sequencing end-labeled DNA Meth. Enzymol. 65, 499-560.
Bornstein, P. and Traub. W. (1979). The chemistry and biology of collagen. In The Proteins, 4. H. Neurath and R. L. Hill, eds. (New York: Academic Press), pp. 41 l-632.
McBride, 0. W. and Harrington, W. F. (1967a). Ascaris gen: on the disulfide cross-linkages and the molecular the subunits. Biochemistry 6, 1484-l 498.
Boyd, C. D., Tolstoshev, P.. Schafer, C., Kretschmer, P. J., Nienhuis, A. Isolation and characterization of a coding for a part of the pro a2 chain Chem. 255, 3212-3220.
McBride, 0. W. and Harrington. W. F. (1967b). Helix-coil transition in collagen. Evidence for a single-stranded triple helix. Biochemistry 6, 1499-1514.
M. P.. Trapnell, B. C.. Coon, H. W. and Crystal, R. G. (1980). l&kilobase genomic sequence of sheep type I collagen. J. Biol.
Corden, J., Wasylyk. B., Buchwalder, A., Sassone-Corsi. P., Kedinger, C. and Chambon. P. (1980). Promoter sequences of eukaryotic protein-coding genes. Science 209. 1406-l 414. Cox, G. N.. Laufer. J. S., Kusch, and phenotypic characterization elegans. Genetics 95, 317-339.
M. and Edgar, R. S. (1980). Genetic of roller mutants of Caenorhabditis
Cox. G. N.. Kusch, M.. DeNevi, K. and Edgar, R. S. (1981 a). Temporal regulation of cuticle synthesis during development of Caenorhabditis elegans. Dev. Biol. 84, 277-285. Cox, G. N., Kusch. M. and Edgar, R. S. (1981 b). Cuticle of Caenorhabditis elegans: its isolation and partial characterization. J. Ceil Biol. 90. 7-l 7. Cox, G. N., Staprans. S. and Edgar, R. S. (1981 c). The cuticle of Caenorhabditis elegans. II. Stage-specific changes in ultrastructure and protein composition during postembryonic development. Dev. Biol. 86, 456-470. Davis, B. D. and Tai. P.-C. (1980). The mechanism across membranes. Nature 283, 433-438.
of protein
Dixit, S. N.. Mainardi. C. L., Seyer, J. M. and Kang. Covalent structure of collagen: amino acid sequence chick skin collagen containing the animal collagenase Biochemistry 78, 5416-5422.
secretion
A. H. (1979). of a2-CB5 of cleavage site.
Emmons, S. W., Klass, M. R. and Hirsh, D. (1979). Analysis of the constancy of DNA sequences during development and evolution of the nematode Caenorhabditis elegans. Proc. Nat. Acad. Sci. USA 76, 1333-1337. Fessler. J. H. and Fessler, L. I. (1978). Ann. Rev. Biochem. 47, 129-162. Gross, J. (1963). ative Biochemistry, Academic Press), Higgins,
Biosynthesis
of procollagen.
Comparative biochemistry of collagen. In Compar5. M. Florkin and H. S. Mason, eds. (New York: pp. 307-346.
B. J. and Hirsh,
D. (1977).
Roller
mutants
of the nematode
Monson, J. M., Natzle, J., Friedman, J. and McCarthy, Expression and novel structure of a collagen gene Proc. Nat. Acad. Sci. USA 79, 1761-l 785.
cuticle collaproperties of
B. J. (1982). in Drosophila.
Ohkubo. H., Vogeli, G., Mudryj, M.. Avvedimento, W. E., Sullivan, M.. Pastan, I. and decrombrugghe, B. (1980). Isolation and characterization of overlapping genomic clones covering the chicken 012 (type I) collagen gene. Proc. Nat. Acad. Sci. USA 77, 7059-7063. Ouazana, R. and Herbage, Il. (1981). Biochemical characterization of the cuticle collagen of the nematode Caenorhabditis elegans. Biochem. Biophys. Acta 669, 236-243. Proudfoot, sequences
N. J. and Brownlee. G. G. (1976). 3’ Non-coding region in eukaryotic messenger RNA. Nature 263, 21 l-21 4.
Ramachandran. G. N. (1967). Structure of collagen at the molecular level. In Treatise on Collagen, 1, G. N. Ramachandran. ed. (New York: Academic Press), pp. 103-l 83. Rigby, P. W. J.. Dieckmann. M., Rhodes, C. and Berg, P. (1977). Labeling deoxyribonucleic acid to high specific activity in vitro by nick translation with DNA polymerase I. J. Mol. Biol. 7 13, 237-251. Schafer, M. P., Boyd. C. D., Tolstoshev, P. and Crystal, R. G. (1980). Structural organization of a 17KB segment of the u2 collagen gene: evaluation by R loop mapping. Nucl. Acids Res. 8, 2241-2253. Sharp, 646.
P. A. (1981).
Speculations
on RNA splicing.
Cell 23, 643-
Smith, D. R. and Calve. J. M. (1980). Nucleotide sequence of the E. coli gene coding for dihydrofolate reductase. Nucl. Acids Res. 8, 2255-2274. Wozney. Structure
J., Hanahan. D.. Tate, V.. Boedtker, H. and Doty, P. (1981). of the pro a2(l) collagen gene. Nature 294, 129-l 35.
Yamada, Y.. Avvedimento. V. E., Mudryj, M.. Ohkubo, H., Vogeli. G., Irani. M.. Pastan. I. and decrombrugghe, B. (1980). The collagen gene: evidence for its evolutionary assembly by amplification of a DNA segment containing an exon of 54 bp. Cell 22, 887-892.