MOLECULAR PHYLOGENETICS AND EVOLUTION
Vol. 5, No. 1, February, pp. 220–231, 1996 ARTICLE NO. 0015
Private Polymorphisms: How Many? How Old? How Useful for Genetic Taxonomies? E. A. THOMPSON*
AND
J. V. NEEL†
*Department of Statistics, University of Washington, Seattle, Washington 98195; and †Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109 Received June 14, 1995
The data on the distribution and frequencies of private polymorphisms in the tribal populations of Central and South America are used to address the question of the extent to which such data can be used to address questions of phylogenetic history. It is shown that due to the great increases in population number that accompanied agricultural development, most private polymorphisms have arisen since population settlement and tribal differentiation. Conversely, the absence of Amerindian variants of wide distribution confirms the small size of the hemispheric population until relatively recent times. Patterns of recent population decline and recovery that accompanied European contact since 1492 have also had a strong impact on the age distribution of extant variants, eliminating many that were relatively young in 1492. The majority of surviving variants that have achieved polymorphic frequencies in a tribe or group of tribes are from 100 to 400 generations old (2500 to 10,000 years). Such genetic variants thus characterize tribes, or groups of closely related tribes, but do not provide a greater time depth of phylogenetic history. 1996 Academic Press, Inc.
INTRODUCTION Genetic taxonomies/phylogenies have usually been based on variants widely dispersed over populations, the genetic distances between populations based on differences in the frequencies of shared alleles. In our studies of tribal Amerindian populations, we have been impressed by the frequency of apparently unique variants restricted to one or several related or adjacent tribes. Where the frequency of the variant allele within the tribe is $ 0.01, we have referred to the variant as a private polymorphism (PP). These variants result from mutations occurring at some point in tribal history; in principle the more remote in time the mutational event, the greater the number of tribes sharing the novel variant. This fact gives rise to the possibility of a taxonomy based on shared private polymorphisms, a taxonomy based on a more qualitative approach as contrasted to the quantitative approach of the customary taxonomies. 1055-7903/96 $18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
The validity of this expectation depends on the distribution of shared variants among populations, which in turn depends strongly on the number of available extant variants, and on their age distribution. In this presentation, we will explore the expected distribution of variants, based not only on a survey of electrophoretic variants of proteins in Ameridian tribes but also on an analysis of Amerindian demography and presumed prehistory. We demonstrate that the age distribution of extant variants is such that the expectation is for a concentration of unique variants in single tribes, or a related group of tribes, but little expectation of a hierarchy beyond a tribal group. These findings have implications for expectations regarding the distribution of variant alleles in other of the world’s tribal populations and for the type of variant distribution to be expected as a consequence of the tribal amalgamations that have created modern populations. THE DATA AVAILABLE FOR THIS TREATMENT Data on Private Polymorphisms Between 1962 and 1985, our group surveyed 20 different American Indian tribes of central and South America with respect to the occurrence of electrophoretic variants in a battery of some 25 plasma and erythrocyte proteins. These variants presumably all reflect base pair substitutions in DNA, so that the considerations that follow are fully applicable to DNA-based studies. All the tribes were technically classified as Amerindians, i.e, the descendants of the first wave of immigration (as contrasted to the later-arriving Na-dene groups). Twelve of the tribes were located in South America and had been selected on the basis of minimal acculturation. Their locations are shown in Fig. 1. The findings with respect to private polymorphisms in eight of the tribes are given in Table 1 (Refs. in Neel, 1978a; Neel et al., 1980). The other four tribes (the Makiritare, Piaroa, Kraho, and Ayoreo) were extensively sampled, but no variants found. In Table 1, we follow the linguistic classification of Greenberg (1987). For these 12 tribes, admixture with non-Indians was generally negligible and in all instances ,5%.
220
221
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
FIG. 1. Locations of the 12 South American tribes sampled for the presence of variant alleles, plus the Guaymi of Central America. GUA, Guaymi; PIA, Piaroa; MAK, Makiritare; MAC, Macushi; YAN, Yanomama; BAN, Baniwa; WAP, Wapishana; TIC, Ticuna; PAN, Pano; CAY, Cayapo; KRA, Kraho; XAV, Xavante; AYO, Ayoreo.
On the other hand, the eight tribes of Central America were all drawn from a single linguistic group (Chibcha) and represented a higher degree of acculturation and admixture than the South American tribes. Their location is shown in Fig. 2, and the findings regarding private polymorphisms are shown in Table 2 (Refs. in
Thompson et al., 1992). In Fig. 2, we have graphically indicated the distributions of the four private polymorphisms encountered in the Central American tribes. There is some ambiguity as to the pre-Colombian location of the Teribe, in whom no variants were found. A failure to locate this tribe properly, or its recent history of severe decimation, could explain the anomaly that none of the four private polymorphisms were encountered in this apparently centrally located tribe. Tables 1 and 2 include the variables that we will need for our anlysis, which must take into account the recent demographic history of each tribe since 1492, the time of first contact (FC) of Amerindians with Europeans. To varying degrees, tribes thereafter underwent a severe population crash, before recovering to their current sizes. Our analysis will be in terms of the one-generation tribal size (N); we place the number in the adult (reproducing) generation of the tribe at approximately 50%. The sample sizes are the number of sampled alleles—twice the number of individuals sampled, and the sample frequency (pˆ ) of each variant in each tribe is also given. This provides a crude estimate of the number of copies of each variant in the adult generation of each tribe. Only a single electrophoretic system was used in the examination of each protein for electrophoretic variants; since additional variants are often detected when multiple electrophoretic systems are employed, this estimate of private polymorphism frequency is a minimum. The salient point is that whereas there are only two apparent instances of tribal sharing of the 8 PP encountered in the 78 paired comparisons among 12 South American tribes, there are 9 instances of sharing of the 4 PP encountered in 28 paired comparisons among the 8 Chibcha-speaking tribes. The first instance of a shared variant in the South American tribes involves a variant of ceruloplasmin shared by two Ge´-speaking
TABLE 1 Estimated Number of Alleles for Each of the Electrophoretically Defined Private Polymorphisms Encountered in 12 Tribes Tested on Average for 25 Systems
Allele
Tribe
ALB*YAN2 ALB*MAKU ACP1*TIC1
Yanomama Wapishana Ticuna
CA II*BAN1 CRPL*A-CAY1
Baniwa Cayapo Xavante Macushi Wapishana Wapishana Central Pano
ESA1*D-MAC1 PEPA*2-WAP1 PEPB*PAN1
Linguistic grouping a Chibcha–Paezan (Chibcha) Equatorial–Tucanoan (Equatorial) Equatorial–Tucanoan (Macro –Tucanoan) Equatorial–Tucanoan (Equatorial) Ge´-Pano–Carib (Macro–Ge´) Ge´-Pano–Carib (Macro–Ge´) Ge´-Pano–Carib (Macro–Carib) Equatorial–Tucanoan (Equatorial) Equatorial–Tucanoan (Equatorial) Ge´-Pano–Carib (Pano)
Nb
Sample size (alleles)
Allele frequency (pˆ)
Estimated No. copies in N
15,000 2,000 15,000
7500 1000 7500
7008 1246 3526
0.074 0.019 0.111
1110 38 1665
1,500 1,500 1,700 4,000 2,000 2,000 18,000
750 750 850 2000 1000 1000 9000
754 1336 914 996 1228 1228 670
0.054 0.050 0.008 0.049 0.024 0.016 0.024
Estimated size at FC
Size following crash
Present tribal size
5,000 2,000 6,000
5000 1000 2000
1,500 1,500 1,700 4,000 2,000 2,000 18,000
750 750 850 2000 1000 1000 9000
Note. References to detailed descriptions of each of these variants are to be found in Neel (1978). a Following the classification of Greenberg (1987). b Estimated number in adult generation, calculated as 50% of tribal size.
81 75 14 89 196 48 244 32 432
6 6
222
THOMPSON AND NEEL
FIG. 2. Locations of the eight Chibcha-speaking tribes sampled and the tribal distributions of the four ‘‘private’’ electrophoretic polymorphisms that were identified.
tribes of the Brazilian Mato Grosso. The second, a variant of esterase A, involves the Macushi and Wapishana, adjacent tribes speaking different languages but between whom there is well documented recent intermarriage (Neel et al., 1977), and we feel that this is clearly an example of recent diffusion of a variant through intermarriage. The shared PPs in the Chibcha all involve tribes that in the pre-Colombian era were contiguous. (We recognize that a conclusion regarding the identity of electrophoretic variants based solely on electrophoretic mobility is generally weak, but in this situation it would be a remarkable coincidence if two electrophoretically identical variants were due to different mutations.)
Data on the Amerindian History Our analysis of these data will draw on a number of population parameters developed in previous studies of these Amerindians. The ultimate origin and the time of arrival of the first wave of American Indians have been matters of active debate for years. The two favored dates of arrival of this first wave have been about 30,000 years before present (ybp) and about 15,000 ybp. Climatic conditions during the last Ice Age render an arrival and escape from the Alaskan end of the Bering Land Bridge into the North American plains in the years between these two dates very unlikely. Recently, on the basis of a mtDNA ‘‘clock’’ derived from a study of
TABLE 2 Estimated Number of Alleles for Each of the Electrophoretically Defined Private Polymorphisms Encountered in Chibcha-Speaking Tribes of Central America Tested for Systems
Tribe
Estimated size at FC
Size following crash
Present tribal size
TF*D-GUA
Boruca Bribri Cabecar
30,000 15,000 15,000
1000 750 750
ACP*GUA
Kuna Guaymi Bokota
5,000 10,000 5,000
TPI*3
Bribri Cabecar Guatuso
LDHB*GUA
Guaymi
Allele
Na
Sample size (alleles)
Allele frequency (pˆ)
Estimated No. copies in N
1,000 5,000 4,000
500 2,500 2,000
124 526 282
0.056 0.059 0.078
56 245 312
2500 5000 250
35,000 55,000 1,500
17,500 27,500 750
228 1118 230
0.048 0.157 0.078
1680 8635 117
15,000 15,000 1,500
750 750 75
5,000 4,000 300
2,500 2,000 150
530 286 166
0.032 0.028 0.024
160 112 7
10,000
5000
55,000
27,500
1122
0.090
4950
Note. References to detailed descriptions of each of these variants are to be found in Thompson et al. (1992). a N is size of adult generation, calculated as 50% of total population.
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
the Chibcha tribes, we have suggested that the earlier arrival data is the more probable (Torroni et al., 1994). Furthermore, from a combination of seroepidemiological data on the distribution of the human T-cell lymphotrophic virus type II in the Americas and Siberia, as well as data on the mtDNA of the tribes and ethnic groups of the same area, we have suggested that the most direct ancestors of the Amerindians were not the ancestors of the current Siberian ethnic groups, but were derived from the then tribal groups of the region of the present extreme southeast Siberia/northern Manchuria/Mongolia (Neel et al., 1994). Migration of these ancestral groups northward to the base of the Bering Land Bridge is estimated to have consumed some 10,000 years, so that the time depth for the Amerindian appropriate to this treatment becomes some 40,000 years. We take one generation to average 25 years throughout our analysis. Time is measured in years before present or generations before present (gbp). Data on Mutation Rates The number of variants to be expected clearly depends on the mutation rate. Given the foregoing data, we have undertaken on several occasions to calculate the rate of mutation to electrophoretic variants most consistent with the accumulated data. A variety of approaches have been employed, all indirect. Neel and Thompson (1978) suggested that the rate of mutation resulting in electrophoretic variants most consistent with the observed number of variants and their frequency in the 12 tribes of Table 1 was 0.7 3 1025/locus/ generation. Neel and Rothman (1978), applying three different methods (of Kimura-Ohta, of Nei, and of Rothman-Adams), obtained estimates of 1.43, 1.69, and 1.71 3 10 25, respectively. Finally, Chakraborty and Neel (1989) developed a method for the simultaneous estimation of effective population size and mutation rate based on the Amerindian data and estimated the average locus rate/generation to be 1.1 3 10 25. In retrospect, these estimates are probably somewhat high because they did not make adequate allowance for the past population crashes experienced by some of the tribes. The rate with which spontaneous mutation results in electrophoretic variants of essentially the same proteins examined in the studies of Amerindians has been estimated by direct counting in another mongoloid population (Japanese); the value obtained was 0.6 3 10 25/locus/generation (Neel et al., 1986). In our analysis, we will work with two mutation rates, of 1 3 1025/ locus/generation and 0.5 3 1025/locus/generation, which we believe provide upper and lower bounds on the likely true value. A MODEL OF AMERINDIAN DEMOGRAPHIC HISTORY The foregoing facts have been assembled into a general model of Amerindian demographic history (Fig. 3).
223
This model, based on a complex of observation and theory, will undoubtedly undergo modification in the light of future developments, but even in its present form, it can serve as the basis for some calculations not heretofore pursued. The primary demographic parameters are mean rates of population increase, which, together with population numbers at some point in time, determine total numbers at all epochs and hence the base for the input of new variant alleles. The one-generation (adult) number is assumed to be one-half of the total population size. Initial calculations assume the mean rate of increase given below and also are summarized in the two history scenarios given in Table 3. Figure 4 shows the population size scenarios on a logarithmic scale. Following these calculations, the effect on the results of certain reasonable departures from the model will be pursued. It will be shown that the predictions from the model are reasonably robust to a variety of perturbations in the model. As pictured, the model implies a continuous process of fissioning as the Amerindian groups expanded. While this was indeed the dominant pattern, there was undoubtedly fusion of groups, as their fortunes varied; i.e., we do not exclude rhizotic events in the evolution of Amerindian groups (cf. Moore, 1994) but relegate to them a relatively minor role. The model recognizes four epochs in Amerindian history. 1. Period I, The Nomadic Epoch: 40,000–24,000 ybp The number of individuals involved in the migration from southeastern Siberia/Manchuria/Mongolia across the Bering Land Bridge to the Great Plains south of the northern ice is utterly speculative. Given, however, that humans were apparently just beginning to extend into the area of origin 40,000 ybp (Fiedel, 1987), we will assume that the total number of individuals involved in the founding population was 750–1000. Following Deevey’s (1960) estimate of the very slow rate of population increase during the Upper Paleolithic and Mesolithic Periods, we will estimate that this number had increased to 1200–3000 by the time the groups crossed the Bering Land Bridge, some 30,000 ybp, and to 2000– 10,000 by the time the migrant groups reached the Great Plains perhaps 24,000 ybp. The mean rate of population increase during this period would be very low, rising to 0.5% per generation for the larger population size assumptions, but as low as 0.1% for the lower numbers (Table 3). 2. Periods II and III, The Colonization Epoch: 24,000–8000 ybp After reaching the generally hospitable areas of the New World, with abundant game, the Amerindians would have spread out, and by 8000 ybp have reached all parts of North, Central, and South America. The assumption of a global Paleolithic preagricultural population density of 0.04 individuals per square kilometer (Deevey, 1960) results in an estimate of the total popu-
224
THOMPSON AND NEEL
FIG. 3. A scenario for the demographic history of the Amerindian. ybp, years before present; gbp, generations before present; Roman numerals, periods as defined in Tables 3 and 4. The lower portion of the figure is not drawn to scale.
lation of 1.5 million (750,000 adults) by 8000 ybp. Since much of the colonization must have been patchy, with many less habitable areas not occupied, this may be too high, and for our low-numbers scenario we assume only 350,000 individuals by 8000 ybp (Table 3). In either case the mean population rate of increase would have been somewhat higher than in the nomadic period, the value rising from 0.5% to just over 1% per generation. If any of the ‘‘early’’ carbon-14 dates from sites in both North and South America are correct (reviewed in Fiedel (1987)), then this population had to be thinly spread over much of the New World, which would explain the apparent rarity of human artifacts in the
early years of Amerindian presence in the New World. We postulate the subdivision of this population into many small endogamous groups, which postulate is reinforced by the great linguistic diversification among Amerindian tribes (Nichols, 1990). 3. Periods IV and V, The Precontact Agricultural Epoch: 8000–500 ybp Although the advent of agriculture permitted much denser populations than could be supported by a hunting–gathering economy, population rates of increase may be little changed. By 8000 ybp, most of the large game mammals had disappeared, and the environment
225
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
TABLE 3 Two Postulated Demographic Hypothetical Histories of the Amerindian Population
Years BP
Generations BP
40,000
1600
Rate of increase per generation
No. of adults History 1 500
History 2
1200
1,500
580
24,000
960
5,000
820
500 350
320 20 14
750,000 20 Million 2.5 Million
History 2
1.003
1.0011
1.005
1.0014
up to 1.011 1.011
1.0085
375
30,000
8,000
History 1
187,000
Where Mongolia/ Siberia Beringia Great Plains
1.016
Everywhere in North and South America
20 Million 2.5 Million
may indeed have been generally less hospitable than previously. The main effect of incipient agriculture will have been not on population increase, but on the more settled and structured nature of tribal groups. Estimates of the sizes to which the various Amerindian populations had grown by first European contact (500 ybp) have been highly controversial. Denevan (1992), attempting to reconcile all the estimates to date, has placed the total number at 53 million, with probable limits for the estimate of 43 to 65 million. This number has in part been derived from calculations of the carrying capacity of an area rather than from any actual census. Myers (1976) has pointed out that to the extent that there were uninhabited areas (buffer zones) between tribal distributions, this method leads to an overestimate. Denevan’s estimate (Denevan, 1992) also includes the Na-dene-speaking tribes, thought to be the descendants of later arrivals, but these are a small fraction of the total. We will work with a conservative estimate, of some 40 million Amerindians at time of FC. The corresponding overall average rates of population increase during this period is thus 1 to 1.5% (Table 3) and we use this value for the Chibcha-speaking tribes. However, for the tribes of the Amazon Basin and Upper Orinocco drainage, listed in Table 1, we believe that the average rate of increase may have been substantially less than this, and we suggest the value of 0.5% assumed for the earlier hunter/gatherer epoch.
mation of tribes following contact with European diseases and European brutality must have varied greatly from region to region. The population expansion during the period 8000–500 ybp is dominated by events in central Mexico and the central Andean region, and these populations also were among the most decimated in the subsequent etiolation. Some tribes disappeared completely, their remnants being absorbed by neighboring tribes, whereas the tribes of the Upper Amazon and Orinocco basins may be presumed to have been subject to far less numerical reduction. With respect to the tribes of the Amazonian basin and the Upper Orinocco drainage among whom we have worked, we estimate that except for the Yanomama, the decimation was usually no more than 50% of an original size equal to the present size. For those tribes in which private polymorphisms were encountered, Table 1 contains a projection of ‘‘first contact’’
The Population Crash and Thereafter The contact with Europeans following 1492 resulted in a population crash which is estimated for all the Americas to have reduced the native population to about 10% of its original number (Denevan, 1992). In many places, decline predates direct Caucasian contact; smallpox entered the New World by 1518. However, in some parts the decline has been more recent, following direct contact. Both the rate of population growth during the period 8000–500 ybp and the deci-
FIG. 4. Population sizes (on a log scale) over history under the two scenarios of population history of Table 3.
226
THOMPSON AND NEEL
tribal size from present numbers based on that assumption, as well as estimates of postcrash numbers. Similar estimates are available for the four tribes (Makiritare, Piaroa, Kraho, and Ayoreo) in which no private polymorphisms were found (data not shown). With respect to the Yanomama, we have argued that their remote position has spared them the decimation of most tribes (Neel, 1978b). Further, they acquired simple agricultural practices relatively late in their history and have thus only recently undergone the growth that accompanies incipient agriculture. We will estimate that their size at first European contacts in 1492 was roughly one-third of their present size, i.e., about 5000. Turning now to the Chibcha tribes, we encounter quite a different situation. Whereas the tribes of Table 1 are widely dispersed, all of the tribes of Table 2 are closely bunched in their distribution and speak a similar language. Extensive historical data are available concerning these tribes (Barrantes et al., 1990). We have previously published estimates of size at first contact and also present size (Thompson et al., 1992). We now present in Table 2 an estimate of size at the nadir in their numbers, to bring these numbers into congruence with the numbers given for the other set of tribes. It must be emphasized that both the estimate of size at ‘‘first contact’’ and that at the nadir following the crash are quite approximate. The demographic histories of the various tribes since the crash are quite disparate. For the Chibcha speakers, some have recovered and even exceeded their estimated size at first contact (the Kuna and Guaymi), others are still far below their estimated number at time of first contact. Where there has been a recovery in number, most of it has occurred in the present century. A PROBABILITY MODEL FOR VARIANT NUMBERS The Branching Process Model We now turn to analysis of numbers of variants to be expected to be extant in the total Amerindian population and in the subsets of Tables 1 and 2. The basic model for variant numbers in this treatment is the branching process of Fisher (1930). Thompson and Neel (1978) used this process, with varying population rates of increase and a zero-modified geometric offspring distribution, to consider survival and numbers of variant alleles in Amerindian populations. Genes are counted at the adult generation; each gene in an individual reaching adulthood is a mutation with probability equal to the mutation rate, and new variants arise at any given generation in history in proportion to the adult-generation population size. The offspring distribution of variant replicates again counts genes as they reach the next generation of adults. A new variant allele may have no replicates at the next generation be-
cause the individual has no offspring, because none of his offspring reach adulthood, or because all those that reach adulthood carry the homologous allele. The model is an infinite alleles model; each extant variant is assumed to be of monophyletic origin. The new variant alleles produced by mutation are assumed to be selectively neutral; their fate is that of any random haplotype from the population, and the mean rates of increase applied to each variant are those assumed for the population as a whole. Further, the descendant copies of a new variant allele are assumed to replicate according to a branching process model, independently of each other, independently of other variant alleles, and independently of any overall population constraint. Clearly this is a considerable oversimplification of the population processes that determine the genetic variation observed in a sample of individuals from a current population. However, for variants arising in an expanding population, the model provides a useful indication of the likely numbers and ages of variant alleles in a current population. The specific branching-process offspring distribution assumed, at calendar time t, is pt (0) 5 at pt (k) 5 (1 2 at)(1 2 d) d k21.
(1)
Time, t, is measured in generations before present. ‘‘Now’’ 5 0 gbp, and we extend our analysis back to 1600 gbp, or 40,000 ybp. The parameter at is the probability a single allele leaves no replicate in the subsequent (adult) generation; conditional on there being replicates (probability (1 2 at )), the distribution is assumed to be geometric, with geometric parameter d. The zeromodified geometric distribution (1) allows exact calculation of probabilities of multigeneration copy number or extinction. While it does not precisely mimic distributions of variant offspring number derived from family size distributions, it provides a reasonable approximation for suitably chosen values of the two parameters a and d. The mean of the variant offspring distribution is mt 5 (1 2 at )/1 2 d)
(2)
and changing values of the population rate of increase mt over time t are effected by changing the value of a t. For simplicity, the geometric parameter d is assumed to remain constant in time. Following Thompson and Neel (1978) we take d 5 0.4 in most of our analyses; however, the robustness of the conclusions to different values of d, with appropriate compensating changes in at, is also considered.
227
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
Effective Population Size
TABLE 4
The parameters mt determine overall adult population numbers, given the number at any given point in time. If the number at time t0 is N0 then the number N 1 at a later generation t1 (t1 , t0 since time is measured in gbp) is t0
N1 5 N0
p
m t.
(3)
t5t111
Although an average mt over both time and space is adequate for considering total population expansion, populations do not increase smoothly in time, but undergo bursts of increase. Periods of rapid increase and sudden decrease are likely characteristic of hunter– gatherer and subsistence–agriculture populations. To analyze the effect of such demographic patterns, we can vary m in short-term cycles (Thompson and Neel, 1978). One possibility is to increase m by a factor (1 1 g) for (c 2 1) generations and decrease by a factor (1 1 g) 2c11 on the final generation of the cycle. This has an impact on variance effective population size, the harmonic mean of population sizes over the cycle. As a first experiment we have taken g 5 0.1 and c 5 8, these values representing seven generations of plenty, followed by one in which the population declines by about 50%. While the model assumes that this pattern applies to all segments of the population, it is not necessary to assume that all subpopulations undergo synchronous cycles. While a major climatic change might affect the hemispheric population, short-term cycles will be local, resulting from local environment or the impact of neighbors. The model already incorporates the effect of variable offspring number on effective population size, through the probability distribution for gene replicates. However, it is relevant to consider our model’s implications in this regard. Specifically, for given mt, d determines the variance of offspring number. The ratio of actual (one-generation) population size to effective size is (d 1 at )/(1 2 d). Recalling that mt 5 (1 2 at )/(1 2 d), we can vary the value of d, adjusting the value of at to give us the same values of mt. When mt is approximately 1, at is approximately equal to d. In this case, setting d 5 0.4 results in an Ne of 75% of the one-generation size, or 38% of the total population size. When d 5 0.5, Ne is 50% of the adult generation size, or 25% of the total population size. Within this range, varying d has virtually no impact on overall results; these are dominated by the pattern of mt. The Probabilities Computed Given the zero-modified geometric offspring distribution (1), many relevant probabilities can be computed exactly. The key ones are Qt, the probability that a new
Expected Numbers of Variants per Locus Currently Extant in Greater Than K Copies in the Total Amerindian Population, Using the Histories of Table 3 and Mutation Rates m of 1025 and 0.5 3 1025 History 1 µ 5 10 25
History 2 µ 5 0.5 3 10 25
Period
Date of origin (generations)
K 5 20
K 5 200
K 5 20
K 5 200
I. II. III. IV. V.
1600–960 960–640 640–320 320–170 170–20
0.166 1.17 21.6 68.0 138.0
0.166 1.15 14.5 9.4 0.32
0.02 0.176 3.49 21.2 67.4
0.02 0.175 3.05 5.9 0.4
Amerindian Early regional Later regional Early tribal Later tribal
BP BP BP BP BP
mutation arising t gbp is now extinct (and (1 2 Qt ), the survival probability), and (1 2 Mt ), the current geometric parameter of copy number for a surviving variant which arose t gbp. Given these parameters, for each generation, we can compute the expected number of variants arising, the expected number now surviving, and the expected number surviving in more than K copies, where for illustration we choose K 5 20, or 200, or 1000. Summing over the t values in any specified period, we can obtain the expected numbers of variants (per locus) from each history period. Relative values of these are unaffected by initial population size and mutation rate. RESULTS The New World Picture of Variant Numbers and Variant Ages Some results from the demographic scenarios of Fig. 3 and Table 3 are given in Table 4. These tables summarize the numbers of variants per locus expected in the total Amerindian population, dating from each specified time period, and currently existing in at least the specified number of copies. Additionally, there will be alleles originating more than 40,000 ybp that have survived to the present. The time periods correspond to those previously discussed, with the colonization period (24,000–8000 ybp, 1600–320 gbp) and the settled/agricultural period (8000–500 ybp, 320–20 gpb) each divided into two parts. The first hypothesized history clearly yields too many variants (Table 4), especially in light of the fact that the studies summarized in Tables 1 and 2 have yielded no widely dispersed uniquely Amerindian polymorphisms. Although subdivision and spatial heterogeneity will mean that only a very small proportion of more recent variants may be present in our sampled tribes, ancient variants, if present at all, should be in all major subpopulations, as are the ancient Mongoloid variants TF*D2Chi and Di a . Like these Mongoloid variants, an older Amerindian variant might be absent from a single tribe that has undergone
228
THOMPSON AND NEEL
severe isolation and/or decimation, but it would be widely present. The finding that there should be 0.17 extant variants/locus originating between 1600 and 960 gbp, and 1 per locus arising between 960 and 640 gbp does not seem consistent with observed data. The second history scenario, coupled with a lower mutation rate, gives more reasonable figures for the ancient variants, with 0.02 variants/locus from Period 1 and 0.17/ locus from the next 8000 years (320 generations). Figure 5 shows the survival probability of each gene by time of origin. We see the probabilities of survival rising from 0.5 to 1.5% at about 320 gbp (8000 ybp), and then declining. The decline is due to the recent population crash—younger variants, not yet present in large numbers, are less likely to survive this crash. Also shown are the probabilities if short-term cycles are superimposed; qualitatively the results are little changed. Two points are of interest. First, the preponderance of recent variants (Table 4) arises not from increased survival probability, but from the vastly greater number of new variants that arise in a larger population (Fig. 4). Second, the survival probability of ancient genes, about 0.5%, is consistent with previous results (Thompson et al., 1992) and provides some bounds on the founding population of the Americas. For example, if the original adult population were 1000 adults, 2000 genes, then copies of 10 of these should be extant, and in large numbers. This is broadly consistent with the survival of many worldwide human polymorphisms, and of Mongoloid variants such as Di a and TF*D2Chi, but the loss of others, such as the A and B alleles of ABO. Figure 6 shows the relative ages of variants expected to be currently extant in the global Amerindian population in greater than the specified number of copies. It
is seen that the overwhelming majority of variants are more recent than 480 gbp (12,000 ybp) and thus will likely be linguistically or ‘‘tribally’’ restricted in their distribution. While very recent variants will not have achieved large numbers of copies, even variants currently present in several thousand copies likely postdate linguistic differentiation and settled occupation; most of these most numerous variants owe their current numbers to recent population expansion. Due to the great heterogeneity of patterns of population crash and recovery in post-Columbian times, we consider each pattern of population crash and recovery separately in assessing the expected numbers of variants in the tribal populations of Figs. 1 and 2. These are then combined to give total figures for the 12 South American tribes of Fig. 1 (not including the Guaymi) and for the eight Chibcha-speaking tribes of Fig. 2. The four South American tribes in whom no private polymorphisms were found are included with the tribes of Table 1 for the South American analysis, while the Teribe are likewise included with the Central American tribes of Table 2 for the computations for the Chibcha speaking tribes. The results are shown in Table 5. The Chibcha-speaking tribes considered in this analysis comprise only about 0.0026 of the Amerindian population of 1492. They form a single linguistic group, and several variant alleles are shared by several tribes (Table 2). The populations of this area of Panama and Costa Rica have a history of settled occupation and agriculture dating back to 8000 ybp, and thus they may be typical of the scenarios of Table 3. If so, this suggests a not unreasonable small founding population of 1– 4000 total individuals at 8000 ybp. For the Chibcha
FIG. 5. Probabilities that a genes survives (has extant replicate descendants), from a given point in time to the present. Curves are given for the two history scenarios of Table 3. For history 1, the effect of imposing small-scale (eight generation) patterns of growth and crash, on the underlying demographic pattern is also shown.
FIG. 6. Relative numbers of variants, by age of origin. The curves show numbers of variants of the given age expected to be extant in at least the number of copies specified (20 and 200, respectively, for the two curves). This figure assumes the first history of Table 3; results are qualitatively similar for the second history.
Results Specific to the Populations of Tables 1 and 2
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
TABLE 5 Expected Numbers of Variants per Locus in the Tribes of Figs. 1 and 2, Using the History and Mutation Rates Given South America
Copies $ 20
$ 200
$ 1000
$ 20 $ 200 $ 1000
Chibcha
m (8000–500 ybp) In Table 3 Mutation rate Origin
1.005 N/A 0.5 3 10 25
1.011 History 1 10 25
1.011 History 1 10 25
1.0158 History 2 0.5 3 10 25
320–170 gbp 170–20 gbp ,20 gbp Total 320–170 gbp 170–20 gbp ,20 gbp Total 320–170 gbp 170–20 gbp ,20 gbp Total Observed Observed Observed
0.12 0.22 0.03 0.37 0.04 0.04 0 0.08 0.01 0 0 0.01
0.18 0.37 0.06 0.61 0.05 0.06 0 0.11 0.01 0 0 0.01
0.09 0.32 0.10 0.51 0.04 0.09 0 0.13 0.01 0.01 0 0.02
0.03 0.18 0.05 0.26 0.01 0.05 0 0.06 0.01 0.01 0 0.02
0.32 0.16 0.08
0.16 0.16 0.08
populations we therefore use the two histories of Table 3 for the period 8000–500 ybp, with the corresponding mutation rates used in Table 4. Note that, given the population size at 500 ybp a larger rate of increase corresponds to smaller past population sizes (Fig. 4). The different patterns of post-Columbian crash and recovery are analyzed separately, and the results combined to give the totals of Table 5. History scenario 2, with the lower mutation rate gives the best fit to the observed data. For instance, for the largest group of variants considered (those with $20 extant copies in the adult generation) the expectation for variants per locus accumulating in the Chibcha since 320 gbp is 0.26, whereas the observed number is 0.16. Given the variance of the underlying process, and the effects of possible errors in our assumptions, this difference between observed and expected is not significant. In the tribes of Table 1, the ‘‘private’’ polymorphism distribution is different. In total there are twice as many variants, but most are restricted to a single tribe; there is one instance of a shared allele by two tribes of a single linguistic group, and another shared by neighboring tribes between whom there is documented recent migration. These tribes probably had slower rates of increase in pre-Columbian times, and we therefore assume the lower increase rate of Table 3 (m 5 1.011) and also a rate of only 0.5% per generation, more typical of the earlier hunter–gatherer period of Table 3. Again, results are computed separately for the different post-Columbian patterns of crash and recovery and are then combined to give the results of Table 5. The two history scenarios differ little in their predictions; the slower rate of increase gives more weight to older vari-
229
ants, but the totals differ little. The best fit, especially with the slower (m 5 1.005) rate of increase, is for the smaller mutation rate of only 0.5 3 1025/locus/generation, the expectation for variants with $20 copies being 0.37 and the observation 0.32. The fit is thus good for variants with greater than 20 extant copies and reasonable for variants with at least 200 copies. The model predicts fewer than observed variants with more than 1000 copies, but in each observed instance of this, the result is explicable in terms of a tribe’s recent expansion. Moreover, Table 5 considers only variants arising since 8000 ybp. From Table 4 and Fig. 6, we see that variants with extant numbers of copies greater than 200 are likely to be somewhat older than this, and thus expectations computed only on the basis of 8000 years of tribal differentiation provide an underestimate. Sampling Effects The probability of missing a variant polymorphic in a well sampled tribe is negligible (Thompson et al., 1992). However, a variant may be missed if it is localized in some area of the tribe. The sampling problem is analogous to the crash problem—it is spatially heterogenous, and so has disproportionate effect on young variants. However, just as ancient variants (if surviving) are likely to be widespread among tribes, older tribal variants (if surviving) will be widespread within the tribe. Variants with the numbers, frequencies, and likely ages, of those of Tables 1 and 2 will be little affected by sampling, given the very extensive sampling undertaken. It is worth noting, however, that a closer fit between observation and expectation can be achieved in Table 5, by allowing a somewhat higher mutation rate (up to 1025) to achieve proportionately higher expectations in all categories. A close fit is then obtained for variants present in $ 200 and $ 1000 copies in the current adult generations, while the then too low values observed in the $ 20 category can then be explained as variants having been missed by sampling. The available data have little power to distinguish these alternative hypotheses, but given the extensive sampling of these tribes we have some confidence that the fits achieved in Table 5 for the $20 category with the lower mutation rates are valid. The previous explanation that high-copy-number expectations will be underestimated by considering only 8000 years of history thus explains well the observations in other categories. DISCUSSION We have explored the genetic implications of two reconstructions of Amerindian history, both with a 40,000-year time depth, and, based on population parameters developed in previous studies, have asked which of these scenarios is more consistent with the findings regarding the data on tribal private polymor-
230
THOMPSON AND NEEL
phisms of electrophoretic variants of some 25 proteins studied in 20 different Amerindian tribes. These tribes consist of 12 tribes widely scattered throughout South America, with diverse linguistic affiliations, and 8 tribes of Central America, all Chibcha speakers. A salient point to the data is that although the Amerindians exhibit such mongoloid variants as TF*D2Chi and Di a, no unique Amerindian variants which are widespread across various Amerindian linguistic groups have thus far been recognized. Our model predicts that relatively few variants should have arisen and persisted from the first 30,000 years of this history, the vast majority of the variants having arisen since the population expansion fueled by the development of agriculture beginning some 8000 ybp. However, the apparent absence of such early variants requires a very small early population, with a prolonged history at very small numbers. This requirement is to some extent alleviated by assuming a smaller mutation rate. Of the two rates assumed (0.5 and 1.0 3 1025 /locus/generation), the smaller provides a better fit to the observed data. Because only a fraction of each tribe was sampled for variants, because variants might have been lost in the post-Colombian Amerindian population crash, and because our electrophoretic techniques would not detect all charge-change variants, the data estimate of tribal variants is biased downward, so that any comparison of observation with prediction can be only approximate. In undertaking any such comparison, it seems best to emphasize those allele number estimates based on the demonstration of $20 copies of the allele. Variants with less than this number of copies are likely to be recent, localized, and possibly missed by sampling. In all our comparisons we observe somewhat fewer variants than expected in the $20 copies category, consistent with the failure to detect some low-frequency localized variants, and an excess in the $200 category, consistent with the fact that expectations of Table 5 consider only variants aged under 8000 years. Our results are robust to several of our model assumptions. Varying the geometric parameter d within the plausible range d 5 0.4 to d 5 0.5 has virtually no impact on results, and a pattern of short-term cycles in population size has little effect. Under such a pattern of cycles, survival probabilities of old variants are somewhat increased, and hence also current numbers. However, the effect is not large. Of course, a regular pattern of increase and crash is unlikely, but variance in the spacings of these minicrashes will have little effect. Obviously also, the entire population of the Americas was not undergoing this pattern of increase and minicrash in phase! Although there may have been occasional climatic disasters, even these will not have affected the entire two continents. However, this does not affect conclusions on the current numbers of variants of given age. Provided that the same general pattern holds, it has no effect if some parts of the population have different generations of crash and increase.
To return, then, to the question raised in the Introduction, the presence of certain private polymorphisms may be useful in establishing tribal and linguistic affiliations, but only rarely will be helpful in a broader context. The present model can be employed to estimate the average total number of variants per specific nucleotide, on the assumption of a nucleotide mutation rate (say of 1 3 10 28/generation). Additionally, with slight modification, it can be applied to the more rapidly evolving restriction site variants of mtDNA. Similar findings are expected; most extant variation postdates tribal differentiation. For instance, most of the mtDNA variation revealed by the use of 14 restriction enzymes in the 12 widely separated tribes of this study was tribe-specific (Torroni et al., 1993). As an extreme example, 15 of 16 Kuna exhibited a tribe-specific restriction site (Torroni et al., 1994). However, for sequence data or restriction site data, there is additional information in the inferred phylogeny of variant alleles. Thus such data may be found to have greater usefulness in a taxonomic context than the electromorph alleles of nuclear DNA. Ancient polymorphisms, shared by Mongoloid populations (including Amerindians) (e.g., TF*D2Chi and Di a) characterize variants arising between 70,000 and 40,000 years ago. Polymorphisms of worldwide distribution date from the long prehistory of the human species, or even earlier. Extrapolating from our Amerindian results, the small numbers then extant and the low survival probability of any new mutation, a very long period of history is required to generate the polymorphisms of global distribution. We note also that variant survival rates decay, but that those that survive many generations are likely to continue to survive, despite population crashes. Survival of old mutations is scarcely affected by recent Amerindian history. Again, this can be extrapolated to the ancient polymorphisms. Those very few that became established in the human species, and did not become extinct between 5 million and 200,000 years ago, will still be around today. Although the Amerindians may represent an extreme example of a founding population which then underwent an almost explosive expansion with the development of agriculture, we suggest that, to a greater or lesser extent, many extant populations have a similar history, with a similar expectation: accumulation of a characteristic set of private polymorphisms within linguistic groupings, but relatively few variants unique to the corresponding ethnic grouping, despite the long period that intervened between the beginnings of the ethnic groups and the appearance of agriculture and population expansion. This expectation would certainly be true of much of Europe is we accept the thesis of Cavalli-Sforza and colleagues (1994) that invading agriculturists replaced the original inhabitants of much of Europe beginning some 8000 ybp. This scenario has two important implications for cur-
PRIVATE POLYMORPHISMS: HOW MANY? HOW OLD?
rent human populations. First, because of the relative recency of origin of most of the variants encountered in human populations, marked linkage disequilibrium for variants as distant as 0.5 cM from markers is to be expected. This expectation of disequilibrium is enhanced where tribes have amalgamated to create contemporary, urban populations, since genetic variation within a tribe or linguistic group is less than in amalgamated populations, and as long as the variant was restricted to the tribe or linguistic group of origin, the number of ‘‘harnesses’’ within which it might be introduced remained limited. Second, most modern populations represent an amalgamation of tribal and linguistic groups that were noninterbreeding entities up until relatively recent times. With this amalgamation, what were private polymorphisms within tribes or linguistic groups decrease in relative frequency in the total population to subpolymorphic proportions. We have previously pointed out that failure to recognize this aspect of most modern populations can lead to false inferences concerning such diverse subjects as the past occurrence of population bottlenecks or the frequency of slightly deleterious mutations (Chakraborty et al., 1988; Thompson et al., 1992). The present demonstration, that most rare variants have arisen relatively recently, as human population numbers dramatically expanded, and were quite localized, underscores that earlier conclusion. ACKNOWLEDGMENTS This research was supported in part by Department of Energy Grant 87ER60533 (J.V.N.) and NSF GrantBIR-9305835 (E.A.T.). The major part this work was undertaken while E.A.T. was visiting the University of Michigan, Fall 1994; the hospitality of the Department of Biostatistics is gratefully acknowledged. We are also grateful to Elizabeth Elle and Deborah Sheely (Rutgers University) for comments on an earlier draft.
REFERENCES Barrantes, R., Smouse, P. E., Mohrenweiser, H. W., Gershowitz, H., Azofeifa, J., Arias, T. D., and Neel, J. V. (1990). Microevolution in lower Central America: Genetic characterization of the Chibchaspeaking groups of Costa Rica and Panama, and a consensus taxonomy based on genetic and linguistic affinity. Am. J. Hum. Genet. 46: 63–84. Cavalli-Sforza, L. L., Menozzi, P., Piazza, A. (1994). ‘‘The History and Geography of Human Genes,’’ Princeton Univ. Press, Princeton, NJ. Chakraborty, R., and Neel, J. V. (1989). Description and validation of a method for simultaneous estimation of effective population size and mutation rate from human population data. Proc. Natl. Acad. Sci. USA 86: 9407 –9411. Chakraborty, R., Smouse, P. E, and Neel, J. V. (1988). Population
231
amalgamation and genetic variation: Observations on artificially agglomerated tribal populations of Central and South America. Am. J. Hum. Genet. 43: 709–725. Deevey, E. S. (1960). The human population. Sci. Am. 203: 195–204. Denevan, W. M. (1992). ‘‘The Native Population of the Americas in 1492,’’ Univ. of Wisconsin, Madison, WI. Fiedel, S. J. (1987). ‘‘Prehistory of the Americas,’’ Cambridge Univ. Press, New York. Fisher, R. A. (1930). ‘‘The Genetical Theory of Natural Selection,’’ Clarendon, Oxford, UK. Greenberg, J. H. (1987). ‘‘Language in the Americas,’’ Stanford Univ. Press, Stanford, CA. Moore, J. H. (1994). Putting anthropology back together again: The ethnogenetic critique of cladistic theory. Am. Anthropol. 96: 925– 948. Myers, T. P. (1976). Defended territories and no-man’s lands. Am. Anthropol. 78: 354–355. Neel, J. V. (1978a). Rare variants, private polymorphisms, and locus heterozygosity in Amerindian populations. Am. J. Hum. Genet. 30: 465–490. Neel, J. V. (1978b). The population structure of an Amerindian tribe, the Yanomama. Annu. Rev. Genet. 12: 365–413. Neel, J. V., Biggar, R. J., and Sukernik, R. I. (1994). Virologic and genetic studies relate Amerind origins to the indigenous people of the Mongolia/Manchuria/southeastern Siberia region. Proc. Natl. Acad. Sci. USA 91: 10737–10741. Neel, J. V., Gershowitz, H., Mohrenweiser, H. W., Amos, B., Kostyu, D. D., Salzano, F. M., Mestriner, M. A., Lawrence, D., Simoes, A. L., Smouse, P. E., Oliver, W. J., Spielman, R. S., and Neel, J. V., Jr. (1980). Genetic studies on the Ticuna, an enigmatic tribe of Central Amazonas. Ann. Hum. Genet. 44: 37–54. Neel, J. V., and Rothman, E. D. (1978). Indirect estimates of mutation rates in tribal Amerindians. Proc. Natl. Acad. Sci. USA 75: 5585– 5588. Neel, J. V., Satoh, C., Goriki, K., Fujita, M., Takahashi, N., Asakawa, J., and Hazama, R. (1986). The rate with which spontaneous mutation alters the electrophoretic mobility of polypeptides. Proc. Natl. Acad. Sci. USA 83: 389–393. Neel, J. V., Tanis, R. J., Migliazza, E. C., Spielman, R. S., Salzano, F., Oliver, W J, Morrow, M., and Bachofer, S. (1977). Genetic studies of the Macushi and Wapishana Indians. I. Rare genetic variants and a private polymorphism of esterase. A. Hum. Genet. 36: 81–107. Neel, J. V., and Thompson, E. A. (1978). Founder effect and number of private polymorphisms observed in Amerindian tribes. Proc. Natl. Acad. Sci. USA 75: 1904 –1908. Nichols, J. (1990). Linguistic diversity and the first settlement of the New World. Language 66: 475–521. Thompson, E. A., and Neel, J. V. (1978). Probability of founder effect in a tribal population. Proc. Natl. Acad. Sci. USA 75: 1442–1445. Thompson, E. A., Neel, J. V., Smouse, P. E., and Barrantes, R. (1992). Microevolution of the Chibcha-speaking peoples of lower Central America: Rare genes in an Amerindian complex. Am. J. Hum. Genet. 51: 609–626. Torroni, A., Neel, J. V., Barrantes, R., Schurr, T. G., and Wallace, D. C. (1994). Mitochondrial DNA ‘‘clock’’ for the Amerinds and its implications for timing their entry into North America. Proc. Natl. Acad. Sci. USA 91: 1158–1162. Torroni, A., Schurr, T. G., Cabell, M. F., Brown, M. D., Neel, J. V., Larsen, Smith, D. G., Vallo, C. M., and Wallace, D. C. (1993). Asian affinities and continental radiation of the four founding Native American mtDNAs. Am. J. Hum. Genet. 53: 563–590.