P. K. Sen and C. R. Rao, eds., Handbook of Statistics, Vol. 18 © 2000 Elsevier Science B.V. All rights reserved.
QA
~'-I"
Genomic Sequences and Quasi-Multivariate CATANOVA*
Hildete Prisco Pinheiro, FraGcoise Seillier-Moiseiwitsch, Pranab Kumar Sen and Joseph Eron Jr.
1. Introduction
Simpson (1949), apparently unaware of the work of Gini (1912), proposed a measure of biodiversity. Working with a similar measure of qualitative variation, Light and Margolin (1971, 1974) developed an analysis of variance for one-way tables suitable for categorical variables. In the context of genomic sequences, their framework can be used to compare the variability at a single position between and within groups. In many instances, and in particular for viral sequences, a single position yields little information. Hence there is a need to deal with variation at a typically large number of sites. We adopt a pseudo-multidimensional approach and study the components of variation. Based on the assumed independence among positions, the rationality of a test statistic for the null hypothesis of homogeneity among groups is investigated, and related statistical perspectives are considered. The motivation here is to present multivariate analysis of variance (MANOVA) models and analysis tools for high-dimensional categorical data that are qualitative and unordered. The scientific focus is the comparison of sets of genomic sequences from the human immunodeficiency virus (HIV). For example, one has sampled viral strains from different geographical areas to see whether the variability is similar in each region. Similarly, when we follow several individuals and obtain sequences from each individual at different time points, our interest lies in estimating the variability between and within individuals. These are typical molecular epidemiologic studies of genomic sequences that pertain to different epidemiologic strata, so that the between-group component may need further partitioning into several subcomponents. Incidentally, in all these cases, the response variable at each position is the amino-acid or nucleotide label. It thus * This research was funded in part by Coordenaggo de Aperfei~omento de Pessoal de Nivel Superior, the National Science Foundation (DMS-9305588), the American Foundation for AIDS Research (70428-15-RF) and the National Institutes of Health (R29-GM49804 and P30-HD37260). 713
714
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
exhibits predominantly qualitative variation. In this situation, the classical M A N O V A models and analysis schemes are not appropriate. High dimension (i.e., number of positions) adds more complexities to the modeling and analysis schemes. The Gini-Simpson index of biodiversity (GSI) (Gini, 1912; Simpson, 1949) is expressed solely in terms of frequencies for each category, and is quite insensitive to any quantitative information these categories might have. On the basis of a similar measure of diversity, Light and Margolin (1971, 1974) developed an analysis of variance, (CATANOVA) for one-way categorical data governed by a simple product-multinomial law. This framework can be used in the present context to compare the variability of the response variable at a single position between and within groups. In the analysis of genomic sequences, a single position bears little information. Consequently, we need to consider regions of the genome. For HIV-1, the regions of interest span anywhere from 35 to several hundreds amino acids. Generally, the relative importance of the positions may not be known nor is their stochastic interdependence pattern. Extracting quantitative information from genomic sequences requires some knowledge of molecular biology. We import the necessary background in Section 2. Motivated by the biological foundations, statistical model assumptions and basic motivations are considered in Section 3. Components of variation are derived from the analogy of variation in quantitative variables and diversity in qualitative variables. In either case, the goal is to decompose the total variation or divergence into identifiable components; these analogues of variance component analysis are presented in detail in subsequent sections. Genomic sequences are not considered on an individual basis but only as contributing to the overall variability in the distribution of a high-dimensional categorical response. The formulation of a measure of diversity for such a high-dimensional categorical model is itself a complex statistical task that has not yet met the light of full generality. To present the basic ideas in a simple manner, we partition the measures of diversity with respect to some factors (usually of an epidemiological nature) assuming independence among positions, and develop a test statistic for the null hypothesis of homogeneity among groups (Section 9); its power properties are also highlighted (Section 11). An alternative approach based on the Hamming distance is outlined in Section 10. Results of simulations, performed to evaluate the relevance of the asymptotic results when sample sizes are moderate, are reported in Section 12. A brief data analysis follows in Section 13. Some of the mathematical derivations are relegated to the Appendix (Section 14).
2. Biological background Nucleotides are the building blocks of genomes and each nucleotide consists of a sugar, a phosphate and a base. In a nucleic-acid macromolecule, all sugars are of
Genomic sequences and quasi-multivariate CA T A N O VA
715
the same kind, either ribose or deoxyribose. Nucleic acids thus come in two forms: Ribonucleic Acid (RNA) or Deoxyrinucleic Acid (DNA). D N A has four possible nucleotide bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T), where A pairs up with T and G with C. R N A also has four bases: A, C, G and Uracil (U) in place of T. A is now complementary to U. Unlike D N A , it is single stranded. Any three of these base pairs codes for an amino acid (Table 1). A protein is made up of a sequence of amino acids. To transform the D N A "words" into amino acids, some sophisticated molecular machinery is needed. Transcription is the process by which the two strands of D N A are teased apart and a molecule of R N A is built along one D N A strand by the enzyme RATA polymerase to begin protein synthesis. Each base of this messenger R N A ( m R N A ) is complementary to the corresponding base of D N A . The m R N A then carries this genetic information from the D N A to the protein factory, the ribosomes. A ribosome is made up of two balls, one that binds to the m R N A (starting at or near a AUG codon) and one that has two slots for transfer R N A molecules (tRNA). The t R N A translates the genetic code into amino acids. At one end of a t R N A molecule is attached an anticodon which binds to the complementary codon of m R N A , and the other end carries the associated amino acid. As the ribosome moves along the m R N A sequence, the amino acids are linked and separated from the t R N A . The now vacant slots are occupied by new t R N A molecules and the process continues until the ribosome reaches a stop codon (one for which there is no matching t R N A anticodon).
Table 1 Amino acids Amino acids
Abbreviations
Codons
Glycine Alanine Valine Leucine Isoleucine Serine Threonine Aspartic acid Glutamic acid Lysine Arginine Asparagine Glutamine Cysteine Methionine Phenylalanine Tyrosine Tryptophan Histidine Proline
GLY ALA VAL LEU ILE SER THR ASP GLU LYS ARG ASN GLN CYS MET PHE TYR TRP HIS PRO
GGT, GGC, GGA, GGG GCT, GCC, GCA, GCG GTT, GTC, GTA, GTG TTA, TTG, CTT, CTC, CTA, CTG ATT, ATC, ATA TCT, TCC, TCA, TCG, AGT, AGC ACT, ACA, ACG, ACC GAT, GAC GAA, GAG AAA, AAG CGT, CGC, CGA, CGG, AGA, AGG AAT, AAC TAA, TAG TGT, TGC ATG TTT, TTC TAT, TAC TGG TAU, CAC TTU, TTC, TTA, TTG
716
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
P. K. Sen and J. Eron Jr.
A retrovirus, like HIV, has the ability to reverse the normal flow of genetic information from genomic D N A to m R N A (Varmus and Brown, 1989). Its genomic RNA encodes an enzyme (reverse transcriptase) that makes a D N A copy of its RNA. This D N A gets incorporated into the host genome. Reverse transcriptase is prone to making errors, which result in changes in the genetic code of the virus. Recombinations between R N A strands also causes alterations in the viral sequences. The genetic variability of HIV is relatively high compared to other retroviruses (Mansky and Temin, 1995). Error rates of purified HIV-1 (HIV-1 is the most common form of HIV in the world while HIV-2 circulates mostly in Africa) reverse transcriptase determined with a D N A template (of the lacZ~ peptide gene) range from 5 × 10 4 to 6.7 × 10 -4 per base pair (Roberts et al., 1988). To test the hypothesis that the mutation rate for HIV-1 reverse transcriptase is comparable to its purified version, Mansky and Temin (1995) developed a system to measure forward mutation rates with an HIV-1 vector containing the lacZ~ peptide gene as a reporter for mutations. They found that the forward mutation rate of HIV-1 in a single cycle of replication is 3.4 × 10 -5. The in-vivo mutation rate is therefore lower than the error rate of purified reverse transcriptase by a factor of 20 (Mansky and Temin, 1995). Explanations for this discrepancy are: the association of viral or nonviral accessory proteins during reverse transcription, the influence of cellular mismatch repair mechanisms, and differences between the reverse transcriptase produced in vivo and that assayed in vitro. Sequences can be compared at either the nucleotide or amino-acid level. Nucleotide substitutions can be evaluated for mutations that cause changes in amino acids (non-synonymous) vs. mutations that do not (silent or synonymous). Furthermore, substitutions between purines (A and G) only or pyrimidines (C and T) only are termed transitions, and those between a purine and a pyrimidine (A ~ C, A +-+ T, G +-~ C, G +-+ T) transversions. The HIV genome contains nine genes (gag, poI, vif, vpr, vpu, tat, rev, env and neJ). Most of the internal part of the genome is densely packed with proteincoding domains, so that some genes overlap. The D N A form of the genome is bounded by a repeated sequence, the LTR. The pattern of nucleotide variation is not constant over the whole genome. For instance, the genes encoding internal virion proteins, gag and poI, are more conserved than the gene coding for the envelope of the virus, env. Also, the nucleotide differences in env change the corresponding amino acids more frequently, and therefore, the amino-acid sequence exhibits more variation in env than in gag and pol (Coffin, 1986). The viral envelope is the only gene product in direct contact with the host environment. It is thus not surprising that this gene has the most variable sequence (Hahn et al., 1985; Coffin, 1986; Seillier-Moiseiwitsch et al., 1994). In most organisms, the vast majority of genes do not follow a clear pattern in the first two codon positions, but in HIV there is a preference for A at the expense of C. In the third codon position, the shift towards A is even greater, while in other organisms A is rare there. In HIV, purines predominate over pyrimidines. The overrepresentation of A is highest in pol and lowest in env. As A is not
G e n o m i c s e q u e n c e s a n d quasi-multivariate C A T A N O V A
717
concentrated in the hypervariable segments of env, there is so far no plausible explanation at present for this peculiar coding strategy (Kypr and Mrfizek, 1987). Genomic comparisons of virus isolates have shown that HIV-1 variants in Africa are both highly diverse and generally distinct from those in North America and Europe. Analysis based on g a g or env indicate that African isolates are more heterogeneous than those in North America and Europe. For instance, McCutchan et al. (1992) compared 22 HIV-1 isolates from Zambia and 16 from North America. Among the Zambian isolates, the mean pairwise nucleotide difference in g a g was 7.1% (with a range of 4.4% to 9.8 %). The mean nucleotide difference of the Zambian sequences to the most distant North American virus in the set (the so-called M N strain) was 13.6% (with a range of 14.4% to 17.1%).
3. Statistical motivation In genomic sequence analysis, we encounter data on a (generally large) number of positions for several groups. For each position, the response is categorical with anywhere between 4 and 20 categories. The nature of these categories (i.e., nucleotide or amino-acid label) is totally qualitative in nature. We thus deal with data sets such as that summarized in Table 2. The spatial relationships of the sites may not be known nor can they be taken to be statistically independent. In this Table 2 Contingency table (K positions) Groups
Positions Categories
Totals
1
2
..
C
1
1
nll 1
n211
nCll
n.n
N
1
2
nll 2
n212
ncl2
n.12
N
1
K
nllK
n21K
nCIK
n.lK = N
n11.
n21.
nc1.
n+ =NK
Totals 2
]
n121
n221
nc21
n.21 ~ N
2
2
n122
n222
nc22
n.22 ~ N
2
K
n 12K
n22K
nC2K
n.2K ~ N
nl2.
n22.
nc2.
n.2. = N K
Totals G
1
nlG1
n2GI
nCGI
n.G1 - - N
G
2
n 1G2
n 2G2
nCG2
n.G2 ~ N
G
K
nlGK
n2G K
nCGK
n.GK ~ N
Totals
n 1G.
n2G.
nCG,
n.G. = N K
Grand total
nb.
n2..
nC..
n.... NGK
718
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
high-dimensional setup, it may be difficult to incorporate standard multivariate analysis of variance (MANOVA) models and analysis tools for drawing statistical inferences. Alternative categorical analysis of variance (CATANOVA) models are often found to be more appropriate in such non-standard situations. Light and Margolin (1971, 1974) developed a CATANOVA model and analysis scheme for one-way contingency tables. They investigated the properties of the components of variation under a common multinomial model, and also studied the behavior of the proposed tests for small to moderate sample sizes. Anderson and Landis (1980, 1982) extended the C A T A N O V A procedure to contingency tables involving several factors. To understand their approach (presented in an abstract factorial framework), by analogy to the analysis of variance in experimental design, the groups, sequences and positions play the role of blocks, plots and split plots, respectively. Since there usually is a large number of sites (e.g., the V3 loop of env contains 35 amino acids, i.e. 105 nucleotides). Factors with a large number of levels create problems. Most importantly, the main interest is the difference among groups not among the positions. Consequently, there is a greater appeal for M A N O V A models which treat the positions as the coordinates of a multivariate response. However, due to the sheer number of positions relative to the number of sequences, the traditional M A N O V A may end up with little power. Hence, alternative modeling and analysis schemes are deemed more appropriate. Our primary interest is in assessing the homogeneity among groups. As there are K (> 2) sites, as mentioned before, we may be drawn to the M A N O V A formulation wherein the between-group dispersion is to be judged against the within-group component. Let pcgk stand for population probability of belonging to category c in group g at position k (c = 1 , . . . , C; k = 1 , . . . , K; g = 1 , . . . , G). The null hypothesis of homogeneity of the G groups may then be formulated as
Pcgk=Pek
Vg, g = I , . . . , G ,
c= 1,...,C,
k= 1,...,K .
If we denote by P0 the vector (Pcgk,c = 1 , . . . , C, k = 1 , . . . , K ) , for g = 1 , . . . , G, then the null hypothesis relates to the homogeneity of these Pg's (under the c additional restraints that ~c=aP~gk = 1, Vk, g, k = 1 , . . . , K and g = 1 , . . . , G ) . From this perspective, one can argue that the classical Pearson goodness-of-fit )~2_ test statistic can be applied here. This )~2-statistic for Table 2, assuming the positions to be independent, is G
c
K
g=l c=l k=l
nc'k
with K ( G - 1 ) ( C - 1) degrees of freedom (DF). The limiting x2-distribution is a close approximation only when the cell frequencies ncgk'S are all large (at least 5). Moreover, this test will not perform well when the K positions are not stochastically independent. In analyzing genomic sequences, we know that these conditions are not usually met. Indeed, sites often exhibit a few polymorphisms with very low frequencies, and, to obtain a functional protein, deleterious
Genomic sequences and quasi-multivariate CA TANO VA
719
mutations at one position will be compensated by substitutions at other sites. Just as for Fisher's exact test, the exact null distribution is difficult to implement for small values of N when G or K is not small. Moreover, if the number of D F is large but the non-centrality parameter is not proportionally so, the resulting test is likely to have less power than tests directed towards specific alternatives. Note that the number of D F here is usually large: for instance, comparing two groups of sequences 100 nucleotide long yields 300 DF. Thus, in order to use the Pearsonian goodness-of-fit test and approximate its critical level by 7~2 percentile point, we may need at least 1,500 observations (and generally far more, as the categories may not be equally probable); in addition, the large number of D F results in a large critical level, so that the test may not have great power. For these reasons, we need to use another (pseudo-multivariate) approach to assess homogeneity among groups, with less sensitivity to large values of K. The two approaches discussed above (viz., Anderson-Landis and Pearsonian )~2) all assume independence among groups, sequences and positions, which leads to a product (over K) multinomial model. Individuals in a population are not generally independent because of their shared ancestry. For HIV sequences, assuming that the sampled individuals are epidemiologically independent, this may not be such a strong assumption, because of the rapid evolution of the virus (Hahn et al., 1985, 1986; Coffin, 1986; Seillier-Moiseiwitsch et al., 1994; Mansky and Temin, 1995). However, with well-established relationships among positions, there are basic complications in statistical modeling: even with independence among groups and sequences we no longer have a product multinomial model. While positions are highly likely to be stochastically interrelated, we often do not know in what manner. Thus, the classical logistic model may not work out here without further information about the underlying spatial structure. Parametric models for interdependence among positions, such as Markov chains, have been shown not to fit genetic sequences well (e.g., Karlin et al. (1990)). Other dependence models take into account all possible associations of various orders. In the binary case, we may use the Bahadur (1961) representation model, or the model suggested by Liang et al. (1992) with only pairwise dependence (K + (~) parameters). These models contain far too many parameters to be reliably estimated. Ideally, the ratio between the numbers of sequences and parameters should be at least 5. Therefore, the number of sequences should be at least 5 times K(K+ 1)/2, which is difficult to achieve in practice. For polychotomous responses, the situation is even worse: for instance, the model proposed by Liang et al. (1992) has KC + C 2 (~) parameters for C categories. To reduce the number of parameters, we will need to assume, for instance, equal correlation between all positions, which is not a realistic assertion. These situations require further careful scrutiny, and will not be considered in this study. In passing, we refer to Karnoub et al. (1999) for tests of independence between substitutions at two specific sites. Extensions to the general case of sets of sites are yet to be formulated precisely.
720
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
P. K. Sen and J. Eron Jr.
As a first step, our goal is to test the difference between and within groups, using a measure of diversity for categorical responses and assuming independence among positions. When K is large and the dependence is strong, the distribution of the test statistic degenerates to a lower dimensional one. Given some of the positions, others are conditionally redundant. As we do not know what subset can be discarded, we take an intermediate stand and assume that the degree of dependence becomes small as K grows. This also allows us to use an exchangeable model for large K with small intraclass dependence pattern. Our proposed measure averages over K. Therefore, a large K may have a smoothing effect. Hence, this assumption is not so divergent from independence.
4. Variation in categorical data and genomic sequences For categorical data with qualitative labels, the mean is an ill-defined measure of central tendency. Consequently, measures of variation, such as the variance and mean absolute deviation, which are suitable for continuous variables, no longer apply. It is thus imperative for such data to introduce suitable functions that relate to lack of concentration or diversity in a physically meaningful sense. The Gini index was proposed for just this purpose (Gini, 1912). To interpret it, we first briefly consider data with quantitative variation. F o r a random variable X, the variance, defined as E(X - EX) 2, m a y also be expressed as Eq~(XI ,X2), where q~(a, b) = 12 (a - b) 2 (Hoeffding, 1948). Similarly, the mean absolute deviation (about the mean) is defined as EIX - EXI. Consider a set of N independent experimental units with (quantitative) measurements X1,...,XN. Then the sample variance is defined as
-N-
1 N 1Z
i=1
- x:12 =
Qn2)-I i
,(xi,Xj) ,
where 2- = ~ ~N=I Xi. In a similar fashion, the sum of squares is defined as N
1 N
1
N
(4.1)
=
i=1
"= j=l
~/ 1~
where d~j = IXi - X j l denotes the Euclidean distance. In the context of genomic sequences, each Xi falls into one of C possible categories. The Euclidean distance is no longer appropriate here. Let us define d(Xi,Xj) = dij, a zero-one valued distance for the pair (X/,Xj) of observations, as
d/j =
1 0
if X / a n d Xj belong to different categories otherwise .
(4.2)
Genomic sequences and quasi-multivariate CA TANO VA
721
The variation for categorical responses X1,... ,AN is
1 N N 2 ~'=
du
1 ~-~.~1 ~j=l
=
where dij is defined as in (4.2) (Gini, 1912). As each response assumes one and only one of the C possible categories, the data are summarized by the vector = ( n l , . . . , nc) where ni is the number of responses in category i (i = 1 , . . . , C), so that ~c=1 ni = N . Then the variation in the responses is 1
N{
1-
DN =~ - ~ Z ninj = 2 ivkj
/.~1 (hi)e} ~ .
(4.4)
"=
In an ecological context, Simpson (1949), apparently unaware of Gini's work, proposed the following index of diversity. Let there be C species, and l e t p l , . . . , p c stand for the probabilities of these C species. Simpson's index of ecological diversity is c
Is(p) = 1 - p'p = 1 - Z p 2
(4.5)
i 1
and its sample counterpart c
*?s(P) = 1 - 0'P = 1 - Z / ? 2 ,
(4.6)
i=1
where/?i = n i / N (i = 1 , . . . , C) relates to the sample proportion. Call this the Gini-Simpson index of diversity (GS). Therefore,
N]
DN = ~- s(P)
Definitions (4.4) and (4.5) are motivated by two properties. 1. The variation of N categorical responses is minimized if and only if they all belong to the same category, i.e., pi = 1, Vi, i = 1 , . . . , C. 2. The variation of N responses is maximized when the responses are distributed among the available categories as evenly as possible, i.e., pi = l / C , Yi, i = 1,...,C. More complex measures have evolved from this simple formulation and incoporate more sophisticated distance functions. We refer to Rao (1982a-c, 1984), Nayak (1986a,b), Nayak and Gastwirth (1989), and Sen (1999), among others. While some of these measures have been utilized for genetic variation studies (Chakraborty and Rao, 1991), others even apply to economic inequality and poverty indexes (Sen, 1999).
722
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
P. K. Sen and J. Eron Jr.
Define the entropy function as C
g(p) = - Z p c
log Pc ,
c=l
on the C-simplex Sc -- {p >_ 0 : p'p = 1}. It is known that g(p) can not be smaller than GS (Sen, 1999). As a result, it m a y have some inflationary tendency. In fact, Rao (1982a-c) appraised the role of entropy measures in genetic studies, and pointed out the limitations of g(p) as a measure of biological diversity. These criticisms pertain to the current context. To reduce the impact of these undesirable factors, Rao introduced the paired Shannon entropy measure C
5Qp) = - Z {Pc log Pc + (1 - Pc) log(1 - Pc) } • c=l
He also considered a variation of the y-degree entropy of R6nyi .~(p)=(1-7)
-lloge
p
,
for0<7
.
N a y a k (1986a,b) pursued the Rao quadratic entropy measure ~ ( p ) = p'Ap , with A = (dij) where the dij's stand for the distance between the categories i and j. He remarked correctly that the choice of A is not a statistical problem and is mainly governed by the nature of the question under investigation. In particular, if we let A = 111 - I then 3f(p) = 1 - pip = Is(p). There are difficulties in formulating a suitable A in the current application as the three-dimensional structure of the encoded protein is seldom known. Hence, to adopt the Rao quadratic entropy measure, it might be worthwhile exploring suitable utility scores. Sen (1999) considered the following measure C
Jucs=
Z
ucpc(1 - pc) ,
c--1
where the uc's denote utility scores. In the context of poverty indexes, Sen (1999) illustrated the use of appropriate utility scores. He also pointed out the value of such an utility-oriented Gini-Simpson type index in Quality of Life and other studies. For genomic sequences, depending on their type (amino acids or nucleotides), one can attach an appropriate score related to their importance in the diversity spectrum. In the case of nucleotide sequences, substitutions (from the consensus) within the purine group or pyrimidine group may not be as revealing as changes between groups. For amino acids, it may be their polarity that is the crucial characteristic (while other aspects, such as size, can be downplayed). These utility scores can be chosen as non-negative or belonging to the interval [0, 1]. Again, GS corresponds to the particular case where the u~'s are all 1.
Genomic sequences and quasi-multivariate CA T A N O VA
723
If we confine our interest to a single site where there are a number of possible (qualitatively different) outcomes then we may use some of these indexes in the quantification of within- and between-group variations. Indeed, viewed from this angle, the problem reduces to comparing a number of multinomial populations (lacking any ordering of the categories), and the work of Light and Margolin (1971, 1974) and Nayak (1986a,b) on C A T A N O V A is very pertinent. The detailed implementation appears in the next section. The problem, however, is that we do have a (rather large) number of positions, and hence, we encounter a multivariate analogue of such C A T A N O V A procedures. In the usual M A N O V A models (or their extensions to mixed models), the dependence patterns of the response variates play a fundamental role. In this respect, as the loci are not generally mapped in three-dimensional space, identifying their interactions is, at best, a difficult task. For this reason, in Section 6, we present a typical situation, discuss the basic statistical framework, and suggest suitable formulations which can be adopted in practice.
5. Partitioning the diversity measure: A single position The findings of this section are primarily for motivation, since we generally deal with multiple positions. Let us look back at Table 2 relating to G groups, where for each group, there are N sequences with K positions. The response at each position falls into one of C categories. Let X~ = (X~,... ,X~:)' be a random vector representing sequence i of group g, for i = 1,... ,N, k = 1 , . . . , K and g = 1 , . . . , G. Xi~ is thus a random variable assuming one of C (unordered) categories, and denotes what is present at position k along sequence i in group g. Specifically, for nucleotide sequences, x]k c {A, C, T, G} and therefore C is 4. First, assume each sequence consists of a single position. We summarize the data in Table 3. Let
dij =
1
ifX, g # X y
0
ifXT=X f
Table 3 S u m m a r y o f the d a t a (one p o s i t i o n ) Sequences
Groups 1
2
3
...
G
1 2 3
xI x~ x~
x2 x~ x32
x~ x3 x~
... ... ...
x] x] x3a
N
xI
x2
x3
...
x~
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
724
P. K. Sen and J. Eron Jr.
The total number of responses is G
C
g=l
c=i
C
G
c=1 g = l
where no0 is the number of responses in category c for group g and c N = n.g = ~c=1 nco is the number of responses for group g, which is simply the number of sequences in each group. The Total Simpson Index (TSI) is TSI = 1
~_1\NG)
(5.1)
The dispersion within group g (i.e., among {x~,g x2,..., 0 XgN})is
The within-group Simpson Index (WSI) is found by averaging (5.2) over all g's: WSI
1 g~_G1{
~
1-
~(
nc~)
c=l \n'O']
2}
GC = 1 - G Z ~-~ (neOn2 O=1 ~=1 \NG'/
(5.3)
The between-group Simpson Index (BSI) is BSI=TSI-WSI=
(5.41
If we had any prior knowledge regarding a suitable matrix A then, following Nayak (1986a,b), we could derive the expressions for the between- and withingroup Rao quadratic entropy. In fact, we refer to Nayak (1986a,b) for an elegant discussion of the components of this measure. He also investigated the asymptotic distribution for the ratio statistic of the between- vs. within-group components. These results pertain to the Gini-Simpson type index, and hence, we shall not discuss them in detail. More general results for the multi-site case are presented in the next two sections.
6. Partitioning the diversity measure: Multiple sites To conform to the general pattern, we now refer back to Table 1, and assume there are K positions along each sequence. We have X/g = (X/Z,...,X/~c)' and . . , XN). Xg . .(X~, g ' For each k (= 1 , . . . , K), we may define TSI, BSI and WSI as in the previous section; we denote these measures by (TSIk, BSIk, WSIk) ~. Thus, we consider the 3 × K matrix with column vectors (TSIk, BSIk, WSIk) ~ (k -- 1 , . . . , K). Recall that, for each sequence, the responses at these K positions may not be generally regarded as statistically independent, and consequently
Genomic sequences and quasi-multivariate CA TANO VA
725
these K vectors are also not generally statistically independent. The problem is therefore in combining these measures into a single one that takes into account their interdependence and can be used for an overall assessment. In the classical MANOVA model with normally distributed error components, the entire dependence pattern among the different variates can be summarized by the product moment correlation matrix. This simple feature is no longer tenable for the non-normal case, let alone the general categorical-data models. In the current context, the random vectors X~k are correlated polychotomous variables. Thus, there will be interactions of different orders: for example, for the special case of C = 2, there are (§) first-order interactions and in general, (x) interactions of order r - 1, for r = 2 , . . . , K; thus, even when these interactions are formulated in a convenient manner (for example, as in Liang et al. (1992)), there will be a large number (2K - K) of dependence parameters. When C _> 2, this number will be exponentially larger. Therefore, to work with such models we need a huge N for a reasonable estimation of such a large number of nuisance parameters; in practice, this may run contrary to our manageable experimental setups. Note that for each of the K positions and G groups, we have N responses, so that the total number of responses is G C K NGK=Zn.o.=Znc..=Zn..k=-ZZ~ncgk g-1 c=l k=l
C
G
K .
(6.1)
c=l g-1 k=l
We could easily vary the number N from group to group, but that will add more complexity to the notation and expressions in the formulae. Hence, for simplicity of presentation, we assume the same number N across all groups. The variation within group g at position k is 1-
~(ncgk~
2= 1 -- ~ ( n c o k ~
c= I \ n.gk /
2
zc=l .~ \-~-/
'
since n.Ok = N . The variation within group g (ignoring the K positions) is C
1- Z
= 1
~_~kNK/
'
(6.2)
c=l \n.g. ,/
since n.0. = N K . The measures of dispersions are then WSI
=
1 u=~l{ ~ 1
~ -
@~)2
} 1-
c=l
c G W ( n~°" ,~2 c~-I \ N G K / =
C ( nc.- ~2
TSI=I-~kNGK/ c=l
BSI=TSI-WSI=
(6.3) (6.4)
'
G~-~-.~(
ncg ~2
~',,NGK/
( /~c.. ~2 - c=l k N G K / "
(6.5)
726
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
In the above formulation, we have treated the K positions as stochastically independent. In the next section, we append some discussion on this point, and describe a suitable probabilistic model.
7. The probabilistic model Let ncgk be the n u m b e r o f responses in category c at position k for g r o u p g, and let pcgk(>0) stand for the probability o f falling into category c at position k for g r o u p g (c = 1 , . . . , C; k = 1 , . . . ,K; g = 1 , . . . , G). Assuming that responses in different groups are independent, for g r o u p g and position k, the responses (nlgk, n2gk,..., ncgk) follow a multinomial distribution: N nlok...nCok
Pr{nlgk,n2gk,...,nc~k} =
(7.1)
H ( P c g k ) nc°k , c=l
where ~c=1P~0k c = 1, k = 1 , . . . , K and g = 1 , . . . , G. N o t e that E(ncok) = Npcok
(7.2)
Var(n~ok) = NP~ok(1 -- P~gk)
and Cov(nc, g~kl, n~202k2) = --fiNP
fi =
ifgl=g2andkl=k2 otherwise
If we assume that the positions are independent, the model is the conventional p r o d u c t multinomial one, given by
(
g=l k=l
)c
9=1 k=l
c=l
(7.3) Then V g _= ( n l g l . . . n c o l n D z . . . n c g z . . . n l g K . . . n c g g ) V = (V1V2... VG) ~ is a G C K x 1 vector with
~ is a C K x
1 vector and
E(V) = It = Xl*o = N ( p 1 1 1 . . . P C l l . . . P l a K . . . P c G x ) '
(7.4)
Let ® denote the direct-sum operation. Then Cov(V) = !2 - N12° = N(~]ll @ 1212 @ ' "
@ 121K @ ]~21 @ "'" @ ~2K @ ' ' " @ 12GK)
(7.5)
where £ok is a C x C matrix o f the form ~"ek = Dgk -- I-togkl.(ogk
(7.6)
with Dgk = Diag(plgk,... ,PCgk) and laoek = (Plak...PCek)' for g = 1 , . . . , G, k = 1 , . . . ,K.
Genomic sequences and quasi-multivariate CA TANO VA
727
Let us now try to interpret the above product-multinomial model from a somewhat different perspective. Recall that in our situation, typically K is large, while C is anywhere between 4 and 20. The assumption of independence of the G groups appears to be quite reasonable in many cases, and so is the independence of the N sequences within each group whenever an individual contributes a single sequence. The K positions, however, may not always be statistically independent. We allow here their marginal inhomogeneity by letting the pcgk'S to vary with k. Note further that n.gk = N, Vg, k. We have already commented on the nature of dependence that can be handled in a straightforward way: basically, for large values of K we need to allow only a small amount of stochastic dependence in order to have a simple statistical resolution. As a further step toward this end, we denote by )~cg~=Npcg~,
c= 1,...,C;
g=I,...,G;
k:
1,...,K
and assume that ncok, c = 1 , . . . , C ; 9 = 1 , . . . , G ; k = 1 , . . . , K Poisson variates with parameters 2cok. The joint probability law for the ncok'S is now given by
c o x
2~
are independent
(7.7)
c=l g=l k=l
and for the n4k'S by G
K
I-I II
g:l k:l
n.gk
(n4k)!
(7.8/
Therefore, the conditional law for the ncgk'S, given that n.0k = N, V g, k, is given by the product multinomial law in (7.3). In practice, it might be more reasonable to assume the independence of the ncg~'s in the above Poisson model, and then use (7.3) for our subsequent analysis. This approach is not at all uncommon in statistical modeling of count data (Bishop et al., 1975). Basically, if we consider the K-dimensional response for sequence i in group g, there are C~: possible realizations { ( c l , . . . , c K ) :ek = 1 , . . . , C , and k = 1 , . . . , K } . Therefore, if N is large while none of the CK cell probabilities is large, the Poisson model works out well. This is more likely to be realized when C and K are large, and no single combination has a high probability. Small dependence among the K positions may lead to such a situation. Of course, it is not difficult to construct pathological cases where the product multinomial model would not be a good fit to the joint probability law for the n~gk's. In the next section, we discuss further properties based on the multinomial model in (7.3).
8. Moments of diversity measures
We present here some moment computations that are helpful in the formulation of suitable statistical tests. Let ® denote the Kronecker product, and let
728
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
1 -
P. K. Sen and J. Eron Jr.
1
-
-
-
T
T -- [)'NGK" 2 (UKa ® Ic) - [)'NGK" 2
°
(8.1)
where UKC is a K G x K G matrix of l's, Ic is the C x C identity matrix and T ° is a C K G × C K G matrix (having K G x K G partitions with each partition being the C x C identity matrix). Let M be a G x G diagonal matrix with diagonal elements Gn2g.(= G ( N K ) 2 here), i.e., and
M = G(NK)2Ic
M_ 1 _
1
G(NK)2 IG .
Define W ~ [(M_1 ®UK) ® i c l =
G
_
_
1
(NGK)2 [(I6 ® UK) ® Ic] --= G(NK)2
W
°
.
(8.2) Then TSI = 1 - V'TV
(8.3)
WSI = 1 - VtWV
(8.4)
Therefore, BSI = TSI
(8.5)
WSI = V ' ( - T + W)V = V'BV
-
where -1 B
=
- T + W
-
(NCX)2
G
- ()'NGK" 2 (UKc ® Ic) -~ [)'NGK'2 [(IG ® UK) ® Ic]
~G ® UK) - ~U~:G O Ic --= G(NX)2
Since E(V'TV) = trace(T~2) + la'T~t, E(TSI) = 1 - trace(TE) - g'Tlt
=
Epc2
-Np:
9=1 k=l
E(WSI) = 1 - trace(WE) - Ix'Wla =
1 - - -1
NK + ~
1
~--'~q-~[K_~1Pco~ 2 - Npcg. 2 ] c=l 0=1
(8.6)
729
Genomic sequences and quasi-multivariate C A T A N O VA
and E(BSI)
G- 1 1 Zp;2k_ - N G K ~ N ( G K ) 2 c=1 ko=l k=l
Z
NGK2 c=l 9=1
Np2..
1
Pc~k
Define the population variation within group g at position k as C
Is(pgk) = 1 - Z p Z g k .
(8.7)
c=I
The null hypothesis H0 :pcgk = lack for all 9 implies that
Is(plk) = :s(p2~) . . . . .
Is(PGk) = :s(Pk) ,
i.e., within-group variation at position k is the same across all groups, and that
IlPik II=ll pRk II. . . .
IIPGk II
(8.8)
where Pok = (PlokP20k...Pcgk) ~ is a C x 1 vector representing the probabilities of belonging to categories c = 1, . . . , C in group g and position k. If one is interested in the hypothesis stated in (8.8), the hypothesis of homogeneity among groups (P~ok = P~k) is not necessarily true. Here, we consider the more stringent null hypothesis/4o :P~ok =p~k, Vc, k. Under this H0 E0 (TSI) = 1 -
+
~
E0(WSI) = 1 - ~ - - K + ~ - K - ~ E0(BSI)
- ~G -
1
p;2k - NGp 2.
p2k - N p 2.
(8.9)
(8.10)
1 - 1~ p c ~2
(S.ll)
c=l k=l
Since V follows a multinomial distribution, from (7.4) and (7.5), asymptotically,
v
d N(v~o,~o)
(8.12)
12° = ~21 ® ~22 @ . . . @ }]G, 129 = 1281 @ ]~82 @ " "" @ is given by (7.6). Under H0, for any g = 1 , . . . , G,
where
Y~ak = N0k
and
"F~gK, g = 1, . . . , G a n d
No = !2o = I~Ol ® 12o2 ® ..- ® 120K
Nak
(8.13)
where l~0k is the C x C matrix ~Ok = D k -- Jl~kJl'dc
(8.14)
with Dk being a C × C diagonal matrix with elements Plk,--. ,PCk and ~ok = (Plk...Pck)'. Therefore, under/4o,
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
730
P. K. Sen and J. Eron Jr.
(8.15)
= N "£<' = N£~ = N(IG ® lg*O) • Now,
Cov(BSI, WSI) = Cov(BSI, TSI) - Var(BSI) Cov(TSI, WSI) = Var(TSI) - Cov(TSI, BSI)
(8.16) (8.17)
We incorporate these findings in the formulation of the test statistic in the next section.
9. T h e test statistic
We define BSI and WSI as in (8.3)-(8.5) and consider an analogue of the usual variance ratio. Specifically, we write this test statistic as F1N =
BSI N WS~
(9.1)
In the normal ANOVA model, under a suitable null hypothesis, F1N has the variance-ratio distribution, which, for large values of the denominator DF, can be approximated by a rescaled )~2-distribution. The situation is quite different here. First, even under H0, the exact distribution of F1N is difficult to write in closed form. Secondly, a rescaled F1N may not generally have an asymptotically )~2_ distribution. For the specific case of K = 1, Nayak (1986a,b) derived the asymptotic distribution of F1N. We present a similar result for K > 1. We need to introduce some notation, and first consider some allied results. Note that BSI can be written as G
C
Bs, : v ' . v : E Z g=l c=l
ncg. NKv/G
nc..NKx/G] 2 (NGK) 2 ]
(9.2)
Let O~gk = ncgk - Np& = n~9~ - Eo(ncyk). Then G o.
= ~
K
Z
K
Ocok = nc.. - NG ~-~p& •
9=1 k=l
k=l
Also, under H0, 0 = (0111... 0
OCO1...
OCGK)' is asymptotically
~ N(0, £~)
(9.3)
where 123 is as in (8.15). So,
BSI=
O~g. NKv@
0~.. N K ~ (NGK) 2
1 20'B°O 2 = O'BO-- G(NK)
731
Genomic sequences and quasi-multivariate CA T A N O VA
Hence, under H0, asymptotically CGK
2 , ~ )Li()~1)i i=1
gsI
(9.4)
where (Z2)i's are independent Z2 r a n d o m variables with 1 degree of freedom and {2i, i = 1 , . . . , CGK} is the set of characteristic roots of 1
NBN~--NGK1 - :~ B°N~
--
NGK 21[((IG®Ux)--GUxG)®Ic] (IG®N~)
by (8.6) and (8.15). L o o k i n g at 120k, it is easy to see that its r a n k is at most (C - 1) because o f the restriction that ~c=1 p ~ = 1. In fact, the r a n k of each l£0k is (C - 1) since any o f its (C - 1) columns are linearly independent. Therefore, the rank of ~2; is K(C - 1). Further, (1/NGK2)B°lC,~ is a G x G partition matrix the elements of which are the 120~'s matrices premultiplied by some constants ( G - 1 or - 1 ) . In order to get the characteristic roots of (1/NGK2)B°N~ we need to solve the equation -
acK
=
(9.5)
0
Since {pc~,C = 1 , . . . , C} need to be estimated, so are {2ik, i = 1 , . . . , C - 1}. Determining the characteristic roots of a multinomial covariance matrix is not straightforward. R o y et al. (1960) studied this p r o b l e m without actually presenting the closed-form expression for the roots. The characteristic equation for each k is -
H ( p ~ k - 2) = 0
(9.6)
It is easy to see that 2 = 0 is a root, but identifying the other roots must proceed numerically.
Now, TSI=I-VITV
.
Since NT~2~ is not idempotent, the distribution o f V~TV is not g~ank(r),,'XU)" U n d e r H0, however,
V'TV -
(NGK)2 c=1 1
(NGK)2
Oc.. + NG _ Pck
C
(9.7)
= O'TO + ~ 5 ~-'p{.~ + A'O where A = ( A ' A * . . . A*) ~ is a form
CGK
× 1 vector and A* is a 1 x
CK
vector of the
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Erol~ Jr.
732
A*
2
--
NGK 2 (Pl .... Pc.Pv . "Pc...P>" . . . . .pc.)
The following result is important to our discussion. LEMMA 1. 0'T0 and A'0 are not independent. We observe that KCG
O'TO ~ E
Ai()~2)i '
A ' O ,-~
N(0,NA'£;A)
and
i=1
KCG
V'TV ~ Z
"~i()~2)i "~- N(0,NA'N~A)
+ 31
i=I
where {2i, i = 1,..., CGK} is NTlg; = (1/N( GK)2)T°£~ and l 51 =
the
set
of
characteristic
roots
of
c
~-~ E p 2. = lgT~ under Ho •
(9.8)
e=l
As for WSI = 1 - V'WV, under Ho 1
V'WV
G
C
G(NK)2 0__~1E~=Inc2g from (6.3) ]
a
c
2
g=l
c
l
c=l
G
c
g=l c~l
(9.9)
= O'WO + A'O + 51 from (9.8) Again,
V'WV
-
CGK
1
G(NK)2 V'WoV ~ Zi=l 2i(Z~)i+N(O'NA'F"~A) + 61
where {)~i,i= 1,...,CGK} is the set of characteristic roots of N W ~ = (1/NGK2)W~'E~. Let us denote by 01~Eo(BSI)
_
G-1
NGK
1---
1
K
02 =- Eo (TSI) = 1 - ~ 1 + ~ 1
=- eo(WS0
=
-
+
2
Pea ,
=Cc~.l[~=lP2 ck _NGp 2 ,
-
Genomic sequences and quasi-multivariate CA TANO VA
733
Also, we record here a result due to Searle (1971): when X is N(lt,£), the rth cumulant of X'AX is K~(X'AX) = Since ( V / ~ ) ~ totically
U
l(r --
1)![tr(A12) r + rltt'A(12A)"-l~] .
N(v/Ntto, 12°) and ( 0 / v ~ ) ~
N(0, £°), we obtain that asymp-
Var(BSI) = Var(V'BV) = Var(0'B0) = 2 trace(BN12°) 2
(9.10)
Var(TSI) = Var(V'TV) = 2 trace(TN£°) 2 + 4NIt'oTNN°TNlto
(9.11)
Var(WSI) = Var(V'WV) = 2 trace(WN12°) 2 + 4Ntt'oWN12°WNtto (9.12) and under H0, asymptotically, 2 2 trace(B°12~)2 Varo(BSI) = 2 trace ( N ~1- B ° 1 2 ; )2 - (NGK2)
(9.13)
Varo(TSl) -
(9.14)
Var0(WSI) =
2 trace(TON;)2 -~ 4 N 2(GK) 4 N(GK) 4 la'oT°12;T°la~ 2
(NG) 2K4
trace(W°12~) 2 + ~ I L t ' o W ° £ ; W ° l t o NG 2
(9.15)
Let
TN,l =- BSI - 01,
TN,2 =- TSI - 02,
TN,3~WSI-03
Note that KCG
(i)
o (N -1) i=1
since { 2 i : i = 1. . . . ,KC} is the set of characteristic roots of (1/N(GK) 2) (UK ® lc)12;, and E; = O(1).
(ii) A'O = Op(N -1/2) since A'O ~ N(0,NA ' ~0° A) and A = O(N 1) (iii) 6~ = ~5
2 = o(1). c=1 Pc.
Then,
TN2 = 1 -- VITV
= 1-
-
02
op(x-1)+Op(N-l/2)+ 1
= Op(N-l/2)
c
H. P. Pinheiro, F. Seillier-Moiseiwitseh, P. K. Sen and J. Eron Jr.
734
Similarly,
TN,3 = 1 - V'WV - 03 = Op(N -1/2) and BSI = V'BV = 0'B0 = Op(N -1) Thus
/
(.
BSI \
_BSI
.)
/BSI'~
[
nN=--Nk~-~) =N\TN,3+03j =Nk~7- ) 1+ 03J \ 03 J 4- Op(N-1/2) = N
-~
4- Op(N -1/2)
since TN,3 = Op(X-V2), (N(BSI)T:v,3)/(O~) = @(N-1/z), N(BSI) = Op(1) and
1 c 03 = 1 - ~ Z p
2. 4- O(N -1 ) = O~ 4- O(U -1)
e=i
By (9.4), we have the following central result: THEOREM 1. Asymptotically, under H0, BSI
FiN = N - - ~o
03
1 CGK * ~
~3
~ i=1
.~
2
(Z1)
(9.16)
i
where {2i: 1 , . . . , K G C } is the set of characteristic roots of (1/NGK2)B°~2~.
[]
Under H0, asymptotically we have E0(F1N )
- 1) 1 - K - NO1 __ ( G Gxo°3
=
Zk=lp c \
Var0(FiN ) = N 2 -Varo(BSI)] = 2 trace(B°£~) 2 (0~) 2 (GK20~) 2
J
We remark that the asymptotic distribution of FiN depends on the unknown pck's through the characteristic roots ,~i's. I f these characteristic roots are estimated in a consistent manner then (9.16) can be used to simulate this distribution by generating independent normal variables and repeating the simulation a large number of times. Alternatively, since the terms on the R.H.S. of (9.16) are i.i.d. Z12, s, under the Noether condition: max(,~i)
~0 ,
Genomic sequences and quasi-multivariate CA TANO VA
735
we can apply the central limit theorem for large CGK, and claim that K2(F1N-N~)
(9.17)
,',-, N ( 0 , ~r2)
where a 2 = 2 trace(B°£~)2 / ( GO~)2. Using a similar approach, when K = 1, Light and Margolin (1971) developed an equivalent CATANOVA test. Extending their CATANOVA approach to several positions, we note that the sum of squares are NGK WSS - ~
1 a c NGK WSI 2NK Z Z n2cg' : - - 2 '
(9.18)
g=l c=l
~ne..--NGKTsI, ---2 C
Tss_NGK 2
BSS - 2 ~
1 2NGK ~=1
1
G
2 _ ~ ncg"
n2..
and
NGK = ~BSI
(9.19)
.
(9.20)
c=l
In this setup, the test statistic is
BSS/(G-
1)
B S I / ( G - 1)
F{ = W S S / ( N G K - G) = W S I / ( N G K - 1)
(NGK - G) FiN N ( G - 1)
The test considered here is thus a natural extension of the Light-Margolin CATANOVA test to multiple sites, assuming the latter are stochastically independent. In the next section, we discuss another approach which does not rely on this assumption, albeit at the cost of more complexity in the underlying distribution theory.
10. A Hamming distance detour The Hamming distance is the proportion of positions at which two aligned sequences differ. We formulate an ANOVA model for these distances wherein we only assume independence among sequences (but not positions). Let X~ = (X/~,... ,X/R)' be a random vector representing sequence i of group g, where the X,~'s denote the category outcomes c ( k = 1 , . . . , K ; c = 1 , . . . , C ; g = 1 , . . . , G; i = 1,... ,N). For a pair X/g,Xi°,', the Hamming distance is K
D gii~g ' = K
1 Z / ( X / ~ ¢ X/~;) k--1
= K -1 {number of positions where Xy and X f differ}
(10.1)
736
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
Thus, for every pair of X/g,XJ, ~ii' FIgs' is a pseudo-metric. Taking this as a degree (1, 1), we define the following (generalized) U-statistics:
D°N=
(N) -1
Z
D~J°'
g:l'''''G
kernel of
;
(10.2)
g' = 1 , . . . ,G .
(10.3)
1
N
DgN°'=N-2 Z Z --v Dgf''
g¢
i=1 j=l
As a result, when we consider the G groups, we have
QNG/ I{(N/ G } Z DoN + N2 Z [)f' g=l l
~)N :
1 -
(NGal
+ (G -
1)b
N ,
(10.4)
where :
G
G g=l
(lO.5)
is the pooled within-group average Hamming distance, and DaN=
Z
D~°'
(10.6)
l<_g
is the average between-group Hamming distance. These statistics are expressible in terms of Hoeffding's (1948) generalized U-statistics for one- and two-sample models. Our goal is to compare statistically the two sets of within group and between group U-statistics in a conventional one-way ANOVA framework. We refer to Pinheiro et al. (1999) for a comprehensive account of these developments. While this approach circumvents the assumption of independence across positions, it deals with pairwise distances rather that diversity among sets of sequences (based on outcome frequencies).
11. Power of the test
We consider the test statistic formulated in Section 9, and discuss its power properties under suitable alternative hypotheses. A basic task in this respect is to check that the test is consistent against a (fixed) alternative that homogeneity of the p~k'S, across the G groups, does not hold. This can be easily verified. Looking at the expressions for WSI and TSI, we note that by virtue of the stochastic convergence of ncgk/N to Pcgk, the quadratic form in BSI will converge to a
Genomie sequences and quasi-multivariate CATANO VA
737
positive limit with probability approaching to l as N increases. Likewise, WSI stochastically converges to a positive limit (that cannot be larger than one). As a result, under any fixed alternative, as N increases,
N 1F1N- BSI WSI
÷ c*(>0) in probability
where c* depends on Pok" Therefore, FIN becomes stochastically large as N increases. On the other hand, the critical level of FaN, as discussed earlier, has a finite positive limit. Consequently, the power of the test converges to 1 as N increases. Given the consistency of the test, a natural question is: how does it behave for alternatives that become closer to the null hypothesis of homogeneity as N increases? In the statistical literature, this type of local alternatives is often referred to as P i t m a n alternatives, and they are chosen so that the limiting power of a test does not converge to 1 (as it does for any fixed alternative). Typically, such a Pitman alternative belongs to a O ( N -1/2) neighborhood of the null point in the parameter space. With that in mind, let us consider a sequence {HN} alternative hypotheses, 1 [IN : Peak = ~
"~cgk -~-Pck
c=l,...,C
k=l,...,K
g=l,...,G,
(11.1) where the 7cgk's are fixed quantities, not all zero, and they satisfy the constraints that apply to the pcgk's. Thus, 7cgk = 0 yields the null hypothesis H0 : Pcgk = Pck. The interest here is in the case where 7~gk ¢ 0. Then, 1
Ocg. = nee. - x/Nyco. - Npc.,
Oc.. = nc.. - v ~ T c . . - NGpc. .
Define c* = Nc (in the limit)
= N¼A'(4N2I - B ')A a(A1 - A~)'(4I - GKZ(B°) ~)(A~ - A~)
(11.2)
and 2i's are the diagonal elements of A ° = NA
(11.3)
with P being an orthogonal matrix (i.e., p i p = I) such that P F P 1 = A, where A is 1 a diagonal matrix and F = ( 1 / G N K )2( B ~ ) 1/2 1!2o [(Bo ) 1 / 2 ]. Also, let c~i = a2
(11.4)
738
H . P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
with ai being the ith row of the vector ½PB-1/2A : Kv~/(2v/N)P(B°) -1/2 (A~ - A ~ ) , which is a linear combination of the 7cgk's, since A1 = N-3/2A~ and A2 = N-3/2A~ are CGK × 1 vectors of the form
A1 m-
2 K2GN3/2
*
(All
...
A-k
1G)
with A~o = (710. . . . ?cg.710....
7cg. . . .
Ylg ....
YCg.) for
each g = 1 , . . . , G
2 A2 -- ()'KG'2N3/2 (A~... A~) where A~
=
(Yl . . . .
Yc..Yl . . . . .
Yc .....
];1 . . . . .
Y c..)
•
Then, we have the following:
THEOREM 2. Under the sequence of alternatives distribution of the test statistic is given by
{HN}
in (11.1), the limiting
P{F1N >>_ulHN}--+ P{ ai~=12i(Z2(6i)>_O°3u-c*} , where c*, 2i and c~i are defined as in (11.2), (11.3) and (11.4).
(11.5) []
The proof of this theorem is relegated to the Appendix.
12. Simulations
To assess the suitability of the asymptotic distribution for the test statistic, we generated N sequences, with K positions each, in each of G groups. The outcome falls into one of 4 categories, so that we simulate nucleotide sequences. We then computed the test statistic F{ and the standardized F~* (i.e, K(F{ - (al/a~))/cr,). We compare the observed value to the percentiles of the standard normal distribution. This procedure was repeated 500 times. The results are summarized in Tables 4 and 5. The data for Table 4 were generated using the same probability across all positions, i.e., pck = Pc, while for Table 5, different probabilities were used. The tables show the number below the 1st, 2.5th, 5th and 10th percentiles and above the 90th, 95th, 97.5th and 99th percentiles of the standard normal distribution. The number of ,'s reveals how far the observed counts are from the expected counts: * indicates that they are more than 2 standard deviations apart and ,~x more than 3 standard deviations apart.
Genomic sequences and quasi-multivariate CA TANO VA
739
Table 4 Results of simulations for diversity measures (Pck = Pc). N
K
20 20 20 50 50 50 I00 100 100 1O0 100 200 200
G
10 10 50 t0 10 50 t0 10 10 50 50 10 10
Percentiles of N(0, 1)
2 5 2 2 5 2 2 5 10 2 5 5 10
1
2.5
5
l0
90
95
97.5
99
0* 0* 0* 0* O* O* O* O* O* O* O* O* 1
0n 1"~ 0.* 0~ 0"* 0"* 0"* 1"* 4* 0.* 0"* 0.* 3*
0.* 2** 0.* 0.* 4'~ 0~ 0"* 5** 11" 0"* 3** 9** 14"
0~ 20 *~ 0*~ 0~ 33* 0~ 0"* 26** 40 0"* 23** 30* 40
52 48 59 67* 51 59 52 63 42 49 56 48 52
36* 30 49** 43 ~ 34 41"* 38* 38* 30 36* 32 32 31
27 *~ 21" 38 ~* 28 ~ 23 ~ 32** 30** 26** 14 21" 18 24** 17
16~ 16.* 27 ~ 19"~ 13"* 27 "~ 24** 11" 10" 16** 12"* 18"* 7
* indicates that the observed count is between 2SD and 3SD or - 3 S D and -2SD, and ** more than 3SD in either direction from its expected value Table 5 Results of simulations for diversity measures (Pck ~ Pc)N
G
20 20 20 50 50 50 100 100 100 100 100 200 200
K
l0 10 50 10 10 50 10 10 10 50 50 10 10
Percentiles of N(0, 1)
2 5 2 2 5 2 2 5 10 2 5 5 10
1
2.5
5
0* 0* 0* O* 0* O* O* 0* 1 O* O* O* 1
0"* 0n 0.* 0'~ 1~ 0~ 0"* 1"~ 3 0H 1n 1'~ 8
0~ 4** 0.* 0~ 6** 0'0~ 0"* 10"* 13" 0.* 9'~ 8~ 17
10
90
95
97.5
99
0.* 24 ~ 0"~ 0'~ 43 0~ 0~ 27 ~ 49 0~* 28 '~ 34* 40
59 54 63 60 51 53 53 41 53 46 57 54 54
36* 37* 39* 36* 33 37* 42** 28 30 28 35* 30 30
23 n 22* 29** 28** 17 24 ~ 31.* 19 18 22* 22* 18 17
16~ 13.* 20** 19~ 10 14~ 21 *~ 12.* 7 17"* 10" 12'~ 9
* indicates that the observed count is between 2SD and 3SD or - 3 S D and - 2 S D , and ** more than 3SD in either direction from its expected value
In both
cases, the asymptotic
N = 100, K = 10 a n d standard
normal
homogeneity Hence,
distribution
among
the more
distribution
G = 10. W h e n
one
is a g o o d
halves
the
approximation
number
when
of groups,
no longer applies. As the null hypothesis
the
assumes
t h e g r o u p s , o n e p o o l s all g r o u p s t o e s t i m a t e t h e p a r a m e t e r s .
groups,
the more
reliable the estimates
are. The
number
of
740
H. P. Pinheiro, F. Seillier-Moiseiwitsch,
P. K. Sen and J. Eron Jr.
positions appears to bear little influence on the validity of the asymptotics, as does the composition of the sites.
13. Data analysis When data sets are not large enough for the asymptotic results to apply, we need to call upon a resampling technique, such as the bootstrap. Here is a summary of the procedure: 1. Estimate pck from the data, tistic Fj. 2. Generate N sequences, with 3. Recompute the test statistic 4. Repeat steps 2 and 3 (1,000
i.e.,
Pck = (nclk + nc2k)/2N and compute the sta-
K positions each, in each of G groups, using/Sck. F1 from the bootstrap data and call it F{. repetitions).
The p-value is then #F{ >_Flobs/lO00. When sequences within a group are not independent, for instance when they are sampled from a single individual (i.e., the patients then define the groups), the reference distribution need to be altered to reflect the phylogenetic relationships. We then resort to reproducing the evolution of the sequences and generate this distribution on the basis of the simulated sequences. F r o m the sequence data, for each position, we estimate the frequency distribution of the nucleotides/amino acids. In view of the non-occurrence of some categories at specific positions (due to structural constraints), it is not reasonable to pool the estimates for different sites into overall estimates. We take the observed frequencies as substitution rates. For HIV, this is desirable as the observed frequencies reflect both structural constraints and immune selection pressures. The event of a mutation at a specific site is modeled as a two-step process. First, whether the position undergoes a change is governed by the overall rate for the genome under study. In the context of HIV sequences, this is the error rate for reverse transcriptase, i.e., 0.0005 per site per replication (Preston et al., 1988). Next, in case of mutation, the specific substitution follows the transition matrix described above. The simulation starts with the consensus sequence as seed. It is subjected to the mutation process a random number of times. We set this number between 100 and 2,400. Indeed, for HIV, mutations occur at the time of replication. Its replication rate is approximately 240 times per year, and this number represents the number of replications before transmission. This sequence thus gives rise to offspring sequences. In the present application, this branching process mimics HIV transmission: no offspring with probability 0.20 and 1 to 5 offspring with probabiIity 0.16 each (Blower and McLean, 1994). The tree is grown by repeating this process many times (with the output sequences from the previous generation as seeds). We obtain a total of 10,000 to 20,000 sequences. We sample without replacement as many sequences as there are in the original data set. F r o m these
Genomic sequences and quasi-multivariate CA TANO VA
741
we compute the test statistic. This sampling is performed a large number of times (here 1,000) to build up a reference distribution. We apply this procedure to viral sequences from 8 individuals who were treated with a protease inhibitor. Sequences were sampled from blood and semen at different time points. They are 1,041 nucleotides long and span the region coding for protease and reverse transcriptase. For each compartment, the data consist of the consensus sequence at each visit (Table 6). The object of the study is to ascertain whether the variability is comparable in both compartments. There are thus 2 groups. Each patient provides a set of sequences on which the null hypothesis of homogeneity is tested. For 4 of the 8 subjects, the null hypothesis is rejected.
14. Appendix We proceed to outline the proofs of Lemma 1 and Theorem 2. PROOF OF LEMMA 1. 0'T0 = (1/(NGK)2)O'T°O and A'0 are independent if and only if 1
A'NI2; ~
T° = 0
(Searle, 1971) .
1 N ( GK)2 A ' Z o T°
1
-k
-- N ( G K ) 2 ( A ' A * . . . A*)(I6 ® ~2o)(UKG ® Ic)
G
~
7r
:k
-- N ( G K ) 2 (A Zo(UK N Ic)A 12o(UK ® I c ) . . . A*I2~(Ux ® Ic))
Let a = (Pl . . . . pc.) I. Recall l~; = 12Ol ® - - - ® 1~0X 2 A*I2~(UK ® Ic) --NGK2 (a'~2ola'~2o2... atl~0K) (UK ® Ic) For each k, from (14)
at~0k
lk
1.
pc.Pcl, P'el,
c=l
/
• --
Pe.Pck
c=l
and the first element of the vector A*I2;(Ux ® Ic) is
/
742
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr. Table 6 Between-compartment comparison of variation Patients
Weeks
1
0 12 16 24
2
0 8
3
0 4 12 40 56
4
0 22 26
5
0 3 4 8
6
0
Blood
Semen
15.51 t
55.90 *2 13.29 t
51.78 *~ t 52.46*
0.01
8 13 25 29 49 58
Statistics
t t t
7
0 12 32
23.58**
8
0 12 16 24 41 48
12.06
t indicates the visits for which data are missing. Test statistics with ~ fall above the 99.5th percentile of the reference distribution and those with ~ above its m a x i m u m
NGK 2
Hence,
• - ZPc
PlkPek
0rT0 and At0 are not independent.
7~ 0
Genomic sequences and quasi-multivariate CA TANO VA
P R O O F OF T H E O R E M 2. N o t e
that
G
C
ncg.
BSI = V'BV = Z
nc..NK.v/~-
9=1 c=1 NG'v@
=~
Z
0c9 + Npo + NKv@
9=1 c = l
G
743
2
(NGK) 2
~7~9._ (Oc.. + NGpc. + ~7~..)NK\/~(NGK) 2
c F
= ~--~LNKv~d 9=1 c=l
NKGz
K,/NG
KGI2~j
2
= ~9=IZ< G ~(l
+2
9=1
(NGK) 2
]
> Oc'INK~)( I 'co} \NKv O (NCX) ]
'Cl } ~ )
2
= 0!B0 + ( a l - A2)'0 + N2(A1 - A2)t(A1 - A2) where B is as in (8.6), 0 of the f o r m
= ( 0 i l l . . . Ocgl . . .
!
CGK) , A1 and A2 are
CGK x 1 vectors
2
A1 _-_ K2GN3/2 (All~ ...
A ~
1G)
with A~o = (719. . . .
[email protected] . . . . 7@ . . . . 719. . . . 7@.) for each g = 1 , . . . , G 2 A2 _ ()'KG'2N3/2 ( A ~ . . . A~) A~ = (71 . . . . . ~C-.)J1 . . . . . 7C . . . . . ~21. . . . . ~/C..) Recall that ( 0 / x / N ) ~ N(0, ~:°) and 0'B0 ~ ~ c ~ K ;.i(Z2)i, with 2i's being the characteristic roots of ( 1 / N G K 2 ) B ° £ °, where with
Z ° = (I211 ®£12 ® - " ® £1x ®!221 ®.-- ®NGX) and 1~0~ is as in (6). N o w , let A~
= N3/2A 1
(14.1)
and A~ = N3/ZA2. Then
N(A1 - A2)'0 ~ N(0, (A~ - A~)'Z°(A~ - A~)) N o w , we call u p o n the following result: LEMMA 2. 0'B0 and A'10 are not independent. T h e p r o o f of L e m m a 2 is analogous to the p r o o f of L e m m a 1, and is left out.
744
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
Since 0'B0 and A'10 are not independent, 0'B0 and (A1 - A 2 ) ' 0 are also not independent. So, the distribution of V~BV is not the convolution of a linear combination of x2-random variables and a normal distribution. Let A = A1 - A2, then
VIBV -- fiB0 + AI0 + N2A/A : (BU20 + ½B-U2A)'(BU20 + ½B-U2A) - ¼A'B-1A + N2A'A ---- (BU20 + ½B-1/2A)'(BU20 + ½B-1/2A) + I A ' ( 4 N 2 I - B 1)A and let X = ( B 1 / 2 0 + ½ B - U 2 A ) , ½B-V2A. Note that,
v .B =
G()z<12
= ½ v~dK(B°)-~/2(A~
and N F =
F=(1/GNK2)(B*)U2Y2>[(B°)U2]'
and
~B=
(B°I-'/2N-3/21A - All -- A~) =
(1/GK2)(B~)I/2~2~[(B°)V2]' =
~
F °. Then,
v/NX ~ N(~ v@K(B°)-V2(A~ - A~" I 5 (B°)VzN°[(B~)V21' or
v ~ x - N(~;,r °) Let P be an orthogonal matrix (i.e., P ' P = I) such that P F P ' = A, where A is a diagonal matrix, and Y = PX ~ X = p i y Then,
and G
NX'X = N Y ' P P ' Y = NY'Y ~ Z
2i(Z2((~i))i
(14.2)
i=l
where ),i's are the diagonal elements of A* : NA, 6i = (a2i/2i), ai is the ith row of the vector ½ P B - 1 / z A = (Kv/-G)/(2v~)P(B°)-I/Z(A~-A~), which is a linear combination of the ycgk'S. Let c be the constant ¼A~(4N2I- B-1)A. Then, Pr(Fl _> u) -= Pr (N(X'X ~
+ c) _>u )
=
Pr(NX'X >_ O3u ° - c*)
(14.3)
Genomic sequences and quasi-multivariate C A T A N O VA
745
N o t e t h a t c* = N c = 4 (A~ - A ~ ) ' ( 4 I GK2(B°)-1)(A~ -A~) = O(1) since B -I = G ( N K ) 2 ( B ° ) -1 = O ( N ~) a n d A ' A = N - a ( A ~ - A ~ ) ' ( A ~ - A~) = O ( N - 3 ) . A s t h e n o n - c e n t r a l i t y p a r a m e t e r ~i i n c r e a s e s , t h e d i s t r i b u t i o n o f e a c h o f t h e n o n - c e n t r a l / 2 - r a n d o m v a r i a b l e s s h i f t s t o t h e r i g h t . T h e r e f o r e , t h e p r o b a b i l i t y i n (14.3) g o e s t o 1 a n d t h e p o w e r o f t h e t e s t c o n v e r g e s t o 1. []
References
Anderson, R. J. and J. R. Landis (1980). CATANOVA for Multidimensional Contingency tables: Nominal-scale response. Comm. Statist. Theory and Meth. 9(11), 1191 1206. Anderson, R. J. and J. R. Landis (1982). CATANOVA for Multidimensional Contingency tables: Ordinal-scale response. Comm. Statist. Theory and Meth. 11(3), 257-270. Bahadur, R. R. (1961). A representation of the joint distribution of responses to n dichotomous items. In: Studies in Item Analysis and Prediction (Ed., H. Solomon), pp. 158-176, Stanford University Press. Bishop, Y. M. M., S. E. Fienberg and P. W. Holland (1975). Discrete Multivariate Analysis - Theory and Practice. MIT Press Cambridge, Massachusetts. Blower, S. M. and A. R. McLean (1994). Prophylactic vaccines, risk behavior change, and the probability of eradicating HIV in San Francisco. Science 265, 1451 1454. Chakraborty, R. and C. R. Rao (1991). Measurement of genetic variation for evolutionary studies. In Handbook o f Statistics Vol. 8." Statistical Methods in Biological and Medical Sciences (Eds., C. R. Rao and R. Chakraborty), pp. 271-316. North-Holland. Coffin, J. M. (1986). Genetic Variation in AIDS Viruses. Cell 46, 1-4. Gini, C. W. (1912). Variabilita e Mutabilita. Studi Economico-Giuridici della R. Universita di Cagliari 3(2) 3 159. Graybill, F. A. (1961). An Introduction to Linear Statistical Models. McGraw-Hill, New York. Hahn, B. H., M. A. Gonda, G. M. Shaw, M. Popovic and J. A. Hoxie (1985). Genomic diversity of the acquired immune deficiency syndrome virus HTLV-III: Different viruses exhibit greatest divergence in their envelope genes. Proc. Natl. Acad. Sci. USA 82, 4813-4817. Hahn, B. H., G. M. Shaw, M. E. Taylor, R. R. Redfield and P. D. Markham (1986). Genetic variation in HTLV-III/LAV over time in patients with AIDS. Science 232, 1548-1553. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist. 19, 293 325. Karlin, S., B. Blaisdell and V. Brendel (1990). Identification of significant sequence patterns in proteins. In: Methods in Enzymology (Ed., R. Doolittle), pp. 388-402. Academic Press. Karnoub, M. C., F. Seillier-Moiseiwitsch and P. K. Sen (1999). A conditional approach to the detection of correlated mutations. In Statistics in Molecular Biology and Genetics. (Ed., F. SeillierMoiseiwitsch) Inst. Math. Statist. Lecture Notes-Monograph Series 33, 221-235. Kypr, J. and J. Mrfizek (1987). Unusual codon usage of HIV. Nature 327, 20. Liang, K., S. L. Zeger and B. Qaqish (1992). Multivariate regression analyses for categorical data. J. Roy. Statist. Soc. Ser. B 54, 3-40. Light, R. J. and B. H. Margolin (1971). An analysis of variance for categorical data. J. Amer. Statist. Assoc. 66, 534-544. Light, R. J. and B. H. Margolin (1974). An analysis of variance for categorical data II: Small sample comparisons with chi square and other competitors. J. Amer. Statist. Assoc. 69, 755-764. Mansky, L. M. and H. M. Temin (1995). Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J. Virology 69, 5087-5094. McCutchan, F. E., B. L. P. Ungar, P. Hegerich, C. R. Roberts, A. K. Fowler, S. K. Hira, P. L. Perine and D. S. Burke (1992). Genetic analysis of HIV-1 isolates from zambia and an expanded phylogenetic tree for HIV-1. J. Acquired Immune Deficiency Syndromes 5, 441-449.
746
H. P. Pinheiro, F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron Jr.
Nayak, T. K. (1986). An analysis of diversity using rao's quadratic entropy. Sankhyd: The Indian J. Statist. Ser. B 48, 315-330. Nayak, T. K. (1986). Sampling distribution in analysis of diversity. SankhyS: The Indian J. Statist. Ser. B 48, 1-9. Nayak, T. K. and J. L. Gastwith (1989). The use of diversity analysis to assess the relative influence of factors affecting the income distribution. J. Bus. Econ. Statist. 7, 453M60. Pinheiro, H. P. (1997). Modelling Variability in the H I V Genome. PhD thesis, University of North Carolina, December, Mimeo Series No. 2186T. Pinheiro, H. P., F. Seillier-Moiseiwitsch and P. K. Sen (1999). Analysis of variance based on the hamming distance. Preprint. Preston, B. D., B. J. Poiesz and L. A. Loeb (1988). Fidelity of HIV-1 reverse transcriptase. Science 242, 1168-1171. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. John Wiley & Sons, second edn. Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. Theor. Popln. Biol. 21, 24M3. Rao, C. R. (1982). Diversity: Its measurement, decomposition, apportionment and analysis. Sankhy& Ser. A 44, 1-21. Rao, C. R. (1982). Gini-simpson index of diversity: A characterization, generalization and applications. Utilitus Mathematica 21, 273-282. Rao, C. R. (1984). Convexity properties of entropy functions and analysis of diversity. In Inequalities in Statistics and Probability. Inst. Math. Statist. Lecture Notes 5. Roy, S. N., B. G. Greenberg and A. E. Sarhan (1960). Evaluation of determinants, characteristic equations, and their roots for a class of patterned matrices. J. Roy. Statist. Soc. Ser. B 22(2), 348-359. Searle, S. R. (1971). Linear Models. John Wiley & Sons. Searle, S. R. (1982). Matrix Algebra Useful for Statistics. John Wiley & Sons. Seillier-Moiseiwitsch, F., B. H. Margolin and R. Swanstrom (1994). Genetic variability of human immunodeficiency virus: Statistical and biological issues. Annu. Rev. Genetics 28, 559 596. Sen, P. K. (1999). Utility-oriented Simpson-type indexes and inequality measures. Calcutta Statist. Assoc. Bull. 49 (in press). Shorrocks, A. F. (1980). The class of additively decomposable inequality measures. Econometrica 48, 613-626. Simpson, E. H. (1949). The Measurement of Diversity. Nature 163, 688. Varmus, H. and P. Brown (1989). Retroviruses. In Mobile D N A (Eds., D. E. Berg, and M. M. Howe), pp. 53-108. American Soc. for Microbiology.