The Promise of Composite Likelihood Methods for Addressing Computationally Intensive Challenges

The Promise of Composite Likelihood Methods for Addressing Computationally Intensive Challenges

22 The Promise of Composite Likelihood Methods for Addressing Computationally Intensive Challenges Na Li Division of Biostatistics, School of Public ...

162KB Sizes 0 Downloads 10 Views

22 The Promise of Composite

Likelihood Methods for Addressing Computationally Intensive Challenges Na Li Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455

I. II. III. IV.

Introduction Composite Likelihood Methods Applications in Population Genetics Applications in Fine Mapping of Disease Mutations A. Terwilliger’s method B. Devlin et al.’s method C. Male´cot model for linkage disequilibrium D. Other methods V. Other Applications VI. Prospectives and Discussion References

ABSTRACT High-dimensional genetic data, due to its complex correlation structure, poses an enormous challenge to standard likelihood–based methods for making statistical inference. As an approximation, composite likelihood has proved to be a successful strategy for some genetic applications. It has the potential to see even wider application and much research is needed. We first give a brief description of composite likelihood. The advantage of this method and potential challenges in inference are noted. Next, its applications in genetic studies are Advances in Genetics, Vol. 60 Copyright 2008, Elsevier Inc. All rights reserved.

0065-2660/08 $35.00 DOI: 10.1016/S0065-2660(07)00422-1

638

Na Li

reviewed, specifically in estimating population genetics parameters such as recombination rate, and in multi-locus linkage disequilibrium mapping of disease genes with some discussion about future research directions. ß 2008, Elsevier Inc.

I. INTRODUCTION For high-dimensional correlated data, the joint distribution may be difficult to specify or calculate. One such example is population-based molecular genetic data. Genetic markers on the same chromosome from apparently “unrelated” individuals are correlated in a very complex way. The individual chromosomes are connected through an unobserved genealogy that may extend many generations and is essentially of infinite dimension even for very small samples. Many stochastic populations or evolutionary processes can influence the genealogy such as subdivision, migration, and selection. On a chromosome, genetic polymorphisms first arose due to mutation. While marker alleles on the same chromosome tend to be passed from generation to generation together, recombination and gene conversions allow the exchange of genetic material among chromosomes. As an approximation to the standard complete likelihood, composite likelihood methods (Lindsay, 1988) provide a good alternative strategy for modeling the data and making statistical inference. The composite likelihood is formed by the products of component likelihoods which may be correlated. The approach is sometimes referred to as pseudo-likelihood (Besag, 1974; Cox and Reid, 2004). However, the term pseudo-likelihood is also used in different contexts. To avoid confusion, we will adhere to the nomenclature of Lindsay (1988). In addition to genetic applications reviewed here, composite likelihood methods are often used in image analysis, spatial analysis (ecology), time series, and social network modeling.

II. COMPOSITE LIKELIHOOD METHODS Suppose Y is a vector random variable with density f(y; ), where  is an unknown vector parameter. Under certain regularity conditions, the maximum likelihood estimator (MLE) of  is obtained by setting the score function (the derivative of the log likelihood) to zero and solving the resulting equation: UðÞ 

∂log fðy; Þ ∂‘ðÞ  ¼0 ∂ ∂

22. Composite Likelihood Methods for High-Dimensional Data

639

The score function U() is unbiased with EU() ¼ 0, and satisfies:   ∂U T EðUU Þ ¼ E ð1Þ ∂ The above quantity is also known as the Fisher information I(). If we observe n independent realizations from the distribution f(y; ), the MLE ^ from the Pn score equation i¼1 Ui ðÞ ¼ 0 is consistent and has an asymptotic normal distribution: pffiffiffi nð^  0 Þ ! Nð0; I1 ð0 ÞÞ as n goes to infinity, where 0 corresponds to the true value of the parameter. In some situations, the joint distribution f may be difficult to express analytically and difficult to calculate. It could be easier to find a component log likelihood ‘i ðÞ ¼ log fðyi ; Þ, for a subset of the data Yi (say, i ¼ 1, . . ., m). The components Yi need not be mutually exclusive and the density may be expressed marginally or conditionally upon other parts of data. The choice depends on the application. For example, in a Markov random field, the model is expressed in terms of conditional densities f(xijxn(i)) where n(i) denotes the neighborhood of location i. It is then natural to use the conditional likelihoods as components. In other cases, the marginal likelihood function f(yi; ) for mutually exclusive Yi is more convenient. When there is little information about  in the marginal likelihoods, we can also use the marginal pairwise likelihood f(yi, yj; ). Examples will be shown later on. In any case, the composite log likelihood is simply the sum of the log component likelihoods: m X ‘C ðÞ ¼ ‘i ðÞ i¼1

The composite likelihood is not the true likelihood function unless the components are independent. Nevertheless, we can estimate parameter  by maximizing this composite log likelihood. Under suitable regularity conditions, we can exchange the differentiation and summation. It can be shown that the maximum composite likelihood estimator is also consistent since the composite score function: m m ∂‘C ðÞ X ∂‘i ðÞ X UC ðÞ ¼ ¼ ¼ Ui ðÞ ∂ ∂ i¼1 i¼1 is an unbiased estimating function if ‘i is a genuine log likelihood for component i such that E(Ui) ¼ 0. Note however in population genetics applications, due to the complex dependence structure, the regularity conditions may not hold or it might be very difficult to prove that they do, as we will see later.

640

Na Li

The consistency of a composite likelihood estimator only requires the correct specification of component likelihoods (e.g., the marginal distribution of a univariate Yi) so it is more robust to model misspecification. The composite likelihood estimator can be consistent even when the complete likelihood estimator is not, since the complete likelihood is more prone to be misspecified. Assuming the necessary conditions hold, the composite likelihood estimator ~ is asymptotically normal with mean 0. The asymptotic variance is the inverse of an analog of Fisher’s information defined as (Lindsay, 1988): 0 1 0 1T ∂U ∂U C C 1 AðVarðUC ÞÞ E@ A IC ðÞ ¼ E@ ∂ ∂ ð2Þ ! !T m m X X ¼ VarðUi Þ ðEðUC UTC ÞÞ1 VarðUi Þ i¼1

i¼1

The middle term involves E(UiUj) which can be difficult to calculate, especially since the joint distribution of Yi and Yj may not even be specified. If independent realizations from f(y; ) are available, it is possible to estimate it using the empirical variance in the line of generalized estimating equations (Liang and Zeger, 1986; Zeger et al., 1988) which would be consistent as the number of independent units goes to infinity. Otherwise, it might be possible to use weighted empirical variance estimates (Lumley and Heagerty, 1999) or resampling-based methods (Heagerty and Lele, 1998). In addition, the composite likelihood ratio statistic (CLRT), CLRT ¼ 2ð‘C ð~Þ  ‘C ð0 ÞÞ no longer has the usual asymptotic 2 distribution. Making valid inference based on composite likelihood is a major difficulty. Generally because the components of the composite likelihood are likely to be positively correlated, ignoring the correlation (assuming E(UiUj) ¼ 0 or using standard 2 distribution) is expected to be anti-conservative (type I error rate inflated). In practice, as least in the genetic examples, inference based on composite likelihood is done informally and often simulation studies are conducted to calibrate the results. It can be shown that the information for composite likelihood (2) is smaller than that of the complete likelihood (1). Except for special cases (e.g., at specific parameter values), composite likelihood is less efficient than the complete likelihood. Lindsay (1988) noted that it might be possible to improve efficiency by using proper weights for individual components. The idea is similar to the use of a suitable working correlation matrix (instead of the identity matrix) in generalized estimating equations (Liang and Zeger, 1995). In practice, the optimal weights may be difficult to find. However, when the components are equal in marginal information, using equal weights is the optimal strategy.

22. Composite Likelihood Methods for High-Dimensional Data

641

III. APPLICATIONS IN POPULATION GENETICS One of the problems in population genetics that has attracted tremendous interest recently is to make inference about population parameters based on a sample of DNA sequence or haplotype data. Consider a random sample of n haplotypes at L loci hij, i ¼ 1, . . . , n, j ¼ 1, . . . , L. For simplicity, we will assume that loci are diallelic and ignore the fact that humans are diploids and haplotypes often have to be inferred from unphased genotype data. The haplotypes are related because they are descended from common ancestors. Mutations arise to create new polymorphic loci. Adjacent alleles are likely to be passed together from generation to generation (i.e., in linkage). The haplotypes can also exchange alleles through recombination which breaks down the linkage. In the current generation, however, there might not be enough historic recombination events, especially between close loci, so the loci may be still correlated, that is, in linkage disequilibrium. One quantity that is of particular interest is the scaled recombination rate r ¼ 4Ner where r is the per-site recombination rate per meiosis and Ne is the effective population size. It is a natural measure of multi-locus linkage disequilibrium (Pritchard and Przeworski, 2001). Coalescent-based models (Hudson, 1983; Kingman, 1982) are commonly used to describe this stochastic population process and form the basis for statistical inference. For a given genealogical history of a sample, it is trivial to determine the probability of the data under the coalescent model for fixed parameter values. However, calculation of the likelihood requires integration over the huge set of possible genealogical history compatible with the sample. Despite the development of sophisticated Monte Carlo algorithms (e.g., Fearnhead and Donnelly, 2001; Griffiths and Marjoram, 1996; Kuhner et al., 2000; Stephens and Donnelly, 2000), it remains an impossible task for all but the smallest data sets. To get around the computational challenge, several methods have been proposed based on approximation of the likelihood. The first such approach to estimate r was proposed in Hey and Wakeley (1997). They first derived the likelihood function for a pair of loci of four haplotypes. Then for three informative (e.g., polymorphic) loci, a composite likelihood is formed by the product of the two probabilities from the loci 1 and 2, and loci 2 and 3. A maximum likelihood estimate of r is then obtained. For a larger sample with more than four haplotypes and three loci, an overall estimate of r is taken to be the average estimate for every possible set of two intervals in all possible subsets of four haplotypes. In practice, this can only be done using a (relative) small number of random subsets. They also discussed the possibility of using composite likelihood from all possible subsets instead of simply averaging the estimates. However, it was too computationally intensive to be feasible.

642

Na Li

In Fearnhead and Donnelly (2001), an importance sampling–based approach was developed to estimate r based on haplotype data. However, it was very computationally intensive, especially for large number of segregating loci. The authors later proposed a composite likelihood approach (Fearnhead and Donnelly, 2002) for longer haplotypes. In this approximation, long sequences are divided into small disjoint subregions, and the likelihoods for each subregion are combined to form the composite likelihood. Through simulations they showed that the composite likelihood estimator performed fairly well. Despite the dependence of the components, standard asymptotic distributional results are useful approximates if the number of subregions is small with relatively large subregions. Hudson (2001) proposed another composite likelihood method based on the likelihoods of all possible pairs of loci. For two diallelic loci, the data can be summarized by a 2  2 table. Under certain model assumptions, the probability of the data can then be calculated either analytically (for small number of haplotypes) or by a Monte Carlo method. These probabilities can be tabulated and saved to serve as a look-up table for future computation. As a result, this method is extremely fast. This pairwise likelihood method was later extended to allow for recurrent mutations (McVean et al., 2002) where importance sampling is used to calculate the pairwise likelihoods, and applied to virus genetic data. With emphasis on modeling fast-evolving virus population (e.g., HIV-1), CarvajalRodriguez et al. (2006) further extended this composite likelihood estimator with a more general mutation model that allows more than two states and variable mutation rates among loci. Wall (2004) considered estimation of gene conversion rate using composite likelihood formed by three-locus likelihoods. Most of these authors emphasized point estimation only and used simulations to empirically evaluate the performance (Smith and Fearnhead, 2005). Fearnhead (2003) presented some theoretical results for composite likelihood– based estimators of r. Note that because of shared ancestry and linkage, independent realizations from the population process are not available. In particular, since the individuals are highly correlated, the information content increases very slowly with larger number of haplotypes (individuals), roughly on the order of log(n) (Felsenstein, 2006; Pluzhnikov and Donnelly, 1996). Therefore, Fearnhead (2003) considered only the asymptotics when the number of loci goes to infinity. It was shown that the composite likelihood estimator of Fearnhead and Donnelly (2002) is consistent, whereas the consistency of Hudson’s pairwise likelihood estimator cannot be proved. Intuitively, this is because the correlation between two subregions decreases and eventually goes to zero as the distance between them increases. Furthermore, assuming equal subregion length and constant r, data from each of the subregion is identically distributed. On the other hand, when all possible pairwise likelihoods are included, the components are not identically distributed and the correlation structure is

22. Composite Likelihood Methods for High-Dimensional Data

643

complicated. Recently, Wiuf (2006) proved the consistency of the subregionbased composite likelihood methods in estimating population parameters in general (not just recombination rates) and also under more general population models. The estimator based on a truncated version of the pairwise likelihood method, however, was shown to be consistent (Fearnhead, 2003). In this version, only pairs of loci that are within a certain fixed distance apart are included. Again, this ensures that correlation decreases to zero when the components (pairs of loci) are sufficiently far away. It was also noted that this truncated composite likelihood corresponds to using weighted composite likelihoods with weight 0 if the pair of single nucleotide polymorphisms (SNPs) are too far away from each other. The weighting scheme ensures the consistency of the estimator with little if any efficiency loss. Another topic related to estimating recombination rate is detection of recombination hotspots (Crawford et al., 2004; Fearnhead et al., 2004; Li and Stephens, 2003). Fearnhead and Smith (2005) proposed a penalized likelihood model: S5 X ‘i ðri Þ  lh i¼1

where S is the total number of loci, h is the (unknown) number of hotspots. Each component ‘i ðri Þ is the log-likelihood function [based on an approximated version of Fearnhead and Donnelly (2002)] for a subregion of 5 loci with ri the average recombination rate in the subregion. Note that here the subregions are overlapping and l is a penalty parameter designed to prevent overfitting. This is also an example of composite likelihood where equal (but not fixed) weights are used. Another application of composite likelihood methods in population genetics is to detect the effects of selection (e.g., Jensen et al., 2005; Kim and Stephan, 2002; Meiklejohn et al., 2004; Zhu and Bustamante, 2005). In particular, Kim and Nielsen (2004) extended the pairwise likelihood of Hudson (2001) to a selective sweep model. No theoretical results have been proved regarding the properties of these estimators. In particular, Wiuf (2006) noted that the theory developed in the paper does not apply to models with selection.

IV. APPLICATIONS IN FINE MAPPING OF DISEASE MUTATIONS A simple example of the use of composite likelihood dated from the earliest days of linkage analysis. Exactly multi-point linkage analysis software is very limited in the maximum size of the pedigree. Even Markov chain Monte Carlo– based algorithms have their limits. When a very large pedigree is encountered, it is broken down to small pieces to satisfy the constraints of the software.

644

Na Li

The likelihoods (LOD scores) from the sub-pedigrees are then summed as if they are independent. It was shown that the lost of efficiency can be reduced when larger components are used (Chapman et al., 2001). In traditional genetic linkage and association studies, founders or population-based cases and controls are assumed to be independent. Although this is generally a good approximation, it is not always appropriate (Leutenegger et al., 2002). Care must be taken to properly control the type 1 error rate. One of the challenges in linkage disequilibrium–based association analysis is to combine information from multiple marker loci. Several composite likelihood methods have been developed that were based on the association between a marker and the putative disease locus. The primary difference among them is the parametrization of the association parameter and the model used to relate the association parameter to genetic distance or recombination rate, which provides a basis for combining information across markers that are of varying distance away from the disease locus.

A. Terwilliger’s method In Terwilliger (1995), the association between the disease and a (potentially multi-allelic) marker is characterized by the l statistic, the relative increase in frequency of the marker allele among chromosomes carrying the disease mutation compared to the general population. Consider a multi-allelic marker and a sample of chromosomes containing either the disease allele D or normal allele N. Let pi be the population frequency of marker allele i, and qi ¼ PrðijDÞ, the allele frequency of i on the D-bearing chromosomes. Assuming that allele i is associated with the disease, that is, it occurred on the ancestral haplotype where disease mutation D first appeared, we have qi ¼ pi þ lð1  pi Þ, which defines association parameter l. Using Bayes Theorem, it is easy to see that ri ¼ PrðijNÞ ¼ pi  lð1  pi Þ½ pD =ð1  pD Þ where pD is the frequency of the disease mutation. Similarly for the allele j not associated with the disease, we can derive qj ¼ PrðjjDÞ ¼ pj  lpi and rj ¼ PrðjjNÞ ¼ 1  PrðijNÞ. Thus, the data is reduced to be a m  2 table with m being the number of alleles. A likelihood function for l can be written, assuming allele i is associated with the disease. Let Xj denote the number of haplotypes containing alleles j and allele D, Yj denote the number of haplotypes containing alleles j and allele N (the wild-type allele), the likelihood function is computed as: Li ¼

m X

X Y

qj j rj j

j¼1

where qj and rj were given above, and the sum is taken over all m alleles of the marker, including the allele i assumed to be associated with D. If we assume

22. Composite Likelihood Methods for High-Dimensional Data

645

in addition that the first disease mutation occurred independently of any marker alleles, the probability that it was on the same chromosome as i is thenPpi, the frequency of allele i. The overall likelihood is written as L¼ m i¼1 pi Li which can be maximized to provide a maximum likelihood estimate of l. Note that here it is assumed that the marker allele is older than the disease mutation and the marker allele frequencies do not change over time. One consequence of this assumption is that common alleles are more likely to be associated with disease mutation. Another consequence of the model is that l  0 (only positive correlation is allowed). To test the null hypothesis l ¼ 0, the likelihood ratio statistic asymptotically has a 1:1 mixture distribution of a point mass 0 and 21 . Likelihoods for multiple markers are then combined to form a composite likelihood. The relationship lk ¼ ð1  k Þn is used to pool information across markers. Here,  is the proportion of the disease chromosomes bearing the same ancestral mutation D. It is a heterogeneity parameter to allow for uncertainty in disease locus genotype (i.e., not all chromosomes from affected individuals carry the same mutation at the locus of interest), where n is the number of generations since the disease mutation D occurred,  is the known recombination fraction between the marker k and the test position. The author proposed to use fixed values of  and n (which determines l for each marker since k is known) and use the sum of likelihood test statistics for all the markers as the overall test statistic which is again assumed to have a 50:50 mixture of 20 and 21 distribution. The test becomes increasingly conservative with larger number of markers. Alternatively, a composite likelihood Papproach can be used to estimate the unknown disease location Lð; ; nÞ ¼ k Lðlk Þ where lk is expressed in terms of the distance between marker k and location of the disease locus k using the formula as above. For a given disease locus position , the likelihood is maximized with respect to  and n. The profile likelihood can then be plotted as a function of the putative disease locus position, similar to the location LOD score in linkage analysis. It was noted that estimate of the disease locus position was not very reliable because the model assumptions might not be valid.

B. Devlin et al.’s method For very rare alleles only, Devlin et al. (1996) proposed to use a different measure of association, the robust attributable risk () which for a 2  2 table of marker-disease combination equals: 11 22  12 21 ¼ ð3Þ 11 22 þ 21 22

646

Na Li

Then, the sample statistic Y ¼ log ^ is modeled approximately by a gamma distribution with mean  and variance /l. This model assumes a Wright–Fisher population model with nonoverlapping generations. In addition, the marker polymorphism is assumed to be older than the disease mutation and its frequency does not change over time. The disease mutation arose only once, and had only one copy thus was initially in complete disequilibrium with nearby marker alleles. The mean  is approximately n where n is again the number of generations since the disease mutations occurred and  is the recombination fraction between the marker and disease locus. The variance of Y is harder to formulate directly. Rather it is expressed as VarðYj Þ ¼ rj  for the j-th marker where  ¼ VarðY1 Þ and rj, the relative variance, is assumed to depend only on  and the allele frequencies. By summing over the likelihood function of all the markers, a composite likelihood is formed with parameters (n,,). This model was also extended in the same paper by adding an additional heterogeneity parameter to allow for two disease mutations instead of just one.

C. Male´cot model for linkage disequilibrium Morton and his colleagues have developed a rich literature for modeling linkage disequilibrium and LD mapping based on the association measure: ^¼ r

ad  bc ða þ bÞðc þ dÞ

where a, b, c, d are the counts of four haplotypes between two diallelic loci and are arranged such that ad  bc and b  c (Maniatis et al., 2002; Morton et al., 2001; Morton et al., 2007; Morton, 2005). Under Male´cot’s model for isolation by distance (Male´cot, 1973), its expectation can be written as:  X  Eð^ rÞ ¼ r ¼ ð1  LÞM exp  ei di þ L i

where L is a parameter to account for the bias and can be interpreted as the association between unlinked loci; parameter M allows for multiple origins of the disease mutation and equals 1 if there was only P a single mutation (i.e., complete initial LD when the mutation occurred). eidi is the distance between the two loci where the sum is taken over intervals between pairs of adjacent loci. ei is assumed to be a constant within each interval and di is the physical length of the interval (in kilobytes). eidi provides an additive measure of LD, as genetic distance, measured in LD units (LDU) (Maniatis et al., 2002). ei is specific to each adjacent marker pair and can be estimated using population data in similar way as construction of linkage maps. Population-specific LD maps have been

22. Composite Likelihood Methods for High-Dimensional Data

647

constructed using genome-wide SNP data from the HapMap project (Tapper et al., 2003, 2005) and their properties have been studied extensively (Maniatis et al., 2007; Zhang et al., 2002, 2004). In the context of association mapping, the association measure ^ between the marker and disease phenotypes can be represented either through the regression coefficient for continuous phenotypes or correlation coefficient for binary affectation status (Collins and Morton, 1998; Maniatis et al., 2004). The expected value i of the association statistic ^ i between marker i and the phenotype is again modeled by the Male´cot equation, i

¼ ð1  LÞMeejSi Sj þ L

ð4Þ

where Si is the location of the marker and S is the unknown position of the causal mutation. To combine information across multiple markers, a composite log likelihood is formed by effectively assuming ^ i to have a Gaussian distribution with mean i and variance 1=K i . ‘¼



P

iK

i

ð^i  2

2 iÞ

Under the null hypothesis that none of the SNPs in the region is associated with the disease, the parameter M is constrained to be equal to 0, S does not exist, and L can be estimated from external information. Under the alternative hypothesis, both M and S are parameters to be estimated from the data. The parameter L, on the other hand, can either be estimated from the data or using external information. A likelihood ratio test based on the composite likelihood can then be applied which has either 2 or 3 degrees of freedom respectively depending on whether L is estimated from current data or not. In applications of composite likelihoods, statistical inference is often done informally. The primary difficulty relates to composite likelihood–based inference in general as discussed in Section III. It is more difficult in a genetic setting since it is not possible to define the correlation of two components, and there is essentially only one realization from the population process. In addition, because approximations and heuristic arguments are used in multiple parts of the model, it is difficult to evaluate the performance of the methods and provide practical guidelines about under what situations these methods might be more useful or can be improved upon. Maniatis et al. (2004) used simulations to evaluate the type I error rate of likelihood ratio tests based on the composite likelihood. They concluded that the type I error rates are acceptable and the test is slightly anti-conservative for the most complex model (all parameters M, S, and L estimated from the data). More recently, Morton et al. (2007) proposed a permutation-based method to better control the type I error rate and also considered estimation of the false

648

Na Li

discovery rate (FDR) in the setting of genome-wide association scan. The advantage of using composite likelihood–based methods is that they cannot only provide an estimate of the location, but also give confidence intervals (Maniatis et al., 2005; Zhang et al., 2006).

D. Other methods A composite likelihood approach was also adopted by Xiong and Guo (1997). For a single marker, the log likelihood is defined as:   m ‘ðÞ ¼ E P pkidid ð5Þ i¼1

where  is the recombination fraction between the marker and disease mutation, pid is the frequency of the marker allele i on the haplotypes bearing disease mutation, kid is the observed number of haplotypes harboring the disease mutation. The population frequency of the marker allele is assumed to be constant but its frequency in the disease population changes over time and is a function of . The expectation in (5) is taken over all possible genealogy of the sample since the disease mutation first arose. This expectation is difficult to evaluate. Under certain assumptions, Xiong and Guo (1997) derived first- and second-order approximations ofPthe likelihood. For multiple markers, the composite likelihood is simply ‘ ¼ ‘i . The recombination fractions between markers and the disease locus at an unknown position x are given by Haldane’s map function. The composite likelihood methods reviewed thus far are all based on modeling the linkage disequilibrium between a single marker and the disease locus. A common drawback of these methods is that they do not account for the dependence of adjacent markers. This is not satisfactory because when densely spaced markers are available, they are likely to be highly correlated. Ignoring the correlation will result in loss of efficiency and incorrect inference. For a rare, fully penetrant disease mutation, Garner and Slatkin (2002) developed a theoretical model of the sampling distribution of two-marker haplotypes that are linked to the disease mutation. They studied the use of maximum likelihood methods compared to the composite likelihood methods based on two single-marker likelihoods and concluded that the full two-marker likelihood method gave much better localization of the disease mutation. Their composite likelihood methods sometimes placed the disease mutation in the wrong interval. This twomarker model could not be extended to include more markers and it leaves out the question of which two markers to use when multiple markers are

22. Composite Likelihood Methods for High-Dimensional Data

649

available. It is conceivable that a composite likelihood can be constructed based on the two-marker and single-marker likelihoods, for example (Cox and Reid, 2004): ‘ðÞ ¼

X s
‘ðYs ; Yt ; Þ  aK

X

‘ðYs ; Þ

s

where K is the number of markers,  is the parameter of interest (i.e., the position of the disease mutation), and a is a constant chosen to achieve optimality (e.g., a ¼ 0 corresponds to a pairwise likelihood). It is likely that some truncation, that is, ignoring pairs of markers that are too far away, would improve efficiency. A different approach to multi-locus linkage disequilibrium mapping is based on haplotypes (McPeek and Strahs, 1999). A likelihood model is defined for the probability of an observed haplotype, given an ancestral haplotype and the disease status. The ancestral haplotype is a parameter to be estimated. Given the ancestral haplotype, the observed haplotypes are assumed to be independent, which is to assume a star-shape genealogy. The likelihood inference is then based on combining the individual component likelihoods, ignoring possible correlations among the haplotypes. This method can also be considered as an application of composite likelihood methods. This approach was later extended to a Bayesian framework (Liu et al., 2001).

V. OTHER APPLICATIONS The application of composite likelihood is not limited to correlated genotype data but also to pedigree data and high-dimensional phenotype data in general. For a case-control study, a haplotype-based likelihood ratio test of association can be conducted using multinomial likelihoods. However, when the subjects are related, as is often the case where the samples were originally collected for linkage studies, standard likelihood method does not apply. Browning et al. (2005) proposed a weighted composite likelihood method for haplotype-based case-control association testing using pedigrees. In the simple case, suppose haplotypes can be observed, the likelihood for one individual is Li ¼ phi1 phi2 where hi1 and hi2 are the two haplotypes and phi1 , phi2 are the corresponding population frequencies. A composite likelihood is formed by Q ignoring the fact the individuals may be related: L ¼ ni¼1 ðLi Þi . Here i is a weight based on kinship coefficients matrix K,  ¼ 1/2 (1TK1), which was shown to minimize the variance of haplotype frequency estimate. The composite likelihood for unphased data can be developed similarly, and different weighting schemes were also discussed.

650

Na Li

In addition to increasingly high-dimensional genotype data, there is a trend toward collecting more and more phenotype data as the technology becomes increasingly available, such as multiple structural measures from medical imaging data, measurement of lipid fractions via NMR. These measures can also be collected over time and thus pose a major challenge in the analysis, especially considering the fact that many genetic analyses are complicated enough and often need special software. In a nongenetic setting, the approach adopted by Fieuws and Verbeke (2006) provides an instructive example. The problem was to model hearing thresholds at 22 different frequencies measured over 20 years. The joint model LðY1 ; . . . ; Y22 ; Y Þ for all 22 outcomes was intractable. Instead, the log likelihoods for all pairs of outcomes LðYr ; Ys ; Yrs Þ were maximized separately to give estimates of Y ¼ fYrs ; r < sg. Some of the parameters in the joint model Y* will appear multiple times in the set of parameters Y, whereas others that are specific to outcomes r and s will appear ^ has an asymptotic normal only once. Under suitable conditions, the estimator Y distribution, and the parameter of interest Y is estimated by transformation ^  ¼ AY ^ which effectively takes average estimates of those common paraY meters over all pairs. Although not exactly a composite likelihood method, it shares similar characteristics. Simulation studies showed that the pairwise approach yielded unbiased estimates with robust standard errors. Although there was some loss of efficiency, especially for the parameters common to multiple pairs of outcome variables, it is not important in practice when the fully efficient likelihood estimator is not available.

VI. PROSPECTIVES AND DISCUSSION With the fast-pacing revolution of technology, more and more data become readily available for genetic researchers. Currently studies are underway to genotype up to a million SNPs using high-throughput platforms. By pooling resources across studies, a “mega-study” may include tens of thousands of subjects. Developments in genomics and proteomics make it feasible to measure the mRNA or protein levels for thousands of genes at the same time (Cheung et al., 2005). To make good use of the data, both statistical efficiency and computational feasibility have to be considered in modeling and inference. Composite likelihood may prove to be at least a good starting point in untangling the large volume of data. Composite likelihood–based approaches have been very successfully applied in the estimation of population genetics parameters. However, its usage in linkage disequilibrium mapping is not as widespread. There are a few reasons why association mapping is more difficult. First, the disease loci (mostly like more

22. Composite Likelihood Methods for High-Dimensional Data

651

than one for complex disease) are unknown thus cannot be modeled directly. The observed phenotypes only provide limited information regarding the underlying genotypes. To model the unobserved disease mutation genotypes, it is often necessary to specify a penetrance model which is typically difficult. It is often implicitly assumed that there is one disease mutation in the region of interest but in reality, several different mutations may contribute to the disease for different subjects in the sample. Or perhaps two mutations in the region together increase the risk of disease. These complications make the idea of “localizing” the disease mutation not a very well-defined problem. Some models of linkage disequilibrium require parameters such as the age of the disease mutation that cannot be reliably estimated without knowing the frequency of the disease allele, the genetic model (recessive or multiplicative, etc.) and the penetrance. Second, the theoretical results derived for population genetics parameters all assumed that on the length of the region (chromosome) goes to infinity. Aside from the practical limitation of finite chromosome length, for gene mapping only markers in close neighborhood of the disease mutation (whose positions are unfortunately unknown) are likely to provide useful information. Third, in genetic mapping studies, various ascertainment schemes are used in attempts to increase the statistical power. It is not clear how to model ascertainment in the population genetics framework. Nevertheless, there is a lot of potential for applications of composite likelihood methods in genetic studies of complex diseases. Several areas would be of particular interest. First, single-marker–based models (e.g., Graham and Thompson, 1998) can be further developed to allow for combining information across markers (e.g., by the Male´cot model) and thus form the basis of composite likelihoods. Second, we need to develop methods for multi-locus association mapping that incorporate coalescent models with more than just two loci or one haplotype. One example is the three-locus likelihood model based on Wright– Fisher model of population genetics (Boitard et al., 2006). Based on existing examples, it is expected that using larger, higher-order, components will lead to increased efficiency of the composite likelihood estimator and likely better asymptotic behavior as well. Third, the validity of inference based on composite likelihoods needs to be carefully studied, through theoretical development or extensive simulations. Resampling-based methods (Morton et al., 2007) are very promising and can be further explored. Recently theoretical development in this area includes Varin and Vidoni (2005). Finally, composite likelihood methods can also be used to model highdimensional phenotype data. For example, bivariate linkage analysis is available in existing software (Almasy et al., 1997) but higher dimensional phenotypes are more difficult to model directly due to the large number of variance parameters. A similar strategy as in Fieuws and Verbeke may provide an attractive alternative.

652

Na Li

References Almasy, L., Dyer, T. D., and Blangero, J. (1997). Bivariate quantitative trait linkage analysis: Pleiotropy versus co-incident linkages. Genet. Epidemiol. 14, 953–958. Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R. Stat. Soc. B 36, 192–236. Boitard, S., Abdallah, J., de Rochambeau, H., Cierco-Ayrolles, C., and Mangin, B. (2006). Linkage disequilibrium interval mapping of quantitative trait loci. BMC Genomics 7, 54. Browning, S. R., Briley, J. D., Briley, L. P., Chandra, G., Charnecki, J. H., Ehm, M. G., Johansson, K. A., Jones, B. J., Karter, A. J., Yarnall, D. P., and Wagner, M. J. (2005). Casecontrol single-marker and haplotypic association analysis of pedigree data. Genet. Epidemiol. 28, 110–122. Carvajal-Rodriguez, A., Crandall, K. A., and Posada, D. (2006). Recombination estimation under complex evolutionary models with the coalescent composite-likelihood method. Mol. Biol. Evol. 23, 817–827. Chapman, N. H., Leutenegger, A. L., Badzioch, M. D., Bogdan, M., Conlon, E. M., Daw, E. W., Gagnon, F., Li, N., Maia, J. M., Wijsman, E. M., and Thompson, E. A. (2001). The importance of connections: Joining components of the Hutterite pedigree. Genet. Epidemiol. 21(Suppl. 1), S230–S235. Cheung, V. G., Spielman, R. S., Ewens, K. G., Weber, T. M., Morley, M., and Burdick, J. T. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369. Collins, A., and Morton, N. E. (1998). Mapping a disease locus by allelic association. Proc. Natl. Acad. Sci. USA 95, 1741–1745. Cox, D. R., and Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika 91, 729–737. Crawford, D. C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M. J., Nickerson, D. A., and Stephens, M. (2004). Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36, 700–706. Devlin, B., Risch, N., and Roeder, K. (1996). Disequilibrium mapping: Composite likelihood for pairwise disequilibrium. Genomics 36, 1–16. Fearnhead, P. (2003). Consistency of estimators of the population-scaled recombination rate. Theor. Popul. Biol. 64, 67–79. Fearnhead, P., and Donnelly, P. (2001). Estimating recombination rates from population genetic data. Genetics 159, 1299–1318. Fearnhead, P., and Donnelly, P. (2002). Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. B 64, 657–680. Fearnhead, P., and Smith, N. G. C. (2005). A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 77, 781–794. Fearnhead, P., Harding, R. M., Schneider, J. A., Myers, S., and Donnelly, P. (2004). Application of coalescent methods to reveal fine-scale rate variation and recombination hotspots. Genetics 167, 2067–2081. Felsenstein, J. (2006). Accuracy of coalescent likelihood estimates: Do we need more sites, more sequences, or more loci? Mol. Biol. Evol. 23, 691–700. Fieuws, S., and Verbeke, G. (2006). Pairwise fitting of mixed models for the joint modeling of multivariate longitudinal profiles. Biometrics 62, 424–431. Garner, C., and Slatkin, M. (2002). Likelihood-based disequilibrium mapping for two-marker haplotype data. Theor. Popul. Biol. 61, 153–161.

22. Composite Likelihood Methods for High-Dimensional Data

653

Graham, J., and Thompson, E. A. (1998). Disequilibrium likelihoods for fine-scale mapping of a rare allele. Am. J. Hum. Genet. 63, 1517–1530. Griffiths, R. C., and Marjoram, P. (1996). Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3, 479–502. Heagerty, P. J., and Lele, S. R. (1998). A composite likelihood approach to binary spatial data. J. Am. Stat. Assoc. 93, 1099–1111. Hey, J., and Wakeley, J. (1997). A coalescent estimator of the population recombination rate. Genetics 145, 833–846. Hudson, R. R. (1983). Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23, 183–201. Hudson, R. R. (2001). Two-locus sampling distribution and their application. Genetics 159, 1805–1817. Jensen, J. D., Kim, Y., DuMont, V. B., Aquadro, C. F., and Bustamante, C. D. (2005). Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170, 1401–1410. Kim, Y., and Nielsen, R. (2004). Linkage disequilibrium as a signature of selective sweeps. Genetics 167, 1513–1524. Kim, Y., and Stephan, W. (2002). Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160, 765–777. Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13, 235–248. Kuhner, M. K., Yamato, J., and Felsenstein, J. (2000). Maximum likelihood estimation of recombination rates from population data. Genetics 156, 1392–1401. Leutenegger, A. L., Ge´nin, E., Thompson, E. A., and Clerget-Darpoux, F. (2002). Impact of parental relationships in maximum LOD score affected sib-pair method. Genet. Epidemiol. 23, 413–425. Li, N., and Stephens, M. (2003). Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233. Liang, K. Y., and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Liang, K. Y., and Zeger, S. L. (1995). Inference based on estimating functions in the presence of nuisance parameters (with discussion). Stat. Sci. 10, 158–173. Lindsay, B. (1988). Composite likelihood methods. Contemp. Math. 80, 221–239. Liu, J. S., Sabatti, C., Teng, J., Keats, B. J., and Risch, N. (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11, 1716–1724. Lumley, T., and Heagerty, P. (1999). Weighted empirical adaptive variance estimators for correlated data regression. J. R. Stat. Soc. B 61, 459–477. Male´cot, G. (1973). Isolation by distance. In “Genetic Structure of Populations” (N. E. Morton, ed.), pp. 72–75. University of Hawaii Press, Honolulu. Maniatis, N., Collins, A., Xu, C. F., McCarthy, L. C., Hewett, D. R., Tapper, W., Ennis, S., Ke, X., and Morton, N. E. (2002). The first linkage disequilibrium (LD) maps: Delineation of hot and cold blocks by diplotype analysis. Proc. Natl. Acad. Sci. USA 99, 2228–2233. Maniatis, N., Collins, A., Gibson, J., Zhang, W., Tapper, W., and Morton, N. E. (2004). Positional cloning by linkage disequilibrium. Am. J. Hum. Genet. 74, 846–855. Maniatis, N., Morton, N., Gibson, J., Xu, C. F., Hosking, L., and Collins, A. (2005). The optimal measure of linkage disequilibrium reduces error in association mapping of affection status. Hum. Mol. Genet. 14, 143–153. Maniatis, N., Collins, A., and Morton, N. E. (2007). Effects of single SNPs, haplotypes, and wholegenome LD maps on accuracy of association mapping. Genet. Epidemiol. 31, 179–188. McPeek, M. S., and Strahs, A. (1999). Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 65, 858–875.

654

Na Li

McVean, G., Awadalla, P., and Fearnhead, P. (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160, 1231–1241. Meiklejohn, C. D., Kim, Y., Hartl, D. L., and Parsch, J. (2004). Identification of a locus under complex positive selection in Drosophila simulans by haplotype mapping and composite-likelihood estimation. Genetics 168, 265–279. Morton, N. (2005). Linkage disequilibrium maps and association mapping. J. Clin. Invest. 115, 1425–1430. Morton, N. E., Zhang, W., Taillon-Miller, P., Ennis, S., Kwok, P. Y., and Collins, A. (2001). The optimal measure of allelic association. Proc. Natl. Acad. Sci. USA 98, 5217–5221. Morton, N., Maniatis, N., Zhang, W., Ennis, S., and Collins, A. (2007). Genome scanning by composite likelihood. Am. J. Hum. Genet. 80, 19–28. Pluzhnikov, A., and Donnelly, P. (1996). Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144, 1247–1262. Pritchard, J. K., and Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69, 1–14. Smith, N. G. C., and Fearnhead, P. (2005). A comparison of three estimators of the populationscaled recombination rate: Accuracy and robustness. Genetics 171, 2051–2062. Stephens, M., and Donnelly, P. (2000). Inference in molecular population genetics (with discussion). J. R. Stat. Soc. B 62, 605–655. Tapper, W. J., Maniatis, N., Morton, N. E., and Collins, A. (2003). A metric linkage disequilibrium map of a human chromosome. Ann. Hum. Genet. 67, 487–494. Tapper, W., Collins, A., Gibson, J., Maniatis, N., Ennis, S., and Morton, N. E. (2005). A map of the human genome in linkage disequilibrium units. Proc. Natl. Acad. Sci. USA 102, 11835–11839. Terwilliger, J. D. (1995). A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and the one or more polymorphic marker loci. Am. J. Hum. Genet. 56, 777–787. Varin, C., and Vidoni, P. (2005). A note on composite likelihood inference and model selection. Biometrika 92, 519–528. Wall, J. D. (2004). Estimating recombination rates using three-site likelihoods. Genetics 167, 1461–1473. Wiuf, C. (2006). Consistency of estimators of population scaled parameters using composite likelihood. J. Math. Biol. 53, 821–841. Xiong, M., and Guo, S. W. (1997). Fine-scale genetic mapping based on linkage disequilibrium: Theory and applications. Am. J. Hum. Genet. 60, 1513–1531. Zeger, S. L., Liang, K. Y., and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics 44, 1049–1060. Zhang, W., Collins, A., Maniatis, N., Tapper, W., and Morton, N. E. (2002). Properties of linkage disequilibrium (LD) maps. Proc. Natl. Acad. Sci. USA 99, 17004–17007. Zhang, W., Collins, A., Gibson, J., Tapper, W. J., Hunt, S., Deloukas, P., Bentley, D. R., and Morton, N. E. (2004). Impact of population structure, effective bottleneck time, and allele frequency on linkage disequilibrium maps. Proc. Natl. Acad. Sci. USA 101, 18075–18080. Zhang, W., Maniatis, N., Rodriguez, S., Miller, G. J., Day, I. N. M., Gaunt, T. R., Collins, A., and Morton, N. E. (2006). Refined association mapping for a quantitative trait: Weight in the H19IGF2-INS-TH region. Ann. Hum. Genet. 70, 848–856. Zhu, L., and Bustamante, C. D. (2005). A composite-likelihood approach for detecting directional selection from DNA sequence data. Genetics 170, 1411–1421.