Computational Statistics and Data Analysis 55 (2011) 2104–2113
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses Makoto Tomita a,∗ , Noboru Hashimoto b , Yutaka Tanaka c a
Clinical Research Center, Tokyo Medical and Dental University Hospital, Faculty of Medicine, 113-8519, Japan
b
Nagoya University Graduate School of Medicine, 466-8550, Japan
c
Department of Information Systems and Mathematical Sciences, Nanzan University, 489-0863, Japan
article
info
Article history: Received 18 February 2010 Received in revised form 4 November 2010 Accepted 4 January 2011 Available online 21 January 2011 Keywords: Multivariate analysis Quantitative responses Haplotype Likelihood ratio test
abstract Though there have been several works on the analysis of the association between genotype and phenotype, little can be found for the association analysis between a haplotype or haplotype sets and multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate responses, but it cannot be applied to the case of stochastic diplotype configurations and complex genetic models. The present paper proposes a method of association analysis between diplotype configuration and multivariate quantitative responses assuming the dominant, recessive and additive models. A comparative study is performed between the proposed method and QTLmarc by applying the two methods to numerical examples and small size simulated data sets with actual genotype information taken from the data set of the Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions. The results show that the proposed method is superior to QTLmarc in finding the assumed association. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Recently the association has been actively studied between genotype and phenotype in post genomic research. Here ‘genotype’ means not only genotype itself but also haplotypes and diplotype configurations that are estimated from genotypes of the sample, and ‘phenotype’ indicates qualitative or quantitative variables which may be related to some specific diseases. A quantitative phenotype variable, called QTL (quantitative trait locus), includes covariates such as BMI and glucose level. Some algorithms have been proposed so far to analyze the association between the genotype information and the quantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association between the genotype and the univariate phenotype, assuming the normality of the conditional distribution of the phenotype given the genotype information. The likelihood is calculated on the basis of frequencies of diplotype configurations (joint probability of the frequencies of the haplotypes that compose the dipolotype) and the density function of a normal distribution. The algorithm QTLmarc (Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative responses, however, it can deal with only the case where each subject’s haplotype is determined uniquely from its genotype. It is doubtful whether it can evaluate the association properly in general cases. Therefore, it is valuable to develop a general method of association analysis for multivariate quantitative responses. In this paper, we extend the algorithm QTLhaplo so that it can deal with the association between the genotype and multiple quantitative variables assuming three types of models, i.e., the dominant, recessive and additive models.
∗
Corresponding author. E-mail address:
[email protected] (M. Tomita).
0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.01.002
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
2105
2. Method 2.1. Univariate models Shibata et al. (2004) describe the algorithm QTLhaplo as follows. Suppose that there exist l linked loci. As DNA is of double helix structure and each haplotype has its counterpart, the number of possible haplotypes is L = 2l in total. Let the relative ∑L frequencies of the haplotypes be given as Θ = (θ1 , . . . , θj , . . . , θL ), where θj ≥ 0 and j=1 θj = 1. As each subject has a
combination of two haplotypes, there is a possibility that it has L2 possible combinations a1 , a2 , . . . , aL2 . The probability that the ith subject has a diplotype configuration ak of the lth and the mth haplotypes is given by P (di = ak |Θ ) = θl θm , where di is a diplotype configuration for the ith subject. Also suppose that the ith subject has quantitative phenotype ψi with probability density function f . Now consider that a sample of size N is observed in an experiment. The phenotype for each diplotype configuration is assumed to follow a normal distribution with a common variance but with a mean which varies depending on the diplotype configuration. The outcome of the experiment can then be expressed as (Θ , D, Ψ ), where D = (d1 , . . . , dN ) indicates the vector of the diplotype configurations and Ψ = (ψ1 , . . . , ψN ) indicates the matrix of the quantitative phenotypes. The observations for N subjects are classified into the two groups of subjects with and without specified haplotype ht in the diplotype configurations, and from these observations the group means µk and the common variance σ 2 can be estimated for the distributions of quantitative phenotypes, respectively. The problem is to test whether there exists any difference in the distribution of the phenotypes between the two groups. For a dominant model, D+ is defined as the set of subjects with diplotype configurations containing haplotype ht , and D− is defined as the set of those without ht . Then, the distribution of phenotypes is given by N (µ1 , σ 2 ) for diplotype di ∈ D+ or by N (µ2 , σ 2 ) for di ∈ D− . Denote the probability density functions by fµj (x), j = 1, 2. Thus the probability density function for ψi is defined by fµ1 (x) = f (ψi = x|di ∈ D+ ) in case di ∈ D+ and fµ2 (x) = f (ψi = x|di ∈ D− ) in case di ∈ D− . Let A and B denote haplotypes with specified ht and those without ht , respectively. Then every diplotype configuration is expressed as AA, AB, or BB, and sets D+ and D− can be defined as follows. In the case of dominant models, AA and AB belong to D+ , while BB belongs to D− . In the case of recessive models, AA belongs to D+ , while AB and BB belong to D− . For additive models, the distributions of ψi for AA, BB and AB are given by N (µ1 , σ 2 ), N (µ2 , σ 2 ) and N (µ3 , σ 2 ), respectively, where µ3 = (µ1 + µ2 )/2. 2.2. Extension to multivariate models We try to extend the above univariate model to a multivariate model. Suppose that the quantitative phenotype vector Ψi follows a multinormal distribution with a common variance–covariance matrix but with different mean vector corresponding to the group defined by the diplotype configurations. In dominant/recessive models, the density function is given by
1 1 ′ −1 e− 2 (x−µ1 ) Σ (x−µ1 ) (2π )p/2 |Σ |− 21 f (Ψi = x|di = ak , µ, Σ ) = 1 1 ′ −1 e− 2 (x−µ2 ) Σ (x−µ2 ) 1 − (2π )p/2 |Σ | 2
if ak ∈ D+ , (1) if ak ̸∈ D+ ,
while in an additive model it is given by
f (Ψi = x|di = ak , µ, Σ ) =
1 1 ′ −1 e− 2 (x−µ1 ) Σ (x−µ1 ) 1 − (2π )p/2 |Σ | 2 1 1 ′ −1 e− 2 (x−µ2 ) Σ
(x−µ2 )
1 (2π )p/2 |Σ |− 2 µ1 +µ2 ′ −1 µ1 +µ2 1 1 e− 2 (x− 2 ) Σ (x− 2 ) − 21 p / 2 (2π ) |Σ |
if ak ∈ DAA , if ak ∈ DBB ,
(2)
if ak ∈ DAB ,
where µ, Σ indicate the mean vector and the variance–covariance matrix, respectively, x is the vector of individual quantitative phenotypes, and p is the number of phenotype variables. 2.3. Likelihood function The observed data consist of the genotype and quantitative phenotype of N subjects. Let Gobs = (g1 , g2 , . . . , gN ) and Ψobs = (w1 , w2 , . . . , wN ) be the vectors of the observed genotypes and the matrix of the quantitative phenotypes, respectively. Then the likelihood function is given by
2106
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
L(Θ , µ, Σ ) ∝
N − ∏
P (di = ak |Θ )f (ψi = wi |di = ak , µ, Σ ),
i=1 ak ∈Ai
where Ai denotes the set of diplotype configurations ak , which is consistent with gi , and f is the probability density function for N (µ, Σ ), where µ depends on ak . Under the null hypothesis it is assumed that the distribution of the phenotype does not depend on the haplotype, i.e., the mean vector µ is equal to a common vector µ0 . Under the alternative hypothesis, two multinormal distributions, N (µ1 , Σ ) and N (µ2 , Σ ), are defined, if the model is dominant or recessive, and three multinormal distributions, N (µ1 , Σ ), N (µ2 , Σ ), and N ((µ1 +µ2 )/2, Σ ), are defined, if the model is additive. The phenotype x of the ith subject follows a distribution given by Eq. (1) or (2), depending on the assumed model. 2.4. Estimation of parameters and likelihood ratio test If the complete data of d1 , d2 , . . . , dN and Ψ1 , Ψ2 , . . . , ΨN were available, the maximum likelihood estimators for µ, Σ and the haplotype frequencies Θ = (θ1 , θ2 , . . . , θL ) could be obtained as
θˆj = nj /(2N ) (j = 1, 2, . . . , L), − − µˆ1 = ψi /N+ , µˆ2 = ψi /N− , di ∈D+
di ̸∈D+
−
ˆ = Σ
(Ψi − µ1 ) (Ψi − µ1 ) + ′
di ∈D+
−
(Ψi − µ2 ) (Ψi − µ2 )
′
N,
di ̸∈D+
where nj counts how often the jth haplotype appears among the N subjects, and N+ and N− denote the numbers of subjects who possess or do not possess haplotype ht , respectively. However, actually the complete data are not available and we can observe only genotypes and phenotypes of the subjects. For a subject, two or more haplotypes might correspond to one genotype. Therefore, there are two or more candidates for the diplotype configuration of the subject, given the haplotype. Several procedures have been developed for estimating haplotype frequencies Θ using the EM-algorithm, MCMC , or other algorithms. However, as the purpose of our study is different, we omit here a detailed explanation of those algorithms. Here we assume that the vector of frequencies of the haplotypes have been estimated using some appropriate software. Their results can be used to estimate the mean vectors and the variance–covariance matrix in our multivariate models. In the case of dominant and recessive models the log-likelihood function is expressed as l = log L(Θ , µ, Σ ) and the maximum likelihood estimators for µ1 and µ2 are obtained by N ∑
µˆ1 =
N ∑
ψi (ub /u0 )
i =1 N ∑
and
µˆ2 =
(ub /u0 )
i=1
ψi (vb /v0 )
i=1 N ∑
, (vb /v0 )
i =1
where
−
ub =
P (di = ak |Θ )f (ψi |di = ak , µ1 , σ ),
ak ∈D+ ∩Ai
u0 =
−
P (di = ak |Θ )f (ψi |di = ak , µ, σ ),
ak ∈Ai
−
vb =
P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),
ak ∈Ai ∩D−
v0 =
−
P (di = ak |Θ )f (ψi |di = ak , µ, Σ ).
ak ∈Ai
The variance–covariance matrix is estimated as
ˆ = Σ
1 n
N N − − T T (ψi − µ1 ) (ψi − µ1 ) (ub /u0 ) + (ψi − µ2 ) (ψi − µ2 ) (vb /v0 ) , i =1
i =1
where n is i = 1 ( ub / u0 ) + i=1 (vb /v0 ). D+ is a set of the diplotype configurations which contain haplotype ht in such a way that it is consistent with dominant/recessive models. In the case of additive models, the mean vectors and the variance–covariance matrix are estimated by solving the following equations.
∑N
N − ub i=1
u0
∑N
+
N 1 − wb
4 i=1 w0
µ1 +
N 1 − wb
4 i =1 w 0
µ2 =
N − ub i=1
u0
+
1 2
u0 ub
xi
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 N 1 − wb
4 i=1 w0
ˆ = Σ
µ1 +
N N − vb 1 − wb + v0 4 i=1 w0 i =1
µ2 =
N − i=1
vb 1 + v0 2
u0 ub
2107
xi
N N − − (ψi − µ1 ) (ψi − µ1 )T (ub /u0 ) + (ψi − µ2 ) (ψi − µ2 )T (vb /v0 ) n i=1 i=1 N − T + (ψi − (µ1 + µ2 )/2) (ψi − (µ1 + µ2 )/2) (wb /w0 ) . 1
i=1
In the above equations, u’s, v ’s and w ’s are defined by ub =
−
P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),
ak ∈Ai ∩AA
u0 =
−
P (di = ak |Θ )f (ψi |di = ak , µ, Σ ),
ak ∈Ai
vb =
−
P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),
ak ∈Ai ∩BB
v0 =
−
P (di = ak |Θ )f (ψi |di = ak , µ, Σ ),
ak ∈Ai
wb =
−
P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ) and
ak ∈Ai ∩AB
w0 =
−
P (di = ak |Θ )f (ψi |di = ak , µ, Σ ).
ak ∈Ai
These two systems of equations can be solved by using iterative procedures with appropriate initial values. The converged solution provides maximum likelihood estimates. Likelihood ratio tests can be applied to the association analysis between the haplotypes and the phenotypes. Let L0max and Lmax be the likelihood functions under the null and alternative hypotheses, respectively. It is known that under the null hypothesis the log-likelihood ratio −2 log(L0max /Lmax ) asymptotically follows a chi-squared distribution. The degree of freedom is given by the difference of the numbers of parameters corresponding to both hypotheses. 2.5. QTLmarc algorithm Kamitsuji and Kamatani (2006) proposed an algorithm ‘‘QTLmarc’’ for estimating haplotypes associated with several quantitative phenotypes. Their method is a discriminant analysis of the two groups defined by whether or not the genotype of the sample contains a specified haplotype on the basis of a linear combination of quantitative phenotype values. The goodness of discrimination is evaluated with the area under the ROC curve (AUC). As the two groups are defined simply as above, their method cannot deal with the differences among the dominant, recessive and additive models. In the next section, we show two numerical examples, where the comparison of the performance will be made between our method and the QTLmarc. 3. Numerical study and the results of analyses We consider the case where there are three loci with two kinds of alleles as genotypes and two quantitative phenotype variables. As a genotype input data set, an actual data set for 44 subjects was downloaded from the Hapmap project, and the information of the region (107,189 loci) of the X chromosome was used. Among a large number of loci, 40 loci with linkage disequilibria were selected by Tomita et al. (2008), where they studied this area of the X chromosome from the Hapmap project. See Table 1 for detailed information (rs numbers, chromosome positions) on loci of data and Table 2 for major/minor allele frequencies. Note that we do not have any information on haplotypes as the data set does not contain the phase information and there is no missing observation. Fig. 1 shows the loci and LD(D′ ) map with chromosome positions. Various linkage disequilibriums can be found in Fig. 1. The LD map has been made using GUI ‘‘Haploview’’ software (Barrett et al., 2005). In Fig. 1, we can confirm that there are blocks ‘‘Block 1 and 2’’ defined by the method of Wang et al. (2002). We estimated haplotype–diplotype configurations and selected htSNPs (haplotype tagging SNPs) using GUI ‘‘Integrated Environmental System for SNPs Data Analysis’’ software (Tomita et al., 2004). After using the software, we selected 4 loci {7, 11, 13, 19} for Blocks 1 and 2, since we wanted to have as many varieties as possible for stochastic diplotype configurations. To select htSNPs, we used the method of Kamatani et al. (2004) with cumulative haplotype frequency 0.95. As a result, we obtained the diplotype configurations shown in Table 3.
2108
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
Table 1 Information of loci. (rs number, chromosome position). Locus# rs# Position
1 rs197000 28409449
2 rs197005 28413819
3 rs197006 28416655
4 rs197012 28424066
5 rs197014 28430651
6 rs197016 28433916
Locus# rs# Position
7 rs197018 28435927
8 rs197021 28441493
9 rs197022 28442352
10 rs5943527 28442857
11 rs642519 28445155
12 rs404274 28446901
Locus# rs# Position
13 rs196983 28448532
14 rs115126 28449670
15 rs115125 28449943
16 rs196985 28452952
17 rs196986 28453150
18 rs17348455 28453190
Locus# rs# Position
19 rs196988 28453742
20 rs196990 28456060
21 rs1265497 28458373
22 rs6630730 28468511
23 rs196982 28468715
24 rs196975 28475390
Locus# rs# Position
25 rs1468134 28526074
26 rs724087 28644979
27 rs5985808 28645245
28 rs4103136 28645826
29 rs12863731 28650807
30 rs1586093 28675987
Locus# rs# Position
31 rs5985930 28680456
32 rs5985809 28681164
33 rs5943575 28681984
34 rs2521807 28703122
35 rs634270 28705667
36 rs6630793 28709418
Locus# rs# Position
37 rs5943579 28723040
38 rs628704 28724158
39 rs629965 28724381
40 rs11095138 28725017
Table 2 Summary for major/minor allele frequencies for each locus. Locus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Major allele
Minor allele
Missing
Name
Count
Frequency
Name
Count
Frequency
Name
Count
Total
A A T G T A T A A C G G C T A A T T T C G C A C T A T G A C T C T G T A G A A T
49 49 49 49 49 49 56 49 49 49 64 46 76 67 67 67 46 54 46 75 67 67 45 66 45 44 59 59 72 66 62 66 62 60 62 52 70 70 68 58
0.5568 0.5568 0.5568 0.5568 0.5568 0.5568 0.6364 0.5568 0.5568 0.5568 0.7273 0.5227 0.8636 0.7614 0.7614 0.7614 0.5227 0.6136 0.5227 0.8523 0.7614 0.7614 0.5114 0.75 0.5114 0.5 0.6705 0.6705 0.8182 0.75 0.7045 0.75 0.7045 0.6818 0.7045 0.5909 0.7955 0.7955 0.7727 0.6591
G G C T C C C G G T A A T C C G C A A T A T G T C G C C T A C T C A C G A C G A
39 39 39 39 39 39 32 39 39 39 24 42 12 21 21 21 42 34 42 13 21 21 43 22 43 44 29 29 16 22 26 22 26 28 26 36 18 18 20 30
0.4432 0.4432 0.4432 0.4432 0.4432 0.4432 0.3636 0.4432 0.4432 0.4432 0.2727 0.4773 0.1364 0.2386 0.2386 0.2386 0.4773 0.3864 0.4773 0.1477 0.2386 0.2386 0.4886 0.25 0.4886 0.5 0.3295 0.3295 0.1818 0.25 0.2955 0.25 0.2955 0.3182 0.2955 0.4091 0.2045 0.2045 0.2273 0.3409
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
2109
Table 3 Diplotype configurations estimated by htSNPs loci {7, 11, 13, 19} for each subject. Sub#
Diplotype#
Probability
Pop. freq.
Diplotype configuration
1 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 3 1 1 1 1 1 1 2 3 1 2 3 1 1 1 1 1 1 2 3 1 1 2 3 1 1 2 1 1
1 1 1 1 1 1 1 1 1 1 1 0.5671 0.4329 1 1 1 1 1 1 1 1 1 1 1 0.9367 0.0633 0.6271 0.3406 0.0324 1 1 1 1 1 0.6271 0.3406 0.0324 0.6271 0.3406 0.0324 1 1 1 1 1 0.6271 0.3406 0.0324 1 0.6271 0.3406 0.0324 1 0.929 0.071 1 1
0.0812 0.0573 0.0812 0.0405 0.0105 0.0069 0.0028 0.0321 0.0268 0.0014 0.0405 0.007 0.0053 0.0177 0.0812 0.0812 0.0321 0.0105 0.0405 0.0812 0.0321 0.0069 0.0405 0.0812 0.0291 0.002 0.0379 0.0206 0.002 0.0127 0.0573 0.0105 0.0321 0.0105 0.0379 0.0206 0.002 0.0379 0.0206 0.002 0.0028 0.0136 0.0321 0.0006 0.0405 0.0379 0.0206 0.002 0.0075 0.0379 0.0206 0.002 0.0812 0.0115 0.0009 0.0127 0.0177
TGCATGCA TGCT TGCA TGCATGCA TGCT TGCT CACA CACA CGCT CACT CGCA CGCA TGCA TGTT TGCT CACT TACA TACA TGCT TGCT CGCA CACT CGCT CACA CACT CACT TGCATGCA TGCATGCA TGCA TGTT CACA CACA TGCT TGCT TGCATGCA TGCA TGTT CGCT CACT TGCT TGCT TGCATGCA TGCA CACA TACA CGCA TGCA CACT TGCT CACA TACA CGCT TGTT TGTT TGCT TGCA TGCT CGCT TGCA TGTT TGCT CGCT TGCA CACT TGCT CACA TACA CGCT TGCA CACT TGCT CACA TACA CGCT CGCA CGCA CACT CACA TGCA TGTT CGTT CGTT TGCT TGCT TGCA CACT TGCT CACA TACA CGCT TGCT TACA TGCA CACT TGCT CACA TACA CGCT TGCATGCA TGTT CACA TACA CGTT TGTT TGTT CACT CACT
13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31
32
33 34 35 36 37 38
39 40
41 42 43 44 TGCA is the target haplotype.
We assumed haplotype ‘‘TGCA’’ as the target haplotype for our study. Then, the data set of our study were divided into three subsets. The first subset consists of the subjects who have two ‘‘TGCA’’s, i.e., ‘‘TGCA–TGCA’’, as the diplotype, the second subset consists of those with one target haplotype, and the third subset consists of those without the target haplotype. To describe the association we consider three models, i.e., the dominant, recessive, additive models. In the dominant model, group D+ is composed of the first and second subsets and group D− is composed of the third subset. In the recessive model, group D+ is composed of the first subset and group D− is composed of the second and third subsets. In the additive model, there are three groups which correspond to the three subsets defined above.
2110
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
Fig. 1. LD(D′ ) map with rs number and chromosome position for data.
Regarding the two quantitative variables, observations were generated assuming that they follow a bivariable normal distribution N (µ1 , Σ ), N (µ2 , Σ ) or N (µ3 , Σ ) depending on the number the subjects had haplotype ‘‘TGCA’’, where
µ1 = (120, 19),
µ2 = (118, 20),
Σ=
256 40
40 . 8
The bivariate normal random numbers were generated by using the library ‘‘mvtnorm’’ in R. In our numerical study, we first generated two data sets. In the first data set, it is assumed that the genotype-to-phenotype relationship can be described by a recessive model, where the phenotype follows N {µ1 , Σ } or N {µ2 , Σ }, depending whether it belongs to group D+ or group D− , respectively. In the second data set, it is assumed that the genotype-tophenotype relationship can be described by an additive model, where in the first group the phenotype follows N {µ1 , Σ }, in the second group the phenotype follows N {µ2 , Σ }, and in the third group the phenotype follows N {µ3 , Σ }, where µ3 = (µ1 + µ2 )/2. We obtained the results of association analysis with our method as shown in Tables 5 and 7 by using our R program. The ‘q-value’ in Tables 5 and 7 indicates q value of the false discovery rate (FDR) (Benjamini and Hochberg, 1995). Tables 4 and 6 show the results of analysis using QTLmarc (Kamitsuji and Kamatani, 2006). The results of our method show that p values for the target haplotype ‘‘TGCA’’ are smaller than q values, and that p values for other haplotypes are larger than q values both in Tables 5 and 7. On the other hand the results of QTLmarc show that the p value for the target haplotype is larger than 0.05, and in addition p values for other six haplotypes are smaller than 0.05 in Table 4 and that the p value for the target haplotype is 0.042, while p values for other five haplotypes are also smaller than 0.05 in Table 6. There are many other haplotypes which have similar or smaller p values than the target haplotype by their method. Therefore we have to judge that the target haplotype is not significant in Table 6. Thus we could find that the target haplotype ‘‘TGCA’’ is significant by our method, but not significant by the QTLmarc. These results confirm that QTLmarc is not applicable besides the dominant model. For the dominant model we carried out a small size simulation study on the comparison of the powers between QTLmarc and our method. We assumed a dominant model with µ1 = (120, 19): null case, Σ = (σ11 , σ12 , σ22 ) = (256, 40, 8), set the values of µ’s for alternative cases so that the Mahalanobis distances became approximately 0.25, 0.50, 0.75, 1.00 and 1.25, and then generated 100 sets of data for each case. The results of analysis are given in Table 8. This table shows clearly that our method has higher power than QTLmarc for detecting the genotype-to-phenotype relationship in the case of the dominant model.
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
2111
Table 4 Results of analysis using QTLmarc of the first data set based on a recessive model. Haplotype
Number of carriers
AUC
P-value
TACA CGCA TGTA TGTT CACA TGCT TACT CGTT CGCT CACT TGCA CATA TATA TGTA CATT TATT
9 10 6 8 11 22 8 2 12 13 22 NA NA NA NA NA
0.8114 0.7367 0.7286 0.7192 0.7175 0.6906 0.6727 0.6274 0.6106 0.6097 0.5371 NA NA NA NA NA
0.00068 0.00349 0.00583 0.00607 0.00768 0.00270 0.03782 0.15388 0.10776 0.12171 0.33023 NA NA NA NA NA
AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype. Table 5 Results of the analysis using our method of the first data set based on a recessive model. Haplotype
Haplotype frequency
χ2
P-value
q-value
df
TGCA CGTT CGCA CACT TACA TGCT CACA TGTT
0.2368 0.0263 0.0526 0.1228 0.0877 0.1842 0.1140 0.0877
15.3214 4.9060 2.4720 2.0721 1.5365 0.4474 0.3173 0.2663
0.00047 0.08603 0.29054 0.35484 0.46381 0.79955 0.85331 0.87533
0.00625 0.01250 0.01875 0.02500 0.03125 0.03750 0.04375 0.05000
2 2 2 2 2 2 2 2
TGCA is the target haplotype. Table 6 Results of analysis using QTLmarc of the second data set based on an additive model. Haplotype
Number of carriers
AUC
P-value
TACA TGCT TACT CGCA CACA TGCA CGTT TGTA CGCT CACT TGTT CATA TATA CGTA CATT TATT
9 22 8 10 11 22 2 6 12 13 8 NA NA NA NA NA
0.7605 0.6825 0.6928 0.6873 0.6624 0.6328 0.6774 0.6745 0.6092 0.6034 0.5836 NA NA NA NA NA
0.00011 0.00448 0.00960 0.01345 0.03368 0.04209 0.06749 0.06923 0.09593 0.11959 0.23181 NA NA NA NA NA
AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype. Table 7 Results of the analysis using our method of the second data set based on an additive model. Haplotype
Haplotype frequency
χ2
P-value
q-value
df
TGCA CGTT TACA TGCT CACT CGCA CACA TGTT
0.2368 0.0263 0.0877 0.1842 0.1228 0.0526 0.1140 0.0877
11.8377 5.7841 3.3868 1.7475 1.2107 0.5627 0.5247 0.0452
0.00269 0.05546 0.18390 0.41738 0.54588 0.75476 0.76925 0.97768
0.00625 0.01250 0.01875 0.02500 0.03125 0.03750 0.04375 0.05000
2 2 2 2 2 2 2 2
TGCA is the target haplotype.
2112
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
Table 8 Power comparison between QTLmarc and our method for dominant models based on a simulation study with 100 iterations. Mean vector µx Mahalanobis distance QTLmarc Power Ours
{119, 19.5} 0.25 0.23 0.23
{118.5, 19.7} 0.508 0.56 0.59
{118.1, 19.84} 0.753 0.73 0.77
{118, 20} 1 0.83 0.94
{117.77, 20.12} 1.252 0.90 0.98
Table 9 The chi-squared (top) and AIC (bottom) statistics obtained with the analysis assuming dominant, recessive and additive models of the three datasets. Hypothesis test by Dominant model
True model
Recessive model
Additive model
Dominant
11.5837132 −7.5837132
0.0384127 3.9615873
Recessive
3.4820052 0.5179948
15.3214361
8.2125950
−11.3214361
−4.2125950
Additive
8.7507100
−4.7507100
11.6215961
−7.6215961
4.2594829
−0.2594829
11.8376920
−7.8376920
4. Discussions and summary We proposed a method of multivariate association analysis and developed an R program. It is an extension of QTLhaplo to the case of multivariate analysis, where we can analyze the cases where the diplotype configuration is not determined uniquely from the genotype and we can assume any dominant, recessive and additive models. To study the effectiveness of our method for the recessive and additive models, we carried out a numerical study to analyze two data sets with real genotype data (Hapmap project) and simulated QTL data, and for the dominant model we performed a small size simulation study to compare the powers with QTLmarc. The results showed that our method was more effective for detecting the genotype-to-phenotype relationship than QTLmarc of Kamitsuji and Kamatani (2006) in all the dominant, recessive and additive models. We speculate on the reason for the big difference of the results for the recessive and additive models. In Tables 4 and 6, the number of carriers of the target haplotype ‘‘TGCA’’ is 22. However, after we estimated haplotype and diplotype configurations by the maximum likelihood method, there are only 7 subjects with both haplotypes ‘‘TGCA–TGCA’’ (so that there are only 7 carriers in the recessive model), and there are only 13 subjects with the haplotype ‘‘TGCA’’ (so that there are only 7 and 13 carriers in the additive model) stochastically. The QTLmarc algorithm selected also subjects #39 and #42 as the carriers, but they have no target haplotype. It may be due to their method of haplotype estimation being based on allele counting, and model fitting can be applicable only for the dominant model. For the dominant model the power of our method is higher than that of QTLmarc which may be explained by the first advantage which will be mentioned below. For dominant models there are few differences, possibly because QTLmarc considers subjects #39 and #42 with no target haplotype as the carriers. So far we discussed the comparison of the powers between our method and QTLmarc. In actual data analysis, however, the objective is to find out if there exists any haplotype which is closely related to the phenotypes assuming an appropriate one among the dominant, recessive and additive models. For this purpose we may operationally choose the model with the highest significance. Note that, if we wish to use the AIC statistics we can compute them up to an additive constant based on the values of likelihood ratio statistics, because the log-likelihood statistics for the null models are common among the dominant, recessive and additive models. To study how this idea works in actual data analysis we analyzed each of the three datasets generated in the manner described in Section 3 assuming the three models, i.e., the dominant, recessive and additive models. The chi-squared statistics and the AIC statistics are given in Table 9. It is noted that the true models could be detected correctly in all cases but that there might exist a tendency that the dataset generated assuming the additive model is easily misjudged to be taken from other models. There are two advantages of our method compared to the QTLmarc algorithm. One is that our method can treat genotype data with stochastically determined diplotypes and the other is that we can assume any model among dominant, recessive and additive models. It is expected that our method will be useful in association studies of complex diseases such as schizophrenia and autism, where the causes of the diseases are not yet resolved and there exist multiple candidate responses. Acknowledgements We are deeply grateful to referees who provided kindly considered feedback and valuable comments. This work was partly supported by KAKENHI (21700317; Grant-in-Aid for Young Scientists (B)). References Barrett, J.C., Fry, B., Maller, J., Daly, M.J., 2005. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 21 (2), 263–265. Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B 57 (1), 289–300.
M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113
2113
Kamatani, N., Sekine, A., Kitamoto, T., Iida, A., Saito, S., Kogame, A., Inoue, E., Kawamoto, M., Harigai, M., Nakamura, Y., 2004. Large-scale single-nucleotide polymorphism (SNP) and haplotype analyses, using dense SNP maps, of 199 drug-related genes in 752 subjects: the analysis of the association between uncommon SNPs within haplotype blocks and the haplotypes constructed with haplotype-tagging SNPs. American Journal of Human Genetics 75 (2), 190–203. Kamitsuji, S., Kamatani, N., 2006. Estimation of haplotype associated with several quantitative phenotypes based on maximization of are under a receiver operating characteristic (ROC) curve. Journal of Human Genetics 51 (4), 314–325. Shibata, K., Ito, T., Kitamura, Y., Iwaaki, N., Tanaka, H., Kamatani, N., 2004. Simultaneous estimation of haplotype frequencies and quantitative trait parameters: applications to the test of association between phenotype and diplotype configuration. Genetics 168, 525–539. Tomita, M., Hatsumichi, M., Kurihara, K., 2008. Identify LD blocks based on hierarchical spatial data. Computational Statistics and Data Analysis 52 (4), 1806–1820. Tomita, M., Inoue, E., Kamatani, N., 2004. Integrated environmental system for SNPs data analysis. In: Program and Abstracts of The 13th Takeda Science Foundation Symposium on Bioscience, 92. Wang, N., Akey, J.M., Zhang, K., Chakraborty, R., Jin, L., 2002. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. American Journal of Human Genetics 71 (5), 1227–1234.