Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

Computational Statistics and Data Analysis 55 (2011) 2104–2113 Contents lists available at ScienceDirect Computational Statistics and Data Analysis ...

537KB Sizes 0 Downloads 78 Views

Computational Statistics and Data Analysis 55 (2011) 2104–2113

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses Makoto Tomita a,∗ , Noboru Hashimoto b , Yutaka Tanaka c a

Clinical Research Center, Tokyo Medical and Dental University Hospital, Faculty of Medicine, 113-8519, Japan

b

Nagoya University Graduate School of Medicine, 466-8550, Japan

c

Department of Information Systems and Mathematical Sciences, Nanzan University, 489-0863, Japan

article

info

Article history: Received 18 February 2010 Received in revised form 4 November 2010 Accepted 4 January 2011 Available online 21 January 2011 Keywords: Multivariate analysis Quantitative responses Haplotype Likelihood ratio test

abstract Though there have been several works on the analysis of the association between genotype and phenotype, little can be found for the association analysis between a haplotype or haplotype sets and multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate responses, but it cannot be applied to the case of stochastic diplotype configurations and complex genetic models. The present paper proposes a method of association analysis between diplotype configuration and multivariate quantitative responses assuming the dominant, recessive and additive models. A comparative study is performed between the proposed method and QTLmarc by applying the two methods to numerical examples and small size simulated data sets with actual genotype information taken from the data set of the Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions. The results show that the proposed method is superior to QTLmarc in finding the assumed association. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Recently the association has been actively studied between genotype and phenotype in post genomic research. Here ‘genotype’ means not only genotype itself but also haplotypes and diplotype configurations that are estimated from genotypes of the sample, and ‘phenotype’ indicates qualitative or quantitative variables which may be related to some specific diseases. A quantitative phenotype variable, called QTL (quantitative trait locus), includes covariates such as BMI and glucose level. Some algorithms have been proposed so far to analyze the association between the genotype information and the quantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association between the genotype and the univariate phenotype, assuming the normality of the conditional distribution of the phenotype given the genotype information. The likelihood is calculated on the basis of frequencies of diplotype configurations (joint probability of the frequencies of the haplotypes that compose the dipolotype) and the density function of a normal distribution. The algorithm QTLmarc (Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative responses, however, it can deal with only the case where each subject’s haplotype is determined uniquely from its genotype. It is doubtful whether it can evaluate the association properly in general cases. Therefore, it is valuable to develop a general method of association analysis for multivariate quantitative responses. In this paper, we extend the algorithm QTLhaplo so that it can deal with the association between the genotype and multiple quantitative variables assuming three types of models, i.e., the dominant, recessive and additive models.



Corresponding author. E-mail address: [email protected] (M. Tomita).

0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.01.002

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

2105

2. Method 2.1. Univariate models Shibata et al. (2004) describe the algorithm QTLhaplo as follows. Suppose that there exist l linked loci. As DNA is of double helix structure and each haplotype has its counterpart, the number of possible haplotypes is L = 2l in total. Let the relative ∑L frequencies of the haplotypes be given as Θ = (θ1 , . . . , θj , . . . , θL ), where θj ≥ 0 and j=1 θj = 1. As each subject has a

combination of two haplotypes, there is a possibility that it has L2 possible combinations a1 , a2 , . . . , aL2 . The probability that the ith subject has a diplotype configuration ak of the lth and the mth haplotypes is given by P (di = ak |Θ ) = θl θm , where di is a diplotype configuration for the ith subject. Also suppose that the ith subject has quantitative phenotype ψi with probability density function f . Now consider that a sample of size N is observed in an experiment. The phenotype for each diplotype configuration is assumed to follow a normal distribution with a common variance but with a mean which varies depending on the diplotype configuration. The outcome of the experiment can then be expressed as (Θ , D, Ψ ), where D = (d1 , . . . , dN ) indicates the vector of the diplotype configurations and Ψ = (ψ1 , . . . , ψN ) indicates the matrix of the quantitative phenotypes. The observations for N subjects are classified into the two groups of subjects with and without specified haplotype ht in the diplotype configurations, and from these observations the group means µk and the common variance σ 2 can be estimated for the distributions of quantitative phenotypes, respectively. The problem is to test whether there exists any difference in the distribution of the phenotypes between the two groups. For a dominant model, D+ is defined as the set of subjects with diplotype configurations containing haplotype ht , and D− is defined as the set of those without ht . Then, the distribution of phenotypes is given by N (µ1 , σ 2 ) for diplotype di ∈ D+ or by N (µ2 , σ 2 ) for di ∈ D− . Denote the probability density functions by fµj (x), j = 1, 2. Thus the probability density function for ψi is defined by fµ1 (x) = f (ψi = x|di ∈ D+ ) in case di ∈ D+ and fµ2 (x) = f (ψi = x|di ∈ D− ) in case di ∈ D− . Let A and B denote haplotypes with specified ht and those without ht , respectively. Then every diplotype configuration is expressed as AA, AB, or BB, and sets D+ and D− can be defined as follows. In the case of dominant models, AA and AB belong to D+ , while BB belongs to D− . In the case of recessive models, AA belongs to D+ , while AB and BB belong to D− . For additive models, the distributions of ψi for AA, BB and AB are given by N (µ1 , σ 2 ), N (µ2 , σ 2 ) and N (µ3 , σ 2 ), respectively, where µ3 = (µ1 + µ2 )/2. 2.2. Extension to multivariate models We try to extend the above univariate model to a multivariate model. Suppose that the quantitative phenotype vector Ψi follows a multinormal distribution with a common variance–covariance matrix but with different mean vector corresponding to the group defined by the diplotype configurations. In dominant/recessive models, the density function is given by

  1 1  ′ −1   e− 2 (x−µ1 ) Σ (x−µ1 )   (2π )p/2 |Σ |− 21  f (Ψi = x|di = ak , µ, Σ ) =    1 1 ′ −1   e− 2 (x−µ2 ) Σ (x−µ2 )  1 − (2π )p/2 |Σ | 2

if ak ∈ D+ , (1) if ak ̸∈ D+ ,

while in an additive model it is given by

f (Ψi = x|di = ak , µ, Σ ) =

  1 1 ′ −1   e− 2 (x−µ1 ) Σ (x−µ1 )  1  −   (2π )p/2 |Σ | 2      1 1 ′ −1 e− 2 (x−µ2 ) Σ

(x−µ2 )

1   (2π )p/2 |Σ |− 2        µ1 +µ2 ′ −1 µ1 +µ2 1 1   e− 2 (x− 2 ) Σ (x− 2 )  − 21 p / 2 (2π ) |Σ |

if ak ∈ DAA , if ak ∈ DBB ,

(2)

if ak ∈ DAB ,

where µ, Σ indicate the mean vector and the variance–covariance matrix, respectively, x is the vector of individual quantitative phenotypes, and p is the number of phenotype variables. 2.3. Likelihood function The observed data consist of the genotype and quantitative phenotype of N subjects. Let Gobs = (g1 , g2 , . . . , gN ) and Ψobs = (w1 , w2 , . . . , wN ) be the vectors of the observed genotypes and the matrix of the quantitative phenotypes, respectively. Then the likelihood function is given by

2106

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

L(Θ , µ, Σ ) ∝

N − ∏

P (di = ak |Θ )f (ψi = wi |di = ak , µ, Σ ),

i=1 ak ∈Ai

where Ai denotes the set of diplotype configurations ak , which is consistent with gi , and f is the probability density function for N (µ, Σ ), where µ depends on ak . Under the null hypothesis it is assumed that the distribution of the phenotype does not depend on the haplotype, i.e., the mean vector µ is equal to a common vector µ0 . Under the alternative hypothesis, two multinormal distributions, N (µ1 , Σ ) and N (µ2 , Σ ), are defined, if the model is dominant or recessive, and three multinormal distributions, N (µ1 , Σ ), N (µ2 , Σ ), and N ((µ1 +µ2 )/2, Σ ), are defined, if the model is additive. The phenotype x of the ith subject follows a distribution given by Eq. (1) or (2), depending on the assumed model. 2.4. Estimation of parameters and likelihood ratio test If the complete data of d1 , d2 , . . . , dN and Ψ1 , Ψ2 , . . . , ΨN were available, the maximum likelihood estimators for µ, Σ and the haplotype frequencies Θ = (θ1 , θ2 , . . . , θL ) could be obtained as

θˆj = nj /(2N ) (j = 1, 2, . . . , L), − − µˆ1 = ψi /N+ , µˆ2 = ψi /N− , di ∈D+

di ̸∈D+

 −

ˆ = Σ

(Ψi − µ1 ) (Ψi − µ1 ) + ′

di ∈D+



(Ψi − µ2 ) (Ψi − µ2 )





N,

di ̸∈D+

where nj counts how often the jth haplotype appears among the N subjects, and N+ and N− denote the numbers of subjects who possess or do not possess haplotype ht , respectively. However, actually the complete data are not available and we can observe only genotypes and phenotypes of the subjects. For a subject, two or more haplotypes might correspond to one genotype. Therefore, there are two or more candidates for the diplotype configuration of the subject, given the haplotype. Several procedures have been developed for estimating haplotype frequencies Θ using the EM-algorithm, MCMC , or other algorithms. However, as the purpose of our study is different, we omit here a detailed explanation of those algorithms. Here we assume that the vector of frequencies of the haplotypes have been estimated using some appropriate software. Their results can be used to estimate the mean vectors and the variance–covariance matrix in our multivariate models. In the case of dominant and recessive models the log-likelihood function is expressed as l = log L(Θ , µ, Σ ) and the maximum likelihood estimators for µ1 and µ2 are obtained by N ∑

µˆ1 =

N ∑

ψi (ub /u0 )

i =1 N ∑

and

µˆ2 =

(ub /u0 )

i=1

ψi (vb /v0 )

i=1 N ∑

, (vb /v0 )

i =1

where



ub =

P (di = ak |Θ )f (ψi |di = ak , µ1 , σ ),

ak ∈D+ ∩Ai

u0 =



P (di = ak |Θ )f (ψi |di = ak , µ, σ ),

ak ∈Ai



vb =

P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),

ak ∈Ai ∩D−

v0 =



P (di = ak |Θ )f (ψi |di = ak , µ, Σ ).

ak ∈Ai

The variance–covariance matrix is estimated as

ˆ = Σ

1 n

  N N − − T T (ψi − µ1 ) (ψi − µ1 ) (ub /u0 ) + (ψi − µ2 ) (ψi − µ2 ) (vb /v0 ) , i =1

i =1

where n is i = 1 ( ub / u0 ) + i=1 (vb /v0 ). D+ is a set of the diplotype configurations which contain haplotype ht in such a way that it is consistent with dominant/recessive models. In the case of additive models, the mean vectors and the variance–covariance matrix are estimated by solving the following equations.

∑N



N − ub i=1

u0

∑N

+

N 1 − wb

4 i=1 w0

 µ1 +

N 1 − wb

4 i =1 w 0

µ2 =

N  − ub i=1

u0

+

1 2



u0 ub

 xi

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 N 1 − wb

4 i=1 w0

ˆ = Σ

 µ1 +

N N − vb 1 − wb + v0 4 i=1 w0 i =1

 µ2 =

N  − i=1

vb 1 + v0 2



u0 ub

2107

 xi

 N N − − (ψi − µ1 ) (ψi − µ1 )T (ub /u0 ) + (ψi − µ2 ) (ψi − µ2 )T (vb /v0 ) n i=1 i=1  N − T + (ψi − (µ1 + µ2 )/2) (ψi − (µ1 + µ2 )/2) (wb /w0 ) . 1

i=1

In the above equations, u’s, v ’s and w ’s are defined by ub =



P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),

ak ∈Ai ∩AA

u0 =



P (di = ak |Θ )f (ψi |di = ak , µ, Σ ),

ak ∈Ai

vb =



P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ),

ak ∈Ai ∩BB

v0 =



P (di = ak |Θ )f (ψi |di = ak , µ, Σ ),

ak ∈Ai

wb =



P (di = ak |Θ )f (ψi |di = ak , µ2 , Σ ) and

ak ∈Ai ∩AB

w0 =



P (di = ak |Θ )f (ψi |di = ak , µ, Σ ).

ak ∈Ai

These two systems of equations can be solved by using iterative procedures with appropriate initial values. The converged solution provides maximum likelihood estimates. Likelihood ratio tests can be applied to the association analysis between the haplotypes and the phenotypes. Let L0max and Lmax be the likelihood functions under the null and alternative hypotheses, respectively. It is known that under the null hypothesis the log-likelihood ratio −2 log(L0max /Lmax ) asymptotically follows a chi-squared distribution. The degree of freedom is given by the difference of the numbers of parameters corresponding to both hypotheses. 2.5. QTLmarc algorithm Kamitsuji and Kamatani (2006) proposed an algorithm ‘‘QTLmarc’’ for estimating haplotypes associated with several quantitative phenotypes. Their method is a discriminant analysis of the two groups defined by whether or not the genotype of the sample contains a specified haplotype on the basis of a linear combination of quantitative phenotype values. The goodness of discrimination is evaluated with the area under the ROC curve (AUC). As the two groups are defined simply as above, their method cannot deal with the differences among the dominant, recessive and additive models. In the next section, we show two numerical examples, where the comparison of the performance will be made between our method and the QTLmarc. 3. Numerical study and the results of analyses We consider the case where there are three loci with two kinds of alleles as genotypes and two quantitative phenotype variables. As a genotype input data set, an actual data set for 44 subjects was downloaded from the Hapmap project, and the information of the region (107,189 loci) of the X chromosome was used. Among a large number of loci, 40 loci with linkage disequilibria were selected by Tomita et al. (2008), where they studied this area of the X chromosome from the Hapmap project. See Table 1 for detailed information (rs numbers, chromosome positions) on loci of data and Table 2 for major/minor allele frequencies. Note that we do not have any information on haplotypes as the data set does not contain the phase information and there is no missing observation. Fig. 1 shows the loci and LD(D′ ) map with chromosome positions. Various linkage disequilibriums can be found in Fig. 1. The LD map has been made using GUI ‘‘Haploview’’ software (Barrett et al., 2005). In Fig. 1, we can confirm that there are blocks ‘‘Block 1 and 2’’ defined by the method of Wang et al. (2002). We estimated haplotype–diplotype configurations and selected htSNPs (haplotype tagging SNPs) using GUI ‘‘Integrated Environmental System for SNPs Data Analysis’’ software (Tomita et al., 2004). After using the software, we selected 4 loci {7, 11, 13, 19} for Blocks 1 and 2, since we wanted to have as many varieties as possible for stochastic diplotype configurations. To select htSNPs, we used the method of Kamatani et al. (2004) with cumulative haplotype frequency 0.95. As a result, we obtained the diplotype configurations shown in Table 3.

2108

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Table 1 Information of loci. (rs number, chromosome position). Locus# rs# Position

1 rs197000 28409449

2 rs197005 28413819

3 rs197006 28416655

4 rs197012 28424066

5 rs197014 28430651

6 rs197016 28433916

Locus# rs# Position

7 rs197018 28435927

8 rs197021 28441493

9 rs197022 28442352

10 rs5943527 28442857

11 rs642519 28445155

12 rs404274 28446901

Locus# rs# Position

13 rs196983 28448532

14 rs115126 28449670

15 rs115125 28449943

16 rs196985 28452952

17 rs196986 28453150

18 rs17348455 28453190

Locus# rs# Position

19 rs196988 28453742

20 rs196990 28456060

21 rs1265497 28458373

22 rs6630730 28468511

23 rs196982 28468715

24 rs196975 28475390

Locus# rs# Position

25 rs1468134 28526074

26 rs724087 28644979

27 rs5985808 28645245

28 rs4103136 28645826

29 rs12863731 28650807

30 rs1586093 28675987

Locus# rs# Position

31 rs5985930 28680456

32 rs5985809 28681164

33 rs5943575 28681984

34 rs2521807 28703122

35 rs634270 28705667

36 rs6630793 28709418

Locus# rs# Position

37 rs5943579 28723040

38 rs628704 28724158

39 rs629965 28724381

40 rs11095138 28725017

Table 2 Summary for major/minor allele frequencies for each locus. Locus

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Major allele

Minor allele

Missing

Name

Count

Frequency

Name

Count

Frequency

Name

Count

Total

A A T G T A T A A C G G C T A A T T T C G C A C T A T G A C T C T G T A G A A T

49 49 49 49 49 49 56 49 49 49 64 46 76 67 67 67 46 54 46 75 67 67 45 66 45 44 59 59 72 66 62 66 62 60 62 52 70 70 68 58

0.5568 0.5568 0.5568 0.5568 0.5568 0.5568 0.6364 0.5568 0.5568 0.5568 0.7273 0.5227 0.8636 0.7614 0.7614 0.7614 0.5227 0.6136 0.5227 0.8523 0.7614 0.7614 0.5114 0.75 0.5114 0.5 0.6705 0.6705 0.8182 0.75 0.7045 0.75 0.7045 0.6818 0.7045 0.5909 0.7955 0.7955 0.7727 0.6591

G G C T C C C G G T A A T C C G C A A T A T G T C G C C T A C T C A C G A C G A

39 39 39 39 39 39 32 39 39 39 24 42 12 21 21 21 42 34 42 13 21 21 43 22 43 44 29 29 16 22 26 22 26 28 26 36 18 18 20 30

0.4432 0.4432 0.4432 0.4432 0.4432 0.4432 0.3636 0.4432 0.4432 0.4432 0.2727 0.4773 0.1364 0.2386 0.2386 0.2386 0.4773 0.3864 0.4773 0.1477 0.2386 0.2386 0.4886 0.25 0.4886 0.5 0.3295 0.3295 0.1818 0.25 0.2955 0.25 0.2955 0.3182 0.2955 0.4091 0.2045 0.2045 0.2273 0.3409

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

2109

Table 3 Diplotype configurations estimated by htSNPs loci {7, 11, 13, 19} for each subject. Sub#

Diplotype#

Probability

Pop. freq.

Diplotype configuration

1 2 3 4 5 6 7 8 9 10 11 12

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 3 1 1 1 1 1 1 2 3 1 2 3 1 1 1 1 1 1 2 3 1 1 2 3 1 1 2 1 1

1 1 1 1 1 1 1 1 1 1 1 0.5671 0.4329 1 1 1 1 1 1 1 1 1 1 1 0.9367 0.0633 0.6271 0.3406 0.0324 1 1 1 1 1 0.6271 0.3406 0.0324 0.6271 0.3406 0.0324 1 1 1 1 1 0.6271 0.3406 0.0324 1 0.6271 0.3406 0.0324 1 0.929 0.071 1 1

0.0812 0.0573 0.0812 0.0405 0.0105 0.0069 0.0028 0.0321 0.0268 0.0014 0.0405 0.007 0.0053 0.0177 0.0812 0.0812 0.0321 0.0105 0.0405 0.0812 0.0321 0.0069 0.0405 0.0812 0.0291 0.002 0.0379 0.0206 0.002 0.0127 0.0573 0.0105 0.0321 0.0105 0.0379 0.0206 0.002 0.0379 0.0206 0.002 0.0028 0.0136 0.0321 0.0006 0.0405 0.0379 0.0206 0.002 0.0075 0.0379 0.0206 0.002 0.0812 0.0115 0.0009 0.0127 0.0177

TGCATGCA TGCT TGCA TGCATGCA TGCT TGCT CACA CACA CGCT CACT CGCA CGCA TGCA TGTT TGCT CACT TACA TACA TGCT TGCT CGCA CACT CGCT CACA CACT CACT TGCATGCA TGCATGCA TGCA TGTT CACA CACA TGCT TGCT TGCATGCA TGCA TGTT CGCT CACT TGCT TGCT TGCATGCA TGCA CACA TACA CGCA TGCA CACT TGCT CACA TACA CGCT TGTT TGTT TGCT TGCA TGCT CGCT TGCA TGTT TGCT CGCT TGCA CACT TGCT CACA TACA CGCT TGCA CACT TGCT CACA TACA CGCT CGCA CGCA CACT CACA TGCA TGTT CGTT CGTT TGCT TGCT TGCA CACT TGCT CACA TACA CGCT TGCT TACA TGCA CACT TGCT CACA TACA CGCT TGCATGCA TGTT CACA TACA CGTT TGTT TGTT CACT CACT

13 14 15 16 17 18 19 20 21 22 23 24 25

26 27 28 29 30 31

32

33 34 35 36 37 38

39 40

41 42 43 44 TGCA is the target haplotype.

We assumed haplotype ‘‘TGCA’’ as the target haplotype for our study. Then, the data set of our study were divided into three subsets. The first subset consists of the subjects who have two ‘‘TGCA’’s, i.e., ‘‘TGCA–TGCA’’, as the diplotype, the second subset consists of those with one target haplotype, and the third subset consists of those without the target haplotype. To describe the association we consider three models, i.e., the dominant, recessive, additive models. In the dominant model, group D+ is composed of the first and second subsets and group D− is composed of the third subset. In the recessive model, group D+ is composed of the first subset and group D− is composed of the second and third subsets. In the additive model, there are three groups which correspond to the three subsets defined above.

2110

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Fig. 1. LD(D′ ) map with rs number and chromosome position for data.

Regarding the two quantitative variables, observations were generated assuming that they follow a bivariable normal distribution N (µ1 , Σ ), N (µ2 , Σ ) or N (µ3 , Σ ) depending on the number the subjects had haplotype ‘‘TGCA’’, where

µ1 = (120, 19),

µ2 = (118, 20),

Σ=



256 40



40 . 8

The bivariate normal random numbers were generated by using the library ‘‘mvtnorm’’ in R. In our numerical study, we first generated two data sets. In the first data set, it is assumed that the genotype-to-phenotype relationship can be described by a recessive model, where the phenotype follows N {µ1 , Σ } or N {µ2 , Σ }, depending whether it belongs to group D+ or group D− , respectively. In the second data set, it is assumed that the genotype-tophenotype relationship can be described by an additive model, where in the first group the phenotype follows N {µ1 , Σ }, in the second group the phenotype follows N {µ2 , Σ }, and in the third group the phenotype follows N {µ3 , Σ }, where µ3 = (µ1 + µ2 )/2. We obtained the results of association analysis with our method as shown in Tables 5 and 7 by using our R program. The ‘q-value’ in Tables 5 and 7 indicates q value of the false discovery rate (FDR) (Benjamini and Hochberg, 1995). Tables 4 and 6 show the results of analysis using QTLmarc (Kamitsuji and Kamatani, 2006). The results of our method show that p values for the target haplotype ‘‘TGCA’’ are smaller than q values, and that p values for other haplotypes are larger than q values both in Tables 5 and 7. On the other hand the results of QTLmarc show that the p value for the target haplotype is larger than 0.05, and in addition p values for other six haplotypes are smaller than 0.05 in Table 4 and that the p value for the target haplotype is 0.042, while p values for other five haplotypes are also smaller than 0.05 in Table 6. There are many other haplotypes which have similar or smaller p values than the target haplotype by their method. Therefore we have to judge that the target haplotype is not significant in Table 6. Thus we could find that the target haplotype ‘‘TGCA’’ is significant by our method, but not significant by the QTLmarc. These results confirm that QTLmarc is not applicable besides the dominant model. For the dominant model we carried out a small size simulation study on the comparison of the powers between QTLmarc and our method. We assumed a dominant model with µ1 = (120, 19): null case, Σ = (σ11 , σ12 , σ22 ) = (256, 40, 8), set the values of µ’s for alternative cases so that the Mahalanobis distances became approximately 0.25, 0.50, 0.75, 1.00 and 1.25, and then generated 100 sets of data for each case. The results of analysis are given in Table 8. This table shows clearly that our method has higher power than QTLmarc for detecting the genotype-to-phenotype relationship in the case of the dominant model.

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

2111

Table 4 Results of analysis using QTLmarc of the first data set based on a recessive model. Haplotype

Number of carriers

AUC

P-value

TACA CGCA TGTA TGTT CACA TGCT TACT CGTT CGCT CACT TGCA CATA TATA TGTA CATT TATT

9 10 6 8 11 22 8 2 12 13 22 NA NA NA NA NA

0.8114 0.7367 0.7286 0.7192 0.7175 0.6906 0.6727 0.6274 0.6106 0.6097 0.5371 NA NA NA NA NA

0.00068 0.00349 0.00583 0.00607 0.00768 0.00270 0.03782 0.15388 0.10776 0.12171 0.33023 NA NA NA NA NA

AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype. Table 5 Results of the analysis using our method of the first data set based on a recessive model. Haplotype

Haplotype frequency

χ2

P-value

q-value

df

TGCA CGTT CGCA CACT TACA TGCT CACA TGTT

0.2368 0.0263 0.0526 0.1228 0.0877 0.1842 0.1140 0.0877

15.3214 4.9060 2.4720 2.0721 1.5365 0.4474 0.3173 0.2663

0.00047 0.08603 0.29054 0.35484 0.46381 0.79955 0.85331 0.87533

0.00625 0.01250 0.01875 0.02500 0.03125 0.03750 0.04375 0.05000

2 2 2 2 2 2 2 2

TGCA is the target haplotype. Table 6 Results of analysis using QTLmarc of the second data set based on an additive model. Haplotype

Number of carriers

AUC

P-value

TACA TGCT TACT CGCA CACA TGCA CGTT TGTA CGCT CACT TGTT CATA TATA CGTA CATT TATT

9 22 8 10 11 22 2 6 12 13 8 NA NA NA NA NA

0.7605 0.6825 0.6928 0.6873 0.6624 0.6328 0.6774 0.6745 0.6092 0.6034 0.5836 NA NA NA NA NA

0.00011 0.00448 0.00960 0.01345 0.03368 0.04209 0.06749 0.06923 0.09593 0.11959 0.23181 NA NA NA NA NA

AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype. Table 7 Results of the analysis using our method of the second data set based on an additive model. Haplotype

Haplotype frequency

χ2

P-value

q-value

df

TGCA CGTT TACA TGCT CACT CGCA CACA TGTT

0.2368 0.0263 0.0877 0.1842 0.1228 0.0526 0.1140 0.0877

11.8377 5.7841 3.3868 1.7475 1.2107 0.5627 0.5247 0.0452

0.00269 0.05546 0.18390 0.41738 0.54588 0.75476 0.76925 0.97768

0.00625 0.01250 0.01875 0.02500 0.03125 0.03750 0.04375 0.05000

2 2 2 2 2 2 2 2

TGCA is the target haplotype.

2112

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Table 8 Power comparison between QTLmarc and our method for dominant models based on a simulation study with 100 iterations. Mean vector µx Mahalanobis distance QTLmarc Power Ours

{119, 19.5} 0.25 0.23 0.23

{118.5, 19.7} 0.508 0.56 0.59

{118.1, 19.84} 0.753 0.73 0.77

{118, 20} 1 0.83 0.94

{117.77, 20.12} 1.252 0.90 0.98

Table 9 The chi-squared (top) and AIC (bottom) statistics obtained with the analysis assuming dominant, recessive and additive models of the three datasets. Hypothesis test by Dominant model

True model

Recessive model

Additive model

Dominant

11.5837132 −7.5837132

0.0384127 3.9615873

Recessive

3.4820052 0.5179948

15.3214361

8.2125950

−11.3214361

−4.2125950

Additive

8.7507100

−4.7507100

11.6215961

−7.6215961

4.2594829

−0.2594829

11.8376920

−7.8376920

4. Discussions and summary We proposed a method of multivariate association analysis and developed an R program. It is an extension of QTLhaplo to the case of multivariate analysis, where we can analyze the cases where the diplotype configuration is not determined uniquely from the genotype and we can assume any dominant, recessive and additive models. To study the effectiveness of our method for the recessive and additive models, we carried out a numerical study to analyze two data sets with real genotype data (Hapmap project) and simulated QTL data, and for the dominant model we performed a small size simulation study to compare the powers with QTLmarc. The results showed that our method was more effective for detecting the genotype-to-phenotype relationship than QTLmarc of Kamitsuji and Kamatani (2006) in all the dominant, recessive and additive models. We speculate on the reason for the big difference of the results for the recessive and additive models. In Tables 4 and 6, the number of carriers of the target haplotype ‘‘TGCA’’ is 22. However, after we estimated haplotype and diplotype configurations by the maximum likelihood method, there are only 7 subjects with both haplotypes ‘‘TGCA–TGCA’’ (so that there are only 7 carriers in the recessive model), and there are only 13 subjects with the haplotype ‘‘TGCA’’ (so that there are only 7 and 13 carriers in the additive model) stochastically. The QTLmarc algorithm selected also subjects #39 and #42 as the carriers, but they have no target haplotype. It may be due to their method of haplotype estimation being based on allele counting, and model fitting can be applicable only for the dominant model. For the dominant model the power of our method is higher than that of QTLmarc which may be explained by the first advantage which will be mentioned below. For dominant models there are few differences, possibly because QTLmarc considers subjects #39 and #42 with no target haplotype as the carriers. So far we discussed the comparison of the powers between our method and QTLmarc. In actual data analysis, however, the objective is to find out if there exists any haplotype which is closely related to the phenotypes assuming an appropriate one among the dominant, recessive and additive models. For this purpose we may operationally choose the model with the highest significance. Note that, if we wish to use the AIC statistics we can compute them up to an additive constant based on the values of likelihood ratio statistics, because the log-likelihood statistics for the null models are common among the dominant, recessive and additive models. To study how this idea works in actual data analysis we analyzed each of the three datasets generated in the manner described in Section 3 assuming the three models, i.e., the dominant, recessive and additive models. The chi-squared statistics and the AIC statistics are given in Table 9. It is noted that the true models could be detected correctly in all cases but that there might exist a tendency that the dataset generated assuming the additive model is easily misjudged to be taken from other models. There are two advantages of our method compared to the QTLmarc algorithm. One is that our method can treat genotype data with stochastically determined diplotypes and the other is that we can assume any model among dominant, recessive and additive models. It is expected that our method will be useful in association studies of complex diseases such as schizophrenia and autism, where the causes of the diseases are not yet resolved and there exist multiple candidate responses. Acknowledgements We are deeply grateful to referees who provided kindly considered feedback and valuable comments. This work was partly supported by KAKENHI (21700317; Grant-in-Aid for Young Scientists (B)). References Barrett, J.C., Fry, B., Maller, J., Daly, M.J., 2005. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 21 (2), 263–265. Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B 57 (1), 289–300.

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

2113

Kamatani, N., Sekine, A., Kitamoto, T., Iida, A., Saito, S., Kogame, A., Inoue, E., Kawamoto, M., Harigai, M., Nakamura, Y., 2004. Large-scale single-nucleotide polymorphism (SNP) and haplotype analyses, using dense SNP maps, of 199 drug-related genes in 752 subjects: the analysis of the association between uncommon SNPs within haplotype blocks and the haplotypes constructed with haplotype-tagging SNPs. American Journal of Human Genetics 75 (2), 190–203. Kamitsuji, S., Kamatani, N., 2006. Estimation of haplotype associated with several quantitative phenotypes based on maximization of are under a receiver operating characteristic (ROC) curve. Journal of Human Genetics 51 (4), 314–325. Shibata, K., Ito, T., Kitamura, Y., Iwaaki, N., Tanaka, H., Kamatani, N., 2004. Simultaneous estimation of haplotype frequencies and quantitative trait parameters: applications to the test of association between phenotype and diplotype configuration. Genetics 168, 525–539. Tomita, M., Hatsumichi, M., Kurihara, K., 2008. Identify LD blocks based on hierarchical spatial data. Computational Statistics and Data Analysis 52 (4), 1806–1820. Tomita, M., Inoue, E., Kamatani, N., 2004. Integrated environmental system for SNPs data analysis. In: Program and Abstracts of The 13th Takeda Science Foundation Symposium on Bioscience, 92. Wang, N., Akey, J.M., Zhang, K., Chakraborty, R., Jin, L., 2002. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. American Journal of Human Genetics 71 (5), 1227–1234.