Logical and test consistency in pairwise multiple comparisons

Logical and test consistency in pairwise multiple comparisons

Journal of Statistical Planning and Inference xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Statistical Planning and Inference...

809KB Sizes 0 Downloads 54 Views

Journal of Statistical Planning and Inference xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Logical and test consistency in pairwise multiple comparisons ∗

Chi Tim Ng a , , Woojoo Lee b , Youngjo Lee c a b c

Department of Statistics, Chonnam National University, Gwangju 500-757, Korea Department of Statistics, Inha University, Incheon 402-751, Korea Department of Statistics, Seoul National University, Seoul 151-747, Korea

article

info

Article history: Received 9 July 2018 Received in revised form 20 September 2019 Accepted 20 September 2019 Available online xxxx MSC: 62F03 62F12 62F35

a b s t r a c t In this study, we reformulate pairwise comparisons as penalized likelihood estimation problems based on the L1 penalties on the pairwise differences. The proposed approach offers a natural way to avoid logically contradictory conclusions resulting from pairwise comparisons. Moreover, we show that under appropriate conditions, all nulls are not rejected and all alternatives are rejected with probability going to one. Two examples based on real data are used to illustrate the proposed method. © 2019 Elsevier B.V. All rights reserved.

Keywords: Pairwise comparisons Penalized likelihood Test consistency

1. Introduction Pairwise comparison methods have been widely used in various areas such as medicine, biology and education. Agresti et al. (2008) studied pairwise differences between proportion of adverse events such as nausea when four treatment groups from a randomized trial related to early Huntington’s disease were compared. Gelman et al. (2012) compared all states based on average scores on the National Assessment of Educational Progress fourth-grade mathematics test. Lin et al. (2014) and Lu et al. (2017) considered a clinical problem involving pairwise comparisons of several treatments to assess their relative efficacy. The classical pairwise comparison methods were described in Miller (1981) and Hochberg and Tamhane (1987). However, these methods can result in logically contradictory conclusions (Lehmann and Romano, 2005). For example, suppose there are three groups with means µ1 , µ2 and µ3 respectively, and the problem of interest employs multiple tests on three hypotheses: H012 : µ1 = µ2 vs. H112 : µ1 ̸ = µ2 , H013 : µ1 = µ3 vs. H113 : µ1 ̸ = µ3 , and H023 : µ2 = µ3 vs. H123 : µ2 ̸ = µ3 . The classical methods may yield the logically contradictory result that µ1 = µ2 , µ1 = µ3 , and µ2 ̸= µ3 . To circumvent the logical contradiction, the concepts of coherence and consonance have been introduced in Gabriel (1969), Sonnemann (2008) and Zhao et al. (2010). Finner and Strassberger (2002) regard coherence as a minimal requirement for multiple tests and Sonnemann (2008) shows that any coherent test can be constructed using the closure method proposed by Marcus et al. (1976). Sonnemann and Finner (1998) show that any incoherent multiple testing procedure can be replaced by a coherent one that rejects the same hypotheses and possibly more. Romano et al. (2011) show that many optimal tests satisfy the consonance property. Furthermore, Zhao et al. (2010) developed general methods for constructing coherent and consonant tests based on the partitioning method of Finner and Strassberger (2002). ∗ Corresponding author. E-mail addresses: [email protected] (C.T. Ng), [email protected] (W. Lee), [email protected] (Y. Lee). https://doi.org/10.1016/j.jspi.2019.09.009 0378-3758/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

2

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx Table 1 Outcomes of a multiple test. Hypothesis

Accept null

Reject null

Total

Null Alternative Total

N0 − V N1 − S N −R

V S R

N0 N1 N

To apply coherent and consonant testing procedures, various computational methods have been proposed. However, in practice, they are computationally intractable as the number of hypotheses N increases. For example, with N hypotheses, the number of tests required by the closure method is O(2N ) (Romano et al., 2011). This raises a severe computational difficulty because the total number of tests increases exponentially with N. Furthermore, constructing test statistics for the hypotheses is not easy in general. In order to reduce the number of tests, the so-called shortcut procedures have been studied in the literature (Sonnemann and Finner, 1998; Brannah and Bretz, 2010). However, they have focused mainly on some special cases such as consonant procedures (Goeman and Solari, 2011). Thus, to overcome such computationally difficulties in pairwise comparison test, we propose a procedure that automatically does not generate logically contradictory results. In this study, we reformulate pairwise comparisons as penalized likelihood problems. The proposed method eliminates logically contradictory conclusions and does not require that we perform O(2k ) tests. To illustrate this, consider the three-group example again. The L1 penalty can be defined as

λ(|µ1 − µ2 | + |µ1 − µ3 | + |µ2 − µ3 |), where λ is a tuning parameter. Let (µ ˆ 1, µ ˆ 2, µ ˆ 3 ) be the maximum penalized likelihood estimator. Our proposed method rejects the null H012 if µ ˆ 1 ̸= µ ˆ 2 . Due to the L1 penalty on the pairwise differences, for example, |µ ˆ1 −µ ˆ 2 | can take the value zero. Therefore, µ ˆ1 = µ ˆ 2 is possible. Obviously, µ ˆ1 = µ ˆ2, µ ˆ1 = µ ˆ 3 , and µ ˆ 2 ̸= µ ˆ 3 cannot occur at the same time. On the contrary, if some non-zero critical values T12 , T13 , T23 are used as thresholding values to determine whether µ1 − µ2 , µ1 − µ3 and µ3 − µ2 are in fact zeros, the possibility that |µ ˆ1 −µ ˆ 2 | < T12 , |µ ˆ1 −µ ˆ 3 | < T13 , and |µ ˆ2 −µ ˆ 3 | > T23 hold at the same time cannot be ruled out. Hocking et al. (2011) studied a similar penalized likelihood for clustering problems. Though Zhang and Zhang (2012) studied the penalized likelihood procedures based on the so-called structured LASSO, the validity of their assumptions in the above-mentioned penalty function requires justifications. Suppose that there are k groups and the group sizes are n1 , n2 , . . . , nk . Consider multiple tests H0i versus H1i for i = 1, 2, . . . , N, where N = k(k − 1)/2 . In this paper, we study ‘‘test consistency’’ of the proposed penalized likelihood procedure that as n = maxi=1,2,...,k ni increases, P(Not reject H0i if H0i is true and reject H1i if H1i is true for all i = 1, 2, . . . , N) → 1. In multiple tests, such as pairwise comparisons, the family-wise error rate (FWER) is often controlled at a fixed level of 5% or 1%. To control the FWER, Tukey (1953) proposes using simultaneous confidence intervals. Kramer (1956), Cheung and Chan (1996) and Lin et al. (2014) constructed simultaneous pairwise confidence intervals for pairwise comparisons based on the work of Tukey (1953). However, under appropriate conditions, as we shall see, the FWER goes to zero as the group size increases. Table 1 shows the quantities arising in the multiple tests. The test consistency is equivalent to P(V = 0 and N1 − S = 0) → 1 as n → ∞ , where V , N1 , and S depend on n. In pairwise comparisons, the number of hypotheses N is increasing with the number of groups. Gretton and Györfi (2010) establish test consistency results for a non-parametric test for the independence of two groups and Lee and Bjørnstad (2013) show the test consistency for independent multiple hypotheses. In this paper, we study the test consistency of pairwise comparisons. The remainder of the paper is organized as follows. In Section 2, we discuss pairwise comparisons under the penalized likelihood approach, and show that test consistency can be achieved under certain conditions. Then, in Sections 3 and 4, we describe a numerical study and two applications of the proposed method respectively. Section 5 concludes the paper including a discussion of possible extensions of the theory and methods developed in this paper. All proofs are presented in Appendix A. In Appendix B, we present a numerical algorithm that implements the penalized likelihood method. 2. Test consistency in pairwise comparisons Consider pairwise comparisons of k groups aiming to establish equalities in the group means. For each group i = 1, 2, . . . , k , the observations yit ∈ R t = 1, 2, . . . , ni are generated independently from a distribution with log-density function q(yit ; µi , τi ). Here, µi are the parameters of interest and τi are nuisance parameters. The pairwise comparison problem involves N = k(k − 1)/2 hypotheses tests: H0ij : µi = µj vs. H1ij : µi ̸ = µj for i ̸ = j = 1, 2, . . . , k. Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

3

2.1. Construction of the test Consider the penalized likelihood Qλ (µ1 , . . . , µk , τ1 , . . . , τk ) =

ni k ∑ ∑

q(yit ; µi , τi ) − λ



|µi − µj | .

(2.1)

1≤i
i=1 t =1

Let (µ ˆ 1, µ ˆ 2, . . . , µ ˆ k , τˆ1 , . . . , τˆk ) be the maximizer of (2.1). In this paper, we accept H0ij if µ ˆi = µ ˆ j and reject H0ij if µ ˆ i ̸= µ ˆ j . In practice, the estimate must be obtained through numerical approximation strategies. Therefore, a critical value is required to determine if |µ ˆi −µ ˆ j | is close to zero. This is discussed in Appendix B. Theorem 2.1. The penalized likelihood estimator always gives a coherent and consonant solution if a local maximum of (2.1) exists. In the heterogeneous Gaussian cases where the observations yit , i = 1, 2, . . . , k , t = 1, 2, . . . , ni follow the distribution N(µi , σi2 ), we might take the nuisance parameters to be the variances τi = σi2 ∈ R+ . The penalized likelihood becomes Qλ (µ1 , . . . , µk , τ1 , . . . , τk ) = −

k ∑ ni i=1

2

log τi −

k 1∑

2

i=1

τi−1

ni ∑



t =1

1≤i
(yit − µi )2 − λ

After profiling σi with its maximum likelihood estimator σi (µi ) = up to constants Qλ∗ (µ1 , . . . , µk ) =

k ∑

L∗i (µi ) − λ



√∑ ni

t =1 (yit

|µi − µj | .

(2.2)

− µi )2 /ni , the penalized likelihood becomes

|µi − µj | ,

(2.3)

1≤i
i=1

∑n

i 2 where L∗i (µi ) = −ni log{ t = 1 (yit −µi ) /ni }/2 . The resulting penalized profiled likelihood is a function of µ1 , . . . , µk only. In the homogeneous Gaussian case with τi = τ ∈ R+ , the maximizer obtained from (τ , λ) is the same as the maximizer from (1, τ λ) . In this case, τ Qλ (µ1 , . . . , µk , τ ) = Qτ λ (µ1 , . . . , µk , 1). Therefore, it suffices that we establish the theory assuming τ is known.

2.2. Test consistency Suppose that the group sizes fulfill maxi=1,2,...,k ni /mini=1,2,...,k ni = O(1). It is shown in this section that when n = maxi=1,2,...,k ni tends to infinity, there exists λ such that test consistency holds with probability going to one. Note also that the profiled penalized likelihood Qλ∗ (µ1 , . . . , µk ) and the penalized likelihood Qλ (µ1 , . . . , µk , τ1 , . . . , τk ) yield the same solution to (µ1 , . . . , µk ) . It suffices to establish the theory of Qλ∗ (µ1 , . . . , µk ) =

k ∑

L∗i (µi ) − λ

i=1



|µi − µj |

1≤i
under certain conditions on L∗i (µi ) . Definition 2.1. (i) A pairwise comparison procedure is said to be test consistent if as n → ∞ P(Conclude H0ij for all i, j with µi = µj and Conclude H1ij for all i, j with µi ̸ = µj ) → 1 . (ii) (µ ˆ 1, . . . , µ ˆ k ) is a strictly local maximum if Qλ∗ (µ ˆ 1, . . . , µ ˆ k ) > Qλ∗ (µ1 , . . . , µk ) for a neighborhood of (µ ˆ 1, . . . , µ ˆ k) . Then, we have the following theorem. Theorem 2.2. Let (µ01 , µ02 , . . . , µ0k ) be the true values of the parameters (µ1 , µ2 , . . . , µk ) and c be the number of clusters, where c is finite and unknown. Here, a cluster refers to a group of µ0i sharing a common value. Suppose that

µ01 = · · · = µ0b1 < µ0b1 +1 = · · · = µ0b2 < · · · < µ0bc −1 +1 = · · · = µ0k and mini̸=j,µ0 ̸=µ0 |µ0i − µ0j | > Kn−η for some constants K > 0 and 0 ≤ η < 1/2 . Let m1 = b1 , m2 = b2 − b1 , . . . , mc = i

j

k − bc −1 . For s = 1, 2, . . . , c , i = bs−1 + 1, . . . , bs , let

Zi = n−1/2 ∇µi L∗i (µ0bs ) and Ji = n−1 E {∇µ2 i µi L∗i (µ0bs )}, Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

4

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

where ∇ and ∇ 2 refer to the first- and second-order derivatives. Consider the following three regularity conditions: (I) all Ji are bounded above and below by some positive constants, (II) there exists a sequence of constants us (ms ) such that

⏐ ⏐ ⏐ ⏐Zi − 1 max i=bs−1 +1,...,bs ⏐ ms ⏐

⏐ ⏐ ⏐ Zj ⏐⏐ = Op (us (ms )) j=bs−1 +1 ⏐ bs ∑

and

λn−1/2 min1≤s≤c ms max1≤s≤c us (ms )

→ ∞,

and

λk2 n−1+η min1≤s≤c ms

→0

(2.4)

as n → ∞ , where k, λ, m1 , . . . , mc are allowed to depend on n and (III) Z1 , . . . , Zm are Op (1) and J1 , . . . , Jm are O(1). If the regularity conditions (I)–(III) hold, a strictly local maximum of (2.3) exists and is unique. Moreover, such a strictly local maximum satisfies test consistency. In the homogeneous Gaussian cases, the condition (2.4) can be replaced by a weaker condition,

λn−1/2 min {ms /us (ms )} → ∞ , 1≤s≤c

and

λkn−1+η → 0

(2.5)

as n → ∞ , where k, λ, m1 , . . . , mc are allowed to depend on n. In order to detect the between-group differences, conditions (I) and (III) require that the within-group variation should be bounded. Condition (II) requires the existence of λ satisfying (2.4) in general cases or (2.5) in the homogeneous Gaussian cases. The validity of these conditions is discussed below. To guarantee the existence of λ satisfying (2.5), the following condition is needed in the homogeneous Gaussian case, max {kus (ms )/ms } = o(n1/2−η ) . s

(2.6)

The quantities us (ms ) can be obtained using extreme value theory. For example, in the Gaussian cases yit ∼ N(µi , 1) , we have Zi ∼ N(0, 1) . It is well-known that maxi=1,2,...,m |Zi − Z¯ | − u(m)

v (m) converges in distribution to the extreme value distribution (Fisher and Tippett, 1928; Gnedenko, 1943). Following McCormick (1980), we take

v (m) = (2 log m)1/2 and u(m) = (2 log m)1/2 − (2 log m)−1/2 log(4π log m). Suppose that k = exp(nν /2) for some 0 < ν < 1 − 2η and m1 √ = O(k) , m2 = O(k) , . . . , mc = O(k) . In the Gaussian cases, from Chow and Teugels (1978), we have max{u(k), v (k)} ≈ 2 log k = nν/2 , so that the condition (2.6) holds. Thus, when n is sufficiently large, Theorem 2.2 guarantees that there exists a value of λ for the test consistency. For example, ∗ when k = exp(nν /2) , (2.5) can be satisfied by λ = Cn(ν +1)/2 /k for some C > 0 and ν < ν ∗ < 1 − 2η. When k = nν for ∗ a constant ν > 0 , condition (2.6) holds . Condition (2.5) can be satisfied by λ = n1/2 log n/k or λ = nν +1/2 /k for some ∗ 0 < ν < 1/2 − η . For non-Gaussian cases, if there exist rth moments E |Zi |r < ∞ , i = 1, 2, . . . , n for some r > 1, the following inequality holds

( E

max

i=bs−1 +1,...,bs

⎛ ⎞ )r bs ∑ |Zi | ≤ E ⎝ |Zi |r ⎠ = O(ms ) . j=bs−1 +1 1/r ms

If so, we let us (ms ) = . This guarantees test consistency for the Bernoulli distribution and for the multinomial distribution because r can be arbitrarily large. 2.3. Choice of the tuning parameter λ The Bayesian information criterion (BIC) is commonly used to choose λ (Wang et al., 2007; Konishi and Kitagawa, 2008) in the context of regression analysis. Here, we propose a definition of the BIC for the pairwise comparison problem as follows: Given λ , let c ∗ be the number of the clusters in the estimate corresponding to the tuning parameter λ. For s = 1, 2, . . . , c ∗ , suppose that µ ˆ b∗ +1 = · · · = µ ˆ b∗s and denote by µ ˆ s the common value. For convenience, assume that s−1

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

5



the cluster labels are assigned such that µ ˆ1 < µ ˆ2 < ··· < µ ˆ c . To avoid the bias in µ ˆ s owing to the penalty terms, we can re-estimate ∗

bs ∑

µ ˆ sλ=0 = argmax µ

L∗i (µ) .

i=b∗ +1 s−1

Define ∗



BIC (λ) = −2





bs c ∑ ∑

ˆ sλ=0 ) + log c L∗i (µ

(

c )∑ ∗

s=1 i=b∗ +1 s−1

log ⎣

s=1



bs ∑

⎤ ni ⎦ .

i=b∗ s−1

The tuning parameter λ is chosen by minimizing BIC (λ). Note that the dependence of BIC on λ comes from c ∗ and b∗1 , b∗2 , . . . , b∗c ∗ . For balanced homogeneous Gaussian cases, we use the following BIC ∗

BIC (λ) = kn log σˆ 2 + log c ∗

(

c )∑

log[n(b∗s − b∗s−1 )] ,

s=1

where ∗



σˆ = 2

bs n ∑ ∑

c 1 ∑

kn

(yit − µ ˆ sλ=0 )2 .

s=1 i=b∗ +1 t =1 s−1

For balanced heterogeneous Gaussian cases, we use the following BIC, ∗

BIC (λ) = n



bs c ∑ ∑



log σˆ i2,s + log c ∗

s=1 i=b∗ +1 s−1

(

c )∑

log[n(b∗s − b∗s−1 )] ,

s=1

where

σˆ i2,s =

n 1∑ (yit − µ ˆ sλ=0 )2 . n t =1

∑c ∗

Note that the term s=1 log[n(b∗s − b∗s−1 )] can be derived from the Laplace approximation in Konishi and Kitagawa (2008). The multiplier log c ∗ is included in the penalty terms to improve the performance of BIC in the large k cases. Similar multipliers are proposed in the context of regression analysis, see e.g., Fan and Tang (2013), Leng and Tang (2012), Wang et al. (2009), and Wang and Leng (2007). It is important to note that our penalized likelihood method is not in conflict with other error control methods. For example, as an alternative to BIC, one can choose λ to control the FWER at α = 0.05. However, this approach does not achieve the test consistency that we discussed in this paper. Under the circumstance that all nulls are true, FWER controls the probability of concluding all nulls are true at the pre-specified level α even though the group sizes increase. On the contrary, test consistency implies that the probability that all decisions are correct (not reject H0ij if H0ij is true and reject H0ij if H0ij is false for all 1 ≤ i < j ≤ k) goes to one as n → ∞. Therefore, test consistency is stronger than FWER control. The details about asymptotic results for controlling FWER are beyond the scope of our paper. 3. Numerical study We conduct a numerical study to illustrate the performance of the proposed method. Four simulation settings are considered, namely, homogeneous and heterogeneous Gaussian cases, and balanced and unbalanced Bernoulli cases. The tuning parameter is selected by minimizing the BIC described in Section 2.3. The performance of the proposed method is assessed using two measures. The first one is the out-of-sample prediction error computed using both test and training datasets, where the test dataset is an independent copy of the training dataset. Such a performance measure has been widely used in the literature, see Efron (2004), Häaggström and de Luna (2010), and the references therein. Second, to describe the new concept of ‘‘test consistency’’ empirically, we introduce the notation Fℓ , the cumulative proportion of groups covered by the first ℓ largest clusters for ℓ = 1, 2, . . . , k. Ideally, the cumulative proportion of groups covered by the first c largest clusters should be close to one. In the following simulation studies, the true clusters are equal in sizes. In such a situation, the cumulative proportions of groups covered by the first c clusters should be close to (1, 2, . . . , c)/c. In all simulation examples, the experiment is repeated for 100 times and the expected proportions (1, 2, . . . , c)/c and the estimated proportions F1 , F2 , . . . , Fc are plotted in each figure. Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

6

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 1. The performance of the proposed method for pairwise comparison problems: (a) c = 1 (b) c = 4.

3.1. Gaussian cases For i = 1, 2, . . . , k , t = 1, 2, . . . , ni , denote by ytest ,it and yit the observations in the test dataset and the training dataset respectively. For the homogeneous Gaussian model, consider the cases with n = 50, k = 200 and c = 1 or 4. When c = 1, the data are generated from a Gaussian distribution with mean 1 and variance 0.25. This setting corresponds to the global null, µ1 = · · · = µk . When c = 4, each cluster contains 50 groups and the observations are drawn from Gaussian distributions with variance 0.25 and means 1, 2, 3 and 4 respectively. Consider L∗i (µi ) = −

ni ∑

(yit − µi )2 /2

t =1

and obtain c , bs , and ˆ µsλ=0 (s = 1, 2, . . . , c ∗ ) as described in Section 2.3. For s = 1, 2, . . . , c ∗ , i = b∗s−1 + 1, . . . , b∗s , (I) s µλ=0 . Let c and bs (s = 1, 2, . . . , c) be the true values (unknown in practice but known in the simulation) define µ ˆi = ˆ and ∗



µ ˜ sλ=0 = argmax µ

bs ∑

L∗i (µ) .

(3.7)

i=bs−1 +1 (II)

For s = 1, 2, . . . , c , i = bs−1 + 1, . . . , bs , define ˆ µi and the PRMSE ratio as

∑k ∑ni

PRMSE = ∑ki=1 ∑nt =1 i i=1

t =1

=˜ µsλ=0 . Define the prediction error as

∑k ∑ni i=1

t =1

(I)

(ytest ,it − ˆ µi )2

(I)

(ytest ,it − ˆ µi )2 (II) 2 i )

(ytest ,it − ˆ µ

.

The PRMSE ratio is reported in each figure. The closer the PRMSE ratio is to one, the better is the proposed method. Fig. 1(a) (c = 1) and (b) (c = 4) show boxplots of the PRMSE ratios and the three quantile lines for the cumulative proportion covered by the first four clusters from 100 simulations. In Fig. 1(b), + denotes the reference points (1, 0.25), (2, 0.5), (3, 0.75) and (4, 1), corresponding to their perfect clustering by the oracle estimator. When c = 1, the PRMSE ratios are very close to one. In the worst case, the ratio is still less than 1.005. The cumulative proportion of groups covered by the largest cluster is always one. For c = 4, the PRMSE ratios are again very close to one. The three quantile lines are identical and overlap with the reference line . Thus, the clustering results are reliable. Here, the four clusters are clearly separated so that the proposed method shows the oracle clustering result. However, as we shall see later, when the clusters are less clearly separated, the proposed method has errors in clustering. For the heterogeneous Gaussian model, we compare cases with small and large heterogeneities. For the small heterogeneity case, σi is randomly generated from the uniform distribution on [0.6, 0.65]. For the large heterogeneity case, Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

7

the uniform distribution on [0.6, 0.9] is considered. This simulation shows the effects of heterogeneity on the clustering result. Consider

{ Li (µi ) = −ni log ∗

ni ∑

}

(yit − µi ) /ni 2

/2

t =1

and obtain c ∗ , b∗s , ˆ σi2,s , and ˆ µsλ=0 (s = 1, 2, . . . , c ∗ ) as described in Section 2.3. For s = 1, 2, . . . , c ∗ , i = b∗s−1 + 1, . . . , b∗s , (I)

define µ ˆi

= ˆ µsλ=0 and σˆ i2 = ˆ σi2,s . Let c and bs (s = 1, 2, . . . , c) be the true values (unknown in practice but known in the simulation). Define µ ˜ sλ=0 as in (3.7). For s = 1, 2, . . . , c , i = bs−1 + 1, . . . , bs , define µ ˆ (II) = ˜ µsλ=0 and i ∑k ∑ni ∑ni (I) 2 s 2 2 2 µi ) /ˆ σi and the PRMSE ratio as ˜ σi = t =1 (yit − ˜ µλ=0 ) /ni . Define the prediction error as i=1 t =1 (ytest ,it − ˆ ∑k ∑ni (I) 2 2 (ytest ,it − ˆ µi ) /ˆ σi PRMSE = ∑ki=1 ∑nt =1 . (II) 2 i µi ) /˜ σi2 t =1 (ytest ,it − ˜ i=1 In Fig. 2(a), the PRMSE ratios are close to one and the cumulative proportion of groups covered by the largest cluster is as high as that of the homogeneous variance case. In Fig. 2(b), the median cumulative proportions of groups covered by the first two clusters follow the reference points marked with a ‘‘+’’, whereas the third and fourth clusters do not. Thus, the proposed method tends to yield many small clusters consisting of few groups. Fig. 2(d) shows clearly that large heterogeneity deteriorates the clustering performance of the proposed method. Although the PRMSE ratios are still close to one, we find a large deviation of the quantile lines from the reference points marked with a ‘‘+’’ in Fig. 2(d), which implies that the clustering results are not as reliable as those in the case of small heterogeneity. Thus, we may conclude that large heterogeneity produces unnecessarily many clusters consisting of few variables, and results in larger variability when estimating the mean parameters. In the heterogeneous Gaussian model, a much larger n would be required to achieve test consistency. Next, consider an example with smaller sample size with n = 14 and k = 33. Consider two cases c = 1 and c = 3. When c = 1, the observations are generated from Gaussian distributions with mean 0.14 and variances uniformly distributed between 0.03 and 0.264. When c = 3, each cluster contains 11 groups and the data are generated from Gaussian distributions with means 0.04, 0.15, and 0.12 corresponding to the minimum, median, and maximum values respectively. In Fig. 3(a), the PRMSE ratios are close to one and the median cumulative proportion of groups covered by the largest cluster is around 0.8, which implies that clustering results are quite reliable. In Fig. 3(b), the median cumulative proportion of groups covered by the first three clusters is below the reference points marked with a ‘‘+’’. Thus, the proposed method tends to yield many small clusters consisting of few groups under this setting. The performances of the proposed method under the heterogeneity cases and small n cases are an interesting future research direction. 3.2. Bernoulli cases For i = 1, 2, . . . , k , t = 1, 2, . . . , ni , denote by ytest ,it and yit the observations in the test dataset and the training dataset respectively. Suppose that the observations in the ith group follow Bernoulli distribution with P(yit = 1) = 1 − P(yit = 0) = pi . Consider L∗i (pi ) =

ni ∑ {yi log pi + (1 − yi ) log(1 − pi )} t =1

and obtain c , b∗s , and ˆ psλ=0 (s = 1, 2, . . . , c ∗ ) as described in Section 2.3. For s = 1, 2, . . . , c ∗ , i = b∗s−1 + 1, . . . , b∗s , (I) s define pˆ i = ˆ pλ=0 . Let c and bs (s = 1, 2, . . . , c) be the true values (unknown in practice but known in the simulation) and ∗

bs ∑

p˜ sλ=0 = argmax p

L∗i (p) .

i=bs−1 +1 (II)

For s = 1, 2, . . . , c , i = bs−1 + 1, . . . , bs , define ˆ pi (I) (I) (I) niˆ pi )2 /niˆ pi (1 − ˆ pi ) and the PRMSE ratio as

∑k=1

(I)

(I)

= ˜ psλ=0 . Define the prediction error as

∑k

i=1 (ytest ,i



(I)

(ytest ,i − niˆ pi )2 /niˆ pi (1 − ˆ pi ) , PRMSE = ∑ki==11 (II) 2 (II) ˆ ˆ ˆ(II) i=1 (ytest ,i − ni pi ) /ni pi (1 − pi )

∑n

where ytest ,i = t =1 ytest ,it . For balanced cases, we set n = 50 (small group size) and n = 100 (large group size) and k = 200 , and c = 1 or 3 . When c = 1, the data are generated from a Bernoulli distribution with probability 0.3. When c = 3, the three clusters have sizes of (67, 67, 66) and the data are drawn from a Bernoulli distribution with probabilities 0.2, 0.5, and 0.8 respectively. Fig. 4(a) shows the result for n = 100, illustrating that the proposed method is reliable in terms of both Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

8

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 2. The performance of the proposed method for pairwise comparison problems: Figures (a) and (b) in the upper panel are for the small heterogeneity cases with c = 1 and c = 4 respectively. Figures (c) and (d) in the lower panel are for the large heterogeneity case with c = 1 and c = 4 respectively.

prediction and clustering of the groups. In comparison, small n (n = 50) deteriorates the performance of the proposed method. In Fig. 4(c), the median of the PRMSE ratios is 1.48 and the median cumulative proportion of groups covered by the largest cluster is 0.91, increasing to 0.97 for the largest three clusters. This implies that, as in the Gaussian case with heterogeneous variances, the proposed method with small n produces unnecessarily many clusters. In Fig. 4(d), the median quantile line is still close to the reference points marked with a ‘‘+’’. Therefore, for this Bernoulli case, small n affects the prediction performance more strongly. For unbalanced cases, we set k = 200, and c = 1 or 3. The sample sizes of the k groups are generated from a Poisson distribution with mean 100 (large group size) and 50 (small group size). Fig. 5(a) and (b) show that the proposed method is reliable in the large group size cases in terms of both prediction and clustering of the groups. However, Fig. 5(c) and (d) show that reducing the group size deteriorates the performance of the proposed method. In general, the PRMSE ratios are larger and the median quantile line lies below the reference line. Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

9

Fig. 3. The performance of the proposed method for pairwise comparison problems under the real dataset setting (heterogeneous Gaussian case): (a) c = 1 (b) c = 3.

Next, consider a small sample example. Suppose that k = 25 and the sample sizes of these k groups are generated from a Poisson distribution with mean 110. Consider two cases c = 1 and c = 3. When c = 1, the data are generated from a Bernoulli distribution with pi = 0.52. When c = 3, the three clusters have size (8, 8, 9) and the data are drawn from a Bernoulli distribution with pi = 0.1, 0.5, and 0.8 respectively. For both c = 1 and c = 3, the number of Bernoulli data for each variable is generated from a Poisson distribution with mean 110. Fig. 6(a) shows that the proposed method performs well in terms of clustering of the groups. In Fig. 6(b), the median cumulative proportion of the groups covered by the first three largest clusters is identical to the reference points marked with a ‘‘+’’. This implies that under this setting the proposed method produces reliable clustering results with good prediction performance. 4. Real data analysis In this section we examine two real datasets. 4.1. Lactase persistence phenotype data over world Lactase is necessary for milk consumption and an estimated 65% of human adults reduce the amount of lactase produced in the body after weaning. Whether lactase is produced throughout adult life, which called lactase persistence (LP), is a trait determined by genetic factors. Itan et al. (2010) provide the LP phenotype frequencies in Europeans, Africans, and Southern and Middle-Eastern Asians. They gathered the number of phenotype frequencies for 25(= k) countries from a large set of literature. The minimum and maximum numbers for the Bernoulli observations (ni ) are 17 and 638, respectively. The proposed algorithm yields 13 clusters with the largest cluster including nine countries. Fig. 7(a) shows a dot plot illustrating the membership of the 25 countries. Each continent is denoted by a different color. To illustrate problems with widely used traditional pairwise comparison methods, we consider the Bonferroni correction. For illustration purposes, we plot the three largest clusters. The black bars at the left of the y-axis in Fig. 7(a) depict the clusters based on the Bonferroni correction. This shows that the classical pairwise comparison methods are not coherent because some countries belong to different clusters simultaneously. Therefore, we present the results of the proposed method only. The proposed method (Fig. 7(a)) suggests that North India and South India belong to different clusters. The genetic differences between ancestral north and south Indian populations are also reported in the existing literature (e.g., Tamang and Thangaraj, 2012; Lanka et al., 2014; ArunKumar et al., 2012). 4.2. Analysis of GDP data for 33 cities in China Nominal gross domestic product (GDP) data are obtained for the period 2001 to 2015 for 33 provincial capital cities, special administrative regions, and directly-controlled municipalities in China. The names of the cities are shown in Fig. 7(b). Let St denote the GDP in the tth year. We first transform the GDP to the annual growth rate log(St /St −1 ) in order Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

10

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 4. The performance of the proposed method for pairwise comparison problems: Figures (a) and (b) in the upper panel is the large sample cases (n = 100). Figures (c) and (d) in the lower panel is the small sample cases (n = 50).

to remove the trend. After the trend correction, we are left with k = 33 and n = 14. The model presented in Section 2 assumes the independence of the observations. To validate such assumptions, the Box–Pierce test (Box and Pierce, 1970) is applied to test the serial independence of the transformed data. Test results show no statistical significance for all cities at the 0.05 level after correcting for the multiplicity of the tests. The classical Bonferroni correction gives two clusters: Hong Kong and all other cities in China. The penalized likelihood method based on a heterogeneous Gaussian model yields six clusters, with Hong Kong forming a single cluster. The largest cluster consists of 22 cities, with an estimated mean of 0.141. The second largest cluster consists of six cities, with an estimated mean of 0.138. Fig. 7(b) shows a dot plot of clusters for the 33 cities. The clustering is helpful for policy makers and economists when evaluating the economic performances of cities. In a country with a heterogeneous degree of social-economic development, it is natural to compare a city to cities belonging to the same or neighboring clusters. Since Hong Kong is a mature and developed economy, the growth rate is low. In contrast, Guiyang forms another single cluster, but is less developed and ranks 23rd in terms of GDP among the 33 cities in this Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

11

Fig. 5. The performance of the proposed method for pairwise comparison problems: Figures (a) and (b) in the upper panel is the large sample cases (Poisson mean = 100). Figures (c) and (d) in the lower panel is the small sample cases (Poisson mean = 50).

study. Major Chinese cities such Beijing, Shanghai, and Guangzhou are included in the small clusters (between Shanghai and Changchun in Fig. 7(b)). Interestingly, these clusters include the provincial capitals of three northeast provinces (Haerbin of Heilongjiang province, Shenyang of Liaoning province, and Changchun of Jilin province). Haerbin has a growth rate as low as that of Shanghai, even though Shanghai has a higher GDP. Compared with the classical pairwise comparison methods, our method yields seven clusters for the remaining 32 cities, all of which are meaningful. 5. Conclusion In this paper, we show that pairwise comparison problems can be reformulated as penalized likelihood estimation problems. The penalized likelihood method not only rules out the possibility of drawing contradictory conclusions, but also provides a condition for test consistency. Furthermore, the existence of a range of the tuning parameter λ allowing test consistency can be established. Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

12

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 6. The performance of the proposed method for pairwise comparison problems under the real dataset setting (unbalanced Bernoulli case): (a) c = 1 (b) c = 3.

Fig. 7. (a) Analysis of lactase persistence phenotype data: dot plot of probability estimates for the phenotype frequencies (b) Analysis of GDP data: dot plot of mean estimates for the annual growth rate data. The black bars at the left of y-axis in each figure depict the clustering result based on traditional pairwise comparison methods (Bonferroni correction or Tukey’s procedure).

In the literature on variable selection, information criteria such as the BIC are commonly used to identify λ. Developing a rigorous theory for the choice of λ based on the BIC for pairwise comparison problems is left as future research. It is also interesting to study how to control FWER and FDR via our proposed method under finite sample settings. Acknowledgments The first two authors contributed equally to this work. Chi Tim NG’s research is supported by a research program of Rural Development Administration, Republic of Korea (project no. PJ01283009).Woojoo Lee was supported by a Grant from the Next-Generation BioGreen 21 program, Korea (Project No. PJ01337701), Rural Development Administration, Republic of Korea. Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

13

Appendix A. Proofs For simplicity, we present only the proofs for the balanced cases where n1 = n2 = · · · = nk = n . A.1. Proof of Theorem 2.2 Propositions A.1 and A.2 are needed to prove Theorem 2.2. Proposition A.1. For any given λ , the strictly local maximum of (2.1) is unique. Proof of Proposition A.1. Suppose that (2.1) admits a strictly local maximum (µ ˆ 1, . . . , µ ˆ k ) . By contradiction assume that there is another local maximum (µ ˆ ′1 , . . . , µ ˆ ′k ) (not necessarily a local maximum in strict sense). Let Qλ∗ (µ1 , . . . , µk ) be ˆ = (µ ˆ ′ = (µ the penalized likelihood function in (2.1). Consider the straight line between µ ˆ 1, . . . , µ ˆ k ) and µ ˆ ′1 , . . . , µ ˆ ′k ) . ′ ∗ ˆ + (1 − w)µ ˆ ) for 0 ≤ w ≤ 1 . The continuity of f (w) in the close interval [0, 1] suggests the Let f (w ) = Qλ (w µ existence of a minimum, f (1 − δ ) for some 0 ≤ δ ≤ 1. Clearly, f (1) cannot be such a minimum because it is a (strictly) local maximum. Therefore, δ > 0 . If f (0) ≥ f (1) , the convexity (not necessarily strict convexity) of f (w ) yields ˆ ) > f (µ ˆ ′) . f (1 − δ ) ≥ δ f (0) + (1 − δ )f (1) ≥ f (1) which is a contradiction. Therefore, f (0) < f (1) . This means that f (µ ′ ′ ′ ˆ ) < f (µ ˆ ) . This is Note that if (µ ˆ 1, . . . , µ ˆ k ) is strictly local maximum, the same arguments above apply and we have f (µ a contradiction. Therefore, the strictly local maximum is unique. □ Proposition A.2. The sufficient and necessary conditions for the existence of strictly local maximum point satisfying test consistency are: (i) The following equations admit a solution to (µ ˆ1 ,... ,µ ˆ c) ,

∇ L∗1 (µ ˆ 1 ) + · · · + ∇ L∗b1 (µ ˆ 1) =



∇ Lb 1 + 1 ( µ ˆ ) + · · · + ∇ Lb2 (µ ˆ )=



λm1 ms sgn(µ ˆ1 −µ ˆ s)

s̸ =1 2





2

λm2 ms sgn(µ ˆ2 −µ ˆ s)

s̸ =2

.. .. .=∑ . ∇ L∗bc −1 +1 (µ ˆ c ) + · · · + ∇ L∗k (µ ˆ c) = λmc ms sgn(µ ˆc −µ ˆ s) .

(A.8)

s̸ =c

(ii) For any s = 1, 2, . . . , c with ms > 1 , ℓs = 1, 2, . . . , ms − 1 ,

⎞ ⎛ ⏐ ⏐ ℓs ms ∑ ⏐ ⏐1 1 1 ∑ ∗ s ∗ s ⎠⏐ ⏐ ⎝ ∇ L(j) (µ ˆ )− ∇ L(j) (µ ˆ ) ⏐ < λ. ⏐ m s ℓs m s − ℓs j=1

(A.9)

j=ℓs +1

Here, ∇ L∗(j) is the jth order statistic of {∇ L∗b

s−1 +1

, . . . , ∇ L∗bs −1 } .

Condition (ii) is equivalent to condition (ii′ ) below. (ii′ ) For any s = 1, 2, . . . , c with ms > 1 , ℓ = 1, 2, . . . , ms − 1 , and ℓs chosen values {is1 , is2 , . . . , isℓs } from Ms = {bs−1 + 1, . . . , bs } , we have

⏐ ⏐ ⏐ ∗ s ⏐ s ∗ s ⏐ ⏐∇ L (µˆ ) + · · · + ∇ L∗ (µˆ s ) − ℓs [∇ L∗ ˆ ) + · · · + ∇ Lb s ( µ ˆ )]⏐ < λℓs (ms − ℓs ) . isℓs bs−1 +1 (µ ⏐ is1 ms

(A.10)

Proof of Proposition A.2. First, some notations are introduced. Consider the following re-parameterization. For s = 1, 2, . . . , c , ξbs = µbs and if ms > 1 ,

ξbs−1 +1 = µbs−1 +1 − µbs , . . . , ξbs −1 = µbs −1 − µbs . For convenience, set b0 = 1 . Let Qλ∗ (µ1 , µ2 , . . . , µk ) be the penalized likelihood function. Define (µ ˆ 1, . . . , µ ˆ c ) as the solution to Eq. (A.8) if exists. To complete the proof, we rely on the fact that the local maximum has non-positive directional derivatives along all directions. Let a be a unit vector such that abs = 0 for all s = 1, 2, . . . , c . The directional derivative of Qλ∗ (µ1 , µ2 , . . . , µk ) along the direction a at (µ ˆ 1, . . . , µ ˆ k ) is defined as follows, Qλ∗ (ξˆ1 + a1 δ, ξˆ2 + a2 δ, . . . , ξˆk + ak δ ) − Qλ∗ (ξˆ1 , ξˆ2 , . . . , ξˆk )

. δ Let As = abs−1 +1 +· · ·+ abs −1 , Ms = {bs−1 + 1, . . . , bs } , and Ms− = {bs−1 + 1, . . . , bs − 1} . After some algebraic manipulations, Da = lim

δ→0+

we have

Da =

∑∑ s



i∈Ms

ai ∇ L∗i (µ ˆ s) − λ

⎧ ∑⎨∑ s



i∈Ms−

| ai | +



⎫ ⎬ | ai − aj | ⎭ −

i
Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

14

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

−λ



(ms Ar − mr As )sgn(µ ˆr −µ ˆ s) .

r >s

Moreover, 2





r >s

r ̸ =s

(ms Ar − mr As )sgn(µ ˆr −µ ˆ s) =

(ms Ar − mr As )sgn(µ ˆr −µ ˆ s)



=

ms Ar sgn(µ ˆr −µ ˆ s) +

r ̸ =s

=2



mr As sgn(µ ˆs −µ ˆ r)

r ̸ =s

∑ r

Ar



ms sgn(µ ˆ −µ ˆ s) . r

s ̸ =r

If (i) holds, Da can further be written as Da =

∑∑ −

s



ˆ s) − λ ai ∇ L∗i (µ

⎧ ∑⎨∑ s

i∈Ms

∑ As ∑ s

ms



|ai | +

i∈Ms−



⎫ ⎬ | ai − aj | ⎭ −

(A.11)

i
∇ L∗i (µ ˆ s) .

(A.12)

i∈Ms

First, the equivalence between (ii) and (ii′ ) is established. Let Js = {is1 , is2 , . . . , isℓs } . After some algebraic manipulations, inequality (A.10) can be rewritten as

⎛ ⎞ ⏐ ⏐ ∑ ⏐1 ⏐ 1 ∑ 1 ∗ s ∗ s ⎠⏐ ⏐ ⎝ ∇ Li ( µ ˆ )− ∇ Li ( µ ˆ ) ⏐ < λ. ⏐ ms ℓs ms − ℓs i∈Js

(A.13)

i∈Ms \Js

Then, (ii′ ) H⇒ (ii) is trivial. To prove (ii) H⇒ (ii′ ), consider Js,ℓs ,max and Js,ℓs ,min that contains i corresponding to either the ˆ s ) respectively. Clearly, inequality (A.13) holds for Js = Js,ℓs ,max from (ii). Using the greatest and smallest ℓs values of ∇ L∗i (µ asymmetry



∇ L∗i (µ ˆ s) −

i∈Js

ℓs ∑ ms

∇ L∗i (µ ˆ s) = −



∇ L∗i (µ ˆ s) +

i∈ / Js

i∈Ms

m s − ℓs ∑ ms

∇ L∗i (µ ˆ s) ,

(A.14)

i∈Ms

(A.10) for Js,ℓs ,min is equivalent to (A.10) for Js,ms −ℓs ,max , which is also guaranteed by (ii). Next, we show that test consistency holds if and only if (i) and (ii′ ) are satisfied. The necessity of (i) is trivial. For the necessity of (ii′ ), if the local maximum exists, Da < 0 for any direction √ a . Given s , ℓ , and Js , condition (ii′ ) can be obtained ℓs for i = is1 , is2 , . . . , isℓs and ai = 0 otherwise. If by appropriate choice of a for (A.12) as follows. If b√ ∈ / J , set a = 1 / s s i bs ∈ Js , set ai = 0 for i = is1 , is2 , . . . , isℓs and ai = 1/ ms − ℓs otherwise. Condition (ii′ ) then follows from the asymmetry (A.14). ′ ∗ assume that both conditions ∑ ∑ (i) and2 (ii ) hold. Consider the Lagrange function Da of Da under constraint∗ ∑ For ∑ sufficiency, 2 ∗ s i∈Ms− ai − 1) . Here, ν is the Lagrange multiplier. Both critical points of Da s i∈Ms− ai = 1 , that is, Da = Da + ν ( and non-differentiable points of Da have to be studied. It suffices to show that Da < 0 at 2 1. the points of intersection of the spherical hyper-surface s i∈Ms− ai and any other k − c − 1 linearly independent constraints chosen from ai = aj and ai = 0 for i ̸ = j and i, j ∈ / {b1 , b2 , . . . , bc } , 2. the critical points of D∗a , and 3. the critical points of D∗a in the restricted space formed by any < k − c − 1 constraints chosen from ai = aj and ai = 0 for i ̸ = j and i, j ∈ / {b1 , b2 , . . . , bc } .

∑ ∑

Points of intersection: Suppose that k − c − 1 constraints are chosen and the number non-zero ai is ℓ . Then, the number of ‘ai = 0’-type and ‘ai = aj ’-type constraints are√k − c − ℓ and ℓ − √1 respectively. Therefore, all non-zero ai values must be the same and the common value is ai = 1/ ℓ (or ai = −1/ ℓ). Suppose that indexes of the non-zero groups are J1 ∪ J2 ∪ . . . ∪ Jc , where Js = {is1 , . . . , isℓs } ⊂ Ms , s = 1, 2, . . . , c . Set ℓ1 , ℓ2 , . . . , ℓc in condition (ii′ ) as the cardinalities of J1 , J2 , . . . , Jc respectively. Then, the corresponding Da value is

⎧ ⎨ 1 ∑ ∑

Da = ± √



s



∇ L∗i (µ ˆ s) −

i∈Js

ℓs ∑ ms

⎫ ⎬

1

∇ L∗i (µ ˆ s) − √ λ ⎭ ℓ i∈Ms



ℓs (ms − ℓs ).

s

From condition (ii′ ), Da < 0 . Critical points: For i ∈ Ms− , s = 1, 2, . . . , c , define qi as the order rank of ai among {ai : i ∈ Ms− } and Ei = ∇ L∗i (µ ˆ s) −

1 ∑ ms

∇ L∗j (µ ˆ s ) − λ[sgn(ai ) + 2qi − ms ] .

j∈Ms

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

15

√∑ ∑ 2 − It can be checked that the critical points have ai = ±Ei / s i∈Ms− Ei for i ∈ Ms . Here, the ± signs must be − − either all√positive or all negative for all i ∈ M1 ∪ · · · ∪ Mc . Next, we show that the ± sign must be negative. If so, ∑ ∑ 2 Da = − s i∈Ms− Ei < 0 . Arbitrarily take s from {1, 2, . . . , c } . Suppose without loss of generality that abs−1 +1 < abs−1 +2 < · · · < abs −1 . If sgn(abs −1 ) = +1 , condition (ii′ ) for ℓs = 1 suggests that Ei < 0 . Then, the ± sign must be negative. If sgn(abs −1 ) = −1 , then abs−1 +1 = −1 . Condition (ii′ ) for ℓs = 1 implies that Ebs−1 +1 > 0 . Since sgn(abs−1 +1 ) = −1 , the ± sign must be negative. Critical points in restricted spaces: Here, we only deal with the case of one-constraint au = av , u ∈ Mα− , v ∈ Mβ− . Other cases can be handled in the same manner. For all i ∈ Ms− , s = 1, 2, . . . , c , define 1 ∑ ˆ s) − ˆ s ) − λ[sgn(ai ) + #{j ∈ Ms− : aj < ai } − #{j ∈ Ms− : aj > ai }] . Ei = ∇ L∗i (µ ∇ L∗j (µ ms

j∈Ms−

The critical points of D∗a satisfy au = av = ± √

(Eu + Ev )/2 (Eu + Ev /2 + )2



i∈(M1− ∪···∪Mc− )\{u,v}

Ei2

and for j ∈ (M1− ∪ · · · ∪ Mc− ) \ {u, v} , aj = ± √

Ej (Eu + Ev /2 + )2



i∈(M1− ∪···∪Mc− )\{u,v}

Ei2

.

Here, the ± signs are either all positive or all negative. It suffices to show that the ± must be negative. If so,

  2 Da = − √(Eu + Ev ) /2 +

∑ −

Ei2 < 0 . −

i∈(M1 ∪···∪Mc )\{u,v}

Arbitrarily take s from {1, 2, . . . , c } . Suppose without loss of generality that abs−1 +1 < abs−1 +2 < · · · < abs −1 . We need to deal with the possible situations with a tie at either abs−1 +1 or abs −1 . Consider the case sgn(abs −1 ) = +1 . Without tie at abs −1 , condition (ii′ ) for ℓs = 1 suggests that Ebs −2 < 0 . With tie at abs −1 , condition (ii′ ) for ℓs = 2 suggests that Ebs −2 + Ebs −1 < 0 . Therefore, the ± sign must be negative. Next, consider the case sgn(abs −1 ) = −1 . Note that sgn(ai ) = −1 for all i ∈ Ms− . If there is no tie in abs−1 +1 , from condition (ii′ ) for ℓs = 1 , Ebs−1 +1 = ∇ L∗bs−1 +1 (µ ˆ s) −

1 ∑ ms

∇ L∗j (µ ˆ s ) + λ(ms − 1) > 0 .

j∈Ms

If there is a tie, from condition (ii′ ) for ℓs = 2 , Ebs−1 +1 + Ebs−1 +2 = ∇ L∗bs−1 +1 (µ ˆ s ) + ∇ L∗bs−1 +2 (µ ˆ s) −

1 ∑ ms

∇ L∗j (µ ˆ s ) + 2λ(ms − 2) > 0 .

j ∈ Ms

Since sgn(as−1 + 1) = −1 , the ± sign must be negative. □ Proof of Theorem 2.2. To establish the conclusion on test consistency, Proposition A.2 is used. Moreover, Proposition A.1 guarantees that the strictly local maximum described in Proposition A.2 must be the unique if it exists. See Definition 2.1. Now, we verify condition (i) in Proposition A.2. Without loss of generality, suppose that the true parameters satisfy µ0b1 < µ0b2 < · · · < µ0bc . The crucial idea of the proof is that if (µ ˜ 1, . . . , µ ˜ c ) obtained from Eq. (A.15) satisfies 1 c µ ˜ < ··· < µ ˜ , it also solves Eq. (A.8), b1 ∑

∇ L∗i (µ ˜ 1 ) = −λm1 (m2 + · · · + mc )

i=1 b2 ∑

∇ L∗i (µ ˜ 2 ) = −λm2 (−m1 + m3 + · · · + mc )

i=b1 +1

.. . . = .. bc ∑

∇ L∗i (µ ˜ c ) = −λmc (−m1 − m2 − · · · − mc −1 ) .

(A.15)

i=bc −1 +1

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

16

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

This can be seen from the fact that if µ ˜1 < ··· < µ ˜ c holds, m2 + · · · + mc = −sgn(µ ˜1 −µ ˜ 2 )m2 − · · · − sgn(µ ˜1 −µ ˜ c )mc ,

−m1 + m3 + · · · + mc = −sgn(µ ˜2 −µ ˜ 1 )m1 − sgn(µ ˜2 −µ ˜ 3 )m3 − · · · − sgn(µ ˜3 −µ ˜ c )mc , .. . . = .. −m1 − m2 − · · · − mc −1 = −sgn(µ ˜c −µ ˜ 1 )m1 − sgn(µ ˜c −µ ˜ 2 )m2 − · · · − sgn(µ ˜c −µ ˜ c −1 )mc −1 . It suffices to show that µ ˜1 < ··· < µ ˜ c with probability going to one. For such purpose, apply Taylor approximations to the left-hand-sides and subtract the sth equation from the (s − 1)th equation in (A.15). Under regularity condition (III), standard arguments based on Taylor expansion yield that

√ √ µ ˜s −µ ˜ s−1 = µ0bs − µ0bs−1 + Op (1/ ms n) + Op (1/ ms−1 n) + O(λkn−1 ) . Then, under the condition (2.4), with probability going to one we have µ ˜1 − µ ˜ 2 has the same sign as µ0b1 − µ0b2 , and so 1 2 c on. That means µ ˜ <µ ˜ < ··· < µ ˜ . Next, we check condition (ii′ ) of Proposition A.2. Let (µ ˆ 1, µ ˆ 2, . . . , µ ˆ c ) be the solution to (A.8). Note that

∇ L∗i (µ ˆ s) −

1 ms 1

= ∇ L∗i (µ0bs ) −

[∇ L∗bs−1 +1 (µ ˆ s ) + · · · + ∇ L∗bs (µ ˆ s )]

ms

[∇ L∗bs−1 +1 (µ0bs ) + · · · + ∇ L∗bs (µ0bs )] + Op (n(µ ˆ s − µ0bs )2 )

The first two terms on the right-hand-side are Op (n1/2 ) . In the Gaussian cases, the last term on the right-hand-side becomes zero. In the non-Gaussian cases,

√ |µ ˆ s − µ0bs | ≤ Op (1/ ms n) + O(λkn−1 ) . Then, condition (2.4) for general cases and condition (2.5) for Gaussian cases guarantee that max

i=bs−1 +1,...,bs

⏐ ⏐ ⏐ ∗ s ⏐ s ∗ s ⏐ ⏐∇ L (µˆ ) − 1 [∇ L∗ ˆ ) + · · · + ∇ Lb s ( µ ˆ )]⏐ < λms /2 . bs−1 +1 (µ ⏐ i ms

It follows obviously from triangle inequality that if ℓs = 1, 2, . . . , ms /2 , for any ℓs chosen ∇ L∗i values with i is chosen from bs−1 + 1, . . . , bs ,

⏐ ⏐ ⏐ ∗ s ⏐ s ∗ s ⏐ ⏐∇ L (µˆ ) + · · · + ∇ L∗ (µˆ s ) − ℓ [∇ L∗ ˆ ) + · · · + ∇ Lbs (µ ˆ )]⏐ < λℓs (ms − ℓs ) . isℓs bs−1 +1 (µ ⏐ is1 ms

For ℓs = ms /2 + 1, . . . , ms − 1 , similar results hold due to the asymmetry

∑{

∇ Li ( µ ˆ )− ∗

s

i∈Js

=−

∑{

1 ms

∇ L∗i (µ ˆ s) −

i∈ / Js

[∇ Lbs−1 +1 (µ ˆ ) + · · · + ∇ Lb s ( µ ˆ )] ∗

1 ms

s



s

}

} [∇ L∗bs−1 +1 (µ ˆ s ) + · · · + ∇ L∗bs (µ ˆ s )] ,

where Js is any index set chosen from bs−1 + 1, . . . , bs . test consistency follows immediately. Appendix B. Exterior point algorithm Arnold and Tibshirani (2014) present a generic approach to compute the so-called generalized Lasso problem including the fused Lasso problem as a special case, k ∑



i=1

1≤i
(yi − µi )2 + λ

|µi − µj | .

(B.16)

Their algorithm requires O(N 2 ) computational time in each iteration, where N = k(k − 1)/2 . As an alternative, Pan et al. (2013) developed the exterior point algorithm for the penalized likelihood (B.16), which can handle large k cases efficiently. In this paper we extend the algorithm of Pan et al. (2013) to include general problems of the form given in (2.1). The proposed algorithm is equivalent to that of Pan et al. (2013) when yit is Gaussian and n = 1. The main difficulty in optimizing (2.1) lies in the non-separable penalty term |µi − µj |, which may result in the coordinate-wise descent algorithm failing to provide a stationary point (Friedman et al., 2007). To overcome this, we Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

17

follow Pan et al. (2013) and reparameterize (2.1) by introducing new parameters Θij and adding quadratic penalties of µi − µj − Θij : ni k ∑ ∑



q(yit ; µi , τi ) − λ

|Θij | − ν

1≤i
i=1 t =1



(µi − µj − Θij )2 .

(B.17)

1≤i
Note that this new objective function contains a separable penalty term for Θij . When ν → ∞ , the solution to (B.17) approaches the solution to the original problem given in (2.1). The objective function (B.17) can be maximized using the following two-step iterative procedure. Step 1: Set ν = 1 . Choose the unpenalized estimation (λ = 0) as the initial guess of µ . Step 2: Fix µ , and update τ and Θij . Perform one-step Newton updates for each τi and set

Θij =

⎧ ⎪ ⎪ ⎪ ⎨0 ,

(old)

when |2ν (µi

µi(old) − µj(old) − λ/2ν , ⎪ ⎪ ⎪ ⎩µ(old) − µ(old) + λ/2ν , j

i

− µ(old) )| < λ , j

(old)

− µ(old) ) > λ, j

(old)

) < −λ . − µ(old) j

when 2ν (µi when 2ν (µi

These updating formulas are derived from solving (B.17) analytically with respect to Θij . Step 3: For Θ from Step 2, update µ by solving the following equation



(

⎣−ν 1k 1Tk + diag −

ni ∑

q′′ (yit ; µ

(old) ) i



+ kν

ni

∑ (old)

= ⎝−µi

⎦µ i=1,2,...,k

t =1





) (old) i



ni (old)

q′′ (yit ; µi

, τi(old) ) +

t =1



(old)

q′ (yit ; µi

, τi(old) ) + ν



.

Θij ⎠

j ̸ =i

t =1

i=1,2,...,k

Here, the differentiations are with respect to the parameters in µ . The notation diag(x) refers to the diagonal matrix with diagonal elements x . In the special cases of Gaussian distribution with equal variance, the equation becomes

⎛ ⎞ ni ∑ ∑ ] [ yit + ν Θij ⎠ −ν 1k 1Tk + diag (ni + kν)i=1,2,...,k µ = ⎝ j ̸ =i

t =1

. i=1,2,...,k

The computation can be simplified by using the fact that (a1k 1Tk + D)−1 = D−1 − (a−1 + 1Tk D−1 1k )−1 D−1 1k 1Tk D−1 . For homogeneous Gaussian cases, we can use (a1k 1Tk + bIk )−1 = b−1 Ik − a(b2 + kab)−1 1k 1Tk . Note that the matrix



(

⎣−ν 1k 1Tk + diag −

ni ∑ t =1



) (old)

q′′ (yit ; µi

, τi(old) ) + kν

⎦ i=1,2,...,k

is positive definite and therefore is invertible in a neighborhood of the true parameter vector according to continuity arguments and condition (I) in Theorem 2.2. Step 4: Repeat steps 2 and 3 until convergence. Step 5: Increase ν by a factor greater than one. For example, if the factor is two, this step becomes ν ← 2ν . Step 6: Repeat steps 2 to 5 until convergence. Under the regularity conditions of Theorem 2.2, if q(yit ; µi ) is convex, the proposed coordinate-wise descent algorithm provides the global maximum because the L1 penalty term on Θij is separable and convex (Tseng, 2001). In balanced cases where ni = n , i = 1, 2, . . . , k , the computational burden of each iteration is O(kn + k2 ) . Here, O(kn) computational time is required to compute the first and second order derivatives of the log-likelihood. In step 1, O(k2 ) steps are required to compute Θij , i, j = 1, 2, . . . , k , i ̸ = j . Step 3 requires O(k) computational time. References Agresti, A., Bini, M., Bertaccini, B., Ryu, E., 2008. Simultaneous conficence interval for comparing binomial parameters. Biometrics 64, 1270–1275.

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.

18

C.T. Ng, W. Lee and Y. Lee / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Arnold, T., Tibshirani, R., 2014. Efficient implementations of the generalized lasso dual path algorithm. arXiv:1405.3222v2 [stat.CO]. ArunKumar, G., et al., 2012. Population differentiation of southern Indian male lineages correlates with agricultural expansions predating the caste system. PLoS One 7, 11. Box, G.E.P., Pierce, D.A., 1970. Distribution of residual correlations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509–1526. Brannah, W., Bretz, F., 2010. Shortcuts for locally consonant closed test procedures. J. Amer. Statist. Assoc. 105, 660–669. Cheung, S.H., Chan, W.S., 1996. Simultaneous confidence intervals for pairwise comparison in a two-way unbalanced design. Biometrics 52, 463–472. Chow, T.L., Teugels, J.L., 1978. The sum and the maximum of i.i.d. random variables. In: Proc. Second Prague Symp. Asymptotic Statistics. North Holland, N.Y.. Efron, B., 2004. The estimation of prediction error: covariance penalties and cross-validation. J. Amer. Statist. Assoc. 99, 619–642. Fan, Y., Tang, C.Y., 2013. Tuning parameter selection in high dimensional penalized likelihood. Ann. Statist. 38, 3567–3604. Finner, H., Strassberger, K., 2002. The partitioning principle: a powerful tool in multiple decision theory. Ann. Statist. 30, 1194–1213. Fisher, R., Tippett, L.H.C., 1928. Limiting forms of the frequency distribution of largest or smallest member of a sample. Proc. Camb. Phil. Soc. 24, 180–190. Friedman, J., et al., 2007. Pathwise coordinate optimization. Ann. Appl. Stat. 2, 302–332. Gabriel, K., 1969. Simultaneous test procedures – some theory of multiple comparisons. Ann. Math. Stat. 40, 224–250. Gelman, A., Hill, J., Yajima, M., 2012. Why we (usually) don’t have to worry about multiple comparisons. J. Res. Educ. Effectiveness 5, 189–211. Gnedenko, B.V., 1943. Sur la distribution limite du terme d’une série aléatoire. Ann. of Math. 44, 423–453. Goeman, J., Solari, A., 2011. Multiple testing for exploratory research. Statist. Sci. 26, 584–597. Gretton, A., Györfi, 2010. Consistent nonparametric tests of independence. J. Mach. Learn. Res. 11, 1391–1423. Häaggström, J., de Luna, X., 2010. Estimating prediction error: Cross-validation vs. accumulated prediction error. Commun. Statist. Simul. Comput. 39, 880–898. Hochberg, Y., Tamhane, A.C., 1987. Multiple Comparison Procedures. Wiley, New York. Hocking, T.D., Joulin, A., Bach, F., Vert, J., 2011. Clusterpath: An algortihm for lcustering using convex fusion penalties. In: Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA. Itan, Y., et al., 2010. A worldwide correlation of lactase persistence phenotype and genotypes. BMC Evol. Biol. 10, 36. Konishi, S., Kitagawa, G., 2008. Informational Criteria and Statistical Modeling. Springer, New York. Kramer, C.Y., 1956. Extension of multiple range test to cluster means with unequal numbers of replications. Biometrics 12, 307–310. Lanka, R., et al., 2014. Mitochondrial DNA history of Sri Lankan ethnic people: their relations within the island and with the Indian subcontinental populations. J. Hum. Genet. 59, 28–36. Lee, Y., Bjørnstad, J.F., 2013. Extended likelihood approach to large-scale multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. http://dx.doi.org/10. 1111/rssb.12005. Lehmann, E.L., Romano, J.P., 2005. Testing Statistical Hypotheses(3rd). Springer, New York. Leng, C., Tang, C.Y., 2012. Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika 99, 703–716. Lin, Y.Q., Cheung, S.H., Poon, W.Y., Lu, T.Y., 2014. Pairwise comparisons with ordered categorical data. Stat. Med. 32, 3192–3205. Lu, T.Y., Lin, Y., Zhong, J., 2017. Pairwise comparisons of treatments with ordinal responses in a two-way setting. Comm. Statist. Theory Methods 46, 10549–10563. Marcus, R., Peritz, E., Gabriel, K.R., 1976. On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660. McCormick, W.P., 1980. Weak convergence for the maxima of stationary Gaussian processes using random normalization. Ann. Probab. 8, 483–497. Miller, R., 1981. Simultaneous Statistical Inference. Springer-Verlag, New York. Pan, W., Shen, X., Liu, B., 2013. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. J. Mach. Learn. Res. 14, 1865–1889. Romano, J.P., Shaikh, A., Wolf, M., 2011. Consonance and the closure method in multiple testing. Int. J. Biostat. 7 (1), 12. Sonnemann, E., 2008. General solutions to multiple testing problems. Biom. J. 50, 641–656, translation with minor corrections of the original article Sonnemann, E. 1982. Allgemeine Lösungen multipler Testprobleme. EDV in Medizin und Biologie 13, 120–128 by Helmut Finner. Sonnemann, E., Finner, H., 1998. Vollstandigkeitssatze fur multiple Testprobleme. In: Bauer, P., Hommel, G., Sonnemann, E. (Eds.), Multiple Hypothesenprufung. Springer, Berlag, pp. 121–135. Tamang, R., Thangaraj, K., 2012. Genomic view on the peopling of India. Integrative Genet. 3, 20. Tseng, P., 2001. Convergence of block coordinate descent method for nondifferentialble maximization. J. Optim. Theory Appl. 109, 474–494. Tukey, J.W., 1953. The problem of multiple comparisons. Unpublished report, Princeton University, Princeton, New Jersey. Wang, H., Leng, C., 2007. Unified LASSO estimation via least squares approximation. J. Amer. Statist. Assoc. 101, 1418–1429. Wang, H., Li, B., Leng, C., 2009. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Ser. B Stat. Methodol. 71, 671–683. Wang, H., Li, R., Tsai, C.L., 2007. On the consistency of SCAD tuning parameter selector. Biometrika 94, 553–558. Zhang, C.H., Zhang, T., 2012. A General Framework of Dual Certificate Analysis for Structured Sparse Recovery Problems. ArXiv e-prints.1201.3302. Zhao, H., Wang, B., Cui, X., 2010. General solutions to consistency problems in multiple hypothesis testing. Biom. J. 52, 735–746.

Please cite this article as: C.T. Ng, W. Lee and Y. Lee, Logical and test consistency in pairwise multiple comparisons. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.009.