Probabilistic inference for multiple testing

Probabilistic inference for multiple testing

International Journal of Approximate Reasoning 55 (2014) 654–665 Contents lists available at ScienceDirect International Journal of Approximate Reas...

374KB Sizes 1 Downloads 138 Views

International Journal of Approximate Reasoning 55 (2014) 654–665

Contents lists available at ScienceDirect

International Journal of Approximate Reasoning www.elsevier.com/locate/ijar

Probabilistic inference for multiple testing Chuanhai Liu, Jun Xie ∗ Department of Statistics, Purdue University, 250 N. University St., West Lafayette, IN 47907, United States

a r t i c l e

i n f o

Article history: Received 25 April 2013 Received in revised form 24 September 2013 Accepted 25 September 2013 Available online 3 October 2013 Keywords: Belief functions Fiducial inference Inferential model Many-normal-means Predictive random sets

a b s t r a c t An inferential model is developed for large-scale simultaneous hypothesis testing. Starting with a simple hypothesis testing problem, the inferential model produces a probability triplet ( p , q, r ) on an assertion of the null or alternative hypothesis. The probabilities p and q are for and against the truth of the assertion, whereas r = 1 − p − q is the remaining probability called the probability of “don’t know”. For a large set of hypotheses, a sequence of assertions concerning the total number of true alternative hypotheses are proposed. The inferential model provides levels of belief without a prior for the sequence of assertions and offers a new multiple comparison procedure (MCP). The proposed method is obtained by improving Fisher’s fiducial and the Dempster–Shafer theory of belief functions so that it produces probabilistic inferential results with desirable frequency properties. The new multiple comparison procedure is shown to have a comparable performance with other existing MCPs and is favorable in terms of probabilistic interpretation. The proposed method is applied in identifying differentially expressed genes in microarray data analysis. © 2013 Elsevier Inc. All rights reserved.

1. Introduction There have been tremendous research efforts made in the last decade on solving large-scale simultaneous hypothesis (i ) (i ) testing, where one is concerned with a large number n of pairs of competing hypotheses: H 0 versus H a for i = 1, . . . , n. The multiple testing problem is introduced by modern scientific techniques, for example, gene expression microarray in identifying differentially expressed genes from a large number of candidates or even the whole genome. Existing efforts have been made for a series of multiple comparison procedures (MCPs), for example, controlling generalized family-wise error rate (gFWER) or controlling false discovery rate [5,12,27,28,20]. Dudoit and Laan’s book [10] provides good descriptions of the existing MCPs. An alternative way of thinking about the multiple testing problem is to consider a sequence of assertions:

  (i ) Ak = there are at least k H a ’s that are true for k = 1, 2, . . . , n. We will develop a probabilistic inference for this type of assertions and will use the probabilistic inference to construct a multiple comparison procedure. We start with a single test for a null hypothesis H 0 versus an alternative hypothesis H a . The classic frequency theory of hypothesis testing developed by Neyman, Pearson, and Fisher has been known as the twentieth century’s most influential piece of applied mathematics [6,11]. However, the p-value, computed from an observed test statistic assuming the truth of the null hypothesis, does not have a desirable probability interpretation of whether or not the null hypothesis is true. In fact, the interpretation of p-value is so confusing for nonstatisticians that many falsely think it

*

Corresponding author. E-mail addresses: [email protected] (C. Liu), [email protected] (J. Xie).

0888-613X/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ijar.2013.09.017

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

655

is a probability of H 0 . In the context of Bayesian hypothesis testing, Bayes factors are often computed to measure evidence in favor one over the other hypothesis. However, like Fisher’s p-value, Bayes factors do not have a desirable probability interpretation. An attractive approach is to seek prior-free probabilistic inferential methods. Ronald R. Fisher, the founding father of the modern statistics, took a lead in this direction with his fiducial inference [16,17]. Intensive and stimulating investigations in 1940–1960s on fiducial had led to the nowadays commonly perceived conclusion that “fiducial inference stands as R.A. Fisher’s one great failure” [32]. On the other hand, researchers have been inspired by the interesting ideas in fiducial inference. Perhaps due to the challenge in modern scientific inference, especially with high-dimensional problems such as multiple testing discussed in this article, it is exciting to see renewed efforts in generalizing and modifying Fisher’s fiducial. These include the Dempster–Shafer theory [26,8], generalized fiducial [18,19], algorithmic inference [1–3], exact confidence interval methods [4], confidence distributions [30,31], belief functions with good frequency properties [9], and inferential models [21,14,23]. For a brief review of these methods, see Liu and Xie [25]. In this paper, we tackle the problem of multiple testing from a general perspective of producing uncertainty assessments and derive a probabilistic inferential model for it. To be more precise, we follow Dempster [8] to view that probabilistic inference for the truth of H 0 or H a amounts to producing a probability p for the truth of H 0 , a probability q for the truth of H a , and a residual probability r, called the probability of “don’t know”, for neither H 0 nor H a . That is, the triplet ( p , q, r ) is our uncertainty assessment of H 0 and H a . As in [21], we modify Fisher’s fiducial [17,18] to obtain a new probabilistic inferential model, so that the triplet ( p , q, r ) has desirable frequency properties. The modified inferential framework is called Inferential Model (IM). It is closely related to the Dempster–Shafer theory of belief functions [7,26]. Compared to the notions of Belief and Plausibility in belief functions [26,23], p equals Belief and p + r equals Plausibility of H 0 . IM provides direct statistical evidence for H 0 and H a . Most importantly, the ( p , q, r ) triplet is calculated from the specification of our uncertainty on unknown model parameters but is not the conditional probability under either the truth of H 0 or the truth of H a . In Section 2, we introduce the framework of IM for single hypothesis testing. We specially discuss the interpretation of the probabilities used in IM, which are considered as degrees of belief for the null or alternative hypothesis. In Section 3, we study the many-normal-means problem, where the inferential model is used for multiple testing. We show that IM offers a new multiple comparison procedure and has comparable performances with other popular MCPs. In Section 4, we apply the inferential model of multiple testing in microarray data analysis, to identify differentially expressed genes. Finally, in Section 5 we conclude with a few remarks and point to a future work that an optimal IM can be derived for multiple testing. 2. The framework of IM 2.1. The basic framework We start with a demonstration example. Assume that a set of observed data X is available and that model f θ ( X ) for X ∈ X is specified, usually with unknown parameter θ ∈ Θ . We use the following example to explain the framework of IM. The key idea is to use an unobserved auxiliary random variable to represent f θ ( X ) and to make uncertainty assessment by predicting this unobserved auxiliary random variable. Example 1. The exponential distribution with a rate parameter θ (> 0) has a pdf f (x; θ) = θ e −θ x for x  0 and f (x; θ) = 0 for x < 0. It has a cdf F (x; θ) = 1 − e −θ x for x  0. A random number generator of the exponential distribution, based on inverse of its cdf, is given by

X = −θ −1 ln U ,

where U ∼ Unif(0, 1).

(1)

This sampling mechanism preserves the model for X given θ . Moreover, it specifies a relationship among the observed data X , the unknown parameter θ , and the unobserved variable U . The unobserved variable U is called auxiliary random variable and Eq. (1) is referred to as an association equation. Our proposed IM inference about θ is based on predicting the unobserved U using what is called a predictive random set (PRS). An example of PRS is defined by a random interval in the following,

    S = u: |u − .5|  | V − .5| = .5 − | V − .5|, .5 + | V − .5| ,

with V ∼ Unif(0, 1).

A realization S of the PRS S is interpreted as the belief that the unobserved U lies in S. It follows from the association equation (1) that given the observed data X , the true value of θ belongs to



 θ : X = −θ −1 ln u for some u ∈ S ,

which is denoted as Θ X and more specifically

     Θ X ≡ − X −1 ln .5 + | V − .5| , − X −1 ln .5 − | V − .5| ,

V ∼ Unif(0, 1).

Now suppose that an assertion of interest about θ is given by a subset A in the space of θ , i.e., A ⊂ (0, ∞). The realization S supports the truth of A if and only if Θ X ⊆ A , and supports the truth of the negation of A , denoted by A c ,

656

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

if and only if Θ X ⊆ A c . In the case when Θ X intersects with both A and A c , we don’t know which of A and A c is true. This is the case called “don’t know” by Dempster [8]. Accordingly, the probability

P (Θ X ⊆ A )

p=

P (Θ X = ∅)

provides evidence supporting the truth of A , the probability

q=

P (Θ X ⊆ A c ) P (Θ X = ∅)

provides evidence supporting the truth of A c , and the residual probability

r =1− p −q is the probability of “don’t know”. The particular way of handling conflict cases, i.e., Θ X = ∅, in computing ( p , q, r ) corresponds to Dempster’s rule of combination. A more efficient alternative is proposed by Ermini Leaf and Liu [14]. Conflict cases do not cause problems in all of the ( p , q, r ) evaluations in this paper, including this illustrative example. For simplicity, in what follows, we write

p = P (Θ X ⊆ A ) and





q = P ΘX ⊆ A c . Formally, an inferential model for probabilistic inference about θ is given by probabilities defined in the sample space consisting of all subsets of Θ . Its probability measure (sub-additive) is derived by an auxiliary random variable, for instance, the uniform variable U in Example 1. More specifically, a random set is constructed for inference about θ using the auxiliary random variable and conditioning on the observed data X , e.g., Θ X in Example 1. The probability triplet is correspondingly obtained. In the next subsection, we provide detailed discussions about the interpretation of the probability triplet. They are not probabilities in the frequentist sense but instead levels of belief. To emphasize the fact that the ( p , q, r ) output is conditional on the observed data X , we write ( p , q, r ) as ( p X (A ), q X (A ), r X (A )). The following concept of validity offers desirable long-run frequency properties for the probability triplet ( p X (A ), q X (A ), r X (A )). Definition 1. The inferential model is valid for assertion A if for every

Pθ Pθ

 

X: p X (A )  α X: q X (A )  α





α in (0, 1), both

 1 − α and 1−α

hold respectively for every θ ∈ A c = Θ \ A and for every θ ∈ A . The probabilities in P θ (·) are defined with respect to the random variable X following f θ ( X ). In other words, validity requires p X (A ) and q X (A ), as functions of the random variable X , to be stochastically bounded by the uniform distribution over the unit interval (0, 1) in repeated experiments. Thus, the triplet ( p X (A ), q X (A ), r X (A )) provides strength of evidence for both A and A c in terms of long-run frequency probability. The performance of IMs depends on the validity of PRSs. Specifically, IMs are valid as long as the underlying PRSs are valid [23]. The validity of PRSs is defined as follows. Definition 2. Let U be the auxiliary random variable in the space U and let S be a PRS with focal elements in the power / U  ). If Q (U  ) as a function of set 2U . For all U  , denote by Q (U  ) the probability that S misses U  , i.e., Q (U  ) = P (S d

U  is stochastically smaller than or equal to the standard uniform random variable, Unif(0, 1), as U  = U , then the PRS S is said to be valid for predicting U . We refer to Martin and Liu [23] for more discussion on PRSs. 2.2. Interpretation of the probability triplet in IM We interpret the probability triplet in IM as degrees of belief rather than probabilities in the ordinary frequentist sense. As shown in Example 1, the first step of an IM is to use an auxiliary random variable to represent the probability model of the observed data. The use of the auxiliary random variable is especially critical when there is no prior knowledge on unknown parameters. Indeed, based on inference using such auxiliary variables, Martin and Liu [22] recently proposed a

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

657

new but natural way of combining information for prior-free probabilistic inference. Their results show that this new way of combining information includes the Bayes rule as a trivial case and shed new light on frequentist thoughts on efficient methods such as Fisher’s conditional inference. To some extent, prior-free probabilistic inference is made by “inverting” the probability measure of the auxiliary random variable to the parameter space using the association equation, e.g., Eq. (1), which relates the observed data and the unknown parameters through the auxiliary random variable. Before “inverting”, all probabilistic operations, including implicit integration via the use of PRSs, are standard in the familiar probability theory built on the Kolmogorov axioms. It follows that before “inverting” to the parameter space, probability represented by the auxiliary random variable maintains the same interpretation as that in the specification of the sampling model. As remarked in [23], the final IM results, obtained via “inverting”, are on assertions or subsets in the parameter space. They are no longer interpreted as the usual probability but instead as degrees of belief. In addition, suppose that the (numerical) probabilities specified in the sampling model are frequency calibrated. This is typically assumed in modeling so that the ordinary probability calculus can be used as a convenient mathematical tool for data analysis. In this case, inferential results produced by IM have desirable frequency properties. The triplet ( p X (A ), q X (A ), r X (A )) is connected to the notions of Belief and Plausibility, denoted as Bel X (A ) and Pl X (A ) in the Dempster–Shafer theory of belief functions. In fact, we have Bel X (A ) = p X (A ) and Pl X (A ) = p X (A )+ r X (A ). Hence, the triplet provides lower and upper probabilities for the assertion A . The lower probability p X (A ) is sub-additive in the sense that p X (A ) + p X (A c )  1. Thresholds for p X (A ) and q X (A ) can be used to confirm the truth and falsity of A . When applying the technique in practices, we report all three probabilities, where a large value of p X (A ) supports A (e.g., the null hypothesis), a large value of q X (A ) supports A c (e.g., the alternative hypothesis), and a large value of r X (A ) does not support either. For those who are familiar with the Neyman–Pearson framework of hypothesis testing, they may choose to reject H 0 : θ ∈ A when the plausibility of the null hypothesis is smaller than the significance level α , i.e.,

p X (A ) + r X (A ) = 1 − q X (A ) < α . In fact, Martin and Liu [24, Theorem 2] proved that under mild conditions a Fisher’s p-value could be interpreted as a plausibility of H 0 . Therefore, controlling the plausibility is compatible with the Neyman–Pearson framework of hypothesis testing. 2.3. IM versus fiducial and Dempster–Shafer In the context of sampling model based statistical inference, the Dempster–Shafer theory is a generalization of Fisher’s fiducial. For recent contributions to fiducial, see Hannig [18] and Xie and Singh [30]. Our proposed IM modifies fiducial inference by using predictive random sets to predict auxiliary random variables. The use of PRSs corrects the bias in Fisher’s fiducial probability. Ermini Leaf, Hui, and Liu [13] and Ermini Leaf and Liu [14] took a closer look at fundamental issues in probabilistic inference in the simple single-normal-mean problem. They showed that fiducial inference was problematic in this simple case for assertions with constrained parameters. By problematic, we mean that the fiducial probability is not frequency calibrated. To elaborate on this point, consider the simple normal example with the unknown θ constrained to be positive. That is, we have the association X = θ + Z with the auxiliary variable Z ∼ N (0, 1) and θ ∈ (0, ∞). Due to the constraint on θ , with fiducial (as a special case of Dempster–Shafer) we rule out the possibility that given X , Z = X − θ  X . This leads to the fiducial (posterior) distribution with cdf

F X (θ) =

Φ(θ − X ) − Φ(− X ) 1 − Φ(− X )

(θ > 0),

where Φ(.) stands for the cdf of the standard normal N (0, 1). It is easy to verify, for example, that the 95% one-sided confidence interval created based on this posterior does not have the correct coverage. This explains why fiducial inference does not have desirable frequency properties in general. When they are not frequency calibrated, fiducial probabilities are referred to here as biased probabilities. Though related, IM has made a significant modification to fiducial and the Dempster–Shafer theory. More specifically, the auxiliary random variable U is unobservable and is predicted by a predictive random set (PRS). The use of PRS effectively corrects the bias of fiducial posterior probabilities, which is due to the fact that for every assertion A on θ , A ⊂ Θ , the corresponding event in the sample space of U is selected through the observed data X which itself depends on the unobserved realization of U . Intuitively, IM differs from Fisher’s fiducial and the Dempster–Shafer theory of belief functions in that it carries out necessary probability/integration operations before “inverting” to the parameter space. More elaboration along this line is given in [23]. 3. The many-normal-means problem The many-normal-means problem is a benchmark problem for inference about multiple testing. Suppose that the observed data set consists of n data points X 1 , . . . , X n from the model:

658

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

iid

X i ∼ N (θi , 1),

i = 1, . . . , n . (i )

(i )

In the context of multiple testing, we consider a large number of hypotheses H 0 : θi = 0 versus H a : θi = 0 for i = 1, . . . , n. We refer to each non-zero θi (and the corresponding X i ) as a signal and want to identify the signals presented in the data. As stated in the introduction section, this problem is translated into producing ( p , q, r ) outputs for a sequence of assertions

  (i ) Ak = the number of true H a  k ,

k = 1, 2, . . . , n .

An inferential model for the many-normal-means problem can be constructed through the following data generating mechanism,

X i = θi + Φ −1 (U i ),

i = 1, . . . , n ,

where U 1 , . . . , U n are auxiliary random variables from the uniform Unif(0, 1) and Φ −1 (·) stands for the inverse cdf of the standard normal N (0, 1). If U = (U 1 , . . . , U n ) were observed, the values of θi would have been known by calculating

θi = X i − Φ −1 (U i ). We will predict the unobserved U 1 , . . . , U n via a predictive random set, which leads to a random set for the parameter θ = (θ1 , . . . , θn ). 3.1. An IM for inference about the number of signals Let Sn denote a simplex space of size n:

  Sn = (u 1 , . . . , un ): 0 < u 1 < · · · < un < 1 . A predictive random set (PRS) for the ordered uniform variables U = (U (1) , . . . , U (n) ) is defined as

  S = u: u ∈ Sn ; g (u )  g (U ) , where g (·) takes the form

g (u ) =

n  



ai ln u i + b i ln(1 − u i )

(u ∈ Sn )

(2)

i =1

with ai = 1/(n − i + .7) and b i = 1/(i − 1 + .7), for i = 1, . . . , n. Martin and Liu [23] showed that such defined g (·) provided an easy-to-compute predictive random set for predicting sorted uniforms. The function (2) is a measurement of how close a sequence of ordered values 0 < u 1 < · · · < un < 1 is to the median vector of the ordered uniform random variables U = (U (1) , . . . , U (n) ). The coefficients ai and bi are chosen via stochastic approximation so that g (u ) is maximized at the marginal medians of U . As an illustration, Fig. 1 shows contours of the function in the space S2 = {(u 1 , u 2 ): 0 < u 1 < u 2 < 1}. The function achieves the maximum at the marginal medians of U (1) and U (2) and decreases towards the boundary of S2 . The random set S , defined above, corresponds to the inner area of the curve g (U ) and predicts a region for the unobserved uniform vector (U (1) , . . . , U (n) ). For inference about the parameter of interest θ we consider the random set

    S X = θ : θ ∈ Rn , g Φ( X − θ )  g (U ) ,

(3)

where Φ( X − θ ) represents the sorted (Φ( X 1 − θ), . . . , Φ( X n − θ)). An assertion A about the parameter θ , as a subset in the parameter space, will be evaluated by a probability triplet with p (A ) = P (S X ⊆ A ), and q(A ) = P (S X ⊆ A c ). Following Corollary 1 and Theorem 2 of Martin and Liu [23] the inferential model with the random set (3) is valid, due to the fact that the boundary function g (·) is continuous. The assertion Ak means “there are at least k signals” and thereby its negation is Akc = {θ0  k − 1}, where θ0 represents the number of non-zero components of θ . We use the random set (3) to produce a probability triplet about Ak and Akc , with









p (Ak ) = q Akc = P (S X ⊆ Ak ),





q(Ak ) = p Akc = P S X ⊆ Akc , r (Ak ) = 1 − p (Ak ) − q(Ak ). To compute p (Akc ) = P (S X ⊆ Akc ), we note that the event S X ⊆ Akc is equivalent to that θ0  k − 1 holds for all θ ∈ S X = {θ : g (Φ( X − θ ))  g (U )}. This is an impossible event, because if we take θi = X i − Φ −1 (U i ) for i = 1, . . . , n then we have θ0 = n and g (Φ( X − θ ))  g (U ). Therefore,

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

659

Fig. 1. Contours of the g function defined in (2) in the space of two ordered uniform samples S2 = {(u 1 , u 2 ): 0 < u 1 < u 2 < 1}.









p Akc = P S X ⊆ Akc = 0. To compute p (Ak ) = P (S X ⊆ Ak ), we note that the event S X ⊆ Ak means that θ0  k holds for all θ satisfying g (Φ( X − θ ))  g (U ), which is equivalent to



max

θ : θ 0 k−1



g Φ( X − θ ) < g (U ).

The constraint θ0  k − 1 implies that “except for at most (k − 1) X i ’s, the others form a sample of size (n − k + 1) from N (0, 1)”. This idea was implicit in the Benjamini–Hochberg procedure and has been shown recently by Ferreira and Zwinderman [15]. To compute maxθ: θ0 k−1 g (Φ( X − θ )), we choose to work on the corresponding g (·) function defined over the (n − k + 1)-dimensional space instead of that defined over the n-dimensional space. (The resulting  inference is more efficient n because it marginalizes out the (k − 1) potential signals.) Let Yn−k+1 be the set of all n−k+1 combinations of (n − k + 1) X i ’s that are from N (0, 1). We want to solve the following optimization problem

max

Y ∈Yn−k+1





g Φ(Y (1) ), . . . , Φ(Y (n−k+1) ) .

Let g ∗ = maxY ∈Yn−k+1 g (Φ(Y (1) ), . . . , Φ(Y (n−k+1) )). Then p (Ak ) = P (S X ⊆ Ak ) can be computed by first finding g ∗ and then approximating the probability p (Ak ) = P ( g (U ) > g ∗ ) via Monte Carlo methods, i.e.,

p (Ak ) ≈

1  M

I { g (U (i) )> g∗ }

i

where U (1) , . . . , U ( M ) are M samples drawn from the distribution of n − k + 1 sorted uniforms and I { g (U (i) )> g∗ } = 1 if g (U (i ) ) > g ∗ and 0 otherwise. Direct calculation for g ∗ = maxY ∈Yn−k+1 g (Φ(Y (1) ), . . . , Φ(Y (n−k+1) )) is a so-called NP-hard problem, when all possible  n  combinations n−k+1 are considered. We are able to use an efficient algorithm that greatly reduces the computation. The detailed algorithm is provided in Appendix A.

660

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

3.2. Inference for the number of signals in an interval Assume that there are k signals in the observed data set X 1 , . . . , X n . That is, there are n − k of these n observed values that are known to form a sample from N (0, 1). We are interested in the number of signals in a given interval, say, [x1 , x2 ] (e.g., x1 = 3 and x2 = ∞). Formally, we consider the assertion that “there are J  j signals in [x1 , x2 ], conditioning on k signals in the whole set of n observations”. To make a probabilistic inference about this assertion, we start with a data generating mechanism for the observed count of the number of X i ’s that fall in [x1 , x2 ], denoted as C x1 ,x2 . It is known that there are (n − k) X i ’s from N (0, 1). Consider an auxiliary random variable N 0 as the number of these n − k standard normals that fall into the interval [x1 , x2 ]. Then N 0 follows a binomial distribution, Binomial(n − k, Φ(x2 ) − Φ(x1 )). There is a critical relationship among the observed count C x1 ,x2 , the quantity of interest J , and the unobserved random variable N 0 , that is, C x1 ,x2 = N 0 + J . If N 0 were observed, we could obtain an inference about J as C x1 ,x2 − N 0 . Since N 0 is unobserved, we use a random set {0, 1, . . . , N } to predict it, where N ∼ Binomial(n − k, Φ(x2 ) − Φ(x1 )). This leads to a random set for inference about J ,

S = { J : C x1 ,x2 − N  J  C x1 ,x2 },





where N ∼ Binomial n − k, Φ(x2 ) − Φ(x1 ) .

The predictive random set {0, 1, . . . , N } is the discrete counterpart of the one-sided continuous PRSs that are optimal for one-sided assertions under mild conditions (see [23] for more discussion on construction of assertion-specific PRSs). For a probabilistic inference about the assertion { J : J  j }, that is, “there are at least j signals in [x1 , x2 ], conditioned on k signals in the whole set of n observations”, we compute the ( p , q, r ) output as follows:





p = P S ⊆ { J : J  j } = P (C x1 ,x2 − N  j )

= P ( N  C x1 ,x2 − j )   = pBinomial C x1 ,x2 − j , n − k, Φ(x2 ) − Φ(x1 ) ,   q = P S ⊆ { J : J < j } = P (∅) = 0, r = 1 − p, where pBinomial(., n − k, Φ(x2 ) − Φ(x1 )) denotes the cdf of Binomial(n − k, Φ(x2 ) − Φ(x1 )). For an analogy with the concept of false discovery rate (FDR), one may also report

FDRx1 ,x2 = max 0,

(n − k)[Φ(x2 ) − Φ(x1 )] C x1 ,x2

,

which gives an expected number of falsely rejected null hypotheses in a rejection interval [x1 , x2 ]. 3.3. Multiple comparison procedure based on the IM method We derive a multiple comparison procedure using the probabilistic inference for the sequence of assertions Ak , k = 1, . . . , n. As in existing MCPs [10], we assume that an unadjusted p-value is calculated for each of the n pairs of hypotheses (i ) (i ) H 0 : θi = 0 versus H a : θi = 0. The unadjusted p-value is simply 2(1 − Φ( X i )) for the many-normal-means problem. We obtain the MCP based on the IM method in the following steps. 1. Order the unadjusted p-values from the smallest to the largest. 2. Calculate probabilities for the sequence of assertions, Ak , k = 1, . . . , n, i.e.,

p





(i )

the number of true H a  k .

α , e.g., α = 0.01, find the largest k∗ such that   (i ) p the number of true H a  k∗  1 − α .

3. For a given significance level

4. The hypotheses corresponding to the first k∗ smallest unadjusted p-values will be declared as the true alternatives. Note that the criterion

p



the number of true H a  k∗ (i )



1−α

implies

p



  (i ) + r the number of true H a < k∗     = q the number of true H a(i )  k∗ + r the number of true H a(i )  k∗  α . the number of true H a < k∗ (i )



C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

661

Fig. 2. Comparison of the proposed MCP based on IM with the MCP based on Benjamini–Hochberg’s false discovery rate control and the MCP of gFWER control. The ROC curves are plots of true positive rates versus false positive rates at various significance levels, where the origin corresponds to no rejection in the multiple testing.

This MCP guarantees that our claimed number of true alternatives has a high level of belief (1 − α ), and the plausibility for the number of true alternatives less than our declared value (k∗ ) is under control. We use a simulation example to compare the proposed MCP with other existing procedures, more specifically, Benjamini– Hochberg’s false discovery rate control [5] and gFWER [10]. The observed data are generated from the following distributions. ind

X i ∼ N (μi , 1)



μi = 0, for i = 1, . . . , 1350, μi = 3, for i = 1351, . . . , 1500.

We repeat the simulation five times and compare the three MCPs in terms of ROC (Receiver Operating Characteristic) curves. A ROC curve is a plot of the true positive rate against the false positive rate for a multiple comparison procedure at various significance levels, where the true positive rate is the fraction of correctly rejected hypotheses out of alternative hypotheses, or power, and the false positive rate is the fraction of falsely rejected hypotheses out of null hypotheses, or type I error probability. ROC plot provides a tool to compare different MCPs. The curve that has larger powers for given type I errors corresponds to a better MCP. Fig. 2 shows three ROC curves for the proposed MCP, namely the IM procedure, the Benjamini–Hochberg procedure, and a gFWER procedure, respectively. For various significance levels, i.e., a variety of α values, the average true positive rate and false positive rate from the five simulations are plotted. The origin in the plot corresponds to no rejection in the multiple testing. Fig. 2 indicates that the IM procedure has a comparable performance with the other two commonly used MCPs. While the Benjamini–Hochberg procedure controls false discovery rate, which is defined as the expected value of the ratio of falsely rejected hypotheses out of all rejections, the gFWER procedure controls the probability of the event that there are at least v falsely rejected hypotheses. We use v = 15 in Fig. 2 but have also tried v = 5, and v = 25. The ROC curves of the gFWER procedure for v = 5, 15, and 25 are close to each other, with the curve of v = 15 in the middle of the other two (figure not shown). Both the Benjamini–Hochberg procedure and gFWER are frequentist procedures, based on the probability model for the observed data. However, the IM procedure is based on the improved fiducial type of probability inference, which provides a level of belief for hypotheses rather than a way to control a frequentist error rate. The quantity 1 − p ({Ak }), or q({Ak }) + r ({Ak }), is in analogy with the adjusted p-value [10] for existing MCPs, such as the gFWER procedure. We use q({Ak }) + r ({Ak }) rather than q({Ak }) here due to the sub-additivity of the IM probability measurement. IM provides degrees of belief directly for the assertion of interest,

  (i ) Ak = the number of true H a  k , and offers a new multiple comparison procedure with comparable power to the other popular MCPs. We consider the IM procedure favorable in terms of probability interpretation about the number of true alternative hypotheses.

662

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

Fig. 3. The probability p (Ak ) for at least k significance genes out of 4608 in the gene expression microarray example.

4. Application in microarray data analysis We study an HIV data set included in an R package called nudge (http://bioconductor.org/packages/2.2/bioc/html/nudge. html) for detection of differential gene expression. The data consists of cDNA from CD4+ T cell lines at 1 hour after infection with HIV, from a study by van’t Wout et al. [29]. The data have the following structure: Data structure Sample 1 Dye 1 Gene

. . .

r1

Sample 2 Dye 2

r2

r1

Dye 1 r2

r1

Dye 2 r2

r1

r2

4608 × 8 data values

where Sample 1 and Sample 2 correspond to the HIV infected and the control samples, Dye 1 and Dye 2 correspond to two dye labeling schemes, and r1 and r2 correspond to two duplicate microarray slides. A standard approach of identifying significance gene expression is to calculate z-scores from the 4608 individual two-sample t-tests based on four log-intensity values in Sample 1 and four log-intensity values in Sample 2, where zi = Φ −1 ( F 6 (t i )) with F 6 as the cumulative distribution function of a standard t variable with 6 degrees of freedom. The null distribution for these z-scores is modeled as either a standard normal or an empirical normal distribution. Then Benjamini–Hochberg’s or gFWER method can be employed to detect significance genes. Alternatively, we apply the probabilistic inference model for the multiple testing problem. The original 4608 × 8 data matrix is first transformed so that, for a given gene, the average difference between Sample 1 and Sample 2 is normally distributed and has mean zero for most of the 4608 genes. We denote the test statistic as Z g , for g = 1, . . . , 4608 genes. The details of the transformation are provided in the supplementary material. Identifying significantly expressed genes in this example becomes the same problem of many-normal-means discussed in Section 3. We test for 4608 hypotheses that the mean of Z g is zero versus the alternative of non-zero. The probability p (Ak ) shown in Fig. 3 is the probability that the number of true alternative hypotheses is at least k, or there are at least k significance genes. This probability is about 1 for k up to 40 then drops dramatically and reaches 0 when k = 80. The probability curve indicates that the number of significantly expressed genes is in the range of 40 to 60. In this gene expression data set, there are 13 genes known to be differentially expressed (HIV genes) and 29 genes known not to be (non-human genes). These two sets of genes serve as positive and negative controls respectively. The 13 positive control genes have the largest 13 Z -values using our method and are marked by red in the list of extreme values in Fig. 4. The 29 negative control genes are marked by green in Fig. 4, where 28 out 29 negative controls have small Z -values (absolute value less than 3) hence are correctly identified as negative. As comparisons, Benjamini–Hochberg’s MCP and the gFWER procedure are also applied. Benjamini–Hochberg’s rejects 45 hypotheses with the false discovery rate controlled at α = 0.05. The rejection set includes the same 13 positive control genes and the one negative control gene that is rejected by the IM procedure. We apply two gFWER procedures. One has the probability of at least v = 25 falsely rejected hypotheses controlled at α = 0.05 and the other has the probability of at least v = 40 falsely rejected hypotheses controlled at α = 0.05.

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

663

Fig. 4. The histogram of the Z -values with the underlying curve of the standard normal density. The Z -values for positive and negative controls are marked by red (text) and green (dots). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The two gFWER reject 42 and 44 hypotheses respectively, both containing the 13 positive control genes and the one negative control gene that is rejected by the IM procedure. We conclude that the three MCPs have comparable performances in this microarray gene expression example. 5. Concluding remarks We consider a formal way of summarizing uncertainty in statistical inference. We focus on the problem of hypothesis testing, especially multiple testing, by a probabilistic inferential model about an unobserved ordered uniform sample. We show that hypothesis testing problems can be treated as ( p , q, r ) computations. With the example of microarray data analysis, we have demonstrated the application of probabilistic inference in multiple testing. The proposed method can be extended to other problems involving simultaneous hypothesis testing, including the many-Poisson-means problem and variable selection in linear regression. In particular, development of such methods for multinomial of a large number of categories is both theoretically challenging and practically useful for single nucleotide polymorphism (SNP) analysis in genome-wide association studies. As a first step, without concerning the multiple testing issue, the authors have developed another probabilistic inferential model for multinomial distributions [25]. Optimal IM methods can be developed using more efficient predictive random sets, which may outperform other existing MCPs. The PRS used in this paper for multiple testing is, to some extent, conservative. The basic idea is to use the fact that if there are at most k signals, there must be at least n − k data points that can be considered as a sample of size n − k from the standard normal. The PRS does not assume any distribution for the alternative hypothesis and hence can be used for more general alternatives. When the normal distribution model is indeed adequate, more efficient methods can be developed. Those will be our future research works. Acknowledgements This work is supported by the National Science Foundation Grant DMS-1007678. The authors thank the Editor, the Area Editor, and the anonymous referees for their very helpful comments and suggestions that helped enhance the paper. The authors thank Nan Gu for his assistance in making the simulation example in Section 3.3. Appendix A. An optimal matching algorithm The maximization problem at the end of Section 3.1 can be solved in theory by evaluating the objective function for all possible “n-choose-m” combinations Cm n . This brute-force method is inefficient. Here, we propose an efficient algorithm for solving this discrete optimization problem. Let U (i ) = Φ( X (i ) ) for i = 1, . . . , n. Rewrite the objective function as

g (s) =

m  



αi ln U (s(i)) + βi ln(1 − U (s(i)) )

i =1

664

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

Fig. 5. The range-expansion scheme of the proposed optimal matching algorithm (Step 2). The dots in the r(i ) row are αi /(αi + βi ) for i = 1, . . . , m. The dots in the U (i ) row are the observed sample. The observation U (14) is a preferred one in the range of the desired solution, but mapped (back) to three points 8, 9, 10 in {1, . . . , m}, given by s(8) = s(9) = s(10) = 14. Two possible ways of reducing the number of links to U (14) toward a one-to-one mapping are shown by modifying the leftmost link (i.e., from r(8) to U (14) ) and the rightmost link (i.e., from r(10) to U (14) ). The corresponding subsequent changes to other links, shown by dashed arrows, are required by the monotonicity of the desired mapping. The one with the larger boundary function value is chosen. Note that there exit cases where there is only one possible way, e.g., the two links to U (1) and the two links to U (2) .

where s ∈ Cm n represents a strictly monotone mappings from {1, . . . , m} to {1, . . . , n}, that is, 1  s(1) < · · · < s(m)  n. The basic idea of the proposed algorithm is to (i) find the solution over the expanded or enlarged space that is obtained by allowing s to be a many-to-one mapping, and (ii) make iterative adjustments toward the solution in the original space, i.e., the sought mapping s from {1, . . . , m} to {1, . . . , n} is one-to-one. That is, the proposed algorithm is iterative. Each iteration expands the range of the current mapping s(.) by including one more index that belongs to the range of the desired solution. This expansion scheme is depicted in Fig. 5. ˜m The details are as follows. First, expand Cm n to Cn to allow s be any many-to-one monotone mapping. That is, a mapping ˜m ˜m c in C can map multiple points in { 1 , . . . , m } to a single point in {1, . . . , n}. In other words, C n n corresponds to “choosing (m) from n with replacement”. In this way, s(i ) takes its most preferred match, the best U s(i ) that maximizes αi ln U (i ) + βi ln(1 − U (i ) ). Consequently, the corresponding optimal mapping s can be obtained straightforwardly by solving for each i = 1, . . . , m

s(i ) = arg max

1 j n





αi ln U ( j) + βi ln(1 − U ( j) ) .

(4)

Algorithm 1 (An optimal matching algorithm). Start with the initial mapping s(.) defined by (4). Repeat the following 3 steps until s(.) is one-to-one: Step-1. Set M j = {i: s(i ) = j } for j = 1, . . . , n, and take an arbitrary j from { j: | M j | > 1} (to move one in M j to the next best match in either the left or right side). Step-2. Let Z j ≡ {k: k < j ; | M k | = 0}. If Z j = ∅ set L = 0 and l = s; otherwise set L = 1 and define

⎧ ⎨ max Z j if i = min{ M max Z j +1 }; l(i ) = s(i − 1) if min{ M max Z j +1 } < i  min M j ; ⎩ s (i ) otherwise. Define R and r (i ) in the same fashion as defining l(i ), but to shift to the right. Step-3. Set s = l if Lg (l) > R g (r ), and set s = r otherwise. The following lemma is useful to establish the needed convergence results on the proposed algorithm. Lemma 1. Let S be the desired range, i.e., S = { j: j = s(i ) for some i = 1, . . . , m = m}. Let R be a subset of S . If R consists of a preferred set of consecutive indices for the image of a consecutive indices in {1, . . . , m} with |{i: s(i ) ∈ R }| > |R |, then the expanded subset R ∪ {min(R ) − 1} or R ∪ {max(R ) + 1} is a subset of the desired range.

C. Liu, J. Xie / International Journal of Approximate Reasoning 55 (2014) 654–665

665

Proof. First, the fact that the desired mapping is one-to-one implies that R has to be expanded. The unimodality of each additive pair of components, αi ln U s(i ) + βi ln(1 − U s(i ) ), of the objective function and the fact that R is a preferred subset require the expansion index to be one of the two indices min(R ) − 1 and max(R ) + 1. This implies that the desired expansion is either R ∪ {min(R ) − 1} or R ∪ {max(R ) + 1}. We note that the two possible expansions are constructed by Step-2 of the optimal matching algorithm and that one of the two expansions is chosen by Step 3 of the algorithm. 2 Theorem 1. The optimal matching algorithm converges to the desired solution in exactly



j (| M j | − 1)

iterations.

Proof. First, it is easy to see that the algorithm  will converge because each iteration reduces size of M j with | M j | > 1 by 1. That is, the algorithm converges in exactly j (| M j | − 1) iterations, where M j refers to those defined at the beginning of the algorithm. Note that the initial subset R , the range of the starting many-to-one solution, satisfies the conditions of Lemma 1. Thus, it follows from Lemma 1 that the optimal matching algorithm converges to the desired solution, which is uniquely determined by its range as the mapping is both monotonic and one-to-one. 2 Appendix B. Supplementary material Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.ijar.2013.09.017. References [1] B. Apolloni, D. Malchiodi, S. Gaito, Algorithmic Inference in Machine Learning, 2nd ed., International Series on Advanced Intelligence, vol. 5, Advanced Knowledge International, Magill, Adelaide, 2006. [2] B. Apolloni, S. Bassis, Algorithmic inference of two-parameter gamma distribution, Commun. Stat., Simul. Comput. 38 (9) (2009) 1950–1968. [3] B. Apolloni, S. Bassis, Confidence about possible explanations, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 41 (6) (2011) 1639–1653. [4] M.S. Balch, Mathematical foundations for a theory of confidence structures, Int. J. Approx. Reason. 53 (2012) 1003–1019. [5] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. B 57 (1995) 289–300. [6] J.O. Berger, Could Fisher, Jeffreys and Neyman have agreed on testing?, Stat. Sci. 18 (2003) 1–32. [7] A.P. Dempster, Further examples of inconsistencies in the fiducial argument, Ann. Math. Stat. 34 (1963) 884–891. [8] A.P. Dempster, Dempster–Shafer calculus for statisticians, Int. J. Approx. Reason. 48 (2008) 265–377. [9] T. Denoeux, Constructing belief functions from sample data using multinomial confidence regions, Int. J. Approx. Reason. 42 (3) (2006) 228–252. [10] S. Dudoit, M. Laan, Multiple Testing Procedures with Applications to Genomics, Springer, New York, 2008. [11] B. Efron, Microarrays, empirical Bayes and the two-groups model, Stat. Sci. 23 (2008) 1–22. [12] B. Efron, R. Tibshirani, J.D. Storey, V. Tusher, Empirical Bayes analysis of a microarray experiment, J. Am. Stat. Assoc. 96 (2001) 1151–1160. [13] D. Ermini Leaf, J. Hui, C. Liu, Statistical inference with a single observation of N (θ, 1), Pak. J. Stat. 25 (2009) 571–586. [14] D. Ermini Leaf, C. Liu, Inference about constrained parameters using the elastic belief method, Int. J. Approx. Reason. 53 (5) (2012) 709–727. [15] J.A. Ferreira, A.H. Zwinderman, On the Benjamini–Hochberg method, Ann. Stat. 34 (4) (2006) 1827–1849. [16] R.A. Fisher, Inverse probability, Proc. Camb. Philos. Soc. XXVI (1930) 528–535. [17] R.A. Fisher, Statistical Methods and Scientific Inference, 3rd ed., Haffner Press, New York, 1973. [18] J. Hannig, On generalized fiducial inference, Stat. Sin. 19 (2009) 491–544. [19] J. Hannig, T.C.M. Lee, Generalized fiducial inference for wavelet regression, Biometrika 96 (2009) 847–860. [20] F. Liang, C. Liu, N. Wang, A sequential Bayesian procedure for identification of differentially expressed genes, Stat. Sin. 12 (2007) 571–597. [21] R.G. Martin, J. Zhang, C. Liu, Dempster–Shafer theory and statistical inference with weak beliefs, Stat. Sci. 25 (2010) 72–87. [22] R. Martin, C. Liu, Conditional inferential models: combining information for prior-free probabilistic inference, 2012, unpublished manuscript. [23] R.G. Martin, C. Liu, Inferential models: a framework for prior-free posterior probabilistic inference, J. Am. Stat. Assoc. 108 (2013) 301–313. [24] R.G. Martin, C. Liu, On a ‘plausible’ interpretation of p-values, Stat. Sin. (2013), submitted for publication. [25] C. Liu, J. Xie, Large scale two sample multinomial inferences and its applications in genome-wide association studies, Int. J. Approx. Reason. (2013), http://dx.doi.org/10.1016/j.ijar.2013.04.010. [26] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey, 1976. [27] J.D. Storey, A direct approach to false discovery rates, J. R. Stat. Soc. B 64 (2002) 479–498. [28] J.D. Storey, The positive false discovery rate: a Bayesian interpretation and the q-value, Ann. Stat. 31 (2003) 2013–2035. [29] A.B. van’t Wout, G.K. Lehrma, S.A. Mikheeva, G.C. O’Keeffe, M.G. Katze, R.E. Bumgarner, G.K. Geiss, J.I. Mullins, Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4+-T-Cell lines, J. Virol. 77 (2003) 1392–1402. [30] M. Xie, K. Singh, Confidence distribution, the frequentist distribution of a parameter – a review, Int. Stat. Rev. (2013), http://dx.doi.org/10.1111/insr. 12000. [31] M. Xie, K. Singh, W.E. Strawderman, Confidence distributions and a unifying framework for meta-analysis, J. Am. Stat. Assoc. 106 (2011) 320–333. [32] S.L. Zabell, R.A. Fisher and the fiducial argument, Stat. Sci. 7 (1992) 369–387.