Simultaneous tests for patterned recognition using nonparametric partially sequential procedure

Simultaneous tests for patterned recognition using nonparametric partially sequential procedure

Statistical Methodology 5 (2008) 535–551 www.elsevier.com/locate/stamet Simultaneous tests for patterned recognition using nonparametric partially se...

862KB Sizes 0 Downloads 74 Views

Statistical Methodology 5 (2008) 535–551 www.elsevier.com/locate/stamet

Simultaneous tests for patterned recognition using nonparametric partially sequential procedure Uttam Bandyopadhyay a,∗ , Amitava Mukherjee a , Barendra Purkait b a Department of Statistics, University of Calcutta, Kolkata, India b Geological Survey of India, Kolkata, India

Received 13 September 2007; received in revised form 4 February 2008; accepted 4 February 2008

Abstract In the present paper we introduce a partially sequential sampling procedure to develop a nonparametric method for simultaneous testing. Our work, as in [U. Bandyopadhyay, A. Mukherjee, B. Purkait, Nonparametric partial sequential tests for patterned alternatives in multi-sample problems, Sequential Analysis 26 (4) (2007) 443–466], is motivated by an interesting investigation related to arsenic contamination in ground water. Here we incorporate the idea of multiple hypotheses testing as in [Y. Benjamini, T. Hochberg, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of Royal Statistical Society B 85 (1995) 289–300] in a typical way. We present some Monte Carlo studies related to the proposed procedure. We observe that the proposed sampling design minimizes the expected sample sizes in different situations. The procedure as a whole effectively describes the testing under dual pattern alternatives. We indicate in brief some large sample situations. We also present detailed analysis of a geological field survey data. c 2008 Elsevier B.V. All rights reserved.

Keywords: Arsenic contamination; Dual patterned alternatives; Ground water quality; Multiple hypotheses tests; Partial sequential sampling; Wilcoxon score; World Health Organization

1. Introduction It is well known that the different parts of eastern Asia is severely affected by the presence of naturally occurring high arsenic (As) concentration in ground water. It occurs in typically young ∗ Corresponding address: Department of Statistics, University of Calcutta, 35, Ballygunge Circular Road, 700019 Kolkata, West Bengal, India. E-mail addresses: [email protected] (U. Bandyopadhyay), [email protected] (A. Mukherjee), baren [email protected] (B. Purkait).

c 2008 Elsevier B.V. All rights reserved. 1572-3127/$ - see front matter doi:10.1016/j.stamet.2008.02.001

536

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

Fig. 1.1. Arsenic level (mg/l) versus depth (m) scatter plot with spline smoother for high affected zone.

areas formed by Quaternary aquifers in alluvial and deltaic sediments. Bandyopadhyay et al. [4] presents an elaborate discussion on the presence of arsenic in ground water, impact of arsenic on human health and various recent initiatives by different organizations. For further discussion one may see, for example, [1,11,18,19,14]. According to the latest guideline [21] of the World Health Organization (WHO), a concentration greater than 0.01 mg/l is referred to as high. In India, the allowable limit is still considered to be 0.05 mg/l. High concentration has been also identified in eight districts of West Bengal, India and in some adjacent parts of Bangladesh. Here our case study is confined within the four blocks of Malda district of West Bengal in India. This study area is a part of the Ganga–Bramhaputra delta comprising Quaternary sediments deposited by the Ganga–Pagla–Bhagirathi–Mahananda River system. From the past records of the data used by Purkait and Mukherjee [14], we see that the levels of arsenic contamination in ground water have some pattern. They considered the arsenic contamination level in ground water as the response variable and the depth of the water source as the predictor. Then they used some suitable spline smoothers to fit generalised additive models and to identify the relationship between the response variable and the predictor. Fitted models reveal that two major patterns dominate in different areas. In some parts, mainly in moderately affected blocks like English Bazaar and Kaliachak-3, there is a decreasing pattern while in other parts, namely, highly affected Kaliachak-1 and 2, there is an increasing pattern in the first phase and then a decreasing pattern. This is clear from Figs. 1.1 and 1.2. However the smoothed additive model explains only a little of the null deviance. There are always some physical limitations of this type of additive models when volume of the data is not large enough. Even when the smoother is applied over the whole set of data by combining all four blocks, a weak evidence of umbrella pattern is found. This is represented in Fig. 1.3. However, as we pointed out earlier, smoothers only explain about 10% of the null deviance which is far from being satisfactory. Thus, instead of choosing arbitrary models, we, in the present work, develop nonparametric tests to recognize whether there is really any underlying pattern or it is merely a result of over smoothing. In testing such hypotheses we require to collect water samples from different layers beneath the soil. Different layers may be supposed to form different populations. The consolidated layering of sediment types up to 60 m depth at that place is given below:

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

537

Fig. 1.2. Arsenic level (mg/l) versus depth (m) scatter plot with spline smoother for moderately high affected zone.

Fig. 1.3. Arsenic level (mg/l) versus depth (m) scatter plot with spline smoother for pooled sample.

0–12 m: Fine sand with lenses of silty and brown clay and carbonate materials and at places inclined laminated micaceous silty materials; 12–18 m: Light brown and greenish white fine sands; 18–19.5 m: Fine to medium sands with pebbles at bottom 3 cm; 19.5–21 m: Greyish white, micaceous, medium sands; 21–24 m: Medium sands with elliptical to rounded ferruginous concretions and brownish clay lenses; 24–27.5 m: Coarse sands with elliptical to rounded ferruginous concretions and brownish clay lenses; 27.5–46 m: Alternation of coarse and pebbly sands and at places gritty with brownish patch and brownish clay lenses and calcareous concretions; 46–48 m: Ash grey, sticky clay with pebbles concretions; 48–60 m: Medium to very coarse sands at places calcareous.

538

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

For various reasons, the collection process involves much smaller time as well as cost from the first layer, which is referred to as surface water sample. Thus we choose to take a fixed sample of suitable size from the surface. This is obviously our first sample. Here, we introduce the random variable X k to denote the level of contaminated arsenic (in mg/l) in the kth water sample, k = 1, 2, . . . , m. Such m may be considered as the initial or the first sample size. Our working assumption is that X k ’s are independent for all k, which can easily be verified using a simple nonparametric test based on runs. The levels of arsenic contamination in the other layers may be considered to be alike with that of the first when there is no pattern. However one cannot rule out the possibility that there is a gradual change in location parameter from one population to another. This gradual change may occur in a particular direction or with an umbrella type pattern. Also the collection of water samples from the deep depth is not only time consuming but also costlier. Hence from all other populations, we must record observations following some sequential sampling scheme. In this way we not only register a gain in sample size in decision making but also save cost of the experiment. This will lead to a partial sequential inference procedure. Partial sequential procedure was introduced by Wolfe [20] and by Orban and Wolfe [12] for testing related to two d.f.’s. They, starting from a fixed number of sample blocks based on the initially sampled X observations, proposed an inverse sampling scheme by putting nonnegative and nondecreasing scores to these sample blocks and attaching an appropriate score to each newly drawn Y observation. That is, each Y -observation is assigned a score of that sample block where it lies. Various properties of such partial sequential procedures are studied by Orban and Wolfe [12,13], Randles and Wolfe [15], Chatterjee and Bandyopadhyay [8], Bandyopadhyay and Mukherjee [2] and Bandyopadhyay et al. [3]. In the present work we consider such a partial sequential technique for dual patterned alternative. Here we extend several partially sequential sample problems as in [7,4]. Here sample observations from the populations other than the surface are to be drawn as per requirement. This, in turn, results in a prefixed expected sample size under the null hypothesis for the rest of the populations. Under alternative hypothesis, it is expected to be even lower. So the obvious choice is an inverse sampling scheme based on the partial sequential procedure. Let us introduce the random variable Yi to denote the arsenic contamination level at the ith layer beneath the surface for i = 1, 2, . . . , s. The effect of the layer immediately beneath the surface is denoted by the random variable Y1 , the next layer by Y2 and so on. The deepest layer is denoted by Ys . Based on the above information, we provide some statistical framework in the next section. The rest of the paper is organized as follows. Section 3 describes the sampling strategy. The appropriate test statistics are introduced in Section 4. Some empirical results based on the Monte Carlo study are obtained in Section 5. Some asymptotic results are discussed in Section 6. Section 7 contains a case study using field data. Finally, Section 8 concludes with a discussion. 2. Statistical framework In the light of the discussion of the above section, we may frame the problem statistically as follows: Let X and Y1 , Y2 , . . . , Ys be (s + 1) real-valued random variables such that X ∼ F and Yi ∼ G i for i = 1, 2, . . . , s, where F and G i ’s for all i are unknown continuous d.f.’s. We assume that X and Yi ’s for all i are all independent and that G i (x) = F(x − δi ), −∞ < δi < ∞ for i = 1, 2, . . . , s. In the present investigation, based on a partial sequential sampling scheme in

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

539

which there is a random sample on X of fixed size, we are concerned with the problem of testing H0 : δi = 0

for i = 1, 2, . . . , s

against the order restricted alternative H1 : 0 ≥ δ1 ≥ δ2 ≥ · · · ≥ δs with at least one strict inequality or H2 : 0 ≤ δ1 ≤ δ2 ≤ · · · ≤ δl ≥ δl+1 ≥ · · · ≥ δs S with at least one strict inequality. The form of the alternative may also be H1 H2 . S Clearly, usual single hypothesis testing problem for H0 against the alternative H1 H2 will not serve the purpose any more. The simple reason is that, when null hypothesis will be rejected, one cannot be sure about the true nature of the pattern. It may be either H1 or more T H2c. To beT H1c H or H precise, we may have, as alternatives, three mutually disjoint possibilities: H 2 1 2 T or simply H1 H2 . The first of the above three hypotheses indicates that initially location shift occurs leftward. The second hypothesis explains the situation where initial location shift occurs towards right with possible reversibility or stability at higher layers. The third hypothesis represents the populations with an initially stable location parameter having some leftward shift at higher layers. Naturally one may think of a two stage testing procedure when null hypothesis will be rejected. That is, to test H1 versus H2 in that second stage. Note that H1 and H2 are not mutually disjoint. Further the choice between H1 and H2 will really be vague from the practical point of view. As far as mathematical side is concerned, we see the distribution of the test statistic under H1 or H2 will not be distribution free. So the second stage test will neither have mathematical lucidity nor have any practical utility. Nevertheless, we can easily tackle the problem using the ideas of multiple hypotheses testing problems. Lot of researches have been carried out in this area in the past twenty years. Simes [16] presented an improved Bonferroni procedure for multiple tests of significance. Such procedure achieves weak control over family wise error rate. For strong control over false discovery rate, Benjamini and Hochberg [5] (henceforth it is a BH procedure), introduced some simple technique for multiple hypotheses tests. They, for ξ hypotheses tests, viz., Hi = 0 versus Hi = 1 for i = 1, 2, . . . , ξ , propose an algorithm to make simultaneous inference. The algorithm is, first to find ordered observed p-values: p(1) ≤ p(2) ≤ · · · ≤ p(ξ ) , then calculate kˆ = max{k : p(k) ≤ α ξk }, and finally reject the null hypotheses corresponding to p(1) , . . . , p(k) ˆ . Benjamini and Yekutieli [6] show that the BH procedure is also valid under some kind of dependence. One may see, for example, web based lecture notes on “Multiple hypothesis testing - recent developments and future challenges” by Steinsland [17]. Multiple hypotheses testings are, in these days, extensively used in microarray experiments, say, with DNA microarrays. To identify differently expressed genes, methods are developed for measuring expression levels for thousands of genes simultaneously. One may see a review article on multiple hypotheses testing in microarray experiments by Dudoit et al. [10]. We like to use typical multiple hypotheses tests in the environmental problem with little modification. Here we propose to carry out two tests simultaneously. One is for H0 against H1 and the other is for H0 against H2 . Let p1 and p2 be the p-values for the two tests respectively. We reject null hypothesis at the level α according to the following practically effective rule. We decide in favour of any one of the following:

540

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

T i. H1 TH2 if min( p1 , p2 ) < α2 and also max( p1 , p2 ) < α, ii. H1 TH2c if p1 < α2 but p2 > α, iii. H2 H1c if p2 < α2 but p1 > α. 3. Sampling strategy Let Xm = (X 1 , X 2 , . . . , X m ) be an initial random sample of fixed size m corresponding to the random variable X . Let Yi1 , Yi2 , . . . , Yin , . . . be sequentially drawn observations corresponding to the random variable Yi for i = 1, 2, . . . , s. Then, for any i, using Wilcoxon score, the partial sequential procedure based on inverse sampling scheme corresponding to Orban and Wolfe may be described by the following stopping variables: Mi+ = min{n : max(Sin , n − Sin ) ≥ r/2},

(3.1) Pn

where r is a prefixed positive number and Sin = k=1 Fm (Yik ) with Fm (·) being the empirical j d.f. based on Xm . Thus, corresponding to a Y -observation drawn at any stage, the score is m+1 if the observation belongs to the j-th sample block of Xm , j = 1, 2, . . . , m + 1. It can be easily seen that the number of such blocks remains fixed for any draw. Note that such a stopping rule is an extension of the stopping rule Mi = min{n : Sin ≥ r/2},

(3.2)

described by Orban and Wolfe [12]. Such a rule is used by Costello and Wolfe [7] and also by Bandyopadhyay and Mukherjee [2] for testing the identity of several populations against monotonic increasing type ordered alternatives. The rule (3.1) is also a modification of the stopping rule M i = min{n : (n − Sin ) ≥ r/2},

(3.3)

used for testing the identity of several populations against a monotonic decreasing trend in location among the different Y -populations. Observe that, as max(Sin , n − Sin ) ≥ Sin for any n, we have Mi ≥ Mi+

and similarly

M i ≥ Mi+

but

min(Mi , M i ) = Mi+

(3.4)

for every i, where Mi and M i are, respectively, the same as in (3.2) and (3.3). (3.4) essentially indicates that the proposed rule will always ensure a gain in sample size from the Y -populations. Now we set Mi∗ = Mi+

if Mi − S Mi n < r/2 ≤ S Mi n ,

= 2r − Mi+

if S Mi n < r/2 ≤ Mi − S Mi n .

(3.5)

This transformation efficiently recovers the loss incurred for considering smaller number of sample observations when we use (3.1) instead of (3.2) or (3.3). In fact it can easily seen that, for any i, the asymptotic null distribution of Mi∗ is exactly the same as that of Mi or M i . It is easy to see that under H0 , E[Fm (Yi )] =

1 2

(3.6)

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

541

and, under any 0 ≥ δ1 ≥ δ2 ≥ · · · ≥ δs , 1 ≥ E[Fm (Y1 )] ≥ E[Fm (Y2 )] ≥ · · · ≥ E[Fm (Ys )]. 2

(3.7)

Thus, for any positive r , we expect a gradual increasing pattern of Mi ’s and similarly of Mi∗ ’s under H1 . That is, under H1 , we expect, M0∗ ≤ M1∗ ≤ M2∗ ≤ · · · ≤ Ms∗

(3.8)

with some strict inequality, where M0∗ = E H0 Mi , i = 1, 2, . . . , s. We have seen that this value is r even for moderately large m. However, from (3.4), we see that the actual numbers of sample observations considered from the test populations will not increase because an increasing pattern of Mi ’s means a decreasing pattern of M i ’s. Equipped with the above, we now slightly modify Y -statistics introduced by Bandyopadhyay et al. [4] for a monotone trend and patterned alternatives and use them simultaneously. 4. Test procedure based on regression principle Bandyopadhyay et al. [4] developed a sequential version of [9] type tests for patterned alternatives. For fixed sample sizes from all the populations, Hettmansperger and Norton [9] developed a test statistic based on Ri j , where Ri j is the rank of the jth observation of the ith population in the combined data, where j = 1, 2, . . . , n i , n i being the fixed sample size for the ith population and i = 1, 2, . . . , s + 1. Let R i be the average rank of the ith population in the combined group. Then, under H0 , it is likely that R i ’s are more or less equal except for sampling fluctuations. However, under H1 , R i ’s are expected to show a decreasing trend. Incorporating such an idea, they propose strongly in favour of the test statistic V =

s+1 X

λi (ci − cλ )R i ,

i=1

where scores depending upon the pattern of the alternatives and cλ = Ps+1 ci ’s are suitably chosen ni P c λ with λ = . i i i s+1 i=1 i=1

ni

In the same line of thinking, Bandyopadhyay et al. [4] considered a nonparametric test for decreasing trend among the different populations by using the statistic given by: V0 =

s+1 1 X (ci − c)M i−1 , s + 1 i=1

(4.1)

where ci ’s are suitable scores and c is the simple arithmetic mean of ci ’s. As regards M 0 , we may note the following simple identity under null hypothesis that M0∗ = M0 = M 0 = E H0 Mi = E H0 M i . For ordered alternatives H1 , that is, when 0 ≤ δ1 ≤ δ2 ≤ · · · ≤ δs , an obvious choice for ci is i, since this is the Wilcoxon score for the nonparametric trend test. Then we have c = s+2 2 . Consequently the statistic V∗ =

s   1 X s − i Mi s + 1 i=0 2

(4.2)

542

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

may be used for testing H0 against H1 . Using (3.1) and (3.5) we further modify the test statistic described in (4.2) as V1 =

s  1 X s ∗ i− Mi s + 1 i=0 2

(4.3)

for testing H0 against H1 . Whereas, for testing H0 against H2 , we use V2 =

s 1 X (c − ci )Mi∗ s + 1 i=0

(4.4)

with ci = i + 1 for i = 0, 1, . . . , l and ci = 2l + 1 − i for i = l + 1, . . . , s. This gives   1 (l + 1)(l + 2) (s − l)(3l − s + 1) 2l + 4ls + s + 2 − s 2 − 2l 2 c= + = . s+1 2 2 2(s + 1) It is intuitively clear that right tailed tests based on V1 and V2 are, respectively, appropriate for testing H0 against H1 and H2 . Hence we can evaluate the respective p-values, namely, p1 and p2 . 5. Some Monte Carlo studies In this section we present summarized results of some Monte Carlo simulation studies. There is a two-fold purpose to this Monte Carlo study. In the first phase we present the gain in sample sizes under null hypothesis when the proposed stopping rule is used instead of Orban and Wolfe [12] or Bandyopadhyay et al. [4] type stopping rule. We see that if we start with m = r = 25, the expected percentage gain in sample size from each of the Y -population under null hypothesis is 9.6. Such a gain is about 7.4% if we start with m = r = 50 and is about 5.6% if m = r = 100. The gain in sample size under an alternative will be even better. In the second phase we study the discovery rates of the true pattern, when indeed there is some pattern in location parameters of the successive populations, between the two proposed procedures at some nominal level, say α = 0.05. We consider the distribution of X to be standard normal. A random sample of size m is obtained from it using the software R 2.4.0. Then, under null hypothesis, common distribution of Yi0 s corresponding to all other Y -populations is definitely standard normal. To determine the realized value of Mi , i = 1, 2, . . . , s, observations corresponding to Yi , are drawn sequentially using the stopping rule described in (3.1). Based on 10 000 such replicates of the Monte Carlo experiment, we obtain the upper 5% and 2.5% points of the respective test statistics. Now under the strictly monotone decreasing type patterned alternative, that is, H1 ∩ H2c , we generate random samples corresponding to X -population from the standard normal distribution. For the ith Y -population, random samples are generated from N (−iδ, 1) where δ is the common spread between two successive populations. Under the umbrella type patterned alternative given by H1c ∩ H2 , we, for X -population, generate random samples from standard normal distribution as before. Thereafter we consider sample observations from N (iδ, 1) corresponding to Yi , i = 1, 2, . . . , l, and from N (2l − jδ, 1) for Yi , i = l + 1, l + 2, . . . , s. We then determine Mi+ and compute the observed value of the test statistics. Certainly, if the observed value of a test statistic exceeds the upper 5% (or 2.5%) point of its null distribution, we can say that the observed pvalue is less than 0.05 (or 0.025). Thus we can determine whether p1 , the p-value corresponding

543

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

to the test for H0 against H1 exceeds 0.025 (or 0.05) or not. A similar argument may be given for p2 , the p-value corresponding to the test for H0 against H2 . Using the above set-up, 10 000 replications of the joint test mechanism for a given m, s, l and δ is performed separately first assuming a true pattern to be H1 ∩ H2c , and then to be H1c ∩ H2 . For each of the two situations, we note the proportion of times for which one of the three alternatives is identified. Obviously, one minus the sum of these three proportions gives the number of times it falsely accepts the null hypothesis even when there is pattern. We summarize the proportions of true as well as false detections under two most important alternatives. The numerical findings for selected choices of m, l and δ are shown in the following Table 5.1. From the above table it is clear that when the umbrella point possibly occurs at or after the third Y -population, the test based on regression principle performs satisfactorily for s = 5. When s = 10, the procedure works satisfactorily if the umbrella point possibly occurs at or after the fifth Y -population. In such situations the test based regression principle ensures reasonably higher probability of detecting the true pattern. That is, under both forms of alternatives, be it a monotonically decreasing trend pattern or an umbrella pattern, the test based on regression approach truly detects alternatives with high probabilities. This is verified to be true for a wide range of values of m starting from 25 and also for the different magnitudes of the spread between the population means. 6. Some asymptotic results Assuming that, for each m, there is a positive number r = r (m) such that, as m → ∞, r → ∞ but

r → λ ∈ [0, ∞). m

It can easily be seen that, for large m, the asymptotic distribution of

(6.1) 1 √ (M i 2 r

− r ) is the same as

that of √1r (S N − N2 ). From [2], we can see that the corresponding distribution will be normal with 0 mean and variance 1+λ 12 . Obviously (2r − M i ) will also have the same asymptotic distribution

as that of M i . Therefore asymptotic distribution of Mi will also be the same as that of M i . Let Σ be a positive definite matrix of order s with diagonal elements σii = λ2 and off-diagonal 0 elements σii 0 = 1+λ 2 , i 6= i . Then, by some straightforward computations, the following theorem can be established. n o 1 Theorem 6.1. Under H0 , the s component random vector 2√ (M − r ), i = 1, 2, . . . , s i r converges to an s-variate normal distribution with mean vector 0 and dispersion matrix Σ as m → ∞. Using the elementary notion of sampling distributions and idea of folded normal distribution we see that the asymptotic null distributions of Mi and Mi∗ are identical. Thus, using [4], the asymptotic null distribution of V1 may be seen to be normal with zero mean and variance depending on m, r and s. In fact, setting λ = 1 and after a straightforward computation we see that the p-value for testing H0 against H1 based on V1 is approximately s " !# s+1 1 − Φ 6v1 , (6.2) ms(s + 2)

544

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

Table 5.1 Discovery rates under different alternatives Actual hypothesis δ =T 0.10, s = 5 and m = 25 H1 H2c T c H2 H1 T H1 H2c T H2 H1c T c H1 H2 T H2 H1c δ =T 0.10, s = 5 and m = 50 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 5 and m = 75 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 5 and m = 100 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.25, s = 5 and m = 25 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.25, s = 5 and m = 50 H1 H2c T c H2 H1 T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.25, s = 5 and m = 75 H1 H2c T c H2 H1

Decision l

H1

T

H2c

H2

T

H1c

H2c

T

H1c

0.049 0.009 0.295 0.002 0.439 0.007

0.007 0.028 0.015 0.049 0.002 0.160

0.392 0.292 0.110 0.008 0.001 0

0.077 0.015 0.492 0.001 0.773 0.004

0.004 0.040 0.006 0.086 0 0.300

0.706 0.539 0.246 0.004 0.0006 0.001

0.045 0.009 0.587 0 0.911 0.002

0.001 0.037 0.001 0.120 0 0.481

0.871 0.729 0.302 0.005 0 0.001

0.025 0.005 0.572 0 0.974 0.002

0 0.026 0 0.127 0 0.576

0.951 0.859 0.385 0.003 0 0.001

0.036 0.005 0.710 0.007 0.992 0.001

0 0.010 0 0.383 0 0.703

0.958 0.942 0.252 0.306 0 0

3

0 0 0.604 0.001 1 0

0 0 0 0.417 0 0.961

1 0.999 0.369 0.528 0 0

1

0 0

0 0

1 1

1 2 3

1 2 3

1 2 3

1 2 3

1 2 3

1 2

545

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551 Table 5.1 (continued) Actual hypothesis T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.25, s = 5 and m = 100 H1 H2c T c H2 H1 T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 10 and m = 25 H1 H2c T c H2 H1 T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 10 and m = 50 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 10 and m = 75 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c δ =T 0.10, s = 10 and m = 100 H1 H2c T H2 H1c T H1 H2c T H2 H1c T H1 H2c T H2 H1c

Decision l 2 3

1 2 3

1 3 5

1 3 5

1 3 5

1 3 5

H1

T

H2c

H2

T

H1c

H2c

T

0.513 0 1 0

0 0.304 0 0.997

0.487 0.684 0 0

0 0 0.402 0 1 0

0 0 0 0.191 0 0.999

1 1 0.577 0.801 0 0

0.000 0.000 0.141 0.008 0.972 0.004

0.000 0.000 0 0.094 0 0.662

0.992 0.989 0.842 0.757 0.023 0.041

0 0 0.011 0.001 0.996 0.000

0 0 0 0.026 0 0.859

1 1 0.989 0.967 0.004 0.063

0 0 0.001 0 0.998 0

0 0 0 0.005 0 0.936

1 1 0.999 0.995 0.002 0.050

0 0 0 0 0.998 0

0 0 0 0.001 0 0.937

1 1 1 0.999 0.002 0.060

H1c

where v1 is the observed value of V1 . Further, under the same set-up, we see after some simple algebraic treatment that the p-value for testing H0 against H2 based on V2 is approximately " !# r 3 − 21 1−Φ (s + 1)v2 [ss(c)] , (6.3) r where v2 is the observed value of V2 and ss(c) =

Ps

i=0 (ci

− c) ¯ 2 with c0 = 1.

546

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

Remark 6.1. Here, as in [4], we choose the same r for getting sampling units from each Y population to avoid certain ambiguities. It is easy to check that the expected value of Mi0 s depend on r as well as the location parameter of the corresponding distribution. Thus, if several r are chosen for different Y -populations, it will be difficult to ascertain whether the variations in Mi0 s are caused by differences in location or for the different r -values. However, even if we use different r for different Y -populations, all the asymptotic results will remain valid if we ensure for each stage that the ratio mr tends to 0. Otherwise the asymptotic results will be more complicated and will not be of much practical interest. 7. A data study For convenience, we denote the Y -populations by Π1 , . . . , Π8 , and the X -population by Π . Here s = 8, and Π is the depth layer covering the earth surface up to 12 m beneath the ground. Successive layers starting from 12 m depth as described in the Introduction are now referred to as different Y -Populations. That is, Π1 is the layer covering 12 m to 18 m beneath the ground and so on. The depth layer covering 48 m to 60 m constitutes our Y -population Π8 . Interpretation of δi , i = 1, . . . , 8, may be given as before. Here the shift parameter may not be uni-directional. As we see from the simulation studies, the simultaneous test mechanism works well if umbrella point l is greater than or equal to 3 when the number of Y -populations is about 5. Again the mechanism performs better if umbrella point l is greater than or equal to 5 when the number of Y -populations is about 10. But according to the geological view-point and according to Figs. 1.1 and 1.3, it is likely that a possible umbrella point exists at l = 2. To facilitate the effective use of the simultaneous test mechanism, we redefine our populations in the following way. We split the initial population into two halves. We refer to layer 0–6 m as upper surface P0 and the layer 6–12 m as lower surface P1 . Now we treat water samples from P0 as the samples from initial population. We consider the population P1 just as another Y -population. Then we refer to the earlier Y -population Πi as a redefined Y -population Pi+1 for i = 1, 2, 3. Further we combine prior Y -populations Π4 and Π5 into a single modified Y -population P5 and also merge previous Y -populations Π6 , Π7 and Π8 into a single modified Y -population P6 . As a result, we now have altogether 6 Y -populations and the possibility of an umbrella point will be at l = 3. Obviously such a pulling of adjacent Y -populations will help in reducing false discovery rate and give a better result. It is intuitively quite clear that such a pulling will neither hide decreasing trend pattern if any nor the expected umbrella pattern if any. In the present investigation, we collect m(=100) observations from Π . These observations constitute surface water sample collected from rivers, ponds etc. as well as from ground water sources within 12 m depth. The data are represented by Fig. 7.1. We also check the randomness of these 100 sample observations using the run test. There are 54 runs about the median (=0.0) against the expected 46.5 runs. This implies that the p-value of the test is 0.0972. However, there are 42 runs about the mean (=0.0473) against the expected 38.5 runs. This implies that the pvalue of the test is 0.3466. So in either case we find no reason to suspect the null hypothesis of randomness at 5% level of significance. This justifies our working assumption that X k ’s are independent for all k. Then we set r = m. This choice ensures that the expected sample size corresponding to each Y -population under any alternative will not exceed 100 and will meet the requisite cost constraints. We execute the rule as described under (3.1) for collecting samples from P1 , . . . , P6 . We stop sampling as soon as Sin or n − Sin exceeds 50 and apply (3.5) simultaneously to determine the corresponding Mi∗ ’s.

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

547

Fig. 7.1. Box-plot and kernel density plot of 100 observations drawn from upper surface, i.e., X -population. [The depth layer 0–6 m.]

From (3.1), clearly Mi+ is the observed number of sample observations from Πi , i = 1, 2, . . . , 8. Table 7.1 as well as Fig. 7.2 gives a brief account of determination of Mi+ for the six Y -populations. Fig. 7.3 represents the summarized data in the form of Box-plot from each Y -population. We see from Table 7.1 that our observed Mi+ -values are, respectively, 62, 62, 61, 58, 65 and 67. Since in every case we see that Sin exceeds 50 at the earliest and not (n − Sin ), we have Mi+ = Mi∗ for every i. The values s = 6 and m = 100 are large enough for all practical purposes, and hence large sample results are assumed to be quite valid. We see that the observed value of V1 is −13.85714. Using (6.2), we find p1 = 0.99925, the asymptotic p-value for the test of H0 against H1 . Further, with l = 3, we see that the observed value of V2 is 8.67347. Here we can safely assume that V2 is normally distributed with 0 mean and variance 5.05345. Hence, using (6.3), we evaluate the approximate p-value of the test as p2 = 0.00006. Thus we see that p1 is much greater than situation 0.05 while p2 is much smaller than 0.025. This provides us ample opportunity to suspect the null hypothesis on the basis of given data at the 5% level. Further as an immediate consequence an umbrella pattern is detected. 8. Discussion Many research workers are at present, very much interested in the problems of testing under possible presence of order restricted hypotheses. Tests for order restricted alternatives are extensively studied using both parametric and nonparametric approaches under both frequentist as well as Bayesian frameworks. The present paper is a development on the earlier work by Bandyopadhyay et al. [4] using the same sequential sampling scheme and for order

548

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

Table 7.1 Execution of the stopping rule to determine the sample size from different Y -populations Test Population Πi

Observation No. (k)

Observed Value

Rik

F100 (Yik )

Sik

P1

1 2 3 – – 61 62

0.030 0.000 0.000 – – 0.000 0.000

75 65 65 – – 65 65

0.750 0.650 0.650 – – 0.650 0.650

0.750 1.400 2.050 – – 49.710 50.360

P2

1 2 3 – – 61 62

0.100 0.030 0.070 – – 0.030 0.100

90 75 89 – – 75 90

0.900 0.750 0.890 – – 0.750 0.900

0.900 1.650 2.540 – – 49.170 50.070

P3

1 2 3 – – 60 61

0.000 0.000 0.030 – – 0.250 0.300

65 65 75 – – 95 96

0.650 0.650 0.750 – – 0.950 0.960

0.650 1.300 2.050 – – 49.320 50.280

P4

1 2 3 – – 57 58

0.600 0.250 0.500 – – 0.150 0.180

100 095 100 – – 91 91

1.000 0.950 1.000 – – 0.910 0.910

1.000 1.950 2.950 – – 49.950 50.860

P5

1 2 3 – – 64 65

0.010 0.000 0.010 – – 0.200 0.500

66 65 66 – – 91 100

0.660 0.650 0.660 – – 0.910 1.000

0.660 1.310 1.970 – – 49.450 50.450

P6

1 2 3 – – 66 67

0.060 0.000 0.000 – – 0.150 0.070

75 65 65 – – 90 89

0.750 0.650 0.650 – – 0.90 0.89

0.750 1.400 2.050 – – 49.64 50.53

restricted hypotheses. Nevertheless here a different sampling strategy is introduced which ensures substantial gain in sample sizes. Moreover here we show the use of simultaneous testing procedure in pattern recognition. With a little statistical manipulation we analyze a field survey data and we see that the proposed procedure nicely detects an umbrella pattern. This means

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

549

Fig. 7.2. Plot of the partial sequential sums of empirical distribution functions for each Y -population and their boundary crossing.

arsenic contamination level initially increases as depth increases. However a reversibility occurs after a certain layer. In fact after the layer with fine to medium sands with pebbles at the bottom 3 cm, at 18–19.5 m depth, arsenic contamination pattern decreases. It is worth mentioning that lots of research problems are still to be considered in this area. One of the most interesting modifications would be when there are two or more umbrella points. Further here we see from simulation studies that the proposed procedure works better when

550

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

Fig. 7.3. Box-plot of the observations sequentially drawn from each of the test population.

umbrella points occur at a comparatively later stage. Developing simultaneous tests for pattern recognition with a possibility of early umbrella point is really a future challenge to meet. That will enable us to analyze the data even without redefining the Y -populations. Tests may be modified relaxing our working assumptions on equality of scale parameters among various populations and independence of m initial sample observations. We leave all these for future research. Acknowledgments The authors are grateful to an anonymous reviewer for his/her various comments, specially for a painstaking rechecking of the computations involving the data. The Council of Scientific and Industrial Research, India has provided partial financial support under the CSIR sanction no F. No 9/28(594)/ 2003-EMR-I to the second author. The third author is grateful to the Deputy Director General, Eastern Region, GSI, for providing infra-structural facilities to carry out this research and use these data for the present work. References [1] J.M. Azcue, J.O. Nriagu, in: J.O. Nriagu (Ed.), Arsenic: Historical perspectives. Arsenic in the Environment Part 1: Cycling and characterization: Advances in Environmental Science and Technology, John Wiley and Sons, New York, 1994. [2] U. Bandyopadhyay, A. Mukherjee, Nonparametric partial sequential test for location shift at an unknown time point, Sequential Analysis 26 (1) (2007) 99–113. [3] U. Bandyopadhyay, A. Mukherjee, A. Biswas, Controlling type-I error rate in monitoring structural changes using partially sequential procedures, Communications in Statistics: Simulation and Computation 37 (3) (2008) 466–485. [4] U. Bandyopadhyay, A. Mukherjee, B. Purkait, Nonparametric partial sequential tests for patterned alternatives in multi-sample problems, Sequential Analysis 26 (4) (2007) 443–466. [5] Y. Benjamini, T. Hochberg, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of Royal Statistical Society B 85 (1995) 289–300. [6] Y. Benjamini, D. Yekutieli, A resampling based false discovery rate controlling multiple test procedure, Journal of Statistical Planning and Inference 82 (2001) 171–196. [7] P. Costello, D.A. Wolfe, Partially sequential treatment versus control multiple comparison, Biometrika 67 (1981) 403–412. [8] S.K. Chatterjee, U. Bandyopadhyay, Inverse sampling based on general scores for nonparametric two-sample problems, Calcutta Statistical Association Bulletin 33 (1984) 35–58.

U. Bandyopadhyay et al. / Statistical Methodology 5 (2008) 535–551

551

[9] T.P. Hettmansperger, R.M. Norton, Tests for patterned alternatives in k-sample problems, Journal of American Statistical Association 82 (1987) 292–299. [10] S. Dudoit, J.P. Shaffer, J.C. Boldrick, Multiple hypothesis testing in microarray experiments, Statistical Science 18 (2003) 71–103. [11] R.T. Nickson, J.M. McArthur, P. Ravenscroft, W.G. Burgess, K.M. Ahmed, Mechanism of arsenic release to groundwater, Bangladesh and West Bengal, Applied Geochemistry 15 (2000) 403–413. [12] J. Orban, D.A. Wolfe, Distribution free partially sequential placement procedure, Communications in StatisticsTheory and Methods A9 (9) (1980) 883–902. [13] J. Orban, D.A. Wolfe, A class of distribution free two-sample tests based on placements, Journal of American Statistical Association 77 (1982) 666–672. [14] B. Purkait, A. Mukherjee, A statistical study to correlate different factors influencing the arsenic contamination in ground water — a case study from malda district, West Bengal, India, in: B.S. Dandapat, B.S. Majumdar (Eds.), Proceedings of International Conference on Application of Fluid Mechanics in Industry and Environment, Research Publishing, Chennai, Singapore, 2006, pp. 173–184. [15] H.R. Randles, D.A. Wolfe, Introduction to the Theory of Nonparametric Statistics, John Willey, New York, 1979. [16] J.R. Simes, An improved Bonferroni procedure for multiple tests of significance, Biometrika 73 (1986) 751–754. [17] Steinsland. Web based Lecture Note. http://www.math.ntnu.no/ingelins/pres/Presprove.pdf. [18] A.H. Welch, M.S. Lico, J.L. Hughes, Arsenic in groundwater of the Western United States, Groundwater 26 (1988) 333–347. [19] A.H. Welch, D.B. Westjohn, D.R. Helsel, R.B. Wanty, Arsenic in groundwater of the United States: Occurrence and geochemistry, Ground Water 38 (2000) 589–604. [20] D.A. Wolfe, On a class of partially sequential two sample test procedure, Journal of American Statistical Association 72 (1977) 202–205. [21] World Health Organization, Guidelines for Drinking Water Quality, vol. I, Geneva, 1993.