Hypothesis Testing

Hypothesis Testing

CHAPTER 8 Hypothesis Testing: Concept and Practice 157 8.1  HYPOTHESES IN INFERENCE A Clinical Hypothesis Suppose we want to know if a new antibioti...

478KB Sizes 5 Downloads 169 Views

CHAPTER 8

Hypothesis Testing: Concept and Practice 157

8.1  HYPOTHESES IN INFERENCE A Clinical Hypothesis Suppose we want to know if a new antibiotic reduces the time for a particular type of lesion to heal. We already know the mean time μ to heal in the population, that is, the general public, without an antibiotic. We take a sample of lesion patients from the same population, randomizing the selection to ensure representativeness (see Section 1.9), and measure the time to heal when treated with the antibiotic. We compare our sample mean time to heal, say ma (a for antibiotic), with μ. Our clinical hypothesis (not statistical hypothesis) is that the antibiotic helps, i.e. ma  μ. To be sure that our comparison gives us a dependable answer, we go through the formal sequence of steps that composes scientific inference.

Decision Theory Introduced In general terms, the decision theory portion of the scientific method uses a mathematically expressed strategy, termed a decision function (or sometimes decision rule), to make a decision. This function includes explicit, quantified gains and losses to reach a conclusion. Its goal is to optimize the outcome of the decision, that is, jointly to maximize gains and minimize losses. Gains might be factors such as faster healing, less pain, or greater patient satisfaction. Losses might be factors such as more side effects or greater costs – in time, effort, or inconvenience as well as money. Any number of possible decision functions exists, depending on the strategy selected, that is, on the gains and losses chosen for inclusion and their relative weightings. Is financial cost of treatment to be included? Is pain level more or less important than eventual overall satisfaction? In a given situation, quite different “optimum” decisions could be reached, depending on the decision function chosen. Those who wish to apply outcomes derived from an investigator’s use of decision theory should note that a personal or financial agenda may be involved in the choice of elements and weightings used in the decision function. To safeguard against such agendas, a user should accept only decision functions with natures that have been clearly and explicitly documented. Statistics in Medicine. DOI: http://dx.doi.org/10.1016/B978-0-12-384864-2.00008-1 © 2012 Elsevier Inc. All rights reserved.

158

Chapter 8  Hypothesis Testing: Concept and Practice Operations Research In most applications, the decision problem is expressed as a strategy to select one of a number of options based on a criterion of minimum loss. The loss might include risk of making a wrong decision, loss to the patient (financial cost as well as cost in pain, decreased quality of life, or even death), and/or financial or time cost to the investigator-institution. In the industrial, business, and military fields, applied decision theory most often has come under the heading of operations research (or operational analysis [British]). In medicine, some forms of applied decision theory using multiple sources of loss in the decision strategy appear under the heading of outcomes analysis, which is introduced in Chapter 20.

Decision Making by Testing a Statistical Hypothesis Section 1.4 notes that medicine’s major use of statistical inference is in making conclusions about a population on the basis of a sample from that population. Most often, we form statistical hypotheses, usually in a form different from the clinical hypothesis, about the population. We use sample data to test these hypotheses. This procedure is a special case of two-option decision theory in which the decision strategy uses only one loss, that of the risk (probability) of making an erroneous decision. More specifically, the loss is measured as known, controlled probability of each of two possible errors: choosing the first hypothesis when the second is true and choosing the second when the first is true.

How We Get from Observed Data to a Test From the data, we calculate a value, called a statistic, that will answer the statistical question implied by the hypothesis. For example, to answer the time-toheal question, we may compare the statistic ma (mean time to heal using the antibiotic) with the parameter μ (mean time to heal for the untreated population). Such statistics are calculated from data that follow probability distributions; therefore, the statistics themselves will follow a probability distribution. Section 4.7 notes that a mean, the time-to-heal statistic here, is distributed (at least approximately) normal. Areas under the tails of such distributions provide probabilities, or risks, of error associated with “yes” or “no” answers to the question asked. Estimation of these error probabilities from the sample data constitutes the test.

The Null Hypothesis The question to be answered must be asked in the form of a hypothesis. This statistical hypothesis, which differs from the clinical hypothesis, must be stated carefully to relate to a statistic, especially because the statistic must have a known or derivable probability distribution. In most applications, we want to test for a difference, starting with the null hypothesis, symbolized H0, that our sample is no different (hence “null”) from known information. (The hypothesis to be tested must be based on the known distribution, in this case, of the

8.1  Hypotheses in Inference established healing time, because it cannot be based on an unknown antibiotic healing time. This is explained further later.) In testing for a difference, we hypothesize that the population mean time to heal with an antibiotic, μa (which we estimate by our sample mean, ma), does not differ from the mean time to heal without the antibiotic, that is, H0: μa    μ. In contrast to testing for a difference, we could test for equivalence. In this case, we would hypothesize a specified decrease in healing times versus those seen without the antibiotic, that is, H0: μa    μ    decrease. (The historical name and symbol are carried over even though the hypothesis is no longer truly null.) Most of this chapter focuses on the more familiar difference testing. Equivalence testing is addressed in Chapter 16.

The Alternate Hypothesis After forming the null hypothesis, we form an alternate hypothesis, symbolized H1, stating the nature of the discrepancy from the null hypothesis if such discrepancy should appear. Hypotheses in words may be long and subject to misunderstanding. For clarity, they usually are expressed in symbols, where the symbols are defined carefully.

Forms the Hypotheses Can Take To answer the time-to-heal question, we hypothesized that the population mean μa from which our sample is drawn does not differ from μ. Our alternate hypothesis is formed logically as not H0. If we believe the antibiotic could either shorten or lengthen the healing, we use a two-sided alternate hypothesis, so that our hypotheses are: 

H0 : µa  µ and H1 : µa ≠ µ.

(8.1)

The two-sided alternate is conservative and is the usually chosen form. In case we truly believe that the antibiotic cannot lengthen the healing, the alternate to no difference is shortened healing and our alternate hypothesis would be H1: μa  μ. The alternate here is known as a one-sided hypothesis. When a decision is made about using or not using a medical treatment, the sidedness may be chosen from the decision possibilities rather than the physical possibilities. If we will alter clinical treatment only for significance in the positive tail and not in the negative tail, a one-tailed test may be used.

Why the Null Hypothesis Is Null It may seem a bit strange at first that our primary statistical hypothesis in testing for a difference says there is no difference, even when, according to our clinical hypothesis, we believe there is one, and might even prefer to see one. The reason lies in the ability to calculate errors in decision making. When the hypothesis says that our sample is no different from known information, we have available a known probability distribution and therefore can calculate the

159

160

Chapter 8  Hypothesis Testing: Concept and Practice area under the distribution associated with the erroneous decision: a difference is concluded when in truth there is no difference. This area under the probability curve provides us with the risk for a false-positive result. The alternate hypothesis, on the other hand, says just that our known distribution is not the correct distribution, not what the alternate distribution is. Without sufficient information regarding the distribution associated with the alternate hypothesis, we cannot calculate the area under the distribution associated with the erroneous decision: no difference exists when there is one, that is, the risk for a false-negative result.

A Numerical Example: Hypotheses about Prostate Volumes As a numerical example, consider the sample of the 10 prostate volumes in Table DB1.1. Suppose we want to decide if the population mean μv from which the sample was drawn is the same as the mean of the population of 291 remaining volumes. Because the mean of sample volumes may be either larger or smaller than the population mean, the alternate is a two-sided hypothesis. Our hypotheses are 

H0 : µv  µ and H1 : µv ≠ µ.

(8.2)

m estimates the unknown μv and, if H0 is true, m is distributed N(μ,σm2). (μ is the mean and σ the standard deviation of the 291 volumes, and σm is the standard error of the mean [SEM] σ/ n [see Section 4.8].) Standardization of m provides a statistic z, which we know to be distributed N(0,1), that is,



z

m µ σm

m

µ

σ/ n

.

(8.3)

We know that m  32.73, μ  36.60, and σ  18.12; n  10 is the size of the sample we are testing. By substituting in Eq. (8.3), we find z    0.675; that is, the sample mean is about two-thirds of a (population) standard deviation below the population mean. The small excerpt (Table 8.1) from Table I (see Tables of Probability Distributions) shows the two-tailed probabilities for the 0.60 and 0.70 standard deviations. The calculated z lying between these two values tells us that the probability of finding a randomly drawn normal observation more than 0.675σ away from the mean is a little more than 0.484, or about 0.5. We conclude that we have a 50% chance of being wrong if we decide that the sample mean did not arise from this distribution. There is not enough evidence to conclude a difference. This result may be visualized on a standard normal distribution as in Figure 8.1.

The Critical Value A “cut point” between whether a statistic is significant or non-significant is termed a critical value. In symbols, the critical values of z leading to a two-tailed α are zα/2

8.1  Hypotheses in Inference Table Table 8.1 8.1

Excerpt from Table I, Normal Distribution

z (No. Standard Deviations to Right of Mean)

Two-tailed α (Area in Both Tails)

0.60 0.70

0.548 0.484

0.4

0.3

0.2

0.1

0.025

0.025

0 –4

–3

–2

–1

0

1

2

3

4

z

FIGURE 8.1 A standard normal distribution showing 2.5% tail areas adding to 5% risk of error from concluding that a random value did not arise from this distribution (risk of a false positive). These shaded areas are rejection regions. The 0.025 area under the right tail represents the probability that a value drawn randomly from this distribution will be 1.96σ or farther above the mean, σ taking the value 1 in this standard normal form. We can see that z  0.675, about 2/3 of a standard deviation below 0, lies far from a rejection region.

and z1α/2. By symmetry, we need calculate only the right tail z1α/2 and use its negative for the left. The decision about significance of the statistic z may be stated as a formula. z is not significant if z1α/2  z  z1α/2, or, using Eq. (8.3), 

z1

α/ 2

m µ σm

z1

α/ 2 .

(8.4)

The Most Common Types of Statistic Being Tested and Their Associated Probability Distributions The previous section demonstrated how a hypothesis having a general pattern of a sample mean divided by a known population SEM follows a standard normal distribution. In the same fashion, a hypothesis about a mean when the standard deviation must be estimated (the population standard deviation is

161

162

Chapter 8  Hypothesis Testing: Concept and Practice unknown) has the same general pattern, namely, a sample mean divided by a sample SEM, so it follows a t distribution. A hypothesis about a standard deviation uses a variance, which follows a chi-square distribution. A hypothesis comparing two standard deviations uses a ratio of two variances, which follows an F distribution. And finally, a hypothesis about a rate of occurrence follows a binomial or a Poisson distribution. Thus, a large number of statistical questions can be tested using just these six well-documented probability distributions (introduced in Section 4.7).

Confidence Intervals Are Closely Related Confidence intervals are closely related to hypothesis tests procedurally, in that null and alternate hypotheses could state that the interval does and does not, respectively, enclose the population mean. However, the use differs: a confidence interval is used for description, not to test.

Assumptions in Hypothesis Testing Section 4.6 shows that various assumptions underlie hypothesis testing and that the result of a test is not correct when these assumptions are violated. Furthermore, the methods do not tell the user when a violation occurs or by how much the results are affected. Results appear regardless. It is up to the user to verify that the assumptions are satisfied. The assumptions required vary from test to test, and which apply are noted along with each test methodology. The most common assumptions are that errors are normal and independent from one another and, for multiple samples, that variances are equal.

The Meaning of “Error” An “error” in an observation does not refer to an error in the sense of a mistake, but rather to the deviation of the individual observation from the typical. The term error, often and more appropriately called “residual”, is used for historical reasons; if you find the term “error” in referring to observations confusing in your reading, read “deviation from typical” instead.

The Assumption of Independence of Errors When a set of observations is taken, we assume that knowledge of the error on one tells us nothing about the error on another, that is, that the errors are independent from one another. Suppose we take temperature readings on a ward of patients with bacterial infections. We expect them to be in the 38° to 40°C range, averaging about 39°C. We assume that finding a 39.5°C reading for one patient (i.e. an error of 0.5°C greater than the average) tells us nothing about the deviation from average we will find for the patient in the next bed. How might such an assumption be violated? If the ward had been filled starting from the far end and working nearer as patients arrived, the patients at the far end might be improving and have lower temperatures. Knowledge of the temperature error of a particular patient might indeed give us a clue to

8.2  Error Probabilities the temperature error in the patient in the next bed. The assured independence of errors is a major reason for incorporating randomness in sampling. This assumption is made in almost all statistical tests.

The Assumption of Normality of Errors Second, many, but not all, tests also assume that these errors are drawn from a normal distribution. A major exception is rank-order (or non-parametric) tests. Indeed, the avoidance of this assumption is one of the reasons to use rankorder tests.

The Assumption of Equality of Standard Deviations The third most frequently made assumption occurs when the means of two or more samples are tested. It is assumed that the standard deviations of the errors are the same. This assumption is stated more often as requiring equality of variances rather than of standard deviations. The term homoscedasticity sometimes encountered in a research article is just a fancy word for equality of variances. Exercise 8.1. In DB4, a hypothesis to be investigated is that protease inhibitors reduce pulmonary admissions. Is this a clinical or a statistical hypothesis? What would be a statement of the other type of hypothesis? Exercise 8.2. In DB14, we ask if the mean ages of patients with and without exercise-induced bronchoconstriction (EIB) are different. Write down the null and alternate hypotheses.

8.2  ERROR PROBABILITIES Posing an Example: Mean PSA between Cancerous and Healthy Patients We have a mean of prostate-specific antigen (PSA) readings from a cancer patient group and we want to compare it with the mean PSA of healthy patients to see if PSA can detect the presence of prostate cancer. The null hypothesis, H0, states that mean PSA is no different between groups.

Type I (α) Error The null hypothesis may be rejected when it should have been accepted; we conclude that there is no difference in PSA when there is. Such an error is denoted a Type I error. Its probability of occurring by chance alone is denoted α. Formally, α  P[rejecting H0H0 true] (read: “probability of rejecting H0 given H0 is true”). The conclusion of a difference when there is none is similar to the false-positive result of a clinical test. Properly, α is chosen before data are gathered so that the choice of critical value cannot be influenced by study results.

Type II (β) Error; Power of a Test Alternatively, the null hypothesis may be accepted when it should have been rejected; we conclude that mean PSA is different between groups when it is not.

163

164

Chapter 8  Hypothesis Testing: Concept and Practice Table 8.2  Relationships among Types of Error and their Probabilities as Table 8.2 Dependent on the Decision and the Truth Decision

Truth

H0 True

H0 False

H0 true

Correct decision True negative (Probability 1  α)

Type I error False positive (Probability α)

H0 false

Type II error False negative (Probability β)

Correct decision True positive (Probability 1  β)

Such an error is denoted a Type II error. Its probability of occurring by chance alone is denoted β. Formally, β  P[accepting H0H0 false]. The conclusion of no difference when there is one is similar to the false-negative result of a clinical test. 1  β is the power of the test, which is referred to often in medical literature and used especially when assessing the sample size required in a clinical study (Chapter 18). Power is the probability of detecting a difference that exists.

p-Value After the data have been gathered, new information is available: the value of the decision statistic, for example, the (standardized) difference between control and experimental means. The error of rejecting the null hypothesis when it is true can now be estimated using sample data and is termed the p-value. If the p-value is smaller than α, we reject the null hypothesis; otherwise, we do not have enough evidence to reject it. The size of the p-value can give a clue about the relationship of the null hypothesis to the data. A p-value near 0 or near 1 leaves little doubt as to the conclusion to be drawn from the study, but a p-value slightly exceeding α may suggest that further study is warranted. A listing of the actual p-value in a study result often adds information beyond just indicating a value greater or less than α. The next section further addresses this issue.

Relation among Truth, Decision, and Errors Type of error depends on the relationship between decision and truth, as is depicted in Table 8.2, a form of truth table.

LOGICAL STEPS IN A STATISTICAL TEST The logic of a statistical test of difference used historically is the following: (1) we choose a level of α, the risk for a Type I error, that we are willing to accept. (2) Because we know the distribution of the statistic we are using, for example, z or t, we use a probability table to find its critical value. (3) We take our sample and calculate the value of the statistic arising from it. (4) If our statistic falls on one side of the critical value, we do not have the evidence to reject H0; if on the other side, we do reject H0.

8.2  Error Probabilities

165

0.4 Null distribution

0.3

Alternate distribution

0.2 σm

β

0.1

α

0 –3

–2

–1

0

1

2

3

4

5

1.645 (critical value)

The Critical Value and the Errors Illustrated Illustrative null and alternate distributions and their α and β values are shown in Figure 8.2.

Choosing α but Not β Is a Common Reality We usually know or have reason to assume the nature of the null distribution, but we seldom know or have enough information to assume that of the alternate distribution. Thus, instead of minimizing both risks, as we would wish to do, we fix a small α and try to make β small by taking as large a sample size as we can. In medical applications, choosing α  5% has become commonplace, but 5% is by no means required. Other risks, perhaps 1% or 10%, may be chosen so long as they are stated and justified.

Addending a Confidence Interval Increasingly, readers of medical literature are asking for a confidence interval on the statistic being tested to “give assurance of adequate power”. Simple algebra can change Eq. (8.4) into the confidence interval Eq. (7.5). Significance in Eq. (8.4) is equivalent to observing that the confidence interval does not cross zero. Thus, the confidence interval does not add to the decision-making process. It does not tell us what the power is. However, it often provides useful insights into and understanding of the relationships among the parameter estimates involved.

Accepting H0 versus Failing to Reject H0 Previously, it was common in the medical literature to speak of accepting H0 if the statistic fell on the “null side” of the critical value (see Figure 8.2). Users

FIGURE 8.2 Depiction of null and alternate distributions along with the error probabilities arising from selecting a particular critical value. To the right of the critical value is the rejection region; to the left, the no-evidence-toreject region.

166

Chapter 8  Hypothesis Testing: Concept and Practice often then were mystified to see that a larger sample led to contradiction of a conclusion already made. The issue is one of evidence. The conclusion is based on probabilities, which arise from the evidence given by the data. The more accurate statement about H0 is the following: the data provide inadequate evidence to infer that H0 is probably false. With additional evidence, H0 may be properly rejected. The interpretation is that our databased evidence gives inadequate cause to say that it is untrue, and therefore we act in clinical decision making as if it were true. The reader encountering the “accept H0” statement should recall this admonition.

Testing for a Difference versus Equivalence Testing for a difference, for example between mean times to resolve infection caused by two competing antibiotics, has a long history in medical statistics. In recent years, the “other side of the coin” has been coming into practice, namely testing for equivalence. If either antibiotic could be the better choice, the equivalence test is two-sided. However, in many cases we have an established (gold) standard and ask if a new competitor is as good as the standard, giving rise to a one-sided equivalence test. A one-sided test is often termed a non-inferiority test. For example, we may have an established but expensive brand-name drug we know to be efficacious. A much cheaper generic drug appears on the market. Is it equally efficacious? The concept is to pose a specific difference between the two effects, ideally the minimum difference that is clinically relevant, as the null hypothesis, H0. If H0 is rejected, we have evidence that the difference between the effects is either zero or, at most, clinically irrelevant and we believe the generic drug is effective. The methodology of equivalence testing is more fully addressed in Chapter 16; at this stage of learning, the student need only be aware of this option. Exercise 8.3. A clinical hypothesis arising from DB3 might be: the mean serum theophylline level is greater at the end of the antibiotic course than at baseline. (a) What probability distribution is associated with this hypothesis? (b) What assumptions about the data would be required to investigate this hypothesis? (c) State in words the Type I and Type II errors associated with this hypothesis. (d) How would the probability (risk) of these errors be designated? How would the power of the test be designated? Exercise 8.4. A clinical hypothesis arising from DB9 might be: the standard deviation of platelet counts is 60 000 (which gives about 95% confidence (mean  2 3 60 000) coverage of the normal range of 240 000); this hypothesis is equivalent to supposing that the variance is 60 0002  3 600 000 000. (a) What probability distribution is associated with this hypothesis? (b) What assumptions about the data would be required to investigate this hypothesis? (c) State in words the Type I and Type II errors associated with this hypothesis. (d) How would the probability (risk) of these errors be designated? Exercise 8.5. A clinical hypothesis arising from DB5 might be: the variances (or standard deviations) of plasma silicon before and after implant are different.

8.3  Two Policies of Testing (a) What probability distribution is associated with this hypothesis? (b) What assumptions about the data would be required to investigate this hypothesis? (c) State in words the Type I and Type II errors associated with this hypothesis. (d) How would the probability (risk) of these errors be designated?

8.3  TWO POLICIES OF TESTING A Bit of History During the early development of statistics, calculation was a major problem. It was done mostly by pen, sometimes assisted by bead frames (various forms of abaci) in many parts of the world (Asia, Middle East, Europe, the Americas). Later hand-crank calculators and then electric ones were used. Probability tables were produced for selected values with great effort, the bulk being done in the 1930s by hundreds of women given work by the US Works Project Administration during the Great Depression. It was not practical for a statistician to calculate a p-value for each test, so the philosophy became to make the decision of acceptance or rejection of the null hypothesis on the basis of whether the p-value was bigger or smaller than the chosen α (e.g., p  0.05 v. p  0.05) without evaluating the p-value itself. The investigator (and reader of published studies) then had to not reject H0 if p were not less than α (result not statistically significant) and reject it if p were less (result statistically significant).

Calculation by Computer Provides a New Option With the advent of computers, calculation of even very involved probabilities became fast and accurate. It is now possible to calculate the exact p-value, for example, p  0.12 or p  0.02. The user now has the option to make a decision and interpretation on the exact error risk arising from a test.

Contrasting Two Approaches The later philosophy has not necessarily become dominant, including in medicine. The two philosophies have generated some dissension among statisticians. Advocates of the older approach hold that sample distributions only approximate the probability distributions and that exactly calculated p-values are not accurate anyway; the best we can do is select a “significant” or “not significant” choice. Advocates of the newer approach – and these must include the renowned Sir Ronald Fisher in the 1930s – hold that the accuracy limitation is outweighed by the advantages of knowing the p-value. The action we take about the test result may be based on whether a not-significant result suggests a most unlikely difference (perhaps p    0.80) or is borderline and suggests further investigation (perhaps p  0.08) and, similarly, whether a significant result is close to the decision of having happened by chance (perhaps p  0.04) or leaves little doubt in the reader’s mind (perhaps p  0.004).

167

168

Chapter 8  Hypothesis Testing: Concept and Practice Other Factors Must Be Considered The preceding comments are not meant to imply that a decision based on a test result depends solely on a p-value, the post hoc estimate of α. The post hoc estimate of β, the risk of concluding a difference when there is none, is germane. And certainly the sample size and the clinical difference being tested must enter into the interpretation. Indeed, the clinical difference is often the most influential of values used in a test equation. The comments on the interpretation of p-values relative to one another do hold for adequate sample sizes and realistic clinical differences.

Select the Approach that Seems Most Sensible to You Inasmuch as the controversy is not yet settled, the user may select the philosophy preferred. The author tends toward the newer approach. Exercise 8.6. In DB14, healthy subjects showed a mean decrease of 2.15 ppb exhaled nitric oxide (eNO) from before exercise to 20 minutes after exercise. A 5% significant test of this mean against a theoretical mean of 0 (paired t test) yielded p  0.085. Make an argument for the two interpretations of this p-value discussed in Section 8.3.

8.4  ORGANIZING DATA FOR INFERENCE The First Step: Identify the Type of Data After the raw data are obtained, they must be organized into a form amenable to analysis by the chosen statistical test. We often have data in patient charts, instrument printouts, or observation sheets, and want to set them up so that we can conduct inferential logic. The first step is to identify the type of data to be tested. Section 2.4 shows that most data could be typed as categorical (we count the patients in a disease category, for example), rank-order (we rank the patients in numerical order), or continuous (we record the patient’s position on a scale). Although there is no comprehensive rule for organizing data, an example of each type will help the user become familiar with the terminology and concepts of data types.

Categorical Data In the data of Table DB1.1, consider the association of a digital rectal examination (DRE) result with a biopsy result. We ask the following question: “Does DRE result give some indication of biopsy result, or are the two events independent?” Both results are either positive or negative by nature. We can count the number of positive results for either, so they are nominal data. For statistical analysis, we almost always want our data quantified. Each of these results may be quantified as 0 (negative result) or 1 (positive result). Thus, our variables in this question are the numerical values (0 or 1) of DRE and the numerical values (0 or 1) of biopsy. The possible combinations of results are 0,0 (0 DRE and 0 biopsy); 0,1; 1,0; and 1,1. We could set up a 2  2 table (Table 8.3) in the format of Table 8.2.

8.4  Organizing Data for Inference Table Format for Recording Joint Occurrences of Table8.3  8.3 Two Binary Data Sets DRE Result 0 Biopsy result

1

Totals

0 1 Totals

Table Format of Table 8.2 with Table DB1.1 Table8.4  8.4 Data Entered DRE Result

Biopsy result

0 1 Totals

0

1

Totals

3 2 5

4 1 5

 7  3 10

Entering the Data After setting up the format, we go to the data (from Table DB1.1) and count the number of times each event occurs, entering the counts into the table. It is possible to count only some of the entries and obtain the rest by subtraction, but it is much safer to count the entry for every cell and use subtraction to check the arithmetic. The few extra seconds are a negligible loss compared with the risk of publishing erroneous data or conclusions. Entering the data produces Table 8.4. Our data table is complete and we are ready to perform a categorical test.

Rank Data Remaining with Table DB1.1, suppose we ask whether average prostate-specific antigen density (PSAD) is different for positive versus negative biopsy. PSAD is a continuous-type measure; that is, each value represents a point on a scale from 0 to a large number, which ordinarily would imply using a test of averages with continuous data. However, PSAD values come from a very rightskewed distribution. (The mean of the PSADs for the 301 patients is 0.27, which lies far from the center of the range 0–4.55.) The assumption of normality is violated; it is appropriate to use rank methods.

Converting Continuous Data to Rank Data Sometimes ranks arise naturally, as in ranking patients’ severity during triage. In other cases, such as the one we are addressing now, we must convert continuous data to rank data. To rank the PSAD values, we write them down and

169

170

Chapter 8  Hypothesis Testing: Concept and Practice Table Table8.5 8.5

PSAD and Biopsy Data in Rank Format

PSAD

PSAD Ranks

Biopsy Results

Ranks for Biopsy  0

0.24 0.15 0.36 0.27 0.22 0.11 0.25 0.14 0.17 0.48

 6  3  9  8  5  1  7  2  4 10

0 0 1 1 1 0 0 0 0 0

 6  3

 1  7  2  4 10

Ranks for Biopsy  1

9 8 5

assign rank 1 to the smallest, rank 2 to the next smallest, and so on. We associate the biopsy results and sort out the ranks for biopsy 0 and biopsy 1. These entries appear in Table 8.5. Our data entry is complete, and we are ready to perform a rank test.

Continuous Measurement Data Suppose we ask whether average PSA is different for positive versus negative biopsy results. For each patient, we would record the PSA level and a 0 or 1 for biopsy result, ending with data as in columns 5 and 8 of Table DB1.1. Prostatespecific antigen level is a continuous measurement, that is, each value represents a point on a scale from 0 to a large number. It was noted earlier that PSA is not too far from normal when patients with benign prostatic hypertrophy are excluded. Thus, a continuous data type test of averages would be appropriate. We would need means and standard deviations. Their method of calculation was given in Chapter 5. One way to display our data in a form convenient to use is shown in Table 8.6. Our data entry and setup are complete and we are ready to perform a means test.

What Appears to Be a Data Type May Not Act That Way in Analysis Suppose a list of cancer patients have their pathologic stage rated T1, T2, T3, or T4. Each stage is more severe than the previous one, so that the ratings are ordered data. It appears that your analysis will be a rank-order method. But wait! When you examine the data, you find that all the patients are either T2 or T3. You could classify them into low or high categories. You actually analyze numbers of patients (counts) by category, using a categorical method. This phenomenon may also occur with continuous data, where actual data readings may force the type into rank-order or categorical form, thereby changing the method of analysis. The investigator must remain aware of and sensitive to such a data event.

8.5  Evolving a Way to Answer Your Data Question Table Table8.6 8.6

PSA and Biopsy Data in Means-Test Format

PSA

Biopsy Result

PSA for Biopsy  0

7.6 4.1 5.9 9.0 6.8 8.0 7.7 4.4 6.1 7.9

0 0 1 1 1 0 0 0 0 0

7.6 4.1

8.0 7.7 4.4 6.1 7.9 m  6.54, s  1.69

PSA for Biopsy  1

5.9 9.0 6.8

m  7.23, s  1.59

Exercise 8.7. Of what data type is: (a) the variable Respond versus Not respond in DB6? (b) The variable Nausea score in DB2? (c) The variable Platelet count in DB9? Exercise 8.8. In DB10, rank the seconds to perform the triple hop for the operated leg, small to large. Exercise 8.9. From DB14, set up tables as in Section 8.4: (a) EIB frequency by sex; (b) for the 6 EIB patients, 5-minute eNO differences by sex as ranks; (c) for the 6 EIB patients, 5-minute eNO differences by sex as continuous measurements.

8.5 EVOLVING A WAY TO ANSWER YOUR DATA QUESTION Fundamentally, a Study Is Asking a Question Clinical studies ask questions about variables that describe patient populations. A physician may ask about the PSA of a population of patients with prostate cancer. Most of these questions are about the characteristics of the probability distribution of those variables, primarily its mean, standard deviation, and shape. Of most interest in the prostate cancer population is the average PSA. Three stages in the evolution of scientific knowledge about the population in question are discussed in Section 1.2: description (the physician wants to know the average PSA of the cancerous population), explanation (the physician wants to know if and why it is different from the healthy population), and prediction (the physician wants to use PSA to predict which patients have cancer).

Description and Prediction When we know little about a distribution at question, the first step is to describe it. We take a representative sample from the population and use the statistical summarizing methods of Chapter 5 to describe the sample. These

171

172

Chapter 8  Hypothesis Testing: Concept and Practice sample descriptors estimate the characteristics of the population. We can even express our confidence in the accuracy of these sample summaries by the confidence methods of Chapter 7. The description step will not be pursued again in this chapter. Prediction combines the result of the inference with a cause– explanatory model to predict results for cases not currently in evidence. This is a more sophisticated stage in the evolution of knowledge that is discussed further in Chapter 21. The current chapter addresses statistical testing, which leads to the inference from sample to population.

Testing We often want to decide (1) whether our patient sample arose from an established population (Does a sample of patients who had an infection have the same average white blood count [WBC] after treatment with antibiotics as the healthy population?) or (2) whether the populations from which two samples arose are the same or different (Does a sample of patients treated for infection with antibiotics have the same average WBC count as a sample treated with a placebo?). In these examples, the variable being used to contrast the differences is mean WBC. Let us subscript the means with h for healthy patients, a for patients treated with antibiotics, and p for patients treated with a placebo. Recall μ represents a population mean and m represents a sample mean. Then, (1) contrasts ma with μh and (2) contrasts ma with mp.

STEPS IN SETTING UP A TEST (WITH WBC COUNT AS AN EXAMPLE) These contrasts are performed by statistical tests following the logic of inference (see Section 8.1) with measured risks of error (see Section 8.2). The step-by-step logic adhered to is as follows (assuming the simplest case of one question to be answered using one variable):   1. Write down the question you will ask of your data. (Does treating my infected patients with a particular antibiotic make them healthy again?)   2. Select the variable on which you can obtain data that you believe best to highlight the contrasts in the question. (I think WBC count is best to show the state of health.)   3. Select the descriptor of the distribution of the variable that will furnish the contrast, e.g. mean, standard deviation. (Mean WBC count will provide the most telling contrast.)   4. Write down the null and alternate hypotheses indicated by the contrast. (H0: μa  μh; H1: μa  μh.)   5. Write down a detailed, comprehensive sentence describing the population(s) measured by the variable involved in that question. (The healthy population is the set of people who have no current or chronic infections affecting their WBC count. The treated population is the set of people who have the infection being treated and no other current or chronic infection affecting their WBC count.)   6. Write down a detailed, comprehensive sentence describing the sample(s) from which your data on the variable will be drawn. (My sample is selected randomly

8.5  Evolving a Way to Answer Your Data Question

  7.

  8.   9.

10.

from patients presenting to my clinic who pass my exclusion screen [e.g. other infections].) Ask yourself what biases might emerge from any distinctions between the makeup of the sample(s) and the population(s). Could this infection be worse for one age, sex, cultural origin, and so on, of one patient than another? Do the samples represent the populations with respect to the variable being recorded? (I have searched and found studies that show that mortality and recovery rates, and therefore probably WBC, are the same for different sexes and cultural groups. One might suspect that the elderly have begun to compromise their immune systems, so I will stratify my sample [see Section 1.8] to assure that it reflects the age distribution at large.) Recycle steps 1 through 7 until you are satisfied that all steps are fully consistent with each other. In terms of the variable descriptors and hypotheses being used, choose the most appropriate statistical test (see Chapter 2 for more examples) and select the α level you will accept. If your sample size is not preordained by availability and/or economics, satisfy yourself that you have an adequate sample size to answer the question (see Chapter 18).

At this point, you are ready to obtain your data (which might take hours or years).

Additional Example: Duration of the Common Cold We should like to test (and lay to rest) the following assertion: “People who do not see a physician for a cold get well faster than people who do”.   1. Question being asked: does seeing a physician for a cold retard the time to heal?   2. Variable to use: length of time (days) for symptoms (nasal congestion, etc.) to disappear.   3. Descriptor of variable that will provide contrast between patients who see a physician (group 1) and those who do not (group 2): mean number of days μ1 (estimated by m1) and μ2 (estimated by m2).   4. Hypotheses: H0: μ1  μ2; H1: μ1  μ2.   5. Populations: populations are the sets of people in this nation who have cold symptoms but are otherwise healthy and see a physician for their condition (population 1) and those who do not (population 2).   6. Samples: for sample 1, 50 patients will be randomly chosen from those who present at the walk-in clinic of a general hospital with cold symptoms but evince no other signs of illness. Data for sample 2 are more difficult to obtain. A random sample of five pharmacies in the area is taken and each is monitored for customers who have signs of a cold. Of these customers, a random sample of 10 who state they are not seeing a physician is taken from each pharmacy, with the customers agreeing to be included and to participate in follow up by telephone.   7. Biases and steps to prevent them: there are several possible sources of bias, as there are in most medical studies. However, the major questions

173

174

Chapter 8  Hypothesis Testing: Concept and Practice of bias are the following: (a) are patients and customers at and in the vicinity of our general hospital representative of those in general? (b) Are customers who buy cold medicines at pharmacies representative of cold sufferers who do not see physicians? We can answer question (a) by analyzing the demographics statistically after our study is complete, whereas question (b) requires a leap of faith.   8. Recycle: these steps seem to be adequately consistent as they are.   9. Statistical test and choice of α: two-sample t test with α  0.05. 10. Sample size: Chapter 18 explains that we need not only (a) α (0.05), but also (b) the power (1  β, which we take to be power  0.80), (c) the difference between means that we believe to be clinically meaningful (which we choose as 2 days), and (d) σ1 and σ2. We estimate σ1 to be 3 from a pilot survey of patients with cold symptoms who were followed up by telephone and assume σ2 is the same. Section 18.4 explains that we need at least 36 in each sample; our planned 50 per group is a large enough sample. Now we may begin to collect our data. Exercise 8.10. Using the 2 3 2 table in DB2, follow the first nine steps of Section 8.5 in setting up a test to learn if the drug reduces nausea score. Exercise 8.11. From DB13, we want to learn if our clinic results agree with those of the laboratory. Follow the first nine steps of Section 8.5 in setting up a test.