Statistical analysis of anaesthetic experiments

Statistical analysis of anaesthetic experiments

11 Statistical analysis of anaesthetic experiments M. A. STAFFORD Why use statistics? Statistics provide a clear demonstration of the results of a s...

1MB Sizes 1 Downloads 98 Views


Statistical analysis of anaesthetic experiments M. A. STAFFORD

Why use statistics? Statistics provide a clear demonstration of the results of a study or experiment, convert opinions to something more resembling facts, and present data in a compact form that nonetheless allows readers to form their own opinions from it. For these reasons statistics form an important part of nearly all medical research projects, but in spite of this widespread usage, surveys of both general medical literature (Pocock et al, 1987) and anaesthetic literature (Avram et al, 1985) show that statistical problems are common. Niggling about statistics takes up a high proportion of medical correspondence pages, discussion time at research society meetings and the reports prepared by referees for scientific journals. There are undoubtedly many carefully thought out and laboriously executed research projects which never reach public debate or publication because their result has been invalidated by a simple but fatal statistical flaw in the design. W h y are problems so common? The difficulty seems to be that statistics is a basic research tool, the importance of which is underrated by the new researcher until complications arise. No one should use a scientific tool without a basic knowledge of how it works and, more particularly, the circumstances in which the results are valid and those in which they are invalid. Without this theoretical background it is difficult to avoid making fundamental errors in the collection of data and the choice and application of techniques of analysis. If new researchers gave the learning of statistics a fraction of the effort that they put into learning the techniques of electromyography, pneumotachography or high performance liquid chromatography then many of the problems would not exist. It is worth noting that universities are now encouraged to put new MPhil and PhD students through a course of basic laboratory skills, including statistics, at the start of their projects. Is there a short cut? How can the new researcher use statistics without going through a vast and impractical amount of theory and practice. The solution is to be highly selective. Firstly, there are machines that perform statistical tests--computers with suitable programs--and they do the work well. The beginner should therefore concentrate on things that the machines cannot do, and completely ignore the inner workings of the tests. Note that this view is directly opposite to that presented by many short textbooks of statistics, which concentrate on the mechanics of the calculations at the expense of the theory. Essentially, effort should be concentrated on the Baillibre's ClinicalAnaesthesiology--Vol. 2, No. 1, March 1988




production of valid data, data that is known to be representative of the target population, and to be suitable for the intended method of analysis. Further, restrict the statistical methods used to a few standard procedures, but be aware of the theory behind the selected methods, and in particular any limitations on their use. In most cases, experiments can be designed so that these are adequate--the secret is in the design. Finally, if the experiment cannot be designed to use a simple test, or if the data, in spite of the design, turn out to be unsuitable for the planned test, take advice from a statistician. If problems are foreseen at the design stage, take advice then. Rules for the successful use of statistics by researchers are: 1. 2. 3. 4. 5. 6.

Make sure that the data are right. Choose the correct tests for the data. Design the experiment to keep the analysis simple. Leave the number work to the machines. Present not only the results but as much of the evidence on which they are based as possible. Seek help early if necessary.

One chapter cannot cover even highly selective statistical theory, practical statistics, use of computers and guidance to authors. Therefore the aim of this chapter is to cover the gaps between those areas, and to provide a large scale route map through the minefields.

THEORETICAL ASPECTS A number of statistical concepts are more important to the researcher than their presentation and context in the shorter statistical books suggest. They are presented here to bring attention to them.

Definition of statistic and statistics Statistics deals with techniques for collecting, describing, analysing and drawing conclusions from data. A statistic is such a description or conclusion expressed as a number. The techniques fall into two main groups. 1. 2.

Descriptive statistics are used to summarize data from samples, to describe a quantity of information in a few words or numbers. Inferential statistics predict the properties of a population from a sample of that population.

Medical statistics tends to concentrate on epidemiology--data from observational studies or surveys. The kind of experiments common in anaesthesia, where the experimenter has considerable control over the data through selection of variables and groups, has more in common with the experiments of biologists and agriculturalists, and this should be borne in mind in selecting statistical textbooks.



Null hypothesis Philosophy states that it is not possible to prove anything, though it is not of course possible to prove this view. It is possible to disprove, but not to prove, a hypothesis, and therefore all scientific facts or 'laws' are only hypotheses that satisfactorily explain observations and that have not yet been disproved. Statistical 'proof' is obtained by setting up the hypothesis that the converse of the argument is true, and then disproving this 'null hypothesis'. One may hypothesize that the observed difference between two trial groups is due to random variation. Demonstrating that this is unlikely leads to the conclusion that the two groups are samples from different populations, and hence that the differences in, for example, the drug treatment of the two groups has caused the observed difference.

Samples and populations Descriptive statistics describe samples, whereas inferential statistics attempt to predict from samples the properties of the populations from which the samples were derived. The accuracy of the inferences depends on the sample being representative of the population, and this implies that the sample is randomly chosen. In an experiment , the various trial or treatment groups are the samples, and it is usually the aim to make them representative of the human population in general, so that one group represents all persons treated in one way. The results can then be held to be valid for the population, so that the particular treatment has the same effect whenever or wherever it is used. In most comparisons of two or more samples, the null hypothesis is that all samples are from the same population, and that any observed differences are due to the sampling process and are caused by random variation. It is therefore essential that the samples differ only in the variable(s) under investigation, and are otherwise comparable in all ways. Samples may of course differ in more than one variable as long as all the relevant variables are investigated, but an experiment is more sensitive, i.e. more likely to produce a definite result, if only one variable is involved. The most effective way of ensuring .this is to allocate subjects to experimental groups at random. Interference with a chosen system of randomization is regarded as making it no longer truly random; once a subject is allocated to a group, the allocation must not change and any subject withdrawn from an experiment for any reason should not be replaced by another one. The cases needed to make up reasonable group sizes must be additional cases rather than replacing the drop-outs in the randomization system.

Probability and significance The probability of a particular outcome of a trial or procedure is the proportion of trials producing that outcome out of an infinite number of trials. Probabilities have a value in the range 0 (the outcome never occurs) to i (the outcome always occurs), although the percentage, 100 times the probability, is often used. A popular example is the tossing of a coin. Two



outcomes are possible, heads and tails, and the probability of either, with an unbiased coin, is 0.5. For any number of trials less than infinity, the actual proportion of a particular outcome, although likely to be close to the probability, can and will differ from it. For 10 tosses of a coin, 5 heads is the most likely result, but 4 or 6 will be common, and 0 or 10 rare, but not impossible. The probability produced by a statistical test is the probability that the value of the test statistic, usually a t, F or chi-squared value, will be found if the null hypothesis is true. Thus a probability of 0.05 or 5% states that the particular value of the statistic is likely in 1 out of 20 tests to be caused by random variation in the data. In the other 19, the cause is likely to be a true difference between the groups. Statistical significance is achieved if the probability of a test result falls below a preselected value, often 0.05 or 5%, but sometimes 0.01 or 1%, or even 0.005 or 0.5%. Preselection of the significance cut-off level is regarded as an important part of the design of the experiment, and values only fractionally above the level are expected to be reported as non-significant. On the other hand, results falling below the lower cut-off points are usually indicated as being more significant. In the days before computers were routinely used to perform the tests, and all calculations were done by hand, it was relatively easy to calculate the test statistics, but very difficult to calculate their probability. The values of the test statistic for the 0.05, 0.01 and 0.005 probability levels were obtained from tables, and the value of the statistic calculated was compared with these cut-off points. Thus the probability (P) could be described as >0.05 or >0.01 or 0 . 0 5 > P > 0 . 0 1 (the probability is less than 0.05 and greater than 0.01) without explicitly calculating it. Now that probabilities can be calculated easily, it is reasonable to quote the calculated probability value, rather than define the range it lies in. Readers can then decide on the significance of a probability of 0.049 or 0.051 for themselves. Statistical significance is not the same as practical or clinical significance; for example a t-test may show that treatment with a particular drug lowers systolic arterial pressure. Further analysis may then reveal that the most likely value for this reduction is 0.5 mmHg, and that the actual value is likely to lie in the range 0.3 to 0.7 mmHg. Here statistical and clinical significances are different. The estimation of the magnitude of a difference between test groups is becoming a common method of assessing the practical or clinical significance of a statistically significant result. The absence, however, of a statistically significant result always means that there is no practically significant result either. Variables, cases and data types

One number derived from one subject is an observation. All the observations derived from one individual together form a case. Variable is the collective term for the values of the same observation from all the individuals. Variables may be classified as nominal (or logical), ordinal (ranked) or interval. Interval variables may be discrete or continuous.



Nominal variable values have no fixed relationship to observations with a different value: the different values represent broad classifications such as male/female or animal/vegetable/mineral. It is not uncommon to use numbers to represent the different values of a nominal variable, a process known as coding, for example using 1 to represent male and 2 to represent female; this is often encouraged by statistical computer programs which are not designed to handle alphabetical data. This is perfectly correct, but it is incorrect to then treat the data as ordinal or interval, which they might then resemble. Ordinal data are ordered by magnitude or ranked: if one observation has a higher numerical value than another then its magnitude is greater, but the amount of the difference between two adjacent points on the scale is not defined, and may not be the same as the difference between two adjacent points at a different position on the scale. Interval data are from a range of continuous values on a scale of known constant intervals. Continuous means that the variable can take any value within a particular range, for example a measurement. Interval data can also be discrete, meaning that only a limited number of values exist, and that intermediate values between these points are impossible. The relationship between the points is, however, constant: the same difference between two values represents the same amount whatever the values. The commonest type of discrete interval data is a count of events. Most statistical tests for interval data expect continuous data, although they are often used for discrete data after applying corrections for discontinuity, especially when the difference between adjacent values is small compared with the mean or range--making the errors due to discontinuity small. When the errors due to discontinuity could be large, as may be the case for a low count, nonparametric tests, designed for ordinal data, should be used. These tests should also be used when the intervals between the points on the scale are unknown. Such data are not ordinal, but should be converted to ordinal before use. Typical of this class of data is observer assessment, such as 'difficulty of intubation' on a scale of, say, 0-4.

Frequency distributions The frequency distribution of a variable is the relationship between the value of a variable and the frequency at which that value appears. If the possible values of such a variable are plotted (on the x axis) against the number of cases taking that value (on the y axis), the resulting graph represents the frequency distribution. The graph of a continuous variable, and its frequency distribution, can often be described by a mathematical equation. Frequency distributions of most biological continuous variables follow one of a few common patterns, the most common of which is the 'normal' or Gaussian distribution. Only two numbers are needed to distinguish different normal distributions: the mean, which describes the position of the curve on the x axis (the mean value is the value at the peak of the curve), and the standard deviation, which describes the shape of the curve (in particular the width of the bell). Many of the variables that do not follow



the normal distribution can be transformed (converted) into variables that do; common methods of transformation are to replace each observation by its logarithm or arcsine. Because of the number of normal variables, or variables that can be made normal, most of the standard statistical tests are designed for this distribution, and only produce valid results if the data distribution is normal, or nearly so. There are a few other distributions found in biological systems, for example the Binomial and Poisson, each with standard statistical methods for dealing with them, but most of the methods for handling non-normal data make no assumptions about the underlying frequency distribution. These are the non-parametric tests: the term non-parametric applies to the tests and not to the data. For every standard test there is a non-parametric equivalent. Non-parametric tests can be used on normal data, but are less sensitive, i.e. less likely to show any difference between groups.

Descriptive statistics The term descriptive statistics covers three types of procedure, which all attempt to describe a variable as simply as possible.

Tables. In its simplest form, a table is a list of data, laid out in a way that makes it easy to read. The table may be expanded by showing some summarizing statistics--totals or averages, or reduced to the summary statistics only. The latter is the most efficient way of displaying nominal information. Tables take longer to assimilate than pictures, but contain more information: it is usually necessary to construct a table before converting the information in it to some form of graphical representation. Graphical methods. Graphical representations give a good overall view of data, but at the expense of detail. Pie charts are simple alternatives to tables. There are two main forms of bar chart; one is a rectangular form of pie chart to demonstrate frequencies or proportions, the other is used to compare several samples. Histograms are a form of bar chart used to show a simplified version of the frequency distribution of a sample. The cumulative frequency distribution or ogive can be more useful under some circumstances. None of these methods show individual data points. When two associated variables are being examined, the scattergram is a simple plot of each case, with the two variables as the two axes. The addition of a calculated regression line, and possibly the 95 % confidence intervals of the regression line, makes the graph even more informative. A modified version of the scattergram can be used for single variables. In this case, the variable is used as the y axis, and the observations are plotted in an approximately vertical line, moved sideways where necessary so that they do not overlap. The mean and standard error or standard deviation can also be shown, and the data for several samples can be tabulated on the same graph to allow comparison. This is a convenient method of giving full details of the data without resorting to tabulating the actual values, but is not suitable for large sample sizes, where a bar chart should be used.



Numerical representation. A normal distribution can be completely described by two numbers, the mean and the standard deviation, and while this is not completely possible for non-normal distributions, there are statistics which perform a similar function, by describing the average value and variability of the sample. The mean, median and mode all give an idea of the average value. The mean of a variable is the true average value i.e. the sum of the observations divided by the number of observations. The median is the value of the middle observation after all the observations have been sorted into order of size. The mode is the most common value. For a normal distribution, all three values are the same, but for asymmetrical distributions, this may not be the case. The standard error of the mean is an estimate of how well a sample mean represents the mean of the population. It is in fact the standard deviation of the distribution of the population mean, and so the population mean is 95% likely to lie within _+1.96 standard errors from the sample mean. The variance is the index of the variability of a sample of a normally distributed variable--the extent that individual observations differ from the mean. The standard deviation, which is the square root of the variance, is the form commonly used. An alternative method of expressing the same idea, particularly for non-normal variables, is to quote the range, the largest and smallest values. As this may give a false impression of the true variability--one exceptionally large or small value will markedly distort the range--an alternative is to quote the 25th and 75th quartiles; 25% of the observations have values below the 25th quartile and 25 % have values above the 75th quartile. Skewness and kurtosis are further descriptors of normal variables. Skewness is a measure of the symmetry of the variable, and abnormal values indicate that the mode differs from the mean. Kurtosis is a measure of the flatness of the distribution. Both can be used as measures of the 'normality' of a variable, and may be produced routinely by computer programs. Inferential statistical tests

Inferential statistics attempt to itifer from samples the properties of the population from which they are derived. They are used either to estimate a particular parameter of the population, or to investigate the null hypothesis that the samples come from the same population. The main tests suitable for use by the researcher without professional statistical advice are summarized in Table 1. Before using any test, it is essential to be certain that the test is suitable for the type of data involved, and that any other limitations on the applicability of the test have been met. One of the commonest errors in analysis is the use of multiple tests. Each statistical test used carries a distinct probability that a significant result is in fact due to chance alone. This probability is up to 0.05-1 in 20. If20 tests are performed, the chance that one will produce a falsely positive result is therefore high (but not 100%, probabilities are combined by multiplication not addition). A correctly designed analysis tests one null hypothesis with one test applied to all the data simultaneously. If multiple testing is neces-


M.A. STAFFORD Table 1. The basic statistical armamentarium.


Nominal data


Parametric test

Descriptive statistics

Tables Percentages

Median Centile range Range Frequency distribution

Mean Standard deviation Range

Wilcoxon signed-rank test Mann-Whitney U-test Wilcoxon Rank-sum test Friedman test

Paired t-test

Comparison of means/medians Two samples-paired Two samples-independent > 2 samples-paired > 2 samples-independent Measurement of association

Chi-squared test

Kruskal-Wallis analysis of variance of ranks Spearman rank correlation coefficient Kendall's x(tau)

Unpaired t-test Analysis of variance with repeated measures Analysis of variance Correlation coefficient Linear regression

sary, it should not be performed until a single test on all the data has demonstrated that a difference exists--multiple testing should only locate, not detect, a difference. Then a specially designed test or series of tests should be used, but the use and interpretation of these tests is for experts only. The t-test is probably the test most frequently abused by multiple use, usually when an analysis of variance is more appropriate.

Chi-squared test and contingency tables The only information to be obtained from nominal data is the frequency of a particular value, and whether this differs between trial groups. The first step is to tabulate the total numbers of each value for each group, forming a contingency table. The totals form the cells of the table, which will have one row for each possible value, and one column for each group or sample (or vice versa). A two-way table has only two dimensions--group and value: multi-way tables, which allow for additional factors, are also possible. The main test for difference between the samples in a two-way table is the chi-squared test. There are limitations on the use of the chi-squared test; no cell of the table should have a value of less than two, and no cell should have an expected value of less than five. If the totals are small, and there are only two rows and two columns (note that a chi-squared test can handle any n u m b e r of rows and columns), Fisher's exact test is more suitable.

Comparisons of average values The main tests used compare location, the average or mean values of one variable between two or more groups or samples. Analysis of variance is the general version of this test for normal data. Analysis of variance (anovar)



determines how much of the variation (and this is the source of the name) in the observations is due to population differences, and how much is due to random variation. Comparison of the contribution of the various kinds of variation enables determination of the importance of the population differences in comparison with the random (residual i.e. not otherwise accounted for) variation. In the common special case of there being only two samples, the test simplifies to a special form, a t-test. There are many forms of both analysis of variance and the t-tests. The standard forms assume independent samples. If the data are paired (i.e. the two groups consist of different data taken from the same individuals), for example measurements before and after treatment, then there is a paired t-test that examines only the differences between the pairs. The multiple sample equivalent of the paired t-test is analysis of variance with repeated measures, and this method can also handle more than two observations on each individual. Anovar and the t-tests are contingent on the data being independent (there is no association between the individual points), and normal, and the usual forms of the test also assume that the sample variances are equal. The tests are robust, i.e. not affected by small deviations from the conditions of normality and equality of variance. There are tests, the KolmogorovSmirnov test and Bartlett's test, for acceptability of the data; some computer programs perform these tests automatically. There are also versions of the t-test that do not assume equality of variance. However if the normality of the data is suspect, a non-parametric test should be used. The nonparametric analysis of variance is the Kruskal-Wallis analysis of variance of ranks, and it also has repeated measures, the Friedmann test, and two sample equivalents, the Wilcoxon rank sum or Mann-Whitney test for two independent samples, and the sign and Wilcoxon signed rank tests for paired samples.

Correlation and regression These are measures of the association between two variables. Correlation only measures the degree of association, regression assumes that the value of one variable (the dependent 'variable) depends on the value of another (the independent variable). Linear regression calculates the slope and intercept of the line that best fits the data. Multiple regression calculates the relationship between one dependent variable and several independent variables, which are assumed to be independent of each other. Analysis of covariance is a technique that combines analysis of variance and regression. It is commonly used to increase precision in randomized experiments or to remove sources of bias in observational experiments.

Curve fitting Linear regression is a method of calculating the coefficients of the (simple or linear) equation that best describes the relationship between independent and dependent variables. As an equation becomes a line on a graph, the result is described as the line of bestfit, and the method as 'fitting' the line.



The method actually finds the equation that has the smallest sum of the squares of the distances of the data points from the line, so it is also known as the least squares fit. The least squares method can also be used if the relationship between the variables is not linear, but exponential or polynomial (the sum of several powers of x). Non-linear least squares fitting is commonly used in pharmacokinetics to calculate the equation describing the decay of the plasma levels of a drug. C o m m o n errors

A number of statistical errors appear repeatedly in scientific reports. Most are avoidable, but often only at the design or data collection stage. The effect may range from reducing the impact of the experiment to completely invalidating the result.

Errors in design 1. 2.

3. 4.

The sample size is too small. A difference exists, but the data cannot demonstrate it. The sample does not represent the population that was intended. Instead it only represents women, or left-handed people, because of bias in the selection of cases. This can also be thought of as an error in reporting, as quoting the correct population corrects the statistical error. The samples are not comparable because of failure of the randomization. Other differences, in sex, age, or weight, may account for the observed difference ascribed to the treatment or procedure. The measurement or end-point allows observer bias, and this is not measured or corrected. True differences due to treatment may be hidden by differences between the way the observers interpret their observations, or these differences may introduce a sample difference that does not exist.

Errors in analysis 1.

2. 3. 4. 5.

The data do not meet the underlying assumptions of the test, or are outside the limits for accuracy. For example, data for a t-test are not normal, skewed or not continuous, or the groups have differing variances, or data for a chi-squared test have cells with less than 2 observed or 5 expected members. Use of multiple tests on the same data without correction: multiple t-tests or Wilcoxon tests. The application of tests for independent data to related data: using an unpaired t-test on paired data. Information is present but not used or an inappropriately simple test used. For example, use of a non-parametric test on normal data. A true difference may be demonstrable but not demonstrated. Inappropriate reversal of transformations. For example, use of a log



transformation followed by quoting the antilog of the standard deviation (SD). The correct method is to use the SD of the log and the log of the mean to find the logs of the points 1 SD above and below the mean, and quote these after taking antilogs. . Failure to eliminate outliers. One point at least 2 standard deviations from the remainder can be discarded (as most likely to be an error).

Errors in reporting 1. 2.


Confusion of regression and correlation: regression assumes that one variable depends on another; correlation does not, the two variables may both depend on a third. Making too much of a significant correlation coefficient. The standard significance test only demonstrates that the coefficient is not zero. A low, non-zero correlation coefficient does not suggest an association between the variables in practice, and an association does not give any information about a causal relationship. Confusion of statistical and clinical or practical significance. This is reduced if estimates of differences replace simple tests for differences.

PRACTICE,--PERFORMING A STATISTICAL ANALYSIS Statistical analysis cannot be thought of as a discrete process that forms one of the last stages of an experiment. The key to success is to design and perform the experiment with the analysis in mind. It is often said that the best way to perform an experiment is to write the report first, leaving blank spaces for the numbers to be filled in when the work is actually done. The advance planning implied by this approach is most likely to lead to a satisfactory statistical analysis.

Design The constant and justified complaint of medical statisticians is that researchers seeking help with the statistical analysis of an experiment do so after the experimental work is completed, when it is too late to change the design or to ensure adequate trial group sizes. In many cases, there are weaknesses in the design of the experiment that force the use of a complex analytical technique where better design could have allowed a simpler test. Most experiments in the field of anaesthesia can and should be designed so that a single simple null hypothesis can be tested by a simple robust (insensitive to minor faults in the data) standard test. If this is not possible, it is essential that advice is obtained at that stage, before collecting data. The more advanced statistical devices are intended for use on observational data from surveys, where complicating factors cannot be designed out, not to correct poor experimental design. Samples should be complete, comparable and truly random, and their sizes calculated as far as possible to demonstrate clearly any positive



findings. The end-points should be clear, consistent and chosen in advance. Attempts should be made to ensure that there is no observer bias-observers blind to trial group, objective rather than subjective end-points where possible, observers evenly distributed through groups by stratification of the randomization where possible. Biologists are expected to abandon an experiment with unsatisfactory samples or data not matching the pre-assigned analysis. In the case of data obtained from human volunteers it is reasonable to attempt to retrieve the experiment by a more complex analysis, as it may be ethically undesirable to repeat an experiment that can be so rescued, but in these cases less confidence can be placed in the results.

During the experiment Make sure that the data are complete, so that there are not many cases withdrawn because of incomplete data during the analysis. This requires attention to detail and careful record keeping. Research students are taught to write everything--results, casual observations, speculations--into a bound notebook and not to use loose pieces of paper, which may get lost or cause confusion by being shuffled. In clinical experiments, collect as much data as possible in advance, as patients can disappear quickly after operations, and the piece of information, say weight or height, will not be in the patient's notes, assuming you can find them when you want them. All this information should be assembled in advance. Allow for circumstances in which information is produced faster than you can record it; use printers or chart recorders wherever possible. It can be safer to speak results on to a tape recorder rather than trying to write everything down. It is essential that the experimenter does not have any clinical responsibilities, as this may distract from the data recording (as well as vice versa). Even if a particular group number implies a particular treatment, write down what was actually done, as the randomization schedule may be lost or confused.

Preliminary examination of data Check for distribution and errors. Start with a simple graph of the data, a histogram of each group or a scatter plot if two variables may be related. If the proposed analysis is parametric (expecting normally distributed data), the histogram is a simple check for the normal distribution. An alternative way is to calculate the skewness and kurtosis of the samples. A scatter diagram gives the first estimate of the relationship or lack of it between the two variables, linear or non-linear. In both cases some of the data points may seem to differ wildly from the remainder. These are usually errors, and this can sometimes be confirmed by rechecking the paperwork. Even if there is no detectable clerical error, one point that is grossly different from the others can be safely discarded, but more than one should only be discarded after advice. Many of the computer programs that perform statistical tests will also perform some or all of this preliminary examination.



Check randomization, look for bias (confounding). At the same time it is necessary to check that any randomization scheme has worked, and that the samples are truly comparable. This is done by comparing the demographic data, weight and height by t-test, sex ratio by chi-squared test, and age. Age might well not be normally distributed. The distribution can be checked by counting the numbers in each 5-10-year band and using a chi-squared test. If differences are found in any of the variables, it may be possible to correct for them. Performing tests There are two main ways of performing the actual tests, by hand or using a computer. The analysis should only be performed by hand if access to a suitable computer is impossible. Although a beginner must invest some time in learning to use both the computer and the particular computer program, this is an investment that will almost certainly pay off later. Analysis by hand is time consuming, prone to errors, especially by beginners, and the labour involved tends to inhibit a full analysis or the use of sophisticated tests.

Performing the analysis 1. 2. 3. 4. 5.

Use a computer and a statistical program or package. If the pedigree of the program is not certain, test it with worked examples from textbooks. Test your ability to use the computer and program with worked examples from textbooks. Separate the two tasks of preparing the data and analysing it. Check data after entry into the computer, and before the analysis, by comparison with the original documents.

Manual methods. Work through a text book example first to be sure you have the method correct. Use squared paper and a pencil so that corrections can be made easily. Arithmetical and clerical errors (such as failing to include a data point in a calculation or using the wrong sample size) are likely to be common. Repeat all calculations until the results are the same twice running. Advanced calculators are helpful, some can automatically calculate mean and SD. Many of the older textbooks were written when manual methods were the norm and contain useful short-cuts and cross-checks, although the short-cuts are often at the expense of absolute accuracy.

Using a computer. Computer software is divided into two parts, the application program (usually abbreviated to program) which details a particular task, and the operating system, which contains the instructions common to all tasks to organize the machine and its peripherals, and perform fundamental operations such as loading the application program from the disk into the computer memory. It is not a good idea to use a computer without some idea of the operating system. Nearly all computers have some form of self-teaching manual for the



operating system and the new user should sit down at the machine and work through it. Most operating systems have a facility that allows a program to take data from a disk file when it expects data from the keyboard, and this facility should be used if the program does not take data directly from disk itself. The application program that performs the statistical analysis will be described as a statistical program, a statistical program suite, or a statistical package. The terms package and program suite in this context refer to a number of related programs designed to work together. The individual programs might, for example, perform the individual statistical tests. Data prepared for one program should, where relevant, be accepted by any other program. Whether a single program or a number of related programs are used is determined by programming considerations, and may change during the evolution of the program. There is no practical difference. Apart from the program to perform the statistics, it may be necessary to use an editor program to prepare a file containing the data and possibly the parameters of the test to be performed.

Data preparation and validation. An editor program prepares or modifies a disk file. For example, word processors can be used as editors, but there is a common problem, as most word processors normally store layout information in the disk file. There is usually a way of getting round this problem, which is necessary as the layout information usually confuses any application program that receives it. The way that this is done is very variable. Some word processors can be run in an editor or non-document mode. Others can remove the layout information when putting the text file onto the disk (usually described as 'saving as an ASCII file'). Yet others will do this if asked to 'print into a file', but may also need to be explicitly instructed not to add margins and headings. Some statistical program packages include an input and data-validation program. This places the data in a disk file that can be used by the other programs, allows the data to be printed out to check for typing errors, and performs simple descriptive statistics, again for validation purposes. This program should be used if available. If there is a choice of an editor or a word processor, choose the word processor, as familiarity with its use will be helpful when writing the final report. 1. 2. 3. 4.

Prepare a fair copy of the data in columns, squared paper helps. Prepare a disk file of the data using an editor, word processor or data input program. Print the disk file data and compare the printout with original data, not with any fair copy made. This is easier and more accurate if done by two people, one to read the original data and one to read the printout. Use the editor, word processor or data input program to correct the errors in the disk file; there will certainly be some.

Programming. The owner of a home computer faced with analysing the results of an experiment will be tempted to write a program to do the work.



Writing a computer program is not a trivial task, and thoroughly testing it so that some confidence can be placed in the results usually requires more work than the original writing. Writing a program to perform a t-test is as useful as re-inventing the wheel. Buy, beg or borrow a statistical program or package. Don't steal; this inhibits the supply of good programs, and in nearly all cases, the cost of the program is worth paying just to obtain the instruction manual. Suppliers of statistical software are listed in the Appendix.

Spreadsheet programs. These were initially designed to aid accounting but have been found helpful in a range of simple arithmetical tasks. The program provides an empty spreadsheet, a table of cells each intended to contain a number. The table can be very large, with limits of 256 rows and 8192 columns not uncommon. The number in a cell may be entered directly as data, or derived from other cell contents using a formula supplied by the user. This formula is permanently associated with the cell. Whenever any cell is changed, the contents of any cell derived from it are updated. The cells can also store words to be used as labels, or text to make the sheet more readable when printed. The facility for reading in, changing and printing out the numbers is built in, as are several useful functions, such as the ability to calculate the sum of a column of numbers. These programs provide the safest and quickest way for do-it-yourself 'programming' for simple numerical tasks such as basic statistical tests. However the ease of use does not remove the need for rigorous testing of any method (in this case cell formulae) used.

Statistical programs and packages. There are a large number of these in existence. The oldest were written over 20 years ago by teams of programmers and statisticians to perform the routine statistical tests done by professional statisticians. These mainframe packages have all evolved, errors have been corrected and improvements made. They are capable of advanced and sophisticated tests, some of which are beyond the ability of the average user to interpret correctly. The simple tests are performed in sophisticated ways; many routinely check or help to check that any test performed is valid for the data presented. For example, the t-test function may test that the variances of the samples are equal. Several related tests may be performed at the same time. To use t-testing as an example again, requesting a standard t-test may also produce the results of modified t-tests that are less affected by unequal variances. These packages are supported by extensive documentation, but often there will be only one set per installation, and this will be kept in a library and be available for reference only. It is usually possible to purchase further copies of the manuals, but many packages have independently written 'shadow' manuals available in good bookshops at reasonable prices, and these, though sometimes less complete than the official manual, are generally easier to use. The programs are supplied and serviced by specialist organizations, often commercial firms set up for the purpose. The programs are now available on a variety of mainframe and mini-computers, and



increasingly on larger microcomputers, although in a cut-down form. They are extensively used by universities and research establishments (where most originated), and are extremely effective. Among the more commonly available are Minitab, GENSTAT, SPSS (now SPSSX), SAS, GLIM, BMD (now BMDP) and MLP. Minitab and SPSSX are available for micros (see Appendix for further details). Any of these can perform all the tests that can be safely done by the amateur statistician. If there is a choice, SPSSX has the widest range of documentation and is the most widely available. A comprehensive review of mainframe statistical software is available (Francis, 1981). Mainframe packages have a reputation for being less than user-friendly. They assume that the user is relatively familiar with statistical methodology, the package and the computer. These packages were written for large mainframe computers which worked, as did all computers of that type at that time, in what is called 'batch mode'. The data, instructions to the operating system about the program, and instructions to the program, were all punched onto computer cards. The cards were given to the computer operator, who loaded them into the computer in batches. The results were printed on a fast printer and returned to the user, who never saw the computer. If the data or the instructions contained errors, these would be listed on the printout, and the user would correct the cards and repeat the process. There was no facility for interaction between the user and the computer. Although interactive facilities have been added, they are still best used in the manner for which they were designed. Prepare disk files containing the data, instructions to the programs and instructions to the operating system. Instruct the computer to obey the latter, and the results will be placed in another disk file. Placing the results in a disk file is more productive than having them printed each time, and saves paper. Examine the contents of the results file, and if they are satisfactory, have the file printed. If not, correct or modify the input file and repeat the process. The second group of packages are intended to run on small minicomputers and microcomputers. They cover a much smaller range of tests than the mainframe packages, but do, in general, include all the tests that the amateur statistician will want to, or indeed should, use. Most are only available for one type of computer. The packages are aimed at users who use statistics as a tool, but who are not full-time statisticians, and most are written by such users. New users should have a reasonable understanding of the tests and the computer. These programs are written for interactive computer systems; if an error or problem is found, the computer stops and invites the user to correct the situation. Many programs expect all the program control information to be entered from the computer keyboard each time the program is run. As the program must often be run several times for each analysis, this can be tedious. Some programs expect the data to be entered in the same way. This is likely to give rise to errors, and at best means that the data entry stage has to be very carefully checked at each run, which is time consuming. The operating system may allow data entry expected from the keyboard to be redirected to come from a disk file. Many of the packages are distributed free or at cost, often through user



groups for a particular microcomputer or the computer user groups of professional bodies. Programs from these sources carry no guarantee and, if serviced at all, errors are only corrected on a grace and favour basis by the author if and when time is available. Some of the programs are still evolving, as the original authors or users add functions for which they have found a need. Some have been taken up by commercial firms who service and expand them. These programs are published or advertised for sale in computer magazines, especially those dedicated to one particular type of computer. The level of documentation is variable and its availability may be restricted, as many commercial microcomputer programs are sold with licences to be used on one machine only, and many software suppliers assume that this means only one user. If a request is made for more documentation, it is assumed that an illegal copy of the programs has been made. There is a review of a number of American commercial microcomputer packages (Carpenter et al, 1984). If all else fails, a colleague may have written and tested a program to do the job, but these programs often misbehave when out of the sight of their creator, and so should only be used as a last resort, after careful testing.

Testing methods, programs and packages. The cheaper the program is, the more carefully it must be tested. Statistics textbooks contain worked examples which form the best test material. Use the examples from several books for the more amateur programs, but test even the more expensive packages to confirm that the whole analysis process works in your hands. The mainframe package manuals will contain worked examples that will test your ability to use the package and the computer system. Never run a package for the first time using new data. If at all possible prepare the input in advance, and do not type the data directly into the program. First, you will almost certainly need to run the program several times, and so will avoid having to re-type the data. Second, you will make typing errors, and may have to start all over again. Third, it is difficult to be confident that you have not made any typing error. PRESENTING RESULTS

The results of the experiment should be presented clearly in such a way that the casual reader can quickly comprehend your conclusion, and the more careful reader can satisfy himself that your data are valid and your conclusion correctly drawn. Do not expect everyone to take your word for the results. Give them the evidence in a way that allows them to come to the same conclusion. Take note of the layout and content of the statistics while reading the papers on similar and related work that need to be assessed before designing the experiment.

Requirements for presentation of results 1.

A description of the samples, with sample sizes and confirmation that any randomization scheme was successful.


2. 3. 4. 5.


The data, in a form as close to the original as possible. A description of the data, with confirmation of normality if this appears doubtful. Results of tests of the main null hypothesis, with the actual probability confidence intervals of estimated differences. Results of any other tests suggested by the data.

A scientific report or paper has four main sections: introduction, patients/ materials and methods, results, and discussion. Statistical aspects of the design (mainly details of any randomization) should be presented at the start of the methods section, as this forms part of the planning of the study. The statistical methods appear at the end of the methods section, as the analysis is the last step in the experiment. The data, the values obtained from your sample naturally form the results section, but any inference about the population from which the sample was drawn should be part of the discussion. Methods section

If the number of subjects was calculated, give any assumptions made and the results of the calculation. If the subjects were divided into groups, state how the allocation was done (simple randomization or some modified form, separate tables for men and women, or stratified to make group sizes equal every 10 cases). State any methods used to measure or reduce observer bias. Give full details of the statistical methods used. It should be obvious which test was used to produce any statistic in the results section. Remember that there are several forms of t-test and analysis of variance. If the choice of test is in any way unusual, give the reasons for choosing it. If the test itself is unusual, quote the reference for its source. If any assumptions are made, justify them. Results section

Start by giving full details of all cases withdrawn for any reason. State the numbers withdrawn for each reason from each trial group, and summarize any results obtained before withdrawal. Next, present the data for the minor variables: those recorded to check randomization or for subsidiary reasons. These can be presented in summary form, as means and standard deviations for continuous data, or as medians with range or quartile range. It is important to present as much as possible of the data for the major variables, as a table if there are only a few cases, or in graphical form if there is a lot of information. The scattergram is ideal for this purpose. Include as a table the means and standard deviations or standard errors. Use the standard error if the mean values are important to your argument, and the standard deviation if the variability of the data is a feature. If using paired data, it is more relevant to present the mean and so of the differences between pairs. Include the mean and SD/SEM to the graphics, in a way that does not obscure the data. For paired data, the



graphics should indicate the link between pairs, or only show the differences. Now present the main statistics. If the proposed method of analysis could not be used, say why, with any numbers if available. Include the information graphically as well as numerically if at all possible; for example, add a calculated regression line over the scattergram of the data. For the difference between group means, do not simply state whether or not there seems to be a difference, but estimate the value and add the 95% confidence intervals (Gardner and Altman, 1986). You may, for example, find a difference in plasma sodium between two groups with a probability of 0.001, but then demonstrate that the estimated difference is 0.1 mmol, showing statistical significance but no practical significance. If possible, present calculated values for probability, not '0.05 > P > 0.01' or '*, **, or ***'. If probabilities have been calculated and presented, it would be reasonable to defer contention of the statistical significance to the discussion. If the data suggested any tests, present the results last and distinguish between them and results obtained as part of the design of the study. These exploratory analyses can suggest new hypotheses, but not test them.


The British Medical Journal has published a discussion paper on Statistical Guidelines for Contributors to Scientific Journals (Altman et al, 1983), and also prints every 6 months its guidelines for its statistical referees (Anon, 1988). Two factors influence the choice of reading. Most short textbooks of statistics date from the time that the calculations represented the main problem in a statistical analysis. These books therefore tend to concentrate on a cookbook approach to the performance of the tests, at the expense of the theoretical background. The use of computer programs for the calculations means that the background reading of the casual statistician should be concentrated on the collection of valid data for input to the computer, the use of the correct tests, and the interpretation of the results. Secondly, medical statistics texts are biased towards observational studies, and the well-controlled experimental studies that are usual in anaesthesia have more in common with the experimental work of biologists and agriculturists. Short textbooks aimed at biologists are more likely to be of use. The textbooks by Campbell (1974), Snedecor and Cochran (1980) and Petrie (1987) are recommended.

REFERENCES Altman DG, Gore SM, Gardner MJ & Pocock SJ (1983) Statistical guidelines for contributors to medical journals. British Medical Journal 286: 1489-1493. Anon (1988) Guidelines for writing papers. British Medical Journal 296: 49-50. Avram MJ, Shanks CA, Dykes MHM, Ronai AK & Stiers WM (1985) Statistical methods in



anesthesia articles: an evaluation of two American journals during two six month periods. Anesthesia and Analgesia 64: 607-611. Campbell RC (1974) Statistics for Biologists, 2nd edn. London: Cambridge University Press. Carpenter J, Deloria D & Morganstein D (1984) Statistical software for microcomputers: a comprehensive analysis of 24 packages. Byte 9: 234--264. Francis I (1981) Statistical software--a comparative review. North Holland: Elsevier. Gardner MJ & Altman DG (1986) Confidence intervals rather than P-values: estimation rather than hypothesis testing. British Medical Journal 292: 746-750. Petrie A (1987) Lecture Notes on Medical Statistics, 2nd edn. Oxford: Blackwell Scientific Publications. Pocock SJ, Hughes MD & Lee RJ (1987) Statistical problems in the reporting of clinical trials: a survey of three medical journals. New England Journal of Medicine 317(7): 426-432. Snedecor GW & Cochran WG (1980) Statistical Methods, 7th edn. Ames, Iowa: Iowa State University Press.