Early Human Development xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Early Human Development journal homepage: www.elsevier.com/locate/earlhumdev
WASP (Write a Scientific Paper) using Excel – 10: Contingency tables A R T I C L E I N F O
A B S T R A C T
Keywords: Software Computers Statistics Biostatistics
Contingency tables may be required to perform chi-test analyses. This provides pointers as to how to do this in Microsoft Excel and explains how to set up methods to calculate confidence intervals for proportions, including proportions with zero numerators.
1. Introduction A contingency table (a cross tabulation or crosstab) is a matrix, a table that depicts the frequency distribution of variables, typically as counts, i.e. the distribution of one variable in rows and another in columns. These are useful to study the association/s between two variables. This sort of data does not make (and indeed does not need to make) any assumptions about the data being normally distributed. The commonest form is a 2 by 2 matrix (Table 1). The dependent variable (the outcome) is conventionally placed at the top and the independent variable (predictor/exposure/test) is placed at the side. Thus, the outcome of interest is typically the top left cell (a). Most outcomes are usually negative events or failures such as death, disablement, disease or discomfort. Examples of predictors or exposures are smoking, alcohol, sex, blood group, hypertension, diabetes, or treatments. Some outcomes are not dichotomous (binary) but can be transformed into a binary format using cut-offs. A typical example is body mass index (BMI), which is a continuous variable. Sets of BMI values can be analysed in contingency tables by using cut-off thresholds, such as World Health Organisation values for obesity. Values are thus categorised as obese/not obese and can be input into a 2 by 2 matrix. Tables such as these may be extended to 2 by n. It is important to note that the comparison for two sets of predictor/exposure variables are “with” or “without”. The denominator (the total) is never invoked (and is unnecessary) as this is not a rate calculation. 2. Chi function Excel's inherent chi testing functions are rudimentary. The CHITEST function requires not only observed but also expected values. For example, Table 2 depicts a small database that comprises monthly live births, by sex, for the state of California for the year 2013. A pivot table easily extracts male and female totals as has been shown in a previous paper in this series [1]. The ratio of male to total births is 0.5124. Were the expected ratio 0.5 (equal number of male and female births), the expected values are calculated on the right hand side along with the formulas depicting the calculations. The observed ratio departs significantly from the expected ratio (p < < 0.0001) as shown, along with the Excel chi function. However, it is known that this ratio approximates 0.515. The expected values calculation is also shown on the right as well as the chi function, which is still statistically significant, but less so (p = 0.0003). 3. Excel limitations The following are not natively available in Excel. In 2 × 2 tables, a “Yates continuity correction” is conventionally applied so as to prevent overestimation of statistical significance for small datasets. There are additionally three common variants of the chi test:
• A Mantel-Haenszel test for the analysis of stratified or matched categorical data. https://doi.org/10.1016/j.earlhumdev.2018.02.020
0378-3782/ © 2018 Elsevier B.V. All rights reserved.
Early Human Development xxx (xxxx) xxx–xxx Table 1 Template for a 2 by 2 table. Outcome
predictor/ exposure
Present Absent
Present
Absent
a c
b d
Table 2 Monthly live births, by sex, for California, 2013 and chi calculations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A
B
C
Month January January February February March March April April May May June June July July August August September September October October November November December December
Sex Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male
Births 20,243 21,008 17,939 18,772 19,889 20,815 18,960 19,834 19,707 20,809 18,902 20,104 20,761 21,931 21,818 23,070 21,537 22,267 21,399 22,254 19,600 21,073 20,455 21,558
D
E
F
Row Labels Male Female Grand Total
Sum of Births 253,495 241,210 494,705
Male Female Male/Total p
Male Female Male/Total p
G
Observed 253,495 241,210 494,705 0.5124 2.5854E-68 =CHITEST(F9:F10,G9:G10)
Expected 247,352.5 247,352.5 494,705 0.5000
Observed 253,495 241,210 494,705 0.5124 0.000277041 =CHITEST(F18:F19,G18:G19)
Expected 254,773.075 239,931.925 494,705 0.5150
H
=F11/2 =F11/2
=F20*0.515 =F20*0.485
• A chi test for trend to ascertain whether a ratio is changing, e.g. with time and • McNemar's test for paired observation. Furthermore, it is considered inappropriate to use the chi test for 2 × 2 tables if the totals in the table are < 20 or if the totals are in the range 20–40 and the smallest expected value is < 5. In this case, Fisher's exact test is considered the better choice. 4. Excel add-ins Data for such tests can usually be extracted using pivot tables from a database, as shown above. It is laborious to attempt to carry out these analyses in Excel and for this reason, this paper will not demonstrate further how to construct more functional chi tests in Excel. However, several add-ins are available which can do all of these tests. A popular (and free) Excel add-in called Bio-Med-Stat performs all of these functions [2]. 5. Confidence intervals for proportions Confidence intervals are derived from the standard error which is in turn calculated from the standard deviation. However, contingency tables assume non-normality and these calculations are therefore not an inherent part of the calculation of statistical significance. However, the confidence interval of a proportion is still important as it gives researchers the 95% limits of probability of where the true population proportion lies. There are two common ways to calculate this. A good approximation is the following which first calculates the standard error (p is the proportion):
SE =
p(1−p) n
From this point, the confidence interval is conventionally calculated thus, where z is conventionally given as 1.96 (95%)
CI = p ± (z × SE) However, the so-called binomial methods should be used for the calculation of confidence intervals for greater precision when the proportion is ≤0.3 or ≥0.7 (i.e. 0.3 ≤ p ≤ 0.7) [3]. These are calculated thus (where q = 1-p): 2
Early Human Development xxx (xxxx) xxx–xxx Table 3 Confidence interval calculations for a proportion.
1 2 3 4 5 6 7 8 9 10 11 12 13
Upper CI =
Lower CI =
A
B
C
Affected Total (n)
231 26,117
p SE
0.008844814 0.000579367
=B1/B2 =SQRT(B4*(1-B4)/B2)
Upper CI p Lower CI
0.009980372 0.008844814 0.007709255
=B10 + (1.96*B7) =B4 =B10-(1.96*B7)
Upper CI p Lower CI
9.98 8.84 7.71
=B7*1000 =B8*1000 =B9*1000
(2np + z2 + 1) + z z2 + (2−1/n) + 4p(nq−1) 2(n + z2) (2np + z2−1) −z z2− (2 + 1/n) + 4p(nq + 1) 2(n + z2)
6. Confidence interval for zero numerator It is naturally possible for an event not to occur, say over a given period of time in a sample. For example, a rare congenital anomaly may not occur if the sample size is insufficiently large. This does not imply that the event cannot occur and confidence intervals can be calculated in order to establish the possibility. Naturally, if an event does not occur, a lower confidence interval cannot be calculated, but an upper confidence interval (which is naturally fully dependent on the denominator) can be estimated by the simple formula 3/n [4]. It is possible to key in these formulas into Excel. The example below pertains to an actual study investigating congenital heart disease in Malta. There were 231 live births with congenital heart disease who were diagnosed by one year of age, from a total population of 26,117 live births born over the period 1990–1994 (Table 3) [5]. The proportion and approximate 95% confidence intervals are shown as well as the calculations. Since these are rates, and rates of congenital malformations are conventionally displayed per 1000 live births, p and confidence intervals are shown multiplied by 1000, implying that the rate was of congenital heart disease in Malta was 8.84/1000 live births (95% CI: 7.71–9.98/1000 live births). However, the proportion here is ≤0.3 and the exact calculation of confidence intervals therefore yields slightly different values (95% CI: 7.76–10.08/1000 live births). It will be appreciated that since this is a relatively small population, very rare congenital heart lesions may not have been present in this five year sample. For any such lesions, the upper confidence interval of this proportion is 3/26117 = 0.00011, i.e. 0.11487/1000 live births. The next paper in this series will demonstrate how to calculate the various test characteristics from a contingency table in order to obtain sensitivity, specificity, positive and negative predictive values, diagnostic accuracy and prevalence [6]. Acknowledgments The inspiration for this series of papers arises from Thomas Douglas Victor Swinscow's original series of papers in the 1970s entitled “Statistics at Square One” [7] as well as the Excel-based statistics talks prepared for the international Write a Scientific Paper course (WASP – http://www.ithams. com/wasp) [8,9]. I would also like to thank Dr. Neville Calleja (Director at Department of Health Information & Research - Department of Health Information & Research, Ministry of Health, the Elderly and Community Care) for reviewing these manuscripts. Conflict of interest statement There are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. References [1] V. Grech, WASP (write a scientific paper) using excel - 2: pivot tables, Early Hum. Dev. (Jan 12 2018), http://dx.doi.org/10.1016/j.earlhumdev.2018.01.003 (pii: S0378-3782(18) 30007-0, [Epub ahead of print]). [2] P. Slezák, Microsoft excel add-in for the statistical analysis of contingency tables, Int. J. Innov. Educ. Res. 2 (2014) 90–100. [3] J.L. Fleiss, Statistical Methods for Rates and Proportions, 2nd edition, Wiley, New York, 1981. [4] J.A. Hanley, A. Lippman-Hand, If nothing goes wrong, is everything all right? Interpreting zero numerators, JAMA 249 (1983) 1743–1745. [5] V. Grech, Spectrum of congenital heart disease in Malta. An excess of lesions causing right ventricular outflow tract obstruction in a population-base study, Eur. Heart J. 19 (3) (1998 Mar) 521–525.
3
Early Human Development xxx (xxxx) xxx–xxx [6] [7] [8] [9]
Grech V. WASP (Write a Scientific Paper) using Excel – 11: Test characteristics. Early Hum. Dev. - in press. T. Swinscow, Statistics at square one, Br. Med. J. 1 (6020) (1976) 1240. V. Grech, WASP – write a scientific paper course: why and how, J. Vis. Commun. Med. 40 (3) (Jul 2017) 130–134. V. Grech, S. Cuschieri, Write a scientific paper (WASP) - a career-critical skill, Early Hum. Dev. (Jan 17 2018), http://dx.doi.org/10.1016/j.earlhumdev.2018.01.001 (pii: S03783782(18)30005-7 [Epub ahead of print]). ⁎
Victor Grech Academic Department of Paediatrics, Mater Dei Hospital, Malta E-mail address:
[email protected]
⁎
Department of Paediatrics, Mater Dei Hospital, Malta.
4