RBMOnline - Vol 9. No 2. 2004 129-131 Reproductive BioMedicine Online; www.rbmonline.com/Article/1327 on web 10 June 2004
Commentary Further thoughts regarding evidence offered in support of the ‘Barker hypothesis’ E Walters1,3, RG Edwards2 Institute and Churchill College Cambridge; 2RBMOnline, Duck End Farm, Park Lane, Dry Drayton, Cambridge, UK 3Correspondence:
[email protected] 1Babraham
Abstract The original objection to the paper by Kwong et al. was that the use of an inappropriate (between-pup) estimate of experimental error had exaggerated the importance of the maternal nutrition effect. From the group’s most recent response, it has been possible to regenerate the raw data and carry out a further detailed analysis. It is apparent that despite now using a more sophisticated statistical tool, Kwong et al. have still, in effect, used the between-pup error, thus repeating the previous, probably exaggerated, finding. It is maintained that the nutrition effect should be studied using the between-rat variation, which then provides a result that is a good deal less emphatic. Further, it is felt that there is a very important point of principle involved in this dispute, relating to the rigorous analysis of hierarchical data, particularly in small studies. Keywords: Barker hypothesis, dispute, maternal nutrition, rats, statistics
Introduction
Statistical critique
In an earlier note (Walters and Edwards, 2003), the statistical treatment of data contained in a paper by Kwong et al. (2000) was criticized. The main complaint was that inferences regarding the two diet treatments applied to 12 rats should in essence be based on 12 summary statistics (one data point per rat), and not on the relatively numerous offspring data. It was felt that the authors had ignored the very important hierarchical structure of the data. Almost certainly, errors of this sort would tend to exaggerate the importance of any effect being investigated, by using estimates of random variation that were too small. It was suggested that the weight of evidence for the authors’ conclusions should be tested by a re-analysis of the data, using more rigorous methods.
Before embarking on a detailed discussion of the analysis, it may be useful to digress a little on the issue of valid statistical inference. In the customary method of presenting experimental results, the ‘weight of evidence’ for any conclusion is generally based on the significance probability (or P-value). This represents the probability of obtaining such an extreme result as that actually observed, under the ‘null hypothesis’. The ‘null hypothesis’ here represents the assumption that the phenomenon (or treatment) of interest has no impact at all. In order to validate the customary statistical analysis, the treatments (here the two diets) need to be allocated at random to the experimental units (the rats). It is crucial to note that the randomization process applies to the experimental unit (the rat), and not to the pup. Consequently, it is variation among the experimental units, treated alike, which should be used as experimental error to test treatment effects. The individual observations for the pups contribute to the study only by adding precision to the dam summary statistics.
Kwong and her colleagues have now submitted a revised analysis (Kwong et al., 2004) in which they reiterate some of their earlier findings. The new paper has been examined, and even though the ‘multilevel modelling’ method used was capable of producing the correct analysis, the authors appear to have used yet again the ‘between-offspring’ error to make their inferences. It is maintained, therefore, that the inferences are biased in that they have grossly exaggerated the importance of the diet effect. Whereas in the response to the first Kwong paper (Kwong et al., 2000), it was not possible to carry out an independent detailed analysis, any objections being simply on a point of principle, the publication of new data has enabled a numerical investigation to be performed. In their more recent paper, Kwong et al. (2004) presented a diagram from which it was possible to abstract the blood pressure values for individual offspring.
In the analysis of hierarchical data, some analysts would first test whether the higher stratum error (here between rats) was significantly greater than the lower stratum error (here between pups). If that was not the case, they would argue that pooling of the errors was justified. It is felt that in small investigations, this is a very dangerous procedure indeed. When the higher error is based on relatively few degrees of freedom, the test for heterogeneity would be fairly insensitive. Indeed it may be contended that such a procedure is often used simply to produce a lower error estimate to use for testing. The lower stratum error will almost always (as here) be the smaller
129
Commentary - Further thoughts regarding the Barker hypothesis - E Walters & RG Edwards
Table 1. Hierarchical analysis of variance of systolic blood pressure values. Sum of squares Between treatments Between dams (within treatment) Between pups (within dams)
df
Mean square
1293.6
1
1293.6
2660.3 9310.6
10 49
266.3 190.1
Table 2. Mean systolic blood pressures (BP) for two experimental treatments applied to ‘dams’, together with the result of applying three statistical tests. The three differences are quoted as mean values ± SE. Diet (%)
Mean BP
P-value
18 9
106.1 116.2
-
Difference (a) (b) (c)
10.1 ± 4.58 10.1 ± 3.87 10.2 ± 3.87
0.052 0.012 0.009
Test (a) obtained by using ‘between-dam’ variation as experimental error, and referring to Student’s t table. Test (b) using between-pup variation as experimental error, and referring to Student’s t table. Test (c) quoted by Kwong et al. (2004), consistent with using between-pup variation, and referring to normal probability table.
figure. As a general rule, the variation between experimental units should be used for testing purposes. The main reason to advocate this rather severe stance as regards statistical analysis is because so much published material in assisted reproduction techniques lacks what a statistician would call an experimental design. Quite often, the exercise is simply a speculative trawl through the records. If an investigation has some shortcomings from the design angle, it should at the very least employ the most rigorous and fastidious analytical methods available. Otherwise, a relaxation of so many of the validating conditions of statistical evaluation would detract radically from its value. The numerical analysis of the data will now be considered in some detail. As already mentioned, it was possible to abstract the original systolic blood pressure values (n = 61) from the diagram, albeit rather imprecisely. Since the data conform to a standard hierarchical structure, the analysis of variance was computed, and is displayed in Table 1. It is satisfying to note that the ‘within-dam’ standard deviation (13.8) was very close to the figure of 13.5 quoted by Kwong et al. (2004), thus adding credibility to the abstraction of the data. Further, other aspects of the calculations were consistent with the figures of Kwong et al. (2004), confirming that the method of regenerating the original data has been fairly reliable. The disagreement with Kwong et al. comes down very simply to two tests that can be carried out in this analysis of variance (ANOVA) table. It is maintained that the correct rigorous test is given by the variance ratio statistic F = 1293.6/266.3 = 4.86 (P = 0.052). By contrast, Kwong in effect uses the variance ratio statistic F = 1293.6/190.1 = 6.80 (P = 0.012). The Pvalue of 0.009 quoted by Kwong et al. (2004) appears to be the result of using the normal probability tables rather than the (correct) variance ratio F, or Student’s t table, since the error is an estimated figure based on 49 degrees of freedom. These findings are summarized in Table 2, where the weight of evidence is seen to increase dramatically by the adoption of progressively less rigorous analyses.
130
It is appreciated that Kwong and her colleagues may not in fact have carried out the calculations precisely in the way described here, but the algorithm that they used seems to have provided a similar breakdown of the variation. Note in particular the identical standard error (3.87) of the effect of interest.
Discussion Although Kwong et al. (2004) have now reanalysed the data from their earlier paper, and used more sophisticated statistical methods, it is apparent that they have cited exactly the same statistical result as before, that is by using the between-pup variation for experimental error, instead of the more rigorous between-dam error. Since the between-pup error will virtually always be the smaller figure, its indiscriminate use will often lead to exaggerated claims for treatment effects. It is not the intention to attach too much importance to the simple fact that the P-value (0.052) calculated by reanalysis of the results did not quite achieve significance at the customary level, whereas that achieved by Kwong et al. (P = 0.009) is highly significant. What is of vital importance, however, is the fact that they have reported what they regard as a very emphatic effect, whereas it might be prudent to be more reticent. Clearly further investigations are necessary, preferably using an adequate number of dams so as to avoid the need to pool various sources of variation in order to obtain a reasonable test of hypothesis. The consistency of the direction of the effect is acknowledged, but its true magnitude now needs to be evaluated precisely. Now that it has been possible to study these data in detail, it is felt that the point of principle outlined in the previous paper remains fully as relevant. Analysts need to ensure that the ‘weight of evidence’ in support of conclusions should be evaluated as fastidiously and rigorously as possible. It is observed that mis-analyses of hierarchical data often represent the most severe distortions of the truth. In the present example, Kwong et al. (2004) were no doubt led to accept the lower stratum (between-pup) error because of the relatively small ‘between-dam’ component of variance. As has been argued, however, this device can be quite misleading in small studies, and a more conservative approach is to be preferred.
Commentary - Further thoughts regarding the Barker hypothesis - E Walters & RG Edwards
It is noted that the practice that raises the strongest objections is quite prevalent in a great deal of published work in assisted reproduction technology. Nor is the confusion confined to biological journals, where statistics is not a mainstream discipline. Even in the relatively elevated statistical environment represented by the journal Biometrika, lack of attention to hierarchical structure is not unknown; see, for, example the dispute between Olkin and Vaeth (1981) and Walters and Rowell (1982).
quoted much smaller effects. They also detected evidence of publication bias, and an inadequate attention to potential confounders. Since the Kwong study contained 12 rats in total, it is pertinent to mention that Huxley and her co-workers designated a study as being of ‘moderate size’ when there were over 1000 participants. The perception of the exaggeration of this ‘fetal effect’ is based solely on what is regarded as an unsatisfactory statistical evaluation.
References The reader may wonder what, if anything, has been tested where the between-pup variation has been used as experimental error. Quite simply, the test is whether there is statistical evidence of the phenomenon for the particular experimental units under study (dams). In virtually all research of this sort, it is preferable to make inferences about the infinite population, where a random sample is available for study. For this wider hypothesis, the between-dam variation should be used as experimental error. Readers who may be puzzled by this rather subtle point of inference should refer to the discussion on the impact of sample size in the earlier note (Walters and Edwards, 2003). These two types of hypotheses are sometimes referred to as the ‘finite model’ and the ‘infinite model’. Although the issues raised may appear rather slight to many readers, a very important point of principle is involved. Another paper has appeared on this topic (Huxley et al., 2003), where the authors concluded that the ‘fetal effect’ on subsequent blood pressure had, on the whole, been exaggerated. Their study was a synthesis of more than 50 published papers, and they particularly noted that large effects were often drawn from small trials whereas really large trials
Huxley R, Neil A, Collins R 2003 Unravelling the fetal origins hypothesis: is there really an inverse association between birthweight and subsequent blood pressure. Lancet 360, 659–665. Kwong WY, Wild AE, Roberts P et al. 2000 Maternal under-nutrition during the preimplantation period of rat development causes blastocyst abnormalities and programming of postnatal hypertension, Development 127, 4195–4202. Kwong WY, Osmond C, Fleming TP 2004 Support for Barker hypothesis upheld in rat model of maternal undernutrition during the preimplantation period: application of integrated ‘random effects’ statistical model. Reproductive BioMedicine Online 8, 574–576. Olkin I, Vaeth M 1981 Maximum Likelihood estimation in a twoway analysis of variance with correlated errors on one classification, Biometrika 68, 653–660. Walters E, Edwards RG 2003 On a fallacious invocation of the Barker hypothesis of anomalies in newborn rats due to ‘mother’s food restriction in preimplantation phases, Reproductive BioMedicine Online 7, 580–582. Walters DE, Rowell JG 1982 Comments on a paper by I.Olkin and M. Vaeth on two-way analysis of variance with correlated errors, Biometrika 69, 664–666.
Received 5 March 2004; refereed 30 March 2004; accepted 28 May 2004.
131