c o r t e x x x x ( 2 0 1 5 ) 1 e5
Available online at www.sciencedirect.com
ScienceDirect Journal homepage: www.elsevier.com/locate/cortex
Discussion forum
Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015 Ryota Kanai Sackler Centre for Consciousness Science, School of Psychology, University of Sussex, Brighton, United Kingdom
1.
Introduction
Reproducibility is at the heart of science. Replications of previous studies are crucial for the healthy progress of science. Boekel and colleagues attempted to replicate previously reported structure brain-behaviour (SBB) correlations with a preregistration of their planned analysis and data collection methods prior to data analysis (Boekel et al., in press). This is one of the first studies that used pre-registration combined with Bayesian inference to examine structural neuroimaging studies. Pre-registration is an important feature of such attempts as it prohibits researchers from flexibly changing data analysis strategies or a post hoc selection of variables after seeing results of analysis. The use of Bayesian statistics was also important because it enables a test of whether there is substantial evidence for the null hypothesis that there is no association between brain structure and behavioural measures. In this study, the authors claim that almost all the previous findings they examined, including those published in my own work, have not been replicated. I was one of the reviewers on this paper. Despite a couple of major concerns, I recommended publication of this study for two reasons: The primary reason for the recommendation was that issues I encountered during the review are open-ended. I reasoned that they should be openly discussed within the community, rather than privately debated in the review process amongst the authors and myself. The second reason is that taking preregistration seriously entails publishing the paper as long as it adhered to the original plan. As a referee I did notice a few methodological shortcomings about this particular study (see below for full details). The
issues that I regard worth public discussion are twofold. One is about statistical methods for a confirmatory study in the context of neuroimaging; the analysis method used by Boekel et al. was suboptimal in the sense that it necessarily underestimates correlations. The second point concerns the review procedure for pre-registered studies in general. On one interpretation of pre-registration of a confirmatory study, authors should not deviate from the original analysis plan. But what if a reviewer spotted a potential problem with their methods? Should authors address reviewers' requests for additional analyses? As pre-registration currently stands, it is unclear whether as a reviewer, I am in the position to request additional analyses and/or experiments. In what follows, I will discuss these two points more in details. My first set of comments are focused on the replication results for my previous two studies (Kanai, Bahrami, Roylance, & Rees, 2011; Kanai, Dong, Bahrami, & Rees, 2011) that were included in this replication study, as I am most familiar with the original methods. However many of my comments will be applicable to other studies.
2. Regions-of-interest (ROIs) approach underestimates correlation In Boekel et al.'s study, the authors extracted the mean grey matter volume from regions-of-interest (ROIs) defined as the voxels in the original studies that exhibited correlations with a statistical significance of p < .001, uncorrected. If the measures were low-dimensional behavioural data such as reaction time differences across conditions, one could simply re-run the
http://dx.doi.org/10.1016/j.cortex.2015.02.020 0010-9452/© 2015 Elsevier Ltd. All rights reserved.
Please cite this article in press as: Kanai, R., Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015, Cortex (2015), http:// dx.doi.org/10.1016/j.cortex.2015.02.020
2
c o r t e x x x x ( 2 0 1 5 ) 1 e5
original study for a confirmatory Bayesian statistical analysis. However, there is a unique methodological problem associated with high-dimensional MRI data. Take the example of the Cognitive Failures Questionnaire (CFQ) study (Kanai, Dong, et al., 2011). The statistical inference in the original study was that regional grey matter volumes of some voxels within the fronto-parietal cortices showed a correlation that is significantly higher than chance correcting for multiple comparisons across the search space (i.e., frontal and parietal lobes). However, the exact location of the peak voxel and the exact spatial extent of the surrounding highly significant region (used for the definition of ROI) entail spatial uncertainty. There is little doubt that they are close to the “true” peak coordinates that would be found for an infinitely large sample. However, due to the inaccurate spatial definition of an ROI, the mean volume within the ROI would include noise in the confirmatory sample, and the correlation with behaviour is likely to be underestimated. I have illustrated this point schematically in Fig. 1. In this illustration, both the discovery sample and confirmatory sample show relatively high correlation near the “true” correlated voxels (shown by the red lines). In both samples, the peak voxels were found within the spatial extent of truly correlated voxels. An ROI (shown by the blue shade) is defined at p < .001 uncorrected based on the data in the discovery sample. While the mean correlation coefficient in the discovery sample may be inflated above the true correlation coefficient, the mean correlation for the confirmatory sample is much lower within the ROI because of the inaccuracy in the original ROI definition.1 The correlations estimated in Boekel's study would therefore necessarily be conservative and underestimate the actual correlational strength.
3. Conservative method with small sample may explain the failures of replication Another concern with the Boekel et al. study is the relatively small sample size (n ¼ 36). The suboptimal ROI approach combined with the small sample size would make it rather difficult to replicate anything at all. This can be seen in Figure 8 of Boekel et al.'s study. Among the 17 correlations examined, many (at least 10) of the correlations in the replication sample seem to be just slightly weaker versions of the original results, and this could be due to the suboptimal ROI method. The replication results also had wider confidence intervals that overlapped with 0. My general impression is that the suboptimal ROI approach lowered the correlation coefficient and the small sample increased the confidence interval. These two factors made replication very difficult. 1 Strictly speaking, the mean correlation within the ROI is different from the correlation computed for the average grey matter volume within an ROI. For example, if neighbouring voxels had independent noise, the averaging voxel values over space could increase the correlation because independent noise would be cancelled out. In practice, however, this is highly unlikely in structural neuroimaging, because neighbouring voxels are highly correlated within individuals (pairwise correlations of neighbouring voxels are near 1).
Fig. 1 e A schematic illustration of the influence of spatial uncertainty on estimation of correlation within an ROI. A. A hypothetical statistical map for a discovery sample is depicted. The x-axis corresponds to one dimensional version of the MRI image space, and the y-axis correspond to the correlation coefficient with a hypothetical behavioural score. The solid blue line show how correlation coefficient r is distributed over space (i.e., brain). Data were generated from the true correlational pattern illustrated by the red line. An ROI (shown by the light blue shade) is defined as the voxels above a certain threshold. B. A correlation map for an independent confirmatory sample is shown by the solid blue line. In this particular example, a map with a slightly shifted peak was intentionally chosen for illustration purposes. In this example, the voxels within the ROI on average shows much smaller correlation than the original study due to the slight shift of the ROI from the true position of correlated brain regions. Potential mismatch between the discovery and confirmatory samples may be more severe when the ROI is defined in a 3-dimensional space.
On the other hand, some of the results seemed to go genuinely unreplicated. These were the three correlations reported in Xu et al.'s paper (Xu et al., 2012), and two SBB correlations that involved amygdala (Kanai et al., 2011). However, quite a few correlations seem to be replicated in this confirmatory sample, but not reach statistical significance. In fact, I am rather surprised that numerically close or sometimes greater correlation coefficients reported in this confirmatory sample were labelled as anecdotal support for H0 (see Table 6 in Boekel et al.). Moreover, the associations of the amygdala and entorhinal cortex with the egocentric online social network size, which were not replicated in Boekel et al.' work, have been successfully replicated in a study from another group (Von Der Heide, Vyas, & Olson, 2014), suggesting that these correlations are replicable, but just happened not to be replicated in Boekel's sample due to cultural
Please cite this article in press as: Kanai, R., Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015, Cortex (2015), http:// dx.doi.org/10.1016/j.cortex.2015.02.020
c o r t e x x x x ( 2 0 1 5 ) 1 e5
differences in the way Facebook is used across countries, or due to differences in the pre-processing pipeline (see below for more details). The problem of small samples is further reflected by the fact that roughly a half of the results reported here were labelled as “anecdotal” because their Bayes Factors (where BF > 3 implies evidence for H0) were below 3. This implies there was not enough evidence to support either hypotheses and more data collection is needed to disambiguate them. Furthermore, the BF suggested anecdotal support for the null hypothesis even though the correlation reported in the confirmatory sample was numerically greater (r ¼ .19) than the original (r ¼ .13). This made me doubt if the analysis methods used in Boekel et al.'s study could replicate any wellestablished findings. One of my suggestions as a reviewer was to check if the authors' analysis method could replicate any well-established effect at all. For example, the effect of ageing is a topic studied extensively across several centres using large samples (Gogtay et al., 2004; Good et al., 2001; Sowell et al., 1999). However, this sort of methodological validation could not be conducted on the basis that this was a preregistered confirmatory study (see Section 6 below for more on this point).
4. Some findings are clearly replicated if more conventional methods are used To illustrate the point that the ROI analysis used in this study underestimated correlation, here I show that our CFQ study was clearly replicated using the same dataset as that used in Boekel et al.' study, when a more conventional analysis method was applied. In Boekel et al.'s paper, this SBB correlation was labelled as anecdotal support for the null hypothesis. For this analysis, the authors kindly shared their anonymised structural MRI scans for a secondary data analysis. This allowed me to conduct a voxel-based morphometry analysis on the same dataset using the analysis procedure we used in our original study (Kanai et al., 2011). There were two crucial differences between the original study and Boekel et al.'s study. The first concerns the method of co-registration. In our original study, we used the procedure called DARTEL on SPM8 for inter-individual co-registration, whereas Boekel et al. used presumably FNIRT implemented in FSL's default VBM pipeline (Douaud, Smith, Jenknson, Behrens, Johansen-Berg et al., 2007). While we hope to find consistent results regardless of differences in details, methodological details should be closely matched to the original study when the purpose is to replicate a previous study. The second crucial difference is that to circumvent the problem of averaging over suboptimal spatial region, I used small volume correction (SVC) (Worsley et al., 1996). Briefly, SVC calculates p-values corrected for multiple comparisons for a pre-defined spatial extent, and improves the sensitivity of statistical analysis compared to a whole brain analysis, which would require correction for multiple comparisons across the whole brain. This has been a widely used method in neuroimaging when prior hypotheses were available about the location where effects would be expected.
3
In this secondary data analysis, I pre-processed Boekel et al.'s MRI data using the same SPM8 pipeline as our original study (DARTEL for coregistration; smoothing kernel, fwhm ¼ 8 mm; Normalisation to MNI space with Jacobian modulation; see Kanai et al. (2011) for full details) and used SVC to test whether there is a statistically significant voxel at p < .05 corrected for multiple comparisons within a small sphere centred at the original peak voxel (MNI coordinate: x ¼ 15, y ¼ 61, z ¼ 54; radius ¼ 8 mm). As covariates for multiple regression, we included the distractibility score computed from a previous factor analysis (Wallace, Kass, & Stanny, 2002), and total grey matter volume, as we did in our original study. This analysis indeed revealed a significant correlation between the regional grey matter volume in the left SPL and the distractibility subscale computed from CFQ. There was a significant positive correlation close to the original coordinate in the left SPL [p(FWE-corr) ¼ .039; MNI coordinate: x ¼ 18, y ¼ 61, z ¼ 57; T(33) ¼ 3.14]. In fact this is a strong effect, as the correlation at the peak voxel corresponds to r ¼ .48 (c.f. the peak correlation in the original study was r ¼ .38). This suggests that the estimation of correlation by averaging over an ROI used in Boekel and colleagues considerably underestimates correlation (r ¼ .22), and possibly depicted the false image that none of the brain behaviour correlations was replicated. Furthermore, we have also previously replicated the CFQ result in an independent sample of 36 participants collected in Denmark (Sandberg et al., 2014). In that study, we again found a positive correlation between left SPL and the distractibility score [p(FWE-corr) ¼ .039; MNI coordinate: x ¼ 20, y ¼ 61, z ¼ 54; T(33) ¼ 3.38; r ¼ .51]. Therefore, the CFQ result has been confirmed in 3 independent samples including the data set of Boekel et al.'s when the pre-processing procedure was closely matched and SVC was used for statistical analysis for data with small samples. Given these independent replications, it would be an exaggeration to depict that none of the SBB correlations were replicated. Instead, the highly consistent results indicate the reliability of the original finding. As an auxiliary note, I would like to point out that we clearly indicated that the other correlation in the left prefrontal cortex was not statistically significant after correction for multiple comparisons in our original report. We wrote “A weak negative correlation between gray matter volume and distractibility was found in the left mid prefrontal cortex … However, this did not reach statistical significance after correction for multiple comparisons (p ¼ .755, FWE corrected).” Therefore, I did not expect that this would be replicated. As such, we did not discuss this correlation in Sandberg et al.'s study, and neither was this replicated in my re-analysis of Boekel et al.'s data.
5. What is the right statistical method for a pre-registered neuroimaging study? Based on these considerations, an emerging question is how we should incorporate prior information more efficiently from previous studies to conduct a confirmatory neuroimaging study. As I discussed above, the ROI approach underestimates correlation because of the spatial uncertainty in the original
Please cite this article in press as: Kanai, R., Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015, Cortex (2015), http:// dx.doi.org/10.1016/j.cortex.2015.02.020
4
c o r t e x x x x ( 2 0 1 5 ) 1 e5
study. SVC seems to be a reasonable alternative, but it has its own problems such as how to decide the appropriate size of the small volume, and the frequentist statistical inference does not allow the possibility of supporting the null hypothesis, which is important in confirmatory studies. We need something similar to SVC combined with Bayesian inference. Ideally it would be useful if the information of spatial uncertainty in the original study could be incorporated into confirmatory studies. Bayesian approaches in neuroimaging (Friston, Glaser, et al., 2002; Friston, Penny, et al., 2002) may be expanded for this purpose. For example, posterior probability maps for correlation coefficient being above a certain threshold (e.g., r > .2), or BF maps could be computed, and shared. Such methods would afford us with a systematic way to combine previous knowledge and update our beliefs about the existence (or absence) of the SBB correlation.
6.
Refereeing a pre-registration document
Strictly speaking, the protocol by Boekel et al. had not been formally reviewed. The document was shared with the authors of the original studies, but no formal peer-review was carried out. However, I treated this work as pre-registered for two reasons. First, the authors did have a written document on the planned analysis on their blog and tried to follow the format of pre-registration. It was April 2012 when the authors first contacted me about this confirmatory study, and this was way before the time when Cortex launched the new section for Registered Reports in May 2013. At that time, any formal review procedure was not simply available, and making a public declaration about the study plan was a great choice on the authors' side. Therefore, it would be unfair to blame the authors for not going through a proper peer-review process. Another reason why I treated this study as if it were properly pre-registered is because there is a high chance that I would have not been able to spot the methodological issue if I had been given a chance to formally review the preregistration document. From the start, the authors were willing to collaborate with the original authors to carry out this pre-registration study. In that stage, I was aware of the type of analyses they were going to conduct. I was also aware that the ROI analysis would underestimate correlations, but thought this approach would be reasonable. In that sense, I am not quite sure whether I would have recommended amendments to the ROI approach and the number of participants they were planning to recruit if I had been asked to review the pre-registration document. For these reasons, I treated the study as if it were properly pre-registered. A formal process of peer-review is now available in Registered Reports section of Cortex. However, once I had decided to view the study as a preregistered study, refereeing it turned out to be a rather difficult task. I expect that this experience will be shared among reviewers in the future as pre-registration becomes more common. The primary difficulty I encountered was to decide what my roles as a reviewer were. At one level of interpretation, pre-registration can be taken as inability to modify any analysis methods or suggest additional data collection. If so, should I just check if the methods described in the pre-
registration document are accurately followed? Does this mean I am not allowed to comment on any other aspects of the study? For example, during the first round of review, I was aware that the ROI approach was too conservative, and the CFQ correlation can be replicated using the same data if an alternative, more conventional method was used. One straightforward recommendation as in typical peer reviews would be that the authors include the additional SVC analysis, because it would provide a more balanced view by showing that original results could be replicated if methods were more closely matched to those in the original studies. However, while I believe that this would be a perfectly reasonable suggestion, this asks the authors to perform an analysis that was not described in their pre-registered analysis plan. If we take the role of pre-registration as not permitting any changes to analysis and data collection, those suggestions should not be addressed. Indeed the authors responded to my comments by statements such as “Due to the confirmatory nature of our replication attempt, we cannot change the ROI-approach without re-labelling the analyses as being exploratory (which would defeat the purpose of this replication attempt)” or “We can unfortunately not perform such an analysis in a confirmatory way, as we already inspected our data”. On one level I understand the theoretical motivation of these responses, because the whole point of pre-registration is to prohibit flexible changes in data analysis strategies. On another level, however, I felt such reviewer comments should be addressed by additional analyses for the sake of scientific merit, and examining the same data from different angles is important for understanding the reason of the failures of replication. However, it is important to point out that this does not imply that the authors were reluctant to perform additional analyses. Instead, they suggested that new issues that arose after the planned data analyses should be addressed by other researchers interested in alternative analysis methods. To this end, the authors have made the data publicly available to allow further scrutiny. Indeed, their data sharing effort enabled me to perform the re-analysis on the CFQ VBM as reported above. On balance, my current inclination is to suggest that reviewers should be allowed to ask for additional analyses and/ or data collection even for pre-registered studies, if the suggestions are motivated by scientific merit. This should not completely defeat the point of pre-registration because phacking by flexible changes in data analysis strategies by authors will be prohibited. Another reason is that, as I discussed above, it will be practically difficult to spot all major concerns when we review a pre-registration document, and data can reveal new problems that were not thought of during the planning stage. However, I can foresee that different opinions exist on this point. For example, such a review process may introduce a publication bias, and the file drawer problem would arise specially if papers were rejected for reporting with null results. Therefore, care must be taken not to revive the problems pre-registration is designed to resolve. For this potential controversy, I did not make a judgement as a reviewer, but decided to write this Reply to share the practical issue we need to address.
Please cite this article in press as: Kanai, R., Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015, Cortex (2015), http:// dx.doi.org/10.1016/j.cortex.2015.02.020
5
c o r t e x x x x ( 2 0 1 5 ) 1 e5
7.
Summary
In sum, I identified two open questions. The first concerns methodological developments of Bayesian approaches to utilise the spatial information from previous studies in a confirmatory study. The second concerns the role of peer review in a confirmatory study. I am sure that new practical issues may arise in the future in this new endeavour to correct potential problems in our current system. But I hope they will be addressed in a constructive manner, and open discussion within and beyond the scientific community of cognitive neuroscience will be the key to make healthy progress towards understanding the biological basis of human mind.
references
Boekel, W., Wagenmakers, E.-J., Belay, L., Verhagen, J., Brown, S., & Forstmann, B. U. (2015). A purely confirmatory replication study of structural brain-behavior correlations. Cortex (in press). Douaud, G., Smith, S., Jenkinson, M., Behrens, T., JohansenBerg, H., Vickers, J., et al. (2007). Anatomically related grey and white matter abnormalities in adolescent-onset achizophrenia. Brain, 130(pt 9), 2375e2386. Friston, K. J., Glaser, D. E., Henson, R. N. A., Kiebel, S., Phillips, C., & Ashburner, J. (2002). Classical and Bayesian inference in neuroimaging: theory. NeuroImage, 16, 484e512. Friston, K. J., Penny, W., Phillips, C., Kiebel, S., Hinton, G., & Ashburner, J. (2002). Classical and Bayesian inference in neuroimaging: theory. NeuroImage, 16, 465e483. Gogtay, N., Giedd, J. N., Lusk, L., Hayashi, K. M., Greenstein, D., Vaituzis, A. C., et al. (2004). Dynamic mapping of human cortical development during childhood through early adulthood. Proceedings of the National Academy of Sciences United States of America, 101, 8174e8179.
Good, C. D., Johnsrude, I., Ashburner, J., Henson, R. N. A., Friston, K. J., & Frackowiak, R. S. J. (2001). A voxel-based morphometric study of ageing in 465 normal adult human brains. NeuroImage, 14, 21e36. Kanai, R., Bahrami, B., Roylance, R., & Rees, G. (2011). Online social network size is reflected in human brain structure. Proceedings of the Royal Society B, 279, 1327e1334. Kanai, R., Dong, M.-Y., Bahrami, B., & Rees, G. (2011). Distractibility in daily life is reflected in the structure and function of human parietal cortex. Journal of Neuroscience, 31, 6620e6626. Sandberg, K., Blicher, J. U., Dong, M. Y., Rees, G., Near, J., & Kanai, R. (2014). Occipital GABA correlates with cognitive failures in daily life. NeuroImage, 87, 55e60. Sowell, E. R., Thompson, P. M., Holmes, C. J., Batth, R., Jernigan, T. L., & Toga, A. W. (1999). Localizing age-related changes in brain structure between childhood and adolescence using statistical parametric mapping. NeuroImage, 9, 587e597. Von Der Heide, R., Vyas, G., & Olson, I. R. (2014). The social network-network: size is predicted by brain structure and function in the amygdala and paralimbic regions. Social Cognitive and Affective Neuroscience, 9, 1962e1972. Wallace, J. C., Kass, S. J., & Stanny, C. J. (2002). Cognitive failures questionnaire revisited: correlations and dimensions. Journal of General Psychology, 29, 238e256. Worsley, K. J., Marrett, S., Neelin, P., Vandal, A. C., Friston, K. J., & Evans, A. C. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4, 58e73. Xu, J., Kober, H., Carroll, K. M., Rounsaville, B. J., Pearlson, G. D., & Potenza, M. N. (2012). White matter integrity and behavioural activation in healthy subjects. Human Brain Mapping, 33, 994e1002.
Received 24 December Reviewed 20 January Revised 19 February Accepted 25 February
2014 2015 2015 2015
Please cite this article in press as: Kanai, R., Open questions in conducting confirmatory replication studies: Commentary on “A purely confirmatory replication study of structural brain-behaviour correlations” by Boekel et al., 2015, Cortex (2015), http:// dx.doi.org/10.1016/j.cortex.2015.02.020