Functional magnetic resonance imaging: Comparison between activation maps and computation pipelines in a clinical context

Functional magnetic resonance imaging: Comparison between activation maps and computation pipelines in a clinical context

Magnetic Resonance Imaging 31 (2013) 555–566 Contents lists available at SciVerse ScienceDirect Magnetic Resonance Imaging journal homepage: www.mri...

2MB Sizes 0 Downloads 24 Views

Magnetic Resonance Imaging 31 (2013) 555–566

Contents lists available at SciVerse ScienceDirect

Magnetic Resonance Imaging journal homepage: www.mrijournal.com

Functional magnetic resonance imaging: Comparison between activation maps and computation pipelines in a clinical context Valentina Pedoia a,⁎, Sabina Strocchi b, Vittoria Colli b, Elisabetta Binaghi a, Leopoldo Conte c a b c

Dipartimento di Scienze Teoriche e Applicate–Sezione Informatica, Università degli Studi dell'Insubria Varese, Italy Servizio di Fisica Sanitaria, Ospedale di Circolo e Fondazione Macchi, Varese, Italy Dipartimento di Scienze Cliniche e Biologiche, Università degli Studi dell'Insubria Varese, Italy

a r t i c l e

i n f o

Article history: Received 11 January 2012 Revised 5 October 2012 Accepted 30 October 2012 Keywords: fMRI Statistical Parametric Map Statistical agreement Evaluation framework

a b s t r a c t In this study new evaluation strategies for comparing different Statistical Parametric Maps computed from fMRI time-series analysis software tools are proposed. The aim of our work is to assess and quantitatively evaluate the statistical agreement of activation maps. Some pre-processing steps are necessary to compare SPMs (Statistical Parametric Maps), including segmentation and co-registration. The study of the statistical agreement is carried out following two ways. The first way considers SPMs as the result of two classification processes and extracts confusion matrix and Cohen's kappa index to assess agreement. Some considerations will be made on the statistical dependence of classes and a new formulation of kappa index will be used for overcoming this problem. The second way considers SPMs as two 3D images, and computes the similarity of SPMs images with a fuzzy formulation of the Jaccard Index. Several experiments were conducted both to assess the performance of the proposed evaluation tools and to compare activation maps computation pipelines from two widely used software tools in a clinical context. © 2013 Elsevier Inc. All rights reserved.

1. Introduction Functional magnetic resonance imaging (fMRI) is an imaging technique able to provide information about brain activity, through Blood-Oxygen-Level-Dependent (BOLD) signal. In this manner, MRI can not only supply useful and detailed anatomical images, but it also becomes a very interesting functional imaging modality. The standard pipeline used to obtain an fMRI activation map is briefly recalled in the following. Firstly, acquisition of volumetric MR images of the patient's brain while he/she is performing a task is performed. The task paradigm is designed in order to acquire several volumes during task execution and to acquire several volumes during rest. Secondly, acquired volumes are statistically analyzed in order to evaluate a Statistical Parametric Map (SPM) of brain activation. Recently, the Medical Informatics community has been greatly involved in fMRI research and a large number of software tools for time-series analysis were developed [1,2]. The user is then faced with the question of how to choose among them. Of course, many considerations are important at this stage, including costs, operating systems and software compatibility, and more. In our opinion an important question should also be whether all the pipelines considered supply the same results when they have the same ⁎ Corresponding author. Valentina Pedoia, Collegue of Dipertimento di Scienze Teoriche e Applicate Università degli Studi dell'Insubria, Via Mazzini 5 Varese, Italy. Tel.: +39 0332218942. E-mail address: [email protected] (V. Pedoia). 0730-725X/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.mri.2012.10.013

information as an input. We especially investigate if it is possible to measure in a quantitative way such differences. It is not a topic of this work to find a way to evaluate the goodness of fMRI processing methods but to measure the differences between the Statistical Parametric Map obtained. A large amount of work has been done in comparing and evaluating methods of fMRI analysis and exam pipeline both in medical and computer science fields [3–5]. The state of the art in this field can be analyzed from both the aim of the evaluation strategy and the metrics used in the evaluation process. As regards the first aspect, some authors used synthetic phantoms to evaluate the performances of commonly used fMRI software packages with respect to motion correction [6,7]. The definition of a phantom that can simulate real data well is a critical part of the research in the field of fMRI tools evaluation. Pickens et al. [7] published a study focused on the development of a synthetic phantom based on real subject data. Often, the strategies based on simulated data used receiver operating characteristic (ROC) curves for assessing the methods sensitivity and specificity [8,9], with all the drawbacks arising from the use of synthetic data. Other approaches are focused on the repeatability of the exam pipeline. Strother et al. [10] and Tegeler at el. [11] have conducted longitudinal studies repeating the same pipeline on couples of sequential fMRI acquisitions. The results obtained show that the SPMs for the finger tapping tasks obtained by high-field experiments are well reproducible. On the other hand, Machielsen et al. [12] measured the reliability of a visual encoding task highlighting

556

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

inconsistencies in the activation in repeated examinations. These findings indicate that, whereas consistent patterns of activation exist, more insight is needed into what determines the volume of activation, especially to assess cognitive alterations in patients over time. Other authors have employed their effort to study the repeatability in terms of statistical methods used for fMRI volumes analysis, observing that the relative performance of the methods varies considerably across subjects and classification tasks [13,14]. From the point of view of the metrics used for the evaluation two categories of approaches can be considered: threshold and nonthreshold. SPM is a 3D map that describes brain activation where each voxel value identifies the statistical significance of the hypothesis of activation, thus SPM is a volume of discrete values between 0 and the maximum value of statistical significance obtained by the analysis. The assessment of the agreement between two such maps has been considered a complex task for a long time. Therefore, many approaches have defined a threshold value by which to create binary activation maps more easily treated. Genovese et al. [15] and Maitra et al. [16] have proposed a metric for evaluating binary maps based on pseudo-ROC analysis; Strother et al. and Tegeler et al. performed the repeatability analysis mentioned above on binary maps, using a metric designed ad hoc for the study. Moreover, classical statistical agreement tools as the Cohen's kappa index [17], Jaccard index [18] and Dice coefficient [19] were applied to assess accuracy and reliability in fMRI considering binary maps [20,21]. In recent years, the trend is to study evaluation strategy and metrics that can be applied on non-threshold maps. Maitra [22] reformulated the pseudo-ROC approach to eliminate the thresholding requirement. The intraclass correlation coefficient (ICC) [23] is a widely used non-threshold metric [24]. A new approach based on measuring the approximate mutual information between the fMRI time series of a validation data set and a calculated activation map is proposed in Afshin-Pour et al. [25]. This method can be applied on thresholded label maps or continuous maps. From the cited literature one can draw two important interrelated issues that are still being studied: on one hand, the management and quantitative assessment of inconsistencies of the results obtained from different processing methods and on the other hand, the choice of appropriated evaluation metrics. Proceeding from these considerations, a first objective of our work is to make a contribution to the ongoing debate on the agreement between the different processing methods, breaking down the problem in the clinical context in which fMRI is a widely accepted tool in pre-surgical planning procedures. For this purpose we developed a detailed comparative analysis aimed to show how and at what extent two of the most relevant and widely used processing methods differ in producing SPMs. The first method considered is IViewBOLD®, a widely used tool in clinical practice, and the second method considered is FMRI Expert Analysis Tool (FEAT) included in FMRIB Software Library (FSL) [1], which is one of the state-of-the-art methods within the scientific community. Dealing with the problem of choosing an adequate evaluation metric, we proceeded from the assumption that the evaluation conducted on graduate non-threshold maps is preferable in our context. A number of metrics work on non-threshold maps, someone conceived ad hoc [23], others derived from the standard indexes [25]. Consistently with what was done in literature we conceived a new evaluation strategy to compare non-threshold SPMs based on a continuous formulation of kappa and Jaccard statistical indexes. All the experiments were carried out on real fMRI data. The data set represents well the typical clinical data: the subjects were real neurosurgery patients awaiting surgery for removal of brain tumors and they underwent widely used clinical tasks, including visual, cognitive and motor aspects.

2. Materials and methods The magnetic resonance imaging device used is a Philips Achieva® 1.5 T. The evaluation framework is developed in Matlab® release 2010a. 2.1. Stimulation paradigms Stimulation paradigms are those most often used in neuroradiology: motor (finger tapping) and language (verb generation, word generation, object naming) tasks. All tasks are block designs. The finger tapping task is a monolateral thumb–forefinger movement of nine total blocks: four times OFF–ON plus a last OFF block. Each block lasts 30 s, where 10 volumes are acquired. The patient is vocally instructed by the operator about when to perform the activity. Object naming is performed showing simple color images (ON block) alternating a white cross on a black background (OFF block). During the viewing of the image (an image every 3 s) the patient has to think of the word representing the object shown. During the OFF block the patient has to think nothing. Each block lasts 30 s, where 10 volumes are acquired. The block sequence is as follows: four times OFF–ON plus a last OFF block. Verb generation is performed showing simple black and white drawings representing an action (ON block) alternating a white cross on a black background (OFF block). During the viewing of the drawings (an image every 3 s) the patient has to think of the verb representing the action shown. During the OFF block the patient has to think nothing. Each block lasts 30 s, where 10 volumes are acquired. The block sequence is as follows: six times OFF–ON plus a last OFF block. Word generation is performed showing single white letters on a black background (ON block) alternating a white cross on a black background (OFF block). During the viewing of the letters (a letter every 6 s) the patient has to think of all the words beginning with that letter that he/she can. During the OFF block the patient has to think nothing. Each block lasts 30 s, where 10 volumes are acquired. The block sequence is as follows: six times OFF–ON plus a last OFF block. 2.2. MRI images acquisition During the tasks, the brain volume was acquired using standard EPI BOLD acquisition sequence (TR 3000 ms, TE 50 ms, FA 90°, matrix 128 × 128, FOV 230 mm, axial orientation, slice thickness 4 mm, spacing 4 mm, number of slices 30). For anatomical image reference, a T1-weighted volumetric sequence was chosen (TR 7113 ms, TE 3208 ms, FA 8°, matrix 256 × 256, FOV 256 mm, sagittal orientation, slice thickness 1 mm, spacing 1 mm, number of slices 180). 2.3. Compared SW pipelines From the acquired volumes (90 for finger tapping and object naming, 130 for verb generation and word generation), activation maps were calculated as statistical parametric maps (SPMs) through two different pipelines. The first pipeline used IViewBOLD® software supplied by Philips®. Statistical analysis was carried out applying t-test to each voxel time course, after retrospective image registration for motion correction. SPMs are function of t, minimum t value 3, maximum t value as results from analysis. Contiguous voxels =4 and default mask were set. The second pipeline used FMRI Expert Analysis Tool (FEAT) included in FMRIB Software Library (FSL) [1], which uses univariate General Linear Model (GLM) for voxel time-course analysis [26]. Firstly brain was extracted from an anatomical reference sequence through Brain Extraction Tool [27]. Then FEAT was launched with parameters set as detailed in Appendix A. SPMs are function of z, minimum and maximum z values as result from analysis.

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

2.4. Patients We used images of 15 consecutive patients, who underwent different fMRI procedures, for a total of 49 tasks acquired (13 finger tapping, 13 object naming, 12 word generation, 11 verb generation). The patients underwent fMRI for clinical reasons, mainly presurgical for oncologic illness. 3. Statistical agreement of activation maps In this paper, new quantitative evaluation strategies to compare fMRI exam results in terms of software performance and exam pipeline have been studied and experimentally evaluated in an integrated evaluation framework. This fMRI quantitative analysis framework includes two essential preprocessing functions: boundary-based Brain Segmentation using graph searching principles [28] and Surface Registration based on neural estimation of affine transformation in spherical domain: affine SPHARM [29]. Our goal is to develop an objective map comparison in order to assess statistical agreement. In detail, our strategy is applied on SPMs obtained by using the software packages, whose processing mode was detailed in the previous section. Statistical analysis complexity, processing pipeline variability and parameters settings lead to different results also analyzing the same set of volumes with the software packages. In Fig. 1 two cases of SPM overlapping are shown: in Fig. 1(A) an example of good overlapping and in Fig. 1(B) an example of bad overlapping; our goal is to make this qualitative feeling a quantitative agreement measure. The study of statistical agreement was carried out in two ways. The first way considers SPMs as the result of two classifiers and extracts confusion matrix and Cohen's kappa index for assessing the agreement [17]. A new formulation of kappa index is proposed to deal with the statistical dependence of classes. The second way considers SPMs as two 3D images, and computes the similarity of the SPM images with a fuzzy formulation of the Jaccard index [18]. 3.1. Preprocessing: brain MRI alignment The preprocessing phase is composed of two steps included in the evaluation framework: brain segmentation and registration. Some software tools for fMRI time-series analysis do not include brain segmentation in the processing pipeline, so all MRI voxels are

557

analyzed and may be active. Therefore, active zones can be found outside the brain. These voxels are obviously wrong and must be excluded from the analysis, segmenting the brain on MR image. Moreover, the segmentation phase provides brain surface to be used in the registration phase. Several methods have been proposed and experimentally evaluated, working on either 2D images or the entire 3D brain volume [27,30–32]. Despite the sizable achievement obtained, strong deformations of the normal tissues, caused for example by tumor presence, may strongly affect the performances of these promising methods. For these reasons a novel fully automatic MRI brain segmentation method able to cope with critical cases in which tumors caused strong brain deformation has been proposed in a previous work [28]. The method is a 2D boundary-based segmentation using graph searching principles [33–35]. The MRI head has been unwrapped in polar coordinates, in such a way that it is possible to describe the brain edge like a feasible function with a set of constrains and then to represent the segmentation problem in terms of constrained optimization obtaining a fully automatic and unsupervised method for brain segmentation in MRI images. During fMRI examination a set of brain volume is acquired and SPM is the result of time-series analysis. The maps are shown in relation with one of the acquired volumes, usually the first or the last. Considering the processing performed with different software packages there is no certainty that the reference volume is the same. For this reason a registration process is essential to have a correct map overlap. In this work a surface registration method has been used [29]. This method performs affine 3D surface registration using shape modeling based on SPHerical HARMonic (SPHARM). This method generalizes the classical SPHARM registration technique [36,37] without losing the advantage of performing the whole of the registration process in spherical domain using a set of Radial Basis Function Networks (RBFN) to estimate the transformation between brain surfaces. Description of the 3D surface with spherical harmonic coefficients is brief but comprehensive and directly provides a metric of shape similarity. Therefore, registration is obtained aligning the SPHARM model through minimization of root mean square distance between coefficient vectors. 3.2. Statistical agreement: confusion matrix The first method for assessing the agreement of activation maps proposed considers the statistical nature of the data. It analyzes the

Fig. 1. IViewBOLD® and FEAT FSL SMPs superimposed on MRI acquisition (Red Map FEAT FSL, Green Map IViewBOLD®, Yellow intersection). (A) Good agreement. (B) Bad agreement.

558

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

software packages as two classifiers using a confusion matrix that contains information about actual and predicted classifications made by the systems. Performance is commonly evaluated using data in the matrix with a set of specific indices, the most widely used is kappa index [17]. In our particular case of study a normalization step is needed because the SPMs are expressed in non-homogeneous way therefore each SPM was normalized in relation to its own maximum value, hence each voxel assumes a value between 0 and 1. The process of normalization is not necessary to compare two pipelines that compute homogeneous SPM. Let SPM1 and SPM2 be two normalized and co-registered maps obtained from different analysis of the same fMRI exam. A set of quantization levels are defined from 0 to 1 and in each i,j position of the confusion matrix there is the number of overlapped voxels that belong to the i th quantization level for SPM1 and j th quantization level for SPM2. In this way, as in a classical confusion matrix, on the main diagonal there are the agreements and outside the diagonal the disagreements. Fig. 2(A, B) shows confusion matrixes, on 10 quantization levels, computed for examples in Fig. 1. Following the classical way to assess statistical agreement, kappa index is computed to evaluate the confusion matrix. For the first example K is 0.17 and for the second K is − 0.01. Kappa index takes values in the interval [− 1,1]. Considering both Landis and Fleiss kappa evaluation tables [38,39] a poor agreement is found in both cases. In general kappa index can be computed if the following conditions are true: • Analyzed instances are independent. • Classifiers judge independently. • Classes are mutually exclusive, exhaustive and independent. In this context, considering the way in which the confusion matrix is defined, the first two hypotheses are verified. Otherwise, the last condition is not true. This third condition allows the agreement to be

computed if all the errors are equal. Indeed, kappa index does not distinguish between major and minor errors. However, in this case, the misclassification between adjacent classes should weigh less in agreement evaluation than misclassification between far classes. The problem to weigh disagreement differently in the evaluation of the confusion matrix has been extensively studied in literature proposing new indexes such as weighted kappa [40,41]. The method applied in this work takes into account statistical dependence between classes and computes for each class, i.e. in our context each level of quantization, a mean error based on the distance (ε) which takes into account the distance between the classes: N 1 X ji− j j C ði; j Þ C ð j; j Þ i¼1 N −1 N 1 X ji−j j C ði; j Þ ¼ C ði; i Þ j¼1 N −1

εCol j ¼ εRowi

ð1Þ

where C is the confusion matrix built as detailed above and N is the number of the quantization levels. From the equations it is clear that the calculation of ε involves the application of a set of weights to the confusion matrix. The indexes εColj and εRowi can be interpreted as the producer accuracy and user accuracy of a new confusion matrix Cw to which has been applied a matrix of weights through the element by element product: Cw(i,j) = C(i,j) * W(i,j). The weight matrix is defined as follows: ( W ði; j Þ ¼

1; ji− j j ; N −1

if

i¼j

if

i≠j

ð2Þ

In this way the method takes into account class dependence not defining new indices but weighing the confusion matrix and calculating traditional metrics of evaluation. In Fig. 3(A, B) weighted confusion matrixes computed for the examples above are shown. New kappa values computed respectively are 0.61 and −0.08, that are consistent with visual inspection. It is worth noting that the application of the weight matrix does not imply an unconditioned amplification of the level of agreement. Indeed, if the agreement is under estimated because of the class dependence the application of the weight matrix leads to an increase in agreement. Otherwise, where the two classifiers, which are the results of the two software packages, are completely in disagreement, the application of the weight matrix does not change the comparison response. 3.3. Information overlap The SPMs comparison method based on weighted confusion matrix, although very good because it honors statistical nature of the data, lacks in that it moves away from how the map is qualitatively evaluated: as a volumetric image. Moreover, the weighted confusion matrix puts little emphasis on what classes are in agreement. One way could be to observe the trend of the producer and user accuracy, but this loses the overall vision of the problem. For these reasons, the evaluation framework presented here provides a second method to evaluate SPMs agreement. As before we consider SPM1 and SPM2 two normalized and co-registered maps obtained from different analysis of the same fMRI exam. A similarity index is defined in our previous work [42]. This index is based on the Jaccard index [18]. Intersection and union are computed using fuzzy logic definitions and Information Overlap Index (IO) is defined as:

Fig. 2. Examples of confusion matrix. (A) Good agreement. (B) Bad agreement.

  T N min SPM ; SPM 1;i 2;i SPM1 SPM2 1X   IO ¼ ¼ N i¼1 max SPM ; SPM SPM1 ∪SPM2 1;i

2;i

ð3Þ

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

Fig. 3. Examples of weighted confusion matrix. (A) Good agreement. (B) Bad agreement.

where N is the total number of voxels where max(SPM1,i, SPM2,i) N 0. This index gives a quantitative evaluation of behavior similarity of different tools. However, if SPMs are very noisy, IO can be polarized by the overlap of noise (voxels with low activation). For this reason, IO is computed as a function of different activation thresholds. Consider to obtain a binary mask Mth applying the threshold th on the activation map SPM. The activation map SPMth is computed applying the mask Mth to the original SPM: Th

SPM ði; j Þ ¼



SPMði; j Þ; 0;

if SPMði; j Þ≥Th if SPMði; j ÞbTh

ð4Þ

Information overlap matrix is computed as follows:   Th Th N min SPM j ; SPM k 1;i 2;i 1X   IOð j; k Þ ¼ N i¼1 max SPMTh j ; SPMThk 1;i

ð5Þ

2;i

The IO matrix values are high if the SPMs are in agreement. A joint variation of the two thresholds describe IO matrix from the high left corner to the low right one. IO matrix of two identical SPMs would be a perfect symmetrical matrix with all the values on the diagonal line equal to one. In real cases we can have IO matrixes approximating this one (excellent agreement). Hence, peaks along the main diagonal line or parallel to it mean that activation areas are in the same position in the two SPMs and have similar statistical significance. Along the diagonal line, if a peak is on the right side it corresponds to the overlapping of more significant areas in the original SPMs. Otherwise, if it is on the left it corresponds to the overlapping of areas with less statistical significance. In all cases, if a peak is high SPMs overlap well (good agreement), if it is quite high SPMs have similar significance (acceptable agreement).

559

If a peak does not have a symmetry in relation to the diagonal line it means that statistical relevance of the activation areas is different in the two SPMs (poor agreement). IO matrixes which exhibit no peaks describe SPMs that, even if partially overlapping has different clinical meanings (disagreement). Maps obtained from tasks that involve language necessarily involve also the visual cortex, this side effect is caused by stimulus that is produced showing images to the patient. We consider that the overlap of these areas is not significant in evaluating maps clinical agreement. For this reason IO matrixes are also evaluated eliminating data from voxels known to belong to visual functions (from a registration with Juelich Functional Atlas [1]). Hence, we could also evaluate the concordance of the two pipelines for small functional areas, as the visual area is always a big and bright one, when stimulated by our paradigms. It is worth noting that the registration process allows removal of activation of functional areas that are noise for the evaluation process (like the visual cortex in our application); or to assess agreement between the maps separately for different functional areas of interest. Fig. 4 shows the IO matrixes for the above examples with 10 threshold values. If overlap is really related to information and not to noise, IO increases on the main diagonal jointly with the threshold. This surface gives a quantitative description, therefore allowing a comparison of different experiments. If the tools lead to the same conclusions about brain activation, the surface has an unmistakable pattern similar to a “comet” as is shown in Fig. 4(A). Otherwise, IO matrix shown in Fig. 4(B) looks completely different and has much smaller values of agreement. This representation, differing from that made by the confusion matrix, highlights whether the agreement is on the high or low values of activation. To summarize IO matrix characteristics in an only figure of merit, a grid with 5 scores from − 2 to 2 was defined based on the previous observations: Score 2 Excellent agreement: Plot symmetric in relation to the diagonal line or to a line parallel to diagonal one Score 1 Good agreement: Well defined and high absolute maximum Score 0 Acceptable agreement: Well defined and quite high absolute maximum Score − 1 Poor agreement: Not well defined absolute maximum, monotone decreasing plot Score − 1 Disagreement: Quite uniformly flat plot Despite the judge being subjective, it is driven by clear rules. Moreover, this figure is based on bidimensional image analysis typical of human vision. As a result, IO matrix assessment is easier and less error prone than direct SPMs comparison with the same meaning. Considering the fact that the analysis of IO matrix characteristics is accomplished through an approximate evaluation of qualitative visual features, it was decided to design a fuzzy production system that could formalize and then automatically reproduce the expert decision attitudes [43]. We consider the following visual features computed on the IO matrixes normalized to 100 (SPM100) M: Maximum max value of IO matrix: Max ¼ maxðSPM100 Þ: I: Integral sum of all values in the SPM100: Nq Nq I ¼ ∑i¼1 ∑ j¼1 SPM100 ði; j Þ where Nq is the number of the quantization levels. DFD: Distance from the diagonal is the distance between the point that contains the maximum value and the matrix main diagonal. Let imax and jmax be the row and column of the maximum  value the distance from the diagonal is DFD = |imax − jmax| cos 4 :

560

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

DL: Diagonal length is the distance between the point that contains the maximum value and the left top corner DL = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i 2max þ j 2max . The overall IO matrix evaluation process was then modeled in terms of fuzzy relationships between visual features and levels of agreement and is organized in three conceptual steps: • Linguistic labeling of visual features and classes/levels of agreement. • Diagnostic rules definition. • Inference for class assignment. We formalize these concepts using fuzzy declarative propositions of the form X is A to represent the linguistic description of visual features. X is a linguistic variable denoting a given feature such as, for example, distance from the diagonal (DFD); A is a term, such as Low or Medium or High; it belongs to a given term set and represents a fuzzy set in a given universe of discourse U characterized by a membership function μLow, Med, High: U → [0,1]. U contains all the possible numerical values assumed by X. Letting u belonging to U, a value associated with the visual feature DFD μLow(u) is interpreted as the degree of possibility of the expert describing the feature concerned with the term Low, or in other words, the degree of compatibility between the numerical value u and the linguistic term Low. The five classes of agreement listed above are modeled as crisp classes. All the linguistic variable (M, I, DFD and DL) are normalized in a unit range [0,1], the associated term sets are Low, Medium and High and the membership functions of the corresponding fuzzy sets

are three Gaussian G(μ, σ) with mean equal to 0, 0.5 and 1 respectively and standard deviation equal to 0.2. Fuzzy production rules describing how combinations of visual features, expressed in terms of multiple fuzzy declarative propositions, related to classes of agreement, have the following general form: If X1 is A1 And … And Xn is An then Class is C The theoretical operator min proposed in the original formulation of fuzzy set theory is chosen for implementing the aggregation connective AND in the antecedent of rules [44]. In the fuzzy framework IO matrix evaluation is modeled as a classification process and consists in the deduction of a conclusion regarding a given matrix, the premises of which are the values of visual features and the classification rules in the knowledge base. As in any other logic, a rule of inference governs this deduction. When specific values are provided in input the fuzzy logic inference mechanism interprets the set of fuzzy production rules and deduces the final diagnostic results. A well-known deductive paradigm is the MAX–MIN method provided by Mamdani [45]. The MAX–MIN method tests the magnitudes of each rule and selects the highest one. Table 1 lists the complete set of rules elicited from experts indicating for each rule the antecedent and consequent part. A first set of rules (Table 1(a)) is used to aggregate visual features M and I and to infer intermediate variable IOVal that combined with the other features DL and DFD allow to infer the class of agreement Score using the set of rules in Table 1(b). The classification process is performed choosing the class with the maximum value of plausibility. In Fig. 5 the scheme of the overall procedure is shown. 4. Discussion and results This section presents our results with two aims. The first aim is to evaluate our method to assess the agreement of SPMs. The second aim is to evaluate agreement between the results obtained by the fMRI time-series analysis software tools considered. Concerning the first purpose, five examples of agreement evaluation, for both confusion matrix and information overlap analysis, are detailed. The Table 1 Rules elicited from the experts: (a) first stage: (I,M) →IOval, (b) second stage: (IOval, DFD, DL) → Score. (a)

Fig. 4. Examples of information overlap (IO) matrix. (A) Good agreement. (B) Bad agreement.

(b)

Maximum

Integral

IOVal

IOVal

DFD

DL

Score

Low Low Low Medium Medium Medium High High High

Low Medium High Low Medium High Low Medium High

Low Low Medium Low Medium High Medium High High

Low Low Low Low Low Low Low Low Low Medium Medium Medium Medium Medium Medium Medium Medium Medium High High High High High High High High High

Low Low Low Medium Medium Medium High High High Low Low Low Medium Medium Medium High High High Low Low Low Medium Medium Medium High High High

Low Medium High Low Medium High Low Medium High Low Medium High Low Medium High Low Medium High Low Medium High Low Medium High Low Medium High

Very Low Very Low Low Very Low Very Low Low Very Low Very Low Low Low High Very High Medium Medium High Medium Medium Medium High Very High Very High Medium High Very High Medium High High

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

561

Fig. 5. Scheme of the fuzzy recognition system.

shown examples are presented in order of increasing agreement, with a score from − 2 to 2. The quantization step for both IO and confusion matrix is 0.1. For each example a representation of SPMs overlap using color convention already applied in Fig. 1 is shown. Figs. 6, 7, 8, 9, and 10 show in order: SPMs overlap, confusion matrixes before and after the correction using a logarithmic color scale and IO matrix using the same color scale for each example. Fig. 6 shows the agreement analysis of a verb generation task. Both software tools have identified a poor activation with few overlap points focused in the inferior parietal lobule zone. The first software tool detects activations only in the left hemisphere. Otherwise the second processing detects some contralateral activations. Kappa

indexes computed on confusion matrix and on weighted confusion matrix are 0 and 0.08 respectively, which indicates a complete disagreement. It is worth that these values are computed without considering agreement in class 0, which would lead to an overestimation of agreement. Score −2 has been attributed to the IO matrix. It is quite uniformly flat with a small maximum value that is located far from the main diagonal line. Quantities that describe the IO matrix have the following values: M =19.79, I = 824.49 thus with a mean value of 8.24, DFD = 5, DL =6.08. Fig. 7 shows the agreement analysis of a Right Finger Tapping task. Activations are mainly in the left hemisphere for both processing. First the software detects activations focused in a small

Fig. 6. Agreement analysis example, IO score −2.

562

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

Fig. 7. Agreement analysis example, IO score −1.

and specific area, otherwise there is a greater dispersion as regards the second processing. Kappa indexes computed on the confusion matrix and on weighted confusion matrix are 0.04 and 0.22 respectively. Although they are higher than in the previous case,

they indicate little agreement. Qualitative analysis of the maps leads us to believe that one of the software is more conservative, but through the analysis of confusion matrices we cannot verify or refute this thesis. Otherwise, analysis of IO matrix clarifies this doubt.

Fig. 8. Agreement analysis example, IO score 0.

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

563

Fig. 9. Agreement analysis example, IO score 1.

Indeed, if the thesis is verified, IO matrix increases on the main diagonal line jointly with the threshold. The location of maximum far from the diagonal line tells us that the most activated voxels do not overlap. Score − 1 has been attributed to the IO matrix. Absolute maximum is not well defined, and the IO matrix is monotone

decreasing. Quantities that describe the IO matrix have the following values: M = 25.02, I = 1046.30 thus with a mean value of 10.46, DFD = 5, DL = 7.28. Fig. 8 shows the agreement analysis of an Object Naming task. For both the processing, activations are focused in Broca areas and a

Fig. 10. Agreement analysis example, IO score 2.

564

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

contralateral activation is noticed. Activation area in the left hemisphere detected by the second software is widest. Qualitative analysis leads to the conclusion that the maps are quite overlapping. Kappa indexes computed on confusion matrix and on weighted confusion matrix are − 0.02 and 0.37 respectively. The IO matrix looks not very different from the previous case because maximum and integral are similar. The main difference is the lower distance between maximum and main diagonal lines, that indicates a better overlap. Score 0 has been attributed to the IO matrix. Quantities that describe the IO matrix have the following values: M = 25.02, I = 1083.84 thus with a mean value of 10.82, DFD = 2, DL = 7.21. Fig. 9 shows the agreement analysis of a Verb Generation task. For both the processing activations are focused in Broca and Wernicke areas in the left hemisphere; both software tools detected a small contralateral activation. Qualitative analysis of the maps leads to judge the agreement good; despite the fact the maps are not identical the conclusions drawn from the analysis would probably be the same. Kappa indexes computed on confusion matrix and on weighted confusion matrix are respectively, − 0.06 and 0.47. It is

worth noting that, in the cases where the agreement is higher, to weight the confusion matrix becomes essential for the correct analysis. Score 1 has been attributed to the IO matrix. The high absolute maximum is well defined and is located on the main diagonal line, IO matrix is not evaluated with score 2 because maximum value is located at the level of quantization 0.6. An excellent IO matrix has a maximum located in the high levels of quantization. Quantities that describe the IO matrix have the following values: M = 36.73, I = 1701.14 thus with a mean value of 17.01, DFD = 0, DL = 8.49. Fig. 10 shows the agreement analysis of a Word Generation task. There are many active areas. Indeed, it is difficult to identify an activation core. SPMs are overlapped on every zone of accumulation of active voxels. Disagreement points are isolated and appear to be noise. Qualitative analysis of the maps leads to judge the agreement excellent. Kappa indexes computed on confusion matrix and on weighted confusion matrix are 0.08 and 0.52, respectively. Considering Landis Kappa evaluation tables agreement is only moderate. It seems that the Kappa evaluation method tends to underestimate the qualitative

Table 2 IViewBOLD® and FEAT FSL activation map agreement study. Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Task

M

I

DFD

Object Naming Word Generation Verb Generation Object Naming Word Generation Verb Generation Verb Generation Object Naming Finger Tapping Finger Tapping Word Generation Object Naming Finger Tapping Word Generation Verb Generation Object Naming Word Generation Verb Generation Verb Generation Object Naming Word Generation Finger Tapping Finger Tapping Verb Generation Object Naming Word Generation Verb Generation Object Naming Finger Tapping Finger Tapping Word Generation Object Naming Word Generation Object Naming Finger Tapping Finger Tapping Word Generation Verb Generation Verb Generation Object Naming Word Generation Object Naming Finger Tapping Finger Tapping Verb Generation Finger Tapping Finger Tapping Object Naming Word Generation

36.50 30.98 26.61 37.17 30.55 37.60 33.58 33.86 42.24 43.32 39.22 16.43 37.03 18.19 19.79 27.72 14.59 35.36 26.98 35.15 22.55 29.97 47.92 8.05 12.33 47.74 32.17 25.02 6.42 34.53 63.39 11.25 26.96 27.96 25.02 13.23 42.52 36.73 23.31 58.48 38.26 0.31 9.33 6.49 0.01 57.61 73.92 49.61 41.27

1494.37 1579.36 1325.93 1928.36 1429.11 1743.06 1775.96 1690.39 2310.11 2390.59 1866.45 627.86 2221.04 600.72 824.49 1303.59 391.51 2287.47 729.81 1742.91 1043.57 1702.37 1925.96 311.92 249.44 2879.23 940.52 1083.84 201.07 1861.89 2920.37 406.62 1184.04 1239.38 1046.30 638.51 2332.12 1701.14 995.97 2656.55 2078.56 2.36 228.00 118.10 0.01 2455.89 3863.13 1722.27 2092.11

0 2 0 2 7 2 1 1 0 1 0 2 1 6 5 1 9 2 9 3 3 1 1 2 4 2 0 2 4 3 1 3 5 0 5 5 3 0 4 1 1 1 7 9 0 1 0 3 4

DL

Human score

11.31 3.16 2.83 7.21 8.06 10.00 5.00 3.61 4.24 6.40 2.83 4.47 12.04 10.77 6.08 6.40 10.05 8.60 10.05 10.82 4.12 10.63 12.04 11.40 5.10 11.40 12.73 7.21 5.10 4.12 13.45 4.12 11.18 9.90 7.28 7.28 12.21 8.49 11.66 13.45 13.45 2.24 8.06 10.05 1.41 13.45 9.90 10.82 10.30

0.5 −0.50 0.00 1.00 −1.00 1.00 0.00 0.5 1.50 1.50 1.00 −2.00 0.50 −1.50 −1.50 0.5 −2.00 1.00 −1.00 1.00 −0.50 0.00 2.00 −2.00 −2.00 2.00 1.00 0.00 −2.00 0.00 2.00 −2.00 0.00 1.00 −1.00 −2.00 1.00 1.00 −1.00 2.00 1.00 −2.00 −2.00 −2.00 −2.00 2.00 2.00 2.00 1.00

Fuzzy score

K

WK

2 −1 −1 0 0 1 0 0 1 1 1 −2 2 −1 −2 0 −2 0 −2 1 0 2 2 −1 −2 2 −1 0 −2 0 2 −2 1 0 0 −2 1 0 1 2 2 −2 −2 −2 −2 2 2 1 1

0.04 0.03 0.04 0.03 0.00 0.05 0.09 0.06 0.06 0.04 0.07 0.07 0.04 −0.01 0.00 0.07 −0.05 0.05 0.02 0.02 0.02 0.08 0.08 −0.01 0.11 0.07 0.10 −0.02 −0.03 0.03 0.08 −0.01 0.04 0.10 0.04 −0.06 0.11 0.06 −0.04 0.27 0.08 0.00 −0.01 0.07 0.00 0.17 0.12 0.04 0.02

0.47 0.23 0.43 0.36 0.10 0.44 0.52 0.46 0.40 0.36 0.44 0.41 0.25 0.09 0.08 0,43 −0.02 0.26 0.04 0.27 0.20 0.47 0.47 0.37 0.35 0.40 0.50 0.33 0.13 0.21 0.52 0.25 0.22 0.53 0.22 0.04 0.46 0.47 0.01 0.49 0.52 0.00 0.08 0.14 0.00 0.60 0.60 0.36 0.28

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

agreement impression. Otherwise, IO matrix is evaluated with the highest score 2. The maximum value is high, well defined and located near the main diagonal line at the quantization level 0.9. Quantities that describe the IO matrix have the following values: M=63.39, I=2920.37 thus with a mean value of 29.37, DFD=1, DL =13.45. In the second phase of our experiments the 49 fMRI exams mentioned above were considered. Both the kappa and IO matrix evaluations were performed. The SPMs agreements were assessed using the confusion matrix and an observer attributed scores to all the IO matrixes twice (to verify the reproducibility of the attribution). All the observed IO matrixes were displayed with the same fixed color scale. The same IO matrixes are evaluated using the fuzzy system. In Table 2 all the results are reported. For each task the quantities describing IO matrix used as linguistic variable and confusion matrix are detailed. For the IO matrix: • • • • •

maximum value (M) total surface integral (I) horizontal distance of maximum from diagonal line (DFD) diagonal length (DL) mean of the scores assigned in both subjective evaluation of the IO the matrix (Score). • mean scores automatically assigned to the IO matrix by the fuzzy system (Fuzzy Score). For the confusion matrix: • “classical” kappa, evaluated with threshold 0.1 (K). • Kappa including dependent classes correction, evaluated with threshold 0.1 (WK). IO matrix description quantities were not known to the observer attributing subjective scores during his assessing process. Linear correlation coefficient between “classical” kappa and subjective score is R = 0.55. Linear correlation coefficient between corrected kappa and subjective score is R = 0.75. Hence, we can say that IO matrix comparison method correlates well with well-established statistical evaluation methods as that of Cohen's kappa. Moreover, the correction made to account for partial dependence of classes is valid, and causes a better correlation between this statistical method and a visual one. Moreover, our results show that the IO matrix evaluation performed using the fuzzy system has a high degree of agreement with that performed by the subjective analysis (R = 0.85). In this way the analysis is objective despite the linguistic nature of the problem being respected. Thus, our method is repeatable and fully transferable to other contexts. The mean scores attributed to IO matrixes of all the tasks analyzed, divided by task itself for the analysis with and without the visual cortex are reported in Table 3. When the comparison of the two pipelines is performed including visual cortex, Table 3 shows that there is concordance between the results of the two pipelines. It seems that concordance is better for language tasks than for motor task. If visual cortex is excluded from the analysis, results are completely different. For all tasks concordance is much smaller, with scores almost always zero or negative. This shows that the previously observed concordance is

Table 3 Mean scores by task. Mode With visual cortex Without visual cortex

Finger Tapping

Object Naming

Word Generation

Verb Generation

0.2 0.0

1.3 −0.2

0.5 0.1

0.5 0.4

565

almost always based on the concordance for the big and bright visual area. This is particularly true for language tasks, and this is logical: our language tasks are stimulated with visual stimuli. The difference is truly big for object naming (stimulated by color photos of objects) and verb generation (stimulated by hand drawing of actions). For word generation, the stimulus is a letter on a black background, which a simpler visual stimulus. Hence, if one wants to evaluate a coarse concordance between fMRI pipelines, he can include visual cortex. In view of a coarse evaluation, the two pipelines considered are in agreement. However, in our opinion, in a clinical context a fine concordance is necessary. In fact, positions of small language or motor areas are used in brain surgical planning. Hence, we consider that the concordance of the two pipelines, to be real and useful, should last also without visual cortex. In this case, the concordance between the two considered fMRI pipelines vanishes. Finger tapping, known as a robust task, simple to perform for patients, is the only task obtaining a concordance evaluated as mild (“0” mean score). 5. Conclusion In this work we presented a comparison analysis of two fMRI software tools: Philips IViewBOLD® and FMRI Expert Analysis Tool (FEAT) included in FMRIB Software Library (FSL), obtaining a quantitative assessment of the inconsistencies between the SPM generated. The experimental analysis adopted two non-threshold evaluation strategies based on standard indexes adapted to graduated “soft” data. Results obtained in our clinical context substantiate the general claim that often same sequence, processed in different ways, gives different results. This effect has been observed and quantitatively analyzed in the specific relevant context of clinical surgery in which variation in software requires attention. A second contribution of the present study concerns the use of evaluation metrics. Both indexes have been found useful in the clinical domain. The use of the first metric proposed, based on the weighted kappa, is facilitated by the similarity with the well-known kappa index. The second, more complex IO matrix strategy, better reflects the typical visual method used by clinicians to evaluate SPMs. Our conclusions about the inconsistency of the software pipeline are not generalizable to other time-series analysis methods, but the proposed comparison method can be applied to them. The strength of our work lies into the fact that the SPMs analyzed concern real patients and are extracted using the software tool embedded in acquisition machines widely used in clinical practice. Acknowledgments We wish to acknowledge Dr. R. Minotto of Neuroradiology Department, all neuroradiology staff and speech therapists G. Cracco and A. Barbieri for the collaboration. Appendix A In this appendix the values of FMRI Expert Analysis Tool (FEAT) (FSL) are listed, with the aim of making our experiments reproducible on other data sets. Misc Brain/background threshold 10% Design efficiency • • • •

Noise level 0.66% Temporal smoothness 0.34% Estimate from data z threshold 5.3

566

V. Pedoia et al. / Magnetic Resonance Imaging 31 (2013) 555–566

Data • Delete volumes 0 • High pass filter cut off (s): 100 Pre-Stats • • • • • • •

Motion correction MCFLIRT BO unwarping Slice timing correction NONE BET brain extraction Spatial smoothing FWHM (mm) 5 Intensity normalization Temporal filtering Highpass Stats

• • • • • • • • • •

Use film prewhitening Add motion parameters to model Add additional confound EVs Convolution gamma Phase 0 Stdev 3 Mean log 6 Orthogonalize Add tempoaral derivative Apply temporal filtering Post-Stats

• • • • • • • •

Pre-thresholding masking Thresholding Voxel Corrected voxel P threshold 0.05 Contrast masking—Rendering Use actual z min/max Trasparent blobs Create time-series plot Registration 7 D.O.F.

References [1] http://www.fmrib.ox.ac.uk/fsl/. [2] http://www.fil.ion.ucl.ac.uk/spm/. [3] Zhang J, Anderson JR, Liang L, Pulapura SK, Gatewood L, Rottenberg DA, et al. Evaluation and optimization of fMRI single-subject processing pipelines with npairs and second-level cva. Magn Reson Imaging 2009;27(2):264-78, http://dx.doi.org/10.1016/j.mri.2008.05.021. [4] Zang J, Liang L, Anderson JR, Gatewood L, Rottenberg DA, Strother SC. A Javabased processing pipeline evaluation system for assessment of univariate general linear model and multivariate canonical variate analysis-based pipelines. Neuroinformatics 2008;6(2):123-34, http://dx.doi.org/10.1016/ j.mri.2008.05.021. [5] Oakes T, Johnstone T, Walsh KO, Greischar L, Alexander A, Fox A, et al. Comparison of fMRI motion correction software tools. Neuroimage 2005;28(3):529-43, http://dx.doi.org/10.1016/j.neuroimage.2005.05.058. [6] Morgan VL, Dawant BM, Li Y, Pickens DR. Comparison of fMRI statistical software packages and strategies for analysis of images containing random and stimulus-correlated motion. Comput Med Imaging Graph 2007;31(6): 436-46, http://dx.doi.org/10.1016/j.compmedimag.2007.04.002. [7] Pickens DR, Li Y, Morgan VL, Dawant BM. Development of computer-generated phantoms for fMRI software evaluation. Magn Reson Imaging 2005;23(5): 653-63, http://dx.doi.org/10.1016/j.mri.2005.04.007. [8] Sorenson JA, Wang X. ROC methods for evaluation of fMRI techniques. Magn Reson Med 1996;36(5):737-44. [9] Skudlarski P, Constable R, Gore JC. ROC analysis of statistical methods used in functional MRI: individual subjects. Neuroimage 1999;9(3):311-29. [10] Strother SC, Lange N, Anderson JR, Schaper KA, Rehm K, Hansen LK, et al. Activation pattern reproducibility: measuring the effects of group size and data analysis models. Hum Brain Mapp 1997;5:312-6. [11] Tegeler C, Strother SC, Anderson JR, Kim SG. Reproducibility of BOLD-based functional MRI obtained at 4 T. Hum Brain Mapp 1999;7(4):267-83.

[12] Machielsen WCM, Rombouts SARB, Barkhof F, Scheltens P, Witter MP. fMRI of visual encoding: reproducibility of activation. Hum Brain Mapp 2000;9(3): 156-64. [13] Skup M. Longitudinal fMRI analysis: a review of methods. Stat Interface 2010;3(2):235-52. [14] Schmah T, Yourganov G, Zemel RS, Hinton GE, Small SL, Strother SC. Comparing classification methods for longitudinal fMRI studies. Neural Comput 2010;22(11):2729-62. [15] Genovese CR, Noll DC, Eddy WF. Estimating test-retest reliability in functional MR imaging I: Statistical methodology. Magn Reson Med 1997;38:497-507. [16] Maitra R, Roys SR, Gullapalli RP. Test-retest reliability estimation of functional MRI data. Magn Reson Med 2002;48(1):62-70. [17] Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Meas 1960;20(1):37-46, http://dx.doi.org/10.1177/001316446002000104. [18] Jaccard P. The distribution of the flora in the alpine zone. New Phytol 1912;11(2):37-50. [19] Dice LR. Measures of the Amount of Ecologic Association Between Species. Ecology 1945;26(3):297-302. [20] Le TH, Hu X. Methods for assessing accuracy and reliability in functional MRI. NMR Biomed 1997;10(4–5):160-4. [21] Colwell RK, Coddington JA. Estimating terrestrial biodiversity through extrapolation. Phil Trans R Soc Lond 1994;345(1311):101-18. [22] Maitra R. Assessing certainty of activation or inactivation in test–retest fMRI studies. Neuroimage 2009;47(1):88-97. [23] Shrout P, Fleiss J. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86(2):420-8. [24] Caceres A, Hall DL, Zelaya FO, Williams SC, Mehta MA. Measuring fMRI reliability with the intra-class correlation coefficient. Neuroimage 2009;45(3): 758-68, http://dx.doi.org/10.1016/j.neuroimage.2008.12.035. [25] Afshin-Pour B, Soltanian-Zadeh H, Hossein-Zadeh G-A, Grady CL, Strother SC. A mutual information-based metric for evaluation of fMRI data-processing approaches. Hum Brain Mapp 2011;32(5):699-715. [26] Nelder J, Wedderburn R. Generalized linear models. J R Stat Soc 1972;135(3): 370-84. [27] Smith SMS. Fast robust automated brain extraction. Hum Brain Mapp 2002;17(3):143-55, http://dx.doi.org/10.1002/hbm.10062. [28] Pedoia V, Binaghi E, Balbi S, Benedictis AD, Monti E, Minotto R. 2d MRI brain segmentation by using feasibility constraints. Proceedings Vision And Medical Image Processing, VipIMAGE 2011;2011:251-6. [29] Pedoia V, Binaghi E, Gallo I. Affine SPHARM registration: neural estimation of affine transformation in spherical domain. Proceedings of International Conference on Computer Vision Theory and Applications (VISAPP). Olihao Portugal: INSTICC Press; 2011. [30] Balafar M, Ramli A, Saripan M, Mashohor S. Review of brain MRI image segmentation methods. Artif Intell Rev 2010;33:261-74, http://dx.doi.org/10.1007/s10462-010-9155-0. [31] Clarke L, Velthuizen R, Camacho M, Heine J, Vaidyanathan M, Hall L, et al. MRI segmentation: methods and applications. Magn Reson Imaging 1995;13(3): 343-68, http://dx.doi.org/10.1016/0730-725X(94)00124-L. [32] Atkins M, Mackiewich B. Fully automatic segmentation of the brain in MRI. IEEE Trans Med Imaging 1998;17(1):98–107. [33] Martelli A. Edge detection using heuristic search methods. Comput Graph Image Process 1972;1(2):169-82, http://dx.doi.org/10.1016/S0146-664X(72)80013-3. [34] Sonka VHM, Boyle R. Image processing, analysis and machine vision. 3rd ed. London: Chapman and Hall; 1993. [35] Thedens D, Skorton D, Fleagle S. Methods of graph searching for border detection in image sequences with applications to cardiac magnetic resonance imaging. IEEE Trans Med Imaging 1995;14(1):42-55. [36] Huang H, Shen L. Surface alignment of 3d spherical harmonic models: application to cardiac MRI analysis. MICCAI'2005; 2005. p. 67-74. [37] Shen L, Huang H, Makedon F, Saykin AJ. Efficient registration of 3d SPHARM surfaces. CRV '07: Proceedings of the Fourth Canadian Conference on Computer and Robot Vision, IEEE Computer Society, Washington, DC, USA; 2007. p. 81-8. http://dx.doi.org/10.1109/CRV.2007.26. [38] Landis JR, Koch GG. The measurement of observer agreement for the categorical data. Biometrics 1977;33:159-74. [39] Feliss L. Statistical method for rates and proportions. 3rd ed. New York: John Wiley and Sons; 1981.. [40] Bakeman R, Gottman JM. Observing interaction: an introduction to sequential analysis. Cambridge, UK: Cambridge University Press; 1997. [41] Cohen J. Weighed kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968;70(4):213-20. [42] Pedoia V, Colli V, Strocchi S, Vite C, Binaghi E, Conte L. fMRI analysis software tools: an evaluation framework. Proc. SPIE 7965; 2011. p. 796528-9. [43] Mamdani E, Gaines B. Fuzzy reasoning and its applications. London: Academic Press; 1981. [44] Dubois D, Prade H. A review of fuzzy set aggregation connectives. Inform Sci 1985;36(12):85–121. [45] Mamdani E. Advances in the linguistic synthesis of fuzzy controllers. International Journal of Man–Machine Studies 1976;8(6):669-78.