ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

[version 2; peer review: 2 approved, 1 not approved]
PUBLISHED 25 Mar 2026
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background

We are considering the statistical analysis of functional magnetic resonance imaging (fMRI) data. As demonstrated in previous work, grouping voxels into regions (of interest) and carrying out a multiple test for signal detection on the basis of these regions typically leads to a higher sensitivity when compared with voxel-wise multiple testing approaches.

Methods

In the case of a multi-subject study, we propose to define the regions for each subject separately based on their individual brain anatomy, represented, e.g., by regional labels. The aggregation of the subject-specific evidence for the presence of signals in the different regions is then performed by means of a combination function for p-values. We validate the proposed methodology with simulated data and apply it to real fMRI data of a hypothesis-driven approach towards identifying brain regions involved in understanding software code.

Results

The results of our simulated data indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline. Testing our method on real fMRI data, we found that our approach yields overlapping results with a two-stage approach for which two independent experiments are needed, one for defining the regions and one for actual signal detection.

Conclusions

In this paper, we overall demonstrate that our method of utilizing anatomical information is a candidate to provide a more sensitive analysis of fMRI data.

Keywords

Aparc label; combination test; false discovery rate; mass-univariate linear model; program comprehension

Revised Amendments from Version 1

We revised our paper to improve accessibility, substantially clarified methodological details (especially regarding the multi‑stage testing procedure, validity of the partial conjunction test, dependence structure, and hypothesis interpretation), and expanded the introduction, methods, and discussion accordingly. Additionally, we extended the simulation study (including varying SNR), refined figures and explanations, added a detailed code tutorial, clarified limitations (e.g., sample size), and incorporated multiple reviewer‑suggested clarifications throughout.

See the authors' detailed response to the review by Qiran Jia
See the authors' detailed response to the review by Fabricio Cravo and Stephanie Noble
See the authors' detailed response to the review by Benedikt Sundermann

1. Introduction

Signal detection in high-dimensional data is a major topic of modern statistics. In the context of functional magnetic resonance imaging (fMRI) with its numerous volume units (voxels) as primary units of measurement, localization of brain function is a common structural assumption. Scattered, spread-out signals (in single voxels or very small groups of voxels) are likely to be artifacts (Ref. 41). Instead, topologically contiguous signals forming larger groups of voxels are much more plausible; see Refs. 21, 30. This information rules out certain patterns of the signal structure a priori, and can hence be exploited to increase the statistical power for signal detection; see, among many others.45 There are different possibilities to define or find such functional regions:

  • (i) One may refer to an atlas of the brain, like the Brodmann atlas (see Ref. 7) and aggregate data within the regions given by this atlas. This has been the strategy of Ref. 45.

  • (ii) One may find the regions (of interest) in a data-driven manner, e.g., by a cluster analysis. This has been proposed, among others, in Ref. 28. However, as emphasized for instance in Ref. 4, it is important that this data-driven definition of regions is “based on information outside the data that we set out to analyze”, meaning that the dataset used for defining the regions should be (stochastically) independent of the dataset which is used for signal detection, to avoid selection biases.

  • (iii) One may choose a statistical methodology which ensures statistically valid conclusions even for regions which are selected in a post-hoc manner after having seen the actual study data. This can be achieved by simultaneous inference methods which guarantee that any possible selection event is accounted for; see, e.g., Ref. 43 and references therein.

All of the three aforementioned strategies have their assets and their drawbacks: Strategy (i) is inexpensive and easy to implement, but the regions taken from the atlas may not be optimally aligned with the specific task at hand, and differences in the individual brain anatomies of the study participants may complicate its application. For a statistical approach to the alignment of fMRI data, see, e.g. Ref. 2 and the references therein. Strategy (ii) is costly (two independent experiments are needed), but is supposed to yield a more accurate definition of the regions (of interest). Strategy (iii) avoids both the (potentially suboptimal) a priori definition of regions and the need for an additional independent experiment. However, the issue of multiple testing (see, e.g., Refs. 13, 14) becomes much more severe if simultaneity over all possible selection events has to be guaranteed, and the selection of the regions is not based on a clear-cut (statistical) criterion, but on the expert judgment of the study data. Therefore, it is hard to compare the results of Strategy (iii) with those of Strategies (i) and (ii).

In the present work, we propose a new strategy and apply it to existing data of a study on program comprehension Ref. 48 (see Methods for details). Our proposed data analysis strategy is similar to Strategy (i). Namely, we rely on regions that are defined based on the freesurfer segmentation of the brain into so-called Aparc labels; cf. Section 2.3 for details. However, we apply this segmentation of the brain for each subject separately. The rationale for this novel approach is that functional brain regions show variable localization in brain templates between subjects, despite optimized anatomical co-registration. Only in the final step of analysis (see Step 4 in Section 2.4), a combination of subject-specific p -values for the activation of the brain regions of interest takes place, and we evaluate the significance of such a region (on the multi-subject level of data analysis) by means of the resulting combined p -value.

As we will demonstrate by means of the concrete example from the field of programing language comprehension, this new strategy can be similarly powerful as Strategy (ii) applied in Ref. 48, while avoiding the sequence of two separate fMRI studies where the data of the first is used for definition of suitable regions for the second. Strategy (ii) has been applied in (Ref. 48) as follows: The authors made use of the knowledge about cortical regions relevant for understanding software code identified in a previous study; cf. Ref. 47. This first study (Ref. 47) thus served as “localizer” for the second study (Ref. 48) in the sense of Strategy (ii). Only with this “pre-localization”, it was possible in Ref. 48 to declare certain brain regions statistically significantly activated during the (more specific, as compared to Ref. 47) task of programming language comprehension considered in Ref. 48.

In contrast, our present methodology will not at all rely on the data from the first study (Ref. 47), while achieving similar results as those obtained in Ref. 48. Thus, our proposed methodology can enable researchers to spare precious measurement time.

The rest of the paper is structured as follows. In Section 2, we describe our proposed statistical methodology. Section 3 describes conducted computer simulations comparing three different methods of fMRI data analysis. Section 4 is devoted to the detailed description of our re-analysis of the fMRI data from Ref. 48, and the results of this re-analysis are presented in Section 4.3. We conclude with a discussion in Section 5.

2. Methods

In this section, we describe our statistical model for fMRI data as well as the proposed data analysis workflow for detecting brain regions which are significantly associated with a certain cognitive task.

2.1 Reference data

As reference for our novel statistical approach we reuse the data of a study on program comprehension published in Ref. 48. This study made use of a functional localizer defined in a separate prior study with a comparable experimental design published Ref. 47 and thus followed Strategy (ii) for functional localization of regions involved in comprehending program code (see Table 1 for details of the two studies). In the first study (see Ref. 47), participants were asked to understand short program code snippets, such as shown in Listing 1. The program code did not contain any useful identifier names, which induces bottom-up comprehension (cf. Ref. 40). Ref. 47 contrasted the bottom-up comprehension task with a syntax task, in which participants were presented with similar program code snippets, but only had to focus on syntax errors (e.g., missing semicolon). This control condition was intended to reveal only brain activation that is necessary for programmers to comprehend program code in-depth. As an additional control condition, the experiment included phases of rest in between the comprehension and syntax conditions.

Table 1.

Overview of the two related fMRI studies of program comprehension.

Study 1 by Ref. 47 Study 2 by Ref. 48
Participant sessions1614
Trials1230
ConditionsBottom-up program comprehension, control (syntax), restTop-down program comprehension (n = 24), Bottom-up program comprehension (n = 3), control (syntax, n = 3), rest
Scans900900
5753c8cc-90cf-40f2-b6b0-4e103882baca_figure8.gif

Listing 1. Example code snippet in Java from Siegmund et al. (Ref. 47) that computes the length of the last word in a string. The snippet uses non-meaningful identifiers to induce bottom-up comprehension. Participants needed to figure out the output of this snippet “5”.

The second study (Ref. 48) was a follow-up study that also differentiated the program-comprehension task into more nuanced conditions. One aim was to differentiate between bottom-up comprehension and top-down comprehension (cf. Ref. 9) which was induced by varying the meaningfulness of identifier names and by prior training to provide participants with the necessary knowledge. As in the previous study, the syntax task served as a control condition. Another research question addressed the goal of confirming the activated brain areas from the first study. To this end, the second study built on the regions identified in the first study, thus following Strategy (ii). Both studies were approved by the ethics board of the University of Magdeburg (Application: 87/14).

2.2 Linear models for voxel-wise multiple tests

Let Yixt denote the observed data from a functional MRI experiment at voxel x and time t for the i -th subject. Here, we adopt the common view (cf. Section 5.4 in Ref. 30) of a mass-univariate linear model

(1)
Yixt=Xiβix+εixt
for the data, with a design matrix Xi containing variables with the expected blood oxygenation level dependent (BOLD) response related to the experimental stimuli or nuisance parameters like drifts of the MR signal. The random variable εixt is the error term, which is assumed to be normally distributed with zero expectation and a spatio-temporal correlation structure. The model in Eq. (1) is also referred to as a “within-subject model” in the fMRI literature; see, e. g., Section 12.4.1 in Ref. 37. Estimates β̂ix of the statistical parametric map (SPM) or their contrasts cTβ̂ix and estimates of their covariance matrices Σ̂ix (or the variances σ̂ix2=cTΣ̂ixc ) can then be obtained from a pre-whitened version of the linear model above; cf. Ref. 30.

The SPM then forms a random t -field (cf. Ref. 58) with an inherent multiple comparison problem due to the large number of local hypotheses. One common strategy is to define local p-values at each voxel x and for each subject i based on the local values of the random t -field and to control the family-wise error rate (FWER) using accordingly adjusted thresholds; cf. Ref. 59. However, this is known to be a very conservative approach with respect to the detectability of significant brain signals in the outlined framework. In contrast, approaches related to the control of the false discovery rate (FDR) can handle the multiple comparison problem, e.g., by the procedure proposed in Ref. 6.

2.3 Parcellation of the human brain

Neuroanatomic research has found that the human brain can be parcellated into different sub-regions based on structural similarities. One of the earliest atlases is the Brodmann atlas (see Ref. 7) which is based on the cytoarchitectural organization of the brain. The Brodmann Areas have been schematically transferred to a template brain, the so-called Talairach-Atlas (see Ref. 52) which is commonly used in fMRI studies to report the location of significant grand average activation, as used in Refs. 47, 48. For the analysis of these data in the current paper, we chose the Harvard-Oxford brain atlas (cf. Refs. 11, 33) that provides a parcellation based on gross anatomical landmarks and delivers an Aparc label j for each voxel of each individual brain space. However, any other brain parcellation to define regional labels could be used with our methodology.

2.4 Statistical inference

As outlined in the introduction, we re-used fMRI data from a program code comprehension task first analyzed in Ref. 48 and performed a new analysis comprising the four steps outlined below. The experiment used two different levels of software program code comprehension stimuli, henceforth denote as bottom-up and top-down comprehension, to infer on the related cognitive processes. In our strategy, we combined the methods from Ref. 48 (steps 1 and 2) and Ref. 45 (step 3). Furthermore, we implemented our new methodological contribution of combining the evidence for activation of a given brain region across the subjects (step 4). The first two steps were already conducted in Ref. 48 and we only exported the voxel-wise p-values of the random-effects linear model analysis for each subject separately. Steps 3 and 4 rely on the assumption that these voxel-wise p-values are valid (i. e., stochastically lower-bounded by a uniform distribution on [0, 1] under the null hypothesis). To ensure this, it is important that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small).

Step 1: Program comprehension versus rest

In the first step, we contrasted (for each participant separately) the comprehension of program code (cf. Ref. 48) to the rest condition. This identifies brain areas with a positive deflection of the BOLD response. Furthermore, in order to account for the multiple comparison problem, we performed the Benjamini-Hochberg test (see Ref. 6) for FDR control. Only those voxels which have been declared significant by this procedure were considered in step 2. This methodology is justified by the fact that the FDR is an established screening criterion for high-dimensional multiple test problems.

Step 2: Bottom-up comprehension versus control condition

In this step, we contrasted (again, for each participant separately) one type of program comprehension: bottom-up comprehension. Bottom-up comprehension is induced when program code provides no semantic cues and programmers need to comprehend each line separately and then integrate the information in a slow, tedious process. For the significant voxels from the first step, we applied in a second step the same multiple test to the contrast of bottom-up comprehension against the control condition (syntax task) on the restricted set of voxels. As a result of this step, we get for each participant i and for each considered voxel k a p-value p~ik .

Step 3: Regional p-values for every participant i

This step builds upon the methodology from Ref. 45, and it delivers for each participant i and for each anatomical (regional) label j a (confirmatory) significance evaluation with respect to the contrast specified in step 2. Hence, the evidence from all voxels of participant i in the brain region labeled by j is combined in this step of data analysis.

To this end, let κ be a tuning parameter with values in the interval [0, 1] (i.e. in per cent) and let mj be the number of voxels contained in the brain region labeled by region label j . To keep the notation feasible, we implicitly assume here that for each participant i the same number mj of voxels belong to the brain region labeled by j . We consider the null hypothesis Hij of no relevant differential activation of the region labeled by j for participant i during the program comprehension and syntax tasks mentioned in step 2, together with its two-sided alternative hypothesis Kij . We call Hij the “regional null hypothesis” for the brain region labeled by j for participant i . We formalize Hij as a so-called partial conjunction hypothesis (see Ref. 45 and the references therein for a formal mathematical description), meaning that we consider the differential activation in region j for participant i relevant, if it contains at least ujκ·mj significant voxels. For testing Hij we calculate the “regional p-value” pijREGION , given by

pijREGIONmin1ımjuj+1{mjuj+1ı}p~i,(uj1+ı):mj
where the voxel-wise p-values p~i,1:mj,,p~i,mj:mj for participant i in region j are ordered from smallest to largest (see Ref. 5). For the p -value pijREGION to be valid, assumptions about the dependency structure among p~ik:1kK are required. In particular, weak dependency (meaning that the empirical cumulative distribution functions of the p -values corresponding to true and false null hypotheses, respectively, converge in the Glivenko-Cantelli sense as K ) has been assumed in Ref. 45, where K denotes the total number of voxels.

In order to achieve family-wise error rate (FWER) control, we have to choose the tuning parameter κ smaller than or equal to 1/J , where J is the number of regional labels. Choosing κ=1/J corresponds to the so-called Bonferroni multiplicity correction. The choice of κ is discussed further in Appendix S1 of Ref. 45.

Step 4: Combined regional hypothesis tests by Fisher’s method

In this final step, we combine for each regional label j the regional p-values calculated in step 3 over all participants i=1,,n . In order to do this, we apply the so-called Fisher method to combine p-values. Namely, the Fisher test statistic Tj for region j is given by

Tj2i=1nlog(pijREGION)

Under independence of the data with respect to the participants, Tj is asymptotically X2n2 -distributed (chi squared) with 2n degrees of freedom under the null. The latter independence assumption is justified, because the participants have been included in the study independently from each other.

Finally, we can reject the (over all participants i combined) regional hypothesis Hj (i.e., the respective partial conjunction hypothesis, but now with respect to the population, not with respect to a single participant) if and only if Fisher’s test statistic Tj is larger than the (1-ακ)-quantile of the X2n2 -distribution with 2n degrees of freedom, where the tuning parameter κ has been introduced in step 3. This parameter addresses the multiplicity of the test problem with respect to the J regional labels which are simultaneously under consideration.

3. Computer simulations

In order to compare the performance of three different methods for fMRI data analysis in a controlled framework, we have carried out computer simulations.

3.1 Simulation setting

We have simulated a dataset {Yixt} with eleven subjects (referring to the index i and corresponding to the group size in the real dataset below), a spatial grid of size 20×20×20=8,000 voxels (referring to the index x ), and 195 time points (scans, referring to the index t ). We have assumed that the 8,000 voxels are grouped into eight anatomical regions of size 10×10×10=1,000 each, and that these regions are correctly annotated for all eleven subjects. Two alternating stimuli in an ON-OFF block task design with a total of six ON blocks of a duration of 15 scans have been used for the temporal signal: In a circular area in one of the predefined anatomical regions the signal of one stimulus was twice as high as the signal by the other stimulus mimicking a signal contrast. In a corresponding area in a second anatomical region the signal was created with no difference between the stimuli. The datasets with first order autocorrelated Rician noise have been created using the R package neuRosim (Ref. 56); p-values p~ik for Contrast 1 (ON of either stimulus versus rest) and Contrast 2 (one stimulus versus the other) where determined using the R package fmri (Ref. 51). This simulation setup has been run 500 times. We repeated the simulation for varying signal-to-noise ratios (0.75, 1.0, 1.25, 1.5, 1.75).

3.2 Considered data analysis methods

For the statistical analysis of the simulated data, we have considered three different methods.

  • Voxel-wise: On the basis of voxel-wise Z -scores (aggregated over all eleven participants) and the resulting p-values for Contrast 1, voxels have been screened by applying the Benjamini-Hochberg method at level α=0.05 . For the screened voxels only, Z -scores (again aggregated over all eleven participants) and the resulting p-values for Contrast 2 have been computed. A region j has been declared significantly activated, if the number of Contrast 2 p-values below 0.05/sj in that region j has been larger than 1,000·κ , with sj denoting the number of screened voxels in region j and κ=0.01 , meaning that we considered a region j to be relevantly activated if at least 10 out of the 1,000 voxels in that region j are statistically significantly activated under the stimulus. The rejection threshold 0.05/sj accounts for the (reduced) multiplicity resulting from the pre-selection by means of Contrast 1.

  • Cluster-wise: Activation is typically spatially spread over multiple voxels. In fact, single isolated voxels are mostly considered as spurious rather than activated; see Ref. 30. Thus, we also analyzed the simulated datasets and the two contrasts with cluster-based thresholds: At the given significance level α=0.05 the value of the test statistic at a voxel must exceed some threshold and it has to belong to a cluster of at least s connected voxel. Thresholds can be pre-determined by simulations. To determine p-values we used the R package fmri (Ref. 51), where the method is implemented as function fmri.cluster. There, cluster-level inference thresholds are pre-defined by simulation for this function, see Ref. 42.

  • Region-wise (this paper): Our proposed method from Section 2, again with FDR level 0.05, κ=0.01 , and α=0.05 in Step 4.

3.3 Results

We have assessed the type I and the type II error behavior of the three methods described in the previous subsection by calculating (for each of the three methods) the proportion of simulation runs in which any region without true activation has been declared significantly activated (type I error component) as well as the proportion of simulation runs in which the one region with true activation has not been declared significantly activated (type II error component). Figure 1 summarizes the results of our computer simulations. The horizontal axis in Figure 1 refers to the signal-to-noise ratio (SNR).

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure1.gif

Figure 1. Type-I- and Type-II-errors for different signal-to-noise ratios (SNR) for the three methods under consideration: Using voxel-wise inference (VOXEL), using a cluster-based method (CLUSTER), and using the method from this paper (REGION).

Table 2. The region of interest in BA21 identified in Ref. 47 consists of 2844 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. Aparc labels in bold are evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

# Voxels Overlapping with BA21 in %# Voxels of Entire Aparc Label in % in BA21
Left-Cerebral-White-Matter 63422%2106200.3%
ctx lh G temporal middle 61722%75158.2%
ctx lh S temporal sup 51218%93795.5%
ctx lh S temporal inf 853%24993.4%

From Figure 1 it becomes apparent that all three methods are capable of protecting reliably against type I errors under our data-generating model. However, the power of the proposed method appears to be considerably larger than that of the voxel-wise method. For example, during our 500 simulation runs at SNR=1.5 , our new method as well as the cluster-wise method always detected the truly activated region, while the voxel-wise method detected it only in 367 out of the 500 runs. As far as a power comparison between our proposed method and the cluster-wise method is concerned, our simulations indicate that the proposed method exhausts the significance level (α=0.05 ) better than the cluster-wise method in the sense that its estimated type I error probability is larger. Due to the structure of the decision rule, this automatically also implies higher (more precisely: non-smaller) power, meaning that the type II error probability of the proposed method is upper-bounded by the type II error probability of the cluster-wise method. Under our simulation settings with SNR1.5 , both procedures have identical (and perfect) power, but for SNR[0.8,1.1] the proposed method exhibits strictly larger power in comparison with the cluster-wise method. The difference in power can be substantial (up to 10% more power of the proposed method) for moderately large (and thus realistic) values of the SNR.

4. Real data analysis

We re-used the data from Ref. 48 and compared our results with those obtained by additionally utilizing47 as pre-study in the sense of Strategy (ii) outlined in the introduction.

4.1 Previous findings

In the two previous studies mentioned before, Siegmund et al. used similar analysis processes. They used BrainVoyager™ QX 2.8.4 [1]. The anatomical scans were transformed into the Talairach brain to account for differences in brain size; cf. Ref. 52. They preprocessed the functional data of both studies with a standard pipeline of: 3-D motion correction, slice-scan-time correction, and temporal filtering. In addition, they applied a spatial smoothing with a Gaussian filter (FWHM = 4 mm).

In the first study Ref. 47, the random-effects GLM revealed five brain areas (BAs 6, 21, 40, 44, 47) with significant activation with the contrast Bottom-Up Comprehension versus Control condition, i.e. syntax task. In the second study, the same contrast with a cluster-based inference revealed no significant areas anymore, likely due to the reduced statistical power of five instead of twelve bottom-up comprehension tasks per session. Thus, the authors ran a regions-of-interest analysis restricted to the identified activation clusters of the first study on the data of the second study. This resulted in a significantly stronger activation for Bottom-Up Comprehension versus Syntax in BAs 21, 40, and 44.

4.2 Data export and preparation for re-analysis

In Figure 2, we illustrate the overall process for the re-analysis of the data from Ref. 48. For our re-analysis, we exported the already pre-processed data from the second study. We did not use the data of the first study as our method does not need prior definitions of regions of interest. We used BrainVoyager to access the already computed GLM-analysis data from Siegmund et al. Specifically, we exported the statistical values (i.e., t-scores, p-values) of the obtained brain activation on a voxel basis for each participant. The voxel resolution is the same BrainVoyager uses, i.e., a 1 mm interpolated resolution.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure2.gif

Figure 2. Illustration of processing of the experimental data.

The box in gold on the left indicates analysis steps that have already been performed in Siegmund et al. (Ref. 48). The box in turquoise on the right indicates the processing steps proposed in this paper.

In addition, we used FreeSurfer to segment and parcellate the brain of each participant based on their anatomical scan; cf. Refs. 18, 19. We used the Destrieux’ cortical atlas to assign Aparc labels on an individual participant basis (see Ref. 12) as a suitable region definition. Next, we used Nipype (see Refs. 17, 23) to convert Freesurfer labels to a BrainVoyager-readable format.

Our last step annotated the exported functional data with the individual anatomical labels for each participant. We removed all functional voxels for coordinates that had no assigned Aparc label, which typically are voxels that are not considered gray matter.

4.3 Results

Figure 3 displays the six brain regions which have been declared as significant (at FWER level α=5% ) associated with the task at hand by our described methodology. There is an overlap with the results obtained by Ref. 48 in which they utilized prior knowledge, but also some differences which we discuss in the following in their nuances. We visualize the confirmed network of brain activation from Siegmund et al. in Figure 4 and our identified network of significantly activated Aparc labels in Figure 5. Figure 3 and Figure 4 show overlapping results with regard to the Brodmann area 21 that covers the middle and inferior temporal gyrus (separated by the inferior temporal sulcus). However, we observed differences regarding smaller brain regions. We compare Siegmund’s replication efforts to our results in Section 5.1.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure3.gif

Figure 3. The six significant brain regions (at FWER level α = 5%).

Activated brain areas are particularly in the middle and inferior temporal lobe.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure4.gif

Figure 4. Network of left-lateralized confirmed brain areas. In Ref. 48, BAs 21, 40, 44 were found activated during program comprehension.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure5.gif

Figure 5. Results of our analysis with significantly activated Aparc labels.

In each row, each point corresponds to the Aparc p-value pⅈjAPARC of one study participant i, where the index j refers to the area indicated by the code at the beginning of the row.

Our method found three Aparc labels in the inferior and middle temporal gyrus in the left hemisphere. Siegmund et al. found their largest and most robust activation cluster in BA21 of the left hemisphere, which covers several gyri in the temporal lobe. These left temporal gyri are often associated with semantic processing of natural language, which is typically left-lateralized for right-handed participants. In the context of programming, the activation is believed to be responsible for extracting the meaning of individual identifiers and symbols during program comprehension. We found two further Aparc labels bilaterally in the inferior temporal gyrus and the anterior collateral sulci, which are both in the temporal lobe as well.

For each of these six regions indexed by j , we display all subject-specific regional p-values {pijREGION:1in }, where n is the number of study participants. For each j considered in Figure 5, it can clearly be observed that not a single extreme outlier (one very small p-value corresponding to one individual subject) is responsible for the statistical significance with respect to the combination test statistic Tj but that the combined information contributed by all n subjects supports our statistical conclusions. Regional p-values pijREGION1 can occur, if none of the voxels belonging to region j has been selected in Steps 1 and 2 described before for a certain subject i . By construction of Tj this essentially means that the “effective sample size” for such a region is reduced, while the number of degrees of freedom for the null distribution of Tj remains unchanged.

5. Discussion

5.1 Statistical sensitivity

This paper introduces a new strategy to analyze fMRI data and demonstrates its performance by drawing a comparison to two studies by Siegmund et al. In their second study, they investigated a new research question which was based on a design with reduced statistical power regarding the network of brain areas activated during bottom-up program comprehension. Only by using prior knowledge of the location of identified clusters from the first study and conducting a regions-of-interest analysis with increased statistical power, they were able to exceed standard thresholds of statistical significance but restricted to the subset of areas identified in the first study.

Research that similarly aims to detect small differences between cognitive processes with state-of-the-art methods would need to increase statistical power, e.g. by conducting two studies: first, to identify the network of brain areas involved in the overarching cognitive process, and second, to identify differential effects within the activated brain areas. The method of utilizing anatomically predefined regions of interest as determined in each individual brain is a candidate for a more sensitive analysis as compared to statistical testing of single voxels with identical coordinates of a template brain. This is possibly due to allowing more flexibility with respect to the exact anatomical location of functional units across different brains. Our example of using ROIs from the Harvard-Oxford atlas can easily be replaced in future applications by brain parcellation schemes that are based on more refined anatomical and/or functional databases. We demonstrated by simulations that our method can accurately detect signals based on region aggregation and outperforms standard mass-univariate approach while achieving a similar performance as cluster-based inferences without the need to define thresholds based on data smoothness via simulations. Further, we demonstrated that our method can find significantly activated brain areas without relying on prior knowledge on a real fMRI dataset. Moreover, additional brain areas were identified which have been described in two fMRI studies of programmers and interpreted as being involved in visuo-spatial processing. One study investigated manipulating data structures, which share similarities to spatial rotation (see Ref. 27) and one writing program code (see Ref. 29). Since the study by Siegmund et al. (2014) was the first study on program comprehension they used a rather small FDR corrected significance level ( p<0.01 as compared to the more common p<0.05 ). Possibly this is one reason why the two areas identified by the current approach were not identified there and as a consequence could not emerge in Siegmund et al. (2017). Thus, our current approach seems valuable in cases where new research questions are explored and little to no prior work is available and which would require careful statistical hypothesis testing to initially minimize false positives.

However, inferior frontal gyrus (BA 44) and inferior parietal lobule (with BA 40), shown to be significant in Siegmund et al., were not significant with our method. Therefore, we exemplarily investigated this difference with the activation cluster in BA 40. In Figure 6, we display a histogram of p-values for the (statistically significant) Aparc label in the middle temporal gyrus, in which a majority of the participants contribute with small p-values. In contrast, in Figure 7, we display a histogram of p-values for the inferior parietal lobule. Across the entire group, we also observe an accumulation of very small p-values, but only three of eleven participants contribute to this. Unlike the regions-of-interest analysis done by Siegmund et al., our method relying on combining Aparc p-values by Fisher’s method does not reject hypotheses with such a p-value distribution.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure6.gif

Figure 6. Histogram of p values of ctx_lh_G_temporal_middle label that is evaluated as significantly activated.

The plot without a heading is across all participants, while the eleven plots with headings show the distribution across individual participants. The majority of participants is contributing to the significance.

5753c8cc-90cf-40f2-b6b0-4e103882baca_figure7.gif

Figure 7. Histogram of p values of ctx_lh_G_pariet_inf-Angular that is not evaluated as significantly activated.

The plot without the heading is across all participants, while the eleven plots with headings show the distribution across individual participants. While there are many small p- values, only a few participants contribute these small values.

By transforming the Aparc labels into the Talairach space, we observed that the voxels included in the activation cluster of BA 40 do not perfectly align with the Aparc label that should cover this brain area, i.e. the inferior angular gyrus of the parietal lobe. Table 3 shows that the overlap to the activation cluster in BA40 is less than 50% and that only around 5% of its voxels are assigned to an Aparc label at all. In contrast, Table 2 shows that at least 75% of the voxels of the BA 21 cluster are aligned to Aparc labels of which two are significant with our method. Thus, a small activation cluster within a large anatomical region may get lost with our approach. We further discuss this potential drawback of the used anatomical segmentation and possible remedies in Section 5.3.

Table 3. The region of interest in BA40 identified in Ref. 47 consists of 1777 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. No Aparc label is evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

# Voxels Overlapping with BA40 in %# Voxels of Entire Aparc Label in % in BA40
ctx lh G pariet inf-Supramar 30417.1%63184.3%
ctx lh G pariet inf-Angular 29016.3%49755.1%
Left-Cerebral-White-Matter 1578.9%1788750.1%

From the methodological point of view, our main contribution consists in a novel way how to combine evidence: Instead of aggregating single voxel data over all participants by mapping them to a standard brain template, we define subject-specific regions and combine the evidence on the level of these regions by means of a combination function. This is a generic methodological approach which is not restricted to the specific study setup of Siegmund et al., but can be applied to essentially any fMRI study design, involving an arbitrary number of contrasts.

Furthermore, also the (final) combination step of our proposed approach is generic in the sense that instead of the Fisher combination function any other (appropriate) combination function for p-values may be used. Recently, there has been a renewed interest in p-value combination methods; see, e.g., Ref. 57 (with discussion), Ref. 53, and Ref. 54.

It is also important to note that our results are based on an fMRI study with only 11 participants resulting in a threat to external validity. While this number might appear low, it is in line with similar studies in the domain of software engineering (Refs. 10, 17, 23, 39) where a smaller, but more homogeneous sample can be beneficial due to particularly concerning confounding factors related to programmer expertise, experience, and demographics (Refs 49, 55). Nevertheless, future work shall replicate our approach on datasets with a larger sample size and different domains.

5.2 Comparison to related fMRI analysis methods

Under the multiple testing framework, testing of grouped null hypotheses with (potential) application in fMRI is an active research topic. In Ref. 25 as well as in Ref. 4, clustering techniques were employed to define regions of interest, and the authors incorporated the heterogeneous cluster sizes in a weighting scheme for the linear step-up test from Ref. 6. In the same vein, the authors of Ref. 26 as well as Ref. 61 made use of the different proportions of true null hypotheses in each of the groups in their proposed weighting. A Bayesian variant of this idea has been derived in Ref. 32. Hierarchical methods, which exclude groups without strong evidence for the presence of signals in several stages of data analysis, have been worked out by several researchers; see, e. g., Refs. 3, 45 and 60. However, these methods rely on combining the subject-specific data on the voxel level, which is a standard technique as mentioned, for instance, in Section 5 of Ref. 31 Also on the basis of (combined) voxel data, a hierarchical independent component analysis for the comparison of brain functional networks has been proposed in Ref. 46. To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea.

It is worth mentioning that both our proposed methodology and the methods from previous literature that we are considering in our comparisons make certain assumptions about the data: First, a linear relationship between the features encoded in the design matrix and the response is assumed. Thus, non-linear effects cannot (fully) be captured by models of the type (Eq. 1). Second, Gaussianity of error terms is assumed, implying symmetry and light tails, among other things. It can be of interest to validate or to test these assumptions for a concrete dataset at hand. However, since this aspect is not the main focus of our present work, we defer such investigations to future research. In particular, modeling the data that we have re-analyzed in this work with different model assumptions would diminish the comparability of our data analysis results with the results from previous analyses. One simple (numerical) robustness analysis regarding the Gaussianity assumption can be performed by simulating data with a different error distribution and comparing the results with the results presented here. To this end, the source code which is available as supplementary material for this article may be helpful for practical implementation.

Even though the majority of fMRI studies still use GLM-based analyses which require corrections for multiple comparisons, we would like to note that statistical analyses more advanced than voxel-wise regression analyses increase in popularity; see, e.g., Ref. 36 for a recent approach or multi-voxel pattern analysis (MVPA) introduced by Haxby and colleagues, see Ref. 24 and the multiple papers cited therein. MVPA predicts stimulus event categories from the relative changes in activation across a set of voxels. Such a set of voxels is extracted in the first, feature selection step of analysis. The second step partitions the data into a training set and a testing set entered into a pattern classification algorithm. From a statistical point of view this requires a sufficiently large number of comparable events. In the experiments of Ref. 47 the events are code snippets that must be read and understood within 60 seconds to enable a certain complexity of software code and were thus limited in number to fit into the duration of a typical fMRI session of about 45 minutes. Such limited number of events may pose difficulties for MVPA cross-validation. Such low number of rather long events is, however, sufficient for generalized linear model (GLM) analyses of block design experiments that typically yield strong detection power (albeit low discrimination power).

Moreover, as functional activation is spatially distributed over several voxels rather than focused in single ones, cluster-based inferences have become rather standard; cf. Ref. 30. This is implemented and used in all major software package. However, a recent discussion by Ref. 16 revealed their flaws in practical situations. Furthermore, cluster-based inference requires simulation according to the smoothness of the data at hand.

The experiments we used to validate our approach employed a classical hypothesis-driven design, which is widely used in the literature, to unravel complex cognitive processing. They were constructed such that perceptual and cognitive processes necessary but not specific for understanding software code were controlled by specific test conditions, either requiring to read the same code but with a different task or controlling for attentional demands. Furthermore, understanding software code is a highly idiosyncratic process and therefore identification of brain activity using control conditions within individual subjects is possibly more feasible as a first step towards identifying the most relevant brain areas involved in such a complex cognitive process. Thus, our method can serve as a valuable alternative using spatial aggregation while still being fast and easy to implement.

5.3 Outlook: from anatomical to functional aggregation

Instead of utilizing a single voxel GLM group analysis in common brain templates, we used a parcellation of the brain for each individual participant into regional labels, here Aparc labels, before aggregation into the group analysis. This procedure provides more labels than a traditional Brodmann atlas. Still, some of the regions are very large and thus presumably contain several functional areas. There is ongoing research to subdivide the brain based on cytoarchitectonic details, e.g. the Jülich-Brain (see Ref. 1). This will provide more detailed parcellation schemes in the future and that could easily be integrated into our proposed method. Another refinement could be to use functionally defined brain regions for our presented methodology. A study that implements functional localizers could identify participant-specific functional maps of the brain for well-defined standardized tasks (e.g., see Ref. 35). Then, in the analysis, our presented methodology can aggregate across all participant-specific brains with less imprecision than traditional methods. We would like to note that such functional localizers are currently restricted to research areas that include brain regions with specific functional specialization; cf. Refs. 22 and 44. The studies we presented in this paper are concerned with a rather complex cognitive task. Moreover, understanding program code is highly individual since different programmers rely on different comprehension strategies based on their preferences, domain knowledge, and experience (e.g., Refs. 8, 9 and 50).

Ethical considerations

Our paper did not acquire any new data with human participants. The original studies of Siegmund et al. were approved by the responsible ethics board of the University of Magdeburg (Application: 87/14).

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 01 Oct 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.12688/f1000research.166549.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 25 Mar 2026
Revised
Views
7
Cite
Reviewer Report 30 Mar 2026
Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany;  Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Lower Saxony, Germany;  University of Münster, Münster, Germany 
Approved
VIEWS 7
I Approved. Thank you ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sundermann B. Reviewer Report For: Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.5256/f1000research.194983.r470520)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
15
Cite
Reviewer Report 27 Mar 2026
Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA 
Approved
VIEWS 15
I appreciate the authors’ careful revisions and responses, which address most of my main concerns. The revised introduction, methods, and expanded simulation section are much clearer and better organized. I also appreciate the clearer definition of the regional null hypothesis ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jia Q. Reviewer Report For: Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.5256/f1000research.194983.r470521)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 01 Oct 2025
Views
19
Cite
Reviewer Report 19 Jan 2026
Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA 
Approved with Reservations
VIEWS 19
This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jia Q. Reviewer Report For: Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.5256/f1000research.183550.r441581)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R3-1: This manuscript ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R3-1: This manuscript ... Continue reading
Views
30
Cite
Reviewer Report 26 Dec 2025
Fabricio Cravo, Psychology, Northeastern University College of Science (Ringgold ID: 195088), Boston, Massachusetts, USA 
Stephanie Noble, Northeastern University, Boston, Massachusetts, USA 
Not Approved
VIEWS 30
The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cravo F and Noble S. Reviewer Report For: Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.5256/f1000research.183550.r423766)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R2-1: The ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R2-1: The ... Continue reading
Views
35
Cite
Reviewer Report 07 Oct 2025
Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany;  Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Lower Saxony, Germany;  University of Münster, Münster, Germany 
Approved with Reservations
VIEWS 35
The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sundermann B. Reviewer Report For: Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.5256/f1000research.183550.r420203)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R1.1: The ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 25 Mar 2026
    Norman Peitek, Saarland University, Saarbrücken, Germany
    25 Mar 2026
    Author Response
    We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

    "R1.1: The ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 01 Oct 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.