Utilizing anatomical information for signal detection in functional magnetic resonance imaging

Norman Peitek; André Brechmann; Karsten Tabelow; Thorsten Dickhaus

doi:10.12688/f1000research.166549.1

Home Browse Utilizing anatomical information for signal detection in functional...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

[version 1; peer review: 2 approved with reservations, 1 not approved]

Norman Peitek ¹, André Brechmann², Karsten Tabelow³, Thorsten Dickhaus⁴

PUBLISHED 01 Oct 2025

Author details Author details

¹ Saarland University, Saarbrücken, Saarland, Germany
² Leibniz Institute for Neurobiology, Magdeburg, Saxony-Anhalt, Germany
³ Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Berlin, Germany
⁴ University of Bremen, Bremen, Bremen, Germany

Norman Peitek
Roles: Data Curation, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

André Brechmann
Roles: Conceptualization, Data Curation, Funding Acquisition, Methodology, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Karsten Tabelow
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Thorsten Dickhaus
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background

We are considering the statistical analysis of functional magnetic resonance imaging (fMRI) data. As demonstrated in previous work, grouping voxels into regions (of interest) and carrying out a multiple test for signal detection on the basis of these regions typically leads to a higher sensitivity when compared with voxel-wise multiple testing approaches.

Methods

In the case of a multi-subject study, we propose to define the regions for each subject separately based on their individual brain anatomy, represented, e.g., by regional labels. The aggregation of the subject-specific evidence for the presence of signals in the different regions is then performed by means of a combination function for p-values. We validate the proposed methodology with simulated data and apply it to real fMRI data of a hypothesis-driven approach towards identifying brain regions involved in understanding software code.

Results

The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation. Testing our method on real fMRI data, we found that our approach yields overlapping results with a two-stage approach for which two independent experiments are needed, one for defining the regions and one for actual signal detection.

Conclusions

In this paper, we overall demonstrate that our method of utilizing anatomical information is a candidate to provide a more sensitive analysis of fMRI data.

Keywords

Aparc label; combination test; false discovery rate; mass-univariate linear model; program comprehension

Corresponding author: Norman Peitek

Competing interests: No competing interests were disclosed.

Grant information: Financial support by the Deutsche Forschungsgemeinschaft (DFG) via grant DI 1723/3-2 is gratefully acknowledged. Brechmann’s work is supported by DFG grant BR 2267/7-2.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Peitek N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1) First published: 01 Oct 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1) Latest published: 01 Oct 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1)

1. Introduction

Signal detection in high-dimensional data is a major topic of modern statistics. Typically, structural information like, for instance, (the degree of ) sparsity of the signal is necessary for its detectability and/or (consistent) estimability; see, e.g., Figure 1 in Refs. 13, 47, Chapter 7 in Ref. 50, as well as references therein.

Figure 1. Illustration of processing of the experimental data.

The box in Harvest Gold indicates analysis steps that have already been performed in Siegmund et al.⁴³ The box in Monte Carlo indicates the processing steps proposed in this paper.

Especially in the context of functional magnetic resonance imaging (fMRI), another type of structural information is localization. The primary units of fMRI measurement are volume units (voxels) of the human brain. Localization means that scattered, spread-out signals (in single voxels or very small groups of voxels) are prone to be artifacts. Instead, topologically contiguous signals forming larger groups of voxels are much more plausible; see Refs. 19, 28. This information rules out certain patterns of the signal structure a priori, and can hence be exploited to increase the statistical power for signal detection; see, among many others.⁴⁰ There are different possibilities to define or find such regions: (i) One may refer to an atlas of the brain, like the Brodmann atlas (see Ref. 7) and aggregate data within the regions given by this atlas. This has been the strategy in Ref. 40. (ii) One may find the regions (of interest) in a data-driven manner, e.g., by a cluster analysis. This has been proposed, among others, in Ref. 26. However, as emphasized for instance in Ref. 4, it is important that this data-driven definition of regions is “based on information outside the data that we set out to analyze”, meaning that the dataset used for defining the regions should be (stochastically) independent of the dataset which is used for signal detection, to avoid selection biases. (iii) One may choose a statistical methodology which ensures statistically valid conclusions even for regions which are selected in a post-hoc manner after having seen the actual study data. This can be achieved by simultaneous inference methods which guarantee that any possible selection event is accounted for; see, e.g., Ref. 38 and references therein.

All of the three aforementioned strategies have their assets and their drawbacks: Strategy (i) is inexpensive and easy to implement, but the regions taken from the atlas may not be optimally aligned with the specific task at hand, and differences in the individual brain anatomies of the study participants may complicate its application. For a statistical approach to the alignment of fMRI data, see, e.g. Ref. 2 and the references therein. Strategy (ii) is costly (two independent experiments are needed), but is supposed to yield a more accurate definition of the regions (of interest). Strategy (iii) avoids both the (potentially suboptimal) a priori definition of regions and the need for an additional independent experiment. However, the issue of multiple testing (see, e.g., Refs. 12, 14) becomes much more severe if simultaneity over all possible selection events has to be guaranteed, and the selection of the regions is not based on a clear-cut (statistical) criterion, but on the expert judgment of the study data. Therefore, it is hard to compare the results of Strategy (iii) with those of Strategies (i) and (ii).

Strategy (ii) has been followed in two studies (published in Refs. 42, 43) in which programmers comprehended program code (see Table 1 for details). In the first study (see Ref. 42), participants were asked to understand short program code snippets, such as shown in Listing 1. The program code did not contain any useful identifier names, which induces bottom-up comprehension (cf. Ref. 37). In this first study, the bottom-up comprehension task was contrasted with a syntax task, in which participants were presented with similar program code snippets, but only had to focus on syntax errors (e.g., missing semicolon). This control condition was intended to reveal only brain activation that is necessary for programmers to comprehend program code in-depth. As an additional control condition, the experiment included phases of rest in between the comprehension and syntax conditions.

Table 1. Overview of the two related fMRI studies of program comprehension.

	Study 1; see Ref. 42	Study 2; see Ref. 43
Participant sessions	16	14
Trials	12	30
Conditions	Bottom-up program comprehension, control (syntax), rest	Top-down program comprehension, Bottom-up program comprehension, control (syntax), rest
Scans	900	900

Listing 1. Example code snippet in Java from Siegmund et al. (cf. Ref. 42) that computes the length of the last word in a string.

The snippet uses non-meaningful identifiers to induce bottom-up comprehension. Participants needed to figure out the output of this snippet “5”.

The second study (see Ref. 43) was a follow-up study that also differentiated the program-comprehension task into more nuanced conditions. One aim was to differentiate between bottom-up comprehension and top-down comprehension (cf. Ref. 9) which was induced by varying the meaningfulness of identifier names and by prior training to provide participants with the necessary knowledge. As in the previous study, the syntax task served as a control condition. Another research question addressed the goal of confirming the activated brain areas from the first study. To this end, the second study built on the regions identified in the first study, thus following Strategy (ii). Both studies were approved by the ethics board of the University of Magdeburg (Application: 87/14).

In the present work, we propose a new strategy and apply it to the data from Ref. 43 without using the knowledge about cortical regions relevant for understanding software code and identified in Ref. 42: We first utilize structural information from the individual’s brain scans. Here, we use an automatic parcellation of the brain into anatomical labels, which assign the voxels of an individual to pre-defined anatomic regions. This step of data analysis provides us with a significance evaluation (in terms of a p-value) for the presence of signals in anatomically defined regions for every study participant separately. By this, the method incorporates the rationale, that functional activation is also spatially connected²⁸ and is thus able to gain more sensitivity without using the “localizer” data of the first study. Then, in a second step, we combine for each of these regions the p-values of all voxels within that region from all study participants using an appropriate combination function, and we evaluate the significance of the whole region by means of the resulting combined p-value. As we will demonstrate by means of the concrete example from the field of programming language comprehension, this new strategy can be similarly powerful as Strategy (ii) described above, while avoiding the sequence of two separate fMRI studies where the data of the first is used for definition of suitable regions for the second. In fact, our present methodology will not at all rely on the data from the first study⁴² that had been performed and utilized at that time by the authors of Ref. 43, while achieving similar results. Thus, it can enable researchers to spare precious measurement time.

In general, statistical analyses that are more advanced than voxel-wise regression analyses increase in popularity; see, e.g., Ref. 34 for a recent approach. Another alternative approach to analyze fMRI data is multi-voxel pattern analysis (MVPA) introduced by Haxby and colleagues, see Ref. 22 and the multiple papers cited therein. MVPA predicts stimulus event categories from the relative changes in activation across a set of voxels. Such a set of voxels is extracted in the first by a feature selection analysis. The second step partitions the data into a training set and a testing set entered into a pattern classification algorithm. From a statistical point of view this requires a sufficiently large number of comparable events. In the experiments described in Ref. 42 the events are code snippets that must be read and understood within 60 seconds to enable a certain complexity of software code and were thus limited in number to fit into the duration of a typical fMRI session of about 45 minutes. Such limited number of events may pose difficulties for MVPA cross-validation. Such low number of rather long events is, however, sufficient for general linear model (GLM) analyses of block design experiments that typically yield strong detection power (albeit low discrimination power).

Moreover, as functional activation is spatially distributed over several voxels rather than focused in single ones, cluster-based inferences have become rather standard (cf. Ref. 28). This is implemented and used in all major software package. However, a recent discussion in Ref. 15 revealed their flaws in practical situations. Furthermore, cluster-based inference requires simulation according to the smoothness of the data at hand.

In fact, the majority of fMRI studies like our studies from Refs. 42, 43 still use GLM-based analyses which require corrections for multiple comparisons. The experiments were constructed such that perceptual and cognitive processes necessary but not specific for understanding software code were controlled by specific test conditions, either requiring to read the same code but with a different task or controlling for attentional demands. Thus, it was a classical hypothesis-driven design, which is widely used in the literature, especially to unravel complex cognitive processing. Besides that, understanding software code is a highly idiosyncratic process and therefore identification of brain activity using control conditions within individual subjects is possibly more feasible as a first step towards identifying the most relevant brain areas involved in such a complex cognitive process. Thus, our method can serve as a valuable alternative using spatial aggregation while still being fast and easy to implement.

The rest of the paper is structured as follows. In Section 2, we describe our proposed statistical methodology. Section 3 describes conducted computer simulations comparing three different methods of fMRI data analysis. Section 4 is devoted to the detailed description of our re-analysis of the fMRI data from Ref. 43, and the results of this re-analysis are presented in Section 4.3. We conclude with a discussion in Section 5.

2. Methods

In this section, we describe our statistical model for fMRI data as well as the proposed data analysis workflow for detecting brain regions which are significantly associated with a certain cognitive task.

2.1 Linear models for voxel-wise multiple tests

Let Y_ixt denote the observed data from a functional MRI experiment at voxel $x$ and time $t$ for the $i$ -th subject. Here, we adopt the common view (cf. Section 5.4 in Ref. 28) of a mass-univariate linear model

(1)

Y_{ixt} = X_{i} β_{ix} + ε_{ixt}

for the data, with a design matrix

X_{i}

containing variables with the expected blood oxygenation level dependent (BOLD) response related to the experimental stimuli or nuisance parameters like drifts of the MR signal. The random variable

ε_{ixt}

is the error term, which is assumed to be normally distributed with zero expectation and a spatio-temporal correlation structure. The model in (1) is also referred to as a “within-subject model” in the fMRI literature; see, e. g., Section 12.4.1 in Ref. 35. Estimates

{\hat{β}}_{ix}

of the statistical parametric map (SPM) or their contrasts

c^{T} {\hat{β}}_{ix}

and estimates of their covariance matrices

{\hat{Σ}}_{ix}

(or the variances

{\hat{σ}}_{ix}^{2} = c^{T} {\hat{Σ}}_{ix}^{} c

) can then be obtained from a pre-whitened version of the linear model above; cf. Ref. 28.

The SPM then forms a random $t$ -field (cf. Ref. 53) with an inherent multiple comparison problem due to the large number of local hypotheses. One common strategy is to define local p-values at each voxel $x$ and for each subject $i$ based on the local values of the random $t$ -field and to control the family-wise error rate (FWER) using accordingly adjusted thresholds; cf. Ref. 54. However, this is known to be a very conservative approach with respect to the detectability of significant brain signals in the outlined framework. In contrast, approaches related to the control of the false discovery rate (FDR) can handle the multiple comparison problem, e.g., by the procedure proposed in. Ref. 6.

2.2 Parcellation of the human brain

Neuroanatomic research has found that the human brain can be parcellated into different sub-regions based on structural similarities. One of the earliest atlases is the Brodmann atlas (see⁷) which is based on the cytoarchitectural organization of the brain. The Brodmann Areas have been schematically transferred to a template brain, the so-called Talairach-Atlas (see Ref. 46) which is commonly used in fMRI studies to report the location of significant grand average activation, as used in Refs. 42, 43. For the analysis of these data in the current paper, we chose the Harvard-Oxford brain atlas (cf. Refs. 10, 31) that provides a parcellation based on gross anatomical landmarks and delivers an Aparc label $j$ for each voxel of each individual brain space. However, any other brain parcellation to define regional labels could be used with our methodology.

2.3 Statistical inference

As outlined in the introduction, we re-used fMRI data from a program code comprehension task first analyzed in Ref. 43 and performed a new analysis comprising the four steps outlined below. The experiment used two different levels of software program code comprehension stimuli, henceforth denote as bottom-up and top-down comprehension, to infer on the related cognitive processes. In our strategy, we combined the methods from Ref. 43 (steps 1 and 2) and Ref. 40 (step 3). Furthermore, we implemented our new methodological contribution of combining the evidence for activation of a given brain region across the subjects (step 4). For the first two steps, we conducted, for each subject, a random-effects linear model analysis as described above for deriving voxel-wise p-values.

Step 1: Program comprehension versus rest

In the first step, we contrasted (for each participant separately) the comprehension of program code (cf. Ref. 43) to the rest condition. This identifies brain areas with a positive deflection of the BOLD response. Furthermore, in order to account for the multiple comparison problem, we performed the Benjamini-Hochberg test (see Ref. 6) for FDR control. Only those voxels which have been declared significant by this procedure were considered in step 2. This methodology is justified by the fact that the FDR is an established screening criterion for high-dimensional multiple test problems.

Step 2: Bottom-up comprehension versus control condition

In this step, we contrasted (again, for each participant separately) one type of program comprehension: bottom-up comprehension. Bottom-up comprehension is induced when program code provides no semantic cues and programmers need to comprehend each line separately and then integrate the information in a slow, tedious process. For the significant voxels from the first step, we applied in a second step the same multiple test to the contrast of bottom-up comprehension against the control condition (syntax task) on the restricted set of voxels. As a result of this step, we get for each participant $i$ and for each considered voxel $k$ a p-value ${\tilde{p}}_{ik}$ .

Step 3: Regional p-values for every participant $i$

This step builds upon the methodology from Ref. 40, and it delivers for each participant $i$ and for each anatomical (regional) label $j$ a (confirmatory) significance evaluation with respect to the contrast specified in step 2. Hence, the evidence from all voxels of participant i in the brain region labeled by $j$ is combined in this step of data analysis.

To this end, let $κ$ be a tuning parameter with values in the interval [0, 1] (i.e. in per cent) and let $m_{j}$ be the number of voxels contained in the brain region labeled by region label $j$ . To keep the notation feasible, we implicitly assume here that for each participant $i$ the same number $m_{j}$ of voxels belong to the brain region labeled by $j$ . We consider the null hypothesis $H_{ij}$ of no relevant differential activation of the region labeled by $j$ for participant $i$ during the two tasks mentioned in step 2, together with its two-sided alternative hypothesis $K_{ij}$ . We call $H_{ij}$ the “regional null hypothesis” for the brain region labeled by $j$ for participant $i$ . We formalize $H_{ij}$ as a so-called partial conjunction hypothesis (see⁴⁰ and the references therein for a formal mathematical description), meaning that we consider the differential activation in region $j$ for participant $i$ relevant, if it contains at least $u_{j} ≔ κ \cdot m_{j}$ significant voxels. For testing $H_{ij}$ we calculate the “regional p-value” $p_{ij}^{REGION}$ , given by

p_{ij}^{REGION} ≔ min_{1 \leq ı \leq m_{j} - u_{j} + 1} {\frac{m_{j} - u_{j} + 1}{ı}} {\tilde{p}}_{i, (u_{j} - 1 + ı) : m_{j}}

where the voxel-wise p-values

{\tilde{p}}_{i, 1 : m_{j}}, \dots, {\tilde{p}}_{i, m_{j} : m_{j}}

for participant

i

in region

j

are ordered from smallest to largest (see Ref. 5).

In order to achieve family-wise error rate (FWER) control, we have to choose the tuning parameter $κ$ smaller than or equal to $1 / J$ , where $J$ is the number of regional labels. Choosing $κ = 1 / J$ corresponds to the so-called Bonferroni multiplicity correction. The choice of $κ$ is discussed further in Appendix S1 of Ref. 40.

Step 4: Combined regional hypothesis tests by Fisher’s method

In this final step, we combine for each regional label $j$ the regional p-values calculated in step 3 over all participants $i = 1, \dots, n$ . In order to do this, we apply the so-called Fisher method to combine p-values. Namely, the Fisher test statistic $T_{j}$ for region $j$ is given by

T_{j} ≔ - 2 \sum_{i = 1}^{n} log (p_{ij}^{REGION})

Under independence of the data with respect to the participants, $T_{j}$ is asymptotically $X_{2 n}^{2}$ -distributed (chi squared) with $2 n$ degrees of freedom under the null. The latter independence assumption is justified, because the participants have been included in the study independently from each other.

Finally, we can reject the (over all participants $i$ combined) regional hypothesis $H_{j}$ (i.e., the respective partial conjunction hypothesis, but now with respect to the population, not with respect to a single participant) if and only if Fisher’s test statistic $T_{j}$ is larger than the (1-ακ)-quantile of the $X_{2 n}^{2}$ -distribution with $2 n$ degrees of freedom, where the tuning parameter $κ$ has been introduced in step 3. This parameter addresses the multiplicity of the test problem with respect to the $J$ regional labels which are simultaneously under consideration.

3. Computer simulations

In order to compare the performance of three different methods for fMRI data analysis in a controlled framework, we have carried out computer simulations.

3.1 Simulation setting

We have simulated a dataset ${Y_{ixt}}$ with eleven subjects (referring to the index $i$ and corresponding to the group size in the real dataset below), a spatial grid of size $20 \times 20 \times 20 = 8,000$ voxels (referring to the index $x$ ), and 195 time points (scans, referring to the index $t$ ). We have assumed that the 8,000 voxels are grouped into eight anatomical regions of size $10 \times 10 \times 10 = 1,000$ each, and that these regions are correctly annotated for all eleven subjects. Two alternating stimuli in an ON-OFF block task design with a total of six ON blocks of a duration of 15 scans have been used for the temporal signal: In a circular area in one of the predefined anatomical regions the signal of one stimulus was twice as high as the signal by the other stimulus mimicking a signal contrast. In a corresponding area in a second anatomical region the signal was created with no difference between the stimuli. The datasets with first order autocorrelated Rician noise have been created using the R package neuRosim⁵¹; p-values ${\tilde{p}}_{ik}$ for Contrast 1 (ON of either stimulus versus rest) and Contrast 2 (one stimulus versus the other) where determined using the R package fmri⁴⁵. This simulation setup has been run 500 times.

3.2 Considered data analysis methods

For the statistical analysis of the simulated data, we have considered three different methods.

• Voxel-wise: On the basis of voxel-wise $Z$ -scores (aggregated over all eleven participants) and the resulting p-values for Contrast 1, voxels have been screened by applying the Benjamini-Hochberg method at level $α = 0.05$ . For the screened voxels only, $Z$ -scores (again aggregated over all eleven participants) and the resulting p-values for Contrast 2 have been computed. A region $j$ has been declared significantly activated, if the number of Contrast 2 p-values below $0.05 / s_{j}$ in that region $j$ has been larger than $1,000 \cdot κ$ , with $s_{j}$ denoting the number of screened voxels in region $j$ and $κ = 0.01$ , meaning that we considered a region $j$ to be relevantly activated if at least 10 out of the 1,000 voxels in that region $j$ are statistically significantly activated under the stimulus. The rejection threshold $0.05 / s_{j}$ accounts for the (reduced) multiplicity resulting from the pre-selection by means of Contrast 1.
• Cluster-wise: Activation is typically spatially spread over multiple voxels. In fact, single isolated voxels are mostly considered as spurious rather than activated; see Ref. 28. Thus, we also analyzed the simulated datasets and the two contrasts with cluster-based thresholds: At the given significance level $α = 0.05$ the value of the test statistic at a voxel must exceed some threshold and it has to belong to a cluster of at least s connected voxel. Thresholds can be pre-determined by simulations. To determine p-values we used the R package fmri⁴⁵, where the method is implemented.
• Region-wise (this paper): Our proposed method from Section 2, again with FDR level 0.05, $κ = 0.01$ , and $α = 0.05$ in Step 4.

3.3 Results

We have assessed the type I and the type II error behavior of the three methods described in the previous subsection by calculating (for each of the three methods) the proportion of simulation runs in which any region without true activation has been declared significantly activated (type I error component) as well as the proportion of simulation runs in which the one region with true activation has not been declared significantly activated (type II error component). Table 2 summarizes the results of our computer simulations.

Table 2. The results of our computer simulations.

The estimated type I error refers to the proportion of simulation runs in which any region without true activation has been declared significantly activated. The estimated type II error refers to the proportion of simulation runs in which the one region with true activation has not been declared significantly activated.

	Voxel-wise	Cluster-wise	Region-wise (this paper)
Estimated type I error	0%	0%	0.8%
Estimated type II error	24.6%	0%	0%

From Table 2 it becomes apparent that all three methods are capable of protecting reliably against type I errors under our data-generating model. However, the power of the proposed method appears to be considerably larger than that of the voxel-wise method. Namely, during our 500 simulation runs, our new method as well as the cluster-wise method always detected the truly activated region, while the voxel-wise method detected it only in 367 out of the 500 runs.

4. Real data analysis

We re-used the data from Ref. 43 and compared our results with those obtained by additionally utilizing⁴² as pre-study in the sense of Strategy (ii) outlined in the introduction.

4.1 Previous findings

In the two previous studies mentioned before, Siegmund et al. used similar analysis processes. They used BrainVoyager™ QX 2.8.4[1]. The anatomical scans were transformed into the Talairach brain to account for differences in brain size; cf. Ref. 46. They preprocessed the functional data of both studies with a standard pipeline of: 3-D motion correction, slice-scan-time correction, and temporal filtering. In addition, they applied a spatial smoothing with a Gaussian filter (FWHM = 4 mm).

In the first study⁴², the random-effects GLM revealed five brain areas (BAs 6, 21, 40, 44, 47) with significant activation with the contrast Bottom-Up Comprehension versus Control condition, i.e. syntax task. In the second study, the same contrast revealed no significant areas anymore, likely due to the reduced statistical power of five instead of twelve bottom-up comprehension tasks per session. Thus, the authors ran a regions-of-interest analysis restricted to the identified activation clusters of the first study on the data of the second study. This resulted in a significantly stronger activation for Bottom-Up Comprehension versus Syntax in BAs 21, 40, and 44.

4.2 Data export and preparation for re-analysis

In Figure 1, we illustrate the overall process for the re-analysis of the data from Ref. 43. For our re-analysis, we exported the already pre-processed data from the second study. We did not use the data of the first study as our method does not need prior definitions of regions of interest. We used BrainVoyager to access the data from Siegmund et al. We exported the statistical values (i.e., t-scores, p-values) of the obtained brain activation on a voxel basis for each participant. The voxel resolution is the same BrainVoyager uses, i.e., a 1 mm interpolated resolution.

In addition, we used FreeSurfer to segment and parcellate the brain of each participant based on their anatomical scan; cf. Refs. 17, 18. We used the Destrieux’ cortical atlas to assign Aparc labels on an individual participant basis (see Ref. 11) as a suitable region definition. Next, we used Nipype (see Refs. 16, 21) to convert Freesurfer labels to a BrainVoyager-readable format.

Our last step annotated the exported functional data with the individual anatomical labels for each participant. We removed all functional voxels for coordinates that had no assigned Aparc label, which typically are voxels that are not considered gray matter.

4.3 Results

Figure 2 displays the six brain regions which have been declared as significant (at FWER level $α = 5 %$ ) associated with the task at hand by our described methodology. There is an overlap with the results obtained by Ref. 43 in which they utilized prior knowledge, but also some differences which we discuss in the following in their nuances. We visualize the confirmed network of brain activation from Siegmund et al. in Figure 3 and our identified network of significantly activated Aparc labels in Figure 4. Figure 3 and Figure 4 show overlapping results with regard to the Brodmann area 21 that covers the middle and inferior temporal gyrus (separated by the inferior temporal sulcus). However, we observed differences regarding smaller brain regions. We compare Siegmund’s replication efforts to our results in Section 5.1.

Figure 2. The six significant brain regions (at FWER level α = 5%).

In each row, each point corresponds to the Aparc p-value $p_{ⅈj}^{APARC}$ of one study participant i, where the index j refers to the area indicated by the code at the beginning of the row.

Figure 3. Network of left-lateralized confirmed brain areas. In Ref. 43, BAs 21, 40, 44 were found activated during program comprehension.

Figure 4. Results of our analysis with significantly activated Aparc labels.

Activation are particularly in the middle and inferior temporal lobe.

Our method found three Aparc labels in the inferior and middle temporal gyrus in the left hemisphere. Siegmund et al. found their largest and most robust activation cluster in BA21 of the left hemisphere, which covers several gyri in the temporal lobe. These left temporal gyri are often associated with semantic processing of natural language, which is typically left-lateralized for right-handed participants. In the context of programming, the activation is believed to be responsible for extracting the meaning of individual identifiers and symbols during program comprehension. We found two further Aparc labels bilaterally in the inferior temporal gyrus and the anterior collateral sulci, which are both in the temporal lobe as well.

For each of these six regions indexed by $j$ , we display all subject-specific regional p-values ${p_{ij}^{REGION} : 1 \leq i \leq n$ }, where $n$ is the number of study participants. For each $j$ considered in Figure 2, it can clearly be observed that not a single extreme outlier (one very small p-value corresponding to one individual subject) is responsible for the statistical significance with respect to the combination test statistic $T_{j}$ but that the combined information contributed by all $n$ subjects supports our statistical conclusions. Regional p-values $p_{ij}^{REGION} \equiv 1$ can occur, if none of the voxels belonging to region $j$ has been selected in Steps 1 and 2 described before for a certain subject $i$ . By construction of $T_{j}$ this essentially means that the “effective sample size” for such a region is reduced, while the number of degrees of freedom for the null distribution of $T_{j}$ remains unchanged.

5. Discussion

5.1 Statistical sensitivity

This paper introduces a new strategy to analyze fMRI data and demonstrates its performance by drawing a comparison to two studies by Siegmund et al. In their second study, they investigated a new research question which was based on a design with reduced statistical power regarding the network of brain areas activated during bottom-up program comprehension. Only by using prior knowledge of the location of identified clusters from the first study and conducting a regions-of-interest analysis with increased statistical power, they were able to exceed standard thresholds of statistical significance but restricted to the subset of areas identified in the first study.

Research that similarly aims to detect small differences between cognitive processes with state-of-the-art methods would need to increase statistical power, e.g. by conducting two studies: first, to identify the network of brain areas involved in the overarching cognitive process, and second, to identify differential effects within the activated brain areas. The method of utilizing anatomically predefined regions of interest as determined in each individual brain is a candidate for a more sensitive analysis as compared to statistical testing of single voxels with identical coordinates of a template brain. This is possibly due to allowing more flexibility with respect to the exact anatomical location of functional units across different brains. Our example of using ROIs from the Harvard-Oxford atlas can easily be replaced in future applications by brain parcellation schemes that are based on more refined anatomical and/or functional databases. We demonstrated by simulations that our method can accurately detect signals based on region aggregation and outperforms standard mass-univariate approach while achieving a similar performance as cluster-based inferences without the need to define thresholds based on data smoothness via simulations. Further, we demonstrated that our method can find significantly activated brain areas without relying on prior knowledge on a real fMRI dataset. Moreover, additional brain areas were identified which have been described in two fMRI studies of programmers and interpreted as being involved in visuo-spatial processing. One study investigated manipulating data structures, which share similarities to spatial rotation (see Ref. 25) and one writing program code (see Ref. 27). Since the study by Siegmund et al. (2014) was the first study on program comprehension they used a rather small FDR corrected significance level ( $p < 0.01$ as compared to the more common $p < 0.05$ ). Possibly this is one reason why the two areas identified by the current approach were not identified there and as a consequence could not emerge in Siegmund et al. (2017). Thus, our current approach seems valuable in cases where new research questions are explored and little to no prior work is available and which would require careful statistical hypothesis testing to initially minimize false positives.

However, inferior frontal gyrus (BA 44) and inferior parietal lobule (with BA 40), shown to be significant in Siegmund et al., were not significant with our method. Therefore, we exemplarily investigated this difference with the activation cluster in BA 40. In Figure 5, we display a histogram of p-values for the (statistically significant) Aparc label in the middle temporal gyrus, in which a majority of the participants contribute with small p-values. In contrast, in Figure 6, we display a histogram of p-values for the inferior parietal lobule. Across the entire group, we also observe an accumulation of very small p-values, but only three of eleven participants contribute to this. Unlike the regions-of-interest analysis done by Siegmund et al., our method relying on combining Aparc p-values by Fisher’s method does not reject hypotheses with such a p-value distribution.

Figure 5. Histogram of p values of `ctx_lh_G_temporal_middle` label that is evaluated as significantly activated.

The plot without a heading is across all participants, while the eleven plots with headings show the distribution across individual participants. The majority of participants is contributing to the significance.

Figure 6. Histogram of p values of `ctx_lh_G_pariet_inf-Angular` that is not evaluated as significantly activated.

The plot without the heading is across all participants, while the eleven plots with headings show the distribution across individual participants. While there are many small p- values, only a few participants contribute these small values.

By transforming the Aparc labels into the Talairach space, we observed that the voxels included in the activation cluster of BA 40 do not perfectly align with the Aparc label that should cover this brain area, i.e. the inferior angular gyrus of the parietal lobe. Table 4 shows that the overlap to the activation cluster in BA40 is less than 50% and that only around 5% of its voxels are assigned to an Aparc label at all. In contrast, Table 3 shows that at least 75% of the voxels of the BA 21 cluster are aligned to Aparc labels of which two are significant with our method. Thus, a small activation cluster within a large anatomical region may get lost with our approach. We further discuss this potential drawback of the used anatomical segmentation and possible remedies in Section 5.3.

Table 3. The region of interest in BA21 identified in Ref. 42 consists of 2844 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. Aparc labels in bold are evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

	# Voxels Overlapping with BA21	in %	# Voxels of Entire Aparc Label	in % in BA21
`Left-Cerebral-White-Matter`	634	22%	210620	0.3%
`ctx lh G temporal middle`	617	22%	7515	8.2%
`ctx lh S temporal sup`	512	18%	9379	5.5%
`ctx lh S temporal inf`	85	3%	2499	3.4%
…	…	…	…	…

Table 4. The region of interest in BA40 identified in Ref. 42 consists of 1777 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. No Aparc label is evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

	# Voxels Overlapping with BA40	in %	# Voxels of Entire Aparc Label	in % in BA40
`ctx lh G pariet inf-Supramar`	304	17.1%	6318	4.3%
`ctx lh G pariet inf-Angular`	290	16.3%	4975	5.1%
`Left-Cerebral-White-Matter`	157	8.9%	178875	0.1%
…	…	…	…	…

From the methodological point of view, our main contribution consists in a novel way how to combine evidence: Instead of aggregating single voxel data over all participants by mapping them to a standard brain template, we define subject-specific regions and combine the evidence on the level of these regions by means of a combination function. This is a generic methodological approach which is not restricted to the specific study setup of Siegmund et al., but can be applied to essentially any fMRI study design, involving an arbitrary number of contrasts.

Furthermore, also the (final) combination step of our proposed approach is generic in the sense that instead of the Fisher combination function any other (appropriate) combination function for p-values may be used. Recently, there has been a renewed interest in p-value combination methods; see, e.g., Ref. 52 (with discussion), Ref. 48, and Ref. 49.

5.2 Comparison to related fMRI analysis methods

Under the multiple testing framework, testing of grouped null hypotheses with (potential) application in fMRI is an active research topic. In Ref. 23 as well as in Ref. 4, clustering techniques were employed to define regions of interest, and the authors incorporated the heterogeneous cluster sizes in a weighting scheme for the linear step-up test from Ref. 6. In the same vein, the authors of Ref. 24 as well as Ref. 56 made use of the different proportions of true null hypotheses in each of the groups in their proposed weighting. A Bayesian variant of this idea has been derived in Ref. 30. Hierarchical methods, which exclude groups without strong evidence for the presence of signals in several stages of data analysis, have been worked out by several researchers; see, e. g., Refs. 3, 40 and 55. However, these methods rely on combining the subject-specific data on the voxel level, which is a standard technique as mentioned, for instance, in Section 5 of Ref. 29 Also on the basis of (combined) voxel data, a hierarchical independent component analysis for the comparison of brain functional networks has been proposed in Ref. 41. To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea.

It is worth mentioning that both our proposed methodology and the methods from previous literature that we are considering in our comparisons make certain assumptions about the data: First, a linear relationship between the features encoded in the design matrix and the response is assumed. Thus, non-linear effects cannot (fully) be captured by models of the type (1). Second, Gaussianity of error terms is assumed, implying symmetry and light tails, among other things. It can be of interest to validate or to test these assumptions for a concrete dataset at hand. However, since this aspect is not the main focus of our present work, we defer such investigations to future research. In particular, modeling the data that we have re-analyzed in this work with different model assumptions would diminish the comparability of our data analysis results with the results from previous analyses. One simple (numerical) robustness analysis regarding the Gaussianity assumption can be performed by simulating data with a different error distribution and comparing the results with the results presented here. To this end, the source code which is available as supplementary material for this article may be helpful for practical implementation.

5.3 Outlook: from anatomical to functional aggregation

Instead of utilizing a single voxel GLM group analysis in common brain templates, we used a parcellation of the brain for each individual participant into regional labels, here Aparc labels, before aggregation into the group analysis. This procedure provides more labels than a traditional Brodmann atlas. Still, some of the regions are very large and thus presumably contain several functional areas. There is ongoing research to subdivide the brain based on cytoarchitectonic details, e.g. the Jülich-Brain (see Ref. 1). This will provide more detailed parcellation schemes in the future and that could easily be integrated into our proposed method. Another refinement could be to use functionally defined brain regions for our presented methodology. A study that implements functional localizers could identify participant-specific functional maps of the brain for well-defined standardized tasks (e.g., see Ref. 33). Then, in the analysis, our presented methodology can aggregate across all participant-specific brains with less imprecision than traditional methods. We would like to note that such functional localizers are currently restricted to research areas that include brain regions with specific functional specialization; cf. Refs. 20 and 39. The studies we presented in this paper are concerned with a rather complex cognitive task. Moreover, understanding program code is highly individual since different programmers rely on different comprehension strategies based on their preferences, domain knowledge, and experience (e.g., Refs. 8, 9 and 44).

Ethical considerations

Our paper did not acquire any new data with human participants. The original studies of Siegmund et al. were approved by the responsible ethics board of the University of Magdeburg (Application: 87/14).

Data availability

Weierstrass Institute for Applied Analysis and Stochastics publication server: Utilizing anatomical information for signal detection in functional magnetic resonance imaging - Data. https://doi.org/10.20347/wias.data.9.³⁶

This project contains the underlying data:

• pvalueFromRealData.csv (p-value data mapped onto Aparc labels)

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Source code available from: https://github.com/ktabelow/HierarchicalFMRI/

Archived software available from: https://zenodo.org/records/17097938

License: GNU General Public License v3.0

Acknowledgements

We thank André Neumann for collaborating with us on a previous version of this article (available as preprint³²) and Jörg Stadler for his technical support in data processing with Nipype and FreeSurfer.

References

1. Amunts K, Mohlberg H, Bludau S, et al.: Julich-brain: A 3d probabilistic atlas of the human brain’s cytoarchitecture. Science. 2020; 369: 988–992. PubMed Abstract | Publisher Full Text
2. Andreella A, Feilong M, Halchenko Y, et al.: A Statistical Approach to the Alignment of fMRI Data. Book of Short Papers, SIS 2020. Pollice A, Salvati N, Spagnolo FS, editors. 2020; pp. 733–738.
3. Benjamini Y, Bogomolov M: Selective inference on multiple families of hypotheses. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014; 76: 297–318. Publisher Full Text
4. Benjamini Y, Heller R: False discovery rates for spatial signals. J. Am. Stat. Assoc. 2007; 102: 1272–1281. Publisher Full Text
5. Benjamini Y, Heller R: Screening for partial conjunction hypotheses. Biometrics. 2008; 64: 1215–1222. PubMed Abstract | Publisher Full Text
6. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995; 57: 289–300. Publisher Full Text
7. Brodmann K: Vergleichende Lokalisationslehre der Großhirnrinde in ihren Prinzipien dargestellt auf Grund des Zellbaues. Leipzig: Barth; 1909.
8. Brooks R: Using a behavioral theory of program comprehension in software engineering. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 1978; pp. 196–201.
9. Brooks R: Towards a theory of the comprehension of computer programs. Int. J. Man-Mach. Stud. 1983; 18: 543–554. Publisher Full Text
10. Desikan R, Ségonne F, Fischl B, et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006; 31: 968–980. PubMed Abstract | Publisher Full Text
11. Destrieux C, Fischl B, Dale A, et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage. 2010; 53: 1–15. PubMed Abstract | Publisher Full Text | Free Full Text
12. Dickhaus T: Simultaneous statistical inference with applications in the life sciences. Berlin Heidelberg: Springer-Verlag; 2014.
13. Donoho D, Jin J: Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 2004; 32: 962–994. MR 2065195.
14. Dudoit S, van der Laan M : Multiple testing procedures with applications to genomics., Springer Series in Statistics. New York, NY: Springer; 2008.
15. Eklund A, Nichols TE, Knutsson H: Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. 2016; 113: 7900–7905. PubMed Abstract | Publisher Full Text | Free Full Text
16. Esteban O, Markiewicz CJ, Burns C, et al.: nipy/nipype: 1.5.0.2020.
17. Fischl B, Salat DH, Busa E, et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002; 33: 341–355. Publisher Full Text
18. Fischl B, Van Der Kouwe A, Destrieux C, et al.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. 2004; 14: 11–22. Publisher Full Text
19. Forman S, Cohen J, Fitzgerald M, et al.: Improved assessment of significant activation in functional magnetic resonance imaging (fmri): use of a cluster-size threshold. Magn. Reson. Med. 1995; 33: 636–647. PubMed Abstract | Publisher Full Text
20. Friston K, Rotshtein P, Geng J, et al.: A critique of functional localisers. NeuroImage. 2006; 30: 1077–1087. PubMed Abstract | Publisher Full Text
21. Gorgolewski K, Burns CD, Madison C, et al.: Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Front. Neuroinform. 2011; 5: Article 13. Publisher Full Text
22. Haxby JV: Multivariate pattern analysis of fMRI: The early beginnings. NeuroImage. 2012; 62: 852–855. PubMed Abstract | Publisher Full Text | Free Full Text
23. Heller R, Stanley D, Yekutieli D, et al.: Cluster-based analysis of fMRI data. NeuroImage. 2006; 33: 599–608. Publisher Full Text
24. Hu J, Zhao H, Zhou H: False discovery rate control with groups. J. Am. Stat. Assoc. 2010; 105: 1215–1227. PubMed Abstract | Publisher Full Text | Free Full Text
25. Huang Y, Liu X, Krueger R, et al.: Distilling neural representations of data structure manipulation using fMRI and fNIRS. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 2019; pp. 396–407.
26. Jarmasz M, Somorjai R: Exploring regions of interest with cluster analysis (EROICA) using a spectral peak statistic for selecting and testing the significance of fMRI activation time-series. Artif. Intell. Med. 2002; 25: 45–67. PubMed Abstract | Publisher Full Text
27. Krueger R, Huang Y, Liu X, et al.: Neurological divide: An fMRI study of prose and code writing. Proceedings of International Conference on Software Engineering (ICSE). 2020; pp. 678–690.
28. Lazar N: The statistical analysis of functional MRI data. Statistics for Biology and Health. Springer; 2008.
29. Lindquist M: The statistical analysis of fMRI data. Stat. Sci. 2008; 23: 439–464. MR 2530545.
30. Liu Y, Sarkar S, Zhao Z: A new approach to multiple testing of grouped hypotheses. J. Statist. Plann. Inference. 2016; 179: 1–14. MR 3550875. Publisher Full Text
31. Makris N, Goldstein JM, Kennedy D, et al.: Decreased volume of left and total anterior insular lobule in schizophrenia. Schizophr. Res. 2006; 83: 155–171. PubMed Abstract | Publisher Full Text
32. Neumann A, Peitek N, Brechmann A, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging, WIAS Preprint No. 2806.2021. Publisher Full Text
33. Nieto-Castañón A, Fedorenko E: Subject-specific functional localizers increase sensitivity and functional resolution of multi-subject analyses. NeuroImage. 2012; 63: 1646–1669. PubMed Abstract | Publisher Full Text | Free Full Text
34. Noble S, Mejia AF, Zalesky A, et al.: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference. Proc. Natl. Acad. Sci. USA. 2022; 119: e2203020119. PubMed Abstract | Publisher Full Text | Free Full Text
35. Ombao H, Lindquist M, Thompson W, et al.: Handbook of Neuroimaging Data Analysis. New York: CRC Press; 2016.
36. Peitek N, Brechmann A, Tabelow K, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging. Data. 2024. Publisher Full Text
37. Pennington N: Stimulus structures and mental representations in expert comprehension of computer programs. Cogn. Psychol. 1987; 19: 295–341. Publisher Full Text
38. Rosenblatt J, Finos L, Weeda W, et al.: All-resolutions inference for brain imaging. NeuroImage. 2018; 181: 786–796. PubMed Abstract | Publisher Full Text
39. Saxe R, Brett M, Kanwisher N: Divide and conquer: A defense of functional localizers. NeuroImage. 2006; 30: 1088–1096. PubMed Abstract | Publisher Full Text
40. Schildknecht K, Tabelow K, Dickhaus T: More specific signal detection in functional magnetic resonance imaging by false discovery rate control for hierarchically structured systems of hypotheses. PLoS One. 2016; 11: 1–21.
41. Shi R, Guo Y: Investigating differences in brain functional networks using hierarchical covariate-adjusted independent component analysis. Ann. Appl. Stat. 2016; 10: 1930–1957. MR 3592043. PubMed Abstract | Publisher Full Text
42. Siegmund J, Kästner C, Apel S, et al.: Understanding understanding source code with functional magnetic resonance imaging. Proceedings International Conference on Software Engineering (ICSE). ACM; 2014; pp. 378–389.
43. Siegmund J, Peitek N, Parnin C, et al.: Measuring neural efficiency of program comprehension. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, ESEC/FSE; 2017; 2017. : pp. 140–150.
44. Soloway E, Ehrlich K: Empirical studies of programming knowledge. IEEE Trans. Softw. Eng. 1984; 10: 595–609.
45. Tabelow K, Polzehl J: Statistical parametric maps for functional mri experiments in R: The package fmri. J. Stat. Softw. 2011; 44: 1–21. Publisher Full Text Reference Source
46. Talairach J, Tournoux P: Co-planar stereotaxic atlas of the human brain. Thieme; 1988.
47. van de Geer S : Estimation and testing under sparsity. Lecture Notes in Mathematics. Cham: Springer; 2016; Vol. 2159. . Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. MR 3526202.
48. Vovk V, Wang R: Combining p-values via averaging. Biometrika. 2020; 107: 791–808. Publisher Full Text
49. Vovk V, Wang R: E-values: Calibration, combination, and applications. Ann. Stat. 2021; 49: 1736–1754.
50. Wainwright M: High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press; 2019.
51. Welvaert M, Durnez J, Moerkerke B, et al.: neuRosim: An R package for generating fmri data. J. Stat. Softw. 2011; 44: 1–18. Publisher Full Text Reference Source
52. Wilson D: The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. 2019; 116: 1195–1200. PubMed Abstract | Publisher Full Text | Free Full Text
53. Worsley K: Local maxima and the expected Euler characteristic of excursion sets of χ², f and t fields. Adv. Appl. Probab. 1994; 26: 13–42. Publisher Full Text
54. Worsley K, Marrett S, Neelin P, et al.: A unified statistical approach for determining significant signals in images of cerebral activation. Hum. Brain Mapp. 1996; 4: 58–73. PubMed Abstract | Publisher Full Text
55. Yekutieli D: Hierarchical false discovery rate-controlling methodology. J. Am. Stat. Assoc. 2008; 103: 309–316. Publisher Full Text
56. Zhao H, Zhang J: Weighted p-value procedures for controlling FDR of grouped hypotheses. J. Stat. Plann. Inference. 2014; 151-152: 90–106. Publisher Full Text

Footnotes

1 Brain Innovation BV, Maastricht, The Netherlands, http://brainvoyager.com

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 01 Oct 2025

Author details Author details

¹ Saarland University, Saarbrücken, Saarland, Germany
² Leibniz Institute for Neurobiology, Magdeburg, Saxony-Anhalt, Germany
³ Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Berlin, Germany
⁴ University of Bremen, Bremen, Bremen, Germany

Norman Peitek
Roles: Data Curation, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

André Brechmann
Roles: Conceptualization, Data Curation, Funding Acquisition, Methodology, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Karsten Tabelow
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Thorsten Dickhaus
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

Financial support by the Deutsche Forschungsgemeinschaft (DFG) via grant DI 1723/3-2 is gratefully acknowledged. Brechmann’s work is supported by DFG grant BR 2267/7-2.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 01 Oct 2025, 14:1019

https://doi.org/10.12688/f1000research.166549.1

Copyright

© 2025 Peitek N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 01 Oct 2025

Views

4

Reviewer Report 19 Jan 2026

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.183550.r441581

This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening ... Continue reading

This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people).

(2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, P_ij^APARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable.

(3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why).

(4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa.

(5) A clearer definition of the regional hypothesis H_ij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition.

(6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness.

(7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least.

(8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

13

Reviewer Report 26 Dec 2025

Fabricio Cravo, Psychology, Northeastern University College of Science (Ringgold ID: 195088), Boston, Massachusetts, USA

Stephanie Noble, Northeastern University, Boston, Massachusetts, USA

Not Approved

https://doi.org/10.5256/f1000research.183550.r423766

The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical ... Continue reading

The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions. For example:
1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this.
2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?
3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn’t it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?
4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?
In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details.

Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: “The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.”
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence.

Misc additional comments

1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state “In the present work, we propose a new strategy and apply it to the data from Ref. 43”, but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate “the part on the right indicates the processing steps proposed in this paper”, whose main component is the method of Schildknecht et al. Finally, they indicate “To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,” but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher’s method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern #1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text.
2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here.
3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well.
4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion.
5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary.
6. Minor: the use of “Monte Carlo” as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Statistical methods for neuroimaging; computational methods for neuroimaging; computer science and applied mathematics (Cravo)

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

20

Reviewer Report 07 Oct 2025

Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany; Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Lower Saxony, Germany; University of Münster, Münster, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.183550.r420203

The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they ... Continue reading

The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual’s anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors’ background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software.

While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied—single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step.
- Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?

The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases

The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: clinical neuroradiology, functional neuroimaging (methods and applied research)

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 01 Oct 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 1 01 Oct 25	read	read	read

Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany; Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Germany; University of Münster, Münster, Germany
Fabricio Cravo, Northeastern University College of Science (Ringgold ID: 195088), Boston, USA

Stephanie Noble, Northeastern University, Boston, USA
Qiran Jia, University of Southern California, Los Angeles, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

4 Views

19 Jan 2026 | for Version 1

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

4 Views Cite this report Responses(0)

Approved With Reservations

This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people).

(2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, P_ij^APARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable.

(3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why).

(4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa.

(5) A clearer definition of the regional hypothesis H_ij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition.

(6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness.

(7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least.

(8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

13 Views

26 Dec 2025 | for Version 1

Fabricio Cravo, Psychology, Northeastern University College of Science (Ringgold ID: 195088), Boston, Massachusetts, USA

Stephanie Noble, Northeastern University, Boston, Massachusetts, USA

13 Views Cite this report Responses(0)

Not Approved

The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions. For example:
1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this.
2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?
3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn’t it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?
4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?
In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details.

Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: “The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.”
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence.

Misc additional comments

1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state “In the present work, we propose a new strategy and apply it to the data from Ref. 43”, but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate “the part on the right indicates the processing steps proposed in this paper”, whose main component is the method of Schildknecht et al. Finally, they indicate “To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,” but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher’s method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern #1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text.
2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here.
3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well.
4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion.
5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary.
6. Minor: the use of “Monte Carlo” as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Statistical methods for neuroimaging; computational methods for neuroimaging; computer science and applied mathematics (Cravo)

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

20 Views

07 Oct 2025 | for Version 1

Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany; Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Lower Saxony, Germany; University of Münster, Münster, Germany

20 Views Cite this report Responses(0)

Approved With Reservations

The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual’s anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors’ background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software.

While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied—single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step.
- Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?

The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases

The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

clinical neuroradiology, functional neuroimaging (methods and applied research)

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Amunts K, Mohlberg H, Bludau S, et al.: Julich-brain: A 3d probabilistic atlas of the human brain’s cytoarchitecture. Science. 2020; 369: 988–992. PubMed Abstract | Publisher Full Text

[2] 2. Andreella A, Feilong M, Halchenko Y, et al.: A Statistical Approach to the Alignment of fMRI Data. Book of Short Papers, SIS 2020. Pollice A, Salvati N, Spagnolo FS, editors. 2020; pp. 733–738.

[3] 3. Benjamini Y, Bogomolov M: Selective inference on multiple families of hypotheses. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014; 76: 297–318. Publisher Full Text

[4] 4. Benjamini Y, Heller R: False discovery rates for spatial signals. J. Am. Stat. Assoc. 2007; 102: 1272–1281. Publisher Full Text

[5] 5. Benjamini Y, Heller R: Screening for partial conjunction hypotheses. Biometrics. 2008; 64: 1215–1222. PubMed Abstract | Publisher Full Text

[6] 6. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995; 57: 289–300. Publisher Full Text

[7] 7. Brodmann K: Vergleichende Lokalisationslehre der Großhirnrinde in ihren Prinzipien dargestellt auf Grund des Zellbaues. Leipzig: Barth; 1909.

[8] 8. Brooks R: Using a behavioral theory of program comprehension in software engineering. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 1978; pp. 196–201.

[9] 9. Brooks R: Towards a theory of the comprehension of computer programs. Int. J. Man-Mach. Stud. 1983; 18: 543–554. Publisher Full Text

[10] 10. Desikan R, Ségonne F, Fischl B, et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006; 31: 968–980. PubMed Abstract | Publisher Full Text

[11] 11. Destrieux C, Fischl B, Dale A, et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage. 2010; 53: 1–15. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Dickhaus T: Simultaneous statistical inference with applications in the life sciences. Berlin Heidelberg: Springer-Verlag; 2014.

[13] 13. Donoho D, Jin J: Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 2004; 32: 962–994. MR 2065195.

[14] 14. Dudoit S, van der Laan M : Multiple testing procedures with applications to genomics., Springer Series in Statistics. New York, NY: Springer; 2008.

[15] 15. Eklund A, Nichols TE, Knutsson H: Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. 2016; 113: 7900–7905. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Esteban O, Markiewicz CJ, Burns C, et al.: nipy/nipype: 1.5.0.2020.

[17] 17. Fischl B, Salat DH, Busa E, et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002; 33: 341–355. Publisher Full Text

[18] 18. Fischl B, Van Der Kouwe A, Destrieux C, et al.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. 2004; 14: 11–22. Publisher Full Text

[19] 19. Forman S, Cohen J, Fitzgerald M, et al.: Improved assessment of significant activation in functional magnetic resonance imaging (fmri): use of a cluster-size threshold. Magn. Reson. Med. 1995; 33: 636–647. PubMed Abstract | Publisher Full Text

[20] 20. Friston K, Rotshtein P, Geng J, et al.: A critique of functional localisers. NeuroImage. 2006; 30: 1077–1087. PubMed Abstract | Publisher Full Text

[21] 21. Gorgolewski K, Burns CD, Madison C, et al.: Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Front. Neuroinform. 2011; 5: Article 13. Publisher Full Text

[22] 22. Haxby JV: Multivariate pattern analysis of fMRI: The early beginnings. NeuroImage. 2012; 62: 852–855. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Heller R, Stanley D, Yekutieli D, et al.: Cluster-based analysis of fMRI data. NeuroImage. 2006; 33: 599–608. Publisher Full Text

[24] 24. Hu J, Zhao H, Zhou H: False discovery rate control with groups. J. Am. Stat. Assoc. 2010; 105: 1215–1227. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Huang Y, Liu X, Krueger R, et al.: Distilling neural representations of data structure manipulation using fMRI and fNIRS. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 2019; pp. 396–407.

[26] 26. Jarmasz M, Somorjai R: Exploring regions of interest with cluster analysis (EROICA) using a spectral peak statistic for selecting and testing the significance of fMRI activation time-series. Artif. Intell. Med. 2002; 25: 45–67. PubMed Abstract | Publisher Full Text

[27] 27. Krueger R, Huang Y, Liu X, et al.: Neurological divide: An fMRI study of prose and code writing. Proceedings of International Conference on Software Engineering (ICSE). 2020; pp. 678–690.

[28] 28. Lazar N: The statistical analysis of functional MRI data. Statistics for Biology and Health. Springer; 2008.

[29] 29. Lindquist M: The statistical analysis of fMRI data. Stat. Sci. 2008; 23: 439–464. MR 2530545.

[30] 30. Liu Y, Sarkar S, Zhao Z: A new approach to multiple testing of grouped hypotheses. J. Statist. Plann. Inference. 2016; 179: 1–14. MR 3550875. Publisher Full Text

[31] 31. Makris N, Goldstein JM, Kennedy D, et al.: Decreased volume of left and total anterior insular lobule in schizophrenia. Schizophr. Res. 2006; 83: 155–171. PubMed Abstract | Publisher Full Text

[32] 32. Neumann A, Peitek N, Brechmann A, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging, WIAS Preprint No. 2806.2021. Publisher Full Text

[33] 33. Nieto-Castañón A, Fedorenko E: Subject-specific functional localizers increase sensitivity and functional resolution of multi-subject analyses. NeuroImage. 2012; 63: 1646–1669. PubMed Abstract | Publisher Full Text | Free Full Text

[34] 34. Noble S, Mejia AF, Zalesky A, et al.: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference. Proc. Natl. Acad. Sci. USA. 2022; 119: e2203020119. PubMed Abstract | Publisher Full Text | Free Full Text

[35] 35. Ombao H, Lindquist M, Thompson W, et al.: Handbook of Neuroimaging Data Analysis. New York: CRC Press; 2016.

[36] 36. Peitek N, Brechmann A, Tabelow K, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging. Data. 2024. Publisher Full Text

[37] 37. Pennington N: Stimulus structures and mental representations in expert comprehension of computer programs. Cogn. Psychol. 1987; 19: 295–341. Publisher Full Text

[38] 38. Rosenblatt J, Finos L, Weeda W, et al.: All-resolutions inference for brain imaging. NeuroImage. 2018; 181: 786–796. PubMed Abstract | Publisher Full Text

[39] 39. Saxe R, Brett M, Kanwisher N: Divide and conquer: A defense of functional localizers. NeuroImage. 2006; 30: 1088–1096. PubMed Abstract | Publisher Full Text

[40] 40. Schildknecht K, Tabelow K, Dickhaus T: More specific signal detection in functional magnetic resonance imaging by false discovery rate control for hierarchically structured systems of hypotheses. PLoS One. 2016; 11: 1–21.

[41] 41. Shi R, Guo Y: Investigating differences in brain functional networks using hierarchical covariate-adjusted independent component analysis. Ann. Appl. Stat. 2016; 10: 1930–1957. MR 3592043. PubMed Abstract | Publisher Full Text

[42] 42. Siegmund J, Kästner C, Apel S, et al.: Understanding understanding source code with functional magnetic resonance imaging. Proceedings International Conference on Software Engineering (ICSE). ACM; 2014; pp. 378–389.

[43] 43. Siegmund J, Peitek N, Parnin C, et al.: Measuring neural efficiency of program comprehension. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, ESEC/FSE; 2017; 2017. : pp. 140–150.

[44] 44. Soloway E, Ehrlich K: Empirical studies of programming knowledge. IEEE Trans. Softw. Eng. 1984; 10: 595–609.

[45] 45. Tabelow K, Polzehl J: Statistical parametric maps for functional mri experiments in R: The package fmri. J. Stat. Softw. 2011; 44: 1–21. Publisher Full Text Reference Source

[46] 46. Talairach J, Tournoux P: Co-planar stereotaxic atlas of the human brain. Thieme; 1988.

[47] 47. van de Geer S : Estimation and testing under sparsity. Lecture Notes in Mathematics. Cham: Springer; 2016; Vol. 2159. . Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. MR 3526202.

[48] 48. Vovk V, Wang R: Combining p-values via averaging. Biometrika. 2020; 107: 791–808. Publisher Full Text

[49] 49. Vovk V, Wang R: E-values: Calibration, combination, and applications. Ann. Stat. 2021; 49: 1736–1754.

[50] 50. Wainwright M: High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press; 2019.

[51] 51. Welvaert M, Durnez J, Moerkerke B, et al.: neuRosim: An R package for generating fmri data. J. Stat. Softw. 2011; 44: 1–18. Publisher Full Text Reference Source

[52] 52. Wilson D: The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. 2019; 116: 1195–1200. PubMed Abstract | Publisher Full Text | Free Full Text

[53] 53. Worsley K: Local maxima and the expected Euler characteristic of excursion sets of χ², f and t fields. Adv. Appl. Probab. 1994; 26: 13–42. Publisher Full Text

[54] 54. Worsley K, Marrett S, Neelin P, et al.: A unified statistical approach for determining significant signals in images of cerebral activation. Hum. Brain Mapp. 1996; 4: 58–73. PubMed Abstract | Publisher Full Text

[55] 55. Yekutieli D: Hierarchical false discovery rate-controlling methodology. J. Am. Stat. Assoc. 2008; 103: 309–316. Publisher Full Text

[56] 56. Zhao H, Zhang J: Weighted p-value procedures for controlling FDR of grouped hypotheses. J. Stat. Plann. Inference. 2014; 151-152: 90–106. Publisher Full Text

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

Abstract

Background

Methods

Results

Conclusions

Keywords

1. Introduction

Figure 1. Illustration of processing of the experimental data.

Table 1. Overview of the two related fMRI studies of program comprehension.

Listing 1. Example code snippet in Java from Siegmund et al. (cf. Ref. 42) that computes the length of the last word in a string.

2. Methods

2.1 Linear models for voxel-wise multiple tests

(1)

2.2 Parcellation of the human brain

2.3 Statistical inference

3. Computer simulations

3.1 Simulation setting

3.2 Considered data analysis methods

3.3 Results

Table 2. The results of our computer simulations.

4. Real data analysis

4.1 Previous findings

4.2 Data export and preparation for re-analysis

4.3 Results

Figure 2. The six significant brain regions (at FWER level α = 5%).

Figure 3. Network of left-lateralized confirmed brain areas. In Ref. 43, BAs 21, 40, 44 were found activated during program comprehension.

Figure 4. Results of our analysis with significantly activated Aparc labels.

5. Discussion

5.1 Statistical sensitivity

Figure 5. Histogram of p values of ctx_lh_G_temporal_middle label that is evaluated as significantly activated.

Figure 6. Histogram of p values of ctx_lh_G_pariet_inf-Angular that is not evaluated as significantly activated.

Table 3. The region of interest in BA21 identified in Ref. 42 consists of 2844 voxels.

Table 4. The region of interest in BA40 identified in Ref. 42 consists of 1777 voxels.

5.2 Comparison to related fMRI analysis methods

5.3 Outlook: from anatomical to functional aggregation

Ethical considerations

Data availability

Software availability

Acknowledgements

References

Footnotes

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 5. Histogram of p values of `ctx_lh_G_temporal_middle` label that is evaluated as significantly activated.

Figure 6. Histogram of p values of `ctx_lh_G_pariet_inf-Angular` that is not evaluated as significantly activated.