Utilizing anatomical information for signal detection in functional magnetic resonance imaging

Norman Peitek; André Brechmann; Karsten Tabelow; Thorsten Dickhaus

doi:10.12688/f1000research.166549.2

Home Browse Utilizing anatomical information for signal detection in functional...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

[version 2; peer review: 2 approved, 1 not approved]

Norman Peitek ¹, André Brechmann², Karsten Tabelow³, Thorsten Dickhaus⁴

PUBLISHED 25 Mar 2026

Author details Author details

¹ Saarland University, Saarbrücken, Saarland, Germany
² Leibniz Institute for Neurobiology, Magdeburg, Saxony-Anhalt, Germany
³ Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Berlin, Germany
⁴ University of Bremen, Bremen, Bremen, Germany

Norman Peitek
Roles: Data Curation, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

André Brechmann
Roles: Conceptualization, Data Curation, Funding Acquisition, Methodology, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Karsten Tabelow
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Thorsten Dickhaus
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background

We are considering the statistical analysis of functional magnetic resonance imaging (fMRI) data. As demonstrated in previous work, grouping voxels into regions (of interest) and carrying out a multiple test for signal detection on the basis of these regions typically leads to a higher sensitivity when compared with voxel-wise multiple testing approaches.

Methods

In the case of a multi-subject study, we propose to define the regions for each subject separately based on their individual brain anatomy, represented, e.g., by regional labels. The aggregation of the subject-specific evidence for the presence of signals in the different regions is then performed by means of a combination function for p-values. We validate the proposed methodology with simulated data and apply it to real fMRI data of a hypothesis-driven approach towards identifying brain regions involved in understanding software code.

Results

The results of our simulated data indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline. Testing our method on real fMRI data, we found that our approach yields overlapping results with a two-stage approach for which two independent experiments are needed, one for defining the regions and one for actual signal detection.

Conclusions

In this paper, we overall demonstrate that our method of utilizing anatomical information is a candidate to provide a more sensitive analysis of fMRI data.

Keywords

Aparc label; combination test; false discovery rate; mass-univariate linear model; program comprehension

Corresponding author: Norman Peitek

Competing interests: No competing interests were disclosed.

Grant information: Financial support by the Deutsche Forschungsgemeinschaft (DFG) via grant DI 1723/3-2 is gratefully acknowledged. Brechmann’s work is supported by DFG grant BR 2267/7-2.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2026 Peitek N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.12688/f1000research.166549.2) First published: 01 Oct 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1) Latest published: 25 Mar 2026, 14:1019 (https://doi.org/10.12688/f1000research.166549.2)

Revised Amendments from Version 1

We revised our paper to improve accessibility, substantially clarified methodological details (especially regarding the multi‑stage testing procedure, validity of the partial conjunction test, dependence structure, and hypothesis interpretation), and expanded the introduction, methods, and discussion accordingly. Additionally, we extended the simulation study (including varying SNR), refined figures and explanations, added a detailed code tutorial, clarified limitations (e.g., sample size), and incorporated multiple reviewer‑suggested clarifications throughout.

See the authors' detailed response to the review by Qiran Jia
See the authors' detailed response to the review by Fabricio Cravo and Stephanie Noble
See the authors' detailed response to the review by Benedikt Sundermann

1. Introduction

Signal detection in high-dimensional data is a major topic of modern statistics. In the context of functional magnetic resonance imaging (fMRI) with its numerous volume units (voxels) as primary units of measurement, localization of brain function is a common structural assumption. Scattered, spread-out signals (in single voxels or very small groups of voxels) are likely to be artifacts (Ref. 41). Instead, topologically contiguous signals forming larger groups of voxels are much more plausible; see Refs. 21, 30. This information rules out certain patterns of the signal structure a priori, and can hence be exploited to increase the statistical power for signal detection; see, among many others.⁴⁵ There are different possibilities to define or find such functional regions:

(i) One may refer to an atlas of the brain, like the Brodmann atlas (see Ref. 7) and aggregate data within the regions given by this atlas. This has been the strategy of Ref. 45.
(ii) One may find the regions (of interest) in a data-driven manner, e.g., by a cluster analysis. This has been proposed, among others, in Ref. 28. However, as emphasized for instance in Ref. 4, it is important that this data-driven definition of regions is “based on information outside the data that we set out to analyze”, meaning that the dataset used for defining the regions should be (stochastically) independent of the dataset which is used for signal detection, to avoid selection biases.
(iii) One may choose a statistical methodology which ensures statistically valid conclusions even for regions which are selected in a post-hoc manner after having seen the actual study data. This can be achieved by simultaneous inference methods which guarantee that any possible selection event is accounted for; see, e.g., Ref. 43 and references therein.

All of the three aforementioned strategies have their assets and their drawbacks: Strategy (i) is inexpensive and easy to implement, but the regions taken from the atlas may not be optimally aligned with the specific task at hand, and differences in the individual brain anatomies of the study participants may complicate its application. For a statistical approach to the alignment of fMRI data, see, e.g. Ref. 2 and the references therein. Strategy (ii) is costly (two independent experiments are needed), but is supposed to yield a more accurate definition of the regions (of interest). Strategy (iii) avoids both the (potentially suboptimal) a priori definition of regions and the need for an additional independent experiment. However, the issue of multiple testing (see, e.g., Refs. 13, 14) becomes much more severe if simultaneity over all possible selection events has to be guaranteed, and the selection of the regions is not based on a clear-cut (statistical) criterion, but on the expert judgment of the study data. Therefore, it is hard to compare the results of Strategy (iii) with those of Strategies (i) and (ii).

In the present work, we propose a new strategy and apply it to existing data of a study on program comprehension Ref. 48 (see Methods for details). Our proposed data analysis strategy is similar to Strategy (i). Namely, we rely on regions that are defined based on the freesurfer segmentation of the brain into so-called Aparc labels; cf. Section 2.3 for details. However, we apply this segmentation of the brain for each subject separately. The rationale for this novel approach is that functional brain regions show variable localization in brain templates between subjects, despite optimized anatomical co-registration. Only in the final step of analysis (see Step 4 in Section 2.4), a combination of subject-specific $p$ -values for the activation of the brain regions of interest takes place, and we evaluate the significance of such a region (on the multi-subject level of data analysis) by means of the resulting combined $p$ -value.

As we will demonstrate by means of the concrete example from the field of programing language comprehension, this new strategy can be similarly powerful as Strategy (ii) applied in Ref. 48, while avoiding the sequence of two separate fMRI studies where the data of the first is used for definition of suitable regions for the second. Strategy (ii) has been applied in (Ref. 48) as follows: The authors made use of the knowledge about cortical regions relevant for understanding software code identified in a previous study; cf. Ref. 47. This first study (Ref. 47) thus served as “localizer” for the second study (Ref. 48) in the sense of Strategy (ii). Only with this “pre-localization”, it was possible in Ref. 48 to declare certain brain regions statistically significantly activated during the (more specific, as compared to Ref. 47) task of programming language comprehension considered in Ref. 48.

In contrast, our present methodology will not at all rely on the data from the first study (Ref. 47), while achieving similar results as those obtained in Ref. 48. Thus, our proposed methodology can enable researchers to spare precious measurement time.

The rest of the paper is structured as follows. In Section 2, we describe our proposed statistical methodology. Section 3 describes conducted computer simulations comparing three different methods of fMRI data analysis. Section 4 is devoted to the detailed description of our re-analysis of the fMRI data from Ref. 48, and the results of this re-analysis are presented in Section 4.3. We conclude with a discussion in Section 5.

2. Methods

In this section, we describe our statistical model for fMRI data as well as the proposed data analysis workflow for detecting brain regions which are significantly associated with a certain cognitive task.

2.1 Reference data

As reference for our novel statistical approach we reuse the data of a study on program comprehension published in Ref. 48. This study made use of a functional localizer defined in a separate prior study with a comparable experimental design published Ref. 47 and thus followed Strategy (ii) for functional localization of regions involved in comprehending program code (see Table 1 for details of the two studies). In the first study (see Ref. 47), participants were asked to understand short program code snippets, such as shown in Listing 1. The program code did not contain any useful identifier names, which induces bottom-up comprehension (cf. Ref. 40). Ref. 47 contrasted the bottom-up comprehension task with a syntax task, in which participants were presented with similar program code snippets, but only had to focus on syntax errors (e.g., missing semicolon). This control condition was intended to reveal only brain activation that is necessary for programmers to comprehend program code in-depth. As an additional control condition, the experiment included phases of rest in between the comprehension and syntax conditions.

Table 1.

Overview of the two related fMRI studies of program comprehension.

	Study 1 by Ref. 47	Study 2 by Ref. 48
Participant sessions	16	14
Trials	12	30
Conditions	Bottom-up program comprehension, control (syntax), rest	Top-down program comprehension (n = 24), Bottom-up program comprehension (n = 3), control (syntax, n = 3), rest
Scans	900	900

Listing 1. Example code snippet in Java from Siegmund et al. (Ref. 47) that computes the length of the last word in a string. The snippet uses non-meaningful identifiers to induce bottom-up comprehension. Participants needed to figure out the output of this snippet “5”.

The second study (Ref. 48) was a follow-up study that also differentiated the program-comprehension task into more nuanced conditions. One aim was to differentiate between bottom-up comprehension and top-down comprehension (cf. Ref. 9) which was induced by varying the meaningfulness of identifier names and by prior training to provide participants with the necessary knowledge. As in the previous study, the syntax task served as a control condition. Another research question addressed the goal of confirming the activated brain areas from the first study. To this end, the second study built on the regions identified in the first study, thus following Strategy (ii). Both studies were approved by the ethics board of the University of Magdeburg (Application: 87/14).

2.2 Linear models for voxel-wise multiple tests

Let Y_ixt denote the observed data from a functional MRI experiment at voxel $x$ and time $t$ for the $i$ -th subject. Here, we adopt the common view (cf. Section 5.4 in Ref. 30) of a mass-univariate linear model

(1)

Y_{ixt} = X_{i} β_{ix} + ε_{ixt}

for the data, with a design matrix

X_{i}

containing variables with the expected blood oxygenation level dependent (BOLD) response related to the experimental stimuli or nuisance parameters like drifts of the MR signal. The random variable

ε_{ixt}

is the error term, which is assumed to be normally distributed with zero expectation and a spatio-temporal correlation structure. The model in Eq. (1) is also referred to as a “within-subject model” in the fMRI literature; see, e. g., Section 12.4.1 in Ref. 37. Estimates

{\hat{β}}_{ix}

of the statistical parametric map (SPM) or their contrasts

c^{T} {\hat{β}}_{ix}

and estimates of their covariance matrices

{\hat{Σ}}_{ix}

(or the variances

{\hat{σ}}_{ix}^{2} = c^{T} {\hat{Σ}}_{ix}^{} c

) can then be obtained from a pre-whitened version of the linear model above; cf. Ref. 30.

The SPM then forms a random $t$ -field (cf. Ref. 58) with an inherent multiple comparison problem due to the large number of local hypotheses. One common strategy is to define local p-values at each voxel $x$ and for each subject $i$ based on the local values of the random $t$ -field and to control the family-wise error rate (FWER) using accordingly adjusted thresholds; cf. Ref. 59. However, this is known to be a very conservative approach with respect to the detectability of significant brain signals in the outlined framework. In contrast, approaches related to the control of the false discovery rate (FDR) can handle the multiple comparison problem, e.g., by the procedure proposed in Ref. 6.

2.3 Parcellation of the human brain

Neuroanatomic research has found that the human brain can be parcellated into different sub-regions based on structural similarities. One of the earliest atlases is the Brodmann atlas (see Ref. 7) which is based on the cytoarchitectural organization of the brain. The Brodmann Areas have been schematically transferred to a template brain, the so-called Talairach-Atlas (see Ref. 52) which is commonly used in fMRI studies to report the location of significant grand average activation, as used in Refs. 47, 48. For the analysis of these data in the current paper, we chose the Harvard-Oxford brain atlas (cf. Refs. 11, 33) that provides a parcellation based on gross anatomical landmarks and delivers an Aparc label $j$ for each voxel of each individual brain space. However, any other brain parcellation to define regional labels could be used with our methodology.

2.4 Statistical inference

As outlined in the introduction, we re-used fMRI data from a program code comprehension task first analyzed in Ref. 48 and performed a new analysis comprising the four steps outlined below. The experiment used two different levels of software program code comprehension stimuli, henceforth denote as bottom-up and top-down comprehension, to infer on the related cognitive processes. In our strategy, we combined the methods from Ref. 48 (steps 1 and 2) and Ref. 45 (step 3). Furthermore, we implemented our new methodological contribution of combining the evidence for activation of a given brain region across the subjects (step 4). The first two steps were already conducted in Ref. 48 and we only exported the voxel-wise p-values of the random-effects linear model analysis for each subject separately. Steps 3 and 4 rely on the assumption that these voxel-wise p-values are valid (i. e., stochastically lower-bounded by a uniform distribution on [0, 1] under the null hypothesis). To ensure this, it is important that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small).

Step 1: Program comprehension versus rest

In the first step, we contrasted (for each participant separately) the comprehension of program code (cf. Ref. 48) to the rest condition. This identifies brain areas with a positive deflection of the BOLD response. Furthermore, in order to account for the multiple comparison problem, we performed the Benjamini-Hochberg test (see Ref. 6) for FDR control. Only those voxels which have been declared significant by this procedure were considered in step 2. This methodology is justified by the fact that the FDR is an established screening criterion for high-dimensional multiple test problems.

Step 2: Bottom-up comprehension versus control condition

In this step, we contrasted (again, for each participant separately) one type of program comprehension: bottom-up comprehension. Bottom-up comprehension is induced when program code provides no semantic cues and programmers need to comprehend each line separately and then integrate the information in a slow, tedious process. For the significant voxels from the first step, we applied in a second step the same multiple test to the contrast of bottom-up comprehension against the control condition (syntax task) on the restricted set of voxels. As a result of this step, we get for each participant $i$ and for each considered voxel $k$ a p-value ${\tilde{p}}_{ik}$ .

Step 3: Regional p-values for every participant $i$

This step builds upon the methodology from Ref. 45, and it delivers for each participant $i$ and for each anatomical (regional) label $j$ a (confirmatory) significance evaluation with respect to the contrast specified in step 2. Hence, the evidence from all voxels of participant i in the brain region labeled by $j$ is combined in this step of data analysis.

To this end, let $κ$ be a tuning parameter with values in the interval [0, 1] (i.e. in per cent) and let $m_{j}$ be the number of voxels contained in the brain region labeled by region label $j$ . To keep the notation feasible, we implicitly assume here that for each participant $i$ the same number $m_{j}$ of voxels belong to the brain region labeled by $j$ . We consider the null hypothesis $H_{ij}$ of no relevant differential activation of the region labeled by $j$ for participant $i$ during the program comprehension and syntax tasks mentioned in step 2, together with its two-sided alternative hypothesis $K_{ij}$ . We call $H_{ij}$ the “regional null hypothesis” for the brain region labeled by $j$ for participant $i$ . We formalize $H_{ij}$ as a so-called partial conjunction hypothesis (see Ref. 45 and the references therein for a formal mathematical description), meaning that we consider the differential activation in region $j$ for participant $i$ relevant, if it contains at least $u_{j} ≔ κ \cdot m_{j}$ significant voxels. For testing $H_{ij}$ we calculate the “regional p-value” $p_{ij}^{REGION}$ , given by

p_{ij}^{REGION} ≔ min_{1 \leq ı \leq m_{j} - u_{j} + 1} {\frac{m_{j} - u_{j} + 1}{ı}} {\tilde{p}}_{i, (u_{j} - 1 + ı) : m_{j}}

where the voxel-wise p-values

{\tilde{p}}_{i, 1 : m_{j}}, \dots, {\tilde{p}}_{i, m_{j} : m_{j}}

for participant

i

in region

j

are ordered from smallest to largest (see Ref. 5). For the

p

-value

p_{ij}^{REGION}

to be valid, assumptions about the dependency structure among

({\tilde{p}}_{ik} : 1 \leq k \leq K)

are required. In particular, weak dependency (meaning that the empirical cumulative distribution functions of the

p

-values corresponding to true and false null hypotheses, respectively, converge in the Glivenko-Cantelli sense as

K \to \infty

) has been assumed in Ref. 45, where

K

denotes the total number of voxels.

In order to achieve family-wise error rate (FWER) control, we have to choose the tuning parameter $κ$ smaller than or equal to $1 / J$ , where $J$ is the number of regional labels. Choosing $κ = 1 / J$ corresponds to the so-called Bonferroni multiplicity correction. The choice of $κ$ is discussed further in Appendix S1 of Ref. 45.

Step 4: Combined regional hypothesis tests by Fisher’s method

In this final step, we combine for each regional label $j$ the regional p-values calculated in step 3 over all participants $i = 1, \dots, n$ . In order to do this, we apply the so-called Fisher method to combine p-values. Namely, the Fisher test statistic $T_{j}$ for region $j$ is given by

T_{j} ≔ - 2 \sum_{i = 1}^{n} log (p_{ij}^{REGION})

Under independence of the data with respect to the participants, $T_{j}$ is asymptotically $X_{2 n}^{2}$ -distributed (chi squared) with $2 n$ degrees of freedom under the null. The latter independence assumption is justified, because the participants have been included in the study independently from each other.

Finally, we can reject the (over all participants $i$ combined) regional hypothesis $H_{j}$ (i.e., the respective partial conjunction hypothesis, but now with respect to the population, not with respect to a single participant) if and only if Fisher’s test statistic $T_{j}$ is larger than the (1-ακ)-quantile of the $X_{2 n}^{2}$ -distribution with $2 n$ degrees of freedom, where the tuning parameter $κ$ has been introduced in step 3. This parameter addresses the multiplicity of the test problem with respect to the $J$ regional labels which are simultaneously under consideration.

3. Computer simulations

In order to compare the performance of three different methods for fMRI data analysis in a controlled framework, we have carried out computer simulations.

3.1 Simulation setting

We have simulated a dataset ${Y_{ixt}}$ with eleven subjects (referring to the index $i$ and corresponding to the group size in the real dataset below), a spatial grid of size $20 \times 20 \times 20 = 8,000$ voxels (referring to the index $x$ ), and 195 time points (scans, referring to the index $t$ ). We have assumed that the 8,000 voxels are grouped into eight anatomical regions of size $10 \times 10 \times 10 = 1,000$ each, and that these regions are correctly annotated for all eleven subjects. Two alternating stimuli in an ON-OFF block task design with a total of six ON blocks of a duration of 15 scans have been used for the temporal signal: In a circular area in one of the predefined anatomical regions the signal of one stimulus was twice as high as the signal by the other stimulus mimicking a signal contrast. In a corresponding area in a second anatomical region the signal was created with no difference between the stimuli. The datasets with first order autocorrelated Rician noise have been created using the R package neuRosim (Ref. 56); p-values ${\tilde{p}}_{ik}$ for Contrast 1 (ON of either stimulus versus rest) and Contrast 2 (one stimulus versus the other) where determined using the R package fmri (Ref. 51). This simulation setup has been run 500 times. We repeated the simulation for varying signal-to-noise ratios (0.75, 1.0, 1.25, 1.5, 1.75).

3.2 Considered data analysis methods

For the statistical analysis of the simulated data, we have considered three different methods.

• Voxel-wise: On the basis of voxel-wise $Z$ -scores (aggregated over all eleven participants) and the resulting p-values for Contrast 1, voxels have been screened by applying the Benjamini-Hochberg method at level $α = 0.05$ . For the screened voxels only, $Z$ -scores (again aggregated over all eleven participants) and the resulting p-values for Contrast 2 have been computed. A region $j$ has been declared significantly activated, if the number of Contrast 2 p-values below $0.05 / s_{j}$ in that region $j$ has been larger than $1,000 \cdot κ$ , with $s_{j}$ denoting the number of screened voxels in region $j$ and $κ = 0.01$ , meaning that we considered a region $j$ to be relevantly activated if at least 10 out of the 1,000 voxels in that region $j$ are statistically significantly activated under the stimulus. The rejection threshold $0.05 / s_{j}$ accounts for the (reduced) multiplicity resulting from the pre-selection by means of Contrast 1.
• Cluster-wise: Activation is typically spatially spread over multiple voxels. In fact, single isolated voxels are mostly considered as spurious rather than activated; see Ref. 30. Thus, we also analyzed the simulated datasets and the two contrasts with cluster-based thresholds: At the given significance level $α = 0.05$ the value of the test statistic at a voxel must exceed some threshold and it has to belong to a cluster of at least s connected voxel. Thresholds can be pre-determined by simulations. To determine p-values we used the R package fmri (Ref. 51), where the method is implemented as function fmri.cluster. There, cluster-level inference thresholds are pre-defined by simulation for this function, see Ref. 42.
• Region-wise (this paper): Our proposed method from Section 2, again with FDR level 0.05, $κ = 0.01$ , and $α = 0.05$ in Step 4.

3.3 Results

We have assessed the type I and the type II error behavior of the three methods described in the previous subsection by calculating (for each of the three methods) the proportion of simulation runs in which any region without true activation has been declared significantly activated (type I error component) as well as the proportion of simulation runs in which the one region with true activation has not been declared significantly activated (type II error component). Figure 1 summarizes the results of our computer simulations. The horizontal axis in Figure 1 refers to the signal-to-noise ratio (SNR).

Figure 1. Type-I- and Type-II-errors for different signal-to-noise ratios (SNR) for the three methods under consideration: Using voxel-wise inference (VOXEL), using a cluster-based method (CLUSTER), and using the method from this paper (REGION).

Table 2. The region of interest in BA21 identified in Ref. 47 consists of 2844 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. Aparc labels in bold are evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

	# Voxels Overlapping with BA21	in %	# Voxels of Entire Aparc Label	in % in BA21
`Left-Cerebral-White-Matter`	634	22%	210620	0.3%
`ctx lh G temporal middle`	617	22%	7515	8.2%
`ctx lh S temporal sup`	512	18%	9379	5.5%
`ctx lh S temporal inf`	85	3%	2499	3.4%
…	…	…	…	…

From Figure 1 it becomes apparent that all three methods are capable of protecting reliably against type I errors under our data-generating model. However, the power of the proposed method appears to be considerably larger than that of the voxel-wise method. For example, during our 500 simulation runs at $SNR = 1.5$ , our new method as well as the cluster-wise method always detected the truly activated region, while the voxel-wise method detected it only in 367 out of the 500 runs. As far as a power comparison between our proposed method and the cluster-wise method is concerned, our simulations indicate that the proposed method exhausts the significance level $(α = 0.05$ ) better than the cluster-wise method in the sense that its estimated type I error probability is larger. Due to the structure of the decision rule, this automatically also implies higher (more precisely: non-smaller) power, meaning that the type II error probability of the proposed method is upper-bounded by the type II error probability of the cluster-wise method. Under our simulation settings with $SNR \geq 1.5$ , both procedures have identical (and perfect) power, but for $SNR \in [0.8, 1.1]$ the proposed method exhibits strictly larger power in comparison with the cluster-wise method. The difference in power can be substantial (up to $\approx 10 %$ more power of the proposed method) for moderately large (and thus realistic) values of the SNR.

4. Real data analysis

We re-used the data from Ref. 48 and compared our results with those obtained by additionally utilizing⁴⁷ as pre-study in the sense of Strategy (ii) outlined in the introduction.

4.1 Previous findings

In the two previous studies mentioned before, Siegmund et al. used similar analysis processes. They used BrainVoyager™ QX 2.8.4 [1]. The anatomical scans were transformed into the Talairach brain to account for differences in brain size; cf. Ref. 52. They preprocessed the functional data of both studies with a standard pipeline of: 3-D motion correction, slice-scan-time correction, and temporal filtering. In addition, they applied a spatial smoothing with a Gaussian filter (FWHM = 4 mm).

In the first study Ref. 47, the random-effects GLM revealed five brain areas (BAs 6, 21, 40, 44, 47) with significant activation with the contrast Bottom-Up Comprehension versus Control condition, i.e. syntax task. In the second study, the same contrast with a cluster-based inference revealed no significant areas anymore, likely due to the reduced statistical power of five instead of twelve bottom-up comprehension tasks per session. Thus, the authors ran a regions-of-interest analysis restricted to the identified activation clusters of the first study on the data of the second study. This resulted in a significantly stronger activation for Bottom-Up Comprehension versus Syntax in BAs 21, 40, and 44.

4.2 Data export and preparation for re-analysis

In Figure 2, we illustrate the overall process for the re-analysis of the data from Ref. 48. For our re-analysis, we exported the already pre-processed data from the second study. We did not use the data of the first study as our method does not need prior definitions of regions of interest. We used BrainVoyager to access the already computed GLM-analysis data from Siegmund et al. Specifically, we exported the statistical values (i.e., t-scores, p-values) of the obtained brain activation on a voxel basis for each participant. The voxel resolution is the same BrainVoyager uses, i.e., a 1 mm interpolated resolution.

Figure 2. Illustration of processing of the experimental data.

The box in gold on the left indicates analysis steps that have already been performed in Siegmund et al. (Ref. 48). The box in turquoise on the right indicates the processing steps proposed in this paper.

In addition, we used FreeSurfer to segment and parcellate the brain of each participant based on their anatomical scan; cf. Refs. 18, 19. We used the Destrieux’ cortical atlas to assign Aparc labels on an individual participant basis (see Ref. 12) as a suitable region definition. Next, we used Nipype (see Refs. 17, 23) to convert Freesurfer labels to a BrainVoyager-readable format.

Our last step annotated the exported functional data with the individual anatomical labels for each participant. We removed all functional voxels for coordinates that had no assigned Aparc label, which typically are voxels that are not considered gray matter.

4.3 Results

Figure 3 displays the six brain regions which have been declared as significant (at FWER level $α = 5 %$ ) associated with the task at hand by our described methodology. There is an overlap with the results obtained by Ref. 48 in which they utilized prior knowledge, but also some differences which we discuss in the following in their nuances. We visualize the confirmed network of brain activation from Siegmund et al. in Figure 4 and our identified network of significantly activated Aparc labels in Figure 5. Figure 3 and Figure 4 show overlapping results with regard to the Brodmann area 21 that covers the middle and inferior temporal gyrus (separated by the inferior temporal sulcus). However, we observed differences regarding smaller brain regions. We compare Siegmund’s replication efforts to our results in Section 5.1.

Figure 3. The six significant brain regions (at FWER level α = 5%).

Activated brain areas are particularly in the middle and inferior temporal lobe.

Figure 4. Network of left-lateralized confirmed brain areas. In Ref. 48, BAs 21, 40, 44 were found activated during program comprehension.

Figure 5. Results of our analysis with significantly activated Aparc labels.

In each row, each point corresponds to the Aparc p-value $p_{ⅈj}^{APARC}$ of one study participant i, where the index j refers to the area indicated by the code at the beginning of the row.

Our method found three Aparc labels in the inferior and middle temporal gyrus in the left hemisphere. Siegmund et al. found their largest and most robust activation cluster in BA21 of the left hemisphere, which covers several gyri in the temporal lobe. These left temporal gyri are often associated with semantic processing of natural language, which is typically left-lateralized for right-handed participants. In the context of programming, the activation is believed to be responsible for extracting the meaning of individual identifiers and symbols during program comprehension. We found two further Aparc labels bilaterally in the inferior temporal gyrus and the anterior collateral sulci, which are both in the temporal lobe as well.

For each of these six regions indexed by $j$ , we display all subject-specific regional p-values ${p_{ij}^{REGION} : 1 \leq i \leq n$ }, where $n$ is the number of study participants. For each $j$ considered in Figure 5, it can clearly be observed that not a single extreme outlier (one very small p-value corresponding to one individual subject) is responsible for the statistical significance with respect to the combination test statistic $T_{j}$ but that the combined information contributed by all $n$ subjects supports our statistical conclusions. Regional p-values $p_{ij}^{REGION} \equiv 1$ can occur, if none of the voxels belonging to region $j$ has been selected in Steps 1 and 2 described before for a certain subject $i$ . By construction of $T_{j}$ this essentially means that the “effective sample size” for such a region is reduced, while the number of degrees of freedom for the null distribution of $T_{j}$ remains unchanged.

5. Discussion

5.1 Statistical sensitivity

This paper introduces a new strategy to analyze fMRI data and demonstrates its performance by drawing a comparison to two studies by Siegmund et al. In their second study, they investigated a new research question which was based on a design with reduced statistical power regarding the network of brain areas activated during bottom-up program comprehension. Only by using prior knowledge of the location of identified clusters from the first study and conducting a regions-of-interest analysis with increased statistical power, they were able to exceed standard thresholds of statistical significance but restricted to the subset of areas identified in the first study.

Research that similarly aims to detect small differences between cognitive processes with state-of-the-art methods would need to increase statistical power, e.g. by conducting two studies: first, to identify the network of brain areas involved in the overarching cognitive process, and second, to identify differential effects within the activated brain areas. The method of utilizing anatomically predefined regions of interest as determined in each individual brain is a candidate for a more sensitive analysis as compared to statistical testing of single voxels with identical coordinates of a template brain. This is possibly due to allowing more flexibility with respect to the exact anatomical location of functional units across different brains. Our example of using ROIs from the Harvard-Oxford atlas can easily be replaced in future applications by brain parcellation schemes that are based on more refined anatomical and/or functional databases. We demonstrated by simulations that our method can accurately detect signals based on region aggregation and outperforms standard mass-univariate approach while achieving a similar performance as cluster-based inferences without the need to define thresholds based on data smoothness via simulations. Further, we demonstrated that our method can find significantly activated brain areas without relying on prior knowledge on a real fMRI dataset. Moreover, additional brain areas were identified which have been described in two fMRI studies of programmers and interpreted as being involved in visuo-spatial processing. One study investigated manipulating data structures, which share similarities to spatial rotation (see Ref. 27) and one writing program code (see Ref. 29). Since the study by Siegmund et al. (2014) was the first study on program comprehension they used a rather small FDR corrected significance level ( $p < 0.01$ as compared to the more common $p < 0.05$ ). Possibly this is one reason why the two areas identified by the current approach were not identified there and as a consequence could not emerge in Siegmund et al. (2017). Thus, our current approach seems valuable in cases where new research questions are explored and little to no prior work is available and which would require careful statistical hypothesis testing to initially minimize false positives.

However, inferior frontal gyrus (BA 44) and inferior parietal lobule (with BA 40), shown to be significant in Siegmund et al., were not significant with our method. Therefore, we exemplarily investigated this difference with the activation cluster in BA 40. In Figure 6, we display a histogram of p-values for the (statistically significant) Aparc label in the middle temporal gyrus, in which a majority of the participants contribute with small p-values. In contrast, in Figure 7, we display a histogram of p-values for the inferior parietal lobule. Across the entire group, we also observe an accumulation of very small p-values, but only three of eleven participants contribute to this. Unlike the regions-of-interest analysis done by Siegmund et al., our method relying on combining Aparc p-values by Fisher’s method does not reject hypotheses with such a p-value distribution.

Figure 6. Histogram of p values of `ctx_lh_G_temporal_middle` label that is evaluated as significantly activated.

The plot without a heading is across all participants, while the eleven plots with headings show the distribution across individual participants. The majority of participants is contributing to the significance.

Figure 7. Histogram of p values of `ctx_lh_G_pariet_inf-Angular` that is not evaluated as significantly activated.

The plot without the heading is across all participants, while the eleven plots with headings show the distribution across individual participants. While there are many small p- values, only a few participants contribute these small values.

By transforming the Aparc labels into the Talairach space, we observed that the voxels included in the activation cluster of BA 40 do not perfectly align with the Aparc label that should cover this brain area, i.e. the inferior angular gyrus of the parietal lobe. Table 3 shows that the overlap to the activation cluster in BA40 is less than 50% and that only around 5% of its voxels are assigned to an Aparc label at all. In contrast, Table 2 shows that at least 75% of the voxels of the BA 21 cluster are aligned to Aparc labels of which two are significant with our method. Thus, a small activation cluster within a large anatomical region may get lost with our approach. We further discuss this potential drawback of the used anatomical segmentation and possible remedies in Section 5.3.

Table 3. The region of interest in BA40 identified in Ref. 47 consists of 1777 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. No Aparc label is evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

	# Voxels Overlapping with BA40	in %	# Voxels of Entire Aparc Label	in % in BA40
`ctx lh G pariet inf-Supramar`	304	17.1%	6318	4.3%
`ctx lh G pariet inf-Angular`	290	16.3%	4975	5.1%
`Left-Cerebral-White-Matter`	157	8.9%	178875	0.1%
…	…	…	…	…

From the methodological point of view, our main contribution consists in a novel way how to combine evidence: Instead of aggregating single voxel data over all participants by mapping them to a standard brain template, we define subject-specific regions and combine the evidence on the level of these regions by means of a combination function. This is a generic methodological approach which is not restricted to the specific study setup of Siegmund et al., but can be applied to essentially any fMRI study design, involving an arbitrary number of contrasts.

Furthermore, also the (final) combination step of our proposed approach is generic in the sense that instead of the Fisher combination function any other (appropriate) combination function for p-values may be used. Recently, there has been a renewed interest in p-value combination methods; see, e.g., Ref. 57 (with discussion), Ref. 53, and Ref. 54.

It is also important to note that our results are based on an fMRI study with only 11 participants resulting in a threat to external validity. While this number might appear low, it is in line with similar studies in the domain of software engineering (Refs. 10, 17, 23, 39) where a smaller, but more homogeneous sample can be beneficial due to particularly concerning confounding factors related to programmer expertise, experience, and demographics (Refs 49, 55). Nevertheless, future work shall replicate our approach on datasets with a larger sample size and different domains.

5.2 Comparison to related fMRI analysis methods

Under the multiple testing framework, testing of grouped null hypotheses with (potential) application in fMRI is an active research topic. In Ref. 25 as well as in Ref. 4, clustering techniques were employed to define regions of interest, and the authors incorporated the heterogeneous cluster sizes in a weighting scheme for the linear step-up test from Ref. 6. In the same vein, the authors of Ref. 26 as well as Ref. 61 made use of the different proportions of true null hypotheses in each of the groups in their proposed weighting. A Bayesian variant of this idea has been derived in Ref. 32. Hierarchical methods, which exclude groups without strong evidence for the presence of signals in several stages of data analysis, have been worked out by several researchers; see, e. g., Refs. 3, 45 and 60. However, these methods rely on combining the subject-specific data on the voxel level, which is a standard technique as mentioned, for instance, in Section 5 of Ref. 31 Also on the basis of (combined) voxel data, a hierarchical independent component analysis for the comparison of brain functional networks has been proposed in Ref. 46. To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea.

It is worth mentioning that both our proposed methodology and the methods from previous literature that we are considering in our comparisons make certain assumptions about the data: First, a linear relationship between the features encoded in the design matrix and the response is assumed. Thus, non-linear effects cannot (fully) be captured by models of the type (Eq. 1). Second, Gaussianity of error terms is assumed, implying symmetry and light tails, among other things. It can be of interest to validate or to test these assumptions for a concrete dataset at hand. However, since this aspect is not the main focus of our present work, we defer such investigations to future research. In particular, modeling the data that we have re-analyzed in this work with different model assumptions would diminish the comparability of our data analysis results with the results from previous analyses. One simple (numerical) robustness analysis regarding the Gaussianity assumption can be performed by simulating data with a different error distribution and comparing the results with the results presented here. To this end, the source code which is available as supplementary material for this article may be helpful for practical implementation.

Even though the majority of fMRI studies still use GLM-based analyses which require corrections for multiple comparisons, we would like to note that statistical analyses more advanced than voxel-wise regression analyses increase in popularity; see, e.g., Ref. 36 for a recent approach or multi-voxel pattern analysis (MVPA) introduced by Haxby and colleagues, see Ref. 24 and the multiple papers cited therein. MVPA predicts stimulus event categories from the relative changes in activation across a set of voxels. Such a set of voxels is extracted in the first, feature selection step of analysis. The second step partitions the data into a training set and a testing set entered into a pattern classification algorithm. From a statistical point of view this requires a sufficiently large number of comparable events. In the experiments of Ref. 47 the events are code snippets that must be read and understood within 60 seconds to enable a certain complexity of software code and were thus limited in number to fit into the duration of a typical fMRI session of about 45 minutes. Such limited number of events may pose difficulties for MVPA cross-validation. Such low number of rather long events is, however, sufficient for generalized linear model (GLM) analyses of block design experiments that typically yield strong detection power (albeit low discrimination power).

Moreover, as functional activation is spatially distributed over several voxels rather than focused in single ones, cluster-based inferences have become rather standard; cf. Ref. 30. This is implemented and used in all major software package. However, a recent discussion by Ref. 16 revealed their flaws in practical situations. Furthermore, cluster-based inference requires simulation according to the smoothness of the data at hand.

The experiments we used to validate our approach employed a classical hypothesis-driven design, which is widely used in the literature, to unravel complex cognitive processing. They were constructed such that perceptual and cognitive processes necessary but not specific for understanding software code were controlled by specific test conditions, either requiring to read the same code but with a different task or controlling for attentional demands. Furthermore, understanding software code is a highly idiosyncratic process and therefore identification of brain activity using control conditions within individual subjects is possibly more feasible as a first step towards identifying the most relevant brain areas involved in such a complex cognitive process. Thus, our method can serve as a valuable alternative using spatial aggregation while still being fast and easy to implement.

5.3 Outlook: from anatomical to functional aggregation

Instead of utilizing a single voxel GLM group analysis in common brain templates, we used a parcellation of the brain for each individual participant into regional labels, here Aparc labels, before aggregation into the group analysis. This procedure provides more labels than a traditional Brodmann atlas. Still, some of the regions are very large and thus presumably contain several functional areas. There is ongoing research to subdivide the brain based on cytoarchitectonic details, e.g. the Jülich-Brain (see Ref. 1). This will provide more detailed parcellation schemes in the future and that could easily be integrated into our proposed method. Another refinement could be to use functionally defined brain regions for our presented methodology. A study that implements functional localizers could identify participant-specific functional maps of the brain for well-defined standardized tasks (e.g., see Ref. 35). Then, in the analysis, our presented methodology can aggregate across all participant-specific brains with less imprecision than traditional methods. We would like to note that such functional localizers are currently restricted to research areas that include brain regions with specific functional specialization; cf. Refs. 22 and 44. The studies we presented in this paper are concerned with a rather complex cognitive task. Moreover, understanding program code is highly individual since different programmers rely on different comprehension strategies based on their preferences, domain knowledge, and experience (e.g., Refs. 8, 9 and 50).

Ethical considerations

Our paper did not acquire any new data with human participants. The original studies of Siegmund et al. were approved by the responsible ethics board of the University of Magdeburg (Application: 87/14).

Data availability

Weierstrass Institute for Applied Analysis and Stochastics publication server: Utilizing anatomical information for signal detection in functional magnetic resonance imaging - Data. https://doi.org/10.20347/wias.data.9 (Ref. 38).

This project contains the underlying data:

• pvalueFromRealData.csv (p-value data mapped onto Aparc labels)

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Source code available from: https://github.com/ktabelow/HierarchicalFMRI/

Archived software available from: https://zenodo.org/records/17097938

License: GNU General Public License v3.0

Acknowledgements

We thank André Neumann for collaborating with us on a previous version of this article (available as preprint Ref. 34) and Jörg Stadler for his technical support in data processing with Nipype and FreeSurfer.

References

1. Amunts K, Mohlberg H, Bludau S, et al.: Julich-brain: A 3d probabilistic atlas of the human brain’s cytoarchitecture. Science. 2020; 369: 988–992. PubMed Abstract | Publisher Full Text
2. Andreella A, Feilong M, Halchenko Y, et al.: A Statistical Approach to the Alignment of fMRI Data. Book of Short Papers, SIS 2020. Pollice A, Salvati N, Spagnolo FS, editors. 2020; pp. 733–738.
3. Benjamini Y, Bogomolov M: Selective inference on multiple families of hypotheses. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014; 76: 297–318. Publisher Full Text
4. Benjamini Y, Heller R: False discovery rates for spatial signals. J. Am. Stat. Assoc. 2007; 102: 1272–1281. Publisher Full Text
5. Benjamini Y, Heller R: Screening for partial conjunction hypotheses. Biometrics. 2008; 64: 1215–1222. PubMed Abstract | Publisher Full Text
6. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995; 57: 289–300. Publisher Full Text
7. Brodmann K: Vergleichende Lokalisationslehre der Großhirnrinde in ihren Prinzipien dargestellt auf Grund des Zellbaues. Leipzig: Barth; 1909.
8. Brooks R: Using a behavioral theory of program comprehension in software engineering. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 1978; pp. 196–201.
9. Brooks R: Towards a theory of the comprehension of computer programs. Int. J. Man-Mach. Stud. 1983; 18: 543–554. Publisher Full Text
10. Castelhano J, Duarte I, Ferreira C, et al.: The Role of the Insula in Intuitive Expert Bug Detection in Computer Code: An fMRI Study. Brain Imaging Behav. 2018:1–15.
11. Desikan R, Ségonne F, Fischl B, et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006; 31: 968–980. PubMed Abstract | Publisher Full Text
12. Destrieux C, Fischl B, Dale A, et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage. 2010; 53: 1–15. PubMed Abstract | Publisher Full Text | Free Full Text
13. Dickhaus T: Simultaneous statistical inference with applications in the life sciences. Berlin Heidelberg: Springer-Verlag; 2014.
14. Dudoit S, van der Laan M : Multiple testing procedures with applications to genomics., Springer Series in Statistics. New York, NY: Springer; 2008.
15. Duraes J, Madeira H, Castelhano J, et al.: Understanding the Brain at Software Debugging, in Proceedings International Symposium Software Reliability Engineering (ISSRE). 2016; pp. 87–92.
16. Eklund A, Nichols TE, Knutsson H: Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. 2016; 113: 7900–7905. PubMed Abstract | Publisher Full Text | Free Full Text
17. Esteban O, Markiewicz CJ, Burns C, et al.: nipy/nipype: 1.5.0.2020.
18. Fischl B, Salat DH, Busa E, et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002; 33: 341–355. Publisher Full Text
19. Fischl B, Van Der Kouwe A, Destrieux C, et al.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. 2004; 14: 11–22. Publisher Full Text
20. Floyd B, Santander T, Weimer W: Decoding the Representation of Code in the Brain: An fMRI Study of Code Review and Expertise, in Proceedings of International Conference on Software Engineering (ICSE). IEEE. 2017:175–186.
21. Forman S, Cohen J, Fitzgerald M, et al.: Improved assessment of significant activation in functional magnetic resonance imaging (fmri): use of a cluster-size threshold. Magn. Reson. Med. 1995; 33: 636–647. PubMed Abstract | Publisher Full Text
22. Friston K, Rotshtein P, Geng J, et al.: A critique of functional localisers. NeuroImage. 2006; 30: 1077–1087. PubMed Abstract | Publisher Full Text
23. Gorgolewski K, Burns CD, Madison C, et al.: Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Front. Neuroinform. 2011; 5: Article 13. Publisher Full Text
24. Haxby JV: Multivariate pattern analysis of fMRI: The early beginnings. NeuroImage. 2012; 62: 852–855. PubMed Abstract | Publisher Full Text | Free Full Text
25. Heller R, Stanley D, Yekutieli D, et al.: Cluster-based analysis of fMRI data. NeuroImage. 2006; 33: 599–608. Publisher Full Text
26. Hu J, Zhao H, Zhou H: False discovery rate control with groups. J. Am. Stat. Assoc. 2010; 105: 1215–1227. PubMed Abstract | Publisher Full Text | Free Full Text
27. Huang Y, Liu X, Krueger R, et al.: Distilling neural representations of data structure manipulation using fMRI and fNIRS. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 2019; pp. 396–407.
28. Jarmasz M, Somorjai R: Exploring regions of interest with cluster analysis (EROICA) using a spectral peak statistic for selecting and testing the significance of fMRI activation time-series. Artif. Intell. Med. 2002; 25: 45–67. PubMed Abstract | Publisher Full Text
29. Krueger R, Huang Y, Liu X, et al.: Neurological divide: An fMRI study of prose and code writing. Proceedings of International Conference on Software Engineering (ICSE). 2020; pp. 678–690.
30. Lazar N: The statistical analysis of functional MRI data. Statistics for Biology and Health. Springer; 2008.
31. Lindquist M: The statistical analysis of fMRI data. Stat. Sci. 2008; 23: 439–464. MR 2530545.
32. Liu Y, Sarkar S, Zhao Z: A new approach to multiple testing of grouped hypotheses. J. Statist. Plann. Inference. 2016; 179: 1–14. MR 3550875. Publisher Full Text
33. Makris N, Goldstein JM, Kennedy D, et al.: Decreased volume of left and total anterior insular lobule in schizophrenia. Schizophr. Res. 2006; 83: 155–171. PubMed Abstract | Publisher Full Text
34. Neumann A, Peitek N, Brechmann A, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging, WIAS Preprint No. 2806.2021. Publisher Full Text
35. Nieto-Castañón A, Fedorenko E: Subject-specific functional localizers increase sensitivity and functional resolution of multi-subject analyses. NeuroImage. 2012; 63: 1646–1669. PubMed Abstract | Publisher Full Text | Free Full Text
36. Noble S, Mejia AF, Zalesky A, et al.: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference. Proc. Natl. Acad. Sci. USA. 2022; 119: e2203020119. PubMed Abstract | Publisher Full Text | Free Full Text
37. Ombao H, Lindquist M, Thompson W, et al.: Handbook of Neuroimaging Data Analysis. New York: CRC Press; 2016.
38. Peitek N, Brechmann A, Tabelow K, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging. Data. 2024. Publisher Full Text
39. Peitek N, Apel S, Parnin C, et al.:Program comprehension and code complexity metrics: An FMRI study.In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).IEEE, 2021; pp. 524–536.
40. Pennington N: Stimulus structures and mental representations in expert comprehension of computer programs. Cogn. Psychol. 1987; 19: 295–341. Publisher Full Text
41. Poldrack RA, Mumford JA, Nichols TE: Handbook of functional MRI data analysis :Cambridge University Press; 2011. Publisher Full Text
42. Polzehl J, Tabelow K: Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R .Springer International Publishing; 2023. Publisher Full Text
43. Rosenblatt J, Finos L, Weeda W, et al.: All-resolutions inference for brain imaging. NeuroImage. 2018; 181: 786–796. PubMed Abstract | Publisher Full Text
44. Saxe R, Brett M, Kanwisher N: Divide and conquer: A defense of functional localizers. NeuroImage. 2006; 30: 1088–1096. PubMed Abstract | Publisher Full Text
45. Schildknecht K, Tabelow K, Dickhaus T: More specific signal detection in functional magnetic resonance imaging by false discovery rate control for hierarchically structured systems of hypotheses. PLoS One. 2016; 11: 1–21.
46. Shi R, Guo Y: Investigating differences in brain functional networks using hierarchical covariate-adjusted independent component analysis. Ann. Appl. Stat. 2016; 10: 1930–1957. MR 3592043. PubMed Abstract | Publisher Full Text
47. Siegmund J, Kästner C, Apel S, et al.: Understanding understanding source code with functional magnetic resonance imaging. Proceedings International Conference on Software Engineering (ICSE). ACM; 2014; pp. 378–389.
48. Siegmund J, Peitek N, Parnin C, et al.: Measuring neural efficiency of program comprehension. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, ESEC/FSE; 2017; 2017. : pp. 140–150.
49. Siegmund J, Schumann J: Confounding parameters on program comprehension: a literature survey. Empir. Softw. Eng. 2015; 20:1159–1192. Publisher Full Text
50. Soloway E, Ehrlich K: Empirical studies of programming knowledge. IEEE Trans. Softw. Eng. 1984; 10: 595–609.
51. Tabelow K, Polzehl J: Statistical parametric maps for functional mri experiments in R: The package fmri. J. Stat. Softw. 2011; 44: 1–21. Publisher Full Text Reference Source
52. Talairach J, Tournoux P: Co-planar stereotaxic atlas of the human brain. Thieme; 1988.
53. Vovk V, Wang R: Combining p-values via averaging. Biometrika. 2020; 107: 791–808. Publisher Full Text
54. Vovk V, Wang R: E-values: Calibration, combination, and applications. Ann. Stat. 2021; 49: 1736–1754.
55. Wagner S, Wyrich M: Code comprehension confounders: A study of intelligence and personality. IEEE Trans. Softw. Eng. 2021; 48:4789–4801.
56. Welvaert M, Durnez J, Moerkerke B, et al.: neuRosim: An R package for generating fmri data. J. Stat. Softw. 2011; 44: 1–18. Publisher Full Text Reference Source
57. Wilson D: The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. 2019; 116: 1195–1200. PubMed Abstract | Publisher Full Text | Free Full Text
58. Worsley K: Local maxima and the expected Euler characteristic of excursion sets of χ², f and t fields. Adv. Appl. Probab. 1994; 26: 13–42. Publisher Full Text
59. Worsley K, Marrett S, Neelin P, et al.: A unified statistical approach for determining significant signals in images of cerebral activation. Hum. Brain Mapp. 1996; 4: 58–73. PubMed Abstract | Publisher Full Text
60. Yekutieli D: Hierarchical false discovery rate-controlling methodology. J. Am. Stat. Assoc. 2008; 103: 309–316. Publisher Full Text
61. Zhao H, Zhang J: Weighted p-value procedures for controlling FDR of grouped hypotheses. J. Stat. Plann. Inference. 2014; 151-152: 90–106. Publisher Full Text

Footnotes

1 Brain Innovation BV, Maastricht, The Netherlands, http://brainvoyager.com

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 01 Oct 2025

Author details Author details

Norman Peitek
Roles: Data Curation, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

André Brechmann
Roles: Conceptualization, Data Curation, Funding Acquisition, Methodology, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Karsten Tabelow
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Thorsten Dickhaus
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

Financial support by the Deutsche Forschungsgemeinschaft (DFG) via grant DI 1723/3-2 is gratefully acknowledged. Brechmann’s work is supported by DFG grant BR 2267/7-2.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 25 Mar 2026, 14:1019

https://doi.org/10.12688/f1000research.166549.2

version 1

Published: 01 Oct 2025, 14:1019

https://doi.org/10.12688/f1000research.166549.1

© 2026 Peitek N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 2; peer review: 2 approved, 1 not approved]. F1000Research 2026, 14:1019 (https://doi.org/10.12688/f1000research.166549.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 25 Mar 2026

Revised

Views

Reviewer Report 30 Mar 2026

Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany; Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Lower Saxony, Germany; University of Münster, Münster, Germany

Approved

https://doi.org/10.5256/f1000research.194983.r470520

I Approved. Thank you ... Continue reading

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 27 Mar 2026

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

Approved

https://doi.org/10.5256/f1000research.194983.r470521

I appreciate the authors’ careful revisions and responses, which address most of my main concerns. The revised introduction, methods, and expanded simulation section are much clearer and better organized. I also appreciate the clearer definition of the regional null hypothesis and the added simulation results with varying signal-to-noise ratios. Although I still think the simulation setting could be made more challenging beyond varying SNR alone, the current revision is sufficient for this manuscript. I also appreciate the added explanation and discussion of competing approaches, such as cluster-based inference. The software looks great, and the tutorial is very helpful. As one minor additional suggestion, it may still be helpful to include a slightly more detailed caption with simple math for each step in Figure 2, showing the purpose/basics of each of the four steps so that readers can more easily distinguish the screening, regional aggregation, and cross-subject inference components of the workflow.

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 01 Oct 2025

Views

Reviewer Report 19 Jan 2026

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.183550.r441581

This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people).

(2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, P_ij^APARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable.

(3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why).

(4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa.

(5) A clearer definition of the regional hypothesis H_ij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition.

(6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness.

(7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least.

(8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people)."

Reply R3-1: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.2 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4. We clarify this in the revised version of the manuscript.

"R3-2: (2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, PijAPARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable."

Reply R3-2: We corrected the typos in the Figure and the Caption and made sure, that in the published manuscript the Figure appears only in Section 4, where it is first referenced and all formulas from Section 2 have been introduced. The left part of the Figure schematically describes only one study (Siegmund et al., 2017) where functional and anatomical data has been acquired for each of the n participants.
We would prefer to leave Table 1 and Listing 1 in the manuscript in order to evaluate the type of study without consulting the original publication. The introduction has been re-written to be more concise and less technical.

"R3-3: (3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why)."

Reply R3-3: We agree that this discussion does not belong to the introduction. We transferred it to the discussion Section 5, where we elaborate on the pros and cons of each method.

"R3-4: (4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa."

Reply R3-4: When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do in Step 4 not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

As far as the "screening" constituted by Step 1 is concerned, the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small) by this screening. This follows from the theory of fixed sequence multiple tests as explained, e.\ g., in Section 3.1.2.2 of doi.org/10.1007/978-3-642-45182-9. Technically, the p-values for all voxels which have not been declared significant in Step 1 are simply set to 1.0 in Step 2. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R3-5: (5) A clearer definition of the regional hypothesis Hij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition."

Reply R3-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets. We have slightly extended the description of the method in the revised version of the manuscript.

"R3-6: (6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness."

Reply R3-6: We extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods. Furthermore, we provide a tutorial and explanation of the simulation code, that can be used to adjust the simulation to own purposes.

"R3-7: (7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least."

Reply R3-7: Thank you for this highlighting this interesting aspect. We would like to clarify that a cluster-based inference was in fact attempted in the Siegmund et al. (2017) study. However, this approach did not yield any statistically significant effects in the real dataset. As a result, Siegmund et al. had to rely on an independent dataset collected in 2014 to obtain a meaningful signal. This experience was a primary motivation for developing and evaluating the method presented in this paper, since it is not common that a prior study exists that can be used in such manner. We now explain this aspect more clearly in Section 4.1 in the revised submission.

"R3-8: (8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations."

Reply R3-8: We would like to stress the the purpose of this paper is the introduction of a new inference method for fMRI related activation. The scope of the work is to outline the method, test it on already existing data, and compare the results with the previous findings for validation. Further potential neuroscientific insights will be left for the usage of the methods on new study data.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people)."

Reply R3-1: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.2 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4. We clarify this in the revised version of the manuscript.

"R3-2: (2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, PijAPARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable."

Reply R3-2: We corrected the typos in the Figure and the Caption and made sure, that in the published manuscript the Figure appears only in Section 4, where it is first referenced and all formulas from Section 2 have been introduced. The left part of the Figure schematically describes only one study (Siegmund et al., 2017) where functional and anatomical data has been acquired for each of the n participants.
We would prefer to leave Table 1 and Listing 1 in the manuscript in order to evaluate the type of study without consulting the original publication. The introduction has been re-written to be more concise and less technical.

"R3-3: (3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why)."

Reply R3-3: We agree that this discussion does not belong to the introduction. We transferred it to the discussion Section 5, where we elaborate on the pros and cons of each method.

"R3-4: (4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa."

Reply R3-4: When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do in Step 4 not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

As far as the "screening" constituted by Step 1 is concerned, the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small) by this screening. This follows from the theory of fixed sequence multiple tests as explained, e.\ g., in Section 3.1.2.2 of doi.org/10.1007/978-3-642-45182-9. Technically, the p-values for all voxels which have not been declared significant in Step 1 are simply set to 1.0 in Step 2. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R3-5: (5) A clearer definition of the regional hypothesis Hij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition."

Reply R3-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets. We have slightly extended the description of the method in the revised version of the manuscript.

"R3-6: (6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness."

Reply R3-6: We extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods. Furthermore, we provide a tutorial and explanation of the simulation code, that can be used to adjust the simulation to own purposes.

"R3-7: (7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least."

Reply R3-7: Thank you for this highlighting this interesting aspect. We would like to clarify that a cluster-based inference was in fact attempted in the Siegmund et al. (2017) study. However, this approach did not yield any statistically significant effects in the real dataset. As a result, Siegmund et al. had to rely on an independent dataset collected in 2014 to obtain a meaningful signal. This experience was a primary motivation for developing and evaluating the method presented in this paper, since it is not common that a prior study exists that can be used in such manner. We now explain this aspect more clearly in Section 4.1 in the revised submission.

"R3-8: (8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations."

Reply R3-8: We would like to stress the the purpose of this paper is the introduction of a new inference method for fMRI related activation. The scope of the work is to outline the method, test it on already existing data, and compare the results with the previous findings for validation. Further potential neuroscientific insights will be left for the usage of the methods on new study data.
Competing Interests: None Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people)."

Reply R3-1: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.2 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4. We clarify this in the revised version of the manuscript.

"R3-2: (2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, PijAPARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable."

Reply R3-2: We corrected the typos in the Figure and the Caption and made sure, that in the published manuscript the Figure appears only in Section 4, where it is first referenced and all formulas from Section 2 have been introduced. The left part of the Figure schematically describes only one study (Siegmund et al., 2017) where functional and anatomical data has been acquired for each of the n participants.
We would prefer to leave Table 1 and Listing 1 in the manuscript in order to evaluate the type of study without consulting the original publication. The introduction has been re-written to be more concise and less technical.

"R3-3: (3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why)."

Reply R3-3: We agree that this discussion does not belong to the introduction. We transferred it to the discussion Section 5, where we elaborate on the pros and cons of each method.

"R3-4: (4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa."

Reply R3-4: When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do in Step 4 not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

As far as the "screening" constituted by Step 1 is concerned, the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small) by this screening. This follows from the theory of fixed sequence multiple tests as explained, e.\ g., in Section 3.1.2.2 of doi.org/10.1007/978-3-642-45182-9. Technically, the p-values for all voxels which have not been declared significant in Step 1 are simply set to 1.0 in Step 2. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R3-5: (5) A clearer definition of the regional hypothesis Hij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition."

Reply R3-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets. We have slightly extended the description of the method in the revised version of the manuscript.

"R3-6: (6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness."

Reply R3-6: We extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods. Furthermore, we provide a tutorial and explanation of the simulation code, that can be used to adjust the simulation to own purposes.

"R3-7: (7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least."

Reply R3-7: Thank you for this highlighting this interesting aspect. We would like to clarify that a cluster-based inference was in fact attempted in the Siegmund et al. (2017) study. However, this approach did not yield any statistically significant effects in the real dataset. As a result, Siegmund et al. had to rely on an independent dataset collected in 2014 to obtain a meaningful signal. This experience was a primary motivation for developing and evaluating the method presented in this paper, since it is not common that a prior study exists that can be used in such manner. We now explain this aspect more clearly in Section 4.1 in the revised submission.

"R3-8: (8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations."

Reply R3-8: We would like to stress the the purpose of this paper is the introduction of a new inference method for fMRI related activation. The scope of the work is to outline the method, test it on already existing data, and compare the results with the previous findings for validation. Further potential neuroscientific insights will be left for the usage of the methods on new study data.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people)."

Reply R3-1: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.2 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4. We clarify this in the revised version of the manuscript.

"R3-2: (2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, PijAPARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable."

Reply R3-2: We corrected the typos in the Figure and the Caption and made sure, that in the published manuscript the Figure appears only in Section 4, where it is first referenced and all formulas from Section 2 have been introduced. The left part of the Figure schematically describes only one study (Siegmund et al., 2017) where functional and anatomical data has been acquired for each of the n participants.
We would prefer to leave Table 1 and Listing 1 in the manuscript in order to evaluate the type of study without consulting the original publication. The introduction has been re-written to be more concise and less technical.

"R3-3: (3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why)."

Reply R3-3: We agree that this discussion does not belong to the introduction. We transferred it to the discussion Section 5, where we elaborate on the pros and cons of each method.

"R3-4: (4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa."

Reply R3-4: When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do in Step 4 not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

As far as the "screening" constituted by Step 1 is concerned, the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small) by this screening. This follows from the theory of fixed sequence multiple tests as explained, e.\ g., in Section 3.1.2.2 of doi.org/10.1007/978-3-642-45182-9. Technically, the p-values for all voxels which have not been declared significant in Step 1 are simply set to 1.0 in Step 2. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R3-5: (5) A clearer definition of the regional hypothesis Hij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition."

Reply R3-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets. We have slightly extended the description of the method in the revised version of the manuscript.

"R3-6: (6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness."

Reply R3-6: We extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods. Furthermore, we provide a tutorial and explanation of the simulation code, that can be used to adjust the simulation to own purposes.

"R3-7: (7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least."

Reply R3-7: Thank you for this highlighting this interesting aspect. We would like to clarify that a cluster-based inference was in fact attempted in the Siegmund et al. (2017) study. However, this approach did not yield any statistically significant effects in the real dataset. As a result, Siegmund et al. had to rely on an independent dataset collected in 2014 to obtain a meaningful signal. This experience was a primary motivation for developing and evaluating the method presented in this paper, since it is not common that a prior study exists that can be used in such manner. We now explain this aspect more clearly in Section 4.1 in the revised submission.

"R3-8: (8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations."

Reply R3-8: We would like to stress the the purpose of this paper is the introduction of a new inference method for fMRI related activation. The scope of the work is to outline the method, test it on already existing data, and compare the results with the previous findings for validation. Further potential neuroscientific insights will be left for the usage of the methods on new study data.
Competing Interests: None Close
Report a concern

Views

Reviewer Report 26 Dec 2025

Fabricio Cravo, Psychology, Northeastern University College of Science (Ringgold ID: 195088), Boston, Massachusetts, USA

Stephanie Noble, Northeastern University, Boston, Massachusetts, USA

Not Approved

https://doi.org/10.5256/f1000research.183550.r423766

The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions. For example:
1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this.
2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?
3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn’t it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?
4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?
In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details.

Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: “The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.”
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence.

Misc additional comments

1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state “In the present work, we propose a new strategy and apply it to the data from Ref. 43”, but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate “the part on the right indicates the processing steps proposed in this paper”, whose main component is the method of Schildknecht et al. Finally, they indicate “To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,” but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher’s method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern #1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text.
2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here.
3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well.
4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion.
5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary.
6. Minor: the use of “Monte Carlo” as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Statistical methods for neuroimaging; computational methods for neuroimaging; computer science and applied mathematics (Cravo)

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions."

Reply R2-1: The reviewer is right that the two conditions in Steps 1 and 2 are nested. This makes sense from the application point of view, because only those voxels k are carried over to Step 2 which show activation in the less specific task considered in Step 1 (as compared to the more specific task in Step 2). From the mathematical-statistical point of view, however, this nestedness is not explicitly made use of in the statistical methodology. The relevant assumptions are that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small). This follows from the theory of fixed sequence multiple tests as explained, e. g., in Section 3.1.2.2 of dx.doi.org/10.1007/978-3-642-45182-9.

"R.2-2 1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this."

Reply R2-2: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.3 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4.

As explained in our answer to a previous point, Step 1 is only used to set certain voxel-specific p-values to 1.0 if these voxels do not show activation in the more general task (contrasted with blocks of resting state) corresponding to Step 1 as compared to the more specific task contrast in Step 2. In particular, Step 1 is not used to define the regions $j=1,\ldots,J$ .

"R.2-3 2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?"

Reply R2-3: The methodology developed in Reference 40 assumes that the p-values $(\tilde{p}_{ik})$ (in the notation of our present manuscript, cf. Step 2) are valid, meaning that they are under the null hypothesis of no activation in Step 2 stochastically lower-bounded by a uniform distribution. As argued in our response to your first point, this is the case for our proposed method. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R.2-4 3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn't it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?"

Reply R2-4: The reviewer is correct, and this is exactly the reason why the nestedness of the conditions referring to Steps 1 and 2 is a meaningful setup. Step 1 makes the plausible assumption about functional specificity that any region that is involved in a specific cognitive function (here program comprehension) shows stronger activation during that function than during rest. In Step 2, the syntax task serves as a more specific control condition since the same program snippets are shown to the participants albeit with a different task of finding syntax errors which does not require program comprehension.

"R.2-5 4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?"

Reply R2-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets.

"R.2-6 In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details."

Reply R2-6: In the revised version of the manuscript, we provide additional explanations of our proposed statistical methodology at several places in Sections 1 and 2, along the lines of our above answers to your respective points.

"R.2-7 Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: ''The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.''
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence."

Reply R2-7: We now specify the statement in the abstract: "Our simulations indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline.". The practical advantage of the proposes methods comes from the similar statistical power when applied to the real-data while not utilizing the results of a functional localizer. We think that this is a convincing argument in favor of our suggested approach.

In the revision we have extended our simulation study. In particular, we have now varied the signal-to-noise ratio (SNR). It turns out that for $\text{SNR}\in[0.8,1.1]$ our proposed procedure can achieve up to approximately 10% higher power than the cluster-based method.

"R.2-8 Misc additional comments
1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state ''In the present work, we propose a new strategy and apply it to the data from Ref. 43'', but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate ''the part on the right indicates the processing steps proposed in this paper'', whose main component is the method of Schildknecht et al. Finally, they indicate ''To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,'' but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher's method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern \#1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text."

Reply R2-8: The reviewer is correct that Step 3 of our proposed methodology relies on Reference No. 40. However, the usage of the Fisher combination in Step 4 as well as the general idea to carry out Steps 1 - 3 on the subject level and only to combine all subject-specific data in Step 4 is to the best of our knowledge a novel proposal.

"R.2-9 2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here."

Reply R2-9: The reviewer is correct that weak dependency (in the specific sense that averaged empirical cdfs of p-values converge in the Glivenko-Cantelli sense, both under the null and under the alternative) is assumed for Step 3 of our proposed methodology. We explicitly mention this in the revised version of the manuscript.

"R.2-10 3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well."

Reply R2-10: For the cluster-level inference we use the R package fmri which provides a function for this type of analysis. The thresholds therein are indeed determined by simulation for a wide range of significance levels, spatial smoothness, and data size, similar to AFNIs implementation, see Section 4.4.5 in Polzehl and Tabelow (2023) Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R. We added a corresponding clarification in the manuscript.

"R.2-11 4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion."

Reply R2-11: The indication of Figure 1 is actually correct. We have re-used the existing data from Siegmund et al. after they have completed the GLM analysis of steps 1 and 2. We have clarified the phrasing in our methods section to make this more clear.

"R.2-12 5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary."

Reply R2-12: We appreciate pointing out the potential issue of the smaller sample size. We would like to note that the area of the fMRI study of Siegmund et al. (software engineering) is particularly challenging for recruitment. Many fMRI studies in software engineering specifically target a (small) homogeneous sample over a larger sample size where confounding factors related to the multi-faceted dimensions of programmer expertise, experience, and demographics may dilute the observed effects.

Nevertheless, to address the valid concern for the results of our work more explicitly, we now elaborate on the threat to validity from our sample size and arising need for future work.

"R.2-13 6. Minor: the use of ''Monte Carlo'' as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names."

Reply R2-13: We thank the reviewer for the suggestion and have simplified the color names. We also additionally used "left" and "right" to increase colorblind-friendliness.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions."

Reply R2-1: The reviewer is right that the two conditions in Steps 1 and 2 are nested. This makes sense from the application point of view, because only those voxels k are carried over to Step 2 which show activation in the less specific task considered in Step 1 (as compared to the more specific task in Step 2). From the mathematical-statistical point of view, however, this nestedness is not explicitly made use of in the statistical methodology. The relevant assumptions are that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small). This follows from the theory of fixed sequence multiple tests as explained, e. g., in Section 3.1.2.2 of dx.doi.org/10.1007/978-3-642-45182-9.

"R.2-2 1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this."

Reply R2-2: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.3 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4.

As explained in our answer to a previous point, Step 1 is only used to set certain voxel-specific p-values to 1.0 if these voxels do not show activation in the more general task (contrasted with blocks of resting state) corresponding to Step 1 as compared to the more specific task contrast in Step 2. In particular, Step 1 is not used to define the regions $j=1,\ldots,J$ .

"R.2-3 2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?"

Reply R2-3: The methodology developed in Reference 40 assumes that the p-values $(\tilde{p}_{ik})$ (in the notation of our present manuscript, cf. Step 2) are valid, meaning that they are under the null hypothesis of no activation in Step 2 stochastically lower-bounded by a uniform distribution. As argued in our response to your first point, this is the case for our proposed method. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R.2-4 3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn't it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?"

Reply R2-4: The reviewer is correct, and this is exactly the reason why the nestedness of the conditions referring to Steps 1 and 2 is a meaningful setup. Step 1 makes the plausible assumption about functional specificity that any region that is involved in a specific cognitive function (here program comprehension) shows stronger activation during that function than during rest. In Step 2, the syntax task serves as a more specific control condition since the same program snippets are shown to the participants albeit with a different task of finding syntax errors which does not require program comprehension.

"R.2-5 4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?"

Reply R2-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets.

"R.2-6 In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details."

Reply R2-6: In the revised version of the manuscript, we provide additional explanations of our proposed statistical methodology at several places in Sections 1 and 2, along the lines of our above answers to your respective points.

"R.2-7 Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: ''The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.''
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence."

Reply R2-7: We now specify the statement in the abstract: "Our simulations indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline.". The practical advantage of the proposes methods comes from the similar statistical power when applied to the real-data while not utilizing the results of a functional localizer. We think that this is a convincing argument in favor of our suggested approach.

In the revision we have extended our simulation study. In particular, we have now varied the signal-to-noise ratio (SNR). It turns out that for $\text{SNR}\in[0.8,1.1]$ our proposed procedure can achieve up to approximately 10% higher power than the cluster-based method.

"R.2-8 Misc additional comments
1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state ''In the present work, we propose a new strategy and apply it to the data from Ref. 43'', but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate ''the part on the right indicates the processing steps proposed in this paper'', whose main component is the method of Schildknecht et al. Finally, they indicate ''To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,'' but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher's method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern \#1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text."

Reply R2-8: The reviewer is correct that Step 3 of our proposed methodology relies on Reference No. 40. However, the usage of the Fisher combination in Step 4 as well as the general idea to carry out Steps 1 - 3 on the subject level and only to combine all subject-specific data in Step 4 is to the best of our knowledge a novel proposal.

"R.2-9 2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here."

Reply R2-9: The reviewer is correct that weak dependency (in the specific sense that averaged empirical cdfs of p-values converge in the Glivenko-Cantelli sense, both under the null and under the alternative) is assumed for Step 3 of our proposed methodology. We explicitly mention this in the revised version of the manuscript.

"R.2-10 3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well."

Reply R2-10: For the cluster-level inference we use the R package fmri which provides a function for this type of analysis. The thresholds therein are indeed determined by simulation for a wide range of significance levels, spatial smoothness, and data size, similar to AFNIs implementation, see Section 4.4.5 in Polzehl and Tabelow (2023) Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R. We added a corresponding clarification in the manuscript.

"R.2-11 4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion."

Reply R2-11: The indication of Figure 1 is actually correct. We have re-used the existing data from Siegmund et al. after they have completed the GLM analysis of steps 1 and 2. We have clarified the phrasing in our methods section to make this more clear.

"R.2-12 5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary."

Reply R2-12: We appreciate pointing out the potential issue of the smaller sample size. We would like to note that the area of the fMRI study of Siegmund et al. (software engineering) is particularly challenging for recruitment. Many fMRI studies in software engineering specifically target a (small) homogeneous sample over a larger sample size where confounding factors related to the multi-faceted dimensions of programmer expertise, experience, and demographics may dilute the observed effects.

Nevertheless, to address the valid concern for the results of our work more explicitly, we now elaborate on the threat to validity from our sample size and arising need for future work.

"R.2-13 6. Minor: the use of ''Monte Carlo'' as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names."

Reply R2-13: We thank the reviewer for the suggestion and have simplified the color names. We also additionally used "left" and "right" to increase colorblind-friendliness.
Competing Interests: None Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions."

Reply R2-1: The reviewer is right that the two conditions in Steps 1 and 2 are nested. This makes sense from the application point of view, because only those voxels k are carried over to Step 2 which show activation in the less specific task considered in Step 1 (as compared to the more specific task in Step 2). From the mathematical-statistical point of view, however, this nestedness is not explicitly made use of in the statistical methodology. The relevant assumptions are that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small). This follows from the theory of fixed sequence multiple tests as explained, e. g., in Section 3.1.2.2 of dx.doi.org/10.1007/978-3-642-45182-9.

"R.2-2 1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this."

Reply R2-2: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.3 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4.

As explained in our answer to a previous point, Step 1 is only used to set certain voxel-specific p-values to 1.0 if these voxels do not show activation in the more general task (contrasted with blocks of resting state) corresponding to Step 1 as compared to the more specific task contrast in Step 2. In particular, Step 1 is not used to define the regions $j=1,\ldots,J$ .

"R.2-3 2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?"

Reply R2-3: The methodology developed in Reference 40 assumes that the p-values $(\tilde{p}_{ik})$ (in the notation of our present manuscript, cf. Step 2) are valid, meaning that they are under the null hypothesis of no activation in Step 2 stochastically lower-bounded by a uniform distribution. As argued in our response to your first point, this is the case for our proposed method. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R.2-4 3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn't it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?"

Reply R2-4: The reviewer is correct, and this is exactly the reason why the nestedness of the conditions referring to Steps 1 and 2 is a meaningful setup. Step 1 makes the plausible assumption about functional specificity that any region that is involved in a specific cognitive function (here program comprehension) shows stronger activation during that function than during rest. In Step 2, the syntax task serves as a more specific control condition since the same program snippets are shown to the participants albeit with a different task of finding syntax errors which does not require program comprehension.

"R.2-5 4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?"

Reply R2-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets.

"R.2-6 In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details."

Reply R2-6: In the revised version of the manuscript, we provide additional explanations of our proposed statistical methodology at several places in Sections 1 and 2, along the lines of our above answers to your respective points.

"R.2-7 Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: ''The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.''
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence."

Reply R2-7: We now specify the statement in the abstract: "Our simulations indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline.". The practical advantage of the proposes methods comes from the similar statistical power when applied to the real-data while not utilizing the results of a functional localizer. We think that this is a convincing argument in favor of our suggested approach.

In the revision we have extended our simulation study. In particular, we have now varied the signal-to-noise ratio (SNR). It turns out that for $\text{SNR}\in[0.8,1.1]$ our proposed procedure can achieve up to approximately 10% higher power than the cluster-based method.

"R.2-8 Misc additional comments
1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state ''In the present work, we propose a new strategy and apply it to the data from Ref. 43'', but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate ''the part on the right indicates the processing steps proposed in this paper'', whose main component is the method of Schildknecht et al. Finally, they indicate ''To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,'' but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher's method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern \#1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text."

Reply R2-8: The reviewer is correct that Step 3 of our proposed methodology relies on Reference No. 40. However, the usage of the Fisher combination in Step 4 as well as the general idea to carry out Steps 1 - 3 on the subject level and only to combine all subject-specific data in Step 4 is to the best of our knowledge a novel proposal.

"R.2-9 2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here."

Reply R2-9: The reviewer is correct that weak dependency (in the specific sense that averaged empirical cdfs of p-values converge in the Glivenko-Cantelli sense, both under the null and under the alternative) is assumed for Step 3 of our proposed methodology. We explicitly mention this in the revised version of the manuscript.

"R.2-10 3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well."

Reply R2-10: For the cluster-level inference we use the R package fmri which provides a function for this type of analysis. The thresholds therein are indeed determined by simulation for a wide range of significance levels, spatial smoothness, and data size, similar to AFNIs implementation, see Section 4.4.5 in Polzehl and Tabelow (2023) Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R. We added a corresponding clarification in the manuscript.

"R.2-11 4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion."

Reply R2-11: The indication of Figure 1 is actually correct. We have re-used the existing data from Siegmund et al. after they have completed the GLM analysis of steps 1 and 2. We have clarified the phrasing in our methods section to make this more clear.

"R.2-12 5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary."

Reply R2-12: We appreciate pointing out the potential issue of the smaller sample size. We would like to note that the area of the fMRI study of Siegmund et al. (software engineering) is particularly challenging for recruitment. Many fMRI studies in software engineering specifically target a (small) homogeneous sample over a larger sample size where confounding factors related to the multi-faceted dimensions of programmer expertise, experience, and demographics may dilute the observed effects.

Nevertheless, to address the valid concern for the results of our work more explicitly, we now elaborate on the threat to validity from our sample size and arising need for future work.

"R.2-13 6. Minor: the use of ''Monte Carlo'' as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names."

Reply R2-13: We thank the reviewer for the suggestion and have simplified the color names. We also additionally used "left" and "right" to increase colorblind-friendliness.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions."

Reply R2-1: The reviewer is right that the two conditions in Steps 1 and 2 are nested. This makes sense from the application point of view, because only those voxels k are carried over to Step 2 which show activation in the less specific task considered in Step 1 (as compared to the more specific task in Step 2). From the mathematical-statistical point of view, however, this nestedness is not explicitly made use of in the statistical methodology. The relevant assumptions are that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small). This follows from the theory of fixed sequence multiple tests as explained, e. g., in Section 3.1.2.2 of dx.doi.org/10.1007/978-3-642-45182-9.

"R.2-2 1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this."

Reply R2-2: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.3 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4.

As explained in our answer to a previous point, Step 1 is only used to set certain voxel-specific p-values to 1.0 if these voxels do not show activation in the more general task (contrasted with blocks of resting state) corresponding to Step 1 as compared to the more specific task contrast in Step 2. In particular, Step 1 is not used to define the regions $j=1,\ldots,J$ .

"R.2-3 2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?"

Reply R2-3: The methodology developed in Reference 40 assumes that the p-values $(\tilde{p}_{ik})$ (in the notation of our present manuscript, cf. Step 2) are valid, meaning that they are under the null hypothesis of no activation in Step 2 stochastically lower-bounded by a uniform distribution. As argued in our response to your first point, this is the case for our proposed method. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R.2-4 3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn't it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?"

Reply R2-4: The reviewer is correct, and this is exactly the reason why the nestedness of the conditions referring to Steps 1 and 2 is a meaningful setup. Step 1 makes the plausible assumption about functional specificity that any region that is involved in a specific cognitive function (here program comprehension) shows stronger activation during that function than during rest. In Step 2, the syntax task serves as a more specific control condition since the same program snippets are shown to the participants albeit with a different task of finding syntax errors which does not require program comprehension.

"R.2-5 4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?"

Reply R2-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets.

"R.2-6 In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details."

Reply R2-6: In the revised version of the manuscript, we provide additional explanations of our proposed statistical methodology at several places in Sections 1 and 2, along the lines of our above answers to your respective points.

"R.2-7 Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: ''The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.''
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence."

Reply R2-7: We now specify the statement in the abstract: "Our simulations indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline.". The practical advantage of the proposes methods comes from the similar statistical power when applied to the real-data while not utilizing the results of a functional localizer. We think that this is a convincing argument in favor of our suggested approach.

In the revision we have extended our simulation study. In particular, we have now varied the signal-to-noise ratio (SNR). It turns out that for $\text{SNR}\in[0.8,1.1]$ our proposed procedure can achieve up to approximately 10% higher power than the cluster-based method.

"R.2-8 Misc additional comments
1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state ''In the present work, we propose a new strategy and apply it to the data from Ref. 43'', but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate ''the part on the right indicates the processing steps proposed in this paper'', whose main component is the method of Schildknecht et al. Finally, they indicate ''To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,'' but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher's method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern \#1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text."

Reply R2-8: The reviewer is correct that Step 3 of our proposed methodology relies on Reference No. 40. However, the usage of the Fisher combination in Step 4 as well as the general idea to carry out Steps 1 - 3 on the subject level and only to combine all subject-specific data in Step 4 is to the best of our knowledge a novel proposal.

"R.2-9 2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here."

Reply R2-9: The reviewer is correct that weak dependency (in the specific sense that averaged empirical cdfs of p-values converge in the Glivenko-Cantelli sense, both under the null and under the alternative) is assumed for Step 3 of our proposed methodology. We explicitly mention this in the revised version of the manuscript.

"R.2-10 3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well."

Reply R2-10: For the cluster-level inference we use the R package fmri which provides a function for this type of analysis. The thresholds therein are indeed determined by simulation for a wide range of significance levels, spatial smoothness, and data size, similar to AFNIs implementation, see Section 4.4.5 in Polzehl and Tabelow (2023) Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R. We added a corresponding clarification in the manuscript.

"R.2-11 4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion."

Reply R2-11: The indication of Figure 1 is actually correct. We have re-used the existing data from Siegmund et al. after they have completed the GLM analysis of steps 1 and 2. We have clarified the phrasing in our methods section to make this more clear.

"R.2-12 5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary."

Reply R2-12: We appreciate pointing out the potential issue of the smaller sample size. We would like to note that the area of the fMRI study of Siegmund et al. (software engineering) is particularly challenging for recruitment. Many fMRI studies in software engineering specifically target a (small) homogeneous sample over a larger sample size where confounding factors related to the multi-faceted dimensions of programmer expertise, experience, and demographics may dilute the observed effects.

Nevertheless, to address the valid concern for the results of our work more explicitly, we now elaborate on the threat to validity from our sample size and arising need for future work.

"R.2-13 6. Minor: the use of ''Monte Carlo'' as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names."

Reply R2-13: We thank the reviewer for the suggestion and have simplified the color names. We also additionally used "left" and "right" to increase colorblind-friendliness.
Competing Interests: None Close
Report a concern

Views

Reviewer Report 07 Oct 2025

Approved with Reservations

https://doi.org/10.5256/f1000research.183550.r420203

The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual’s anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors’ background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software.

While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied—single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step.
- Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?

The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases

The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: clinical neuroradiology, functional neuroimaging (methods and applied research)

CITE

Report a concern

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual's anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors' background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software."

Reply R1-1: We followed the advice of the reviewer and made the Introduction more accessible for neuroimaging researchers, specifically by moving parts of the Introduction to the methods section as well as the paragraph on alternative statistical approaches into the Discussion. We believe that the novelty of our approach and its general applicability in fMRI studies is now more evident for researchers who are interested in applying it in their own work.

We agree with the reviewer that a more detailed tutorial for using and potentially extending the code is a useful addition to the associated GitHub repository.
We therefore provide a new tutorial file with detailed explanation of the code that should enable the user to run own analysis on fMRI data or own simulations with a only few edits.

"R1.2: While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied - single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step."

Reply R1-2: Steps 1 to 3 are carried out for each subject separately. In particular, no data are shared between subjects in Steps 1 - 3. Only in Step 4, the subject-specific analyses from Steps 1 to 3 are combined (for each region j separately by utilizing the Fisher combination function over the participants i from 1 to n, where n denotes the number of subjects).

"R1.3: Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?"

Reply R1-3 There are two aspects to consider for answering this question: (1) Since, as explained above, no data are shared between subjects in Steps 1 - 3, the (region-specific, meaning j is fixed) Fisher combination function is applied to stochastically independent p-values $p_{1j}^{\text{REGION}},\ldots,p_{nj}^{\text{REGION}}$ for each region j. This ensures that the chi-square distribution with 2n degrees of freedom is the correct marginal null distribution for each statistic $T_j$ when considered "stand-alone". (2) When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

"R1.4: The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases"

Reply R1-4: We thank the reviewer for the suggestions. We extended documentation of the code and provide now a tutorial how to easily create own simulated data. Together with the provided scripts the user is now able to test the method under other conditions. Furthermore, we think that the inclusion of more results for more fMRI data and different atlases is beyond the scope of this (methodological) paper. By elaborating the description of the code and the tutorial, we now make the method more applicable to own data of interested researchers. This allows to consider necessary background knowledge on the respective research question which is often important for final validation of results. Furthermore, we extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods.

"R1.5: The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented."

Reply R1-5: We are not aware of other recent approaches following a similar rationale. We would like to leave this for further research.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual's anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors' background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software."

Reply R1-1: We followed the advice of the reviewer and made the Introduction more accessible for neuroimaging researchers, specifically by moving parts of the Introduction to the methods section as well as the paragraph on alternative statistical approaches into the Discussion. We believe that the novelty of our approach and its general applicability in fMRI studies is now more evident for researchers who are interested in applying it in their own work.

We agree with the reviewer that a more detailed tutorial for using and potentially extending the code is a useful addition to the associated GitHub repository.
We therefore provide a new tutorial file with detailed explanation of the code that should enable the user to run own analysis on fMRI data or own simulations with a only few edits.

"R1.2: While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied - single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step."

Reply R1-2: Steps 1 to 3 are carried out for each subject separately. In particular, no data are shared between subjects in Steps 1 - 3. Only in Step 4, the subject-specific analyses from Steps 1 to 3 are combined (for each region j separately by utilizing the Fisher combination function over the participants i from 1 to n, where n denotes the number of subjects).

"R1.3: Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?"

Reply R1-3 There are two aspects to consider for answering this question: (1) Since, as explained above, no data are shared between subjects in Steps 1 - 3, the (region-specific, meaning j is fixed) Fisher combination function is applied to stochastically independent p-values $p_{1j}^{\text{REGION}},\ldots,p_{nj}^{\text{REGION}}$ for each region j. This ensures that the chi-square distribution with 2n degrees of freedom is the correct marginal null distribution for each statistic $T_j$ when considered "stand-alone". (2) When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

"R1.4: The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases"

Reply R1-4: We thank the reviewer for the suggestions. We extended documentation of the code and provide now a tutorial how to easily create own simulated data. Together with the provided scripts the user is now able to test the method under other conditions. Furthermore, we think that the inclusion of more results for more fMRI data and different atlases is beyond the scope of this (methodological) paper. By elaborating the description of the code and the tutorial, we now make the method more applicable to own data of interested researchers. This allows to consider necessary background knowledge on the respective research question which is often important for final validation of results. Furthermore, we extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods.

"R1.5: The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented."

Reply R1-5: We are not aware of other recent approaches following a similar rationale. We would like to leave this for further research.
Competing Interests: None Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

25 Mar 2026

Author Response

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The ... Continue reading We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual's anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors' background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software."

Reply R1-1: We followed the advice of the reviewer and made the Introduction more accessible for neuroimaging researchers, specifically by moving parts of the Introduction to the methods section as well as the paragraph on alternative statistical approaches into the Discussion. We believe that the novelty of our approach and its general applicability in fMRI studies is now more evident for researchers who are interested in applying it in their own work.

We agree with the reviewer that a more detailed tutorial for using and potentially extending the code is a useful addition to the associated GitHub repository.
We therefore provide a new tutorial file with detailed explanation of the code that should enable the user to run own analysis on fMRI data or own simulations with a only few edits.

"R1.2: While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied - single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step."

Reply R1-2: Steps 1 to 3 are carried out for each subject separately. In particular, no data are shared between subjects in Steps 1 - 3. Only in Step 4, the subject-specific analyses from Steps 1 to 3 are combined (for each region j separately by utilizing the Fisher combination function over the participants i from 1 to n, where n denotes the number of subjects).

"R1.3: Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?"

Reply R1-3 There are two aspects to consider for answering this question: (1) Since, as explained above, no data are shared between subjects in Steps 1 - 3, the (region-specific, meaning j is fixed) Fisher combination function is applied to stochastically independent p-values $p_{1j}^{\text{REGION}},\ldots,p_{nj}^{\text{REGION}}$ for each region j. This ensures that the chi-square distribution with 2n degrees of freedom is the correct marginal null distribution for each statistic $T_j$ when considered "stand-alone". (2) When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

"R1.4: The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases"

Reply R1-4: We thank the reviewer for the suggestions. We extended documentation of the code and provide now a tutorial how to easily create own simulated data. Together with the provided scripts the user is now able to test the method under other conditions. Furthermore, we think that the inclusion of more results for more fMRI data and different atlases is beyond the scope of this (methodological) paper. By elaborating the description of the code and the tutorial, we now make the method more applicable to own data of interested researchers. This allows to consider necessary background knowledge on the respective research question which is often important for final validation of results. Furthermore, we extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods.

"R1.5: The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented."

Reply R1-5: We are not aware of other recent approaches following a similar rationale. We would like to leave this for further research.
We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual's anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors' background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software."

Reply R1-1: We followed the advice of the reviewer and made the Introduction more accessible for neuroimaging researchers, specifically by moving parts of the Introduction to the methods section as well as the paragraph on alternative statistical approaches into the Discussion. We believe that the novelty of our approach and its general applicability in fMRI studies is now more evident for researchers who are interested in applying it in their own work.

We agree with the reviewer that a more detailed tutorial for using and potentially extending the code is a useful addition to the associated GitHub repository.
We therefore provide a new tutorial file with detailed explanation of the code that should enable the user to run own analysis on fMRI data or own simulations with a only few edits.

"R1.2: While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied - single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step."

Reply R1-2: Steps 1 to 3 are carried out for each subject separately. In particular, no data are shared between subjects in Steps 1 - 3. Only in Step 4, the subject-specific analyses from Steps 1 to 3 are combined (for each region j separately by utilizing the Fisher combination function over the participants i from 1 to n, where n denotes the number of subjects).

"R1.3: Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?"

Reply R1-3 There are two aspects to consider for answering this question: (1) Since, as explained above, no data are shared between subjects in Steps 1 - 3, the (region-specific, meaning j is fixed) Fisher combination function is applied to stochastically independent p-values $p_{1j}^{\text{REGION}},\ldots,p_{nj}^{\text{REGION}}$ for each region j. This ensures that the chi-square distribution with 2n degrees of freedom is the correct marginal null distribution for each statistic $T_j$ when considered "stand-alone". (2) When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

"R1.4: The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases"

Reply R1-4: We thank the reviewer for the suggestions. We extended documentation of the code and provide now a tutorial how to easily create own simulated data. Together with the provided scripts the user is now able to test the method under other conditions. Furthermore, we think that the inclusion of more results for more fMRI data and different atlases is beyond the scope of this (methodological) paper. By elaborating the description of the code and the tutorial, we now make the method more applicable to own data of interested researchers. This allows to consider necessary background knowledge on the respective research question which is often important for final validation of results. Furthermore, we extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods.

"R1.5: The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented."

Reply R1-5: We are not aware of other recent approaches following a similar rationale. We would like to leave this for further research.
Competing Interests: None Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 01 Oct 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 25 Mar 26	read		read
Version 1 01 Oct 25	read	read	read

Benedikt Sundermann, Universitätsmedizin Oldenburg, Oldenburg, Germany; Evangelisches Krankenhaus Oldenburg (Ringgold ID: 84511), Oldenburg, Germany; University of Münster, Münster, Germany
Fabricio Cravo, Northeastern University College of Science (Ringgold ID: 195088), Boston, USA

Stephanie Noble, Northeastern University, Boston, USA
Qiran Jia, University of Southern California, Los Angeles, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

30 Mar 2026 | for Version 2

7 Views Cite this report Responses(0)

Approved

I Approved. Thank you for your revisions and explanations.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

clinical neuroradiology, functional neuroimaging (methods and applied research)

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

15 Views

27 Mar 2026 | for Version 2

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

15 Views Cite this report Responses(0)

Approved

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

19 Views

19 Jan 2026 | for Version 1

Qiran Jia, Division of Biostatistics and Health Data Science, University of Southern California, Los Angeles, California, USA

19 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biostatistics, high-dimensional data analysis, voxel-level multiple comparison, multi-view data analysis

Respond to this report

Responses (1)

Author Response

25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R3-1: This manuscript proposes a four-stage region-based inference workflow for signal detection in fMRI that leverages subject-specific anatomical parcellations to improve sensitivity over voxel-level multiple testing and to reduce reliance on independent “localizer” experiments. The authors apply a two-stage within-subject screening scheme, aggregate voxel-level evidence into anatomical region p-values using a partial conjunction framework, and then combine regional evidence across subjects using Fisher’s method. Simple simulations show that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline, and in the re-analysis of program comprehension fMRI data, it identifies significant regions that partially overlap prior ROI-based findings.

Overall, the manuscript is informative enough about the context, motivation, and methodology, and discusses the workflow and results in detail; however, several concerns remain, and additional work is needed to adequately justify the proposed approach.

(1) In the Introduction, the authors present a three-strategy taxonomy to motivate an alternative approach; however, the proposed strategy is introduced at length and can be easy for readers to lose track of. Thus, it would benefit from defining the strategy and stating the novelty in one crisp sentence (make sure it is understandable for nonstatistical people)."

Reply R3-1: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.2 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4. We clarify this in the revised version of the manuscript.

"R3-2: (2) Figure 1 is confusing, beyond the “Monte Carlo” typo. It appears to treat both studies as having the same number of participants, n, whereas Table 1 reports different sample sizes across the two studies. In addition, Table 1 and Listing 1 do not seem essential for the main manuscript; that information would be better used to improve Figure 1. Also, PijAPARC and other formulas are not presented in the introduction, so it might be inappropriate to appear here. Overall, my suggestion is that the introduction can be less technical and more approachable."

Reply R3-2: We corrected the typos in the Figure and the Caption and made sure, that in the published manuscript the Figure appears only in Section 4, where it is first referenced and all formulas from Section 2 have been introduced. The left part of the Figure schematically describes only one study (Siegmund et al., 2017) where functional and anatomical data has been acquired for each of the n participants.
We would prefer to leave Table 1 and Listing 1 in the manuscript in order to evaluate the type of study without consulting the original publication. The introduction has been re-written to be more concise and less technical.

"R3-3: (3) The Introduction’s discussion of MVPA and cluster-based inference feels somewhat tangential and is framed largely as a dismissal of common alternatives. It would be stronger to more directly and even-handedly summarize the pros and cons of MVPA and cluster-based inference versus standard GLM-based analyses, and then clearly state which components the proposed method adopts from each (and why)."

Reply R3-3: We agree that this discussion does not belong to the introduction. We transferred it to the discussion Section 5, where we elaborate on the pros and cons of each method.

"R3-4: (4) Methodologically, I am comfortable with Steps 1–2 as a within-subject pre-screening procedure, but Step 3 requires further justification. In particular, the manuscript should clarify whether the partial conjunction test from Ref. 40 remains valid after restricting inference to a screened subset of voxels, and whether FWER is still controlled across the full multi-stage pipeline. I am not sure whether their way of choosing kappa and achieving FWER control automatically guarantees global FWER control in your case. I also think the interpretation of kappa is actually essential, so it might be necessary to elaborate on it more and even conduct simulations with the choice of kappa."

Reply R3-4: When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do in Step 4 not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

As far as the "screening" constituted by Step 1 is concerned, the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small) by this screening. This follows from the theory of fixed sequence multiple tests as explained, e.\ g., in Section 3.1.2.2 of doi.org/10.1007/978-3-642-45182-9. Technically, the p-values for all voxels which have not been declared significant in Step 1 are simply set to 1.0 in Step 2. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R3-5: (5) A clearer definition of the regional hypothesis Hij is needed. In particular, the paper should explicitly state what it means to reject region j after the four-step procedure, and the interpretation of kappa might be part of the hypothesis definition."

Reply R3-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets. We have slightly extended the description of the method in the revised version of the manuscript.

"R3-6: (6) The simulation setting is somewhat idealized. The paper would be strengthened by additional simulations that better reflect realistic fMRI conditions, such as imperfect parcellation/label noise and varying degrees of spatial smoothness."

Reply R3-6: We extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods. Furthermore, we provide a tutorial and explanation of the simulation code, that can be used to adjust the simulation to own purposes.

"R3-7: (7) In addition, the cluster-based baseline is only evaluated in the simplified simulation and is not compared on the real dataset; the authors should more explicitly articulate the practical benefit of the proposed method, given that cluster-based inference achieves similar power in their simulation results. While the manuscript highlights the advantage of avoiding smoothness-based threshold calibration via simulation, many contemporary fMRI inference methods provide automated, data-driven spatial modeling and thresholding (including recent deep-learning-based approaches), and these should be acknowledged and discussed as relevant baselines at least."

Reply R3-7: Thank you for this highlighting this interesting aspect. We would like to clarify that a cluster-based inference was in fact attempted in the Siegmund et al. (2017) study. However, this approach did not yield any statistically significant effects in the real dataset. As a result, Siegmund et al. had to rely on an independent dataset collected in 2014 to obtain a meaningful signal. This experience was a primary motivation for developing and evaluating the method presented in this paper, since it is not common that a prior study exists that can be used in such manner. We now explain this aspect more clearly in Section 4.1 in the revised submission.

"R3-8: (8) I am comfortable with the real data analysis findings, but an additional application would always be better to show generalizability. Figures 5–6 and Tables 2–4 communicate key insights, but the current presentation feels exploratory rather than indexing-ready. These results would benefit from more polished, information-dense visualizations."

Reply R3-8: We would like to stress the the purpose of this paper is the introduction of a new inference method for fMRI related activation. The scope of the work is to outline the method, test it on already existing data, and compare the results with the previous findings for validation. Further potential neuroscientific insights will be left for the usage of the methods on new study data.

View more View less

Competing Interests

None

Back to all reports

Reviewer Report

31 Views

26 Dec 2025 | for Version 1

Fabricio Cravo, Psychology, Northeastern University College of Science (Ringgold ID: 195088), Boston, Massachusetts, USA

Stephanie Noble, Northeastern University, Boston, Massachusetts, USA

31 Views Cite this report Responses(1)

Not Approved

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Statistical methods for neuroimaging; computational methods for neuroimaging; computer science and applied mathematics (Cravo)

Respond to this report

Responses (1)

Author Response

25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R2-1: The authors introduce a multi-stage approach for aggregating evidence within a priori regions. They show this approach yields higher power than voxel-level inference and comparable power to cluster-based inference in simulations, and is able to highlight task-relevant regions in empirical data compared with a voxel-driven approach. This is an interesting procedure. However, there are a few concerns with the statistical approach that must be addressed before the contribution of the study can be fully evaluated.

First, it was challenging to evaluate the rationale for the Step 1 & 2 pre-selection of voxels for p-value estimation + correction (only applicable for nested conditions?) and how its addition leads to a valid statistical test. The core inspiration for these steps comes from Ref 43 (Seigmund et al., 2017), but we were unable to find many details regarding the statistical rationale from that manuscript and thus were unable to evaluate multiple questions."

Reply R2-1: The reviewer is right that the two conditions in Steps 1 and 2 are nested. This makes sense from the application point of view, because only those voxels k are carried over to Step 2 which show activation in the less specific task considered in Step 1 (as compared to the more specific task in Step 2). From the mathematical-statistical point of view, however, this nestedness is not explicitly made use of in the statistical methodology. The relevant assumptions are that (i) the two conditions to be tested in Steps 1 and 2 are pre-defined before seeing any data which are used to compute p-values, and (ii) the explicit p-value calculation in Step 2 is only performed for those voxels which have been declared significant in Step 1. Technically, the p-values for all voxels which have not been declared significant in Step 1 are set to 1.0 in Step 2. Assumptions (i) and (ii) ensure that the p-value distribution under the null hypothesis tested in Step 2 is not deflated (p-values are not systematically too small). This follows from the theory of fixed sequence multiple tests as explained, e. g., in Section 3.1.2.2 of dx.doi.org/10.1007/978-3-642-45182-9.

"R.2-2 1) The introduction references Strategy (ii), noting that regions must be defined based on information outside the data that we set out to analyze to avoid selection biases. However, in the proposed method, Step 1 uses the same functional data to select voxels. Could the authors clarify how this satisfies the independence requirement? The justification that "FDR is an established screening criterion for high-dimensional multiple test problems" is not sufficient to address this."

Reply R2-2: Strategy (ii) mentioned in the Introduction refers to a data analysis plan in which the regions (of interest) themselves are defined based on study data. In our study, however, these regions are defined based on the freesurfer segmentation of the brain into aparc labels; cf. Section 2.3 for details. This APARC annotation is anatomical information which is not derived from the fMRI data themselves. In this sense, our data analysis plan is more similar to Strategy (i) than to Strategy (ii). Our methodological innovation, though, is that we do not map each subject's brain to a standard brain and perform the subsequent analysis for all subjects combined, but we carry out Steps 1 to 3 for each subject separately, and we only combine subject-specific data in Step 4.

As explained in our answer to a previous point, Step 1 is only used to set certain voxel-specific p-values to 1.0 if these voxels do not show activation in the more general task (contrasted with blocks of resting state) corresponding to Step 1 as compared to the more specific task contrast in Step 2. In particular, Step 1 is not used to define the regions $j=1,\ldots,J$ .

"R.2-3 2) Related: the test in Ref 40 evaluates whether a sufficient proportion of voxels within an anatomically-defined region shows activation, and it appears it tests the complete set of voxels within the boundary. Could the authors clarify how applying an FDR screening selection within a boundary before applying the hypothesis test developed in Ref 40 still qualifies as an appropriate use of the test?"

Reply R2-3: The methodology developed in Reference 40 assumes that the p-values $(\tilde{p}_{ik})$ (in the notation of our present manuscript, cf. Step 2) are valid, meaning that they are under the null hypothesis of no activation in Step 2 stochastically lower-bounded by a uniform distribution. As argued in our response to your first point, this is the case for our proposed method. Notice, moreover, that the number $m_j$ appearing in Step 3 of our proposed methodology refers to the total number of voxels belonging to region j, and not to the number of voxels which have been declared significant in Step 1. Thus, if (for instance) we have a region of total size $m_j$ = 100 (voxels), and we set $\kappa$ = 0.5, we require 50 small p-values in that region j to declare that region significantly activated in Steps 3 and 4. Those p-values which have been set to 1.0 in Step 1 do not contribute to achieving this criterion.

"R.2-4 3) (If the above is appropriate) Why does program comprehension vs. rest serve as an adequate condition of pre-selection for voxels that may be significant in the bottom-up comprehension vs. control condition? Isn't it possible that an effect may be stronger in the latter condition than the former condition, thus meeting significance only in the latter and not former condition?"

Reply R2-4: The reviewer is correct, and this is exactly the reason why the nestedness of the conditions referring to Steps 1 and 2 is a meaningful setup. Step 1 makes the plausible assumption about functional specificity that any region that is involved in a specific cognitive function (here program comprehension) shows stronger activation during that function than during rest. In Step 2, the syntax task serves as a more specific control condition since the same program snippets are shown to the participants albeit with a different task of finding syntax errors which does not require program comprehension.

"R.2-5 4) When a region is declared significant, what specific hypothesis has been rejected? How should the result be interpreted, given the multiple stages of testing?"

Reply R2-5: If the regional null hypothesis $H_j$ is rejected in Step 4, the data provide evidence that at least $\kappa\cdot 100\%$ of voxels in region j are active during the visual presentation of code snippets.

"R.2-6 In general, it would help to provide a bit more detail regarding the statistical approaches used in the present manuscript to avoid too much consultation with the works in Ref 40 and 43 to obtain key details."

Reply R2-6: In the revised version of the manuscript, we provide additional explanations of our proposed statistical methodology at several places in Sections 1 and 2, along the lines of our above answers to your respective points.

"R.2-7 Second, and perhaps more importantly, the authors indicate throughout the manuscript (starting in the abstract and ending in the conclusion) that their main takeaway is that the proposed method outperforms voxel-based inference. For example:
Abstract: ''The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation.''
which, based on Table 2, is only true when comparing to voxel-based inference, not cluster-based inference.
However, voxel-based inference is exceedingly rare in fMRI analysis due to the widespread appreciation that this approach is overconservative when correcting for so many tests, and the development of methods that leverage the known spatial dependence to obtain more powerful inferences. As such, cluster-based approaches are much more widely used. As the authors demonstrate in their one comparison with cluster-level inference (i.e., the simulation study) that the proposed approach yields nearly equivalent Type I and II error as cluster-based inference, it is unclear what the practical advantages are for the proposed method compared with the more commonly used cluster-based inference approach that also leverages spatial dependence."

Reply R2-7: We now specify the statement in the abstract: "Our simulations indicate that the proposed approach improves power relative to voxel-wise inference and performs comparably to a cluster-based baseline.". The practical advantage of the proposes methods comes from the similar statistical power when applied to the real-data while not utilizing the results of a functional localizer. We think that this is a convincing argument in favor of our suggested approach.

In the revision we have extended our simulation study. In particular, we have now varied the signal-to-noise ratio (SNR). It turns out that for $\text{SNR}\in[0.8,1.1]$ our proposed procedure can achieve up to approximately 10% higher power than the cluster-based method.

"R.2-8 Misc additional comments
1. It was not always clear in the manuscript where exactly the present work departs from that of Schildknecht et al. (2016) (Ref 40). The authors state ''In the present work, we propose a new strategy and apply it to the data from Ref. 43'', but it appears that the bulk of this strategy is the region-based procedure by Schildknecht et al. (2016) (Ref 40), as they later state in the Methods. Similarly, in Fig. 1 they indicate ''the part on the right indicates the processing steps proposed in this paper'', whose main component is the method of Schildknecht et al. Finally, they indicate ''To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea,'' but, again, inference in subject-specific regions appears to be part of the previous work. To our knowledge, the main contributions of the present manuscript appears to be the aggregation of evidence over a group of subjects instead of a single subject (Fisher's method), the additional validation analyses, and, secondarily, the Step 1-2 estimation of p-values for preselected voxels that is only applicable for nested conditions (but again, see major concern \#1). These are valuable additions but not always made clear. When it is not clear what the relevant additions are, it becomes difficult to know when to seek key details from the present manuscript compared with previous works. This all seems mostly clear enough in the abstract but becomes less so in the text."

Reply R2-8: The reviewer is correct that Step 3 of our proposed methodology relies on Reference No. 40. However, the usage of the Fisher combination in Step 4 as well as the general idea to carry out Steps 1 - 3 on the subject level and only to combine all subject-specific data in Step 4 is to the best of our knowledge a novel proposal.

"R.2-9 2. The authors previously state (Ref. 40) that the weak dependency assumption may be problematic, and indeed substantial dependence between neighboring voxels is at the heart of many cluster- or region-based inference approaches. This limitation should be reiterated here."

Reply R2-9: The reviewer is correct that weak dependency (in the specific sense that averaged empirical cdfs of p-values converge in the Glivenko-Cantelli sense, both under the null and under the alternative) is assumed for Step 3 of our proposed methodology. We explicitly mention this in the revised version of the manuscript.

"R.2-10 3. It would be helpful to provide more detail for the method used for RFT cluster-level inference. The authors indicate that a priori thresholds for cluster-level inference are determined by simulation, but it is unclear how and whether a priori thresholds are set a priori or via simulation for the present study. This would be helpful to provide as those thresholds substantially influence the results. Common practice is to set the cluster-determining threshold a priori and use theory or simulation to obtain the cluster-size threshold indicating significant clusters corresponding with a target level of FWER control in AFNI, FSL, or SPM; if the approach here departs from that, it would be helpful to mention that as well."

Reply R2-10: For the cluster-level inference we use the R package fmri which provides a function for this type of analysis. The thresholds therein are indeed determined by simulation for a wide range of significance levels, spatial smoothness, and data size, similar to AFNIs implementation, see Section 4.4.5 in Polzehl and Tabelow (2023) Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R. We added a corresponding clarification in the manuscript.

"R.2-11 4. Note that Figure 1 suggests that Steps 1-2 were performed by Siegmund et al., but the methods indicate that the authors conducted their own GLM analysis for these steps. It would help to update the figure to include this detail to avoid confusion."

Reply R2-11: The indication of Figure 1 is actually correct. We have re-used the existing data from Siegmund et al. after they have completed the GLM analysis of steps 1 and 2. We have clarified the phrasing in our methods section to make this more clear.

"R.2-12 5. The real data validation uses a single dataset with n=11 subjects, which is on the lower side for contemporary fMRI experiments. In light of recent evidence that typical mass univariate studies may yield unreliable or underpowered results (e.g., Marek et al., 2020), it is unclear how robust evidence from this validation might be. It would help to mention this limitation or provide an argument on the contrary."

Reply R2-12: We appreciate pointing out the potential issue of the smaller sample size. We would like to note that the area of the fMRI study of Siegmund et al. (software engineering) is particularly challenging for recruitment. Many fMRI studies in software engineering specifically target a (small) homogeneous sample over a larger sample size where confounding factors related to the multi-faceted dimensions of programmer expertise, experience, and demographics may dilute the observed effects.

Nevertheless, to address the valid concern for the results of our work more explicitly, we now elaborate on the threat to validity from our sample size and arising need for future work.

"R.2-13 6. Minor: the use of ''Monte Carlo'' as the name of a color in Fig 1 was somewhat confusing, given that it could readily be mistaken for the method for stochastic simulation. It would also help to use more standard and readily appreciated color names."

Reply R2-13: We thank the reviewer for the suggestion and have simplified the color names. We also additionally used "left" and "right" to increase colorblind-friendliness.

View more View less

Competing Interests

None

Back to all reports

Reviewer Report

35 Views

07 Oct 2025 | for Version 1

35 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

clinical neuroradiology, functional neuroimaging (methods and applied research)

Respond to this report

Responses (1)

Author Response

25 Mar 2026

Norman Peitek, Saarland University, Saarbrücken, Germany

We would like to thank the reviewer for their constructive feedback on our manuscript. Here, we give a point-to-point reply to the issues raised by the reviewer.

"R1.1: The authors report an alternative strategy for assessing statistical significance in group fMRI studies, i.e. identifying consistent activation patterns across several participants (not group comparison studies). Instead of a conventional mass-univariate approach with voxel-wise multiple-comparison correction or cluster-level statistics, they propose a region-based, step-wise approach. They aim to develop a testing scheme that allows researchers to skip additional independent localizer studies while maintaining relatively high statistical power in small samples. Specifically, voxel-wise statistics are followed by region-wise assessment: the original p-values are combined within atlas-based regions (co-registered with each individual's anatomical data) to test for the statistically significant involvement of that region. The method is evaluated on both simulated small-sample data and an existing real fMRI dataset. The authors conclude that their approach provides improved sensitivity for regional activations without a substantial increase in false-positive findings. They also present diagnostic p-value histograms (suggesting relative insensitivity to outlier subjects) and compare their findings with those from the original study analysed using conventional methods.

In summary, this is a well-justified example of an fMRI analysis approach that aggregates first-level mass-univariate results within pre-specified regions (or groups of regions/networks) and, in a broader sense, is based on assessing an excess of low p-values. Such approaches have recently been gaining popularity.

In general, the paper is well structured. The writing style and some formalisms (commendable for their rigour) likely reflect the authors' background in mathematics or computer science. However, this style may make the work slightly less accessible for researchers in applied neuroimaging. I would therefore recommend rewriting parts of the Introduction and Discussion in a manner more consistent with typical neuroimaging papers. This would improve readability for researchers who wish to apply the proposed analysis in their own work. Similarly, the associated GitHub repository could benefit from a clearer tutorial explaining how to run the provided code, especially while it is not yet integrated into common neuroimaging software."

Reply R1-1: We followed the advice of the reviewer and made the Introduction more accessible for neuroimaging researchers, specifically by moving parts of the Introduction to the methods section as well as the paragraph on alternative statistical approaches into the Discussion. We believe that the novelty of our approach and its general applicability in fMRI studies is now more evident for researchers who are interested in applying it in their own work.

We agree with the reviewer that a more detailed tutorial for using and potentially extending the code is a useful addition to the associated GitHub repository.
We therefore provide a new tutorial file with detailed explanation of the code that should enable the user to run own analysis on fMRI data or own simulations with a only few edits.

"R1.2: While I cannot assess all mathematical details of the proposed approach (I am assessing the work from an applied neuroimaging perspective), it generally appears valid and without major methodological errors. The authors should, however, clarify the following points in more detail:

- The procedure involves an initial voxelwise screening (FDR) before regional testing. At which level is this applied - single subject or group? Please clarify whether this introduces any dependence between selection and subsequent tests, and how this might affect nominal error rates. Please justify this decision and consider presenting supplementary results without this step."

Reply R1-2: Steps 1 to 3 are carried out for each subject separately. In particular, no data are shared between subjects in Steps 1 - 3. Only in Step 4, the subject-specific analyses from Steps 1 to 3 are combined (for each region j separately by utilizing the Fisher combination function over the participants i from 1 to n, where n denotes the number of subjects).

"R1.3: Dependence structure: To what extent is the Fisher method affected by correlations among voxels within a region, and how might statistical dependence between regions (as expected in fMRI data; consider functional networks or recent work on eigenmodes of brain function) influence the final multiple-comparison adjustment?"

Reply R1-3 There are two aspects to consider for answering this question: (1) Since, as explained above, no data are shared between subjects in Steps 1 - 3, the (region-specific, meaning j is fixed) Fisher combination function is applied to stochastically independent p-values $p_{1j}^{\text{REGION}},\ldots,p_{nj}^{\text{REGION}}$ for each region j. This ensures that the chi-square distribution with 2n degrees of freedom is the correct marginal null distribution for each statistic $T_j$ when considered "stand-alone". (2) When testing all regions $j=1,\ldots,J$ simultaneously, a multiplicity correction is indeed required. Therefore, we do not use the $(1-\alpha)$ -quantile of $\chi^2_{2n}$ as the rejection threshold for each $T_j$ , but the $(1-\alpha\kappa)$ -quantile of $\chi^2_{2n}$ , where $\kappa$ is approximately equal to $1/J$ (Bonferroni correction). The theory developed in Reference No. 40 allows for choosing this adjustment factor $\kappa$ slightly larger than $1/J$ , but the order of magnitude of $\kappa$ is $O\left(J^{-1}\right)$ . So, you may simply think of the final multiplicity correction as a Bonferroni correction with respect to the number J of considered brain regions. This is a confirmatory analysis in the sense that the family-wise error rate (with respect to the family of region-specific "no activation" null hypotheses) is under control (up to the asymptotic nature of the chi-square distribution).

"R1.4: The manuscript might benefit from the following optional additions:
- Application to simulated data with different sample sizes
- Application to an additional fMRI dataset
- Comparison of results across different brain atlases"

Reply R1-4: We thank the reviewer for the suggestions. We extended documentation of the code and provide now a tutorial how to easily create own simulated data. Together with the provided scripts the user is now able to test the method under other conditions. Furthermore, we think that the inclusion of more results for more fMRI data and different atlases is beyond the scope of this (methodological) paper. By elaborating the description of the code and the tutorial, we now make the method more applicable to own data of interested researchers. This allows to consider necessary background knowledge on the respective research question which is often important for final validation of results. Furthermore, we extended the simulation study by varying the signal-to-noise ratio used in the data generation. This allows studying the Type-I- and Type-II-errors in a broader range of scenarios giving insights into the statistical properties of the methods.

"R1.5: The Discussion is well balanced and considers most relevant aspects, including methodological limitations. It could benefit from a more direct comparison with other recently used approaches following a similar rationale. The conclusions follow logically from the results presented."

Reply R1-5: We are not aware of other recent approaches following a similar rationale. We would like to leave this for further research.

View more View less

Competing Interests

None

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Amunts K, Mohlberg H, Bludau S, et al.: Julich-brain: A 3d probabilistic atlas of the human brain’s cytoarchitecture. Science. 2020; 369: 988–992. PubMed Abstract | Publisher Full Text

[2] 2. Andreella A, Feilong M, Halchenko Y, et al.: A Statistical Approach to the Alignment of fMRI Data. Book of Short Papers, SIS 2020. Pollice A, Salvati N, Spagnolo FS, editors. 2020; pp. 733–738.

[3] 3. Benjamini Y, Bogomolov M: Selective inference on multiple families of hypotheses. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014; 76: 297–318. Publisher Full Text

[4] 4. Benjamini Y, Heller R: False discovery rates for spatial signals. J. Am. Stat. Assoc. 2007; 102: 1272–1281. Publisher Full Text

[5] 5. Benjamini Y, Heller R: Screening for partial conjunction hypotheses. Biometrics. 2008; 64: 1215–1222. PubMed Abstract | Publisher Full Text

[6] 6. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995; 57: 289–300. Publisher Full Text

[7] 7. Brodmann K: Vergleichende Lokalisationslehre der Großhirnrinde in ihren Prinzipien dargestellt auf Grund des Zellbaues. Leipzig: Barth; 1909.

[8] 8. Brooks R: Using a behavioral theory of program comprehension in software engineering. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 1978; pp. 196–201.

[9] 9. Brooks R: Towards a theory of the comprehension of computer programs. Int. J. Man-Mach. Stud. 1983; 18: 543–554. Publisher Full Text

[10] 10. Castelhano J, Duarte I, Ferreira C, et al.: The Role of the Insula in Intuitive Expert Bug Detection in Computer Code: An fMRI Study. Brain Imaging Behav. 2018:1–15.

[11] 11. Desikan R, Ségonne F, Fischl B, et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006; 31: 968–980. PubMed Abstract | Publisher Full Text

[12] 12. Destrieux C, Fischl B, Dale A, et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage. 2010; 53: 1–15. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Dickhaus T: Simultaneous statistical inference with applications in the life sciences. Berlin Heidelberg: Springer-Verlag; 2014.

[14] 14. Dudoit S, van der Laan M : Multiple testing procedures with applications to genomics., Springer Series in Statistics. New York, NY: Springer; 2008.

[15] 15. Duraes J, Madeira H, Castelhano J, et al.: Understanding the Brain at Software Debugging, in Proceedings International Symposium Software Reliability Engineering (ISSRE). 2016; pp. 87–92.

[16] 16. Eklund A, Nichols TE, Knutsson H: Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. 2016; 113: 7900–7905. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Esteban O, Markiewicz CJ, Burns C, et al.: nipy/nipype: 1.5.0.2020.

[18] 18. Fischl B, Salat DH, Busa E, et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002; 33: 341–355. Publisher Full Text

[19] 19. Fischl B, Van Der Kouwe A, Destrieux C, et al.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. 2004; 14: 11–22. Publisher Full Text

[20] 20. Floyd B, Santander T, Weimer W: Decoding the Representation of Code in the Brain: An fMRI Study of Code Review and Expertise, in Proceedings of International Conference on Software Engineering (ICSE). IEEE. 2017:175–186.

[21] 21. Forman S, Cohen J, Fitzgerald M, et al.: Improved assessment of significant activation in functional magnetic resonance imaging (fmri): use of a cluster-size threshold. Magn. Reson. Med. 1995; 33: 636–647. PubMed Abstract | Publisher Full Text

[22] 22. Friston K, Rotshtein P, Geng J, et al.: A critique of functional localisers. NeuroImage. 2006; 30: 1077–1087. PubMed Abstract | Publisher Full Text

[23] 23. Gorgolewski K, Burns CD, Madison C, et al.: Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Front. Neuroinform. 2011; 5: Article 13. Publisher Full Text

[24] 24. Haxby JV: Multivariate pattern analysis of fMRI: The early beginnings. NeuroImage. 2012; 62: 852–855. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Heller R, Stanley D, Yekutieli D, et al.: Cluster-based analysis of fMRI data. NeuroImage. 2006; 33: 599–608. Publisher Full Text

[26] 26. Hu J, Zhao H, Zhou H: False discovery rate control with groups. J. Am. Stat. Assoc. 2010; 105: 1215–1227. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Huang Y, Liu X, Krueger R, et al.: Distilling neural representations of data structure manipulation using fMRI and fNIRS. Proceedings of International Conference on Software Engineering (ICSE). IEEE; 2019; pp. 396–407.

[28] 28. Jarmasz M, Somorjai R: Exploring regions of interest with cluster analysis (EROICA) using a spectral peak statistic for selecting and testing the significance of fMRI activation time-series. Artif. Intell. Med. 2002; 25: 45–67. PubMed Abstract | Publisher Full Text

[29] 29. Krueger R, Huang Y, Liu X, et al.: Neurological divide: An fMRI study of prose and code writing. Proceedings of International Conference on Software Engineering (ICSE). 2020; pp. 678–690.

[30] 30. Lazar N: The statistical analysis of functional MRI data. Statistics for Biology and Health. Springer; 2008.

[31] 31. Lindquist M: The statistical analysis of fMRI data. Stat. Sci. 2008; 23: 439–464. MR 2530545.

[32] 32. Liu Y, Sarkar S, Zhao Z: A new approach to multiple testing of grouped hypotheses. J. Statist. Plann. Inference. 2016; 179: 1–14. MR 3550875. Publisher Full Text

[33] 33. Makris N, Goldstein JM, Kennedy D, et al.: Decreased volume of left and total anterior insular lobule in schizophrenia. Schizophr. Res. 2006; 83: 155–171. PubMed Abstract | Publisher Full Text

[34] 34. Neumann A, Peitek N, Brechmann A, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging, WIAS Preprint No. 2806.2021. Publisher Full Text

[35] 35. Nieto-Castañón A, Fedorenko E: Subject-specific functional localizers increase sensitivity and functional resolution of multi-subject analyses. NeuroImage. 2012; 63: 1646–1669. PubMed Abstract | Publisher Full Text | Free Full Text

[36] 36. Noble S, Mejia AF, Zalesky A, et al.: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference. Proc. Natl. Acad. Sci. USA. 2022; 119: e2203020119. PubMed Abstract | Publisher Full Text | Free Full Text

[37] 37. Ombao H, Lindquist M, Thompson W, et al.: Handbook of Neuroimaging Data Analysis. New York: CRC Press; 2016.

[38] 38. Peitek N, Brechmann A, Tabelow K, et al.: Utilizing anatomical information for signal detection in functional magnetic resonance imaging. Data. 2024. Publisher Full Text

[39] 39. Peitek N, Apel S, Parnin C, et al.:Program comprehension and code complexity metrics: An FMRI study.In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).IEEE, 2021; pp. 524–536.

[40] 40. Pennington N: Stimulus structures and mental representations in expert comprehension of computer programs. Cogn. Psychol. 1987; 19: 295–341. Publisher Full Text

[41] 41. Poldrack RA, Mumford JA, Nichols TE: Handbook of functional MRI data analysis :Cambridge University Press; 2011. Publisher Full Text

[42] 42. Polzehl J, Tabelow K: Magnetic Resonance Brain Imaging: Modelling and Data Analysis Using R .Springer International Publishing; 2023. Publisher Full Text

[43] 43. Rosenblatt J, Finos L, Weeda W, et al.: All-resolutions inference for brain imaging. NeuroImage. 2018; 181: 786–796. PubMed Abstract | Publisher Full Text

[44] 44. Saxe R, Brett M, Kanwisher N: Divide and conquer: A defense of functional localizers. NeuroImage. 2006; 30: 1088–1096. PubMed Abstract | Publisher Full Text

[45] 45. Schildknecht K, Tabelow K, Dickhaus T: More specific signal detection in functional magnetic resonance imaging by false discovery rate control for hierarchically structured systems of hypotheses. PLoS One. 2016; 11: 1–21.

[46] 46. Shi R, Guo Y: Investigating differences in brain functional networks using hierarchical covariate-adjusted independent component analysis. Ann. Appl. Stat. 2016; 10: 1930–1957. MR 3592043. PubMed Abstract | Publisher Full Text

[47] 47. Siegmund J, Kästner C, Apel S, et al.: Understanding understanding source code with functional magnetic resonance imaging. Proceedings International Conference on Software Engineering (ICSE). ACM; 2014; pp. 378–389.

[48] 48. Siegmund J, Peitek N, Parnin C, et al.: Measuring neural efficiency of program comprehension. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, ESEC/FSE; 2017; 2017. : pp. 140–150.

[49] 49. Siegmund J, Schumann J: Confounding parameters on program comprehension: a literature survey. Empir. Softw. Eng. 2015; 20:1159–1192. Publisher Full Text

[50] 50. Soloway E, Ehrlich K: Empirical studies of programming knowledge. IEEE Trans. Softw. Eng. 1984; 10: 595–609.

[51] 51. Tabelow K, Polzehl J: Statistical parametric maps for functional mri experiments in R: The package fmri. J. Stat. Softw. 2011; 44: 1–21. Publisher Full Text Reference Source

[52] 52. Talairach J, Tournoux P: Co-planar stereotaxic atlas of the human brain. Thieme; 1988.

[53] 53. Vovk V, Wang R: Combining p-values via averaging. Biometrika. 2020; 107: 791–808. Publisher Full Text

[54] 54. Vovk V, Wang R: E-values: Calibration, combination, and applications. Ann. Stat. 2021; 49: 1736–1754.

[55] 55. Wagner S, Wyrich M: Code comprehension confounders: A study of intelligence and personality. IEEE Trans. Softw. Eng. 2021; 48:4789–4801.

[56] 56. Welvaert M, Durnez J, Moerkerke B, et al.: neuRosim: An R package for generating fmri data. J. Stat. Softw. 2011; 44: 1–18. Publisher Full Text Reference Source

[57] 57. Wilson D: The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. 2019; 116: 1195–1200. PubMed Abstract | Publisher Full Text | Free Full Text

[58] 58. Worsley K: Local maxima and the expected Euler characteristic of excursion sets of χ², f and t fields. Adv. Appl. Probab. 1994; 26: 13–42. Publisher Full Text

[59] 59. Worsley K, Marrett S, Neelin P, et al.: A unified statistical approach for determining significant signals in images of cerebral activation. Hum. Brain Mapp. 1996; 4: 58–73. PubMed Abstract | Publisher Full Text

[60] 60. Yekutieli D: Hierarchical false discovery rate-controlling methodology. J. Am. Stat. Assoc. 2008; 103: 309–316. Publisher Full Text

[61] 61. Zhao H, Zhang J: Weighted p-value procedures for controlling FDR of grouped hypotheses. J. Stat. Plann. Inference. 2014; 151-152: 90–106. Publisher Full Text

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

1. Introduction

2. Methods

2.1 Reference data

Table 1.

Listing 1. Example code snippet in Java from Siegmund et al. (Ref. 47) that computes the length of the last word in a string. The snippet uses non-meaningful identifiers to induce bottom-up comprehension. Participants needed to figure out the output of this snippet “5”.

2.2 Linear models for voxel-wise multiple tests

(1)

2.3 Parcellation of the human brain

2.4 Statistical inference

3. Computer simulations

3.1 Simulation setting

3.2 Considered data analysis methods

3.3 Results

Figure 1. Type-I- and Type-II-errors for different signal-to-noise ratios (SNR) for the three methods under consideration: Using voxel-wise inference (VOXEL), using a cluster-based method (CLUSTER), and using the method from this paper (REGION).

Table 2. The region of interest in BA21 identified in Ref. 47 consists of 2844 voxels.

4. Real data analysis

4.1 Previous findings

4.2 Data export and preparation for re-analysis

Figure 2. Illustration of processing of the experimental data.

4.3 Results

Figure 3. The six significant brain regions (at FWER level α = 5%).

Figure 4. Network of left-lateralized confirmed brain areas. In Ref. 48, BAs 21, 40, 44 were found activated during program comprehension.

Figure 5. Results of our analysis with significantly activated Aparc labels.

5. Discussion

5.1 Statistical sensitivity

Figure 6. Histogram of p values of ctx_lh_G_temporal_middle label that is evaluated as significantly activated.

Figure 7. Histogram of p values of ctx_lh_G_pariet_inf-Angular that is not evaluated as significantly activated.

Table 3. The region of interest in BA40 identified in Ref. 47 consists of 1777 voxels.

5.2 Comparison to related fMRI analysis methods

5.3 Outlook: from anatomical to functional aggregation

Ethical considerations

Data availability

Software availability

Acknowledgements

References

Footnotes

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 6. Histogram of p values of `ctx_lh_G_temporal_middle` label that is evaluated as significantly activated.

Figure 7. Histogram of p values of `ctx_lh_G_pariet_inf-Angular` that is not evaluated as significantly activated.