ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Utilizing anatomical information for signal detection in functional magnetic resonance imaging

[version 1; peer review: awaiting peer review]
PUBLISHED 01 Oct 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Background

We are considering the statistical analysis of functional magnetic resonance imaging (fMRI) data. As demonstrated in previous work, grouping voxels into regions (of interest) and carrying out a multiple test for signal detection on the basis of these regions typically leads to a higher sensitivity when compared with voxel-wise multiple testing approaches.

Methods

In the case of a multi-subject study, we propose to define the regions for each subject separately based on their individual brain anatomy, represented, e.g., by regional labels. The aggregation of the subject-specific evidence for the presence of signals in the different regions is then performed by means of a combination function for p-values. We validate the proposed methodology with simulated data and apply it to real fMRI data of a hypothesis-driven approach towards identifying brain regions involved in understanding software code.

Results

The results of our simulated data show that our proposed method demonstrated significantly higher power in detecting true activation. Testing our method on real fMRI data, we found that our approach yields overlapping results with a two-stage approach for which two independent experiments are needed, one for defining the regions and one for actual signal detection.

Conclusions

In this paper, we overall demonstrate that our method of utilizing anatomical information is a candidate to provide a more sensitive analysis of fMRI data.

Keywords

Aparc label; combination test; false discovery rate; mass-univariate linear model; program comprehension

1. Introduction

Signal detection in high-dimensional data is a major topic of modern statistics. Typically, structural information like, for instance, (the degree of ) sparsity of the signal is necessary for its detectability and/or (consistent) estimability; see, e.g., Figure 1 in Refs. 13, 47, Chapter 7 in Ref. 50, as well as references therein.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure1.gif

Figure 1. Illustration of processing of the experimental data.

The box in Harvest Gold indicates analysis steps that have already been performed in Siegmund et al.43 The box in Monte Carlo indicates the processing steps proposed in this paper.

Especially in the context of functional magnetic resonance imaging (fMRI), another type of structural information is localization. The primary units of fMRI measurement are volume units (voxels) of the human brain. Localization means that scattered, spread-out signals (in single voxels or very small groups of voxels) are prone to be artifacts. Instead, topologically contiguous signals forming larger groups of voxels are much more plausible; see Refs. 19, 28. This information rules out certain patterns of the signal structure a priori, and can hence be exploited to increase the statistical power for signal detection; see, among many others.40 There are different possibilities to define or find such regions: (i) One may refer to an atlas of the brain, like the Brodmann atlas (see Ref. 7) and aggregate data within the regions given by this atlas. This has been the strategy in Ref. 40. (ii) One may find the regions (of interest) in a data-driven manner, e.g., by a cluster analysis. This has been proposed, among others, in Ref. 26. However, as emphasized for instance in Ref. 4, it is important that this data-driven definition of regions is “based on information outside the data that we set out to analyze”, meaning that the dataset used for defining the regions should be (stochastically) independent of the dataset which is used for signal detection, to avoid selection biases. (iii) One may choose a statistical methodology which ensures statistically valid conclusions even for regions which are selected in a post-hoc manner after having seen the actual study data. This can be achieved by simultaneous inference methods which guarantee that any possible selection event is accounted for; see, e.g., Ref. 38 and references therein.

All of the three aforementioned strategies have their assets and their drawbacks: Strategy (i) is inexpensive and easy to implement, but the regions taken from the atlas may not be optimally aligned with the specific task at hand, and differences in the individual brain anatomies of the study participants may complicate its application. For a statistical approach to the alignment of fMRI data, see, e.g. Ref. 2 and the references therein. Strategy (ii) is costly (two independent experiments are needed), but is supposed to yield a more accurate definition of the regions (of interest). Strategy (iii) avoids both the (potentially suboptimal) a priori definition of regions and the need for an additional independent experiment. However, the issue of multiple testing (see, e.g., Refs. 12, 14) becomes much more severe if simultaneity over all possible selection events has to be guaranteed, and the selection of the regions is not based on a clear-cut (statistical) criterion, but on the expert judgment of the study data. Therefore, it is hard to compare the results of Strategy (iii) with those of Strategies (i) and (ii).

Strategy (ii) has been followed in two studies (published in Refs. 42, 43) in which programmers comprehended program code (see Table 1 for details). In the first study (see Ref. 42), participants were asked to understand short program code snippets, such as shown in Listing 1. The program code did not contain any useful identifier names, which induces bottom-up comprehension (cf. Ref. 37). In this first study, the bottom-up comprehension task was contrasted with a syntax task, in which participants were presented with similar program code snippets, but only had to focus on syntax errors (e.g., missing semicolon). This control condition was intended to reveal only brain activation that is necessary for programmers to comprehend program code in-depth. As an additional control condition, the experiment included phases of rest in between the comprehension and syntax conditions.

Table 1. Overview of the two related fMRI studies of program comprehension.

Study 1; see Ref. 42Study 2; see Ref. 43
Participant sessions1614
Trials1230
ConditionsBottom-up program comprehension, control (syntax), restTop-down program comprehension, Bottom-up program comprehension, control (syntax), rest
Scans900900
b1abd514-30cf-471d-b55d-8b47477c46d7_figure7.gif

Listing 1. Example code snippet in Java from Siegmund et al. (cf. Ref. 42) that computes the length of the last word in a string.

The snippet uses non-meaningful identifiers to induce bottom-up comprehension. Participants needed to figure out the output of this snippet “5”.

The second study (see Ref. 43) was a follow-up study that also differentiated the program-comprehension task into more nuanced conditions. One aim was to differentiate between bottom-up comprehension and top-down comprehension (cf. Ref. 9) which was induced by varying the meaningfulness of identifier names and by prior training to provide participants with the necessary knowledge. As in the previous study, the syntax task served as a control condition. Another research question addressed the goal of confirming the activated brain areas from the first study. To this end, the second study built on the regions identified in the first study, thus following Strategy (ii). Both studies were approved by the ethics board of the University of Magdeburg (Application: 87/14).

In the present work, we propose a new strategy and apply it to the data from Ref. 43 without using the knowledge about cortical regions relevant for understanding software code and identified in Ref. 42: We first utilize structural information from the individual’s brain scans. Here, we use an automatic parcellation of the brain into anatomical labels, which assign the voxels of an individual to pre-defined anatomic regions. This step of data analysis provides us with a significance evaluation (in terms of a p-value) for the presence of signals in anatomically defined regions for every study participant separately. By this, the method incorporates the rationale, that functional activation is also spatially connected28 and is thus able to gain more sensitivity without using the “localizer” data of the first study. Then, in a second step, we combine for each of these regions the p-values of all voxels within that region from all study participants using an appropriate combination function, and we evaluate the significance of the whole region by means of the resulting combined p-value. As we will demonstrate by means of the concrete example from the field of programming language comprehension, this new strategy can be similarly powerful as Strategy (ii) described above, while avoiding the sequence of two separate fMRI studies where the data of the first is used for definition of suitable regions for the second. In fact, our present methodology will not at all rely on the data from the first study42 that had been performed and utilized at that time by the authors of Ref. 43, while achieving similar results. Thus, it can enable researchers to spare precious measurement time.

In general, statistical analyses that are more advanced than voxel-wise regression analyses increase in popularity; see, e.g., Ref. 34 for a recent approach. Another alternative approach to analyze fMRI data is multi-voxel pattern analysis (MVPA) introduced by Haxby and colleagues, see Ref. 22 and the multiple papers cited therein. MVPA predicts stimulus event categories from the relative changes in activation across a set of voxels. Such a set of voxels is extracted in the first by a feature selection analysis. The second step partitions the data into a training set and a testing set entered into a pattern classification algorithm. From a statistical point of view this requires a sufficiently large number of comparable events. In the experiments described in Ref. 42 the events are code snippets that must be read and understood within 60 seconds to enable a certain complexity of software code and were thus limited in number to fit into the duration of a typical fMRI session of about 45 minutes. Such limited number of events may pose difficulties for MVPA cross-validation. Such low number of rather long events is, however, sufficient for general linear model (GLM) analyses of block design experiments that typically yield strong detection power (albeit low discrimination power).

Moreover, as functional activation is spatially distributed over several voxels rather than focused in single ones, cluster-based inferences have become rather standard (cf. Ref. 28). This is implemented and used in all major software package. However, a recent discussion in Ref. 15 revealed their flaws in practical situations. Furthermore, cluster-based inference requires simulation according to the smoothness of the data at hand.

In fact, the majority of fMRI studies like our studies from Refs. 42, 43 still use GLM-based analyses which require corrections for multiple comparisons. The experiments were constructed such that perceptual and cognitive processes necessary but not specific for understanding software code were controlled by specific test conditions, either requiring to read the same code but with a different task or controlling for attentional demands. Thus, it was a classical hypothesis-driven design, which is widely used in the literature, especially to unravel complex cognitive processing. Besides that, understanding software code is a highly idiosyncratic process and therefore identification of brain activity using control conditions within individual subjects is possibly more feasible as a first step towards identifying the most relevant brain areas involved in such a complex cognitive process. Thus, our method can serve as a valuable alternative using spatial aggregation while still being fast and easy to implement.

The rest of the paper is structured as follows. In Section 2, we describe our proposed statistical methodology. Section 3 describes conducted computer simulations comparing three different methods of fMRI data analysis. Section 4 is devoted to the detailed description of our re-analysis of the fMRI data from Ref. 43, and the results of this re-analysis are presented in Section 4.3. We conclude with a discussion in Section 5.

2. Methods

In this section, we describe our statistical model for fMRI data as well as the proposed data analysis workflow for detecting brain regions which are significantly associated with a certain cognitive task.

2.1 Linear models for voxel-wise multiple tests

Let Yixt denote the observed data from a functional MRI experiment at voxel x and time t for the i -th subject. Here, we adopt the common view (cf. Section 5.4 in Ref. 28) of a mass-univariate linear model

(1)
Yixt=Xiβix+εixt
for the data, with a design matrix Xi containing variables with the expected blood oxygenation level dependent (BOLD) response related to the experimental stimuli or nuisance parameters like drifts of the MR signal. The random variable εixt is the error term, which is assumed to be normally distributed with zero expectation and a spatio-temporal correlation structure. The model in (1) is also referred to as a “within-subject model” in the fMRI literature; see, e. g., Section 12.4.1 in Ref. 35. Estimates β̂ix of the statistical parametric map (SPM) or their contrasts cTβ̂ix and estimates of their covariance matrices Σ̂ix (or the variances σ̂ix2=cTΣ̂ixc ) can then be obtained from a pre-whitened version of the linear model above; cf. Ref. 28.

The SPM then forms a random t -field (cf. Ref. 53) with an inherent multiple comparison problem due to the large number of local hypotheses. One common strategy is to define local p-values at each voxel x and for each subject i based on the local values of the random t -field and to control the family-wise error rate (FWER) using accordingly adjusted thresholds; cf. Ref. 54. However, this is known to be a very conservative approach with respect to the detectability of significant brain signals in the outlined framework. In contrast, approaches related to the control of the false discovery rate (FDR) can handle the multiple comparison problem, e.g., by the procedure proposed in. Ref. 6.

2.2 Parcellation of the human brain

Neuroanatomic research has found that the human brain can be parcellated into different sub-regions based on structural similarities. One of the earliest atlases is the Brodmann atlas (see7) which is based on the cytoarchitectural organization of the brain. The Brodmann Areas have been schematically transferred to a template brain, the so-called Talairach-Atlas (see Ref. 46) which is commonly used in fMRI studies to report the location of significant grand average activation, as used in Refs. 42, 43. For the analysis of these data in the current paper, we chose the Harvard-Oxford brain atlas (cf. Refs. 10, 31) that provides a parcellation based on gross anatomical landmarks and delivers an Aparc label j for each voxel of each individual brain space. However, any other brain parcellation to define regional labels could be used with our methodology.

2.3 Statistical inference

As outlined in the introduction, we re-used fMRI data from a program code comprehension task first analyzed in Ref. 43 and performed a new analysis comprising the four steps outlined below. The experiment used two different levels of software program code comprehension stimuli, henceforth denote as bottom-up and top-down comprehension, to infer on the related cognitive processes. In our strategy, we combined the methods from Ref. 43 (steps 1 and 2) and Ref. 40 (step 3). Furthermore, we implemented our new methodological contribution of combining the evidence for activation of a given brain region across the subjects (step 4). For the first two steps, we conducted, for each subject, a random-effects linear model analysis as described above for deriving voxel-wise p-values.

Step 1: Program comprehension versus rest

In the first step, we contrasted (for each participant separately) the comprehension of program code (cf. Ref. 43) to the rest condition. This identifies brain areas with a positive deflection of the BOLD response. Furthermore, in order to account for the multiple comparison problem, we performed the Benjamini-Hochberg test (see Ref. 6) for FDR control. Only those voxels which have been declared significant by this procedure were considered in step 2. This methodology is justified by the fact that the FDR is an established screening criterion for high-dimensional multiple test problems.

Step 2: Bottom-up comprehension versus control condition

In this step, we contrasted (again, for each participant separately) one type of program comprehension: bottom-up comprehension. Bottom-up comprehension is induced when program code provides no semantic cues and programmers need to comprehend each line separately and then integrate the information in a slow, tedious process. For the significant voxels from the first step, we applied in a second step the same multiple test to the contrast of bottom-up comprehension against the control condition (syntax task) on the restricted set of voxels. As a result of this step, we get for each participant i and for each considered voxel k a p-value p~ik .

Step 3: Regional p-values for every participant i

This step builds upon the methodology from Ref. 40, and it delivers for each participant i and for each anatomical (regional) label j a (confirmatory) significance evaluation with respect to the contrast specified in step 2. Hence, the evidence from all voxels of participant i in the brain region labeled by j is combined in this step of data analysis.

To this end, let κ be a tuning parameter with values in the interval [0, 1] (i.e. in per cent) and let mj be the number of voxels contained in the brain region labeled by region label j . To keep the notation feasible, we implicitly assume here that for each participant i the same number mj of voxels belong to the brain region labeled by j . We consider the null hypothesis Hij of no relevant differential activation of the region labeled by j for participant i during the two tasks mentioned in step 2, together with its two-sided alternative hypothesis Kij . We call Hij the “regional null hypothesis” for the brain region labeled by j for participant i . We formalize Hij as a so-called partial conjunction hypothesis (see40 and the references therein for a formal mathematical description), meaning that we consider the differential activation in region j for participant i relevant, if it contains at least ujκ·mj significant voxels. For testing Hij we calculate the “regional p-value” pijREGION , given by

pijREGIONmin1ımjuj+1{mjuj+1ı}p~i,(uj1+ı):mj
where the voxel-wise p-values p~i,1:mj,,p~i,mj:mj for participant i in region j are ordered from smallest to largest (see Ref. 5).

In order to achieve family-wise error rate (FWER) control, we have to choose the tuning parameter κ smaller than or equal to 1/J , where J is the number of regional labels. Choosing κ=1/J corresponds to the so-called Bonferroni multiplicity correction. The choice of κ is discussed further in Appendix S1 of Ref. 40.

Step 4: Combined regional hypothesis tests by Fisher’s method

In this final step, we combine for each regional label j the regional p-values calculated in step 3 over all participants i=1,,n . In order to do this, we apply the so-called Fisher method to combine p-values. Namely, the Fisher test statistic Tj for region j is given by

Tj2i=1nlog(pijREGION)

Under independence of the data with respect to the participants, Tj is asymptotically X2n2 -distributed (chi squared) with 2n degrees of freedom under the null. The latter independence assumption is justified, because the participants have been included in the study independently from each other.

Finally, we can reject the (over all participants i combined) regional hypothesis Hj (i.e., the respective partial conjunction hypothesis, but now with respect to the population, not with respect to a single participant) if and only if Fisher’s test statistic Tj is larger than the (1-ακ)-quantile of the X2n2 -distribution with 2n degrees of freedom, where the tuning parameter κ has been introduced in step 3. This parameter addresses the multiplicity of the test problem with respect to the J regional labels which are simultaneously under consideration.

3. Computer simulations

In order to compare the performance of three different methods for fMRI data analysis in a controlled framework, we have carried out computer simulations.

3.1 Simulation setting

We have simulated a dataset {Yixt} with eleven subjects (referring to the index i and corresponding to the group size in the real dataset below), a spatial grid of size 20×20×20=8,000 voxels (referring to the index x ), and 195 time points (scans, referring to the index t ). We have assumed that the 8,000 voxels are grouped into eight anatomical regions of size 10×10×10=1,000 each, and that these regions are correctly annotated for all eleven subjects. Two alternating stimuli in an ON-OFF block task design with a total of six ON blocks of a duration of 15 scans have been used for the temporal signal: In a circular area in one of the predefined anatomical regions the signal of one stimulus was twice as high as the signal by the other stimulus mimicking a signal contrast. In a corresponding area in a second anatomical region the signal was created with no difference between the stimuli. The datasets with first order autocorrelated Rician noise have been created using the R package neuRosim51; p-values p~ik for Contrast 1 (ON of either stimulus versus rest) and Contrast 2 (one stimulus versus the other) where determined using the R package fmri45. This simulation setup has been run 500 times.

3.2 Considered data analysis methods

For the statistical analysis of the simulated data, we have considered three different methods.

  • Voxel-wise: On the basis of voxel-wise Z -scores (aggregated over all eleven participants) and the resulting p-values for Contrast 1, voxels have been screened by applying the Benjamini-Hochberg method at level α=0.05 . For the screened voxels only, Z -scores (again aggregated over all eleven participants) and the resulting p-values for Contrast 2 have been computed. A region j has been declared significantly activated, if the number of Contrast 2 p-values below 0.05/sj in that region j has been larger than 1,000·κ , with sj denoting the number of screened voxels in region j and κ=0.01 , meaning that we considered a region j to be relevantly activated if at least 10 out of the 1,000 voxels in that region j are statistically significantly activated under the stimulus. The rejection threshold 0.05/sj accounts for the (reduced) multiplicity resulting from the pre-selection by means of Contrast 1.

  • Cluster-wise: Activation is typically spatially spread over multiple voxels. In fact, single isolated voxels are mostly considered as spurious rather than activated; see Ref. 28. Thus, we also analyzed the simulated datasets and the two contrasts with cluster-based thresholds: At the given significance level α=0.05 the value of the test statistic at a voxel must exceed some threshold and it has to belong to a cluster of at least s connected voxel. Thresholds can be pre-determined by simulations. To determine p-values we used the R package fmri45, where the method is implemented.

  • Region-wise (this paper): Our proposed method from Section 2, again with FDR level 0.05, κ=0.01 , and α=0.05 in Step 4.

3.3 Results

We have assessed the type I and the type II error behavior of the three methods described in the previous subsection by calculating (for each of the three methods) the proportion of simulation runs in which any region without true activation has been declared significantly activated (type I error component) as well as the proportion of simulation runs in which the one region with true activation has not been declared significantly activated (type II error component). Table 2 summarizes the results of our computer simulations.

Table 2. The results of our computer simulations.

The estimated type I error refers to the proportion of simulation runs in which any region without true activation has been declared significantly activated. The estimated type II error refers to the proportion of simulation runs in which the one region with true activation has not been declared significantly activated.

Voxel-wise Cluster-wise Region-wise (this paper)
Estimated type I error0%0%0.8%
Estimated type II error24.6%0%0%

From Table 2 it becomes apparent that all three methods are capable of protecting reliably against type I errors under our data-generating model. However, the power of the proposed method appears to be considerably larger than that of the voxel-wise method. Namely, during our 500 simulation runs, our new method as well as the cluster-wise method always detected the truly activated region, while the voxel-wise method detected it only in 367 out of the 500 runs.

4. Real data analysis

We re-used the data from Ref. 43 and compared our results with those obtained by additionally utilizing42 as pre-study in the sense of Strategy (ii) outlined in the introduction.

4.1 Previous findings

In the two previous studies mentioned before, Siegmund et al. used similar analysis processes. They used BrainVoyager™ QX 2.8.4[1]. The anatomical scans were transformed into the Talairach brain to account for differences in brain size; cf. Ref. 46. They preprocessed the functional data of both studies with a standard pipeline of: 3-D motion correction, slice-scan-time correction, and temporal filtering. In addition, they applied a spatial smoothing with a Gaussian filter (FWHM = 4 mm).

In the first study42, the random-effects GLM revealed five brain areas (BAs 6, 21, 40, 44, 47) with significant activation with the contrast Bottom-Up Comprehension versus Control condition, i.e. syntax task. In the second study, the same contrast revealed no significant areas anymore, likely due to the reduced statistical power of five instead of twelve bottom-up comprehension tasks per session. Thus, the authors ran a regions-of-interest analysis restricted to the identified activation clusters of the first study on the data of the second study. This resulted in a significantly stronger activation for Bottom-Up Comprehension versus Syntax in BAs 21, 40, and 44.

4.2 Data export and preparation for re-analysis

In Figure 1, we illustrate the overall process for the re-analysis of the data from Ref. 43. For our re-analysis, we exported the already pre-processed data from the second study. We did not use the data of the first study as our method does not need prior definitions of regions of interest. We used BrainVoyager to access the data from Siegmund et al. We exported the statistical values (i.e., t-scores, p-values) of the obtained brain activation on a voxel basis for each participant. The voxel resolution is the same BrainVoyager uses, i.e., a 1 mm interpolated resolution.

In addition, we used FreeSurfer to segment and parcellate the brain of each participant based on their anatomical scan; cf. Refs. 17, 18. We used the Destrieux’ cortical atlas to assign Aparc labels on an individual participant basis (see Ref. 11) as a suitable region definition. Next, we used Nipype (see Refs. 16, 21) to convert Freesurfer labels to a BrainVoyager-readable format.

Our last step annotated the exported functional data with the individual anatomical labels for each participant. We removed all functional voxels for coordinates that had no assigned Aparc label, which typically are voxels that are not considered gray matter.

4.3 Results

Figure 2 displays the six brain regions which have been declared as significant (at FWER level α=5% ) associated with the task at hand by our described methodology. There is an overlap with the results obtained by Ref. 43 in which they utilized prior knowledge, but also some differences which we discuss in the following in their nuances. We visualize the confirmed network of brain activation from Siegmund et al. in Figure 3 and our identified network of significantly activated Aparc labels in Figure 4. Figure 3 and Figure 4 show overlapping results with regard to the Brodmann area 21 that covers the middle and inferior temporal gyrus (separated by the inferior temporal sulcus). However, we observed differences regarding smaller brain regions. We compare Siegmund’s replication efforts to our results in Section 5.1.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure2.gif

Figure 2. The six significant brain regions (at FWER level α = 5%).

In each row, each point corresponds to the Aparc p-value pⅈjAPARC of one study participant i, where the index j refers to the area indicated by the code at the beginning of the row.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure3.gif

Figure 3. Network of left-lateralized confirmed brain areas. In Ref. 43, BAs 21, 40, 44 were found activated during program comprehension.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure4.gif

Figure 4. Results of our analysis with significantly activated Aparc labels.

Activation are particularly in the middle and inferior temporal lobe.

Our method found three Aparc labels in the inferior and middle temporal gyrus in the left hemisphere. Siegmund et al. found their largest and most robust activation cluster in BA21 of the left hemisphere, which covers several gyri in the temporal lobe. These left temporal gyri are often associated with semantic processing of natural language, which is typically left-lateralized for right-handed participants. In the context of programming, the activation is believed to be responsible for extracting the meaning of individual identifiers and symbols during program comprehension. We found two further Aparc labels bilaterally in the inferior temporal gyrus and the anterior collateral sulci, which are both in the temporal lobe as well.

For each of these six regions indexed by j , we display all subject-specific regional p-values {pijREGION:1in }, where n is the number of study participants. For each j considered in Figure 2, it can clearly be observed that not a single extreme outlier (one very small p-value corresponding to one individual subject) is responsible for the statistical significance with respect to the combination test statistic Tj but that the combined information contributed by all n subjects supports our statistical conclusions. Regional p-values pijREGION1 can occur, if none of the voxels belonging to region j has been selected in Steps 1 and 2 described before for a certain subject i . By construction of Tj this essentially means that the “effective sample size” for such a region is reduced, while the number of degrees of freedom for the null distribution of Tj remains unchanged.

5. Discussion

5.1 Statistical sensitivity

This paper introduces a new strategy to analyze fMRI data and demonstrates its performance by drawing a comparison to two studies by Siegmund et al. In their second study, they investigated a new research question which was based on a design with reduced statistical power regarding the network of brain areas activated during bottom-up program comprehension. Only by using prior knowledge of the location of identified clusters from the first study and conducting a regions-of-interest analysis with increased statistical power, they were able to exceed standard thresholds of statistical significance but restricted to the subset of areas identified in the first study.

Research that similarly aims to detect small differences between cognitive processes with state-of-the-art methods would need to increase statistical power, e.g. by conducting two studies: first, to identify the network of brain areas involved in the overarching cognitive process, and second, to identify differential effects within the activated brain areas. The method of utilizing anatomically predefined regions of interest as determined in each individual brain is a candidate for a more sensitive analysis as compared to statistical testing of single voxels with identical coordinates of a template brain. This is possibly due to allowing more flexibility with respect to the exact anatomical location of functional units across different brains. Our example of using ROIs from the Harvard-Oxford atlas can easily be replaced in future applications by brain parcellation schemes that are based on more refined anatomical and/or functional databases. We demonstrated by simulations that our method can accurately detect signals based on region aggregation and outperforms standard mass-univariate approach while achieving a similar performance as cluster-based inferences without the need to define thresholds based on data smoothness via simulations. Further, we demonstrated that our method can find significantly activated brain areas without relying on prior knowledge on a real fMRI dataset. Moreover, additional brain areas were identified which have been described in two fMRI studies of programmers and interpreted as being involved in visuo-spatial processing. One study investigated manipulating data structures, which share similarities to spatial rotation (see Ref. 25) and one writing program code (see Ref. 27). Since the study by Siegmund et al. (2014) was the first study on program comprehension they used a rather small FDR corrected significance level ( p<0.01 as compared to the more common p<0.05 ). Possibly this is one reason why the two areas identified by the current approach were not identified there and as a consequence could not emerge in Siegmund et al. (2017). Thus, our current approach seems valuable in cases where new research questions are explored and little to no prior work is available and which would require careful statistical hypothesis testing to initially minimize false positives.

However, inferior frontal gyrus (BA 44) and inferior parietal lobule (with BA 40), shown to be significant in Siegmund et al., were not significant with our method. Therefore, we exemplarily investigated this difference with the activation cluster in BA 40. In Figure 5, we display a histogram of p-values for the (statistically significant) Aparc label in the middle temporal gyrus, in which a majority of the participants contribute with small p-values. In contrast, in Figure 6, we display a histogram of p-values for the inferior parietal lobule. Across the entire group, we also observe an accumulation of very small p-values, but only three of eleven participants contribute to this. Unlike the regions-of-interest analysis done by Siegmund et al., our method relying on combining Aparc p-values by Fisher’s method does not reject hypotheses with such a p-value distribution.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure5.gif

Figure 5. Histogram of p values of ctx_lh_G_temporal_middle label that is evaluated as significantly activated.

The plot without a heading is across all participants, while the eleven plots with headings show the distribution across individual participants. The majority of participants is contributing to the significance.

b1abd514-30cf-471d-b55d-8b47477c46d7_figure6.gif

Figure 6. Histogram of p values of ctx_lh_G_pariet_inf-Angular that is not evaluated as significantly activated.

The plot without the heading is across all participants, while the eleven plots with headings show the distribution across individual participants. While there are many small p- values, only a few participants contribute these small values.

By transforming the Aparc labels into the Talairach space, we observed that the voxels included in the activation cluster of BA 40 do not perfectly align with the Aparc label that should cover this brain area, i.e. the inferior angular gyrus of the parietal lobe. Table 4 shows that the overlap to the activation cluster in BA40 is less than 50% and that only around 5% of its voxels are assigned to an Aparc label at all. In contrast, Table 3 shows that at least 75% of the voxels of the BA 21 cluster are aligned to Aparc labels of which two are significant with our method. Thus, a small activation cluster within a large anatomical region may get lost with our approach. We further discuss this potential drawback of the used anatomical segmentation and possible remedies in Section 5.3.

Table 3. The region of interest in BA21 identified in Ref. 42 consists of 2844 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. Aparc labels in bold are evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

# Voxels Overlapping with BA21 in %# Voxels of Entire Aparc Label in % in BA21
Left-Cerebral-White-Matter 63422%2106200.3%
ctx lh G temporal middle 61722%75158.2%
ctx lh S temporal sup 51218%93795.5%
ctx lh S temporal inf 853%24993.4%

Table 4. The region of interest in BA40 identified in Ref. 42 consists of 1777 voxels.

Only a subset of these voxels is assigned to Aparc labels. However, the assigned Aparc labels are larger and only a smaller section overlaps with the activation cluster. No Aparc label is evaluated as significant. Only Aparc labels with at least 75 overlapping voxels are included.

# Voxels Overlapping with BA40 in %# Voxels of Entire Aparc Label in % in BA40
ctx lh G pariet inf-Supramar 30417.1%63184.3%
ctx lh G pariet inf-Angular 29016.3%49755.1%
Left-Cerebral-White-Matter 1578.9%1788750.1%

From the methodological point of view, our main contribution consists in a novel way how to combine evidence: Instead of aggregating single voxel data over all participants by mapping them to a standard brain template, we define subject-specific regions and combine the evidence on the level of these regions by means of a combination function. This is a generic methodological approach which is not restricted to the specific study setup of Siegmund et al., but can be applied to essentially any fMRI study design, involving an arbitrary number of contrasts.

Furthermore, also the (final) combination step of our proposed approach is generic in the sense that instead of the Fisher combination function any other (appropriate) combination function for p-values may be used. Recently, there has been a renewed interest in p-value combination methods; see, e.g., Ref. 52 (with discussion), Ref. 48, and Ref. 49.

5.2 Comparison to related fMRI analysis methods

Under the multiple testing framework, testing of grouped null hypotheses with (potential) application in fMRI is an active research topic. In Ref. 23 as well as in Ref. 4, clustering techniques were employed to define regions of interest, and the authors incorporated the heterogeneous cluster sizes in a weighting scheme for the linear step-up test from Ref. 6. In the same vein, the authors of Ref. 24 as well as Ref. 56 made use of the different proportions of true null hypotheses in each of the groups in their proposed weighting. A Bayesian variant of this idea has been derived in Ref. 30. Hierarchical methods, which exclude groups without strong evidence for the presence of signals in several stages of data analysis, have been worked out by several researchers; see, e. g., Refs. 3, 40 and 55. However, these methods rely on combining the subject-specific data on the voxel level, which is a standard technique as mentioned, for instance, in Section 5 of Ref. 29 Also on the basis of (combined) voxel data, a hierarchical independent component analysis for the comparison of brain functional networks has been proposed in Ref. 41. To the best of our knowledge, our proposed method to define subject-specific regions by means of regional labels and to combine the resulting subject-specific regional p-values by means of a combination function is a novel idea.

It is worth mentioning that both our proposed methodology and the methods from previous literature that we are considering in our comparisons make certain assumptions about the data: First, a linear relationship between the features encoded in the design matrix and the response is assumed. Thus, non-linear effects cannot (fully) be captured by models of the type (1). Second, Gaussianity of error terms is assumed, implying symmetry and light tails, among other things. It can be of interest to validate or to test these assumptions for a concrete dataset at hand. However, since this aspect is not the main focus of our present work, we defer such investigations to future research. In particular, modeling the data that we have re-analyzed in this work with different model assumptions would diminish the comparability of our data analysis results with the results from previous analyses. One simple (numerical) robustness analysis regarding the Gaussianity assumption can be performed by simulating data with a different error distribution and comparing the results with the results presented here. To this end, the source code which is available as supplementary material for this article may be helpful for practical implementation.

5.3 Outlook: from anatomical to functional aggregation

Instead of utilizing a single voxel GLM group analysis in common brain templates, we used a parcellation of the brain for each individual participant into regional labels, here Aparc labels, before aggregation into the group analysis. This procedure provides more labels than a traditional Brodmann atlas. Still, some of the regions are very large and thus presumably contain several functional areas. There is ongoing research to subdivide the brain based on cytoarchitectonic details, e.g. the Jülich-Brain (see Ref. 1). This will provide more detailed parcellation schemes in the future and that could easily be integrated into our proposed method. Another refinement could be to use functionally defined brain regions for our presented methodology. A study that implements functional localizers could identify participant-specific functional maps of the brain for well-defined standardized tasks (e.g., see Ref. 33). Then, in the analysis, our presented methodology can aggregate across all participant-specific brains with less imprecision than traditional methods. We would like to note that such functional localizers are currently restricted to research areas that include brain regions with specific functional specialization; cf. Refs. 20 and 39. The studies we presented in this paper are concerned with a rather complex cognitive task. Moreover, understanding program code is highly individual since different programmers rely on different comprehension strategies based on their preferences, domain knowledge, and experience (e.g., Refs. 8, 9 and 44).

Ethical considerations

Our paper did not acquire any new data with human participants. The original studies of Siegmund et al. were approved by the responsible ethics board of the University of Magdeburg (Application: 87/14).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Oct 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Peitek N, Brechmann A, Tabelow K and Dickhaus T. Utilizing anatomical information for signal detection in functional magnetic resonance imaging [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1019 (https://doi.org/10.12688/f1000research.166549.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Oct 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.