MRI data harmonization across sites using ComBat enhances classification of meningioma and glioma brain-tumors in dogs: a case study [version 1; peer review: awaiting peer review]

Background: Magnetic resonance imaging (MRI) in clinical patients is often evaluated for diagnostic purposes. However, to develop a disease classifier, imaging data can be “noisy”, as in being heterogeneous (e.g., obtained from multiple sites), having significant crossover between normal and pathological processes, being highly imbalanced for the outcome variable (i.e., unequal numbers of cases and controls), or due to a lack of accurate quantitative analysis tools that are transferable, easily usable, and accurate to generate the final image variables for machine learning analyses. Methods: In this article, we demonstrate the effectiveness of ComBat harmonization of heterogeneous MRI data on dogs’ brains, collected across multiple sites, prior to using them in the random forest (RF) classifier to attempt to differentiate the meningioma and the glioma tumor-types. We consider three image variables generated from each of the brain scans and three clinical covariates – age, sex, and breedtype – for each


Introduction
Magnetic resonance imaging (MRI), a powerful technology to detect abnormalities in human and animal organs, 1-9 can be challenging for clinically differential diagnosis. [10][11][12][13][14] In omics sciences, data normalization (henceforth, "harmonization") is a crucial preprocessing step prior to downstream analyses, [15][16][17][18][19][20][21] mitigating any spurious effects on the scientific conclusions incorporated due to undesired sources of variation, such as batch effects, intrinsic factors within the subjects, and scanning sites. Such harmonization is also essential for MRI data, as the signal intensities in these data are measured in arbitrary units that vary across study-visits and patients. [22][23][24][25] In this study, we demonstrate the effectiveness of a batch-effect correction tool, ComBat, 26 widely used in transcriptomics 27,28 but also adopted for radiomics data, 29,30 in adjusting for undesirable effects of multiple sites on MRI signal intensities (SIs). We chose ComBat due to its superior performance in removing site-specific unwanted variations from fractional anisotropy and mean diffusivity maps in diffusor tensor MRI. 29 In their study, the authors considered only controls, used data that were from two "pure" sites, and implemented a sophisticated image-processing pipeline to generate the tissue outcome labels, which resulted in final measurements on the image variables (voxels) having dimensions in the order of 10,000's. In our case, however, each subject is diseased (meningioma/glioma) and the data come from two "impure" sites, i.e., the "outside" site consists of multiple non-CSU sites, the data thus potentially being noisy due to heterogeneous MRI scanners/protocols used. Notably, such site-heterogeneity can be commonplace to ensure a sufficient sample size. Additionally, we use only three manually recorded image variables, available for all subjects across the sites. Via the downstream performance of the ensemble machine learning classification tool, random forest [31][32][33][34] (RF), our study thus aims to demonstrate the utility of ComBat harmonization in a "non-ideal" yet practical scenario.

Study population and data generation
We use n = 244 subjects (dogs) in our study, belonging to one of the following four subpopulations: 1) glioma, scanned at the Colorado State University -Veterinary Teaching Hospital (CSU-VTH), n = 39; 2) glioma, obtained from a site outside CSU, n = 20; 3) meningioma scanned at the CSU-VTH, n = 106; and 4) meningioma, obtained from a site outside CSU, n = 79. Note that we treat the subjects as coming from only two sites --CSU and "outside". However, the "outside" site actually consists of 36 unique sites (Table 1). DN and XY, the two "processors", generate the data used in the final analyses. DN scans through the conclusion of each patient's brain MRI diagnostic report stored in the CSU-VTH Philips IntelliSpace PACS (picture archiving and communication system) Radiology software (henceforth referred to as "PACS") database, labeling the associated brain tumor-type as either "glioma" or "meningioma" based on the radiologist's/principal interpreter's conclusion including terms such as "likely"/"most likely"/"most consistent", etc. Therefore, these binary tumor-type labels are not based on surgical, histopathological evidence and are used as the outcome variable in the downstream RF classification (see the "Statistical analysis" section). Since we do not have access to the diagnostic reports for the subjects from the "outside" site, we consider instead the corresponding ones from the CSU PACS database that are closest to their original exam dates.
For each patient, we only consider the transverse/axial section, T1-weighted, post-contrast scans (typically labeled as "Trans T1 +C"). The processors scan through all the slices within each patient's respective DICOM file and select up to three representative slices in which the cancerous lesions are most prominently visible (i.e., highest contrast) by naked eye. Note that, among the 244 subjects, we settle with only one suitable slice for seven subjects and two for six subjects (Extended data: Table S1). 47 Then, within each chosen slice, two circular regions of interest (ROIs) are drawn encompassing the densest parts visually examined, one each on the lesion and on the "normal" tissue, using the PACS software in-built "drawing" tool. Also note that, as "normal" tissue, we choose facial muscle for seven meningioma subjects and muscle of mastication for the rest (Extended data: Table S2). 48 From each of these two ROIs, three statistics for the SIs are noted: mean, standard deviation, and the central point-value. See Figure 1 for an example.
Besides the three MRI variables, for each patient we also record the following covariates: three clinicalage (in months) at the time of MRI scan, sex (male, female, male castrated, female spade/spayed), and breedtype; six related to MRI scannerrepetition time (TR), echo time (TE), number of excitations (NEX), slice thickness (mm), frequency phase (X x Y), and field-of-view reconstruction (FOV recon; cm); and one technicalprocessor.
Note that, for the final analysis, we use both sex and breedtype as binary variables: sex (female/male) and breedtype (nonbrachycephalic/brachycephalic). Data on frequency phase are used as two independent scanner covariates. Due to the presence of missing data, we eventually omit the "FOV Recon" scanner covariate from the final analysis. Thus, we have three binary covariatessex, breedtype, and processor, coded as 0/1; the rest are treated as continuous variables. See Table 2 for a summary of all of the final variables used in our analyses.   Table 2. Summary of the three clinical covariates, one technical covariate, six magnetic resonance imaging (MRI) scanner covariates, and three MR curated image variables used in our statistical analyses. The data are grouped based on the four subpopulations as indicated in the columns. Apart from the three binary covariatessex, breedtype, and processorthat are coded as 0/1, the rest are treated as continuous variables; each cell-value indicates the range in the top line and the median (median absolute deviation in parentheses) in the bottom line.

Statistical analysis Preprocessing of the data and final variables
For each of up to three selected slices corresponding to each sample, we first normalize the mean, the standard deviation, and the central point-value of the SIs within the diseased ROI by taking respective ratios to the normal ROI within that same slice ( Figure 1). We call these three measures adj-mean (SI), adj-SD (SI), and adj-cent (SI), respectively. Next, for each sample, we compute the means of these adjusted measures across the selected slices. These three summarized measures, respectively referred to as μ (adj-mean (SI)), μ (adj-SD (SI)), and μ (adj-cent (SI)), are used as the final three image variables in the subsequent analyses (Figures S1 and S2). [50][51][52][53][54][55] The intercorrelations among the three continuous image variables and the disease labels (0 = glioma, 1 = meningioma) are shown in Figure S3. 56,57 We note that, for both CSU and outside sites μ (adj-mean (SI)) and μ (adj-cent (SI)) are maximally correlated with the disease labels and the correlations among the μ (adj-SD (SI)) and disease labels are negligible. Among the continuous covariates across both sites, while age (in months), μ (adj-mean (SI)), and μ (adj-cent (SI)) resemble a Gaussian distribution, those of others deviate greatly from it (data not shown).

Tumor classification
For the classification of meningioma and glioma brain-tumors (glioma treated as the "positive" class), we apply RF [31][32][33][34] and evaluate classification performance based on sensitivity, specificity, and total accuracy, benchmarked via "lower" and "upper" bounds (Table 3). Using the same site for training and test sets, we expect better RF classification performance (upper bound) compared to when using different sites (lower bound).
For the "lower" bound calculations, we use all the samples within the outside site (n = 99, M/G = 79/20) to train the RF classifier, and randomly subsample n = 38 subjects from the CSU population, ensuring M/G = 19/19 representation, for  Yes Scenario d Scenario c Table 5. Random forest (RF) classification median (median absolute deviation in parentheses) sensitivity ("Sens"), specificity ("Spec"), and total accuracies ("Tot Acc") corresponding to Case 1, scenarios a -d ( We use the original n = 79 meningioma samples and the n = 80 glioma cases that are generated using SMOTE. Within this training set, we tune the parameters of the RF classifier using 5-fold cross-validation repeated 25 times, and using all possible combinations of predictor variables in the model via the mtry argument in the train() function. For the "upper" bound calculations, we keep the identical test set compositions as in lower bound computations, and form the training set by randomly subsampling n = 79 "meningioma CSU" subjects from the remaining 87. We repeat this exercise of computing lower and upper bounds 75 times, each time with a different training-test split. Finally, we report the medians (and median absolute deviations) of the classification metrics across these 75 random samples; see Table 5 for an example.

Scenarios studied
We investigate the RF classifier performance at the lower and upper bounds for the following scenarios: • [Case 0: one scenario] We examine the effectiveness of using three clinical covariates only in classifying the tumor types. No image, technical, and scanner covariates are used, and therefore, no ComBat harmonization is involved.
• [Case 1: four scenarios] We use the three image variables in ComBat. Besides, we either use the three clinical covariates or not in ComBat and in subsequent RF, thus giving rise to four scenarios (ad; Table 4). We do not use any technical and scanner covariates in ComBat.
To assess the impact of ComBat harmonization on RF classification performance, we conduct nonparametric tests (Wilcoxon's signed-rank paired one-sided tests with continuity correction) to examine whether a post-ComBat classification metric lower bound is: (1) significantly greater than that for its pre-ComBat counterpart, and (2) significantly lower than the corresponding upper bound (Table 5). Glioma is treated as the "positive" class in classification and, therefore, sensitivity measures the proportion of true glioma cases correctly identified, specificity measures the proportion of true meningioma cases correctly identified, and total accuracy measures the total proportion of true meningioma and glioma cases correctly identified.

Results
Below we discuss the full set of results for the scenarios in Cases 0 and 1. [43][44][45][46][50][51][52][53][54][55][56][57] Note that, besides these two cases, we also examine the results of another case (Case 2) in which, alongside the three image variables, we include one technical covariate and six scanner covariates (see the "Study population and data generation" section) in the ComBat step. However, since the essence of these results is mostly similar to that of Case 1, we set them aside as "Extended data" (Extended data: Table S3). 49 Figure 2. Boxplots of random forest (RF) classification metrics corresponding to Case 0: "tota" = total accuracy, "sens" = sensitivity, and "spec" = specificity. L, U: lower bound (black) and upper bound (blue) obtained from RF models using only three clinical covariates.   Table 4. L3, U3: pre-ComBat lower bound (black) and upper bound (magenta) obtained from RF models using only three image variables; L6, L6.CB, U6: pre-ComBat lower bound (red), post-ComBat lower bounds (green, 1c; blue, 1d), and upper bound (cyan) obtained from RF models using three image variables and three clinical covariates.

Using only three clinical covariates in the RF classification model (no ComBat harmonization involved)
Using only the clinical covariates of the subjects in the RF model (Case 0), the lower bound total accuracies are not significantly lower than those for upper bounds: both medians = 57.9%; p-value = 0.332 ( Figure 2). The lower bounds of the sensitivity and the specificity measures are also not significantly lower than those for the upper bounds: p-values 0.133 and 0.884 respectively. Thus, the distributions of the age/sex/breed-type between meningioma/glioma subjects do not vary significantly across sites. For example, exact p-values corresponding to the Pearson's chi-squared tests (with Yates' continuity correction) on the two 2Â2 contingency tables for sex and breed-type distributions across CSU and Outside sites are 0.762 and 0.604, respectively. Also, among all scenarios, RF achieves the lowest medians of total accuracy and sensitivity in this case, which indicates an overall poor predictive strength of using only clinical covariates in the RF model (Figures 2 and 3).
Using only three image variables in the RF classification model Pre-harmonization These results confirm that using just the three image variables in the RF model, ComBat harmonization enhances the RF classification performance (except for specificity) compared to that in pre-ComBat and when using only the clinical covariates.
Using three image variables and three clinical covariates in the RF classification model Pre-harmonization  Figure 4-B).
Specificity: Using the image variables and the clinical covariates in the RF model, the lower bound specificity (pre-ComBat) is significantly higher than that using only three image variables in RF: medians 78.9% vs. 68.4%; p-value = 2.33 E-10. Similarly, the upper bound specificity is also significantly higher: medians 84.2% vs. 78.9%; p-value = 2.90 E-03 (Table 5, Figure 4-C).  Table 5).

Post-harmonization
Specificity: Using post-ComBat harmonization (scenarios c, d), the specificity lower bounds are again significantly higher compared to their pre-ComBat counterparts. For example, post-ComBat specificity lower bound using three image variables and three clinical covariates (scenario c) vs. pre-ComBat: medians 84.2% vs. 78.9%; p-value = 9.44 E-10 ( Table 5, Figure 4-C). This post-ComBat specificity lower bound in scenario c is also significantly higher than that using only image variables (scenario d): both medians 84.2%; p-value = 2.69 E-03 (Table 5, Figure 4-C) and compared to those when not using the clinical covariates in the RF model (scenario b): medians 84.2% vs. 73.7%; p-value = 3.05 E-12 (Table 5).
These results confirm that using the image variables and clinical covariates together in the RF model, with or without ComBat harmonization, results in better RF classification performance (except for sensitivity) than using only the image variables. Furthermore, using the image variables as well as the clinical covariates in both ComBat harmonization and the RF model provides the highest total accuracy and specificity across all scenarios.

Discussion
In this case-study, we demonstrate the efficacy of MRI data harmonization using ComBat in enhancing the downstream RF classification performance. Utilizing the clinical covariates along with the image variables both in ComBat and RF (Case 1, scenario c) results in the highest total accuracy. When adjusting for the technical and scanner covariates in ComBat (Case 2), we only notice significant improvements in specificity (correct identification of true meningioma cases; scenarios c, d) compared to when not using them (Case 1; Tables 5 and S3). For both cases, RF achieves the highest specificity with the clinical covariates included in the model, irrespective of including them in ComBat (e.g., maximum median value for Case 1 is 84.2%, scenarios c, d; Table 5). Of all cases and scenarios, RF attains the highest sensitivity (correct identification of true glioma cases) when we include the clinical covariates in ComBat but not in the classification model in Case 1 (maximum median value is 63.2%, scenario b; Table 5).
In summary, we confirm the overall effectiveness of ComBat harmonization in adjusting for the site-specific variability even for our "non-ideal" as a practically feasible, noisy, low-dimensional, manually processed MRI dataset.

Limitations
The highest median total accuracy we obtain is 71.1% (Case 1, scenario c). However, among the 75 repetitions, we do notice up to a maximum of 84.2%. The challenge in attaining any higher total accuracy is mainly poised by low sensitivity, i.e., correct identification of true glioma cases, possibly due to: 1) insufficient predictorswe have used three available, manually generated image variables and three covariates for our analyses; 2) the possible minor mislabeling of the tumor-types or imprecise ROIs because the labels are based on the visual inspection and subjective, expert conclusion of the examining radiologists at the CSU-VTH and not confirmed via surgical histopathology, or because the ROIs in each scan-slice are drawn by two non-radiologists, and hence can possibly incur imprecise diseased/normal ROIs; 3) nonhomogeneous sites -ComBat performance can potentially sharpen further with more homogeneous composition of the "outside" site; 4) an imbalanced outcome classesalthough we address the severe class imbalance, a more balanced distribution in the original data may enhance RF performance 36 ; and 5) the choice of class imbalance adjustor and classifierone may choose a different class-imbalance adjustment, such as "over-sampling", 37 or a different classifier, such as logistic regression. 38 However, our initial exploration suggests that the SMOTE-RF combination provides better results than those of some other alternatives (data not shown).  This project contains the following extended data:

Data availability
Figshare: Figure S3 39 We implemented the ComBat data harmonization using the neuroCombat R software package, which is publicly available in Jean-Philippe Fortin's GitHub: https://bit.ly/fortin-ComBat-git, and the SMOTE imbalanced class adjustment using the smote() function within the performanceEstimation CRAN package. 40 For the RF classifier, we use method = "rf" input argument in the train() function and compute the classification performance evaluation metrics using the confusionMatrix() function, both within the caret CRAN package. 41,42 As a freely available alternative to PACS for a DICOM viewer and imaging data generator, we suggest Horos. Client consent was obtained from the respective owners of all dogs included in this study to use all obtained images and medical data for the purposes of research. Consent for publication is not applicable.