Managing the effect of magnetic resonance imaging pulse sequence on radiomic feature reproducibility in the study of brain metastases [version 1; peer review: awaiting peer review]

Background: Despite the promise of radiomics studies, their limited reproducibility has hindered meaningful clinical translation. Variability in study designs as well as image acquisition and processing contribute to unreproducible radiomic results. This work’s purpose was to (i) quantitatively compare variability of radiomic features extracted from 2-D spin echo (SE) and 3-D spoiled gradient echo (SPGR) T1-weighted post-contrast magnetic resonance (MR) images of brain metastases acquired within the same patient in a single imaging session, and (ii) provide a framework to inform data acquisition for reproducible radiomics studies. Methods: A retrospective cohort of 29 patients with pathologically-confirmed brain metastases and contrast-enhanced T1-weighted MR images acquired using 2-D SE and 3-D SPGR sequences within one exam was identified. Metastases were segmented twice by different physicians using semi-automated methods. Radiomic features were extracted using PyRadiomics for 264 preprocessing variable combinations. Lin’s concordance correlation coefficient (CCC) was computed between features extracted from images acquired by both pulse sequences and different tumor segmentations. Results: We provided general recommendations to improve MR-based radiomic feature reproducibility by clustering and identifying low-concordance features and processing variables. Median CCC between 2-D SE and 3-D SPGR (measuring feature agreement between pulse sequences) was greater for fixed bin count intensity SE (0.93 versus 0.86) but improved to 0.93 for 3-D SPGR after low-concordance feature exclusion. Conclusions: The following recommendations are proposed to improve reproducibility: 1) Fixed bin count intensity discretization for all studies, 2) for studies with 2-D and 3-D datasets, excluding high-variability features from downstream analyses, 3) when segmentation is manual or semi-automated, using only 2-D SE images or excluding features susceptible to segmentation variability. who have undergone Gamma Knife radiosurgery. One study used radiomic features as a prognostic factor to predict effectiveness of Gamma Knife radiosurgery in brain metastases. Another important application exists in cases where brain metastases detected cases machine learning classifiers have been trained on radiomic feature data to predict metastatic tumor type. in apparent diffusion coefficient maps in cervical cancer, 33 and another compared intensity normalization and discretization methods for gadolinium-enhanced T1-weighted and T2-weighted fluid-attenuated inversion recovery series in glioma patients. 34 Recent MRI studies have also been performed in patients with Alzheimer ’ s, 35 multiple sclerosis, 35 lachrymal gland tumors, 36 breast lesions, 36 and glioblastoma multiform, 37,38 as well as healthy volunteers. 35,39 – 41 Many preprocessing variables have been studied, including contrast weighting, 40 resolution, 40 acceleration, 40,41 gray level discretization, 36,38 statistical normalization, 35,38 and bias field correction, 37 all of which affect the repeatability and reproducibility of extracted radiomic features. as single next-generation clustered heat maps (NG-CHM) via an interactive viewer at https://www.ngchm.net using the provided NG-CHM files for CCC between features from 2-D SE and 3-D SPGR data, CCC between different lesion segmentations on 2-D SE data, and CCC between different lesion segmentations on 3-D SPGR data. 46,47 The interactive NG-CHMs allow the convenient visualization of the results in their entirety in order to provide further information with which to support the design of radiomics studies. values extracted from 2-D SE and 3-D SPGR images. Preprocessing combinations include spatial normalization to voxel sizes of 0.4297 x 0.4297 x 5 mm (vs435), 1 x 1 x 5 mm (vs15), 3 x 3 x 3 mm (vs33), and 1 x 1 x 1 mm (vs11), intensity discretization to fixed bin width of 16 (bw16), 32 (bw32), 64 (bw64), 128 (bw128), and 256 (bw256), and 2-D (f2d) or 3-D (f3d) feature extraction.)


Introduction
It is estimated that 20% of cancer patients develop brain metastases, and prognosis following metastasis to the brain is generally poor. 1 Surgical resection, whole-brain radiation therapy, and stereotactic radiosurgery are the most prevalent treatments and are necessary to extend survival, preserve neurologic function, and provide palliative care. 2 Magnetic resonance imaging (MRI) plays a critical role in the diagnosis and treatment of brain metastases. Several different pulse sequences are routinely employed in the detection and monitoring of brain metastases, and their effectiveness for these purposes has been the subject of several studies. [3][4][5][6] In particular, post-contrast T1-weighted images can be acquired in a number of ways, commonly including 2-D conventional spin echo (SE) and 3-D spoiled gradient echo (SPGR). Generally, 2-D SE acquisitions provide images with better lesion conspicuity, whereas SPGR can be acquired more easily with thin slices in 3-D, allowing better detection of small metastases. 7 A recent consensus recommendation for imaging brain metastases includes post-contrast T1-weighted acquisitions with both of these pulse sequences. 8 Despite the valuable role of MRI in imaging brain metastases, some clinical questions that determine course of treatment, such as differentiation between tumor progression and radiation necrosis or determination of metastatic tumor type, are difficult or impossible to answer by evaluating MR images with the human eye. Radiomics aims to completely characterize data contained in an image or region of interest (ROI) by using dozens to hundreds of different radiomic features that each capture some characteristic of the image or ROI. Radiomics confers several potential advantages: radiomic features may capture image characteristics that are nearly or completely invisible to the human eye, features can quantitatively capture image characteristics that otherwise require qualitative evaluation, and radiomics can be employed in automatic tools to augment clinical decision making. One of the principal applications of radiomic features is to train machine learning classifiers that can assist clinical decisions. 9 Several studies demonstrate the feasibility of such tools in cases such as patient outcomes in non-small cell lung cancer. 10 There are several important applications of radiomics for brain metastases in particular. One such application is training machine learning classifiers to distinguish between tumor progression and radiation necrosis, particularly in patients who have undergone Gamma Knife radiosurgery. [11][12][13] One study used radiomic features as a prognostic factor to predict effectiveness of Gamma Knife radiosurgery in brain metastases. 14 Another important application exists in cases where brain metastases are detected before diagnosing the primary cancer. In these cases machine learning classifiers have been trained on radiomic feature data to predict metastatic tumor type. 15,16 While radiomics studies have already yielded exciting findings and these tools have many promising applications, there are concerns about the reproducibility of radiomics studies due to the range of possible study designs and the inherent variability in radiomic features as a function of imaging modality, acquisition parameters, scanner, image preprocessing methods, and feature definitions. 17 Several studies have addressed these concerns in computed tomography (CT) 18,19 and positron emission tomography (PET). The Image Biomarker Standardization Initiative (IBSI) has worked to standardize radiomic features across several imaging modalities, including MRI, CT, and PET 20 and has also published a manual in which consensus-based recommendations and guidelines for radiomics are presented, as well as a general radiomics image processing scheme. 21 A recent review of repeatability and reproducibility studies of radiomic features in cancer patients found good representation of CT and PET images, however, relatively few investigated MR-based radiomic features. Furthermore, repeatability and reproducibility investigations have been limited to a small number of cancer types, with types such as non-small cell lung cancer (NSCLC) being dominant in the literature. 22 None addressed brain metastases. The need for standardization of MR-based radiomic features in particular is well understood, 23 and there have been several suggestions for bringing uniformity to radiomics study workflows. 21,24 It is critical to understand the variability of MR-based radiomic features in order to identify which will be reliable for downstream applications, such as machine learning classifiers. As recently as 2016, a review by Yip and Aerts found no investigations of MR-based radiomic feature repeatability, highlighting the need for such work. 25 In the intervening time, several studies have begun to address important questions about the variability of MR-based radiomic features. In a mathematical phantom, Ford et al. investigated MR-based radiomic feature variability as a function of pulse sequence parameter selection, 26 and Bologna et al. studied feature robustness to various acquisition parameters in a digital phantom. 27 In MRI phantoms, Mayerhoefer et al. investigated how radiomic feature variability responded to variations in acquisition parameters, including the number of acquisitions, repetition time, echo time, sampling bandwidth, and spatial resolution. 28 Rai et al. performed a multicenter evaluation of MRI-based radiomic feature reproducibility in phantoms, 29 Wong et al. studied longitudinal acquisition repeatability of features on an American College of Radiology MRI phantom, 30 and Lee et al. used a test-retest scheme to quantify repeatability of radiomic features in an MRI radiomics phantom while varying acquisition parameters across multiple scanners. 31 Finally, in human images, one recent test-retest study investigated repeatability of MR-based radiomic features in glioblastoma across several preprocessing approaches, 32 one examined the sensitivity of radiomic features to inter-observer variability in apparent diffusion coefficient maps in cervical cancer, 33 and another compared intensity normalization and discretization methods for gadolinium-enhanced T1-weighted and T2-weighted fluid-attenuated inversion recovery series in glioma patients. 34 Recent MRI studies have also been performed in patients with Alzheimer's, 35 multiple sclerosis, 35 lachrymal gland tumors, 36 breast lesions, 36 and glioblastoma multiform, 37,38 as well as healthy volunteers. 35,[39][40][41] Many preprocessing variables have been studied, including contrast weighting, 40 resolution, 40 acceleration, 40,41 gray level discretization, 36,38 statistical normalization, 35,38 and bias field correction, 37 all of which affect the repeatability and reproducibility of extracted radiomic features.
Many studies of MR-based radiomic features have been test-retest repeatability studies, 30,[38][39][40][41] which typically minimize dataset heterogeneity in order to exclusively isolate the intra-scanner variability of radiomic features. However, several previous studies have pointed out the need for balance between dataset homogeneity, which results in low noise in order not to mask any radiomic signature, and heterogeneity, which offers increased generalizability for application to realworld datasets. [42][43][44] Previous studies have addressed some practical concessions that must be made to real-world dataset quality. Many concluded that one of the best options to standardize results from multi-center studies or retrospective cases is to perform preprocessing prior to feature extraction and determine the stable features. 27,42 Given the prevalence of this pragmatic approach, best practices must be determined for radiomics studies in these real-world conditions as well as in studies with highly controlled datasets.
The purpose of this work was to both (i) quantitatively compare the variability of radiomic features extracted from 2-D SE and 3-D SPGR MR images of brain metastases acquired within the same patient in a single imaging session to determine the impact of image acquisition on the identified radiomic features, and (ii) to provide a framework to use these results to inform data acquisition and processing to improve the reliability and reproducibility of radiomics studies. Consensus recommendations for acquiring post-contrast T1-weighted images using both pulse sequences are relatively recent, and many imaging protocols and existing datasets include only one post-contrast T1-weighted acquisition. With this unique dataset, it is critical to understand the variability of radiomic features extracted from both acquisition types, especially for retrospective studies where heterogeneous imaging data is often unavoidable. Furthermore, this information impacts the design of imaging protocols for future studies and the selection of appropriate radiomic data where data from both acquisitions are available. We consider this study to be innovative because it addresses a key aspect of reproducibility in MRI radiomics studies: the sensitivity of features to the input imaging data acquisition parameters. To our knowledge, no previous studies have compared variability of radiomics features derived from different common MR pulse sequences in human imaging data. Based on this data, we provide general recommendations for the design of reproducible radiomics studies, and the extended data can be used to guide study design in more specific cases.

Dataset
The study protocol (PA17-0374_MOD008) was approved by the MD Anderson Office of Human Subjects Protection. Ethical approval and consent from participants were waived by the committee. A retrospective cohort was identified consisting of 225 patients treated with Gamma Knife for brain metastasis who subsequently developed radiological/ clinical progression and required surgical resection of the same lesion. MR imaging was performed on multiple 1.5T and 3T General Electric MR scanners prior to surgical resection, after which brain metastases were confirmed via pathology. Patients were excluded who did not have T1-weighted post-contrast images acquired by both 2-D SE and 3-D SPGR in a single exam, resulting in 29 patients. T1-weighted post-contrast images were acquired after application of gadobutrol (Gadavist, Bayer Healthcare) with a dose of 0.1 mmol/kg body weight. Within the same exam, T1-weighted post contrast images were acquired using 2-D SE after contrast injection, followed by 3-D SPGR. Standard in-line reconstruction provided by the vendor was used, and no acceleration was employed. Table 1 shows characteristics of these 29 cases, and Table 2 lists scanners and pulse sequence parameter ranges for the two T1-weighted post-contrast acquisitions.
Tumor volume segmentation Tumor volumes were contoured by experienced physicians using the treatment planning system RayStation. In patients with multiple lesions, the contoured lesion was the same one ultimately surgically resected. The enhancing volume in both T1-weighted post-contrast images (2-D SE and 3-D SPGR) was contoured via semi-automated methods using tools available in RayStation and adjusted manually as necessary. The tumor volume on each image was contoured twice by different physicians in order to examine the variability introduced through segmentation. Contour similarity was evaluated using Dice similarity coefficient (DSC). Example images and tumor segmentations are shown in Figure 1.
Image preprocessing and radiomic feature extraction In total, 29 patients with brain metastases were included in this study. Each patient had two images (2-D SE and 3-D SPGR), and each tumor volume was segmented by two different physicians, resulting in 58 images and 116 segmentations of the tumor volumes that were analyzed. Radiomic features were extracted using PyRadiomics, which adheres to the image biomarker standardization initiative (IBSI) definitions of features. 45 For each combination of preprocessing variables, 105 radiomic features were extracted from several classes, including shape, first order, gray level co-occurrence matrix (GLCM), gray level run length matrix (GLRLM), gray level size zone matrix (GLSZM), gray level dependence matrix (GLDM), and neighboring gray tone difference matrix (NGTDM). These features are listed in Table 3 under their respective classes along with abbreviations for the full feature names used in figures for readability.
Several preprocessing variables were investigated, including spatial normalization, 2-D or 3-D radiomic feature extraction, intensity discretization, and image filters. Spatial normalization resolutions included 0.4297 Â 0.4297 Â 5 mm, 1 Â 1 Â 5 mm, 3 Â 3 Â 3 mm, and 1 Â 1 Â 1 mm. Both 2-D and 3-D radiomic features were extracted, where 3-D features consider voxels from adjacent slices to be neighboring for purposes of feature computations, and 2-D features only consider neighboring voxels within the same slice. Intensity discretization performed on original images included both fixed bin count and fixed bin width methods. Fixed bin counts of 16, 32, 64, 128, and 256 and fixed bin widths of 16, 32, 64, 128, and 256 were considered. This resulted in a total of 264 preprocessing scenarios and 30,624 different feature extractions. In addition to original images, radiomic features were extracted from filtered images. Filters included Table 1. Characteristics of brain metastases cases treated with Gamma Knife that subsequently developed tumor progression, requiring surgical resection of the same lesion. In total, 29 cases had T1-weighted postcontrast images acquired by both 2-D spin echo and 3-D spoiled gradient echo pulse sequences in the same exam prior to surgical resection.

Characteristic
Cases (

Data analysis
Lin's concordance correlation coefficient (CCC) was used to determine concordance between radiomic feature values extracted from the two acquisition types. CCC is defined between two variables X and Y as where ρ is the correlation coefficient between X and Y, σ x and σ y are the standard deviations of X and Y, respectively, and μ x and μ y are the means of X and Y, respectively.
CCC was computed between values of each radiomic feature extracted from the 2-D SE series and those extracted from the 3-D SPGR series for all 264 combinations of preprocessing variables. Kernel density estimates were performed to compare the distribution of CCC among all radiomic features between each combination of preprocessing methods.   CCC was also computed between values of each radiomic feature extracted from the first segmentation of the tumor volume and those from the second segmentation of the same tumor volume. CCC was calculated for each radiomic feature, combination of preprocessing methods, and acquisition type. Kernel density estimates were again performed to compare the distribution of CCC among all radiomic features between each different acquisition type and combination of preprocessing methods. General recommendations for radiomic study design were given based on patterns of concordance. Exclusion was suggested for filters or preprocessing variable combinations with consistently low concordance or highly variable concordance between 2-D SE and 3-D SPGR extractions. Low concordance suggests these features, filters, and preprocessing variables are likely unreliable in heterogeneous datasets where both 2-D SE and 3-D SPGR pulse sequences were used. Features with consistently low concordance between extractions from the different lesion segmentations were likewise recommended for exclusion in studies in which segmentation variability was present.
A Kolmogorov-Smirnov (K-S) test was performed between radiomic feature values computed from 2-D SE images and those from 3-D SPGR images to assess whether the two distributions were significantly different. K-S tests were repeated to test for distribution differences in each radiomic feature and each different combination of preprocessing variables. Levene's test for equality of variances was also performed on the same data to determine whether significant differences in variability were present.

Results
How to interpret and use these results to design your MR acquisitions and analysis Results are provided for two scenarios: (i) concordance between radiomic features extracted from 2-D SE and from 3-D SPGR acquisitions, which measures feature agreement between pulse sequences, and (ii) concordance between radiomic features extracted from contours drawn by two different observers on (iia) 2-D SE or (iib) 3-D SPGR images, which captures feature sensitivity to inter-observer variability. For each of these scenarios, there were 105 radiomic features and 264 combinations of preprocessing decisions to consider. Our recommendations for radiomic study design are provided based on broad patterns of concordance displayed in these features and processing combinations, a subset of which is displayed in the following figures. These recommendations are presented in the form of flowcharts, and the quantitative effects of each study design decision are summarized graphically.
The extended data can also inform more specific decisions in the design of MR acquisitions and analysis by using the concordance of radiomic features for pulse sequences, contouring methods, and preprocessing methods of interest. The complete set of results can be found divided into several figures in the extended data. Alternatively, these data can be viewed as single next-generation clustered heat maps (NG-CHM) via an interactive viewer at https://www.ngchm.net using the provided NG-CHM files for CCC between features from 2-D SE and 3-D SPGR data, CCC between different lesion segmentations on 2-D SE data, and CCC between different lesion segmentations on 3-D SPGR data. 46,47 The interactive NG-CHMs allow the convenient visualization of the results in their entirety in order to provide further information with which to support the design of radiomics studies.

Radiomic features extracted from 2-D SE and 3-D SPGR images
CCC was computed between radiomic feature values extracted from 2-D SE and 3-D SPGR images. A value of 1 indicates perfect concordance, -1 indicates perfect discordance, and 0 indicates complete absence of concordance. Figure 2 shows the clustered heat map of CCC values computed for 2-D radiomic features extracted from original images and various preprocessing combinations, including spatial normalization to voxel sizes of 0.4297 Â 0.4297 Â 5 mm (vs435), 1 Â 1 Â 5 mm (vs15), 3 Â 3 Â 3 mm (vs33), and 1 Â 1 Â 1 mm (vs11), intensity discretization to fixed bin counts of 16 (bc16), 32 (bc32), 64 (bc64), 128 (bc128), and 256 (bc256), and intensity discretization to fixed bin width of 16 (bw16), 32 (bw32), 64 (bw64), 128 (bw128), and 256 (bw256). Figure 3 contains the clustered CCC values for 3-D radiomic features in the same combinations. Table 4 lists features with consistently high concordance between values extracted from 2-D SE and 3-D SPGR images. Heat maps of CCC values computed for filtered images, including Laplacian of Gaussian (LoG) with sigma values ranging from 0.5 mm to 5 mm and wavelet (LLH, LHL, LHH, HLL, HLH, HHL, HHH, and LLL), as well as square, square root, logarithm, exponential, and gradient normalization, are included in the extended data. Kernel density estimates of CCC distributions are also computed and displayed in the extended data for all combinations of spatial normalization and intensity discretization in original images.
Two-sample K-S goodness of fit tests were performed between radiomic feature values extracted from 2-D SE images and those from 3-D SPGR images for each radiomic feature and combination of preprocessing methods. Figures 4 and 5 show clustered binary heat maps of 2-D and 3-D features, respectively, and combinations of preprocessing methods for which the K-S test resulted in p < 0.05, indicating rejection of the null hypothesis that the two sets of radiomic feature values were from the same distribution. Similarly, Levene's test for equality of variances was performed between radiomic feature values extracted from 2-D SE images and those from 3-D SPGR images for each radiomic feature and combination of preprocessing methods. These results are included in the extended data.
By hierarchically clustering CCC for each feature and preprocessing variable, low-concordance combinations can be identified and recommended for exclusion in future radiomics studies employing both pulse sequences in the acquisition of T1-weighted post-contrast images in order to improve reproducibility. Figure 6 summarizes recommendations for radiomic feature extraction from T1-weighted post-contrast images acquired by mixed pulse sequences, and Figure 7 groups CCC values to show the effects of each recommendation. The median CCC between features extracted from  images acquired by 2-D SE and 3-D SPGR for recommended combinations of features and preprocessing variables was 0.79, compared to a median CCC of 0.51 for those combinations that are not recommended.
Radiomic features extracted from different segmentations DSC was used to measure the similarity of ROIs delineated by two different physicians. On 2-D SE images, mean DSC was 0.9072 and standard deviation was 0.0574. On 3-D SPGR images, mean DSC was 0.9067 and standard deviation was 0.0456. This suggests good overall agreement between segmentation, which is suitable for assessing feature sensitivity to inter-observer variability. CCC was also computed between radiomic feature values extracted from two different segmentations of the tumor volume on 2-D SE images and separately on 3-D SPGR images. Figure 8 shows the clustered heat map of CCC values computed for original 2-D SE images and differing preprocessing combinations, including  spatial normalization to varying voxel sizes, bin counts for relative intensity discretization, and bin widths for absolute intensity discretization. Figure 9 shows the same for 3-D features and original 3-D SPGR images. Table 5 lists the features with consistently low concordance between values extracted from the two segmentations on 3-D SPGR images. Results for 2-D SE and 3-D SPGR LoG-and wavelet-filtered images, as well as square, square root, logarithm, exponential, and gradient normalizations, are provided in extended data. Kernel density estimates of these CCC distributions from 2-D SE and 3-D SPGR images are computed and displayed in the extended data for all combinations of spatial normalization and intensity discretization in original images.
For each radiomic feature and combination of preprocessing methods, two-sample K-S goodness of fit tests were performed between radiomic feature values extracted from two different segmentations of the tumor volume on 2-D SE images and separately on 3-D SPGR images. Only 12 total combinations of features and preprocessing methods from both 2-D SE and 3-D SPGR images resulted in K-S tests with p < 0.05, indicating rejection of the null hypothesis that the two sets of radiomic feature values are from the same distribution. Detailed figures are shown in the extended data. Similarly, Levene's test for equality of variances was performed between radiomic feature values extracted from the two sets of tumor volume segmentations for each radiomic feature, each combination of preprocessing methods, and both 2-D SE and 3-D SPGR images. Seventeen features that consistently met the Levene's test significance threshold (p < 0.05) between two different lesion segmentations on 3-D SPGR series are listed in Table 6. Detailed results are again provided in extended data.
Low-concordance combinations of features and preprocessing variables were again identified and recommended for exclusion from future studies by hierarchically clustering CCC values for all combinations. Figure 10 summarizes recommendations for radiomic feature extraction from T1-weighted post-contrast images based on variability introduced by lesion segmentation methods, and Figure 11 groups CCC values to show the effects of each recommendation made. The median CCC between features extracted from two different lesion segmentations was 0.93 for images acquired by 2-D SE and 0.86 for images acquired by 3-D SPGR. By applying additional recommendations to feature extraction from 3-D SPGR data, median CCC of recommended features and preprocessing variables was increased to 0.93.

Discussion
This study compared the resulting MR-based radiomic features between two common T1-weighted MR pulse sequences (2-D SE and 3-D SPGR) for brain metastases imaging that were acquired in the same patient during the same imaging study. It highlights the impact of image acquisition and processing on the identified radiomic features by investigating feature variability as a function of pulse sequence, spatial normalization, 2-D or 3-D feature extraction, intensity discretization, image filters, and inter-observer variability in segmentation. Repeatability and reproducibility studies of radiomic features are necessary to assess the generalizability of radiomics applications. For example, machine learning classifiers trained on radiomic features may not be generalizable to other datasets if training is performed with features that perform poorly in repeatability or reproducibility metrics. Repeatability and reproducibility studies also contribute to the future design and standardization of high-quality, reliable radiomics studies. For these reasons, repeatability and reproducibility studies of radiomic features are increasingly necessary.
The images in this study were not acquired with uniform pulse sequence parameters. A previous study found that the sensitivity of second-order texture features to variation in acquisition parameters increases non-uniformly with spatial resolution across features. However, it also determined that variations in acquisitions such as number of acquisitions, Figure 6. Flowchart summarizing recommendations based on results from this work. This flowchart assumes an existing dataset with post-contrast T1-weighted images and provides suggestions for selecting pre-processing and feature extraction parameters.
repetition time, echo time, and slice bandwidth did not significantly affect results of pattern discrimination above certain spatial resolution thresholds, and it identified GLCM features as the most robust to variability introduced in datasets with lower or heterogeneous spatial resolutions. 28 Because many radiomic features are likely dependent on acquisition parameters, one review of feature repeatability and reproducibility studies recommends performing benchmarking studies on datasets with heterogeneous acquisition variables, such as slice thickness, acquisition protocols, multiple scanner manufacturers, and multiple institutions. 22 In these results, the fixed bin count intensity discretization method for original images generally displays greater CCC between 2-D and 3-D sequences than the fixed bin width method (Figures 2 and 3). This result reaffirms the widespread preference for relative discretization methods when processing MR images. Fixed bin methods introduce some normalization to images for which intensity units are arbitrary and contrast is important. 21 Features with consistently high concordance between 2-D SE and 3-D SPGR series are listed in Table 4. Some features, such as large area emphasis, large area high gray level emphasis, and zone variance from the GLSZM class, demonstrated high variability in concordance between 2-D SE and 3-D SPGR series. For these features, concordance may be sensitive to preprocessing variables but generally may not be robust to variation in acquisition parameters or pulse sequence. In a mathematical phantom, Ford et al. determined that radiomic features varied significantly between SE and SPGR pulse sequences, 26 which is supported by these results.
When considering features extracted from filtered images, the LoG filter, for all tested values of sigma, generally resulted in greater concordance between 2-D SE and 3-D SPGR sequences than other filters ( Figures S-1 and S-2). In particular,   For spatial normalization, kernel density estimates of CCC distributions among radiomic features ( Figure S-12) show that 2-D radiomic features extracted at the non-isotropic resolutions (0.4297 Â 0.4297 Â 5 mm and 1 Â 1 Â 5 mm) possess higher concordance as a whole between 2-D SE and 3-D SPGR features. 3-D radiomic features extracted at non-isotropic resolutions possess lower concordance as a whole, which is expected since this eliminates rotational invariance of features. For this reason, this combination is not recommended, and in-plane computation of radiomic features should be employed for non-isotropic spatial normalization. For both 2-D and 3-D radiomic feature extraction, the isotropic resolutions (3 Â 3 Â 3 mm and 1 Â 1 Â 1 mm) showed lower concordance as a whole between 2-D SE and 3-D SPGR. This suggests a significant penalty to agreement between features from the two sequences as a result of interpolation between slices.
From the two-sample K-S tests, several features met the significance threshold (p < 0.05) for being sampled from different distributions (Figures 4 and 5). This suggests that these features should be treated with caution, as they may not be robust to differences between the acquisition parameters and pulse sequences. Several of the same features met the significance threshold (p < 0.05) in Levene's test for equality of variances . Additional features displayed statistically significantly different variances between 2-D SE and 3-D SPGR features but did not reach significance for the K-S test. These features may be more stable for images acquired with one pulse sequence compared to the other.
Nearly all features and combinations of preprocessing methods demonstrated very high concordance between the two segmentations drawn by different physicians on the 2-D SE images (Figures 8,. This is encouraging as it indicates that almost all features were robust to inter-observer variability in segmentation of the tumor volume. Likewise, most features and combinations of preprocessing methods showed high concordance between segmentations on 3-D SE images as well (Figures 9 and S-26). However, several features yielded consistently low concordance between observers (Table 5). This indicates that these features may be poor choices for applications trained on data acquired with 3-D SPGR sequences, as they likely are not robust to inter-observer variability during segmentation. Similar features displayed low concordance for filtered 3-D SPGR images ( Figures S-27, S-28, S-29, and S-30). Again, most features demonstrated very low concordance for square, square root, exponential, and gradient normalization scales, suggesting that these are not reliable methods for processing MR images for feature extraction. As a whole, fixed bin width methods resulted in greater concordance between the two segmentations for 3-D SPGR series (Figures 9,. This might be explained by the inclusion or exclusion of voxels at the segmentation boundaries having less impact on the bins into which interior voxels fall for fixed bin widths than for fixed bin counts. Still, it is unlikely to be advantageous to sacrifice the normalizing effect of fixed bin count methods for MR data for better robustness to inter-observer variability from segmentation. If fix bin width intensity discretization is used, it is important to include an appropriate intensity normalization method in image pre-processing. 34 For 2-D SE data, spatial normalization did not appear to have a significant impact on agreement between features extracted from the two segmentations ( Figures S-33, S-35, and S-36). For 3-D SPGR data, only the 3 Â 3 Â 3-mm spatial normalization appeared to perform consistently poorly compared to the other resolutions tested (Figures S-34, S-37, and S-38). This resolution results in a significant amount of averaging of the original image data, and therefore loss of feature concordance is unsurprising.
For two-sample K-S tests between feature values from the two segmentations, only six features and preprocessing methods met the significance threshold (p < 0.05) for being sampled from different distributions for 2-D SE data , and six features and preprocessing methods met the significance threshold for 3-D SPGR data Figure 10. Flowchart summarizing recommendations based on results from this work. This flowchart assumes an imaging protocol with post-contrast T1-weighted images is being designed before data collection and provides suggestions for selecting pulse sequence and pre-processing and feature extraction parameters.
( . However, a few features from the 2-D SE data met the significance threshold (p < 0.05) for Levene's test for equality of variances , including long run high gray level emphasis and long run low gray level emphasis from the GLRLM class and large area high gray level emphasis, large area low gray level emphasis, and small area low gray level emphasis from the GLSZM class. Several more features from the 3-D SPGR data consistently met this significance threshold . This suggests that features extracted from 2-D SE images are generally more robust to inter-observer variability during segmentation than those from 3-D SPGR images. This may result from better lesion conspicuity or higher signal-to-noise ratio (SNR) in 2-D SE acquisitions. The features listed above may suffer instability in 3-D SPGR images as a result variability introduced during segmentation of the tumor volume.
The results from this study are summarized into a series of recommendations that are shown as flowcharts in Figures 6 and  10. Figure 6 walks through decisions for an existing dataset with post-contrast T1-weighted images and provides suggestions for selecting pre-processing and feature extraction parameters. For T1-weighted post-contrast images acquired by both SE and SPGR sequences, fixed bin count intensity discretization is strongly recommended, nonisotropic spatial normalization with comparable in-plane resolution and 2-D radiomic feature extraction are potentially beneficial, and exclusion of low-concordance feature groups, such as certain wavelet filters (HLH, HHL, and HHH) and square root, logarithm, and exponential normalization, is strongly recommended. Figure 7 demonstrates the effect of each individual recommendation and the cumulative impact of all recommendations. The strong recommendations above resulted in substantial improvement in CCC, and the potential recommendations resulted in more modest improvement.  (Table 4) and remaining features. (c) Distribution of CCC for features extracted from images acquired by 2-D spin echo (SE) and SPGR preceded by fixed bin count (BC) and fixed bin width (BW) intensity discretization after excluding not recommended features (Table 4). (d) Distribution of CCC between features extracted from images acquired by 3-D SPGR obtained by applying all recommendations and remaining features. Figure 10 considers decisions involved in designing an imaging protocol with post-contrast T1-weighted images before data collection and provides suggestions for selecting pulse sequences. For any segmentation with significant variability, especially manual or semi-automated methods, 2-D SE acquisitions are strongly preferred to reduce sensitivity to inter-observer variability. Fixed bin count intensity discretization and exclusion of low-concordance feature groups, such as HHL and HHH wavelet filters and square root and logarithm normalization, are strongly recommended for 2-D SE data. If 3-D SPGR acquisitions are included, fixed bin width intensity discretization may potentially reduce feature sensitivity to segmentation variability but requires appropriate intensity normalization. Exclusion of low-concordance features in Table 5 results in a comparable distribution of CCC values to those obtained from segmentations on images from 2-D SE acquisitions, so this is strongly recommended. Figure 11 again compares the effects of each individual decision in this process and the cumulative effect of all recommendations if 3-D SPGR acquisitions are included.
Some limitations to this work warrant further study. Although the sample set was small, the goal of this study was to leverage a unique dataset to compare the effect of two pulse sequences within the same patient and same tumors imaged in the same scanner in a single imaging session on the resulting radiomic features. Given that consensus recommendations on brain metastases protocols that include both 2-D SE and 3-D SPGR T1+C acquisitions are relatively recent, 8 we are unlikely to be able to assemble a larger dataset similar to this one retrospectively. Over 85% of patients from the original cohort were rejected because they did not have both T1-weighted post-contrast series in the same exam. Second, the scanner models and acquisition parameters were not uniform across patients included in this study. In typical test-retest studies, dataset heterogeneity is often considered to be a weakness. Because this study does not strictly compare identical measurements, but rather concordance of similar measurements, we believe increased generalizability better protects against dataset variable dependence in the results. Several previous studies have pointed out the need for balance between dataset homogeneity, which affords low noise in order to detect radiomic signatures, and heterogeneity, which offers increased generalizability for application to real-world datasets. [42][43][44] However, it is important to note potential drawbacks of dataset heterogeneity. Third, the main goal of this study was to study feature concordance between two different pulse sequences in brain metastasis imaging, but several additional variables affect reproducibility, such as intensity normalization, and would be useful to focus on in future work. These variables introduce additional considerations, such as white matter segmentation dependence, effects on image texture, and tumor size-dependent distortion. 34,35 Finally, it would be useful to complete the picture of brain metastasis imaging protocols by investigating other commonly employed sequences, such as T2 FLAIR.

Conclusions
MR-based radiomic features that demonstrate high concordance between values extracted from images acquired with different pulse sequences may be more reliable and robust inputs to feature-based models that assist with clinical decision making. Similarly, those with high concordance between feature values extracted from different tumor volume segmentations may be more stable against variability introduced during segmentation. Fixed bin count intensity discretization demonstrated higher concordance between features extracted from 2-D SE and 3-D SPGR images, which agrees with common recommendations for MR-based radiomic feature extraction. Non-isotropic spatial normalization was found to have higher concordance between features extracted from 2-D SE and 3-D SPGR images. This study found that the 2-D SE pulse sequence was more robust to inter-observer variability in tumor volume segmentation than the 3-D SPGR pulse sequence. We use these results to provide comprehensive recommendations for preprocessing in future radiomics studies with heterogeneous imaging data. This project contains the following underlying data:

Data availability
• pulsesequences_ccc_ngchm.ngchm. (The complete set of results for CCC between features from 2-D SE and 3-D SPGR data. These data can be viewed as a single next-generation clustered heat map (NG-CHM) via an interactive viewer at https://www.ngchm.net. The interactive NG-CHMs allow the convenient visualization of the results in their entirety in order to provide further information with which to support the design of radiomics studies, e.g. selection of pulse sequences, contouring methods, and preprocessing methods of interest.) • segmentation_2dse_ccc_ngchm.ngchm. (The complete set of results for CCC between different lesion segmentations on 2-D SE data. These data can be viewed as a single next-generation clustered heat map (NG-CHM) via an interactive viewer at https://www.ngchm.net. The interactive NG-CHMs allow the convenient visualization of the results in their entirety in order to provide further information with which to support the design of radiomics studies, e.g. selection of pulse sequences, contouring methods, and preprocessing methods of interest.) • segmentation_3dspgr_ccc_ngchm.ngchm. (The complete set of results for CCC between different lesion segmentations on 3-D SPGR data. These data can be viewed as a single next-generation clustered heat map (NG-CHM) via an interactive viewer at https://www.ngchm.net. The interactive NG-CHMs allow the convenient visualization of the results in their entirety in order to provide further information with which to support the design of radiomics studies, e.g. selection of pulse sequences, contouring methods, and preprocessing methods of interest.)

Data that cannot be shared
The following data cannot be shared due to restrictions on data sharing in the IRB protocol. Individuals may contact the corresponding author to apply for access to the data, which will be granted upon IRB approval.
• 29 T1-weighted post-contrast image series acquired by 3-D spoiled gradient echo sequences. This project contains the following extended data:

Extended data
• Figure