Radiomics is a rapidly evolving field aiming to link phenotypes characterized from medical images with clinical data, including but not limited to, diagnostic, prognostic, and genomic information (1–7). Quantitative image features (also known as, radiomics features) have been shown, for example, to be associated with distant metastasis in lung adenocarcinoma (8–10), pathological response in a variety of cancer types (11–15), cancer recurrence after radiation therapy (16–19), and disease-free survival (20–23), and even with genotypes in many different cancer types (1, 24–31). Although there are many published prediction models related to both disease and treatment, there is no standardized evaluation of the performance (2), such as, but not limited to, the use of publicly available data and open-source feature extractors. The need for repeatability and reproducibility in radiomics has been increasingly emphasized (32, 33).
Therefore, the National Institutes of Health has encouraged medical imaging researchers to publicly share their data to stimulate open-science collaboration, and The Cancer Imaging Archive (TCIA) has evolved into a leading public database (34). TCIA is a service that hosts a large archive of medical images of cancer accessible for public download. Researchers nationwide are encouraged to submit data sets, and the current collection contains projects sponsored by private institutions and national programs. The Cancer Genome Atlas (TCGA) program is one such project that has generated a huge database of genomic, epigenomic, transcriptomic, and proteomic data from >20,000 samples spanning 33 cancer types (35). Clinical, genetic, and pathological data are stored on the Genomic Data Commons (GDC) data portal, while the radiological data reside in TCIA.
Many research groups have developed and also released open-source software packages with the hopes of establishing standardization to enhance reproducibility and comparability of radiomics results (36–41). The use of textural features for image classification dates back to 1973 (42), and image pattern recognition technologies have been widely deployed in computer-aided detection and diagnosis for the past 3 decades (43). A standard lexicon has been adopted as reference—the Image Biomarker Standardization Initiative (IBSI), version 9, available as of May 19, 2019 (44). Still, differences in image acquisition and preprocessing parameters may impact feature extraction (32, 45–48). Different radiomics software have also been shown to have varied algorithm implementation, which results in different feature values and poor agreement (49–50).
To the best of our knowledge, it is unknown how differences in feature extractor selection and feature calculation may impact the overall classification performance. The purpose of this study was to investigate differences in the overall radiogenomic classification performance on publicly available computed tomography (CT) images of patients with non–small cell lung cancer (NSCLC) owing to the use of different feature extractors. We also reported in detail our experience in the use of public data sets and open-source feature extractors.
Materials and Methods
The basic study design diagram is shown in Figure 1. Public imaging data from TCIA relevant to our experiment were collected and split into training and validation cohorts. TCIA data consisted of 3 shared projects, NSCLC-Radiogenomics (51), TCGA-Lung Adenocarcinoma (TCGA-LUAD) (52), and TCGA-Lung Squamous Cell Carcinoma (TCGA-LUSC) (53). The training cohort was created using part of NSCLC-Radiogenomics (data collected from 2 institutions with relatively homogenous CT scanning parameters), while the validation cohort was created using a mix of data from the 3 projects (data collected from 7 institutions with diverse CT scanning parameters). This split aimed to test the generalization ability of radiomics features/models from a relatively homogenous data set to a more heterogeneous data set. Three feature extractors were used for feature extraction. Univariate and multivariate analyses were performed sequentially on predicting epidermal growth factor receptor (EGFR) mutant status by using each individual extractor, and performance was compared between the 3.
Patient Imaging and Clinical Data
The 3 data sets, NSCLC-Radiogenomics, TCGA-LUAD and TCGA-LUSC, were obtained through TCIA website. The NSCLC Radiogenomics data set was produced and described in detail by Bakr et al. (51). The NSCLC Radiogenomics data set included 211 cases with 129 EGFR wildtypes, 43 EGFR mutants, and 39 unknowns. TCGA-LUAD (52) and TCGA-LUSC (53) data collections provide clinical images to matched subjects in TCGA. TCGA-LUAD data set included 69 cases with 52 EGFR wildtypes, 11 EGFR mutants, and 6 unknowns. TCGA-LUSC data set included 37 cases with 36 EGFR wildtypes and 1 EGFR mutant. Imaging data for TCGA was collected from many sites worldwide and is very heterogeneous in terms of scanner modalities, manufacturers, and acquisition protocols. We included all patients that had a chest CT scan and a known EGFR mutation status. We excluded cases that had no noticeable lesion, cases with artifacts such as a biopsy needle in the lesion, and cases with multiple lesions and no provided segmentation. In some cases, Pyradiomics and IBEX produced an error during feature extraction, and these cases were excluded as well. Additional details are provided in online supplemental Section S3. In total, 149 cases from NSCLC Radiogenomics and 79 cases from TCGA LUAD and LUSC data sets were ultimately included in the study. Further details regarding all data, including information about scanning parameters, are included in the online supplemental Section S1A.
Training and Validation Set Split
The training and validation cohorts were split by data set and adjusted in order to maintain a balance of EGFR mutants and wildtypes in each cohort. The training cohort consisted of a random subset of the NSCLC Radiogenomics–included cases, totaling 105, with 27 mutant and 78 wildtype cases. The validation cohort included the remaining 44 cases from NSCLC Radiogenomics and the cases from TCGA-LUAD and TGCA-LUSC, totaling 123, with 18 mutant and 105 wildtype cases. The validation cohort had a much more heterogeneous sample owing to contribution from 3 data sets. Figure 1 details this split visually.
The validation cohort was also split into 3 subgroups corresponding to the 3 data sets. The NSCLC Radiogenomics subgroup had 44 cases with 33 EGFR wildtypes and 11 EGFR mutants. The TCGA-LUAD subgroup had 46 cases with 39 EGFR wildtypes and 7 EGFR mutants. The TCGA-LUSC subgroup had 33 EGFR wildtype cases only.
The NSCLC Radiogenomics data set provided segmentation for only 144 out of 211 cases. The remaining 67 cases and the TCGA-LUAD/-LUSC cases were segmented semiautomatically using a published segmentation algorithm incorporated into an open-source image viewing platform, WEASIS (54–55). The available and newly created segmentations were reviewed by an experienced thoracic radiologist (LE) and manually adjusted if necessary.
Three feature extractors were used to extract radiomics features from the segmented tumor volumes. The radiomics feature extractors included 2 open-source software packages, Pyradiomics, developed by Aerts' group (36), and the Imaging Biomarker Explorer (IBEX), developed by Court's group (37), and our in-house extractor, Columbia Image Feature Extractor (CIFE) developed by Zhao's group (32). Conditions between the 3 packages were controlled by using the recommended or if not available, the default settings.
Pyradiomics V2.1.2 (36) is an open-source Python package for the extraction of radiomics features from medical imaging. In total, 1319 features were extracted from each segmented tumor using Pyradiomics.
IBEX version 1.0β (37) is an open-source MATLAB and C/C++ software platform designed to support common radiomics workflow tasks, including but not limited to feature extraction. All available features were extracted without image preprocessing filters. In total, 1767 features were extracted from each segmented tumor using IBEX.
CIFE (32) is our in-house software package based on MATLAB 2016b (The MathWorks, Natick, MA) designed to extract radiomics features from medical imaging. In total, 1126 features were extracted from each segmented tumor using CIFE. (See online supplemental Section S1B for further details about settings of each feature extractor.)
Analysis was run separately and identically on the 3 different feature sets computed from the 3 feature extractors. In this work, the univariate and multivariate analyses were performed sequentially on the feature sets. The univariate analysis was performed on only the training cohort to select features, and the multivariate analysis was performed on the training cohort and validated in the validation cohort.
First, a large number of redundant (ie, highly correlated) and noninformative features were removed using unsupervised clustering and receiver operating characteristic analysis. The unsupervised hierarchical clustering was performed in 3 steps:
Spearman rank correlations were calculated between features.
Features were organized into a hierarchical clustering tree based on these correlations.
Features were separated into groups based on a set correlation threshold.
Within each group containing redundant features, the correlation threshold was set to <0.2, and only features satisfying that criteria were selected as nonredundant (56). Nonredundant features were then examined in the univariate analysis using the area under the receiver operating characteristic curve (AUC) to indicate prediction performance for each feature. Only features with AUC > 0.6 were selected as informative features. Because the data set we used were relatively small, we used an unsupervised clustering–based algorithm instead of other widely used supervised feature selection algorithms (eg, mRMR and Relief) which might result in high risk of overfitting (56–58).
In the multivariate analysis, features attained from the univariate analysis were used to build models on the training set using 4 widely used machine-learning classification algorithms: k-nearest neighbors (KNN), least absolute shrinkage and selection operator (LASSO), support vector machine (SVM), and random forest classifier techniques. Fivefold cross-validation was applied on the training cohort to establish a performance baseline. In the 5-fold cross-validation, the training cohort was randomly separated into 5 subsets. One subset was used as a testing set, whereas the other 4 subsets were used as the training set. The training and testing procedures were repeated 5 times until each sample in the data set was used as a testing sample exactly once. The same 5-fold subsets were used for every model. The final training AUC for the prediction model was estimated using the average of 5 prediction performance.
The performance of model was then evaluated on the independent validation cohort. No samples in the independent validation cohort had ever been seen during training. The input to each model was the selected feature values and the output was the EGFR mutation status. A bootstrap approach reported by Aerts et al. (1) was used to calculate the significance on comparing models attained from each feature extractor. For 100 times, we calculated the AUC from 100 randomly selected samples, and the Wilcoxon test was used to assess significance.
All statistical analysis was performed on MATLAB 2016b platform (The MathWorks). A 2-sided P value of <.05 was regarded as statistically significant.
The clinical characteristics of the 228 patients included in our experiment are presented in Tables 1 and 2. Statistically significant differences were tested using the chi-square test for categorical data and the t test for continuous data. There was no significant difference between the training and validation cohorts in terms of age, sex, or tumor stage (P = .98, .74, and .39, respectively). The histological diagnosis showed a significant difference between the 2 cohorts, likely due to differences in data set origin (detailed in the Materials and Methods section). Although not statistically significant, there is a trend toward a difference between the training and validation cohorts in terms of proportion of EGFR mutants and wildtypes (P = .54) owing to the increased number of EGFR wildtypes in the validation cohort.
For the 3 sets of features from each feature extractor, we selected candidate features with a correlation coefficient <0.2 and an AUC > 0.6. These features are presented in Table 2. The definitions of these features are presented in online supplemental Section S1B. From Pyradiomics, IBEX, and CIFE, 6, 5, and 4 features were identified, respectively.
Most of the features selected from each extractor were different. Pyradiomics and our CIFE both used forms of intensity minimum and skewness, but owing to the use of image preprocession, Laplacian of Gaussian (LoG) filtering, in Pyradiomics, these values are not interchangeable. The distribution for every feature is included in the online supplemental Section S2B.
Moreover, we performed nonparametric Wilcoxon rank sum test to test the significance of feature distribution between EGFR wildtype and mutant for each individual candidate feature, and the results are shown in the online supplemental Figures S2–S4. All feature values originally had a significant difference in distribution between the wildtype and mutant subsets of the training cohort. However, 5/5 IBEX features, 6/6 Pyradiomics features, and 3/4 CIFE features did not have a significant difference in the wildtype and mutant subsets of the validation cohort. Only CIFE: intensity skewness had a significant difference between wildtype and mutant subsets of both the training and validation cohorts (P = .0072, P = .014) (see online supplemental Figure S4.4).
A correlogram of all features selected is shown in the online supplemental Figure S5 of Section S2C. There is little correlation between features from the same extractor, and there is some correlation between features from different extractors.
Multivariate Model Performance on Differentiating Lung Cancer EGFR Subtypes
The performance of multivariate models built from each feature extractor is summarized by the AUC value and is presented in Table 3. The optimal model from each feature extractor, determined by performance on the validation cohort, was produced using random forest classifier techniques. The performances from IBEX, Pyradiomics, and CIFE random forest models on the validation cohort were AUCs of 0.54, 0.56, and 0.64, respectively.
A pairwise comparison between each of the best models is shown in the online supplemental Table S2A. Comparisons were done using the bootstrap approach previously reported by Aerts et al. (1). Although the results of IBEX and Pyradiomics were not significantly different (P = .19), CIFE produced results significantly different from those of IBEX (p = 1.54e-14) and Pyradiomics (p = 2.02e-10).
A comparison between the performances of the best models on the training versus the validation data sets is shown in the online supplemental Table S2B. All models had significant differences in performance between the training and validation sets, but the trend for IBEX and Pyradiomics seems to have a greater difference.
In this study, we aimed to use different feature extractors on public imaging data to compare classification performance. The radiomics feature extractors included 2 open-source software packages, Pyradiomics (36) and IBEX (37), and our in-house extractor, CIFE (32). These software packages have seen extensive use by researchers worldwide in experiments to predict diagnostic, genomic, prognostic, and response outcomes for a wide range of diseases, and proved ideal candidates for our comparison (8, 13, 15, 59–67).
We initially extracted 1767, 1319, and 1126 features from IBEX, Pyradiomics, and CIFE, respectively. After removing for redundancy and selecting clinical informative features, we ultimately isolated 6, 5, and 4 candidate features for the 3 feature extractors respectively. This result is consistent with that of a previous report that there is a large amount of redundancy within feature extractors (68). Notably, the selected features differed mostly from each group, but there were some similarities. Intensity minimum and skewness features were chosen from Pyradiomics and CIFE, although the implementation of the 2 is not exactly the same. There was some correlation between features from different extractors. This may suggest that similar biological characteristics are described.
Our results match those of existing literature on EGFR radiogenomic classification. Zhang et al. and Li et al. have found skewness to be predictive of EGFR mutation status and subtypes (69–70). Mei et al. also used the Pyradiomics feature extractor and similarly found that Size Zone NonUniformity Normalized was a predictor for EGFR mutation status (71).
We next used these selected nonredundant and informative candidate features to build multivariate prediction models using 4 commonly used machine-learning classification algorithms: k-nearest neighbors, support vector machine, random forest, and bagging. The best models created from IBEX, Pyradiomics, and CIFE features achieved similar training performance with cross-validated AUCs of 0.68, 0.67, and 0.69, respectively.
However, in validation, the performances from IBEX, Pyradiomics, and CIFE were AUCs of 0.54, 0.555, and 0.638 respectively. The validation performances were significantly decreased from the cross-validated training performance for models created from all 3 feature extractors. A pairwise comparison showed that CIFE had a significantly different validation performance than both IBEX and Pyradiomics, whereas the performance between IBEX and Pyradiomics was not significantly different. Our data were split into training and validation cohorts using a single data set for the training cohort and a mix of 3 data sets for the validation cohort. Therefore, the validation cohort will naturally be relatively more heterogeneous in terms of imaging parameters than the training cohort, as the cases come from 7 different institutions. Furthermore, the CT imaging parameters are more lung cancer–specific in the training data than those in the validation data. We believe that the splitting strategy used in our work would allow us to discover better model performance. This may explain the decrease in performance of radiofrequency (RF) models from all groups from the training cohort to the validation cohort. In addition, the trend toward a difference in proportion of EGFR wildtype and mutant cases between the training and validation cohorts may have also affected the performance. Although a decrease in performance from training to validation is commonly seen in machine-learning experiments (72), it is interesting that the performance of RF models built from IBEX and Pyradiomics features decreased more than the performance of the RF model built from CIFE features.
Our study has several limitations. For the open-source feature extractors, features were extracted as suggested by online documentation or by using the default settings of the features' parameters. Other researchers may find different results if they, for example, use different image preprocessing parameters. In addition, although we found 3 data sets with the information for our case example, our data size is still limited. The validation cohort consists of 3 different data sets, which may have affected the performance of our model. Although it would be interesting to see the individual performances of each subgroup within the validation cohort, this analysis was not feasible owing to the limited number of cases and imbalances of mutant and wildtype cases. We did not consider the effect of imaging heterogeneity and segmentation on our results because the purpose of our study was to compare extractors rather than assess the potential effects of segmentation. Although we had a mix of provided segmentation and our own in-house-generated contours, we used the same data set images and tumor segmentation for all different extractors. In addition, the definitions of the implemented features are available, but some are hard for us to fully explain the meanings.
It is important to note that the purpose of the study was to not compare these feature extractors in terms of their capabilities of building prediction models, but to show that differences can exist when applying different feature extractors to the same clinical application.
Future work may include optimization of machine-learning models, larger data sets, and other clinical applications. Generating a combined model from features of all 3 extractors may also potentially increase performance. In addition, although the CIFE feature extractor has been used in several published studies, it has yet to be released to the public.
Overall our experience with public data sets and open-source feature extraction software has been quite smooth. The majority of data cases fulfilled our inclusion criteria for our experiment and are easily accessible and ready for use. We could extract features for the majority of cases with all software packages and had clear documentation to facilitate use by a beginner. Further details regarding our experience are included in the online supplemental Section S3.
Different radiomics features were selected from different feature extractors to predict EGFR mutation status in patients with NSCLC, which resulted in varying prediction performance. Correlation between features from different extractors may indicate similar biological characteristics are measured. However, attention should be paid to the generalizability of both individual radiomics features and radiomics prediction models. In the future, radiomics feature extraction techniques will undoubtedly improve and may further standardize, but for now researchers may find it useful to use multiple packages for their clinical applications.