Lung cancer remains the most common cause of cancer death worldwide, and the 5-year survival rates of non–small cell lung cancer (NSCLC) remain quite poor despite advances in diagnosis and treatment (1, 2). Further, many patients will develop recurrence or progression following primary treatment. The absolute risk of any recurrence at 5 years post-treatment ranges from 33% to 52%, with the majority occurring at a distant site (3, 4). Among prognostic factors for predicting outcomes in NSCLC, tumor stage based on the American Joint Committee on Cancer (AJCC) staging system is currently considered the best for predicting outcomes (5). More accurate clinical, imaging, and molecular biomarkers will be extremely useful for stratifying patients who are at a higher risk of recurrence and who might benefit from adjuvant or more aggressive treatment options (6).
Maximum standardized uptake value (SUVmax) on fluorine-18F fluoro-2-deoxy-D-glucose (FDG) positron emission tomography (PET) imaging has also been shown to predict recurrence or death in NSCLC (7). However, this is a single-voxel metric; we hypothesized that applying a radiomics approach to extract more complex information (eg, texture) from standard medical images could provide additional prognostic information (8, 9).
While recent work has evaluated the potential for radiomics features to augment traditional metrics of response (10–12), the majority of studies to date have focused on only the metabolic tumor volume (MTV) on PET and, to the best of our knowledge, no study has investigated the peritumoral region. Tumor invasion from the main mass can be defined by infiltration of stroma, blood vessels, or visceral pleura (13). Recent studies have also shown the potential for tumor cells to spread into air spaces in the lung tissue adjacent to the tumor volume (14). It is well known that these features may present as border spiculation, vascular convergence, or pleural attachment surrounding the tumor on anatomical imaging, and that they may result in subtle heterogeneous uptake on PET imaging (15).
We investigated the potential of FDG-PET radiomics to predict recurrence in NSCLC by (1) assessing the variability in radiomic feature extraction from PET images and (2) building and validating a radiomics model to predict time to recurrence. We hypothesize that computational imaging features in the tumor and surrounding area on FDG-PET can augment clinical features to improve recurrence prediction.
We retrospectively analyzed a total of 291 patients with NSCLC from 2 distinct cohorts of prospectively acquired patients (n = 145 and n = 146). The study was approved by our Institutional Review Board, and all subjects signed informed consent before participation. Our study was also compliant with the Health Insurance Portability and Accountability Act.
The training cohort consisted of subjects from a pool of patients with early-stage NSCLC referred for surgical treatment at 2 local medical centers between 2008 and 2012 with preoperative PET/computed tomography (CT) performed before surgery (n = 145). This data set is publicly available on The Cancer Imaging Archive (16, 17). We used a second cohort (n = 146) for model validation. This was a cohort from 3 local medical centers between 2010 and 2016. Subjects were selected from patients undergoing evaluation for lung cancer by PET/CT imaging before definitive treatment as part of an observational biomarker study. In both the training and validation cohorts, there were no patients that received neoadjuvant therapy.
The AJCC seventh edition system was used for staging. Pathological staging was used in the training cohort and a combination of clinical and pathological staging in the validation cohort. Demographic differences between the training and validation cohorts were assessed using the Wilcoxon rank-sum test for continuous variables and the χ2 test for categorical variables. All patients were followed per standard clinical protocol with clinical examination and imaging. We analyzed the combined endpoint of disease recurrence or progression. For stage I–IIIA subjects, we defined recurrence as either local, regional, or distant. For patients with stage IIIB–IV disease, we defined an event as any progression of disease. Time to event or last known follow-up was recorded from the date of pretreatment PET imaging.
Pretreatment FDG-PET/CT scans were acquired using a standard clinical protocol at 1 of 3 local medical centers. Images were acquired using either a GE Discovery VCT (GE Health care, Waukesha, WI), a GE Discovery LS PET/CT (GE Healthcare, Waukesha, WI), a Siemens Biograph mCT (Siemens Healthcare, Erlangen, Germany), or a Phillips Allegro/Gemini TF PET/CT (Phillips Healthcare, Cleveland, OH). Patients underwent scanning following fasting for a minimum of 6–8 h. A dose of 12–17 mCi of FDG was administered and patients underwent scanning from the skull base to mid-thigh using bed positions acquired every 2–5 minutes ∼45–60 minutes after injection. Manufacturer-specific CT-based attenuated correction was performed using ordered subset expectation maximization reconstruction.
Region of Interest Delineations
Pretreatment PET images were converted to SUV units normalized by body weight. Two research assistants (S.M. and S.B.) were trained by a board-certified physician in Nuclear Medicine (G.D.) in using MIM Version 6.6 (MIM Software Inc., Cleveland, OH) to contour tumor MTVs using the semiautomatic PET-edge gradient-based segmentation tool. Both observers contoured all images independently in the training cohort. A subset of 21 images considered difficult to contour were reviewed by the same physician and re-delineated if necessary. To assess intraobserver variability, observer 1 (S.M.) contoured all images a second time after a delay of 3 months. We calculated the Dice similarity coefficient (DSC), mean absolute distance (MAD) of the boundary, and absolute volume difference between each set of contours to assess inter- and intraobserver variability of the MTV regions in the training cohort. Observer 1 alone contoured all images in the validation cohort.
We then generated a 3-dimensional penumbra region extending outward 1 cm from the surface of the MTV to sample surrounding uptake by using a 3D distance transform with a threshold of 1 cm. This distance was intuitively chosen to sample enough surrounding tissue given the voxel sizes of the PET images, while avoiding oversampling normal tissue. In addition to the MTV alone, we also evaluated the following 2 additional regions: the MTV plus penumbra and the penumbra only (excluding the MTV).
We extracted radiomics features in the MTV, penumbra, and MTV plus penumbra regions in both cohorts using The Quantitative Image Feature Engine (18) implemented in MATLAB R2016B (The MathWorks, Natick, MA). In the MTV, features included size (n = 4), sphericity (n = 1), local volume-invariant integral (LVII) shape (n = 39), histogram intensity (n = 12), and gray-level co-occurrence matrix (GLCM) texture (n = 144) (19, 20), for a total of 200 features. Because the penumbra region was generated from the MTV, 44 size and shape measures were not calculated in the penumbra and MTV plus penumbra regions (because they would not be independent measurements), for a total of 156 features in each. This resulted in a total of 512 features for analysis as summarized in Table 1. We set a fixed intensity bin size of 0.2 SUV for texture feature calculation to allow a meaningful comparison between images on the same SUV scale. This discretization may also reduce the differences between multiple scanners used in this study (21).
We then calculated intraclass correlation coefficients (ICCs) across the 3 sets of outlines for each radiomic feature to assess inter- and intraobserver variability. Robust features, defined as those with ICCs >0.8 in the training cohort, were selected for further analysis (22, 23).
Model Building and Validation
All radiomic features were normalized (Z-score transformation) before feature selection and model building. We further optimized the features through a generalized linear model via the least absolute shrinkage and selection operator (LASSO) (24) Cox regression using the glmnet package in R software version 3.4.3 (25). LASSO is a shrinkage and variable selection method for high-dimensional data, which was used to select top features to predict time to recurrence in the training cohort. The robust radiomic features and the 2 known clinical predictors (stage and SUVmax) were provided to LASSO. Alpha, the regularization parameter, was set to 1 (LASSO penalty) to minimize the number of selected features by shrinking most of the coefficients to zero and to minimize potential overfitting in the training cohort. In total, 100 randomizations of 4-fold cross-validation was used to reduce the effect of randomness in fold selection. The mean cross-validated error curves were averaged for each tuning parameter lambda value across all randomizations. The lambda and corresponding radiomic features associated with the minimum error were selected.
We built univariate and multivariate Cox proportional hazards models in the training cohort using the most frequently selected radiomic and/or clinical features. We evaluated the Akaike information criterion (AIC) to compare the quality of the different models, with lower AICs representing a higher quality model. We assessed the likelihood ratio P-value for the derived models to show recurrence prediction significance. HRs and 95% CIs were reported for individual variables. To evaluate nested models combining the clinical and/or radiomic features, the likelihood ratio test was used to compare the goodness of fit.
To verify prediction validity, we locked the coefficients of the variables in the top model generated from the training cohort and evaluated it in the validation cohort. The prognostic value was assessed using the concordance index with Noether's test to determine significance from random (0.5). We performed Kaplan–Meier analysis to separate high- and low-risk groups based on the median risk score in the training cohort. We performed a Student's t test for dependent samples to compare concordance indices between the models. All statistical analyses and model building were performed using R. Statistical significance was assessed at the P < .05 level.
The training and validation cohorts were similarly matched with regard to median age (P = .057) and tumor location (P = .571) (Table 2). The training cohort had a higher proportion of males (P = .005) and adenocarcinoma histology (P = .035). There was a slightly higher proportion of stage IV patients in the validation cohort (P < .001), resulting in a larger percentage of patients who recurred/progressed (P = .038). The median time to recurrence was 14 months (range, 2–97) in the training cohort and 15 months (range, 1–59) in the validation cohort. The median follow-up time for censored patients without an event was 50 months (range, 1–115) in the training cohort and 32 months (range, 1–76) in the validation cohort.
|Training (n=145)||Validation (n=146)||P-value|
|Age, years||69 (42–87)||71 (41–96)||.057|
|Gender||Male||109 (75%)||87 (60%)||.005|
|Tumor Location||Right upper lobe||52 (36%)||50 (34%)||.571|
|Right middle lobe||14 (10%)||9 (6%)|
|Right lower lobe||21 (14%)||26 (18%)|
|Left upper lobe||38 (26%)||34 (23%)|
|Left lower lobe||20 (14%)||27 (19%)|
|Tumor Histology||Adenocarcinoma||113 (78%)||103 (71%)||.035|
|Squamous cell||29 (20%)||30 (21%)|
|Non–small cell cancer not otherwise specified||3 (2%)||13 (9%)|
|Tumor Stage||0a||4 (3%)||0 (0%)||<.001|
|I||89 (61%)||100 (68%)|
|II||28 (19%)||13 (9%)|
|III||21 (14%)||17 (12%)|
|IV||3 (2%)||16 (11%)|
|Recurrence/Progression||Yes||40 (28%)||57 (39%)||.038|
|No||105 (72%)||89 (61%)|
Table 3 shows the Dice Similarity Coefficient (DSC), Mean Absolute Boundary Distance (MAD), and absolute volume difference between observers in the training cohort. Overall, semiautomatic segmentations were highly reproducible with an average DSC >0.9, MAD <1 mm, and volume differences <1 mL. When we inspected images with low DSC, high MAD, and/or high volume differences, we found that lesions that had the largest degree of variability tended to have a low uptake (eg, SUVmax <2), heterogeneous uptake, and/or were adjacent to structures with a similar metabolic uptake as the tumor (eg, the heart or mediastinum), making the precise boundary of the tumor difficult to determine. These features were evident in ∼20% of the cases.
|Observera||Dice Similarity Coefficient (DSC)||Mean Absolute Boundary Distance (MAD, mm)||Absolute Volume Difference (mL)b|
|A vs a (Intra)||0.916 (0.090)||0.548 (0.544)||0.71 (1.66)|
|A vs B (Inter)||0.917 (0.087)||0.559 (0.507)||0.58 (0.92)|
|a vs B (Inter)||0.904 (0.105)||0.628 (0.631)||0.79 (1.46)|
ii] a Observer 1 contoured each tumor twice (A and a) and observer 2 contoured each lesion once (B).
Table 4 shows the ICCs of the 4 different classes of radiomic features in each of the 3 regions of interest. We found that a total of 435 of the 512 features (85%) had an ICC >0.8 (Table 5) and were considered robust to differences in the segmentations (22, 23).
Feature Selection and Model Training
Across the 100 randomizations, the average minimum cross-validation error was 10.5% at a lambda value of 0.1296 in the training cohort. This lambda generated 2 features with nonzero coefficients, stage, and 1 MTV plus penumbra GLCM texture feature (maximum probability). Although SUVmax has previously been shown to be associated with recurrence in NSCLC, it was not selected by LASSO as a top feature. However, it was found to be a significant univariate predictor in our cohort (Table 6), consistent with previous studies (7).
Figure 1 visualizes the Pearson correlation coefficients of the top features. For reference, correlation of the top features with MTV volume and SUVmax is also shown. All correlations were low and the radiomic feature showed no correlation with stage, volume, or SUVmax.
Univariate Cox regression model statistics, including the AIC, likelihood ratios, P-values, and HRs, are shown for the top features in Table 6. Both features were significant univariate predictors of time to recurrence. Overall, stage was the best univariate predictor.
Because stage was the best univariate predictor, the likelihood ratio test was performed to assess significant improvements to this well-established clinical model for recurrence prediction. Additional features were added to determine significant improvements to the model. Adding the MTV plus penumbra texture feature to stage significantly improved the model (P = .006). This multivariate model was a significant predictor of time to recurrence in the training cohort (likelihood ratio = 27.59, P < .001, concordance = 0.74 [95% CI: 0.66-0.81]). Both stage (HR = 1.92 [95% CI: 1.37–2.67], P < .001) and the radiomic texture feature (HR = 0.52 [95% CI: 0.30–0.91], P = .02) were significant covariates in the multivariate model. Adding SUVmax to stage did not significantly improve the clinical model performance (P = .22). It also did not significantly improve performance in the combined stage and radiomic model (P = .73).
Univariate results were confirmed in the validation cohort (Table 7), with all features being significant predictors of time to recurrence. The locked multivariate model from the training cohort, which included stage and the radiomic texture feature, was a significant predictor in the validation cohort (concordance = 0.74 [95% CI: 0.67–0.81], Noether's P < .001). We separated the patients into high- and low-risk groups on the basis of the median risk score in the training cohort. Kaplan–Meier time-to-recurrence curves for the multivariate model in both cohorts are shown in Figure 2. Recurrence was lower in the group below the median model risk score.
The multivariate model including stage and the radiomic feature significantly outperformed the best performing clinical model of stage in the training (P = .036) and validation (P = .033) cohorts. The combined model also outperformed the radiomic feature alone in both the training cohort (P = .019) and the validation cohort (P < .001).
Figure 3 exemplifies 2 patients with similar SUVmax that would typically be considered to be at a high risk of recurrence. Yet, the combined model including radiomics correctly predicted the recurrence status of each patient on the basis of the median risk value. Based on qualitative inspection, the high-risk patient had more heterogeneous uptake in the penumbra region compared with the low-risk patient.
We show here evidence that texture in the MTV and nearby surrounding region can predict recurrence in NSCLC. Furthermore, augmenting this radiomic feature with stage significantly improved performance over stage alone, which was validated in an independent data set. This model also showed potential value in risk-stratifying patients with NSCLC who are at high versus low risk of recurrence or progression. A general rule in modeling studies is that 10 patients are needed for every feature selected in the model (8). To minimize overfitting, our final model consisted of only 2 features. However further studies on larger sample sizes with additional features may improve prognostic performance and applicability to other cohorts.
The radiomic feature selected was a GLCM texture feature in the combined MTV plus penumbra volume. This feature, which describes local texture variations, suggests that patients whose PET images show a more heterogeneous texture, specifically in the penumbra region surrounding the MTV, are more likely to recur. This suggests the importance of image data in the surrounding region for recurrence prediction. This region may contain uptake not measured in the MTV (and not by the SUVmax) and could indicate areas of disease adjacent to the primary mass. The texture being detected in this region may be indicative of an invasive component of the tumor, for example, spiculations or tumor spread through blood vessels, but this requires further investigation (15).
Notably, size or shape features, including the commonly used metrics of maximum axial diameter and 3D volume, were not selected as predictive features. SUVmax was also not selected, and adding it to clinical or combined models did not significantly improve performance. This suggests that texture features may provide more useful information than traditional metrics for predicting recurrence/progression.
Previous work in the field of radiomics has evaluated FDG-PET features for outcome prediction in lung cancer. Jansen et al. found the GLCM energy texture feature was a significant predictor of overall survival in oligometastatic NSCLC (26). Others have shown that texture features may be beneficial for predicting local control, distant metastasis, and disease-free survival in lung cancer (10–12). However, the majority of studies to date have focused on only the MTV. To the best of our knowledge, ours is the first study that evaluates the lung tumor penumbral region of PET images for recurrence prediction. Future work integrating CT imaging features or molecular data may improve prognostic performance.
Our study investigated PET/CT images from multiple scanners and institutions, potentially introducing variability in image data and quality and therefore the construction of a predictive model. We used a standard acquisition protocol across all institutions to minimize this variability (27, 28). This may still result in signal variations in the tumor and penumbra regions; therefore, further studies investigating single scanners are warranted and may improve model performance.
Previous work has also shown that PET radiomic features are dependent more on delineation variability than on reconstruction algorithm (29) and that texture features are less affected by difference in scanners (30). Many radiomic features also show high test–retest stability with repeat PET imaging (31). The PET-edge segmentation tool we used for tumor segmentation showed high reproducibility with associated radiomic feature robustness. Segmentations were performed with commercially available software (MIM Software, Inc.), making it an easily deployed and integrated system.
Our work is also applicable in a “real world,” nonresearch setting, where different scanners and images of variable quality are routinely used for clinical assessment. However, additional external validation of this radiomics model is warranted to determine the impact of different scanners and acquisition protocols on model predictions.
Our study has several limitations. The primary limitation is that the penumbra region was not restricted to the lung volume, that is, it may at times have included the adjacent chest wall, major blood vessels, and/or mediastinum. However, as features were selected from within this region, it is providing relevant information for the prediction of recurrence. The effect of this and the efforts to minimize it remain the subject of further investigation. Owing to differences in breathing between the PET and CT images, accurate registration of the lung boundary is challenging. We also investigated only a single distance of 1 cm for the penumbra region; it is possible that larger or smaller distances could improve or degrade performance. Another limitation is the inherent low resolution of the PET images, limiting the amount of information we can analyze for each tumor owing to lower voxel quantities for smaller tumors. Finally, the sample sizes analyzed were relatively small, and validation of this model in larger data sets is warranted.
In conclusion, a PET texture feature in the metabolic tumor volume and surrounding region augmented staging for NSCLC recurrence prediction. This model may be useful in identifying patients who are at a higher risk of recurrence or progression and may assist physicians in determining what patients may benefit from adjuvant or personalized treatment options at the time of diagnosis.