Breast cancer, the most common type of cancer among women, is a heterogeneous disease comprising subtypes with different biology, prognosis, and treatment outcome. Breast cancer can be classified into subtypes based on the hormone receptor (HR) status, including both estrogen and progesterone receptors, and human epidermal growth factor receptor 2 (HER2) expression to inform treatment decisions (1, 2). These breast cancer subtype classifications also have implications for disease-free survival and relapse (3). Further understanding of subtype-specific response and effective monitoring by imaging may provide means for early therapeutic intervention, leading to better outcomes (4).
Magnetic resonance imaging (MRI) is one of the most accurate imaging tools used to monitor and predict treatment response for patients undergoing chemotherapy (5–14). However, the predictive performance varies between different quantitative measurements derived from MRI, and by variations in the parameters that define those measurements. Previous studies have found that the tumor volume measured using MRI for patients undergoing preoperative chemotherapy has strong association with recurrence-free survival (13, 15, 16), and the association is influenced by the threshold settings of 2 contrast enhancement parameters (17). Another recent study has demonstrated that the influence varied in HR/HER2− defined breast cancer subtypes (18).
A standardized MRI-derived volume calculation procedure was used in the I-SPY 1 TRIAL (Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging And moLecular Analysis) imaging sub-study: American College of Radiology Imaging Network (ACRIN) 6657. This procedure used empirically determined, site-specific analysis parameters, specifically an early time-point percent enhancement threshold (PEt) and a signal enhancement ratio threshold (SERt) for calculation of a functional tumor volume (FTV) for patients undergoing neoadjuvant (preoperative) chemotherapy (NACT) for breast cancer. FTV was shown to be predictive of both treatment response, as measured by pathological complete response (pCR) (15), and of recurrence-free survival (16) in the study population.
In the current study, we explored how the pCR prediction performance of FTV varies over a wide range of PEt and SERt, for different serial time-point MRI scans during the NACT course, and for different patient cohorts determined by HR and HER2 status. We show that the predictive performance to treatment response by MRI varies by contrast thresholds, and that the pCR prediction may be improved through subtype-specific contrast enhancement thresholds.
In total, 237 women with breast tumors sized ≥3 cm evaluated by either clinical examination or imaging were enrolled between 2002 and 2006 at 9 institutions in the USA. All patients provided written consent. As shown in Figure 1, 4 MRI examinations were conducted for each patient at the following time-points: before starting anthracycline–cyclophosphamide (AC) chemotherapy (MRI1); at least 2 weeks after the first cycle and before the second AC cycle (MRI2); between regimens if taxane was given (MRI3); and following the completion of chemotherapy but before surgery (MRI4). A subset of 116 patients that had image data from all 4 MRIs, pathological outcomes, and HR/HER2 status were analyzed for this retrospective study. The detailed design and previous findings of I-SPY 1 TRIAL/ACRIN 6657 have been previously published (15, 16, 19, 20).
Determination of Breast Cancer Subtype
HR status and HER2 receptor expression were determined by pretreatment core biopsy, using immunohistochemistry (IHC) and Allred score at study sites. The HER2 status was determined by IHC and/or fluorescence in situ hybridization assays. Unlike HRs, HER2 testing (IHC and fluorescence-in situ hybridization assays) was performed locally at study sites and centrally at the University of North Carolina (19). Estrogen or progesterone receptor was positive if Allred score was ≥3, that is, ≥3% cells stained positive. HER2 was positive if it was tested positive at either a local or a central laboratory. The following 3 subtype groups were defined: HR+HER2−; HER2+ (HR either positive or negative); and triple-negative breast cancer (TNBC, ie, HR−HER2−) tumors.
Evaluation of Pathological Response
The pCR was considered as the surrogate end point of NACT and was defined as the absence of residual invasive disease in the breast and axillary lymph nodes at surgery (19). By this definition, patients were classified into 2 groups at the end of NACT as follows: pCR and non-pCR (residual invasive cancer). In I-SPY 1/ACRIN 6657, pCR was evaluated locally by each institution's pathologist immediately after surgery. In the event of a patient declining surgery, there was no pCR status for that patient.
Each patient had 4 MRIs (Figure 1) at their participating site using a 1.5 T scanner and dedicated 4- or 8-channel breast radiofrequency coil. Imaging was performed with the patient in the prone position with an intravenous catheter inserted in the antecubital vein or hand. The image acquisition protocol was prespecified, and it included a localization scan and T2-weighted sequences, followed by a contrast-enhanced T1-weighted series. For the contrast-enhanced T1-weighted series, high spatial resolution (in-plane spatial resolution, ≤1 mm), 3-dimensional fat-suppressed T1-weighted imaging of the symptomatic breast was performed using a gradient-echo sequence with the following parameters: repetition time = 4.5 milliseconds, flip angle ≤45°, field of view = 16–18 cm, minimum matrix = 256 × 192, sections = 64, and section thickness ≤2.5 mm.
All imaging tests were performed unilaterally over the symptomatic breast and in the sagittal orientation. Imaging time for the T1-weighted sequence was between 4.5 and 5 minutes, with one data set acquired before injection of a gadolinium-based contrast agent and repeated 2–4 times immediately after injection. Interimaging delays were added as needed to result in postcontrast administration temporal sampling between 2 minutes 15 seconds and 2 minutes 30 seconds for early-phase images and between 7 minutes 15 seconds and 7 minutes 45 seconds for delayed-phase images.
Functional Tumor Volume Measurement
Following each MRI examination, image data were transferred to the ACRIN Core Lab for central archival and subsequently to the University of California at San Francisco for image analysis. All images were analyzed using in-house software developed in the IDL programming environment (ITT Visual Information Solutions, Boulder, Colorado) (21). For each dynamic contrast-enhanced (DCE-) MRI acquisition, a region of interest (ROI) encompassing the primary tumor as determined by signal enhancement was manually defined by a trained research associate by placing rectangular boxes on orthogonal maximum intensity projection images created from the early postcontrast scan (Figure 2A–C). Background air regions and suppressed fat regions were masked out using an automatically determined intensity threshold applied to the precontrast image.
The FTV was then measured using the signal enhancement ratio method within the ROI (22). The volumes of image voxels within the ROI that met PEt and SERt were summed to compute FTV, constrained by a minimum number of connected voxels to eliminate isolated voxels. PE and SER were calculated at each voxel as follows: PE = 100% × (S1 − S0)/S0 and SER = (S1 − S0)/(S2 − S0), where S0, S1, and S2 were signal intensities at precontrast, early contrast, and late postcontrast, respectively, collected during the DCE-MRI scan (23). A cutoff PEt was first applied followed by a connectivity test to create an enhanced tissue mask. SER was then calculated for all voxels in the mask (Figure 2D), and SERt was applied to determine which voxels to include in the FTV. In ACRIN 6657, PEt was nominally set at 70% and adjusted empirically for each site to qualitatively reflect the extent of tumor and to account for unexpected variability in MRI systems and imaging parameters. SERt was set to be zero across all participant sites in the primary aim analysis of the trial. All magnetic resonance images from a given site were processed using the same site-specific PEt. To study the effect of PEt/SERt setting, we recalculated FTV by varying these 2 thresholds. PEt was changed from 30% to 200% in steps of 10% and SERt from 0 to 2 in steps of 0.2. FTV was recalculated at each MR examination as follows: baseline (FTV1), early treatment (FTV2), inter-regimen (FTV3), and before surgery (FTV4). Percent change of FTV was defined as the change in FTV relative to the baseline FTV1 value (ΔFTVn = 100% × (FTVn − FTV1)/FTV1, n = 2, 3, 4).
FTV measurements were calculated for each pair of PEt/SERt values, and associations with pCR were evaluated using receiver operating characteristic (ROC) curve analysis. The area under the ROC curve (AUC) was estimated to provide a measure of predictor quality. In the statistical model, patients with pCR were considered as controls (negative outcome) and those with non-pCR were considered as cases (positive outcome). For each PEt/SERt pair, the AUC was estimated in the full cohort and separately in each specific breast cancer subtype. The AUCs were then mapped as a surface plot on the axes of PEt (range, 30%–200%) and SERt (range, 0–2) for each FTV measurement. Higher AUC indicates “stronger association” between the measurement and pCR status. The optimized PEt/SERt was selected as having the maximum AUC over the map of PEt/SERt combinations. The processes of calculating FTV for each specific PEt/SERt pair, estimating AUCs, and selecting optimized PEt/SERt based on AUC values were performed automatically after the ROI was defined.
Because of the small sample size, it was not feasible to perform cross validation and hence AUCs and predictive accuracy estimates will be subject to overfitting. An optimal cutoff point was chosen as closest to sensitivity = 100% and specificity = 100% on the ROC curve (24). Data processing and optimization were performed in Matlab (R2012b 64bit for Mac, MathWorks Inc., Natick, Massachusetts), and all statistical analyses were conducted using the R statistical analysis software package and the pROC library (25, 26). Data are expressed as median with interquartile range. All tests were performed at the P < .05 level, and all results are provided with estimates, 95% confidence intervals (CIs), and P values if appropriate.
A cohort of 116 patients was analyzed. The status of HR and HER2 was available for primary tumors in 115 patients (99%). Characteristics of patients with and without pCR are described in Table 1.
|Characteristics||pCR (n = 34)||Non-pCR (n = 82)||Pa|
|Age, Median (range)||47 (31−69)||49 (28−67)||0.3|
|Tumor size (cm), median (IQR)||5.0 (4.1−6.0)||5.5 (4.0−7.1)||0.25|
|Premenopausal||16 (13.8)||35 (30.2)|
|Postmenopausal||12 (10.3)||33 (28.4)|
|Undetermined||6 (5.2)||14 (12.1)|
|Mastectomy||16 (13.8)||43 (37.1)|
|Breast-conserving surgery||18 (15.5)||40 (34.5)|
|Invasive ductal carcinoma||30 (25.9)||66 (56.9)|
|Invasive lobular carcinoma||2 (1.7)||5 (4.3)|
|Mixed ductal and lobular carcinoma||0 (0.0)||1 (0.9)|
|Other||2 (1.7)||5 (4.3)|
|Negative||22 (19.0)||28 (24.1)|
|Positive||12 (10.3)||54 (46.6)|
|Negative||18 (15.5)||58 (50.0)|
|Positive||16 (13.8)||23 (19.8)|
|Missing||0 (0.0)||1 (0.9)|
|Axillary lymph node status at initial staging||0.01|
|Negative||24 (20.7)||27 (23.3)|
|Positive||19 (16.4)||55 (47.4)|
|NA||1 (0.9)||0 (0.0)|
|HR−/HER2− (TNBC)||11 (9.5)||19 (16.4)||0.01|
|HR+/HER2−||6 (5.2)||39 (33.6)|
|HER2+||16 (13.8)||23 (19.8)|
|Missing||1 (0.9)||0 (0.0)|
Effect of Varying PEt/SERt on Predicting pCR
Analyses of surgical samples revealed pCR in 34 patients (29%). The remaining 82 patients (71%) did not achieve pCR (non-pCR). Among 45 patients with HR+/HER2− breast cancer, only 6 (estimated percentage, 13%, with 95% CI of 5% to 27%) achieved pCR. Sixteen HER2+ patients out of 39 (estimated percentage: 41%, with 95% CI of 26% to 58%) achieved pCR and 11 out of 30 patients (estimated percentage: 37%, with 95% CI of 20% to 56%) achieved pCR in the TNBC subgroup.
Figure 3 shows the highest AUCs observed for FTV measurements at different treatment time-points for the full cohort and by breast cancer subtype. In general, AUCs evaluated in subtypes were estimated to be higher than those in the full cohort, of which triple negatives had the highest estimated AUCs. In addition, absolute FTVs and ΔFTVs at MRI2 and MRI3 showed higher AUCs than those measured at MRI1 and MRI4. The estimated AUC at ΔFTV3 in the HR+/HER2− subgroup was among the highest with a narrow confidence interval. Although ΔFTV3 showed no significance difference relative to other FTV predictors in the full cohort and other subtypes, we focused our contrast threshold comparison between subgroups using ΔFTV3 as a predictor.
In the full cohort among all PEt/SERt combinations, ΔFTV3 exhibited higher estimated AUCs (≥0.75) at 70% ≤ PEt ≤ 140% and lower range of SERt (0.0−1.0) (Figure 4A). Within specific subtypes, differential effect of varying PEt/SERt on the prediction of using ΔFTV3 for pCR was observed. In the HR+/HER2− subgroup (Figure 5A), higher estimated AUCs occurred at higher PEt ranging from 120% to 200% across the entire range of SERt (0.0−2.0). In the HER2+ subgroup (Figure 6A), high AUCs occurred at PEt from 70% to 140% and at SERt from 1.0 to 2.0. In the TNBC subtype (Figure 7A), higher estimated AUCs also occurred at a PEt range of 60% to 150% and across the entire range of SERt (0.0−2.0).
To demonstrate the improved discrimination of pCR versus non-pCR using optimized PE/SER thresholds, we examined ΔFTV3 in the full cohort and in breast cancer subtypes. Table 2 shows diagnostic performance for cutoff points selected from ROC curves (Figures 4–7B). In the full cohort, inconsistent effects on sensitivity and specificity were observed, whereas a consistent improvement was shown in subtypes. Table 3 shows ΔFTV3 values and differences between patients with pCR and those without pCR (non-pCR) (Figures 4–7C). P values in Table 3 were estimated by likelihood ratio test. Lower P values at optimized PEt/SERt in subtypes may indicate that ΔFTV3 calculated by optimized PEt/SERt has stronger predictive value for pCR than the default. Odds ratios were also estimated to be larger using optimized than default thresholds.
Figure 8 shows an example of the effect of PEt/SERt on tumor voxels and subsequent FTV calculations in DCE-MRI. In this example, a 38-year-old female patient with a tumor sized 4 cm was enrolled in the I-SPY 1 TRIAL. The tumor was identified to be HR+/HER2− before treatment. The patient received AC- and taxane-based chemotherapy, and she did not achieve pCR at the completion of the treatment.
The effect of varied PEt/SERt on estimated AUCs for FTV2, ΔFTV2, and FTV3 is shown in the supplement of this paper. When comparing absolute measures FTV2 and FTV3 with percent change ΔFTV2 and ΔFTV3 (Figure 4–7A), the absolute measurements are more reliable in predicting pCR over a wider range of PEt/SERt. In HR+/HER2− subtype, higher estimated AUCs were observed at high PEt in all FTV measurements. Estimated AUCs for HER2+ are generally lower than HR+/HER2− and TNBC, which can also be observed in Figure 3. A mixed effect of PEt/SERt in TNBC was observed when high AUCs were found at a higher range of PEt for FTV2, FTV3, and ΔFTV3 but at lower range of PEt for ΔFTV2.
In this study, the impact of PE and SER thresholds on FTV prediction of neoadjuvant treatment response was retrospectively investigated using data from the I-SPY 1 TRIAL/ACRIN 6657. In that study, default PEt and SERt levels were used in the FTV calculations that were empirically set by visual evaluation of DCE images. In this paper, we present a semiautomated method to customize the PEt and SERt parameters, particularly for breast cancer subtypes, to account for the heterogeneity of tumor biology as reflected in imaging biomarkers. Through the optimization framework of this study, we seek to better understand the enhancement patterns of individual breast cancer subtypes and the association between enhancement measurements and pathologic outcomes of NACT.
Various forms of FTV have been investigated and compared previously to test predictive performance measured at different time-points during the treatment. Previous work on the ACRIN 6657 study reported AUCs of FTV ratios at MRI2, MRI3, and MRI4 relative to MRI1 in predicting pCR using the default PEt/SERt (15). In a study using earlier data from a pilot cohort of 64 patients imaged at a single center (18), the effect of varying PEt/SERt on FTV and ΔFTV was investigated. The percent change in FTV over the entire course of treatment from baseline to before surgery (ΔFTVf) was the predictor with the highest hazard ratio in the full cohort and the HR+/HER2− and HER2+ subgroups, whereas the absolute presurgical FTV (FTVf) was the highest for the TNBC subtype. In this study, FTV was calculated at MRI1–4 and percent change of FTV at MRI2, 3, 4. Although the inter-regimen metrics FTV3/ΔFTV3 generally showed the higher estimated AUCs, AUCs of the presurgery values FTV4/ΔFTV4 varied across patient cohorts (Figure 3). Meanwhile, FTV2/ΔFTV2 had similarly high AUCs as FTV3/ΔFTV3 across all patient cohorts except HR+/HER2−. Given the small sample size, these observations are limited to this study only. Cross validation is needed to confirm it in a general population.
PE and SER measure the signal enhancement characteristics of pre- and postcontrast injection during DCE-MRI (22). These 2 basic measurements and their thresholds may have a profound effect on the subsequent FTV calculation and, hence, its predictive performance of response in breast cancer subtypes during the treatment course. The current study showed that higher AUCs were observed at higher PEt when absolute FTV was used to predict pCR in HR+/HER2− subtype. A similar finding was observed in the HR+ subgroup in a previous study (18), indicating that higher PEt may better discriminate regions of malignant tumor from the high background parenchymal enhancement often found in HR+ patients (27–30). High SER value is indicative of tissue with a strong contrast washout characteristic and is generally associated with malignancy (31). Many studies have reported that TNBC shows a malignant enhancement pattern on DCE-MRI (32–36). Li et al. reported that postchemotherapy tumor volume with high SER had a statistically significant association with disease recurrence (37). Among breast cancer subtypes in this study, HER2+ was most affected by SERt at FTV3 and ΔFTV3. Higher AUCs were observed at higher SERt, suggesting distinct biology and microenvironment within the HER2+ tumor that differ from other subtypes.
Compared with HR+/HER2− and TNBC, HER2+ had lower AUCs. This may be because of the heterogeneity within this subgroup, which included both HR+ or HR−. Because of the small sample size, we could not further subset this group into HR+/HER2+ and HR−/HER2+. The heterogeneity within this subtype may limit the effectiveness of changing PEt/SERt to improve AUC. Furthermore, although trastuzumab is the current standard treatment for HER2+ patients, it was not used routinely in the timeframe of this study. Only 13 of 39 HER2+ patients received trastuzumab therapy. This adds complexity to this subtype and may have also created bias in our results. Because of the small sample size, we did not exclude these patients.
The presented retrospective study has a few limitations. First, the image quality may not be consistent in our patient cohort. Imaging data in this study were collected from a multicenter clinical trial and were acquired from 7 participating sites in the USA. The default PEt/SERt setting varied across sites, and we only studied the subsequent calculated FTVs by applying subtype-specific thresholds. Second, the sample size is too small to perform any kind of validation (or cross validation) of the optimization model. The highest AUCs found in the full cohort and in subtypes may therefore overestimate the true optimal values. Further study on an independent cohort should therefore be performed to evaluate the extent to which our estimated AUCs represent generalizable improvement in predictive values. Again because of the relatively smaller sample sizes, AUCs estimated in subtypes have wider CIs compared with those estimated in the full cohort. In this study of 116 patients, we were unable to evaluate other factors such as age, tumor size, and axillary lymph node status. Third, the treatment was not the same for all subtypes. The data set was acquired between May 2002 and March 2006. All patients in our cohort had AC and taxane therapy before surgery, and one-third of HER2+ patients received additional trastuzumab. These different treatments can affect the predictive performance of ΔFTV with or without optimization. Finally, HER2+ subtype comprised both HR+/HER2+ and HR−/HER2+, posing potential heterogeneity in the analysis. In our planned future study with a larger cohort, the HR+/HER2+ and HR−HER2+ subsets will be separately analyzed.