Neoadjuvant chemotherapy (NAC) of breast cancer has shown equivalent effectiveness in comparison to adjuvant chemotherapy in terms of disease-free and overall survival (1, 2). NAC has the advantage of allowing a down-grade of the primary tumor for breast-conserving surgery and providing in vivo information about a patient's response to a specific regimen (3–5). The I-SPY 2 TRIAL (Investigation of Serial Studies to Predict Your Therapeutic Response through Imaging and Molecular Analysis 2) is a multicenter clinical trial for patients with locally advanced breast cancer undergoing NAC with the primary endpoint of pathological complete response (pCR) (6). Patients undergo dynamic-contrast enhanced MRI (DCE-MRI) examinations before, during, and after NAC. DCE-MRI provides additional insight into tumor physiology and may be able to provide better imaging biomarkers to treatment response than anatomical imaging alone (7–10). Results from the ACRIN 6657 trial showed that functional tumor volume measured by magnetic resonance imaging (MRI) was associated with pCR and recurrence-free survival, and functional tumor volume was a stronger indicator of response than clinical assessment (11–14).
Background parenchymal enhancement (BPE) on breast DCE-MRI is a physiological feature describing signal enhancement resulting from the uptake of gadolinium-based intravenous contrast by normal breast tissue (15). BPE observed in breast fibroglandular tissue (FGT) shows an association with breast cancer risk (16–19) and has also been investigated for use as an imaging biomarker to predict NAC response (20–22). Studies have shown BPE to be subtype-dependent with positive association for hormone receptor status (23, 24). Currently, 4 categories of BPE are qualitatively defined in the Breast Imaging Reporting and Data System (BI-RADS) atlas: minimal, mild, moderate, and marked (25). Acceptance of BPE as a biomarker is constrained by limited single-site studies with small sample sizes and varying methods for visual and quantitative BPE assessment (26). A recent review by Liao et al. reported the results of a number of studies using quantitative BPE measurements to evaluate treatment outcomes with varying methods for BPE quantification between studies (19, 22, 27, 28). To address inter-reader variability associated with qualitative BPE assessment, a standardized quantitative method is also needed.
Here, we evaluated 3 segmentation approaches for measuring quantitative contralateral BPE and compared them for prediction of pCR using data from the multicenter I-SPY2 trial. The overall aim was to determine an accurate, fully automatic, and robust segmentation method to quantitatively measure contralateral BPE and optimize its predictive power for assessing treatment response.
Materials and Methods
Women ≥18 years of age diagnosed with stage II/III breast cancer and tumor size measuring ≥2.5 cm were eligible to enroll in the I‐SPY 2 TRIAL (6). Patients with evidence of distant metastasis were excluded from the study. Biomarker assessments based on hormone (estrogen and progesterone) receptors (HR+/−), human epidermal growth factor receptor 2 (HER2+/−) status, and a 70‐gene assay (MammaPrint, Agendia, Amsterdam, The Netherlands) were performed at baseline (T0) and used for treatment randomization (6). In addition to standard immunohistochemical and fluorescence in situ hybridization (FISH) assays, the protocol included a microarray‐based assay of HER2 expression (TargetPrint, Agendia) to assign HR and HER2 statuses. Patients with tumors that were designated as HR+/HER2− and low risk according to the MammaPrint 70‐gene assay were excluded because the potential benefit of receiving investigational drugs along with chemotherapy for patients with less proliferative tumors are low with consideration of the risk of drug side effects (29, 30). All patients provided written informed consent to participate in the trial. A second consent was obtained if the patient was randomized to an experimental treatment.
Pathologic Assessment of Response
Figure 1 shows the schema of the I‐SPY 2 TRIAL. Pathologic complete response (pCR), defined as the absence of residual cancer in the breast or lymph nodes as evaluated by a trained pathologist at the time of surgery, is the primary endpoint of the trial. All patients were classified as either pCR or non‐pCR. Patients who left the study without completing the entire course of treatment or did not undergo surgery for any reason were labeled as non‐pCR.
MRI examinations were performed before the initiation of NAC (baseline, T0), after 3 weeks of treatment (early‐treatment, T1), after 12 weeks and between drug regimens (inter-regimen, T2), and after completion of NAC and before surgery (presurgery, T3). MRI data were acquired with 1.5 T or 3 T scanners with a dedicated breast RF coil, across a variety of vendor platforms and institutions. All MRI examinations for the same patient were performed using the same magnet configuration (manufacturer, field strength, and breast coil model). The standardized image acquisition protocol included T2‐weighted and DCE‐MRI sequences performed bilaterally in the axial orientation.
DCE‐MRI was acquired as a series of 3D fat‐suppressed T1‐weighted images with the following parameters as specified in the I-SPY2 MRI protocol: repetition time = 4–10 milliseconds, minimum echo time, flip angle = 10°–20°, field of view = 260–360 mm to achieve full bilateral coverage, acquisition matrix = 384–512 with in‐plane resolution ≤ 1.4 mm, slice thickness ≤ 2.5 mm, and temporal resolution = 80–100 seconds. Gadolinium contrast agent was administered intravenously at a dose of 0.1 mmol/kg body weight and at a rate of 2 mL/s, followed by a 20‐mL saline flush. The same contrast agent brand was used for all MRI examinations for the same patient. Precontrast and multiple postcontrast images were acquired using identical sequence parameters. Postcontrast imaging continued for at least 8 minutes following contrast agent injection.
Quantitative Image Analysis
Nonuniformity of low spatial frequency intensity owing to coil sensitivity variations seen in the MRI data is known as bias or inhomogeneity. To correct for image inhomogeneity, all examinations were preprocessed with N4 bias correction, an improvement upon the N3 (nonparametric nonuniformity normalization) method (31). Automatic whole breast segmentation was performed on all examinations on each slice using locally developed software. Both breasts were initially segmented from background for the volumes anterior to the sternal notch using the precontrast image reformatted to the coronal orientation. The FGT volume of only the contralateral breast was then segmented using fuzzy c-means (FCM) clustering (32). Segmentation of 3 different sized subvolumes was investigated: all axial slices containing FGT voxels (full stack), central 50% of included slices (half stack), and the central 5 slices (center 5). A visual representation of the subvolumes is shown in Figure 2. All magnetic resonance examinations were centrally processed at the core I-SPY 2 imaging core laboratory using in‐house software developed in IDL (ITT Visual Information Solutions, Boulder, CO).
Within each segmentation mask, mean background parenchymal enhancement (BPE) in the contralateral breast was calculated from DCE-MRI at each treatment time point as:
A subset of 148 patients underwent unilateral manual whole breast segmentation of the contralateral breast followed by automatic FGT segmentation to better encapsulate as much FGT as possible. Manual whole breast segmentation excluded any regions with artifacts such as inhomogeneous fat saturation or coil bias, observed typically in the most superior and inferior axial slices to minimize inclusion of non-FGT voxels in the BPE quantification. Owing to the time-consuming nature of performing manual delineation, the full cohort was not assessed, and this manually delineated subset was used as a reference standard. The Pearson's linear correlation coefficient, r, was calculated to assess the difference in BPE quantification between the fully automated and semimanual methods.
Quality Assessment of BPE Calculation
Visual quality of breast tissue segmentation for each examination was examined by a radiologist and was graded on how well the automatic segmentation performed on tissue classification, because image quality (eg, coil artifacts, poor fat suppression) can cause errors in the segmentation process. Automatic segmentation quality was visually graded as good, adequate, poor, or failed quality using representative images chosen at the center slice and at ends of the selected subvolume. Figure 3 shows an example of a typical good tissue segmentation for accurate BPE quantification. The quality assurance grades were used to further stratify the quality of BPE values used for analysis.
An initial 990 I-SPY2 patients enrolled on drug arms completed by November 2016 were included and considered for analysis. Patients who did not have a DCE-MRI scan at early-treatment (T1) or inter-regimen (T2), had a rejected DCE-MRI scan, or had a failed segmentation quality grade were excluded from analysis and comprised the first cohort for analysis. A final cohort was defined after additional removal of examinations with poor segmentation quality, leaving only good and adequate segmentation-quality examinations.
Statistical analysis was performed to assess the predictive performance of a single magnetic resonance predictor for pCR vs non‐pCR outcomes. All statistical analyses were performed using SciPy 1.3 (https://scipy.org) and Python 3.7 (Python Software Foundation, Wilmington, DE).
The percent change in mean BPE from T0 to T1 (%ΔBPE0_1) was used in a univariate analysis for pCR prediction and is calculated as:
The area under the ROC curve (AUC) of a logistic regression model was used to assess the predictive performance of %ΔBPE0_1 in the full cohort and within subtypes. P-values for AUCs being different from .5 were estimated using the Mann–Whitney U test. Results with P-values < 0.05 were considered statistically significant.
In total, 990 patients with pCR outcome enrolled in the I-SPY2 TRIAL from completed drug arms before November 2016 were included in this study. Patients who did not have a DCE-MRI scan at early-treatment (T1) or inter-regimen (T2) had a failed segmentation quality grade, or had a rejected DCE-MRI scan, because other image quality or protocol adherence issues were excluded. After preliminary exclusion, BPE was calculated in 735 women (median age, 49 years; range, 24–77) and were included in the first cohort analysis, in which 258 (35.1%) patients achieved pCR. An additional 395 patients were excluded owing to strict quality assessment of poor tissue segmentations including undersampling, coil artifacts, poor fat suppression. For the second cohort, 340 women (median age, 49 years; range, 24–77) were included, in which 113 (33.2%) patients achieved pCR. Patients with hormone receptor–negative disease were more likely to achieve pCR than those with hormone receptor–positive disease. Patient characteristics are shown in Table 1 and a flow diagram of patient exclusion is visualized in Figure 4. No statistically significant differences were found in patient characteristics between the enrolled population of 990 patients and final analysis cohort of 340 patient that excluded poor-quality BPE.
Any Segmentation Quality
(N = 990)
Good or Adequate Segmentation
Quality (N = 340)
|Age at Screening (Years)||0.78a|
|Mean (SD)||48.8 (10.6)||48.9 (10.0)|
|American Indian or Alaska Native||4 (0%)||2 (1%)|
|Asian||68 (7%)||23 (7%)|
|Black or African American||121 (12%)||28 (8%)|
|Mixed Race/Ethnicity||7 (1%)||4 (1%)|
|Native Hawaiian or Pacific Islander||5 (1%)||2 (1%)|
|White||784 (79%)||281 (83%)|
|Post/Perimenopausal||324 (41%)||117 (43%)|
|Premenopausal||464 (59%)||153 (57%)|
|Pathologic Complete Response||0.86b|
|pCR||324 (33%)||113 (33%)|
|nPCR||666 (67%)||227 (67%)|
|HR+HER2+||156 (16%)||57 (17%)|
|HR+HER2−||380 (38%)||140 (41%)|
|HR−HER2+||89 (9%)||27 (8%)|
|HR−HER2−||363 (37%)||116 (34%)|
The comparability of 3 segmentation methods, full stack, half stack, and center 5 was assessed in the full cohort and within HR and human epidermal growth factor receptor 2 (HER2) subtypes. To analyze the strength of the linear relationship, the Pearson's linear correlation coefficient (r) was calculated between segmentation methods. The r values for full vs half, half vs center 5, and full vs center 5 were 0.953, 0.867, and 0.840, respectively. However, a high correlation is not necessarily indicative of meaningful results, as it does not provide information about possible bias. To visualize systematic bias versus random variation between the 3 automated segmentation methods, Bland–Altman plots (Figure 5) were calculated in the quality-restricted second cohort to see if our various automated method differed from each other (33). The mean differences for all 3 comparisons are very close to 0, suggesting that the estimated bias is low. All 3 plots also show that there are no apparent variations with mean values, with most of the points within the 95% limits of agreement.
Comparison with Manual BPE Reference Standard.
We calculated BPE for a subset of patients that had a manual whole breast segmentation to compare the relationship between automated and manual methods. Figure 6 shows the Pearson's linear correlation. All automated methods showed high agreement with the manual reference method, with best agreement using the half stack method. The r values between manual and full stack, half stack, and center 5 are 0.971, 0.977, and 0.925, respectively, with all P-values = .001.
Table 2 shows the pCR rate and the reported AUCs for percent change in mean contralateral BPE from baseline to early treatment (%ΔBPE0_1) and from baseline to inter-regimen (%ΔBPE0_2) for each segmentation method within the full cohort and within subtypes. The data in this table contain all segmentation quality categories including poor, adequate, and good visual segmentations. AUCs in the full cohort ranged from 0.51 to 0.53 and AUCs varied within subtype from 0.56 to 0.58 in HR+/HER2+, 0.52 to 0.53 in HR+/HER2−, 0.56 to 0.59 in HR−/HER2+, and 0.51 to 0.52 in HR−/HER2−. These results reached statistical significance in the HR+/HER2− subtype for the T2 time-point predictor (%ΔBPE0_2).
When patients were restricted to adequate and good visual segmentation quality (Table 3), the associated AUCs for both BPE predictors remained similar between segmentation methods. The pCR rate was higher for HR−/HER2+ patients with adequate/good quality than the patients with any segmentation quality (81.5% versus 68.9%). AUCs in the full cohort ranged from 0.50 to 0.51 and AUCs varied within subtype from 0.44 to 0.57 in HR+/HER2+, 0.54 to 0.57 in HR+/HER2−, 0.78 to 0.87 in HR−/HER2+, and 0.50 to 0.55 in HR−/HER2−. The highest AUC values, which were also statistically significant, were found in the HR−/HER2+ subtype at the early time point (%ΔBPE0_1). Although sample sizes were reduced by ≥50% in every subtype after restricting for segmentation quality, differences in AUCs between subtypes became more apparent, with notably higher AUC values achieved in the HR−/HER2+ subtype at the T1 time point. Comparison of AUC values in the quality-restricted and unrestricted cohorts highlights the small variation between segmentation methods relative to differences between subtypes.
BPE observed in breast FGT shows an association with breast cancer risk. We further investigated BPE's use as an imaging biomarker to predict NAC response. For BPE to be used as a robust, clinically meaningful biomarker, an automated quantitative segmentation method is necessary to remove subjectivity and inter-reader variability associated with qualitative methods (34). Although manual segmentation yields promising results, manual delineation of the breast surface and visual confirmation of tissue boundaries are time-consuming and subject to inter-reader variability. The use of automated segmentation may provide reproducible quantitative results required for validation and for ensuring repeatability. This study compared automated quantitative methods for BPE calculation using different levels of tissue sampling to improve segmentation quality and assessed the ability of each method to predict treatment response.
When visual assessment of segmentation quality was used, a large proportion or percentage of cases, 54% of the data set, were excluded from analysis. A limitation of this retrospective study may be the image quality, in which patients up until 2016 were included in the analysis. Since then, we have implemented better equipment and are continually improving our segmentation methods. When the exclusion criteria were relaxed, allowing artifacts or undersampling of tissue, our findings remained consistent within and between subtypes. In Table 1, the Kruskal–Wallis rank sum test and Pearson's chi-square test performed between the second quality-restricted cohort, and the initial 990 patients considered for analysis showed that the difference between cohorts is not statistically significant, suggesting that the included cohort reflects that of the population included in the I-SPY2 trial. However, the results in the HR+/HER2− subtype for %ΔBPE0_2 were reinforced in the quality-limited cohort, indicating that image quality may have different impacts in different subtypes. Results had higher relative AUCs for pCR prediction.
AUC values were similar for each segmentation method with only small differences for the full cohort as well as within subtypes and do not appear substantially meaningful for pCR prediction in the first cohort of 735 patients and when quality was restricted to 340 patients. Interesting AUC results from %ΔBPE0_2 are seen in Table 2 for the HR+ and HER2− subtype group. Our results show that BPE at the later time point may be predictive of HR+/HER2− patient's response to treatment with no clear differences between methods. Variations within subtype were relatively small in comparison to the AUC differences between subtypes. For example, the differences in the full cohort at T0 to T1 only vary by 0.01 in Table 3. Percent change in BPE did not show strong predictive power, which can be generalized to the full cohort. However, in HR− and HER2+ patients where there was a higher percentage of pCR, the change in mean BPE showed statistically significant predictive power toward pCR at the earlier time point in response to taxane-based treatment. Within the HR− and HER2+ subtype, the jump in AUC may signify change in BPE as a good imaging biomarker for early detection of pCR in a neoadjuvant setting. Although the HR−/HER2+ cohort size consists of 27 patients, of whom 22 achieved pCR, additional validation needs to be performed on a larger sample size.
Our results corroborate those of the work of Dong et al., supporting current findings that women with HR− tumors were more likely to achieve pCR than HR+ tumors and indicating that decreased BPE in women with HER2+ breast cancer may predict effective response to NAC treatment (35). BPE is affected by hormonal changes where estrogen can lead to increased contrast uptake in tissue as well as dilation of the blood vessels (36).
Fully automatic segmentation demonstrated some limitations. Figure 7 visually shows some limitations of the full stack method and the center 5 method. The full stack may pick up noise and false masking in the outermost regions of the DCE-MRI. The example on the left in Figure 7 shows an axial slice that contains a contralateral breast artifact from an implanted venous access port used to deliver chemotherapy. The artifact adversely affected the automatic segmentation, which falsely classified the artifact as tissue. The full stack method is also the most computationally intensive method and does not appear to provide more predictive benefit than the half stack. Another limitation was that the center referenced for the center 5 slice method may not always have been well centered within the breast, and thus, it might not give a representative sample of the tissue; whereas, the half stack method may sample enough of the breast to capture all of the FGT while excluding the other edges that may pick up artifacts. The example on the right in Figure 7 shows the smaller volume of interest using the center 5 slice method.
We showed that using the half stack method was the best compromise to optimize our clinical decision tool through validation. This compromise uses fewer computational resources while still retaining the same predictive performance as the full stack method. Using the half stack method, our study, along with many others, shows the importance of a longitudinal analysis using BPE as a predictor for positive response to treatment. Based on these observations, we recommend using the half-stack volume of interest moving forward. Future plans include comparing our results to a manually segmented reference standard, implementing automatic nipple slice detection, and adding contralateral BPE into a multivariate model to hopefully improve predictive performance for treatment response.
In conclusion, quantitative BPE calculated from DCE-MRI is an emerging imaging biomarker that has shown promise as an indicator of early response to neoadjuvant treatment. We showed that our BPE calculations from different-sized subvolumes of DCE-MRI scans are robust against each other and provide results with close agreement. From our study, we recommend moving forward with the half stack method for a fully automatic segmentation method for repeatable quantitative BPE measurements.