Serial magnetic resonance imaging (MRI) studies during neoadjuvant chemotherapy (NAC) for breast cancer allow for in vivo observation of changes in the tumor to assess treatment response. Multiparametric breast MRI studies typically include a primary dynamic contrast-enhanced (DCE-MRI) acquisition for lesion visualization. These images can be used for morphologic characterization, quantitative and qualitative enhancement characterization of both lesion and background parenchyma, and quantification of lesion size. DCE-MRI-derived metrics have shown value for prediction of both pathological and survival outcomes for patients with breast cancer (1–3). Functional diffusion-weighted imaging (DWI), which reflects water mobility impeded by cellular constituents and interstitial tortuosity (4), can help in evaluating therapeutic efficacy by reflecting changes in tumor cellularity (5). The apparent diffusion coefficient (ADC) measured by DWI has been shown to improve specificity and positive predictive value of breast magnetic resonance (MR) examinations and to identify early tumor response to cytotoxic effects of breast cancer therapy (6–9).
Unlike conventional qualitative diagnostic imaging relying on differences in signal intensities, interpretation of changes in quantitative metrics such as ADC requires the measurement of confidence intervals (CIs). Metric changes exceeding the CI will, with 95% confidence, correspond to true parameter changes (beyond measurement error) (10). These intervals are determined by precision (repeatability) and accuracy (bias) of the applied DWI protocol and the physical model for a derived quantitative imaging biomarker (11). The baseline precision can be determined from test/retest (TT/RT) examination performed with identical imaging protocol for study subjects. To reliably detect changes in breast tumor diffusion characteristics, the measured changes in any lesion ADC metric must be compared to corresponding CIs.
Previous single-site studies performed in relatively small subject cohorts have investigated repeatability and reproducibility of breast ADC measures in normal (12–16) and cancerous (13, 14, 17–19) tissue. Within-subject coefficients of variance ranged from 5% to 11%. Recent findings from the multicenter ACRIN 6698 Trial investigating DWI biomarkers for predicting treatment response in breast cancer NAC (20) indicated excellent repeatability of mean and median tumor ADC metrics (21). These results were achieved with a standardized imaging protocol, centralized processing and extensive quality assurance and control procedures. However, mean tumor ADC measures provided only moderate power for predicting treatment outcome (22). Other single-center research suggests that improved tumor characterization may be achieved using alternative histogram metrics (23–26), as well as showing potential relations of volume-based metrics to the clinical standard Response Evaluation Criteria In Solid Tumors (RECIST) criteria (27). The evaluation of precision for such alternative breast tumor ADC histogram metrics is still sparse and based on single-center studies (18, 28).
An objective measure of tumor burden is essential for clinical management and evaluation of cancer therapeutics. Radiographic assessments of solid lesions, including longest diameter, estimators of cross-sectional tumor area, and total tumor volume, have been used as indicators of tumor size and have formed the basis for objective criteria of response by mass shrinkage, as well as disease progression (27, 29, 30). Water mobility, on the other hand, is sensitive to tissue microenvironment, such that lower mobility (reflected by low ADC) implies higher cellular density. There is therefore potential to derive novel useful biomarkers that combine features of both tumor volume and density by means of ADC histogram analysis. Conceptually, the cumulative volume of voxels both within the tumor region of interest (ROI) and having an ADC value below a specified threshold (thus excluding presumably less-dense or already necrotic tissues reflected by higher ADC) provides an estimate of dense tumor volume (23).
In this retrospective study of data from the TT/RT arm of the multisite ACRIN 6698 trial, we analyzed repeatability and reproducibility of ADC histogram and volumetric characteristics to establish confidence intervals for corresponding biomarkers for use in treatment response assessment during breast cancer NAC.
The DW-MRI data for this study was acquired as part of the ACRIN 6698 Trial “Diffusion Weighted MR Imaging Biomarkers for Assessment of Breast Cancer Response to Neoadjuvant Treatment” (20), a sub-study of the multicenter I-SPY 2 TRIAL evaluating novel treatments for breast cancer. ACRIN 6698 was performed at a subset of I-SPY 2 sites that met additional prequalification requirements for performing DW-MRI. Both studies were HIPAA-compliant and performed under IRB approval, and all patients gave informed consent before enrolling. Women of age ≥18 were eligible if they had biopsy-confirmed diagnosis of stage II–III disease, and clinically or radiologically measurable disease in the breast with a tumor longest diameter (LD) of >2.5 cm. Patients were classified by hormone receptor (HR), human epidermal growth factor receptor-2 (HER2), and MammaPrint (MP) status, and patients with low-risk disease (HR+/HER2−/MP-low) were excluded. A subset of patients participated in the repeatability arm of the trial. For this subset, “coffee break” style TT/RT DWI scans were acquired (as described below) for evaluation of whole-tumor ADC repeatability (21), and were retrospectively analyzed for this current study.
Details of the multivisit I-SPY 2 MRI protocol, the standardized ACRIN 6698 DWI protocol, and the TT/RT DWI protocol have been previously reported (20, 21, 31). In brief, for repeatability evaluation, T2W and multi b-value DWI images were acquired; the patient was removed from the scanner and repositioned; the scans were repeated. A DCE acquisition was subsequently performed. All imaging was done in the axial plane with full bilateral coverage of the breasts, in the prone position. The standardized DWI protocol required acquisition using a fat-suppressed SS-EPI sequence using b values of 0, 100, 600, and 800 s/mm2. TT/RT DWI measurements for a given patient were performed on the same day in a single imaging session. A single TT/RT study was conducted for each consented subject at either pretreatment (T0 time-point) or early treatment (T1 time-point, after 3 weeks of treatment), with T0 specified as the preferred time-point. DWI images were assessed with a standardized QA protocol (32), and subjects with either TT or RT scans judged not analyzable owing to protocol deviations or poor image quality were excluded from further analyses.
Whole-Tumor ADC Histogram Analysis
ADC histogram analysis was conducted using the ADC maps and tumor regions of interest (ROIs) defined for the primary study analysis (21). The TT and RT ADC maps were calculated using all b values and a monoexponential decay model. Multislice, whole-tumor ROIs were manually defined by selecting regions with hyperintensity on high b-value DWI (b = 600 or 800 s/mm2) and relatively low ADC, while avoiding adjacent adipose and fibroglandular tissue, biopsy clip artifacts, and regions of high T2 signal (eg, seroma and necrosis). All apparent disease regions were included in the ROI by using multiple distinct contours per slice and multiple slices as necessary. All voxels from the individual contours were combined into a single composite ROI for histogram analysis. The TT and RT ROIs for a given patient were defined separately and independently with no cross-referencing between the 2 DWI scans, and were defined by the same operator to minimize operator variability. All ROI definitions were reviewed and adjusted, if necessary, by the senior operator (reader 1; >10 years of quantitative breast MR analysis experience). The composite ROIs were applied to the derived ADC maps and used to define subject-specific TT and RT histograms. Standard histogram statistics, including mean, standard deviation, skew, kurtosis, median, ranges, and percentiles (5th, 15th, 25th, 50th, 75th, and 95th) were calculated for each histogram. Dense tumor volumes (VADC), defined as the volume of tissue within the ROI with ADC values below a specified threshold ADC, were calculated by summing the appropriate histogram bins and multiplying by image voxel volume found in the DICOM header. Fractional dense tumor volumes (fVADC) was calculated as the volume at the ADC threshold (VADC) divided by the volume at an ADC threshold of 3.0 µm2/ms (V3.0). V3.0 corresponds to approximately the full ROI volume, discounting isolated voxels with ADC > 3.0 µm2/ms resulting from noise. ADC thresholds used were 0.5, 0.6,…, 2.0, 2.5, and 3.0 µm2/ms.
The measurement repeatability of each metric across subjects was quantified using Bland–Altman (BA) 95% limits of agreement (LOA) = Mean(TT − RT) ± 1.96 × SD(TT − RT), where Mean(TT − RT) and SD(TT − RT) are the mean and standard deviation of the difference between TT and RT values. The repeatability coefficient, RC = 1.96 × SD(TT − RT) was used for comparisons between metrics of the same units (33). Within-subject coefficient of variance (11),34):
As part of the primary analysis of the TT/RT arm of ACRIN 6698 a reader study for determining intra- and interoperator reproducibility was conducted using the RT scans from a subset of 20 patients. Reader 1 defined whole-tumor ROIs on the DWI twice (“RD1” and “RD1b”) while Reader 2 (4 y experience at quantitative breast MRI analysis) defined a single set of ROIs on the studies (“RD2”). The readers operated independently. The ROIs were defined independently from those used in the repeatability analysis, but using the same ROI protocol. The second set of ROIs for intra-operator measures (RD1b) were defined 5–6 weeks after the first set. Reproducibility results for ROI characteristics and for mean tumor ADC were previously reported (21). For the current study we applied the tumor segmentations from the reader study to calculate the intra- and interoperator reproducibility of the histogram percentile, VADC and fVADC metrics. Reproducibility was determined using wCV and BA LOA analysis as described above.
All image and statistical analyses were performed using in-house IDL software (Exelis Visual Information Solutions, Boulder, CO), Matlab R2015b toolboxes (MathWorks, Natick, MA) and SAS™ software version 9.4 (SAS, Cory, NC).
The ACRIN 6698 Trial consented 89 patients (median age, 47 years; range, 27–73 years) from 9 institutions to the TT/RT substudy. Of those, 18 patients were excluded from analysis owing to either MRI protocol inconsistencies between TT and RT acquisitions (N = 3) or unacceptable image quality on TT and/or RT scans (N = 15). Scans from the remaining 71 patients (median age, 46 years; range, 27–71 years), including 60 pretreatment (T0) and 11 early-treatment (T1) visits, were analyzed for this study. This cohort was identical to that analyzed for the original study mean ADC repeatability analysis (21). Figure 1 shows T2-weighted images (b = 0 s/mm2), high-b-value DWI images, the corresponding ADC maps, and the segmented tumor ADC histograms for TT and RT acquisitions from 1 subject. These illustrate typical differences in TT and RT tumor ROI segmentation and ADC map noise, leading to variations of the respective histogram characteristics.
Repeatability results for DWI histogram metrics are given in Table 1 and presented graphically in Figure 2, with values for the mean ADC included for comparison. Highest precision (wCV = 5.44%) was observed for the 50th percentile (median) metric, whose sample distribution overlapped with the distribution for the mean ADC metric. This overlap is consistent with the Gaussian measurement noise being the main source of observed TT-RT variations for ADC histogram percentile metrics. Precision was also good for moderately lower percentile metrics, with wCV = 8.1% and 6.6% for 15th and 25th percentiles, respectively, but was degraded to 13.9% at the fifth percentile. The BA plot for selected histogram percentile values (Figure 2A) illustrates consistent repeatability patterns for 15th, 25th, and 50th percentiles. The LOAs were very similar for these metrics: RC values = 0.174, 0.160, and 0.158 μm2/ms (LOA shown as horizontal dashed lines in the BA plot). The histograms of the binned mean values for these percentile metrics across our cohort are shown in Figure 2B. For the 15th percentile metric, 85% (60/71) of all cases and 92% of pretreatment cases (50/60) had ADC values < 1.1 μm2/ms, indicating the presence of appreciable dense tumor tissue in these cases.
|Units||Meana||wCV (%)||wCV 95% CI (%)||Deltab||BA RCc|
Precision was lower for ADC-thresholded volume metrics (VADC), and it had considerable variation across the tested ADC thresholds (Table 1; Figure 2, C and D). wCV values were >50% for ADC thresholds <0.9 μm2/ms, indicating very poor repeatability for these measures. At higher ADC thresholds (≥1.5 μm2/ms), VADC is dominated by the volume of the whole-tumor ROI for the majority of cases, as tissue with ADC above these thresholds would be included in the ROI only by error. This resulted in wCV ≅ 27% for these thresholds, representing the repeatability of the ROI size. V1.5 was >80% of the total ROI volume for 56 (79%) of all cases and 50 (83%) of the pretreatment cases. We therefore focused analysis on moderate threshold values of 0.9, 1.1, and 1.3 μm2/ms, finding wCV = 44.0%, 36.5%, and 29.1%. Figure 2 C and D shows the BA plots and mean value histograms for the V0.9, V1.1, and V1.3 volume measures. The sample means (RC) for these 3 thresholds were 1.5 (1.5), 3.3 (2.2), and 5.0 (3.3) cm3 (LOAs shown in Figure 2C). RC values exceeded the mean metric values for all lower threshold volumes, consistent with low repeatability of these metrics. Results for fractional volumes (fVADC = VADC/V3.0; ADC = 0.9, 1.1, 1.3 μm2/ms) are shown in Table 1 and Figure 2, E and F. wCV values were lower than respective VADC values but still a factor of 3–5 times greater than the percentile measure wCV values. For ADC thresholds 0.9 and 1.1 μm2/ms, fractional volumes were distributed fairly uniformly across the range from 0 to 1 (Figure 2F, dark and light green).
Figure 3 shows the dependence on metric parameter of the sample means and 95%CI [mean ± RC, RC = repeatability coefficient = 1.96 × SD(TT − RT)] for ADC percentiles (A), VADC volumes (B), and fVADC fractional volumes (C) across the full range of the parameters examined. The tightest CI were observed for 15th to 75th ADC percentiles, with some drop off in precision at the extremes of 5th and 95th percentiles, indicating good precision measurements across a very wide range. For VADC measures at thresholds at or above 1.4 µm2/ms, wide CI and relatively small changes in sample means with ADC threshold changes indicate the limiting effects of ROI size dependence and ROI variability on these measurements. Specifically, in this threshold range, VADC is reflecting a manually determined total tumor volume based primarily on the high-b-value image intensity. This volume has high variability owing to operator subjectivity. We therefore expect generally poor sensitivity for detecting volume changes with treatment in this parameter range. Figure 3B also indicates that confident measurement of typical VADC changes, in particular the negative tumor volume changes most commonly associated with therapy response, may be limited at thresholds <0.9 µm2/ms. In this range the RC is near to or greater than the mean, putting the lower CI below 0 cm3. With current ROI techniques the most promising threshold range for low ADC volume measurements would appear to be between 0.9 and 1.3 µm2/ms. The fVADC measures (Figure 3C) showed a similar effect of excessive variability at low ADC thresholds, and also lost sensitivity at higher thresholds due to compression of values against the upper limit of fVADC = 1.0. In the moderate threshold range, the 95%CI for fVADC metric appeared somewhat tighter than that for the absolute VADC volumes.
Correlations between different histogram metrics and with ROI total area are shown in Figure 4. For the correlation analysis, the cohort was limited to the 60 patients with TT/RT acquisitions at the pretreatment (T0) study time point, to avoid complications from the upward shift in the overall population tumor ADC distribution with NAC treatment. The scatter plots illustrate the correlations between absolute and fractional volumes V1.2 and fV1.2 with the median ADC (50th percentile, Figure 4A) and the ROI area (Figure 4B). fV1.2 indicates a strong correlation with the median (R = −0.98), which was not seen for the absolute V1.2 (R = −0.13). Figure 4B shows the strong correlation (R = 0.81) between volume V1.2 and ROI area, pointing toward ROI variability as the most significant contributor to the poor repeatability for the volume metrics. This correlation is reduced but still moderate (R = −0.37) for the corresponding fractional volume. The color chart in Figure 4C shows the correlation results (Pearson R values) for pairwise comparisons between ROI area (left column and top row) and the 9 ADC histogram metrics. High correlation (|R| > 0.8) was observed within each metric type (3 × 3 arrays along the diagonal), between percentile values and fVADC, and between VADC and ROI area. Normalization of the volumes to create fractional volumes reduced the correlation to ROI area, but it was still moderate. P-values were consistent with significant correlations (P < 10−4) for all the comparisons indicating high correlation.
Tables 2 and 3 give intra- and interoperator reproducibility, respectively, for histogram metrics evaluated using the RT data (second acquisition on each patient) on a 20-patient subset. Results for reproducibility followed similar patterns to the repeatability results: the histogram percentiles between 15th and 50th showed good reproducibility, with wCV(95%CI) ranging from 3.8% (2.9, 5.5) to 4.8% (3.7, 7.0) for intraoperator variability and 3.8% (2.9, 5.5) to 5.3% (4.1, 7.7) for interoperator variability. Volume-based measures were considerably less reproducible. In our primary range of interest for ADC thresholds from 0.9 to 1.3 µm2/ms there was no discernable dependence on wCV of VADC with threshold. For fVADC reproducibility values showed the expected trend of lower wCV (higher precision) with higher thresholds, with wCV(95%CI) values for threshold ADC = 1.3 µm2/ms the lowest at 34% (26%, 49%) and 13% (10%, 20%) for intra- and interoperator reproducibility, respectively. The poor wCV values and large CI for volume measures were consistent with greater dependence on ROI characteristics as seen in the repeatability measures presented above. However, substantial variability in wCV estimates may be also due to the small number of patients in the reproducibility cohort, and for fVADC measures, the wCV model constraints may not be well satisfied, as errors in these measures may not be proportional to the mean values.
|Units||Meana||wCV (%)||wCV 95% CI (%)||Deltab||BA RCc|
|Units||Meana||wCV (%)||wCV 95% CI (%)||Deltab||BA RCc|
This study provides baseline precision and reproducibility for ADC histogram-based metrics along both ADC (ADC percentile) and volume dimensions. In our study cohort of patients undergoing NAC for invasive breast cancer, repeatability is better for ADC percentiles versus low ADC volumes, the latter appearing more sensitive to ROI segmentation variations. Fractional-volumes, that is low ADC volumes normalized to the total tumor ROI volume, show reduced sensitivity to segmentation variability. However, compared with all volumetric measures, the low ADC percentiles (15th and 25th), which are of interest for quantifying changes in dense tumor tissue with treatment, showed at least 3-fold better repeatability and lower sensitivity to segmentation variability.
For precise measurement of response during NAC, it is critical to quantify changes in malignant tumor burden. This can be done with a variety of techniques including linear dimension measurements by clinical examination or from imaging studies (eg, RECIST), or volumetric measures such as functional tumor volume from DCE-MRI examinations (35). DW-MRI has the ability to more specifically quantify solid or viable tumor volumes, based on their low ADC values of <1 µm2/ms (23). However, our analysis indicates relatively low precision of such volume measurements when coupled with manual segmentation. Improved segmentation consistency, either through better prescribed procedures or preferably through more highly automated techniques, is likely needed for useful measurement of treatment-induced changes in ADC-based solid tumor volumes in the breast cancer NAC realm. The use of fractional dense tumor volumes, normalized to the full ROI volume, alleviated some of the dependence on segmentation reproducibility. The wCV values were still relatively poor, but this may be reflective in part of breakdown of the wCV model when the errors are not proportional to the mean. The fractional volume metrics did show strong correlations to the histogram percentile metrics, indicating a possible functional equivalence between them. The changes in these metrics over treatment will be explored as potential biomarkers for therapy response prediction in a future study.
The most significant limitation to this study was the restriction to manually defined whole-tumor ROIs. Given the great heterogeneity among breast cancer lesions, the wide variety of imaging platforms in the multicenter study, and the complex ROI definition procedure, there was a lot of variability introduced in the analysis. This limits the determination of a true repeatability value for the tested metrics. We were also limited by relatively small sample sizes, having only 71 analyzable studies for repeatability, with a very unequal split between the T0 and T1 time-points, and a further limitation to 20 subjects for the reader study.
In conclusion, development and validation of quantitative imaging tools for supporting cancer patient trials and ultimately routine clinical adoption have been the focus of the National Institutes of Health Quantitative Imaging Network over the past decade (34, 36–38). In this present study, we found that ADC histogram percentiles down to 15% have high repeatability and reproducibility, comparable to mean ADC, while low-ADC volumetric measures were substantially less repeatable. Tumor segmentation variability appeared to be the main source of TT/RT error for volume-based ADC histogram metrics. High correlation of ADC percentiles to fractional tumor volumes indicated functional equivalence, and both low percentile distribution and fractional volume analysis suggested best sensitivity to volume changes for ADC between 0.9 and 1.3 µm2/ms. The diagnostic and predictive performance of these biomarkers will be evaluated in future work.