Quantitative imaging (QI) metrics are emerging as a tool for therapeutic response assessment in cancer treatment (1). As QI tools have been technically validated, clinical trials start to make decisions based upon these imaging metrics, for example, quantitative parameters derived from dynamic contrast-enhanced (DCE)-magnetic resonance imaging (MRI) (1, 2).
The DCE-MRI-derived QI metrics can be affected by differences in MRI platforms, pulse sequences, acquisition parameters, image reconstruction schemes, pharmacokinetic models, and quantification software packages (3–8), which limits deployment of DCE-MRI in clinical trials and practice. MRI scanners from each vendor have unique hardware configuration, vendor-specific pulse sequences, and reconstruction schemes, which can cause a systematic bias in estimated QI metrics (4). In addition, selection of magnetic resonance (MR) acquisition parameters can influence quantification of these metrics (5, 9). Furthermore, QI metrics derived from different image-processing software packages can lead to substantial variations in the metrics, even when using the same pharmacokinetic model, T1 map, arterial input function (AIF), and region of interest (6, 7). To address these challenges, collaborative efforts under the initiatives of professional societies and government agencies have been made for development of DCE-MRI profiles, T1 phantoms, digital reference object, and statistical methods to harmonize imaging acquisition across different platforms, to validate imaging hardware and software, to test computer algorithms, and to assess technical performance (4–6, 10–16). All these efforts are absolutely necessary but not sufficient to warrant the accuracy and precision of QI metrics obtained in each individual patient during a clinical trial, which could affect decision-making and even clinical outcomes (1). Therefore, it is necessary to develop and implement a quantitative quality assurance (QA) procedure to measure QI metrics acquired in the patients who are on the trial (17).
Accuracy, in general, refers to closeness of a measured QI metrics to a true or known value, while precision is an agreement between repeated measurements of a metrics (17). For any QI metrics that does not have its true value available, its deviation from a reference value, obtained as a group mean from a large sample study in any standard reference region, can serve as its measurement accuracy (17). Precision, more commonly known as repeatability, can be easily evaluated from repeated measurements, often called as test–retest studies, in a normal reference region that is not expected to have any changes during a time interval of test–retest studies (17). Under these principles, a reference value and repeatability coefficient (RC) of a QI metrics in a reference region under certain conditions or constraints of image acquisition and process can be determined from a sample of population with 95% confidence and used to assess accuracy and precision of the metrics measured from an individual patient.
DCE-MRI-derived blood volume (BV) is emerging as a promising QI metrics in assessing therapeutic response in head and neck (HN) cancers (18, 19). Tumor subvolumes characterized by low BV have been reported to be high-risk imaging biomarkers for tumor progression (19–22). Boosting those poorly perfused subvolumes with high radiation doses could improve local and regional control (23, 24). To test this clinical hypothesis, a randomized phase-II adaptive radiation therapy (RT) trial that targets persisting poorly perfused subvolumes of the tumor with high radiation doses in patients with poor prognosis HN cancers has been initiated (21, 22, 25). The persisting poorly perfused tumor subvolumes are defined on the basis of BV measurements pre-RT and 2 weeks after starting RT. Inaccurate and unrepeatable estimates of BV maps could generate false, poorly perfused subvolumes. Subsequently, intensifying radiation doses to these falsely classified subvolumes can lead to either tumor overdose or underdose, which could increase radiation toxicity or cause failure of disease control, respectively. To achieve the goal of the clinical trial, it is critical to ensure accuracy and precision of BV maps in each individual patient and thereby warrant proper segmentation of low BV tumor subvolumes.
The present study developed and evaluated a framework for real-time quantitative assessment of accuracy and precision of a QI metrics in individual patients during a clinical trial. The method was applied to DCE-MRI-derived BV maps acquired during an ongoing clinical trial for poor prognosis HN cancers. As the repeatability analysis cannot be done in treated tumor volume owing to expected therapy-caused changes, a normal tissue region in the cerebellum that has little therapy-induced change was used as a reference region for BV measurements and hence to assess the accuracy and precision of BV maps. Our study showed that inaccurate and imprecise BV maps could be detected in real time before clinical decision was made. This method can be extended to other QI metrics and body sites. This process should be a part of the workflow of a clinical trial.
Materials and Methods
Patients with advanced HN cancers were enrolled in an IRB-approved randomized phase-II clinical trial. The patients who have advanced human papillomavirus (HPV)-HN cancers (stage IV) or HPV+ T4/N3 HN cancers (stage III) were eligible for the trial. All patients gave their study-specific informed consent to participate in the trial. Patients underwent MRI scans before RT and after receiving 10 fractions (Fx) of 2 Gy per fraction of radiation.
All MRI scans were acquired on a 3 T MR scanner (Magnetom Skyra, Siemens Healthineers, Erlangen, Germany). Each patient underwent scanning in the radiation treatment position on a flat table top using the patient-specific immobilization face mask, head support, and bite bar. MRI series included 2-dimensional multislice pre- and postcontrast T1-weighted images with fat saturation (voxel size: 0.88 × 0.88 × 3.3 mm3; echo time [TE]/repetition time [TR] = 8.4/1040 milliseconds), 2-dimensional T2-weighted images (voxel size: 0.78 × 0.78 × 3.3 mm3; TE/TR = 89/11000 milliseconds), and 3-dimensional (3D) volumetric T1-weighted DCE images. The DCE image volumes were acquired using a 3D gradient-echo sequence in the sagittal orientation with a large field of view (FOV) in the superior and inferior directions to cover primary and nodal cancers, carotid artery, and cerebellum. The sagittal orientation allows us to achieve higher temporal resolution and avoid time-of-flight effects of blood-flow spins (Figure 1). Other acquisition parameters included flip angle/TE/TR = 10°/0.97/2.73 milliseconds, FOV = 300 × 300 × 150 mm3, and voxel size ≈ 1.6 × 1.6 × 2.5 mm3. Sixty dynamic scans were collected at 3 minute, with a temporal resolution of 3 second.
Extended Tofts Model for BV Quantification
Plasma volume maps (vp) were generated from the T1-weighted DCE image series using the extended Tofts model (26);equation (1). In-house software package of functional image analysis tool (FIAT) was used for image analysis and processing to generate parametric maps (20, 21), in which the implemented extended Tofts model has been validated using digital reference object (DRO) (5). To convert the plasma volume maps to the BV maps, a Hematocrit value of 0.45 was applied (27). A protocol-specific procedure of DCE analysis was established before initiation of the clinical trial, particularly regarding how to create an AIF. To obtain the AIF, a dynamic phase in which contrast just entered the carotid artery was chosen by visually inspecting the temporal profile of the dynamic image volumes. Then, an AIF was generated by thresholding 20 voxels with the largest intensity changes on the selected phase compared with the average baseline image intensities. Finally, the AIF was visually inspected to make sure that its voxels were located within the carotid artery and had the expected dynamic profile. BV maps were derived from the extended Tofts model using the patient-specific AIF, and then coregistered to the postcontrast T1-weighted images at pre-RT using rigid-body transformation (20).
To ensure quality of quantitative parametric maps, QA of hardware and software at system-level was performed routinely. System-level QA of the MRI scanner was performed daily, weekly, and yearly using an ACR water phantom following the ACR protocol. Daily signal-to-noise ratio variations were recorded and were stable. Also, in an NCI Quantitative Imaging Network (QIN) multicenter collaborative project, we evaluated accuracy, repeatability, and interplatform reproducibility of T1 quantification from variable flip angles using an NIST T1 water phantom on our scanner, compared to others (4). For software QA, performance of our implementation of the extended Tofts model was evaluated using a digital reference object, that is, synthesized DCE phantoms with and without noise, which was fully reported previously (5). Also, we participated in an NCI QIN multicenter AIF challenge to validate and compare our AIF delineation procedure with others' (15). Based upon these evaluation and validation, imFIAT has been granted a level-2 benchmark by NCI QIN (28).
Individual-Level Assessment of Accuracy and Precision of BV Maps
Our pilot study indicates that repeatability of BV values in the cerebellum is stable and ∼18% (unpublished data). Also, cerebellums in our patients received a mean radiation dose <3Gy after 10 Fx of 2 Gy treatment. Therefore, we chose cerebellum as a reference region and manually drew bilateral volumes of interest (VOIs) across 2–3 slices having a volume of ∼4 cc (number of voxels, ∼1600) to extract mean BV values (Figure 2).
For each patient, MRI scanning was performed pre-RT and repeated after 10 Fx of radiation (2wkRT), which were considered as test and retest studies. An RC of BV values in the cerebellum VOIs was estimated using 1-way analysis of variance (ANOVA) model (29). First, within-subject mean squares (WMS) was estimated from n patients. Then RC and relative RC were estimated by RC = 2.77 × and rRC = 100 × RC/X̂, respectively, where X̂ was the grand mean of overall observations from n patients. Because the WMS for 2 repeated measurements was distributed as χn2wSD2/n, the 95% confidence interval (CI) of the estimated RC was given by RCL = RC × and RCU = RC × , where χn2(a) was the ath percentile of the χ2 distribution with n degrees of freedom.
To assess accuracy and precision of BV values in each individual patient, a group mean (Mn) of BV in cerebellum VOIs as a reference value with a 95% CI defined by standard deviation (SDn), and an RCn with a 95% CI defined by RCL and RCU were computed from n patients. For the next new scan, it was determined whether the mean BV value in the cerebellum VOI was between Mn−2SDn and Mn+2SDn. If yes, the BV map was deemed accurate with 95% confidence. For each new patient, a difference of BV between the 2 scans (test and retest) was determined whether it was within −RCn and RCn. If yes, the BV maps of this new patient were considered repeatable with 95% confidence. When the new patient's data passed both tests, the BV maps could be used to update the reference value and RC. Otherwise, the BV maps from this individual patient were flagged for further evaluation or correction before used in the clinical trial.
Other Statistical Analysis
A paired t test was performed to examine whether there was any difference between mean BVs measured at test and a retest with P-value <0.05 as statistically significant. The distribution of differences in mean BV values between the 2 scans was tested for normality using the Shapiro–Wilk test. Similarly, to detect a potential relationship between the measurement error and the magnitude of the combined mean BV values between 2 scans, a rank correlation coefficient (Kendall's tau) test between absolute differences against their combined means was performed.
Association Between Repeatability of AIF Peak and BV
As noted, we used a fixed imaging protocol to minimize variations in acquisition. However, it was unknown how repeatability of AIF was associated with repeatability of BV values. To examine this association, we measured the AIF peak value for each scan and calculated the RC from the 2 scans. We compared percentage differences of AIF peaks between the 2 scans with those of BV values measured in the cerebellum VOIs.
At the time of this report, 62 consecutive patients (median age, 62 years; male, 52; female, 10) were enrolled in the clinical trial. For the first 10 patients, the mean (±SD) BV values from test and retest were 2.22 (±0.13) mL/100 g and 2.21 (±0.19) mL/100 g, respectively, and not significantly different (P-value = 0.79: paired t test), yielding the overall group mean (±SD) of 2.21 (±0.16) mL/100 g (see Table 1). The difference in the BV values between test–retest studies was independent to the combined mean (P-value = 0.21: Kendall tau test), indicating that the measurement error was independent to the magnitude of measured BV values. Also, the Shapiro–Wilk test showed that the differences in BV values between the 2 examinations were normally distributed. An RC of BV values between the 2 tests was estimated to be 0.37, yielding a relative RC (rRC) of 16.7% with a 95% CI of (11.7%, 29.4%). Using the leave-1-out cross-validation, we did not find any outlier from the first 10 patients. Therefore, we used M10 and RC10 as starting reference values to evaluate the next patient (Table 1).
BV measurements from 62 patients were evaluated in real time, and 3 patients were identified to have inaccurate BV values in 1 of the 2 scans (Figure 3). Mean BVs measured from these 3 patients were in the range of 3.05–3.95 mL/100 g, which were much higher than those measured from the group mean + 2 × SD value (2.52 mL/100 g). The repeatability tests found that the percentage differences of BV values between the 2 scans of the 3 patients were much greater than the uncertainty range defined by −RC and RC. Note that our procedure detected large variations of BV values in 3 scans in real time, but not in retrospective analysis. The consequences of the BV maps for decision-making with and without correction were evaluated and discussed with the physicians during the clinical trial.
As the patients were enrolled into the clinical trial, the data from the 3 patients were excluded from the updated reference values for accuracy and precision measurements. One additional patient who had BV values within the normal range for both test and retest was excluded owing to partial coverage of cerebellum in 1 scan and mismatched slices in cerebellum between the 2 scans. As a result, the data from 58 patients were included to update the reference values. A group mean (±SD) of BV values was of 2.21 (±0.14) mL/100 g at test, and 2.22 (±0.17) mL/100 g at retest, which were not significantly different (P-value = 0. 73: paired t test; see Table 1), suggesting stability of the quantified BV maps. Also, the absolute difference was independent of their combined means (P-value = 0.67: Kendall tau test). ANOVA led to an RC of 0.35, and an rRC of 15.9% with a 95% CI of (13.5%, 19.5%). Note that the 95% CI (uncertainty) of estimated RC decreased with an increase in the number of patients. Figure 4 shows a plot of percentage differences of BV values between test–retest studies versus their combined means. As shown in the plot, percentage differences of mean BVs from the 3 patients, who had inaccurate mean BVs, were much large than the RC interval (% difference > 33% at the lowest), indicating the imprecision in the repeated measures.
Finally, the relative RC of the AIF peak values was of 61.8%. Figure 5 shows a scatter plot of percentage differences of BV values in the cerebellum VOIs versus those of AIF peak values between the 2 scans. Note that there was no association or even a trend between the 2 differences, suggesting the variation of AIF peaks could not explain the variation in BV measurements.
In this study, we developed and evaluated a methodology and metrics for real-time quantitative assessment of accuracy and precision on DCE-MRI derived metrics using reference values in a normal reference tissue region. It is critical to establish such a real-time QA test in the workflow of a clinical trial to identify unreliable estimates of QI metrics before used in a trial. A subsequent action should be planned in the design of a clinical trial. A real-time QA procedure of QI metrics in individual patients would enhance the ability of the trial to achieve its objectives and increase reliability of scientific findings. Our method can be extended to other QI metrics and body-sites to support individualized therapy and improve therapeutic outcomes.
It would be worth noting that accuracy and precision of BV values investigated in this study do not represent how accurate the QI metrics measure a true physiological BV. As discussed in the Introduction, they are measures of bias and variation of BV values as a QI metrics quantified from HN DCE-MRI using the extended Tofts model to reference values. Our data show that the group mean and RC of BV values in the cerebellum are stable, suggesting that it is a great candidate used as a reference region. As anticipated, the 95% CI of estimated RC decreases with an increase in the sample size. Using these reference values, we are able to detect unreliable QI measures of individual patients in real time during the clinical trial. Our test is different from test and retest analysis performed before therapy. The latter helps us understand the general technical behavior of a QI metrics in a sample of population, but it does not tell us whether the metrics acquired in each patient in a clinical trial is reliable or not. Finally, the impact of uncertainty of a QI metrics in a decision-making process needs to be investigated in future.
As shown in this study, reference values have to be established in a reference tissue region to perform the proposed QA test. The reference tissue region chosen may depend upon the image type and body site of interest. However, the QI metrics in a reference region has to be stable, less affected by therapy, and within the FOV of the scan. In our preliminary investigation, we tested sternocleidomastoid muscle (SCM) contralateral to tumor as a possible tissue reference region. We found that the BV values in SCM were not as stable as those in the cerebellum, possibly owing to low BV in SCM. Also, in some cases, tumors are distributed bilaterally, in which there is no noninvolved SCM that can be used as a reference region. On the other hand, the cerebellum tissue receives few Gy radiation doses (<3 Gy) for HN cancer treatment, and BV changes in cerebellum VOIs after 10 Fx of RT do not show any positive or negative trend (Figure 4), suggesting that the treatment effect within the cerebellum is minimum and can be ignored. Reference values of BV in the cerebellum VOIs are adequate for evaluation of the overall quality of BV maps, as MRI data are acquired in the k-space and BV maps are determined by a single AIF. However, local motion, e.g., swallowing, can cause local degradation in DCE-MRI, which cannot be captured by the analysis performed in the normal reference region. However, it still needs to be cautious to use QI metrics during a therapeutic trial.
In our study, patient positioning, scanner, image protocol, acquisition procedure, and analysis software and process are controlled carefully to maintain consistency of QI metrics delineation during the clinical trial. The factors that can influence repeatability of DCE-MRI-derived QI metrics include patient positioning, image registration, AIF delineation, image noise, image process, treatment effect, and unknown physiological fluctuation. We further investigated repeatability of AIF peaks, as well as its influence on repeatability of BV maps, but found no relationship among differences in the BV values and the AIF peaks between the 2 scans (Figure 5). These findings indicate that the AIF peak variation cannot solely explain one in the BV measures.
In conclusion, the present study developed and evaluated a methodology for quantitative assessment of accuracy and precision of DCE-MRI derived BV maps in a phase-II randomized clinical trial for poor prognosis HN cancers. The outlined framework was able to detect outliers, that is, identify the individual patients who had unreliable BV values in real time during the clinical trial. Because accuracy and precision of QI metrics influence decision-making in the individualized and adaptive cancer therapy, individual QA testing of such QI metrics needs to be integrated into a clinical trial workflow to warrant success of the trial.