Quantitative imaging biomarkers (QIBs) provide medical image–derived intensity, texture, shape, and size features that have potential use in the characterization of disease and prediction of clinical outcomes. In the evolving field of radiomics, large numbers of potentially informative novel and diverse QIBs are extracted and studied for the personalization of disease treatment, particularly in oncology (1, 2). Examples of single institution–based studies of imaging biomarkers include brain cancer (3, 4), head and neck cancer (5–8), lung cancer (9–13), nasopharyngeal carcinoma (14), prostate cancer (15, 16), and sarcoma (17). Other research has focused on performance of QIBs across multiple institutions, such as the analysis provided by Castelli et al. (18) regarding the predictive value of quantitative fluorodeoxyglucose positron emission tomography (FDG PET) in 45 studies of head and neck cancer.
Despite the growing body of radiomics research and the established use of some imaging biomarkers, such as metabolic tumor volume (MTV), few new QIBs have been adopted for clinical decision-making. Cancer Research UK and the European Organisation for Research and Treatment of Cancer, with NCI involvement, recently convened a consensus group to make recommendations for accelerating the clinical translation of imaging biomarkers. To that end, the group published a roadmap for navigating 3 main domains through which biomarker development passes: 1) discovery, 2) validation, and 3) qualification (19). In general, discovery is the process of identifying biomarkers associated with a disease or disease outcome of interest in a limited patient population; whereas, validation and qualification are formal assessments of biomarker performance and clinical utility in a broader population. Biomarker validation can be further divided into 2 complementary tasks, namely, technical validation and clinical validation, which focus on the quality of measured biomarker values and measured associations with disease, respectively. A third, qualification domain involves establishment of the fitness of biomarkers for specific clinical applications. Application of appropriate statistical methods is essential for the development of new clinically applicable QIBs. In particular, this process requires proper statistical estimation of measurement accuracy and precision for each of technical and clinical validation and proper statistical design and analysis of clinical trials for establishment of clinical utility.
In this paper, we use a new statistical approach for technical and clinical validation of QIBs derived from head and neck cancer FDG PET scans to investigate the impact of tumor segmentation variability across multiple institutions on the estimation of study power to design clinical trials (20). The approach uses a hierarchical Bayesian model to estimate systematic and random QIB measurement errors and simultaneously estimate the effects of these errors on study power to predict clinical outcomes. Specifically, our study is focused on 22 radiomic QIBs that were previously investigated regarding their ability to predict outcome in the treatment of head and neck cancer (21). The QIBs are derived from lesion segmentation resulting from an FDG PET/CT segmentation challenge involving 7 institutional members of the NCI Quantitative Imaging Network (QIN) (22, 23). All participating QIN members routinely use different approaches for lesion segmentation. Thus, the network provides an ideal setting within which to study the impact of segmentations on radiomic QIBs across methods and institutions. While our work focuses on errors because of using different segmentation tools, the used statistical methods are broadly applicable to other settings in which scanner, operator, or other image source differences contribute to QIB measurement errors.
Application to FDG PET imaging is of substantial interest for QIB development, because it is an established imaging approach for the quantification of cancer tumor burden (24–26). QIB extraction from FDG PET images involves several steps, including image acquisition and reconstruction. In addition, for many QIBs, segmentation of all tumors is required for calculating QIB values. Tumors may be segmented in a number of ways. Standard clinical practice is manual segmentation by trained experts (eg, radiation oncologists). Alternatively, a number of segmentation tools have been developed to help decrease human effort and increase segmentation consistency. These tools range from being semiautomated to fully automated (27). Although QIBs derived from tumor segmentations can be profoundly impacted by variation and bias in segmentation methods, existing studies provide little insight into the impacts of different methods on derived QIBs. In this work, we study this relevant issue.
Errors in PET-derived QIBs have been studied previously, primarily in terms of repeatability and reproducibility. Traverso et al. (28) performed a systematic review of 41 full-text articles to assess consensus regarding the robustness of commonly utilized radiomics QIBs for PET, CT, and MRI. The authors encountered error metric reporting of intraclass correlation coefficient (ICC) in 14 studies, correlation coefficient in 12, and various other descriptive statistics in 9. Bailly et al. (29) assessed variability of QIBs in relation to their dependence on different PET/CT reconstruction methods with coefficient of variation and percent deviation. Dice coefficient, ICC, and confidence interval half widths were used by Altazi et al. (30) to evaluate PET CT radiomic features in patients with cervical cancer. Kalpathy-Cramer et al. (31) report concordance correlation coefficients for the assessment of radiomic features from lung nodules in a multi-institutional study. Lu et al. (32) summarized reliability of radiomic features across image acquisition settings with R2. Although the aforementioned studies use univariate or ANOVA-based statistics to estimate error, there are very few examples of simultaneous estimation of systematic and random errors. Beichel et al. (33) did use linear mixed effects regression to compare quality and variability of tumor volume measurements from the same QIN PET segmentation challenge analyzed herein. However, the Bayesian approach used in this study more generally analyzes the impact of all segmentation approaches simultaneously, provides estimates of study power, and includes 22 radiomic features and, therefore, illustrates a new statistical approach for QIB validation and qualification.
Our Bayesian statistical approach and QIN challenge application are described in the following section. Thereafter, analysis results are given to offer comparisons of measurements from challenge participants. Finally, a discussion is provided of the results and their implications for the current and future state of radiomic biomarker assessment and development.
Quantitative Imaging Biomarkers
QIBs were derived from FDG PET/CT scans of patients with head and neck squamous cell carcinoma acquired at The University of Iowa Hospitals and Clinics (UIHC). Scans were collected, curated, and uploaded to TCIA (34) [collection: QIN-HEADNECK (35)] as part of the NCI QIN (22). A QIN segmentation challenge was conducted in which a subset of 10 diverse pretreatment scans containing 47 lesions were segmented manually by 3 experienced radiation oncologists at the UIHC and by the following QIN sites: Columbia University Medical Center, H. Lee Moffitt Cancer Center and University of South Florida, Memorial Sloan Kettering Cancer Center, Simon Fraser University (Canada), University of Pittsburgh, The University of Iowa, and The University of Washington Medical Center. Sites were allowed to use segmentation tools of their choosing. Tools included both commercially available software and academic, in-house-developed segmentation algorithms. Deidentified summaries of the methods are given in Table 1. Further details of the challenge scanner acquisition and segmentation methods as well as evaluations of segmentation performance are given by Beichel et al. (33).
Forty-seven head and neck tumors in the 10 PET/CT scans were segmented using 7 different methods by the challenge participants. Each scan was segmented twice with a time interval between initial and repeat segmentation. A challenge coordinator at The University of Iowa collected the segmentations and derived 22 QIBs with the 3D Slicer software for medical image informatics, image processing, and 3-dimensional visualization (36). The QIBs derived from lesion segmentations are summarized in Table 2 and include 5 of the most commonly used clinical biomarkers and 17 biomarkers available from the PET-IndiC extension (37) for the 3D Slicer, which were assessed in the context of outcome prediction by Beichel et al. (21). The PET-IndiC QIBs are generally designed to characterize standardized uptake value (SUV) patterns within segmented lesions by using descriptive statistics.
|Max||Maximum value in region of interest (SUV)||C|
|Peak||Maximum average gray value that is calculated from a 1 cm3 sphere placed within the region of interest (45) (SUV)||C|
|Mean||Mean value in region of interest (SUV)||C|
|MTV||Volume of region of interest (mL)||C|
|TLG||Total lesion glycolysis (mL)||C|
|Min||Minimum value in region of interest (SUV)||I|
|Standard||Standard deviation in region of interest (SUV)||I|
|RMS||Root mean square value in region of interest (SUV)||I|
|First Quartile||25th percentile value in region of interest (SUV)||I|
|Median||50th percentile value in region of interest (SUV)||I|
|Third Quartile||75th percentile value in region of interest (SUV)||I|
|Upper Adjacent||First value in region of interest not greater than 1.5 times the interquartile range (SUV)||I|
|Q1 Distribution||Percent of gray values that fall within the first quarter of the grayscale range within the region of interest (%)||I|
|Q2 Distribution||Percent of gray values that fall within the second quarter (%)||I|
|Q3 Distribution||Percent of gray values that fall within the third quarter (%)||I|
|Q4 Distribution||Percent of gray values that fall within the fourth quarter (%)||I|
|Glycolysis Q1||Lesion glycolysis calculated from the first quarter of the grayscale range within the region of interest (mL)||I|
|Glycolysis Q2||Lesion glycolysis calculated from the second quarter (mL)||I|
|Glycolysis Q3||Lesion glycolysis calculated from the third quarter (mL)||I|
|Glycolysis Q4||Lesion glycolysis calculated from the fourth quarter (mL)||I|
|SAM||Standardized added metabolic activity (46) (mL)||I|
|RA||Rim average; mean of uptake in a 2-voxel-wide rim region around region of interest (SUV)||I|
Statistical analysis focused on the quantification of random and systematic differences in QIB measurements across segmentation methods. For each method, descriptive means and standard deviations were computed on the population of segmented images. Agreement between and variability within the methods were estimated with a Bayesian regression modeling approach (20). This approach was taken to ensure that statistical inferences accounted for the study design, which included biomarkers derived from 8 different segmentation methods applied to a common set of 47 lesions, manual segmentation performed by 3 different operators, semiautomated segmentations performed by 1 operator each, 2 segmentations performed per operator and lesion. In brief, the statistical modeling of biomarker measurement for lesion , operator , and segmentation is composed of the following series of mean and variance components:
Biomarker measurements from multiple readers and/or multiple segmentations of the same lesion are averaged together as . Within-lesion variance is for the multiple readers and segmentations of manual segmentation and is for the single reader and multiple segmentations of other methods. Between-lesion variance is . The modeled means and variances are allowed to vary by segmentation method, denoted later with a subscript . Thus, the application of the Bayesian model to the data provided estimates of mean differences between methods and differences between and within-lesion variability. Systematic mean differences were assessed relative to manual segmentation, the current standard for image segmentation of head and neck cancer. Systematic differences were estimated as the relative biases in population mean QIB measurements from semiautomated methods compared with manual segmentation:38, 39). The C-index is a nonparametric, rank-based performance metric that can be interpreted as the probability a randomly selected pair of lesions will have QIB measurements with the same ordering on both segmentation methods being compared. Values of 1 and 0.5 represent perfect and chance concordance, respectively. Relative between-lesion variability was estimated with ICC and coefficient of total variation (wCV), respectively. ICC is defined as the variance in biomarker values between lesions relative to the total variance and is calculated as follows:
Simulation studies were conducted to assess the impact of segmentation methods on estimating associations between QIBs and clinical outcomes, as described by Smith and Beichel (20). The general approach taken in the simulations is to define true relationships between manually segmented QIBs and a binary outcome and then to assess the degree to which QIBs from other segmentation methods can recover the true relationship. Specifically, probability of a hypothetical binary outcome was defined in terms of a logistic relationship with manually segmented QIBs, such that,
Then, logistic regression models were fit using the QIBs measured with other (semiautomated) methods to estimate statistical power to recover the true odds ratio (OR) of 2, a result useful for the determination of sample size in designing clinical trials. Also estimated from the simulations were method-specific biases , variances , and root mean square error as a combination of estimation accuracy and precision. Lower bias, variance, and error indicate better estimation of the true value.
For descriptive comparisons, QIB means and standard deviations were computed over the measurements obtained from each segmentation method applied to the population of head and neck tumors included in the QIN challenge. The distributions of these method-specific population statistics are summarized with boxplots in Figures 1 and 2. In the plots, QIB values are displayed on the log scale to depict distributional variability relative to their different measurement scales. Distributions of the population means show how similar the methods are on average with respect to their QIB measurements. Accordingly, method means are most similar for the Max, Peak, and Mean clinical QIBs and similarly for the root mean square (RMS), First Quartile, Median, Third Quartile, and Upper Adjacent PET-IndiC QIBs. Similarities among the methods in the overall variability of their QIB measurements can be gauged by the distribution plots of population standard deviations. As with the population means, methods are most similar for the Max and Peak clinical QIBs. Otherwise, more dissimilarities are observed among the other clinical and PET-IndiC QIBs. Also noteworthy are the mean and standard deviation dissimilarities apparent in MTV measurements, indicating sensitivity of volumetric measurements to segmentation method. Method-specific estimates of the QIB means and standard deviations can be found in Supplemental Table 1.
Distributions of between- and within-method variability are summarized in Figures 3 and 4 with ICC and wCV, respectively. The ICC plots show the agreement of semiautomated segmentation methods with manual segmentation. Consistent with the population plots, there is near-perfect (ICC = 1) agreement among all methods for Max and Peak. High degrees of agreement are seen for Mean, Standard, RMS, First Quartile, Median, Third Quartile, and Upper Adjacent. Q1–Q4 Distributions exhibit very poor agreement; whereas the remaining QIBs have fairly good agreement. Within-method variability as measured by wCV tends to be low for many of the QIBs that have high agreement. Notable exceptions are MTV and total lesion glycolysis and several of the PET-IndiC QIBs that have moderate ICC but high wCV. Method-specific estimates of QIB variability as well as agreement are given in Supplemental Table 2.
Results of simulation studies are summarized in Figures 5 and 6 for N = 100 hypothetical binary clinical outcomes. As described in the methods, outcomes were repeatedly simulated based on for QIBs from manual segmentation. Bias, variability, and power were then calculated for ORs estimated with QIBs derived from the other semiautomated segmentation methods. As such manual segmentation defines the true relationship between QIBs and clinical outcomes, and the results quantify the quality of estimates that can be obtained with the other methods. Taking into account both estimation bias and variability, RMSE values plotted in Figure 5, and tabulated in Supplemental Table 3, show relatively low error for Upper Adjacent, RMS, Third Quartile, Mean, Glycolysis Q4, Max, and Peak. Statistical power to detect effects of the QIBs, at the 5% level of significance, is summarized in the heatmap of Figure 6. The QIBs and methods in the heatmap are ordered according to similarity measures from hierarchical clustering of their powers. Dendrogram clustering of the 2 are displayed to the top and right of the heatmap. Power is generally inversely related to RMSE. Overall, the effects of clinical QIBs, compared to those of PET-IndiC QIBs, were less affected by the segmentation method. With respect to methods, QIB measurements from segmentation method 2 are most similar to those from manual segmentation. After that, the grouping of methods 3 and 7 are most similar to those of method 2. Method 4 produced the outlying values depicted as individual dots on the boxplots discussed previously and has the lowest power. Accordingly, a clinical trial planning to use segmentation method 4 would require a larger sample size for most of the QIBs. Likewise, within a method, the study power would vary depending on the biomarker for which a trial is being designed.
Based on the previously discussed measures of agreement, variability, and power, hierarchical clustering was used to identify QIB groupings for which the impact of segmentation was either low, moderate, high, or extreme. Table 3 presents the clustering results and summarizes performance measures aggregated over the 7 semiautomated segmentation methods. Coefficients of variation computed from the segmentation-specific population means were 7.8%, 7.3%, 53.8%, and 27.7% in the low, moderate, high, and extremely impacted biomarker groups, respectively. Agreements to manual segmentation as measured by absolute relative biases were 6.7%, 13.2%, 52.0%, and 51.1%. The extreme group stood apart from the other as having comparatively poor ICC of 0.603 and power of 26.9%. Average ICC for the low through highly impacted groups was markedly better at 0.993, 0.966, and 0.892, and powers were 85.1%, 78.1%, and 67.7%.
Segmentation Impact on QIBs
In this work, a unified Bayesian modeling approach was applied to estimate QIB measurement errors and their effects on statistical power. It enables quantification and comparison of the effects of different tumor segmentation methods on the panel of 22 QIBs. Clinical QIBs have long been used in clinical research and practice to characterize disease and to assess disease progression. A widely used example is the RECIST criteria for defining tumor response in clinical trials in terms of change in imaged tumor size (40, 41). Improvements in and increasing access to medical imaging have fueled interest in other, more advanced QIBs indicated in the panel of PET-IndiC QIBs. Unfortunately, QIB measurements are subject to errors from multiple sources, including scanner makes and models, settings, reconstruction algorithms, segmentation methods, and biologic variability. Our approach enables illustration of the effects of segmentation methods on random and systematic differences as well as statistical power.
Segmentation of tumors defines volumes of interest within which voxel intensities are used to calculate various QIBs that quantify different properties of tumors. Typically, manual and semiautomated tumor segmentation approaches—as used within this work—are subject to various degrees and types of variation in generated VOIs. An example for tumor segmentation differences between QIN sites is depicted in Figure 7. Intuition would suggest that segmentation methods have less of an effect on QIBs whose calculations are less dependent on accurate VOI definition. Indeed, several new segmentation methods have been motivated by the insensitivity of biomarkers extracted from them. For instance, Echegaray et al. (42, 43) propose “core samples” and “digital biopsy” segmentation methods for which several intensity and texture features were shown to be consistent with a reference standard. In our comparison of multiple segmentation methods, QIB quantile measures (Max, Peak, First Quartile, Median, Third Quartile, and Upper Adjacent) extracted primarily from interior voxel intensities have particularly high agreement of the population means and relatively high agreement of the population standard deviations, high ICC, low wCV, low RMSE, and high power. The measurements of mean-based QIBs (Mean, Standard, RMS, and rim average [RA]) also exhibit relatively high degrees of reliability. However, such relatively simple QIBs might not be able to capture desirable characteristics of tumors, such as texture. The remaining QIBs, which more broadly utilize the VOI for QIB calculation, are more affected by the segmentation method, but might provide relevant information. Thus, it is imperative to study the impact of tumor segmentation variability on subsequent predictive modeling. For example, to discover a relationship between QIB and outcome, more samples might be needed for a segmentation method that is more prone to segmentation variability than a method that is less prone to variability.
Consequently, our technical performance assessments were designed to assess impact on QIB measurements, because ultimate interest is often on QIB performance in the prediction of clinical outcomes, also known as clinical performance. To address clinical performance, our Bayesian approach provides simulation study results to characterize the effect of segmentation on the ability to recover associations between QIBs and a hypothetical clinical outcome. Many of the quantile measures that had good technical performance also had good clinical performance, that is, low RMSE and high power. A few exceptions were the lower power of Median, First Quartile, and RA. In addition, the low statistical power and high variation across segmentation methods for MTV are noteworthy because many studies propose to utilize MTV for outcome prediction.
Typically, segmentation methods are evaluated regarding only their segmentation performance. Our statistical analysis approach enables the selection of methods regarding their suitability for specific QIBs. Method 4 stands out as having noticeably lower power than the other methods. In general, power varies differentially across QIBs and methods, thus helping explain why a QIB may be identified as statistically significant in one research setting but not in another when different segmentation methods are used.
The illustrated statistical approach can aid QIB development by providing estimates of technical and clinical performance for biomarker validation and of statistical power for clinical trial design. The application considered involves development of QIBs derived from different semiautomated segmentation methods. Such computer-aided analysis of medical images has the potential to advance the development of QIBs by decreasing the time needed to extract them and by increasing the consistency of their measurements. Image analysis methods are advancing rapidly with several semiautomated tools currently available for the segmentation and quantification of FDG PET images. Given the range and freedom of choices that exist, understanding the effects of different tools on the technical and clinical performance of QIBs derived from them is essential. To that end, the technical and clinical performance analysis results provided by the present study represent a baseline and provide a starting point for future improvements in imaging biomarker quantification. Furthermore, our analysis explores performance within a multisite (QIN challenge) setting in which different segmentation tools are used. Results show degrees of systematic and random differences between sites that highlight the need for improved consistency of segmentation tool algorithms and their application. Multiple courses of action should be considered to improve consistency. Tool application guidelines and training are important at the user level. In addition, tool consistency could be improved with application-specific method development and benchmarking against publicly available and clinically relevant data sets.
Improved consistency of computer-aided tools will increase the utility of QIBs for disease characterization and response assessment. This is particularly relevant for multicenter clinical trials and the field of radiomics in general where images may be processed quantitatively by different operators and atdifferent institutions. Future adoption of standards for tool development and statistical assessment as well as reduced requirements for user operability would benefit image analysis in such decentralized applications. The current state, however, is quite heterogeneous with respect to technologies, operators, and assessments.
The QIN challenge data analyzed in this study has some notable limitations. First, its scope is limited only to the effect of segmentation method on QIB measurements. Other factors such as scanner type, settings, and reconstruction algorithm will also affect the measurements. All images were obtained at the same institution so as to reduce the effects of image acquisition differences on results obtained in the QIN challenge. Second, the challenge results may not generalize to non-head and neck cancers, as stability of biomarker measurements has been observed to differ across cancer types (44). These 2 limitations are characteristic of the data source and not the statistical approach, which can be applied to estimate measurement error and predictive performance in other settings in which additional sources of measurement error are present. Third, there is no absolute ground truth segmentation for head and neck tumors. Instead, manual segmentation was used as a surrogate ground truth, or reference standard, for the calculation of agreement (C-index) and for the simulation studies to estimate RMSE and statistical power. To mitigate variability in this reference standard, manual segmentations were performed by 3 expert radiation oncologists at 2 separate time points and combined to derive reference QIB measurements for each tumor. Fourth, a synthetic simulation study was conducted to assess clinical performance rather than using actual clinical outcomes from patients. The advantage of this approach is that the true relationship between QIBs and simulated outcomes is known and can thus be used to estimate RMSE and power. Moreover, simulation is a valid and commonly used approach for the design of clinical trials. The disadvantage is that the statistical model used in the simulation may not fully reflect the complexities of true relationships between QIBs and clinical outcomes.
QIBs are becoming increasingly important in the characterization, treatment, and prognostication of disease. Clinical markers such as maximum SUV and tumor volume have a long history of use. The simplicity of their calculations lend themselves well to widespread adoption. However, that simplicity may limit their utility as prognostic indicators. Thus, there is interest in more advanced markers that utilize texture, shape, and intensity information from imaged tumors. Such features can be more prone to measurement errors owing to differences in segmentation methods or other image acquisition or processing steps. The used statistical approach can help quantify QIB measurement error in real-world (eg, multi-institutional) settings for which they are being developed. Results from the approach could be used to prioritize QIBs that are less sensitive to measurement error, to identify standardizations needed in the process by which QIBs are derived, or to determine statistical power for clinical trial design. For example, our finding that PET-IndiC features Standard, RMS, First/Third Quartile, Upper Adjacent, and RA have technical performance similar to maximum SUV and tumor volume suggest that these more advance markers can be measured as reliably and precisely as standard clinical makers. Over all of the markers analyzed, we observed a wide range of performances and thus conclude that errors due to segmentation methods need to be reduced. Therefore, we recommend establishment of reference imaging data set collections and reference segmentations against which segmentation methods can be benchmarked and tuned to ensure harmonization of QIBs. The presented findings summarize the current state of QIB variability and systematic differences owing to segmentation methods used by NCI QIN members. Moreover, the statistical analysis of technical and clinical QIB performance offers an approach that could be used in the future to develop QIBs in other disease and imaging settings.