Medical imaging plays an ever greater role in disease diagnosis and patient care. One of the most exciting new areas related to cancer diagnosis, treatment planning, and response assessment is the field of radiomics, which involves the extraction and analysis of a large number of quantitative imaging features from medical images for characterization of tumor and tissue phenotypes (1, 2).
Owing to the associations between tumor phenotypes and underlying biological processes, radiomic features (RFs) or RF-derived phenotypes can act as biomarkers that convey information about disease to help with the management of therapies. To date, radiomics has shown promise in improving cancer diagnosis and prognostic assessment in several tumor types including lung (3–5), brain (6), breast (7), liver (8–10), kidney (11), and esophagus (12) cancers. Moreover, RFs also exhibit correlations with genetic mutation status (5) and disease recurrence (13), as well as therapeutic response (14) and survival (15) in lung cancer.
While serving as an imaging biomarker for oncology, the influence of image acquisition settings on RFs should be well understood before the biomarker can be fully utilized (16). Until now, numerous studies have been conducted on the “reproducibility” of RFs (17–20), which refers to whether feature values could remain the same when reimaged using different equipment and different image acquisition settings. To the best of our knowledge, with the exception of studies on the accuracy of volume measurements (21, 22), there has been no report to date exploring the “reliability” of RFs. “Reliability” refers to whether true feature value could be maintained when imaged using different scanners and image acquisition settings. The true feature value in our study was defined as the feature value that was calculated on computed tomography (CT) image within which the CT number of each tissue composition was equal to its theoretical CT number at 120 kVp, for example, air equals to −1000 HU, and water equals to 0 HU. Thus, true feature value was also called as reference value in our study.
The challenge of such a reliability study lies in the fact that reference values for RFs are generally quite difficult to obtain, especially for in vivo lesions, because of unknown tissue composition, as well as anatomic, physiologic, and even positional variations among different patients. In view of this point, we aimed to carry out a pilot study on RF reliability using the ACR CT phantom (American College of Radiology CT accreditation phantom) (23). The ACR CT phantom is a widely used CT QC phantom, and has a well-defined CT number for each object inside module 1.
In this study, we attained CT images of the phantom under 24 image acquisition settings using a GE Discovery 750HD scanner (GE Healthcare, Waukesha, WI). The reliability of 8 widely used RFs—mean, std, skewness, kurtosis, GLCM [gray-level co-occurrence matrix (24)]-energy, GLCM-contrast, GLCM-correlation, and GLCM-homogeneity—was investigated on the 24 sets of CT images.
Scanning the ACR CT Phantom
A Gammex CT ACR 464 phantom was scanned on a GE Discovery 750HD scanner using a routine adult abdomen protocol at 4 different tube currents (25, 50, 100, 200 Effective mAs). The CT images were then reconstructed with 3 different slice thicknesses (1.25, 2.5, 5 mm) and 2 convolution kernels (STANDARD, SOFT), resulting in a total of 4 × 3 × 2 = 24 sets of CT images. The CT scanning parameters used in this study are listed in Table 1.
Preparation of Image Region and ROIs for Extracting Real Feature Value
The ACR CT phantom is composed of 4 modules and primarily constructed from water-equivalent materials (23). Each module contains several components made of different materials. In our study, 2 circular objects from module 1, made of polyethylene and acrylic each, were selected to create image patterns for feature extraction. Polyethylene and acrylic are materials with CT numbers of −95 HU and 120 HU at a 120-kVp setting falling within the ranges of the abdominal CT window.
For each object, a 2-dimensional region of 45 × 45 mm containing the object was cropped from the CT image located at the center of module 1 along the axial direction. Within the cropped region, 100 regions of interest (ROIs) were randomly generated. The criteria to generate ROIs included the following:
The center of the ROI should be located inside the object.
ROI shall cover part of the object and part of the background outside the object, for the purpose of studying radiomic features on nonhomogenous patterns rather than only on homogenous patterns, such as that derived from cartridge phantoms filled with paper/rubber in the literature (19).
The size of the ROI must range from 12 × 12 mm to 18 × 18 mm; the sizes of the cropped region and ROIs were empirically determined on the basis of the physical size of the object (a cylinder with diameter = 25 mm and depth = 4 cm as provided in the manual of ACR CT phantom). The process of preparing the cropped region and ROIs is illustrated in Figure 1.
Preparation of Computer-Generated Images for Extracting Reference Feature Value
A noise-free digital image series to simulate module 1 of the ACR CT phantom was generated for the extraction of reference feature values. The 2 selected objects (polyethylene and acrylic) were reproduced via an image-processing algorithm on the basis of designated parameters (eg, location, size, shape, and density in CT number) provided in the phantom manual (23). The rest of the computer-generated images were defined as water-equivalent background with a CT number = 0. The image region and ROIs for feature extraction from the computer-generated images were copied from those used in the scanned phantom images to guarantee that they were identical so that variations introduced by position misalignment and density difference could be minimized and the bias of real value to reference value would be purely because of the different image acquisition settings.
Extraction of Feature
In our study, 8 2D RFs were investigated, including 4 histogram-based features—mean, std (standard deviation), skewness, and kurtosis—and four texture-based GLCM features (24), GLCM-energy, GLCM-contrast, GLCM-correlation, and GLCM-homogeneity. Mean, std, skewness, and kurtosis are first-order statistic features to characterize an histogram of image intensity. GLCM features are textural features characterizing the gray-tone spatial dependencies of an image, that is, quantifying the relationship between pixels within an ROI. Details of definitions of the 8 RFs are provided in the online Supplemental Material.
In the implementation, the 8 RFs were calculated on each ROI by using an in-house feature extraction algorithm programmed on the MATLAB 2016b platform (MathWorks, Natick, MA). Before feature calculation, images were interpolated into isotropic pixel spacing of 0.5 × 0.5 mm2.
Reliability of Feature
In our study, feature reliability was defined as the degree of predicting reference feature value from real feature value. Reference feature value was the feature value extracted from noise-free computer-generated phantom images, while real feature value is the feature value extracted from CT images attained from the physical ACR phantom. High predictability means that a change in reference feature value can be correctly reflected by a proportional change in the real feature value. If an RF exhibited high predictability under a certain image acquisition setting, then the RF calculation was believed to be reliable.
Consequently, R2, a statistical metric widely used to assess the proportion of variance in the dependent variable that is predictable from the independent variable, was adopted to quantify feature reliability. An R2 value of 1 indicated that the reference feature value could be predicted by a real feature value to a degree of 100%, whereas an R2 value of 0 indicated that there was no relation between reference feature value and real feature value. The R2 equation can be defined as follows:
Figure 2 shows an example of how to use R2 to assess feature reliability under certain image acquisition setting. The graphs (A) and (B) in Figure 2 present the skewness values, one of the histogram-based RFs, calculated from ROIs under the image acquisition settings of “convolution kernel = STANDARD, slice thickness = 1.25 mm, and Effective mAs = 200” and “convolution kernel = STANDARD, slice thickness = 1.25 mm, Effective mAs = 25,” respectively. The feature data used to estimate R2 value consisted of 200 pairs of skewness values, corresponding to the reference and real skewness values calculated on the 200 ROIs on the computer-generated and physical phantom images, respectively. As shown in Figure 2, high reliability (R2 = 0.9575) indicated that reference skewness values approximated the real skewness values measured at high tube current, while low reliability (R2 = 0.4021) indicated reference skewness values diverged from real skewness values measured at low tube current.
Figure 3 shows the reliability values for the 8 RFs under 24 image acquisition settings, combinations of 4 tube currents (25, 50, 100, 200 Effective mAs), 3 slice thicknesses (1.25, 2.5, 5 mm), and 2 convolution kernels (STANDARD and SOFT). Overall, we were able to observe that feature reliability decreased with a decrease in tube current, features were more reliable on 5-mm CT images than on 1.25- and 2.5-mm CT images, there was little difference in feature reliability between CT images of STANDARD and SOFT convolution kernels, and histogram-based RFs are more reliable than textural RFs.
We averaged the reliability values across individual image acquisition parameters to further investigate their influence (Table 2). For example, when investigating the influence of “200 Effective mAs,” we averaged the feature reliability values of “STANDARD_ST125_EffmAs200,” “STANDARD_ST250_EffmAs200,” “STANDARD_ST500_EffmAs200,” “SOFT_ST125_EffmAs200,” “SOFT_ST250_EffmAs200,” and “SOFT_ST500_EffmAs200” together as presented in Figure 3.
To facilitate the analysis, we empirically set R2 > 0.85 as high reliability. As shown in Table 2, in the case of tube current, 100 Effective mAs could be regarded as a threshold to guide the application of RFs, that is, using tube current ≥100 Effective mAs resulted in more reliable RFs, especially the histogram-based RFs, while using tube current <100 Effective mAs produced only a few reliable RFs. For slice thickness, 5-mm CT images yielded more reliable RFs. For convolution kernel, the STANDARD and SOFT showed similar influence on feature reliability. The feature mean and std showed extremely high reliability across all image acquisition settings.
We observed an obvious unusual trend that the average reliability of GLCM-energy at slice thickness of 1.25 mm was higher than that at slice thickness of 2.5 mm (see Table 2). As we turned to the details of reliability presented in Figure 3, we found that the unusual trend was caused by a great drop of reliability at tube current 25 Effective mAs, a very low dose condition for the slice thickness of 2.5 mm. Actually, based on our results, low tube current easily led to unusual trends for some RFs, for example, GLCM-homogeneity at 50 and 25 Effective mAs and GLCM-homogeneity at 1.25- and 2.5-mm slice thicknesses in Table 2.
In this study, we introduced the concept of RF reliability and evaluated the RF reliability of 8 commonly used RFs under 24 different image acquisition settings. The 24 image acquisition settings involved 3 image acquisition parameters, for example, tube current, slice thickness, and convolution kernel, and covered a wide range of imaging protocols for abdominal CT imaging. Moreover, our study was based on heterogeneous ROIs, that is, ROIs containing both object and background, which is an advantage over previous studies using homogenous ROI phantoms, for example, paper/rubber-filled cartridges (19).
Overall, for the ACR CT phantom, tube current affected reliability the most, slice thickness the second, and convolution kernel the least. The small effect of convolution kernels was due to the similarity of the 2 “smooth” kernels used in this abdominal study. The histogram-based RFs showed much higher reliability than textural RFs.
For tube current, 200 Effective mAs represented high-dose CT imaging, while 25 Effective mAs represented low-dose CT imaging. It is quite intuitive that CT images derived from high-dose scanning would yield more reliable RFs as it produced higher quality images than low noise scanning. Therefore, to obtain high RF reliability, high-dose CT imaging is recommended, especially for those radiomic studies using textural RFs. When keeping all other imaging acquisition parameters unchanged, increasing the slice thickness from 1.25 mm to 5 mm can reduce image noise by 50%. It is reasonable to believe that thick-section CT imaging yielded more reliable RFs. However, thick-section CT imaging introduces larger partial volume effect than thin-section CT imaging. In clinical practice, partial volume effect is one of the main negative effects that lowered image resolution and thus blurred fine structures within/around lesions, for example, small vessels, boundary of tumor margin, etc. It will also affect some RFs extracted from thick-section CT images. Therefore, the selection of RFs and slice thickness should depend on the aim of the radiomic study. Because the 2 convolution kernels, STANDARD and SOFT, both belonged to smooth soft-tissue kernels which yielded low-noise image, their influence on RF reliability was similar. Also, our results showed that smooth soft-tissue kernels used by abdominal CT scans had little impact on RF reliability.
In this study, 2 categories of RFs, histogram-based and the textural, were investigated. Histogram-based RFs showed much higher reliability than textural RFs, especially the mean and std. It is actually one of the basic requirements for a CT scanner that mean should be reliable across different image acquisition settings. Our results showed this. For the std, its high reliability was somewhat due to the use of the polyethylene and acrylic objects to create image patterns, which possessed dozens of Hounsfield unit (HU) intensity different from the water-equivalent background. Nevertheless, according to this finding, it is quite reliable to apply std in charactering tumor lesions with dozens of HU difference from the background, such as liver metastasis of colorectal cancer [mean, 68 HU; range, 40–115 HU as reported in the CRYSTAL clinical trials (25, 26)] and gastrointestinal stromal tumors [mean, 72 HU; range, 46–156 HU as reported in the Choi criteria study (27)].
In contrast to histogram-based RFs, more attention should be paid to the use of textural RFs. Textural RFs are easily affected by tube current, which is an imaging parameter directly proportional to patient radiation dose. High tube current guarantees high reliability of textural RFs, but leads to high patient dose. Therefore, the use of textural RFs should depend on the aim of a study. For example, it is inadvisable to use textural RFs in a low-dose CT screening study (28), whereas it might be safe to use textural RFs in a CT-based radiation therapy study (29).
There were several limitations of our pilot study. First, the created image patterns were simple, involving only 2 materials for each pattern, polyethylene, and a water-equivalent background, or acrylic and a water-equivalent background. Second, only a small set of RFs from 2 feature categories were investigated. Third, only 1 CT scanner was used. To address these limitations, we propose future studies, including designing more sophisticate phantoms that mimic in vivo lesions with the help of 3D-printing technique (30), using a high-throughput analysis method to evaluate a large scale of RFs (20), and involving multiple scanners from multiple institutions to attain CT images under more image acquisition settings (19).
In this study, we explored the reliability of RFs on multiple CT image acquisition settings. To the best of our knowledge, this is the first study investigating RF reliability by comparing real feature values calculated from scanned phantom images and reference feature values computed from computer-generated phantom images. We found that CT image acquisition settings influenced RF reliability to varying degrees. Therefore, attention should be paid when using RFs for CT-based radiomic studies, especially textural RFs.