In medical research, imaging plays an important role in identifying abnormalities by creating a visual depiction of the internal organs of the human body for clinical analysis. Radiomics (1, 2) refers to the extraction of quantitative features from medical images to discover prognostic or diagnostic disease markers. These features may have the ability to enable building classifiers for effective detection, diagnosis, and therapy outcome prediction of cancer.
Computed tomography (CT) scans are used extensively in cancer diagnosis and treatment. CT scans between patients may have different acquisition and reconstruction parameters. These parameters vary among scanner vendors as well. In addition, every institution follows its own scan protocols; therefore, scans for the same body part may differ among institutions. As a result, a radiomics prediction model generated on one institution's data may not be usable or may not generate acceptable performance with another institution's data. Hence, it is necessary to analyze the stability of features under varying imaging parameters to assess the impact of the latter on the former.
Most reports on the stability and robustness of radiomics features with variation in image acquisition parameters are on patients' CT scans. In a previous study (3), we analyzed the variability and stability of radiomics features across different image acquisition parameters using 8 scanners from 3 different manufacturers. The acquired images had 7 different pixel sizes ranging from 0.39 to 0.98 mm, and the slice thickness varied from 1.25 to 3.75 mm. It was found that some radiomics features were voxel size–dependent, but with a proper normalization approach, this dependency could be reduced or eliminated. Galavis (4) analyzed the variability of texture features under various acquisition and reconstruction parameters using 20 patients with solid tumors. In total, 50 texture features were extracted and further classified into 3 groups based on variation range: small variability features (range ≤ 5%), intermediate variability features (10% ≤ range ≤ 25%), and large variability features (≥30%). Hunter (5) analyzed radiomics features that were stable and informative across different machines using 56 patients with non–small cell lung cancer (NSCLC) from 3 CT scan machines of 2 institutions. The Jaccard index and dice similarity coefficient were used to analyze the stability of radiomics features across multiple machines. He found that redundancy and stability of features depended on the CT image type and CT scanner. Balagurunathan (6) analyzed the stability of features from CT scans of 32 patients with NSCLC. Baseline and follow-up scans of the patients were obtained within a gap of 15 minutes using the same CT scanner and imaging protocol. There were 23 stable features out of a total of 219 features extracted. To show the prognostic potential of these 23 features, another independent NSCLC data set with 59 lung adenocarcinomas was used.
Every scanner has its own set of image acquisition and reconstruction parameters as shown in Table 1. The variability of image acquisition and reconstruction parameters across different machines could be measured to enable adjustments by scanning a given patient multiple times with different sets of imaging parameters for each machine. However, scanning the same patient multiple times is ethically questionable owing to the accumulated ionizing radiation dose. To address this problem, a physical phantom can be used to acquire multiple scans while varying imaging parameters for different machines. Zhao et al. (7) analyzed 22 phantom lesions for exploring slice thickness and reconstruction kernel variation using 14 radiomics features. In total, 3 different slice thicknesses (1.25, 2.5, and 5 mm) and 2 reconstruction kernels were used to obtain the scans. They observed that all features were significantly different when imaged at 1.25-mm versus 5-mm slice thickness and suggested that thinner (1.25 and 2.5 mm) and thicker (5 mm) slice images should not be used concurrently. Mackin et al. (8) investigated the interscanner variability of radiomics features using phantoms by obtaining scans from 17 different scanners. In total, 20 patients with NSCLC were also used to measure the variability of features from tumors. They concluded that the variability of some radiomics features extracted from NSCLC tumors was comparable to the variability of the same radiomics features obtained from CT scans of phantoms across different CT scanners. We also previously studied the variation of feature values across different scanners for several manufacturers (3).
|GE Discovery STE (GE1)a||120||250b||Helical||0.984||1.0||Standard||Det. Coverage = 40||1.25, 2.5, and 3.75|
|GE LightSpeed 32 pro (GE2)||120||250b||Helical||0.984||1.0||Standard||Det. Coverage = 40||1.25, 2.5, and 3.75|
|Philips Big Bore (P1)||120||250||Helical||1.024||1.0||Standard (B)||16 × 0.75||1.5, 2, and 3|
|Philips Brilliance 64 (P2)||120||250||Helical||1.024||1.0||Standard (B)||64 × 0.625||1.5, 2, and 3|
|Siemens Definition As (S1)||120||250||Helical||1.0||1.0||I31f-2||64 × 0.625||1.5, 2, and 3|
|Siemens Sensation 64 (S2)||120||250||Helical||1.0||1.0||B31f||64 × 0.625||1.5, 2, and 3|
|Siemens Sensation 40 (S3)||120||250||Helical||1.0||1.0||B31f||40 × 0.625||1.5, 2, and 3|
|Siemens Sensation 16 (S4)||120||250||Helical||1.0||1.0||B31f||16 × 0.75||1.5, 2, and 3|
In recent years, with the advancement of neural networks (9, 10), deep features obtained from deep neural networks have been proposed for analyzing cancerous tumors alone or in combination with conventional radiomics features. One of the most crucial traits of deep features to qualify as a potential imaging biomarker is stability across scans. Until now, there was not much work regarding deep features' variability over scanner parameters other than our previous work (11), in which prerectified linear unit (pre-ReLU) features (deep features from the layer before the outputs of a pretrained convolutional neural network [CNN] before applying the ReLU activation function) were used for the analysis of deep feature variability. Our current work is an extension of our previous work on deep feature stability analysis. In this paper, we have made the following contributions using the following CT radiomics phantom images:
In total, 8 different scanners from 3 different manufacturers were investigated in our current study.
As post-ReLU features have shown better classification performance (12), along with the pre-ReLU features, these were also examined for stability analysis.
In our previous work, only the rubber cartridge, which showed textural similarity to NSCLC tumors, was used. The dense cork cartridge also showed textural similarity to NSCLC tumors (13). In this study, we examined dense cork and natural cork cartridges in addition to the rubber cartridge.
As the rubber cartridges had Hounsfield unit (HU) values similar to those of NSCLC tumors, the stable deep features for the rubber cartridge were also tested on the 2 other cartridges for classification analysis.
The goal of this study was to analyze the stability of deep features extracted from CT scans (images) from 3 different manufacturers with different image acquisition parameters and to evaluate the stable features' utility in building accurate classifiers.
Materials and Methods
Image Acquisition and Reconstruction
The credence cartridge radiomics (CCR) phantom reported by Mackin (8) was used in image acquisition. In total, 8 different scanners from 3 different manufacturers (GE, Philips, and Siemens) were used to obtain scans using the CCR phantom at the H. Lee. Moffitt Cancer Center and Research Institute, Tampa, FL. Slice thicknesses for the GE scanners were 1.25, 2.5, and 3.75 mm, and those for the Philips and Siemens scanners were 1.5, 2, and 3 mm. The adjacent reconstruction interval or zero interslice gap was used for all CT phantom scans. For every slice thickness, the reconstruction field of view (FOV) varied from 200 to 500 mm (200, 250, 300, 350, 400, 450, and 500 mm) corresponding to pixel sizes ranging from 0.39 to 0.98 mm. The pixel size was calculated as FOV/matrix size, and a matrix size of 512 ×512 was kept constant for all scans. Parameters for each scanner are shown in Table 1. CT (HU) numbers and SDs for different cartridges within the CCR phantom are reported in online supplemental Table 1). The noise power spectrum (NPS) of the rubber cartridge using 5 different scanners was reported in a recent paper (14) for the same pixel sizes and slice thicknesses. The NPS provides the noise texture of an image. Noise texture in a CT image varies with variation of slice thickness, pixel size, and reconstruction kernels. Here, we want to point out that NPS can be used to quantify the noise texture introduced because of different imaging parameters. For example, when the same slice thickness has similar noise texture, it results in similar NPS values. So, noise texture is intrinsically associated with images, while NPS is an analytical tool to quantify that texture. The investigation of the impact of noise on deep features could be significant and thus needs future evaluation. For the same pixel size and slice thickness, CT images might provide the same NPS and thus similar noise texture. However, for different kernels, the NPS, as well as the noise texture, will be different.
Convolutional Neural Networks and Transfer Learning
A CNN (9, 10) is a variant of feedforward neural networks, and it has been used extensively for object recognition and classification. A CNN typically consists of ≥1 convolutional layers along with pooling layers followed by ≥1 fully connected layers, as in a classical multilayer feedforward network. Each neuron has a bias input, accepts some input values on weighted links, executes a dot product, and forwards the output to the next layer. A nonlinear activation function is normally used on the outputs.
Training a CNN from scratch requires a large amount of data (preferably hundreds of images per class, eg, ImageNet data set). In medical imaging, obtaining a large amount of data (ie, a large number of images) is often difficult. To counter this problem, a transfer learning approach (15, 16) has been used. Using previously learned knowledge to solve a new task is known as transfer learning. In this study, we chose 1 CNN (visual geometry group-slow [VGG-S]) (17) already trained using natural camera images from the ImageNet data set (18). The VGG-S CNN architecture is shown Figure 1. This pretrained CNN has 5 convolution layers followed by 3 fully connected layers. We obtained deep features from the penultimate layer of the CNN before (pre-ReLU) and after (post-ReLU) applying the ReLU activation function. The frequently used ReLU activation function for a CNN is represented by the following equation:
Contouring and Feature Extraction
The phantom had 10 different cartridges, from which we chose rubber, dense cork, and natural cork cartridges for our experiments. We analyzed the rubber and dense cork cartridges because of the similarity of their HU to NSCLC HU values (13), and natural cork was also investigated owing to some visible textural patterns. Throughout the scanning procedure, a 512 × 512 image size was used. Contouring of regions of interest (ROIs) was done with the help of Mirada software (3) (the central region of each cartridge was captured with a spherical ROI of volume 4.2 cm3.) As this was a 2D approach, for every scan of the phantom using a different pixel size and FOV, 1 slice from each of the 3 cartridges was chosen for analysis.
The input image size for the pretrained network was 224 × 224; so, a subimage of the required size was extracted from the center of the 512 × 512 scanned image. The pretrained CNN was trained using color camera images (24-bit images), whereas the phantom's images were grayscale. Hence, deep features were extracted by feeding phantom images through the red channel only (zeros were sent through green and blue channels). The vector size of the extracted deep features was 4096 (12). After extraction, deep features were normalized between 1 and −1. Figure 2 shows a phantom image slice of a rubber cartridge and 224 × 224 extracted subregions from different cartridges.
Using 1 cartridge of the phantom at a time (rubber, dense cork, or normal cork) for every scanner, we obtained scans of 3 different slice thicknesses and, for each slice thickness, 7 different pixel sizes. For every cartridge, 21 different scans were generated with the intent to analyze the stability of each deep feature across these scans.
For each of the 4096 features, the concordance correlation coefficient (CCC) (19) across the 21 scans was calculated with respect to the pixel size before and after feature normalization. The maximum CCC after feature normalization was around 0.8 for some deep features, whereas those same features had much lower CCC (∼0.3) before feature normalization. As the CCC values could be low even after normalization, improvements in the CCC values after normalization were noted. Each CCC value was converted to a z value (using Fisher transformation equation  (20), and the improvement was calculated by equation . After that, the z value was converted to a P-value, and the improvement significance was checked at the 95% significance level, P = .05. If any feature was found to be improved with a P >.05, then that feature was not evaluated further. Equations  and  are as follows:
After using CCC for further filtering, the SD was also computed for each feature after normalization. Stability was determined using thresholds as follows. A threshold value of 0.25 was chosen for the SD. If a feature value had an SD <0.25, that feature value was considered stable across different pixel sizes. The chosen threshold value encompassed 12.5% of the feature range. The SD threshold will mostly rule out features that have widely disparate values but will vary little, on average. After normalization, a feature will be called stable with respect to a change of pixel size, if it had P-values <.05 and <.25 for its CCC and SD, respectively.
For the stability analysis, 4096 deep features were extracted from the pretrained CNN. This large number has the problem of multiple comparisons potentially showing spurious results. To avoid this, false discovery rate (FDR) (22) was applied to the discovered stable features to adjust their P-value.
Experiments and Results
In Section 3, the results of multiple scans of cartridges are analyzed for deep feature stability. If a feature varies with the change of pixel size or voxel size, then a machine learning model built with deep features extracted using one institution's data may not work on another institution's data. Stable features are needed across different scanner parameters. We called a feature stable if, after normalization, the feature had a similar value within a chosen threshold limit independently of scanner parameters, which meant that the feature was stable across variations of parameters.
Based on both SD and improvement in CCC, we grouped the deep features into 2 different groups. Group 1 consisted of features whose improvement in CCC resulted in P-value and SD values <.05 and <0.25, respectively. These features showed less variability with the change of pixel sizes and slice thicknesses, and they were considered stable across pixel sizes. Group 2 comprised unstable features that did not become stable after applying normalization (ie, showed variability before and after normalization). We analyzed the stability of deep features for both pre-ReLU and post-ReLU features extracted from 3 different phantom cartridges. Figures 3 and 4 show examples of a feature that improved and of one that did not improve after normalization, respectively. Figure 5A shows stability results of VGG-S pre-ReLU and post-ReLU features obtained from the rubber cartridge. Figure 5B shows the analysis of the dense cork, and Figure 5C describes results from the normal cork.
We observed that normalization using the pixel area helped to obtain more stable features (reducing variations and increasing stability across different pixel sizes) better than normalization using voxel size. After analyzing intrascanner dependency (investigating features from every scanner separately), we analyzed features from different scanners jointly (interscanner dependency). Interscanner dependency evaluates the number of stable features by comparing all 4096 features among all scanners. Different cartridges (rubber, dense cork, and normal cork), normalization approaches, and deep features (post-ReLU and pre-ReLU) were also compared. Figure 6 shows the results obtained from the interscanner dependency analysis. Only 1 pre-ReLU feature obtained from different scanners was found to be stable across 3 cartridges. Whereas for post-ReLU features, 19 features were found to be stable across 3 cartridges. Figure 7 shows a feature found to be stable after normalization (interscanner dependency).
FDR was applied on the discovered stable features to adjust their P-value. Now the features with adjusted P-value were compared against those with P-value of .05. After the FDR correction, the number of stable features was reduced (49, 71, and 70 deep features were stable on rubber, dense cork, and normal cork cartridges, respectively). Interscanner dependency (analysis of features obtained from different scanners) was also explored. Each of these 3 cartridges had different HU values and textures. Only 1 post-ReLU feature (feature column 299) from CNN was found to be stable across different scanners for all 3 cartridges after P-values were corrected using FDR. The stability of features changed significantly with the change of cartridges. Hence, the question of stability on a cartridge was examined. The results obtained after applying FDR are shown in Figures 8 and 9.
One data set was chosen to analyze the relevant utility of the stable deep features for prognostic analysis. Deidentified data from the National Lung Screening Trial (NLST) was obtained via the data access system of the National Cancer Institute under an IRB-approved process. Figure 10 describes the NLST study timeline as well as the criteria for dividing the SDLCs and NPCs into Cohort1 and Cohort2. It should be noted that the scans performed on the physical phantom (120 kVp, 250 mAs) were not equivalent to the low-dose scans used on the NLST data set (120–140 kVp, 40–80 mAs). Nevertheless, we have shown that stable deep features identified using phantom images can be used to enhance malignancy classification in humans during low-dose CT screening, which uses a higher mAs value. A detailed description of the data set is in the online supplemental Appendix (23, 24) which includes Figure 10 describing the data. Deep features from the CNN were extracted using the red-color channel. In our previous study (21), we experimented with the original deep features (without choosing any stable features) and the top 5 / 10/15 / 20 deep features were selected using the symmetric uncertainty (25) feature selector. In Paul et al.'s study (21), using VGG-S deep features from the red-color channel gave the best classification accuracy of 65.4% with 0.66 area under the curve (AUC) using 15 features.
The deep features (49) that were found to be stable on rubber cartridges over different scanners and parameters were evaluated further to determine whether the use of stable features could lead to improved classification performance. The classification performance was evaluated with respect to accuracy and area under the receiver operating characteristic curve (26). We also choose the top 5 / 10/15 / 20 deep features from our stable features using the symmetric uncertainty feature selector and using all 49 features to classify using random forests classifier (27). We found that using pixel area normalized deep features enabled 67.08% accuracy with 0.67 AUC, which was an improvement over our previous analysis using all 4096 features. The classification performance was further enhanced to 68.77% (0.68 AUC) by using all 49 stable features, a 2% accuracy increase over using unnormalized features. Detailed results are shown in Table 2. Online supplemental Appendix Table 2) compares the performance of different classifiers [decision tree (28), naïve Bayes (29), and nearest neighbors (30)] with unmodified deep features and normalized stable deep features. From this study, we observed the importance of deep feature stability assessment before prognostic evaluation. By choosing the stable features, we may avoid using unreliable and irrelevant features.
|# of Features||Original Deep Features(All 4096 Features)a||Stable DeepFeatures Onlyb||Stable Deep Features – Normalizedby Pixel Areab|
|Top 5||62.02 (0.643)||64.135 (0.62)||62.02 (0.6)|
|Top 10||63.71 (0.66)||63.71 (0.66)||65.4 (0.64)|
|Top 15||65.4 (0.66)||65.82 (0.66)||65.4 (0.66)|
|Top 20||64.97 (0.66)||66.24 (0.66)||67.08 (0.67)|
|Top 49 or All 49||66.24 (0.68)||67.08 (0.68)||(0.68)|
CT imaging plays a critical role in current NSCLC treatment and research. The FOV can vary from scan to scan depending on the size, location, and NSCLC tumor stage and patient size. Slice thickness is another important parameter selected for obtaining a scan. How these image acquisition parameters affect features extracted from a CNN remains unexplored. Hence, the main focus of this study was to evaluate how deep features behave with variation in image acquisition parameters. There has been recent work (31) on finding semantic meaning for deep features, suggesting that meaning may be ascertained for stable deep features. To the best of our knowledge, this is the first work analyzing the stability of deep features with varying slice thickness and FOV. In this paper, only pixel size and slice thickness dependency and variability were chosen for analysis. The goal was to gain a better understanding of the variability and allow focus on ways to remove or reduce variability.
In our previous study (21), deep features extracted from different color channels of a pretrained CNN were analyzed and variations in classification accuracies were also obtained. Deep features were extracted from the red channel of the pretrained CNN (VGG-S) for our current study. VGG-S was chosen for our study because deep features from VGG-S showed good classification for NSCLC nodules (21).
From this study, we observed that some deep features were stable as shown in Figure 4 (small variability with the variation of pixel size) within a chosen threshold. These stable features did not appreciably change with the change of pixel sizes. Some of the deep features were pixel size–dependent. These features showed variation with changes in pixel size. Robust and stable features across different reconstruction kernels and image acquisition parameters are desirable in radiomics. In an attempt to stabilize these features across the variability of pixel sizes, we proposed 2 normalization procedures using pixel area and voxel size (11, 14). In some cases, features stabilized after being normalized by pixel area or voxel size. However, some of these pixel size–dependent features showed variability even after normalization. They were not stable with pixel size. Hence, we found that some deep features also had pixel size dependencies such as the conventional radiomics features, and a similar correction approach could be used to reduce the dependency.
Normalization using pixel area generated a greater number of stable features (less variability across pixel size changes) than using voxel size. In fact, voxel size normalization by itself was of minimal help, perhaps because we are dealing with planar images (2D), 1 slice per scan. It added only a couple more stable features, in some cases, to the set found with pixel area normalization. Voxel size is a volume that may explain the lack of improvement when normalizing by it. Interscanner dependency (analysis of features obtained from different scanners) was also explored. Each of these 3 cartridges (rubber, dense cork, and natural cork) had different HUs and textures, yet 1 pre-ReLU and 19 post-ReLU features from a VGG-S pretrained CNN were found to be stable across different scanners for all 3 cartridges.
Previously (12, 21) we found that using post-ReLU features provided better classification performance than using pre-ReLU features for lung nodules. Post-ReLU features lack negative values (all the negative values were made 0) owing to the ReLU activation function. Pre-ReLU features had negative feature values. Both sets of features (pre- and post-ReLU) were investigated to gain our understanding of feature stability. From our current study, we found more post-ReLU deep features could be stabilized than pre-ReLU features.
We found that the features also changed significantly when using different scanners with different protocols. Our analysis also showed that more deep features from the dense cork cartridges were stable than features from the rubber and normal cork cartridges. This happened because the texture of dense cork cartridge is more uniform than that of the rubber and normal cork cartridges.
Finally, in this study, pre- and post-ReLU features were obtained from a transfer learning approach, which was a limitation of the study because the VGG-S CNN was not trained on any type of medical images or any medical imaging modality. We used transfer learning because we do not yet have the large number of medical images needed to train a complex CNN and to explore the utility of transfer learning. Given our present results, deep features from a CNN trained on CT images will be investigated in the future. Even though phantoms were made of different materials with different texture, an analysis using real human CT scans would be useful. No scans from human subjects were used for stability analysis, which was a limitation of this study. In future work, more analysis using patient data will be analyzed for prognostic evaluation after using the proposed pixel size-based normalization. Deep features from only the red channel of the VGG-S network were used for this study, which was another limitation. In the future, feature normalization will be analyzed during CNN training to determine if it is possible to identify characteristics of features that would benefit from normalization. The deep features will be analyzed further using different scanner parameters and various reconstruction kernels.
The goal of this study was to analyze whether deep features were stable across different scanner parameters and manufacturers. Stability is one of the essential characteristics of deep features to qualify as a potential imaging biomarker. From this study, we found that many deep features were dependent on pixel size, as are many conventional radiomics features. It was found that this dependency could be reduced, for some, by normalizing the deep features using pixel area and voxel size. We found that the stability of deep features changed significantly when using different phantom cartridges (49, 71, and 70 deep features were stable on rubber, dense cork, and normal cork cartridges, respectively). We also looked for deep features that were stable across 3 physical phantom cartridges for post-ReLU, and found 1 feature. The 3 cartridges were made of different components and had differences in texture uniformity and HU values. It is therefore advisable to analyze the stability of deep features among different cartridges independently. The stable and normalized deep features achieved improved classification performance compared with the original deep features chosen by the symmetric uncertainty feature selector, which shows the usefulness of stable features for prognosis analysis. Based on this study, some deep features may be candidates for future imaging biomarkers, but researchers must be cautious because most deep features show dependence on image acquisition parameter variations.