Approximately 81,400 new cases of bladder cancer (62,100 in men, 19,300 in women) will be diagnosed in 2020, resulting in 17,980 deaths (13,050 male, 4,930 female) according to estimates by the American Cancer Society (1). Only 51% of bladder cancers are diagnosed at an early stage (stage T1 or less) when the cancer involves only the inner mucosal layer of the bladder wall (1) and is relatively easier to treat.
Improvement in patient survival and decrease in probability of metastatic disease is observed when a neoadjuvant chemotherapy was performed prior to radical cystectomy (2–4). However, neoadjuvant chemotherapy can cause significant toxicities, such as neutropenic fever, sepsis, mucositis, nausea, vomiting, malaise, and alopecia (5). Assessing response to neoadjuvant chemotherapy is not reliable at present, which may cause some patients to suffer adverse reactions to treatment with chemotherapy while gaining minimal benefit (6, 7). It is important to develop an accurate method for assessment of treatment response. Such a method could be very useful for personalizing therapy to patients in the neoadjuvant chemotherapy setting. It might also facilitate optimal selection of patients for bladder-sparing therapy (8), in which trimodal therapy (ie, transurethral resection, chemotherapy, radiation) can be used as a curative option for patients who do not wish to undergo the morbidity of radical cystectomy.
A computerized decision-support system for muscle-invasive bladder cancer treatment response assessment (CDSS-T) using imaging information from computed tomography (CT) examinations was developed in our laboratory. The CDSS-T tool estimates the likelihood that a patient has completely responded to neoadjuvant chemotherapy (9). It integrates deep-learning convolutional neural networks (DL-CNN) and radiomics features. We have used the CDSS-T as a physicians' aid in an observer study for assessment of the likelihood that a patient has completely responded to neoadjuvant chemotherapy (10). The physicians' assessment accuracy improved when CDSS-T was used than when CDSS-T was not used (10).
In this study, we evaluated the intraobserver variability in physicians' assessment aided by the CDSS-T of complete radiographic response to neoadjuvant chemotherapy, and the effects of that intraobserver variability on the physicians' assessment accuracy.
The study population consisted of 123 subjects with 157 muscle-invasive bladder cancers who had undergone CT scanning of the pelvis before and after neoadjuvant chemotherapy treatment before radical cystectomy. One hundred subjects were males with a mean age of 63 years (range, 43–84 years), and 23 were females with a mean age of 23 years (range, 37–82 years). The chemotherapy treatment was performed with MVAC (methotrexate, vinblastine, doxorubicin, cisplatin) or an alternative regimen (variably including carboplatin, paclitaxel, gemcitabine, etoposide). Three cycles of chemotherapy treatment were performed. Institutional Review Board (IRB) approval was obtained for this study.
For all subjects, pretreatment and posttreatment CT scans of the pelvis with or without contrast material were acquired with GE Healthcare (WI) Lightspeed MDCT scanners using 120 kVp and 120–280 mA, at a pixel size range of 0.586–0.977 mm and a slice interval range of 0.625–7 mm. Pretreatment CT scans were acquired ∼1 month before the first cycle of chemotherapy. Posttreatment imaging was acquired at ∼1 month after completion of the therapy. The time interval between the pre- and posttreatment scans was 4 months on average. One to 2 months after completion of neoadjuvant chemotherapy, a radical cystectomy was performed. The final cancer stage and whether the subject had responded completely to neoadjuvant chemotherapy (ie, pathologic T0; the primary outcome measure) was determined on the basis of the pathology obtained from the bladder at the time of surgery. The pathology cancer stage was used as a reference standard. A radiologist (R.H.C) with over 30 years of experience reading abdominal CT marked all cancer locations on the pre- and postchemotherapy CT scans and defined a volume of interest (VOI) with a bounding box that enclosed the cancers using a custom graphical user interface (GUI), MiViewer, developed at the University of Michigan CAD-AI Research Laboratory. This reference radiologist did not participate as an observer in the treatment response assessment experiment.
Computerized Decision Support System for Treatment Response Assessment (CDSS-T)
Our CDSS-T system integrates DL-CNN and radiomic features to distinguish between bladder cancers that have fully responded to treatment (ie, pathologic stage T0) and those that have not (ie, pathologic stage T1–T4) (9). The CDSS-T system segments bladder cancers using our in-house-developed segmentation tool, autoinitialized cascaded level sets (AI-CALS). (11). Radiomic features were extracted from the segmented tumor. The image analysis pipeline of the CDSS-T system is shown in Figure 1.
DL-CNN Assessment Model
We trained a DL-CNN to distinguish complete responders from noncomplete-responders as described previously (9, 12). In brief, “hybrid” regions of interests (ROIs) were first generated from the pre- and post-treatment ROIs extracted from within the segmented cancers on the pre- and posttreatment CT scans. Each hybrid ROI was formed from a digitally concatenated side-by-side pair of the pre- and posttreatment ROIs. For each cancer, a large number of hybrid ROIs were generated by taking different combinations of the pre- and posttreatment ROIs. All hybrid ROIs from the same cancer were labeled as a complete responder (ie, pathologic stage T0) or a noncomplete-responder (ie, pathologic stages T1–T4) according to the postcystectomy-determined pathologic cancer stage. A leave-one-case-out cross-validation scheme was used for the training and testing of the DL-CNN model. For each leave-one-case-out partition, all hybrid ROIs except for those from the left-out case were used as a training set to train the DL-CNN. The hybrid ROIs from the left-out case were used as test and the trained DL-CNN was then deployed to these test hybrid ROIs. Therefore, a likelihood score of pathologic T0 disease for each of the test ROIs was obtained. Finally, by using the average of the likelihood scores among the ROIs associated with the specific cancer, a “per-cancer” summary score was obtained.
Radiomics Assessment Model
We also developed a radiomics-based model to distinguish complete responders from noncomplete-responders (9). In total, 91 radiomics features, which previously were shown to be useful in analyzing breast masses, lung nodules, and bladder cancer treatment response assessment, were extracted from every segmented cancer. Details of the radiomics features can be found in (9, 13, 14). The percent difference of each radiomic feature between the pre- and posttreatment tumor was calculated for every pre–post CT pair of a given bladder cancer. A 2-loop leave-one-case-out cross-validation scheme (15) was used to build this assessment model to separate the training procedure, which included feature selection and classifier training, from the testing cases. Within the inner loop, the subset of features was selected and the classifier weights were trained with a leave-one-case-out scheme by using the training partition. In the outer loop the trained classifier was deployed to the left-out test case. In such a way the test case is kept independent from the training process. An average of 4 features was selected, including 2 contrast features and 2 run-length statistics features.
CAD Score Generation
The final CDSS-T score was obtained by combining the test scores from both the DL-CNN and the radiomics assessment models. The CDSS-T combined score was generated by taking the maximum of the 2 scores. Receiver operating characteristic (ROC) analysis was performed on the CDSS-T scores. To communicate conveniently the CDSS-T scores to the physicians, the CDSS-T scores were linearly scaled within the interval between 1 and 10, rounding to the nearest whole integer. These rounded scores were referred to as computer-aided diagnosis (CAD) scores. A score of 1 corresponded to the lowest likelihood that the lesion pair was indicative of complete response. A score of 10 corresponded to the highest likelihood that the lesion pair was indicative of complete response. Fitted curves to the distributions of the linearly transformed scores for both the noncomplete-responders and the complete responders were obtained. The area under both of the fitted distribution curves was then normalized to a value of 1. The normalized fitted distribution curves (Figure 2C) were displayed on the GUI as a reference together with the cancer-specific CDSS-T likelihood score to be used as decision support in the computer-aided reading by the observer.
Observer Performance Study
Twelve physicians participated as observers in this study including 5 abdominal-fellowship-trained attending radiologists (faculty experience, 2–36 years), 1 second-year radiology resident, 3 fourth-year radiology residents, 1 attending urologist (faculty experience, 11 years), and 2 attending oncologists (faculty experience, 3 and 10 years). Each observer reviewed each pre- and posttreatment CT pair displayed side by side on a specialized GUI that allows common interactive functions such as windowing, scrolling, and zooming (Figure 2). The observer was asked to provide an estimate of the likelihood of having complete response to treatment of the cancer by inspecting the pre- and posttreatment CT pair. The bladder tumor to be assessed was marked by a VOI box on both the pre- and posttreatment scans. In cases containing multiple cancers and therefore multiple VOIs, each VOI was analyzed separately (Figure 2A). Each observer was given unlimited time for the evaluation and was blinded to the reference standard and to the results of the other observers. To minimize bias related to fatigue or learning due to reading order, the sequence of cases in the reading list was randomized differently for each observer.
For each cancer, each observer provided an estimate of its likelihood of complete response on a scale of 0% to 100%, where 0% indicated definite residual viable neoplasm (>T0 disease) and 100% indicated definite complete response (T0 disease) (Figure 2B). Reader estimates were provided first without and then with access to the CAD likelihood score (Figure 2C). In this way, the observers were given the opportunity to modify their estimate after being provided the CAD score, although they could leave it unchanged if they wished.
Each observer was also asked to estimate a percentage response of tumor to the neoadjuvant chemotherapy on a scale of −100% to +100% using RECIST 1.1 (16) measurement criteria, where 0% indicated no change between pre- and posttreatment CT scans, −100% indicated at least doubling of tumor size, and 100% indicated a complete response.
To study the intraobserver variability, each observer was asked to repeat the evaluation of the first 51 cases in the observer's individually randomized reading list after completing the evaluation of all cases in the list. Because each observer's list was randomized differently, the first 51 cases were different for each observer. We define the first reading of these first 51 cases for each observer as “original evaluation” to distinguish it from the repeated evaluation in the following discussion. The washout time between the original and repeated evaluations was ∼1 month to avoid potential memorization effects. The observers were not informed that they are repeating the evaluation of the cases. The observers were also blinded to the reference standard and to the results of the other observers.
The observers' estimates were analyzed with multireader, multicase (MRMC) receiver operating characteristic (ROC) methodology using the radical cystectomy specimen as the reference standard (17). iMRMC methodology was also used for the analysis of the not “fully-crossed” intra-observer variability data, which were analyzed as an alternative design study (18, 19). The area under the curve (AUC) and the statistical significance of the difference in readings with and without CDSS-T were calculated. One outcome was a comparison of the diagnostic accuracy of the physicians in diagnosing T0 disease after treatment without CDSS-T and after the physicians had CDSS-T for decision support. Another outcome was an assessment of the intraobserver variability by comparing the results of the original and the repeated evaluation of the corresponding subsets of cases for each observer. The AUC and the statistical significance of the difference between the 2 evaluations were calculated.
An additional measure of the intraobserver variability was based on the standard deviation of the differences of the observer's original evaluation likelihood estimates and the observer's corresponding repeated evaluation likelihood estimates. The intraobserver variability assessments were performed for the observers' evaluations without and without CDSS-T and then compared.
The average standard deviation of the likelihood estimates by the observers per treatment pair was analyzed to study the effects of CDSS-T on inter- and intraobserver variability. The standard deviation of the observers' likelihood estimates of a given cancer was used as a measure of the level of difficulty, assuming that inter- and intraobserver variabilities would be smaller for easier cancers.
Pearson correlation was used to examine if there is relationship between the average level of difficulty of the case group and the AUC of reading the same case group by a given observer. The correlation was calculated for the AUCs of both readings with and without CDSS-T in both the original and repeated evaluations and the AUC of the CDSS-T alone. For all analyses, a P-value of <.05 was considered to indicate a significant difference.
Surgical histology revealed that 25% (40 / 157) of bladder cancers were determined to have a pathologic stage of T0 following neoadjuvant chemotherapy (ie, 40 complete responders). The average maximum diameter for these 40 completely responding lesions was 30.1 mm on pretreatment scans and 14.3 mm on posttreatment scans. Suspected lesions on posttreatment scans in these patients were found to represent an inflamed bladder wall or an entirely necrotic treated tumor. The average maximum diameter for the remaining 117 incompletely responding lesions was 43.0 mm on pretreatment scans and 31.2 mm on posttreatment scans.
Approximately 24% (12/51) of the bladder cancers were determined to be complete responders after neoadjuvant chemotherapy for each of the 12 subsets of 51 cases for the 12 observers used to study the intra-observer variability.
Overall Results for All Cancers
The overall results for all cancers (157 cancer pairs) are summarized in the following as a reference for the current study. A detailed analysis of the overall results can be found elsewhere (10).
The individual AUC values of the 12 observers are shown in Table 1. In general, the physicians' diagnostic accuracy significantly increased (P = .01) and physicians' diagnostic variability significantly decreased (P < .001) with the aid of CDSS-T. The average AUC for all of the physicians combined was 0.74 (range, 0.66–0.78) without CDSS-T, and it increased to 0.77 (range, 0.73–0.81) with CDSS-T. This difference was statistically significant (P = .01). In comparison, the AUC for assessment of complete response by CDSS-T alone was 0.80 ± 0.04.
The original and repeated evaluations of the first 51 cases in each observer's individually randomized reading list and estimation of the intraobserver variability are analyzed below. Twelve groups of 51 cases, each group contained the first 51 cases read by each observer, were evaluated 2 times, referred to as original evaluation and repeated evaluation.
The individual AUC values of the 12 observers for the original and repeated evaluations are shown in Table 2 and Figures 3 and 4. For the original evaluation, the average AUC of the 12 observers without the CDSS-T was 0.76 (range, 0.65–0.88) that increased to 0.80 (range, 0.70–0.90) with CDSS-T. The improvement was statistically significant (P = .001). For the repeated evaluation, the average AUC of the observers without the CDSS-T was 0.78 (range, 0.65–0.88) that increased to 0.81 (range, 0.70–0.93) with CDSS-T. The improvement was also statistically significant (P = .010).
i] In the 12 groups of 51 cases, each group contained different cases for each observer, were evaluated 2 times, shown as original evaluation and repeated evaluation.
However, there was no statistically significant difference between the average AUCs for the original and the repeated evaluations without CDSS-T (P = .083) or for the evaluations with CDSS-T (P = .222).
The standard deviations of the AUCs were smaller for both the original and the repeated evaluations with CDSS-T than for those without CDSS-T: an average of 0.073 without CDSS-T versus an average of 0.069 with CDSS-T (P < .0002) for the original evaluation, and an average of 0.069 without CDSS-T versus an average of 0.064 with CDSS-T (P < .004) for the repeated evaluation. In addition, for both without and with CDSS-T, the standard deviations of the AUCs were smaller for the repeated evaluation than for the original evaluation. However, the differences did not reach statistical significance (P > .07).
When evaluating with CDSS-T, 2 observers performed better than the CDSS-T alone in the original evaluation (Figure 3). In the repeated evaluation, 3 additional observers (5 in total) performed better than the CDSS-T alone when they evaluated with CDSS-T (Figure 4). The average AUC over the 12 groups of 51 cases for assessment of complete response by CDSS-T alone was 0.85 ± 0.06.
The intraobserver variability estimated as the mean standard deviation of the corresponding observers' likelihood estimates differences between the original and the repeated evaluations was 26.53 without CDSS-T and was reduced significantly to 21.59 with CDSS-T (P < .0001) (Table 3).
Difficulty of Cancers as a Performance Factor
The level of difficulty for the 12 case groups estimated by the inter-reader standard deviation of the member cases within the groups was moderately negatively correlated (r = −0.64) with the corresponding AUC for CDSS-T alone for the 12 groups.
The level of difficulty was also negatively correlated (r = −0.31) with the corresponding physicians' AUCs with and without CDSS-T for the 12 groups. In the repeated evaluation, the physicians' AUCs with and without CDSS-T was less negatively correlated (r = −0.10 and −0.22, respectively) with the level of difficulty compared with the original evaluation with CDSS-T.
In this study, we evaluated the intraobserver variability of physicians' treatment response assessments of bladder cancer after neoadjuvant chemotherapy in CT examinations via CDSS-T.
We observed statistically significant improvement in physicians' average performance when they used CDSS-T for evaluation than when they did not use CDSS-T for evaluation. There was improvement in all experiments including the evaluation with the entire data set as well as the original and repeated evaluations of the individualized subsets. We have found that the interobserver variability was significantly reduced with the use of CDSS-T in the previous study (10), and that the intraobserver variability was also significantly reduced with CDSS-T in the current study. This is important, because the CDSS-T was able to consistently improve the accuracy of the observer evaluations and reduce the observer variability in the different experiments including repeated evaluations.
The level of difficulty of the cases has a stronger impact on the CDSS-T performance alone than on the observer performance. The observers were even less affected in the repeated evaluation with CDSS-T.
For both without and with CDSS-T evaluations, we have observed a slight improvement trend in the observers' performance (increased average AUCs and reduced average variability [standard deviations]) for the repeated evaluation compared with the original evaluation. However the improvement was not statistically significant. In addition, a larger number of observers with CDSS-T performed better than the CDSS-T alone in the repeated evaluation. The observed trends of improved performance for the repeated evaluation are interesting. These may be attributed to the fact that the observers were becoming more experienced using the decision-support tool and were using it more effectively for improving their assessment. The understanding of how a user may be influenced by their experience with and confidence on a decision-support tool is a topic of interest for future studies.
There are limitations in this study. First, the CDSS-T scores were obtained through the leave-one-case-out cross-validation owing to the lack of a large data set. Ideally, the system should have been evaluated on an independent test set (20). However, the leave-one-case-out cross-validation approach is well established in the machine learning literature and is a statistically valid technique for estimating classifier performance in an unknown population. In the future, as we collect a larger data set, we will evaluate our system on an independent test set.
Second, we used a sequential design for our observer study experiment (21–23). The main reason is that the Food and Drug Administration approved the use of CAD so far is in the sequential mode as a second reader.
Third, although the performance of CDSS-T alone was higher than that of the observers in this study, the AUCs under all conditions were still modest, probably because of the challenging nature of this classification task. It is possible that the imaging modality itself provides limited radiomics or physiological information that neither a physician nor machine learning will be able to overcome. We are now attempting to improve the CDSS-T by using improved cancer segmentation methods (24), more advanced DL-CNN models (12), and most importantly, by combining the imaging-based assessment with other available clinical biomarkers, including results from bimanual examinations under anesthesia, results from transuretheral resection of bladder cancer (25), and molecular biomarkers such as genomics and proteomics. Fourth, none of our observers was experienced in using a decision-support tool for bladder cancer, because such decision-support tools are not yet available in the clinic for abdominopelvic applications. This may have limited the observers' confidence in the CDSS-T system at the beginning. We expect that physicians will become more receptive to CDSS-T “advice” after gaining experience with the system as observed in the repeated evaluation results in the current study. The increased experience and improved confidence in CDSS-T may result in further improvements in diagnostic accuracy.
There exists an intraobserver variability for the physicians in the assessment of patients' response to neoadjuvant chemotherapy for muscle-invasive bladder cancer in CT. This study shows that our computerized decision-support system, CDSS-T, can significantly reduce physicians' variability and improve their accuracy in identifying the complete response of muscle-invasive bladder cancer to neoadjuvant chemotherapy. To validate the impact of the CDSS-T on clinical decision-making, a large-scale observer study should be conducted in an independent case set. A fully validated CDSS-T may have the potential of improving physicians' decision in the selection of patients with muscle-invasive bladder cancer for bladder-sparing therapy.