X-ray computed tomography (CT) is a nondestructive technique in which a source of X-rays (ionizing radiation) revolves around an object of interest, generating axial images of its structure. CT is today an indispensable tool in medicine for the diagnosis of multiple diseases (1). Since its introduction in 1970, there are about 30,000 CT scanners in the world and its number continues to increase exponentially (1). Even if CT techniques represent one of the greatest developments in the field of X-rays in the past 50 years (1), there is still room for improvement. Some of the proposed lines of work to achieve this objective include reduction of patient exposure time, reduction of acquisition and processing times, development of new techniques with new functionalities, and cost reduction (2). Some of the solutions to these points have so far focused on the development of higher performance hardware that will lower costs for and dose delivery to patients without compromising the effectiveness of the diagnosis. Work has also been done on the use of iterative methods for image reconstruction, reducing processing times and reducing the dose necessary for the acquisition of images with diagnostic value.
Current CT scanners use filtered back projection (FBP) for regular image reconstruction (3, 4). This technique is mathematically based on the radon transform (RT) (5, 6). In this technique, the image of an object is reconstructed from a set of X-ray projections of the aforementioned object (7). There are many computational algorithms that can be used to solve the RT. The theoretical framework of reconstruction methods can be found in the literature (2, 8, 9). Nevertheless, these are usually classified into 2 methodologies: analytical reconstruction methods and/or iterative reconstruction methods. In the first method, the use of FBP algorithms has been the standard to date (10), because it is a simple and fast reconstruction algorithm with low computational cost. However, it is prone to the appearance of artifacts and sometimes has low spatial resolution. In addition, it requires a large number of projections without noise for the reconstruction solution to be of quality. This is far from real-life situations, in which only a finite set of projections will be available. Moreover, “approximate” reconstruction methods are also known as iterative algorithms. These are less sensitive to incomplete sets of data, noise, and artifacts (10, 11). This translates into better quality of the reconstructed images. In addition, by requiring less information for reconstruction, they can use fewer projections/radiation by reducing the doses delivered in a study to a patient. For a review of the mathematics behind these methods, see the literature (12). The most common between them is maximum likelihood expectation maximization (MLEM) (13). Nevertheless, the main limitation of these techniques with respect to analytical reconstruction methods is their high computational cost.
One way to deal with the limitation of long computational times is to parallelize the reconstruction of an image. To this end, the graphics processing unit (GPU) can be used in combination with the parallel programming environment CUDA (compute unified device architecture) (14-16), same that allows to handle and process data in GPUs. It is made up of execution units, called streaming multiprocessors, which in turn are formed by computing cores called streaming processors, or CUDA cores. It is these nuclei that execute the instructions. That is, the mathematical operations or the routing and transfer of data in memory. The programming model of graphic cards involves both the central processing unit (CPU) and the GPU. The sequential part of the applications is executed in the CPU (Host), and the intensive computing part is accelerated by using the GPU (Device) in parallel. This technology is basic for the use and application of GPUs in fields as diverse as: bioinformatics, computational chemistry, computer imaging and vision, climate, numerical Analysis, etc. (17).
Medical image processing is one of the first fields in which GPUs were used (18, 19). Tatarchuk et al. (20) presented a set of methods that allowed interactive exploration of medical data sets in real time. Flores et al. (21, 22) presented the development of a fast algorithm implemented in GPUs to reconstruct high-quality images from projected, sampled, and noisy data. Belzunce et al. obtained a parallel implementation under CUDA for nuclear medicine data (SPECT and PET). An acceleration factor of up to 85 times was achieved with respect to a single-wire CPU implementation (23). Xie et al. proposed a way to synchronize the acquisition of CT data with the GPU analysis. This reduced the computational analysis time between 10 and 16 times (24). Finally, Xing et al. (25) showed more applications on the use of GPUs in this area.
In this study, medical CT images will be reconstructed using an iterative method that will solve the MLEM reconstruction algorithm. (13) This will be done using GPUs and CUDA to parallelize the analysis and reduce its computational costs. This analysis will be performed under nonideal noise conditions and their results will be compared with those of commercial medical imaging programs. Another objective of this study is that the work scheme developed can be implemented in a commercial laptop with a graphics card compatible with CUDA. This way and using a system with minimal computing requirements; a fast, simple and low-cost image reconstruction platform can be provided. The article is structured as follows: the next section presents the methodology used in the processing of the images comprising the structure of the algorithm modules, the quality metrics applied to the reconstructions obtained, and the software and hardware used. This is followed by the results section in which the benefits of the CUDA implementation of the MLEM algorithm is presented and compared.
First, the MLEM algorithm was implemented in series, both in MatLab® (MathWorks, Natick, MA) and in C. Second, the same algorithm was implemented in parallel using CUDA. Third, different image quality parameters were calculated for a series of clinical and phantom images based on the reconstruction time and the number of iterations. In a couple of cases Poisson noise was added to blur the original images and perform an evaluation of the effect of noise on the different reconstruction parameters. Finally, results of the reconstructed images were compared with those of commercial software.
Lange et al. (26) presented the MLEM reconstruction algorithm for emission and transmission tomography. The MLEM algorithm was derived from the following mathematical expression:
Here is a pixel, j is the image x in the r-th iteration, is the interval i of the projection measured given by the CT scanner, is the system matrix whose coefficients connect the values of the image with projections and describe the probability of detecting a photon in píxel j for the i bin.
All algorithm implementations developed in this study required of an system matrix that was used in the resolution of the MLEM algorithm given in the literature (1). Creating a system matrix depended on the number of projection lines , the number of projection angles and the size of the image to be reconstructed . This matrix could be calculated in different ways, and in this project, the methodology given in the literature (27-29) was used to generate and model the matrix in MatLab.
Implementation of the Algorithm for CT Image Reconstruction
As already mentioned, to compare processing times and performance, the MLEM algorithm was implemented, first serially, in MatLab and C languages, and then in parallel using CUDA on the graphics card.
Two versions of the MLEM algorithm were developed, one in MatLab and one in C (Serial-MatLab and Serial-C, respectively). The different processes that are part of the programmed algorithms are described as follows:
Input Data: In a first step, the files provided by the CT scanner were loaded. This included the system matrix and the measured projections . These structures were called matrix and vector, respectively. The matrix and the vector provided information such as the number of rows and columns of the system matrix or the number of elements of the projection vector .
Initial Image Estimation: To reduce the number of iterations, an initial estimate of the image was used— . It was created either with all its values equal to the average of the projection data or by setting its values at an intensity between 1 and 255 (grayscale).
Forward-Projection: The first iteration was performed on the image with an initial estimate that simulated the projections of a given . This was required for the calculation of to determine the corresponding projection values.
Comparison with the measured data: The comparison step was performed by dividing the original projection data with the data from simulated projections , obtaining a relation . Here formed a multiplicative correction factor for each projection.
Back-Projection: The calculated was back-projected to obtain a correction factor for the initial image. This was done with the following system calculation .
Normalization: The correction factor was normalized before being multiplied by . It was divided by a weighting term based on the system model to apply the desired intensity of each image correction factor. This produced , where represented the normalized correction factor.
Image Update: The initial estimation of image was multiplied by the normalized correction factor obtaining the new image estimation which was then fed into the algorithm in the initial image estimation step.
These processes were executed iteratively until the relative error between the estimated and the actual image reached a value <5%. To verify that this behavior was maintained, up to 1000 iterations were performed.
From the serial implementation of the algorithm, the performance was measured for each of the processes. It was found that the processes that consumed the most resources were the steps of Forward-Projection and Back-Projection. Between them, they used ∼75% of computational time.
Two versions were developed: parallel using GPU (Parallel-GPU) and parallel using GPU but optimizing data transfer (Parallel-OpT). Both implementations resolved equation (1) using both C and CUDA. CUDA was used to exclusively develop the Forward-Projection and Back-Projection processes, known for being the ones that consume the most computing resources.
The calculations performed on the data structures were handled in C by four kernels (Back-Projection, Forward-Projection, Normalization, Image Update), which distributed the data to the blocks of threads. This was done in similarity to the serial case. The different processes of the parallel versions are described as follows:
Forward-Projection module: To this end, a Forward-Projection module, using resources from the cuBLAS library from CUDA, was programmed (30, 31). No Forward-Projection or Back-Projection modules exists per se in cuBLAS, so they were developed specifically by authors for this project. Matrix was loaded to the Host and transferred to the Device; vector was generated in the Host and also transferred to the Device. Matrix was partitioned in grids of threads blocks. Each thread read a row of the matrix and multiplied it by vector to obtain a vector in the Device. The resulting vector was then transferred to the Host.
Back-Projection module: As in the previous process, the Back-Projection module was launched on the Device. This module required most of the computational resources for the entire reconstruction process, as it consisted of 2 operations: calculation of the transposition of and that of its product with vector .
Optimization of GPU computing time. The main data transfers between Host and Device were carried out when the Forward-Projection and Back-Projection processes were executed. To improve the performance of the programs, the transfer of data between the RAM of the CPU and the global memory of the GPU was handled by joining the Forward-Projection, comparing the measured data and Back-Projection modules in one function, and maintaining data calculations in the GPU. This created the optimized parallel version (Parallel-Opt).
As in the serial case, these processes were executed iteratively until the relative error between the estimated and the actual image reached a value <5%.
Image Quality Parameters
In this study, 2 clinical CT images were initially reconstructed along with a 512 × 512 image of the Shepp–Logan phantom (SLP) (32). The first clinical image was an axial cut of the brain that covered the cerebellum, as well as the nasal and auditory cavities (33). The second clinical image represented an abdominal axial cut through the lungs, heart, pancreas, and liver (33). The matrices of both CT images were 512 × 512 with a dynamic range of 8 bits per pixel, with a system matrix of and a vector projection of for a 5° sweeping angle.
To evaluate the quality of the reconstructed images, different image quality parameters were used. The main parameters studied were: contrast-to-noise ratio and signal-to-noise ratio (34). The first indicated how well 2 neighboring tissues could be distinguished. The latter how strong a signal was with respect to background noise. Other image quality parameters evaluated were: mean squared error (MSE) (35), peak signal-to-noise ratio (PSNR) (36) or structural similarity index (SSIM) (37). MSE provided a measurement of the differences between a reconstructed image and a reference image. The lower the MSE value, the better the result. The PSNR described the relationship of the maximum possible power of a signal with the power of noise corruption. It was represented on a decibel scale, and a high PSNR value meant that the reconstruction was of high quality. Finally, the SSIM evaluated the following 3 image parameters: luminance, contrast, and structural correlation. The SSIM was used to measure the similarity between 2 images, namely, original CT image and reconstructed images. SSIM values were normalized between 0 and 1; being 1 the situation in which both images were equal.
All image quality parameters were measured for all the CT images and for the 4 implementations of the algorithm developed and the two versions of TIGRE. In addition, 2 scenarios were considered, in which 5% Poisson noise was added to the images under study to assess the impact of noise on the reconstruction time and quality.
Comparison with Commercial Software
To compare the implementations developed in this work for the MLEM algorithm with those of an external (commercial) code, the TIGRE MatLab library was used (38). From it, 2 iterative tomographic reconstruction toolboxes based on GPUs were used, namely, the OS-SART algorithm that is used to solve reconstructions of the conical beam CT images using simultaneous algebraic reconstruction techniques with ordered subsets, and the ASD-POCS algorithm that is based on the most pronounced adaptive descent projection in convex subsets (39).
Hardware and Software
Serial and parallel implementations were developed and executed on a laptop with an Intel Core i7-5500U processor running at 2.40 GHz and equipped with an NVIDIA GeForce 840 M graphics card, with a global memory of 2048 MB and 384 CUDA cores on an Ubuntu operating system 15.04 (40). MATLAB R2017a was used (27) as well as GCC 4.9.2 (41) for the serial part. CUDA C 7.5 (31) was used for the parallel calculations. The Open Source Computer Vision library (OpenCV 3.2.0) was used to calculate image quality measurements (42, 43).
Computer algorithms and code developed here will be available after petition to authors.
Figures 1 to 3 present the results obtained for the 4 implementations of the algorithm developed here (Serial-MatLab, Serial-C, and the 2 parallel versions) and the external test software (2 versions).
Images in Figure 1 correspond to a reconstruction with Parallel-OpT software with a maximum of 1000 iterations. In the first row, SLP and two 512 × 512 CT images are presented. In the following row 4 128 × 128 regions of interest of the previous images are presented. These regions were selected randomly by authors. The original image of the SLP (first column) is shown in the first row of Figure 1. In the second column, the reconstructed image “iteration 64” is shown, which will be referred from now on as “the best” and called iteration64. Number 64 just quantifies the number of iterations necessary to reach it. In the third column, the reconstructed image called “functional” is shown. In this case it is called “iteration 35,” and it is characterized by the fact that the differences in its quality parameters with those of the “the best” are <5%. The last column shows the “functional” image with the Poisson noise. The CT images of the central row correspond to the axial section of the brain, while the abdominal axial section is presented in the last row of Figure 1. The data of the last 2 rows are presented in the same scheme as in the SLP image.
The values of the different quality parameters of each image are presented in Table 1. The minimum number of necessary iterations to obtain the given image quality value and the time taken to achieve it are also presented.
Figure 2 shows the mean values for the different quality parameters of the reconstructed images of Figure 1: MSE, PSNR, and SSIM. Table 2 vs. number of iterations presents the quantification of these values. Images obtained from reconstruction had 512 × 512 matrices formed from CT projections that varied between 0° and 180° in 5° steps (views). Initially, the quality values of the reconstructed images were calculated for all serial and parallel implementations; however, the differences in the absolute values of these quality parameters were <0.001%.
In addition, the dependence between the optimal number of views/projections for image reconstruction in the implementations was studied. These results are presented as complementary material (see online supplemental Figure 1). In this analysis, the image quality parameters, as well as the time calculations, were compared based on the number of iterations and the number of views. It was found that the use of jumps of 5° per view to cover the entire field of vision, as well as 50 to 100 iterations, produced reliable results without extending the calculation time while obtaining high-quality image parameter values.
It is worth mentioning that when the field of view was covered either with projections that varied between 0°–180° and 0°–360°, the reconstructions of the different programs produced very similar results (regarding image quality parameters). Because of this and from this point, only reconstructions with a system matrix that varied from 0° to 180° were taken into account.
Figure 3 presents the average computation time for the 4 implementations/programs developed by the authors, as well as the 2 versions based on TIGRE. It also shows the results with added Poisson noise. The data are presented versus the number of iterations.
It can be observed that the differences in computation time between the Parallel-OpT and Serial-C implementations were of the order of 12 times faster for the parallel solution. The Parallel-OpT was 170 times faster than Serial-MatLab. Parallel-OpT and Parallel-GPU implementations differed by a factor of 2 when computer processing time was considered. It was observed that the calculation times of the ASD_POCS version of TIGRE were comparable with those of the Serial-C and ∼10 times slower than Parallel-OpT. In contrast, TIGRE OS-SART was the fastest of all, being 2 times faster than the Parallel-OpT implementation in CUDA. When the data were analyzed with aggregate noise, no differences were found in the computation time with respect to Paralell-OpT. This was for the reconstruction of the 3 images studied. Finally, it is worth noting that the relationships between the calculation times of implementations of the MLEM algorithm were constant and remained independent of the number of iterations or added noise.
As mentioned earlier and as seen in Figures 2 and 3, Parallel-OpT was up to 13 times faster than ASD_POCS and 7% better in image quality. The OS-SART was, in contrast, 50% faster, but with lower image result quality. The MSE was 720% worse; the PSNR was 7.5% worse
These differences are highlighted in Figure 4. In Figure 4, a general comparison between Parallel-OpT and the 2 chosen from the TIGRE library are shown. The best quality/time ratio can be appreciated from the first implementation.
Finally, to test the stability and effectiveness of these programs, data from a clinical study were reconstructed using the Parallel-OpT implementation. Further 101 CT images, corresponding to a single axial cut of the pelvis were obtained from each volunteer. Data were downloaded from The Cancer Genome Atlas Sarcoma database (TCGA-SARC) (33); therefore, no ethical permission was needed to perform this study on our side. In each image the SSIM was calculated as in previous sections. The SSIM data are presented in Figure 5 (using MatLab Distribution Fitter app software). SSIM values reached a maximum of 0.9. Data presented a normal distribution form of 0.902 ± 0.019 (mean ± SD). This was confirmed by conducting a Shapiro–Wilk test calculated with SPSS software (P < .05). All data, except one, exceeded the threshold value of 0.85 SSIM, a standard limit for good image quality.
Discussion and Conclusions
This work shows the parallel implementation of the MLEM reconstruction algorithm using graphic card capabilities. Software solutions developed here were compared with reconstructions of clinical images using several image quality parameters (CNR, SSIM, MSE, etc.). Computer time used for image reconstruction in CUDA-based programs was shorter than that associated with serial solutions. Acceleration factors that varied between 2 and 170 times were obtained. In comparison with the TIGRE software, images were obtained of similar quality, but in some cases and for some specific image quality factors, images were 7 times better for our CUDA programs. The addition of 5% noise to signals did not increase the reconstruction time proportionally; in fact, it left it almost unaltered. Moreover, image quality decreased in general by a 5% factor (except for the PSNR parameter that lost 30% of its maximum value in some scenarios). The clinical study results showed that calculations were consistent and reproducible. Furthermore, the quality of image reconstruction was conserved even at short reconstruction times. All these results imply that the MLEM implementations in CUDA using GPUs’ capabilities presented here were reliable and fast.
With respect to image acquisition, it was shown that to generate the system matrix, only projections in the range of 0° to 180° were needed. This contrasted with the acquisition of full-field projections (from 0° to 360°). This is a well-established fact in the field, so no data or images were presented to support this fact in this work. In general, acquisitions over small angles (1°, 2°, or 3°) at 1000 iterations took up to 5000 seconds. This amount of time was obviously greater than reconstructions in steps of 5° or larger. Measurements with larger angles were described in the online supplemental Figure 1. There it could be appreciated that the 5° measurements presented a perfect compromise between reconstruction quality and reconstruction time. Therefore, this value was the one selected for reconstruction in this study.
Figure 2 showed how the different image quality metrics (MSE, PSNR, SSIM) decreased first and then increased rapidly as the number of iterations increased. This allowed to deduce an “ideal” range of iterations for image reconstruction for this software implementation that was set between 50 and 100. Therefore, as it can be seen in Figure 2B, the PSNR parameter for the SLP and the abdominal image reached a local maximum value within the range of 50 to 100 iterations. For the head image, this effect was similar but less prominent. Obviously when larger number of iterations were performed, values of this last parameter were larger and therefore better. Nevertheless, the time used to improve 5% of this parameter could be easily 5–10 times larger. That made this approach of larger iterations less attractive for practical purposes. The maximum values of the SSIM metric for the SLP, the head, and the abdominal images were 0.992, 0.987, and 0.986, respectively. These maximum values were reached within the proposed range of iterations (see Figure 2C). After the maximum, the values remained stable and were not affected by a greater number of iterations. As indicated before in the text, values of SSIM over 0.85 represented images of good clinical quality.
Once the iteration interval and the degrees between projections used to acquire the images were established, the processing times of the different implementations developed were compared (Figure 3). It was found that the difference in time between Serial-C and Parallel-OpT to achieve the same result was ∼1000 seconds at 50 iterations (12 times more) and 9000 seconds in 1000 iterations (170 times more). These results showed that implementations of the MLEM algorithm based on CUDA could perform good image reconstructions in shorter times than traditional CPU-based methods. Surprisingly, the addition, the Poisson noise did not affect the reconstruction times of the Parallel-OpT implementation. When compared with the results from TIGRE implementations, equal or better time performances were achieved, always with better image quality reconstruction.
If the reader considers that one of the basic ideas behind this work was to implement the MLEM algorithm in a novel way by using the GPUs’ capabilities in commercial/low capability computing systems, he/she can observe that this was accomplished. Therefore, the solution for image analysis presented here could be used in institutions or health systems with low financial resources in a reliable way, conserving image quality and maintaining reconstruction times as low as possible.