Echocardiogram enhancement using supervised manifold denoising

Echocardiogram enhancement using supervised manifold denoising

Medical Image Analysis 24 (2015) 41–51 Contents lists available at ScienceDirect Medical Image Analysis journal homepage: www.elsevier.com/locate/me...

3MB Sizes 5 Downloads 64 Views

Medical Image Analysis 24 (2015) 41–51

Contents lists available at ScienceDirect

Medical Image Analysis journal homepage: www.elsevier.com/locate/media

Echocardiogram enhancement using supervised manifold denoising Hui Wu a,∗, Toan T. Huynh b, Richard Souvenir a a b

Department of Computer Science, University of North Carolina at Charlotte Department of General Surgery, Carolinas Medical Center

a r t i c l e

i n f o

Article history: Received 12 September 2014 Revised 29 April 2015 Accepted 5 May 2015 Available online 21 May 2015 Keywords: Biomedical image processing Echocardiography Image denoising Supervised learning Video signal processing

a b s t r a c t This paper presents data-driven methods for echocardiogram enhancement. Existing denoising algorithms typically rely on a single noise model, and do not generalize to the composite noise sources typically found in real-world echocardiograms. Our methods leverage the low-dimensional intrinsic structure of echocardiogram videos. We assume that echocardiogram images are noisy samples from an underlying manifold parametrized by cardiac motion and denoise images via back-projection onto a learned (non-linear) manifold. Our methods incorporate synchronized side information (e.g., electrocardiography), which is often collected alongside the visual data. We evaluate the proposed methods on a synthetic data set and real-world echocardiograms. Quantitative results show improved performance of our methods over recent image despeckling methods and video denoising methods, and a visual analysis of real-world data shows noticeable image enhancement, even in the challenging case of noise due to dropout artifacts. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Whether for visual enhancement or as pre-processing for downstream algorithms, such as anomaly detection (Qian et al., 2011), object segmentation (Santiago et al., 2013), motion analysis (Huang et al., 2014; Papademetris et al., 2002; Yu et al., 2014), or atlas construction (Duchateau et al., 2011), denoising facilitates visual data interpretation by removing noise, de-emphasizing distractors, and increasing the definition of relevant organ structures. Compared with other cardiac imaging techniques such as magnetic resonance imaging (Huang et al., 2011), denoising is especially important for eachocardiograms since the images often suffer from ultrasound imaging noise. Fig. 1 shows sample echocardiograms depicting common cases of ultrasound imaging noise. The images in the first column are relatively clear examples with well-defined cardiac structures. The frames in the second column depict an elevated amount of speckle noise.1 This can be observed as the granular pattern resulting from the scattering of ultrasound signals, which can introduce discontinuities at the boundaries of larger tissue and, often, a lack or loss of clarity of smaller structures (e.g., heart valves). The third column shows images containing dropout artifacts, which are typically caused by a loss of tight contact between the transducer and the patient, ∗

Corresponding author. Tel.: +17 046123398. E-mail address: [email protected] (H. Wu). 1 For image enhancement, speckles are considered as unwanted visual artifacts. However, it is worth noting that speckle patterns can be useful features for automated tracking algorithms (e.g., Yue et al., 2009).

http://dx.doi.org/10.1016/j.media.2015.05.004 1361-8415/© 2015 Elsevier B.V. All rights reserved.

insufficient conductive gel, or extra fluid or fat tissue between the transducer and the heart. Dropout artifacts can result in the loss of visibility in part of the structures of interest, and often affect consecutive frames in a video. These example frames demonstrate a common issue: often, real-world biomedical data contains multiple sources of noise. Many existing video denoising algorithms extend single-image algorithms, with some modification to incorporate temporal regularization. In the single-image case, most methods assume a prior underlying model on the noise distribution, such as zero-mean Gaussian noise or a Poisson distribution (Zhang et al., 2008). For single images, even in the biomedical domain, these assumptions are often reasonable and provide adequate denoising. However, as in the echocardiography example shown in Fig. 1, a single statistical model may be insufficient to represent the complexity of multiple, different noise sources. In this paper, we do not make strong assumptions regarding the noise model, but rather make use of the fact that the number of underlying causes of image change in biomedical video is often quite low. That is, the number of degrees of freedom in such sets is small and often enumerable. For example, cardiac ultrasound frames may vary due to cardiac phase, patient breathing, transducer motion, and imaging noise. In the case of denoising, or other biomedical video analysis tasks, this structure can often provide a more perceptually meaningful basis for regularization than the temporal order of frames in a sequence. However, factoring the causes of image change directly from visual data and recovering this underlying structure is a non-trivial problem due to the high dimensionality of data and low

42

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

Fig. 1. Biomedical video often suffers from multiple sources of noise. These images show echocardiogram frames that suffer from speckle noise (middle) and dropout artifacts (right) highlighted by arrows.

signal-to-noise-ratio (SNR). To address this problem, we take advantage of the fact that in clinical biomedical settings, there is often available, associated metadata collected during image acquisition that can serve as a proxy for the dominant cause of image change. For example, in echocardiography, electrocardiograms (ECG), which record cardiac electrical activity and, therefore, heart phase information, are usually acquired along with synchronized echocardiograms. This paper, which extends our previous work (Wu et al., 2013) with a new kernelized model, incorporates motion-relevant context for data-driven video denoising rather than learning this structure directly from noisy image data. Since our approach relies on the underlying relationships between images for denoising, it naturally applies to cases when multiple sources of noise are present. 2. Related work The literature on video denoising, both within and without the biomedical domain, is vast. Many of the approaches tailored to video are extensions of single-image denoising methods that incorporate some form of temporal regularization. In general, these methods consider noisy images as ideal noise-free images corrupted by random noise drawn from a specified probability distribution and introduce additional constraints to solve the under-constrained task of separating the noise-free image from the noisy input. Image-based video denoising. One common model is to assume independent and identically distributed, zero-mean Gaussian additive noise, where images are denoised by finding the mean of neighboring pixels, and the neighborhood can be defined spatially and/or temporally based on some notion of similarity (Buades et al., 2005; 2008; Ghoniem et al., 2008; Ren et al., 2012; Xu et al., 2010). For ultrasound image analysis, more complex noise models have been considered, including multiplicative, locally-correlated speckle noise. This model has been incorporated into algorithms that employ nonlocal means (Coupe et al., 2009) and wavelet shrinkage (Gupta et al., 2007; Khare et al., 2010). One of the most widely-applied approaches, speckle reducing anisotropic diffusion (SRAD) (Yu and Acton, 2002), considers speckle removal as an iterative edge-preserving diffusing process. There have been a number of extensions for ultrasound enhancement (Aja-Fernández and Alberola-López, 2006; Aksel et al., 2006; Krissian et al., 2005; 2007; Yu et al., 2010), which follow the general SRAD framework. These image denoising methods all share the same inherent assumption of a single noise model, whereas in this paper, we consider cases of biomedical image analysis, which

exhibit noise models with composite causes, including factors that are difficult to model parametrically. Manifold denoising. Compared to video denoising methods extended from single-image algorithms, there has been some work that considers the entire video as a whole by considering image-level relationships. Manifold denoising approaches take advantage of the property that related images, when considered as points in a high-dimensional space, lie on or near a low-dimensional manifold. Manifold-based methods denoise images by combining information from nearby images on the manifold, rather than temporal neighbors. One method for manifold denoising learns a global manifold representation of the image set using density estimation, and images are denoised by minimizing the sum of deviation from the learned manifold (Sun et al., 2012). Similarly, there have been denoising approaches based on other unsupervised nonlinear dimensionality reduction techniques, such as Kernel Principal Component Analysis (Kwok and Tsang, 2003), autoencoders (Vincent et al., 2010), and Gaussian Process Latent Variable Model (Gao et al., 2008). These methods use the learned low-dimensional representation of images to denoise via back-projection into image space. There are other methods that do not rely on constructing a global manifold structure, but rather exploit the locally smooth and linear properties of a manifold. Local Principal Component Analysis (PCA) has been used to estimate local manifold directions, and images are denoised by varying the image along the vector in image space in the direction orthogonal to the tangent space (Wang and Carreira-Perpiñán, 2010; Wang et al., 2011), or by minimizing the overall reconstruction error (Gong et al., 2010). The nearest-neighbor graph, used in graph embedding approaches, has been used to approximate the manifold topology and denoise images by graph diffusion, which iteratively reduces the differences between neighboring images (Hein, 2007; Hein and Maier, 2006). An important assumption underlying these manifold denoising methods is that the manifold structure can be well-approximated solely from the input images. However, robustly learning a low-dimensional manifold is a non-trivial problem in presence of composite causes of imaging noise, where the (multimodal) noise variance may dominate data variance. Denoising with side information. One approach to side-stepping the problem of learning a low-dimensional representation for image data is to incorporate side information. In contrast to the large amount of work done in unsupervised image and video denoising, there is much less that incorporates side information, or metadata. In some methods side information has been used during training to learn a

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

denoising function that maps noisy image patches to denoised pixel intensities (van Ginneken and Mendrik, 2006; Li, 2009). These approaches have not been applied to biomedical video analysis, as the construction of a training set well-sampled from the feature space is difficult due to complex structural variations and unavailability of ground truth images. Our approach for biomedical video denoising utilizes the side information for supervised learning of the underlying low-dimensional structure of the entire video, and, similar to related manifold denoising methods, is entirely data-driven and does not make strong assumptions on the noise model. 3. Background In this section, we describe the concept of video denoising through subspace learning. Consider an image set, X, with (noisy) images (in column-vector representation), xi ∈ RD , where D is the number of pixels in each image. Under this model, we assume the ideal (noise-free) images, zi = M(ti ), are sampled from a function, M : Rl → RD , where t ∈ Rl denotes the low-dimensional (l  D) manifold coordinate for the ideal image, zi . Under this model, the observed input image, xi , is the manifold image perturbed by an unknown noise model

xi = M ( ti ) + η ,

(1)

and the goal of denoising is to estimate zˆ i , such that zˆ i is as close to M(ti ) as possible. Given M, the denoising problem would reduce to an optimization that maps the input sample to the closest point on the manifold. In the unsupervised case, both M and each ti are unknown and must be estimated from the data. In the supervised case, ti (or a proxy) are provided. To motivate the need for supervised (nonlinear) subspace learning, we first review approaches for denoising using unsupervised subspace learning (Section 3.1).

(a) PCA

43

(b) SPCA

Fig. 2. For a set of 2-D data points, the learned subspace (represented by a line) may differ depending on the amount of supervision. (a) PCA finds the subspace in the direction of the largest variance of data. (b) When metadata is provided (indicated by colors of the points), supervised PCA (SPCA) finds the linear subspace that best correlates with the metadata.

An implicit assumption of these subspace learning methods, both linear or nonlinear, is that the subspace or manifold can be estimated in an unsupervised fashion, i.e., only using noisy data samples as input. However, when dealing with images with low SNR, when denoising is most needed, it is possible that the image set variation due to noise dominates the change(s) due to the motion of interest. This issue is exacerbated when the manifold is not densely sampled, i.e., N  D, as is the case for most biomedical image sets. Fig. 2 illustrates a case where, for a toy 2D data set, the direction of maximum variance does not correlate with the associated motion metadata (indicated by colors of the points). In the case of echocardiogram enhancement, this corresponds to the case where imaging noise is more dominant than the motion of interest (e.g., motion due to heartbeat). When metadata related to the underlying motion of interest is available, it can be incorporated by providing information correlated with the structure of the underlying manifold. In the next section, we describe our approach to video denoising by incorporating provided metadata for subspace learning.

3.1. Denoising by unsupervised subspace learning Existing approaches for manifold denoising often introduce additional assumptions on the subspace properties to constrain the problem of estimating the underlying low-dimensional subspace given noisy samples. The most common assumption is that the ideal images are drawn from a linear subspace. Under this assumption, Principal Component Analysis (PCA) has been used to find the linear subspace that minimizes the difference from noisy images to their projections (Zhang et al., 2010; 2009). Given the learned linear subspace using PCA, images can be denoised by projecting highdimensional images onto the subspace, followed by back-projection to the image space. This linear assumption, however, does not approximate well the underlying manifolds of biomedical videos; the image changes most often observed (e.g., speckle noise, non-rigid deformation, dropout artifacts) are rather complex and difficult to model with a linear basis. To address these limitations, there have been methods that consider learning nonlinear subspaces of image sets. These methods often rely on the locally linear and smooth manifold assumptions. One group of approaches (Gao et al., 2008; Kwok and Tsang, 2003) explicitly learns low-dimensional manifold coordinates using nonlinear analogs to PCA, such as ISOMAP (Tenenbaum et al., 2000) or LLE (Roweis and Saul, 2000). However, nonlinear dimensionality reduction techniques often lack a parametric mapping to the image space, so a separate step is needed to learn the back-projection function (Duchateau et al., 2013; Gerber et al., 2010). Another group of nonlinear approaches (Gong et al., 2010; Hein, 2007; Hein and Maier, 2006; Sun et al., 2012; Wang and Carreira-Perpiñán, 2010) focuses on denoising local patches of the manifold estimated by the k-nearest-neighbor graph without explicitly estimating the global low-dimensional parametrization.

4. Methods In this section, we introduce our algorithms for video denoising using SPCA and KSPCA (Barshan et al., 2011), which take as input (noisy) images, X, and their associated metadata, S, and return a corˆ We assume that the metadata responding set of denoised images Z. correlates with the motion(s) of interest within the image set. 4.1. Supervised PCA (SPCA) Given the metadata, S = [s1 , . . . , sN ], associated with the (centered) input images, X, and a Gaussian kernel function, κ m ( · ), the N × N kernel matrix Km encodes the pairwise metadata similarity. The basis, U, that optimally correlates with the metadata can be computed as the top d eigenvectors of matrix XKm XT . This requires the eigendecomposition of a D × D matrix, which is impractical as the number of pixels per frame, D, is often quite large (i.e., millions). The dual version of the problem only requires the eigendecomposition of a N × N matrix, where N is the number of images in the set and usually N  D. Let L be the Cholesky decomposition of Km , such that Km = LT L. Following the eigendecomposition of the matrix, LXT XLT , let V be the matrix of the first d eigenvectors in each column, and  be a diagonal matrix containing the square roots of the first d eigenvalues in the diagonal. The optimal basis of the learned subspace can be computed as follows:

U = XV−1

(2)

where U is a D × d matrix where each column is a basis vector of the linear subspace. Given U, the projection operation for images is

44

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

(a) SPCA

(b) KSPCA

Fig. 3. For a set of 2-D data points (colored by associated metadata), the learned subspace (represented by a line or curve) may differ depending on the algorithm. (a) Supervised PCA (SPCA) finds the linear subspace that best correlates with the metadata. (b) Kernel Supervised PCA (KSPCA) finds the nonlinear variation in data that correlates with the metadata.

as follows:

Y = UT X,

(3)

where Y = [y1 , · · · , yN ] are the coordinates of projected images onto the learned subspace. Similar to PCA, in SPCA, the denoised images can be estimated by linear back-projection, Zˆ = UY. However, using SPCA directly in this manner leads to a similar limitation as with PCA. Even if the subspace is correlated with the motion of interest, in biomedical video, complex, non-rigid motion often leads to image changes that are poorly approximated by linear basis functions.

1 1 K1 − 2kx j )T HV−2 VT H(kxi − K1 ) N N 1 2 + 2 1T K1 + K j j − 1T kx j (6) N N

φ (x j ) − φ (zˆ i )22 = (kxi +

Depending on the kernel, the feature space distances can be converted to image space distances. In our case, with the Gaussian kernel, the squared Euclidean distance between the denoised image, zˆ i , and the j-th image, xj , is as follows:



d2j = −2σ 2 log 1 −

4.2. Kernel supervised PCA (KSPCA) The kernel version of SPCA can approximate the nonlinear manifold structure underlying the input noisy images, while also incorporating metadata for supervised subspace learning. KSPCA implicitly induces a feature mapping, denoted by φ ( · ), from the input image space to a different feature space, so that linear subspace learning in feature space corresponds to nonlinear subspace learning in the original image space. With KSPCA, images are projected to the lowdimensional subspace (in feature representation), then denoised by back-projection to the feature space. There is then an additional step, unlike PCA and SPCA, of an inverse feature mapping to reconstruct the denoised image. In KSPCA, learning the optimal linear subspace in feature space and obtaining the low-dimensional projections have a similar form as the dual version of SPCA in image space. Let  = [φ (x1 ), . . . , φ (xN )] be the (implicit) feature representation of the corresponding images. H denotes the centered data in the feature space, where H is the centering matrix, H = I − N1 11T , I is an N × N identity matrix, and 1 is an N × 1 vector composed of 1’s. As with SPCA, Km is the kernel matrix of metadata and L is the Cholesky decomposition of Km . The optimal subspace (linear in the implicit feature space) can be computed by solving the following eigen decomposition:

VT VT = LHT HLT

(4) )T φ (x

Using the so-called “kernel trick”, the dot product φ (xi j ) in the implicit feature space can be substituted by a kernel function. Let K be the N × N kernel matrix, where Ki, j = κ (xi − x j  ), and (4) becomes

VT VT = LHKHLT

between the basis vector learned by SPCA, and the structure learned by KSPCA, which is linear in the implicit feature space, but nonlinear in image space. Although implicit feature mapping provides the benefit of nonlinear subspace learning, it lacks a parametric inverse mapping to the image space to recover the denoised image (or pre-image Mika et al., 1998). To denoise an image xi , first, the feature mapping of its denoised version φ (zˆ i ) is estimated by back-projecting the lowdimensional coordinate to the feature space, φ (zˆ i ) = UUT (φ (xi ) − φ¯ ) + φ¯ , then by computing the inverse mapping of φ (zˆ i ) to the image space. Computing the inverse mapping of φ ( · ) is non-trivial. Given a point in the feature space, there may not exist an exact solution to its inverse mapping. Therefore, we use an approximate solution that computes the inverse mapping by preserving the distances to local neighbors in the feature space (Kwok and Tsang, 2003). The squared Euclidean distance of φ (zˆ i ) to the j-th point in the feature space can be written entirely in terms of dot product between points in feature space, i.e., in terms of kernel functions

(5)

where V contains the first d eigenvectors in each column and  is the diagonal matrix that contains the square roots of the first d eigenvalues. The optimal basis can be computed as U = HV−1 . Given the learned basis U, the low-dimensional coordinate of the i-th image in the subspace is UT (φ (xi ) − φ¯ ), where φ¯ is the mean in feature space. Or, using the kernel representation, −1 VH(kxi − K1 ), where kxi is the kernel of xi to all input images. Fig. 3 illustrates the differences



1 φ (x j ) − φ (zˆ i )22 , 2

(7)

The inverse mapping can be approximated as a weighted linear combination of input images. Using (6) to find the indices of the k nearest neighbors of φ (zˆ i ) in feature space, denoted by Ni , the least square solution for the inverse mapping is as follows:

zˆ i = argmin



z

(x j − z 2 − d j )2 ,

(8)

j∈Ni

which can be solved using singular value decomposition. 4.3. Algorithms Our supervised manifold denoising (SMD) algorithms both perform supervised nonlinear subspace learning, but differ in how the metadata is incorporated and at what stage nonlinearity is modeled. SMDE explicitly incorporates the metadata parameter for denoising, while SMDI only relies on the metadata to build an underlying model, and, therefore, only implicitly uses these values for denoising. 4.3.1. SMDE Assuming there exists an invertible mapping between the intrinsic manifold parameter t and the provided metadata s, t = g(s ), then learning M(t ) can be reformulated as learning a nonlinear function G (s ) = M(g(s )). In this setting, learning G with linear methods (e.g., PCA) or general nonlinear regression suffers from issues of high bias or the curse of dimensionality, respectively. Therefore, for SMDE , we take a two-step approach to learn G (s ): a pre-processing step that reduces the dimension of input images, and an explicit parametrization step that learns the nonlinear dependency between the metadata and the images in the lower-dimensional subspace. For dimensionality reduction, we use SPCA, so the learned basis optimally correlates with the metadata. Given the metadata S and low-dimensional coordinates in the linear subspace Y, we learn a nonlinear function h : Rl → Rd . In our implementation, we employ a radial basis function to give

h (s ) =

nb  j=1

w j ϕ ( s − s j  ) ,

(9)

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

45

the kernel width of metadata, σ m , is selected as the mean pairwise Euclidean distance between the metadata of all images. For SMDI , the kernel width of images, σ , is the mean Euclidean distance between the training images, and the number of nearest neighbors, k, is selected to be equal to 15% of the number of images in the set. To evaluate the performance of the proposed algorithms, we compare the results against the following denoising algorithms. For each method, the free parameters were optimized to give the best visual results on a few example echocardiogram images. Fig. 4. SMDE uses metadata for both subspace learning and computing the nonlinear mapping to the SPCA subspace; SMDI is based on KSPCA and allows for nonlinear subspace modeling but requires an additional inverse mapping step to reconstruct the output image.

Algorithm 1 SMDE . Input: image set, X; metadata, S; kernel for metadata, κm (· ); dimension for the subspace, d Output: denoised image set, Zˆ 1: 2: 3: 4: 5: 6:

Compute basis vectors, U (Eq. 2) for all xi do Compute embedding, yi (Eq. 3) Learn nonlinear function, h : s → y (Eq. 9) for all si do Compute denoised image, zˆ i = U · h(si )













where nb is the number of kernels, wj is the kernel weight, and ϕ ( · ) is the radial basis function. Given the learned nonlinear mapping to the subspace, h(s), the mapping to the image space is zˆ i = U · h(si ). Algorithm 1 provides the pseudocode for this method. 4.3.2. SMDI In the SMDE algorithm, SPCA only models linear variation in the data, therefore a separate nonlinear mapping step is incorporated for denoising in the presence of nonlinear variations. However, KSPCA can be used to learn a nonlinear subspace in image space. Our SMDI algorithm, based on KSPCA projection and back-projection, uses the metadata in an implicit manner to model the image subspace and denoise images. We assume the feature mapping φ ( · ) projects the original nonlinear manifold to a linear subspace in the feature space. Using the back-projection of the low-dimensional coordinate of φ (xi ) to compute the feature mapping of denoised image, φ (zˆ i ), the denoised image zˆ i can be estimated by finding the inverse feature mapping of φ (zˆ i ) to the input space. Algorithm 2 provides the pseudocode for SMDI . Fig. 4 outlines the two proposed SMD algorithms. 5. Evaluation We evaluated the proposed SMD methods on both synthetic data and real-world echocardiograms. The methods were implemented in Matlab on a standard desktop computer. For all instances of supervised subspace learning, the dimensionality of the learned subspace was equal to the number of non-zero eigenvalues. For both methods,

Algorithm 2 SMDI . Input: image set, X; metadata, S; kernel for metadata, κm (· ); kernel for images, κ (· ); dimension for the subspace, d; number of nearest neighbors, k Output: denoised image set, Zˆ Compute basis vectors, V, and eigenvalues,  (Eq. 4) for all xi ∈ X do 3: Find Ni , the k-nearest neighbors of φ (zˆ i ) (Eq. 6) 4: Compute denoised image zˆ i (Eq. 8) 1: 2:



TEMPR is a commonly used baseline approach for denoising by temporal averaging. The temporal window size is 3. Lee (Lee, 1980) is a moving window based denoising algorithm using local statistics. The local window size is 16 × 16 and the number of iterations is 10. SRAD (Yu and Acton, 2002) is an anisotropic diffusion based despeckling algorithm that uses edge-preserving partial differential equations. The step size of iterative diffusion is 0.025 and the number of iterations is 100. GENLIK (Pizurica et al., 2003) is a wavelet domain method that uses local statistics to estimate the probability of signal presence. The local window size is 20 × 20 and the multiplicative factor for wavelet thresholding is 15. TV (Chambolle, 2004) is a total variation minimization based image denoising algorithm. The regularization parameter is 20, and the number of iterations is 1000. 3DS (Negi and Labate, 2012) is a video denoising algorithm that extends 2D shearlet transform to incorporate the temporal dimension. 3-level shearlet decomposition is used, and the number of directional bands is [16, 32, 64] from the coarest to the finest level. VBM3D (Dabov et al., 2007a) is a video denoising algorithm that extends the image denoising algorithm, BM3D (Dabov et al., 2007b), which is based on grouping similar image patches and performing collaborative filtering within each group. The noise standard deviation for VBM3D is 35.

5.1. Synthetic data denoising The synthetic data consists of star-like shape, shown in Fig. 5, warped by a nonlinear deformation. The magnitude of the deformation is controlled by a sinusoidal parameter corrupted by zero-mean Gaussian noise, as shown in Fig. 5(a). This results in a video containing the synthetic shape deforming with motion similar to the cardiac cycle. To add realistic speckle noise to the synthetic video, we use the Field II ultrasound simulation toolbox (Jensen, 1996; Jensen and Svendsen, 1992). For the synthetic video, the foreground (star shape) and background intensities are modeled parametrically. The energy strengths of scatters in the background are drawn from a zero-mean Gaussian distribution with standard deviation of 1, and the energy strengths of foreground scatters are drawn from a zero-mean Gaussian distribution with standard deviation of 0.1. In each frame, 10,000 scatters are randomly placed. Example frames of the simulated ultrasound video are shown in Fig. 5(c). 5.1.1. Quantitative metrics Two metrics are used for quantitative evaluation of the denoising results: Structural Similarity (SSIM) and Ultrasound Despeckling Assessment Index (USDSAI). SSIM (Wang et al., 2004) was originally proposed to compute the similarity between two images. When a noise-free image and its denoised version are provided as input, SSIM can be used to evaluate the denoising quality. SSIM incorporates three factors for image comparison: luminance, contrast and structure. Given image I and the denoised image J, both having image intensity range of [0, 255], the

46

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

Fig. 5. For a synthetic video, (a) shows the corresponding metadata values each frame, with five selected frames highlighted in red. For these frames, (b) shows the ground truth shapes, and (c) shows the images after ultrasound noise is added. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

SSIM index at the i-th pixel is as follows:

SSIM(Ii , Ji ) =

(2μIi μJi + (255c1 ) )(2σIi Ji + (255c1 ) ) , (μIi 2 + μJi 2 + (255c2 )2 )(σIi 2 + σJi 2 + (255c2 )2 ) 2

2

where μIi and μJi are the local average intensities for image I and image J respectively, σIi 2 and σJi 2 are the local standard deviations, σIi Ji is the local covariance, c1 and c2 are predefined constants to prevent numerical instability. All statistics are computed within a local window. The overall similarity between I and J can be computed as the mean of SSIM values (MSSIM) at each pixel location. MSSIM ranges from 0 to 1, and larger values indicate better denoising performance. In our experiment, we set the window size as 20 × 20 to compute local statistics; c1 is 0.01 and c2 is 0.03. For the images generated using the Field II simulator, the input binary masks cannot be directly used as ground truth images since the intensity levels are not comparable to the noisy images, so instead, using the binary ground truth masks to indicate the foreground-background segmentation, we take the average intensity of the input images in each region to serve as the ground truth values. USDSAI (Tay et al., 2010) is a denoising metric designed for images with distinct intensity classes. Our synthetic data consists of two classes, foreground Cf and the background Cb . After denoising, pixels intensities should show high within-class agreement and low inter-

class agreement. Given denoised images, Qalg measures the degree of discrimination between the two classes

Qalg =

(μC f − μCb )2 , σC2f + σC2b

(10)

where μC f and μCb are the mean of the two classes, and σC2 and σC2 f

b

are the standard deviations. When the two classes are well separated with small within-class variation, Qalg is large. The same measure can be defined for the noisy input image, denoted as Q0 , and USDSAI is Q

defined as the ratio between the two measures, Qalg . USDSAI values 0 larger than 1 indicate improved image quality after denoising, and larger values indicate better noise reduction performance. 5.1.2. Results The input is a synthetic video (with associated metadata) of 300 frames containing approximately 10 cycles of motion (Fig. 5). This video was denoised using our two proposed SMD methods and six related denoising algorithms. For each method, we computed the MSSIM and USDSAI metrics for each output frame, and the average, over all frames, is used as the overall measure. Table 1 shows these results. Both metrics show a similar pattern. The SMD methods outperform both the image despeckling and video denoising methods for this task, with SMDI slightly better than SMDE .

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

47

Table 1 Quantitative results for denoising synthetic data. For both metrics, higher is better.

MSSIM USDSAI

Lee

SRAD

GENLIK

TV

TEMPR

3DS

VBM3D

SMDE

SMDI

0.590 1.747

0.694 2.041

0.575 1.944

0.671 2.189

0.584 1.792

0.673 2.403

0.761 2.513

0.807 2.714

0.815 2.821

Fig. 6. Qualitative results from three frames of denoising of the synthetic data (described in Fig. 5).

Fig. 6 shows three selected frames of the synthetic video denoised using each method. In terms of speckle removal, SRAD, 3DS, VBM3D and the proposed SMD methods show more reduction than Lee, GENLIK and TV. While SRAD reduced speckle noise, the recovered contours of the shapes, though well-defined, are distorted and do not accurately reflect the ground truth shape. Also, the performance of SRAD varies noticeably across input images. For the second example frame, the intensity of the recovered shape is less homogeneous, and more artifacts of bright and dark spots are produced than the other two examples. The contour of the toy shape recovered from 3DS is significantly less defined than those from SMD methods. Meanwhile, compared with SMD methods, the contours recovered by VBM3D are less smooth and the pixel intensities are less homogeneous within the background and the foreground regions. In contrast, both SMD methods can effectively remove speckle noise, produce homogeneous intensities of the background and the shape, and recover the definition of the shape with minimal distortion. While a slight quantitative difference was observed between the SMD methods, visually, the resulting images are very similar.

5.2. Echocardiogram denoising We collected echocardiogram videos from patients in a clinical setting using a Philips CX50 Ultrasound System, operating at 33 Hz. Each of the obtained videos contains roughly 6–10 heartbeats. To extract the metadata that is related with the cardiac movement, we use the electrocardiogram (ECG) signal that is commonly collected alongside the echocardiogram. An example ECG is shown in the top row of Fig. 7. We locate key points in the signal to separate systole and diastole phases (Zhang et al., 2006). The metadata is interpolated linearly within each phase for each frame, and then projected to a 2D unit circle composed of a systole semi-circle and a diastole semi-circle. The resulting metadata is a 2D signal, and changes along the unit circle alternating between phases of cardiac cycles. 5.2.1. Despeckling results Fig. 8 shows a frame that captures the parasternal short axis view of the heart and denoised images using each of the methods with zoomed-in views of the left ventricle (highlighted by red boxes). In

48

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

Fig. 7. Metadata is inferred for systole and diastole by interpolating between key points in the ECG data.

Fig. 8. (b–g) show the output of denoising an echocardiogram from the parasternal short axis view (a) with a zoomed-in view of the left ventricle (boxed in red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

Fig. 9. (b–g) show the output of denoising an echocardiogram from the apical two chamber view (a) with a zoomed-in view of the left ventricle (boxed in red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

49

Fig. 10. Selected frames from an echocardiogram video degraded by dropout artifacts (identified with arrows).

the original image, even though the global structure of the heart chamber is clear, speckle noise obstructs small structures, such as the mitral valve in the left ventricle. Also, finer details are obscured; for example, the shape of the inner boundary of left ventricular wall is difficult to discern. Of the tested methods, SRAD, TV and Lee all oversmooth the images, as seen from the loss of fine details, like the inner boundary of left ventricle. The despeckling effect of TEMPR is not very obvious. GENLIK, 3DS and VBM3D preserve the weak boundary, with VBM3D performing slightly better than GENLIK and 3DS, however, the visual appearance of homogenous tissues, such as the chamber wall, is not improved after denoising. In contrast, the SMD methods recover the homogenous tissues and preserve the inner boundary of left ventricle, as observed by the consistently dark intensities within the heart chamber and the distinct definition and more homogeneous intensities of chamber walls. Fig. 9 shows a frame that captures the apical two chamber view of the heart and denoised images using each method. Globally, the input shows strong contrast, excluding the region of the left ventricular wall on the top right of the image. We focus on the denoising results in this area. Lee, SRAD, GENLIK, TV and 3DS all produce oversmoothed images with the boundary of the left ventricular wall difficult to discern. TEMPR and VBM3D both preserve the shape of the left ventricular wall, but TEMPR does not reduce the speckle noise effectively, and VBM3D does not recover the definition of the mitral valve as well as SMD methods. In contrast, the SMD methods recover the shape of the mitral valve and preserve the boundary of left ventricle. 5.3. Evaluation of subspace learning for denoising In this section, we seek to evaluate the effect of supervised nonlinear subspace learning for denoising. We compare our proposed methods to baseline subspace learning approaches. We employ a similar framework of projection followed by back-projection for PCA and (linear) SPCA. Both approaches follow the methodology of the SMD methods. For PCA, we empirically selected a dimensionality of 14 for the linear subspace, and for SPCA, the parameters matched those of the SMDE algorithm. For these experiments, we consider the case of video with composite noise sources, including speckle noise and dropout artifacts (Fig. 10). Compared to images solely corrupted by speckle noise, dropout artifacts have a significant impact on the visual appearance of the underlying structures and can occur over consecutive frames.

Fig. 12. Each row shows three echocardiograms denoised by subspace learning approaches. The images correspond to the points labeled A, B, and C, respectively in Fig. 11. Frames A and B contain large dropout artifacts and the denoising results vary among the different methods. Frame C does not contain these types of artifacts and shows similar performance across all methods.

In Fig. 11, each graph shows a visualization of the 2D embedding of this video. Each point represents a frame from the original video colored based on the value of the associated metadata (inferred from the corresponding ECG signal), where full systole is represented in light blue and full diastole in red. Additionally, consecutive frames are connected by edges. The frames corresponding to the labeled points (A,B,C) are shown in Fig. 12 along with denoised images using each of the subspace learning approaches. Early frames of the video, including those marked A and B, show dropout artifacts with a loss of visualization of left ventricle. PCA selects basis vectors to maximize the variance of the data. For this data set, we notice that the frames represented by A and B,

Fig. 11. Each graph shows a visualization of the 2D embedding of an echocardiogram video degraded by dropout artifacts. Each point represents a frame from the original video, connected sequentially, and colored based on the value of the associated metadata. The frames corresponding to the labeled points (A, B, C) are shown in Fig. 12. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

50

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51

which are corrupted due to dropout artifacts, are distant from the rest of the points. The learned subspace primarily encodes the variance due to images with dropouts and those without. A side effect is that there is little differentiation between the diastolic process (changing from light blue to red) and the systolic process (changing from red to light blue). The differences between the embeddings of SPCA and PCA demonstrate the effect of incorporating metadata. The difference in the diastolic process and the systolic process is better represented as a cyclic trajectory. Even though the frame marked B appears to be an outlier, the left ventricle walls not visible in Frame A and Frame B can be seen in the reconstructed images by SPCA. By implicitly incorporating metadata, the SPCA basis de-emphasizes the variation caused by artifacts, while focusing on the variation caused by cardiac motion. The limitation of SPCA is that the linear basis underfits the nonlinear manifold structure to reduce the variance caused by the dropout artifact. The resulting SPCA coordinates of Frame A and Frame B significantly deviate from other images with similar phases, and reconstructed images of Frame A and Frame B have strong speckle noise. Both of our proposed methods improve the performance on Frame A and Frame B. SMDE explicitly requires smooth transitions between images with similar metadata, so the reconstructed images of Frame A and Frame B recover the left ventricle and reduce speckle noise. Compared to the embedding of SPCA, the 2-D embedding of SMDI yields better cyclic patterns, and the deviation of Frame B is reduced. The transition between images with similar phases in SMDE is smoother than SMDI due to the explicit parametrization step using the metadata. The reconstructed images of Frame A and Frame B using SMDI also preserve the left ventricle and reduce speckle noise, but are slightly less defined than SMDE . 6. Conclusions and future work By incorporating metadata related to the motion of interest, our supervised manifold denoising (SMD) methods consider and exploit image-level relationships within a given data set. This allows multiple visually-similar frames to be combined in a phase-weighted manner for denoising. As demonstrated in the previous section, the result is improved image enhancement both qualitatively and quantitatively on synthetic and real-world examples. Both methods rely on supervised, nonlinear subspace learning for denoising and perform similarly in most cases. However, depending on the situation, one may choose SMDE , which is more robust against dropout artifacts, or SMDI for more speckle noise reduction. For SMDE , once the linear basis U and the nonlinear function h( · ) are learned, the denoised image zˆ i only depends on the associated metedata si . Whereas in SMDI , denoising the i-th image only requires the noisy image xi as input. We presented supervised manifold denoising methods that use metadata extracted from ECG signals for echocardiogram enhancement. Such algorithms are increasingly relevant with the growing interest in automated echocardiogram analysis as the development of the portable device industry accelerates. Further, as the need for bedside real-time image acquisition continues to grow, automated video enhancements and interpretation will prove paramount to the delivery of care for hospitalized patients. Acknowledgment We would like to thank Dr. Geoffrey A. Rose, MD and the staff at the Sanger Heart & Vascular Institute at Carolinas Medical Center for providing the anonymized data, annotations, and interpretations of the echocardiograms used in this work. References Aja-Fernández, S., Alberola-López, C., 2006. On the estimation of the coefficient of variation for anisotropic diffusion speckle filtering. IEEE Trans. Image Process. 15 (9), 2694–2701.

Aksel, A., Janiczek, R.L., Hossack, J.A., French, B.A., Acton, S.T., 2006. Ultrasound myocardial tracking with speckle reducing anisotropic diffusion assisted initialization. In: Proc. ICIP, pp. 1945–1948. Barshan, E., Ghodsi, A., Azimifar, Z., Zolghadri Jahromi, M., 2011. Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recognit. 44 (7), 1357–1371. Buades, A., Coll, B., Morel, J.-M., 2005. A non-local algorithm for image denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., vol. 2, pp. 60–65. Buades, A., Coll, B., Morel, J.-M., 2008. Nonlocal image and movie denoising. Int. J. Comput. Vis. 76 (2), 123–139. Chambolle, A., 2004. An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20 (1–2), 89–97. Coupe, P., Hellier, P., Kervrann, C., Barillot, C., 2009. Nonlocal means-based speckle filtering for ultrasound images. Image Process. IEEE Trans. 18 (10), 2221–2229. Dabov, K., Foi, A., Egiazarian, K., 2007a. Video denoising by sparse 3d transform-domain collaborative filtering. In: Proceedings of the Conference on European Signal Processing, pp. 145–149. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K., 2007b. Image denoising by sparse 3-d transform-domain collaborative filtering. Image Process. IEEE Trans. 16 (8), 2080– 2095. Duchateau, N., De Craene, M., Piella, G., Silva, E., Doltra, A., Sitges, M., Bijnens, B.H., Frangi, A.F., 2011. A spatiotemporal statistical atlas of motion for the quantification of abnormal myocardial tissue velocities. Med. Image Anal. 15 (3), 316–328. Duchateau, N., De Craene, M., Sitges, M., Caselles, V., 2013. Adaptation of multiscale function extension to inexact matching: Application to the mapping of individuals to a learnt manifold. In: Proceedings of the Conference on Geometric Science of Information, vol. 8085, pp. 578–586. Gao, Y., Chan, K.L., Yau, W.-Y., 2008. Manifold denoising with Gaussian process latent variable models. In: Proceedings of the 19th International Conference on Pattern Recognition., pp. 1–4. Gerber, S., Tasdizen, T., Thomas Fletcher, P., Joshi, S., Whitaker, R., 2010. Manifold modeling for brain population analysis. Med. Image Anal. 14 (5), 643–653. Ghoniem, M., Chahir, Y., Elmoataz, A., 2008. Video denoising via discrete regularization on graphs. In: Proceedings of the 19th International Conference on Pattern Recognition., pp. 1–4. van Ginneken, B., Mendrik, A., 2006. Image denoising with k-nearest neighbor and support vector regression. In: Proceedings of the 18th International Conference on Pattern Recognition, vol. 3, pp. 603–606. Gong, D., Sha, F., Medioni, G., 2010. Locally linear denoising on image manifolds. Proc. Artif. Intell. Stat. 9, 265–272. Gupta, S., Kaur, L., Chauhan, R.C., Saxena, S.C., 2007. A versatile technique for visual enhancement of medical ultrasound images. Digital Signal Process. 17 (3), 542– 560. Hein, M., 2007. Manifold denoising as preprocessing for finding natural representations of data. In: Proceedings of the AAAI Conference on Artificial Intelligence., pp. 1646– 1649. Hein, M., Maier, M., 2006. Manifold denoising. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, vol. 20, pp. 1–8. Huang, J., Zhang, S., Metaxas, D., 2011. Efficient mr image reconstruction for compressed mr imaging. Med. Image Anal. 15 (5), 670–679. Huang, X., Dione, D.P., Compas, C.B., Papademetris, X., Lin, B.A., Bregasi, A., Sinusas, A.J., Staib, L.H., Duncan, J.S., 2014. Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Med. Image Anal. 18 (2), 253– 271. Jensen, J.A., 1996. Field: a program for simulating ultrasound systems. In: Proceedings of the 10th Nordic-Baltic Conference on Biomedical Imaging, 34. Citeseer, pp. 351– 353. Jensen, J.A., Svendsen, N.B., 1992. Calculation of pressure fields from arbitrarily shaped, apodized, and excited ultrasound transducers. Ultrason. Ferroelectr. Freq. Control, IEEE Trans. 39 (2), 262–267. Khare, A., Khare, M., Jeong, Y., Kim, H., Jeon, M., 2010. Despeckling of medical ultrasound images using Daubechies complex wavelet transform. Signal Process. 90 (2), 428–439. Krissian, K., Kikinis, R., Westin, C.-F., Vosburgh, K., 2005. Speckle-constrained filtering of ultrasound images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., 2, pp. 547–552. Krissian, K., Westin, C.F., Kikinis, R., Vosburgh, K.G., 2007. Oriented speckle reducing anisotropic diffusion. Image Process. IEEE Trans. 16 (5), 1412–1424. Kwok, J.T., Tsang, I.W., 2003. The pre-image problem in kernel methods. In: Proceedings of the 20th International Conference on Machine Learning, pp. 408–415. Lee, J.-S., 1980. Digital image enhancement and noise filtering by use of local statistics. Pattern Anal.Mach. Intell. IEEE Trans. 2 (2), 165–168. Li, D., 2009. Support vector regression based image denoising. Image Vis. Comput. 27 (6), 623–627. Mika, S., Schölkopf, B., Smola, A.J., Müller, K.-R., Scholz, M., Rätsch, G., 1998. Kernel PCA and de-noising in feature spaces. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp. 536–542. Negi, P.S., Labate, D., 2012. 3-d discrete shearlet transform and video processing. Image Process. IEEE Trans. 21 (6), 2944–2954. Papademetris, X., Sinusas, A., Dione, D.P., Constable, R.T., Duncan, J.S., 2002. Estimation of 3-d left ventricular deformation from medical images using biomechanical models. Med. Imaging IEEE Trans. 21 (7), 786–800. Pizurica, A., Philips, W., Lemahieu, I., Acheroy, M., 2003. A versatile wavelet domain noise filtration technique for medical imaging. Med. Imaging IEEE Trans. 22 (3), 323–331.

H. Wu et al. / Medical Image Analysis 24 (2015) 41–51 Qian, Z., Liu, Q., Metaxas, D.N., Axel, L., 2011. Identifying regional cardiac abnormalities from myocardial strains using nontracking-based strain estimation and spatiotemporal tensor analysis. Med. Imaging IEEE Trans. 30 (12), 2017–2029. Ren, J., Zhuo, Y., Liu, J., Guo, Z., 2012. Illumination-invariant non-local means based video denoising. In: Proceedings of the IEEE International Conference on Image Processing., pp. 1185–1188. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 2323–2326. Santiago, C., Nascimento, J.C., Marques, J.S., 2013. 3d left ventricular segmentation in echocardiography using a probabilistic data association deformable model. In: Proceedings of the IEEE International Conference on Image Processing, pp. 606– 610. Sun, K., Bruno, E., Marchand-Maillet, S., 2012. Unsupervised skeleton learning for manifold denoising. In: Proceedings of the International Conference on Pattern Recognition., pp. 2719–2722. Tay, P.C., Garson, C.D., Acton, S.T., Hossack, J.A., 2010. Ultrasound despeckling for contrast enhancement. Image Process. IEEE Trans. 19 (7), 1847–1860. Tenenbaum, J.B., De Silva, V., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 2319–2323. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408. Wang, W., Carreira-Perpiñán, M.Á., 2010. Manifold blurring mean shift algorithms for manifold denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 1759–1766. Wang, W., Lu, Z., Carreira-Perpiñán, M.Á., 2011. A denoising view of matrix completion. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 334–342.

51

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality assessment: from error visibility to structural similarity. Image Process. IEEE Trans. 13 (4), 600– 612. Wu, H., Bowers, D.M., Huynh, T.T., Souvenir, R., 2013. Biomedical video denoising using supervised manifold learning. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 1244–1247. Xu, Q., Jiang, H., Scopigno, R., Sbert, M., 2010. A new approach for very dark video denoising and enhancement. In: Proceedings of the IEEE International Conference on Image Processing, pp. 1185–1188. Yu, J., Tan, J., Wang, Y., 2010. Ultrasound speckle reduction by a SUSAN-controlled anisotropic diffusion method. Pattern Recognit. 43 (9), 3083–3092. Yu, Y., Acton, S.T., 2002. Speckle reducing anisotropic diffusion. Image Process. IEEE Trans. 11 (11), 1260–1270. Yu, Y., Zhang, S., Li, K., Metaxas, D., Axel, L., 2014. Deformable models with sparsity constraints for cardiac motion analysis. Med. Image Anal. 18 (6), 927–937. Yue, Y., Clark, J.W., Khoury, D.S., 2009. Speckle tracking in intracardiac echocardiography for the assessment of myocardial deformation. Biomed. Eng. IEEE Trans. 56 (2), 416–425. Zhang, B., Fadili, J.M., Starck, J.-L., 2008. Wavelets, ridgelets, and curvelets for Poisson noise removal. Image Process. IEEE Trans. 17 (7), 1093–1108. Zhang, L., Dong, W., Zhang, D., Shi, G., 2010. Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recognit. 43 (4), 1531– 1549. Zhang, L., Vaddadi, S., Jin, H., Nayar, S.K., 2009. Multiple view image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1542–1549. Zhang, Q., Manriquez, A.I., Medigue, C., Papelier, Y., Sorine, M., 2006. An algorithm for robust and efficient location of T-wave ends in electrocardiograms. Biomed. Eng. IEEE Trans. 53 (12), 2544–2552.