next up previous
Next: Engineering Enhancements for the Up: Estimating Uncorrupted-Signal Markov Statistics Previous: Inverse Problem: KL-Divergence Optimality

Optimization Using the EM Algorithm

The inverse problem we have here is that of mixture-density parameter estimation--the parameter here is the set $\Theta = \{ {\bf z}_u \}_{u \in \mathcal{U}}$ of the means of Gaussians that defines the uncorrupted-signal Markov PDF. We propose to solve this using the EM algorithm [43,104]. The EM algorithm computes a ML parameter estimate when the data are incomplete, i.e., a part of the data remains unobserved or hidden. We now describe the key idea behind the working of the EM algorithm.

The optimization formulation in (5.9) is a little unwieldy because it contains the logarithm of a sum. If we knew which Gaussian component generated each observation, then we could obtain the probability $\hat P_C ({\bf\tilde z}_t)$ by evaluating a single Gaussian: the one that generated ${\bf\tilde z}_t$. The EM approach gets rid of the summation that the logarithm applies to. The key idea behind EM is that it assumes the existence of one hidden RV associated with each observation ${\bf\tilde z}_t$. The PDF of this hidden RV gives the probabilities for different Gaussian components to have generated ${\bf\tilde z}_t$. Let us call this RV $L$. The values of $L$ are, however, never observed. The EM algorithm starts by assuming a joint PDF $P ({\bf\tilde
Z},L)$ of the observed and hidden RVs, i.e., the complete data. It defines the probability of the observation ${\bf\tilde z}_t$ assuming that it came from the $l$-th Gaussian as

$\displaystyle P ({\bf\tilde z}_t \vert l)
= G ({\bf\tilde z}_t - {\bf z'}_l, \Psi'_l),$     (124)

where ${\bf z'}_l$ and $\Psi'_l$ are the mean and covariance values, respectively, for the $l$-th Gaussian. The goal of the EM algorithm is to iteratively find the ML estimate of the parameter $\Theta$ as
$\displaystyle \Theta^*$ $\textstyle =$ $\displaystyle \mathop{\mbox{argmax }}_{\Theta}
\log P ({\bf\tilde z} \vert \Theta)$  
  $\textstyle =$ $\displaystyle \mathop{\mbox{argmax }}_{\Theta}
\log
\bigg(
\int_{\mathcal{S}_L} P ({\bf\tilde z}, l \vert \Theta) dl
\bigg),$ (125)

where $\mathcal{S}_L$ is the support of $P (L)$. Each iteration comprises the E (expectation) step and the M (maximization) step. The E step formulates an expectation of the complete-data likelihood function over the PDF of the hidden RV conditioned on the observed data and current parameter estimate. The M step maximizes this expectation with respect to the parameter. After much simplification [19], the maximization performed in the $m$-th iteration reduces to
$\displaystyle \mathop{\mbox{argmax }}_{\Theta}
\sum_{u \in \mathcal{U}}
\sum_{t...
...ert {\bf\tilde z}_t; \Theta^{m-1})
\log P ({\bf\tilde z}_t \vert u, {\bf z}_u),$     (126)

where $\Theta^{m-1}$ is the $(m-1)$-th parameter estimate that is held constant and $\Theta = \{ {\bf z}_u \}_{u \in \mathcal{U}}$ is the free variable. The parameter updates guarantee no decrease in the likelihood $P ({\bf\tilde z} \vert \Theta)$ of the observed data and, hence, the sequence of estimates converge to a local maximum of the likelihood function.

An important element in this entire process of inferring the uncorrupted-signal Markov statistics is the initial choice of the sample $\{ {\bf\hat z}_u^0 \}_{u \in \mathcal{U}}$ for the EM algorithm. We initialize $\{ {\bf\hat z}_u^0 \}_{u \in \mathcal{U}}$ to comprise a small random fraction of the entire set of observed neighborhood-intensities $\{ {\bf
\tilde z}_t \}_{t \in \mathcal{T}}$, spread uniformly over the image domain $\mathcal{T}$. This ensures the representation of all important features in the image and produces an initial estimate close to the global maximum of the likelihood function.

The EM updates, for density estimation using a sum of Gaussians, are as follows.


  1. Let $\{ {\bf\hat z}_u^m \}_{u \in \mathcal{U}}$ be the parameter estimate at the $m$-th iteration.

  2. Use the lookup tables to compute ${\bf\hat {z'}}^{m}_u$ and ${\Psi'}_u^m$, $\forall u \in U$, where
    $\displaystyle {\bf\hat {z'}}^{m}_u (i)$ $\textstyle =$ $\displaystyle \mathcal{L}_{\mu} ({\bf\hat z}_u^m (i), \sigma, \sigma_R)
\mathrm { and}$  
    $\displaystyle \hat {\Psi'}_u^{m} (i,i)$ $\textstyle =$ $\displaystyle \mathcal{L}_{\sigma} ({\bf\hat z}_u^m (i), \sigma, \sigma_R).$ (127)



  3. Compute
    $\displaystyle \forall u \in \mathcal{U}, \forall t \in \mathcal{T},
P ({\bf\tilde z}_t \vert u)
= G ({\bf\tilde z}_t - {\bf\hat {z'}}^{m}_u, \Psi'_u)$     (128)



  4. Use Bayes rule to evaluate $P (u \vert {\bf\tilde z}_t), \forall t \in \mathcal{T}, \forall u \in
\mathcal{U}$. Because we derive the initial set of observations ${\bf\hat z}_u^0$ from the PDF $P
({\bf\tilde Z})$ that is close to $P ({\bf Z})$, we can ignore the a priori probabilities $P (u)$--treat them equal for all $u$. Thus, we compute
    $\displaystyle \forall u \in \mathcal{U}, \forall t \in \mathcal{T},
P (u \vert ...
...tilde z}_t \vert u)
}
{
\sum_{v \in \mathcal{U}}
P ({\bf\tilde z}_t \vert v)
}.$     (129)



  5. Update the current parameter estimate using a gradient-ascent scheme using first-order finite forward differences:
    $\displaystyle \forall u \in \mathcal{U},
{\bf\hat z}_u^{m+1}
=
{\bf\hat z}_u^m
...
...t \in \mathcal{T}} P (u \vert {\bf\tilde z}_t) }
-
{\bf\hat {z'}}_u^{m}
\Bigg),$     (130)

    where the Jacobian is a diagonal matrix--each component of the vector neighborhood ${\bf\hat
z}_u$ is corrupted independently because of the conditional independence assumption on the noise model--that can be computed numerically using the lookup table $\mathcal{L}_{\mu} (\cdot)$. The partial derivatives in the Jacobian are the reciprocal of the rate of change of the shift in the mean of the Rician-corrupted Gaussians with respect to the change in the means of the input Gaussian (for i.i.d. additive Gaussian noise the Jacobian is exactly identity). We have numerically found that this derivative is always greater than unity, and approaches unity for large SNR (where Rician noise behaves very similar to i.i.d. additive Gaussian noise). For low SNR, however, the derivative can be much larger than unity and this may lead to numerically-large updates. In practice, we treat the Jacobian as identity. This results in a projected-gradient ascent strategy that is still guaranteed to converge.

  6. If $\sum_{u \in \mathcal{U}} \parallel {\bf\hat z}_u^{m+1} - {\bf\hat z}_u^m \parallel_2^2 <
\epsilon$, where $\epsilon$ is a small threshold, then stop, otherwise go to Step 3.


next up previous
Next: Engineering Enhancements for the Up: Estimating Uncorrupted-Signal Markov Statistics Previous: Inverse Problem: KL-Divergence Optimality
Suyash P. Awate 2007-02-21