next up previous
Next: Brain Tissue Classification Up: MRI Brain Tissue Classification Previous: Learning Per-Class Markov Statistics


Classification via Mutual-Information Maximization

This section formulates the classification problem as an optimal-segmentation problem using with an information-theoretic goodness measure associated with the Markov PDFs. It begins by forming a connection between information-theoretic measures, such as mutual information, entropy [34], and classification.

Loosely speaking, the mutual information between two random variables quantifies the degree of functional dependence between them. For functionally-dependent random variables, each variable uniquely determines the other, and the mutual information is maximized. On the other hand, independent random variables convey no information about each other, and their mutual information is zero (minimal). For image segmentation [87], we can say that a good segmentation is one in which the voxel-neighborhood intensity values provide the most information about the class labels. Likewise, knowing the voxel class should provide the most reliable estimate of the voxel neighborhood. Clearly, there is no strict functional dependence and images are inherently stochastic, but mutual information provides a well-founded mechanism for quantifying the degree to which these properties hold.

Using the set of conditional PDFs $\{ P_k ({\bf Z}) \}_{k=1}^{K}$ for the $K$ classes, we can define a joint PDF $P (L, {\bf Z})$ between the RVs $L$ and ${\bf Z}$. At each voxel $t$, an instance $(l_t, {\bf z}_t)$ is drawn from the joint PDF. What we observe, however, are only the intensity-neighborhood vectors ${\bf z}_t$. The label values $l_t$ define the classification and must be estimated. We define the optimal segmentation as the one that maximizes the mutual information between $L$ and $\Z$, i.e.,

$\displaystyle I (L, {\bf Z})$ $\textstyle =$ $\displaystyle h ({\bf Z}) - h ({\bf Z} \vert L)$  
  $\textstyle =$ $\displaystyle h ({\bf Z}) - \sum_{k=1}^{K} P (L = k) h ({\bf Z} \vert L = k),$ (138)

where $I (\cdot)$ is the mutual information function and $h (\cdot)$ is the entropy. Entropy is a measure of randomness or uncertainty associated with a PDF [34], and regions $\mathcal{T}_k$ having low entropies $h ({\bf Z} \vert L = k)$ for Markov PDFs exhibit a high degree of predictability in their neighborhoods.

The entropy of class $k$ is

$\displaystyle h ({\bf Z} \vert L = k)
=
- \int_{\Re^{\vert d\vert}}
P_k ({\bf z})
\log P_k ({\bf z})
d{\bf z},$     (139)

where $d = \vert\mathcal{N}_t\vert$ is the neighborhood size.

The entropy of the Markov PDF associated with the entire image, $h ({\bf Z})$, is independent of the label assignment $L$ and we can ignore it during the optimization. Thus, (6.3) implies that the optimal segmentation is the one that minimizes a weighted average of entropies $h ({\bf Z} \vert L = k)$ of the $K$ Markov PDFs associated with the $K$ stationary-ergodic regions. The present mutual-information-based energy gives more importance, or weight, to reducing entropies of larger regions in the image in direct proportion to their size--the weights are the probability of occurrence of the classes $P (L = k)$ in the image. Rewriting $I (L, {\bf Z}) = h (L) - h (L \vert {\bf Z})$ provides more insight into this optimality metric. We see that the metric encourages segmentations with equal voxel counts for the classes--uniform PDF for $L$ implying maximal $h (L)$--while demanding high predictability of the label at each voxel $t$ given its neighborhood intensities ${\bf z}_t$--low $h (L \vert {\bf z}_t)$ leading to low $h (L \vert {\bf Z})$.

Equations (6.3) and (6.4) give the optimal segmentation as

$\displaystyle \{ \mathcal{T}^*_k \}_{k=1}^K$ $\textstyle =$ $\displaystyle \mathop{\mbox{argmin }}_{ \{ \mathcal{T}_k \}_{k=1}^K }
\sum_{k=1}^K P (L = k) h ({\bf Z} \vert L = k)$  
  $\textstyle =$ $\displaystyle \mathop{\mbox{argmax }}_{ \{ \mathcal{T}_k \}_{k=1}^K }
\Bigg(
\s...
...k=1}^K
P (L = k)
\int_{\Re^d} P_k ({\bf z}) \log P_k ({\bf z}) d{\bf z}
\Bigg).$ (140)

Treating entropy as the expectation of negative log-probability and approximating the expectation, in turn, by the sample mean [34], gives
$\displaystyle \{ \mathcal{T}^*_k \}_{k=1}^K$ $\textstyle =$ $\displaystyle \mathop{\mbox{argmax }}_{ \{ \mathcal{T}_k \}_{k=1}^K }
\Bigg(
\sum_{k=1}^K
P (L = k)
E_{P_k ({\bf Z})}
\Big[ \log P_k ({\bf Z}) \Big]
\Bigg)$  
  $\textstyle \approx$ $\displaystyle \mathop{\mbox{argmax }}_{ \{ \mathcal{T}_k \}_{k=1}^K }
\Bigg(
\s...
...athcal{S}'_k\vert}
\sum_{{\bf z} \in \mathcal{S}'_k}
\log P_k ({\bf z})
\Bigg),$ (141)

where $\mathcal{S}'_k$ is a random sample [47,161] derived from the PDF $P_k ({\bf Z})$. Assuming ergodicity [47], in addition to stationarity, enables us to approximate ensemble averages using $\mathcal{S}_k$ with spatial averages using $\mathcal{T}_k$. Hence we have
$\displaystyle \{ \mathcal{T}^*_k \}_{k=1}^K
\approx \mathop{\mbox{argmax }}_{ \...
...vert\mathcal{T}_k\vert}
\sum_{t \in \mathcal{T}_k} \log P_k ({\bf z}_t)
\Bigg).$     (142)

To estimate $P (L = k)$ from the data, we observe that the discrete random variable $L$ can take only $K$ possible values. Furthermore, $\vert\mathcal{T}_k\vert$ voxels, out of a total of $\vert\mathcal{T}\vert$ voxels, have $L =
k$. Thus,

$\displaystyle P (L = k) = \frac {\vert\mathcal{T}_k\vert} {\vert\mathcal{T}\vert}.$     (143)

Substituting (6.8) in (6.7) gives
$\displaystyle \{ \mathcal{T}^*_k \}_{k=1}^K
\approx \mathop{\mbox{argmax }}_{ \...
...l{T}\vert}
\sum_{k=1}^K
\sum_{t \in \mathcal{T}_k} \log P_k ({\bf z}_t)
\Bigg).$     (144)

The probabilities $P_k ({\bf z}_t)$ are given by the Parzen-window density estimate in (6.2), i.e.,
$\displaystyle P_k ({\bf z}_t)
\approx
\frac {1} {\vert\mathcal{A}_t\vert}
\sum_{u \in \mathcal{A}_t} G_d ({\bf z}_t - {\bf z}_u; \Psi_d),$     (145)

where the set $\mathcal{A}_t$ is a small subset of $\mathcal{T}_k$. Section 3.5.1 describes how to construct $\mathcal{A}_t$, unique for each voxel $t$, to effectively estimate the probability.

So far, we have not taken into account any a priori information in the segmentation process and we have derived all probabilities solely from the data. The formulation, however, extends in a straightforward manner to include a priori information using standard Bayesian strategies followed by optimization involving the resulting posterior probabilities. Section 6.4.3 discusses how to integrate a priori information in the form of brain tissue probabilistic atlases into the proposed method. For the minimization in (6.9), we manipulate the regions $\mathcal{T}_k$ using a gradient-descent optimization strategy. Section 6.4.2 gives the details.


next up previous
Next: Brain Tissue Classification Up: MRI Brain Tissue Classification Previous: Learning Per-Class Markov Statistics
Suyash P. Awate 2007-02-21