next up previous
Next: Conditional Entropy Up: Information Theory Previous: Information Theory

Entropy

The concept of entropy was prevalent, before Shannon, in the thermodynamics and statistical mechanics literature. In classical thermodynamics, the important second law states that the total entropy of any isolated thermodynamic system tends to increase with time. Ludwig Boltzmann and Josiah W. Gibbs, in the late 1800s, statistically analyzed the randomness associated with an ensemble of gas particles. They called this measure entropy and defined it to be proportional to the logarithm of the number of microstates such a gas could occupy. Their mathematical formulation of entropy, albeit in a different context, was equivalent to the definition by Shannon.

Shannon defined a measure of uncertainty or randomness associated with an RV, calling it entropy [154]. Thus, entropy is the average uncertainty associated with each possible value of the RV:

$\displaystyle h (X)$ $\textstyle =$ $\displaystyle \int_{\mathcal{S}_X} P (x) \log \Bigg( \frac {1} {P (x)} \Bigg) dx$ (58)
  $\textstyle =$ $\displaystyle - \int_{\mathcal{S}_X} P (x) \log P (x) dx,$ (59)

where $\mathcal{S}_X = \{ x : P(x) > 0 \}$ is the support set of $P(X)$.

Alfred Renyi [138] generalized Shannon's measure of entropy by presenting a family of entropy functions parameterized by a continuous parameter $\alpha $:

$\displaystyle h_{\alpha} (X)
= \frac {1} {1 - \alpha} \log \Bigg( \int_{\mathcal{S}_X} \bigg( P (x) \bigg)^{\alpha} dx \Bigg).$     (60)

He showed that the Renyi entropy converges to the Shannon entropy in the limit as $\alpha
\rightarrow 1$. Many other measures of entropy exist such as the Havrda-Chavrat entropy [84], Hartley entropy [72], and Kapur's measures of entropy [85,84]. This dissertation utilizes the Shannon measure for all purposes and, hence, we will restrict our focus to that measure.

We can also interpret Shannon entropy as the expectation of the RV $\Big( - \log P (X) \Big)$, i.e.,

$\displaystyle h (X)
= E_{P (X)} \Big[ - \log P (X) \Big].$     (61)

We saw previously that, given a random sample, an unbiased and consistent estimator of the expectation of the RV is the sample mean. Thus, given a random sample derived from an RV $X$, an estimate for the entropy of $X$ as
$\displaystyle h (X)$ $\textstyle \approx$ $\displaystyle \frac {1} {n} \sum_{i=1}^{n} \Bigg( - \log P (x_i) \Bigg)$  
  $\textstyle =$ $\displaystyle - \frac {1} {n} \log \Bigg( \prod_{i=1}^{n} P (x_i) \Bigg).$ (62)

We can observe that the expression on the right involves the product of the probabilities of occurrence of the observations. This product is, in fact, the likelihood function associated with the observations. Recall that the ML estimate selects that parameter value that maximizes the likelihood function--where each term is the probability conditioned on the parameter value. Indeed, we can prove that the ML parameter estimates are the same as the minimum-entropy parameter estimates when dealing with Shannon's entropy measure:

$\displaystyle \mathop{\mbox{argmax }}_{\theta} \prod_{i=1}^{n} P (x_i \vert \theta)$ $\textstyle =$ $\displaystyle \mathop{\mbox{argmin }}_{\theta} \frac {-1} {n} \log \Bigg( \prod_{i=1}^{n} P (x_i \vert \theta) \Bigg)$  
  $\textstyle =$ $\displaystyle \mathop{\mbox{argmin }}_{\theta} \frac {1} {n} \sum_{i=1}^{n} \Big( - \log P (x_i \vert \theta) \Big)$  
  $\textstyle \approx$ $\displaystyle \mathop{\mbox{argmin }}_{\theta} h (X).$ (63)

The joint entropy of two RVs $X$ and $Y$ is

$\displaystyle h (X, Y) = \int_{\mathcal{S}_X} \int_{\mathcal{S}_Y} - P (x, y) \log P (x, y) dx dy,$     (64)

analogous to the definition of the entropy of a single RV [154].


next up previous
Next: Conditional Entropy Up: Information Theory Previous: Information Theory
Suyash P. Awate 2007-02-21