Entropy

Definition (Entropy)

Let $X$ be a discrete Random Variable with finite Alphabet $\mathscr{X}$ and pmf given by $p_X(a):=P(X=a), a\in\mathscr{X}$ The entropy of a discrete Random Variable $X$ with pmf $p_X$ and alphabet $\mathscr{X}$ is denoted by $H(X)$ and given by: $H(X)=-\sum_{a\in\mathscr{X}}p_X(a)\log_2(p_X(a))$

Intuition

Taking the notion of Self-Information, we can look at entropy as the average information conveyed by some Random Variable. We can look at entropy as giving us our expected information from a certain distribution rv Examining the definition of entropy $H(X)$ we have that $\begin{align*} H(X)&=-\sum_{a\in\mathscr{X}}p_X(a)\log_2(p_X(a))\\ &=\sum_{a\in\mathscr{X}}p_X(a)\ I(p_X(a))\\ &=E_{p_X}[I(p_X(x))]\\ &=E_{p_X}[-log_2(p_X(x))] \end{align*}$ or that $H(X)$ is the expected (statistical average) amount of information one gains about the occurrence of one its $|\mathscr{X}|$ outcomes.

Remark

$H(X)$ denotes entropy with a base unit $b$ of 2. Other values of $b$ will be denoted using a suffix like $H_b(X)$ .

Lemma (Nonnegativity and change of base)

$H(X)\ge0$ with equality if and only if $X$ is deterministic (i.e. $\exists c\in\mathscr{X}$ such that $p_X(c)=1$ )
$H_b(X):=-\sum_{a\in\mathscr{X}}\ p_X(a)*\log_bp_X(a)=(\log_b2)H(X)$

Theorem (Upper bound of Entropy)

If $X\sim p_X$ is a discrete RV (with finite alphabet $\mathscr{X}$ ) then $H(X)\le\log_2|\mathscr{X}|$ with equality if and only if $X$ is uniformly distributed over $\mathscr{X}$

Intuition

From the perspective of Information Theory, Entropy tells us what the most efficient encoding of our information is (i.e. your encoding solution’s average bit usage is lower bounded by the Entropy of the input).

You can also think of it as giving a benchmark for the most efficient way to encode information from a given distribution.

Intuitively, entropy or $H(X)$ , tells us how “random” $X$ is:

If $X$ $X$ is deterministic, then $H(X)=0$ $H (X) = 0$
1. This is the minimum value of entropy or randomness for $X$
If $X$ $X$ is uniform, then $H(X)=\log_2|\mathscr{X}|$ $H (X) = lo g_{2} ∣ X ∣$
1. This is the maximal value of entropy or randomness for $X$

Definition (Conditional Entropy)

Given $(X,Y)\sim p_{XY}$ , the conditional entropy of $X$ given $Y$ is denoted by $H(X|Y)$ and given by $\begin{align*}H(X|Y)&=E_{p_{XY}}[-\log_2(p_{X|Y}(x|y))]\\&=-\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_{XY}(a,b) \ \log_2(p_{X|Y}(a|b))\end{align*}$

Lemma (Chain Rule)

$H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)$

Theorem (Conditioning Reduces Entropy)

If $(X,Y)\sim p_{XY}$ , then $H(X)\ge H(X|Y)$ and if $X\perp\!\!\!\perp Y$ then $H(X)=H(X|Y)$

Linked from

Shannon Coding Theorem

Differential Entropy

Asymptotic Equipartition Property

Cross-Entropy

Entropy Rate

Entropy

Joint Entropy

Renyi Entropy

Source Entropy