Entropy

Definition (Entropy)

Let XX be a discrete Random Variable with finite Alphabet X\mathscr{X} and pmf given by pX(a):=P(X=a),aXp_X(a):=P(X=a), a\in\mathscr{X} The entropy of a discrete Random Variable XX with pmf pXp_X and alphabet X\mathscr{X} is denoted by H(X)H(X) and given by: H(X)=aXpX(a)log2(pX(a))H(X)=-\sum_{a\in\mathscr{X}}p_X(a)\log_2(p_X(a))

Intuition

Taking the notion of Self-Information, we can look at entropy as the average information conveyed by some Random Variable. We can look at entropy as giving us our expected information from a certain distribution rv Examining the definition of entropy H(X)H(X) we have that H(X)=aXpX(a)log2(pX(a))=aXpX(a) I(pX(a))=EpX[I(pX(x))]=EpX[log2(pX(x))] \begin{align*} H(X)&=-\sum_{a\in\mathscr{X}}p_X(a)\log_2(p_X(a))\\ &=\sum_{a\in\mathscr{X}}p_X(a)\ I(p_X(a))\\ &=E_{p_X}[I(p_X(x))]\\ &=E_{p_X}[-log_2(p_X(x))] \end{align*} or that H(X)H(X) is the expected (statistical average) amount of information one gains about the occurrence of one its X|\mathscr{X}| outcomes.

Remark

H(X)H(X) denotes entropy with a base unit bb of 2. Other values of bb will be denoted using a suffix like Hb(X)H_b(X).

Lemma (Nonnegativity and change of base)

  1. H(X)0H(X)\ge0 with equality if and only if XX is deterministic (i.e. cX\exists c\in\mathscr{X} such that pX(c)=1p_X(c)=1)
  2. Hb(X):=aX pX(a)logbpX(a)=(logb2)H(X)H_b(X):=-\sum_{a\in\mathscr{X}}\ p_X(a)*\log_bp_X(a)=(\log_b2)H(X)

Theorem (Upper bound of Entropy)

If XpXX\sim p_X is a discrete RV (with finite alphabet X\mathscr{X}) then H(X)log2XH(X)\le\log_2|\mathscr{X}| with equality if and only if XX is uniformly distributed over X\mathscr{X}

Intuition

From the perspective of Information Theory, Entropy tells us what the most efficient encoding of our information is (i.e. your encoding solution’s average bit usage is lower bounded by the Entropy of the input).

You can also think of it as giving a benchmark for the most efficient way to encode information from a given distribution.

Intuitively, entropy or H(X)H(X), tells us how “random” XX is:

  1. If XX is deterministic, then H(X)=0H(X)=0
    1. This is the minimum value of entropy or randomness for XX
  2. If XX is uniform, then H(X)=log2XH(X)=\log_2|\mathscr{X}|
    1. This is the maximal value of entropy or randomness for XX

Definition (Conditional Entropy)

Given (X,Y)pXY(X,Y)\sim p_{XY}, the conditional entropy of XX given YY is denoted by H(XY)H(X|Y) and given by H(XY)=EpXY[log2(pXY(xy))]=aXbYpXY(a,b) log2(pXY(ab))\begin{align*}H(X|Y)&=E_{p_{XY}}[-\log_2(p_{X|Y}(x|y))]\\&=-\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_{XY}(a,b) \ \log_2(p_{X|Y}(a|b))\end{align*}

Lemma (Chain Rule)

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

Theorem (Conditioning Reduces Entropy)

If (X,Y)pXY(X,Y)\sim p_{XY}, then H(X)H(XY)H(X)\ge H(X|Y) and if X ⁣ ⁣ ⁣YX\perp\!\!\!\perp Y then H(X)=H(XY)H(X)=H(X|Y)

Linked from