Mutual Information

Definition (Mutual Information)

Given $(X,Y)\sim p_{XY}$ on $\mathscr{X}\times\mathscr{Y}$ , the mutual information between $X$ and $Y$ , denoted by $I(X;Y)$ , is given by $\begin{align*} I(X;Y):&=D(p_{XY}\|p_Xp_Y)\\ &=E_{p_{XY}}\left[\log_2\left(\frac{p_{XY}(X,Y)}{p_X(X)p_Y(Y)}\right)\right]\\ &=\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_{XY}(a,b)\log_2\frac{p_{XY}(a,b)}{p_X(a)p_Y(b)} \end{align*}$

Proposition (Properties of Mutual Information)

Symmetry: $I(X;Y)=I(Y;X)$
Chain Rule: $\begin{align*}I(X;Y)&=H(X)-H(X|Y)\\&=H(Y)-H(Y|X)\\&=H(X)+H(Y)-H(X,Y)\end{align*}$
Mutual Information of the same variable is Entropy: $I(X;X)=H(X)$
Nonnegativity: $I(X;Y)\ge0$ (with equality iff $X\perp\!\!\!\perp Y$ )
LUB: $I(X;Y)\le \min\{\log_2|\mathscr{X}|,\log_2|\mathscr{Y}|\}$

Remark

$I(X;Y)$ measures the “reduction” in the average amount of information (or uncertainty) about $X$ by knowing $Y$ (and vice-versa). It is also a measure of dependence between $X$ and $Y$ or a measure of “information transfer” from $Y$ to $X$ (and vice-versa).

Definition (Conditional Mutual Information)

Let $(X,Y,Z) \sim p_{XYZ}$ on $\mathscr{X}\times\mathscr{Y}\times\mathscr{Z}$ , the conditional mutual information between $X$ and $Y$ given $Z$ is: $\begin{align*} I(X;Y|Z):&=D(p_{XY|Z}\|p_{X|Z}p_{Y|Z}|p_Z)\\ &=\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}\sum_{c\in\mathscr{Z}}p_{XY|Z}(a,b|c)p_Z(c)\log_2\frac{p_{XY|Z}(a,b|c)}{p_{X|Z}(a|c) \ p_{Y|Z}(b|c)}\\ &=\sum_{c\in\mathscr{Z}}p_Z(c)\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_{XY|Z}(a,b|c)\log_2\frac{p_{XY|Z}(a,b|c)}{p_{X|Z}(a|c) \ p_{Y|Z}(b|c)}\\ &= \mathbb{E}_{z\sim P_{Z}}[D(p_{XY|z}\Vert p_{X|z}p_{Y|z})] \end{align*}$

Lemma (Conditional Analog Property)

$\begin{align*}I(X;Y|Z)&=H(X|Z)-H(X|Z,Y)\\&=H(Y|Z)-H(Y|Z,X)\\&=H(X|Z)+H(Y|Z)-H(X,Y|Z)\end{align*}$

Lemma (Chain Rule for Mutual Information)

Let random vector $X^n$ and RV $Y$ be jointly distributed with joint pmf $p_{X^nY}$ then $I(X^n;Y)=\sum_{i=1}^nI(X_i;Y|X^{i-1})$

Lemma (Mutual Information is Minimized by RD Function)

Let $X\sim p_{X}$ and RV $Y\in\hat{\mathcal{X}}$ has $p(y|x)$ such that $E[d(X,Y)]\le D$ , then $I(X;Y)\ge R(D)$
Let $X\sim p_{X}$ and RV $Y\in\hat{\mathcal{X}}$ then $I(X;Y)\ge R(E[d(X,Y)])$

Lemma (Independence Bound of Mutual Information)

Let $X_{1},\dots,X_{n}$ be iid RVs. Then for any RVs $\hat{X}_{1},\dots,\hat{X}_{n}$ $I(X^{n};\hat{X}_{n})\ge\sum_{i=1}^{n}I(X_{i},\hat{X}_{i})$

Definition (Differential mutual information)

Let $(X,Y)\sim f_{XY}$ with $S_{X}\subset\mathbb{R}^{2}$ . Then the mutual information between $X$ and $Y$ is $\begin{align*} I(X;Y):&=D(f_{X}\|f_{y})\\ &=\int_{S_{X}}f_{XY}(x,y)\log_2\frac{f_{XY}(x,y)}{f_{X}(x)f_{Y}(y)}dxdy\\ &=h(X)+h(Y)-h(X,y)\\ &=h(X)-h(X|Y)\\ &=h(Y)-h(Y|X) \end{align*}$

Remark

For continuous RVs $(X,Y)\sim f_{XY}$ , for $n$ and $m$ sufficiently large, $\begin{align*} I([X]_{n},[Y]_{n})&=H([X]_{n})+H([Y]_{n})-H([X]_{n},[Y]_{n})\\ &\approx (h(X)+n) + (h(Y)+m) - (h(X,Y)+n+m)\\ &=h(X)+h(Y)-h(X,Y)\\ &=I(X;Y) \end{align*}$ Hence $\lim_{n,m\to\infty}I([X]_{n},[Y]_{n})=I(X;Y)$

Conclusion

Mutual Information is a universal information measure in Information Theory.

Theorem (Chain Rule for differential mutual information)

$I(X_{1},\cdots.X_{n};Y)=\sum\limits_{i=1}^{n}I(X_{i};Y|X_{i-1},\cdots,X_{1})$

Linked from

Information Capacity

Information Capacity with Input Cost

Data Processing Inequality

Mutual Information