FIND ME ON

GitHub

LinkedIn

Divergence

🌱

Definition
InfoTheory

Given two pmfs pp and qq on the same alphabet X\mathscr{X}, the divergence between pp and qq, denoted by D(pq)D(p\|q) is given by: D(pq)=aXp(a)log2(p(a)q(a))=Ep[log2(p(X)q(X))] \begin{align*} D(p\|q)&=\sum_{a\in\mathscr{X}}p(a)\log_2\left(\frac{p(a)}{q(a)}\right)\\ &=E_p\left[\log_2\left(\frac{p(X)}{q(X)}\right)\right] \end{align*} where log2(p(X)q(X))\log_2\left(\frac{p(X)}{q(X)}\right) is called the “log-likelihood ratio”.

D(pq)D(p\|q) is a measure of dissimilarity or distance between distributions pp and qq. It is a measure of the inefficiency of assuming that the “distribution of observed data XX is qq” when the “true one is actually pp”.

It can also be thought of as the added entropy (i.e. extra average bits) that arises from using the approximated/different distribution QQ instead of the true/original distribution in PP, this is all in using PP as the reference point (hence why it is the expectation with respect to PP).

Convention

If X\mathscr{X} contains points of zero mass under pp or qq (i.e. is is not the support for both pp and qq), then: 0log0q=0 for q0plogp0+ for p>0 \begin{align*} &0\log\frac{0}{q}=0 \ &for \ q\ge0\\\\ &p\log\frac{p}{0}\to+\infty \ &for \ p>0 \end{align*} >[!rmk] >Divergence is not a true distance as it does not satisfy symmetry or the triangular inequality.

Lemma (Non-negativity of divergence)

D(pq)0D(p\|q)\ge0 with equality if and only if p=qp=q

For pp and qq with support X\mathscr{X}, we have that the Divergence 12log2pq2D(pq)1(log2)minaX{p(a),q(a)}pq\frac{1}{2\log2}\|p-q\|^2\le D(p\|q)\le\frac{1}{(\log2)\min_{a\in\mathscr{X}}\{p(a),q(a)\}}\|p-q\|

Given two joint pmfs pXY=pxpYXp_{XY}=p_{x}p_{Y|X} and qXY=qXqYXq_{XY}=q_Xq_{Y|X} on X×Y\mathcal{X}\times\mathcal{Y}, then the divergence can be expressed as D(pXYqXY)=D(pXqX)+D(pYXqYXpX)D(p_{XY}\|q_{XY})=D(p_{X}\|q_{X})+D(p_{Y|X}\|q_{Y|X}|p_{X})

Let pXp_X be a pmf on X\mathscr{X} and let pYXp_{Y|X} and qYXq_{Y|X} be two different conditional pmfs on Y×X\mathscr{Y}\times\mathscr{X}. Then the conditional divergence between pYXp_{Y|X} and qYXq_{Y|X} given pXp_X is D(pYXqYXpX):=EpYXpX[log2(pYX(YX)qYX(YX))]=aXbYpX(a)pYX(ba)log2pYX(ba)qYX(ba)=aXpX(a)bYpYX(ba)log2pYX(ba)qYX(ba)=EpX[D(pYX=XqYX=X)]\begin{align*}D(p_{Y|X}\|q_{Y|X}|p_X):&=E_{p_{Y|X}p_X}\left[\log_2\left(\frac{p_{Y|X}(Y|X)}{q_{Y|X}(Y|X)}\right)\right]\\&=\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_X(a)p_{Y|X}(b|a)\log_2\frac{p_{Y|X}(b|a)}{q_{Y|X}(b|a)}\\&=\sum_{a\in\mathscr{X}}p_X(a)\sum_{b\in\mathscr{Y}}p_{Y|X}(b|a)\log_2\frac{p_{Y|X}(b|a)}{q_{Y|X}(b|a)}\\&=E_{p_X}\left[D(p_{Y|X=X}\|q_{Y|X=X})\right]\end{align*}

Let (X,X^,Z)PXX^Z(X,\hat X, Z)\sim P_{X\hat XZ} on X×X×Z\mathcal{X}\times\mathcal{X}\times\mathcal{Z} then D(pXZpX^ZpZ)D(pXpX^)D(p_{X|Z}\|p_{\hat X|Z}|p_{Z})\ge D(p_{X}\|p_{\hat X})

Linked from