Divergence

🌱

Definition

InfoTheory

Given two pmfs $p$ and $q$ on the same alphabet $\mathscr{X}$ , the divergence between $p$ and $q$ , denoted by $D(p\|q)$ is given by: $\begin{align*} D(p\|q)&=\sum_{a\in\mathscr{X}}p(a)\log_2\left(\frac{p(a)}{q(a)}\right)\\ &=E_p\left[\log_2\left(\frac{p(X)}{q(X)}\right)\right] \end{align*}$ where $\log_2\left(\frac{p(X)}{q(X)}\right)$ is called the “log-likelihood ratio”.

$D(p\|q)$ is a measure of dissimilarity or distance between distributions $p$ and $q$ . It is a measure of the inefficiency of assuming that the “distribution of observed data $X$ is $q$ ” when the “true one is actually $p$ ”.

It can also be thought of as the added entropy (i.e. extra average bits) that arises from using the approximated/different distribution $Q$ instead of the true/original distribution in $P$ , this is all in using $P$ as the reference point (hence why it is the expectation with respect to $P$ ).

Convention

If $\mathscr{X}$ contains points of zero mass under $p$ or $q$ (i.e. is is not the support for both $p$ and $q$ ), then: $\begin{align*} &0\log\frac{0}{q}=0 \ &for \ q\ge0\\\\ &p\log\frac{p}{0}\to+\infty \ &for \ p>0 \end{align*}$ >[!rmk] >Divergence is not a true distance as it does not satisfy symmetry or the triangular inequality.

Lemma (Non-negativity of divergence)

$D(p\|q)\ge0$ with equality if and only if $p=q$

For $p$ and $q$ with support $\mathscr{X}$ , we have that the Divergence $\frac{1}{2\log2}\|p-q\|^2\le D(p\|q)\le\frac{1}{(\log2)\min_{a\in\mathscr{X}}\{p(a),q(a)\}}\|p-q\|$

Given two joint pmfs $p_{XY}=p_{x}p_{Y|X}$ and $q_{XY}=q_Xq_{Y|X}$ on $\mathcal{X}\times\mathcal{Y}$ , then the divergence can be expressed as $D(p_{XY}\|q_{XY})=D(p_{X}\|q_{X})+D(p_{Y|X}\|q_{Y|X}|p_{X})$

Let $p_X$ be a pmf on $\mathscr{X}$ and let $p_{Y|X}$ and $q_{Y|X}$ be two different conditional pmfs on $\mathscr{Y}\times\mathscr{X}$ . Then the conditional divergence between $p_{Y|X}$ and $q_{Y|X}$ given $p_X$ is $\begin{align*}D(p_{Y|X}\|q_{Y|X}|p_X):&=E_{p_{Y|X}p_X}\left[\log_2\left(\frac{p_{Y|X}(Y|X)}{q_{Y|X}(Y|X)}\right)\right]\\&=\sum_{a\in\mathscr{X}}\sum_{b\in\mathscr{Y}}p_X(a)p_{Y|X}(b|a)\log_2\frac{p_{Y|X}(b|a)}{q_{Y|X}(b|a)}\\&=\sum_{a\in\mathscr{X}}p_X(a)\sum_{b\in\mathscr{Y}}p_{Y|X}(b|a)\log_2\frac{p_{Y|X}(b|a)}{q_{Y|X}(b|a)}\\&=E_{p_X}\left[D(p_{Y|X=X}\|q_{Y|X=X})\right]\end{align*}$

Let $(X,\hat X, Z)\sim P_{X\hat XZ}$ on $\mathcal{X}\times\mathcal{X}\times\mathcal{Z}$ then $D(p_{X|Z}\|p_{\hat X|Z}|p_{Z})\ge D(p_{X}\|p_{\hat X})$

Linked from

Filter Stability

Differential Divergence

Cross-Entropy

Divergence

Jensen-Shannon Divergence

Renyi Divergence

Convexity or Concavity of Information Measure