Differential Entropy

NAVIGATION

Home

Research

Bookshelf

Garden

FIND ME ON

GitHub

Home

Research

Bookshelf

Garden

ℹ

Differential Entropy

Introduction

So far, we’ve dealt primarily with discrete-time systems with discrete alphabets. Now we look to change discrete to continuous.

Let $\mathcal{X}\subset\mathbb{R}$ be a real-valued RV with cdf $F_{X}(x)=P(X\le x), \ x\in\mathbb{R}$ and pdf, $f_{X}$ where $F_{X}(x)=\int_{-\infty}^{x}f_{X}(t)dt, \ x\in\mathbb{R}$ The support of $X$ with pdf $f_{X}$ is $S_{X}=\{x\in\mathbb{R}:f_{X}(x)>0\}$

Some intuition

Recall that for discrete (finite) alphabet random variables, its entropy has an operational meaning, i.e., it describes the minimal number of codebits per source symbol needed to losslessly describe it. But with a real-valued RV taking values in a continuum, the number of bits would be infinite since we have infinite source symbols.

Definition (Differential Entropy)

The differential entropy of a real-valued RV $X$ with pdf $f_{X}$ and support $S_{X}\subset\mathbb{R}$ is given by $h(X):=-\int_{S_{X}}f_{X}(t)\log_{2}f_{X}(t)dt=E_{X}\left[-\log_{2}f_{X}(X)\right]$ when the integral exists. The usual multivariate extension holds, i.e. for real-valued $X^{n}=(X_{1},\cdots,X_{n})$ with pdf $f_{X^{n}}$ and support $S_{X^{n}}\subset\mathbb{R}^{n}$ the joint differential entropy is $\tag{bits}\begin{align*} h(X^{n})&=-\int_{S_{X^{n}}}f_{X^{n}}(x_{1},\cdots,x_{n})\log_{2}f_{X^{n}}(x_{1},\cdots,x_{n})dx_{1}\cdots dx_{n}\\ &=E_{X^{n}}\left[-\log_{2}f_{X^{n}}(X^{n})\right] \end{align*}$ when integral exists.

Lemma (Uniform Quantization of Real-Valued Source)

Consider a real-valued RV $X$ with support $S_{X}=[a,b)$ and pdf $f_{X}$ such that:

$f_{X}\log_{2}f_{X}$ is Riemann-integrable (i.e. integrable in the sense we’re originally familiar with) and,
$-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt=\infty$ or the entropy over the support is infinite.

Let $[X]_{n}$ be the uniform quantization of $X$ with quantization step size $\Delta=\frac{b-a}{2^{n}}$ i.e. with $n$ -bit accuracy.

Then for $n$ sufficiently large, $\tag{bits}H([X]_{n})\approx-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt \ + \ n$ i.e. $\lim_{n\to\infty}\left[H([X]_{n})-n\right]=-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt$

Remark

This result also holds for any real-valued RV with:

Generally unbounded support and,
Well-defined pdf

Intuition

So the operational meaning of the differential entropy is that since $H([X]_{n})$ is the minimum average number of bits needed to losslessly describe $[X]_{n}$ , the uniformly quantized X. We then obtain that $h(X)+n$ bits is approximately needed to describe $X$ when uniformly quantizing it with $n$ -bit accuracy or $H([X]_{n})\approx n+h(X)$ for $n$ sufficiently large. So the larger $h(X)$ is, the larger the average number of bits required to describe a uniformly quantized $X$ with $n$ -bit accuracy. If $(X,Y)\sim f_{XY}$ on $S_{XY}\subset\mathbb{R}^{2}$ then $H([X]_{n},[Y]_{m})\approx m+n+h(X,Y)$ for $m,n$ sufficiently large.

Proposition (Properties)

Conditioning Reduces Entropy: $h(X)\ge h(X|Y)$ with equality if $X\perp \!\!\! \perp Y$ .
Chain Rule for Differential Entropy: $h(X^{n})=h(X_{1},\cdots,X_{n})=\sum\limits_{i=1}^{n}h(X_{i}|X_{i-1},\cdots,X_{1})$
Independence Bound for Differential Entropy: $h(X^{n})\le\sum\limits_{i=1}^{n}h(X_{i})$ with equality $\iff$ all $X_{i}$ ’s are independent.
Invariance of Differential Entropy Under Translation: $h(X+c)=h(X) \ \forall\mbox{ constant }c\in\mathbb{R}$
Differential Entropy Under Scaling: $h(aX)=h(X)+\log|a|, \ \ \forall \mbox{ constant }a\not=0$
Joint Differential Entropy Under Linear Maps: Let $\underline X=(X_{1},\cdots,X_{n})^{T}$ be a random column vector with joint pdf $f_{\underline X}=f_{X^{n}}$ and let $\underline Y=A\underline X$ where $A$ is an $n\times n$ invertible matrix (i.e. $\det(A)\not=0$ ). Then $h(\underline{Y})=h(Y_{1},\cdots,Y_{n})=h(\underline X)+\log|\det(A)|$

Definition (Conditional differential entropy)

Let $(X,Y)\sim f_{XY}$ with support $S_{XY}\subset\mathbb{R}^{2}$ . The conditional differential entropy of $Y$ given $X$ is given by $\begin{align*} h(Y|X):&=-\int_{S_{XY}}f_{XY}(x,y)\log_{2}f_{Y|X}(y|x)dxdy\\ &=E_{XY}[-\log_2f_{Y|X}(Y|X)] \end{align*}$

Lemma (Chain Rule)

Similar to the discrete case, $\begin{align*} h(X,Y)&=h(X)+h(Y|X)\\ &=h(Y)+h(X|Y) \end{align*}$

Theorem (Shannon Lower Bound)

Let $X\sim f$ and finite differential entropy $h(X)$ . Then its rate distortion function for MSE distortion is lower bounded as $R(D)\ge h(X)- \frac{1}{2}\log(2\pi eD)=R_{SLB}(D)$ or $R(D)\ge R_{G}(D)-D(X\|X_{G})$ where the divergence measures the “non-Gaussianness” of $X$ and for the distortion rate function $D(R)\ge \frac{1}{2\pi e}2^{-2(R-h(X))}$

Lemma (Hadamard Inequality)

For any real-valued $n\times n$ positive definite matrix $K=[k_{ij}]$ we have $\det(K)\le\prod_{i=1}^{n}K_{ii}$ with equality iff $K$ is a diagonal matrix.

Theorem (Maximal Differential Entropy for Real-Valued Random Vectors)

Let $\underline X=(X_{1},\ldots,X_{n})^{T}$ be a real-valued random vector with support $S_{X^{n}}=\mathbb{R}^{n}$ , mean vector $\underline\mu$ and (invertible) covariance matrix $K_{\underline X}$ . Then $h(\underline X)=h(X_{1},\ldots,X_{n})\le \frac{1}{2}\log_{2}[(2\pi e)^{n}\det(K_{\underline X})]$ with equality iff $\underline X\sim\mathcal{N}(\underline\mu,K_{\underline X})$ i.e. $X$ is Gaussian.

Theorem (Maximal Differential Entropy Under Various Constraints)

The following results regarding the maximal possible value of $h(X)$ for a continuous RV $X$ under various constraints can be shown:

If $S_{X}=(0,\infty)$ and $E[X]=\mu<\infty$ , then $\tag{bits}h(X)\le\log_{2} \frac{e}{\lambda}$ with equality iff $X$ is exponential with parameter $\lambda$ where $\lambda= \frac{1}{\mu}$ .
If $S_{X}=\mathbb{R}$ and $E[|X|]=\lambda<\infty$ , then $\tag{bits}h(X)\le\log_{2}(2e\lambda)$ with equality iff $X$ is Laplacian with mean zero and variance $2\lambda^{2}$
If $S_{X}=(a,b)$ , then $\tag{bits}h(X)\le\log_{2}(b-a)$ with equality iff $X$ is uniform over $(a,b)$ .

Remark

In general if we want to maximize $h(f)$ over all pdfs $f$ defined on support $S\subset\mathbb{R}$ subject to constraints $\int_{S}m_{i}(x)f(x)dx=a_{i}, \ i=1,\ldots,l$ then it can be shown that $f^{*}(x)=e^{-\lambda_{0}-\sum\limits_{i=1}^{l}\lambda_{i}m_{i}(x)},\ m\in S$ where $\lambda_{0},\ldots,\lambda_{l}$ are chosen so that all $l$ constraints are satisfied and $\int_{S}f^{*}(x)dx=1$

Linked from

Asymptotic Equipartition Property