Differential Entropy

NAVIGATION

Home

Research

Bookshelf

Garden

FIND ME ON

GitHub

Home

Research

Bookshelf

Garden

Differential Entropy

🌱

Definition

InfoTheory

Introduction

So far, we’ve dealt primarily with discrete-time systems with discrete alphabets. Now we look to change discrete to continuous.

Let $\mathcal{X}\subset\mathbb{R}$ be a real-valued RV with cdf $F_{X}(x)=P(X\le x), \ x\in\mathbb{R}$ and pdf, $f_{X}$ where $F_{X}(x)=\int_{-\infty}^{x}f_{X}(t)dt, \ x\in\mathbb{R}$ The support of $X$ with pdf $f_{X}$ is $S_{X}=\{x\in\mathbb{R}:f_{X}(x)>0\}$ ## Some intuition Recall that for discrete (finite) alphabet random variables, its entropy has an operational meaning, i.e., it describes the minimal number of codebits per source symbol needed to losslessly describe it. But with a real-valued RV taking values in a continuum, the number of bits would be infinite since we have infinite source symbols.

The differential entropy of a real-valued RV $X$ with pdf $f_{X}$ and support $S_{X}\subset\mathbb{R}$ is given by $h(X):=-\int_{S_{X}}f_{X}(t)\log_{2}f_{X}(t)dt=E_{X}\left[-\log_{2}f_{X}(X)\right]$ when the integral exists. The usual multivariate extension holds, i.e. for real-valued $X^{n}=(X_{1},\cdots,X_{n})$ with pdf $f_{X^{n}}$ and support $S_{X^{n}}\subset\mathbb{R}^{n}$ the joint differential entropy is $\tag{bits}\begin{align*} h(X^{n})&=-\int_{S_{X^{n}}}f_{X^{n}}(x_{1},\cdots,x_{n})\log_{2}f_{X^{n}}(x_{1},\cdots,x_{n})dx_{1}\cdots dx_{n}\\ &=E_{X^{n}}\left[-\log_{2}f_{X^{n}}(X^{n})\right] \end{align*}$ when integral exists.

Intuition

So the operational meaning of the differential entropy is that since $H([X]_{n})$ is the minimum average number of bits needed to losslessly describe $[X]_{n}$ , the uniformly quantized X. We then obtain that $h(X)+n$ bits is approximately needed to describe $X$ when uniformly quantizing it with $n$ -bit accuracy or $H([X]_{n})\approx n+h(X)$ for $n$ sufficiently large. So the larger $h(X)$ is, the larger the average number of bits required to describe a uniformly quantized $X$ with $n$ -bit accuracy. If $(X,Y)\sim f_{XY}$ on $S_{XY}\subset\mathbb{R}^{2}$ then $H([X]_{n},[Y]_{m})\approx m+n+h(X,Y)$ for $m,n$ sufficiently large.

1. Conditioning Reduces Entropy: $h(X)\ge h(X|Y)$ with equality if $X\perp \!\!\! \perp Y$ . 2. Chain Rule for Differential Entropy: $h(X^{n})=h(X_{1},\cdots,X_{n})=\sum\limits_{i=1}^{n}h(X_{i}|X_{i-1},\cdots,X_{1})$ 3. Independence Bound for Differential Entropy: $h(X^{n})\le\sum\limits_{i=1}^{n}h(X_{i})$ with equality $\iff$ all $X_{i}$ ’s are independent. 4. Invariance of Differential Entropy Under Translation: $h(X+c)=h(X) \ \forall\mbox{ constant }c\in\mathbb{R}$ 5. Differential Entropy Under Scaling: $h(aX)=h(X)+\log|a|, \ \ \forall \mbox{ constant }a\not=0$ 6. Joint Differential Entropy Under Linear Maps: Let $\underline X=(X_{1},\cdots,X_{n})^{T}$ be a random column vector with joint pdf $f_{\underline X}=f_{X^{n}}$ and let $\underline Y=A\underline X$ where $A$ is an $n\times n$ invertible matrix (i.e. $\det(A)\not=0$ ). Then $h(\underline{Y})=h(Y_{1},\cdots,Y_{n})=h(\underline X)+\log|\det(A)|$

Let $(X,Y)\sim f_{XY}$ with support $S_{XY}\subset\mathbb{R}^{2}$ . The conditional differential entropy of $Y$ given $X$ is given by $\begin{align*} h(Y|X):&=-\int_{S_{XY}}f_{XY}(x,y)\log_{2}f_{Y|X}(y|x)dxdy\\ &=E_{XY}[-\log_2f_{Y|X}(Y|X)] \end{align*}$

Similar to the discrete case, $\begin{align*} h(X,Y)&=h(X)+h(Y|X)\\ &=h(Y)+h(X|Y) \end{align*}$

Let $X\sim f$ and finite differential entropy $h(X)$ . Then its rate distortion function for MSE distortion is lower bounded as $R(D)\ge h(X)- \frac{1}{2}\log(2\pi eD)=R_{SLB}(D)$ or $R(D)\ge R_{G}(D)-D(X\|X_{G})$ where the divergence measures the “non-Gaussianness” of $X$ and for the distortion rate function $D(R)\ge \frac{1}{2\pi e}2^{-2(R-h(X))}$

For any real-valued $n\times n$ positive definite matrix $K=[k_{ij}]$ we have $\det(K)\le\prod_{i=1}^{n}K_{ii}$ with equality iff $K$ is a diagonal matrix.

Let $\underline X=(X_{1},\ldots,X_{n})^{T}$ be a real-valued random vector with support $S_{X^{n}}=\mathbb{R}^{n}$ , mean vector $\underline\mu$ and (invertible) covariance matrix $K_{\underline X}$ . Then $h(\underline X)=h(X_{1},\ldots,X_{n})\le \frac{1}{2}\log_{2}[(2\pi e)^{n}\det(K_{\underline X})]$ with equality iff $\underline X\sim\mathcal{N}(\underline\mu,K_{\underline X})$ i.e. $X$ is Gaussian.

The following results regarding the maximal possible value of $h(X)$ for a continuous RV $X$ under various constraints can be shown: 1. If $S_{X}=(0,\infty)$ and $E[X]=\mu<\infty$ , then $\tag{bits}h(X)\le\log_{2} \frac{e}{\lambda}$ with equality iff $X$ is exponential with parameter $\lambda$ where $\lambda= \frac{1}{\mu}$ . 2. If $S_{X}=\mathbb{R}$ and $E[|X|]=\lambda<\infty$ , then $\tag{bits}h(X)\le\log_{2}(2e\lambda)$ with equality iff $X$ is Laplacian with mean zero and variance $2\lambda^{2}$ 3. If $S_{X}=(a,b)$ , then $\tag{bits}h(X)\le\log_{2}(b-a)$ with equality iff $X$ is uniform over $(a,b)$ .

In general if we want to maximize $h(f)$ over all pdfs $f$ defined on support $S\subset\mathbb{R}$ subject to constraints $\int_{S}m_{i}(x)f(x)dx=a_{i}, \ i=1,\ldots,l$ then it can be shown that $f^{*}(x)=e^{-\lambda_{0}-\sum\limits_{i=1}^{l}\lambda_{i}m_{i}(x)},\ m\in S$ where $\lambda_{0},\ldots,\lambda_{l}$ are chosen so that all $l$ constraints are satisfied and $\int_{S}f^{*}(x)dx=1$

Linked from

Consequence of the AEP

Clarity

Introducing Clarity and Perceivability