FIND ME ON

GitHub

LinkedIn

Differential Entropy

🌱

Definition
InfoTheory

Introduction

So far, we’ve dealt primarily with discrete-time systems with discrete alphabets. Now we look to change discrete to continuous.

Let XR\mathcal{X}\subset\mathbb{R} be a real-valued RV with cdf FX(x)=P(Xx), xRF_{X}(x)=P(X\le x), \ x\in\mathbb{R} and pdf, fXf_{X} where FX(x)=xfX(t)dt, xRF_{X}(x)=\int_{-\infty}^{x}f_{X}(t)dt, \ x\in\mathbb{R}The support of XX with pdf fXf_{X} is SX={xR:fX(x)>0}S_{X}=\{x\in\mathbb{R}:f_{X}(x)>0\} ## Some intuition Recall that for discrete (finite) alphabet random variables, its entropy has an operational meaning, i.e., it describes the minimal number of codebits per source symbol needed to losslessly describe it. But with a real-valued RV taking values in a continuum, the number of bits would be infinite since we have infinite source symbols.

The differential entropy of a real-valued RV XX with pdf fXf_{X} and support SXRS_{X}\subset\mathbb{R} is given by h(X):=SXfX(t)log2fX(t)dt=EX[log2fX(X)]h(X):=-\int_{S_{X}}f_{X}(t)\log_{2}f_{X}(t)dt=E_{X}\left[-\log_{2}f_{X}(X)\right] when the integral exists. The usual multivariate extension holds, i.e. for real-valued Xn=(X1,,Xn)X^{n}=(X_{1},\cdots,X_{n}) with pdf fXnf_{X^{n}} and support SXnRnS_{X^{n}}\subset\mathbb{R}^{n} the joint differential entropy is h(Xn)=SXnfXn(x1,,xn)log2fXn(x1,,xn)dx1dxn=EXn[log2fXn(Xn)]\tag{bits}\begin{align*} h(X^{n})&=-\int_{S_{X^{n}}}f_{X^{n}}(x_{1},\cdots,x_{n})\log_{2}f_{X^{n}}(x_{1},\cdots,x_{n})dx_{1}\cdots dx_{n}\\ &=E_{X^{n}}\left[-\log_{2}f_{X^{n}}(X^{n})\right] \end{align*}when integral exists.

Intuition

So the operational meaning of the differential entropy is that since H([X]n)H([X]_{n}) is the minimum average number of bits needed to losslessly describe [X]n[X]_{n}, the uniformly quantized X. We then obtain that h(X)+nh(X)+n bits is approximately needed to describe XX when uniformly quantizing it with nn-bit accuracy or H([X]n)n+h(X)H([X]_{n})\approx n+h(X)for nn sufficiently large. So the larger h(X)h(X) is, the larger the average number of bits required to describe a uniformly quantized XX with nn-bit accuracy. If (X,Y)fXY(X,Y)\sim f_{XY} on SXYR2S_{XY}\subset\mathbb{R}^{2} then H([X]n,[Y]m)m+n+h(X,Y)H([X]_{n},[Y]_{m})\approx m+n+h(X,Y)for m,nm,n sufficiently large.

1. Conditioning Reduces Entropy: h(X)h(XY)h(X)\ge h(X|Y)with equality if X ⁣ ⁣ ⁣YX\perp \!\!\! \perp Y. 2. Chain Rule for Differential Entropy: h(Xn)=h(X1,,Xn)=i=1nh(XiXi1,,X1)h(X^{n})=h(X_{1},\cdots,X_{n})=\sum\limits_{i=1}^{n}h(X_{i}|X_{i-1},\cdots,X_{1}) 3. Independence Bound for Differential Entropy: h(Xn)i=1nh(Xi)h(X^{n})\le\sum\limits_{i=1}^{n}h(X_{i})with equality     \iff all XiX_{i}’s are independent. 4. Invariance of Differential Entropy Under Translation: h(X+c)=h(X) \mboxconstantcRh(X+c)=h(X) \ \forall\mbox{ constant }c\in\mathbb{R} 5. Differential Entropy Under Scaling: h(aX)=h(X)+loga,  \mboxconstanta0h(aX)=h(X)+\log|a|, \ \ \forall \mbox{ constant }a\not=0 6. Joint Differential Entropy Under Linear Maps: Let X=(X1,,Xn)T\underline X=(X_{1},\cdots,X_{n})^{T} be a random column vector with joint pdf fX=fXnf_{\underline X}=f_{X^{n}} and let Y=AX\underline Y=A\underline Xwhere AA is an n×nn\times n invertible matrix (i.e. det(A)0\det(A)\not=0). Then h(Y)=h(Y1,,Yn)=h(X)+logdet(A)h(\underline{Y})=h(Y_{1},\cdots,Y_{n})=h(\underline X)+\log|\det(A)|

Let (X,Y)fXY(X,Y)\sim f_{XY} with support SXYR2S_{XY}\subset\mathbb{R}^{2}. The conditional differential entropy of YY given XX is given by h(YX):=SXYfXY(x,y)log2fYX(yx)dxdy=EXY[log2fYX(YX)]\begin{align*} h(Y|X):&=-\int_{S_{XY}}f_{XY}(x,y)\log_{2}f_{Y|X}(y|x)dxdy\\ &=E_{XY}[-\log_2f_{Y|X}(Y|X)] \end{align*}

Similar to the discrete case, h(X,Y)=h(X)+h(YX)=h(Y)+h(XY)\begin{align*} h(X,Y)&=h(X)+h(Y|X)\\ &=h(Y)+h(X|Y) \end{align*}

Let XfX\sim f and finite differential entropy h(X)h(X). Then its rate distortion function for MSE distortion is lower bounded as R(D)h(X)12log(2πeD)=RSLB(D)R(D)\ge h(X)- \frac{1}{2}\log(2\pi eD)=R_{SLB}(D)or R(D)RG(D)D(XXG)R(D)\ge R_{G}(D)-D(X\|X_{G})where the divergence measures the “non-Gaussianness” of XX and for the distortion rate function D(R)12πe22(Rh(X))D(R)\ge \frac{1}{2\pi e}2^{-2(R-h(X))}

For any real-valued n×nn\times n positive definite matrix K=[kij]K=[k_{ij}] we have det(K)i=1nKii\det(K)\le\prod_{i=1}^{n}K_{ii}with equality iff KK is a diagonal matrix.

Let X=(X1,,Xn)T\underline X=(X_{1},\ldots,X_{n})^{T} be a real-valued random vector with support SXn=RnS_{X^{n}}=\mathbb{R}^{n}, mean vector μ\underline\mu and (invertible) covariance matrix KXK_{\underline X}. Then h(X)=h(X1,,Xn)12log2[(2πe)ndet(KX)]h(\underline X)=h(X_{1},\ldots,X_{n})\le \frac{1}{2}\log_{2}[(2\pi e)^{n}\det(K_{\underline X})]with equality iff XN(μ,KX)\underline X\sim\mathcal{N}(\underline\mu,K_{\underline X}) i.e. XX is Gaussian.

The following results regarding the maximal possible value of h(X)h(X) for a continuous RV XX under various constraints can be shown: 1. If SX=(0,)S_{X}=(0,\infty) and E[X]=μ<E[X]=\mu<\infty, then h(X)log2eλ(bits)\tag{bits}h(X)\le\log_{2} \frac{e}{\lambda}with equality iff XX is exponential with parameter λ\lambda where λ=1μ\lambda= \frac{1}{\mu}. 2. If SX=RS_{X}=\mathbb{R} and E[X]=λ<E[|X|]=\lambda<\infty, then h(X)log2(2eλ)(bits)\tag{bits}h(X)\le\log_{2}(2e\lambda)with equality iff XX is Laplacian with mean zero and variance 2λ22\lambda^{2} 3. If SX=(a,b)S_{X}=(a,b), then h(X)log2(ba)(bits)\tag{bits}h(X)\le\log_{2}(b-a)with equality iff XX is uniform over (a,b)(a,b).

In general if we want to maximize h(f)h(f) over all pdfs ff defined on support SRS\subset\mathbb{R} subject to constraints Smi(x)f(x)dx=ai, i=1,,l\int_{S}m_{i}(x)f(x)dx=a_{i}, \ i=1,\ldots,lthen it can be shown that f(x)=eλ0i=1lλimi(x), mSf^{*}(x)=e^{-\lambda_{0}-\sum\limits_{i=1}^{l}\lambda_{i}m_{i}(x)},\ m\in Swhere λ0,,λl\lambda_{0},\ldots,\lambda_{l} are chosen so that all ll constraints are satisfied and Sf(x)dx=1\int_{S}f^{*}(x)dx=1

Linked from