Differential Entropy

Introduction

So far, we’ve dealt primarily with discrete-time systems with discrete alphabets. Now we look to change discrete to continuous.

Let XR\mathcal{X}\subset\mathbb{R} be a real-valued RV with cdf FX(x)=P(Xx), xRF_{X}(x)=P(X\le x), \ x\in\mathbb{R} and pdf, fXf_{X} where FX(x)=xfX(t)dt, xRF_{X}(x)=\int_{-\infty}^{x}f_{X}(t)dt, \ x\in\mathbb{R}The support of XX with pdf fXf_{X} is SX={xR:fX(x)>0}S_{X}=\{x\in\mathbb{R}:f_{X}(x)>0\}

Some intuition

Recall that for discrete (finite) alphabet random variables, its entropy has an operational meaning, i.e., it describes the minimal number of codebits per source symbol needed to losslessly describe it. But with a real-valued RV taking values in a continuum, the number of bits would be infinite since we have infinite source symbols.

Definition (Differential Entropy)

The differential entropy of a real-valued RV XX with pdf fXf_{X} and support SXRS_{X}\subset\mathbb{R} is given by h(X):=SXfX(t)log2fX(t)dt=EX[log2fX(X)]h(X):=-\int_{S_{X}}f_{X}(t)\log_{2}f_{X}(t)dt=E_{X}\left[-\log_{2}f_{X}(X)\right] when the integral exists. The usual multivariate extension holds, i.e. for real-valued Xn=(X1,,Xn)X^{n}=(X_{1},\cdots,X_{n}) with pdf fXnf_{X^{n}} and support SXnRnS_{X^{n}}\subset\mathbb{R}^{n} the joint differential entropy is h(Xn)=SXnfXn(x1,,xn)log2fXn(x1,,xn)dx1dxn=EXn[log2fXn(Xn)]\tag{bits}\begin{align*} h(X^{n})&=-\int_{S_{X^{n}}}f_{X^{n}}(x_{1},\cdots,x_{n})\log_{2}f_{X^{n}}(x_{1},\cdots,x_{n})dx_{1}\cdots dx_{n}\\ &=E_{X^{n}}\left[-\log_{2}f_{X^{n}}(X^{n})\right] \end{align*}when integral exists.

Lemma (Uniform Quantization of Real-Valued Source)

Consider a real-valued RV XX with support SX=[a,b)S_{X}=[a,b) and pdf fXf_{X} such that:

  1. fXlog2fXf_{X}\log_{2}f_{X} is Riemann-integrable (i.e. integrable in the sense we’re originally familiar with) and,
  2. abfX(t)log2fX(t)dt=-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt=\inftyor the entropy over the support is infinite.

Let [X]n[X]_{n} be the uniform quantization of XX with quantization step size Δ=ba2n\Delta=\frac{b-a}{2^{n}}i.e. with nn-bit accuracy.

Then for nn sufficiently large, H([X]n)abfX(t)log2fX(t)dt + n(bits)\tag{bits}H([X]_{n})\approx-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt \ + \ ni.e. limn[H([X]n)n]=abfX(t)log2fX(t)dt\lim_{n\to\infty}\left[H([X]_{n})-n\right]=-\int_{a}^{b}f_{X}(t)\log_{2}f_{X}(t)dt

Remark

This result also holds for any real-valued RV with:

  1. Generally unbounded support and,
  2. Well-defined pdf

Intuition

So the operational meaning of the differential entropy is that since H([X]n)H([X]_{n}) is the minimum average number of bits needed to losslessly describe [X]n[X]_{n}, the uniformly quantized X. We then obtain that h(X)+nh(X)+n bits is approximately needed to describe XX when uniformly quantizing it with nn-bit accuracy or H([X]n)n+h(X)H([X]_{n})\approx n+h(X)for nn sufficiently large. So the larger h(X)h(X) is, the larger the average number of bits required to describe a uniformly quantized XX with nn-bit accuracy. If (X,Y)fXY(X,Y)\sim f_{XY} on SXYR2S_{XY}\subset\mathbb{R}^{2} then H([X]n,[Y]m)m+n+h(X,Y)H([X]_{n},[Y]_{m})\approx m+n+h(X,Y)for m,nm,n sufficiently large.

Proposition (Properties)

  1. Conditioning Reduces Entropy: h(X)h(XY)h(X)\ge h(X|Y)with equality if X ⁣ ⁣ ⁣YX\perp \!\!\! \perp Y.
  2. Chain Rule for Differential Entropy: h(Xn)=h(X1,,Xn)=i=1nh(XiXi1,,X1)h(X^{n})=h(X_{1},\cdots,X_{n})=\sum\limits_{i=1}^{n}h(X_{i}|X_{i-1},\cdots,X_{1})
  3. Independence Bound for Differential Entropy: h(Xn)i=1nh(Xi)h(X^{n})\le\sum\limits_{i=1}^{n}h(X_{i})with equality     \iff all XiX_{i}’s are independent.
  4. Invariance of Differential Entropy Under Translation: h(X+c)=h(X) \mboxconstantcRh(X+c)=h(X) \ \forall\mbox{ constant }c\in\mathbb{R}
  5. Differential Entropy Under Scaling: h(aX)=h(X)+loga,  \mboxconstanta0h(aX)=h(X)+\log|a|, \ \ \forall \mbox{ constant }a\not=0
  6. Joint Differential Entropy Under Linear Maps: Let X=(X1,,Xn)T\underline X=(X_{1},\cdots,X_{n})^{T} be a random column vector with joint pdf fX=fXnf_{\underline X}=f_{X^{n}} and let Y=AX\underline Y=A\underline Xwhere AA is an n×nn\times n invertible matrix (i.e. det(A)0\det(A)\not=0). Then h(Y)=h(Y1,,Yn)=h(X)+logdet(A)h(\underline{Y})=h(Y_{1},\cdots,Y_{n})=h(\underline X)+\log|\det(A)|

Definition (Conditional differential entropy)

Let (X,Y)fXY(X,Y)\sim f_{XY} with support SXYR2S_{XY}\subset\mathbb{R}^{2}. The conditional differential entropy of YY given XX is given by h(YX):=SXYfXY(x,y)log2fYX(yx)dxdy=EXY[log2fYX(YX)]\begin{align*} h(Y|X):&=-\int_{S_{XY}}f_{XY}(x,y)\log_{2}f_{Y|X}(y|x)dxdy\\ &=E_{XY}[-\log_2f_{Y|X}(Y|X)] \end{align*}

Lemma (Chain Rule)

Similar to the discrete case, h(X,Y)=h(X)+h(YX)=h(Y)+h(XY)\begin{align*} h(X,Y)&=h(X)+h(Y|X)\\ &=h(Y)+h(X|Y) \end{align*}

Theorem (Shannon Lower Bound)

Let XfX\sim f and finite differential entropy h(X)h(X). Then its rate distortion function for MSE distortion is lower bounded as R(D)h(X)12log(2πeD)=RSLB(D)R(D)\ge h(X)- \frac{1}{2}\log(2\pi eD)=R_{SLB}(D)or R(D)RG(D)D(XXG)R(D)\ge R_{G}(D)-D(X\|X_{G})where the divergence measures the “non-Gaussianness” of XX and for the distortion rate function D(R)12πe22(Rh(X))D(R)\ge \frac{1}{2\pi e}2^{-2(R-h(X))}

Lemma (Hadamard Inequality)

For any real-valued n×nn\times n positive definite matrix K=[kij]K=[k_{ij}] we have det(K)i=1nKii\det(K)\le\prod_{i=1}^{n}K_{ii}with equality iff KK is a diagonal matrix.

Theorem (Maximal Differential Entropy for Real-Valued Random Vectors)

Let X=(X1,,Xn)T\underline X=(X_{1},\ldots,X_{n})^{T} be a real-valued random vector with support SXn=RnS_{X^{n}}=\mathbb{R}^{n}, mean vector μ\underline\mu and (invertible) covariance matrix KXK_{\underline X}. Then h(X)=h(X1,,Xn)12log2[(2πe)ndet(KX)]h(\underline X)=h(X_{1},\ldots,X_{n})\le \frac{1}{2}\log_{2}[(2\pi e)^{n}\det(K_{\underline X})]with equality iff XN(μ,KX)\underline X\sim\mathcal{N}(\underline\mu,K_{\underline X}) i.e. XX is Gaussian.

Theorem (Maximal Differential Entropy Under Various Constraints)

The following results regarding the maximal possible value of h(X)h(X) for a continuous RV XX under various constraints can be shown:

  1. If SX=(0,)S_{X}=(0,\infty) and E[X]=μ<E[X]=\mu<\infty, then h(X)log2eλ(bits)\tag{bits}h(X)\le\log_{2} \frac{e}{\lambda}with equality iff XX is exponential with parameter λ\lambda where λ=1μ\lambda= \frac{1}{\mu}.
  2. If SX=RS_{X}=\mathbb{R} and E[X]=λ<E[|X|]=\lambda<\infty, then h(X)log2(2eλ)(bits)\tag{bits}h(X)\le\log_{2}(2e\lambda)with equality iff XX is Laplacian with mean zero and variance 2λ22\lambda^{2}
  3. If SX=(a,b)S_{X}=(a,b), then h(X)log2(ba)(bits)\tag{bits}h(X)\le\log_{2}(b-a)with equality iff XX is uniform over (a,b)(a,b).

Remark

In general if we want to maximize h(f)h(f) over all pdfs ff defined on support SRS\subset\mathbb{R} subject to constraints Smi(x)f(x)dx=ai, i=1,,l\int_{S}m_{i}(x)f(x)dx=a_{i}, \ i=1,\ldots,lthen it can be shown that f(x)=eλ0i=1lλimi(x), mSf^{*}(x)=e^{-\lambda_{0}-\sum\limits_{i=1}^{l}\lambda_{i}m_{i}(x)},\ m\in Swhere λ0,,λl\lambda_{0},\ldots,\lambda_{l} are chosen so that all ll constraints are satisfied and Sf(x)dx=1\int_{S}f^{*}(x)dx=1

Linked from