Let X be a discrete Random Variable with finite Alphabet X and pmf given by pX(a):=P(X=a),a∈X The entropy of a discrete Random Variable X with pmf pX and alphabet X is denoted by H(X) and given by: H(X)=−a∈X∑pX(a)log2(pX(a)) ## Intuition Taking the notion of self-information, we can look at entropy as the average information conveyed by some random variable. We can look at entropy as giving us our expected information from a certain distribution rv Examining the definition of entropy H(X) we have that H(X)=−a∈X∑pX(a)log2(pX(a))=a∈X∑pX(a) I(pX(a))=EpX[I(pX(x))]=EpX[−log2(pX(x))] or that H(X) is the expected (statistical average) amount of information one gains about the occurrence of one its ∣X∣ outcomes.
1. H(X)≥0 with equality if and only if X is deterministic (i.e. ∃c∈X such that pX(c)=1) 2. Hb(X):=−∑a∈X pX(a)∗logbpX(a)=(logb2)H(X)
If X∼pX is a discrete RV (with finite alphabet X) then H(X)≤log2∣X∣ with equality if and only if X is uniformly distributed over X
Intuition
From the perspective of Information Theory, Entropy tells us what the most efficient encoding of our information is (i.e. your encoding solution’s average bit usage is lower bounded by the Entropy of the input).
You can also think of it as giving a benchmark for the most efficient way to encode information from a given distribution.
Intuitively, entropy or H(X), tells us how “random” X is: 1. If X is deterministic, then H(X)=0 1. This is the minimum value of entropy or randomness for X 2. If X is uniform, then H(X)=log2∣X∣ 1. This is the maximal value of entropy or randomness for X
Given (X,Y)∼pXY, the conditional entropy of X given Y is denoted by H(X∣Y) and given by H(X∣Y)=EpXY[−log2(pX∣Y(x∣y))]=−a∈X∑b∈Y∑pXY(a,b) log2(pX∣Y(a∣b))
H(X,Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y)
If (X,Y)∼pXY, then H(X)≥H(X∣Y) and if X⊥⊥Y then H(X)=H(X∣Y)