FIND ME ON

GitHub

LinkedIn

Lempel-Ziv Coding

🌱

InfoTheory

LZ77 Code

Let xik=xixi+1…xk,Β Β Β xk=x1k=x1x2…xkx_{i}^{k}=x_{i}x_{i+1}\dots x_{k}, \ \ \ x^{k}=x_{1}^{k}=x_{1}x_{2}\dots x_{k} - Set the window length to w∈Nw\in\mathbb{N}. - A string xn=x1x2…xnx^{n}=x_{1}x_{2}\dots x_{n} from the finite alphabet X\mathcal{X} is to be encoded. - Assume that x1…xiβˆ’1x_{1}\dots x_{i-1} has already been encoded. - We now find the largest kk such that for some jj in the range 1≀j≀w1\le j\le w xiβˆ’jiβˆ’j+kβˆ’1=xii+kβˆ’1x_{i-j}^{i-j+k-1}=x_{i}^{i+k-1}i.e.Β the longest match of a string of kk not-yet-encoded symbols (phrase) with a string starting in the window (search buffer) consisting of the last ww symbols xiβˆ’wiβˆ’1=xiβˆ’wxiβˆ’w+1…xiβˆ’1x_{i-w}^{i-1}=x_{i-w}x_{i-w+1}\dots x_{i-1} - Represent the phrase xii+kβˆ’1x_{i}^{i+k-1} with the pair (j,k)(j,k), i.e.Β the location where the match starts in the window and the length of the match - If no match is found, send xix_{i} uncompressed. - The encoded phrase is represented by the pointer (F,j,k)(F,j,k) or (F,xi)(F,x_{i}) where F=1F=1 if a match is found and F=0F=0 if there is no match. Think of jj as the number of steps you need to go backwards to find the beginning of the matching phrase and kk as the length of the phrase.

There is a prefix code C:Nβ†’{0,1}βˆ—\mathcal{C}:\mathbb{N}\to \{ 0,1 \}^{*} with codeword lengths satisfying ∣C(k)∣=log⁑k+2log⁑log⁑k+O(1)|\mathcal{C}(k)|=\log k+2\log \log k+O(1)

Assume {Xi}=…,Xβˆ’1,X0,X1,…\{ X_{i} \}=\dots,X_{-1},X_{0},X_{1},\dots is a Stationary & Ergodic Source having entropy rate H~(X)\tilde{H}(\mathcal{X}). Then the avg code rate of the simplified version of the LZ77 algorithm satisfies lim⁑nβ†’βˆž1nE[ln(Xn)]=lim⁑nβ†’βˆžRΛ‰n=H~(X)\lim_{ n \to \infty } \frac{1}{n}E[\mathscr{l}_{n}(X^{n})]=\lim_{ n \to \infty } \bar{R}_{n}=\tilde{H}(\mathcal{X}) i.e.Β it is optimal.

LZ78 Code

This method builds an explicit dictionary by incrementally parsing the input sequence into shortest phrases that have not been seen so far. Then we represent each phrase in terms of others that have previously appeared in the form of a pair (i,s)(i,s) where ii is the index of referenced symbol and ss and the new symbol added on to the preexisting one.

Example

Our phrase abbaaacbaacbaaabbaaacbaacbaais parsed as a,b,ba,aa,c,bb,aac,baaa,b,ba,aa,c,bb,aac,baaand we represent it as (0,a),(0,b),(2,a),(1,a),(0,c),(2,b),(4,c),(3,a)(0,a),(0,b),(2,a),(1,a),(0,c),(2,b),(4,c),(3,a) # Properties 1. Code Length: l(Xn)=c(xn)(log⁑c(xn)+O(1))\mathscr{l}(X^{n})=c(x^{n})(\log c(x^{n})+O(1))where c(xn)c(x^{n}) is the number of phrases in the dictionary obtained by parsing xnx^{n}. 2. Alternative statement for entropy rate lim⁑nβ†’βˆžE[c(Xn)(log⁑c(Xn)+O(1))]n=H~(X)\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\tilde{H}(\mathcal{X}) >[!thm] LZ78 is a Universal Code >The LZ78 algorithm is universal in the class of all stationary and ergodic sources on alphabet X\mathcal{X} i.e.Β lim⁑nβ†’βˆžE[c(Xn)(log⁑c(Xn)+O(1))]n=lim⁑nβ†’βˆžE[ln(Xn)]=H~(X)\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\lim_{ n \to \infty } E[\mathscr{l}_{n}(X^{n})]=\tilde{H}(\mathcal{X}) for any stationary ergodic source {Xi}i=1∞\{ X_{i} \}_{i=1}^{\infty} with entropy rate H~(X)\tilde{H}(\mathcal{X}).

Linked from