Lempel-Ziv Coding

LZ77 Code

Let xik=xixi+1xk,   xk=x1k=x1x2xkx_{i}^{k}=x_{i}x_{i+1}\dots x_{k}, \ \ \ x^{k}=x_{1}^{k}=x_{1}x_{2}\dots x_{k}

  • Set the window length to wNw\in\mathbb{N}.
  • A string xn=x1x2xnx^{n}=x_{1}x_{2}\dots x_{n} from the finite alphabet X\mathcal{X} is to be encoded.
  • Assume that x1xi1x_{1}\dots x_{i-1} has already been encoded.
  • We now find the largest kk such that for some jj in the range 1jw1\le j\le w xijij+k1=xii+k1x_{i-j}^{i-j+k-1}=x_{i}^{i+k-1}i.e. the longest match of a string of kk not-yet-encoded symbols (phrase) with a string starting in the window (search buffer) consisting of the last ww symbols xiwi1=xiwxiw+1xi1x_{i-w}^{i-1}=x_{i-w}x_{i-w+1}\dots x_{i-1}
  • Represent the phrase xii+k1x_{i}^{i+k-1} with the pair (j,k)(j,k), i.e. the location where the match starts in the window and the length of the match
  • If no match is found, send xix_{i} uncompressed.
  • The encoded phrase is represented by the pointer (F,j,k)(F,j,k) or (F,xi)(F,x_{i}) where F=1F=1 if a match is found and F=0F=0 if there is no match. Think of jj as the number of steps you need to go backwards to find the beginning of the matching phrase and kk as the length of the phrase.

Lemma (Existence of LZ77 Codelengths)

There is a prefix code C:N{0,1}\mathcal{C}:\mathbb{N}\to \{ 0,1 \}^{*} with codeword lengths satisfying C(k)=logk+2loglogk+O(1)|\mathcal{C}(k)|=\log k+2\log \log k+O(1)

Theorem (LZ77 is Optimal for Stationary Ergodic Sources)

Assume {Xi}=,X1,X0,X1,\{ X_{i} \}=\dots,X_{-1},X_{0},X_{1},\dots is a Stationary & Ergodic Source having entropy rate H~(X)\tilde{H}(\mathcal{X}). Then the avg code rate of the simplified version of the LZ77 algorithm satisfies limn1nE[ln(Xn)]=limnRˉn=H~(X)\lim_{ n \to \infty } \frac{1}{n}E[\mathscr{l}_{n}(X^{n})]=\lim_{ n \to \infty } \bar{R}_{n}=\tilde{H}(\mathcal{X}) i.e. it is optimal.

LZ78 Code

This method builds an explicit dictionary by incrementally parsing the input sequence into shortest phrases that have not been seen so far. Then we represent each phrase in terms of others that have previously appeared in the form of a pair (i,s)(i,s) where ii is the index of referenced symbol and ss and the new symbol added on to the preexisting one.

Example

Our phrase abbaaacbaacbaaabbaaacbaacbaais parsed as a,b,ba,aa,c,bb,aac,baaa,b,ba,aa,c,bb,aac,baaand we represent it as (0,a),(0,b),(2,a),(1,a),(0,c),(2,b),(4,c),(3,a)(0,a),(0,b),(2,a),(1,a),(0,c),(2,b),(4,c),(3,a)

Properties

  1. Code Length: l(Xn)=c(xn)(logc(xn)+O(1))\mathscr{l}(X^{n})=c(x^{n})(\log c(x^{n})+O(1))where c(xn)c(x^{n}) is the number of phrases in the dictionary obtained by parsing xnx^{n}.
  2. Alternative statement for entropy rate limnE[c(Xn)(logc(Xn)+O(1))]n=H~(X)\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\tilde{H}(\mathcal{X}) >[!thm] LZ78 is a Universal Code >The LZ78 algorithm is universal in the class of all stationary and ergodic sources on alphabet X\mathcal{X} i.e. limnE[c(Xn)(logc(Xn)+O(1))]n=limnE[ln(Xn)]=H~(X)\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\lim_{ n \to \infty } E[\mathscr{l}_{n}(X^{n})]=\tilde{H}(\mathcal{X}) for any stationary ergodic source {Xi}i=1\{ X_{i} \}_{i=1}^{\infty} with entropy rate H~(X)\tilde{H}(\mathcal{X}).

Linked from