Lempel-Ziv Coding

🌱

InfoTheory

LZ77 Code

Let $x_{i}^{k}=x_{i}x_{i+1}\dots x_{k}, \ \ \ x^{k}=x_{1}^{k}=x_{1}x_{2}\dots x_{k}$ - Set the window length to $w\in\mathbb{N}$ . - A string $x^{n}=x_{1}x_{2}\dots x_{n}$ from the finite alphabet $\mathcal{X}$ is to be encoded. - Assume that $x_{1}\dots x_{i-1}$ has already been encoded. - We now find the largest $k$ such that for some $j$ in the range $1\le j\le w$ $x_{i-j}^{i-j+k-1}=x_{i}^{i+k-1}$ i.e. the longest match of a string of $k$ not-yet-encoded symbols (phrase) with a string starting in the window (search buffer) consisting of the last $w$ symbols $x_{i-w}^{i-1}=x_{i-w}x_{i-w+1}\dots x_{i-1}$ - Represent the phrase $x_{i}^{i+k-1}$ with the pair $(j,k)$ , i.e. the location where the match starts in the window and the length of the match - If no match is found, send $x_{i}$ uncompressed. - The encoded phrase is represented by the pointer $(F,j,k)$ or $(F,x_{i})$ where $F=1$ if a match is found and $F=0$ if there is no match. Think of $j$ as the number of steps you need to go backwards to find the beginning of the matching phrase and $k$ as the length of the phrase.

There is a prefix code $\mathcal{C}:\mathbb{N}\to \{ 0,1 \}^{*}$ with codeword lengths satisfying $|\mathcal{C}(k)|=\log k+2\log \log k+O(1)$

Assume $\{ X_{i} \}=\dots,X_{-1},X_{0},X_{1},\dots$ is a Stationary & Ergodic Source having entropy rate $\tilde{H}(\mathcal{X})$ . Then the avg code rate of the simplified version of the LZ77 algorithm satisfies $\lim_{ n \to \infty } \frac{1}{n}E[\mathscr{l}_{n}(X^{n})]=\lim_{ n \to \infty } \bar{R}_{n}=\tilde{H}(\mathcal{X})$ i.e. it is optimal.

LZ78 Code

This method builds an explicit dictionary by incrementally parsing the input sequence into shortest phrases that have not been seen so far. Then we represent each phrase in terms of others that have previously appeared in the form of a pair $(i,s)$ where $i$ is the index of referenced symbol and $s$ and the new symbol added on to the preexisting one.

Example

Our phrase $abbaaacbaacbaa$ is parsed as $a,b,ba,aa,c,bb,aac,baa$ and we represent it as $(0,a),(0,b),(2,a),(1,a),(0,c),(2,b),(4,c),(3,a)$ # Properties 1. Code Length: $\mathscr{l}(X^{n})=c(x^{n})(\log c(x^{n})+O(1))$ where $c(x^{n})$ is the number of phrases in the dictionary obtained by parsing $x^{n}$ . 2. Alternative statement for entropy rate $\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\tilde{H}(\mathcal{X})$ >[!thm] LZ78 is a Universal Code >The LZ78 algorithm is universal in the class of all stationary and ergodic sources on alphabet $\mathcal{X}$ i.e. $\lim_{ n \to \infty } \frac{E[c(X^{n})(\log c(X^{n})+O(1))]}{n}=\lim_{ n \to \infty } E[\mathscr{l}_{n}(X^{n})]=\tilde{H}(\mathcal{X})$ for any stationary ergodic source $\{ X_{i} \}_{i=1}^{\infty}$ with entropy rate $\tilde{H}(\mathcal{X})$ .

Linked from

Lempel-Ziv Coding