Information Theory - Chapter X Lempel-Ziv coding

3 minute read

Published:

This chaper is to talk about LZ78 and its asympotic optimaling.

LZ78: Ride along the sequence: if what you have observed after the last reset is not in the dictionary, place it in the dictionary with $(\text{pointer to the prefix}, \text{last symbol})$ in the encoding. Below shows an example:

For a $0-1$ string $(0000001101010000011010)$, first parse the string into phrases: $(0,00,000,1,10,101,0000,01,1010)$. And then encode the phrases as two parts, the first is the index of the prefix(if no prefix, set to $0$) and the second is the last digit. The encoding result is shown below:

index of prefixlast digit
0(0000, 0)
00(0001, 0)
000(0010, 0)
1(0000, 1)
10(0100, 0)
101(0101, 1)
0000(0011, 0)
01(0001, 1)
1010(0110, 0)

Now we consider about its performance.

The worst case: The number of phrases $C(n)$ in a sequence of length $n$ satisfying

\[C(n)\le \frac{n}{\log n}(1+o(1)).\]

Proof:

First denoted by $n_k$ the sum of lengths of all strings of length $\le k$. So

\[n_k=\sum_{j=1}^kj2^j=(k-1)2^{k+1}+2. \\\] \[\begin{aligned} C(n_k)\le \sum_{j=1}^k2^j = 2^{k+1}-2&=\frac{(k-1)2^{k+1}-2(k-1)}{k-1} \\ &<\frac{(k-1)2^{k+1}+2}{k-1} = \frac{n_k}{k-1}. \end{aligned}\]

Let $n_k\le n<n_{k+1}$, more precisely, we can write $n=n_k+m$

\[\begin{aligned} C(n)&\le \frac{n_k}{k-1} + \frac{m}{k-1}\\ &\le\frac{n_k}{k-1} + \frac{m}{k-1}\\ &=\frac{n}{k-1}. \end{aligned}\]

To bound $k$, note that

\[\begin{aligned} &n\ge n_k\ge 2^k \Rightarrow k\le \log n \\ &n< n_{k+1}=k2^{k+2}+2\le (k+2)2^{k+2} \le (\log n +2)2^{k+2} \end{aligned}\] \[\Rightarrow k+2\ge \log\frac{n}{\log n+2}\] \[\begin{aligned} \Rightarrow k-1&\ge \log n - \log(\log n +2) -3 \\ &\ge (1-\epsilon_n)\log n \end{aligned}\]

where

\[\epsilon_n=\min\{1, \frac{\log\log n +4}{\log n}\}\rightarrow0(\text{when } n\rightarrow \infty).\]

So

\[C(n)\le \frac{n}{k-1}\le \frac{n}{(1-\epsilon_n)\log n}. ∎\]

For random string, LZ78 asymptotically approaches the entropy.

Random cases: Suppose $X$ is a memoryless source with PMF $P_X(i)=P_i$. $P(x^n)=\prod_{i=1}^nP_X(x_i)=\prod_{i=1}^n P_{x_i}$. Parse $x^n=(x_1, x_2, \cdots, x_n)$ into distinct phrases $(x_1, x_2, \cdots, x_n)=(y_1, y_2, \cdots , y_C)$. So $P(x^n)=\prod_{i=1}^C P(y_i)$.

Proof:

Let $C_L$ be the number of phrases of length $l$ among $y_i$, $i=1, \cdots, c$.

Before going on, we prove following lemma first:

Ziv’s Lemma:

\[-\log P(x^n)\ge \sum_lc_l\log c_l\]

Proof of lemma:

\[P(x^n)=\prod_l\prod_{y_i: |y_i|=l}P(y_i)\]

We know that all $y_i$ are distinct, so

\[\begin{aligned} &\sum_{y_i:|y_i|=l}\le 1 \\ \Rightarrow& \prod_{y_i:|y_i|=l} P(y_i)\le (\frac{1}{c_l})^{c_l} \\ \Rightarrow& \log P(x^n)\le \sum_{l}c_l\log \frac{1}{c_l} ∎ \end{aligned}\]

Suppose there are $m_i$ entries of symbol $i$ among $x_1, \cdots, x_n$. Then

\[P(x^n)=\prod_{i\in X} P^{m_i}_i.\]

Invoking LLN, $m_i \approx nP_i$, so

\[P(x^n)\approx \prod_{i\in X} P^{nP_i}_i,\] \[\log P(x^n)\approx n\sum_{i\in X} P_{i} \log P_i \approx -nH(P).\]

So with Ziv’s lemma,

\[H(P)\ge \frac{\sum_l c_l\log {c_l}}{n}.\]

So we will prove that LZ comporesses to entropy if we show that:

\[\sum_{l}c_l\log c_l\approx C\log(C).\]

We note that

\[\sum_l c_l=C \Rightarrow \sum_{l}c_l\log c_l=C\log C+C\sum_l\frac{c_l}{C} \log \frac{c_l}{C},\]

where $-\sum_l\frac{c_l}{C}\log\frac{c_l}{C}$ is entropy of the distribution $Q=(q_i=\frac{c_i}{C}, i=1, 2, \cdots).$

Consider the following maximization problem:

\[\begin{aligned} \max H(Q)\\ s.t. \sum_l{lc_l=n} \end{aligned}\]

An exact solution can be found, but a very crude estimate is

\[\begin{aligned} H(Q)=o(log\frac{n}{C}) &\Rightarrow nH(P)\ge\sum_{l}c_l\log c_l=C\log C-o(\log\frac{n}{C})\\ &\Rightarrow \frac{C\log C}{n}\le H(P) + \frac{o(\log(\frac{n}{C}))}{n}. ∎ \end{aligned}\]

As $C$ is the number of the phrases, $\log C$ is the length to be coded. so $\frac{C\log C}{n}$ is bounded by $H(P)$. asymptotically optimal holds.