Information Theory - Chapter 4 Asympotic Equipartition Property(AEP)

1 minute read

Published: March 08, 2026

Informal statement of AEP: Let $X_1,\cdots,X_n$ be i.i.d. Then almost all realizations $(x_1, \cdots,x_n)$ have probability close to $2^{-nH(X)}$. The set of all these realizations is called the typical set.

Formal statement:

Theorem 1: Let $X_1,\cdots X_n$ be i.i.d. and let $H(X)$ be the entropy of their common PMF $P_X$. Then $-\frac{1}{n}\log P_{X_1,\cdots,X_n}(x_1,\cdots,x_n)\rightarrow H(X)$ in probability.

Proof:

$\log P_{X_1,\cdots,X_n}(x_1,\cdots, x_n)=\log\prod_{i=1}^n P_{X_i}(x_i)=\sum_{i=1}^n\log P_{X_i}(x_i)$.

Then $X_i$ i.i.d. $\Rightarrow \log P_{X_i}(x_i)$ i.i.d.

By the law of large numbers

$-\frac{1}{n}\sum_{i=1}^n\log P_{X_i}(x_i)\rightarrow \mathbb{E}[\log P_{X_i}(x_i)]=H(X)$ in probability.

Then a typical set of realizations of $X$, denoted by $A_\epsilon^{(n)}$ is defined as the set of sequences $(x_1, \cdots,x_n)\in\mathcal{X}^n$,s.t. $2^{-n(H(X)+\epsilon)}\le P_{X_1,\cdots,X_n}\le 2^{-n(H(X)-\epsilon)}$.

Theorem(Property of Typical Set):

$(x_1,\cdots,x_n)\in A_\epsilon^{(n)}\Rightarrow \vert-\frac{1}{n}\log P_{X_1,\cdots,X_n}(x_1,\cdots,x_n)-H(X)\vert\le \epsilon$
$\forall\epsilon,\exists n$, s.t. $Pr(A_\epsilon^{(n)})>1-\epsilon$.
$\forall \epsilon,\exists n$, s.t. $(1-\epsilon)2^{n(H(Y)-\epsilon)}\le \vert A_\epsilon^{(n)}\vert\le 2^{n(H(X)+\epsilon)}.$

Proof:

By definition
Convergence in probability means that $\forall\delta>0$, for large enough $n=n(\delta),Pr(\vert -\frac{1}{n}\log P(X_1,\cdots,X_n)-H(X)\vert<\epsilon)\ge 1-\delta$. Take $\delta=\epsilon$.
For upper bound: $\vert A_\epsilon^{(n)}\vert\le 2^{n(H(X)+\epsilon)}$

\[\begin{aligned} 1&\ge \sum_{(x_1,\cdots,x_n)\in A_\epsilon^{(n)}}Pr(x_1, \cdots,x_n) \\ &\ge \sum_{(x_1,\cdots,x_n)\in A_\epsilon^{(n)}}2^{-n(H(X)+\epsilon)} \\ &=2^{-n(H(X)+\epsilon)}\cdot \vert A_\epsilon^{(n)}\vert \end{aligned}\]

For lower bound: $(1-\epsilon)2^{n(H(X)-\epsilon)}\le \vert A_\epsilon^{(n)}\vert$

From 2: $Pr(A_\epsilon^{(n)})\ge 1-\epsilon$ for enough large $n$.

$\Rightarrow 1-\epsilon\le Pr(A_\epsilon^{(n)})\le \sum_{(x_1,\cdots,x_n)\in A_\epsilon^{(n)}}2^{-n(H(X)-\epsilon)}=2^{-n(H(X)-\epsilon)}\vert A_\epsilon^{(n)}\vert$

Application of AEP: lossless data compression $X_1,\cdots,X_n\sim P_X$i.i.d.

For compression, we need to encode every sequence $(x_1,\cdots,x_n)\in\mathcal{X}^{n}$.

For $(x_1,\cdots,x_n)\in A_\epsilon^{(n)}\subseteq \mathcal{X}^{n}$, prefix it by $0$, $\vert A_\epsilon^{(n)}\vert\le 2^{n(H(X)+\epsilon)}\Rightarrow$ needs $nH(X)+1+1$ bits to compress the sequence.

For $(x_1,\cdots,x_n)\notin A_\epsilon^{(n)}$, prefix it by $1$, then needs $n\log\vert\mathcal{X}\vert+1+1$ bits to compress the sequence.

So the expected length of encoding: Denote length of the encoding of $(x_1, \cdots,x_n)$ by $l(x_1,\cdots,x_n)$.

\[\begin{aligned} \Rightarrow \mathbb{E}[l(x_1,\cdots,x_n)]&\le Pr(A_\epsilon^{(n)})(n(H(X)+\epsilon)+2)+Pr(\bar{A_\epsilon^{(n)}})(n\log\vert\mathcal{X}\vert+2) \\ &\le n(H(X)+\epsilon)+2+\epsilon(n\log\vert\mathcal{X}\vert+2) \\ &=n(H(X)+\epsilon) \text{ for } \epsilon^\prime=\epsilon+\epsilon\log\vert\mathcal{X}\vert+\frac{2}{n} \end{aligned}\]

Thus we found a short description of $X_1,\cdots,X_n$.

Denote $B_\delta^{(n)}\triangleq$ the smallest set s.t. $Pr(B_\delta^{(n)})\ge 1-\delta$.

\[\begin{aligned} 1-\delta-\epsilon\le Pr(A_\epsilon^{(n)}\cap B_\delta^{(n)})&=\sum_{(x_1, \cdots,x_n)\in A_\epsilon^{(n)}\cap B_\delta^{(n)}}Pr(x_1,\cdots,x_n) \\ &\le \sum_{(x_1, \cdots,x_n)\in A_\epsilon^{(n)}\cap B_\delta^{(n)}}2^{-n(H(X)-\epsilon)} \\ &\le \vert B_\delta^{(n)}\vert 2^{-n(H(X)-\epsilon)} \end{aligned}\] \[\Rightarrow \vert B_\delta^{(n)}\vert\ge(1-\delta-\epsilon)2^{n(H(X)-\epsilon)}\\ \Rightarrow \frac{1}{n}\log\frac{\vert B_\delta^{(n)}\vert}{\vert A_\epsilon^{(n)}\vert}\rightarrow 0\]

Expamle 1: For $X_i\sim\text{Bernouli}(0.9)$, the most probable sequence is $(1,1,\cdots,1)$,which is in $B_\delta^{(n)}$ not in $A_\delta^{(n)}$.

Types: $\mathcal{X}$-finite set,$\vert\mathcal{X}\vert=q,\mathcal{X}=a_1, \cdots, a_q$. Denote $(x_1, \cdots,x_n)\in\mathcal{X}^n$ by $x$.The emperical distribution of $x$ is $P_x=(P_x(a_1), \cdots, P_x(a_q))$, where

\[\begin{aligned} P_x(a)=\frac{\vert \{ i:x_i=a\} }{n}=\frac{n_a}{n},a\in\mathcal{X} \end{aligned}\]

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Yeqing Qiu

Information Theory - Chapter 4 Asympotic Equipartition Property(AEP)

Share on

You May Also Enjoy

我的五年

Information Theory - Chapter 1: Entropy

Information Theory - Chapter 2 Property of Entropy

Information Theory - Chapter 3 Important Inequality