我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路，请完成解题用中文撰写讲解要点，并撰...

Question

我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路，请完成解题用中文撰写讲解要点，并撰写英文LaTeX讲稿，确保通俗易懂，深入浅出，对每一道题给出清晰的明白的解题思路，并给一些高价值的takeaway，学生下课后仍能记得关键思想和insight

Answer 1

先直接给你两部分内容：

中文讲解要点（面向你自己备课用）
英文 LaTeX 讲稿（可直接放进 .tex 用作习题课讲稿）

作业题目来自 Homework 4。(HW_4.pdf)

一、中文讲解要点（备课提纲）

下面把每一题、每一小问都按“思路 + 运算骨架 + 课堂可以强调的 takeaway”来整理。你可以根据时间决定推导细节讲多少。

题 1 Reverse I-projection

总体概念提醒

KL 散度： $D(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)}$ 。
I-projection（正向投影）：在一堆分布集合 $\mathcal{P}$ 里找一个 $p^\star$ ，使得 $D(p^\star\|q)$ 最小。
Reverse I-projection（反向投影）：这次是固定 $p$ ，在集合 $\mathcal{Q}$ 里找 $q^\star$ ，使得 $D(p\|q^\star)$ 最小。
这道题的核心是：KL 散度 + 凸/对数凸集合 → 有“勾股定理”式的几何结构。

1(a) 逆 I-投影的勾股定理

题目要证：
若 $\mathcal{Q}$ 非空、闭、log-convex，且

q^\star = \arg\min_{q\in\mathcal{Q}} D(p\|q),

则对所有 $q\in\mathcal{Q}$ ，

D(p\|q) \ge D(p\|q^\star) + D(q^\star\|q).

关键概念：log-convex 集合

给定 $r,s\in\mathcal{Q}$ 且不互相奇异，定义“对数凸组合” $q_\lambda(x) = \frac{r(x)^\lambda s(x)^{1-\lambda}} {\sum_z r(z)^\lambda s(z)^{1-\lambda}},\quad 0\le\lambda\le1.$
若 $\mathcal{Q}$ 是 log-convex，说明 沿着“log 空间直线”从一个分布走到另一个分布，路径上的所有点也在 $\mathcal{Q}$ 内。

证明思路（课堂版）：

在 $\mathcal{Q}$ 里挑一条“log 直线”
固定任意 $q\in\mathcal{Q}$ ，和 $q^\star$ 一起，构造
$q_\lambda(x) = \frac{q(x)^\lambda q^\star(x)^{1-\lambda}} {\sum_z q(z)^\lambda q^\star(z)^{1-\lambda}},$
其中 $q_0 = q^\star, q_1 = q$ ，且所有 $q_\lambda \in \mathcal{Q}$ 。
看 KL 散度沿这条曲线如何变化
定义
$f(\lambda) = D(p\|q_\lambda).$
因为 $q^\star$ 是 $\mathcal{Q}$ 上的最优点，对这条路径也必须是最小点，所以
$f(\lambda) \ge f(0),\ \forall \lambda\in[0,1] \quad\Rightarrow\quad f'(0^+) \ge 0.$
算 $f'(0)$ （可以在黑板上做一次）：
- 写出 $\log q_\lambda(x) = \lambda\log q(x) + (1-\lambda)\log q^\star(x) - \log Z(\lambda),$ 其中 $Z(\lambda)$ 是归一化因子。
- 代入 $f(\lambda) = \sum_x p(x)\big(\log p(x)-\log q_\lambda(x)\big),$ 用链式法则求导，在 $\lambda=0$ 处整理，会得到 $f'(0) = D(p\|q) - D(p\|q^\star) - D(q^\star\|q).$
利用最小点的导数条件
因为 $f'(0^+)\ge 0$ ，所以
$D(p\|q) - D(p\|q^\star) - D(q^\star\|q) \ge 0 \quad\Rightarrow\quad D(p\|q)\ge D(p\|q^\star)+D(q^\star\|q).$

课堂 takeaway：

KL 散度在“log-凸路径”上具有类似欧式空间中的勾股性质。
从几何上看：从 $p$ 到集合 $\mathcal{Q}$ 的“最短点” $q^\star$ ，把“距离”分成了两段：从 $p$ 到 $q^\star$ ，再从 $q^\star$ 到 $q$ 。

1(b)（练习）逆 I-投影、指数族与线性族

你可以把这题讲成“指数族–线性族的双对偶关系”的一条具体体现。

设定回顾：

给定基准分布 $q^\star$ 和充分统计量 $t(x)$ 。
指数族 $\mathcal{E}_t(q^\star) = \big\{ q_\theta(x)=\exp(\theta^\top t(x)-\psi(\theta))q^\star(x) \big\}.$
线性族 $\mathcal{L}_t(q^\star) = \big\{ p: \mathbb{E}_p[t(X)] = \mathbb{E}_{q^\star}[t(X)] \big\}.$

要证：
$q^\star$ 是 $p$ 在指数族 $\mathcal{E}_t(q^\star)$ 上的 reverse I-projection
$\iff$ $p \in \mathcal{L}_t(q^\star)$ 。

思路：看 $D(p\|q_\theta)$ 对 $\theta$ 的导数。

写出 $D(p\|q_\theta) = \sum_x p(x)\log\frac{p(x)}{q_\theta(x)} = \text{const} + \psi(\theta) - \theta^\top \mathbb{E}_p[t(X)].$
求导： $\nabla_\theta D(p\|q_\theta) = \nabla\psi(\theta) - \mathbb{E}_p[t(X)] = \mathbb{E}_{q_\theta}[t(X)] - \mathbb{E}_p[t(X)].$
若 $q^\star = q_{\theta=0}$ 是 reverse I-projection，就必须使得
$\theta=0$ 处梯度为零： $\mathbb{E}_{q^\star}[t(X)] = \mathbb{E}_p[t(X)],$ 即 $p \in \mathcal{L}_t(q^\star)$ 。
反过来，若 $p\in\mathcal{L}_t(q^\star)$ ，则 $\theta=0$ 处导数为 0，
又因为 $D(p\|q_\theta)$ 对 $\theta$ 是凸的， $\theta=0$ 给出全局最小值，
所以 $q^\star$ 就是 reverse I-projection。

课堂 takeaway：

“指数族 + KL 最小化” ⇔ “线性约束（矩约束）”。
这就是最大熵 / 最大似然和指数族之间关系的本质。

1(c) 逆 I-投影的另一种刻画

题意：
给定分布 $p$ 和非空、闭、凸的集合 $\mathcal{Q}$ 。
证明： $q^\star$ 是 $p$ 在 $\mathcal{Q}$ 上的 reverse I-projection，当且仅当

D(p'\|q') + D(p'\|p) \ge D(p'\|q^\star), \quad \forall p',\ \forall q'\in\mathcal{Q}.

可以把它理解成： $q^\star$ 是那个让“所有点” $p'$ 相比其它 $q'$ 都不会更吃亏的点（多了一个补偿项 $D(p'\|p)$ ）。

“如果”方向（⇐）：比较简单，可以课堂上详细写。

假设不等式对所有 $p',q'$ 成立。
取 $p'=p$ ，得到 $D(p\|q') + D(p\|p) \ge D(p\|q^\star) \Rightarrow D(p\|q') \ge D(p\|q^\star),\ \forall q'\in\mathcal{Q},$ 因为 $D(p\|p)=0$ 。
这就说明 $q^\star$ 确实是 $\arg\min_{q\in\mathcal{Q}} D(p\|q)$ 。

“只有如果”方向（⇒）：思路型讲解即可

假设 $q^\star$ 是 reverse I-projection。由于 $\mathcal{Q}$ 是凸集，
可把 KL 看成一个 Bregman 散度，对凸集的“Bregman 投影”有一个
很类似的勾股定理和“法向条件”（KKT 条件）。
固定任意 $p'$ 和 $q'\in\mathcal{Q}$ ，构造连接 $p$ 与 $p'$ 的直线 $p_\lambda = (1-\lambda)p' + \lambda p,\quad 0\le\lambda\le1,$ 然后考虑函数 $F(\lambda) = D(p_\lambda\|q') - D(p_\lambda\|q^\star).$
通过 KL 的显式形式展开并利用 $q^\star$ 的最优性（类似 1(a) 的方法），
可以证明 $F(1) \ge 0$ ，这就等价于题目中的不等式。
（具体代数展开比较长，习题课可以只画出推导路线。）

课堂 takeaway：

KL 散度作为 Bregman 散度时，对“凸集的投影”有一类统一的勾股式刻画。
这个不等式是一个“三点不等式”：涉及 $p',p,q^\star,q'$ 四个对象，
但最终是用“所有的 $p'$ ”来刻画一个“最佳 $q^\star$ ”。

题 2 Modeling & Model Capacity

设 $x\in\{0,1,2\}$ ， $y\in\{0,1\}$ 。条件分布是

\begin{aligned} p_y(y;0)&=\varepsilon^y(1-\varepsilon)^{1-y}\quad &\text{(Bernoulli($\varepsilon$))},\\ p_y(y;1)&=(1-\varepsilon)^y\varepsilon^{1-y}\quad &\text{(Bernoulli($1-\varepsilon$))},\\ p_y(y;2)&=\tfrac12\quad &\text{(完全噪声)}. \end{aligned}

其中 $0<\varepsilon<1/2$ 。

记先验 $p_x(x)$ ，互信息 $I_{p_x}(x;y)$ ，模型容量

C = I_{p_x^\star}(x;y),

其中 $p_x^\star$ 是使互信息最大的先验（“least informative prior”，其实是 capacity-achieving prior）。

2(a)(i) 写出互信息的结构形式

要找到 $f(\varepsilon), g(\varepsilon)$ 使

I_{p_x}(x;y) = H_B\!\left(\frac12 + f(\varepsilon)\big(p_x(1)-p_x(0)\big)\right) - g(\varepsilon)p_x(2) - H_B(\varepsilon),

其中 $H_B(p) = -p\log p-(1-p)\log(1-p)$ 。

思路：直接用 $I=H(Y)-H(Y|X)$

记
$p_x(0)=p_0,\quad p_x(1)=p_1,\quad p_x(2)=p_2.$
计算 $y$ 的边缘分布：
$\begin{aligned} P(Y=1) &= p_0\varepsilon + p_1(1-\varepsilon) + p_2\cdot\tfrac12,\\[3pt] P(Y=0) &= 1-P(Y=1). \end{aligned}$
把 $1/2$ 拆到每一项：
$P(Y=1)-\tfrac12 = (p_0(\varepsilon-\tfrac12)) + (p_1((1-\varepsilon)-\tfrac12)).$
注意 $\varepsilon-\tfrac12=-(\tfrac12-\varepsilon)$ ， $1-\varepsilon-\tfrac12=\tfrac12-\varepsilon$ 。
所以
$P(Y=1)-\tfrac12 = \big(\tfrac12-\varepsilon\big)(p_1-p_0).$
即
$P(Y=1) = \tfrac12 + \big(\tfrac12-\varepsilon\big)(p_1-p_0).$
特别重要：不依赖 $p_2$ 。
于是
$H(Y) = H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right).$
条件熵：
$\begin{aligned} H(Y|X) &= p_0 H_B(\varepsilon) + p_1 H_B(1-\varepsilon) + p_2 H_B(\tfrac12) \\ &= (p_0+p_1)H_B(\varepsilon) + p_2 H_B(\tfrac12) \\ &= (1-p_2) H_B(\varepsilon) + p_2 H_B(\tfrac12), \end{aligned}$
因为 $H_B(1-\varepsilon)=H_B(\varepsilon)$ 。
互信息：
$\begin{aligned} I_{p_x}(x;y) &= H(Y)-H(Y|X)\\ &= H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - (1-p_2)H_B(\varepsilon) - p_2 H_B(\tfrac12)\\ &= H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - H_B(\varepsilon) - p_2\big(H_B(\tfrac12)-H_B(\varepsilon)\big). \end{aligned}$

对应题目的形式：

$f(\varepsilon) = \tfrac12-\varepsilon$ ；
$g(\varepsilon) = H_B(\tfrac12) - H_B(\varepsilon)$ 。

2(a)(ii) 找到最优先验 $p_x^\star$ 和容量 C

我们要最大化上面的 $I_{p_x}(x;y)$ 对 $p_0,p_1,p_2$ 。

观察互信息结构：

第一项 $H_B(\cdot)$ 只依赖于 $p_1-p_0$ ，与 $p_2$ 无关。
第三项 $-g(\varepsilon)p_2$ 是关于 $p_2$ 的线性减项，因为 $g(\varepsilon)>0$ （ $H_B(1/2)$ 最大）。

直观结论：

$x=2$ 对 $y$ 来说完全是噪声（ $y$ 均匀），增加 $p_2$ 只会增大条件熵 $H(Y|X)$ ，
不改变 $Y$ 的边缘分布 → 信息只会减小。
⇒ 最优一定是 $p_2^\star = 0$ 。
在 $p_2=0$ 情况下，约束变成 $p_0+p_1=1$ ，且 $P(Y=1) = \tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0).$ 互信息是 $I = H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - H_B(\varepsilon).$ 而 $H_B(\cdot)$ 在 $[0,1]$ 上关于 $1/2$ 对称并在 $1/2$ 处最大。
⇒ 为了最大化 $H(Y)$ ，应使 $P(Y=1)=1/2$ 。
这要求 $p_1-p_0=0\Rightarrow p_1=p_0=1/2$ 。

得到：

最优先验： $p_x^\star(0)=p_x^\star(1)=\tfrac12,\quad p_x^\star(2)=0.$
容量： $C = I_{p_x^\star}(x;y) = H_B(\tfrac12) - H_B(\varepsilon).$
因此题目要求的 $C = \alpha H_B(1/2) + \beta H_B(\varepsilon)$ 中： $\alpha = 1,\quad \beta = -1.$

课堂 takeaway：

最优先验不会给“纯噪声”状态（这里的 $x=2$ ）任何权重。
容量达成时：输出 $Y$ 也是均匀的（信息论中的常见结论）。

2(a)(iii) 对应的输出混合分布 $p_y^\star$

用最优先验 $p_x^\star$ 作为权重：

\begin{aligned} P^\star(Y=1) &= \tfrac12\varepsilon + \tfrac12(1-\varepsilon) + 0\cdot\tfrac12 = \tfrac12,\\ P^\star(Y=0) &= \tfrac12. \end{aligned}

所以

p_y^\star(0) = p_y^\star(1) = \tfrac12.

takeaway： 容量点时，输出端看上去是“最均匀、最不偏”的。

2(b)(i) 用均匀先验时的输出分布

现在 $p_x(x) = 1/3$ for $x=0,1,2$ 。

\begin{aligned} P(Y=1) &= \tfrac13\varepsilon + \tfrac13(1-\varepsilon) + \tfrac13\times\tfrac12\\ &= \tfrac13\big(\varepsilon+1-\varepsilon+\tfrac12\big) = \tfrac13\cdot\tfrac32 = \tfrac12, \end{aligned}

因此仍然是

p_y(0) = p_y(1) = \tfrac12.

2(b)(ii) 均匀先验下互信息的损失

先求均匀先验下的互信息：

$H(Y) = H_B(1/2)$ （因为还是均匀）。
条件熵 $H(Y|X) = \tfrac13 H_B(\varepsilon) + \tfrac13 H_B(1-\varepsilon) + \tfrac13 H_B(1/2) = \tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2).$
所以 $\begin{aligned} I_{\text{unif}} &= H_B(1/2) - \left(\tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2)\right)\\ &= \tfrac23 H_B(1/2) - \tfrac23 H_B(\varepsilon)\\ &= \tfrac23\big(H_B(1/2)-H_B(\varepsilon)\big)\\ &= \tfrac23 C. \end{aligned}$

损失：

C - I_{\text{unif}} = C - \tfrac23 C = \tfrac13 C = \tfrac13 H_B(1/2) - \tfrac13 H_B(\varepsilon).

所以

\gamma = \tfrac13,\quad \delta = -\tfrac13.

课堂 takeaway：

“随便用均匀先验”会损失固定比例（这里是 1/3）的容量。
但均匀先验仍然达到 $2/3$ 的容量，说明它还算“不是太差”。

2(c) 一般模型下 $C - I_{q_x} = D(q_1\|q_2)$

设一般模型 $p_y(\cdot;x)$ ，容量达到时的先验为 $p_x^\star$ ，对应输出为 $p_y^\star$ 。给定任意先验 $q_x$ 。

假设 $p_x^\star(x) > 0$ 对所有 $x$ 。

利用“等距离性质（Equidistance Property）”：

对容量点满足：对所有 $x$ , $D\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = C.$

关键恒等式（链式法则）：

对任意先验 $q_x$ 对应的输出 $q_y(y) = \sum_x q_x(x) p_y(y;x),$ 可以证明 $\sum_x q_x(x) D(p_{y|x}(\cdot|x)\|p_y^\star) = I_{q_x}(x;y) + D(q_y\|p_y^\star).$ （在课堂上可以把 log 分成 $\log \frac{p_{y|x}}{q_y} + \log\frac{q_y}{p_y^\star}$ 两部分展开。）

带入等距离性质：

左边是 $\sum_x q_x(x)\cdot C = C$ ，所以 $C = I_{q_x}(x;y) + D(q_y\|p_y^\star).$
于是 $C - I_{q_x}(x;y) = D(q_y\|p_y^\star).$

因此应取：

$q_1(\cdot) = q_y(\cdot)$ ：在先验 $q_x$ 下的 $y$ 的边缘分布；
$q_2(\cdot) = p_y^\star(\cdot)$ ：容量先验对应的边缘分布。

2(d) 若 $p_x^\star$ 不是满支撑的情况

现在假定存在 $x$ 使得 $p_x^\star(x)=0$ 。经典 Kuhn–Tucker 条件告诉我们：

对所有 $x$ ， $D(p_{y|x}(\cdot|x)\|p_y^\star) \le C,$
且仅在 $p_x^\star(x)>0$ 的那些点上等号成立。

于是对任意 $q_x$ ：

\sum_x q_x(x) D(p_{y|x}\|p_y^\star) \le C.

按刚才同样的分解：

\sum_x q_x(x) D(p_{y|x}\|p_y^\star) = I_{q_x}(x;y) + D(q_y\|p_y^\star),

所以

I_{q_x}(x;y) + D(q_y\|p_y^\star) \le C \quad\Rightarrow\quad C - I_{q_x}(x;y) \ge D(q_y\|p_y^\star).

这正对应选项 B。

课堂 takeaway：

若容量先验不是满支撑，则一些 $x$ 对应的输出分布离 $p_y^\star$ 比 $C$ 更近（信息更小）。
对任意其他先验，其互信息与容量的差至少大于某个 KL 散度下界（这里是 $D(q_y\|p_y^\star)$ ）。

题 3 Typical Sequences（典型序列）

设 $y_1,\dots,y_N$ i.i.d. Bernoulli， $P(y_i=1)=\alpha$ ， $p$ 为单个 $y_i$ 的分布。对所有 $\log$ 取 base 2。

典型集定义（标准版本）：

T_\varepsilon(p;N) = \left\{ y^N:\ \left|-\frac1N\log p(y^N) - H(p)\right|\le\varepsilon \right\}.

3(a) $\alpha=1/2$ 的典型集与大小

当 $\alpha = 1/2$ 时，每个 $y_i$ 服从 Bernoulli(1/2)，
对任意序列 $y^N\in\{0,1\}^N$ 都有

p(y^N) = \left(\tfrac12\right)^N = 2^{-N}.

信息密度： $-\frac1N\log p(y^N) = -\frac1N\log 2^{-N} = 1.$
单个变量的熵： $H(p) = H_B(1/2)=1.$

于是对任意 $\varepsilon>0$ ：

\left|-\frac1N\log p(y^N) - H(p)\right| = |1-1| = 0 \le \varepsilon.

结论：

典型集就是所有长度为 N 的二进制序列： $T_\varepsilon(p;N) = \{0,1\}^N.$
大小： $|T_\varepsilon(p;N)| = 2^N.$

课堂 takeaway：

完全对称的 Bernoulli(1/2) 源下，所有序列都是同样“典型”的，概率完全一样。

3(b) $\alpha=1/3$ 时的典型集与增长指数

现在 $P(y_i=1)=1/3$ ， $P(y_i=0)=2/3$ ，记 $k$ 为序列里 1 的个数。

1. 写出单个序列的概率

p(y^N) = \left(\tfrac13\right)^k \left(\tfrac23\right)^{N-k}.

信息密度：

-\frac1N\log p(y^N) = -\frac{k}{N}\log\tfrac13 - \left(1-\frac{k}{N}\right)\log\tfrac23.

记经验频率 $\hat{p}_N = k/N$ 。则

-\frac1N\log p(y^N) = -\hat{p}_N\log\tfrac13 -(1-\hat{p}_N)\log\tfrac23.

而源熵为

H(p) = H_B(1/3) = -\tfrac13\log\tfrac13 - \tfrac23\log\tfrac23.

2. 典型条件对应“1 的比例接近 1/3”

典型条件：

\left|-\frac1N\log p(y^N) - H_B(1/3)\right|\le\varepsilon

可以看成是对 $\hat{p}_N$ 的约束：
因为上式是一个连续函数 $g(\hat{p}_N)$ ，在 $\hat{p}_N=1/3$ 处取值正好是 $H_B(1/3)$ ，所以存在一个 $\delta(\varepsilon)$ ，使得

|\hat{p}_N - 1/3|\le\delta(\varepsilon)

就等价于典型条件成立（对足够大的 N ）。

因此：

典型集可以写成

T_\varepsilon(p;N) = \left\{ y^N:\ \left|\frac{\#\{i:y_i=1\}}{N} - \frac13\right|\le\delta(\varepsilon)\right\}.

3. 大小的增长指数

标准典型序列定理告诉我们：对于离散无记忆源，

对足够大的 $N$ ， $(1-\eta)2^{N(H(p)-\varepsilon)} \le |T_\varepsilon(p;N)| \le 2^{N(H(p)+\varepsilon)}$ 对某个很小的 $\eta>0$ 。

对 $\alpha=1/3$ ：

H(p) = H_B(1/3).

因此， $|T_\varepsilon(p;N)|$ 的增长指数为

\lim_{N\to\infty}\frac1N\log|T_\varepsilon(p;N)| = H_B(1/3).

课堂上可以再补一句：

近似地说，当 $N$ 很大时 $|T_\varepsilon(p;N)| \approx 2^{N H_B(1/3)}.$
每个典型序列的概率大约是 $2^{-N H_B(1/3)}.$
于是几乎所有概率质量都集中在“数量级为 $2^{NH}$ ” 的这些典型序列上。

课堂 takeaway：

典型集的大小指数就是源熵 $H(p)$ 。
“熵 ≈ 每个典型序列的信息量 ≈ 典型序列个数的指数”，这是信息论里最重要的直觉之一。

二、英文 LaTeX 讲稿（可直接使用）

下面是一份可以直接放进 .tex 的讲稿草稿，风格是“课堂上我会怎么说 + 公式”，你可以按需要删减。

latex
\documentclass[11pt]{article}
\usepackage{amsmath, amssymb, amsfonts}
\usepackage{bm}

\newcommand{\KL}{D}
\newcommand{\E}{\mathbb{E}}
\newcommand{\HB}{H_{\mathrm{B}}}

\begin{document}

\title{Homework 4 -- Discussion Session Notes}
\author{Theory and Methods for Statistical Inference}
\date{}
\maketitle

In this recitation we go through all problems of Homework 4.
The audience is assumed to be undergraduate students with a basic
background in probability and information theory.
I will try to emphasize geometric pictures and key takeaways.

Throughout, for two distributions $p$ and $q$ on the same finite alphabet,
we write the Kullback--Leibler divergence as
$$
  \KL(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)}.
$$

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Problem 1: Reverse I-projection}

\subsection*{Set-up and intuition}

\textbf{I-projection vs.\ reverse I-projection.}
Given a convex set of distributions $\mathcal{P}$ and a fixed distribution $q$,
the (forward) I-projection is
$$
  p^\star = \arg\min_{p\in\mathcal{P}} \KL(p\|q).
$$
In contrast, given a set $\mathcal{Q}$ and a fixed distribution $p$,
the \emph{reverse} I-projection is
$$
  q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q).
$$
So in reverse I-projection, the first argument $p$ is fixed, and we move $q$
inside the set $\mathcal{Q}$.

Geometrically, we can think of KL divergence as a kind of ``distance''
(although it is not symmetric), and I-projection means
``the closest point in the set''.

\subsection*{(a) Pythagoras' theorem for reverse I-projection}

\textbf{Goal.} Let $\mathcal{Q}$ be nonempty, closed, and \emph{log-convex}.
Let $q^\star$ be the reverse I-projection of $p$ onto $\mathcal{Q}$:
$$
  q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q).
$$
We want to show that for all $q\in\mathcal{Q}$,
$$
  \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q).
$$

\textbf{Log-convex combinations.}
For any $r,s\in\mathcal{Q}$ that are not mutually singular, and $\lambda\in[0,1]$,
we define the log-convex combination
$$
  q_\lambda(x)
  = \frac{r(x)^\lambda s(x)^{1-\lambda}}
    {\sum_z r(z)^\lambda s(z)^{1-\lambda}}.
$$
The set $\mathcal{Q}$ is log-convex if for every such $r,s$ and $\lambda$,
the distribution $q_\lambda$ belongs to $\mathcal{Q}$.

You can think of this as ``taking a straight line in log-space'' between $r$
and $s$, and staying inside the set along this path.

\medskip
\textbf{Step 1: Follow a log-geodesic from $q^\star$ to $q$.}

Fix any $q\in\mathcal{Q}$ and set
$$
  q_\lambda(x)
  = \frac{q(x)^\lambda q^\star(x)^{1-\lambda}}
    {\sum_z q(z)^\lambda q^\star(z)^{1-\lambda}},\quad \lambda\in[0,1].
$$
Then
$$
  q_0 = q^\star,\qquad q_1 = q,
$$
and $q_\lambda\in\mathcal{Q}$ for all $\lambda$ by log-convexity.

Define
$$
  f(\lambda) = \KL(p\|q_\lambda).
$$

Because $q^\star$ is the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$,
it must also minimize along this particular path:
$$
  f(\lambda) \ge f(0)\quad\text{for all }\lambda\in[0,1].
$$
In particular, the right derivative at $\lambda=0$ satisfies
$$
  f'(0^+) \ge 0.
$$

\medskip
\textbf{Step 2: Compute the derivative $f'(0)$.}

We write
$$
  \log q_\lambda(x)
  = \lambda\log q(x) + (1-\lambda)\log q^\star(x) - \log Z(\lambda),
$$
where
$$
  Z(\lambda) = \sum_z q(z)^\lambda q^\star(z)^{1-\lambda}
$$
is the normalizing constant.

Then
$$
  f(\lambda)
  = \sum_x p(x)\Big(\log p(x) - \log q_\lambda(x)\Big),
$$
and
$$
  f'(\lambda)
  = -\sum_x p(x)\,\frac{\partial}{\partial\lambda}\log q_\lambda(x).
$$

A straightforward but slightly tedious calculation (which you can carry out
on the board) shows that at $\lambda=0$,
$$
  f'(0)
  = \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q).
$$
Intuitively, this derivative compares how fast the divergence increases
when we move from $q^\star$ in the direction of $q$.

\medskip
\textbf{Step 3: Use the optimality of $q^\star$.}

Since $f$ has a minimum at $\lambda=0$ along this path,
we must have $f'(0^+)\ge 0$, so
$$
  \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q) \;\ge\; 0,
$$
which is exactly the desired inequality:
$$
  \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q).
$$

\medskip
\textbf{Takeaway:}
for a log-convex set $\mathcal{Q}$, the reverse I-projection $q^\star$ behaves
like the foot of a perpendicular in a ``KL geometry'':
going from $p$ to any other point $q\in\mathcal{Q}$ costs at least
the divergence from $p$ to $q^\star$ plus the divergence from $q^\star$ to $q$.

\subsection*{(b) Practice: exponential and linear families}

Here we recall:
\begin{itemize}
  \item Given a base distribution $q^\star$ and a sufficient statistic
        $t(x)\in\mathbb{R}^d$, the exponential family is
        $$
          \mathcal{E}_t(q^\star)
          = \left\{
                 q_\theta(x) 
                 = \exp(\theta^\top t(x) - \psi(\theta))\,q^\star(x)
            \right\}.
        $$
  \item The associated linear family is
        $$
          \mathcal{L}_t(q^\star)
          = \left\{
                 p:\ \E_p[t(X)] = \E_{q^\star}[t(X)]
            \right\}.
        $$
\end{itemize}

\textbf{Claim.}
$q^\star$ is the reverse I-projection of $p$ onto $\mathcal{E}_t(q^\star)$
if and only if $p\in\mathcal{L}_t(q^\star)$.

\medskip
\textbf{Proof idea: look at derivatives of $\KL(p\|q_\theta)$ in $\theta$.}

Using the definition of $q_\theta$, we get
$$
  \KL(p\|q_\theta)
  = \sum_x p(x)\log\frac{p(x)}{q_\theta(x)}
  = \text{constant in $\theta$}
    + \psi(\theta) - \theta^\top\E_p[t(X)].
$$

Differentiating with respect to $\theta$,
$$
  \nabla_\theta \KL(p\|q_\theta)
  = \nabla\psi(\theta) - \E_p[t(X)]
  = \E_{q_\theta}[t(X)] - \E_p[t(X)].
$$

If $q^\star$ is the reverse I-projection, then it must correspond to some
parameter, say $\theta=0$, and at this point the gradient must vanish:
$$
  \E_{q^\star}[t(X)] = \E_p[t(X)].
$$
So $p$ belongs to $\mathcal{L}_t(q^\star)$.

Conversely, if $p\in\mathcal{L}_t(q^\star)$, then at $\theta=0$ the gradient
vanishes, and by convexity of $\KL(p\|q_\theta)$ in $\theta$, this point is
a global minimizer. Hence $q^\star$ is indeed the reverse I-projection.

\medskip
\textbf{Takeaway:}
in exponential families, \emph{minimizing} KL divergence over the family
is equivalent to \emph{matching} the expected sufficient statistics.
This is the core of many maximum entropy and maximum likelihood results.

\subsection*{(c) Another characterization of reverse I-projection}

We are given a nonempty closed \emph{convex} set $\mathcal{Q}$.
Let $p$ be any distribution. We want to prove:

$$
  q^\star\ \text{is the reverse I-projection of $p$ onto $\mathcal{Q}$}
$$
if and only if for all $p'$ and any $q'\in\mathcal{Q}$,
$$
  \KL(p'\|q') + \KL(p'\|p) \;\ge\; \KL(p'\|q^\star).
$$

This is a kind of ``three-point inequality'' involving $p',p,q^\star$ and $q'$.

\medskip
\textbf{(``If'' direction)} Assume the inequality holds for all $p',q'$.

If we plug in $p'=p$, we obtain
$$
  \KL(p\|q') + \KL(p\|p) \ge \KL(p\|q^\star).
$$
But $\KL(p\|p)=0$, hence
$$
  \KL(p\|q') \ge \KL(p\|q^\star)\quad\forall q'\in\mathcal{Q}.
$$
Therefore $q^\star$ is indeed the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$,
i.e., the reverse I-projection.

\medskip
\textbf{(``Only if'' direction)} Suppose $q^\star$ is the reverse I-projection.

In this direction, the algebra is more involved. A clean way to think about it
is to view KL divergence as a \emph{Bregman divergence} generated by the
negative entropy function
$$
  \varphi(r) = \sum_x r(x)\log r(x).
$$
Then
$$
  \KL(p\|q) = \varphi(p) - \varphi(q) - \nabla\varphi(q)^\top(p-q).
$$

For a convex set $\mathcal{Q}$, the Bregman projection $q^\star$ of $p$
onto $\mathcal{Q}$ is characterized by a first-order (KKT-like) condition
involving $\nabla\varphi(q^\star)$.
By expanding the Bregman divergence expressions for
$\KL(p'\|q')$, $\KL(p'\|p)$, and $\KL(p'\|q^\star)$ and using this condition,
one can show that
$$
  \KL(p'\|q') + \KL(p'\|p) - \KL(p'\|q^\star) \ge 0
$$
for all $p'$ and all $q'\in\mathcal{Q}$.

In a homework-level solution, you can follow the same pattern as in the usual
Pythagoras theorem for forward I-projection: express the difference in KLs,
use convexity of $\mathcal{Q}$, and apply the optimality of $q^\star$.

\medskip
\textbf{Takeaway:}
this inequality says that $q^\star$ is the ``best possible'' element of
$\mathcal{Q}$ in a very strong sense:
for \emph{every} distribution $p'$, using $q^\star$ is never worse than using
any $q'\in\mathcal{Q}$, up to the additive ``penalty term'' $\KL(p'\|p)$,
which does not depend on $q'$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Problem 2: Modeling and capacity}

We have a parameter $x\in\{0,1,2\}$ and an observable binary variable $y\in\{0,1\}$.
The conditional distributions of $y$ given $x$ are
$$
\begin{aligned}
p_y(y;0) &= \varepsilon^y(1-\varepsilon)^{1-y}
           &&\text{(Bernoulli($\varepsilon$)),} \\
p_y(y;1) &= (1-\varepsilon)^y\,\varepsilon^{1-y}
           &&\text{(Bernoulli($1-\varepsilon$)),} \\
p_y(y;2) &= 1/2
           &&\text{(pure noise),}
\end{aligned}
$$
with $0<\varepsilon<1/2$.

Let $p_x$ be a prior distribution on $\{0,1,2\}$ and
$I_{p_x}(x;y)$ denote the mutual information between $x$ and $y$ under $p_x$.
The \emph{model capacity} is
$$
  C = I_{p_x^\star}(x;y),
$$
where $p_x^\star$ is the capacity-achieving (least informative) prior.

\subsection*{(a)(i) Expressing the mutual information}

We denote $H_B(p) = -p\log p-(1-p)\log(1-p)$, the entropy of a Bernoulli($p$).

Let
$$
  p_x(0)=p_0,\quad p_x(1)=p_1,\quad p_x(2)=p_2.
$$

\textbf{Step 1: Compute the marginal of $Y$.}
$$
\begin{aligned}
P(Y=1)
 &= p_0\varepsilon + p_1(1-\varepsilon) + p_2\cdot\frac12, \\
P(Y=0) &= 1 - P(Y=1).
\end{aligned}
$$
Subtract $1/2$:
$$
\begin{aligned}
P(Y=1) - \tfrac12
 &= p_0(\varepsilon-\tfrac12) + p_1((1-\varepsilon)-\tfrac12) \\
 &= (\tfrac12-\varepsilon)(p_1-p_0).
\end{aligned}
$$
Therefore
$$
  P(Y=1) = \tfrac12 + \big(\tfrac12-\varepsilon\big)(p_1-p_0),
$$
and this does \emph{not} depend on $p_2$.

Thus
$$
  H(Y)
  = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big).
$$

\textbf{Step 2: Compute $H(Y|X)$.}

$$
\begin{aligned}
H(Y|X)
 &= p_0 H_B(\varepsilon)
   + p_1 H_B(1-\varepsilon)
   + p_2 H_B(1/2) \\
 &= (p_0+p_1)H_B(\varepsilon) + p_2 H_B(1/2)
  = (1-p_2)H_B(\varepsilon) + p_2 H_B(1/2),
\end{aligned}
$$
because $H_B(1-\varepsilon) = H_B(\varepsilon)$.

\textbf{Step 3: Mutual information.}

$$
\begin{aligned}
I_{p_x}(x;y)
 &= H(Y) - H(Y|X) \\
 &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big)
    - (1-p_2)H_B(\varepsilon) - p_2H_B(1/2) \\
 &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big)
    - H_B(\varepsilon)
    - p_2\big(H_B(1/2) - H_B(\varepsilon)\big).
\end{aligned}
$$

Comparing with the required form
$$
I_{p_x}(x;y)
 = H_B\!\Big(\tfrac12 + f(\varepsilon)(p_x(1)-p_x(0))\Big)
   - g(\varepsilon)p_x(2) - H_B(\varepsilon),
$$
we identify
$$
  f(\varepsilon) = \tfrac12 - \varepsilon,
  \qquad
  g(\varepsilon) = H_B(1/2) - H_B(\varepsilon).
$$

\subsection*{(a)(ii) The capacity-achieving prior and $C$}

We now maximize $I_{p_x}(x;y)$ over $p_0,p_1,p_2$.

\textbf{Observation 1: $x=2$ only adds noise.}  

The term involving $p_2$ is
$$
  -p_2\big(H_B(1/2)-H_B(\varepsilon)\big),
$$
and $H_B(1/2)>H_B(\varepsilon)$.
So for fixed $(p_1-p_0)$, increasing $p_2$ strictly \emph{decreases}
the mutual information.
Hence the optimum must satisfy
$$
  p_2^\star = 0.
$$

\textbf{Observation 2: with $p_2=0$, maximize the entropy of $Y$.}

When $p_2=0$, we have $p_0+p_1=1$ and
$$
  P(Y=1) = \tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0).
$$
Then
$$
  I(x;y)
  = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big)
    - H_B(\varepsilon).
$$

The binary entropy $H_B(\cdot)$ is concave in its argument and is maximized
at $1/2$. Thus, to maximize $I(x;y)$, we want $P(Y=1)=1/2$, i.e.
$$
  p_1-p_0 = 0 \quad\Rightarrow\quad p_1=p_0=1/2.
$$

\textbf{Therefore}
$$
  p_x^\star(0)=p_x^\star(1)=\tfrac12,\quad p_x^\star(2)=0,
$$
and
$$
  C = I_{p_x^\star}(x;y)
    = H_B(1/2) - H_B(\varepsilon).
$$

So in the expression
$$
  C = \alpha H_B(1/2) + \beta H_B(\varepsilon)
$$
we have
$$
  \alpha = 1,\qquad \beta = -1.
$$

\subsection*{(a)(iii) The mixture distribution $p_y^\star$}

Using the optimal prior $p_x^\star$,
$$
\begin{aligned}
P^\star(Y=1)
 &= \tfrac12\varepsilon + \tfrac12(1-\varepsilon) + 0\cdot\tfrac12 \\
 &= \tfrac12,\\
P^\star(Y=0) &= \tfrac12.
\end{aligned}
$$

Thus
$$
  p_y^\star(0) = p_y^\star(1) = \tfrac12.
$$

\subsection*{(b)(i) Mixture distribution with uniform prior}

Now take $p_x(x)=1/3$ for all $x$.
We get
$$
\begin{aligned}
P(Y=1)
 &= \tfrac13\varepsilon + \tfrac13(1-\varepsilon) + \tfrac13\times\tfrac12 \\
 &= \tfrac13\Big(\varepsilon + 1 - \varepsilon + \tfrac12\Big)
  = \tfrac13\cdot\tfrac32 = \tfrac12,
\end{aligned}
$$
and $P(Y=0)=1/2$.

So again,
$$
  p_y(0)=p_y(1)=\tfrac12.
$$

\subsection*{(b)(ii) Information loss under the uniform prior}

\textbf{Mutual information for the uniform prior.}

We already know $H(Y)=H_B(1/2)$.
The conditional entropy is
$$
  H(Y|X)
  = \tfrac13 H_B(\varepsilon) + \tfrac13 H_B(1-\varepsilon) + \tfrac13 H_B(1/2)
  = \tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2).
$$

Thus
$$
\begin{aligned}
I_{\text{unif}}
 &= H(Y) - H(Y|X) \\
 &= H_B(1/2) - \left(\tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2)\right) \\
 &= \tfrac23 H_B(1/2) - \tfrac23 H_B(\varepsilon) \\
 &= \tfrac23\big(H_B(1/2) - H_B(\varepsilon)\big) \\
 &= \tfrac23 C.
\end{aligned}
$$

\textbf{Information loss.}

$$
  C - I_{\text{unif}}
  = C - \tfrac23 C
  = \tfrac13 C
  = \tfrac13 H_B(1/2) - \tfrac13 H_B(\varepsilon).
$$

So the constants in
$$
  C - I_{p_x}(x;y) = \gamma H_B(1/2) + \delta H_B(\varepsilon)
$$
are
$$
  \gamma = \tfrac13,\qquad \delta = -\tfrac13.
$$

\subsection*{(c) General model: expressing the loss as a KL divergence}

Now consider a general model $p_y(\cdot;x)$ for $y\in\mathcal{Y}$,
parameterized by $x\in\mathcal{X}$.
Let $p_x^\star$ be a capacity-achieving prior and $p_y^\star$ be the
corresponding marginal distribution of $y$.
Let $q_x$ be an arbitrary distribution on $\mathcal{X}$.

Assume $p_x^\star(x) > 0$ for all $x\in\mathcal{X}$.

\textbf{Equidistance property.}
For a capacity-achieving prior with full support, we have
$$
  \KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big)
  = C
$$
for all $x$.

\textbf{Mixture under $q_x$.}
Let
$$
  q_y(y) = \sum_x q_x(x)\,p_{y|x}(y|x)
$$
be the marginal distribution of $y$ under $q_x$.

\textbf{Key identity.}
We can show
$$
  \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big)
  = I_{q_x}(x;y) + \KL(q_y\|p_y^\star).
$$
This comes from writing
$$
  \log\frac{p_{y|x}(y|x)}{p_y^\star(y)}
  = \log\frac{p_{y|x}(y|x)}{q_y(y)}
    + \log\frac{q_y(y)}{p_y^\star(y)}
$$
and then averaging over $x$ and $y$.

\textbf{Apply equidistance.}
By the equidistance property, the left-hand side equals
$$
  \sum_x q_x(x)\cdot C = C,
$$
so
$$
  C = I_{q_x}(x;y) + \KL(q_y\|p_y^\star).
$$

Therefore
$$
  C - I_{q_x}(x;y) = \KL(q_y\|p_y^\star).
$$

So we can take
$$
  q_1(\cdot) = q_y(\cdot),\qquad q_2(\cdot) = p_y^\star(\cdot).
$$

\subsection*{(d) The case where $p_x^\star$ is not full support}

Now suppose that $p_x^\star(x)=0$ for some $x$.
In this case, the Kuhn--Tucker conditions for channel capacity say:
$$
  \KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C,\quad\forall x,
$$
with equality if and only if $p_x^\star(x)>0$.

Therefore, for any prior $q_x$,
$$
  \sum_x q_x(x)\,\KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C.
$$

Using the same decomposition as before,
$$
  \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big)
  = I_{q_x}(x;y) + \KL(q_y\|p_y^\star).
$$
Hence
$$
  I_{q_x}(x;y) + \KL(q_y\|p_y^\star) \le C,
$$
and so
$$
  C - I_{q_x}(x;y) \ge \KL(q_y\|p_y^\star).
$$

This corresponds to option \textbf{B}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Problem 3: Typical sequences}

Let $y_1,\dots,y_N$ be i.i.d.\ Bernoulli random variables with
$$
  \mathbb{P}(y_i=1)=\alpha,\quad \mathbb{P}(y_i=0)=1-\alpha.
$$
Let $p$ be the distribution of a single $y_i$, and let all logarithms be base 2.

The typical set (weakly typical) is defined as
$$
  T_\varepsilon(p;N)
  = \left\{
        y^N:\ \left| -\frac{1}{N}\log p(y^N) - H(p)\right|\le\varepsilon
    \right\},
$$
where $H(p)$ is the entropy of a single Bernoulli($\alpha$) variable.

\subsection*{(a) Case $\alpha=1/2$}

If $\alpha=1/2$, then each $y_i$ is Bernoulli($1/2$),
and for any sequence $y^N\in\{0,1\}^N$ we have
$$
  p(y^N) = \left(\tfrac12\right)^N = 2^{-N}.
$$

Therefore
$$
  -\frac{1}{N}\log p(y^N)
  = -\frac{1}{N}\log 2^{-N}
  = 1
$$
for all sequences.

The entropy of a single Bernoulli($1/2$) variable is
$$
  H(p) = H_B(1/2) = 1.
$$

Thus, for any $\varepsilon>0$,
$$
  \left| -\frac{1}{N}\log p(y^N) - H(p)\right|
  = |1-1| = 0 \le \varepsilon,
$$
so
$$
  T_\varepsilon(p;N) = \{0,1\}^N,
$$
and
$$
  |T_\varepsilon(p;N)| = 2^N.
$$

\subsection*{(b) Case $\alpha=1/3$: typical set and growth exponent}

Now assume $\alpha=1/3$, so
$$
  \mathbb{P}(y_i=1)=\tfrac13,\quad
  \mathbb{P}(y_i=0)=\tfrac23.
$$

Let $k$ be the number of ones in the sequence $y^N$:
$$
  k = \#\{i:y_i=1\}.
$$

\textbf{Probability of a given sequence.}
Every sequence with $k$ ones and $N-k$ zeros has probability
$$
  p(y^N) = \left(\tfrac13\right)^k\left(\tfrac23\right)^{N-k}.
$$
Hence
$$
  -\frac{1}{N}\log p(y^N)
  = -\frac{k}{N}\log\tfrac13 - \left(1-\frac{k}{N}\right)\log\tfrac23.
$$

Let $\hat{p}_N = k/N$ be the empirical fraction of ones.
Then
$$
  -\frac{1}{N}\log p(y^N)
  = -\hat{p}_N\log\tfrac13 - (1-\hat{p}_N)\log\tfrac23.
$$

The entropy is
$$
  H(p) = H_B(1/3)
       = -\tfrac13\log\tfrac13 - \tfrac23\log\tfrac23.
$$

\textbf{Typical set description.}
The typicality condition
$$
  \left| -\frac{1}{N}\log p(y^N) - H_B(1/3)\right|\le\varepsilon
$$
is a condition on $\hat{p}_N$.
Because the function of $\hat{p}_N$ above is continuous and achieves
$H_B(1/3)$ at $\hat{p}_N=1/3$, there exists some $\delta(\varepsilon)>0$
such that, for large $N$,
$$
  \left|\hat{p}_N - \tfrac13\right|\le\delta(\varepsilon)
$$
is equivalent to the typicality condition.

So we can write
$$
  T_\varepsilon(p;N)
  = \left\{
        y^N:\ 
        \left|\frac{\#\{i:y_i=1\}}{N} - \frac13\right|\le\delta(\varepsilon)
    \right\}.
$$

\textbf{Growth exponent of $|T_\varepsilon(p;N)|$.}
The asymptotic equipartition property (AEP) for discrete memoryless sources
states that, for large $N$,
$$
  (1-\eta)\,2^{N(H(p)-\varepsilon)}
  \;\le\; |T_\varepsilon(p;N)|
  \;\le\; 2^{N(H(p)+\varepsilon)}
$$
for some small $\eta>0$.

Thus the exponential growth rate of $|T_\varepsilon(p;N)|$ is
$$
  \lim_{N\to\infty}\frac{1}{N}\log |T_\varepsilon(p;N)|
  = H(p)
  = H_B(1/3).
$$

Intuitively, for large $N$,
$$
  |T_\varepsilon(p;N)| \approx 2^{N H_B(1/3)}.
$$

\medskip
\textbf{Key takeaway:}
for a Bernoulli source, the number of typical sequences is roughly
$2^{NH(p)}$, and each typical sequence has probability around $2^{-NH(p)}$.
Most of the probability mass is concentrated on these typical sequences.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\end{document}

如果你愿意，我也可以帮你把这份 LaTeX 再“瘦身版”/“板书版”各做一份，或者帮你把中文要点直接翻译成 PPT 提纲。

我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路，请完成解题用中文撰写讲解要点，并撰...

سؤال

الفكرة

الإجابة

一、中文讲解要点（备课提纲）

题 1 Reverse I-projection

总体概念提醒

1(a) 逆 I-投影的勾股定理

1(b)（练习）逆 I-投影、指数族与线性族

1(c) 逆 I-投影的另一种刻画

题 2 Modeling & Model Capacity

2(a)(i) 写出互信息的结构形式

2(a)(ii) 找到最优先验 $p_x^\star$ 和容量 C

2(a)(iii) 对应的输出混合分布 $p_y^\star$

2(b)(i) 用均匀先验时的输出分布

2(b)(ii) 均匀先验下互信息的损失

2(c) 一般模型下 $C - I_{q_x} = D(q_1\|q_2)$

2(d) 若 $p_x^\star$ 不是满支撑的情况

题 3 Typical Sequences（典型序列）

3(a) $\alpha=1/2$ 的典型集与大小

3(b) $\alpha=1/3$ 时的典型集与增长指数

二、英文 LaTeX 讲稿（可直接使用）

مشاركة هذا السؤال والجواب

سؤال

الفكرة

الإجابة

一、中文讲解要点（备课提纲）

题 1 Reverse I-projection

总体概念提醒

1(a) 逆 I-投影的勾股定理

1(b)（练习） 逆 I-投影、指数族与线性族

1(c) 逆 I-投影的另一种刻画

题 2 Modeling & Model Capacity

2(a)(i) 写出互信息的结构形式

2(a)(ii) 找到最优先验 px⋆p_x^\starpx⋆​ 和容量 C

2(a)(iii) 对应的输出混合分布 py⋆p_y^\starpy⋆​

2(b)(i) 用均匀先验时的输出分布

2(b)(ii) 均匀先验下互信息的损失

2(c) 一般模型下 C−Iqx=D(q1∥q2)C - I_{q_x} = D(q_1\|q_2)C−Iqx​​=D(q1​∥q2​)

2(d) 若 px⋆p_x^\starpx⋆​ 不是满支撑的情况

题 3 Typical Sequences（典型序列）

3(a) α=1/2\alpha=1/2α=1/2 的典型集与大小

3(b) α=1/3\alpha=1/3α=1/3 时的典型集与增长指数

二、英文 LaTeX 讲稿（可直接使用）

مشاركة هذا السؤال والجواب

1(b)（练习）逆 I-投影、指数族与线性族

2(a)(ii) 找到最优先验 $p_x^\star$ 和容量 C

2(a)(iii) 对应的输出混合分布 $p_y^\star$

2(c) 一般模型下 $C - I_{q_x} = D(q_1\|q_2)$

2(d) 若 $p_x^\star$ 不是满支撑的情况

3(a) $\alpha=1/2$ 的典型集与大小

3(b) $\alpha=1/3$ 时的典型集与增长指数