我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路,请完成解题用中文撰写讲解要点,并撰...

تم الإنشاء في: ٢٥ نوفمبر ٢٠٢٥

تمت الإجابة باستخدام GPT-5.1 Pro بواسطة Chat01

سؤال

HW_4.pdf

我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路,请完成解题用中文撰写讲解要点,并撰写英文LaTeX讲稿,确保通俗易懂,深入浅出,对每一道题给出清晰的明白的解题思路,并给一些高价值的takeaway,学生下课后仍能记得关键思想和insight

الفكرة

Teaching opportunity

I’m working through teaching reverse I-projection and connections to the exponential family and moment matching.

Reevaluating assumptions

I’m rethinking the "reverse I-projection" onto a convex set, focusing on cross-entropies and convexity properties, to clearly demonstrate the orthogonality property.

Working through inequalities

I’m exploring the link between cross-entropies and the orthogonality of the support hyperplane in reverse I-projection, requiring the inequality to hold for any distributions p' and q' in the convex set Q.

Formulating mutual information

I’m mapping out mutual information for the probability distribution, aiming to express it in terms of entropy functions and given parameters. Simplifying this expression is the next step.

Validating assumptions

I'm thinking through the mutual information formula, ensuring dependencies on parameters and functions are accurate, and simplifying expressions to align with the desired form.

Finalizing mutual information

I’m establishing functions f(ε) and g(ε) linked to entropy shifts, then analyzing optimal priors and mixture distributions to pinpoint the "least informative prior" that maximizes mutual information.

Evaluating distribution choices

I'm digging into various distributions to see if mutual information or key properties stay intact when using q_x instead of p_x. Equidistance properties of KL divergence help.

Evaluating equidistance

Interestingly enough, using the equidistance property for KL divergence when some input symbols are absent, mutual information holds, but KL divergence inequality weakens.

Verifying mutual information

OK, let's confirm that for any weighted distribution q_x, C - I_q consistently meets or surpasses D(p_q||p*_y), especially when q_x excludes certain symbols.

Estimating sequence counts

OK, let me see. For α=1/2, every sequence is typical, so there are 2^N sequences. For α=1/3, sequences with roughly αN ones are typical.

Calculating growth exponent

I'm working through the growth exponent for the typical set size for α=1/3. The exponent is H_B(1/3), and typical sequences have an empirical frequency of ones close to α.

Crafting the lecture

I’m ensuring clarity and engaging visuals, making complex concepts like KL divergence and typical sets accessible for students with a foundation in discrete probability theory.

Confirming log-convexity

I’m verifying the log-convexity of distributions and their connections to KL divergence by examining derivatives, e-geodesic combinations, and specific distributions within the set Q.

Connecting ideas

I’m focusing on the intersection of exponential and linear families via duality. Utilizing the KL divergence and convexity, I’m demonstrating q* is a reverse I-projection of p onto exponential family.

Connecting the dots

I'm aligning the Bregman projection theorem with convex duality, emphasizing how general inequalities extend the Pythagorean theorem in the context of KL divergence.

Mapping the structure

I’m detailing the lecture's layout, diversifying sections between Chinese and English, using bullet points, and integrating formulas and note citations. This approach ensures clarity and educational impact.

Clarifying inequality

OK, let's examine the inequality with p'=p. This implies q* minimizes D(p||q) on Q, resembling a reverse I-projection.

Detailing entropy constraints

I’m integrating entropy constraints from prior cases, using logarithmic properties, and calculating empirical frequencies for typical sequences. This builds a comprehensive understanding of typical set analysis.

Piecing together

I'm shaping the LaTeX script with sections on KL divergence, mutual information, typical sequences, and high-level problem solutions. Ensuring mathematical clarity and addressing student prerequisites.

Reasoned for 14m 56s

الإجابة

先直接给你两部分内容:

  1. 中文讲解要点(面向你自己备课用)
  2. 英文 LaTeX 讲稿(可直接放进 .tex 用作习题课讲稿)

作业题目来自 Homework 4。(HW_4.pdf)


一、中文讲解要点(备课提纲)

下面把每一题、每一小问都按“思路 + 运算骨架 + 课堂可以强调的 takeaway”来整理。你可以根据时间决定推导细节讲多少。


题 1 Reverse I-projection

总体概念提醒

  • KL 散度:D(pq)=xp(x)logp(x)q(x)D(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)}
  • I-projection(正向投影):在一堆分布集合 P\mathcal{P} 里找一个 pp^\star,使得 D(pq)D(p^\star\|q) 最小。
  • Reverse I-projection(反向投影):这次是固定 pp,在集合 Q\mathcal{Q} 里找 qq^\star,使得 D(pq)D(p\|q^\star) 最小。
  • 这道题的核心是:KL 散度 + 凸/对数凸集合 → 有“勾股定理”式的几何结构

1(a) 逆 I-投影的勾股定理

题目要证:
Q\mathcal{Q} 非空、闭、log-convex,且

q=argminqQD(pq),q^\star = \arg\min_{q\in\mathcal{Q}} D(p\|q),

则对所有 qQq\in\mathcal{Q}

D(pq)D(pq)+D(qq).D(p\|q) \ge D(p\|q^\star) + D(q^\star\|q).

关键概念:log-convex 集合

  • 给定 r,sQr,s\in\mathcal{Q} 且不互相奇异,定义“对数凸组合” qλ(x)=r(x)λs(x)1λzr(z)λs(z)1λ,0λ1. q_\lambda(x) = \frac{r(x)^\lambda s(x)^{1-\lambda}} {\sum_z r(z)^\lambda s(z)^{1-\lambda}},\quad 0\le\lambda\le1.
  • Q\mathcal{Q} 是 log-convex,说明 沿着“log 空间直线”从一个分布走到另一个分布,路径上的所有点也在 Q\mathcal{Q}

证明思路(课堂版):

  1. Q\mathcal{Q} 里挑一条“log 直线”
    固定任意 qQq\in\mathcal{Q},和 qq^\star 一起,构造

    qλ(x)=q(x)λq(x)1λzq(z)λq(z)1λ,q_\lambda(x) = \frac{q(x)^\lambda q^\star(x)^{1-\lambda}} {\sum_z q(z)^\lambda q^\star(z)^{1-\lambda}},

    其中 q0=q,q1=qq_0 = q^\star, q_1 = q,且所有 qλQq_\lambda \in \mathcal{Q}

  2. 看 KL 散度沿这条曲线如何变化
    定义

    f(λ)=D(pqλ). f(\lambda) = D(p\|q_\lambda).

    因为 qq^\starQ\mathcal{Q} 上的最优点,对这条路径也必须是最小点,所以

    f(λ)f(0), λ[0,1]f(0+)0. f(\lambda) \ge f(0),\ \forall \lambda\in[0,1] \quad\Rightarrow\quad f'(0^+) \ge 0.
  3. f(0)f'(0)(可以在黑板上做一次):

    • 写出 logqλ(x)=λlogq(x)+(1λ)logq(x)logZ(λ), \log q_\lambda(x) = \lambda\log q(x) + (1-\lambda)\log q^\star(x) - \log Z(\lambda), 其中 Z(λ)Z(\lambda) 是归一化因子。
    • 代入 f(λ)=xp(x)(logp(x)logqλ(x)), f(\lambda) = \sum_x p(x)\big(\log p(x)-\log q_\lambda(x)\big), 用链式法则求导,在 λ=0\lambda=0 处整理,会得到 f(0)=D(pq)D(pq)D(qq). f'(0) = D(p\|q) - D(p\|q^\star) - D(q^\star\|q).
  4. 利用最小点的导数条件
    因为 f(0+)0f'(0^+)\ge 0,所以

    D(pq)D(pq)D(qq)0D(pq)D(pq)+D(qq). D(p\|q) - D(p\|q^\star) - D(q^\star\|q) \ge 0 \quad\Rightarrow\quad D(p\|q)\ge D(p\|q^\star)+D(q^\star\|q).

课堂 takeaway:

  • KL 散度在“log-凸路径”上具有类似欧式空间中的勾股性质。
  • 从几何上看:pp 到集合 Q\mathcal{Q} 的“最短点”qq^\star,把“距离”分成了两段:从 ppqq^\star,再从 qq^\starqq

1(b)(练习) 逆 I-投影、指数族与线性族

你可以把这题讲成“指数族–线性族的双对偶关系”的一条具体体现。

设定回顾:

  • 给定基准分布 qq^\star 和充分统计量 t(x)t(x)
  • 指数族 Et(q)={qθ(x)=exp(θt(x)ψ(θ))q(x)}. \mathcal{E}_t(q^\star) = \big\{ q_\theta(x)=\exp(\theta^\top t(x)-\psi(\theta))q^\star(x) \big\}.
  • 线性族 Lt(q)={p:Ep[t(X)]=Eq[t(X)]}. \mathcal{L}_t(q^\star) = \big\{ p: \mathbb{E}_p[t(X)] = \mathbb{E}_{q^\star}[t(X)] \big\}.

要证:
qq^\starpp 在指数族 Et(q)\mathcal{E}_t(q^\star) 上的 reverse I-projection
    \iff pLt(q)p \in \mathcal{L}_t(q^\star)

思路:看 D(pqθ)D(p\|q_\theta)θ\theta 的导数。

  1. 写出 D(pqθ)=xp(x)logp(x)qθ(x)=const+ψ(θ)θEp[t(X)].D(p\|q_\theta) = \sum_x p(x)\log\frac{p(x)}{q_\theta(x)} = \text{const} + \psi(\theta) - \theta^\top \mathbb{E}_p[t(X)].
  2. 求导: θD(pqθ)=ψ(θ)Ep[t(X)]=Eqθ[t(X)]Ep[t(X)].\nabla_\theta D(p\|q_\theta) = \nabla\psi(\theta) - \mathbb{E}_p[t(X)] = \mathbb{E}_{q_\theta}[t(X)] - \mathbb{E}_p[t(X)].
  3. q=qθ=0q^\star = q_{\theta=0} 是 reverse I-projection,就必须使得
    θ=0\theta=0 处梯度为零: Eq[t(X)]=Ep[t(X)],\mathbb{E}_{q^\star}[t(X)] = \mathbb{E}_p[t(X)], pLt(q)p \in \mathcal{L}_t(q^\star)
  4. 反过来,若 pLt(q)p\in\mathcal{L}_t(q^\star),则 θ=0\theta=0 处导数为 0,
    又因为 D(pqθ)D(p\|q_\theta)θ\theta 是凸的,θ=0\theta=0 给出全局最小值,
    所以 qq^\star 就是 reverse I-projection。

课堂 takeaway:

  • “指数族 + KL 最小化” ⇔ “线性约束(矩约束)”
    这就是最大熵 / 最大似然和指数族之间关系的本质。

1(c) 逆 I-投影的另一种刻画

题意:
给定分布 pp 和非空、闭、凸的集合 Q\mathcal{Q}
证明:qq^\starppQ\mathcal{Q} 上的 reverse I-projection,当且仅当

D(pq)+D(pp)D(pq),p, qQ.D(p'\|q') + D(p'\|p) \ge D(p'\|q^\star), \quad \forall p',\ \forall q'\in\mathcal{Q}.

可以把它理解成:qq^\star 是那个让“所有点”pp' 相比其它 qq' 都不会更吃亏的点(多了一个补偿项 D(pp)D(p'\|p))。

“如果”方向(⇐):比较简单,可以课堂上详细写。

  • 假设不等式对所有 p,qp',q' 成立。
  • p=pp'=p,得到 D(pq)+D(pp)D(pq)D(pq)D(pq), qQ,D(p\|q') + D(p\|p) \ge D(p\|q^\star) \Rightarrow D(p\|q') \ge D(p\|q^\star),\ \forall q'\in\mathcal{Q}, 因为 D(pp)=0D(p\|p)=0
  • 这就说明 qq^\star 确实是 argminqQD(pq)\arg\min_{q\in\mathcal{Q}} D(p\|q)

“只有如果”方向(⇒):思路型讲解即可

  • 假设 qq^\star 是 reverse I-projection。由于 Q\mathcal{Q} 是凸集,
    可把 KL 看成一个 Bregman 散度,对凸集的“Bregman 投影”有一个
    很类似的勾股定理和“法向条件”(KKT 条件)。
  • 固定任意 pp'qQq'\in\mathcal{Q},构造连接 pppp' 的直线 pλ=(1λ)p+λp,0λ1, p_\lambda = (1-\lambda)p' + \lambda p,\quad 0\le\lambda\le1, 然后考虑函数 F(λ)=D(pλq)D(pλq). F(\lambda) = D(p_\lambda\|q') - D(p_\lambda\|q^\star).
  • 通过 KL 的显式形式展开并利用 qq^\star 的最优性(类似 1(a) 的方法),
    可以证明 F(1)0F(1) \ge 0,这就等价于题目中的不等式。
    (具体代数展开比较长,习题课可以只画出推导路线。)

课堂 takeaway:

  • KL 散度作为 Bregman 散度时,对“凸集的投影”有一类统一的勾股式刻画。
  • 这个不等式是一个“三点不等式”:涉及 p,p,q,qp',p,q^\star,q' 四个对象,
    最终是用“所有的 pp'”来刻画一个“最佳 qq^\star

题 2 Modeling & Model Capacity

x{0,1,2}x\in\{0,1,2\}y{0,1}y\in\{0,1\}。条件分布是

py(y;0)=εy(1ε)1y(Bernoulli(ε)),py(y;1)=(1ε)yε1y(Bernoulli(1ε)),py(y;2)=12(完全噪声).\begin{aligned} p_y(y;0)&=\varepsilon^y(1-\varepsilon)^{1-y}\quad &\text{(Bernoulli($\varepsilon$))},\\ p_y(y;1)&=(1-\varepsilon)^y\varepsilon^{1-y}\quad &\text{(Bernoulli($1-\varepsilon$))},\\ p_y(y;2)&=\tfrac12\quad &\text{(完全噪声)}. \end{aligned}

其中 0<ε<1/20<\varepsilon<1/2

记先验 px(x)p_x(x),互信息 Ipx(x;y)I_{p_x}(x;y),模型容量

C=Ipx(x;y),C = I_{p_x^\star}(x;y),

其中 pxp_x^\star 是使互信息最大的先验(“least informative prior”,其实是 capacity-achieving prior)。


2(a)(i) 写出互信息的结构形式

要找到 f(ε),g(ε)f(\varepsilon), g(\varepsilon) 使

Ipx(x;y)=HB ⁣(12+f(ε)(px(1)px(0)))g(ε)px(2)HB(ε),I_{p_x}(x;y) = H_B\!\left(\frac12 + f(\varepsilon)\big(p_x(1)-p_x(0)\big)\right) - g(\varepsilon)p_x(2) - H_B(\varepsilon),

其中 HB(p)=plogp(1p)log(1p)H_B(p) = -p\log p-(1-p)\log(1-p)

思路:直接用 I=H(Y)H(YX)I=H(Y)-H(Y|X)

  1. px(0)=p0,px(1)=p1,px(2)=p2. p_x(0)=p_0,\quad p_x(1)=p_1,\quad p_x(2)=p_2.
  2. 计算 yy 的边缘分布:

    P(Y=1)=p0ε+p1(1ε)+p212,P(Y=0)=1P(Y=1).\begin{aligned} P(Y=1) &= p_0\varepsilon + p_1(1-\varepsilon) + p_2\cdot\tfrac12,\\[3pt] P(Y=0) &= 1-P(Y=1). \end{aligned}

    1/21/2 拆到每一项:

    P(Y=1)12=(p0(ε12))+(p1((1ε)12)).P(Y=1)-\tfrac12 = (p_0(\varepsilon-\tfrac12)) + (p_1((1-\varepsilon)-\tfrac12)).

    注意 ε12=(12ε)\varepsilon-\tfrac12=-(\tfrac12-\varepsilon)1ε12=12ε1-\varepsilon-\tfrac12=\tfrac12-\varepsilon
    所以

    P(Y=1)12=(12ε)(p1p0).P(Y=1)-\tfrac12 = \big(\tfrac12-\varepsilon\big)(p_1-p_0).

    P(Y=1)=12+(12ε)(p1p0).P(Y=1) = \tfrac12 + \big(\tfrac12-\varepsilon\big)(p_1-p_0).

    特别重要:不依赖 p2p_2

  3. 于是

    H(Y)=HB ⁣(12+(12ε)(p1p0)).H(Y) = H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right).
  4. 条件熵:

    H(YX)=p0HB(ε)+p1HB(1ε)+p2HB(12)=(p0+p1)HB(ε)+p2HB(12)=(1p2)HB(ε)+p2HB(12),\begin{aligned} H(Y|X) &= p_0 H_B(\varepsilon) + p_1 H_B(1-\varepsilon) + p_2 H_B(\tfrac12) \\ &= (p_0+p_1)H_B(\varepsilon) + p_2 H_B(\tfrac12) \\ &= (1-p_2) H_B(\varepsilon) + p_2 H_B(\tfrac12), \end{aligned}

    因为 HB(1ε)=HB(ε)H_B(1-\varepsilon)=H_B(\varepsilon)

  5. 互信息:

    Ipx(x;y)=H(Y)H(YX)=HB ⁣(12+(12ε)(p1p0))(1p2)HB(ε)p2HB(12)=HB ⁣(12+(12ε)(p1p0))HB(ε)p2(HB(12)HB(ε)).\begin{aligned} I_{p_x}(x;y) &= H(Y)-H(Y|X)\\ &= H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - (1-p_2)H_B(\varepsilon) - p_2 H_B(\tfrac12)\\ &= H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - H_B(\varepsilon) - p_2\big(H_B(\tfrac12)-H_B(\varepsilon)\big). \end{aligned}

对应题目的形式:

  • f(ε)=12εf(\varepsilon) = \tfrac12-\varepsilon
  • g(ε)=HB(12)HB(ε)g(\varepsilon) = H_B(\tfrac12) - H_B(\varepsilon)

2(a)(ii) 找到最优先验 pxp_x^\star 和容量 C

我们要最大化上面的 Ipx(x;y)I_{p_x}(x;y)p0,p1,p2p_0,p_1,p_2

观察互信息结构:

  1. 第一项 HB()H_B(\cdot) 只依赖于 p1p0p_1-p_0,与 p2p_2 无关。
  2. 第三项 g(ε)p2-g(\varepsilon)p_2 是关于 p2p_2线性减项,因为 g(ε)>0g(\varepsilon)>0HB(1/2)H_B(1/2) 最大)。

直观结论:

  • x=2x=2yy 来说完全是噪声(yy 均匀),增加 p2p_2 只会增大条件熵 H(YX)H(Y|X)
    不改变 YY 的边缘分布 → 信息只会减小
    最优一定是 p2=0p_2^\star = 0
  • p2=0p_2=0 情况下,约束变成 p0+p1=1p_0+p_1=1,且 P(Y=1)=12+(12ε)(p1p0). P(Y=1) = \tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0). 互信息是 I=HB ⁣(12+(12ε)(p1p0))HB(ε). I = H_B\!\left(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\right) - H_B(\varepsilon). HB()H_B(\cdot)[0,1][0,1] 上关于 1/21/2 对称并在 1/21/2 处最大。
    ⇒ 为了最大化 H(Y)H(Y),应使 P(Y=1)=1/2P(Y=1)=1/2
    这要求 p1p0=0p1=p0=1/2p_1-p_0=0\Rightarrow p_1=p_0=1/2

得到:

  • 最优先验: px(0)=px(1)=12,px(2)=0. p_x^\star(0)=p_x^\star(1)=\tfrac12,\quad p_x^\star(2)=0.
  • 容量: C=Ipx(x;y)=HB(12)HB(ε).C = I_{p_x^\star}(x;y) = H_B(\tfrac12) - H_B(\varepsilon).
  • 因此题目要求的 C=αHB(1/2)+βHB(ε) C = \alpha H_B(1/2) + \beta H_B(\varepsilon) 中: α=1,β=1.\alpha = 1,\quad \beta = -1.

课堂 takeaway:

  • 最优先验不会给“纯噪声”状态(这里的 x=2x=2)任何权重
  • 容量达成时:输出 YY 也是均匀的(信息论中的常见结论)。

2(a)(iii) 对应的输出混合分布 pyp_y^\star

用最优先验 pxp_x^\star 作为权重:

P(Y=1)=12ε+12(1ε)+012=12,P(Y=0)=12.\begin{aligned} P^\star(Y=1) &= \tfrac12\varepsilon + \tfrac12(1-\varepsilon) + 0\cdot\tfrac12 = \tfrac12,\\ P^\star(Y=0) &= \tfrac12. \end{aligned}

所以

py(0)=py(1)=12.p_y^\star(0) = p_y^\star(1) = \tfrac12.

takeaway: 容量点时,输出端看上去是“最均匀、最不偏”的


2(b)(i) 用均匀先验时的输出分布

现在 px(x)=1/3p_x(x) = 1/3 for x=0,1,2x=0,1,2

P(Y=1)=13ε+13(1ε)+13×12=13(ε+1ε+12)=1332=12,\begin{aligned} P(Y=1) &= \tfrac13\varepsilon + \tfrac13(1-\varepsilon) + \tfrac13\times\tfrac12\\ &= \tfrac13\big(\varepsilon+1-\varepsilon+\tfrac12\big) = \tfrac13\cdot\tfrac32 = \tfrac12, \end{aligned}

因此仍然是

py(0)=py(1)=12.p_y(0) = p_y(1) = \tfrac12.

2(b)(ii) 均匀先验下互信息的损失

先求均匀先验下的互信息:

  • H(Y)=HB(1/2)H(Y) = H_B(1/2)(因为还是均匀)。
  • 条件熵 H(YX)=13HB(ε)+13HB(1ε)+13HB(1/2)=23HB(ε)+13HB(1/2).H(Y|X) = \tfrac13 H_B(\varepsilon) + \tfrac13 H_B(1-\varepsilon) + \tfrac13 H_B(1/2) = \tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2).
  • 所以 Iunif=HB(1/2)(23HB(ε)+13HB(1/2))=23HB(1/2)23HB(ε)=23(HB(1/2)HB(ε))=23C.\begin{aligned} I_{\text{unif}} &= H_B(1/2) - \left(\tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2)\right)\\ &= \tfrac23 H_B(1/2) - \tfrac23 H_B(\varepsilon)\\ &= \tfrac23\big(H_B(1/2)-H_B(\varepsilon)\big)\\ &= \tfrac23 C. \end{aligned}

损失:

CIunif=C23C=13C=13HB(1/2)13HB(ε).C - I_{\text{unif}} = C - \tfrac23 C = \tfrac13 C = \tfrac13 H_B(1/2) - \tfrac13 H_B(\varepsilon).

所以

γ=13,δ=13.\gamma = \tfrac13,\quad \delta = -\tfrac13.

课堂 takeaway:

  • “随便用均匀先验”会损失固定比例(这里是 1/3)的容量。
  • 但均匀先验仍然达到 2/32/3 的容量,说明它还算“不是太差”。

2(c) 一般模型下 CIqx=D(q1q2)C - I_{q_x} = D(q_1\|q_2)

设一般模型 py(;x)p_y(\cdot;x),容量达到时的先验为 pxp_x^\star,对应输出为 pyp_y^\star。给定任意先验 qxq_x

假设 px(x)>0p_x^\star(x) > 0 对所有 xx

利用“等距离性质(Equidistance Property)”:

  • 对容量点满足:对所有 xx, D(pyx(x)py())=C. D\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = C.

关键恒等式(链式法则):

  • 对任意先验 qxq_x 对应的输出 qy(y)=xqx(x)py(y;x), q_y(y) = \sum_x q_x(x) p_y(y;x), 可以证明 xqx(x)D(pyx(x)py)=Iqx(x;y)+D(qypy).\sum_x q_x(x) D(p_{y|x}(\cdot|x)\|p_y^\star) = I_{q_x}(x;y) + D(q_y\|p_y^\star). (在课堂上可以把 log 分成 logpyxqy+logqypy\log \frac{p_{y|x}}{q_y} + \log\frac{q_y}{p_y^\star} 两部分展开。)

带入等距离性质:

  • 左边是 xqx(x)C=C\sum_x q_x(x)\cdot C = C,所以 C=Iqx(x;y)+D(qypy).C = I_{q_x}(x;y) + D(q_y\|p_y^\star).
  • 于是 CIqx(x;y)=D(qypy).C - I_{q_x}(x;y) = D(q_y\|p_y^\star).

因此应取:

  • q1()=qy()q_1(\cdot) = q_y(\cdot):在先验 qxq_x 下的 yy 的边缘分布;
  • q2()=py()q_2(\cdot) = p_y^\star(\cdot):容量先验对应的边缘分布。

2(d) 若 pxp_x^\star 不是满支撑的情况

现在假定存在 xx 使得 px(x)=0p_x^\star(x)=0。经典 Kuhn–Tucker 条件告诉我们:

  • 对所有 xx D(pyx(x)py)C,D(p_{y|x}(\cdot|x)\|p_y^\star) \le C,
  • 且仅在 px(x)>0p_x^\star(x)>0 的那些点上等号成立。

于是对任意 qxq_x

xqx(x)D(pyxpy)C.\sum_x q_x(x) D(p_{y|x}\|p_y^\star) \le C.

按刚才同样的分解:

xqx(x)D(pyxpy)=Iqx(x;y)+D(qypy),\sum_x q_x(x) D(p_{y|x}\|p_y^\star) = I_{q_x}(x;y) + D(q_y\|p_y^\star),

所以

Iqx(x;y)+D(qypy)CCIqx(x;y)D(qypy).I_{q_x}(x;y) + D(q_y\|p_y^\star) \le C \quad\Rightarrow\quad C - I_{q_x}(x;y) \ge D(q_y\|p_y^\star).

这正对应选项 B

课堂 takeaway:

  • 若容量先验不是满支撑,则一些 xx 对应的输出分布离 pyp_y^\starCC 更近(信息更小)。
  • 对任意其他先验,其互信息与容量的差至少大于某个 KL 散度下界(这里是 D(qypy)D(q_y\|p_y^\star))。

题 3 Typical Sequences(典型序列)

y1,,yNy_1,\dots,y_N i.i.d. Bernoulli,P(yi=1)=αP(y_i=1)=\alphapp 为单个 yiy_i 的分布。对所有 log\log 取 base 2。

典型集定义(标准版本):

Tε(p;N)={yN: 1Nlogp(yN)H(p)ε}.T_\varepsilon(p;N) = \left\{ y^N:\ \left|-\frac1N\log p(y^N) - H(p)\right|\le\varepsilon \right\}.

3(a) α=1/2\alpha=1/2 的典型集与大小

α=1/2\alpha = 1/2 时,每个 yiy_i 服从 Bernoulli(1/2),
对任意序列 yN{0,1}Ny^N\in\{0,1\}^N 都有

p(yN)=(12)N=2N.p(y^N) = \left(\tfrac12\right)^N = 2^{-N}.
  • 信息密度: 1Nlogp(yN)=1Nlog2N=1.-\frac1N\log p(y^N) = -\frac1N\log 2^{-N} = 1.
  • 单个变量的熵: H(p)=HB(1/2)=1.H(p) = H_B(1/2)=1.

于是对任意 ε>0\varepsilon>0

1Nlogp(yN)H(p)=11=0ε.\left|-\frac1N\log p(y^N) - H(p)\right| = |1-1| = 0 \le \varepsilon.

结论:

  • 典型集就是所有长度为 N 的二进制序列 Tε(p;N)={0,1}N.T_\varepsilon(p;N) = \{0,1\}^N.
  • 大小: Tε(p;N)=2N.|T_\varepsilon(p;N)| = 2^N.

课堂 takeaway:

  • 完全对称的 Bernoulli(1/2) 源下,所有序列都是同样“典型”的,概率完全一样。

3(b) α=1/3\alpha=1/3 时的典型集与增长指数

现在 P(yi=1)=1/3P(y_i=1)=1/3P(yi=0)=2/3P(y_i=0)=2/3,记 kk 为序列里 1 的个数。

1. 写出单个序列的概率

p(yN)=(13)k(23)Nk.p(y^N) = \left(\tfrac13\right)^k \left(\tfrac23\right)^{N-k}.

信息密度:

1Nlogp(yN)=kNlog13(1kN)log23.-\frac1N\log p(y^N) = -\frac{k}{N}\log\tfrac13 - \left(1-\frac{k}{N}\right)\log\tfrac23.

记经验频率 p^N=k/N\hat{p}_N = k/N。则

1Nlogp(yN)=p^Nlog13(1p^N)log23.-\frac1N\log p(y^N) = -\hat{p}_N\log\tfrac13 -(1-\hat{p}_N)\log\tfrac23.

而源熵为

H(p)=HB(1/3)=13log1323log23.H(p) = H_B(1/3) = -\tfrac13\log\tfrac13 - \tfrac23\log\tfrac23.

2. 典型条件对应“1 的比例接近 1/3”

典型条件:

1Nlogp(yN)HB(1/3)ε\left|-\frac1N\log p(y^N) - H_B(1/3)\right|\le\varepsilon

可以看成是对 p^N\hat{p}_N 的约束:
因为上式是一个连续函数 g(p^N)g(\hat{p}_N),在 p^N=1/3\hat{p}_N=1/3 处取值正好是 HB(1/3)H_B(1/3),所以存在一个 δ(ε)\delta(\varepsilon),使得

p^N1/3δ(ε)|\hat{p}_N - 1/3|\le\delta(\varepsilon)

就等价于典型条件成立(对足够大的 N )。

因此:

典型集可以写成

Tε(p;N)={yN: #{i:yi=1}N13δ(ε)}.T_\varepsilon(p;N) = \left\{ y^N:\ \left|\frac{\#\{i:y_i=1\}}{N} - \frac13\right|\le\delta(\varepsilon)\right\}.

3. 大小的增长指数

标准典型序列定理告诉我们:对于离散无记忆源,

  • 对足够大的 NN (1η)2N(H(p)ε)Tε(p;N)2N(H(p)+ε)(1-\eta)2^{N(H(p)-\varepsilon)} \le |T_\varepsilon(p;N)| \le 2^{N(H(p)+\varepsilon)} 对某个很小的 η>0\eta>0

α=1/3\alpha=1/3

H(p)=HB(1/3).H(p) = H_B(1/3).

因此,Tε(p;N)|T_\varepsilon(p;N)| 的增长指数为

limN1NlogTε(p;N)=HB(1/3).\lim_{N\to\infty}\frac1N\log|T_\varepsilon(p;N)| = H_B(1/3).

课堂上可以再补一句:

  • 近似地说,当 NN 很大时 Tε(p;N)2NHB(1/3).|T_\varepsilon(p;N)| \approx 2^{N H_B(1/3)}.
  • 每个典型序列的概率大约是 2NHB(1/3).2^{-N H_B(1/3)}.
  • 于是几乎所有概率质量都集中在“数量级为 2NH2^{NH}” 的这些典型序列上。

课堂 takeaway:

  • 典型集的大小指数就是源熵 H(p)H(p)
  • 熵 ≈ 每个典型序列的信息量 ≈ 典型序列个数的指数”,这是信息论里最重要的直觉之一。

二、英文 LaTeX 讲稿(可直接使用)

下面是一份可以直接放进 .tex 的讲稿草稿,风格是“课堂上我会怎么说 + 公式”,你可以按需要删减。

latex
\documentclass[11pt]{article} \usepackage{amsmath, amssymb, amsfonts} \usepackage{bm} \newcommand{\KL}{D} \newcommand{\E}{\mathbb{E}} \newcommand{\HB}{H_{\mathrm{B}}} \begin{document} \title{Homework 4 -- Discussion Session Notes} \author{Theory and Methods for Statistical Inference} \date{} \maketitle In this recitation we go through all problems of Homework 4. The audience is assumed to be undergraduate students with a basic background in probability and information theory. I will try to emphasize geometric pictures and key takeaways. Throughout, for two distributions $p$ and $q$ on the same finite alphabet, we write the Kullback--Leibler divergence as $$ \KL(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)}. $$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 1: Reverse I-projection} \subsection*{Set-up and intuition} \textbf{I-projection vs.\ reverse I-projection.} Given a convex set of distributions $\mathcal{P}$ and a fixed distribution $q$, the (forward) I-projection is $$ p^\star = \arg\min_{p\in\mathcal{P}} \KL(p\|q). $$ In contrast, given a set $\mathcal{Q}$ and a fixed distribution $p$, the \emph{reverse} I-projection is $$ q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q). $$ So in reverse I-projection, the first argument $p$ is fixed, and we move $q$ inside the set $\mathcal{Q}$. Geometrically, we can think of KL divergence as a kind of ``distance'' (although it is not symmetric), and I-projection means ``the closest point in the set''. \subsection*{(a) Pythagoras' theorem for reverse I-projection} \textbf{Goal.} Let $\mathcal{Q}$ be nonempty, closed, and \emph{log-convex}. Let $q^\star$ be the reverse I-projection of $p$ onto $\mathcal{Q}$: $$ q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q). $$ We want to show that for all $q\in\mathcal{Q}$, $$ \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q). $$ \textbf{Log-convex combinations.} For any $r,s\in\mathcal{Q}$ that are not mutually singular, and $\lambda\in[0,1]$, we define the log-convex combination $$ q_\lambda(x) = \frac{r(x)^\lambda s(x)^{1-\lambda}} {\sum_z r(z)^\lambda s(z)^{1-\lambda}}. $$ The set $\mathcal{Q}$ is log-convex if for every such $r,s$ and $\lambda$, the distribution $q_\lambda$ belongs to $\mathcal{Q}$. You can think of this as ``taking a straight line in log-space'' between $r$ and $s$, and staying inside the set along this path. \medskip \textbf{Step 1: Follow a log-geodesic from $q^\star$ to $q$.} Fix any $q\in\mathcal{Q}$ and set $$ q_\lambda(x) = \frac{q(x)^\lambda q^\star(x)^{1-\lambda}} {\sum_z q(z)^\lambda q^\star(z)^{1-\lambda}},\quad \lambda\in[0,1]. $$ Then $$ q_0 = q^\star,\qquad q_1 = q, $$ and $q_\lambda\in\mathcal{Q}$ for all $\lambda$ by log-convexity. Define $$ f(\lambda) = \KL(p\|q_\lambda). $$ Because $q^\star$ is the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$, it must also minimize along this particular path: $$ f(\lambda) \ge f(0)\quad\text{for all }\lambda\in[0,1]. $$ In particular, the right derivative at $\lambda=0$ satisfies $$ f'(0^+) \ge 0. $$ \medskip \textbf{Step 2: Compute the derivative $f'(0)$.} We write $$ \log q_\lambda(x) = \lambda\log q(x) + (1-\lambda)\log q^\star(x) - \log Z(\lambda), $$ where $$ Z(\lambda) = \sum_z q(z)^\lambda q^\star(z)^{1-\lambda} $$ is the normalizing constant. Then $$ f(\lambda) = \sum_x p(x)\Big(\log p(x) - \log q_\lambda(x)\Big), $$ and $$ f'(\lambda) = -\sum_x p(x)\,\frac{\partial}{\partial\lambda}\log q_\lambda(x). $$ A straightforward but slightly tedious calculation (which you can carry out on the board) shows that at $\lambda=0$, $$ f'(0) = \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q). $$ Intuitively, this derivative compares how fast the divergence increases when we move from $q^\star$ in the direction of $q$. \medskip \textbf{Step 3: Use the optimality of $q^\star$.} Since $f$ has a minimum at $\lambda=0$ along this path, we must have $f'(0^+)\ge 0$, so $$ \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q) \;\ge\; 0, $$ which is exactly the desired inequality: $$ \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q). $$ \medskip \textbf{Takeaway:} for a log-convex set $\mathcal{Q}$, the reverse I-projection $q^\star$ behaves like the foot of a perpendicular in a ``KL geometry'': going from $p$ to any other point $q\in\mathcal{Q}$ costs at least the divergence from $p$ to $q^\star$ plus the divergence from $q^\star$ to $q$. \subsection*{(b) Practice: exponential and linear families} Here we recall: \begin{itemize} \item Given a base distribution $q^\star$ and a sufficient statistic $t(x)\in\mathbb{R}^d$, the exponential family is $$ \mathcal{E}_t(q^\star) = \left\{ q_\theta(x) = \exp(\theta^\top t(x) - \psi(\theta))\,q^\star(x) \right\}. $$ \item The associated linear family is $$ \mathcal{L}_t(q^\star) = \left\{ p:\ \E_p[t(X)] = \E_{q^\star}[t(X)] \right\}. $$ \end{itemize} \textbf{Claim.} $q^\star$ is the reverse I-projection of $p$ onto $\mathcal{E}_t(q^\star)$ if and only if $p\in\mathcal{L}_t(q^\star)$. \medskip \textbf{Proof idea: look at derivatives of $\KL(p\|q_\theta)$ in $\theta$.} Using the definition of $q_\theta$, we get $$ \KL(p\|q_\theta) = \sum_x p(x)\log\frac{p(x)}{q_\theta(x)} = \text{constant in $\theta$} + \psi(\theta) - \theta^\top\E_p[t(X)]. $$ Differentiating with respect to $\theta$, $$ \nabla_\theta \KL(p\|q_\theta) = \nabla\psi(\theta) - \E_p[t(X)] = \E_{q_\theta}[t(X)] - \E_p[t(X)]. $$ If $q^\star$ is the reverse I-projection, then it must correspond to some parameter, say $\theta=0$, and at this point the gradient must vanish: $$ \E_{q^\star}[t(X)] = \E_p[t(X)]. $$ So $p$ belongs to $\mathcal{L}_t(q^\star)$. Conversely, if $p\in\mathcal{L}_t(q^\star)$, then at $\theta=0$ the gradient vanishes, and by convexity of $\KL(p\|q_\theta)$ in $\theta$, this point is a global minimizer. Hence $q^\star$ is indeed the reverse I-projection. \medskip \textbf{Takeaway:} in exponential families, \emph{minimizing} KL divergence over the family is equivalent to \emph{matching} the expected sufficient statistics. This is the core of many maximum entropy and maximum likelihood results. \subsection*{(c) Another characterization of reverse I-projection} We are given a nonempty closed \emph{convex} set $\mathcal{Q}$. Let $p$ be any distribution. We want to prove: $$ q^\star\ \text{is the reverse I-projection of $p$ onto $\mathcal{Q}$} $$ if and only if for all $p'$ and any $q'\in\mathcal{Q}$, $$ \KL(p'\|q') + \KL(p'\|p) \;\ge\; \KL(p'\|q^\star). $$ This is a kind of ``three-point inequality'' involving $p',p,q^\star$ and $q'$. \medskip \textbf{(``If'' direction)} Assume the inequality holds for all $p',q'$. If we plug in $p'=p$, we obtain $$ \KL(p\|q') + \KL(p\|p) \ge \KL(p\|q^\star). $$ But $\KL(p\|p)=0$, hence $$ \KL(p\|q') \ge \KL(p\|q^\star)\quad\forall q'\in\mathcal{Q}. $$ Therefore $q^\star$ is indeed the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$, i.e., the reverse I-projection. \medskip \textbf{(``Only if'' direction)} Suppose $q^\star$ is the reverse I-projection. In this direction, the algebra is more involved. A clean way to think about it is to view KL divergence as a \emph{Bregman divergence} generated by the negative entropy function $$ \varphi(r) = \sum_x r(x)\log r(x). $$ Then $$ \KL(p\|q) = \varphi(p) - \varphi(q) - \nabla\varphi(q)^\top(p-q). $$ For a convex set $\mathcal{Q}$, the Bregman projection $q^\star$ of $p$ onto $\mathcal{Q}$ is characterized by a first-order (KKT-like) condition involving $\nabla\varphi(q^\star)$. By expanding the Bregman divergence expressions for $\KL(p'\|q')$, $\KL(p'\|p)$, and $\KL(p'\|q^\star)$ and using this condition, one can show that $$ \KL(p'\|q') + \KL(p'\|p) - \KL(p'\|q^\star) \ge 0 $$ for all $p'$ and all $q'\in\mathcal{Q}$. In a homework-level solution, you can follow the same pattern as in the usual Pythagoras theorem for forward I-projection: express the difference in KLs, use convexity of $\mathcal{Q}$, and apply the optimality of $q^\star$. \medskip \textbf{Takeaway:} this inequality says that $q^\star$ is the ``best possible'' element of $\mathcal{Q}$ in a very strong sense: for \emph{every} distribution $p'$, using $q^\star$ is never worse than using any $q'\in\mathcal{Q}$, up to the additive ``penalty term'' $\KL(p'\|p)$, which does not depend on $q'$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 2: Modeling and capacity} We have a parameter $x\in\{0,1,2\}$ and an observable binary variable $y\in\{0,1\}$. The conditional distributions of $y$ given $x$ are $$ \begin{aligned} p_y(y;0) &= \varepsilon^y(1-\varepsilon)^{1-y} &&\text{(Bernoulli($\varepsilon$)),} \\ p_y(y;1) &= (1-\varepsilon)^y\,\varepsilon^{1-y} &&\text{(Bernoulli($1-\varepsilon$)),} \\ p_y(y;2) &= 1/2 &&\text{(pure noise),} \end{aligned} $$ with $0<\varepsilon<1/2$. Let $p_x$ be a prior distribution on $\{0,1,2\}$ and $I_{p_x}(x;y)$ denote the mutual information between $x$ and $y$ under $p_x$. The \emph{model capacity} is $$ C = I_{p_x^\star}(x;y), $$ where $p_x^\star$ is the capacity-achieving (least informative) prior. \subsection*{(a)(i) Expressing the mutual information} We denote $H_B(p) = -p\log p-(1-p)\log(1-p)$, the entropy of a Bernoulli($p$). Let $$ p_x(0)=p_0,\quad p_x(1)=p_1,\quad p_x(2)=p_2. $$ \textbf{Step 1: Compute the marginal of $Y$.} $$ \begin{aligned} P(Y=1) &= p_0\varepsilon + p_1(1-\varepsilon) + p_2\cdot\frac12, \\ P(Y=0) &= 1 - P(Y=1). \end{aligned} $$ Subtract $1/2$: $$ \begin{aligned} P(Y=1) - \tfrac12 &= p_0(\varepsilon-\tfrac12) + p_1((1-\varepsilon)-\tfrac12) \\ &= (\tfrac12-\varepsilon)(p_1-p_0). \end{aligned} $$ Therefore $$ P(Y=1) = \tfrac12 + \big(\tfrac12-\varepsilon\big)(p_1-p_0), $$ and this does \emph{not} depend on $p_2$. Thus $$ H(Y) = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big). $$ \textbf{Step 2: Compute $H(Y|X)$.} $$ \begin{aligned} H(Y|X) &= p_0 H_B(\varepsilon) + p_1 H_B(1-\varepsilon) + p_2 H_B(1/2) \\ &= (p_0+p_1)H_B(\varepsilon) + p_2 H_B(1/2) = (1-p_2)H_B(\varepsilon) + p_2 H_B(1/2), \end{aligned} $$ because $H_B(1-\varepsilon) = H_B(\varepsilon)$. \textbf{Step 3: Mutual information.} $$ \begin{aligned} I_{p_x}(x;y) &= H(Y) - H(Y|X) \\ &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - (1-p_2)H_B(\varepsilon) - p_2H_B(1/2) \\ &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - H_B(\varepsilon) - p_2\big(H_B(1/2) - H_B(\varepsilon)\big). \end{aligned} $$ Comparing with the required form $$ I_{p_x}(x;y) = H_B\!\Big(\tfrac12 + f(\varepsilon)(p_x(1)-p_x(0))\Big) - g(\varepsilon)p_x(2) - H_B(\varepsilon), $$ we identify $$ f(\varepsilon) = \tfrac12 - \varepsilon, \qquad g(\varepsilon) = H_B(1/2) - H_B(\varepsilon). $$ \subsection*{(a)(ii) The capacity-achieving prior and $C$} We now maximize $I_{p_x}(x;y)$ over $p_0,p_1,p_2$. \textbf{Observation 1: $x=2$ only adds noise.} The term involving $p_2$ is $$ -p_2\big(H_B(1/2)-H_B(\varepsilon)\big), $$ and $H_B(1/2)>H_B(\varepsilon)$. So for fixed $(p_1-p_0)$, increasing $p_2$ strictly \emph{decreases} the mutual information. Hence the optimum must satisfy $$ p_2^\star = 0. $$ \textbf{Observation 2: with $p_2=0$, maximize the entropy of $Y$.} When $p_2=0$, we have $p_0+p_1=1$ and $$ P(Y=1) = \tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0). $$ Then $$ I(x;y) = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - H_B(\varepsilon). $$ The binary entropy $H_B(\cdot)$ is concave in its argument and is maximized at $1/2$. Thus, to maximize $I(x;y)$, we want $P(Y=1)=1/2$, i.e. $$ p_1-p_0 = 0 \quad\Rightarrow\quad p_1=p_0=1/2. $$ \textbf{Therefore} $$ p_x^\star(0)=p_x^\star(1)=\tfrac12,\quad p_x^\star(2)=0, $$ and $$ C = I_{p_x^\star}(x;y) = H_B(1/2) - H_B(\varepsilon). $$ So in the expression $$ C = \alpha H_B(1/2) + \beta H_B(\varepsilon) $$ we have $$ \alpha = 1,\qquad \beta = -1. $$ \subsection*{(a)(iii) The mixture distribution $p_y^\star$} Using the optimal prior $p_x^\star$, $$ \begin{aligned} P^\star(Y=1) &= \tfrac12\varepsilon + \tfrac12(1-\varepsilon) + 0\cdot\tfrac12 \\ &= \tfrac12,\\ P^\star(Y=0) &= \tfrac12. \end{aligned} $$ Thus $$ p_y^\star(0) = p_y^\star(1) = \tfrac12. $$ \subsection*{(b)(i) Mixture distribution with uniform prior} Now take $p_x(x)=1/3$ for all $x$. We get $$ \begin{aligned} P(Y=1) &= \tfrac13\varepsilon + \tfrac13(1-\varepsilon) + \tfrac13\times\tfrac12 \\ &= \tfrac13\Big(\varepsilon + 1 - \varepsilon + \tfrac12\Big) = \tfrac13\cdot\tfrac32 = \tfrac12, \end{aligned} $$ and $P(Y=0)=1/2$. So again, $$ p_y(0)=p_y(1)=\tfrac12. $$ \subsection*{(b)(ii) Information loss under the uniform prior} \textbf{Mutual information for the uniform prior.} We already know $H(Y)=H_B(1/2)$. The conditional entropy is $$ H(Y|X) = \tfrac13 H_B(\varepsilon) + \tfrac13 H_B(1-\varepsilon) + \tfrac13 H_B(1/2) = \tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2). $$ Thus $$ \begin{aligned} I_{\text{unif}} &= H(Y) - H(Y|X) \\ &= H_B(1/2) - \left(\tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2)\right) \\ &= \tfrac23 H_B(1/2) - \tfrac23 H_B(\varepsilon) \\ &= \tfrac23\big(H_B(1/2) - H_B(\varepsilon)\big) \\ &= \tfrac23 C. \end{aligned} $$ \textbf{Information loss.} $$ C - I_{\text{unif}} = C - \tfrac23 C = \tfrac13 C = \tfrac13 H_B(1/2) - \tfrac13 H_B(\varepsilon). $$ So the constants in $$ C - I_{p_x}(x;y) = \gamma H_B(1/2) + \delta H_B(\varepsilon) $$ are $$ \gamma = \tfrac13,\qquad \delta = -\tfrac13. $$ \subsection*{(c) General model: expressing the loss as a KL divergence} Now consider a general model $p_y(\cdot;x)$ for $y\in\mathcal{Y}$, parameterized by $x\in\mathcal{X}$. Let $p_x^\star$ be a capacity-achieving prior and $p_y^\star$ be the corresponding marginal distribution of $y$. Let $q_x$ be an arbitrary distribution on $\mathcal{X}$. Assume $p_x^\star(x) > 0$ for all $x\in\mathcal{X}$. \textbf{Equidistance property.} For a capacity-achieving prior with full support, we have $$ \KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = C $$ for all $x$. \textbf{Mixture under $q_x$.} Let $$ q_y(y) = \sum_x q_x(x)\,p_{y|x}(y|x) $$ be the marginal distribution of $y$ under $q_x$. \textbf{Key identity.} We can show $$ \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ This comes from writing $$ \log\frac{p_{y|x}(y|x)}{p_y^\star(y)} = \log\frac{p_{y|x}(y|x)}{q_y(y)} + \log\frac{q_y(y)}{p_y^\star(y)} $$ and then averaging over $x$ and $y$. \textbf{Apply equidistance.} By the equidistance property, the left-hand side equals $$ \sum_x q_x(x)\cdot C = C, $$ so $$ C = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ Therefore $$ C - I_{q_x}(x;y) = \KL(q_y\|p_y^\star). $$ So we can take $$ q_1(\cdot) = q_y(\cdot),\qquad q_2(\cdot) = p_y^\star(\cdot). $$ \subsection*{(d) The case where $p_x^\star$ is not full support} Now suppose that $p_x^\star(x)=0$ for some $x$. In this case, the Kuhn--Tucker conditions for channel capacity say: $$ \KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C,\quad\forall x, $$ with equality if and only if $p_x^\star(x)>0$. Therefore, for any prior $q_x$, $$ \sum_x q_x(x)\,\KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C. $$ Using the same decomposition as before, $$ \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ Hence $$ I_{q_x}(x;y) + \KL(q_y\|p_y^\star) \le C, $$ and so $$ C - I_{q_x}(x;y) \ge \KL(q_y\|p_y^\star). $$ This corresponds to option \textbf{B}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 3: Typical sequences} Let $y_1,\dots,y_N$ be i.i.d.\ Bernoulli random variables with $$ \mathbb{P}(y_i=1)=\alpha,\quad \mathbb{P}(y_i=0)=1-\alpha. $$ Let $p$ be the distribution of a single $y_i$, and let all logarithms be base 2. The typical set (weakly typical) is defined as $$ T_\varepsilon(p;N) = \left\{ y^N:\ \left| -\frac{1}{N}\log p(y^N) - H(p)\right|\le\varepsilon \right\}, $$ where $H(p)$ is the entropy of a single Bernoulli($\alpha$) variable. \subsection*{(a) Case $\alpha=1/2$} If $\alpha=1/2$, then each $y_i$ is Bernoulli($1/2$), and for any sequence $y^N\in\{0,1\}^N$ we have $$ p(y^N) = \left(\tfrac12\right)^N = 2^{-N}. $$ Therefore $$ -\frac{1}{N}\log p(y^N) = -\frac{1}{N}\log 2^{-N} = 1 $$ for all sequences. The entropy of a single Bernoulli($1/2$) variable is $$ H(p) = H_B(1/2) = 1. $$ Thus, for any $\varepsilon>0$, $$ \left| -\frac{1}{N}\log p(y^N) - H(p)\right| = |1-1| = 0 \le \varepsilon, $$ so $$ T_\varepsilon(p;N) = \{0,1\}^N, $$ and $$ |T_\varepsilon(p;N)| = 2^N. $$ \subsection*{(b) Case $\alpha=1/3$: typical set and growth exponent} Now assume $\alpha=1/3$, so $$ \mathbb{P}(y_i=1)=\tfrac13,\quad \mathbb{P}(y_i=0)=\tfrac23. $$ Let $k$ be the number of ones in the sequence $y^N$: $$ k = \#\{i:y_i=1\}. $$ \textbf{Probability of a given sequence.} Every sequence with $k$ ones and $N-k$ zeros has probability $$ p(y^N) = \left(\tfrac13\right)^k\left(\tfrac23\right)^{N-k}. $$ Hence $$ -\frac{1}{N}\log p(y^N) = -\frac{k}{N}\log\tfrac13 - \left(1-\frac{k}{N}\right)\log\tfrac23. $$ Let $\hat{p}_N = k/N$ be the empirical fraction of ones. Then $$ -\frac{1}{N}\log p(y^N) = -\hat{p}_N\log\tfrac13 - (1-\hat{p}_N)\log\tfrac23. $$ The entropy is $$ H(p) = H_B(1/3) = -\tfrac13\log\tfrac13 - \tfrac23\log\tfrac23. $$ \textbf{Typical set description.} The typicality condition $$ \left| -\frac{1}{N}\log p(y^N) - H_B(1/3)\right|\le\varepsilon $$ is a condition on $\hat{p}_N$. Because the function of $\hat{p}_N$ above is continuous and achieves $H_B(1/3)$ at $\hat{p}_N=1/3$, there exists some $\delta(\varepsilon)>0$ such that, for large $N$, $$ \left|\hat{p}_N - \tfrac13\right|\le\delta(\varepsilon) $$ is equivalent to the typicality condition. So we can write $$ T_\varepsilon(p;N) = \left\{ y^N:\ \left|\frac{\#\{i:y_i=1\}}{N} - \frac13\right|\le\delta(\varepsilon) \right\}. $$ \textbf{Growth exponent of $|T_\varepsilon(p;N)|$.} The asymptotic equipartition property (AEP) for discrete memoryless sources states that, for large $N$, $$ (1-\eta)\,2^{N(H(p)-\varepsilon)} \;\le\; |T_\varepsilon(p;N)| \;\le\; 2^{N(H(p)+\varepsilon)} $$ for some small $\eta>0$. Thus the exponential growth rate of $|T_\varepsilon(p;N)|$ is $$ \lim_{N\to\infty}\frac{1}{N}\log |T_\varepsilon(p;N)| = H(p) = H_B(1/3). $$ Intuitively, for large $N$, $$ |T_\varepsilon(p;N)| \approx 2^{N H_B(1/3)}. $$ \medskip \textbf{Key takeaway:} for a Bernoulli source, the number of typical sequences is roughly $2^{NH(p)}$, and each typical sequence has probability around $2^{-NH(p)}$. Most of the probability mass is concentrated on these typical sequences. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}

如果你愿意,我也可以帮你把这份 LaTeX 再“瘦身版”/“板书版”各做一份,或者帮你把中文要点直接翻译成 PPT 提纲。

مشاركة هذا السؤال والجواب