我需要在习题课用英语向基础一般的本科生讲解这份作业的全部解题思路,请完成解题用中文撰写讲解要点,并撰...
Creato il: 25 novembre 2025
Risposto usando GPT-5.1 Pro da Chat01
Creato il: 25 novembre 2025
Risposto usando GPT-5.1 Pro da Chat01
先直接给你两部分内容:
作业题目来自 Homework 4。(HW_4.pdf)
下面把每一题、每一小问都按“思路 + 运算骨架 + 课堂可以强调的 takeaway”来整理。你可以根据时间决定推导细节讲多少。
题目要证:
若 非空、闭、log-convex,且
则对所有 ,
关键概念:log-convex 集合
证明思路(课堂版):
在 里挑一条“log 直线”
固定任意 ,和 一起,构造
其中 ,且所有 。
看 KL 散度沿这条曲线如何变化
定义
因为 是 上的最优点,对这条路径也必须是最小点,所以
算 (可以在黑板上做一次):
利用最小点的导数条件
因为 ,所以
课堂 takeaway:
你可以把这题讲成“指数族–线性族的双对偶关系”的一条具体体现。
设定回顾:
要证:
是 在指数族 上的 reverse I-projection
。
思路:看 对 的导数。
课堂 takeaway:
题意:
给定分布 和非空、闭、凸的集合 。
证明: 是 在 上的 reverse I-projection,当且仅当
可以把它理解成: 是那个让“所有点” 相比其它 都不会更吃亏的点(多了一个补偿项 )。
“如果”方向(⇐):比较简单,可以课堂上详细写。
“只有如果”方向(⇒):思路型讲解即可
课堂 takeaway:
设 ,。条件分布是
其中 。
记先验 ,互信息 ,模型容量
其中 是使互信息最大的先验(“least informative prior”,其实是 capacity-achieving prior)。
要找到 使
其中 。
思路:直接用
记
计算 的边缘分布:
把 拆到每一项:
注意 ,。
所以
即
特别重要:不依赖 。
于是
条件熵:
因为 。
互信息:
对应题目的形式:
我们要最大化上面的 对 。
观察互信息结构:
直观结论:
得到:
课堂 takeaway:
用最优先验 作为权重:
所以
takeaway: 容量点时,输出端看上去是“最均匀、最不偏”的。
现在 for 。
因此仍然是
先求均匀先验下的互信息:
损失:
所以
课堂 takeaway:
设一般模型 ,容量达到时的先验为 ,对应输出为 。给定任意先验 。
假设 对所有 。
利用“等距离性质(Equidistance Property)”:
关键恒等式(链式法则):
带入等距离性质:
因此应取:
现在假定存在 使得 。经典 Kuhn–Tucker 条件告诉我们:
于是对任意 :
按刚才同样的分解:
所以
这正对应选项 B。
课堂 takeaway:
设 i.i.d. Bernoulli,, 为单个 的分布。对所有 取 base 2。
典型集定义(标准版本):
当 时,每个 服从 Bernoulli(1/2),
对任意序列 都有
于是对任意 :
结论:
课堂 takeaway:
现在 ,,记 为序列里 1 的个数。
1. 写出单个序列的概率
信息密度:
记经验频率 。则
而源熵为
2. 典型条件对应“1 的比例接近 1/3”
典型条件:
可以看成是对 的约束:
因为上式是一个连续函数 ,在 处取值正好是 ,所以存在一个 ,使得
就等价于典型条件成立(对足够大的 N )。
因此:
典型集可以写成
3. 大小的增长指数
标准典型序列定理告诉我们:对于离散无记忆源,
对 :
因此, 的增长指数为
课堂上可以再补一句:
课堂 takeaway:
下面是一份可以直接放进 .tex 的讲稿草稿,风格是“课堂上我会怎么说 + 公式”,你可以按需要删减。
latex\documentclass[11pt]{article} \usepackage{amsmath, amssymb, amsfonts} \usepackage{bm} \newcommand{\KL}{D} \newcommand{\E}{\mathbb{E}} \newcommand{\HB}{H_{\mathrm{B}}} \begin{document} \title{Homework 4 -- Discussion Session Notes} \author{Theory and Methods for Statistical Inference} \date{} \maketitle In this recitation we go through all problems of Homework 4. The audience is assumed to be undergraduate students with a basic background in probability and information theory. I will try to emphasize geometric pictures and key takeaways. Throughout, for two distributions $p$ and $q$ on the same finite alphabet, we write the Kullback--Leibler divergence as $$ \KL(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)}. $$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 1: Reverse I-projection} \subsection*{Set-up and intuition} \textbf{I-projection vs.\ reverse I-projection.} Given a convex set of distributions $\mathcal{P}$ and a fixed distribution $q$, the (forward) I-projection is $$ p^\star = \arg\min_{p\in\mathcal{P}} \KL(p\|q). $$ In contrast, given a set $\mathcal{Q}$ and a fixed distribution $p$, the \emph{reverse} I-projection is $$ q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q). $$ So in reverse I-projection, the first argument $p$ is fixed, and we move $q$ inside the set $\mathcal{Q}$. Geometrically, we can think of KL divergence as a kind of ``distance'' (although it is not symmetric), and I-projection means ``the closest point in the set''. \subsection*{(a) Pythagoras' theorem for reverse I-projection} \textbf{Goal.} Let $\mathcal{Q}$ be nonempty, closed, and \emph{log-convex}. Let $q^\star$ be the reverse I-projection of $p$ onto $\mathcal{Q}$: $$ q^\star = \arg\min_{q\in\mathcal{Q}} \KL(p\|q). $$ We want to show that for all $q\in\mathcal{Q}$, $$ \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q). $$ \textbf{Log-convex combinations.} For any $r,s\in\mathcal{Q}$ that are not mutually singular, and $\lambda\in[0,1]$, we define the log-convex combination $$ q_\lambda(x) = \frac{r(x)^\lambda s(x)^{1-\lambda}} {\sum_z r(z)^\lambda s(z)^{1-\lambda}}. $$ The set $\mathcal{Q}$ is log-convex if for every such $r,s$ and $\lambda$, the distribution $q_\lambda$ belongs to $\mathcal{Q}$. You can think of this as ``taking a straight line in log-space'' between $r$ and $s$, and staying inside the set along this path. \medskip \textbf{Step 1: Follow a log-geodesic from $q^\star$ to $q$.} Fix any $q\in\mathcal{Q}$ and set $$ q_\lambda(x) = \frac{q(x)^\lambda q^\star(x)^{1-\lambda}} {\sum_z q(z)^\lambda q^\star(z)^{1-\lambda}},\quad \lambda\in[0,1]. $$ Then $$ q_0 = q^\star,\qquad q_1 = q, $$ and $q_\lambda\in\mathcal{Q}$ for all $\lambda$ by log-convexity. Define $$ f(\lambda) = \KL(p\|q_\lambda). $$ Because $q^\star$ is the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$, it must also minimize along this particular path: $$ f(\lambda) \ge f(0)\quad\text{for all }\lambda\in[0,1]. $$ In particular, the right derivative at $\lambda=0$ satisfies $$ f'(0^+) \ge 0. $$ \medskip \textbf{Step 2: Compute the derivative $f'(0)$.} We write $$ \log q_\lambda(x) = \lambda\log q(x) + (1-\lambda)\log q^\star(x) - \log Z(\lambda), $$ where $$ Z(\lambda) = \sum_z q(z)^\lambda q^\star(z)^{1-\lambda} $$ is the normalizing constant. Then $$ f(\lambda) = \sum_x p(x)\Big(\log p(x) - \log q_\lambda(x)\Big), $$ and $$ f'(\lambda) = -\sum_x p(x)\,\frac{\partial}{\partial\lambda}\log q_\lambda(x). $$ A straightforward but slightly tedious calculation (which you can carry out on the board) shows that at $\lambda=0$, $$ f'(0) = \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q). $$ Intuitively, this derivative compares how fast the divergence increases when we move from $q^\star$ in the direction of $q$. \medskip \textbf{Step 3: Use the optimality of $q^\star$.} Since $f$ has a minimum at $\lambda=0$ along this path, we must have $f'(0^+)\ge 0$, so $$ \KL(p\|q) - \KL(p\|q^\star) - \KL(q^\star\|q) \;\ge\; 0, $$ which is exactly the desired inequality: $$ \KL(p\|q) \;\ge\; \KL(p\|q^\star) + \KL(q^\star\|q). $$ \medskip \textbf{Takeaway:} for a log-convex set $\mathcal{Q}$, the reverse I-projection $q^\star$ behaves like the foot of a perpendicular in a ``KL geometry'': going from $p$ to any other point $q\in\mathcal{Q}$ costs at least the divergence from $p$ to $q^\star$ plus the divergence from $q^\star$ to $q$. \subsection*{(b) Practice: exponential and linear families} Here we recall: \begin{itemize} \item Given a base distribution $q^\star$ and a sufficient statistic $t(x)\in\mathbb{R}^d$, the exponential family is $$ \mathcal{E}_t(q^\star) = \left\{ q_\theta(x) = \exp(\theta^\top t(x) - \psi(\theta))\,q^\star(x) \right\}. $$ \item The associated linear family is $$ \mathcal{L}_t(q^\star) = \left\{ p:\ \E_p[t(X)] = \E_{q^\star}[t(X)] \right\}. $$ \end{itemize} \textbf{Claim.} $q^\star$ is the reverse I-projection of $p$ onto $\mathcal{E}_t(q^\star)$ if and only if $p\in\mathcal{L}_t(q^\star)$. \medskip \textbf{Proof idea: look at derivatives of $\KL(p\|q_\theta)$ in $\theta$.} Using the definition of $q_\theta$, we get $$ \KL(p\|q_\theta) = \sum_x p(x)\log\frac{p(x)}{q_\theta(x)} = \text{constant in $\theta$} + \psi(\theta) - \theta^\top\E_p[t(X)]. $$ Differentiating with respect to $\theta$, $$ \nabla_\theta \KL(p\|q_\theta) = \nabla\psi(\theta) - \E_p[t(X)] = \E_{q_\theta}[t(X)] - \E_p[t(X)]. $$ If $q^\star$ is the reverse I-projection, then it must correspond to some parameter, say $\theta=0$, and at this point the gradient must vanish: $$ \E_{q^\star}[t(X)] = \E_p[t(X)]. $$ So $p$ belongs to $\mathcal{L}_t(q^\star)$. Conversely, if $p\in\mathcal{L}_t(q^\star)$, then at $\theta=0$ the gradient vanishes, and by convexity of $\KL(p\|q_\theta)$ in $\theta$, this point is a global minimizer. Hence $q^\star$ is indeed the reverse I-projection. \medskip \textbf{Takeaway:} in exponential families, \emph{minimizing} KL divergence over the family is equivalent to \emph{matching} the expected sufficient statistics. This is the core of many maximum entropy and maximum likelihood results. \subsection*{(c) Another characterization of reverse I-projection} We are given a nonempty closed \emph{convex} set $\mathcal{Q}$. Let $p$ be any distribution. We want to prove: $$ q^\star\ \text{is the reverse I-projection of $p$ onto $\mathcal{Q}$} $$ if and only if for all $p'$ and any $q'\in\mathcal{Q}$, $$ \KL(p'\|q') + \KL(p'\|p) \;\ge\; \KL(p'\|q^\star). $$ This is a kind of ``three-point inequality'' involving $p',p,q^\star$ and $q'$. \medskip \textbf{(``If'' direction)} Assume the inequality holds for all $p',q'$. If we plug in $p'=p$, we obtain $$ \KL(p\|q') + \KL(p\|p) \ge \KL(p\|q^\star). $$ But $\KL(p\|p)=0$, hence $$ \KL(p\|q') \ge \KL(p\|q^\star)\quad\forall q'\in\mathcal{Q}. $$ Therefore $q^\star$ is indeed the minimizer of $\KL(p\|\cdot)$ over $\mathcal{Q}$, i.e., the reverse I-projection. \medskip \textbf{(``Only if'' direction)} Suppose $q^\star$ is the reverse I-projection. In this direction, the algebra is more involved. A clean way to think about it is to view KL divergence as a \emph{Bregman divergence} generated by the negative entropy function $$ \varphi(r) = \sum_x r(x)\log r(x). $$ Then $$ \KL(p\|q) = \varphi(p) - \varphi(q) - \nabla\varphi(q)^\top(p-q). $$ For a convex set $\mathcal{Q}$, the Bregman projection $q^\star$ of $p$ onto $\mathcal{Q}$ is characterized by a first-order (KKT-like) condition involving $\nabla\varphi(q^\star)$. By expanding the Bregman divergence expressions for $\KL(p'\|q')$, $\KL(p'\|p)$, and $\KL(p'\|q^\star)$ and using this condition, one can show that $$ \KL(p'\|q') + \KL(p'\|p) - \KL(p'\|q^\star) \ge 0 $$ for all $p'$ and all $q'\in\mathcal{Q}$. In a homework-level solution, you can follow the same pattern as in the usual Pythagoras theorem for forward I-projection: express the difference in KLs, use convexity of $\mathcal{Q}$, and apply the optimality of $q^\star$. \medskip \textbf{Takeaway:} this inequality says that $q^\star$ is the ``best possible'' element of $\mathcal{Q}$ in a very strong sense: for \emph{every} distribution $p'$, using $q^\star$ is never worse than using any $q'\in\mathcal{Q}$, up to the additive ``penalty term'' $\KL(p'\|p)$, which does not depend on $q'$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 2: Modeling and capacity} We have a parameter $x\in\{0,1,2\}$ and an observable binary variable $y\in\{0,1\}$. The conditional distributions of $y$ given $x$ are $$ \begin{aligned} p_y(y;0) &= \varepsilon^y(1-\varepsilon)^{1-y} &&\text{(Bernoulli($\varepsilon$)),} \\ p_y(y;1) &= (1-\varepsilon)^y\,\varepsilon^{1-y} &&\text{(Bernoulli($1-\varepsilon$)),} \\ p_y(y;2) &= 1/2 &&\text{(pure noise),} \end{aligned} $$ with $0<\varepsilon<1/2$. Let $p_x$ be a prior distribution on $\{0,1,2\}$ and $I_{p_x}(x;y)$ denote the mutual information between $x$ and $y$ under $p_x$. The \emph{model capacity} is $$ C = I_{p_x^\star}(x;y), $$ where $p_x^\star$ is the capacity-achieving (least informative) prior. \subsection*{(a)(i) Expressing the mutual information} We denote $H_B(p) = -p\log p-(1-p)\log(1-p)$, the entropy of a Bernoulli($p$). Let $$ p_x(0)=p_0,\quad p_x(1)=p_1,\quad p_x(2)=p_2. $$ \textbf{Step 1: Compute the marginal of $Y$.} $$ \begin{aligned} P(Y=1) &= p_0\varepsilon + p_1(1-\varepsilon) + p_2\cdot\frac12, \\ P(Y=0) &= 1 - P(Y=1). \end{aligned} $$ Subtract $1/2$: $$ \begin{aligned} P(Y=1) - \tfrac12 &= p_0(\varepsilon-\tfrac12) + p_1((1-\varepsilon)-\tfrac12) \\ &= (\tfrac12-\varepsilon)(p_1-p_0). \end{aligned} $$ Therefore $$ P(Y=1) = \tfrac12 + \big(\tfrac12-\varepsilon\big)(p_1-p_0), $$ and this does \emph{not} depend on $p_2$. Thus $$ H(Y) = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big). $$ \textbf{Step 2: Compute $H(Y|X)$.} $$ \begin{aligned} H(Y|X) &= p_0 H_B(\varepsilon) + p_1 H_B(1-\varepsilon) + p_2 H_B(1/2) \\ &= (p_0+p_1)H_B(\varepsilon) + p_2 H_B(1/2) = (1-p_2)H_B(\varepsilon) + p_2 H_B(1/2), \end{aligned} $$ because $H_B(1-\varepsilon) = H_B(\varepsilon)$. \textbf{Step 3: Mutual information.} $$ \begin{aligned} I_{p_x}(x;y) &= H(Y) - H(Y|X) \\ &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - (1-p_2)H_B(\varepsilon) - p_2H_B(1/2) \\ &= H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - H_B(\varepsilon) - p_2\big(H_B(1/2) - H_B(\varepsilon)\big). \end{aligned} $$ Comparing with the required form $$ I_{p_x}(x;y) = H_B\!\Big(\tfrac12 + f(\varepsilon)(p_x(1)-p_x(0))\Big) - g(\varepsilon)p_x(2) - H_B(\varepsilon), $$ we identify $$ f(\varepsilon) = \tfrac12 - \varepsilon, \qquad g(\varepsilon) = H_B(1/2) - H_B(\varepsilon). $$ \subsection*{(a)(ii) The capacity-achieving prior and $C$} We now maximize $I_{p_x}(x;y)$ over $p_0,p_1,p_2$. \textbf{Observation 1: $x=2$ only adds noise.} The term involving $p_2$ is $$ -p_2\big(H_B(1/2)-H_B(\varepsilon)\big), $$ and $H_B(1/2)>H_B(\varepsilon)$. So for fixed $(p_1-p_0)$, increasing $p_2$ strictly \emph{decreases} the mutual information. Hence the optimum must satisfy $$ p_2^\star = 0. $$ \textbf{Observation 2: with $p_2=0$, maximize the entropy of $Y$.} When $p_2=0$, we have $p_0+p_1=1$ and $$ P(Y=1) = \tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0). $$ Then $$ I(x;y) = H_B\!\Big(\tfrac12 + (\tfrac12-\varepsilon)(p_1-p_0)\Big) - H_B(\varepsilon). $$ The binary entropy $H_B(\cdot)$ is concave in its argument and is maximized at $1/2$. Thus, to maximize $I(x;y)$, we want $P(Y=1)=1/2$, i.e. $$ p_1-p_0 = 0 \quad\Rightarrow\quad p_1=p_0=1/2. $$ \textbf{Therefore} $$ p_x^\star(0)=p_x^\star(1)=\tfrac12,\quad p_x^\star(2)=0, $$ and $$ C = I_{p_x^\star}(x;y) = H_B(1/2) - H_B(\varepsilon). $$ So in the expression $$ C = \alpha H_B(1/2) + \beta H_B(\varepsilon) $$ we have $$ \alpha = 1,\qquad \beta = -1. $$ \subsection*{(a)(iii) The mixture distribution $p_y^\star$} Using the optimal prior $p_x^\star$, $$ \begin{aligned} P^\star(Y=1) &= \tfrac12\varepsilon + \tfrac12(1-\varepsilon) + 0\cdot\tfrac12 \\ &= \tfrac12,\\ P^\star(Y=0) &= \tfrac12. \end{aligned} $$ Thus $$ p_y^\star(0) = p_y^\star(1) = \tfrac12. $$ \subsection*{(b)(i) Mixture distribution with uniform prior} Now take $p_x(x)=1/3$ for all $x$. We get $$ \begin{aligned} P(Y=1) &= \tfrac13\varepsilon + \tfrac13(1-\varepsilon) + \tfrac13\times\tfrac12 \\ &= \tfrac13\Big(\varepsilon + 1 - \varepsilon + \tfrac12\Big) = \tfrac13\cdot\tfrac32 = \tfrac12, \end{aligned} $$ and $P(Y=0)=1/2$. So again, $$ p_y(0)=p_y(1)=\tfrac12. $$ \subsection*{(b)(ii) Information loss under the uniform prior} \textbf{Mutual information for the uniform prior.} We already know $H(Y)=H_B(1/2)$. The conditional entropy is $$ H(Y|X) = \tfrac13 H_B(\varepsilon) + \tfrac13 H_B(1-\varepsilon) + \tfrac13 H_B(1/2) = \tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2). $$ Thus $$ \begin{aligned} I_{\text{unif}} &= H(Y) - H(Y|X) \\ &= H_B(1/2) - \left(\tfrac23 H_B(\varepsilon) + \tfrac13 H_B(1/2)\right) \\ &= \tfrac23 H_B(1/2) - \tfrac23 H_B(\varepsilon) \\ &= \tfrac23\big(H_B(1/2) - H_B(\varepsilon)\big) \\ &= \tfrac23 C. \end{aligned} $$ \textbf{Information loss.} $$ C - I_{\text{unif}} = C - \tfrac23 C = \tfrac13 C = \tfrac13 H_B(1/2) - \tfrac13 H_B(\varepsilon). $$ So the constants in $$ C - I_{p_x}(x;y) = \gamma H_B(1/2) + \delta H_B(\varepsilon) $$ are $$ \gamma = \tfrac13,\qquad \delta = -\tfrac13. $$ \subsection*{(c) General model: expressing the loss as a KL divergence} Now consider a general model $p_y(\cdot;x)$ for $y\in\mathcal{Y}$, parameterized by $x\in\mathcal{X}$. Let $p_x^\star$ be a capacity-achieving prior and $p_y^\star$ be the corresponding marginal distribution of $y$. Let $q_x$ be an arbitrary distribution on $\mathcal{X}$. Assume $p_x^\star(x) > 0$ for all $x\in\mathcal{X}$. \textbf{Equidistance property.} For a capacity-achieving prior with full support, we have $$ \KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = C $$ for all $x$. \textbf{Mixture under $q_x$.} Let $$ q_y(y) = \sum_x q_x(x)\,p_{y|x}(y|x) $$ be the marginal distribution of $y$ under $q_x$. \textbf{Key identity.} We can show $$ \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ This comes from writing $$ \log\frac{p_{y|x}(y|x)}{p_y^\star(y)} = \log\frac{p_{y|x}(y|x)}{q_y(y)} + \log\frac{q_y(y)}{p_y^\star(y)} $$ and then averaging over $x$ and $y$. \textbf{Apply equidistance.} By the equidistance property, the left-hand side equals $$ \sum_x q_x(x)\cdot C = C, $$ so $$ C = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ Therefore $$ C - I_{q_x}(x;y) = \KL(q_y\|p_y^\star). $$ So we can take $$ q_1(\cdot) = q_y(\cdot),\qquad q_2(\cdot) = p_y^\star(\cdot). $$ \subsection*{(d) The case where $p_x^\star$ is not full support} Now suppose that $p_x^\star(x)=0$ for some $x$. In this case, the Kuhn--Tucker conditions for channel capacity say: $$ \KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C,\quad\forall x, $$ with equality if and only if $p_x^\star(x)>0$. Therefore, for any prior $q_x$, $$ \sum_x q_x(x)\,\KL(p_{y|x}(\cdot|x)\|p_y^\star(\cdot)) \le C. $$ Using the same decomposition as before, $$ \sum_x q_x(x)\,\KL\big(p_{y|x}(\cdot|x)\,\big\|\,p_y^\star(\cdot)\big) = I_{q_x}(x;y) + \KL(q_y\|p_y^\star). $$ Hence $$ I_{q_x}(x;y) + \KL(q_y\|p_y^\star) \le C, $$ and so $$ C - I_{q_x}(x;y) \ge \KL(q_y\|p_y^\star). $$ This corresponds to option \textbf{B}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Problem 3: Typical sequences} Let $y_1,\dots,y_N$ be i.i.d.\ Bernoulli random variables with $$ \mathbb{P}(y_i=1)=\alpha,\quad \mathbb{P}(y_i=0)=1-\alpha. $$ Let $p$ be the distribution of a single $y_i$, and let all logarithms be base 2. The typical set (weakly typical) is defined as $$ T_\varepsilon(p;N) = \left\{ y^N:\ \left| -\frac{1}{N}\log p(y^N) - H(p)\right|\le\varepsilon \right\}, $$ where $H(p)$ is the entropy of a single Bernoulli($\alpha$) variable. \subsection*{(a) Case $\alpha=1/2$} If $\alpha=1/2$, then each $y_i$ is Bernoulli($1/2$), and for any sequence $y^N\in\{0,1\}^N$ we have $$ p(y^N) = \left(\tfrac12\right)^N = 2^{-N}. $$ Therefore $$ -\frac{1}{N}\log p(y^N) = -\frac{1}{N}\log 2^{-N} = 1 $$ for all sequences. The entropy of a single Bernoulli($1/2$) variable is $$ H(p) = H_B(1/2) = 1. $$ Thus, for any $\varepsilon>0$, $$ \left| -\frac{1}{N}\log p(y^N) - H(p)\right| = |1-1| = 0 \le \varepsilon, $$ so $$ T_\varepsilon(p;N) = \{0,1\}^N, $$ and $$ |T_\varepsilon(p;N)| = 2^N. $$ \subsection*{(b) Case $\alpha=1/3$: typical set and growth exponent} Now assume $\alpha=1/3$, so $$ \mathbb{P}(y_i=1)=\tfrac13,\quad \mathbb{P}(y_i=0)=\tfrac23. $$ Let $k$ be the number of ones in the sequence $y^N$: $$ k = \#\{i:y_i=1\}. $$ \textbf{Probability of a given sequence.} Every sequence with $k$ ones and $N-k$ zeros has probability $$ p(y^N) = \left(\tfrac13\right)^k\left(\tfrac23\right)^{N-k}. $$ Hence $$ -\frac{1}{N}\log p(y^N) = -\frac{k}{N}\log\tfrac13 - \left(1-\frac{k}{N}\right)\log\tfrac23. $$ Let $\hat{p}_N = k/N$ be the empirical fraction of ones. Then $$ -\frac{1}{N}\log p(y^N) = -\hat{p}_N\log\tfrac13 - (1-\hat{p}_N)\log\tfrac23. $$ The entropy is $$ H(p) = H_B(1/3) = -\tfrac13\log\tfrac13 - \tfrac23\log\tfrac23. $$ \textbf{Typical set description.} The typicality condition $$ \left| -\frac{1}{N}\log p(y^N) - H_B(1/3)\right|\le\varepsilon $$ is a condition on $\hat{p}_N$. Because the function of $\hat{p}_N$ above is continuous and achieves $H_B(1/3)$ at $\hat{p}_N=1/3$, there exists some $\delta(\varepsilon)>0$ such that, for large $N$, $$ \left|\hat{p}_N - \tfrac13\right|\le\delta(\varepsilon) $$ is equivalent to the typicality condition. So we can write $$ T_\varepsilon(p;N) = \left\{ y^N:\ \left|\frac{\#\{i:y_i=1\}}{N} - \frac13\right|\le\delta(\varepsilon) \right\}. $$ \textbf{Growth exponent of $|T_\varepsilon(p;N)|$.} The asymptotic equipartition property (AEP) for discrete memoryless sources states that, for large $N$, $$ (1-\eta)\,2^{N(H(p)-\varepsilon)} \;\le\; |T_\varepsilon(p;N)| \;\le\; 2^{N(H(p)+\varepsilon)} $$ for some small $\eta>0$. Thus the exponential growth rate of $|T_\varepsilon(p;N)|$ is $$ \lim_{N\to\infty}\frac{1}{N}\log |T_\varepsilon(p;N)| = H(p) = H_B(1/3). $$ Intuitively, for large $N$, $$ |T_\varepsilon(p;N)| \approx 2^{N H_B(1/3)}. $$ \medskip \textbf{Key takeaway:} for a Bernoulli source, the number of typical sequences is roughly $2^{NH(p)}$, and each typical sequence has probability around $2^{-NH(p)}$. Most of the probability mass is concentrated on these typical sequences. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}
如果你愿意,我也可以帮你把这份 LaTeX 再“瘦身版”/“板书版”各做一份,或者帮你把中文要点直接翻译成 PPT 提纲。