SVM p范数推导

作成日: 2025年1月13日

使用モデル O1 by Chat01

質問

SVM 软间隔 $\xi_i$ 改为 p 范数 $|\xi_i|^p$ ，详细推导出它的 Primal 和 Dual 问题。并说明 p=1 和 p＞1 时，SVM问题有什么不同？

思考

Reworking the objective

I'm curious about tweaking the objective in the primal form of the soft-margin SVM. The focus shifts to minimizing $\frac{1}{2} \|w\|^2 + C \sum_i |\xi_i|^p$ with specific constraints.

Reevaluating absolute values

I'm thinking aloud about the use of absolute values for $\xi_i$ in the objective. Since $\xi_i \ge 0$ , there's no need for $|\xi_i|$ , so $|\xi_i|^p$ simplifies to $\xi_i^p$ .

Tackling derivatives

I’m piecing together partial derivatives for $w$ , $b$ , and $\xi_i$ to determine the optimal conditions. I’m also considering complementary slackness for $\alpha_i$ and $\mu_i$ to ensure the solution fits perfectly.

Taking a closer look

Pulling together the dual problem, rewriting $\xi_i$ in terms of $\alpha_i$ and $C$ , and ensuring the dual Lagrangian is tight involves considering these key elements.

Synthesizing dual objectives

I'm figuring out the dual function by minimizing over $\alpha_i$ and $\mu_i$ , solving for $\xi_i$ using constraints, and ensuring optimal solutions align with the support vector method.

Incorporating constraints

OK, let's work through the constraints. By defining $f(\xi_i)$ and solving for $\xi_i^*$ , I ensure the solution fits the constraints, keeping everything balanced.

Breaking down constraints

OK, let’s do a step-by-step analysis of constraints. Simplifying terms between constraints ensures a clear, accurate, and insightful expression. This approach effectively clarifies complex elements.

Breaking down the difference

I'm working through the difference in terms of $C$ and $p$ . It's fascinating to see how using known expressions for conjugate functions can simplify this process significantly.

Mapping out conjugate function implications

OK, let me see. For $p=1$ , the dual problem aligns with the standard SVM linear hinge loss. For $p=2$ , it shifts to the squared hinge loss.

Revisiting alpha constraints

OK, let's see: for $p=1$ , $\alpha_i$ is bounded by $C$ , while for $p>1$ , it’s unbounded. This shift creates different constraints for the dual problem.

Analyzing changes

For $p>1$ , the unbounded $\alpha_i$ and stronger non-linear penalty alter the margin geometry, resulting in fewer margin violations compared to $p=1$ .

Restating key points

For $p>1$ , the primal and dual objectives are clarified, showing a nuanced change in constraint geometry and properties, leading to a clear progression in understanding.

Refining the dual

By establishing $\mu_i \geq 0$ , the dual function is streamlined, ignoring $\mu_i$ and piecing together the dual variables' relationship with $C$ , $p$ , and $\alpha_i$ .

Piecing together the dual

Tracing the dual problem involves maximizing $\sum_i \alpha_i$ and $\sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$ , minus the conjugate function $h(\alpha_i)$ .

Examining dual constraints

For $p=1$ , $\alpha_i$ is limited by $C$ , while $p>1$ allows unbounded $\alpha_i$ . The geometry of slack variables also shifts with $p$ , leading to fewer violations.

Piecing together

By setting up the Lagrangian and using KKT conditions, I can determine optimal $w$ , $b$ , and $\xi_i$ values. This approach simplifies finding the optimal solution.

Assessing constraint impacts

OK, let me see: for $p=1$ , $\alpha_i$ is capped by $C$ . For $p>1$ , $\alpha_i$ can grow unbounded. Constraints and properties adapt accordingly.

Thought about SVM p-norm优化，持续 1m 48s

回答

下面给出一个将软间隔支持向量机（Soft‐Margin SVM）中对松弛变量 $\xi_i$ 的线性惩罚（即“ $L_1$ ”形式）推广到一般的 $p$ 次惩罚 $\|\xi_i\|^p$ 时，如何推导其原始问题（Primal）和对偶问题（Dual），以及在 $p=1$ 与 $p>1$ 时对偶问题所呈现的关键差异。

1. 回顾传统软间隔 SVM

在二分类问题中，给定训练样本 $\{(x_i, y_i)\}_{i=1}^m$ ，其中 $x_i \in \mathbb{R}^n$ ， $y_i \in \{+1, -1\}$ 。传统软间隔 SVM 的原始问题可写为

\begin{aligned} \min_{w, b, \xi} \quad & \frac{1}{2}\|w\|^2 + C \sum_{i=1}^m \xi_i \\ \text{s.t.} \quad & y_i \bigl(w^\mathsf{T} x_i + b\bigr) \;\ge\; 1 - \xi_i,\quad i=1,\dots,m,\\ & \xi_i \;\ge\; 0,\quad i=1,\dots,m. \end{aligned}

其对偶问题可得出

\begin{aligned} \max_{\alpha} \quad & \sum_{i=1}^m \alpha_i \;-\; \frac{1}{2}\sum_{i,j=1}^m \alpha_i \alpha_j\,y_i y_j\, x_i^\mathsf{T} x_j\\ \text{s.t.}\quad & 0 \;\le\; \alpha_i \;\le\; C,\quad i=1,\dots,m,\\ & \sum_{i=1}^m \alpha_i y_i \;=\; 0. \end{aligned}

其中 $\alpha_i$ 是拉格朗日乘子。
可以看到， $\alpha_i$ 在区间 $[0, C]$ 上受限，即出现了“ $\alpha_i \le C$ ” 的“盒子约束”(box constraint)。

2. 将软间隔惩罚从 $L_1$ 推广到 $L_p$

现在我们将传统 SVM 中对松弛变量 $\xi_i$ 的线性惩罚 $\sum \xi_i$ 改为更一般的 $p$ 范数惩罚 $\sum \xi_i^p$ ，其中通常取 $p \ge 1$ 。新的原始问题变为

<div align="center"> $\displaystyle \begin{aligned} \min_{w, b, \{\xi_i\}} \quad & \frac{1}{2}\|w\|^2 \;+\; C\sum_{i=1}^m \bigl(\xi_i\bigr)^p\\ \text{s.t.}\quad & y_i\bigl(w^\mathsf{T} x_i + b\bigr) \;\ge\; 1 - \xi_i,\quad i=1,\dots,m,\\ & \xi_i \;\ge\; 0,\quad i=1,\dots,m. \end{aligned} $ </div>

注：由于约束 $\xi_i \ge 0$ ，因此 $\lvert \xi_i \rvert^p = \xi_i^p$ 。

下面依照拉格朗日对偶原理推导其对偶问题。

2.1 构造拉格朗日函数

令 $\alpha_i \ge 0$ 为对不等式

y_i(w^\mathsf{T}x_i + b) \;\ge\; 1-\xi_i

所对应的拉格朗日乘子， $\mu_i \ge 0$ 为对 $\xi_i \ge 0$ 的拉格朗日乘子。则拉格朗日函数为

\begin{aligned} \mathcal{L}(w,b,\{\xi_i\}, \{\alpha_i\}, \{\mu_i\}) =\;& \frac{1}{2}\|w\|^2 \;+\; C\sum_{i=1}^m \bigl(\xi_i\bigr)^p \\ &-\; \sum_{i=1}^m \alpha_i\,\Bigl[y_i\bigl(w^\mathsf{T}x_i + b\bigr) - 1 + \xi_i \Bigr]\\ &-\; \sum_{i=1}^m \mu_i \,\xi_i. \end{aligned}

我们希望先对 $(w,b,\{\xi_i\})$ 做最小化，再对 $(\{\alpha_i\}, \{\mu_i\})$ 做最大化。

2.2 对 $w, b, \xi_i$ 求极小

对 $w$ 求偏导并令其为零：
$\frac{\partial \mathcal{L}}{\partial w} = w \;-\;\sum_{i=1}^m \alpha_i\,y_i\,x_i = 0 \;\;\Longrightarrow\;\; w = \sum_{i=1}^m \alpha_i\,y_i\,x_i.$
对 $b$ 求偏导并令其为零：
$\frac{\partial \mathcal{L}}{\partial b} = - \sum_{i=1}^m \alpha_i\,y_i = 0 \;\;\Longrightarrow\;\; \sum_{i=1}^m \alpha_i\,y_i = 0.$
对 $\xi_i$ 求偏导并令其为零：
$\frac{\partial \mathcal{L}}{\partial \xi_i} = C\,p\,(\xi_i)^{\,p-1} \;-\;\alpha_i \;-\;\mu_i = 0.$
因此
$\alpha_i + \mu_i \;=\; C\,p\,(\xi_i)^{\,p-1}.$

此外，还有 KKT 互补松弛条件：

$\alpha_i \;\bigl[y_i(w^\mathsf{T}x_i + b) - 1 + \xi_i\bigr] = 0$ .
$\mu_i \,\xi_i = 0$ .
$\alpha_i \ge 0,\,\mu_i \ge 0,\,\xi_i \ge 0.$

从 $\mu_i \,\xi_i = 0$ 可见：

若 $\xi_i > 0$ ，则 $\mu_i=0$ ，并由上式可得 $\alpha_i = C\,p\,(\xi_i)^{p-1}$ 。
若 $\xi_i=0$ ，则 $\alpha_i + \mu_i = C\,p\,(\xi_i)^{p-1}=0$ ，通常意味着 $\alpha_i=0,\mu_i$ 可能为正等情形。

2.3 得到对偶目标函数

将

w \;=\;\sum_{i=1}^m \alpha_i\,y_i\,x_i

带回拉格朗日函数，并对 $\xi_i$ 取最优（即极小值），可写出“对偶函数” $W(\{\alpha_i\},\{\mu_i\})$ 。然后再对 $\alpha_i,\mu_i$ 做最大化，并结合约束 $\alpha_i\ge 0,\mu_i\ge 0$ 即可得到对偶问题。由于推导过程相对繁琐，这里给出要点：

$\|w\|^2$ 项
$\frac{1}{2}\|w\|^2 = \frac{1}{2}\Bigl\|\sum_i \alpha_i\,y_i\,x_i\Bigr\|^2 = \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j\,y_i y_j\, x_i^\mathsf{T} x_j.$
线性项 $-\sum_i \alpha_i$ 出现于约束 $\,y_i(w^\mathsf{T}x_i + b) \ge 1-\xi_i$ .
$\xi_i^p$ 与 $\alpha_i + \mu_i$ 的配合
当 $p=1$ 时， $\alpha_i + \mu_i = C\,p= C$ ，这会直接带来 $\alpha_i \le C$ 的“盒子约束”；
当 $p>1$ 时，则不再出现 $\alpha_i$ 的固定上界，而是通过 $\alpha_i + \mu_i = C\,p\,(\xi_i)^{p-1}$ 的非线性关系来约束。

综上，可得对偶问题一般形式为

\begin{aligned} \max_{\alpha\ge 0} \quad & \sum_{i=1}^m \alpha_i \;-\; \frac{1}{2} \sum_{i,j=1}^m \alpha_i \alpha_j\,y_i y_j\, x_i^\mathsf{T} x_j \;-\;\underbrace{\sum_{i=1}^m \phi(\alpha_i)}_{\text{由 }(\xi_i)^p\text{ 的对偶对应}} \\ \text{s.t.} \quad & \sum_{i=1}^m \alpha_i\,y_i \;=\; 0, \end{aligned}

其中 $\phi(\alpha_i)$ 来自于 $\min_{\xi_i\ge 0}\bigl\{C\,(\xi_i)^p - \alpha_i\,\xi_i - \mu_i\,\xi_i\bigr\}$ 的解析解或互补松弛条件。对具体的 $p$ 值，可以进一步写出显式形式。

3. $p=1$ 与 $p>1$ 时的区别

$p=1$ （经典软间隔 SVM）
- 原始问题为 $\min_{w,b,\xi} \;\;\frac12\|w\|^2 \;+\; C\sum_{i=1}^m \xi_i.$
- 对偶问题中会出现 $0 \;\le\; \alpha_i \;\le\; C,$ 这就是标准软间隔 SVM 的“盒子约束”。直观而言， $\alpha_i$ 的上界就是 $C$ ，对应了松弛变量 $\xi_i$ 的线性惩罚强度。
$p>1$ （非线性惩罚）
- 原始问题为 $\min_{w,b,\xi} \;\;\frac12\|w\|^2 \;+\; C\sum_{i=1}^m \bigl(\xi_i\bigr)^p.$
- 对偶问题中，不再有 $\alpha_i \le C$ 这样的简单盒子约束， $\alpha_i$ 可能取更大值。
- 几何/算法上体现为：对大的松弛变量（即误差较大的样本）会有“更大”的惩罚，因为 $\xi_i^p$ 随 $\xi_i$ 增大而比线性增长更快（当 $p>1$ ）。这会倾向于让训练过程更加“惩罚”少数大的违约点，从而在一定程度上鼓励更多样本满足更严格的间隔要求。

4. 小结

原始问题(Primal)：将 $\sum \xi_i$ 替换为 $\sum \xi_i^p$ 后，约束依然是
$\begin{aligned} &\min_{w,b,\{\xi_i\}} \quad \frac12 \|w\|^2 \;+\; C\sum_{i=1}^m (\xi_i)^p,\\ &\text{s.t.}\quad y_i(w^\mathsf{T}x_i + b)\,\ge\, 1-\xi_i,\quad \xi_i \ge 0. \end{aligned}$
对偶问题(Dual)：通过拉格朗日方法，可得到
$\begin{aligned} &\max_{\alpha\ge 0} \;\; \sum_{i=1}^m \alpha_i \;-\; \frac12 \sum_{i,j=1}^m \alpha_i\,\alpha_j\, y_i\,y_j\, x_i^\mathsf{T} x_j \;-\;\sum_{i=1}^m \phi(\alpha_i),\\ &\text{s.t.}\quad \sum_{i=1}^m \alpha_i\,y_i = 0, \end{aligned}$
其中 $\phi(\alpha_i)$ 由 $\xi_i^p$ 与互补松弛条件共同决定。
$p=1$ 与 $p>1$ 的核心区别在于：
- $p=1$ 时，对偶中的 $\alpha_i$ 会出现简单的上界 $\alpha_i \le C$ ；
- $p>1$ 时，不再有 $\alpha_i \le C$ 这样的盒子约束，而是由一条非线性关系（ $\alpha_i = Cp\,(\xi_i)^{p-1}$ ）来隐式限定 $\alpha_i$ 的取值范围。几何上，这意味着对大误差会给予更严厉的惩罚，从而也会影响支持向量分布及最终的分类间隔。

参考阅读

C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000.
有关一般化 $p$ -范数惩罚与对偶的对应关系，可参见凸分析中的“共轭函数”(Fenchel conjugate) 相关内容。

以上就是将软间隔 SVM 中对 $\xi_i$ 的线性惩罚拓展到 $p$ 范数惩罚时，其原始、对偶问题的推导思路，以及 $p=1$ 和 $p>1$ 之间在对偶约束与几何解读上的主要差别。

質問

思考

回答

1. 回顾传统软间隔 SVM

2. 将软间隔惩罚从 L1L_1L1​ 推广到 LpL_pLp​