Optimization of Variational Inequalities As mentio...

Question

Optimization of Variational Inequalities
As mentioned in Section 3, both saddle point optimization and game optimization can be cast as
variational inequality problems. In this section, we present methods to solve (VIP).
Standard batch methods. The two standard methods studied in the VIP literature are the gradient
method (Bruck, 1977) and the extragradient method (Korpelevich, 1976). The iterates of the gradient
method are known to converge linearly when the operator is strongly monotone7
(Chen and Rockafellar, 1997), but oscillates for a bilinear operator as shown in Fig. 1. On the other hand, the uniform
average of the iterates of the gradient method converges for any monotone operator with a O(1/
√
t)
rate (Nedic and Ozdaglar ´ , 2009). Finally, the extragradient method does not require any averaging to
converge for monotone operators, and can even converge at the faster O(1/t) rate (Nesterov, 2007).
The idea of this method is to compute a lookahead step (see extrapolation step in Alg. 2) in order to
compute a more stable direction to follow. As we have seen in Fig. 1, oscillations for the gradient
method may result from the game between the generator and the discriminator. The extragradient
method takes advantage of the regularity of F to prevent such oscillations.

Stochastic extensions We present extensions of these standard methods in the context of a stochastic operator, i.e. F(x) := Eξ∼P [F(x, ξ)]. We assume that this expectation cannot be computed and
that we can only sample ξ ∼ P and compute F(x, ξ). This is motivated from the GAN formulation
where we only have access to a finite sample estimate of the expectation, and the gradients are usually
computed on mini-batches. For GANs, ξ is thus a mini-batch of points coming from the true data
distribution p and the generator distribution qθ. In the following, we focus on stochastic algorithms
designed to solve (VIP). These algorithms use stochastic approximations of the gradients (6) of
the respective cost functions of the discriminator and the generator. To handle constraints such as
parameter clipping (Arjovsky et al., 2017), we present a projected version of the algorithms, where
PX (y) denotes the projection of y onto X (see App. A). Note that when X = R
d
, the projection is
the identity mapping (unconstrained setting). For our analysis, we require one of the two following
assumption on the stochastic operator F(x, ξ):
Assumption 1. Bounded variance by σ
2
: Eξ[kF(x) − F(x, ξ)k
2
] ≤ σ
2
, ∀x ∈ X .
Assumption 2. Bounded expected squared norm by M2
: Eξ[kF(x, ξ)k
2
] ≤ M2
, ∀x ∈ X .

Assump. 1 is standard in stochastic variational analysis, while Assump. 2 is a stronger assumption
sometimes made in stochastic convex optimization. To illustrate how strong Assump. 2 is, note that
it does not hold for an unconstrained bilinear objective. It is thus mainly reasonable for bounded
constraint sets. Note that in practice we have σ M. In the following, we present three algorithms:
the first one is the stochastic extension of the gradient method for solving (VIP); the second one is a
stochastic extensions of the extragradient method and the third one is OneSEM, a novel variant of
SEM. Throughout this section
• x
∗
is a solution of (VIP) and x0 is the initialization of our sequence.
• R is set big enough such that R > kx0 − x
∗k and F is a monotone operator.
While we assume the monotonicity of F for our theoretical convergence results, we note that the
merit function presented in Sec. 3 is still defined for non-convex objectives and we can derive local
convergence results (see App. B.3).

Stochastic gradient method. The gradient method to solve (VIP) is also known as the forwardbackward algorithm. It has been studied with additive noise by Rosasco et al. (2016) in the general
context of monotone inclusion and by Nemirovski et al. (2009) in the context of stochastic saddle
point optimization. Alg. 1 presents the stochastic gradient method with averaging (AvgSGD), which
reduces to the standard (simultaneous) SGD updates for the two player games used in the GAN
literature, but here with averaging. We have not found in the literature a proof of convergence rate for
AvgSGD for a general stochastic VIP, and thus we present it here (all proofs are in App. E):
Theorem 1. Under Assump. 1 and 2, AvgSGD (Alg. 1) with constant step size γ gives,
E[ErrR(x¯T )] ≤
R2
2γT

γ
M2 + σ
2
2
where x¯T :=
1
T
T
X−1
t=0
xt , ∀T ≥ 1 . (11)
The constant term γ(M2 + σ
2
)/2 on the RHS of (11) is called the variance term. This kind of bound
is standard in stochastic optimization. We also provide in App. E a similar rate with an extra log
factor when γt = √
γ
t
. We show that this variance term is smaller than the one of SGD with prediction
method (Yadav et al., 2018) in App. D.

Stochastic extragradient method. The stochastic extragradient method (SEM), described in
Alg. 2, adds an extrapolation step compared to AvgSGD to reduce the oscillations due to the
game between the two players. Consequently, it has a smaller variance term than (11). We extend
the analysis from Juditsky et al. (2011), which assumed that X is convex and bounded, to prove the
convergence of SEM when X is potentially unbounded (allowing the unconstrained setting). As
discussed previously, Assump. 2 made in Thm. 1 for the convergence of AvgSGD is very strong in
the unbounded setting. One advantage of SEM is that Thm. 2 does not require this assumption.
Theorem 2. Under Assump. 1, if Eξ[F] is L-Lipschitz, then SEM (Alg.2) with a constant step size
γ ≤ √
1
3L
gives,
E[ErrR(y¯T )] ≤
R2
γT
+
7
2
γσ2 where y¯T :=
1
T
T
X−1
t=0
yt , ∀T ≥ 1 . (12)
Moreover, since in practice σ M, the variance term of SEM (12) is significantly smaller than the
one of AvgSGD (11). To summarize, SEM provides better convergence guarantees than AvgSGD but
requires two gradient computations and samples per iteration. This motivates our new method, that
we call OneSEM, that achieves the best of both worlds.
Re-using the gradient. One issue with SEM is that the algorithm "wastes" a gradient (Alg. 2
Line 4) when it steps back to xt after the extrapolation step. Thus, each iteration is twice costly than
a standard gradient method iteration. The idea of OneSEM is to store (Alg. 3 Line 4) and reuse the
previous extrapolated gradient to compute the new extrapolation point. With that, OneSEM requires
the computation of only one sample and one gradient per iteration. To start the algorithm, an initial
stored gradient d−1 (Alg.3 Line 1) is required for the first extrapolation step. In theory and in practice
we set it to F(x0, ξ−1) so that the first iteration of OneSEM becomes identical to SEM.
Theorem 3. Under Assump. 1, if Eξ[F] is L-Lipschitz, then OneSEM (Alg.3) with a constant step
size γ ≤
1
2
√
3L
gives,
E[ErrR(y¯T )] ≤
R2
γT
+
13
2
γσ2 where y¯T :=
1
T
T
X−1
t=0
yt ∀T ≥ 1 .

объясни подробно что здесь написано

Answer 1

Ниже ‒ подробный «разбор по полочкам» всего фрагмента. Я последовательно объясняю (1) постановку задачи, (2) классические (детерминированные) методы, (3) переход к стохастике, (4) три алгоритма — AvgSGD, SEM и One‑SEM — и (5) их теоретические гарантии.

1. Вариационные неравенства (VIP) и связь с играми

Вариационное неравенство для оператора $F\!:X\to\mathbb{R}^d$ на множестве $X\subseteq\mathbb{R}^d$ формулируется так:

\text{найти }x^\* \in X:\quad \langle F(x^\*),\,x-x^\*\rangle \;\ge 0\quad\forall x\in X.

В задачах седловой оптимизации и обучения в играх (напр. GAN‑ы) условия оптимальности Куна–Таккера сводятся ровно к такому неравенству, поэтому вся теория VIP автоматически даёт нам алгоритмы для этих задач.
Монотонность оператора $F$ (аналог «выпуклости» для функций) означает
$\langle F(x)-F(y),\,x-y\rangle\ge 0$ .
Сильная монотонность (как «строгая выпуклость») добавляет коэффициент $\mu>0$ :
$\langle F(x)-F(y),\,x-y\rangle\ge\mu\|x-y\|^2$ .

2. Детерминированные методы

Метод	Идея шага	Ключевой факт
Gradient method (Bruck, 1977)	$x_{t+1}=P_X\!\bigl[x_t-\gamma F(x_t)\bigr]$	Линейная сходимость при сильной монотонности, но для чисто билинейных игр (GAN‑ов) начинает «крутиться по кругу» (рис. 1 в статье).
Extragradient (Korpelevich, 1976)	«Смотрим вперёд на пол‑шага»: 1) $x_{t+\frac12}=P_X[x_t-\gamma F(x_t)]$ ; 2) $x_{t+1}=P_X[x_t-\gamma F(x_{t+\frac12})]$	Для просто монотонных операторов сходится без усреднения и даёт более быстрый порядок $O(1/t)$ . Суть: второй градиент берётся в более «стабильной» точке, что гасит колебания.

3. Переход к стохастическим операторам

В практических задачах, особенно в GAN‑ах, $F(x)$ — это математическое ожидание по мини‑батчу $\xi$ :

F(x)=\mathbb{E}_{\xi\sim P}\bigl[F(x,\xi)\bigr].

Мы не можем вычислить его точно, а лишь получаем стохастическую оценку $F(x,\xi)$ . Поэтому алгоритм каждый раз видит «зашумлённый» градиент.

Чтобы анализировать такие методы, вводятся два варианта гипотезы о шуме:

Assumption 1 (ограниченная дисперсия)
$\displaystyle \mathbb{E}_{\xi}\|F(x)-F(x,\xi)\|^2\le\sigma^2$ .
Традиционное допущение: шум в среднем «нормальный».
Assumption 2 (ограниченное ожидание квадрата нормы)
$\displaystyle \mathbb{E}_{\xi}\|F(x,\xi)\|^2\le M^2$ .
Сильнее первой: требует ограниченности самого стох. градиента. При неограниченных $X$ и билинейных играх почти никогда не выполняется, поэтому нужна лишь в особых случаях (например, когда параметры жёстко обрезаются).

4. Три стохастических алгоритма

4.1 AvgSGD (Stochastic Gradient с усреднением)

text
x_{t+1} = P_X[x_t - γ F(x_t, ξ_t)]
ŷ_T     = (1/T) Σ_{t=0}^{T-1} x_t

Усреднение (последняя строка) критично: сами итераты могут болтаться, но их среднее сходится.
Теорема 1. При шаге $γ=\text{const}$ :

\mathbb{E}[\text{Err}_R(\bar x_T)]\;\le\; \frac{R^2}{2γT}\;+\;\frac{γ}{2}\bigl(M^2+\sigma^2\bigr).

Первая часть (∝ 1/T) — «оптимизационная»;
вторая (∝ γ) — «шумовая» (variance term).

4.2 SEM (Stochastic Extra‑gradient)

text
y_t     = P_X[x_t - γ F(x_t, ξ_t)]      # экстраполяция
x_{t+1} = P_X[x_t - γ F(y_t, ζ_t)]      # основной шаг (новый батч)
ŷ_T     = (1/T) Σ y_t

Требует два вызова градиента на итерацию (ξ_t и ζ_t).
Теорема 2. При $L$ -липшицевом $\mathbb{E}[F]$ и $γ\le\sqrt{1/(3L)}$ :

\mathbb{E}[\text{Err}_R(\bar y_T)] \;\le\; \frac{R^2}{γT}\;+\;\frac{7}{2}\,γ\sigma^2.

Сравните: вместо $M^2+\sigma^2$ осталась только $\sigma^2$ и даже с меньшим коэффициентом 7/2 < 1/2 × 2. Т.е. сем» гасит шум лучше, но за счёт двойных вычислений.

4.3 One‑SEM (одноградиентный вариант)

Идея: «не выбрасывать» первый градиент, а сохранить его и использовать в следующем шаге.

text
# t = -1: вычисляем d_{-1} = F(x_0, ξ_{-1})
for t = 0,1,2,...
    y_t     = P_X[x_t - γ d_{t-1}]        # экстраполяция
    d_t     = F(y_t, ξ_t)                 # считаем градиент один раз
    x_{t+1} = P_X[x_t - γ d_t]            # основной шаг

На каждом цикле всего один градиент.
Теорема 3. При том же $L$ и $γ\le\frac{1}{2}\sqrt{1/(3L)}$ :

\mathbb{E}[\text{Err}_R(\bar y_T)] \;\le\; \frac{R^2}{γT}\;+\;\frac{13}{2}\,γ\sigma^2.

Шумовой коэффициент хуже, чем у SEM (13/2 против 7/2), но всё ещё лучше, чем у AvgSGD, а стоимость итерации такая же, как у AvgSGD. Поэтому One‑SEM «берёт лучшее» от двух миров.

5. Как читать эти оценки

$\text{Err}_R(\cdot)$ — так называемая мерит‑функция: мера того, насколько точка далека от решения VIP. У автора она определяется через проекцию на шар радиуса $R$ и имеет свойство, что $\text{Err}_R(x)=0$ ⇔ $x$ — решение.
Формула всегда имеет вид
$\underbrace{\frac{\text{константа}}{γT}}_{\text{«детерминированная» часть}} \;+\; \underbrace{γ\cdot\text{(шум)}}_{\text{variance term}}.$
Оптимальный порядок $O(1/\sqrt{T})$ достигается, если выбрать шаг $γ=Θ(1/\sqrt{T})$ . Тогда обе части становятся $O(1/\sqrt{T})$ .
Но константы при шуме различны:
$\begin{aligned} \text{AvgSGD}&:\quad \tfrac12(M^2+\sigma^2),\\ \text{SEM} &:\quad \tfrac72\,\sigma^2,\\ \text{One‑SEM}&:\quad \tfrac{13}{2}\,\sigma^2. \end{aligned}$
При $M\gg\sigma$ (типично в GAN‑ах) SEM и One‑SEM выигрывают.

6. Практическая интерпретация для GAN‑ов

Oscillation классического симметричного SGD в GAN‑ах («генератор гонится за дискриминатором») объясняется тем, что задача билинейная, оператор не сильно‑монотонный, и градиентный метод просто вращается вокруг седла.
Extragradient (SEM) добавляет «заглянуть вперёд», тем самым предугадывает реакцию второй стороны и сглаживает траекторию.
One‑SEM делает то же, но без удвоения вычислительных затрат, что критично, когда обучение и так дорого.

Короткое резюме

AvgSGD — самый простой, но шумно сходится.
SEM — самый устойчивый к шуму, но вдвое дороже.
One‑SEM — компромисс: почти устойчивость SEM при цене AvgSGD.

Все три метода сходятся с порядком $O(1/\sqrt{T})$ ; разница — в коэффициентах при шуме и в затратах на градиенты.