A final report for Prof. R. Srikant’s excellent ECE490 class at UIUC. Please note that while the proofs follow the same arguments from the sources, some typos have been corrected.

Introduction

Optimization, or the identification of the inputs which cause the extreme point(s) of a function, is a critical task in industry and a fruitful academic subject. Training neural networks, for example, is an application of optimization which has had major impact on industry and which was only made possible by academic breakthroughs. In this report, we examine two variations of gradient descent: mirror descent and Nesterov’s accelerated gradient descent.

Preliminaries

Let the optimization problem be that of finding the minimum. A local minimum of \(f: S \to \mathbb{R}\) is an \(x^* \in S\) such that \[f(x^*) \leq f(x) \ \forall \ x \in \{ x : || x - x^* || \leq \epsilon \}\] for some \(\epsilon \ge 0\). A global minimum is an \(x^*\) such that \(f(x^*) \leq f(x) \ \forall \ x \in S\). Since a local minimum is at least as good as the points in a nearby region, but a global minimum is as least as good as any point, optimization algorithms usually aim to find a global minimum. In this report, we consider optimization on functions \(f\) which are convex, or satisfy \[f(\alpha x + (1 - \alpha)y) \leq \alpha f(x) + (1-\alpha)f(y) \ \ \forall x,y \in S, \ \ \alpha \in [ 0, 1].\] In fact, if \(f^*\) is a local minimum of a convex function \(f\), then it is also a global minimum of \(f\).

Some functions are continuous and can be reasonably minimized, but are not entirely differentiable, e.g. \(f(x) = \lvert x \rvert\). Accordingly, we apply the notion of a subgradient to perform optimization. A subgradient at the point \(a \in S\) is any \(v \in S\) such that for some \(x \in S\) \[f(x) - f(a) \geq v^T (x - a).\] Throughout this report we use notation \(\nabla f(x)\) to refer to gradients, but it can be replaced with the corresponding subgradient if appropriate, since a convex function has a subgradient at each point.

Gradient Descent

The standard optimization algorithm which can be applied to a differentiable \(f\) is gradient descent. Gradient descent uses the information provided by the gradient \(\nabla f\) to update its guess for the minimum. This is natural because the gradient encodes the direction of steepest descent of the function; going in the direction of the steepest descent should decrease the function’s value. Usually, the algorithm is presented in steepest descent form, \[\begin{aligned} x_{k+1} &= x_k - \alpha \nabla f(x_k), \\ x_0 &= \textup{initial guess} \in \mathbb{R}^d, \end{aligned}\] where \(\alpha > 0\) is the (scalar) step size. This is so-called because we move directly in the opposite direction of the gradient. As we saw in class [1], on a convex function, steepest descent with fixed step size converges with rate \(\mathcal{O} (1/k)\).

However, gradient descent can be slow for certain types of practical problems [2]. In this report, we state and prove convergence for two algorithms which can provide improvements on gradient descent, Nestorov’s Accelerated Gradient (NAG) and mirror descent.

Mirror Descent

Overview

To understand mirror descent, we will show how it is a generalization of gradient descent. The following derivation is adapted from [3] and [4]. Consider the constrained optimization problem \[\min_{x \in S \subset \mathbb{R}^d} f(x).\] The projected gradient algorithm (PGA) applies gradient descent, using the Euclidean projection to constrain \(x\): \[\begin{aligned} x_{k+1} &= [x_k - \alpha_k \nabla f(x_k)]^+ \\ x_0 &= \textup{initial guess} \in \mathbb{R}^d. \end{aligned}\] Using the definition of Euclidean projection, the PGA can be written \[\begin{aligned} x_{k+1} &= \mathop{\mathrm{arg\,min}}_{x \in S} || (x_k - \alpha_k \nabla f(x_k)) - x ||^2 \\ &= \mathop{\mathrm{arg\,min}}_{x \in S} || (x_k - x) + \alpha_k \nabla f(x_k) ||^2 \\ &= \mathop{\mathrm{arg\,min}}_{x \in S} \biggl\{ || x - x_k ||^2 + 2 \alpha_k (x - x_k)^T \nabla f(x_k) \biggr\}. \end{aligned}\] Dropping terms that don’t depend on x and rearrranging, \[x_{k+1} = \mathop{\mathrm{arg\,min}}_{x \in S} \biggl\{ \langle x, \nabla f(x_k) \rangle + \frac{1}{2\alpha_k} || x - x_k ||^2 \biggr\}.\] It follows that the projected gradient algorithm can be extended to a generalized notion of distance \(D: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^+\): \[x_{k+1} = \mathop{\mathrm{arg\,min}}_{x \in S} \biggl\{ \langle x, \nabla f(x_k) \rangle + \frac{1}{\alpha_k} D(x, x_k) \biggr\}.\] where \(D\) satisfies \(D(u, v) > 0\) for \(u \neq v\) and \(D(u, u) = 0 \ \forall u,v\). The concept of distance we will use is called Bregman divergence. Let \(\phi: S \to \mathbb{R}\) be continuously differentiable and strongly convex with strong convexity parameter \(\sigma > 0\). Then the Bregman divergence is \[B_\phi = \phi(x) - \phi(y) - \langle x - y, \nabla \phi (y) \rangle .\] Using the \(B_\phi\) as measure of distance, we obtain what Beck and Teboulle [3] call the subgradient algorithm with nonlinear projections (SANP): \[\begin{aligned} x_{k+1} = \mathop{\mathrm{arg\,min}}_{x \in S} \biggl\{ \langle x, \nabla f(x_k) \rangle + \frac{1}{\alpha_k} B_\phi(x, x_k) \biggr\}. \end{aligned}\] Though the mirror descent algorithm was originally stated by Nemirovskij and Yudin [5], the major contribution of [3] was showing the original description of mirror descent to be equivalent to SANP. Thus we refer to SANP and mirror descent interchangeably.

Convergence

The convergence result is adapted from Beck and Teboulle’s [3] SANP analysis.

Theorem 1. SANP converges, i.e. \(\min_{1 \leq s \leq k} f(x_t) - f(x^*) \to 0\), assuming \(\sum_{s} \alpha_s = \infty, \alpha_k \to 0, k \to \infty\).

Proof. For SANP, optimality implies \[\langle x - x_{k+1}, \alpha_k \nabla f(x_k) + \nabla \phi (x_{k+1}) - \nabla \phi (x_k) \rangle \geq 0,
\forall x \in S.\]
Taking \(x = x^*\), \[\langle x^* - x_{k+1}, \nabla \phi (x_k) - \nabla \phi (x_{k+1}) - \alpha_k \nabla f(x_k) \rangle \geq 0.\] Since \(f\) is convex, by the definition of subgradient, \[\begin{aligned} 0 &\leq \alpha_k (f(x_k) - f(x^*)) \\ &\leq \leq \alpha_k \langle x_k - k^*, \nabla f(x_k) \rangle \\ &= \langle x^* - x_{k+1}, \nabla \phi (x_k) - \nabla \phi(x_{k+1}) - \alpha_k \nabla f(x_k) \rangle \\ &+ \langle x^* - x_{k+1}, \nabla \phi (x_{k+1}) - \nabla \phi (x_k) \rangle \\ &+ \langle x_k - x_{k+1}, \alpha_k \nabla f(x_k) \rangle \\ & := s_1 + s_2 + s_3. \end{aligned}\] where \(s_1, s_2, s_3\) denote the three summed terms, respectively. Then \[\begin{aligned} s_1 &\leq 0, \\ s_2 &= B_\phi (x^*, x_k) - B_\phi (x^*, x_{k+1}) - B_\phi (x_{k+1}, x_k), \\ s_3 &\leq \frac{1}{2\sigma} \alpha_k^2 || \nabla f(x_k) ||^2 + \frac{1}{2} \sigma || x_k - x_{k+1} ||^2. \end{aligned}\] The second line results from Lemma 4.1 in [3]). The final line follows from the fact \[\langle a, b \rangle \leq \frac{1}{2\sigma} ||a||^2 + \frac{1}{2}\sigma || b ||^2
\forall a,b \in \mathbb{R}^d.\]
Thus we can apply the fact that \(B_\phi (\cdot, \cdot)\) is \(\sigma\)-strongly convex, i.e. \(-B_\phi (x_{k+1}, x_k) + \frac{1}{2} \sigma || x_k - x_{k+1} || ^2 \leq 0\), \[\begin{aligned} \alpha_k (f(x_k) - f(x^*)) &= s_1 + s_2 + s_3 \\ &\leq B_\phi (x^*, x_k) - B_\phi (x^*, x_{k+1}) \\ &+ \frac{1}{2\sigma} \alpha_k^2 || \nabla f(x_k) ||^2. \end{aligned}\] Summing both sides over \(s = 1, \dots, k\) and canceling the telescoping sums, \[\begin{aligned} \sum_{k=1}^{s} \alpha_k (f(x_k) - f(x^*)) \leq B_\phi (x^*, x_1) - B_\phi (x^*, x_{t+1}) + \frac{1}{2\sigma} \alpha_k^2 || \nabla f(x_k) ||^2. \end{aligned}\] Then we can directly obtain \[\begin{aligned} \min_{1 \leq s \leq k} f(x_s) - f(x^*) \leq \frac{B_\phi (x^*, x_1) + \frac{1}{2\sigma}\sum_{s=1}^{k} \alpha_s^2 ||\nabla f(x_s) ||^2}{\sum_{s=1}^{k} \alpha_s}. \end{aligned}\] Assuming \(\alpha_k \to 0\) and \(\sum_{s} \alpha_s = \infty\) as \(k \to \infty\), it follows that the RHS approaches zero as \(k \to \infty\). ◻

Interpretation

The approach above admits a proximal interpretation of mirror descent, in which we are trying to pick a close point according to the Bregman divergence. On the other hand, mirror descent can be written, viewing \(\phi\) as an operator, as \[x_{k+1} = (\nabla \phi)^{-1} (\nabla \phi (x_k) - \alpha \nabla f(x_k)),\] where \((\nabla \phi)^{-1} = \nabla(\phi^*)\), denoting as \(\phi^*\) the conjugate of \(\phi\) [6]. Formulated as such, we can view \(\nabla \phi\) as a map from the primal space to the dual space, where the gradient update is performed. Afterward, the the updated point is mapped back to the primal space. The resulting point may lie outside the constraint set, in which case we use the Bregman projection to correct it. The procedure is illustrated in Figure 1.

Illustration of the mirror map from dual to primal space [7]

The underlying motivation for this scheme is twofold. First, it accounts for the fact that \(x\) and \(\nabla f(x)\) live in different spaces – \(x\) in the primal, \(\nabla f(x)\) in the dual. This is a subtle fact for optimization in \(\mathbb{R}^d\) because \(\mathbb{R}^d\) is self-dual. However, it can have a significant impact. Second, the use of Bregman divergence instead of Euclidean distance can take into account the case \(f\) is \(L\)-smooth with respect to a non-Euclidean norm [6].

Finally, an illustrative example of the improvement over gradient descent is given by the use of negative entropy as the mirror map [7], that is, \[\phi(x) = \sum_{i=1}^{d} x(i) \log x(i).\] In this case, using the Bregman divergence to project onto the simplex constraint set \(\Delta_d = \{ x \in \mathbb{R}^d_+ : \sum_{i=1}^{d} x(i)=1 \}\) is no more than the renormalization \(x \to \frac{x}{|| x ||_1}\). It can be seen that minimizing \(f\) with bounded subgradients over \(\Delta_d\) results in a rate of convergence on order \(\sqrt{\frac{\log d}{k}}\), improving on the standard gradient descent rate of convergence in this case of \(\sqrt{\frac{d}{k}}\).

Nestorov’s Accelerated Gradient Method

Overview

Nestorov proposed the accelerated gradient method (NAG) in 1983 [8], and in the last decade it has been successfully adapted to deep learning problems [9], [10] when interpreted as a momentum method.

The algorithm makes use of the sequences \[\begin{aligned} \lambda_k &= \frac{1 + \sqrt{1 + 4\lambda_{k-1}^2}}{2} \ \ \ (\lambda_0 = 0), \\ \gamma_k &= \frac{1 - \lambda_k}{\lambda_{k+1}}. \end{aligned}\] NAG consists of a gradient update term similar to standard gradient descent, as well as an additional correcting term: \[\begin{aligned} y_{k+1} &= x_k - \alpha \nabla f(x_k), \\ x_{t+1} &= (1 - \gamma_k) y_{k + 1} + \gamma_k y_k. \end{aligned}\]

Convergence

The following derivation is adapted from [7]. We will use the descent lemma, that is, \[f(y) - f(x) \leq f(x)^T (x-y) - \frac{1}{2} L || \nabla f(x) - \nabla f(y) ||^2.\] for Lipschitz constant \(L\).

Theorem 2. NAG satisfies \[f(y_k) - f(x^*) \leq \frac{2 || x_1 - x^* ||^2}{k^2 L}.\]

Proof. Using the descent lemma, \[\begin{aligned} f(y_{k+1}) - f(x_k) &\leq \nabla f(x_k)^T (x_k - y_k) - \frac{1}{2\beta} || \nabla f(x_k)||^2 \nonumber \\ &= \frac{1}{L} (x_k - y_{k+1}) - \frac{1}{2L} ||x_k - y_{k+1} ||^2 \label{eq:1}. \end{aligned}\] Similarly, \[\label{eq:2} f(y_{k+1}) - f(x^*) \leq \frac{1}{L} (x_k - y_{k+1})^T (x_k - x^*) - \frac{1}{2L} || x_k - y_{k+1} ||^2.\] Multiplying ([eq:1]) by \((\lambda_k - 1)\) and adding the result to ([eq:2]), then letting \(\delta_k = f(y_k) - f(x^*)\), \[\begin{aligned} \lambda_k \delta_{k+1} - (\lambda_k - 1)\delta_k &\leq \frac{1}{L} (x_k - y_{k+1})^T (\lambda_k x_k - (\lambda_k - 1)y_k - x^*) \\ &- \frac{1}{2L} \lambda_k ||x_k - y_{k+1} ||^2. \end{aligned}\] Next, multiply this expression by \(\lambda_k\) and using the facts that \(\lambda^2_{k-1} = \lambda_k^2 - \lambda_k\) and \(2a^T b - || a ||^2 = ||b||^2 - ||b-a||^2\), \[\begin{aligned} \lambda_{k}^2 \delta_{k+1} - \lambda_{k-1}^2 \delta_{k} &\leq \frac{1}{2L} \left( 2\lambda_{k}(x_{k} - y_{k+1})^{T}(\lambda_{k} x_k - (\lambda_{k} - 1)y_{k} - x^{*}) - || \lambda_{k}(y_{k+1} - x_{k}) ||^{2} \right) \nonumber \\ &= \frac{1}{2L} \left( || \lambda_{k}x_{k} - (\lambda_{k} - 1)y_{k} - x^{*} ||^{2} - || \lambda_{k}y_{k+1} - (\lambda_{k} - 1)y_{k} - x^{*} ||^{2} \right) \label{eq:3}. \end{aligned}\] On the other hand, by definition, \[\begin{aligned} &x_{k+1} = y_{k+1} + \gamma_{k}(y_{k} - y_{k+1}) \nonumber \\ \iff& \lambda_{k+1}x_{k+1} = \lambda_{k+1}y_{k+1} + (1 - \lambda_{k})(y_{k} - y_{k+1}) \nonumber \\ \iff& \lambda_{k+1}x_{k+1} - (\lambda_{k+1} - 1)y_{k+1} = \lambda_{k}y_{k+1} - (\lambda_{k} - 1)y_{k} \label{eq:4}. \end{aligned}\] So combining ([eq:3]) and ([eq:4]), letting \(u_{k} = \lambda_{k}x_{k} - (\lambda_{k} - 1)y_{k}\), we get \[\lambda_{k}^{2}\delta_{k+1} - \lambda_{k-1}^{2}\delta_{k} \leq \frac{1}{2L} \left( || u_{k} ||^{2} - || u_{k+1} ||^{2} \right).\] Summing and cancelling the telescoping sums, and dropping the \(u_{k}\), \[\begin{aligned} \sum_{s=1}^{k-1} \lambda_{s}^{2}\delta_{s+1} - \lambda_{s-1}^{2}\delta_{s} &\leq \frac{1}{2L} \sum_{s=1}^{k-1} \left( || u_{s} ||^{2} - || u_{s+1} ||^{2} \right) \\ \delta_{k} &\leq \frac{1}{2L \lambda_{k-1}^{2}} || u_{1} ||^{2}. \end{aligned}\] It can be shown by induction that \(\lambda_{k-1} \geq \frac{k}{2}\), hence we can rewrite the equation to obtain the desired result, \[f(y_k) - f(x^*) \leq \frac{2 || x_1 - x^* ||^2}{k^2 L}.\] which demonstrates the convergence. ◻

Interpretation

NAG can be seen as a type of momentum method, though it took until the last decade for this interpretation to be pointed out and applied to machine learning [9]. Momentum methods are ones in which the update depends to some extent on the previous update vectors. Thus, like a ball rolling down a hill, it gains some momentum in the steepest direction [11].

However, in NAG, there is a subtle, yet fundamental difference in how the momentum is used to perform the update. While the standard momentum method performs the gradient update first then makes a big jump in the direction of the accumulated momentum, NAG makes a big jump in the direction of the momentum, then correct this with the gradient at the resulting position. Geoffrey Hinton describes the improvement with the phrase “it’s better to correct a mistake after you have made it” [12].

The theoretical superiority of NAG is apparent in the convergence proof: it converges at rate \(O(1 / k^2)\). As mentioned earlier, gradient descent only converges at \(O(1 / k)\). But in NAG, careful choice of \(\lambda\) allowed us to introduce \(k^2\) in the denominator, achieving quadratic convergence.

Conclusion

We reviewed two advanced optimization methods which derive from gradient descent. While the theoretical properties demonstrated here are interesting and (reasonably) accessible, they are still that – theoretical. A practical introduction to how optimizers are used can be obtained by exploring the Pytorch implementations [13].

[1] R. Srikant, “Introduction to Optimization.” Class notes for ECE 490, University of Illinois Urbana-Champaign, Aug. 2023.

[2] R. Tibshirani, “Optimization: Lecture 6.” Class notes for 10-725, Carnegie Mellon University, Aug. 2013.

[3] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.

[4] T. Lienart, “Mirror descent algorithm.” https://tlienart.github.io/posts/2018/10/27-mirror-descent-algorithm/, 2018.

[5] A. S. Nemirovskij and D. B. Yudin, “Problem complexity and method efficiency in optimization,” 1983.

[6] A. Andersen, “Mirror descent part 2: The algorithm.” https://angms.science/doc/CVX/MD_algo.pdf, Nov. 12, 2021.

[7] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Foundations and Trends in Machine Learning, vol. 8, no. 3–4, pp. 231–357, 2015.

[8] Y. E. Nesterov, “A method for solving the convex programming problem with convergence rate \(O(\frac{1}{k^2})\),” in Dokl. Akad. Nauk sssr, 1983, pp. 543–547.

[9] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, PMLR, 2013, pp. 1139–1147.

[10] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in optimizing recurrent networks,” in 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 8624–8628.

[11] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.

[12] G. Hinton, “Lecture 6C: The momentum method.” https://www.youtube.com/watch?v=N18Km9YIIug, 2013.

[13] “PyTorch 2.3 documentation.” Available: https://pytorch.org/docs/stable/optim.html