Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 14

24 Mar 2024 (over 1 year ago)

Gradient Descent

Gradient descent works poorly when the level sets are poorly conditioned (anisotropic) and works well when the level sets are well-conditioned (isotropic or spherical).
The efficiency or speed of gradient descent depends on how aspherical or anisotropic the sublevel sets are.
Steepest descent direction depends on the norm used.
Steepest descent in the Euclidean norm is precisely gradient descent.
The choice of the metric or norm for steepest descent makes a big difference in convergence.
It is desirable to use a norm that is aligned with the sublevel sets of the function.

Newton's Method

The Newton step involves using the Hessian (second derivative) of the function at the current point as a metric to determine the direction of steepest descent.
The Newton step can be thought of as a variable metric method, as the metric is changed at each step.
Another interpretation of the Newton step is to minimize a second-order approximation of the function.
The Newton step can also be derived by linearizing the optimality condition (gradient equals zero) and solving for the displacement.
In one dimension, the Newton step involves finding the zero-crossing of a linear approximation of the derivative.
Repeated application of the Newton step leads to rapid convergence to the minimum of a smooth convex function.
Newton's method uses the Newton step and the gradient to minimize a function.
The Newton decrement is the predicted decrease in the function when taking a Newton step.
The metric defined by the Hessian is a quadratic metric that is invariant under affine transformations.
Newton's method is independent of linear changes of coordinates.
Newton's method works exceptionally well for quadratic functions because it solves the entire problem in one step.
The convergence of Newton's method depends on the third derivative of the function.
Newton's method works exceptionally well when the third derivative of a function is small.
A small third derivative indicates that the function is close to quadratic.
The Lipschitz constant (L) is used to express the smallness of the third derivative.
If the norm of the gradient is larger than a certain number, Newton's method guarantees a fixed decrease in the function.
Once the norm of the gradient becomes smaller than that number, the convergence accelerates and becomes quadratic.
The damped Newton phase is followed by the quadratically convergent phase.
The maximum number of steps in the damped Newton step is determined by the function value minus the optimal value divided by gamma.
The number of iterations until reaching a certain level of suboptimality is proportional to the logarithm of the logarithm of the desired accuracy.
The classical analysis of Newton's method for optimization involves complex hypotheses and parameters that are often unknown in practical problems, making the convergence results useless.
The author criticizes the emphasis on theoretical convergence analysis rather than practical performance and suggests that many Western optimization convergence results are conceptually flawed.
Quasi-Newton methods and other variations of Newton's method can be used to approximate the Hessian matrix and make the method more practical.
The author argues that the convergence analysis of Newton's method is aesthetically embarrassing because it is not scale-invariant, while the actual code is.
The author encourages a focus on practical performance and simplicity in optimization methods, rather than overly complex theoretical analysis.

Practical Considerations

Newton's method is a powerful optimization algorithm that converges quadratically, meaning it rapidly approaches the optimal solution with each iteration.
It is particularly effective for smooth, unconstrained optimization problems.
Newton's method involves calculating the gradient and Hessian (matrix of second derivatives) of the objective function at each iteration to determine the search direction and step size.
Despite its efficiency, Newton's method can be computationally expensive, especially for large-scale problems with many variables.
Quasi-Newton methods are alternatives that approximate the Hessian and are often used in practice.
Newton's method is widely used in various fields, including machine learning, signal processing, and optimal control problems, where it can achieve high accuracy with relatively few iterations.

Advanced Analysis

The classical analysis of Newton's method has a less sophisticated analysis compared to the actual algorithm.
The issue with the classical analysis is that it says the third derivative is small in a way that is not affine independent.
Nesterov and Nemirovski introduced the concept of self-concordance, which is an affine invariant way of saying the third derivative is small.
Many practical functions are self-concordant, and Newton's method based on self-concordance has a simpler analysis.
Nesterov and Nemirovski's analysis of Newton's method based on self-concordance gives a complexity bound that is much better than the classical analysis.

Implementation Details

Newton's method reduces the solution of minimizing a smooth convex function to minimizing a sequence of quadratic functions.
Each iteration of Newton's method involves solving a linear system of equations.
If the Hessian matrix is structured, the linear system can be solved efficiently in linear time.
A variant of Newton's method called the "lazy Newton" method updates the Hessian matrix every 20 steps, which can save computation time.