A Simple Convex Optimization Algorithm

In this section, we aim to show that convex optimization is a computationally tractable problem, by presenting a simple method for general-purpose convex optimization, the subgradient method.

  • The subgradient method

  • Convergence analysis and complexity

  • Projected subgradient method

The subgradient method

The subgradient method is a simple algorithm for minimizing a non-differentiable convex function, and more generally, solving convex optimization problems. Its complexity in terms of problem size is very good (each iteration is cheap), but in terms of accuracy, very poor (the algorithm typically requires thousands or millions of iterations). The convergence analysis of the algorithm has an important theoretical consequence: it shows that convex optimization is essentially ‘‘tractable’’, provided we can compute subgradients of the objective and constraints functions efficiently.

The method is part of a large class of methods known as first-order methods, which are based on subgradients or closely related objects. In constrast, second-order methods typically use Hessian information; each iteration is expensive (both in flops and memory) but convergence is obtained very quickly.

Setup

The setup is as follows. Let f:mathbf{R} to mathbf{R}^n be a convex function. We seek to solve the uconstrained problem
 p^ast = min_x : f(x).
To minimize f, the subgradient method described below uses only subgradient information at every step.

Now consider the convex, inequality constrained problem:
 min : f_0(x) ~:~ f_i(x) le 0, ;; i=1,ldots, m.
We can formulate this as an unconstrained problem of minimizing the function h, with
 h(x) = left{ begin{array}{ll} f_0(x) & mbox{if }f_i(x) le 0, ;; i=1,ldots, m  +infty & mbox{otherwise.} end{array} right.
Certainly, h is not differentiable, but it is convex. So we can apply the subgradient method for unconstrained minimization to this function.

Algorithm

The algorithm proceeds as follows:
 x^{(k+1)} = x^{(k)} - alpha _k g^{(k)}, ;; g^{(k)} in partial f(x^{(k)}) .
Here, x^{(k)} is the k-th iterate, g^{(k)} is any subgradient of f at x^{(k)}, and alpha_k>0 is the k-th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. Since this is not a descent method (the objective is not guaranteed to decrease at each step), we must keep track of the best solution seen so far, via the values
 p_{rm best} ^{(k)} = min _{0leq i leq k} : f(x^{(k)}) .
We will also denote overline{p} = lim_{krightarrow infty} p_{rm best} ^{(k)}.

Convergence analysis and complexity

To analyze the convergence, we make the following assumptions:

  1. p^ast is finite and attained

  2. There is a constant G such that, for every x, and every g in partial f(x), we have |g|_2 le G.

  3. There is a large enough constant R such that {R geq ||x^{(1)} - x^*||_2}.

Now let x^star be any minimizer of f. We have
 begin{array}{rcl} |x^{(k+1)} - x^star|_2^2 &=& |x^{(k)} - alpha _k g^{(k)} - x^star|_2^2  &=& |x^{)k)} - x^star|_2^2 - 2alpha_k g^{(k)T} (x^{(k)}-x^star) + alpha_k^2 |g^{(k)}|_2^2  &le&  |x^{(k)} - x^star|_2^2 - 2alpha_k (f(x^{(k)}-p^star) + alpha_k^2 |g^{(k)}|_2^2 , end{array}
where we have used the subgradient inequality:
 p^star = f(x^star) ge f(x^{(k)}) + g^{(k)T}(x^star-x^{(k)}) .
Applying this recursively, we get
 begin{array}{rcl} |x^{(k+1)} - x^star|_2 &le & |x^{(1)}-x^star|_2^2 -2sum_{i=1}^k (f(x^{(i)}-p^star) + sum_{i=1}^k alpha_i^2 |g^{(i)}|_2^2  &le& R^2 -2sum_{i=1}^k (f(x^{(i)}-p^star) + G^2 sum_{i=1}^k alpha_i^2 . end{array}
By definition of p_{rm best} ^{(k)}, we have
 sum_{i=1}^k (f(x^{(i)}-p^star) ge (p_{rm best} ^{(k)}-p^star) left( sum_{i=1}^k alpha_i right),
which yields
 p_{rm best}^{(k)}-p^star le frac{R^2 + G^2 sum_{i=1}^k alpha_i^2}{2sum_{i=1}^k alpha_i}.

This shows that

  1. For constant step sizes (alpha _k = alpha for every k), the condition G^2alpha/2 le epsilon guarantees that overline{p}-p^star le G^2 alpha /2 le epsilon.

  2. For constant step length ({alpha _k = gamma /{|g^{(k)}|_2}}), the condition G gamma /2 le epsilon guarantees that overline{p}-p^star le G^2 alpha /2 le epsilon.

Complexity

The convergence analysis result (ref{eq:conv-anal-L19}) depends on the choice of the step size rule. Which rule is optimal for this bound? The problem of minimizing the upper bound above is convex and symmetric (the function does not change when we exchange variables). Hence, the optimal alpha_i’s are all equal at optimum, to a number alpha = (R/G) k^{-1/2}. With this choice of step length rule, the number of iterations needed to achieve epsilon-sub-optimality, as predicted by the analysis, is (RG)/epsilon^2. The dependence on epsilon is now O(1/epsilon^2), which is very slow. Interior-point methods alluded to earlier typically behave in O(log (1/epsilon))), but they do not apply to general convex problems, only to a special class.

The above discusses the dependence of complexity on the accuracy epsilon. The complexity depends on problem size as well, and more generally, on the structure of the function involved, via the need to compute subgradients at each step, and also the parameters R and G.

Example: We seek to find a point in a given intersection of closed convex sets,
 C=C_1cap cdots cap C_m subseteq R^n .
We formulate this problem as minimizing f, where
 f(x)=max(mbox{bf dist}(x,C_1)cdots mbox{bf dist}(x,C_m)) .
Let P_j the projection operator on C_j, and let
 f_j(x)=mbox{bf dist}(x,C_j)
be the corresponding distance function. Thus, P_j(x) achieves the minimum value of f_j. Then, a subgradient of f_j at x is
 g_j (x) = frac {x-P_j (x)} {|x-P_j (x)|_2}.

Projected subgradient method

One extension of the subgradient method for solving constrained optimization problems, is the projected subgradient method. We consider the problem
 min_x : f(x) ~:~ x in C,
where C is a convex set, which can be defined by a set of inequality constraints. The projected subgradient method uses the iteration
 x^{(k+1)}=P_C (x^{(k)} - alpha{(k)} g{(k)})
where P_C is projection on C, and {g}{(k)} is any subgradient of f at {x{(k)}}.

Example: Let A in mathbf{R}^{m times n}, b in mathbf{R}^m, with m le n, A full rank (hence, AA^T succ 0). We consider the problem
 min_x : f(x) ~:~ Ax = b.
The projection on the affine space {x ::: Ax=b } is
 P_N = I-A^T(AA^T )^{-1} A
The subgradient iterations are given by:
 x^{(k)} = P_N(x^{(k)} - alpha^{(k)}  g^{(k)}) = x^{(k)} - alpha^{(k)} P_N (g^{(k)}),
where we have exploited the fact that the operator P is linear.