Strong Duality

  • Primal and dual problems

  • Strong duality via Slater’s condition

  • Geometric interpretation

  • Examples

Primal and dual problems

In this section, we consider a convex optimization problem
 p^ast := displaystylemin_x :  f_0(x) ~:~ begin{array}[t]{l} f_i(x) le 0,quad i=1,cdots, m,   h_i(x) = 0, quad i=1,cdots,p, end{array}
where the functions f_0,f_1,ldots,f_m are convex, and h_1,ldots,h_p are affine. We denote by {cal D} the domain of the problem (which is the intersection of the domains of all the functions involved), and by {cal X} subseteq {cal D} its feasible set.

To the problem we associate the Lagrangian {cal L} : mathbf{R}^n times mathbf{R}^m times mathbf{R}^p  rightarrow mathbf{R}, with values
{cal L}(x,lambda,nu) : = f_0(x) + sum_{i=1}^m lambda_i f_i(x) + sum_{i=1}^p nu_i h_i(x).
The dual function is g : mathbf{R}^m times mathbf{R}^p rightarrow mathbf{R}, with values
 g(lambda,nu) := min_x : {cal L}(x,lambda,nu) .
The associated dual problem is
 d^ast = max_{lambda ge 0, : nu} : g(lambda,nu).
Note that the dual variables associated with the affine equality constraints (in vector nu) are not constrained in sign.

Strong duality via Slater’s condition

Duality gap and strong duality

We have seen how weak duality allows to form a convex optimization problem that provides a lower bound on the original (primal) problem, even when the latter is non-convex. The duality gap is the non-negative number p^ast-d^ast.

We say that strong duality holds for the above problem if the duality gap is zero: p^ast = d^ast.

Slater’s condition

We say that the problem satisfies Slater’s condition if it is strictly feasible, that is:
 exists : x_0 in {cal D} ~:~ f_i(x_0) < 0, ;; i=1,ldots,m, ;; h_i(x_0) = 0, ;; i=1,ldots,p.
We can replace the above by a weak form of Slater’s condition, where strict feasibility is not required whenever the function f_i is affine.

We then have the

Theorem: Strong duality via Slater condition

If the primal problem is convex, and satisfies the weak Slater’s condition, then strong duality holds, that is, p^ast = d^ast.

Note that there are many other similar results that guarantee a zero duality gap. For example:

If f_0 is quadratic convex, and the functions f_1,ldots,f_m,h_1,ldots,h_p are all affine, then the duality gap is always zero, provided one of the primal or dual problems is feasible. In particular, strong duality holds for any feasible linear optimization problem.

Geometric interpretation

Assume that there is only one inequality constraint in the primal problem (m=1), and let
{cal A} := left{ (u,t) ~:~ exists : x in mathbf{R}^n, ;; u ge f_1(x), ;; t ge f_0(x)  right}.

The problem is feasible if and only if {cal A} intersects the left-half plane. Furthermore, we have
 p^ast = min_{u,t} : t ~:~ (u,t) in {cal A} , ;; u le 0 .
and
 g(lambda) = min_{u,t} : (lambda,1)^T(u,t) ~:~ (u,t) in {cal A} .
If the minimum is finite, then the inequality (lambda,1)^T(u,t) ge g(lambda) defines a supporting hyperplane, with slope -lambda, of {cal A} at (u,t). (See Figs. 5.3 and 5.4 in BV,p.233.)

If the problem is convex, then {cal A} is also convex. If Slater’s condition holds, then the interior of {cal A} intersects the left-half plane, and strong duality holds. (See Fig. 5.6 in BV,p.236.)

Examples

A convex counterexample

Convexity alone is not enough to guarantee strong duality. Consider for example the convex problem
 min_{x,y>0} : e^{-x}  ~:~ x^2 / y le 0,
with variables x and y, and domain mathcal D={(x,y) ;|; y > 0}. We have p^ast = 1. The Lagrangian is L(x,y,lambda) = e^{-x} + lambda x^2 /y, and the dual function is
 g(lambda) = inf_{x,y>0} (e^{-x} + lambda x^2/y) =  left{ begin{array}{ll} 0 & lambda geq 0 -infty & lambda < 0,  end{array} right.
so we can write the dual problem as
 d^ast = max_lambda : 0 ~:~ lambda ge 0
with optimal value d^star = 0. The optimal duality gap is p^star - d^star = 1. In this problem, Slater’s condition is not satisfied, since x=0 for any feasible pair (x,y).

Minimum Euclidean distance problem

The minimum distance to an affine set mentioned in XXX is
 min: frac{1}{2} |x|_2^2 ~:~ Ax = b,
where A in mathbf{R}^{p times n}, b in mathbf{R}^p. The problem is convex, and satisfies Slater’s condition (in fact, strong duality always holds for this convex quadratic problem). Hence, we know that p^ast = d^ast. This allows us to compute the optimal value of the problem analytically: p^ast = d^ast = frac{1}{2} b^T(AA^T)^{-1} b.

We can also find a corresponding optimal point: for every nu, the point x(nu) = -A^Tnu achieves the minimum in the definition of the dual function g(nu). Let us set x^ast := x(nu^ast), where nu^ast = -(AA^T)^{-1}b denotes the optimal dual variable. The point x^ast = A^T(AA^T)^{-1}b is optimal for the primal problem. Indeed, it is feasible, since Ax^ast = A^TA(AA^T)^{-1}b = b, and its objective value equals to the optimal value (1/2) |x^ast|_2^2 = frac{1}{2} b^T(AA^T)^{-1} b = d^ast = p^ast. Hence, x^ast is optimal, as claimed.

Linear optimization problem

Consider the LP in inequality form:
 p^ast = min_x : c^Tx ~:~ Ax le b,
where A in mathbf{R}^{m times n}, b in mathbf{R}^m. Assume that the above problem is feasible, so that strong duality holds. Then the problem can be equivalently written in the dual form, as an LP:
 p^ast = d^ast = max_lambda : -b^Tlambda ~:~ lambda ge 0, ;; A^Tlambda + c = 0.
The above LP is in standard form, with the number of constraints and variables exchanged.

Duality is another way to convert any LP in inequality form into a standard form, and vice-versa. (The other method is via the introduction of new variables.)

Support vector machine classification

Return to the example seen here, which involved a binary classification problem. Given m data points x_{i}inmathbf{R}^{n}, each of which is associated with a label y_i in {-1,1}, the problem is to find a hyperplane that separates, as much as possible, the two classes. Let us denote Z = [y_1x_1,ldots,y_mx_m] in mathbf{R}^{n times m}.

Ideally, we would like to minimize the number of errors on the training set (x_i,y_i)_{i=1}^m. This is hard as it involves a non-convex function. An upper bound on the number of errors is provided by the so-called hinge loss function
 L(w,b) := sum_{i=1}^m (1-y_i(w^Tx_i + b))_+.
We’d also like to control robustness of the resulting linear classifier, and at the same time guarantee unicity. It turns out that these objectives can be achieved via the following problem:
 min_{w,b} : C cdot L(w,b) + frac{1}{2} |w|_2^2.
where C>0 is a parameter that controls the trade-off between robustness and performance on the training set (a greater C encourages performance at the expense of robustness).

The above can be written as a QP, by introducing slack variables:
 min_{w,b,v} : frac{1}{2} |w|_2^2 +  C sum_{i=1}^m v_i ~:~ v ge 0, ;; y_i(w^Tx_i + b) ge 1-v_i, ;; i=1,ldots,m,
or, more compactly:
 min_{w,b,v} : frac{1}{2} |w|_2^2 +  C v^Tmathbf{1} ~:~ v ge 0, ;; v + Z^Tw + b y  ge mathbf{1} .
The corresponding Lagrangian is
{cal L}(w,b,lambda,mu) = frac{1}{2} |w|_2^2 +  C v^Tmathbf{1} + lambda^T(mathbf{1} - v-Z^Tw-by) - mu^Tv,
where mu in mathbf{R}^m corresponds to the sign constraints on v.

The dual function is given by
 g(lambda,mu) = min_{w,b} : {cal L} (w,b,lambda,mu).
We can readily solve for w by taking derivatives, which leads to w(lambda,mu) = Zlambda. Taking derivatives with respect to v yields the constraint C mathbf{1} = lambda +mu, while taking derivatives with respect to b leads to the dual constraint lambda^Ty = 0. We obtain
 g(lambda,mu) = left{ begin{array}{ll} lambda^Tmathbf{1} - frac{1}{2} |Zlambda|_2^2 & mbox{if } lambda^Ty = 0, ;; lambda+mu = C mathbf{1},  + infty & mbox{otherwise.} end{array}right.
We obtain the dual problem
 d^ast = max_{lambda ge 0, : mu ge 0} : g(lambda,mu) = max_lambda : lambda^Tmathbf{1} - frac{1}{2}lambda^T Z^TZlambda ~:~ 0 le lambda le C mathbf{1}, ;; lambda^Ty = 0.
Strong duality holds, since the primal problem is a QP.

Note that the result depends only on the so-called kernel matrix K = Z^TZ in {cal S}_{+}^m, and the dual problem involves only m variables and m constraints. Hence, the only dependence on the number of dimensions (features), n, is via the required computation of the kernel matrix, that is, on scalar products x_i^Tx_j, 1 le ile j le m. Thus, duality allows a great reduction in the computational effort, compared to solving the original QP in n variables and m constraints. This is known as the ‘‘kernel trick’’.

Note also that duality allows to show that the optimal value of the problem is a convex function of the kernel matrix, which allows to optimize over it.