Lec-optimization

Lecture notes: Optimization formulations

Plan/outline

I will summarize what we covered in the three lectures on formulating problems as optimization. The main takeaways here are:

How can we express different problems, particularly "combinatorial" problems (like shortest path, minimum spanning tree, matching, etc.) as optimization?
Linear and convex optimization: these are specific optimization problems that are known to be solvable efficiently (in polynomial time).
Expressing combinatorial problems as linear programs by "relaxing" binary constraints, and how these "relaxations" can be used to obtain algorithms. This is known as the relax and round paradigm for developing algorithms.

What is optimization?

$f$ $x_1, x_2, \dots, x_n$ $\mathcal{D}$ $x_i$ $\mathbb{R}^n$ optimize $f$ $\mathcal{D}$ $g_i (x_1, x_2, \dots, x_n) \ge 0$ $1 \le i \le m$ .

$\mathcal{D}$ $f$ , the type and the number of constraints, etc., solving optimization problems can have vastly different complexities. Studying procedures for optimization is an area in itself, and is well beyond the scope of this course.

That said, many optimization algorithms are iterative procedures, analogous to local search (which we saw earlier). One common procedure is gradient descentfeasible $x^{(0)}$ gradient $f$ . This simple heuristic has been extremely successful in practice, with most of modern machine learning using variants of gradient descent. One of the main issues with gradient descent and other "local search" approaches is that they converge to "local optima" (troughs of the function which may have sub-optimal objective value).

$f, \mathcal{D}, g_i$ , etc., and since problems typically have multiple optimization formulations (which may have very different complexities), we will see some concrete examples and illustrate the power of the optimization paradigm in algorithm design.

Expressing problems as optimization

Let us start with some basic examples of optimization.

Least squares regression. $v_1, v_2, \dots, v_m$ $\mathbb{R}^n$ $y_i$ $v_i$ $f$ $f(v_i) = y_i$ linear $f$ linear $\langle x, v_i \rangle$ $x \in \mathbb{R}^n$ .

$\{ v_i, y_i \}_{i=1}^m$ $x$ $\langle x, v_i \rangle \approx y_i$ $i$ $\sum_{i=1}^m (\langle x, v_i \rangle - y_i)^2$ .

This is now an optimization problem:

$x$ we wish to find
$x$ $\mathbb{R}^n$ (this is therefore known as an unconstrained optimization)
$\sum_{i=1}^m (\langle x, v_i \rangle - y_i)^2$ .

It turns out that the least squares regression problem can be solved efficiently (in polynomial time; indeed, there has been a lot of research leading to a nearly linear time algorithm for it).

The three aspects above: variables, constraints and objective, are what define an optimization problem, and these should be explicitly stated in any optimization formulation.

Linear programming. $x_1, x_2, \dots, x_n$ $\mathbb{R}^n$ linear $x$ $c_1 x_1 + c_2 x_2 + \dots + x_n x_n$ $c^T x$ $\langle c, x\rangle$ linear constraints $a_1 x_1 + a_2 x_2 + \dots a_n x_n \le b$ . Note that there can be multiple such constraints.

We will go a bit deeper into linear programming (LP) in a little bit.

Combinatorial problems

$x_i$ , and the optimization problem we ended up with is a "continuous" one. However, many of the problems we've been looking at in the course are discrete (or "combinatorial"): finding a shortest path or a spanning tree, where we need to select a set of edges, or finding a matching or a scheduling (which require finding permutations), etc.

The key question we study is: can optimization be useful for solving combinatorial problems? Let us first see how if we allow discrete variables (where the domain is a discrete set), these problems can be easily phrased as optimization.

Matching in bipartite graphs. $n$ $n$ $H_{i,j}$ $i$ $j$ $\ge 0$ $\sigma$ $\{1,2,\dots, n\}$ $\sum_i H_{i, \sigma(i)}$ is maximized.

$y_i$ $\sigma(i)$ $y_i$ $\{1, 2, \dots, n\}$ domain $y_i$ $H_{i, y_i}$ $y_i$ $4y_i^2 + H_{11} y_i$ ).

$n^2$ binary $x_{ij}$ $\{0,1\}$ $i$ $j$ $x_{ij} =1$ $j$ $i$ $x_{ij}=0$ $j$ $i$ .

Given this intention, we can impose the following constraints:

$x_{ij} \in \{0,1\}$ (domain is binary)
$i$ $\sum_{j} x_{ij} = 1$ (every child receives precisely one gift)
$j$ $\sum_i x_{ij} =1$ (every gift is assigned to one child)

The main thing to notice is that: $n^2$ $x_{ij}$ that satisfy (1-3) above correspond to a valid assignment of gifts to children. Now, in terms of these variables, the objective (total happiness) has a very clean form:

$\sum_i \sum_j x_{ij} H_{ij}$ (***)

$i$ $\sum_j x_{ij} H_{ij}$ $j$ $x_{ij}$ $1$ $x_{ij'}$ $0$ $j' \ne j$ , and so those terms do not contribute to the objective).

$\{ x_{ij}\}_{1\le i, j \le n}$ , and we maximize (***) subject to conditions (1-3). We have the following observation:

Main observation. solving the matching problem is exactly equivalent to solving the optimization problem described above.

The observation generally requires a formal proof, showing that (a) optimum objective values of the two problems are equal and (b) one can move from an optimal solution of one problem to an optimal solution of the other.

$\sigma$ $x_{ij} = 1$ $j = \sigma(i)$ $0$ otherwise), and this satisfies all the constraints. Second, a solution to the optimization problem (as observed earlier) must correspond to a permutation. The objective values are also preserved, as we observed while writing the objective (***).

Test your intuition. Would the formulation still capture the problem exactly if we were to have only the constraints (1) and (2) above? [Hint: NO, because in this case the optimization formulation can assign one gift to many children (and possibly not assign some gifts to anyone)].

Set cover. $(U, V, E)$ $U$ $V$ $i$ $j$ $i$ $j$ $|U| = n$ $|V|= m$ $S$ $U$ all $\in V$ $S$ who possesses the skill).

Choosing variables. $U$ $x_u$ $u \in U$ $x_u$ is to be chosen or not.

Choosing constraints $j$ $j$ $S$ $\Gamma(j)$ $j$ $\sum_{u \in \Gamma(j)} x_u \ge 1$ every $j$ $m$ $n$ variables that we defined.

domain $\{0,1\}$ , as we discussed.

Objective. $S$ $|S|$ $\sum_{u \in U} x_u$ . Thus minimizing this is simply the objective.

Once again, it is easy to see that solving Set Cover is equivalent to solving the optimization problem above. (Because we can from solutions of one to solutions of the other while maintaining the objective value.)

General paradigm

So far, we have seen examples of phrasing problems of interest in the form of mathematical optimization. Why is this useful? In some sense, reducing one problem to a different one does not magically make it easier. The point is that there have been numerous heuristics and efficient algorithms developed for optimization. Thus, we can hope that phrasing a problem as optimization yields a new (and reasonably efficient) way of solving it.

The picture we drew in class illustrates the overall paradigm:

As discussed above, for the optimization formulation to be equivalent to the original problem, we must be able to convert any solution output by the optimizer to one of the original instance (while maintaining the cost).

More examples

In the examples above, it was almost "immediate" that the optimization formulation "exactly captures" the combinatorial problem we started out with (i.e., solutions to the original problem correspond exactly to the solutions to the optimization formulation). The next example illustrates that this step can sometimes be tricky.

Minimum spanning tree (MST). $(V, E$ $\{w_e\}_{e \in E}$ $S$ $S$ , there is a path from every vertex to every other vertex, (b) the total weight of the edges chosen is minimized.

$x_e$ $e$ $x_e=1$ $e$ $0$ otherwise.

$\sum_{e \in E} w_e x_e$ .

$S$ , there is a path from every vertex to every other vertex.

Idea 1: the first idea is to place conditions that "ensure" that the edges chosen form a tree. These can be: ones such as

$(n-1)$ $\sum_{e \in E} x_e = n-1$ .
$f_1, f_2, \dots, f_r \in E$ $G$ $x_{f_1} + x_{f_2} + \dots + x_{f_r} \le r-1$ .
$x_e \in \{0,1\}$

force $n-1$ edges without having a cycle is to have a spanning tree -- this is a simple exercise if it's not clear).

Number of constraints. In the formulation above, the number of constraints is equal to the number of cycles in the graph, which is typically exponential in the number of vertices.

Idea 2: the above method is based on ensuring that we don't pick anything more than a tree. It did not explicitly enforce "connectivity between every pair" of vertices. It turns out that a nice way to enforce this is as follows:

$T$ $\emptyset$ $V$ $\rho(T)$ $T$ $T$ outside $T$ $T$ $\sum_{e \in \rho(T)} x_e \ge 1$ .

$x_e \in \{0,1\}$ , and the objective is as before.

$\{x_e\}$ all $T$ $G'$ $e$ $x_e = 1$ $G'$ connected $G'$ $G'$ $R$ $R$ $V \setminus R$ $G'$ $\{x_e\}$ $\sum_{e \in \rho(R)} x_e \ge 1$ -- a contradiction!

Efficient optimization

As we said before, if optimization formulations are to be useful for solving problems, the optimizer itself must be efficient. Indeed, as we said earlier, there are often many ways to express a problem as optimization, and the art is to choose one that can be solved easily using a solver.

It is therefore useful to know what kind of optimization problems can be solved efficiently in practice.

If we care about provably efficient procedures, perhaps the most important class is convex optimization, where the goal is to minimize a convex function over a convex set. For those of you interested in knowing more about these notions, please refer to the wikipedia page on convex optimization and Chapter 2 of these notes. There are also plenty of other notes online (and an extensive textbook by Boyd).

Linear programming

linear programming $n$ $x_1, x_2, \dots, x_n$ linear $c_1 x_1 + \dots + c_n x_n$ $c_i$ $m$ $i$ $a_{i1} x_1 + a_{i2} x_2 + \dots + a_{in} x_n \le b_i$ $a_{ij}$ $b_i$ $a_i^T x \le b_i$ $a_i$ $x$ as vectors.)

$\mathbb{R}^n$ one constraint $a_i^T x \le b_i$ ) is often referred to as a half-space. The feasible set for the linear program is thus the $m$ half-spaces (one per constraint). The feasible region is called a "polytope". It can be either a bounded or an unbounded set; it can even be empty in which case the problem is said to be infeasible.

It turns out that many interesting problems arising in planning and operations research can be phrased as linear programming, and thus LPs are one of the most fundamental objects in optimization.

For a gentle introduction to linear programming (via examples in low dimension like we saw in class, refer to the excellent notes by Luca Trevisan).

polynomial time $a_{ij}, b_i$ $n, m$ ). Even though this was a theoretical breakthrough, in practice, a simpler "local search" procedure known as the "simplex algorithm" (which basically moves from one "vertex" of the polytope to a neighboring one in a systematic manner) works quite well in practice. More recently, a set of techniques collectively known as "interior point methods" have emerged as powerful alternatives to the simplex algorithm both in theory and in practice.

In what follows, we will treat the solution of LP as an efficient black-box, and see how this can help in solving combinatorial problems.

Combinatorial problems and "relaxations"

In the examples we saw earlier, we expressed problems such as matching, set cover and MST as optimization, but the variables involved were binary ones. In fact, in all of these examples, the constraints involved as well as the objective function were all linear (as we defined above), but the variables are binary, not real valued.

Such problems are called integer linear programs, or ILPs. While many heuristics exist for solving ILPs reasonably quickly, the problem is NP-hard, so we do not expect to have a general algorithm that scales well with the problem size.

Given that we care about algorithms with provable guarantees on the running time, does this mean that the optimization approach is useless?

The answer is no, because as we will see, we can often "relax" the binary constraints on the variables and still achieve meaningful insights into the problem. In what follows let us focus on a simple case of the set cover problem, namely the "vertex cover" question we saw earlier in the course.

Vertex cover. $G = (V, E)$ $S$ $ij$ $i, j$ $S$ . (This is a special case of Set Cover, if we view the vertices and edges as two sides of a bipartite graph.)

The optimization formulation for this problem (which is an ILP, as we saw before) is:

$x_u$ $u \in V$ .

$\sum_{u \in V} x_u$ subject to the constraints:

$x_u \in \{0,1\}$ .
$ij$ $x_i + x_j \ge 1$

non-linear $x_u \in \{0,1\}$ $0 \le x_u \le 1$ (to be clear, these are two linear constraints).

Thus the LP relaxation of the formulation above is:

$\sum_{u \in V} x_u$ subject to:

$0 \le x_u \le 1$
$ij$ $x_i + x_j \ge 1$ .

Comparing the objective values. The first thing to note is that both the original formulation (which we will call the ILP) and the relaxation are minimization problems. Further, every feasible solution to the ILP is also a feasible solution to the relaxation. In other words, the relaxation is a minimization over a "larger set" of solutions, and thus the optimum objective value of the relaxation is less than or equal to the optimum objective value of the ILP.

strictly smaller $2$ $x_u = 1/2$ $0 \le x_u \le 1$ $x_i + x_j \ge 1$ $3/2 = 1.5$ .

Thus in this case, the opt objective value of the relaxation is strictly smaller than that of the ILP.

Such a gap between the objective value of the ILP and that of the relaxation is called the integrality gap of the formulation. One of the central themes in approximation algorithms is to come up with ILP formulations for different problems that have a (provably) small integrality gap. This way, even though the LP solution can be fractional (and thus may not be useful to recover a solution to the original problem), we have a decent estimate of the optimum objective value.

Rounding fractional solutions

Given the gap above, one may ask if for the vertex cover problem, there are instances in which the optimum objective value for the relaxation is way smaller than that of the ILP (say a factor 100 smaller).

We show now that the "gap" is always bounded by factor of 2. Further, we do this in a "constructive" way: given any feasible solution to the relaxation (which potentially has fractional values), we construct feasible solution to the ILP (i.e., a binary solution), with the guarantee that the objective value is not much larger. This process in general is known as rounding LP solutions.

$\{x_u\}$ $\{y_u\}$ $\sum_u x_u$ $\sum_u y_u$ are not too far apart.

Rounding for vertex cover $x_u \ge 1/2$ $y_u$ $x_u < 1/2$ $y_u = 0$ $u \in V$ .

$\{y_u\}$ $ij$ $y_i + y_j \ge 1$ $\{x_u\}$ $\{y_u\}$ $x_i + x_j \ge 1$ $x_i, x_j$ $\ge 1/2$ $y$ $1$ $y_i + y_j \ge 1$ for every edge!

$\sum_u y_u$ $\sum_u x_u$ every $u$ $y_u \le 2 x_u$ $x_u < 1/2$ $y_u = 0$ $x_u \ge 1/2$ $y_u =1$ $y_u \le 2x_u$ .

$u$ $\sum_u y_u \le 2 \sum_u x_u$ feasible $\{y_u\}$ $2$ times that of the LP relaxation.

An approximation algorithm for Vertex cover

We will now see why the above implies a factor 2 approximation algorithm for the vertex cover problem. I.e., an algorithm that is guaranteed to output a solution whose objective value is at most two-times the optimum objective value. (Recall that we saw in one of our HWs that a "lazy" greedy algorithm also achieves this.)

$(V, E)$ of vertex cover (VC),

first write down the ILP (i.e. our optimization formulation with binary variables); as we saw, there's a correspondence between solutions to this and those of the VC instance.
$\{ x_u\}$ be the optimum solution.
$\{y_u\}$ -- a feasible solution to the ILP, and then output the corresponding solution to the VC problem.

optimum $(V, E)$ $OPT$ $OPT$ $OPT_{LP}$ $OPT$ $x$ $\sum_u x_u \le OPT$ $\sum_u y_u \le 2 \cdot OPT$ , which is what we claimed.

$x_u = 1/2$ $3$ , which is sub-optimal, but not by much).

Relax and round. The above process is an example of a more general framework, known as relax-and-round. It is a powerful framework for designing approximation algorithms for combinatorial optimization problems. The first step is to write down an ILP formulation involving (typically) binary variables. The next step is to "relax" the constraints to obtain a Linear program, which can be solved in polynomial time using an LP solver. This results in a potentially "fractional" solution. The last step is to "round" the fractional solution to a binary one, obtaining a feasible solution to the ILP. This is then used to obtain a solution to the original problem.