CS 6963 Summaries
From ResearchWiki
Back to CS6963
Aug 26, 2008
Problem: routing in a network.
We're given a directed graph on N nodes. Each edge is a link between communicating nodes. A link can carry one packet at a time. We assume that nodes have queues for each outgoing edge. All nodes have IDs, so we can refer to node 1, node 10, and so on.
Consider permutation routing: here, each node wants to send a packet to exactly one other node, and each node receives exactly one packet. A routing algorithm has to decide, for each (i,j=d(i)) start-end pair, how to route the packet that goes from i to j. Every edge can be busy at a time step: thus the running time of the routing algorithm is upper bounded by the max congestion on an edge.
Let's consider an "easy" class of routing algorithms: namely, algorithms that are oblivious: the route used depends solely on the pair (i,j). More precisely, the path taken by a packet from i depends only on j, and on nothing else. Oblivious routing needs little state: each (src,dest) pair uniquely defines a route, and so routing tables are easy to maintain.
Theorem: Let the graph have N nodes and out degree d. Then there exists a permutation that forces
congestion for any deterministic oblivious routing scheme.
Consider a special network: the Boolean hypercube on n-bit strings (two nodes are adjacent if they differ in one bit). the degree d = n, and the size of the network N = 2n. The above result says that the number of steps needed is at least
.
We will prove the following result:
Theorem: There exists a randomized routing scheme for the hypercube that requires only O(n) steps with high probability.
The deterministic lower bound works by a counting argument (show that each "destination graph" linking a fixed destination to its sources must have a large number of congested edges, and use averaging to show that there must be some large set of routes that all hit the same bad edge. Since routes are oblivious, this defines a permutation that forces congestion).
But let's get some intuition on the hypercube using a simple algorithm called 'bit fixing': find the left most bit where the current location differs from the target, and route along that path.
Lemma: there's a
lower bound for bit-fixing methods.
Proof: consider the following permutation: the node i = a_i b_i is routed to the node j = b_i a_i
Idea: all routes must pass through a node of the form X X and there are only
of these. Thus there exists at least one node that takes in
packets, and it has degree n.
Here's a clue though: the adversary beat the algorithm by taking advantage of the bit fixing protocol. Suppose we could confuse it by scattering nodes so as not to create a bottleneck ?
Randomized algorithm
- Route every packet from i to a random location σ(i)
- Route every packet from σ(i) to d(i)
Intuition: the randomization spreads the packets out, so the routes don't get bottlenecked.
Proof. Suppose we show that Phase 1 has congestion c with high probability. Then this implies that phase 2 has congestion c with high probability as well by reversing time. Does this mean that the total congestion is at most 2c ? Yes, if we make sure that no packets from phase 1 interfere with packets from phase 2. So we make all packets wait until c steps are done, so that we can be sure that there's no interference.
What remains is to prove that phase 1 causes only congestion c. Fix a packet v_i from i, and look at how long it takes. It's n + delay. How much is the delay ?
Part I: identify structural cause of delay (independent of randomization) Part II: analyze structure using randomization
What's the cause of delay.
Fix a route for a particular packet v. Let S be the set of all packets that traverse at least one edge of this route. Claim: the delay in packet is at most |S| Intuition: any packet that shares a route with v can delay it only once.
Observation: once two routes separate, they cannot merge again. Proof: in the bit fixing model, two routes diverging implies that at some step, the two packets are at locations that have different prefixes. Now each prefix is a prefix of the final destination, so once this happens, the routes cannot merge again.
[Back to claim] We will charge every unit of delay for v to some packet in S. Suppose at time t, v travels packet ej. then the lag is t - j. Suppose v's lag goes from l to l+1, while v is about to cross ej. this means that some other packet must cross ej at the same time, and therefore also has lag l.
Fix the latest time at which some packet has lag l. This packet must leave the route, else it can continue on with lag l. We charge v's increase in lag to this packet, and now that it's gone, we will never charge it again.
Analyze this structure.
We now need to understand how many packets overlap v's route. Let Hij = 1 if routes for i and j share an edge. Therefore, the delay on
. Destinations are chosen randomly, and so Hij are independent Poisson trials. So we need to estimate
.
Fix an edge e, and let T(e) = # routes through it. Then
(for
the route of i)
By symmetry, each edge is the same as each other in Phase 1, and so T(ei) = T(ej). Since the expected route length is n/2, and there are N of them, the total number of route-edges = Nn/2, and there are Nn edges in the hypercube, and so E[T(e)] = 1/2
Therefore,
Applying a standard Chernoff bound, the probability that this is more than 6n is 2 − 6n. Moreover, we have to take a union bound for all packets, so we multiply this by 2n = 2 − 5n
NOTE: the T(ei) are not independent, but the Hij are. so we must use a Chernoff bound over the H_{ij} rather than over the T(ei)
Aug 28, 2008
The Probabilistic Method I
In today's lecture, we shall discuss some simple yet very effective probabilistic techniques that can be used for proving existence of objects with certain properties. These techniques fall under the broad umbrella of "Probabilistic Methods" and, using the power of randomization, they give us results that could be hard to prove otherwise. The basic recipe is simple:
Construct a sample space of objects. Now check to see if, with positive probability, a randomly selected object from this space has the required property. We are done.
Although the basic premise is simple, even in the randomized setting, certain combinatorial arguments are needed for proving the existence of such objects. The probabilistic method thus helps us by providing a bunch of techniques to deal with this. Apart from giving the existence results, the power of randomization is further evidenced by its ability to help us develop randomized algorithms that can actually find such objects. We shall discuss three important techniques and example situations where they yield useful results.
The Basic Counting Argument
The basic counting argument works by constructing an appropriate probability space S and then, by naively enumerating over all possible randomly chosen objects of interest, shows that the probability of existence of an object having required properties is non-zero.
Example: Consider the problem of coloring the edges of an n-vertex graph with only two colors such that there are no large cliques (say having k vertices) with all edges having the same color (i.e. no monochromatic subgraph). The following theorem gives the condition that guarantees the existence of such a coloring.
Theorem: If
< 1, then it is possible to color the edges of Kn with two colors so that it has no monochromatic Kk subgraph.
Proof: The proof works by constructing a sample space of all k-vertex cliques and computing the probability that all edges in each such clique are given the same color. It's easy to show that this probability turns out to be
. Using the union bound (and applying the condition given in the theorem above), we can easily see that the probability of at least one such clique being monochromatic is less than 1. Thus the probability that none of the k-cliques are monochromatic will be greater than zero and thus there must exist such a coloring.
Now that we know that such a coloring exists, we can even actually go further and find such a coloring using a randomized construction algorithm. We shall see examples of constructing both a Monte Carlo and a Las Vegas algorithm for finding such coloring. We will also investigate the conditions under which such constructions are practically useful (i.e. whether the Monte Carlo algorithm indeed yields a low error probability, or whether the Las Vegas algorithm can run in polynomial time).
The Expectation Argument
The basic counting argument works by constructing a probability space from which a randomly chosen object is shown to have the desired properties with a positive probability. However, sometimes it is easier to prove such a thing by instead working with an expectation (averaging) argument. The expectation argument is based on a fairly simple (and intuitive) fact: a random variable must, with positive probability, have at lease one value that is no greater than its expectation, and at least one value that is no smaller than its expectation.
Example: A satisfiability problem (also called a SAT formula) is a logical expression that is the conjunction (AND) of a set of clauses. Each clause is a disjunction (OR) of literals. The solution to SAT formula is an assignment (True/False) of the literals such that all the clauses are satisfied. In general, determining if a SAT formula has a solution is NP-hard. Given a SAT formula, a related problem is satisfying as many clauses as possible. This is called the MAXSAT problem. We shall see that, using the expectation argument, we can provide a lower bound on the number of satisfied clauses.
Theorem: Given a set of m clauses, let ki be the number of literals in the ith clause for i=1,...,m. Let k = miniki. Then there is a truth assignment that satisfies at least
clauses.
Proof: The proof works by first finding the probabilities of each clause being satisfied for some uniformly random choice of literals. Given these probabilities, we can easily compute the expected number of satisfied clauses. Invoking the expectation argument, we see that there is a positive probability that there must be an assignment that satisfies at least that many number of clauses.
We shall see another example of finding a large cut in a graph. In general, finding the maximum cut in a graph is NP-hard. However, using the expectation argument, we can give a lower bound on the value of the maximum cut and show that the value of maximum cut is at least half the number of edges in the graph. Transforming this existence argument, we will also see a Las Vegas algorithm that finds a large cut with this lower bound.
Sample and Modify
In some cases, it is easier to indirectly construct the object with required properties. The sample and modify method gives a 2-stage indirect construction. In the first stage, a random structure is constructed which does not have the required properties. In the second stage, this structure is incrementally modified until it becomes one with the required properties.
Example 1: An independent set of a graph is a set of vertices with no edges between them. Finding the largest independent set in a graph is NP-hard. Using sample and modify, however, we can give a lower bound on the size of the largest independent set of a graph.
Theorem: Let G = (V,E) be a graph on n vertices with m edges. Then G has an independent set with at least n2 / 4m vertices.
Proof: The proof (which can also be thought of as an algorithm) works by starting with the original graph and incrementally removing vertices and edges from it until all the edges are removed from the graph. The remaining vertices thus form an independent set. Thus in this case, we have a bound on the size of the independent set (which as we will see is n2 / 4m) as well as an algorithm for finding such a set.
Example 2: Another example is sample and modify technique is about proving the existence of graphs having a large girth. Girth of a graph is the length of its smallest cycle.
Theorem: For any integer
there is a graph with n nodes, at least
edges, and girth at least k.
Usually we expect dense graphs (i.e. ones having a large number of vertices) to have a small girth. But the above theorem says that it is possible for even dense graphs to have large girths too.
Proof: The proof works by first sampling a random graph (i.e. one starting with n isolated vertices and randomly, with some probability p, adding edges between all pairs of vertices). Then in the second step, we count all possible cycles of length up to k-1 and destruct all such cycles. The destruction is simple: simply delete one edge from each of these cycles. In the graph we are left with, there would be cycles only having length more than or equal to k. To get the number of edges in the resulting graph, we can subtract the number of edges deleted in the second step from the number of edges we had in the graph to begin with. It can be shown that this number is at least
.
Sept 02, 2008
The Probabilistic Method II
Continuing in the same vein as in the previous lecture, we shall see yet another powerful technique for giving existence proofs for rare events (i.e. ones having a very small probability). This is in contrast with several other probabilistic methods which usually show the existence of events that usually have a reasonably high probability of existence.
Consider a set of "bad" events (say E1,...,En). We often want to obtain a bound for the case when no bad event happens (which in general would be a rare event, possibly with a very small probability). That is, we want a bound on
. If the events are mutually exclusive or mutually independent, it is easy to obtain such a bound. However, often we are faced with situations that only allow limited dependence (or, seen another way, limited independence) among the events. The Lovasz Local Lemma shows that even under such limited dependence, we can prove such bounds, given certain conditions.
Mutual Exclusion
If all the bad events Ei are mutually exclusive, then the probability of at least one of them happening (i.e. their union) is merely the sum of their probabilities. This gives us the probability of a good event (i.e. none of the bad events happening) immediately.
Mutual Independence
For a set of mutually independent events, each with a bounded nonzero probability, it can be easily shown that the probability of none of the events occurring is positive. To see this, think of the events E1,..,En as a set of mutually independent events in some probability space. Then
. Also, if E1,...,En are mutually independent then the complements
can be shown to be independent as well. If each probability Pr(Ei) is bounded, i.e. Pr(Ei) < 1, then:
, giving the desired results.
Limited Dependence
In many scenarios, mutual independence is too strong an assumption (and mutual exclusion too is very rare). However, often there are cases when the problem structure requires (or allows) us to slightly relax the independence conditions. In such cases, the events are no longer mutually independent but only have limited dependence among them. Such dependencies can be captured by a dependency graph which is a directed graph such that there exists an edge between any two nodes only if the associated events depend on each other.
Based on the notion of limited dependence, the Lovasz Local Lemma (LLL) gives us conditions and guarantee that all of a set of "bad" events can be avoided under certain conditions. The LLL comes in two versions - symmetric and asymmetric. However, for most applications, using the symmetric version suffices and we shall consider the symmetric version only.
Lovasz Local Lemma
Let E1,...,En be a set of events, with the following assumptions:
- for all i,
.
- the degree D of dependency graph given by E1,...,En is bounded by D
(i.e. each Ei is independent of all but at most D other events Ej).
-
.
Then
Note: The condition 4pD < 1 means that since we are given p, D (degree of dependency) cannot be too large. Note that the less likely the bad events are, the more willing we are to tolerate dependence, because even chained bad events can't hurt us.
Note: For D = 0, the mutually independent case follows.
Applications
Listed below are some problems, and the related results which we shall see how to prove using the Lovasz Local Lemma.
- Edge-disjoint paths: The problem is to come up with a communication strategy for n pairs of users such that no edge is shared between any two paths. Further, each pair i = 1,..,n can choose its path from a set Fi of m paths. Using LLL, we can show that if any path in Fi does not share edges with too many paths in Fj, then there exists a way to choose n edge-disjoint paths connecting the n pairs.
- k-satisfiability problem: The problem is to check if there exists an assignment such that a set of clauses, each having k variables, is satisfied. Using LLL, we can show that such an assignment is possible if no variable appears in too many clauses.
- Routing using prespecified paths: Consider a routing problem on a network where the path for each packet has already been specified (i.e., for packet i, a path Pi that connects vertices si and ti). The restriction is that each edge in the network can carry only one packet per time step. This would lead to congestion on an edge if there are several packets waiting to use that edge. We further assume that the paths are edge-simple, that is a path does not repeat an edge. We want to minimize the maximum time taken by any packet to reach its destination. Equivalently, we want to come up with a schedule such that the queue-sizes at each edge are kept as small as possible. It turns out that using LLL, it is possible to come up with such a schedule.
Sept 04, 2008
Hashing I
Hashing is one of the most powerful data structures. In this lecture, we will see internally how a hash table is created using randomization.
Why hashing?
We have lot of data structures like balanced search trees, random treaps and random skip lists, then why do we need hashing? The problem with all the above data structures is that they need Ω(logs) time to process search or update operation. However all of us know that using hash tables, we can do the look up as well as update in 0(1) time.
Naive way of doing hashing Suppose we have a set S whose keys are keys are chosen from a totally ordered universe of M of size m. Now if we create a Table T of size m; a table is simply an array supporting random access of set S. This strategy makes look up and updates in 0(1). However, the catch is the size of m is many magnitudes larger than the set S. Even though the approach is impractical, it gives us a new model which permits us to get around the comparison-based lower bounds on searching in a totally ordered set. Now our goal is to reduce the size of the table to get it close to size of set S.
Difference b/w static and dynamic hashing
- In the static dictionary problems, we are given a fixed set of keys S and the job is to organize them into a data structure that supports efficient processing of FIND queries
- In dynamic case, the Set S is not known in advance. Instead it is constructed using insert and delete operations that are intermingled with FIND queries.
Defination of hash function
A hash function h is a mapping from M
N where size of m>>n. N is a table consisting of entries indexed from N={0,1,
,n-1}. The size of each word in in table is logm
Defining collision A collision occurs when two different elements present in M maps to same cell in table N.
Definition of Perfect hash function
A hash function h: M
N is said to be perfect for a set S
M if h does not cause collisions among the keys of the set S.
- For Static case, perfect hash functions can be used to obtain good solutions and we will talk about this in detail on next Tuesday.
- In dynamic case, it is not possible to come up with a perfect hash function. The problem is that no hash function can be perfect for all possible sets S
M; since n<<m. Also the deterministic hash functions are not good idea, since adversary could then chose any input to beat the algorithm. Hence, we need randomized nearly perfect hash functions.
- For more, refer exercise 8.11 and stuff below it
Universal Hash Families
Definition: Let M={0,1,
,m-1} and N={0,1,
,n-1}, with m
n. A family H of functions from M into N is said to be 2-universal if, for all x, y ε M such that x
y, and for h chosen uniformly at random from H,
Pr[h(x) = h(y)]
Can we do better than this bound? The answer is no? Refer to Theorem 8.12 (read notations above it)
Good properties of random function Refer Theorem 8.14
Why total randomness is bad A totally random mapping from M to N has a collision probability of exactly 1/n; thus, a random choice from a 2-universal family, of hash functions gives a seemingly random function. The collection of all possible functions from M to N is a 2-universal family, but it has several disadvantages. Picking a random function requires Ω(mlogn) random bits. This is also the number of bits of storage required to represent the chosen function.
Solution to above problem Again, we need a smaller 2-universal families of function that require a small amount of space and are easy to evaluate. In particular, the goal is to create a 2-universal family using only a small small subset of all possible functions.
Intuition So essentially what is happening above is we have lot of randomness and its not good for us since we need to do lot of work to create our hash functions. Hence, we need to do derandomization to reduce the space of all possible hash functions. The small set of space we get after derandomization could be used to pick and represent our hash functions.
Derandomization to reduce randomness
Constructing Universal Hash Families
For fix m and n and choose a prime
. We will work over the field Zp={0,1,
,p-1}. Let
be the function given by g(x) = x mod n. For all a,b εZp define the linear function
and the hash function
as follows.
- fa,b(x) = ax + bmodp
- ha,bx = g(fa,b(x))
The family of hash functions H = {ha,b | a,bεZp with
is 2-universal. We can prove this fact using the definition of universal hash families.
- Refer Theorem 8.15 and 8.16.
Summarize: Now, if you look at this family of hash functions, it is 2-universal as well as it requires only O(log m) bits as compared to (m logn) bits in total random case. Hence, here the big message is randomness is good. But as we know excess of everything is bad, its same with randomness. At that time, we could derandomize the algorithm and can use pseudorandomness to generate efficient algorithms with provable bounds
Definition of Strongly Universal Hash families Refer definition 8.6
Sept 09, 2008
Hashing II
Motivation
In the last lecture, we encountered hash functions for dynamic dictionaries. With these hash functions we could perform searches in expected constant time. However, the constant time bound was obtained using an average case analysis and for dynamic dictionaries the time complexity becomes unbounded in the worst case. In this lecture, we introduce the design of hash functions for static dictionaries which have constant search times in the worst case. The trade off being that the dictionary size should be fixed and known in advance. Dynamic dictionaries sizes, for instance, could update themselves and hence their size need not be fixed.
Why do we need perfect hash families ?
In order to ensure constant search time in the worst case for static dictionaries, we need to find (perfect) hash functions which cause zero collisions. We start by showing that families of perfect hash functions do exist for static dictionaries, for example, the trivial case of all functions from M to T. However, this family has a very large size and we hope to obtain (almost) similar good results with hash families of much smaller size. In addition, we also have come up with the requirement that the size of the family of good hash functions be bounded in polynomial of m, where m is the size of the universe (M) of keys.
Why aren't perfect hash families sufficient ?
As already mentioned, for perfect hash functions, the hash family size is very large. The hope is that by relaxing the "perfect" requirement on the hash functions, we can bring down the size of the near-perfect hash family. However, near perfect hash families might cause collisions. But we cannot afford to have many collisions since we intend to achieve O(1) search time in the worst case. To this effect, we introduce a secondary level of hashing which uses perfect hash functions to rehash all keys which collide within a particular bin in the primary hash table.
Overall hashing scheme
So, now we have:
- a near perfect primary hash function which maps all keys from
to the primary hash table, and
- an exactly perfect secondary hash function which maps colliding keys in primary cells to a secondary hash table.
We still need to prove:
- the existence of such primary and secondary hash functions
- Ex. 8.15 and 8.16 suggests using probabilistic methods to prove their existence
- guarantees on the overall size of the hash table and the worst case search time
- Th 8.19 proves that despite the fact that secondary hash tables require size quadratic in bin sizes of the primary hash table cells, we can still obtain a linear bound the overall size of the entire hash table by intelligently choosing the primary hash functions. Th 8.19 also proves that in the worst case, the FIND operation requires constant time
Food for thought
1. Given that we can construct chaining schemes (double hashing) for static dictionaries, why not do the same dynamic dictionaries ?
2. Defn 8.10 is analogous to construction of universal hash functions in the initial portions of section 8.4.3 (page 219). But now we only have a multiplicative factor in x (we multiply it by k) and no addition term (as we had b in the earlier case). Why ?
Sep 11, 2008
Randomized incremental constructions
The idea of a randomized incremental construction (RIC) is a way to "beat" an adversary by randomizing the input so that the adversary can't set up a worst-case sequence that makes life difficult for an algorithm. For example, insertion sort is an example of an algorithm that has a worst-case performance of Ω(n2) because the adversary can create a bad input. But suppose we were allowed to permute the input after the adversary chose it ? We then get a randomized version of the sort procedure, but one in which we can argue that bad cases get "averaged out".
A generic randomized incremental procedure looks like this:
- Randomly order inputs
- insert them into a data structure one by one
This is incredibly easy from a programming point of view, and makes these algorithms very effective in practice. However, from an analysis point of view, it turns out that a better description is recursive.
Recursive-Solve(
)
- Pick s_i at random
- Recursive-Solve(S − si)
- Insert si into resulting structure.
Unrolling the recursion makes the algorithm look "backward": It seems as if we're picking the s_i randomly in reverse order, running the algorithm from back to front. This allows us to employ a form of analysis called 'backward' analysis, where we analyze the algorithm as if it were running backward.
The advantage of this approach is that it changes the nature of the events we condition on when figuring out the cost of inserting (or deleting) a new item. Roughly speaking, the cost of insertion (a random variable), instead of being analyzed in terms of the existing structure, can be analyzed in terms of the resulting structure. This is because if we take the 'backwards' view of the algorithm, we can relate the cost of insertion in step 3 to the cost of deletion in Step 2. This changes the conditioning, and as we shall see in a number of examples, makes the analysis tremendously easier.
We'll look at 4 examples, in progressively increasing complexity:
- A re-analysis of QUICKSORT in this RIC setting
- Planar convex hulls
- Delaunay triangulations of convex polygons
- Fixed dimensional linear programming.
Sept 16, 2008
Random Sampling
There are situations for which randomized incremental construction, which was discussed in the last class, might not be appropriate. For instance, randomized incremental construction is inherently sequential, and may thus be unsuitable for designing parallel randomized geometric algorithms. Building a geometric structure is not often an end in itself; It is also a means for solving search problems. For eg., the Voronoi diagram serves as the basis for nearest neighbor queries. We focus on designing randomized geometric algorithms, (Also called random sampling or randomized divide-and-conquer).
Point Location Problem
We are given a set S of n numbers, and wish to answer membership queries: a query is a number, and we are to report whether or not the query number is a member of S. We pick a random sample R of r numbers from S, where r is a constant whose value is sufficiently small compared to n. We sort the elements of R (in constant time), and then partition S\R (in time q(n)) into r + 1 subsets; the ith subset contains those elements of S\R that are larger than exactly i elements of R. Sample R is good if everyone of the r + 1 resulting subsets of S\R has size at most (an log r)/r, for a fixed, suitable constant a.
Search Structure
For each subset containing more than b elements, for a suitable constant value of b > r, we recur by again choosing a random subset of r elements from it, and so on. This process induces a search-tree. Given a query q, we identify (in constant time) one of the r + 1 subsets of S in which to continue the search for q. We search recursively in the sub-tree associated with this subset.
Cost of Search
Let Q(n) be the cost of the search on a set of n elements. We have the following recurrence.
a is small compared to r/log r. c is a constant representing the cost of descending one level of the tree. We have Q(n) = O(log n), a constant because in the process of constructing the search tree, we ensured that the random sample at every level was good.
Point Location in Arrangements
Let L be a set of n lines in the plane. The lines in L partition the plane into O(n2) convex polygonal regions.The resulting structure is known as an arrangement of lines. we are only interested in the portion of this arrangement that lies within a fixed triangle τ that contains in its interior all points of intersection between lines in L. This can be viewed as a planar graph as follows. There is a vertex of the graph for each point at which two lines meet. In addition, there is also a vertex for each point at which a line of L intersects the boundary of τ. An edge between two vertices corresponds in the natural sense to the line segment between two vertices that are adjacent in the arrangement. Each face of this planar graph is one of the polygonal regions into which τ is partitioned by the lines in L. We study the following query problem: given a query point q in the plane, what facet of this graph contains the query point?
We now have, Given L, a triangular arrangement of the lines in L can be computed in time O(n2).
Point location problem in T(L)
1. Pick a random sample R of r lines from L, where r is a suitably large constant. Construct the arrangement T(R).The number of facets in T(R) is O(r2), and is thus a constant
2. For each (triangular) facet f in T(R), determine the set of lines of L\R intersecting f denote by Lf this set of lines. This can be done in time O(nr2). We say a facet f is good if it is intersected by no more than (anlogr)/r lines of L for a suitable constant a. We say the random sample R is good if every facet of T(R) is good. If the chosen sample R is not good, we repeatedly pick samples R until we get a good sample R
3. For each facet f of T(R) for which |Lf| > b for a constant b, we recur on this process. In the recursive steps, the enclosing triangle is just the triangle bounding the facet f. We maintain a pointer from each facet f to the triangular arrangement of the recursive random sample of lines intersecting f. These pointers will facilitate the search process
Note: Step 2 can in fact be implemented in time O(nr). <Hint : Zone Theorem>
We shall prove the following lemma.
Lemma
The probability that any facet of T(R) is intersected by more than (anlogr)/r lines of L is less than 1/2, for a suitably large constant a.
Corollary
The expected number of trials before we obtain a good sample R is at most 2.
The construction time satisfies the recurrence,
where c is a constant and T(k) denotes the upper bound on the expected cost of constructing the data structure for an arrangement of k lines. This solves to T(n) = O(n2 + ε(r)), where ε(r) is a positive constant that becomes smaller as r gets larger.
Theorem
The data structure is constructed in expected time O(n2 + ε) for a set of n lines in the plane for any fixed ε > 0, and this data structure can support point location queries in time O(logn).
Point to ponder
What are the effects of increasing r on the construction time for the search structure, and on the query time?
Sept 18, 2008
Randomized Rounding
In this paper, we will show a randomized approach to solve a special class of problems so called 0-1 integer programs. We will show how we can relax the constraints, solve the relaxed constraint problem and then force the constraints back (randomly) and still get a good solution which otherwise would have been hard to solve. We will finally show how it would affect the solution.
Integer Programs: Integer programs are special case of linear programs (LP) where we have an additional constraint such that parameters that we minimize over, should be integers i.e.
We know that IPs are NP-hard therefore we solve them by relaxing their constraints and converting them into LP. This basically means that we no more force x to be integers rather we allow all real numbers i.e.
. Here is the basic algorithm to solve the original 0-1 integer program problem using randomized rounding.
- Solve the relaxed version of IP i.e. LP.
- Randomly set variables xi to one or zero according to the rule:
, here
is the value of xi obtained after solving LP.
- Not really part of the algorithm-- Analyze how it affects the solution which basically means that prove, after performing step 2, we have a non-zero probability of the existence of the solution. Also analyze the quality of the solution.
We will look at three examples (routing problem for VLSI, multi-commodity flow and k-matching for hypergraphs) to see how this technique works.
Sept 23, 2008
Algebraic Fingerprinting
Multiplying two polynomials of degree n can be done in O(nlogn) time using the FFT. We show that verifying f1f2 = f3 can be done in O(n) time. Further, verifying that f1 = f2 for some polynomials f1 and f2 can be done with the same algorithm. Extending this past polynomials, we can verify the equality of two strings a and b by viewing them as polynomials with integer coefficients and degree at most n. We do this by taking strings
and
viewing them as polynomials
and
.
To do this verification we find the fingerprint of the polynomials by selecting a random number r, and evaluating the polynomials at r. If the two fingerprints differ, then obviously the polynomials differ too. We shall show that if their fingerprints are equal, then they polynomials are equal with high probability.
The notes used for this lecture follow:
Randomized Algorithms Motwani and Raghavan, sections 7.1-7.2
Notes by Scoπ Alfeld
23.9.08
Section 7.1 - Fingerprinting and Freivalds' Technique
Multiplying two
matrices A and B can be done in O(n3) time via the standard matrix multiplication algorithm.
The fastest alternative algorithm currently known runs in O(n2.376). Verifying, however, that
can be done in O(n2) time with a randomized algorithm by exploiting the fact that we can multiply a matrix with a vector in O(n2) time.
Let
be a random vector.
Let
Clearly

Obviously, the converse is not true. We will show, however, that

Consider the
matrix D = AB − C.
We note that P(y = z) = P(Dr = 0).
Because
, we know that D is not the zero-matrix.
Let the k-th row of D, dT, be some row with a non-zero entry.
Further, let z be the index of some non-zero entry in dT.
The k-th element of Dr is the inner product dTr.
Clearly,
.
We note that dTr = 0 iff

This is to say

The right hand size of the above equation can be 1, 0, or some other number. Because rz is either 0 or 1 with equal likelihood, we find the (tight) upper bound.

By this analysis, we see that the decision to draw elements of r from \{0, 1\} was arbitrary.
Any two distinct elements would lead to the same bound.
Further, if instead we draw elements of r from some set
, then

Section 7.2 - Verifying Polynomial Identities
We now extend this technique to verifying polynomial multiplication. Given polynomial functions f1(x), f2(x) of degree n, and f3(x) of degree 2n, we want to verify
Multiplying f1(x) and f2(x) takes O(nlog(n)) time using the Fast Fourier Transform. We will show that, using a randomized algorithm, we can verify the multiplication in O(n) time.
Let
be a random number.
We claim the polynomial multiplication to be valid if
Clearly,

but the converse is not true. In parallel with the previous section, we will show that

We let
Because the degree of f1 and f2 is n, Q has at most 2n distinct roots. Therefore,

We can illustrate this pictorially.
The graph of Q(x) crosses the x-axis at most 2n times, therefore when we select r, the probability that it is one of these zeros is at most
.
We can also use a deterministic approach of simply picking 2n + 1 points, and evaluating Q(x) for each.
However, the naive approach to this takes O(n2) time, and even a more advanced technique brings this only to O(nlog2(n)) time (longer than multiplying the original polynomials directly).
In general, this procedure can be used to determine a polynomial identity f1(x) = f2(x) by simply letting Q(x) = f1(x) − f2(x). Given f1(x) and f2(x) explicitly, we can also compare coefficients directly - a deterministic O(n) procedure. This cannot be done, however, when the polynomials are instead given implicitly. One key advantage of this randomized algorithm is that it allows a black-box view of the polynomials.
We now extend this technique to handle the case for multivariate polynomials.
First we adopt the book's change of variables in Q.
Rather than the degree of the polynomials, n now represents the number of variables.
The degree of a term in Q is the sum of the exponents, and the total degree d of Q is the maximum of the degrees of its terms.
We now choose a random vector
, and want to prove

We prove this by induction on n. Our base case

follows from our preceding analysis.
We then factor out x1 (or any other x of our choosing), defining
to be the greatest exponent of x1.
This leads to

where
is the coefficient of
.
We note that
is a polynomial of n − 1 variables, and has degree at most d − i.
We now consider the polynomial
.
The total degree of Qk is at most d − k, and by our choice of k, we know that

We now consider the two cases.
Case I: 
By our induction hypothesis, we know that
.
Case II: 
We define a new, univariate, polynomial
By the fact that the coefficient of
is
, we know that q has degree k, and
.
Because q is univariate, we know that
We now know that

And that

Exercise 7.3 in the book shows that for any two events A and B

Therefore

Exercise 7.3
For events A and B we want to show

We know that

and, by Bayes' Rule

and clearly,

Therefore

Sept 25, 2008
Problem: Fingerprinting and Applications
In this lecture, we will be looking at another technique of fingerprinting. Suppose we have two parties Alice and Bob and they want to exchange information to verify that both of them have same information stored in their respective databases. The simple way to verify both have same information is if one sends all its information to the other. But if the information is n-bits long, then one has to transfer all n-bits but our goal is to verify both have same data without sending all bits.
To solve this we will use another fingerprinting technique which is based on taking a modulo of the information by some prime p and then Alice send that fingerprint to Bob who compares it with its fingerprint. This technique will fail only if Fp(a) = Fp(b) | a
b. We will see that if we pick p from a big enough set, then the P[Fp(a) = Fp(b)] | a
b is small. Also, we will see using this technique we need only O(log n) bits to verify, which is an exponential improvement.
We will also look at two applications:
1. Arithmetic modulo a random prime
If we have very large integers, its computationally expensive to do computations on them. These can be results of evaluating polynomials. And generally our goal is to say if two large integers equal or not. If we use modulo operator at time of evaluating, then we can ensure a correct answer with high probability which will again only take O(logn) bits.
2. Pattern matching
We are given a string X=
and shorter pattern Y=
with m<n. The goal is to find if Y occurs as a contiguous substring of X. The brute force methods take O(mn) time, however there exists complicated deterministic algorithms that run in O(m+n). They are difficult to implement and have large overheads. We will see that we can use fingerprinting technique and solve it in O(m+n) which is a randomized algorithm. Again the probability of failure is small. And we can further reduce error probability by converting Monte Carlo algorithm into Las Vegas algorithm.
At the end, I will summarize different fingerprinting techniques and we can discuss which technique should be used depending on the problem.
Food for thought: Why we are solving pattern matching using this technique? Why not technique used for comparing polynomials?
September 30, 2008
Approximate Counting I
If you are following along in the book, this presentation will actually cover sections 6.2-6.5, with more emphasis on the later sections.
Introduction of Random Walks
This main thrust of this talk involves determining various bounds for different characteristics of random walks on graphs. While this does include counting, it may not be exactly what you had in mind. A random walk on a graph is just what you would imagine, pick a vertex, then randomly choose one of its neighbors, walk to this new vertex and then repeat the process. This kind of stochastic process has a wide variety of applications, in this seminar we will be exploring several characteristics of random walks. Specifically, we will investigate their connections with Markov Chains, and analyze hard it is to determine how many steps (walks from node to node) it requires to make it back to your origin node or to hit every node in a graph.
Markov Chain recap
A Markov Chain is a discrete-time stochastic process defined over a set of states S in terms of a matrix P of transition probabilities. The MC can only be in one state at a time and makes state transitions in discrete steps. The Pi,j entries of the transition matrix is the probability that the next state will be j given that the current state is i. Also, for
,
and
. A key trait of MC are their memorylessness property, which means that the future behavior of a MC depends only on the current state and not on how it arrived at that state:
.
A stationary distribution for a given MC with transition matrix P is a probability distribution π such that π = πP. Intuitively, once a Markov Chain enters its stationary distribution it will not leave.
More on Random Walks
Some definitions
1. hi,j - the number of steps from i to j. Known as the hitting time or access time. 2. fi,j - the probability that you visit j if you start your walk from i. 3. Cu,v - The Commute time, the number of steps needed to get from u to v and back again. 4. Cu(G) - Cover time: the expected length of a walk that starts at u and visits every other node in G. The total cover time for a graph is C(G) = maxuCu(G).
Theorem 6.5: For any edge
,
- The commute time is less than twice the number of edges in a given graph.
Random Walks as Electrical Networks
A physics 102 refresher:
1. Kirchoff's Law: the sum of currents entering a node equal the sum of current leaving a node. 2. Ohm's Law: Voltage = Current * Resistance
Given an undirected graph G, let N(G) be the electrical network defined as follows: for each vertex in G create a vertex in N(G). For every edge in G place a 1 Ohm resistor between corresponding vertices in N(G).
Graph Cover Time Bounds
Theorem 6.9:
These bounds cannot be generally improved (beyond constant factors). There are graphs for which neither bound is tight to within a factor of logn.
October 02, 2008
Approximate Counting II
Today's seminar is much closer to the idea of counting than the previous, yet they overlap in various ways. The big picture is a story of wanting to estimate the size of a set that is the subset of a much larger set. We want to sample for the superset, but in a targeted way. Can we construct such a sampling in order to get at the subset we are interested in without having to sample much more?
Introduction to #P
What is a counting problem? Normally in this algorithms seminar we generally think in terms of decision problems, algorithms that return either YES/NO upon completion. In today's seminar we are going to explore a class of problems that involve counting the number of solutions that are possible for a given decision problem. For example, perhaps we want to determine the number of Hamiltonian cycles are present in a graph, or the number of cliques on 5 vertices, etc. More formally, a counting problem takes an instance of a decision problem and returns the number of solutions for this instance.
The class of counting problems that are associated with the complexity class NP is denoted as #-P (Sharp-P). This is the set of all counting problems corresponding to decision problems in NP.
Randomized Approximation Schemes
Consider a problem Π and let
denote the number of distinct solutions for instance I of Π.
An approximation algorithm takes input I and outputs A(I) which we hope is close to
.
1. PAS - Polynomial Approximate Scheme : A deterministic algorithm that takes I and
as input and in polynomial time produces A(I) such that
.
2. PRAS - Polynomial Random Approximate Scheme :
The 3/4 term is somewhat arbitrary, and this scheme gives us no information regarding the quarter of cases that do not fall within this bound.
3.
-FPRAS : Fully Polynomial, Random, Approximation Scheme : Computes an
-approximation to
with probability of 1 − δ in polynomial time in
.
Disjunctive Normal Form (DNF) Counting Problem
Let
be a boolean formula in disjunctive normal form. Recall that DNF has a set of m clauses that are separated by or operations, and inside each clause is a set of literals that represents either a boolean variable or it's negation. We assume that a variable can only appear once per clause.
A truth assignment a is a true/false assignment for every variable. A truth assignment satisfies a formula when F() evaluates to true.
Let
be the number of true assignments that satisfy F, it is easy to see that
.
The DNF counting problem is essentially to find the value of
. Specifically we want to find a
-FPRAS for this problem.
The Coverage Algorithm
To solve the DNF counting problem described above, it is helpful to first formulate this problem as an abstraction as follows.
Let V be a finite set and we are given m subsets:
and the following assumptions:
1. V is the space of all truth assignments. 2. Hi are all the truth assignments that satisfies the clause Ci in F. 3.is computable in polynomial time. 4. It is possible to sample uniformly from any Hi. 5.
it is possible to determine whether
in polynomial time.
Ask yourself if these assumptions seem reasonable.
From here, the goal is to estimate the size of
, which is hard to do by brute force or the inclusion-exclusion principle.
Given assumptions 3-5 above, we can actual use some simple Monte Carlo methods to approximate H. (Recall, in this abstraction, H represents the set of all truth assignments that satisfy at least one clause in F, thereby satisfying F itself.)
Now, this in turn becomes the union of sets problem and we solve it as follows.
Define a multiset U which is the multiset union of all partitions Hi,
. The elements of U can be seen as the ordered pair (v,i) such that
. The size of U becomes
.
The coverage set of a v is defined by
, the size of the coverage set is exactly the number of Hi's containing the element v. In our DNF counting problem example, cov(v) would be the set of clauses satisfied by truth assignment a.
What does the coverage set tell us?
1. The number of coverage sets isand this is easy to compute. 2. The coverage sets partitions U. 3. The size of U is the sum of the coverage sets. 4.
,
![]()
Next we define a canonical element:
Let
be defined as f((v,i)) = 1 if
and 0 otherwise. The canonical element is simply the element that corresponds to the occurrence of v in the lowest-numbered Hi.
Let G be the set of canonical elements.
because each Hi contains 1 canonical element. Our goal is to estimate the size of G, which is exactly what we are looking for in terms of our original DNF problem.
Lastly, we are left with Theorem 11.3, which states:
The Monte Carlo method yields an
-approximation to
with probability at least 1 − δ provided:
.
And the running time is polynomial in N. It is fairly easy to verify that computing f and sampling from U uniformly can be done in polynomial time. The result is a
-FPRAS algorithm utilizing Importance Sampling that can approximate the number of satisfying truth assignments for the DNF problem.
October 07, 2008
Randomized Complexity Classes I
This talk introduces complexity classes for Randomized algorithms. We start with definitions of RP and coRP and try to understand them by drawing comparisons with the known notions of NP and coNP. Thereafter, we introduce ZPP. We show that the class ZPP essentially captures the class of problems for which we can design Las Vegas algorithms. We also establish relationships between ZPP and
.
The definition of class RP is somewhat stringent in that it allows only one-sided error. However, for most practical purposes we would like to reduce this level of restriction. This leads us to the class BPP which allows both-sided errors. What about coBPP ? How is BPP related to RP and NP ? - are some of the questions that would be answered in the course of the talk.
Next we introduce the class PP. We compare PP with BPP and show that how having a small or large gap in between the probability bounds in the definition of the aforementioned classes can make life easy or difficult. In addition, we will see that in all the above cases, the actual values in the probability bounds don't matter much and can be replaced with some other value without affecting the definition of the classes. Moreover, we can use the same strategy to intuitively understand why it's good to have a large gap as in BPP.
Finally, we conclude by establishing the containment hierarchy that includes all classes that we already knew and all that we have learned so far.
October 09, 2008
Randomized Complexity Classes II
The last lecture introduced complexity classes for randomized algorithms, namely, RP, coRP, ZPP, BPP and PP. We saw that using probability amplification techniques the probability that a randomized algorithm gives a wrong answer can be made extremely small. This essentially captures the importance/success of randomized algorithms by highlighting the fact that although randomized algorithms have some probability of error, this error probability can be made so small that in practice randomized algorithms are as good as deterministic algorithms. In addition, we also learned about the containment hierarchy of randomized complexity classes -
and
. However, we don't know whether
?
The big question for this talk is: Where does BPP lie ?
We give two answers to this question:
1: We know that NP allows existential quantifiers (
) and coNP allows universal quantifiers (
). What if a problem contains both ? The Polynomial Hierarchy (PH) was defined to capture problems in this category. The PH generalizes the notion of P-NP-coNP and consists of complexity classes Σi and Πi. The Polynomial Hierarchy can be defined by an offline approach using the usual Deterministic Turing Machine (DTM) or by an online approach using the Alternating Turing Machine (ATM) which alternates between the existential and universal quantifiers. We formally define ATMs and show that ATMs can be used to realize the complexity classes Σi and Πi. Thereafter, we prove:
.
2: Researchers tried to use boolean circuits to attack P = NP?. It was assumed that, Boolean Circuits are mathematically simpler than Turing Machines and could be used to answer the P = NP question. Circuit complexity (measured in terms of boolean circuit size) provides an alternative definition of the Polynomial Hierarchy. We define non-uniform complexity using families of non-uniform circuits. We also look at an alternative definition which uses "an advice" to test memberships of input strings. The complexity class P / poly defines the class of problems which have algorithms with polynomial time-complexity and polynomial (circuit) size complexity. We prove that BPP also lies in P/poly, i.e.,
.
October 21, 2008
Interactive Proofs: IP and PSPACE
IP
In computational complexity theory, the class IP is the class of problems solvable by an interactive proof system. An interactive proof system consists of two machines, a prover, P, which presents a proof that a given string n is a member of some language, and a verifier, V, that checks that the presented proof is correct. The prover is assumed to be infinite in computation and storage, while the verifier is a probabilistic polynomial-time machine with access to a random bit string whose length is polynomial on the size of n. These two machines exchange a polynomial number, p(n), of messages and once the interaction is completed, the verifier must decide whether or not n is in the language, with only a 1/3 chance of error.
Formally,
Let
be some function with k(n) computable in poly(n) time. A language L is in IP[k] if there is a Turing machine V such that on inputs x,r,a1,...,ai , V runs in time polynomial in | x | and such that,
(Completeness)
(Soundness)
We can define IP as, IP =
Example: Proving that graphs are not isomorphic
This is an example in IP that is not known to be in NP. We say two graphs G1 and G2 are isomorphic if there is a permutation π of the labels of the nodes of G1 such that π(G1) = G2. Graph Isomorphism (GI) is in NP, since a certificate is simply the description of the permutation π. Graph Non-Isomorphism (GNI) is deciding whether two given graphs are not isomorphic.
PSPACE
PSPACE is all the problems which can be solved by programs which only need a polynomial (in the length of the problem instance specification) amount of memory to run.
The formal definition of PSPACE is:
PSPACE =
The following relations are known between PSPACE and other complexity classes.
PSPACE-completeness
A language B is PSPACE-complete if:
a) B
PSPACE
b) For all A
PSPACE, A
B
where A
B means that there is a polynomial-time many-one reduction from A to B.
Remarks: IP = PSPACE
- Given any verifier V , we can compute the optimum prover P (which, given x, maximizes the verifier’s acceptance probability) using poly( | x | ) space (and hence 2poly( | x | ) time). Thus
.
- We will also prove that
. To do so, we’ll show that
. This is sufficient because every
is polytime reducible to TQBF (True Quantified Boolean Formula).
October 24, 2008
The PCP Theorem
Suppose your professor, the infinitely powerful prover, wants to provide you with some proof. As a limited student, you wish to verify that the proof is valid. However, you do not have time to, or perhaps want to, read every page of the proof. Your tactic is simple: throw all the pages of the proof in the air, catch a few, read those,
and then deduce whether or not the proof as a whole is valid.
We now formalize this problem with the notion of a PCP system.
We let
be the prover (the professor), and
be the verifier (the student).
The system works as follows.
- Given an instance x of length n of some problem,
writes down a proof π as a binary string.
-
does some precomputation, looking only at x.
-
then randomly chooses a limited set
of indexes of π.
-
then constructs a function φ, returning YES or NO.
-
then gets the values of π at each of the indices, passes the values through φ, and returns the result.
We then define the limitations
.
-
's precomputation must be done in poly(n) time.
-
has Q(n) bits of randomness.
-
can look at r(n) bits of π.
-
can use
, but not any values of π while constructing φ.
- If π is valid for x,
must claim so.
- If π is not valid for x,
must claim so with at least probability
.
We define the class of problems PCP(Q,r) such that a language L is in PCP(Q,r) if
-
implies
can write down a proof such that
can verify it in the above procedure
-
implies that no matter what proof
writes,
accepts it with probability at most
.
This may seem to be an odd or arbitrary way to define a class of problems. The punch line, however, is the PCP Theorem:
NP = PCP(O
log(n)
, O(1))
We note that there are key differences between the PCP approach to NP and more well known interpretations.
Instead of reading in an entire proof, and verifying it directly, in the PCP system
asks
for only pieces of the full proof.
From those pieces,
then guesses as to whether or not the entire proof is valid, and
's guess
is wrong with bounded probability.
The PCP Theorem will not be proven in this lecture, as it would take far too long. Instead, we tie the PCP Theorem to the notion of a problem being hard to approximate. Specifically, we show how the PCP Theorem is equivalent to saying for some s, GAP-3SAT(s) is NP-hard. To do this, we now define the problem GAP-3SAT(s).
\pagebreak
Given a 3CNF formula ψ with m clauses, we define OPT to be the maximum number of clauses in ψ
that can be satisfied simultaneously.
The problem of GAP-3SAT(s) is to:
- output YES if OPT = m.
- output NO if OPT < sm.
We make no restrictions on the output if sm < OPT < m.
We now go on to prove that the PCP Theorem implies that there exists a constant s such that GAP-3SAT(s) is NP-hard.
To do this, we show that, given the PCP theorem, there exists a reduction
from the NP-complete problem 3-Color to GAP-3SAT(s).
Because 3-Color is in NP, and we're assuming the PCP theorem, we know there exists a proof π from the prover
as defined previously.
We let πi denote the i-th bit of π.
We think of each of the πi's as a variable for a 3SAT formula.
Given the input graph
,
enumerates all N = 2Q = 2O(log(n)) = poly(n) possible random strings the verifier V could choose.
We denote these as
.
Each of the Qi's provides C proof locations, and a predicate φ.
constructs a 3CNF formula from each φ.
Because each φ is a function of C variables, we know that each generated 3CNF formula has at most K = C2C clauses.
To simplify the analysis, we assume that each generated 3CNF formula has K clauses.
R then returns the conjunction of all the generated 3CNF formulae, yielding a total of m = NK clauses.
Clearly, if
, by the PCP Theorem, there exists a π that satisfies all of V's checks.
Thus all of the m clauses can be satisfied, and OPT = m as required for the reduction to be valid.
If, however,
, at least
of V's checks must fail, again by the PCP Theorem.
If Qi results is a failure, then we know that the 3CNF formula constructed from Qi's φ must be unsatisfiable.
This means that at most K − 1 clauses can be satisfied.
Thus the total number of clauses that can be satisfied is:
This shows that the PCP Theorem implies that GAP-3SAT(s) is NP-hard.
We now show that GAP-3SAT(s) being NP-hard implies the PCP Theorem.
We do so by first assuming that GAP-3SAT(s) is NP-hard.
Our assumption means that any NP-complete problem, for example 3-Color, can be reduced to GAP-3SAT(s). This is to say that we can reduce 3-Color, with input graph G, to a 3CNF formula R(G) such that:
-
has OPT = m.
-
has OPT < sm.
Given the existence of this reduction, we construct
and
's proof for the PCP system.
runs the reduction above as its precomputation, giving R(G) = ψ, a 3CNF formula with m clauses.
's proof π is the canonical representation of an assignment to variables in ψ.
uses its random bits to pick a clause C of ψ at random, and checks to see that C is satisfied by π.
Clearly, if
, by the definition of R(G), any clause picked by
will be satisfied because OPT = m.
If
, we know that OPT < sm, again by our construction of R(G).
This means that the probability that
picked an unsatisfied clause is less than s.
Since s is constant, through repetitions of the process, we can bring the probability below
.
We've now shown that the PCP Theorem is equivalent to saying that GAP-3SAT(s) is NP-hard.
We now change the tone to a discussion of the PCP Theorem at a higher level.
One key aspect of the PCP Theorem is that
needs only to look at a constant number of bits of the proof.
We now turn our efforts to providing some rough intuition as to why it is possible to verify a proof, potentially very long, looking at only a constant number of bits.
Suppose we have the following game.
Alice has a number
, and Bob is trying to find out what n is.
Bob knows
, but can only look at n through fuzzy glasses.
While in the PCP system this is saying Bob can only read a constant number of bits of n, in this game we say he can only read Alice within five.
This is to say that when Bob attempts to read a number q from Alice, he receives
.
The question is how can Bob accurately determine the value of n.
A key principle aiding Bob is that a problem being in PCP(Q,r) means that
{\it can} write a proof that
can verify.
In our game, Bob can simply request that Alice multiply n by some number, say, 100.
Now, Bob asks Alice for her number, and receives
, which can easily be decoded to n.
By taking the original space of possible proofs, and projecting it into a larger space, we are able to cause valid proofs to be spread out. Then, by obtaining only a rough sense of their location in the larger space, we can ascertain the original proof. While the Alice and Bob game has a distance measure of numerical difference, the distance used for proofs would be a measure of how many bits differ between any two proofs.
October 28, 2008
Expanders
Let G be a directed, d-regular graph, possibly with self loops and/or multiple edges between its vertices.
We define E(S,T) to be the number of edges connecting the sets of vertices
.
For a set
, we then define δS to be the number of edges leaving S.
This leads to our measure of the Expansion of a graph, h(G).
Formally:
- G = {V,E}, | V | = n
-
-
-
Finally, we say that G is an Expander if h(G) is large, specifically if
for some
constant threshold c.
Intuitively, Expanders can be thought of as graphs wherein one can become lost easily, or graphs with no bottle necks.
1.1Expanders and Eigenvalues:
We now connect the idea of an Expander to the eigenvalues of its adjacency matrix.
We define the adjacency
matrix A(G) as follows
A(G)u,v is the number of edges between vertex u and vertex v in either direction.
Clearly, A(G) is symmetric, meaning that
.
Linear algebra therefore tells us that A(G) has real eigenvalues
.
To simplify the notation, we assume the eigenvalues are sorted such that

We call the eigenvalues of A(G) the Spectrum of G.
Why do we care about the Spectrum of G? We note some connections between the Spectrum of G and other properties of G. These properties will not be proven in this lecture.
- μ0 = d (recall that G is a d-regular graph.)
- G is connected iff
.
-
Letting the Spectral Gap Δ = μ0 − μ1 = d − μ1, property 3, above, becomes

This theorem, from Cheeger \& Buser, as well as Tanner, Alon \& Milman, gives us a way to bound h(G) by looking only at the eigenvalues of A(G). In this lecture, we won't be using this theorem directly. It's important to note, however, that we are able to detect a graph G as an Expander given its adjacency matrix, by finding and sorting A(G)'s eigenvalues.
1.2The Expander Mixing Lemma:
We now discuss a very important aspect of Expanders. For reasons that will become clear later in this lecture, an extremely beneficial property of Expanders is their similarity to random graphs. We define a random graph to be constructed in the following manner:
- For each node
.
- Pick d nodes
from V uniformly at random.
- Add edges
to E.
The Expander Mixing Lemma can be thought of as a bound on how much an Expander G = {V,E} differs from a randomly constructed graph.
Specifically, we claim that for all sets
, the number of edges from S to T will not differ too greatly from
the expected number of edges between S and T in a randomly constructed graph.
For a random graph R with n vertices, we can see that the number of edges between S and T is

The Expander Mixing Lemma states that for an Expander G with n nodes, the difference between ER(S,T) and EG(S,T) is
bounded for all
.
Specifically,

where λ = max( | μ1 | , | μn − 1 | ). This lemma will not be proven in this lecture, however we will use its result in a later application. We now focus on a seemingly unrelated issue, which we will tie back to the notion of Expanders and the Expander Mixing Lemma later.
2 Error Amplification for BPP:
Recall that a language L is in BPP if there exists a polynomial time algorithm A, that, given an instance x of a problem,
and a random string r of length m, returns the correct answer with probability at least
.
We let f(x) be a function that, given x always returns the correct classification\footnote[1]{f(x) need not run in polynomial time, we only use its
existence to simplify the analysis.}.
This is to say
, and 0 if
</math>Because the problem is in BPP, we know that the probability (over the random string r) that A is incorrect is at most
.

A standard technique is to take a set
of d random strings, and define the function
This process requires m random bits for each of the d strings, for a total of dm random bits.
We can think of this process in terms of selecting nodes from a graph G = {V,E} with | V | = 2m.
To simplify the notation, we use the vertices and their indices interchangeably.
For example, for
, A(x,v) = A(x,r) where r is the index of v.
Suppose we let G be a randomly constructed graph. We let some vertex V * be a random node in G, and let R be the set of d indices of V * 's neighbors. Because of the random construction of G, this process is identical to picking d m-length strings randomly for R.
Through this process of choosing V * and taking its neighbors, we are allowing ourselves to obtain d random strings, each of length m, while only using m random bits. This process, however, requires a fully constructed G, which could require far more than dm random bits. Ideally, we want to be able to construct the neighborhood of V * , without having to construct all of G.
By the random construction of G, it is easy to construct neighborhoods on the fly. However, our goal is to conserve the number of random bits needed, and so we must look for a deterministic method. We therefore need a deterministic graph that behaves like a randomly constructed one. We now bring our discussion back to Expanders.
We do the same graph-based method for choosing R as before, but now we let G be an Expander rather than a randomly constructed graph. Our method is as follows
- Draw V * uniformly at random from V.
- Let R be the d neighbors of V * .
- Return
We now show that the probability (over V * ) of
being wrong is bounded.
Specifically

We begin by making some definitions
- N = 2m = | V |
-
is the set of vertices
-
is the set of vertices
We then make the following observations
- Because of A's probability of error,
.
- By the definition of S, every vertex
has at least
neighbors in T:
.
- Because V * is drawn uniformly at random from V,
- The Expander Mixing Lemma:
From algebra we can then see that
October 30, 2008
In the previous class using the properties of d-regular expander graph we saw that we can amplify the success probability of a BPP algorithm to
using m bits of randomness. Specifically, this technique selects a random vertex (using m bits of randomness) and selects all its d neighbors to amplify the success probability. In this class, we will see a different technique which instead of considering all the neighbors it starts from the initial random vertex and randomly walks for a length of t picking all the vertices it traversed. The t vertices traversed by this walk are chosen for amplification. We will show that this reduces the error probability exponentially while using and extra O(t) random bits.
Notations
(n,d,α) graph is a undirected graph G with n vertices, d-regular and let
are the eigen values of the adjacency matrix (A) represented by G. Since the rows and columns sum to d λ0 = d and for a number
Normalized adjacency matrix (
) of G is defined to be
.
If
are the eigen values of
then
and
Let X be a random vertex in G with probability vector
and let Y be a uniformly chosen neighbor of X then
Random Walk on Expander Graph
In an (n,d,α)-graph G there is a large set of good vertices (satisfying some condition) and we want to find a subset of vertices of size t such that we can find some good verices as well. Let
be the set of bad vertices with density β = | B | / n.
If we randomly and uniformly choose
then
. This approach uses tlogn. We will show that by choosing a vertex randomly and then performing a random walk in G of length t the probability will become exponentially small at the same time using less number of random bits.
Let G be an (n,d,α)-graph and
with density β = | B | / | V | . Choose
uniformly at random and let
be a random walk on G starting at X0. Denote (B,t) the event that the random walk is confined to B; i.e.
.
Thearom ![\Pr[(B,t)] \leq (\beta+\alpha)^t](/~suresh/mediawiki/images/math/5/1/0/5102165dfbbeaae15822c22ad9eae774.png)
Let P = PB be a diagonal matrix which is a projection to the space of vectors in B, i.e.
or 0. Basically P confines a distribution vector to only the set B.
Lemma
Lemma
From both these lemmas we can prove the thearom.
Amplifying the success probability
We will see how this thearom can be used to reduce the error probability with out increasing the random bits by a big factor.
RP
Remember that, if
and a random algorithm A has the property that for a random string r of length m ...
- If
- If
Build an (n,d,α)-graph such that V = {0,1}m (all possible random strings). For some input x let B be all the coin tosses for which A(x) is wrong. Let A' be the following algorithm:
- pick a vertex
uniformly at random
- perform a random walk of length t resulting with the set of vertices
- return
From previous thearom,
The error is reduced exponentially while the number of random bits used is only m + t log d = m+O(t)
BPP
Let
and assume a randomized algorithm A decides, using a random string of length m, whether
with a two sided error probability of β = 1 / 10. Let A' be the following algorithm
- pick a vertex
uniformly and at random
- perform a random walk of length t resulting with the set of vertices
- return mode{A(x,vi)}
A' fails if majority of the vi's are in B. Fix a set of indices
such that
...
![\Pr[\forall i \in K v_i \in B] \leq (\beta + \alpha)^{|K|} \leq (\beta + \alpha)^{t/2}](/~suresh/mediawiki/images/math/1/0/a/10a426a509e53ceefc6f338151b35723.png)
Assume that α is small enough such that
.
![\Pr[A' \textrm{fails}] \leq 2^t \cdot (\beta+\alpha)^{t/2} \leq 2^t \cdot (1/8)^{t/2} = (1/2)^{t/2}](/~suresh/mediawiki/images/math/8/c/6/8c6b4e2931445f628e5942a408465735.png)
The error probability has been reduced exponentially using only m + O(t) random bits.
Two-Point Sampling
Consider a different technique to reduce the error probability in RP algorithms. The algorithm A picks a random number r from the range
(the ring of integers modulo n) for a suitable choice of prime number n. Assume that there are atleast n/2 values of r for which A(x,r)=1.
The algorithm can independently choose t random numbers and compute the A(x,ri) and return
. By the independence assumption, the probability of incorrectly classifying an input
by declaring that it is not in L is at most 2 − t. But this uses t log n random bits.
Consider a different algorithm A' ...
- sample only two values a and b from Zn.
- for i=1,...,t let ri = ai + b(modn) and evalute A(x,r_i)
- return
Since, ri's are pairwise independent
and
. The probability that the pairwise independent iterations produce an incorrect classification
.
Thus by using only 2 log(n) bits the error probability is reduced to atmost 1/t, which is a considerable improvement over the error bound of 1/4 achieved by the naive use of a and b.
Nov 4, 2008
The High Level Picture
We will define pseudo-randomness in roughly this way:
If it's too hard to tell, with sufficient accuracy, whether bits came from a real random source or a deterministic function, then we'll call that function a Pseudo Random Number Generator (PRNG).
We formalize "too hard to tell" in terms of the size S of circuit that distinguishes between the two, and "sufficient accuracy" with ε, the chance that we get the answer right.
Circuits
Recall that we measure the size of a circuit by counting the number of AND and OR gates it contains (NOT gates are not counted). The size of the circuit required to compute a function is a reasonable measure of the complexity of that function.
For the purposes of this lecture, we will only consider circuits that have n (one bit) inputs, and a single output bit. Such a circuit can be constructed in O(2n) (or fewer bits): consider that for any function f of n inputs, we can "factor out" the first input, leaving us with a constant sized circuit and two functions of n − 1 inputs.
The other piece we require is the ability to understand how circuit size relates to something we are more familiar with, complexity in terms of runtime. Using the previous result, we can show that for any Turing machine that always terminates within time t(n), we can construct a circuit that has size O((t(n))2).
Pseudorandom Generators
We start with the notion of indistinguishability - intuitively, whether we can identify the source of a series of bits. Formally, two random variables X and Y are (S,ε)-indistinguishable if:
... for any circuit C of at most size S.
That is, the probability that our circuit returns 1 is the same (within ε) for values drawn from both distributions. Note that this is for any circuit, so we are saying that there does not exist a circuit (of size S or smaller) that can distinguish them. Also note that while two instances of the same distribution (eg. the uniform distribution over the same range) meet this criteria, two different distributions (eg. a uniform and a gaussian) will generally not.
Now, we can define a pseudorandom generator very easily: it is simply indistinguishable from the uniform distribution:
A random variable X is called pseudorandom if it is (S,ε) indistinguishable from the uniform distribution {0,1}n.
More specifically, we consider the case of a BMY (Blum-Micali-Yao) generator: Such a generator takes in a
string of n bits, and outputs a string of l(n) bits, where l(n) > n,
and is called the stretch of the generator. The generator must be computable in poly time (in order to
be practical), and when passed a sufficiently long string of n bits drawn from the uniform
distribution, is pseudorandom with S = p(n) and
(where
p and q are any polynomials).
Basically: a BMY generator is one that takes as input a string of truly random bits, and returns a longer string of bits that "look" random.
There are a few important things to notice here: First, it's really quite hard to get good random bits on a computer. What a BMY-type generator lets us do is take a small number of random bits and turn them into a larger number of reasonably good random bits. Second, since this generator is deterministic, and the size of the output is greater than the size of the input, not every output string is possible. Thus, it should be possible to memorize every possible output of the generator, and use that to distinguish generator output from truly random output. However, it's not possible to do this with a poly-size circuit.
Another important thing is that BMY generators can only really exist if
. Because
the generating function G() is computable in poly time, we can simply take output y
and guess the x used as input to the function. This is clearly an NP algorithm; if we can do
it in poly time, this means that we can distinguish the output of the generator from the input in poly time,
and thus break the BMY conditions.
One Way Functions
The idea of a one way function is simple: it's a function whose input cannot be reconstructed from its output. We will tie the existence of one way functions (specifically, one way permutations) to the existence of PRNGs, and during this lecture or the next, show how to produce a PRNG using, in part, a one-way permutation.
A one-way permutation is a specific type of one way function: it takes an n-bit string as input, and returns a n-bit string as output. Furthermore, it is a bijection; that is, every input results in a different output, and every possible output is used.
Formally, a bijective function
is a (S,ε)
one-way permutation if for every circuit C of at most size S,
In other words, no circuit (of size less than or equal to S) can find x from f(x) with significant probability. In the asymptotic version of this definition, we limit S and ε to be polynomial in the n.
Finally, we define "hard core predicates" for permutations. Simply, a hard core predicate is a function with a one-bit output that, for some permutation f, can be computed from x but not (reliably) from f(x). Formally, B is (S,ε) hard-core for f if:
Note that the 1/2 is in this equation because the output of B is one bit, and therefore we have a one-half chance of simply guessing it.
It's easy to see that if a hard-core predicate exists for a permutation, then that permutation is one-way: if the permutation were not one way, we could break the hard-core predicate by inverting the permutation.
Building a PRNG
We now show that, if a one-way permutation exists, and a hard-core predicate exists for that permutation, a PRNG exists. We do so by showing how to build a PRNG from these two things. (We have already shown that if a hard-core predicate exists, the permutation is one-way.)
Recall that a BMY PRNG must have stretch; that is, it must output more bits than were provided as input. The one-way permutation outputs exactly as many bits as were input, but we can use the hard-core function to get one additional bit, giving us a tiny stretch of n + 1. We simply concatenate the bit strings from the permutation p(x) and its related hard-core function B(x):
G(x) = p(x),B(x)
Of course, we must prove that the resulting function is in fact (S,ε) pseudorandom. We do so using a proof by contradiction; if there were a distinguisher function that could distinguish the final bit (contributed by B(x)) from a random one, this would imply the existence of a function that could compute the hard-core predicate from the permutation, which we know cannot exist.
Our proof uses the following algorithm:
Input z( = p(x)) Pickat random (uniformly) If D(z,b) = 1 then output b else output 1 − b
... where D is a distinguisher circuit of size
whose job it is to distinguish p(x),B(x) from p(x),r (where r
is a randomly chosen bit.) Let's call the output of this algorithm Ab(p(x)), for some choice
of b.
What we want to find is the value of Prx,b[Ab(p(x)) = B(x)] - if this value is
, we have our contradiction.
One thing we know is that:
... this is simply averaging over the two (equally likely) cases where b = B(x) and
.
We also know that:
... from the definition of the algorithm.
Nov 6, 2008
Previously...
What we know:
one way permutation + hard core predicate -> pseudorandom generator
how to find such a hard core predicate:
Answer: take a one way permutation p and define a new one-way permutation as g(x,y) = (p(x), y) (clearly one-way) Let B((x,y)) = <x,y> (we'll call it B_x)
Theorem (GL): B is a hard core predicate for g. Specifically, Pr(B_x(r) = <x,r>) <= 1/2 + eps/2
The Goldreich-Levin construction of hard-core predicates
Proof idea. Suppose B is not hard core. Then Pr( < math > Bx(r) = < x,r > ) > 1 / 2 + eps / 2 We will show how to reconstruct x (with reasonably probability), thus inverting g (which is not possible).
Now let's think of Bx(y) as a function that's trying to approximate the "linear" function fx(r) = < x,r > The above statement says that B is reasonably good at doing this. Our goal is to "learn" the coefficients of the linear function and therefore learn x.
Warmup I
- Suppose eps = 1 (i.e B is perfect). Then invoke Bx(ei) and extract each coefficient.
Warmup II
Suppose eps < 1 but is large (B is nearly perfect). Will the above work ? no, because all the e_i could be in the bad part of B (where it is wrong). So let's sample randomly, since we know that on random points B looks like f.
Trick: by linearity, < x,z > = < x,z + r > − < x,r >
Algorithm: For input z Pick r at random, compute Bx(z + r) − Bx(r). Claim: this gives < x,z > . Proof: it fails if both z+r and r are in the "bad set" for B. This occurs with probability 2*(1/2-eps/2) = 1-eps. Call it delta Now run the above algorithm with e_i. It succeeds with probability 1-n*delta, therefore delta < 1/2n, which means that eps > 1 - 1/2n. Not great.
So the main problem now is to find this "almost exact" routine that serves as a proxy for f().
Main proof
New idea: One of our problems was the uncertainty in our oracle: We're using B as a proxy for f(), and this only goes so far. Let's pretend that we can actually compute f. In particular, assume for that some randomly chosen strings r1..rm we can compute exactly bi = < x,ri > .
Now we compute <x, z> as follows: We take all possible subsets of [m], and sum up the r_is, yielding the set r_S. Now do the shift trick
< x,z > = < x,z + rS > − < x,rS >
note that by linearity, the last term is merely
Now we replace the < x,z + rS > by the oracle B yielding the estimator
Do this for all S and take the majority value.
Finally, run this algorithm on e_i as before.
Why does it work ?
IDEA 1: the S are pairwise independent. Which means Chebyshev !! Let X_i be the variable denoting that the ith set gives the right answer. Pr[X_i = 1] = (1+eps)/2 Var[X_i] = (1-eps^2)/4. E[X = sum X_i] = M(1+eps/2), Var[X] = M(1-eps^2)/4
Since X_i are pairwise independent, we can use chebyshevL
Pr[|X - mu| > A] < Var(X)/A^2
A = M eps/2. Solving, we get the probability of bad stuff as 1/(M eps^2)
This means that the overall probability of failure is n/(M eps^2) < 1/2
So we need M = 2n/eps^2.
IDEA 2: But how do we get these magic b_i ? well we need log(M) = m values. so we guess all possible choices. only 2^m = M = n/eps^2 choices needed !!
Nov 11, 2008
In todays lecture, we will see some of the inequalities that can be used in the analysis of randomized algorithms. We will start our discussion with the review of Markov inequality, Chebyshev's inequality, Chernoff bound and Hoeffding's inequality.
A good and concise description of all of these bounds except for Chernoff-bound is given on the wiki so I will explain only Chernoff bound here and for rest, I refer to the links provided above.
Chernoff bound
The Chernoff bound applies to a class of random variables and does give exponential fall-off of probability with distance from the mean. The critical condition that’s needed for a Chernoff bound is that the random variable be a sum of independent indicator random variables. Chernoff bound is defined in terms of two tails - lower tail and upper tail.
upper tail:
Let random variables
be independent random variables taking on values 0 or 1. Further, assume that
. Then, if we let
and μ be the expectation of X, for any δ > 0
By a more complicated argument, which we wont give here, you can show that for δ < 2e−1, the Chernoff bound simplifies to:
lower tail: Lower tail of the Chernoff bound is given as:
This bound is quite good, but can be clumsy to compute. We can simplify it to a weaker bound which is:
In order to analyze many randomized algorithms, we come across variables that are not independent. Note that chernoff bound (or hoeffding bound) are only applicable when all random variables are independent. Below we will extend the idea of chernoff bound to the variables that are not independent but dependent in some specific way (Negatively dependent)
Negative dependence
Intuitively, negative dependency can be defined as: Given as set of random variables, if a subset of random variables take "high" values" then other disjoint subset of variables will take "low" values. This "in very loose sense" is basically meant to say that sum of the values that random variables can take remain constant. Formally:
Let
be a vector of random variables then random variables are negatively associated if for every two disjoint index sets
,
Once we have established the notion of negative dependency, we can use this in the proof of chernoff bound and get a bound that is suitable for negatively dependent variables. We will now show two properties that will help in establishing this negative dependency among random variables.
Closure under Products
If
and
are two independent families of random variables that are separately negative associated then, the family
is also negatively associated.
Disjoint Monotone Aggregation
If
are negatively associated, and
is a family of disjoint subsets of [n] , then the random variables
is also negatively associated where
are arbitrary non-decreasing functions.
We will show an example of "balls and bins" experiments that shows how this bound can be used in practice.
We will now discuss another inequality that will be useful in the analysis of randomized algorithms.
Janson's Inequality
Given random variables
, if we choose two subsets A,B of random variables from these n variables at random such that
, in this case we write A˜B, then Janson's inequality is defined in the following way:
Let
as above and let
. Define
-
,
-
where the sum is over "ordered pairs". Then for any
,
Nov 18, 2008
Using biased random variables in our algorithm, we can prove that the result has certain good properties, but only in some sort of probabilistic sense, so for one particular time, we can never ensure anything. Then can we safely get relatively good result just running the algorithm once by being careful with the variables but still in tractable time complexity (polynomial)?
In this class, we will illustrate a deterministic method using the approximation problem of an integer program. First, introduce the lattice approximation problem and the randomized rounding process we use to solve it. Then, we prove the existence of the integer approximation bounded by certain error rate and do some analysis to the whole probabilistic space using the method of conditional probabilities. After that, we derive an efficient deterministic algorithm by finding a proper pessimistic estimator.
The Lattice Approximation Problem
Input: an n * r matrix C in which
for all i,j; and an r-vector
, where each pj is a real number.
Goal: an integer r-vector that approximates p 'well' such that every coordinate of
is small in absolute value, that is:
The randomized rounding process: set each qj to 1 with probability pj, independently of all the other components of q. As for the random variable
,
set the expectation
,
then we can rewrite
- Δi = | ψi − si | .
Generalize every ψi above as the weighted sum of Bernoulli trials, let
be reals in [0,1], and let
be independent Bernoulli trials with E[Xj] = pj, we get general form for the variable:
.
and for the expectation:
.
Using markov inequality and some small tricks, we can derive the Chernoff-type tail bounds for ψ:
THEOREM 1. Let δ > 0 and m = E[ψ] > = 0, then
and
THEOREM 2.
.
Now we define two denotations B(m,δ)andD(m,x) which we will use afterwards for convenience: The bound:
The deviation (error rate):
- B(m,D(m,x)) = x.
Based on the above bounds, we have: THEOREM 3. There exists an integer approximation vector q such that
- Δi < = siD(si,1 / 2n).
, where n is the first dimention of the matrix C.
Proof: According to the above def for D(m,x),
- Pr[ψi > si + siD(si,1 / 2n)] < 1 / 2n,.
- Pr[ψi < si − siD(si,1 / 2n)] < 1 / 2n,.
So the prob of bad event βi is < 1 / n, so the prob of any of the n bad events βi is < 1, thus, the integer approximation vector q which satisfies all the n conditions exists with non-zero prob.
The method of conditional probabilities
Now we have known that there exists an integer approximation q to p satisfying the n conditions in Theorem 3 with certain prob, then if we can see the whole probabilistic space, we know some cases of q are good while other cases are bad. Then the question is how can we find one good case for q determinately without enumerating all the possible cases.
First we need to display all the cases for q and do some analysis, the method here is called conditional probabilities. Using this method, we model the construction of r-vector q by means of a complete binary decision tree T of r levels and level j represents the setting of qj to 0 or 1. We can see the tree has 2n leaves corresponding all the possible r-vector q, so we know some leaf of it is good and some is bad.
So finding a good q equals to finding a path to walk down the tree to a good leaf.
Define
to be the conditional probability of a bad vector q given
and assuming that randomized rounding is used to compute
, then
,
since the sum of the coefficients pj and 1 − pj equal to 1,
,
now we can use a greedy algorithm: for j = 1 to r, at level j, we set qj to 0 or 1 so as to minimize
,
According to the THEOREM 3, 1 > P1, so we can get
,
In exact form, P(leaf) is 0 or 1, so using this greedy algorithm, we will get to a good leaf as a result.
Deterministic Approximation using pessimistic estimator
The problem with the above greedy algorithm is we need to compute the exact bad conditional probability
and
to assign to qj, the exact computation needs exponential time. So at every level, instead of using Pj, now we choose its upper bound
, the pessimistic estimator. To ensure to find a good leaf as the result of the greedy algorithm, the pessimistic estimator should also satisfy the following properties:
At every level,
(a) the U is an upper bound on the function P.
(b) U never rises in the process.
For the lattice approximation problem, following the existence proof before, we can choose U to be the sum of the 1/2n lower and upper bounds for all the n ψi,
where
- Li + = si[1 + D(si,1 / 2n)]
- Li − = si[1 − D(si,1 / 2n)]
during computing, we can use
- ti = In[1 + D(si,1 / 2n)]
which was used when derive the tail bound for ψ.
Now we check the property (b) for U by setting q1.
,
similar analysis, (b) is satisfied by U.
Besides, the update rule at the kth level is : replace
by
when setting qk to be 1
or
by 1 when setting qk to be 0.
Analysis on U shows that it can be computed in polynomial time, so the greedy algorithm now can be completed in polynomial time too.
Nov 17, 2008
Game Trees
Let's consider a simple game with two players, C and R. The players take turns making moves, and there is an overall score kept for the game. Player C goes first, and has the goal of maximizing this score, and R has the goal of minimizing this score.
We can think of the possible game space as comprising an d-ary tree, with depth k: d is the number of choices available at at point during the game (note that we assume the number of choices is the same for both players over the entire game), and k is half of the depth of the tree (ie. the number of moves each player gets.) The leaf nodes of this tree (of which there are d2k) are labeled with "values" - real numbers indicating the outcome of the game.
To evaluate a game tree, we set the values of interior nodes according to the values of their children, and find the value of the root. Internal nodes at depth dmod2 = 0 are MIN nodes, whose values are set to the minimum value of their children, and those at depth dmod2 = 1 are MAX nodes.
In this formulation, we must read the values d2k leaves to evaluate a tree, since we must read all subtrees of an internal node in order to find the MIN or MAX.
Let's consider a simpler version of this tree: a T2,k game, in which each internal node has two children. Further, let's say that the only possible values are 0 and 1. This turns the MAX nodes into OR nodes (1 if and child is 1), and MIN nodes into AND nodes (0 if both children are 0).
In this formulation, it may be possible to read only some of the leaves; while evaluating an AND node, if the first child we read is a 0, we can avoid reading the subtree rooted in the second child. Keep in mind that, for an AND node, if we are forced to evaluate both children and the AND node ends up returning a 1, the parent OR node now has at least one child child with a 1, and we can thus skip its other child if we were to read this one first.
In the presence of an adversary, however, any deterministic algorithm must read the entire tree: since the adversary knows the order in which children will be evaluated, it can arrange things so that the first child of every AND node is 1, and the first child of every OR node is 0. Thus, we have to read all 22k = 4k leaves in the worst case.
If we randomize the order in which we read the children, the adversary cannot intentionally "hide" a 0 or a 1 from us, and thus over all possible inputs, the expected number of children we have to evaluate for each node is 3/2 rather than 2. We can show by induction that this means the expected number of leaf nodes we must investigate is 3k.
We take n to be the number of leaves in the tree, which is 4k - thus our randomized algorithm has an expected running time of n0.793, as compared to n for the deterministic algorithm. We now move on to investigate whether it is possible to create a better randomized algorithm for this problem, by using Yao's technique to find a lower bound for the runtime of any randomized algorithm on this problem.
Minimax
We now interest ourselves with the question of whether any randomized algorithm can have a lower expected running time. To do so, we use Yao's minimax principle.
Let's switch gears to a different type of game, that from game theory. Again, we have a two-player game, and again, one player is trying to maximize the payout and the other is trying to minimize. This time, however, the game is described as a simple matrix M, with the rows corresponding to the choice made by player R, the maximizing player, and the columns corresponding to choices made by the minimizing player C. The entries of the matrix correspond to the payout.
Because the amount paid by one player is equal to the amount received by the other, this is called a zero sum game. We will consider zero-information games, in which neither player knows the other player's strategy.
There are two kinds of strategies the players can take: in a pure strategy, the row player always picks the same particular row, and the column player always picks the same column. In a mixed strategy, the row (or column) picked by a player is a random variable with some distribution over the possible choices. (It's easy to see that a pure strategy is a special case of a mixed strategy.)
An optimal pure strategy for the maximizing player is the choice of a row that maximizes the minimum value the player can receive; that is, given that we have picked a row, we assume the other player will pick the column that is least advantageous to us. A similar argument applies to the minimizing player.
It's easy to see that, for a pure strategy on payout matrix M, with i and j indexing rows and columns:
That is, VR = maximinjMij, the lower bound on the score the maximizing player gets from his best strategy is no greater than VC = minjmaxiMij, maximum payoff the minimizing player will have to make from his best strategy.
Put another way,
VR < = VC
This does not say that the maximizing player always wins (or loses)! Recall the maximizing player wants the highest VR possible, and the minimizing player wants the smallest VC possible. So, this is simply saying that the best payoff the maximizing player can guarantee himself is less than or equal to (as good as or worse) the best payoff the minimizing player can
If VR = VC, then the game is said to have a solution, with a value V.
Mixed Strategies and Von Neumann's Minimax
Let's say that the players are allowed to use random distributions over their choices (p for R, and q for C), instead of having to stick with a single choice. The expected value of the payoff is now:
... ie. just every entry of the payoff matrix, weighted by the probability that it is chosen.
Von Neumann's Minimax Theorem tells us that, for mixed strategies, every two-player zero-sum, zero-information game has a solution:
maxpminqpTMq = minqmaxppTMq
This is rather surprising; for pure strategies, this is an inequality, but when we add randomness, it becomes an equality. We will not prove this theorem, but consider the following example.
Think of rock-paper-scissor, in which there are three possible outcomes: player A wins, player B wins, or there is a tie. Assign the payouts 1, -1, and 0 to these outcomes. In the pure case, VR is -1: the maximizing player cannot select a row in which they might not lose. Similarly, VC = . Thus, we have VR < = VC.
Now, let's say each player selects at random. Since this is a 'fair' game, and all outcomes are equally likely, each player's expected value is a tie, and thus we have V = VR = VC = 0
Yao's Minimax
Let's apply this to the design of randomized algorithms. Let's say that the row (maximizing) player is the adversary providing us inputs, and we are the column (minimizing) player. Each row corresponds to some input that the adversary can select, and each column corresponds to some deterministic algorithm we can run. The entries in the matrix refer to some value, say, time, that the algorithm takes to run on the input. The adversary's goal is to maximize this quantity, and we want to minimize it.
A pure strategy for the adversary consists of a "bad" string (the hardest for every algorithm), and for us, consists of a "good" deterministic algorithm (the one with the best lower bound).
A mixed strategy for us corresponds to a randomized algorithm: we select from among some set of algorithms with some probability distribution. Though we talk about selecting from a set of deterministic algorithms, we could also think of this as selecting the random bits for one (or more) randomized algorithms.
Let's define the distributional complexity of an algorithm as the expected running time of the best deterministic algorithm for the worst distribution of inputs. As we'll see, this is a lower bound on the expected runtime of any randomized algorithm.
We can re-write Von Neumann's minimax theorem as: maxpminqE[C(Ip,Aq)] = minqmaxpE[C(Ip,Aq)]
... for random input Ip chosen according to distribution p, random algorithm Aq chosen according to distribution q, and where C(I,A) is the runtime of algorithm A on input I. Alternately,
... where
is the space of all algorithms we're
selecting from (which must be finite), and
is the
input space.
These lead to Yao's minimax priniciple:
The left side of this equation is the best expected runtime of any algorithm on input distribution p: thus, it is the best we can hope to do as algorithm designers if we know the input distribution. The left side is the best the adversary can do if he knows the distribution of algorithms we are using. So, in effect, what this is saying that we can pick any input distribution and evaluate the best-case running time over all our possible algorithms, and this will provide us with a lower bound - when we design a randomized algorithm, assuming an adversary, we can do no better than this.
Note that the quality of our lower bound is only as good as the p that we picked; though it will in a lower bound, it might not be a tight lower bound. To get the tightest possible lower bound, we have to pick the worst possible p, not an easy proposition!
Randomness and Non-Uniformity
Re-visiting circuits for a moment: recall that the circuits we have dealt with so far have a fixed set of inputs and compute deterministic functions. Let's add some additional inputs that come not from the problem instance, but are instead have random values. We'll consider randomized circuits with one-sided error: that is, if the true function would return 0, the circuit must return 0 regardless of inputs, and if the true function would return 1, the circuit should return 1 with probability greater than one half. Call the random strings bits for which the circuit returns 1 (it must be correct) witnesses.
Over all inputs for which the function returns 1 and all strings of random bits, we know from the definition that at least half of the time, the circuit returns a 1. Thus, there is at least one random string that serves as a witness to at least half of the inputs. Let's build a circuit that "hard wires" this random string. Removing all inputs that we've just covered, we repeat the process, creating another circuit, and so on, n times. We OR together all of these circuits; we have created a single circuit that is no more than n times bigger than the original; if the original was poly-size, so is our new circuit. This shows us that we can construct an equivalent circuit without using randomness.
We have not just shown that we can do away with randomness entirely in poly computations; our circuit only works for a particular n. A circuit for a function taking n+1 bits as input might look completely different. This is formalized as the notion uniformity.
We say that a function takes advice if it uses a read-only string of bits to decide whether a given string is part of a language or not; this advice string is a function of the length of the input string. We can think of our procedure above as generating this advice for a particular length. A uniform function is one that does not need advice, and a non-uniform function is one that needs it.
Nov 25, 2008
Random Graphs I
The theory of random graphs draws heavily upon graph theory + probability theory.
Loosely speaking, a random graph can be obtained by starting with a set of n vertices and randomly adding edges between pair of vertices. There are several ways a random graph can be defined. One model is the following: given a positive integer n and a number
, a random graph G(n,p) can defined to be a distribution over the space of graphs on the vertex set
such that:
. Another model of random graphs is parametrized by the total number of edges e and is denoted as G(n,e). This model assigns equal probability to all graphs with exactly e edges. It turns out that G(n,e) has similar properties as G(n,p) with
. In this lecture, we will follow the probability based model only. The random graph model can used to prove existence of certain graphs; however, in this lecture, we would only be looking at specific properties of random graphs.
One can also view a random graph as a time-varying phenomena. Consider, for example, that we associate a certain potential xi,j to each pair of vertices where
. Now take a light source and increase its intensity (say
) gradually from 0 to 1. The edge between i and j is turned on when p reaches xi,j and stays on thereafter. At any given "time" instant p, the graph defined by all turned on edges is essentially G(n,p). In the extreme case (p = 1), all the edges will be on.
Random Graphs and Graph Theotric Properties
Since random graphs are defined in terms of probabilities, we can determine the probability that a given random graph satisfies a certain graph theoretic property (say "if the graph is triangle free" or "what is the clique number", etc). Given a random graph G(n,p) and some graph theoretic property A, we write
to denote precisely this probability. We shall see the specific example of the triangle case as this properties shows up at many places such as while analyzing web-graphs.
It turns out that for the triangle case (and for many other natural graph theoretic properties), there is only a narrow range of probabilities where the property is satisfied. This has been made formal by the following definition:
Definition: r(n) is called a threshold function for a graph theoretic property A if:
1. When
,
.
2. When
,
.
Note: p(n) in the above definition is the probability as a function of n. r(n) similarly is the threshold function which is again a function of n.
In the triangle example, as we shall see, 1 / n turns out to be the threshold function. Also note that the threshold function for a given graph theoretic property isn't necessarily unique. For example, in the triangle case, even 10 / n would be a threshold function since it satisfies the definition above.
More to come...
Branching Processes:
The Giant Component:
Phase Transition:
Dec 2, 2008
Random Graphs II
Coming soon...
is computable in polynomial time.
4. It is possible to sample uniformly from any
it is possible to determine whether
in polynomial time.
and this is easy to compute.
2. The coverage sets partitions
,
at random (uniformly)
If