Algorithms Seminar/Fall07
From ResearchWiki
Approximate High Dimensional Data Analysis. (Fall 2007) Friday 10:40-12:00 WEB 1460
Talks
| Date | Paper(s) | Presenter |
|---|---|---|
| 08/31/07 | Randomized algorithms and tail bound analysis, a whirlwind tour of exact computational geometry, explanation of the data stream model, and some background on approximation algorithms (Michel Goemans lecture notes) | Suresh Venkatasubramanian |
| Random Sampling | ||
| 09/07/07 | Sublinear algorithms in metric spaces (also see Two algorithms for nearest neighbour searching) | Aaron Wood |
| 09/14/07 | Approximate clustering via core sets (BHI02) (focus on the k-median sections) | Andrew Nelson |
| 09/21/07 | Approximate clustering via core sets (BHI02) (focus on the k-median sections) (continued) | Andrew Nelson |
| 09/28/07 | linear time algorithms for clustering in any dimension, (and a followup) | Amit Goyal |
| Core Sets | ||
| 10/05/07 | The core-set construction from (BHI02), further improvements by Badoiu/Clarkson, I, II | Avishek Saha |
| 10/19/07 | Clarkson's meta-paper on core set constructions (not all of it: talk to me first) | Piyush Rai |
| 11/02/07 | Core-sets via grids: Section 2,3,6 of the survey | Hal Daume III |
| Stream Algorithms | ||
| 11/13/07 | Basic sketching technology: AMS, Flajolet-Martin. Lower bounds via communication complexity | Nathan Gilbert |
| 11/16/07 | norm estimation | Amit Goyal |
| 11/30/07 | Count sketches | Arvind Agarwal |
| Compressed Sensing | ||
| 12/07/07 | Compressed sensing and the Orthogonal Matching Pursuit algorithm | Jeff Blanchard |
Summaries
Background: 08/31
Both the data streams survey, and Michel Goemans' lecture notes, are rather long, and I don't expect that you'd read the entire document in either case. My purpose in adding these two docs is to provide a reference for longer term use: for the purpose of tomorrow's seminar, all you really need is the first few pages of the Goemans notes, and the first 13 pages of the data streams survey.
This survey by Agarwal and Sharir is dated, but Sections 7 and 8 provide a good overview of the results on clustering.
Sampling I: 09/07
Suresh: There is much to read in these two papers, but what we should focus on is the following sampling idea: namely, in order to approximate a quantity defined over a set of points, we merely sample a small set of point at random, and use this set to estimate the quantity. This idea is used in the Kleinberg paper to quickly check which of two points is closer to a query point, and is used in the Indyk paper to estimate sums of distances quickly.
In both cases, we rely on either the triangle inequality or deeper geometric structure to make the desired argument.
Arvind: We talked about the problem of finding k nearest neighbors given a query point in high dimensional space. There has been good amount of work on algorithms for this problem in both approximate and exact cases. Most of these algorithms achieve a complexity which is logarithmic in n and exponential in dimension d. This paper presents two approximate algorithms which achieve much better complexity. First algorithm has a pre-processing complexity of O(nlogd)2d and a query time of O((dlog2d)(d + logn)) while second algorithm achieves a near-linear pre-processing time and a query time that improves asymptotically on brute force search in all dimensions.
Here we would talk about the first algorithm. In this algorithm, query processing is deterministic while pre-processing involves random sampling. Basic idea of this algorithm is to choose a set of L vectors randomly from Sd − 1, project all points onto these one dimensional randomly chosen vectors and then use lemma 2.1
if
for some
to find the nearest neighbors. This lemma 2.1 basically says that if two vectors are projected on randomly chosen vectors then there is high probability that projection of smaller vector will be small. This means that the point which was closer to the origin in the original high dimensional space, will also be closer (with high probability) if computed using its projections on random vectors (much less than the original dimension d) rather than computed using distance in d. Algorithm uses this lemma to compare the distance of two points from the query point.
Preprocessing : Here are the main steps of preprocessing stage.
- First choose a set of L vectors
uniformly at random from Sd − 1.
- For each vector vl, project centers of all pairs (compute vl.pij where pij is the center of pair (i,j)) on this vector and sort them. Lets call this sorted list Sl.
- Once we have Sl for each random vector, we select realizable traces from these lists
. A trace, σ is defined as sequence of primitive intervals σl (an interval between two adjacent point in the sorted list). A trace is realizable if there exists a point
such that for every l, vl.q lies in the primitive interval σl of Sl.
- Now for each realizable trace, build a complete digraph with directed edge from i to j > i if i σ-dominates j for more than half of the sorted list otherwise an edge from j to i. Finally construct an apex ordering and store the apex orderings with corresponding realizable traces.
Query Processing: Once we have preprocessed data, finding k nearest neighbors given a query point q is fairly simple. For every random vector l, we do a binary search to insert the value of vl.q into sorted list Sl. Let us say that σl is the primitive interval into which this falls, From this, we get a trace
for L vectors. Next, we look up the apex ordering of the digraph corresponding to σ and return the first k entries which gives k-nearest neighbors.
Remarks:
- Here are some notes on VC-dimension
Sampling II: 09/14
Aaron: In the seminar we discussed an approximation algorithm to the k-median problem given in Approximate clustering via core sets by Badoiu, Har-Peled, and Indyk. The k-median problem is the following: For a set P of n points in Rd, find the subset K of Rd of k points such that the distance from P to K is minimized in the sense that the sum over the points in P of dist(p,K) is minimized where the distance from a given point to a given set is defined to be the minimum distance from that point to some point in the given set. In what follows, med-opt(P,k) will denote the minimal cost of the k-median problem, i.e., the minimal sum described above, and AvgMed(P,k) = med-opt(P,k)/|P| will denote the average distance from K to a point in P. The idea is to first consider the 1-median problem and then use the results in some sort of iterative process.
In the 1-median problem, we are interested in finding the point c such that the sum of distances from c to the points in P is minimized. The first result proved is that with constant probability, for a random sample X of O(1/ε3 log 1/ε) points in P, the flat span(X) contains a (1+ε)-approximation to c. This result is proved by partitioning the sampled set X into rounds and considering the projection ci of c onto the flat Fi spanned by the points in the first i rounds. The partitioning is done in such a way as to guarantee that the distance between c and ci decreases as i increases; for example, the first round of sampling ends when we find a point that is within a distance 2 AvgMed(P,1) of c. A probabilistic counting argument gives that the number of points sampled only needs to be of the order given above. Now that we know that span(X) contains a good approximation x to the 1-median problem, the next step is to have some method to find the approximation, or rather to approximate the approximation. The idea is to put an exponential grid on span(X) and identify each point in the span with the grid vertex closest to it. The grid is chosen in a way that the points don't "move" too far in this identification. Now the desired point in the span of X is close to a vertex on the grid. This vertex is the desired approximate 1-median. Note that this procedure of constructing a grid and "snapping" the points to the grid relies on Gonzalez's k-center algorithm.
- Description of Gonzalez's k-center algorithm
Next we look at the 2-median problem, assuming that the problem is irreducible; that is, assuming that the point set P doesn't "nicely" break up into two sets with two different 1-medians. Otherwise, the approximation to the 1-median problem can serve as an approximation to the 2-median problem. Under this assumption there are two cases. First, if the components are comparable in size, then we exhaustively search through partitions of P until we find the "nice" partition. Then we run the 1-median algorithm on each component to get the solution. Second, we consider the case where one of the components is very much larger than the other, say
where | P1 | > > | P2 | . By repetitive sampling of P, with high probability we end up with only points that are in P1. We run the 1-median algorithm on P1 to get one of the desired medians, say c1'. We then define a set P1' to be the points in P that are within a some fixed distance of c1'; that is, P1' is the set of points for which c1' is the 1-median. This fixed distance that we use is found by taking advantage of the required distance between the two medians given by the irreducibility condition. The claim here is that "most" of P1 is in P1'; that is, there are very few points that are not within this distance of c1' and hence will not affect the approximation done by running the 1-median algorithm on P − P1'. Since P − P1' mostly consists of points that are in P2, then the point c2' gotten by running the 1-median algorithm on this set is a good approximation to the 1-median of P2. We use these two points, {c1', c2'}, to approximate the solution to the 2-median problem.
To solve the k-median problem, we use the 2-median algorithm recursively; that is, we reduce the number of clusters by one at each stage and compute at most k reductions.
The 1-median algorithm is done in
time, the 2-median algorithm is done in
time, and the k-median algorithm can be done in
time, all of which are sub-exponential n and d.
Broader view of k-means algorithm: 09/27
Amit:
For past two weeks we have looked at K median problem for clustering. But this time we are going to look at a different algorithm for clustering i.e. K-means. The algorithm is linear in size of input i.e n (number of points) and d(no of dimensions). K and ε are constant which means the algorithm don't need to be linear in them :).
Now the first question arises what's the difference between k-means and k-median.
You can consider k-median as a more general problem and k-means as a special case of that. Since in k-median you can minimize any kind of distance (i.e. d(P,K)). But for K-means we consider square of euclidean distance.
The basic idea of our algorithm is very simple. They begin with the observation of Inaba et. al. [11] that given a set of points, their centroid can be very well approximated by sampling a constant number of points and finding the centroid of this sample. So if we knew the clusters formed by the optimal solution, we can get good approximations to the actual centers. Of course, we do not know this fact. However, if we sample O(k) points, we know that we will get a constant number of points from the largest cluster. Thus, by trying all subsets of constant size from this sample, we can essentially sample points from the largest cluster. In this way, we can estimate the centers of large clusters. However, in order to sample from the smaller clusters, we need to prune points from the larger clusters. This pruning has to balance two facts – we would not like to remove points from the smaller clusters and yet we want to remove enough points from the larger clusters. We are doing the same thing which we did last week. Their algorithm appears very similar in spirit to that of Badiou et al. [3]. In fact both these algorithms begin with the same premise of random sampling. However, in order to sample from the smaller clusters, their algorithm has to guess the sizes of the smaller clusters and the distances between clusters. This causes an O((logn)^k) multiplicative factor in the running time of their algorithm. We completely avoid this extra factor by a much more careful pruning algorithm. The way they do pruning is the crux of the problem (i.e. we don't want (log n)^k factor) which we are going to talk tomorrow.
On coreset constructions for k-center clustering problem: 10/05
by Avishek
The following lines highlight the gist of each of the three papers:
BHI2002:
1a. constructs core sets of size = O(1 / ε2),
1b. proposes an algorithm to construct 1-center MEB with time complexity (TC) = O((1 / ε)10log(1 / ε))
1c. the algorithm proposed in (1b) takes
and
TC to construct 2-center and k-center MEBs, respectively
Improve- ment over previous TC of
but still EXPONENTIAL!!!
BC2004:
2a. uses the same algorithm as (1a) of BHI2002 to construct core sets of size = O(1 / ε),
Improvement over BHI2002!!!
2b. presents a gradient descent algorithm for 1-center MEB with TC = O(1 / ε5)
BC2006 (Sections 1-3) :
3a. presents a lower bound on the coreset construction size = O(1 / ε),
LOWER BOUND PROVED!!!
3b. presents an upper bound of core-set size = O(1 / ε)
UPPER BOUND PROVED which makes O(1 / ε) the optimal coreset size !!!
(N.B.: we will not cover (3b) as the mathematical details are really involved and not very intuitive)
Sparse Approximation: generalizations, variants, and special cases: 10/19
Piyush:
The problem of maximizing a concave function f(x) in a simplex S can be solved iteratively using a limited form of gradient ascent. Essentially, instead of optimizing in the direction of the gradient, the optimization can be done in the direction of the largest component of gradient. The key property of this scheme is that it produces a "sparse" solution x which has a very small number of non-zero entries. The convergence results show that these sparse solutions are provably good in an approximate sense. This iterative scheme and its variants have been used for various problems related to sparse approximation.
Clarkson's paper ties together various already known results, gives a formal definition of a concept based on solution sparsity (coresets), gives tighter results to improve previously known bounds, and also provides general versions of algorithms which are based on the notion of a sparse solution (e.g. minimum enclosing balls and support vector machines).
The paper introduces the Frank-Wolfe algorithm which maximizes a concave function f() in a feasible polytope F by doing a local linear approximation of f() such that:
For the special case of the polytope being a simplex and the direction of local linear approximation being a vertex e(i) of the simplex, the maximum value of the above approximation becomes:
which is equal to
The above function of x is called the Wolfe Dual and the procedure above satisfies
where xk is the value of the kth iterate and f(x * ) is the value of the optimal solution to the primal problem.
The Wolfe Dual Problem to the original maximization problem of f() in S can be written as:
subject to
The above minimization problem remains the same even if we choose
. The dual problem then becomes:
where
.
Concavity of f() ensures that
and weak duality condition says that
.
The weak duality condition also helps in establishing the relationship between primal and dual approximations. If h(x) be the scaled measure denoting primal gap measure: h(x) = (f(x * ) − f(x)) / 4Cf, and g(x) be the dual gap measure given by g(x) = (w(x) − f(x)) / 4Cf, the following are satisfied:
when
and
when
The latter bound is always as good as the former (follows from weak duality which leads to
). When g(xk) − h(xk) is large, the bound will be much better.
The paper also gives a probabilistic argument for the existence of sparse solutions to such problems and variations to the sparse greedy algorithm.
In essence, the basic idea that the paper conveys is that whenever we have a maximization problem that can be transformed into a dual problem of a certain form, there exists a sparse solution to it. A variety of functions such as general quadratic functions and objective functions in problems such as minimum enclosing balls, convex approximation, support vector machines, support vector regression, boosting, etc can be written in a dual form, leading to sparse solutions.
Core sets as grids (11/02)
Suresh:
In its original form, a core set can be thought of as a kind of grid. For example, let's say that we place a unit grid over a set of points in the plane. Suppose the grid has R cells in each dimension. Then there are R^2 grid cells. We can snap each point of the input to the center of the grid cell that contains it (many points might snap to a single cell). In this way, we can reduce a point set of size n to a gridded point set of size R^2. Since R will often be a constant, or something only weakly dependent on n, we have reduced the point set size. Each point is displaced by at most
units, so this is the error incurred.
The problem is that R, which is constant-factor related to the diameter of the point set, can be exponential in the input size. For example, the coordinates of a point could be (2k,2k), which takes only 2logk bits to write down. So this gridding approach needs some tweaking before it can yield efficient approximation algorithms.
Let's say we want to approximate the minimum enclosing ball of a point set. We know that the diameter of the point set is at most twice the radius of the MEB. So we set the grid cell size to be αΔ, which means that the total number of grid cells is at most
.
Snap all points as before, and run your favorite MEB algorithm on the resulting point set. The running time of this procedure is n + f(1 / α2). Consider the answer returned: since every point is moved by at most O(αΔ), and
, we can choose alpha so that the error is at most εR * , and thus we obtain a (1 + ε) approximation to the MEB in time n + f(1 / ε2).
In general though, we can't always hope that the quantity of interest can be related to the diameter, allowing us this simple gridding strategy. So we need something more general. For this, we define the notion of an approximate extent: we want to pick a subset of points Q from P such that for any direction u, the projection of Q along u has approximately the same diameter as the projection of P along u.
Why would such a property be useful: for many natural geometric optimization problems (diameter, width, bounding box), it can be shown that the only points that matter are those on the convex hull, and that what matters are the actual extents along directions. Thus, approximating P by Q can help. Formally, we'll call a measure faithful' if the extent approximation is also a core set. Many standard geometric measures are faithful.
One obvious choice for Q is the convex hull of P. By construction, the convex hull must preserve all extents exactly. However, the convex hull might consist of ALL of P, and so is not a small enough set ! What this tells us though is that something that approximates extents is really approximating a convex polytope.
So given a point set, how do we compute approximate extents ? Here's how the procedure looks:
- Make P "alpha-fat" (define alpha-fat using hypercubes)(This can be done via an affine transformation that preserves extent relationships.)
- Make a large grid; Key point here is the size of the grid = f(eps, d, alpha)
- Choose one point from highest and lowest cell in each column.
- Number of points in Q is at most size of boundary, which is (1 / ε * α)(d − 1)
Notes:
- a point set is \alpha-fat if it can be enclosed within the hypercube AND a scaled copy of the hypercube can be placed inside it.
- Claim: Choosing points from highest and lowest cells approximates the extent. Why: each point in Q is close to a point in P and each extent in P is reasonably large (since P is fat)
An even better approach, that reduces the size of the coreset to its square root, is to do gridding in the dual space rather than in primal space. Specifically, build a grid on the sphere that encloses the point sets, and return nearest neighbours (in P) of the grid points.
Generalizations:
- dual space core sets: we can define an extent function on the envelope of a collection of lines; by duality, we can construct a core set by constructing a core set for the dual set of points
- Functions that are polynomials: the "linearization" trick allows us to map these functions to hyperplanes in a higher dimensional space, after which we can use the above technique.
Basic Sketching Algorithms for finding frequency moments. (11/13)
Nathan:
There exist problems from varied domains (database systems, sensor networks, satellite communications) where the cost of keeping an entire data set in memory is either impossible, or too costly to be reasonable. This fact has motivated research in the direction of algorithms that can make inferences regarding characteristics of a given data after a single pass through it and only require sub-linear space.
The specific problem we will encounter in this paper is bounding the space complexity of approximating the frequency moments Fk from a data stream. Recall that in data streams we are only allowed (or are able to) look at the data once.
Frequency moments of a sequence
are defined as
for some
, where the mi values represent the number of elements from A that have label i. Several values of k are used quite frequently in various applications, for example:
- F0 represents the number of distinct elements appearing in the sequence.
- F1 represents the length of the sequence: m
- F2 Gini index, useful for various error-checking applications on the sequence.
-
The label that has the most members in the sequence.
For the sake of an example, given the sequence A = (1,1,3,5,1,2,0,0), the frequency moments for this sequence would be: F0 = 5, F1 = 8, F2 = 16 and
.
Previous Work: Flajolet & Martin devised a method for approximating F0 using only O(lgn) bits.
Alon et al. devised a method for
Essentially, they defined a r.v. that can be computed under the given space constraints whose expected value is Fk and whose variance is as small as small as possible.
...then they looked to Chebyshev to save them...
Claim 1: For every
there exists a random algorithm that given a sequence
, computes in one pass and using
bits and a number Y such that the probability that Y deviates from Fk by more than λFk is at most ε.
Proof (sketch):
Let
and
.
- Compute s2 r.v.
and output their median Y. Yi is the avg. of s1 r.v.s
.
- Each of the X variables are computed as follows (using O(lgn + lgm) bits)
- choose a (uniform) random member
.
- suppose ap = l, let
which essentially is the number of occurrences (inclusive) of the label l in the remainder of the sequence A following ap
- choose a (uniform) random member
- X is then defined as X = m(rk − (r − 1)k
From here we wish to calculate the expected value and variance for these random variables. As it turns out, the
which is what we wanted.
The variance is a little trickier, but relies on the following bound:
which is used with the definition of Yi and computation for E[X] and the following property:
,
to give us the following result:
.
Other useful information
Another more recent paper on todays topic: Simpler algorithm for estimating frequency moments of data streams
Compressed Sensing (12/07)
Jeff: There is a shift in perspective from previous topics to this one. Compressed sensing considers sampling to be the most costly portion of signal processing. For example taking an MRI of a child requires the child to remain perfectly still for about 45 minutes (usually requiring medically unnecessary sedation). With this focus on cost, the algorithmic complexity for reconstruction or approximation is allowed to grow essentially unchecked. If one could prove that some algorithm would allow taking the MRI of the child in 45 seconds, someone would build a machine to process the collected data. Compressed sensing is a point of view point stating that with the correct sampling and reconstruction techniques, we can ignore the Nyquist rate theory.
Suppose s is a data set, say a signal, in dimension d. We find a representation system (a spanning set)
and some coefficients bi so that we can represent s in terms of this system:
. This representation system is generally referred to as a dictionary. It is often the case that we can closely approximate s by a sparse set of coefficients. That is there exists some representation requiring only m < < N non zero coefficents:
, and
. Sometimes, we assume that the signals we are receiving have only m non zero entries because we know the signal will be compressible. The standard example is a digital photo. You take your 7 mega pixel photo, but email your mom a highly compressed, very accurate, sparse representation of only 256Kb. Mom still sees what you intended her see. The question is why do we gather so much data (7Mb) if most of the information is on some sparse set (256Kb). Can we not simply take fewer samples and capture the important information? This is the study of compressed sensing or compressive sampling.
I will discuss a compressed sensing overview. I'll introduce the main problem of replacing the combinatorial NP-hard problem with a linear programming problem. Donoho and Tanner have provided necessary and sufficient conditions on when this will work using high dimensional polytopes and the notion of neighborliness. If you are interested in this, you may read the papers listed below. This problem is usually considered from a measurement matrix point of view and multiple people have provided sufficient conditions, most notably the Restricted Isometry Property(RIP). If there is time at the end, I'll discuss RIP and how this helps define the algorithm development. There is nothing you should read for this; I'll be introducing the overarching ideas.
Knowing the audience, this will serve only as motivation for the algorithms used in compressed sensing. For the majority of the time, we'll discuss orthogonal matching pursuit and show how this algorithm will find the sparsest representation of a signal. I will present Joel Tropp's article on OMP. I suggest reading as much as you can, but certainly I; II A,B,C; IIIA; IVA. If you just want to see the algorithm, that's section IIC.
Papers
Some of the papers we might cover:
- Sanjeev Arora's PTAS for planar TSP and other problems and Joe Mitchell's guillotine method (the power of dynamic programming on grids). k-median clustering in the plane
- Shape approximations (diameter, width and Dudley's result): Barequet and Har-Peled, Chan, and more Chan.
- High dimensional approximate nearest neighbour algorithms by Clarkson, Kleinberg, Indyk and Motwani, and Kushilevitz, Ostrovsky and Rabani.Also see Clarkson's survey of nearest neighbour methods.
- Core sets: a survey, the Badoiu-Clarkson optimal core set for min enclosing ball, , more core sets for k-median clustering, even more core sets for k-median clustering, .
- Random sampling: Sublinear algorithms in metric spaces, Sublinear geometric algorithms, linear time algorithms for clustering in any dimension, (and a followup)
Presentation slides for many of these papers can be found here
- Low-rank projections: The Random Projection Method is a good reference.
- Metric Embeddings: This course has all the reference material we need.
- Random projections: Random Projections of Smooth Manifolds combines ideas for sampling from manifolds with the Johnson-Lindenstrauss Lemma in order to do dimensionality reduction on manifolds.
- Compressed sensing: The definitive bibliography for all compressed sensing material. See also the blog posts by Igor Carron and Terry Tao. Martin Strauss has lecture notes at the recent MADALGO streaming summer school. (I (Jeff) suggest the following compressed sensing articles: for an overview Compressive Sampling by Candes or the more technical Compressed Sensing by Donoho; for a single sample from the algorithm literature Greed is Good by Tropp; on high dimensional geometric interpretations of compressed sensing (in order) DT1 and DT2 by Donoho and Tanner. If you find one of these areas interesting and want more, go to the definitive bibliography mentioned above.)
Synopsis
The "curse of dimensionality" is a phrase that acknowledges the inherent intractability of dealing with data in high dimensional spaces. In general, the "curse" manifests itself by penalizing an algorithm exponentially as the dimension of data increases. This renders most of geometric methods from low dimensional spaces ineffective, and makes data analysis in high dimensions a very challenging problem. As a consequence, much of the more recent research in high dimensional geometry has focused on the twin tools of randomization and approximation:
If we're willing to tolerate quantifiably inexact answers generated by algorithms with probabilistic (but provable) guarantees of success, can we circumvent the curse of dimensionality ?
In short, the answer is yes. Techniques developed over the last fifteen years have showed us how approximations, random projections, sampling, and other algorithmic tools have helped to ease the pain of dealing with high dimensional data. All of these techniques yield provable guarantees on the worst-case running time and the quality of solutions produced.
In this seminar, we will explore the range of methods that have been developed for managing the geometry of high dimensional data. Our focus will be on techniques and tricks, rather than on problems, and part of the challenge of this seminar will be to distill out and clarify the main ideas that underly algorithms in this area.
norm estimation