Dealing With Large Data
From ResearchWiki
Pointers and notes for a talk at Google Kirkland, Mar 6, 2008.
Contents |
Basic Premise
Dealing with large data means compromising on everything.
- We can't see everything
- if we can see everything, we can't store everything
- even if we can store things, we can't compute precisely.
- even if we can compute imprecisely, we might need to be uncertain about our answers.
The two paradigms that are central to any "large data analysis" are approximations and randomization. They come together in the realm of data streaming.
Approximations
What do we do when a problem is hard to compute precisely ?
- Assume that the problem is "well structured". For example, independent set is NP-hard, but not on paths (or on trees): maybe your input comes from a specialized class of problems. There is a notion of a 'tree like' graph, on which some problems are no longer NP-hard.
- assume the input comes from some specialized distribution (Bayesian methods)
- get as close to the optimal solution as possible.
This last notion is called an approximation algorithm.
Formal definition
A problem consists of instances. Each instance I has a set of feasible solutions σ(I). Each solution has a cost associated with it. Call this cost f(σ(I)).
An optimal solution to the instance is a solution whose cost is no larger than any other cost (minimization)
Let's fix an algorithm A. The job of the algorithm, given an instance, is to pick some feasible solution - ANY solution.
The performance of this algorithm is defined by its behaviour with respect to OPT. Suppose we denote A's cost on an instance I by A(I), and OPT's cost by OPT(I):
- Absolute approximation: (A(I) - OPT(I) < c
- Relative approximation: A(I)/OPT(I) < c
Also, you can talk about absolute and asymptotic guarantees:
- Absolute: for all I, the guarantee holds
- Asymptotic: there exists some n such that for all n' > n, for all I, |I| = n', the guarantee holds (i.e "in the limit")
Examples
- Take Vertex cover. Our "algorithm" returns all vertices. This is an absolute approximation of n-1 (OPT might be as little as one vertex, if the graph is a star). It's also a relative approximation of n (same reasoning)
- Planar graph coloring: 3-coloring a planar graph is NP-Complete. Check that graph can't be 2-colored (using a simple breadth-first search), and then output a 5-coloring (not trivial, but doable in polynomial time: think about it !). This is a 2 absolute approximation, and a 5/3-relative approximation.
Absolute approximations are VERY HARD, and so we don't often use them. In the reading linked below, there's an example of how some problems can't be approximated absolutely AT ALL unless P = NP. Intuitively, this is because additive approximations are not "invariant to scale": if we increase the size of a graph, an absolute approximation starts looking better and better, to the point where we should be able to solve the problem exactly ! So we prefer to work with relative approximations in general (although we will often encounter both kinds)
Approximations make sense even for P-time problems, if we want to reduce the running time from a "bad" polynomial to something more reasonable. For example, suppose we wanted to determine the farthest pair of points in a metric space. A naive algorithm takes O(n2) time (look at all pairs), and we can't really do any better (think of a metric space where one point has distance 2 to one other point, and all other distances are of length 1).
However, if we take any one point and find its furthest neighbor (which takes linear time) and double this value, we get a 2-approximation. This follows directly from the triangle inequality. So by sacrificing accuracy, we lose an order of magnitude in the running time.
Approximate Binary Search (and counting)
Consider binary search on n items drawn from the range
. Binary search takes log n time. This is an easy problem. However, suppose we are willing to be approximate, in that we merely want to know that there's some item 'close' to the query (say with a value that's within a factor 2 of the query). Then we can do a lot better.
- Snap each item to the largest power of 2 below it.
- do binary search on the snapped set, using a snapped version of the query.
There are only logn distinct values to worry about, and so binary search takes loglogn time !! If we "find" an item, we know that the actual item is no more than a factor two away.
Suppose factor 2 is no good, and we really want a
-approximation. The same argument applies (but with "snapping" to the nearest power of
, and now the number of distinct items is
. Binary search on this yields a running time of
This causes some puzzlement: what happens when
goes to zero ! The point is that with a running time that's exponentially better than in the exact case, we can get "reasonably" close to the right answer.
This might seem like a toy problem, but it connects to a real application.
Suppose we want the count the number of objects streaming past us. We can count the number of items passing by using log n space. Using an idea sort of like above, we can actually use loglogn space !
We all know that log n is good. If we're counting IP addresses, for exampel, 2^32 -> 32. But loglog232 = 5 !
We won't always be this lucky: but the moral is that often if we want to solve a problem exactly, we can solve it much faster if we are willing to be approximate.
Reading
- A definitive text on approximation algorithms is Approximation Algorithms by Vijay Vazirani. Also see A compendium of NP-optimization problems for an encyclopaedic reference of NP-hard problems and what's known about approximating them.
- The idea of approximating diameter, and doing approximate binary search, is folklore. The trick for approximate counting is due to Morris, and is surveyed in this paper by Flajolet
Randomization
The other pillar of large-data analysis is randomization: the idea that tossing coins when making decisions in our algorithms can actually help.
Randomization introduces uncertainty into algorithms, both in terms of how long it might take to run, and what kind of answer it gives.
A Monte Carlo algorithm will run for a deterministic amount of time, but might give a wrong answer with some probability. A Las Vegas algorithm always gives the correct answer, but its running time is a random variable whose expected value is bounded.
Waiting time lemma: if an event has a good outcome with probability p, the expected number of times you need to repeat the event before you get the good outcome is 1/p
If you can verify whether the output of a monte carlo algorithm is "good" (for example, is a claimed graph coloring actually legal), then you can use the above lemma to convert a Monte Carlo to a Las Vegas algorithm (run the algorithm, check if it's good, and repeat if not).
To convert Las Vegas to Monte Carlo, just stop at some predetermined time, and output what you have at that point. Then we have to figure out the probability of failure, which depends on the problem.
The main attraction of randomization is that the algorithms are relatively simple, and it's the analysis that's hard. This means that coding a randomized algorithm can sometimes be easier than coding a deterministic algorithm for the same problem.
Example: approximate medians
Computing a median is "easy": it takes deterministic linear time. Although the standard deterministic algorithm does introduce a reasonably large constant, and can be inefficient in practice.
In many settings, we just want to use the median to split data roughly equally (for load balancing for example), and it's not important that we find the exact median, as long as it yields a reasonable balance. Let's say that the rank of an element is the number of items at most as large as it in the set. A median has rank n/2.
An approximate median is an element with rank r in the range
. How much work do we need to do to return such an element ? Only a CONSTANT amount of work is needed (specifically
)
The proof works by noticing that one of the two bad cases occurs when at least half of the elements lie in the (strictly smaller than 1/2) range [(1/2+\varepsilon)n, n]. This occurs with very small probability (think about tossing 50% tails with a coin biased towards heads)
Tail Bounds
In general, to analyze properties of randomized algorithms (how likely is it that we deviate from the expected value by more than a small amount), we need properties called tail bounds, that calculate the probability of deviating from the mean of a distribution, for stronger and stronger assumptions on the nature of the random variables. This set of notes has more details on this, and this set of notes explains more about the power of amplification via repetition.
Readings
Apart from the above notes, the two best sources on randomized algorithms are
- Randomized Algorithms by Motwani and Raghavan
- Probability and Computing, by Mitzenmacher and Upfal
Streaming
What is a streaming algorithm ? It's a Turing machine with limited memory (sublinear) reads from a tape in one direction ONLY. it computes whatever it can from the data, and then spits out an answer. A streaming model captures the idea that we have a short memory, and can't remember everything that happened in the past (contrast this with online algorithms, later)
Frequency estimation: The most elementary problem one can consider in a stream setting is counting. We just saw the problem of counting in limited space. Suppose we wanted to determine the number of distinct elements that appear in a stream. It's easy to do this if we maintained a count for each item we see, but that might be too many ! How can we determine distinct counts (approximately) without needing space proportional to the support size ?
Clustering: Another interesting class of streaming problems comes from computational geometry. The main motivation here is clustering: given a stream of (possibly high-dimensional) points, can we compute a clustering in small space ? Obviously, we can't store the cluster mapping (since we'll need one piece of information for each point), but we can hope to compute some kind of aggregate or summary consisting of the cluster centers.
Estimation: Given two vectors, can we estimate the distance between them. Here, the main problem is that the vector coordinates might appear out of order, or the coordinates might get updated from time to time. This is a core problem that lies at the heart of many other stream computing problems.
Sketching: Ultimately, most stream algorithms reduce to finding some kind of sketch: a reduced representation structure from the data. This is often not based on sampling, but based on projections of some kind. The important idea is that the sketch inhabits a linear space, so you can combine sketches. Further, you want a norm on the sketches to preserve some property of the input (like the
or
difference between the streams). Other sketches use carefully designed hash functions and counters.
Warmup: Medians revisited
Here's an algorithm inspired by a streaming algorithm of Munro and Paterson:
Fix parameters b and k. b buffers, each of size k. Start reading data into buffers. if all are full, merge two, and increment "weight" of this buffer. repeat until done.
If
is
, this gives an approximate quantile (in the rank sense we described above. Using a later result of Greenwald and Khanna, we can even remove one log factor while maintaining a deterministic algorithm.
Another cool trick: how to sample randomly from a stream (more details here).
This means that algorithms that work by random sampling can be made to work in a stream setting.
Frequency Estimation: The AMS results
What are frequency estimates ? Suppose we're looking at a stream of items
from the range
. Let the frequency of i in the stream be mi.
Frequency moments of this sequence are defined as
for some
, where the mi values represent the number of elements from A that have label i. Several values of k are used quite frequently in various applications, for example:
- F0: represents the number of distinct elements appearing in the sequence.
- F1: represents the length of the sequence: m
- F2: Gini index, useful for various error-checking applications on the sequence.
-
: The most frequent element
For the sake of an example, given the sequence A = (1,1,3,5,1,2,0,0), the frequency moments for this sequence would be: F0 = 5, F1 = 8, F2 = 16 and
.
A high level idea of how to estimate such quantities is this: we need an estimator of the desired frequency, so we run a process to generate it. Then we repeat this f(λ) times and take the average so as to reduce the approximation error. Then we repeat THAT g(epsilon) times and take the median to reduce the probability of error. All of this follows from the correct application of Chebyshev and Chernoff bounds.
Consider F2 (sum of squares of frequencies). Every time we see an element, we add an εi value to a running sum, where εi is either 1 or -1.
+ cross term that gets nuked out if the eps are pairwise independent. Since
, we get the desired estimate.
Now although this scheme "works", the problem is storing the εi values. We need one for each item, and we certainly can't store that many random values. The trick here is to realize that if we can generate the random values using a seed, then we can compute the value when needed, rather than storing it. Of course, if we can compute a random value, how can it be random ! The answer is, like with all pseudorandom generators, that we can hope to compute something that is sufficiently random. In this case, to get the desired expected value, we need pairwise randomness, and to get a high confidence bound on the estimate, we need 4-wise independent random variables (because we need to bound a variance). It turns out that such 4-wise random variables can be generated using a seed of sublinear size (in fact only logarithmic number of bits are necessary, which concludes the argument.
How to get an F0 estimator ? Rough idea: hash the items randomly within a domain of size n, and look at the item with the largest hash value (this is a variant of a trick to sample randomly from the support: hash every value randomly and take the min value). You can show that the probability that 2^{hash value} is not F_0 is small, depending on how far from F0 you are willing to be. Total memory needed is log n (to maintain the running count).
Doing
(most frequent items) is very hard: there is no reasonable approximation even using a randomized algorithm. This is bad news because top-K queries are quite important.
Frequency Estimation: Other counts
Next best thing: produce a list of items so that any item has freq >= (1-eps) kth largest, and all items with frequency (1+eps) kth largest are there.
explain the basic idea of the +1, -1 estimator, E[c . s_i(q)] = n_i, and how it needs to be replicated to make the probabilities work.
yielding a t X b table
Another approach: take a user specified parameter s, and return all items that have counts of at least sN, with a fractional error term eps.
[outline lossy counting paper from MM02]
An important stream variant: Sliding windows
Stream algorithms wait for "whole" stream. Suppose we want to maintain statistics on a small window (called a sliding window) Even a simple thing like counting the number of ones is tricky. it requires 1/epsilon log^2 n time.
Readings
- The Data Stream Model
- J. Ian Munro, Mike Paterson: Selection and Sorting with Limited Storage.
- Random sampling for approximate quantiles
- Space-efficient quantiles (Greenwald/Khanna)
- Space complexity of approximating frequency moments (and an informal perspective here)