I work in algorithms and computational geometry. My areas of interest shift around quite a bit, but there are some major threads. The sub areas below describe much, but not all, of my recent work; for a fuller (but less coherent) description, my papers are the best source.
Algorithmic Statistics
The growth of massive data sets, and need to do analysis on these large sets, has led to an explosion of interest into what I’ll call “scalable statistics”, methods for applying statistical tools and techniques to data sets far larger than those these methods were designed to work with.
To analyze such data sets, one not only has to find ways of computing summary statistics efficiently and with limited space. One has to develop inferential procedures that can work on such approximations of the data. Probability distributions lie at the heart of many inference procedures, and a major challenge in massive data sets is to manipulate and infer from empirically constructed distributions.
Some of my work in this regard:
- An Information-Theoretic Approach to Detecting Changes in Multi-dimensional Data Streams
- On stationarity in Internet measurements through an information-theoretic lens
- The Hunting of the Bump: On Maximizing Statistical Discrepancy
- Streaming and Sublinear Approximation of Entropy and Information Distances
- Rapid Identification of Column Heterogeneity
- Validating Multi-column Schema Matchings by Type
Geometric Pattern Matching
Matching shapes is one of the human skills probably most difficult to replicate in machines. Geometric pattern matching studies methods for matching shapes, be they as simple as clouds of points, or even curves or surfaces. At one level, the human body is a giant shape matching machine; proteins bind with each other and with drugs largely based on shape complementarity (a fancy phrase for saying that they fit together).
My Ph.D thesis dealt with matching problems in the context of drug design.
… given a collection of drugs that all have similar (or the same)
pharmacological effect on us, what interesting shapes might they have in common that allows this to happen ? …
I have also looked at more abstract pattern matching problems, where you have different measures of similarity and even different kinds of shapes !
Graphics Cards As Stream Computers
An abstract from a paper I wrote for the SIGMOD-DIMACS workshop on management and processing of data streams (I gave a longer talk at the University of Chicago).
Massive data sets have radically changed our understanding of how to design efficient algorithms; the streaming paradigm, whether it in terms of number of passes of an external memory algorithm, or the single pass and limited memory of a stream algorithm, appears to be the dominant method for coping with large data.
A very different kind of massive computation has had the same effect at the level of the CPU. The most prominent example is that of the computations performed by a graphics card. The operations themselves are very simple, and require very little memory, but require the ability to perform many computations extremely fast and in parallel to whatever degree possible. What has resulted is a stream processor that is highly optimized for stream computations. An intriguing side effect of this is the growing use of a graphics card as a general purpose stream processing engine. In an ever-increasing array of applications, researchers are discovering that performing a computation on a graphics card is far faster than performing it on a CPU, and so are using a GPU as a stream co-processor.
My recent work has focused on using fast graphics cards as a streaming processor to solve problems in computational geometry and graphics. Some of the problems that can be solved this way, faster than using traditional CPUs:
- Map Simplification
- (Dynamic) 2D Depth Contours
- Geometric Optimization
- Solid Modelling
- Natural Neighbour Interpolation
This page has more information. If you want to try this at home, you will need a good graphics card and some nifty software
External Memory Algorithms
Many of our notions of efficiency and good algorithmic design are challenged by having to deal with really large data sets (I’m talking terabytes here!). Cost models change (good ol’ Turing doesn’t quite cut it), and even a simple matter like looking at a data item more than once cannot be taken for granted.
I’ve done some work on indexing, graph searching, and shortest paths.