Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.
In this talk, I'll discuss algorithms for finding regions with high discrepancy for a wide class of (statistical) discrepancy functions. The main idea is a procedure to approximate a general convex discrepancy function by a small set of "simple" discrepancy functions, for which we can design efficient search algorithms. One specific consequence of this is a new and efficient approximation algorithm for finding regions that maximize the Kulldorff scan statistic, a well known discrepancy function in biosurveillance.
This is joint work with Deepak Agarwal (AT&T) and Jeff Phillips (Duke U.) The paper may be found at http://arxiv.org/abs/cs.CG/0510004