Anomaly detection is important in a variety of data monitoring settings. Determining ``bumps'', or clusters with unusually high/low data density, is a specific case of anomaly detection that has applications in biosurveillance and environmental monitoring. One of the most popular measures for detecting clusters is the scan statistic, a likelihood ratio test that determines how likely it is that measurements in a specific region are different from the surrounding areas. One such statistic is the Kulldorff scan statistic, a Poisson-based measure that is used widely in biosurveilance settings.
In general, the statistic is a \emph{discrepancy function} that measures the difference between measurements in a region and baseline data for that region. The goal, given such a function and a class of regions, is to find the region maximizing this discrepancy function.
In this paper, we present algorithms for maximizing statistical discrepancy over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to \emph{any convex discrepancy function}. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one need only maximize a \emph{linear} discrepancy function over the same set of shapes.
For the case of the Kulldorff scan statistic, we give an algorithm running in time $O(\frac{1}{\epsilon}n^2\log^2 n)$ that computes the maximum discrepancy rectangle to within additive error $\epsilon$. Prior to our work, the best known algorithms were exact and ran in time $O(n^4)$.
We also derive general discrepancy functions for the one-parameter exponential family, allowing us to generalize the Kulldorff scan statistic to continuous data via the use of Gaussian distributions. This function is (to the best of our knowledge) novel, and can be approximately maximized over axis parallel rectangles using our general framework in $O(\frac{1}{\epsilon}n^3\log n\log \log n)$ time additively, and $O(\frac{1}{\epsilon}n^2\log^2 n)$ relatively. Similar results hold for discrepancy functions based on the Bernoulli and gamma distributions.
Our results also extend to \emph{prospective discrepancy}, where the region under consideration extends from some time $t$ to the current time, in addition to its spatial extents. They also carry over to hyper-rectangles in $d$ dimensions.