Quality And Efficiency in Kernel Density Estimates for Large Data

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts] 

Overview

Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any largescale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques.

Papers and Talks

1. Quality And Efficiency in Kernel Density Estimates for Large Data

    Full version: (Paper) (Talk), (Poster)

Source Code

Important Notice

If you use this code for your work, please kindly cite our paper. Thanks!

If you find any bugs or have any suggestions/comments, we would be very happy to hear from you!

Code Description

The code package includes the centralized version, centralized version with the data we used in the paper and the distributed version(using MapReduce). The centralized code is developed in GNU C++ and the distributed version is developed in Hadoop MapReduce version 1.0.3. The folder names are the same as the names of the algorithms in the paper. For the centralized version, to compile the source code, simply go to each folder and type 'make'. For the distributed version, to use first type 'make' and then 'hadoop jar HadoopKernel.jar org.myorg.HadoopKernel' for usage instructions. The MapReduce code was developed and tested in Hadoop 1.0.3, although it may work with future versions of Hadoop as well.

Download

KDE centralized code [tgz], KDE centralized code with sample data(43M) [tgz], KDE map reduce code [tgz]

Quick Install

The folder names are self-explanatory and contain a Makefile for easy-compilation. We also include a KDE_EVAL folder containing codes of generating test data and evaluation. All programs have a readme.txt and verbose help output to explain what parameters are expected.

Dataset

We have generated and experimented with the datasets described in the paper. A sample data is provided, please refer to readme.txt for an example of the sample data. For now, our code can successfully deal with 1D and 2D data.

Acknowledgement

Research described below has been funded by the NSF under grants CCF-1115677, IIS-0916488, and IIS-1053979. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Contacts

Yan Zheng