Coresets for Kernel Regression

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts]　

Overview

Kernel regression is an essential and ubiquitous tool for non-parametric data analysis, particular popular among time series and spatial data. However, the central operation which is performed many times, evaluating a kernel on the data set, takes linear time. This is impractical for modern large data sets. In this paper we describe coresets for kernel regression: compressed data sets which can be used as proxy for the original data and have provably bounded worst case error. The size of the coresets are independent of the raw number of data points; rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, and demonstrate that they incur negligible error, can be constructed extremely efficiently, and result in great computational gains compared to using the full dataset.

Papers and Talks

1. Coresets for Kernel Regression

Full version:

Source Code

Important Notice

If you use this code for your work, please kindly cite our paper. Thanks!

If you find any bugs or have any suggestions/comments, we would be very happy to hear from you!

Code Description

The code package includes the methods to generate the coresets using GNU C++, scripts to run the experiments in the paper and the real and synthetic datasets.

Download

Kernel regression code [tgz]

Quick Install

The folder names are self-explanatory and contain a Makefile for easy-compilation. All programs have a readme and verbose help output to explain what parameters are expected.

Dataset

We have generated and experimented with the datasets described in the paper. A sample data is provided, please refer to readme (in Kernel regression code) for an example of the sample data. For now, our code can successfully deal with time series data and spacial spacial data.

Acknowledgement

Research described below has been funded by the NSF under grants IIS-1251019, CCF 1350888, ACI-1443046, and CNS-1514520. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Contacts

[MAIL] Yan Zheng