Scalable Histograms on Large Probabilistic Data

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts] 


Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies sur from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and educiency achieved by our methods, when compared to the state-of-the-art methods. They also achieved excellent approximation quality in practice.

Papers and Talks

1. Scalable Histogram on Large Probabilistic Data,

    Full version: Talk:

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or have any suggestions/comments, we would be very happy to hear from you!

Library Description

The library was developed in C++ on Ubuntu 12. For installation and usage, please refer to the README file in the tar ball.


Histogram on Large Probabilistic Data Library: [tar.gz]


We have generated and experimented with the datasets described in the paper. To generate experimental data sets please follow the directions outlined in our paper.


Mingwang Tang