Building Wavelet Histograms on Large Data in MapReduce

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts] 

Overview

MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.

Papers and Talks

1. Building Wavelet Histograms on Large Data in MapReduce,

    Full version:   Talk:  

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

The library is developed in Java for use with Hadoop 0.20.2. The code is self-explanatory. A makefile is included to compile the library into a jar file. The classpath in the makefile should be updated to refer to the Hadoop 0.20.2 jar file and the Apache Commons Logging jar file. A customized scheduler which permits data-local only mappers and a user specified single reducer is also included.

Download

Wavelet Histogram MapReduce Library [tar.gz]

Quick Install

Hadoop 0.20.2 may be obtained from here and Apache Commons Logging may be obtained from here. After obtaining both, update the makefiles as mentioned above in order to compile. The library may be invoked by specifying the command:
hadoop jar wavelet.jar org.HadoopWavelet
A detailed description of command line arguments is displayed.

Dataset

The real dataset used in this paper is publicly available here.

Contacts

Jeffrey Jestes     
Feifei Li