BIGDATA: Small: DCM: DA: Building a Mergeable and Interactive Distributed Data Layer for Big Data Summarization Systems

Princial Investigators: Feifei Li, Jeff Phillips, supported by the BIGDATA program from NSF, award #1251019.

Students: Klemen Simonic, Mina Ghashami, Robert Christensen, Jason Vuong, Jamie Long, Tony Tuttle

[Overview] [Papers and Talks] [Patents] [Source Code] [Dataset] [Acknowledgement] [Contacts] 

Overview

Big data today is stored in a distributed fashion across many different machines or data sources. This poses new algorithmic and system challenges to performing efficient analysis on the full data set. To address these difficulties, the PIs are building the MIDDLE (Mergeable and Interactive Distributed Data LayEr) Summarization System and deploying it on large real-world datasets. The MIDDLE system builds and maintains a special class of summaries that can be efficiently constructed and updated while still allowing fine-grained analysis on the heavy tail. Mergeable summaries can represent any data set with a guaranteed tradeoff between size and accuracy, and any two such summaries can be merged to create a new summary with the same size-accuracy tradeoff.

Interactive summaries can be quickly adapted to a specified query range of data while maintaining the same size-accuracy tradeoffs relative to the data in that range. This allows accurate efficient analysis to zero-in on small subsets of big data.

The MIDDLE system enables different big data users to develop a wide spectrum of efficient and scalable data analytic tasks through the use of data summaries. The MIDDLE system is being evaluated and refined with the aid of domain experts. Since the prospect of data-summary-based analytics becoming a part of standard techniques in processing big data is tantalizing, this research generates broader impacts on the nation's government agencies, research institutes, education system, and high-tech industries. Our broad impacts also extend to academia and community outreach, through the design and development big data curriculum and education, and the involvement of general public in understanding and using big data through concise summaries.

Papers and Talks

Distributed Online Tracking (SIGMOD 2015)

    Full version:  

L_infity Error and Bandwidth Selection for Kernel Density Estimates of Large Data (SIGKDD 2015)

    Full version:  

Improved Practical Matrix Sketching with Guarantees (ESA 2014).

    Full version:  

Quality and Efficiency in Kernel Density Estimates for Large Data (SIGMOD 2013)

    Full version:  , Project Website

Scalable Histograms on Large Probabilistic Data (SIGKDD 2014)

    Full version:  , Project Website

Continuous Matrix Approximation on Distributed Data (VLDB 2014)

    Full version:  , Project Website

Scalable Keyword Search on Large RDF Data (IEEE TKDE 2014)

    Full version:  , Project Website

Quality and Efficiency in Kernel Density Estimates for Large Data (SIGMOD 2013)

    Full version:  , Project Website

Building Wavelet Histograms on Large Data in MapReduce (VLDB 2012)

    Full version:  , Project Website

Patents

Scalable summarization of data graphs, US Patent 8977650

Scalable summarization of data graphs, US Patent 8984019

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

Please refer to individual project website and paper for detailed description for the following libraries.

Download

ProbString Library [tar.gz]

Quick Install

DistributedRank Library [tar.gz]

Quick Install

The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.

Dataset

Please refer to individual project website and paper for the description of the datasets.

Acknowledgement

This project is supported by the National Science Foundation (NSF) under the project: CAREER: Novel Query Processing Techniques for Probabilistic Data. Any opinions, findings, and conclusions or recommendations expressed in this project are those of author(s) and do not necessarily reflect the views of the National Science Foundation.

Contacts

Feifei Li