Data Summaries for Massive Data

Overview

Performing data analytics on a massive dataset can require extensive amounts of resources and it can be unrealistic to issue data analytics tasks over the scale of data we are seeing nowadays, especially when there are restrictions to resource utilization. Users are often willing to trade some of the accuracy obtained in an exact solution for an approximate solution which can save orders of magnitude in computation, IOs, and communication. One particular and useful way of obtaining approximate solutions is to first construct a data summary, which provides quality guarantees for a particular set of queries. A nice feature of data summaries is they typically are independent of the dataset size and can be defined to depend only on the desired error in the query, resulting in data summaries which are often only kilo- or megabytes in size. Building a data summary is usually a one time cost, and serves as a surrogate for the original dataset, allowing data analytics to be performed faster.

In this project we explore efficient techniques to construct data summaries for accelerating data analytics tasks.

Papers and Talks

  1. Quality and Efficiency for Kernel Density Estimates in Large Data,
    by Y. Zheng, J. Jestes, J. Phillips, F. Li
    In Proceedings of 32nd ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), pages TBA, NYC, NY, June 2013.
  2. Ranking Large Temporal Data (Talk),
    by J. Jestes, J. Phillips, F. Li, M. Tang
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012), PVLDB 5(11): 1412-1423, Istanbul, Turkey, August, 2012.
  3. Building Wavelet Histograms on Large Data in MapReduce (Project Website), (Talk),
    by J. Jestes, K. Yi, F. Li
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012), PVLDB 5(2): 109-120, Istanbul, Turkey, August, 2012.
  4. Efficient Parallel kNN Joins for Large Data in MapReduce (Project Website), (Talk),
    by C. Zhang, F. Li, J. Jestes
    In Proceedings of 15th International Conference on Extending Database Technology (EDBT 2012), pages 38-49, Berlin, Germany, March, 2012.

Contacts

Jeffrey Jestes