Performing data analytics on a massive dataset can require extensive amounts of resources and it can be unrealistic to issue data analytics tasks over the scale of data we are seeing nowadays, especially when there are restrictions to resource utilization. Users are often willing to trade some of the accuracy obtained in an exact solution for an approximate solution which can save orders of magnitude in computation, IOs, and communication. One particular and useful way of obtaining approximate solutions is to first construct a data summary, which provides quality guarantees for a particular set of queries. A nice feature of data summaries is they typically are independent of the dataset size and can be defined to depend only on the desired error in the query, resulting in data summaries which are often only kilo- or megabytes in size. Building a data summary is usually a one time cost, and serves as a surrogate for the original dataset, allowing data analytics to be performed faster.
In this project we explore efficient techniques to construct data summaries for accelerating data analytics tasks.