CAREER: Novel Query Processing Techniques for Distributed Probabilistic Data

Princial Investigator : Feifei Li, supported by the IIS program from NSF, award #1053979 (while at FSU), and award #1200792 (after transferring to Utah).

Students: Mingwang Tang, Jeffrey Jestes, Robert Christensen, Oscar Marshall

[Overview] [Papers and Talks] [Source Code] [Dataset] [Acknowledgement] [Contacts] 


Data are increasingly generated, stored, and processed distributively. Meanwhile, when large amounts of data are generated, ambiguity, uncertainty, and errors are inherently introduced, especially in a distributed setup. It is best to represent such data in a distributed probabilistic database. In distributed data management, summary queries are useful tools for obtaining the most important answers from massive quantities of data effectively and efficiently, e.g., top-k queries, heavy hitters (aka frequent items), histograms and wavelets, threshold monitoring queries, etc. This project investigates novel query processing techniques for various, important summary queries in distributed probabilistic data. Broadly classified, this project examines both snapshot summary queries in static (i.e., no updates) distributed probabilistic databases, and continuous summary queries in dynamic (i.e., with updates) distributed probabilistic databases. A number of techniques are explored to design novel, communication and computation efficient algorithms for processing these queries. A distributed probabilistic data management system (DPDMS) prototype is implemented based on the query processing techniques developed in this project. This DPDMS is released to and used in practice by scientists and engineers from other science disciplines as well as industry. Graduate and undergraduate students, including those from minority groups, are actively involved in this project. Findings from the project have been integrated into different courses, demos, and educational projects.

Papers and Talks

Exact and Approximate Flexible Aggregate Similarity Search (VLDBJ, To Appear, 2016)

    Full version:  

Distributed Online Tracking (SIGMOD 2015)

    Full version:  

Scalable Histograms on Large Probabilistic Data(SIGKDD 2014)

    Full version:  

Quality and Efficiency in Kernel Density Estimates for Large Data (SIGMOD 2013)

    Full version:  , Project Website

Optimal Splitters for Temporal and Mulit-version Databases (SIGMOD 2013)

    Full version:  , Project Website

Efficient Parallel kNN Joins for Large Data in MapReduce (EDBT 2012)

    Full version:  , Project Website

Building Wavelet Histograms on Large Data in MapReduce (VLDB 2012)

    Full version:  , Project Website

Efficient Treshold Monitoring for Distributed Probabilistic Data (ICDE 2012)

    Full version:  

Probabilistic String Similarity Joins (SIGMOD 2010, Project Website)

    Full version:   Talk:  

Ranking Distributed Probabilistic Data (SIGMOD 2009, Project Website)

    Full version:   Talk:  

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

Please refer to individual project website and paper for detailed description for the following libraries.


ProbString Library [tar.gz]

Quick Install

DistributedRank Library [tar.gz]

Quick Install

The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.


Please refer to individual project website and paper for the description of the datasets.


This project is supported by the National Science Foundation (NSF) under the project: CAREER: Novel Query Processing Techniques for Probabilistic Data. Any opinions, findings, and conclusions or recommendations expressed in this project are those of author(s) and do not necessarily reflect the views of the National Science Foundation.


Feifei Li