Efficient Ranking and Aggregate Query Processing in Probabilistic Databases

Princial Investigator: Feifei Li, supported by the IIS program from NSF, award #0916448 (while at FSU), #1212310 (after transering to Utah), NSF link 1, NSF link 2.

Students: Jeffrey Jestes, Chi Zhang, Mingwang Tang, Justin DeBrabant

[Abstract] [Papers and Talks] [Source Code] [Dataset] [Related sites for subprojects] [Acknowledgement] [Contacts]　

Abstract

When dealing with massive quantities of data, ranking and aggregate queries are powerful techniques for focusing attention on the most important answers. Many applications that produce such massive quantities of data inherently introduce uncertainty in the same time, for example, probabilistic match in data integration, imprecise measurements from sensors, fuzzy duplicates in data cleaning, inconsistency in scientific data. Hence, the importance of these queries is even greater in probabilistic data, where a relation can encode exponentially many possible worlds. Uncertainty opens the gate to many possible definitions for ranking and aggregate queries. This project systematically examines the underlying properties associated with the rich semantics of ranking and aggregate queries for large amounts of probabilistic data. More importantly, this project investigates the issue of how to design novel and scalable algorithms for processing these queries efficiently in the offline, centralized environment and the streaming model. With the emergence of probabilistic data in many important application domains, the demand for understanding and processing the ranking and aggregate queries efficiently from the scientific community and beyond (e.g., government and military agencies) is expected to intensify in the coming years. The results of this project lay down a firm foundation for this important problem.

Papers and Talks

Ranking Large Temporal Data (VLDB 2012), Project Website

Full version:

Efficient Treshold Monitoring for Distributed Probabilistic Data (IEEE ICDE 2012)

Full version:

Flexible Aggregate Similarity Search (ACM SIGMOD 2011)

Full version:

Optimal Location Queries in Road Network Databases (ICDE 2011), Project Website

Full version: Talk:

Logging Every Footstep: Quantile Summaries for the Entire History (SIGMOD 2010)

Full version:

Semantics of Ranking Queries on Probabilistic Data (IEEE TKDE, accepted in 2010)

Full version:

Top-K queries on Temporal Data (VLDBJ,Vol. 13, No. 5, pages 715-733, 2010. Project Website)

Full version: Talk:

Probabilistic String Similarity Joins (SIGMOD 2010, Project Website)

Full version: Talk:

Ranking Distributed Probabilistic Data (SIGMOD 2009, Project Website)

Full version: Talk:

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks (ICDE 2009, Project Website),

Full version: Talk:

Finding Frequent Items in Probabilistic Data (SIGMOD 2008, Project Website),

Full version: Talk:

Efficient Processing of Top-k Queries on Uncertain Databases (ICDE 2008 and IEEE TKDE, Vol. 20, No. 12, pages 1669-1682, 2008 Project Website),

Full version: Talk:

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

The library is developed in GNU C++. New algorithms are contained in the folder newtopk and the state of the art is in the folder soliman. Generators for various datasets are in the folder generators and all the datasets used for the experiment could be downloaded below. The naming convention of source codes and executable files are self-explain. The library includes all the 8 algorithms and 5 different generators.

Download

Seb-tree Library [tar.gz]

FrequentItem Library [tar.gz]

ExpectedRank Library [tar.gz]

UTopk Library [tar.gz]

Quick Install

The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.

Dataset

We have generated and experimented with the following datasets, where N is the total number of tuples when represented in relational model, xnumber is the number of xtuples and xdegree is the number of alternatives each xtuple may have.

1. For the single-alternative case, uniform distribution for the score and different distributions for the confidence, including uniform, normal with different means and exponential with different means, in this case the score and the confidence are not correlated. The naming convention for them is indep_uu_N, indep_un_N_mean_standard-deviation and indep_ux_mean.

2. For the single-alternative case, correlated distributions between score and confidence. Specifically, using normal distributions for both and a mean value of 0.5 for the confidence, we vary the correlation value (in the covariance matrix) of the two. The naming convention for them is cor_N_corval.

3. For the multi-alternative case, we generated different degrees of correlated data (between score and confidence) and vary both the number of xtuples and xdegrees. All the tuples are stored in the file starting with xtuple_*. The naming convention for them is xtuple_N_corval_xnumber_xdegree. The file to specify the relationships among alternatives (i.e., which tuples are belonging to the same xtuple but just different alternatives) is xid_*. The naming convention for them is xid_N_corval_xnumber_xdegree.

The dataset below contains the ones used in our experiment. However, it is very convenient to generate other datasets with different properties using our generators which you could download from the above link together with source codes for all algorithms.

Dataset [tar.gz]

Related sites for subprojects

Top-k queries in probabilistic data
Semantics for ranking probabilistic data
Finding frequent items in probabilistic data
Similarity join in probabilistic string and text
Top-k queries for temporal data

Acknowledgement

This project is supported by the National Science Foundation (NSF) under the project: III:Small:Efficient Ranking and Aggregate Query Processing for Probabilistic Data. Any opinions, findings, and conclusions or recommendations expressed in this project are those of author(s) and do not necessarily reflect the views of the National Science Foundation.

Contacts

Feifei Li