Supported in part by the IIS program from NSF, award #0916448 (while at FSU), #1212310 (after transering to Utah), NSF link 1, NSF link 2.

Ke Yi, Feifei Li, Divesh Srivastava and George Kollios

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts]

This work introduces new algorithms for processing top-$k$ queries in uncertain databases, under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly takes one value from one or more alternatives. Soliman et al. first introduced the problem of top-$k$ query processing in uncertain databases, but their solutions are far from optimal. Our new results significantly improve the state of the art, in terms of both running time and memory usage. In the single-alternative case, our new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multi-alternative case, the improvement is even more dramatic: while the previous algorithms have exponential complexity in both time and space, our algorithms run in near linear or low polynomial time. Our study covers both types of top-$k$ queries proposed in previous work. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches.

1. Efficient Processing of Top-k Queries on Uncertain Databases,

**Important Notice**

**If you use this library for your work, please kindly cite our paper. Thanks!**

If you find any **bugs** or any **suggestions/comments**,
we are **very happy to hear from you**!

**Library Description**

The library is developed in GNU C++. New algorithms are contained in the folder newtopk and the state of the art is in the folder soliman. Generators for various datasets are in the folder generators and all the datasets used for the experiment could be downloaded below. The naming convention of source codes and executable files are self-explain. The library includes all the 8 algorithms and 5 different generators.

**Download**

**UTopk Library** [tar.gz] [please send email to Feifei for this library.]

**Quick Install**

The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.

We have generated and experimented with the following datasets, where N is the total number of tuples when represented in relational model, xnumber is the number of xtuples and xdegree is the number of alternatives each xtuple may have.

1. For the single-alternative case, uniform distribution for the score and
different distributions for the confidence, including uniform, normal with
different means and exponential with different means, in this case the score and
the confidence are not correlated. The naming convention for them is **
indep_uu_N, indep_un_N_mean_standard-deviation and indep_ux_mean.**

2. For the single-alternative case, correlated distributions between score
and confidence. Specifically, using normal distributions for both and a mean
value of 0.5 for the confidence, we vary the correlation value (in the
covariance matrix) of the two. The naming convention for them is **cor_N_corval.**

3. For the multi-alternative case, we generated different degrees of
correlated data (between score and confidence) and vary both the number of
xtuples and xdegrees. All the tuples are stored in the file starting with **
xtuple_*. **The naming convention for them is **
xtuple_N_corval_xnumber_xdegree. **The file to specify the relationships among
alternatives (i.e., which tuples are belonging to the same xtuple but just
different alternatives) is **xid_***. The naming convention for them is **
xid_N_corval_xnumber_xdegree**.

The dataset below contains the ones used in our experiment. However, it is very convenient to generate other datasets with different properties using our generators which you could download from the above link together with source codes for all algorithms.

**Dataset** [tar.gz]