Supported in part by the IIS program from NSF, award #0916448 (while at FSU), #1212310 (after transering to Utah), NSF link 1, NSF link 2.
Ke Yi, Feifei Li, Divesh Srivastava and George Kollios
[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts]
This work introduces new algorithms for processing top-$k$ queries in uncertain databases, under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly takes one value from one or more alternatives. Soliman et al. first introduced the problem of top-$k$ query processing in uncertain databases, but their solutions are far from optimal. Our new results significantly improve the state of the art, in terms of both running time and memory usage. In the single-alternative case, our new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multi-alternative case, the improvement is even more dramatic: while the previous algorithms have exponential complexity in both time and space, our algorithms run in near linear or low polynomial time. Our study covers both types of top-$k$ queries proposed in previous work. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches.
1. Efficient Processing of Top-k Queries on Uncertain Databases,
Full version: Talk:
If you use this library for your work, please kindly cite our paper. Thanks!
If you find any bugs or any suggestions/comments, we are very happy to hear from you!
The library is developed in GNU C++. New algorithms are contained in the folder newtopk and the state of the art is in the folder soliman. Generators for various datasets are in the folder generators and all the datasets used for the experiment could be downloaded below. The naming convention of source codes and executable files are self-explain. The library includes all the 8 algorithms and 5 different generators.
UTopk Library [tar.gz] [please send email to Feifei for this library.]
The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.
We have generated and experimented with the following datasets, where N is the total number of tuples when represented in relational model, xnumber is the number of xtuples and xdegree is the number of alternatives each xtuple may have.
1. For the single-alternative case, uniform distribution for the score and different distributions for the confidence, including uniform, normal with different means and exponential with different means, in this case the score and the confidence are not correlated. The naming convention for them is indep_uu_N, indep_un_N_mean_standard-deviation and indep_ux_mean.
2. For the single-alternative case, correlated distributions between score and confidence. Specifically, using normal distributions for both and a mean value of 0.5 for the confidence, we vary the correlation value (in the covariance matrix) of the two. The naming convention for them is cor_N_corval.
3. For the multi-alternative case, we generated different degrees of correlated data (between score and confidence) and vary both the number of xtuples and xdegrees. All the tuples are stored in the file starting with xtuple_*. The naming convention for them is xtuple_N_corval_xnumber_xdegree. The file to specify the relationships among alternatives (i.e., which tuples are belonging to the same xtuple but just different alternatives) is xid_*. The naming convention for them is xid_N_corval_xnumber_xdegree.
The dataset below contains the ones used in our experiment. However, it is very convenient to generate other datasets with different properties using our generators which you could download from the above link together with source codes for all algorithms.