Ranking Distributed Probabilistic Data

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts]　

Overview

Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation-efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.

Papers and Talks

1. Ranking Distributed Probabilistic Data,

Full version: Talk:

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

The library is developed in GNU C++. It also comes with the data Generator in Matlab. Before compilation you must install the GNU Linear Programming Kit available here. To compile, simply go to each folder and type Make.

Download

Ranking Distributed Probabilistic Data Library [tar.gz]

Quick Install

The subfolder's names are self-explain. Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.

Dataset

We have generated and experimented with the datasets described in the paper. In the source-code released above, it also contains the generator for the synthetic data sets. For real data sets, please follow the description in our paper.

Contacts

Jeffrey Jestes