Ke Yi, Feifei Li, Graham Cormode, Marios Hadjieleftheriou, George Kollios and Divesh Srivastava
[Overview] [Papers and Talks] [PIRS: Polynomial Identity Random Synopses] [Source Code] [Dataset] [Contacts]
The overwhelming flow of information in many data stream applications forces
many companies to outsource to a third-party the deployment of a Data Stream
Management System (DSMS) for performing desired computations. Remote
computations intrinsically raise issues of trust, making query execution
assurance on data streams a problem with practical implications. Consider a
client observing the same data stream as a remote server (e.g., network
traffic),
that registers a continuous query on the server's DSMS, and receives answers
upon request. The client needs to verify the integrity of the results using
significantly fewer resources than evaluating the query locally. Towards that
goal, we propose a {\em probabilistic algorithm} for selection and
aggregate/group-by queries, that uses constant space irrespective of the
result-set size, has low update cost, and arbitrarily small probability of
failure. We generalize this algorithm to allow some tolerance on the number of
errors permitted (irrespective of error magnitude), and also discuss the
hardness of permitting arbitrary errors of small magnitude. We also perform an
empirical evaluation using live network traffic.
1. Randomized Synopses for Query Assurance on Data Streams,
Please refer papers above for details.
Illustration of the problem settings
Important Notice
If you find any bugs or any suggestions/comments, we are very happy to hear from you!
Library Description
The library is developed in GNU C++. The library depends on: Tools Library and GNU GMP for arbitrary precision arithmetic calculation.
Our library is tested with the following version of corresponding dependent libraries:
1. Tools Library : [zip]
2. GNU GMP: [tar.gz]
Download
PIRS Library [tar.gz]
Update: (05/14/08)Fixed some bugs and newly released the PIRS* and FM-PIRS!
This version includes the followings: PIRS, PIRS test with World Cup Data Set, PIRS test with IPs data set, PIRS^\gamma (PIRSET, the exact solution) , PIRS^\pm\gamma (PIRSAT, the approximate solution) and their corresponding buffering techniques (using LRU to explore locality), and PIRS-attack to simulate attacks to PIRS. Now also includes PIRS* (to identify which groups contain errors and) and FM-PIRS that could indentify how many errors in all groups!
Quick Install
The subfolder's names are self-explain (pirsgi refers to PIRS*, gi stands for group identifier). Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.
We have tested PIRS (and its variants) for two sets of real data streams. One is the World Cup data set which consists of the access log for 1998 world cup web sites. You may find detailed description of this data set from here. The second one is the live IP traces from AT&T backbone network. Due to privacy and legal requirement, we are not able to release the IPs dataset. However, the World Cup data stream is public. Our experiment consists of the traffic/request from Day 46 to Day 47. You may download the original data from WCUP_data_day4647. However, you need to combine the individual small data files into one file. For testing purpose, you may grab any one small file from ITA website and test-run PIRS and its variants.