Ranking and Monitoring Probabilistic Data

Overview

Uncertainty in observed data is a common occurrence for numerous applications, such as with scientific measurements and text data (e.g. name entry errors), and this is compounded by the fact that data is frequently generated in the terabyte and petabyte scale and is often generated too quickly for cleaning and quality assurance. When we are faced with processing massive amounts of probabilistic data, one of the most commonly used and essential tools for reducing it to a more compact and easily observed data, is the ranking query, which returns the top-k most important records from a dataset. Not only are uncertainties becoming more common in data, but often times when uncertainty arises in large data, the data is also distributed in nature as well, e.g. distributed data collected and integrated from distributed locations such as distributed sensor fields using sensor equipment with imprecise or fuzzy measurements or such as multiple collaborating scientific institutes with inconsistent data.

In this project we study two fundamental topics, ranking and monitoring probabilistic data, proposing techniques which are both processing and, in the case of distributed systems, communication efficient.

Papers and Talks

  1. Efficient Threshold Monitoring for Distributed Probabilistic Data (Talk),
    by M. Tang, F. Li, J. Phillips, J. Jestes
    In Proceedings of 28th IEEE International Conference on Data Engineering (ICDE 2012), pages 1120-1131, Washington DC. April, 2012.
  2. Semantics of Ranking Queries for Probabilistic Data (Project Website),
    by J. Jestes, G. Cormode, F. Li, K. Yi,
    Vol. 23, No. 12, pages 1903-1917, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 2011.
  3. Probabilistic String Similarity Joins (Project Website), (Talk),
    by J. Jestes, F. Li, Z. Yan, K. Yi,
    In Proceedings of 29th ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), pages 327-338, Indianapolis, Indiana, June 2010.
  4. Ranking Distributed Probabilistic Data (Project Website), (Talk),
    by F. Li, K. Yi, J. Jestes,
    In Proceedings of 28th ACM SIGMOD International Conference on Management of Data (SIGMOD 2009), pages 361-374, Providence, USA, June 2009.

Contacts

Jeffrey Jestes