Uncertainty in observed data is a common occurrence for numerous applications, such as with scientific measurements and text data (e.g. name entry errors), and this is compounded by the fact that data is frequently generated in the terabyte and petabyte scale and is often generated too quickly for cleaning and quality assurance. When we are faced with processing massive amounts of probabilistic data, one of the most commonly used and essential tools for reducing it to a more compact and easily observed data, is the ranking query, which returns the top-k most important records from a dataset. Not only are uncertainties becoming more common in data, but often times when uncertainty arises in large data, the data is also distributed in nature as well, e.g. distributed data collected and integrated from distributed locations such as distributed sensor fields using sensor equipment with imprecise or fuzzy measurements or such as multiple collaborating scientific institutes with inconsistent data.
In this project we study two fundamental topics, ranking and monitoring probabilistic data, proposing techniques which are both processing and, in the case of distributed systems, communication efficient.