Refreshments 3:20 p.m.
Abstract
Massive data have become ubiquitous and are being generated at an
ever-increasing rate almost everywhere (e.g., in large data centers).
This phenomena demands building and retrieving effective and concise
summaries efficiently to represent the underlying data for further
reasoning, mining, and analytics. There are two types of summaries in
general, data summaries representing the entire data and query
summaries summarizing a subset of data selected w.r.t. user inputs. In
this talk, we present a comprehensive study on how to summarize
massive data effectively and efficiently, using wavelet histograms on
large distributed data (a data summary) and aggregate similarity
search (a query summary) as examples. We leverage on both algorithmic
(sampling, sketch, geometry) and (database) system techniques
(indexing, MapReduce) to fulfill our goal. We demonstrate that by
using distributed and parallel frameworks, and blending algorithmic
and database techniques, excellent scalability and efficiency can be
achieved. We also briefly address the data modeling challenges in
massive data (e.g., probabilistic data).
BIO
Feifei Li has been an assistant professor at the Computer Science
Department, Florida State University, since August 2007. He obtained
his B.S. in computer engineering from Nanyang Technological
University, Singapore in 2002 (transferred from Tsinghua University,
China) and PhD in computer science from Boston University in 2007. His
research focuses on large scale data management, such as query
processing, indexing, and query optimization in databases and data
management problems. He also works on probabilistic data, text/string
processing, semantic web/graph data (e.g., RDF), as well as security
and privacy issues in data management. His research has been actively
supported by NSF, HP Labs, FSU, and the Florida Department of Revenue.
He has won an NSF career award in 2011 and the IEEE ICDE best paper
award in 2004.