Refreshments 3:20 p.m.
Abstract
Nowadays, a vast ocean of data is collected from trillions of connected
devices everyday. Useful knowledge is usually buried in multiple genres of
data, which are from different sources, in different formats, and with
different types of representation. Many interesting patterns cannot be
extracted from a single data collection, but have to be discovered from the
integrative analysis of all heterogeneous data sources available. Although
many algorithms have been developed to analyze multiple information sources,
real applications continuously pose new challenges: Data can be gigantic,
noisy, unreliable, dynamically evolving, highly imbalanced, and
heterogeneous. Meanwhile, users provide limited feedback, have growing
privacy concerns, and ask for actionable knowledge. In this talk, I will
discuss my thesis work on exploring the power of multiple heterogeneous
information sources in challenging learning scenarios. I will present two
perspectives of learning from multiple sources, i.e., exploring their
similarities (knowledge integration) or their differences (inconsistency
detection). First, for knowledge integration, I proposed a graph based
consensus maximization framework to combine multiple supervised and
unsupervised models, which greatly improves classification accuracy. Second,
I developed approaches based on probabilistic models and spectral embedding
techniques to detect objects performing inconsistently across multiple
sources as a new type of outliers. I will show the effectiveness of these
general learning techniques with a few sample applications in social
networks, Internet, multimedia, and cyber-security.