Refreshments 3:40 p.m.
Abstract
Abs: In data mining applications and spatial and multimedia databases,
a useful tool is the kNN join, which is to produce the k nearest
neighbors (NN), from a dataset P, of every point in a dataset R. Since
it involves both the join and the NN search, performing kNN joins
efficiently is a challenging task. Meanwhile, applications continue to
witness a quick (exponential in some cases) increase in the amount of
data to be processed. A popular model nowadays for large-scale data
processing is the shared-nothing cluster on a number of commodity
machines using MapReduce [6]. Hence, how to execute kNN joins
efficiently on large data that are stored in a MapReduce cluster is an
intriguing problem that meets many practical needs. This work proposes
novel (exact and approximate) algorithms in MapReduce to perform
efficient parallel kNN joins on large data. We demonstrate our ideas
using Hadoop, an open source MapReduce framework. Extensive
experiments using large real datasets, at least tens of millions of
records, have convincingly demonstrated the efficiency, effectiveness,
and the scalability of our methods.