Probabilistic String Similarity Joins

Supported in part by the IIS program from NSF, award #0916448 (while at FSU), #1212310 (after transering to Utah), NSF link 1, NSF link 2.

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts] 

Overview

Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.

Papers and Talks

1. Probabilistic String Similarity Joins,

    Full version: Talk: Poster:

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or have any suggestions/comments, we would be very happy to hear from you!

Library Description

The library was developed in Visual Studio 2005 in C++. It also comes with the data generator. To compile open the .sln file in Visual Studio 2005 and use the Build command. After building you may import the library into SQL Server using the Deploy command. Note: we have tested the code using Visual Studio 2005 Professional Edition and SQL Server 2008 Enterprise Edition.

Download

Probabilistic String Similarity Joins Library [tar.gz]

Quick Install

The folder names are self-explanatory. The generator folder contains a Makefile for easy-compilation. All generator programs have a verbose help output to explain what parameters are expected. The myagg and ped folders contain the aggregate UDF and UDFs respectively. They must be compiled in Visual Studio 2005.

Dataset

We have generated and experimented with the datasets described in the paper. To generate experimental data sets please follow the directions outlined in our paper. Please obtain the real data sets from the individual data proprietors.

Contacts

Feifei Li