Utah High School ACM Programming Contest
2005 Take-Home Problem

Dataset Duplicate Detection Device

Project Details


Table of Contents

1.0 The Story So Far

2.0 Problem Description
    2.1 Inputs & Outputs
    2.2 Example Duplicates

3.0 Take-Home Logistics
    3.1 Submission
    3.2 Scoring

4.0 Getting Started
    4.1 Skeleton Solution
    4.2 Sample Test Data
    4.3 How Do I Start?

5.0 Help!
    5.1 Frequently Asked Questions (Updated 4/7/05)
    5.2 Contact Information
    5.3 Hints

Change History


1.0 The Story So Far

You've just landed your first real software development job working for a brand new company named Acquired Clinics, Medical (ACM). Their business is to buy out smaller medical clinics, take over their patient's health care, and thus grow their empire! Your first task as a new employee at ACM is to purge the company's quickly growing database of duplicate entries.

Back to Table of Contents


2.0 Problem Description

Your task is to design a system to detect duplicate records in a given dataset. Your program will search through a dataset of up to 1,000 entries. Each entry will contain the following fields: unique identifier, first name, last name, address, zip code, phone number, and email address. All fields will always be provided. A duplicate pair is defined as any two records that share enough data to certainly be considered the same record. Every record will be uniquely identified using a number. This field is used to identify which records are considered duplicates and should not be considered when comparing records.

This definition is imprecise, but the crux of this real-world problem is deciding how to conservatively tag duplicate pairs. If the matching algorithm is too liberal, time will be wasted entering customer data again. If the matching algorithm is too conservative, too much time will be spent by humans finding duplicate pairs.

2.1 Inputs & Outputs

It is extremely important that you strictly adhere to the input and output format guidelines. This will allow us to efficiently and accurately judge your submissions.

Your program will take an input filename as the one and only command line parameter. The program should not prompt for any input. The skeleton solution that we have provided properly handles receiving the input filename as a command line parameter. The input file will contain the dataset and will be formatted such that each data field is on a separate line in the proper order, and a blank line separates each complete record. For example:

    0
    John
    Doe
    1234 Example St.
    84100
    123-456-7890
    john@doe.net
    1
    Jane
    Doe
    456 Example St.
    84200
    234-567-8901
    jane.doe@gmail.com
    2
    Jon
    Doe
    1234 Example St.
    84100
    123-456-7890
    john@doe.net

After processing the dataset, the program will output to the screen (standard out) a single line for each duplicate pair found. In the example dataset above, records #0 and #2 match, so the output would be:

    0,2

NOTE: Please report duplicate pairs in numerical order, i.e. "0,2" not "2,0". Each duplicate pair reported a second (or subsequent) time will be scored as incorrect. If there is an instance where more than 2 items are found to be duplicates, you must report all duplicate pairs. For instance if records 2, 8, and 10 match your output would be:

    2,8 
    8,10 
    2,10 

It should also be noted that the order of the pairs in the file is not important. For instance, "2,10" does not have to come before "8,10".

Nothing besides the listing of duplicate pairs should be output.

2.2 Example Duplicates

The following list is intended to aid in understanding what types of errors could possibly cause duplicate records in the data set. This list is certainly not exhaustive, but will help get you on the right track. Section 4.2 provides links to a sample test data file and a solution file. This test data has examples exhibiting the items listed below. The final test data that we use to judge your program will contain examples demonstrating these properties plus additional, more difficult examples with more complicated errors.

  1. Change of case - Some records may have been entered in all caps while others may have used mixed case. For example: John Doe and JOHN DOE would be duplicates.
  2. Letter transposition - It is common for letters to be transposed due to a data entry error. For example: John Doe and Jonh Doe would be duplicates.
  3. Letter omission - It is common for letters to be omitted due to a data entry error. For example: John Doe and John De would be duplicates.
  4. Letter replacement - It is common for letters to be replaced by other letters due to a data entry error. For example: John Doe and John Dle would be duplicates.
  5. Letter addition - It is common for letters to be added due to a data entry error. For example John Doe and Johnn Doe would be considered duplicates.
  6. Alphanumeric substitution - It is common for letters to be replaced by a similar looking number due to scanning errors. For example: John Doe and John 4oe would be considered duplicates.
  7. Missing punctuation - It is common for punctuation to be missing due to inaccurate data entry. For example: 123 Main St. and 123 Main St would be considered duplicates.

The examples above don't use all of the fields in the description, but all fields should be considered in your algorithm. For instance, examine the following records:

    0
    John
    Doe
    1234 Example St.
    84100
    123-456-7890
    john@doe.net
    1
    Jon
    Doe
    91 Elm Dr.
    10096
    240-374-1846
    john.doe@foobar.net
    2
    Jon
    Doe
    1234 Example St.
    84100
    123-456-7890
    john@doe.net

You program should detect that "0,2" as duplicates. It should not however detect that "0,1" or "1,2" are duplicates. Even though the names could seem to be a data entry error the other data would lead your program to not report those records as duplicates. Your program should also not assume that there will only be differences in a single field. Duplicate pairs may have multiple fields that don't match exactly.

Back to Table of Contents


3.0 Take-Home Logistics

3.1 Submission

The take home problem should be submitted on the day of the contest. Please follow these directions carefully, as improper submissions may prevent your program from being judged. Bring the following with you when your team checks in:

  1. A floppy disk or CD containing your commented source code, a compiled executable, a readme.txt file, and a feedback.txt file. The floppy disk or CD should be labeled with your school name.
  2. An organized print-out of all your source code, readme.txt file, and feedback.txt file with your school and team members clearly identified.

Your source code should be commented liberally and use good programming style. Your comments should explain the key algorithms and design so that we can better understand your program.

For your compiled executable, if your submission was written in C++ you should submit a single executable file named recordMatcher so that we can run it with the command:

       RecordMatcher input.txt.

If your submission was written in Java, you should ensure that you submit compiled .class files. The class that contains the main function should be called RecordMatcher so that we can run it with the command:

       java RecordMatcher input.txt.

Your readme.txt file should contain the name of your school and the names of the members in your team. It should also indicate what development environment you used and instructions for recompiling your code. If it becomes necessary we will recompile your code from source. It must compile with one of the following compilers: Microsoft Visual C++.NET 2003 or earlier, GCC/G++ 3.0 or earlier, or Java SDK 1.5 or earlier. Note that all C++ code must by ANSI C++ compliant so you should not use features that are specific to an individual compiler and/or platform. Additionally, the readme.txt file should briefly explain what you did and why, and anything else you tried that you didn't use in your final solution. Explain what you learned while working on the take-home problem.

Your feedback.txt file should contain information so that we can continue to improve the contest. What did you think about the problem? How difficult or easy was it? What were the hardest and easiest parts of the problem? Provide any other comments that you think may be useful to us as we develop future take-home problems.

3.2 Scoring

Points will be awarded for each "correct" duplicate pair reported, but points will be deducted for each "incorrect" one. (Note that there will be some subjectivity here. Our grading will assume that your matching algorithm is relatively conservative.) Your solution should be computed in a timely manner, so judges will only score output for 5 minutes per data set. Each submission will be graded according to the following criteria:

To compute the final score for the take-home problem, each solution will be ranked on the criteria listed above relative to the other systems in the competition. This score will be scaled so that the team with the highest average receives 200 points. Teams that submit a take home solution that does no more than our sample code will receive 0 points.

Back to Table of Contents


4.0 Getting Started

4.1 Skeleton Solution

A sample solution has been provided in both C++ and Java that does a bare minimum amount of work. It inputs the file, compares records looking only for differences in case (upper versus lower), and outputs a list of pairs of records that match with only case differences.

C++ skeleton code - This code can be compiled and run using the following commands (assuming a Unix/Linux environment):

     > g++ record.cc recordMatcher.cc -o recordMatcher 
     > ./recordMatcher testSet0.txt 
  

Java skeleton code - This code can be compiled and run using the following commands:

    > javac Record.java RecordMatcher.java
    > java RecordMatcher testSet0.txt
  

4.2 Sample Test Data

We have provided a file with 85 records that you can use to test your program. This data contains a few pairs of records that differ in minor ways. It by no means includes all the possible ways in which "duplicate" records might differ.

testSet0.txt

We have also provided a list of these pairs (in the format that your program should use for its output) so you can check your work. Note that this list was compiled by hand. The skeleton program provided will not find all of the duplicate pairs that are listed.

solution0.txt

4.3 How Do I Start?

Your first task should be to modify the skeleton program so it handles all the types of duplicates included in the Sample Test Data.

NOTE: When we test your program with the Sample Test Data, the records will be modified such that they have similar errors, but not exactly. Hence, you may not hard code these specific answers into your program.

Next, you should add code to handle other types of "duplicates" that seem most probable. Our Official Test Data will contain as many different examples as we can think of, so it's unlikely that anyone will get all of them. Just do as many as you have time for, and you will be scored relative to how you do relative to other teams.

Back to Table of Contents


5.0 Help!

Please check back here often. As questions arise, we will update this FAQ with new information.

5.1 Frequently Asked Questions

Q1: In the take home problem description you specify that we can use Java SDK 1.4 or later. I am wondering as to wether we will be able to use Java SDK 1.5 specific items such as generics and the new for loop syntax. Thank you.
A1: Yes, you may use the new features available in Java SDK 1.5. If you are writing your solution in C++, your code must be ANSI C++ compliant.

Q2: Will you provide the official test data used for judging after the contest?
A1: Yes. Sorry that it took us so long to get them online but they are now available at the following links:
        Test Cases
        Answers

5.2 Contact Information

If you have any questions that aren't answered in the above section, please send email to hspc at cs.utah.edu.

5.3 Hints

There are many simple methods that can go a long way to solving this problem. There are also more complicated methods discussed in the literature. Several references are linked below. Use them at your own risk...

An Assessment of Name Matching Algorithms

Tolerating Spelling Errors during Patient Validation

Back to Table of Contents


Change History

Date Description
17-Feb-2005 Initial release
24-Feb-2005 Corrected listing of tools that the submission should compile on and added first FAQ question.
24-Feb-2005 Added additional reference (Tolerating Spelling Errors during Patient Validation).
7-April-2005 Added links to official test data.

Back to Table of Contents