When you go shopping at just about any business today, your transaction involves a computer. At a Web-based business, such as an online catalog or bookstore, you use a Web browser to search for and buy products. At a ``bricks and mortar'' business, such as a neighborhood grocery store, your chosen items are scanned by a computer at the checkout counter. In either case, the things that you buy are tracked by computer.
Not only does the computer produce your bill automatically, but it interacts with the store's other software systems as well. For instance, the computer at the checkout counter might tell the store's inventory system what is being bought, so that the store can keep track of what it has and what it needs to order. At an online store, when you add an item to your virtual ``shopping cart,'' the computer may consult an online database to find other products that you might like to buy. Finally, when you check out, the computer may remember your purchases in a database, so that yet more programs can analyze the data to spot emerging patterns and trends.
The idea of analyzing large datasets in order to discover hidden patterns and trends is called data mining. For instance, a data mining program might analyze a store's transaction records and discover that most of the people who buy cookies also buy milk at the same time. This information can be used by the store in various ways: for instance, the store might decide to lower the price of cookies, in hopes of selling more milk. A data mining program might also discover other useful trends: for example, that people who buy only a few items at a time are more likely to buy prepared food, and people who buy many things are more likely to buy food staples (e.g., flour) and family-size packages. This might lead the store to put prepared foods near the front of the store --- so that the few-item customers can dash in and out --- and to put staples near the back.
Taking advantage of patterns and trends means that a store can maximize its profits, which makes data mining a critical technology for many businesses. Data mining is different from normal database queries in that the goal of data mining is to discover patterns and structures that we might not anticipate. Further, to be most useful, a data mining tool must be as automatic as possible, ideally operating without any human interaction or guidance at all. A fully automatic tool can run continuously, always looking for emerging trends and hidden patterns.
As described previously, the patterns that are discovered by a data mining tool can be used in many ways. One way to use such patterns is in a recommender system: a tool that predicts what products you might like to buy, based on the products that you've bought in the past and the products that are currently in your shopping cart.
For instance, recommender systems are commonly used at online book and music stores in order to increase sales. When you add a CD by your favorite band to your online shopping cart, the store's Web site will probably point out other CDs by that band. It might also recommend other CDs, based on what you and other customers have purchased in the past. (Maybe people who like your favorite band also seem to like certain other bands as well.) If the store makes good recommendations, then the store is likely to increase its sales and profits. If the store makes bad recommendations, however, sales may actually go down, because it risks alienating customers. You might buy a CD on the store's recommendation, but then find out that you don't like the music. This might make you less likely to trust the store's advice in the future, meaning that ultimately, you might buy fewer CDs.
Your task is to write a data mining tool and recommender system for a small store. Your program will read two files.
As just described, your program will read two separate files: a training file and a test file. Each file is a text file containing a number of records, and each record describes a single customer transaction. Each transaction has the following form:
customer-id product-id [product-id ...]
That is, a single customer identifier followed by one or more product identifiers. A customer identifier is an upper-case letter (`A' through `Z'), and tells us what customer was involved in this purchase. A product identifier is a positive integer between 1 and 100, inclusive, and represents a single instance of a particular kind of product. Every unique customer and kind of product has a unique identifier. This means, of course, that there are 26 possible customers and 100 possible product kinds that may appear in an input file. There will be no more than ten (10) product identifiers in a single transaction; i.e., no more than ten items will be purchased at one time.
Within a record, identifiers are separated by one or more space characters. Additionally, spaces may appear before the customer-id or after the last product-id. Every record is contained on a single line, and is terminated by an end-of-line sequence (i.e., a newline). There will be no more than ten thousand (10,000) transactions in an input file.
Below is an example input file describing five separate transactions:
A 10 12 7 D 12 B 3 8 10 8 7 A 1 10 4 E 3 8 2
The first line in the above file says that customer A purchased three items in one transaction: one 10, one 12, and one 7. The second line says that customer D bought product 12. The third line says that customer B bought one 3, two 8's, a 7, and a 10. On the fourth line, customer A made another transaction, this time buying three items. Finally, customer E buys three items. This file illustrates some important points:
In the input training file, the transaction records represent complete transactions. In the input test file, the records represent partial transactions. (But even in the test file, every record will contain at least one product identifier.) For instance, your program might be run with the following training and test files:
Training Test ------------------ ------------------ A 10 12 7 A 3 7 D 12 C 8 B 3 8 10 8 7 D 3 12 8 A 1 10 4 B 10 5 E 3 8 2
Your program will input both files as described previously, and then output its recommendations as described in the next section.
Your program will output exactly one recommendation for each record in the test file. This recommendation is your program's guess about what items the customer wants to add to his or her shopping basket as part of the corresponding transaction. Your program also picks a number of ``points'' to associate with each item in the recommendation. If the customer accepts an item in your recommendation (i.e., buys that item), then your program is awarded the points associated with that item. If the customer rejects a recommended item, however, your program loses those points. (Overall scoring is described later on in this document.)
A recommendation is a line of the following form:
[product-id points] ...
That is, a list of zero or more pairs of alternating product identifiers and points, all on a single line. A product identifier is a positive integer as described previously. A point value is a positive integer between 1 and 10, inclusive. Each product identifier is followed by an associated point value. The product identifiers and point values are all separated by one or spaces, and the recommendation is terminated by the end of line. An empty recommendation (one that recommends no products) is represented by an empty or blank line.
Consider again our example training and test files:
Training Test ------------------ ------------------ A 10 12 7 A 3 7 D 12 C 8 B 3 8 10 8 7 D 3 12 8 A 1 10 4 B 10 5 E 3 8 2
Since there are four records in the test file, your program must output exactly four recommendations, one for each test record. For the first record, your program might recommend one instance of product 10. Notice that in the training file, customer A always bought a 10. Further, in the training data, everyone who bought a 7 also bought a 10. Given this evidence, your program might choose a fairly high number of points for this item; say that your program decides to recommend product 10 for nine points. Your program would output the following recommendation:
10 9
For the second partial transaction, your program might recommend an instance of product 8 for one point, and an instance of product 3 for five points:
8 1 3 5
For the third test, perhaps your program would recommend no items, since the only apparent connection is between products 3 and 8, and the customer already has one of each. In this case, to make no recommendation, your program simply outputs a blank line. Your program must output a blank line to make an empty recommendation.
Finally, for the last test transaction, your program might recommend product 7 for three points. Thus, in this example, the total output of your program would be the following four lines: (Note that the third output line is blank.)
10 9 8 1 3 5 7 3
Remember the following when generating recommendations:
Your program sends its output to a file named on the program's command line. After writing its recommendations to the file, your program exits.
Your program must accept the following three arguments on the command line, in the following order:
Obviously, your program must read the first two files and write the third. Your program may assume that the output file does not already exist. Your program may not require any additional command line arguments.
As described above in the Output section, your program will output a list of recommended items for each partial transaction in the test file. Your program also picks a number of ``points'' to associate with each item in the recommendation. If the customer accepts an item in your recommendation (i.e., buys that item), then your program is awarded the points associated with that item. If the customer rejects a recommended item, however, your program loses those points.
Your program starts with zero points, and gains or loses points as described for each recommendation. Your program's overall score is the sum of the wins minus the sum of the losses, as you would expect.
Whether a customer accepts or rejects your program's recommendations is determined by an answer file that is separate from the training and test files given to your program. For each partial transaction in the test file, the answer file contains a list of the additional items that the customer decided to buy. Continuing our example, we now have a total of four files as shown below:
[Input] [Input]
Training Test
------------------ ------------------
A 10 12 7 A 3 7
D 12 C 8
B 3 8 10 8 7 D 3 12 8
A 1 10 4 B 10 5
E 3 8 2
[Output] [Hidden]
Recommendation Answer
------------------ ------------------
10 9 10 1
8 1 3 5
8 8
7 3 7
The lines in the answer file have a one-to-one correspondence with the lines in the test and recommendation file. The first line of the answer file tells us that customer A added one instance of product 10 and one instance of product 1. Comparing this with the recommendation, we see that the program correctly guessed that the customer would buy an item 10, and so the program gains the points associated with that item: in this case, nine points. Customer A also bought one item 1, but the program did not guess this, and so receives no points. Thus, the net gain for the first recommendation is nine points --- not too bad!
The remaining lines of the answer file are handled in the same way. The second line of the answer file indicates that the customer (in this case, customer C) added no additional items to his or her shopping cart. Thus, the program loses points for each recommended item: one point for item 8, and five points for item 3. For the third test, the customer bought two instances of product 8. The recommendation was empty, however, and so the program neither gains or loses any points. Finally, in the fourth test, the program gains three points for its correct guess that the customer would buy product 7.
Overall, in this example, the program made two correct product recommendations (for a net gain of 12 points), made two incorrect product recommendations (for a net loss of 6 points), and missed three items that customers actually bought (neither gaining or losing points). Thus, in this example, the program's final score is 6 points (12-6).
Your recommender program does not have access to the answer file as it runs. (Otherwise, it would be too easy to make accurate recommendations, don't you think?) This means that your program cannot compute its own score.
Rather, your program's score is determined by a separate scoring program that you can run to evaluate the quality of your data mining program's recommendations. We will provide this program to you (see support programs, below). Of course, we will be running this scoring program to grade your recommender! The scoring program reads two files: the answer file, and the recommendation file that was output by your program. The program computes the overall score as described above and then prints the final score.
A complete test requires three related files: a training file, a test file, and an answer file. The first two are read by your data mining program, and the third is used to evaluate the quality of your program's recommendations. Your program will be run with many different sets of test files, so your program must use general data mining techniques to spot patterns and trends.
To make sure that your data mining techniques work well for many different test sets, we will provide you with a tool that can generate test sets for your data mining program. (See the support programs, below.) In particular, we will provide you with:
By using the test generator program, you can generate as many different test suites as you like, in order to make sure that your data mining program works for all kinds of specific inputs. Your goal is to write a program that gets high scores for as many test suites as possible!
At this point you might be asking, ``How do I actually write a program to find hidden data patterns?'' This is where you get to do some research and be creative! Think about what kinds of patterns there might be in the transactions records of a store:
Also, think about how much ``support'' or ``evidence'' your program might need in order to make product recommendations. If you spot a pattern, it will usually be a pattern that doesn't happen all of the time. Maybe the pattern only matches only 50% or even only 10% of the time. How strong does the pattern need to be in order for it to be useful for making recommendations? How does the strength of a pattern relate to the number of points that your program should give to each recommended product?
A good recommender program will certainly make many wrong guesses, so don't be concerned if your program often loses points. The trick is to make more points from the right guesses than you lose from the wrong guesses! If your program can do this consistently, for many test suites, then you have a good recommender!
There are many Web sites that contain great information about data mining in general, and about the kind of pattern mining you'll be doing here in particular. There are also many books about data mining at your local libraries and bookstores. We encourage you to look around and find examples of algorithms that might be useful.
You may use any pseudo-code algorithms you find, but it is not acceptable to use someone else's source code. That would be considered cheating, and your team will be disqualified.
Some topics that you might search for include:
Here are a few links to get you started:
The long-awaited skeleton code is now available. If you have any questions or comments about the code, please check the FAQ first. If that doesn't help, send email to hspc@cs.utah.edu for assistance.
The ZIP archive is 11 KB, and all the files separately are about 26 KB. The skeleton contains about 800 lines of code including comments and whitespace. If you use this skeleton in writing your solution, please add comments to make it clear where you have made changes, what those changes do, and why you made them. This will help us grade your data mining program more accurately.
The skeleton code implements a basic algorithm to find ``association rules'' that describe patterns in the training file. The first step of the algorithm is to find sets of items that appear together often enough to be ``statistically significant,'' or in other words, that are common enough to be reliable for making recommendations. These are called the frequent item sets. The second part of the algorithm takes those frequent item sets and generates every possible rule that one could make with the items in some set. A rule has a left hand side (lhs) and a right hand side (rhs), and usually is written like this:
[lhs] => [rhs]
The left hand side ``implies'' the right hand side. This means that if you find all the items in the lhs in a transaction, you're likely to find the rhs in the transaction as well. In addition to a lhs and rhs, every rule has a confidence value that tells you how often the rule appears to hold. Confidence over 50% means that you usually found the rhs in the training data, and confidence less than 50% means that you usually did not find all of the rhs present. In fact, very low confidence means that if all the items in lhs are present, you'll very rarely find all the items in the rhs.
From a single frequent item set, you can make many different association rules. For instance, from the set [1 2 3], you could make rules including the following:
[1] => [2 3]
[2] => [1 3]
[1 2] => [3]
[1] => [3]
[3] => [2]
Some of these rules will have high confidence values, and these are the ones that usually give the best recommendations. For example, if the rule [1 2] => [3] has a confidence of 90%, it means that in the training data, 90% of all the transactions that include both items 1 and 2 also include item 3. So for a partial transaction [1 2], we believe that with 90% probability, if we recommend an item 3, it will be accepted by the customer.
Based on the confidences (probabilities) of the association rules, your program must decide which items to recommend, and what point values to give to those items. There are many ``betting strategies'' that will work well, so feel free to try anything. As long as your program gets positive scores over the long haul, you're on the right track. Of course, the higher your scores, the better you're doing! Always remember that there are two ways to increase your score: increase the number of recommendations you get right and the points you get for those recommendations, or decrease the number of incorrect recommendations and the points you lose from those recommendations.
The support programs include the test generation and scoring programs described previously. We give you the (C++) source code for these programs, so that you can compile them on your system and see how they work. Download the files below:
If for some reason you can't compile the test generator or the scoring program, you can download precompiled versions:
The test generator reads a raw data file and randomly ``slices'' it into a training file, a test file, and an answer file. The program has a command line interface, and you can invoke the program as follows:
testgen [-s <seed>]
<raw-file>
<training-file> <training-size>
<test-file> <test-size>
<answer-file>
The command line arguments should be on a single line, of course. The file arguments are the names of the indicated files. The size arguments are the number of transactions to be output into the files. There is no answer-size argument, because the test and answer files must always contain the same number of lines.
The seed argument is optional: if given, the seed is a number that is used to initialize the random number generator. By specifying different seed values, you can generate reproducible ``slicings'' of the raw data. If no seed is specified, then the test generator picks a slicing completely at random.
For example, suppose you want to use the raw data file raw-01.dat to generate a 5,000-transaction training file called train.dat, a 3,000-transaction test file called test.dat, and an answer file called answer.dat. You would type the following command:
testgen raw-01.dat train.dat 5000 test.dat 3000 answer.dat
Each time you run the above command, you will get a completely new set of data in each file. If you specify a seed value, you will get the same output files every time you use that seed. For example:
testgen -s 1 raw-01.dat train.dat 5000 test.dat 3000 answer.dat
The scoring program reads two files: a recommendation file that was output by your data mining program, and an answer file (previously produced by the test generator). The scoring program computes and prints the overall score as described previously. The command line for the scoring program is straightforward:
score <recommendation-file> <answer-file>
For example, suppose your program's recommendations are stored in recommend.dat, and you want to determine your score according to the answers in answer.dat. You would type the following command:
score recommend.dat answer.dat
To sum up, you can run the test generator, your program, and the scoring program in sequence to make a ``complete'' test case. If your program is called datamine, for example, you might type a series of commands like this:
testgen raw-01.dat train.dat 5000 test.dat 3000 answer.dat datamine train.dat test.dat recommend.dat score recommend.dat answer.dat
(The command line syntax of your program was described above.) By running the above series of commands over and over, you can test your program on many different input file sets, to make sure that your data mining techniques produce good results reliably and consistently.
Your program will be graded mostly on the quality of its recommendations (i.e., the scores that it receives) on a wide variety of test suites. We will test your program with many different input files. These inputs will be derived (by the test generation program) both from the raw data files that we provide to you, and from similar data files that we will use just for grading. The best solutions will be those that use general data mining techniques to achieve consistently high scores across many tests.
In addition to the quality of its recommendations, your program will also be graded on programming style and comments. Use style and comments to make your code clear and understandable by us!
So that we can better understand your program, include comments inside your code describing how your algorithms work. Also, write a ``README'' file explaining what you did and why, and anything else that you tried that you didn't use in your final solution. Explain what you learned while working on the take-home problem. Finally, so that we can continue to improve the contest, please give us some feedback about what you thought about the problem: how difficult or easy it was, what the hardest and easiest parts were, and other topics you would like to see in a take-home problem.
When you come to the contest, bring the following items on disk(s):
We will recompile your program from source if necessary. Please make sure that your source code compiles with one of the following compilers: Microsoft Visual C++, GCC/G++ (free), or DJGPP (free) for DOS/Win. Please ask us if you don't have any of these compilers and would like to use a different one. Be warned: If we cannot get your source code to compile, we will not be able to grade your program!
If you have a question about the take-home problem, first check the list of frequently asked questions (FAQ). With luck, your question will already be answered there!
If you can't find an answer there, then ask us via email. We will answer your question as soon as possible. We might also add your question to the frequently asked questions list.
Finally, remember to check the 2002 High School Programming Contest home page for late-breaking news and contest updates! Good luck!
| High School Programming Contest <hspc@cs.utah.edu> | Last modified: Mon Jan 28 22:32:03 MST 2002 |