Back to the take-home problem home page.
(Last update: Jan 14 2002)
Send email to hspc@cs.utah.edu. We will answer your question as soon as possible. We might also add your question to this Web page!
(Last update: Jan 17 2002)
Yes. Each time your program is run, it must finish in less than 60 seconds.
(Last update: Jan 17 2002)
We say in the take-home problem that there will be 100 possible product kinds that may appear in an input file. These will always be identified by numbers from 1-100.
(Last update: Jan 22 2002)
On January 22, we updated the skeleton code to fix the difficulties when compiling with MS Visual C++. If you downloaded the code before then, please update your copy of the skeleton and try again. If you still have trouble compiling the code under MS Visual C++, please tell us.
(Last update: Jan 23 2002)
We provide you with a skeleton for your data mining program, but you do not have to use it if you don't want to. In your program, you can use as much or as little of the skeleton code as you like.
In any case, whether you use the skeleton or not, your program must meet the specifications laid out in the problem statement. This includes accepting the required command-line arguments, reading and writing the proper files and file formats, and so on. Your program also needs to run fairly quickly: see question 2.
(Last update: Jan 23 2002)
You must use C++ this year. We're thinking about adding Java next year if there's enough demand. This will be one of the topics we'll discuss with the teachers on Feb. 4.
(Last update: Jan 31 2002)
To find rules for making recommendations, the first thing you need to do is to find the ``frequent item sets.'' These are groups of items that appear together often enough that we might be able to make a reliable rule out of them. (The more occurrences there are, the more we believe that our measured confidence in that rule is a good estimate of the true confidence in that rule.) The job of the expandSets function is to find those frequent item sets.
The general idea of expandSets is to find all of the frequent item sets by iteratively ``growing'' from the smallest possible set, the empty set []. We grow our sets by adding one item at a time, and seeing if the result is a frequent item set. From the empty set, we generate all of the one-item sets ([1], [2], [3], ..., [100]) and see which of those sets are frequent. From there, we'll grow all of the frequent one-item sets to discover all of the frequent two-item sets, and so on.
The actual expansion of a set is handled by the function countExpandedSet, which is called by expandSets. The countExpandedSet function takes an item set (the set parameter), generates all of the sets that can be made by adding one new item to that set, and determines how many times each of those sets occurs in the training data (the parameter d). So, given the empty set, countExpandedSet counts the number of transactions that contain [1], the number of transactions that contain [2], and so on up to [100]. These counts are returned in the array called c: c[i] is the number of transactions that contain the expanded set that is made from the original set by adding item i. (Note that countExpandedSet will not try to expand a set by adding more instances of products that are already in the set. You might want to change this in your program.)
countExpandedSet counts all the occurrences of all the possible expansions of a set: but which of those expanded sets are frequent? This is determined by the cutoff local variable in the expandSets function. A set must occur at least cutoff times in the training data to be considered frequent. To select and save the frequent item sets, expandSets calls another helper function, addtobin.
The addtobin function iterates over an array of expanded sets (represented by the parameter set, which describes the ``base set,'' and the s parameter, which is an array giving the scores of all the possible expansions of set). For every expanded set that occurs at least cutoff times, we would like to keep it as a frequent item set, and use it for future expansion. However, to keep things manageable, we place a limit on the total number of frequent item sets that we will keep around. This is the purpose of the bin parameter, which is our collection of the ``best'' frequent item sets. The bin parameter is a two-dimensional array: EXPANSION_LIMIT is the maximum number of frequent item sets that we will keep around, and MAX_ITEMS is the maximum number of items in a transaction. The bin is kept sorted by the ``scores'' of the item sets, which are held in the score array: score[i] is the number of times that the set in bin[i] occurred in the training data. When a frequent item set is added to the bin, if the bin is already full, we drop the least-frequent set from the bin. The actual insertion of a set into the bin is handled by the function insert.
Finally, we come back to the expandSets function! As described previously, the idea is to start with the empty set, and iteratively find all of the frequent item sets by ``growing'' from the frequent sets that we've already found. At each step, we have a bin that holds the best frequent item sets that we found in the previous step. The bins are held in the array called bins: bins[i] is the bin for sets of size i. The numbins array holds the number of items in each bin: numbins[i] is the number of sets in bins[i]. We trust that you can figure out the purposes of the nbins, score, and parent arrays from looking at the code.
From the sets in bin[i-1], we expand and find sets to put into bins[i]. We keep doing this until either we stop finding frequent item sets, or we generate the sets that contain the maximum number of items that can be contained in a transaction (MAX_ITEMS).
Let's do a little walk-through. Starting from the empty set, we generate all of the one-item sets ([1], [2], [3], ..., [100]), see which of these sets are frequent, and store the best of those in our bin for sets of size 1, which is bins[1]. Usually, almost all of the one-item sets will be frequent.
Now we start looking for two-item frequent sets. Suppose that in the previous round, we found that [1] was a frequent set. Now we'll generate all of the expansions of [1], which are [1 2], [1 3], ..., up to [1 100]. We then determine which of these are frequent, and add them to our bin for two-item sets (which is bins[2]). We do the same for all of the one-item frequent sets that we found; when we're done, we have filled our new bin with the best two-item frequent sets.
We iterate again, this time trying to fill our bin for three-item sets (bins[3]) by searching all of the expansions of the sets in our bin for two-item sets (bins[2]). Suppose that [1 4] was a frequent item set: we then generate and test the expansions [1 4 2], [1 4 3], ..., up to [1 4 100]. We do the same for all of the other two-item frequent sets, to produce all of the best three-item sets.
The expandSets function keeps this up, iteratively finding larger and larger frequent items sets, and finally quits when either it stops finding frequent item sets (mostly likely) or it has generated the frequent item sets of maximum size.
As a final step, the expandSets function collects the best frequent item sets from all of its bins. The best sets are stored in the f array; fsize[i] is the number of items in the set stored in f[i]. The total number of frequent item sets in f is stored in fnum. This number will never be more that MAX_FREQ_ITEMSETS. So, in the end, the contents of the f array are the best frequent item sets; the contents of the fsize array are the sizes of those sets; and fnum is the total number of sets in f.
There are some things you can do to improve the way this works, since right now it doesn't properly eliminate duplicates, which causes quite a few problems later.
| High School Programming Contest <hspc@cs.utah.edu> | Last modified: Thu Jan 31 06:53:51 MST 2002 |