TAs: Reza Esfadani (email) | Office hours: Tuesday 11am-noon @ 3115 MEB

+ Raghvendra Singh (email) | Office Hours: Monday 3-4pm @ 3423 MEB (near grad launge)

Spring 2014 | Mondays, Wednesdays 5:15 pm - 6:35 pm

WEB 2230

Catalog number: CS 5140 01 or CS 6140 01

Data mining is the study of efficiently finding structures and patterns in data sets. We will also study what structures and patterns you

This class may differ greatly from many data mining classes offered elsewhere. Perhaps it should be called "Large Scale Data Mining" since many of the techniques we will discuss have been designed to deal with (or have survived the onslaught) of very large scale data. Many of these techniques use randomized algorithms - these are often extremely simple to use, but more difficult to analyze. We will focus more on how to use, and give explanations (but often not proofs) of correctness.

Topics will include: similarity search, clustering, regression/dimensionality reduction, link analysis (PageRank), and small space summaries. We may also discuss anomaly detection, compressed sensing, and pattern matching.

When material is not covered by the books, free reference material will be linked to or produced.

For undergrads, the prerequistits are CS 3505 and CS 2100. It is also highly recommended you have taken CS 3130 - in many ways, this is the natural continuation of that course.

In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.

Date | Topic | Video | Link | Assignment (latex) | Project |
---|---|---|---|---|---|

Mon 1.06 | (Instructor Traveling - No Class) | ||||

Wed 1.08 | Class Overview | 1,2 | MMDS 1.1 | ||

Mon 1.13 | Statistics Principles : Birthday Paradox + Coupon Collector | 1,2 | MMDS 1.2 | ||

Wed 1.15 | Chernoff-Hoeffding Bounds + Applications | 1,2,3 | CSTIA 2.3 | Terry Tao Notes | Tarjan Notes | ||

Mon 1.20 | (MLK Day - No Class) | ||||

Wed 1.22 | Similarity : Jaccard + k-Grams | 1,2 | MMDS 3.1 + 3.2 | CSTIA 7.3 | ||

Mon 1.27 | Similarity : Min Hashing | 1,2,3 | MMDS 3.3 | ||

Wed 1.29 | Similarity : LSH | 1,2,3 | MMDS 3.4 | Statistical Principles | |

Mon 2.03 | Similarity : Distances | 1,2,3 | MMDS 3.5 + 7.1 | CSTIA 8.1 | Proposal | |

Wed 2.05 | Similarity : SIFT and ANN vs. LSH | 1,2,3 | MMDS 3.7 + 7.1.3 | ||

Mon 2.10 | Clustering : Hierarchical | 1,2,3 | MMDS 7.2 | CSTIA 8.7 | ||

Wed 2.12 | Clustering : K-Means | 1,2,3 | MMDS 7.3 | CSTIA 8.3 | ||

Mon 2.17 | (Presidents Day - No Class) | ||||

Wed 2.19 | Clustering : Spectral | 1,2,3 | MMDS 10.4 | CSTIA 8.4 | Luxburg | Gleich | Document Hash(tex) | |

Mon 2.24 | Frequent Items : Heavy Hitters | 1,2,3 | MMDS 4.1 | CSTIA 7.1.3 | Min-Count Sketch | Misra-Gries | Data Collection Report | |

Wed 2.26 | Frequent Itemsets : Apriori Algorithm | 1,2,3 | MMDS 6+4.3 | Careful Bloom Filter Analysis | ||

Mon 3.03 | Regression : Basics in 2-dimensions | 1,2,3 | ESL 3.2 and 3.4 | ||

Wed 3.05 | Regression : SVD + PCA | 1,2,3 | Geometry of SVD - Chap 3 | CSTIA 4 | Clustering (tex) | |

Mon 3.10 | (Spring Break - No Class) | ||||

Wed 3.12 | (Spring Break - No Class) | ||||

Mon 3.17 | Regression : Column Sampling and Frequent Directions | 1,2,3 | MMDS 9.4 | CSTIA 2.7 + 7.2.2 | arXiv | ||

Wed 3.19 | Regression : Compressed Sensing and OMP | 1,2,3 | CSTIA 10.3 | Tropp + Gilbert | Intermediate Report | |

Mon 3.24 | Regression : L1 Regression and Lasso | 1,2,3 | Davenport | ESL 3.8 | ||

Wed 3.26 | Noise : Noise in Data | 1,2,3 | MMDS 9.1 | Tutorial | Frequent (tex) | |

Mon 3.31 | Noise : Privacy | 1,2,3 | Dwork | ||

Wed 4.02 | Graph Analysis : Markov Chains | 1,2,3 | MMDS 10.1 + 5.1 | CSTIA 5 | Weckesser notes | ||

Mon 4.07 | Graph Analysis : PageRank | 1,2,3 | MMDS 5.1 + 5.4 | ||

Wed 4.09 | Graph Analysis : MapReduce(room change: hill east of MEB) |
1,2,3 | MMDS 2 | Old Lecture 1, 2, 3 | Overview Lecture | Regression (tex) | |

Mon 4.14 | Graph Analysis : PageRank via MapReduce | 1,2,3 | MMDS 5.2 | Final Report | |

Wed 4.16 | Graph Analysis : Communities | 1,2,3 | MMDS 10.2 + 5.5 | CSTIA 8.8 + 3.4 | Poster Outline | |

Mon 4.21 | Graph Analysis : Graph Sparsification | 1,2,3 | MMDS 4.1 | ||

Wed 4.23 | Poster Day !!! | Poster Presentation | |||

Mon 4.29 | Graphs (tex) |

We will plan to have 5 or 6 short homework assignments, roughly covering each main topic in the class. The homeworks will usually consist of an analytical problems set, and sometimes a light programming exercize. There will be no specific programming language for the class, but some assignments may be designed around a specific one that is convenient for that task.

Each person in the class will be responsible for a small project. I will allow small groups to work together. The project will be very open-ended; basically it will consist of finding an interesting data set, exploring it with one or more techniques from class, and presenting what you found. I will try to provide suggestions for data sources and topics, but ultimately the groups will need to decide on their own topic. There will be several intermediate deadlines so projects are not rushed at the end of the semester.

This class has the following collaboration policy. For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. For projects, you may of course work however you like within your groups. You may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged (this does not count towards page limits). Of course any outside materials used must be referenced appropriately.