Data Mining

Data Mining
Instructor : Jeff Phillips (email) | Office hours: Th 11-noon @ MEB 3442 (and often directly before class in WEB L101)
TAs: Ankit Agarwal (email) | Office hours: MEB 3419 @ Tu @ 9-10am
+ Anusha Buchireddygari (email) | Office Hours: MEB 3419 @ Mon+Wed 2-3pm
+ Tony Tuttle (email) | Office Hours: MEB 3419 @ Tu + Th 12-1pm
Spring 2015 | Mondays, Wednesdays 5:15 pm - 6:35 pm
WEB L101
Catalog number: CS 5140 01 or CS 6140 01

Syllabus

Description:
Data mining is the study of efficiently finding structures and patterns in large data sets. We will focus on several aspects of this: (1) converting from a messy and noisy raw data set to a structured and abstract one, (2) applying scalable and probabilistic algorithms to these well-structured abstract data sets, and (3) formally modeling and understanding the error and other consequences of parts (1) and (2), including choice of data representation and trade-offs between accuracy and scalability. These steps are essential for training as a data scientist.
Algorithms, probability, and linear algebra are required mathematical tools for understanding these approaches.
Topics will include: similarity search, clustering, regression/dimensionality reduction, graph analysis, PageRank, and small space summaries. We will also cover several recent developments, and the application of these topics to modern applications, often relating to large internet-based companies.
Upon completion, students should be able to read, understand, and implement many data mining research papers.

Books:
MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
CSTIA: Computer Science Theory for the Information Age by John Hopcroft and Ravi Kannan. This is currently only collated lecture notes from a theory class that covers some similar topics. This provide some proofs and formalisms not explicitly covered in lecture.
When material is not covered by the books, free reference material will be linked to or produced.

Videos: We plan to videotape all lectures, and make them available online. They will appear on this playlist on our YouTube Channel.

Prerequisits: A student who is comfortable with basic probability, basic linear algebra, basic big-O analysis, and basic programming and data structures should be qualified for the class. There is no specific languange we will use. However, programming assignments will often (intentionally) not be as specfic as in lower-level classes. This will partially simulate real-world settings where one is given a data set and asked to analyze it; in such settings even less direction is provided.
For undergrads, the prerequistits are CS 3500 and CS 3130 and MATH 2270 (or equivalent), and CS 4150 is a corequisite.
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.

Schedule: (subject to change - some linked material is from the previous iteration of the class)

Date	Topic (+ Notes)	Video	Link	Assignment (latex)	Project
Mon 1.12	Class Overview	Vid	MMDS 1.1
Wed 1.14	Statistics Principles + Chernoff Bounds	Vid	MMDS 1.2
Mon 1.19	(MLK Day - No Class)
Wed 1.21	Similarity : Jaccard + k-Grams	V1+V2	MMDS 3.1 + 3.2 \| CSTIA 7.3
Mon 1.26	Similarity : Min Hashing	Vid	MMDS 3.3
Wed 1.28	Similarity : LSH	Vid	MMDS 3.4	Statistical Principles
Mon 2.02	Similarity : Distances	Vid	MMDS 3.5 + 7.1 \| CSTIA 8.1		Proposal
Wed 2.04	Similarity : SIFT and ANN vs. LSH	Vid	MMDS 3.7 + 7.1.3
Mon 2.09	Clustering : Hierarchical	Vid	MMDS 7.2 \| CSTIA 8.7
Wed 2.11	Clustering : K-Means	Vid	MMDS 7.3 \| CSTIA 8.3
Mon 2.16	(Presidents Day - No Class)
Wed 2.18	Clustering : Spectral	Vid	MMDS 10.4 \| CSTIA 8.4 \| Luxburg \| Gleich	Document Hash
Mon 2.23	Frequent Items : Heavy Hitters	Vid	MMDS 4.1 \| CSTIA 7.1.3 \| Min-Count Sketch \| Misra-Gries		Data Collection Report
Wed 2.25	Frequent Itemsets : Apriori Algorithm	Vid	MMDS 6+4.3 \| Careful Bloom Filter Analysis
Mon 3.02	Regression : Basics in 2-dimensions	Vid	ESL 3.2 and 3.4
Wed 3.04	Regression : SVD + PCA	Vid	Geometry of SVD - Chap 3 \| CSTIA 4	Clustering
Mon 3.09	QUIZ #1
Wed 3.11	Regression : Matrix Sketching	V1+V2+V3	MMDS 9.4 \| CSTIA 2.7 + 7.2.2 \| arXiv
Mon 3.16	(Spring Break - No Class)
Wed 3.18	(Spring Break - No Class)
Mon 3.23	Regression : Compressed Sensing and OMP	Vid	CSTIA 10.3 \| Tropp + Gilbert		Intermediate Report
Wed 3.25	Regression : L1 Regression and Lasso	Vid	Davenport \| ESL 3.8	Frequent
Mon 3.30	Noise : Noise in Data	Vid	MMDS 9.1 \| Tutorial
Wed 4.01	Noise : Privacy	V1+V2+V3	Dwork
Mon 4.06	Graph Analysis : Markov Chains	Vid	MMDS 10.1 + 5.1 \| CSTIA 5 \| Weckesser notes
Wed 4.08	Graph Analysis : PageRank	V1+V2+V3	MMDS 5.1 + 5.4
Mon 4.13	Graph Analysis : MapReduce	Vid	MMDS 2 \| Old Lecture 1, 2, 3 \| Overview Lecture	Regression
Wed 4.15	Graph Analysis : PageRank via MapReduce	Vid	MMDS 5.2		Final Report
Mon 4.20	Graph Analysis : Communities	Vid	MMDS 10.2 + 5.5 \| CSTIA 8.8 + 3.4		Poster Outline
Wed 4.22	Graph Analysis : Graph Sparsification	Vid	MMDS 4.1
Mon 4.27	QUIZ #2
Wed 4.29	Poster Day !!! (5-7pm)			Graphs	Poster Presentation

Latex: I highly highly recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures.