Math for Data (Foundations of Data Analysis)
Instructor : Jeff Phillips (email) | Office hours: Thursdays 10-11am @ MEB 3442 (and directly after class in MEB 3105)
TAs: Mehran Javanmardi (email) | Office hours: Mondays 2-4pm @ MEB 3115
Fall 2016 | Tuesday, Thursdays 12:25 pm - 1:45 pm
MEB 3105
Catalog number: CS 4964 01




Description:
This class will be an introduction to computational data analysis, focusing on the mathematical foundations. The goal will be to carefully develop and explore several core topics that form the backbone of modern data analysis topics, including Machine Learning, Data Mining, Artificial Intelligence, and Visualization. This will include some background in probability and linear algebra, and then various topics including Bayes Rule and connection to inference, linear regression and its polynomial and high dimensional extensions, principal component analysis and dimensionality reduction, as well as classification and clustering. We will also focus on modern PAC (probably approximately correct) and cross-validation models for algorithm evaluation.
These topics are often very breifly covered at the end of a probability or linear algebra class, and then are often assumed knowledge in advanced data mining or machine learning classes. This class will fill that gap. While some students may want to jump straight to advanced data analysis classes (e.g., CS5350, CS5340, CS5140, CS5630, CS6300) this class may be wise to take first. The planned pace will be closer to CS3130 or Math2270 than the 5000/6000-level courses. Also, some students may want to go back and solidify their foundations if the 5000/6000-level classes were a bit fast-paced.

The current plan is to use Python in the class to demonstrate and explore basic concepts. But programming will not be the main focus.

Book:
I have had trouble finding a single text which covers the concepts in this class at the right level in the right way. Here are a few books that cover some of the material, but at a more advanced level:
  • More Advanced Books: Understanding ML | Foundations of Data Science | Introductioin to Statistical Learning
    Here is a list nice resources I believe may be useful with relevant parts at roughly the right level for this course:
  • Probability: ProbStat course | P1 | P2
  • Bayes Rule/Reasoning: B1 | B2 | B3 | B4
  • Linear Algebra: No-BS Book | LA1 | LA2 | LA3
  • Linear Regression: LR1 | LR2
  • Gradient Descent: GD1 | GD2
  • PCA: PCA1 | PCA2 | PCA3 | PCA4
  • Clustering: C1 | C2 | C3 | C4
  • Classification: L1 | L2 | L3

    I hope to post my own notes to accompany each set of lectures.

    Prerequisits:
    The official pre-requisites are CS 2100 and CS 2420. These are to ensure a certain very basic mathematical maturity (CS 2100) and a basic understanding of how to store and manipulate data with some efficiency (CS2420).

    This will be the first iteration of this class. If it goes well, future versions of this class may require pre-requisites of CS 3130 and Math 2270 (then it can start with less review). And then this class may in turn be a pre-requisite for CS 5350, CS 5140, CS6300, etc. as part of a new Data Science pipeline.

    Schedule:
    Date Topic Assignment
    Tue 8.23 Class Overview
    Thu 8.25 Probability Review : Sample Space, Random Variables, Independence
    Tue 8.30 Probability Review : PDFs, CDFs, Expectation, Variance, Joint and Marginal Distributions HW1 out
    Thu 9.01 Bayes Rule
    Tue 9.06 Bayes Rule : Bayesian Reasoning
    Thu 9.08 Convergence : Central Limit Theorem and Estimation
    Tue 9.13 Convergence : PAC Algorithms and Concentration of Measure HW 1 due
    Thu 9.15 Linear Algebra Review : Vectors, Matrices, Multiplication and Scaling
    Quiz 1
    Tue 9.20 Linear Algebra Review : Norms, Linear Independence, Rank HW 2 out
    Thu 9.22 Linear Algebra Review : Inverse, Orthogonality, numpy
    Tue 9.27 Linear Regression : dependent, independent variables
    Thu 9.29 Linear Regression : multiple regreesion, polynomial regression
    Tue 10.04 Linear Regression : overfitting and cross-validation HW 2 due
    Thu 10.06 Linear Regression : (slack) or kernels
    Quiz 2
    Tue 10.11
    FALL BREAK
    Thu 10.13
    FALL BREAK
    Tue 10.18 Gradient Descent : functions, minimum, maximum, convexity HW 3 out
    Thu 10.20 Gradient Descent : gradients and algorithmic variants
    Tue 10.25 Gradient Descent : fitting models to data and stochastic gradient descent
    Thu 10.27 PCA : SVD
    Tue 11.01 PCA : oops -- retroactively class was canceled
    Thu 11.03 PCA : rank-k approximation and eigenvalues HW 3 due
    Tue 11.08 PCA : power method | Election Day -- don't forget to vote' HW 4 out
    Thu 11.10 PCA : centering, MDS, and dimensionalty reduction
    Quiz 3
    Tue 11.15 Clustering : Voronoi Daigrams
    Thu 11.17 Clustering : k-means
    Tue 11.22 Clustering : EM HW 4 due
    Thu 11.24
    THANKSGIVING
    HW 5 out
    Tue 11.29 Classification : Linear prediction
    Thu 12.01 Classification : Perceptron Algorithm
    Tue 12.06 Classification : variants (kernels, KNN, maybe neural nets)
    Thu 12.08 in class Review
    Quiz 4
    Fri 12.09 HW 5 due
    Mon 12.12
    FINAL EXAM (10:30am - 12:30pm)
    (practice)



    Class Organization: The class will be run through this webpage, and Canvas. The schedule, notes, and links will be maintained here. All homeworks will be turned in through Canvas.


    Grading: There will be one final exam with 20% of the grade. Homeworks and quizzes will be worth 80% of the grade. There will be 5 homeworks and 4 quizzes -- the lowest one (either one homework or one quiz) can be dropped. So each counted homework/quiz will be worth 10% of the grade.

    The homeworks will usually consist of an analytical problems set, and sometimes light programming exercizes in python. When python will be used, we typically will work through examples in class first.


    Late Policy: To get full credit for an assignment, it must be turned in through Canvas by the start of class, specifically 12-noon. Once the 12-noon deadline is missed, those turned in late will lose 10%. Every subsequent 24 hours until it is turned another 10% is deducted. That is, a homework 30 hours late worth 10 points will have lost 2 points. Once the graded assignment is returned, or 48 hours has passed, any assignment not yet turned in will be given a 0.


    Academic Conduct Policy: The Utah School of Computing has an academic misconduct policy, which requires all registered students to sign an Acknowledgement Form. This form must be signed and turned into the department office before any homeworks are graded.

    This class has the following collaboration policy:
    For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collabodated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment.

    For quizzes and the final exam, talking to anyone (other than instructors/TAs) during the examination period is not allowed and will result in a 0 on that test or quiz.