TAs: Iain Lee (email) | Office hours: Mondays 3-5pm (MEB 3159)

Haocheng Dai (email) | Office hours: Mondays 7-9pm (Zoom)

Peter Jacobs (email) | Office hours: Tuesdays 8-10am (MEB 3159 or Zoom)

Manoj Thanneru (email) | Office hours: Tuesdays 2-4pm (MEB 3159 or Zoom)

Fall 2021 | Tuesday, Thursdays 10:45 am - 12:05 pm

WEB L103 ( Zoom/YouTube )

Catalog number: CS/DS 3190 01

and meets with COMP 5960 01 (meant for non-SoC graduate and non-matriculated students)

Google Calendar of all lectures & office hours

This class will be an introduction to computational data analysis, focusing on the mathematical foundations, but providing some basic experience in analysis techniques. The goal will be to carefully develop and explore several core topics that form the backbone of modern data analysis topics, including Machine Learning, Data Mining, Artificial Intelligence, and Visualization. This will include some background in probability and linear algebra, and then various topics including Bayes Rule and its connection to inference, linear regression and its polynomial and high dimensional extensions, principal component analysis and dimensionality reduction, as well as classification and clustering. We will also focus on modern PAC (probably approximately correct) and cross-validation models for algorithm evaluation.

Some of these topics are often very breifly covered at the end of a probability or linear algebra class, and then are often assumed knowledge in advanced data mining or machine learning classes. This class fills that gap. The planned pace will be closer to CS3130 or Math2270 than the 5000-level advanced data analysis courses.

We will use Python in the class to demonstrate and explore basic concepts. But programming will not be the main focus.

Former TA Hasan Poormahmood created a short python tutorial on loading, manipulating, processing, and plotting data in python in colab. Here is the python notebook so you can follow along.

A free version (v0.6) is free and available online as pdf. The formatting and page numbering is updated, and the writing is improved in spots in the v1.0. Some content is also added in v1.0, but it does not affect the part covered in this course.

More outside

As I have done for several years, we will also live stream lectures (on YouTube).

The official pre-requisites are CS 2100, CS 2420, and Math 2270. These are to ensure a certain very basic mathematical maturity (CS 2100) a basic understanding of how to store and manipulate data with some efficiency (CS2420), and basics of linear algebra and high dimensions (MATH 2270).

We have as a co-requisite CS 3130 (or Math 3070) to ensure some familiarity with probability.

A few lectures will be devoted to review linear algebra and probability, but at a fast pace and a focus on the data interpretation of these domains. I understand students now obtain background in data analysis in a variety of different ways, contact instructor if you think you may manage without these pre-requisites.

This course a pre-requisite for CS 5350 (Machine Learning) and CS 5140 (Data Mining), and is part of a new Data Science pipeline.

Date | Chapter | Video | Topic | Assignment |
---|---|---|---|---|

Tue 8.24 | yt | Class Overview | ||

Thu 8.26 | Ch 1 - 1.2 | yt | Probability Review : Sample Space, Random Variables, Independence | Quiz 0 |

Tue 8.31 | Ch 1.3 - 1.6 | yt | Probability Review : PDFs, CDFs, Expectation, Variance, Joint and Marginal Distributions(colab) | HW1 out |

Thu 9.02 | Ch 1.7 | yt | Bayes' Rule: MLEs and Log-likelihoods | |

Tue 9.07 | Ch 1.8 | yt | Bayes Rule : Bayesian Reasoning | |

Thu 9.09 | Ch 2.1 - 2.2 | yt | Convergence : Central Limit Theorem and Estimation (colab) | Quiz 1 |

Tue 9.14 | Ch 2.3 | yt | Convergence : PAC Algorithms and Concentration of Measure | HW 1 due |

Thu 9.16 | Ch 3.1 - 3.2 | yt | Linear Algebra Review : Vectors, Matrices, Multiplication and Scaling | HW 2 out |

Tue 9.21 | Ch 3.3 - 3.5 | yt | Linear Algebra Review : Norms, Linear Independence, Rank and numpy (colab) | |

Thu 9.23 | Ch 3.6 - 3.8 | yt | Linear Algebra Review : Inverse, Orthogonality | Quiz 2 |

Tue 9.28 | Ch 5.1 | yt | Linear Regression : explanatory & dependent variables (colab) | HW 2 due |

Thu 9.30 | Ch 5.2-5.3 | yt | Linear Regression : multiple regression (colab), polynomial regression (colab) | |

Tue 10.05 | Ch 5.4 | yt | Linear Regression : overfitting and cross-validation (colab) | HW 3 out |

Thu 10.07 | Ch 5 | yt | Linear Regression : mini review + slack (colab) | Quiz 3 |

Tue 10.12 | ||||

Thu 10.14 | ||||

Tue 10.19 | Ch 6.1 - 6.2 | yt | Gradient Descent : functions, minimum, maximum, convexity & gradients | |

Thu 10.21 | Ch 6.3 | yt | Gradient Descent : algorithmic & convergence (colab) | |

Tue 10.26 | Ch 6.4 | yt | Gradient Descent : fitting models to data and stochastic gradient descent | HW 3 due |

Thu 10.28 | Ch 7.1 - 7.2 | yt | Dimensionality Reduction : project onto a basis | Quiz 4 |

Tue 11.02 | Ch 7.2 - 7.3 | yt | Dimensionality Reduction : SVD and rank-k approximation (colab) | HW 4 out |

Thu 11.04 | Ch 7.4 | yt | Dimensionality Reduction : eigndecomposition and power method (colab) | |

Tue 11.09 | Ch 7.5 - 7.6 | yt1,yt2 | Dimensionality Reduction : PCA, centering (colab), and MDS (colab) | |

Thu 11.11 | Ch 8.1 | yt | Clustering : Voronoi Diagrams + Assignment-based Clustering | Quiz 5 |

Tue 11.16 | Ch 8.3 | yt | Clustering : k-means (colab) | HW 4 due |

Thu 11.18 | Ch 8.4, 8.7 | yt | Clustering : EM, Mixture of Gaussians, Mean-Shift | |

Tue 11.23 | Ch 9.1 | yt | Classification : Linear prediction | HW 5 out |

Thu 11.25 | ||||

Tue 11.30 | Ch 9.2 | yt | Classification : Perceptron Algorithm | |

Thu 12.02 | Ch 9.3 | yt | Classification : Kernels and SVMs | Quiz 6 |

Tue 12.07 | Ch 9.4 - 9.5 | yt | Classification : Neural Nets, Decision Trees, etc | |

Thu 12.09 | yt | Semester Review |
||

Fri 12.10 | HW 5 due | |||

Fri 12.17 | FINAL EXAM overlaps with (10:30am - 12:30pm) |
(practice) |

The homeworks will usually consist of an analytical problems set, and sometimes light programming exercizes in python. When python will be used, we typically will work through examples in class first.

This class has the following collaboration policy:

For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collaborated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment.

I hope the book provide all information required to understand the material for the class .. and for a solid footing beyond. However, it is sometimes useful to also explore other sources. Wikipedia is often a good source on many of these topics. In the past students have also enjoyed 3 Blue 1 Brown.

Here are a few other books that cover some of the material, but at a more advanced level:

Understanding ML | Foundations of Data Science | Introduction to Statistical Learning

Here is a list nice resources I believe may be useful with relevant parts at roughly the right level for this course, but often with disparate notation: