Mathematical Foundations for Data Analysis
Jeff M. Phillips
This ``book'' consists if a series of lecture notes I have built for an introductionory class on the mathematical
Foundations of Data Analysis
at the University of Utah. It is designed for sophomore or junior undergraduate who has had some programming, and some basic exposure to probability and linear algebra.
These notes also overlap some with other notes on Data Mining : Algorithms, Geometry, and Probability, that builds on some of these topics for more advanced students.
You can either download the complete book or individual chapters below.
0. Title Page and Preface
This book is meant for a self-contained course introducing many basic principles and techniques needed for modern data analysis. In particular, this course was designed as preparation for students planning to take rigorous Machine Learning and Data Mining courses. With this goal in mind, it both introduces key conceptual tools which are often helpful to see multiple times, and the most basic techniques to begin a basic familiarity with the backbone of modern data analysis.
Interaction with other courses.
It is recommended that students taking this class have calculus and a familiarity with programming and algorithms. They should have also taken some probability and/or linear algebra; but we also review key concepts in probability and linear algebra, so as to keep the book self-contained. Thus, it may be appropriate for students to take these classes simultaneously. If appropriately planned for, it is the hope that this course could be taken at the sophomore level so that more rigorous and advanced data analysis classes can already be taken at a junior level.
Although we touch on Bayesian Inference, we do not cover most of classical statistics; neither frequen- tist hypothesis testing or the similar Bayesian perspectives. Most universities have well-developed classes on these topics which while also very useful, provide a complimentary view of data analysis. Classical statistical modeling approaches are often essential when a practitioner needs to provide some modeling assumptions to harness maximum power from limited data. But in the era of big data this is not always necessary. Rather, the topics in this course provide tools for using some of the data to help choose the model.
Scope and topics.
Vital concepts introduced include concentration of measure and PAC bounds, cross-validation, gradient descent, and principal component analysis. These ideas are essential for modern data analysis, but not often taught in other introductory mathematics classes in a computer science or math department. Or if these concepts are taught, they are also presented in a very different context.
We also survey basic techniques in supervised (regression and classification) and unsupervised (principal component analysis and clustering) learning. We make an effort to keep the presentation and concepts on these topics simple. We stick to those which attempt to minimize sum of squared errors, and do not go into regularization. We stick to classic but magical algorithms like Lloyd’s algorithm for k-means, the power method for eigenvectors, and perceptron for linear classification. For many students (especially those in a computer science program), these are the first numerical algorithms they will have encountered.
While this course is mainly focused on a mathematical preparation, what would data analysis be without data? As such we provide discussion on how to use these tools and techniques on actual data, with examples given in python. We choose python since it has many powerful libraries with extremely efficient backends in C. So for most data sets, this provides the proper interface for working with these tools.
But as important as writing the code itself is a discussion on when and when-not to use techniques from the immense toolbox available. This is one of the main ongoing questions a data scientist must ask. And so, the text attempts to introduce the readers to this ongoing discussion.
1. Probability Review
Probability is a critical tool for modern data analysis. It arises in dealing with uncertainty, in randomized algorithms, and in Bayesian analysis. To understand any of these concepts correctly, it is then paramount to have a solid and rigorous statistical foundation. Hence we will review some key definitions.
2. Bayes' Rule
This topic is on Bayes' Rule and Bayesian reasoning. Bayes' Rule is the key component in how to build likelihood functions, which are key to evaluating models based on data. Bayesian reasoning is surprisingly different, it is much more about modeling uncertainty.
This topic will overview a variety of extremely powerful analysis results that span statistics, estimation theorem, and big data. It provides a framework to think about how to aggregate more and more data to get better and better estimates. It will cover the Central Limit Theorem (CLT), Chernoff Hoeffding bounds, and Probably Approximately Correct (PAC) algorithms.
4. Linear Algebra Review
In this chapter we briefly review many key aspects of linear algebra that will be necessary for the remainder of the course.
5. Linear Regression
We introduce the basic model of linear regression. It builds a linear model to predict one variable from one other variable or from a set of other variables. We will demonstrate how this simple technique can extend to building potentially much more complex polynomial models. Then we will introduce the central and extremely powerful idea of cross-validation. This method fundamentally changes the statistical goal of validating a model, to characterizing the data.
6. Gradient Descent
In this chapter we discuss optimizing over general functions f. Typically the function is defined f : R^d -> R; that is its domain is multi-dimensional (in this case d-dimensional) and output is a real scalar (R). This often arises to describe the "cost" of a model which has d parameters which describe the model (e.g., degree (d-1)-polynomial regression) and the goal is to find the parameters with minimum cost. Although there are special cases where we can solve for these optimal parameters exactly, there are many cases where we cannot. What remains in these cases is to analyze the function f, and try to find its minimum point. The most common solution for this is gradient descent where we try to "walk" in a direction so the function decreases until we no-longer can.
7. Principal Component Analysis
This chapter builds a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is to understand the "shape" of the data, for instance a low-dimensional representation that captures most of meaning of the high-dimensional data. This is sometimes referred to as unsupervised learning (as opposed to regression and classification, where the data has labels, known as supervised learning). Like most unsupervised settings, it can be a lot of fun, but its easy to get yourself into trouble if you are not careful.
We will cover many interconnected tools, including the singular value decomposition (SVD), eigenvectors and eigenvalues, the power method, principal component analysis, and multidimensional scaling.
This chapter focuses on automatically grouping data points into subsets of similar points. There are numer- ous ways to define this problem, and most of them are quite messy. And many techniques for clustering actually lack a mathematical formulation. We will focus on what is probably the cleanest and most used formulation: k-means clustering. But, for background, we will begin with a mathematical detour in Voronoi diagrams.
This chapter returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive or negative. Ideas from linear regression can be applied here, but we will instead overview a different, but still beautiful family of techniques based on linear classification.