Mathematical Foundations for Data Analysis

Mathematical Foundations for Data Analysis
Jeff M. Phillips

This book has evolved from a a series of lecture notes I compiled for two courses. The first is an introductionory class on the mathematical Foundations of Data Analysis at the University of Utah. It is designed for sophomore or junior undergraduates who have had some programming, and some basic exposure to probability and linear algebra.
The second is an advanced Data Mining course also taught at the University of Utah. It is designed for advanced undergraduates and beginning graduate students; its pre-requisites are basically those of the previous introductory course. Not all of the content from that Data Mining course have found there way into the book -- including most of the more algorithmic topics such as sections on streaming algorithms, which are harder to motivate without truely massive amounts of data.

The complete book (v0.6) is available for download.

Preface
This book is meant for use with a self-contained course that introduces many of the basic mathematical principles and techniques needed for modern data analysis. In particular, it was constructed from material taught mainly in two courses. The first is an early undergraduate course which was designed to prepare students to succeed in rigorous Machine Learning and Data Mining courses. The second course is that advanced Data Mining course. It should be useful for any combination of such courses.
The book introduces key conceptual tools which are often absent or brief in undergraduate curriculum, and for most students, helpful to see multiple times. On top of these, it introduces the generic versions of the most basic techniques that comprise the backbone of modern data analysis. And then it delves deeper in a few more advanced topics and techniques -- still focusing on clear, intuitive, and lasting ideas, instead of specific details in the ever-evolving state-of-the-art.

Interaction with other courses. Students take a course taught from this book should have calculus, and especially for the more advanced material, a basic familiarity with programming and algorithms is useful for fully understanding the computational techniques. Students also benefit from some prior exposure to probability and linear algebra; but this text also reviews key concepts in these areas, so as to keep the book more self-contained. It may be appropriate for students to take these classes before or concurrently. If appropriately planned for, it is the hope that a first course taught from this book could be taken as early as the undergraduate sophomore level, so that more rigorous and advanced data analysis classes can already be taken during the junior year.

Although we touch on Bayesian Inference, we do not cover most of classical statistics; neither frequentist hypothesis testing or the similar Bayesian perspectives. Most universities have well-developed courses on these topics which while also very useful, provide a complimentary view of data analysis. Classical statistical modeling approaches are often essential when a practitioner needs to provide some modeling assumptions to harness maximum power from limited data. But in the era of big data this is not always necessary. Rather, the topics in this course provide tools for using some of the data to help choose the model.

Scope and topics. Vital concepts introduced include concentration of measure and PAC bounds, cross-validation, gradient descent, a variety of distances, principal component analysis, and graphs. These ideas are essential for modern data analysis, but not often taught in other introductory mathematics classes in a computer science or math department. Or if these concepts are taught, they are presented in a very different context.

We also survey basic techniques in supervised (regression and classification) and unsupervised (principal component analysis and clustering) learning. We make an effort to keep the presentation and concepts on these topics simple. We mainly initially stick to those methods which attempt to minimize sum of squared errors. We lead with classic but magical algorithms like Lloyd's algorithm for $k$-means, the power method for eigenvectors, and perceptron for linear classification. For many students (even those in a computer science program), these are the first iterative, non-discrete algorithms they will have encountered.
Sometimes the book ventures beyond these basics into concepts like regularization and lasso, locality sensitive hashing, multi-dimensional scaling, spectral clustering, neural net basics, and data sketching. These can be sprinkled in, to allow courses to go deeper and more advanced as is suitable for the level of students.

On data. While this text is mainly focused on a mathematical preparation, what would data analysis be without data. As such we provide discussion on how to use these tools and techniques on actual data, with simple examples given in python. We choose python since it has increasingly many powerful libraries often with efficient backends in low level languages like C or Fortran. So for most data sets, this provides the proper interface for working with these tools.

But arguably more important than writing the code itself is a discussion on when and when not to use techniques from the immense toolbox available. This is one of the main ongoing questions a data scientist must ask. And so, the text attempts to introduce the readers to this ongoing discussion -- but resists diving into an explicit discussion of the actual tools available.

Examples, Geometry, and Ethics. Three themes that this text highlights to try to aid in the understanding and broader comprehension of these fundamentals are examples, geometry, and ethical connotations. These are each offset in colored boxes.

Examples with focus on Simplicity: We try to provide numerous simple examples to demonstrate key concepts. We aim to be as simple as possible, and make data examples small, so they can be fully digested.

Geometry of Data and Proofs: Many of the ideas in this text are inherently geometric, and hence we attempt to provide many geometric illustrations which can illustrate what is going on. These boxes often go more in depth into what is going on, and include the most technical proofs.

Ethics of Data Analysis: As data analysis glides into an abstract, automatic, but nebulous place within decision making everywhere, the surrounding ethical questions are becoming more important. We highlight various ethical questions which may arise in the course of using the analysis described in this text. We intentionally do not offer solutions, since there may be no single good answer to some of the dilemmas presented. Moreover, we believe the most important part of instilling positive ethics is to make sure analysts at least think about the consequences, which we hope is partially achieved via these highlighting boxes.

Thanks. I would like to thank gracious support from NSF in the form of grants CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, CNS-1564287, and IIS-1816149, which have funded my cumulative research efforts during the writing of this text. I would also like to thank the University of Utah, as well as the Simons Institute for Theory of Computing, for providing excellent work environments while this text was written. And thanks to Natalie Cottrill, Yang Gao, Koriann South, and many other students for a careful reading and feedback.

This version (v0.5) released online in December 2019, includes about 20 additional pages over the previous version (v5.0), many new figures and examples, and hopefully the eradication of most of the typos. Any updates, and eventually links to code snippets can be found at mathfordata.com. This is likely the final public free online version of the book. A forthcoming formally published version will hopefully correct any lingering typos, add a few more clarifying examples, and have more exercises.

Jeff M. Phillips
Salt Lake City, December 2019

Contents

1 Probability Review 9
   1.1 Sample Spaces ......................................... 9
   1.2 Conditional Probability and Independence........................... 10
   1.3 Density Functions........................................ 11
   1.4 Expected Value ......................................... 12
   1.5 Variance............................................. 13
   1.6 Joint, Marginal, and Conditional Distributions......................... 14
   1.7 Bayes' Rule........................................... 16
     1.7.1 Model Given Data ................................... 17
   1.8 Bayesian Inference ....................................... 19

2 Convergence and Sampling 25
   2.1 Sampling and Estimation.................................... 25
   2.2 Probably Approximately Correct (PAC) ............................ 27
   2.3 Concentration of Measure.................................... 28
     2.3.1 Union Bound and Examples .............................. 32
   2.4 Importance Sampling...................................... 35
     2.4.1 Sampling Without Replacement with Priority Sampling................ 38

3 Linear Algebra Review 43
   3.1 Vectors and Matrices ...................................... 43
   3.2 Addition and Multiplication .................................. 45
   3.3 Norms.............................................. 47
   3.4 Linear Independence ...................................... 48
   3.5 Rank............................................... 49
   3.6 Inverse.............................................. 49
   3.7 Determinant........................................... 50
   3.8 Orthogonality.......................................... 50

4 Distances and Nearest Neighbors 55
   4.1 Metrics ............................................. 55
   4.2 Lp Distances and their Relatives ................................ 55
     4.2.1 Lp Distances ...................................... 55
     4.2.2 Mahalanobis Distance ................................. 59
     4.2.3 Cosine and Angular Distance.............................. 59
     4.2.4 KL Divergence..................................... 60
   4.3 Distances for Sets and Strings ................................. 61
     4.3.1 Jaccard Distance .................................... 61
     4.3.2 Edit Distance...................................... 63
   4.4 Modeling Text with Distances ................................. 64
     4.4.1 Bag-of-Words Vectors ................................. 65
     4.4.2 k-Grams ........................................ 67
   4.5 Similarities ........................................... 69
     4.5.1 Normed Similarities .................................. 69
     4.5.2 Set Similarities ..................................... 70
   4.6 Locality Sensitive Hashing ................................... 71
     4.6.1 Properties of Locality Sensitive Hashing ........................ 74
     4.6.2 Prototypical Tasks for LSH............................... 75
     4.6.3 Banding to Amplify LSH................................ 78
     4.6.4 LSH for Angular Distance ............................... 78
     4.6.5 LSH for Euclidean Distance .............................. 79
     4.6.6 Minhashing as LSH for Jaccard Distance ....................... 80

5 Linear Regression 83
   5.1 Simple Linear Regression.................................... 83
   5.2 Linear Regression with Multiple Explanatory Variables.................... 85
   5.3 Polynomial Regression ..................................... 89
   5.4 Cross Validation......................................... 90
   5.5 Regularized Regression..................................... 94
   5.5.1 Tikhonov Regularization for Ridge Regression .................... 94
     5.5.2 Lasso.......................................... 95
     5.5.3 Dual Constrained Formulation............................. 96
     5.5.4 Orthogonal Matching Pursuit.............................. 98

6 Gradient Descent 105
   6.1 Functions ............................................105
   6.2 Gradients ............................................106
   6.3 Gradient Descent ........................................107
     6.3.1 Learning Rate ......................................108
   6.4 Fitting a Model to Data .....................................112
     6.4.1 Least Mean Squares Updates for Regression ......................113
     6.4.2 Decomposable Functions ................................113

7 Principal Component Analysis 117
   7.1 Data Matrices ..........................................117
     7.1.1 Projections .......................................118
     7.1.2 SSE Goal ........................................119
   7.2 Singular Value Decomposition .................................120
     7.2.1 Best Rank-k Approximation ..............................123
   7.3 Eigenvalues and Eigenvectors..................................125
   7.4 The Power Method .......................................127
   7.5 Principal Component Analysis .................................129
   7.6 Multidimensional Scaling ....................................129
   7.7 Linear Discriminant Analysis ..................................130
   7.8 Distance Metric Learning ....................................131
   7.9 Matrix Completion .......................................133
   7.10 Random Projections .......................................134

8 Clustering 139
   8.1 Voronoi Diagrams ........................................139
     8.1.1 Delaunay Triangulation .................................142
     8.1.2 Connection to Assignment-based Clustering ......................143
   8.2 Gonzalez Algorithm for k-Center Clustering ..........................143
   8.3 Lloyd's Algorithm for k-Means Clustering ...........................145
     8.3.1 Lloyd's Algorithm ...................................146
     8.3.2 k-Means++ .......................................148
     8.3.3 k-Mediod Clustering ..................................149
     8.3.4 Soft Clustering .....................................150
   8.4 Mixture of Gaussians ......................................150
     8.4.1 Expectation-Maximization ...............................151
   8.5 Hierarchical Clustering .....................................152
   8.6 Density-based Clustering and Outliers .............................154
     8.6.1 Outliers .........................................155
   8.6 Mean Shift Clustering ......................................155

9 Classification 159
   9.1 Linear Classifiers ........................................159
     9.1.1 Loss Functions .....................................161
     9.1.2 Cross Validation and Regularization ..........................162
   9.2 Perceptron Algorithm ......................................163
   9.3 Kernels .............................................166
     9.3.1 The Dual: Mistake Counter ..............................167
     9.3.2 Feature Expansion ...................................167
     9.3.3 Support Vector Machines ................................168
   9.4 kNN Classifiers .........................................169
   9.5 Neural Networks ........................................169

10 Graphs 173
   10.1 Markov Chains .........................................175
     10.1.1 Ergodic Markov Chains ................................177
     10.1.2 Metropolis Algorithm .................................179
   10.2 PageRank ............................................ 180
   10.3 Spectral Clustering on Graphs .................................182
     10.3.1 Laplacians and their Eigen-Structure ..........................183
   10.4 Communities in Graphs ..................................... 187
     10.4.1 Preferential Attachment ................................188
     10.4.2 Betweenness ......................................189
     10.4.3 Modularity .......................................189

11 Big Data and Sketching 193
   11.1 The Streaming Model ......................................194
     11.1.1 Mean and Variance ...................................194
     11.1.2 Reservoir Sampling ...................................194
   11.2Frequent Items .........................................195
     11.2.1 Warm Up: Majority ...................................196
     11.2.2 Misra-Gries Algorithm .................................196
     11.2.3 Count-Min Sketch ...................................197
     11.2.4 Count Sketch .....................................199
   11.3 Matrix Sketching ........................................200
     11.3.1 Covariance Matrix Summation .............................200
     11.3.2 Frequent Directions ...................................201
     11.3.3 Row Sampling .....................................202
     11.3.4 Random Projections and Count Sketch Hashing for Sparse Matrices .........203