Jeff M. Phillips

This book has evolved from a a series of lecture notes I compiled for two courses. The first is an introductionory class on the mathematical Foundations of Data Analysis at the University of Utah. It is designed for sophomore or junior undergraduates who have had some programming, and some basic exposure to probability and linear algebra.

The second is an advanced Data Mining course also taught at the University of Utah. It is designed for advanced undergraduates and beginning graduate students; its pre-requisites are basically those of the previous introductory course. Not all of the content from that Data Mining course have found there way into the book -- including most of the more algorithmic topics such as sections on streaming algorithms, which are harder to motivate without truely massive amounts of data.

The complete book (v0.6) is available for download.

This book is meant for use with a self-contained course that introduces many of the basic mathematical principles and techniques needed for modern data analysis. In particular, it was constructed from material taught mainly in two courses. The first is an early undergraduate course which was designed to prepare students to succeed in rigorous Machine Learning and Data Mining courses. The second course is that advanced Data Mining course. It should be useful for any combination of such courses.

The book introduces key conceptual tools which are often absent or brief in undergraduate curriculum, and for most students, helpful to see multiple times. On top of these, it introduces the generic versions of the most basic techniques that comprise the backbone of modern data analysis. And then it delves deeper in a few more advanced topics and techniques -- still focusing on clear, intuitive, and lasting ideas, instead of specific details in the ever-evolving state-of-the-art.

Although we touch on Bayesian Inference, we do not cover most of classical statistics; neither frequentist hypothesis testing or the similar Bayesian perspectives. Most universities have well-developed courses on these topics which while also very useful, provide a complimentary view of data analysis. Classical statistical modeling approaches are often essential when a practitioner needs to provide some modeling assumptions to harness maximum power from limited data. But in the era of big data this is not always necessary. Rather, the topics in this course provide tools for using some of the data to help choose the model.

We also survey basic techniques in supervised (regression and classification) and unsupervised (principal component analysis and clustering) learning. We make an effort to keep the presentation and concepts on these topics simple. We mainly initially stick to those methods which attempt to minimize sum of squared errors. We lead with classic but magical algorithms like Lloyd's algorithm for $k$-means, the power method for eigenvectors, and perceptron for linear classification. For many students (even those in a computer science program), these are the first iterative, non-discrete algorithms they will have encountered.

Sometimes the book ventures beyond these basics into concepts like regularization and lasso, locality sensitive hashing, multi-dimensional scaling, spectral clustering, neural net basics, and data sketching. These can be sprinkled in, to allow courses to go deeper and more advanced as is suitable for the level of students.

But arguably more important than writing the code itself is a discussion on when and

Jeff M. Phillips

Salt Lake City, December 2019

1

1.1 Sample Spaces ......................................... 9

1.2 Conditional Probability and Independence........................... 10

1.3 Density Functions........................................ 11

1.4 Expected Value ......................................... 12

1.5 Variance............................................. 13

1.6 Joint, Marginal, and Conditional Distributions......................... 14

1.7 Bayes' Rule........................................... 16

1.7.1 Model Given Data ................................... 17

1.8 Bayesian Inference ....................................... 19

2

2.1 Sampling and Estimation.................................... 25

2.2 Probably Approximately Correct (PAC) ............................ 27

2.3 Concentration of Measure.................................... 28

2.3.1 Union Bound and Examples .............................. 32

2.4 Importance Sampling...................................... 35

2.4.1 Sampling Without Replacement with Priority Sampling................ 38

3

3.1 Vectors and Matrices ...................................... 43

3.2 Addition and Multiplication .................................. 45

3.3 Norms.............................................. 47

3.4 Linear Independence ...................................... 48

3.5 Rank............................................... 49

3.6 Inverse.............................................. 49

3.7 Determinant........................................... 50

3.8 Orthogonality.......................................... 50

4

4.1 Metrics ............................................. 55

4.2 Lp Distances and their Relatives ................................ 55

4.2.1 Lp Distances ...................................... 55

4.2.2 Mahalanobis Distance ................................. 59

4.2.3 Cosine and Angular Distance.............................. 59

4.2.4 KL Divergence..................................... 60

4.3 Distances for Sets and Strings ................................. 61

4.3.1 Jaccard Distance .................................... 61

4.3.2 Edit Distance...................................... 63

4.4 Modeling Text with Distances ................................. 64

4.4.1 Bag-of-Words Vectors ................................. 65

4.4.2 k-Grams ........................................ 67

4.5 Similarities ........................................... 69

4.5.1 Normed Similarities .................................. 69

4.5.2 Set Similarities ..................................... 70

4.6 Locality Sensitive Hashing ................................... 71

4.6.1 Properties of Locality Sensitive Hashing ........................ 74

4.6.2 Prototypical Tasks for LSH............................... 75

4.6.3 Banding to Amplify LSH................................ 78

4.6.4 LSH for Angular Distance ............................... 78

4.6.5 LSH for Euclidean Distance .............................. 79

4.6.6 Minhashing as LSH for Jaccard Distance ....................... 80

5

5.1 Simple Linear Regression.................................... 83

5.2 Linear Regression with Multiple Explanatory Variables.................... 85

5.3 Polynomial Regression ..................................... 89

5.4 Cross Validation......................................... 90

5.5 Regularized Regression..................................... 94

5.5.1 Tikhonov Regularization for Ridge Regression .................... 94

5.5.2 Lasso.......................................... 95

5.5.3 Dual Constrained Formulation............................. 96

5.5.4 Orthogonal Matching Pursuit.............................. 98

6

6.1 Functions ............................................105

6.2 Gradients ............................................106

6.3 Gradient Descent ........................................107

6.3.1 Learning Rate ......................................108

6.4 Fitting a Model to Data .....................................112

6.4.1 Least Mean Squares Updates for Regression ......................113

6.4.2 Decomposable Functions ................................113

7

7.1 Data Matrices ..........................................117

7.1.1 Projections .......................................118

7.1.2 SSE Goal ........................................119

7.2 Singular Value Decomposition .................................120

7.2.1 Best Rank-k Approximation ..............................123

7.3 Eigenvalues and Eigenvectors..................................125

7.4 The Power Method .......................................127

7.5 Principal Component Analysis .................................129

7.6 Multidimensional Scaling ....................................129

7.7 Linear Discriminant Analysis ..................................130

7.8 Distance Metric Learning ....................................131

7.9 Matrix Completion .......................................133

7.10 Random Projections .......................................134

8

8.1 Voronoi Diagrams ........................................139

8.1.1 Delaunay Triangulation .................................142

8.1.2 Connection to Assignment-based Clustering ......................143

8.2 Gonzalez Algorithm for k-Center Clustering ..........................143

8.3 Lloyd's Algorithm for k-Means Clustering ...........................145

8.3.1 Lloyd's Algorithm ...................................146

8.3.2 k-Means++ .......................................148

8.3.3 k-Mediod Clustering ..................................149

8.3.4 Soft Clustering .....................................150

8.4 Mixture of Gaussians ......................................150

8.4.1 Expectation-Maximization ...............................151

8.5 Hierarchical Clustering .....................................152

8.6 Density-based Clustering and Outliers .............................154

8.6.1 Outliers .........................................155

8.6 Mean Shift Clustering ......................................155

9

9.1 Linear Classifiers ........................................159

9.1.1 Loss Functions .....................................161

9.1.2 Cross Validation and Regularization ..........................162

9.2 Perceptron Algorithm ......................................163

9.3 Kernels .............................................166

9.3.1 The Dual: Mistake Counter ..............................167

9.3.2 Feature Expansion ...................................167

9.3.3 Support Vector Machines ................................168

9.4 kNN Classifiers .........................................169

9.5 Neural Networks ........................................169

10

10.1 Markov Chains .........................................175

10.1.1 Ergodic Markov Chains ................................177

10.1.2 Metropolis Algorithm .................................179

10.2 PageRank ............................................ 180

10.3 Spectral Clustering on Graphs .................................182

10.3.1 Laplacians and their Eigen-Structure ..........................183

10.4 Communities in Graphs ..................................... 187

10.4.1 Preferential Attachment ................................188

10.4.2 Betweenness ......................................189

10.4.3 Modularity .......................................189

11

11.1 The Streaming Model ......................................194

11.1.1 Mean and Variance ...................................194

11.1.2 Reservoir Sampling ...................................194

11.2Frequent Items .........................................195

11.2.1 Warm Up: Majority ...................................196

11.2.2 Misra-Gries Algorithm .................................196

11.2.3 Count-Min Sketch ...................................197

11.2.4 Count Sketch .....................................199

11.3 Matrix Sketching ........................................200

11.3.1 Covariance Matrix Summation .............................200

11.3.2 Frequent Directions ...................................201

11.3.3 Row Sampling .....................................202

11.3.4 Random Projections and Count Sketch Hashing for Sparse Matrices .........203