Foundations for Geometric Analysis of Noisy Data
PI : Jeff M. Phillips
NSF CCF-1350888 (May 2014 - April 2019; on NCE thru April 2021)
An important role of computational geometry is to understand and formalize the structure of data. And as data is becoming a central currency of modern science, this role is growing in importance. However, much of classical computational geometry inherently assumes that all aspects of data are known and precise. This is rarely the case in practice. This project focuses on building the foundations for two extensions to classic geometric settings pertinent to noisy data.

1. We study locational uncertainty in point sets, where the location of each data point is described by a probability distribution. Given such an input, the goal is to formalize how to construct, approximate, and concisely represent the distribution of geometric queries on this uncertain data.

2. We study the geometric consequences of applying a statistical kernel (e.g. a Gaussian kernel) to a data set. He will investigate how this process can smooth data, remove degeneracies, and implicitly simplify and regularize algorithms. Moreover, he will explore the geometric structure of the resulting kernel density estimate, and how it relates to algorithms for the data and approximate representations of the data.

The PI will lead the development of a data-focused educational program around the themes of data analysis, algorithmics, and visualization. The PI is developing a model course for this program on data mining; it focuses on the geometric, statistical, and algorithmic properties of data. An extensive set of course notes is being compiled, accompanied with videotaped lectures freely available online. This class and program attract many interdisciplinary and diverse students and observers. This program is part of a larger effort to make relevant data analysis techniques from computational geometry available to a broader data-rich audience.


Publications:
  • Finding an Approximate Mode of a Kernel Density Estimate.
         Jasper C.H. Lee, Jerry Li, Christopher Musco, Jeff M. Phillips, and Wai Ming Tai. European Symposium on Algorithms (ESA). September 2021.
         arXiv:1912.07673. December 2019.

  • Optimal Courset for Gaussian Kernel Density Estimation.
         Wai Ming Tai. arXiv:2007.08031. July 2020.

  • The GaussianSketch for Almost Relative Error Kernel Distance.
         Jeff M. Phillips and Wai Ming Tai. International Conference on Randomization and Computation (RANDOM). August 2020.
        arXiv:1811.04136. December 2019.

  • Sketched MinDist.
         Jeff M. Phillips and Pingfan Tang. International Symposium on Computational Geometry (SOCG). June 2020.
        arXiv:1907.02171. July 2019.

  • On Measuring and Mitigating Biased Inferences of Word Embeddings.
         Sunipa Dev, Tao Li, Jeff M. Phillips and Vivek Srikumar. AAAI Conference on Artificial Intellegence (AAAI). February 2020.
        arXiv:1908.09369. August 2019.

  • Simple Distances for Trajectories via Landmarks.
         Jeff M. Phillips and Pingfan Tang. ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL). November 2019.
        arXiv:1804.11284. June 2019.

  • The Kernel Spatial Scan Statistic.
         Mingxuan Han, Michael Matheny, and Jeff M. Phillips. ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL). November 2019.
        arXiv:1906.09381. June 2019.

  • Closed Form Word Embedding Alignment. (Invited to KAIS Special Issue)
         Sunipa Dev, Saffia Hassan, and Jeff M. Phillips. International Conference on Data Mining (ICDM). November 2019.
        arXiv:1806.01330. June 2018.

  • On the VC Dimension of Metric Balls under Frechet and Hausdorff Distances.
         Anne Driemel, Andre Nusser(*), Jeff M. Phillips, Ioannis Psarros. International Symposium on Computational Geometry (SoCG). June 2019.
        arXiv:1903.03211. November 2019 (*: adds upper bound with Hausdorff in high dimensions, Andre Nusser added as co-author).

  • Attenuating Bias in Word Vectors.
         Sunipa Dev and Jeff M. Phillips. International Conference on Artificial Intelligence and Statistics (AIStats). April 2019.
        arXiv:1901.07656. January 2019.

  • Computing Approximate Statistical Discrepancy.
         Michael Matheny and Jeff M. Phillips. International Symposium on Algorithm and Computation (ISAAC). December 2018.
        arXiv:1804.11287. April 2018.

  • Approximating the Distribution of the Median and other Robust Estimators on Uncertain Data.
         Kevin Buchin, Jeff M. Phillips and Pingfan Tang. International Symposium on Computational Geometry (SOCG). June 2018.
        arxiv:1601.00630. March 2018.

  • Near-Optimal Coresets of Kernel Density Estimates. (Invited to Special Issue)
         Jeff M. Phillips and Wai Ming Tai. International Symposium on Computational Geometry (SOCG). June 2018.
        arxiv:1802.01751. February 2018.

  • Improved Coresets for Kernel Density Estimates.
         WaiMing Tai and Jeff M. Phillips. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SoDA). January 2018.
        arxiv:1710.04325. October 2017.

  • Visualization of Big Spatial Data using Coresets for Kernel Density Estimates.
         Yan Zheng, Yi Ou, Alexander Lex, and Jeff M. Phillips. Visual Data Science (VDS). October 2017.
        arxiv:1709.04453. September 2017.
        Project Page (and code).

  • Visualizing Sensor Network Coverage with Location Uncertainty.
         Tim Sodergren, Jessica Hair, Jeff M. Phillips, and Bei Wang. Visual Data Science (VDS). October 2017.
        arxiv:1710.06925. September 2017.

  • Relative Error Embeddings for the Gaussian Kernel Distance.
         Di Chen and Jeff M. Phillips. Algorithmic Learning Theory (ALT). October 2017.
        arxiv:1602.05350. February 2016.

  • Coresets for Kernel Regression.
         Yan Zheng and Jeff M. Phillips. ACM Conference on Knowledge Discovery and Data Mining (KDD). August 2017.
         arxiv:1702.03644. February 2017.
         Project Page

  • The Robustness of Estimator Composition.
         Pingfan Tang and Jeff M. Phillips. Conference on Neural Information Processing (NIPS). December 2016.

  • epsilon-Kernel Coresets for Stochastic Points.
         Lingxiao Huang, Jian Li, Jeff M. Phillips, and Haitao Wang. European Symposium on Algorithms (ESA). August 2016.
         arxiv.org:1411.0194. November 2014.

  • An Integrated Classification Scheme for Mapping Estimates and Errors of Estimation from the American Community Survey.
         Ran Wei, Daoqin Tong, and Jeff M. Phillips. Computers, Environment and Urban Systems (CEUS). April 2016.

  • Coresets and Sketches.
         Jeff M. Phillips. Handbook of Discrete and Computational Geometry. 3rd edition, CRC Press, Chapter 48. 2016.
         arxiv.org:1601.00617. January 2016.

  • Streaming Kernel Principal Component Analysis.
         Mina Ghashami, Daniel Perry, and Jeff M. Phillips. International Conference on Artificial Intelligence and Statistics (AISTATS). May 2016.
         arxiv.org:1512.05059. December 2015.

  • Subsampling in Smooth Range Spaces.
         Jeff M. Phillips and Yan Zheng. Algorithmic Learning Theory (ALT). October 2015.
         short version appeared in Computational Geometry : Young Researchers Forum. June 2015.

  • L_infity Error and Bandwidth Selection for Kernel Density Estimates of Large Data.
         Yan Zheng and Jeff M. Phillips. ACM Conference on Knowledge Discovery and Data Mining (KDD). August 2015.

  • Geometric Inference on Kernel Density Estimates.
         Jeff M. Phillips, Bei Wang, and Yan Zheng. International Symposium on Computational Geometry (SOCG). June 2015.
         full version: arXiv:1307.7760. March 2015.
         early version appeared as Kernel Distance for Geometric Inference. Jeff M. Phillips and Bei Wang. 22nd Fall Workshop on Computational Geometry. October 2012.


    Educational Development and Outreach:
    Mathematical Foundations for Data Analysis is a textbook (with verison free online) published by Springer-Nature in Spring 2021.
    This text book specifically supports two courses: an undergraduate course on the Foundations of Data Analysis and a graduate course in Data Mining. Both cover this material (now essential in data science) from a very geomtric and algorithmic perspective.

  • Data Science Program (formly Big Data Program) with bent towards algorithmic principles and geometric interpretation of data. Includes new Bachelors in Data Science, Graduate and Undergraduate Certificates in Data Science, and a track in Data Management and Analysis within the MS and PhD degrees in Computing.
    Program Youtube Channel with lectures from many classes.

  • Utah Data Science Day co-organized in 2017, 2018, 2019. Each had over 150 registered participants. Job fairs with 15-25 local companies. Industry panels, poster sessions, invited talks, information sessions about educational opportunties. Bridging the foundational aspects of data analysis with the application domains which motivate the problems and benefit from the developments.

  • Workshop on Geometry and Machine Learning co-organized at CG Week in 2016, 2017, 2018, 2019, 2021. Highlighted developments of computational geometry to and from the Machine Learning commuinity.


    Personnel:
  • PI: Jeff Phillips
  • GRA: Yan Zheng (PhD 2017, first job @ Visa Research)
  • GRA: Pingfan Tang (PhD 2019, first job @ Google)
  • GRA: Sunipa Dev (PhD 2020, first job @ NSF CI Fellow, UCLA)
  • GRA: Wai Ming Tai (PhD 2020, first job @ postdoc, UChicago)
  • Ug: Benwei Shi (BS Thesis 2017, now PhD student in CS @ Utah)
  • Ug: Jamie Iong (BS Thesis 2015, first job @ EMC)