The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power.
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them.
This is a broad (and fluid!) topic: we will investigate technical challenges in building a framework for validating data mining outcomes, and we will also make forays into the problem of “explanation”: how does one even “explain” the results of data mining (let alone validate it). Some of the topics we will encounter will include interactive proofs, privacy mechanisms, theories of explanations in science, and homomorphic encryption.
Participants in the seminar will be expected to
- present at least one paper
- read all the assigned readings and participate in class discussions
In addition, each student will prepare notes for one lecture (scribing format will be provided).
Schedule of Readings
Jan 9: Seminar overview.
We will focus our study of accountability in data mining on three aspects: checking if a computation was performed accurately, protecting data from unwarranted access/computations, and attempting to explain the results of data mining. In all of this, our goal will be to understand what the theory offers, how effective it is, and what questions remain to be asked.
- Kevin Slavin’s talk on how algorithms are taking over the world (it also has a transcript).
- Cory Doctorow’s story Human Readable is an engaging tale of what happens when algorithms control all aspects of our world, and how difficult it is to demand accountability for the behavior of machines (note that the story is available as a podcast, or as part of the anthology With a Little Help).
- Databuse: Digital Privacy and the Mosaic, by Benjamin Wittes. On why privacy as a holistic term is not sufficient for our modern digital data needs.
- Your next mayor: a computer, by Will Doig @ Salon.com. A short speculation on cities of the future.
- Six Provocations for big data, by danah boyd and Kate Crawford. Implications of the data revolution for how we think about data, science, knowledge, and ethics.
Presenter: Suresh Venkat.
Scribe: Suresh Venkat.
Jan 16: Checking computations: some background.
An overview of the theory of interactive proofs: in brief, how a weak verifier, with some well chosen queries, can verify the results of a much harder computation presented by an all-powerful prover.
Presenter: Amirali Abdullah.
Scribe: John Moeller (notes)
Jan 23: Checking computations: the big-data problem.
In the world of big data, your ability to check a computation is limited even more so by lack of access to all the data. We will look at new research that does verification of a computation even if you can only watch as data streams by.
- Verifying Computations with Streaming Interactive Proofs (VLDB 2012)
- Annotations in Data Streams (ECCC, ICALP 2009)
Presenter: Chenxu Ding
Scribe: Mina Ghashami
Jan 30: Checking computations: a matter of trust.
Resource limits are one problem with verifying computations. Trust is another. Suppose you are trying to get answers from an untrusted server, and also have limited access to a trusted entity. Can you ensure that computations are correct ? We will start with a study of Merkle trees: a key data structure that shows up in this setting, and go onto an exploration of authenticated data structures.
- Merkle trees (or hash trees)
- Authenticated dictionaries (Goodrich/Tamassia)
- Jeff Erickson’s notes on skip lists.
Also look at Section 2 of Charalampos Papamanthou’s Ph.D thesis for a good overview of authenticated data structures and the basic model.
Presenter: Samira Daruki
Scribe: Aaditya Landge
Feb 6: Checking computations: Outsourcing computation.
In the cloud, no one knows you’re a machine. Or something like that anyway. As we outsource more computations to third parties, how can we check that the computations are being done correctly ?
Non-Interactive Verifiable Computing: Outsourcing Computation to Untrusted Workers. Rosario Gennaro, Craig Gentry and Bryan Parno. CRYPTO 2010. Here’s a video (with transcript!) describing the material.
Presenter: Swetha Machanavajhala
Scribe: Amirali Abdullah
NO class: (Suresh in San Diego)
Feb 20: Protecting your data: Asking hidden queries.
Search engines mine the queries you ask to learn things about you. Can you hide your queries and still get the answers you need ?
- Private Information Retrieval. Chor, Kushilevitz, Goldreich and Sudan.
- Replication Is Not Needed: Single Database, Computationally-Private Information Retrieval. Kushilevitz and Ostrovsky.
Presenter: Aaditya Landge
Scribe: Samira Daruki
Feb 27: Protecting your data: Computing without revealing your data
In the Millionaire’s Problem, two millionaires want to figure out who has more money without revealing their actual worth to each other. In general, you might have $n$ players that each possess a variable $x_i$, and they wish to compute a function $f(x_1, \ldots, x_n)$ without anyone learning anyone else’s input.
Readings: Chapter 1 from this book.
Presenter: Yan Zheng
NO class: (Suresh in Germany)
Mar 20: Protecting your data: Anonymizing data
Instead of performing computations on hidden data, you can release a modified version of your data. How should you do that, and what guarantees can you enforce on the data ? We look at some traditional statistical notions of privacy preserving data publishing.
- On Measures of Privacy
- Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization
Presenter: Parasaran Raman
Scribe: Swetha Machanavajhala
Mar 27: Protecting your data: Differential Privacy
Differential privacy brings cryptographic principles to the problem of data privacy. How does it work ?
Presenter: John Moeller
Scribe: Samira Daruki
Apr 3: Questions on differential privacy
Some of the papers that we came across in our discussions:
- Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Marianne Winslett:
Functional Mechanism: Regression Analysis under Differential Privacy. PVLDB 2012, 1364-1375
- This paper is the latest in a sequence that seeks only to release the results of data mining, not the data or a query mechanism (Model 3 in our discussion)
Papers on private/secure near neighbor search:
- In the MPC setting, there’s a (not too) large line of work here, most of which bases on the Yao protocols to compute distances in a distributed manner, add up sums, do comparisons, etc.
- The next is the setting where the server is blind to both the data and the user query. http://www.cs.utah.edu/~lifeifei/papers/snnicde.pdf
- And there’s also a setting where the server has to be blind to the query, but is allowed to know the database. (I.e, a map/GPS based system where the user wants to be guaranteed his locational privacy, but the data is publically available.) Here is one of the most cited papers. Usually, it’s based on a suitable spatial transform and Hilbert / space filling curve. http://dl.acm.org/citation.cfm?id=1376631
Privacy and Clustering:
- Multi-Party computation of Clustering: (Analyst gets to look at scrambled data)
- Data owner publishes clustering results Anonymously (Model 3)
Apr 10: Understanding results: Concise explanations.
Occam’s razor expresses the idea that a model used to explain data should be concise. We will briefly review different approaches for achieving conciseness in models (VC-dimension, MDL, sparsity and so on)
Apr 17: Understanding results: Explaining recommendations.
Recommendation engines like Netflix and Amazon will often generate “explanations” along with the recommendation. For example, “We think you’ll like this movie because you watched that one”, or “You should buy this because your friends bought things similar to it” and so on. How does one generate good explanations ? what makes an explanation more effective ? We will look at this in the context of recommendation systems.
- Recommender systems: from algorithms to user experience (Section 5.4)
- Evaluating the effectiveness of explanations for recommender systems
- Netflix recommendations: Beyond the 5 stars (part I)
- Explaining Recommendations: Satisfaction vs. Promotion
- Explaining Collaborative Filtering Recommendations
- Tagsplanations: Explaining Recommendations Using Tags
Presenter: Parasaran Raman
Apr 24: Understanding results: Causality versus correlation.
One of the easiest mistakes in data mining is confusing correlation with causality. But what is causality anyway ? We’ll take a brief look at Judea Pearl’s theory of counterfactuals and how that might lead to better ways of generating hypotheses from a mining process.
- Causal inference in statistics: An overview,” Judea Pearl. Statistics Surveys, 3:96–146, 2009.
- Chapter 21 of Cosma Shalizi’s new book on data analysis.
Presenter: Amirali Abdullah