http://www.cs.utah.edu/~suresh
suresh at cs utah edu
Ph: 801 581 8233
Room 3404, School of Computing
50 S. Central Campus Drive,
Salt Lake City, UT 84112.

Wednesdays 1:25-2:45
MEB 3105

## Overview

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power.

Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them.

This is a broad (and fluid!) topic: we will investigate technical challenges in building a framework for validating data mining outcomes, and we will also make forays into the problem of “explanation”: how does one even “explain” the results of data mining (let alone validate it). Some of the topics we will encounter will include interactive proofs, privacy mechanisms, theories of explanations in science, and homomorphic encryption.

## Class Mechanics

Participants in the seminar will be expected to

• present at least one paper

In addition, each student will prepare notes for one lecture (scribing format will be provided).

#### Jan 9: Seminar overview.

We will focus our study of accountability in data mining on three aspects: checking if a computation was performed accurately, protecting data from unwarranted access/computations, and attempting to explain the results of data mining. In all of this, our goal will be to understand what the theory offers, how effective it is, and what questions remain to be asked.

• Kevin Slavin’s talk on how algorithms are taking over the world (it also has a transcript).
• Cory Doctorow’s story Human Readable is an engaging tale of what happens when algorithms control all aspects of our world, and how difficult it is to demand accountability for the behavior of machines (note that the story is available as a podcast, or as part of the anthology With a Little Help).
• Databuse: Digital Privacy and the Mosaic, by Benjamin Wittes. On why privacy as a holistic term is not sufficient for our modern digital data needs.
• Your next mayor: a computer, by Will Doig @ Salon.com. A short speculation on cities of the future.
• Six Provocations for big data, by danah boyd and Kate Crawford. Implications of the data revolution for how we think about data, science, knowledge, and ethics.

#### Scribe: Suresh Venkat.

Jan 16: Checking computations: some background.

An overview of the theory of interactive proofs: in brief, how a weak verifier, with some well chosen queries, can verify the results of a much harder computation presented by an all-powerful prover.

Presenter: Amirali Abdullah.

Scribe: John Moeller (notes)

Jan 23: Checking computations: the big-data problem.

In the world of big data, your ability to check a computation is limited even more so by lack of access to all the data. We will look at new research that does verification of a computation even if you can only watch as data streams by.

Presenter: Chenxu Ding

Scribe: Mina Ghashami

Jan 30: Checking computations: a matter of trust.

Resource limits are one problem with verifying computations. Trust is another. Suppose you are trying to get answers from an untrusted server,  and also have limited access to a trusted entity. Can you ensure that computations are correct ? We will start with a study of Merkle trees: a key data structure that shows up in this setting, and go onto an exploration of authenticated data structures.

Also look at Section 2 of Charalampos Papamanthou’s Ph.D thesis for a good overview of authenticated data structures and the basic model.

Presenter: Samira Daruki

Feb 6: Checking computations: Outsourcing computation.

In the cloud, no one knows you’re a machine. Or something like that anyway. As we outsource more computations to third parties, how can we check that the computations are being done correctly ?

Presenter: Swetha Machanavajhala

Scribe: Amirali Abdullah

Feb 13: NO class: (Suresh in San Diego)

Search engines mine the queries you ask to learn things about you. Can you hide your queries and still get the answers you need ?

Scribe: Samira Daruki

In the Millionaire’s Problem, two millionaires want to figure out who has more money without revealing their actual worth to each other. In general, you might have $n$ players that each possess a variable $x_i$, and they wish to compute a function $f(x_1, \ldots, x_n)$ without anyone learning anyone else’s input.

Readings: Chapter 1 from this book.

Presenter: Yan Zheng

Scribe:

Mar 6: NO class: (Suresh in Germany)
Mar 20: Protecting your data: Anonymizing data

Instead of performing computations on hidden data, you can release a modified version of your data. How should you do that, and what guarantees can you enforce on the data ? We look at some traditional statistical notions of privacy preserving data publishing.

Presenter: Parasaran Raman

Scribe: Swetha Machanavajhala

Mar 27: Protecting your data: Differential Privacy

Differential privacy brings cryptographic principles to the problem of data privacy. How does it work ?

Readings: Differential Privacy (Cynthia Dwork), Differential Privacy Survey

Presenter: John Moeller

Scribe: Samira Daruki

Apr 3: Questions on differential privacy

Some of the papers that we came across in our discussions:

Papers on private/secure near neighbor search:

• In the MPC setting, there’s a (not too) large line of work here, most of which bases on the Yao protocols to compute distances in a distributed manner, add up sums, do comparisons, etc.
• The next is the setting where the server is blind to both the data and the user query. http://www.cs.utah.edu/~lifeifei/papers/snnicde.pdf
• And there’s also a setting where the server has to be blind to the query, but is allowed to know the database. (I.e, a map/GPS based system where the user wants to be guaranteed his locational privacy, but the data is publically available.) Here is one of the most cited papers. Usually, it’s based on a suitable spatial transform and Hilbert / space filling curve. http://dl.acm.org/citation.cfm?id=1376631

Graph Privacy:

Privacy and Clustering:

• Multi-Party computation of Clustering: (Analyst gets to look at scrambled data)
• Data owner publishes clustering results Anonymously (Model 3)

Apr 10: Understanding results: Concise explanations.

Occam’s razor expresses the idea that a model used to explain data should be concise. We will briefly review different approaches for achieving conciseness in models (VC-dimension, MDL, sparsity and so on)

Presenter: Scribe:

Apr 17: Understanding results: Explaining recommendations.

Recommendation engines like Netflix and Amazon will often generate “explanations” along with the recommendation. For example, “We think you’ll like this movie because you watched that one”, or “You should buy this because your friends bought things similar to it” and so on. How does one generate good explanations ? what makes an explanation more effective ? We will look at this in the context of recommendation systems.

Presenter: Parasaran Raman

Scribe:

Apr 24: Understanding results: Causality versus correlation.

One of the easiest mistakes in data mining is confusing correlation with causality. But what is causality anyway ? We’ll take a brief look at Judea Pearl’s theory of counterfactuals and how that might lead to better ways of generating hypotheses from a mining process.