Course Welcome

Security and Privacy in Data Science (CS 763)

September 02, 2020

Welcome to Virtual CS 763!

Norms for virtual class

Mute yourself when you are not talking
Recommended (not required): turn on your video
Use the chat for questions/side discussions

If you wouldn’t do it in a real classroom, you probably shouldn’t do it virtually.

Guidelines for discussion

Basically: be nice to one another
WAIT: Why Am I Talking?
One mic: one person speaks at a time

Remote students

Strongly recommended to attend live lectures
If you can’t (e.g., lecture in the middle of the night):
- All lectures will be recorded on BBCU: watch them
- Do two paper reviews per week instead of presentation+summary

Let me know ASAP if you are remote so I can set you up with paper reviews

Security and Privacy

It’s everywhere!

Stuff is totally insecure!

What topics to cover?

A really, really vast field

Things we will not be able to cover:
- Real-world attacks
- Computer systems security
- Defenses and countermeasures
- Social aspects of security
- Theoretical cryptography
- …

Theme 1: Formalizing S&P

Mathematically formalize notions of security
Rigorously prove security
Guarantee that certain breakages can’t occur

Remember: definitions are tricky things!

Theme 2: Automating S&P

Use computers to help build more secure systems
Automatically check security properties
Search for attacks and vulnerabilities

Five modules

Differential privacy
Adversarial machine learning
Cryptography in machine learning
Algorithmic fairness
PL and verification

This course is broad!

Each module could be its own course
- We won’t be able to go super deep
- You will probably get lost
Our goal: broad survey of multiple areas
- Lightning tour, focus on high points

Hope: find a few things that interest you

This course is technical!

Approach each topic from a rigorous point of view
Parts of “data science” with provable guarantees
This is not a “theory course”, but…

Differential privacy

A mathematical definition of privacy

Simple and clean formal property
Satisfied by many algorithms
Degrades gracefully under composition

Adversarial machine learning

Manipulating ML systems

Crafting examples to fool ML systems
Messing with training data
Extracting training information

Cryptography in machine learning

Crypto in data science

Learning models without raw access to private data
Collecting analytics data privately, at scale
Side channels and implementation issues
Verifiable execution of ML models
Other topics (e.g., model watermarking)

Algorithmic fairness

When is a program “fair”?

Individual and group fairness
Inherent tradeoffs and challenges
Fairness in unsupervised learning
Fairness and causal inference

PL and verification

Proving correctness

Programming languages for security and privacy
Interpreting neural networks and ML models
Verifying properties of neural networks
Verifying probabilistic programs

Tedious course details

Lecture schedule

First ten weeks: lectures MWF
- Intensive lectures, get you up to speed
- I will present once a week
- You will present twice a week
Last five weeks: no lectures
- Intensive work on projects
- I will be available to meet, one-on-one

You should attend/watch all lectures

Class format

Three components:
1. Paper presentations
2. Presentation summaries
3. Final project
Announcement/schedule/materials on website
Discussions/forming groups on Piazza

Paper presentations

In pairs, lead a discussion on group of papers
- See website for detailed instructions
- See website for schedule of topics
One week before presentation: meet with me
- Come prepared with draft slides and outline
- Run through your outline, I will give feedback

Presentation summaries

In pairs, prepare written summary of another group
- See website for detailed instructions
- See website for schedule of topics
One week after presentation: send me summary
- I will work with you to polish report
- Writeups will be shared with the class

Final project

In groups of 2-3
See website for project details
Key dates:
- October 12: Milestone 1
- November 6: Milestone 2
- End of class: Final writeups and presentations

Todos for you

Complete the course survey
Explore the course website
Think about which lecture you want to present
Think about which lecture you want to summarize
Form project groups and brainstorm topics

Sign up for slots and projects here

We will move quickly

First deadline: next Wednesday, September 9
- Form paper and project groups
- Signup sheet here
First slot is soon: Monday, September 14
- I will help the first group prepare

Defining privacy

What does privacy mean?

Many kinds of “privacy breaches”
- Obvious: third party learns your private data
- Retention: you give data, company keeps it forever
- Passive: you don’t know your data is collected

Why is privacy hard?

Hard to pin down what privacy means!
Once data is out, can’t put it back into the bottle
Privacy-preserving data release today may violate privacy tomorrow, combined with “side-information”
Data may be used many times, often doesn’t change

Hiding private data

Delete “personally identifiable information”
- Name and age
- Birthday
- Social security number
- …
Publish the “anonymized” or “sanitized” data

Problem: not enough

Can match up anonymized data with public sources
De-anonymize data, associate names to records
Really, really hard to think about side information
- May not even be public at time of data release!

Netflix prize

Database of movie ratings
Published: ID number, movie rating, and rating date
Competition: predict which movies IDs will like
Result
- Tons of teams competed
- Winner: beat Netflix’s best by 10%

A triumph for machine learning contests!

Privacy flaw?

Attack
- Public info on IMDB: names, ratings, dates
- Reconstruct names for Netflix IDs
Result
- Netflix settled lawsuit ($10 million)
- Netflix canceled future challenges

“Blending in a crowd”

Only release records that are similar to others
k-anonymity: require at least k identical records
Other variants: l-diversity, t-closeness, …

Problem: composition

Repeating k-anonymous releases may lose privacy
Privacy protection may fall off a cliff
- First few queries fine, then suddenly total violation
Again, interacts poorly with side-information

Differential privacy

Yet another privacy definition

A new approach to formulating privacy goals: the risk to one’s privacy, or in general, any type of risk… should not substantially increase as a result of participating in a statistical database. This is captured by differential privacy.

Proposed by Dwork, McSherry, Nissim, Smith (2006)

Basic setting

Private data: set of records from individuals
- Each individual: one record
- Example: set of medical records
Private query: function from database to output
- Randomized: adds noise to protect privacy

Basic definition

A query Q is (\varepsilon, \delta)-differentially private if for every two databases db, db' that differ in one individual’s record, and for every subset S of outputs, we have:

\Pr[ Q(db) \in S ] \leq e^\varepsilon \cdot \Pr[ Q(db') \in S ] + \delta

Basic reading

Output of program doesn’t depend too much on any single person’s data

Property of the algorithm/query/program
- No: “this data is differentially private”
- Yes: “this query is differentially private”