In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy (a formal guarantee that provides robust privacy assurances). Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today.
What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen.
Here are some highlights from our conversation:
Differential privacy and machine learning
In the literature, there are actually multiple ways differential privacy is used in machine learning. We can either inject noise directly at the input data level, or while we’re training a model. We can also inject noise into the gradient. At every iteration we’re computing the gradients, we can inject some sort of noise. Or we can also inject noise during aggregation. If we’re using ensembles, we can inject noise there. And we can also inject noise at the output level. So after we’ve trained the model, and we have our vectors of weights, then we can also inject noise directly to the weights.
A mechanism for building robust models
There could be a chance that differential privacy methods can actually make your model more general. Because, essentially, when models memorize their training data, it could be due to overfitting. So, injecting all of this noise may help the resulting model move you further away from overfitting, and you get a more general model.
Related resources:
- “How to build analytic products in an age when data privacy has become critical”
- “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain.
- “Data regulations and privacy discussions are still in the early stages”: Aurélie Pols on GDPR, ethics, and ePrivacy.
- “Data collection and data markets in the age of privacy and machine learning”