Privacy and Fairness: two very different machine learning ideals?
Hello everyone,
I hope you've been keeping well! I'm trying a new format out for the email newsletter, slightly longer form but giving more context surrounding the videos posted.
YouTube channel update
I'm happy to say I've posted 5 videos in the month of February, with new videos coming weekly! Check out the recent YouTube videos here. They roughly fit into two topics:
Privacy:
Bias / Fairness:
- What constitutes dataset bias?
- Does debiasing word embeddings actually work?
- Counterfactuals - the solution to fairness and explainability?
The first 100 subscribers are always the hardest: I'm up to 43 now but would really appreciate if you could share these videos with friends!
If you want to keep track of how the channel is progressing, I've written a post that I'll be updating with major milestones.
Privacy vs Fairness
Privacy and fairness are both ideals we'd like our machine learning systems to satisfy, but they're very different in practice.
Privacy is well-defined
The objectives of privacy are more well-defined: we don't want our models to memorise our data, and we want to keep control of our data. The former requirement is enshrined in a mathematical definition: differential privacy, giving us a guarantee for privacy that we can measure with a quantifiable "privacy budget". The latter can be guaranteed through techniques like federated learning, which let us train models without the data leaving our device.
What does it mean to be "fair"?
Fairness is a bit trickier. Researchers have tried multiple definitions of algorithmic fairness, but they're not all compatible with each other and they've each got their shortcomings. To cut a long story short, fairness is subjective, and I think this is more a matter of policy than just settling on a statistical definition. (I'll cover these stats. definitions in a future video!)
How do you identify and eliminate bias?
To compound this, even if you've agreed on your definition of fairness, the task of eliminating bias is not straightforward. First, you've got to identify the sources of bias in your dataset - it is much broader than you think (making this video was an eye-opener for me!).
There are techniques out there for debiasing word embeddings, but they're often "lipstick on a pig" - they superficially mask the bias without eradicating it. More here
So is all lost for fairness?
On the contrary, I think it's one of the most exciting fields because you can still shape it! One approach of defining fairness is counterfactual fairness - asking if you would've got the job had you been of a different race. I personally think this is an intuitive way of reasoning about fairness. Counterfactuals are a form of causal inference: a probabilistic framework in you formally define your assumptions about which factors depend on each other.
Remember how differential privacy gives you a rock-solid mathematical guarantee for privacy? Counterfactual fairness could be similar for fairness, where you can guarantee fairness given your assumptions about causality.
There's a catch
It appears that current methods for privacy and fairness seem to be conflicting. The noise added by differentially-private algorithms seems to disproportionately affect minorities. On the flip-side, in trying to make the model fair, we likely need to factor in sensitive information about demographics.
I'll have more to say about this in future videos: there's just so much interesting content in this space! Not to mention the entire topic of explainability :)
This Sunday's video will be on PATE, a differentially-private algorithm that won Best Paper in ICLR 2017, and along with DP-SGD (covered in the first Differential Privacy video) is probably one of most influential differentially-private algorithms out there.
Remember to subscribe if you haven't already, and please spread the word! Privacy and fairness are real concerns for ML practitioners.
Till next time,
Mukul