The problem of fairness comes up in any discussion of data ethics. We’ve seen analyses of products like COMPASS, we’ve seen the maps that show where Amazon first offered same-day delivery, and we’ve seen how job listings shown to women are skewed toward lower-paying jobs.
We also know that “fair” is a difficult concept for any number of reasons, not the least of which is the data used to train machine learning models. Kate Crawford’s recent NIPS keynote, The Trouble with Bias, is an excellent introduction to the problem. Fairness is almost always future oriented and aspirational: we want to be fair, we want to build algorithms that are fair. But the data we train with is, by definition, backward-looking, and reflects our history, which frequently isn’t fair. Real estate data reflects the effects of racial discrimination in housing, which is still taking place, many years after it became illegal. Employment data reflects assumptions about what men and women are expected to do (and have historically done): women get jobs as nurses, men get jobs as engineers. Not only are our models based on that historical data, they’ve proven to be excellent at inferring characteristics like race, gender, and age, even when they’re supposed to be race, age, and gender neutral.
In his keynote for the neveropen Strata conference in Singapore (see slides and write-up here), Ben Lorica talked about the need to build systems that are fair, and isolated several important problems facing developers: disparate impact (for example, real estate redlining), disproportionate error rate (for example, recommending incorrect treatments for elderly patients), and unwarranted associations (for example, tagging pictures of people as “gorillas”). The problem is bigger than figuring out how to make machine learning fair; that is only a first step. Simply understanding how to build systems that are fair would help us craft a small number of systems. But in the future, machine learning will be in everything: not just applications for housing and employment, but possibly even in our databases themselves. We’re not looking at a future with a few machine learning applications; we’re looking at machine learning embedded into everything.
This will be a shift in software development that’s perhaps as radical as the industrial revolution was for manufacturing, though that’s another topic. In practice, what it means is that a company might not have a small number of machine learning models to deal with; a company, even a small one, might well have thousands, and a Google or a Facebook might have millions.
We also have to think about how our models evolve over time. Machine learning isn’t static; you don’t finish an application, send it over to operations, and deploy it. People change, as does the way they use applications. Many models are constantly re-training themselves based on the most recent data, and as we’ve seen, the data flowing into a model has become a new attack vector. It’s a new security problem, and a big one, as danah boyd and others have argued. Ensuring that a model is fair when it is deployed isn’t enough. We have to ensure that it remains fair—and that means constantly testing and re-testing it for fairness.
Fairness won’t be possible at this scale if we expect developers to craft models one at a time. Likewise, we can’t expect teams to test and re-test thousands of models once they have been deployed. That artisanal vision of AI development has started us down this road, but it certainly will not survive.
Lorica suggests in his keynote that the answer lies in monitoring our models for fairness. This monitoring has to be automated; there’s no way humans can keep track of thousands or millions of models, running in everything from thermostats and cell phones to database servers. We need to use machine learning to monitor machine learning. Fortunately, this vision isn’t that far-fetched. We already use tools like these in modern IT. The importance of monitoring and maintaining running systems is part of what we’ve learned through the devops movement. We’d never deploy a system without putting monitoring in place. We constantly measure throughput, latency, and many other variables on systems that are in production. Those monitoring tools increasingly use machine learning to detect problems ranging from failing hardware to attacks.
Automated tools for monitoring and testing fairness don’t exist yet; as far as we can tell, they aren’t even on the research agenda. But that doesn’t mean they can’t exist, or won’t exist. We just need to build them. They should be able to test for fairness in several ways: disparate impact between different groups, disproportionate error rates for specific groups, and unwarranted or erroneous associations, to start with. And our tools should be able to watch models as their performance inevitably drifts over time.
Furthermore, we have to recognize that fairness is ultimately a human problem. We can’t expect a tool to do the whole job; we can’t solve the problem of “fairness” with software. We can, however, expect a tool to assist humans and help them make decisions about what is and isn’t fair. We don’t want to take humans out of the fairness loop; rather, we want to give the humans responsible for that loop the ability to monitor thousands of models. That means designing and building “fairness dashboards” that let humans see how their systems are behaving at a glance; we may want tools to generate alarms (or at least defer actions) when they see something suspicious. Yes, alarm fatigue is a problem, but so is tagging a human as a gorilla.
So far, we’ve only asked our machine learning applications to optimize business outcomes. We need to do more; we have to do more. Unfair models won’t just be a source of embarrassment; they’ll injure people, they’ll damage reputations, they’ll become legal liabilities. But aside from the business consequence, we need to think about how the systems we build interact with our world, and we need to build systems that make our world a better place. We need to build machine learning systems that not only are fair, but that will also remain fair over time. But to do so, we’ll need to build tools that augment our human abilities: tools and dashboards for monitoring fairness.
That’s a challenge, but I believe we can meet it.