Could Auditing Tools Help Fight Algorithmic Bias?

Like people, algorithms can have significant blind spots, and when they get things wrong, they can have a significant impact on society.

And as algorithms become increasingly ubiquitous, the choices they make for us will have far-reaching implications, according to Aron Culotta, associate professor of Computer Science at the Illinois Institute of Technology.

When algorithms make errors that are somehow unfair or are systematically biased against certain groups of people, they reinforce and worsen any existing inequities.”

“In applications like criminal sentencing, loan applications and self-driving cars, we need algorithms that are not only accurate, but also error-free,” said Culotta. “When algorithms make errors that are somehow unfair or are systematically biased against certain groups of people, they reinforce and worsen any existing inequities.”

Algorithmic Bias, Machine learning, Data Science — Aron Culotta, associate professor of Computer Science at the Illinois Institute of Technology.

Biased training data leads to biased algorithms

Algorithmic bias often stems from the data that is used to train the algorithm. And because bias runs deep in humans on many levels, training algorithms to be completely free of those biases is a nearly impossible task, said Culotta. Even if you want to combat bias, knowing where to look for it can be harder than it sounds.

Data scientists do not necessarily know when they are making the algorithm that it will make incorrect or biased predictions.”

“Data scientists do not necessarily know when they are making the algorithm that it will make incorrect or biased predictions,” said Culotta.

Algorithmic biases often stem from the text and images that data scientists use to train their models. For example, if you search for images of a “police officer” or “boss” on the internet, most of the pictures that show up will likely be of white men. If you fed this data into your algorithm, your model would likely conclude that bosses and cops are usually white and male, perpetuating stereotypes against women, minorities and other groups.

Everything from the tools and equipment used to collect data to the factors data scientists select to analyze and how they train their models, can cause biases to creep into algorithms. But there are steps data scientists, policymakers and other stakeholders can take to minimize bias.

Building fairer algorithms

One way to build better algorithms is to use auditing tools to detect biases in the training model before deploying it in the real world. Aequitas is one such open-source toolkit developed at the University of Chicago.

To get clarity about exactly how Aequitas works, we caught up with one of the primary creators of the software, Rayid Ghani, distinguished career professor at the Heinz College of Information Systems and Public Policy and the School of Computer Science at Carnegie Mellon University.

Ghani was also the chief scientist for the Obama for America 2012 Election Campaign and director of the Center for Data Science & Public Policy at the University of Chicago.

The idea of fairness is different for each user or application, said Ghani because every government, society or organization will have their own definition of fairness.

For example, one policymaker may define fairness as no one being left behind, while another stakeholder may want algorithms to proactively reduce inequity across all sub-groups over time.

Before we dive into how Aequitas works, however, let’s take a step back to understand how data scientists decide if their models are accurate.

In data science, there are four kinds of findings:

True positives: when an algorithm spots a real-world pattern.
False positives: when an algorithm identifies a pattern but there isn’t one.
True negatives: when there is no pattern, and the algorithm doesn’t identify one, either.
False negatives: when the algorithm fails to spot a pattern that exists in the real world.

Put simply, your model is biased if false positive or false negative rates are significantly higher or lower for a subgroup of people than for the population as a whole, or when compared to another sub-group.

The tricky part about avoiding bias, Ghani said, is that predictive models inevitably rely on some level of generalization.

To get the greatest possible number of people correct, the algorithm is going to be biased or incorrect about some smaller sub-group of people.”

“To get the greatest possible number of people correct, the algorithm is going to be biased or incorrect about some smaller sub-group of people," said Ghani, adding this is why we often see algorithms failing at predicting accurate results for the sub-groups like women, minorities, and others.

Most data scientists are focused on building models that predict correctly for the largest possible number of scenarios — this concept is called “accuracy.” But for some, it can be easy to get so swept up in optimizing for accuracy that they forget about the notion of fairness.

Part of the problem in the industry today, Ghani said, is that accuracy and fairness are viewed as being mutually exclusive.

“It’s too hard. These are mutually exclusive so I should just give up.’ Those were the kinds of stories that we were hearing from policy makers,” Ghani said.

Ghani and his team started thinking about the fairness and equity of the entire system. What they found was that, based on the desired outcomes, some error variables were more important than others.

You shouldn’t care about all types of disparities equally. Ultimately, what’s most important is that the overall system is fair.”

“You shouldn’t care about all types of disparities equally,” Ghani said. “In machine learning models, there is a lot in data. Ultimately, what’s most important is that the overall system is fair.”

Algorithmic Bias Aequitas — Caption: Aequitas Fairness Tree courtesy of Aequitas

The fairness tree

Ghani’s team decided to design a Fairness Tree: a systematic way for data scientists and stakeholders to navigate their way directly to the errors that are most impactful to the outcome they were trying to achieve. One of the key considerations in the fairness tree is whether a proposed intervention is punitive or assistive.

For example, if the algorithm is charged with punitive intervention, such as deciding whether someone should go to jail, then a high false positive rate for any subgroup (sending too many people to jail) becomes much more important than a false negative rate (sending too few), said Ghani, because an incorrect prediction can have massive impact on an individual’s life and perpetuate societal inequities.

Alternatively, if you are working on a model whose function is assistive — an algorithm that helps to find the best health insurance option, for instance — recommending an insurance plan one time too many is far less bad.

It’s these types of considerations that the Aequitas toolkit and fairness tree helps data scientists and policymakers parse through.

Assessing bias using false positives and negatives

Aequitas works by comparing the false positive and negative rates between the “overall reference” group against the “protected or selected” group. If the disparity for a “protected or selected” group is within 80 and 125 percent of the value of the reference group, the audit passes — otherwise, it fails.

The tool assesses different kinds of metrics, such as false negative rate parity, false positive rate parity and false positive discovery rate parity — a criteria that considers whether your rate or errors is the same across all subgroups. Then, it creates a report indicating which metrics are biased.

Aequitas is a free, open source toolkit available on Github and is compatible with Python 3.6+. Data scientists and policymakers can test the application using their own data or one of the samples sets provided on the site.

Aequitas, Algorithms Bias — Aequitas bias audit results using data from the COMPAS dataset.

Ghani and his team’s work has yielded some important discoveries about what practitioners can do to create better algorithms:

Define “fairness” within the context of what you want to achieve. Does it mean not leaving anyone behind, or does it mean reducing error rates gradually over time? The answer might depend on the consequences of getting things wrong.
Optimizing for accuracy isn’t always the best solution. For most problems, the most accurate model for predicting the behavior of large groups of people might lead to unfair outcomes for members of smaller subsets. To avoid that, data scientists can use auditing tools to build models that are both accurate and fair. While it may impact accuracy numbers a bit at first, over time, it might just be the better approach.
Use auditing tools from the beginning. Having a conversation about bias metrics upfront will help data scientists to encode systems more effectively. Talking about fairness as an outcome-based metric gives data scientists a seat at the table from the get-go, said Ghani. It also offers an opportunity to talk frankly about what kinds of biases are most important to avoid.
Be open about your technology’s shortcomings. Stakeholders should make the results of their audits public. “If the algorithm doesn’t pass the audit, a stakeholder can either decide not to use it or use but let the public know we’ve vetted the problem and it’s the best we can do,” said Ghani. This helps to eliminate public backlash.
Convene a comprehensive group of stakeholders. Ghani said it should not just be the policymaker or lead data scientists at a company who defines the ideal outcome. When it comes to building algorithms that impact entire communities, outside stakeholders should be involved from the get-go.
Lastly, try to create some way of testing your algorithm once it is released into the world. “You should be constantly monitoring the system to see if you are having the impact you thought you were going to have,” said Ghani. If that means you need to spend extra funds to compare and track results, then think of it as buying insurance. “It’s in our nature to say it’s built. You’ve showed me it works. But it’s important to track AI tools and make sure we have guard rails in place. The alternative is too risky,” he said.

Biased training data leads to biased algorithms

Building fairer algorithms

The fairness tree

Assessing bias using false positives and negatives

Recent Articles