This post was originally published by Jonathan Davis at Medium [AI]
Understanding behavioral economics can help data scientists create better, more effective machine learning models.
In 1975 Herbert A. Simon was awarded the Turing Award by the Association for Computing Machinery. This award, given to “an individual selected for contributions of a technical nature made to the computing community” is considered to be the Nobel Prize for computing.
Simon and co-recipient Allen Newell made basic contributions to artificial intelligence, the psychology of human cognition, and list processing.
It is interesting to note that amongst his contributions to artificial intelligence and list processing, he is also being recognised for his contribution to human cognition. At first glance, one would think that understanding how humans think is about as far from computer science as you can get!
However, there are two key arguments that explain why human cognition is important for any advancements in computer science, and especially AI.
In his 1950 seminal paper “Computing Machinery and Intelligence,” Alan Turing introduced what became known as the Turing test. A computer and a human have a written dialogue and in this “imitation game” the computer tries to fool the human participant into thinking it is also a human by devising responses that it thinks a human would make.
One of the key aims of AI is to train computers to make decisions like humans, whether labelling pictures or responding to questions. Even if the aim is task-specific, not centred around replicating humans in their entirety, it is crucial that developers of AI have some understanding of human cognition, so that they can replicate it.
One of the many modern applications of AI, specifically machine learning, is in human-facing interaction. Whether recommending products to drive sales or auto-completing sentences in emails machine learning models are trained to understand what users want. However, the methods, data and metrics used to develop these models need to be provided with an understanding of how the model output will interact with the human users.
In this article, we’ll focus on the interaction between humans and computer models, understanding how behavioral economics can be used to help data scientists develop and train more effective machine learning models.
Classical economics is based on the assumption that all individuals behave rationally, i.e. they will make the decision with the greatest personal utility (benefit).
However, modern economists began to realize that humans often behave irrationally. Not only that, but they are predictably irrational, behaving in the same irrational way every time they make similar decisions. Behavioral economics is the study of these predictably irrational decisions, known as cognitive biases.
Therefore, somewhere between phycology and economics, behavioral economists try to identify and measure, through experimentation, these systematic deviations from rational behavior and identify them in the real world.
Daniel Kahneman and Amos Tversky, widely considered to be the founding fathers of the field, wrote extensively on the practical implications of cognitive bias is various fields, including finance, clinical judgment and management.
There are several types of cognitive biases that data scientists can use to improve the efficacy of their machine learning models.
Confirmation bias is the tendency for humans to search for information that confirms one’s prior beliefs. This occurs because people naturally cherry-pick information that aligns with what they already believe is true.
As an extreme example, if you believe that the world is flat you will search extensively for evidence, no matter how scarce or unreliable, that supports your hypothesis, and ignore the widely available and reliable evidence against it.
Although he does not call it “confirmation bias,” one of the earliest experimental examples of this was by Peter Wason in 1960.
In his experiment, he challenged subjects to identify the rule relating to three sequential numbers, normally [2, 4, 6]. To try and learn the rule, they were allowed to generate any set of three numbers and the experimenter would tell them whether or not it fit the rule.
Wason found that most subjects devised extremely complex rules and generated many triplets that conformed to the rule. This is a poor tactic considering you cannot prove a rule definitively no matter how many combinations the experimenter confirms, but you can disprove it with just one. The rule was simply a sequence in ascending order, and only 6 out of 29 subjects identified it on their first guess.
In his 2011 Ted Talk, Eli Pariser talks about what he calls the “filter bubble”, the internet phenomenon where users are shown only what is most relevant to them. This is generally done using a recommender system method called collaborative filtering where users are recommended items based on what other people similar to them have interacted with (I’ll use interacted as a generic term for clicked, watched, bought, etc.).
The result of this is that you are shown more of what you have already interacted with. If you generally read conservative-leaning news articles, you’ll be shown more conservative-leaning news articles; if you watch action movies, you’ll be recommended more action movies.
However, Pariser points out that this isolates people from a variety of information and opinions, by trapping them in their filter bubble, without them even knowing. This reinforces confirmation bias, because not only is the user only searching for information confirms their beliefs, but it’s all they have available to them.
There are two main issues with this. Firstly, there are ethical concerns with providing users, unknowingly, with bias content. It becomes harder for people to form more well-rounded opinions, the result of well-balanced information sources. In Pariser’s words,
“The danger of these filters is that you think you are getting a representative view of the world and you are really, really not, and you don’t know it.”
The second issue is the holistic effectiveness of recommender systems. I like whisky, so when I look at any of my social media streams, it is full of online whisky sellers. Will I then go on to buy whisky? Yes.
So why is this such a bad thing?
Well, because I like whisky, and even without the adverts, I will search around the internet for interesting bottles and likely buy some anyway.
Both of these concerns, ethical and effectiveness, can be addressed by introducing an element of variation to recommended items. Perhaps a Republican response to an article, or a bottle of gin that whisky lovers tend to like.
How can this be done? A simple method is to include penalty terms into algorithms for similarity. This addresses the ethical concern quite well.
However, to improve the model effectiveness a change of perspective when training may help. Instead of measuring model performance on how many recommended items are interacted with, try measuring how many more items are being purchased than would have without the recommendations.
Without an understanding of confirmation bias, data scientists would be unlikely to realize that the filter bubble phenomenon is occurring, let alone know how to try and mitigate it.
Availability bias occurs when people rely on the information that is most readily available to them, generally more recent information.
In an experiment, Tversky and Kahneman showed participants either a list of 19 famous men and 20 less famous women or 19 famous women and 20 less famous men. The participants were generally able to recall more of the famous gender than less famous gender and estimated that the list of the famous gender was longer than the less famous gender.
Kahneman and Tversky argued that this was caused by availability bias. Despite being a poor heuristic for judging probability, participants used the number of celebrities more readily available to them as an estimate for the total number. As they were likely to recall more famous celebrities, they estimated that it was the longer list.
When training machine learning models, availability bias can often cause data bias. If only the most readily available data is used to train the model, it may contain inherent bias.
A well-known example of this is gender bias in machine translation. This can occur when translating from gendered languages to gender-neutral languages. For example, “he” and “she” in English both translate to the ungendered pronoun “o” in Turkish.
People started noticing that this created gender biases in Google translate, such as translating “o bir doktor” to “he is a doctor” and “o bir hemşire” to “she is a nurse.” This was because of an inherent bias in the training data, as historically more men were doctors and more women were nurses.
This is a consequence of availability bias, where data scientists took the data available to them without considering whether it would create the most effective model.
In order to mitigate this bias, data scientists need to change their thinking from “How can I make my model using the data I have?” to:
“What data do I need to create my model?”
In the example above, Google’s solution was to create a new dataset containing queries labelled as either male, female or gender-neutral, which they used to train their model. By thinking outside of what was immediately available to them, they were able to create a much more effective machine learning model.
In 1943, during World War II, the US military studied the damage to planes and decided to reinforce the areas that were most commonly damaged, in order to reduce bomber losses.
However, statistician Abraham Wald realized that the areas that were most hit were the least vulnerable, as those planes were able to return to the base; instead, the areas with the least evidence damage should be reinforced as the lack of damage indicated that the planed damaged in these areas went on to crash.
Survivorship bias is a type of availability bias, but instead of focusing on the most readily available information, humans focus on the most visible information, typically because it has passed through some selection process.
The above story teaches a very important lesson about the limitations of data science. Sometimes all the available data is still not enough to create a good model. What is not available might be just as important.
Unfortunately, this often means that machine learning models can often go into production before the data scientists who developed it realize that they aren’t working.
Understanding this limitation is extremely important in avoiding wasted time and money. A simple solution is to include domain experts in the machine learning development and data collection processes. These experts will be able to spot domain-specific issues that cannot be seen from the data without additional context.
Anchoring occurs when a person relies too heavily on a piece of information they have already received. All future decisions and judgments are then made using this piece of information as an “anchor”, even though it may be irrelevant.
In a 1974 article published in Science, Kahneman and Tversky describe an experiment where they spun a rigged wheel of fortune in front of participants. The wheel would either land on 10 or 65. The participants were then asked to estimate the total number of African countries in the United Nations.
The group that saw the wheel land on 10 estimated, on average, 25 countries. On the other hand, those that saw the wheel land on 65 guessed, on average, 45 countries. This is despite the fact that the participants thought the wheel was completely random.
Once humans have been provided with an anchor, they use it as the starting point for any decision. In the above experiments, those who saw 10 on the wheel subconsciously used 10 as the starting point for the number of African countries in the UN. They would then increase the number until they were comfortable with their estimate.
Considering that most people are not one-hundred percent certain about every decision they make, there is a window of uncertainty, which means that if you approach the estimate from two different directions, you can get wildly different estimates, on either side of the window.
Anchoring can be a particularly important consideration when creating training datasets for machine learning models. These datasets are often created by tasking humans with manually labelling the data using their own judgment. As, in many cases of machine learning, the aim is to reproduce human decision making, this is often the most accurate, if not the only, way to create a dataset of the “ground truth.”
This can be quite straightforward if you are labelling whether images are cats or dogs. But imagine you have asked a group of real estate experts to estimate the price of houses. If the first house you show them is a multi-million dollar mansion, the subsequent estimates are likely to be much higher than if you were to start with a run-down bungalow.
The results of this could be a machine learning model that consistently over or underestimates the price of houses, not because the model performs badly, but because the data is biased. In fact, it is likely that a data scientist wouldn’t spot the poor performance, as the validation dataset would have been labeled in the same way, so would contain the same bias.
There are several ways to mitigate against anchoring. The first is to deliberately show participants specific initial data points. This could be a series of houses that have been judged to be mid-range. Alternatively, it could be a set of example, three houses with low, medium and high price tags, along with those price tags.
In both of these cases, anchoring is not being mitigated against but is being deliberately set to avoid bias. On the other hand, to mitigate against anchoring each data point can be labelled by multiple participants, with the average taken as the final label. Each participant would receive a random selection of data points, in a random order so that an average would counteract any individual biases.
Cognitive bias is an unavoidable phenomenon in human decision making. However, research over the past half-decade has shown us that these irrational decisions are predictable, and this predictability can be used to mitigate against them.
Although machine learning models in and of themselves cannot have cognitive biases, they can have biases as a result of cognitive bias as they are an interface for human decision making.
Whether ensuring there is no bias contained within the data going in, or accounting for the bias of the humans that use the data going out, data scientists need to consider human decision making.
Without these considerations, we have seen how models can be ineffective, or even wrong. And we may not even know that it’s happening.
This post was originally published by Jonathan Davis at Medium [AI]