Data selection bias - can it be a matter of life or death?
Data selection in predictive modelling is often describe as an art form in itself; and taking historic data at face-value can sometimes be dangerous.
Remember the story of the WW2 US bombers? The military wanted to understand where they should improve the armour on the aircraft. So they analysed where the returning planes had received the most enemy fire and found the most bullet holes on the main and tail wings. The recommendation was to reinforce those areas.
But the data was inherently biased. As Abraham Wald, a highly regarded statistician from Columbia University noted, the analysis only considered the aircraft that had survived and made it back to base. Lost aircraft weren’t included in the analysis. Wald recommended reinforcement in the areas that were not damaged on the returning aircraft - his analysis was put into action and was still being used years later.
It's a well-known example in operational research, but what has this got to do with credit risk? And can data selection bias lead to life or death decisions here?
Well, maybe. From the customer’s point of view, there’s a reason for regulation, and financial health is very closely linked to mental and even physical health. For the lender, the quality of its decisions are directly related to the quality of its business.
From data to decisions
Ultimately, it’s critical to remember that data gets used to create models, that get used to make decisions. Will an application be accepted or rejected? How much can they borrow? For how long? At what price?
Throughout the credit lifecycle, decisions are made every day, what to try to sell someone, whether to authorise a transaction, increase or decrease credit lines, when to chase payment, how, in what way – it goes on. And those decisions are weighed according to their many potential pros and cons – revenues, losses, costs, response likelihood and more.
They’re of critical importance to lender and customer alike. Bad decisions can have a profound effect on each. Which means improving them is an essential ongoing pursuit.
How?
Well, we don’t start with the data. Understanding the decision outcomes is an essential first step. Only then can you start thinking about the data. What data is available? How does it relate to the decision and the outcome? Historic data and decisions will inform this, but beware the WW2 bombers.
Exploring the bias
Let’s take the example of improving personal loan application decisions. For every applicant, we want to determine whether to accept/decline and the loan price (interest rate) to offer. Looking at our historic pricing decisions allows us to see how increasing or decreasing price from the advertised or typical rate, impacts the likelihood of the applicant accepting the offer and taking the loan.
We know that increasing price will decrease the take-up rates, but by how much? And how much does price sensitivity vary from applicant to applicant? This sounds relatively straightforward, but what if our historic data is in some way incomplete? Or biased?
We see a similar impact of selective data bias when developing origination risk scorecards, and the need for reject inference. We only have loan default performance (or bullets) for the loans that were historically accepted, and we do not have any performance to analyse for the historically rejected applicants. Might we be armouring the wrong bit of the plane?
Data selection bias can impact the accuracy and success of such models. For price sensitivity models, it’s common that higher risk, more credit hungry applicants will have historically received higher prices and vice versa. Where a similar profile of applicant has received a range of different prices, this provides good data to understand price sensitivity for that type/profile of applicant.
But what do we do for profiles where the historic pricing has not varied much, or where it has only varied for very small numbers of applicants? There is a danger that the data is taken at face value, and the take-up rates observed for the differently priced applicants are simply compared to each other.
From this data we may see what we expect to see – where the relatively high-risk applicants were priced lower than normal, their take-up rates increased. But sometimes the complete inverse is seen, where the lower priced applicants have lower take-up.
At first glance this can be hard to understand and explain. However, it could be that there is additional information about these customers, not necessarily contained in the data, that makes them less credit hungry and therefore less likely to take-up a loan offer (they may have large savings balances, for example).
Put simply, if pricing decisions have been targeted historically and not assigned at random, then the historic data will contain bias. Developing and informing price sensitivity models purely on the historic data, without understanding how that data has been created (how price decisions were made in the past), can be a recipe for disaster.
Price sensitivity models are often a key component of pricing optimisation solutions that maximise portfolio revenues and profitability subject to competing constraints such as losses, capital and market share. Inaccurate price sensitivity models lead to sub-optimal pricing decisions for the business.
How much this example is life or death for the business is debateable, but every decision rule-set it can have a significant impact on success and profitability, as well as on the customers on the receiving end.