Predictive modelling – are your models degrading and under-performing because of data pitfalls?
Many modellers and model owners see substantial model degradation and model performance drop-off from pre-live model validations to in-live model monitoring. Is this all attributable to population shifts and data drift or are there other factors at play? There are many common modelling mistakes which can also contribute to under-performing models and sometimes hide behind the population or data shift explanations. To highlight key areas where things can start to go wrong for your models, here are 3 questions all related to the understanding of your data. Challenging yourself or your modelling teams to fully understand the data will help avoid some of these common predictive modelling pitfalls which I have included as real-life examples of how things can go wrong.
Where has the data come from, and will it be sourced in the same way when the model is in production? Is the data limited in anyway? What is missing and has it been influenced?
In some databases it is possible that field values can get overwritten when new information is obtained. For example, date of last address change being automatically updated when someone moves address. Without strong data management processes in place, this can cause problems when using this data for predictive modelling purposes. When developing any predictive model it is vital that the information used to generate your set of candidate modelling characteristics is ‘as at the time of scoring’, for example for an application scorecard the information is as at the time of the application. It is easy to see how this may not always be the case if field values can be updated over time and data is not time-stamped.
Data is often obtained from different sources to use for modelling. In credit risk it is common to use retrospective CRA (bureau) data. Again, for this data to be useful for model development it must represent the data that would have been available as at the scoring point. Unfortunately, I have come across many developments when this has not been the case. Take behaviour scoring for example, where an historic month or set of months are taken as the Observation period, with the Performance period being the corresponding following 12 months. If the observation month is October 2020, with performance assessed Nov21-Oct22, the bureau data needs to represent what was available prior to 1st October 2020 and not what was available as at 31st October 2020. The difference sounds small but it can have a major impact on the predictive patterns and power seen in the bureau data, which will not be replicated when the model goes live in production.
Information that has not been requested from an applicant is very different to information that has not been provided by an applicant. The ability to distinguish between the two is often very valuable. Both may appear as ‘missing’ in your data, but the meaning of the ‘missing’ is quite different. Changing application forms and processes that make some data fields mandatory or optional can have a strong impact on missing data and how this relates to predicting credit risk. This equally applies to other field values in terms of consistency in requesting, collecting and storing and not just missing values.
Have you generated reliable and sensible characteristics/features as candidate predictors for your model?
I have seen Credit Card and Current Account behaviour scorecards where customers are behaving in a completely consistent fashion from month to month in terms of their spend and payments, but their monthly behaviour risk score has increased or decreased in particular months. Clearly this is not ideal, and was due to the way certain characteristics (or features) were generated by using “number of days”, with the resultant values being impacted depending on the number of days in each calendar month (varying from 28 to 31).
When developing ‘scorecard’ type models, have you grouped or binned your predictors appropriately?
When performing grouping beware of small raw volumes in your fine classes. I have often come across fine classes with small volumes grouped together to create a larger group purely on their WoE. Being informed purely by WoE values to get to final coarse classes that have satisfactory volumes and are therefore deemed reliable groups, is a classic case of over-fitting. Having a parameterised grouping tool such as the one available in Paragon’s Modeller tool can help avoid such bad practice and over-fitting.
These questions and pitfalls are by no means a comprehensive list and are just some of the types of questions that should be posed by model validation teams, reviewers and approvers of credit risk models. Over the last 20+ years I have developed, reviewed, approved or consulted on hundreds of predictive models that will have informed millions of credit risk decisions impacting millions of individuals and SMEs. That is the reason why models need to be as good as they can be – they directly impact individuals, businesses and the lenders that use them. The modelling approaches, the data sources, the tools and techniques have evolved and changed over the last 20 years, but the fundamentals remain the same. These fundamentals boil down to (i) understanding of your data and (ii) getting your modelling design right.
In recent years there has been much greater emphasis, both internally and demanded by regulators, placed on independent validation of models and model risk management best practices across the model lifecycle. This is certainly helping to ensure that the modelling fundamentals are always being adhered to, and modelling attentions are not solely distracted towards the latest and greatest modelling algorithms. It does not matter which statistical or machine learning technique is used if your data and model design fundamentals are lacking or flawed. Using a model risk management tool such as Focus also provides an effective way for all model stakeholders to understand what has been asked, answered and managed appropriately throughout the model lifecycle, from model design, to development, implementation, ongoing use and periodic validations.
Please feel free to share in the comments section any of your modelling pitfall experiences.