5. Data

Authors

Nigel Carpenter

Isabelle Williams

Jacky Poon

John McCarthy

Published

November 22, 2022

What data is available to develop ML techniques on?

We have split this question into two parts:

What data can you use while learning or benchmarking ML techniques?
What internal and external data should you use when applying ML in your organisation?

Q5a: What data can you use while learning or benchmarking ML techniques?

First you can use the data available to your organisation - it’s always motivational to use real data that is relevant to you when learning new methodology.

Saying that, if the data available to you is limited or not at the right level of granularity then there are some publicly available options that may be helpful. Collections of public insurance data like R’s insurance data or more general data from websites like kaggle or the UCI Machine Learning repository may be useful in your learning journey.

Using synthetic data can be a good way of benchmarking different machine learning approaches and increasingly researchers are using simulated data in their publications and making the data available. Reading the papers and trying the code can be a good way to build and understanding of what data can be useful and how it can be incorporated into the machine learning algorithm. Examples of papers with published data and code, sometimes synthetic, include:

DeepTriangle: A Deep Learning Approach to Loss Reserving | Paper Code
An individual claims reserving model for reported claims | Paper Code
Stochastic loss reserving with mixture density neural networks | Paper Code
Individual data in Reserving using machine learning - an advanced example in R (Part 1)
Triangular data in ML modelling on triangles - a worked example

The working party has reviewed several data simulation packages that can be helpful to create data for use in machine learning:

One of our GIRO 2022 talks discusses a number of options
SPLICE (the successor to SynthETIC)
simulationmachine

Q5b: What internal and external data should you use when applying ML in your organisation?

Overview

The chain ladder reserving techniques and their many variants are very efficient at extracting insights from triangular data. Similarly machine learning algorithms are very effective at finding predictive patterns in large volumes of data. In recent years we have seen the combination of machine learning and big data result in improvements in predictive accuracy from weather prediction to journey times and cancer detection to name a few. We hypothesise that machine learning could improve the process of actuarial reserving but that gives rise to these questions:

What additional data is most relevant to improving reserve estimates?
How can this data be incorporated into a machine learning based reserving process?
How can we research and apply these techniques if we don’t have access to granular transactional data?

These are all good questions and they are among the things that the working party is seeking to answer. In truth we don’t know the answer to these questions until we research and try a range of approaches. But experience from other domains gives us encouragement that research will result in insight and answers. In fact these data questions have already resulted in some useful artefacts that can help our research.

Data from your organisation

Chain ladder reserving techniques are very efficient at extracting insights from aggregated triangular data, whereas machine learning algorithms are very effective at finding predictive patterns in larger volumes of data. So make sure you retain access to the granular source data, and make aggregation the last step in your process, that way you always have the option to work at the most granular level if you wish to.

With regard to what additional data we should be using it’s likely there is a lot to gain and learn from speaking to the Pricing, Claims and Operations functions. Pricing hold a wealth of risk related and external data that they can show effect the likelihood and quantum of a claim. All of this data is known at policy outset and hopefully already recorded in databases so should be available to include in reserving datasets.

Then the operations or claim functions will have a wealth of data from their workforce planning and automation tools that will be pertinent to claim development patterns. There will be data about the stage of every claim in its lifecycle to settlement as well as information about staffing levels, call demand and even external data to support claim verification such as weather data that can also be helpful.

For external data, like weather data, it is increasingly possible to retrospectively purchase the data to fill historic gaps you may have.

Data quality

The reality for many actuaries is that we don’t always have good quality data available to us, granular or otherwise. This makes it difficult for us to actually develop ML models in the first place, let alone a robust process to refresh them, as we are often building shiny new models on top of the foundation of data processes that are shaky at best.

What do you do when you don’t have great data? One answer is to learn from the techniques used to prep data for ML models. For example, imputing missing data in a smart way can lead to much more efficient and less biased models (depending on the quantity of and reasons for missingness, of course).

Sometimes, sorting out a data problem can be as simple as reaching out to the people that deal with it further down the pipeline. Sometimes actuaries can be reluctant to reach out to departments like BI or Data Engineering, and those departments are often reluctant to respond to actuaries too. If we can strengthen the relationships that Actuarial departments have with these teams, issues with data are more likely to be addressed at source (where the issue might actually be), rather than “patched over” by the end user.

The point here is that we need to build a virtuous cycle around data, where processes take minimal time and energy to run and issues are addressed quickly, rather than a vicious one, where Actuaries spend so much time running poorly put together data processes and fixing bad data that they have little time to do anything else.

Finally, any model whether traditional or ML, is only as good as the data that it is based on. Companies with good data allow actuaries the opportunity to properly interpret and understand the model outputs and advise on the risks facing the organisation. Companies with poor data will find their actuaries drowning in manual adjustments and fixes, and ultimately the uncertainty has cost given insurance companies need to hold risk margins.

About the authors

Back to FAQ list