Data Structures: Initial Findings – Machine Learning in Reserving Working Party

Introduction

A significant question to answer with individual claims modelling is how to structure the data going into the model. For example, what should the y variate be; ultimate, incremental payments, cumulative payments or something else? Which variables should you include? How to split the data between training and test datasets?

This sets out a summary of some of our initial findings, which were undertaken a year or so ago, using the basic Feedforward Neural Network, as defined in our NN framework, on a well behaved and stable dataset. We looked at a variety of different options, but it was not an exhaustive list, and we continue to investigate other alternatives.

NB: Use of different datasets and model approaches may well lead to different conclusions.

Summary of findings

The test/train split that seems to be the best so far is the one that we define as ‘S3’ (ie data up to the valuation date only, with settled claims as at the valuation date in the training dataset, and open claims as at the valuation date in the test dataset) along with a Y variate of the ultimate claim value. However it still underestimates outstanding claims
It seems that the model is not picking up development period patterns over time, it would be interesting to see if times series based NNs, such as RNNs, LSTMs or GRUs would work better
Whilst the ‘rectangular’ test/train split produced the best results, this is not a practical option in a real life reserving situation where these predictions would be undertaken as at a particular valuation date, and where the details of future claims are not yet known. (NB: splits like the rectangular one can be of value; the practicality of the split ultimately depends on the model’s purpose and the features used.)

There is much more that we could investigate - indeed this exercise led to many more questions. However we decided the best next step would be to move onto looking at time series based NNs to see if they would work better on the outstanding claims in the S3 test dataset and at picking up any development period trends.

We did look at more scenarios than are listed here, including additional test and train splits. We have set out the more pertinent findings from this initital look here. We would encourage anyone to run the code for themselves to see the full range of diagnostics and to try out their own range of scenarios.

The dataset

The dataset used is simulated SPLICE data. More information on the structure and character of the dataset can be found here.
The code used to produce the dataset can be downloaded here which allows replication of this work, as well as providing a clear definition of how the dataset is structured.

SPLICE 1
* Stable, short tail claims
* One row per transaction
* A transaction for every development period

Variables

Variables included	Description
claim_no	claim ID
occurrence_time	Time of accident
Notidel	Reporting delay from occurrence
development_period	Time of payment from occurrence, rounded up to the nearest integer
pmt_no	Transaction number, ie 1 for the first payment for a particular claim, 2 for the second etc
log1_paid_cumulative	Natural logarithm of cumulative paid amount
claim_size	Ultimate settled claim size
max_paid_dev_factor	Maximum development factor by the time of that development period for that claim
min_paid_dev_factor	Minimum development factor by the time of that development period for that claim

Test/train splits

When modelling Neural Nerworks you split the data into training and test datasets. The training dataset is used to fit the model and then data that has been witheld from the training dataset, the test dataset, is used to test how well the model performs.

We tried lots of combinations of different y variates and test/train data spits. The big limiting factor for choosing test/train splits was that for future claim transactions we are limited in what the test dataset can contain, eg it can’t use paid data as that would not be known at the time of the valuation. As stated earlier, these are some of our intial findings - there are other alternatives that we have not yet investigated.

Here we set out 3 test/train splits. We looked at more splits than are shown here but these are the ones we found to be most useful/relevant.
* Rectangular
* S3
* S5

NB: The naming conventions S3 and S5 are just numbers 3 and 5 of the various test train splits we tested. A new naming convention would be helpful – suggestions welcome!

The following 10 minute video provides some background to the SPLICE dataset and looks at the different test/train splits: Diagnostics Demo - 1. Data NB: It was created for internal use; ie for members of the working party.

Rectangular split

In our NN explainer blog we simply split the datasets randomly by claim, with 60% of claims in the training dataset and the remaining 40% in the test dataset. The full history of a particular claim is included in each dataset, so you can think of this as akin to ‘rectangular data’.

In traditional reserving practice we would tend to include development up to a particular calendar period in the training data, with data after that calendar period being the projected estimates; akin to the ‘triangular data’ we are used to in claims reserving.

We have used this rectangular test/train split in the explainer blog as initially we were trying to start to understand what the models are doing and we thought using this type of train/test split would give the most stable results, even if it wasn’t practical when projecting claims in real life.

S3 split

The S3 split also randomly splits by claim (60/40) as the rectangular split does. However here we only include data that would be known as at the valuation date; only settled claims are in the training data, and only open claims are in the test dataset.

S5 split

The training dataset is identical to that in S3. The test dataset is also the same as S3 but in addition includes future claims, ie information about claim transactions that took place after the valuation date. This means that variables such as cumulative payments and the timing of future payments for open claims would not be known at the valuation date, so should not really be included in the test dataset. Other variables such as notification delay and occurrence period can still be included as they would not change in the future after the valuation date.

The following shows how S3 and S5 are defined in the code:

The following graphs look into some of the differences between the rectangular and S3 splits. It can help to understand what each type of split looks like.
Code to reproduce these graphs can be found here.

You can draw your own conclusions, but some initial observations are set out below:

If we look by development period (the top row of graphs), the bar charts show numbers of claims; grey in the training dataset, and blue in the test dataset. The notification period is short and so with the rectangular data we can see the number of claims in each development period are pretty stable by about development period 4. We can also see the 60/40 split in the numbers of claims.
With S3, only data before the valuation date is included, which is why there are fewer claims with data at later development periods. We can also see how few claims there are in the test dataset, this is because it only contains claims that are open as at the valuation date.
The lines on these graphs show total cumulative paid amounts (black=train, blue=test). As the rectangular data contains the full history of each claim, we see the line increasing until it plateaus when all claims are settled. Again we can see the 60/40 split in the difference in size between the training and test datasets.
The graphs on the bottom row are the same but with occurrence period on the x axis. The number and size of claims for the SPLICE 1 dataset will vary randomly by occurrence period but be roughly even over the whole dataset. Again we can clearly see the 60/40 split between the train and test datasets in the rectangular split. S3 again is dominated by the fact that we only are using data up to the valuation date, so there will be more claims at earlier occurrence periods (and vice versa) and in the test dataset (blue bars and lines), we only have claims that are outstanding at the valuation date, so claims in earlier occurrence periods will likely have settled by the valuation date

Data Structures Investigated

Structure element	Description
Y variates	Ultimate and Cumulative Paid
Test/train splits	Rectangular, S3 and S5
Data	SPLICE 1 dataset
Format	One row per transaction
Format	A transaction for every development period, even if there is no payment

List of variables in the datasets:

Variable Name	Description
claim_no	claim ID
occurrence_time	Time of accident
notidel	Reporting delay from occurrence
development_period	Time of payment from occurrence, rounded up to the nearest integer
pmt_no	Transaction number, ie 1 for the first payment for a particular claim, 2 for the second etc
log1_paid_cumulative	Natural logarithm of 1 + cumulative paid amount

Findings

The following graphs show how the rectangular test/train split works very well indeed. However it includes future claim information that would not be known as at the valuation date, so it would not be realistic to use in real life.

It can be seen that using the ultimate value as the y variate provides a much better fit than using cumulative paid. It is also interesting to note that the model still works quite well with ultimate as the y variate, even if we don’t include cumulative paid as a vairable in the model. We see something like this again later on when looking at split S5, where including the timing (and not necessarily the size) of future payments makes a big difference to the performance of the Neural Network.

Graph descriptions

For an introduction to the graphs shown here you might find this 10 minute demo useful Diagnostics Demo - 2. Tensorboard. NB: It was originally intended for internal use, ie for working party members only. See also our Diagnostics blog.

Logged AvsE Ultimates on the test dataset, ultimates only

This graph plots the logged actual ultimate against the logged ultimate prediction for each claim. ‘Ultimates only’ means that only one value is shown for each claim (from the last transaction in the dataset for that claim). Logging the values means that we can see smaller claims more clearly. There is a very skewed distribution of claims sizes with many more claims being small. The mean claim size of the dataset is around 225,000 which equates to a logged value of 12.3.

QQ plot of the test dataset

A QQ or quantile-quantile plot is a measure of the goodness of fit of a model. It orders the actual y values into (in this example) 5% quantiles. For each quantile it then calculates the mean of the expected, or predicted, values from the model. These actual and expected means are then plotted as a scatterplot. The closer the points are to a straight line the better the fit is.

NB: for the predicted values there is an ultimate prediction for each row in the data, where each row represents a development period for a particular claim, so each claim will have multiple different expected ultimates dependent on the development period.

AvsE on the test dataset, ultimates only

This plots the actual ultimate against the ultimate prediction for each claim. Only one value is shown for each claim (from the last transaction in the dataset for that claim).

We chose to use the S3 split with ultimate y variate going forward, as the best of the options we looked at, taking into account the practical realties of reserving at a valuation date. At the same time we also note that it underestimates outstanding claims. (We found similar results of underestimating outstanding claims using the chain ladder method on individual claims on a well behaved, stable triangle.)

The above is a comparision of the S3 and S5 splits, and of ultimate and cumulative paid as ultimate.

S5 was tempting to use as it uses future claim payment information in the test dataset. However in practice you cannot include variables such as cumulative paid and pmt_no as they would not be known as at the valuation date. It can be seen that excluding these variables makes the fit poorer.

It was interesting to note that if we took out cumulative paid but included pmt_no, then S5 did produce results of a similar ilk to S3, which suggests the timing of future payments has a significant part to play in the model.

For the S3 split we could not use cumulative paid as the y variate, as the structure of S3 means there are no claims with future cumulative payments to test it on.

The following graphs show the total actual and predicted claim costs, split by training and test datasets for the S3 split. The first shows the totals by development period, and the second graph by occurrence period.

We can see from these graphs how the actual versus expected (AvsE) for the training data is quite good, but here again, we see the model is underestimating the ultimate claim amounts in the test dataset, ie for claims open as at the valuation date. We can also see how the prediction is worse for the earlier development periods.

Other observations:

There is some evidence that the model is not picking up development patterns over time
- In the S3 test dataset, which only includes claims open as at the valuation date, performance is poor
- Claims with later development periods predict values closer to the ultimate value
- More recent accident periods show poorer performance
- If we include current paid, ie the paid value at the valuation date, then this dominates the predicted value
As we set out in our hyperparameters blog, the initialization value used is very important (here it is the mean claim size). We noted that smaller claims do not tend to move away from the initial value very much (as can be seen in the Logged QQ plots)
Even though the model has been logged, large claims may still be dominating the fitting
In addition, small claims will probably have fewer transactions (ie pmt_nos), so less weight might be given to these claims?
With the rectangular split, despite the fit of individual claims looking good, the total expected value is too high. This might also be related to the small claims which are not moving away from the initial values enough.

How to run the code for yourself

You can look at the full output of graphs in Tensorboard yourself. To do so, download code here and run your own scenarios to see for yourself and draw your own conclusions.

Additional comments

The inclusion of claim_no as a variable could distort the model, as later occurrence periods have higher claim numbers. However when we reran the model without claim_no it did not change these results, probably because in the SPLICE 1 dataset there is no trend by occurrence period.
The data included rows (transactions) for every development period up to period 40, so this meant that the cumulative paid value was included in the data for settled claims. It would be interesting to see if not including a cumulative paid value for settled claims would improve the prediction on the test data.
There are lots more investigations that could be done, eg including case estimates, including additional covariates such as age of claimant or injury severity, or looking at less well behaved data, ie longer tailed claims, or data which includes trends or shocks. These alternatives are available in other SPLICE datasets.

With thanks to Matthew Lambrianidis for his feedback and advice, and to Jacky Poon for the original NN model NN model code