import numpy as np #deal with arrays in python
import pandas as pd #python data frames
import statsmodels.api as sm # GLM
import statsmodels.formula.api as smf # formula based GLM
import matplotlib.pyplot as plt #standard pythonic approach for plots
import seaborn as sns #nice library for plots
This is a repost of our previous Reserving with GLMs article with all code in Python rather than in R. There are some minor edits due to the language swap, but essentially the article is unchanged. If this is your first time seeing this article, and you are an R user, then you may prefer to read the original article.
Introduction
An aim of the MLR working party is to promote the use of machine learning (ML) in reserving. So why then are we talking about using GLMs for reserving? Well, as noted in our introductory post, we consider that getting familiar with using GLMs for reserving is a good place to begin your ML journey - GLMs should already be familiar from pricing so making the switch to reserving with GLMs is a useful first step.
The first time I did a reserving job (back in 2001) I used GLMs. Coming from a statistical background and being new to the actuarial workplace at the time, this didn’t seem unusual to me. Since then, most of my reserving jobs have used GLMs - personally I find it a lot easier and less error-prone than working with excel. Also, once you know what you are doing, you can do everything with a GLM that you can do with an excel-based model, and then more.
However, some people reading this article may be new to the idea of using GLMs in reserving. So I’m going to use an example where we start with a chain ladder model, fitted as a GLM and then explore the additional features that we can add using a GLM. All the Python code will be shared here.
The material is mostly based on a 2016 CAS monograph Stochastic Loss Reserving Using Generalized Linear Models that I co-authored with Greg Taylor.
Before we begin, let’s import the Python packages that we need.
Data
The data used here were sourced from the Meyers and Shi (2011) database, and are the workers compensation triangle of the New Jersey Manufacturers Group. They are displayed in Section 1.3 of the monograph. We’ve made a CSV file of the data (in long format) available here for convenience.
acc_year dev_year cumulative incremental
0 1 1 41821.0 41821.0
1 1 2 76550.0 34729.0
2 1 3 96697.0 20147.0
3 1 4 112662.0 15965.0
4 1 5 123947.0 11285.0
5 1 6 129871.0 5924.0
6 1 7 134646.0 4775.0
7 1 8 138388.0 3742.0
8 1 9 141823.0 3435.0
9 1 10 144781.0 2958.0
10 2 1 48167.0 48167.0
11 2 2 87662.0 39495.0
12 2 3 112106.0 24444.0
13 2 4 130284.0 18178.0
14 2 5 141124.0 10840.0
15 2 6 148503.0 7379.0
16 2 7 154186.0 5683.0
17 2 8 158944.0 4758.0
18 2 9 162903.0 3959.0
19 3 1 52058.0 52058.0
20 3 2 99517.0 47459.0
21 3 3 126876.0 27359.0
22 3 4 144792.0 17916.0
23 3 5 156240.0 11448.0
24 3 6 165086.0 8846.0
25 3 7 170955.0 5869.0
26 3 8 176346.0 5391.0
27 4 1 57251.0 57251.0
28 4 2 106761.0 49510.0
29 4 3 133797.0 27036.0
30 4 4 154668.0 20871.0
31 4 5 168972.0 14304.0
32 4 6 179524.0 10552.0
33 4 7 187266.0 7742.0
34 5 1 59213.0 59213.0
35 5 2 113342.0 54129.0
36 5 3 142908.0 29566.0
37 5 4 165392.0 22484.0
38 5 5 179506.0 14114.0
39 5 6 189506.0 10000.0
40 6 1 59475.0 59475.0
41 6 2 111551.0 52076.0
42 6 3 138387.0 26836.0
43 6 4 160719.0 22332.0
44 6 5 175475.0 14756.0
45 7 1 65607.0 65607.0
46 7 2 110255.0 44648.0
47 7 3 137317.0 27062.0
48 7 4 159972.0 22655.0
49 8 1 56748.0 56748.0
50 8 2 96063.0 39315.0
51 8 3 122811.0 26748.0
52 9 1 52212.0 52212.0
53 9 2 92242.0 40030.0
54 10 1 43962.0 43962.0
= pd.read_csv(
msdata "https://institute-and-faculty-of-actuaries.github.io/mlr-blog/documents/csv/glms_meyershi.csv",
= {
dtype 'acc_year': int,
'dev_year': int,
'incremental': float,
'cumulative': float}
)print(msdata)
So we have four columns:
- ‘acc_year’: accident year, numbered from 1 to 10
- ‘dev_year’: development year, also numbered from 1 to 10
- ‘cumulative’: cumulative payments to date
- ‘incremental’: incremental payments for that accident year, development year combination.
We can also plot the data
= plt.subplots(1, 2)
fig, axs =1)
plt.subplots_adjust(wspace=sns.lineplot(x='dev_year',y='cumulative',data=msdata,hue='acc_year',palette=sns.color_palette("deep"), ax=axs[0])
g='center left', bbox_to_anchor=(1, 0.5),fontsize = 6)
g.legend(loc"Cumulative payments")
g.set_title(
=sns.lineplot(x='dev_year',y='incremental',data=msdata,hue='acc_year',palette=sns.color_palette("deep"), ax=axs[1])
g1='center left', bbox_to_anchor=(1, 0.5),fontsize = 6)
g1.legend(loc"Incremental payments") g1.set_title(
The data look quite well behaved - each year seems to have a similar development pattern.
Before we go further, we’ll add some additional variables to the data set:
- Categorical (or factor) versions of ‘acc_year’ and ‘dev_year’ - these will be needed later for modelling
- A calendar year variable, ‘cal_year’
Later on we’ll need to be able to apply the same categorical structure to subsets of ‘acc_year’ and ‘dev_year’ so we make an object to hold the categorisation, ‘cat_type’.
We’ll have a look at the data and check the variable types to ensure everything is as it should be.
= pd.api.types.CategoricalDtype(categories=list(range(1,11)), ordered=True)
cat_type
"acc_year_factor"] = msdata["acc_year"].astype(cat_type)
msdata["dev_year_factor"] = msdata["dev_year"].astype(cat_type)
msdata["cal_year"] = msdata["acc_year"] + msdata["dev_year"] - 1 # subtract 1 so that first cal_year is 1 not 2
msdata[
print(msdata.head(6))
acc_year dev_year cumulative ... acc_year_factor dev_year_factor cal_year
0 1 1 41821.0 ... 1 1 1
1 1 2 76550.0 ... 1 2 2
2 1 3 96697.0 ... 1 3 3
3 1 4 112662.0 ... 1 4 4
4 1 5 123947.0 ... 1 5 5
5 1 6 129871.0 ... 1 6 6
[6 rows x 7 columns]
print(msdata.dtypes)
acc_year int32
dev_year int32
cumulative float64
incremental float64
acc_year_factor category
dev_year_factor category
cal_year int32
dtype: object
Chain ladder model
Fitting the model
Our first model will be the familiar chain ladder (volume all) model. The monograph (and references therein) note that certain types of GLM gives exactly the same result as the chain ladder so I’m going to use that to get the chain ladder result.
The specific model I’m using that replicates the chain ladder result is the Over-dispersed Poisson (ODP) cross classified (cc) model (Sections 3.3.2 and 3.3.3 of the monograph).
To apply the model, we will use the glm()
function from the statsmodels.formula.api package. The cross-classified model requires separate levels for each of accident and development year so we use the categorical versions of these variates.
To fit an ODP model, we use the scale='X2'
argument to fit()
- the X2
stands for chi-squared and tells glm()
to use a Pearson chi-squared estimate for the scale parameter.
The -1
in the model formula instructs glm()
to fit without an intercept. Doing this will allow us to recover parameter estimates similar to those in the monograph.
= smf.glm('incremental~ -1 + acc_year_factor + dev_year_factor',
glm_fit1 = msdata,
data = sm.families.Poisson() ).fit(scale='X2')
family print(glm_fit1.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: incremental No. Observations: 55
Model: GLM Df Residuals: 36
Model Family: Poisson Df Model: 18
Link Function: Log Scale: 114.54
Method: IRLS Log-Likelihood: -20.828
Date: Sun, 13 Nov 2022 Deviance: 4128.1
Time: 13:43:06 Pearson chi2: 4.12e+03
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=========================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------
acc_year_factor[1] 10.6568 0.032 336.795 0.000 10.595 10.719
acc_year_factor[2] 10.7953 0.030 360.507 0.000 10.737 10.854
acc_year_factor[3] 10.8992 0.029 377.465 0.000 10.843 10.956
acc_year_factor[4] 10.9890 0.028 391.327 0.000 10.934 11.044
acc_year_factor[5] 11.0388 0.028 396.654 0.000 10.984 11.093
acc_year_factor[6] 11.0159 0.029 385.868 0.000 10.960 11.072
acc_year_factor[7] 11.0081 0.029 373.734 0.000 10.950 11.066
acc_year_factor[8] 10.8905 0.033 333.463 0.000 10.826 10.955
acc_year_factor[9] 10.8361 0.037 295.349 0.000 10.764 10.908
acc_year_factor[10] 10.6911 0.051 209.454 0.000 10.591 10.791
dev_year_factor[T.2] -0.2047 0.023 -8.993 0.000 -0.249 -0.160
dev_year_factor[T.3] -0.7474 0.028 -26.512 0.000 -0.803 -0.692
dev_year_factor[T.4] -1.0167 0.033 -30.954 0.000 -1.081 -0.952
dev_year_factor[T.5] -1.4516 0.042 -34.446 0.000 -1.534 -1.369
dev_year_factor[T.6] -1.8325 0.055 -33.495 0.000 -1.940 -1.725
dev_year_factor[T.7] -2.1403 0.072 -29.933 0.000 -2.280 -2.000
dev_year_factor[T.8] -2.3483 0.093 -25.218 0.000 -2.531 -2.166
dev_year_factor[T.9] -2.5132 0.127 -19.831 0.000 -2.762 -2.265
dev_year_factor[T.10] -2.6645 0.199 -13.369 0.000 -3.055 -2.274
=========================================================================================
Loss reserve
Now we’ll have a look at the loss reserve. If you’ve done the chain ladder calculations, you should find this gives the same answer.
Before we get there, we’ll write a function to produce a data set from a vectors of acc_year
and dev_year
variables - this will be useful later on as well.
The function below takes in numpy 1-d arrays of acc and dev and the categorisation we created above and produces a pandas dataframe with the various covariates that we have created so far.
# Write a function to create part of a triangle
def CreateData(acc, dev, cat_type):
'''
Creates a data set to apply predict function to
Parameters
-------------------------
acc: np 1-d array of acc values
dev: np 1-d array of dev values
It is possible for one of these to be length 1 and this gets repeated
Returns
-------------------------
df with all the various terms that we need:
acc_year
dev_year
acc_year_factor
dev_year_factor
cal_year
'''
if acc.size == dev.size:
= pd.DataFrame({'acc_year': acc, 'dev_year': dev})
df elif acc.size == 1:
= pd.DataFrame({'acc_year': np.repeat(acc, dev.size), 'dev_year': dev})
df elif dev.size == 1:
= pd.DataFrame({'acc_year': acc, 'dev_year': np.repeat(dev, acc.size)})
df else:
print("Length of acc and dev must be equal or one must contain one element only")
return
# add the other variables and fix types
'cal_year'] = df['acc_year'] + df['dev_year'] - 1
df['acc_year_factor'] = df['acc_year'].astype(cat_type)
df['dev_year_factor'] = df['dev_year'].astype(cat_type)
df[
= df.astype(dtype={'acc_year': int,
df'dev_year': int,
'cal_year': int
})
return df
We now use this function and give it:
- An accident array that repeats the numbers 1-10 10 times each
- A development array that repeats the sequence 1-10 10 times
- Our previously defined
cat_type
This outputs the full reserving data set (upper and lower triangles) in long form.
We then subset this to keep only the future (lower triangle) data and reindex (number the rows). Finally we create a column that holds the fitted values from our first model.
= CreateData(acc = np.repeat(np.arange(1,11), 10),
fulltr = np.tile(np.arange(1,11), 10),
dev = cat_type)
cat_type
# extract lower tri - make a deep copy so it exists independently of
# fulltr rather than being a slice
= fulltr[fulltr['cal_year']>10].copy(deep=True)
futdata
# reindex this
= np.arange(0, 45, dtype=int)
futdata.index
'fitted1'] = glm_fit1.predict(futdata)
futdata[
print(futdata.head())
acc_year dev_year cal_year acc_year_factor dev_year_factor fitted1
0 2 10 11 2 10 3397.665217
1 3 9 11 3 9 4385.341383
2 3 10 12 3 10 3769.510642
3 4 8 11 4 8 5657.645836
4 4 9 12 4 9 4797.591370
Now we can look at the OCL by accident year and in total.
# now get ocl by accident year
= futdata.groupby(by="acc_year").sum()
ocl_year print(ocl_year['fitted1'])
acc_year
2 3397.665217
3 8154.852025
4 14579.105829
5 22645.065096
6 31865.349506
7 45753.129496
8 60093.456331
9 80983.200079
10 105874.473778
Name: fitted1, dtype: float64
= ocl_year['fitted1'].sum()
ocl_total print(ocl_total)
373346.29735591734
Model diagnostics
Calculations
So far, we’ve fitted a model and have calculated the loss reserve. We can do all this with the chain ladder algorithm. Now we start looking at some of the extras that using a GLM gives us - for a start the statistical model structure means that we can calculate residuals with known properties, assuming the assumptions underlying the model are valid.
If the model assumptions are satisfied by the data, then the residuals should not have any remaining structure. Plots of residuals can therefore be useful to detect violations of model assumptions. Note that we need to be careful to use the right types of residuals - in many cases we want to use standardised deviance residuals:
- Deviance residuals because the more recognisable Pearson residuals (based on actual - fitted) are difficult to interpret for non-normal models.
- Standardised because the raw residuals are on different scales depending on the scale of the underlying values.
Some more details may be found in Chapter 6 of the monograph and also in Chapter 5 of A Practitioner’s Introduction to Stochastic Reserving.
Here we look at the following:
- Residual Scatterplots
- by linear predictor
- by accident, development and calendar years
- if the model assumptions are satisfied then the residuals should look homogeneous (or in layperson’s language, like a random cloud), centred around zero)
- Heat map of actual vs fitted laid out in triangular form
- In this we get the actual/fitted ratio in each (acc, dev) cell (subject to lower and upper bounds of [0.5, 2]) and then plot the colour-coded triangle of the actual/fitted values
- heat maps are helpful to check for model fit and may help to identify missing interactions.
We have to prepare the data by adding the fitted values and residuals.
Because this model has a lot of parameters, there are two observations where the fitted is exactly equal to the actual – (acc_year=1, dev_year=10) and (acc_year=10, dev_year=1). This is because these observations have a unique parameter.
statsmodels returns unstandardised deviance residuals, so we’ve included a function to calculate the standardised versions.
The deviance calculations below return
nan
(not a number) andinf
for these points, but the residual should really be 0 so this adjustment is made manually.Also add actual/fitted ratios and the log of these (restricted to the range [log(0.5), log(2)]) - these will be used for a heatmap later.
- The restricted range is used to generate easier to read shadings in the heat-map, while the conversion to log means that the shading scales will be similar intensity for \(x\)% and \(1/x\) %
First, here’s a function to calculate standardised deviance residuals.
# Write a function to calculate std deviance residuals
# statsmodels just gives deviance residuals, but we want standardised.
# For definitions of these residuals consult (e.g) http://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_genmod_details25.htm
#
# to calculate standardised deviance residuals:
# - resid_deviance from statsmodels = sqrt(di) sign(y-mu)
# - std_resid_deviance = sqrt(di) sign(y-mu) / sqrt(phi (1-hat matrix))
def StdDevResids(df, glm_obj, resid_colname = 'std_dev_resid'):
'''
Calculates the standardised deviance residuals.
Parameters
-------------------------
df: Pandas dataframe to which we add the residuals - should have dimensions of training/model data
glm_obj: Statsmodel glm object containing results of fit
resid_colname = name of residual column name
Returns
-------------------------
df updated with std resids column, called resid_colname
resid_deviance = sqrt(di) sign(y-mu) - stored in glm_obj
std_resid_deviance = sqrt(di) sign(y-mu) / sqrt(phi (1-hat matrix)) - calculated
For definitions of these residuals consult http://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_genmod_details25.htm
'''
# calculate residuals
= glm_obj.resid_deviance / (np.sqrt(glm_obj.scale * (1 - glm_obj.get_hat_matrix_diag())))
df[resid_colname]
# when actual=fitted can get nans and infs in results so replace these with zero
0, inplace = True)
df[resid_colname].replace(np.nan,0, inplace = True)
df[resid_colname].replace(np.inf,
return(df)
Now calculate the residuals, and add fitted values and the AvsF values as well.
There will be a warning message from the residual calculation function - this relates to the cells where actual == fitted and can be ignored.
= StdDevResids(msdata, glm_fit1, 'residuals1') msdata
<string>:23: RuntimeWarning: invalid value encountered in sqrt
'fitted1'] = glm_fit1.predict(msdata)
msdata['linear_predictor1'] = glm_fit1.predict(msdata, linear=True)
msdata['AvsF1'] = msdata['incremental'] / msdata['fitted1']
msdata['AvsF_restricted1'] = np.log(np.maximum(0.5, np.minimum(2, msdata['AvsF1'])))
msdata[
10) msdata.head(
acc_year dev_year ... AvsF1 AvsF_restricted1
0 1 1 ... 0.984516 -1.560475e-02
1 1 2 ... 1.003241 3.235741e-03
2 1 3 ... 1.001466 1.464611e-03
3 1 4 ... 1.038796 3.806213e-02
4 1 5 ... 1.134358 1.260672e-01
5 1 6 ... 0.871577 -1.374512e-01
6 1 7 ... 0.955659 -4.535425e-02
7 1 8 ... 0.922093 -8.110930e-02
8 1 9 ... 0.998183 -1.818724e-03
9 1 10 ... 1.000000 7.771561e-15
[10 rows x 12 columns]
Plotting
Now let’s look at the residual scatterplots. In the linear predictor scatterplot, the points are colour coded by development year.
= plt.subplots(2, 2)
fig, axs =1, hspace=1)
plt.subplots_adjust(wspace=sns.scatterplot(x='linear_predictor1', y='residuals1',hue=msdata.dev_year, data=msdata, palette=sns.color_palette("deep"), ax=axs[0,1])
g"Linear Predictor")
g.set_title(='center left', bbox_to_anchor=(1, 0.5),fontsize = 6)
g.legend(loc='acc_year', y='residuals1', data=msdata,palette=sns.color_palette("deep"), ax=axs[0,0]).set_title("Accident Year")
sns.scatterplot(x='dev_year', y='residuals1', data=msdata,palette=sns.color_palette("deep"), ax=axs[1,0]).set_title("Development Year")
sns.scatterplot(x='cal_year', y='residuals1', data=msdata,palette=sns.color_palette("deep"), ax=axs[1,1]).set_title("Calendar Year") sns.scatterplot(x
These results are quite good - bear in mind there are only a small number of points so plots must be interpreted in relation to this. In particular:
- The residuals do not appear to fan out or fan in (once you take into account that later development years have small number of points)
- They appear centred around 0
Now construct and draw the heat map. Note that the colours are:
- blue (A/F = 50%)
- white (A/F = 100%)
- red (A/F = 200%)
with shading for in-between values
= plt.subplots(1)
fig, axs
= sns.diverging_palette(255,0,sep=16, as_cmap=True)
cmap
= msdata.pivot('acc_year','dev_year','AvsF_restricted1'),
sns.heatmap(data = .5,
linewidths = cmap,
cmap =0,
center= False) cbar
In a heat map for a reserving triangle, we look for a random scattering of red and blue points. This plot looks quite good (though we’ll revisit this shortly).
Refining the model
We could stop here - and just use the results from this model, which match those produced by the chain ladder. The diagnostics suggest that the model fits quite well. However, because this is a GLM, we have more options than just replicating the chain ladder.
In particular, can we:
- identify simplifications to the model to make it more parsimonious (i.e. reduce the number of parameters)?
- identify any areas of poorer fit that may suggest missing model terms including interactions?
Simplifying the model
First we consider if we can use a parametric shape for the accident and development year parameters. The end result should be something similar to the chain ladder approach but with far fewer parameters.
Accident year
First plot the accident year parameters. We could access the parameters directly and plot these, but an alternative way of doing this is to make a data set where we have all values of acc_year
and a single value of dev_year
- the base level which is a parameter estimate of 0 - and then get the linear predicted values for this. This latter method will be useful later when we refine the model so we’ll use it here too.
We therefore use the CreateData()
function we wrote earlier to make a data set in suitable form and then add the linear predicted values to this.
= CreateData(acc = np.arange(1,11), dev=np.array(1), cat_type = cat_type)
acc_fit 'fitted1'] = glm_fit1.predict(acc_fit, linear = True)
acc_fit[
print(acc_fit)
acc_year dev_year cal_year acc_year_factor dev_year_factor fitted1
0 1 1 1 1 1 10.656759
1 2 1 2 2 1 10.795334
2 3 1 3 3 1 10.899190
3 4 1 4 4 1 10.989037
4 5 1 5 5 1 11.038835
5 6 1 6 6 1 11.015898
6 7 1 7 7 1 11.008077
7 8 1 8 8 1 10.890500
8 9 1 9 9 1 10.836129
9 10 1 10 10 1 10.691081
= 'acc_year', y = 'fitted1', marker = 'o', data=acc_fit) sns.lineplot(x
Note that the shape of the accident parameters closely resembles that of a parabola.
This suggests that we can replace the 10 accident year parameters by
- the overall intercept
- an
acc_year
term - an
acc_year
squared term (we’ll need to add this to the dataframe)
So refit the model on this basis:
- Drop the -1 from the
glm_fit1
formula to allow the model to have an intercept - Replace the
acc_year_factor
term with the parabola terms.
"acc_year_2"] = np.power(msdata["acc_year"], 2)
msdata[
= smf.glm('incremental ~ acc_year + acc_year_2 + dev_year_factor',
glm_fit2 = msdata,
data = sm.families.Poisson() ).fit(scale='X2')
family print(glm_fit2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: incremental No. Observations: 55
Model: GLM Df Residuals: 43
Model Family: Poisson Df Model: 11
Link Function: Log Scale: 102.58
Method: IRLS Log-Likelihood: -24.714
Date: Sun, 13 Nov 2022 Deviance: 4427.0
Time: 13:43:49 Pearson chi2: 4.41e+03
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=========================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 10.4710 0.034 304.264 0.000 10.404 10.538
dev_year_factor[T.2] -0.2056 0.021 -9.661 0.000 -0.247 -0.164
dev_year_factor[T.3] -0.7501 0.026 -28.314 0.000 -0.802 -0.698
dev_year_factor[T.4] -1.0148 0.031 -32.755 0.000 -1.076 -0.954
dev_year_factor[T.5] -1.4520 0.040 -36.484 0.000 -1.530 -1.374
dev_year_factor[T.6] -1.8305 0.052 -35.432 0.000 -1.932 -1.729
dev_year_factor[T.7] -2.1422 0.068 -31.734 0.000 -2.274 -2.010
dev_year_factor[T.8] -2.3527 0.088 -26.758 0.000 -2.525 -2.180
dev_year_factor[T.9] -2.5137 0.120 -21.011 0.000 -2.748 -2.279
dev_year_factor[T.10] -2.6609 0.188 -14.167 0.000 -3.029 -2.293
acc_year 0.2001 0.014 14.071 0.000 0.172 0.228
acc_year_2 -0.0179 0.001 -13.210 0.000 -0.021 -0.015
=========================================================================================
Now let’s compare the previous and current fit. We go back to the acc_fit
dataframe and add the linear predicted values from the glm_fit2
model (and hopefully it now becomes clearer why it can be convenient to use a data set with the predict()
method rather than accessing parameter estimates directly.
# Plot the values before and after
# first add acc_year_2 to acc_fit
"acc_year_2"] = np.power(acc_fit["acc_year"], 2)
acc_fit[
# add glm_fit2 linpred
'fitted2'] = glm_fit2.predict(acc_fit, linear = True)
acc_fit[
print(acc_fit)
acc_year dev_year cal_year ... fitted1 acc_year_2 fitted2
0 1 1 1 ... 10.656759 1 10.653146
1 2 1 2 ... 10.795334 4 10.799501
2 3 1 3 ... 10.899190 9 10.910041
3 4 1 4 ... 10.989037 16 10.984768
4 5 1 5 ... 11.038835 25 11.023682
5 6 1 6 ... 11.015898 36 11.026781
6 7 1 7 ... 11.008077 49 10.994067
7 8 1 8 ... 10.890500 64 10.925539
8 9 1 9 ... 10.836129 81 10.821197
9 10 1 10 ... 10.691081 100 10.681042
[10 rows x 8 columns]
= 'acc_year',
sns.lineplot(x = 'value',
y = 'variable',
hue = 'o',
marker =pd.melt(acc_fit[['acc_year', 'fitted1', 'fitted2']], 'acc_year')) data
This looks very good - the fit is very similar, but we have 7 fewer parameters.
Development year
Now we do the same thing for development year - we build a dataset with all dev_year
levels and acc_year
set the the same value (I’ve chosen 1) and then we calculate the linear predicted values and plot these.
= CreateData(acc = np.array(1), dev=np.arange(1,11), cat_type = cat_type)
dev_fit "acc_year_2"] = np.power(dev_fit["acc_year"] ,2)
dev_fit['fitted2'] = glm_fit2.predict(dev_fit, linear = True)
dev_fit[
= 'dev_year', y = 'fitted2', marker = 'o', data=dev_fit) sns.lineplot(x
Looking at this plot, it appears that a straight line would fit quite well. However, this fit would be improved by allowing the straight line to bend (have a knot) at dev_year = 7. So let’s try this below.
Note we actually fit dev_year - 1
rather than dev_year
. This means that the parameter estimate at dev_year
= 1 is 0, just as it is in the glm_fit2
model.
'dev_year_m1'] = msdata["dev_year"] - 1
msdata['dev_year_ge_7'] = np.maximum(msdata["dev_year"] - 7.5, 0)
msdata[
= smf.glm('incremental ~ acc_year + acc_year_2 + dev_year_m1 + dev_year_ge_7',
glm_fit3 = msdata,
data = sm.families.Poisson() ).fit(scale='X2')
family print(glm_fit3.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: incremental No. Observations: 55
Model: GLM Df Residuals: 50
Model Family: Poisson Df Model: 4
Link Function: Log Scale: 242.06
Method: IRLS Log-Likelihood: -25.865
Date: Sun, 13 Nov 2022 Deviance: 11879.
Time: 13:44:08 Pearson chi2: 1.21e+04
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 10.5095 0.052 201.734 0.000 10.407 10.612
acc_year 0.2042 0.022 9.451 0.000 0.162 0.247
acc_year_2 -0.0183 0.002 -8.891 0.000 -0.022 -0.014
dev_year_m1 -0.3641 0.009 -41.160 0.000 -0.381 -0.347
dev_year_ge_7 0.2389 0.088 2.701 0.007 0.066 0.412
=================================================================================
Assuming the fit is satisfactory, our original model with 19 parameters has now been simplified to 5 parameters - much more parsimonious and robust.
Let’s check the fit by dev_year to see.
# check fit by dev_year
'dev_year_m1'] = dev_fit["dev_year"] - 1
dev_fit['dev_year_ge_7'] = np.maximum(dev_fit["dev_year"] - 7.5, 0)
dev_fit[
'fitted3'] = glm_fit3.predict(dev_fit, linear = True)
dev_fit[
= 'dev_year',
sns.lineplot(x = 'value',
y = 'variable',
hue = 'o',
marker =pd.melt(dev_fit[['dev_year', 'fitted2', 'fitted3']], 'dev_year')) data
This looks good. However dev_year
= 2 is a bit underfit in the latest model, so we can add something to improve this fit (a term at dev_year
=2). Refit and replot.
# add dq_eq_2 term
'dev_year_eq_2'] = 1 * (msdata["dev_year"] == 2)
msdata[
= smf.glm('incremental ~ acc_year + acc_year_2 + dev_year_m1 + dev_year_ge_7 + dev_year_eq_2',
glm_fit4 = msdata,
data = sm.families.Poisson() ).fit(scale='X2')
family print(glm_fit4.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: incremental No. Observations: 55
Model: GLM Df Residuals: 49
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 107.27
Method: IRLS Log-Likelihood: -27.579
Date: Sun, 13 Nov 2022 Deviance: 5273.8
Time: 13:44:17 Pearson chi2: 5.26e+03
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 10.4687 0.035 297.896 0.000 10.400 10.538
acc_year 0.2001 0.014 13.867 0.000 0.172 0.228
acc_year_2 -0.0179 0.001 -13.018 0.000 -0.021 -0.015
dev_year_m1 -0.3577 0.006 -59.405 0.000 -0.370 -0.346
dev_year_ge_7 0.2356 0.059 3.996 0.000 0.120 0.351
dev_year_eq_2 0.1545 0.019 7.932 0.000 0.116 0.193
=================================================================================
# check fit by dev_year
'dev_year_eq_2'] = 1 * (dev_fit["dev_year"] == 2)
dev_fit[
'fitted4'] = glm_fit4.predict(dev_fit, linear = True)
dev_fit[
= 'dev_year',
sns.lineplot(x = 'value',
y = 'variable',
hue = 'o',
marker =pd.melt(dev_fit[['dev_year', 'fitted2', 'fitted3', 'fitted4']], 'dev_year')) data
Looks good! Fitting dev_year
=2 better has also improved the tail fitting (dev_year
>7).
Identifying missing structure
The second part of the model refining process involves checking for missing structure.
Let’s have a better look at the heat map, as it stands after the model simplification process.
# look at heat map again - remember log scale
'fitted4'] = glm_fit4.predict(msdata)
msdata['AvsF4'] = msdata['incremental'] / msdata['fitted4']
msdata['AvsF_restricted4'] = np.log(np.maximum(0.5, np.minimum(2, msdata['AvsF4'])))
msdata[
= plt.subplots(1)
fig, axs
= msdata.pivot('acc_year', 'dev_year', 'AvsF_restricted4'),
sns.heatmap(data = .5,
linewidths = cmap,
cmap =0,
center=False) cbar
Let’s refer to the annotated version of this plot from the previous version of this article in R. Note that the heatmap colours are a little different, but the underlying values are the same.
We see:
- development year 1, a distinct area of blue in the earlier accident years (A < F), followed by red (A > F)
- development year 2, a distinct area of red in the earlier accident years (A > F), followed by blue (A < F)
- development year 3, a possible progression from red to blue with increasing accident year (F increasing relative to A)
- development year 4, nearly all red (A > F)
This suggests the payment pattern has altered and can be accommodated by (mostly) interaction terms within the GLM. Consider adding the following terms:
- (development year = 1) * (accident year is between 1 and 6)
- (development year = 2) * (accident year is between 1 and 6)
- (development year = 3) * (accident year linear trend)
- (development year = 4)
So, let’s refit the model with terms to capture these and have a look at the heat map again
# fit model with interactions
'dev_year_eq_1'] = 1 * (msdata["dev_year"] == 1)
msdata['dev_year_eq_3'] = 1 * (msdata["dev_year"] == 3)
msdata['dev_year_eq_4'] = 1 * (msdata["dev_year"] == 4)
msdata['acc_year_1_6'] = 1 * (msdata['acc_year'] >= 1)*(msdata['acc_year'] <= 6)
msdata[
= 'incremental ~ acc_year + acc_year_2 + dev_year_m1 + dev_year_ge_7 + dev_year_eq_2' # previous model
formula4 = formula4 + ' + dev_year_eq_4 + dev_year_eq_1:acc_year_1_6 + dev_year_eq_2:acc_year_1_6 + dev_year_eq_3:acc_year'
formula5
= smf.glm(formula5,
glm_fit5 = msdata,
data = sm.families.Poisson() ).fit(scale='X2')
family print(glm_fit5.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: incremental No. Observations: 55
Model: GLM Df Residuals: 45
Model Family: Poisson Df Model: 9
Link Function: Log Scale: 53.933
Method: IRLS Log-Likelihood: -28.461
Date: Sun, 13 Nov 2022 Deviance: 2426.9
Time: 13:44:29 Pearson chi2: 2.43e+03
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 10.4904 0.032 327.787 0.000 10.428 10.553
acc_year 0.2066 0.011 19.007 0.000 0.185 0.228
acc_year_2 -0.0183 0.001 -16.078 0.000 -0.021 -0.016
dev_year_m1 -0.3685 0.007 -53.317 0.000 -0.382 -0.355
dev_year_ge_7 0.2720 0.044 6.145 0.000 0.185 0.359
dev_year_eq_2 0.0375 0.025 1.518 0.129 -0.011 0.086
dev_year_eq_4 0.0528 0.024 2.217 0.027 0.006 0.099
dev_year_eq_1:acc_year_1_6 -0.0671 0.028 -2.392 0.017 -0.122 -0.012
dev_year_eq_2:acc_year_1_6 0.1273 0.031 4.101 0.000 0.066 0.188
dev_year_eq_3:acc_year -0.0113 0.004 -2.913 0.004 -0.019 -0.004
==============================================================================================
This model should match that displayed in Table 7-5 of the monograph - and indeed it does (some very minor differences in parameter values - the model in the monograph was fitted in SAS).
Look at the updated heat map again - has the model resolved the identified issues?
'fitted5'] = glm_fit5.predict(msdata)
msdata['AvsF5'] = msdata['incremental'] / msdata['fitted5']
msdata['AvsF_restricted5'] = np.log(np.maximum(0.5, np.minimum(2, msdata['AvsF5'])))
msdata[
= plt.subplots(1)
fig, axs
= msdata.pivot('acc_year', 'dev_year', 'AvsF_restricted5'),
sns.heatmap(data = .5,
linewidths = cmap,
cmap =0,
center=False) cbar
Here’s the annotated R plot which helps show the impact of the interactions
This looks much better.
We should also look at the residual plots again
# look at residual plots again
= StdDevResids(msdata, glm_fit5, 'residuals5')
msdata 'linear_predictor5'] = glm_fit5.predict(msdata, linear = True)
msdata[
= plt.subplots(2, 2)
fig, axs =1, hspace=1)
plt.subplots_adjust(wspace=sns.scatterplot(x='linear_predictor5', y='residuals5',hue=msdata.dev_year, data=msdata, palette=sns.color_palette("deep"), ax=axs[0,1])
g"Linear Predictor")
g.set_title(='center left', bbox_to_anchor=(1, 0.5),fontsize = 6)
g.legend(loc='acc_year', y='residuals5', data=msdata,palette=sns.color_palette("deep"), ax=axs[0,0]).set_title("Accident Year")
sns.scatterplot(x='dev_year', y='residuals5', data=msdata,palette=sns.color_palette("deep"), ax=axs[1,0]).set_title("Development Year")
sns.scatterplot(x='cal_year', y='residuals5', data=msdata,palette=sns.color_palette("deep"), ax=axs[1,1]).set_title("Calendar Year") sns.scatterplot(x
These residuals do look better than those from the chain ladder model.
Loss reserve
Now that we have a model, let’s produce the estimate of the outstanding claims by accident year and in total.
- Take the lower triangle data [futdata] created above
- Add on the new variates we created
- Score the model on this data
- Summarise the results
# add all the new terms to futdata
'acc_year_2'] = np.power(futdata['acc_year'], 2)
futdata['dev_year_m1'] = futdata["dev_year"] - 1
futdata['dev_year_ge_7'] = np.maximum(futdata['dev_year'] - 7.5, 0)
futdata['dev_year_eq_2'] = 1*(futdata['dev_year'] == 2)
futdata['dev_year_eq_1'] = 1*(futdata['dev_year'] == 1)
futdata['dev_year_eq_3'] = 1*(futdata['dev_year'] == 3)
futdata['dev_year_eq_4'] = 1*(futdata['dev_year'] == 4)
futdata['acc_year_1_6'] = 1*(futdata['acc_year'] >= 1)*(futdata['acc_year'] <= 6)
futdata[
'fitted5'] = glm_fit5.predict(futdata)
futdata[
print(futdata.head(6))
acc_year dev_year cal_year ... dev_year_eq_4 acc_year_1_6 fitted5
0 2 10 11 ... 0 1 3618.769139
1 3 9 11 ... 0 1 4470.907136
2 3 10 12 ... 0 1 4059.634560
3 4 8 11 ... 0 1 5324.840843
4 4 9 12 ... 0 1 4835.016085
5 4 10 13 ... 0 1 4390.249630
[6 rows x 15 columns]
Reserves by accident year
= futdata.groupby(by='acc_year').sum()
ocl_year print(ocl_year['fitted5'])
acc_year
2 3618.769139
3 8530.541696
4 14550.106558
5 22172.719479
6 32458.174082
7 45694.907228
8 62955.431279
9 79300.613414
10 101211.917139
Name: fitted5, dtype: float64
Total reserves
= ocl_year['fitted5'].sum()
ocl_total print(ocl_total)
370493.1800137104
Reviewing this example
Looking back over this example, what we have done is started with a chain ladder model and then shown how we can use a GLM to fit a more parsimonious model (i.e. fewer parameters). It may then be possible to reconcile the shape of the parametric fit by accident year to underlying experience in the book - here we saw higher payments in the middle accident years. Is this due to higher claims experience or higher premium volumes? Does this give us an insight that allows us to better extrapolate into the future when setting reserves?
We have also used model diagnostics to identify areas of misfit and then used GLM interactions to capture these changes.
Practical use of GLMs in traditional reserving
Modelling
The ideas in this simple example extend to more complex traditional scenarios. By traditional I mean that the data you have available to you are classified by accident (or underwriting), development and calendar periods only.
First decide what you are going to model. Here we had a single model of incremental payments. However you could fit a Payments Per Claim Finalised (PPCF) model which consists of 3 submodels - numbers of claims incurred by accident period, number of claims finalised by (accident, development period) and payments per claim finalised by (accident, development period). Each of these could then be fitted by a GLM.
For whatever you’re modelling, you then pick the two triangle directions that you think are most critical for that experience. You can’t include all 3 at the start since they are correlated.
So, for PPCF submodels:
- for number of claims incurred models, accident and development period effects are likely to be where you start.
- numbers of claims finalised will usually depend on development period (type of claim) and calendar period (to take account of changes in claim settlement processes)
- for claim size models, you will probably want development and calendar period effects. For these models you could use operational time instead of development period to avoid changes in the timing of claims finalisations impacting your model.
Then fit the models by starting with the modelled effects as factors and use methods such as those outlined above to reduce the number of parameters by using parametric shapes. Look for missing structure and consider adding interactions or (carefully) adding limited functions of the third triangular direction. Take advantage of GLM tools to refine your model. Use what you know about the portfolio to inform your model - if you know that there was a period of rapid claims inflation, then include that in your model.
Setting reserves
It is possible to overlay judgement onto a GLM’s predictions. At the end of the day, the predictions are just based on a mathematical formula. So, taking claims inflation as an example, if you’ve been seeing 5% p.a. over the last 3 years, but you think this is going to moderate going forward, then you can adjust the projections by removing the 5% p.a. rate into the future and replacing it with, say, 2% p.a. Once you get familiar with using GLMs, you might find it easier to incorporate judgement - the GLM can capture more granularity about past experience which in turn may make it easier to work out how things might change in future and how to numerically include these changes.
Additional References
The references below have further examples of fitting GLMs in this way, and show how to capture quite complex experience. Although both use individual data, the methodology can be used in a similar manner for aggregate data.
- Loss Reserving with GLMs: A Case Study
- Individual Claim modelling of CTP data
- Predictive modeling applications in actuarial science, Frees and Derig, 2004 - in particular see Chapter 18 in Volume 1 and Chapter 3 in Volume 2.
Please feel free to add references to other useful material in the comments.