library(SPLICE) # Load the package once installed
library(DT)
library(data.table)
library(ggplot2)
library(patchwork)
Introduction
The Foundations Workstream’s aims are to:
- Create a roadmap for learning machine learning
- Gather the relevant learning materials
- Develop notebooks of example code in R and Python
In this blog I will be detailing the steps I have taken to upskill myself in the basics of machine learning in reserving and hence provide the roadmap mentioned above, with references to the relevant learning materials along the way.
The journey will mostly be based on previous blogs written by the working party, which I will link throughout this blog. Note that a number of these blogs have been put together in an easy-to-follow format in a living book published by the working party as well as in the Actuarial Analytics Cookbook, a collection of articles that was put together by the Young Data Analytics Working Group, a working party with the Actuaries Institute, the Australian actuarial body, with the relevant blogs being under the General Insurance section.
The first blog to look at is Introducing the foundations workstream and articles. Here, we introduce the initial steps we will be going through:
- Choice of programming language
- What data to use
- What machine learning methods are available:
- GLMs (not machine learning but a good starting point)
- Regularised GLMs (LASSO method)
- Decision trees, random forest and XGBoost
- Neural networks, which we don’t cover in this blog, but has been covered by the working party
I have decided to base my work within this blog on R. The reason for this is that I have some experience in R (as is standard for anyone who has studied for the IFoA actuarial exams), whereas I have none in Python, and many of the relevant previous blogs are also based in R. The blog states that it is important not to get bogged down by the choice of programming language, hence my decision is one of convenience.
My aim is to be able to reproduce the models in the previous blogs, on my own data set, so as to produce something original and so I have to work through each step in creating the models, and not just blindly copy and paste each group of code. I recommend doing the same. I will endeavour to highlight any potential pitfalls and difficulties that I encountered.
Alongside working through the working party blogs I have taken some basic refresher courses in R on Data Camp. This was useful to put my brain in R mode, making it easier to understand the code that I was copying from the previous blogs and to sort out any minor errors I was getting.
Deciding Data
There are a number of options when choosing a data set. The data used in this blog is triangle data with accident/underwriting period, calendar and development periods being the features (machine learning terminology for independent variables) and payments being the target variable (dependent variable). The data needs to include historical and future values.
Probably the easiest option is to use a readily made triangle that you have access to and format it in Excel to match the format used in the models. However, the downsides to this approach are you cannot control what trends are incorporated into your dataset, along with the possibility that the triangle is too volatile to be modelled effectively.
There are plenty of ways to simulate data, with many examples throughout the various blogs covered. For this exercise I have used SPLICE. SPLICE is a synthetic paid loss and incurred cost experience simulator built in R. It is built upon its predecessor SynthETIC, with added capabilities on simulating case estimates. SynthETIC would be sufficient for this exercise; however, it makes more sense to use and become familiar with the most up to date method. Hence using SPLICE to generate a dataset is the most appropriate method. SPLICE and SynthETIC were developed in UNSW, Australia, and a blog has been written on SPLICE which can give some extra background. A description of SPLICE can also be found on GitHub with the code and how to use it, as well as descriptions of the different datasets it can create.
The dataset I created has 30 accident and development periods, with an average of 2,000 claims per period, using scenario 2 which has a 2% p.a. inflation rate and a notification and settlement delay based on claim size. These are the trends we are hoping the machine learning models will pick up. A forewarning the dataset with 2,000 claims per period does take a while to run, and hence I recommend doing so once and downloading the dataset, and read it in every run. The reason I have chosen so many claims is to make the dataset less volatile, giving the machine learning techniques a better chance at picking up the trends.
Note I have included a list of all the packages required in this blog, in the appendices. Further, please note that to start I have used base R to manipulate data, whereas later in this blog I use data.table
package, in line with the code in the previous working party blogs. I also use some other packages for displaying data.
Here is the code to generate the data. Since it takes a long time to run, I have saved the data set. It is available here as an RDS file.
# Generate the dataset
<- generate_data(
datasets n_claims_per_period = 2000, # expected claims per period parameter
n_periods = 30, # number of accident and development periods
complexity = 2, # scenario 2
random_seed = 42)
saveRDS(datasets, file="_sim_data.RDS") # Save the dataset
Read in the dataset.
= readRDS("_sim_data.RDS") datasets
As mentioned, the SPLICE dataset contains a large amount of information including a claims data frame, an incurred data frame and a paid data frame. I will be using the paid data frame, which includes the following features:
- Claim number
- Payment number
- Occurrence period (origin period)
- Occurrence time
- Claim size
- Notification delay
- Settlement delay
- Payment time
- Payment period
- Payment size
- Inflated payment size
- Payment delay
However, we do not need all of these features (yet!) and will simplify the data set, to a recognisable triangle dataset paid_data
, including accident year acc_year
, development year dev_year
, the inflated payment amount payments
and a past/future indicator train_ind
. This involves aggregating up the payments in each period, resulting in an incremental triangle. Finally we strip payments with development years greater than 30 as this is consistent with working party blogs, noting that in reality we would use the triangle as is, to use all data available.
= datasets$payment_dataset # Strip out the paid data frame
paid_data = paid_data$payment_period - paid_data$occurrence_period + 1 # Create a development period variable
dev_year = cbind(paid_data , dev_year) # Attach the development period variable
paid_data = aggregate(paid_data$payment_inflated , list(paid_data$occurrence_period , paid_data$dev_year) , FUN = sum) # Sum over individual payments to get aggregated triangles
paid_data
colnames(paid_data) = c("acc_year" , "dev_year" , "payments" )
= paid_data[paid_data$dev_year < 31,] # Strip out future payments, since the blogs are built on triangles with equal accident and development periods paid_data
I have also added some code to add 0 increment data points since the code does this implicitly by not including a data point for those features. This is because when fitting models, 0 values will impact the fitted parameters, giving different parameters to a model based on nulls instead of 0s.
# Add 0 increments into the data set
= matrix( c(0) , ncol = 2 , nrow = 30 ) # Create a 0 2x30 matrix
zerovec colnames(zerovec) = c( "acc_year" , "dev_year")
= NULL
zerovec1
for (i in 1:30){
1] = i # Set the first column as i (accident years)
zerovec[,2] = c(1:30) # Set the second column as 1 - 30 (development years)
zerovec[,= rbind(zerovec1,zerovec) # Combine the matrices to get a matrix with a development year for each accident year
zerovec1
}
= cbind(c(0) , zerovec1 ) # Set the payments to 0
zerovec2 colnames(zerovec2 ) = c("payments" , "acc_year" , "dev_year")
= rbind(paid_data , zerovec2)
paid_data
= aggregate(paid_data$payments , list(paid_data$acc_year , paid_data$dev_year) , FUN = sum) # Aggregating leaves 0 increments in null triangle points
paid_data colnames(paid_data) = c("acc_year" , "dev_year" , "payments" )
We add a past/future value indicator to determine whether a data point is in our training data set, and strip out future periods after development period 30.
for (i in c(1 : length(paid_data$payments))) {
if ((paid_data$acc_year[i] + paid_data$dev_year[i]) < 32) { paid_data$train_ind[i] = TRUE } else { paid_data$train_ind[i] = FALSE } # Set future data points as FALSE and past as TRUE
}
= paid_data[paid_data$dev_year < 31,] # Strip out future periods as this is the required format for the models paid_data
Let’s have a look at our data set.
datatable(paid_data) |> formatRound("payments", digits = 0)
And see what the development looks like cumulatively and incrementally by accident and development period. If you want to just look at the history, then take the # off the first line of code.
<- ggplot(data=paid_data, aes(x=dev_year, y=payments, colour=as.factor(acc_year))) + # Plot
p1
geom_line(linewidth = 1) +
scale_color_viridis_d(begin=0.9, end=0) +
ggtitle("Incremental") +
theme_bw() +
theme(legend.position = "none", legend.title=element_blank(), legend.text=element_text(size=8))
p1
= NULL
cum_paid_data
for (i in 1 : 30) {
= paid_data$payments[paid_data$acc_year == i]
obj = cumsum(obj)
obj = data.frame(acc_year = i, dev_year = 1:(30) , cumulative = obj)
obj = rbind(cum_paid_data , obj)
cum_paid_data
}
<- ggplot(data=cum_paid_data, aes(x=dev_year, y=cumulative, colour=as.factor(acc_year))) + # Plot
p2
geom_line(linewidth=1) +
scale_color_viridis_d(begin=0.9, end=0) +
ggtitle("Cumulative") +
theme_bw() +
theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=6))
p2
We can see that our dataset is quite homogeneous and has a clear inflationary trend.
Reserving with GLMs
Before we start looking at any machine learning techniques we will cover the blog on reserving with GLMs. Since many machine learning techniques are based on GLMs this is a good place to start. I also found it useful as a good place to re-familiarise myself with using R, while having a previous understanding of the underlying maths. Finally, using GLMs on unedited triangle data produces the same reserves as using a volume weighted all chain-ladder estimate, therefore it is easy to check the results being produced by R are what you expect.
If you are unsure of how GLMs work, I found the 2016 CAS monograph referenced in the GLM blog very useful. I’ve also found this article, which is a good place to start from scratch.
Following the code in the GLM blog is relatively simple. It begins with how the data is downloaded; this can be ignored as we don’t use it. Instead we will link in our SPLICE dataset as a data table and filter it for past values. We also load (or install) a number of packages.
library(data.table) # manipulate the data
library(patchwork) # easily combine plots
options(scipen = 99) # get rid of scientific notation
= data.table(paid_data) # GLMs function requires data tables
paid_data_GLM = paid_data_GLM[paid_data_GLM$train_ind == TRUE] # set the data set for the GLM as the training data set only paid_data_GLM
Then we fit the model. Note we add a calender year factor, which the blog uses for diagnostics. Also we are using accf
and devf
as our factor names to agree with code in later blogs.
:= as.factor(acc_year) # GLMs require factors not numbers
paid_data_GLM[, accf # Note we are using 'accf' and 'devf' as our factor names to agree with code in later blogs
:= as.factor(dev_year)
][, devf
:= acc_year + dev_year - 1]
][, cal_year
<- glm(data = paid_data_GLM,
glm_fit1
family = quasipoisson(link = "log"),
formula = "payments ~ 0 + accf + devf")
$coeff_table <- data.table(parameter = names(glm_fit1$coefficients),
glm_fit1
coeff_glm_fit1 = glm_fit1$coefficients)
head(glm_fit1$coeff_table)
parameter coeff_glm_fit1
1: accf1 9.579442
2: accf2 9.459583
3: accf3 9.500844
4: accf4 9.540363
5: accf5 9.518344
6: accf6 9.514399
Next, we create a null dataset of future increments, to which we apply the GLM prediction, giving us our reserve estimate. This is achieved by following the code in the underlying blog, and is summarised in the table below.
# first make the lower triangle data set
<- NULL
ay <- NULL
dy
for(i in 2:30){
<- c(ay, rep(i, times=(i-1)))
ay <- c(dy, (30-i+2):30)
dy
}
<- data.table(acc_year = ay, dev_year = dy)
futdata
# make factors
:= acc_year + dev_year
futdata[, cal_year := as.factor(acc_year)
][, accf := as.factor(dev_year)]
][, devf
# make the prediction and sum by acc_year
<- predict(glm_fit1, newdata = futdata, type="response")
x
:= x]
futdata[, incremental
# data.table syntax to get summary by accident year
<- futdata[, lapply(.SD, sum), .SDcols=c("incremental"), by="acc_year"]
ocl_year
# total ocl
<- futdata[, sum(incremental)]
ocl_total
# print the acc year table with total
:= as.character(acc_year) ] # to make a table with total row
ocl_year[, acc_year <- rbind(ocl_year, data.table(acc_year="Total", incremental=ocl_total))
ocl_year_print setnames(ocl_year_print, "incremental", "OCL") # rename column for printing
datatable(ocl_year_print)|> formatRound("OCL", digits = 0)
As a check, I am going to see if my reserve matches the chain ladder method (which they should unless one of the leading diagonal payments is zero). The code is taken from the LASSO blog, introduced below, edited to fit the column names in our dataset and to perform a volume weighted all chain ladder.
# get cumulative payments to make it easier to calculate CL factors
:= cumsum(payments), by=.(acc_year)][train_ind==FALSE, cumpmts := NA]
paid_data_GLM[, cumpmts
<- numeric(29) # hold CL factors for number of years we require reserves
cl_fac
for (j in 1:29){ # Code edited to be a volume weighted average on all years
<-
cl_fac[j] ==TRUE & dev_year == (j+1) & acc_year <= (30-j), sum(cumpmts)] /
paid_data_GLM[train_ind==TRUE & dev_year == (j) & acc_year <= (30-j), sum(cumpmts)]
paid_data_GLM[train_ind
}
# accumulate the CL factors
<- rev(cumprod(rev(cl_fac)))
cl_cum
# leading diagonal for projection
<- paid_data_GLM[train_ind==TRUE & acc_year + dev_year == 31 & acc_year > 1, cumpmts]
leading_diagonal
# CL amounts now
<- cl_cum * leading_diagonal - leading_diagonal
cl_os
sum(cl_os) - ocl_year_print$OCL[30]
[1] 0.3830585
The difference is essentially zero - rounding / floating point errors are likely the source of it not being exactly zero.
Results by accident year are shown below
= rev(cl_os)
cl_os 30] = sum(cl_os)
cl_os[
$cl_os = cl_os
ocl_year_print
$difference = ocl_year_print$cl_os - ocl_year_print$OCL
ocl_year_print
datatable(ocl_year_print )|> formatRound(c("OCL", "cl_os"), digits = 0) |> formatRound("difference", digits = 4)
The GLM blog also gives a number of ways of performing diagnostics on the model, in order to refine it, for example by reducing overfitting or incorporating interaction terms. I won’t go into this in detail as it is standard GLM work and is also fairly complex, especially when using your own data set.
Reserving with LASSO
The next blog describes our first Machine Learning technique Self-assembling claim reserving models using the LASSO. LASSO stands for Least Absolute Shrinkage and Selection Operator. It is a ‘regularised regression’ method which acts like a GLM (as described in the previous section) but penalises the use of additional features and hence complex models, by using basis functions (functions of acc
, dev
and cal
). There is some detail on this in the LASSO blog under the Set of basis functions heading, however, it took me a little while to get my head around. After some googling I found this website to be useful.
The LASSO blog details some of the mathematics behind the model, which should be fairly simple to understand for anyone with a good knowledge of statistics, including the model selection sections which introduces Lambda - the hyperparameter for this model. A hyperparameter is a parameter whose value is used to control the learning process. It cannot be fitted by the data, rather needs to be tuned by assessing the value that optimizes the model’s speed and quality of learning process.
The blog then states the advantages of the LASSO model, being an automatic method of dealing with data sets with complex features, such as shock inflationary events where chain ladder methods fail. GLMs are able to do this, but the design can be very time consuming. The LASSO model has performed this well in general.
Again the LASSO blog then generates its own data set, which we can skip over, bringing in our own data set instead (which needs to be changed to a data table). However, we do need to load the glmnet
package (this might need to be installed). We need to create another column in our data for calendar year, named cal
(accident + development -1), and it would also be useful to change the other column names to match those used in the blog (pmts
, acc
and dev
). Note I have also changed our dataset to the data table called dat
to match the blog.
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
# colour palette for some plots
# IFoA colours
# primary colours
<- "#113458"
dblue <- "#4096b8"
mblue <- "#d9ab16"
gold <- "#dcddd9"
lgrey <- "#3f4548"
dgrey <- "#3F4548"
black #secondary colours
<- "#d01e45"
red <- "#8f4693"
purple <- "#ee741d"
orange <- "#e9458c"
fuscia <- "#8076cf"
violet
= paid_data # Change the name to fit with the blog code
dat colnames(dat) = c( "acc", "dev", "pmts","train_ind" ) # Change the columns to fit with the blog code
= cbind(dat , cal = dat$acc + dat$dev - 1 )
dat = data.table(dat) # change the format to a data table so the manipulation below works dat
Before we start coding the basis functions, there is a note in the LASSO blog on scaling the basis functions, where each basis function is divided by a scaling factor. We calculate one scaling factor for each feature (i.e. accident, development or calendar period), and divide each basis function by the scaling factor that the function relates to. The scaling factors are determined by a simple formula that resembles a variance calculation. Scaling prevents the values of each basis function impacting the penalty against each parameter in the loss function, which should only be determined by Lambda.
The code first sets up the scaling parameters and then generates the basis functions. Note the use of the linear spline function, which is defined when creating the LASSO dataset that the LASSO blog is using. Since we have skipped this section, we have to go back to get this code before we generate our basis functions as well as defining the numperiods
variable (in our case 30).
= 30 # Set to the number of periods in your data set
num_periods
<- function(var, start, stop){
LinearSpline pmin(stop - start, pmax(0, var - start))
}
<- function(vec) {
GetScaling
<- length(vec)
fn <- mean(vec)
fm <- vec - fm
fc <- ((sum(fc^2))/fn)^0.5
rho_factor
}
# function to create the ramps for a particular primary vector
<- function(vec, vecname, np, scaling){
GetRamps
# vec = fundamental regressor
# vecname = name of regressor
# np = number of periods
# scaling = scaling factor to use
# pre-allocate the matrix to hold the results for speed/efficiency
<- length(vec)
n <- (np-1)
nramps
<- matrix(data=NA, nrow=n, ncol=nramps)
mat <- vector(mode="character", length=nramps)
cnames
<- 0
col_indx
for (i in 1:(np-1)){
<- col_indx + 1
col_indx <- LinearSpline(vec, i, 999) / scaling
mat[, col_indx] <- paste0("L_", i, "_999_", vecname)
cnames[col_indx]
} colnames(mat) <- cnames
return(mat)
}
# create the step (heaviside) function interactions
<- function(vec1, vec2, vecname1, vecname2, np, scaling1, scaling2) {
GetInts
# vec1 = fundamental regressor 1
# vec2 = fundamental regressor 2
# vecname1 = name of regressor 1
# vecname2 = name of regressor 2
# np = number of periods
# scaling1 = scaling factor to use for regressor 1
# scaling2 = scaling factor to use for regressor 2
# pre-allocate the matrix to hold the results for speed/efficiency
<- length(vec1)
n <- (np-1)*(np-1)
nints
<- matrix(data=NA_real_, nrow=n, ncol=nints)
mat <- vector(mode="character", length=nints)
cnames
<- 0
col_indx
for (i in 2:np){
<- LinearSpline(vec1, i-1, i) / scaling1
ivec <- paste0("I_", vecname1, "_ge_", i)
iname
if (length(ivec[is.na(ivec)]>0)) print(paste("NAs in ivec for", i))
for (j in 2:np){
<- col_indx + 1
col_indx <- ivec * LinearSpline(vec2, j-1, j) / scaling2
mat[, col_indx] <- paste0(iname, "xI_", vecname2, "_ge_", j)
cnames[col_indx]
<- LinearSpline(vec2, j-1, j) / scaling2
jvec
if (length(jvec[is.na(jvec)]>0)) print(paste("NAs in jvec for", j))
}
} colnames(mat) <- cnames
return(mat)
}
When setting up the scaling factors in ‘rho_factor_list’ I couldn’t get the for loop to work (something to do with the ‘get’ function), most likely due to my R skills. Instead of spending time sussing this out, I set the three scaling factors individually using the ‘GetScaling’ function.
# get the scaling values
<- vector(mode="list", length=3)
rho_factor_list names(rho_factor_list) <- c("acc", "dev", "cal")
"acc"]] = GetScaling(dat[dat$train_ind == TRUE, acc]) # I could not get the 'base::get()' function to work hence have split this into 3 lines
rho_factor_list[["dev"]] = GetScaling(dat[dat$train_ind == TRUE, dev])
rho_factor_list[["cal"]] = GetScaling(dat[dat$train_ind == TRUE, cal]) rho_factor_list[[
Once this has been done it should be trivial to set up our scaled basis functions and calculate the effects from our triangle, which is then stored in ‘varset’. We then drop any interaction columns that are constant e.g. columns involving future periods; these won’t affect the model but dropping them is good practice and makes the model run faster.
# main effects - matrix of values of Ramp functions
<- GetRamps(vec = dat[, acc], vecname = "acc", np = num_periods, scaling = rho_factor_list[["acc"]])
main_effects_acc <- GetRamps(vec = dat[, dev], vecname = "dev", np = num_periods, scaling = rho_factor_list[["dev"]])
main_effects_dev <- GetRamps(vec = dat[, cal], vecname = "cal", np = num_periods, scaling = rho_factor_list[["cal"]])
main_effects_cal
<- cbind(main_effects_acc, main_effects_dev, main_effects_cal)
main_effects
# interaction effects
<- cbind(
int_effects GetInts(vec1=dat[, acc], vecname1="acc", scaling1=rho_factor_list[["acc"]], np=num_periods,
vec2=dat[, dev], vecname2="dev", scaling2=rho_factor_list[["dev"]]),
GetInts(vec1=dat[, dev], vecname1="dev", scaling1=rho_factor_list[["dev"]], np=num_periods,
vec2=dat[, cal], vecname2="cal", scaling2=rho_factor_list[["cal"]]),
GetInts(vec1=dat[, acc], vecname1="acc", scaling1=rho_factor_list[["acc"]], np=num_periods,
vec2=dat[, cal], vecname2="cal", scaling2=rho_factor_list[["cal"]])
)
<- cbind(main_effects, int_effects)
varset
# drop any constant columns over the training data set
# do this by identifying the constant columns and dropping them
# get the past data subset only using TRUE/FALSE from train_ind in dat
<- varset[dat$train_ind, ]
varset_train
# identify constant columns as those with max=min
<- varset_train[, apply(varset_train, MARGIN=2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
rm_cols
# drop these constant columns
<- varset[, !(colnames(varset) %in% colnames(rm_cols))] varset
Finally we come to fitting the model, using the cv.glmnet
function, which is defined well in the LASSO blog, and then plotting log(Lambda) against the Poisson deviance. This shows which Lambda produces the lowest deviance from the cross-validation test performed by the function and hence is the best model. Cross-validation is explained well in the next blog R: Machine Learning Triangles hence I will not go into it in detail here except to say it is the most common method of tuning hyperparameters. The graph shows the Lambda at which the minimum deviance is achieved and a larger Lambda one standard deviation away. The reason for the larger Lambda is that some people view it as a good compromise between a model fitting well and possible overfit at the lambda.min model.
Note I have increased the maxit
max iterations to 500,000, since the Lasso was not converging. This does not have a huge impact on the time it takes to run.
<- num_periods^2 # max number of variables ever to be nonzero
my_pmax <- num_periods*10 #max number of vars in the model
my_dfmax
<- Sys.time()
time1 <- cv.glmnet(x = varset[dat$train_ind,],
cv_fit y = dat[train_ind==TRUE, pmts],
family = "poisson",
nlambda = 200, # default is 100 but often not enough for this type of problem
nfolds = 8,
thresh = 1e-08, # default is 1e-7 but we found we needed to reduce it from time to time
lambda.min.ratio = 0, # allow very small values in lambda
dfmax = my_dfmax,
pmax = my_pmax,
alpha = 1,
standardize = FALSE,
maxit = 500000) # convergence can be slow so increase max number of iterations
print("time taken for cross validation fit: ")
[1] "time taken for cross validation fit: "
Sys.time() - time1
Time difference of 41.5844 secs
plot(cv_fit)
$lambda.min cv_fit
[1] 945.3654
$lambda.1se cv_fit
[1] 2678.311
The LASSO blog gives a couple of ways of assessing the appropriateness of the model; I will not go into these now as we will be carrying out a full model validation exercise in the next section. However, we do need the first bit of code in this section to fit the predicted values that the model gives us. Note if you do carry out this exercise the LASSO blog has an underlying ‘mu’ value in the data, which we don’t have and needs to be removed from the graph function.
Now we are ready to produce some reserves using the LASSO, and compare them to the chain ladder/GLM reserves. The code in the blog contains the code for the chain ladder reserves which we used above, hence we will not repeat it. The blog also produces a comparison plot and a summary of the outstanding reserves for each method.
:= as.vector(predict(cv_fit,
dat[, fitted newx = varset,
s = cv_fit$lambda.min,
type="response"))]
= copy(dat) # Define a separate object to keep the results for comparison later
dat_LASSO
<- dat[train_ind == FALSE, # Summarise the LASSO and simulated reserves by accident year
os_acc LASSO = sum(fitted), simulated = sum(pmts)), #Note we have removed the 'underlying' as this does not exist in our data
.(=.(acc)]
keyby
<- os_acc[, .(LASSO = sum(LASSO), # Create totals
os simulated = sum(simulated))]
# Add the CL amounts now
:= cl_os[-30]] # note we have already reversed the order of our chainladder reserves so do not need to here, need to remove the total we added
os_acc[, Chainladder := sum(cl_os)]
os[, Chainladder
# make a long [tidy format] version of the data for use with ggplot2
<- melt(os_acc, id.vars = "acc")
os_acc_long
# divide by Bn to make numbers more readable
<-
os_plot ggplot(data=os_acc_long, aes(x=acc, y=value, colour=variable,
linetype=variable, size=variable, alpha=variable))+
geom_line()+
scale_linetype_manual(name="", values=c("solid", "dashed", "dotted", "solid" ))+
scale_colour_manual(name="", values=c(red, dgrey, dgrey, mblue ))+
scale_size_manual(name="", values=c(2, 1, 1, 1.5))+
scale_alpha_manual(name="", values=c(0.8, 0.5, 0.5, 0.8))+
theme_classic()+
theme(legend.position="bottom", legend.title=element_blank())+
labs(x="Accident quarter", y="Amount")+
ggtitle("Outstanding amounts")
os_plot
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
We discuss the effectiveness of the LASSO model at predicting the future reserves at the end of this blog, so I won’t comment on it now.
Machine Learning Triangles
The next blog brings together everything we’ve done so far, and compares a number of machine learning and non-machine learning methods:
- GLM
- LASSO
- Decision trees
- Random Forest
- XGBoost
It predominantly uses the mlr3 R package, which requires a number of packages that are shown in the MLT blog, so these need to be installed. The “mlr3extralearners” package is not on CRAN, but the blog provides code for this to be installed. R will ask you “Enter one or more numbers, or an empty line to skip updates”. Enter 1 in the Console and press enter to install all the updates. Note that the mlr3 package is constantly being developed so the version installed needs to be checked.
The MLT blog is focused on setting up data (which we don’t need to worry about as we will use our own), applying ML models to the data, tuning the hyperparameters and comparing the performance of the models.
When loading our data we can use the same code as for the LASSO model. However, note that we do need to set the accident and development years as factors and we want to label our rows 1:900.
library(mlr3) # base mlr3 package
library(mlr3learners) # implements common learners
library(mlr3tuning) # tuning methodology
Loading required package: paradox
library(paradox) # used to create parameter sets for hyperparameter tuning
library(mlr3extralearners) # needed for glms
library(rpart.plot) # to draw decision trees created by rpart
Loading required package: rpart
# ensure these packages are installed - but you don't need to attach them in the library command
# raprt
# ranger
# xgboost
# Reset our data
= paid_data # Change the name to fit with the blog code
dat colnames(dat) = c( "acc", "dev", "pmts", "train_ind" ) # Change the columns to fit with the blog code
= cbind(dat , cal = dat$acc + dat$dev - 1 )
dat = data.table(dat) # change the format to a data table so the manipulation below works
dat
# create the num_periods variable - number of acc/dev periods
<- dat[, max(acc)]
num_periods
:= as.factor(acc)]
dat[, accf := as.factor(dev)]
dat[, devf
# data.table code, .N = number of rows
:= 1:.N] dat[, row_id
We then set up an mlr3 task, which holds the data and sets roles for the data. When doing this, note that when setting new rules for the future data, you need to define the roles as “holdout” instead of “validation”, as there has been a change to the mlr3 package since the MLT blog was written.
# note can also use the sugar function tsk()
<- TaskRegr$new(id = "reserves",
task backend = dat[, .(pmts, train_ind, acc, dev, cal, row_id)],
target = "pmts")
We then set the training dataset “use” and the validation dataset “holdout”.
$set_row_roles(dat[train_ind==FALSE, row_id], roles = "holdout")
task
# make row_id a name
$set_col_roles("row_id", roles="name")
task
# drop train_ind from the feature list
$col_roles$feature <- setdiff(task$feature_names, "train_ind")
task
# check the task has the correct variables now
task
<TaskRegr:reserves> (465 x 4)
* Target: pmts
* Properties: -
* Features (3):
- dbl (3): acc, cal, dev
# check alll the col_roles
print(task$col_roles)
$feature
[1] "acc" "cal" "dev"
$target
[1] "pmts"
$name
[1] "row_id"
$order
character(0)
$stratum
character(0)
$group
character(0)
$weight
character(0)
As mentioned in the previous section of this blog, we now come on to how to tune the hyperparameters of a machine learning model, using cross-validation. A simple summary of cross-validation is it splits up the data we are using for fitting the model (i.e. the historic triangle data) into a training data set and a test data set. This split is done by splitting the data into K partitions and fitting the model to the data in all bar one of these partitions, and let the partition we left out be the test data set. We then repeat the process K times so that each partition is the test data set once. We then calculate then performance of the fitted model on each test set and average the performance. For this we need a performance measure, which will be the root mean square error (RMSE) here. We compare the average performance of fitted models with a variety of hyperparameters, to determine the optimum hyperparameters. There is a more detailed description of this in the MLT blog, but hopefully this description will make it easier to follow.
Setting up the cross-validation in mlr3 is very simple, and is done below. The MLT blog uses 6 folds (partitions) of the data and 25 evaluations (the number of different hyperparameter combinations to test).
# create object
= rsmp("cv", folds=6)
crossval
set.seed(42) # reproducible results
# populate the folds
# NB: mlr3 will only use data where role was set to “use”
$instantiate(task)
crossval
# Setthe performance measure to root mean square error
<- msr("regr.rmse")
measure = trm("evals", n_evals = 25) evals_trm
Now it’s time to set up the decision tree, random forest and XGBoost models. The MLT blog doesn’t go into great detail about how these models work, so I have sourced some useful resources from a quick Google:
We now set up the decision tree, using the mlr3 and rpart.plot packages. The coding and theory behind decision trees are fairly simple:
# create a learner, ie a container for a decision tree model
<- lrn("regr.rpart")
lrn_rpart_default
# fit the model on all the past data (ie where row_roles==use)
$train(task)
lrn_rpart_default
# Visualise the tree
::rpart.plot(lrn_rpart_default$model, roundint = FALSE) rpart.plot
$param_set lrn_rpart_default
<ParamSet>
id class lower upper nlevels default value
1: cp ParamDbl 0 1 Inf 0.01
2: keep_model ParamLgl NA NA 2 FALSE
3: maxcompete ParamInt 0 Inf Inf 4
4: maxdepth ParamInt 1 30 30 30
5: maxsurrogate ParamInt 0 Inf Inf 5
6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
7: minsplit ParamInt 1 Inf Inf 20
8: surrogatestyle ParamInt 0 1 2 0
9: usesurrogate ParamInt 0 2 3 2
10: xval ParamInt 0 Inf Inf 10 0
However, tuning the hyperparameters of the tree is more involved, using the R library - paradox
. The MLT blog shows you how to tune the complexity parameter cp
and the minimum number of splits minsplits
. The complexity parameter states the minimum improvement in RMSE that a model requires to add a node, as a result controls the complexity, speed and overfitting of the model. The code defines a range of complexity and minimum splits and tests 25 combinations of parameters within this range, producing the one with the lowest RMSE.
<- ps(
tune_ps_rpart cp = p_dbl(lower = 0.001, upper = 0.1), # set the cp
minsplit = p_int(lower = 1, upper = 10) # set the minsplit
)
# to see what's searched if we set a grid of 5
# rbindlist(generate_design_grid(tune_ps_rpart, 5)$transpose())
# create the hyper-parameter for this particular learner/model
<- TuningInstanceSingleCrit$new(
instance_rpart
task = task,
learner = lrn("regr.rpart"),
resampling = crossval,
measure = measure,
search_space = tune_ps_rpart,
terminator = evals_trm
)
# create a grid-search tuner for a 5-point grid
<- tnr("grid_search", resolution = 5)
tuner
# suppress log output for readability
::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
lgr
$optimize(instance_rpart) tuner
cp minsplit learner_param_vals x_domain regr.rmse
1: 0.001 10 <list[3]> <list[2]> 1855345
# restart log output
::get_logger("bbotk")$set_threshold("info")
lgr::get_logger("mlr3")$set_threshold("info")
lgr
print(instance_rpart$result_learner_param_vals)
$xval
[1] 0
$cp
[1] 0.001
$minsplit
[1] 10
<- lrn("regr.rpart")
lrn_rpart_tuned $param_set$values = instance_rpart$result_learner_param_vals
lrn_rpart_tuned$train(task)
lrn_rpart_tuned
# plot the model
::rpart.plot(lrn_rpart_tuned$model, roundint = FALSE) rpart.plot
As noted in the MLT blog, decision trees are fairly poor models for our purposes but are a good introduction to mlr3 and tuning hyperparameters. There are plenty of other hyperparameters that can be tuned (many of which are mentioned in the link above), however, these are unlikely to help us much.
Setting up and tuning the random forest model is very similar. This time we will be tuning the number of trees and minimum node size (minimum number of splits for the decision trees in the forest) hyperparameters. mlr3 finds the best fit for the hyperparameters and then the blog fits rest of the hyperparameters for the model and outputs them.
# Random Forest
lrn("regr.ranger")$param_set
<ParamSet>
id class lower upper nlevels default
1: alpha ParamDbl -Inf Inf Inf 0.5
2: always.split.variables ParamUty NA NA Inf <NoDefault[3]>
3: holdout ParamLgl NA NA 2 FALSE
4: importance ParamFct NA NA 4 <NoDefault[3]>
5: keep.inbag ParamLgl NA NA 2 FALSE
6: max.depth ParamInt 0 Inf Inf
7: min.node.size ParamInt 1 Inf Inf 5
8: min.prop ParamDbl -Inf Inf Inf 0.1
9: minprop ParamDbl -Inf Inf Inf 0.1
10: mtry ParamInt 1 Inf Inf <NoDefault[3]>
11: mtry.ratio ParamDbl 0 1 Inf <NoDefault[3]>
12: num.random.splits ParamInt 1 Inf Inf 1
13: num.threads ParamInt 1 Inf Inf 1
14: num.trees ParamInt 1 Inf Inf 500
15: oob.error ParamLgl NA NA 2 TRUE
16: quantreg ParamLgl NA NA 2 FALSE
17: regularization.factor ParamUty NA NA Inf 1
18: regularization.usedepth ParamLgl NA NA 2 FALSE
19: replace ParamLgl NA NA 2 TRUE
20: respect.unordered.factors ParamFct NA NA 3 ignore
21: sample.fraction ParamDbl 0 1 Inf <NoDefault[3]>
22: save.memory ParamLgl NA NA 2 FALSE
23: scale.permutation.importance ParamLgl NA NA 2 FALSE
24: se.method ParamFct NA NA 2 infjack
25: seed ParamInt -Inf Inf Inf
26: split.select.weights ParamUty NA NA Inf
27: splitrule ParamFct NA NA 3 variance
28: verbose ParamLgl NA NA 2 TRUE
29: write.forest ParamLgl NA NA 2 TRUE
id class lower upper nlevels default
parents value
1: splitrule
2:
3:
4:
5:
6:
7:
8:
9: splitrule
10:
11:
12: splitrule
13: 1
14:
15:
16:
17:
18:
19:
20:
21:
22:
23: importance
24:
25:
26:
27:
28:
29:
parents value
<- ps(
tune_ps_ranger num.trees = p_int(lower = 100, upper = 900),
min.node.size = p_int(lower = 1, upper = 5)
)
# to see what's searched if we set a grid of 5
#rbindlist(generate_design_grid(tune_ps_ranger, 5)$transpose())
<- TuningInstanceSingleCrit$new(
instance_ranger
task = task,
learner = lrn("regr.ranger"),
resampling = crossval,
measure = measure,
search_space = tune_ps_ranger,
terminator = evals_trm
)
# suppress output
::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
lgr
#We can reuse the tuner we set up for the decision tree
#this was tuner <- tnr("grid_search", resolution = 5)
$optimize(instance_ranger) tuner
num.trees min.node.size learner_param_vals x_domain regr.rmse
1: 700 1 <list[3]> <list[2]> 1678458
# turn output back on
::get_logger("bbotk")$set_threshold("info")
lgr::get_logger("mlr3")$set_threshold("info")
lgr
$result_learner_param_vals instance_ranger
$num.threads
[1] 1
$num.trees
[1] 700
$min.node.size
[1] 1
# Fit the model to all the parameters
<- lrn("regr.ranger")
lrn_ranger_tuned $param_set$values = instance_ranger$result_learner_param_vals
lrn_ranger_tuned$train(task)
lrn_ranger_tuned$model lrn_ranger_tuned
Ranger result
Call:
ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), case.weights = task$weights$weight, num.threads = 1L, num.trees = 700L, min.node.size = 1L)
Type: Regression
Number of trees: 700
Sample size: 465
Number of independent variables: 3
Mtry: 1
Target node size: 1
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 2887242432235
R squared (OOB): 0.9447452
Next, we set up the XGBoost model. Of all the models I’d recommend doing more research into this one. It was first designed in 2016 and has since taken the machine learning world by storm and hence is worth paying particular attention to. Similar to the decision tree and random forest we will be setting a few, but not all of the parameters. There are many more hyperparameters available to tune in the XGBoost model than the previous two, hence properly fitting one of these models will require more research. Here is a summary of the hyperparameters:
lrn("regr.xgboost")$param_set
<ParamSet>
id class lower upper nlevels default
1: alpha ParamDbl 0 Inf Inf 0
2: approxcontrib ParamLgl NA NA 2 FALSE
3: base_score ParamDbl -Inf Inf Inf 0.5
4: booster ParamFct NA NA 3 gbtree
5: callbacks ParamUty NA NA Inf <list[0]>
6: colsample_bylevel ParamDbl 0 1 Inf 1
7: colsample_bynode ParamDbl 0 1 Inf 1
8: colsample_bytree ParamDbl 0 1 Inf 1
9: disable_default_eval_metric ParamLgl NA NA 2 FALSE
10: early_stopping_rounds ParamInt 1 Inf Inf
11: eta ParamDbl 0 1 Inf 0.3
12: eval_metric ParamUty NA NA Inf rmse
13: feature_selector ParamFct NA NA 5 cyclic
14: feval ParamUty NA NA Inf
15: gamma ParamDbl 0 Inf Inf 0
16: grow_policy ParamFct NA NA 2 depthwise
17: interaction_constraints ParamUty NA NA Inf <NoDefault[3]>
18: iterationrange ParamUty NA NA Inf <NoDefault[3]>
19: lambda ParamDbl 0 Inf Inf 1
20: lambda_bias ParamDbl 0 Inf Inf 0
21: max_bin ParamInt 2 Inf Inf 256
22: max_delta_step ParamDbl 0 Inf Inf 0
23: max_depth ParamInt 0 Inf Inf 6
24: max_leaves ParamInt 0 Inf Inf 0
25: maximize ParamLgl NA NA 2
26: min_child_weight ParamDbl 0 Inf Inf 1
27: missing ParamDbl -Inf Inf Inf NA
28: monotone_constraints ParamUty NA NA Inf 0
29: normalize_type ParamFct NA NA 2 tree
30: nrounds ParamInt 1 Inf Inf <NoDefault[3]>
31: nthread ParamInt 1 Inf Inf 1
32: ntreelimit ParamInt 1 Inf Inf
33: num_parallel_tree ParamInt 1 Inf Inf 1
34: objective ParamUty NA NA Inf reg:squarederror
35: one_drop ParamLgl NA NA 2 FALSE
36: outputmargin ParamLgl NA NA 2 FALSE
37: predcontrib ParamLgl NA NA 2 FALSE
38: predictor ParamFct NA NA 2 cpu_predictor
39: predinteraction ParamLgl NA NA 2 FALSE
40: predleaf ParamLgl NA NA 2 FALSE
41: print_every_n ParamInt 1 Inf Inf 1
42: process_type ParamFct NA NA 2 default
43: rate_drop ParamDbl 0 1 Inf 0
44: refresh_leaf ParamLgl NA NA 2 TRUE
45: reshape ParamLgl NA NA 2 FALSE
46: sample_type ParamFct NA NA 2 uniform
47: sampling_method ParamFct NA NA 2 uniform
48: save_name ParamUty NA NA Inf
49: save_period ParamInt 0 Inf Inf
50: scale_pos_weight ParamDbl -Inf Inf Inf 1
51: seed_per_iteration ParamLgl NA NA 2 FALSE
52: skip_drop ParamDbl 0 1 Inf 0
53: strict_shape ParamLgl NA NA 2 FALSE
54: subsample ParamDbl 0 1 Inf 1
55: top_k ParamInt 0 Inf Inf 0
56: training ParamLgl NA NA 2 FALSE
57: tree_method ParamFct NA NA 5 auto
58: tweedie_variance_power ParamDbl 1 2 Inf 1.5
59: updater ParamUty NA NA Inf <NoDefault[3]>
60: verbose ParamInt 0 2 3 1
61: watchlist ParamUty NA NA Inf
62: xgb_model ParamUty NA NA Inf
id class lower upper nlevels default
parents value
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13: booster
14:
15:
16: tree_method
17:
18:
19:
20: booster
21: tree_method
22:
23:
24: grow_policy
25:
26:
27:
28:
29: booster
30: 1
31: 1
32:
33:
34:
35: booster
36:
37:
38:
39:
40:
41: verbose
42:
43: booster
44:
45:
46: booster
47: booster
48:
49:
50:
51:
52: booster
53:
54:
55: booster,feature_selector
56:
57: booster
58: objective
59:
60: 0
61:
62:
parents value
It should also be noted that we need a lot more evaluations to run XGBoost, and hence it takes a longer time to run. The MLT blog provides the code for modelling the XGBoost with any number of evaluations, as well as a best fit model, which had been previously developed. I am going to fit two models, one with 25 evaluations and one with 500, but not attempt a best fit model.
Editor’s note: since this code takes a while to run (~18 mins on a 6.5 year old computer), I haven’t rerun it here. The next code chunk below uses the results from this block of code to fit the two models, one tuned with the 25 evaluations and one from the 500.
<- ps(
tune_ps_xgboost # ensure non-negative values
objective = p_fct("reg:tweedie"),
tweedie_variance_power = p_dbl(lower = 1.01, upper = 1.99),
# eta can be up to 1, but usually it is better to use low eta, and tune nrounds for a more fine-grained model
eta = p_dbl(lower = 0.01, upper = 0.3),
# select value for gamma
# tuning can help overfitting but we don't investigate this here
gamma = p_dbl(lower = 0, upper = 0),
# We know that the problem is not that deep in interactivity so we search a low depth
max_depth = p_int(lower = 2, upper = 6),
# nrounds to stop overfitting
nrounds = p_int(lower = 100, upper = 500)
)
<- TuningInstanceSingleCrit$new(
instance_xgboost
task = task,
learner = lrn("regr.xgboost"),
resampling = crossval,
measure = measure,
search_space = tune_ps_xgboost,
terminator = evals_trm
)
# need to make a new tuner with resolution 4
#tuner <- tnr("grid_search", resolution = 4)
set.seed(84) # for random search for reproducibility
<- tnr("random_search")
tuner
::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
lgr
<- Sys.time()
time_bef $optimize(instance_xgboost)
tuner
::get_logger("bbotk")$set_threshold("info")
lgr::get_logger("mlr3")$set_threshold("info")
lgr
Sys.time() - time_bef
#500 evals
= trm("evals", n_evals = 500)
evals_trm2
<- ps(
tune_ps_xgboost_500
# ensure non-negative values
objective = p_fct("reg:tweedie"),
tweedie_variance_power = p_dbl(lower = 1.01, upper = 1.99),
# eta can be up to 1, but usually it is better to use low eta, and tune nrounds for a more fine-grained model
eta = p_dbl(lower = 0.01, upper = 0.3),
# select value for gamma
# tuning can help overfitting but we don't investigate this here
gamma = p_dbl(lower = 0, upper = 0),
# We know that the problem is not that deep in interactivity so we search a low depth
max_depth = p_int(lower = 2, upper = 6),
# nrounds to stop overfitting
nrounds = p_int(lower = 100, upper = 500)
)
<- TuningInstanceSingleCrit$new(
instance_xgboost_500
task = task,
learner = lrn("regr.xgboost"),
resampling = crossval,
measure = measure,
search_space = tune_ps_xgboost_500,
terminator = evals_trm2
)
# need to make a new tuner with resolution 4
#tuner <- tnr("grid_search", resolution = 4)
set.seed(84) # for random search for reproducibility
<- tnr("random_search")
tuner
::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
lgr
<- Sys.time()
time_bef $optimize(instance_xgboost_500)
tuner
::get_logger("bbotk")$set_threshold("info")
lgr::get_logger("mlr3")$set_threshold("info")
lgr
Sys.time() - time_bef
We need to manually train our XGBoost models, by setting the hyperparameters.
If you’ve run the code above then you can see the parameters using this code:
$result_learner_param_vals
instance_xgboost$result_learner_param_vals instance_xgboost_500
Set the hyperparameters (using previously run tuning parameters if you haven’t rerun the code) and train the models:
= lrn("regr.xgboost", objective="reg:tweedie", nrounds=192,
lrn_xgboost_25 tweedie_variance_power=1.448514, eta=0.1726785, gamma=0, max_depth=3)
$train(task)
lrn_xgboost_25
= lrn("regr.xgboost", objective="reg:tweedie", nrounds=159,
lrn_xgboost_500 tweedie_variance_power=1.138227, eta=0.07166036, gamma=0, max_depth=2)
$train(task) lrn_xgboost_500
Include XGBoost in our set of models:
# make a new copy of the data, deleting the row_id, accf and devf columns
<- copy(dat)[, c("row_id", "accf", "devf") := NULL]
model_forecasts
# add the model projections - specify list of learners to use and their names
<- c(lrn_rpart_tuned, lrn_ranger_tuned, lrn_xgboost_25, lrn_xgboost_500) #add 500
lrnrs
# the id property is used to name the variables - since we have 2 xgboosts, adjust id where necessary
$id <- "regr.xgboost_25"
lrn_xgboost_25$id <- "regr.xgboost_500"
lrn_xgboost_500
for(l in lrnrs){
#learner$predict(task, row_ids) is how to get predicted values, returns 3 cols with preds in response
# use row_ids to tell it to use entire data set, ie "use" and validation rows
#model_forecasts[, (nl) := l$predict(task, row_ids=1:nrow(dat))$response]
$id := l$predict(task=task, row_ids=1:nrow(dat))$response]
model_forecasts[, l
}
We then add our GLM estimate. In the MLT blog this is done with a new GLM, however, we can re-use ours.
# GLM
"regr.glm" := predict(glm_fit1, newdata=dat[, .(pmts, accf, devf)], type="response")] model_forecasts[,
Include our LASSO estimates in the model forecasts. The MLT blog shows how to use the mlr3 package to perform the LASSO model, but this is surplus to requirement here.
# Bring in LASSO results from before
$regr.cv_glmnet = dat_LASSO$fitted model_forecasts
Now it’s a case of comparing all the models. We calculate the RMSE and the standard error contribution of the training and test data and output a number of graphs so we can investigate our models.
# get list of variables to form observations by - this is why we used a common naming structure
<- names(model_forecasts)[ names(model_forecasts) %like% "regr"]
model_names
# wide -> long
<- melt(model_forecasts,
model_forecasts_long
measure.vars = model_names,
id.vars=c("acc", "dev", "cal", "pmts", "train_ind"))
setnames(model_forecasts_long, c("variable", "value"), c("model", "fitted"))
<- model_forecasts_long[, se_contrib := (fitted-pmts)^2
model_rmse se_contrib = sum(se_contrib), num = .N), by=.(model, train_ind)
][, .(:= sqrt(se_contrib / num)]
][, rmse
datatable(model_rmse) |> formatRound("se_contrib", digits = 0)|> formatRound("rmse", digits = 0)
We can see the XGBoost with 500 iterations has the lowest RMSE and the LASSO and decision tree being next on the future values, with the random forest performing the worst. This is interesting as the random forest has the lowest RMSE on the past values, which may show overfitting and not picking up the inflationary trend.
Let’s add in the actuals for comparison.
# add in the actual values for comparison
<- model_forecasts_long[model=="regr.glm",]
actuals := "actual"
actuals[, model := pmts]
][, fitted
<- rbind(actuals, model_forecasts_long)
model_forecasts_long_plot
# calculate log_fitted
:= log(pmax(1, fitted))] model_forecasts_long_plot[, log_fitted
Our first diagnostic graphs give us a quick visual comparison of the fitted values compared to the actuals.
<- function(dat, x="dev", y="acc", facet="model", actual, fitted, lims=c(0.25, 4),
GraphHeatMap xlab="Development quarter", ylab="Accident Quarter"){
# copy data to avoid modifying original
<- copy(dat)
localdat
# get fails if there is a variable with the same name so make local copies
<- x
local_x <- y
local_y <- actual
local_actual <- fitted
local_fitted
<- max(localdat[[y]])
np := get(local_actual) / get(local_fitted)
localdat[, .avsf := log(pmax(min(lims), pmin(max(lims), .avsf)))
][, .avsf_restrict_log := np + 1 - get(local_y)]
][, .past_line
<- ggplot(data=localdat, aes_string(x=local_x, y=local_y)) +
g geom_tile(aes(fill = .avsf_restrict_log))+scale_y_reverse()+
facet_wrap(~get(facet), ncol=2)+
theme_classic()+
scale_fill_gradient2(name="log(Acc vs Fit)", low=mblue, mid="white", high=red, midpoint=0, space="Lab", na.value="grey50", guide="colourbar")+
labs(x=xlab, y=ylab)+
geom_line(aes_string(x=".past_line", y=local_y), colour=dgrey, size=2)+
theme(strip.text = element_text(size=8,colour="grey30"), strip.background = element_rect(colour="white", fill="white"))+
theme(axis.title.x = element_text(size=8), axis.text.x = element_text(size=7))+
theme(axis.title.y = element_text(size=8), axis.text.y = element_text(size=7))+
theme(element_line(linewidth=0.25, colour="grey30"))+
theme(legend.position="bottom", legend.direction = "horizontal", legend.text=element_text(size=6) , legend.title =element_text(size=8) )
NULL
invisible(list(data=localdat, graph=g))
}
<- GraphHeatMap(model_forecasts_long, x="dev", y="acc", facet="model", actual="pmts", fitted="fitted")
g $graph g
There are a couple of interesting observations here. We can see the random forest performs well for the history but not the future. The XGBoost_500 looks good in general. The LASSO model performs steadily worse we we go into the future. The XGBoost_25 also struggles to pick up the trend in the later development periods.
We can now look at a reserve predictions comparison.
# summarise the future payments and each model projection by accident quarter
# remember ml is a character vector of all the model names, which are columns of
# fitted values in model_forecasts
# get reserves and payments by accident quarter
<- model_forecasts_long[train_ind == FALSE, .(actual=sum(pmts), reserve=sum(fitted)), by=.(model, acc)]
os_acc
# make full tidy data set by stacking actuals
<- rbind(os_acc[model=="regr.glm",][, reserve := actual][, model := "actual"],
os_acc
os_acc)
setkey(os_acc, model, acc)
# create a factor variable from model so we can order it in the plot as we wish
<- ggplot(data=os_acc, aes(x=acc, y=reserve, group=model)) +
g1 geom_line(aes(linetype=model, colour=model, size=model, alpha=model))+
scale_colour_manual(name="", values=c(dgrey, red, dgrey, fuscia, dgrey, mblue, gold))+
scale_linetype_manual(name="", values=c("solid", "solid", "dotted", "dotdash", "dashed", "longdash", "solid"))+
scale_size_manual(name="", values=c(2.5, 2, 1, 1.25, 1, 1.25, 1.5))+
scale_alpha_manual(name="", values=c(1, 0.9, 0.5, 0.7, 0.5, 0.7, 0.9))+
coord_cartesian(ylim=c(0, 500000000), xlim=c(0, 40))+
theme_classic() +
theme(legend.position = "bottom")+
labs(y="Reserve", x="Accident Quarter")+
NULL
g1
<- os_acc[, .(reserve=sum(reserve), actual=sum(actual)), by=.(model)]
os := reserve / actual][, actual := NULL]
os[, ratio
# sort
:= abs(ratio-100)]
os[, ratio_diff <- os[order(ratio_diff)][, ratio_diff := NULL]
os
# output table
setnames(os, c("model", "reserve", "ratio"), c("Model", "Reserve", "Ratio to actual(%)"))
datatable(os)|> formatRound("Reserve", digits = 0)|> formatPercentage("Ratio to actual(%)")
We can see again the XGBoost performing really well here. Interestingly the GLM/Chain ladder performs well in terms of predicting reserves, but clearly there is an offsetting as it predicts very low reserves for later periods. The LASSO model and random forests under predicts the reserves in the later accident periods, showing they do not pick up on the inflationary trends adequately. It is interesting to note the relatively low RMSE for the LASSO when looking at these results, which could be reflective of the high performance of the model in earlier accident periods.
The decision tree forecast (rpart) performs surprisingly well on this data set. Based on anecdotal experience from this working party this is unusual.
The MLT blog also looks at a large number of other diagnostics which would be useful when assessing the model performance but I will let the reader do this independently.
Conclusion
I found working through these blogs and learning these techniques to be a tough but rewarding journey. Hopefully this blog will make the process easier for others looking to learn machine learning in reserving. However, this blog only covers the very beginnings of what the working party has produced and is researching. For example there are blogs that cover this material but in Python, if that is your preferred programming language. Furthermore there are blogs working through Neural Networks, ML model diagnostics and practical considerations for insurance companies.
Appendix
I have included a summary of all the links in this blog for completeness:
Working party material
Other useful resources
- Actuarial cookbook
- SPLICE
- DataCamp introduction to R
- CAS monograph on stochastic reserving
- GLMs
- Basis functions and regularisation
- Decision trees
- Random forests
- xgboost
Finally, here are details of the R session as well as commented out code to install the various packages
#install.packages("SPLICE")
#install.packages("DT")
#install.packages("data.table") # manipulate the data
#install.packages("ggplot2") # plot the data
#install.packages("viridis") # plot colours that are friendly to colour blindness
#install.packages("patchwork") # easily combine plots
#install.packages("glmnet")
#install.packages("mlr3")
#install.packages("mlr3learners")
#install.packages("mlr3tuning")
#install.packages("paradox")
#install.packages("remotes")
#remotes::install_github("mlr-org/mlr3extralearners")
#install.packages("rpart.plot")
#install.packages("ranger")
#install.packages("xgboost")
sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_Ireland.utf8 LC_CTYPE=English_Ireland.utf8
[3] LC_MONETARY=English_Ireland.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Ireland.utf8
time zone: Europe/Dublin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rpart.plot_3.1.1 rpart_4.1.16
[3] mlr3extralearners_0.5.49-9000 mlr3tuning_0.14.0
[5] paradox_0.10.0 mlr3learners_0.5.4
[7] mlr3_0.14.0 glmnet_4.1-8
[9] Matrix_1.5-4.1 patchwork_1.1.3
[11] ggplot2_3.4.4 data.table_1.14.8
[13] DT_0.30 SPLICE_1.1.1
[15] SynthETIC_1.0.5
loaded via a namespace (and not attached):
[1] tidyselect_1.1.2 viridisLite_0.4.1 dplyr_1.0.10
[4] farver_2.1.1 fastmap_1.1.1 digest_0.6.33
[7] lifecycle_1.0.3 survival_3.5-5 magrittr_2.0.3
[10] compiler_4.3.1 rlang_1.1.1 sass_0.4.2
[13] tools_4.3.1 utf8_1.2.2 yaml_2.3.7
[16] knitr_1.40 labeling_0.4.2 htmlwidgets_1.5.4
[19] xgboost_1.6.0.1 withr_2.5.0 purrr_0.3.5
[22] mlr3misc_0.11.0 grid_4.3.1 fansi_1.0.3
[25] mlr3measures_0.5.0 colorspace_2.0-3 future_1.28.0
[28] globals_0.16.1 scales_1.2.1 iterators_1.0.14
[31] cli_3.6.1 rmarkdown_2.16 crayon_1.5.2
[34] bbotk_0.5.4 generics_0.1.3 rstudioapi_0.14
[37] future.apply_1.9.1 DBI_1.1.3 cachem_1.0.6
[40] stringr_1.4.1 splines_4.3.1 assertthat_0.2.1
[43] parallel_4.3.1 vctrs_0.6.4 jsonlite_1.8.2
[46] listenv_0.8.0 crosstalk_1.2.0 foreach_1.5.2
[49] lgr_0.4.4 jquerylib_0.1.4 glue_1.6.2
[52] parallelly_1.32.1 codetools_0.2-19 stringi_1.7.8
[55] shape_1.4.6 gtable_0.3.1 palmerpenguins_0.1.1
[58] munsell_0.5.0 tibble_3.1.8 pillar_1.8.1
[61] htmltools_0.5.5 R6_2.5.1 evaluate_0.16
[64] lattice_0.21-8 backports_1.4.1 renv_1.0.3
[67] bslib_0.4.0 Rcpp_1.0.9 uuid_1.1-0
[70] checkmate_2.1.0 ranger_0.14.1 xfun_0.33
[73] pkgconfig_2.0.3