1 Introduction

This is the first notebook of a series of three that outlines and elaborates upon code used to replicate the central scenario in the paper of Maximilien Baudry “NON-PARAMETRIC INDIVIDUAL CLAIM RESERVING IN INSURANCE”

https://www.institutdesactuaires.com/global/gene/link.php?doc_id=11747&fg=1
http://chaire-dami.fr/files/2016/10/Reserving-article.pdf

Below we step through the process to create a single simulated dataset, as set out in Section 5 of Baudry’s paper. Having stepped through the data creation process, at the end, we present the code in the form of a function that returns a simulated policy and claim dataset.

The second notebook details the process for creating a reserving database and the third outlines the process for creating reserves using machine learning.

Before we start we import a few pre-requisite R packages.

# Importing packages
library(data.table)
library(magrittr)
library(lubridate)
library(ggplot2)
library(cowplot)
library(repr)
library(kableExtra)

2 Create Policy Data set

We start by simulating number of policies sold by day and by policy coverage type.

2.1 Policy count by date

# polices sold between start 2016 to end 2017
dt_policydates <- data.table(date_UW = seq(as.Date("2016/1/1"), as.Date("2017/12/31"), "day"))

# number of policies per day follows Poisson process with mean 700 (approx 255,500 pols per annum)
dt_policydates[, ':='(policycount = rpois(.N,700),
                      date_lapse = date_UW %m+% years(1),
                      expodays = as.integer(date_UW %m+% years(1) - date_UW),
                      pol_prefix = year(date_UW)*10000 + month(date_UW)*100 + mday(date_UW))]

2.2 Policy covers by date

We then add policy coverage columns in proportions 25% Breakage, 45% Breakage and Oxidation and 30% Breakage, Oxidation and Theft.

# Add columns defining Policy Covers   
dt_policydates[, Cover_B := round(policycount * 0.25)]
dt_policydates[, Cover_BO := round(policycount * 0.45)]
dt_policydates[, Cover_BOT := policycount - Cover_B - Cover_BO]

kable(head(dt_policydates), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
date_UW policycount date_lapse expodays pol_prefix Cover_B Cover_BO Cover_BOT
2016-01-01 709 2017-01-01 366 20160101 177 319 213
2016-01-02 683 2017-01-02 366 20160102 171 307 205
2016-01-03 672 2017-01-03 366 20160103 168 302 202
2016-01-04 691 2017-01-04 366 20160104 173 311 207
2016-01-05 710 2017-01-05 366 20160105 178 320 212
2016-01-06 668 2017-01-06 366 20160106 167 301 200

2.3 Policy transaction file

We then create a policy transaction file containing 1 row per policy, with columns to indicate policy coverage details.

2.3.1 Policy date & number

The first step is to create a policy table with 1 row per policy and underwriting date.

# repeat rows for each policy by UW-Date
dt_policy <- dt_policydates[rep(1:.N, policycount),c("date_UW", "pol_prefix"), with = FALSE][,pol_seq:=1:.N, by=pol_prefix]

# Create a unique policy number 
dt_policy[, pol_number := as.character(pol_prefix * 10000 + pol_seq)]

kable(head(dt_policy), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
date_UW pol_prefix pol_seq pol_number
2016-01-01 20160101 1 201601010001
2016-01-01 20160101 2 201601010002
2016-01-01 20160101 3 201601010003
2016-01-01 20160101 4 201601010004
2016-01-01 20160101 5 201601010005
2016-01-01 20160101 6 201601010006

2.3.2 Policy coverage type

Then we add the policy coverage details appropriate to each row.

# set join keys
setkey(dt_policy,'date_UW')
setkey(dt_policydates,'date_UW')  

# remove pol_prefix before join
dt_policydates[, pol_prefix := NULL]  

# join cover from summary file (dt_policydates)
dt_policy <- dt_policy[dt_policydates]  

# now create Cover field for each policy row
dt_policy[,Cover := 'BO']
dt_policy[pol_seq <= policycount- Cover_BO,Cover := 'BOT']
dt_policy[pol_seq <= Cover_B,Cover := 'B']  

dt_policy[, Cover_B := as.factor(Cover)]  

# remove interim calculation fields
dt_policy[, ':='(pol_prefix = NULL,
                 policycount = NULL,
                 pol_seq = NULL,
                 Cover_B = NULL,
                 Cover_BOT = NULL,
                 Cover_BO = NULL)]

# check output
kable(head(dt_policy), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
date_UW pol_number date_lapse expodays Cover
2016-01-01 201601010001 2017-01-01 366 B
2016-01-01 201601010002 2017-01-01 366 B
2016-01-01 201601010003 2017-01-01 366 B
2016-01-01 201601010004 2017-01-01 366 B
2016-01-01 201601010005 2017-01-01 366 B
2016-01-01 201601010006 2017-01-01 366 B

2.3.3 Policy Brand, Price, Model features

Now further details can be added to the policy dataset; such as policy brand, model and price details.

# Add remaining policy details
dt_policy[, Brand := rep(rep(c(1,2,3,4), c(9,6,3,2)), length.out = .N)]
dt_policy[, Base_Price := rep(rep(c(600,550,300,150), c(9,6,3,2)), length.out = .N)]

# models types and model cost multipliers
for (eachBrand in unique(dt_policy$Brand)) {
  dt_policy[Brand == eachBrand, Model := rep(rep(c(3,2,1,0), c(10, 7, 2, 1)), length.out = .N)]
  dt_policy[Brand == eachBrand, Model_mult := rep(rep(c(1.15^3, 1.15^2, 1.15^1, 1.15^0), c(10, 7, 2, 1)), length.out = .N)]
}

dt_policy[, Price := ceiling (Base_Price * Model_mult)]

# check output
kable(head(dt_policy), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
date_UW pol_number date_lapse expodays Cover Brand Base_Price Model Model_mult Price
2016-01-01 201601010001 2017-01-01 366 B 1 600 3 1.520875 913
2016-01-01 201601010002 2017-01-01 366 B 1 600 3 1.520875 913
2016-01-01 201601010003 2017-01-01 366 B 1 600 3 1.520875 913
2016-01-01 201601010004 2017-01-01 366 B 1 600 3 1.520875 913
2016-01-01 201601010005 2017-01-01 366 B 1 600 3 1.520875 913
2016-01-01 201601010006 2017-01-01 366 B 1 600 3 1.520875 913

2.3.4 Tidy and save

The final step is to keep only columns of interest and save the resulting policy file.

# colums to keep
cols_policy <- c("pol_number",
                 "date_UW",
                 "date_lapse",
                 "Cover",
                 "Brand",
                 "Model",
                 "Price")

dt_policy <- dt_policy[, cols_policy, with = FALSE]

# check output
kable(head(dt_policy), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number date_UW date_lapse Cover Brand Model Price
201601010001 2016-01-01 2017-01-01 B 1 3 913
201601010002 2016-01-01 2017-01-01 B 1 3 913
201601010003 2016-01-01 2017-01-01 B 1 3 913
201601010004 2016-01-01 2017-01-01 B 1 3 913
201601010005 2016-01-01 2017-01-01 B 1 3 913
201601010006 2016-01-01 2017-01-01 B 1 3 913
save(dt_policy, file = "./dt_policy.rda")

3 Create claims file

3.1 Sample Claims from Policies

We now simulate claims arising from the policy coverages. Claim rates vary by policy coverage and type.

3.1.1 Breakage Claims

We start with breakages claims and sample from the policies data set to create claims.

# All policies have breakage cover
# claims uniformly sampled from policies
claim <- sample(nrow(dt_policy), size = floor(nrow(dt_policy) * 0.15))

# Claim severity multiplier sampled from beta distribution
dt_claim <- data.table(pol_number = dt_policy[claim, pol_number],
                       claim_type = 'B',
                       claim_count = 1,
                       claim_sev = rbeta(length(claim), 2,5))

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev
201605200661 B 1 0.4072562
201610230512 B 1 0.0747138
201712250647 B 1 0.2870089
201712020699 B 1 0.1255599
201605210350 B 1 0.1755683
201705290189 B 1 0.2807473

3.1.2 Oxidation Claims

Oxidation claims follow a similar process, just with different incidence rates and severities.

# identify all policies with Oxidation cover
cov <- which(dt_policy$Cover != 'B')

# sample claims from policies with cover
claim <- sample(cov, size = floor(length(cov) * 0.05))

# add claims to table 
dt_claim <- rbind(dt_claim,
                  data.table(pol_number = dt_policy[claim, pol_number],
                             claim_type = 'O',
                             claim_count = 1,
                             claim_sev = rbeta(length(claim), 5,3)))

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev
201605200661 B 1 0.4072562
201610230512 B 1 0.0747138
201712250647 B 1 0.2870089
201712020699 B 1 0.1255599
201605210350 B 1 0.1755683
201705290189 B 1 0.2807473

3.1.3 Theft Claims

In the original paper the distribution for Theft severity claims is stated to be Beta(alpha = 5, beta = 0.5).

# identify all policies with Theft cover
# for Theft claim frequency varies by Brand
# So need to consider each in turn...

for(myModel in 0:3) {

    cov <- which(dt_policy$Cover == 'BOT' & dt_policy$Model == myModel)
    claim <- sample(cov, size = floor(length(cov) * 0.05*(1 + myModel)))
  
    dt_claim <- rbind(dt_claim,
                      data.table(pol_number = dt_policy[claim, pol_number],
                                 claim_type = 'T',
                                 claim_count = 1,
                                 claim_sev = rbeta(length(claim), 5,.5)))
}

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev
201605200661 B 1 0.4072562
201610230512 B 1 0.0747138
201712250647 B 1 0.2870089
201712020699 B 1 0.1255599
201605210350 B 1 0.1755683
201705290189 B 1 0.2807473
kable(tail(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev
201607300342 T 1 0.8668355
201604090169 T 1 0.9933972
201703070300 T 1 0.9226491
201602260241 T 1 0.8307499
201611130304 T 1 0.9994143
201702170212 T 1 0.9103530

3.2 Claim dates and costs

3.2.1 Policy UW_date and value

We now need to add details to claims, such as policy underwritng date and phone cost. These details come from the policy table.

# set join keys
setkey(dt_policy, pol_number)
setkey(dt_claim, pol_number)

#join Brand and Price from policy to claim
dt_claim[dt_policy,
         on = 'pol_number',
         ':='(date_UW = i.date_UW,
              Price = i.Price,
              Brand = i.Brand)]

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev date_UW Price Brand
201601010005 B 1 0.6181657 2016-01-01 913 1
201601010008 B 1 0.0969650 2016-01-01 913 1
201601010009 B 1 0.4714690 2016-01-01 913 1
201601010024 B 1 0.2941159 2016-01-01 794 1
201601010053 B 1 0.4574479 2016-01-01 728 2
201601010057 B 1 0.4703381 2016-01-01 457 3

3.2.2 Add simulated Claim occrrence, reporting and payment delays

The occurrence delay is assumed uniform over policy exposure period. Reporting and payment delays are assumed to follow Beta distributions.

# use lubridate %m+% date addition operator 
dt_claim[, date_lapse := date_UW %m+% years(1)]
dt_claim[, expodays := as.integer(date_lapse - date_UW)]
dt_claim[, occ_delay_days := floor(expodays * runif(.N, 0,1))]

dt_claim[ ,delay_report := floor(365 * rbeta(.N, .4, 10))]  
dt_claim[ ,delay_pay := floor(10 + 40* rbeta(.N, 7,7))]  

dt_claim[, date_occur := date_UW %m+% days(occ_delay_days)]
dt_claim[, date_report := date_occur %m+% days(delay_report)]
dt_claim[, date_pay := date_report %m+% days(delay_pay)]

dt_claim[, claim_cost := round(Price * claim_sev)]

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
pol_number claim_type claim_count claim_sev date_UW Price Brand date_lapse expodays occ_delay_days delay_report delay_pay date_occur date_report date_pay claim_cost
201601010005 B 1 0.6181657 2016-01-01 913 1 2017-01-01 366 30 36 17 2016-01-31 2016-03-07 2016-03-24 564
201601010008 B 1 0.0969650 2016-01-01 913 1 2017-01-01 366 72 8 30 2016-03-13 2016-03-21 2016-04-20 89
201601010009 B 1 0.4714690 2016-01-01 913 1 2017-01-01 366 87 19 22 2016-03-28 2016-04-16 2016-05-08 430
201601010024 B 1 0.2941159 2016-01-01 794 1 2017-01-01 366 26 10 28 2016-01-27 2016-02-06 2016-03-05 234
201601010053 B 1 0.4574479 2016-01-01 728 2 2017-01-01 366 341 1 43 2016-12-07 2016-12-08 2017-01-20 333
201601010057 B 1 0.4703381 2016-01-01 457 3 2017-01-01 366 167 21 26 2016-06-16 2016-07-07 2016-08-02 215

3.2.3 Claim key and tidy

The final stage is to do some simple tidying and add a unique claim key. The file is then saved for future use.

Note that the original paper spoke of using a “competing hazards model” for simulating claims. I have taken this to mean that a policy can only give rise to one claim. Where the above process has simulated multiple claims against the same policy I keep only the first occurring claim.

# add a unique claimkey based upon occurence date

dt_claim[, clm_prefix := year(date_occur)*10000 + month(date_occur)*100 + mday(date_occur)]
dt_claim[, clm_seq := seq_len(.N), by = clm_prefix]
dt_claim[, clm_number := as.character(clm_prefix * 10000 + clm_seq)]


# keep only first claim against policy (competing hazards)
setkeyv(dt_claim, c("pol_number", "clm_prefix"))
dt_claim[, polclm_seq := seq_len(.N), by = .(pol_number)]
dt_claim <- dt_claim[polclm_seq == 1,]

# colums to keep
cols_claim <- c("clm_number",
                "pol_number",
                "claim_type",
                "claim_count",
                "claim_sev",
                "date_occur",
                "date_report",
                "date_pay",
                "claim_cost")

dt_claim <- dt_claim[, cols_claim, with = FALSE]

# check output
kable(head(dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
clm_number pol_number claim_type claim_count claim_sev date_occur date_report date_pay claim_cost
201601310001 201601010005 B 1 0.6181657 2016-01-31 2016-03-07 2016-03-24 564
201603130001 201601010008 B 1 0.0969650 2016-03-13 2016-03-21 2016-04-20 89
201603280001 201601010009 B 1 0.4714690 2016-03-28 2016-04-16 2016-05-08 430
201601270001 201601010024 B 1 0.2941159 2016-01-27 2016-02-06 2016-03-05 234
201612070001 201601010053 B 1 0.4574479 2016-12-07 2016-12-08 2017-01-20 333
201606160001 201601010057 B 1 0.4703381 2016-06-16 2016-07-07 2016-08-02 215
save(dt_claim, file = "./dt_claim.rda")

4 Checking exhibits

Baudry’s paper produced a number of summary exhibits from the simulated data. Let’s recreate them to get some comfort that we have correctly recreated the data.

You can see that the severity exhibit, Chart B, is inconsistent with that presented in the original paper. The cause of that difference is unclear at the time of writing. It’s likely to be because something nearer to a Beta(alpha = 5, beta = 0.05) has been used. However using that creates other discrepancies likely to be due to issues with the competing hazards implementation. For now we note the differences and continue with the data as created here.

dt_claim[, Days_reportdelay := as.numeric(difftime(date_report, date_occur, units="days"))]


hist(dt_claim[, Days_reportdelay],breaks = 50)

dt_claim[, Days_paymentdelay := as.numeric(difftime(date_pay, date_report, units="days"))]
hist(dt_claim[, Days_paymentdelay],breaks = 60)

The final set of exhibits are those on slide 44. The only difference of note here is in Chart B, the Claim Rate by phone brand.

Baudry’s exhibits show Brand 1 to have a 5% higher claim frequency than other Brands. From reading the paper I can’t see why we should expect that to be the case. Claim rate varies by phone Model but Model incidence doesn’t vary by Brand. Therefore I can’t see how the Chart B equivalent could be correct given the details in the paper.

I leave the code as is noting the difference but recognising that it will not affect the wider aims of the paper.

## Warning: Removed 1 row(s) containing missing values (geom_path).

5 Function to create policy and claim data

The above code can be wrapped into a function which returns a list containing the policy and claim datasets.

By calling the function with a seed you can simulate policy and claim datasets.

tmp <- simulate_central_scenario(1234)

kable(head(tmp$dt_claim), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
clm_number pol_number claim_type claim_count claim_sev date_occur date_report date_pay claim_cost
201606080001 201601010001 B 1 0.3337923 2016-06-08 2016-06-08 2016-07-21 305
201609150001 201601010014 B 1 0.3692034 2016-09-15 2016-09-15 2016-10-17 309
201609090001 201601010025 B 1 0.4496012 2016-09-09 2016-09-09 2016-10-07 357
201602190001 201601010027 B 1 0.4019731 2016-01-25 2016-02-19 2016-03-21 319
201605140001 201601010043 B 1 0.2146653 2016-05-14 2016-05-14 2016-06-15 196
201612110001 201601010045 B 1 0.2783313 2016-12-11 2016-12-11 2017-01-06 254

6 Download the code and try it yourself

To finish, if you wish to try out this code in your own local instance of R then we have made this code and the supporting files and folders available in a zip file here.

Download and extract the zip file to a local directory and then open the R project file Baudry_1.rproj in your local R software installation. In the root of the project folder you will see two files;

  1. Notebook_1_SimulateData_v1.Rmd - which is the source code used to recreate this notebook
  2. Notebook_1_SimulateData_v1.R - the equivalent code provided as an R script

Please note that, depending upon your R installation , you may have to install R libraries before you can run the code provided. R will warn you if you have missing dependencies and you can then install them from CRAN.