Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MDS5130 Project

Due date: May 9, 2022

❼ Outstanding projects will be invited to give a presentation on April 26, 2023. Students

who have given a presentation can receive maximum 10 bonus points in their final exam.

❼ Students who want to present their work need to submit the project by April 19, 2023.

Submissions after April 19 will not be invited for presentation. All students can revise their work before May 9, 2023.

❼ The submitted codes must be clearly written in a R file with an output MSE.

❼ A report to describe your analysis is required.

1 Background

In this project, we will analysis a dataset about horse racing. Let’s have a brief introduction of horse racing.  In a particular game, there are 14 horses racing.  Before a particular time tfinal , people are allowed to bet which horse can win the game. Let bi (t) be the total amount betting on horse i at time t. Note that bi (t) is increasing before tfinal . After the game, we have bi (tfinal ) being bet on horse i for i = 1, . . . , 14. If horse I wins the game, people who bet on horse I can get the dividend

dI(f)  = dI (tfinal ) = (1 − ∇) 对j(n)=1 bj (tfinal )


for each $1 bet, here ∇ = 0.175 is the percentage track-take. Note that the dividends

(1 − ∇) 对j(n)=1 bj (t)

bi (t)

for horse i, i = 1, . . . , 14, are known by all gamers at time t < tfinal . As bi (t) is time varying, so does di (t).

Now suppose we have some insider information and we believe that we know the “true” winning probability πi of each horse i. Since we will only make a bet on horse i if the expected return is greater than 1/πi , so one betting strategy is betting on horse i if di(f)  > 1/πi . However, we don’t know di(f)  at time we bet (tbet ). Let bi  = bi (tbet ), di  = di (tbet ), fi W be the amount we bet on horse i at tbet  and Ci  be the amount bet on horse i by other parties after tbet . Then

we have

bi + Ci + fi W             .

The unknown quantities here are Ci for i = 1, . . . , 14. In this project, your task is to estimate Csum  =对 Ci  before tbet .

2 Data

The datasets “data20XX.RData” with XX=14,15,16,17,18 are given. They all have the same set of column names, which are

ID: It is of the form “yyyymmddrr”, which means Year yyyy Month mm Date dd Race

rr. Note that there are more than one race on each day and the number of races can be different on each day.

WIN POOL.x: The total amount in the pool at time tbet .

WIN POOL.y:  The total amount in the pool at time tfinal .   Hence Csum   is the

difference between WIN POOL.y and WIN POOL.x.

WIN TAKE.x: ∇ = 0.175. It is the same as WIN TAKE.y.

WIN ODDS i.x:  di  = di (tbet ).  If it is 0, it means that horse i actually was not in

the race.

WIN ODDS i.y: di(f)  = di (tfinal ). If it is 0, it means that horse i actually was not in

the race.

WIN MODEL i.x:  “True” winning probability πi .  If it is 0, it means that horse i

actually was not in the race. It is the same as WIN MODEL i.y.

WIN TIME.y The yyyymmdd” part of ID.

WIN NUMBER.y The rr”part of ID.

In this project, you are required to forecast Csum  for each race in data2018.RData. Note that you MUST only use the information BEFORE tbet  to forecast the Csum  in a particular race. Let N be the total number of races in 2018, R2018  be the set of all races in 2018, xr  be the true Csum  on Race r and r  be your forecast.

Your goal is to minimize the mean absolute percentage errors,

MAPE = r .

Please note the followings.

1. Your work will be evaluated by other dataset, namely “data2019.RData”, that have the same set of columns of the given data set.

2. Only the given data set and the information provided in the project can be used. Don’t use any other additional information in your analysis.