Assignment 4 (Advanced Regression Analysis)


Note: This assignment will count towards your final module mark.

Please submit your individual solution as one pdf file containing the required theoretical derivations, your R output (plots) and R script.


1. In what follows please use the motorcycle dataset (motor-dat.txt) available on Moodle under week 8.

(a) Define the nonparametric regression problem and implement the local constant estimator in R (for the motorcycle dataset) using the Epanechnikov kernel and a bandwidth of 10% of the covariate (time) range, then plot the estimated curve.

Recall that you have also obtained the local linear estimator in the practicals. Compare the local constant estimator with the local linear estimator. Which estimator do you think is better? Please write down your reason.

(b) Try different bandwidths, e.g. 6%, 25%, 50% and 100% of the range of the covariate, and plot the corresponding estimated curves. What is the impact of choosing different bandwidths on the estimator?


2. In what follows please use the labour dataset (labor.suppl(2).dat) available on Moodle under week 10.

(a) State the formula of the kernel-based density estimator and implement in R the kernel-based density estimator (for the labour dataset), using the Gaussian kernel with its corresponding optimal bandwidth. What is the optimal band-width value you obtained using the formula at the point above? Plot your estimator.

(b) Now also use other bandwidths, where denotes the optimal bandwidth at point (a), and plot each corre-sponding estimated density. What is the impact of choosing different band-widths on the density estimator?

Hint for both problems: overleaf you will find some incomplete R code to guide you through the estimation steps.


Incomplete R code for Question 1

dat<-read.table(file.choose(),header=T) #choose motor_dat.txt

y<-dat[,2] #response (head acceleration)

x<-dat[,1] #covariate (time)


plot(x,y, xlab="time", ylab="head acceleration")


min<-??? #min observed time

max<-??? #max observed time

n<-length(y) #number of observations


ngrid<-100

xgrid<-1:ngrid

d<-(max - min)/ngrid

for(i in 1:ngrid)

{

xgrid[i] = min + d * i #holds x values at which we will estimate unknown m

}


h<-??? #take bandwidth at 10% of the covariate range


mhat<-1:ngrid #this will hold estimated m[xgrid]: for now we only initialised it


for(i in 1:ngrid)#Estimating at every grid point#

{

B<-c(0,0)

for(l in 1:n)

{

cc<-??? #evaluate distance between observed time point and xgrid[i]

if(abs(cc) < h) #only use those time points in the neighbourhood of xgrid[i]

{

k<-??? #evaluate Epanechnikov kernel at cc/h: use (1-t^2)_+ kernel

B[1] <- B[1] + ???

B[2] <- B[2] + ???

}

}

mhat[i]<-B[1]/B[2]

}

lines(xgrid, mhat, type="l",col=2) #estimated curve


Incomplete R code for Question 2

dat<-read.table(file.choose()) #choose file labor.suppl(2).dat


x<-dat[,3] #observations on hourly earnings


X11() #crude idea of how the density function might look like

hist(x, xlab="hourly earnings", ylim=c(0, 0.13), freq=F, main=’’)


#work out the range of hourly earnings (x)

min<-???

max<-???

n<-??? #number of observations


#construct the equally spaced points within the range of x at which

#we shall estimate the density function

ngrid<-100

xgrid<-1:ngrid

d<-(max - min)/ngrid

for(i in 1:ngrid)

{

xgrid[i] = min + d * ???

}


### When the Gaussian kernel is used ###

h<-??? #compute corresponding optimal bandwidth

h


mhat<-rep(0,ngrid) #this will hold the estimates of the density function at xgrid

#for now we only initialised it

for(i in 1:ngrid)

{

mhat[i] = 0

for(l in 1:n)

{

cc<-??? #compute distance between each observation and xgrid[i]

k<-??? #evaluate Gaussian kernel (check formula in notes)

mhat[i] <- mhat[i]+???

}

mhat[i] <- mhat[i]/???

}

lines(xgrid, mhat, lty=3,col=2)