Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 2

Big Data and Machine Learning for Economics and Finance

Provide your answers in a PDF document that must be generated by R Markdown.

For each exercise, provide

complete R code with detailed comments. (Those comments should explain each line of R instructions, please explain the general idea behind each code block.)

● R output (tables, graphs, function outputs etc...)

your detailed comments on the output. (Each exercise specifies a minimal and a maximal number of words.)

The function set.seed() should never be used while answering the assignment questions.

Exercise 1. (5 points) For this exercise, the only extra packages allowed are boot and Ecdat. Consider the dataset Strike in the package Ecdat. The dataset contains a variable named duration that measures the length of some factory strikes (in number of days). An economist is interested in studying that duration variable. If n is the sample size, and Xi represent the

duration of strike i, then the economist is interested in computing the quantity

n

n i

i=1

as this gives an estimate of a parameter a that could be used later as a building block for a theoretical economic model.

After computing  a^  for this sample and assessing its accuracy through the standard error computed by drawing 1000 bootstrap samples, you feel unsatisfied with the fact that the standard error fluctuates each time you re-run the bootstrap algorithm.

Design a Monte Carlo experiment to assess the accuracy of the bootstrap standard errors. Comment the results. (Your output comments should be between 50 and 200 words in length.)

Exercise 2. (7 points) For this exercise, the only extra package allowed is ISLR2.

Consider the variables age and wage in the dataset Wage from ISLR2. We are interested in clustering the data into either 2 or 3 groups. Use k-means clustering and hierachical clustering (with complete, average, single and centroid linkages) to generate 2 clusters in a first step and 3 clusters in a second step. For each clustering method and number of clusters, provide a 2- dimenisonal figure that contains a scatter plot of the data, and where each observation has a different color depending on which cluster it belongs to. What are your impressions about the performance of the different clustering methods? Compare to the figures on page 3 in Lecture Slide 7. (Your output comments should be between 100 and 300 words in length.)

Exercise 3. (8 points) For this exercise, the only extra package allowed is ISLR2.

Consider the dataset Wage in ISLR2. We are interested in predicting wage given age.

1. We are interested in writing R code that computes the following function:

M (x) = D(x) 15(1 ¡ x2)2

Here  D(x) = 1 if 1 6 x 6 1 and 0 otherwise. Write an R function quart1 that takes a number u as an input argument and returns the value of the function M evaluated at

u.  Using the function quart1, plot M  over the interval [¡1.5; 1.5].

2. We are interested in writing R code that computes the following function

Q(x; y) =  1M x Σ

Here M is the same function as in the previous question, x is an arbitrary number and

y is a positive number.

Write an R function quart that takes a number u and a positive number b as input arguments and returns the value of the function Q evaluated at u and b.

3. Given a dataset (x1; ::::xi; :::; xn) consisting of n observation of one variable X, we are interested in generating a new variable W with n observations: (w1; :::; wi; :::; wn). Here, the ith observation wi is given by

=  Q(xi ¡ z; y)

wi Pn Q(x ¡ z; y)

Here Q is the same function as in the previous question, z is an arbitrary number and

y is a positive number.

Write an R function that computes the n observations of the variable W given the

n observations of the variable X and the two numbers z and y.

That function should be called wt and should take three input arguments: a vector xi, a number x and a positive number b. That function should return a vector of the same length as xi.

4. We are now ready to write an R function nwe that gives nonparametric regression predictions of wage given age.

The R function nwe takes four input arguments:

● A vector y containing the observations for the regression output variable Y

● A vector x containing the observations for the regression input variable X

● A positive number bwdth.

A number newdata representing the point of the variable X at which you would like to compute a prediction for the variable Y .

The function nwe should output a single number that contains your prediction.

The function proceeds in the followin way: Given the arbitrary number newdata and the positive number bwdth, it constructs n weights wi using the function wt (with the inputs x, newdata and bwdth). Then your prediction at newdata is given by

n

wiyi

i=1

where yi is the ith observation in the vector y.

Using 50 to 200 words, explain your understanding of how nwe would be successful at accomplishing predictions.

5. Using the data age and wage, predict the wage when age ranges over a grid of points of size 100 between age 18 and age 80. Plot the predicted wage using 3 different values of bwdth (2, 10 and 100 respectively). Choose the best value of bwdth among those 3 values by Cross-validation. (Your output comments should be between 200 and 400 words in length.)