Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1

Big Data and Machine Learning for Economics and Finance

Provide your answers in a document generated by RMarkdown. For each answer, provide the R code, the R output and your comments on the output. Comment each line of your R code as well.  Give thorough explanations throughout.

Exercise 1. (10 points) For this exercise, the only extra package allowed is ISLR2. The datset Default will be used throughout the exercise and is accessible through the ISLR2 package.

I.  Consider the following figure constructed from the dataset Default.

 

a) Write the R code to reproduce that plot.

b) What is the conditioning variable in that plot? Give a thorough interpretation. II.  Consider another figure constructed from the same dataset.

 

a) Write the R code to reproduce that plot.

b)  Carry out a regression exercise where you are attempting to predict balance given only the variable default.

1. Write the R code to train that model.

2. Modify the plot on figure 2 to add the predicted regression line.

3.  Give predictions of balance for all possible values of default. Show how to do the calculations directly in R and by using the regression output.

III.  Consider another figure from the same dataset

 

a) What are the differences between this plot and the previous one?

b) Would you obtain the same regression results as with the previous figure? Illus- trate everything with R code and conceptual justifications if necessary.

Exercise 2.  (5 points) I asked chatGPT the following question:

Question: My data are made up from observations of individual incomes and probabilities of defaults for those individuals computed by some expert. Does this supervised learning exercise fall into the category of classification?

chatGPT gave the following long answer:

Yes, this supervised learning exercise falls into the category of classification.

In classification, the goal is to assign each observation to one of several predefined classes or categories based on its features. In your case, you have observations of incomes along with probabilities of defaults. The task is to predict whether an individual will default or not based on their income, which makes it a classification problem.

Here's why it's a classification problem:

1. Target Variable: The probabilities of defaults computed by the expert are essentially a measure of the likelihood of defaulting. You want to use these probabilities to classify individuals into two categories: default and non-default.

2. Classes: You have two classes: default and non-default. The goal is to determine which class an individual belongs to based on their income and the associated probability of default.

3. Supervised Learning: You have labeled data where each observation is associated with a known class (default or non-default). This makes it a supervised learning problem.

4. Predictive Task: The goal of the exercise is to predict the class label (default or non-default) of new, unseen observations based on their income and the computed probability of default.

Overall, you are using features (incomes and probabilities of defaults) to make predictions about categorical outcomes (default or non-default), which aligns with the principles of classification in supervised learning.

Find all the mistakes in chatGPT's output (if any exist at all), and justify thoroughly.

Exercise 3.  (5 points) For this exercise, the only extra package allowed is ISLR2. The datset Boston will be used throughout the exercise and is accessible through the ISLR2 package.

We are interested in predicting medv given a single input variable from among the three variables crim, rm and ptratio.  We ran three regressions in R using each one of the three variables as a single input variable. Then, we computed the average squared error between our predictions for medv and the actual observed values of medv over the whole sample.  We got the following figure.

 

a) Write the R code to reproduce that plot. Explain the details of any calculations required.

b)  Based on this plot, does it make sense to say that the best model among the three is the one that uses rm as input?  Or should it be the one that uses  crim instead?  Or something else?