Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Data Mining

August 25, 2018

Overall 30 points (+5 bonus points)

This is a closed-book exam

There are 4 questions below. Note that (2e) and (4) are bonus questions. Good luck.

1. Cross validation (10 points)

a. What is cross-validation? What is it used for in machine learning? And why is it important to use this technique? (2 points)

b. You find a paper that reports 99% accuracy predicting previous stock-market trends for 10 stocks. Their cross-validation method was as follows.

i. They aggregated daily stock-market prices for the 10 stocks over the past 20 years. A single sample for them was therefore each day’s stock prices for all 10 stocks

ii. They randomly assigned samples to a training (80%) and testing (20%) set

iii. They fit the model on the training set, and evaluated it on the testing set

If you think that this method is fine, explain why. If you think it is a problematic, explain what the authors could do to improve their methods and how might their results change if they implemented your proposed method instead. (8 points)

2. Regression & regularization (10 points + 2 bonus points)

a. You are hired by the HR department of a large company to try and predict what pay to offer to new employees based on their overall number of years of experience and their previous salary. Which machine-learning model would you use to achieve this? Explain why you would use that model. (1 point)

b. What is the objective function for ordinary least-squares (OLS) regression? What quantity is being minimized in OLS? What is the OLS solution? (3 points)

c. For Ridge regression, what is the objective function and what are the effects of this regularization method on its parameter estimates? (3 points)

d. For LASSO, what is the objective function and what are the effects of this regularization method on its parameter estimates? (3 points)

e. Bonus: What phenomenon that we discussed in class might create a problem when applying the model you suggested above in (a) to predict the current salary from the number of years of experience and the previous salary? Explain how the problem might come about and suggest a solution (2 bonus points)

3. Prevno University decides to automate its graduate-school admission procedure. They have been collecting the following information about their incoming students for the past 10 years: their undergraduate GPA, undergraduate institution name, age, gender, mean grading of their self-statement over various independent, human readers, mean grading of their reference letters over various independent, human readers, letter grades in 3 most-relevant courses, special merit—score from 0 (none) to 4 (various publications, awards, etc.). They then further gathered these students’ final GPA at Prevno upon graduation. They now have data from 10,000 students. You are tasked with constructing a machine-learning model to output a score in the range of 1 (certain rejection) to 5 (certain acceptance) that would help guide the admissions committee on its decision. (10 points)

a. Suggest an architecture of a neural network that would achieve this goal. How many and what kind of neurons would you choose for your input, output, and hidden layers? Explain why you think that architecture would be good. (5 points)

b. The administration decided that they want you to construct another model that would output a binary answer: accept or reject. What changes would you need to make in your neural network model? Explain your answer. (2 points)

c. You are instructed not to use neural networks in (a) and (b) above. Which machine-learning algorithms would you use instead in (a) and which in (b)? Explain your choices. How would the performance of your machine-learning algorithms compare to that of your suggested neural networks? Explain your answer. (3 points)

4. Bonus question: You have a small sample from a very noisy dataset. Examples might be EEG recordings containing a lot of movement artifacts, photos of faces taken from very far away with a low-resolution camera, or a speech dataset recorded using a crackling microphone. To what extent, if at all, would additional data that was collected in the same manner and from the same source improve your machine-learning model? Explain your answer and plot a qualitative graph of your prediction of that model’s accuracy over the number of samples (accuracy on the y axis and the number of samples on the x axis). (3 bonus points)