In this part of the homework, you will project data for neural networks.
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Requirements for Part1
In this part of the homework, you will project data for neural networks.
The file ElectionData.csv contains data available at this siteLinks to an external site., and includes the fraction of votes by county earned by President Trump and Secretary Clinton in the 2016 US Presidential election, sorted by county FIPS code (FIPS stands for Federal Information Processing System, and is simply a geographic identifierLinks to an external site.). It also includes a number of variables that may be useful to predict these fractions. Use the data to develop a neural network model for either individual.
When constructing your model, please consider the following:
· Explore the data set with scatter plots and compute the correlation matrix.
· Split the entire dataset into a training and test set. The test set should contain a random sample of roughly 30% of the observations (to determine how many observations are in the data set, use the length() function on the row of interest).
· Normalize the data before fitting the model.
· Neural Network structure (inputs to the neuralnet( ) function):
o Decide how many hidden layers to include in the network.
o Decide how many nodes will be in each hidden layer.
o Decide which activation function to use (i.e. set act.fct=’logistic’ or act.fct=’tanh’).
o Decide which independent variables to include.
· Fit the model using neuralnet( ).
· Plot the resulting network.
· For both the training set and the test set:
o Compute predictions using the fitted model. Be sure to scale the inputs and unscale the outputs.
o Compute the error (using MAE or an error metric of your choice).
· In the homework document, analyze the model. How well does your model predict the election results? Do you think it will generalize well to new data? What could be done to improve the model?
Requirements for Part2
In this part of the homework, you will work with a subset of a data that contains home sales in DC that are less than 1 million dollars(DC_PropertieResidentialunder1mill.csv), and also occurred between 2015 and 2018. Your task is to develop a neural network model using the H2O and LIME packages in R to predict sales price using the various independent variables available in the data set. To do this, the RStudio instance below contains a blank R script called Use_H2O_and_LIME.R for you to perform your work. In addition, it contains several scripts and data sets from the module that you can use to write your script; use and modify as much code as you need to complete the assignment.
Many attributes in this dataset are categorical. For example, ASSESSMENT_NBHD (assessment neighborhood) will need to be converted to numerical from a neighborhood name. Consider carefully how you might do this. There are also a lot of variables, and it will be easy to overfit your model. To this end, in the deeplearning() function in h2o there is a setting called L2. Investigate this setting and see if you find it valuable. There is also a function h2o.varimp_plot() that will create a histogram of the relative importance of all the inputs to your model. That might help you decide how to select variables for inclusion in your final model.
When constructing your model, also consider the following:
· Explore the data set with scatter plots and compute the correlation matrix.
· Split the entire dataset into a training and test set, and both should be converted into h2o data frames. The test set should contain a random sample of roughly 30% of the entire data.
· Select several data points from the training data set to analyze with LIME later on.
· Fit the neural network using h2o.deeplearning( )
o Decide which independent variables to include.
o Decide how many hidden layers to include in the network.
o Decide how many nodes will be in each hidden layer.
o Decide which activation function to use (for reference, h2o supports the following: "Tanh", "TanhWithDropout", "Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout" ).
o Specify whether or not to use an adaptive learning rate (argument: adaptive_rate)
o L1 regularization can result in many weights being set to 0, add stability, and improve generalization. Set the l1 argument (larger values correspond to more regularization).
o L2 regularization can result in many weights being set to small values, add stability, and improve generalization. Set the l2 argument (larger values correspond to more regularization). Note: if the L1 coefficient is higher than the L2 coefficient, then the algorithm will favor L1 regularization and vice versa.
o Specify how many epochs the training algorithm should run for.
o Set the random seed argument (seed) to guarantee you get the same results each time you run the training algorithm.
· Plot the training and validation loss vs the training epochs
· Use the summary( ) function to examine the fit model
· Use the data points selected earlier to compute and analyze the neural networks predictions using the lime( ) function from the lime package.
o Visualize the results of this analysis
· In the homework document, analyze the model and summarize your findings. How well does your model predict house price? Do you think it will generalize well to new data? Which variables ended up being most important? What could be done to improve the model?
This dataset in this assignment is available from KaggleLinks to an external site., and is used in accordance with the Creative Commons licenseLinks to an external site..
NOTE: Because of the amount of data in this assignment, you may experience delays while R runs the computations. If you encounter warnings that the H20 cluster node is behaving slowly, paste h2o.removeAll() into the Console and run it to free up as much memory as you can.
2024-11-29
Explore the data set with scatter plots and compute the correlation matrix.