ECE4524 SP21 - Prof. Jones – HW 5

Neural Network Architecture

Due Thursday, April 29, 2021 – 11:59 PM via Canvas


In this assignment you will be investigating the relative performance of a set of models, including artificial neural networks. To do this, you should do the following:

• Create a new Python project, and add the packages numpy, pandas and openpyxl. If you are using PyCharm, you can find information on adding packages here:

https://www.jetbrains.com/help/pycharm/installing-uninstalling-and-upgrading-packages.html#interpreter-settings

• Load a dataset from the file named “pop-2020.xlsx”. This dataset contains the play-by-play results for every play in every NFL game for the 2020 season. You will load this into a pandas dataFrame using the pandas package for python.

• We are going to find the best architecture to use here, but the file has too much data for our purposes (the assignment takes too long to run). Randomly sample the big dataFrame to get a sample 20% of its size. This is the dataset that you will work with. A pandas dataframe has functions that will help you with this. You should use 42 as your random seed.

• The file has columns for the Quarter, Minute and Second in the game, but these are not so good for modeling. You should calculate a new feature column called “Seconds Left” from these three columns, add it to the table and remove the Quarter, Minute and Second columns. Note: in a football game, the Quarter tells which of the 15-minute quarters we are in (1 to 4) and the Minute and Second tell how much time is left in the current quarter.

• I want you to predict the continuous variable “Yards”, which is the yards gained by each play. The predictor variables you should use are:

The categoricals: Formation, PlayType, PassType and RushDirection

The binaries: isRush and isPass

The numericals: SecondsLeft, Down, ToGo and YardLine

You pass these as arrays of strings into the preprocDataFrame function and it will do the work of processing the columns properly.

• Normalize the columns in the data set to the range (-1, 1); you can use the MinMaxScaler function for this.

• Preprocess the dataFrame and split it into training and test partitions; I have given you a Python function to accomplish this (see the starter .py file I have supplied).

• Train a LinearRegression module on this data, and measure its performance. I have supplied a function to print some popular metrics for a regression model.

• Train a Decision Tree regression model on this data, and measure its performance in the same way.

• For the major part of this assignment, you are to try a number of different artificial neural network (multilayer perceptron) regression models on the data. The process is as follows:

• For various numbers of hidden nodes1 in the first hidden layer:

o For various numbers of hidden nodes in the second hidden layer:

 Do the following four times:

• Train a MLPRegressor on the training data

o Use the default relu activation function and maxiter of 200

• Measure its MSE on both training and test data

 Compute the average MSE values for the four trials

 Print this information along with the numbers of nodes used

• This will take a while! Mine took around an hour and a half.

• Once it’s done, present your results in a table of some sort, and identify the model architectures that gave:

The best performance on training data

The best performance on test data

The largest and smallest generalization gaps

• Finally, choose the model architecture that gave you the best performance on training data, and train an ANN with that architecture, on the full dataset from the “pop-2020.xlsx” file (don’t subsample as you did above). Write out the model performance for this model.

When you are done, write a couple of paragraphs presenting your results. Be sure and explain:

• Which model(s) are the best (don’t forget Occam’s razor)

• How much does model size affect performance

• Whether the linear, decision tree or final neural network models give better performance

• How much time your code took to run

I need to see the model performance results for your linear and decision tree models, the tables of results for your ANN testing, and the ANN model performance results for your chosen architecture. Add to your Word or pdf document all of the code for your main program (no need to give me the code that I supplied to you) - no dark mode or screenshots as always. Put all of this into a single Word or pdf file and submit via Blackboard. Also attach your Python source file. Please submit your files separately, not in a zip archive.

NOTE: if you use Jupyter, you MUST download and attach the .py files – do NOT submit an .ipynb file!


# ECE4524 SP21 HW5

# Created by Creed Jones on April 4, 2021

import numpy as np

import pandas as pd

import DQR

import sklearn.linear_model as linmod

import sklearn.preprocessing as preproc

import sklearn.model_selection as modelsel

import sklearn.metrics as metrics

import sklearn.neural_network as nnet

import sklearn.tree as tree

import warnings

import time

from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

pd.options.mode.chained_assignment = None # default='warn'


def dataFrameFromFile(filename):

df = pd.read_excel(filename)        # read an Excel spreadsheet

print('File ', filename, ' is of size ', df.shape)

print(df.dtypes)

# fix any datetime64 columns

for label, content in df.items():

if (pd.api.types.is_datetime64_ns_dtype(content.dtype)):

df[label] = df[label].dt.strftime('%Y-%m-%d').tolist()

dqreport = DQR.DQR(df)

dqreport.writexlsx(filename)

return df


def create_unknowns(dframe, categorical_column):

newcol = dframe[categorical_column]

newcol.replace({np.nan:'Unknown'}, inplace=True)

dframe[categorical_column] = newcol

return newcol


def preprocDataFrame(df, IDLabels, TargetLabels, Categoricals, Binaries, Numericals, StratifyColumn):

# start preparing the dataset - missing values, categoricals, etc. 

df['PlayType'].replace({'EXTRA POINT':'', 'TIMEOUT':'', 'TWO-POINT CONVERSION':'', 'NO PLAY':''})

df.dropna(subset=(IDLabels + TargetLabels + ['Formation', 'PlayType']), inplace=True)

newdf = df[IDLabels + Binaries + Numericals + StratifyColumn]


# one-hot encoding of categoricals

for label in Categoricals:

catcolumns = pd.get_dummies(create_unknowns(df, label), prefix=label)

newdf[catcolumns.columns] = catcolumns


dqreport = DQR.DQR(newdf)

dqreport.writexlsx('C:/Data/NFL/PROCESSED.xlsx')


target = df[TargetLabels]

return newdf, target


def printRegressionModelStats(model, testX, testy, names, targetName, outputPrefix):

Ypred = model.predict(testX)

(ntest, npred) = testX.shape

r2 = metrics.r2_score(testy, Ypred)

adjr2 = 1 - ((ntest-1)/(ntest - npred - 1))*(1-r2)

mse = metrics.mean_squared_error(testy, Ypred)

print("\n\r%s: R2 = %f, adjR2 = %f, MSE = %f" % (outputPrefix, r2, adjr2, mse))

if (hasattr(model, 'coef_')):

coeff = model.coef_.ravel()

print(targetName, " = %6.4f" % model.intercept_, end="", flush=True)

for featCount in range(len(names)):

print(" + %6.4f*%s" % (coeff[featCount], names[featCount]), end="", flush=True)


# YOUR CODE GOES HERE