ECMM444 Fundamentals of Data Science Course Assessment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ECMM444 Fundamentals of Data Science
Course Assessment 2
This course assessment (CA2)represents 60%of the overall module assessment.
Submission deadline:8 January 2024,12noon
Aim: Show understanding of linear algebra methods for data analysis with pandas and numpy .
This is an individual exercise and your attention is drawn to the University guidelines on collaboration and plagiarism,which are available from the university website.
Notes on how to use the notebook:
1.do not change the name of this notebook,i.e.the notebook file has to be:CA2.ipynb.
2.do not add your name or student code in the notebook or in the file name (it must be an anonymous submission).
3.do not remove or delete or add any cell in this notebook:you should work on a separate, private notebook,and only when you are finished debugging copy the function implementations into the cells of this notebook.Make sure to copy only the function implementation and nothing else.
4.remove the raise NotImplementedError()under the #YOUR CODE HERE and replace it with your code:note that if you leave this command in the cell you will fail the associated test.
Submission:
·to access this notebook you have downloaded the archive ecmm444_ca2.zip,and unziped t to a folder ecmm444_ca2
·the folder ecmm444_ca2 contains some images(.png),a notebook (.ipynb)and some other files for the datasets
·to submit your completed Jupyter notebook,save it in the folder ecmm444_ca2 without changing the filename,i.e.the notebook has to have the file name CA2.ipynb
·create a.zip archive (not any other compression format,only .zip)of the folder ecmm444_ca2 with your updated notebook
·submit a single file,the zipped archive,using the ELE submission system
Evaluation criteria:
Each question asks for one or more functions to be implemented.
· Each function is awarded a number of marks.
·Hidden unit tests will be used to evaluate if desired properties of the required function are met.
·If you make a typo error(e.g.mispelling a variable)this will likely causes a syntax error, and the function will fail the hidden unit tests.
·The coding style (including clarity,conciseness,appropiate use of commands and data structures,efficiency,good programming practices)will also be kept into consideration to award full marks.
·Note that functions may be tested in the unit tests on some randomly generated input.
localhost:8888/notebooks/Desktop/CA2.ipynb
·Notebooks not conforming to the required format (see notes on how to use the notebook) will be penalised.
Notes:
Students are expected to do some autonomous readings and research to familiarise themselves with the topics of the exercises.
Students are not allowed to import additional external libraries unless explicitly stated in the question.
Do not assume that the implementations provided in the Workshops exercises do not contain mistakes.You should write and are ultimately responsible for the code that you submit in this assessment
Questions are not strict software specifications.Students are expected to use their knowledge nf the cthiert to internret cnrrectlv the meaninn nf alectinne
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd |
Part 1
Aim: Show competence in using the numpy library,and understanding of principal component analysis and the singular value decomposition.
Overview of the questions:
Questions 1.1-1.4 are about the construction of the dataset.
Questions 1.5-1.7 are about principal component analysis.
Questions 1.8-1.9 are about the rank r approximation.
Question 1.1 [marks 5]
Create a function create_rot_mat(A)that takes a non-singular,square array A as input and outputs a rotation matrix (array)with the same shape as the input.The function should first apply the Gram-Schmidt process to the columns in A.Following this,the sign of the last column should be flipped if the determinant is negative.The function should raise an AssertionError if the input array is singular.The function should not change the original array A.
def create_rot_mat(A): #YOUR CODE HERE raise NotImplementedError() |
In [ ]: ##This cell is reserved for the unit tests. Do not consider this cell.
Question 1.2 [marks 5]
Create a function means =create_means(R,k)that takes as input an n x n rotation matrix R and an integer k.The output will be an n x k array containing the coordinates of k means (of the data to be generated later).The first column (mean coordinates)in the output array should be a unit vector with means[0,0]=1.The second column should be equal to the first column multiplied from the left by the rotation matrix R,the third column should be equal to the first column multiplied twice from the left by the rotation matrix R,the fourth column should be equal to the first column multiplied thrice from the left by the rotation matrix R,etc. The function should raise an AssertionError if the array R is not a rotation matrix.
def create_means(R,k):
#YOUR CODE HERE
raise NotImplementedError()
#This cell is reserved for the unit tests.Do not consider this cell. |
Question 1.3 [marks 3]
Create a function create_PSD_matrix(R,eigenvalues) that takes as input an n x n rotation matrix R and a 1 x n dimensional array eigenvalues of positive numbers.The function
should output a positive definite matrix with the eigenvectors specified by the columns in R
and the associated eigenvalues given by the values in eigenvalues .The function should raise
an AssertionError if the array R is not a rotation matrix or if any of the eigenvalues are not positive.
def create_PSD_matrix (R,eigenvalues): #YOUR CODE HERE raise NotImplementedError() |
#This cell is reserved for the unit tests.Do not consider this cell. |
Question 1.4 [marks 4]
Create a function X,targets =make_data(means,cov,m)thats output a data matrix X and a one dimensional class vector targets.The function takes as input an n x k array means, where each column in means represents the mean vector for each class,an n x n array cov that specifies the covariance for all classes,and an integer m.Generate the same number of instances for each class for a total of m instances (assume that m is an exact multiple of k ).
The output X should be an m x n array. Targets should contain a class indicator for each instance (i.e.an integer between 0 and k-1 indicating the class the corresponding row in X belongs to).All data should be simulated from a multivariate normal distribution.
When executing the following code
A=np.array([[ 1, 0],[-1, 1]])
R =create_rot_mat(A)
k = 8
means =create_means(R,k)
A=np.array([[ 1, 2],[4,5]])
R =create_rot_mat(A)
eigenvalues = 0.05*np.array([ 0.05, 1.5])
cov =create_PSD_matrix(R,eigenvalues)
m = 800
data_matrix,targets =make_data(means,cov,m)
for i in range(k):
def make_data(means,cov,m): #YOUR CODE HERE raise NotImplementedError() |
#This cell is reserved for the unit tests.Do not consider this cell. |
#Just run the following code,do not modify it A =np.array([[1,0],[-1,1]]) R =create_rot_mat(A) k =8 means =create_means(R,k) A =np.array([[1,2],[4,5]]) R =create_rot_mat(A) eigenvalues =0.05*np.array([0.05,1.5]) cov =create_PSD_matrix(R,eigenvalues) m =800 data_matrix,targets =make_data(means,cov,m) for i in range(k): plt.scatter(data_matrix[targets==i,0],data matrix[targets==i,1] |
Question 1.5 (3 marks)
Write the function mu,cov =get_mean_cov(X) that takes a m x n data matrix X in input and returns the mean vector mu as a one dimensional numpy vector of size n and the covariance
matrix cov as a numpy matrix object of size n x n.
Provide your own implementation of the covariance.Do not use functions from the numpy library or any other library to directly compute the covariance matrix.
def get_mean_cov(X): #YOUR CODE HERE raise NotImplementedError() |
#This cell is reserved for the unit tests.Do not consider this cell. |
Question 1.6 (7 marks)
Write your own code to perform the PCA dimensionality reduction (i.e.do not use functions
provided by the scikit library,such as sklearn.decomposition.PCA or any other library that computes the PCA directly).
Write a function PCA(X,threshold) that takes as input a m x n data matrix consisting of m vectors in n dimensions and a threshold between 0 and 1.The function should return the centred projection of X,using the minimal number of principal components needed to ensure that the explained variance of the PCA exceeds threshold.
def PCA(X,threshold): #YOUR CODE HERE raise NotImplementedError() |
In [ ]: #This cell is reserved for the unit tests.Do not consider this cell.
2024-01-05