Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECMM444 Fundamentals of Data Science

Course Assessment 2

This course assessment (CA2)represents 60%of the overall module assessment.

Submission deadline:8 January 2024,12noon

Aim: Show understanding of linear algebra methods for data analysis with pandas and numpy    .

This is an individual exercise and your attention is drawn to the University guidelines on collaboration and plagiarism,which are available from the university website.

Notes on how to use the notebook:

1.do not change the name of this notebook,i.e.the notebook file has to be:CA2.ipynb.

2.do not add your name or student code in the notebook or in the file name (it must be an anonymous submission).

3.do not remove or delete or add any cell in this notebook:you should work on a separate, private notebook,and only when you are finished debugging copy the function implementations into the cells of this notebook.Make sure to copy only the function implementation and nothing else.

4.remove the raise NotImplementedError()under the #YOUR CODE HERE and replace it  with your code:note that if you leave this command in the cell you will fail the associated test.

Submission:

·to access this notebook you have downloaded the archive ecmm444_ca2.zip,and unziped t to a folder ecmm444_ca2

·the folder ecmm444_ca2 contains some  images(.png),a  notebook  (.ipynb)and  some other files for the datasets

·to submit your completed Jupyter notebook,save it in the folder ecmm444_ca2 without changing the filename,i.e.the notebook has to have the file name CA2.ipynb

·create a.zip archive (not any other compression format,only .zip)of the folder ecmm444_ca2 with your updated notebook

·submit a single file,the zipped archive,using the ELE submission system

Evaluation criteria:

Each question asks for one or more functions to be implemented.

· Each function is awarded a number of marks.

·Hidden unit tests will be used to evaluate if desired properties of the required function are met.

·If you make a typo error(e.g.mispelling a variable)this will likely causes a syntax error, and the function will fail the hidden unit tests.

·The coding style (including clarity,conciseness,appropiate use of commands and data    structures,efficiency,good programming practices)will also be kept into consideration to award full marks.

·Note that functions may be tested in the unit tests on some randomly generated input.

localhost:8888/notebooks/Desktop/CA2.ipynb

·Notebooks not conforming to the required format (see notes on how to use the notebook) will be penalised.

Notes:

Students are expected to do some autonomous readings and research to familiarise themselves with the topics of the exercises.

Students are not allowed to import additional external libraries unless explicitly stated in the question.

Do not assume that the implementations provided in the Workshops exercises do not contain mistakes.You should write and are ultimately responsible for the code that you submit in this assessment

Questions are not strict software specifications.Students are expected to use their knowledge nf the cthiert to internret cnrrectlv the meaninn nf alectinne

%matplotlib inline

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

Part 1

Aim: Show competence in using the numpy library,and understanding of principal component analysis and the singular value decomposition.

Overview of the questions:

Questions 1.1-1.4 are about the construction of the dataset.

Questions 1.5-1.7 are about principal component analysis.

Questions  1.8-1.9  are  about the  rank  r  approximation.

Question 1.1 [marks 5]

Create a function create_rot_mat(A)that takes a  non-singular,square array A as  input and outputs a rotation matrix (array)with the same shape as the input.The function should first apply the Gram-Schmidt process to the columns in A.Following this,the sign of the last column should be flipped if the determinant is negative.The function should raise an AssertionError if the input array is singular.The function should not change the original array A.

def create_rot_mat(A):

#YOUR CODE HERE

raise NotImplementedError()

In    [  ]: ##This cell is reserved for the unit tests. Do not consider this cell.

Question 1.2 [marks 5]

Create a function means =create_means(R,k)that takes as input an n x n rotation matrix R and an integer k.The output will be an n x k array containing the coordinates of k means (of the data to be generated later).The first column (mean coordinates)in the output array should be a unit vector with means[0,0]=1.The second column should be equal to the first column multiplied from the left by the rotation matrix R,the third column should be equal   to the first column multiplied twice from the left by the rotation matrix R,the fourth column should be equal to the first column multiplied thrice from the left by the rotation matrix R,etc. The function should raise an AssertionError if the array R is not a rotation matrix.

def   create_means(R,k):

#YOUR     CODE     HERE

raise     NotImplementedError()

#This    cell    is reserved    for the unit tests.Do not    consider this    cell.

Question 1.3 [marks 3]

Create a function create_PSD_matrix(R,eigenvalues) that takes as input an n x  n  rotation matrix R and a  1 x n dimensional array eigenvalues of positive numbers.The function

should output a positive definite matrix with the eigenvectors specified by the columns in R

and the associated eigenvalues given by the values in eigenvalues .The function should raise

an AssertionError if the array R is not a rotation matrix or if any of the eigenvalues are not positive.

def create_PSD_matrix (R,eigenvalues):

#YOUR CODE HERE

raise NotImplementedError()

#This    cell    is reserved    for the unit tests.Do not    consider this    cell.

Question 1.4 [marks 4]

Create a function X,targets =make_data(means,cov,m)thats output a data matrix X and a one dimensional class vector targets.The function takes as input an n x k array means, where each column in means represents the mean vector for each class,an n x n array cov that specifies the covariance for all classes,and an integer m.Generate the same number of   instances for each class for a total of m instances (assume that m is an exact multiple of k ).

The output X should be an m x n array. Targets should contain a class indicator for each instance (i.e.an integer between 0 and k-1 indicating the class the corresponding row in X     belongs to).All data should be simulated from a multivariate normal distribution.

When executing the following code

A=np.array([[  1, 0],[-1, 1]])

R =create_rot_mat(A)

k  = 8

means =create_means(R,k)

A=np.array([[  1, 2],[4,5]])

R =create_rot_mat(A)

eigenvalues     =  0.05*np.array([     0.05, 1.5])

cov =create_PSD_matrix(R,eigenvalues)

m = 800

data_matrix,targets  =make_data(means,cov,m)

for    i   in   range(k):

def make_data(means,cov,m):

#YOUR CODE HERE

raise NotImplementedError()

#This  cell  is  reserved  for  the  unit  tests.Do  not  consider  this  cell.

#Just run the following code,do not modify it

A =np.array([[1,0],[-1,1]])

R =create_rot_mat(A)

k =8

means =create_means(R,k)

A =np.array([[1,2],[4,5]])

R =create_rot_mat(A)

eigenvalues =0.05*np.array([0.05,1.5])

cov =create_PSD_matrix(R,eigenvalues)

m =800

data_matrix,targets =make_data(means,cov,m)

for i in range(k):

plt.scatter(data_matrix[targets==i,0],data matrix[targets==i,1]

Question 1.5 (3 marks)

Write the function mu,cov =get_mean_cov(X) that takes a m x  n data matrix X in input and returns the mean vector mu as a one dimensional numpy vector of size n and the covariance

matrix cov as a numpy matrix object of size n x n.

Provide your own implementation of the covariance.Do not use functions from the numpy library or any other library to directly compute the covariance matrix.

def get_mean_cov(X):

#YOUR CODE HERE

raise NotImplementedError()

#This    cell     is    reserved     for    the     unit    tests.Do     not    consider     this    cell.

Question 1.6 (7 marks)

Write your own code to perform the PCA dimensionality reduction (i.e.do not use functions

provided  by  the  scikit library,such as sklearn.decomposition.PCA or any other library that computes the PCA directly).

Write a function PCA(X,threshold) that takes as input a m x n data matrix consisting of m  vectors in n dimensions and a threshold between 0 and 1.The function should return the centred projection of X,using the minimal number of principal components needed to ensure that the explained variance of the PCA exceeds threshold.

def PCA(X,threshold):

#YOUR CODE HERE

raise NotImplementedError()

In  [  ]: #This cell is reserved for the unit tests.Do not consider this cell.