闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DATA3888 (2022): Assignment 1

Question 1: Brain-box

A physics instructor Louis has created a data set stored under “Spiker box Louis.zip ” that has a series of sequences of varying lengths. The ﬁle name determines the eye movement. For example the ﬁle ‘LRL L3.wav’ corresponds to left-right-left eye movements; the ﬁle LLRLRLRL_L.wav corresponds to left-left-right- left-right-left-right-left‘ eye movements. There are a total of 31 ﬁles. Build a classiﬁcation rule for detecting {L, R} under streaming condition where the function will take a sequence of signal as an input.

● (i) Estimate the accuracy of your classiﬁer. Is your value reasonable?

● (ii) Does the length of the sequence impact on the classiﬁcation accuracy?

Hint: (a) Consider what metric you will use to deﬁne “performance”? You will need to explain your choice and justify your answer.

dir("data/Spiker_box_Louis/Short")

## [1] "LLL_L1.wav" "LLL_L2.wav" "LLL_L3.wav" "LLR_L1.wav" "LLR_L2.wav" ## [6] "LLR_L3.wav" "LRL_L1.wav" "LRL_L2.wav" "LRL_L3.wav" "LRR_L1.wav" ## [11] "LRR_L2.wav" "LRR_L3.wav" "RLL_L1.wav" "RLL_L2.wav" "RLL_L3.wav" ## [16] "RLR_L1.wav" "RLR_L2.wav" "RLR_L3.wav" "RRL_L1.wav" "RRL_L2.wav" ## [21] "RRL_L3.wav" "RRR_L1.wav" "RRR_L2.wav" "RRR_L3.wav"

dir("data/Spiker_box_Louis/Medium")

## [1] "LLRLRLRL_L.wav" "LLRRLLLR_L.wav" "LLRRRLLL_L.wav" "LRRRLLRL_L.wav" ## [5] "RRRLRLLR_L.wav"

dir("data/Spiker_box_Louis/Long")

## [1] "LLLRLLLRLRRLRRRLRLLL_L.wav" "RRLRRLRLRLLLLLLRRLRL_L.wav"

Question 2: Prevalidated model

(from Week 3 lecture) The Kidney Transplant data from “GSE46474” contains the gene expression proﬁles of 40 blood samples. Of those, 20 patients rejected their kidney and 20 had stable grafts and will be treated

as controls. Using this gene expression data. Lets build a classiﬁcation model incorporating two types of

data using the prevalidation principle. Here, we ﬁrst build a molecular signature (set of features) from

the gene expression platform to obtain a single variable known as prevalidated outcome. Next, we model this prevalidated outcome in combination with the others other clinical variables to build a classiﬁer of outcome of interest.

● (a) Build a classiﬁer using support vector machine (SVM) to predict the outcome of graft survival and generate a prevalidated outcome from the gene expression data.

● (b) Use it together with the clinical variables in a logistic regression to build a risk model. Describe your ﬁnal model for classifying graft survival in diﬀerent individuals and your estimate of its accuracy.

● (c) What is the ﬁnal prediction based on your ﬁnal model for a 70-year-old male whose transcriptomics proﬁle is predicted to have a favourable survival outcome?

Question 3 - Blood vs Biopsy Biomarker

In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself. Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from a kidney biopsy. Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474 and use the selected genes to build a classiﬁer using randomForest to predict the outcome of graft survival. Visualize your results. We have broken this task into the following 4 sub section.

● (a) Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474. Combined the two sets of genes and how many genes are in the union of these two list.

● (b) Build two classiﬁer using randomForest to predict the outcome of graft survival using the genes selected in part (a).

● (c) Preform repeated 5-fold cross validation for each of the data and calculate the accuracy. What is the average accuracy for blood vs biopsy biomarker model?

● (d) Select an appropriate graphic to communicate the diﬀerence between these two classiﬁcation accuracy.

Question 4 - Visualisation on world map

Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the

last two decades. The data is in the ﬁle “Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv”, and the author curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The

column “Average bleaching” records the percentage of coral reefs worldwide that were bleached during the sampling periods, while the column “ClimSST” quantiﬁes the sea-surface temperature (SST) at various locations.

● (a) Use ggplot to visualize the percentage of bleaching in coral reefs on a world map and look at which areas of the world have the most severe coral bleaching.

● (b) The team of scientist believe that “coral bleaching is less common in localities with a high variance of sea-surface temperature Anomaly (SSTA) over time.” Use one or two appropriate graphic together to demonstrate this point. Please explain your choice. Hint: Determine which column of the data measure “sea-surface temperature Anomaly (SSTA)”.