闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 782, 2022SC

Assignment 2

Some requirements

❼ You can submit either a PDF ﬁle that contains everything (e.g., don’t edit out error or warning messages from the results of your code), or an R markdown ﬁle plus the HTML ﬁle produced from your R markdown ﬁle.

❼ Comment, explain and demonstrate how your code works at critical steps (especially, for a long piece of code).

❼ Unless asked for, avoid using explicit loops or anything equivalent as much as possible. If unsure of your choices in terms of eﬃciency, use system .time() to compare run times.

❼ You will lose marks for signiﬁcantly poor R code or lack of explanation/demonstration.

Question 1 [10 marks]

It is well known that binomial coeﬃcients /k(亿)、can be found by creating Pascal’s triangle. “In much of the Western world, it is named after the French mathematician Blaise Pascal, although other mathematicians studied it centuries before him in India, Persia (Iran), China, Germany,

and Italy” [Wikipedia].

The triangle looks like the following:

1 1

1 2 1

1 3 3 1

1 ... ... ... 1

the (n + 1)-th row of which contains the values of /k(亿)、, for k = 0, 1, . . . , n. It is created as follows. Starting with the ﬁrst row which has 1 only, each row thereafter is produced based on the row before it, in the way that the ﬁrst and last values in the row are always 1s and any value in between is simply the sum of the two values right above it.

Implement this method in a recursive R function that, given a nonnegative integer n, ﬁnds and returns the (n + 1)-th row of the triangle. Demonstrate that your function works well for n = 0, 1, 2, 5, 10, 100.

Question 2 [20 marks]

In this question, let us investigate a classiﬁcation problem that has many more variables than observations, i.e., a p > n problem. As required in Parts (a) and (b) below, two classiﬁcation methods (slightly simpliﬁed) are to be implemented and used.

Modern technology such as DNA microarrays has made it feasible to measure the expression levels of large numbers of genes simultaneously for a sample (or subject/observation). On Canvas, the following three ﬁles are available (data source: Golub et al. (1999), Science 286, 531–536):

training .csv - Training data set needed for building a classiﬁer. It contains

the expression levels of p = 7129 genes for n = 38 samples, along with the gene accession numbers (or gene IDs).

test .csv - Independent test data set that can be used for examining

the performance of a classiﬁer. It contains the expression levels of the same 7129 genes for 34 new samples, where the genes, as indicated by their accession numbers, are given in the same order as in the training set.

samples .txt - Class labels of the 72 (= 38 + 34) samples. The class of a

sample is one of two types of tumor, either acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML).

Researchers are interested in predicting the type of a tumor from the expression levels of genes. For a two-class problem like this, one can label/classify a new observation x = (x1 , . . . , xp ) as class 1 (or ALL) if

f1 (x) > f2 (x),

or class 2 (or AML) if otherwise, where f1 and f2 are, respectively, the density functions of the observations in the two classes. If the variables xj are treated as being independent and normally distributed, the above classiﬁcation criterion is equivalent to using the inequality given by

p p

log{φ(xj ; µ1j , σj )} > log{φ(xj ; µ2j , σj )}, (1)

j=1 j=1

where φ(.; µ, σ) denotes the univariate normal density function with mean µ and standard deviation σ , µkj (k = 1, 2) the mean of the expression levels of gene j for class k and σj the corresponding within-class standard deviation. The two classiﬁcation methods below diﬀer in

their estimates of µkj , where σj are replaced with their pooled estimates j from the training

Note that as typical for gene data, the genes are given along the rows but should be treated as variables, while the samples are along the columns but should be treated as observations, in both the training and test sets.

In your code, you should avoid using loops as much as you can.

(a) The Na ı¨ve Bayes method simply uses the sample mean kj in place of the unknown true

(b) A problem with the Nave Bayes method is that it includes all variables in its classiﬁer,

but most variables (genes here) are likely irrelevant to classiﬁcation. A soft thresholding method known as Nearest Shrunken Centroids (Tibshirani et al. (2002), PNAS 99, 6567– 6572) provides a better solution. It ﬁnds and uses only a subset of variables in its classiﬁer and yet often gives more accurate predictions.

The method works as follows. Let

kj - j

samples of class k . For a properly-chosen threshold value λ > 0, dkj is shrunken towards 0 using

d = sign(dkj )(|dkj | - λ)+ ,

where “+” means positive part (a+ = a if a > 0, and = 0 otherwise). The shrunken

versions of kj are then obtained by reversing the above transformation as follows: = j + mk j d .

The classiﬁer uses in (1) in place of µkj . Note that for any j, if dj = dj = 0, then

Implement the Nearest Shrunken Centroids method and apply it to the test set using λ = 6. Provide the predicted class labels of the test samples and ﬁnd out how many test samples are misclassiﬁed. Print the accession numbers (IDs) of the genes that are deemed having predictive power by the method.

[An appropriate value of λ can be determined solely from the training set, via a data resampling method such as cross-validation. This is not required for answering this as- signment question.] [10 marks]

Question 3 [20 marks]

The ABO locus used for blood types incorporates the three alleles A, B, and O and exhibits the four observable phenotypes A, B, AB and O. Since alleles A and B are genetically dom- inant to allele O, this results that, for example, both genotypes A/O and A/A exhibit the same phenotype A. From the observed numbers of people with blood types A, B, AB and O, respectively, one can estimate the proportions of alleles A, B and O in a population via the maximum likelihood method, which assumes the Hardy-Weinberg law of population genetics.

Suppose that a random sample of a human population is of size n = 505 and consists of nA = 172, na = 43, nAa = 11 and nO = 279 people for phenotypes A, B, AB and O, respectively. The log-likelihood function is given by

l(pA , pa , pO ) = nA log(pA(2) + 2pApO ) + na log(pa(2) + 2papO ) + nAa log(2pApa ) + nO log(pO(2)),

where pA , pa , pO 2 0 are the proportions of alleles A, B and O in the population, satisfying pA + pa + pO = 1.

(a) Use the optim() function to ﬁnd the maximum likelihood estimates of pA , pa and pO .

You should use the BFGS (or L-BFGS-B) method, without providing the derivatives of the log-likelihood function. [10 marks]

(b) Re-do Part (a), with the derivatives of the log-likelihood function provided to optim(). [5 marks]

(c) Provide a contour plot of the log-likelihood function, in terms of pA and pa only, where pO = 1 - pA - pa . [5 marks]