STATS 782, 2022SC Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 782, 2022SC
Assignment 2
Some requirements
❼ You can submit either a PDF file that contains everything (e.g., don’t edit out error or warning messages from the results of your code), or an R markdown file plus the HTML file produced from your R markdown file.
❼ Comment, explain and demonstrate how your code works at critical steps (especially, for a long piece of code).
❼ Unless asked for, avoid using explicit loops or anything equivalent as much as possible. If unsure of your choices in terms of efficiency, use system .time() to compare run times.
❼ You will lose marks for significantly poor R code or lack of explanation/demonstration.
Question 1 [10 marks]
It is well known that binomial coefficients /k(亿)、can be found by creating Pascal’s triangle. “In much of the Western world, it is named after the French mathematician Blaise Pascal, although other mathematicians studied it centuries before him in India, Persia (Iran), China, Germany,
and Italy” [Wikipedia].
The triangle looks like the following:
1
1 1
1 2 1
1 3 3 1
1 ... ... ... 1
the (n + 1)-th row of which contains the values of /k(亿)、, for k = 0, 1, . . . , n. It is created as follows. Starting with the first row which has 1 only, each row thereafter is produced based on the row before it, in the way that the first and last values in the row are always 1s and any value in between is simply the sum of the two values right above it.
Implement this method in a recursive R function that, given a nonnegative integer n, finds and returns the (n + 1)-th row of the triangle. Demonstrate that your function works well for n = 0, 1, 2, 5, 10, 100.
Question 2 [20 marks]
In this question, let us investigate a classification problem that has many more variables than observations, i.e., a p > n problem. As required in Parts (a) and (b) below, two classification methods (slightly simplified) are to be implemented and used.
Modern technology such as DNA microarrays has made it feasible to measure the expression levels of large numbers of genes simultaneously for a sample (or subject/observation). On Canvas, the following three files are available (data source: Golub et al. (1999), Science 286, 531–536):
training .csv - Training data set needed for building a classifier. It contains
the expression levels of p = 7129 genes for n = 38 samples, along with the gene accession numbers (or gene IDs).
test .csv - Independent test data set that can be used for examining
the performance of a classifier. It contains the expression levels of the same 7129 genes for 34 new samples, where the genes, as indicated by their accession numbers, are given in the same order as in the training set.
samples .txt - Class labels of the 72 (= 38 + 34) samples. The class of a
sample is one of two types of tumor, either acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML).
Researchers are interested in predicting the type of a tumor from the expression levels of genes. For a two-class problem like this, one can label/classify a new observation x = (x1 , . . . , xp ) as class 1 (or ALL) if
f1 (x) > f2 (x),
or class 2 (or AML) if otherwise, where f1 and f2 are, respectively, the density functions of the observations in the two classes. If the variables xj are treated as being independent and normally distributed, the above classification criterion is equivalent to using the inequality given by
p p
log{φ(xj ; µ1j , σj )} > log{φ(xj ; µ2j , σj )}, (1)
j=1 j=1
where φ(.; µ, σ) denotes the univariate normal density function with mean µ and standard deviation σ , µkj (k = 1, 2) the mean of the expression levels of gene j for class k and σj the corresponding within-class standard deviation. The two classification methods below differ in
their estimates of µkj , where σj are replaced with their pooled estimates j from the training
Note that as typical for gene data, the genes are given along the rows but should be treated as variables, while the samples are along the columns but should be treated as observations, in both the training and test sets.
In your code, you should avoid using loops as much as you can.
(a) The Na ı¨ve Bayes method simply uses the sample mean kj in place of the unknown true
(b) A problem with the Nave Bayes method is that it includes all variables in its classifier,
but most variables (genes here) are likely irrelevant to classification. A soft thresholding method known as Nearest Shrunken Centroids (Tibshirani et al. (2002), PNAS 99, 6567– 6572) provides a better solution. It finds and uses only a subset of variables in its classifier and yet often gives more accurate predictions.
The method works as follows. Let
kj - j
samples of class k . For a properly-chosen threshold value λ > 0, dkj is shrunken towards 0 using
d = sign(dkj )(|dkj | - λ)+ ,
where “+” means positive part (a+ = a if a > 0, and = 0 otherwise). The shrunken
versions of kj are then obtained by reversing the above transformation as follows: = j + mk j d .
The classifier uses in (1) in place of µkj . Note that for any j, if dj = dj = 0, then
Implement the Nearest Shrunken Centroids method and apply it to the test set using λ = 6. Provide the predicted class labels of the test samples and find out how many test samples are misclassified. Print the accession numbers (IDs) of the genes that are deemed having predictive power by the method.
[An appropriate value of λ can be determined solely from the training set, via a data resampling method such as cross-validation. This is not required for answering this as- signment question.] [10 marks]
Question 3 [20 marks]
The ABO locus used for blood types incorporates the three alleles A, B, and O and exhibits the four observable phenotypes A, B, AB and O. Since alleles A and B are genetically dom- inant to allele O, this results that, for example, both genotypes A/O and A/A exhibit the same phenotype A. From the observed numbers of people with blood types A, B, AB and O, respectively, one can estimate the proportions of alleles A, B and O in a population via the maximum likelihood method, which assumes the Hardy-Weinberg law of population genetics.
Suppose that a random sample of a human population is of size n = 505 and consists of nA = 172, na = 43, nAa = 11 and nO = 279 people for phenotypes A, B, AB and O, respectively. The log-likelihood function is given by
l(pA , pa , pO ) = nA log(pA(2) + 2pApO ) + na log(pa(2) + 2papO ) + nAa log(2pApa ) + nO log(pO(2)),
where pA , pa , pO 2 0 are the proportions of alleles A, B and O in the population, satisfying pA + pa + pO = 1.
(a) Use the optim() function to find the maximum likelihood estimates of pA , pa and pO .
You should use the BFGS (or L-BFGS-B) method, without providing the derivatives of the log-likelihood function. [10 marks]
(b) Re-do Part (a), with the derivatives of the log-likelihood function provided to optim(). [5 marks]
(c) Provide a contour plot of the log-likelihood function, in terms of pA and pa only, where pO = 1 - pA - pa . [5 marks]
2022-08-22