Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Data Science in R: Final Assignment

General Instructions

•    Analyse each of the datasets described below and write a short report (roughly 10-12 pages) on your analyses. Note: It is the content that is important. Longer reports will not be penalized directly, but longer is not necessarily better (whereas concise is definitely better).

•    For each dataset, your report should consist of an introduction stating the purpose of the analysis.

•   This should be followed by a data and methods section describing the data and explaining   why the method(s) you have chosen are appropriate and briefly, in your own words, how     they function. (You may want to describe how the methods function very briefly in the main text and include a more detailed (but still brief) description of the methods in an appendix-  which you then refer to in the main report.)

•   The analyses should then be described in detail in a results section. Recall: any tables, graphs or plots included should be carefully labelled and discussed.

•   The report should have a concluding section in which you summarise and interpret your results. If appropriate, the results from different methods should be compared and any  similarities or differences commented on. If appropriate, you should attempt to draw     practical conclusions from your analysis (e.g Step 8 of Practical 4B- Assignment 2)- see    below.

•   Tip: In general, you should not report the poor results of lots of classifiers that you have tried in the main text- you may wish to include this sort of trial and improvement in an  appendix (but if you do, it should still be presented in the formal report style).

•    If the data is pre-processed in any way (e.g. scaling, normalisation) this should be explained and justified (at the appropriate place in the report).

•   The report should have a future work section in which you suggest how the analysis could be extended and/ or improved.

•    R code should be included in an Appendix.

(Note: This is included for two reasons: to check you can perform this analysis correctly (so the marker may run your code to check it works as intended) and to identify errors if the     results are not as expected (and hopefully still reward your efforts). Therefore, it is in your  interest to include all R code clearly and concisely- see model solutions for good examples.)

Data set 1: Biomedical data

Background and Motivation

•   The data arose in a study to develop screening methods to identify carriers of a rare genetic disorder.

•    Four measurements m1, m2, m3, m4 were made on blood samples.

    The current industry standard is to use m1 (only) to identify carriers of the disease.

   The purpose of the analysis is to develop a new screening procedure to detect carriers and to describe its effectiveness. Are any of the measurements better than the others? Should   the measurements be combined?

The Data

•   The data are in two files, one for normals” (normals.txt: 127 samples) and one for carriers of the disease (carriers.txt: 67 samples).

•   The data have been stripped of the names and other identifiers, otherwise the data are as received by the analyst.

•    Each file has 6 columns. The first column contains the age of patient, the second contains the date that blood sample was taken (mmddyy), columns 3 to 6 contain the 4 measurements m1, m2, m3 and m4 respectively.

Points to Note and Suggestions to Include

•    Experts in the field have noted that young people tend to have higher measurements.

•   The laboratory that prepared the measurements is worried that there may be a systematic drift over time in their measurement process.

•   These two effects should be considered in the analysis .

•    In your conclusion, can you suggest how the experimental design could have been improved?

•    As the disease is rare, there are fewer carriers of the disease from whom data are available than for normal controls.

•    Recall the purpose of this analysis (in the orange box- above): you should include an investigation/ discussion that compares using m1 only (within an appropriate method) with at least one other alternative.

•   The four measurements m1, m2, m3, m4 are not equally easy to obtain (in terms of cost and accuracy). In the future work section for this data set, suggest how this type of information  could be taken into account (e.g. within your analysis or when developing the screening procedure in light of your results etc.).

Data set 2: DNA data

Background and Motivation

•    Bacteriophages, or just phages, are viruses that infect bacteria and regulate bacterial populations in natural ecosystems.

•    However, phages invade the human body, just as they do other natural environments, and they contribute to the evolution of bacterial cells in the human body by acquiring and spreading DNA.

•   The immune system reacts to them, although it is not clear to what extent, and their impact on human health is not yet known.

•    Studies suggest that more attention needs to be paid to their interference.

•   The aim of this analysis is to distinguish between human and phage DNA sequences.

The Data

•   The data set human-phage.txt contains 300 human DNA sequences (labelled pos” in the    first column of the file) and 300 phage DNA sequences (labelled “neg” in the first column of the file).

Points to Note and Suggestions to Include

•    Rather than just using the DNA sequence as input, try extracting features from the sequence. For example, you could count the numbers of individual bases (A, T, C or G) or any repeated patterns (see Practical 8A and associated VLC).

•    Compare the classification results you obtain using your extracted features and classifier(s) of your choice with those obtained using the DNA sequence and a random Forest classifier (as in Week 5).

•    Use the resources given in Week 9 (and any additional research you may perform) in the introduction/ conclusions sections to give practical biological insight (at an appropriate level-

1 paragraph maximum in each section).

Generic Project Markscheme

University Grade

Percentage Range

Brief Description

Fail

0-29

Does not demonstrate a basic understanding of the                 fundamental course materials.

Compensatable Pass

30-39

Most aspects can be improved (includes mistakes/ omissions). However, demonstrates a         basic understanding of the        fundamental course materials.

Third

40-49

Many aspects can be improved (includes mistakes/ omissions). However, demonstrates a         basic understanding of the        core course materials.

2:2

50-59

Demonstrates a sound               understanding of the course     materials. A number of aspects could be somewhat improved  (includes minor mistakes/         omissions).

2:1

60-69

Demonstrates a good               understanding of the course   materials. Very few aspects    could be somewhat improved (perhaps includes very minor mistakes/ omissions).

First

70-79

Demonstrates a deep            understanding of the course materials. Displays initiative and flair. Only small               improvements can be            suggested.

High First

80-100

As above but at a level well   beyond what is reasonably    expected at this level (e.g. of publishable quality).