MATH6068 Statistical Genetics

 

Assessment 2020/2021

 

Due Date of Tuesday 11th May 2021, 3 pm

 

There is only one assignment for this course and it carries a total of 100 marks. You will write two short reports on the analysis of two sets of genetic data. Your assignment is to be handed in using Turnitin on Blackboard by 3 pm on Tuesday 11th May 2021.

 

Unauthorised late coursework will receive a penalty of the deduction of a percentage of marks awarded per working day late, in line with the University’s late submission penalties.

 

Policy on Collaboration

We encourage students to discuss and exchange ideas since this is an important part of the educational process. You are allowed to discuss your general analytic strategy with your fellow students, but you must do your own analyses as well as interpret and write up the results yourself. However, it is NOT acceptable for a student to read and gain ideas for his or her coursework from another student’s finished work. If copying between pieces of coursework occurs, it will be penalised after discussion with the students concerned. A report will be made to the student’s programme coordinators.

 

Guidance Notes

Whilst it is not necessary for your assessment to be typed, legibility is essential.

Your report does not need to be in any particular format, but you need to state clearly the purpose of the analysis (including any background information if appropriate), the method(s) used and your interpretation of the results.

Write up your answers to the TDT and Case Control analysis separately. If it is appropriate to comment on similarities and differences between the two analyses a paragraph can be added at the end.  

It would be helpful to include a copy of the Stata code used at the end of your write up for each question. This will help us to understand exactly what you have done, and has the potential to increase your marks.   

Figures in brackets give the marking scheme and total 100.

 

You will use two sets of genuine genetic data.

The datasets, Chrom5and16.dta and Chrom2casecon.dta, will be posted onto Blackboard for MATH6068 in the Assignments Folder so they should be accessible to all students registered on the course.

These datasets relate to single nucleotide polymorphisms. A single nucleotide polymorphism or SNP (pronounced “snip”) is a small genetic change, or variation, that can occur within a person’s DNA sequence. SNP variation occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide letters C, G, or T. Although many SNPs do not produce physical changes in people, scientists believe that some SNPs may predispose a person to a certain disease. Here the interest is in the potential association of these SNPs and asthma.

 

You must use Stata for your assessment. In particular, the Stata ado files that you used for your Computer Practicals 1 and 2, written by David Clayton, entitled ‘ado files for Computer Practicals 1 and 2’ available on MATH6068 on Blackboard.

 

Assessment

1. Carry out an analysis of the family-based Chrom5and16.dta data and write a report.

These data consist of single nucleotide polymorphisms on chromosome 5 (the IL4 gene) and chromosome 16 (the IL4 receptor alpha gene). Here the interest is in the potential association of these SNPs and asthma (as recorded on a case report form). SNP1 and SNP2 are on chromosome 5 and SNP3 and SNP4 are on chromosome 16. The variables in the Chrom5and16.dta data file are:

● famcon: family id (consecutive integers, runs from 1 to 340)

● id: nine digit participant id number

● fatherid: id number of father (with some values missing)

● motherid: id number of mother (with some values missing)

● sex: sex (1=female, 2=male)

● crf (1 = not asthmatic, 2 = asthmatic as recorded on the case report form)

● snp1al1: single nucleotide polymorphism number 1, allele number 1

● snp1al2: single nucleotide polymorphism number 1, allele number 2

● snp2al1: single nucleotide polymorphism number 2, allele number 1

● snp2al2: single nucleotide polymorphism number 2, allele number 2

● snp3al1: single nucleotide polymorphism number 3, allele number 1

● snp3al2: single nucleotide polymorphism number 3, allele number 2

● snp4al1: single nucleotide polymorphism number 4, allele number 1

● snp4al2: single nucleotide polymorphism number 4, allele number 2

● totige : (1 = not raised, 2 = raised age corrected total IgE)

 

IgE or immunoglobulin E is an antibody involved in allergic reactions. It can be measured in a blood serum analysis. A high level of serum total IgE is often interpreted as the general tendency or predisposition to develop allergic diseases.


Carry out appropriate analyses for each of the 4 SNPs separately, commenting fully on the results. Use a 0.05 level of significance for inference [40 marks]

(a) for asthma as reported on the case report form, i.e. crf,

(b) for raised total IgE, i.e. totige.

 

Comment on the differences, if any, in these analyses. Give an interpretation of your results. [10 marks]

 

2. Carry out an appropriate analysis of SNP1 on case-control status using the Chrom2casecon.dta data. [50 marks]. These data consist of a population-based case-control study of the potential association between asthma and a ‘SNP’ on chromosome 2. Use a 0.05 level of significance for inference.

The variables in the Chrom2casecon.dta data file are:

● id: id (consecutive sequence) The first 322 id numbers are the first sibling case group, and the id numbers from 323-506 are the controls.

● crfasthm: The CRF asthma variable indicates whether they have asthma or not. There is one case which is not ‘CRF asthma’ but is still included as a case in the case-control indicator variable (ccind). This is because they are not on current medication, but still have a doctor’s diagnosis of asthma. You can ignore this variable.

● ccind: The case-control indicator variable: 1= case and 0 = control

● snp1a_1: single nucleotide polymorphism 1, allele number 1

● snp1a_2: single nucleotide polymorphism 1, allele number 2

 

Notes

● One part of Computer Practical 1 involved the creation of a genotype variable; you may find this useful for some types of case control analysis.

● The alleles, snp1a_1 and snp1a_2, use the postfix strings _1 and _2 notation used in computer practical 1.

● There are 3 different alleles (i, j and k), but only one allele is a rare allele in this dataset.

● There is a lot of missing data in the file (a full stop is the missing value indicator). Stata excludes these cases.