Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH6068 Statistical Genetics

Assessment 2022/2023

Due Date of 2nd May 2023, 4 pm

There is only one assignment for this assessment. You will write two short reports on the analysis of two sets of genetic data. Your assignment is to be handed in using Turnitin on Blackboard by 2nd May 2023, 4 pm. Make sure you hand in using the Turnitin ‘MATH6068 Assessment 2023’, which will be available from 24th April 2023.

The use of artificial intelligence tools, such as Chat GPT, is not allowed for this assessment. As in all coursework, anything you have written must be in your own words and not copied from another student or from output generated using an artificial intelligence tool.

Unauthorised late coursework will receive a penalty of the deduction of a percentage of marks awarded per working day late, in line with the University’s late submission penalties.

Policy on Collaboration

We encourage students to discuss and exchange ideas since this is an important part of the educational process. You are allowed to discuss your general analytic strategy with your fellow students, but you must do your own analyses as well as interpret and write up the results yourself. However, it is NOT acceptable for a student to read and gain ideas for his or her coursework from another student’s finished work. If copying between pieces of coursework occurs, it will be penalised after discussion with the students concerned. A report will be made to the student’s programme coordinators.

Guidance Notes

• Whilst it is not necessary for your assessment to be typed, legibility is essential.

• Your report does not need to be in any particular format, but you need to state clearly the purpose of the analysis (including any background information if appropriate), the method(s) used and your interpretation of the results.

• Write up your answers to the TDT and Case Control analysis separately. If it is appropriate to comment on similarities and differences between the two analyses a paragraph can be added at the end.  

• Figures in brackets give the marking scheme and total 100.

You will use two sets of genuine genetic data.

The datasets, CHROM14.dta and CHROM14CC.dta, will be posted onto Blackboard for MATH6068 in the Assignments Folder.

These datasets relate to the same single nucleotide polymorphisms on chromosome 14. A single nucleotide poly-morphism or SNP (pronounced “snip”) is a small genetic change, or variation, that can occur within a person’s DNA sequence. SNP variation occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide letters C, G, or T. Although many SNPs do not produce physical changes in people, scientists believe that some SNPs may predispose a person to certain diseases. In this assignment the interest is in the potential association of a SNP in the CMA1 gene, located on the long arm of chromosome 14 (14q 11.2), and asthma.

You must use Stata for your assessment. In particular, the Stata ado files that you used for your Computer Practicals 1 and 2, written by David Clayton, entitled ‘ado files for Computer Practicals 1 and 2’ available on MATH6068 on Blackboard.

Assessment

Q1. Provide a background/introduction to the study [5 marks] and exploratory analysis [5 marks]. Carry out appropriate family-based analyses using this data [20 marks], explaining why you think your analysis is appropriate [5 marks]. Interpret the results of your analyses as fully as you can [15 marks]. 

These data consist of a single nucleotide polymorphism on chromosome 14. Here the interest is in the potential association of a SNP in the CMA1 gene, located on the long arm of chromosome 14 (14q 11.2), and asthma (as recorded on a case report form). CHROM14.dta has 1390 observations in total and the variables in the CHROM14.dta file are:

• famcon: family identification number, there are 341 different families  

• id: id number of each person in the study

• fatherid: this variable provides the id number of the individual’s father (some values are necessarily missing)

• motherid : this variable provides the id number of the individual’s mother (some values are necessarily missing)

• sex: sex of child (1=female, 2=male)

• crf (1 = not asthmatic, 2 = asthmatic as recorded on the case report form)

• pc204 (1 = not affected, 2 = affected, i.e. PC20 < 4 or severe asthma)

• pc2016 (1 = not affected, 2 = affected, i.e. PC20 ≤16 or mild/moderate asthma)

• totige (1 = not raised, 2 = raised age corrected total IgE)

• atop2 (1 = not atopic, 2 = atopic)

• allele1: single nucleotide polymorphism, allele number 1 (a or g)

• allele2: single nucleotide polymorphism, allele number 2 (a or g)

Carry out a TDT analysis for the SNP (and any other analyses you deem appropriate) for:

(a) asthma as reported on the case report form, i.e. crf

(b) severe asthma, i.e. pc204

(c) mild asthma, i.e. pc2016

(d) raised total IgE, i.e. totige

(e) atopic status, i.e. atop2

Use a 0.05 level of significance for testing.

Notes:

• The CHROM14.dta dataset contains a mixture of parents and their offspring, consider this when exploring/describing the dataset.

• There is missing data in the file (full stop is missing value indicator). STATA excludes these cases.

Q2. Carry out case-control analyses of the CHROM14CC.dta [15 marks]. This should also include: a background/introduction (short) [4 marks]; exploratory analyses [12 marks]; and discussion of assumptions made [10 marks]. Comment as fully as you can on the results of these analyses including consideration of links to other analyses if appropriate [9 marks]. Use a 0.05 level of significance for testing.

These data consist of a population-based case-control study of the potential association between asthma and a SNP in the CMA1 gene, located on the long arm of chromosome 14 (14q 11.2). There are 308 cases and 176 controls. The variables in CHROM14CC data file are:

• id: id (consecutive sequence) The first 308 id numbers are the first sibling case group, and the id numbers from 309-484 are the controls.

• crf: The CRF asthma variable indicates whether they have asthma or not.

There is one case which is not ‘CRF asthma’ positive but is still included as a case in the case-control indicator variable (ccind). This is because they are not on current medication, but still have a doctor diagnosis of asthma. Use the ccind variable for your analyses.

• allele1: single nucleotide polymorphism, allele number 1 (g or a)

• allele2: single nucleotide polymorphism, allele number 2 (g or a)

• ccind: The case-control indicator variable: 1 = case and 0 = control.

Notes:

• One part of Computer Practical 1 involved the creation of a genotype variable; you may find this useful for some types of case control analysis.