Statistics 4H Projects Session 2021-2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Statistics 4H Projects
Session 2021-2022
Statistical modelling of correlated count data from biological experiments (Epi B)
BriefDescription ofProject With the advent of powerful high-throughput experimental technologies over the last two decades, it is becoming common for biology labs to generate massive datasets, in the form of long sequences of counts, in their quest to determine features of biological interest within the human genome. For instance, the methylation of DNA is a biological process that leads to certain parts of the DNA being better protected against mutations that could lead to loss of function or even the onset of cancer, and is thus of great interest to medical scientists. The study of DNA methylation often necessitates collecting measurements on the same locations of DNA from different people, and due to biological constraints on the genome, various data sequences on the same set of genomic locations is often highly correlated, making it more difficult to determine any hidden patterns. Most current models in use for biological sequence data assume data sequences are independent, potentially leading to incorrect inference. This project will focus on data generated from a particular type of sequencing experiment, called bisulfite sequencing, designed to study patterns of DNA methylation in the human genome. Various types of models for correlated count data will be considered and assessed, to determine an appropriate set of models (and assumptions) for data generated in such experiments. Finally, statistical inferences from the model(s) will be validated by comparisons to current biological knowledge. |
Key Questions ofInterest What kind of statistical models are appropriate for data generated from experiments to study DNA methylation? How can correlations between sequences of counts be incorporated into such models? Can such models usefully differentiate regions having specific biological functions? |
Analysis Summary
What level of difficulty do you think the project will have for the typical student?
Moderate/Difficult
Is any Programming/Simulation required? Yes
Please specify the statistical techniques which the project is likely to require, and
Coding and implementing probability functions for non-standard distributions in R. Depending on the student, implementing techniques for either maximum likelihood estimation or Bayesian estimation (using MCMC) of model parameters. |
Please specify the statistical techniques which the project is likely to require, and any that are essential (since combined and WP(5) may not have covered them, since they have options): |
Probability and inference Bayesian Statistics/Advanced Bayesian Methods |
Suggested reading: 1. Introduction to epigenomics and epigenome-wide analysis. Fazzari MJ, Greally JM. Methods Mol Biol. 2010;620:243-65.
2. Bacon with Your Eggs? Applications of a New Bivariate Beta-Binomial Distribution. Danaher, P., & Hardie, B. (2005). The American Statistician, 59(4), 282-286. [Available at: |
dest=https://www.jstor.org/stable/27643695&site=jstor] 3. DNA methylome analysis using short bisulfite sequencing data. Krueger, F., Kreck, B., Franke, A. et al. Nat Methods 9, 145– 151 (2012).
(Note: This project is related to projects Epi A, Epi C and Epi D.) |
2022-01-15
Statistical modelling of correlated count data from biological experiments (Epi B)