Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Computational Biology

Section A: Integrity Statement

By completing this assessment, I agree to the following declaration:

I understand the University expects all students to complete coursework with integrity and honesty. I promise to complete all online assessment with the same academic integrity standards and values. Any identified form of poor academic practice or academic misconduct will be followed up and may result in disciplinary action.

As a member of the University’s student body, I will complete this assessment in a fair, honest, responsi- ble and trustworthy manner. This means that:

• I declare that this assessment is my own work.

• I will not seek out any unauthorised help in completing this assessment.

• I am aware the University may use plagiarism detection tools to check my content.

I will not discuss the content of the assessment with anyone else in any form, including Canvas, Piazza, Facebook, Twitter or any other social media or online platform within the assessment period.

• I will not reproduce the content of this assessment anywhere in any form at any time.

I declare that I generated the calculations and data in this assessment independently, using only the tools and resources defined for use in this assessment.

• I will not share or distribute any tools or resources I developed for completing this assessment.

Section B: Computational Biology and Numerical Integration

1. Your research assistant generated the bifurcation plot below but excluded some important details.

Aside from the plot you know that:

i) The difference equation of the system being studied is: x_t₊₁ = exp(−ax²) + b

ii) In this plot, the parameter a = 4.9

iii) When b = 0, there is a periodic attractor.

Based on the plot and the information you have, answer the following questions.

(a) What would be an appropriate label for the horizontal axis? [ 1 mark]

(b) What would be an appropriate label for the vertical axis? [ 1 mark] (c)Why is it impossible for someone who only has access to this plot to identify an exact set of

parameter values where the system’s dynamics are certain to be chaotic? [2 marks]

(d) Given the information provided above, describe an approximate region in parameter space where this system has a periodic attractor that involves 4 different states. [1 mark]

(e) Describe the technique presented in lectures for creating a diagram like this one. You may use pseudo-code to do so, but are not required to. [10 marks]

(f) The technique you have just described has two important parameters that must be tuned to make the bifurcation diagram legible and accurate. Identify these parameters and explain why the value selected for each should not be too high nor too low. [7 marks]

Section C: Sequence Alignments and Assembly

2. You are given a function global(A, B, match, mismatch, gap) that solves the global alignment problem. It returns the optimal score and the optimal alignment when given match/mismatch scores, a linear gap penalty, and a pair of sequences.

(a) Can you use the global function to align two sequences in such a way that the alignment has only matches and gaps? If so explain how, otherwise suggest a modification of global(A, B, match, mismatch, gap) that would enable this. In the latter case, you do not have to use pseudo-code; just explain your idea in one sentence. [2 mark]

(b) You want to calculate the optimal overlap alignment of two sequences of equal length. Can this problem be solved using the global function, while not modifying global? Explain your answer. You may use pseudo-code to do so, but are not required to. [6 marks]

3. The global alignment function global(A, B, match=2, mismatch=-3, gap =-1) pro- duced the optimal path traced here:

F 2 :

Answer the following questions and explain your answers.

(a)What will the optimal alignment look like? [ 2 marks] (b)Is it possible with the given information to calculate the optimal score? If so, write down the

optimal score. If not, explain why not. [3 marks]

4. You are given the following profile matrix:

1 1

4 4

C 0 1 1

1 1

2 4

T 0 1 1

¹ 0 0 0

(a)Write down a multiple sequence alignment that corresponds to this profile matrix. [ 2 marks] (b)How many different alignments are there that correspond to this profile matrix? Are all

multiple sequence alignments corresponding to this profile the same size (in terms of number of sequences and alignment length)? Explain your answers. [2 marks]

5. You are given a set of error-free reads S = {AGC, CGC, CTT, GCG, GCT} and you want to reconstruct the DNA sequence X from which the reads originated. Show step-by-step how you will solve this sequence reconstruction problem using the following two approaches:

i. Write down the reconstructed sequence and explain how you obtained it. That is, explain the order in which the reads were considered and why you selected that order. [4 marks]

ii. Is the obtained sequence unique? If it is not unique, write down an alternative solution as an ordered sequence of reads. [2 mark]

(a) For the following, use the approach based on the Hamiltonian path. Work with the 2-mer composition of the provided reads.

i. Show the graph and explain how you constructed it. [ 3 marks] ii.Explain how you read the sequence from the graph. Provide the reconstructed sequence,

if it exists, and explain if the reconstructed sequence is unique. [3 marks]

Section D: Simulation, HMMs and Trees

6.(a)What is the distribution of the random variable that the randomVar pseudo-code below generates given that randomUniform produces a uniformly distributed random variable between 0 and 1? Explain your answer.

randomVar(rate) var = 0

x = - log(randomUniform())/rate while x < 1

var = var + 1

x = x - log(randomUniform())/rate return var

[4 marks]

(b)Given a method PoissPro1(t) that returns the event times of a Poisson process with fixed rate 1 over an arbitrary time period, t, and randomUniform as in part (a), write a pseudo- code method PoissPro(rate, t) that returns the event times of a Poisson process with arbitrary rate over an arbitrary time period, t. Describe one experiment you could run to check the accuracy of your implementation. [4 marks]

7. In the genomes of some phages, the coding regions are richer in purines (bases A and G) than pyrimidines (bases C and T). An HMM with states Y for pyrimindine and U for purine correspond to coding/non-coding regions in these sequences. In the model, the probability that state U is followed by state Y is 3%. The probability that state Y is followed by state U is 1%. The sequence starts in state U 80% of the time.

Each state can emit any of the 4 bases A, C, G, or T. Emission probabilities are given by

0.35

0.2

0.1

0.25

0.45

0.2

0.1

0.35

(a) Sketch a diagram of the HMM, showing all states, possible transitions and transition proba- bilities. Include the begin state but no end state. [3 marks]

(b) What is the joint probability P(x,π) of the state sequence π = UUUY and the symbol sequence x = ACCC? Leave your answer as a product or sum of numbers. [3 marks]

(c) Suppose you have multiple sequences of the same length x1 , . . . , xn . How could you use P(xj) for j = 1,...,n to decide whether or not the sequences came from the model de- scribed here? [3 marks]

(d) The model was trained using observed sequences. Describe how you could train the algorithm in the case where you knew where coding and non-coding regions were to start with, and in the case you didn’t have this information. What are the strengths and weaknesses of the two training approaches? [7 marks]

(e) Complete the entries a, b, and c in the Viterbi matrix below and draw in the traceback pointers.

Remember to show your working.

8. Let the symmetric matrix

[5 marks]

specify the pairwise distances, D_ij, between the four sequences x₁, . . . , x₄.

(a) Construct a UPGMA tree from D showing your working. [4 marks]

(b) Perform the first step in the neighbour-joining algorithm on D (up to where the first pair of neighbours i, j are joined and the distances of i and j to the new node are calculated).

[4 marks]

(c) Will UPGMA or neighbour-joining (or both or neither) reconstruct the correct tree in this case? Explain your answer. [4 marks]

9. Consider the four aligned sequences, W,X,Y, and Z:

12345

W: TCAAT

X: GCAAG

Y: CGATC

Z: CGATG

(a) Explain parsimony informative means, and identify the parsimony informative sites in the alignment. [3 marks]

(b) By calculating the parsimony score for each possible tree topology for these four taxa, find the maximum parsimony tree. [4 marks]

(c) Add two more columns to this alignment so that the new alignment has a parsimony tree which is topologically different to the parsimony tree for the original alignment. [3 marks]

(d) What typically prevents us from being able to find the maximum parsimony tree and why?

[2 marks]

(e) Under what circumstances might we opt to use a parsimony method over a likelihood based method? [2 marks]

10. (a)Describe a heuristic which can be used to try to find the maximum likelihood tree. Will this heuristic typically find the tree that maximises the likelihood? Explain your answer.

[4 marks]

(b) Given mutation rate parameter μ and normalised rate matrix Q, how do you calculate the probability that an A mutates to a C along a lineage of length t = 2? (Recall we denote, for example, the (A, A)th entry of a matrix B by BAA .) [3 marks]

(c) Let X and Y be sequences of length L. How can you use the calculation in part (b) to calculate the probability that X mutates into Y over a lineage of length t = 2? Explain any assumptions you are making. [3 marks]

(d) Explain why it is not always possible to estimate the height of a tree and the mutation rate at the same time. What data can be used to overcome this problem? [4 marks]