COMP 462 / 561 – Computational Biology Methods Assignment #1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP 462 / 561 – Computational Biology Methods
Assignment #1
Due date: 23:59, October 3, 2022, on MyCourses
Notes: You can either type your assignment (using Word, Latex, etc), or handwrite it and scan it. In either case, you must submit a PDF document. For question 1, also submit your code.
Question 1. (45 points) Sequence alignment slippage-aware scoring
Insertions and deletions are often caused by DNA polymerase slippage, which results in the duplication of one or more nucleotides, or the deletion of one or more copies of a repeated nucleotide. See the examples below. For this question, you will need to design an algorithm to solve a modified version global pairwise alignment that captures this notion, and you will need to implement and use it.
Slippage-aware alignment scoring scheme
Given:
• Sequences S and T aligned in an alignment A
• Substitution cost matrix M
• Slippage gap penalty cs
• Non-slippage gap penalty cn (generally we would have cn > cs )
Score(A) = same as standard linear gap penalty scoring scheme, except that
1. if A[1,i] = ‘- ‘, the score assigned to column i is: cs if A[2,i]=A[2,i- 1]
cn otherwise
2. if A[2,i] = ‘- ‘, then case assigned to column i is: cs if A[1,i]=A[1,i- 1]
cn otherwise
Examples: M = +1/- 1 for matches/mismatches, cs = - 1, cn = -2
AAAG AACG AT---GA ATTA-GT
A--T A--T ACCGGGA AT-AAG-
Score: 1111=-2 1121=-3 1112111=-2 1111112=0
Note: Numbers in red are negative numbers.
a) ( 10 points) Give the pseudocode of an exact algorithm for optimal slippage-aware alignment problem. Your algorithm should run in time O(m n), where m and n are the lengths of the two sequences.
b) (35 points) Implement your algorithm, including the trace-back procedure, in the programming language of your choice. Do not use any external alignment package/library; write your program from scratch. Your algorithm should take five arguments: (1) File containing the two sequences to be aligned (FASTA format). (2) The score for matches; (3) The score for mismatches (we will assume that all mismatches are score identically); (4) The slippage gap penalty cs ; (5) The non-slippage gap penalty cn .
Your program should print out the optimal alignment score, and the optimal alignment itself.
Use your program to compute the optimal slippage-aware alignment for the following pairs of sequences, assuming matchScore=1, mismatchScore=- 1, cs = - 1, cn = -2
http://www.cs.mcgill.ca/~blanchem/561/hw1_brca1_3utr_small.fa
http://www.cs.mcgill.ca/~blanchem/561/hw1_brca1_3utr_full.fa
Report the alignments found, together with their score, and submit your code on MyCourses. Your code will not be marked, but submitting an answer (alignment score) that would not have been produced by your own code would be considered cheating.
Question 2. (10 points) Needleman-Wunsch algorithm
What is the optimal global alignment for APPLE and CAPE under the linear gap penalty scoring scheme (match score = +1, mismatch score = - 1, and gap cost c=- 1). Show the dynamic programming matrix. Report all optimal solutions.
Question 3. (5 points) Affine gap penalty
Give an example of a pair of short sequences for which the optimal pairwise alignment under the linear gap penalty scheme (cost(L) = - 1*L) is different from the optimal pairwise alignment under an affine gap penalty with cost(L) = -2 - 0.5*L. Use substitution cost = +1 for matches and - 1 for mismatches. Try to find the shortest sequences possible. Show the optimal alignment under the two scoring schemes.
Question 4. (5 points) Longest common subsequence problem
Suppose that you have a very fast machine to execute the Smith-Waterman algorithm with any user-specific substitution cost matrix and linear gap penalty. How would you use it to solve the longest common subsequence: Given two DNA sequences S and T, find the longest sequence X that is a subsequence of both S and T?
Note: Subsequences and substrings are two different things. A subsequence is made of characters that occur in the right order in the input spring, but not necessarily consecutively. A substring is a sequence of consecutive characters of the input string. So every substring is a subsequence, but not vice-versa. For example, BOICS is a subsequence of BIOINFORMATICS, but it is not a substring. IRO is neither a subsequence nor a substring of BIOINFORMATICS.
Question 5. (15 points) NoDeletion alignment problem
Consider the NoDeletion global alignment problem, which consists of optimally aligning sequences S and T under a linear gap penalty for insertions but where deletions from S to T are not allowed. For example, these alignments are allowed:
S: ACAG S: AG-AT- S: ---A---
T: ATAG T: ACCATG T: CAGAGGC
But these alignments are not allowed:
S: ACAG S: AG-AT- S: C---GGC
T: A-AG T: -CCATG T: CAGAG--
Let m and n be the lengths of S and T respectively, and let k = n – m (of course, the problem only makes sense if m <= n). Give an algorithm to solve the problem in time O( m*k ).
Question 6. (10 points) Progressive multiple sequence alignment
The progressive alignment algorithm we have seen in class to solve the multiple sequence alignment is not guaranteed to produce an optimal solution. Assume a sum-of-pairs scoring scheme with a linear gap penalty of c = - 1 and the cost of matches/mismatches is +1/- 1.
Give a simple example of specific set of short DNA strings where the algorithm fails to produce an optimal result. Give the alignment and score produced by the algorithm, as well as the optimal alignment and its score.
Question 7. (10 points) Blast algorithm
Give an example of the two most similar DNA sequences of length 20 that Blast using word length w=5 will fail to align.
2022-09-25