CS 128 | [MP2] DNA Forensics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS 128 | [MP2] DNA Forensics
Background
DNA encodes the genetic information found in all known organisms. Using four nitrogenous bases: Adenine (A), Thymine (T), Guanine (G), and Cytosine (C), DNA encodes for different proteins which are responsible for the organism’s functionality. DNA profiling, the process of determining an individual's DNA characteristics, is commonly used in forensic science, parentage tests, and medical research. However, there are over 3 billion nitrogenous bases in a typical human genome and comparing every possible alignment to each person being profiled would be too computationally expensive. So how do we determine which Person a given DNA sequence belongs to? We leverage the fact that most of the human genome is relatively similar, but there are certain areas of the human genome that have high genetic diversity. So instead of matching every person’s DNA to the given DNA, we can compare these highly diverse regions. These regions contain Short Tandem Repeats (STR’s) which are short sequences of DNA that repeat consecutively.
Using multiple STRs, rather than just one, can improve the accuracy of DNA profiling. If the probability that two people have the same number of repeats for a single STR is 5%, and the analyst looks at 10 different STRs, then the probability that two DNA samples match purely by chance is about 1 in 1 quadrillion (assuming all STRs are independent of each other). So if two DNA samples match in the number of repeats for each of the STRs, the analyst can be pretty confident they came from the same person.
Suppose we have 3 people (Michael, Reese, and Nathan), with STR counts for ATTA AATG and TATC. Suppose Michael has 15 consecutive occurrences of ATTA, 10 consecutive occurrences of AATG, and 12 consecutive occurrences of TATC. Similarly, Reese has 4 consecutive occurrences of ATTA, 6 consecutive
occurrences of AATG, 2 consecutive occurrences of TATC and Nathan has 10 consecutive occurrences of ATTA, 4 consecutive occurrences of AATG, and 2 consecutive occurrences of TATC.
Now suppose you’re given the following DNA Strand:
ATTAATTAATTAATTAAATGAATGAATGAATGAATGAATGTATCTATCATTAAATGTATC
Well, imagine that you looked through the DNA sequence for the longest consecutive sequence of repeated ATTAs and found that the longest sequence was 4 repeats long. If you then found that the longest sequence of AATGs is 6 repeats long, and the longest sequence of TATC is 2 repeats long, that would provide pretty good evidence that the DNA was Reese’s. Notice that the longest consecutive sequenceis not simply the overall countof the STR in the strand.
Of course, it's also possible that once you take the counts for each of the STRs, it doesn't match anyone in your DNA database, in which case you have no match. If you were given the DNA strand:
ATTAATTAATTAATTAATTA
then there would be no match as none of the persons have only 5 ATTA’s in a row.
Assignment
Your program will be given the name of a DNA database file and a string representation of the DNA sequence we would like to determine to whom it belongs. DNA database files can vary from one another in the number of STRs and number of people they contain. You can assume that a DNA database file will have at least one STR and one person present. You cannot assume that all DNA databases will contain exactly the same number of STRs (e.g., 3).
To begin the analysis, your first task is to write a program that reads into memory the DNA database. In this assignment, the DNA database will be encoded as a CSV file, where each row corresponds to an individual and each column corresponds to an STR. For example,
Name,ATTA,AATG,TATC
Michael,15,10,12
Reese,4,6,2
Nathan,10,4,2
The DNA database encoded as a CSV file in this manner communicates three important pieces of information. First, the STRs that we will be using in our analysis (ATTA,AATG,TATC). Second, the names (Michael, Reese, Nathan) of the individuals involved in our study. Finally, the number of times each individual has a specific STR repeated consecutively in her/his DNA. Michael has ATTA repeated 15 times consecutively somewhere in his DNA, AATG 10 times,
and TATC 12 times. Reese has ATTA repeated 4 times consecutively somewhere in his DNA, AATG 6 times, and TATC 2 times. Nathan has ATTA repeated 10 times consecutively somewhere in his DNA, AATG 4 times, and TATC 2 times.
All of the information stored in the DNA database needs to be read into memory. We recommend that you read the database line-by-line into your program using std::getline until there are no more lines to be read. For example, if we would like to read each line from the file example.dat, we could write something like this:
What's happening? We attempt to extract a line from the input file stream ifs and store it in line. This is the conditional of our for-statement. If this operation succeeds, we enter the loop body. That is, if a line is successfully extracted from ifs, we enter the loop body, and (in this example) put the line to standard out. After each iteration of the loop body, the line is "reset" to the empty string, and we attempt to extract another line from ifs. This process continues until there are no more lines to get, at which point getline will fail to extract, and the conditional will evaluate to false terminating the loop.
Always assume that the first row of anyDNA database will be the column names. The first column will not always be Name — for instance, if the dataset were French, it might read Nom — though the remaining columns will always be the STRs. Therefore, in our example, std::getline would
read Name,ATTA,AATG,TATC into a single std::string. To help you obtain the different “columns”, we have provided you the function utilities::GetSubstrs that can produce a std::vector<std::string> {"Name", "ATTA", "AATG", "TATC"}. After reading the first row from the file, each additional row will correspond to an individual and each column to the number of times consecutively a particular STR repeats in their DNA. Once we begin reading str counts, notice the integer values from the file become stored in std::string form. This goes against our preference since an std::string has different behaviors than an integer. Therefore, it is advisable to convert those numbers into integer values. You should perform the conversion using std::stoi from the <string> facilities. Here's an example:
std::string number_as_string = "8"; int number_as_int = std::stoi(number_as_string); /* OK, since number_as_string constitutes valid int. */ |
std::string string_as_string = "Howdy"; int string_as_int = std::stoi(string_as_string); /* ERROR: terminating with uncaught exception of type std::invalid argument: stoi: no conversion */ |
You have the freedom to determine how you would like to organize all of this information in your program.
Now, we direct our attention to the DNA sequence under analysis. For each of the STRs from the DNA database, you compute and record
the longestconsecutive run of repeats for that STR in the DNA sequence. Subsequently, you compare these counts to each individual’s counts. Should the STR counts in the DNA sequence match exactly one person from the database, write the name of the individual to standard output. Otherwise, you should write “No match” to standard output.
Acknowledgement
Components of this MP were inspired by this NIFTY assignment.
Starter Code
classroom.github.com/a/bpXTDOyv
Requirements
Your program must consider each STR independent of the other STRs. For example, if you had a function that returned the longest consecutive sequence of an STR in a given DNA sequence (we recommend this), that function would be called once per STR!
When looking for an STR in the strand, you cannot reuse characters for one match, to create another match for the sameSTR; you cannot overlap characters for matches of the sameSTR. Accordingly, upon finding an STR match in the sequence, you would continue scanning for other matches from the character in the strand that follows directly after the characters comprising the match. For example,
Kindly understand that the longest consecutive sequenceis not simply the overall countof the STR in the strand. The longest consecutive sequence is the longest consecutive"count" or "run" of an STR side-by-side itself in the strand. In contrast, the overall count of an STR is the total number of times the STR is in the strand.
The first comma-separated entry on the first line of the DNA Database will always be a string, but you cannot assume that that string will always be Names.
DNA database input files will vary in the number of STRs. You can assume that the DNA database input file will have at least one STR. You cannot assume that our test input files contain only have three STRs.
You must create your own STR DNA database input files and DNA strands to test your application
STR DNA database
Do not create these in Excel. Instead, create a new file in VS code while connected to your container/VM
Compose this STR DNA database file such that it follows the format specified in this prompt
Name the file using any extension (e.g., .dat, .csv, etc.); what's important is the format of the file's contents
DNA sequences
You should compose multiple DNA sequences for testing
We recommend having strands that place the longest consecutive sequence of the STR in different positions relative to shorter consecutive sequences of the same STR
Remember, it is the longest consecutive sequence of STRs that matters
Program Usage
Your program must require as its first command-line argument the name of a data file containing STRs (e.g., data.csv) and must require as its second
command-line argument the string of DNA to analyze. Note: the zeroth argument is always the name of the program.
Your program must write the correct response to standard output and subsequently terminate.
For example,
> ./bin/exec data.csv ATTAATTAATTAATTAAATGAATGAATGAATGAATGAATGTATCTATC
Reese
The command and arguments to the program are shown in red; the output in blue. Note: “<”is the terminal prompt; it is not a part of either the input nor the output.
Note: if you're running your program with our visual debugger, please provide the absolute path to the input file. For example, if you're developing
in ~/src/mp-dna-forensics-michaelrnowak/ and your test case is in ./tests/input-test-1.dat relative to that directory, the absolute path to input-test-
1.dat is /home/vagrant/src/mp-dna-forensics-michaelrnowak/tests/input-test-1.dat. The absolute path to the tests directory could be found by cd into tests and executing pwd.
Your program must compile without warnings/errors when compiled with: clang++ using the -std=c++20 and the following flags -Wall -Wextra -Werror -
pedantic
Constraints
Only the following header files are allowed to be #included in your solution files:
<cctype> <cassert> <fstream> <iostream> <map> <string> <vector> "functions.hpp" "utilities.hpp"
Example
For the CSV file:
Names,AGATG,AATG,TAT
Casey,5,2,8
Amani,3,7,4
Blair,6,1,5
The DNA sequence
AGACGGGTTACCATGACTATTATTATTATTATTATTATTATACGTACGTACGTATGAGATGAGATGAGATGAGATGAGATGCCTCGACTTCGATCGCAATGAATG
would map to Casey,
TATTATTATTATAACCCTGCGCGCGCGCGATCCAGCATTAGCTAGCATCAAGATGAGATGAGATGGAATTTCGAAATGAATGAATGAATGAATGAATGAATG
to Amani, and
GGTACAGATGGCAAAGATGAGATGAGATGGTCGTCGAGCAATCGTTTCGATAATGAATGAATGAATGAATGAATGAATGACACACGTCGATGCTAGCGGCG
to No match.
Hints
Be sure to create helper functions
Look into std::stoi to convert strings to integers
No Match happens when none of the people matches the DNA or when more than one person matches the DNA
Be sure to include helper functions to facilitate the comparisons, otherwise you will end up with spaghetti code
2023-02-16