Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 128 | [MP2] DNA Forensics

Background

DNA encodes the genetic information found in all known organisms. Using four nitrogenous bases: Adenine (A), Thymine (T), Guanine (G), and Cytosine (C), DNA encodes for different proteins which are responsible for the organism’s functionality. DNA profiling, the process of determining an individual's DNA   characteristics, is commonly used in forensic science, parentage tests, and medical research. However, there are over 3 billion nitrogenous bases in a typical  human genome and comparing every possible alignment to each person being profiled would be too computationally expensive. So how do we determine  which Person a given DNA sequence belongs to? We leverage the fact that most of the human genome is relatively similar, but there are certain areas of the human genome that have high genetic diversity. So instead of matching every person’s DNA to the given DNA, we can compare these highly diverse          regions. These regions contain Short Tandem Repeats (STR’s) which are short sequences of DNA that repeat consecutively.

 

Using multiple STRs, rather than just one, can improve the accuracy of DNA profiling. If the probability that two people have the same number of repeats for a single STR is 5%, and the analyst looks at 10 different STRs, then the probability that two DNA samples match purely by chance is about 1 in 1 quadrillion        (assuming all STRs are independent of each other). So if two DNA samples match in the number of repeats for each of the STRs, the analyst can be pretty        confident they came from the same person.

Suppose we have 3 people (Michael, Reese, and Nathan), with STR counts for ATTA AATG and TATC. Suppose Michael has 15 consecutive occurrences            of ATTA, 10 consecutive occurrences of AATG, and 12 consecutive occurrences of TATC. Similarly, Reese has 4 consecutive occurrences of ATTA, 6 consecutive

occurrences of AATG, 2 consecutive occurrences of TATC and Nathan has 10 consecutive occurrences of ATTA, 4 consecutive occurrences of AATG, and 2 consecutive occurrences of TATC.

Now suppose you’re given the following DNA Strand:

ATTAATTAATTAATTAAATGAATGAATGAATGAATGAATGTATCTATCATTAAATGTATC

Well, imagine that you looked through the DNA sequence for the longest consecutive sequence of repeated ATTAs and found that the longest sequence was 4 repeats long. If you then found that the longest sequence of AATGs is 6 repeats long, and the longest sequence of TATC is 2 repeats long, that would           provide pretty good evidence that the DNA was Reese’s. Notice that the longest consecutive sequenceis not simply the overall countof the STR in the strand.

Of course, it's also possible that once you take the counts for each of the STRs, it doesn't match anyone in your DNA database, in which case you have no match. If you were given the DNA strand:

ATTAATTAATTAATTAATTA

then there would be no match as none of the persons have only 5 ATTA’s in a row.

Assignment

Your program will be given the name of a DNA database file and a string representation of the DNA sequence we would like to determine to whom it            belongs. DNA database files can vary from one another in the number of STRs and number of people they contain. You can assume that a DNA database file will have at least one STR and one person present. You cannot assume that all DNA databases will contain exactly the same number of STRs (e.g., 3).

To begin the analysis, your first task is to write a program that reads into memory the DNA database. In this assignment, the DNA database will be encoded as a CSV file, where each row corresponds to an individual and each column corresponds to an STR. For example,

Name,ATTA,AATG,TATC

Michael,15,10,12

Reese,4,6,2

Nathan,10,4,2

The DNA database encoded as a CSV file in this manner communicates three important pieces of information. First, the STRs that we will be using in our     analysis (ATTA,AATG,TATC). Second, the names (Michael, Reese, Nathan) of the individuals involved in our study. Finally, the number of times each individual has a specific STR repeated consecutively in her/his DNA. Michael has ATTA repeated 15 times consecutively somewhere in his DNA, AATG 10 times,

and TATC 12 times. Reese has ATTA repeated 4 times consecutively somewhere in his DNA, AATG 6 times, and TATC 2 times. Nathan has ATTA repeated 10 times consecutively somewhere in his DNA, AATG 4 times, and TATC 2 times.

All of the information stored in the DNA database needs to be read into memory. We recommend that you read the database line-by-line into your program   using std::getline until there are no more lines to be read. For example, if we would like to read each line from the file example.dat, we could write something like this:

What's happening? We attempt to extract a line from the input file stream ifs and store it in line. This is the conditional of our for-statement. If this operation succeeds, we enter the loop body. That is, if a line is successfully extracted from ifs, we enter the loop body, and (in this example) put the line to standard out. After each iteration of the loop body, the line is "reset" to the empty string, and we attempt to extract another line from ifs. This process continues until there are no more lines to get, at which point getline will fail to extract, and the conditional will evaluate to false terminating the loop.

Always assume that the first row of anyDNA database will be the column names. The first column will not always be Name — for instance, if the dataset were French, it might read Nom — though the remaining columns will always be the STRs. Therefore, in our example, std::getline would

read Name,ATTA,AATG,TATC into a single std::string. To help you obtain the different   “columns”, we have provided you the function utilities::GetSubstrs that can produce a std::vector<std::string> {"Name", "ATTA", "AATG", "TATC"}. After reading the first row from the file, each additional row will correspond to an individual and each column to the number of times consecutively a particular STR repeats in their DNA. Once we begin reading str counts, notice the integer values from the file become stored in std::string form. This goes against our preference since an std::string has different behaviors than an integer. Therefore, it is advisable to convert those numbers into integer values. You should perform the conversion using std::stoi from the <string> facilities. Here's an example:

std::string number_as_string = "8";

int number_as_int = std::stoi(number_as_string);

/* OK, since number_as_string constitutes valid int. */

std::string string_as_string = "Howdy";

int string_as_int = std::stoi(string_as_string);

/* ERROR: terminating with uncaught exception of type std::invalid argument: stoi: no conversion  */

You have the freedom to determine how you would like to organize all of this information in your program.

Now, we direct our attention to the DNA sequence under analysis. For each of the STRs from the DNA database, you compute and record

the longestconsecutive run of repeats for that STR in the DNA sequence. Subsequently, you compare these counts to each individual’s counts. Should the STR counts in the DNA sequence match exactly one person from the database, write the name of the individual to standard output. Otherwise, you should  write   “No match” to standard output.

Acknowledgement

Components of this MP were inspired by this NIFTY assignment.

Starter Code

classroom.github.com/a/bpXTDOyv

Requirements

Your program must consider each STR independent of the other STRs. For example, if you had a function that returned the longest consecutive sequence of an STR in a given DNA sequence (we recommend this), that function would be called once per STR!

When looking for an STR in the strand, you cannot reuse characters for one match, to create another match for the sameSTR; you cannot overlap          characters for matches of the  sameSTR. Accordingly, upon finding an STR match in the sequence, you would continue scanning for other matches from the character in the strand that follows directly after the characters comprising the match. For example,

 

Kindly understand that the longest consecutive sequenceis not simply the overall countof the STR in the strand. The longest consecutive   sequence is the longest consecutive"count" or "run" of an STR side-by-side itself in the strand. In contrast, the overall count of an STR is the total number of times the STR is in the strand.

The first comma-separated entry on the first line of the DNA Database will always be a string, but you cannot assume that that string will always be Names.

DNA database input files will vary in the number of STRs. You can assume that the DNA database input file will have at least one STR. You cannot assume that our test input files contain only have three STRs.

You must create your own STR DNA database input files and DNA strands to test your application

  STR DNA database

   Do not create these in Excel. Instead, create a new file in VS code while connected to your container/VM

   Compose this STR DNA database file such that it follows the format specified in this prompt

   Name the file using any extension (e.g., .dat, .csv, etc.); what's important is the format of the file's contents

   DNA sequences

You should compose multiple DNA sequences for testing

We recommend having strands that place the longest consecutive sequence of the STR in different positions relative to shorter consecutive sequences of the same STR

Remember, it is the longest consecutive sequence of STRs that matters

Program Usage

  Your program must require as its first command-line argument the name of a data file containing STRs (e.g., data.csv) and must require as its second

command-line argument the string of DNA to analyze. Note: the zeroth argument is always the name of the program.

  Your program must write the correct response to standard output and subsequently terminate.

   For example,

> ./bin/exec data.csv ATTAATTAATTAATTAAATGAATGAATGAATGAATGAATGTATCTATC

Reese

The command and arguments to the program are shown in red; the output in blue. Note:   “<”is the terminal prompt; it is not a part of either the input nor the output.

   Note: if you're running your program with our visual debugger, please provide the absolute path to the input file. For example, if you're developing

in ~/src/mp-dna-forensics-michaelrnowak/ and your test case is in ./tests/input-test-1.dat relative to that directory, the absolute path to input-test-

1.dat is /home/vagrant/src/mp-dna-forensics-michaelrnowak/tests/input-test-1.dat. The absolute path to the tests directory could be found by cd into tests and executing pwd.

  Your program must compile without warnings/errors when compiled with: clang++ using the -std=c++20 and the following flags -Wall -Wextra -Werror -

pedantic

Constraints

Only the following header files are allowed to be #included in your solution files:

<cctype> <cassert> <fstream> <iostream> <map> <string> <vector> "functions.hpp" "utilities.hpp"

Example

For the CSV file:

Names,AGATG,AATG,TAT

Casey,5,2,8

Amani,3,7,4

Blair,6,1,5

The DNA sequence

AGACGGGTTACCATGACTATTATTATTATTATTATTATTATACGTACGTACGTATGAGATGAGATGAGATGAGATGAGATGCCTCGACTTCGATCGCAATGAATG

would map to Casey,

TATTATTATTATAACCCTGCGCGCGCGCGATCCAGCATTAGCTAGCATCAAGATGAGATGAGATGGAATTTCGAAATGAATGAATGAATGAATGAATGAATG

to Amani, and

GGTACAGATGGCAAAGATGAGATGAGATGGTCGTCGAGCAATCGTTTCGATAATGAATGAATGAATGAATGAATGAATGACACACGTCGATGCTAGCGGCG

to No match.

Hints

Be sure to create helper functions

Look into std::stoi to convert strings to integers

No Match happens when none of the people matches the DNA or when more than one person matches the DNA

Be sure to include helper functions to facilitate the comparisons, otherwise you will end up with spaghetti code