闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Genome Bioinformatics (BIO726P) - Queen Mary

2022

Assignment details

This is a graded assignment. Brieﬂy, you are expected to assemble a genome (Task 1) and answer questions on QMplus based on your assembly (Task 2).

We suggest you read through the entire document carefully before starting. Do not hesitate to ask on QMplus if there is any ambiguity or if you need help.

Task 1: Genome assembly

We want you to perform a *de novo* genome assembly for the red ﬁre ant, *Solenopsis invicta*. This is because the oﬃcial genome assembly of the species is quite fragmented (RefSeq accession number: GCF_000188075.2).

We obtained 50x genome coverage of long-molecule sequences from Paciﬁc Biosciences' Sequel platform. However, as the assembly approaches for long-molecule sequences are relatively new (and constantly evolving!), we would like to test diﬀerent combinations of assembly parameters.

We need your help in this experiment: We would like each of you to perform an assembly using a diﬀerent combination of assembly parameters. The assembler and the parameters to use are shown in the table below (Table 1).

Input

Sequenced reads for assembly:

Please suppose that there is no need to do read cleaning - you should use the input as it is.

How to run the assembly

It is best to run the assembly on Apocrita. To submit your job follow HPC documentation:

(check the Job Script Builder section as well).

Assembly software is installed on Apocrita, but you will need to activate it in your job script before the assembly command:

We recommend you request, 16 CPU cores, a total of 96 GB of RAM, and a run time of 3 days. This can be speciﬁed by adding the following lines in your job script:

*Remember*, you must tell the assembly software how many CPU cores you requested. Otherwise, it will use only 4 CPUs (see manual). Your assembly will likely take less than 72 hours (we expect between 6 and 24 hrs), but it is a good practice to request more run time.

So that we can collect and evaluate your assemblies, and approach, please copy your entire directory to the computer you used for practicals.

The analyses that you do of the assembly can happen on any computer, including on the cluster. Please do the follow-up analyses in other directories.

A major challenge in this exercise can be getting things to run. For issues that you feel may be speciﬁc to the practical or your understanding, please ask on QMplus forum. For issues with the cluster and permissions and space, please contact ITS research support. In both cases, if it is a technical issue, please clearly state the command you ran and the error message you get.

Task 2: Answer questions

This task depends on having obtained an assembly from the previous task. If you are stuck and unable to get an assembly and time is running out (e.g., by Wednesday), don't hesitate to ask.

Using your assembly, answer the following questions. We want you to answer the questions directly on a form on QMPlus - the questions are listed below for reference.

Answering some of the questions will require you to run additional commands. This can be done on the computer you used for the practicals. If you need to use speciﬁc software to answer any of the questions below, ask for help.

1. Provide the absolute path to the directory where you ran your assembly.

We expect you to run the assembly software using a directory structure similar to the practicals. You will be scored for

getting the directory structure right and for keeping a record of the commands you ran in a ﬁle.

(hint: we should be able to run to view your directory structure)

2. Provide the absolute path to your assembly fasta ﬁle.

We will use your assembly for further analysis, so please make sure to provide the correct path.

(hint: we should be able to run to view the ﬁle)

3. Provide the number of contigs in your assembly.

4. Provide the total length of your assembly.

5. Provide the N50 length of your assembly.

6. Provide the length of the longest contig in your assembly.

7. WITHOUT running a gene prediction software, provide an estimate (even if vague) of how many protein-coding genes the longest contig in your assembly contains.

There are several ways to answer this question and there is no best approach. The approach you come up with, can be as simple or as sophisticated as you like. We ask you to explain your approach in the next question.

8. Explain how you answered the previous question and indicate potential shortcomings of your approach.

As indicated in previous question, there is no best approach.

9. Cytochromes P450 (CYPs) are a family of enzymes produced by almost all living organisms. CYPs play a key role in synthesis and metabolism of various molecules in the cell. In humans, there are a total of 57 genes that encode for CYPs; other species may have more or less. Provide an estimate of how many CYP genes are present in your assembly.

10. Explain how you answered the previous question and indicate potential shortcomings of your approach.

11. We have a set of paired Illumina reads that we would like you to map to your assembly as you normally would, and subsequently calculate the mean coverage for each contig using the software mosdepth . Path to reads:

Please suppose that there is no need to do read cleaning - you should use the input as it is. Now, consider the ten contigs with the highest mean coverage. What is their coverage like? Provide a range, for example, "34 to 2300".

12. Are the ten contigs with the highest mean coverage likely from the normal nuclear ﬁre ant genome or might some represent something diﬀerent? How did you ﬁgure out the answer? Why do these contigs have such a high coverage?

13. In real life situation where this assembly will be the basis of several months or even years of follow-up work, which metrics would you use to have suﬃcient conﬁdence that the assembly is "good enough"?

Student ID	Assembly software	Assembly parameters
bt22027	wtdbg2	-p 18 -S 1 -K 0.5 -A -s 0.05 -L 5000
bt22031	wtdbg2	-p 18 -S 1 -K 0.4 -A -s 0.05 -L 5000
bt22866	wtdbg2	-p 18 -S 1 -K 0.2 -A -s 0.05 -L 5000
bt18597	wtdbg2	-p 18 -S 1 -K 0.3 -A -s 0.05 -L 5000
bt22038	wtdbg2	-p 18 -S 2 -K 0.5 -A -s 0.05 -L 5000
bt22003	wtdbg2	-p 18 -S 2 -K 0.4 -A -s 0.05 -L 5000
bt22048	wtdbg2	-p 18 -S 2 -K 0.2 -A -s 0.05 -L 5000
bt19339	wtdbg2	-p 18 -S 2 -K 0.3 -A -s 0.05 -L 5000
bt22947	wtdbg2	-p 18 -S 4 -K 0.5 -A -s 0.05 -L 5000
bt22007	wtdbg2	-p 18 -S 4 -K 0.4 -A -s 0.05 -L 5000
bt22043	wtdbg2	-p 18 -S 4 -K 0.2 -A -s 0.05 -L 5000
bt22047	wtdbg2	-p 18 -S 4 -K 0.3 -A -s 0.05 -L 5000
bt22941	wtdbg2	-p 19 -S 1 -K 0.5 -A -s 0.05 -L 5000
bt22040	wtdbg2	-p 19 -S 1 -K 0.4 -A -s 0.05 -L 5000
bt22934	wtdbg2	-p 19 -S 1 -K 0.2 -A -s 0.05 -L 5000
bt18541	wtdbg2	-p 19 -S 1 -K 0.3 -A -s 0.05 -L 5000
bt19682	wtdbg2	-p 19 -S 2 -K 0.5 -A -s 0.05 -L 5000

Student ID	Assembly software	Assembly parameters
bt22880	wtdbg2	-p 19 -S 2 -K 0.4 -A -s 0.05 -L 5000
bt22911	wtdbg2	-p 19 -S 2 -K 0.2 -A -s 0.05 -L 5000
bt22068	wtdbg2	-p 19 -S 2 -K 0.3 -A -s 0.05 -L 5000
bt22900	wtdbg2	-p 19 -S 4 -K 0.5 -A -s 0.05 -L 5000
ml21566	wtdbg2	-p 19 -S 4 -K 0.4 -A -s 0.05 -L 5000
bt22862	wtdbg2	-p 19 -S 4 -K 0.2 -A -s 0.05 -L 5000
bt19540	wtdbg2	-p 19 -S 4 -K 0.3 -A -s 0.05 -L 5000
bt211031	wtdbg2	-p 20 -S 1 -K 0.5 -A -s 0.05 -L 5000
bt22050	wtdbg2	-p 20 -S 1 -K 0.4 -A -s 0.05 -L 5000
bt22010	wtdbg2	-p 20 -S 1 -K 0.2 -A -s 0.05 -L 5000
bt22061	wtdbg2	-p 20 -S 1 -K 0.3 -A -s 0.05 -L 5000
bt22071	wtdbg2	-p 20 -S 2 -K 0.5 -A -s 0.05 -L 5000
bt19575	wtdbg2	-p 20 -S 2 -K 0.4 -A -s 0.05 -L 5000
bt18070	wtdbg2	-p 20 -S 2 -K 0.2 -A -s 0.05 -L 5000
bt22032	wtdbg2	-p 20 -S 2 -K 0.3 -A -s 0.05 -L 5000
bt22879	wtdbg2	-p 20 -S 4 -K 0.5 -A -s 0.05 -L 5000
bt22028	wtdbg2	-p 20 -S 4 -K 0.4 -A -s 0.05 -L 5000
bt22640	wtdbg2	-p 20 -S 4 -K 0.2 -A -s 0.05 -L 5000
bt22627	wtdbg2	-p 20 -S 4 -K 0.3 -A -s 0.05 -L 5000