Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Genome Bioinformatics (BIO726P) - Queen Mary

2022


Assignment details

This is a graded assignment. Briefly, you are expected to assemble a genome (Task 1) and answer questions on QMplus based on your assembly (Task 2).

We suggest you read through the entire document carefully before starting. Do not hesitate to ask on QMplus if there is any ambiguity or if you need help.

Task 1: Genome assembly

We want you to perform a *de novo* genome assembly for the red fire ant, *Solenopsis invicta*. This is because the official genome assembly of the species is quite fragmented (RefSeq accession number: GCF_000188075.2).

We obtained 50x genome coverage of long-molecule sequences from Pacific Biosciences' Sequel platform. However, as the assembly approaches for long-molecule sequences are relatively new (and constantly evolving!), we would like to test different combinations of assembly parameters.

We need your help in this experiment: We would like each of you to perform an assembly using a different combination of  assembly parameters. The assembler and the parameters to use are shown in the table below (Table 1).

Input

Sequenced reads for assembly:

Please suppose that there is no need to do read cleaning - you should use the input as it is.

How to run the assembly

It is best to run the assembly on Apocrita. To submit your job follow HPC documentation:

(check the Job Script Builder section as well).

Assembly software is installed on Apocrita, but you will need to activate it in your job script before the assembly command:

We  recommend you request, 16 CPU cores, a total of 96 GB of RAM, and a run time of 3 days. This can be specified by adding    the following lines in your job script:

*Remember*, you must tell the assembly software how many CPU cores you requested. Otherwise, it will use only 4 CPUs (see manual). Your assembly will likely take less than 72 hours (we expect between 6 and 24 hrs), but it is a good practice to request more run time.


So that we can collect and evaluate your assemblies, and approach, please copy your entire directory to the computer you used for practicals.

The analyses that you do of the assembly can happen on any computer, including on the cluster. Please do the follow-up analyses in other directories.

A major challenge in this exercise can be getting things to run. For issues that you feel may be specific to the practical or your understanding, please ask on QMplus forum. For issues with the  cluster  and  permissions  and  space,  please  contact  ITS research support. In both cases, if it is a technical issue, please clearly state the command you ran and the error message you     get.

Task 2: Answer questions

This task depends on having obtained an assembly from the previous task. If you are stuck and unable to get an assembly and time is running out (e.g., by Wednesday), don't hesitate to ask.

Using your assembly, answer the following questions. We want you to answer the questions directly on a form on QMPlus - the questions are listed below for reference.

Answering some of the questions will require you to run additional commands. This can be done on the computer you used for the practicals. If you need to use specific software to answer any of the questions below, ask for help.

1. Provide the absolute path to the directory where you ran your assembly.

We expect you to run the assembly software using a directory structure similar to the practicals. You will be scored for

getting the directory structure right and for keeping a record of the commands you ran in a file.

 (hint: we should be able to run to view your directory structure)

2. Provide the absolute path to your assembly fasta file.

We will use your assembly for further analysis, so please make sure to provide the correct path.

(hint: we should be able to run to view the file)

3. Provide the number of contigs in your assembly.

4. Provide the total length of your  assembly.

5. Provide the N50 length of your assembly.

6. Provide the length of the longest contig in your assembly.

7. WITHOUT running a gene prediction software, provide an estimate (even if vague) of how many protein-coding genes the longest contig in your assembly contains.

There are several ways to answer this question and there is no best approach. The approach you come up with, can be as simple or as sophisticated as you like. We ask you to explain your approach in the next question.

8. Explain how you answered the previous question and indicate potential shortcomings of your approach.

As indicated in previous question, there is no best approach.

9. Cytochromes P450 (CYPs) are a family of enzymes produced by almost all living organisms. CYPs play a key role in synthesis and metabolism of various molecules in the cell. In humans, there are a total of 57 genes that encode for CYPs; other species may have more or less. Provide an estimate of how many CYP genes are present in your assembly.

10. Explain how you answered the previous question and indicate potential shortcomings of your approach.

11. We have a set of paired Illumina reads that we would like you to map to your assembly as you normally would, and subsequently calculate the mean coverage for each contig using the software mosdepth . Path to reads:

Please suppose that there is no need to do read cleaning - you should use the input as it is. Now,  consider the ten contigs with the highest mean coverage. What is their coverage like? Provide a range, for example, "34 to 2300".

12. Are the ten contigs with the highest mean coverage likely from the normal nuclear fire ant genome or might some represent something different? How did you figure out the answer? Why do these contigs have such a high coverage?

13. In real life situation where this assembly will be the basis of several months or even years of follow-up work, which metrics would you use to have sufficient confidence that the assembly is "good enough"?

Student ID

Assembly software

Assembly parameters

bt22027

wtdbg2

-p 18 -S 1 -K 0.5 -A -s 0.05 -L 5000

bt22031

wtdbg2

-p 18 -S 1 -K 0.4 -A -s 0.05 -L 5000

bt22866

wtdbg2

-p 18 -S 1 -K 0.2 -A -s 0.05 -L 5000

bt18597

wtdbg2

-p 18 -S 1 -K 0.3 -A -s 0.05 -L 5000

bt22038

wtdbg2

-p 18 -S 2 -K 0.5 -A -s 0.05 -L 5000

bt22003

wtdbg2

-p 18 -S 2 -K 0.4 -A -s 0.05 -L 5000

bt22048

wtdbg2

-p 18 -S 2 -K 0.2 -A -s 0.05 -L 5000

bt19339

wtdbg2

-p 18 -S 2 -K 0.3 -A -s 0.05 -L 5000

bt22947

wtdbg2

-p 18 -S 4 -K 0.5 -A -s 0.05 -L 5000

bt22007

wtdbg2

-p 18 -S 4 -K 0.4 -A -s 0.05 -L 5000

bt22043

wtdbg2

-p 18 -S 4 -K 0.2 -A -s 0.05 -L 5000

bt22047

wtdbg2

-p 18 -S 4 -K 0.3 -A -s 0.05 -L 5000

bt22941

wtdbg2

-p 19 -S 1 -K 0.5 -A -s 0.05 -L 5000

bt22040

wtdbg2

-p 19 -S 1 -K 0.4 -A -s 0.05 -L 5000

bt22934

wtdbg2

-p 19 -S 1 -K 0.2 -A -s 0.05 -L 5000

bt18541

wtdbg2

-p 19 -S 1 -K 0.3 -A -s 0.05 -L 5000

bt19682

wtdbg2

-p 19 -S 2 -K 0.5 -A -s 0.05 -L 5000

 

Student ID

Assembly software

Assembly parameters

bt22880

wtdbg2

-p 19 -S 2 -K 0.4 -A -s 0.05 -L 5000

bt22911

wtdbg2

-p 19 -S 2 -K 0.2 -A -s 0.05 -L 5000

bt22068

wtdbg2

-p 19 -S 2 -K 0.3 -A -s 0.05 -L 5000

bt22900

wtdbg2

-p 19 -S 4 -K 0.5 -A -s 0.05 -L 5000

ml21566

wtdbg2

-p 19 -S 4 -K 0.4 -A -s 0.05 -L 5000

bt22862

wtdbg2

-p 19 -S 4 -K 0.2 -A -s 0.05 -L 5000

bt19540

wtdbg2

-p 19 -S 4 -K 0.3 -A -s 0.05 -L 5000

bt211031

wtdbg2

-p 20 -S 1 -K 0.5 -A -s 0.05 -L 5000

bt22050

wtdbg2

-p 20 -S 1 -K 0.4 -A -s 0.05 -L 5000

bt22010

wtdbg2

-p 20 -S 1 -K 0.2 -A -s 0.05 -L 5000

bt22061

wtdbg2

-p 20 -S 1 -K 0.3 -A -s 0.05 -L 5000

bt22071

wtdbg2

-p 20 -S 2 -K 0.5 -A -s 0.05 -L 5000

bt19575

wtdbg2

-p 20 -S 2 -K 0.4 -A -s 0.05 -L 5000

bt18070

wtdbg2

-p 20 -S 2 -K 0.2 -A -s 0.05 -L 5000

bt22032

wtdbg2

-p 20 -S 2 -K 0.3 -A -s 0.05 -L 5000

bt22879

wtdbg2

-p 20 -S 4 -K 0.5 -A -s 0.05 -L 5000

bt22028

wtdbg2

-p 20 -S 4 -K 0.4 -A -s 0.05 -L 5000

bt22640

wtdbg2

-p 20 -S 4 -K 0.2 -A -s 0.05 -L 5000

bt22627

wtdbg2

-p 20 -S 4 -K 0.3 -A -s 0.05 -L 5000