Genome Bioinformatics (BIO726P) - Queen Mary
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Genome Bioinformatics (BIO726P) - Queen Mary
2022
Assignment details
This is a graded assignment. Briefly, you are expected to assemble a genome (Task 1) and answer questions on QMplus based on your assembly (Task 2).
We suggest you read through the entire document carefully before starting. Do not hesitate to ask on QMplus if there is any ambiguity or if you need help.
Task 1: Genome assembly
We want you to perform a *de novo* genome assembly for the red fire ant, *Solenopsis invicta*. This is because the official genome assembly of the species is quite fragmented (RefSeq accession number: GCF_000188075.2).
We obtained 50x genome coverage of long-molecule sequences from Pacific Biosciences' Sequel platform. However, as the assembly approaches for long-molecule sequences are relatively new (and constantly evolving!), we would like to test different combinations of assembly parameters.
We need your help in this experiment: We would like each of you to perform an assembly using a different combination of assembly parameters. The assembler and the parameters to use are shown in the table below (Table 1).
Input
Sequenced reads for assembly:
Please suppose that there is no need to do read cleaning - you should use the input as it is.
How to run the assembly
It is best to run the assembly on Apocrita. To submit your job follow HPC documentation:
(check the Job Script Builder section as well).
Assembly software is installed on Apocrita, but you will need to activate it in your job script before the assembly command:
We recommend you request, 16 CPU cores, a total of 96 GB of RAM, and a run time of 3 days. This can be specified by adding the following lines in your job script:
*Remember*, you must tell the assembly software how many CPU cores you requested. Otherwise, it will use only 4 CPUs (see manual). Your assembly will likely take less than 72 hours (we expect between 6 and 24 hrs), but it is a good practice to request more run time.
So that we can collect and evaluate your assemblies, and approach, please copy your entire directory to the computer you used for practicals.
The analyses that you do of the assembly can happen on any computer, including on the cluster. Please do the follow-up analyses in other directories.
A major challenge in this exercise can be getting things to run. For issues that you feel may be specific to the practical or your understanding, please ask on QMplus forum. For issues with the cluster and permissions and space, please contact ITS research support. In both cases, if it is a technical issue, please clearly state the command you ran and the error message you get.
Task 2: Answer questions
This task depends on having obtained an assembly from the previous task. If you are stuck and unable to get an assembly and time is running out (e.g., by Wednesday), don't hesitate to ask.
Using your assembly, answer the following questions. We want you to answer the questions directly on a form on QMPlus - the questions are listed below for reference.
Answering some of the questions will require you to run additional commands. This can be done on the computer you used for the practicals. If you need to use specific software to answer any of the questions below, ask for help.
1. Provide the absolute path to the directory where you ran your assembly.
We expect you to run the assembly software using a directory structure similar to the practicals. You will be scored for
getting the directory structure right and for keeping a record of the commands you ran in a file.
(hint: we should be able to run to view your directory structure)
2. Provide the absolute path to your assembly fasta file.
We will use your assembly for further analysis, so please make sure to provide the correct path.
(hint: we should be able to run to view the file)
3. Provide the number of contigs in your assembly.
4. Provide the total length of your assembly.
5. Provide the N50 length of your assembly.
6. Provide the length of the longest contig in your assembly.
7. WITHOUT running a gene prediction software, provide an estimate (even if vague) of how many protein-coding genes the longest contig in your assembly contains.
There are several ways to answer this question and there is no best approach. The approach you come up with, can be as simple or as sophisticated as you like. We ask you to explain your approach in the next question.
8. Explain how you answered the previous question and indicate potential shortcomings of your approach.
As indicated in previous question, there is no best approach.
9. Cytochromes P450 (CYPs) are a family of enzymes produced by almost all living organisms. CYPs play a key role in synthesis and metabolism of various molecules in the cell. In humans, there are a total of 57 genes that encode for CYPs; other species may have more or less. Provide an estimate of how many CYP genes are present in your assembly.
10. Explain how you answered the previous question and indicate potential shortcomings of your approach.
11. We have a set of paired Illumina reads that we would like you to map to your assembly as you normally would, and subsequently calculate the mean coverage for each contig using the software mosdepth . Path to reads:
Please suppose that there is no need to do read cleaning - you should use the input as it is. Now, consider the ten contigs with the highest mean coverage. What is their coverage like? Provide a range, for example, "34 to 2300".
12. Are the ten contigs with the highest mean coverage likely from the normal nuclear fire ant genome or might some represent something different? How did you figure out the answer? Why do these contigs have such a high coverage?
13. In real life situation where this assembly will be the basis of several months or even years of follow-up work, which metrics would you use to have sufficient confidence that the assembly is "good enough"?
Student ID |
Assembly software |
Assembly parameters |
bt22027 |
wtdbg2 |
-p 18 -S 1 -K 0.5 -A -s 0.05 -L 5000 |
bt22031 |
wtdbg2 |
-p 18 -S 1 -K 0.4 -A -s 0.05 -L 5000 |
bt22866 |
wtdbg2 |
-p 18 -S 1 -K 0.2 -A -s 0.05 -L 5000 |
bt18597 |
wtdbg2 |
-p 18 -S 1 -K 0.3 -A -s 0.05 -L 5000 |
bt22038 |
wtdbg2 |
-p 18 -S 2 -K 0.5 -A -s 0.05 -L 5000 |
bt22003 |
wtdbg2 |
-p 18 -S 2 -K 0.4 -A -s 0.05 -L 5000 |
bt22048 |
wtdbg2 |
-p 18 -S 2 -K 0.2 -A -s 0.05 -L 5000 |
bt19339 |
wtdbg2 |
-p 18 -S 2 -K 0.3 -A -s 0.05 -L 5000 |
bt22947 |
wtdbg2 |
-p 18 -S 4 -K 0.5 -A -s 0.05 -L 5000 |
bt22007 |
wtdbg2 |
-p 18 -S 4 -K 0.4 -A -s 0.05 -L 5000 |
bt22043 |
wtdbg2 |
-p 18 -S 4 -K 0.2 -A -s 0.05 -L 5000 |
bt22047 |
wtdbg2 |
-p 18 -S 4 -K 0.3 -A -s 0.05 -L 5000 |
bt22941 |
wtdbg2 |
-p 19 -S 1 -K 0.5 -A -s 0.05 -L 5000 |
bt22040 |
wtdbg2 |
-p 19 -S 1 -K 0.4 -A -s 0.05 -L 5000 |
bt22934 |
wtdbg2 |
-p 19 -S 1 -K 0.2 -A -s 0.05 -L 5000 |
bt18541 |
wtdbg2 |
-p 19 -S 1 -K 0.3 -A -s 0.05 -L 5000 |
bt19682 |
wtdbg2 |
-p 19 -S 2 -K 0.5 -A -s 0.05 -L 5000 |
Student ID |
Assembly software |
Assembly parameters |
bt22880 |
wtdbg2 |
-p 19 -S 2 -K 0.4 -A -s 0.05 -L 5000 |
bt22911 |
wtdbg2 |
-p 19 -S 2 -K 0.2 -A -s 0.05 -L 5000 |
bt22068 |
wtdbg2 |
-p 19 -S 2 -K 0.3 -A -s 0.05 -L 5000 |
bt22900 |
wtdbg2 |
-p 19 -S 4 -K 0.5 -A -s 0.05 -L 5000 |
ml21566 |
wtdbg2 |
-p 19 -S 4 -K 0.4 -A -s 0.05 -L 5000 |
bt22862 |
wtdbg2 |
-p 19 -S 4 -K 0.2 -A -s 0.05 -L 5000 |
bt19540 |
wtdbg2 |
-p 19 -S 4 -K 0.3 -A -s 0.05 -L 5000 |
bt211031 |
wtdbg2 |
-p 20 -S 1 -K 0.5 -A -s 0.05 -L 5000 |
bt22050 |
wtdbg2 |
-p 20 -S 1 -K 0.4 -A -s 0.05 -L 5000 |
bt22010 |
wtdbg2 |
-p 20 -S 1 -K 0.2 -A -s 0.05 -L 5000 |
bt22061 |
wtdbg2 |
-p 20 -S 1 -K 0.3 -A -s 0.05 -L 5000 |
bt22071 |
wtdbg2 |
-p 20 -S 2 -K 0.5 -A -s 0.05 -L 5000 |
bt19575 |
wtdbg2 |
-p 20 -S 2 -K 0.4 -A -s 0.05 -L 5000 |
bt18070 |
wtdbg2 |
-p 20 -S 2 -K 0.2 -A -s 0.05 -L 5000 |
bt22032 |
wtdbg2 |
-p 20 -S 2 -K 0.3 -A -s 0.05 -L 5000 |
bt22879 |
wtdbg2 |
-p 20 -S 4 -K 0.5 -A -s 0.05 -L 5000 |
bt22028 |
wtdbg2 |
-p 20 -S 4 -K 0.4 -A -s 0.05 -L 5000 |
bt22640 |
wtdbg2 |
-p 20 -S 4 -K 0.2 -A -s 0.05 -L 5000 |
bt22627 |
wtdbg2 |
-p 20 -S 4 -K 0.3 -A -s 0.05 -L 5000 |
2022-10-13