SIMILARITY SEARCH IN DATABANKS

发布时间：2024-06-08

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SIMILARITY SEARCH IN DATABANKS

Exercise 1 : BLASTP on SwissProt

You have to download the Ex.1.VirtualProt.tfa sequence from Ecampus. We are going to examine with BLASTP (protein-protein BLAST) against the SwissProt bank to determine the function of this protein and to deduce his evolution from it.

We are going to use the NCBI portal (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and sent a first request. In the Web Blast part ofthe page click on the protein BLAST section. You will then access the BLASTP suite page.

• Paste in the « Enter Query Sequence » zone the VirtualProt.tfa protein (copy/paste)

• Select the UniprotKB/Swissport databank in the Database list of the « Choose Search Set » zone.

• Launch the program with the default parameters by clicking on the BLAST button.

Question 1 : Searching for protein function

• Why are we doing the search with the protein instead of doing it with the gene?

• What are the pros and the cons of the proteic banks SwissProt and nr (non redondant)? Look the online documentation if necessary.

• What is the function of this protein?

• Which protein domains are found in this protein?

Question 2 : Searching for homologous proteins

• Which are the homologous proteins of your query protein?

• For each different identify gene give the biggest and the lowest e-value?

• Make a Faireun tableau récapitulatif : nom delaprotéine, meilleure E-valeur rapportée (et organisme), pire E-valeur.

Question 3 : Study of the graphical representation

Successively click on the first and last red line of the graphical representation,, and on the first and last pink line of the graphical representation

• What is the signification of the colour code pink and red?

• Which are the related sequences in this 4 cases?

• What are the length, score and E-value of those four alignments?

• Do a summary diagram of similar zones between VirtualProt and those 4 homologous proteins.

Question 4 : Changing the default parameters of the algorithm

Click on the hypertext « Algorithm parameters », in small, at the bottom of the page on the left. The General Parameters open then, and you could change :

• The number of selected sequences by BLAST (Max target Sequences): put this parameter to 1000,

• Does this parameter allow you to recover sequences with non significative E-values ?

• Do you identify some new homologous genes? Which ones?

Click randomly on the hypertext link ofthe score for 2 or 3 new genes and analyse the alignments.

• On which region of the VirtualProt protein does they match? Is it significant (score, alignment length, E-value) ?

In conclusion :

• Which hypothese(s) could you do concerning the evolution of this protein?

Exercise 2 : Identifying the function of a protein

You have a protein sequence Ex.2.orf19_ecoli.tfa. Do a search against the SwissProt bank. BLAST at NCBI: (http://www.ncbi.nlm.nih.gov/BLAST/)

• What is the function of this protein?

• What strategy do you propose to refine your search?

Exercise 3 : BLASTN

You have a Ex3.copia.tfa sequence. Perform a BLASTN query on the default bank. BLAST at NCBI: (http://www.ncbi.nlm.nih.gov/BLAST/)

• What is this copia.tfa sequence? What does feedback tell you about sequence structure?

• Is the query you made correct?

• Which settings should be changed? Try several solutions and comment on them.

Searching in a dedicated database: FLYbase.

Connect to the BLAST search page of the FLYbase database (dedicated to the Drosophila genome)

BLAST in FlyBASE: FLYBase http://flybase.bio.indiana.edu/blast/ Choose the TRANSPOSONS bank

• How many copies of copia do you find on each chromosome of Drosophila melonogaster

• Why was the research effective this time?

• For the best sequence found: say how many hits you get with copia?

• How do you explain this based on the comments: make a summary diagram?

Exercise 4 : Understanding a simple blast search

IMPORTANT: do NOT limit your search to "bacteria" in PART 1 (we are looking for insulin).

Below is the mRNA sequence for insulin from a South American rodent, the Degu (Octodon degus). You could also find it in the Ex4.Insulin_Degu.txt file on Ecampus.

>gi |202471 |gb |M57671.1 |OCOINS Octodon degus insulin mRNA, complete cds GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC CCCTTGAATGAG

We will now use a BLASTN search at NCBI to determine whether this sequence looks like the human mRNA for insulin. There are two ways we can do this:

• search the entire database and look for human hits in the results,

• specifically search the human part of the database. We will try both of these possibilities.

Search against NR

• Follow the "nucleotide blast" link from the main BLAST page.

• In the section "Program Selection" select the option "Somewhat similar sequences (blastn)"

• Choose "Nucleotide collection (nr/nt)" as the search database. NR is the "Non Redundant" database, which contains all non-redundant (non-identical) sequences from GenBank and the full genome databases.

• Click the BLAST button to launch the search.

After the search has completed, make yourself familiar with the BLAST output page. After a header with some information about the search, there are three main parts:

• Graphic Summary

o each hit is represented by a line showing which part of the query sequence the alignment covers. The lines are coloured according to alignment score.

• Descriptions

o atable with a one-line description of each hit with some alignment statistics.

• Alignments

o the actual alignments between the query and the database hits.

The columns in the Descriptions table are:

• Description — the description line from the database

• Scientific Name — the scientific name of the database hit.

• Max score — the alignment score of the best match (local alignment) between the query and the database hit

• Total score — the sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical)

• Query cover — the percentage of the query sequence that is covered by the alignment(s)

• E value — the Expect value calculated from the Max score (i.e. the number of unrelated hits with that score or better you would expect to find for random reasons)

• Per.Ident — the percent identity in the alignment(s)

• Acc.Len — the accession length of the database hit.

• Accession — the accession number of the database hit.

First, take a look at the best hit. Since our search sequence (the query) was taken from GenBank

which is part of NR, we should find an identical sequence in the search. Make sure this is the case!

QUESTION 1.1:

Answer the following questions about the best hit:

• what is the identifier (Accession)?

• what is the alignment score ("max score")?

• what is the percent identity and query coverage?

• what is the E-value?

• are there any gaps in the alignment?

Then, find the best hit from human (Homo sapiens) that is not a synthetic construct. (Tip: you can press Ctrl-F in most browsers to search in the page).

QUESTION 1.2:

Answer the same questions as before about the hit you found now.

Search against Human G+T

Open a new window/tab with the BLAST homepage. Make a new BLASTN search with the same query sequence, this time with Database set to Human genomic + transcript (Human G+T).

Remember again to select Somewhat similar sequences (blastn) under Program Selection. Consider the best hit.

Note: eventhough you may not have found exactly the same database entry in the two searches, the alignment should be the same. Make sure this is the case by comparing the actual alignments in the two windows where you made the searches.

QUESTION 1.3:

Answer the same questions as before about the best hit you found in this search.

Concerning database size and E-values

When answering the previous two questions, you may have noticed that the E-value changed, while the alignment score did not. We will now investigate this further.

QUESTION 1.4:

What are the sizes (in basepairs) of the databases we used for the two BLAST searches? (Tip: Expand the "Search summary" section near the top by clicking it).

QUESTION 1.5:

• What is the ratio between the database sizes in the two BLAST searches?

• What is the ratio between the E-values (for the best human hits) in the two BLAST searches?

• What is the relationship between database size and E-value for hits with identical alignment score?

• In conclusion: if the database size is doubled, what will happen to the E-value?

Exercise 5 : Assessing the statistical significance of BLAST hits

IMPORTANT: limit your search to "bacteria" (taxid: 2) in ALL of this section (PART 2) to make the BLAST searches run quicker.

As discussed in the lecture, there will be a risk of getting false positive results (hits to sequences that are not related to our input sequence) by purely stochastic means. In this first part of the exercise we will be investigating this further, by examining what happens when we submit randomly generated sequence to BLAST searches.

Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have our word for their randomness, you'll be generating your own random sequences with the Sequence Manipulation Suite. We previously used d4/d20 dice to generate these sequences manually, but we have decided to let the computer do the work in order for you to save sometime. It is important to understand that these computer generated sequences are totally random, just as if you were rolling a die to determine each nucleotide/amino acid in each sequence.

Random DNA sequences and BLASTN

• Generate three DNA sequences of length 25bp using the random DNA generatorfrom the Sequence Manipulation Suite. Note: as three is not an option, generate ten sequences and copy the first three.

QUESTION 2.1:

Report the three sequences in FASTA format.

We will now do a BLASTN search using these three random sequences as queries. Follow the "nucleotide blast" link from the main BLAST page, and, as before, select the option "Somewhat similar sequences (blastn)" in the section "Program Selection". Choose "Nucleotide Collection (nr/nt)" as the search database.

VERY IMPORTANT: For this special situation where we BLAST small artificial sequences we need to turn off some the automatics NCBI incorporate when short sequences are detected.

Otherwise we'll not be able to see the intended results:

• Extend the "Algorithm parameters" section (see the screen shot below) in order to gain access to fine-tuning the options.

1. Deselect the "Automatically adjust parameters for short input sequences" option.

2. Set the E-value cut-off ("Expect threshold") to 50

Remember to adjust the BLAST settings

• Paste in your three sequences in FASTA format and start the BLAST search.

Browsing BLAST results: select which of your query sequences to inspect in the drop-down box near the top of the page

• Inspect the results.

QUESTION 2.2:

Answer the following small questions, and document your findings by pasting in examples of alignments / text snippets from the overview table:

• Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).

• What is the typical length of the hits (the alignment length)?

• What is the typical % identity?

• In what range is the bit-scores ("max score")?

o Notice: This is conceptually the same as the "alignment score" we have already met in the pairwise alignment exercise.

• What is the range of the E-values?

QUESTION 2.3:

• What is the biological significance of these hits / is there any biological meaning? Random protein sequences and BLASTP

Now it's time to work with a set of protein sequences: Generate three peptide sequences of length 25aa using the random protein generator.

• Notice 1: The distribution of amino acids will be equal (5% prob) and this is different from true biological sequences - however this is not important for this first part of the exercise.

• Notice 2: Please recall from the lecture that the way BLASTP selects candidate sequences for full Smith-Waterman alignment is different from BLASTN. (BLASTN - a single short (11 bp +) perfect match hit is needed. BLASTP - a pair of "near match" hits of 3 aawithin a 40 aa window is needed).

QUESTION 2.4:

Report the sequences in FASTA format.

Locate the "Protein BLAST" page at NCBI and choose blastp as the algorithm to use.

Paste in your sequences in FASTA format, and choose the "NR" database (this is the protein version, consisting of translated CDS'es, UniProt etc).

VERY IMPORTANT: We also need to tweak the parameters this time - in the "Algorithm Parameters" section select BLOSUM62 as the alignment matrix to use and set the "Expect threshold" to 1000 (default: 0.05) - and DISABLE the "Short queries" parameters as we did in the DNA search a moment ago - otherwise our carefully tweaked parameters will be ignored.

• Perform the BLAST search.

• Inspect the results.

QUESTION 2.5:

(Remember to document your answers in the same manner as Q2.2)

• What is the typical length of the alignment and do they contain gaps?

• What is the range of E-values?

• Try to inspect a few of the alignments in details ("+" means similar sequences) - do you find any that look plausible, if we for a moment ignore the length/E-value?

• If we had used the default E-value cut-off of 0.05 would any hits have been found?

QUESTION 2.6:

• If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?

o Remember to take E-values into your consideration.

Exercise 6 : Using BLAST to transfer functional information by finding homologs

IMPORTANT: limit your search to "bacteria" (taxid: 2) in ALL of this section to make the BLAST searches run quicker. (The organisms we're looking for all belongs to the "Bacteria" domain of life, so this restriction is OK).

Homo-, Ortho- and Paralogs

One of the most common ways to use BLAST as a tool, is in the situation where you have a sequence of unknown function, and want to find out which function it has. Since a large amount of sequence data has been gathered during the years, chances are that an evolutionarily related sequence with known function has already been identified. In general such a related sequence is known as a "homolog".

Homo-, Ortho- and Paralogs:

• A Homolog is a general term that describes a sequence that is related by any evolutionary means.

• An Ortholog ("Ortho" = True) is a sequence that is "the same gene" in a different organism: The sequences shared a single common ancestor sequence, and has now diverged through speciation (e.g. the Alpha-globin gene in Human and Mouse).

• A Paralog arises due to a gene duplication within a species. For example Alpha- and Beta- globin are each others paralogs.

source:gwLee's blog

Notice that in both cases it's possible to transfer information, for example information about gene family / protein domains. We have already touched upon comparison of (potentially) evolutionarily related sequences in the pairwise alignment exercise. However, this time we do not start out with two sequences we assume are related, but we rather start out with a single sequence ("query sequence") which we will use to search the databases for homologs (we often informally speak of "BLAST hits", when discussing the sequences found).

BLAST example 1

Let's start out with a sequence that will produce some good hits in the database. The sequence below is a full-length transcript (mRNA) from a prokaryote. Let's find out what it is.

>Unknown_transcript01

CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA AATGAAGGTACC

BLASTN search

Perform a BLAST search in the NR/NT database (BLASTN) using default settings. Remember to set Expect threshold back to the default value, 0.05.

QUESTION 3.1:

(Once again remember to document your findings)

• Do we get any significant hits?

• What kind of genes (function) do we find?

BLASTP search

Now let's try to do the same at the protein level.

• Find the longest ORF usingVirtual Ribosome(hint: remember to search all Plus (1,2,3) reading frames) and save of copy the sequence in FASTA format.

• BLAST the sequence (BLASTP) against the NR database.

QUESTION 3.2: (Document!)

• Report your translated protein sequence in FASTA format.

• Do we find any conserved protein domains? (Click the Graphic Summary tab).

Identifying known protein domains can provide important clues to the function of an unknown protein.

• Do we find any significant hits? (E-value?)

• Are all the best hits the same category of enzymes?

• From what you have seen, what is best for identifying intermediate quality hits - DNA or Protein BLAST?

BLAST example 2

In the previous section, we used a sequence that was already in the database. What would happen if we used a truly unknown sequence?.

The sequence is a DNA fragment from an unknown microorganism. It was cloned and sequenced directly from DNA extracted from a soil-sample, and named "CLONE89". It was amplified using degenerated PCR primers that target the middle ("core cloning") of the sequence of a group of

known enzymes and has never been submitted to databanks.

ID CLONE89.DNA 1145 BP DS-DNA UPDATED 07/10/23

DE Unknown sequence from soil-sample extract microorganism

AC -

KW -

OS –

SQ Sequence 218 A 306 C 342 G 279 T 0 OTHER

ACCCCCTTAA AAAGAACAAA CAGGCACGCG CGGGCAGACA GGTGCAGTGC CCATCTGAAT 120

GGAAGTGCAT GCTGATGGAG ACACTACCCC CACCGTTTTC CCTGCCTGTT TGCAAATCGC 180

CATGGCTGAC AGCGGGCACA TTGCCGACCG CAGTGGAGCT CCTATCCACG TAATGAGGGT 240

CCTTCGCACC GTCCTGGGGA TAGATATGCC GCTGCTGTAC ATCACAGGCA CCTTCGGCCT 300

AATGGCGGAC GTGCAAGGTA CTCTCAAGAA AGGTAATACT TCTCTGCGCT CTGTTTCTGG 360

CGAGGCAAAG ATCTGCAAAG ACGTGCCGGG GCTGGCGGCG ACACAGTTGC TCCAGATACT 420

TCTGTTCATC ATGGTGCAGT ATTTGTTCCA GCAGAACCTC TGCGTACCGG TCACCGTTGT 480

GTTCGATTCG GCCACCGGCC CGGTGCTCGG CACACTGTTG GGCGTGCTCT ACTGCACCTT 540

GCTGGGCACT GTGCCGGCTT CATGCTGCTA TTTAATGACG CGCCTCGTGT GCGTGCGGCT 600

CGGGGGGCGG GGCGAGGCAA GCCTAATGAA GGGGAAGGGA TTGCCTAAGA AACGCACACA 660

GGTCAGCCGG AACCGATCCG ACTTACTCGG TCGCTCGGTA TTCCTCCGCC GAAAACCAAA 720

AGTGCCCCTC TGGATCCTGA AGTTAGGGTC TCCGGTCGGT GGCCTTACGC TTTGGATGCG 780

TGCACTGGTG CTAGGGATCG GCATTATACA ACTGAGGTAC TCGTCGGTGC GTTTCCAATC 840

ACTAGCGTAC GCCTCATCAG CCGGCGGCTG CCCTATAGTG GCGCCGTTCG TCAAACTGTT 900

GCTGGCCGGA CTCGGTGTGA CGTTGCTCGG CGTGCTACCG GCCGAGAAGC GATTTGCGTC 960

AGCTCAGGGG CATGATACAT GGATGCGGCC GTCTTCTTCG ACGTATGTGT GAGGTCTGAG 1020

GGGGTCGACG TGAGAATCTT CATCCATGGC TGTTTGTCTG AGCGTGTGTA TGTGGGTGTG 1080

TGTGTCTACG TTTATGCTGT TGGGATATCT TCGACAAGTT TAGCCTGCTC GTGCACCAGC 1140

CTGGT 1145

QUESTION 3.3 (Long question - read all):

Your task is now to find out what kind of enzyme this sequence is likely to encode, using the methods you have learned.

INSTRUCTIONS: As part of the continuous assessment, you will be asked to write the answer to this question in the form of a report -just be sure to include the sub-questions in your answers. You will have to gather all the clues yourself, reason about which tools/databases to use and document your conclusions. You will submit this document on Ecampus in the repository provided for this purpose.

STEP 1 - cleaning up the sequence:

The sequence is (more or less) in EMBL format and the NCBI BLAST server expects the input to be in FASTA format, or to be "raw" unformatted sequence. So you will have to convert the sequence to FASTA format manually and quote it in your report.

STEP 2 - thinking about the task:

Consider the following before you start on solving this task:

• Based on the information given: is the sequence protein-coding?

• If it is, can you trust it will contain both a START and STOP codon?

• Do we know if the sequence is sense or anti-sense?

and think which consequences the answers to these questions should have for your choice of methods and parameters.

Subquestion: Give a summary of your considerations.

STEP 3 - Performing the database search:

Significance: We will put the criteria for significance at 1e-10 (remember: the higher the E-value, the worse the significance).

Subquestion:

Cover the following in your answer:

• What tool(s) and database(s) will be relevant to use?

• Document the results from the different BLAST searches - what works and what does not work?

• In conclusion: What kind of enzyme is CLONE89? Gather as much evidence as possible.

Exercise 7 : BLAST'ing Genomes

IMPORTANT: do NOT limit your search to "bacteria" here - now we are actively looking at organism specific queries.

So far we have been using BLAST to search in the big broad databases that covers at huge set of sequence from a large range of organisms. In this final part of the exercise we will be doing some more focused searches in smaller databases by targeting specific genomes.

Typically this will be useful if you have a gene of known function from one organism (say a cell- cycle controlling gene from Yeast, Saccharomyces cerevisiae) and want to find the human homolog/ortholog to this gene (genes that control cell division are often involved in cancer).

When you have been performing the BLAST searches, you have probably already noticed, that's it possible to search specifically in the Human and Mouse genomes (these database only contains sequences from Human/Mouse). It's also possible to restrict the output from searches in the large databases (e.g. NR) to specific organisms.

A growing number of organisms have been fully sequenced, and the research teams responsible for a large scale genome project typically put up their own Web resources for accessing the data. For example the Yeast genome is principally hosted in the Saccharomyces Genome Database (SGD - www.yeastgenome.org) - it should be noted that SGD also offers BLAST as a means to search the database.

Genome specific analysis of histones

SGD

Let's do a small study of the relationship between the histones found in Yeast and in Human (evolutionary distance: ~1-1.5 billion years).

Lookup the HTA2 gene in SGD (http://www.yeastgenome.org- use the searchbox at the top of the page). Notice that a brief description about the function of the gene and its protein product is displayed (a huge amount of additional information can be found further down the page - much of it Yeast specific).

QUESTION 4.1:

What information is given about the relationship between this gene and the gene "HTA1"?

Browse the page and locate the link to the protein sequence. Save the sequence as a file, we'll need it in question 4.2.

NCBI

Now return to the NCBIblastp page. Set Database to "Reference proteins (refseq_protein)", and enter Saccharomyces cerevisiae in the Organism field (and accept the suggestion with taxid:4932). Then make a search with the previous HTA2 saved sequence.

QUESTION 4.2:

(Remember to document your answers)

• How many high-confidence hits do we get?

• Do the hits make sense, from what you have read about HTA2 at the SGD webpage?

Tip: click on the Gene links under Related Information (to the right of the alignments) to see the gene names for the protein hits.

The next step is to search the translated version of the human genome.

Do as before, still with Database set to "Reference proteins (refseq_protein)", just enter Human in the Organism field.

QUESTION 4.3:

• How many high-confidence hits (with E-value better than 10-10) are found? (Approximately)

• What are all the high-confidence hits called?