Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


SCM8051 Written Assignment, 2021

Analysis of an RNA-Seq experiment

Instructions

This 1,500-word written report (not including references, figures, tables or appendices) is based upon a published RNA-Seq dataset. Following the directions below you will repeat certain analyses and interpret the results. The specific analyses you need to perform and the questions to address are indicated in red text. The report should be structured according to the sections below and include the figures indicated. The questions should be answered with free text rather than bullet points, citing key supporting references included in a bibliography at the end of the document.

Background

As the impact of Covid-19 became apparent there was a rush to understand how the virus works. One of earliest studies aimed to characterise the transcriptional response of cells to infection with SARS-CoV-2. Available first as a preprint on the biorxiv server, the study was subsequently published in the prestigious journal CellImbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19 Blanco-Melo D etal. (2020) Cell 181(5), 1036-1045.e9 (pdf available in Canvas).

The authors reported that cells mount an ‘inappropriate inflammatory response defined by elevated chemokine expression in the absence of Type I and III interferons’. You will evaluate the bioinformatic approach employed to assess gene expression and the way in which the results were presented in the Cell study.

You can perform the analyses using the standalone applications in R or in Galaxy (the public instances – note that for Galaxy Europe will be down for scheduled maintenance for 24 hrs from Nov 24th at 17.45)

 

Section 1: Study design and data reporting (25%)

When reporting analyses of high throughput expression data, it is good practice for the raw data and appropriate processed data to be deposited in a publicly available archive. Indeed, most journals now require adherence to the MINSEQE (Minimum Information About a Next-generation Sequencing Experiment) guidelines for the minimum information that should be included when describing a sequencing study.

The key information that should be available in the publication and archive submission (in this case Gene Expression Omnibus (GEO)) includes:

· Descriptions of experimental work used to generate the data

o Sample type: Cells/tissues/organism

o Treatments

o Library preparation and sequencing strategy

o Reagents and protocols

· The genome and annotations used for alignments

· All software tools employed (including versions)

· Specific parameters used (default or custom with explanation)

Is this information provided within the Cell paper in sufficient detail to enable replication of the bioinformatics reported in the study?

Evaluate the pros and cons of the following choices made by the authors: The model systems, the library prep kits and the assignment of reads to genes (consider aligners such as STAR and mappers such as Salmon that use pseudoalignment).

 

Section 2: Overview and normalisation of the data (25%)

Download and unzip the expression matrix of raw read counts from the GEO series page. We are going to focus on the following mock and SARS-CoV-2 infected samples, so make a matrix containing only these samples (tip: select the relevant columns from the expression matrix).

NHBE:

Series1_NHBE_Mock_1, Series1_NHBE_Mock_2, Series1_NHBE_Mock_3, Series1_NHBE_SARS-CoV-2_1, Series1_NHBE_SARS-CoV-2_2, Series1_NHBE_SARS-CoV-2_3

A549:

Series2_A549_Mock_1, Series2_A549_Mock_2, Series2_A549_Mock_3 ,Series2_A549_SARS-CoV-2_1,Series2_A549_SARS-CoV-2_2,Series2_A549_SARS-CoV-2_3

A549-ACE2:

Series6_A549-ACE2_Mock_1,Series6_A549-ACE2_Mock_2,Series6_A549-ACE2_Mock_3,Series6_A549-ACE2_SARS-CoV-2_1,Series6_A549-ACE2_SARS-CoV-2_2,Series6_A549-ACE2_SARS-CoV-2_3

Calu3:

Series7_Calu3_Mock_1,Series7_Calu3_Mock_2,Series7_Calu3_Mock_3,Series7_Calu3_SARS-CoV-2_1,Series7_Calu3_SARS-CoV-2_2,Series7_Calu3_SARS-CoV-2_3

Lung biopsy:

Series15_HealthyLungBiopsy_2,Series15_HealthyLungBiopsy_1,Series15_COVID19Lung_2,Series15_COVID19Lung_1

For an initial look at the data check how many reads there are in each sample (column) and present this as a bar graph (Figure 1). Explain why some samples have more reads than others and the potential implications for analysis of gene expression.

Load the count matrix into EdgeR and perform normalization using the TMM, RLE and upper quartile methods.

As a first step in any analysis it is helpful to visualise the relationship between all samples in terms of their global transcriptomes. This can be achieved using Multidimensional scaling (MDS) or principal component analysis (PCA).

Provide guidelines for how to generate a PCA plot that clearly depicts the relationships between the different samples based on the TMM normalised count matrix. Include the plot in your report (Figure 2).

Explain briefly how to interpret the plot. Based solely on this plot, suggest which cell type you think would provide the best in vitro model for studying SARS-CoV-2 infection in lung?

Append the R scripts you used for EdgeR in this section to your report (this is not included in the word count). Tip: If using Galaxy you can select the ‘output R Script’ option in EdgeR.

 

Section 3: Differential gene expression (25%)

To examine the host transcriptional response to SARS-CoV-2 in more detail the authors used primary Human Bronchial Epithelial Cells (NHBE) (Fig 2 in the paper). Repeat the analysis of differential gene expression, but just between the mock and SARS-CoV-2 infected NHBE samples that you analysed in section 2 above.

Firstly, use EdgeR with default parameters (P-Value Adjusted Threshold = 0.05, P-Value Adjustment Method Benjamini & Hochberg (1995), Minimum Log2 Fold Change = 0) but with each of the normalization methods (Trimmed Mean of M-values (TMM), Relative Log Expression (RLE) and Upper quartile (UQ)).

Compare the lists of altered genes detected by each method and present the overlaps as a Venn diagram (Figure 3).

Briefly explain the rationale for each method and explain how important you consider the choice of normalization method to be.

Now use 3 different software packages, EdgeR, DeSeq2 and Limma-Voom, and Log2FC > 1 or <-1, p adjusted-value < 0.05. Use the default parameters (ie use EdgeR with TMM normalization) and note these in your answer.

Compare the lists of altered genes detected by each method and present the overlaps as a Venn diagram (Figure 4).

Explain why the different algorithms may have detected different numbers of significantly altered genes.

 

Section 4: Gene set enrichment analysis (GSEA) (25%)

GSEA can be used to identify common characteristics or functions of the altered genes. This involves testing whether more of the altered genes are present in a given set of genes than would be expected by chance. Commonly this is applied to sets of genes defined by gene ontology (GO) terms.

Use the STRING server mentioned in the paper (https://string-db.org/ ) to analyse the protein-protein interaction network of the genes detected as altered in NHBE cells by SARS-CoV-2 infection using EdgeR (default parameters, TMM normalization, p<-.05, FC >1). Tip: input the gene names as a list of ‘multiple proteins’.

Show the protein network in your report (Figure 5) and provide a brief explanation of how to interpret this plot (tip: consider the PPI enrichment p-value provided in the ‘Analysis’ tab). 

Consider the gene enrichment analysis for the GO dataset for biological processes (tip: also available under the ‘Analysis tab’). Include a table of the top 18 enriched terms ranked by FDR in your report, including the ‘count in network’ and ‘strength’ – explain the meaning of these terms.  Compare these results with those presented in Figure 2C of the paper. Explain why there may be differences and whether you consider the results reported in the Cell paper to be a fair representation of the data.