Todos Santos Computational Biology and Genomics 2018
In this exercise we will go through an RNA-seq pipeline, including reference-free transcriptome assembly and differential gene expression analysis.
We will use the Trinity software package and associated scripts (https://github.com/trinityrnaseq/trinityrnaseq/wiki).
This exercise is adapted from Haas et al (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols. https://www.nature.com/articles/nprot.2013.084
1.1.
Connect to the server containing the data and software:
%%bash
ssh yourname@rna.biology.colostate.edu
1.2.
Change into the directory containing the materials for the exercise:
%%bash
cd TrinityExercise
1.3.
Examine the contents of the directory:
%%bash
ls
We will analyze paired end data from yeast from various growth conditions (these are subsets of larger datasets):
Sp.ds.1M.left.fq.gz # left reads for diauxic shift
Sp.ds.1M.right.fq.gz # right reads for diauxic shift
Sp.hs.1M.left.fq.gz # left reads for heatshock
Sp.hs.1M.right.fq.gz # right reads for heatshock
Sp.log.1M.left.fq.gz # left reads for log phase
Sp.log.1M.right.fq.gz # left reads for log phase
Sp.plat.1M.left.fq.gz # left reads for plateau phase
Sp.plat.1M.right.fq.gz # right reads for plateau phase
We will use the GUI program FastQC to assess the quality of our libraries.
Want to learn more? See https://www.youtube.com/watch?v=bz93ReOv87Y
2.1.
Run FastQC:
%%bash
fastqc file_name.fq.gz
%%bash
python -m SimpleHTTPServer 8000
2.2.
Open a web browser and go to http://rna.biology.colostate.edu:8000
2.3.
Click on the appropriate fastqc.html
file to view fastqc output.
Here, we will use Trimmomatic to remove adapter sequence and low quality bases from reads and filter out low quality reads and reads shorter than 36 nt.
See the Trimmomatic manual for more details - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
3.1.
Trim off adapters and filter low quality reads from the diauxic shift treated samples:
%%bash
trimmomatic PE \
`input.left.fq.gz` \
`input.right.fq.gz` \
`Trimmed_input.left.paired.fq.gz` \
`Trimmed_input.left.unpaired.fq.gz` \
`Trimmed_input.rigt.paired.fq.gz` \
`Trimmed_input.right.unpaired.fq.gz` \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
This is the actual code with the fastq file names:
%%bash
trimmomatic PE \
Sp.ds.1M.left.fq.gz \
Sp.ds.1M.right.fq.gz \
Trimmed_Sp.ds.1M.left.paired.fq.gz \
Trimmed_Sp.ds.1M.left.unpaired.fq.gz \
Trimmed_Sp.ds.1M.right.paired.fq.gz \
Trimmed_Sp.ds.1M.right.unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
3.2.
Repeat adapter trimming and quality filter for each library.
3.3.
Examine files from before and after adapter trimming and quality filtering using zmore
or zless
.
Trinity will be used to assemble transcripts.
Want to know how it works? See https://www.broadinstitute.org/videos/trinity-how-it-works
4.1.
Combine the forward RNA-seq reads and the reverse RNA-seq reads from all samples into single files:
%%bash
cat Trimmed*left*gz >reads.ALL.left.fq.gz
cat Trimmed*right*gz >reads.ALL.right.fq.gz
4.2.
Assemble the reads into a reference transcriptome (~102 min with 4 threads):
%%bash
Trinity \
--seqType fq \
--left reads.ALL.left.fq.gz \
--right reads.ALL.right.fq.gz \
--SS_lib_type RF \
--CPU 4 \
--max_memory 20G
--seqType # fastq (fq)
--left # left reads fastq file
--right # right reads fastq file
--SS_lib_type # strand specificity; RF indicates the right reads corresond to the transcribed strands.
--CPU # controls the amount of RAM used
The transcripts are in the output file trinity_out_dir/Trinity.fasta
.
4.3.
Assess the quality of the assembly:
%%bash
scripts/TrinityStats.pl \
trinity_out_dir/Trinity.fasta
We will now compare our de novo transcriptome to the S.pombe reference transcriptome using BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).
5.1.
Make a BLAST database using the S. pombe reference transcriptome:
%%bash
makeblastdb \
-in S_pombe_refTrans.fasta \
-dbtype nucl
5.2.
Use megablast to blast the trinity assembled transcriptome against the reference transcriptome:
%%bash
blastn -query trinity_out_dir/Trinity.fasta \
-db S_pombe_refTrans.fasta \
-out Trinity_vs_S_pombe_refTrans.blastn \
-evalue 1e-20 \
-dust no \
-task megablast \
-num_threads 4 \
-max_target_seqs 1 \
-outfmt 6
5.3.
Summarize blast results:
%%bash
scripts/analyze_blastPlus_topHit_coverage.pl \
Trinity_vs_S_pombe_refTrans.blastn \
trinity_out_dir/Trinity.fasta \
S_pombe_refTrans.fasta
Here we will count the numbers of reads aligning to each trancript in each of our libraries using RSEM.
A tutorial on RSEM can be found here - https://github.com/bli25broad/RSEM_tutorial.
A paper descibing RSEM is here - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323
6.1.
Count the number of reads for each transcript identified by Trinity in the diauxic shift sample:
%%bash
scripts/align_and_estimate_abundance.pl \
--seqType fq \
--transcripts trinity_out_dir/Trinity.fasta \
--left Sp.ds.1M.left.fq.gz \
--right Sp.ds.1M.right.fq.gz \
--seqType fq \
--est_method RSEM \
--aln_method bowtie2 \
--trinity_mode \
--prep_reference \
--output_dir ds \
--thread_count 4
6.2.
Repeat read counts for each library.
6.3.
Examine one of the rsem output files:
%%bash
more ds/RSEM.genes.results
We will use edgeR to identify differentially expressed genes.
For information about edgeR, see its bioconductor page - http://bioconductor.org/packages/release/bioc/html/edgeR.html
For the paper describing it, see - https://academic.oup.com/bioinformatics/article/26/1/139/182458
7.1.
Make a sample expression matrix:
%%bash
scripts/abundance_estimates_to_matrix.pl \
--est_method RSEM \
--gene_trans_map none \
--name_sample_by_basedir \
log/RSEM.genes.results \
ds/RSEM.genes.results \
hs/RSEM.genes.results \
plat/RSEM.genes.results
7.2.
Run the Trinity differential expression analysis pipeline:
%%bash
Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix RSEM.isoform.counts.matrix \
--method edgeR \
--dispersion 0.1 \
--output edgeR_output
--matrix # the matrix file from 7.1.
--method # the differential expression method, DEseq2, edgeR, etc.
--dispersion # a value used to determine how much the variance for each gene deviates from the mean. We will use 0.1 as an estimate because we don't have biological replicates.
7.3.
Change into the edgeR output directory and examine the output:
%%bash
cd edgeR_output
ls
7.4.
Read counts, DE results, MA plots, and volcano plots are generated from each sample comparison. Examine these files along with the instructor. PDF files can be viewed in Cyberduck - select file and hit the spacebar
. Text files can be viewed in the terminal or downloaded and opened in a text editor or Excel.
7.4.
Generate heat map and sample correlation matrix:
Scaling normalization is typically recommended. The Trinity pipeline has a helper script that uses the TMM method (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25) that can be run prior to this step. See run_TMM_normalization_write_ FPKM_matrix.pl
.
../Analysis/DifferentialExpression/analyze_diff_expr.pl \
--matrix ../RSEM.isoform.counts.matrix \
-C 2 -P 0.001
RSEM.isoform.counts.matrix # file generated in step 7.2
-C # log2 fold change
-P # P value cutoff
7.5.
Examine the heatmap (diffExpr.P0.001_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf
) and correlation matrix(diffExpr.P0.001_C2.matrix.log2.centered.sample_cor_matrix.pdf
) by selecting each file in Cyberduck and pressing the spacebar
.