Todos Santos Computational Biology and Genomics 2019
In this exercise we will go through an RNA-seq pipeline, including reference-free transcriptome assembly and differential gene expression analysis.
We will use the Trinity software package and associated scripts (https://github.com/trinityrnaseq/trinityrnaseq/wiki).
This exercise is adapted from Haas et al (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols. https://www.nature.com/articles/nprot.2013.084
1.1.
Create a new document in a text editor (BBEdit, Notepad++, or similar). As we go through the exercise, we will type our commands in the text editor and then copy them into the terminal one command at a time. This is a good way to edit and keep track of your code. Output from the various commands should also be copied into the document.
1.2.
Connect to the server containing the data and software:
%%bash
ssh yourname@rna.biology.colostate.edu
1.3.
Change into the directory containing the materials for the exercise:
%%bash
cd TrinityExercise
1.4.
Examine the contents of the directory:
%%bash
ls
We will analyze paired end data from yeast from various growth conditions (these are subsets of larger datasets):
heatshock_left.fq.gz # left reads for heatshock
heatshock_right.fq.gz # right reads for heatshock
log_left.fq.gz # left reads for log phase
log_right.fq.gz # left reads for log phase
plateau_left.fq.gz # left reads for plateau phase
plateau_right.fq.gz # right reads for plateau phase
We will use the GUI program FastQC to assess the quality of our libraries.
Want to learn more? See https://www.youtube.com/watch?v=bz93ReOv87Y
2.1.
Run FastQC (normally you should do this for every library but we'll just focus on one for the sake of time):
%%bash
fastqc file_name.fq.gz
2.2.
Open Cyberduck and connect to the server using SFTP.
2.3.
Navivate to the fastq output which will have a .html
extension.
2.4.
Select the FastQC output and press the space bar to view.
Here, we will use Trimmomatic to remove adapter sequence and low quality bases from reads and filter out low quality reads and reads shorter than 36 nt (note that we are using very leniant critaria for trimming reads).
See the Trimmomatic manual for more details - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
3.1.
Trim off adapters and filter low quality reads:
%%bash
trimmomatic PE \ # call trimmomatic in paired-end mode
`input_left.fq.gz` \ #input file read 1 set
`input_right.fq.gz` \ #input file read 2 set
`Trimmed_input_left.paired.fq.gz` \ #output file read 1 set, paired reads
`Trimmed_input_left.unpaired.fq.gz` \ #output file read 1 set, unpaired reads
`Trimmed_input_right.paired.fq.gz` \ #output file read 2 set, paired reads
`Trimmed_input_right.unpaired.fq.gz` \ #output file read 2 set, unpaired reads
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 #trim
This is the actual code with the fastq file names for the heatshock data:
%%bash
trimmomatic PE \
heatshock_left.fq.gz \
heatshock_right.fq.gz \
Trimmed_heatshock_left_paired.fq.gz \
Trimmed_heatshock_left_unpaired.fq.gz \
Trimmed_heatshock_right_paired.fq.gz \
Trimmed_heatshock_right_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
3.2.
Repeat adapter trimming and quality filtering for each library.
Trinity will be used to assemble transcripts.
Want to know how it works? See https://www.broadinstitute.org/videos/trinity-how-it-works
4.1.
Combine the forward RNA-seq reads and the reverse RNA-seq reads from all samples into single files:
%%bash
cat Trimmed*left*gz >reads_ALL_left.fq.gz
cat Trimmed*right*gz >reads_ALL_right.fq.gz
4.2.
Assemble the reads into a reference transcriptome (~120 min with 4 threads):
%%bash
Trinity \
--seqType fq \
--left reads_ALL_left.fq.gz \
--right reads_ALL_right.fq.gz \
--SS_lib_type RF \
--CPU 4 \
--max_memory 20G
--seqType # fastq (fq)
--left # left reads fastq file
--right # right reads fastq file
--SS_lib_type # strand specificity; RF indicates the right reads corresond to the transcribed strands.
--CPU # controls the amount of RAM used
The transcripts are in the output file trinity_out_dir/Trinity.fasta
.
4.3.
Assess the quality of the assembly:
%%bash
scripts/TrinityStats.pl \
trinity_out_dir/Trinity.fasta
We will now compare our de novo transcriptome to the S.pombe reference transcriptome using BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).
5.1.
Make a BLAST database using the S. pombe reference transcriptome:
%%bash
makeblastdb \
-in S_pombe_refTrans.fasta \
-dbtype nucl
5.2.
Use megablast to blast the trinity assembled transcriptome against the reference transcriptome:
%%bash
blastn -query trinity_out_dir/Trinity.fasta \
-db S_pombe_refTrans.fasta \
-out Trinity_vs_S_pombe_refTrans.blastn \
-evalue 1e-20 \
-dust no \
-task megablast \
-num_threads 4 \
-max_target_seqs 1 \
-outfmt 6
5.3.
Summarize blast results:
%%bash
Here we will count the numbers of reads aligning to each trancript in each of our libraries using RSEM.
A tutorial on RSEM can be found here - https://github.com/bli25broad/RSEM_tutorial.
A paper descibing RSEM is here - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323
6.1.
Count the number of reads for each transcript identified by Trinity in the heatshock sample:
%%bash
scripts/align_and_estimate_abundance.pl \
--seqType fq \
--transcripts trinity_out_dir/Trinity.fasta \
--left heatshock_left.fq.gz \
--right heatshock_right.fq.gz \
--seqType fq \
--est_method RSEM \
--aln_method bowtie2 \
--trinity_mode \
--prep_reference \
--output_dir heatshock \
--thread_count 4
6.2.
Repeat read counts for each library - heatshock, log, plateau.
6.3.
Examine one of the rsem output files:
%%bash
more heatshock/RSEM.genes.results
We will use edgeR to identify differentially expressed genes.
For information about edgeR, see its bioconductor page - http://bioconductor.org/packages/release/bioc/html/edgeR.html
For the paper describing edgeR, see - https://academic.oup.com/bioinformatics/article/26/1/139/182458
7.1.
Make a sample expression matrix:
%%bash
scripts/abundance_estimates_to_matrix.pl \
--est_method RSEM \
--gene_trans_map none \
--name_sample_by_basedir \
heatshock/RSEM.isoforms.results \
log/RSEM.isoforms.results \
plateau/RSEM.isoforms.results
7.2.
Run the Trinity differential expression analysis pipeline:
%%bash
Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix RSEM.isoform.counts.matrix \
--method edgeR \
--dispersion 0.1 \
--output edgeR_output
--matrix # the matrix file from 7.1.
--method # the differential expression method, DEseq2, edgeR, etc.
--dispersion # a value used to determine how much the variance for each gene deviates from the mean. We will use 0.1 as an estimate because we don't have biological replicates.
7.3.
Change into the edgeR output directory and examine the output:
%%bash
cd edgeR_output
ls
7.4.
Read counts, DE results, MA plots, and volcano plots are generated from each sample comparison. Examine these files along with the instructor. PDF files can be viewed in Cyberduck - select file and hit the spacebar
. Text files can be viewed in the terminal or downloaded and opened in a text editor or Excel.
7.5.
Optional: Generate heat map and sample correlation matrix:
Scaling normalization is typically recommended. The Trinity pipeline has a helper script that uses the TMM method (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25) that can be run prior to this step. See run_TMM_normalization_write_ FPKM_matrix.pl
.
../Analysis/DifferentialExpression/analyze_diff_expr.pl \
--matrix ../RSEM.isoform.counts.matrix \
-C 2 -P 0.001
RSEM.isoform.counts.matrix # file generated in step 7.2
-C # log2 fold change
-P # P value cutoff
Examine the heatmap (diffExpr.P0.001_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf
) and correlation matrix(diffExpr.P0.001_C2.matrix.log2.centered.sample_cor_matrix.pdf
) by selecting each file in Cyberduck and pressing the spacebar
.