The large size of genomic datasets often lends itself to large and complex data plots. In particular, it is often useful or necessary to generate data plots with lots of separate panels (e.g., for different chromosomes, experimental treatments, tissue types, etc.). In generating these plots, researchers can save a lot of time and get a higher quality end-product by using code-based graphing language such as R (and particularly the ggplot2 package within R). The ability to make minor changes to code and re-run it to generate new or modified figures is a huge advantage over programs such as Microsoft Excel that require extensive manual work in generating data plots. The ggplot package is also designed to produce attractive figures without extensive customization or modification of default settings.
The goal of this exercise will be use the ggplot2 package in R to plot data that were produced by mapping RNA-seq reads to an unusually large and fragmented plant mitochondrial genome that consists of 63 chromosomes. You will generate an individual plot for a single chromosome and then generate a massive multipanel figure with one plot for each of the 63 chromosomes.
Launch an R or RStudio console session. Load ggplot2 with the following command.
library(ggplot2)
If you receive an error because ggplot2 is not installed on your machine, you can install it with the command install.packages("ggplot2")
.
We are going to work from a data set that summarizes read depth from RNA-seq reads that were mapped onto a genome that consists of multiple chromosomes. It is called SlidingWindow.txt
and can be downloaded here.
Read in the data with the following command (you may need to change your working directory or provide the path to where you stored this data file):
RNAseq = read.delim ("SlidingWindow.txt")
Check out the structure of the dataset by using the str
functioning and printing the first five rows of data with the following commands:
str(RNAseq)
## 'data.frame': 28581 obs. of 5 variables:
## $ Chrom : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Pos : int 251 501 751 1001 1251 1501 1751 2001 2251 2501 ...
## $ Region : chr "intergenic" "intergenic" "intergenic" "intergenic" ...
## $ RCM : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Enrichment: num 0 0 0 0 0 0 0 0 0 0 ...
RNAseq[1:5,]
## Chrom Pos Region RCM Enrichment
## 1 1 251 intergenic 1 0
## 2 1 501 intergenic 1 0
## 3 1 751 intergenic 1 0
## 4 1 1001 intergenic 1 0
## 5 1 1251 intergenic 1 0
Note that there are five data columns, containing the following.
Even though this data file is a highly reduced summary of the raw RNA-seq data, it still contains more than 28,000 data points for 63 different chromosomes. Let’s simplify a little by selecting the subset of data points that are from Chromosome 21 and start with those. The which
function in the R base package is useful for this.
RNAseq_chrom21 = RNAseq[which(RNAseq$Chrom==21),]
Note that if you have tidyverse installed and loaded, you can also use the filter
function in dplyr, which is a little more readable: RNAseq_chrom21 = RNAseq %>% filter(Chrom==21)
.
Let’s begin by defining a new plot with the ggplot function. We will work from the data subset for chromosome 21 only, and we will make nucleotide position our x-variable and read depth our y-variable.
singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM))
Now, try viewing that plot by entering its name on the command line.
singlePlot
Oops. That gave us a blank plot. This is a common mistake (and source of frustration) with ggplot. In the native plot function in R, simply defining the x- and y-variables is sufficient to generate a plot. For example:
plot (RNAseq_chrom21$Pos, RNAseq_chrom21$RCM)
But this is not sufficient for ggplot. In this case, we have only done the aesthetic mappings to associate the variables Pos
and RCM
in the dataset with the X and Y positions, respectively. To visualize the data, we need to specify at least one geometry. Let’s generate a scatter plot by adding the geom_point()
statement. This will inherit the mapping information from the aes()
arguments in the main ggplot
call.
singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM)) + geom_point()
singlePlot
Now, let’s distinguish between “genic” and “intergenic” regions as defined in the input file. We can add to our aesthetic mappings by including a color
statement and associating with the Region
variable in our data set.
singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Region)) + geom_point()
singlePlot
If you are not fond of the color scheme, you can modify it by adding an additional statement to the existing plot.
singlePlot + scale_color_manual(values=c("red", "black"))
Now we see that the region of highest expression is an annotated gene. But there are also highly expressed regions in uncharacterized intergenic sequence. Note that we could have used an alternative color scheme for our points. For example, instead of putting them into two categories (genic and intergenic) we could have used a “heat map” to color them based on a continuous variable such as Enrichment
(see above for description of the Enrichment
data). Let’s try that by defining a new plot entirely.
singlePlot2 = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point()
singlePlot2
Now the shading of the points reflects the extent to which the transcripts are enriched in the mitochondria relative to the total-cellular fraction (lighter blue indicates more enrichment). If you don’t like the default color scheme, the viridis
package offers a nice palette for heat maps (and note that this palette is better perceived by color-blind individuals than classic red/green heat map scales). We can amend our stored plot as follows. Note that if you don’t have the viridis
package installed you can add it with install.packages("viridis")
.
library(viridis)
## Loading required package: viridisLite
singlePlot2 + scale_color_viridis()
Now, we clearly see that most of the highly expressed regions are also greatly enriched in mitochondrial RNA samples. But that is not the case for one region (three purple points around position 10 kb). This region is likely a foreign chloroplast or nuclear gene that has been inserted into the mitochondrial genome but not actually highly expressed in the mitochondria themselves.
Above, we made a modification to a stored plot (singlePlot2
). But we can also build up the full code in one statement, and you do not always need to store your plot in a vairable. Just running the code without assigning it to a variable will automatically display the results. Let’s just enter the full code for the next step. It is often best to display expression data on a log-scale, so let’s make a further modification to the y-axis scale.
ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point() + scale_color_viridis() + scale_y_log10()
It’s always good to label your axes clearly. Let’s change the axes labels to something a little less cryptic. Note that this involves editing the theme
of the plot (i.e., the non-data text and lines that appear on the plot). This can be done with the theme
function, but xlab
and ylab
are shortcuts for just setting the axes labels.
ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point() + scale_color_viridis() + scale_y_log10() + xlab("Nucleotide Position") + ylab("Read Count Per Million")
Hmmm. When we scale this up to a multipanel figure, we might not be able to write out those long number for the x-axis positions. Let’s convert those to kb by changing our x mapping from Pos
to Pos/1000
and updating our xlab
accordingly to indicate that the scale is in kb.
ggplot (data=RNAseq_chrom21, aes(x=Pos/1000, y=RCM, color=Enrichment)) + geom_point() + scale_color_viridis() + scale_y_log10() + xlab("Nucleotide Position (kb)") + ylab("Read Count Per Million")
Speaking of space, having a smaller point size might help us down the line. You can map size
to a variable with aes()
if you want it to reflect some aspect of your dataset, but you can also set it to a constant as an attribute of geom_point
. Let’s reduce the point size by adding size=0.5
to our geom_point()
call. This might be smaller than we’d like for a single plot like this, but it will help reduce overlap when we build up to a dense multipanel plot.
ggplot (data=RNAseq_chrom21, aes(x=Pos/1000, y=RCM, color=Enrichment)) + geom_point(size=0.5) + scale_color_viridis() + scale_y_log10() + xlab("Nucleotide Position (kb)") + ylab("Read Count Per Million")
One last tweak to our individual plot. There are numerous theme
elements you can adjust individually, but ggplot also provides pre-packaged themes. For example, theme_bw()
and theme_classic()
can provide a bit of a cleaner overall look than the default theme and its gray plot background. Let’s add + theme_bw()
to update our theme.
ggplot (data=RNAseq_chrom21, aes(x=Pos/1000, y=RCM, color=Enrichment)) + geom_point(size=0.5) + scale_color_viridis() + scale_y_log10() + xlab("Nucleotide Position (kb)") + ylab("Read Count Per Million") + theme_bw()
We have just developed code to plot read depth along the length of a single chromosome. But what about the other 62 chromosomes in this genome? It turns out that with very small additions and modifications to our existing code, we can generate a single figure with a separate panel for each and everyone of the chromosomes. Compare the following code for plotting data from all 63 chromosomes to what we just ran above for a single chromosome.
ggplot (data=RNAseq, aes(x=Pos/1000, y=RCM, color=Enrichment)) + geom_point(size=0.5) + scale_color_viridis() + scale_y_log10() + xlab("Nucleotide Position (kb)") + ylab("Read Count Per Million") + theme_bw() + scale_x_continuous (breaks=c(0,75,150)) + facet_wrap(~Chrom, nrow=7)
The two commands are almost exactly the same. As highlighted in bold text, the only differences are that 1) we are now using the full dataset with all chromosomes (RNAseq
instead of RNAseq_chrom21
), 2) we have manually set the breaks on the x-acis for tick marks to improve readability, and 3) we have added the facet_wrap
statement, which separates the analysis by chromosome and reports an array of individual plots. The facet_wrap
command (and related facet_grid
command) is an extremely valuable feature of ggplot, allowing you to build up complicated multipanel figures with relative ease.
Try running the above code and storing it to a variable called “multiPlot”. Then print it by typing multiPlot
.
There are multiple ways to output a figure to a PDF. One is use the pdf
and dev.off
commands as follows (you can also explore the ggsave
command as well as exporting from the viewer from RStudio).
pdf ("multiPlot.pdf")
multiPlot
dev.off()
To verify that you have produced a publication-quality multipanel figure, open the PDF and check it out!
Further develop your ggplot skills by generating your own code.
If you have your own dataset and vision for a plot, try to build it with ggplot. Be creative! Discuss ideas with your peers and the instructor.
If you don’t have a dataset read to go, try to recreate one of the two plots below with their associated datasets. Inspect the plot and consider how you might build it up from some of the commands you learned above. Start simple and add elements in steps.
The following plot shows how mitochondrial DNA copy number increases in polyploid plants relative to diploid plants. Quantifications are based on ddPCR analysis of four genes, each shown in a separate facet. The data are available here. Note that the plot uses two different geometries: geom_boxplot
and geom_jitter
. Layering multiple geometries is a common approach with ggplot.
The following plot shows distributions of mRNA expression levels in plant leaf tissue from genes in the nuclear, mitochondrial and chloroplast genomes. Data are based on ribo-zero RNA-seq analysis and are available here. The distributions in the ridgeline plot are visualized with geom_density_ridges
, which is a geometry available in the ggridges package.
The plots and datasets in this exercise were adapted from the following studies:
Forsythe ES, Grover CE, Miller ER, Conover JL, Chavarro CF, Arick MA, Peterson DG, Leal-Bertioli SCM, Sharbrough J, Wendel JF, Sloan DB. 2022 Organellar transcripts dominate the cellular mRNA pool across plants of varying ploidy levels. bioRxiv. 2022.03.12.484027
Fernandes Gyorfy M, Miller ER, Conover JL, Grover CE, Wendel JF, Sloan DB, Sharbrough J. 2021. Nuclear-cytoplasmic balance: whole genome duplications induce elevated organellar genome copy number. Plant Journal. 108: 219–230.
Wu Z, Stone JD, Štorchová H, Sloan DB. 2015. High transcript abundance, RNA editing, and small RNAs in intergenic regions within the massive mitochondrial genome of the angiosperm Silene noctiflora. BMC Genomics. 16: 938.
Datacamp: Check out the courses on Introduction to Data Visualization with ggplot2, Intermediate Data Visualization with ggplot2, and Visualization Best Practices in R.
ggtree: If you work with phylogenetic trees and want to generate plots or tables that line up with the tips of your tree, ggtree is a usual extension to ggplot.