Background

The large size of genomic datasets often lends itself to large and complex data plots. In particular, it is often useful or necessary to generate data plots with lots of separate panels (e.g., for different chromosomes, experimental treatments, tissue types, etc.). In generating these plots, researchers can save a lot of time and get a higher quality end-product by using code-based graphing language such as R (and particularly the ggplot2 package within R). The ability to make minor changes to code and re-run it to generate new or modified figures is a huge advantage over programs such as Microsoft Excel that require extensive manual work in generating data plots. The ggplot package is also designed to produce attractive figures without extensive customization or modification of default settings.

Objectives

The goal of this exercise will be use the ggplot2 package in R to plot data that were produced by mapping RNA-seq reads to an unusually large and fragmented plant mitochondrial genome that consists of 63 chromosomes. You will generate an individual plot for a single chromosome and then generate a massive multipanel figure with one plot for each of the 63 chromosomes.

Software and Dependencies

Protocol

1. Open the R console and load the ggplot2 package

Launch an R or RStudio console session. Load ggplot2 with the following command.

library(ggplot2)

If you receive an error because ggplot2 is not installed on your machine, you can install it with the command install.packages("ggplot2") (but you can ignore warning messages that ggplot2 was built under a different version of R).

2. Import RNA-seq summary data

We are going to work from a data set that summarizes read depth from RNA-seq reads that were mapped onto a genome that consists of multiple chromosomes. It is called SlidingWindow.txt and can be downloaded here.

Read in the data with the following command (you may need to provide the path to where you stored this data file):

RNAseq = read.delim ("SlidingWindow.txt")


Print the first five rows of data with the following command:

RNAseq[1:5,]
##   Chrom  Pos     Region RCM Enrichment
## 1     1  251 intergenic   1          0
## 2     1  501 intergenic   1          0
## 3     1  751 intergenic   1          0
## 4     1 1001 intergenic   1          0
## 5     1 1251 intergenic   1          0


Note that there are five data columns, containing the following.

  • Chrom – Chromosome number
  • Pos – Nucleotide position for a 250-bp window used to calculated average read depth
  • Region – Identification of the corresponding window as either “genic” or “intergenic”
  • RCM – Average read depth in the corresponding window expressed as read count per million
  • Enrichment – Proportional enrichment of coverage in a mitochondrial fraction relative to total cellular RNA (log scale)

3. Generate a subset of the data that only includes a single chromosome

Even though this data file is a highly reduced summary of the raw RNA-seq data, it still contains more than 28,000 data points for 63 different chromosomes. Let’s simplify a little by selecting the subset of data points that are from Chromosome 21 and start with those. The which function is very useful for this.

RNAseq_chrom21 = RNAseq[which(RNAseq$Chrom==21),]

4. Generate plots of read depth along the length of chromosome 21 with ggplot

Let’s begin by defining a new plot with the ggplot function. We will work from the data subset for chromosome 21 only, and we will make nucleotide position our x-variable and read depth our y-variable.

singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM))


Now, try viewing that plot by entering its name on the command line.

singlePlot




Oops. That gave us a blank plot. This is a common mistake (and source of frustration) with ggplot. In the native plot function in R, simply defining the x- and y-variables is sufficient to generate a plot. For example:

plot (RNAseq_chrom21$Pos, RNAseq_chrom21$RCM)



But this is not sufficient for ggplot. In this case, we must at least specify that we want to generate a scatter plot by adding the geom_point() statement.

singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM)) + geom_point()

singlePlot




But what if we want to distinguish between “genic” and “intergenic” regions as defined in the input file? We can re-define our plot by adding a color statement.

singlePlot = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Region)) + geom_point()

singlePlot




If you are not fond of the color scheme, you can modify it by adding an additional statement to the existing plot.

singlePlot + scale_colour_manual(values=c("red", "black"))




Now we see that the region of highest expression is an annotated gene. But there are also highly expressed regions in uncharacterized intergenic sequence. Note that we could have used an alternative color scheme for our points. For example, instead of putting them into two categories (genic and intergenic) we could have used a “heat map” to color them based on a continuous variable such as “Enrichment” (see above). Let’s try that by defining a new plot entirely.

singlePlot2 = ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point()

singlePlot2




Now the shading of the points reflects the extent to which the transcripts are enriched in the mitochondria relative to the total cellular fraction (lighter blue indicates more enrichment). But maybe we can make the visualization a little clearer by manually defining the heat-map color scheme as follows:

singlePlot2 + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red")




Now, we clearly see that most of the highly expressed regions are also greatly enriched in mitochondrial RNA samples. But that is not the case for one region (green points around position 10 kb). This region is likely a foreign chloroplast or nuclear gene that has been inserted into the mitochondrial genome but not actually highly expressed in the mitochondria themselves.

It is often best to display expression data on a log-scale, so let’s make a further modification by manually defining the y-axis including the use of a log transformation as follows.

singlePlot2 + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red") + scale_y_continuous(limits=c(1, 1000), breaks=c(1, 10, 100, 1000), trans="log1p")




For one last tweak, let’s change the axes labels to something a little less cryptic.

singlePlot2 + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red") + scale_y_continuous(limits=c(1, 1000), breaks=c(1, 10, 100, 1000), trans="log1p") + xlab("Nucleotide Position") + ylab("Read Count Per Million")

5. Generate a multipanel figure showing read depth across each of the 63 chromosomes

We have just developed code to plot read depth along the length of a single chromosome. But what about the other 62 chromosomes in this genome? It turns out that with very small additions and modifications to our existing code, we can generate a single figure with a separate panel for each and everyone of the chromosomes. Compare the following blocks of code for plotting data from just one chromosome or all 63 chromosomes.

One chromosome:
ggplot (data=RNAseq_chrom21, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point() + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red") + scale_y_continuous(limits=c(1, 1000), breaks=c(1, 10, 100, 1000), trans="log1p") + xlab("Nucleotide Position") + ylab("Read Count Per Million")
All chromosomes:
ggplot (data=RNAseq, aes(x=Pos, y=RCM, color=Enrichment)) + geom_point() + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red") + scale_y_continuous(limits=c(1, 1000), breaks=c(1, 10, 100, 1000), trans="log1p") + xlab("Nucleotide Position") + ylab("Read Count Per Million") + facet_wrap(~Chrom, nrow=7, ncol=9)

These two commands are almost exactly the same. As highlighted in bold text, the only differences are that, in the second command, we are now using the full dataset with all chromosomes (RNAseq instead of RNAseq_chrom21) and that we have added the facet_wrap statement, which separates the analysis by chromosome and reports an array of individual plots. To generate a multipanel plot, execute the above command for “All chromosomes” in your R console.




This is a good start, but you may notice that the points are looking a little too large when they are packed into this many plots, and some of the text is starting to overlap. Let’s make a few modifications to our code (i.e., change point size and text size and redefine the x-axis). The code changes are shown in bold.

multiPlot = ggplot (data=RNAseq, aes(x=Pos/1000, y=RCM, color=Enrichment)) + geom_point(size=0.5) + scale_colour_gradient(limits=c(-1, 1), low = "green", high = "red") + scale_y_continuous(limits=c(1, 1000), breaks=c(1, 10, 100, 1000), trans="log1p") + xlab("Nucleotide Position (kb)") + ylab("Read Count Per Million") + facet_wrap(~Chrom, nrow=7, ncol=9) + scale_x_continuous(limits=c(0, 200), breaks=c(0, 75, 150)) + theme (text = element_text(size=8))


To output a figure to a PDF, use the pdf and dev.off commands:

pdf ("multiPlot.pdf")  
multiPlot  
dev.off()


To verify that you have produced a publication-quality multipanel figure, open the PDF. You can do that in many ways (such as simply double-clicking on it). But you can also launch a file from the R console as follows:

system2('open', args = 'multiPlot.pdf')




In this exercise, we have taken a look at just one feature of ggplot (i.e., scatter plots). To get a better sense of the diverse capabilities of this package, explore the ggplot documentation and try the online tutorial from the Harvard IQSS.