Generating graphics is a central component of scientific communication that is important in both written publications and oral presentations. Many researchers are accustomed to manually “drawing” figures using program such as Inkscape, Adobe Illustrator, or Microsoft PowerPoint. These are valuable tools, but in cases where graphics are meant to accurately represent quantitative data, there are benefits to use a coding language. You are probably familiar with R as a statistical language and a resource for generating plots and graphs. But R can also be used for drawing shapes and graphics. Some of the advantages of using R in this context are:
The ability to generate and save code/scripts that will make it efficient to produce new figures based on related datasets or modified versions of the same figure without starting from scratch each time.
The ability to imbed these steps into larger scripts or pipelines and avoid laborious manual work.
The ability to be precise and accurate with placement of objects to faithfully represent the data.
Note that Processing is an even more powerful tool for code-based graphic design. We will not be using it in this workshop, but it is something to be aware of depending on your research and visualization needs.
Launch an R or RStudio console session, and create a canvas for yourself by entering the plot.new()
command:
plot.new()
Now, let’s use the polygon
function to create a simple shape. When calling this function, the minimum necessary information to provide is a vector of x-coordinates and a vector of y-coordinates. For example, the following command should generate a rectangle that is 0.2 units long and 0.7 units high, with its lower-left hand corner at the origin. Note that the order of the points is important because the outline of the shape will be traced in the order that the points are listed.
polygon(c(0, 0, 0.2, 0.2), c(0, 0.7, 0.7, 0))
After entering that command, you should see the outline of a rectangle appear in your plot window (see above).
Now enter two more polygon commands.
polygon(c(0.3, 0.4, 0.5), c(0.05, 0.4, 0.05), lwd=4)
polygon (c(0, 0.1, 0.5, 0.5, 0.1), c(0.4, 0.5, 0.5, 0.3, 0.3), col="green")
You should see two more shapes appear in your plot window: a triangle and an arrow-like pentagon. Note that we have also added an additional parameter to each of these function calls. The lwd
parameter is used to set the line-weight. Notice that we have increased it above the default value of 1, creating a thicker line around the triangle. The col
parameter sets the fill color for the object, and we have used it to make the pentagon green. By this point you should have a nice piece of abstract art! (See above)
To learn more about options and parameters, try exploring the par documentation in R by entering the following at the R command prompt:
?par
Finally, note that the green polygon is partially covering the triangle. This reflects the general rule that objects that are added later to the plot will be placed on top of others. This can be important for constructing complex or layered images.
Now that we see how we can draw shapes in R, let’s use these tools to make an image based on biological data such as the gene map below.
This figure compares homolgous regions of two related bacterial genomes. The shapes represent genes, and the gray shading connects pairs of orthologous genes (note that the few genes that are unique to one of the two genomes are shown in yellow).
This is a much more complicated figure than the ones we have drawn so far, but it is still just a set of different lines and polygons, so let’s walk through the steps that would be used to generate this in R.
First lets import some data that summarizes the positions of genes.
genes=read.table("~/Desktop/csb2017/r_figure_drawing/genes.txt", header=TRUE)
Print out the data to see how they are formatted.
genes
## Start End Strand Species Color
## 1 1 426 -1 1 steelblue3
## 2 426 1733 -1 1 steelblue3
## 3 1918 2751 1 1 steelblue3
## 4 2753 4135 -1 1 steelblue3
## 5 4218 5582 1 1 steelblue3
## 6 5623 6591 -1 1 steelblue3
## 7 6596 7045 -1 1 steelblue3
## 8 7060 7911 1 1 steelblue3
## 9 7895 9523 1 1 steelblue3
## 10 9533 10909 1 1 steelblue3
## 11 11175 11780 1 1 steelblue3
## 12 11794 12990 1 1 khaki1
## 13 12987 14396 1 1 steelblue3
## 14 14397 15743 -1 1 steelblue3
## 15 15855 17546 1 1 steelblue3
## 16 17548 18030 1 1 khaki1
## 17 18027 18449 1 1 steelblue3
## 18 18739 19296 1 1 khaki1
## 19 19293 20150 1 1 steelblue3
## 20 244879 245307 -1 2 steelblue3
## 21 245322 246620 -1 2 steelblue3
## 22 250359 251189 1 2 steelblue3
## 23 251449 252439 -1 2 steelblue3
## 24 253012 254382 1 2 steelblue3
## 25 254379 255350 -1 2 steelblue3
## 26 255356 255811 -1 2 steelblue3
## 27 255836 256681 1 2 steelblue3
## 28 256659 258260 1 2 steelblue3
## 29 260255 261616 1 2 steelblue3
## 30 263608 264030 -1 2 khaki1
## 31 265481 266029 1 2 steelblue3
## 32 267276 268658 1 2 steelblue3
## 33 268691 270037 -1 2 steelblue3
## 34 270138 271835 1 2 steelblue3
## 35 273254 273697 1 2 steelblue3
## 36 276841 276951 -1 2 khaki1
## 37 277710 278570 1 2 steelblue3
As you can see, there are five columns, which provide the following information for each gene:
This table provides lots of information for each gene. But it does not tell us which genes/regions of the two genomes are homologous to each other. Let’s import another dataset that summarizes that information (from a sequence analysis thay was performed previously).
connections=read.table("~/Desktop/csb2017/r_figure_drawing/connections.txt", header=TRUE)
connections
## SpeciesA SpeciesB StartA EndA StartB EndB
## 1 1 2 1 1733 244879 246620
## 2 1 2 1918 4135 250359 252439
## 3 1 2 4218 9523 253012 258260
## 4 1 2 9533 10909 260255 261616
## 5 1 2 11175 11780 265481 266029
## 6 1 2 12987 17546 267276 271835
## 7 1 2 18027 18449 273254 273697
## 8 1 2 19293 20150 277710 278570
Each row in this table summarizes the start and end positions of a region that is homologous between our two genomes.
Now lets calculate the lengths for each of the two genome region and determine which is longer. We will use this to scale the size of our entire figure.
species1_min = min(genes[which(genes$Species==1), "Start"])
species1_max = max(genes[which(genes$Species==1), "End"])
species1_length = species1_max - species1_min + 1
species2_min = min(genes[which(genes$Species==2), "Start"])
species2_max = max(genes[which(genes$Species==2), "End"])
species2_length = species2_max - species2_min + 1
largestLength = max (species1_length, species2_length)
Look carefully at each of these commands to understand what they are doing. Note that they are taking a subset of the data for either species 1 or species 2 with the which
command. They are also using the max
and min
functions to determine the beginning (min) and ends (max) of the regions in each species. Total lengths are calculated by subtracting the min positions from the max positions (and adding 1 to be inclusive). You can check that your calculations were performed correctly by looking at the value of the largestLength
variable:
largestLength
## [1] 33692
Now let’s create a new plot and start drawing our map. We can use the segments
function to draw lines for our two genome regions. This command takes x- and y-coordinates for a start point and for an end point. In this case, we will use arbitrary y-coordinates of 0.4 and 0.5 so one of our genomic regions will be on top of the other. The x-coordinates will start at 0, and the end will be determined by the length of the region in the data set. The lwd
specifies the “thickness” of the line.
plot.new()
height1 = 0.4
height2 = 0.5
segments(0, height1, species1_length/largestLength, height1, lwd=3)
segments(0, height2, species2_length/largestLength, height2, lwd=3)
We can then add gray shaded regions to note regions of homology between the two genomes. We are doing this now because we want these shapes to be “under” our genes in our final drawing. As in Part 1, we will use the polygon
function to draw these shapes. So we need to define the x- and y-coordinates for each region. This can be done with the data that we imported from teh connections.txt file, but we have to scale the nucleotide position coordinates to the length of the genomic region. For example lets look at the first line in that file:
connections[1,]
## SpeciesA SpeciesB StartA EndA StartB EndB
## 1 1 2 1 1733 244879 246620
To define our x-coordinates, we need compare the nucleotide positions to the start positions and total lengths of these regions as follows. Note that we are defining the color of our shapes with the col
option and turning off any outline with the border = NA
option.
left1 = (connections[1, "StartA"] - species1_min)/largestLength
right1 = (connections[1, "EndA"] - species1_min)/largestLength
left2 = (connections[1, "StartB"] - species2_min)/largestLength
right2 = (connections[1,"EndB"] - species2_min)/largestLength
polygon(c(left1, left2, right2, right1), c(height1, height2, height2, height1), col = "gray85", border = NA)
That adds one of the homology regions we want to show. To add all of them, we can use a for
loop to go through all the lines in our file instead of manually adding each one individually. The following code will loop through all the lines in our connections.txt file instead of justing look at 1. Note that we are using the “index” variable i
and set it equal to row number for each pass through the loop, and we determine the number of rows in the table with the dim
function. As you can see, the code inside the foor loop is almost exactly the same as above except that we have replaced the specific row number (1) with the index variable i
.
for (i in 1:dim(connections)[1]){
left1 = (connections[i, "StartA"] - species1_min)/largestLength
right1 = (connections[i, "EndA"] - species1_min)/largestLength
left2 = (connections[i, "StartB"] - species2_min)/largestLength
right2 = (connections[i,"EndB"] - species2_min)/largestLength
polygon(c(left1, left2, right2, right1), c(height1, height2, height2, height1), col = "gray85", border = NA)
}
Note that are homology shapes are starting to cover up our line segments. We can always put them back on “top” by re-entering the segments
commands.
segments(0, height1, species1_length/largestLength, height1, lwd=3)
segments(0, height2, species2_length/largestLength, height2, lwd=3)
Now all we have to do is add our genes. Once again, we will use the polygon
function for each gene, and the challenge will be to define our x- and y-coordinates based on the nucletide position in the genes.txt file. Lets look at the first line in that table.
genes[1,]
## Start End Strand Species Color
## 1 1 426 -1 1 steelblue3
The following code is compicated but try to look at it to understand how it is defining an “arrow” (5-pointed polygon) for each gene in the genes.txt file. Note that once again we are using a for
loop to go over each row in the table. We are also using if
and else
statements because the code is going to have to be different depending on whether the gene is on the forward or reverse strand or if we are looking at species 1 or species 2.
#We need to define the length of the triangle portion of each arrow and the height of our gene shapes:
arrowLen = 0.01;
boxHeight = 0.04;
for (i in 1:dim(genes)[1]){
#Define the "left" and "right" x-coordinates.
if (genes[i, "Species"] == 1){
left = (genes[i,"Start"]-species1_min)/largestLength
right = (genes[i,"End"]-species1_min)/largestLength
height = height1
}else if (genes[i, "Species"] == 2){
left = (genes[i,"Start"]-species2_min)/largestLength
right = (genes[i,"End"]-species2_min)/largestLength
height = height2
}
#we have already defined the gene colors in our input data file.
geneFillColor=as.character(genes[i,"Color"]);
#Draw arrows, distinguishing between forward and reverse oriented genes
if (genes[i,"Strand"] == 1){
arrowStart = max (left, right - arrowLen)
polygon (c(left, arrowStart, right, arrowStart, left), c(height-boxHeight/2, height-boxHeight/2, height, height+boxHeight/2, height+boxHeight/2), col = geneFillColor, lwd = 0.5)
}else if (genes[i,"Strand"] == -1){
arrowStart = min (right, left + arrowLen)
polygon (c(left, arrowStart, right, right, arrowStart), c(height, height-boxHeight/2, height-boxHeight/2, height+boxHeight/2, height+boxHeight/2), col = geneFillColor, lwd = 0.5)
}
}
That involved a lot of code, but hopefully you can see how simple shape drawing can be built up into more complex and precise figures based on biological data.
It can be very useful to output R plots or graphics directly to a PDF rather than displaying them within a R plot window. Let’s use the polygon function again by entering the following four commands at the R command prompt.
pdf("~/Desktop/csb2017/r_figure_drawing/yellow_square.pdf")
plot.new()
polygon(c(0, 0, 0.5, 0.5), c(0, 0.5, 0.5, 0), col= "yellow")
dev.off()
Note that the polygon function call should have generated a beautiful yellow square, but you probably noticed that nothing showed up in your plot window. The reason for that is we called the pdf
function first. By doing that, we created a new PDF file and stored the new plot directly into that file. Also, note the dev.off()
command at the end. That is important because it essentially closes and saves the file. Until that command is run, you will find the PDF is just an empty file.
Verify that you successfully generated the PDF file, by entering the following command.
system2('open', args = '~/Desktop/csb2017/r_figure_drawing/yellow_square.pdf')
Being able to write directly to a PDF may seem rather useless (after all, you could always export the plot window from an R session as a PDF). But as we will see, it is a handy feature that helps when we want build R scripts and skip the R console entirely.
One advantage of code-based statistical packages like R is that you can combine scripts into large data analysis pipelines. For this reason, it can often be useful to run R code without opening an RStudio or R console session. Lets see an example of this. To draw the gene map in Part 2, we could of built a script instead of entering all that code piece by piece. In fact, if you inspect the file called comparative_gene_map.R
, you will find that it contains essentially the same code as we used to make our map. However, it can be called directly from the command line (with any genes and connections file to produce any map you want).
To test this out, open a Terminal session (not Rstudio or the R console) and cd
in the directory for this exercise. You can also view the content of the script using less
(and type q
to exit when your done looking at it).
#Run the following from a bash terminal (not in R)
cd ~/Desktop/csb2017/r_figure_drawing
less comparative_gene_map.R
Run the following command which will execute the script in R and pass the arguments for input and output files to the script.
#Run the following from a bash terminal (not in R). Note that for this to work, the Rscript executable must be in your PATH.
Rscript comparative_gene_map.R genes.txt connections.txt Rscript_output.pdf
Confirm that this generated the same map that we did in Part 2 by opening the PDF output as follows:
#Run the following from a bash terminal (not in R)
open Rscript_output.pdf