Background

Generating graphics is a central component of scientific communication that is important in both written publications and oral presentations. Many researchers are accustomed to manually “drawing” figures using program such as Inkscape, Adobe Illustrator, or Microsoft PowerPoint. These are valuable tools, but in cases where graphics are meant to accurately represent quantitative data, there are benefits to use a coding language. You are probably familiar with R as a statistical language and a resource for generating plots and graphs. But R can also be used for drawing shapes and graphics. Some of the advantages of using R in this context are:

Note that Processing is an even more powerful tool for code-based graphic design. We will not be using it in this workshop, but it is something to be aware of depending on your research and visualization needs.

Objectives

Software and Dependencies

Protocol

1. Open R console and use it to draw shapes

Launch an R or RStudio console session, and create a canvas for yourself by entering the plot.new() command:

plot.new()

Now, let’s use the polygon function to create a simple shape. When calling this function, the minimum necessary information to provide is a vector of x-coordinates and a vector of y-coordinates. For example, the following command should generate a rectangle that is 0.2 units long and 0.7 units high, with its lower-left hand corner at the origin. Note that the order of the points is important because the outline of the shape will be traced in the order that the points are listed.

polygon(c(0, 0, 0.2, 0.2), c(0, 0.7, 0.7, 0))

After entering that command, you should see the outline of a rectangle appear in your plot window (see above).

Now enter two more polygon commands.

polygon(c(0.3, 0.4, 0.5), c(0.05, 0.4, 0.05), lwd=4)
polygon (c(0, 0.1, 0.5, 0.5, 0.1), c(0.4, 0.5, 0.5, 0.3, 0.3), col="green")

You should see two more shapes appear in your plot window: a triangle and an arrow-like pentagon. Note that we have also added an additional parameter to each of these function calls. The lwd parameter is used to set the line-weight. Notice that we have increased it above the default value of 1, creating a thicker line around the triangle. The col parameter sets the fill color for the object, and we have used it to make the pentagon green. By this point you should have a nice piece of abstract art! (See above)

To learn more about options and parameters, try exploring the par documentation in R by entering the following at the R command prompt:

?par

Finally, note that the green polygon is partially covering the triangle. This reflects the general rule that objects that are added later to the plot will be placed on top of others. This can be important for constructing complex or layered images.

2. Import data on gene positions and use R to draw a gene map

Now that we see how we can draw shapes in R, let’s use these tools to make an image based on biological data such as the gene map below.

This figure compares homolgous regions of two related bacterial genomes. The shapes represent genes, and the gray shading connects pairs of orthologous genes (note that the few genes that are unique to one of the two genomes are shown in yellow).

This is a much more complicated figure than the ones we have drawn so far, but it is still just a set of different lines and polygons, so let’s walk through the steps that would be used to generate this in R.

First lets import some data that summarizes the positions of genes.

genes=read.table("~/Desktop/csb2017/r_figure_drawing/genes.txt", header=TRUE)

Print out the data to see how they are formatted.

genes
##     Start    End Strand Species      Color
## 1       1    426     -1       1 steelblue3
## 2     426   1733     -1       1 steelblue3
## 3    1918   2751      1       1 steelblue3
## 4    2753   4135     -1       1 steelblue3
## 5    4218   5582      1       1 steelblue3
## 6    5623   6591     -1       1 steelblue3
## 7    6596   7045     -1       1 steelblue3
## 8    7060   7911      1       1 steelblue3
## 9    7895   9523      1       1 steelblue3
## 10   9533  10909      1       1 steelblue3
## 11  11175  11780      1       1 steelblue3
## 12  11794  12990      1       1     khaki1
## 13  12987  14396      1       1 steelblue3
## 14  14397  15743     -1       1 steelblue3
## 15  15855  17546      1       1 steelblue3
## 16  17548  18030      1       1     khaki1
## 17  18027  18449      1       1 steelblue3
## 18  18739  19296      1       1     khaki1
## 19  19293  20150      1       1 steelblue3
## 20 244879 245307     -1       2 steelblue3
## 21 245322 246620     -1       2 steelblue3
## 22 250359 251189      1       2 steelblue3
## 23 251449 252439     -1       2 steelblue3
## 24 253012 254382      1       2 steelblue3
## 25 254379 255350     -1       2 steelblue3
## 26 255356 255811     -1       2 steelblue3
## 27 255836 256681      1       2 steelblue3
## 28 256659 258260      1       2 steelblue3
## 29 260255 261616      1       2 steelblue3
## 30 263608 264030     -1       2     khaki1
## 31 265481 266029      1       2 steelblue3
## 32 267276 268658      1       2 steelblue3
## 33 268691 270037     -1       2 steelblue3
## 34 270138 271835      1       2 steelblue3
## 35 273254 273697      1       2 steelblue3
## 36 276841 276951     -1       2     khaki1
## 37 277710 278570      1       2 steelblue3

As you can see, there are five columns, which provide the following information for each gene:

  • Start – Start nucleotide position for the gene
  • End – End nucleotide position for the gene
  • Strand – Forward (1) or reverse (-1) orientation of the gene
  • Species – Which of the two species (genomes) the gene is found in
  • Color – The color we will use for the gene

This table provides lots of information for each gene. But it does not tell us which genes/regions of the two genomes are homologous to each other. Let’s import another dataset that summarizes that information (from a sequence analysis thay was performed previously).

connections=read.table("~/Desktop/csb2017/r_figure_drawing/connections.txt", header=TRUE)

connections
##   SpeciesA SpeciesB StartA  EndA StartB   EndB
## 1        1        2      1  1733 244879 246620
## 2        1        2   1918  4135 250359 252439
## 3        1        2   4218  9523 253012 258260
## 4        1        2   9533 10909 260255 261616
## 5        1        2  11175 11780 265481 266029
## 6        1        2  12987 17546 267276 271835
## 7        1        2  18027 18449 273254 273697
## 8        1        2  19293 20150 277710 278570

Each row in this table summarizes the start and end positions of a region that is homologous between our two genomes.

Now lets calculate the lengths for each of the two genome region and determine which is longer. We will use this to scale the size of our entire figure.

species1_min = min(genes[which(genes$Species==1), "Start"])

species1_max = max(genes[which(genes$Species==1), "End"])

species1_length = species1_max - species1_min + 1

species2_min = min(genes[which(genes$Species==2), "Start"])

species2_max = max(genes[which(genes$Species==2), "End"])

species2_length = species2_max - species2_min + 1

largestLength = max (species1_length, species2_length)

Look carefully at each of these commands to understand what they are doing. Note that they are taking a subset of the data for either species 1 or species 2 with the which command. They are also using the max and min functions to determine the beginning (min) and ends (max) of the regions in each species. Total lengths are calculated by subtracting the min positions from the max positions (and adding 1 to be inclusive). You can check that your calculations were performed correctly by looking at the value of the largestLength variable:

largestLength
## [1] 33692

Now let’s create a new plot and start drawing our map. We can use the segments function to draw lines for our two genome regions. This command takes x- and y-coordinates for a start point and for an end point. In this case, we will use arbitrary y-coordinates of 0.4 and 0.5 so one of our genomic regions will be on top of the other. The x-coordinates will start at 0, and the end will be determined by the length of the region in the data set. The lwd specifies the “thickness” of the line.

plot.new()

height1 = 0.4
height2 = 0.5

segments(0, height1, species1_length/largestLength, height1, lwd=3)
segments(0, height2, species2_length/largestLength, height2, lwd=3)

We can then add gray shaded regions to note regions of homology between the two genomes. We are doing this now because we want these shapes to be “under” our genes in our final drawing. As in Part 1, we will use the polygon function to draw these shapes. So we need to define the x- and y-coordinates for each region. This can be done with the data that we imported from teh connections.txt file, but we have to scale the nucleotide position coordinates to the length of the genomic region. For example lets look at the first line in that file:

connections[1,]
##   SpeciesA SpeciesB StartA EndA StartB   EndB
## 1        1        2      1 1733 244879 246620

To define our x-coordinates, we need compare the nucleotide positions to the start positions and total lengths of these regions as follows. Note that we are defining the color of our shapes with the col option and turning off any outline with the border = NA option.

left1 = (connections[1, "StartA"] - species1_min)/largestLength
right1 = (connections[1, "EndA"] - species1_min)/largestLength
left2 = (connections[1, "StartB"] - species2_min)/largestLength
right2 = (connections[1,"EndB"] - species2_min)/largestLength

polygon(c(left1, left2, right2, right1), c(height1, height2, height2, height1), col = "gray85", border = NA)

That adds one of the homology regions we want to show. To add all of them, we can use a for loop to go through all the lines in our file instead of manually adding each one individually. The following code will loop through all the lines in our connections.txt file instead of justing look at 1. Note that we are using the “index” variable i and set it equal to row number for each pass through the loop, and we determine the number of rows in the table with the dim function. As you can see, the code inside the foor loop is almost exactly the same as above except that we have replaced the specific row number (1) with the index variable i.

for (i in 1:dim(connections)[1]){
  left1 = (connections[i, "StartA"] - species1_min)/largestLength
  right1 = (connections[i, "EndA"] - species1_min)/largestLength
  left2 = (connections[i, "StartB"] - species2_min)/largestLength
  right2 = (connections[i,"EndB"] - species2_min)/largestLength
  polygon(c(left1, left2, right2, right1), c(height1, height2, height2, height1), col = "gray85", border = NA)
} 

Note that are homology shapes are starting to cover up our line segments. We can always put them back on “top” by re-entering the segments commands.

segments(0, height1, species1_length/largestLength, height1, lwd=3)
segments(0, height2, species2_length/largestLength, height2, lwd=3)

Now all we have to do is add our genes. Once again, we will use the polygon function for each gene, and the challenge will be to define our x- and y-coordinates based on the nucletide position in the genes.txt file. Lets look at the first line in that table.

genes[1,]
##   Start End Strand Species      Color
## 1     1 426     -1       1 steelblue3

The following code is compicated but try to look at it to understand how it is defining an “arrow” (5-pointed polygon) for each gene in the genes.txt file. Note that once again we are using a for loop to go over each row in the table. We are also using if and else statements because the code is going to have to be different depending on whether the gene is on the forward or reverse strand or if we are looking at species 1 or species 2.

#We need to define the length of the triangle portion of each arrow and the height of our gene shapes:
arrowLen = 0.01;
boxHeight = 0.04;

for (i in 1:dim(genes)[1]){
  #Define the "left" and "right" x-coordinates.
  if (genes[i, "Species"] == 1){
    left = (genes[i,"Start"]-species1_min)/largestLength
    right = (genes[i,"End"]-species1_min)/largestLength
    height = height1
  }else if (genes[i, "Species"] == 2){
    left = (genes[i,"Start"]-species2_min)/largestLength
    right = (genes[i,"End"]-species2_min)/largestLength
    height = height2
  }
  
  #we have already defined the gene colors in our input data file.
  geneFillColor=as.character(genes[i,"Color"]);

  #Draw arrows, distinguishing between forward and reverse oriented genes
  if (genes[i,"Strand"] == 1){
        arrowStart = max (left, right - arrowLen)
        polygon (c(left, arrowStart, right, arrowStart, left), c(height-boxHeight/2, height-boxHeight/2, height, height+boxHeight/2, height+boxHeight/2), col = geneFillColor, lwd = 0.5)
    }else if (genes[i,"Strand"] == -1){
        arrowStart = min (right, left + arrowLen)
        polygon (c(left, arrowStart, right, right, arrowStart), c(height, height-boxHeight/2, height-boxHeight/2, height+boxHeight/2, height+boxHeight/2), col = geneFillColor, lwd = 0.5)
    }
}

That involved a lot of code, but hopefully you can see how simple shape drawing can be built up into more complex and precise figures based on biological data.

3. Write R plots directly to a PDF

It can be very useful to output R plots or graphics directly to a PDF rather than displaying them within a R plot window. Let’s use the polygon function again by entering the following four commands at the R command prompt.

pdf("~/Desktop/csb2017/r_figure_drawing/yellow_square.pdf")  
plot.new()  
polygon(c(0, 0, 0.5, 0.5), c(0, 0.5, 0.5, 0), col= "yellow")  
dev.off()

Note that the polygon function call should have generated a beautiful yellow square, but you probably noticed that nothing showed up in your plot window. The reason for that is we called the pdf function first. By doing that, we created a new PDF file and stored the new plot directly into that file. Also, note the dev.off() command at the end. That is important because it essentially closes and saves the file. Until that command is run, you will find the PDF is just an empty file.

Verify that you successfully generated the PDF file, by entering the following command.

system2('open', args = '~/Desktop/csb2017/r_figure_drawing/yellow_square.pdf')

Being able to write directly to a PDF may seem rather useless (after all, you could always export the plot window from an R session as a PDF). But as we will see, it is a handy feature that helps when we want build R scripts and skip the R console entirely.

4. Run R scripts directly from the command line (Terminal)

One advantage of code-based statistical packages like R is that you can combine scripts into large data analysis pipelines. For this reason, it can often be useful to run R code without opening an RStudio or R console session. Lets see an example of this. To draw the gene map in Part 2, we could of built a script instead of entering all that code piece by piece. In fact, if you inspect the file called comparative_gene_map.R, you will find that it contains essentially the same code as we used to make our map. However, it can be called directly from the command line (with any genes and connections file to produce any map you want).

To test this out, open a Terminal session (not Rstudio or the R console) and cd in the directory for this exercise. You can also view the content of the script using less (and type q to exit when your done looking at it).

#Run the following from a bash terminal (not in R)
cd ~/Desktop/csb2017/r_figure_drawing
less comparative_gene_map.R

Run the following command which will execute the script in R and pass the arguments for input and output files to the script.

#Run the following from a bash terminal (not in R). Note that for this to work, the Rscript executable must be in your PATH.
Rscript comparative_gene_map.R genes.txt connections.txt Rscript_output.pdf
Confirm that this generated the same map that we did in Part 2 by opening the PDF output as follows:
#Run the following from a bash terminal (not in R)
open Rscript_output.pdf