Background

A significant challenge in dealing with large genomic datasets is being able to visualize them in an effective way. Generating informative and attractive figures is one of the most important things you can do to make your presentations and publications more impactful. If you look through genome papers, you will find that it is very common to generate circular images and summaries of genomic data. See the figure below summarizing two related bacterial genomes. When done properly, this can be very effective even in cases where you are working with genomes that are not circular themselves.

Objectives

The goal of this exercise will be to use the program Circos to generate publication-quality images from genomic data.

Software and Dependencies

Note that circos is already installed on our workshop server. But it is not yet in your “PATH”, meaning that it is not in one of the list of places that are automatically searched for executables whenever you enter a Bash command. Circos can be found in the following directory: /home/apps/circos-0.69-6/bin. Therefore, you could enter that full absolute path every time you want to call the program. Or you can add that location to your “PATH” so that you only need to type circos to run the program. Add that location to your path with the following command.

PATH=$PATH\:/home/apps/circos-0.69-6/bin

Protocol

1. Generate the Circos figure from the example dataset distributed with the software

Before we get into the details of Circos, let’s see what it can do and confirm that the software is properly installed by generating the “example” figure based on the data distributed with the software.

First, open a Terminal session and ssh into our workshop server. Then enter the following command to move into the directory with the relevant files for this exercise.

cd ~/TodosSantos/circos/example

Then run Circos by simply entering the name of the program. All the configuration files for this dataset are already in place in this directory, so no additional information is necessary as long as Circos is installed and in your PATH. We will go through how to set up these configuration files later.

circos

The program should take about a minute to run and report a number of status updates along the way. It will return to the command prompt when finished. To see the figure that was generated, enter the following command.

[Note that the open command used below and throughout this exercise assumes that you are working locally on a Mac OS X machine. If you are using a local linux machine, you can use xdg-open. If you are using a remote server, it will be easiest to transfer the image file to your local machine before viewing it.]

open circos.png

You should see something like this:

Whoa. One could argue whether a figure that “busy” could ever be informative. In truth, there are a lot of bad Circos figures that get published and convey little more than “we have a lot of data”. But this example file is merely meant to give you a sense of the large amount of data that can be incorporated into circular figures and the many different ways Circos allows you to visualize those data. Let’s work through some of the steps required to convert genomic data into the input files Circos uses to generate figures like the one above (except simpler and more useful).

2. Set up circos.conf and karyotype files

The heart of a Circos run is the configuration file. When you call Circos, it will assume that a file with the name circos.conf is present in your current working directory. Otherwise, you can specify the name and location of your configuration file with the -conf option when you call Circos from the command line.

From your Terminal window, move into the exercise directory:

cd ~/TodosSantos/circos/exercise

Use less to view the contents of the configuration file.

less circos.conf

This file contains some standard features that will always be present including references to some additional configurations that are distributed with Circos. It also contains parameters that you can set to alter the appearance of your figure. One key parameter is the one that defines the name and location of the karyotype file. That file defines the chromosomes that will be used for the figure.

Exit less by typing q, and then print the contents of the karyotype file to the terminal screen with the cat command.

cat karyotype.human.txt
## chr - hs1 1 0 249250621 red
## chr - hs2 2 0 243199373 blue
## chr - hs3 3 0 198022430 red
## chr - hs4 4 0 191154276 blue
## chr - hs5 5 0 180915260 red
## chr - hs6 6 0 171115067 blue
## chr - hs7 7 0 159138663 red
## chr - hs8 8 0 146364022 blue
## chr - hs9 9 0 141213431 red
## chr - hs10 10 0 135534747 blue
## chr - hs11 11 0 135006516 red
## chr - hs12 12 0 133851895 blue
## chr - hs13 13 0 115169878 red
## chr - hs14 14 0 107349540 blue
## chr - hs15 15 0 102531392 red
## chr - hs16 16 0 90354753 blue
## chr - hs17 17 0 81195210 red
## chr - hs18 18 0 78077248 blue
## chr - hs19 19 0 59128983 red
## chr - hs20 20 0 63025520 blue
## chr - hs21 21 0 48129895 red
## chr - hs22 22 0 51304566 blue
## chr - hsX X 0 155270560 red
## chr - hsY Y 0 59373566 blue

You should see that the chromosomes are defined by each of line of text, which are presented in the following format:

chr - ID LABEL START END COLOR

The columns should be largely self-explanatory. Note that the difference between ID and LABEL is that ID is what you will refer to in other data and configuration files, whereas LABEL is the text that will be put on the figure. START and END refer to chromosome lengths (typically in bp), and COLOR defines the color used for that chromosome in the figure.

3. Generate a circular representation of the human genome

Run Circos in your current directory (exercise).

circos

Once the run has completed, open the output. You should see that it updates to the following image. These are the 24 human chromosomes (including the 22 autosomes and both the X and the Y chromosome) that were defined in the karyotype file.

open circos.png

4. Modify thickness of the chromosomes

Now let’s try making some modifications. Launch a text editor such as BBEdit that can connect to a remote server and open the following file: ~/TodosSantos/circos/exercise/circos.conf

Many of the parameters that define how the figure is displayed are found within the <ideogram> block. Change the thickness parameter so that it reads:

thickness = 20p

Save that change to the text file and re-run Circos.

circos

You should see the thickness of the chromosomes has been reduced in the updated figure (you may have to re-open the circos.png file).

5. Add labels for each of the chromosomes

In your text editor, paste the following text into the circos.conf file anywhere between the <ideogram> and </ideogram> lines.

show_label = yes
label_font = default
label_radius = 1r + 75p
label_size = 50
label_parallel = yes

Save the file and re-run Circos from the command-line.

circos

Verify that your figure has been updated to add labels for each chromosome.

6. Visualize the single chromosome of a bacterial genome and add label positions

In your text editor, change the karyotype line in the circos.conf file so that it reads:

karyotype = karyotype.ecoli.txt

If you save that change and re-run Circos, you should see a rather dull-looking green circle. Let’s add some tick marks to the outside of the genome to indicate position. Use cat to view the contents of the ticks.conf file in the current directory

cat ticks.conf
## 
## show_ticks          = yes
## show_tick_labels    = yes
## 
## <ticks>
## skip_first_label     = no
## skip_last_label      = no
## radius               = dims(ideogram,radius_outer)
## tick_separation      = 2p
## min_label_distance_to_edge = 0p
## label_separation = 5p
## label_offset     = 5p
## multiplier = 0.000001
## color = black
## 
## 
## <tick>
## size     = 20p
## thickness      = 4p
## spacing        = 50u
## show_label     = no
## </tick>
## 
## <tick>
## size     = 30p
## thickness      = 8p
## spacing        = 1000u
## show_label     = yes
## suffix = " Mb"
## label_size     = 50p
## format         = %s
## </tick>
## </ticks>

This file sets various parameters for how to display ticks around the circle. These particular settings will display big, labeled marks every 1 Mb and small, unlabeled marks every 50 kb. Let’s add these ticks to our figure, but instead of copying the whole body of text into our main configuration file, we can just add the following line at the top of circos.conf.

<<include ticks.conf>>

Note that this is a general strategy. Rather than have one configuration file get bigger and bigger, you can refer to additional files in your main configuration file to keep your project more organized. If you save this change and re-run Circos, you should see the following updated image.

circos

7. Add GC skew data to the plot

Our bacterial genome plot is still pretty boring. Let’s add some information about the genome. GC skew is a measure of strand-biased nucleotide composition that is defined as \((G - C)/(G + C)\), where \(G\) and \(C\) are the number of guanines and cytosines in one strand of DNA. GC skew can be very valuable in identifying the origin of replication in bacterial genomes.

Let’s calculate GC skew across the entire to genome. To do so, run the following script. The script is written in Perl. It reads in a genome in FASTA format to calculate GC skew. You can investigate the contents of the script if you are interested, but we will only need the output for this exercise. This raises an important point about Circos. Most of the formats for inputting data into Circos are very simple, but they are not in standard file formats that you may already have (e.g., GFF, GenBank, BLAST, etc.). MS Excel can go a long way, but being comfortable with a scripting language (e.g., Python or Perl) can be VERY helpful for producing input files for Circos.

Enter the following command to run the Perl script.

./gc_skew_for_circos.pl Ecoli.genome.fas 5000 1000 main blue orange > gc_skew.txt

This script has calculated GC skew in a 5-kb sliding window with a 1-kb step size. It has formatted the output for Circos such that positive GC skew values will be shown in blue and negative values will be shown in orange. The “main”" term simply refers to the name of the chromosome in the E. coli karyotype file. You can print the first 10 lines of the output file with the head command (you can also explore the entire file with less). You should see that each line specifies a location in the genome and a corresponding GC skew value.

head gc_skew.txt
## main 1   1   0.0434782608695652  fill_color=blue
## main 1001    1001    0.0488  fill_color=blue
## main 2001    2001    0.0659047619047619  fill_color=blue
## main 3001    3001    0.0715900527505652  fill_color=blue
## main 4001    4001    0.0618042226487524  fill_color=blue
## main 5001    5001    0.0471083875909613  fill_color=blue
## main 6001    6001    0.0119922630560928  fill_color=blue
## main 7001    7001    0.00235478806907378 fill_color=blue
## main 8001    8001    0   fill_color=orange
## main 9001    9001    -0.0015527950310559 fill_color=orange

We can visualize these data in our figure by adding a plots block to our main configuration file. In your text editor, paste the following lines after the </ideogram> line near the end of the circos.conf file.

<plots>
<plot>
type = histogram
file = gc_skew.txt
extend_bin = yes
thickness = 0
r0 = 0.6r
r1 = 1.0r
orientation = out
min = -0.2
max = 0.2
</plot>
</plots>

If you save these changes, and re-run Circos, you should see the following updated figure.

circos

Do you have a guess as to where the origin of replication and termination of replication are in E. coli?

We have just added a histogram “track” to our Circos plot. But these tracks can take on many forms:

  • Histograms
  • Scatter plots
  • Line plots
  • Heat maps
  • Tiles (i.e., stacking elements like read mapping)
  • Text labels
  • Glyphs (i.e., symbols)

9. Explore the Circos tutorials

We have barely scratched the surface on the diverse visualization options in Circos. Even for the code we have used in some of these examples, we have not gone into detail on many of the parameters. As time allows, try out some of the tutorials described in the Circos website.