Visualizing genomic data and making figures with Circos

Protocol

1. Generate the Circos figure from the example dataset distributed with the software

Before we get into the details of Circos, let’s see what it can do and confirm that the software is properly installed by generating the “example” figure based on the data distributed with the software.

First, open a Terminal session and ssh into our workshop server. Then enter the following command to move into the directory with the relevant files for this exercise.

cd ~/TodosSantos/circos/example

Then run Circos by simply entering the name of the program. All the configuration files for this dataset are already in place in this directory, so no additional information is necessary as long as Circos is installed and in your PATH. We will go through how to set up these configuration files later.

circos

The program should take about a minute to run and report a number of status updates along the way. It will return to the command prompt when finished. To see the figure that was generated, enter the following command.

[Note that the open command used below and throughout this exercise assumes that you are working locally on a Mac OS X machine. If you are using a local linux machine, you can use xdg-open. If you are using a remote server, it will be easiest to transfer the image file to your local machine before viewing it.]

open circos.png

You should see something like this:

Whoa. One could argue whether a figure that “busy” could ever be informative. In truth, there are a lot of bad Circos figures that get published and convey little more than “we have a lot of data”. But this example file is merely meant to give you a sense of the large amount of data that can be incorporated into circular figures and the many different ways Circos allows you to visualize those data. Let’s work through some of the steps required to convert genomic data into the input files Circos uses to generate figures like the one above (except simpler and more useful).

2. Set up circos.conf and karyotype files

The heart of a Circos run is the configuration file. When you call Circos, it will assume that a file with the name circos.conf is present in your current working directory. Otherwise, you can specify the name and location of your configuration file with the -conf option when you call Circos from the command line.

From your Terminal window, move into the exercise directory:

cd ~/TodosSantos/circos/exercise

Use less to view the contents of the configuration file.

less circos.conf

This file contains some standard features that will always be present including references to some additional configurations that are distributed with Circos. It also contains parameters that you can set to alter the appearance of your figure. One key parameter is the one that defines the name and location of the karyotype file. That file defines the chromosomes that will be used for the figure.

Exit less by typing q, and then print the contents of the karyotype file to the terminal screen with the cat command.

cat karyotype.human.txt

## chr - hs1 1 0 249250621 red
## chr - hs2 2 0 243199373 blue
## chr - hs3 3 0 198022430 red
## chr - hs4 4 0 191154276 blue
## chr - hs5 5 0 180915260 red
## chr - hs6 6 0 171115067 blue
## chr - hs7 7 0 159138663 red
## chr - hs8 8 0 146364022 blue
## chr - hs9 9 0 141213431 red
## chr - hs10 10 0 135534747 blue
## chr - hs11 11 0 135006516 red
## chr - hs12 12 0 133851895 blue
## chr - hs13 13 0 115169878 red
## chr - hs14 14 0 107349540 blue
## chr - hs15 15 0 102531392 red
## chr - hs16 16 0 90354753 blue
## chr - hs17 17 0 81195210 red
## chr - hs18 18 0 78077248 blue
## chr - hs19 19 0 59128983 red
## chr - hs20 20 0 63025520 blue
## chr - hs21 21 0 48129895 red
## chr - hs22 22 0 51304566 blue
## chr - hsX X 0 155270560 red
## chr - hsY Y 0 59373566 blue

You should see that the chromosomes are defined by each of line of text, which are presented in the following format:

chr - ID LABEL START END COLOR

The columns should be largely self-explanatory. Note that the difference between ID and LABEL is that ID is what you will refer to in other data and configuration files, whereas LABEL is the text that will be put on the figure. START and END refer to chromosome lengths (typically in bp), and COLOR defines the color used for that chromosome in the figure.

3. Generate a circular representation of the human genome

Run Circos in your current directory (exercise).

circos

Once the run has completed, open the output. You should see that it updates to the following image. These are the 24 human chromosomes (including the 22 autosomes and both the X and the Y chromosome) that were defined in the karyotype file.

open circos.png

4. Modify thickness of the chromosomes

Now let’s try making some modifications. Launch a text editor such as BBEdit that can connect to a remote server and open the following file: ~/TodosSantos/circos/exercise/circos.conf

Many of the parameters that define how the figure is displayed are found within the <ideogram> block. Change the thickness parameter so that it reads:

thickness = 20p

Save that change to the text file and re-run Circos.

circos

You should see the thickness of the chromosomes has been reduced in the updated figure (you may have to re-open the circos.png file).

5. Add labels for each of the chromosomes

In your text editor, paste the following text into the circos.conf file anywhere between the <ideogram> and </ideogram> lines.

show_label = yes
label_font = default
label_radius = 1r + 75p
label_size = 50
label_parallel = yes

Save the file and re-run Circos from the command-line.

circos

Verify that your figure has been updated to add labels for each chromosome.

6. Visualize the single chromosome of a bacterial genome and add label positions

In your text editor, change the karyotype line in the circos.conf file so that it reads:

karyotype = karyotype.ecoli.txt

If you save that change and re-run Circos, you should see a rather dull-looking green circle. Let’s add some tick marks to the outside of the genome to indicate position. Use cat to view the contents of the ticks.conf file in the current directory

cat ticks.conf

## 
## show_ticks          = yes
## show_tick_labels    = yes
## 
## <ticks>
## skip_first_label     = no
## skip_last_label      = no
## radius               = dims(ideogram,radius_outer)
## tick_separation      = 2p
## min_label_distance_to_edge = 0p
## label_separation = 5p
## label_offset     = 5p
## multiplier = 0.000001
## color = black
## 
## 
## <tick>
## size     = 20p
## thickness      = 4p
## spacing        = 50u
## show_label     = no
## </tick>
## 
## <tick>
## size     = 30p
## thickness      = 8p
## spacing        = 1000u
## show_label     = yes
## suffix = " Mb"
## label_size     = 50p
## format         = %s
## </tick>
## </ticks>

This file sets various parameters for how to display ticks around the circle. These particular settings will display big, labeled marks every 1 Mb and small, unlabeled marks every 50 kb. Let’s add these ticks to our figure, but instead of copying the whole body of text into our main configuration file, we can just add the following line at the top of circos.conf.

<<include ticks.conf>>

Note that this is a general strategy. Rather than have one configuration file get bigger and bigger, you can refer to additional files in your main configuration file to keep your project more organized. If you save this change and re-run Circos, you should see the following updated image.

circos

7. Add GC skew data to the plot

Our bacterial genome plot is still pretty boring. Let’s add some information about the genome. GC skew is a measure of strand-biased nucleotide composition that is defined as \((G - C)/(G + C)\), where \(G\) and \(C\) are the number of guanines and cytosines in one strand of DNA. GC skew can be very valuable in identifying the origin of replication in bacterial genomes.

Let’s calculate GC skew across the entire to genome. To do so, run the following script. The script is written in Perl. It reads in a genome in FASTA format to calculate GC skew. You can investigate the contents of the script if you are interested, but we will only need the output for this exercise. This raises an important point about Circos. Most of the formats for inputting data into Circos are very simple, but they are not in standard file formats that you may already have (e.g., GFF, GenBank, BLAST, etc.). MS Excel can go a long way, but being comfortable with a scripting language (e.g., Python or Perl) can be VERY helpful for producing input files for Circos.

Enter the following command to run the Perl script.

./gc_skew_for_circos.pl Ecoli.genome.fas 5000 1000 main blue orange > gc_skew.txt

This script has calculated GC skew in a 5-kb sliding window with a 1-kb step size. It has formatted the output for Circos such that positive GC skew values will be shown in blue and negative values will be shown in orange. The “main”" term simply refers to the name of the chromosome in the E. coli karyotype file. You can print the first 10 lines of the output file with the head command (you can also explore the entire file with less). You should see that each line specifies a location in the genome and a corresponding GC skew value.

head gc_skew.txt

## main 1   1   0.0434782608695652  fill_color=blue
## main 1001    1001    0.0488  fill_color=blue
## main 2001    2001    0.0659047619047619  fill_color=blue
## main 3001    3001    0.0715900527505652  fill_color=blue
## main 4001    4001    0.0618042226487524  fill_color=blue
## main 5001    5001    0.0471083875909613  fill_color=blue
## main 6001    6001    0.0119922630560928  fill_color=blue
## main 7001    7001    0.00235478806907378 fill_color=blue
## main 8001    8001    0   fill_color=orange
## main 9001    9001    -0.0015527950310559 fill_color=orange

We can visualize these data in our figure by adding a plots block to our main configuration file. In your text editor, paste the following lines after the </ideogram> line near the end of the circos.conf file.

<plots>
<plot>
type = histogram
file = gc_skew.txt
extend_bin = yes
thickness = 0
r0 = 0.6r
r1 = 1.0r
orientation = out
min = -0.2
max = 0.2
</plot>
</plots>

If you save these changes, and re-run Circos, you should see the following updated figure.

circos

Do you have a guess as to where the origin of replication and termination of replication are in E. coli?

We have just added a histogram “track” to our Circos plot. But these tracks can take on many forms:

Histograms
Scatter plots
Line plots
Heat maps
Tiles (i.e., stacking elements like read mapping)
Text labels
Glyphs (i.e., symbols)

8. Identify repeats within the genome with links

Circular genome representations can be very useful for drawing connections between different parts of a genome. For example, they are very effective at visualizing repeated sequences. Let’s show the locations of repeats within the E. coli genome. First, we need to find the repeats. To do that, we will use BLAST. Run the following two commands to make a BLAST database and run a BLASTN search.

makeblastdb -in Ecoli.genome.fas -dbtype nucl

blastn -evalue 1e-10 -db  Ecoli.genome.fas -query Ecoli.genome.fas -out self_blast.txt

Note that we are BLASTing the same sequence against itself. That may seem like a waste of time, but it is a very common way to identify repeats. In this case, we are using the default MEGABLAST algorithm within blastn, so we will only identify very similar repeats (word size = 28). Running the following command will call a BioPerl script will parse the BLAST output and summarize the output in a format that Circos can read.

./blast_repeats_for_circos.pl self_blast.txt 1000 0.9 main > blast_repeats.txt

[Note that if you do not have local BLAST or BioPerl installed, a copy of this output file is already provided as blast_repeats.premade.txt.]

The command line parameters instruct the script to summarize all blast hits that are at least 1000 bp long and have at least 90% nucleotide sequence identity. Once again “main” simply refers to the name of the chromosome in the E. coli karyotype file. You can use head to print the first 10 lines of the blast_repeats.txt output file.

head blast_repeats.txt

## repeat1  main    4166317 4166317
## repeat1  main    2731499 2731499
## repeat2  main    2726052 2726052
## repeat2  main    4171773 4171773
## repeat3  main    3423658 3423658
## repeat3  main    228885  228885
## repeat4  main    223467  223467
## repeat4  main    3429065 3429065
## repeat5  main    4035385 4035385
## repeat5  main    223625  223625

Each pair of lines specifies the connection points for the given repeat pair.

We can visualize these data with Circos by adding a links block to our main configuration file. In your text editor, paste the following lines after the </ideogram> line in the circos.conf file.

<links>
radius = 0.8r
bezier_radius = -0.1r
bezier_radius_purity = 0.9
crest = 0.5
perturb = yes
perturb_crest = 0.9,1.1
perturb_bezier_radius = 0,2
perturb_bezier_radius_purity = 0
<link singles>
z = 0
show = yes
color = black
thickness = 2
file = blast_repeats.txt
</link>
</links>

In Circos, links are simply curved lines that connect two points (Bézier curves). The above parameters largely determine how “bendy” you want those curves to be and whether you “perturb” them so that similar connections do not overlap quite so much.

If you save these changes, and re-run Circos, you should see the following updated figure.

circos

In this case, our repeats are very small relative to the size of the whole genome, so we are using simple links, which essentially represent a connection between points. But if you want to show connection between larger elements, you can use “ribbons” in Circos such that the width of the connect line reflects the size of each region.

9. Explore the Circos tutorials

We have barely scratched the surface on the diverse visualization options in Circos. Even for the code we have used in some of these examples, we have not gone into detail on many of the parameters. As time allows, try out some of the tutorials described in the Circos website.