A significant challenge in dealing with large genomic datasets is being able to visualize them in an effective way. Generating informative and attractive figures is one of the most important things you can do to make your presentations and publications more impactful. If you look through genome papers, you will find that it is very common to generate circular images and summaries of genomic data. See the figure below summarizing two related bacterial genomes. When done properly, this can be very effective even in cases where you are working with genomes that are not circular themselves.
The goal of this exercise will be to use the program Circos to generate publication-quality images from genomic data.
Before we get into the details of Circos, let’s see what it can do and confirm that the software is properly installed by generating the “example” figure based on the data distributed with the software.
First, open a Terminal session and enter the command to move into the directory with the relevant files for this exercise.
cd ~/Desktop/csb2017/circos/example
Then run Circos by simply entering the name of the program. All the configuration files for this dataset are already in place in this directory, so no additional information is necessary as long as Circos is installed and in your PATH. We will go through how to set up these configuration files later.
circos
The program should take about a minute to run and report a number of status updates along the way. It will return to the command prompt when finished. To see the figure that was generated, enter the following command.
open circos.png
You should see something like this:
Whoa. One could argue whether a figure that “busy” could ever be informative. In truth, there are a lot of bad Circos figures that get published and convey little more than “we have a lot of data”. But this example file is merely meant to give you a sense of the large amount of data that can be incorporated into circular figures and the many different ways Circos allows you to visualize those data. Let’s work through some of the steps required to convert genomic data into the input files Circos uses to generate figures like the one above (except simpler and more useful).
The heart of a Circos run is the configuration file. When you call Circos, it will assume that a file with the name circos.conf
is present in your current working directory. Otherwise, you can specify the name and location of your configuration file with the -conf
option when you call Circos from the command line.
From your Terminal window, move into the exercise directory:
cd ~/Desktop/csb2017/circos/exercise
Use less
to view the contents of the configuration file.
less circos.conf
This file contains some standard features that will always be present including references to some additional configurations that are distributed with Circos. It also contains parameters that you can set to alter the appearance of your figure. One key parameter is the one that defines the name and location of the karyotype file. That file defines the chromosomes that will be used for the figure.
Exit less
by typing q
, and then print the contents of the karyotype file to the terminal screen with the cat
command.
cat karyotype.human.txt
## chr - hs1 1 0 249250621 red
## chr - hs2 2 0 243199373 blue
## chr - hs3 3 0 198022430 red
## chr - hs4 4 0 191154276 blue
## chr - hs5 5 0 180915260 red
## chr - hs6 6 0 171115067 blue
## chr - hs7 7 0 159138663 red
## chr - hs8 8 0 146364022 blue
## chr - hs9 9 0 141213431 red
## chr - hs10 10 0 135534747 blue
## chr - hs11 11 0 135006516 red
## chr - hs12 12 0 133851895 blue
## chr - hs13 13 0 115169878 red
## chr - hs14 14 0 107349540 blue
## chr - hs15 15 0 102531392 red
## chr - hs16 16 0 90354753 blue
## chr - hs17 17 0 81195210 red
## chr - hs18 18 0 78077248 blue
## chr - hs19 19 0 59128983 red
## chr - hs20 20 0 63025520 blue
## chr - hs21 21 0 48129895 red
## chr - hs22 22 0 51304566 blue
## chr - hsX X 0 155270560 red
## chr - hsY Y 0 59373566 blue
You should see that the chromosomes are defined by each of line of text, which are presented in the following format:
chr - ID LABEL START END COLOR
The columns should be largely self-explanatory. Note that the difference between ID and LABEL is that ID is what you will refer to in other data and configuration files, whereas LABEL is the text that will be put on the figure. START and END refer to chromosome lengths (typically in bp), and COLOR defines the color used for that chromosome in the figure.
Run Circos in your current directory (exercise
).
circos
Once the run has completed, open the output. You should see that it updates to the following image. These are the 24 human chromosomes (including the 22 autosomes and both the X and the Y chromosome) that were defined in the karyotype file.
open circos.png
Now let’s try making some modifications. Open the following file in any text editor that you prefer: ~/Desktop/csb2017/circos/exercise/circos.conf
Many of the parameters that define how the figure is displayed are found within the <ideogram>
block. Change the thickness parameter so that it reads:
thickness = 20p
Save that change to the text file and re-run Circos.
circos
You should see the thickness of the chromosomes has been reduced in the updated figure (you may have to re-open the circos.png
file).
In your text editor, paste the following text into the circos.conf
file anywhere between the <ideogram>
and </ideogram>
lines.
show_label = yes
label_font = default
label_radius = 1r + 75p
label_size = 50
label_parallel = yes
Save the file and re-run Circos from the command-line.
circos
Verify that your figure has been updated to add labels for each chromosome.
In your text editor, change the karyotype line in the circos.conf file so that it reads:
karyotype = karyotype.ecoli.txt
If you save that change and re-run Circos, you should see a rather dull-looking green circle. Let’s add some tick marks to the outside of the genome to indicate position. Use cat
to view the contents of the ticks.conf file in the current directory
cat ticks.conf
##
## show_ticks = yes
## show_tick_labels = yes
##
## <ticks>
## skip_first_label = no
## skip_last_label = no
## radius = dims(ideogram,radius_outer)
## tick_separation = 2p
## min_label_distance_to_edge = 0p
## label_separation = 5p
## label_offset = 5p
## multiplier = 0.000001
## color = black
##
##
## <tick>
## size = 20p
## thickness = 4p
## spacing = 50u
## show_label = no
## </tick>
##
## <tick>
## size = 30p
## thickness = 8p
## spacing = 1000u
## show_label = yes
## suffix = " Mb"
## label_size = 50p
## format = %s
## </tick>
## </ticks>
This file sets various parameters for how to display ticks around the circle. These particular settings will display big, labeled marks every 1 Mb and small, unlabeled marks every 50 kb. Let’s add these ticks to our figure, but instead of copying the whole body of text into our main configuration file, we can just add the following line at the top of circos.conf.
<<include ticks.conf>>
Note that this is a general strategy. Rather than have one configuration file get bigger and bigger, you can refer to additional files in your main configuration file to keep your project more organized. If you save this change and re-run Circos, you should see the following updated image.
circos
Our bacterial genome plot is still pretty boring. Let’s add some information about the genome. GC skew is a measure of strand-biased nucleotide composition that is defined as \((G - C)/(G + C)\), where \(G\) and \(C\) are the number of guanines and cytosines in one strand of DNA. GC skew can be very valuable in identifying the origin of replication in bacterial genomes.
Let’s calculate GC skew across the entire to genome. To do so, run the following script. The script is written in Perl. It reads in a genome in FASTA format to calculate GC skew. You can investigate the contents of the script if you are interested, but we will only need the output for this exercise. This raises an important point about Circos. Most of the formats for inputting data into Circos are very simple, but they are not in standard file formats that you may already have (e.g., GFF, GenBank, BLAST, etc.). MS Excel can go a long way, but being comfortable with a scripting language (e.g., Python or Perl) can be VERY helpful for producing input files for Circos.
Enter the following command to run the Perl script.
./gc_skew_for_circos.pl Ecoli.genome.fas 5000 1000 main blue orange > gc_skew.txt
This script has calculated GC skew in a 5-kb sliding window with a 1-kb step size. It has formatted the output for Circos such that positive GC skew values will be shown in blue and negative values will be shown in orange. The “main”" term simply refers to the name of the chromosome in the E. coli karyotype file. You can print the first 10 lines of the output file with the head
command (you can also explore the entire file with less
). You should see that each line specifies a location in the genome and a corresponding GC skew value.
head gc_skew.txt
## main 1 1 0.0434782608695652 fill_color=blue
## main 1001 1001 0.0488 fill_color=blue
## main 2001 2001 0.0659047619047619 fill_color=blue
## main 3001 3001 0.0715900527505652 fill_color=blue
## main 4001 4001 0.0618042226487524 fill_color=blue
## main 5001 5001 0.0471083875909613 fill_color=blue
## main 6001 6001 0.0119922630560928 fill_color=blue
## main 7001 7001 0.00235478806907378 fill_color=blue
## main 8001 8001 0 fill_color=orange
## main 9001 9001 -0.0015527950310559 fill_color=orange
We can visualize these data in our figure by adding a plots
block to our main configuration file. In your text editor, paste the following lines after the </ideogram>
line near the end of the circos.conf
file.
<plots>
<plot>
type = histogram
file = gc_skew.txt
extend_bin = yes
thickness = 0
r0 = 0.6r
r1 = 1.0r
orientation = out
min = -0.2
max = 0.2
</plot>
</plots>
If you save these changes, and re-run Circos, you should see the following updated figure.
circos
Do you have a guess as to where the origin of replication and termination of replication are in E. coli?
We have just added a histogram “track” to our Circos plot. But these tracks can take on many forms:
Circular genome representations can be very useful for drawing connections between different parts of a genome. For example, they are very effective at visualizing repeated sequences. Let’s show the locations of repeats within the E. coli genome. First, we need to find the repeats. To do that, we will use BLAST. Run the following two commands to make a BLAST database and run a BLASTN search.
makeblastdb -in Ecoli.genome.fas -dbtype nucl
blastn -evalue 1e-10 -db Ecoli.genome.fas -query Ecoli.genome.fas -out self_blast.txt
Note that we are BLASTing the same sequence against itself. That may seem like a waste of time, but it is a very common way to identify repeats. In this case, we are using the default MEGABLAST algorithm within blastn, so we will only identify very similar repeats (word size = 28). Running the following command will call a BioPerl script will parse the BLAST output and summarize the output in a format that Circos can read.
./blast_repeats_for_circos.pl self_blast.txt 1000 0.9 main > blast_repeats.txt
[Note that if you do not have local BLAST or BioPerl installed, a copy of this output file is already provided as blast_repeats.premade.txt.]
The command line parameters instruct the script to summarize all blast hits that are at least 1000 bp long and have at least 90% nucleotide sequence identity. Once again “main” simply refers to the name of the chromosome in the E. coli karyotype file. You can use head to print the first 10 lines of the blast_repeats.txt output file.
head blast_repeats.txt
## repeat1 main 4166317 4166317
## repeat1 main 2731499 2731499
## repeat2 main 2726052 2726052
## repeat2 main 4171773 4171773
## repeat3 main 3423658 3423658
## repeat3 main 228885 228885
## repeat4 main 223467 223467
## repeat4 main 3429065 3429065
## repeat5 main 4035385 4035385
## repeat5 main 223625 223625
Each pair of lines specifies the connection points for the given repeat pair.
We can visualize these data with Circos by adding a links
block to our main configuration file. In your text editor, paste the following lines after the </ideogram>
line in the circos.conf
file.
<links>
radius = 0.8r
bezier_radius = -0.1r
bezier_radius_purity = 0.9
crest = 0.5
perturb = yes
perturb_crest = 0.9,1.1
perturb_bezier_radius = 0,2
perturb_bezier_radius_purity = 0
<link singles>
z = 0
show = yes
color = black
thickness = 2
file = blast_repeats.txt
</link>
</links>
In Circos, links are simply curved lines that connect two points (Bézier curves). The above parameters largely determine how “bendy” you want those curves to be and whether you “perturb” them so that similar connections do not overlap quite so much.
If you save these changes, and re-run Circos, you should see the following updated figure.
circos
In this case, our repeats are very small relative to the size of the whole genome, so we are using simple links, which essentially represent a connection between points. But if you want to show connection between larger elements, you can use “ribbons” in Circos such that the width of the connect line reflects the size of each region.
We have barely scratched the surface on the diverse visualization options in Circos. Even for the code we have used in some of these examples, we have not gone into detail on many of the parameters. As time allows, try out some of the tutorials described in the Circos website.