Sequencing and assembling environmental DNA samples to produce “metagenomes” has become common practice. But working with DNA from a mixture of different species/taxa poses many technical challenges. One of the central steps in a metagenomics analysis involves grouping or “binning” assembled contigs based on their species or taxa.
The goal of exercise is to illustrate the concept of metagenomic binning. We will use three lines of evidence (depth of coverage, GC content, and sequence homology) to identify contigs that derive from the same bacterial species. We will do so using an assembly based on DNA isolated from tobacco whiteflies (Bemisia tabaci) and their associated bacterial communities.
We will work with assembled sequencing reads from DNA from whole individuals of the insect Bemisia tabacum (tobacco whitefly) and its associated microbes. This is a very simple dataset by the standards of metagenomics and environmental sequencing, but it will help illustrate the principles of metagenomic binning.
Fluorescence in situ hybridization (FISH) of intracellular bacteria species in a whitefly (Gottlieb et al. 2010).
Open a terminal session and ssh into our workshop server. Then change directories as follows.
cd TodosSantos/metagenomic_binningYou can view the contents of the directory with ls.
Note that the directory contains a file named bemisia.contigs.fas. This file was produced by assembling the total DNA sample with Velvet, but it has been filtered to remove contigs with length less than 1500 bp or coverage less than 3x. Our first step in analyzing these contigs will be to BLAST them against a set of representative bacterial sequences.
makeblastdb -in bacterial_references.fas -dbtype nucl
    
blastn -task blastn -evalue 1e-6 -num_threads 2 -db bacterial_references.fas -query bemisia.contigs.fas -out bemisia.blast.txt
The next step will be to calculate GC content for each contig in the assembly and summarize that along with the contig’s top BLAST hit and its reported sequencing depth for that contig. Within the same Terminal window, run the following command (all on one line).
perl contig_summary.pl bemisia.contigs.fas bemisia.blast.txt colors.txt 1500 3 39 > bemisia.metasummary.txt
Now we will make a quick plot of our data. Later in the workshop, we will do a more extensive introduction to R and data visualization, so we will skip through some things quickly here to get to the final output.
From the terminal start an R session simply by typing the letter R:
RNow read in the data into R with the following command:
contig_data = read.table ("bemisia.metasummary.txt", header = TRUE)We are going to make a plot that requires an R library called scales. Load this with the following command:
library(scales)Now generate a plot (will be output as file called Rplots.pdf) using the following command:
plot (contig_data$GC_content, contig_data$Coverage, log ="y", pch=19, col=alpha(contig_data$Color, 0.5), xlab = "GC Content", ylab = "Coverage")Now quit your R session as follows:
quit()Transfers the Rplots.pdf file to your own computer to look at the plot. You should see a plot like the following (except without the added circles and labeling):
Note the discrete clusters of points (contigs). These are strongly associated to BLAST hits to the same species/taxon. Some or all of these three types of information form the basis for most metagenomic binning tools:
Bonus question: Why are all the clusters slanted downward?