Background

Identifying and aligning similar DNA and protein sequences is one of the most common tasks in computational biology, and BLAST (Basic Local Alignment Search Tool) still represents one of the most widely used and effective tools. While many users are familiar with running BLAST searches through the NCBI webserver, it is also possible and often preferable to download BLAST software and databases and run searches on a local computer or server. Advantages of running local BLAST include…

There are numerous specialized “flavors” of BLAST, but the following five programs represent the classic and most widely used methods. They differ based on whether the inputs are nucleotide or amino acid sequences and whether the alignments are based on nucleotide or (translated) amino acid sequences.

Program Query Sequences Database Sequences Alignment
blastn nucleotide nucleotide nucleotide
blastp amino acid amino acid amino acid
blastx nucleotide amino acid amino acid*
tblastn amino acid nucleotide amino acid*
tblastx nucleotide nucleotide amino acid*

*blastx, tblastn, and tblastx translate the input query and/or database sequences in all six possible reading frames before performing alignment.

Objectives

The goal of this exercise will be to make sequence comparisons between entire sets of proteins and entire genomes of closely related bacteria. This will involve generating databases and running searches with local BLAST, parsing the output with BioPerl scripts, and plotting data with R.

Software and Dependencies

[Note: If you do not have local BLAST and/or BioPerl installed, intermediate files are stored in the ~/TodosSantos/local_blast/prerun/ directory so that you can follow along.]

Protocol

1. Make BLAST databases

Prior to running a local BLAST search, you must first download or create a BLAST database. Familiar databases like “nr” or “nt” can be downloaded directly from NCBI for use in local searches, but you can also create a custom BLAST database from any input file in FASTA format. In this exercise, we will make two BLAST databases. One will be made from the entire set of protein sequences from a particular strain of Escherichia coli, and the other will consist of a single nucleotide sequence of the entire E. coli genome.

First, open a Terminal session and connect to your account on our workshop server. Once you have done that, enter the following command to move into the directory with the relevant files for this exercise.

cd ~/TodosSantos/local_blast/

The directory contains FASTA files with E. coli protein sequences (Ecoli.proteins.fas) and the E. coli genome sequence (Ecoli.genome.fas). To convert these files into BLAST databases run the following commands. As you can see, the -dptype flag is required so that you can specify whether the sequences are nucleotide (nucl) or amino acid (prot) sequences.

makeblastdb -in Ecoli.proteins.fas -dbtype prot

makeblastdb -in Ecoli.genome.fas -dbtype nucl

Enter the ls command and note that you have now generated three files with additional extensions for each of the FASTA input files. These are the actual BLAST database files that will be used to run searches.

Ecoli.genome.fas
Ecoli.genome.fas.nhr
Ecoli.genome.fas.nin
Ecoli.genome.fas.nsq

Ecoli.proteins.fas
Ecoli.proteins.fas.phr
Ecoli.proteins.fas.pin
Ecoli.proteins.fas.psq

3. Summarize BLAST results by parsing output file with a BioPerl script

The output text file from BLAST is easy to read, but when you do large numbers of searches it is far too much to view all of it. Extracting key information from BLAST output with Perl or Python scripts can be very valuable. Here, we will use a Perl script that relies on modules from BioPerl that make it easy to parse BLAST output.

perl BLASTP_parse.pl blastp.txt > blastp.parsed.txt

Lets look at the last few lines of the output file with the tail command.

tail blastp.parsed.txt
## zipA zipA    76.3848396501458
## zntA zntA    84.9655172413793
## zntB ydaN    92.6829268292683
## zntR zntR    92.9577464788732
## znuA znuA    86.984126984127
## znuB znuB    96.1832061068702
## znuC znuC    95.2380952380952
## zraP zraP    63.2
## zur  zur 93.0232558139535
## zwf  zwf 97.3577235772358

Each line corresponds to a Salmonella protein and reports its top E. coli hit and the percent amino acid identity between the query and the hit. You can view all of the parsed output file (blastp.parsed.txt) from the command line with less or by opening it in Microsoft Excel.

Note that we have only extracted a very small amount of information from the BLAST output (name of the query sequence, name of the top hit sequence, and percent identity of the top hit). But depending on your needs, you can extract essentially any of the information you could want from the file, including alignment lengths, positions, orientations, sequences, scores, etc.

Open the BLASTP_parse.pl script in BBEdit (or a command-line editor like nano). The Perl code is complicated, but look at it to get a general sense of what it is doing.

Now try changing line 25 from…

$value = $hsp_obj->percent_identity;

to…

$value = $hsp_obj->evalue;

If you save these changes and repeat the parsing of the BLAST output, you will find the parsed output file now reports the e-value for each hit rather than the percent amino-acid identity.

5. Plot data in R

Launch an R session on the workshop server by simply typing R.

R
blastnData = read.table ("blastn.parsed.txt", sep = '\t', header=TRUE)

And then make an XY scatter plot showing the locations of the BLASTN hits in the two genomes with the following command. Note that the cex options simply sets the size of the points. We will save the plot to a PDF file by calling pdf() and the closing the file with dev.off(). The output should look like the image below. To view the output PDF file you will neen to transfer my_dotplot.pdf to your own computer with Cyberduck.

pdf ("my_dotplot.pdf")
plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25)
dev.off()
quit()

What is this plot telling us about the similarities and differences between the genome structures of E. coli and Salmonella? How can you explain the downward-sloping set of points that ends at ~2 Mb on the x-axis?