Identifying and aligning similar DNA and protein sequences is one of the most common tasks in computational biology, and BLAST (Basic Local Alignment Search Tool) still represents one of the most widely used and effective tools. While many users are familiar with running BLAST searches through the NCBI webserver, it is also possible and often preferable to download BLAST software and databases and run searches on a local computer or server. Advantages of running local BLAST include…
The ability to use customized databases including your own unpublished genome and transcriptome sequences that are not available on GenBank
The ability to submit searches for thousands of query sequences simultaneously (e.g., all contigs from a de novo transcriptome assembly)
Integration into custom scripts and pipelines
There are numerous specialized “flavors” of BLAST, but the following five programs represent the classic and most widely used methods. They differ based on whether the inputs are nucleotide or amino acid sequences and whether the alignments are based on nucleotide or (translated) amino acid sequences.
Program | Query Sequences | Database Sequences | Alignment |
---|---|---|---|
blastn | nucleotide | nucleotide | nucleotide |
blastp | amino acid | amino acid | amino acid |
blastx | nucleotide | amino acid | amino acid* |
tblastn | amino acid | nucleotide | amino acid* |
tblastx | nucleotide | nucleotide | amino acid* |
*blastx, tblastn, and tblastx translate the input query and/or database sequences in all six possible reading frames before performing alignment.
The goal of this exercise will be to make sequence comparisons between entire sets of proteins and entire genomes of closely related bacteria. This will involve generating databases and running searches with local BLAST, parsing the output with BioPerl scripts, and plotting data with R.
[Note: If you do not have local BLAST and/or BioPerl installed, intermediate files are stored in the ~/TodosSantos/local_blast/prerun/
directory so that you can follow along.]
Prior to running a local BLAST search, you must first download or create a BLAST database. Familiar databases like “nr” or “nt” can be downloaded directly from NCBI for use in local searches, but you can also create a custom BLAST database from any input file in FASTA format. In this exercise, we will make two BLAST databases. One will be made from the entire set of protein sequences from a particular strain of Escherichia coli, and the other will consist of a single nucleotide sequence of the entire E. coli genome.
First, open a Terminal session and connect to your account on our workshop server. Once you have done that, enter the following command to move into the directory with the relevant files for this exercise.
cd ~/TodosSantos/local_blast/
The directory contains FASTA files with E. coli protein sequences (Ecoli.proteins.fas
) and the E. coli genome sequence (Ecoli.genome.fas
). To convert these files into BLAST databases run the following commands. As you can see, the -dptype
flag is required so that you can specify whether the sequences are nucleotide (nucl
) or amino acid (prot
) sequences.
makeblastdb -in Ecoli.proteins.fas -dbtype prot
makeblastdb -in Ecoli.genome.fas -dbtype nucl
Enter the ls
command and note that you have now generated three files with additional extensions for each of the FASTA input files. These are the actual BLAST database files that will be used to run searches.
Ecoli.genome.fas
Ecoli.genome.fas.nhr
Ecoli.genome.fas.nin
Ecoli.genome.fas.nsq
Ecoli.proteins.fas
Ecoli.proteins.fas.phr
Ecoli.proteins.fas.pin
Ecoli.proteins.fas.psq
Now use the BLASTP program to run a search for every protein sequence from a strain of Salmonella enterica (Salmonella.proteins.fas
) against the E. coli protein database you just made. Enter the following command:
blastp -query Salmonella.proteins.fas -db Ecoli.proteins.fas -evalue 1e-6 -num_threads 4 -out blastp.txt
Note that the -evalue
option sets the significance threshold for reporting hits. This can be modified depending on how stringent/permissive you would like to be with your search. There are 5027 proteins in the Salmonella file, so this will run 5027 separate BLAST searches. The -num_threads
option tells the program to use multiple cores (in this case 4) to make the search go a little faster. It should take about 1 min. For more information on BLAST settings you can type: blastp -help
Once the search is complete, view the output file with the following command:
less blastp.txt
This is a plain-text version of the information you would see when you run a BLAST search online. To exit viewing the file, type q
.
The output text file from BLAST is easy to read, but when you do large numbers of searches it is far too much to view all of it. Extracting key information from BLAST output with Perl or Python scripts can be very valuable. Here, we will use a Perl script that relies on modules from BioPerl that make it easy to parse BLAST output.
perl BLASTP_parse.pl blastp.txt > blastp.parsed.txt
Lets look at the last few lines of the output file with the tail
command.
tail blastp.parsed.txt
## zipA zipA 76.3848396501458
## zntA zntA 84.9655172413793
## zntB ydaN 92.6829268292683
## zntR zntR 92.9577464788732
## znuA znuA 86.984126984127
## znuB znuB 96.1832061068702
## znuC znuC 95.2380952380952
## zraP zraP 63.2
## zur zur 93.0232558139535
## zwf zwf 97.3577235772358
Each line corresponds to a Salmonella protein and reports its top E. coli hit and the percent amino acid identity between the query and the hit. You can view all of the parsed output file (blastp.parsed.txt
) from the command line with less
or by opening it in Microsoft Excel.
Note that we have only extracted a very small amount of information from the BLAST output (name of the query sequence, name of the top hit sequence, and percent identity of the top hit). But depending on your needs, you can extract essentially any of the information you could want from the file, including alignment lengths, positions, orientations, sequences, scores, etc.
Open the BLASTP_parse.pl
script in BBEdit (or a command-line editor like nano
). The Perl code is complicated, but look at it to get a general sense of what it is doing.
Now try changing line 25 from…
$value = $hsp_obj->percent_identity;
to…
$value = $hsp_obj->evalue;
If you save these changes and repeat the parsing of the BLAST output, you will find the parsed output file now reports the e-value for each hit rather than the percent amino-acid identity.
Now try performing a nucleotide-based search to compare the entire Salmonella genome against the entire E. coli genome by entering the following command:
blastn -task blastn -query Salmonella.genome.fas -db Ecoli.genome.fas -evalue 1e-20 -num_threads 4 -out blastn.txt
Note the use of the -task blastn
flag. That may seem redundant because the program itself is BLASTN, but this program defaults to use a version of BLAST called MEGABLAST, which has very stringent default parameters (including a word size of 28). Therefore, for many or most applications it is necessary to either reduce the word size in MEGABLAST or to run classic BLASTN by using the task flag. Also note that we are using and unusually low e-value (1e-20). This will eliminate a large number of significant but short alignments from our output, which we are not interested in for this exercise. For more information on BLAST settings you can type: blastn -help
Once the search is complete, view the output file with the following command.
less blastn.txt
This is a plain-text version of the information you would see when you run a BLAST search online. To exit viewing the file, type q
. As above, we will use a BioPerl script to parse and summarize the BLAST output. But in this case, we are going to use a script that will report much more information about each alignment.
perl BLASTN_parse.pl blastn.txt > blastn.parsed.txt
The parsed output file (blastn.parsed.txt
) has numerous columns so open it in Microsoft Excel to view it (you will have to transfer it back to your local machine first). Note that the output file reports the positions of both the query and the hit for each BLAST alignment. Try using these to make a “dot-plot”.
Launch an R session on the workshop server by simply typing R
.
R
blastnData = read.table ("blastn.parsed.txt", sep = '\t', header=TRUE)
And then make an XY scatter plot showing the locations of the BLASTN hits in the two genomes with the following command. Note that the cex
options simply sets the size of the points. We will save the plot to a PDF file by calling pdf()
and the closing the file with dev.off()
. The output should look like the image below. To view the output PDF file you will neen to transfer my_dotplot.pdf
to your own computer with Cyberduck.
pdf ("my_dotplot.pdf")
plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25)
dev.off()
quit()
What is this plot telling us about the similarities and differences between the genome structures of E. coli and Salmonella? How can you explain the downward-sloping set of points that ends at ~2 Mb on the x-axis?