Python Scripting and Biopython

Background

Python is a popular scripting language that has a well-developed set of tools for bioinformatics. These tools are part of of Biopython and allow the use of various biologically-relevant objects built from FASTA files, Blast output, etc.

Objectives

The goal of this exercise is to provide a brief introduction to Python scripting and the functions available in Biopython.

Software and Dependencies

This exercise requires Python, Biopython, and BBEdit.

Part 1: Python variables

We will start by using an interactive shell to get the hang of Python. Log in to the server cctsi-104.cvmbs.colostate.edu and navigate to the directory ~/TodosSantos/scripting_exercises/. Type python into the command line to enter an interactive shell.

In Python, syntax is important, including spaces and indents. Also note the version number listed when you enter the shell. An important point is that Python versions 2 and 3 are quite different and aren’t necessarily compatible with one another, so always be aware of the version you’re using.

Python can handle multiple types of variables. First, it can handle numbers. Put the following into the Python shell:

number1 = 1
number2 = 2.0

Now type “number1” and “number2” into the shell and make sure your variables were stored properly:

number1

## 1

number2

## 2.0

Python can also have character variables, which are denoted by using apostrophes (’’) or quotation marks (“”). Enter the following into the shell:

singlechar = "a"
anotherchar = 'b'

Now type the variable names into the shell as you did above.

singlechar

## 'a'

anotherchar

## 'b'

Multiple characters within apostrophes or quotation marks make what’s called a string. Here are two examples:

name = "Todos Santos"
othername = 'Mexico'

name

## 'Todos Santos'

othername

## 'Mexico'

Part 2: Python lists

Variables can be stored together into lists. Lists are denoted with square brackets ([]).

We can group multiple variables of the same type into a list:

greetings = ["hello", "hola", "bonjour", "hallo"]

We can also group variables of different types into a list:

newList = ["a", 1, "Todos Santos", 2, "Mexico"]

Check that both of your lists are correct by typing their names into the shell.

We can also start by making an empty list and then add elements using append:

numbers = []
numbers.append(1)
numbers.append(2)
numbers.append(3)
numbers

## [1, 2, 3]

We can obtain different items in the list using indexing. Indexing allows us to get the item at a specific position in the list. Indexing starts at 0 in Python, so the first element of the list is number 0, the second element of the list is number 1, and so on. Try this short indexing exercise:

greetings

## ['hello', 'hola', 'bonjour', 'hallo']

greetings[0]

## 'hello'

greetings[1]

## 'hola'

Did the indexing print out the expected item in greetings?

We can also index starting at the back of the list rather than the front by using negative numbers. The last item in the list is number -1, the second to last item in the list is number -2, and so on. Try this indexing exercise:

greetings

## ['hello', 'hola', 'bonjour', 'hallo']

greetings[-1]

## 'hallo'

greetings[-2]

## 'bonjour'

In this case, what negative index value can be used to get the first item in greetings (“hello”)? Try it out. Did Python print out “hello?”

We can also remove items in a list. To remove a specific item from greetings:

greetings

## ['hello', 'hola', 'bonjour', 'hallo']

greetings.remove("hello")
greetings

## ['hola', 'bonjour', 'hallo']

Now greetings lacks “hello” at the front, so now the first item in greetings is “hola.”

We can also delete an item based on its index:

greetings

## ['hola', 'bonjour', 'hallo']

del greetings[0]
greetings

## ['bonjour', 'hallo']

Now greetings only has two elements.

Pop can also be used to remove items (and will print the item that was removed):

alphabet = ["a", "b", "c", "d"]
alphabet.pop()

## 'd'

alphabet

## ['a', 'b', 'c']

By default, pop() will remove the last item in the list. You can also tell it which index to remove:

alphabet.pop(0)

## 'a'

alphabet

## ['b', 'c']

Part 3: For loops

Now, instead of typing commands directly into the Python shell, we will write them into a script. A script is a file that contains a series of commands, which are run sequentially when we run the script.

Exit the Python shell using exit(). Open the file loops_ifelse.py through BBEdit or another text editor.

Keeping scripts allows you to reproduce your work easily, since you can run the same script over and over again. It also allows you to keep notes about the code you used so that it’s easier to understand what you did later. Comments are made in Python using pound signs (#). If you put a pound sign at the beginning of a line, the computer will not read anything on that line, so it can be used to take notes. At the top of the script, the name of the script, its use, the date it was created, and the creator’s name are all included in comments.

Look at the next portion of the script. You’ll notice that the list greetings is created:

greetings = ["hello", "hola", "bonjour", "hallo"]

and then we iterate through the list using a for loop. A for loop is used to complete the same task on multiple items, in this case all of the items in a list. This for loop prints out each item in the list.

Run the script on the server from the command line with:

python loops_ifelse.py

## hello
## hola
## bonjour
## hallo

You’ll notice that the four greetings were printed to the screen. Let’s make the greetings print to a file instead. To do this:

Comment out the current for loop by putting pound signs at the front of every line.
Underneath the original for loop, add the following code:

file = open("output.txt","w") #opens a new file for writing

The open command makes a new file with the name given as a string, and the “w” after the name tells Python that this file is being opened for writing. We could also use “r” to open the file for reading.

Add the new for loop:

for word in greetings: #for each word in the greetings list
  file.write(word) #write the word to our new file

This will add each word in the list to the file instead of printing it to the screen.

After the for loop, add:

file.close()

This command closes the file. We can’t access the edited file until it is closed.

Now save the file to the server and run it again. Open the resulting output file (output.txt). What is wrong with the output file?

When we print to a file in Python, we need to tell Python to add spaces. In this case, we want to enter between words so they’re each on one line. The enter symbol in Python is “”. So open up the script again and replace the latest additions with these commands:

file = open("output.txt","w") #opens a new file for writing
for word in greetings: #for each word in the greetings list
  file.write(word) #write the word to our new file
  file.write("\n") #enter after each word
file.close()

Now save the file, upload it to the server (in the directory ~/TodosSantos/scripting_exercises/), and run it again. Did we fix the output?

We can also use a for loop to print out each letter in the word “hola.” “hola” is the second element of the greetings list, so we can index it with greetings[1].

Add the following for loop to the end of the script:

for char in greetings[1]:
    print(char)

## h
## o
## l
## a

Reupload the script and run it again. What prints to the screen this time?

There are also loops called while loops, which loop through items until a certain condition is met. We will not write a while loop, but the concept is similar to that of a for loop.

Stop when you get to this point.

Exercise: Pseudocode

Part 4: Python strings

For this exercise, use an interactive Python shell again.

In addition to being able to loop through a list of strings (like the list greetings above), we can also loop through an individual word’s characters. Let’s say we have a file called Homo_sapiens_genes.fasta. Start by inputting the file name:

filename = "Homo_sapiens_genes.fasta"

Now let’s loop through all of the characters in that file name:

for char in filename: 
  print char

## H
## o
## m
## o
## _
## s
## a
## p
## i
## e
## n
## s
## _
## g
## e
## n
## e
## s
## .
## f
## a
## s
## t
## a

We can index a string the same way we can index a list:

filename[0]

## 'H'

filename[-1]

## 'a'

We can also use Python to “parse” the file name. Parsing is the process of breaking up a string into smaller components. Perhaps we want the part of the file name before the extension. To split the filename at the period, we can use the split() function. Try this command:

filename.split(".")

## ['Homo_sapiens_genes', 'fasta']

You’ll notice that this command produces both components of the file name after the split (and note that the character used to split the string is eliminated). To save one of these components, we can use indexing:

beforeExt = filename.split(".")[0]
beforeExt

## 'Homo_sapiens_genes'

Perhaps we also want to make a new file called “Homo_sapiens_genes.txt.” We can use concatenation to do this. Concatenation is the merging of two strings.

beforeExt = filename.split(".")[0]
newName = beforeExt + ".txt"
newName

## 'Homo_sapiens_genes.txt'

There are many other string functions available in Python. You’ll see two more (endswith and startswith) in Part 5 of this exercise. A good resource is w3schools.

Part 5: Python if/else statements

For this exercise, exit the Python shell and continue using the script loops_ifelse.py.

Now we will use if/else statements to print out certain words in our greetings list. An if/else statement completes an action only if a particular condition is met (and if it’s not, you can use an else to perform a different action).

Comment out all of the lines in loops_ifelse.py except the line that makes the greetings list. Below the commented out sections, add the code:

for word in greetings: #go through each word in the list
    if word.startswith("h"): #if the word starts with h
        print(word)

## hello
## hola
## hallo

This code will iterate through each word in the list, just like before, but will only print out the word if it starts with the letter “h”. Save the script and run it on the server. Which words now print to the screen?

Say that we want also want to print “no” if the word does not start with “h”. Update your for loop to the following:

for word in greetings: #go through each word in the list
    if word.startswith("h"): #if the word starts with h
        print(word)
    else:
      print("no")

## hello
## hola
## no
## hallo

We can also add multiple conditional statements. Say we want to print any word that ends with the letter “o”“, print”yes" for any word that starts with “h” but does not end in “o”, and print “no” for any word that doesn’t meet either of those criteria. Replace your loop with the following:

for word in greetings: #go through each word in the list
    if word.endswith("o"): #if the word ends with o
        print(word)
    elif word.startswith("h"): #"elif"" stands for "else if"
      print("yes")
    else:
      print("no")

## hello
## yes
## no
## hallo

Run the script again (and note that “elif” stands for “else if”). Do the words print out correctly and in the order you expected?

Part 6: Dictionaries

For this exercise, enter another interactive Python shell on the server.

A Python dictionary is a special type of list that allows you to store a key and value together. This type of storage lends itself well to biological data, which we will see in the Biopython section below.

Let’s say we want to keep a dictionary containing instructor names and their favorite numbers: Alissa, 6 Dan, 112 Tai, 23 Mark, 5555 (these are made up apart from Alissa’s favorite number :)).

Dictionaries are initialized using curly braces ({}). Let’s make one for this data:

favNums = {"Alissa":6,"Dan":112,"Tai":23, "Mark":5555}

In this case, the names are keys, while the favorite numbers are values. To retrieve the value associated with “Alissa”:

favNums["Alissa"]

## 6

There are many Python functions associated with dictionaries. Here is a good resource: w3schools

For now, we’ll finish this section by looping through a dictionary. In this case, let’s loop through both keys and values of the dictionary:

for key, value in favNums.items():
  print(key,value)

## ('Dan', 112)
## ('Mark', 5555)
## ('Alissa', 6)
## ('Tai', 23)

Stop when you get to this point.

Slides: Biopython

Part 7: Biopython

Biopython is a great tool that builds on Python to specifically cater to biological data. All of the concepts we have just used apply to Biopython, and Biopython has additional functionality. For a long tutorial on all of the things Biopython can do, see the official Biopython website.

A particularly useful package in Biopython is SeqIO. SeqIO allows you to read in different file types of biological relevance, including fasta/fastq, phylip, genbank, etc. These files are used to create SeqRecord objects, so you may also find useful information in documentation about SeqRecords.

Open BBEdit and make a file called biopython.py. Remember to comment! At the top, add an import statement:

from Bio import SeqIO

Biopython is not part of the Python core, so we have to load specific Bipython packages when we need them. In this case, we are importing SeqIO, a package that lets us read in different types of files. In this case, we’ll be reading the file Subtree_7365_seqs.fasta, so add the following line below the import statement in the script:

fastaFilename = "Subtree_7365_seqs.fasta"

One particularly useful thing that SeqIO can do is iterate easily through all of the entries in a fasta (or other type of) file. Let’s print out the record (sequence) ID and the length of the sequence for each entry in the fasta file. Add the following code to the end of the script:

for record in SeqIO.parse(fastaFilename, "fasta"):
    print(record.id, len(record))

Now run the script and note what is printed to the screen.

SeqIO can also be used to make a fasta file into a dictionary, where the keys are the sequence headers (prefixed with “>”) and the values are the actual sequences. Add the following code to the script and run it again:

fastadict = SeqIO.to_dict(SeqIO.parse(fastaFilename, 'fasta'))

Say I want the entry associated with the ID “G_maderense_prot_G_maderense__g.48982comp14163_c0_seq1" Add this to the script:

seqName = "G_maderense_prot_G_maderense__g.48982comp14163_c0_seq1"

(Note that the sequence name does not include the “>.” Python automatically removed that when I created the dictionary.)

I can now retrieve the information associated with that sequence name (add this to the script):

entry = fastadict[seqName]
print(entry)

Now run the script again.

What if I want only the sequence assoicated with that ID? Add this code and run the script yet again:

print(entry.seq)

You can also extract the sequence IDs or other attributes (name, description, etc). See the SeqIO documentation as well as the Biopython documentation’s SeqRecord section.

We can also use Biopython to execute Blast searches and manipulate the output.

Let’s do a blast search on a specific sequence. We’ll use the sequence of clpP1 from Arabidopsis thaliana, which I’ve put int the file clpP1_Athaliana.txt. To open and read the file’s contents, add the following to biopython.py:

f = open("clpP1_Athaliana.txt","r")
seq = f.read()
f.close()

To use this sequence in blastn (nucleotide blast), first import the correct package by adding this line to the top of the script:

from Bio.Blast import NCBIWWW

and then use the following command (this will take several minutes, so the script may run more slowly):

result_handle = NCBIWWW.qblast("blastn", "nt", seq)

We can now use result_handle to read the blast output. First, add another import statement to the top of the script:

from Bio.Blast import NCBIXML

And then add to the end of the script:

blast_records = NCBIXML.read(result_handle)

(if we had blasted more than one sequence at the same time, we’d use NCBIXML.parse instead of NCBIXML.read).

We can now use a for loop to go through all of the blast hits. Use the following code to print out all alignments in the blast output that met a certain e-value threshold. Run the script once you add this code. Note: blast_records contains multiple blast hits, which each have many different attributes. Walk yourself through the following code and use a comment to record what you think each command is doing in this for loop. Hint: hsp = “high scoring pair”

E_VALUE_THRESH = 0.04
for alignment in blast_records.alignments:
     for hsp in alignment.hsps:
         if hsp.expect < E_VALUE_THRESH:
             print("****Alignment****")
             print("sequence:", alignment.title)
             print("length:", alignment.length)

This is just one way of many that we can use Python to parse (break apart) Blast output into useful attributes. See the Biopython tutorial’s Blast section for more information.

Part 8: Additional exercises

Write a script to determine the GC content of the sequence “ATCGGGCTTTTAAGGCC.” Hint: make a file containing this sequence called “sequence.txt” and read it into Python by including:
```
file = open("sequence.txt","r")
seq = file.read()
file.close()
```
Make a dictionary containing at least 10 key:value pairs, where the value for each key is a number. Write a for loop to add up all of the values in the dictionary.
Make a dictionary containing at least 10 key:value pairs, where only some values are numbers. Write a for loop to add up all of the numbers in the dictionary, this time using an if statement to determine whether the value is a number.