Todos Santos Computational Biology and Genomics 2018

Introduction to Scripting

The purpose of this exercise is to introduce bash scripting for creating command line workflows.

OUTLINE

  1. Why scripts are useful - a prerequisite to scripting.
  2. Text editor and bash scripting basics.

Script: A series of commands or instructions to automate a task. The commands are written in a text file that is then executed by a program without being first compiled (converted into the binary machine code).

Scripting language: A computer programming language that supports scripts. The scripts are typically interpreted by the program and do not have to be compiled.


Part 1. Prerequisite to scripting

In this exercise, we’ll discuss variable assignment and manipulation. The goal is to demonstrate the need to automate repeated tasks, which will be done in the next exercise in which we introduce scripts.

1. Open a shell or terminal window.

2. Assign a DNA sequence to a variable (sequence):

In [ ]:
%%bash
sequence=ACTGTACGGTACAC

3. Complement the sequence using tr:

In [ ]:
%%bash
echo $sequence | tr [ACTGactg] [TGACtgac]

4. Reverse the sequence using rev:

In [ ]:
%%bash
echo $sequence | rev

5. Reverse complement the sequence using tr and rev:

In [ ]:
%%bash
echo $sequence | rev | tr [ACTGactg] [TGACtgac]

6. Calculate the length of the sequence using wc -m:

In [ ]:
%%bash
echo -n $sequence | wc -m

This exercise required multiple steps and if it is something we wanted to repeat, it could be easily automated with a script, as demonstrated below.


Part 2. Bash scripting

In this exercise, we will write a bash script to identify the length, the complement, the reverse, and the reverse complement of a DNA sequence.

TEXT EDITORS

A good text editor is essential for writing scripts. Microsoft Word and Mac TextEdit should not be used for writing scripts.

Common Free Text Editors:

1. Open a shell or terminal window.

2. Create a new directory named bash_scripts using mkdir.

3. Change into the bash_scripts directory using cd.

4. Open a text editor, such as TextWrangler or Notepad++, using open and use it to create a new file called iseq.sh within the bash_scripts directory. In the next steps, we will write a bash script by adding commands to the file.

5. Confirm that you are in the bash_scripts directory using pwd and that the iseq.sh script is in the directory using ls.

6. Insert a shebang that directs the terminal to bash within the iseq.sh file: #!/bin/bash

7. Prompt the user for a sequence and store as a variable (seq) using read -p:

read -p "Enter a sequence: " seq

8. Complement the sequence using tr and store as new variable (comp):

comp=`echo $seq | tr [ACTGactg] [TGACtgac]`

9. Reverse the sequence using rev and store as a new variable (reverse):

reverse=`echo $seq | rev`

10. Reverse complement a sequence using tr and rev and store as a new variable (revcomp):

revcomp=`echo $seq | tr [ACTGactg] [TGACtgac] | rev`

11. Calculate the length of the sequence using wc -m and store as a new variable (length):

length=`echo -n $seq | wc -m`

12. Print to the shell the output of each of the above steps along with a brief description:

echo ""
echo "Original sequence: $seq"
echo "Complement: $comp"
echo "Reverse: $reverse"
echo "Reverse complement: $revcomp"
echo "Length: $length"
echo ""

13. Return to your terminal window and execute the script using ./iseq.sh. Did you receive an error message? How can you make the script executable?

14. Edit iseq.sh to repeat indefinitely using a while loop as follows:

a. Add while before the read command.

while read -p "Enter a sequence: " seq

b. Insert a new line and the command do directly after the previous line containing while….:

do

c. At the end of the script add a new line and the command done:

done

15. Return to your terminal window and execute the script again.

16. Edit iseq.sh to exit if something other than DNA sequence is entered.

a. Directy after the read statement, check if the variable contains an empty string using a conditional statement:

if [[ $seq == "" ]]
    then
        echo ""
        echo "All done. Adios!"
        echo ""
        exit
fi

b. Directly after the previous conditional statement, insert a conditional statement that checks for non-DNA characters:

if [[ $seq =~ [^ACTGactg] ]]
    then
        echo ""
        echo "Non-DNA characters detected. Adios!"
        echo ""
        exit
fi

17. Return to your terminal window and execute the script using ./iseq.sh. Try entering non-DNA characters. Does the program terminate?