Background

Bash is a scripting language that we can use to execute commands in a shell.

Objectives

The goal of this exercise is to provide an overview of useful Bash commands.

Software and Dependencies

Bash is a Unix/Linux shell and command language, and is already on the server.

Part 1: Changing PATH

Log on to the server cctsi-104.cvmbs.colostate.edu. We will start by using the command which. which will tell us 1) whether a program is automatically found by the computer and 2) where that program is located on the server.

Try the following command:
which python

Since this command gave you an output, Python is automatically located by the server. Further, you now know that the Python installation on the server is located in the directory /usr/bin/python.

Try to find R:
which R

Is R also automatically located on the server?

We are going to use the program dna2prot to translate a DNA sequence. Use which to see where this program is located. What does the which command give you in this case?

The dna2prot program is not found by the server. Why can’t it be found? The reason has to do with PATH, which is a variable in Bash.

Use ls -l to look at the contents of your home directory. Now use ls -al. Notice that using ls -al lists more files than using ls -l does. This is because ls -al (specifically the “a” part of that command) lists files that are normally hidden from view.

We want to look at the file called .profile. Note that on some computers and servers, this is called .bash_profile. Use cat to look at .profile.

This file contains information about your profile as a user on this server. The important lines for our purposes are:
# set PATH so it includes user's private bin directories
PATH="$HOME/bin:$HOME/.local/bin:$PATH"
# include /home/apps/bin in path 
PATH="$PATH:/home/apps/bin"

The lines that begin with a pound sign (#) are called comments. A comment is a way to take notes within your script. Anything written after # on the same line is not read by the computer. Comments are very important because they allow you to take notes about what you did, which makes it easier to understand/read your code later.

The lines beginning with PATH define where the server can find programs. The reason that the server can’t find the program dna2prot is because it is not located in the directories defined in the PATH.

Try to run the following command. This program takes a DNA sequence as an input and will output the translation.
dna2prot ACGTAGCTAGTATATGCTGCATATTGACTGCATAGCTAGCATATTTATATGC 
This command doesn’t work for the reason outlined above: the server doesn’t know how to find the program. This program is located in the folder /home/amwill/scripts/, so instead try the command:
/home/amwill/scripts/dna2prot ACGTAGCTAGTATATGCTGCATATTGACTGCATAGCTAGCATATTTATATGC 

The program should now run.

But what if we don’t want to type the whole path to the program every time? We have two options. First, we can define a new PATH in the current session. Type in the command:
PATH=$PATH:/home/amwill/scripts
Now try this command again:
dna2prot ACGTAGCTAGTATATGCTGCATATTGACTGCATAGCTAGCATATTTATATGC 
It works! Now logout of the server and log back in. Try the above command one more time:
dna2prot ACGTAGCTAGTATATGCTGCATATTGACTGCATAGCTAGCATATTTATATGC 

Now it doesn’t work. It doesn’t work because the definition of PATH we used was only for the previous session; we have to retype it every time we log in for it to work. What if we want it to work every time without entering it in? We can change the .profile we looked at before.

Stop when you get to this point. We will do this part together.

Enter this file with the command:
vi .profile
and use the down arrow to get to the bottom of the file. Use the right arrow to go to the end of the last line, and then hit the “i” key (for “insert”). Enter down and add a new line:
PATH="$PATH:/home/amwill/scripts/"
Once you add this line, hit the esc key and then type in
:wq

to save and quit.

Now use cat to make sure your addition to .profile was saved. We also have to reload the .profile one time, which we can do with the command:
source ~/.profile
Now use this command again:
dna2prot ACGTAGCTAGTATATGCTGCATATTGACTGCATAGCTAGCATATTTATATGC 

It now runs because we added the path to this program to .profile. It will still work if you log out and then back in again (try it out!).

Part 2: Using bash scripts

Say that we want to rename a file on the server. Find the file practice_compiledSeqs.txt by navigating to ~/TodosSantos/scripting_exercises/. To rename this single file, we can use the command mv (move):
mv practice_compiledSeqs.txt practice_compiledSeqs.fas

Now when you type in ls, you should see that the file now has the extension .fas instead of .txt.

But what if we want to rename hundreds or thousands of files in this way? We don’t want to have to rename each file individually. Instead, we can use a bash script. A script is a file that contains commands. These commands are run sequentially.

Open the file rename.sh in BBEdit. “.sh” is the extension used on bash scripts.

You’ll notice that there are comments at the top of the script; they start with a pound sign (#). Again, anything written on the same line as a pound sign (after the pound sign) is not read by the computer. You’ll notice that the comments list the script’s purpose, creator, and date of creation.

The only command in this script is one that, for all .txt files in a directory, will change the extension from .txt to .fas.

All bash scripts have to have a “do” command and a “done” at the end. The semicolons (;) are also important. What this code is saying is “for all .txt files, change the .txt extension to a .fas extension.” The syntax is a bit strange, but basically variables are introduced and then referenced using dollar signs ($), and the “${file%.txt}.fas” part is telling the computer to change the extensions.

An important part of the command is the *. * is a wildcard character, so in this case, *.txt is any file that ends in “.txt”.

Stop when you get to this point.

Exercise: Using the wildcard character

Enter the directory sequences using the cd command. Use the ls command to look at the contents of this directory. We are going to use rename.sh to change the extensions of all of these files, which are all already in fasta format, but have a .txt extension.

From this directory, type in the command:
bash ../rename.sh 

bash tells the server that we’re running the bash script, next to which we have the scripts name. Since this script is in the directory above the current one, we use “../” before its name. If the script were in the same directory as the files that we wanted to change, we could just use the script name.

Note that this script will only change the extensions of all .txt files in the directory. If a file has an extension other than .txt, it will not be changed, but be aware of any .txt files you don’t want to make into .fas files when you run the script. (You could also change the script slightly to avoid changing them.)

Let’s look at an individual .fas file in this directory. Use more or less to inspect a file’s contents. You’ll notice that there are multiple sequences in each fasta file. And let’s say I want to align the sequences in each file.

I want to use MAFFT for alignment (MAFFT is an alignment software). Specifically, I want to use the -einsi option in MAFFT. Let’s run this option on a single file, the one that we changed the extension on at the beginning of this lesson.

cd back to ~/TodosSantos/scripting_exercises/. Use ls to make sure practice_compiledSeqs.fas is in this directory. Now let’s run MAFFT on this file:
einsi practice_compiledSeqs.fas > aligned_practice_compiledSeqs.fas 

We want to name the output (to the right of the “>”) differently than the input (to the left of the “>”) because if we don’t, the command will overwrite the input file, and we’ll lose the original.

Once the command finishes running, use ls to confirm that there is now a file called aligned_practice_compiledSeqs.fas.

Now open a new file in BBEdit and copy and paste the content of rename.sh into it. Save this file as align.sh. Change the informational comments to include the function of this script, as well as your name and the date.

Now get rid of the original command (that changes the file extension) and replace it with this command:
for file in *.fas ; do einsi $file > aligned_$file ; done

This command will align the sequences within each .fas file using MAFFT einsi. Save the file in the same directory as rename.sh and navigate back into the directory sequences.

Now run the command:
bash ../align.sh 

After the script finishes running (when it stops printing text to the screen), use ls to look at the directory’s contents. You’ll notice that all .fas files now have an aligned version.

We could have done both of the above tasks (renaming and alignment) in a single bash script. See both_tasks.sh.