Background

The R statistical computing environment is powerful, flexible, and widely used software for analyzing and visualizing data. It is fast becoming the standard for performing statistical tests. As the research community increasingly expects documentation, transparency, and reproducibility in data analysis, R Markdown is a valuable tool that allows users to merge documentation, code, statistical output, and graphics into a single document.

Objectives

The goal of this exercise will be to provide a brief introduction/reminder of the basic data import and statistical functionality of R and then to generate an R Markdown document that records both the steps and output in a statistical analysis.

Software and Dependencies

Protocol

1. Launch an interactive R session

Launch the R Studio program to initiate an interactive R session. An R session can also be launched from a terminal or the standard R console. But we will use R Studio for this exercise because it incorporates R Markdown functionality.

Interactive R sessions allow you to enter individual R commands. R will then run the command and report relevant output. For example, R is capable of performing simple arithmetic. Try entering a simple math problem at the command prompt into the window labeled “Console” and then press enter/return. You should see the correct answer reported back to you in the console window.

2+2
## [1] 4


2. View a data set

Typically when using R, you will be importing your own data sets from files. But R also comes with a few built-in datasets that are good for practicing. For example, one is stored in a variable called “cars”.

You can display the data simply by typing the name of our variable, “cars”.

cars
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

You can see that the data are organized similarly to a spreadsheet. You can visualize a subset of the data by row and or column using the following syntax: cars[row(s), column(s)]. If you leave one of those fields blank, it will simply show all of them. For example, you could view only rows 1 through 5 as follows.

cars[1:5, ]
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16


3. Calculate mean and variance from a data set

R contains numerous built-in statistical functions. Let’s apply some simple ones and calculate mean and variance for car speed in this dataset. Notice that another way to refer to a specific column in the data set is by using $ followed by the column name.

mean (cars$speed)
## [1] 15.4
var (cars$speed)
## [1] 27.95918


4. Import a data set

Now let’s import some of our own data. The file salaries.txt (tab-delimited text file) contains a simple data set for this exercise and can be downloaded here. It summarizes salary data for 12 employees.

There are multiple options for importing files with R commands such as read.table, read.csv, and read.delim. Let’s use read.table to import our data. To find out more about any function in R, you can simply type the name of the function preceded by a ?. Enter the following command and read through the documentation that appears.

?read.table

Note that some of the default settings are different than we will want to use in this case. The salaries.txt data set does have header names for the columns, and the columns are tab-delimited (which we will indicate with \t);

Import this dataset into R with the following command. We are storing the data in a variable called “salary_data”, which can be done with the either the = sign or the <- characters. Note that you may need to provide the path to the file depending on where it is located on your computer.

salary_data = read.table("salaries.txt", header = TRUE, sep = "\t")

If you got an error when trying to import the file, it is likely because you do not have the salaries.txt file in your current working directory. There are two solutions to this problem. First, you can provide the full path to the file. Alternatively, you can change your working directory. To determine your current working directory, use the following command:

getwd()

To change your working directory, use the following command:

setwd("/THE/PATH/TO/YOUR/DESIRED/DIRECTORY")

Now try repeating the import step and display the imported data by typing the name of our variable, “salary_data”.

salary_data = read.table("salaries.txt", header = TRUE, sep = "\t")
salary_data
##      Name ResearcherType Salary
## 1   Alice  CompBiologist 105000
## 2     Bob      Biologist  65000
## 3  Carlos  CompBiologist  99000
## 4   Diana      Biologist  72000
## 5     Eva  CompBiologist 109000
## 6   Frank      Biologist  73000
## 7  George  CompBiologist 112000
## 8  Helena      Biologist  68000
## 9  Ingrid  CompBiologist  97000
## 10   Juan      Biologist  61000
## 11  Katie  CompBiologist 103000
## 12  Laura      Biologist  70000


5. Run a t-test

R has built-in functions and additional packages to perform almost any imaginable statistical test. Let’s try running a simple t-test to see if there is a significant difference in salary between biologists and computational biologists. Enter the following command in the console to perform this test. The syntax here (Salary ~ ResearcherType) is asking whethere the dependent variable Salary differs significantly from the independent variable ResearcherType.

t.test(Salary ~ ResearcherType, salary_data)
## 
##  Welch Two Sample t-test
## 
## data:  Salary by ResearcherType
## t = -12.052, df = 9.4908, p-value = 4.573e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -42704.19 -29295.81
## sample estimates:
##     mean in group Biologist mean in group CompBiologist 
##                    68166.67                   104166.67

If you are familiar with interpreting results from a t-test, hopefully you can see that it definitely pays to become a computational biologist!!


6. Generate an R Markdown document

But how could we go about generating a complete and reproducible record of the analysis we just performed? To do this, create a new R Markdown (.Rmd) file from the New File menu.



Add your own title to the document, such as “Todos Santos t-test”, and then click OK.



You will see the R Studio has created a new Untitled1 document that already has a lot of example text including the title you entered above. This is written in the R Markdown language which is a modified version of Markdown. It has a relatively user-friendly and readable syntax that can be converted to HTML.

To convert the new document to HTML, choose the “Knit to HTML” option under the knit icon. It will ask you to save the untitled Rmd file. Save it in the same directory where you have the salaries.txt file.



You should see from the resulting HTML output that the example code in the new document generates a mix of text, R code, R output, and R plots. Now let’s generate our own code that summarizes the t-test analysis we did above.


7. Add content to R Markdown document

Start by deleting all the text in your R Markdown document that is BELOW the header information that includes title, author, data, and output.

Now start entering some new text. One feature of R Markdown is that you can type any text you want, and it will appear in your output. This is a great way to take notes. So go ahead and enter some notes. Perhaps something like the following…

Today, I am performing a t-test on salary data to order to try out R Markdown.

The powerful thing about R Markdown is that you can also write in R commands (as well as code from other languages such as Bash and Python). It will then do three things for you: store the code, run the code, and report the output.

Let’s enter the code we already ran above in the following format.

```{r}
salary_data = read.delim("salaries.txt")
salary_data
t.test(Salary ~ ResearcherType, salary_data)
```

The the set of three backticks indicates that we are entering code and the {r} indicates that it will be R code. Then, you can see we simply add our R commands before indicating the end of the code with three more backticks.

Now save your Rmd file and “Knit to HTML” again. You should see that the resulting output reports your notes, the R code, and the output from running the R code… a perfect and reproducible research notebook!


8. Add an R plot

R commands can also produce plots, and R Markdown will report the plot images that are produced as output from running R code. So let’s add some more code to our file to generate a plot. Add the following code at the bottom of the R Markdown file, which will plot salary for biologists vs. computational biologists.

```{r}
plot(Salary ~ ResearcherType, salary_data)
```

If you now save your Rmd file and re-run “Knit to HTML”, you should see that your output now contains the additional code and a nice plot of your data.


9. Test out the many formatting options in R Markdown

R Markdown offers lots of ways to format text, include images, and run code from R, Python, Bash, etc. Check out the R Markdown Cheat Sheet and try out some of the formatting options.

The output is html, so you can images, tables, etc. Try building your own webpage!