The R statistical computing environment is powerful, flexible, and widely used software for analyzing and visualizing data. It is fast becoming the standard for performing statistical tests. As the research community increasingly expects documentation, transparency, and reproducibility in data analysis, R Markdown is a valuable tool that allows users to merge documentation, code, statistical output, and graphics into a single document.
The goal of this exercise will be to provide a brief introduction/reminder of the basic statistical functionality of R and then to generate an R Markdown document that records both the steps and output in a statistical analysis.
Launch the R Studio program to initiate an interactive R session. An R session can also be launched from a terminal or the standard R console. But we will use R Studio for this exercise because it incorporates R Markdown functionality.
Interactive R sessions allow you to enter individual R commands. R will then run the command and report relevant output. For example, R is capable of performing simple arithmetic. Try entering a simple math problem at the command prompt into the window labeled “Console” and then press enter/return. You should see the correct answer reported back to you in the console window.
2+2
## [1] 4
Now let’s import some data to perform slightly more complicated operations in R. The file salaries.txt
(tab-delimited text file)contains a simple data set for this exercise and can be downloaded here. It summarizes salary data for 12 employees. Import it into R with the following command. We are storing the data in a variable called “salary_data”. Note that you may need to provide the path to the file depending on where it is located on your computer.
salary_data = read.delim("salaries.txt")
Now display the imported data by typing the name of our variable, “salary_data”.
salary_data
## Name ResearcherType Salary
## 1 Alice CompBiologist 105000
## 2 Bob Biologist 65000
## 3 Carlos CompBiologist 99000
## 4 Diana Biologist 72000
## 5 Eva CompBiologist 109000
## 6 Frank Biologist 73000
## 7 George CompBiologist 112000
## 8 Helena Biologist 68000
## 9 Ingrid CompBiologist 97000
## 10 Juan Biologist 61000
## 11 Katie CompBiologist 103000
## 12 Laura Biologist 70000
R has built-in functions and additional packages to perform almost any imaginable statistical test. Let’s try running a simple t-test to see if there is a significant difference in salary between biologists and computational biologists. Enter the following command in the console to perform this test.
t.test(Salary ~ ResearcherType, salary_data)
##
## Welch Two Sample t-test
##
## data: Salary by ResearcherType
## t = -12.052, df = 9.4908, p-value = 4.573e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -42704.19 -29295.81
## sample estimates:
## mean in group Biologist mean in group CompBiologist
## 68166.67 104166.67
If you are familiar with interpreting results from a t-test, hopefully you can see that it definitely pays to become a computational biologist!!
But how could we go about generating a complete and reproducible record of the analysis we just performed? To do this, create a new R Markdown (.Rmd) file from the New File menu.
Add your own title to the document, such as “Todos Santos t-test”, and then click OK.
You will see the R Studio has created a new Untitled1 document that already has a lot of example text including the title you entered above. This is written in the R Markdown language which is a modified version of Markdown. It has a relatively user-friendly and readable syntax that can be converted to HTML.
To convert the new document to HTML, choose the “Knit to HTML” option under the knit icon. It will ask you to save the untitled Rmd file. Save it in the same directory where you have the salaries.txt file.
You should see from the resulting HTML output that the example code in the new document generates a mix of text, R code, R output, and R plots. Now let’s generate our own code that summarizes the t-test analysis we did above.
Start by deleting all the text in your R Markdown document that is BELOW the header information that includes title, author, data, and output.
Now start entering some new text. One feature of R Markdown is that you can type any text you want, and it will appear in your output. This is a great way to take notes. So go ahead and enter some notes. Perhaps something like the following…
Today, I am performing a t-test on salary data to order to try out R Markdown.
The powerful thing about R Markdown is that you can also write in R commands (as well as code from other languages such as Bash and Python). It will then do three things for you: store the code, run the code, and report the output.
Let’s enter the code we already ran above in the following format.
```{r}
salary_data = read.delim("salaries.txt")
salary_data
t.test(Salary ~ ResearcherType, salary_data)
```
The the set of three backticks indicates that we are entering code and the {r}
indicates that it will be R code. Then, you can see we simply add our R commands before indicating the end of the code with three more backticks.
Now save your Rmd file and “Knit to HTML” again. You should see that the resulting output reports your notes, the R code, and the output from running the R code… a perfect and reproducible research notebook!
R commands can also produce plots, and R Markdown will report the plot images that are produced as output from running R code. So let’s add some more code to our file to generate a plot. Add the following code at the bottom of the R Markdown file, which will plot salary for biologists vs. computational biologists.
```{r}
plot(Salary ~ ResearcherType, salary_data)
```
If you now save your Rmd file and re-run “Knit to HTML”, you should see that your output now contains the additional code and a nice plot of your data.
R Markdown offers lots of ways to format text, include images, and run code from R, Python, Bash, etc. Check out the R Markdown Cheat Sheet and try out some of the formatting options.