Chapter 2 The toolbox
2.1 Embarking on Data Science
The picture below represents data science in 1918: it probably tool a clerk a day to generate the figure.
But disregarding the time investment: This is data science. You collect data (in this case, Influenza mortality), look for patterns and try to find underlying mechanisms that may explain the patterns (Age, Gender, Marital Status).
(source)
2.2 Why do statistical programming?
Since you’re a life science student -that is my target audience at least-, you have probably worked with Excel or SPSS at some time. Have you ever wondered
- Why am I doing this exact same series of mouse clicks again and again? Is there not a more efficient way?
- How can I describe my work reproducible as a series of mouse clicks?
If so, then R may be your next favorite data analysis tool. It takes a little effort at first, but once you get the hang of it you will never create a plot in Excel again.
With R - as with any programming language,
- Redoing an analysis or generating a report with minor adjustments is a breeze
- The analysis is central, not the output. This guarantees complete reproducibility
Overview of the toolbox
This chapter will introduce you to a toolbox that will serve you well during your data quests.
It consists of
- The R programming language and built-in functionality
- The RStudio Integrated Development Environment (IDE)
- R Markdown as documenting and reporting tool
2.3 Tool 1: The R programming language
Nobody likes to pay for computer tools. R is completely free of charge. Moreover, it is completely open source. This is of course one of the main reasons for its popularity; other statistical tools are not free and sometimes downright expensive.
Besides this free nature, R is very popular because it has an interactive mode. We call this a read–evaluate–print loop: REPL. This means you don’t need to write programs to run code. You simply type a command in the console, press enter and immediately get the result on the line below.
As stated above, because you store your analyses in code, repeating these analyses -possibly with with new data or changed settings- is very easy.
One of my personal favorite features is that R supports “literate programming” for creating presentations (such as this one!) and other publications (reports, papers etc). Pdf documents, Microsoft Word documents, web pages (html) and e-books are all possible outputs of a single R Markdown master document.
Finally, R has advanced embedded graphical support. This means that graphical output (a plot) is as easy to generate as textual output!
Here are some figures to whet your appetite. You will be able to create all of these yourself at the end of this course (actually, a pair of courses).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Use of `DF$value` is discouraged. Use `value` instead.
## Warning: Use of `DF$value` is discouraged. Use `value` instead.
2.4 Tool 2: RStudio as development environment
RStudio is a so-called Integrated Development Environment. This means it is a “Swiss Multitool” for programming. With it, you manage and run code, files, documentation on the language (help pages), building different output formats. The workbench has several panels and looks like this when you run the application.
You primarily work with 4 panels of the workbench:
- Code editor where you write your scripts and R Markdown documents: text files with code you want to execute more than once
- R console where you execute lines of code one by one
- Environment and History See what data you have in memory, and what you have done so far
- Plots, Help & Files
You use the console to do basic calculations, try pieces of code, develop a function, or load scripts (from the code editor) into memory. On the other hand, the code editor is used to work on code that has life span longer than a few minutes: analyses you may want to repeat, or develop further in the form of scripts and R Markdown documents. The code editor supports many file types for viewing and editing: regular text, structured datafiles (text, csv, data files), scripts (programs), and analytical notebooks (R Markdown).
What is nice about the code editor above regular text editors such as Notepad, Wordpad, TextEdit, is that it knows about different file types and their constituting elements and helps your read, write (auto-complete, error alerts), scan and organize them by displaying these elements using coloring, font types and other visual aids.
Here is the same piece of code, which is a plain text file, in two different editors. First as plain text in the Mac TextEdit app and next in the RStudio code editor:
It is clearly visible where the code elements, numeric data and character data are within the code.
2.5 Tool 3: R Markdown
Using R Markdown you can combine regular text and figures with embedded R code that will be executed to generate a final document. We call this literate programming.
You can use it to create reports in word, pdf or web (html), presentations (pdf or web) and even eBooks and websites. This entire eBook itself is written in R Markdown!
Markdown is, just like the language for the web, html
, a markup language. Markup means that you use textual elements to indicate structure instead of content. The R extension to Markdown, R Markdown, simply is Markdown with embedded pieces of R code. Consider this piece of Markdown:
## Tool 3: R Markdown
![](figures/markdown_logo.jpg)
Using RMarkdown you can combine regular text and figures with embedded R code that will be executed to generate a final document.
The result of this snippet, after it is converted into html, is the top of the current paragraph you are reading.
Here is a piece of R code we call a code chunk that plots some random data in a scatter plot. In RStudio this piece of R code within (the current) R Markdown document looks like this:
Every code chunk consists of two parts; its header and body. The header tells the conversion engine (knitr) how to deal with the code within the chunk, and its output. In this case, this header is
{r simple-scatter-demo-1, fig.asp=0.6, out.width='80%', fig.align='center', fig.caption='A simple scatter plot'}
This header specifies quite a few things. First, the programming language (r
) and the label, or “name”, of the chunk (simple-scatter-demo-1
). Next, several aspects of the generated plot are specified: its aspect ratio, relative width, alignment on the page and the figure caption. Only the programming language is required here.
Next, when you knit (translate) the document into web format it results in the piece below, together with its output, a scatter plot.
R Markdown is translated into html, the markup language of the web, before any further processing occurs. That is why you can also embed html code elements within it but that is outside the scope of this course.
Here are the most basic elements you can use in Markdown documents.
Finally, it is also possible to embed Latex elements. For instance, equations can be defined in a text format. This:
$$d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}$$
results in this:
\[d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}\]
RStudio provides several cheatsheets with R, including Markdown. Have a look at Help → Cheatsheets.
A final note. To be able to convert R Markdown into Word format you need to have MS Word installed on that machine. If you want to be able to generate pdf documents, you will need a bit more: see the screencast Setting up on a Windows system. It is a bit outdated so you should update to more recent version numbers.
2.6 Resources
Swirl course
Swirl course to accompany this course: http://github.com/MichielNoback/R_Data_Analysis
Data Files
In this section all data files used or required for the course presentations or exercises are listed.- Whale selenium data: whale_selenium.txt
- Bird observation data: Observations-Data-2014.xlsx or, as csv
- Food constituents: food_constituents.txt
- Wine review data: winemag-data-130k-v2.csv
Web resources and references
R Markdown
R Markdown is a Markdown “Dialect” used for presenting, documenting and reporting in R: http://rmarkdown.rstudio.comR cheat sheet
The R cheat sheet.R Markdown reference
The RMarkdown reference cards with extensive documentation. Also available at the computer exam!Bioconductor
Bioconductor provides tools for the analysis and comprehension of high- throughput genomic data: http://www.bioconductor.org