Chapter 2 The toolbox

2.1 Embarking on Data Science

The picture below represents data science in 1918: it probably tool a clerk a day to generate the figure.

But disregarding the time investment: This is data science. You collect data (in this case, Influenza mortality), look for patterns and try to find underlying mechanisms that may explain the patterns (Age, Gender, Marital Status).

(source)

2.2 Why do statistical programming?

Since you’re a life science student -that is my target audience at least-, you have probably worked with Excel or SPSS at some time. Have you ever wondered

  • Why am I doing this exact same series of mouse clicks again and again? Is there not a more efficient way?
  • How can I describe my work reproducible as a series of mouse clicks?

If so, then R may be your next favorite data analysis tool. It takes a little effort at first, but once you get the hang of it you will never create a plot in Excel again.

With R - as with any programming language,

  • Redoing an analysis or generating a report with minor adjustments is a breeze
  • The analysis is central, not the output. This guarantees complete reproducibility

Overview of the toolbox

This chapter will introduce you to a toolbox that will serve you well during your data quests.
It consists of

  • The R programming language and built-in functionality
  • The RStudio Integrated Development Environment (IDE)
  • R Markdown as documenting and reporting tool

2.3 Tool 1: The R programming language

Nobody likes to pay for computer tools. R is completely free of charge. Moreover, it is completely open source. This is of course one of the main reasons for its popularity; other statistical tools are not free and sometimes downright expensive. Besides this free nature, R is very popular because it has an interactive mode. We call this a read–evaluate–print loop: REPL. This means you don’t need to write programs to run code. You simply type a command in the console, press enter and immediately get the result on the line below.
As stated above, because you store your analyses in code, repeating these analyses -possibly with with new data or changed settings- is very easy. One of my personal favorite features is that R supports “literate programming” for creating presentations (such as this one!) and other publications (reports, papers etc). Pdf documents, Microsoft Word documents, web pages (html) and e-books are all possible outputs of a single R Markdown master document.

Finally, R has advanced embedded graphical support. This means that graphical output (a plot) is as easy to generate as textual output!

Here are some figures to whet your appetite. You will be able to create all of these yourself at the end of this course (actually, a pair of courses).

## `geom_smooth()` using formula 'y ~ x'
A facetplot - multiple similar plots split over a single nominal or ordinal variable

Figure 2.1: A facetplot - multiple similar plots split over a single nominal or ordinal variable

## Warning: Use of `DF$value` is discouraged. Use `value` instead.

## Warning: Use of `DF$value` is discouraged. Use `value` instead.
A polar plot - the dimensions are not your normal 2d x and y

Figure 2.2: A polar plot - the dimensions are not your normal 2d x and y

A custom jitter visualization

2.4 Tool 2: RStudio as development environment

RStudio logo

RStudio is a so-called Integrated Development Environment. This means it is a “Swiss Multitool” for programming. With it, you manage and run code, files, documentation on the language (help pages), building different output formats. The workbench has several panels and looks like this when you run the application.

You primarily work with 4 panels of the workbench:

  1. Code editor where you write your scripts and R Markdown documents: text files with code you want to execute more than once
  2. R console where you execute lines of code one by one
  3. Environment and History See what data you have in memory, and what you have done so far
  4. Plots, Help & Files

You use the console to do basic calculations, try pieces of code, develop a function, or load scripts (from the code editor) into memory. On the other hand, the code editor is used to work on code that has life span longer than a few minutes: analyses you may want to repeat, or develop further in the form of scripts and R Markdown documents. The code editor supports many file types for viewing and editing: regular text, structured datafiles (text, csv, data files), scripts (programs), and analytical notebooks (R Markdown).

What is nice about the code editor above regular text editors such as Notepad, Wordpad, TextEdit, is that it knows about different file types and their constituting elements and helps your read, write (auto-complete, error alerts), scan and organize them by displaying these elements using coloring, font types and other visual aids.

Here is the same piece of code, which is a plain text file, in two different editors. First as plain text in the Mac TextEdit app and next in the RStudio code editor:

code in TextEdit

exact same file in RStudio editor

It is clearly visible where the code elements, numeric data and character data are within the code.

2.5 Tool 3: R Markdown

Using R Markdown you can combine regular text and figures with embedded R code that will be executed to generate a final document. We call this literate programming.
You can use it to create reports in word, pdf or web (html), presentations (pdf or web) and even eBooks and websites. This entire eBook itself is written in R Markdown!

Markdown is, just like the language for the web, html, a markup language. Markup means that you use textual elements to indicate structure instead of content. The R extension to Markdown, R Markdown, simply is Markdown with embedded pieces of R code. Consider this piece of Markdown:

## Tool 3: R Markdown

![](figures/markdown_logo.jpg)

Using RMarkdown you can combine regular text and figures with embedded R code that will be executed to generate a final document. 

The result of this snippet, after it is converted into html, is the top of the current paragraph you are reading.

Here is a piece of R code we call a code chunk that plots some random data in a scatter plot. In RStudio this piece of R code within (the current) R Markdown document looks like this:

Every code chunk consists of two parts; its header and body. The header tells the conversion engine (knitr) how to deal with the code within the chunk, and its output. In this case, this header is

{r simple-scatter-demo-1, fig.asp=0.6, out.width='80%', fig.align='center', fig.caption='A simple scatter plot'}

This header specifies quite a few things. First, the programming language (r) and the label, or “name”, of the chunk (simple-scatter-demo-1). Next, several aspects of the generated plot are specified: its aspect ratio, relative width, alignment on the page and the figure caption. Only the programming language is required here.

Next, when you knit (translate) the document into web format it results in the piece below, together with its output, a scatter plot.

x <- 1:100
y <- rnorm(100) + 1:100*rnorm(100, 0.2, 0.1)
plot(x, y)
A simple scatter plot

Figure 2.3: A simple scatter plot

R Markdown is translated into html, the markup language of the web, before any further processing occurs. That is why you can also embed html code elements within it but that is outside the scope of this course.
Here are the most basic elements you can use in Markdown documents.

RMarkdown

Finally, it is also possible to embed Latex elements. For instance, equations can be defined in a text format. This:

$$d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}$$

results in this:

\[d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}\]

RStudio provides several cheatsheets with R, including Markdown. Have a look at Help → Cheatsheets.

A final note. To be able to convert R Markdown into Word format you need to have MS Word installed on that machine. If you want to be able to generate pdf documents, you will need a bit more: see the screencast Setting up on a Windows system. It is a bit outdated so you should update to more recent version numbers.

2.6 Resources

Swirl course

Swirl course to accompany this course: http://github.com/MichielNoback/R_Data_Analysis

Data Files

In this section all data files used or required for the course presentations or exercises are listed.

Web resources and references

Screencasts

  • Setting up on a Windows system YouTube
  • Starting with R studio YouTube