Contents

GOALS

Goals

The objectives of these two lessons are

  • demystify the concept of programming
  • show why programming can be better than Excel (or SPSS)
  • give you a friendly intro on programming with R

WHY R?

The analytical toolbox

Many tools are used in the life sciences to perform data analysis and visualisation tasks.

The most well-known one is Excel, but you may have encountered SPSS.

Another much-used tool is R, a programming language with embedded statistics and graphics support.

What is R?

Besides its free nature, R is very popular because it

  • has an interactive mode (read–evaluate–print loop: REPL)
  • makes repeating analyses (with new data) very easy
  • supports “literate programming” for creating presentations (such as this one!) and reports

RSTUDIO

The workbench

Panels of the workbench

You work with 4 panels in the workbench:

  1. Code editor where you write your scripts: text file with code you want to execute more than once
  2. R console where you execute lines of code one by one
  3. Workspace and history See what data you have in memory, and what you have done so far
  4. Plots, Help & Files

The console vs code editor

  • Use the console to do basic calculations, try pieces of code.

  • Use the code editor to work on
    • scripts - analyses you may want to repeat or develop further.
    • data files
    • analytical notebooks (RMarkdown)

USING THE CONSOLE

Let’s calculate

R, like any programming language, supports all math operations in the way you would expect them:

+     is ‘plus’, as in 2 + 2 = 4

-     ditto, subtract, as in 2 - 2 = 0

*     multiply

/     divide

^     exponent (identical to: **)

Remember that \(\sqrt{n} = n^{0.5}\)

Use parentheses () for grouping parts of equations.

Precedence

All “operators” adhere to the standard mathematrical precedence rules (PEMDAS):

    Parentheses (simplify inside these)
    Exponents
    Multiplication and Division (from left to right)
    Addition and Subtraction (from left to right)

–Practice 1–

In the console, calculate the following:

\(31 + 11\)

\(66 - 24\)

\(\frac{126}{3}\)

\(12^2\)

\(\sqrt{256}\)

\(\frac{3*(4+\sqrt{8})}{5^3}\)

Solutions

\(31 + 11 = 42\)

\(66 - 24 = 42\)

\(\frac{126}{3} = 42\)

\(12^2 = 144\)

\(\sqrt{256} = 16\)

\(\frac{3 * (4 + \sqrt{8})}{5^3} = 0.1638823\) – in R: (3 * (4 + 8^0.5))/5^3

DATA TYPES

Four types of data

You have seen that R works with numbers. There are a few more types of data:

numeric:  numbers with a decimal part:
- 3.123 & 5000.0 & 4.1E3
integer:  numbers without a decimal part:
- 1 & 0 & 2999
logical:  also called boolean values:
- true or false
character:  text, always between quotes:
- "hello R" or "GATC"
factor:  nominal and ordinal scales (visited later)

Note When you type a number in the console it will always be numeric if it is without decimal part.

FUNCTIONS

Definition

Simple mathematics is not the core business of R.

You want to do more complex things and this is where functions come in.

A function is a named piece of functionality that you can execute by typing its name, followed by a pair of parentheses. Within these parentheses, you can pass data for the function to work on. Functions may or may not return a value

It has this general form: \[function\_name(argument, argument, ...)\]

Example: Square root with sqrt()

You have already seen that the square root can be calculated as \(n^{0.5}\).

However, there is also a function for it: sqrt(). It returns the square root of the given number, e.g. sqrt(36)

sqrt(36)
[1] 6
36^0.5
[1] 6

Another example: paste()

The paste function can take any number of arguments and returns them, combined into a single text. You can also supply a separator using sep=<separator string>:

paste(1, 2, 3, sep="---")
[1] "1---2---3"

The quotes around the three dashes"---" indicate it is text data.

Getting help on a function

Type ?function_name in the console to get help on a function.
For instance, typing ?sqrt will give the help page of the square root function.

Scroll down in the help page to see example usages of the function.

–Practice 2–

  1. View the help page for paste. There are two variants of this function.
    • Which?
    • What is the difference between them?
    • Use both variants to generate exactly this message "welcome to R" from these arguments: "welcome", "to", "R"
  2. What does the abs function do?
    • What is returned by abs(-20) and what is abs(20)?
  3. What does the c function do?
    • What is the difference in its result when you combine 1, 3 and "a" as arguments, or 1, 2 and 3?

VARIABLES

What are variables?

  • Variables are used to name and reuse pieces of data
  • E.g., x <- 42 is used to create a variable x with a value of 42.
  • Variables are really variable - their value can change!
  • You define a variable using <- in R, so x <- 42 is the same as x = 42
  • The next chunk creates a variable called dna and then uses it in paste():
dna <- 'GATC'
paste("The DNA letters are:", dna)
[1] "The DNA letters are: GATC"

–Practice 3–

Create these three variabless: x=20, y=10 and z=3.
Next, calculate the following with these variables:

  1. \(x+y\)
  2. \(x^z\)
  3. \(q = x \times y \times z\)
  4. \(\sqrt{q}\)
  5. \(\frac{q}{\pi}\) (pi is just pi in R)
  6. \(\log_{10}{(x \times y)}\)

VECTORS AND INDEXING

The function c()

The function c() generates a vector from the passed arguments.
The data type will be the one that best fits the arguments.

c(1, 2, 5)
[1] 1 2 5
c(1, "a", 2)
[1] "1" "a" "2"
c(1, TRUE, FALSE)
[1] 1 1 0

Vectors and Indexing

  • In R, all data lives inside vectors.
  • A vector is a series of elements maintained as a single unit.
  • Individual elements can be accessed using their index - a number between square brackets: []
nucleotides <- c("A", "C", "G", "T")
nucleotides
[1] "A" "C" "G" "T"
nucleotides[3]    # fetch the third
[1] "G"
nucleotides[2]
[1] "C"

Note the use of # to put regular text within code: this is code comment and is ignored by R.

More on indexing

  • You can use any order and combination of numbers to select elements from a vector.
nucleotides <- c("A", "C", "G", "T")
nucleotides[c(3,1,4)]
[1] "G" "A" "T"

or use a series of indices with a:ba through b

nucleotides[1:3] 
[1] "A" "C" "G"

–Practice 4–

Given the letters of the alphabet, available as the variable letters in R, make a selection that gives you this output:

  1. “b”
  2. “z”
  3. “c” “d” “e” “f”
  4. “d” “n” “a”
  5. “dna” (Look at the help for paste and use the collapse argument):

Indexing on conditions

One of the nice things about indexing is that you can use logical indexing to select elements you are interested in.

q <- c(2, 1, 4)
q[c(TRUE, FALSE, TRUE)] #select with a logical
[1] 2 4
q > 2 #which values in q are higher than 2?
[1] FALSE FALSE  TRUE
q[q > 2] #select those
[1] 4

q <- c(2, 1, 4)
q == 1
q[q == 1]
[1] FALSE  TRUE FALSE
[1] 1
q <= 3
q[q <= 3]
[1]  TRUE  TRUE FALSE
[1] 2 1
q == 1 | q == 4 # using logical "OR"
q[q == 1 | q == 4]
[1] FALSE  TRUE  TRUE
[1] 1 4

Overview logical operators

  • Compare
    • > (greater then)
    • >= (greater then or equal)
    • < (less then)
    • <= (less then or equal)
    • == (equal)
  • Combine
    • & (AND)
    • | (OR)
  • Negate
    • ! (NOT)

Here is a named vector to demonstrate logical indexing further:

grades <- c(3.4, 5.6, 8.3, 2.9, 6.8)
# Attach names to the vector for readable display
names(grades) <- c("Ian", "Mark", "Lara", "Rowan", "Iris")
grades
  Ian  Mark  Lara Rowan  Iris 
  3.4   5.6   8.3   2.9   6.8 

grades
  Ian  Mark  Lara Rowan  Iris 
  3.4   5.6   8.3   2.9   6.8 
pass_test <- grades >= 5.5
pass_test
  Ian  Mark  Lara Rowan  Iris 
FALSE  TRUE  TRUE FALSE  TRUE 
grades[pass_test] 
Mark Lara Iris 
 5.6  8.3  6.8 

highest grade and average students

grades[grades == max(grades)]
Lara 
 8.3 
grades[grades >= 5.5 & grades < 7] 
Mark Iris 
 5.6  6.8 

–Practice 5 (intro)–

Given these vectors, representing a hypothetical controlled drug test experiment:

participant_ids <- c("P01", "P02", "P03", "P04", "P05", "P06")
placebo_given <- c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE)
patient_responses <- c(76, 44, 38, 92, 28, 81)
names(placebo_given) <- participant_ids
names(patient_responses) <- participant_ids
placebo_given
  P01   P02   P03   P04   P05   P06 
FALSE  TRUE  TRUE FALSE  TRUE FALSE 
patient_responses
P01 P02 P03 P04 P05 P06 
 76  44  38  92  28  81 

–Practice 5–

Copy the code from the previous slide and, using only logical selections, select

  1. those participant_ids for which a placebo was given
  2. those participant_ids for which NO placebo was given
  3. the responses for which a placebo was given
  4. the responses for which a placebo was given and calculate the mean of this group (using mean())
  5. the highest value (using max()) of the patients who were given a placebo
  6. (challenge) the patient responses with a response higher than the mean of all responses

REAL DATA

Protein analysis results

The data is the analysis result of a set of proteins encoded on the Staphylococcus aureus genome.

It has been run through the sequence analysis tools SignalP, LipoP and TMHMM.

How it looks in Excel

Export to simple text file

I modified the Excel sheet a bit and then exported the data as a plain text file. The data is now in a tab-delimited file called protein_processing_pred.csv.
When opened in a simple text editor (e.g. Notepad) it looks like this.

Loading into R using read.table()

Here is how you load data files in R
No mouse clicks!

protein_data <- read.table(file = "data/protein_processing_pred.csv",
                        header = TRUE, 
                        sep = ";", 
                        dec = ",", 
                        as.is = c(1, 2, 3, 21)) 

Don’t worry - you do not need to do this for the test. The next slide explains what happens.


protein_data <- read.table(file = "data/protein_processing_pred.csv",
                        header = TRUE, 
                        sep = ";", 
                        dec = ",", 
                        as.is = c(1, 2, 3, 21))
  • read.table is the function you use to load data from a file. It accepts many optional arguments and only one mandatory - the file name.
  • the key = value pairs between the parentheses such as sep = ";" and header=TRUE are the function arguments that specify where the file is and how it should be loaded.
  • the data are assigned to a variable called protein_data.
  • type ?read.table if you are interested in the details.

Inspect the data: column names

names(protein_data)
 [1] "FASTA_Header"          "Sequence"              "SignalP_name"          "SignalP_Cmax"         
 [5] "SignalP_Cmax_pos"      "SignalP_Ymax"          "SignalP_Ymax_pos"      "SignalP_Smax"         
 [9] "SignalP_Smax_pos"      "SignalP_Smean"         "SignalP_D"             "SignalP_YesNo"        
[13] "SignalP_Dmaxcut"       "SignalP_Networks_used" "LipoP_Localisation"    "LipoP_Score"          
[17] "TMHMM_Length"          "TMHMM_ExpAA"           "TMHMM_First60"         "TMHMM_PredHel"        
[21] "TMHMM_Topology"       

Inspect the data: the first few entries

head(protein_data)
                                                                                                          FASTA_Header
1    gi_15925706_ref_NP_373239.1_ chromosomal replication initiator protein [Staphylococcus aureus subsp. aureus N315]
2               gi_15925707_ref_NP_373240.1_ DNA polymerase III, beta chain [Staphylococcus aureus subsp. aureus N315]
3               gi_15925708_ref_NP_373241.1_ conserved hypothetical protein [Staphylococcus aureus subsp. aureus N315]
4 gi_15925709_ref_NP_373242.1_ DNA repair and genetic recombination protein [Staphylococcus aureus subsp. aureus N315]
5                         gi_15925710_ref_NP_373243.1_ DNA gyrase subunit B [Staphylococcus aureus subsp. aureus N315]
6                         gi_15925711_ref_NP_373244.1_ DNA gyrase subunit A [Staphylococcus aureus subsp. aureus N315]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Sequence
1                                                                                                                                                                                                                                                                                                                                                                                                                                                     MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK
5                                                                                                                                                                                                                                                      MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF
6 MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE
                  SignalP_name SignalP_Cmax SignalP_Cmax_pos SignalP_Ymax SignalP_Ymax_pos
1 gi_15925706_ref_NP_373239.1_        0.111               59        0.101               41
2 gi_15925707_ref_NP_373240.1_        0.168               39        0.179               39
3 gi_15925708_ref_NP_373241.1_        0.109               13        0.098               33
4 gi_15925709_ref_NP_373242.1_        0.131               36        0.133               15
5 gi_15925710_ref_NP_373243.1_        0.124               16        0.096               25
6 gi_15925711_ref_NP_373244.1_        0.115               35        0.104               29
  SignalP_Smax SignalP_Smax_pos SignalP_Smean SignalP_D SignalP_YesNo SignalP_Dmaxcut
1        0.132               37         0.069     0.089             N            0.45
2        0.372               36         0.125     0.158             N            0.45
3        0.121               30         0.085     0.093             N            0.45
4        0.226                7         0.167     0.147             N            0.45
5        0.107               29         0.076     0.088             N            0.45
6        0.165               20         0.100     0.102             N            0.45
  SignalP_Networks_used LipoP_Localisation LipoP_Score TMHMM_Length TMHMM_ExpAA TMHMM_First60
1            SignalP-TM                CYT   -0.200913          453        0.03          0.00
2            SignalP-TM                CYT   -0.200913          377        0.00          0.00
3            SignalP-TM                CYT   -0.200913           81        0.00          0.00
4            SignalP-TM                CYT   -0.200913          370        0.00          0.00
5            SignalP-TM                CYT   -0.200913          644        0.06          0.00
6            SignalP-TM                CYT   -0.200913          889        0.02          0.01
  TMHMM_PredHel TMHMM_Topology
1             0              o
2             0              o
3             0              o
4             0              o
5             0              o
6             0              o

Inspect the data: View like Excel

If you want to have a look at the data “spreadsheet-style”, you can type View(prot_data) in the Console. The Viewer will show the data associated with the variable in the editor panel.

DATAFRAMES

What is a dataframe

  • The dataset protein_data is what is called a dataframe in R.
  • In a dataframe, data is organized in rows and columns.
  • Columns contain measurements of a single variable; they are formed by vectors.
  • Rows contain observations (here: different measurements on a protein).
  • A dataframe is an ordered list of vectors of the same length.

Accessing a column of a dataframe

  • Use the dollar sign $ to get hold of a single columnm
protein_data$SignalP_YesNo
  [1] N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
 [48] N N N N N N N N N N N N N N N N N N N N N N N N N Y N N N Y Y N N N N N N Y N N Y N N N N N N
 [95] N N N N N N N N N N N N N N N Y Y N N N N N N Y Y N Y Y N Y N N Y N Y Y Y N Y N N Y N Y Y N N
[142] Y Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y Y Y N Y N N N N N N N Y N Y N N Y N N N Y N N Y N N N Y N
[189] Y N N N N Y Y N Y Y Y Y Y Y N N N N Y N N N Y N N Y N N Y Y Y Y Y N N Y N Y N N N Y Y Y Y Y Y
[236] Y N Y Y Y N N N N N N N N N Y N N N Y Y Y N N Y N N N N N N N N N N N N N N Y Y Y Y N Y Y Y Y
[283] Y N Y Y N N Y N N N N N Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N N N N N N Y N N N N Y Y N N
[330] N Y N Y Y Y N N N N N Y N N Y Y N Y N Y Y Y Y N N N N N N Y N Y Y N N Y N N N N Y N N N Y Y Y
[377] Y Y Y Y N Y Y Y N N N
Levels: N Y

  • Here is a table summary of column “SignalP_YesNo”, showing that 134 out of 387 proteins have a putative signal sequence:
table(protein_data$SignalP_YesNo)

  N   Y 
253 134 

Since it is a vector, you can use indexing on a column:

protein_data$SignalP_YesNo[330:345]
 [1] N Y N Y Y Y N N N N N Y N N Y Y
Levels: N Y

The number of rows and columns

Get the dimensions of the dataset:

dim(protein_data)
[1] 387  21

This is a vector of two integers. Which one is the number of rows?

Indexing on dataframes

  • Indexing on whole dataframes is like using a coordinate system:
    dataframe[rows, columns]
  • Leave empty if you want all values of the row or column
  • The next few slides use this example dataframe called students

students <- data.frame(sid=paste0("S0", 1:5), 
                       name=c("Mark", "Lynn", "Lianne", "Peter", "Rose"),
                       sex=factor(c("m", "f", "f", "m", "f"), 
                                  labels = c("female", "male")),
                       biology=c(5.6, 6.2, 7.9, 4.4, 9.1),
                       statistics=c(6.1, 5.1, 8.0, 4.7, 7.3),
                       informatics=c(6.3, 6.1, 7.7, 5.4, 9.5),
                       stringsAsFactors = F)
students
  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5

  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5
students[1, 2] # row 1, second value
[1] "Mark"
students[2, 4:6] # all grades of Lynn
  biology statistics informatics
2     6.2        5.1         6.1

  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5
students[, 2] # all student names - same as students$name
[1] "Mark"   "Lynn"   "Lianne" "Peter"  "Rose"  
students[2, ] # row 2
  sid name    sex biology statistics informatics
2 S02 Lynn female     6.2        5.1         6.1

Using logical selections

  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5

All grades for Lynn:

students[students$name == "Lynn", 4:6] 
  biology statistics informatics
2     6.2        5.1         6.1

  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5

All grades for girls:

students[students$sex == "female", ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
5 S05   Rose female     9.1        7.3         9.5

–Practice 6–

To get the student data into your session, type source("https://git.io/fjfMW") in the console. It is this file: https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/intro_lesson_data.R

Type students or View(students) in the console to verify you have it

  1. select all informatics grades
  2. select the whole third and fourth rows
  3. select the statistics grade for Peter
  4. select the biology and statistics grades for the female students
  5. select the student names where the biology grade is below 6

–Practice 6 Solutions–

1: select all informatics grades: two alternatives

students$informatics
[1] 6.3 6.1 7.7 5.4 9.5
students[, 5]
[1] 6.1 5.1 8.0 4.7 7.3

2: select the third and fourth row entirely: two alternatives

students[2:3, ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
students[c(2, 3), ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7

3: select the statistics grade for Peter: three alternatives

students[4, 5] # simple
[1] 4.7
students$statistics[4] # also OK
[1] 4.7
students[students$name == "Peter", 5] # better
[1] 4.7

4: select the biology and statistics grades for the female students

students[students$sex == "female", c(4, 5)]
  biology statistics
2     6.2        5.1
3     7.9        8.0
5     9.1        7.3

5: select the student names where the biology grade is below 6

students[students$biology < 6, 2]
[1] "Mark"  "Peter"

CREATING FIGURES

Load the data yourself.

If you want to tag along, download and load the file as follows:

protein_data <- read.table(file = "https://git.io/fjfum",
                        header = TRUE,
                        sep = ";",
                        dec = ",",
                        as.is = c(1, 2, 3, 21))

(if the short URL does not work, use
https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/protein_processing_pred.csv”)

A scatterplot

Let’s investigate the relation Cmax vs Ymax of the SignalP analysis:

head(protein_data$SignalP_Cmax)
[1] 0.111 0.168 0.109 0.131 0.124 0.115
head(protein_data$SignalP_Ymax)
[1] 0.101 0.179 0.098 0.133 0.096 0.104

See http://www.cbs.dtu.dk/services/SignalP-4.1/output.php for a description of these analysis results.


Create variables for convenience, and plot:

cmax <- protein_data$SignalP_Cmax
ymax <- protein_data$SignalP_Ymax
plot(x = cmax, y = ymax)

Tweak a little

plot(x = cmax, y = ymax,
     xlab = "Cmax value", ylab="Ymax value",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))

Or, even better, with a log transform:

plot(x = log2(cmax), y = log2(ymax),
     xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))

Add a regression line

plot(x = log2(cmax), y = log2(ymax),
     xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))
model <- lm(log2(ymax)  ~  log2(cmax))
abline(model, col = "red", lwd=2)

Or with a smoother?

scatter.smooth(x = log2(cmax), y = log2(ymax),
     xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))
abline(model, col = "red", lwd=2)

Summary

  • You have seen R at work over a variety of activities, including basic plotting.
  • Creating a thorough analysis is quite some work - as with any analysis platform, but repeating the analysis with new data is as simple as a mouse click.
  • Advise: use RMarkdown whenever you are going to do a data analysis project.

On the test

What is expected of you at the test of this course:

You should be be able to

  • perform basic mathematic operations
  • apply selections on vectors and dataframes
  • read documentation for a function
  • use existing functions correctly

You will be allowed to use all R documentation as well as this presentation.

FINAL ASSIGNMENT

Final assignment

Download the assignment RMarkdown document from this location:

https://git.io/JeOIh

Long URL:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.Rmd
Web-page view:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.html

Save the file on your computer, open it in RStudio and then deal with the assignments. Put your solutions where it says
## YOUR CODE HERE.

You can process the document into Word form by klicking “Knit” Submit this Word document.