A friendly intro of R

(as friendly as possible)

Michiel Noback

Contents

GOALS

Goals

The objectives of these two lessons are

Why R?

The analytical toolbox

Many tools are used in the life sciences to perform data analysis and visualisation tasks.

The most well-known one is Excel, but you may have encountered SPSS.

Another much-used tool is R, a programming language with embedded statistics and graphics support.

What is R?

Besides its free nature, R is very popular because it

RSTUDIO

The workbench

Panels of the workbench

You work with 4 panels in the workbench:

  1. Code editor where you write your scripts: text file with code you want to execute more than once
  2. R console where you execute lines of code one by one
  3. Workspace and history See what data you have in memory, and what you have done so far
  4. Plots, Help & Files

The console vs code editor

USING THE CONSOLE

Let’s calculate

R, like any programming language, supports all math operations in the way you would expect them:

+     is ‘plus’, as in 2 + 2 = 4

-     ditto, subtract, as in 2 - 2 = 0

*     multiply

/     divide

^     exponent

Remember that \(\sqrt{n} = n^{0.5}\)

Use parentheses () for grouping parts of equations.

Precedence

All “operators” adhere to the standard mathematrical precedence rules (PEMDAS):

    Parentheses (simplify inside these)
    Exponents
    Multiplication and Division (from left to right)
    Addition and Subtraction (from left to right)

–Practice 1–

In the console, calculate the following:

\(31 + 11\)

\(66 - 24\)

\(\frac{126}{3}\)

\(12^2\)

\(\sqrt{256}\)

\(\frac{3*(4+\sqrt{8})}{5^3}\)

Solutions

\(31 + 11 = 42\)

\(66 - 24 = 42\)

\(\frac{126}{3} = 42\)

\(12^2 = 144\)

\(\sqrt{256} = 16\)

\(\frac{3 * (4 + \sqrt{8})}{5^3} = 0.1638823\) – in R: (3 * (4 + 8^0.5))/5^3

DATA TYPES

Four types of data

You have seen that R works with numbers. There are a few more types of data:

numeric:  numbers with a decimal part:
- 3.123 & 5000.0 & 4.1E3
integer:  numbers without a decimal part:
- 1 & 0 & 2999
logical:  also called boolean values:
- true or false
character:  text, always between quotes:
- "hello R" or "GATC"
factor:  nominal and ordinal scales (not dealt with here)

FUNCTIONS

Definition

Simple calculations is not the core business of R.

You want to do more complex things and this is where functions come in.

A function is a named piece of functionality that you can execute by typing its name, followed by a pair of parentheses. Within these parentheses, you can pass data for the function to work on. Functions may or may not return a value

It has this general form:

\[function\_name(argument, argument, ...)\]

Example: Square root with sqrt()

You have already seen that the square root can be calculated as \(n^{0.5}\).

However, there is also a function for it: sqrt(). It returns the square root of the given number, e.g. sqrt(36)

sqrt(36)
[1] 6
36^0.5
[1] 6

Another example: paste()

The paste function can take any number of arguments and returns them, combined into a single text. You can also supply a separator using sep=<separator string>:

paste(1, 2, 3, sep="---")
[1] "1---2---3"

The quotes around the three dashes"---" indicate it is text data.

Getting help on a function

Type ?function_name in the console to get help on a function.
For instance, typing ?sqrt will give the help page of the square root function.

Scroll down in the help page to see example usages of the function.

–Practice 2–

  1. View the help page for paste. There are two variants of this function.
    • Which?
    • What is the difference between them?
    • Use both variants to generate exactly this message "welcome to R" from these arguments: "welcome", "to", "R"
  2. What does the abs function do?
    • What is returned by abs(-20) and what is abs(20)?
  3. What does the c function do?
    • What is the difference in its result when you combine 1, 3 and "a" as arguments, or 1, 2 and 3?

VARIABLES

What are variables?

dna <- 'GATC'
paste("The DNA letters are:", dna)
[1] "The DNA letters are: GATC"

–Practice 3–

Create these three variabless: x=20, y=10 and z=3.
Next, calculate the following with these variables:

  1. \(x+y\)
  2. \(x^z\)
  3. \(q = x \times y \times z\)
  4. \(\sqrt{q}\)
  5. \(\frac{q}{\pi}\) (pi is just pi in R)
  6. \(\log_{10}{(x \times y)}\)

VECTORS AND INDEXING

The function c()

The function c() generates a vector from the passed arguments.
The data type will be the one that best fits the arguments.

c(1, 2, 5)
[1] 1 2 5
c(1, "a", 2)
[1] "1" "a" "2"
c(1, TRUE, FALSE)
[1] 1 1 0

Vectors and Indexing

nucleotides <- c("A", "C", "G", "T")
nucleotides
[1] "A" "C" "G" "T"
nucleotides[3]    # fetch the third
[1] "G"
nucleotides[2]
[1] "C"

Note the use of # to put regular text within code: this is code comment and is ignored by R.

More on indexing

nucleotides <- c("A", "C", "G", "T")
nucleotides[c(3,1,4)]
[1] "G" "A" "T"

or use a series of indices with a:ba through b

nucleotides[1:3] 
[1] "A" "C" "G"

–Practice 4–

Given the letters of the alphabet, available as the variable letters in R, make a selection that gives you this output:

  1. “b”
  2. “z”
  3. “c” “d” “e” “f”
  4. “d” “n” “a”
  5. “dna” (Look at the help for paste and use the collapse argument):

Indexing on conditions

One of the nice things about indexing is that you can use logical indexing to select elements you are interested in.

q <- c(2, 1, 4)
q[c(TRUE, FALSE, TRUE)] #select with a logical
[1] 2 4
q > 2 #which values in q are higher than 2?
[1] FALSE FALSE  TRUE
q[q > 2] #select those
[1] 4
q <- c(2, 1, 4)
q == 1
q[q == 1]
[1] FALSE  TRUE FALSE
[1] 1
q <= 3
q[q <= 3]
[1]  TRUE  TRUE FALSE
[1] 2 1
q == 1 | q == 4 # using logical "OR"
q[q == 1 | q == 4]
[1] FALSE  TRUE  TRUE
[1] 1 4

Overview logical operators

Here is a named vector to demonstrate logical indexing further:

grades <- c(3.4, 5.6, 8.3, 2.9, 6.8)
# Attach names to the vector for readable display
names(grades) <- c("Ian", "Mark", "Lara", "Rowan", "Iris")
grades
  Ian  Mark  Lara Rowan  Iris 
  3.4   5.6   8.3   2.9   6.8 
grades
  Ian  Mark  Lara Rowan  Iris 
  3.4   5.6   8.3   2.9   6.8 
pass_test <- grades >= 5.5
pass_test
  Ian  Mark  Lara Rowan  Iris 
FALSE  TRUE  TRUE FALSE  TRUE 
grades[pass_test] 
Mark Lara Iris 
 5.6  8.3  6.8 

highest grade and average students

grades[grades == max(grades)]
Lara 
 8.3 
grades[grades >= 5.5 & grades < 7] 
Mark Iris 
 5.6  6.8 

–Practice 5 (intro)–

Given these vectors, representing a hypothetical controlled drug test experiment:

participant_ids <- c("P01", "P02", "P03", "P04", "P05", "P06")
placebo_given <- c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE)
patient_responses <- c(76, 44, 38, 92, 28, 81)
names(placebo_given) <- participant_ids
names(patient_responses) <- participant_ids
placebo_given
  P01   P02   P03   P04   P05   P06 
FALSE  TRUE  TRUE FALSE  TRUE FALSE 
patient_responses
P01 P02 P03 P04 P05 P06 
 76  44  38  92  28  81 

–Practice 5–

Copy the code from the previous slide and, using only logical selections, select

  1. those participant_ids for which a placebo was given
  2. those participant_ids for which NO placebo was given
  3. the responses for which a placebo was given
  4. the responses for which a placebo was given and calculate the mean of this group (using mean())
  5. the highest value (using max()) of the patients who were given a placebo
  6. (challenge) the patient responses with a response higher than the mean of all responses

REAL DATA

Protein analysis results

The data is the analysis result of a set of proteins encoded on the Staphylococcus aureus genome.

It has been run through the sequence analysis tools SignalP, LipoP and TMHMM.

How it looks in Excel

Export to simple text file

I modified the Excel sheet a bit and then exported the data as a plain text file. The data is now in a tab-delimited file called protein_processing_pred.csv.
When opened in a simple text editor (e.g. Notepad) it looks like this.

Loading into R using read.table()

Here is how you load data files in R
No mouse clicks!

protein_data <- read.table(file = "data/protein_processing_pred.csv",
                        header = TRUE, 
                        sep = ";", 
                        dec = ",", 
                        as.is = c(1, 2, 3, 21)) 

Don’t worry - you do not need to be able to do this for the test. The next slide explains what happens.

protein_data <- read.table(file = "data/protein_processing_pred.csv",
                        header = TRUE, 
                        sep = ";", 
                        dec = ",", 
                        as.is = c(1, 2, 3, 21))

Inspect the data: column names

names(protein_data)
 [1] "FASTA_Header"          "Sequence"              "SignalP_name"          "SignalP_Cmax"         
 [5] "SignalP_Cmax_pos"      "SignalP_Ymax"          "SignalP_Ymax_pos"      "SignalP_Smax"         
 [9] "SignalP_Smax_pos"      "SignalP_Smean"         "SignalP_D"             "SignalP_YesNo"        
[13] "SignalP_Dmaxcut"       "SignalP_Networks_used" "LipoP_Localisation"    "LipoP_Score"          
[17] "TMHMM_Length"          "TMHMM_ExpAA"           "TMHMM_First60"         "TMHMM_PredHel"        
[21] "TMHMM_Topology"       

Inspect the data: the first few entries

head(protein_data)
                                                                                                          FASTA_Header
1    gi_15925706_ref_NP_373239.1_ chromosomal replication initiator protein [Staphylococcus aureus subsp. aureus N315]
2               gi_15925707_ref_NP_373240.1_ DNA polymerase III, beta chain [Staphylococcus aureus subsp. aureus N315]
3               gi_15925708_ref_NP_373241.1_ conserved hypothetical protein [Staphylococcus aureus subsp. aureus N315]
4 gi_15925709_ref_NP_373242.1_ DNA repair and genetic recombination protein [Staphylococcus aureus subsp. aureus N315]
5                         gi_15925710_ref_NP_373243.1_ DNA gyrase subunit B [Staphylococcus aureus subsp. aureus N315]
6                         gi_15925711_ref_NP_373244.1_ DNA gyrase subunit A [Staphylococcus aureus subsp. aureus N315]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Sequence
1                                                                                                                                                                                                                                                                                                                                                                                                                                                     MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK
5                                                                                                                                                                                                                                                      MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF
6 MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE
                  SignalP_name SignalP_Cmax SignalP_Cmax_pos SignalP_Ymax SignalP_Ymax_pos
1 gi_15925706_ref_NP_373239.1_        0.111               59        0.101               41
2 gi_15925707_ref_NP_373240.1_        0.168               39        0.179               39
3 gi_15925708_ref_NP_373241.1_        0.109               13        0.098               33
4 gi_15925709_ref_NP_373242.1_        0.131               36        0.133               15
5 gi_15925710_ref_NP_373243.1_        0.124               16        0.096               25
6 gi_15925711_ref_NP_373244.1_        0.115               35        0.104               29
  SignalP_Smax SignalP_Smax_pos SignalP_Smean SignalP_D SignalP_YesNo SignalP_Dmaxcut
1        0.132               37         0.069     0.089             N            0.45
2        0.372               36         0.125     0.158             N            0.45
3        0.121               30         0.085     0.093             N            0.45
4        0.226                7         0.167     0.147             N            0.45
5        0.107               29         0.076     0.088             N            0.45
6        0.165               20         0.100     0.102             N            0.45
  SignalP_Networks_used LipoP_Localisation LipoP_Score TMHMM_Length TMHMM_ExpAA TMHMM_First60
1            SignalP-TM                CYT   -0.200913          453        0.03          0.00
2            SignalP-TM                CYT   -0.200913          377        0.00          0.00
3            SignalP-TM                CYT   -0.200913           81        0.00          0.00
4            SignalP-TM                CYT   -0.200913          370        0.00          0.00
5            SignalP-TM                CYT   -0.200913          644        0.06          0.00
6            SignalP-TM                CYT   -0.200913          889        0.02          0.01
  TMHMM_PredHel TMHMM_Topology
1             0              o
2             0              o
3             0              o
4             0              o
5             0              o
6             0              o

Inspect the data: View like Excel

If you want to have a look at the data “spreadsheet-style”, you can type View(prot_data) in the Console. The Viewer will show the data associated with the variable in the editor panel.

DATAFRAMES

What is a dataframe

Accessing a column of a dataframe

protein_data$SignalP_YesNo
  [1] N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
 [48] N N N N N N N N N N N N N N N N N N N N N N N N N Y N N N Y Y N N N N N N Y N N Y N N N N N N
 [95] N N N N N N N N N N N N N N N Y Y N N N N N N Y Y N Y Y N Y N N Y N Y Y Y N Y N N Y N Y Y N N
[142] Y Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y Y Y N Y N N N N N N N Y N Y N N Y N N N Y N N Y N N N Y N
[189] Y N N N N Y Y N Y Y Y Y Y Y N N N N Y N N N Y N N Y N N Y Y Y Y Y N N Y N Y N N N Y Y Y Y Y Y
[236] Y N Y Y Y N N N N N N N N N Y N N N Y Y Y N N Y N N N N N N N N N N N N N N Y Y Y Y N Y Y Y Y
[283] Y N Y Y N N Y N N N N N Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N N N N N N Y N N N N Y Y N N
[330] N Y N Y Y Y N N N N N Y N N Y Y N Y N Y Y Y Y N N N N N N Y N Y Y N N Y N N N N Y N N N Y Y Y
[377] Y Y Y Y N Y Y Y N N N
Levels: N Y
table(protein_data$SignalP_YesNo)

  N   Y 
253 134 

Since it is a vector, you can use indexing on a column:

protein_data$SignalP_YesNo[330:345]
 [1] N Y N Y Y Y N N N N N Y N N Y Y
Levels: N Y

The number of rows and columns

Get the dimensions of the dataset:

dim(protein_data)
[1] 387  21

This is a vector of two integers. Which one is the number of rows?

Indexing on dataframes

students <- data.frame(sid=paste0("S0", 1:5), 
                       name=c("Mark", "Lynn", "Lianne", "Peter", "Rose"),
                       sex=factor(c("m", "f", "f", "m", "f"), 
                                  labels = c("female", "male")),
                       biology=c(5.6, 6.2, 7.9, 4.4, 9.1),
                       statistics=c(6.1, 5.1, 8.0, 4.7, 7.3),
                       informatics=c(6.3, 6.1, 7.7, 5.4, 9.5),
                       stringsAsFactors = F)
students
  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5
  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5
students[1, 2] # row 1, second value
[1] "Mark"
students[2, 4:6] # all grades of Lynn
  biology statistics informatics
2     6.2        5.1         6.1
  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5
students[, 2] # all student names - same as students$name
[1] "Mark"   "Lynn"   "Lianne" "Peter"  "Rose"  
students[2, ] # row 2
  sid name    sex biology statistics informatics
2 S02 Lynn female     6.2        5.1         6.1

Using logical selections

  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5

All grades for Lynn:

students[students$name == "Lynn", 4:6] 
  biology statistics informatics
2     6.2        5.1         6.1
  sid   name    sex biology statistics informatics
1 S01   Mark   male     5.6        6.1         6.3
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
4 S04  Peter   male     4.4        4.7         5.4
5 S05   Rose female     9.1        7.3         9.5

All grades for girls:

students[students$sex == "female", ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
5 S05   Rose female     9.1        7.3         9.5

–Practice 6–

To get the student data into your session, type source("https://git.io/fjfMW") in the console. It is this file: https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/intro_lesson_data.R

Type students or View(students) in the console to verify you have it

  1. select all informatics grades
  2. select the whole third and fourth rows
  3. select the statistics grade for Peter
  4. select the biology and statistics grades for the female students
  5. select the student names where the biology grade is below 6

–Practice 6 Solutions–

1: select all informatics grades: two alternatives

students$informatics
[1] 6.3 6.1 7.7 5.4 9.5
students[, 6]
[1] 6.3 6.1 7.7 5.4 9.5

2: select the third and fourth row entirely: two alternatives

students[2:3, ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7
students[c(2, 3), ]
  sid   name    sex biology statistics informatics
2 S02   Lynn female     6.2        5.1         6.1
3 S03 Lianne female     7.9        8.0         7.7

3: select the statistics grade for Peter: three alternatives

students[4, 5] # simple but not really the intention
[1] 4.7
students$statistics[4] # idem
[1] 4.7
students[students$name == "Peter", 5] # better
[1] 4.7

4: select the biology and statistics grades for the female students

students[students$sex == "female", c(4, 5)]
  biology statistics
2     6.2        5.1
3     7.9        8.0
5     9.1        7.3

5: select the student names where the biology grade is below 6

students[students$biology < 6, 2]
[1] "Mark"  "Peter"

CREATING FIGURES

Load the data yourself.

If you want to tag along, download and load the file as follows:

protein_data <- read.table(file = "https://git.io/fjfum",
                        header = TRUE,
                        sep = ";",
                        dec = ",",
                        as.is = c(1, 2, 3, 21))

(if the short URL does not work, use
https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/protein_processing_pred.csv”)

A scatterplot

Let’s investigate the relation Cmax vs Ymax of the SignalP analysis:

head(protein_data$SignalP_Cmax)
[1] 0.111 0.168 0.109 0.131 0.124 0.115
head(protein_data$SignalP_Ymax)
[1] 0.101 0.179 0.098 0.133 0.096 0.104

See http://www.cbs.dtu.dk/services/SignalP-4.1/output.php for a description of these analysis results.

Create variables for convenience, and plot:

cmax <- protein_data$SignalP_Cmax
ymax <- protein_data$SignalP_Ymax
plot(x = cmax, y = ymax)

Tweak a little

plot(x = cmax, y = ymax,
     xlab = "Cmax value", ylab="Ymax value",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))

Or, even better, with a log transform:

plot(x = log2(cmax), y = log2(ymax),
     xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))

Add a regression line

plot(x = log2(cmax), y = log2(ymax),
     xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
     pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))
model <- lm(log2(ymax)  ~  log2(cmax))
abline(model, col = "red", lwd=2)

Summary

On the test

What is expected of you at the test of this course:

You should be be able to

You will be allowed to use all R documentation as well as this presentation.

FINAL ASSIGNMENT

Final assignment

Download the assignment RMarkdown document from this location:

https://git.io/JeOIh

Long URL:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.Rmd
Web-page view:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.html

Save the file on your computer, open it in RStudio and then deal with the assignments. Put your solutions where it says
## YOUR CODE HERE.

You can process the document into Word form by klicking “Knit” Submit this Word document.