Michiel Noback
The objectives of these two lessons are
Many tools are used in the life sciences to perform data analysis and visualisation tasks.
The most well-known one is Excel, but you may have encountered SPSS.
Another much-used tool is R, a programming language with embedded statistics and graphics support.
Besides its free nature, R is very popular because it
You work with 4 panels in the workbench:
Use the console to do basic calculations, try pieces of code.
Use the code editor to work on
R, like any programming language, supports all math operations in the way you would expect them:
+ is ‘plus’, as in 2 + 2 = 4
- ditto, subtract, as in 2 - 2 = 0
* multiply
/ divide
^ exponent
Remember that \(\sqrt{n} = n^{0.5}\)
Use parentheses () for grouping parts of equations.
All “operators” adhere to the standard mathematrical precedence rules (PEMDAS):
Parentheses (simplify inside these)
Exponents
Multiplication and Division (from left to right)
Addition and Subtraction (from left to right)
In the console, calculate the following:
\(31 + 11\)
\(66 - 24\)
\(\frac{126}{3}\)
\(12^2\)
\(\sqrt{256}\)
\(\frac{3*(4+\sqrt{8})}{5^3}\)
\(31 + 11 = 42\)
\(66 - 24 = 42\)
\(\frac{126}{3} = 42\)
\(12^2 = 144\)
\(\sqrt{256} = 16\)
\(\frac{3 * (4 + \sqrt{8})}{5^3} = 0.1638823\) – in R: (3 * (4 + 8^0.5))/5^3
You have seen that R works with numbers. There are a few more types of data:
numeric: numbers with a decimal part:
- 3.123 & 5000.0 & 4.1E3
integer: numbers without a decimal part:
- 1 & 0 & 2999
logical: also called boolean values:
- true or false
character: text, always between quotes:
- "hello R" or "GATC"
factor: nominal and ordinal scales (not dealt with here)
Simple calculations is not the core business of R.
You want to do more complex things and this is where functions come in.
A function is a named piece of functionality that you can execute by typing its name, followed by a pair of parentheses. Within these parentheses, you can pass data for the function to work on. Functions may or may not return a value
It has this general form:
\[function\_name(argument, argument, ...)\]
sqrt()You have already seen that the square root can be calculated as \(n^{0.5}\).
However, there is also a function for it: sqrt(). It returns the square root of the given number, e.g. sqrt(36)
[1] 6
[1] 6
paste()The paste function can take any number of arguments and returns them, combined into a single text. You can also supply a separator using sep=<separator string>:
[1] "1---2---3"
The quotes around the three dashes"---" indicate it is text data.
Type ?function_name in the console to get help on a function.
For instance, typing ?sqrt will give the help page of the square root function.
Scroll down in the help page to see example usages of the function.
paste. There are two variants of this function.
"welcome to R" from these arguments: "welcome", "to", "R"abs function do?
abs(-20) and what is abs(20)?c function do?
1, 3 and "a" as arguments, or 1, 2 and 3?x <- 42 is used to create a variable x with a value of 42.<- in R, so “x <- 42” is the same as “x = 42”paste():[1] "The DNA letters are: GATC"
Create these three variabless: x=20, y=10 and z=3.
Next, calculate the following with these variables:
pi in R)c()The function c() generates a vector from the passed arguments.
The data type will be the one that best fits the arguments.
[1] 1 2 5
[1] "1" "a" "2"
[1] 1 1 0
[]nucleotides <- c("A", "C", "G", "T")
nucleotides
[1] "A" "C" "G" "T"
nucleotides[3] # fetch the third
[1] "G"
nucleotides[2]
[1] "C"Note the use of # to put regular text within code: this is code comment and is ignored by R.
[1] "G" "A" "T"
or use a series of indices with a:b – a through b
[1] "A" "C" "G"
Given the letters of the alphabet, available as the variable letters in R, make a selection that gives you this output:
paste and use the collapse argument):One of the nice things about indexing is that you can use logical indexing to select elements you are interested in.
[1] 2 4
[1] FALSE FALSE TRUE
[1] 4
[1] FALSE TRUE FALSE
[1] 1
[1] TRUE TRUE FALSE
[1] 2 1
[1] FALSE TRUE TRUE
[1] 1 4
> (greater then)>= (greater then or equal)< (less then)<= (less then or equal)== (equal)& (AND)| (OR)! (NOT)Here is a named vector to demonstrate logical indexing further:
grades <- c(3.4, 5.6, 8.3, 2.9, 6.8)
# Attach names to the vector for readable display
names(grades) <- c("Ian", "Mark", "Lara", "Rowan", "Iris")
grades Ian Mark Lara Rowan Iris
3.4 5.6 8.3 2.9 6.8
Ian Mark Lara Rowan Iris
3.4 5.6 8.3 2.9 6.8
Ian Mark Lara Rowan Iris
FALSE TRUE TRUE FALSE TRUE
Mark Lara Iris
5.6 8.3 6.8
Lara
8.3
Mark Iris
5.6 6.8
Given these vectors, representing a hypothetical controlled drug test experiment:
participant_ids <- c("P01", "P02", "P03", "P04", "P05", "P06")
placebo_given <- c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE)
patient_responses <- c(76, 44, 38, 92, 28, 81)
names(placebo_given) <- participant_ids
names(patient_responses) <- participant_ids
placebo_given P01 P02 P03 P04 P05 P06
FALSE TRUE TRUE FALSE TRUE FALSE
P01 P02 P03 P04 P05 P06
76 44 38 92 28 81
Copy the code from the previous slide and, using only logical selections, select
mean())max()) of the patients who were given a placeboThe data is the analysis result of a set of proteins encoded on the Staphylococcus aureus genome.
It has been run through the sequence analysis tools SignalP, LipoP and TMHMM.
I modified the Excel sheet a bit and then exported the data as a plain text file. The data is now in a tab-delimited file called protein_processing_pred.csv.
When opened in a simple text editor (e.g. Notepad) it looks like this.
Here is how you load data files in R
No mouse clicks!
protein_data <- read.table(file = "data/protein_processing_pred.csv",
header = TRUE,
sep = ";",
dec = ",",
as.is = c(1, 2, 3, 21)) Don’t worry - you do not need to be able to do this for the test. The next slide explains what happens.
protein_data <- read.table(file = "data/protein_processing_pred.csv",
header = TRUE,
sep = ";",
dec = ",",
as.is = c(1, 2, 3, 21))read.table is the function you use to load data from a file. It accepts many optional arguments and only one mandatory - the file name.sep = ";" and header=TRUE are the function arguments that specify where the file is and how it should be loaded.protein_data.?read.table if you are interested in the details. [1] "FASTA_Header" "Sequence" "SignalP_name" "SignalP_Cmax"
[5] "SignalP_Cmax_pos" "SignalP_Ymax" "SignalP_Ymax_pos" "SignalP_Smax"
[9] "SignalP_Smax_pos" "SignalP_Smean" "SignalP_D" "SignalP_YesNo"
[13] "SignalP_Dmaxcut" "SignalP_Networks_used" "LipoP_Localisation" "LipoP_Score"
[17] "TMHMM_Length" "TMHMM_ExpAA" "TMHMM_First60" "TMHMM_PredHel"
[21] "TMHMM_Topology"
FASTA_Header
1 gi_15925706_ref_NP_373239.1_ chromosomal replication initiator protein [Staphylococcus aureus subsp. aureus N315]
2 gi_15925707_ref_NP_373240.1_ DNA polymerase III, beta chain [Staphylococcus aureus subsp. aureus N315]
3 gi_15925708_ref_NP_373241.1_ conserved hypothetical protein [Staphylococcus aureus subsp. aureus N315]
4 gi_15925709_ref_NP_373242.1_ DNA repair and genetic recombination protein [Staphylococcus aureus subsp. aureus N315]
5 gi_15925710_ref_NP_373243.1_ DNA gyrase subunit B [Staphylococcus aureus subsp. aureus N315]
6 gi_15925711_ref_NP_373244.1_ DNA gyrase subunit A [Staphylococcus aureus subsp. aureus N315]
Sequence
1 MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLIDDIQFIQNKVQTQEEFFYTFNELHQNNKQIVISSDRPPKEIAQLEDRLRSRFEWGLIVDITPPDYETRMAILQKKIEEEKLDIPPEALNYIANQIQSNIRELEGALTRLLAYSQLLGKPITTELTAEALKDIIQAPKSKKITIQDIQKIVGQYYNVRIEDFSAKKRTKSIAYPRQIAMYLSRELTDFSLPKIGEEFGGRDHTTVIHAHEKISKDLKEDPIFKQEVENLEKEIRNV
2 MMEFTIKRDYFITQLNDTLKAISPRTTLPILTGIKIDAKEHEVILTGSDSEISIEITIPKTVDGEDIVNISETGSVVLPGRFFVDIIKKLPGKDVKLSTNEQFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAELNKIMSDNEEDIDIFFASNQVLFKVGNVNFISRLLEGHYPDTTRLFPENYEIKLSIDNGEFYHAIDRASLLAREGGNNVIKLSTGDDVVELSSTSPEIGTVKEEVDANDVEGGSLKISFNSKYMMDALKAIDNDEVEVEFFGTMKPFILKPKGDDSVTQLILPIRTY
3 MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDAGSFLIIHQGEQ
4 MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKIEGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELGQISAVYLNDLAQYQRILKQKNNYLKQLQLGQKKDLTMLEVLNQQFAEYAMKVTDKRAHFIQELESLAKPIHAGITNDKEALSLNYLPSLKFDYAQNEAARLEEIMSILSDNMQREKERGISLFGPHRDDISFDVNGMDAQTYGSQGQQRTTALSIKLAEIELMNIEVGEYPILLLDDVLSELDDSRQTHLLSTIQHKVQTFVTTTSVDGIDHEIMNNAKLYRINQGEIIK
5 MVTALSDVNNTDNYGAGQIQVLEGLEAVRKRPGMYIGSTSERGLHHLVWEIVDNSIDEALAGYANKIEVVIEKDNWIKVTDNGRGIPVDIQEKMGRPAVEVILTVLHAGGKFGGGGYKVSGGLHGVGSSVVNALSQDLEVYVHRNETIYHQAYKKGVPQFDLKEVGTTDKTGTVIRFKADGEIFTETTVYNYETLQQRIRELAFLNKGIQITLRDERDEENVREDSYHYEGGIKSYVELLNENKEPIHDEPIYIHQSKDDIEVEIAIQYNSGYATNLLTYANNIHTYEGGTHEDGFKRALTRVLNSYGLSSKIMKEEKDRLSGEDTREGMTAIISIKHGDPQFEGQTKTKLGNSEVRQVVDKLFSEHFERFLYENPQVARTVVEKGIMAARARVAAKKAREVTRRKSALDVASLPGKLADCSSKSPEECEIFLVEGDSAGGSTKSGRDSRTQAILPLRGKILNVEKARLDRILNNNEIRQMITAFGTGIGGDFDLAKARYHKIVIMTDADVDGAHIRTLLLTFFYRFMRPLIEAGYVYIAQPPLYKLTQGKQKYYVYNDRELDKLKSELNPTPKWSIARYKGLGEMNADQLWETTMNPEHRALLQVKLEDAIEADQTFEMLMGDVVENRRQFIEDNAVYANLDF
6 MAELPQSRINERNITSEMRESFLDYAMSVIVARALPDVRDGLKPVHRRILYGLNEQGMTPDKSYKKSARIVGDVMGKYHPHGDSSIYEAMVRMAQDFSYRYPLVDGQGNFGSMDGDGAAAMRYTEARMTKITLELLRDINKDTIDFIDNYDGNEREPSVLPARFPNLLANGASGIAVGMATNIPPHNLTELINGVLSLSKNPDISIAELMEDIEGPDFPTAGLILGKSGIRRAYETGRGSIQMRSRAVIEERGGGRQRIVVTEIPFQVNKARMIEKIAELVRDKKIDGITDLRDETSLRTGVRVVIDVRKDANASVILNNLYKQTPLQTSFGVNMIALVNGRPKLINLKEALVHYLEHQKTVVRRRTQYNLRKAKDRAHILEGLRIALDHIDEIISTIRESDTDKVAMESLQQRFKLSEKQAQAILDMRLRRLTGLERDKIEAEYNELLNYISELETILADEEVLLQLVRDELTEIRDRFGDDRRTEIQLGGFEDLEDEDLIPEEQIVITLSHNNYIKRLPVSTYRAQNRGGRGVQGMNTLEEDFVSQLVTLSTHDHVLFFTNKGRVYKLKGYEVPELSRQSKGIPVVNAIELENDEVISTMIAVKDLESEDNFLVFATKRGVVKRSALSNFSRINRNGKIAISFREDDELIAVRLTSGQEDILIGTSHASLIRFPESTLRPLGRTATGVKGITLREGDEVVGLDVAHANSVDEVLVVTENGYGKRTPVNDYRLSNRGGKGIKTATITERNGNVVCITTVTGEEDLMIVTNAGVIIRLDVADISQNGRAAQGVRLIRLGDDQFVSTVAKVKEDAEDETNEDEQSTSTVSEDGTEQQREAVVNDETPGNAIHTEVIDSEENDEDGRIEVRQDFMDRVEEDIQQSSDEDEE
SignalP_name SignalP_Cmax SignalP_Cmax_pos SignalP_Ymax SignalP_Ymax_pos
1 gi_15925706_ref_NP_373239.1_ 0.111 59 0.101 41
2 gi_15925707_ref_NP_373240.1_ 0.168 39 0.179 39
3 gi_15925708_ref_NP_373241.1_ 0.109 13 0.098 33
4 gi_15925709_ref_NP_373242.1_ 0.131 36 0.133 15
5 gi_15925710_ref_NP_373243.1_ 0.124 16 0.096 25
6 gi_15925711_ref_NP_373244.1_ 0.115 35 0.104 29
SignalP_Smax SignalP_Smax_pos SignalP_Smean SignalP_D SignalP_YesNo SignalP_Dmaxcut
1 0.132 37 0.069 0.089 N 0.45
2 0.372 36 0.125 0.158 N 0.45
3 0.121 30 0.085 0.093 N 0.45
4 0.226 7 0.167 0.147 N 0.45
5 0.107 29 0.076 0.088 N 0.45
6 0.165 20 0.100 0.102 N 0.45
SignalP_Networks_used LipoP_Localisation LipoP_Score TMHMM_Length TMHMM_ExpAA TMHMM_First60
1 SignalP-TM CYT -0.200913 453 0.03 0.00
2 SignalP-TM CYT -0.200913 377 0.00 0.00
3 SignalP-TM CYT -0.200913 81 0.00 0.00
4 SignalP-TM CYT -0.200913 370 0.00 0.00
5 SignalP-TM CYT -0.200913 644 0.06 0.00
6 SignalP-TM CYT -0.200913 889 0.02 0.01
TMHMM_PredHel TMHMM_Topology
1 0 o
2 0 o
3 0 o
4 0 o
5 0 o
6 0 o
If you want to have a look at the data “spreadsheet-style”, you can type View(prot_data) in the Console. The Viewer will show the data associated with the variable in the editor panel.
protein_data is what is called a dataframe in R.$ to get hold of a single columnm [1] N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
[48] N N N N N N N N N N N N N N N N N N N N N N N N N Y N N N Y Y N N N N N N Y N N Y N N N N N N
[95] N N N N N N N N N N N N N N N Y Y N N N N N N Y Y N Y Y N Y N N Y N Y Y Y N Y N N Y N Y Y N N
[142] Y Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y Y Y N Y N N N N N N N Y N Y N N Y N N N Y N N Y N N N Y N
[189] Y N N N N Y Y N Y Y Y Y Y Y N N N N Y N N N Y N N Y N N Y Y Y Y Y N N Y N Y N N N Y Y Y Y Y Y
[236] Y N Y Y Y N N N N N N N N N Y N N N Y Y Y N N Y N N N N N N N N N N N N N N Y Y Y Y N Y Y Y Y
[283] Y N Y Y N N Y N N N N N Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N N N N N N Y N N N N Y Y N N
[330] N Y N Y Y Y N N N N N Y N N Y Y N Y N Y Y Y Y N N N N N N Y N Y Y N N Y N N N N Y N N N Y Y Y
[377] Y Y Y Y N Y Y Y N N N
Levels: N Y
N Y
253 134
Since it is a vector, you can use indexing on a column:
[1] N Y N Y Y Y N N N N N Y N N Y Y
Levels: N Y
Get the dimensions of the dataset:
[1] 387 21
This is a vector of two integers. Which one is the number of rows?
dataframe[rows, columns]studentsstudents <- data.frame(sid=paste0("S0", 1:5),
name=c("Mark", "Lynn", "Lianne", "Peter", "Rose"),
sex=factor(c("m", "f", "f", "m", "f"),
labels = c("female", "male")),
biology=c(5.6, 6.2, 7.9, 4.4, 9.1),
statistics=c(6.1, 5.1, 8.0, 4.7, 7.3),
informatics=c(6.3, 6.1, 7.7, 5.4, 9.5),
stringsAsFactors = F)
students sid name sex biology statistics informatics
1 S01 Mark male 5.6 6.1 6.3
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
4 S04 Peter male 4.4 4.7 5.4
5 S05 Rose female 9.1 7.3 9.5
sid name sex biology statistics informatics
1 S01 Mark male 5.6 6.1 6.3
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
4 S04 Peter male 4.4 4.7 5.4
5 S05 Rose female 9.1 7.3 9.5
[1] "Mark"
biology statistics informatics
2 6.2 5.1 6.1
sid name sex biology statistics informatics
1 S01 Mark male 5.6 6.1 6.3
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
4 S04 Peter male 4.4 4.7 5.4
5 S05 Rose female 9.1 7.3 9.5
[1] "Mark" "Lynn" "Lianne" "Peter" "Rose"
sid name sex biology statistics informatics
2 S02 Lynn female 6.2 5.1 6.1
sid name sex biology statistics informatics
1 S01 Mark male 5.6 6.1 6.3
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
4 S04 Peter male 4.4 4.7 5.4
5 S05 Rose female 9.1 7.3 9.5
All grades for Lynn:
biology statistics informatics
2 6.2 5.1 6.1
sid name sex biology statistics informatics
1 S01 Mark male 5.6 6.1 6.3
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
4 S04 Peter male 4.4 4.7 5.4
5 S05 Rose female 9.1 7.3 9.5
All grades for girls:
sid name sex biology statistics informatics
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
5 S05 Rose female 9.1 7.3 9.5
To get the student data into your session, type source("https://git.io/fjfMW") in the console. It is this file: https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/intro_lesson_data.R
Type students or View(students) in the console to verify you have it
informatics gradesstatistics grade for Peter1: select all informatics grades: two alternatives
[1] 6.3 6.1 7.7 5.4 9.5
[1] 6.3 6.1 7.7 5.4 9.5
2: select the third and fourth row entirely: two alternatives
sid name sex biology statistics informatics
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
sid name sex biology statistics informatics
2 S02 Lynn female 6.2 5.1 6.1
3 S03 Lianne female 7.9 8.0 7.7
3: select the statistics grade for Peter: three alternatives
[1] 4.7
[1] 4.7
[1] 4.7
4: select the biology and statistics grades for the female students
biology statistics
2 6.2 5.1
3 7.9 8.0
5 9.1 7.3
5: select the student names where the biology grade is below 6
[1] "Mark" "Peter"
If you want to tag along, download and load the file as follows:
protein_data <- read.table(file = "https://git.io/fjfum",
header = TRUE,
sep = ";",
dec = ",",
as.is = c(1, 2, 3, 21))(if the short URL does not work, use
“https://raw.githubusercontent.com/MichielNoback/intro_R_lessons/gh-pages/data/protein_processing_pred.csv”)
Let’s investigate the relation Cmax vs Ymax of the SignalP analysis:
[1] 0.111 0.168 0.109 0.131 0.124 0.115
[1] 0.101 0.179 0.098 0.133 0.096 0.104
See http://www.cbs.dtu.dk/services/SignalP-4.1/output.php for a description of these analysis results.
Create variables for convenience, and plot:
plot(x = cmax, y = ymax,
xlab = "Cmax value", ylab="Ymax value",
pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))Or, even better, with a log transform:
plot(x = log2(cmax), y = log2(ymax),
xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))plot(x = log2(cmax), y = log2(ymax),
xlab = "log2(Cmax value)", ylab="log2(Ymax value)",
pch=19, cex = 0.8, col=rgb(0, 0, 1, 0.3))
model <- lm(log2(ymax) ~ log2(cmax))
abline(model, col = "red", lwd=2)What is expected of you at the test of this course:
You should be be able to
You will be allowed to use all R documentation as well as this presentation.
Download the assignment RMarkdown document from this location:
Long URL:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.Rmd
Web-page view:
https://michielnoback.github.io/intro_R_lessons/final_assignment_R_BMR.html
Save the file on your computer, open it in RStudio and then deal with the assignments. Put your solutions where it says
## YOUR CODE HERE.
You can process the document into Word form by klicking “Knit” Submit this Word document.