Chapter 10 Exercises

This chapter only contains exercises. The solutions are in the next chapter which has a numbering parallel to this one. To work on the exercises it is probably best to create either an RMarkdown Notebook (File → New File → R Notebook) or an RMarkdown document (File → New File → R Markdown… → choose Document as type and give it a title). A Notebook has the advantage that you can toggle the visibility of the code, but it has fewer output options (no pdf or Word). I strongly suggest you play around with both.

This is a link to an R cheat sheet that you could use for selecting the right function for a task.

If you want you can download this chapter as separate file and work in that. You can download it here. Before commencing you should put in a new Markdown header at the top of the file, with at least this contents:

---
title: "Exercise solutions DAVuR1"
author: "<YOUR NAME>"
---

Plotting rules

With all plots, take care to adhere to the rules regarding titles and other decorations. Tip: the site Quick-R has nice detailed information with examples on the different plot types and their configuration. Especially the section on plotting is helpful for these assignments.

10.1 Getting started

10.1.1 Install the tools

Install these tools on your own PC or laptop, in this order:

  1. R itself https://cran.r-project.org/bin/windows/base/
  2. RStudio https://rstudio.com/products/rstudio/download/ (choose the free version of course)
  3. [Optionally] If you want to generate pdf documents you should also install a Latex version: MikTeX or TinyTex for Windows or MacTeX on MacOS. If you want to keep things simple I suggest you stick to HTML and MS Word output (Word can also export to PDF).

10.2 Toolbox

10.2.1 Résumé

Create an R Markdown document called Résumé.Rmd and create your Curriculum Vitae using Markdown syntax. Use the R Markdown Reference Guide or Cheat Sheet or this Markdown Cheatsheet (you don’t need R for your résumé so markdown alone is enough).

10.3 Basic R

10.3.1 Math in the console

In the console, calculate the following:

\(31 + 11\)

\(66 - 24\)

\(\frac{126}{3}\)

\(12^2\)

\(\sqrt{256}\)

\(\frac{3*(4+\sqrt{8})}{5^3}\)

10.3.2 First look at functions

A
View the help page for paste(). There are two variants of this function. Which? And what is the difference between them? Use both variants to generate exactly this message "welcome to R" from these arguments: "welcome ", "to ", "R"

B
What does the abs function do? What is returned by abs(-20) and what is abs(20)?

C
What does the c function do? What is the difference in returned value (result) of c() when you combine either 1, 3 and "a" as arguments, or 1, 2 and 3? Use the function class() to answer this.

D
What does the function install.packages() do? Use it to install the package RColorBrewer. We’ll have a look at this package later on.

E
Type ?round in the console. There are several functions related to rounding numbers. Give this vector:

numberz <- c(-1.33, -1.55239, 0.4432, 0.5000001, 1.98765)

generate these -exact- outputs using ons of the listed functions:

[1] -1 -1  1  1  2
[1] -1 -1  0  0  1
[1] -1.330 -1.552  0.443  0.500  1.988
[1] -2 -2  0  0  1

F
Using the on the website provided cheat sheet, find the String manipulation function that will convert this vector of author names

names <- c("William Shakespeare", "Agatha Christie", "J. K. Rowling")

into this:

[[1]]
[1] "William"     "Shakespeare"

[[2]]
[1] "Agatha"   "Christie"

[[3]]
[1] "J."      "K."      "Rowling"

(The output is a list, a data type that we’ll see later on)

G

Using the on the website provided cheat sheet, find the data selection function that will convert this vector of bird names

birds <- c("robin", "wagtail", "blackbird", "robin", "blackbird", "buzzard")
unique(birds)

10.3.3 Variables

Create three variables with the given values - x=20, y=10 and z=3. Next, calculate the following with these variables:

  1. \(x+y\)
  2. \(x^z\)
  3. \(q = x \times y \times z\)
  4. \(\sqrt{q}\)
  5. \(\frac{q}{\pi}\) (pi is simply pi in R)
  6. \(\log_{10}{(x \times y)}\)

10.3.4 Vectors

Circles

The circumference of a circle is \(2\pi\cdot r\), its surface \(4\pi \cdot r^2\) and its volume \(4/3 \pi\cdot r^3\). Given this vector of circle radiuses,

radiuses <- c(0, 1, 2, pi, 4)

A
Calculate their cirumference.

B
Calculate their surface.

C
Calculate their volume.

Creating vectors

Create the following vectors, as efficiently as possible. The functions rep(), seq(), paste0() and the colon operator : can be used, in any combination.

A
[1] 1 2 5 1 2 5

B
[1] 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5

C
[1] 1 1 1 4 4 4 9 9 9 1 1 1 4 4 4 9 9 9

D
[1] "1a" "2b" "3c" "4d" "5e" "1a" "2b" "3c" "4d" "5e"

E
[1] "0z" "0.2y" "0.4x" "0.6w" "0.8v" "1u"

F
[1] "505" "404" "303" "202" "101" "000"

G [Challenge]
[1] "0.5A5.0" "0.4B4.0" "0.3C3.0" "0.2D2.0" "0.1E1.0"

10.3.5 Stair walking and heart rate

The vectors below hold data for a staircase walking experiment. A subject of normal weight and height was asked to ascend a (long) stairs wearing a heart-rate monitor. The subjects’ heart was registered for different step heights. Create a line plot showing the dependence of heart rate (y axis) on stair height (x axis).

#number of steps on the stairs
stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35)
#heart rate after ascending the stairs
heart_rate <- c(66, 65, 67, 69, 73, 79, 86, 97)

10.3.6 More subjects

The experiment from the previous question was extended with three more subjects. One of these subjects was like the first of normal weight, whereas the two others were obese. The data are given below. Create a single scatter plot with connector lines between the points showing the data for all four subjects. Give the normal-weighted subjects a green line and symbol and the obese subjects a red line and symbol.
You can add new data series to a plot by using the points(x, y) function. Use the ylim() function to adjust the Y-axis range.

#number of steps on the stairs
stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35)
#heart rates for subjects with normal weight
heart_rate_1 <- c(66, 65, 67, 69, 73, 79, 86, 97)
heart_rate_2 <- c(61, 61, 63, 68, 74, 81, 89, 104)
#heart rates for obese subjects
heart_rate_3 <- c(58, 60, 67, 71, 78, 89, 104, 121)
heart_rate_4 <- c(69, 73, 77, 83, 88, 96, 102, 127)

10.3.7 Chickens on a diet

The body weights of chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. In the experiment there were four groups of chicks on different protein diets. Here are the data for the first four chicks. Chick one and two were on diet 1 and chick three and four were on diet 2. Create a single line plot showing the data for all four chicks. Give each chick its own color.

# chick weight data
time <- c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21)
chick_1 <- c(42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205)
chick_2 <- c(40, 49, 58, 72, 84, 103, 122, 138, 162, 187, 209, 215)
chick_3 <- c(42, 53, 62, 73, 85, 102, 123, 138, 170, 204, 235, 256)
chick_4 <- c(41, 49, 61, 74, 98, 109, 128, 154, 192, 232, 280, 290)

10.3.8 Chicken bar plot

With the data from the previous question, create a bar plot of the maximum weights of the chicks.

10.3.9 Discoveries

The R language comes with a wealth of datasets for you to use as practice materials. We will see several of these. One of these datasets is The Time-Series dataset called discoveries holding the numbers of “great” inventions and scientific discoveries in each year from 1860 to 1959. Type its name in the console to see it. Create plot(s) answering these questions:

A
What is the frequency distribution of the number of discoveries per year? Use the barplot() and table() functions for this.

B
What is the 5-number summary of discoveries per year?

C
What is the trend over time for the numbers of discoveries per year?

PS: This is actually not a simple vector but a vector with some time-related attributes. It is called a Time-Series (a ts class), but this does not really matter for this assignment.

10.3.10 Lung cancer

The R datasets package has three related timeseries datasets relating to lung cancer deaths. These are ldeaths, mdeaths and fdeaths for total, male and female deaths respectively. Create a line plot showing the monthly mortality holding all three of these datasets. Use the legend() function to add a legend to the plot, as demonstrated in this example:

t <- 1:5
y1 <- c(2, 3, 5, 4, 6)
y2 <- c(1, 3, 4, 5, 7)
plot(t, y1, type = "b", ylab = "response", ylim = c(0, 8))
points(t, y2, col = "blue", type = "b")
legend("topleft", legend = c("series 1", "series 2"), col = c("black", "blue"), pch = 1, lty = 1)

A
Create the mentioned line plot. Do you see trends and/or patterns and if so, can you explain these?

B
Create a combined boxplot of the three time-series. Are there outliers? If so, can you figure out when this occurred, and why?

10.4 Complex datatypes

This section serves you some datatype challenges.

10.4.1 Creating factors

A
Given this vector:

animal_risk <- c(2, 4, 1, 1, 2, 4, 1, 4, 1, 1, 2, 1)

and these possible levels: 1: harmless 2: risky 3: dangerous 4: deadly

Create a factor from this data and then barplot the result.

B
Given this data, a simulation of wealth distribution of “poor”, “middle class”, “wealthy” "rich:

set.seed(1234)
wealth_male <- sample(x = letters[1:4], 
                 size = 1000,
                 replace= TRUE, 
                 prob = c(0.7, 0.17, 0.12, 0.01))
wealth_female <- sample(x = letters[1:4], 
                 size = 1000,
                 replace= TRUE, 
                 prob = c(0.8, 0.15, 0.497, 0.003))

Create a factor from these two and report the cumulative percentage of its individual levels starting at the most abundant level, combined for male and female. Hint: use table() and prop.table().

Next, create a side-by-side barplot of this data. Don’t forget the legend!

10.4.2 A dictionary with a named vector

Almost all programming languages know the (hash)map / dictionary data structure storing so-called “key-and-value” pairs. They make it possible to “look up” the value belonging to a “key”. That is where the term dictionary comes from. A dictionary holds keys (the words) and their meaning (values). R does not have a dictionary type but you could make a dict-like structure using a vector with named elements. Here follows an example.

If I wanted to create and use a DNA codon translation table, and use it to translate a piece of DNA, I could do something like what is shown below (there are only 4 of the 64 codons included). See if you can figure out what is going on there

## define codon table as named vector
codons <- c("Gly", "Pro", "Lys", "Ser")
names(codons) <- c("GGA", "CCU", "AAA", "AGU")

## the DNA to translate
my_DNA <- "GGACCUAAAAGU"
my_prot <- ""
## iterate the DNA and take only every third position
for (i in seq(1, nchar(my_DNA), by=3)) {
    codon <- substr(my_DNA, i, i+2);
    my_prot <- paste(my_prot, codons[codon])
}
print(my_prot)
## [1] " Gly Pro Lys Ser"

A
Make a modified copy of this code chunk in such a way that no spaces are present between the amino acid residues (use help on paste() to figure this out) and that single-letter codes of amino acids are used instead of three-letter codes.

B
[Challenge] Here is a vector called nuc_weights. It holds the weights for the nucleotides A, C, G and U respectively. Make it a named vector, iterate my_DNA from the above code chunk and calculate its molecular weight.

nuc_weights <- c(491.2, 467.2, 507.2, 482.2)

10.4.3 Protein concentrations with Lowry

The Lowry method is a widely used spectroscopic method to quantify protein amounts in solutions. Here are some real Lowry results for the quantification of proteins: The calibration curve is made using BSA (bovine serum albumin) as a standard. The concentrations are in mg/ml:

bsa_conc <- c(0,    0,  0.025,  0.025,  0.075,  0.075,  0.125,  0.125)

Note that the duplos are all put into one vector. We’ll deal with that later.

These are the obtained absorption values at 750 nm (again for the duplos):

OD750 <- c(0.063,   0.033,  0.16,   0.181,  0.346,  0.352,  0.491,  0.488)

A

Combine these two vectors in a dataframe and assign to a variable with name dilution. Do not add column names yet!

B

Now add the appropriate column names: prot_conc and absorption.

C

You forgot to include the last datapoints (again these are duplos):

bsa_conc2 <- c(0.175,   0.175,  0.25,   0.25)
OD750_2 <- c(0.597, 0.595,  0.743,  0.742)

Generate a dataframe assigned to the variable df_temp from these datapoints. This time, add the appropriate column names with creation of the dataframe. be sure you use exactly the same comlumn names.

D

Now add the second dataframe df_temp to dataframe with name dilution and assign it to the name dilution (thus overwrite the original variable):

E

Generate a line plot with the following properties:

  • Title: “Absorbance as a function of protein concentration”.
  • Axis labels with clear indicated quantities and units.
  • Datapoints and connector lines should be visible and have a blue color.
  • Y-axis lower limit should be 0 and the upper limit should be 1.
  • The plot character (symbol) should be a filled circle.

Use col = rgb(0, 0, 1, 0.5) to get transparent blue plot symbols.

F

Generate a new dataframe from the dataframe dilution, now with the duplo measurements side-by-side. So, instead of 2 you will now have 4 columns.

  • Fist generate a dataframe even using the even numbered rows.
  • Then generate a dataframe oddusing the odd numbered rows.
  • Finally combine these two new dataframes into one and assign it to the variable dilution_duplo.

Hint: use subsetting with TRUE and FALSE and vector cycling.

G

The result will be a dataframe with two columns named prot_conc and two columns named absorption. Delete the second column named prot_conc (this is an exact copy after all!) and rename the column names to prot_conc, abs1 and abs2.

H

Calculate the mean of the two abs measurements and add it as a column named mean_abs:

I

Generate a bar plot with the following properties:

  • Title: “Absorbance versus protein concentration”.
  • Axis labels with clear indicated quantities and units.
  • Y-axis lower-limit should be 0, upper-limit should be 1.
  • Plot the mean absorbance on the Y-axis.
  • Bars should have a green color.

10.4.4 HPLC data

The data below are from an educational experiment “Thymol quantification in Thyme”. This data are obtained by a reverse-phase HPLC method using thymol in a standard curve. Peak areas were calculated using an integrator device. Note that peak areas are dimensionless. The concentration thymol is in μg/ml.

  • co = concentration
  • pa = peak area
co <- c(100,    100,    75, 75, 50, 50, 25, 25, 10, 10, 5, 5)
pa <- c(1969017,    1858012,    1399762,    1449423,    963014, 832137, 467856, 562012, 200123, 145545, 94567,  64752)

A

Add the vectors to a dataframe named hplc_data. Use appropriate and clear column names.

B

The concentrations are in descending order. Use order() to sort the dataframe so that the concentrations are in ascending order.

C

Generate a scatter plot with the following settings: - Title: “Peak area as a function of thymol concentration” - Axis labels with clear indicated quantities and units. - Datapoints should be visible (no connector lines). - Datapoints should be blue - Add a red colored linear regression line with a lineweigth of 2.

10.4.5 Airquality

The airquality dataset is also one of the datasets included in the datasets package. We’ll explore this for a few questions.

A
Create a scatterplot of Temperature as a function of Solar radiation. Is there, as you might naively expect, a strong correlation? You could use cor.test() to find out. Add a linear model using lm() to extend your plot.

B
Create a boxplot-series of Temp as a function of Month (use ?boxplot to find out how this works). What appears to be the warmest month?

C
What date (day/month) has the lowest recorded temperature? Which the highest? Please give temperature values in Celsius, not Fahrenheit! (Yes, this is an extra challenge!)

D
Create a histogram of the wind speeds, and add a thick blue vertical line for the value of the mean and a fat red line for the median (use abline() for this).

E
Use the pairs() function with argument panel = panel.smooth to plot all pairwise correlations between Ozone, Solar radiation, Wind and Temperature. Which pair shows the strongest correlation in your opinion? Verify this using the cor() function after removing incomplete cases. Create a separate well annotated scatterplot of this pair.

[Challenge] Through the summary() function you can obtain the R-squared value -the strength of the correlation- like this: summary(your_linear_model)$r.squared. Can you place it within the plot using the text() function?

10.4.6 File reading practice

The files in this online folder on github directory contain some (simulated) gene-array data. The dataset contains a selection of induced transcripts after some stimulus. The columns represent:

  • the mRNA entry
  • fold induction after some stimulus
  • the protein entry number
  • protein length in number of amino acids of the corresponding NP entry
  • the protein family the protein belongs to
  • the predicted cellular localization

Whatever the contents of a file, you always need to address (some of) these questions:

  • Are there comment lines at the top?
  • Is there a header line with column names?
  • What is the column separator?
  • Are there quotes around character data?
  • How are missing values encoded?
  • How are numeric values encoded?
  • What is the type in each column?

Also: read the help for the read.table function carefully.

Read the content of the each file in the file_reading directory into a dataframe object and assign it to the variable name df. There are 15 different files to practice your file loading skills.

You don’t need to download them manually; simply use this:

download.file(url = "https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/file_reading/file01.txt",
              destfile = "file01.txt")

Then you’ll have the file in the current working directory. Replace the file number as needed.

10.4.7 Bird observations

You will explore a bird observation dataset, downloaded from GOLDEN GATE AUDUBON SOCIETY. This file lists bird observations collected by this bird monitoring group in the San Francisco Bay Area. I already cleaned it a bit and placed it here: data/Observations-Data-2014.csv.

You can download it as follows:

file_name <- "Observations-Data-2014.csv"
remote_url <- paste0("https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/", file_name)

download.file(url = remote_url, destfile = file_name)

Load the observation data into R and assign it to a variable called bird_obs.

From here on, it is assumed that you have the dataframe bird_obs loaded. This series of exercises deals with cleaning and transforming data, and exploring a cleaned dataset using basic plotting techniques and descriptive statistics.

A
First, explore the raw data as they are.

  • What data on bird observations were recorded (i.e. what kind of variables do you have)?
  • What did R do to the original column names?
  • Are all column names clear to you?

B
How many bird observations were recorded?

C
The column holding observation “Number” is actually not a number. Into what type has R converted it?

D
Convert the “Number” column into an integer column using as.integer(), but assign it to a new column called “Count” (i.e. do not overwrite the original values). Compare the first 50 values or so of these two columns. What happened to the data? Is this OK?

E
The previous question has shown that converting factors to numbers is a bit dangerous. It is often easiest to convert characters to numbers. The best way to do this is by using the as.is = c(<column indices>) argument for the read.table() function.

So, which columns should be loaded as real factor data and which as plain character data? Use read.table() and the as.is argument to reload the data, and then transform the Number column to integer again as Count.

F
Compare the first 50 values of the Number and Count columns again. Has the conversion succeeded? How many Number values could not be transformed into an integer value? Hint: use is.na()

G
Explore the sighting counts:

  • Report the sighting with the maximum number of birds in a single sighting?
  • What is the mean sighting count?
  • What is the median of the sighting count?

H
Is the Count variable a normal distributed value? You can use hist(...), table() or plot(density(...)) to explore this further.

I
Explore the species constitution:

  • How many different species were recorded?
  • How many genera do they constitute?
  • What species from the genus “Puffinus” have been observed?

Hint: use the function unique() here.

[Challenge] The number of unique “common names” should equal the number of unique “scientific names” since each species has a common name and a scientific name. The scientific name can be generated from Genus and Species. Investigate this. Is it correct? If not, what went wrong?

J [Challenge]
This is a challenge exercise for those who like to grind their brains! Think of a strategy to “rescue” the NAs that appear after transforming “Number” to “Count”. Hint: use gsub() orgrep()

10.5 Functions

10.5.1 The cut() function I

Suppose we have a gene expression data set of gene array data. The vector below contains the fold-change of genome-wide mRNA expression after exposure of cells to dihydrotestosterone. A value above 1 means an induction of gene expression, a value below 1 means a repression of gene expression. However, we know that the accuracy of the fold induction measurements is approximately plus or minus 0.1. This means that we can set the breakpoints as follows:

  • range 0 < x < 0.9 is a repression
  • range 0.9 <= x < 1.1 is no net change
  • range x >= 1.1 is an induction
fold_change <- c(0.10, 8.50, 0.95, 56.08, 0.90, 99.15, 1.00, 0.01, 0.65, 1.10, 2.34, 62.2)

A

Use the cut() function to convert the data into a factor that tells you if there is an induction, no net change (no_change) or a repression. Make sure that you get an ordered factor. Assign the factor to the variable induct_type.

B

Create a data frame of both vectors and assign it to induction. Use fold_change and induction_type as column names.

C

Create a well annotated plot of the induction type that shows the number of genes that are repressed, have no change or are induced upon dihydrotestosterone treatment. Bars should have a blue color. The bars should be named repressed, no change, induced.

10.5.2 The cut() function II

We will use the cut() function on a slightly larger data set with fictitious gene array data. The text file is located here.You can simply use that as argument to read.table():

remote_location <- "https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/gene_expression_data.txt"
read.table(remote_location,  )

The columns are: - transcript_ID: NM_ entry number of the reference mRNA.
- fold_change: the same rules as in the previous exercise for repression, no change and induction apply here.
- family: the protein family that the gene encodes for.
- location: the cellular localization of the gene product.

A

Load the file and assign it to the variable gene_array.

B

Add a column induction_repression to the data frame based on the fold_induction column. Use the cut function to accomplish this. The same rules apply for the breakpoints.

C

Create a well annotated plot of the induction type that shows the number of genes that are repressed, have no change or are induced upon dihydrotestosterone treatment. Bars should be blue. The bars should be named repressed, no change, induced. Make sure that the bars are within the axis borders.

10.6 Regular Expressions

10.6.1 Restriction enzymes

A
The restriction enzyme PacI has the recognition sequence “TTAATTAA”. Define (at least) three alternative regex patterns that will catch these sites.

B
The restriction enzyme SfiI has the recognition sequence “GGCCNNNNNGGCC”. Define (at least) three alternative regex patterns that will catch these sites.

10.6.2 Prosite Patterns

A
The Prosite pattern PS00211 (ABC-transporter-1; https://prosite.expasy.org/PS00211) has the pattern:
“[LIVMFYC]-[SA]-[SAPGLVFYKQH]-G-[DENQMW]-[KRQASPCLIMFW]-[KRNQSTAVM]-[KRACLVM]-[LIVMFYPAN]-{PHY}-[LIVMFW]-[SAGCLIVP]-{FYWHP}-{KRHP}-[LIVMFYWSTA].” Translate it into a regex pattern. Info on the syntax is here: https://prosite.expasy.org/prosuser.html#conv_pa

B
The Prosite pattern PS00018 (EF-hand calcium-binding domain; https://prosite.expasy.org/PS00018) has the pattern: “D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- [DE]-[LIVMFYW].” Translate it into a regex pattern.

You could exercise more by simply browsing Prosite. Test your pattern by fetching the proteins referred to within the Prosite pattern details page.

10.6.3 Fasta Headers

The fasta sequence format is a very common sequence file format used in molecular biology. It looks like this (I omitted most of the actual protein sequences for better representation):

>gi|21595364|gb|AAH32336.1| FHIT protein [Homo sapiens]
MSFRFGQHLIK...ALRVYFQ
>gi|15215093|gb|AAH12662.1| Fhit protein [Mus musculus]
MSFRFGQHLIK...RVYFQA
>gi|151554847|gb|AAI47994.1| FHIT protein [Bos taurus]
MSFRFGQHLIK...LRVYFQ

As you can see there are several distinct elements within the Fasta header which is the description line above the actual sequence: one or more database identification strings, a protein description or name and an organism name. Study the format - we are going to extract some elements from these fasta headers using the stringr package. Install it if you don’t have it yet.

Here is a small example:

library(stringr)
hinfII_re <- "GA[GATC]TC"
sequences <- c("GGGAATCC", "TCGATTCGC", "ACGAGTCTA")
str_extract(string = sequences,
            pattern = hinfII_re)
## [1] "GAATC" "GATTC" "GAGTC"

Function str_extract() simply extracts the exact match of your regex (shown above). On the other hand, function str_match() supports grouping capture through bounding parentheses:

phones <- c("+31-6-23415239", "+49-51-55523146", "+31-50-5956566")
phones_re <- "\\+(\\d{2})-(\\d{1,2})" #matching country codes and area codes
matches <- str_match(phones, phones_re) 
matches
##      [,1]     [,2] [,3]
## [1,] "+31-6"  "31" "6" 
## [2,] "+49-51" "49" "51"
## [3,] "+31-50" "31" "50"

Thus, each set of parentheses will yield a column in the returned matrix. Simply use its column index to get that result set:

matches[, 2] ##the country codes
## [1] "31" "49" "31"

Now, given the fasta headers in ./data/fasta_headers.txt which you can simply load into a character vector using readLines(), extract the following.

A
- Extract all complete organism names.
- Extract all species-level organism names (omitting subspecies and strains etc).

B
Extract all first database identifiers. So in this header element >gi|224017144|gb|EEF75156.1| you should extract only gi|224017144

C
Extract all protein names/descriptions.

10.7 Scripting

This section serves you some exercises that will help you improve your function-writing skills.

10.7.1 Illegal reproductions

As an exercise, you will re-invent the wheel here for some statistical functions.

The mean

Create a function, my_mean(), that duplicates the R function mean(), i.e. calculates and returns the mean of a vector of numbers, without actually using mean().

Standard deviation

Create a function, my_sd(), that duplicates the R function sd(), i.e. calculates and returns the standard deviation of a vector of numbers, without actually using sd().

Maximum

Suppose we have some random numbers:

set.seed(123) 
my_nums <- sample(10000, 1000)

The use of set.seed() assures that the generated random numbers will always be the same, for everyone who runs it, every time.

Use a for loop and if/else flow control to find the largest number in my_nums. Do not use the build in max() or sort() functions. Put this in a function with the name my_max and demonstrate its use. You may use the build-in max() function to verify your result.

10.7.1.1 Find match locations

Create a function that will report the match locations some value in a given vector. Name it where_is_it(). Use only for() and if/else to get to your solution, no functions except c().

So, this snippet should return the given expected reult

x <- c("Pro", "Glu", "Lys", "Pro", "Ser")
where_is_it(x, "Pro")
##expected:
#[1] 1 4

Median

[Challenge] Create a function, my_median(), that duplicates the R function median(), i.e. calculates and returns the median of a vector of numbers. This is actually a bit harder than you might expect. Hint: use the sort() function.

10.7.2 Count odd and even numbers

Create a function, count_odd_even(), that counts the number of even and odd numbers in an input vector. It should return the result as a named vector. Do not make use of R base functions.

10.7.3 Add column compared to mean

Create a function that accepts as input:

  • a dataframe
  • a column name. This column name should refer to a numeric column

The function should return a copy of this dataframe with an extra column attached. The column with name ‘compared_to_mean’ should have values “greater” (greater than mean) or “less” (less than or equal to the mean) when compared to the indicated column. Apply a check to the users’ input and stop with an appropriate error message if something is not right (not a dataframe, column does not exist, not a numeric column).

10.7.4 Interquantile ranges

Create a function that will calculate a custom “interquantile range”. The function should accept three arguments: a numeric vector, a lower quantile and an upper quantile. It should return the difference (range) between these two quantile values. The lower quantile should default to 0 and the higher to 1, thus returning max(x) minus min(x). The function therefore has this “signature”:

interquantile_range <- function(x, lower = 0, higher = 100) {}

Perform some tests on the arguments to make a robust method: are all arguments numeric?

To test you method, you can compare interquantile_range(some_vector, 0.25, 0.75) with IQR(some_vector) - they should be the same.

10.7.5 Vector distance

Create a function, distance(p, q), that will calculate and return the Euclidean distance between two vectors of equal length. A numeric vector can be seen as a point in multidimensional space. Euclidean distance is defined as

\[d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}\] Where p and q are the two vectors and n the length of the two vectors.
You should first perform a check whether the two vectors are of equal length and both of type numeric or integer. If not, the function should abort with an appropriate error message.

Other distance measures

Extend the function of the previous assignment in such a way that a third argument is accepted, method =, which defaults to “euclidean”. Other possible distance measures are “Manhattan” (same as “city block” and “taxicab”) and [Challenge] Pearson correlation.

Look up the equations for these in Wikipedia or some other place.

10.7.6 Quadratic equation solver

Write a function that will use the quadratic equation (ABC formula) to solve an equation in the form \(ax^2 + bx + c = 0\). Call the function abc_solver. Use flow control to perform the following:

  • First calculate the discriminant (you should know the equation from your math lessons). The website of Khan Academy (https://www.khanacademy.org/) gives ample review.
  • Test if the discriminant is negative. If so, return a warning message that tells the user that the equation can not be solved.
  • Then test if the discriminant is positive. If so, there are two solutions for the equation. Return them in a named vector.
  • Then test if the discriminant is 0. If so, there is one solution for the equation. Return the solution as a named vector.

Test your function for the following quadratic models:
- \(ax^1 + 2x + 3\) (no solution)
- \(ax^3 + 7x + 4\) (two intersections)
- \(ax^2 + 4x + 2\) (1 intersection)

10.7.7 G/C percentage of DNA

[Challenge XL] Create a function, GC_perc(), that calculates and returns the GC percentage of a DNA or RNA sequence. Accept as input a sequence and a flag -strict- indicating whether other characters are accepted than core DNA/RNA (GATUC). If strict = FALSE, the percentage of other characters should be reported using a warning() call. If strict = TRUE, the function should terminate with an error message. Use stop() for this. strict should default to TRUE. NOTE, usage of strict can complicate things, so start with the core functionality! You can use strsplit() or substr() to get hold of individual sequence characters.

10.8 Function apply and its relatives

In this section you will encounter some exercises revolving around the different flavors of apply.

10.8.1 Whale selenium

On the course website under Resources you will find a link to file whale_selenium.txt. You could download it into your working directory manually or use download.file() to obtain it. However, there is a third way to get its contents without actually downloading it as a local copy. You can read it directly using read.table() as shown here.

whale_sel_url <- "https://raw.githubusercontent.com/MichielNoback/davur1/gh-pages/exercises/data/whale_selenium.txt"
whale_selenium <- read.table(whale_sel_url,
    header = T,
    row.names = 1)

Note: when you are going to load a file many times it is probably better to store a local copy.

A
Report the means of both columns using apply().

B Report the standard deviation of both columns, using apply()

C
Report the standard error of the mean of both columns, using apply() The SEM is calculated as \[\frac{sd}{\sqrt{n}}\] where \(sd\) is the sample standard deviation and \(n\) the number of measurements. You should create the function calculating this statistic yourself.

D
Using apply(), calculate the ratio of \(Se_{tooth} / Se_{liver}\) and attach it to the whale_selenium dataframe as column ratio. Create a histogram of this ratio.

E
Using print() and paste(), report the mean and the standard deviation of the ratio column, but do this with an inline expression, e.g. an expression embedded in the R markdown paragraph text.

10.8.2 ChickWeight

This exercise revolves around the ChickWeight dataset of the built-in datasets package.

A
Report the number of chickens used in the experiment.

B
Use aggregate() to get the mean weight of the chickens for the different Diets.

C
Use coplot() to plot a panel with weight as function of Time, split over Diet.

D
[Challenge] Add a column called weight_gain to the dataframe holding values for the weight gain since the last measurement. Take special care with rows marking the boundaries between individual chickens! You could consider using a traditional for loop here. In the next course, we’ll see a more efficient way of doing this.

E
Split the weight_gain column on Diet and report the mean, median and standard deviation for each diet. If you were not successful in the previous question, load and attach the data from file ChickWeight_weight_gain.Rdata downloadable from https://github.com/MichielNoback/davur1_gitbook/raw/master/data/ChickWeight_weight_gain.Rdata. You can use this code chunk for downloading and loading the data into variable stored_weight_gain. Don’t forget to attach the column to the data frame!

local_file <- "ChickWeight_weight_gain.Rdata"
download.file(paste0("https://github.com/MichielNoback/davur1_gitbook/raw/master/data/", local_file), local_file)
load(local_file)

F
Create a (single-panel) boxplot for weight gain, split over Diet. Hint: read the boxplot() help page!

10.8.3 Food constituents

The food constituents dataset holds information on ingredients for different foods. Individual foods are simply marked with an id.

A
Load the data and report the different food categories (Type). Also report the numbers of entries for each Type.

B
What is the mean energy content of chocolate foods?

C
What is the food category with the highest mean fat content?

D
What food category has the highest mean energy content, and which has the lowest?

E
[Challenge] Create a boxplot showing the difference in sugar content between drink and solid food.

F
Assuming both saturated fats and sugar are bad for you, what food category do you consider the worst? Think of a means to answer this, explain it and carry it out.

10.8.4 Urine properties

The urine specimens dataset contains a readme.txt file and a data file (urine.csv; direct link: “https://raw.githubusercontent.com/MichielNoback/datasets/master/urine/urine.csv”). Study the readme to get an idea of the data. Download the file like this:

urine_file_name <- "urine.csv"
url <- paste0("https://raw.githubusercontent.com/MichielNoback/datasets/master/urine/", urine_file_name)
local_name <- paste0("../", urine_file_name) #specifiy your own folder!
download.file(url = url, destfile = local_name)

A
Load the data into a dataframe with name urine.

B
Convert the column r into a factor with two levels: yes and no, and give it a better name: ox_crystals.

C
Using apply(), report the mean and standard deviation of the numeric columns only. Give these with only two decimal digits. Use a named vector so you get this output:

     gravity   ph   osmo  cond   urea calc
mean    1.02 6.03 615.04 20.90 266.41 4.14
sd      0.01 0.72 238.25  7.95 131.25 3.26

D
Using aggregate, report the mean of the numeric columns, but split over the ox_crystals variable levels. Which variables seem most likely candidates for a relation with oxalate crystal formation.

E
Use the pairs() plotting function to explore the pairwise relationships between the different numeric variables

F
Use the heatmap.2() and cor() functions together to create a heatmap of pairwise variable correlations. You will need to install package gplots for the heatmap.2 function. Alternatively, use heatmap() from base R.

Which visualization do you prefer - the pairs() or the heatmap.2()?

G
[Challenge] There does not seem to be an interesting correlation between pH and any of the other variables. Sometimes this is becuase you are not looking in enough detail. Let’s dig a little further. Use cut() to split the pH variable into a factor with three levels: “acidic”, “neutral” and “basic” with breakpoints between them at pH 5.5 and 7. Next, use the split() and lapply() functions to calculate the mean, meadian and sd of the other numeric variables for each level of this pH factor. This exercise can of course be done in many ways. one of them is by using an apply() within the applied function of lapply().

10.8.5 Bird observations revisited

This exercise revisits the bird observations dataset. You can download it here. (Re)load the dataset.

A
Report the number of observations per County. Use both a textual as a barplot representation. With the barplot, you should order the bars according to observation numbers.

B
Report the number of observations per Observer.1 but only for observers with more than 10 observations, ordered from high to low observation count. Use order() to achieve this.

C
Which observer has the highest number of observations listed (and how many is that)?

D
Report the different observed species (using Common.name) for each genus. [Challenge] Report only the 5 Genera with the highest number of observed species.

E
[Challenge] Create a Dataframe holding the number of birds per day (use Date.start) and plot it with date on the x-axis and number of birds on the y-axis. Hint: use as.Date() to convert the character date to a real date field. See this page how you can do that Date Values.