Introduction

This is the final assignment of the intro lessons on R. You should work within this document, and “Knit” it into a Word document when you are finished. Submit the resulting Word document as well as this .Rmd document on Blackboard.

The presentation is available here: https://michielnoback.github.io/intro_R_lessons/Intro_lesson_1.html

and as flat html as well: https://michielnoback.github.io/intro_R_lessons/Intro_lesson_simple.html ## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

This will include both the code and its result into the created document.
Clicking the tiny green arrow in the top right corner of a code chunk will execute only that chunk, within RStudio.

Including Plots

You can also embed plots, for example:

plot(pressure)

Load the data

Run this chunk to download the data and load it into your R session.

protein_data <- read.table(file = "https://git.io/fjfMC",
                        header = TRUE,
                        sep = ";",
                        dec = ",",
                        as.is = c(1, 2, 3, 21))

The assignments

Assignment 1

You have seen the function table() which can be used to create a frequency table. Apply this function to generate a table for the different localizations as identified by LipoP.

## YOUR CODE HERE

Assignment 2

Use table() again, this time to create a “contingency table” (cross tabulation) for LipoP localization agains SignalP predictions (Yes/No). Is there a (cor)relation between these analyses?

## YOUR CODE HERE

Assignment 3

First run the chunk below to get a new column for the TMHMM analysis, giving a simplified version of column TMHMM_PredHel. If the number of transmembrane helices is bigger then 1, it is reduced to “2+”

protein_data$TMHMM_PredHelSimple <- protein_data$TMHMM_PredHel
protein_data$TMHMM_PredHelSimple <- ifelse(protein_data$TMHMM_PredHel > 1, "2+", protein_data$TMHMM_PredHel)

Have a look at this new column.

Next, create a “contingency table” for SignalP against TMHMM_PredHelSimple predictions. Is there a (cor)relation between these analyses? What does this say?

## YOUR CODE HERE

Assignment 4

Find out how many proteins have a negative result in the three analyses (SignalP, LipoP, TMHMM). Negative means N (SignalP), CYT (LipoP) and 0 (TMHMM PredHel). Use nrow() on the resulting selection to get the number of cases.

Next, find out how many are negative for all three analyses Hint: use (combination of) logical selections: condition_A & condition_B

## YOUR CODE HERE

Assignment 5

Read the help page of the summary() function. Then, use this function and combine it with logical selection to find out what the mean, minimum and maximum values are of the LipoP score for the four different protein classes (CYT, SpI, SpII, TMH). Hint: select each subgroup in the dataframe using logical selection on the rows, and only the “LipoP_Score” column

## YOUR CODE HERE

Assignment 6

Create a scatter plot of SignalP_Smax versus LipoP_Score and give it nice axis labels.

## YOUR CODE HERE

Assignment 7

Study the help page of the hist() function. Create a histogram of the protein lengths. There are three large proteins that distort the image quite a lot. Try to exclude these and generate the histogram again, or try to use the breaks argument to get a better image.

## YOUR CODE HERE

The end