Chapter 5 Complex Datatypes and File Reading
5.1 Matrices are vectors with dimensions
We will not detail on them in this course, only this one small paragraph. This does not mean they are not important, but they are just not the focus here. Some functions require or return a matrix so you should be aware of them.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
5.2 Factors: Nominal & Ordinal scales
Although factors are not actually a complex datatype but very much one of the five base types in R, I saved them because they have some more complex and sometimes puzzling behaviour.
Factors represent different discrete levels of a variable - the nominal and ordinal scales known from statistics.
For instance:
- eye color (brown, blue, green)
- weight class (underweight, normal, obese)
- autism spectrum (none, minimal, heavy)
Factor creation
Factors are used to represent data in nominal and ordinal scales. Nominal has no order (e.g. eye color). Ordinal has order (e.g. autism spectrum), but can not be calculated with, other than ordering from high to low. No distance is defined between separate levels. The following functions are used to create factors:
factor()
: constructor function, from factor, character, numeric or logicalas.factor()
: coercion function, from factor, character, numeric or logicalcut()
: conversion function from numeric vector
So what is the difference between factor()
and as.factor()
? Function as.factor()
is a wrapper for factor()
. The difference lies in behaviour when the input is a factor itself: factor
will omit unused levels. Besides this, as.factor()
does not specify the arguments for labels and levels.
## [1] a b
## Levels: a b c
## [1] a b
## Levels: a b
## [1] a b
## Levels: a b c
Suppose you have surveyed the eye color of your class room and found these values
eye_colors <- c("green", "blue", "brown", "brown", "blue",
"brown", "brown", "brown", "blue", "brown", "green",
"brown", "brown", "blue", "blue", "brown")
Next you would like to plot or tabulate these findings. Simply plotting gives an error:
## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Error in plot.window(...): need finite 'ylim' values
However, plotting a character vector converted to a factor is easy
Factors are also really easy to tabulate and filter
## eye_colors
## blue brown green
## 5 9 2
## [1] 5
Levels, Labels and Ordering
When working with ordinal scales, defining the order of the factors (levels) is crucial. By default, R uses the natural ordering which means it will stick to either numerical (numeric
, integer
and logical
) or alphabetical ordering (character
). When you want a different ordering you need to specify this. You can even define missing levels, as shown in the following example.
classSizes <- factor(c("big", "small", "huge", "huge",
"small","big","small","big"),
levels = c("small", "normal", "big", "huge"),
ordered = TRUE) #make it an ordinal scale!
plot(classSizes)
When you have factor, you can do a -limited- set of calulations with it. However, comparators only work with ordinal scale. As with all equality tests, sum()
works as well:
## [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [1] 2
Convert existing factors
When you already have an unordered factor, you can make it ordered by using the function ordered()
together with a fvector specifying the levels.
classSizes <- factor(c("big", "small", "huge", "huge",
"small", "big", "small", "big"))
classSizes <- ordered(classSizes,
levels = c("small", "big", "huge"))
classSizes
## [1] big small huge huge small big small big
## Levels: small < big < huge
When calculations get corrupted
Especially when a factor consists of numeric levels, calculations can get your mind screwed big time:
## Warning in Ops.factor(x, 1): '+' not meaningful for factors
## [1] NA NA NA NA
## [1] 2 3 4 3
## [1] 4 5 6
The only way to get the numbers back with numeric factors is by using this trick
## [1] 3 4 5 4
## Levels: 3 4 5
## [1] 3 4 5 4
But this makes for really unintelligible code so try to prevent this at all costs!
5.3 Lists
A list is an ordered collection of vectors. These vectors can have differing types and differing lengths.
List creation
Create a list with or without element names:
list(element1, element2, ...)
list(name1 = element1, name2 = element2, ...)
Without names:
## [[1]]
## [1] 2 3 1
##
## [[2]]
## [1] "foo" "bar"
## [1] "foo" "bar"
With names:
## $numbers
## [1] 2 3 1
##
## $words
## [1] "foo" "bar"
This is the preferred way to create and use them because it gives you more and easier ways to access its elements and it makes for much more reradable code. That’s why you will only see lists with named elements from here on.
Making selections on lists
Accessing named elements can be done in three ways:
- By index, within double or single brackets:
[[<index>]]
or[<index>]
- By name of the element, within double or single brackets:
[[<name>]]
or[<name>]
- By name of the element, using the dollar sign on the list name:
$<name>
Here are all three:
## [1] "foo" "bar"
## [1] "foo" "bar"
## [1] "foo" "bar"
Single brackets selection on a list returns a list; double brackets and
$
return a vector.
## $words
## [1] "foo" "bar"
## [1] "foo" "bar"
## [1] "foo" "bar"
In R, selections are often chained. In the following example the second vector element of the second list element is selected.
## $numbers
## [1] 2 3 1
##
## $words
## [1] "foo" "bar"
##
## [1] "bar"
When you need multiple elements of a list, use single brackets. Remember: single brackets return a list; that’s why you need single brackets here.
## $numbers
## [1] 2 3 1
##
## $words
## [1] "foo" "bar"
##
## $numbers
## [1] 2 3 1
Accessing named elements has its limitations. You can not use a variable in combination with the dollar sign selector.
## [1] "foo" "bar"
## NULL
Chaining of selectors can become awkward, as this example demonstrates.
## [1] "foo" "bar"
5.4 Dataframes
A dataframe is an ordered collection of vectors. These vectors can have differing types but must have equal lengths.
A dataframe is very similar to the square grid-like structures you have probably worked with in Excel. Variables are in columns in which all elements are of the same type. Examples (observations) are in rows - they can have differing types.
Dataframes can be constructed using the data.frame()
function in the same way as the list
function:
data.frame(column1 = vector1, column2 = vector2, ...)
Here is a first example.
geneNames <- c("P53","BRCA1","VAMP1", "FHIT")
sig <- c(TRUE, TRUE, FALSE, FALSE)
meanExp <- c(4.5, 7.3, 5.4, 2.4)
genes <- data.frame(
"name" = geneNames,
"significant" = sig,
"expression" = meanExp)
genes
## name significant expression
## 1 P53 TRUE 4.5
## 2 BRCA1 TRUE 7.3
## 3 VAMP1 FALSE 5.4
## 4 FHIT FALSE 2.4
Here you can see the structure of a dataframe: each column has a single datatype but rows can have differing types for neighboring fields.
5.4.1 Selections on dataframes
Making selections on dataframes is not very surprising when you already know how to do it with vectors and lists. There is only one extension. The fact that it is a square grid-like structure makes it possible to add an extra way of making selections: combining rows and column selections as subgrids. This section extensively reviews all means of making selections.
This is a summary:
- Select a single column using
$
will return a vector - Selecting with double brackets
[[<name>]]
or[[<index>]]
will return a vector - Selecting with single brackets
[<name>]
or[<index>]
will return a dataframe - Selecting with row-and-column coordinates
[row_selection, col_selection]
returns either a vector or a dataframe, depending on the selection made. Here,row_selection
andcol_selection
can be- a numerical vector of length 1 or more
- a logical vector of length 1 or more
- empty (to select all rows/columns)
Here follow a few examples.
## [1] "BRCA1"
## name significant
## 2 BRCA1 TRUE
## name expression
## 2 BRCA1 7.3
## [1] "P53" "BRCA1" "VAMP1" "FHIT"
## name expression
## 1 P53 4.5
## 2 BRCA1 7.3
## 3 VAMP1 5.4
## 4 FHIT 2.4
## name significant
## 1 P53 TRUE
## 2 BRCA1 TRUE
## 3 VAMP1 FALSE
## 4 FHIT FALSE
## name significant expression
## 1 P53 TRUE 4.5
## 2 BRCA1 TRUE 7.3
As with vectors and lists, R will cycle selectors, and you can select an element as often as you want.
## [1] "P53" "VAMP1"
## name significant expression
## 1 P53 TRUE 4.5
## 1.1 P53 TRUE 4.5
## 1.2 P53 TRUE 4.5
## 2 BRCA1 TRUE 7.3
A dataframe is much like a list, but not entirely equal:
## [1] "P53" "BRCA1" "VAMP1" "FHIT"
## [1] "data.frame"
## 'data.frame': 4 obs. of 3 variables:
## $ name : chr "P53" "BRCA1" "VAMP1" "FHIT"
## $ significant: logi TRUE TRUE FALSE FALSE
## $ expression : num 4.5 7.3 5.4 2.4
Selections with subset()
Function subset()
can serve as alternative to “bracket-based” selections ([ , ]
).
You can use subset()
to make both column and row selections. Here, using subset =
, the rows are selected for which Solar.R is available using the is.na()
function.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 May 1
## 2 36 118 8.0 72 May 2
## 3 12 149 12.6 74 May 3
## 4 18 313 11.5 62 May 4
## 7 23 299 8.6 65 May 7
## 8 19 99 13.8 59 May 8
Note that you don’t even need to use quotes for column names.
Select columns only with the select =
argument.
## Ozone Solar.R
## 1 41 190
## 2 36 118
## 3 12 149
## 4 18 313
## 5 NA NA
## 6 28 NA
Of course, you can combine row and colum selection:
## Ozone Solar.R
## 1 41 190
## 2 36 118
## 3 12 149
## 4 18 313
## 7 23 299
## 8 19 99
subset()
can be used more sophisticated; however we are going to see subset()
on steroids in the next course: the functions in package dplyr
.
5.4.2 Read from file using read.table()
Usually your data comes from file, loaded into memory as a dataframe. The most common data transfer & storage format is text. The text will have column separators that can be any of a whole number of characters, but tab- or comma-delimited are most common.
Here is an example dataset in a file (“whale_selenium.txt”) where the separator is a space character:
whale liver.Se tooth.Se
1 6.23 140.16
2 6.79 133.32
3 7.92 135.34
...
19 41.23 206.30
20 45.47 141.31
To load this data into an R session you can use the function read.table()
. Let’s try
## V1 V2 V3
## 1 whale liver.Se tooth.Se
## 2 1 6.23 140.16
## 3 2 6.79 133.32
## 4 3 7.92 135.34
## 5 4 8.02 127.82
## 6 5 9.34 108.67
## 'data.frame': 21 obs. of 3 variables:
## $ V1: chr "whale" "1" "2" "3" ...
## $ V2: chr "liver.Se" "6.23" "6.79" "7.92" ...
## $ V3: chr "tooth.Se" "140.16" "133.32" "135.34" ...
That is not entirely correct: all columns are imported as a factor while obviously they should be numeric. The cause of this is that, when loading the data,
- there is no special consideration for the header line
- the separator is assumed to be a space
- the decimal is assumed to be a dot “.”
(and some more assumptions)
Therefore, to read a file correctly, you have to specify its format in every detail. in this case,
- the first line is a header with column names
- the first column contains the row names
Here is a new attempt with some format specifications:
Before proceeding with your data, you should always perform some checks. Several helper methods exist for this purpose:
head()
shows you the firstn
linesstr()
gives you a structure description: what types have the columns and what dimension does the data frame have?summary()
gives you a 6-number sumary of the data
## liver.Se tooth.Se
## 1 6.23 140.16
## 2 6.79 133.32
## 3 7.92 135.34
## 4 8.02 127.82
## 'data.frame': 20 obs. of 2 variables:
## $ liver.Se: num 6.23 6.79 7.92 8.02 9.34 ...
## $ tooth.Se: num 140 133 135 128 109 ...
## liver.Se tooth.Se
## Min. : 6.230 Min. :108.7
## 1st Qu.: 9.835 1st Qu.:134.8
## Median :14.905 Median :143.4
## Mean :20.685 Mean :156.6
## 3rd Qu.:33.633 3rd Qu.:175.1
## Max. :45.470 Max. :245.1
There are various other helper methods the you can use to inspect the contents and nature of your dataframe columns and rows:
dim()
gives the rows and columnsncol()
gives the number of columnsnrow()
gives the number of rowsnames()
gives the column names (synonym tocolnames()
)rownames()
gives the row names
But visualizing your data speaks more than a thousand words of course.
plot(whale_selenium$liver.Se, whale_selenium$tooth.Se,
xlab = "liver Selenium", ylab = "tooth Selenium")
abline(lm(whale_selenium$tooth.Se ~ whale_selenium$liver.Se))
This is absolutely not the whole filereading story! The topic will be addressed again in a later chapter.
5.4.3 Dataframe manipulations
Changing column names
Sometimes the existing column names are not convenient to work with (unclear, too long etc.). In that case it may be a good idea to change the column names. To do this you can use either names()
or colnames()
.
## liver tooth
## 1 6.23 140.16
## 2 6.79 133.32
## pancreas colon
## 1 6.23 140.16
## 2 6.79 133.32
Adding columns
You can add a single column by simply specifying its name and the value(s) to be attached.
## add simulated stomach data
whale_selenium$stomach <- rnorm(nrow(whale_selenium), 42, 6)
head(whale_selenium, n=2)
## liver tooth stomach
## 1 6.23 140.16 52.53270
## 2 6.79 133.32 50.69782
Alternatively, use cbind
. It is a bit more versatile because you can add multiple columns at once.
## liver tooth stomach brain
## 1 6.23 140.16 52.53270 1
## 2 6.79 133.32 50.69782 0
## 3 7.92 135.34 39.30723 1
## 4 8.02 127.82 42.84040 0
## 5 9.34 108.67 39.75810 1
## 6 10.00 146.22 34.76140 0
## 7 10.57 131.18 26.03192 1
## 8 11.04 145.51 41.84943 0
## 9 12.36 163.24 41.79376 1
## 10 14.53 136.55 51.85415 0
## 11 15.28 112.63 30.02723 1
## 12 18.68 245.07 38.02698 0
## 13 22.08 140.48 47.28023 1
## 14 27.55 177.93 44.05432 0
## 15 32.83 160.73 42.73052 1
## 16 36.04 227.60 38.27543 0
## 17 37.74 177.69 43.20567 1
## 18 40.00 174.23 40.58457 0
## 19 41.23 206.30 42.68197 1
## 20 45.47 141.31 38.53242 0
Adding rows: rbind()
Adding rows to a dataframe is similar. There is however a constraint: the column names of both dataframes need to match for this operation to succeed.
my_data1 <- data.frame(colA = 1:3, colB = c("a", "b", "c"))
my_data2 <- data.frame(colA = 4:5, colB = c("d", "e"))
my_data_complete <- rbind(my_data1, my_data2)
my_data_complete
## colA colB
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e