11 Text processing with regex

11.1 Regex syntax

11.1.1 Why regexes?

This chapter deals with processing data in textual form: character data.
When working with text, you really need to be able to work with regular expressions. That is why these are dealt with first, together with the base R regex functions. After that the functions from the stringr package are discussed.

It is easy enough to look for the word “Chimpanzee” in a vector containing animal species names:

animals = c("Chimpanzee", "Cow", "Camel")
animals == "Chimpanzee"

## [1]  TRUE FALSE FALSE

but what are you going to do if there are multiple variants of the word you are looking for? This?

animals = c("Chimpanzee", "Chimp", "chimpanzee", "Camel")
animals == "Chimpanzee" | animals == "Chimp" | animals == "chimpanzee"

## [1]  TRUE  TRUE  TRUE FALSE

The solution here is not using literals, but to describe patterns.

Look at the above example. How would you describe a pattern that would correctly identify all Chimpanzee occurrences?

Is you pattern something like this?

A letter C in upper-or lower case followed by ‘himp’ followed by nothing or ‘anzee’

In programming we use regular expressions or RegEx to describe such a pattern in a formal concise way:

[Cc]himp(anzee)?

And to apply such a pattern in R, we use one of several functions dedicated for this task. Here is one, grepl(), which returns TRUE if the regex matched the vector element.

grepl("[Cc]himp(anzee)?", animals)

## [1]  TRUE  TRUE  TRUE FALSE

Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described.

Base functions using regex

There are several base R functions dedicated to finding patters in character data. They differ in intent and output. Later, the stringr counterparts will be discussed.

finding Does an element contain a pattern (TRUE/FALSE)? grepl(pattern, string)
locating Which elements contain a pattern (INDEX)? grep(pattern, string)
extracting Get the content of matching elements grep(pattern, string, value = TRUE)
replace Replace the first occurrence of the pattern sub(pattern, replacement, string)
replace all Replace all occurrences of the pattern gsub(pattern, replacement, string)

Note that the stringr package from the tidyverse has many user-friendly functions in this field as well. Two of them will be dealt with in the exercises.

11.1.2 Regex components

A regular expression can be build out of any combination of

character sequences - Literal character sequences such as ‘chimp’
character classes - A listing of possibilities for a single position.
- Between brackets: [adgk] means ‘a’ or ‘d’ or ‘g’ or ‘k’.
- Use a hyphen to create a series: [3-9] means digits 3 through 9 and [a-zA-Z] means all alphabet characters.
- Negate using ^. [^adgk] means anything but a, d, g or k.
- A special case is the dot .: any character matches.
- Many special character classes exist (digits, whitespaces etc). They are discussed in a later paragraph.
alternatives - Are defined by the pipe symbol |: “OR”
quantifiers - How many times the preceding block should occur. See next paragraph.
anchors - ^ means matching at the start of a string. $ means at the end.

An excellent cheat sheet from the RStudio website is also included here

Quantifiers

Use quantifiers to specify how many times a character or series of characters should occur.

{n}: exactly n times
{n, }: at least n times
{ ,n}: at most n times
{n, m}: at least n and at most m times.
*: 0 or more times; same as {0, }
+: 1 or more times; same as {1, }
?: 0 or 1 time; same as {0, 1}

Anchoring

Using anchoring, you can make sure the string is not longer than you explicitly state:

dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "015/2/20191", "15/2/20191")
dateRegex <- "^[0-9]{2}[/-][0-9]{1,2}[/-][0-9]{4}$"
grep(pattern = dateRegex, x = dates, value = TRUE)

## [1] "15/2/2019"  "15-2-2019"  "15-02-2019"

Now the date matching is correct.

Metacharacters: Special character classes

Since patterns such as [0-9] occur so frequently, they have dedicated character classes such as [[:digit:]]. The most important other ones are

digits [[:digit:]] or \\d: equivalent to [0-9]
alphabet characters [[:alpha:]]: equivalent to [a-zA-Z]
lowercase characters [[:lower:]]: equivalent to [a-z]
uppercase characters [[:upper:]]: equivalent to [A-Z]
whitespace characters [[:space:]] or \\s: Space, tab, vertical tab, newline, form feed, carriage return
punctuation characters [[:punct:]]: One of !“#$%&’()*+,-./:;<=>?@[]^_`{|}~

(have a look at the cheat sheet for all)

Here is the same example, this time using these predefined character classes

dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "15022019", "15/2/20191")
dateRegex <- "[[:digit:]]{2}[/-]\\d{1,2}[/-]\\d{4}"
grep(pattern = dateRegex, x = dates, value = TRUE)

## [1] "15/2/2019"  "15-2-2019"  "15-02-2019" "15/2/20191"

Alternatives

To apply two alternative choices for a single regex element you use the pipe symbol |. You can us parentheses (foo[]) to fence alternatives off.

column_names <- c("Subject", "Age", "T0_creatine", "T0_calcium", "T1_creatine", "T1_calcium") 
grep(pattern = "T[01]_(creatine|calcium)", x = column_names, value = TRUE)

## [1] "T0_creatine" "T0_calcium"  "T1_creatine" "T1_calcium"

11.1.3 Some examples

Restriction enzymes

This is the recognition sequence for the HincII restriction endonuclease:

5'-GTYRAC-3'
3'-CARYTG-5'

Before reading on: how would you define a regular expression that is precisely describes this recognition sequence?

Molecular biology sequence ambiguity codes can be found here

HincII_rs <- "GT[CT][AG]AC"
sequences <- c("GTCAAC",
               "GTCGAC",
               "GTTGAC",
               "aGTTAACa",
               "GTGCAC")
grep(pattern = HincII_rs, x = sequences, value = TRUE)

## [1] "GTCAAC"   "GTCGAC"   "GTTGAC"   "aGTTAACa"

Dutch dates

Here are some Dutch dates, in different accepted formats. The last two are not a correct notation. Create a RegEx that will determine whether an element contains a Dutch date string.

dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "015/2/20191", "15/2/20191")
dateRegex <- "[0-9]{2}[/-][0-9]{1,2}[/-][0-9]{4}"
grep(pattern = dateRegex, x = dates, value = TRUE)

## [1] "15/2/2019"   "15-2-2019"   "15-02-2019"  "015/2/20191" "15/2/20191"

Why were the last two matched? Because the pattern is there, albeit embedded in a longer string. We have to anchor the pattern to be more specific.

Exercise: Postal codes

Here are some Dutch zip (postal) codes, in different accepted formats. The last two are not a correct notation. Can you create a RegEx that will determine whether an element contains a Dutch zip code?

zips <- c("1234 AA", "2345-BB", "3456CC", "4567 dd", "56789aa", "6789a_")
zips

## [1] "1234 AA" "2345-BB" "3456CC"  "4567 dd" "56789aa" "6789a_"

Exercise: Prosite patterns

Prosite is a database of amino acid sequence motifs. One of them is the Histidine Triad profile (PDOC00694).

[NQAR]-x(4)-[GSAVY]-x-[QFLPA]-x-[LIVMY]-x-[HWYRQ]-
[LIVMFYST]-H-[LIVMFT]-H-[LIVMF]-[LIVMFPT]-[PSGAWN]

Write this down as a RegEx
Was that efficient? Using the gsub() function, can you convert it in a RegEx using code? It may take several iterations. Was that efficient?
Next, use an appropriate function to find if, and where, this pattern is located within the sequences in file data/hit_proteins.txt (here)

Amino Acid codes and Prosite pattern encoding can be found here

11.2 The `stringr` package

This is the last presentation in the tidyverse series. It revolves around processing textual data: finding, extracting, and replacing patterns. Central to this task is pattern matching using regular expressions. Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described. Regular expression syntax is the language in which patterns are described in a wide range of programming languages, including R.

This topic has been dealt with in an introductory manner previously (course DAVuR1). And is repeated and expanded here. Instead of the base R functions we now switch to the stringr package.

As all packages from the tidyverse, stringr has many many functions (type help(package = "stringr") to see which). this package has a great cheat sheet as well.

Here, a few of them will be reviewed.

11.2.1 A few remarks on “locale”

Many functions of the tidyverse packages related to time and text (and currency) accept arguments specifying the locale. The locale is a container for all location-specific display of information.
Think

Character set of the language
Time zone, Daylight savings time
Thousands separator and decimal symbol
Currency symbol

Dealing with locales is a big challenge indeed for any programming language. However, since this is only an introductory course we will stick to US English and work with the current locale for times only. This note is to make you aware of the concept so that you remember this when the appropriate time comes.

11.2.2 The `stringr` essentials

Case conversion

These functions all change the capitalization of (some of) the word characters of an input string. They all ignore non-word characters such as punctuation and other symbols.

str_to_upper() converts all word characters to uppercase
str_to_lower() converts all word characters to lowercase
str_to_title() capitalizes all first characters of words
str_to_sentence() capitalizes the first character in the string, not after every period

sntc <- "the path of the righteous man is beset on all sides by the iniquities of the selfish,  and the tyranny of evil men. --quote from?"

str_to_title(sntc)

## [1] "The Path Of The Righteous Man Is Beset On All Sides By The Iniquities Of The Selfish,  And The Tyranny Of Evil Men. --Quote From?"

str_to_sentence(sntc)

## [1] "The path of the righteous man is beset on all sides by the iniquities of the selfish,  and the tyranny of evil men. --quote from?"

Split, join and substring

Combining two vectors into one, one vector into one, or doing the reverse: splitting. These are all string-based operation that are carried out in scripting quite often.

Here are some joining operations, using str_c():

l1 <- letters[1:5]
l2 <- letters[6:10]

str_c(l1, collapse = "=")

## [1] "a=b=c=d=e"

str_c(l1, l2, sep = "+")

## [1] "a+f" "b+g" "c+h" "d+i" "e+j"

str_c(l1, l2, sep = "+", collapse = "=")

## [1] "a+f=b+g=c+h=d+i=e+j"

When you want to combine variables and text str_glue() comes in handy:

str_glue("The value of pi is {pi} and the first month of the year is {month.name[1]}")

## The value of pi is 3.14159265358979 and the first month of the year is January

This is a more friendly approach than with paste().

Splitting is slightly more tricky since it accepts a regex pattern as split argument. For instance, you can get the words of a sentence by splitting like this:

words <- str_split(sntc, "([[:punct:]]|[[:space:]])+")
words
##alternative
#str_split(sntc, "[^a-zA-Z]+")

## [[1]]
##  [1] "the"        "path"       "of"         "the"        "righteous" 
##  [6] "man"        "is"         "beset"      "on"         "all"       
## [11] "sides"      "by"         "the"        "iniquities" "of"        
## [16] "the"        "selfish"    "and"        "the"        "tyranny"   
## [21] "of"         "evil"       "men"        "quote"      "from"      
## [26] ""

There are two ways to get parts of character strings, or substrings. The first is by index. You can omit both start and end arguments; they will default to start and end of the string, respectively.

nucs <- c("Adenine", "Guanine", "Cytosine", "Thymine")
str_sub(nucs, end = 3)

## [1] "Ade" "Gua" "Cyt" "Thy"

You can even use this function to change the substring that is removed

str_sub(nucs, start = 4) <- "......"
nucs

## [1] "Ade......" "Gua......" "Cyt......" "Thy......"

This does not work with literals! The following chunk gives and error:

str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......"

## Error in str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......": target of assignment expands to non-language object

Matching

When you match a pattern to a string, you usually want to know if it is there, which elements have it, where it is located in those elements or how often it is present. For each of these question there is a dedicated function:

str_detect(string, pattern) detects the presence of a pattern match in a string.
```
str_detect(fruits, "[Aa]")
```
```
## [1]  TRUE  TRUE  TRUE FALSE
```
str_subset(string, pattern) returns only the strings that contain a pattern match
```
str_subset(fruits, "[Aa]")
```
```
## [1] "Banana" "Apple"  "Orange"
```
str_which(string, pattern) finds the indexes of strings that contain a pattern match.
```
str_which(fruits, "[Aa]")
```
```
## [1] 1 2 3
```
str_count(string, pattern) counts the number of matches in a string.
```
str_count(fruits, "[Aa]")
```
```
## [1] 3 1 1 0
```

str_locate(string, pattern) and str_locate_all(string, pattern) locate the positions of pattern matches in a string

str_locate_all(fruits, "[Aa]")

## [[1]]
##      start end
## [1,]     2   2
## [2,]     4   4
## [3,]     6   6
## 
## [[2]]
##      start end
## [1,]     1   1
## 
## [[3]]
##      start end
## [1,]     3   3
## 
## [[4]]
##      start end

Extracting and replacing

If you want to obtain the character sequences matching your pattern you can use the str_extract() and str_extract_all() functions:

str_extract_all(fruits, "an")

## [[1]]
## [1] "an" "an"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "an"
## 
## [[4]]
## character(0)

Finally, replacing occurrences of a pattern is carried out using str_replace() or str_replace_all().

str_replace_all(fruits, "an", "..")

## [1] "B....a" "Apple"  "Or..ge" "Cherry"

10 Data mangling with dplyr

12 Package ggplot2 revisited