Chapter 6 Processing text with: stringr and regex

6.1 Introduction

This is the last presentation in the tidyverse series. It revolves around processing textual data: finding, extracting, and replacing patterns. Central to this task is pattern matching using regular expressions. Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described. Regular expression syntax is the language in which patterns are described in a wide range of programming languages, including R.

This topic has been dealt with in an introductory manner previously (course DAVuR1). And is repeated and expanded here. Instead of the base R functions we now switch to the stringr package.

As all packages from the tidyverse, stringr has many many functions (type help(package = "stringr") to see which). this package has a great cheat sheet as well.

Here, a few of them will be reviewed.

6.1.1 A few remarks on “locale”

Many functions of the tidyverse packages related to time and text (and currency) accept arguments specifying the locale. The locale is a container for all location-specific display of information.
Think

  • Character set of the language
  • Time zone, Daylight savings time
  • Thousands separator and decimal symbol
  • Currency symbol

Dealing with locales is a big challenge indeed for any programming language. However, since this is only an introductory course we will stick to US English and work with the current locale for times only. This note is to make you aware of the concept so that you remember this when the appropriate time comes.

6.2 Review of regular expressions

Many of the stringr functions take regular expression as one of the arguments. Regular expression syntax has been dealt with in a previous course/presentation. For your convenience, an overview is presented here as well.

6.2.1 Regex syntax elements

A regex can be build out of any combination of

  • character sequences - Literal sequences, such as ‘chimp’
  • character classes - A listing of possibilities for a single position.
  • alternatives - Are defined by the pipe symbol |.
  • quantifiers - How many times the preceding block should occur.
  • anchors - ^ means matching at the start of a string. $ means at the end.

The stringr cheat sheet also contains a summary of regex syntax.

6.2.2 Character classes and negation

Characters classes -groups of matching characters for a single position- are placed between brackets: [adgk] means ‘a’ or ‘d’ or ‘g’ or ‘k.’ Use a hyphen to create a series: [3-9] means digits 3 through 9 and [a-zA-Z] means all alphabet characters.

Character classes can be negated by putting a ^ at the beginning of the list: [^adgk] means anything but the letters a, d, g or k.

There is a special character

Since character classes such as [0-9] occur so frequently they have dedicated character classes -also called metacharacters- such as [[:digit:]] or (equivalently) \\d. The most important other ones are these

  • any character (wildcard) is specified by .. If you want to search for a literal dot, you need to escape its special meaning using two backslashes: \\.
  • digits [[:digit:]] or \\d: equivalent to [0-9]
  • alphabet characters [[:alpha:]]: equivalent to [a-zA-Z]
  • lowercase characters [[:lower:]]: equivalent to [a-z]
  • uppercase characters [[:upper:]]: equivalent to [A-Z]
  • whitespace characters [[:space:]] or \\s: Space, tab, vertical tab, newline, form feed, carriage return
  • punctuation characters [[:punct:]]: One of !"#$%&’()*+,-./:;<=>?@[]^_`{|}~

(have a look at the cheat sheet for all)

6.2.3 Quantifiers

Quantifiers specify how often a (part of) a pattern should occur.

  • *: 0 or more times
  • +: 1 or more times
  • ?: 0 or 1 time
  • {n}: exactly n times
  • {n,}: at least n times
  • {,n}: at most n times
  • {n, m}: at least n and at most m times.

The * zero or more times and ? zero or one time quantifiers are sometimes confusing. Why zero? A good example is the Dutch postal code. These are all valid postal codes

pc <- c("1234 AA", "2345-BB", "3456CC", "4567 dd")
pc
## [1] "1234 AA" "2345-BB" "3456CC"  "4567 dd"

and therefore a pattern could be "\\d{4}[ -]?[a-zA-Z]{2}" where the question mark specifies that either a space or a hyphen may occur zero or one time: It may or may not be present.

The stringr package provides two nice utility functions to visualize regex matches in a character: str_view_all() and str_view(). The difference is that the latter function only shows the first match - if present.

str_view_all(pc, "^\\d{4}[ -]?[a-zA-Z]{2}$")

As you can see, the last element (“56789aa”) is not a good postal code.

Note that [a-zA-Z] could have been replaced by [[:alpha:]].

6.2.4 Anchoring

Using anchoring, you can make sure the matching string is not longer than you explicitly state.

  • ^ anchors a pattern to the start of a string
  • $ anchors a regex to the end of a string
sntc <- "the path of the righteous man is beset on all sides by the iniquities of the selfish,  and the tyranny of evil men. --quote from?"

str_view(sntc, "evil") ##matches
str_view(sntc, "evil$") ## does not match

6.2.5 Alternatives

To apply two alternative choices for a single regex element you use the pipe symbol |. You can us parentheses (foo[]) to fence alternatives off.

str_view_all(sntc, "(y\\s)|(\\sf)")

6.3 The stringr essentials

6.3.1 Case conversion

These functions all change the capitalization of (some of) the word characters of an input string. They all ignore non-word characters such as punctuation and other symbols.

  • str_to_upper() converts all word characters to uppercase
  • str_to_lower() converts all word characters to lowercase
  • str_to_title() capitalizes all first characters of words
  • str_to_sentence() capitalizes the first character in the string, not after every period
str_to_title(sntc)
## [1] "The Path Of The Righteous Man Is Beset On All Sides By The Iniquities Of The Selfish,  And The Tyranny Of Evil Men. --Quote From?"
str_to_sentence(sntc)
## [1] "The path of the righteous man is beset on all sides by the iniquities of the selfish,  and the tyranny of evil men. --quote from?"

6.3.2 Split, join and substring

Combining two vectors into one, one vector into one, or doing the reverse: splitting. These are all string-based operation that are carried out in scripting quite often.

Here are some joining operations, using str_c():

l1 <- letters[1:5]
l2 <- letters[6:10]

str_c(l1, collapse = "=")
## [1] "a=b=c=d=e"
str_c(l1, l2, sep = "+")
## [1] "a+f" "b+g" "c+h" "d+i" "e+j"
str_c(l1, l2, sep = "+", collapse = "=")
## [1] "a+f=b+g=c+h=d+i=e+j"

When you want to combine variables and text str_glue() comes in handy:

str_glue("The value of pi is {pi} and the first month of the year is {month.name[1]}")
## The value of pi is 3.14159265358979 and the first month of the year is January

This is a more friendly approach than with paste().

Splitting is slightly more tricky since it accepts a regex pattern as split argument. For instance, you can get the words of a sentence by splitting like this:

words <- str_split(sntc, "([[:punct:]]|[[:space:]])+")
words
##alternative
#str_split(sntc, "[^a-zA-Z]+")
## [[1]]
##  [1] "the"        "path"       "of"         "the"        "righteous" 
##  [6] "man"        "is"         "beset"      "on"         "all"       
## [11] "sides"      "by"         "the"        "iniquities" "of"        
## [16] "the"        "selfish"    "and"        "the"        "tyranny"   
## [21] "of"         "evil"       "men"        "quote"      "from"      
## [26] ""

There are two ways to get parts of character strings, or substrings. The first is by index. You can omit both start and end arguments; they will default to start and end of the string, respectively.

nucs <- c("Adenine", "Guanine", "Cytosine", "Thymine")
str_sub(nucs, end = 3)
## [1] "Ade" "Gua" "Cyt" "Thy"

You can even use this function to change the substring that is removed

str_sub(nucs, start = 4) <- "......"
nucs
## [1] "Ade......" "Gua......" "Cyt......" "Thy......"

This does not work with literals! The following chunk gives and error:

str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......"
## Error in str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......": target of assignment expands to non-language object

6.3.3 Matching

When you match a pattern to a string, you usually want to know if it is there, which elements have it, where it is located in those elements or how often it is present. For each of these question there is a dedicated function:

  • str_detect(string, pattern) detects the presence of a pattern match in a string.

    str_detect(fruits, "[Aa]")
    ## [1]  TRUE  TRUE  TRUE FALSE
  • str_subset(string, pattern) returns only the strings that contain a pattern match

    str_subset(fruits, "[Aa]")
    ## [1] "Banana" "Apple"  "Orange"
  • str_which(string, pattern) finds the indexes of strings that contain a pattern match.

    str_which(fruits, "[Aa]")
    ## [1] 1 2 3
  • str_count(string, pattern) counts the number of matches in a string.

    str_count(fruits, "[Aa]")
    ## [1] 3 1 1 0
  • str_locate(string, pattern) and str_locate_all(string, pattern) locate the positions of pattern matches in a string

    str_locate_all(fruits, "[Aa]")
    ## [[1]]
    ##      start end
    ## [1,]     2   2
    ## [2,]     4   4
    ## [3,]     6   6
    ## 
    ## [[2]]
    ##      start end
    ## [1,]     1   1
    ## 
    ## [[3]]
    ##      start end
    ## [1,]     3   3
    ## 
    ## [[4]]
    ##      start end

6.3.4 Extracting and replacing

If you want to obtain the character sequences matching your pattern you can use the str_extract() and str_extract_all() functions:

str_extract_all(fruits, "an")
## [[1]]
## [1] "an" "an"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "an"
## 
## [[4]]
## character(0)

Finally, replacing occurrences of a pattern is carried out using str_replace() or str_replace_all().

str_replace_all(fruits, "an", "..")
## [1] "B....a" "Apple"  "Or..ge" "Cherry"