Chapter 6 Processing text with: stringr
and regex
6.1 Introduction
This is the last presentation in the tidyverse series. It revolves around processing textual data: finding, extracting, and replacing patterns. Central to this task is pattern matching using regular expressions. Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described. Regular expression syntax is the language in which patterns are described in a wide range of programming languages, including R.
This topic has been dealt with in an introductory manner previously (course DAVuR1). And is repeated and expanded here. Instead of the base R functions we now switch to the stringr
package.
As all packages from the tidyverse, stringr
has many many functions (type help(package = "stringr")
to see which). this package has a great cheat sheet as well.
Here, a few of them will be reviewed.
6.1.1 A few remarks on “locale”
Many functions of the tidyverse packages related to time and text (and currency) accept arguments specifying the locale.
The locale is a container for all location-specific display of information.
Think
- Character set of the language
- Time zone, Daylight savings time
- Thousands separator and decimal symbol
- Currency symbol
Dealing with locales is a big challenge indeed for any programming language. However, since this is only an introductory course we will stick to US English and work with the current locale for times only. This note is to make you aware of the concept so that you remember this when the appropriate time comes.
6.2 Review of regular expressions
Many of the stringr
functions take regular expression as one of the arguments.
Regular expression syntax has been dealt with in a previous course/presentation. For your convenience, an overview is presented here as well.
6.2.1 Regex syntax elements
A regex can be build out of any combination of
- character sequences - Literal sequences, such as ‘chimp’
- character classes - A listing of possibilities for a single position.
- alternatives - Are defined by the pipe symbol
|
. - quantifiers - How many times the preceding block should occur.
- anchors -
^
means matching at the start of a string.$
means at the end.
The stringr cheat sheet also contains a summary of regex syntax.
6.2.2 Character classes and negation
Characters classes -groups of matching characters for a single position- are placed between brackets: [adgk]
means ‘a’ or ‘d’ or ‘g’ or ‘k.’ Use a hyphen to create a series: [3-9]
means digits 3 through 9 and [a-zA-Z]
means all alphabet characters.
Character classes can be negated by putting a ^
at the beginning of the list: [^adgk]
means anything but the letters a, d, g or k.
There is a special character
Since character classes such as [0-9]
occur so frequently they have dedicated character classes -also called metacharacters- such as [[:digit:]]
or (equivalently) \\d
. The most important other ones are these
- any character (wildcard) is specified by
.
. If you want to search for a literal dot, you need to escape its special meaning using two backslashes:\\.
- digits
[[:digit:]]
or\\d
: equivalent to[0-9]
- alphabet characters
[[:alpha:]]
: equivalent to[a-zA-Z]
- lowercase characters
[[:lower:]]
: equivalent to[a-z]
- uppercase characters
[[:upper:]]
: equivalent to[A-Z]
- whitespace characters
[[:space:]]
or\\s
: Space, tab, vertical tab, newline, form feed, carriage return - punctuation characters
[[:punct:]]
: One of !"#$%&’()*+,-./:;<=>?@[]^_`{|}~
(have a look at the cheat sheet for all)
6.2.3 Quantifiers
Quantifiers specify how often a (part of) a pattern should occur.
*
: 0 or more times+
: 1 or more times?
: 0 or 1 time{n}
: exactlyn
times{n,}
: at leastn
times{,n}
: at mostn
times{n, m}
: at leastn
and at mostm
times.
The *
zero or more times and ?
zero or one time quantifiers are sometimes confusing. Why zero? A good example is the Dutch postal code. These are all valid postal codes
<- c("1234 AA", "2345-BB", "3456CC", "4567 dd")
pc pc
## [1] "1234 AA" "2345-BB" "3456CC" "4567 dd"
and therefore a pattern could be "\\d{4}[ -]?[a-zA-Z]{2}"
where the question mark specifies that either a space or a hyphen may occur zero or one time: It may or may not be present.
The stringr
package provides two nice utility functions to visualize regex matches in a character: str_view_all()
and str_view()
. The difference is that the latter function only shows the first match - if present.
str_view_all(pc, "^\\d{4}[ -]?[a-zA-Z]{2}$")
As you can see, the last element (“56789aa”) is not a good postal code.
Note that [a-zA-Z]
could have been replaced by [[:alpha:]]
.
6.2.4 Anchoring
Using anchoring, you can make sure the matching string is not longer than you explicitly state.
^
anchors a pattern to the start of a string$
anchors a regex to the end of a string
<- "the path of the righteous man is beset on all sides by the iniquities of the selfish, and the tyranny of evil men. --quote from?"
sntc
str_view(sntc, "evil") ##matches
str_view(sntc, "evil$") ## does not match
6.2.5 Alternatives
To apply two alternative choices for a single regex element you use the pipe symbol |
. You can us parentheses (foo[])
to fence alternatives off.
str_view_all(sntc, "(y\\s)|(\\sf)")
6.3 The stringr
essentials
6.3.1 Case conversion
These functions all change the capitalization of (some of) the word characters of an input string. They all ignore non-word characters such as punctuation and other symbols.
str_to_upper()
converts all word characters to uppercasestr_to_lower()
converts all word characters to lowercasestr_to_title()
capitalizes all first characters of wordsstr_to_sentence()
capitalizes the first character in the string, not after every period
str_to_title(sntc)
## [1] "The Path Of The Righteous Man Is Beset On All Sides By The Iniquities Of The Selfish, And The Tyranny Of Evil Men. --Quote From?"
str_to_sentence(sntc)
## [1] "The path of the righteous man is beset on all sides by the iniquities of the selfish, and the tyranny of evil men. --quote from?"
6.3.2 Split, join and substring
Combining two vectors into one, one vector into one, or doing the reverse: splitting. These are all string-based operation that are carried out in scripting quite often.
Here are some joining operations, using str_c()
:
<- letters[1:5]
l1 <- letters[6:10]
l2
str_c(l1, collapse = "=")
## [1] "a=b=c=d=e"
str_c(l1, l2, sep = "+")
## [1] "a+f" "b+g" "c+h" "d+i" "e+j"
str_c(l1, l2, sep = "+", collapse = "=")
## [1] "a+f=b+g=c+h=d+i=e+j"
When you want to combine variables and text str_glue()
comes in handy:
str_glue("The value of pi is {pi} and the first month of the year is {month.name[1]}")
## The value of pi is 3.14159265358979 and the first month of the year is January
This is a more friendly approach than with paste()
.
Splitting is slightly more tricky since it accepts a regex pattern as split argument. For instance, you can get the words of a sentence by splitting like this:
<- str_split(sntc, "([[:punct:]]|[[:space:]])+")
words
words##alternative
#str_split(sntc, "[^a-zA-Z]+")
## [[1]]
## [1] "the" "path" "of" "the" "righteous"
## [6] "man" "is" "beset" "on" "all"
## [11] "sides" "by" "the" "iniquities" "of"
## [16] "the" "selfish" "and" "the" "tyranny"
## [21] "of" "evil" "men" "quote" "from"
## [26] ""
There are two ways to get parts of character strings, or substrings. The first is by index. You can omit both start
and end
arguments; they will default to start and end of the string, respectively.
<- c("Adenine", "Guanine", "Cytosine", "Thymine")
nucs str_sub(nucs, end = 3)
## [1] "Ade" "Gua" "Cyt" "Thy"
You can even use this function to change the substring that is removed
str_sub(nucs, start = 4) <- "......"
nucs
## [1] "Ade......" "Gua......" "Cyt......" "Thy......"
This does not work with literals! The following chunk gives and error:
str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......"
## Error in str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......": target of assignment expands to non-language object
6.3.3 Matching
When you match a pattern to a string, you usually want to know if it is there, which elements have it, where it is located in those elements or how often it is present. For each of these question there is a dedicated function:
str_detect(string, pattern)
detects the presence of a pattern match in a string.str_detect(fruits, "[Aa]")
## [1] TRUE TRUE TRUE FALSE
str_subset(string, pattern)
returns only the strings that contain a pattern matchstr_subset(fruits, "[Aa]")
## [1] "Banana" "Apple" "Orange"
str_which(string, pattern)
finds the indexes of strings that contain a pattern match.str_which(fruits, "[Aa]")
## [1] 1 2 3
str_count(string, pattern)
counts the number of matches in a string.str_count(fruits, "[Aa]")
## [1] 3 1 1 0
str_locate(string, pattern)
andstr_locate_all(string, pattern)
locate the positions of pattern matches in a stringstr_locate_all(fruits, "[Aa]")
## [[1]] ## start end ## [1,] 2 2 ## [2,] 4 4 ## [3,] 6 6 ## ## [[2]] ## start end ## [1,] 1 1 ## ## [[3]] ## start end ## [1,] 3 3 ## ## [[4]] ## start end
6.3.4 Extracting and replacing
If you want to obtain the character sequences matching your pattern you can use the str_extract()
and str_extract_all()
functions:
str_extract_all(fruits, "an")
## [[1]]
## [1] "an" "an"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "an"
##
## [[4]]
## character(0)
Finally, replacing occurrences of a pattern is carried out using str_replace()
or str_replace_all()
.
str_replace_all(fruits, "an", "..")
## [1] "B....a" "Apple" "Or..ge" "Cherry"