11 Text processing with regex
11.1 Regex syntax
11.1.1 Why regexes?
This chapter deals with processing data in textual form: character data.
When working with text, you really need to be able to work with regular expressions. That is why these are dealt with first, together with the base R regex functions. After that the functions from the stringr
package are discussed.
It is easy enough to look for the word “Chimpanzee” in a vector containing animal species names:
animals = c("Chimpanzee", "Cow", "Camel")
animals == "Chimpanzee"
## [1] TRUE FALSE FALSE
but what are you going to do if there are multiple variants of the word you are looking for? This?
animals = c("Chimpanzee", "Chimp", "chimpanzee", "Camel")
animals == "Chimpanzee" | animals == "Chimp" | animals == "chimpanzee"
## [1] TRUE TRUE TRUE FALSE
The solution here is not using literals, but to describe patterns.
Look at the above example. How would you describe a pattern that would correctly identify all Chimpanzee occurrences?
Is you pattern something like this?
A letter C in upper-or lower case followed by ‘himp’ followed by nothing or ‘anzee’
In programming we use regular expressions or RegEx to describe such a pattern in a formal concise way:
[Cc]himp(anzee)?
And to apply such a pattern in R, we use one of several functions dedicated for this task. Here is one, grepl()
, which returns TRUE
if the regex matched the vector element.
grepl("[Cc]himp(anzee)?", animals)
## [1] TRUE TRUE TRUE FALSE
Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described.
Base functions using regex
There are several base R functions dedicated to finding patters in character data.
They differ in intent and output. Later, the stringr
counterparts will be discussed.
-
finding Does an element contain a pattern (TRUE/FALSE)?
grepl(pattern, string)
-
locating Which elements contain a pattern (INDEX)?
grep(pattern, string)
-
extracting Get the content of matching elements
grep(pattern, string, value = TRUE)
-
replace Replace the first occurrence of the pattern
sub(pattern, replacement, string)
-
replace all Replace all occurrences of the pattern
gsub(pattern, replacement, string)
Note that the stringr
package from the tidyverse has many user-friendly functions in this field as well. Two of them will be dealt with in the exercises.
11.1.2 Regex components
A regular expression can be build out of any combination of
- character sequences - Literal character sequences such as ‘chimp’
-
character classes - A listing of possibilities for a single position.
- Between brackets:
[adgk]
means ‘a’ or ‘d’ or ‘g’ or ‘k’. - Use a hyphen to create a series:
[3-9]
means digits 3 through 9 and[a-zA-Z]
means all alphabet characters. - Negate using
^
.[^adgk]
means anything but a, d, g or k. - A special case is the dot
.
: any character matches. - Many special character classes exist (digits, whitespaces etc). They are discussed in a later paragraph.
- Between brackets:
-
alternatives - Are defined by the pipe symbol
|
: “OR” - quantifiers - How many times the preceding block should occur. See next paragraph.
-
anchors -
^
means matching at the start of a string.$
means at the end.
An excellent cheat sheet from the RStudio website is also included here
Quantifiers
Use quantifiers to specify how many times a character or series of characters should occur.
-
{n}
: exactlyn
times -
{n, }
: at leastn
times -
{ ,n}
: at mostn
times -
{n, m}
: at leastn
and at mostm
times. -
*
: 0 or more times; same as{0, }
-
+
: 1 or more times; same as{1, }
-
?
: 0 or 1 time; same as{0, 1}
Anchoring
Using anchoring, you can make sure the string is not longer than you explicitly state:
dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "015/2/20191", "15/2/20191")
dateRegex <- "^[0-9]{2}[/-][0-9]{1,2}[/-][0-9]{4}$"
grep(pattern = dateRegex, x = dates, value = TRUE)
## [1] "15/2/2019" "15-2-2019" "15-02-2019"
Now the date matching is correct.
Metacharacters: Special character classes
Since patterns such as [0-9]
occur so frequently, they have dedicated character classes such as [[:digit:]]
. The most important other ones are
-
digits
[[:digit:]]
or\\d
: equivalent to[0-9]
-
alphabet characters
[[:alpha:]]
: equivalent to[a-zA-Z]
-
lowercase characters
[[:lower:]]
: equivalent to[a-z]
-
uppercase characters
[[:upper:]]
: equivalent to[A-Z]
-
whitespace characters
[[:space:]]
or\\s
: Space, tab, vertical tab, newline, form feed, carriage return -
punctuation characters
[[:punct:]]
: One of !“#$%&’()*+,-./:;<=>?@[]^_`{|}~
(have a look at the cheat sheet for all)
Here is the same example, this time using these predefined character classes
dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "15022019", "15/2/20191")
dateRegex <- "[[:digit:]]{2}[/-]\\d{1,2}[/-]\\d{4}"
grep(pattern = dateRegex, x = dates, value = TRUE)
## [1] "15/2/2019" "15-2-2019" "15-02-2019" "15/2/20191"
Alternatives
To apply two alternative choices for a single regex element you use the pipe symbol |
. You can us parentheses (foo[])
to fence alternatives off.
column_names <- c("Subject", "Age", "T0_creatine", "T0_calcium", "T1_creatine", "T1_calcium")
grep(pattern = "T[01]_(creatine|calcium)", x = column_names, value = TRUE)
## [1] "T0_creatine" "T0_calcium" "T1_creatine" "T1_calcium"
11.1.3 Some examples
Restriction enzymes
This is the recognition sequence for the HincII restriction endonuclease:
5'-GTYRAC-3'
3'-CARYTG-5'
Before reading on: how would you define a regular expression that is precisely describes this recognition sequence?
Molecular biology sequence ambiguity codes can be found here
HincII_rs <- "GT[CT][AG]AC"
sequences <- c("GTCAAC",
"GTCGAC",
"GTTGAC",
"aGTTAACa",
"GTGCAC")
grep(pattern = HincII_rs, x = sequences, value = TRUE)
## [1] "GTCAAC" "GTCGAC" "GTTGAC" "aGTTAACa"
Dutch dates
Here are some Dutch dates, in different accepted formats. The last two are not a correct notation. Create a RegEx that will determine whether an element contains a Dutch date string.
dates <- c("15/2/2019", "15-2-2019", "15-02-2019", "015/2/20191", "15/2/20191")
dateRegex <- "[0-9]{2}[/-][0-9]{1,2}[/-][0-9]{4}"
grep(pattern = dateRegex, x = dates, value = TRUE)
## [1] "15/2/2019" "15-2-2019" "15-02-2019" "015/2/20191" "15/2/20191"
Why were the last two matched? Because the pattern is there, albeit embedded in a longer string. We have to anchor the pattern to be more specific.
Exercise: Postal codes
Here are some Dutch zip (postal) codes, in different accepted formats. The last two are not a correct notation. Can you create a RegEx that will determine whether an element contains a Dutch zip code?
zips <- c("1234 AA", "2345-BB", "3456CC", "4567 dd", "56789aa", "6789a_")
zips
## [1] "1234 AA" "2345-BB" "3456CC" "4567 dd" "56789aa" "6789a_"
Exercise: Prosite patterns
Prosite is a database of amino acid sequence motifs. One of them is the Histidine Triad profile (PDOC00694).
[NQAR]-x(4)-[GSAVY]-x-[QFLPA]-x-[LIVMY]-x-[HWYRQ]-
[LIVMFYST]-H-[LIVMFT]-H-[LIVMF]-[LIVMFPT]-[PSGAWN]
- Write this down as a RegEx
- Was that efficient? Using the
gsub()
function, can you convert it in a RegEx using code? It may take several iterations. Was that efficient? - Next, use an appropriate function to find if, and where, this pattern is located within the sequences in file
data/hit_proteins.txt
(here)
Amino Acid codes and Prosite pattern encoding can be found here
11.2 The stringr
package
This is the last presentation in the tidyverse series. It revolves around processing textual data: finding, extracting, and replacing patterns. Central to this task is pattern matching using regular expressions. Pattern matching is the process of finding, locating, extracting and replacing patterns in character data that usually cannot be literally described. Regular expression syntax is the language in which patterns are described in a wide range of programming languages, including R.
This topic has been dealt with in an introductory manner previously (course DAVuR1). And is repeated and expanded here. Instead of the base R functions we now switch to the stringr
package.
As all packages from the tidyverse, stringr
has many many functions (type help(package = "stringr")
to see which). this package has a great cheat sheet as well.
Here, a few of them will be reviewed.
11.2.1 A few remarks on “locale”
Many functions of the tidyverse packages related to time and text (and currency) accept arguments specifying the locale.
The locale is a container for all location-specific display of information.
Think
- Character set of the language
- Time zone, Daylight savings time
- Thousands separator and decimal symbol
- Currency symbol
Dealing with locales is a big challenge indeed for any programming language. However, since this is only an introductory course we will stick to US English and work with the current locale for times only. This note is to make you aware of the concept so that you remember this when the appropriate time comes.
11.2.2 The stringr
essentials
Case conversion
These functions all change the capitalization of (some of) the word characters of an input string. They all ignore non-word characters such as punctuation and other symbols.
-
str_to_upper()
converts all word characters to uppercase -
str_to_lower()
converts all word characters to lowercase -
str_to_title()
capitalizes all first characters of words -
str_to_sentence()
capitalizes the first character in the string, not after every period
sntc <- "the path of the righteous man is beset on all sides by the iniquities of the selfish, and the tyranny of evil men. --quote from?"
str_to_title(sntc)
## [1] "The Path Of The Righteous Man Is Beset On All Sides By The Iniquities Of The Selfish, And The Tyranny Of Evil Men. --Quote From?"
str_to_sentence(sntc)
## [1] "The path of the righteous man is beset on all sides by the iniquities of the selfish, and the tyranny of evil men. --quote from?"
Split, join and substring
Combining two vectors into one, one vector into one, or doing the reverse: splitting. These are all string-based operation that are carried out in scripting quite often.
Here are some joining operations, using str_c()
:
l1 <- letters[1:5]
l2 <- letters[6:10]
str_c(l1, collapse = "=")
## [1] "a=b=c=d=e"
str_c(l1, l2, sep = "+")
## [1] "a+f" "b+g" "c+h" "d+i" "e+j"
str_c(l1, l2, sep = "+", collapse = "=")
## [1] "a+f=b+g=c+h=d+i=e+j"
When you want to combine variables and text str_glue()
comes in handy:
str_glue("The value of pi is {pi} and the first month of the year is {month.name[1]}")
## The value of pi is 3.14159265358979 and the first month of the year is January
This is a more friendly approach than with paste()
.
Splitting is slightly more tricky since it accepts a regex pattern as split argument. For instance, you can get the words of a sentence by splitting like this:
words <- str_split(sntc, "([[:punct:]]|[[:space:]])+")
words
##alternative
#str_split(sntc, "[^a-zA-Z]+")
## [[1]]
## [1] "the" "path" "of" "the" "righteous"
## [6] "man" "is" "beset" "on" "all"
## [11] "sides" "by" "the" "iniquities" "of"
## [16] "the" "selfish" "and" "the" "tyranny"
## [21] "of" "evil" "men" "quote" "from"
## [26] ""
There are two ways to get parts of character strings, or substrings. The first is by index. You can omit both start
and end
arguments; they will default to start and end of the string, respectively.
nucs <- c("Adenine", "Guanine", "Cytosine", "Thymine")
str_sub(nucs, end = 3)
## [1] "Ade" "Gua" "Cyt" "Thy"
You can even use this function to change the substring that is removed
str_sub(nucs, start = 4) <- "......"
nucs
## [1] "Ade......" "Gua......" "Cyt......" "Thy......"
This does not work with literals! The following chunk gives and error:
str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......"
## Error in str_sub(c("Adenine", "Guanine", "Cytosine", "Thymine"), start = 4) <- "......": target of assignment expands to non-language object
Matching
When you match a pattern to a string, you usually want to know if it is there, which elements have it, where it is located in those elements or how often it is present. For each of these question there is a dedicated function:
-
str_detect(string, pattern)
detects the presence of a pattern match in a string.str_detect(fruits, "[Aa]")
## [1] TRUE TRUE TRUE FALSE
-
str_subset(string, pattern)
returns only the strings that contain a pattern matchstr_subset(fruits, "[Aa]")
## [1] "Banana" "Apple" "Orange"
-
str_which(string, pattern)
finds the indexes of strings that contain a pattern match.str_which(fruits, "[Aa]")
## [1] 1 2 3
-
str_count(string, pattern)
counts the number of matches in a string.str_count(fruits, "[Aa]")
## [1] 3 1 1 0
-
str_locate(string, pattern)
andstr_locate_all(string, pattern)
locate the positions of pattern matches in a stringstr_locate_all(fruits, "[Aa]")
## [[1]] ## start end ## [1,] 2 2 ## [2,] 4 4 ## [3,] 6 6 ## ## [[2]] ## start end ## [1,] 1 1 ## ## [[3]] ## start end ## [1,] 3 3 ## ## [[4]] ## start end
Extracting and replacing
If you want to obtain the character sequences matching your pattern you can use the str_extract()
and str_extract_all()
functions:
str_extract_all(fruits, "an")
## [[1]]
## [1] "an" "an"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "an"
##
## [[4]]
## character(0)
Finally, replacing occurrences of a pattern is carried out using str_replace()
or str_replace_all()
.
str_replace_all(fruits, "an", "..")
## [1] "B....a" "Apple" "Or..ge" "Cherry"