10. Regular expressions#

10.1. Introduction#

Sometimes, literal text searching is not enough.

For instance, when searching for email addresses, zipcodes, telephone numbers, dates or biological sequence patterns (my field), it is impossible to say what exactly you want found, but you can say what it will look like.

This is where regular expressions come in. They provide a means to specify in a dedicated mini “language”, what the pattern looks like.

For instance, the pattern

pattern = "[0-9]{4}[a-zA-Z]{2}"

Specifies a pattern that will look for four digits followed by two upper- or lowercase letters.

The Python re module provides all functionality for this kind of work.

Regex

A regular expression or regex consists of a combination of literal character sequences, character classes, quantifiers, groupings and positional anchors that can be used to search for patterns in text in order to locate, extract or replace the occurrences.

In this tutorial the concepts will be demonstrated using te Python re module. To use this module, you will need to import it of course:

import re

The re module provides some useful functions for working with patterns:

Function

Description

match(pattern, text)

Returns a Match object if there is a complete match of pattern with text

search(pattern, text)

Returns a Match object if there is a match of pattern anywhere in text

findall(pattern, text)

Returns a list containing all matches of pattern in text

finditer(pattern, text)

Returns an iterator of Match objects for all matches of pattern in text

split(pattern, text)

Returns a list where the string has been split on each occurrences of pattern in text

sub(pattern, replacement, text)

Replaces one or many matches of pattern with replacement in text and returns the resulting string

For example, to extract occurrences of the word ‘the’ from a larger body of text you can use

text = "The following telephone numbers can be used to get the required information on your order: 020-1234567 or 06-12345678"
re.findall("[Tt]he", text)
['The', 'the']

where [Tt] specifies a character class.

Before looking at the functions of the re module in more detail, we’ll cover the basics of all regex elements first.

10.2. Regex syntax#

10.2.1. Character classes#

One of the pillars under regular expression syntax is the use of character classes. Character classes specify -for a single character position in the pattern- which characters are allowed on that position.

For instance, in the example above, ([Tt]), it was specified that both ‘T’ and ‘t’ were allowed at the first position of the expression. Character classes are generally specified between brackets [], but there are a few much-used character classes that have their own symbol.

Character class

Description

[AaBb]

Matches the characters ‘A’, ‘a’, ‘B’ and ‘B’

[a-z]

Matches all lowercase characters between ‘a’ and ‘z’

[a-zA-Z]

Matches all characters between ‘a’ and ‘z’ and ‘A’ and ‘Z’

[0-9]

Matches all digits

[-]

Matches a literal hyphen

[^a]

Matches anything BUT a

[^0-9]

Matches anything BUT a digit

10.2.1.1. Special character classes#

There are quite a few; see the docs. Here are only the most-used ones.

special sequence

Description

.

Matches any character.

\d

Matches all digits

\s

Matches all whitespace characters

\w

Matches all word characters; equivalent to [0-9a-zA-Z_]

\t

Matches the tab character

\n

Matches the newline character

Here is an example, looking for digits and spaces

re.findall("\s\d\s\w", "If I cout to 9\twill you count to 8 please?")
[' 9\tw', ' 8 p']

The escape character \

The backslash \ has special meaning; it escapes the special meaning of the following symbol, or gives it special meaning (as seen above) depending on context.

It is the cause of many programming errors and bugs, especially when the backslash itself is part of the pattern.

When looking for a literal ‘[‘ or ‘]’ in a search string for instance, you need to do this:

re.findall("[\[\]]", "hallo [daar] ben ik weer")
['[', ']']

When looking for a literal ‘' it gets even harder:

re.findall("[\\\]", "hallo \ daar ben ik weer")
['\\']

10.2.2. Quantifiers#

Quantifiers are used to specify how often a (series of) characters are allowed to occur.
The universal form is with the {from,to} syntax, but there are a few shortcuts here as well:

quantifier

Meaning

{3,8}

Matches a repetition of 3 to 8 times

{,2}

Matches a repetition of 0 to 2 times; equivalent to {0,2}

{3,}

Matches a repetition of 3 or more times

{3,8}

Matches a repetition of 3 to 8 times

+

Matches a repetition of one or more times

?

Matches a repetition of zero or one times

*

Matches a repetition of zero or more times

Here are a few examples.

print("3-4:", re.findall("a{3,4}", "Please say aaaa, not aaa!"))
print("1 or more:", re.findall("a+", "Please say aaaa, not aaa!"))
print("more than 2:", re.findall("a{2,}", "Please say aaaa, not aaa!"))
3-4: ['aaaa', 'aaa']
1 or more: ['a', 'a', 'aaaa', 'aaa']
more than 2: ['aaaa', 'aaa']

10.2.3. Anchors#

Use the ^ to anchor a pattern at the start of the search string and $ to anchor at the end. In the special case that you want to have the whole search string match the pattern, you use both anchors.

print("At the start:", re.findall("^[Tt]he", "The CEO is the boss"))
print("At the end:", re.findall("!$", "Please! Say something!"))
At the start: ['The']
At the end: ['!']

10.2.4. Alternatives#

To indicate alternative patterns you can use the | OR sign. Can als be used in conjunction with grabbing elements using parentheses. See below for details.

print("Literal alternatives:", re.findall("banana|apple", "I want a banana or an apple!"))
print("Alternative patterns:", re.findall("bi[kt]e|[bc]ar", "A bike, a bite, a car, a bar"))
Literal alternatives: ['banana', 'apple']
Alternative patterns: ['bike', 'bite', 'car', 'bar']

10.2.5. Being greedy or non-greedy#

Sometimes, you want to influence the way a pattern is matched with regard to the length of the match. Consider this example:

print("Non-greedy:", re.findall("^.+ ", "To be greedy or not to be greedy"))
Non-greedy: ['To be greedy or not to be ']

This looks for a sequence of characters at the beginning of the search string up to a space. The result is the entire phrase up to the last space. We call this greedy behaviour, which is the default for regex.
What if you are only interested in the first word, up to the first space? In that case you need to use the non-greedy modifier, ?. Again a symbol with multiple meanings in the context of regex…

print("Non-greedy:", re.findall("^.+? ", "To be greedy or not to be greedy"))
Non-greedy: ['To ']

The non-greedy modifier can be used in combination with any quantifier: +?, *?, ??, {,}?.

10.3. Working with match objects#

The search() and finditer() functions return a (sequence of) Match objects. These match objects give much flexibility in dealing with your matches: from sub-patterns to match locations.

With this type of analysis it ofetn a good idea to create a compiled pattern. This saves much computation time if the operation is repeated often.

When placing pairs of parenthese in the pattern you can catch sub-patterns.

Here is an example of the use of Match objects and use:

postal_code_pattern = re.compile("([0-9]{4}[- ]?[A-Za-z]{2}),? ([a-zA-Z]+)")
text = """Please send a copy of this message to John Doe on Marktstraat 5, 2633 AX Someplace, 
and to Jane Doe, Brink 3, 1221ZA, Nowhere."""
for match in re.finditer(postal_code_pattern, text):
    print("whole match:", match.group(0))
    print("postal code:", match.group(1))
    print("town:       ", match.group(2))
    print("match start:", match.start())
whole match: 2633 AX Someplace
postal code: 2633 AX
town:        Someplace
match start: 65
whole match: 1221ZA, Nowhere
postal code: 1221ZA
town:        Nowhere
match start: 111

10.4. Modifying flags#

Finally, the Python regex engine gives you the possibility to modify the behaviour of its regex functions by using the flags= argument. See here for details on the flags.

The most-used are listed below

flag short form

flag long form

behaviour

re.I

re.IGNORECASE

Case-insensitive matching

re.M

re.MULTILINE

Matches ^ at the beginning of each line

re.S

re.DOTALL

Make the ‘.’ special character ALSO a newline

If you want to combine multiple flags, you need to ‘OR’ them:

flags=re.I | re.DOTALL