Regex in Python

2021-03-23 2855 words 14 minutes

Raw strings

Raw string notation keeps regular expressions sane. re tutorial

Raw strings in Python

Just like the regex engine, Python uses \ to escape characters in strings that otherwise have special meaning (e.g. ' and \ itself) and to create tokens with special meaning (e.g. \n).

1
print("Hello\nWorld")

Hello
World

Without escaping a single quotation mark, it takes on its special meaning as a delimiter of a string.

1
'It's raining'

SyntaxError: invalid syntax (3769801028.py, line 1)

To give it its literal meaning as an apostrophe, we need to escape it.

1
"It's raining"

"It's raining"

Python and regex interaction

A string is processed by the Python interpreter before being passed on to the regex engine. Once consequence of this is that if in our regex pattern we want to treat as a literal a character that has special meaning in both Python and regex, we have to escape it twice.

For example: to search for a literal backslash in our regex pattern, we need to write \\\\. The Python interpreter reads this as \\ and passes it to the regex engine, which then reads it as \ as desired.

1
import re

1
2
3
4
s = "a \ b"
m = re.search("a \\\\ b", s)
print(m[0])
m[0]

a \ b

'a \\ b'

This is obviously cumbersome. A useful alternative is to use raw strings r'', which make the Python interpreter read special characters as literals, obviating the first set of escapes. Hence, it’s a good idea to use raw strings in Python regex expressions.

1
2
m = re.search(r"a \\ b", s)
print(m.group())

a \ b

Escape sequences rabbit hole

First things first: an escape sequence is a a sequence of characters that does not represent itself when used within a string literal but is translated into another character or sequence of characters that might be difficult or impossible to represent (from Wikipedia).

When I tried a version of this

1
2
3
4
string = "foo 1a bar 2baz"
pattern = "\b\d[a-z]\b"

re.findall(pattern, string)

[]

it took me 10 minutes to figure out why 1a didn’t match. The short answer is: thou shalt use raw strings!

1
2
3
raw_pattern = r"\b\d[a-z]\b"

re.findall(raw_pattern, string)

['1a']

But why? Because Python interpretes escape sequences in strings according to the rules of Standard C, where \b happens to stand for the backspace. Hence, the pattern without the r prefix means “a backspace immediately followed by a digit immediately followed by a lowercase letter immediately followed by another backspace”, which is not present in the string.

To convince ourselves of this, we can add backspaces to the string and try again – now the pattern matches.

1
2
3
string = "foo \N{backspace}1a\N{backspace} baz 2bar"

re.findall(pattern, string)

['\x081a\x08']

One point that was not immediately obvious to me was why pattern works without the backspace character – why do the backspaces in \d and \w not need escaping?

1
2
pattern = "\d\w"
re.findall(pattern, string)

['1a', '2b']

The explanation is that \ is interpreted literally if it is not part of an escape sequence, as in

1
print("a\k")

a\k

and \d and \w aren’t escape sequences in Python (or C). Hence, these two tokens are passed on unaltered to the regex engine, where they are interpreted according to regex syntax rules.

Remove punctuation rabbit hole

I wanted to remove punctuation in a string like the below.

1
s = "Some .' test & with * punctuation \ characters."

Thinking I was clever, I thought of the useful constants provided by the string module, which provide easy access to character sequences like the set of punctuation characters.

1
2
3
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

I did the below and was about to celebrate victory.

1
2
3
4
5
p = string.punctuation
try:
    re.sub(p, " ", s)
except Exception as e:
    print(e)

multiple repeat at position 10

Oops! It’s a clear case where I jupmpted to a conclusion a little bit too soon, and where spending a few more minutes thinking things through before starting to code would probably have helped me see the two flaws in my approach: I need to escape special characters, and, given that I want to search for characters individually, I need to wrap them in a character rather than passing them as a single string🤦‍♂️

1
2
3
p = f"[{re.escape(string.punctuation)}]"
r = re.sub(p, "", s)
r

'Some  test  with  punctuation  characters'

To remove extra whitespace, I could use:

1
re.sub(" +", " ", r)

'Some test with punctuation characters'

Alternatively, I could use a regex-native approach.

1
2
p = r"[\W_]"
re.sub(p, " ", s)

'Some    test   with   punctuation   characters '

re module

1
import re

Overview of search methods

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
pattern = "a"
string = "Jack is a boy"

methods = [
    ("re.match (start of string)", re.match(pattern, string)),
    ("re.search (anywhere in string)", re.search(pattern, string)),
    ("re.findall (all matches)", re.findall(pattern, string)),
    ("re.finditer (all matches as iterator)", re.finditer(pattern, string)),
]

for desc, result in methods:
    print("{:40} -> {}".format(desc, result))

re.match (start of string)               -> None
re.search (anywhere in string)           -> <re.Match object; span=(1, 2), match='a'>
re.findall (all matches)                 -> ['a', 'a']
re.finditer (all matches as iterator)    -> <callable_iterator object at 0x11236d2e0>

re.findall()

Returns list of all matches if no capturing groups specified, and a list of capturing groups otherwise.

Example: find stand-alone numbers

1
2
3
4
5
6
7
8
data = """
 012
foo34 
     56
78bar
9
 a10b
"""

Without capturing groups entire match is returned

1
2
proper_digits = "\s+\d+\s+"
re.findall(proper_digits, data, flags=re.MULTILINE)

['\n 012\n', ' \n     56\n', '\n9\n ']

One capturing groups returns list of capturing groups

1
2
proper_digits = "(?m)\s+(\d+)\s+"
re.findall(proper_digits, data, flags=re.MULTILINE)

['012', '56', '9']

Multiple capturing groups return list of multi-tuple capturing groups

1
2
proper_digits = "\s+(\d)(\d+)?\s+"
re.findall(proper_digits, data, flags=re.MULTILINE)

[('0', '12'), ('5', '6'), ('9', '')]

To return the full match if the pattern uses capturing groups, simply capture the entire match, too.

1
2
3
s = "Hot is hot. Cold is cold."
p = r"((?i)(\w+) is \2)"
[groups[0] for groups in regex.findall(p, s)]

['Hot is hot', 'Cold is cold']

Finding overlapping matches

1
2
pattern = r"(?=(\w+))"
re.findall(pattern, "abc")

['abc', 'bc', 'c']

re.match()

Find pattern at the beginning of a string

1
2
3
4
5
6
7
8
line = '"688293"|"777"|"2011-07-20"|"1969"|"20K to 30K"'

pattern = r'"\d+"\|"(?P<user_id>\d+)"'

match = re.match(pattern, line)
print(match)
print(match.group("user_id"))
print(match["user_id"])  # alternative, simpler, syntax

<re.Match object; span=(0, 14), match='"688293"|"777"'>
777
777

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from itertools import compress

addresses = [
    "5412 N CLARK",
    "5148 N CLARK",
    "5800 E 58TH",
    "2122 N CLARK" "5645 N RAVENSWOOD",
    "1060 W ADDISON",
    "4801 N BROADWAY",
    "1039 W GRANVILLE",
]


def large_house_number(address, threshold=2000):
    house_number = int(re.match("\d+", address)[0])
    return house_number > threshold


has_large_number = [large_house_number(x) for x in addresses]
list(compress(addresses, has_large_number))

['5412 N CLARK',
 '5148 N CLARK',
 '5800 E 58TH',
 '2122 N CLARK5645 N RAVENSWOOD',
 '4801 N BROADWAY']

re.escape()

I want to match “(other)”. To match the parentheses literally, I’d have to escape them. If I don’t, the regex engine interpres them as a capturing group.

1
2
3
m = re.search("(other)", "some (other) word")
print(m)
m[0]

<re.Match object; span=(6, 11), match='other'>

'other'

I can escape manually.

1
re.search("\(other\)", "some (other) word")

<re.Match object; span=(5, 12), match='(other)'>

But if I have many fields with metacharacters (e.g. variable values that contain parentheses) this is a massive pain. The solution is to just use re.escape(), which does all the work for me.

1
re.search(re.escape("(other)"), "some (other) word")

<re.Match object; span=(5, 12), match='(other)'>

re.split()

1
2
3
pattern = r"(?<=\w)(?=[A-Z])"
s = "ItIsAWonderfulWorld"
re.split(pattern, s)

['It', 'Is', 'A', 'Wonderful', 'World']

re.sub()

Stip a string of whitespace and punctuation.

1
2
s = "String. With! Punctu@tion# and _whitespace"
re.sub(r"[\W_]", "", s)

'StringWithPunctutionandwhitespace'

Using zero-width match to turn CamelCase into snake_case

1
2
3
s = "ThisIsABeautifulDay"
pattern = r"(?<=[a-zA-Z])(?=[A-Z])"
re.sub(pattern, "_", s).lower()

'this_is_a_beautiful_day'

Use same approach with MULTILINE mode to comment out all lines.

1
2
3
4
5
6
s = """first
second
third"""

pattern = "(?m)^"
print(re.sub(pattern, "#", s))

#first
#second
#third

Matching end of line and end of string

\Z matches strict end of string but not cases where last character is a line-break

1
2
3
4
5
6
7
8
9
a = """no newline 
at end"""

b = """newline
at end
"""

print(re.search(r"d\Z", a))
print(re.search(r"d\Z", b))

<re.Match object; span=(17, 18), match='d'>
None

\$ matches end of string flexibly (i.e. before or after final linebreak)

1
2
3
4
5
6
7
8
9
a = """no newline 
at end"""

b = """newline
at end
"""

print(re.findall(r"[ed]$", a))
print(re.findall(r"[ed]$", b))

['d']
['d']

\$ with MULTILINE mode matches end of line

1
2
3
4
5
6
7
8
9
a = """no newline
at end"""

b = """newline
at end
"""

print(re.findall(r"(?m)[ed]$", a))
print(re.findall(r"(?m)[ed]$", b))

['e', 'd']
['e', 'd']

regex module

docs

Todo:

Comparison between re and regex.

1
2
3
4
5
6
7
8
9
# would usually import as `import regex as re`, but because I
# want to compare to built-in re here, I'll import as regex.

# default version is VERSION0, which emulates re to use additional
# functionality, use VERSION1

import regex

regex.DEFAULT_VERSION = regex.VERSION1

Keep out token

The keep out token \K drops everything matched thus far from the overall match to be returned.

1
2
3
4
pattern = r"\w+_\K\d+"
string = "abc_12"

regex.match(pattern, string)[0]

'12'

Inline flags

Flags placed inside the regex pattern take effect from that point onwards. As an example, this helps us find uppercase words that later appear in lowercase. To start, let’s match all words that reappear later in the string.

1
2
3
4
string = "HELLO world hello world"
pattern = r"(?i)(\b\w+\b)(?=.*\1)"

re.findall(pattern, string)

['HELLO', 'world']

To only match uppercase words that later reappear in lowercase, we can do this (explanation):

1
2
pattern = r"(\b[A-Z]+\b)(?=.*(?=\b[a-z]+\b)(?i)\1)"
regex.findall(pattern, string)

['HELLO']

Subroutines

Subroutines obviate the repetition of long capturing groups

1
2
3
4
s = "Tarzan loves Jane"
p = r"(Tarzan|Jane) loves (?1)"
m = regex.search(p, s)
m[0], m[1]

('Tarzan loves Jane', 'Tarzan')

Recursive patterns

Subroutines can call themselves to create a recursive pattern, which can be useful to match tokens where one letter is perfectly balanced by another.

1
2
3
s = "ab and aabb and aab and aaabbb and abb"
p = r"\b(a(?1)?b)\b"
regex.findall(p, s)

['ab', 'aabb', 'aaabbb']

Experimental

only standalone expressions

1
2
3
s = "aaaabbbb aabb aab ab"
p = r"a(?R)?b"
regex.findall(p, s)

['aaaabbbb', 'aabb', 'ab', 'ab']

1
2
3
s = "a a a a b b b b aabb aab ab"
p = r"\b ?a(?R)? b\b"
regex.findall(p, s)

['a a a a b b b b']

Pre-defined subroutines

We can predefine subroutines to produce nicely modular patterns that can easily be reused through our regex. (The \ in the pattern is needed because in free-spacing mode, whitespace that we want to match rather than ignore needs to be escaped.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
defs = """
    (?(DEFINE)
        (?<quant>\d+)
        (?<item>\w+)
    )
    """

pattern = rf"{defs} (?&quant)\ (?&item)"
string = "There were 5 elephants walking towards the water hole."

regex.search(pattern, string, flags=regex.VERBOSE)

<regex.Match object; span=(11, 22), match='5 elephants'>

A useful application of this is to create real-word boundaries (rwb) that match between letters and other characters (rather than between word and non-word characters).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
defs = """
    (?(DEFINE)
        (?<rwb>
            (?i)                   # case insensitive
            (?<![a-z])(?=[a-z])    # beginning of word
            |(?<=[a-z])(?![a-z])   # end of word
        )
    )
    """

pattern = rf"{defs} (?&rwb)\w+(?&rwb)"
string = """
cats23,
 +dogs55,
%bat*"""

regex.findall(pattern, string, flags=regex.VERBOSE)

['cats', 'dogs', 'bat']

Using default word boundaries in the above string would also return digits and underscores, since they are word characters.

1
regex.findall(r"\b\w+\b", string)

['cats23', 'dogs55', 'bat']

Named groups

Supports named groups with a cleaner syntax: (?<name>...) instead of the somewhat verbose (?P<name>...) to define named groups

1
2
3
s = "Zwätschgi was born on 23 Dec 1986"
p = r"\b(?<day>\d{2}) (?<month>\w{3}) (?<year>\d{4})\b"
regex.search(p, s).groupdict()

{'day': '23', 'month': 'Dec', 'year': '1986'}

and \g<name> instead of (?P=name) for backreference.

1
2
3
s = "2012-12-12"
p = "\d\d(?<yy>\d\d)-\g<yy>-\g<yy>"
regex.match(p, s)

<regex.Match object; span=(0, 10), match='2012-12-12'>

Unicode categories

regex provides support for unicode categories, which can be super handy.

1
2
3
4
5
## search for any punctuation character

s = ". and _"
pattern = r"\p{P}"
regex.findall(pattern, s)

['.', '_']

Variable-width lookbehinds

One useful feature of regex is that it allows for variable-width lookbehinds. Like most regex engines, the re doesn’t and tells you so if you try.

For example, if we want to match uppercase words preceeded by a prefix compused of digits and an underscore, such as BANANA in 123_BANANA, the below doesn’t work:

1
2
3
4
5
6
7
string = "123456_ORANGE abc12_APPLE"
pattern = r"(?<=\b\d+_)[A-Z]+\b"

try:
    re.findall(pattern, string)
except Exception as e:
    print(e)

look-behind requires fixed-width pattern

In contrast, regex succeeds.

1
regex.findall(pattern, string)

['ORANGE']

Another application is if we wanted (for whatever reason) to match all words beginning with a at the beginning of a line from lines three onwards.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
string = """abba
abacus
alibaba ada
beta adagio
aladin abracadabra
"""

pattern = "(?<=\n.*\n)a\w+"

regex.findall(pattern, string)

['alibaba', 'aladin']

Character class set operations

Intersection

1
2
3
4
# inside [] are optional but can make pattern easier to read
pattern = r"[[\W]&&[\S]]"
subject = "a.b*5_c 8!"
regex.findall(pattern, subject)

['.', '*', '!']

Union

1
2
3
pattern = r"[ab||\d]"
subject = "a.b*5_c 8!"
regex.findall(pattern, subject)

['a', 'b', '5', '8']

Subtraction

1
2
3
pattern = r"[[a-z]--[b]]"
subject = "a.b*5_c 8!"
regex.findall(pattern, subject)

['a', 'c']

1
2
3
pattern = "[\w--[_\d]]"
subject = "a b 3 k _ f 4"
regex.findall(pattern, subject)

['a', 'b', 'k', 'f']

Pandas

1
import pandas as pd

Insert text in position

Insert an underscore between words

1
2
3
4
5
df = pd.DataFrame({"a": ["HelloWorld", "HappyDay", "SunnyHill"]})

pattern = r"(?<=[a-z])(?=[A-Z])"
df["a"] = df.a.str.replace(pattern, "_", regex=True)
df

	a
0	Hello_World
1	Happy_Day
2	Sunny_Hill

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def colname_cleaner(df):
    """Convert column names to stripped lowercase with underscores."""
    df.columns = df.columns.str.lower().str.strip()
    return df


def str_cleaner(df):
    """Convert string values to stripped lowercase."""
    str_cols = df.select_dtypes("object")
    for col in str_cols:
        df[col] = df[col].str.lower().str.strip()
    return df


movies = data.movies().pipe(colname_cleaner).pipe(str_cleaner)
movies.head(2)

	title	us gross	worldwide gross	us dvd sales	production budget	release date	mpaa rating	running time min	distributor	source	major genre	creative type	director	rotten tomatoes rating	imdb rating	imdb votes
0	the land girls	146083.0	146083.0	NaN	8000000.0	jun 12 1998	r	NaN	gramercy	None	None	None	None	NaN	6.1	1071.0
1	first love, last rites	10876.0	10876.0	NaN	300000.0	aug 07 1998	r	NaN	strand	None	drama	None	None	NaN	6.9	207.0

Finding a single pattern in text

1
2
3
4
5
pattern = "hello"
text = "hello world it is a beautiful day."

match = re.search(pattern, text)
match.start(), match.end(), match.group()

(0, 5, 'hello')

In Pandas

1
movies.title.str.extract("(love)")

	0
0	NaN
1	love
2	NaN
3	NaN
4	NaN
...	...
3196	NaN
3197	NaN
3198	NaN
3199	NaN
3200	NaN

3201 rows × 1 columns

contains(): Test if pattern or regex is contained within a string of a Series or Index.
match(): Determine if each string starts with a match of a regular expression.
fullmatch():
extract(): Extract capture groups in the regex pat as columns in a DataFrame.
extractall(): Returns all matches (not just the first match).
find():
findall():
replace():

1
movies.title.replace("girls", "hello")

0                   the land girls
1           first love, last rites
2       i married a strange person
3             let's talk about sex
4                             slam
                   ...            
3196    zack and miri make a porno
3197                        zodiac
3198                          zoom
3199           the legend of zorro
3200             the mask of zorro
Name: title, Length: 3201, dtype: object

Let’s drop all movies by distributors with “Pictures” and “Universal” in their title.

1
2
3
4
5
6
7
# inverted masking

names = ["Universal", "Pictures"]
pattern = "|".join(names)
mask = movies.distributor.str.contains(pattern, na=True)
result = movies[~mask]
result.head(2)

	title	us_gross	worldwide_gross	us_dvd_sales	production_budget	release_date	mpaa_rating	running_time_min	distributor	source	major_genre	creative_type	director	rotten_tomatoes_rating	imdb_rating	imdb_votes
0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	None	None	NaN	6.1	1071.0
1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	None	None	NaN	6.9	207.0

1
2
3
4
5
6
7
8
# negated regex

names = ["Universal", "Pictures"]
pattern = "\|".join(names)
neg_pattern = f"[^{pattern}]"
neg_pattern
mask = movies.distributor.str.contains(neg_pattern, na=False)
result2 = movies[mask]

1
neg_pattern

'[^Universal\\|Pictures]'

1
2
3
4
5
6
def drop_card_repayments(df):
    """Drop card repayment transactions from current accounts."""
    tags = ["credit card repayment", "credit card payment", "credit card"]
    pattern = "|".join(tags)
    mask = df.auto_tag.str.contains(pattern) & df.account_type.eq("current")
    return df[~mask]

Contents