Using regexs with python

2021-04-05

Python regex

1. Regex basics

A regular expression, or regex for short, is a pattern describing a certain amount of text. The most simple regex is a regular string, like hello which will match the literal text hello.

There are some characters with special meanings. For example . will match any character (except line break). As an example ip.d will match ipad, ipod, ip@d etc.

If we want to only match ipod or ipad we can use ip[ao]d.

1.1. Brackets `[]`

Brackets ([ and ]) are used for specifying which characters are allowed. It is similar to an or operator. As an example:

[abc]: a, b or c
[a-c]: a, b or c
[^a-c]: anything but a, b or c
[0-9a-fA-F]: any number or any letter between a and f or between A and F. This represents hexadecimal

1.1. Position

It is possible to specify matches based on the position with:

^: Start
$: End

For example ^ab will only match the first ab in ab abc ab (Demo)

1.2. Classes

There are certain keywords that will match different types of characters. For example \d will match any digit (any of those 0123456789). Those classes can be inversed with capital letters. As an example \D will match any character except a digit.

The main classes available are:

regex	description	example	inverse
`\d`	Digits	d_example	`\D`
`\w`	Alphanumeric and/or `_`	w_example	`\W`
`\s`	Whitespaces and equivalents (like tabs)	s_example	`\S`

1.3. Quantifiers

It is possible to specify how many times a letter or group (defined in the next section) should appear.

As an example we want the letter a + the letter b which should appear different number of times:

regex	description	example
`ab*`	`a` + `b` zero times or more	example_ab*
`ab+`	`a` + `b` one time or more	example_ab+
`ab?`	`a` + optional `b` (zero or one time)	example_ab?
`ab{2}`	`a` + `b` exactly 2 times	example_ab{2}
`ab{2,5}`	`a` + `b` between 2 and 5 times (inclusive)	example_ab{2,5}

1.4. Groups

It is possible to define groups in regex. They are use to capture part of the text and are declared with parenthesis (). We can use any regex inside the group. For example if we want to math hello:

(hello): capturing group (Demo)
(?:hello): non-capturing group
(?P<name>hello): same but giving the group a name (name in this case)
(hello|bye): or operator in groups.

1.5. Boundaries

It is possible to declare a boundary with \b (with the inverse \B). A boundary means that is surrounded with a non word type character. It is similar to the start (^) and end ($).

It is easier to understand it with some examples:

regex	text: `Impossible to do`	text: `I'm possible :)`	text: `possibler`	example
`possible`	Impossible to do	I'm possible :)	possibler	boundaries_1
`\bpossible`	Impossible to do	I'm possible :)	possibler	boundaries_2
`\bpossible\b`	Impossible to do	I'm possible :)	possibler	boundaries_3
`possible\B`	Impossible to do	I'm possible :)	possibler	boundaries_4
`\Bpossible`	Impossible to do	I'm possible :)	possibler	boundaries_5

1.6. Flags

It is possible to change the behaviour of the regex with some flags:

(?i): case insensitive. For example (?i)hello will match HELLO (demo)
(?m): multi line. When this is active the anchor ^ will be start of line instead of start of text (demo)
(?s): single line.

1.7. Greedy vs lazy

Imagine that we have the following text:

<h1> Title </h1>
<p> Hello </p>

If we want to capture the start of an html tag (<h1> and <p> in this case) one might be tempted to use something like <\w+>. However that will match the whole line (for both lines) since we are specifying < followed by any text and ended with > (demo). If we want to stop it at the first appearence of > we can use the non greedy regex <.+?> (demo).

Some better alternatives are <\w+> or <[^<>]+>.

To sum up:

<.+>: greedy
<.+?>: non greedy

It can also be used with other quantifiers that are not the +.

2. Regex with python

2.1. `re` package

To use regex in python you first need to import the re package with

import re

Then to check if there is a match in a text with one regex using out = re.match(regex, text):

out = re.match(r"hello", "hello world")

# out is a class if we only want to know if there is a match we can use `bool(out)`

When declaring regexs in python it is a good practise to declare them as raw strings with r"regex"

2.2. `match` vs `search`

We have use re.match since it's faster. However this only looks at the begining of the string. To do a more in deep search we can use re.search.

Let's compare them by doing checking different regexs against the following text:

text = """hello world
bye world"""

The results would be:

regex	re.match	re.search
`r"hello"`	✔️	✔️
`r"world"`	❌	✔️
`r"bye"`	❌	✔️

2.3. `re.compile`

If we plan to reuse the same regex we can compile it with regex = re.compile(regex):

regex = re.compile(r"hello")

And then we can use both match and/or search with:

regex.match("hello world")

regex.search("hello world")

The advantage is that a compiled regex is faster than a non-compiled one.

2.4. Working with groups

Imagine that we have the following file names:

filenames = [
    "Informe mensual Indexa Capital - AABBCCDD - 2020-01.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-01.pdf",
    "Informe mensual Indexa Capital - AABBCCDD - 2021-03.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-12.pdf",
]

2.4.1. Unnamed groups

It looks more or less clear that we can extract some data from those names:

regex = re.compile(r"Informe mensual Indexa Capital - (\w{8}) - (\d{4})-(\d{2}).pdf")

out = regex.match(filenames[0])
out.groups()

Out: ('AABBCCDD', '2020', '01')

2.4.2. Named groups

We can use named groups to better extract the data

regex = re.compile(r"Informe mensual Indexa Capital - (?P<account>\w{8}) - (?P<year>\d{4})-(?P<month>\d{2}).pdf")

out = regex.match(filenames[1])
out.groupdict()

Out: {'account': 'XXYYZZWW', 'year': '2020', 'month': '01'}

Clearly we can use that to rename the files doing something like:

out_pattern = "{year}_{month} Indexa capital {account}.pdf"

for filename in filenames:
    groups = regex.match(filename).groupdict()
    new_name = out_pattern.format(**groups)

    print(new_name)

2020_01 Indexa capital AABBCCDD.pdf

2020_01 Indexa capital XXYYZZWW.pdf

2021_03 Indexa capital AABBCCDD.pdf

2020_12 Indexa capital XXYYZZWW.pdf

3. Regex with pandas

There are multiple functions in pandas that can be used with regex. They are:

count: count occurrences that match a regex
replace: replace based on a regex
match: same as re.match
contains: same as re.search
findall: same as re.findall
split / rsplit: splits text based on a regex
extract: extract groups using a regex

In order to explain them let's create a really dummy dataframe with:

import pandas as pd

queries = [
    "select *  FROM  users",
    "SELECT count(id) AS cnt FROM public.users",
    "SELECT * FROM orders",
    "SELECT city, count(1) FROM orders GROUP BY 1",
]
df = pd.DataFrame(queries, columns=["query"])

3.1. `series.str.count`

In this case we want to count the number of queries that come from the users table. It doesn't matter if the schema is specified or not. It can be done with:

df["query"].str.count(r"\s+FROM\s+(\w+\.)?users\b")

Out:

3.2. `series.str.replace`

We can use this for deleting the renamings (like AS cnt):

df["query"].str.replace(r"\s+AS\s+\w+\b", "", regex=True)

Out:

select * FROM users

SELECT count(id) FROM public.users

SELECT * FROM orders

SELECT city, count(1) FROM orders GROUP BY 1

3.3. `series.str.match`

In this case we can look which queries are selecting all columns (SELECT * FROM):

df["query"].str.match(r"(?i)(SELECT)\s+\*\s+(?i)(FROM)")

Out:

True

False

True

False

3.4. `series.str.contains`

Similar to the replace example, let's get the queries that have a rename AS x:

df["query"].str.contains(r"\s+AS\s+\w+")

Out:

False

True

False

3.5. `series.str.findall`

We can use this for getting a list with all words that have between 5 and 8 letters:

df["query"].str.findall("\w{5,8}")

Out:

[select, users]

[SELECT, count, public, users]

[SELECT, orders]

[SELECT, count, orders, GROUP]

Note that the output is one list per row.

3.6. `series.str.split` / `series.str.rsplit`

In this case let's split the query in 2 parts, all before the FROM and all after it:

df["query"].str.split("\s+(?i)from\s+")

Out:

[select *, users]

[SELECT count(id) AS cnt, public.users]

[SELECT *, orders]

[SELECT city, count(1), orders GROUP BY 1]

3.7. `series.str.extract`

With this function we can extract data from groups. The output of this function is a DataFrame where each group is a column.

regex = r"(?i)SELECT\s+(?:.+)\s+(?i)FROM\s+(?P<schema>\w+\.)?(?P<table>\w+)\b"
df["query"].str.extract(regex)

The output in this case is:

schema	table
Nan	users
public.	users
Nan	orders
Nan	orders

The regex used in this example might be a little bit complex, let's analyze it by parts:

(?i)SELECT: SELECT statement case insensitive
\s+: one or more spaces
(?:.+): any number of characters in a non-capturing group (so that it does not appear as column in the output)
\s+: one or more spaces
(?i)FROM: FROM statement case insensitive
\s+: one or more spaces
(?:(?P<schema>\w+)\.)?: Optional non-capturing group composed with:
- (?P<schema>\w+): Capturing group named schema made only with words
- \.: a dot (.)
(?P<table>\w+)\b: Capturing group named table made only with words and that has a right boundary

4. Regex with redshift

In redshift there are 4 string functions that work with regular expressions (regex). They are:

Function	Description
REGEXP_COUNT	Returns an integer that indicates the number of times the pattern occurs in the string. If no match is found, then the function returns 0.
REGEXP_INSTR	Returns an integer that indicates the beginning position or ending position of the matched substring. If no match is found, then the function returns 0.
REGEXP_REPLACE	Replaces every occurrence of the pattern with the specified string.
REGEXP_SUBSTR	Returns the characters extracted from a string by searching for a regular expression pattern.

There is also the SIMILAR operator for using regexs inside WHERE statements. For example:

SELECT count(*) FROM event WHERE name SIMILAR TO '%(Ring|Die)%';

-- It is the same as:
SELECT count(*) FROM event where name LIKE '%Ring%' OR name LIKE '%Die%';

Subscribe to the mailing list

Using regexs with python

Table of Contents

1. Regex basics

1.1. Brackets []

1.1. Position

1.2. Classes

1.3. Quantifiers

1.4. Groups

1.5. Boundaries

1.6. Flags

1.7. Greedy vs lazy

2. Regex with python

2.1. re package

2.2. match vs search

2.3. re.compile

2.4. Working with groups

2.4.1. Unnamed groups

2.4.2. Named groups

3. Regex with pandas

3.1. series.str.count

3.2. series.str.replace

3.3. series.str.match

3.4. series.str.contains

3.5. series.str.findall

3.6. series.str.split / series.str.rsplit

3.7. series.str.extract

4. Regex with redshift

1.1. Brackets `[]`

2.1. `re` package

2.2. `match` vs `search`

2.3. `re.compile`

3.1. `series.str.count`

3.2. `series.str.replace`

3.3. `series.str.match`

3.4. `series.str.contains`

3.5. `series.str.findall`

3.6. `series.str.split` / `series.str.rsplit`

3.7. `series.str.extract`