A regular expression, or regex for short, is a pattern describing a certain amount of text.
The most simple regex is a regular string, like hello
which will match the literal text hello
.
There are some characters with special meanings. For example .
will match any character (except line break).
As an example ip.d
will match ipad
, ipod
, ip@d
etc.
If we want to only match ipod
or ipad
we can use ip[ao]d
.
[]
Brackets ([
and ]
) are used for specifying which characters are allowed. It is similar to an or operator. As an example:
[abc]
: a
, b
or c
[a-c]
: a
, b
or c
[^a-c]
: anything but a
, b
or c
[0-9a-fA-F]
: any number or any letter between a
and f
or between A
and F
. This represents hexadecimalIt is possible to specify matches based on the position with:
^
: Start $
: EndFor example ^ab
will only match the first ab
in ab abc ab
(Demo)
There are certain keywords that will match different types of characters.
For example \d
will match any digit (any of those 0123456789
).
Those classes can be inversed with capital letters. As an example \D
will match any character except a digit.
The main classes available are:
regex | description | example | inverse |
---|---|---|---|
\d |
Digits | d_example | \D |
\w |
Alphanumeric and/or _ |
w_example | \W |
\s |
Whitespaces and equivalents (like tabs) | s_example | \S |
It is possible to specify how many times a letter or group (defined in the next section) should appear.
As an example we want the letter a
+ the letter b
which should appear different number of times:
regex | description | example |
---|---|---|
ab* |
a + b zero times or more |
example_ab* |
ab+ |
a + b one time or more |
example_ab+ |
ab? |
a + optional b (zero or one time) |
example_ab? |
ab{2} |
a + b exactly 2 times |
example_ab{2} |
ab{2,5} |
a + b between 2 and 5 times (inclusive) |
example_ab{2,5} |
It is possible to define groups in regex.
They are use to capture part of the text and are declared with parenthesis ()
.
We can use any regex inside the group. For example if we want to math hello
:
(hello)
: capturing group (Demo) (?:hello)
: non-capturing group(?P<name>hello)
: same but giving the group a name (name
in this case)(hello|bye):
or operator in groups.It is possible to declare a boundary with \b
(with the inverse \B
).
A boundary means that is surrounded with a non word type character.
It is similar to the start (^
) and end ($
).
It is easier to understand it with some examples:
regex | text: Impossible to do |
text: I'm possible :) |
text: possibler |
example |
---|---|---|---|---|
possible |
Impossible to do | I'm possible :) | possibler | boundaries_1 |
\bpossible |
Impossible to do | I'm possible :) | possibler | boundaries_2 |
\bpossible\b |
Impossible to do | I'm possible :) | possibler | boundaries_3 |
possible\B |
Impossible to do | I'm possible :) | possibler | boundaries_4 |
\Bpossible |
Impossible to do | I'm possible :) | possibler | boundaries_5 |
It is possible to change the behaviour of the regex with some flags:
(?i)
: case insensitive. For example (?i)hello
will match HELLO
(demo)(?m)
: multi line. When this is active the anchor ^
will be start of line instead of start of text (demo)(?s)
: single line.Imagine that we have the following text:
<h1> Title </h1> <p> Hello </p>
If we want to capture the start of an html tag (<h1>
and <p>
in this case) one might be tempted to use something like <\w+>
.
However that will match the whole line (for both lines) since we are specifying <
followed by any text and ended with >
(demo).
If we want to stop it at the first appearence of >
we can use the non greedy regex <.+?>
(demo).
Some better alternatives are <\w+>
or <[^<>]+>
.
To sum up:
<.+>
: greedy<.+?>
: non greedyIt can also be used with other quantifiers that are not the
+
.
re
packageTo use regex in python you first need to import the re
package with
import re
Then to check if there is a match in a text with one regex using out = re.match(regex, text)
:
out = re.match(r"hello", "hello world") # out is a class if we only want to know if there is a match we can use `bool(out)`
When declaring regexs in python it is a good practise to declare them as raw strings with
r"regex"
match
vs search
We have use re.match
since it's faster. However this only looks at the begining of the string.
To do a more in deep search we can use re.search
.
Let's compare them by doing checking different regexs against the following text:
text = """hello world bye world"""
The results would be:
regex | re.match | re.search |
---|---|---|
r"hello" |
✔️ | ✔️ |
r"world" |
❌ | ✔️ |
r"bye" |
❌ | ✔️ |
re.compile
If we plan to reuse the same regex we can compile it with regex = re.compile(regex)
:
regex = re.compile(r"hello")
And then we can use both match
and/or search
with:
regex.match("hello world") regex.search("hello world")
The advantage is that a compiled regex is faster than a non-compiled one.
Imagine that we have the following file names:
filenames = [ "Informe mensual Indexa Capital - AABBCCDD - 2020-01.pdf", "Informe mensual Indexa Capital - XXYYZZWW - 2020-01.pdf", "Informe mensual Indexa Capital - AABBCCDD - 2021-03.pdf", "Informe mensual Indexa Capital - XXYYZZWW - 2020-12.pdf", ]
It looks more or less clear that we can extract some data from those names:
regex = re.compile(r"Informe mensual Indexa Capital - (\w{8}) - (\d{4})-(\d{2}).pdf") out = regex.match(filenames[0]) out.groups()
We can use named groups to better extract the data
regex = re.compile(r"Informe mensual Indexa Capital - (?P<account>\w{8}) - (?P<year>\d{4})-(?P<month>\d{2}).pdf") out = regex.match(filenames[1]) out.groupdict()
Clearly we can use that to rename the files doing something like:
out_pattern = "{year}_{month} Indexa capital {account}.pdf" for filename in filenames: groups = regex.match(filename).groupdict() new_name = out_pattern.format(**groups) print(new_name)
There are multiple functions in pandas that can be used with regex. They are:
count
: count occurrences that match a regexreplace
: replace based on a regexmatch
: same as re.match
contains
: same as re.search
findall
: same as re.findall
split
/ rsplit
: splits text based on a regexextract
: extract groups using a regexIn order to explain them let's create a really dummy dataframe with:
import pandas as pd queries = [ "select * FROM users", "SELECT count(id) AS cnt FROM public.users", "SELECT * FROM orders", "SELECT city, count(1) FROM orders GROUP BY 1", ] df = pd.DataFrame(queries, columns=["query"])
series.str.count
In this case we want to count
the number of queries that come from the users
table.
It doesn't matter if the schema is specified or not. It can be done with:
df["query"].str.count(r"\s+FROM\s+(\w+\.)?users\b")
series.str.replace
We can use this for deleting the renamings (like AS cnt
):
df["query"].str.replace(r"\s+AS\s+\w+\b", "", regex=True)
select * FROM users
SELECT count(id) FROM public.users
SELECT * FROM orders
SELECT city, count(1) FROM orders GROUP BY 1
series.str.match
In this case we can look which queries are selecting all columns (SELECT * FROM
):
df["query"].str.match(r"(?i)(SELECT)\s+\*\s+(?i)(FROM)")
series.str.contains
Similar to the replace
example, let's get the queries that have a rename AS x
:
df["query"].str.contains(r"\s+AS\s+\w+")
series.str.findall
We can use this for getting a list with all words that have between 5 and 8 letters:
df["query"].str.findall("\w{5,8}")
select
, users
]SELECT
, count
, public
, users
]SELECT
, orders
]SELECT
, count
, orders
, GROUP
]Note that the output is one list per row.
series.str.split
/ series.str.rsplit
In this case let's split the query in 2 parts, all before the FROM
and all after it:
df["query"].str.split("\s+(?i)from\s+")
select *
, users
]SELECT count(id) AS cnt
, public.users
]SELECT *
, orders
]SELECT city, count(1)
, orders GROUP BY 1
]series.str.extract
With this function we can extract data from groups. The output of this function is a DataFrame where each group is a column.
regex = r"(?i)SELECT\s+(?:.+)\s+(?i)FROM\s+(?P<schema>\w+\.)?(?P<table>\w+)\b" df["query"].str.extract(regex)
The output in this case is:
schema | table |
---|---|
Nan | users |
public. | users |
Nan | orders |
Nan | orders |
The regex used in this example might be a little bit complex, let's analyze it by parts:
(?i)SELECT
: SELECT
statement case insensitive\s+
: one or more spaces(?:.+)
: any number of characters in a non-capturing group (so that it does not appear as column in the output)\s+
: one or more spaces(?i)FROM
: FROM
statement case insensitive\s+
: one or more spaces(?:(?P<schema>\w+)\.)?
: Optional non-capturing group composed with:(?P<schema>\w+)
: Capturing group named schema
made only with words\.
: a dot (.
)(?P<table>\w+)\b
: Capturing group named table
made only with words and that has a right boundaryIn redshift there are 4 string functions that work with regular expressions (regex). They are:
Function | Description |
---|---|
REGEXP_COUNT | Returns an integer that indicates the number of times the pattern occurs in the string. If no match is found, then the function returns 0. |
REGEXP_INSTR | Returns an integer that indicates the beginning position or ending position of the matched substring. If no match is found, then the function returns 0. |
REGEXP_REPLACE | Replaces every occurrence of the pattern with the specified string. |
REGEXP_SUBSTR | Returns the characters extracted from a string by searching for a regular expression pattern. |
There is also the SIMILAR
operator for using regexs inside WHERE
statements. For example:
SELECT count(*) FROM event WHERE name SIMILAR TO '%(Ring|Die)%'; -- It is the same as: SELECT count(*) FROM event where name LIKE '%Ring%' OR name LIKE '%Die%';