4.4.2. Regex basics¶

Tip

The best ways to learn and use regex are:

[https://regexone.com/}(https://regexone.com) is so good, I’m loath to add anything else to this page.
The official python documentation and its HOWTO page
Google+stackoverflow. “Has someone done something similar? Yes? Great!”

Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?

A: Look for all the instances of this pattern: (###) ###-####.

Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.

Regex is how you tell a computer to search for any pattern within a string.

Phone numbers
Emails (regex is why people don’t spell out their emails “correctly” online)
Addresses
Certain words/topics (like assignment 5!)

4.4.2.1. Regex in Python¶

Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.

But obviously, we’re going to use python. Run import re to load the regex package.

4.4.2.1.1. Common functions:¶

The full list of functions is here.

re.search(pattern, string, karg**) looks for the first instance of the regex pattern within the string and returns a “match object” if one is found. Returns None if no match.

import re
re.search("c", "abcdef")

<re.Match object; span=(2, 3), match='c'>

re.findall(pattern,string) returns a list of matching strings, and is how you can count the number of matches

text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)

['carefully', 'quickly']

len(re.findall(r"\w+ly", text))

re.finditer(pattern,string) is similar to findall but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string

# i want to find all of the adverbs AND THEIR POSITIONS in some text
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

07-16: carefully
40-47: quickly

pattern_to_use = re.compile(pattern) will create a pattern you can put as the input to search, find, and findall.

.group(#) if your search or match has parenthesized subgroups, you can access each parenthetical.

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")  
m.group(0)       # The entire match

'Isaac Newton'

m.group(1)       # The first parenthesized subgroup.

'Isaac'

m.group(2)       # The second parenthesized subgroup.

'Newton'

m.group(1, 2)    # Multiple arguments give us a tuple.

('Isaac', 'Newton')

4.4.2.1.2. Special characters to build your patterns¶

Most of this is taken directly from the official documentation.

Char	Matches
`.`	any character except a newline
`^`	start of the string `^[a-z]+` matches the “hi” in “hi there” but not “there”
`$`	end of the string or just before the newline at the end of the string `foo` matches both ‘foo’ and ‘foobar’, but `foo$` matches only ‘foo’
`*`	match 0 or more repetitions of the preceding RE, as many repetitions as are possible. `ab*` will match “a”, “ab”, or “abbbbbbb”
`+`	match 1 or more repetitions of the preceding RE, as many repetitions as are possible. `ab+` will match “ab”, or “abbbbbbb” but not “a”
`?`	match 0 or 1 repetitions of the preceding RE. `ab?` will match “a” and “ab”
`{m}`	match `m` repetitions of the preceding RE. `ab{3}` will match “abbb” but not “abb”
`{m,n}`	match `m` to `n` repetitions of the preceding RE. `ab{3,5}` will match “abbb” and “abbbbb” but not “abb”

Note: Do you want the largest match or the smallest?

*, +, {m} and {m,n} are GREEDY: they match as much text as possible. So if you search ab+ against “abbb” it will match the full string “abbb”. But sometimes you want
If you add ? to any of those, it will perform the match in a minimal way: using ab+ on string “abbbbb” will just return “ab”. Use ab* instead and you’ll get “a”.

Char	Matches
`\`	1. escapes special characters `\*` will actuallye search for an asterisk. 2. or signals a “special sequence”
`[]`	Indicates a set of characters. In a set: `[amk]` will match ‘a’, ‘m’, or ‘k’. Common ranges: `[a-z]`, `[A-Z]`, `[0-9]`. You can combine ranges: `[A-Za-z0-9]`. Special characters lose their special meaning inside sets. For example, `[(+)]` will match any of the literal characters `(`, `+`, ``, or `)`.
`(...)`	Makes a group. POWERFUL and necessary in most uses of regex. If you actually want to match parentheses, use a backslash: `\(`

There is SO MANY more special characters. If you can imagine a “feature” in the pattern of a string, there is probably a special character. \b matches word boundaries, \d for digits, \s` for whitespace, and more.

Tip

Most “regex” in practice is just Googling for someone who has done a similar thing.

A few pointers:

You only benefit from using re.compile when you are creating a bunch of regex patterns. In that case, you “compile” them and can immediately use them all quickly. But if you only have a few patterns, don’t bother.
re.match is similar to re.search, but only starts at the beginning of the string. I don’t use match almost ever.

Raw string notation

You’ll often see people put an “r” in from of the regex pattern. For example: re.search(r"c", "abcdef").

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash (‘’) in a regular expression would have to be prefixed with another one to escape it.

# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>

LeDataSciFi-2021

4.4.2. Regex basics¶

4.4.2.1. Regex in Python¶

4.4.2.1.1. Common functions:¶

4.4.2.1.2. Special characters to build your patterns¶