4.4.2. Regex basics¶
Tip
The best ways to learn and use regex are:
https://regexone.com/ is so good, I’m loath to add anything else to this page.
The official python documentation and its HOWTO page
Google+stackoverflow. If someone has done something similar, and found a solution, great!
Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?
A: Look for all the instances of this pattern: (###) ###-####.
Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.
Regex is how you tell a computer to search for any pattern within a string.
Phone numbers
Emails (regex is why people don’t spell out their emails “correctly” online)
Addresses
Certain words/topics (like assignment 5!)
4.4.2.1. Regex in Python¶
Tip
Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.
But obviously, we’re going to use python. Add import re
to the top of your code to load the regex package.
4.4.2.1.1. Common functions:¶
The full list of functions is here.
re.search(pattern, string, karg**)
looks for the first instance of the regex pattern within the string and returns a “match object” if one is found. ReturnsNone
if no match.
import re
re.search("c", "abcdef")
<re.Match object; span=(2, 3), match='c'>
re.findall(pattern,string)
returns a list of matching strings, and is how you can count the number of matches
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)
['carefully', 'quickly']
len(re.findall(r"\w+ly", text))
2
re.finditer(pattern,string)
is similar tofindall
but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string
# i want to find all of the adverbs AND THEIR POSITIONS in some text
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
pattern_to_use = re.compile(pattern)
will create a pattern you can put as the input tosearch
,find
, andfindall
.
.group(#)
if your search or match has parenthesized subgroups, you can access each parenthetical.
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
m.group(0) # The entire match
'Isaac Newton'
m.group(1) # The first parenthesized subgroup.
'Isaac'
m.group(2) # The second parenthesized subgroup.
'Newton'
m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
4.4.2.1.2. Special characters to build your patterns¶
Most of this is taken directly from the official documentation.
Char |
Matches |
---|---|
|
any character except a newline |
|
start of the string |
|
end of the string or just before the newline at the end of the string |
|
match 0 or more repetitions of the preceding RE, as many repetitions as are possible. |
|
match 1 or more repetitions of the preceding RE, as many repetitions as are possible. |
|
match 0 or 1 repetitions of the preceding RE. |
|
match |
|
match |
Note: Do you want the largest match or the smallest?
*
,+
,{m}
and{m,n}
are GREEDY: they match as much text as possible. So if you searchab+
against “abbb” it will match the full string “abbb”. But sometimes you wantIf you add
?
to any of those, it will perform the match in a minimal way: usingab+
on string “abbbbb” will just return “ab”. Useab*
instead and you’ll get “a”.
Char |
Matches |
---|---|
|
1. escapes special characters |
|
Indicates a set of characters. In a set: |
|
Makes a group. POWERFUL and necessary in most uses of regex. |
There is SO MANY more special characters. If you can imagine a “feature” in the pattern of a string, there is probably a special character. \b
matches word boundaries, \d for digits,
\s` for whitespace, and more.
Tip
In practice, most “regex” is just Googling to see if someone has done a similar thing.
A few pointers:
You only benefit from using
re.compile
when you are creating a bunch of regex patterns. In that case, you “compile” them and can immediately use them all quickly. But if you only have a few patterns, don’t bother.re.match
is similar tore.search
, but only starts at the beginning of the string. I don’t usematch
almost ever.
Raw string notation
You’ll often see people put an “r” in from of the regex pattern. For example: re.search(r"c", "abcdef")
.
Raw string notation (r"text"
) keeps regular expressions sane. Without it, every backslash (‘’) in a regular expression would have to be prefixed with another one to escape it.
# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>