4.4.2. Regex basics


The best ways to learn and use regex are:

  1. https://regexone.com/ is so good, I’m loath to add anything else to this page.

  2. The official python documentation and its HOWTO page

  3. Google+stackoverflow. If someone has done something similar and found a solution, great!

Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?

A: Look for all the instances of this pattern: (###) ###-####.

Your eyeballs can easily do that, but once the job involves enough numbers, it makes sense to let your computer do it for you.

Regex is how you tell a computer to search for any pattern within a string.

  • Phone numbers

  • Emails (regex is why people don’t spell out their emails “correctly” online)

  • Addresses

  • Certain words/topics (like assignment 5!) Regex in Python


Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.

But obviously, we’re going to use python. Add import re to the top of your code to load the regex package. Common functions:

The full list of functions is here.

  • re.search(pattern, string, karg**) looks for the first instance of the regex pattern within the string and returns a “match object” if one is found. Returns None if no match.

import re
re.search("c", "abcdef")
<re.Match object; span=(2, 3), match='c'>
  • re.findall(pattern,string) returns a list of matching strings, and is how you can count the number of matches

text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)
['carefully', 'quickly']
len(re.findall(r"\w+ly", text))
  • re.finditer(pattern,string) is similar to findall but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string

# i want to find all of the adverbs AND THEIR POSITIONS in some text
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
  • pattern_to_use = re.compile(pattern) will create a pattern you can put as the input to search, find, and findall.

  • .group(#) if your search or match has parenthesized subgroups, you can access each parenthetical.

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")  
m.group(0)       # The entire match
'Isaac Newton'
m.group(1)       # The first parenthesized subgroup.
m.group(2)       # The second parenthesized subgroup.
m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton') Special characters to build your patterns

Most of this is taken directly from the official documentation.




any character except a newline


start of the string

^[a-z]+ matches the “hi” in “hi there” but not “there”


end of the string or just before the newline at the end of the string

foo matches both ‘foo’ and ‘foobar’, but foo$ matches only ‘foo’


match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match “a”, “ab”, or “abbbbbbb”


match 1 or more repetitions of the preceding RE, as many repetitions as are possible. ab+ will match “ab”, or “abbbbbbb” but not “a”


match 0 or 1 repetitions of the preceding RE. ab? will match “a” and “ab”


match m repetitions of the preceding RE. ab{3} will match “abbb” but not “abb”


match m to n repetitions of the preceding RE. ab{3,5} will match “abbb” and “abbbbb” but not “abb”

Note: Do you want the largest match or the smallest?

  • *, +, {m} and {m,n} are GREEDY: they match as much text as possible. So if you search ab+ against “abbb” it will match the full string “abbb”. But sometimes you want

  • If you add ? to any of those, it will perform the match in a minimal way: using ab+ on string “abbbbb” will just return “ab”. Use ab* instead and you’ll get “a”.




1. escapes special characters \* will actually search for an asterisk.

2. or signals a “special sequence”


Indicates a set of characters. In a set: [amk] will match ‘a’, ‘m’, or ‘k’.

Common ranges: [a-z], [A-Z], [0-9]. You can combine ranges: [A-Za-z0-9].

Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters (, +, *, or ).


Makes a group. POWERFUL and necessary in most uses of regex.

If you actually want to match parentheses, use a backslash: \(

There is SO MANY more special characters. If you can imagine a “feature” in the pattern of a string, there is probably a special character. \b matches word boundaries, \d for digits, \s` for whitespace, and more.


In practice, most “regex” is just Googling to see if someone has done a similar thing.

A few pointers:

  • You only benefit from using re.compile when you are creating a bunch of regex patterns. In that case, you “compile” them and can immediately use them all quickly. But if you only have a few patterns, don’t bother.

  • re.match is similar to re.search, but only starts at the beginning of the string. I don’t use match almost ever.

Raw string notation

You’ll often see people put an “r” in from of the regex pattern. For example: re.search(r"c", "abcdef").

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash (‘’) in a regular expression would have to be prefixed with another one to escape it.

# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>