4.4.2. Regex basics

Tip

The best ways to learn and use regex are:

  1. [https://regexone.com/}(https://regexone.com) is so good, I’m loath to add anything else to this page.

  2. The official python documentation and its HOWTO page

  3. Google+stackoverflow. “Has someone done something similar? Yes? Great!”

Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?

A: Look for all the instances of this pattern: (###) ###-####.

Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.

Regex is how you tell a computer to search for any pattern within a string.

  • Phone numbers

  • Emails (regex is why people don’t spell out their emails “correctly” online)

  • Addresses

  • Certain words/topics (like assignment 5!)

4.4.2.1. Regex in Python

Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.

But obviously, we’re going to use python. Run import re to load the regex package.

4.4.2.1.1. Common functions:

The full list of functions is here.

  • re.search(pattern, string, karg**) looks for the first instance of the regex pattern within the string and returns a “match object” if one is found. Returns None if no match.

import re
re.search("c", "abcdef")
<re.Match object; span=(2, 3), match='c'>
  • re.findall(pattern,string) returns a list of matching strings, and is how you can count the number of matches

text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)
['carefully', 'quickly']
len(re.findall(r"\w+ly", text))
2
  • re.finditer(pattern,string) is similar to findall but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string

# i want to find all of the adverbs AND THEIR POSITIONS in some text
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
  • pattern_to_use = re.compile(pattern) will create a pattern you can put as the input to search, find, and findall.

  • .group(#) if your search or match has parenthesized subgroups, you can access each parenthetical.

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")  
m.group(0)       # The entire match
'Isaac Newton'
m.group(1)       # The first parenthesized subgroup.
'Isaac'
m.group(2)       # The second parenthesized subgroup.
'Newton'
m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

4.4.2.1.2. Special characters to build your patterns

Most of this is taken directly from the official documentation.

Char

Matches

.

any character except a newline

^

start of the string

^[a-z]+ matches the “hi” in “hi there” but not “there”

$

end of the string or just before the newline at the end of the string

foo matches both ‘foo’ and ‘foobar’, but foo$ matches only ‘foo’

*

match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match “a”, “ab”, or “abbbbbbb”

+

match 1 or more repetitions of the preceding RE, as many repetitions as are possible. ab+ will match “ab”, or “abbbbbbb” but not “a”

?

match 0 or 1 repetitions of the preceding RE. ab? will match “a” and “ab”

{m}

match m repetitions of the preceding RE. ab{3} will match “abbb” but not “abb”

{m,n}

match m to n repetitions of the preceding RE. ab{3,5} will match “abbb” and “abbbbb” but not “abb”

Note: Do you want the largest match or the smallest?

  • *, +, {m} and {m,n} are GREEDY: they match as much text as possible. So if you search ab+ against “abbb” it will match the full string “abbb”. But sometimes you want

  • If you add ? to any of those, it will perform the match in a minimal way: using ab+ on string “abbbbb” will just return “ab”. Use ab* instead and you’ll get “a”.

Char

Matches

\

1. escapes special characters \* will actuallye search for an asterisk.

2. or signals a “special sequence”

[]

Indicates a set of characters. In a set: [amk] will match ‘a’, ‘m’, or ‘k’.

Common ranges: [a-z], [A-Z], [0-9]. You can combine ranges: [A-Za-z0-9].

Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters (, +, *, or ).

(...)

Makes a group. POWERFUL and necessary in most uses of regex.

If you actually want to match parentheses, use a backslash: \(

There is SO MANY more special characters. If you can imagine a “feature” in the pattern of a string, there is probably a special character. \b matches word boundaries, \d for digits, \s` for whitespace, and more.

Tip

Most “regex” in practice is just Googling for someone who has done a similar thing.

A few pointers:

  • You only benefit from using re.compile when you are creating a bunch of regex patterns. In that case, you “compile” them and can immediately use them all quickly. But if you only have a few patterns, don’t bother.

  • re.match is similar to re.search, but only starts at the beginning of the string. I don’t use match almost ever.

Raw string notation

You’ll often see people put an “r” in from of the regex pattern. For example: re.search(r"c", "abcdef").

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash (‘’) in a regular expression would have to be prefixed with another one to escape it.

# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>