4.4.3. Developing a regex

  1. Think of the PATTERN you want to capture in general terms. “I want three letter words.”

  2. Write pattern = "\w{3}" and then try it on a few practice strings. The goal is to BREAK your pattern, find out where it fails, and notice new parts of the pattern you missed.

import re
pattern = "\w{3}"
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
['hey', 'the', 'guy']

That produced “the” as a 3-letter string, but we wanted words.

pattern = "(\w{3}) " # I added a space after it. 
# the () is a way of saying "I'm interested in the stuff (a word, hopefully) in inside the parens" 
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
['hey', 'ere']

Darn. It found the 2 times three letters were followed by a space. It missed the “guy” because that isn’t followed by a space. But there is a way to catch multiple kinds of word boundaries: “\b

pattern = "(\w{3})\b" # \b means the string has to be followed by a word boundary
re.findall(pattern,"hey there guy") 
[]

Except that still didn’t work, but this time for an annoying reason. \b means word boundary in regex, yes but before the string is used by regex, python processes the string, and \b in a python string means “backspace”. Annoying, but look:

print('\b')


The workaround was discussed at the bottom of the last page: Use “raw string notation”. This means the string is left “as is”. Look:

print(r'\b')
\b

Ok, back to our idea of using the word boundary to get “guy”, this time with a “raw string”.

pattern = r"(\w{3})\b" 
re.findall(pattern,"hey there guy") 
['hey', 'ere', 'guy']

Now, we need to get rid of “ere”. A word starts after a word boundary, right?

pattern = r"\b(\w{3})\b"  # make sur the word has a boundary before it
re.findall(pattern,"hey there guy")  # got it!
['hey', 'guy']