4.4.3. Developing a regex¶
Think of the PATTERN you want to capture in general terms. “I want three-letter words.”
Write
pattern = "\w{3}"
and then try it on a few practice strings. The goal is to BREAK your pattern, find out where it fails, and notice new parts of the pattern you missed.
import re
pattern = "\w{3}"
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
['hey', 'the', 'guy']
That produced “the” as a 3-letter string, but we wanted whole words.
So let’s try this next pattern, which looks for three letters and then a space:
pattern = "(\w{3}) " # I added a space after it.
# the () is a way of saying "I'm interested in the stuff (a word, hopefully) inside the parenthesis
# even through I need to tell you about other things to find our target words"
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
['hey', 'ere']
Darn. It found the 2 times three letters were followed by a space. It missed the “guy” because that isn’t followed by a space. But there is a way to catch multiple kinds of word boundaries: “\b
”
pattern = "(\w{3})\b" # \b means the string has to be followed by a word boundary
re.findall(pattern,"hey there guy")
[]
Except that still didn’t work, but this time for an annoying reason. \b
means word boundary in regex, yes but before the string is used by regex, python processes the string, and \b
in a python string means “backspace”. Annoying, but look:
print('\b')
The workaround was discussed at the bottom of the last page: Use “raw string notation”. This means the string is left “as is”. Look:
print(r'\b')
\b
Ok, back to our idea of using the word boundary to get “guy”, this time with a “raw string”.
pattern = r"(\w{3})\b"
re.findall(pattern,"hey there guy")
['hey', 'ere', 'guy']
That worked! Now, we need to get rid of “ere”. A word starts after a word boundary, right?
pattern = r"\b(\w{3})\b" # make sure the word has a boundary before it
re.findall(pattern,"hey there guy") # got it!
['hey', 'guy']