Intro to "Regular Expressions" (aka "regex" aka "regx" aka "re")
Why are we learning regex
Q: Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?
A: Look for all the instances of this pattern: (###) ###-####.
Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.
*Regex is how you tell a computer to search for any pattern within a string.
- Phone numbers
- Emails (regex is why people don't spell out their emails "correctly" online)
- Addresses
- Certain words/topics (like assignment 5!)
Learning by playing
Let's go to https://regexone.com/ . Watch me first, then you can take off.
Regex in Python
Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.
But obviously, we're going to use python. Run import re
to load the regex package.
Common functions:
The full list of functions is here.
re.search(pattern, string, karg**)
looks for the first instance of the regex pattern within the string and returns a "match object" if one is found. ReturnsNone
if no match.>>> re.search("c", "abcdef") <re.Match object; span=(2, 3), match='c'>
re.findall(pattern,string)
returns a list of matching strings, and is how you can count the number of matches>>> text = "He was carefully disguised but captured quickly by police." >>> re.findall(r"\w+ly", text) ['carefully', 'quickly'] >>> len(re.findall(r"\w+ly", text)) 2
re.finditer(pattern,string)
is similar tofindall
but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string# i want to find all of the adverbs AND THEIR POSITIONS in some text >>> text = "He was carefully disguised but captured quickly by police." >>> for m in re.finditer(r"\w+ly", text): ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 07-16: carefully 40-47: quickly
pattern_to_use = re.compile(pattern)
will create a pattern you can put as the input tosearch
,find
, andfindall
.result = re.search(pattern, string) # is equivalent to: prog = re.compile(pattern) result = prog.match(string)
.group(#)
if your search or match has parenthesized subgroups, you can access each parenthetical.# looks for two words with 1 space between >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") >>> m.group(0) # The entire match 'Isaac Newton' >>> m.group(1) # The first parenthesized subgroup. 'Isaac' >>> m.group(2) # The second parenthesized subgroup. 'Newton' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Isaac', 'Newton')
A few pointers:
- You only benefit from using
re.compile
when you are creating a bunch of regex patterns. In that case, you "compile" them and can immediately use them all quickly. But if you only have a few patterns, don't bother. re.match
is similar tore.search
, but only starts at the beginning of the string. I don't usematch
almost ever.
Raw string notation
You'll often see people put an "r" in from of the regex pattern. For example: re.search(r"c", "abcdef")
.
Raw string notation (r"text"
) keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.
# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
Special characters to build your patterns
Most of this is taken directly from the official documentation.
Char | Matches |
---|---|
. |
any character except a newline |
^ |
start of the string ^[a-z]+ matches the "hi" in "hi there" but not "there" |
$ |
end of the string or just before the newline at the end of the string foo matches both 'foo' and 'foobar', but foo$ matches only 'foo' |
* |
match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match "a", "ab", or "abbbbbbb" |
+ |
match 1 or more repetitions of the preceding RE, as many repetitions as are possible. ab+ will match "ab", or "abbbbbbb" but not "a" |
? |
match 0 or 1 repetitions of the preceding RE. ab? will match "a" and "ab" |
{m} |
match m repetitions of the preceding RE. ab{3} will match "abbb" but not "abb" |
{m,n} |
match m to n repetitions of the preceding RE. ab{3,5} will match "abbb" and "abbbbb" but not "abb" |
Note: Do you want the largest match or the smallest?
*
,+
,{m}
and{m,n}
are GREEDY: they match as much text as possible. So if you searchab+
against "abbb" it will match the full string "abbb". But sometimes you want- If you add
?
to any of those, it will perform the match in a minimal way: usingab+
on string "abbbbb" will just return "ab". Useab*
instead and you'll get "a".
Char | Matches |
---|---|
\ |
1. escapes special characters \* will actuallye search for an asterisk. 2. or signals a "special sequence" |
[] |
Indicates a set of characters. In a set: [amk] will match 'a', 'm', or 'k'. Common ranges: [a-z] , [A-Z] , [0-9] . You can combine ranges: [A-Za-z0-9] . Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters ( , + , * , or ) . |
(...) |
Makes a group. POWERFUL and necessary in most uses of regex. If you actually want to match parentheses, use a backslash: \( |
There is SO MANY more special characters. If you can imagine a "feature" in the pattern of a string, there is probably a special character. \b
matches word boundaries, \d for digits,
\s` for whitespace, and more.
Moral: Most "regex" in practice is just Googling for someone who has done a similar thing.
pattern = "\w{3}"
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
# tried but failed:
# "(\w{3}) " <-- a space
# "(\w{3})\b" <-- a word boundary should work! why not?
pattern = r"(\w{3})\b" # trying that raw string notation thing
re.findall(pattern,"hey there guy")
# it made the `\b` work!, but pattern still it is failing...
pattern = r"\b(\w{3})\b" # make sur the word has a boundary before it
re.findall(pattern,"hey there guy") # got it!
Finding words near each other
You can find and download this function here.
I usually put it in the same folder as my code for an assignment, and then in the assignment write from NEAR_regex import NEAR_regex
. Then, you can use it in an assignment without pasting this big block of code into it.
def NEAR_regex(list_of_words,max_words_between=5,partial=False,cases_matter=False):
'''
Parameters
----------
list_of_words : list
A list of "words", each element is a string
This program will return a regex that will look for times where word1
is near word2, or word2 is near word 1.
It works with multiple words: You can see if words1 is near word2 or
word3.
max_words_between : int, optional
How many "words" are allowed between words in list_of_words. The default
is 5, but you should consider this carefully.
"words" in between are chunks of characters. "DON don don- don12 2454"
is 5 words.
This will not allow matches if the words are separated by a newline
("\n") character.
partial : Boolean, optional
If true, will accept longer words than you give. For example, if one
word in your list is "how", it will match to "howdy". Be careful in
choosing this based on your problem. Partial makes more sense with
longer words.
The default is True.
cases_matter: Boolean, optional bt IMPORTANT
If True, will return a regex string that will only catch cases where
words in the string have the same case as given as input to this
function. For example, if one word here is "Hi", then the regex
produced by this function will not catch "hi".
If false, will return a regex string that will only work if all letters
in search string are lowercase.
The default is True.
Warning / Feature
-------
This WILL find matches where the words are separated by line breaks.
I recommend purging line breaks from your strings, in most cases, unless
you are SURE the only breaks left are meaningful paragraph breaks.
Unsure about speed
-------
I don't think this is a very "fast" function, but it should be robust.
Suggested use
-------
a_string_you_have = 'Jack and Jill went up the hill'
# 1. define words and set up the regex
words = ['jack','hill']
rgx = NEAR_regex(words)
# 2. convert the string to lowercase before searching!
a_string_you_have = a_string_you_have.lower()
# 3. len+findall+rgx = counts the number of times the word groups are close
count = len(re.findall(rgx,test))
print(count)
Returns
-------
A string which is a regex that can be used to look for cases where all the
input words are near each other.
'''
from itertools import permutations
start = r'(?:\b' # the r means "raw" as in the backslash is just a backslash, not an escape character
if partial:
gap = r'[A-Za-z]*\b(?: +[^ \n\r]*){0,' +str(max_words_between)+r'} *\b'
end = r'[A-Za-z]*\b)'
else:
gap = r'\b(?: +[^ \n]*){0,' +str(max_words_between)+r'} *\b'
end = r'\b)'
regex_list = []
for permu in list(permutations(list_of_words)):
# catch this permutation: start + word + gap (+ word + gap)... + end
if cases_matter: # case sensitive - what cases the user gives are given back
regex_list.append(start+gap.join(permu)+end)
else: # the resulting search will only work if all words are lowercase
lowerpermu = [w.lower() for w in permu]
regex_list.append(start+gap.join(lowerpermu)+end)
return '|'.join(regex_list)
import re
test = 'This is a partial string another break with words'
words = ['part','with']
rgx = NEAR_regex(words)
print(len(re.findall(rgx,test))) # no match (partials not allowed) - good!
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # match (partials allowed) - good!
rgx = NEAR_regex(words,partial=True,max_words_between=1)
print(len(re.findall(rgx,test))) # no match (too far apart) - good!
words = ['part','With']
rgx = NEAR_regex(words,partial=True,cases_matter=True)
print(len(re.findall(rgx,test)))
words = ['part','with','this']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # no match - good! "This" != "this"
print(len(re.findall(rgx,test.lower()))) # match - good!
test = 'This is a partial string \n another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # fails because of the \n break
test = 'This is a partial string \r another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # fails with \r too.
test = 'This is a partial string another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # extra spaces don't affect
Golden rules for parsing text files
These rules should usually be followed, but there are exceptions. For example, sometimes, the case of a letter matters, and sometimes, keeping punctuation can help. But usually, textual analysis proceeds as follows:
- Use html tags to change/remove unneeded sections. If there are tables you don't want to parse or useless header or footer information, toss them out. Sometimes, you can use the hmtl tags to extract just the part of files you want. If so, do it! If not, proceed:
- Remove html tags, and turn the document into a pure text string.
- Lower case everything.
- Delete punctuation.
- Delete all excess whitespace.
- Now you can search/parse the text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459019003165/0001564590-19-003165.txt'
edgar_resp = requests.get(url)
# save the url
# then in the next part of assignment, youll load it again
# here - i'm skipping those steps so we can grab one document to look at
loaded_file = edgar_resp.content
from NEAR_regex import NEAR_regex
help(NEAR_regex)
# try to use NEAR_regex... look for it working and failing...