Search
Parsing and Searching Text

Outline

  1. Intro to "Regular Expressions"
  2. Golden rules for parsing text files

But first...

Let's check in on the local sports team.

Intro to "Regular Expressions" (aka "regex" aka "regx" aka "re")

Why are we learning regex

Q: Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?

A: Look for all the instances of this pattern: (###) ###-####.

Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.

*Regex is how you tell a computer to search for any pattern within a string.

  • Phone numbers
  • Emails (regex is why people don't spell out their emails "correctly" online)
  • Addresses
  • Certain words/topics (like assignment 5!)

Learning by playing

Let's go to https://regexone.com/ . Watch me first, then you can take off.

Regex in Python

Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.

But obviously, we're going to use python. Run import re to load the regex package.

Common functions:

The full list of functions is here.

  • re.search(pattern, string, karg**) looks for the first instance of the regex pattern within the string and returns a "match object" if one is found. Returns None if no match.
    >>> re.search("c", "abcdef")
    <re.Match object; span=(2, 3), match='c'>
    
  • re.findall(pattern,string) returns a list of matching strings, and is how you can count the number of matches
    >>> text = "He was carefully disguised but captured quickly by police."
    >>> re.findall(r"\w+ly", text)
    ['carefully', 'quickly']
    >>> len(re.findall(r"\w+ly", text))
    2
    
  • re.finditer(pattern,string) is similar to findall but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string
    # i want to find all of the adverbs AND THEIR POSITIONS in some text
    >>> text = "He was carefully disguised but captured quickly by police."
    >>> for m in re.finditer(r"\w+ly", text):
    ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
    07-16: carefully
    40-47: quickly
    
  • pattern_to_use = re.compile(pattern) will create a pattern you can put as the input to search, find, and findall.
    result = re.search(pattern, string)
    # is equivalent to:
    prog = re.compile(pattern)
    result = prog.match(string)
    
  • .group(#) if your search or match has parenthesized subgroups, you can access each parenthetical.
    # looks for two words with 1 space between
    >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")  
    >>> m.group(0)       # The entire match
    'Isaac Newton'
    >>> m.group(1)       # The first parenthesized subgroup.
    'Isaac'
    >>> m.group(2)       # The second parenthesized subgroup.
    'Newton'
    >>> m.group(1, 2)    # Multiple arguments give us a tuple.
    ('Isaac', 'Newton')
    

A few pointers:

  • You only benefit from using re.compile when you are creating a bunch of regex patterns. In that case, you "compile" them and can immediately use them all quickly. But if you only have a few patterns, don't bother.
  • re.match is similar to re.search, but only starts at the beginning of the string. I don't use match almost ever.

Raw string notation

You'll often see people put an "r" in from of the regex pattern. For example: re.search(r"c", "abcdef").

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.

# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>

Special characters to build your patterns

Most of this is taken directly from the official documentation.

Char Matches
. any character except a newline
^ start of the string

^[a-z]+ matches the "hi" in "hi there" but not "there"
$ end of the string or just before the newline at the end of the string

foo matches both 'foo' and 'foobar', but foo$ matches only 'foo'
* match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match "a", "ab", or "abbbbbbb"
+ match 1 or more repetitions of the preceding RE, as many repetitions as are possible. ab+ will match "ab", or "abbbbbbb" but not "a"
? match 0 or 1 repetitions of the preceding RE. ab? will match "a" and "ab"
{m} match m repetitions of the preceding RE. ab{3} will match "abbb" but not "abb"
{m,n} match m to n repetitions of the preceding RE. ab{3,5} will match "abbb" and "abbbbb" but not "abb"

Note: Do you want the largest match or the smallest?

  • *, +, {m} and {m,n} are GREEDY: they match as much text as possible. So if you search ab+ against "abbb" it will match the full string "abbb". But sometimes you want
  • If you add ? to any of those, it will perform the match in a minimal way: using ab+ on string "abbbbb" will just return "ab". Use ab* instead and you'll get "a".
Char Matches
\ 1. escapes special characters \* will actuallye search for an asterisk.

2. or signals a "special sequence"
[] Indicates a set of characters. In a set: [amk] will match 'a', 'm', or 'k'.

Common ranges: [a-z], [A-Z], [0-9]. You can combine ranges: [A-Za-z0-9].

Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters (, +, *, or ).
(...) Makes a group. POWERFUL and necessary in most uses of regex.

If you actually want to match parentheses, use a backslash: \(

There is SO MANY more special characters. If you can imagine a "feature" in the pattern of a string, there is probably a special character. \b matches word boundaries, \d for digits,\s` for whitespace, and more.

Moral: Most "regex" in practice is just Googling for someone who has done a similar thing.

Developing your regex

  1. Think of the PATTERN you want to capture in general terms. "I want three letter words."
  2. Write pattern = "\w{3}" and then try it on a few practice strings. The goal is to BREAK your pattern, find out where it fails, and notice new parts of the pattern you missed.
pattern = "\w{3}"
re.findall(pattern,"hey there guy") # whoops, "the" isnt a 3 letter word
['hey', 'the']
# tried but failed: 
#      "(\w{3}) "     <-- a space
#      "(\w{3})\b"    <-- a word boundary should work! why not?
pattern = r"(\w{3})\b" # trying that raw string notation thing 
re.findall(pattern,"hey there guy")  
# it made the `\b` work!, but pattern still it is failing...
['hey', 'ere', 'guy']
pattern = r"\b(\w{3})\b"  # make sur the word has a boundary before it
re.findall(pattern,"hey there guy")  # got it!
['hey', 'guy']

Finding words near each other

You can find and download this function here.

I usually put it in the same folder as my code for an assignment, and then in the assignment write from NEAR_regex import NEAR_regex. Then, you can use it in an assignment without pasting this big block of code into it.

def NEAR_regex(list_of_words,max_words_between=5,partial=False,cases_matter=False):
    '''
    Parameters
    ----------
    list_of_words : list
        A list of "words", each element is a string
        
        This program will return a regex that will look for times where word1 
        is near word2, or word2 is near word 1.
        
        It works with multiple words: You can see if words1 is near word2 or
        word3. 
        
    max_words_between : int, optional
        How many "words" are allowed between words in list_of_words. The default
        is 5, but you should consider this carefully.
        
        "words" in between are chunks of characters. "DON don don- don12 2454" 
        is 5 words.
        
        This will not allow matches if the words are separated by a newline 
        ("\n") character.
        
    partial : Boolean, optional
        If true, will accept longer words than you give. For example, if one 
        word in your list is "how", it will match to "howdy". Be careful in 
        choosing this based on your problem. Partial makes more sense with 
        longer words. 
        The default is True.
        
    cases_matter: Boolean, optional bt IMPORTANT
        If True, will return a regex string that will only catch cases where  
        words in the string have the same case as given as input to this 
        function. For example, if one word here is "Hi", then the regex 
        produced by this function will not catch "hi".
        
        If false, will return a regex string that will only work if all letters
        in search string are lowercase.
        
        The default is True.
     
        
    Warning / Feature
    -------
    This WILL find matches where the words are separated by line breaks.
    
    I recommend purging line breaks from your strings, in most cases, unless 
    you are SURE the only breaks left are meaningful paragraph breaks. 
    
        
    Unsure about speed
    -------
    I don't think this is a very "fast" function, but it should be robust. 
  
    
    Suggested use
    -------
    a_string_you_have = 'Jack and Jill went up the hill'
    
    # 1. define words and set up the regex
    words = ['jack','hill']                         
    rgx = NEAR_regex(words)                       
    
    # 2. convert the string to lowercase before searching!
    a_string_you_have = a_string_you_have.lower()   
    
    # 3. len+findall+rgx = counts the number of times the word groups are close
    count = len(re.findall(rgx,test))              
    print(count)                                 

    
    Returns
    -------
    A string which is a regex that can be used to look for cases where all the 
    input words are near each other.

    '''
               
    from itertools import permutations
    
    start = r'(?:\b' # the r means "raw" as in the backslash is just a backslash, not an escape character
    
    if partial:
        gap   = r'[A-Za-z]*\b(?: +[^ \n\r]*){0,' +str(max_words_between)+r'} *\b'
        end   = r'[A-Za-z]*\b)'
    else:
        gap   = r'\b(?: +[^ \n]*){0,' +str(max_words_between)+r'} *\b'
        end   = r'\b)'
        
    regex_list = []
    
    for permu in list(permutations(list_of_words)):
        # catch this permutation: start + word + gap (+ word + gap)... + end
        if cases_matter: # case sensitive - what cases the user gives are given back
              regex_list.append(start+gap.join(permu)+end)           
        else: # the resulting search will only work if all words are lowercase
            lowerpermu = [w.lower() for w in permu]
            regex_list.append(start+gap.join(lowerpermu)+end)
    
    return '|'.join(regex_list)
import re

test  = 'This is a partial string another break with words'
words = ['part','with']
rgx   = NEAR_regex(words)
print(len(re.findall(rgx,test)))            # no match (partials not allowed) - good!
0
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test)))            # match (partials allowed) - good!
1
rgx   = NEAR_regex(words,partial=True,max_words_between=1)
print(len(re.findall(rgx,test)))            # no match (too far apart) - good!
0
words = ['part','With']
rgx   = NEAR_regex(words,partial=True,cases_matter=True)
print(len(re.findall(rgx,test)))
0
words = ['part','with','this']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test)))           # no match - good! "This" != "this"
print(len(re.findall(rgx,test.lower())))    # match - good!
0
1
test  = 'This is a partial string \n another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test)))            # fails because of the \n break
0
test  = 'This is a partial string \r another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test)))            # fails with \r too.
0
test  = 'This is a partial string                      another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test)))            # extra spaces don't affect
1

Golden rules for parsing text files

These rules should usually be followed, but there are exceptions. For example, sometimes, the case of a letter matters, and sometimes, keeping punctuation can help. But usually, textual analysis proceeds as follows:

  1. Use html tags to change/remove unneeded sections. If there are tables you don't want to parse or useless header or footer information, toss them out. Sometimes, you can use the hmtl tags to extract just the part of files you want. If so, do it! If not, proceed:
  2. Remove html tags, and turn the document into a pure text string.
  3. Lower case everything.
  4. Delete punctuation.
  5. Delete all excess whitespace.
  6. Now you can search/parse the text.

Practice

Let's use Telsa:

from bs4 import BeautifulSoup
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459019003165/0001564590-19-003165.txt'
edgar_resp = requests.get(url)
# save the url
# then in the next part of assignment, youll load it again
# here - i'm skipping those steps so we can grab one document to look at

loaded_file = edgar_resp.content

from NEAR_regex import NEAR_regex 

help(NEAR_regex)

# try to use NEAR_regex... look for it working and failing...