4.4.4. Intro to NLP - The Anchor Phase Technique¶
Congrats on surviving the last few pages on regex. I’m guessing either you’re asleep, giving up on regex, or you are struggling to keep your eyes from glazing over.
That’s natural! No one really likes or understands regex. I mean, the memes are good… and vicious. But the prior two pages give you some background that will help you understand how to use a key function (NEAR_regex
) we are going to use.
Here is what’s going on:
We can collect a TON of webpages (documents). These pages have text and information we’d like to use.
But hiring employees to manually extract that info is expensive at scale. Python code and computer runtime is cheap.
NLP - natural language processing - is a field concerned with turning text into usable datasets. The field is advanced and has many powerful and cool methods.
“Anchor phrase” is a simple yet powerful technique. The basic idea is to look for a word (or words) near another word (or words) to see if (and how much) a document is discussing some topic. _Are you discussing risk in the context of supply chain issues?”
I wrote a function -
NEAR_regex()
- to do anchor phrase searches. It leverages the power of regex while moooostly hiding us from the necessary work of writing a dang regex.The prior two pages were still useful because they give you necessary background that will help you understand how to use this function.
Again, this function means you won’t need to write a regex pattern!1
The community codebook (a valuable resource!) has a file in it called “near_regex.py”. To get that file to the folder you’re working on that needs the function (e.g., your class notes folder, and the assignment) you have two options:
If you already cloned the textbook’s repo (good job, you!), then the file is already on your computer. Simply find it inside
ledatascifi-2024/community_codebook/
and copy it, then paste to the folder where it is needed.Clicking on the filename in the above link, then the “Raw” button will get you to this page. If you right-click on the page and then select “Save Page As” (or similar, depending on your browser), you can put it in the folder where it is needed.
After you copy the function to the same location as your code, load it into your code by adding this to the top of your code:
from near_regex import NEAR_regex # yes, the font case is different
Tip
At the bottom of the near_regex.py file, I included examples. You don’t need to keep these and if you do, it will cause your code to print stuff. This will be confusing to someone reading the code. So when you copy near_regex.py into a folder to use it, delete all the examples and stuff below the “return” line in the function.
4.4.4.1. Demo¶
Let me start by showing you some examples. After these examples, load the function and read the help documentation in it, and then we can do some practice.
Each example looks inside a document (here, the “document” is just a short sentence) for some words.
To use this function:
Load your document and clean the string
Create a list of strings you want to look for. The function will create a (complex) regex pattern that will detect if all of the strings are near each other in the document.
Give the list of strings (from step 2) to the function to create the regex pattern.
Use the pattern (from step 3) to
Count how many times the pattern hits
Print out the text matches to check the text
def NEAR_regex(list_of_words,max_words_between=5,partial=False,cases_matter=False):
'''
Parameters
----------
list_of_words : list
A list of "words", each element is a string
This program will return a regex that will look for times where word1
is near word2, or word2 is near word 1.
It works with multiple words: You can see if words1 is near word2 or
word3.
max_words_between : int, optional
How many "words" are allowed between words in list_of_words. The default
is 5, but you should consider this carefully.
"words" in between are chunks of characters. "DON don don- don12 2454"
is 5 words.
This will not allow matches if the words are separated by a newline
("\n") character.
partial : Boolean, optional
If true, will accept longer words than you give. For example, if one
word in your list is "how", it will match to "howdy". Be careful in
choosing this based on your problem. Partial makes more sense with
longer words.
The default is True.
cases_matter: Boolean, optional bt IMPORTANT
If True, will return a regex string that will only catch cases where
words in the string have the same case as given as input to this
function. For example, if one word here is "Hi", then the regex
produced by this function will not catch "hi".
If false, will return a regex string that will only work if all letters
in search string are lowercase.
The default is True.
Warning
-------
See the last example. The regex is minimally greedy.
Feature Requests (participation credit available)
-------
1. A wrapper that takes the above inputs, the string to search, and a variable=count|text, and
returns either the number of hits or the text of the matching elements. (See the examples below.)
2. Optionally clean the string before the regex stuff happens.
3. Optionally ignore line breaks.
4. Optionally make it lazy (in the last example,
the "correct" answer is probably 2, but it gives 1.)
Unsure about speed
-------
I don't think this is a very "fast" function, but it should be robust.
Suggested use
-------
# clean your starting string
a_string_you_have = 'jack and jill went up the hill'
# 1. define words and set up the regex
words = ['jack','hill']
rgx = NEAR_regex(words)
# 2a. count the number of times the word groups are in the text near each other
count = len(re.findall(rgx,a_string_you_have))
print(count)
# 2b. print the actual text matches <-- great for checking!
text_of_matches = [m.group(0) for m in re.finditer(rgx,a_string_you_have)]
print(text_of_matches)
Returns
-------
A string which is a regex that can be used to look for cases where all the
input words are near each other.
'''
from itertools import permutations
start = r'(?:\b' # the r means "raw" as in the backslash is just a backslash, not an escape character
if partial:
gap = r'[A-Za-z]*\b(?: +[^ \n\r]*){0,' +str(max_words_between)+r'} *\b'
end = r'[A-Za-z]*\b)'
else:
gap = r'\b(?: +[^ \n]*){0,' +str(max_words_between)+r'} *\b'
end = r'\b)'
regex_list = []
for permu in list(permutations(list_of_words)):
# catch this permutation: start + word + gap (+ word + gap)... + end
if cases_matter: # case sensitive - what cases the user gives are given back
regex_list.append(start+gap.join(permu)+end)
else: # the resulting search will only work if all words are lowercase
lowerpermu = [w.lower() for w in permu]
regex_list.append(start+gap.join(lowerpermu)+end)
return '|'.join(regex_list)
Is the word “part” near the word “with” in this string? No:
from near_regex import NEAR_regex
import re
test = 'This is a partial string another break with words'
words = ['part','with']
rgx = NEAR_regex(words)
print(len(re.findall(rgx,test))) # no match (partials not allowed by default)
0
But we can use partial=True
in the function which will look for any words starting with “part” (partial, partly, etc) and “with” (without, withhold, etc). Be careful using partial=True
, as it can lead to too many matches, and ones you don’t want.
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # match (partials allowed)
1
You can change how many words can be between the search terms. Shorter distances means the terms are more likely to be related grammatically. For example, max_words_between=3
might be used if you want to find an adjective directly modifying a noun (but with the allowance that it might be in a list of adjectives). Higher values are more useful if you want to see if two ideas (nouns) are in the same sentence or paragraph.
In our example, “partial” and “with” are several words apart, so this correctly returns a value of zero:
rgx = NEAR_regex(words,partial=True,max_words_between=1)
print(len(re.findall(rgx,test))) # no match (too far apart)
0
Two caveats:
This is a “dumb” function in the sense that it doesn’t do anything to actually detect if two words are in the same sentence or paragraph. There are ways to do that, but we’re going to keep it simple and just stick to this function as-is. Let’s assume that a sentence is about 20 words on average, and a paragraph is 200 words on average.
In the next two examples: If your document has line breaks, like a newline symbol (“\n”) or a return symbol (“\r”), which create paragraph breaks in a document, this function won’t search both sides of it. If a document splits paragraphs like this, great: our function will do within-paragraph searching, as long as our cleaning process doesn’t delete those symbols. However, some documents have those symbols at the end of every single line and we need to delete them while cleaning, in which case, we can’t assume the function only looks within paragraphs.
test = 'This is a partial string \n another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # fails because of the \n break
0
test = 'This is a partial string \r another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # fails with \r too.
0
test = 'This is a partial string another break with words'
words = ['part','with']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # extra spaces are treated as one space, no impact
1
The cases_matter
parameter is pretty simple. If true, then it only reports a match if the string’s case exactly match how you typed your search terms.
words = ['part','With'] # changed to Capitalized "With"
rgx = NEAR_regex(words,partial=True,cases_matter=True)
print(len(re.findall(rgx,test))) # no match
0
You can look for three terms that all have to be next to each other:
words = ['part','with','this']
rgx = NEAR_regex(words,partial=True)
print(len(re.findall(rgx,test))) # no match - good! "This" != "this"
print(len(re.findall(rgx,test.lower()))) # match - good!
0
1
4.4.4.2. Tips¶
Is topic1 near topic2 (instead of word1 near word2)?
The function doesn’t just look for words that are near each other. It looks to see if regex patterns are near each other.
Above, I said that NEAR_regex builds “a (complex) regex pattern that will detect if all of the strings are near each other in the document.” What it is actually doing is putting the strings into a regex. In our examples so far, the strings are always a word. But your list of search terms can include “mini-regexs”. The most common use: You want to look for a topic (a word) but make sure it is being used in a specific way.
Suppose we want to know if a text message chain involves a greeting specifically for James but not John? But annoyingly, James is sometimes referred to by a nickname!
Well, we’d want to build a search that looks for:
A greeting term: “hey” OR “hi” or “sup”.
James’ name and nicknames: “james” OR “jimmy”
So we can exploit how regex works so that when we specify the words = [element1, element2]
list, that
element1
is the list of greeting terms.element2
is the list of acceptable names.
**To build a regex that looks for “hey” OR “hi” or “sup”, we need to implement these three things:
In regex, OR is “|”.
No spaces between terms
*Important: Put the parentheses around the whole set of terms!
So, “hey” OR “hi” or “sup” becomes '(hey|hi|sup)'
.
Constructing a search like this allows you to look for a topic (here, the greeting term) being used in a specific way. Examples:
The cost (term 1) of building factories (term 2)… but not general costs or costs for other items
Discussions of risks (term 1) from particular sources like floods (term 2)… but not general risks like competition
texts = 'hey jimmy hi james sup john'
words = ['(hey|hi|sup)','(jimmy|james)'] # search for a greeting
rgx = NEAR_regex(words,max_words_between=1)
print(len(re.findall(rgx,texts))) # both are caught
2
Printing out your matches
This is very useful to double-check what your search is finding!
[m.group(0) for m in re.finditer(rgx,texts)]
['hey jimmy', 'hi james']
4.4.4.3. Caveats¶
Two caveats
Caveat One: When a regex finds a match, the part of that string in the match can’t be used again and thus won’t be counted. Above, you could argue that mechanically, there are FOUR matches:
hey jimmy
jimmy hi
hi james
james sup
But the “hey jimmy” match “uses up” “jimmy”, so “jimmy hi” can’t be found. Same with matches #3 and #4.
Caveat Two: The function is “greedy”. Look at the next example, where I allow there to be two words between.
Starting from “hey”, it finds “jimmy” right away, but it doesn’t stop. (That would be a “lazy” regex.) It keeps looking since it can look up to three words away. It takes the largest mechanical match possible, so when it finds james, it deems “hey jimmy hi james” a match, and because of that, “hi james” is already used and can’t be a match. So it reports 1 match, not 2.
rgx = NEAR_regex(words,max_words_between=2)
print(len(re.findall(rgx,texts))) # the regex isn't greedy - it misses inner matches!
[m.group(0) for m in re.finditer(rgx,texts)]
1
['hey jimmy hi james']
4.4.4.4. Practice¶
Let’s use Telsa. Copy this code into an empty notebook, and put the near_regex.py file in the same folder.
Let’s see how many times the document mentions China in the context of three issues:
political considerations (broadly, and specific risks)
rare earth elements
supply chains, import tariffs
I strongly suggest approaching this by using two terms. You might need to have multiple words in each term.
Print out the text of the hits!
Play around with partial and the max parameters. Specifically look for when changing them gives you BAD results: Either way too many non-results are included, or your search returns too few results because a single hit encompasses several hits that should be separately counted
Try to get the “most accurate” sense of
Does the document discuss each issue at all? Is the number of hits 0, or >0?
How much does the document discuss the issue? How many hits?
Does the number of hits reflect the scale of the issue? Why or why not?
If not, is there a way to transform the number of hits to reflect the scale? Some ideas:
Going from 1 to 2 hits is probably more informative than from 99 hits to 100
If you use ln(hits), ln(2)-ln(1) is more than ln(100)-ln(99)
Try to use ln(hits) and see if you spot a problem! What’s a fix you can use?
Binning: no discussion (hits=0), some discussion (hits<=#), lots of discussion (hits>#)
How many bins and how you set the cutoffs is ad hoc and depends on the documents and topics!
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459019003165/0001564590-19-003165.txt'
r = requests.get(url)
# putting the "Good ideas" from 4.4 to work to clean the document:
lower = BeautifulSoup(r.content).get_text().lower()
no_punc = re.sub(r'\W',' ',lower)
cleaned = re.sub(r'\s+',' ',no_punc).strip()
# copy the function into your working directory next to this file
from near_regex import NEAR_regex
# if this prints numbers, that means you havent deleted the examples inside near_regex.py. delete em
help(NEAR_regex) # look at and read the documentation!
# try to use NEAR_regex... look for it working and failing...