4.4. The Power of Textual Data¶
This section of the textbook is only a small appetizer of what can and should be done when working with textual data and python
string objects. I mostly cover the essential stuff needed for this class, for this semester.
But it’s important to provide this overview and cover this topic! The web is awash with (mostly unstructured) text data. Twitter is one obvious site where scraping the text could yield insights, but as of 2020, there are 40 ZETTABYTES of data across the web. A zettabyte is a billion terabytes. A zettabyte is about how many stars are in the observable universe.
Something like 80% of that data is unstructured text. Figuring out how to put structure on that data is POWERFUL.
Google is the most famous example of the power of structuring data. Their approach was to rank documents (you can see the patent here) and then give you the ranked list based on whether your search words were in the document.
One very common task is identifying topics in unstructured text: Does the document discuss topic X?
This section of class tries to show you how to implement a seemingly simple approach to do just that. What you’ll see is that the “basic” code we construct is the foundation for a very powerful framework! After this, you will have code that can be adapted quickly to other text sources and extended to implement more sophisticated methods!
Some good ideas when parsing text files
These rules should usually be followed, but there are exceptions. For example, sometimes the case of a letter matters, and sometimes keeping punctuation can help. But usually, textual analysis proceeds as follows:
Use html tags to change/remove unneeded sections, or select the section of text that you want. If there are tables you don’t want to parse or useless header or footer information, toss them out.
With what remains after step 1, remove html tags and turn the document into a pure text string.
from bs4 import BeautifulSoup BeautifulSoup(some_html).get_text()
Lower case everything. (Python string method:
Delete punctuation (I usually replace with a space).
# don't forget to import re re.sub(r'\W',' ',your_string_variable_goes_here)
Delete all excess whitespace.
Now you can search/parse the text.