Tokenizing with tidytext
Tokenizing by n-grams
Tokenization is the process of breaking your text into pieces or tokens. As noted here, a token is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” In practice, tokens often refer to words.
Unigrams
An n-gram is a contiguous sequence of n items from a sample of text. An n-gram of size 1 is called a unigram, size 2 is a bigram, and size 3 is a trigram.
Here’s an example of how to tokenize by unigrams. We’ll use our text from Metamorphosis.
# downloading metamorphasis
meta = gutenberg_download("5200")
unigrams = meta %>% unnest_tokens(word, text, token = "ngrams", n = 1)
unigrams%>%
# below is purely for visuzliations purposes
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| gutenberg_id | word |
|---|---|
| 5200 | copyright |
| 5200 | c |
| 5200 | 2002 |
| 5200 | david |
| 5200 | wyllie |
Bigrams
And here’s an example of tokenizing by bigrams. Notice the only argument in unnest_tokens that needs to change from our unigram example is n =. (The first argument listed here is the column name, which can be whatever you want.)
bigrams = meta %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
# below is purely for visuzliations purposes
bigrams%>%
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| gutenberg_id | bigram |
|---|---|
| 5200 | copyright c |
| 5200 | c 2002 |
| 5200 | 2002 david |
| 5200 | david wyllie |
| 5200 | wyllie metamorphosis |
Removing Stop Words
Stop words are words that are typically more important to the grammar of a sentence rather than the meaning of a sentence. They are like helper words that help string together meaning without providing much meaning themselves.
For humans, more words are needed but for NLP they add nothing. Let’s look at that last word for example.
For humans, more words are needed but for NLP they add nothing.
Now without the stop words!
humans more words needed NLP add nothing.
As you can see, computer really only need cave man speak to get the semantic meaning of words.
How do we know if something is a stop word?
As you might guess, what exactly constitutes a stop word is somewhat subjective. When thinking about defining a stop word, we need to think about generalities and scale. For example, “no” can be a really important word semantically, and for certain tasks, like sentiment analysis, you may want to keep certain stop words, especially words like “no” or “not”. Handling these words is called negation handling. But we will get into that when we cover sentiment analysis.
Lets use our bigram example from before. We are going to be basing our methodogly for removing stop words off of tidytextmining. Its pretty simple and somewhat scalable.
First split the ngrams into their own columns. In this case, since it's bigrams, the columns will be “word1” and “word2”
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
Next for each “word” column, filter out all the stop words
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
Finally, concatinate all the columns back together with unite()
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
And there you have it! A clean dataframe of bigrams without stop words the tidytext way.
Counting n-grams
While it may seem simple, counting frequency of n-grams (in our case bigrams) will tell us a lot about the content of the text. Let’s look at the bigram frequency of Metamorphosis with and without removing stop words:
With Stop Words
bigrams%>%
count(bigram, sort = T)%>%
# below is purely for visuzliations purposes
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| bigram | n |
|---|---|
| of the | 96 |
| he had | 92 |
| in the | 88 |
| it was | 87 |
| he was | 82 |
Without Stop Words
bigrams_united%>%
count(bigram, sort = T)%>%
# below is purely for visuzliations purposes
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| bigram | n |
|---|---|
| chief clerk | 34 |
| gregor’s father | 24 |
| gregor’s mother | 19 |
| gregor’s sister | 15 |
| earn money | 4 |
So as we can see, before we remove the stop words, we largely get unremarkable bigrams at the top of the list. In fact, I would guess that you would see “of the” as one of the top bigrams in nearly any book or paper.
Once we remove the stop words, we begin to generate content that is specific to our book. Interestingly, “earn money” is one of the top five bigrams in the book!
Example: Trigrams for Heart of Darkness
Overall, this tutorial is simply meant to acquaint you with the basic premises of preparing text data for analysis, specifically in the tidytext universe.
Now, try to repeat the above steps for “Heart of Darkness” by Joseph Conrad. But this time, let's use trigrams instead of bigrams. The Gutenberg id is 219.
Problem
We will be organizing the book “Heart of Darkness” into trigrams. You will need all the packages and skill sets described above to prepare “Heart of Darkness” for analysis.
Hints - tidytext is mostly scalable but will require some minor edits to handle trigrams - Make sure you remove stop words from all three of the trigrams.
Solution
Step 1: Download the Book
darkness = gutenberg_download("219")
Step 2: Tokenize the Dataset
darkness %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
# below is purely for visuzliations purposes
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| gutenberg_id | trigram |
|---|---|
| 219 | heart of darkness |
| 219 | of darkness by |
| 219 | darkness by joseph |
| 219 | by joseph conrad |
| 219 | joseph conrad i |
Step 3: Remove the Stopwords
darkness %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
filter(!word3 %in% stop_words$word)%>%
unite(trigram, word1, word2, word3, sep = " ")%>%
# below is purely for visuzliations purposes
slice(1:5)%>%
kable()%>%
kable_styling("striped")
| gutenberg_id | trigram |
|---|---|
| 219 | cruising yawl swung |
| 219 | canvas sharply peaked |
| 219 | mournful gloom brooding |
| 219 | gloom brooding motionless |
| 219 | bones marlow sat |
Conclusion
As you can see, handling text data with tidytext is very simple. The big advantage of using this package is that it works with other tidyverse functions, and this framework can even be used for some analytic tasks such as sentiment analysis and topic modeling.