Tokenizing with tidytext

Tokenizing by n-grams
Removing Stop Words
Counting n-grams
Example: Trigrams for Heart of Darkness
Conclusion

Tokenizing by n-grams

Tokenization is the process of breaking your text into pieces or tokens. As noted here, a token is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” In practice, tokens often refer to words.

Unigrams

An n-gram is a contiguous sequence of n items from a sample of text. An n-gram of size 1 is called a unigram, size 2 is a bigram, and size 3 is a trigram.

Here’s an example of how to tokenize by unigrams. We’ll use our text from Metamorphosis.

# downloading metamorphasis
meta = gutenberg_download("5200")

unigrams = meta %>% unnest_tokens(word, text, token = "ngrams", n = 1)

unigrams%>%
    # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

gutenberg_id	word
5200	copyright
5200	c
5200	2002
5200	david
5200	wyllie

Bigrams

And here’s an example of tokenizing by bigrams. Notice the only argument in unnest_tokens that needs to change from our unigram example is n =. (The first argument listed here is the column name, which can be whatever you want.)

bigrams = meta %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)

# below is purely for visuzliations purposes
bigrams%>%
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

gutenberg_id	bigram
5200	copyright c
5200	c 2002
5200	2002 david
5200	david wyllie
5200	wyllie metamorphosis

Removing Stop Words

Stop words are words that are typically more important to the grammar of a sentence rather than the meaning of a sentence. They are like helper words that help string together meaning without providing much meaning themselves.

For humans, more words are needed but for NLP they add nothing. Let’s look at that last word for example.

For humans, more words are needed but for NLP they add nothing.

Now without the stop words!

humans more words needed NLP add nothing.

As you can see, computer really only need cave man speak to get the semantic meaning of words.

How do we know if something is a stop word?

As you might guess, what exactly constitutes a stop word is somewhat subjective. When thinking about defining a stop word, we need to think about generalities and scale. For example, “no” can be a really important word semantically, and for certain tasks, like sentiment analysis, you may want to keep certain stop words, especially words like “no” or “not”. Handling these words is called negation handling. But we will get into that when we cover sentiment analysis.

Lets use our bigram example from before. We are going to be basing our methodogly for removing stop words off of tidytextmining. Its pretty simple and somewhat scalable.

First split the ngrams into their own columns. In this case, since it's bigrams, the columns will be “word1” and “word2”

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

Next for each “word” column, filter out all the stop words

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

Finally, concatinate all the columns back together with unite()

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

And there you have it! A clean dataframe of bigrams without stop words the tidytext way.

Counting n-grams

While it may seem simple, counting frequency of n-grams (in our case bigrams) will tell us a lot about the content of the text. Let’s look at the bigram frequency of Metamorphosis with and without removing stop words:

With Stop Words

bigrams%>%
  count(bigram, sort = T)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

bigram	n
of the	96
he had	92
in the	88
it was	87
he was	82

Without Stop Words

bigrams_united%>%
  count(bigram, sort = T)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

bigram	n
chief clerk	34
gregor’s father	24
gregor’s mother	19
gregor’s sister	15
earn money	4

So as we can see, before we remove the stop words, we largely get unremarkable bigrams at the top of the list. In fact, I would guess that you would see “of the” as one of the top bigrams in nearly any book or paper.

Once we remove the stop words, we begin to generate content that is specific to our book. Interestingly, “earn money” is one of the top five bigrams in the book!

Example: Trigrams for Heart of Darkness

Overall, this tutorial is simply meant to acquaint you with the basic premises of preparing text data for analysis, specifically in the tidytext universe.

Now, try to repeat the above steps for “Heart of Darkness” by Joseph Conrad. But this time, let's use trigrams instead of bigrams. The Gutenberg id is 219.

Problem

We will be organizing the book “Heart of Darkness” into trigrams. You will need all the packages and skill sets described above to prepare “Heart of Darkness” for analysis.

Hints - tidytext is mostly scalable but will require some minor edits to handle trigrams - Make sure you remove stop words from all three of the trigrams.

Solution

Step 1: Download the Book

darkness = gutenberg_download("219")

Step 2: Tokenize the Dataset

darkness %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

gutenberg_id	trigram
219	heart of darkness
219	of darkness by
219	darkness by joseph
219	by joseph conrad
219	joseph conrad i

Step 3: Remove the Stopwords

darkness %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")%>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(!word3 %in% stop_words$word)%>%
  unite(trigram, word1, word2, word3, sep = " ")%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")

gutenberg_id	trigram
219	cruising yawl swung
219	canvas sharply peaked
219	mournful gloom brooding
219	gloom brooding motionless
219	bones marlow sat

Conclusion

As you can see, handling text data with tidytext is very simple. The big advantage of using this package is that it works with other tidyverse functions, and this framework can even be used for some analytic tasks such as sentiment analysis and topic modeling.