Tokenizing by n-grams

Tokenization is the process of breaking your text into pieces or tokens. As noted here, a token is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” In practice, tokens often refer to words.

Unigrams

An n-gram is a contiguous sequence of n items from a sample of text. An n-gram of size 1 is called a unigram, size 2 is a bigram, and size 3 is a trigram.

Here’s an example of how to tokenize by unigrams. We’ll use our text from Metamorphosis.

# downloading metamorphasis
meta = gutenberg_download("5200")

unigrams = meta %>% unnest_tokens(word, text, token = "ngrams", n = 1)

unigrams%>%
    # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
gutenberg_id word
5200 copyright
5200 c
5200 2002
5200 david
5200 wyllie

Bigrams

And here’s an example of tokenizing by bigrams. Notice the only argument in unnest_tokens that needs to change from our unigram example is n =. (The first argument listed here is the column name, which can be whatever you want.)

bigrams = meta %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)

# below is purely for visuzliations purposes
bigrams%>%
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
gutenberg_id bigram
5200 copyright c
5200 c 2002
5200 2002 david
5200 david wyllie
5200 wyllie metamorphosis

Removing Stop Words

Stop words are words that are typically more important to the grammar of a sentence rather than the meaning of a sentence. They are like helper words that help string together meaning without providing much meaning themselves.

For humans, more words are needed but for NLP they add nothing. Let’s look at that last word for example.

For humans, more words are needed but for NLP they add nothing.

Now without the stop words!

humans more words needed NLP add nothing.

As you can see, computer really only need cave man speak to get the semantic meaning of words.

How do we know if something is a stop word?

As you might guess, what exactly constitutes a stop word is somewhat subjective. When thinking about defining a stop word, we need to think about generalities and scale. For example, “no” can be a really important word semantically, and for certain tasks, like sentiment analysis, you may want to keep certain stop words, especially words like “no” or “not”. Handling these words is called negation handling. But we will get into that when we cover sentiment analysis.

Lets use our bigram example from before. We are going to be basing our methodogly for removing stop words off of tidytextmining. Its pretty simple and somewhat scalable.

First split the ngrams into their own columns. In this case, since it's bigrams, the columns will be “word1” and “word2”

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

Next for each “word” column, filter out all the stop words

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

Finally, concatinate all the columns back together with unite()

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ") 

And there you have it! A clean dataframe of bigrams without stop words the tidytext way.

Counting n-grams

While it may seem simple, counting frequency of n-grams (in our case bigrams) will tell us a lot about the content of the text. Let’s look at the bigram frequency of Metamorphosis with and without removing stop words:

With Stop Words

bigrams%>%
  count(bigram, sort = T)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
bigram n
of the 96
he had 92
in the 88
it was 87
he was 82

Without Stop Words

bigrams_united%>%
  count(bigram, sort = T)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
bigram n
chief clerk 34
gregor’s father 24
gregor’s mother 19
gregor’s sister 15
earn money 4

So as we can see, before we remove the stop words, we largely get unremarkable bigrams at the top of the list. In fact, I would guess that you would see “of the” as one of the top bigrams in nearly any book or paper.

Once we remove the stop words, we begin to generate content that is specific to our book. Interestingly, “earn money” is one of the top five bigrams in the book!

Example: Trigrams for Heart of Darkness

Overall, this tutorial is simply meant to acquaint you with the basic premises of preparing text data for analysis, specifically in the tidytext universe.

Now, try to repeat the above steps for “Heart of Darkness” by Joseph Conrad. But this time, let's use trigrams instead of bigrams. The Gutenberg id is 219.

Problem

We will be organizing the book “Heart of Darkness” into trigrams. You will need all the packages and skill sets described above to prepare “Heart of Darkness” for analysis.

Hints - tidytext is mostly scalable but will require some minor edits to handle trigrams - Make sure you remove stop words from all three of the trigrams.

Solution

Step 1: Download the Book

darkness = gutenberg_download("219")

Step 2: Tokenize the Dataset

darkness %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
gutenberg_id trigram
219 heart of darkness
219 of darkness by
219 darkness by joseph
219 by joseph conrad
219 joseph conrad i

Step 3: Remove the Stopwords

darkness %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)%>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")%>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(!word3 %in% stop_words$word)%>%
  unite(trigram, word1, word2, word3, sep = " ")%>%
  # below is purely for visuzliations purposes
  slice(1:5)%>%
  kable()%>%
  kable_styling("striped")
gutenberg_id trigram
219 cruising yawl swung
219 canvas sharply peaked
219 mournful gloom brooding
219 gloom brooding motionless
219 bones marlow sat

Conclusion

As you can see, handling text data with tidytext is very simple. The big advantage of using this package is that it works with other tidyverse functions, and this framework can even be used for some analytic tasks such as sentiment analysis and topic modeling.