Exploring Text Data

Summary Statistics with Words
Word Clouds with wordcloud
TF-IDF

Summary Statistics with Words

The first step in any data science project should always be to get a sense of your data. For NLP, a lot of exploratory data analysis revolves around counting the frequencies of different terms and plotting them in different ways. This can be as simple as a bar chart looking at the count of distinct words, to word clouds, to something as complex as a TF-IDF.

Before we go any further, we first need to set up our workspace. We will primarily be working with tidytext to keep ourselves in the tidyverse and will be pulling data from gutenbergr, which is a repository of free classic texts. Let’s start by loading in our data, Metamorphosis by Kafka, and unnesting it into tokens.

library(gutenbergr)
library(wordcloud)
library(ggplot2)
library(tidytext)
library(dplyr)
library(kableExtra)

meta = gutenberg_download("5200") 

tokens = meta %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

Next we can create a simple bar plot looking at the top ten words (not including stop words of course) used in the book.

word_count = tokens%>%
  group_by(word)%>%
  summarise(count = n())%>%
  arrange(desc(count))%>%
  slice(1:10)

ggplot(data = word_count)+
  geom_bar(aes(x = word, y = count), stat = "identity",  fill = "#6699cc")+
  theme_classic()

Word Clouds with wordcloud

Another way to present summary statistics is with a word cloud. For the uninitiated, a word cloud is just a collection of words with the size of the word determined by its prevalence. The natural disadvantage of this is that longer words will appear larger than smaller words, thus making them look more prevalent.

Actually creating the wordcloud is very easy. Just wrap a character vector in wordcloud(). We can remove some of the less frequent words using either the min.freq argument or the max.words argument.

wordcloud(tokens$word, max.words = 75, colors=brewer.pal(6, "Dark2"))

TF-IDF

So far, we have looked at simply counting words and displaying this information in interesting ways. Next we will try to contextualize this information by observing how frequent words are in a document in comparison to other words.

TF-IDF, or Term Frequency and Inverse Document Frequency, is a way to numerically give importance to a word or phrase in a given text document relative to a collection of text documents.

In this example, we will examine how important a word is in a chapter of Metamorphosis in comparison to the entire novel. Our dataframe ‘meta’ contains the text from Metamorphosis; however, there is no column designating chapters in the novel. By looking through the 'meta' dataframe, we can see that there are chapter divides at rows 640 and 1296. Let’s start by adding a chapter column to our dataframe, dividing Metamorphosis into its three chapters.

meta$chapter = NA
meta$chapter[1:639] = 1
meta$chapter[640:1295] = 2
meta$chapter[1296:nrow(meta)] = 3

meta <- meta[,2:3]

Term Frequency (tf)

First, we need to create a dataframe that breaks down our text into one word per row using unnest_tokens(). Here, we can count how many times words occur in each chapter. This is our term frequency (tf).

book_words <- meta%>%
  unnest_tokens(word, text) %>%
  group_by(chapter, word)%>%
  summarise(n = n())%>%
  arrange(desc(n))

book_words%>%
  slice(1:2)%>%
  kable()%>%
  kable_styling("striped")

chapter	word	n
1	the	386
1	to	251
2	the	382
2	to	254
3	the	380
3	and	253

Term frequency alone can tell us which words or phrases occur the most in a given document or collection of documents. This is helpful to some extent; however, some of the words or phrases with the highest term frequency may not be that important, or rather they may not give us much insight into what the document or collection of documents is about. In this example, we can see that “the” is the most frequent term in each chapter, giving us no insight into the contents of these chapters.

To fully illustrate this, let’s see how common words are across the entire book compared to any given chapter.

meta%>%
  unnest_tokens(word, text) %>%
  group_by(word)%>%
  summarise(book_count = n())%>%
  arrange(desc(book_count))%>%
  right_join(book_words, by = "word")%>%
  select(word, chapter, chapter_count = n, book_count)%>%
  ungroup()%>%
  slice(1:10)%>%
  kable()%>%
  kable_styling("striped")

word	chapter	chapter_count	book_count
the	1	386	1148
the	2	382	1148
the	3	380	1148
to	2	254	753
and	3	253	642
to	1	251	753
to	3	248	753
he	1	230	577
and	2	206	642
his	2	206	550

Inverse Document Frequency (idf)

Where the term frequncy shows how common a word is the inverse document frequncy discounts words for being common across documents.

Mathmatically, the IDF (inverse document frequency) of a word in a collection of documents can be understood as:

idf(word) = ln(total number of documents / number of documents containing word)

It may have been a minute since you took a math class, so let’s take a step back and think about what that natural log (ln) is doing there. The natural log (ln) can be thought of as the amount of time it takes something to grow exponentially from one. So ln(1) will be 0, since it takes no time to get to where you currently are.

In other words, if a word appears in all three documents it will have an IDF of 0. The more common the word across documents, the more it is discounted. This is in fact a good method to find context-specific stop words in a collection of documents. Words that appear frequently in every document (and have an IDF score of 0) may be good candidates for stop words.

TF-IDF

Because you multiply the term frequency and the inverse document frequency together, in practice, this means that the TF boosts common words within a document, and the IDF discounts words that are common across documents.

Of course in R we can do all of this in one function. The bind_tf_idf() function gets the TF, IDF, and TF-IDF scores for each word in our dataset.

book_words.2 <- book_words %>%
  bind_tf_idf(word, chapter, n)

book_words.2 %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(chapter) %>% 
  slice(1:10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = chapter)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~chapter, ncol = 2, scales = "free") +
  coord_flip()+
  theme_classic()

Looking at the words with the highest TF-IDF scores in each chapter, we can see which words are more important to individual chapters. Gregor Samsa is one of the main characters in this novel. The term “Gregor” occurred too frequently throughout the entire book. However, his last name, “Samsa,” was probably not used as often since we can see it has the highest TF-IDF score for chapter 3.

From this, we could make some guesses as to what the chapters in this book are about. Perhaps milk was spilled on the couch and someone needed money to buy new furniture in chapter 2.

The only way to know for sure would be to read the book, but this is of course more time efficient.

Exploring Text Data

Team Hanley NLP Working Group

Summary Statistics with Words

Word Clouds with wordcloud

TF-IDF

Term Frequency (tf)

Inverse Document Frequency (idf)

TF-IDF