Introduction to Web Scraping in R

What is rvest

For the uninitiated, web scraping is how to pull data down from a website and clean it so that it can be used for analysis, and rvest is the primary package used for web scraping in R.

From the rstudio blog:

“rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.”

Even though we are elite hackers, we will only have access to the same information that is publicly available for everyone. To get a better sense of what we have access to, go to any webpage, right-click, and select “Inspect”. You will see a bunch of html tags. This is what is available to us for scraping.

Before we get into examples, let's go over how rvest works at its most basic level:

Web Scraping Basics in rvest

First, how do we get a webpage all the way from the internet onto our local computers? As you can guess by the name, the read_html() function will do this for us. Simply enter the name of the page you would like to scrape (in quotes) and put it in the function.

read_html or Getting the HTML

library(rvest)
read_html("https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table")
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

This returns an html object, and while this technically contains all the information we need, it doesn’t really help us. Luckily, rvest comes with simple-to-use helper functions that will let us get this webpage into the right format.

Finding a Page Element

But before we go over how to use rvest’s helper functions, we need to go over how to find a particular element on a page. This will be done with the “Inspect” option in Chrome.

If you are looking to pull back multiple page elements you will need to find the CSS Class that applies to those elements, and if you are looking to pull back a single element, you will need its xpath.

You may be thinking, what is an xpath? An xpath is a path to a particular web element. In order to find one complete the following steps (in Chrome):

  • Find the element that you want on a webpage
  • Inspect it and look for its tag in the html console
  • Right-click on its tag and go to “Copy” -> “Copy XPath”.
  • Paste it into html_node().

Or as seen on TV:

Finding the CSS class is even easier. For the web element that you are looking to retrieve, inspect it, and see what CSS class is being applied. Insert that class name into html_nodes().

html_nodes and html_table

Once you have pulled down the html and know the CSS Class or xpath that you want to use, you can retrieve it/them using html_nodes() or html_node().

In this case, we are using an xpath to pull out the table. Because we are looking for one thing, use html_node() rather than html_nodes(), which will return elements back in a list.

For our example, go to the Wikipedia page and inspect the 2016 Summer Olympics medal table. Next, look for a <table> tag. Right-click this and go to Copy XPath. You should get the same value as you see in the html_node() below.

After that, it's smooth sailing. You have a table, so pipe it into the html_table() function and set fill = TRUE. I used the kableExtra package to make this look nicer for this markdown, but that's not necessary for web scraping. Your results should look something like the table below.

library(kableExtra) # this is just to make the tables look nice

url <- "https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table"

medal_tally <- url %>% 
  read_html() %>% # read_html pulls in the entire html for a webpage
  # html_nodes finds a specified part of the page
  html_node(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
  # if that node is a table, you can pull it out with html_table
  html_table(fill=TRUE)

kable(head(medal_tally), format = "html")%>%
  kable_styling("striped")
Rank NOC Gold Silver Bronze Total
1 United States (USA) 46 37 38 121
2 Great Britain (GBR) 27 23 17 67
3 China (CHN) 26 18 26 70
4 Russia (RUS) 19 17 20 56
5 Germany (GER) 17 10 15 42
6 Japan (JPN) 12 8 21 41

html_nodes and html_text

So what if the data you are looking to pull back isn’t already in a table? What if you want to create a table from information on the webpage? In that case, html_table() won’t work. Instead, you will have to assemble the table yourself.

Enter, html_text(). This function extracts the text from inside a tag. Let’s look at a quick example. First, without html_text, then with it.

url = "http://www.imdb.com/search/title?year=2017&title_type=feature&"

# This CSS class contains all the title names for movies.
titles = read_html(url)%>%
  html_nodes(css = '.lister-item-header a')

titles[1:2]
## {xml_nodeset (2)}
## [1] <a href="/title/tt2283362/?ref_=adv_li_tt">Jumanji: Welcome to the Jungle ...
## [2] <a href="/title/tt1856101/?ref_=adv_li_tt">Blade Runner 2049</a>

As you can see this returned a list of html tags. While this could be useful, what we are really interested in is the text within them. As previously mentioned, we can use html_text() to extract this.

url = "http://www.imdb.com/search/title?year=2017&title_type=feature&"

# This CSS class contains all the title names for movies.
titles = read_html(url)%>%
  html_nodes(css = '.lister-item-header a')%>%
  html_text()

titles[1:2]
## [1] "Jumanji: Welcome to the Jungle" "Blade Runner 2049"

Notice what this returns: a character vector of the movie titles. You can use the data.frame() to consolidate a few of these into a single dataframe (which is the way to solve the first example problem).

Problem 1: Creating a Table from a Webpage

Problem

From the following hyperlink, create a dataframe that contains the below columns:
http://www.imdb.com/search/title?year=2017&title_type=feature&

Columns:

  • rank
  • title
  • run_time
  • genre
  • primary genre (the first genre returned)

Hints

  • Find the CSS classes of the elements you want using inspect
  • Use html_text() to extract the text
  • You will need pattern replacement to clean up the returned strings
  • extract_numeric() from tidyr will take just the numeric values from a character string

Solution

######## IMDB WEBSITE
library(tidyr)

url = "http://www.imdb.com/search/title?year=2017&title_type=feature&"

#Reading the HTML code from the website
webpage = read_html(url)

movies = data.frame(
  rank = html_nodes(webpage, '.text-primary')%>%html_text()%>%extract_numeric(),
  title = html_nodes(webpage,'.lister-item-header a')%>%html_text(),
  run_time = html_nodes(webpage,' .runtime')%>%html_text()%>%extract_numeric(),
  genre = gsub("\n", "", html_nodes(webpage,'.genre')%>%html_text()),
  primary_genre = gsub(",.*", "", gsub("\n", "", html_nodes(webpage,'.genre')%>%html_text()))
)

kable(head(movies), format = "html")%>%
  kable_styling("striped")
rank title run_time genre primary_genre
1 Jumanji: Welcome to the Jungle 119 Action, Adventure, Comedy Action
2 Blade Runner 2049 164 Action, Drama, Mystery Action
3 It 135 Horror Horror
4 Thor: Ragnarok 130 Action, Adventure, Comedy Action
5 Call Me by Your Name 132 Drama, Romance Drama
6 Guardians of the Galaxy Vol. 2 136 Action, Adventure, Comedy Action

Problem 2: Creating a Table from Unstructured Text

Problem

This problem is admittedly much more difficult and will test your ability to work with text data as much, if not more, than your ability to web scrape.

The following link contains the transcript of the first presidential debate between Hiliary Clinton and Donald Trump. Build a table that contains both the speaker and what was said.

While there is probably multiple ways to tackle this, I am going to show you how I handled it.

Hints

  • First, you will return a giant block of text. You will need to use str_locate_all() to return the locations of the speakers (Holt, Clinton, and Trump)
  • Using these locations, you can then extract the correct text
  • You will need to use a for loop to assemble the whole thing into a dataframe.

As this is a considerably more complicated example, we will have a step by step walk through of how to solve it in the solution tab.

Solution

Step 1: Finding the Location of These Speakers

Using the str_locate_all() function from the stringr package, you can return the start and end position of a character.

So for example, if you wanted to find the start and end position of the word “HOLT”, in the following string: "HOLT: I am the moderator, good sir.", you would say: str_locate_all(string, "HOLT"), which would return: start = 1 and end = 4.

library(stringr)

# new transcript source
url="https://www.debates.org/voter-education/debate-transcripts/september-26-2016-debate-transcript/"

# once again we read the url
webpage = read_html(url)

# use the xpath to the container for the entire debate!
transcript = webpage %>% html_nodes(xpath= '//*[@id="content-sm"]') %>% html_text()

# We have 3 different patterns/speakers to search for 
# which will include in our expression using the or operator '|'
# str_locate_all of stringr will give the index (start and end position) of all the matches
markers <- str_locate_all(transcript, pattern = "CLINTON|TRUMP|HOLT")[[1]]%>%as.data.frame()

head(markers)
##   start  end
## 1   251  254
## 2  1526 1532
## 3  1567 1570
## 4  2720 2726
## 5  4569 4572
## 6  4744 4748

Step 2: Extract the Name and Responses by Location

Now that we have the start and end position of each matched speaker in a dataframe, we can retun their names and what they said using the row numbers of the data frame we created.

For example, markers$start[1] will return the first value in the start variable in the markers dataframe, which is 251 or H for HOLT. markers$end[1] will return the position 254, which is the T in HOLT.

To get what each speaker said, we will need the first character past the speaker's name or markers$end[1] + 1 and the last character before for the next speaker's name or markers$start[2] - 1.

# how we can locate a name
# This is the same as: substr(transcript, 251, 254)
substr(transcript, markers$start[1], markers$end[1])
## [1] "HOLT"
# how we can locate what they said
# This is the same as: substr(transcript, 255, 1525)
substr(transcript, markers$end[1]+1, markers$start[2]-1)
## [1] ": Good evening from Hofstra University in Hempstead, New York. I’m Lester Holt, anchor of “NBC Nightly News.” I want to welcome you to the first presidential debate.\nThe participants tonight are Donald Trump and Hillary Clinton. This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization. The commission drafted tonight’s format, and the rules have been agreed to by the campaigns.\nThe 90-minute debate is divided into six segments, each 15 minutes long. We’ll explore three topic areas tonight: Achieving prosperity; America’s direction; and securing America. At the start of each segment, I will ask the same lead-off question to both candidates, and they will each have up to two minutes to respond. From that point until the end of the segment, we’ll have an open discussion.\nThe questions are mine and have not been shared with the commission or the campaigns. The audience here in the room has agreed to remain silent so that we can focus on what the candidates are saying.\nI will invite you to applaud, however, at this moment, as we welcome the candidates: Democratic nominee for president of the United States, Hillary Clinton, and Republican nominee for president of the United States, Donald J. Trump. [applause]\n"

Step 3: Loop through Each Speaker to Assemble a Dataframe

The goal of this step is return each speaker and what they said in a dataframe, all in the order in which they said it.

We can use a loop to do this. Using the indexing technique described above, for each row number, we will return the name and comment into a one row dataframe. As you may expect, each time the for loop iterates, the row will be overwritten.

To get around this, we will initiate a dataframe outside of the for loop and then append the rows to that datafame each time that we loop through.

# initiate data frame outside of loop
answers = data.frame()

# for each element in the loop, get the name of the speaker and the content 
for(i in 1:nrow(markers)){
  temp = data.frame(
            name = substr(transcript, markers$start[i], markers$end[i]),
            comment = substr(transcript, markers$end[i]+1, markers$start[i+1]-1)
            )
  answers = rbind(answers, temp) # my guess is that this is the inefficient part
}


rm(temp)

# lets get rid of the : after the speakers name as it is the first character at the 
# beginning of each response

answers$comment = gsub(": ", "", answers$comment)

# Ok now we have the answers to the questions and the name of the respondent 
kable(head(answers, n = 3), format = "html")%>%
  kable_styling("striped")
name comment
HOLT Good evening from Hofstra University in Hempstead, New York. I’m Lester Holt, anchor of “NBC Nightly News.” I want to welcome you to the first presidential debate. The participants tonight are Donald Trump and Hillary Clinton. This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization. The commission drafted tonight’s format, and the rules have been agreed to by the campaigns. The 90-minute debate is divided into six segments, each 15 minutes long. We’ll explore three topic areas tonightAchieving prosperity; America’s direction; and securing America. At the start of each segment, I will ask the same lead-off question to both candidates, and they will each have up to two minutes to respond. From that point until the end of the segment, we’ll have an open discussion. The questions are mine and have not been shared with the commission or the campaigns. The audience here in the room has agreed to remain silent so that we can focus on what the candidates are saying. I will invite you to applaud, however, at this moment, as we welcome the candidatesDemocratic nominee for president of the United States, Hillary Clinton, and Republican nominee for president of the United States, Donald J. Trump. [applause]
CLINTON How are you, Donald? [applause]
HOLT Good luck to you. [applause] Well, I don’t expect us to cover all the issues of this campaign tonight, but I remind everyone, there are two more presidential debates scheduled. We are going to focus on many of the issues that voters tell us are most important, and we’re going to press for specifics. I am honored to have this role, but this evening belongs to the candidates and, just as important, to the American people. Candidates, we look forward to hearing you articulate your policies and your positions, as well as your visions and your values. So, let’s begin. We’re calling this opening segment “Achieving Prosperity.” And central to that is jobs. There are two economic realities in America today. There’s been a record six straight years of job growth, and new census numbers show incomes have increased at a record rate after years of stagnation. However, income inequality remains significant, and nearly half of Americans are living paycheck to paycheck. Beginning with you, Secretary Clinton, why are you a better choice than your opponent to create the kinds of jobs that will put more money into the pockets of American workers?

Conclusion

And there you go! That’s a mild introduction to how to web scrape with R. We went over how to find the correct xpath or CSS Class to isolate what data we would like to pull back.

html_node() and html_table() can be used to pull back a clean html table into R.

html_nodes() and html_text() can be used to pull back a character vector. Using data.frame(), we can assemble a dataframe out of the different vectors we pull back.

Finally, working with unstructured text can be tough, but through the stringr package, we can clean and provide structure, putting this data into a dataframe. This will allow us to perform NLP more easily later.

One thing to note: webpages are constantly changing and are being reworked, and web scraping is a brute force approach to retrieving data from the web. APIs tend to be more static and should be the first approach for getting data from the web.