Scraping website content with R


Setup

There are many ways you can do web scraping with R. I'm using this example because it worked for me in my job (and I had many enterprise restrictions regarding packages and internet connections).

The only requirement to get this working is rvest.

We will build a function that prints the value of a stock from the previous day.

How

I'll explain as we go.

First, load the required package:

  library(rvest)

Next, we will declare our function and go line by line. The first line, is the base URL we want to use:

  print_quote_value <- function(quote_symbol){
    base_url = "http://finance.yahoo.com/q?s="
    }

Now, we need to concatenate the search string to the base URL. To use Yahoo Finance, as of the time of writing this post, we need to append an expression for the search to work properly:

  print_quote_value <- function(quote_symbol){
    base_url = "http://finance.yahoo.com/q?s="
    url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
    }

Read the content of the page into a variable. We will have a full copy of the HTML page in this new object:

  print_quote_value <- function(quote_symbol){
    base_url = "http://finance.yahoo.com/q?s="
    url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
    raw_html = readLines(url_to_use)
    }

However, the full page, as is, is not very useful. We want a very specific piece of content from the page. Let's get it.

Step 1: Let's get the line number the expression we are looking for appears (we a re looking for "Previous Close").

  line_number = grep("Previous Close", raw_html)

Step 2: If there's more than one result, check the page code to use the line block that have the information you want. In our example, the information is in the first line of the grep results (to better inspect the code, consider saving it to a file and reading it with a HTML editor: download.file(url_to_use, "downloaded_page.html").

We need to know the character position our expression is at in that line. Let's find it and store it to a variable:

  char_position = gregexpr("Previous Close", raw_html[line_number[1]])

Check the output. It's a list. We are interested in the first item.

Step 3: We will subset the line to the region our information is at (this will make extracting the information easier).

The starting point is at the end of the expression we were looking for ("Previous Close") and after the </span> tag of it. The end point is after enough chars to include the closing tag </span> that has the value of the stock.

      text_block_region = substring(raw_html[line_number[1]],
                                    char_position[[1]][1] + 21,
                                    char_position[[1]][1] +175)

Finally, we will extract only the content with the value. Here is where we will use rvest. We will convert the block region to HTML code and extract the content that is in the <span> tags in it.

  page_excerpt = read_html(text_block_region)
  quote = html_text(page_excerpt, "<span>")

And we will print the result nicely. This is how the final code should look like:

      print_quote_value <- function(quote_symbol){
        base_url = "http://finance.yahoo.com/q?s="
        url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
        raw_html = readLines(url_to_use)
        line_number = grep("Previous Close", raw_html)
        char_position = gregexpr("Previous Close", raw_html[line_number[1]])
        text_block_region = substring(raw_html[line_number[1]],
                                      char_position[[1]][1] + 21,
                                      char_position[[1]][1] +175)
        page_excerpt = read_html(text_block_region)
        quote = html_text(page_excerpt, "<span>")
        print(paste(quote_symbol, "=", quote))
      }

(Un)Fortunately, web pages are constantly changing. It is possible that Yahoo changes completely by the time someone else is testing this code. Adjust accordingly. The example should provide some general guidelines and/or ideas to use.

  print_quote_value("NVDA")
  ### [1] "NVDA = 231.66"

Happy scraping.