Setup
There are many ways you can do web scraping with R. I'm using this example because it worked for me in my job (and I had many enterprise restrictions regarding packages and internet connections).
The only requirement to get this working is rvest
.
We will build a function that prints the value of a stock from the previous day.
How
I'll explain as we go.
First, load the required package:
library(rvest)
Next, we will declare our function and go line by line. The first line, is the base URL we want to use:
print_quote_value <- function(quote_symbol){
base_url = "http://finance.yahoo.com/q?s="
}
Now, we need to concatenate the search string to the base URL. To use Yahoo Finance, as of the time of writing this post, we need to append an expression for the search to work properly:
print_quote_value <- function(quote_symbol){
base_url = "http://finance.yahoo.com/q?s="
url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
}
Read the content of the page into a variable. We will have a full copy of the HTML page in this new object:
print_quote_value <- function(quote_symbol){
base_url = "http://finance.yahoo.com/q?s="
url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
raw_html = readLines(url_to_use)
}
However, the full page, as is, is not very useful. We want a very specific piece of content from the page. Let's get it.
Step 1: Let's get the line number the expression we are looking for appears (we a re looking for "Previous Close").
line_number = grep("Previous Close", raw_html)
Step 2: If there's more than one result, check the page code to use the line
block that have the information you want. In our example, the information is in
the first line of the grep
results (to better inspect the code, consider
saving it to a file and reading it with a HTML editor:
download.file(url_to_use, "downloaded_page.html")
.
We need to know the character position our expression is at in that line. Let's find it and store it to a variable:
char_position = gregexpr("Previous Close", raw_html[line_number[1]])
Check the output. It's a list. We are interested in the first item.
Step 3: We will subset the line to the region our information is at (this will make extracting the information easier).
The starting point is at the end of the expression we were looking for
("Previous Close") and after the </span>
tag of it. The end point is after
enough chars to include the closing tag </span>
that has the value of the
stock.
text_block_region = substring(raw_html[line_number[1]],
char_position[[1]][1] + 21,
char_position[[1]][1] +175)
Finally, we will extract only the content with the value. Here is where we will
use rvest
. We will convert the block region to HTML code and extract the
content that is in the <span>
tags in it.
page_excerpt = read_html(text_block_region)
quote = html_text(page_excerpt, "<span>")
And we will print the result nicely. This is how the final code should look like:
print_quote_value <- function(quote_symbol){
base_url = "http://finance.yahoo.com/q?s="
url_to_use = pase(base_url, quote_symbol, "&x=0&y=0", sep = "")
raw_html = readLines(url_to_use)
line_number = grep("Previous Close", raw_html)
char_position = gregexpr("Previous Close", raw_html[line_number[1]])
text_block_region = substring(raw_html[line_number[1]],
char_position[[1]][1] + 21,
char_position[[1]][1] +175)
page_excerpt = read_html(text_block_region)
quote = html_text(page_excerpt, "<span>")
print(paste(quote_symbol, "=", quote))
}
(Un)Fortunately, web pages are constantly changing. It is possible that Yahoo changes completely by the time someone else is testing this code. Adjust accordingly. The example should provide some general guidelines and/or ideas to use.
print_quote_value("NVDA")
### [1] "NVDA = 231.66"
Happy scraping.