Chapter 6 Web-Scraping: Polling use case
A very important metric to keep track of is how your candidate is polling. Are they gaining a lead in the polls or falling behind? This data is often reported via traditional news organizations or some other mediums. The supposed demi-God and mythical pollster Nate Silver’s organization FiveThirtyEight does a wonderful job aggregating polls. Their page National 2020 Democratic Presidential Primary Polls has a table of the most recent polls from many different pollsters.
In this use case we will acquire this data by web scraping using rvest
. We will also go over ways to programatically save polls results to a text file. Saving polling results can allow you present a long term view of your candidate’s growth during the quarter.
6.1 Understanding rvest
This use case will provide a cursory overview of the package rvest
. To learn more go here.
Web scraping is the process of extracting data from a website. Websites are written in HTML and CSS. There are a few aspects of these languages that are used in web scraping that is important to know. HTML is written in a series of what are call tags. A tag is a set of characters wrapped in angle brackets—i.e. <img>
.
With CSS (cascading style sheets), web developers can give unique identifiers to a tag. Classes can also be assigned to a tag. Think of these as group. With web scraping we can specify a particular part of a website by it’s HTML tag and perhaps it’s class or ID. rvest
provides a large set of functions to make this simpler.
6.2 Example
For this example we will be scraping FiveThirtyEight’s aggregated poll table. The table can be found at https://projects.fivethirtyeight.com/2020-primaries/democratic/national/.
Before we begin, we must always prepare our workspace. Mise en place.
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
library(tidyverse)
The first thing we will have to do is specify what page we will be scraping from. html_session()
will simulate a session in an html browser. By providing a URL to html_session()
we will then be able to access the underlying code of that page. Create an object called session
by providing the FiveThirtyEight URL to html_session()
.
session <- html_session("https://projects.fivethirtyeight.com/2020-primaries/democratic/national/")
The next and most important step is to identify which piece of HTML code contains the table. The easiest way to do this is to open up the webpage in Chrome and open up the Inspect Elements view (on Mac - ⌘ + Shift + C). Now that this is open, click the select element button at the top left corner of the inspection pane. Now hover over the table.
You will see that the HTML element is highlighted. We can see that it is a table
tag. Additionally we see that there are two different classes polls-table
and tracker
. To specify a class we put a preceding .
to the class name—i.e. .class-name
. If there are multiple classes we just append the second class name to it—i.e. .first-class.second-class
. Be aware that these selectors can be quite finicky and be a bit difficult to figure out. You might need to do some googling or playing around with the selector.
To actually access the content of this HTML element, we must specify the element using the proper selector. html_node()
will be used to do this. Provide the html session and the CSS selector to html_node()
to extract the HTML element.
session %>%
html_node(".polls-table.tracker")
## {xml_node}
## <table class="polls-table tracker">
## [1] <thead class="hide-mobile" id="table-header"><tr>\n<th class="new">< ...
## [2] <tbody>\n<tr class="visible-row" data-id="102929">\n<!-- Shared--><t ...
Here we see that this returns on object of class xml_node
. This object returns some HTML code but it is still not entirely workable. Since this is an HTML table we want to extract we can use the handy html_table()
. Note that if this wasn’t a table but rather text, you can use html_text()
.
session %>%
html_node(".polls-table.tracker") %>%
html_table()
Take note of the extremely informative error. It appears we might have to deal with mismatching columns.
session %>%
html_node(".polls-table.tracker") %>%
html_table(fill = TRUE)
## Dates Pollster
## 1 • Sep 15-18, 2019768 LV Sep 15-18, 2019
## 2 • Sep 15-18, 20191,326 RV Sep 15-18, 2019
## 3 • Sep 15-17, 2019480 LV Sep 15-17, 2019
## 4 • Sep 14-17, 2019603 LV Sep 14-17, 2019
## 5 • Sep 13-16, 20191,291 LV Sep 13-16, 2019
## 6 • Sep 13-16, 20191,017 LV Sep 13-16, 2019
## 7 • Sep 13-16, 2019506 LV Sep 13-16, 2019
## 8 • Sep 13-15, 20197,487 LV Sep 13-15, 2019
## 9 • Sep 10-13, 2019799 LV Sep 10-13, 2019
## 10 • Sep 10-13, 20191,302 RV Sep 10-13, 2019
## 11 • Sep 6-13, 20191,865 LV Sep 6-13, 2019
## 12 • Sep 6-13, 20193,074 RV Sep 6-13, 2019
## 13 • Sep 10-12, 20191,784 LV Sep 10-12, 2019
## 14 • Sep 7-11, 2019241 LV Sep 7-11, 2019
## 15 • Sep 9-10, 2019557 RV Sep 9-10, 2019
## 16 • Sep 9-10, 2019615 A Sep 9-10, 2019
## Sample Sample Biden Sanders Warren Harris
## 1 C+HarrisX 768 LV 34% 14% 18%
## 2 C+HarrisX 1,326 RV 27% 15% 16%
## 3 AFox News 480 LV 29% 18% 16%
## 4 BYouGov 603 LV 25% 15% 19%
## 5 Civiqs 1,291 LV 24% 14% 30%
## 6 ASurveyUSA 1,017 LV 33% 17% 19%
## 7 A-NBC News/Wall Street Journal 506 LV 31% 14% 25%
## 8 B-Morning Consult 7,487 LV 32% 20% 18%
## 9 C+HarrisX 799 LV 38% 14% 15%
## 10 C+HarrisX 1,302 RV 31% 15% 12%
## 11 C+HarrisX 1,865 LV 35% 15% 15%
## 12 C+HarrisX 3,074 RV 29% 16% 12%
## 13 Civiqs 1,784 LV 23% 15% 28%
## 14 B-GQR Research 241 LV 30% 21% 19%
## 15 B+Ipsos 557 RV 22% 16% 11%
## 16 B+Ipsos 615 A 22% 16% 11%
## Buttigieg O'Rourke Booker Yang Klobuchar Castro Gabbard Ryan Delaney
## 1 5% 5% 2% 2% 2% 1% 0% 0% 1%
## 2 4% 3% 3% 2% 3% 1% 1% 1% 1%
## 3 7% 5% 4% 3% 2% 2% 0% 0% 0%
## 4 5% 8% 3% 2% 3% 1% 1% 2% 0%
## 5 6% 7% 2% 1% 2% 1% 1% 2% 0%
## 6 6% 5% 4% 4% 3% 1% 1%
## 7 5% 7% 1% 2% 4% 2% 1% 1% 0%
## 8 6% 5% 4% 3% 3% 2% 1% 1% 1%
## 9 5% 4% 3% 1% 2% 1% 1% 0% 1%
## 10 5% 4% 3% 2% 2% 1% 1% 1% 2%
## 11 6% 5% 3% 2% 3% 1% 1% 0% 1%
## 12 5% 4% 3% 2% 3% 1% 1% 1% 1%
## 13 7% 7% 2% 1% 2% 1% 1% 2% 1%
## 14 4% 4% 1% 4% 2% 2% 1% 2% 1%
## 15 4% 4% 2% 3% 3% 1% 1% 1% 1%
## 16 4% 4% 3% 3%
## Bennet de Blasio Williamson Bullock Steyer Sestak Messam Gillibrand
## 1 1% 0% 1% 1% 1% 1% 1% 0%
## 2 1% 1% 1% 1% 1% 1% 1% 0%
## 3 1% 1% 0% 0% 0% 1% 0% 0%
## 4 1% 1% 0% 1% 1% 0% 0% 0%
## 5 0% 0% 0% 0% 1% 1%
## 6
## 7 1% 1% 1% 0% 0% 1%
## 8 1% 0% 0% 1% 1% 1%
## 9 1% 1% 1% 0% 1% 2% 0% 0%
## 10 1% 1% 1% 1% 1% 1% 1% 0%
## 11 1% 1% 0% 0% 1% 1% 0% 0%
## 12 1% 1% 1% 0% 1% 1% 1% 0%
## 13 0% 0% 0% 1% 0% 0%
## 14 0% 0% 0% 1% 0% 0% 0% 0%
## 15 0% 0% 0% 0% 1% 0% 0%
## 16
## Hickenlooper Inslee H. Clinton Bloomberg M. Obama Brown Moulton Gravel
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## Swalwell Kerry Abrams Holder McAuliffe Winfrey Ojeda Trump Cuomo
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## Avenatti Kennedy Patrick Zuckerberg Pelosi Garcetti Newsom Schultz
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## Kaine Johnson Kucinich Lee Scott Sinema Warner NA
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## NA
## 1 Biden34%Warren18%Sanders14%Harris5%Buttigieg5%O'Rourke2%Booker2%Yang2%Klobuchar1%Delaney1%Williamson1%Ryan1%Bullock1%de Blasio1%Sestak1%Steyer1%Castro0%Gabbard0%Messam0%Bennet0%
## 2 Biden27%Warren16%Sanders15%Harris4%Buttigieg3%O'Rourke3%Yang3%Booker2%Castro1%Klobuchar1%Delaney1%Gabbard1%Williamson1%Ryan1%Bennet1%Bullock1%de Blasio1%Sestak1%Steyer1%Messam0%
## 3 Biden29%Sanders18%Warren16%Harris7%Buttigieg5%O'Rourke4%Booker3%Klobuchar2%Yang2%Bennet1%Delaney1%Steyer1%Bullock0%Castro0%de Blasio0%Gabbard0%Messam0%Ryan0%Sestak0%Williamson0%
## 4 Biden25%Warren19%Sanders15%Buttigieg8%Harris5%O'Rourke3%Yang3%Booker2%Gabbard2%Klobuchar1%Castro1%Bennet1%Williamson1%Bullock1%Delaney1%de Blasio0%Steyer0%Ryan0%Messam0%Sestak0%
## 5 Warren30%Biden24%Sanders14%Buttigieg7%Harris6%O'Rourke2%Yang2%Gabbard2%Booker1%Klobuchar1%Castro1%Steyer1%Bullock1%Williamson0%de Blasio0%Delaney0%Bennet0%Ryan0%
## 6 Biden33%Warren19%Sanders17%Harris6%Buttigieg5%O'Rourke4%Booker4%Yang3%Klobuchar1%Castro1%
## 7 Biden31%Warren25%Sanders14%Buttigieg7%Harris5%Yang4%Klobuchar2%Booker2%O'Rourke1%Gabbard1%Delaney1%Steyer1%Bennet1%de Blasio1%Castro1%Bullock0%Ryan0%Williamson0%
## 8 Biden32%Sanders20%Warren18%Harris6%Buttigieg5%O'Rourke4%Booker3%Yang3%Klobuchar2%Bullock1%Castro1%Delaney1%Gabbard1%Ryan1%Steyer1%Williamson1%Bennet0%de Blasio0%
## 9 Biden38%Warren15%Sanders14%Harris5%Buttigieg4%O'Rourke3%Yang2%Steyer2%Booker1%Castro1%Klobuchar1%Delaney1%Ryan1%Bennet1%Bullock1%de Blasio1%Gabbard0%Williamson0%Messam0%Sestak0%
## 10 Biden31%Sanders15%Warren12%Harris5%Buttigieg4%O'Rourke3%Booker2%Yang2%Ryan2%Castro1%Klobuchar1%Delaney1%Gabbard1%Williamson1%Bennet1%Bullock1%de Blasio1%Sestak1%Steyer1%Messam0%
## 11 Biden35%Sanders15%Warren15%Harris6%Buttigieg5%O'Rourke3%Yang3%Booker2%Castro1%Klobuchar1%Delaney1%Ryan1%Bennet1%Bullock1%Steyer1%Gabbard0%Williamson0%Messam0%de Blasio0%Sestak0%
## 12 Biden29%Sanders16%Warren12%Harris5%Buttigieg4%O'Rourke3%Yang3%Booker2%Castro1%Klobuchar1%Delaney1%Gabbard1%Ryan1%Bennet1%Bullock1%de Blasio1%Sestak1%Steyer1%Williamson0%Messam0%
## 13 Warren28%Biden23%Sanders15%Harris7%Buttigieg7%Yang2%O'Rourke2%Gabbard2%Booker1%Klobuchar1%Castro1%Williamson1%Ryan1%Steyer0%de Blasio0%Bullock0%Delaney0%Bennet0%
## 14 Biden30%Sanders21%Warren19%Booker4%Harris4%Buttigieg4%Klobuchar2%Gabbard2%Yang2%O'Rourke1%Castro1%Williamson1%Ryan1%Bennet0%de Blasio0%Delaney0%Steyer0%Bullock0%Messam0%Sestak0%
## 15 Biden22%Sanders16%Warren11%Buttigieg4%Harris4%Yang3%Booker3%O'Rourke2%Ryan1%Steyer1%Klobuchar1%Castro1%Gabbard1%Williamson0%Sestak0%Bullock0%Messam0%de Blasio0%Bennet0%
## 16 Biden22%Sanders16%Warren11%Harris4%Buttigieg4%Yang3%Booker3%
## [ reached 'max' / getOption("max.print") -- omitted 232 rows ]
This is much better! But based on visual inspection the column headers are not properly matched. There are a few things that need to be sorted out: there are two date columns, there are commas and percents where numeric columns should be, the column headers are a little messy, and the table isn’t a tibble (this is just personal preference).
We will handle the final two issues first as they are easiest to deal with. The function clean_names()
from janitor
will handle the column headers, and as_tibble()
will coerce the data.frame into a proper tibble. Save this semi-clean tibble into an object called polls
.
polls <- session %>%
html_node(".polls-table.tracker") %>%
html_table(fill = TRUE) %>%
janitor::clean_names() %>%
as_tibble()
polls
## # A tibble: 248 x 60
## x dates pollster sample sample_2 biden sanders warren harris
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 • Sep … Sep 15-… C+Har… 768 LV 34% 14% 18%
## 2 • Sep … Sep 15-… C+Har… 1,326 RV 27% 15% 16%
## 3 • Sep … Sep 15-… AFox … 480 LV 29% 18% 16%
## 4 • Sep … Sep 14-… BYouG… 603 LV 25% 15% 19%
## 5 • Sep … Sep 13-… Civiqs 1,291 LV 24% 14% 30%
## 6 • Sep … Sep 13-… ASurv… 1,017 LV 33% 17% 19%
## 7 • Sep … Sep 13-… A-NBC… 506 LV 31% 14% 25%
## 8 • Sep … Sep 13-… B-Mor… 7,487 LV 32% 20% 18%
## 9 • Sep … Sep 10-… C+Har… 799 LV 38% 14% 15%
## 10 • Sep … Sep 10-… C+Har… 1,302 RV 31% 15% 12%
## # … with 238 more rows, and 51 more variables: buttigieg <chr>,
## # o_rourke <chr>, booker <chr>, yang <chr>, klobuchar <chr>,
## # castro <chr>, gabbard <chr>, ryan <chr>, delaney <chr>, bennet <chr>,
## # de_blasio <chr>, williamson <chr>, bullock <chr>, steyer <chr>,
## # sestak <chr>, messam <chr>, gillibrand <chr>, hickenlooper <chr>,
## # inslee <chr>, h_clinton <chr>, bloomberg <chr>, m_obama <chr>,
## # brown <chr>, moulton <chr>, gravel <chr>, swalwell <chr>, kerry <chr>,
## # abrams <chr>, holder <chr>, mc_auliffe <chr>, winfrey <chr>,
## # ojeda <chr>, trump <chr>, cuomo <chr>, avenatti <chr>, kennedy <chr>,
## # patrick <chr>, zuckerberg <chr>, pelosi <chr>, garcetti <chr>,
## # newsom <chr>, schultz <chr>, kaine <chr>, johnson <chr>,
## # kucinich <chr>, lee <chr>, scott <chr>, sinema <chr>, warner <chr>,
## # na <chr>, na_2 <chr>
We want to shift over the column names to the right just once. Unfortunately there is no elegant way to do this (that I am aware of). We can see that the first column is completely useless so that can be removed. Once that column is removed we can reset the names this way they will be well aligned.
We will start by creating a vector of the original column names.
col_names <- names(polls)
col_names
## [1] "x" "dates" "pollster" "sample"
## [5] "sample_2" "biden" "sanders" "warren"
## [9] "harris" "buttigieg" "o_rourke" "booker"
## [13] "yang" "klobuchar" "castro" "gabbard"
## [17] "ryan" "delaney" "bennet" "de_blasio"
## [21] "williamson" "bullock" "steyer" "sestak"
## [25] "messam" "gillibrand" "hickenlooper" "inslee"
## [29] "h_clinton" "bloomberg" "m_obama" "brown"
## [33] "moulton" "gravel" "swalwell" "kerry"
## [37] "abrams" "holder" "mc_auliffe" "winfrey"
## [41] "ojeda" "trump" "cuomo" "avenatti"
## [45] "kennedy" "patrick" "zuckerberg" "pelosi"
## [49] "garcetti" "newsom" "schultz" "kaine"
## [53] "johnson" "kucinich" "lee" "scott"
## [57] "sinema" "warner" "na" "na_2"
Unfortunately this also presents another issue. Once a column is deselected, there will be one more column name than column. So we will need to select all but the last element of the original names. We will create a vector called new_names
.
# identify the integer number of the last column
last_col <- length(col_names) - 1
# create a vector which will be used for the new names
new_names <- col_names[1:last_col]
Now we can try implementing the hacky solution. Here we will deselect the first column and reset the names using setNames()
. Following, we will use the mutate_at()
variant to remove the percent sign from every candidate column and coerce them into integer columns. Here we will specify which variables to not mutate at within vars()
.
polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%")))
## Warning in ~as.integer(str_remove(., "%")): NAs introduced by coercion
## # A tibble: 248 x 58
## dates pollster sample sample_2 biden sanders warren harris buttigieg
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Sep … C+Harri… 768 LV 34 14 18 5 5
## 2 Sep … C+Harri… 1,326 RV 27 15 16 4 3
## 3 Sep … AFox Ne… 480 LV 29 18 16 7 5
## 4 Sep … BYouGov 603 LV 25 15 19 5 8
## 5 Sep … Civiqs 1,291 LV 24 14 30 6 7
## 6 Sep … ASurvey… 1,017 LV 33 17 19 6 5
## 7 Sep … A-NBC N… 506 LV 31 14 25 5 7
## 8 Sep … B-Morni… 7,487 LV 32 20 18 6 5
## 9 Sep … C+Harri… 799 LV 38 14 15 5 4
## 10 Sep … C+Harri… 1,302 RV 31 15 12 5 4
## # … with 238 more rows, and 49 more variables: o_rourke <int>,
## # booker <int>, yang <int>, klobuchar <int>, castro <int>,
## # gabbard <int>, ryan <int>, delaney <int>, bennet <int>,
## # de_blasio <int>, williamson <int>, bullock <int>, steyer <int>,
## # sestak <int>, messam <int>, gillibrand <int>, hickenlooper <int>,
## # inslee <int>, h_clinton <int>, bloomberg <int>, m_obama <int>,
## # brown <int>, moulton <int>, gravel <int>, swalwell <int>, kerry <int>,
## # abrams <int>, holder <int>, mc_auliffe <int>, winfrey <int>,
## # ojeda <int>, trump <int>, cuomo <int>, avenatti <int>, kennedy <int>,
## # patrick <int>, zuckerberg <int>, pelosi <int>, garcetti <int>,
## # newsom <int>, schultz <int>, kaine <int>, johnson <int>,
## # kucinich <int>, lee <int>, scott <int>, sinema <int>, warner <int>,
## # na <int>
Now we must tidy the data. We will use tidyr::gather()
to transform the data from wide to long. In short, gather takes the column headers (the key
argument) and creates a new variable from the values of the columns (the value
argument). In this case, we will create a new column called candidate
from the column headers and a second column called points
which are a candidates polling percentage. Next we deselect any columns that we do not want to be gathered.
polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%"))) %>%
gather(candidate, points, -dates, -pollster, -sample, -sample_2)
## Warning in ~as.integer(str_remove(., "%")): NAs introduced by coercion
## # A tibble: 13,392 x 6
## dates pollster sample sample_2 candidate points
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Sep 15-18, 2… C+HarrisX 768 LV biden 34
## 2 Sep 15-18, 2… C+HarrisX 1,326 RV biden 27
## 3 Sep 15-17, 2… AFox News 480 LV biden 29
## 4 Sep 14-17, 2… BYouGov 603 LV biden 25
## 5 Sep 13-16, 2… Civiqs 1,291 LV biden 24
## 6 Sep 13-16, 2… ASurveyUSA 1,017 LV biden 33
## 7 Sep 13-16, 2… A-NBC News/Wall Street J… 506 LV biden 31
## 8 Sep 13-15, 2… B-Morning Consult 7,487 LV biden 32
## 9 Sep 10-13, 2… C+HarrisX 799 LV biden 38
## 10 Sep 10-13, 2… C+HarrisX 1,302 RV biden 31
## # … with 13,382 more rows
There are a few more house-keeping things that need to be done to improve this data set. sample_2
is rather uninformative. On the FiveThirtyEight website there is a key which describes what these values represent (A = ADULTS, RV = REGISTERED VOTERS, V = VOTERS, LV = LIKELY VOTERS
). This should be specified in our data set. In addition the sample
column ought to be cast into an integer column. And finally, those messy dates will need to be cleaned. My approach to this requires creating a function to handle this cleaning. First, the simple stuff.
To do the first two above steps, we will continue our function chain and save it to a new variable polls_tidy
.
polls_tidy <- polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%"))) %>%
gather(candidate, points, -dates, -pollster, -sample, -sample_2) %>%
mutate(sample_2 = case_when(
sample_2 == "RV" ~ "Registered Voters",
sample_2 == "LV" ~ "Likely Voters",
sample_2 == "A" ~ "Adults",
sample_2 == "V" ~ "Voters"
),
sample = as.integer(str_remove(sample, ",")))
## Warning in ~as.integer(str_remove(., "%")): NAs introduced by coercion
polls_tidy
## # A tibble: 13,392 x 6
## dates pollster sample sample_2 candidate points
## <chr> <chr> <int> <chr> <chr> <int>
## 1 Sep 15-18, … C+HarrisX 768 Likely Voters biden 34
## 2 Sep 15-18, … C+HarrisX 1326 Registered V… biden 27
## 3 Sep 15-17, … AFox News 480 Likely Voters biden 29
## 4 Sep 14-17, … BYouGov 603 Likely Voters biden 25
## 5 Sep 13-16, … Civiqs 1291 Likely Voters biden 24
## 6 Sep 13-16, … ASurveyUSA 1017 Likely Voters biden 33
## 7 Sep 13-16, … A-NBC News/Wall Stre… 506 Likely Voters biden 31
## 8 Sep 13-15, … B-Morning Consult 7487 Likely Voters biden 32
## 9 Sep 10-13, … C+HarrisX 799 Likely Voters biden 38
## 10 Sep 10-13, … C+HarrisX 1302 Registered V… biden 31
## # … with 13,382 more rows
6.2.1 Date cleaning
Next we must work to clean the date field. I find that when working with a messy column, creating a single function which handles the cleaning is one of the most effective approaches. Here we will create a function which takes a value provided from the dates
field and return a cleaned date. There are two unique cases I identified. There are poll dates which occurred during a single month, or a poll that spanned two months. The dates are separated by a single hyphen -
. If we split the date at -
we will either receive two elements with a month indicated or one month with a day and a day number. In the latter case we will have to carry over the month. Then the year can be appended to it and parsed as a date using the lubridate
package. For more on lubridate
visit here.
The function will only return one date at a time. The two arguments will be date
and .return
to indicate whether the first or second date should be provided. The internals of this function rely heavily on the stringr
package (see R for Data Science Chapter 14). switch()
at the end of the function determines which date should be returned (see Advanced R Chapter 5).
clean_date <- function(date, .return = "first") {
# take date and split at the comma to get the year and the month-day combo
date_split <- str_split(date, ",") %>%
# remove from list / coerce to vector
unlist() %>%
# remove extra white space
str_trim()
# extract the year
date_year <- date_split[2]
# split the month day portion and coerce to vector
dates <- unlist(str_split(date_split[1], "-"))
# paste the month day and year together then parse as date using `mdy()`
first_date <- paste(dates[1], date_year) %>%
lubridate::mdy()
second_date <- ifelse(!str_detect(dates[2], "[A-z]+"),
yes = paste(str_extract(dates[1], "[A-z]+"),
dates[2],
date_year),
no = paste(dates[2], date_year)) %>%
lubridate::mdy()
switch(.return,
first = return(first_date),
second = return(second_date)
)
}
# test on a date
clean_date(polls_tidy$dates[10], .return = "first")
## [1] "2019-09-10"
clean_date(polls_tidy$dates[10], .return = "second")
## [1] "2019-09-13"
We can use this new function to create two new columns poll_start
and poll_end
using mutate()
. Following this we can deselect the original dates
column, remove any observations missing a points
value, remove duplicates using distinct()
, and save this to polls_clean
.
polls_clean <- polls_tidy %>%
mutate(poll_start = clean_date(dates, "first"),
poll_end = clean_date(dates, "second")) %>%
select(-dates) %>%
filter(!is.na(points)) %>%
distinct()
polls_clean
## # A tibble: 4,911 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <int> <chr> <chr> <int> <date> <date>
## 1 C+HarrisX 768 Likely Vo… biden 34 2019-09-15 2019-09-18
## 2 C+HarrisX 1326 Registere… biden 27 2019-09-15 2019-09-18
## 3 AFox News 480 Likely Vo… biden 29 2019-09-15 2019-09-18
## 4 BYouGov 603 Likely Vo… biden 25 2019-09-15 2019-09-18
## 5 Civiqs 1291 Likely Vo… biden 24 2019-09-15 2019-09-18
## 6 ASurveyUSA 1017 Likely Vo… biden 33 2019-09-15 2019-09-18
## 7 A-NBC News/Wal… 506 Likely Vo… biden 31 2019-09-15 2019-09-18
## 8 B-Morning Cons… 7487 Likely Vo… biden 32 2019-09-15 2019-09-18
## 9 C+HarrisX 799 Likely Vo… biden 38 2019-09-15 2019-09-18
## 10 C+HarrisX 1302 Registere… biden 31 2019-09-15 2019-09-18
## # … with 4,901 more rows
6.2.2 Visualization
The cleaned data can be aggregated and visualized.
avg_polls <- polls_clean %>%
group_by(candidate) %>%
summarise(avg_points = mean(points, na.rm = TRUE),
min_points = min(points, na.rm = TRUE),
max_points = max(points, na.rm = TRUE),
n_polls = n() - sum(is.na(points))) %>% # identify how many polls candidate is in
# remove candidates who appear in 50 or fewer polls: i.e. HRC
filter(n_polls > 50) %>%
arrange(-avg_points)
avg_polls
## # A tibble: 26 x 5
## candidate avg_points min_points max_points n_polls
## <chr> <dbl> <int> <int> <int>
## 1 biden 29.7 9 46 242
## 2 sanders 17.5 4 36 247
## 3 warren 11.6 2 43 246
## 4 harris 8.39 2 21 247
## 5 buttigieg 5.17 0 21 218
## 6 o_rourke 4.40 0 21 240
## 7 booker 2.66 0 9 243
## 8 yang 1.60 0 5 197
## 9 klobuchar 1.34 0 5 228
## 10 castro 1.12 0 12 234
## # … with 16 more rows
avg_polls %>%
mutate(candidate = fct_reorder(candidate, avg_points)) %>%
ggplot(aes(candidate, avg_points)) +
geom_col() +
theme_minimal() +
coord_flip() +
labs(title = "Polls Standings", x = "", y = "%")
6.3 Creating historic polling data
It may become useful to have a running history of how candidates have been polling. We can use R to write a csv file of the data from FiveThirtyEight. However, what happens when the polls update? How we can we keep the previous data and the new data? We will work through an example using a combination of bind_rows()
and distinct()
. I want to emphasize that this is not a good practice if you need to scale to hundreds of thousand of rows. This works in this case as the data are inherently small.
To start, I have created a sample dataset which contains 80% of these polls (maybe less by the time you do this!). Note that is probably best to version control this or have multiple copies as a failsafe.
The approach we will take is to read in the historic polls data set and bind rows with the polls_clean
data we have scraped. Next we remove duplicate rows using distinct()
.
old_polls <- read_csv("data/polls.csv")
## Parsed with column specification:
## cols(
## pollster = col_character(),
## sample = col_double(),
## sample_2 = col_character(),
## candidate = col_character(),
## points = col_double(),
## poll_start = col_date(format = ""),
## poll_end = col_date(format = "")
## )
old_polls
## # A tibble: 1,564 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <dbl> <chr> <chr> <dbl> <date> <date>
## 1 C+HarrisX 370 Registered… klobuchar 2 2019-06-06 2019-06-10
## 2 C+HarrisX 448 Registered… gillibra… 1 2019-06-06 2019-06-10
## 3 B-Morning Con… 11627 Likely Vot… harris 13 2019-06-06 2019-06-10
## 4 B-Morning Con… 699 Registered… delaney 0 2019-06-06 2019-06-10
## 5 C+HarrisX 743 Registered… williams… 1 2019-06-06 2019-06-10
## 6 A-Quinnipiac … 559 Registered… gabbard 0 2019-06-06 2019-06-10
## 7 B-Morning Con… 14250 Likely Vot… gillibra… 2 2019-06-06 2019-06-10
## 8 A-Quinnipiac … 559 Registered… gillibra… 0 2019-06-06 2019-06-10
## 9 B-Morning Con… 14250 Likely Vot… harris 6 2019-06-06 2019-06-10
## 10 A+Monmouth Un… 330 Registered… warren 8 2019-06-06 2019-06-10
## # … with 1,554 more rows
updated_polls <- bind_rows(old_polls, polls_clean) %>%
distinct()
updated_polls
## # A tibble: 6,475 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <dbl> <chr> <chr> <dbl> <date> <date>
## 1 C+HarrisX 370 Registered… klobuchar 2 2019-06-06 2019-06-10
## 2 C+HarrisX 448 Registered… gillibra… 1 2019-06-06 2019-06-10
## 3 B-Morning Con… 11627 Likely Vot… harris 13 2019-06-06 2019-06-10
## 4 B-Morning Con… 699 Registered… delaney 0 2019-06-06 2019-06-10
## 5 C+HarrisX 743 Registered… williams… 1 2019-06-06 2019-06-10
## 6 A-Quinnipiac … 559 Registered… gabbard 0 2019-06-06 2019-06-10
## 7 B-Morning Con… 14250 Likely Vot… gillibra… 2 2019-06-06 2019-06-10
## 8 A-Quinnipiac … 559 Registered… gillibra… 0 2019-06-06 2019-06-10
## 9 B-Morning Con… 14250 Likely Vot… harris 6 2019-06-06 2019-06-10
## 10 A+Monmouth Un… 330 Registered… warren 8 2019-06-06 2019-06-10
## # … with 6,465 more rows
Now you have a cleaned data set which has been integrated with the recently scraped data. Write this to a csv using write_csv()
for later use.