Extracting Views, likes and dislikes by webscraping youtube Juliacon 2019

I am in the process of writing a blogpost about Juliacon 2019. I was looking into how I can webscrape data from all 111 Juliacon youtube videos by extracting the title, views, likes, and dislikes figures.

I couldn’t figure it out, so I reached for R’s rvest. With code below

library(rvest)
library(RSelenium)

rs = RSelenium::rsDriver(browser = "chrome", port=4567L)

rsc = rs$client

rsc$navigate("https://www.youtube.com/playlist?list=PLP8iPy9hna6StY9tIJIUN3F_co9A0zh0H")

ht = rsc$getPageSource()

ok <- xml2::read_html(ht[[1]])

ok %>%
  html_nodes("h3.style-scope.ytd-playlist-video-renderer") %>%
  html_text() -> texts

length(texts)

titles = texts %>%
  strsplit("[|]") %>%
  purrr::map_chr(~ifelse(length(.x) == 1, .x, .x[2]) %>% trimws)


urls = ok %>% 
  html_nodes("a.yt-simple-endpoint.style-scope.ytd-playlist-video-renderer") %>%
  html_attr("href")



# go to the url 
get_views <- function(url) {
  rsc$navigate(paste0("https://www.youtube.com", url))
  
  ht2 = rsc$getPageSource()
  
  ok2 <- xml2::read_html(ht2[[1]])
  
  ok2 %>% 
    html_nodes("div.style-scope.ytd-menu-renderer a.yt-simple-endpoint.style-scope") %>%
    html_text %>%
    strsplit("\n") %>%
    purrr::map(~.x[1]) %>%
    unlist -> ok3
  
  like_dislike = as.integer(ok3[1:2])
  
  views = ok2 %>%
    html_node("span.view-count.style-scope.yt-view-count-renderer") %>%
    html_text %>%
    strsplit(" ")
  
  views = stringr::str_remove(views[[1]][1], ",") %>% as.integer
  
  data.table::data.table(views = views, likes = like_dislike[1], disklikes = like_dislike[2])
}

library(data.table)
pt = proc.time()
the_data = purrr::map_dfr(urls, get_views)
print(timetaken(pt))

Ideally, I want to make the the whole thing in Julia. So happpy for someone to chime and show how it can be done with Julia’s Gumbo.jl etc, but otherwise I will just make the webscraping using RCall.jl.

1 Like