I am in the process of writing a blogpost about Juliacon 2019. I was looking into how I can webscrape data from all 111 Juliacon youtube videos by extracting the title, views, likes, and dislikes figures.
I couldn’t figure it out, so I reached for R’s rvest. With code below
library(rvest)
library(RSelenium)
rs = RSelenium::rsDriver(browser = "chrome", port=4567L)
rsc = rs$client
rsc$navigate("https://www.youtube.com/playlist?list=PLP8iPy9hna6StY9tIJIUN3F_co9A0zh0H")
ht = rsc$getPageSource()
ok <- xml2::read_html(ht[[1]])
ok %>%
html_nodes("h3.style-scope.ytd-playlist-video-renderer") %>%
html_text() -> texts
length(texts)
titles = texts %>%
strsplit("[|]") %>%
purrr::map_chr(~ifelse(length(.x) == 1, .x, .x[2]) %>% trimws)
urls = ok %>%
html_nodes("a.yt-simple-endpoint.style-scope.ytd-playlist-video-renderer") %>%
html_attr("href")
# go to the url
get_views <- function(url) {
rsc$navigate(paste0("https://www.youtube.com", url))
ht2 = rsc$getPageSource()
ok2 <- xml2::read_html(ht2[[1]])
ok2 %>%
html_nodes("div.style-scope.ytd-menu-renderer a.yt-simple-endpoint.style-scope") %>%
html_text %>%
strsplit("\n") %>%
purrr::map(~.x[1]) %>%
unlist -> ok3
like_dislike = as.integer(ok3[1:2])
views = ok2 %>%
html_node("span.view-count.style-scope.yt-view-count-renderer") %>%
html_text %>%
strsplit(" ")
views = stringr::str_remove(views[[1]][1], ",") %>% as.integer
data.table::data.table(views = views, likes = like_dislike[1], disklikes = like_dislike[2])
}
library(data.table)
pt = proc.time()
the_data = purrr::map_dfr(urls, get_views)
print(timetaken(pt))
Ideally, I want to make the the whole thing in Julia. So happpy for someone to chime and show how it can be done with Julia’s Gumbo.jl etc, but otherwise I will just make the webscraping using RCall.jl.