[ANN] Harbest.jl v0.4 - Simple Web Scraping

Harbest.jl v0.4

Harbest.jl is a package that allows to perform simple web scraping

Whatโ€™s new?

  • html_table function
  • different and cleaner html_text3 output
  • html_elements function now allows Vector[String} as an argument
  • actual documentation
  • improved docstrings

Example

using Harbest

starwars = read_html("https://rvest.tidyverse.org/articles/starwars.html")

titles = html_elements(starwars, ["section", "h2"]) |> html_text3
titles
# 7-element Vector{String}:
#  "The Phantom Menace"
#  "Attack of the Clones"
#  "Revenge of the Sith"
#  โ‹ฎ
#  "Return of the Jedi"
#  "The Force Awakens"

html = read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")
table = html_elements(html, ".tracklist") |> html_table
table
# 28ร—4 DataFrame
#  Row โ”‚ No.     Title                              Performer(s)                       Length 
#      โ”‚ String  String                             String                             String 
# โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#    1 โ”‚ 1.      "Everything Is Awesome"            Tegan and Sara featuring The Lonโ€ฆ  2:43   
#    2 โ”‚ 2.      "Prologue"                                                            2:28   
#    3 โ”‚ 3.      "Emmett's Morning"                                                    2:00   
#    4 โ”‚ 4.      "Emmett Falls in Love"                                                1:11   
#    5 โ”‚ 5.      "Escape"                                                              3:26
#   โ‹ฎ  โ”‚   โ‹ฎ                     โ‹ฎ                                  โ‹ฎ                    โ‹ฎ
#   25 โ”‚ 25.     "Everything Is Awesome"            Jo Li (Joshua Bartholomew and Liโ€ฆ  1:26
#   26 โ”‚ 26.     "Everything Is Awesome (unpluggeโ€ฆ  Shawn Patterson and Sammy Allen    1:24
#   27 โ”‚ 27.     "Untitled Self Portrait"           Will Arnett                        1:08
#   28 โ”‚ 28.     "Everything Is Awesome (instrumeโ€ฆ                                     2:41
#                                                                              19 rows omitted
8 Likes