How to extract links from HTML

edas · December 3, 2022, 12:16pm

I want to extract links form HTML

using Cascadia

using Gumbo

using HTTP,AbstractTrees

r = HTTP.get("https://example.com/")
h = parsehtml(String(r.body))
body = h.root[2]
link = eachmatch(Selector("a"), body)

Result:
1-element Vector{HTMLNode}:
HTMLElement{:a}:
More information…

Want to extract this line:
https://www.iana.org/domains/example

Thanks,

cormullion · December 3, 2022, 3:51pm

You can explore it at the REPL:

julia-1.8> link = eachmatch(Selector("a"), body)
1-element Vector{HTMLNode}:
 HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
  More information...
</a>

julia-1.8> e1 = first(link)
HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
  More information...
</a>

julia-1.8> e1.    <TAB>
attributes  children    parent

julia-1.8> fieldnames(typeof(e1))
(:children, :parent, :attributes)

julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
  "href" => "https://www.iana.org/domains/example"

julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"

julia-1.8> e1.parent
HTMLElement{:p}:<p>
  <a href="https://www.iana.org/domains/example">
    More information...
  </a>
</p>

julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
  "href" => "https://www.iana.org/domains/example"

julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"

edas · December 4, 2022, 1:53pm

I expanded script according this code.
It helped.
Thanks

Topic		Replies	Views
Extracting information from https://caps.fool.com/Ticker/MSFT.aspx New to Julia	5	522	February 25, 2021
As simple as possible from the website, extract the raw text? General Usage	4	1192	February 9, 2018
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	567	December 31, 2019
Download files from websites using HTTP.jl, Gumbo.jl and Cascadia.jl General Usage question	0	741	February 4, 2019
Way to transform HTML into Text? General Usage	1	677	March 3, 2020

How to extract links from HTML

Related topics