How to extract links from HTML

I want to extract links form HTML

using Cascadia

using Gumbo

using HTTP,AbstractTrees

r = HTTP.get("https://example.com/")
h = parsehtml(String(r.body))
body = h.root[2]
link = eachmatch(Selector("a"), body)

Result:
1-element Vector{HTMLNode}:
HTMLElement{:a}:
More information…

Want to extract this line:
https://www.iana.org/domains/example

Thanks,

You can explore it at the REPL:

julia-1.8> link = eachmatch(Selector("a"), body)
1-element Vector{HTMLNode}:
 HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
  More information...
</a>

julia-1.8> e1 = first(link)
HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
  More information...
</a>

julia-1.8> e1.    <TAB>
attributes  children    parent

julia-1.8> fieldnames(typeof(e1))
(:children, :parent, :attributes)

julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
  "href" => "https://www.iana.org/domains/example"

julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"

julia-1.8> e1.parent
HTMLElement{:p}:<p>
  <a href="https://www.iana.org/domains/example">
    More information...
  </a>
</p>

julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
  "href" => "https://www.iana.org/domains/example"

julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"

I expanded script according this code.
It helped.
Thanks :slight_smile:

1 Like