edas
December 3, 2022, 12:16pm
1
I want to extract links form HTML
using Cascadia
using Gumbo
using HTTP,AbstractTrees
r = HTTP.get("https://example.com/")
h = parsehtml(String(r.body))
body = h.root[2]
link = eachmatch(Selector("a"), body)
Result:
1-element Vector{HTMLNode}:
HTMLElement{:a}:
More information…
Want to extract this line:
https://www.iana.org/domains/example
Thanks,
You can explore it at the REPL:
julia-1.8> link = eachmatch(Selector("a"), body)
1-element Vector{HTMLNode}:
HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
More information...
</a>
julia-1.8> e1 = first(link)
HTMLElement{:a}:<a href="https://www.iana.org/domains/example">
More information...
</a>
julia-1.8> e1. <TAB>
attributes children parent
julia-1.8> fieldnames(typeof(e1))
(:children, :parent, :attributes)
julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
"href" => "https://www.iana.org/domains/example"
julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"
julia-1.8> e1.parent
HTMLElement{:p}:<p>
<a href="https://www.iana.org/domains/example">
More information...
</a>
</p>
julia-1.8> e1.attributes
Dict{AbstractString, AbstractString} with 1 entry:
"href" => "https://www.iana.org/domains/example"
julia-1.8> e1.attributes["href"]
"https://www.iana.org/domains/example"
edas
December 4, 2022, 1:53pm
3
I expanded script according this code.
It helped.
Thanks
1 Like