It seems like Gumbo.jl does a good job of parsing HTML, but there doesn’t seem to be an easy way to extract the text from it (akin to beautiful soup in python).
Has anyone encountered this problem? How did you turn a whole HTMLDocument
into a text string?
Cobbled together this code that kind of does what I want?
using Gumbo
using AbstractTrees
import Gumbo.text
function text(cur_doc::HTMLDocument)
string_parts = []
for elem in PreOrderDFS(aaa.root)
isa(elem, HTMLText) || continue
push!(string_parts, Gumbo.text(elem))
end
return join(string_parts, " ")
end