Way to transform HTML into Text?

djsegal · March 3, 2020, 12:01am

It seems like Gumbo.jl does a good job of parsing HTML, but there doesn’t seem to be an easy way to extract the text from it (akin to beautiful soup in python).

Has anyone encountered this problem? How did you turn a whole HTMLDocument into a text string?

djsegal · March 3, 2020, 12:43am

Cobbled together this code that kind of does what I want?

using Gumbo
using AbstractTrees

import Gumbo.text

function text(cur_doc::HTMLDocument)
    string_parts = []

    for elem in PreOrderDFS(aaa.root) 
        isa(elem, HTMLText) || continue
        push!(string_parts, Gumbo.text(elem))
    end

    return join(string_parts, " ")
end

Topic		Replies	Views
As simple as possible from the website, extract the raw text? General Usage	4	1192	February 9, 2018
Problem with parsing tag(::HTMLText) in Gumbo General Usage	0	358	February 10, 2020
Manipulating HTML DOM using Julia Web Stack question	3	2314	August 8, 2020
How to extract links from HTML General Usage	2	404	December 4, 2022
Reading HTML file for parsing General Usage	1	934	December 19, 2022

Way to transform HTML into Text?

Related topics