As simple as possible from the website, extract the raw text?

programista · January 30, 2018, 3:52pm

As simple as possible from the website, extract the raw text ?
Paul

oxinabox · January 30, 2018, 11:44pm

Simplified from https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/utils.jl

using Gumbo

getpage(url) = parsehtml(String(read(download(url))))
text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")

get_page_text(url) = text_only(getpage(url))

sdmcallister · January 31, 2018, 12:14am

Nice–Thanks.

programista · January 31, 2018, 7:48am

text_only(doc.root)
ERROR: UndefVarError: text_only not defined

Version 0.6.0 (2017-06-19 13:05 UTC)
Official http://julialang.org/ release
x86_64-w64-mingw32

using Gumbo
using DataDeps
using DataDepsGenerators

Paul

W dniu 2018-01-31 o 00:49, Lyndon White pisze:

programista · February 9, 2018, 4:16pm

ERROR: MethodError: no method matching tag(::Gumbo.HTMLText)
Closest candidates are:
tag(::Gumbo.HTMLElement{T}) where T at
C:\Users\PC.julia\v0.6\Gumbo\src\manipulation.jl:6

for:
url=“http://www.rp.pl” (and ohers )
doc=parsehtml(String(read(download(url))))

Some new idea?
Paul

W dniu 2018-01-31 o 01:20, Sdmcallister pisze:

Topic		Replies	Views
Way to transform HTML into Text? General Usage	1	677	March 3, 2020
Problem with parsing tag(::HTMLText) in Gumbo General Usage	0	358	February 10, 2020
How to extract links from HTML General Usage	2	403	December 4, 2022
Reading HTML file for parsing General Usage	1	934	December 19, 2022
Extracting information from https://caps.fool.com/Ticker/MSFT.aspx New to Julia	5	522	February 25, 2021

As simple as possible from the website, extract the raw text?

Related topics