As simple as possible from the website, extract the raw text ?
Paul
Simplified from https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/utils.jl
using Gumbo
getpage(url) = parsehtml(String(read(download(url))))
text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
get_page_text(url) = text_only(getpage(url))
2 Likes
Nice–Thanks.
text_only(doc.root)
ERROR: UndefVarError: text_only not defined
Version 0.6.0 (2017-06-19 13:05 UTC)
Official http://julialang.org/ release
x86_64-w64-mingw32
using Gumbo
using DataDeps
using DataDepsGenerators
Paul
W dniu 2018-01-31 o 00:49, Lyndon White pisze:
ERROR: MethodError: no method matching tag(::Gumbo.HTMLText)
Closest candidates are:
tag(::Gumbo.HTMLElement{T}) where T at
C:\Users\PC.julia\v0.6\Gumbo\src\manipulation.jl:6
for:
url=“http://www.rp.pl” (and ohers )
doc=parsehtml(String(read(download(url))))
Some new idea?
Paul
W dniu 2018-01-31 o 01:20, Sdmcallister pisze: