As simple as possible from the website, extract the raw text?


#1

As simple as possible from the website, extract the raw text ?
Paul


#2

Simplified from https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/utils.jl

using Gumbo

getpage(url) = parsehtml(String(read(download(url))))
text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")

get_page_text(url) = text_only(getpage(url))

#3

Nice–Thanks.


#4

text_only(doc.root)
ERROR: UndefVarError: text_only not defined

Version 0.6.0 (2017-06-19 13:05 UTC)
Official http://julialang.org/ release
x86_64-w64-mingw32

using Gumbo
using DataDeps
using DataDepsGenerators

Paul

W dniu 2018-01-31 o 00:49, Lyndon White pisze:


#5

ERROR: MethodError: no method matching tag(::Gumbo.HTMLText)
Closest candidates are:
tag(::Gumbo.HTMLElement{T}) where T at
C:\Users\PC.julia\v0.6\Gumbo\src\manipulation.jl:6

for:
url=“http://www.rp.pl” (and ohers )
doc=parsehtml(String(read(download(url))))

Some new idea?
Paul

W dniu 2018-01-31 o 01:20, Sdmcallister pisze: