Hi,
I would like to download HTML from urls. For this purpose I use
HTTP.jl and
BufferedStreams.jl.
using HTTP
using BufferedStreams
url = "https://discourse.julialang.org/"
htmlData = String[]
io = nothing
HTTP.open("GET", url) do http
io = BufferedInputStream(http)
while !eof(http)
push!(htmlData, read(http, String))
end
end
htmlData[1]
With this code I get the following data:
"\x1f\x8b\b\0\0\0\0\0\0\x03\xcd\\xedr … "
This looks like there is some encoding problem.
I tinkered around with StringEncodings.jl but could not get it to work.
My guess is, you get a gzip stream.
There are some more flaws in your code, e.g. you don’t use your io. Probably because of some experiments during your error finding process.
Last but not least, HTTP typically return the body as a Vector{UInt8}.
Try this code:
using CodecZlib, HTTP, BufferedStreams
url = "https://discourse.julialang.org/"
htmlData = UInt8[]
HTTP.open("GET", url) do http
io = GzipDecompressorStream(BufferedInputStream(http))
while !eof(io)
push!(htmlData, read(io, UInt8))
end
end
body=String(htmlData)
If there is no special reason for using a BufferedInputStream you can have it much easier:
using HTTP
url = "https://discourse.julialang.org/"
res=HTTP.request("GET",url)
body=String(res.body)