Downloading website content (HTTP, BufferedStreams)

Hi,
I would like to download HTML from urls. For this purpose I use

  • HTTP.jl and
  • BufferedStreams.jl.

using HTTP
using BufferedStreams

url = "https://discourse.julialang.org/"

htmlData = String[]

io = nothing
HTTP.open("GET", url) do http
    io = BufferedInputStream(http)
    while !eof(http)
        push!(htmlData, read(http, String))
    end
end

htmlData[1]

With this code I get the following data:
"\x1f\x8b\b\0\0\0\0\0\0\x03\xcd\\xedr … "

This looks like there is some encoding problem.

I tinkered around with StringEncodings.jl but could not get it to work.

How can I retrieve clean HTML?

Maybe try Downloads · The Julia Language

1 Like

My guess is, you get a gzip stream.
There are some more flaws in your code, e.g. you don’t use your io. Probably because of some experiments during your error finding process.
Last but not least, HTTP typically return the body as a Vector{UInt8}.

Try this code:

using CodecZlib, HTTP, BufferedStreams

url = "https://discourse.julialang.org/"

htmlData = UInt8[]
HTTP.open("GET", url) do http
    io = GzipDecompressorStream(BufferedInputStream(http))
    while !eof(io)
        push!(htmlData, read(io, UInt8))
    end
end
body=String(htmlData)

If there is no special reason for using a BufferedInputStream you can have it much easier:

using HTTP

url = "https://discourse.julialang.org/"

res=HTTP.request("GET",url)
body=String(res.body)
2 Likes

Yep that’s right. The first 4 bytes are the magic bytes of gzip

1 Like