Downloading website content (HTTP, BufferedStreams)

Anton_Hinneck · April 6, 2024, 1:14am

Hi,
I would like to download HTML from urls. For this purpose I use

HTTP.jl and
BufferedStreams.jl.


using HTTP
using BufferedStreams

url = "https://discourse.julialang.org/"

htmlData = String[]

io = nothing
HTTP.open("GET", url) do http
    io = BufferedInputStream(http)
    while !eof(http)
        push!(htmlData, read(http, String))
    end
end

htmlData[1]

With this code I get the following data:
"\x1f\x8b\b\0\0\0\0\0\0\x03\xcd\\xedr … "

This looks like there is some encoding problem.

I tinkered around with StringEncodings.jl but could not get it to work.

How can I retrieve clean HTML?

jules · April 6, 2024, 6:54am

Maybe try Downloads · The Julia Language

oheil · April 6, 2024, 9:12am

My guess is, you get a gzip stream.
There are some more flaws in your code, e.g. you don’t use your io. Probably because of some experiments during your error finding process.
Last but not least, HTTP typically return the body as a Vector{UInt8}.

Try this code:

using CodecZlib, HTTP, BufferedStreams

url = "https://discourse.julialang.org/"

htmlData = UInt8[]
HTTP.open("GET", url) do http
    io = GzipDecompressorStream(BufferedInputStream(http))
    while !eof(io)
        push!(htmlData, read(io, UInt8))
    end
end
body=String(htmlData)

If there is no special reason for using a BufferedInputStream you can have it much easier:

using HTTP

url = "https://discourse.julialang.org/"

res=HTTP.request("GET",url)
body=String(res.body)

abraemer · April 6, 2024, 8:09pm

Yep that’s right. The first 4 bytes are the magic bytes of gzip