How to plumb together download -> uncompress -> untar without writing full downloaded file

which sounds like some problem with Tar waiting for the buffer to fill

Yes I thought there’s some short read happening here. If all doesn’t work, I was hoping just calling read again with the remaining size would be ok, but this doesn’t seem to work either.

The main trouble we’re having here is that Base.BufferStream isn’t a public API so it’s not that well tested, some things like skip are missing and the precise blocking behavior isn’t really documented. (I noticed something a bit weird about blocking — BufferStream seems to only block on read, not write, so the internal buffer might be able to grow indefinitely. Which is fine in this case because the Tar reader is likely faster than the writer doing the download, but really not great if you were tarring and uploading!)

Another alternative is to use Pipe, which is a publicly defined API and widely used for several things. IIUC the cost of blocking on a read or write to Pipe will be a lot higher than BufferStream because the Pipe needs to go through the operating system kernel. But the pipe should have a fixed size buffer, so should block on both read or write side which should be a lot more sensible in general.

The following seemed to work for me:

using Tar, Downloads

function Base.skip(io::Union{Base.BufferStream,Pipe}, n)
    if n > 0
        while n > 0 && isopen(io)
            buf = read(io, n)
            n -= length(buf)
            #if n > 0
            #    @info "Short read" length(buf)
            #end
        end
    else
        error("Can't skip backward in Pipe or BufferStream")
    end
end

io = Pipe()
# Initialize the pipe. I'm not sure there's a public API for this ??
Base.link_pipe!(io)
# Alternatively, use BufferStream... which should work
# but seems to get stuck for some reason
# io = Base.BufferStream()
@sync begin
    @async try
        Downloads.download("https://data.proteindiffraction.org/ssgcid/3lls.tar", io)
        @info "Download complete"
    catch exc
        @error "Caught exception" exc
    finally
        close(io)
    end
    @async try
        loc = Tar.extract(x -> x.path == "3lls/series/200873f12_x0181.img.bz2" ? (@info("Extracting", x); true) : (@info("Ignoring", x); false), io)
        @info "Untar complete" loc
    catch exc
        @error "Caught exception" exc
    finally
        close(io)
    end
end

If you wanted to abort the download once the particular file of interest has been read and extracted, you may be able to do close(io).

1 Like