How to download, extract and import a zipped or tgz csv file from internet?

For a CSV file, I am using a combination of Pipe, HTTP and CSV to download, (eventually) modify the file and import it as a DataFrame without any temporary writing on disk:

urlData = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv"
data = @pipe HTTP.get(urlData).body                |>
             replace!(_, UInt8(';') => UInt8(' ')) |>  # if we need to apply modifications to the file before importing 
             CSV.File(_, delim=' ')                |>
             DataFrame;

How can I use the same general approach (without a temporary disk saving) when the file has been compressed with zip or tar/gz (assuming a single file in the archive) ?

For example:

urlDataZ = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip"
urlDataT = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.tgz"

Crosspost on SO: dataframe - How to use Pipe/HTTP/CSV to download, extract and import a zipped or tgz csv file from internet? - Stack Overflow

Here is one example of a compressed CSV file (if you have single files there isn’t really a point in using Tar, but the example can be modified to put a Tar.jl pipeline in there too):

Testfile:

$ gzip -dc file.csv.gz 
a,b
1,a
2,b
3,c

Example 1: download β†’ decompress β†’ CSV β†’ DataFrame:

import Downloads, SimpleBufferStream, CodecZlib, CSV, DataFrames

url = "file://localhost$(pwd())/file.csv.gz";

df = @sync begin
    # BufferStream for in-flight bytes
    bs = SimpleBufferStream.BufferStream()
    # Download bytes into a decompressor stream
    @async begin
        io = CodecZlib.GzipDecompressorStream(bs)
        Downloads.download(url, io)
        close(io) # close to signal we are done
    end
    # Read decompressed bytes from bs into a DataFrame
    csv_task = @async begin
        f = CSV.File(bs)
        DataFrames.DataFrame(f)
    end
    df = fetch(csv_task)
end

This gives:

julia> df
3Γ—2 DataFrame
 Row β”‚ a      b       
     β”‚ Int64  String1 
─────┼────────────────
   1 β”‚     1  a
   2 β”‚     2  b
   3 β”‚     3  c

Example 2: download β†’ decompress β†’ modify some bytes β†’ CSV β†’ DataFrame

import Downloads, SimpleBufferStream, CodecZlib, CSV, DataFrames

url = "file://localhost$(pwd())/file.csv.gz";

df = @sync begin
    # BufferStream for in-flight bytes
    bs1 = SimpleBufferStream.BufferStream()
    bs2 = SimpleBufferStream.BufferStream()
    # Download bytes into a decompressor stream
    @async begin
        io = CodecZlib.GzipDecompressorStream(bs1)
        Downloads.download(url, io)
        close(io) # close to signal we are done
    end
    # Rewrite 'a'-bytes to 'z'-bytes
    @async begin
        while !eof(bs1)
            bytes = readavailable(bs1)
            for i in eachindex(bytes)
                b = bytes[i]
                if b == UInt8('a')
                    bytes[i] = UInt8('z')
                end
            end
            write(bs2, bytes)
        end
        close(bs2) # close to signal we are done
    end
    # Read modified bytes from bs2 into a DataFrame
    csv_task = @async begin
        # Read decompressed bytes from bs2
        f = CSV.File(bs2)
        # Create a DataFrame
        DataFrames.DataFrame(f)
    end
    df = fetch(csv_task)
end

This gives:

julia> df
3Γ—2 DataFrame
 Row β”‚ z      b       
     β”‚ Int64  String1 
─────┼────────────────
   1 β”‚     1  z
   2 β”‚     2  b
   3 β”‚     3  c

Note that you can use HTTP.get(url; response_io = io) instead of Downloads.download(url, io), but HTTP.jl doesn’t support file:// URLs so used Downloads.jl in this example.

2 Likes