Reading .csv.gz with CSV does not find readavailable(::GZipStream)

I had an original piece of code like this, which worked nicely:

        filepaths = [joinpath(root, f)
                    for (root, dirs, files) in walkdir(root)
                    for f in files[occursin.(fnfeature, files) .& occursin.(r"csv$", files)]]
        df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
            reduce(vcat, [CSV.read(fp,
                        header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
                                :from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
                                :ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
                        types=[String, String, String, Int64, String, String,
                                DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
                                Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
                        delim='|')
                        for fp in filepaths]) |> DataFrame
        end

For reading csv.gz instead, using kmundnic’s suggestion at stackoverflow, I rewrote this (so that I’d not have to learn CSVFiles …) as

        filepaths = [joinpath(root, f)
                    for (root, dirs, files) in walkdir(root)
                    for f in files[occursin.(fnfeature, files) .& occursin.(r"csv.gz$", files)]]
        df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
            reduce(vcat, [GZip.open(fp, "r") do io
			      CSV.read(io,
				 header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
				         :from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
					 :ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
				 types=[String, String, String, Int64, String, String,
					 DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
					 Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
				 delim='|')
			    end
                        for fp in filepaths]) |> DataFrame
        end

However, with this I get

ERROR: LoadError: MethodError: no method matching readavailable(::GZipStream)
Closest candidates are:
  readavailable(::Base.Filesystem.File) at filesystem.jl:199
  readavailable(::IOStream) at iostream.jl:396
  readavailable(::Base.AbstractPipe) at io.jl:243
  ...
Stacktrace:
 [1] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::GZipStream) at .\io.jl:579

:frowning: - what’s it that I don’t understand? Thanks for help!

// That OptFloat thing is unnecessary, isn’t it? - as the DataValues behind a DataFrame handle “empty values” anyway … But so be it, for the moment …

The problem is that the Gzip.jl package doesn’t properly implement the IO interface from Base and has received very little maintenance over the last few years. I’d recommend using https://github.com/bicycle1885/CodecZlib.jl instead, which is actively maintained and includes the proper interfaces for CSV.jl.

4 Likes

CodecZlib.jl is also much faster than Gzip.jl. I use CodecZlib in production and am very happy with it.

2 Likes

Many thanks - I’ll try it tomorrow!

… and I’m happy now, too. Thanks!