Reading .csv.gz with CSV does not find readavailable(::GZipStream)

hmmueller · August 25, 2019, 5:24pm

I had an original piece of code like this, which worked nicely:

        filepaths = [joinpath(root, f)
                    for (root, dirs, files) in walkdir(root)
                    for f in files[occursin.(fnfeature, files) .& occursin.(r"csv$", files)]]
        df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
            reduce(vcat, [CSV.read(fp,
                        header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
                                :from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
                                :ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
                        types=[String, String, String, Int64, String, String,
                                DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
                                Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
                        delim='|')
                        for fp in filepaths]) |> DataFrame
        end

For reading csv.gz instead, using kmundnic’s suggestion at stackoverflow, I rewrote this (so that I’d not have to learn CSVFiles …) as

        filepaths = [joinpath(root, f)
                    for (root, dirs, files) in walkdir(root)
                    for f in files[occursin.(fnfeature, files) .& occursin.(r"csv.gz$", files)]]
        df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
            reduce(vcat, [GZip.open(fp, "r") do io
			      CSV.read(io,
				 header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
				         :from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
					 :ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
				 types=[String, String, String, Int64, String, String,
					 DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
					 Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
				 delim='|')
			    end
                        for fp in filepaths]) |> DataFrame
        end

However, with this I get

ERROR: LoadError: MethodError: no method matching readavailable(::GZipStream)
Closest candidates are:
  readavailable(::Base.Filesystem.File) at filesystem.jl:199
  readavailable(::IOStream) at iostream.jl:396
  readavailable(::Base.AbstractPipe) at io.jl:243
  ...
Stacktrace:
 [1] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::GZipStream) at .\io.jl:579

- what’s it that I don’t understand? Thanks for help!

// That OptFloat thing is unnecessary, isn’t it? - as the DataValues behind a DataFrame handle “empty values” anyway … But so be it, for the moment …

quinnj · August 26, 2019, 4:14pm

The problem is that the Gzip.jl package doesn’t properly implement the IO interface from Base and has received very little maintenance over the last few years. I’d recommend using https://github.com/bicycle1885/CodecZlib.jl instead, which is actively maintained and includes the proper interfaces for CSV.jl.

js135005 · August 26, 2019, 4:18pm

CodecZlib.jl is also much faster than Gzip.jl. I use CodecZlib in production and am very happy with it.

hmmueller · August 26, 2019, 7:21pm

Many thanks - I’ll try it tomorrow!

hmmueller · August 28, 2019, 5:49pm

… and I’m happy now, too. Thanks!

Topic		Replies	Views
Reading gz'ed CSV does not work - length of provided header doesn't match the number of columns General Usage csv	7	747	August 2, 2019
JuliaDB - Support for reading GZ file extension Data	1	1005	May 11, 2017
How to read a compressed CSV file? New to Julia	11	4946	January 17, 2019
Gzipped (.csv.gz) writing? Data	6	2411	October 9, 2018
Read a gzip'd text file line by line using Base.readline(::IO) New to Julia question	9	1357	August 13, 2018

Reading .csv.gz with CSV does not find readavailable(::GZipStream)

Related topics