I had an original piece of code like this, which worked nicely:
filepaths = [joinpath(root, f)
for (root, dirs, files) in walkdir(root)
for f in files[occursin.(fnfeature, files) .& occursin.(r"csv$", files)]]
df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
reduce(vcat, [CSV.read(fp,
header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
:from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
:ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
types=[String, String, String, Int64, String, String,
DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
delim='|')
for fp in filepaths]) |> DataFrame
end
For reading csv.gz instead, using kmundnic’s suggestion at stackoverflow, I rewrote this (so that I’d not have to learn CSVFiles …) as
filepaths = [joinpath(root, f)
for (root, dirs, files) in walkdir(root)
for f in files[occursin.(fnfeature, files) .& occursin.(r"csv.gz$", files)]]
df = let OptFloat64=Union{Missing, Float64}, OptInt32=Union{Missing, Int32}
reduce(vcat, [GZip.open(fp, "r") do io
CSV.read(io,
header=[:domain, :host, :feature, :oid, :largeversion, :clientid,
:from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil,
:ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
types=[String, String, String, Int64, String, String,
DateTime, DateTime, Int8, DateTime, DateTime, DateTime,
Int32, Float64, Float64, Float64, OptFloat64, OptFloat64, OptInt32, OptFloat64],
delim='|')
end
for fp in filepaths]) |> DataFrame
end
However, with this I get
ERROR: LoadError: MethodError: no method matching readavailable(::GZipStream)
Closest candidates are:
readavailable(::Base.Filesystem.File) at filesystem.jl:199
readavailable(::IOStream) at iostream.jl:396
readavailable(::Base.AbstractPipe) at io.jl:243
...
Stacktrace:
[1] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::GZipStream) at .\io.jl:579
- what’s it that I don’t understand? Thanks for help!
// That OptFloat thing is unnecessary, isn’t it? - as the DataValues behind a DataFrame handle “empty values” anyway … But so be it, for the moment …