julia> l = d["Y2D_report_CORE_20180215203106.csv"];
but I cannot do
julia> using FileIO, CSVFiles, DataFrames
julia> l = load(d["Y2D_report_CORE_20180215203106.csv"]) |> DataFrame
ERROR: stat: name too long (ENAMETOOLONG)
Stacktrace:
[1] stat(::String) at ./stat.jl:69
[2] isfile at ./stat.jl:279 [inlined]
[3] query(::String) at /home/js/.julia/v0.6/FileIO/src/query.jl:377
[4] #load#13(::Array{Any,1}, ::Function, ::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
[5] load(::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
Now my question: How can I read those CSV-files into DataFrames and how do I use GZip on the others?
julia> fh = GZip.open(d["WR_2004_20180215203106_CORE_0001.xml.gz"])
ERROR: MethodError: no method matching gzopen(::Array{UInt8,1})
Closest candidates are:
gzopen(::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:263
gzopen(::AbstractString, ::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:262
gzopen(::AbstractString, ::AbstractString, ::Integer) at /home/js/.julia/v0.6/GZip/src/GZip.jl:244
...
Stacktrace:
[1] open(::Array{UInt8,1}, ::Vararg{Array{UInt8,1},N} where N) at /home/js/.julia/v0.6/GZip/src/GZip.jl:264
It is not very clear what you are trying to do. What are the values of d, eg d["WR_2004_20180215203106_CORE_0001.xml.gz"]? It looks like byte vector. Why not just use a vector of strings?
FWIW, I would use
to open a stream, and pass it to whatever function I would use otherwise (assuming it accepts streams).
julia> for (filename, data) in open_zip(fn)
println("$filename has $(length(data)) bytes and is of type($(typeof(data))")
end
WR_2004_20180215203106_CORE_0001.xml.gz has 1205519728 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0002.xml.gz has 1206399676 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0003.xml.gz has 1203918434 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0004.xml.gz has 80509285 bytes and is of type(Array{UInt8,1}
Daily_report_CORE_20180215203106.csv has 1499 bytes and is of type(String
Y2D_report_CORE_20180215203106.csv has 1682 bytes and is of type(String
def openfile(filepath):
"""
f = file as opened
t = type of file ('z' for zipfile, 'tgz' for tar.gz, 'gz')
"""
f = None
t = None
if guess_type(filepath) == ('application/x-tar', 'gzip'):
t = 'tgz'
f = tarfile.open(filepath,"r:gz")
elif guess_type(filepath)[1] == 'gzip':
try:
f = open(filepath, 'rb')
t = 'gz'
except IOError as e:
print ('Oh dear.')
elif guess_type(filepath)[0] == 'application/zip':
try:
f = zipfile.ZipFile(filepath, 'r')
t = 'z'
except IOError as e:
print ('Oh dear.')
else:
try:
f = open(filepath, 'rb')
t = None
except IOError as e:
print ('Oh dear.')
return f,t
f,t = openfile(filename)
if t == 'z':
gzipfiles = [each for each in f.namelist() if each.endswith('.gz')]
for gzipf in gzipfiles:
gzipf_object = f.open(gzipf)
xml = gzip.GzipFile(fileobj=BytesIO(gzipf_object.read()))
# et cetera
I want to get to the point using Julia where I can use the file embedded in a zip-file like this.
For what it’s worth, I have found TranscodingStreams together with CodecZlib to be very efficient in handling gzip files with less than 10% slowdown compared to reading the uncompressed files. However for zip archives (.zip), ZipFiles is outdated and is over 20 times slower than processing uncompressed.
I had hoped to either modernize ZipFiles.jl or to write a CodecZlib-like package for zip archives but haven’t been able to make the time to do so. As I result I have dropped support for zip archives and am recommending that our users use gzip compression instead.
My lack of experience in both Julia and using this type of streaming becomes clear.
julia> using CodecZlib
julia> text = open("2004_CORE.zip")
IOStream(<file 2004_CORE.zip>)
From previous experiene (see previous post in this thread) I know there are more than one .gz-file and more than one plain text .csv-file in this zipfile.
Using streaming, how do I find out which files are present, what their names and types are? And then how do I read them seperately or just one of them?
This is the issue that I’ve hit as well. A zip archive has a directory listing all the included files and the offsets to get to them. The ZipFiles package handles this and allows you to read/write the component files but is abysmally slow so I’ve abandoned using it because I’m handling way too much data.
It desperately needs an update or to be replaced entirely.