I am exploring InfoZIP.
Variable “fn” contains the filename.
julia> d = open_zip(fn)
julia> d.keys
6-element Array{AbstractString,1}:
"WR_2004_20180215203106_CORE_0001.xml.gz"
"WR_2004_20180215203106_CORE_0002.xml.gz"
"WR_2004_20180215203106_CORE_0003.xml.gz"
"WR_2004_20180215203106_CORE_0004.xml.gz"
"Daily_report_CORE_20180215203106.csv"
"Y2D_report_CORE_20180215203106.csv"
From there I can do
julia> l = d["Y2D_report_CORE_20180215203106.csv"];
but I cannot do
julia> using FileIO, CSVFiles, DataFrames
julia> l = load(d["Y2D_report_CORE_20180215203106.csv"]) |> DataFrame
ERROR: stat: name too long (ENAMETOOLONG)
Stacktrace:
[1] stat(::String) at ./stat.jl:69
[2] isfile at ./stat.jl:279 [inlined]
[3] query(::String) at /home/js/.julia/v0.6/FileIO/src/query.jl:377
[4] #load#13(::Array{Any,1}, ::Function, ::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
[5] load(::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
Now my question: How can I read those CSV-files into DataFrames and how do I use GZip on the others?
julia> fh = GZip.open(d["WR_2004_20180215203106_CORE_0001.xml.gz"])
ERROR: MethodError: no method matching gzopen(::Array{UInt8,1})
Closest candidates are:
gzopen(::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:263
gzopen(::AbstractString, ::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:262
gzopen(::AbstractString, ::AbstractString, ::Integer) at /home/js/.julia/v0.6/GZip/src/GZip.jl:244
...
Stacktrace:
[1] open(::Array{UInt8,1}, ::Vararg{Array{UInt8,1},N} where N) at /home/js/.julia/v0.6/GZip/src/GZip.jl:264
It is not very clear what you are trying to do. What are the values of d
, eg d["WR_2004_20180215203106_CORE_0001.xml.gz"]
? It looks like byte vector. Why not just use a vector of strings?
FWIW, I would use
to open a stream, and pass it to whatever function I would use otherwise (assuming it accepts streams).
What I want to do is to parse the gzipped XML file(“WR_2004_20180215203106_CORE_0001.xml.gz”) or put the csv-file in a DataFrame.
Your remark “Why not just use a vector of strings?” is a bit over my head. I have no idea what you mean.
I still do not know how to open it even using CodecZlib.jl:
julia> using CodecZlib
julia> stream = GzipDecompressorStream(open(d["WR_2004_20180215203106_CORE_0001.xml.gz"]))
ERROR: MethodError: no method matching open(::Array{UInt8,1})
Closest candidates are:
open(::AbstractString) at iostream.jl:113
open(::AbstractString, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at iostream.jl:103
open(::AbstractString, ::AbstractString) at iostream.jl:132
...
Sorry, but I do not now what MWE is.
Here is more information of the zip-archive:
julia> for (filename, data) in open_zip(fn)
println("$filename has $(length(data)) bytes and is of type($(typeof(data))")
end
WR_2004_20180215203106_CORE_0001.xml.gz has 1205519728 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0002.xml.gz has 1206399676 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0003.xml.gz has 1203918434 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0004.xml.gz has 80509285 bytes and is of type(Array{UInt8,1}
Daily_report_CORE_20180215203106.csv has 1499 bytes and is of type(String
Y2D_report_CORE_20180215203106.csv has 1682 bytes and is of type(String
I am planning to use LightXML to parse it.
Regards
Johann
In Python I would do this:
def openfile(filepath):
"""
f = file as opened
t = type of file ('z' for zipfile, 'tgz' for tar.gz, 'gz')
"""
f = None
t = None
if guess_type(filepath) == ('application/x-tar', 'gzip'):
t = 'tgz'
f = tarfile.open(filepath,"r:gz")
elif guess_type(filepath)[1] == 'gzip':
try:
f = open(filepath, 'rb')
t = 'gz'
except IOError as e:
print ('Oh dear.')
elif guess_type(filepath)[0] == 'application/zip':
try:
f = zipfile.ZipFile(filepath, 'r')
t = 'z'
except IOError as e:
print ('Oh dear.')
else:
try:
f = open(filepath, 'rb')
t = None
except IOError as e:
print ('Oh dear.')
return f,t
f,t = openfile(filename)
if t == 'z':
gzipfiles = [each for each in f.namelist() if each.endswith('.gz')]
for gzipf in gzipfiles:
gzipf_object = f.open(gzipf)
xml = gzip.GzipFile(fileobj=BytesIO(gzipf_object.read()))
# et cetera
I want to get to the point using Julia where I can use the file embedded in a zip-file like this.
MWE: Minimum Working Example
For what it’s worth, I have found TranscodingStreams together with CodecZlib to be very efficient in handling gzip files with less than 10% slowdown compared to reading the uncompressed files. However for zip archives (.zip), ZipFiles is outdated and is over 20 times slower than processing uncompressed.
I had hoped to either modernize ZipFiles.jl or to write a CodecZlib-like package for zip archives but haven’t been able to make the time to do so. As I result I have dropped support for zip archives and am recommending that our users use gzip compression instead.
1 Like
Thanks @jandehaan and @js135005.
My lack of experience in both Julia and using this type of streaming becomes clear.
julia> using CodecZlib
julia> text = open("2004_CORE.zip")
IOStream(<file 2004_CORE.zip>)
From previous experiene (see previous post in this thread) I know there are more than one .gz-file and more than one plain text .csv-file in this zipfile.
Using streaming, how do I find out which files are present, what their names and types are? And then how do I read them seperately or just one of them?
This is the issue that I’ve hit as well. A zip archive has a directory listing all the included files and the offsets to get to them. The ZipFiles package handles this and allows you to read/write the component files but is abysmally slow so I’ve abandoned using it because I’m handling way too much data.
It desperately needs an update or to be replaced entirely.
2 Likes