Reading files embedded in a Zip-file

I am exploring InfoZIP.

Variable “fn” contains the filename.

julia> d = open_zip(fn)
julia> d.keys
6-element Array{AbstractString,1}:
 "WR_2004_20180215203106_CORE_0001.xml.gz"
 "WR_2004_20180215203106_CORE_0002.xml.gz"
 "WR_2004_20180215203106_CORE_0003.xml.gz"
 "WR_2004_20180215203106_CORE_0004.xml.gz"
 "Daily_report_CORE_20180215203106.csv"   
 "Y2D_report_CORE_20180215203106.csv"    

From there I can do

julia> l = d["Y2D_report_CORE_20180215203106.csv"];

but I cannot do

julia> using FileIO, CSVFiles, DataFrames
julia> l = load(d["Y2D_report_CORE_20180215203106.csv"]) |> DataFrame
ERROR: stat: name too long (ENAMETOOLONG)
Stacktrace:
 [1] stat(::String) at ./stat.jl:69
 [2] isfile at ./stat.jl:279 [inlined]
 [3] query(::String) at /home/js/.julia/v0.6/FileIO/src/query.jl:377
 [4] #load#13(::Array{Any,1}, ::Function, ::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
 [5] load(::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52

Now my question: How can I read those CSV-files into DataFrames and how do I use GZip on the others?

julia> fh = GZip.open(d["WR_2004_20180215203106_CORE_0001.xml.gz"])

ERROR: MethodError: no method matching gzopen(::Array{UInt8,1})
Closest candidates are:
  gzopen(::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:263
  gzopen(::AbstractString, ::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:262
  gzopen(::AbstractString, ::AbstractString, ::Integer) at /home/js/.julia/v0.6/GZip/src/GZip.jl:244
  ...
Stacktrace:
 [1] open(::Array{UInt8,1}, ::Vararg{Array{UInt8,1},N} where N) at /home/js/.julia/v0.6/GZip/src/GZip.jl:264

It is not very clear what you are trying to do. What are the values of d, eg d["WR_2004_20180215203106_CORE_0001.xml.gz"]? It looks like byte vector. Why not just use a vector of strings?

FWIW, I would use

to open a stream, and pass it to whatever function I would use otherwise (assuming it accepts streams).

What I want to do is to parse the gzipped XML file(“WR_2004_20180215203106_CORE_0001.xml.gz”) or put the csv-file in a DataFrame.

Your remark “Why not just use a vector of strings?” is a bit over my head. I have no idea what you mean.

I still do not know how to open it even using CodecZlib.jl:

julia> using CodecZlib
julia> stream = GzipDecompressorStream(open(d["WR_2004_20180215203106_CORE_0001.xml.gz"]))
ERROR: MethodError: no method matching open(::Array{UInt8,1})
Closest candidates are:
  open(::AbstractString) at iostream.jl:113
  open(::AbstractString, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at iostream.jl:103
  open(::AbstractString, ::AbstractString) at iostream.jl:132
  ...

  1. What tool would you want to use for XML parsing? XML is a generic format.

  2. Again, what are the values of d? You just listed the keys.

  3. I would just use a vector of strings to pass the filenames, eg

    ["1.csv.gz", "2.csv.gz", ...]
    

    but clearly you have something else going on with the XML. An MWE would help.

Sorry, but I do not now what MWE is.

Here is more information of the zip-archive:

julia> for (filename, data) in open_zip(fn)
           println("$filename has $(length(data)) bytes and is of type($(typeof(data))")
       end
WR_2004_20180215203106_CORE_0001.xml.gz has 1205519728 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0002.xml.gz has 1206399676 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0003.xml.gz has 1203918434 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0004.xml.gz has 80509285 bytes and is of type(Array{UInt8,1}
Daily_report_CORE_20180215203106.csv has 1499 bytes and is of type(String
Y2D_report_CORE_20180215203106.csv has 1682 bytes and is of type(String

I am planning to use LightXML to parse it.

Regards
Johann

In Python I would do this:

def openfile(filepath):
    """
    f = file as opened
    t = type of file ('z' for zipfile, 'tgz' for tar.gz, 'gz')
    """
    f = None
    t = None
    if guess_type(filepath) == ('application/x-tar', 'gzip'):
        t = 'tgz'
        f = tarfile.open(filepath,"r:gz")

    elif  guess_type(filepath)[1] == 'gzip':
        try:
            f =  open(filepath, 'rb')
            t = 'gz'
        except IOError as e:
            print ('Oh dear.')
    elif  guess_type(filepath)[0] == 'application/zip':
        try:
            f =  zipfile.ZipFile(filepath, 'r')
            t = 'z'
        except IOError as e:
            print ('Oh dear.')

    else:
        try:
            f =  open(filepath, 'rb')
            t = None
        except IOError as e:
            print ('Oh dear.')
    return f,t
 f,t = openfile(filename)
if t == 'z':
        gzipfiles = [each for each in f.namelist() if each.endswith('.gz')]
        for gzipf in gzipfiles:
            gzipf_object = f.open(gzipf)
            xml = gzip.GzipFile(fileobj=BytesIO(gzipf_object.read()))
# et cetera

I want to get to the point using Julia where I can use the file embedded in a zip-file like this.

MWE: Minimum Working Example

For what it’s worth, I have found TranscodingStreams together with CodecZlib to be very efficient in handling gzip files with less than 10% slowdown compared to reading the uncompressed files. However for zip archives (.zip), ZipFiles is outdated and is over 20 times slower than processing uncompressed.

I had hoped to either modernize ZipFiles.jl or to write a CodecZlib-like package for zip archives but haven’t been able to make the time to do so. As I result I have dropped support for zip archives and am recommending that our users use gzip compression instead.

1 Like

Thanks @jandehaan and @js135005.

My lack of experience in both Julia and using this type of streaming becomes clear.

julia> using CodecZlib

julia> text = open("2004_CORE.zip")
IOStream(<file 2004_CORE.zip>)

From previous experiene (see previous post in this thread) I know there are more than one .gz-file and more than one plain text .csv-file in this zipfile.

Using streaming, how do I find out which files are present, what their names and types are? And then how do I read them seperately or just one of them?

This is the issue that I’ve hit as well. A zip archive has a directory listing all the included files and the offsets to get to them. The ZipFiles package handles this and allows you to read/write the component files but is abysmally slow so I’ve abandoned using it because I’m handling way too much data.

It desperately needs an update or to be replaced entirely.

2 Likes