Reading files embedded in a Zip-file

johann.spies · May 3, 2018, 9:23am

I am exploring InfoZIP.

Variable “fn” contains the filename.

julia> d = open_zip(fn)
julia> d.keys
6-element Array{AbstractString,1}:
 "WR_2004_20180215203106_CORE_0001.xml.gz"
 "WR_2004_20180215203106_CORE_0002.xml.gz"
 "WR_2004_20180215203106_CORE_0003.xml.gz"
 "WR_2004_20180215203106_CORE_0004.xml.gz"
 "Daily_report_CORE_20180215203106.csv"   
 "Y2D_report_CORE_20180215203106.csv"

From there I can do

julia> l = d["Y2D_report_CORE_20180215203106.csv"];

but I cannot do

julia> using FileIO, CSVFiles, DataFrames
julia> l = load(d["Y2D_report_CORE_20180215203106.csv"]) |> DataFrame
ERROR: stat: name too long (ENAMETOOLONG)
Stacktrace:
 [1] stat(::String) at ./stat.jl:69
 [2] isfile at ./stat.jl:279 [inlined]
 [3] query(::String) at /home/js/.julia/v0.6/FileIO/src/query.jl:377
 [4] #load#13(::Array{Any,1}, ::Function, ::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52
 [5] load(::String) at /home/js/.julia/v0.6/FileIO/src/loadsave.jl:52

Now my question: How can I read those CSV-files into DataFrames and how do I use GZip on the others?

julia> fh = GZip.open(d["WR_2004_20180215203106_CORE_0001.xml.gz"])

ERROR: MethodError: no method matching gzopen(::Array{UInt8,1})
Closest candidates are:
  gzopen(::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:263
  gzopen(::AbstractString, ::AbstractString) at /home/js/.julia/v0.6/GZip/src/GZip.jl:262
  gzopen(::AbstractString, ::AbstractString, ::Integer) at /home/js/.julia/v0.6/GZip/src/GZip.jl:244
  ...
Stacktrace:
 [1] open(::Array{UInt8,1}, ::Vararg{Array{UInt8,1},N} where N) at /home/js/.julia/v0.6/GZip/src/GZip.jl:264

Tamas_Papp · May 3, 2018, 9:31am

It is not very clear what you are trying to do. What are the values of d, eg d["WR_2004_20180215203106_CORE_0001.xml.gz"]? It looks like byte vector. Why not just use a vector of strings?

FWIW, I would use

to open a stream, and pass it to whatever function I would use otherwise (assuming it accepts streams).

johann.spies · May 3, 2018, 9:44am

What I want to do is to parse the gzipped XML file(“WR_2004_20180215203106_CORE_0001.xml.gz”) or put the csv-file in a DataFrame.

Your remark “Why not just use a vector of strings?” is a bit over my head. I have no idea what you mean.

I still do not know how to open it even using CodecZlib.jl:

julia> using CodecZlib
julia> stream = GzipDecompressorStream(open(d["WR_2004_20180215203106_CORE_0001.xml.gz"]))
ERROR: MethodError: no method matching open(::Array{UInt8,1})
Closest candidates are:
  open(::AbstractString) at iostream.jl:113
  open(::AbstractString, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at iostream.jl:103
  open(::AbstractString, ::AbstractString) at iostream.jl:132
  ...

Tamas_Papp · May 3, 2018, 9:49am

What tool would you want to use for XML parsing? XML is a generic format.
Again, what are the values of d? You just listed the keys.
I would just use a vector of strings to pass the filenames, eg
```
["1.csv.gz", "2.csv.gz", ...]
```
but clearly you have something else going on with the XML. An MWE would help.

johann.spies · May 3, 2018, 9:56am

Sorry, but I do not now what MWE is.

Here is more information of the zip-archive:

julia> for (filename, data) in open_zip(fn)
           println("$filename has $(length(data)) bytes and is of type($(typeof(data))")
       end
WR_2004_20180215203106_CORE_0001.xml.gz has 1205519728 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0002.xml.gz has 1206399676 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0003.xml.gz has 1203918434 bytes and is of type(Array{UInt8,1}
WR_2004_20180215203106_CORE_0004.xml.gz has 80509285 bytes and is of type(Array{UInt8,1}
Daily_report_CORE_20180215203106.csv has 1499 bytes and is of type(String
Y2D_report_CORE_20180215203106.csv has 1682 bytes and is of type(String

I am planning to use LightXML to parse it.

Regards
Johann

johann.spies · May 3, 2018, 10:07am

In Python I would do this:

def openfile(filepath):
    """
    f = file as opened
    t = type of file ('z' for zipfile, 'tgz' for tar.gz, 'gz')
    """
    f = None
    t = None
    if guess_type(filepath) == ('application/x-tar', 'gzip'):
        t = 'tgz'
        f = tarfile.open(filepath,"r:gz")

    elif  guess_type(filepath)[1] == 'gzip':
        try:
            f =  open(filepath, 'rb')
            t = 'gz'
        except IOError as e:
            print ('Oh dear.')
    elif  guess_type(filepath)[0] == 'application/zip':
        try:
            f =  zipfile.ZipFile(filepath, 'r')
            t = 'z'
        except IOError as e:
            print ('Oh dear.')

    else:
        try:
            f =  open(filepath, 'rb')
            t = None
        except IOError as e:
            print ('Oh dear.')
    return f,t
 f,t = openfile(filename)
if t == 'z':
        gzipfiles = [each for each in f.namelist() if each.endswith('.gz')]
        for gzipf in gzipfiles:
            gzipf_object = f.open(gzipf)
            xml = gzip.GzipFile(fileobj=BytesIO(gzipf_object.read()))
# et cetera

I want to get to the point using Julia where I can use the file embedded in a zip-file like this.

jandehaan · May 3, 2018, 3:38pm

MWE: Minimum Working Example

js135005 · May 3, 2018, 3:56pm

For what it’s worth, I have found TranscodingStreams together with CodecZlib to be very efficient in handling gzip files with less than 10% slowdown compared to reading the uncompressed files. However for zip archives (.zip), ZipFiles is outdated and is over 20 times slower than processing uncompressed.

I had hoped to either modernize ZipFiles.jl or to write a CodecZlib-like package for zip archives but haven’t been able to make the time to do so. As I result I have dropped support for zip archives and am recommending that our users use gzip compression instead.

johann.spies · May 7, 2018, 8:01am

Thanks @jandehaan and @js135005.

My lack of experience in both Julia and using this type of streaming becomes clear.

julia> using CodecZlib

julia> text = open("2004_CORE.zip")
IOStream(<file 2004_CORE.zip>)

From previous experiene (see previous post in this thread) I know there are more than one .gz-file and more than one plain text .csv-file in this zipfile.

Using streaming, how do I find out which files are present, what their names and types are? And then how do I read them seperately or just one of them?

js135005 · May 8, 2018, 3:28pm

This is the issue that I’ve hit as well. A zip archive has a directory listing all the included files and the offsets to get to them. The ZipFiles package handles this and allows you to read/write the component files but is abysmally slow so I’ve abandoned using it because I’m handling way too much data.

It desperately needs an update or to be replaced entirely.

nhz2 · September 2, 2024, 3:46pm

I created GitHub - JuliaIO/ZipArchives.jl: Read and write Zip archive files in Julia. which uses CodecZlib.jl internally for decompression and compression. It can efficiently handle reading and writing large amounts of data.