File zipping taking longer for large files

I am trying to zip files in the directory which is around 300 - 400 MB and it’s taking more than 15 mins to finish. Initially I was using custom function written using ZipFile package. I have tried using Base.zip which isn’t helpful and even tried zip_files method from ZipStreams package. This was working fine but zipped file will be having sub-directories if input path provided consists of sub-directories. I have tried to use zipsink as well but no luck. Any suggestions?

function zip(archive_name::String, files::Vector{Any}=String[])
    @info "Zip all files"
    isempty(files) && return
    w = ZipFile.Writer(archive_name)
    for file in files
        @show file
        ff = split(file, "/")
        ff=  length(ff) > 3 ? ff[4] : ff[3]
        f = ZipFile.addfile(w, ff, method=ZipFile.Deflate)
        write(f, read(file, String))
    end
    close(w)
end
1 Like

This is some code I have used to call the 7z binary, which comes bundled with Julia.

using p7zip_jll: p7zip
# Explanation of 7z options:
# `a`:           Add files to an archive.
# `-tzip`:       Create a zip archive instead of a 7z archive.
# `-mm=deflate`: Compress with DEFLATE algorithm.
# `-mx=9`:       Set compression level to maximum.
run(pipeline(`$(p7zip()) a -tzip -mm=deflate -mx=9 $(archive_name) .`,
             stdout = devnull))

You can probably adapt it to your needs but you need to find the 7z documentation externally.

@Sandy45 That seems very slow. I use ZipArchives.jl and it will create an archive over 1GB in much less time than that.

function zipfiles(zip, fnames) # write files in fnames to an open zip file.
    if length(fnames) > 0
        for name in fnames
            if !ismissing(name)
                if !isfile(name)
                    throw(LoadError("", 0, "Specified file not found: $name"))
                end
                f = open(name, "r")
                content = read(f, String)
                close(f)
                name = trimpath(name)
                zip_newfile(zip, name; compress=true)
                write(zip, content)
            end
        end
    end
    return nothing
end
function addtoZIP(zipname, fnames; append=false) # add or append files to a zip file
    if append
        if isfile(zipname)
            zip_append_archive(zipname) do zip
                zipfiles(zip, fnames)
            end
        else
            throw(LoadError("", 0, "Specified file not found: $zipname"))
        end
    else
        ZipWriter(zipname) do zip
            zipfiles(zip, fnames)
        end
    end
    return nothing
end

I chose ZipArchive specifically because it allows me to append extra files to an existing archive.

1 Like

Oh, I see. My bad!

function trimpath(file) # remove the leading path to leave just the filename remaining.
    l = findlast("\\", file)
    if l !== nothing
        file = file[nextind(file, first(l)):end]
    end
    return file
end

This may rely on a Windows path separator - but you seem to be on Windows…

I hope that works now!

Edit: Just discovered the splitpath() function in Base that you could use, too.

@TimG Got it - Thanks!

1 Like

If I’m not overlooking something, trimpath is the same as basename in Base.

2 Likes

If you want to go extra fast, you could try:

ZipStreams.jl and ZipArchives.jl both use CodecZlib.jl for compression so at least in theory should have similar performance.

In ZipArchives.jl you can also manually set the compression level when adding a new file with for example:

zip_newfile(zip, name; compress=true, compression_level=1)

The level can be 1 to 9 where 1 is fastest and 9 is smallest file size. By default this is 6 as a compromise.

Also, yes it is a very bad idea to save absolute paths in a ZIP archive, because if someone tries to extract that archive it may cause errors or if the zip extractor isn’t carefully written may delete unexpected files in the filesystem.

If you really know what you are doing you can disable all entry name checks with for example:

open("test.zip"; write=true) do fileio
    ZipWriter(fileio; check_names=false) do zip
        zip_newfile(zip, ":::")
    end
end

This will lead to the following error on extracting on windows:

1 Like

Not you but me! So much to learn in julia! :upside_down_face:

Thank you for the correction! I’m mostly blindly trust ZipStreams.jl’s README because I happen to know the maintainer and trust him. I have no benchmarks to substantiate my comment here :sweat_smile:

That’s true, I am using the same.

I am trying to run this and trying to check if “.” denotes all files in the working directory being zipped? Can you provide me example of how to zip all the files in a directory or if vector of file paths being passed ?

fname=readdir(pwd())

returns a vector of the names of all files in the current directory.
You can pass this array to the functions given above to put all files in pwd() into the zip.

If you want files in another folder then you might try

fname = joinpath.(dir_name, readdir(dir_name))

where dir_name is a string containing the folder path.

@TimG Thanks Tim. Yes, I figured this out. I was asking @GunnarFarneback regarding the command he posted as I just wanted to give a try.

The code I posted zips all files in the current directory. But it was some time since I wrote it and I have never learned more of the 7z functionality than I have needed at the time.

Okay - Thank You!

bench.jl (2.6 KB)
Manifest.toml (46.8 KB)
Project.toml (323 Bytes)

I’ve made a simple benchmark, and p7zip_jll is much faster and results in a smaller file on an 8-core AMD Ryzen 7 7800X3D CPU. When restricted to a single thread, ZipArchives with compression level 1 is the fastest, but it creates a larger file. I’m not sure how to set the compression level using ZipStreams.

1 Like