InexactError when saving compressed CSV file (but not if I save it uncompressed)

x is a 15M x 243 DataFrame:

julia> x
14259735×243 DataFrame
      Row │ C      R      Y      pr_m1_yl4  pr_m2_yl4  pr_m3_yl4  pr_m4_yl4  p ⋯
          │ Int64  Int64  Int64  Float64    Float64    Float64    Float64    F ⋯
──────────┼─────────────────────────────────────────────────────────────────────
        1 │     1      1   1984   135.01     159.569   134.448      25.8061    ⋯
        2 │     1      1   1985    57.1593    66.5214  182.197      28.1082  1
        3 │     1      1   1986   136.758    149.088   108.003      34.1564
        4 │     1      1   1987   121.977     85.8327   64.9958     95.0051
        5 │     1      1   1988   197.148     78.5767   69.6241     36.9885    ⋯
        6 │     1      1   1989    93.3988   130.05    127.21       50.8063
        7 │     1      1   1990   150.033     39.6072  122.061      55.024   1
        8 │     1      1   1991   158.353    105.28    103.605      50.3801
    ⋮     │   ⋮      ⋮      ⋮        ⋮          ⋮          ⋮          ⋮        ⋱
 14259728 │   669    609   2011    10.3363    44.7045   76.2623     76.8465    ⋯
 14259729 │   669    609   2012    37.849     22.5278   76.5935     11.6821  1
 14259730 │   669    609   2013    74.8778    35.5059   19.4006     45.5431
 14259731 │   669    609   2014   150.525     69.9397   32.5978     44.1507
 14259732 │   669    609   2015    70.5953    55.0444  103.461      78.7517    ⋯
 14259733 │   669    609   2016    28.8887    44.8978    3.20769    38.7676
 14259734 │   669    609   2017    52.9068    68.933   111.106      33.3569
 14259735 │   669    609   2018    44.0931    23.2465   76.31       18.3548

If I save it uncompressed, it works, producing a ~56GB file:

CSV.write("xclim.csv",x)

However, if I try to save it zipping (as inthe CSV doc example), I get an inexact error:

z = ZipFile.Writer("xclim.zip")
f = ZipFile.addfile(z, "xclim.csv", method=ZipFile.Deflate)
x |> CSV.write(f) # error here after a while
close(z)
ERROR: InexactError: trunc(UInt32, 4298836469)
Stacktrace:
  [1] throw_inexacterror(::Symbol, ::Vararg{Any})
    @ Core ./boot.jl:750
  [2] checked_trunc_uint
    @ ./boot.jl:772 [inlined]
  [3] toUInt32
    @ ./boot.jl:856 [inlined]
  [4] UInt32
    @ ./boot.jl:896 [inlined]
  [5] convert(::Type{UInt32}, x::Int64)
    @ Base ./number.jl:7
  [6] setproperty!(x::ZipFile.WritableFile, f::Symbol, v::Int64)
    @ Base ./Base.jl:52
  [7] unsafe_write(f::ZipFile.WritableFile, p::Ptr{UInt8}, nb::UInt64)
    @ ZipFile ~/.julia/packages/ZipFile/yQ7yx/src/ZipFile.jl:682
  [8] unsafe_write
    @ ./io.jl:803 [inlined]
  [9] write
    @ ./io.jl:837 [inlined]
 [10] writecell
    @ ~/.julia/packages/CSV/XLcqT/src/write.jl:306 [inlined]
 [11] (::CSV.var"#114#115"{…})(val::Float64, col::Int64, nm::Symbol)
    @ CSV ~/.julia/packages/CSV/XLcqT/src/write.jl:362
 [12] eachcolumn
    @ ~/.julia/packages/Tables/8p03y/src/utils.jl:75 [inlined]
 [13] writerow(buf::Vector{…}, pos::Base.RefValue{…}, len::Int64, io::ZipFile.WritableFile, sch::Tables.Schema{…}, row::DataFrameRow{…}, cols::Int64, opts::CSV.Options{…})
    @ CSV ~/.julia/packages/CSV/XLcqT/src/write.jl:358
 [14] (::CSV.var"#107#108"{…})(io::ZipFile.WritableFile)
    @ CSV ~/.julia/packages/CSV/XLcqT/src/write.jl:225
 [15] with(f::CSV.var"#107#108"{…}, io::Any, append::Bool, compress::Bool)
    @ CSV ~/.julia/packages/CSV/XLcqT/src/write.jl:294
 [16] #write#106
    @ ~/.julia/packages/CSV/XLcqT/src/write.jl:215 [inlined]
 [17] write(file::ZipFile.WritableFile, itr::DataFrame; append::Bool, compress::Bool, writeheader::Nothing, partition::Bool, kw::@Kwargs{})
    @ CSV ~/.julia/packages/CSV/XLcqT/src/write.jl:199
 [18] write
    @ ~/.julia/packages/CSV/XLcqT/src/write.jl:162 [inlined]
 [19] #101
    @ ~/.julia/packages/CSV/XLcqT/src/write.jl:161 [inlined]
 [20] |>(x::DataFrame, f::CSV.var"#101#102"{@Kwargs{}, ZipFile.WritableFile})
    @ Base ./operators.jl:926
 [21] top-level scope
    @ ~/.julia/dev/GenFSM/dev_local/autoerncode_clim_nyears.jl:121
Some type information was truncated. Use `show(err)` to see complete types.

Could be a memory issue (I have 64 GB on my laptop).. ?

I’m surprised that this doesn’t have a better error message, but I think the reason is that your data is simply too large - the original ZIP format has a maximum size of 4GB due to being a 32-bit format.

ok, thank you.. I have read that page, but I thought that that limit was only on very old zip implementations, and modern implementations had 16 exabytes as limit…

Which are other options to save in a compressed way large DataFrames ?

EDIT: I managed to get a 5GB tgz using command line (I am in Linux), but I would prefer to have a OS independent way on my code…

EDIT2:

It seems the standard way to read/write compressed CSV files is to use GZip:

using DataFrames, CSV, CodecZlib

a = DataFrame(a=rand(4000),b=rand(4000))
CSV.write("a.csv.gz",a;compress=true)
a_copy = CSV.read("a.csv.gz",DataFrame)
a == a_copy # true

Strange it isn’t so easy to find it… I will now try to use it with my real dataset…

You might try ZipArchives.jl. I don’t think it has this 4GB limitation.

Currently, ZipArchives has the following benefits over ZipFile:

  1. Full ZIP64 support: archives larger than 4GB can be written.