~7x compliation time regression in the following data reading code

I noticed a big regression in my data reading code (read csv.zip files from a folder). Stripping it down I get significant time difference between 1.7.3 and 1.8.0.

This setup is quite fragile, for example, DataFrames is not even used, but removing it from the script makes a big difference.

Dummy data folder can be made with mkdir data && echo "HI" > hi.txt && zip data/hi.zip hi.txt && rm hi.txt or just put any .zip file in data folder next to the script.

Would be interested to hear what’s going on. Thanks!

# precompzip.jl
using CSV, DataFrames, ZipFile

function readzipcsv(fpath; file_ix=1, ntasks=1)
    z = ZipFile.Reader(fpath)
    try
        res = CSV.File(read(z.files[file_ix]); ntasks)
    catch e
        res = (CSV.File(read(`unzip -p $fpath`); ntasks))
    finally
        close(z)
    end # try
end

function read_data_folder(folder)
    fs = readdir(folder)
    map(f -> readzipcsv(joinpath(folder, f); ntasks=1), fs)
end
    
@time dfs = read_data_folder("data")
println(size(dfs))
time ../julia-1.7.3/bin/julia precompzip.jl 
  9.866564 seconds (988.95 k allocations: 49.336 MiB, 0.56% gc time, 99.97% compilation time)
(1,)

real    0m28.447s
user    0m27.024s
sys     0m1.198s
time ../julia-1.8.0/bin/julia precompzip.jl 
 67.760195 seconds (71.89 M allocations: 8.032 GiB, 5.66% gc time, 100.00% compilation time)
(1,)

real    1m53.343s
user    1m40.979s
sys     0m15.560s

And without DataFrames

time ../julia-1.8.0/bin/julia precompzip.jl # NO DATAFRAMES import
  8.462947 seconds (846.61 k allocations: 40.318 MiB, 99.98% compilation time)
(1,)

real    0m24.061s
user    0m24.587s
sys     0m4.997s

Package versions

  [336ed68f] CSV v0.10.4
  [a93c6f00] DataFrames v1.3.4
  [a5390f91] ZipFile v0.10.0
2 Likes

did you run this twice? maybe you’re just precompipling DataFrame…

Time is the same on the second run. Also, library imports are outside @time call, so precompile probably wouldn’t show up.

~/temp$ time ../julia-1.8.0/bin/julia precompzip.jl 
 65.496983 seconds (71.89 M allocations: 8.032 GiB, 4.68% gc time, 100.00% compilation time)
(1,)

real    1m50.818s
user    1m41.421s
sys     0m12.727s

~/temp$ time ../julia-1.8.0/bin/julia precompzip.jl 
 63.962627 seconds (71.89 M allocations: 8.032 GiB, 100.00% compilation time)
(1,)

real    1m50.555s
user    1m39.640s
sys     0m14.313s

You have to run your function twice in the same Julia session, not sucessively call the script on the command line.
Then you probably want to use Julia internal timing facilities, e.g. @time macro or @btime macro from BenchmarkTools.

This is besides the point. The OP is talking about a compile-time regression, and compile times are measured the first time you call a function. If DataFrames has been precompiled and there is still a regression in compile times, I think it warrants an issue, similar to this one.

6 Likes

It might even be the same issue - I tested the script on nightly julia build and it runs on par with 1.7.3, so I consider my problem solved. Thanks everybody who’ve looked into it!

time ../julia-b9df45e4d8/bin/julia precompzip.jl
  9.175202 seconds (577.02 k allocations: 32.091 MiB, 99.99% compilation time)
(1,)

real    0m29.731s
user    0m28.895s
sys     0m3.455s
time ../julia-1.7.3/bin/julia precompzip.jl 
  9.824413 seconds (988.95 k allocations: 49.336 MiB, 99.96% compilation time)
(1,)

real    0m28.350s
user    0m26.955s
sys     0m1.158s

Nightly used is

Version 1.9.0-DEV.1320 (2022-09-08)
Commit b9df45e4d81
2 Likes

if this is the problem I think it is, it will be fixed in 1.8.2.

1 Like