I noticed a big regression in my data reading code (read csv.zip
files from a folder). Stripping it down I get significant time difference between 1.7.3
and 1.8.0
.
This setup is quite fragile, for example, DataFrames
is not even used, but removing it from the script makes a big difference.
Dummy data folder can be made with mkdir data && echo "HI" > hi.txt && zip data/hi.zip hi.txt && rm hi.txt
or just put any .zip
file in data
folder next to the script.
Would be interested to hear what’s going on. Thanks!
# precompzip.jl
using CSV, DataFrames, ZipFile
function readzipcsv(fpath; file_ix=1, ntasks=1)
z = ZipFile.Reader(fpath)
try
res = CSV.File(read(z.files[file_ix]); ntasks)
catch e
res = (CSV.File(read(`unzip -p $fpath`); ntasks))
finally
close(z)
end # try
end
function read_data_folder(folder)
fs = readdir(folder)
map(f -> readzipcsv(joinpath(folder, f); ntasks=1), fs)
end
@time dfs = read_data_folder("data")
println(size(dfs))
time ../julia-1.7.3/bin/julia precompzip.jl
9.866564 seconds (988.95 k allocations: 49.336 MiB, 0.56% gc time, 99.97% compilation time)
(1,)
real 0m28.447s
user 0m27.024s
sys 0m1.198s
time ../julia-1.8.0/bin/julia precompzip.jl
67.760195 seconds (71.89 M allocations: 8.032 GiB, 5.66% gc time, 100.00% compilation time)
(1,)
real 1m53.343s
user 1m40.979s
sys 0m15.560s
And without DataFrames
time ../julia-1.8.0/bin/julia precompzip.jl # NO DATAFRAMES import
8.462947 seconds (846.61 k allocations: 40.318 MiB, 99.98% compilation time)
(1,)
real 0m24.061s
user 0m24.587s
sys 0m4.997s
Package versions
[336ed68f] CSV v0.10.4
[a93c6f00] DataFrames v1.3.4
[a5390f91] ZipFile v0.10.0