Slow compilation, i think, not sure how to debug

purplishrock · May 19, 2020, 5:40am

I have a very simple program, on the order of 50 lines, which uses CSV to read a 2 entry-per-line, 448 line CSV file.
i have put @time on the using statements and they are normal (oh why oh why is PyPlot 8s ?!)
total time, about 11s.

the program does not appear to start to run for at least another 10-15s. i can’t provide the program because work, blah, blah, blah. i’m hoping to create a separate copy of what it’s doing so i can post. so i thought for sure it was a compilation issue, but then i stuck an @time on the CSV.File statement and it came back about 8.5s which seemed in possible. So i ran the CSV.File by hand in the repl and it came back at 0.2s, which is what you would expect.

So now i can’t figure out if the @time in the program is correct and there’s just some other weird run-time issue, or if @time is returning a bogus number.

regardless i’m waiting 30-35s for a 50 line program to read a 448 line CSV file and perform some really simple math on the contents.

I was hoping to get some help on how to start to debug this. My plan right now is to start distributing a lot of @time calls to see if it’s more run-time vs compilation.

Is there some (relatively simple) way to get metrics on what the compiler is doing ?

it’s a very weird situation. I’ve waited this long for some programs but they were considerably more complicated and pulled in a LOT of other modules. This one is ridiculously simple and taking a really long time.

Oscar_Smith · May 19, 2020, 5:43am

What version of Julia are you using? Recent ones have improved lot in these areas.

purplishrock · May 19, 2020, 5:47am

aargh, I should know better than to post a question like this without the version.

I’m using 1.4.1

i’m using 1.3.1 at home. i’ll try and get the code onto my home machine and see what 1.3.1 does with it.

mauro3 · May 19, 2020, 6:37am

This might be long compile times of CSV.jl. Have you tried using readdlm of DelimitedFiles? That should be a lot faster, at least for small files, as DelimitedFiles is part of the standard library and thus fully compiled (and not just precompiled).

bernhard · May 19, 2020, 7:05am

I think this is the (pre?)compilation time of CSV.jl which is always rather slow on first usage.

Is it possible for you to (re) use a existing julia session to circumvent this?

For reference, it takes me 15.7 seconds to read a 116719 row file (43 columns) for the first time. Second time is 2.78 seconds

purplishrock · May 19, 2020, 5:32pm

wow, absolutely crazy results.
Everything is being run twice.

Using CSV

$ time julia pulse.jl data/pulse_20200514_234617.csv
real 1m39.590s
user 0m0.015s
sys 0m0.093s

This run still has the “using CSV” statement in it, but I am using a simple reader to read the file, i.e. CSV is NOT being used to read the file. This shows that there is still something very wrong even with the CSV.File function, as the run time drops about 27 s.

$ time julia pulse.jl data/pulse_20200514_234617.csv

real 1m12.716s
user 0m0.015s
sys 0m0.077s

remove “using CSV”, but I left in “using DataFrames”. -40 seconds. ouch.

$ time julia pulse.jl data/pulse_20200514_234617.csv

real 0m31.559s
user 0m0.031s
sys 0m0.061s

and finally. i get rid of both “using CSV” and “using DataFrames”, since i’m only using DataFrames because i couldn’t figure out how to make CSV give me an array. lol.

time julia pulse.jl data/pulse_20200514_234617.csv

real 0m21.711s
user 0m0.000s
sys 0m0.140s

CSV is effectively unusable for me

DataFrames isn’t so great either, but the good news is that my routines that use DataFrames is processing DB accesses and doing a fair amount of work and so the overhead is not as obvious.

I’ll try out DelimitedFiles. Didn’t even know it existed.

Thanks for your help.

purplishrock · May 19, 2020, 7:36pm

With DelimitedFiles, I’m still at about 23s

Then, i can’t remember for the life of me where i saw it, but I remembered that you could set the optimization level, so I “downgraded” my optimization

julia -O0 …

and this took another 6 seconds off, i’m assuming because it’s not trying so hard to optimize.

I posted this question to hopefully learn some techniques. For example even though CSV was, sadly, a culprit, i had to figure it out by not using it.

I’m still interested in knowing if there is a way to get any statistics as to where the compiler is spending it’s time, at a basic level.

mauro3 · May 19, 2020, 8:01pm

I cannot reproduce the long load times of DelimitedFiles:

@show fl = download("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")

@time using DelimitedFiles #   0.000452 seconds (690 allocations: 37.062 KiB)

@time d,h = readdlm(fl, ',', header=true) # 1.163442 seconds (2.12 M allocations: 105.665 MiB, 1.47% gc time)
@time d,h = readdlm(fl, ',', header=true) # 0.000198 seconds (75 allocations: 44.297 KiB)

@show h
@show d

Although, 1s is still too long.

I think this might be because all tests for DelimitedFiles seem to use some sort of IOBuffer and not a file-path string:

github.com

JuliaLang/julia/blob/84d7e67aa82f3a3348a84a5a5659e03e2ebd6bcf/stdlib/DelimitedFiles/test/runtests.jl

# This file is a part of Julia. License is MIT: https://julialang.org/license

using Test, Random
using DelimitedFiles

isequaldlm(m1, m2, t) = isequal(m1, m2) && (eltype(m1) == eltype(m2) == t)

@testset "readdlm" begin
    @test isequaldlm(readdlm(IOBuffer("1\t2\n3\t4\n5\t6\n")), [1. 2; 3 4; 5 6], Float64)
    @test isequaldlm(readdlm(IOBuffer("1\t2\n3\t4\n5\t6\n"), Int), [1 2; 3 4; 5 6], Int)
    @test isequaldlm(readdlm(IOBuffer("1,22222222222222222222222222222222222222,0x3,10e6\n2000.1,true,false,-10.34"), ',', Any),
        reshape(Any[1,2000.1,Float64(22222222222222222222222222222222222222),true,0x3,false,10e6,-10.34], 2, 4), Any)

    @test isequaldlm(readdlm(IOBuffer("-9223355253176920979,9223355253176920979"), ',', Int64), Int64[-9223355253176920979  9223355253176920979], Int64)

    @test size(readdlm(IOBuffer("1,2,3,4"), ',')) == (1,4)
    @test size(readdlm(IOBuffer("1,2,3,"), ',')) == (1,4)
    @test size(readdlm(IOBuffer("1,2,3,4\n"), ',')) == (1,4)
    @test size(readdlm(IOBuffer("1,2,3,\n"), ',')) == (1,4)
    @test size(readdlm(IOBuffer("1,2,3,4\n1,2,3,4"), ',')) == (2,4)

This file has been truncated. show original

Thus the automated precompile-script does not pick up to compile the file-path method.

purplishrock · May 19, 2020, 8:52pm

I may have mislead you. The overall compile/runtime of my program is long, but definitely not a fault of delimited files.

DelimitedFiles is loading in 3ms, lol, and the file read only takes 0.2s.

My program is now taking approximately 18s, total , to run.

3s of julia startup time, i.e. the time where i see nothing
7.2s for “using PyPlot”
1.2s for ALL of the other “using ModuleName”
0.23s for file read.
7s - ?

However what I have noticed, is that there is another noticeable delay from the time all the using statements are complete to the start of the program. I’m thinking that must be more compilation time for the body of my program itself.

Also, I am using ArgParse, and i’ve become suspicious that it is contributing to the overhead, maybe compilation of the line arguments processing machinery ?

I would like to point out something that is very familiar to Julia users. Processing 1 file takes my program about 18s, but processing 36 files (I do it in a loop inside the program - don’t have to re-run it) takes only 2s more, i.e. 20s.

mauro3 · May 19, 2020, 8:58pm

Yes, methods still will need to be compiled for a set of arguments. This incurs further delays after using, often of comparable size.

Yes, Argparse was slow the last time I used it (a few years ago). Just posted today:

maybe better?

tbeason · May 19, 2020, 9:11pm

Comment out everything to do with plotting and run it and report back.

zachmatson · May 19, 2020, 9:13pm

This is a pretty big pain point for me with Julia right now, I think some pretty comprehensive work around being able to write scripts with quick startup in Julia needs to be done still, like compilation not always extending to all of the methods called by the methods that do get compiled. Using -O0 as a Julia argument made a pretty noticeable impact on some scripts I was working on too so that’s a very good find…

purplishrock · May 19, 2020, 10:21pm

I thought i had, however the “using PyPlot” was still in there and when i commented it out i found that i had missed commenting out the plot line. So the “using PyPlot” and errant plot() statement were worth about 8s, 7s for the “using PyPlot” + 1s for the plot, so that reduced my run time down to about 9s.

And now for the interesting part, as @mauro3, alluded to regarding ArgParse…

time julia args.jl test1
0.392707 seconds (46.13 k allocations: 3.572 MiB)
Dict{String,Any}(“filenames” => [“test1”],“graph” => false)

real 0m7.928s
user 0m0.031s
sys 0m0.078s

That’s right, 8s to do nothing but process command line args… ouch.

Well that explains a lot. I think that i have ArgParse in a LOT of scripts. Never occurred to me to check it.

So without ArgParse the script is at about 4s.

zachmatson · May 19, 2020, 10:52pm

Not sure if the ArgMacros module that was linked above works for your use case but wrote it basically because of this kind of speed issue. Might help with this.

purplishrock · May 19, 2020, 11:26pm

yes, thanks for that ! i will definitely try it out.

PeterSimon · May 20, 2020, 1:14am

You might want to also check out argparse2, which was written to address the speed issues of other existing argument parsing solutions.

Topic		Replies	Views
Startup Speed New to Julia question	6	1939	March 15, 2022
Rough start with julia (with CSV package) New to Julia	18	3852	February 16, 2017
TTFX with DataFrames and CSV General Usage question , dataframes , csv , ttfp	40	2394	February 6, 2022
Extremely slow CSV / IO? New to Julia csv , io	3	525	January 5, 2022
Writing to CSV files and plotting are getting slower in every run General Usage question	17	1826	October 4, 2021

Slow compilation, i think, not sure how to debug

Related topics