Slow compilation, i think, not sure how to debug

I have a very simple program, on the order of 50 lines, which uses CSV to read a 2 entry-per-line, 448 line CSV file.
i have put @time on the using statements and they are normal (oh why oh why is PyPlot 8s ?!)
total time, about 11s.

the program does not appear to start to run for at least another 10-15s. i can’t provide the program because work, blah, blah, blah. i’m hoping to create a separate copy of what it’s doing so i can post. so i thought for sure it was a compilation issue, but then i stuck an @time on the CSV.File statement and it came back about 8.5s which seemed in possible. So i ran the CSV.File by hand in the repl and it came back at 0.2s, which is what you would expect.

So now i can’t figure out if the @time in the program is correct and there’s just some other weird run-time issue, or if @time is returning a bogus number.

regardless i’m waiting 30-35s for a 50 line program to read a 448 line CSV file and perform some really simple math on the contents.

I was hoping to get some help on how to start to debug this. My plan right now is to start distributing a lot of @time calls to see if it’s more run-time vs compilation.

Is there some (relatively simple) way to get metrics on what the compiler is doing ?

it’s a very weird situation. I’ve waited this long for some programs but they were considerably more complicated and pulled in a LOT of other modules. This one is ridiculously simple and taking a really long time.

What version of Julia are you using? Recent ones have improved lot in these areas.

1 Like

aargh, I should know better than to post a question like this without the version.

I’m using 1.4.1

i’m using 1.3.1 at home. i’ll try and get the code onto my home machine and see what 1.3.1 does with it.

This might be long compile times of CSV.jl. Have you tried using readdlm of DelimitedFiles? That should be a lot faster, at least for small files, as DelimitedFiles is part of the standard library and thus fully compiled (and not just precompiled).

I think this is the (pre?)compilation time of CSV.jl which is always rather slow on first usage.

Is it possible for you to (re) use a existing julia session to circumvent this?

For reference, it takes me 15.7 seconds to read a 116719 row file (43 columns) for the first time. Second time is 2.78 seconds

wow, absolutely crazy results.
Everything is being run twice.

Using CSV

$ time julia pulse.jl data/pulse_20200514_234617.csv
real 1m39.590s
user 0m0.015s
sys 0m0.093s

This run still has the “using CSV” statement in it, but I am using a simple reader to read the file, i.e. CSV is NOT being used to read the file. This shows that there is still something very wrong even with the CSV.File function, as the run time drops about 27 s.

$ time julia pulse.jl data/pulse_20200514_234617.csv

real 1m12.716s
user 0m0.015s
sys 0m0.077s

remove “using CSV”, but I left in “using DataFrames”. -40 seconds. ouch.

$ time julia pulse.jl data/pulse_20200514_234617.csv

real 0m31.559s
user 0m0.031s
sys 0m0.061s

and finally. i get rid of both “using CSV” and “using DataFrames”, since i’m only using DataFrames because i couldn’t figure out how to make CSV give me an array. lol.

time julia pulse.jl data/pulse_20200514_234617.csv

real 0m21.711s
user 0m0.000s
sys 0m0.140s

CSV is effectively unusable for me :frowning:

DataFrames isn’t so great either, but the good news is that my routines that use DataFrames is processing DB accesses and doing a fair amount of work and so the overhead is not as obvious.

I’ll try out DelimitedFiles. Didn’t even know it existed.

Thanks for your help.

With DelimitedFiles, I’m still at about 23s

Then, i can’t remember for the life of me where i saw it, but I remembered that you could set the optimization level, so I “downgraded” my optimization

julia -O0 …

and this took another 6 seconds off, i’m assuming because it’s not trying so hard to optimize.

I posted this question to hopefully learn some techniques. For example even though CSV was, sadly, a culprit, i had to figure it out by not using it.

I’m still interested in knowing if there is a way to get any statistics as to where the compiler is spending it’s time, at a basic level.

2 Likes

I cannot reproduce the long load times of DelimitedFiles:

@show fl = download("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")

@time using DelimitedFiles #   0.000452 seconds (690 allocations: 37.062 KiB)

@time d,h = readdlm(fl, ',', header=true) # 1.163442 seconds (2.12 M allocations: 105.665 MiB, 1.47% gc time)
@time d,h = readdlm(fl, ',', header=true) # 0.000198 seconds (75 allocations: 44.297 KiB)

@show h
@show d

Although, 1s is still too long.

I think this might be because all tests for DelimitedFiles seem to use some sort of IOBuffer and not a file-path string:

Thus the automated precompile-script does not pick up to compile the file-path method.

I may have mislead you. The overall compile/runtime of my program is long, but definitely not a fault of delimited files.

DelimitedFiles is loading in 3ms, lol, and the file read only takes 0.2s.

My program is now taking approximately 18s, total , to run.

  1. 3s of julia startup time, i.e. the time where i see nothing
  2. 7.2s for “using PyPlot”
  3. 1.2s for ALL of the other “using ModuleName”
  4. 0.23s for file read.
  5. 7s - ?

However what I have noticed, is that there is another noticeable delay from the time all the using statements are complete to the start of the program. I’m thinking that must be more compilation time for the body of my program itself.

Also, I am using ArgParse, and i’ve become suspicious that it is contributing to the overhead, maybe compilation of the line arguments processing machinery ?

I would like to point out something that is very familiar to Julia users. Processing 1 file takes my program about 18s, but processing 36 files (I do it in a loop inside the program - don’t have to re-run it) takes only 2s more, i.e. 20s.

Yes, methods still will need to be compiled for a set of arguments. This incurs further delays after using, often of comparable size.

Yes, Argparse was slow the last time I used it (a few years ago). Just posted today:

maybe better?

Comment out everything to do with plotting and run it and report back.

This is a pretty big pain point for me with Julia right now, I think some pretty comprehensive work around being able to write scripts with quick startup in Julia needs to be done still, like compilation not always extending to all of the methods called by the methods that do get compiled. Using -O0 as a Julia argument made a pretty noticeable impact on some scripts I was working on too so that’s a very good find…

I thought i had, however the “using PyPlot” was still in there and when i commented it out i found that i had missed commenting out the plot line. So the “using PyPlot” and errant plot() statement were worth about 8s, 7s for the “using PyPlot” + 1s for the plot, so that reduced my run time down to about 9s.

And now for the interesting part, as @mauro3, alluded to regarding ArgParse…

time julia args.jl test1
0.392707 seconds (46.13 k allocations: 3.572 MiB)
Dict{String,Any}(“filenames” => [“test1”],“graph” => false)

real 0m7.928s
user 0m0.031s
sys 0m0.078s

That’s right, 8s to do nothing but process command line args… ouch.

Well that explains a lot. I think that i have ArgParse in a LOT of scripts. Never occurred to me to check it.

So without ArgParse the script is at about 4s.

1 Like

Not sure if the ArgMacros module that was linked above works for your use case but wrote it basically because of this kind of speed issue. Might help with this.

1 Like

yes, thanks for that ! i will definitely try it out.

You might want to also check out argparse2, which was written to address the speed issues of other existing argument parsing solutions.

1 Like