Don’t know if this is the correct category but maybe:
Since Julia 0.6 has been released I have seen major performance issues with Julia. After analyzing the issue a little bit I found out that it seems the compile performance that has regressed a lot. It seems that during the transition from 0.5 to 0.6 this has not been observed and I cannot really find deeper discussions on this subject.
I now have some examples the regressed and I wonder if there is some “compiler benchmark suite” where the compiler performance of 0.5 vs 0.6 vs 0.7 is systematically compared? If yes, I could put some examples into such a suite, which I am currently collecting during analysis of my code. Here is a simple example:
Should run these experiments with precompile.jl disabled. It is fast on 0.7 because I ran @time when generating the file this time. Or perhaps the new broadcasting API is better covered in precompile.jl now…
I do think that, overall, compiler performance has regressed, and we badly need to work on it. Partly LLVM has been getting slower over time, but whatever the cause we need to address it. 100x is certainly unusual though, and I would also guess that precompile.jl is involved because of that.
Yes, very much. We are very carful about runtime performance (with Nanosoldier) but have nothing really running for compile times. Adding a test group for compile times has been on my mind for quite some time but haven’t gotten time to do it nor am I sure exactly how it is best measured and if it is possible to avoid the overhead of restarting julia for each measurement.
Its great that a future focus will be on compiler performance but I wonder the following: Which pass of compiling is so slow? LLVM would be the optimizer and codegen right?. But when I profile this I see just calls in inference. Doesn’t that mean that this might be a performance issue during inference? Am no compiler expert but this would be different corners, no?
The new Broadcasting API is indeed much more precompile-friendly as it no longer creates new anonymous functions for each dot-broadcast expression. Creating new anonymous functions is a pretty bad case requiring lots of recompilation.
I don’t believe our profiler is set up to track time spent in C++/LLVM. We’ve been calling Keno’s new pure-Julia IR passes the “optimizer” — this is then what gets passed off to C++ to generate LLVM IR, and then LLVM has a number of optimization passes that it runs.
I haven’t tested it, but I’m getting the feeling that on 0.7 the initial pre-compilation that happens with using is waaaay slower but after that a lot of things seem to actually be a little bit faster. Am I perceiving this correctly? What changed, it’s just pre-compiling more stuff?
Has anybody encountered inconsistent allocation/timings in recent masters? I have been puzzled by identical codes allocating vastly different memory when run in two different, newly-launched REPL sessions. I’ve not managed to pin down the problem or a MWE, but your comment @Tamas_Papp sounded familiar…