Compiler Performance

Don’t know if this is the correct category but maybe:

Since Julia 0.6 has been released I have seen major performance issues with Julia. After analyzing the issue a little bit I found out that it seems the compile performance that has regressed a lot. It seems that during the transition from 0.5 to 0.6 this has not been observed and I cannot really find deeper discussions on this subject.

I now have some examples the regressed and I wonder if there is some “compiler benchmark suite” where the compiler performance of 0.5 vs 0.6 vs 0.7 is systematically compared? If yes, I could put some examples into such a suite, which I am currently collecting during analysis of my code. Here is a simple example:
Julia 0.5.2:

julia> u=[1]
1-element Array{Int64,1}:
 1

julia> @time u.== 1
  0.023830 seconds (8.52 k allocations: 371.236 KB)
1-element BitArray{1}:
 true

Julia 0.6.3:

julia> u=[1]
1-element Array{Int64,1}:
 1

julia> @time u.== 1
  2.614611 seconds (835.32 k allocations: 44.224 MiB, 0.66% gc time)
1-element BitArray{1}:
 true

This is a 100 fold regression in compile performance.

1 Like

FWIW, I get

julia> u=[1]
1-element Array{Int64,1}:
 1

julia> @time u.== 1
  1.455042 seconds (165.26 k allocations: 8.545 MiB)
1-element BitArray{1}:
 true

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

and

julia> u=[1]
1-element Array{Int64,1}:
 1

julia> @time u.== 1
  0.045010 seconds (16.64 k allocations: 996.639 KiB)
1-element BitArray{1}:
 true

julia> versioninfo()
Julia Version 0.7.0-alpha.0
Commit 22590d529d* (2018-05-31 00:07 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

But I don’t know whether this is because of changed makefiles for the sysimg or genuine compile-speed updates.

I have not tested on 0.7 so it would be great if that operation got faster again. But still, I am wondering if there are systematic tools to check compiler performance / regressions.

Should run these experiments with precompile.jl disabled. It is fast on 0.7 because I ran @time when generating the file this time. Or perhaps the new broadcasting API is better covered in precompile.jl now…

yea indeed it can be different things. But that implies that we need tooling for finding regressions (either in compile time or in precompile.jl statements).

I do think that, overall, compiler performance has regressed, and we badly need to work on it. Partly LLVM has been getting slower over time, but whatever the cause we need to address it. 100x is certainly unusual though, and I would also guess that precompile.jl is involved because of that.

6 Likes

Yes, very much. We are very carful about runtime performance (with Nanosoldier) but have nothing really running for compile times. Adding a test group for compile times has been on my mind for quite some time but haven’t gotten time to do it nor am I sure exactly how it is best measured and if it is possible to avoid the overhead of restarting julia for each measurement.

3 Likes

Its great that a future focus will be on compiler performance but I wonder the following: Which pass of compiling is so slow? LLVM would be the optimizer and codegen right?. But when I profile this I see just calls in inference. Doesn’t that mean that this might be a performance issue during inference? Am no compiler expert but this would be different corners, no?

The new Broadcasting API is indeed much more precompile-friendly as it no longer creates new anonymous functions for each dot-broadcast expression. Creating new anonymous functions is a pretty bad case requiring lots of recompilation.

I don’t believe our profiler is set up to track time spent in C++/LLVM. We’ve been calling Keno’s new pure-Julia IR passes the “optimizer” — this is then what gets passed off to C++ to generate LLVM IR, and then LLVM has a number of optimization passes that it runs.

Profile.print takes a C = true keyword which will display C stack frames as well. Not sure how well that will work on a non-debug build though.

2 Likes

This was a good suggestion. Now I can see in the leafs of the deep inference calls many many calls into libLLVM.

I haven’t tested it, but I’m getting the feeling that on 0.7 the initial pre-compilation that happens with using is waaaay slower but after that a lot of things seem to actually be a little bit faster. Am I perceiving this correctly? What changed, it’s just pre-compiling more stuff?

The most striking thing I noticed is that I have a package that takes about 100s to compile (ie, first run) on Julia 0.6, but only 10-15s on Julia 0.7.

2 Likes
julia> u=[1]
1-element Array{Int64,1}:
 1

julia> @time u.== 1
  2.463846 seconds (165.28 k allocations: 8.558 MiB)
1-element BitArray{1}:
 true

Julia0.62, window 7, 64 bit, Intel i5-3320M 2.60GH

Compared with Python, I feel the compiling time harms the benefits/convenience of the elegance of the language, especially when run the code again and again for the debugging purpose.

I get similar results in v0.6.3, but note that in current master,

julia> @time u.==1
  0.275278 seconds (642.14 k allocations: 32.539 MiB, 6.35% gc time)

so things will improve soon for everyone.

Here’s alpha:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-alpha.0 (2018-05-31 00:07 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia-0.7> u=[1];

> @time u .== 1
  0.066827 seconds (16.64 k allocations: 997.857 KiB)

So what’s wrong with master?

1 Like

Haha! Marvellous!

It was some kind of a quirk I cannot reproduce, now it is approximately as fast as your benchmark.

Has anybody encountered inconsistent allocation/timings in recent masters? I have been puzzled by identical codes allocating vastly different memory when run in two different, newly-launched REPL sessions. I’ve not managed to pin down the problem or a MWE, but your comment @Tamas_Papp sounded familiar…

I think I was running multiple MCMC chains on that machine when I did the benchmark, so it probably was my fault.