Performance depends dramatically on compilation order

fverdugo · April 2, 2021, 8:54am

Using ProfileView, I have realized that our code base Gridap.jl suffers from runtime dispatch at some performance critical parts. Luckily, I have been able to reduce the problem to a simple MWE and even found a workaround (significant effort was required though). Hope this post is useful to others encountering the same problem!

I have realized that runtime dispatch happens or not depending on compilation order. Here the MWE:

struct LazyArray{G,F}
  op::G
  args::F
  function LazyArray(op, f...)
    G = typeof(op)
    F = typeof(f)
    new{G,F}(op, f)
  end
end

index(a,i) = a[i]
index(a::LazyArray,i) = index_and_call(a.op,i,a.args)

function index_and_call(op,i,args::Tuple)
  fi = map(fj -> index(fj,i),args)
  op(fi...)
end

function loop(a,m)
  for i in 1:m
    index(a,1)
  end
end

using BenchmarkTools

m = 10^8
a = [rand(1:40,4)]
b = LazyArray(first,a)
isempty(ARGS) || index(b,1)
c = LazyArray(i->3*i,b)
@btime loop($c,$m)

Note the difference between executing or not index(b,1) before measuring times:

$ julia mwe.jl doit
  38.858 ms (0 allocations: 0 bytes)
$ julia mwe.jl
  6.634 s (400000000 allocations: 5.96 GiB)

@code_warntype index(c,1) looks “green” in both cases, but @profview loop(c,m) reveals that there is some unwanted runtime dispatch in the second case within function index_and_call, see the red bars in the flame graph for the second case:

It seems the problem is related with the unpack of the arguments in op(fi...). If I add these extra methods to circumvent unpacking for particular cases, the problem is fixed for those particular cases

function index_and_call(op,i,args::Tuple{Any})
  a1 = index(args[1],i)
  op(a1)
end

function index_and_call(op,i,args::Tuple{Any,Any})
  a1 = index(args[1],i)
  a2 = index(args[2],i)
  op(a1,a2)
end

function index_and_call(op,i,args::Tuple{Any,Any,Any})
  a1 = index(args[1],i)
  a2 = index(args[2],i)
  a3 = index(args[3],i)
  op(a1,a2,a3)
end

Has somebody encountered similar problems? is this a known performance bug in Julia ?

Thanks for helping!

fverdugo · April 2, 2021, 9:01am

For completeness, these are the times obtained when including the extra methods

$ julia mwe.jl doit
  38.858 ms (0 allocations: 0 bytes)
$ julia mwe.jl
  69.949 ms (0 allocations: 0 bytes)

The big problem is fixed, but there is still some room for improvement.

fverdugo · April 2, 2021, 9:05am

And this is the flame graph for the fix:

Only runtime dispatch at the REPL, as expected.

ranocha · April 8, 2021, 1:45pm

That’s very interesting, @fverdugo! Thanks for posting this observation Did you report this in an issue at the Julia repository? As far as I know, the core Julia developers do not necessarily see everything posted here but issues will less likely get lost. For me, this seems to be an interesting performance question.

Indeed, I have recently encountered a similar problem which I haven’t been able to reduce to an MWE yet. There, I also observed quite huge performance differences between different compilation orders - a single function call could either take a few microseconds or exactly (!) five seconds. I have really no clue what’s going on in my case but chances are that it could be related to your nice observation. Thus, if your problem can be fixed in Julia, chances are that I might be lucky as well…

kristoffer.carlsson · April 8, 2021, 1:47pm

Some discussion that might be relevant: https://github.com/JuliaLang/julia/issues/35800

ranocha · August 12, 2021, 9:33am

It looks like we hit exactly the same problem in Trixi.jl. There, the type instability seems to come from splatting used in Base.to_indices, see my analysis in the issue linked above.

fverdugo · August 12, 2021, 2:25pm

Nice and detailed analysis! Luckily you found a workaround. For me was a pain to debug it.

ProfileView saves lives!

Topic		Replies	Views
How slow is runtime dispatch, anyway? Benchmark attempts Performance	8	339	January 15, 2025
Speed of multiple dispatch Performance	14	1201	November 27, 2023
Same performances between type-stable and type-unstable code Performance code_warntype , type-stability	13	284	October 3, 2024
Save dispatched function for later call General Usage	46	1672	September 1, 2021
Unsolvable dynamic dispatch in a `for` loop Performance question	3	513	November 23, 2020

Performance depends dramatically on compilation order

Related topics