Performance depends dramatically on compilation order

Using ProfileView, I have realized that our code base Gridap.jl suffers from runtime dispatch at some performance critical parts. Luckily, I have been able to reduce the problem to a simple MWE and even found a workaround (significant effort was required though). Hope this post is useful to others encountering the same problem!

I have realized that runtime dispatch happens or not depending on compilation order. Here the MWE:

struct LazyArray{G,F}
  op::G
  args::F
  function LazyArray(op, f...)
    G = typeof(op)
    F = typeof(f)
    new{G,F}(op, f)
  end
end

index(a,i) = a[i]
index(a::LazyArray,i) = index_and_call(a.op,i,a.args)

function index_and_call(op,i,args::Tuple)
  fi = map(fj -> index(fj,i),args)
  op(fi...)
end

function loop(a,m)
  for i in 1:m
    index(a,1)
  end
end

using BenchmarkTools

m = 10^8
a = [rand(1:40,4)]
b = LazyArray(first,a)
isempty(ARGS) || index(b,1)
c = LazyArray(i->3*i,b)
@btime loop($c,$m)

Note the difference between executing or not index(b,1) before measuring times:

$ julia mwe.jl doit
  38.858 ms (0 allocations: 0 bytes)
$ julia mwe.jl
  6.634 s (400000000 allocations: 5.96 GiB)

@code_warntype index(c,1) looks “green” in both cases, but @profview loop(c,m) reveals that there is some unwanted runtime dispatch in the second case within function index_and_call, see the red bars in the flame graph for the second case:

It seems the problem is related with the unpack of the arguments in op(fi...). If I add these extra methods to circumvent unpacking for particular cases, the problem is fixed for those particular cases

function index_and_call(op,i,args::Tuple{Any})
  a1 = index(args[1],i)
  op(a1)
end

function index_and_call(op,i,args::Tuple{Any,Any})
  a1 = index(args[1],i)
  a2 = index(args[2],i)
  op(a1,a2)
end

function index_and_call(op,i,args::Tuple{Any,Any,Any})
  a1 = index(args[1],i)
  a2 = index(args[2],i)
  a3 = index(args[3],i)
  op(a1,a2,a3)
end

Has somebody encountered similar problems? is this a known performance bug in Julia ?

Thanks for helping!

5 Likes

For completeness, these are the times obtained when including the extra methods

$ julia mwe.jl doit
  38.858 ms (0 allocations: 0 bytes)
$ julia mwe.jl
  69.949 ms (0 allocations: 0 bytes)

The big problem is fixed, but there is still some room for improvement.

And this is the flame graph for the fix:

Only runtime dispatch at the REPL, as expected.

That’s very interesting, @fverdugo! Thanks for posting this observation :+1: Did you report this in an issue at the Julia repository? As far as I know, the core Julia developers do not necessarily see everything posted here but issues will less likely get lost. For me, this seems to be an interesting performance question.

Indeed, I have recently encountered a similar problem which I haven’t been able to reduce to an MWE yet. There, I also observed quite huge performance differences between different compilation orders - a single function call could either take a few microseconds or exactly (!) five seconds. I have really no clue what’s going on in my case but chances are that it could be related to your nice observation. Thus, if your problem can be fixed in Julia, chances are that I might be lucky as well…

1 Like

Some discussion that might be relevant: sporadic type instability with mapreduce / norm · Issue #35800 · JuliaLang/julia · GitHub

1 Like