Using ProfileView, I have realized that our code base Gridap.jl suffers from runtime dispatch at some performance critical parts. Luckily, I have been able to reduce the problem to a simple MWE and even found a workaround (significant effort was required though). Hope this post is useful to others encountering the same problem!
I have realized that runtime dispatch happens or not depending on compilation order. Here the MWE:
struct LazyArray{G,F}
op::G
args::F
function LazyArray(op, f...)
G = typeof(op)
F = typeof(f)
new{G,F}(op, f)
end
end
index(a,i) = a[i]
index(a::LazyArray,i) = index_and_call(a.op,i,a.args)
function index_and_call(op,i,args::Tuple)
fi = map(fj -> index(fj,i),args)
op(fi...)
end
function loop(a,m)
for i in 1:m
index(a,1)
end
end
using BenchmarkTools
m = 10^8
a = [rand(1:40,4)]
b = LazyArray(first,a)
isempty(ARGS) || index(b,1)
c = LazyArray(i->3*i,b)
@btime loop($c,$m)
Note the difference between executing or not index(b,1)
before measuring times:
$ julia mwe.jl doit
38.858 ms (0 allocations: 0 bytes)
$ julia mwe.jl
6.634 s (400000000 allocations: 5.96 GiB)
@code_warntype index(c,1)
looks “green” in both cases, but @profview loop(c,m)
reveals that there is some unwanted runtime dispatch in the second case within function index_and_call
, see the red bars in the flame graph for the second case:
It seems the problem is related with the unpack of the arguments in op(fi...)
. If I add these extra methods to circumvent unpacking for particular cases, the problem is fixed for those particular cases
function index_and_call(op,i,args::Tuple{Any})
a1 = index(args[1],i)
op(a1)
end
function index_and_call(op,i,args::Tuple{Any,Any})
a1 = index(args[1],i)
a2 = index(args[2],i)
op(a1,a2)
end
function index_and_call(op,i,args::Tuple{Any,Any,Any})
a1 = index(args[1],i)
a2 = index(args[2],i)
a3 = index(args[3],i)
op(a1,a2,a3)
end
Has somebody encountered similar problems? is this a known performance bug in Julia ?
Thanks for helping!