Please help: Poor performance, what am I missing?

Hello clever people,

I’m having trouble dissecting a performance problem.

MWE below. The computation in the 3rd benchmark is the composition of the computations in the first 2 benchmarks. I expect the 3rd benchmark to take about the total time taken by the first 2 benchmarks. But it’s 2x as long and there’s an unexpected allocation. Clearly I’m missing something fundamental. What’s happening here?


using BenchmarkTools

"Sum the numeric values, count the non-numeric values"
function processcollection(t)
    total = 0.0
    ncat  = 0
    for x in t
        total, ncat = processvalue(x, total, ncat)
    end
    total, ncat
end

processvalue(x::Real, total, ncat) = total + x, ncat
processvalue(x, total, ncat) = total, ncat + 1

d = Dict("a" => (1,2), "b" => ("a", "b", "c"), "c" => (1.1, 2.2, 3.3, 4.4, 5.5, 6.6))

k = "c"
v = d[k]
@benchmark $d[$k]                     # 20ns,  0 bytes
@benchmark processcollection($v)      #  2ns,  0 bytes
@benchmark processcollection($d[$k])  # 45ns, 32 bytes, 1 alloc

If you move the lvalues into a function

function main()
    d = Dict("a" => (1,2), "b" => ("a", "b", "c"), "c" => (1.1, 2.2, 3.3, 4.4, 5.5, 6.6))
    k = "c"
    @benchmark processcollection($d[$k])
end
main()

I get 27ns

In general this will still perform a dynamic dispatch — even if the types of d and k are known by the compiler (because you interpolated them), it doesn’t know the type of d[k] (because your dictionary is heterogeneous).

Doesn’t seem to help - I’m still getting 45ns on my machine.

Thanks Steven.
I still don’t get why the 3rd benchmark is slower than the sum of the first two. The dynamic dispatch should happen in both the 1st and the 3rd benchmarks right?

No, the dynamic dispatch is determining (at runtime) which compiled method of processcollection to call based on the type of d[k], and this only happens in the third benchmark.

Ah ok, so we’re essentially talking about the distinction between
@benchmark processcollection($v) and
@benchmark processcollection(v) , which use compile-time dispatch and run-time dispatch respectively, and the latter is the same as the 3rd benchmark.

Thanks again, most helpful.
Jock