Help Requested: Understanding cause of allocations for dynamic dispatch returning `nothing`, which vary based on argument type

Can someone help me understand why passing a Ptr{UInt} to this dynamic dispatch leads to allocations, whereas passing a Ref{UInt} does not?

I have tried the Allocation Profiler, Cthulhu, and @code_*, and I can’t figure out where the allocation is coming from. (Which separately speaks to how hard it can still be to do this kind of perf investigation.)

Here are two implementations of the same idea, hash1 and hash2. The idea is to reduce allocations from hashing a collection of abstract type, by eliminating the allocations from the dynamic dispatch.

This is related to the discussion here: `hash(::Vector{<abstract type>}, ::UInt)` allocates quite a bit due to type instability: why can't hash(::Any) infer `UInt` return type? · Issue #44244 · JuliaLang/julia · GitHub.

My understanding is: dynamic dispatch will (almost) always heap allocate its return value, since it doesn’t know in advance what type the return value will be, since it doesn’t know which method will be selected (that’s what makes it a dynamic dispatch). So the idea here was to use an output parameter instead of returning a value, specifically to avoid allocating the result.

hash1 uses a Ref{UInt} as the output parameter, and indeed it doesn’t allocate on each iteration. But it does need allocate the Ref, since it passes it through to a dynamically chosen function, which let that Ref escape.

The idea in hash2 was to pass a Ptr{UInt} instead, to allow escape-analysis to elide the allocation of the Ref! That should get us to 0 allocations instead of 1. However, somehow now every iteration of the for-loop allocates! And I don’t know why. (But indeed the alloc from the Ref is gone.)

Any help would be much appreciated!

function hash1(a::AbstractArray, h::UInt = UInt(0))
    ref = Ref{UInt}(h)
    for x in a
        hash_into!(ref, x)::Nothing
    end
    return ref[]::UInt
end
function hash_into!(ref::Ref{UInt}, v::T) where T
    ref[] = hash(v, ref[])
    return nothing
end
function hash2(a::AbstractArray, h::UInt = UInt(0))
    r = Ref(h)
    p = Base.unsafe_convert(Ptr{UInt}, r)
    GC.@preserve r begin
        for x in a
            hash_into!(p, x)::Nothing
        end
    end
    return r[]::UInt
end
function hash_into!(p::Ptr{UInt}, x::T) where T
    h = hash(x, unsafe_load(p))
    unsafe_store!(p, h)
    return nothing
end

You can see that for Arrays of fixed type, which don’t have a type instability and thus don’t have a dynamic dispatch, neither one allocates, as expected:

julia> a = collect(1:Int(1e3));

julia> @btime hash1($(a))
  1.079 μs (0 allocations: 0 bytes)
0x246980f127272a2c

julia> @btime hash2($(a))
  1.079 μs (0 allocations: 0 bytes)
0x246980f127272a2c

But for arrays of abstract type, where we need a dynamic dispatch, hash1 allocates 1 alloc for the Ref, but doesn’t allocate on each iteration, while hash2 can stack allocate the Ref, but each iteration allocates for some mystery reason:

julia> a2 = Any[i for i in 1:Int(1e3)];

julia> @btime hash1($(a2))
  7.594 μs (1 allocation: 16 bytes)
0x246980f127272a2c

julia> @btime hash2($(a2))
  11.000 μs (1000 allocations: 15.62 KiB)
0x246980f127272a2c

Please help! :sob:

Here are pprof screenshots from an allocation profile and a CPU profile. (Discourse won’t let me upload the profiles)

Allocation:


CPU:

And to me, their Cthulhu warntypes look identical:

maybe try @noinline see if that helps track down the actual line which allocates?

Because to perform dynamic dispatch, we need to pack each argument into a jl_value_t*.

function hash2(a::AbstractArray, h::UInt = UInt(0))
    r = Ref(h)
    p = Base.unsafe_convert(Ptr{UInt}, r)
    GC.@preserve r begin
        for x in a
            hash_into!(p, x)::Nothing
        end
    end
    return r[]::UInt
end

p is immutable, so p is represented by a int without pointer equality. When you perform dynamic dispatch, immutable needs to be wrapped into a jl_value_t*, that’s why there is an allocation in the loop and you get 1000 allocations for 1000 iterations (I am surprised that this allocation is not moved out of the loop). In contrast, in hash1, ref is already an mutable, there is no need to wrap the value.

2 Likes

Thank you! Yep, that is exactly the explanation! I also asked on slack, and @Sukera explained the same thing. Thank you @anon56330260! :slight_smile: I learned something new today.

Excellent explanation!

1 Like

Here was the thread where we discussed it on slack.

The followup questions I asked there were:

  1. Is there any way to work around this / fix it? Can you do a dynamic dispatch that doesn’t need to box its arguments?
  2. How could the tools have helped me reach this conclusion without needing your expert knowledge?
  3. why doesn’t the optimizer move the Ptr allocation out of the loop in hash2? It’ll always be exactly the same object with the same value. Why is it allocated over and over in the loop?

We discussed answers to 1., and we decided that there is probably an opportunity to improve julia by delaying boxing immutables to only do it in the case where the callee isn’t specialized on the arguments, meaning it needs to get a jl_value_t*.

For 2., it would be nice if something like Cthulhu was able to tell us about the expert knowledge shared in this thread!

1 Like

The answers to your questions are simple. Don’t use dynamic dispatches in performance critical cases and you won’t be bothered by all of these strange corner cases.

No, unless you use mutables or you don’t use dynamic dispatches.

Unreliable and unpredicable therefore unacceptable. Loop optimizations, even simple control flow facts like a variable is constant in a loop are non-trivial. You need to inspect LLVM’s extremely complicated pipeline, to tell whether and why such optimzations are (not) performed.

There are no such tools (maybe never) because developing such tools is hard and performance of dynamic dispatches should be bad. If you insist doing so, then being a compiler expert is easier for you to reach this conclusion.