Help Requested: Understanding cause of allocations for dynamic dispatch returning `nothing`, which vary based on argument type

NHDaly · May 28, 2023, 3:26pm

Can someone help me understand why passing a Ptr{UInt} to this dynamic dispatch leads to allocations, whereas passing a Ref{UInt} does not?

I have tried the Allocation Profiler, Cthulhu, and @code_*, and I can’t figure out where the allocation is coming from. (Which separately speaks to how hard it can still be to do this kind of perf investigation.)

Here are two implementations of the same idea, hash1 and hash2. The idea is to reduce allocations from hashing a collection of abstract type, by eliminating the allocations from the dynamic dispatch.

This is related to the discussion here: `hash(::Vector{<abstract type>}, ::UInt)` allocates quite a bit due to type instability: why can't hash(::Any) infer `UInt` return type? · Issue #44244 · JuliaLang/julia · GitHub.

My understanding is: dynamic dispatch will (almost) always heap allocate its return value, since it doesn’t know in advance what type the return value will be, since it doesn’t know which method will be selected (that’s what makes it a dynamic dispatch). So the idea here was to use an output parameter instead of returning a value, specifically to avoid allocating the result.

hash1 uses a Ref{UInt} as the output parameter, and indeed it doesn’t allocate on each iteration. But it does need allocate the Ref, since it passes it through to a dynamically chosen function, which let that Ref escape.

The idea in hash2 was to pass a Ptr{UInt} instead, to allow escape-analysis to elide the allocation of the Ref! That should get us to 0 allocations instead of 1. However, somehow now every iteration of the for-loop allocates! And I don’t know why. (But indeed the alloc from the Ref is gone.)

Any help would be much appreciated!

function hash1(a::AbstractArray, h::UInt = UInt(0))
    ref = Ref{UInt}(h)
    for x in a
        hash_into!(ref, x)::Nothing
    end
    return ref[]::UInt
end
function hash_into!(ref::Ref{UInt}, v::T) where T
    ref[] = hash(v, ref[])
    return nothing
end

function hash2(a::AbstractArray, h::UInt = UInt(0))
    r = Ref(h)
    p = Base.unsafe_convert(Ptr{UInt}, r)
    GC.@preserve r begin
        for x in a
            hash_into!(p, x)::Nothing
        end
    end
    return r[]::UInt
end
function hash_into!(p::Ptr{UInt}, x::T) where T
    h = hash(x, unsafe_load(p))
    unsafe_store!(p, h)
    return nothing
end

You can see that for Arrays of fixed type, which don’t have a type instability and thus don’t have a dynamic dispatch, neither one allocates, as expected:

julia> a = collect(1:Int(1e3));

julia> @btime hash1($(a))
  1.079 μs (0 allocations: 0 bytes)
0x246980f127272a2c

julia> @btime hash2($(a))
  1.079 μs (0 allocations: 0 bytes)
0x246980f127272a2c

But for arrays of abstract type, where we need a dynamic dispatch, hash1 allocates 1 alloc for the Ref, but doesn’t allocate on each iteration, while hash2 can stack allocate the Ref, but each iteration allocates for some mystery reason:

julia> a2 = Any[i for i in 1:Int(1e3)];

julia> @btime hash1($(a2))
  7.594 μs (1 allocation: 16 bytes)
0x246980f127272a2c

julia> @btime hash2($(a2))
  11.000 μs (1000 allocations: 15.62 KiB)
0x246980f127272a2c

Please help!

NHDaly · May 28, 2023, 3:37pm

Here are pprof screenshots from an allocation profile and a CPU profile. (Discourse won’t let me upload the profiles)

Allocation:

CPU:

And to me, their Cthulhu warntypes look identical:

jling · May 28, 2023, 4:27pm

maybe try @noinline see if that helps track down the actual line which allocates?

anon56330260 · May 28, 2023, 4:36pm

Because to perform dynamic dispatch, we need to pack each argument into a jl_value_t*.

function hash2(a::AbstractArray, h::UInt = UInt(0))
    r = Ref(h)
    p = Base.unsafe_convert(Ptr{UInt}, r)
    GC.@preserve r begin
        for x in a
            hash_into!(p, x)::Nothing
        end
    end
    return r[]::UInt
end

p is immutable, so p is represented by a int without pointer equality. When you perform dynamic dispatch, immutable needs to be wrapped into a jl_value_t*, that’s why there is an allocation in the loop and you get 1000 allocations for 1000 iterations (I am surprised that this allocation is not moved out of the loop). In contrast, in hash1, ref is already an mutable, there is no need to wrap the value.

NHDaly · May 29, 2023, 2:36am

Thank you! Yep, that is exactly the explanation! I also asked on slack, and @Sukera explained the same thing. Thank you @anon56330260! I learned something new today.

Excellent explanation!

NHDaly · May 29, 2023, 2:40am

Here was the thread where we discussed it on slack.

The followup questions I asked there were:

Is there any way to work around this / fix it? Can you do a dynamic dispatch that doesn’t need to box its arguments?
How could the tools have helped me reach this conclusion without needing your expert knowledge?
why doesn’t the optimizer move the Ptr allocation out of the loop in hash2? It’ll always be exactly the same object with the same value. Why is it allocated over and over in the loop?

We discussed answers to 1., and we decided that there is probably an opportunity to improve julia by delaying boxing immutables to only do it in the case where the callee isn’t specialized on the arguments, meaning it needs to get a jl_value_t*.

For 2., it would be nice if something like Cthulhu was able to tell us about the expert knowledge shared in this thread!

anon56330260 · May 29, 2023, 3:23pm

The answers to your questions are simple. Don’t use dynamic dispatches in performance critical cases and you won’t be bothered by all of these strange corner cases.

No, unless you use mutables or you don’t use dynamic dispatches.

Unreliable and unpredicable therefore unacceptable. Loop optimizations, even simple control flow facts like a variable is constant in a loop are non-trivial. You need to inspect LLVM’s extremely complicated pipeline, to tell whether and why such optimzations are (not) performed.

There are no such tools (maybe never) because developing such tools is hard and performance of dynamic dispatches should be bad. If you insist doing so, then being a compiler expert is easier for you to reach this conclusion.

Topic		Replies	Views
Type-instability, runtime dispatch, and heap allocations Performance question , type-stability , runtime-dispatch	3	169	April 12, 2025
Why does this type instability lead to allocations? Performance	3	543	January 13, 2023
Memory allocations with dynamic dispatch may vary with return value General Usage question	2	259	February 27, 2024
Elimination of (unnecessary) runtime dispatch allocations Internals & Design box , allocations , runtime-dispatch	13	1121	May 5, 2024
Dispatch on Value allocating Performance question	9	1714	July 14, 2019

Help Requested: Understanding cause of allocations for dynamic dispatch returning `nothing`, which vary based on argument type

Related topics