Spooky allocations - linked to Varargs?

Hi folks,
today I found a strange behavior that eludes my understanding. Would someone please tell me where the allocations are coming from in this example? I am probably overlooking something but I have been staring at this snippet for far longer now than I’d like to admit and cannot figure it out.

My goal is to compute multinomial coefficients. Consider these functions:

# computes a binomial coefficient as Float64
function mybinomial(N::T, k::T) where T
	(k > N || k < 0) && return 0.0
	2k > N && (k = N-k)
	(isinteger(N) && isinteger(k)) || error("N and k need to be integers! Got: ($N, $k)")
	ret = 1.0
	for i in 1:Int(k)
		ret *= (N+1-i)/i
	end
	return ret
end

# compute the multinomial coefficient iteratively
multinomial() = 1.0
function multinomial(a, vals...)
	s = a
	ret = 1.0
	for v in vals
		s += v
		ret *= mybinomial(s,v)
	end
	return ret
end

Strangely, I see an allocation:

julia> using BenchmarkTools

julia> @btime multinomial($1,$2,$3,$4,$3)
  26.745 ns (1 allocation: 16 bytes)
3.6036e6

I cannot understand where this comes from:

  1. I checked @code_warntype multinomial(1,2,3,4,3) and everything is inferred perfectly
  2. I checked @code_llvm multinomial(1,2,3,4,3) and @code_native multinomial(1,2,3,4,3) which shows that the loop has been unrolled and I cannot find anything that looks an allocation to me
  3. Interestingly, the allocation seems to be very sensitive to seemingly random things… E.g. replacing mybinomial with binomial from Base removes the allocation (but is overall slower and will overflow in my application). I also had a slightly different version of my function that has 2 allocations that vanish when changing to mybinomial to binomial.
  4. I somewhat suspect that these allocations might have to do with iteration over the varargs. Maybe mybinomial has some different compiler effects than binomial which causes the compiler to not ellide some allocation?
Variant with 2 allocations
julia> function multinomial3(a, vals...)
    s = sum(vals)
    ret = binomial(s+a,a)
    for v in vals
            ret *= mybinomial(s,v) # 0 allocs with binomial
            s -=v
    end
    return ret
end

julia> @btime multinomial3($1,$2,$3,$4,$3)
  121.435 ns (2 allocations: 64 bytes)

For reference, I now use a generated function to generate functions like this:

julia> function multinomial_gen(a,b,c,d,e)
    ret = mybinomial(a+b,b)
    ret *= mybinomial(a+b+c,c)
    ret *= mybinomial(a+b+c+d,d)
    ret *= mybinomial(a+b+c+d+e,e)
    return ret
end

julia> @btime multinomial_gen($1,$2,$3,$4,$3)
  14.487 ns (0 allocations: 0 bytes)
3.6036e6

which is quite a bit faster and has never given me any allocations (I actually need to compute A LOT of these so that performance increase actually was very noticeable :sweat_smile: )

All the results shown above where on Julia 1.11.1 but on 1.10.5 I also see the same behavior. If someone can elucidate this for me I’d be very grateful :slight_smile:

Interesting.

If I do a

f(x...) = multinomial(x...)
@code_native f(1,2,3,4,5)

There’s a sequence:

        movabs  rbx, offset ijl_box_int64
        lea     rdx, [rbp - 112]
        mov     r12, rsi
        mov     rcx, qword ptr [rax - 8]
        mov     qword ptr [rbp - 112], 20
        mov     rax, qword ptr [rcx]
        mov     qword ptr [rbp - 56], rcx       # 8-byte Spill
        mov     qword ptr [rbp - 104], rax
        mov     qword ptr [rcx], rdx
        vzeroupper
        call    rbx

So it seems something is being boxed. It’s likely something with the Vararg.

Incidentally, I see the same boxing with:

@noinline f(x...) = (s = 0; for i in x; s += i; end; s)
g(x...) = f(x...)
@code_native g(1,2,3)

though I do not get any allocations with @btime g(1,2,3)

I get this odd thing:

julia> @code_typed g(1,2,3)
CodeInfo(
1 ─ %1 = Core.getfield(x, 1)::Int64
β”‚   %2 = Core.getfield(x, 2)::Int64
β”‚   %3 = Core.getfield(x, 3)::Int64
β”‚   %4 = invoke Main.f(%1::Int64, %2::Vararg{Int64}, %3)::Int64
└──      return %4
) => Int64

(v1.11.1)

I entered an issue at github: Weird type inference with `Vararg` Β· Issue #56485 Β· JuliaLang/julia Β· GitHub

1 Like

Could it just be a missing type annotation on Vararg?

EDIT: yes, that’s it

function multinomial_fast(a, vals::Vararg{Int,N}) where N
	s = a
	ret = 1.0
	for v in vals
		s += v
		ret *= mybinomial(s,v)
	end
	return ret
end
julia> @btime multinomial($1,$2,$3,$4,$3)
  32.190 ns (1 allocation: 16 bytes)
3.6036e6

julia> @btime multinomial_fast($1,$2,$3,$4,$3)
  17.924 ns (0 allocations: 0 bytes)
3.6036e6
4 Likes

Seconding the nonspecialization heuristic, and it’s worth mentioning that the @code_ reflection algorithms assume specialization, even incorrectly, and repeats inference and compilation.
Can we support a way to make @code_typed stop lying to us all? :slight_smile: Β· Issue #32834 Β· JuliaLang/julia

1 Like

facepalm
Thanks a lot! I knew that Julia avoids specialization on Types and Functions but didn’t recall that this applies to Varargs as well.
And I also knew that in these cases @code_warntype and the like are not helpful (what @Benny mentions). We really need to improve diagnosis of this scenario…

While that fixes the performance and alleviates me having to resort to a @generated function, the allocations caused by this are still quite mysterious. What is the difference between mybinomial and Base.binomial that causes allocations with the former but not the latter?

Hard to tell because it’s not clear what that 16-byte allocation is for, and none of the @code_ reflection methods show the code with it (this extends to JET and Cthulhu as well). This is what β€œnot specializing on Vararg” results in:

julia> multinomial(1,2,3,4,5)
3.783779999999999e7

julia> Base.specializations(@which(multinomial(1,2,3,4,5)))
Base.MethodSpecializations(MethodInstance for multinomial(::Int64, ::Int64, ::Vararg{Int64}))

So it actually does specialize on the 1st element’s type and the matching trailing elements’ type, just not on the arity. Not sure if the compiler pulls it off, but it does have the necessary information to infer v::Int64 as if val::AbstractVector{Int64}.

Second, the most immediate difference in Base.binomial is:

Base.@assume_effects :terminates_locally function binomial(n::T, k::T) where T<:Integer

but I don’t know what that does except hint at compile-time binomial calls, but that’s not possible here without the runtime values.

1 Like

:invoke with Vararg (static dispatch) is actually implemented as dynamic dispatch on the codegen level, which causes allocations to occur. You can verify this using AllocCheck.jl.

julia> # computes a binomial coefficient as Float64
       function mybinomial(N::T, k::T) where T
           (k > N || k < 0) && return 0.0
           2k > N && (k = N-k)
           (isinteger(N) && isinteger(k)) || error("N and k need to be integers! Got: ($N, $k)")
           ret = 1.0
           for i in 1:Int(k)
               ret *= (N+1-i)/i
           end
           return ret
       end;
       
       # compute the multinomial coefficient iteratively

julia> multinomial() = 1.0;

julia> function multinomial(a, vals...)
           s = a
           ret = 1.0
           for v in vals
               s += v
               ret *= mybinomial(s,v)
           end
           return ret
       end;

julia> topfunc(a, b, c, d, e) = multinomial(a,b,c,d,e);

julia> using AllocCheck

julia> check_allocs(topfunc, (Int,Int,Int,Int,Int))
2-element Vector{Any}:
 Allocation of Int64 in ./REPL[12]:1
  | (source not available)
Stacktrace:
 [1] topfunc(a::Int64, b::Int64, c::Int64, d::Int64, e::Int64)
   @ Main ./REPL[12]:1

 Dynamic dispatch to function multinomial in ./REPL[12]:1
  | (source not available)
Stacktrace:
 [1] topfunc(a::Int64, b::Int64, c::Int64, d::Int64, e::Int64)
   @ Main ./REPL[12]:1

And one another fix for this is to add @inline for the method, which allows the compiler to eliminate the dispatch:

julia> topfunc(a, b, c, d, e) = @inline multinomial(a,b,c,d,e);

julia> check_allocs(topfunc, (Int,Int,Int,Int,Int))
Any[]

julia> @benchmark topfunc(1,2,3,4,3)
[ Info: Loading BenchmarkTools...
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  8.216 ns … 418.418 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.634 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.486 ns Β±   6.440 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–…β–‚β–‚β–β–β–‚β–ƒβ–‚β–‚                                                 ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–‡β–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–…β–†β–…β–†β–…β–†β–†β–…β–†β–…β–…β–…β–…β–…β–…β–…β–„β–…β–†β–…β–…β–β–ƒβ–„β–…β–„β–„β–„β–„β–„β–β–ƒβ–ƒβ–β–…β–„β–„ β–ˆ
  8.22 ns      Histogram: log(frequency) by time      24.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes

Hobby horse plug: Be less aware of when Julia avoids specializing? and #51423.

1 Like