Allocations due to Boolean keyword arguments - how to avoid them?

Krastanov · September 22, 2022, 2:43pm

Deep in one of my projects I have a function that slightly changes behavior depending on a boolean keyword. It seems if the keyword is Val{Bool} there are no allocations but if it is Bool, then it allocates. Here is a contrived minimal example:

using BenchmarkTools

""" Some work object - it seems the example actually needs something as complicated """
struct Object{VecType<:AbstractVector{<:Unsigned}, MatType<:AbstractMatrix{<:Unsigned}}
    v::VecType
    m::MatType
end

""" Work function that happens to slightly change behavior depending on Val keyword """
function work_val(obj::Object; extrawork::Val{B}=Val(true)) where B
    r,c = size(obj.m)
    @inbounds for i in 1:r
        @inbounds for j in i+1:r
            for k in 1:c obj.m[i,k] ⊻= obj.m[j,k] end
            if B
                obj.v[i] ⊻= obj.m[i,j]
            end
        end
    end
end

"""A "public interface" function that uses a Bool instead of Val{Bool} """
function work_bool_to_val(obj::Object; extrawork::Bool=true)
    work_val(obj; extrawork=Val(extrawork))
end

n = 10
obj = Object(rand(UInt,n),rand(UInt,n,n))

The one that uses Val does not allocate.

@benchmark work_val($obj)
BenchmarkTools.Trial: 10000 samples with 244 evaluations.
 Range (min … max):  310.254 ns … 753.303 ns  ┊ GC (min … max): 0.00% … 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

But the one that uses a Bool and then puts it inside of a Val does allocate even though it simply calls a non-allocating inner function.

@benchmark work_bool_to_val($obj)
BenchmarkTools.Trial: 10000 samples with 160 evaluations.
 Range (min … max):  668.881 ns …  15.615 μs  ┊ GC (min … max): 0.00% … 93.26%
 Memory estimate: 32 bytes, allocs estimate: 1.

If I simply rewrite work_val to directly use Bool then things work fine. However, in the real case where I see this problem, such a rewrite is not possible. Thus my question is If I can not modify work_val, what can I do in order to make work_bool_to_val not allocate?.

This is not a question about refactoring and rethinking the structure of a code base, rather a very targeted question about why a Bool keyword argument causes an allocation when the rest of the body of the function is not allocating.

By the way, a minor simplification of the example makes everything non-allocating. This is deeply confusing to me. If `work_val` never allocates, why do changes to it matter to whether `work_bool_to_val` allocates? Click here to see this example.

using BenchmarkTools

"""Just some contrived work function that happens to slightly change behavior depending on Val keyword"""
function simple_work_val(obj::Vector; extrawork::Val{B}=Val(true)) where B
    l = size(obj,1)
    @inbounds for i in 1:l
        @inbounds for j in i+1:l
            obj[i] ⊻= obj[j]
            if B
                obj[i] += obj[i]
            end
        end
    end
end

"""A "public interface" function that uses a Bool keyword instead of a Val keyword"""
function simple_work_bool_to_val(obj::Vector; extrawork::Bool=true)
    simple_work_val(obj; extrawork=Val(extrawork))
end

n = 10
obj = rand(UInt,n)
@benchmark simple_work_val($obj)
@benchmark simple_work_bool_to_val($obj)

So, maybe it has something to do with the complicated Object type? Insight on this would be greatly appreciated.

Version Info: 1.8.0 (click to expand)

Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver1)
  Threads: 1 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

lmiq · September 22, 2022, 2:54pm

The allocation is probably occuring because by using the Val trick we are causing dynamic dispatch on the call to to simple_work_val (not inside it). If you instead just use a boolean flag, the allocation disappears:

julia> function work_val(obj::Object; extrawork::Bool=true) 
           r,c = size(obj.m)
           @inbounds for i in 1:r
               @inbounds for j in i+1:r
                   for k in 1:c obj.m[i,k] ⊻= obj.m[j,k] end
                   if extrawork
                       obj.v[i] ⊻= obj.m[i,j]
                   end
               end
           end
       end
work_val (generic function with 1 method)

julia> @btime work_bool_to_val($obj)
  218.308 ns (0 allocations: 0 bytes)

stevengj · September 22, 2022, 2:58pm

The problem is that this requires a runtime dynamic dispatch, since the compiler doesn’t know the type (it doesn’t know whether it is Val{true} or Val{false}) until runtime (unless you get luck with constant propagation, which I’m guessing doesn’t happen here).

I don’t see the point of Val in the code you quoted — you might as well use a Bool everywhere, since the if B is not in your innermost loop so it should have a negligible runtime cost.

Krastanov · September 22, 2022, 3:37pm

But then why is the second example working fine without allocations? Why is there no dynamic dispatch, even though it is the exact same flow of code?

I copy it here (it was in the <details> tag of the original post):

"""Just some contrived work function that happens to slightly change behavior depending on Val keyword"""
function simple_work_val(obj::Vector; extrawork::Val{B}=Val(true)) where B
    l = size(obj,1)
    @inbounds for i in 1:l
        @inbounds for j in i+1:l
            obj[i] ⊻= obj[j]
            if B
                obj[i] += obj[i]
            end
        end
    end
end

"""A "public interface" function that uses a Bool keyword instead of a Val keyword"""
function simple_work_bool_to_val(obj::Vector; extrawork::Bool=true)
    simple_work_val(obj; extrawork=Val(extrawork))
end

The reason for the Val{true} is that in the real non-minimal case it has noticeably different performance, presumably because it forces specialization. It is something I originally saw in this thread with examples of its use in SciML and Polyester. And here is a real world example where simply switching from Bool to Val{Bool} eliminated allocations, because of this specialization issue - the issue is that this example is rather long so I could not use it as a MWE.

Krastanov · September 22, 2022, 3:47pm

Oh, maybe I can answer my own question. Using Base.@constprop :aggressive makes the allocations go away in both cases.

Is there a standard way to say “I want aggressive constant propagation on this particular keyword argument”? Then presumably I would not need the keyword::Val{Bool}=Val(true) and could just use some @constprop keyword::Bool=true?

lmiq · September 22, 2022, 4:32pm

Does that really does any difference for performance in this case?

What could make sense is to specialize both the outer and the inner calls with Val, to completely eliminate a branch from the code. In this case, it would be:

julia> function work_bool_to_val(obj::Object; extrawork::Val{B}=Val(true)) where {B}
           work_val(obj; extrawork)
       end
work_bool_to_val (generic function with 1 method)

julia> function work_val(obj::Object; extrawork::Val{B}=Val(true)) where {B} 
           r,c = size(obj.m)
           @inbounds for i in 1:r
               @inbounds for j in i+1:r
                   for k in 1:c obj.m[i,k] ⊻= obj.m[j,k] end
                   if B
                       obj.v[i] ⊻= obj.m[i,j]
                   end
               end
           end
       end
work_val (generic function with 1 method)

julia> @btime work_bool_to_val($obj)
  258.361 ns (0 allocations: 0 bytes)

julia> @btime work_bool_to_val($obj; extrawork=Val(false))
  214.609 ns (0 allocations: 0 bytes)

I didn´t check lowered code, but I would expect that the branch is eliminated in this code when set to false. You could also define two independent versions, with and without the branch, and dispatch on them, to be sure that the branch would be eliminated in the non-extrawork case. The advantage of this approach would be that you could, for example, try to use LoopVectorization in some sense in the non-branched version and make it really much faster.

At the same time, compared to the function that simply has that branch, as they are, there is no perceivable performance difference, with the current codes.

Elrod · September 22, 2022, 4:48pm

function work_bool_to_val(obj::Object; extrawork::Bool=true)
    if extrawork
        work_val(obj; extrawork=Val(true))
    else
        work_val(obj; extrawork=Val(false))
    end
end

Krastanov · September 22, 2022, 5:11pm

No, it does not make a difference in the toy example, as I warned in the first post. It makes a pretty substantial difference in the real-world case that I have linked to, but that one is much too big to share here.

Krastanov · September 22, 2022, 5:13pm

@Elrod, are you saying that the if-else branch with literal Val(true) and Val(false) basically lets the compiler to do the constant propagation? Is this just a fluke of which style lets constprop work with the current compiler, or is there something more fundamental here?

lmiq · September 22, 2022, 6:00pm

I think it is less fundamental There you are just avoiding the dynamic dispatch that was causing the allocation.

Topic		Replies	Views
Keyword arguments is causing allocations General Usage	4	407	June 21, 2020
Dispatch on Value allocating Performance question	9	1715	July 14, 2019
Weird memory allocation when passing functions as keyword arguments Performance memory-allocation , keyword-arguments	6	447	April 3, 2023
Branch in dispatch on values General Usage dispatch	2	444	April 12, 2020
Using a keyword argument leads to enormous allocations Performance kwargs	8	891	May 9, 2022

Allocations due to Boolean keyword arguments - how to avoid them?

Related topics