Unable to eliminate an unwanted allocation

I have spend several hours to eliminate a very elusive allocation that happens in a package that I intend to run on a GPU, which any allocation is fatal. I have tried all the standard methods to eliminate it (@code_warntype,--track-allocation=user) unsuccessfully. I have managed to produce a minimal reproducer that perhaps somebody could have a look for an idea. The central object is MyState, which is a mutable struct that behaves as a stack. Here is the code:

Reproducer code
using StaticArrays
using Rotations
using BenchmarkTools

const Point3 = SVector{3}

mutable struct MyState{T<:AbstractFloat}
    currentDepth::Int64
    volstack::SVector{32,UInt32}
    function MyState{T}() where T<:AbstractFloat
        state = new{T}()
        reset!(state)
        return state
    end
end

function reset!(state::MyState{T}) where T<:AbstractFloat
    state.currentDepth = 0
    state.volstack = zero(SVector{32,UInt32})
    nothing
end

function pushIn!(state::MyState{T}, vol::Int64) where T<:AbstractFloat
    state.currentDepth += 1
    state.volstack = setindex(state.volstack, vol, state.currentDepth)
    nothing
end

function popOut!(state::MyState{T}) where T<:AbstractFloat
    if state.currentDepth > 0
        state.currentDepth -= 1
    end
    nothing
end

function MylocateGlobalPoint!(state::MyState{T}, point::Point3{T}) where T<:AbstractFloat
    reset!(state)
    if point[1] > 0.
        #return 2
        pushIn!(state,999)
    end
    return state.currentDepth
end


function test(point::Point3{T}) where T<:AbstractFloat
    state = MyState{T}()
    MylocateGlobalPoint!(state, point)
end

point = Point3{Float64}(1,1,1)
@btime test(point)

Running it I obtain 1 allocation of 144 bytes (probably the size of MyState). The strange thing is that tracing the allocation tells me that it happens in the object constructor, in the line state = MyState{T}(). However if I comment the line pushIn!(state,999) and uncomment return 2 in the function MylocateGlobalPoint!(...) there is no allocation.

I could perhaps understand that the type MyState is badly implemented using a static vector to emulate a stack but what I do not understand is that the behaviour changes changing the function that is called after the instantiation.

Any hint will be most welcome.

Shouldn’t you be using SetField.jl or an MArray for this to work as expected?

yeah this line looks like it’s allocating at first glance. also what is setindex? I only know setindex!() from base

The setindex function works OK on immutable types like SVector, returning a modified variable rather than mutating the original.

Isn’t the allocation due to the fact that mutable structs are allocated on the heap?

isn’t that the definition of “allocation”… I mean allocation on stack is still allocation and once the immutable object doesn’t fit on stack, it will be allocated on the heap, it’s not a guarantee Julia makes about immutable objects.

1 Like

As I understand it, stack “allocations” are not tracked by @btime and friends. Here is a demonstration:

julia> using BenchmarkTools, StaticArrays

julia> function testalloc(x)
         return setindex(x, 2, 3)
       end
testalloc (generic function with 1 method)

julia> x = @SVector [1,1,1]
3-element SVector{3, Int64} with indices SOneTo(3):
 1
 1
 1

julia> testalloc(x)
3-element SVector{3, Int64} with indices SOneTo(3):
 1
 1
 2

julia> @btime testalloc($x);
  1.100 ns (0 allocations: 0 bytes)

Even though setindex creates a new SVector, it is not an “allocation” according to @btime. But if we use instead a mutable object…

julia> y = @MArray [1,1,1]
3-element MVector{3, Int64} with indices SOneTo(3):
 1
 1
 1

julia> @btime testalloc($y);
  6.400 ns (1 allocation: 32 bytes)

Because an MVector is allocated on the heap, its creation via a call to setindex is counted as an allocation by @btime.

3 Likes

This shows that the allocation occurs when instantiating the mutable struct:

julia> @btime MyState{Float64}();
  9.309 ns (1 allocation: 144 bytes)
1 Like

Oh yes, that is indeed expected. It is similar to an array in that sense. You either have to make them immutable or preallocate them.

1 Like

I would guess there is a difference between creating the object in the global space (interpreter) or in a function.

julia> function f()
         state = MyState{Float64}()
         state.currentDepth
       end
f (generic function with 1 method)

julia> @btime f()
  1.559 ns (0 allocations: 0 bytes)
0

What happens if you return state instead of state.currentDepth?

I noticed that using MVector instead of SVector the number of allocations is much higher.

It allocates because it results an object to the global space.

I don’t think that’s anything to do with it. In that case your code that returns the Int should also allocate.

I think state is allocated, on the heap, but when you just pick out and return the one bitstype field, without using anything else, the rest of the object is optimized away, because it’s not needed.

This can happen with MArrays too, if you convert them into an SArray, it can also be ‘optimized away’.

1 Like

I do no think it can optimise all away. Have a look at

julia> function f()
         state = MyState{Float64}()
         pushIn!(state,9)
         pushIn!(state,99)
         return sum(state.volstack)
         end
f (generic function with 1 method)

julia> @btime f()
  2.624 ns (0 allocations: 0 bytes)
0x0000006c

I simplified your code a bit (taking away the unneded type parameter):

using StaticArrays

mutable struct MyStateMut
    currentDepth::Int64
    volstack::SVector{32,UInt32}
    MyStateMut() = new(0, zero(SVector{32, UInt32}))
end

struct MyStateImm
    currentDepth::Int64
    volstack::SVector{32,UInt32}
    MyStateImm() = new(0, zero(SVector{32, UInt32}))
end

function f()
    state = MyStateMut()
    state.currentDepth
end

g() = MyStateMut()
h() = MyStateImm()

Only difference is mutable vs immutable struct. Then

julia> @btime g();  # allocates
  12.826 ns (1 allocation: 144 bytes)

julia> @btime h();  # no allocation
  0.001 ns (0 allocations: 0 bytes)

But if I just take out one part of the mutable struct:

julia> @btime f();
  1.500 ns (0 allocations: 0 bytes)

The allocation is optimized away.

It is indeed optimized away:

julia> function f()
         state = MyState{Float64}()
         pushIn!(state,9)
         pushIn!(state,99)
         return sum(state.volstack)
         end
f (generic function with 1 method)

julia> @code_llvm f()
;  @ REPL[4]:1 within `f`
; Function Attrs: uwtable
define i32 @julia_f_573() #0 {
L188:
;  @ REPL[4]:5 within `f`
  ret i32 108
}
3 Likes

This is what happens, afaiu, when one uses MArrays to avoid allocations, by converting them to SArrays on the return of functions. In a minimal case:

ulia> struct A
           x::Int
       end

julia> mutable struct B
           x::Int
       end

julia> B() = B(0);

julia> A() = A(0);

julia> @btime B()
  5.170 ns (1 allocation: 16 bytes)
B(0)

julia> @btime A()
  0.022 ns (0 allocations: 0 bytes)
A(0)

julia> f() = A(B().x)
f (generic function with 1 method)

julia> @btime f()
  1.692 ns (0 allocations: 0 bytes)
A(0)

Thus, in this case:

julia> using StaticArrays

julia> mutable struct MyState
           c::Int
           v::SVector{32,Int}
       end

julia> struct MyStateImm
           c::Int
           v::SVector{32,Int}
       end

julia> function f()
           x = MyState(0, zero(SVector{32,Int})) # mutable
           x.c += 1
           x.v = setindex(x.v, rand(1:10), x.c)
           MyStateImm(x.c, x.v) # return immutable
       end
f (generic function with 1 method)

julia> @btime f()
  20.798 ns (0 allocations: 0 bytes)
MyStateImm(1, [3, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Thus you can use the convenience of the syntax of a mutable struct and still allow the compiler to optimize away the allocations.

2 Likes

Thank-you very much, but I do not need to return the State mutable or immutable. The State is created in a function and used in subsequent functions and loops and can be destroyed when returning. It is really a temporary and fixed size, and this is exactly why the allocation in the stack is ideal for this. I really do not understand why Julia does not do it. It is ridiculous that in oder to avoid allocations I need to pass a State instance (i.e. a temporary working object) from the function caller.

If State is mutable and it is passed to other functions (inside the scope it was created, I am not talking about returning), then Julia probably loses sight of it and it is not able to rule out that, for example, some of these functions has saved it in a global variable. To save something mutable in the stack Julia needs to assure itself that the reference cannot escape by any means possible. Which includes both global variables and it becoming the value of a field in another structure that is returned.

Thanks very much. I understand that Julia needs to assure any leaking, in the original reproducer I do not see what makes Julia to think that a reference is escaping and therefore needs to allocate the object State. Probably what happens then is that to be absolutely sure it allocates in the heap always. Is this correct? So, there is no way to have a mutable object in the stack? Is there a pragma or similar to tell the compiler how to allocate the object?