Understanding source of allocations when profiling

I’m a little confused on how to attribute some allocations that I see, depending on how I profile some code. In particular, I’m curious because when I profile this code using julia --track-allocation=user (running testfunc() once, clearing with Profile.clear_malloc_data(), and running testfunc() again before exiting the REPL, as suggested here), I get the following:

        - using StructArrays
        -
        - struct S
        -     a::Float64
        -     b::Float64
        - end
        -
        - function testfunc()
        0     A = rand(1001,2,3);
        -
    48144     B = zeros(1001,2,3);
        -
       16     SA = StructArray{S}(A, dims=3);
        -
       16     SB = StructArray{S}(B, dims=3);
        -
      224     V = [(A,SA),(B,SB)];
        -
      640     circshift!(V[2][2].a, V[1][2].a, (1,0));
        - end

This confuses me both because I expect A = rand(1001,2,3); to allocate memory, and because I expect circshift!(V[2][2].a, V[1][2].a, (1,0)); to have no allocations.

In addition, I tried use @btime to make sense of this with a simplified example in the REPL as follows:

julia> struct S
           a::Float64
           b::Float64
       end

julia> A = rand(1001,2,3);

julia> B = zeros(1001,2,3);

julia> SA = StructArray{S}(A, dims=3);

julia> SB = StructArray{S}(B, dims=3);

julia> V = [(A,SA),(B,SB)];

julia> @btime circshift!($V[2][2].a, $V[1][2].a, (1,0));
  445.631 ns (0 allocations: 0 bytes)

julia> @btime circshift!($(V)[2][2].a, $(V)[1][2].a, (1,0));
  449.279 ns (0 allocations: 0 bytes)

julia> @btime circshift!(V[2][2].a, V[1][2].a, (1,0));
  1.501 μs (11 allocations: 992 bytes)

However, this doesn’t really clear it up for me. Does the memory I’m seeing when I don’t interpolate the variables relate to the same source of allocation as when I’m doing the runtime profiling method, or are these just totally unrelated?

The allocations in @btime stem from the fact that you’re accessing a global variable in your benchmark, not from the code itself.


What does @code_warntype say about your testfunc()? Could it be that it’s type unstable because of the dims=3 in the creation of structarrays, since it’s not known to the compiler that the accesses to V in the last line are in bounds?

@code_warntype gives the following output.

julia> @code_warntype testfunc()
Variables
  #self#::Core.Const(testfunc)
  V::Vector{_A} where _A
  SB::StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}
  SA::StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}
  B::Array{Float64, 3}
  A::Array{Float64, 3}

Body::AbstractArray
1 ─       (A = Main.rand(1001, 2, 3))
│         (B = Main.zeros(1001, 2, 3))
│   %3  = Core.apply_type(Main.StructArray, Main.S)::Core.Const(StructArray{S, N, C, I} where {N, C<:Union{Tuple, NamedTuple}, I})
│   %4  = (:dims,)::Core.Const((:dims,))
│   %5  = Core.apply_type(Core.NamedTuple, %4)::Core.Const(NamedTuple{(:dims,), T} where T<:Tuple)
│   %6  = Core.tuple(3)::Core.Const((3,))
│   %7  = (%5)(%6)::Core.Const((dims = 3,))
│   %8  = Core.kwfunc(%3)::Core.Const(Core.var"#Type##kw"())
│         (SA = (%8)(%7, %3, A))
│   %10 = Core.apply_type(Main.StructArray, Main.S)::Core.Const(StructArray{S, N, C, I} where {N, C<:Union{Tuple, NamedTuple}, I})
│   %11 = (:dims,)::Core.Const((:dims,))
│   %12 = Core.apply_type(Core.NamedTuple, %11)::Core.Const(NamedTuple{(:dims,), T} where T<:Tuple)
│   %13 = Core.tuple(3)::Core.Const((3,))
│   %14 = (%12)(%13)::Core.Const((dims = 3,))
│   %15 = Core.kwfunc(%10)::Core.Const(Core.var"#Type##kw"())
│         (SB = (%15)(%14, %10, B))
│   %17 = Core.tuple(A, SA)::Tuple{Array{Float64, 3}, StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}}
│   %18 = Core.tuple(B, SB)::Tuple{Array{Float64, 3}, StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}}
│         (V = Base.vect(%17, %18))
│   %20 = Base.getindex(V, 2)::Any
│   %21 = Base.getindex(%20, 2)::Any
│   %22 = Base.getproperty(%21, :a)::Any
│   %23 = Base.getindex(V, 1)::Any
│   %24 = Base.getindex(%23, 2)::Any
│   %25 = Base.getproperty(%24, :a)::Any
│   %26 = Core.tuple(1, 0)::Core.Const((1, 0))
│   %27 = Main.circshift!(%22, %25, %26)::AbstractArray
└──       return %27

I’m still too new to Julia to really understand this output. If I read this right, though, it seems that V is not recognized as a Vector-of-Tuples. Is this something I can annotate in the code, and would that help performance?

All those Any there mean that the compiler can’t figure out what type will come out of accessing V. The core problem lies in

%17 = Core.tuple(A, SA)::Tuple{Array{Float64, 3}, StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}}
%18 = Core.tuple(B, SB)::Tuple{Array{Float64, 3}, StructArray{S, _A, _B, _C} where {_A, _B<:Union{Tuple, NamedTuple}, _C}}

because it’s here that the compiler doesn’t know enough about what kind of StructArray you’re creating. I’m kind of confused why you’d create a StructArray{S} from an Array{Float64,3} in the first place, and it seems the compiler agrees. It tries to create a common type for

V = [(A,SA),(B,SB)]

and can only come up with Any as a common super type for these two, since they’re basically both some tuple of some array and some StructArray, which only share Any as their supertype.

One way around this problem would be to create a function barier by putting everything after the creation of your StructArrays in its own function (or just create the StructArrays outside of your testfunc and pass them in, which is a little more julian and flows nicely with the common pattern of preallocating your arrays).

2 Likes

The (A,SA) construct arose from the discussion in another thread where it’s useful for me to operate on the data as a matrix with a particular alignment, but also have the StructArray view for convenience. To avoid repeated allocations, my thought was just to keep both views of the data around as a tuple. The use of V in my actual application is then just to keep a front- and back-buffer for operations, again, to minimize allocations. Basically, I do one “operation” that takes V[1] as input and writes the output to V[2], then I just swap the entries in V so that the current state is in the front buffer at the end of the operation.

So, in that case, I’d generally expect to take V as my argument to a function. If I make an innertestfunc that just takes V and does the circshift on it, then I get the following profiling output:

        - using StructArrays
        -
        - struct S
        -     a::Float64
        -     b::Float64
        - end
        -
        - function testfunc()
        0     A = rand(1001,2,3);
        -
    48144     B = zeros(1001,2,3);
        -
       16     SA = StructArray{S}(A, dims=3);
        -
       16     SB = StructArray{S}(B, dims=3);
        -
      224     V = [(A,SA),(B,SB)];
        -
       64     innertestfunc(V)
        - end
        -
        - function innertestfunc(V)
        0     circshift!(V[2][2].a, V[1][2].a, (1,0));
        - end

Moreover, @code_warntype for the inner function gives

julia> A = rand(1001,2,3);

julia> B = zeros(1001,2,3);

julia> SA = StructArray{S}(A, dims=3);
^[[A^[[A^[[A
julia> SB = StructArray{S}(B, dims=3);

julia> V = [(A,SA),(B,SB)];

julia> @code_warntype innertestfunc(V)
Variables
  #self#::Core.Const(innertestfunc)
  V::Vector{Tuple{Array{Float64, 3}, StructArray{S, 2, NamedTuple{(:a, :b), Tuple{SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Int64}}}

Body::SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}
1 ─ %1 = Base.getindex(V, 2)::Tuple{Array{Float64, 3}, StructArray{S, 2, NamedTuple{(:a, :b), Tuple{SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Int64}}
│   %2 = Base.getindex(%1, 2)::StructArray{S, 2, NamedTuple{(:a, :b), Tuple{SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Int64}
│   %3 = Base.getproperty(%2, :a)::SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}
│   %4 = Base.getindex(V, 1)::Tuple{Array{Float64, 3}, StructArray{S, 2, NamedTuple{(:a, :b), Tuple{SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Int64}}
│   %5 = Base.getindex(%4, 2)::StructArray{S, 2, NamedTuple{(:a, :b), Tuple{SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Int64}
│   %6 = Base.getproperty(%5, :a)::SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}
│   %7 = Core.tuple(1, 0)::Core.Const((1, 0))
│   %8 = Main.circshift!(%3, %6, %7)::SubArray{Float64, 2, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}
└──      return %8

The good news is this seems to have got rid of the extra allocations. I guess this leaves me with two questions:

  1. I can see in the REPL how I’ve got a concrete V that can be used for type inference on innertestfunc. However, I don’t see how wrapping the inner step in a separate function change the interpretation of the code at compile-time when I call it from testfunc()? I would have thought (perhaps still thinking in C/C++ idiom) that this couldn’t improve type-inference at compile time, since anything it knows when it calls innertestfunc() could just as well be known if its manually inlined back into testfunc().

  2. I’m still unclear why A = rand(1001,2,3); appears to not be associated with any allocation.