Remove allocations from splatting?

mihalybaci · July 18, 2023, 7:51pm

I have a simple function and cannot figure out where allocations happening.

using BenchmarkTools

p2s(x, y) = (360.0 + x) % 360, (360.0 + y)  % 360.0   # 0 allocations

function p2s!(p::T, s::T) where T <: VecOrMat
    for (i, coord) in enumerate(eachcol(p))
#        @inbounds s[:, i] .= (0, 0)  # Line A - does not allocate
         @inbounds s[:, i] .= p2s(coord...)  # Line B - allocates
    end
    return s
end

ps = rand(2, 200)*365
ss = similar(ps)

@benchmark p2s!($ps, $ss)

The function p2s does not allocate any memory when run on its own, yet when it runs (Line B) in p2s! it does allocate. I know its that line because if I run Line A instead, and comment out Line B, the function does not allocate. Where are the allocations coming from?

Oscar_Smith · July 18, 2023, 8:03pm

This code as posted errors.

algunion · July 18, 2023, 8:44pm

@mihalybaci, Can you update your example with the exact version that caused the allocation issue? This one throws BoundsError.

mihalybaci · July 18, 2023, 9:16pm

Sorry about that, transcription error. I fixed the code.

danielwe · July 18, 2023, 10:19pm

It’s the splatting. The following does not allocate:

function p2s!(p::T, s::T) where T <: VecOrMat
    for (i, coord) in enumerate(eachcol(p))
#        @inbounds s[:, i] .= (0, 0)  # Line A - does not allocate
         @inbounds s[:, i] .= p2s(coord[1], coord[2])  # Line B - allocates no more
    end
    return s
end

I think this is because the type of coord is an array, which has arbitrary length, so the compiler can’t know how many arguments the splatting will produce.

If you want to keep the splatting but avoid allocation, you can use a vector of SVectors instead of a matrix. It would look something like this:

using BenchmarkTools
using StaticArrays

p2s(x, y) = (360.0 + x) % 360, (360.0 + y)  % 360.0   # 0 allocations

function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
    for i in eachindex(p, s)
        s[i] = p2s(p[i]...)
    end
    return s
end

ps = rand(SVector{2}, 200) * 365
ss = similar(ps)

@benchmark p2s!($ps, $ss)

You could even use splat to replace the loop with a broadcast:

function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
    s .= splat(p2s).(p)
    return s
end

lmiq · July 18, 2023, 10:40pm

One other alternative is to provide the length at compile time, with:

julia> function p2s!(p::T, s::T, ::Val{N}) where {N, T <: VecOrMat}
           for (i, coord) in enumerate(eachcol(p))
                @inbounds s[:, i] .= p2s(ntuple(j->coord[j], N)...)  # Line B - allocates
           end
           return s
       end
p2s! (generic function with 2 methods)

julia> @btime p2s!($ps, $ss, $(Val(2)))
  5.486 μs (0 allocations: 0 bytes)

And you can keep your clean call by adding a function barrier (if this is not called itself from within a loop):

julia> p2s!(p::T, s::T) where {T <: VecOrMat} = p2s!(p, s, Val(size(p,1)))
p2s! (generic function with 2 methods)

julia> @btime p2s!($ps, $ss)
  5.883 μs (0 allocations: 0 bytes)

(but if you are working with 2D coordinates of something, you are probably better of with static arrays)

Benny · July 18, 2023, 10:43pm

More specifically, a view/SubArray whose type parameters do not fix its axes’ lengths. And it fundamentally couldn’t because its parent array is a Matrix type that does not fix its dimensions’ lengths. Even if a particular instance can’t change its size, instances of different sizes belong to the same type.

julia> typeof(( first(enumerate(eachcol(ps))) )[2])
SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}

Using StaticArrays.MMatrix introduces this information. That fixes all axes’ lengths though, I have not heard of an array type with only some axes’ lengths fixed, I’ve only seen Arrays containing StaticArrays.

DNF · July 19, 2023, 11:01am

I would suggest writing this as

p2s(x, y) = (360 + x) % 360, (360 + y) % 360

It will work equally well for Float64 input, but much better for Int or Float32 etc. If you give single precision inputs they will be promoted to double, for example, which is probably undesirable. For integers there is also a large performance loss if you don’t do this.

In general, it’s better to use integer literals everywhere, unless there’s some specific reason not to.

mihalybaci · July 19, 2023, 1:56pm

Thanks for all the replies!

aahhhhh, that makes sense.

Yeah, these are coordinate pairs, so ps and ss will always be 2xN. But it turns out that StaticArrays doesn’t buy any extra speed when I tried. In general there will be anywhere from 100 to 10000 pairs, and there is no speed-up for me at lower values, and @danielwe 's version with Svector is slower at 10000 pairs, which seems to be expected for something of that size.

The p2s function is really just MWE, in the real code a bunch of other math happens to x and y. So even if the inputs to p2s are Int, they are almost guaranteed to be Float by the time they hit the % function. I figured then it would avoid type conversions to specify 360.0, but maybe that’s not the case?

DNF · July 19, 2023, 2:18pm

I see, but this is a general principle. You should get used to using integers in situations like this, because a) it tends to preserve types better, avoiding unnecessary promotions, and b) it will be either the same speed or much faster.

For example, if you use a float other than Float64, this will force it to become a Float64, which is probably undesirable.

Unless you have reason to fear integer overflow, use integer literals.

mihalybaci · July 19, 2023, 2:45pm

Yeah, I checked and there is no difference between 360 and 360.0, so I switched back to the integer literals. I also decided to go with @danielwe’s first non-SVector solution because it’s the simplest and because coord can be guaranteed to have 2-elements (and if it doesn’t, then it should error anyway).

The new version of the full code now runs in 11 microseconds, compared to 200 for the original. So, that’s a great result since p2s! will get called hundreds of thousands (maybe millions) of times.

mihalybaci · July 19, 2023, 2:47pm

FYI, I changed the name of the thread to hopefully make it more distinct from other allocation-related posts.

lmiq · July 19, 2023, 2:50pm

StaticArrays should not be slower, it is basically the same thing, internally. The advantage is that it makes your functions more convenient to write. You could use, for instance:

using BenchmarkTools

using StaticArrays
struct Point2D{T} <: FieldVector{2,T}
    x::T
    y::T
end

p2s(x) = @. (360 + x) % 360

function p2s!(p, s) 
    for i in eachindex(p,s)
          @inbounds s[i] = p2s(p[i])
    end
    return s
end

ps = rand(Point2D{Float64}, 200) .* 366
ss = similar(ps)

@benchmark p2s!($ps, $ss)

gives:

julia> @btime p2s!($ps, $ss)
  5.386 μs (0 allocations: 0 bytes)

The computations are the same, but you are operating over points all the time, and the codes is cleaner then.

mihalybaci · July 19, 2023, 2:55pm

hmmm… When I tested for 2x200 and 2x1000, matrices there was essentially no difference. But for 2x10000 standard Array takes 284 us, while Svector takes 320 us. I don’t anticipate runs where N>10000, but it may be possible.

lmiq · July 19, 2023, 2:58pm

That may be only a benchmarking fluctuation. It shouldn’t be slower. Matrices can be advantageous if you can express critical computations with the dimensions shifted (i. e., using Nx2 matrices instead of 2xN), but usually that is not the case for coordinates. In that cases it is possible that linear algebra computations can be further accelerated by BLAS calls or the large arrays.

with 10^6 points, I get, here:

with matrices:

julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 166 samples with 1 evaluation.
 Range (min … max):  28.010 ms … 42.877 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.948 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.212 ms ±  3.025 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▃ ▁▁   ▁▅▄                                                 
  ██████▅▆▁███▆▇▅▆▆▆▁▅▅▁▁▁▁▅▅▁▅▁▁▅▅▁▁▁▁▅▁▁▁▅▆▆▆▅▆▅▅▇▁▆▁▅▁▁▁▁▅ ▅
  28 ms        Histogram: log(frequency) by time      38.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

with static arrays:


julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 179 samples with 1 evaluation.
 Range (min … max):  27.439 ms …  37.717 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.647 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.958 ms ± 975.401 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▄▅▃▄ ▁     ▁                                                
  █████████▇▇▇▆█▇▆▁▆▆▅▁▆▅▆▁▁▅▁▅▁▅▆▆▅▅▅▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
  27.4 ms       Histogram: log(frequency) by time      31.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

It is a matter of taste.

mihalybaci · July 19, 2023, 4:46pm

I get 27 ms for Array and 30 ms SVector. So not a big difference but still the same percentage slower as with 10^4 samples. Maybe I missed something in my StaticArrays version…

lmiq · July 19, 2023, 4:57pm

@inbounds in the inner loop, maybe? danielwe version didn´t have @inbounds there, while in the matrix version there is.

(is it the same code I posted above?)

mihalybaci · July 19, 2023, 5:14pm

I was using one of the other codes, but I ran your code with the same result, roughly 30.8 ms for a million points, just slighly, bur predicably, slower. @inbounds doesn’t seem to have a noticeable effect.

lmiq · July 19, 2023, 5:36pm

Uhm, I don´t see any difference here. For the records, I’m using Julia 1.9.2, but I wouldn’t bet this changes in earlier versions. Maybe something very platform specific is going on.