Remove allocations from splatting?

I have a simple function and cannot figure out where allocations happening.

using BenchmarkTools

p2s(x, y) = (360.0 + x) % 360, (360.0 + y)  % 360.0   # 0 allocations

function p2s!(p::T, s::T) where T <: VecOrMat
    for (i, coord) in enumerate(eachcol(p))
#        @inbounds s[:, i] .= (0, 0)  # Line A - does not allocate
         @inbounds s[:, i] .= p2s(coord...)  # Line B - allocates
    end
    return s
end

ps = rand(2, 200)*365
ss = similar(ps)

@benchmark p2s!($ps, $ss)

The function p2s does not allocate any memory when run on its own, yet when it runs (Line B) in p2s! it does allocate. I know its that line because if I run Line A instead, and comment out Line B, the function does not allocate. Where are the allocations coming from?

1 Like

This code as posted errors.

3 Likes

@mihalybaci, Can you update your example with the exact version that caused the allocation issue? This one throws BoundsError.

Sorry about that, transcription error. I fixed the code.

It’s the splatting. The following does not allocate:

function p2s!(p::T, s::T) where T <: VecOrMat
    for (i, coord) in enumerate(eachcol(p))
#        @inbounds s[:, i] .= (0, 0)  # Line A - does not allocate
         @inbounds s[:, i] .= p2s(coord[1], coord[2])  # Line B - allocates no more
    end
    return s
end

I think this is because the type of coord is an array, which has arbitrary length, so the compiler can’t know how many arguments the splatting will produce.


If you want to keep the splatting but avoid allocation, you can use a vector of SVectors instead of a matrix. It would look something like this:

using BenchmarkTools
using StaticArrays

p2s(x, y) = (360.0 + x) % 360, (360.0 + y)  % 360.0   # 0 allocations

function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
    for i in eachindex(p, s)
        s[i] = p2s(p[i]...)
    end
    return s
end

ps = rand(SVector{2}, 200) * 365
ss = similar(ps)

@benchmark p2s!($ps, $ss)

You could even use splat to replace the loop with a broadcast:

function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
    s .= splat(p2s).(p)
    return s
end
3 Likes

One other alternative is to provide the length at compile time, with:

julia> function p2s!(p::T, s::T, ::Val{N}) where {N, T <: VecOrMat}
           for (i, coord) in enumerate(eachcol(p))
                @inbounds s[:, i] .= p2s(ntuple(j->coord[j], N)...)  # Line B - allocates
           end
           return s
       end
p2s! (generic function with 2 methods)

julia> @btime p2s!($ps, $ss, $(Val(2)))
  5.486 ΞΌs (0 allocations: 0 bytes)

And you can keep your clean call by adding a function barrier (if this is not called itself from within a loop):

julia> p2s!(p::T, s::T) where {T <: VecOrMat} = p2s!(p, s, Val(size(p,1)))
p2s! (generic function with 2 methods)

julia> @btime p2s!($ps, $ss)
  5.883 ΞΌs (0 allocations: 0 bytes)

(but if you are working with 2D coordinates of something, you are probably better of with static arrays)

2 Likes

More specifically, a view/SubArray whose type parameters do not fix its axes’ lengths. And it fundamentally couldn’t because its parent array is a Matrix type that does not fix its dimensions’ lengths. Even if a particular instance can’t change its size, instances of different sizes belong to the same type.

julia> typeof(( first(enumerate(eachcol(ps))) )[2])
SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}

Using StaticArrays.MMatrix introduces this information. That fixes all axes’ lengths though, I have not heard of an array type with only some axes’ lengths fixed, I’ve only seen Arrays containing StaticArrays.

1 Like

I would suggest writing this as

p2s(x, y) = (360 + x) % 360, (360 + y) % 360

It will work equally well for Float64 input, but much better for Int or Float32 etc. If you give single precision inputs they will be promoted to double, for example, which is probably undesirable. For integers there is also a large performance loss if you don’t do this.

In general, it’s better to use integer literals everywhere, unless there’s some specific reason not to.

Thanks for all the replies!

aahhhhh, that makes sense.

Yeah, these are coordinate pairs, so ps and ss will always be 2xN. But it turns out that StaticArrays doesn’t buy any extra speed when I tried. In general there will be anywhere from 100 to 10000 pairs, and there is no speed-up for me at lower values, and @danielwe 's version with Svector is slower at 10000 pairs, which seems to be expected for something of that size.

The p2s function is really just MWE, in the real code a bunch of other math happens to x and y. So even if the inputs to p2s are Int, they are almost guaranteed to be Float by the time they hit the % function. I figured then it would avoid type conversions to specify 360.0, but maybe that’s not the case?

I see, but this is a general principle. You should get used to using integers in situations like this, because a) it tends to preserve types better, avoiding unnecessary promotions, and b) it will be either the same speed or much faster.

For example, if you use a float other than Float64, this will force it to become a Float64, which is probably undesirable.

Unless you have reason to fear integer overflow, use integer literals.

1 Like

Yeah, I checked and there is no difference between 360 and 360.0, so I switched back to the integer literals. I also decided to go with @danielwe’s first non-SVector solution because it’s the simplest and because coord can be guaranteed to have 2-elements (and if it doesn’t, then it should error anyway).

The new version of the full code now runs in 11 microseconds, compared to 200 for the original. So, that’s a great result since p2s! will get called hundreds of thousands (maybe millions) of times.

FYI, I changed the name of the thread to hopefully make it more distinct from other allocation-related posts.

StaticArrays should not be slower, it is basically the same thing, internally. The advantage is that it makes your functions more convenient to write. You could use, for instance:

using BenchmarkTools

using StaticArrays
struct Point2D{T} <: FieldVector{2,T}
    x::T
    y::T
end

p2s(x) = @. (360 + x) % 360

function p2s!(p, s) 
    for i in eachindex(p,s)
          @inbounds s[i] = p2s(p[i])
    end
    return s
end

ps = rand(Point2D{Float64}, 200) .* 366
ss = similar(ps)

@benchmark p2s!($ps, $ss)

gives:

julia> @btime p2s!($ps, $ss)
  5.386 ΞΌs (0 allocations: 0 bytes)

The computations are the same, but you are operating over points all the time, and the codes is cleaner then.

hmmm… When I tested for 2x200 and 2x1000, matrices there was essentially no difference. But for 2x10000 standard Array takes 284 us, while Svector takes 320 us. I don’t anticipate runs where N>10000, but it may be possible.

That may be only a benchmarking fluctuation. It shouldn’t be slower. Matrices can be advantageous if you can express critical computations with the dimensions shifted (i. e., using Nx2 matrices instead of 2xN), but usually that is not the case for coordinates. In that cases it is possible that linear algebra computations can be further accelerated by BLAS calls or the large arrays.

with 10^6 points, I get, here:

with matrices:

julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 166 samples with 1 evaluation.
 Range (min … max):  28.010 ms … 42.877 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     28.948 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   30.212 ms Β±  3.025 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–„β–ƒ ▁▁   ▁▅▄                                                 
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–†β–β–ˆβ–ˆβ–ˆβ–†β–‡β–…β–†β–†β–†β–β–…β–…β–β–β–β–β–…β–…β–β–…β–β–β–…β–…β–β–β–β–β–…β–β–β–β–…β–†β–†β–†β–…β–†β–…β–…β–‡β–β–†β–β–…β–β–β–β–β–… β–…
  28 ms        Histogram: log(frequency) by time      38.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

with static arrays:


julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 179 samples with 1 evaluation.
 Range (min … max):  27.439 ms …  37.717 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     27.647 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.958 ms Β± 975.401 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–„β–…β–ƒβ–„ ▁     ▁                                                
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–ˆβ–‡β–†β–β–†β–†β–…β–β–†β–…β–†β–β–β–…β–β–…β–β–…β–†β–†β–…β–…β–…β–β–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–… β–…
  27.4 ms       Histogram: log(frequency) by time      31.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

It is a matter of taste.

I get 27 ms for Array and 30 ms SVector. So not a big difference but still the same percentage slower as with 10^4 samples. Maybe I missed something in my StaticArrays version…

@inbounds in the inner loop, maybe? danielwe version didnΒ΄t have @inbounds there, while in the matrix version there is.

(is it the same code I posted above?)

I was using one of the other codes, but I ran your code with the same result, roughly 30.8 ms for a million points, just slighly, bur predicably, slower. @inbounds doesn’t seem to have a noticeable effect.

Uhm, I donΒ΄t see any difference here. For the records, I’m using Julia 1.9.2, but I wouldn’t bet this changes in earlier versions. Maybe something very platform specific is going on.