I have a simple function and cannot figure out where allocations happening.

using BenchmarkTools
p2s(x, y) = (360.0 + x) % 360, (360.0 + y) % 360.0 # 0 allocations
function p2s!(p::T, s::T) where T <: VecOrMat
for (i, coord) in enumerate(eachcol(p))
# @inbounds s[:, i] .= (0, 0) # Line A - does not allocate
@inbounds s[:, i] .= p2s(coord...) # Line B - allocates
end
return s
end
ps = rand(2, 200)*365
ss = similar(ps)
@benchmark p2s!($ps, $ss)

The function p2s does not allocate any memory when run on its own, yet when it runs (Line B) in p2s! it does allocate. I know its that line because if I run Line A instead, and comment out Line B, the function does not allocate. Where are the allocations coming from?

It’s the splatting. The following does not allocate:

function p2s!(p::T, s::T) where T <: VecOrMat
for (i, coord) in enumerate(eachcol(p))
# @inbounds s[:, i] .= (0, 0) # Line A - does not allocate
@inbounds s[:, i] .= p2s(coord[1], coord[2]) # Line B - allocates no more
end
return s
end

I think this is because the type of coord is an array, which has arbitrary length, so the compiler can’t know how many arguments the splatting will produce.

If you want to keep the splatting but avoid allocation, you can use a vector of SVectors instead of a matrix. It would look something like this:

using BenchmarkTools
using StaticArrays
p2s(x, y) = (360.0 + x) % 360, (360.0 + y) % 360.0 # 0 allocations
function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
for i in eachindex(p, s)
s[i] = p2s(p[i]...)
end
return s
end
ps = rand(SVector{2}, 200) * 365
ss = similar(ps)
@benchmark p2s!($ps, $ss)

You could even use splat to replace the loop with a broadcast:

function p2s!(p::T, s::T) where {T <: AbstractVector{<:SVector}}
s .= splat(p2s).(p)
return s
end

One other alternative is to provide the length at compile time, with:

julia> function p2s!(p::T, s::T, ::Val{N}) where {N, T <: VecOrMat}
for (i, coord) in enumerate(eachcol(p))
@inbounds s[:, i] .= p2s(ntuple(j->coord[j], N)...) # Line B - allocates
end
return s
end
p2s! (generic function with 2 methods)
julia> @btime p2s!($ps, $ss, $(Val(2)))
5.486 μs (0 allocations: 0 bytes)

And you can keep your clean call by adding a function barrier (if this is not called itself from within a loop):

julia> p2s!(p::T, s::T) where {T <: VecOrMat} = p2s!(p, s, Val(size(p,1)))
p2s! (generic function with 2 methods)
julia> @btime p2s!($ps, $ss)
5.883 μs (0 allocations: 0 bytes)

(but if you are working with 2D coordinates of something, you are probably better of with static arrays)

More specifically, a view/SubArray whose type parameters do not fix its axes’ lengths. And it fundamentally couldn’t because its parent array is a Matrix type that does not fix its dimensions’ lengths. Even if a particular instance can’t change its size, instances of different sizes belong to the same type.

Using StaticArrays.MMatrix introduces this information. That fixes all axes’ lengths though, I have not heard of an array type with only some axes’ lengths fixed, I’ve only seen Arrays containing StaticArrays.

It will work equally well for Float64 input, but much better for Int or Float32 etc. If you give single precision inputs they will be promoted to double, for example, which is probably undesirable. For integers there is also a large performance loss if you don’t do this.

In general, it’s better to use integer literals everywhere, unless there’s some specific reason not to.

Yeah, these are coordinate pairs, so ps and ss will always be 2xN. But it turns out that StaticArrays doesn’t buy any extra speed when I tried. In general there will be anywhere from 100 to 10000 pairs, and there is no speed-up for me at lower values, and @danielwe 's version with Svector is slower at 10000 pairs, which seems to be expected for something of that size.

The p2s function is really just MWE, in the real code a bunch of other math happens to x and y. So even if the inputs to p2s are Int, they are almost guaranteed to be Float by the time they hit the % function. I figured then it would avoid type conversions to specify 360.0, but maybe that’s not the case?

I see, but this is a general principle. You should get used to using integers in situations like this, because a) it tends to preserve types better, avoiding unnecessary promotions, and b) it will be either the same speed or much faster.

For example, if you use a float other than Float64, this will force it to become a Float64, which is probably undesirable.

Unless you have reason to fear integer overflow, use integer literals.

Yeah, I checked and there is no difference between 360 and 360.0, so I switched back to the integer literals. I also decided to go with @danielwe’s first non-SVector solution because it’s the simplest and because coord can be guaranteed to have 2-elements (and if it doesn’t, then it should error anyway).

The new version of the full code now runs in 11 microseconds, compared to 200 for the original. So, that’s a great result since p2s! will get called hundreds of thousands (maybe millions) of times.

StaticArrays should not be slower, it is basically the same thing, internally. The advantage is that it makes your functions more convenient to write. You could use, for instance:

using BenchmarkTools
using StaticArrays
struct Point2D{T} <: FieldVector{2,T}
x::T
y::T
end
p2s(x) = @. (360 + x) % 360
function p2s!(p, s)
for i in eachindex(p,s)
@inbounds s[i] = p2s(p[i])
end
return s
end
ps = rand(Point2D{Float64}, 200) .* 366
ss = similar(ps)
@benchmark p2s!($ps, $ss)

hmmm… When I tested for 2x200 and 2x1000, matrices there was essentially no difference. But for 2x10000 standard Array takes 284 us, while Svector takes 320 us. I don’t anticipate runs where N>10000, but it may be possible.

That may be only a benchmarking fluctuation. It shouldn’t be slower. Matrices can be advantageous if you can express critical computations with the dimensions shifted (i. e., using Nx2 matrices instead of 2xN), but usually that is not the case for coordinates. In that cases it is possible that linear algebra computations can be further accelerated by BLAS calls or the large arrays.

with 10^6 points, I get, here:

with matrices:

julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 166 samples with 1 evaluation.
Range (min … max): 28.010 ms … 42.877 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 28.948 ms ┊ GC (median): 0.00%
Time (mean ± σ): 30.212 ms ± 3.025 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▃ ▁▁ ▁▅▄
██████▅▆▁███▆▇▅▆▆▆▁▅▅▁▁▁▁▅▅▁▅▁▁▅▅▁▁▁▁▅▁▁▁▅▆▆▆▅▆▅▅▇▁▆▁▅▁▁▁▁▅ ▅
28 ms Histogram: log(frequency) by time 38.8 ms <
Memory estimate: 0 bytes, allocs estimate: 0.

with static arrays:

julia> @benchmark p2s!($ps, $ss)
BenchmarkTools.Trial: 179 samples with 1 evaluation.
Range (min … max): 27.439 ms … 37.717 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.647 ms ┊ GC (median): 0.00%
Time (mean ± σ): 27.958 ms ± 975.401 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▄▅▃▄ ▁ ▁
█████████▇▇▇▆█▇▆▁▆▆▅▁▆▅▆▁▁▅▁▅▁▅▆▆▅▅▅▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
27.4 ms Histogram: log(frequency) by time 31.2 ms <
Memory estimate: 0 bytes, allocs estimate: 0.

I get 27 ms for Array and 30 ms SVector. So not a big difference but still the same percentage slower as with 10^4 samples. Maybe I missed something in my StaticArrays version…

I was using one of the other codes, but I ran your code with the same result, roughly 30.8 ms for a million points, just slighly, bur predicably, slower. @inbounds doesn’t seem to have a noticeable effect.

Uhm, I don´t see any difference here. For the records, I’m using Julia 1.9.2, but I wouldn’t bet this changes in earlier versions. Maybe something very platform specific is going on.