Avoiding memory allocation when repeatedly using small arrays

I often find myself needing fairly simple, small autodiff vector operations applied repeatedly. I don’t know how to do this without allocation each time. Is it possible?

Here is an example where I’m integrating an array p over a surface implicitly defined by f(x)=0.

using BenchmarkTools
using ForwardDiff
using StaticArrays

n = 2^7
p = rand(n,n)

f(x) where N = √sum(abs2,x .- 2^6) - 2^5
function n_direct(x)
    v = x .- 2^6
    return v/√sum(abs2,v)
end
n_auto(x) = ForwardDiff.gradient(f,x)

function oint(f,n,p::Array{T,N}) where {T,N}
    s = zeros(T,N)
    for I ∈ CartesianIndices(p)
        x = SA_F64[I.I...]
        d = f(x)::Float64
        abs(d) ≤ 1 && (s .+= n(x).*p[I]*(1+cos(π*d))/2)
    end
    return s
end

I typically want to use ForwardDiff.gradient to define the normal n, so the user doesn’t need to, but this causes an allocation.

julia> @benchmark oint(f,n_direct,p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  127.100 μs … 499.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     127.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.284 μs ±  13.055 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂ ▃                    ▂                                     ▁
  █████▇▇▄▄▄▃▇▇▇▆▅▄▃▃▃▄▂▅▄█▆▄▄▄▄▄▅▄▅▃▂▄▂▃▂▃▂▄▄▂▂▄▂▅▃▃▂▃▄▄▅▃▂▃▄▃ █
  127 μs        Histogram: log(frequency) by time        200 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

julia> @benchmark oint(f,n_auto,p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  142.400 μs …  4.375 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     143.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   150.388 μs ± 96.927 μs  ┊ GC (mean ± σ):  1.60% ± 2.69%

  █▂ ▁  ▂▂▂ ▁▁        ▁                                        ▁
  ████▆████████▇▆▆▄▅▃▅██▆▄▅▃▃▄▅▅▄▄▄▃▃▃▄▃▄▄▄▄▄▄▃▄▃▅▄▄▅▃▄▃▃▄▄▄▅▄ █
  142 μs        Histogram: log(frequency) by time       228 μs <

 Memory estimate: 38.72 KiB, allocs estimate: 825.

I really enjoyed the Workshop yesterday on tools to improve performance, but I’m not sure what to do to fix this…

julia> using JETTest
julia> @report_dispatch oint(f,n_auto,p)
═════ 2 possible errors found ═════
┌ @ C:\Users\admin\Documents\GitHub\WaterLily\temp.jl:22 n(x)
│┌ @ C:\Users\admin\Documents\GitHub\WaterLily\temp.jl:14 ForwardDiff.gradient(Main.f, x)
││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\gradient.jl:44 ForwardDiff.vector_mode_gradient(f, x)
│││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\gradient.jl:120 ForwardDiff.static_dual_eval(T, f, x)
││││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\apiutils.jl:32 f(%8)
│││││ runtime dispatch detected: f::typeof(f)(%8::SVector{2, ForwardDiff.Dual{ForwardDiff.Tag{typeof(f), Float64}, Float64, 2}})
││││└───────────────────────────────────────────────────────────────────────
│┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\apiutils.jl:32 %1(%9)
││ runtime dispatch detected: %1::typeof(f)(%9::SVector{2, ForwardDiff.Dual{ForwardDiff.Tag{typeof(f), Float64}, Float64, 2}})
│└───────────────────────────────────────────────────────────────────────
(Vector{Float64}, 2)

center and radius are global variables.

3 Likes

You’re right bad example. I’ll correct it.

For some reason the where N in f(x) is causing allocations:

julia> f(x) where N = √sum(abs2,x .- 2^6) - 2^5
f (generic function with 1 method)

julia> @btime f($x)
  22.725 ns (2 allocations: 48 bytes)
78.32565875726226

julia> f(x) = √sum(abs2,x .- 2^6) - 2^5
f (generic function with 1 method)

julia> @btime f($x)
  5.862 ns (0 allocations: 0 bytes)
78.32565875726226

I had to change that to get what you report:

julia> @benchmark oint($f,$n_auto,$p)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  114.826 μs … 284.301 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     115.015 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.534 μs ±   9.510 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁     ▁   ▁                                             ▁   ▁
  █▆█▆▇▇▆▆█▃▄▃█▄▁▄▁▇▆▇▇▇▇▅▅▆▅▇▅▅▆▅▇▆▃▅▅▅▇▆▆▆▅▅▆▆▅▆▅▅▅█▄▅▅▅▅▇█▇▆ █
  115 μs        Histogram: log(frequency) by time        160 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

Then, with these changes, you get 0 allocations:

julia> function oint(f,n,p::Array{T,N}) where {T,N}
           s = zeros(SVector{N,T}) # static vector here
           for I ∈ CartesianIndices(p)
               x = SVector{N,T}(ntuple(i->I[i],N)) # initialize svector with size defined at compile time
               d = f(x)::Float64
               abs(d) ≤ 1 && (s += n(x).*p[I]*(1+cos(π*d))/2) # remove a dot in s .+=
           end
           return s
       end
oint (generic function with 1 method)

julia> @benchmark oint($f,$n_auto,$p)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  105.190 μs … 221.240 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     105.372 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.170 μs ±   7.259 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █      ▃▂    ▁                                                ▁
  █▆▅▆▇████▆▆▄▅█▄▁▄▆▆▄▅▄▅▇▅▅▅▅▇▁▅▃▃▄▇▄▃▄▅▄▇▅▅▄▅▃▇▄▃▅▃▄▄▇▄▅▃▃▄▅█ █
  105 μs        Histogram: log(frequency) by time        145 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.


3 Likes

Thanks. That was a typo left in from previous attempts!

Yeah. I recomend always using Test.detect_unbound_args(YourPackage) or Aqua to check for this.
Of course, these only work if your code is organized into repositories.

The one annoying thing is that f(x::NTuple{N,T}) where {N,T} will result in T being an unbound arg, as it will be unbound if you call it with f(()) (i.e., with an empty Tuple{}).

5 Likes