Avoiding memory allocation when repeatedly using small arrays

weymouth · July 25, 2021, 11:37am

I often find myself needing fairly simple, small autodiff vector operations applied repeatedly. I don’t know how to do this without allocation each time. Is it possible?

Here is an example where I’m integrating an array p over a surface implicitly defined by f(x)=0.

using BenchmarkTools
using ForwardDiff
using StaticArrays

n = 2^7
p = rand(n,n)

f(x) where N = √sum(abs2,x .- 2^6) - 2^5
function n_direct(x)
    v = x .- 2^6
    return v/√sum(abs2,v)
end
n_auto(x) = ForwardDiff.gradient(f,x)

function oint(f,n,p::Array{T,N}) where {T,N}
    s = zeros(T,N)
    for I ∈ CartesianIndices(p)
        x = SA_F64[I.I...]
        d = f(x)::Float64
        abs(d) ≤ 1 && (s .+= n(x).*p[I]*(1+cos(π*d))/2)
    end
    return s
end

I typically want to use ForwardDiff.gradient to define the normal n, so the user doesn’t need to, but this causes an allocation.

julia> @benchmark oint(f,n_direct,p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  127.100 μs … 499.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     127.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.284 μs ±  13.055 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂ ▃                    ▂                                     ▁
  █████▇▇▄▄▄▃▇▇▇▆▅▄▃▃▃▄▂▅▄█▆▄▄▄▄▄▅▄▅▃▂▄▂▃▂▃▂▄▄▂▂▄▂▅▃▃▂▃▄▄▅▃▂▃▄▃ █
  127 μs        Histogram: log(frequency) by time        200 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

julia> @benchmark oint(f,n_auto,p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  142.400 μs …  4.375 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     143.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   150.388 μs ± 96.927 μs  ┊ GC (mean ± σ):  1.60% ± 2.69%

  █▂ ▁  ▂▂▂ ▁▁        ▁                                        ▁
  ████▆████████▇▆▆▄▅▃▅██▆▄▅▃▃▄▅▅▄▄▄▃▃▃▄▃▄▄▄▄▄▄▃▄▃▅▄▄▅▃▄▃▃▄▄▄▅▄ █
  142 μs        Histogram: log(frequency) by time       228 μs <

 Memory estimate: 38.72 KiB, allocs estimate: 825.

I really enjoyed the Workshop yesterday on tools to improve performance, but I’m not sure what to do to fix this…

julia> using JETTest
julia> @report_dispatch oint(f,n_auto,p)
═════ 2 possible errors found ═════
┌ @ C:\Users\admin\Documents\GitHub\WaterLily\temp.jl:22 n(x)
│┌ @ C:\Users\admin\Documents\GitHub\WaterLily\temp.jl:14 ForwardDiff.gradient(Main.f, x)
││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\gradient.jl:44 ForwardDiff.vector_mode_gradient(f, x)
│││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\gradient.jl:120 ForwardDiff.static_dual_eval(T, f, x)
││││┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\apiutils.jl:32 f(%8)
│││││ runtime dispatch detected: f::typeof(f)(%8::SVector{2, ForwardDiff.Dual{ForwardDiff.Tag{typeof(f), Float64}, Float64, 2}})
││││└───────────────────────────────────────────────────────────────────────
│┌ @ C:\Users\admin\.julia\packages\ForwardDiff\QOqCN\src\apiutils.jl:32 %1(%9)
││ runtime dispatch detected: %1::typeof(f)(%9::SVector{2, ForwardDiff.Dual{ForwardDiff.Tag{typeof(f), Float64}, Float64, 2}})
│└───────────────────────────────────────────────────────────────────────
(Vector{Float64}, 2)

Skoffer · July 25, 2021, 11:51am

center and radius are global variables.

weymouth · July 25, 2021, 11:56am

You’re right bad example. I’ll correct it.

lmiq · July 25, 2021, 1:29pm

For some reason the where N in f(x) is causing allocations:

julia> f(x) where N = √sum(abs2,x .- 2^6) - 2^5
f (generic function with 1 method)

julia> @btime f($x)
  22.725 ns (2 allocations: 48 bytes)
78.32565875726226

julia> f(x) = √sum(abs2,x .- 2^6) - 2^5
f (generic function with 1 method)

julia> @btime f($x)
  5.862 ns (0 allocations: 0 bytes)
78.32565875726226

I had to change that to get what you report:

julia> @benchmark oint($f,$n_auto,$p)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  114.826 μs … 284.301 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     115.015 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.534 μs ±   9.510 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁     ▁   ▁                                             ▁   ▁
  █▆█▆▇▇▆▆█▃▄▃█▄▁▄▁▇▆▇▇▇▇▅▅▆▅▇▅▅▆▅▇▆▃▅▅▅▇▆▆▆▅▅▆▆▅▆▅▅▅█▄▅▅▅▅▇█▇▆ █
  115 μs        Histogram: log(frequency) by time        160 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

Then, with these changes, you get 0 allocations:

julia> function oint(f,n,p::Array{T,N}) where {T,N}
           s = zeros(SVector{N,T}) # static vector here
           for I ∈ CartesianIndices(p)
               x = SVector{N,T}(ntuple(i->I[i],N)) # initialize svector with size defined at compile time
               d = f(x)::Float64
               abs(d) ≤ 1 && (s += n(x).*p[I]*(1+cos(π*d))/2) # remove a dot in s .+=
           end
           return s
       end
oint (generic function with 1 method)

julia> @benchmark oint($f,$n_auto,$p)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  105.190 μs … 221.240 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     105.372 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.170 μs ±   7.259 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █      ▃▂    ▁                                                ▁
  █▆▅▆▇████▆▆▄▅█▄▁▄▆▆▄▅▄▅▇▅▅▅▅▇▁▅▃▃▄▇▄▃▄▅▄▇▅▅▄▅▃▇▄▃▅▃▄▄▇▄▅▃▃▄▅█ █
  105 μs        Histogram: log(frequency) by time        145 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

weymouth · July 25, 2021, 1:35pm

Thanks. That was a typo left in from previous attempts!

Elrod · July 25, 2021, 5:13pm

Yeah. I recomend always using Test.detect_unbound_args(YourPackage) or Aqua to check for this.
Of course, these only work if your code is organized into repositories.

The one annoying thing is that f(x::NTuple{N,T}) where {N,T} will result in T being an unbound arg, as it will be unbound if you call it with f(()) (i.e., with an empty Tuple{}).

Topic		Replies	Views
ForwardDiff and GradientConfig memory usage Performance	10	1107	April 5, 2018
Avoiding memory allocation for vector operations Performance	11	3235	July 25, 2018
Avoid allocation for local array General Usage	6	95	January 9, 2025
Large amount of memory allocation when using autodiff and IPNewton Performance	1	119	January 30, 2024
Reducing allocations when subtracting vectors and multiplying by scalars Performance memory-allocation , optimization	9	443	November 9, 2021

Avoiding memory allocation when repeatedly using small arrays

Related topics