Memory allocation in broadcast assignment

Hi! I am struggling to keep a certain level of flexibility and abstraction while optimizing for performance. For example, in the following function, I am doing some processing and modify an array arr in-place.

arr = zeros(2, 4)
p = [1.3, 6.8]

function f1!(arr, p)
    fp1, fp2 = floor.((p[1], p[2]))
    cp1, cp2 = ceil.((p[1], p[2]))
    arr .= [fp1 fp2 fp1 fp2 ; cp1 cp1 cp2 cp2]
end

However, the intermediate array created just before broadcasting allocates memory:

julia> @benchmark f1!($a, $p)
BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  9
  --------------
  minimum time:     72.401 ns (0.00% GC)
  median time:      146.003 ns (0.00% GC)
  mean time:        152.632 ns (16.16% GC)
  maximum time:     15.043 μs (99.33% GC)
  --------------
  samples:          10000
  evals/sample:     976

and I can avoid this by simply assigning by hand each value in arr, as done in f2!:

function f2!(arr, p)
    fp1, fp2 = floor.((p[1], p[2]))
    cp1, cp2 = ceil.((p[1], p[2]))
    arr[1, 1] = fp1
    arr[1, 2] = cp1
    arr[1, 3] = fp1
    arr[1, 4] = cp1
    arr[2, 1] = fp2
    arr[2, 2] = fp2
    arr[2, 3] = cp2
    arr[2, 4] = cp2
end
julia> @benchmark f2!($a, $p)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.305 ns (0.00% GC)
  median time:      4.412 ns (0.00% GC)
  mean time:        4.574 ns (0.00% GC)
  maximum time:     29.816 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Is there a possibility to avoid memory allocation while keeping the clarity of f1!?

You could use Array types where the size is known at compile time.

For example https://github.com/JuliaArrays/StaticArrays.jl
or (if only a few dimensions are static)
https://github.com/mateuszbaran/HybridArrays.jl

It could be that this array type is not flexible enough for your situation.
Maybe there is also an easier way to speed this up.

Notice that the proposed functions are not really in-place.

using StaticArrays, BenchmarkTools

function f1_s!(arr, p)
    fp1, fp2 = floor.((p[1], p[2]))
    cp1, cp2 = ceil.((p[1], p[2]))
    arr = @SArray [fp1 fp2 fp1 fp2 ; cp1 cp1 cp2 cp2]
end

arr = zeros(2, 4)
arr_s = @SArray zeros(2, 4)
p = [1.3, 6.8]
p_s = @SArray [1.3, 6.8]

@benchmark f1!($arr, $p)
BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  9
  --------------
  minimum time:     94.080 ns (0.00% GC)
  median time:      99.577 ns (0.00% GC)
  mean time:        119.891 ns (6.83% GC)
  maximum time:     2.611 μs (91.40% GC)
  --------------
  samples:          10000
  evals/sample:     946

@benchmark f1_s!($arr_s, $p)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.099 ns (0.00% GC)
  median time:      4.701 ns (0.00% GC)
  mean time:        4.883 ns (0.00% GC)
  maximum time:     405.500 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

### EDIT: see the answer form Skoffer, the following benchmark is not realistic. (But I leave it here for reference).
@benchmark f1_s!($arr_s, $p_s)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.399 ns (0.00% GC)
  median time:      0.500 ns (0.00% GC)
  mean time:        0.512 ns (0.00% GC)
  maximum time:     15.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
4 Likes

To enhance this answer.

  1. sub nanoseconds timings is an indication that compiler elides calculations, so to benchmark properly, one should use Ref trick
@benchmark f1_s!($arr, Ref($p_s)[])

julia> @benchmark f1_s!($arr, Ref($p_s)[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.386 ns (0.00% GC)
  median time:      2.397 ns (0.00% GC)
  mean time:        2.477 ns (0.00% GC)
  maximum time:     32.285 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

which is more realistic.

  1. If it is absolutely necessary, it is possible to keep original arrays, just add broadcasting in the last line
function f2_s!(arr, p)
    fp1, fp2 = floor.((p[1], p[2]))
    cp1, cp2 = ceil.((p[1], p[2]))
    arr .= @SArray [fp1 fp2 fp1 fp2 ; cp1 cp1 cp2 cp2]
end

@benchmark f2_s!($arr, Ref($p_s)[])

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.763 ns (0.00% GC)
  median time:      4.776 ns (0.00% GC)
  mean time:        5.057 ns (0.00% GC)
  maximum time:     35.714 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  1. But at the same time, one should be careful with this kind of microbenchmarks, since they omit the creation of p_s object. Of course, it doesn’t allocate in the heap, but it still allocates and requires some computational resources. So, if it is created very rarely, this approach is fine, but if p vector is generated in a loop, then the corresponding benchmark would look like
function f3_s!(arr, p)
    p_s = SVector{2}(p)
    f2_s!(arr, p_s)
end
@benchmark f3_s!($arr, $p)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.698 ns (0.00% GC)
  median time:      5.758 ns (0.00% GC)
  mean time:        6.197 ns (0.00% GC)
  maximum time:     36.615 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

In the end it is not as fast as handwritten assignments, which is reasonable, since StaticArrays add small overhead. But compactness/speed ratio is in favor of StaticArrays approach I think.

5 Likes

Thank you @SteffenPL and @Skoffer for your explanations! It works great. I would like to mark Steffen’s answer as a solution, but I am not finding how… :sweat_smile: