Faster zeros with calloc

More often than not, memory is the bottleneck.
Why pass over memory twice – first to zero initialize, then later to interact with it (i.e., zeros) – if you could just once – zeroing as you interact with it (calloc)?
Fusion is great. Theoretically, calloc seems like it has nothing but upsides.

On Linux, comparing undef initializing with zeros_via_calloc:

julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   22.352 μs …  1.114 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     335.558 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   337.057 μs ± 12.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                            █
  ▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂█▅ ▂
  22.4 μs         Histogram: frequency by time          345 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark Array{Float64}(undef, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  20.666 μs … 48.211 μs  ┊ GC (min … max): 95.95% … 95.08%
 Time  (median):     41.557 μs              ┊ GC (median):    96.05%
 Time  (mean ± σ):   34.945 μs ±  9.908 μs  ┊ GC (mean ± σ):  95.93% ±  0.19%

  █                                                      ▅▁▅
  █▆▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▆ ▃
  20.7 μs         Histogram: frequency by time        42.8 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Similar minimum time, very different median and mean.

For me, fill!(zeros_via_calloc(...), ...) is about 2x faster than using zeros on Linux for 1024x1024:

julia> @benchmark fill!(zeros_via_calloc(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 6266 samples with 1 evaluation.
 Range (min … max):  600.150 μs … 936.917 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     679.553 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   674.231 μs ±  43.092 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▄█         ▁                  ▁▂▁      ▁
  ███▅▃▁▂▁▂▂▄▆█▆▄▂▂▄▄▄▃▂▃▅▇█▆▅▅▆█████▇▆▆▇████▆▅▆▇▇▆▆▅▄▃▃▃▄▆▆▇▆▅ ▄
  600 μs           Histogram: frequency by time          746 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(Array{Float64}(undef, 1024, 1024), 5.0)
BenchmarkTools.Trial: 7129 samples with 1 evaluation.
 Range (min … max):  392.945 μs …   2.997 ms  ┊ GC (min … max):  0.00% … 75.62%
 Time  (median):     585.374 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   697.015 μs ± 265.796 μs  ┊ GC (mean ± σ):  11.60% ± 16.13%

         ▁█▇▃▂
  ▂▂▃▃▃▄▅█████▇▅▃▂▄█▃▂▃▄▃▂▂▂▁▁▁▂▁▂▄▄▂▂▂▂▁▂▂▃▄▃▂▂▂▂▁▂▂▂▂▂▃▄▃▂▃▃▃ ▃
  393 μs           Histogram: frequency by time         1.48 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(zeros(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 3600 samples with 1 evaluation.
 Range (min … max):  1.120 ms …   4.245 ms  ┊ GC (min … max): 0.00% … 68.09%
 Time  (median):     1.226 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.384 ms ± 340.695 μs  ┊ GC (mean ± σ):  9.07% ± 13.24%

    ▁█▃
  ▃▅████▄▄▄▅▇▅▆▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▅▄▃▂▁▁▁▁▁▁▁▂▄ ▃
  1.12 ms         Histogram: frequency by time        2.31 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Maybe more like your WSL2 results than Windows though, in that it is still slower than an undef array?

2 Likes