More often than not, memory is the bottleneck.
Why pass over memory twice – first to zero initialize, then later to interact with it (i.e., zeros) – if you could just once – zeroing as you interact with it (calloc)?
Fusion is great. Theoretically, calloc seems like it has nothing but upsides.
On Linux, comparing undef initializing with zeros_via_calloc:
julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 22.352 μs … 1.114 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 335.558 μs ┊ GC (median): 0.00%
Time (mean ± σ): 337.057 μs ± 12.094 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂█▅ ▂
22.4 μs Histogram: frequency by time 345 μs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark Array{Float64}(undef, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 20.666 μs … 48.211 μs ┊ GC (min … max): 95.95% … 95.08%
Time (median): 41.557 μs ┊ GC (median): 96.05%
Time (mean ± σ): 34.945 μs ± 9.908 μs ┊ GC (mean ± σ): 95.93% ± 0.19%
█ ▅▁▅
█▆▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▆ ▃
20.7 μs Histogram: frequency by time 42.8 μs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
Similar minimum time, very different median and mean.
For me, fill!(zeros_via_calloc(...), ...) is about 2x faster than using zeros on Linux for 1024x1024:
julia> @benchmark fill!(zeros_via_calloc(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 6266 samples with 1 evaluation.
Range (min … max): 600.150 μs … 936.917 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 679.553 μs ┊ GC (median): 0.00%
Time (mean ± σ): 674.231 μs ± 43.092 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▄█ ▁ ▁▂▁ ▁
███▅▃▁▂▁▂▂▄▆█▆▄▂▂▄▄▄▃▂▃▅▇█▆▅▅▆█████▇▆▆▇████▆▅▆▇▇▆▆▅▄▃▃▃▄▆▆▇▆▅ ▄
600 μs Histogram: frequency by time 746 μs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark fill!(Array{Float64}(undef, 1024, 1024), 5.0)
BenchmarkTools.Trial: 7129 samples with 1 evaluation.
Range (min … max): 392.945 μs … 2.997 ms ┊ GC (min … max): 0.00% … 75.62%
Time (median): 585.374 μs ┊ GC (median): 0.00%
Time (mean ± σ): 697.015 μs ± 265.796 μs ┊ GC (mean ± σ): 11.60% ± 16.13%
▁█▇▃▂
▂▂▃▃▃▄▅█████▇▅▃▂▄█▃▂▃▄▃▂▂▂▁▁▁▂▁▂▄▄▂▂▂▂▁▂▂▃▄▃▂▂▂▂▁▂▂▂▂▂▃▄▃▂▃▃▃ ▃
393 μs Histogram: frequency by time 1.48 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark fill!(zeros(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 3600 samples with 1 evaluation.
Range (min … max): 1.120 ms … 4.245 ms ┊ GC (min … max): 0.00% … 68.09%
Time (median): 1.226 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.384 ms ± 340.695 μs ┊ GC (mean ± σ): 9.07% ± 13.24%
▁█▃
▃▅████▄▄▄▅▇▅▆▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▅▄▃▂▁▁▁▁▁▁▁▂▄ ▃
1.12 ms Histogram: frequency by time 2.31 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
Maybe more like your WSL2 results than Windows though, in that it is still slower than an undef array?