Faster zeros with calloc

Seems like calloc really should be implemented in a way to benefit from multithreading:

julia> using LoopVectorization, BenchmarkTools

julia> function zeros_via_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_via_calloc (generic function with 1 method)

julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
           A = f(Float64, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = A[i]
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 1 method)

julia> function alloctest(dims::Vararg{Integer,N}) where {F,N}
           A = Array{Float64}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = 0.0
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 2 methods)

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
 Range (min … max):  49.796 ms … 146.351 ms  ┊ GC (min … max): 0.00% … 65.90%
 Time  (median):     51.714 ms               ┊ GC (median):    3.40%
 Time  (mean ± σ):   53.449 ms ±  12.958 ms  ┊ GC (mean ± σ):  6.49% ±  9.56%

  █ █
  █▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  49.8 ms       Histogram: log(frequency) by time       133 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192) # undef init
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  33.063 ms … 144.814 ms  ┊ GC (min … max): 0.00% … 77.03%
 Time  (median):     36.147 ms               ┊ GC (median):    2.81%
 Time  (mean ± σ):   37.299 ms ±  12.321 ms  ┊ GC (mean ± σ):  9.25% ± 10.36%

  █ ▇▄
  ████▇▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  33.1 ms       Histogram: log(frequency) by time       122 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 184.

julia> @benchmark alloctest(zeros, 8192, 8192) # zeros
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min … max):  85.009 ms … 197.439 ms  ┊ GC (min … max): 0.00% … 56.90%
 Time  (median):     87.641 ms               ┊ GC (median):    0.88%
 Time  (mean ± σ):   91.011 ms ±  19.084 ms  ┊ GC (mean ± σ):  6.29% ± 10.42%

  █ ▅
  █▃█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  85 ms           Histogram: frequency by time          174 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 183.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192) # zeros_via_calloc
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   81.286 ms … 251.090 ms  ┊ GC (min … max):  0.00% … 67.55%
 Time  (median):      81.993 ms               ┊ GC (median):     0.46%
 Time  (mean ± σ):   100.907 ms ±  31.495 ms  ┊ GC (mean ± σ):  19.33% ± 17.28%

  █          ▇
  █▁▁▁▁▁▁▁▁▁▁█▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  81.3 ms       Histogram: log(frequency) by time        251 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 187.

I.e., seems like it should be possible for the specific thread doing the first write to a page to also do the zeroing. If this were the case, calloc would have two benefits:

  1. single pass over the array instead of two passes
  2. implicit (local) multithreading of the zeroing when combined with multithreaded code.

However, that does not appear to be the case.

This is a reasonably common pattern, where an array is initialized and then passed over multiple times to update the values. Here I wanted to only update once of course, to make the potential benefits easier to detect.

2 Likes