Faster zeros with calloc

Elrod · October 17, 2021, 6:51am

Seems like calloc really should be implemented in a way to benefit from multithreading:

julia> using LoopVectorization, BenchmarkTools

julia> function zeros_via_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_via_calloc (generic function with 1 method)

julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
           A = f(Float64, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = A[i]
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 1 method)

julia> function alloctest(dims::Vararg{Integer,N}) where {F,N}
           A = Array{Float64}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = 0.0
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 2 methods)

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
 Range (min … max):  49.796 ms … 146.351 ms  ┊ GC (min … max): 0.00% … 65.90%
 Time  (median):     51.714 ms               ┊ GC (median):    3.40%
 Time  (mean ± σ):   53.449 ms ±  12.958 ms  ┊ GC (mean ± σ):  6.49% ±  9.56%

  █ █
  █▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  49.8 ms       Histogram: log(frequency) by time       133 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192) # undef init
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  33.063 ms … 144.814 ms  ┊ GC (min … max): 0.00% … 77.03%
 Time  (median):     36.147 ms               ┊ GC (median):    2.81%
 Time  (mean ± σ):   37.299 ms ±  12.321 ms  ┊ GC (mean ± σ):  9.25% ± 10.36%

  █ ▇▄
  ████▇▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  33.1 ms       Histogram: log(frequency) by time       122 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 184.

julia> @benchmark alloctest(zeros, 8192, 8192) # zeros
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min … max):  85.009 ms … 197.439 ms  ┊ GC (min … max): 0.00% … 56.90%
 Time  (median):     87.641 ms               ┊ GC (median):    0.88%
 Time  (mean ± σ):   91.011 ms ±  19.084 ms  ┊ GC (mean ± σ):  6.29% ± 10.42%

  █ ▅
  █▃█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  85 ms           Histogram: frequency by time          174 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 183.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192) # zeros_via_calloc
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   81.286 ms … 251.090 ms  ┊ GC (min … max):  0.00% … 67.55%
 Time  (median):      81.993 ms               ┊ GC (median):     0.46%
 Time  (mean ± σ):   100.907 ms ±  31.495 ms  ┊ GC (mean ± σ):  19.33% ± 17.28%

  █          ▇
  █▁▁▁▁▁▁▁▁▁▁█▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  81.3 ms       Histogram: log(frequency) by time        251 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 187.

I.e., seems like it should be possible for the specific thread doing the first write to a page to also do the zeroing. If this were the case, calloc would have two benefits:

single pass over the array instead of two passes
implicit (local) multithreading of the zeroing when combined with multithreaded code.

However, that does not appear to be the case.

This is a reasonably common pattern, where an array is initialized and then passed over multiple times to update the values. Here I wanted to only update once of course, to make the potential benefits easier to detect.

Topic		Replies	Views
Fastest way of getting a long zero vector? General Usage	29	978	March 15, 2024
1.0 annoyances and Matlab comparison Internals & Design	148	13477	June 22, 2018
Performance of zeros() vs. Array{T}()? Performance	4	1188	September 6, 2018
[ANN] ArrayAllocators.jl v0.3 composes with OffsetArrays.jl v1.12.1+ for faster zeros with offset indexing Package Announcements announcement	4	916	June 30, 2022
Memory allocations in for loop variable Performance question	28	3722	August 31, 2018

Faster zeros with calloc

Related topics