Seems like calloc really should be implemented in a way to benefit from multithreading:
julia> using LoopVectorization, BenchmarkTools
julia> function zeros_via_calloc(::Type{T}, dims::Integer...) where T
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
zeros_via_calloc (generic function with 1 method)
julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
A = f(Float64, dims...)
Threads.@threads for i in eachindex(A)
Ai = A[i]
@turbo for j in 1:16
Ai += exp(i-j)
end
A[i] = Ai
end
A
end
alloctest (generic function with 1 method)
julia> function alloctest(dims::Vararg{Integer,N}) where {F,N}
A = Array{Float64}(undef, dims...)
Threads.@threads for i in eachindex(A)
Ai = 0.0
@turbo for j in 1:16
Ai += exp(i-j)
end
A[i] = Ai
end
A
end
alloctest (generic function with 2 methods)
julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
Range (min … max): 49.796 ms … 146.351 ms ┊ GC (min … max): 0.00% … 65.90%
Time (median): 51.714 ms ┊ GC (median): 3.40%
Time (mean ± σ): 53.449 ms ± 12.958 ms ┊ GC (mean ± σ): 6.49% ± 9.56%
█ █
█▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
49.8 ms Histogram: log(frequency) by time 133 ms <
Memory estimate: 512.00 MiB, allocs estimate: 2.
julia> @benchmark alloctest(8192, 8192) # undef init
BenchmarkTools.Trial: 134 samples with 1 evaluation.
Range (min … max): 33.063 ms … 144.814 ms ┊ GC (min … max): 0.00% … 77.03%
Time (median): 36.147 ms ┊ GC (median): 2.81%
Time (mean ± σ): 37.299 ms ± 12.321 ms ┊ GC (mean ± σ): 9.25% ± 10.36%
█ ▇▄
████▇▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
33.1 ms Histogram: log(frequency) by time 122 ms <
Memory estimate: 512.02 MiB, allocs estimate: 184.
julia> @benchmark alloctest(zeros, 8192, 8192) # zeros
BenchmarkTools.Trial: 55 samples with 1 evaluation.
Range (min … max): 85.009 ms … 197.439 ms ┊ GC (min … max): 0.00% … 56.90%
Time (median): 87.641 ms ┊ GC (median): 0.88%
Time (mean ± σ): 91.011 ms ± 19.084 ms ┊ GC (mean ± σ): 6.29% ± 10.42%
█ ▅
█▃█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
85 ms Histogram: frequency by time 174 ms <
Memory estimate: 512.02 MiB, allocs estimate: 183.
julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192) # zeros_via_calloc
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min … max): 81.286 ms … 251.090 ms ┊ GC (min … max): 0.00% … 67.55%
Time (median): 81.993 ms ┊ GC (median): 0.46%
Time (mean ± σ): 100.907 ms ± 31.495 ms ┊ GC (mean ± σ): 19.33% ± 17.28%
█ ▇
█▁▁▁▁▁▁▁▁▁▁█▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
81.3 ms Histogram: log(frequency) by time 251 ms <
Memory estimate: 512.02 MiB, allocs estimate: 187.
I.e., seems like it should be possible for the specific thread doing the first write to a page to also do the zeroing. If this were the case, calloc would have two benefits:
- single pass over the array instead of two passes
- implicit (local) multithreading of the zeroing when combined with multithreaded code.
However, that does not appear to be the case.
This is a reasonably common pattern, where an array is initialized and then passed over multiple times to update the values. Here I wanted to only update once of course, to make the potential benefits easier to detect.