Faster zeros with calloc

Ah, that issue - while technically true, in the end that issue was closed precisely because it was decided to keep it. I guess what I meant was that I haven’t seen this idea of getting rid of it entirely been brought up recently.

Right, but that’s only relevant for the Array{T}(undef, ...) case, since that’s when a user will be exposed to it. I think the GC doesn’t distinguish between initialized and not-initialized memory at all right now, unlike rust (well, at least during allocation).

To expand on what I mean, for those boxed user types T, a call to zeros must initialize them properly with a call to zero, else you may not receive a valid initialization at all when returning. All entries must ultimately point to initialized zeros after all, since we cannot intercept the page faulting mechanism to do our own custom zero initialization. In that case, going the route of zeros -> calloc -> fill with zero(T) makes the call to calloc completely redundant, since it has to be filled with concrete values either way. Incidentally, I believe the same is true for non-boxed types or inline stored types, when their representation of zero(T) is not just a zeroed block of memory, as is the case for integers and floats.

This is a fundamental difference that you can’t circumvent without introducing a new array type which guarantees that the first read from an index has a result that’s == zero(T), if you want it to be done lazily. The current eager implementation does the correct thing, always, with no special care for whether a type is ok with having all zero bits as its zero or not.

I’m not surprised by this - the zeroing is not happening in user space after all. It’s happening in the kernel. The kernel routines in question are, as far as I know, not threaded and they certainly don’t let user space handle zeroing of memory that the kernel guarantees. How could they be threaded in the first place? They’re writing to a single page, threading doesn’t really make sense since that’s the smallest amount of memory any running code can access at a given time. This is what I’m talking about when I say that the page faulting mechanism is opaque to julia code. We don’t gain any benefit from multithreading user space code here, except when we do the zeroing ourselves:

Function definitions
julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
                  A = f(Float64, dims...)
                  Threads.@threads for i in eachindex(A)
                      Ai = A[i]
                      @turbo for j in 1:16
                          Ai += exp(i-j)
                      A[i] = Ai
alloctest (generic function with 2 methods)

julia> function thread_zeros(::Type{T}, dims::Integer...) where T
           A = Array{T}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               A[i] = zero(T)
           return A
thread_zeros (generic function with 1 method)

julia> function zeros_with_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
zeros_with_calloc (generic function with 1 method)
`alloctest` Benchmarks
julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  845.501 ms … 915.111 ms  ┊ GC (min … max): 0.00% … 6.42%
 Time  (median):     857.200 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   865.460 ms ±  26.023 ms  ┊ GC (mean ± σ):  1.51% ± 2.57%

  ██       ██           █                                     █
  ██▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  846 ms           Histogram: frequency by time          915 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.

julia> @benchmark alloctest(th, 8192, 8192)
thisind      thread_zeros  throw
julia> @benchmark alloctest(thread_zeros, 8192, 8192)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  759.736 ms … 824.374 ms  ┊ GC (min … max): 0.00% … 7.35%
 Time  (median):     760.735 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   772.535 ms ±  23.736 ms  ┊ GC (mean ± σ):  1.50% ± 2.76%

  █▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  760 ms           Histogram: frequency by time          824 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 52.

julia> @benchmark alloctest(zeros_with_calloc, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  941.039 ms … 992.792 ms  ┊ GC (min … max): 0.00% … 4.31%
 Time  (median):     949.372 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   955.610 ms ±  18.949 ms  ┊ GC (mean ± σ):  0.77% ± 1.75%

  █   █  █    █    █                                          █
  █▁▁▁█▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  941 ms           Histogram: frequency by time          993 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.
`zeros` vs. `thread_zeros`
julia> @benchmark zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
 Range (min … max):  170.848 ms … 315.340 ms  ┊ GC (min … max):  0.00% … 43.64
 Time  (median):     176.870 ms               ┊ GC (median):     0.43%
 Time  (mean ± σ):   196.879 ms ±  36.527 ms  ┊ GC (mean ± σ):  12.27% ± 12.73

  █              ▄
  █▅▁▁▃▁▁▁▁▁▁▁▁▁▃█▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▁
  171 ms           Histogram: frequency by time          315 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark thread_zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 48 samples with 1 evaluation.
 Range (min … max):   80.754 ms … 245.534 ms  ┊ GC (min … max):  0.00% … 65.21
 Time  (median):      84.810 ms               ┊ GC (median):     0.91%
 Time  (mean ± σ):   104.350 ms ±  31.993 ms  ┊ GC (mean ± σ):  21.82% ± 18.62

  █            ▂▅
  █▁▅▁▁▁▁▅▁▁▁▁▁██▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  80.8 ms       Histogram: log(frequency) by time        246 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 28.

I’m running with julia -t 4, but the speedup is only 2x - this is probably due to oversubscription, I’m running a Intel(R) Core™ i7-6600U CPU. You could create a custom LazyZeroInitArray type that can take advantage of already existing julia threads to leverage them to initialize, but that’s a completely different solution to having zeros call calloc.

Note that it doesn’t matter whether julia “threads” are OS threads or not - the difference lies in the fact that at the point of initialization during the page fault in calloc memory, no user space (i.e. julia) code or threads are active. It’s the kernel that’s doing the initialization, not julia code.

That might depend on which libc you are using and the calloc implementation.

Addendum: Even if calloc or the page faulting mechanism were threaded (they’re not for fundamental reasons central to how current OS work, but let’s assume it anyway for sake of argument), you immediately run into another problem: while julia threads compose easily with other julia threads, they do not compose with non-julia threads, at all. So to make a hypothetical threaded calloc/initialization compose with julia would require something like a julia machine with a julia OS, akin to lisp machines

I haven’t brought up musl because as far as I know, julia basically requires libc and doesn’t run on musl based systems.

Also, as far as I can tell, the code you linked does a simple naive memset in kernel space anyway, so I’m not sure how that is a counterexample to me saying that it’s not julia code and not julia threads that are running this code?

There are currently official Julia musl builds at Download Julia including for 1.6.3 and 1.7.0-rc1.

1 Like

I guess that’s technically true, but musl is still only Tier 3 in terms of support:

Tier 3: Julia may or may not build. If it does, it is unlikely to pass tests. Binaries may be available in some cases. When they are, they should be considered experimental. Ongoing support is dependent on community efforts.

I started a package that includes a calloc based AbstractArrayAllocator

julia> using ArrayAllocators

julia> @time zeros(UInt8, 2048, 2048);
  0.000514 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(calloc, 2048, 2048);
  0.000015 seconds (2 allocations: 4.000 MiB)