Faster zeros with calloc

Sukera · October 17, 2021, 7:23am

Ah, that issue - while technically true, in the end that issue was closed precisely because it was decided to keep it. I guess what I meant was that I haven’t seen this idea of getting rid of it entirely been brought up recently.

Right, but that’s only relevant for the Array{T}(undef, ...) case, since that’s when a user will be exposed to it. I think the GC doesn’t distinguish between initialized and not-initialized memory at all right now, unlike rust (well, at least during allocation).

To expand on what I mean, for those boxed user types T, a call to zeros must initialize them properly with a call to zero, else you may not receive a valid initialization at all when returning. All entries must ultimately point to initialized zeros after all, since we cannot intercept the page faulting mechanism to do our own custom zero initialization. In that case, going the route of zeros -> calloc -> fill with zero(T) makes the call to calloc completely redundant, since it has to be filled with concrete values either way. Incidentally, I believe the same is true for non-boxed types or inline stored types, when their representation of zero(T) is not just a zeroed block of memory, as is the case for integers and floats.

This is a fundamental difference that you can’t circumvent without introducing a new array type which guarantees that the first read from an index has a result that’s == zero(T), if you want it to be done lazily. The current eager implementation does the correct thing, always, with no special care for whether a type is ok with having all zero bits as its zero or not.

I’m not surprised by this - the zeroing is not happening in user space after all. It’s happening in the kernel. The kernel routines in question are, as far as I know, not threaded and they certainly don’t let user space handle zeroing of memory that the kernel guarantees. How could they be threaded in the first place? They’re writing to a single page, threading doesn’t really make sense since that’s the smallest amount of memory any running code can access at a given time. This is what I’m talking about when I say that the page faulting mechanism is opaque to julia code. We don’t gain any benefit from multithreading user space code here, except when we do the zeroing ourselves:

Function definitions

julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
                  A = f(Float64, dims...)
                  Threads.@threads for i in eachindex(A)
                      Ai = A[i]
                      @turbo for j in 1:16
                          Ai += exp(i-j)
                      end
                      A[i] = Ai
                  end
                  A
              end
alloctest (generic function with 2 methods)

julia> function thread_zeros(::Type{T}, dims::Integer...) where T
           A = Array{T}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               A[i] = zero(T)
           end
           return A
       end
thread_zeros (generic function with 1 method)

julia> function zeros_with_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_with_calloc (generic function with 1 method)

`alloctest` Benchmarks

julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  845.501 ms … 915.111 ms  ┊ GC (min … max): 0.00% … 6.42%
 Time  (median):     857.200 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   865.460 ms ±  26.023 ms  ┊ GC (mean ± σ):  1.51% ± 2.57%

  ██       ██           █                                     █
  ██▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  846 ms           Histogram: frequency by time          915 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.

julia> @benchmark alloctest(th, 8192, 8192)
thisind      thread_zeros  throw
julia> @benchmark alloctest(thread_zeros, 8192, 8192)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  759.736 ms … 824.374 ms  ┊ GC (min … max): 0.00% … 7.35%
 Time  (median):     760.735 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   772.535 ms ±  23.736 ms  ┊ GC (mean ± σ):  1.50% ± 2.76%

  █
  █▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  760 ms           Histogram: frequency by time          824 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 52.

julia> @benchmark alloctest(zeros_with_calloc, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  941.039 ms … 992.792 ms  ┊ GC (min … max): 0.00% … 4.31%
 Time  (median):     949.372 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   955.610 ms ±  18.949 ms  ┊ GC (mean ± σ):  0.77% ± 1.75%

  █   █  █    █    █                                          █
  █▁▁▁█▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  941 ms           Histogram: frequency by time          993 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.

`zeros` vs. `thread_zeros`

julia> @benchmark zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
 Range (min … max):  170.848 ms … 315.340 ms  ┊ GC (min … max):  0.00% … 43.64
 Time  (median):     176.870 ms               ┊ GC (median):     0.43%
 Time  (mean ± σ):   196.879 ms ±  36.527 ms  ┊ GC (mean ± σ):  12.27% ± 12.73

  █              ▄
  █▅▁▁▃▁▁▁▁▁▁▁▁▁▃█▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▁
  171 ms           Histogram: frequency by time          315 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark thread_zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 48 samples with 1 evaluation.
 Range (min … max):   80.754 ms … 245.534 ms  ┊ GC (min … max):  0.00% … 65.21
 Time  (median):      84.810 ms               ┊ GC (median):     0.91%
 Time  (mean ± σ):   104.350 ms ±  31.993 ms  ┊ GC (mean ± σ):  21.82% ± 18.62

  █            ▂▅
  █▁▅▁▁▁▁▅▁▁▁▁▁██▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  80.8 ms       Histogram: log(frequency) by time        246 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 28.

I’m running with julia -t 4, but the speedup is only 2x - this is probably due to oversubscription, I’m running a Intel(R) Core™ i7-6600U CPU. You could create a custom LazyZeroInitArray type that can take advantage of already existing julia threads to leverage them to initialize, but that’s a completely different solution to having zeros call calloc.

Note that it doesn’t matter whether julia “threads” are OS threads or not - the difference lies in the fact that at the point of initialization during the page fault in calloc memory, no user space (i.e. julia) code or threads are active. It’s the kernel that’s doing the initialization, not julia code.

mkitti · October 17, 2021, 7:45am

That might depend on which libc you are using and the calloc implementation.

github.com/ifduyue/musl

switch to a common calloc implementation

committed 12:34AM - 11 Jun 20 UTC

+37 -47

abstractly, calloc is completely malloc-implementation-independent; it's malloc …followed by memset, or as we do it, a "conditional memset" that avoids touching fresh zero pages. previously, calloc was kept separate for the bump allocator, which can always skip memset, and the version of calloc provided with the full malloc conditionally skipped the clearing for large direct-mmapped allocations. the latter is a moderately attractive optimization, and can be added back if needed. however, further consideration to make it correct under malloc replacement would be needed. commit b4b1e10364c8737a632be61582e05a8d3acf5690 documented the contract for malloc replacement as allowing omission of calloc, and indeed that worked for dynamic linking, but for static linking it was possible to get the non-clearing definition from the bump allocator; if not for that, it would have been a link error trying to pull in malloc.o. the conditional-clearing code for the new common calloc is taken from mal0_clear in oldmalloc, but drops the need to access actual page size and just uses a fixed value of 4096. this avoids potentially needing access to global data for the sake of an optimization that at best marginally helps archs with offensively-large page sizes.

Sukera · October 17, 2021, 7:46am

Addendum: Even if calloc or the page faulting mechanism were threaded (they’re not for fundamental reasons central to how current OS work, but let’s assume it anyway for sake of argument), you immediately run into another problem: while julia threads compose easily with other julia threads, they do not compose with non-julia threads, at all. So to make a hypothetical threaded calloc/initialization compose with julia would require something like a julia machine with a julia OS, akin to lisp machines

Sukera · October 17, 2021, 7:50am

I haven’t brought up musl because as far as I know, julia basically requires libc and doesn’t run on musl based systems.

Also, as far as I can tell, the code you linked does a simple naive memset in kernel space anyway, so I’m not sure how that is a counterexample to me saying that it’s not julia code and not julia threads that are running this code?

mkitti · October 17, 2021, 11:31am

There are currently official Julia musl builds at Download Julia including for 1.6.3 and 1.7.0-rc1.

Sukera · October 20, 2021, 9:11am

I guess that’s technically true, but musl is still only Tier 3 in terms of support:

Tier 3: Julia may or may not build. If it does, it is unlikely to pass tests. Binaries may be available in some cases. When they are, they should be considered experimental. Ongoing support is dependent on community efforts.

mkitti · March 23, 2022, 9:47am

I started a package that includes a calloc based AbstractArrayAllocator

julia> using ArrayAllocators

julia> @time zeros(UInt8, 2048, 2048);
  0.000514 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(calloc, 2048, 2048);
  0.000015 seconds (2 allocations: 4.000 MiB)

Topic		Replies	Views
Fastest way of getting a long zero vector? General Usage	29	893	March 15, 2024
[ANN] ArrayAllocators.jl v0.3 composes with OffsetArrays.jl v1.12.1+ for faster zeros with offset indexing Package Announcements announcement	4	899	June 30, 2022
What about an `undefs` function in `Base`? General Usage feature-request	64	3211	January 24, 2023
1.0 annoyances and Matlab comparison Internals & Design	148	13090	June 22, 2018
Performance of zeros() vs. Array{T}()? Performance	4	1169	September 6, 2018

Faster zeros with calloc

Related topics