Ah, that issue - while technically true, in the end that issue was closed precisely because it was decided to keep it. I guess what I meant was that I haven’t seen this idea of getting rid of it entirely been brought up recently.
Right, but that’s only relevant for the Array{T}(undef, ...)
case, since that’s when a user will be exposed to it. I think the GC doesn’t distinguish between initialized and not-initialized memory at all right now, unlike rust (well, at least during allocation).
To expand on what I mean, for those boxed user types T
, a call to zeros
must initialize them properly with a call to zero
, else you may not receive a valid initialization at all when returning. All entries must ultimately point to initialized zero
s after all, since we cannot intercept the page faulting mechanism to do our own custom zero
initialization. In that case, going the route of zeros -> calloc -> fill with zero(T)
makes the call to calloc
completely redundant, since it has to be filled with concrete values either way. Incidentally, I believe the same is true for non-boxed types or inline stored types, when their representation of zero(T)
is not just a zeroed block of memory, as is the case for integers and floats.
This is a fundamental difference that you can’t circumvent without introducing a new array type which guarantees that the first read from an index has a result that’s == zero(T)
, if you want it to be done lazily. The current eager implementation does the correct thing, always, with no special care for whether a type is ok with having all zero bits as its zero
or not.
I’m not surprised by this - the zeroing is not happening in user space after all. It’s happening in the kernel. The kernel routines in question are, as far as I know, not threaded and they certainly don’t let user space handle zeroing of memory that the kernel guarantees. How could they be threaded in the first place? They’re writing to a single page, threading doesn’t really make sense since that’s the smallest amount of memory any running code can access at a given time. This is what I’m talking about when I say that the page faulting mechanism is opaque to julia code. We don’t gain any benefit from multithreading user space code here, except when we do the zeroing ourselves:
Function definitions
julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
A = f(Float64, dims...)
Threads.@threads for i in eachindex(A)
Ai = A[i]
@turbo for j in 1:16
Ai += exp(i-j)
end
A[i] = Ai
end
A
end
alloctest (generic function with 2 methods)
julia> function thread_zeros(::Type{T}, dims::Integer...) where T
A = Array{T}(undef, dims...)
Threads.@threads for i in eachindex(A)
A[i] = zero(T)
end
return A
end
thread_zeros (generic function with 1 method)
julia> function zeros_with_calloc(::Type{T}, dims::Integer...) where T
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
zeros_with_calloc (generic function with 1 method)
`alloctest` Benchmarks
julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min … max): 845.501 ms … 915.111 ms ┊ GC (min … max): 0.00% … 6.42%
Time (median): 857.200 ms ┊ GC (median): 0.09%
Time (mean ± σ): 865.460 ms ± 26.023 ms ┊ GC (mean ± σ): 1.51% ± 2.57%
██ ██ █ █
██▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
846 ms Histogram: frequency by time 915 ms <
Memory estimate: 512.00 MiB, allocs estimate: 24.
julia> @benchmark alloctest(th, 8192, 8192)
thisind thread_zeros throw
julia> @benchmark alloctest(thread_zeros, 8192, 8192)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 759.736 ms … 824.374 ms ┊ GC (min … max): 0.00% … 7.35%
Time (median): 760.735 ms ┊ GC (median): 0.09%
Time (mean ± σ): 772.535 ms ± 23.736 ms ┊ GC (mean ± σ): 1.50% ± 2.76%
█
█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
760 ms Histogram: frequency by time 824 ms <
Memory estimate: 512.00 MiB, allocs estimate: 52.
julia> @benchmark alloctest(zeros_with_calloc, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min … max): 941.039 ms … 992.792 ms ┊ GC (min … max): 0.00% … 4.31%
Time (median): 949.372 ms ┊ GC (median): 0.04%
Time (mean ± σ): 955.610 ms ± 18.949 ms ┊ GC (mean ± σ): 0.77% ± 1.75%
█ █ █ █ █ █
█▁▁▁█▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
941 ms Histogram: frequency by time 993 ms <
Memory estimate: 512.00 MiB, allocs estimate: 24.
`zeros` vs. `thread_zeros`
julia> @benchmark zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
Range (min … max): 170.848 ms … 315.340 ms ┊ GC (min … max): 0.00% … 43.64
Time (median): 176.870 ms ┊ GC (median): 0.43%
Time (mean ± σ): 196.879 ms ± 36.527 ms ┊ GC (mean ± σ): 12.27% ± 12.73
█ ▄
█▅▁▁▃▁▁▁▁▁▁▁▁▁▃█▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▁
171 ms Histogram: frequency by time 315 ms <
Memory estimate: 512.00 MiB, allocs estimate: 2.
julia> @benchmark thread_zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 48 samples with 1 evaluation.
Range (min … max): 80.754 ms … 245.534 ms ┊ GC (min … max): 0.00% … 65.21
Time (median): 84.810 ms ┊ GC (median): 0.91%
Time (mean ± σ): 104.350 ms ± 31.993 ms ┊ GC (mean ± σ): 21.82% ± 18.62
█ ▂▅
█▁▅▁▁▁▁▅▁▁▁▁▁██▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
80.8 ms Histogram: log(frequency) by time 246 ms <
Memory estimate: 512.00 MiB, allocs estimate: 28.
I’m running with julia -t 4
, but the speedup is only 2x - this is probably due to oversubscription, I’m running a Intel(R) Core™ i7-6600U CPU. You could create a custom LazyZeroInitArray
type that can take advantage of already existing julia threads to leverage them to initialize, but that’s a completely different solution to having zeros
call calloc
.
Note that it doesn’t matter whether julia “threads” are OS threads or not - the difference lies in the fact that at the point of initialization during the page fault in calloc
memory, no user space (i.e. julia) code or threads are active. It’s the kernel that’s doing the initialization, not julia code.