Abstract
TL; DR zeros_via_calloc
is a potentially faster version of zeros
that is comparable to Array{T}(undef, ...)
and numpy.zeros
.
function zeros_via_calloc(::Type{T}, dims::Integer...) where T
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
# Windows benchmark
julia> @btime zeros_via_calloc(Float64, 1024, 1024);
12.400 ΞΌs (2 allocations: 8.00 MiB)
julia> @btime zeros(Float64, 1024, 1024);
1.652 ms (2 allocations: 8.00 MiB)
# Windows Subsystem for Linux 2 Benchmark
julia> @btime zeros_via_calloc(Float64, 1024, 1024);
24.300 ΞΌs (2 allocations: 8.00 MiB)
julia> @btime zeros(Float64, 1024, 1024);
474.500 ΞΌs (2 allocations: 8.00 MiB)
Introduction
In Julia, there are several ways to create large arrays. The fastest form is Array{T}(undef, dims...)
where T
is the element type of the array and dims
is a series of integers describing the size of the array. There are also a few convenience constructors such as zeros
and ones
which use the fastest form followed by fill!(array, zero(T))
or fill!(array, one(T))
, respectively. Below see that zeros
and ones
are much slower than array creation using Array{T}(undef, ...)
.
julia> @benchmark Array{Float64}(undef, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 12.800 ΞΌs β¦ 1.085 ms β GC (min β¦ max): 0.00% β¦ 96.56%
Time (median): 13.900 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 104.670 ΞΌs Β± 200.995 ΞΌs β GC (mean Β± Ο): 85.28% Β± 36.22%
julia> @benchmark zeros(1024, 1024)
BenchmarkTools.Trial: 2187 samples with 1 evaluation.
Range (min β¦ max): 1.661 ms β¦ 6.932 ms β GC (min β¦ max): 0.00% β¦ 60.38%
Time (median): 1.838 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.281 ms Β± 1.010 ms β GC (mean Β± Ο): 19.35% Β± 21.94%
julia> @benchmark ones(1024, 1024)
BenchmarkTools.Trial: 2186 samples with 1 evaluation.
Range (min β¦ max): 1.657 ms β¦ 6.561 ms β GC (min β¦ max): 0.00% β¦ 58.89%
Time (median): 1.835 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.283 ms Β± 1.021 ms β GC (mean Β± Ο): 19.40% Β± 21.87%
Above we see that on my Windows-based machine, uninitialized array creation takes a median time 13.9 microseconds. While ones
and zeros
takes a median of 1.8 milliseconds or 100 times as long as uninitialized array creation to execute.
However, if we compare to Numpy, we see that numpy.zeros
takes a mean of 92 microseconds and is actually faster than numpy.ones
as well as the Juliaβs zeros
:
In [60]: %timeit numpy.zeros((1024,1024), dtype = np.float64)
92 Β΅s Β± 1.02 Β΅s per loop (mean Β± std. dev. of 7 runs, 10000 loops each)
In [61]: %timeit numpy.ones((1024,1024), dtype = np.float64)
2.5 ms Β± 160 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
The reason that numpy.zeros
is faster is that since 2013, numpy.zeros
uses the C routine calloc
which initializes the underlying memory to zero. Using calloc
for Juliaβs zeros
has been pending since 2011.
https://github.com/JuliaLang/julia/issues/130
calloc
can be faster than creating an uninitialized array and filling it with zeros. The operating system may be able to manage its memory such that it knows where chunks of zeros already exist in memory. If the zeroed memory does not exist, the operating system may choose to defer zero initialization until you actually start to use the memory. See this StackOverflow for a longer explanation.
In this post, we explore whether we can implement zeros
based on calloc
from Julia, right now and evaluate whether it actually has any performance benefits.
Developing a calloc
based zeros
in Julia
Julia is kind enough to expose basic C routines via the module Libc
. Libc.calloc(num::Integer, size::Integer)
takes two arguments. num
is the number of elements, while size
is the size of each element. The routine returns a Ptr{Nothing}
the equivalent of a void pointer in C. This will allocate zeroed memory that is not managed by the garbage collector.
To incorporate this memory into a Julia array, we will use unsafe_wrap(Array, pointer::Ptr{T}, dims; own)
. This wraps an Array
with specified dimensions around the pointer
. The keyword argument own
specifies whether Julia should take ownership of the memory and be responsible for freeing the memory when we are done with it.
Implementing zeros_via_calloc
is a matter of calling Libc.calloc
with the appropriate arguments. We can get the number of elements by taking the product of the dimensions. The size of each element is just the size of the the type T
. For unsafe_wrap
the resulting Ptr{Nothing}
needs to be converted a Ptr{T}
. We will then let Julia own the memory so that we do not have free the memory ourselves.
function zeros_via_calloc(::Type{T}, dims::Integer...) where T
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
The performance is about is faster as expected. There are some differences in performance on Windows versus Windows Subsystem for Linux 2 on the same hardware though.
# Windows
julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 12.300 ΞΌs β¦ 80.400 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 13.700 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 15.137 ΞΌs Β± 5.399 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββββ β
βββββββββββββββββββββββββ
β
β
β
β
ββ
β
βββ
ββββββ
ββ
β
ββ
β
β
β
β
ββ
β
βββ
β
ββ β
12.3 ΞΌs Histogram: log(frequency) by time 41.5 ΞΌs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 2142 samples with 1 evaluation.
Range (min β¦ max): 1.639 ms β¦ 9.415 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.843 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.330 ms Β± 1.121 ms β GC (mean Β± Ο): 20.28% Β± 22.35%
ββββββββ ββββββ βββ β β
βββββββββββ
ββββ
β
β
βββββ
ββββββββββββββββ
βββββββββββββββββββ β
1.64 ms Histogram: log(frequency) by time 5.5 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
# Linux
julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 22.100 ΞΌs β¦ 1.364 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 273.400 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 292.431 ΞΌs Β± 62.129 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
βββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββ β
22.1 ΞΌs Histogram: frequency by time 493 ΞΌs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 7878 samples with 1 evaluation.
Range (min β¦ max): 458.300 ΞΌs β¦ 2.069 ms β GC (min β¦ max): 0.00% β¦ 31.44%
Time (median): 532.200 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 626.414 ΞΌs Β± 191.989 ΞΌs β GC (mean Β± Ο): 8.12% Β± 12.52%
βββ
ββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
458 ΞΌs Histogram: frequency by time 1.15 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
Incorporating this into Base Julia is a bit more complicated since Julia usually uses aligned memory when allocating arrays of this size. The most recent attempt at this was four years ago.
https://github.com/JuliaLang/julia/pull/22953
Is zeros_via_calloc
actually faster in practice?
In discussing this @Sukera on Zulip and @tim.holy on Github, the contention was that zeros_via_calloc
would not be faster in practice. The idea is that via calloc
the operating system is just being lazy and deferring the zero initialization operating until later. Thus when we actually try to iterate or otherwise use the array that performance would suffer as the operating system actually does the work. However, as noted in the introduction, it is also possible that the operating system has done the work ahead of time and may be offering memory that already has been initialized with zeros.
Summation
Letβs begin by just accessing the memory by summing it together. The tests below include array creation time.
# Windows
julia> @benchmark sum(zeros_via_calloc(Float64, 1024, 1024))
BenchmarkTools.Trial: 2134 samples with 1 evaluation.
Range (min β¦ max): 1.680 ms β¦ 5.428 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.847 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.885 ms Β± 204.224 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββ
βββ
βββ
β
ββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ β
1.68 ms Histogram: frequency by time 2.82 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 1563 samples with 1 evaluation.
Range (min β¦ max): 2.131 ms β¦ 10.005 ms β GC (min β¦ max): 0.00% β¦ 59.89%
Time (median): 2.419 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.189 ms Β± 1.589 ms β GC (mean Β± Ο): 20.99% Β± 23.03%
ββββββββββ βββββ
ββββββββββββββββββ
βββ
ββββ
βββββββββββββββββ
ββββ
ββ
β
β
ββ
β
βββββ
β
2.13 ms Histogram: log(frequency) by time 8.86 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> sum(zeros_via_calloc(Float64, 1024, 1024))
0.0
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
# Windows Subsystem for Linux 2
julia> @benchmark sum(zeros_via_calloc(Float64, 1024, 1024))
BenchmarkTools.Trial: 8838 samples with 1 evaluation.
Range (min β¦ max): 418.500 ΞΌs β¦ 2.682 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 471.400 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 496.652 ΞΌs Β± 80.999 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
418 ΞΌs Histogram: frequency by time 774 ΞΌs <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 4866 samples with 1 evaluation.
Range (min β¦ max): 821.900 ΞΌs β¦ 3.744 ms β GC (min β¦ max): 0.00% β¦ 8.10%
Time (median): 934.100 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.018 ms Β± 202.350 ΞΌs β GC (mean Β± Ο): 6.28% Β± 10.47%
βββββββ
β
βββββββββββββββ
βββββ
βββββββββββββββββββββββββββββββββββββββββ β
822 ΞΌs Histogram: frequency by time 1.53 ms <
Memory estimate: 8.00 MiB, allocs estimate: 2.
julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
On both Windows and WSL2, we see benefits when looking at the combined time of creating the array and summation. We can also isolate the summation step in the benchmarks.
# Windows
julia> @benchmark sum(C) setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 2131 samples with 1 evaluation.
Range (min β¦ max): 1.669 ms β¦ 3.298 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.834 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.863 ms Β± 163.536 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ βββ
βββ
βββ
ββββ
ββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββ β
1.67 ms Histogram: frequency by time 2.54 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum(A) setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 1686 samples with 1 evaluation.
Range (min β¦ max): 319.700 ΞΌs β¦ 1.030 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 394.600 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 417.203 ΞΌs Β± 64.969 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββββββββ
βββββββ
ββββββββββββββββββββββββββββββββββββββ β
320 ΞΌs Histogram: frequency by time 682 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
# Windows Subsystem for Linux 2
julia> @benchmark sum(C) setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 8958 samples with 1 evaluation.
Range (min β¦ max): 171.600 ΞΌs β¦ 687.400 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 192.350 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 203.257 ΞΌs Β± 34.846 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββββββ
β
β
β
βββββββββ ββ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ
β
172 ΞΌs Histogram: log(frequency) by time 350 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum(A) setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 4855 samples with 1 evaluation.
Range (min β¦ max): 184.900 ΞΌs β¦ 2.484 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 399.500 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 387.149 ΞΌs Β± 91.006 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
ββββββββ
ββββββββββββββββββββββββββββββ
βββ
βββββββββββββββββββ β
185 ΞΌs Histogram: frequency by time 567 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
On Windows, we see the summation in isolation is slower using zeros_with_calloc
than with zeros
. On WSL2, the summation in isolation actually appears faster using zeros_with_calloc
than with zeros
.
Writing
Next we will try writing to the array. The tests below do not include array creation times.
julia> inds = CartesianIndices(1:5:1024*1024);
# Windows
julia> @benchmark C[$inds] .= $inds setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 8756 samples with 1 evaluation.
Range (min β¦ max): 365.000 ΞΌs β¦ 2.712 ms β GC (min β¦ max): 0.00% β¦ 73.24%
Time (median): 388.700 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 551.171 ΞΌs Β± 349.209 ΞΌs β GC (mean Β± Ο): 27.37% Β± 25.49%
βββββββ ββββββββ β
ββββββββββββ
β
β
βββ
β
β
ββββββββββββββββββββββββββββββββ
ββββββββββ β
365 ΞΌs Histogram: log(frequency) by time 1.42 ms <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> @benchmark A[$inds] .= $inds setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 2016 samples with 1 evaluation.
Range (min β¦ max): 143.800 ΞΌs β¦ 365.700 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 149.200 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 156.769 ΞΌs Β± 24.008 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββββ
β β
βββββββββ
ββββββ
βββββ
ββββββββββββββ
ββ
ββββ
ββββββββ
βββββββββββ
ββ β
144 ΞΌs Histogram: log(frequency) by time 254 ΞΌs <
Memory estimate: 80 bytes, allocs estimate: 2.
# Windows Subsystem for Linux 2
julia> inds = CartesianIndices(1:5:1024*1024);
julia> @benchmark C[$inds] .= $inds setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 9029 samples with 1 evaluation.
Range (min β¦ max): 142.000 ΞΌs β¦ 3.562 ms β GC (min β¦ max): 0.00% β¦ 59.63%
Time (median): 148.100 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 269.166 ΞΌs Β± 215.394 ΞΌs β GC (mean Β± Ο): 26.06% Β± 23.50%
ββ
ββββββ βββ βββ βββ β
βββββββββββββββ
βββ
ββββββββββ
βββββ
βββββββββββββββββββββββββββ
β β
142 ΞΌs Histogram: log(frequency) by time 838 ΞΌs <
Memory estimate: 80 bytes, allocs estimate: 2.
julia> @benchmark A[$inds] .= $inds setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 6052 samples with 1 evaluation.
Range (min β¦ max): 140.400 ΞΌs β¦ 417.000 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 149.100 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 159.025 ΞΌs Β± 27.826 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββ
ββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
β
ββ
ββ
β
β
β
140 ΞΌs Histogram: log(frequency) by time 280 ΞΌs <
Memory estimate: 80 bytes, allocs estimate: 2.
Here writing does appear be slower.
Combined create, write, sum
In the last benchmarks of this initial post, I will benchmark the combined creation, writing, and summation of the resulting arrays.
julia> function c_func()
A = zeros_with_calloc(Float64, 1024, 1024)
A[1:5:length(A)] .= 2.0
sum(A)
end
c_func (generic function with 1 method)
julia> function a_func()
A = zeros(Float64, 1024, 1024)
A[1:5:length(A)] .= 2.0
sum(A)
end
a_func (generic function with 1 method)
# Windows
julia> @benchmark c_func()
BenchmarkTools.Trial: 1825 samples with 1 evaluation.
Range (min β¦ max): 2.066 ms β¦ 9.072 ms β GC (min β¦ max): 0.00% β¦ 52.24%
Time (median): 2.233 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.734 ms Β± 1.126 ms β GC (mean Β± Ο): 17.22% Β± 20.25%
ββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.07 ms Histogram: frequency by time 5.62 ms <
Memory estimate: 8.00 MiB, allocs estimate: 4.
julia> @benchmark a_func()
BenchmarkTools.Trial: 1500 samples with 1 evaluation.
Range (min β¦ max): 2.375 ms β¦ 11.108 ms β GC (min β¦ max): 0.00% β¦ 48.42%
Time (median): 2.686 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.325 ms Β± 1.433 ms β GC (mean Β± Ο): 19.19% Β± 22.26%
ββββββββ ββββββ β
ββββββββββββ
βββββ
βββ
ββββββββββ
ββββββββββββββββββββββββββββ
β
2.38 ms Histogram: log(frequency) by time 7.08 ms <
Memory estimate: 8.00 MiB, allocs estimate: 4.
# Windows Subsystem for Linux 2
julia> @benchmark c_func()
BenchmarkTools.Trial: 6228 samples with 1 evaluation.
Range (min β¦ max): 618.600 ΞΌs β¦ 2.404 ms β GC (min β¦ max): 0.00% β¦ 27.01%
Time (median): 710.000 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 787.481 ΞΌs Β± 173.925 ΞΌs β GC (mean Β± Ο): 7.60% Β± 11.97%
βββββββ
ββββββββββββ
β
β
β
β
βββββββββββββββββββββββββββββββββββββββββββββ β
619 ΞΌs Histogram: frequency by time 1.28 ms <
Memory estimate: 8.00 MiB, allocs estimate: 4.
julia> @benchmark a_func()
BenchmarkTools.Trial: 3429 samples with 1 evaluation.
Range (min β¦ max): 1.157 ms β¦ 3.570 ms β GC (min β¦ max): 0.00% β¦ 25.64%
Time (median): 1.366 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.446 ms Β± 229.285 ΞΌs β GC (mean Β± Ο): 5.05% Β± 9.18%
βββββ
β
ββββ
βββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββ β
1.16 ms Histogram: frequency by time 2.22 ms <
Memory estimate: 8.00 MiB, allocs estimate: 4.
In the combined tests, we see benefits for both Windows and WSL 2.
Conclusions and Discussion
zeros_via_calloc
offers some performance benefits over the current implementation of zeros
on both Windows and Windows Subsystem for Linux 2.
Array creation with zeros_via_calloc
on Windows is faster, almost on par with uninitialized array creation, but array access and write is slower. The combined operation on Windows is still overall faster taking 82% of the time compared to when the array is created with zeros
.
On Windows Subsystem for Linux 2, array creation with zeros_via_calloc
is also faster but not as dramatic as on Windows. Array access and write is slower, but the overall combined operation on WSL2 only takes 54% of the time compared to when the array is created with zeros
.
Overall, for the conditions tested here, zeros_via_calloc
does appear to have performance benefits over the current implementation of zeros
on Windows and Windows Subsystem for Linux 2. Prior testing suggests that the benefit is reduced for smaller arrays.
Is zeros_via_calloc
faster for your applications?