Faster zeros with calloc

mkitti · October 15, 2021, 10:07pm

Abstract

TL; DR zeros_via_calloc is a potentially faster version of zeros that is comparable to Array{T}(undef, ...) and numpy.zeros.

function zeros_via_calloc(::Type{T}, dims::Integer...) where T
   ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
   return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

# Windows benchmark

julia> @btime zeros_via_calloc(Float64, 1024, 1024);
  12.400 μs (2 allocations: 8.00 MiB)

julia> @btime zeros(Float64, 1024, 1024);
  1.652 ms (2 allocations: 8.00 MiB)

# Windows Subsystem for Linux 2 Benchmark

julia> @btime zeros_via_calloc(Float64, 1024, 1024);
  24.300 μs (2 allocations: 8.00 MiB)

julia> @btime zeros(Float64, 1024, 1024);
  474.500 μs (2 allocations: 8.00 MiB)

Introduction

In Julia, there are several ways to create large arrays. The fastest form is Array{T}(undef, dims...) where T is the element type of the array and dims is a series of integers describing the size of the array. There are also a few convenience constructors such as zeros and ones which use the fastest form followed by fill!(array, zero(T)) or fill!(array, one(T)), respectively. Below see that zeros and ones are much slower than array creation using Array{T}(undef, ...).

julia> @benchmark Array{Float64}(undef, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   12.800 μs …   1.085 ms  ┊ GC (min … max):  0.00% … 96.56%
 Time  (median):      13.900 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   104.670 μs ± 200.995 μs  ┊ GC (mean ± σ):  85.28% ± 36.22%

julia> @benchmark zeros(1024, 1024)
BenchmarkTools.Trial: 2187 samples with 1 evaluation.
 Range (min … max):  1.661 ms … 6.932 ms  ┊ GC (min … max):  0.00% … 60.38%
 Time  (median):     1.838 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.281 ms ± 1.010 ms  ┊ GC (mean ± σ):  19.35% ± 21.94%

julia> @benchmark ones(1024, 1024)
BenchmarkTools.Trial: 2186 samples with 1 evaluation.
 Range (min … max):  1.657 ms … 6.561 ms  ┊ GC (min … max):  0.00% … 58.89%
 Time  (median):     1.835 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.283 ms ± 1.021 ms  ┊ GC (mean ± σ):  19.40% ± 21.87%

Above we see that on my Windows-based machine, uninitialized array creation takes a median time 13.9 microseconds. While ones and zeros takes a median of 1.8 milliseconds or 100 times as long as uninitialized array creation to execute.

However, if we compare to Numpy, we see that numpy.zeros takes a mean of 92 microseconds and is actually faster than numpy.ones as well as the Julia’s zeros:

In [60]: %timeit numpy.zeros((1024,1024), dtype = np.float64)
92 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [61]: %timeit numpy.ones((1024,1024), dtype = np.float64)
2.5 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The reason that numpy.zeros is faster is that since 2013, numpy.zeros uses the C routine calloc which initializes the underlying memory to zero. Using calloc for Julia’s zeros has been pending since 2011.

https://github.com/JuliaLang/julia/issues/130

calloc can be faster than creating an uninitialized array and filling it with zeros. The operating system may be able to manage its memory such that it knows where chunks of zeros already exist in memory. If the zeroed memory does not exist, the operating system may choose to defer zero initialization until you actually start to use the memory. See this StackOverflow for a longer explanation.

In this post, we explore whether we can implement zeros based on calloc from Julia, right now and evaluate whether it actually has any performance benefits.

Developing a `calloc` based `zeros` in Julia

Julia is kind enough to expose basic C routines via the module Libc. Libc.calloc(num::Integer, size::Integer) takes two arguments. num is the number of elements, while size is the size of each element. The routine returns a Ptr{Nothing} the equivalent of a void pointer in C. This will allocate zeroed memory that is not managed by the garbage collector.

To incorporate this memory into a Julia array, we will use unsafe_wrap(Array, pointer::Ptr{T}, dims; own). This wraps an Array with specified dimensions around the pointer. The keyword argument own specifies whether Julia should take ownership of the memory and be responsible for freeing the memory when we are done with it.

Implementing zeros_via_calloc is a matter of calling Libc.calloc with the appropriate arguments. We can get the number of elements by taking the product of the dimensions. The size of each element is just the size of the the type T. For unsafe_wrap the resulting Ptr{Nothing} needs to be converted a Ptr{T}. We will then let Julia own the memory so that we do not have free the memory ourselves.

function zeros_via_calloc(::Type{T}, dims::Integer...) where T
   ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
   return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

The performance is about is faster as expected. There are some differences in performance on Windows versus Windows Subsystem for Linux 2 on the same hardware though.

# Windows

julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.300 μs … 80.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.137 μs ±  5.399 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▇▄▁▄▄▄▂▂▃▂                                               ▂
  ██████████████▆▇▆▆▇▆▇▆▆▆▅▅▅▅▅▄▅▅▃▂▅▆▇▇▆▆▅▆▅▅▄▅▅▅▅▅▄▅▅▃▂▅▅▄▄ █
  12.3 μs      Histogram: log(frequency) by time      41.5 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 2142 samples with 1 evaluation.
 Range (min … max):  1.639 ms … 9.415 ms  ┊ GC (min … max):  0.00% …  0.00%
 Time  (median):     1.843 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.330 ms ± 1.121 ms  ┊ GC (mean ± σ):  20.28% ± 22.35%

  ▆███▆▄▂▁                               ▂▃▂▂▁▁ ▁▁▁ ▁       ▁
  █████████▇▅▆▆▆▅▅▅▄▁▁▁▅▄▁▁▁▄▄▁▁▁▁▁▁▁▁▁▅█████████████▆▇█▄▄▇ █
  1.64 ms     Histogram: log(frequency) by time      5.5 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

# Linux

julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   22.100 μs …  1.364 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     273.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   292.431 μs ± 62.129 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▂█▆▄▃▁
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃██████▇▆▅▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  22.1 μs         Histogram: frequency by time          493 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 7878 samples with 1 evaluation.
 Range (min … max):  458.300 μs …   2.069 ms  ┊ GC (min … max): 0.00% … 31.44%
 Time  (median):     532.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   626.414 μs ± 191.989 μs  ┊ GC (mean ± σ):  8.12% ± 12.52%

    ██▂
  ▂▇███▅▃▄▄▄▄▄▃▃▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁ ▂
  458 μs           Histogram: frequency by time         1.15 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Incorporating this into Base Julia is a bit more complicated since Julia usually uses aligned memory when allocating arrays of this size. The most recent attempt at this was four years ago.

https://github.com/JuliaLang/julia/pull/22953

Is `zeros_via_calloc` actually faster in practice?

In discussing this @Sukera on Zulip and @tim.holy on Github, the contention was that zeros_via_calloc would not be faster in practice. The idea is that via calloc the operating system is just being lazy and deferring the zero initialization operating until later. Thus when we actually try to iterate or otherwise use the array that performance would suffer as the operating system actually does the work. However, as noted in the introduction, it is also possible that the operating system has done the work ahead of time and may be offering memory that already has been initialized with zeros.

Summation

Let’s begin by just accessing the memory by summing it together. The tests below include array creation time.

# Windows
julia> @benchmark sum(zeros_via_calloc(Float64, 1024, 1024))
BenchmarkTools.Trial: 2134 samples with 1 evaluation.
 Range (min … max):  1.680 ms …   5.428 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.847 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.885 ms ± 204.224 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▁▆▅▇█▅▆▆▅▄▃▁
  ▅█████████████▇▆▅▄▄▃▄▃▃▃▃▂▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▁▂▂▂ ▄
  1.68 ms         Histogram: frequency by time        2.82 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 1563 samples with 1 evaluation.
 Range (min … max):  2.131 ms … 10.005 ms  ┊ GC (min … max):  0.00% … 59.89%
 Time  (median):     2.419 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.189 ms ±  1.589 ms  ┊ GC (mean ± σ):  20.99% ± 23.03%

  ▇█▇▆▃▃▂▂▁▁                     ▂▃▂▂▁
  ██████████████▇█▆▅█▆▅▁▁▄▅▁▁▁▁▄▆████████▇▄▅▇▆▆▅▁▅▅▅▁▅▅▁▁▄▄▅ █
  2.13 ms      Histogram: log(frequency) by time     8.86 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> sum(zeros_via_calloc(Float64, 1024, 1024))
0.0

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

# Windows Subsystem for Linux 2

julia> @benchmark sum(zeros_via_calloc(Float64, 1024, 1024))
BenchmarkTools.Trial: 8838 samples with 1 evaluation.
 Range (min … max):  418.500 μs …  2.682 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     471.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   496.652 μs ± 80.999 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▆██▄▄▂▁▁
  ▂▄█████████▇▇▆▆▄▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  418 μs          Histogram: frequency by time          774 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 4866 samples with 1 evaluation.
 Range (min … max):  821.900 μs …   3.744 ms  ┊ GC (min … max): 0.00% …  8.10%
 Time  (median):     934.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.018 ms ± 202.350 μs  ┊ GC (mean ± σ):  6.28% ± 10.47%

    ▃▇█▃▂▃▅▁
  ▁▄████████▇█▇▆▅▃▃▄▄▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▄▃▃▃▃▃▃▃▂▂▂▂▁▂▁ ▃
  822 μs           Histogram: frequency by time         1.53 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

On both Windows and WSL2, we see benefits when looking at the combined time of creating the array and summation. We can also isolate the summation step in the benchmarks.

# Windows

julia> @benchmark sum(C) setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 2131 samples with 1 evaluation.
 Range (min … max):  1.669 ms …   3.298 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.834 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.863 ms ± 163.536 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▄ ▂▃▅▇█▅▃▇▅▃▃▄▂
  ▄██████████████████▆▆▅▄▄▄▄▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂ ▄
  1.67 ms         Histogram: frequency by time        2.54 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(A) setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 1686 samples with 1 evaluation.
 Range (min … max):  319.700 μs …  1.030 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     394.600 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   417.203 μs ± 64.969 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            █▇▆▁
  ▂▂▂▃▃▃▃▃▃▇████▅▄▃▃▃▄▄▅▃▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  320 μs          Histogram: frequency by time          682 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# Windows Subsystem for Linux 2

julia> @benchmark sum(C) setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 8958 samples with 1 evaluation.
 Range (min … max):  171.600 μs … 687.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     192.350 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   203.257 μs ±  34.846 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▅▇████▇▇▆▆▅▅▅▅▄▄▃▃▂▃▂▂▁ ▁▁ ▁                                 ▃
  █████████████████████████████▇█▇██▇█▇█▇███▇▇█▆█▇█▇▇▆▇▇▆▆▇▆▅▆▅ █
  172 μs        Histogram: log(frequency) by time        350 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(A) setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 4855 samples with 1 evaluation.
 Range (min … max):  184.900 μs …  2.484 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     399.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   387.149 μs ± 91.006 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                ▄▁▇█▄▃▁
  ▂▄▄▃▂▂▃▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███████▅▄▄▅▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁ ▂
  185 μs          Histogram: frequency by time          567 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

On Windows, we see the summation in isolation is slower using zeros_with_calloc than with zeros. On WSL2, the summation in isolation actually appears faster using zeros_with_calloc than with zeros.

Writing

Next we will try writing to the array. The tests below do not include array creation times.

julia> inds = CartesianIndices(1:5:1024*1024);

# Windows

julia> @benchmark C[$inds] .= $inds setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 8756 samples with 1 evaluation.
 Range (min … max):  365.000 μs …   2.712 ms  ┊ GC (min … max):  0.00% … 73.24%
 Time  (median):     388.700 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   551.171 μs ± 349.209 μs  ┊ GC (mean ± σ):  27.37% ± 25.49%

  ▆█▄▂▄▂▂                                            ▁▃▄▂▃▂▁▁   ▁
  ████████▆▇▆▅▅▅▆▄▅▅▅▄▄▁▄▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅█████████▇ █
  365 μs        Histogram: log(frequency) by time       1.42 ms <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark A[$inds] .= $inds setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 2016 samples with 1 evaluation.
 Range (min … max):  143.800 μs … 365.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     149.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.769 μs ±  24.008 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇██▅▂                                                        ▁
  ██████▇▄▅▄▆▇▇█▅▄▄▁▄▅▆▇▇▇█▇▆▆▆██▇▆▅▇▅▇▄▄▅▄▆▇▇▆▄▇▅▄▄▁▆▄▄▄▆▄▄▅▆▄ █
  144 μs        Histogram: log(frequency) by time        254 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

# Windows Subsystem for Linux 2

julia> inds = CartesianIndices(1:5:1024*1024);

julia> @benchmark C[$inds] .= $inds setup = ( C = zeros_via_calloc(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 9029 samples with 1 evaluation.
 Range (min … max):  142.000 μs …   3.562 ms  ┊ GC (min … max):  0.00% … 59.63%
 Time  (median):     148.100 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   269.166 μs ± 215.394 μs  ┊ GC (mean ± σ):  26.06% ± 23.50%

  █▅▂▂▁▁▁▁            ▃▂▂                ▂▂▂       ▂▃▁          ▁
  ██████████▇▇▆▆▅▄▃▅▂▃███▇▇▆▆▅▆▆▄▄▅▃▃▄▂▄▃████▇▇▇▇▇██████▇▇▇▇▆▅▆ █
  142 μs        Histogram: log(frequency) by time        838 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark A[$inds] .= $inds setup = ( A = zeros(Float64, 1024, 1024) ) evals=1
BenchmarkTools.Trial: 6052 samples with 1 evaluation.
 Range (min … max):  140.400 μs … 417.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     149.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   159.025 μs ±  27.826 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▅█▇▇▆▅▄▃▂▂                                                   ▁
  ██████████████▇▇▇▇▇█▇▇███▇██▇▇▇█▇▇▇▇▆▇▆▇█▇▇▆▆▆▆▆▆▆▆▅▅▅▅▄▅▄▅▅▅ █
  140 μs        Histogram: log(frequency) by time        280 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

Here writing does appear be slower.

Combined create, write, sum

In the last benchmarks of this initial post, I will benchmark the combined creation, writing, and summation of the resulting arrays.


julia> function c_func()
           A = zeros_with_calloc(Float64, 1024, 1024)
           A[1:5:length(A)] .= 2.0
           sum(A)
       end
c_func (generic function with 1 method)

julia> function a_func()
           A = zeros(Float64, 1024, 1024)
           A[1:5:length(A)] .= 2.0
           sum(A)
       end
a_func (generic function with 1 method)

# Windows

julia> @benchmark c_func()
BenchmarkTools.Trial: 1825 samples with 1 evaluation.
 Range (min … max):  2.066 ms … 9.072 ms  ┊ GC (min … max):  0.00% … 52.24%
 Time  (median):     2.233 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.734 ms ± 1.126 ms  ┊ GC (mean ± σ):  17.22% ± 20.25%

  ▁▇█▂
  ████▆▄▃▃▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▁▂▂▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▃▄▃▃▃▃▃▂▂▂ ▃
  2.07 ms        Histogram: frequency by time       5.62 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 4.

julia> @benchmark a_func()
BenchmarkTools.Trial: 1500 samples with 1 evaluation.
 Range (min … max):  2.375 ms … 11.108 ms  ┊ GC (min … max):  0.00% … 48.42%
 Time  (median):     2.686 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.325 ms ±  1.433 ms  ┊ GC (mean ± σ):  19.19% ± 22.26%

  ▄▇██▇▆▃▁                                     ▂▃▃▂▂▂        ▁
  █████████▇▄▅▇▆▆▆▅▆▆▅▄▇▄▆▄▄▁▁▄▅▄▁▁▁▁▁▁▄▄▁▁▁▄▄███████▇█▆▄▇▄▅ █
  2.38 ms      Histogram: log(frequency) by time     7.08 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 4.

# Windows Subsystem for Linux 2

julia> @benchmark c_func()
BenchmarkTools.Trial: 6228 samples with 1 evaluation.
 Range (min … max):  618.600 μs …   2.404 ms  ┊ GC (min … max): 0.00% … 27.01%
 Time  (median):     710.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   787.481 μs ± 173.925 μs  ┊ GC (mean ± σ):  7.60% ± 11.97%

    ▁▆▇█▇▄▂
  ▃▆███████▇▆▅▅▅▅▅▄▄▄▃▃▃▂▂▂▂▁▁▂▁▁▁▁▁▁▁▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁ ▃
  619 μs           Histogram: frequency by time         1.28 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 4.

julia> @benchmark a_func()
BenchmarkTools.Trial: 3429 samples with 1 evaluation.
 Range (min … max):  1.157 ms …   3.570 ms  ┊ GC (min … max): 0.00% … 25.64%
 Time  (median):     1.366 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.446 ms ± 229.285 μs  ┊ GC (mean ± σ):  5.05% ±  9.18%

      ▃▂██▅▅▄▆▃▂
  ▁▁▂▄███████████▇▅▅▄▃▃▂▃▃▄▃▂▃▄▃▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁ ▃
  1.16 ms         Histogram: frequency by time        2.22 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 4.

In the combined tests, we see benefits for both Windows and WSL 2.

Conclusions and Discussion

zeros_via_calloc offers some performance benefits over the current implementation of zeros on both Windows and Windows Subsystem for Linux 2.

Array creation with zeros_via_calloc on Windows is faster, almost on par with uninitialized array creation, but array access and write is slower. The combined operation on Windows is still overall faster taking 82% of the time compared to when the array is created with zeros.

On Windows Subsystem for Linux 2, array creation with zeros_via_calloc is also faster but not as dramatic as on Windows. Array access and write is slower, but the overall combined operation on WSL2 only takes 54% of the time compared to when the array is created with zeros.

Overall, for the conditions tested here, zeros_via_calloc does appear to have performance benefits over the current implementation of zeros on Windows and Windows Subsystem for Linux 2. Prior testing suggests that the benefit is reduced for smaller arrays.

Is zeros_via_calloc faster for your applications?

mkitti · October 15, 2021, 10:12pm

One implication from the Windows benchmarks is that zeros_via_calloc may be comparable to uninitialized array creation in some circumstances, such as when one fills the array with something else.

julia> @benchmark fill!(zeros_with_calloc(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 2160 samples with 1 evaluation.
 Range (min … max):  1.677 ms …   3.807 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.821 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.850 ms ± 160.213 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁ ▃▆▄▇█▇▆▅▃▂
  ▅▄████████████▇▇▆▆▅▄▄▄▃▃▃▂▃▃▂▂▃▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▂▂▁▂▂▂▂▂▂▁▂ ▄
  1.68 ms         Histogram: frequency by time        2.57 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(Array{Float64}(undef, 1024, 1024), 5.0)
BenchmarkTools.Trial: 2186 samples with 1 evaluation.
 Range (min … max):  1.665 ms … 7.798 ms  ┊ GC (min … max):  0.00% … 67.40%
 Time  (median):     1.829 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.282 ms ± 1.035 ms  ┊ GC (mean ± σ):  19.40% ± 21.79%

  ▇██▇▆▃▂▁                                   ▂▃▃▃▃▁▁        ▁
  █████████▇▆▅▆▅▆▄▄▄▅▄▆▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆█████████▇▇▅▆▆▆ █
  1.66 ms     Histogram: log(frequency) by time     5.06 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(zeros(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 1634 samples with 1 evaluation.
 Range (min … max):  2.230 ms … 9.976 ms  ┊ GC (min … max):  0.00% … 61.09%
 Time  (median):     2.445 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.053 ms ± 1.359 ms  ┊ GC (mean ± σ):  20.17% ± 22.88%

  ▆██▆▅▄▂                                    ▂▃▃▂▂▁▁        ▁
  ████████▇▇▄▆▆▄▆▅▄▄▅▄▄▁▅▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▅█████████▄▅▄▅▄▅ █
  2.23 ms     Histogram: log(frequency) by time     6.79 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Elrod · October 15, 2021, 11:05pm

More often than not, memory is the bottleneck.
Why pass over memory twice – first to zero initialize, then later to interact with it (i.e., zeros) – if you could just once – zeroing as you interact with it (calloc)?
Fusion is great. Theoretically, calloc seems like it has nothing but upsides.

On Linux, comparing undef initializing with zeros_via_calloc:

julia> @benchmark zeros_via_calloc(Float64, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   22.352 μs …  1.114 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     335.558 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   337.057 μs ± 12.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                            █
  ▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂█▅ ▂
  22.4 μs         Histogram: frequency by time          345 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark Array{Float64}(undef, 1024, 1024)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  20.666 μs … 48.211 μs  ┊ GC (min … max): 95.95% … 95.08%
 Time  (median):     41.557 μs              ┊ GC (median):    96.05%
 Time  (mean ± σ):   34.945 μs ±  9.908 μs  ┊ GC (mean ± σ):  95.93% ±  0.19%

  █                                                      ▅▁▅
  █▆▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▆ ▃
  20.7 μs         Histogram: frequency by time        42.8 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Similar minimum time, very different median and mean.

For me, fill!(zeros_via_calloc(...), ...) is about 2x faster than using zeros on Linux for 1024x1024:

julia> @benchmark fill!(zeros_via_calloc(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 6266 samples with 1 evaluation.
 Range (min … max):  600.150 μs … 936.917 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     679.553 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   674.231 μs ±  43.092 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▄█         ▁                  ▁▂▁      ▁
  ███▅▃▁▂▁▂▂▄▆█▆▄▂▂▄▄▄▃▂▃▅▇█▆▅▅▆█████▇▆▆▇████▆▅▆▇▇▆▆▅▄▃▃▃▄▆▆▇▆▅ ▄
  600 μs           Histogram: frequency by time          746 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(Array{Float64}(undef, 1024, 1024), 5.0)
BenchmarkTools.Trial: 7129 samples with 1 evaluation.
 Range (min … max):  392.945 μs …   2.997 ms  ┊ GC (min … max):  0.00% … 75.62%
 Time  (median):     585.374 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   697.015 μs ± 265.796 μs  ┊ GC (mean ± σ):  11.60% ± 16.13%

         ▁█▇▃▂
  ▂▂▃▃▃▄▅█████▇▅▃▂▄█▃▂▃▄▃▂▂▂▁▁▁▂▁▂▄▄▂▂▂▂▁▂▂▃▄▃▂▂▂▂▁▂▂▂▂▂▃▄▃▂▃▃▃ ▃
  393 μs           Histogram: frequency by time         1.48 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark fill!(zeros(Float64, 1024, 1024), 5.0)
BenchmarkTools.Trial: 3600 samples with 1 evaluation.
 Range (min … max):  1.120 ms …   4.245 ms  ┊ GC (min … max): 0.00% … 68.09%
 Time  (median):     1.226 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.384 ms ± 340.695 μs  ┊ GC (mean ± σ):  9.07% ± 13.24%

    ▁█▃
  ▃▅████▄▄▄▅▇▅▆▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▅▄▃▂▁▁▁▁▁▁▁▂▄ ▃
  1.12 ms         Histogram: frequency by time        2.31 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Maybe more like your WSL2 results than Windows though, in that it is still slower than an undef array?

stevengj · October 15, 2021, 11:16pm

Slight downside: there is no analogue of posix_memalign for calloc, so allocating 16-byte aligned memory (which Julia does to ease SIMD utilization) is more painful (but still possible).

mkitti · October 15, 2021, 11:57pm

In considering the interface, we may now want to consider a ZeroInitializer with singleton zeroinit to enable Array{T}(zeroinit, dims). This might then allocate memory using an aligned calloc.

@Sukera mentioned a desire for eager and lazy versions. Perhaps we also then need a FillInitializer that just uses fill! to populate array elements as we currently do in ones and zeros. This would be the eager version.

tkf · October 16, 2021, 12:06am

As I commented in Zulip, this is also useful in Array{T}(undef, ...) when T is a boxed value since we need to fill the buffer with null pointers (i.e., zeros). We need an interface that can support this use since zeroing out the array is unavoidable in this case. (Or maybe the calloc version can be the default for boxed eltype.)

Also, I’m going to guess that the over-allocation technique can still be used in the pure-Julia implementation in the OP if you manually free it in the finalizer (i.e., use own = false)?

Sukera · October 16, 2021, 8:27am

To clarify (since it seems to have been lost in translation from zulip/github to discourse), I’m not in principal opposed to calloc, I just think the situations where it can provide a benefit are dubious at best and niche at worst.

I’m not 100% sure how it works on windows (closed source and all that), but on linux calloc usually works by having a dedicated page of zeros, which is used as backing memory when reading from a block of memory that’s been allocated using calloc. When a program writes to that memory, the kernel intercepts the write, finally allocates a “real” page of memory, fills it with zeros and finally writes whatever data you want to write. This moves the time spent for initializing the block of memory with zeros from allocation time to use-time. As far as I remember, this works similarly on windows, but the process of intercepting the write and having the kernel do the work of allocating a page and filling it is a bit more expensive than on linux, which is why the timing on windows may be off sometimes compared to linux.

I’ll go through each example now and repeat my thoughts.

Summation+allocation

Take the example with summation+allocation, i.e. sum(zero_func(Float64, 1024, 1024)). Yes it’s faster, but think about what this operation is doing - you’re summing nothing at all. In the case of calloc, you’re not even hitting different memory, so I’m not sure what this is supposed to show - it’s like an idealized case, where all accessed data is already in a hot L1 cache and you can access it in any random pattern and still have magically fast speed. After real data has been written here, the speed advantage of calloc vanishes - after all, there are now real pages backing the allocation, which go in and out of hot caches, leading to the memory bottleneck @Elrod mentioned.

Summation

A similar case can be made for pure summation, with allocation taken out of the picture, i.e. sum(C) setup = ( C = zeros_func(Float64, 1024, 1024) ). I’m again not surprised calloc is faster here - it’s the same situation as above, hitting magic fast speed because of an idealized caching scenario. The only thing that’s different is that the allocation itself is not counted as part of the benchmark (which, if I recall correctly from zulip, was ~200µs on your machine @mkitti ? Please check, as it would explain the difference to summation+allocation right away). In case of the current zeros, I’m again not surprised that it’s a little slower here. It’s using malloc in the background and fills in the memory with 0 manually, forcing the kernel to create real pages to back this memory. This leads to a more realistic situation regarding caches and doesn’t behave differently from real data (well, except for branch prediction having an idealized situation, but we’ll ignore that for now).

Writing

As for writing, the results again make sense when we think about what’s happening in the background. When iterating and writing over memory, the calloc version page faults (i.e. has to load memory from some other cache than L1-3) probably just as often as the zeros version, BUT the zeros version doesn’t have to initialize the memory, since that’s already been done at creation time. It can just directly write out to memory and let the CPU and it’s page fault prediction manage pre-loading of memory etc.

Combined case

For the combined case, the benefits seem okayish, with 0.5-1ms in favor of calloc when comparing min and max. However, I’m not yet convinced this generalizes to more complex initialization than setting a single value, especially considering we’re not making use of the zero memory since we’re writing to the page, forcing a page fault and thus forcing the allocation of a real page. We don’t even get to keep magically fast read-cache, depending on our writing pattern. To take advantage of that long term, our array would have to be sparse-ish, with less than one write per page of memory on average and probably much fewer than that to see real benefits when reading again (are real sparse arrays already faster here? Probably, but I have no numbers to back that up).

If the question of whether to use calloc or malloc dominates your overall runtime (i.e. your problem is allocation bound), then maybe the first step should not be to think about whether to use calloc or malloc but “why am I allocating so much” and “can I restructure my problem to reuse buffers” or “is this really a performance bottleneck”. Don’t get me wrong, I’m a lover of performance as much as the next person on this forum, but at some point it’s time to think about what you’re optimizing for.

All in all, my personal reason for why I’m not inclined to say “yeah let’s calloc everything” is the following: ease of use and predictability of performance. As was shown both in the benchmarks and in my attempt at explanation of them above, calloc mostly shifts the burden of initialization from the allocation to the computation. This muddies the water when profiling and I’d much rather have a clear indication “oh yeah we’re spending time zeroing memory” when looking at a flame graph than have the cost be hidden behind the first iteration in my loop. When profiling, it forces me as a user to think about whether I really need a zero initialized array in my code. If the conclusion is yes, I really do need it because I’m reading a zero before writing some other data and that’s the fastest way I can express this, then great! That’s the use case I had in mind with the suggestion of “maybe a switch to zeros for lazy initialization?” came from. If zeros were lazy by default, this would be much harder to spot, because it requires deep knowledge about how an OS handles memory to even identify that as a possibility when debugging a “slow” loop that should be faster. I don’t think that’s a reasonable expectation we should have here, as evidenced by the time spent digging into this by @mkitti.

@Elrod I’m not surprised by your fill! comparison - conceptually, both zeros_via_calloc as well as zeros do the same amount of work, zeroing memory and immediately overwriting it again, whereas the Array{Float64}(undef, ...) version only writes, it never reads a zero, making the initialization useless. I think this only strengthens my point about knowing the code in question and how it is used to squeeze out optimal performance. That’s also why I like the eager behavior of zeros - it makes reasoning about the code much easier and the cost is immediate and obvious, it’s not hidden behind the first use.

I think the canonical issue for this sort of thing is implement `@fill([0, 0], 2)` · Issue #41209 · JuliaLang/julia · GitHub. It also has a bunch of links to further discussion, and probably not quite appropriate here.

I’m not sure this is relevant to the discussion about semantics & behavior of zeros though? As far as that allocation goes, you never want to read from an explicitly undef array anyway, so from my POV the first use should always be a write, making the initialization guarantee a performance hit on first write of a page. I’m not the one for a call here though, there is already plenty discussion about this e.g. here zero out memory of uninitialized fields · Issue #9147 · JuliaLang/julia · GitHub which mentions that the situation for arrays is different from types (you never want to read from an undef array and if you do you have a bug).

mkitti · October 16, 2021, 2:35pm

Just to be clear, the main question at hand is whether zeros should use calloc by default rather than the current default of malloc followed by memset. The question transcends programming languages.

https://vorpus.org/blog/why-does-calloc-exist/

For primitive types, Array{T}(undef, ...) should continue to use malloc. fill!(Array{T}(undef, ...), zero(T)) should continue to be available for an explicitly eager version. The question is if zeros should have an eager keyword option or if there should be another shorter method of eagerly making zero initialized memory available.

Sukera · October 16, 2021, 3:40pm

That article basically gives two reasons:

It checks for overflow when allocating and
it doesn’t actually allocate memory it doesn’t need when writing.

The first reason is not relevant for regular use of julia - memory allocation is abstracted away already.

The second reason gives the example of np.eye, which apparently calls calloc (since that’s the default for numpy). I don’t know why numpy does this allocation, but I don’t think the situation can be mapped 1-to-1 to julia, since we have vastly more kinds of specialized array types, in this case Diagonal, which only allocates its diagonal in the first place (which replaced eye back in julia 0.7 days!). This is vastly more efficient if you only ever read from it. If you intend to write, you can use spdiagm from the SparseArrays stdlib instead, again only using what you really need. I understand that for people coming from python this may be unfamiliar territory, but to me this power to express explicitly what I want to do is one of the great advantages of julia that I just don’t quite get with other languages, at least not in such a terse form.

I want to make it clear that I’m not in principal opposed to a version of calloc being exposed - but I am opposed to making it the default behavior, because it muddies the water when actually trying to find out where and why time is spent.

All in all, julia offers a lot of different tools with different use cases for different purposes. The obligation to know what they’re doing and what they want is placed on the user - after all, they know best what their algorithm requires. They are in the best position to make decisions about how to optimize. As such, having to dig into how julia allocates memory just to find out why the first iteration of a loop is slow feels wrong and more like a gotcha to me than seeing one large block of time spent in a zeros call (where the expectation should be that it’s slow because it has to zero out memory - there is no magic “go fast” switch).

mkitti · October 16, 2021, 5:10pm

What would be the best interface for calloc then? Since some say zeros shouldn’t exist, how should we make it available?

The idea that loosely referred to on Github was that may need a unique initializer that would then suggest the usage of calloc rather than malloc. Perhaps something like Array{T}(lazyzeroinit, dims...)?

mkitti · October 16, 2021, 5:17pm

If anyone is interested, here is where glibc decides whether to clear the memory itself or not in calloc

glibc calloc implementation, LGPL licensed

https://github.com/bminor/glibc/blob/30891f35fa7da832b66d80d0807610df361851f3/malloc/malloc.c#L3537-L3557

ToucheSir · October 16, 2021, 6:49pm

Adapting some of my thoughts from the Zulip topic as well:

Why even care?

For the uninitiated, this topic didn’t just come out of nowhere. I think most on this forum are familiar with the semi-frequent comparison threads with other languages where someone claims “Julia is slow” but didn’t write a representative benchmark. The problem is, the more you have to convince them to change in their benchmark to put forward a “fair” comparison, the harder it becomes to defend those changes. This is especially true when said changes appear to deviate from default language/stdlib functionality, which is exactly the case with zeros and calloc. The more friction a potential user experiences with trying to make their Julia port perform well, the more likely we are to see acrimonious forum posts or follow-up tweets. This is not a hypothetical scenario either, I think many of us can think of an example within recent memory.

I like to take the pit of success argument here.
Julia is a such a nice language because it not only provides you enough depth to write optimized programs, but has enough taste injected into the design to choose what works for the majority of users by default even if it’s not the most optimized path in every dimension. One big example of this is (im)mutable structs and heap allocation.

What are users more likely to want? Slow setup time but faster time for the first iteration of a hot loop, or faster time overall?
I’d argue the first is the more niche case, and for that we have fill(0 ...) + fill!(Array{T}(undef, ...), 0). Whereas expecting new users to figure out that they should write unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true) to match the performance of other languages is a tall order.

With apologies to Einstein

The argument put forward in previous posts is that the “spooky action at a distance” caused by calloced memory faulting on first read rather than init is not with the increased overall performance, and that cases where said performance matters are at best niche. I’ve already talked about the “this doesn’t come up enough” part of that argument (namely folks coming from Numpy), but let’s approach it from a different angle.

Say in some alternate timeline, Julia did use calloc by default for zeros. How likely would we have a discussion on the opposite behaviour come up? i.e. a “I much prefer deterministic O(N) init for zeros even at the cost of overall performance” thread? I posit to you that it likely wouldn’t surface at all! If such a discussion did come up, imagine the responses on this forum. “why not just use fill, that has deterministic runtime”, “you probably shouldn’t be using Julia for crypto”, “this seems like a niche use case to reduce average performance for”, and of course “small arrays don’t undergo lazy init already”. Instead, we might get discussions like https://github.com/numpy/numpy/issues/16498#issuecomment-639179593 (“wow, zeros is fast, but I shouldn’t use it for certain benchmarks because it behaves slightly differently for those watching the perf counters”).

Control and lines in the sand

But back to language design. It seems odd that this is the bar where we say “no, X feature feels un-Julian”, when Boxing, escape analysis and heap-to-stack allocation hoisting all exist. A similar argument would be that this kind of control is necessary in a high-performance language, but that doesn’t explain Rust’s eager use of calloc. Even C provides an equally accessible interface for allocating cleared and uninitialized memory, while Julia only puts the latter front-and-centre with similar.

In closing, if you:

Dislike wasting time on correcting misleading benchmark comparisons with other languages
Would prefer to write fewer fenceposts or fun incantations in performance critical code
Have to work with dense arrays or libraries that do sparse writes to dense arrays (e.g. some ADs)

Then my pitch is that having zeros use calloc when appropriate is a good idea.

Sukera · October 16, 2021, 8:14pm

I feel like interacting with the posts and arguments presented in here would be a better mode of communication, as just plainly copy & pasting an argument presented on another platform feels bad, since I now feel like I have to copy & paste my answer to your answer again… Instead I’ll write out a fully formed, new post, as done in the OP.

Personally, I dislike the culture of “changing the benchmark until its correct” precisely because it feels very bad when you’re on the receiving end of it. I have been guilty of this in the past as well and as such, I nowadays try to explain why the presented benchmark gives the observed result without implying they’re doing it wrong unless explicitly asked for. This can either happen in the post directly (“am I missing something/doing it wrong? what’s going on?”) or in a follow up post (“Ah, I see! How could this be benchmarked better?”).

I’m not sure I follow here - are you referring to julia’s stdlib with “default language […] functionality”? From my POV, changing zeros to be implicitly lazy (by using calloc) would precisely introduce such an inconsistency. Across the board, laziness and eagerness are strictly seperated, often by living in the Base.Iterators submodule. Making zeros lazy but having ones be eager feels just as arbitrary - why is one lazy and not the other? The answer would be “because there’s no equivalent to calloc for ones”.

I don’t like the argument about the “pit of success” here for two reasons

it’s very subjective
it implies there’s only “one true way”

The “pit of success” is different for every use case. What’s ideal for one program and developer is an annoyance for another - making a blanket statement about what ought to be done is what feels un-julian to me, not the discussion about “should we expose calloc at all”. I’ve reiterated that point both on zulip and up above - I’m not opposed to calloc!

I don’t understand why this should be an either/or case? I’m arguing that we should have BOTH, but default to the former because making code behave in a way that makes it easier to reason about is a good thing. When you write zeros(Float64, ...), write to some locations and pass that array on to another piece of code that you may or may not have written, you don’t necessarily control how the resulting array is used. Conversely, when you use zeros in library code you have no idea how a user/receiver of that array will use it. Both as a user of a library as well as a developer of a library I’d like to have an easier time following where time is spent and why. With a lazy zeros using calloc, this is not possible - you do not necessarily know when or how that array is used/accessed. The performance cost of calloc can manifest itself whenever any element that happens to be on a new page is accessed, which may not be close to the allocation site at all. How would you identify that this was spent as a zeroing page fault as a result of calloc at all? The information that this is happening is not even present in a strace of system calls. All you can easily observe is that some loop over some indices somewhere in the stack is slower than you expect, and figuring out why is a gargantuan task. Having to pay the cost up front makes it trivial - a flamegraph identifies the aggregate time spent in the zeros call.

Note that I’m not saying that we shouldn’t use calloc ever - that is not the case! What I am saying is that this should only be used when the performance implications about how the cost is distributed are understood and accepted, hence the default for eager initialization.

My complaint would be the same - calloc is intransparent when it comes to where the cost of initialization moves to and thus should be avoided, except in cases where it’s understood to be ok. Naive uses of different allocation schemes (fill(zero(T), ..) vs. a naive zeros(T, ...)) should not have an impact on the performance of a loop in a completely different part of the stack.

If this is really the kind of answer you expect on this forum, I’m extremely saddened that this is the impression the julia community gives off. All of these sound extremely dismissive and, frankly, come across as “get off my lawn” instead of embracing others’ POV. I for one would be happy if people were to implement crypto (and maybe prove correctness?) in julia. I also want julia to succeed in areas other than ML, AD, DiffEq, Modeling and similar.

This feels like a strawman - I’m equally missing tools for investigating the compiler and what it’s doing in these cases. I’m not sure why that would be an argument against having a choice between lazy and eager zeros.

I don’t like just pointing to some implementation somewhere and saying “look, X is doing it too” anymore when the comparison is python or rust. Rust conveniently gives you the choice to use an allocator that doesn’t use calloc but instead always uses malloc - a choice neither julia nor python expose, as the GC is treated as blackbox, not to be meddled with by the user.

Reading through the source of Vec (Rusts’ equivalent to julia vectors), I couldn’t help but notice that growing a vector doesn’t guarantee anything about whether or not the new memory is zeroed or not. In contrast, julia already documents that resize! with a larger size does not initialize the resulting memory. How should this be dealt with should a zeros array be allocated via calloc? As far as I’m aware, there is no realloc for calloced memory.

I don’t get what’s specific about zeros that this blanket statement should have weight here. This can be said about any difference between julia and any other language.
A few points:
a) How do off-by-one errors factor into this discussion at all?
b) I don’t see how having false as a sense of strong zero is relevant to whether or not zeros is lazy or eager. If you get something that wasn’t 0, by definition, that array wasn’t zeroed - no matter the initialization method.
c) I still don’t understand how that line of code is a strong argument either way. Having false as a strong zero not working for user defined types is, from my POV, not a problem of Flux or DiffEq or Zygote, but of the user defined type implementing multiplication wrong. That this is very loosely and badly (or not at all) defined is also bad, but has nothing to do with whether zeros is eager or lazy. In fact, for user defined types zeros has to use malloc with initialization because a zero-filled block of memory may not be valid for zero of a user defined type! As you note, it may be boxed or a number of other things, like not being isbits or simply not having a block of zeros as its zero representation.
That’s a single use case which I really hope is not the exclusive intended use case for julia as a whole.

I don’t think the points you put in a list support either lazy or eager zeros or even the ability for the user to choose. To me it just comes across as “I want this because it’s helpful in my case” to which I say “ok, let’s have a user controlled switch then since users can’t affect the memory allocator meaningfully here”.

I haven’t seen anyone claim “zeros shouldn’t exist” yet, so I’m not quite sure what you mean here.

The trouble with that is that it’s not as simple as switching between calloc and malloc behind the scenes, especially not for user defined types (for which zeros also has to work). For example, suppose we’d add a switch between lazy and eager initialization, do we then provide a LazyInitializedMatrix type that gets returned for custom user defined types that don’t neatly map to calloc? The page faulting process is completely invisible to julia user code as far as I know, so we can’t even hook into it to lazily initialize user types.

Adding onto this that this would be (as far as I know) the only place in julia where user code could explicitly influence how something is allocated, my gut tells me “let’s be careful what we want here and what the implications are”.

artkuo · October 16, 2021, 9:08pm

Would it be appropriate to take the “usual” approach for new features, i.e. start with a package, gather use cases and data, and let that inform how to eventually fold into core? It seems like zeros_via_calloc is faster on some systems and some cases, but uncertain where it might not be better, and potentially confusing to benchmarkers. The power users will surely figure all that out given an option like @fastzeros.

As an analogy, I gather that LoopVectorization is a prototype for eventual core language, with the documentation clearly warning about potential misuses. It sounds like lazy zeros could be similarly deployed. Meanwhile, I thoroughly appreciate @mkitti’s well-researched post, and the well-reasoned discussion from @sukera and others.

mkitti · October 16, 2021, 9:21pm

See Deprecate `ones`? · Issue #24444 · JuliaLang/julia · GitHub

tkf · October 16, 2021, 9:56pm

Julia already needs to zero out boxed value array elements and fields for GC. That’s why I mentioned that it’s already unavoidable. As such, excluding this case will underestimate the importance of the optimization discussed here.

Elrod · October 17, 2021, 6:51am

Seems like calloc really should be implemented in a way to benefit from multithreading:

julia> using LoopVectorization, BenchmarkTools

julia> function zeros_via_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_via_calloc (generic function with 1 method)

julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
           A = f(Float64, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = A[i]
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 1 method)

julia> function alloctest(dims::Vararg{Integer,N}) where {F,N}
           A = Array{Float64}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = 0.0
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 2 methods)

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
 Range (min … max):  49.796 ms … 146.351 ms  ┊ GC (min … max): 0.00% … 65.90%
 Time  (median):     51.714 ms               ┊ GC (median):    3.40%
 Time  (mean ± σ):   53.449 ms ±  12.958 ms  ┊ GC (mean ± σ):  6.49% ±  9.56%

  █ █
  █▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  49.8 ms       Histogram: log(frequency) by time       133 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192) # undef init
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  33.063 ms … 144.814 ms  ┊ GC (min … max): 0.00% … 77.03%
 Time  (median):     36.147 ms               ┊ GC (median):    2.81%
 Time  (mean ± σ):   37.299 ms ±  12.321 ms  ┊ GC (mean ± σ):  9.25% ± 10.36%

  █ ▇▄
  ████▇▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  33.1 ms       Histogram: log(frequency) by time       122 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 184.

julia> @benchmark alloctest(zeros, 8192, 8192) # zeros
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min … max):  85.009 ms … 197.439 ms  ┊ GC (min … max): 0.00% … 56.90%
 Time  (median):     87.641 ms               ┊ GC (median):    0.88%
 Time  (mean ± σ):   91.011 ms ±  19.084 ms  ┊ GC (mean ± σ):  6.29% ± 10.42%

  █ ▅
  █▃█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  85 ms           Histogram: frequency by time          174 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 183.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192) # zeros_via_calloc
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   81.286 ms … 251.090 ms  ┊ GC (min … max):  0.00% … 67.55%
 Time  (median):      81.993 ms               ┊ GC (median):     0.46%
 Time  (mean ± σ):   100.907 ms ±  31.495 ms  ┊ GC (mean ± σ):  19.33% ± 17.28%

  █          ▇
  █▁▁▁▁▁▁▁▁▁▁█▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  81.3 ms       Histogram: log(frequency) by time        251 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 187.

I.e., seems like it should be possible for the specific thread doing the first write to a page to also do the zeroing. If this were the case, calloc would have two benefits:

single pass over the array instead of two passes
implicit (local) multithreading of the zeroing when combined with multithreaded code.

However, that does not appear to be the case.

This is a reasonably common pattern, where an array is initialized and then passed over multiple times to update the values. Here I wanted to only update once of course, to make the potential benefits easier to detect.

mkitti · October 17, 2021, 7:04am

Are one of those “page” words supposed to be “thread”?

Elrod · October 17, 2021, 7:08am

Yes (fixed).

mkitti · October 17, 2021, 7:23am

Here are my results on Windows with 16 threads. alloctest(8192, 8192) and alloctest(zeros_via_calloc, 8192, 8192) seem to take about the same time.

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  124.103 ms … 228.608 ms  ┊ GC (min … max):  0.00% … 43.77%
 Time  (median):     157.009 ms               ┊ GC (median):    17.05%
 Time  (mean ± σ):   169.346 ms ±  44.435 ms  ┊ GC (mean ± σ):  23.65% ± 19.20%

  █                                                  ▃
  █▇▅▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▅█▁▁▁▁▁▁▄ ▁
  124 ms           Histogram: frequency by time          229 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  150.885 ms … 346.944 ms  ┊ GC (min … max):  0.28% … 55.35%
 Time  (median):     180.302 ms               ┊ GC (median):     0.29%
 Time  (mean ± σ):   211.417 ms ±  55.850 ms  ┊ GC (mean ± σ):  20.95% ± 19.02%

    ▄ █▁                      ▁    ▁
  ▆▁█▁██▆▁▆▆▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▆█▆▁▁▆▁▁▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  151 ms           Histogram: frequency by time          347 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  267.953 ms … 468.588 ms  ┊ GC (min … max):  0.23% … 41.70%
 Time  (median):     280.005 ms               ┊ GC (median):     0.18%
 Time  (mean ± σ):   318.286 ms ±  61.175 ms  ┊ GC (mean ± σ):  14.20% ± 14.64%

  █▃▃                       ▃
  ███▇▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁█▇▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  268 ms           Histogram: frequency by time          469 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  159.149 ms … 357.459 ms  ┊ GC (min … max):  0.40% … 54.82%
 Time  (median):     181.814 ms               ┊ GC (median):     0.28%
 Time  (mean ± σ):   214.017 ms ±  56.574 ms  ┊ GC (mean ± σ):  20.51% ± 18.43%

  █ █▃▃                    █ ▃
  █▇███▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▇█▇▁▁▇▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▇ ▁
  159 ms           Histogram: frequency by time          357 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Topic		Replies	Views
Performance of zeros() vs. Array{T}()? Performance	4	1165	September 6, 2018
Zeros slower than Array comprehension Performance question	3	682	October 7, 2018
[ANN] ArrayAllocators.jl: Integrating calloc and aligned memory into Array construction Package Announcements package , announcement , array , zeros	14	1936	May 5, 2022
Julia 1.0, tight-binding benchmark and array slices Performance	9	1876	September 22, 2018
Fastest way of getting a long zero vector? General Usage	29	845	March 15, 2024

Faster zeros with calloc

Abstract

Introduction

Developing a calloc based zeros in Julia

Is zeros_via_calloc actually faster in practice?

Summation

Writing

Combined create, write, sum

Conclusions and Discussion

Summation+allocation

Summation

Writing

Combined case

Why even care?

With apologies to Einstein

Control and lines in the sand

In closing, if you:

Related topics

Developing a `calloc` based `zeros` in Julia

Is `zeros_via_calloc` actually faster in practice?