Fastest way of getting a long zero vector?

I have these three ways of creating a zero vector.

module m

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
@time a = fill(0.0, N);

println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;

println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)

nothing
end

I have a multicore machine. Is there something faster?

You could use calloc. See e.g. Faster zeros with calloc and ArrayAllocators.jl

3 Likes

I do need to write into it later. So it looks like the threaded version is the fastest overall.

You can write into a calloc-allocated array.

Correct. But at that point the OS will do any delayed zeroing out, and things will slow down.

That’s a pretty surprising conclusion; I’d think your benchmarks aren’t as representative as you’d want them to be before coming to that conclusion. As they’re written there, you’re likely intermingling GC and compilation times.

@mbauman You are right. I had to be a bit more sophisticated. I obtained

pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
  4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
;  4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
  108.077 μs (2 allocations: 7.45 GiB)
  4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
  108.736 μs (2 allocations: 7.45 GiB)
  1.125 s (21 allocations: 2.08 KiB)

with



using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end

println("a = zeros(N);")
let
@btime a = zeros($N);
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

The multithreaded function for zeroing out the vector pays off even for moderately long (short!) vectors. For instance

 ../julia-1.10.1/bin/julia -t 4 m.jl
N = 100000 = 100000
a = fill(0.0, N);
  76.781 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  76.991 μs (2 allocations: 781.30 KiB)
undefined, a .= 0.0;
  1.705 μs (2 allocations: 781.30 KiB)
  75.407 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.677 μs (2 allocations: 781.30 KiB)
  23.214 μs (21 allocations: 2.08 KiB)
2 Likes

It is possible that the OS or libc implementation already has a pool of zeroed memory pages that it can just assign.

I’m afraid expecting that to happen will not be very robust, will it?

You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not it’s a security risk, since you read data from other processes.

But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesn’t seem very helpful for libc to zero out memory, because if you reuse memory from your own process it’s not a security risk (at least shouldn’t be… you shouldn’t be reading that not-zeroed-out memory).

Since asking for zero isn’t too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.

[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but it’s harder to implement.]

If it is known that the memory is already zeroed, do you really want to zero it again?

As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.

If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.

Numpy and other languages use calloc by default when zeroed memory is requested.

I don’t want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).

Maybe Polyester.@batch will be faster for that.

I’m not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.

Here’s my setup.

julia> using BenchmarkTools

julia> function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
zeroout (generic function with 1 method)

julia> function demo_zeroout(N)
           a = Vector{Float64}(undef, N)
           zeroout(a)
           return sum(a)
       end
demo_zeroout (generic function with 1 method)

julia> function demo_calloc(N)
           ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
           a = unsafe_wrap(Array, ptr, N; own=false)
           ret = sum(a)
           Libc.free(ptr)
           return ret
       end
demo_calloc (generic function with 1 method)

Here are the results on my computer:

julia> @btime demo_zeroout($N)
  84.700 μs (83 allocations: 790.97 KiB)
0.0

julia> @btime demo_calloc($N)
  35.900 μs (1 allocation: 48 bytes)
0.0

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)

If you want a more compact syntax, you can try ArrayAllocators.jl

julia> using ArrayAllocators

julia> function demo_arrayallocators(N)
           a = Vector{Float64}(calloc, N)
           return sum(a)
       end
demo_arrayallocators (generic function with 1 method)

julia> @btime demo_arrayallocators($N)
  83.300 μs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  69.100 μs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  78.000 μs (5 allocations: 781.36 KiB)
0.0

The timing is a bit wobbly since we let Julia own the array in this case, but you don’t have to deal with the pointer directly.

3 Likes

Interesting. What happens one you run this on your machine?

module m
using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end

function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
    ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
    return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

N = 100000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end

println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end

println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

I get

../julia-1.10.1/bin/julia -t 2 m.jl
a = fill(0.0, N);
  73.067 μs (2 allocations: 781.30 KiB)
  104.308 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  74.800 μs (2 allocations: 781.30 KiB)
  104.306 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  35.291 μs (2 allocations: 781.31 KiB)
  103.690 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  1.464 μs (2 allocations: 781.30 KiB)
  75.019 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.448 μs (2 allocations: 781.30 KiB)
  46.265 μs (11 allocations: 1.06 KiB)

But also

 ../julia-1.10.1/bin/julia -t 8 m.jl
a = fill(0.0, N);
  74.336 μs (2 allocations: 781.30 KiB)
  101.080 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  71.916 μs (2 allocations: 781.30 KiB)
  101.121 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  72.413 μs (2 allocations: 781.31 KiB)
  102.255 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  2.376 μs (2 allocations: 781.30 KiB)
  73.545 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  2.131 μs (2 allocations: 781.30 KiB)
  22.844 μs (41 allocations: 4.11 KiB)
julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, sandybridge)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

Why aren’t you suming either undefined variant?

1 Like

Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.

Sure, but shouldn’t you compare to using sum on the “eagerly” zeroed memory? Wer’e not really comparing apples to apples here.

PS> julia -t 2 bench.jl
a = fill(0.0, N);
  16.600 μs (2 allocations: 781.30 KiB)
  48.400 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  59.100 μs (2 allocations: 781.30 KiB)
  58.200 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  60.100 μs (2 allocations: 781.31 KiB)
  79.100 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  550.000 ns (2 allocations: 781.30 KiB)
  49.800 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  470.000 ns (2 allocations: 781.30 KiB)
  49.500 μs (11 allocations: 1.25 KiB)

PS> julia -t 8 bench.jl
a = fill(0.0, N);
  68.600 μs (2 allocations: 781.30 KiB)
  64.100 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  28.400 μs (2 allocations: 781.30 KiB)
  93.400 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  71.800 μs (2 allocations: 781.31 KiB)
  114.700 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  5.459 μs (2 allocations: 781.30 KiB)
  66.900 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  4.132 μs (2 allocations: 781.30 KiB)
  34.200 μs (41 allocations: 4.86 KiB)

Also we should compare zeroout to fill! rather than fill.

julia> @btime fill!(a, 0.0) setup=(a = Vector{Float64}(undef, $N););
  35.300 μs (0 allocations: 0 bytes)
2 Likes

Passing over data is expensive.

julia> module m
       using BenchmarkTools
       
       function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
       
       function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
           ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
           return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
       
       N = 100000
       
       println("a = fill(0.0, N);")
       let
       @btime a = fill(0.0, $N);
       @btime begin a = fill(0.0, $N); sum(a) end
       end
       
       println("a = zeros(N);")
       let
       @btime a = zeros($N);
       @btime begin a = zeros($N); sum(a) end
       end
       
       println("a = zeros_via_calloc(N);")
       let
       @btime a = zeros_via_calloc(Float64, $N);
       @btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
       end
       
       println("undefined, a .= 0.0;")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
       end
       println("undefined, zeroout(a);")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
       end
       nothing
       end
        
a = fill(0.0, N);
  29.135 μs (2 allocations: 781.30 KiB)
  68.500 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  29.740 μs (2 allocations: 781.30 KiB)
  68.408 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  29.544 μs (2 allocations: 781.31 KiB)
  66.764 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  3.326 μs (2 allocations: 781.30 KiB)
  32.861 μs (0 allocations: 0 bytes)
  67.446 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  3.602 μs (2 allocations: 781.30 KiB)
  44.711 μs (181 allocations: 18.33 KiB)
  105.163 μs (181 allocations: 18.33 KiB)
Main.m

1 Like