Fastest way of getting a long zero vector?

I have these three ways of creating a zero vector.

module m

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
@time a = fill(0.0, N);

println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;

println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)

nothing
end

I have a multicore machine. Is there something faster?

You could use calloc. See e.g. Faster zeros with calloc and ArrayAllocators.jl

3 Likes

I do need to write into it later. So it looks like the threaded version is the fastest overall.

You can write into a calloc-allocated array.

Correct. But at that point the OS will do any delayed zeroing out, and things will slow down.

That’s a pretty surprising conclusion; I’d think your benchmarks aren’t as representative as you’d want them to be before coming to that conclusion. As they’re written there, you’re likely intermingling GC and compilation times.

@mbauman You are right. I had to be a bit more sophisticated. I obtained

pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
  4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
;  4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
  108.077 Ξs (2 allocations: 7.45 GiB)
  4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
  108.736 Ξs (2 allocations: 7.45 GiB)
  1.125 s (21 allocations: 2.08 KiB)

with



using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end

println("a = zeros(N);")
let
@btime a = zeros($N);
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

The multithreaded function for zeroing out the vector pays off even for moderately long (short!) vectors. For instance

 ../julia-1.10.1/bin/julia -t 4 m.jl
N = 100000 = 100000
a = fill(0.0, N);
  76.781 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  76.991 Ξs (2 allocations: 781.30 KiB)
undefined, a .= 0.0;
  1.705 Ξs (2 allocations: 781.30 KiB)
  75.407 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.677 Ξs (2 allocations: 781.30 KiB)
  23.214 Ξs (21 allocations: 2.08 KiB)
2 Likes

It is possible that the OS or libc implementation already has a pool of zeroed memory pages that it can just assign.

I’m afraid expecting that to happen will not be very robust, will it?

You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not it’s a security risk, since you read data from other processes.

But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesn’t seem very helpful for libc to zero out memory, because if you reuse memory from your own process it’s not a security risk (at least shouldn’t beâ€Ķ you shouldn’t be reading that not-zeroed-out memory).

Since asking for zero isn’t too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.

[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but it’s harder to implement.]

If it is known that the memory is already zeroed, do you really want to zero it again?

As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.

If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.

Numpy and other languages use calloc by default when zeroed memory is requested.

I don’t want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).

Maybe Polyester.@batch will be faster for that.

I’m not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.

Here’s my setup.

julia> using BenchmarkTools

julia> function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
zeroout (generic function with 1 method)

julia> function demo_zeroout(N)
           a = Vector{Float64}(undef, N)
           zeroout(a)
           return sum(a)
       end
demo_zeroout (generic function with 1 method)

julia> function demo_calloc(N)
           ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
           a = unsafe_wrap(Array, ptr, N; own=false)
           ret = sum(a)
           Libc.free(ptr)
           return ret
       end
demo_calloc (generic function with 1 method)

Here are the results on my computer:

julia> @btime demo_zeroout($N)
  84.700 Ξs (83 allocations: 790.97 KiB)
0.0

julia> @btime demo_calloc($N)
  35.900 Ξs (1 allocation: 48 bytes)
0.0

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)

If you want a more compact syntax, you can try ArrayAllocators.jl

julia> using ArrayAllocators

julia> function demo_arrayallocators(N)
           a = Vector{Float64}(calloc, N)
           return sum(a)
       end
demo_arrayallocators (generic function with 1 method)

julia> @btime demo_arrayallocators($N)
  83.300 Ξs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  69.100 Ξs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  78.000 Ξs (5 allocations: 781.36 KiB)
0.0

The timing is a bit wobbly since we let Julia own the array in this case, but you don’t have to deal with the pointer directly.

3 Likes

Interesting. What happens one you run this on your machine?

module m
using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end

function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
    ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
    return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

N = 100000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end

println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end

println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

I get

../julia-1.10.1/bin/julia -t 2 m.jl
a = fill(0.0, N);
  73.067 Ξs (2 allocations: 781.30 KiB)
  104.308 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  74.800 Ξs (2 allocations: 781.30 KiB)
  104.306 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  35.291 Ξs (2 allocations: 781.31 KiB)
  103.690 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  1.464 Ξs (2 allocations: 781.30 KiB)
  75.019 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.448 Ξs (2 allocations: 781.30 KiB)
  46.265 Ξs (11 allocations: 1.06 KiB)

But also

 ../julia-1.10.1/bin/julia -t 8 m.jl
a = fill(0.0, N);
  74.336 Ξs (2 allocations: 781.30 KiB)
  101.080 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  71.916 Ξs (2 allocations: 781.30 KiB)
  101.121 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  72.413 Ξs (2 allocations: 781.31 KiB)
  102.255 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  2.376 Ξs (2 allocations: 781.30 KiB)
  73.545 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  2.131 Ξs (2 allocations: 781.30 KiB)
  22.844 Ξs (41 allocations: 4.11 KiB)
julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, sandybridge)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

Why aren’t you suming either undefined variant?

1 Like

Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.

Sure, but shouldn’t you compare to using sum on the “eagerly” zeroed memory? Wer’e not really comparing apples to apples here.

PS> julia -t 2 bench.jl
a = fill(0.0, N);
  16.600 Ξs (2 allocations: 781.30 KiB)
  48.400 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  59.100 Ξs (2 allocations: 781.30 KiB)
  58.200 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  60.100 Ξs (2 allocations: 781.31 KiB)
  79.100 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  550.000 ns (2 allocations: 781.30 KiB)
  49.800 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  470.000 ns (2 allocations: 781.30 KiB)
  49.500 Ξs (11 allocations: 1.25 KiB)

PS> julia -t 8 bench.jl
a = fill(0.0, N);
  68.600 Ξs (2 allocations: 781.30 KiB)
  64.100 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  28.400 Ξs (2 allocations: 781.30 KiB)
  93.400 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  71.800 Ξs (2 allocations: 781.31 KiB)
  114.700 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  5.459 Ξs (2 allocations: 781.30 KiB)
  66.900 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  4.132 Ξs (2 allocations: 781.30 KiB)
  34.200 Ξs (41 allocations: 4.86 KiB)

Also we should compare zeroout to fill! rather than fill.

julia> @btime fill!(a, 0.0) setup=(a = Vector{Float64}(undef, $N););
  35.300 Ξs (0 allocations: 0 bytes)
2 Likes

Passing over data is expensive.

julia> module m
       using BenchmarkTools
       
       function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
       
       function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
           ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
           return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
       
       N = 100000
       
       println("a = fill(0.0, N);")
       let
       @btime a = fill(0.0, $N);
       @btime begin a = fill(0.0, $N); sum(a) end
       end
       
       println("a = zeros(N);")
       let
       @btime a = zeros($N);
       @btime begin a = zeros($N); sum(a) end
       end
       
       println("a = zeros_via_calloc(N);")
       let
       @btime a = zeros_via_calloc(Float64, $N);
       @btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
       end
       
       println("undefined, a .= 0.0;")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
       end
       println("undefined, zeroout(a);")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
       end
       nothing
       end
        
a = fill(0.0, N);
  29.135 Ξs (2 allocations: 781.30 KiB)
  68.500 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
  29.740 Ξs (2 allocations: 781.30 KiB)
  68.408 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  29.544 Ξs (2 allocations: 781.31 KiB)
  66.764 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  3.326 Ξs (2 allocations: 781.30 KiB)
  32.861 Ξs (0 allocations: 0 bytes)
  67.446 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
  3.602 Ξs (2 allocations: 781.30 KiB)
  44.711 Ξs (181 allocations: 18.33 KiB)
  105.163 Ξs (181 allocations: 18.33 KiB)
Main.m

1 Like