Fastest way of getting a long zero vector?

I have these three ways of creating a zero vector.

module m

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
@time a = fill(0.0, N);

println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;

println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)

nothing
end

I have a multicore machine. Is there something faster?

You could use calloc. See e.g. Faster zeros with calloc and ArrayAllocators.jl

3 Likes

I do need to write into it later. So it looks like the threaded version is the fastest overall.

You can write into a calloc-allocated array.

Correct. But at that point the OS will do any delayed zeroing out, and things will slow down.

That’s a pretty surprising conclusion; I’d think your benchmarks aren’t as representative as you’d want them to be before coming to that conclusion. As they’re written there, you’re likely intermingling GC and compilation times.

@mbauman You are right. I had to be a bit more sophisticated. I obtained

pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
  4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
;  4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
  108.077 ÎĽs (2 allocations: 7.45 GiB)
  4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
  108.736 ÎĽs (2 allocations: 7.45 GiB)
  1.125 s (21 allocations: 2.08 KiB)

with



using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end

println("a = zeros(N);")
let
@btime a = zeros($N);
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

The multithreaded function for zeroing out the vector pays off even for moderately long (short!) vectors. For instance

 ../julia-1.10.1/bin/julia -t 4 m.jl
N = 100000 = 100000
a = fill(0.0, N);
  76.781 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  76.991 ÎĽs (2 allocations: 781.30 KiB)
undefined, a .= 0.0;
  1.705 ÎĽs (2 allocations: 781.30 KiB)
  75.407 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.677 ÎĽs (2 allocations: 781.30 KiB)
  23.214 ÎĽs (21 allocations: 2.08 KiB)
2 Likes

It is possible that the OS or libc implementation already has a pool of zeroed memory pages that it can just assign.

I’m afraid expecting that to happen will not be very robust, will it?

You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not it’s a security risk, since you read data from other processes.

But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesn’t seem very helpful for libc to zero out memory, because if you reuse memory from your own process it’s not a security risk (at least shouldn’t be… you shouldn’t be reading that not-zeroed-out memory).

Since asking for zero isn’t too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.

[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but it’s harder to implement.]

If it is known that the memory is already zeroed, do you really want to zero it again?

As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.

If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.

Numpy and other languages use calloc by default when zeroed memory is requested.

I don’t want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).

Maybe Polyester.@batch will be faster for that.

I’m not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.

Here’s my setup.

julia> using BenchmarkTools

julia> function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
zeroout (generic function with 1 method)

julia> function demo_zeroout(N)
           a = Vector{Float64}(undef, N)
           zeroout(a)
           return sum(a)
       end
demo_zeroout (generic function with 1 method)

julia> function demo_calloc(N)
           ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
           a = unsafe_wrap(Array, ptr, N; own=false)
           ret = sum(a)
           Libc.free(ptr)
           return ret
       end
demo_calloc (generic function with 1 method)

Here are the results on my computer:

julia> @btime demo_zeroout($N)
  84.700 ÎĽs (83 allocations: 790.97 KiB)
0.0

julia> @btime demo_calloc($N)
  35.900 ÎĽs (1 allocation: 48 bytes)
0.0

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 Ă— Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)

If you want a more compact syntax, you can try ArrayAllocators.jl

julia> using ArrayAllocators

julia> function demo_arrayallocators(N)
           a = Vector{Float64}(calloc, N)
           return sum(a)
       end
demo_arrayallocators (generic function with 1 method)

julia> @btime demo_arrayallocators($N)
  83.300 ÎĽs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  69.100 ÎĽs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  78.000 ÎĽs (5 allocations: 781.36 KiB)
0.0

The timing is a bit wobbly since we let Julia own the array in this case, but you don’t have to deal with the pointer directly.

3 Likes

Interesting. What happens one you run this on your machine?

module m
using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end

function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
    ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
    return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

N = 100000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end

println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end

println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

I get

../julia-1.10.1/bin/julia -t 2 m.jl
a = fill(0.0, N);
  73.067 ÎĽs (2 allocations: 781.30 KiB)
  104.308 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  74.800 ÎĽs (2 allocations: 781.30 KiB)
  104.306 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  35.291 ÎĽs (2 allocations: 781.31 KiB)
  103.690 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  1.464 ÎĽs (2 allocations: 781.30 KiB)
  75.019 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.448 ÎĽs (2 allocations: 781.30 KiB)
  46.265 ÎĽs (11 allocations: 1.06 KiB)

But also

 ../julia-1.10.1/bin/julia -t 8 m.jl
a = fill(0.0, N);
  74.336 ÎĽs (2 allocations: 781.30 KiB)
  101.080 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  71.916 ÎĽs (2 allocations: 781.30 KiB)
  101.121 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  72.413 ÎĽs (2 allocations: 781.31 KiB)
  102.255 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  2.376 ÎĽs (2 allocations: 781.30 KiB)
  73.545 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  2.131 ÎĽs (2 allocations: 781.30 KiB)
  22.844 ÎĽs (41 allocations: 4.11 KiB)
julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Ă— Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, sandybridge)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

Why aren’t you suming either undefined variant?

1 Like

Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.

Sure, but shouldn’t you compare to using sum on the “eagerly” zeroed memory? Wer’e not really comparing apples to apples here.

PS> julia -t 2 bench.jl
a = fill(0.0, N);
  16.600 ÎĽs (2 allocations: 781.30 KiB)
  48.400 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  59.100 ÎĽs (2 allocations: 781.30 KiB)
  58.200 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  60.100 ÎĽs (2 allocations: 781.31 KiB)
  79.100 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  550.000 ns (2 allocations: 781.30 KiB)
  49.800 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  470.000 ns (2 allocations: 781.30 KiB)
  49.500 ÎĽs (11 allocations: 1.25 KiB)

PS> julia -t 8 bench.jl
a = fill(0.0, N);
  68.600 ÎĽs (2 allocations: 781.30 KiB)
  64.100 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  28.400 ÎĽs (2 allocations: 781.30 KiB)
  93.400 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  71.800 ÎĽs (2 allocations: 781.31 KiB)
  114.700 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  5.459 ÎĽs (2 allocations: 781.30 KiB)
  66.900 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  4.132 ÎĽs (2 allocations: 781.30 KiB)
  34.200 ÎĽs (41 allocations: 4.86 KiB)

Also we should compare zeroout to fill! rather than fill.

julia> @btime fill!(a, 0.0) setup=(a = Vector{Float64}(undef, $N););
  35.300 ÎĽs (0 allocations: 0 bytes)
2 Likes

Passing over data is expensive.

julia> module m
       using BenchmarkTools
       
       function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
       
       function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
           ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
           return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
       
       N = 100000
       
       println("a = fill(0.0, N);")
       let
       @btime a = fill(0.0, $N);
       @btime begin a = fill(0.0, $N); sum(a) end
       end
       
       println("a = zeros(N);")
       let
       @btime a = zeros($N);
       @btime begin a = zeros($N); sum(a) end
       end
       
       println("a = zeros_via_calloc(N);")
       let
       @btime a = zeros_via_calloc(Float64, $N);
       @btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
       end
       
       println("undefined, a .= 0.0;")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
       end
       println("undefined, zeroout(a);")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
       end
       nothing
       end
        
a = fill(0.0, N);
  29.135 ÎĽs (2 allocations: 781.30 KiB)
  68.500 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
  29.740 ÎĽs (2 allocations: 781.30 KiB)
  68.408 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  29.544 ÎĽs (2 allocations: 781.31 KiB)
  66.764 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  3.326 ÎĽs (2 allocations: 781.30 KiB)
  32.861 ÎĽs (0 allocations: 0 bytes)
  67.446 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
  3.602 ÎĽs (2 allocations: 781.30 KiB)
  44.711 ÎĽs (181 allocations: 18.33 KiB)
  105.163 ÎĽs (181 allocations: 18.33 KiB)
Main.m

1 Like