Fastest way of getting a long zero vector?

PetrKryslUCSD · March 14, 2024, 5:56pm

I have these three ways of creating a zero vector.

module m

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
@time a = fill(0.0, N);

println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;

println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)

nothing
end

I have a multicore machine. Is there something faster?

stevengj · March 14, 2024, 6:01pm

You could use calloc. See e.g. Faster zeros with calloc and ArrayAllocators.jl

PetrKryslUCSD · March 14, 2024, 6:19pm

I do need to write into it later. So it looks like the threaded version is the fastest overall.

stevengj · March 14, 2024, 6:22pm

You can write into a calloc-allocated array.

PetrKryslUCSD · March 14, 2024, 7:38pm

Correct. But at that point the OS will do any delayed zeroing out, and things will slow down.

mbauman · March 14, 2024, 7:56pm

That’s a pretty surprising conclusion; I’d think your benchmarks aren’t as representative as you’d want them to be before coming to that conclusion. As they’re written there, you’re likely intermingling GC and compilation times.

PetrKryslUCSD · March 14, 2024, 8:46pm

@mbauman You are right. I had to be a bit more sophisticated. I obtained

pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
  4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
;  4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
  108.077 μs (2 allocations: 7.45 GiB)
  4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
  108.736 μs (2 allocations: 7.45 GiB)
  1.125 s (21 allocations: 2.08 KiB)

with



using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end
N = 1000000000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end

println("a = zeros(N);")
let
@btime a = zeros($N);
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

PetrKryslUCSD · March 14, 2024, 10:08pm

The multithreaded function for zeroing out the vector pays off even for moderately long (short!) vectors. For instance

 ../julia-1.10.1/bin/julia -t 4 m.jl
N = 100000 = 100000
a = fill(0.0, N);
  76.781 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  76.991 μs (2 allocations: 781.30 KiB)
undefined, a .= 0.0;
  1.705 μs (2 allocations: 781.30 KiB)
  75.407 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.677 μs (2 allocations: 781.30 KiB)
  23.214 μs (21 allocations: 2.08 KiB)

mkitti · March 14, 2024, 10:27pm

It is possible that the OS or libc implementation already has a pool of zeroed memory pages that it can just assign.

PetrKryslUCSD · March 14, 2024, 10:52pm

I’m afraid expecting that to happen will not be very robust, will it?

Palli · March 14, 2024, 11:55pm

You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not it’s a security risk, since you read data from other processes.

But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesn’t seem very helpful for libc to zero out memory, because if you reuse memory from your own process it’s not a security risk (at least shouldn’t be… you shouldn’t be reading that not-zeroed-out memory).

Since asking for zero isn’t too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.

[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but it’s harder to implement.]

mkitti · March 15, 2024, 12:19am

If it is known that the memory is already zeroed, do you really want to zero it again?

As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.

If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.

Numpy and other languages use calloc by default when zeroed memory is requested.

PetrKryslUCSD · March 15, 2024, 12:20am

I don’t want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).

lmiq · March 15, 2024, 1:21am

Maybe Polyester.@batch will be faster for that.

mkitti · March 15, 2024, 1:33am

I’m not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.

Here’s my setup.

julia> using BenchmarkTools

julia> function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
zeroout (generic function with 1 method)

julia> function demo_zeroout(N)
           a = Vector{Float64}(undef, N)
           zeroout(a)
           return sum(a)
       end
demo_zeroout (generic function with 1 method)

julia> function demo_calloc(N)
           ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
           a = unsafe_wrap(Array, ptr, N; own=false)
           ret = sum(a)
           Libc.free(ptr)
           return ret
       end
demo_calloc (generic function with 1 method)

Here are the results on my computer:

julia> @btime demo_zeroout($N)
  84.700 μs (83 allocations: 790.97 KiB)
0.0

julia> @btime demo_calloc($N)
  35.900 μs (1 allocation: 48 bytes)
0.0

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)

If you want a more compact syntax, you can try ArrayAllocators.jl

julia> using ArrayAllocators

julia> function demo_arrayallocators(N)
           a = Vector{Float64}(calloc, N)
           return sum(a)
       end
demo_arrayallocators (generic function with 1 method)

julia> @btime demo_arrayallocators($N)
  83.300 μs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  69.100 μs (5 allocations: 781.36 KiB)
0.0

julia> @btime demo_arrayallocators($N)
  78.000 μs (5 allocations: 781.36 KiB)
0.0

The timing is a bit wobbly since we let Julia own the array in this case, but you don’t have to deal with the pointer directly.

PetrKryslUCSD · March 15, 2024, 3:08am

Interesting. What happens one you run this on your machine?

module m
using BenchmarkTools

function zeroout(a)
    @inbounds Threads.@threads for k in eachindex(a)
        a[k] = 0.0
    end
end

function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
    ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
    return unsafe_wrap(Array{T}, ptr, dims; own=true)
end

N = 100000

println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end

println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end

println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end

println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end

println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end

nothing
end

I get

../julia-1.10.1/bin/julia -t 2 m.jl
a = fill(0.0, N);
  73.067 μs (2 allocations: 781.30 KiB)
  104.308 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  74.800 μs (2 allocations: 781.30 KiB)
  104.306 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  35.291 μs (2 allocations: 781.31 KiB)
  103.690 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  1.464 μs (2 allocations: 781.30 KiB)
  75.019 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  1.448 μs (2 allocations: 781.30 KiB)
  46.265 μs (11 allocations: 1.06 KiB)

But also

 ../julia-1.10.1/bin/julia -t 8 m.jl
a = fill(0.0, N);
  74.336 μs (2 allocations: 781.30 KiB)
  101.080 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  71.916 μs (2 allocations: 781.30 KiB)
  101.121 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  72.413 μs (2 allocations: 781.31 KiB)
  102.255 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  2.376 μs (2 allocations: 781.30 KiB)
  73.545 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  2.131 μs (2 allocations: 781.30 KiB)
  22.844 μs (41 allocations: 4.11 KiB)

julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, sandybridge)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

Elrod · March 15, 2024, 3:49am

Why aren’t you suming either undefined variant?

PetrKryslUCSD · March 15, 2024, 4:25am

Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.

mkitti · March 15, 2024, 5:09am

Sure, but shouldn’t you compare to using sum on the “eagerly” zeroed memory? Wer’e not really comparing apples to apples here.

PS> julia -t 2 bench.jl
a = fill(0.0, N);
  16.600 μs (2 allocations: 781.30 KiB)
  48.400 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  59.100 μs (2 allocations: 781.30 KiB)
  58.200 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  60.100 μs (2 allocations: 781.31 KiB)
  79.100 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  550.000 ns (2 allocations: 781.30 KiB)
  49.800 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  470.000 ns (2 allocations: 781.30 KiB)
  49.500 μs (11 allocations: 1.25 KiB)

PS> julia -t 8 bench.jl
a = fill(0.0, N);
  68.600 μs (2 allocations: 781.30 KiB)
  64.100 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  28.400 μs (2 allocations: 781.30 KiB)
  93.400 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  71.800 μs (2 allocations: 781.31 KiB)
  114.700 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  5.459 μs (2 allocations: 781.30 KiB)
  66.900 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  4.132 μs (2 allocations: 781.30 KiB)
  34.200 μs (41 allocations: 4.86 KiB)

Also we should compare zeroout to fill! rather than fill.

julia> @btime fill!(a, 0.0) setup=(a = Vector{Float64}(undef, $N););
  35.300 μs (0 allocations: 0 bytes)

Elrod · March 15, 2024, 5:11am

Passing over data is expensive.

julia> module m
       using BenchmarkTools
       
       function zeroout(a)
           @inbounds Threads.@threads for k in eachindex(a)
               a[k] = 0.0
           end
       end
       
       function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
           ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
           return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
       
       N = 100000
       
       println("a = fill(0.0, N);")
       let
       @btime a = fill(0.0, $N);
       @btime begin a = fill(0.0, $N); sum(a) end
       end
       
       println("a = zeros(N);")
       let
       @btime a = zeros($N);
       @btime begin a = zeros($N); sum(a) end
       end
       
       println("a = zeros_via_calloc(N);")
       let
       @btime a = zeros_via_calloc(Float64, $N);
       @btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
       end
       
       println("undefined, a .= 0.0;")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
       end
       println("undefined, zeroout(a);")
       let
       @btime a = Vector{Float64}(undef, $N);
       end
       let
       @btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
       end
       let
       @btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
       end
       nothing
       end
        
a = fill(0.0, N);
  29.135 μs (2 allocations: 781.30 KiB)
  68.500 μs (2 allocations: 781.30 KiB)
a = zeros(N);
  29.740 μs (2 allocations: 781.30 KiB)
  68.408 μs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
  29.544 μs (2 allocations: 781.31 KiB)
  66.764 μs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
  3.326 μs (2 allocations: 781.30 KiB)
  32.861 μs (0 allocations: 0 bytes)
  67.446 μs (0 allocations: 0 bytes)
undefined, zeroout(a);
  3.602 μs (2 allocations: 781.30 KiB)
  44.711 μs (181 allocations: 18.33 KiB)
  105.163 μs (181 allocations: 18.33 KiB)
Main.m

Topic		Replies	Views
Faster zeros with calloc Performance array	26	3896	March 23, 2022
Huge performance improvement by separating function? General Usage	15	1152	August 27, 2018
How does Vector{Int64}(undef, 20) differ in a multi-threaded environment? General Usage	4	409	May 27, 2022
Fastest way to run a for loop in parallel Performance multithreading , threads	4	5373	December 27, 2020
Performance of zeros() vs. Array{T}()? Performance	4	1167	September 6, 2018

Fastest way of getting a long zero vector?

Related topics