I have these three ways of creating a zero vector.
module m
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
N = 1000000000
println("a = fill(0.0, N);")
@time a = fill(0.0, N);
println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;
println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)
nothing
end
I have a multicore machine. Is there something faster?
Thatâs a pretty surprising conclusion; Iâd think your benchmarks arenât as representative as youâd want them to be before coming to that conclusion. As theyâre written there, youâre likely intermingling GC and compilation times.
@mbauman You are right. I had to be a bit more sophisticated. I obtained
pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
; 4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
108.077 Ξs (2 allocations: 7.45 GiB)
4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
108.736 Ξs (2 allocations: 7.45 GiB)
1.125 s (21 allocations: 2.08 KiB)
with
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
N = 1000000000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end
println("a = zeros(N);")
let
@btime a = zeros($N);
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not itâs a security risk, since you read data from other processes.
But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesnât seem very helpful for libc to zero out memory, because if you reuse memory from your own process itâs not a security risk (at least shouldnât beâĶ you shouldnât be reading that not-zeroed-out memory).
Since asking for zero isnât too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.
[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but itâs harder to implement.]
If it is known that the memory is already zeroed, do you really want to zero it again?
As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.
If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.
Numpy and other languages use calloc by default when zeroed memory is requested.
I donât want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).
Iâm not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.
Hereâs my setup.
julia> using BenchmarkTools
julia> function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
zeroout (generic function with 1 method)
julia> function demo_zeroout(N)
a = Vector{Float64}(undef, N)
zeroout(a)
return sum(a)
end
demo_zeroout (generic function with 1 method)
julia> function demo_calloc(N)
ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
a = unsafe_wrap(Array, ptr, N; own=false)
ret = sum(a)
Libc.free(ptr)
return ret
end
demo_calloc (generic function with 1 method)
Interesting. What happens one you run this on your machine?
module m
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
N = 100000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end
println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end
println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.
julia> module m
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
N = 100000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end
println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end
println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
let
@btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
let
@btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
a = fill(0.0, N);
29.135 Ξs (2 allocations: 781.30 KiB)
68.500 Ξs (2 allocations: 781.30 KiB)
a = zeros(N);
29.740 Ξs (2 allocations: 781.30 KiB)
68.408 Ξs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
29.544 Ξs (2 allocations: 781.31 KiB)
66.764 Ξs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
3.326 Ξs (2 allocations: 781.30 KiB)
32.861 Ξs (0 allocations: 0 bytes)
67.446 Ξs (0 allocations: 0 bytes)
undefined, zeroout(a);
3.602 Ξs (2 allocations: 781.30 KiB)
44.711 Ξs (181 allocations: 18.33 KiB)
105.163 Ξs (181 allocations: 18.33 KiB)
Main.m