I have these three ways of creating a zero vector.
module m
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
N = 1000000000
println("a = fill(0.0, N);")
@time a = fill(0.0, N);
println("undefined, a .= 0.0;")
@time a = Vector{Float64}(undef, N);
@time a .= 0.0;
println("undefined, zeroout(a);")
@time a = Vector{Float64}(undef, N);
@time zeroout(a)
nothing
end
I have a multicore machine. Is there something faster?
That’s a pretty surprising conclusion; I’d think your benchmarks aren’t as representative as you’d want them to be before coming to that conclusion. As they’re written there, you’re likely intermingling GC and compilation times.
@mbauman You are right. I had to be a bit more sophisticated. I obtained
pkrysl@firenze:~/testheat$ ../julia-1.10.1/bin/julia -t 4 m.jl
a = fill(0.0, N);
4.156 s (2 allocations: 7.45 GiB)
a = zeros(N);
; 4.156 s (2 allocations: 7.45 GiB)
undefined, a .= 0.0;
108.077 ÎĽs (2 allocations: 7.45 GiB)
4.147 s (0 allocations: 0 bytes)
undefined, zeroout(a);
108.736 ÎĽs (2 allocations: 7.45 GiB)
1.125 s (21 allocations: 2.08 KiB)
with
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
N = 1000000000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
end
println("a = zeros(N);")
let
@btime a = zeros($N);
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
You can rely on the OS providing zeroed out memory. At least in some or all modern OSes, if not it’s a security risk, since you read data from other processes.
But most allocations you do (except at the start of your program) not come from the OS. They are you own memory reused. It doesn’t seem very helpful for libc to zero out memory, because if you reuse memory from your own process it’s not a security risk (at least shouldn’t be… you shouldn’t be reading that not-zeroed-out memory).
Since asking for zero isn’t too uncommon, then maybe Julia should preemptively do it for you (I guess Java may do that). Julia has threaded GC by now, and it seems plausible to me, not just freeing but also zeroing could be helpful. You could have such a pool, and maybe a bit of other for undef. The latter pool could always be zeroed on demand if you need more of that.
[My preference would be undef, with known read-bounds, that need not be the same as the write-bounds, and the read-bounds could be lazily expanded and zeroed, but it’s harder to implement.]
If it is known that the memory is already zeroed, do you really want to zero it again?
As Palli notes, when memory is initially given to a process it must be zeroed for security reasons. The memory that may not be zeroed is memory that was freed and is being reallocated to the same process. For Julia, this is usually memory that has been garbage collected.
If you carefully control allocations and do not have a long running process, much of your memory will likely be zeroed before it is allocated to Julia.
Numpy and other languages use calloc by default when zeroed memory is requested.
I don’t want to mess with the vector twice. I want to get the vector filled with zeroes. As demonstrated in my experiments, allocating the vector filled with undef and then running the initialization on threads is the fastest approach (as far as I know).
I’m not sure I understand what you mean. calloc will obtain an array filled with zeros. Below calloc beats zeroout by a factor of 2x on my machine.
Here’s my setup.
julia> using BenchmarkTools
julia> function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
zeroout (generic function with 1 method)
julia> function demo_zeroout(N)
a = Vector{Float64}(undef, N)
zeroout(a)
return sum(a)
end
demo_zeroout (generic function with 1 method)
julia> function demo_calloc(N)
ptr = Libc.calloc(N, sizeof(Float64)) |> Ptr{Float64}
a = unsafe_wrap(Array, ptr, N; own=false)
ret = sum(a)
Libc.free(ptr)
return ret
end
demo_calloc (generic function with 1 method)
Interesting. What happens one you run this on your machine?
module m
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
N = 100000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end
println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end
println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
Because the arrays are explicitly set to zero. So presumably there will be no difference relative to the tests above. The testing of the summing was introduced because of the possible delayed action of calloc.
julia> module m
using BenchmarkTools
function zeroout(a)
@inbounds Threads.@threads for k in eachindex(a)
a[k] = 0.0
end
end
function zeros_via_calloc(::Type{T}, dims::Integer...) where {T}
ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
return unsafe_wrap(Array{T}, ptr, dims; own=true)
end
N = 100000
println("a = fill(0.0, N);")
let
@btime a = fill(0.0, $N);
@btime begin a = fill(0.0, $N); sum(a) end
end
println("a = zeros(N);")
let
@btime a = zeros($N);
@btime begin a = zeros($N); sum(a) end
end
println("a = zeros_via_calloc(N);")
let
@btime a = zeros_via_calloc(Float64, $N);
@btime begin a = zeros_via_calloc(Float64, $N); sum(a) end
end
println("undefined, a .= 0.0;")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime a .= 0.0 setup=(a = Vector{Float64}(undef, $N);)
end
let
@btime sum(a .= 0.0) setup=(a = Vector{Float64}(undef, $N);)
end
println("undefined, zeroout(a);")
let
@btime a = Vector{Float64}(undef, $N);
end
let
@btime zeroout(a) setup=(a = Vector{Float64}(undef, $N);)
end
let
@btime (zeroout(a); sum(a)) setup=(a = Vector{Float64}(undef, $N);)
end
nothing
end
a = fill(0.0, N);
29.135 ÎĽs (2 allocations: 781.30 KiB)
68.500 ÎĽs (2 allocations: 781.30 KiB)
a = zeros(N);
29.740 ÎĽs (2 allocations: 781.30 KiB)
68.408 ÎĽs (2 allocations: 781.30 KiB)
a = zeros_via_calloc(N);
29.544 ÎĽs (2 allocations: 781.31 KiB)
66.764 ÎĽs (2 allocations: 781.31 KiB)
undefined, a .= 0.0;
3.326 ÎĽs (2 allocations: 781.30 KiB)
32.861 ÎĽs (0 allocations: 0 bytes)
67.446 ÎĽs (0 allocations: 0 bytes)
undefined, zeroout(a);
3.602 ÎĽs (2 allocations: 781.30 KiB)
44.711 ÎĽs (181 allocations: 18.33 KiB)
105.163 ÎĽs (181 allocations: 18.33 KiB)
Main.m