Fastest way of getting a long zero vector?

The OS does this by setting a bit in the page table, so that the page is zeroed on first access. You could ensure this is done before use by zeroing every 512th element right after calloc. (For 4K page size, check it with @ccall getpagesize()::Cint).

1 Like

Why, though?
In theory zeroing just before use should mean single pass of the data through cache instead of two.
This is expensive. If you care about performance, you’d want to avoid this. (However, my benchmark above did not support this.)

If you don’t, you wouldn’t take on a dependency and maximize readability via just .= 0 or similar.

2 Likes

I was comparing zerout to allocated undef + set to zero. Hence, fill.

I think I see what you are saying: once I start using the array, it will be eventually set to zero when it is loaded into cache. When I look at it, it is zero; before then it might not be, but that is ok. True!

So, why isn’t fill(0, N) like that? Why doesn’t it use calloc?
Actually, zeros should be like that.

Wait, what? Why is fill! faster than a .= 0.0?

A priori, I’d expect:

  • calloc to be fastest
  • memcpy to be next fastest
  • iterative methods to be next

And benchmarks of such a large array will necessarily be constrained by memory bandwidth on many systems.

There are some “peephole” kinds of optimizations that explicitly switch to the lower-level memcpy for some methods that would naively be iterative. Exactly which methods will have gotten that treatment will vary based on what folks have done.

As far as always using calloc, it’s a rare open three-digit issue: implement zeros() by calling calloc · Issue #130 · JuliaLang/julia · GitHub. Someone just needs to devote time to it with some careful benchmarking.

1 Like

How about memset? This be the same as fill! but I’m not sure if that is still the case.

help?> Libc.memset
  memset(dst::Ptr, val, n::Integer) -> Ptr{Cvoid}

  Call memset from the C standard library.

  │ Julia 1.10
  │
  │  Support for memset requires at least Julia 1.10.

Er, yeah, I meant a ccall to memset. Or bzero. It’s just my naive expectation — something that’s able to talk more directly to the OS might have some advantages.

They have optimized implementations that branch on sizes, for example.
They’ll use noon-temporal stores when the memory is too big to fit in cache, for example.

It should in general be the fastest way to set memory (but calloc should still be better for 0).

4 Likes

I can only mark one response as the solution. Sorry @mkitti @mbauman