Faster zeros with calloc

Adapting some of my thoughts from the Zulip topic as well:

Why even care?

For the uninitiated, this topic didn’t just come out of nowhere. I think most on this forum are familiar with the semi-frequent comparison threads with other languages where someone claims “Julia is slow” but didn’t write a representative benchmark. The problem is, the more you have to convince them to change in their benchmark to put forward a “fair” comparison, the harder it becomes to defend those changes. This is especially true when said changes appear to deviate from default language/stdlib functionality, which is exactly the case with zeros and calloc. The more friction a potential user experiences with trying to make their Julia port perform well, the more likely we are to see acrimonious forum posts or follow-up tweets. This is not a hypothetical scenario either, I think many of us can think of an example within recent memory.

I like to take the pit of success argument here.
Julia is a such a nice language because it not only provides you enough depth to write optimized programs, but has enough taste injected into the design to choose what works for the majority of users by default even if it’s not the most optimized path in every dimension. One big example of this is (im)mutable structs and heap allocation.

What are users more likely to want? Slow setup time but faster time for the first iteration of a hot loop, or faster time overall?
I’d argue the first is the more niche case, and for that we have fill(0 ...) + fill!(Array{T}(undef, ...), 0). Whereas expecting new users to figure out that they should write unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true) to match the performance of other languages is a tall order.

With apologies to Einstein

The argument put forward in previous posts is that the “spooky action at a distance” caused by calloced memory faulting on first read rather than init is not with the increased overall performance, and that cases where said performance matters are at best niche. I’ve already talked about the “this doesn’t come up enough” part of that argument (namely folks coming from Numpy), but let’s approach it from a different angle.

Say in some alternate timeline, Julia did use calloc by default for zeros. How likely would we have a discussion on the opposite behaviour come up? i.e. a “I much prefer deterministic O(N) init for zeros even at the cost of overall performance” thread? I posit to you that it likely wouldn’t surface at all! If such a discussion did come up, imagine the responses on this forum. “why not just use fill, that has deterministic runtime”, “you probably shouldn’t be using Julia for crypto”, “this seems like a niche use case to reduce average performance for”, and of course “small arrays don’t undergo lazy init already”. Instead, we might get discussions like https://github.com/numpy/numpy/issues/16498#issuecomment-639179593 (“wow, zeros is fast, but I shouldn’t use it for certain benchmarks because it behaves slightly differently for those watching the perf counters”).

Control and lines in the sand

But back to language design. It seems odd that this is the bar where we say “no, X feature feels un-Julian”, when Boxing, escape analysis and heap-to-stack allocation hoisting all exist. A similar argument would be that this kind of control is necessary in a high-performance language, but that doesn’t explain Rust’s eager use of calloc. Even C provides an equally accessible interface for allocating cleared and uninitialized memory, while Julia only puts the latter front-and-centre with similar.

In closing, if you:

  1. Dislike wasting time on correcting misleading benchmark comparisons with other languages
  2. Would prefer to write fewer fenceposts or fun incantations in performance critical code
  3. Have to work with dense arrays or libraries that do sparse writes to dense arrays (e.g. some ADs)

Then my pitch is that having zeros use calloc when appropriate is a good idea.

5 Likes