To clarify (since it seems to have been lost in translation from zulip/github to discourse), I’m not in principal opposed to calloc
, I just think the situations where it can provide a benefit are dubious at best and niche at worst.
I’m not 100% sure how it works on windows (closed source and all that), but on linux calloc
usually works by having a dedicated page of zeros, which is used as backing memory when reading from a block of memory that’s been allocated using calloc
. When a program writes to that memory, the kernel intercepts the write, finally allocates a “real” page of memory, fills it with zeros and finally writes whatever data you want to write. This moves the time spent for initializing the block of memory with zeros from allocation time to use-time. As far as I remember, this works similarly on windows, but the process of intercepting the write and having the kernel do the work of allocating a page and filling it is a bit more expensive than on linux, which is why the timing on windows may be off sometimes compared to linux.
I’ll go through each example now and repeat my thoughts.
Summation+allocation
Take the example with summation+allocation, i.e. sum(zero_func(Float64, 1024, 1024))
. Yes it’s faster, but think about what this operation is doing - you’re summing nothing at all. In the case of calloc
, you’re not even hitting different memory, so I’m not sure what this is supposed to show - it’s like an idealized case, where all accessed data is already in a hot L1 cache and you can access it in any random pattern and still have magically fast speed. After real data has been written here, the speed advantage of calloc
vanishes - after all, there are now real pages backing the allocation, which go in and out of hot caches, leading to the memory bottleneck @Elrod mentioned.
Summation
A similar case can be made for pure summation, with allocation taken out of the picture, i.e. sum(C) setup = ( C = zeros_func(Float64, 1024, 1024) )
. I’m again not surprised calloc
is faster here - it’s the same situation as above, hitting magic fast speed because of an idealized caching scenario. The only thing that’s different is that the allocation itself is not counted as part of the benchmark (which, if I recall correctly from zulip, was ~200µs on your machine @mkitti ? Please check, as it would explain the difference to summation+allocation right away). In case of the current zeros
, I’m again not surprised that it’s a little slower here. It’s using malloc
in the background and fills in the memory with 0 manually, forcing the kernel to create real pages to back this memory. This leads to a more realistic situation regarding caches and doesn’t behave differently from real data (well, except for branch prediction having an idealized situation, but we’ll ignore that for now).
Writing
As for writing, the results again make sense when we think about what’s happening in the background. When iterating and writing over memory, the calloc
version page faults (i.e. has to load memory from some other cache than L1-3) probably just as often as the zeros
version, BUT the zeros
version doesn’t have to initialize the memory, since that’s already been done at creation time. It can just directly write out to memory and let the CPU and it’s page fault prediction manage pre-loading of memory etc.
Combined case
For the combined case, the benefits seem okayish, with 0.5-1ms in favor of calloc
when comparing min and max. However, I’m not yet convinced this generalizes to more complex initialization than setting a single value, especially considering we’re not making use of the zero
memory since we’re writing to the page, forcing a page fault and thus forcing the allocation of a real page. We don’t even get to keep magically fast read-cache, depending on our writing pattern. To take advantage of that long term, our array would have to be sparse-ish, with less than one write per page of memory on average and probably much fewer than that to see real benefits when reading again (are real sparse arrays already faster here? Probably, but I have no numbers to back that up).
If the question of whether to use calloc
or malloc
dominates your overall runtime (i.e. your problem is allocation bound), then maybe the first step should not be to think about whether to use calloc
or malloc
but “why am I allocating so much” and “can I restructure my problem to reuse buffers” or “is this really a performance bottleneck”. Don’t get me wrong, I’m a lover of performance as much as the next person on this forum, but at some point it’s time to think about what you’re optimizing for.
All in all, my personal reason for why I’m not inclined to say “yeah let’s calloc
everything” is the following: ease of use and predictability of performance. As was shown both in the benchmarks and in my attempt at explanation of them above, calloc
mostly shifts the burden of initialization from the allocation to the computation. This muddies the water when profiling and I’d much rather have a clear indication “oh yeah we’re spending time zeroing memory” when looking at a flame graph than have the cost be hidden behind the first iteration in my loop. When profiling, it forces me as a user to think about whether I really need a zero initialized array in my code. If the conclusion is yes, I really do need it because I’m reading a zero before writing some other data and that’s the fastest way I can express this, then great! That’s the use case I had in mind with the suggestion of “maybe a switch to zeros
for lazy initialization?” came from. If zeros
were lazy by default, this would be much harder to spot, because it requires deep knowledge about how an OS handles memory to even identify that as a possibility when debugging a “slow” loop that should be faster. I don’t think that’s a reasonable expectation we should have here, as evidenced by the time spent digging into this by @mkitti.
@Elrod I’m not surprised by your fill!
comparison - conceptually, both zeros_via_calloc
as well as zeros
do the same amount of work, zeroing memory and immediately overwriting it again, whereas the Array{Float64}(undef, ...)
version only writes, it never reads a zero, making the initialization useless. I think this only strengthens my point about knowing the code in question and how it is used to squeeze out optimal performance. That’s also why I like the eager behavior of zeros
- it makes reasoning about the code much easier and the cost is immediate and obvious, it’s not hidden behind the first use.
I think the canonical issue for this sort of thing is implement `@fill([0, 0], 2)` · Issue #41209 · JuliaLang/julia · GitHub. It also has a bunch of links to further discussion, and probably not quite appropriate here.
I’m not sure this is relevant to the discussion about semantics & behavior of zeros
though? As far as that allocation goes, you never want to read from an explicitly undef
array anyway, so from my POV the first use should always be a write, making the initialization guarantee a performance hit on first write of a page. I’m not the one for a call here though, there is already plenty discussion about this e.g. here zero out memory of uninitialized fields · Issue #9147 · JuliaLang/julia · GitHub which mentions that the situation for arrays is different from types (you never want to read from an undef
array and if you do you have a bug).