Faster zeros with calloc

To clarify (since it seems to have been lost in translation from zulip/github to discourse), I’m not in principal opposed to calloc, I just think the situations where it can provide a benefit are dubious at best and niche at worst.

I’m not 100% sure how it works on windows (closed source and all that), but on linux calloc usually works by having a dedicated page of zeros, which is used as backing memory when reading from a block of memory that’s been allocated using calloc. When a program writes to that memory, the kernel intercepts the write, finally allocates a “real” page of memory, fills it with zeros and finally writes whatever data you want to write. This moves the time spent for initializing the block of memory with zeros from allocation time to use-time. As far as I remember, this works similarly on windows, but the process of intercepting the write and having the kernel do the work of allocating a page and filling it is a bit more expensive than on linux, which is why the timing on windows may be off sometimes compared to linux.


I’ll go through each example now and repeat my thoughts.

Summation+allocation

Take the example with summation+allocation, i.e. sum(zero_func(Float64, 1024, 1024)). Yes it’s faster, but think about what this operation is doing - you’re summing nothing at all. In the case of calloc, you’re not even hitting different memory, so I’m not sure what this is supposed to show - it’s like an idealized case, where all accessed data is already in a hot L1 cache and you can access it in any random pattern and still have magically fast speed. After real data has been written here, the speed advantage of calloc vanishes - after all, there are now real pages backing the allocation, which go in and out of hot caches, leading to the memory bottleneck @Elrod mentioned.

Summation

A similar case can be made for pure summation, with allocation taken out of the picture, i.e. sum(C) setup = ( C = zeros_func(Float64, 1024, 1024) ). I’m again not surprised calloc is faster here - it’s the same situation as above, hitting magic fast speed because of an idealized caching scenario. The only thing that’s different is that the allocation itself is not counted as part of the benchmark (which, if I recall correctly from zulip, was ~200µs on your machine @mkitti ? Please check, as it would explain the difference to summation+allocation right away). In case of the current zeros, I’m again not surprised that it’s a little slower here. It’s using malloc in the background and fills in the memory with 0 manually, forcing the kernel to create real pages to back this memory. This leads to a more realistic situation regarding caches and doesn’t behave differently from real data (well, except for branch prediction having an idealized situation, but we’ll ignore that for now).

Writing

As for writing, the results again make sense when we think about what’s happening in the background. When iterating and writing over memory, the calloc version page faults (i.e. has to load memory from some other cache than L1-3) probably just as often as the zeros version, BUT the zeros version doesn’t have to initialize the memory, since that’s already been done at creation time. It can just directly write out to memory and let the CPU and it’s page fault prediction manage pre-loading of memory etc.

Combined case

For the combined case, the benefits seem okayish, with 0.5-1ms in favor of calloc when comparing min and max. However, I’m not yet convinced this generalizes to more complex initialization than setting a single value, especially considering we’re not making use of the zero memory since we’re writing to the page, forcing a page fault and thus forcing the allocation of a real page. We don’t even get to keep magically fast read-cache, depending on our writing pattern. To take advantage of that long term, our array would have to be sparse-ish, with less than one write per page of memory on average and probably much fewer than that to see real benefits when reading again (are real sparse arrays already faster here? Probably, but I have no numbers to back that up).

If the question of whether to use calloc or malloc dominates your overall runtime (i.e. your problem is allocation bound), then maybe the first step should not be to think about whether to use calloc or malloc but “why am I allocating so much” and “can I restructure my problem to reuse buffers” or “is this really a performance bottleneck”. Don’t get me wrong, I’m a lover of performance as much as the next person on this forum, but at some point it’s time to think about what you’re optimizing for.

All in all, my personal reason for why I’m not inclined to say “yeah let’s calloc everything” is the following: ease of use and predictability of performance. As was shown both in the benchmarks and in my attempt at explanation of them above, calloc mostly shifts the burden of initialization from the allocation to the computation. This muddies the water when profiling and I’d much rather have a clear indication “oh yeah we’re spending time zeroing memory” when looking at a flame graph than have the cost be hidden behind the first iteration in my loop. When profiling, it forces me as a user to think about whether I really need a zero initialized array in my code. If the conclusion is yes, I really do need it because I’m reading a zero before writing some other data and that’s the fastest way I can express this, then great! That’s the use case I had in mind with the suggestion of “maybe a switch to zeros for lazy initialization?” came from. If zeros were lazy by default, this would be much harder to spot, because it requires deep knowledge about how an OS handles memory to even identify that as a possibility when debugging a “slow” loop that should be faster. I don’t think that’s a reasonable expectation we should have here, as evidenced by the time spent digging into this by @mkitti.


@Elrod I’m not surprised by your fill! comparison - conceptually, both zeros_via_calloc as well as zeros do the same amount of work, zeroing memory and immediately overwriting it again, whereas the Array{Float64}(undef, ...) version only writes, it never reads a zero, making the initialization useless. I think this only strengthens my point about knowing the code in question and how it is used to squeeze out optimal performance. That’s also why I like the eager behavior of zeros - it makes reasoning about the code much easier and the cost is immediate and obvious, it’s not hidden behind the first use.


I think the canonical issue for this sort of thing is implement `@fill([0, 0], 2)` · Issue #41209 · JuliaLang/julia · GitHub. It also has a bunch of links to further discussion, and probably not quite appropriate here.


I’m not sure this is relevant to the discussion about semantics & behavior of zeros though? As far as that allocation goes, you never want to read from an explicitly undef array anyway, so from my POV the first use should always be a write, making the initialization guarantee a performance hit on first write of a page. I’m not the one for a call here though, there is already plenty discussion about this e.g. here zero out memory of uninitialized fields · Issue #9147 · JuliaLang/julia · GitHub which mentions that the situation for arrays is different from types (you never want to read from an undef array and if you do you have a bug).

4 Likes

Just to be clear, the main question at hand is whether zeros should use calloc by default rather than the current default of malloc followed by memset. The question transcends programming languages.

https://vorpus.org/blog/why-does-calloc-exist/

For primitive types, Array{T}(undef, ...) should continue to use malloc. fill!(Array{T}(undef, ...), zero(T)) should continue to be available for an explicitly eager version. The question is if zeros should have an eager keyword option or if there should be another shorter method of eagerly making zero initialized memory available.

1 Like

That article basically gives two reasons:

  1. It checks for overflow when allocating and
  2. it doesn’t actually allocate memory it doesn’t need when writing.

The first reason is not relevant for regular use of julia - memory allocation is abstracted away already.

The second reason gives the example of np.eye, which apparently calls calloc (since that’s the default for numpy). I don’t know why numpy does this allocation, but I don’t think the situation can be mapped 1-to-1 to julia, since we have vastly more kinds of specialized array types, in this case Diagonal, which only allocates its diagonal in the first place (which replaced eye back in julia 0.7 days!). This is vastly more efficient if you only ever read from it. If you intend to write, you can use spdiagm from the SparseArrays stdlib instead, again only using what you really need. I understand that for people coming from python this may be unfamiliar territory, but to me this power to express explicitly what I want to do is one of the great advantages of julia that I just don’t quite get with other languages, at least not in such a terse form.

I want to make it clear that I’m not in principal opposed to a version of calloc being exposed - but I am opposed to making it the default behavior, because it muddies the water when actually trying to find out where and why time is spent.

All in all, julia offers a lot of different tools with different use cases for different purposes. The obligation to know what they’re doing and what they want is placed on the user - after all, they know best what their algorithm requires. They are in the best position to make decisions about how to optimize. As such, having to dig into how julia allocates memory just to find out why the first iteration of a loop is slow feels wrong and more like a gotcha to me than seeing one large block of time spent in a zeros call (where the expectation should be that it’s slow because it has to zero out memory - there is no magic “go fast” switch).

2 Likes

What would be the best interface for calloc then? Since some say zeros shouldn’t exist, how should we make it available?

The idea that loosely referred to on Github was that may need a unique initializer that would then suggest the usage of calloc rather than malloc. Perhaps something like Array{T}(lazyzeroinit, dims...)?

If anyone is interested, here is where glibc decides whether to clear the memory itself or not in calloc

glibc calloc implementation, LGPL licensed https://github.com/bminor/glibc/blob/30891f35fa7da832b66d80d0807610df361851f3/malloc/malloc.c#L3537-L3557

Adapting some of my thoughts from the Zulip topic as well:

Why even care?

For the uninitiated, this topic didn’t just come out of nowhere. I think most on this forum are familiar with the semi-frequent comparison threads with other languages where someone claims “Julia is slow” but didn’t write a representative benchmark. The problem is, the more you have to convince them to change in their benchmark to put forward a “fair” comparison, the harder it becomes to defend those changes. This is especially true when said changes appear to deviate from default language/stdlib functionality, which is exactly the case with zeros and calloc. The more friction a potential user experiences with trying to make their Julia port perform well, the more likely we are to see acrimonious forum posts or follow-up tweets. This is not a hypothetical scenario either, I think many of us can think of an example within recent memory.

I like to take the pit of success argument here.
Julia is a such a nice language because it not only provides you enough depth to write optimized programs, but has enough taste injected into the design to choose what works for the majority of users by default even if it’s not the most optimized path in every dimension. One big example of this is (im)mutable structs and heap allocation.

What are users more likely to want? Slow setup time but faster time for the first iteration of a hot loop, or faster time overall?
I’d argue the first is the more niche case, and for that we have fill(0 ...) + fill!(Array{T}(undef, ...), 0). Whereas expecting new users to figure out that they should write unsafe_wrap(Array{T}, Ptr{T}(Libc.calloc(prod(dims), sizeof(T))), dims; own = true) to match the performance of other languages is a tall order.

With apologies to Einstein

The argument put forward in previous posts is that the “spooky action at a distance” caused by calloced memory faulting on first read rather than init is not with the increased overall performance, and that cases where said performance matters are at best niche. I’ve already talked about the “this doesn’t come up enough” part of that argument (namely folks coming from Numpy), but let’s approach it from a different angle.

Say in some alternate timeline, Julia did use calloc by default for zeros. How likely would we have a discussion on the opposite behaviour come up? i.e. a “I much prefer deterministic O(N) init for zeros even at the cost of overall performance” thread? I posit to you that it likely wouldn’t surface at all! If such a discussion did come up, imagine the responses on this forum. “why not just use fill, that has deterministic runtime”, “you probably shouldn’t be using Julia for crypto”, “this seems like a niche use case to reduce average performance for”, and of course “small arrays don’t undergo lazy init already”. Instead, we might get discussions like https://github.com/numpy/numpy/issues/16498#issuecomment-639179593 (“wow, zeros is fast, but I shouldn’t use it for certain benchmarks because it behaves slightly differently for those watching the perf counters”).

Control and lines in the sand

But back to language design. It seems odd that this is the bar where we say “no, X feature feels un-Julian”, when Boxing, escape analysis and heap-to-stack allocation hoisting all exist. A similar argument would be that this kind of control is necessary in a high-performance language, but that doesn’t explain Rust’s eager use of calloc. Even C provides an equally accessible interface for allocating cleared and uninitialized memory, while Julia only puts the latter front-and-centre with similar.

In closing, if you:

  1. Dislike wasting time on correcting misleading benchmark comparisons with other languages
  2. Would prefer to write fewer fenceposts or fun incantations in performance critical code
  3. Have to work with dense arrays or libraries that do sparse writes to dense arrays (e.g. some ADs)

Then my pitch is that having zeros use calloc when appropriate is a good idea.

5 Likes

I feel like interacting with the posts and arguments presented in here would be a better mode of communication, as just plainly copy & pasting an argument presented on another platform feels bad, since I now feel like I have to copy & paste my answer to your answer again… Instead I’ll write out a fully formed, new post, as done in the OP.


Personally, I dislike the culture of “changing the benchmark until its correct” precisely because it feels very bad when you’re on the receiving end of it. I have been guilty of this in the past as well and as such, I nowadays try to explain why the presented benchmark gives the observed result without implying they’re doing it wrong unless explicitly asked for. This can either happen in the post directly (“am I missing something/doing it wrong? what’s going on?”) or in a follow up post (“Ah, I see! How could this be benchmarked better?”).

I’m not sure I follow here - are you referring to julia’s stdlib with “default language […] functionality”? From my POV, changing zeros to be implicitly lazy (by using calloc) would precisely introduce such an inconsistency. Across the board, laziness and eagerness are strictly seperated, often by living in the Base.Iterators submodule. Making zeros lazy but having ones be eager feels just as arbitrary - why is one lazy and not the other? The answer would be “because there’s no equivalent to calloc for ones”.

I don’t like the argument about the “pit of success” here for two reasons

  1. it’s very subjective
  2. it implies there’s only “one true way”

The “pit of success” is different for every use case. What’s ideal for one program and developer is an annoyance for another - making a blanket statement about what ought to be done is what feels un-julian to me, not the discussion about “should we expose calloc at all”. I’ve reiterated that point both on zulip and up above - I’m not opposed to calloc!

I don’t understand why this should be an either/or case? I’m arguing that we should have BOTH, but default to the former because making code behave in a way that makes it easier to reason about is a good thing. When you write zeros(Float64, ...), write to some locations and pass that array on to another piece of code that you may or may not have written, you don’t necessarily control how the resulting array is used. Conversely, when you use zeros in library code you have no idea how a user/receiver of that array will use it. Both as a user of a library as well as a developer of a library I’d like to have an easier time following where time is spent and why. With a lazy zeros using calloc, this is not possible - you do not necessarily know when or how that array is used/accessed. The performance cost of calloc can manifest itself whenever any element that happens to be on a new page is accessed, which may not be close to the allocation site at all. How would you identify that this was spent as a zeroing page fault as a result of calloc at all? The information that this is happening is not even present in a strace of system calls. All you can easily observe is that some loop over some indices somewhere in the stack is slower than you expect, and figuring out why is a gargantuan task. Having to pay the cost up front makes it trivial - a flamegraph identifies the aggregate time spent in the zeros call.

Note that I’m not saying that we shouldn’t use calloc ever - that is not the case! What I am saying is that this should only be used when the performance implications about how the cost is distributed are understood and accepted, hence the default for eager initialization.

My complaint would be the same - calloc is intransparent when it comes to where the cost of initialization moves to and thus should be avoided, except in cases where it’s understood to be ok. Naive uses of different allocation schemes (fill(zero(T), ..) vs. a naive zeros(T, ...)) should not have an impact on the performance of a loop in a completely different part of the stack.

If this is really the kind of answer you expect on this forum, I’m extremely saddened that this is the impression the julia community gives off. All of these sound extremely dismissive and, frankly, come across as “get off my lawn” instead of embracing others’ POV. I for one would be happy if people were to implement crypto (and maybe prove correctness?) in julia. I also want julia to succeed in areas other than ML, AD, DiffEq, Modeling and similar.

This feels like a strawman - I’m equally missing tools for investigating the compiler and what it’s doing in these cases. I’m not sure why that would be an argument against having a choice between lazy and eager zeros.

I don’t like just pointing to some implementation somewhere and saying “look, X is doing it too” anymore when the comparison is python or rust. Rust conveniently gives you the choice to use an allocator that doesn’t use calloc but instead always uses malloc - a choice neither julia nor python expose, as the GC is treated as blackbox, not to be meddled with by the user.

Reading through the source of Vec (Rusts’ equivalent to julia vectors), I couldn’t help but notice that growing a vector doesn’t guarantee anything about whether or not the new memory is zeroed or not. In contrast, julia already documents that resize! with a larger size does not initialize the resulting memory. How should this be dealt with should a zeros array be allocated via calloc? As far as I’m aware, there is no realloc for calloced memory.

  1. I don’t get what’s specific about zeros that this blanket statement should have weight here. This can be said about any difference between julia and any other language.
  2. A few points:
    a) How do off-by-one errors factor into this discussion at all?
    b) I don’t see how having false as a sense of strong zero is relevant to whether or not zeros is lazy or eager. If you get something that wasn’t 0, by definition, that array wasn’t zeroed - no matter the initialization method.
    c) I still don’t understand how that line of code is a strong argument either way. Having false as a strong zero not working for user defined types is, from my POV, not a problem of Flux or DiffEq or Zygote, but of the user defined type implementing multiplication wrong. That this is very loosely and badly (or not at all) defined is also bad, but has nothing to do with whether zeros is eager or lazy. In fact, for user defined types zeros has to use malloc with initialization because a zero-filled block of memory may not be valid for zero of a user defined type! As you note, it may be boxed or a number of other things, like not being isbits or simply not having a block of zeros as its zero representation.
  3. That’s a single use case which I really hope is not the exclusive intended use case for julia as a whole.

I don’t think the points you put in a list support either lazy or eager zeros or even the ability for the user to choose. To me it just comes across as “I want this because it’s helpful in my case” to which I say “ok, let’s have a user controlled switch then since users can’t affect the memory allocator meaningfully here”.


I haven’t seen anyone claim “zeros shouldn’t exist” yet, so I’m not quite sure what you mean here.

The trouble with that is that it’s not as simple as switching between calloc and malloc behind the scenes, especially not for user defined types (for which zeros also has to work). For example, suppose we’d add a switch between lazy and eager initialization, do we then provide a LazyInitializedMatrix type that gets returned for custom user defined types that don’t neatly map to calloc? The page faulting process is completely invisible to julia user code as far as I know, so we can’t even hook into it to lazily initialize user types.

Adding onto this that this would be (as far as I know) the only place in julia where user code could explicitly influence how something is allocated, my gut tells me “let’s be careful what we want here and what the implications are”.

1 Like

Would it be appropriate to take the “usual” approach for new features, i.e. start with a package, gather use cases and data, and let that inform how to eventually fold into core? It seems like zeros_via_calloc is faster on some systems and some cases, but uncertain where it might not be better, and potentially confusing to benchmarkers. The power users will surely figure all that out given an option like @fastzeros.

As an analogy, I gather that LoopVectorization is a prototype for eventual core language, with the documentation clearly warning about potential misuses. It sounds like lazy zeros could be similarly deployed. Meanwhile, I thoroughly appreciate @mkitti’s well-researched post, and the well-reasoned discussion from @sukera and others.

2 Likes

See Deprecate `ones`? · Issue #24444 · JuliaLang/julia · GitHub

Julia already needs to zero out boxed value array elements and fields for GC. That’s why I mentioned that it’s already unavoidable. As such, excluding this case will underestimate the importance of the optimization discussed here.

Seems like calloc really should be implemented in a way to benefit from multithreading:

julia> using LoopVectorization, BenchmarkTools

julia> function zeros_via_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_via_calloc (generic function with 1 method)

julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
           A = f(Float64, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = A[i]
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 1 method)

julia> function alloctest(dims::Vararg{Integer,N}) where {F,N}
           A = Array{Float64}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               Ai = 0.0
               @turbo for j in 1:16
                   Ai += exp(i-j)
               end
               A[i] = Ai
           end
           A
       end
alloctest (generic function with 2 methods)

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 94 samples with 1 evaluation.
 Range (min … max):  49.796 ms … 146.351 ms  ┊ GC (min … max): 0.00% … 65.90%
 Time  (median):     51.714 ms               ┊ GC (median):    3.40%
 Time  (mean ± σ):   53.449 ms ±  12.958 ms  ┊ GC (mean ± σ):  6.49% ±  9.56%

  █ █
  █▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  49.8 ms       Histogram: log(frequency) by time       133 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192) # undef init
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  33.063 ms … 144.814 ms  ┊ GC (min … max): 0.00% … 77.03%
 Time  (median):     36.147 ms               ┊ GC (median):    2.81%
 Time  (mean ± σ):   37.299 ms ±  12.321 ms  ┊ GC (mean ± σ):  9.25% ± 10.36%

  █ ▇▄
  ████▇▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  33.1 ms       Histogram: log(frequency) by time       122 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 184.

julia> @benchmark alloctest(zeros, 8192, 8192) # zeros
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min … max):  85.009 ms … 197.439 ms  ┊ GC (min … max): 0.00% … 56.90%
 Time  (median):     87.641 ms               ┊ GC (median):    0.88%
 Time  (mean ± σ):   91.011 ms ±  19.084 ms  ┊ GC (mean ± σ):  6.29% ± 10.42%

  █ ▅
  █▃█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  85 ms           Histogram: frequency by time          174 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 183.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192) # zeros_via_calloc
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   81.286 ms … 251.090 ms  ┊ GC (min … max):  0.00% … 67.55%
 Time  (median):      81.993 ms               ┊ GC (median):     0.46%
 Time  (mean ± σ):   100.907 ms ±  31.495 ms  ┊ GC (mean ± σ):  19.33% ± 17.28%

  █          ▇
  █▁▁▁▁▁▁▁▁▁▁█▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  81.3 ms       Histogram: log(frequency) by time        251 ms <

 Memory estimate: 512.02 MiB, allocs estimate: 187.

I.e., seems like it should be possible for the specific thread doing the first write to a page to also do the zeroing. If this were the case, calloc would have two benefits:

  1. single pass over the array instead of two passes
  2. implicit (local) multithreading of the zeroing when combined with multithreaded code.

However, that does not appear to be the case.

This is a reasonably common pattern, where an array is initialized and then passed over multiple times to update the values. Here I wanted to only update once of course, to make the potential benefits easier to detect.

2 Likes

Are one of those “page” words supposed to be “thread”?

Yes (fixed).

Here are my results on Windows with 16 threads. alloctest(8192, 8192) and alloctest(zeros_via_calloc, 8192, 8192) seem to take about the same time.

julia> @benchmark zeros(8192, 8192)
BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  124.103 ms … 228.608 ms  ┊ GC (min … max):  0.00% … 43.77%
 Time  (median):     157.009 ms               ┊ GC (median):    17.05%
 Time  (mean ± σ):   169.346 ms ±  44.435 ms  ┊ GC (mean ± σ):  23.65% ± 19.20%

  █                                                  ▃
  █▇▅▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▅█▁▁▁▁▁▁▄ ▁
  124 ms           Histogram: frequency by time          229 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark alloctest(8192, 8192)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  150.885 ms … 346.944 ms  ┊ GC (min … max):  0.28% … 55.35%
 Time  (median):     180.302 ms               ┊ GC (median):     0.29%
 Time  (mean ± σ):   211.417 ms ±  55.850 ms  ┊ GC (mean ± σ):  20.95% ± 19.02%

    ▄ █▁                      ▁    ▁
  ▆▁█▁██▆▁▆▆▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▆█▆▁▁▆▁▁▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  151 ms           Histogram: frequency by time          347 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  267.953 ms … 468.588 ms  ┊ GC (min … max):  0.23% … 41.70%
 Time  (median):     280.005 ms               ┊ GC (median):     0.18%
 Time  (mean ± σ):   318.286 ms ±  61.175 ms  ┊ GC (mean ± σ):  14.20% ± 14.64%

  █▃▃                       ▃
  ███▇▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁█▇▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  268 ms           Histogram: frequency by time          469 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> @benchmark alloctest(zeros_via_calloc, 8192, 8192)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  159.149 ms … 357.459 ms  ┊ GC (min … max):  0.40% … 54.82%
 Time  (median):     181.814 ms               ┊ GC (median):     0.28%
 Time  (mean ± σ):   214.017 ms ±  56.574 ms  ┊ GC (mean ± σ):  20.51% ± 18.43%

  █ █▃▃                    █ ▃
  █▇███▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▇█▇▁▁▇▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▇ ▁
  159 ms           Histogram: frequency by time          357 ms <

 Memory estimate: 512.01 MiB, allocs estimate: 83.

julia> Threads.nthreads()
16

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Ah, that issue - while technically true, in the end that issue was closed precisely because it was decided to keep it. I guess what I meant was that I haven’t seen this idea of getting rid of it entirely been brought up recently.

Right, but that’s only relevant for the Array{T}(undef, ...) case, since that’s when a user will be exposed to it. I think the GC doesn’t distinguish between initialized and not-initialized memory at all right now, unlike rust (well, at least during allocation).

To expand on what I mean, for those boxed user types T, a call to zeros must initialize them properly with a call to zero, else you may not receive a valid initialization at all when returning. All entries must ultimately point to initialized zeros after all, since we cannot intercept the page faulting mechanism to do our own custom zero initialization. In that case, going the route of zeros -> calloc -> fill with zero(T) makes the call to calloc completely redundant, since it has to be filled with concrete values either way. Incidentally, I believe the same is true for non-boxed types or inline stored types, when their representation of zero(T) is not just a zeroed block of memory, as is the case for integers and floats.

This is a fundamental difference that you can’t circumvent without introducing a new array type which guarantees that the first read from an index has a result that’s == zero(T), if you want it to be done lazily. The current eager implementation does the correct thing, always, with no special care for whether a type is ok with having all zero bits as its zero or not.

I’m not surprised by this - the zeroing is not happening in user space after all. It’s happening in the kernel. The kernel routines in question are, as far as I know, not threaded and they certainly don’t let user space handle zeroing of memory that the kernel guarantees. How could they be threaded in the first place? They’re writing to a single page, threading doesn’t really make sense since that’s the smallest amount of memory any running code can access at a given time. This is what I’m talking about when I say that the page faulting mechanism is opaque to julia code. We don’t gain any benefit from multithreading user space code here, except when we do the zeroing ourselves:

Function definitions
julia> function alloctest(f::F, dims::Vararg{Integer,N}) where {F,N}
                  A = f(Float64, dims...)
                  Threads.@threads for i in eachindex(A)
                      Ai = A[i]
                      @turbo for j in 1:16
                          Ai += exp(i-j)
                      end
                      A[i] = Ai
                  end
                  A
              end
alloctest (generic function with 2 methods)

julia> function thread_zeros(::Type{T}, dims::Integer...) where T
           A = Array{T}(undef, dims...)
           Threads.@threads for i in eachindex(A)
               A[i] = zero(T)
           end
           return A
       end
thread_zeros (generic function with 1 method)

julia> function zeros_with_calloc(::Type{T}, dims::Integer...) where T
          ptr = Ptr{T}(Libc.calloc(prod(dims), sizeof(T)))
          return unsafe_wrap(Array{T}, ptr, dims; own=true)
       end
zeros_with_calloc (generic function with 1 method)
`alloctest` Benchmarks
julia> @benchmark alloctest(zeros, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  845.501 ms … 915.111 ms  ┊ GC (min … max): 0.00% … 6.42%
 Time  (median):     857.200 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   865.460 ms ±  26.023 ms  ┊ GC (mean ± σ):  1.51% ± 2.57%

  ██       ██           █                                     █
  ██▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  846 ms           Histogram: frequency by time          915 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.

julia> @benchmark alloctest(th, 8192, 8192)
thisind      thread_zeros  throw
julia> @benchmark alloctest(thread_zeros, 8192, 8192)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  759.736 ms … 824.374 ms  ┊ GC (min … max): 0.00% … 7.35%
 Time  (median):     760.735 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   772.535 ms ±  23.736 ms  ┊ GC (mean ± σ):  1.50% ± 2.76%

  █
  █▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  760 ms           Histogram: frequency by time          824 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 52.

julia> @benchmark alloctest(zeros_with_calloc, 8192, 8192)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  941.039 ms … 992.792 ms  ┊ GC (min … max): 0.00% … 4.31%
 Time  (median):     949.372 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   955.610 ms ±  18.949 ms  ┊ GC (mean ± σ):  0.77% ± 1.75%

  █   █  █    █    █                                          █
  █▁▁▁█▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  941 ms           Histogram: frequency by time          993 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 24.
`zeros` vs. `thread_zeros`
julia> @benchmark zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
 Range (min … max):  170.848 ms … 315.340 ms  ┊ GC (min … max):  0.00% … 43.64
 Time  (median):     176.870 ms               ┊ GC (median):     0.43%
 Time  (mean ± σ):   196.879 ms ±  36.527 ms  ┊ GC (mean ± σ):  12.27% ± 12.73

  █              ▄
  █▅▁▁▃▁▁▁▁▁▁▁▁▁▃█▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▁
  171 ms           Histogram: frequency by time          315 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 2.

julia> @benchmark thread_zeros(Float64, 8192,8192)
BenchmarkTools.Trial: 48 samples with 1 evaluation.
 Range (min … max):   80.754 ms … 245.534 ms  ┊ GC (min … max):  0.00% … 65.21
 Time  (median):      84.810 ms               ┊ GC (median):     0.91%
 Time  (mean ± σ):   104.350 ms ±  31.993 ms  ┊ GC (mean ± σ):  21.82% ± 18.62

  █            ▂▅
  █▁▅▁▁▁▁▅▁▁▁▁▁██▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  80.8 ms       Histogram: log(frequency) by time        246 ms <

 Memory estimate: 512.00 MiB, allocs estimate: 28.

I’m running with julia -t 4, but the speedup is only 2x - this is probably due to oversubscription, I’m running a Intel(R) Core™ i7-6600U CPU. You could create a custom LazyZeroInitArray type that can take advantage of already existing julia threads to leverage them to initialize, but that’s a completely different solution to having zeros call calloc.

Note that it doesn’t matter whether julia “threads” are OS threads or not - the difference lies in the fact that at the point of initialization during the page fault in calloc memory, no user space (i.e. julia) code or threads are active. It’s the kernel that’s doing the initialization, not julia code.

That might depend on which libc you are using and the calloc implementation.

Addendum: Even if calloc or the page faulting mechanism were threaded (they’re not for fundamental reasons central to how current OS work, but let’s assume it anyway for sake of argument), you immediately run into another problem: while julia threads compose easily with other julia threads, they do not compose with non-julia threads, at all. So to make a hypothetical threaded calloc/initialization compose with julia would require something like a julia machine with a julia OS, akin to lisp machines

I haven’t brought up musl because as far as I know, julia basically requires libc and doesn’t run on musl based systems.

Also, as far as I can tell, the code you linked does a simple naive memset in kernel space anyway, so I’m not sure how that is a counterexample to me saying that it’s not julia code and not julia threads that are running this code?

There are currently official Julia musl builds at Download Julia including for 1.6.3 and 1.7.0-rc1.

1 Like

I guess that’s technically true, but musl is still only Tier 3 in terms of support:

Tier 3: Julia may or may not build. If it does, it is unlikely to pass tests. Binaries may be available in some cases. When they are, they should be considered experimental. Ongoing support is dependent on community efforts.