What exactly is "allocation" in Julia?

As pointed out, by many, here, already: Allocation and freeing of memory is just one detail (of many things), the gc has to do. Even keeping track of all the (potentially defragmented) pieces of memory and all references to it, for deciding, whether or not some of those fragments can be released, yet (at regular time-intervals), can result in a huge overhead, but it depends on your algorithms, ofc. Overall, I would say it is more often huge, than small. :wink:

(Edit: …In case of heap-allocated memory, I mean)
(Edit #2: I remember digging into some gc-problems, in java, with a colleague, some years ago, where we came to realize, that the gc, in use, even reshuffled some heap-allocated memory - for whatever reason - resulting in terrible performance. Not sure, the julia-gc would ever do that, though).

I doubt malloc and free in C/C++ has a lot less overhead, it’s just that this overhead is paid in tiny increments rather than all at once during a “garbage collection event”. Also memory allocation is more directly under control of the programmer… They call “malloc” or “new” and so they know they are allocating. In Julia you have to learn what expressions allocate. For example:

foo = bar[1:10]

Allocates a new array and copies the contents from bar. Compared with using @views where it will allocate just a view that basically has pointers to bars memory.

To the extent that C/C++ programs have less overhead from allocation a lot of it comes from it being more explicit and therefore easier to avoid for a naive beginner.

2 Likes

I would guess, it reports heap-allocations, only. Otherwise, the outputs, I’ve seen wouldn’t make sense to me. But someone else should confirm. :sweat_smile:

Not sure, why you’re quoting me about malloc and free?

I’ve not even used those keywords (for a reason!)

This is (part of) what I said about management of memory, when it is heap-allocated…

heap-memory needs to be managed with much more overhead, including keeping track of different fragments of memory (which result from dynamically growing or shrinking datastructures), through a bunch of pointers, which need to be managed, in order to even keep data accessible. Also, in case of multithreading, synchronization of heap-memory has to happen, somehow, etc. (as heap-memory is accessible to all threads).

…this is completely independent of your question, whether or not malloc and free are fast or not, in C.

Here is the larger quote where you say that non GC languages such as C (which uses malloc) have better performance. This is not necessarily true. I’m not trying to be argumentative or anything here just saying that mileage varies and C isn’t necessarily overall going to spend less time.

1 Like

Here is the larger quote where you say that non GC languages such as C (which uses malloc) have better performance.

Yes, I’ve said, that in case of non-gc languages, they typically have a lot less overhead, as they simply do not manage your memory, for you. But this has (from my pov) nothing to do with malloc - I just hope, I’m now able to get that across. You could as well quote me as having said, that languages with curly braces have better performance, which applies to C, just as well as having the malloc-keyword. :wink:

My point was, simply, that non gc’d languages don’t even have any of the management-overhead, other gc’d languages have, for checking all the stuff, a gc has to check (in case you code something, with a lot of heap-memory). It is the management of that memory, not the allocation or freeing - rather all the work, neccessary, to even decide whether or not some piece of memory needs to be freed. :wink:

And I also totally agree, that if you are able to code in a way, where little heap-management is neccessary, then it shouldn’t be much of a factor, but if it is - very roughly speaking: The gc has to take care of that part, at runtime, what a typical C-programmer would have had to brainstorm about, at design-time, to make sure, that there’s no memory leaking from his code. :sweat_smile:

Right, what I’m saying is that typically there will be sprinkled around a bunch of C code some extra instructions to decide whether to call free on something… This may be about as much overhead as a mark-sweep cycle but paid for in microsecond increments all around the code rather than 50 millisecond increments all at once during a GC cycle. The number of total machine cycles spent on management is not necessarily bigger in GC languages, at least if you spend the same design effort. In GC languages it is generally possible to get a correct program with much less thought and this is where GC overhead usually is more, where the programmer doesn’t think at all about allocation. GC will run correctly but slowly, C/C++ etc will just not run.

2 Likes

Non-GC language, at least C (and C++) have other overheads. E.g. strings are frequently copied (since mutable), GC languages are more likely to reduce copying (overhead) i.e. to share data (plus other (speed and security) issues with brain-dead C-strings). But yes, they DO have GC overhead, which needs not be huge (Java and C# are often faster than some C code, most important speed differences are related to cache-locality, structure-of-arrays vs arrays-of-structures, that Java handles poorly but Julia is excellent at).

There are real-time GC available (just not in Julia). Some GC is very fast (in other languages), with little overhead (generational GC, yes in Julia too, but issue with threads).

Ironically Julia is fast as a GC language, when you avoid the GC (which is easy, just not always done), but I believe it could also be fast when GC not avoided, and focusing of heap allocations (and thus GC) as the problem a red herring.

See also Mimalloc test by gbaraldi · Pull Request #47062 · JuliaLang/julia · GitHub which is an experiment to switch the malloc Julia uses. It shows some fairly major advantages.

1 Like

I see what you’re saying, but I tend to disagree. When you code memory allocation and freeing of that same memory, manually, I think, that there is typically much less to be done, in terms of branching, going through a bunch of references, etc. The logic, of where to free (or not free) memory is (from my humble exp.) typically implicitly given, automatically, in most cases (assuming, you know your code well enough). This is opposed to machine-instructions, a gc has to go through to automatically and generically achieve the same goal, but for any given piece of code a programmer throws at the compiler / gc / runtime, in full ignorance of any “understanding” of that code.

That being said, I do agree, in that one (with similar exp. in say julia & C++), usually gets a “correct program with much less thought” (in julia). And I also agree, that, when people don’t think about the gc (in julia), they sometimes pay a high price in terms of performance. Similarly, when people miss memory-leaks in C(++), the consequences can be more severe, than “just” a bit of performance (worst case would be loss of performance, system-wide or segmentation-faults).
From my pov, every language offers some solution to a bunch of tradeoffs - one of those is ease of use & expressiveness (where julia shines) vs. ability to control everything, yourself, with all consequences (including the responsibility for top-performance vs. failure).

You have to do as many (or fewer) allocations in a GC language. And in a non-GC language you may do one free for each malloc (if more a bug, if less potentially a bug, a leak). In a GC-language you do as many free (or fewer) as in a non-GC language. It’s only about the “GC overhead.”

If you have generational C then there’s not a lot of overhead. There may be indirect threading-effects, but there is work on that both this new (and older) PRs:

It’s known that mimalloc is a much better allocator than our default one (languages like C and C++ profit from it now, but we could have the same gain). I see this issue:

The GCBenchmarks suite doesn’t seem to show much of a difference here since most allocations it does don’t use the malloc allocator but our own.

The solution is not use our own to bypass malloc (thus mimalloc). It’s some more invasive work, but not hugely complex and will pay off.

There is nothing ironic, about that. It is a well-known fact, that if you are able to program in a way, that no gc is needed and you switch it off, then your program runs faster. Simple as that, cause there is no gc-overhead.

And one more thing, in C or C++ if you code is serial, then you force your free (and malloc) to be done serially at basically random times.

In Julia and GC languages, frees can be batched up (and faster because of cache-effects, I don’t believe C and C++ would to that though seemingly theoretically possible) and with threads the GC work (and free) could be parallel (I don’t think it is already) even if your code is otherwise serial, for 1/n-th the overhead in time.

I mean it’s a bit ironic that you have to avoid the GC when GC can be fast.

In many cases this is true, in the general case it is not. For example some GC languages do generational copying garbage collection which includes heap compaction. There are workloads where by compacting the data you’re using frequently together the cache is much much more effective and the GC program runs faster. Also the logic inside “malloc” might be much more complicated in C whereas allocation in a GC language might be simpler, such as to just increment a pointer to the “end of memory region”. Especially with copying collectors less work has to be done to figure out which pieces of memory are in use and which aren’t and can be reused… copying collectors just use up the next chunk of memory until the region gets too big, then mark and copy all the active bits to a new region and all at once free the entire previous region.

Basically what I’m saying is GC is not just one thing, it’s a bunch of different strategies, and it can be faster than explicit memory management. It often isn’t but it’s not a given that it must be slower, which a lot of people think is true.

All that being said, probably Julia’s collector will never be a copying collector because we want good compatibility with calling other languages code such as C or R or Python or etc and those languages will not be happy if Julia moves memory out from under them. Still, it’s entirely plausible that Julia’s GC could become say parallelized and also optimized so that it could all be much faster than it is and have much less “pause latency”. If that happens then the advice about pre-allocation etc may still be applicable but the speed-ups involved may be much much less, to even zero.

yep, I’m aware. I’ve (loosely) been following java’s evolution of its (multiple) gcs, ever since '96…97, of which some are quite complex (including generational strategies, etc., but I’m no expert on all of those intricacies, which are even still far more complex, than either of us has even hinted at). However, for the sake of this thread, I was trying to keep it on topic and as simple as possible, to clear up, what a gc does (Edit: As it relates to heap-memory-management).

There are workloads where by compacting the data you’re using frequently together the cache is much much more effective and the GC program runs faster.

I’m sure, there are instances, like you describe, but I would also suggest, that the average user, who is not into any details, is better off, avoiding the gc as much as possible (heap-allocated memory) in 99.9% of the cases, if they are interested in better performance. :sweat_smile:

Is that true ever (in Julia at least)? No; if you avoid allocations (i.e. the GC) then you don’t need to turn the GC off. Because the GC is only triggered on allocations. I don’t know if this changes when we have threaded GC. If you turn off the GC and DO have some allocations, then it’s dangerous to do, you can run out of memory, so not recommended, you have to be very careful and turn it on again (regularly), and seemingly you’re doing the work the GC should be doing.

There isn’t much of an alternative. For real-time you want to avoid allocating in the loop, but you need memory so you allocate (on the heap) out of it.

There’s really no need to avoid the heap otherwise, and the stack isn’t workable (in most cases) as a substitute, and is used instead when it’s possible. What you may want is allocate differently, e.g. fewer larger objects, rather than ragged arrays (as in Java), or pre-allocate.

Hehe, this has made me curious…

[…] if you avoid allocations (i.e. the GC) then you don’t need to turn the GC off. Because the GC is only triggered on allocations. […]

…cause it is not so, in java, always, but I don’t know enough about julia’s inner workings.

So I started with a very trivial example of a recursive function, to be extended, until finally, I would include some code, which would require a heap-allocation (for checking when, and how often the gc would be called). But I failed to realize, that even prior to any extension, just calling @btime would already result in the gc being called (as is the case for using and precompiling any packages).

It’s a fun example, of just how much a gc does, anyways, though (which we often don’t realize)…

julia> GC.enable_logging(true)

julia> function GC.gc(full::Bool=true)
           println("==> gc was called with full= ",full)
           ccall(:jl_gc_collect, Cvoid, (Cint,), full ? 1 : 2)
       end

julia> f(x) = x<=1 ? x : f(x-1) + 1

…this is the output, that is generated…

julia> @btime f(10)
==> gc was called with full= true
GC: pause 57.34ms. collected 10.118183MB. full recollect
GC: pause 252.49ms. collected 0.023080MB. incr 
==> gc was called with full= true
GC: pause 49.95ms. collected 0.000832MB. full recollect
GC: pause 250.39ms. collected 0.000000MB. incr
==> gc was called with full= true
GC: pause 45.77ms. collected 0.000160MB. full recollect
GC: pause 245.63ms. collected 0.000000MB. incr
==> gc was called with full= true
GC: pause 46.93ms. collected 0.000160MB. full recollect
GC: pause 239.19ms. collected 0.000000MB. incr
==> gc was called with full= true
GC: pause 45.38ms. collected 0.080160MB. full recollect
GC: pause 243.06ms. collected 0.008192MB. incr
==> gc was called with full= true
GC: pause 46.80ms. collected 0.000160MB. full recollect
GC: pause 249.88ms. collected 0.000000MB. incr
==> gc was called with full= true
GC: pause 44.94ms. collected 0.000160MB. full recollect
GC: pause 236.46ms. collected 0.000000MB. incr
==> gc was called with full= true
GC: pause 47.28ms. collected 0.000160MB. full recollect
GC: pause 240.18ms. collected 0.000304MB. incr
  13.828 ns (0 allocations: 0 bytes)
10

Not surprising, to those, who know the inner workings of julia well enough, probably. And not to me, after thinking about it, a bit, but it was, after invoking @btime f(10) for the first time. :sweat_smile:

2 Likes

BenchmarkTools does a bunch of GCing before running the benchmark: BenchmarkTools.jl/execution.jl at 40865d7c1ec5a3033d67c329e10c8235f13c6024 · JuliaCI/BenchmarkTools.jl · GitHub. My guess is that is so allocations in your session prior to running the benchmark don’t cause too much extra GC time during the benchmark.

2 Likes

This might be helpful:

1 Like

The rust book explanation indeed is a very good one.
In fact every language (besides perhaps Fortran77) works with stack and heap, and the way C works still provides the best mental model for this. E.g. working in idiomatic modern C++ with std::string creates heap allocations which may influence performance when done in hot loops.

The discussion in Julia IMHO is a bit confused because memory management is mixed with the concept of immutability, probably due to the fact that it is easy to handle small immutable objects on the stack.
AFAIU there is no 1:1 mapping “immutable == stack” and “mutable == heap”, but it is more like immutable objects are always on the stack while for mutable objects the compiler decides.

The other Julian mix-in into the stack-heap dichotomy is that boxing due to runtime type inference leads to heap allocations, probably because Core.Box is mutable.

3 Likes