Multithreaded memory usage

That would be regrettable, and I think not needed. I believe this GC issue with threading is known, and being worked on, and maybe actually solved already in 1.10? Did you try it?

That’s exactly what I wanted to hear, i.e. not a problem when single-threaded. You could try monitoring max memory use to see if some pattern merges with e.g. 1, 2, 3, 4, 8, or the default, max number of threads you use, with 1.9 vs 1.10 (with defaults I think ok; or you need to fiddle with --gcthreads?). I’m not sure if threading was always a problem, you could also consider testing or using e.g. 1.6.

Yes, I dislike adding such, as ideally the GC itself would figure the best times to do this automatically. But you could try to add GC.gc(full=false). I believe it’s much lighter since it only sweeps “so-called young objects”, e.g. I would think the garbage you just added. full=true is the default when you add this explicitly, and the “Excessive use will likely lead to poor performance” warning in the docs applies to that, maybe exclusively. Either way I don’t really like sprinkling code with either since it shouldn’t be needed… but until the GC is fixed; 1.10 shipped for production (I hesitate to recommend some alpha or master, while I usually use master with good effect, hopefully release of 1.10 isn’t too far off).

If you do add GG.gc, then could try it after each known allocation in the loop (or use Bumper.jl for that then not needed), and after each call to a library that (potentially) allocates.

If you’re thinking of doing full manual, then you could consider, which is an indirect dependency of Bumper.jl:

I think Bumper.jl would in most cases be preferable, and while I may have implied it allocates on the stack, it actually allocates from its own pool, which actually lives in the heap (since it uses the above library, and it indirectly uses Libc.malloc, that I don’t think you or anyone needs to know or care to use about any more even with it in Julia’s standard library, since that/those package[s] more user-friendly).

Even if Bumper uses the heap (allocates there only once, or was it once per thread?), in effect it’s as fast as if you had allocated in the stack-frame itself. It’s a package-manages stack in effect.