Scaling of @threads for "embarrassingly parallel" problem

For future reference of other readers here:
Everyone who starts in a parallel computing course is told the golden rule to first squeeze the most performance out of your serial code and then to parallelize that.
Allocations have to do syscalls which inevitably hand over control to the operating system at unpredictable moments and are thus a problem in hot loops.
I agree that tooling can be improved to specify faster where such allocations happen, but perhaps this @deallocate path is not the best way forward.

7 Likes

I usually share the same kind of concerns. However, its not clear to me that the main problem here (or, in general) is the GC per-se, as the % of time spent on it was low.

You’ve also reduced the allocations from 4Gb to 500 MB which is a big deal. And your single-threaded performance seems to have improved significantly (from 250s to 54s, if you where indeed running the same test case). So to me, this seems to have been a problem with memory usage and memory bandwith.

Julia, by default, allocates arrays on the heap and can create copies of arrays when you don’t expect them. A general method to reduce (costly) allocations would be to use stack-allocated arrays, which is however, not so easy to do in Julia.

So I also wish working with stack-allocated arrays (which, in particular, might require manual memory management) were more straightforward in Julia. I’m currently experimenting with StrideArrays.jl and Bumper.jl, which might well be the solution to this problem so we don’t need to wait for Julia 2.0.

Allocations almost never have to do syscalls (they do but only for fairly large allocations, and even then it should be somewhat rare).

1 Like

Arrays on the stack vs heap doesn’t matter. They’re both just memory. What does matter is how they are allocated and cleaned up. One specific thing Julia could do that would help a bunch is have better analysis for array lifetimes which would let the compiler replace gcing and allocating a new array with just reusing the memory (but doing this isn’t trivial.

I wonder, if each thread allocates memory, can there be slowdowns since it needs to be made sure that they don’t allocate the same memory twice? Never heard whether that’s a thing but this made me curious

Multi-threaded allocators aren’t that complicated. The TLDR is that each thread has it’s own pool of pages into which it can allocate small objects. For big objects it will call into the system allocator and there will be some overhead, but it’s relatively minor compared to the cost of actually using that memory for anything.

1 Like

But there are different kinds of memory: cache memory and RAM. In my somewhat limited experience, I tend to think (I’m a practitioner, not a compiler developer) that whenever I manage to allocate a small amount of memory to the stack, the compiler figures out it should use the cache.

Sure, that would be great. But maybe we could help it in the meantime with some allocate/deallocate for performance-critical sections of code. I still haven’t found a nice way to do so.

Cache isn’t managed explicitly by the compiler. All cache decisions are made by the CPU. Memory on the stack often ends up in cache because it is frequently referenced, but heap memory will also do that.

In general rather than manually allocating/deallocating you can get the same (or slightly better) advantage from just reusing the memory and not reallocating in the first place.

3 Likes

Ah, my bad, thanks for the correction.

It is possible to run GC manually and also disable and enable it today. You can even use Libc.malloc and Libc.free to manually allocate and free memory. If you need an array, you use unsafe_wrap.

help?> GC.enable
  GC.enable(on::Bool)

  Control whether garbage collection is enabled using a boolean argument (true for enabled, false for disabled). Return previous GC state.

  │ Warning
  │
  │  Disabling garbage collection should be used only with caution, as it can cause memory use to grow without bound.

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which makes the next GC scan much slower, while an incremental
  collection may only sweep so-called young objects.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.

None of that will really address the core issue of excessive allocations though. Fortunately, Julia offers pretty nice constructs and tools to help you avoid allocations.

1 Like