[quote=“Benny, post:139, topic:101711”]
Looks like it doesn’t scale well despite working more. I’m surprised at how much more, 2.41s * 52.7 → 127s for only ~36x the garbage,[/quote]
I’m not sure. I thought it might be because of the generational assumption being violated in multithreaded contexts, but enabling
GC.enable_logging(true)
only ever reports incremental collections (which is good).
I’m on a different computer now than earlier (10980XE instead of 7980XE), but they’re basically the exact same CPU.
Note that these are both actually 18-core CPUs.
So we have twice as much work per physical core in the multithreaded case. The mallocs doing any better than >2x the single threaded time means they’re getting really good multithreaded scaling.
Baseline is similar on the 10980XE, except (surprisingly) it is a bit slower:
julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
21.603470 seconds (30.00 M allocations: 71.526 GiB, 11.78% gc time)
1.3620400542987349e10
julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
3.164538 seconds (1 allocation: 16 bytes)
1.3620400542987349e10
julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
2.128713 seconds (1 allocation: 16 bytes)
1.3620400542987349e10
julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
1.976689 seconds (1 allocation: 16 bytes)
1.3620400542987349e10
julia> @show Threads.nthreads();
Threads.nthreads() = 36
julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
222.812451 seconds (1.08 G allocations: 2.515 TiB, 59.32% gc time)
4.903344195475447e11
julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
8.182727 seconds (222 allocations: 20.703 KiB)
4.903344195475447e11
julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
4.208087 seconds (222 allocations: 20.703 KiB)
4.903344195475447e11
julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
4.512129 seconds (223 allocations: 20.734 KiB)
4.903344195475447e11
julia> versioninfo()
Julia Version 1.11.0-DEV.142
Commit d1be33d4bc (2023-07-22 20:20 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 53 on 36 virtual cores
Now, enabling GC logging…
julia> GC.enable_logging(true);
julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
# huge wall of GC: pauses that look just like the below:
GC: pause 1.55ms. collected 45.875200MB. incr
GC: pause 1.44ms. collected 45.875200MB. incr
GC: pause 1.53ms. collected 45.875200MB. incr
GC: pause 1.53ms. collected 45.875200MB. incr
22.042343 seconds (30.00 M allocations: 71.526 GiB, 12.45% gc time)
1.3620400542987349e10
julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
# the end contained single threaded GCs
# when we were down to 1 task, but the
# bulk contained collections like:
GC: pause 67.41ms. collected 1397.212160MB. incr
GC: pause 70.31ms. collected 1454.510080MB. incr
GC: pause 73.16ms. collected 1324.771840MB. incr
GC: pause 69.26ms. collected 1434.995200MB. incr
GC: pause 70.86ms. collected 1469.299200MB. incr
226.461490 seconds (1.08 G allocations: 2.515 TiB, 59.97% gc time)
4.903344195475447e11
They were all incr
, none of the collections during these runs were full.
For the multithreaded case, my computer was at only 40% average utilization (according to btop).
Yes, I think that would let us replicate the performance of manual frees. We may even be able to do better in some specialized circumstances like this benchmark, by having less checks on a reuse fast-path (one implementation of that could get!
from task local storage under the hood, and use a weakref to allow the memory to be reclaimed).
Depends. Worse case scenario, it gets copied. In those cases, you can/should generally move the memory out. That means the destination will take ownership.
More commonly, (Named) Return Value Optimization [i.e. (N)RVO] should apply. When this optimization applies, instead of the callee both allocating and filling, the caller actually does the allocation and passes in a reference to the callee, which then fills it.