Threads are not speeding up computation a lot

Palli · February 27, 2025, 9:30pm

It seems like a solved problem now.

$ killall -SIGSTOP firefox-bin

$ julia +1.11 -t 8 --project=. main.jl
  0.084158 seconds (45 allocations: 6.108 MiB)

$ killall -SIGCONT firefox-bin

I get 7.47x for 8 threads, 93% of the ideal speedup, but from 8-16 cores not much extra speed, most likely explained by hyper-threading. It’s just known to be worse, not as good as a full core. At least if CPU bound.

I.e. with the improved code from @yolhan_mannes Threads are not speeding up computation a lot - #8 by yolhan_mannes

Is this as fast as possible?

using ThreadPinning
# GC.gc(); GC.gc() ;GC.gc()
# pinthreads(:cores)  # at best minimally helping
#GC.enable(false)
@time test_gen_keys_range();

DNF · February 27, 2025, 11:11pm

I thought it wasn’t just “not as good”, but that there was zero benefit when cpu is fully utilized.

foobar_lv2 · February 28, 2025, 12:12am

Yes, but it’s pretty hard to “fully utilize” a core (always have all execution units busy, never branch mispredict, never wait on memory).

To be cheeky, I’d say that workloads that fully utilize a core are so homogenous and non-branchy that they should run on GPU

Palli · March 18, 2025, 12:44pm

It’s probably hard to say if you fully utilize all functional units, and if you have another thread running, hyperthreaded, slowing you down for that reason. But at least you share L1 cache, and it seems obvious that it could be a problem with it so limited.

Note after 20 years of hyperthreading Intel is eliminating it (though not for the performance cores):

https://overclock3d.net/news/cpu_mainboard/why-are-intel-ditching-hyperthreading-with-lion-cove-and-lunar-lake/

If we compare a P-core with Hyperthreading disabled and an efficiency optimised P-core without that lacks hyperthreading, we have a core that is smaller and more power efficient. Hyperthreading takes up space, and even when disabled, that feature consumes power.

Nyan · March 19, 2025, 5:05pm

Its easy, CPU utilization with Julia (BigInt) code stays at 30-50%.

I have AMD Ryzen 7 5800H. Its 8 cores 16 threads.
With python, I get 8x scaling on same algorithm (and much faster), but, I forced to use multiprocessing and copy data to-from processes.
I also see 6-8x scaling in other compute intensive applications, so I think it is limits of my system.
P.S.

I made some test with python using C lib for keys generation, and with OpenCL on CPU.

As you can see, it scales perfectly up to 8 threads, and then having hard time going to 16. Maybe because of CPU compute units configuration, cache size, or mobile CPU power limit.
Graph part after 16 threads shows actual “overhead” (as some people here insisting that my task is not parallel enough and I don’t understand how computers works)

P.S.S.
OpenCL is strange puppy, probably using AVX and vectorization.
I think its showing results that is close to limit on my hardware without using crazy cache optimized assembly.

Nyan · March 19, 2025, 5:26pm

Maybe I am not understanding something. But from my knowledge you cannot just resize variable, no matter where it is located, on heap or on stack.
It is always involves copying data to new region of memory.

And GMP probably have some API to handle all of this…
Probably returning carry/borrow atleast or something.
It also probably have ability to set preallocated size on creation?

Nyan · March 19, 2025, 5:45pm

GPU actually good at branching, they have very long conveyor.
Problem with branching on GPU occur when it brakes aggregated memory blocks transfers. But it may or may not make a huge deal depending on your algorithm, code, mem locality.

Nyan · March 19, 2025, 5:51pm

I never liked this stuff. As I understand, HT is good for code with lots of branching and waiting. Proper computation load can just utilize whole core AVX compute bandwidth with on thread.

But funny enough first bulldozer APUs cores had more vector bandvitch that one “core” can utilize. I tested this and actually 4 thread for 1 module(2 cores) is what saturated them.
So its depends probably.

Oscar_Smith · March 19, 2025, 6:37pm

As CPUs have gotten beefier, HT has gotten increasingly beneficial. For anything more complicated than GEMM, it’s very hard to fully utilize a modern CPU core. On Zen5 for example, you need roughly 200 instructions per cache miss to avoid being cache latency bound, <2 branch per 8 instructions or you are branch predictor bottlenecked (and those 2 branches need 95%+ predictability or else you lose out to the branch miss). For the very best code, hyperthreading still doesn’t help much, but especially on AMD, it is often a free 20-30% performance for free.

Benny · March 20, 2025, 6:00am

Your terminology is mixing up some language-level features with implementation details, but I’ll cut to the chase. A GMP integer is implemented with an internal dynamic array, which allow cheap increases in value (internally resizing the array) in most cases by preallocating more memory than necessary, only doing the more expensive reallocation when the memory isn’t enough. The user can manually reallocate, but reallocations are typically handled automatically. Again, because the allocated size isn’t fixed, these allocations must be on the heap. A dynamic array is more directly implemented by Julia’s Vector, and the array’s size/length can easily be observed to not match its allocated size.

Julia’s Base treats Numbers like immutable objects even when they’re not, so BigInt API actually doesn’t let us leverage GMP’s dynamic reallocation. Julia internally does wrap some of GMP’s in-place arithmetic API:

julia> x = big(123); y = x; # same object

julia> using Base.GMP.MPZ: add!

julia> for i in 1:5
          println(y, y.d) # value, dynamic array address
          add!(y, y) # double value
       end
123Ptr{UInt64} @0x0000022bf42a49d0
246Ptr{UInt64} @0x0000022bfd098f40
492Ptr{UInt64} @0x0000022bfd098f40
984Ptr{UInt64} @0x0000022bfd098f40
1968Ptr{UInt64} @0x0000022bfd098f40

julia> using BenchmarkTools

julia> @btime add!($y, $(big(typemax(Int128)))) # reallocated by GMP, not Julia
  11.011 ns (0 allocations: 0 bytes)
1784866255233235935950497391318785927013203163

julia> y.d # address proves reallocation
Ptr{UInt64} @0x0000022b903eb890

julia> x === y # still the same object, all in-place mutations
true

But there’s no implementation of wider in-place operations that would involve conversions to BigFloat, e.g. sin.

Topic		Replies	Views
How to speed up tasks? Performance	8	801	September 23, 2020
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1822	April 6, 2021
Speed up for-loop with multithreading Performance question , multithreading	10	1830	April 22, 2022
Slower @threads than serial for array computations Julia at Scale	26	2684	May 7, 2020
Slower with threads Performance question	26	1166	August 6, 2022

Threads are not speeding up computation a lot

Related topics