Threads are not speeding up computation a lot

It seems like a solved problem now.

$ killall -SIGSTOP firefox-bin

$ julia +1.11 -t 8 --project=. main.jl
  0.084158 seconds (45 allocations: 6.108 MiB)

$ killall -SIGCONT firefox-bin

I get 7.47x for 8 threads, 93% of the ideal speedup, but from 8-16 cores not much extra speed, most likely explained by hyper-threading. It’s just known to be worse, not as good as a full core. At least if CPU bound.

I.e. with the improved code from @yolhan_mannes Threads are not speeding up computation a lot - #8 by yolhan_mannes

Is this as fast as possible?

using ThreadPinning
# GC.gc(); GC.gc() ;GC.gc()
# pinthreads(:cores)  # at best minimally helping
#GC.enable(false)
@time test_gen_keys_range();
1 Like

I thought it wasn’t just “not as good”, but that there was zero benefit when cpu is fully utilized.

Yes, but it’s pretty hard to “fully utilize” a core (always have all execution units busy, never branch mispredict, never wait on memory).

To be cheeky, I’d say that workloads that fully utilize a core are so homogenous and non-branchy that they should run on GPU :wink:

2 Likes

It’s probably hard to say if you fully utilize all functional units, and if you have another thread running, hyperthreaded, slowing you down for that reason. But at least you share L1 cache, and it seems obvious that it could be a problem with it so limited.

Note after 20 years of hyperthreading Intel is eliminating it (though not for the performance cores):

https://overclock3d.net/news/cpu_mainboard/why-are-intel-ditching-hyperthreading-with-lion-cove-and-lunar-lake/

If we compare a P-core with Hyperthreading disabled and an efficiency optimised P-core without that lacks hyperthreading, we have a core that is smaller and more power efficient. Hyperthreading takes up space, and even when disabled, that feature consumes power.

Its easy, CPU utilization with Julia (BigInt) code stays at 30-50%.

I have AMD Ryzen 7 5800H. Its 8 cores 16 threads.
With python, I get 8x scaling on same algorithm (and much faster), but, I forced to use multiprocessing and copy data to-from processes.
I also see 6-8x scaling in other compute intensive applications, so I think it is limits of my system.
P.S.

I made some test with python using C lib for keys generation, and with OpenCL on CPU.

As you can see, it scales perfectly up to 8 threads, and then having hard time going to 16. Maybe because of CPU compute units configuration, cache size, or mobile CPU power limit.
Graph part after 16 threads shows actual “overhead” (as some people here insisting that my task is not parallel enough and I don’t understand how computers works)

P.S.S.
OpenCL is strange puppy, probably using AVX and vectorization.
I think its showing results that is close to limit on my hardware without using crazy cache optimized assembly.

1 Like

Maybe I am not understanding something. But from my knowledge you cannot just resize variable, no matter where it is located, on heap or on stack.
It is always involves copying data to new region of memory.

And GMP probably have some API to handle all of this…
Probably returning carry/borrow atleast or something.
It also probably have ability to set preallocated size on creation?

GPU actually good at branching, they have very long conveyor.
Problem with branching on GPU occur when it brakes aggregated memory blocks transfers. But it may or may not make a huge deal depending on your algorithm, code, mem locality.

I never liked this stuff. As I understand, HT is good for code with lots of branching and waiting. Proper computation load can just utilize whole core AVX compute bandwidth with on thread.

But funny enough first bulldozer APUs cores had more vector bandvitch that one “core” can utilize. I tested this and actually 4 thread for 1 module(2 cores) is what saturated them.
So its depends probably.

As CPUs have gotten beefier, HT has gotten increasingly beneficial. For anything more complicated than GEMM, it’s very hard to fully utilize a modern CPU core. On Zen5 for example, you need roughly 200 instructions per cache miss to avoid being cache latency bound, <2 branch per 8 instructions or you are branch predictor bottlenecked (and those 2 branches need 95%+ predictability or else you lose out to the branch miss). For the very best code, hyperthreading still doesn’t help much, but especially on AMD, it is often a free 20-30% performance for free.

4 Likes

Your terminology is mixing up some language-level features with implementation details, but I’ll cut to the chase. A GMP integer is implemented with an internal dynamic array, which allow cheap increases in value (internally resizing the array) in most cases by preallocating more memory than necessary, only doing the more expensive reallocation when the memory isn’t enough. The user can manually reallocate, but reallocations are typically handled automatically. Again, because the allocated size isn’t fixed, these allocations must be on the heap. A dynamic array is more directly implemented by Julia’s Vector, and the array’s size/length can easily be observed to not match its allocated size.

Julia’s Base treats Numbers like immutable objects even when they’re not, so BigInt API actually doesn’t let us leverage GMP’s dynamic reallocation. Julia internally does wrap some of GMP’s in-place arithmetic API:

julia> x = big(123); y = x; # same object

julia> using Base.GMP.MPZ: add!

julia> for i in 1:5
          println(y, y.d) # value, dynamic array address
          add!(y, y) # double value
       end
123Ptr{UInt64} @0x0000022bf42a49d0
246Ptr{UInt64} @0x0000022bfd098f40
492Ptr{UInt64} @0x0000022bfd098f40
984Ptr{UInt64} @0x0000022bfd098f40
1968Ptr{UInt64} @0x0000022bfd098f40

julia> using BenchmarkTools

julia> @btime add!($y, $(big(typemax(Int128)))) # reallocated by GMP, not Julia
  11.011 ns (0 allocations: 0 bytes)
1784866255233235935950497391318785927013203163

julia> y.d # address proves reallocation
Ptr{UInt64} @0x0000022b903eb890

julia> x === y # still the same object, all in-place mutations
true

But there’s no implementation of wider in-place operations that would involve conversions to BigFloat, e.g. sin.

1 Like