Multi-thread speed in very large vectors

pjssilva · January 29, 2025, 2:01pm

Hi, I am trying to get my toes wet on some multi-threading code using the great OhMyThreads package. But there is a behavior that I don’t understand. I can exemplify it using the sum function. The most natural implementation of a multi-threaded sum for me is

tsum(x) = treduce(+, x)

It works great, up to medium sized vectors:

julia> n = 1_000_000; x = rand(n);

julia> @btime sum($x);
  95.586 μs (0 allocations: 0 bytes)

julia> @btime tsum($x);
  18.560 μs (186 allocations: 13.80 KiB)

A quite reasonable speedup of 5.15 (my computer has 8 threads).

But if I go to really huge vectors I get:

julia> n = 1_000_000_000; x = rand(n);

julia> @btime sum($x);
  238.121 ms (0 allocations: 0 bytes)

julia> @btime tsum($x);
  188.209 ms (186 allocations: 13.80 KiB)

The speed up is basically gone. Why? I don’t see why very large vectors hurt the speed up. What is happening here? Is there a way around?

PS: I have tried many variations, including manual division of the vector in 8 equal-sized chunks and creating a task for each (with individual local accumulators to avoid false sharing as suggested in OhMyREPL documentation). I always get the same behavior.

foobar_lv2 · January 29, 2025, 2:19pm

You might be reaching the limits of your memory bandwidth.

You should post some more info about your system (CPU and memory) to figure that out.

It’s quite system-specific how many threads/cores are needed to saturate your full memory bandwidth (i.e. the quotient total bandwidth / bandwith-per-core).

Your numbers suggest ~30 GB / core and ~40 GB total.

Memory bandwidth doesn’t bind in the small-sized example, since 8MB fit snugly into L3.

sgaure · January 29, 2025, 2:24pm

If memory bandwidth limits you, as foobar_lv2 suggests, you may anyway have reasonable speedup with real world problems more complicated than a sum. That is, if your computation takes more time, you will still manage to fetch from memory fast enough.

pjssilva · January 29, 2025, 2:50pm

My computer has an Intel Core i9 12900K, with 128 GB memory (DDR 4, I believe 3200).

According to Intel it has a maximum memory bandwidth of 76.8 GB/s.
So you are probably right, thanks. It makes sense. A vector with 1 billion floats uses 8 GB of memory. It is taking 0.2 seconds to do the computation, it is then using the 40GB/s that you mentioned. Close to the limit. Hence a single core can saturate the memory bandwidth.

The actual code will do something more complex than simply adding up the numbers, but not by much. Anyhow I still have some hope for reasonable speedup.

Thanks!

j-fu · January 29, 2025, 3:40pm

Some time ago I did some experiments in this respect:

On a laptop quite probably can hit the memory bandwith limit. Multicore servers may have two or four or more (?) lanes to memory, depending on the task, the speedup there could be considerably larger.

lmiq · January 29, 2025, 5:49pm

For what is worth, this is what I get with a more costly computation (for your simple sum I get the same result as yours):

julia> using OhMyThreads

julia> ssum(x) = mapreduce(el -> log(el) * exp(el), +, x)
ssum (generic function with 1 method)

julia> tsum(x) = tmapreduce(el -> log(el) * exp(el), +, x)
tsum (generic function with 1 method)

julia> n = 1_000_000; x = rand(n);

julia> @btime ssum($x);
  6.018 ms (0 allocations: 0 bytes)

julia> @btime tsum($x);
  1.078 ms (228 allocations: 17.12 KiB)

julia> n = 1_000_000_000; x = rand(n);

julia> @btime ssum($x);
  6.140 s (0 allocations: 0 bytes)

julia> @btime tsum($x);
  988.239 ms (228 allocations: 17.12 KiB)

julia> Threads.nthreads()
10

pjssilva · January 30, 2025, 11:24am

Thank you all. It is a shame that the method I am trying to parallelize has very tight and simple loops. I will certainly hit the memory bandwidth wall. Let me see if I can circumvent that with some cleverness.

Mason · January 30, 2025, 11:40am

Do you have a discrete GPU? If so, doing the calculation on the GPU is probably your best bet here. Discrete GPUs typically have much higher memory bandwidth than GPUs thanks to them using GDDR VRAM instead of (LP)DDR, and the VRAM being placed physically very close to the chip.

Also, is it possible for you to do your calculation on Float32 vectors instead of Float64? That’ll give a handy 2x increase in throughput right there.

pjssilva · January 30, 2025, 11:59am

This is in the to-do list for the paper. I plan to have both multi-cpu and GPU implementations. I am starting with multi-cpu because it looks easier. After that, I will try to learn how to write kernels to GPU in Julia. The kernels will be quite simple, as I said. So hopefully it will be “easy”.

As for, in CPU moving to Float32. That does not help much. The problem is that the main loop is so simple that a single core is already capable to use all the memory bandwidth for very large instances. I will get twice the throughput, but the computation will also be twice as fast. So I will probably hit the same limitations.

Mason · January 30, 2025, 12:01pm

Isn’t the whole goal to make the computation faster? If you get twice the performance without needing to use more threads, that sounds like a win to me.

pjssilva · January 30, 2025, 12:13pm

It is certainly a win for the actual application. On the other hand, the paper is about parallelizing a method. Hence, if a single core saturates the memory bandwidth, I won’t be able to show any speedup. So moving to big servers with more memory bandwidth or the GPUs is key. Thank you all!

artemsolod · February 3, 2025, 3:02pm

I would suggest another experiment to see whether the issue is that threads are competing for cache. Try running with 2 threads only and setting pinthreads(:sockets) from GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU-threads

Topic		Replies	Views
Threads.@threads slow down when arrays become too large New to Julia	1	898	November 29, 2021
How to quantitatively analyze the impact of memory bandwidth on multithreading Performance multithreading , memory-allocation	10	176	February 6, 2025
Same multi-threaded code, scaling observed only on some machines Performance	2	73	August 14, 2024
V1.3.1 no gain using multithread General Usage multithreading	33	2028	March 4, 2020
Expected 72X speedup, observed 1.1X Performance	5	737	April 28, 2021

Multi-thread speed in very large vectors

Related topics