Multi-thread speed in very large vectors

Hi, I am trying to get my toes wet on some multi-threading code using the great OhMyThreads package. But there is a behavior that I don’t understand. I can exemplify it using the sum function. The most natural implementation of a multi-threaded sum for me is

tsum(x) = treduce(+, x)

It works great, up to medium sized vectors:

julia> n = 1_000_000; x = rand(n);

julia> @btime sum($x);
  95.586 μs (0 allocations: 0 bytes)

julia> @btime tsum($x);
  18.560 μs (186 allocations: 13.80 KiB)

A quite reasonable speedup of 5.15 (my computer has 8 threads).

But if I go to really huge vectors I get:

julia> n = 1_000_000_000; x = rand(n);

julia> @btime sum($x);
  238.121 ms (0 allocations: 0 bytes)

julia> @btime tsum($x);
  188.209 ms (186 allocations: 13.80 KiB) 

The speed up is basically gone. Why? I don’t see why very large vectors hurt the speed up. What is happening here? Is there a way around?

PS: I have tried many variations, including manual division of the vector in 8 equal-sized chunks and creating a task for each (with individual local accumulators to avoid false sharing as suggested in OhMyREPL documentation). I always get the same behavior.

2 Likes

You might be reaching the limits of your memory bandwidth.

You should post some more info about your system (CPU and memory) to figure that out.

It’s quite system-specific how many threads/cores are needed to saturate your full memory bandwidth (i.e. the quotient total bandwidth / bandwith-per-core).

Your numbers suggest ~30 GB / core and ~40 GB total.

Memory bandwidth doesn’t bind in the small-sized example, since 8MB fit snugly into L3.

3 Likes

If memory bandwidth limits you, as foobar_lv2 suggests, you may anyway have reasonable speedup with real world problems more complicated than a sum. That is, if your computation takes more time, you will still manage to fetch from memory fast enough.

1 Like

My computer has an Intel Core i9 12900K, with 128 GB memory (DDR 4, I believe 3200).

According to Intel it has a maximum memory bandwidth of 76.8 GB/s.
So you are probably right, thanks. It makes sense. A vector with 1 billion floats uses 8 GB of memory. It is taking 0.2 seconds to do the computation, it is then using the 40GB/s that you mentioned. Close to the limit. Hence a single core can saturate the memory bandwidth.

The actual code will do something more complex than simply adding up the numbers, but not by much. Anyhow I still have some hope for reasonable speedup.

Thanks!

2 Likes

Some time ago I did some experiments in this respect:

On a laptop quite probably can hit the memory bandwith limit. Multicore servers may have two or four or more (?) lanes to memory, depending on the task, the speedup there could be considerably larger.

2 Likes

For what is worth, this is what I get with a more costly computation (for your simple sum I get the same result as yours):

julia> using OhMyThreads

julia> ssum(x) = mapreduce(el -> log(el) * exp(el), +, x)
ssum (generic function with 1 method)

julia> tsum(x) = tmapreduce(el -> log(el) * exp(el), +, x)
tsum (generic function with 1 method)

julia> n = 1_000_000; x = rand(n);

julia> @btime ssum($x);
  6.018 ms (0 allocations: 0 bytes)

julia> @btime tsum($x);
  1.078 ms (228 allocations: 17.12 KiB)

julia> n = 1_000_000_000; x = rand(n);

julia> @btime ssum($x);
  6.140 s (0 allocations: 0 bytes)

julia> @btime tsum($x);
  988.239 ms (228 allocations: 17.12 KiB)

julia> Threads.nthreads()
10
3 Likes

Thank you all. It is a shame that the method I am trying to parallelize has very tight and simple loops. I will certainly hit the memory bandwidth wall. Let me see if I can circumvent that with some cleverness.

Do you have a discrete GPU? If so, doing the calculation on the GPU is probably your best bet here. Discrete GPUs typically have much higher memory bandwidth than GPUs thanks to them using GDDR VRAM instead of (LP)DDR, and the VRAM being placed physically very close to the chip.

Also, is it possible for you to do your calculation on Float32 vectors instead of Float64? That’ll give a handy 2x increase in throughput right there.

1 Like

This is in the to-do list for the paper. I plan to have both multi-cpu and GPU implementations. I am starting with multi-cpu because it looks easier. After that, I will try to learn how to write kernels to GPU in Julia. The kernels will be quite simple, as I said. So hopefully it will be “easy”.

As for, in CPU moving to Float32. That does not help much. The problem is that the main loop is so simple that a single core is already capable to use all the memory bandwidth for very large instances. I will get twice the throughput, but the computation will also be twice as fast. So I will probably hit the same limitations.

1 Like

Isn’t the whole goal to make the computation faster? If you get twice the performance without needing to use more threads, that sounds like a win to me.

It is certainly a win for the actual application. On the other hand, the paper is about parallelizing a method. Hence, if a single core saturates the memory bandwidth, I won’t be able to show any speedup. So moving to big servers with more memory bandwidth or the GPUs is key. Thank you all!