Dot product not parallelized on cluster

I am trying to optimize a code involving a dot product which is called many times.

I run the code on a cluster’s node with 256 threads, using the built-in dot.

I check with htop the performance of the threads, and only one or two (out of 256) seem to work.

Also, I use

BLAS.set_num_threads(32)

As I’ve found that this is the maximum BLAS threads I can use. Is this true? or could I use all available threads with BLAS?

When I benchmark the dot product only (and not the entire loop), the (32, not 256) threads work but only for big vectors, with ~>1E7 elements.

I tried many things like custom loops with @turbo, @simd, etc., but nothing seems to improve performance and make all threads work.

Any ideas?
Many thanks

yeah I don’t think you benefit from multiple threads when the array is small…

1e7 elements is tiny amount

1 Like

dot products are memory bottle-necked so adding cores doesn’t help.

2 Likes

it doesn’t scale indefinitely but surely 2 is better than 1?

julia> const a = rand(10^9);

julia> const b = rand(10^9);

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @btime dot(a,b)
  529.519 ms (0 allocations: 0 bytes)
2.5000706579325372e8

julia> BLAS.set_num_threads(2)

julia> @btime dot(a,b)
  357.474 ms (0 allocations: 0 bytes)
2.5000706579323804e8

1 Like

Yeah. It depends a lot on the CPU and memory. In general, a rough guideline is 1 core per channel of ram (this is very rough).