nbakas
1
I am trying to optimize a code involving a dot product which is called many times.
I run the code on a cluster’s node with 256 threads, using the built-in dot.
I check with htop the performance of the threads, and only one or two (out of 256) seem to work.
Also, I use
BLAS.set_num_threads(32)
As I’ve found that this is the maximum BLAS threads I can use. Is this true? or could I use all available threads with BLAS?
When I benchmark the dot product only (and not the entire loop), the (32, not 256) threads work but only for big vectors, with ~>1E7 elements.
I tried many things like custom loops with @turbo, @simd, etc., but nothing seems to improve performance and make all threads work.
Any ideas?
Many thanks
jling
2
yeah I don’t think you benefit from multiple threads when the array is small…
1e7 elements is tiny amount
1 Like
dot
products are memory bottle-necked so adding cores doesn’t help.
2 Likes
jling
4
it doesn’t scale indefinitely but surely 2 is better than 1?
julia> const a = rand(10^9);
julia> const b = rand(10^9);
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @btime dot(a,b)
529.519 ms (0 allocations: 0 bytes)
2.5000706579325372e8
julia> BLAS.set_num_threads(2)
julia> @btime dot(a,b)
357.474 ms (0 allocations: 0 bytes)
2.5000706579323804e8
1 Like
Yeah. It depends a lot on the CPU and memory. In general, a rough guideline is 1 core per channel of ram (this is very rough).