I am trying to optimize a code involving a dot product which is called many times.
I run the code on a cluster’s node with 256 threads, using the built-in dot.
I check with htop the performance of the threads, and only one or two (out of 256) seem to work.
Also, I use
As I’ve found that this is the maximum BLAS threads I can use. Is this true? or could I use all available threads with BLAS?
When I benchmark the dot product only (and not the entire loop), the (32, not 256) threads work but only for big vectors, with ~>1E7 elements.
I tried many things like custom loops with @turbo, @simd, etc., but nothing seems to improve performance and make all threads work.