nbakas
                
              
                
              
                  
                  
              1
              
             
            
              I am trying to optimize a code involving a dot product which is called many times.
I run the code on a cluster’s node with 256 threads, using the built-in dot.
I check with htop the performance of the threads, and only one or two (out of 256) seem to work.
Also, I use
BLAS.set_num_threads(32)
As I’ve found that this is the maximum BLAS threads I can use. Is this true? or could I use all available threads with BLAS?
When I benchmark the dot product only (and not the entire loop), the (32, not 256) threads work but only for big vectors, with ~>1E7 elements.
I tried many things like custom loops with @turbo, @simd, etc., but nothing seems to improve performance and make all threads work.
Any ideas?
Many thanks
             
            
              
              
              
            
            
           
          
            
              
                jling
                
              
              
                  
                  
              2
              
             
            
              
yeah I don’t think you benefit from multiple threads when the array is small…
1e7 elements is tiny amount
             
            
              
              
              1 Like
            
            
           
          
            
            
              dot products are memory bottle-necked so adding cores doesn’t help.
             
            
              
              
              2 Likes
            
            
           
          
            
              
                jling
                
              
              
                  
                  
              4
              
             
            
              it doesn’t scale indefinitely but surely 2 is better than 1?
julia> const a = rand(10^9);
julia> const b = rand(10^9);
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @btime dot(a,b)
  529.519 ms (0 allocations: 0 bytes)
2.5000706579325372e8
julia> BLAS.set_num_threads(2)
julia> @btime dot(a,b)
  357.474 ms (0 allocations: 0 bytes)
2.5000706579323804e8
             
            
              
              
              1 Like
            
            
           
          
            
            
              Yeah. It depends a lot on the CPU and memory. In general, a rough guideline is 1 core per channel of ram (this is very rough).