Dot product not parallelized on cluster

nbakas · January 3, 2023, 11:16pm

I am trying to optimize a code involving a dot product which is called many times.

I run the code on a cluster’s node with 256 threads, using the built-in dot.

I check with htop the performance of the threads, and only one or two (out of 256) seem to work.

Also, I use

BLAS.set_num_threads(32)

As I’ve found that this is the maximum BLAS threads I can use. Is this true? or could I use all available threads with BLAS?

When I benchmark the dot product only (and not the entire loop), the (32, not 256) threads work but only for big vectors, with ~>1E7 elements.

I tried many things like custom loops with @turbo, @simd, etc., but nothing seems to improve performance and make all threads work.

Any ideas?
Many thanks

jling · January 3, 2023, 11:20pm

yeah I don’t think you benefit from multiple threads when the array is small…

1e7 elements is tiny amount

Oscar_Smith · January 3, 2023, 11:41pm

dot products are memory bottle-necked so adding cores doesn’t help.

jling · January 4, 2023, 2:42am

it doesn’t scale indefinitely but surely 2 is better than 1?

julia> const a = rand(10^9);

julia> const b = rand(10^9);

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @btime dot(a,b)
  529.519 ms (0 allocations: 0 bytes)
2.5000706579325372e8

julia> BLAS.set_num_threads(2)

julia> @btime dot(a,b)
  357.474 ms (0 allocations: 0 bytes)
2.5000706579323804e8

Oscar_Smith · January 4, 2023, 3:12am

Yeah. It depends a lot on the CPU and memory. In general, a rough guideline is 1 core per channel of ram (this is very rough).

Topic		Replies	Views
Ideal number of BLAS threads General Usage blas , multithreading , linearalgebra	10	4443	April 27, 2022
Matrix vector multiplication Performance question	4	909	September 27, 2020
BLAS vs Threads on a cluster Performance	6	562	April 23, 2024
Help me understand multi-threaded scaling for matrix multiplication Performance question	22	641	April 16, 2024
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4794	March 28, 2018

Dot product not parallelized on cluster

Related topics