Slow down when running several parallel julia processes which use BLAS (MWE is provided)

Gregstrq · January 16, 2018, 1:59pm

I observe a strange problem. I am doing Monte Carlo, so for now I start multiple julia processes from the bash, each of which gathers statistics.
Running 4 processes in parallel I see a 4 times slow down in execution of each of them compared to when I start only a single process.

Processes use BLAS.gemv! for dense matvec product, and Profiling in both cases show that the time spent in BLAS.gemv! is increased significantly when running multiple processes, so my guess it is the primary problem here. But I can’t understand why it happens.

Each process does BLAS.set_num_threads(1) at start so there should be no problem of too many threads used.
Does anyone have ideas about this?

simonbyrne · January 16, 2018, 6:03pm

Is the 4-fold slowdown based on a comparison of multi-threaded BLAS on one process vs single-threaded BLAS on 4 processes? If so, that would probably explain it.

There are also other potential causes: BLAS libraries tend to be tuned based on the assumption that they’re the main compute intensive thing running on the machine (so as to take full advantage of available resources), so running multiple instances at the same time may mess up things like hyperthreading, caching, etc.

Gregstrq · January 17, 2018, 12:55pm

No. In both cases I run BLAS in single-threaded mode. That is why I can’t understand what is going on.

Gregstrq · January 25, 2018, 4:05pm

Ok. Here is a simple MWE to test this out.

Create a file “task1.jl” with the following code

Base.BLAS.set_num_threads(1)
@inline blas_A_mul_B!(α::T, A::Matrix{T}, B::Vector{T}, β::T, C::Vector{T}) where T<:Number = Base.BLAS.gemv!('N', α, A, B, β, C)

srand(1234)

const D = 720
const A = rand(D,D)

function test_matvec(IM::Matrix{Float64}, N::Int64)
    x = rand(size(IM,1))
    y = rand(size(IM,1))
    t1 = time()
    @inbounds for i=1:N
        blas_A_mul_B!(one(Float64), IM, x, zero(Float64), y)
    end
    return time()-t1
end

test_matvec(A, 1)

t = test_matvec(A,500000)

io = open("out1.txt","w")
print(io, "$t\n")
close(io)

After that run in the bash

for i in `seq 2 4`; do cp task1.jl "task$i.jl"; sed -i -e "s/out1/out$i/g" "task$i.jl";  done

You can replace 4 with the number of processes you want.

After that I can start one process with julia task1.jl.
Or I can start all 4 processes in parallel by

for i in `seq 1 4`; do julia "task$i.jl" & done

In the former case, observing the file “out1.txt” reveals execution time of approximately 36 seconds on my machine.
In the latter case I got approximately 250 seconds in each of out… files.

John_Hearns · January 25, 2018, 4:55pm

Gregstrq, sorry to ask but it is not clear what the specifications of your machine are.

cat /proc/cpuinfo please
If you are running on one core only then you would imagine running 4 things at the same time will slow down.
I may well be horribly misunderstanding what is happening here, so please forgive me.

Gregstrq · January 26, 2018, 7:46am

My bad. I have intel i7. I have 4 physical cores, or 8 logical in hyperthreading mode.
The ouput of /proc/cpuinfo is too long, so I don’t think it’s a good idea to copy it here, but believe me,
it lists those 8 logical cores.

If I had single core machine this kind of behaviour wouldn’t be a surprise for me.
Could you try and test it on your machine?

John_Hearns · January 26, 2018, 12:49pm

Greg, I ran your code snippet on 4 cores reserved on a server with twin Xeon E5-2667
I looped as you do, ran 4 copies of the code. 56 seconds each time, whch is the same time as running one copy of the code.

Reserving 10 cpus, runnign 10 copies of the code run times range between 56 and 69 seconds.

Ralph_Smith · January 26, 2018, 3:39pm

I think one problem fits in your L3 cache, 4 do not. John’s Xeon has a bigger cache and a higher bandwidth to DRAM. You can check your cache sizes with lscpu.

John_Hearns · January 26, 2018, 4:27pm

Ralph, good point.

L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15

Gregstrq · January 26, 2018, 4:34pm

L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7

It is true that my L3 cache is smaller.
By the way, is there a way to measure bandwidth to DRAM and to monitor the bus load?
On Linux, to be specific.

John_Hearns · January 26, 2018, 7:26pm

I would say the Streams benchmark John McCalpin's blog » STREAM benchmark

I may be laughed out of court of course.

Topic		Replies	Views
Running several Julia engines Performance	6	844	January 27, 2020
Mac M2 Ultra brought to its knees General Usage	18	771	June 10, 2025
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2191	May 30, 2017
Matrix vector multiplication Performance question	4	909	September 27, 2020
Pmap and multi-threaded BLAS Performance blas , parallel	2	959	November 29, 2019

Slow down when running several parallel julia processes which use BLAS (MWE is provided)

Related topics