Slow down when running several parallel julia processes which use BLAS (MWE is provided)


I observe a strange problem. I am doing Monte Carlo, so for now I start multiple julia processes from the bash, each of which gathers statistics.
Running 4 processes in parallel I see a 4 times slow down in execution of each of them compared to when I start only a single process.

Processes use BLAS.gemv! for dense matvec product, and Profiling in both cases show that the time spent in BLAS.gemv! is increased significantly when running multiple processes, so my guess it is the primary problem here. But I can’t understand why it happens.

Each process does BLAS.set_num_threads(1) at start so there should be no problem of too many threads used.
Does anyone have ideas about this?


Is the 4-fold slowdown based on a comparison of multi-threaded BLAS on one process vs single-threaded BLAS on 4 processes? If so, that would probably explain it.

There are also other potential causes: BLAS libraries tend to be tuned based on the assumption that they’re the main compute intensive thing running on the machine (so as to take full advantage of available resources), so running multiple instances at the same time may mess up things like hyperthreading, caching, etc.


No. In both cases I run BLAS in single-threaded mode. That is why I can’t understand what is going on.


Ok. Here is a simple MWE to test this out.

Create a file “task1.jl” with the following code

@inline blas_A_mul_B!(α::T, A::Matrix{T}, B::Vector{T}, β::T, C::Vector{T}) where T<:Number = Base.BLAS.gemv!('N', α, A, B, β, C)


const D = 720
const A = rand(D,D)

function test_matvec(IM::Matrix{Float64}, N::Int64)
    x = rand(size(IM,1))
    y = rand(size(IM,1))
    t1 = time()
    @inbounds for i=1:N
        blas_A_mul_B!(one(Float64), IM, x, zero(Float64), y)
    return time()-t1

test_matvec(A, 1)

t = test_matvec(A,500000)

io = open("out1.txt","w")
print(io, "$t\n")

After that run in the bash

for i in `seq 2 4`; do cp task1.jl "task$i.jl"; sed -i -e "s/out1/out$i/g" "task$i.jl";  done

You can replace 4 with the number of processes you want.

After that I can start one process with julia task1.jl.
Or I can start all 4 processes in parallel by

for i in `seq 1 4`; do julia "task$i.jl" & done

In the former case, observing the file “out1.txt” reveals execution time of approximately 36 seconds on my machine.
In the latter case I got approximately 250 seconds in each of out… files.


Gregstrq, sorry to ask but it is not clear what the specifications of your machine are.

cat /proc/cpuinfo please
If you are running on one core only then you would imagine running 4 things at the same time will slow down.
I may well be horribly misunderstanding what is happening here, so please forgive me.


My bad. I have intel i7. I have 4 physical cores, or 8 logical in hyperthreading mode.
The ouput of /proc/cpuinfo is too long, so I don’t think it’s a good idea to copy it here, but believe me,
it lists those 8 logical cores.

If I had single core machine this kind of behaviour wouldn’t be a surprise for me.
Could you try and test it on your machine?


Greg, I ran your code snippet on 4 cores reserved on a server with twin Xeon E5-2667
I looped as you do, ran 4 copies of the code. 56 seconds each time, whch is the same time as running one copy of the code.

Reserving 10 cpus, runnign 10 copies of the code run times range between 56 and 69 seconds.


I think one problem fits in your L3 cache, 4 do not. John’s Xeon has a bigger cache and a higher bandwidth to DRAM. You can check your cache sizes with lscpu.


Ralph, good point.

L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15


L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7

It is true that my L3 cache is smaller.
By the way, is there a way to measure bandwidth to DRAM and to monitor the bus load?
On Linux, to be specific.


I would say the Streams benchmark

I may be laughed out of court of course.