Were you able to solve the problem?
I am the author of the thread Slow down when running several parallel julia processes which use BLAS (MWE is provided).
In that example, each process was doing many matrix-vector product operations. The problem was that all the matrices combined could not fit into cash. As a result, each time a process tried to acess matrix elements it caused cash reload.
Eventually, I dealt with the problem simply by making the matrix shared between processes (using SharedArrays from stdlib).
Do you start julia processess independently? It may be more beneficial to use Distributed stdlib.