Hi, I am trying to enhance the performance of my Julia code by using all the cores on my computer. For this, I am using the FLoops package. My code is simple: I am generating 100 instances of a random density matrix of size 1000 x 1000 and calculating the mean entanglement entropy. Unfortunately, I cannot get the desired speed up for the below code:

I am using the Distributions for generating random matrices, and the QuantumInformation package for calculating the entropy. Is there any way to speed it up, or I have got it wrong?

You likely wonât get significant speedups from paralellism, since the matrix multiplication and matrix logarithm are already multithreaded (by BLAS). That said, I think the algorithm can likely be sped up significantly.

For one, thing C*C' can be sampled directly (itâs a Wishart distribution). Iâm not actually convinced that there isnât a relatively simple scalar distribution you can sample directly that would give the same result.

Well as far as I know, this is the procedure to generate a Wishart ensemble, because we have to make sure that the matrix is positive semidefinite, and any positive semidefinite matrix has this form C*C'.

But I still think even Distributions.Wishart internally would implement X*X'. But, I was expecting to parallelize the code like this: I need to generate 100 instances of the matrix, and I have 10 threads on my computer. Is it not possible that the work is parallelly divided among all the threads such that each gets to handle 10 matrices and I get a 10x speed up? I am new to this, so this might sound naive.

julia> function g()
e = 0.0
C = Matrix{Float64}(undef, 1000, 1000)
Ď = similar(C)
for _ = 1:100
rand!(Normal(), C)
LinearAlgebra.mul!(Ď, C, C')
Ď ./= tr(Ď)
e += vonneumann_entropy(Ď)/log(2)
end
e
end

doesnât seem to make it faster but at least it reduces memory allocation by 70%

itâs nothing special, just the name of the function ending with ! is a hint that it modifies one of its arguments, thatâs it, just a naming convention for functions.

I am not sure if this helps, but you can tell BLAS manually how many threads to use. Doing this in conjunction with using @floop resulted in much faster code for me:

function baseline()
e = 0
for _ in 1:100
C = rand(Normal(), 1000, 1000)
p = C*C'
p = p / tr(p)
e += vonneumann_entropy(p)/log(2)
end
e/100
end
function baseline_floop()
e = 0
@floop for _ in 1:100
C = rand(Normal(), 1000, 1000)
p = C*C'
p = p / tr(p)
@reduce e += vonneumann_entropy(p)/log(2)
end
e/100
end

I obtain the following when setting BLAS.set_num_threads(1)

Hi, thanks for your response, it does help. But I am just wondering why setting the num of threads for BLAS to 1 is giving such a speed up? Shouldnât we increase the number of threads to get speed up?