Parallel processing using FLoops

Hi, I am trying to enhance the performance of my Julia code by using all the cores on my computer. For this, I am using the FLoops package. My code is simple: I am generating 100 instances of a random density matrix of size 1000 x 1000 and calculating the mean entanglement entropy. Unfortunately, I cannot get the desired speed up for the below code:

I am using the Distributions for generating random matrices, and the QuantumInformation package for calculating the entropy. Is there any way to speed it up, or I have got it wrong?

You likely won’t get significant speedups from paralellism, since the matrix multiplication and matrix logarithm are already multithreaded (by BLAS). That said, I think the algorithm can likely be sped up significantly.

Can you elaborate on this, please?

is there not a way to do this without materialize a huge matrix and then another one and a third one?

I am not really sure what you mean, but since what we have is a random matrix, we need to take several instances of it and then take the average.

For one, thing C*C' can be sampled directly (it’s a Wishart distribution). I’m not actually convinced that there isn’t a relatively simple scalar distribution you can sample directly that would give the same result.

Well as far as I know, this is the procedure to generate a Wishart ensemble, because we have to make sure that the matrix is positive semidefinite, and any positive semidefinite matrix has this form C*C'.

https://juliastats.org/Distributions.jl/stable/matrix/#Distributions.Wishart

But I still think even Distributions.Wishart internally would implement X*X'. But, I was expecting to parallelize the code like this: I need to generate 100 instances of the matrix, and I have 10 threads on my computer. Is it not possible that the work is parallelly divided among all the threads such that each gets to handle 10 matrices and I get a 10x speed up? I am new to this, so this might sound naive.

you might be correct.

no, as Oscar said, the C * C' is already multi-threaded because BLAS is multi-threaded, so dividing like this shouldn’t give you linear speed up

Thanks @Oscar_Smith and @jling . I get it now.

julia> function g()
           e = 0.0
           C = Matrix{Float64}(undef, 1000, 1000)
           ρ = similar(C)
           for _ = 1:100
               rand!(Normal(), C)
               LinearAlgebra.mul!(ρ, C, C')
               ρ ./= tr(ρ)
               e += vonneumann_entropy(ρ)/log(2)
           end
           e
       end

doesn’t seem to make it faster but at least it reduces memory allocation by 70%

1 Like

why do I get rand! not defined error?

using Distributions, QuantumInformation, Random, LinearAlgebra

ok, I see thanks again! Also can you please suggest any article which explains the usage of ! in functions in julia?

https://docs.julialang.org/en/v1/manual/variables/#Stylistic-Conventions

it’s nothing special, just the name of the function ending with ! is a hint that it modifies one of its arguments, that’s it, just a naming convention for functions.

Oh Alright! Thank you!

I am not sure if this helps, but you can tell BLAS manually how many threads to use. Doing this in conjunction with using @floop resulted in much faster code for me:

function baseline()
    e = 0
    for _ in 1:100
        C = rand(Normal(), 1000, 1000)
        p = C*C'
        p = p / tr(p)
        e += vonneumann_entropy(p)/log(2)
    end
    e/100
end

function baseline_floop()
    e = 0
    @floop for _ in 1:100
        C = rand(Normal(), 1000, 1000)
        p = C*C'
        p = p / tr(p)
        @reduce e += vonneumann_entropy(p)/log(2)
    end
    e/100
end

I obtain the following when setting BLAS.set_num_threads(1)

@time baseline_floop()
  6.060738 seconds (2.23 k allocations: 3.017 GiB, 0.15% gc time)

Compared to what I get when BLAS uses 8 threads (the default on my machine)

@time baseline_floop()
 11.550902 seconds (2.23 k allocations: 3.017 GiB, 0.09% gc time)

My Julia was started with 4 threads.

2 Likes

Hi, thanks for your response, it does help. But I am just wondering why setting the num of threads for BLAS to 1 is giving such a speed up? Shouldn’t we increase the number of threads to get speed up?

Blas doesn’t give perfect parallelization so if the floops can parallelize perfectly, that will be more efficient than using BLAS threading.