Parallel processing using FLoops

devanshu · May 11, 2022, 3:23pm

Hi, I am trying to enhance the performance of my Julia code by using all the cores on my computer. For this, I am using the FLoops package. My code is simple: I am generating 100 instances of a random density matrix of size 1000 x 1000 and calculating the mean entanglement entropy. Unfortunately, I cannot get the desired speed up for the below code:

I am using the Distributions for generating random matrices, and the QuantumInformation package for calculating the entropy. Is there any way to speed it up, or I have got it wrong?

Oscar_Smith · May 11, 2022, 3:37pm

You likely won’t get significant speedups from paralellism, since the matrix multiplication and matrix logarithm are already multithreaded (by BLAS). That said, I think the algorithm can likely be sped up significantly.

devanshu · May 11, 2022, 3:40pm

Can you elaborate on this, please?

jling · May 11, 2022, 3:48pm

is there not a way to do this without materialize a huge matrix and then another one and a third one?

devanshu · May 11, 2022, 3:51pm

I am not really sure what you mean, but since what we have is a random matrix, we need to take several instances of it and then take the average.

Oscar_Smith · May 11, 2022, 3:55pm

For one, thing C*C' can be sampled directly (it’s a Wishart distribution). I’m not actually convinced that there isn’t a relatively simple scalar distribution you can sample directly that would give the same result.

devanshu · May 11, 2022, 3:58pm

Well as far as I know, this is the procedure to generate a Wishart ensemble, because we have to make sure that the matrix is positive semidefinite, and any positive semidefinite matrix has this form C*C'.

Oscar_Smith · May 11, 2022, 4:00pm

https://juliastats.org/Distributions.jl/stable/matrix/#Distributions.Wishart

devanshu · May 11, 2022, 4:08pm

But I still think even Distributions.Wishart internally would implement X*X'. But, I was expecting to parallelize the code like this: I need to generate 100 instances of the matrix, and I have 10 threads on my computer. Is it not possible that the work is parallelly divided among all the threads such that each gets to handle 10 matrices and I get a 10x speed up? I am new to this, so this might sound naive.

jling · May 11, 2022, 4:13pm

github.com

JuliaStats/Distributions.jl/blob/dd6ae8f4eac304f404b0069540a6c3bb1c667f92/src/matrix/wishart.jl#L206


      
          function _rand!(rng::AbstractRNG, d::Wishart, A::AbstractMatrix)
              if d.singular
                  axes2 = axes(A, 2)
                  r = rank(d)
                  randn!(rng, view(A, :, axes2[1:r]))
                  fill!(view(A, :, axes2[(r + 1):end]), zero(eltype(A)))
              else
                  _wishart_genA!(rng, A, d.df)
              end
              unwhiten!(d.S, A)
              A .= A * A'
          end
          
          function _wishart_genA!(rng::AbstractRNG, A::AbstractMatrix, df::Real)
              # Generate the matrix A in the Bartlett decomposition
              #
              #   A is a lower triangular matrix, with
              #
              #       A(i, j) ~ sqrt of Chisq(df - i + 1) when i == j
              #               ~ Normal()                  when i > j
              #

you might be correct.

no, as Oscar said, the C * C' is already multi-threaded because BLAS is multi-threaded, so dividing like this shouldn’t give you linear speed up

devanshu · May 11, 2022, 4:15pm

Thanks @Oscar_Smith and @jling . I get it now.

jling · May 11, 2022, 4:37pm

julia> function g()
           e = 0.0
           C = Matrix{Float64}(undef, 1000, 1000)
           ρ = similar(C)
           for _ = 1:100
               rand!(Normal(), C)
               LinearAlgebra.mul!(ρ, C, C')
               ρ ./= tr(ρ)
               e += vonneumann_entropy(ρ)/log(2)
           end
           e
       end

doesn’t seem to make it faster but at least it reduces memory allocation by 70%

devanshu · May 11, 2022, 4:48pm

why do I get rand! not defined error?

jling · May 11, 2022, 4:49pm

using Distributions, QuantumInformation, Random, LinearAlgebra

devanshu · May 11, 2022, 4:52pm

ok, I see thanks again! Also can you please suggest any article which explains the usage of ! in functions in julia?

jling · May 11, 2022, 4:59pm

https://docs.julialang.org/en/v1/manual/variables/#Stylistic-Conventions

it’s nothing special, just the name of the function ending with ! is a hint that it modifies one of its arguments, that’s it, just a naming convention for functions.

devanshu · May 11, 2022, 5:01pm

Oh Alright! Thank you!

enweg · May 18, 2022, 6:06pm

I am not sure if this helps, but you can tell BLAS manually how many threads to use. Doing this in conjunction with using @floop resulted in much faster code for me:

function baseline()
    e = 0
    for _ in 1:100
        C = rand(Normal(), 1000, 1000)
        p = C*C'
        p = p / tr(p)
        e += vonneumann_entropy(p)/log(2)
    end
    e/100
end

function baseline_floop()
    e = 0
    @floop for _ in 1:100
        C = rand(Normal(), 1000, 1000)
        p = C*C'
        p = p / tr(p)
        @reduce e += vonneumann_entropy(p)/log(2)
    end
    e/100
end

I obtain the following when setting BLAS.set_num_threads(1)

@time baseline_floop()
  6.060738 seconds (2.23 k allocations: 3.017 GiB, 0.15% gc time)

Compared to what I get when BLAS uses 8 threads (the default on my machine)

@time baseline_floop()
 11.550902 seconds (2.23 k allocations: 3.017 GiB, 0.09% gc time)

My Julia was started with 4 threads.

devanshu · May 19, 2022, 3:47am

Hi, thanks for your response, it does help. But I am just wondering why setting the num of threads for BLAS to 1 is giving such a speed up? Shouldn’t we increase the number of threads to get speed up?

Oscar_Smith · May 19, 2022, 3:53am

Blas doesn’t give perfect parallelization so if the floops can parallelize perfectly, that will be more efficient than using BLAS threading.

Topic		Replies	Views
Something faster than for loops General Usage	27	6466	May 8, 2019
Julia Threads vs BLAS threads Internals & Design	16	11067	July 26, 2018
Elementwise multiplication of arrays across many cores General Usage parallel	5	2360	April 14, 2017
For loop in function and multiplication of larger matrices, slow speed in parallel Performance performance , parallel , loops	3	1316	November 22, 2019
Parallelizing for loop in the computation of a gradient Performance question	19	2632	February 26, 2018

Parallel processing using FLoops

Related topics