Code speed decreases when loop number increases

Hi

I am using a small code to generate some results. Now I want to take the average of different realizations. So I simply add a more for loop with loop number Nr. But I find that when I increase Nr, the code speed becomes slow. When Nr = 100, the total time obtained by @time is roughly 1500 seconds. But when I increase Nr = 500, the speed is suddenly much slower (it runs more than four hours and haven’t finished yet.)

The code is attached below. I want to know the reason why the code speed decreases.

using LinearAlgebra
using StatsBase
using Statistics
using Random
using Plots
using RandomMatrices

function main()
    theta = 1.39
    #println(theta)
    Nk = 800
    Nr = 500
    ps = 3
    sum_z = zeros(Nk,1)
    sum_zlog = zeros(Nk,1)
    sum_s = zeros(Nk,1)
    sum_slog = zeros(Nk,1)
    for i in 1:Nr
        Npool = Int(floor(10^(ps)))
        pool = zeros(Npool, 1)
        pool[:,1] .= 0.5

        pool_sample = zeros(Npool,1)

        z_list = zeros(Nk,1) 

        s_list = zeros(Nk,1) 

        z_log_list = zeros(Nk,1)

        s_log_list = zeros(Nk,1)

        Kup= [cos(theta) 0; 0 sin(theta)]
        Kdown = [sin(theta) 0; 0 cos(theta)]
        
        P2up = kron([1. 0.;0. 1.], [1 0; 0 0])
        P2down = kron([1 0;0 1],[0 0;0 1])

        h = Haar(2)
        for k in 1:Nk
            poolcount  =0
            while poolcount < Npool
                z1 = pool[rand(1:Npool),1]
                rho1 = diagm(0=>[z1,1-z1])

                z2 = pool[rand(1:Npool),1]
                rho2 = diagm(0=>[z2,1-z2])

                u1 = rand(h,2)
                u2 = rand(h,2)

                K1up = u1*Kup*u1'
                K1down = u1*Kdown*u1'
                
                K2up = u2*Kup*u2'
                K2down = u2*Kdown*u2'

                rho1p = K1up*rho1*K1up'
                rho2p = K2up*rho2*K2up'

                p1 = real(tr(rho1p+rho1p'))/2
                p2 = real(tr(rho2p+rho2p'))/2

          
                rho1p = (rho1p+rho1p')/(2*p1)
            
                    
                rho2p = (rho2p+rho2p')/(2*p2)
             

                rho = kron(rho1p,rho2p)

                U = rand(h,4)
                
                rho_p = P2up*U*rho*U'*P2up'
                
                p = real(tr(rho_p+rho_p'))/2

             
                temp =(rho_p+rho_p')/2
                rho_f = [tr(temp[1:2,1:2]) tr(temp[1:2,3:4]);tr(temp[3:4,1:2]) tr(temp[3:4,3:4])]/(p)
             
                    
              
                rho_f = (rho_f+rho_f')/2
                t = abs(tr(rho_f*rho_f))
                z = (1-t)/(1+sqrt(2*t-1))
                if abs(z)>0
                    poolcount = poolcount+1
                    pool_sample[poolcount,1] = abs(z)
                end
            end

            pool = copy(pool_sample)

            pool_sample[:,1] .= 0


            z_log_list[k,1] = mean(log.(pool),dims =1)[1]

            s_log_list[k,1] = std(log.(pool),dims = 1)[1]/sqrt(Npool)

            z_list[k,1] = mean(pool,dims =1)[1]

            s_list[k,1] = std(pool,dims = 1)[1]/sqrt(Npool)


        end

        sum_z += z_list
        sum_zlog += z_log_list
        sum_s += s_list.^2
        sum_slog += s_log_list.^2
    end
    return vec(sum_z)/Nr, sqrt.(vec(sum_s))/Nr,vec(sum_zlog)/Nr, sqrt.(vec(sum_slog))/Nr

end

@time z0,s0, z1,s1 = main();

Hm, this is quite a big kernel. I tried running it with

    Nk = 80
    Nr = 1

through Juno profiler and here is what I see

So where should we start optimizing this?

1 Like

Hi sorry I have very limited coding background. I think the part which takes some of the time is generating random matrices. I guess my question is that why when I change Nr from 100 to 500, the speed decreases a lot. (The total time with Nr =500 is much larger than 5 times that with Nr = 100). Thanks.

Hi, I have very limited background in your domain. From a cursory glance I’d hope this could improve your program (wondering why this isn’t mentioned in the documentation)?

A basic issue is that you are allocating tons of little matrices over and over, and so at some point it will start spending all of its time doing garbage collection.

The solution is to (a) allocate arrays like pool once and re-use and (b) use StaticArrays for all of your little 2x2 and 4x4 matrices, since those are fixed sizes that you can tell the compiler to exploit.

I translated your code to use these principles, without changing what it does (I think), and also removed some Matlab-isms like allocating 1-column matrices when what you really want is a 1d array. It should be much faster:

using LinearAlgebra
using StatsBase
using Statistics
using Random
using StaticArrays

# adapted from rand(Haar(2)) in RandomMatrices, modified
# to use staticarrays, only for beta=2
function rand_haar2(::Val{n}) where n
    M = @SMatrix randn(ComplexF64, n,n) # = rand(Ginibre(2, n))
    q = qr(M).Q
    L = cispi.(2 .* @SVector(rand(n)))
    return q*diagm(L)
end

function main2(; ps=3, theta = 1.39, Nk = 800, Nr = 500)
    sum_z = zeros(Nk)
    sum_zlog = zeros(Nk)
    sum_s = zeros(Nk)
    sum_slog = zeros(Nk)
    
    Npool = 10^ps
    pool = zeros(Npool)
    pool_sample = zeros(Npool)
    logpool = zeros(Npool)
    z_list = zeros(Nk) 
    s_list = zeros(Nk) 
    z_log_list = zeros(Nk)
    s_log_list = zeros(Nk)

    Kup= @SMatrix[cos(theta) 0; 0 sin(theta)]
    Kdown = @SMatrix[sin(theta) 0; 0 cos(theta)]
    P2up = kron(@SMatrix[1. 0.;0. 1.], @SMatrix[1 0; 0 0])
    P2down = kron(@SMatrix[1 0;0 1], @SMatrix[0 0;0 1])
    
    for i in 1:Nr
        pool .= 0.5
        pool_sample .= 0
        z_list .= 0
        s_list .= 0
        z_log_list .= 0
        z_log_list .= 0

        for k in 1:Nk
            poolcount = 0
            while poolcount < Npool
                z1 = pool[rand(1:Npool)]
                rho1 = diagm(@SVector[z1,1-z1])

                z2 = pool[rand(1:Npool)]
                rho2 = diagm(@SVector[z2,1-z2])

                u1 = rand_haar2(Val{2}())
                u2 = rand_haar2(Val{2}())

                K1up = u1*Kup*u1'
                K1down = u1*Kdown*u1'
                
                K2up = u2*Kup*u2'
                K2down = u2*Kdown*u2'

                rho1p = K1up*rho1*K1up'
                rho2p = K2up*rho2*K2up'

                p1 = real(tr(rho1p+rho1p'))/2
                p2 = real(tr(rho2p+rho2p'))/2
          
                rho1p = (rho1p+rho1p')/(2*p1)
                rho2p = (rho2p+rho2p')/(2*p2)
             
                rho = kron(rho1p,rho2p)

                U = rand_haar2(Val{4}())
                rho_p = P2up*U*rho*U'*P2up'
                p = real(tr(rho_p+rho_p'))/2
             
                temp =(rho_p+rho_p')/2
                #rho_f = @SMatrix[tr(temp[1:2,1:2]) tr(temp[1:2,3:4]); tr(temp[3:4,1:2]) tr(temp[3:4,3:4])]/(p)
                rho_f = @SMatrix[temp[1,1]+temp[2,2] temp[1,3]+temp[2,4]; temp[3,1]+temp[4,2] temp[3,3]+temp[4,4]]/(p)
                
                rho_f = (rho_f+rho_f')/2
                t = abs(tr(rho_f*rho_f))
                z = (1-t)/(1+sqrt(2*t-1))
                if !iszero(z)
                    poolcount = poolcount+1
                    pool_sample[poolcount] = abs(z)
                end
            end

            pool .= pool_sample
            pool_sample .= 0

            logpool .= log.(pool)
            z_log_list[k] = mean(logpool)
            s_log_list[k] = std(logpool)/sqrt(Npool)
            z_list[k] = mean(pool)
            s_list[k] = std(pool)/sqrt(Npool)
        end

        sum_z .+= z_list
        sum_zlog .+= z_log_list
        sum_s .+= s_list.^2
        sum_slog .+= s_log_list.^2
    end
    return sum_z/Nr, sqrt.(sum_s)/Nr, sum_zlog/Nr, sqrt.(sum_slog)/Nr
end
8 Likes

The results are convincing:

Edit: around 20 x speedup
Edit: what about function rand_haar2(::Val{n}) where n though, is this a defect?
Edit:

This is something we only see indirectly (via allocations in red) reflected in the profiler, right?

I’m not sure what you mean by a “defect”.

The problem is that RandomMatrices.jl doesn’t currently support returning a StaticArray (RandomMatrices.jl#64). The only way to do it was to copy the source code for the function you were calling and adapt it to return an SMatrix. Fortunately, the function was extremely short and simple, since you only needed the Haar(2) case.

1 Like

Defect in RandomMatrices as you pointed out.

Looks very good! I will try it on my computer. A dumb question is what is the difference between StaticArrays and regular arrays. Thanks

It is not a dumb question. Static arrays are immutable and regular arrays are mutable. Understanding the implications of that is not obvious. I wrote this when I first started to grasp the utility and importance of the difference: Immutable variables · JuliaNotes.jl.

Maybe it helps.

1 Like

Thanks, this note is really useful!

Hi guys. The code using StaticArrays performs really good, much faster than my previous version. Now I try to random seed the function rand_haar2, so I tried

const local_rng = Future.randjump(MersenneTwister(123),1)

function rand_haar2(::Val{n}) where n
    M = @SMatrix randn(local_rng,ComplexF64, (n,n)) # = rand(Ginibre(2, n))
    q = qr(M).Q
    L = cispi.(2 .* @SVector(rand(n)))
    return q*diagm(L)
end

But it doesn’t work. I got

ERROR: TypeError: in Type, in parameter, expected Type, got a value of type MersenneTwister

Anyone knows why? The reason I want to seed the random number generator is because I try to use parallel computing. Thanks!

That’s not entirely true, is it? StaticArrays also provides mutable versions of arrays, although the SMatrix used in this case is of course the immutable kind.

1 Like

Indeed, the difference is more about in what type of memory the array is stored. How mutable static arrays are implemented to be stored on the stack is beyond my understanding.

Another crucial difference, perhaps the main difference, is that StaticArrays have fixed size, encoded into their types.

4 Likes

I actually don’t see what mutability has to do with it. A piece of memory on the stack is no different in the way it is addressed compared to one on the heap. Maybe LLVM treats the stack wildly different, but in C/C++ an array on the stack can be used in the same way as one on the heap. It’s just that it can’t outlive a certain scope.

1 Like

Here is a better source, seems didactical: Chapter 6: Static arrays

1 Like

It is somewhat unfortunate that this page talks about ‘static arrays’ in the context of C++. The concept ‘static’ in C++ has quite a few different meanings (with new ones being defined with every specification update it seems). But ‘static’ arrays as described on that page would have better been called ‘fixed-size’ or ‘constant-size’ arrays (which then matches Julia’s StaticArrays.jl types).

A static array in C++ usually implies either having static linkage (meaning it’s not visible outside the defining translation unit, i.e. C++ file), or having static initialization (meaning it will keep it’s value over multiple calls of the function in which it is defined). But this is getting pretty far off-topic for a Julia forum :slight_smile:

2 Likes

Note that @jisutich resolved this problem in Error when @SMatrix randn(rng,ComplexF64,(2,2)) · Issue #1019 · JuliaArrays/StaticArrays.jl · GitHub — apparently the preferred syntax is now:

M = randn(local_rng, SMatrix{n,n,ComplexF64})

which works with a local_rng argument, unlike @SMatrix.