Very slow loop

Hi all,

I am trying to speed up this loop. I am providing a MWE and I wanted to know if any of you could spot any obvious performance improvementes.

using LinearAlgebra, Statistics, Random

C=rand([0,1],120,13,8000)

D = rand(120)

E=findall(sum(sum(C,dims=1), dims=2)[1,1,:].>0)

P=rand(120,8000)
H=(P.>0.5)
H_D=(D.>0.5)
G_C=(D.<0.05)
A=ones(8000,13)
NF=8000


@views function fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    J=size(C)[1]
    Tmax=size(C)[2]
    T=size(C)[2]-1
    N=size(C)[3]
    num_large0 = zeros(2)
    num_small0 = zeros(2)
    num_large1 = zeros(2)
    num_small1 = zeros(2)
    num2 = zeros(1)
    num_b0=zeros(1)
    num_b1=zeros(1)
    K = size(E)[1]

    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end

    @fastmath @inbounds for n=1:K, t=2:Tmax, j=1:J
            c0 = C[j,t,E[n]]*A[E[n],t]
            c1 = C[j,t-1,E[n]]*(C[j,t,E[n]])*A[E[n],t]

            num_large0[1] = num_large0[1] .+ c0.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small0[1] = num_small0[1] .+ c0.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_large1[1] = num_large1[1] .+ c1.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small1[1] = num_small1[1] .+ c1.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_b0[1] = num_b0[1] .+ c0.*(1-G_C[j])*P[j,E[n]]*D[j]
            num_b1[1] = num_b1[1] .+ c1.*(1-G_C[j])*P[j,E[n]]*D[j]


            if  H_D[j]==1
            num_large0[2] = num_large0[2] .+ c0.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small0[2] = num_small0[2] .+ c0.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_large1[2] = num_large1[2] .+ c1.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small1[2] = num_small1[2] .+ c1.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]
            end
    end
    return vcat(num_large0./(J*N),  num_large1./(J*N), num_small0./(J*N), num_small1./(J*N), num_b0./(J*N), num_b1./(J*N))
end


TEST=@time fun1(C,D,E,P,H,H_D)

Performance result:

91.233291 seconds (1.18 G allocations: 23.470 GiB, 1.82% gc time)

EDIT: Provided MWE. Please disregard how the data is being generated.

Thanks in advance!

In these two lines you seem to be creating new vectors c0 and c1 at every iteration of the loop. If you can predict their size in advance, create them before the loop and mutate their values, that will be better. If those vectors are small, you may want to assign the values to immutable types (tuples, for example) instead of arrays.

2 Likes

Hi @lmiq, thanks for the reply!

Actually, those are scalars. Can you be more precise on how to write this?

That is the problem of not providing a MWE :slight_smile: The effort required to understand what you are doing is much greater.

I don’t understand why you are defining many variables as zeros(1), for example. If those re scalars, why not just simply use scalars?

5 Likes

What is the type of the elements in num_large0? It seems to Ints, but at the same time, in the line you use .+ as you were adding Vectors elementwise.

Also what’s the purpose of expanding A, H and P artificially? It’s just three times the same in a new dimension…

You are totally right, I added a MWE above!

It might be the case that the matrix C is three times larger than of A,H and P. Since these are data, I need to replicate these matrices three times. Nonetheless, as you can see in the MWE I am providing above NF==N.

I just added a MWE above to make these things clearer!

That’s still not a real MWE, because it contains several variables and is not really clear…
However, there are many operations which can be merged:

num_large0[1] = num_large0[1] .+ c0.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
num_small0[1] = num_small0[1] .+ c0.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]
       
num_large1[1] = num_large1[1] .+ c1.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
num_small1[1] = num_small1[1] .+ c1.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

Can be rewritten to:

c0_n = c0 .* P[j,E[n]]*D[j]
c1_n = c1.*P[j,E[n]]*D[j]

H2 = (H[j,E[n]]==2)
H1 = (H[j,E[n]]==1)

num_large0[1] += c0n * H2
num_small0[1] += c0n * H1
       
num_large1[1] += c1n * H2
num_small1[1] += c1n * H1

Such transformations make your code easier to read, modify and is also slightly faster because you’re reusing computations. Also by using .+= we are operating in place instead of creating a new array.

Without knowing the problem we don’t know whether these 3 nested loops are the best approach or not. You are calculating the inner part already often:

julia> size(C)[1] * size(C)[2] * size(E)[1]
12480000
1 Like

Removing the unnecessary broadcast operations and adding tests of the “H”, as above, reduced the time here from 69 to 14 seconds:

Code
function fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    J=size(C)[1]
    Tmax=size(C)[2]
    T=size(C)[2]-1
    N=size(C)[3]
    num_large0 = zeros(2)
    num_small0 = zeros(2)
    num_large1 = zeros(2)
    num_small1 = zeros(2)
    num_b0=0.
    num_b1=0.
    K = size(E)[1]

    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end

    for n=1:K, t=2:Tmax, j=1:J
            c0 = C[j,t,E[n]]*A[E[n],t]
            c1 = C[j,t-1,E[n]]*(C[j,t,E[n]])*A[E[n],t]

            Htest1 = (H[j,E[n]]==1)
            Htest2 = (H[j,E[n]]==2)

            num_large0[1] = num_large0[1] + c0*Htest2*P[j,E[n]]*D[j]
            num_small0[1] = num_small0[1] + c0*Htest1*P[j,E[n]]*D[j]

            num_large1[1] = num_large1[1] + c1*Htest2*P[j,E[n]]*D[j]
            num_small1[1] = num_small1[1] + c1*Htest1*P[j,E[n]]*D[j]

            num_b0 = num_b0 + c0*(1-G_C[j])*P[j,E[n]]*D[j]
            num_b1 = num_b1 + c1*(1-G_C[j])*P[j,E[n]]*D[j]


            if  H_D[j]==1
            num_large0[2] = num_large0[2] + c0*Htest2*P[j,E[n]]*D[j]
            num_small0[2] = num_small0[2] + c0*Htest1*P[j,E[n]]*D[j]

            num_large1[2] = num_large1[2] + c1*Htest2*P[j,E[n]]*D[j]
            num_small1[2] = num_small1[2] + c1*Htest1*P[j,E[n]]*D[j]
            end
    end
    return vcat(num_large0./(J*N),  num_large1./(J*N), num_small0./(J*N), num_small1./(J*N), num_b0/(J*N), num_b1/(J*N))
end

The other important thing is that you are running the function with:

@time fun1(C,D,E,P,H,H_D)

But the function expects a series of optional parameters, which are not being passed:

function fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)

In this case G_C=G_C, for example, does not mean anything concrete. But when you run it it is taking G_C from the global scope, making it type-unstable. Either you want to initialize the optional G_C with a concrete value (GC=(D.<0.05) or pass it as a parameter when calling the function. If you do that, with:

julia> @time fun1(C,D,E,P,H,H_D,G_C=G_C,A=A,NF=NF)
 12.283477 seconds (682.22 M allocations: 10.170 GiB, 7.61% gc time)

time is reduced further. (the same arguments are valid for the other optional parameters you are passing to the function). You can find out that your function is type-unstable with the previous function call with:
(edit: I ran a wrong version before and reported a much smaller time).

julia> @code_warntype fun1(C,D,E,P,H,H_D)
Variables
  #self#::Core.Compiler.Const(fun1, false)
  C::Array{Int64,3}
  D::Array{Float64,1}
  E::Array{Int64,1}
  P::Array{Float64,2}
  H::BitArray{2}
  H_D::BitArray{1}
  G_C::Any
  A::Any
  NF::Any

....

I have a code that run on 7s in my machine. Not sure I did not mess something up, good to check if it is the same answer.

using LinearAlgebra, Statistics, Random

C=rand([0,1],120,13,8000)

D = rand(120)

E=findall(sum(sum(C,dims=1), dims=2)[1,1,:].>0)

P=rand(120,8000)
H=convert.(Float64, (P.>0.5))
H_D=convert.(Float64, (D.>0.5))
G_C=convert.(Float64, (D.<0.05))
A=ones(8000,13)
NF=8000

@views function fun1(C,D,E,P,H,H_D,G_C,A,NF)
    J=size(C)[1]
    Tmax=size(C)[2]
    T=size(C)[2]-1
    N=size(C)[3]
    num_large0 = zeros(2)
		num_large1 = zeros(2)
    num_small0 = zeros(2)
    num_small1 = zeros(2)
    # num2 = zeros(1) # not used?
    num_b0 = num_b1 = 0 # changed from 1-element vectors to scalars
    K = size(E)[1]

    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end

    @fastmath @inbounds for n=1:K, t=2:Tmax, j=1:J
            c0 = C[j,t,E[n]]*A[E[n],t]
            c1 = C[j,t-1,E[n]]*C[j,t,E[n]]*A[E[n],t] # removed unnecessary ()

            HjEn2 = convert(Float64, H[j,E[n]]==2)
            HjEn1 = convert(Float64, H[j,E[n]]==1)
						PjEnDj = P[j,E[n]] * D[j]
            term0 = c0 * PjEnDj
            term1 = c1 * PjEnDj
            c0H1 = term0 * HjEn1
            c0H2 = term0 * HjEn2
            c1H1 = term1 * HjEn1
            c1H2 = term1 * HjEn2

            num_large0[1] += c0H2
            num_small0[1] += c0H1

            num_large1[1] += c1H2
            num_small1[1] += c1H1

            num_b0 += c0 * (1.0 - G_C[j]) * PjEnDj
            num_b1 += c1 * (1.0 - G_C[j]) * PjEnDj

            if  H_D[j]==1
                num_large0[2] += c0H2
                num_small0[2] += c0H1

                num_large1[2] += c0H2
                num_small1[2] += c0H1
            end
    end
    return vcat(num_large0./(J*N),  num_large1./(J*N), num_small0./(J*N), num_small1./(J*N), num_b0./(J*N), num_b1./(J*N))
end

@time fun1(C,D,E,P,H,H_D, G_C, A, NF)
@time fun1(C,D,E,P,H,H_D, G_C, A, NF)
1 Like

This part of the function is hugely detrimental to performance:

    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end

All those matrices become type-unstable if you leave that there. The time goes to 0.2 seconds here if I comment that, and the result is the same.

Thus, my suggestion would be to put that in an outer function, and nest the call to the main function within it:

function outer(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    N=size(C)[3]
    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end
    fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
end


Result:

julia> @time outer(C,D,E,P,H,H_D,G_C=G_C,A=A,NF=NF)
  0.231706 seconds (113 allocations: 3.781 KiB)

Here is the complete code I am running:

Code
using LinearAlgebra, Statistics, Random

C=rand([0,1],120,13,8000)

D = rand(120)

E=findall(sum(sum(C,dims=1), dims=2)[1,1,:].>0)

P=rand(120,8000)
H=(P.>0.5)
H_D=(D.>0.5)
G_C=(D.<0.05)
A=ones(8000,13)
NF=8000

function outer(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    N=size(C)[3]
    if N>NF
        A= vcat(A,A,A)
        H = cat(H ,H ,H ,dims=2)
        P= cat(P,P,P,dims=2)
    end
    fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF) 
end

function fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    J=size(C)[1]
    Tmax=size(C)[2]
    T=size(C)[2]-1
    N=size(C)[3]
    num_large0 = zeros(2)
    num_small0 = zeros(2)
    num_large1 = zeros(2)
    num_small1 = zeros(2)
    num_b0=0.
    num_b1=0.
    K = size(E)[1]

    for n=1:K, t=2:Tmax, j=1:J
            c0 = C[j,t,E[n]]*A[E[n],t]
            c1 = C[j,t-1,E[n]]*(C[j,t,E[n]])*A[E[n],t]

            Htest1 = (H[j,E[n]]==1)
            Htest2 = (H[j,E[n]]==2)

            num_large0[1] = num_large0[1] + c0*Htest2*P[j,E[n]]*D[j]
            num_small0[1] = num_small0[1] + c0*Htest1*P[j,E[n]]*D[j]

            num_large1[1] = num_large1[1] + c1*Htest2*P[j,E[n]]*D[j]
            num_small1[1] = num_small1[1] + c1*Htest1*P[j,E[n]]*D[j]

            num_b0 = num_b0 + c0*(1-G_C[j])*P[j,E[n]]*D[j]
            num_b1 = num_b1 + c1*(1-G_C[j])*P[j,E[n]]*D[j]


            if  H_D[j]==1
            num_large0[2] = num_large0[2] + c0*Htest2*P[j,E[n]]*D[j]
            num_small0[2] = num_small0[2] + c0*Htest1*P[j,E[n]]*D[j]

            num_large1[2] = num_large1[2] + c1*Htest2*P[j,E[n]]*D[j]
            num_small1[2] = num_small1[2] + c1*Htest1*P[j,E[n]]*D[j]
            end
    end
    return vcat(num_large0./(J*N),  num_large1./(J*N), num_small0./(J*N), num_small1./(J*N), num_b0/(J*N), num_b1/(J*N))
end


 

Here is your original code with minimal modifications, just adding this outer function, it runs in 0.5s:

Code
using LinearAlgebra, Statistics, Random

C=rand([0,1],120,13,8000)

D = rand(120)

E=findall(sum(sum(C,dims=1), dims=2)[1,1,:].>0)

P=rand(120,8000)
H=(P.>0.5)
H_D=(D.>0.5)
G_C=(D.<0.05)
A=ones(8000,13)
NF=8000

function outer(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
  N=size(C)[3]
  if N>NF
      A= vcat(A,A,A)
      H = cat(H ,H ,H ,dims=2)
      P= cat(P,P,P,dims=2)
  end
  fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
end

function fun1(C,D,E,P,H,H_D;G_C=G_C,A=A,NF=NF)
    J=size(C)[1]
    Tmax=size(C)[2]
    T=size(C)[2]-1
    N=size(C)[3]
    num_large0 = zeros(2)
    num_small0 = zeros(2)
    num_large1 = zeros(2)
    num_small1 = zeros(2)
    num2 = zeros(1)
    num_b0=zeros(1)
    num_b1=zeros(1)
    K = size(E)[1]

    for n=1:K, t=2:Tmax, j=1:J
            c0 = C[j,t,E[n]]*A[E[n],t]
            c1 = C[j,t-1,E[n]]*(C[j,t,E[n]])*A[E[n],t]

            num_large0[1] = num_large0[1] .+ c0.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small0[1] = num_small0[1] .+ c0.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_large1[1] = num_large1[1] .+ c1.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small1[1] = num_small1[1] .+ c1.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_b0[1] = num_b0[1] .+ c0.*(1-G_C[j])*P[j,E[n]]*D[j]
            num_b1[1] = num_b1[1] .+ c1.*(1-G_C[j])*P[j,E[n]]*D[j]


            if  H_D[j]==1
            num_large0[2] = num_large0[2] .+ c0.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small0[2] = num_small0[2] .+ c0.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]

            num_large1[2] = num_large1[2] .+ c1.*(H[j,E[n]]==2)*P[j,E[n]]*D[j]
            num_small1[2] = num_small1[2] .+ c1.*(H[j,E[n]]==1)*P[j,E[n]]*D[j]
            end
    end
    return vcat(num_large0./(J*N),  num_large1./(J*N), num_small0./(J*N), num_small1./(J*N), num_b0./(J*N), num_b1./(J*N))
end

@time outer(C,D,E,P,H,H_D,G_C=G_C,A=A,NF=NF)
5 Likes

The code contains the equality, H[j,E[n]]==2, however, the array is defined as H=(P.>0.5). In this case, H can only be true or false (0 or 1), so the equality is always false. Therefore, all the lines that depend on that equality are also zero. When I run the code the first four elements of the output are all 0.0, and those are the elements that depend on that equality statement.

If this is a bug, then you could try to fix the equality for what you really need. Or, if it’s not a bug, then you could simply remove all the lines that depend on it, saving a ton of processing.

Edit: Running @lmiq’s latest code with the lines commented out that are always zero, I get an additional 30% speed-up. Also, this was mentioned, but not very explicitly, it doesn’t really make sense to make G_C, A, and NF keywords, which are typically optional. In this function, they are definitely not optional parameters, so it would be simpler to just have them as regular variables.

2 Likes

Dear Leandro,

Thanks for the response. Could you please explain when is it okay to pass optional parameters to a function? and when is it better to pass them directly as arguments of the function?

Is there any case in which I can pass the parameters after the semi-colon without affecting performance?

When I tested it, having as G_C , A , as NF keywords versus standard variables didn’t result in a speed up, but was more about practice. You just don’t want to have non-constant global variables within your function, that is when things slow down.

The problem there was that you were doing something like:

julia> function f(;x=x)
         2*x
       end
f (generic function with 1 method)

julia> x = 1
1

julia> @code_warntype f()
Variables
  #self#::Core.Compiler.Const(f, false)
  x::Any

Body::Any
1 ─ %1 = Main.x::Any
│        (x = %1)
│   %3 = Main.:(var"#f#1")(x, #self#)::Any
└──      return %3


x=x there does not really attribute any default value to x, and thus that value will be determined from the global variable x, which is not constant. Those Anys mean that the types are not stable, and things will be slow.

If you want to pass x as optional but attribute a default value to it, use an actual concrete value,

julia> function g(;x=1)
          2*x
       end
g (generic function with 1 method)

julia> @code_warntype g()
Variables
  #self#::Core.Compiler.Const(g, false)

Body::Int64
1 ─ %1 = Main.:(var"#g#2")(1, #self#)::Core.Compiler.Const(2, false)
└──      return %1

Note that in the first option, your function simply cannot be run without explicitly passing x, if x was not set globally before (new Julia session):

julia> function f(;x=x)
         2*x
       end
f (generic function with 1 method)

julia> f()
ERROR: UndefVarError: x not defined
Stacktrace:
 [1] f() at ./REPL[1]:2
 [2] top-level scope at REPL[2]:1


1 Like

The problem was not the use of keyword variables, it was not initializing them properly with default values neither passing the values.

3 Likes

Thanks so much. I see. If I were to define x as a constant outside of the function, would I have a problem?

If you mean:

julia> f(;x=x) = 2*x
f (generic function with 1 method)

julia> const x = 1
1

julia> @code_warntype f()
Variables
  #self#::Core.Compiler.Const(f, false)
  x::Int64

Body::Int64
1 ─ %1 = Main.x::Core.Compiler.Const(1, false)
│        (x = %1)
│   %3 = Main.:(var"#f#1")(x::Core.Compiler.Const(1, false), #self#)::Core.Compiler.Const(2, false)
└──      return %3


That will perform well. Yet, using x=x does not really make sense. You can initialize the variable with something meaningful, and dependent on the other variables, if you want:

julia> g(y;x=2*y) = x + y
g (generic function with 1 method)

julia> g(1)
3

or, more generally:

julia> initx(y) = 3*y
initx (generic function with 1 method)

julia> h(y;x=initx(y)) = x + y
h (generic function with 1 method)

julia> h(1)
4

julia> h(1,x=5)
6