Contents of the function run well when timed alone, but not when they are in a loop inside the function

Hello all, I am new to Julia and have been making progress with implementing its performance utilities. Generally, I am able to get some results or an understanding of how things are going, but not in this case.

The following code:

@btime circshift!(rickshift, rick0, shift[iz,iy,ix,it])
@btime mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
@btime Threads.@threads for i in 1:nT
    Dd= view(dd, :, i)
    Pd= view(pd, :, i)
    circshift!(Dd, Pd, Tgrid[i]) 

returns (I am using 4 threads throughout)

 200.706 ns (2 allocations: 48 bytes)
  8.112 ÎĽs (2 allocations: 928 bytes)
  29.826 ÎĽs (520 allocations: 22.14 KiB)

EDIT: I have defined rick0 as a sparse vector earlier in the code, and rickshift= similar(rick0), shift is a 4-dimensional array of size nz,ny,nx,nT containing Float32 values, dd is defined as dd= view(d, :, ir, :) as is also in the next snippet, pd= zeros(nt, nT), Tshift= -49:50

which implies that the following function

#size(d) = nt, nr, nT
#size(m) = nz,ny,nx,nT

function Gnew!(d, m)
    for ir in 1:Nr
        for ix in 1:nx, iy in 1:ny, iz in 1:nz
            circshift!(rickshift, rick0, shift[iz,iy,ix,ir])
            mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
            Threads.@threads for i in 1:nT
                Dd= view(dd, :, i)
                Pd= view(pd, :, i)
                circshift!(Dd, Pd, Tshift[i]) 

would run in some 20 minutes for nx=ny=nz=50, nT= 100, Nr= 250 (39e-6 x50x50x50x250/60= 20.468)

but when I run it using the following:

D= zeros(nt, nr, nT)
m_init= randn(nx,ny,nz,nT)
t1= time()
Gnew!(D, m_init)
time()- t1

the output is 4375 secs

And mul_alt2!(...) is a function, to give matrix output of 2 vectors, one of them being sparse vector, which are not defined as matrices, defined as

function mul_alt2!(C::Matrix, X::Vector, A::SparseVector)
    @inbounds for i in A.nzind
        BLAS.axpy!(A[i], X, cc)

Since the A and X in the above function definition are vectors and not matrices, I did not find a relevant BLAS function or any relevant efficient function that does not require me to reshape the vectors into arrays, since that step again is not efficient.
I hope I have posted enough information. In case I haven’t, please let me know. Also if you have some suggestions, do share them. Thanks for your time!

One suspicious thing is the value of nT. It should be 4, shouldn’t it? However, you have nT=100 in the second case. You have too many threads declared, I think. You really only have 4 real processing things.

1 Like

Given your code I can only speculate that you are referencing (non constant) global variables, which is a well known performance pitfall in Julia. Maybe it would be helpful to compile a complete MWE?

1 Like
  1. There are many global variables in Gnew!. Non-const globals inhibit type inference during compilation and hurt performance at runtime. I can’t say for sure, but it could very well be causing overhead that stretched the expected ~20 minutes to ~73 minutes. Even 1 non-const global can seriously hamper a method. The usual fix is to create method arguments and to pass the values to them in the function call.
  2. You also didn’t post or describe the values for many global variables: rickshift, rick0, shift, pd, dd, Tshift.
  3. You don’t say what the arguments D and m_init in Gnew!(D, m_init) are.
  4. There are things in Gnew! I can’t make sense of.
    • The argument d (which the function call passes D to) isn’t used.
    • m is commented as a Tuple (nz,ny,nx,nT) of integers, which is fine. But m[iz,iy,ix, :] is either multidimensional indexing or making a Vector with type m, and neither of those things is consistent with the comment.
1 Like

Hello @Benny, thanks for the reply, I think the pitfall is due to the non-const globals. I have still edited the post for another person that may come across it.
I will test my code and let you know if the non-const globals were the problem.

My bad, I have edited the snippet.

That was a way to let know that D and m_init which are passed into Gnew!(d, m) as d and m are matrices of the respective shapes. I have edited the comment, and also the snippet, to make it clearer.

It looks like that the variables rickshift,rick0,pd,m,dd,pd of the single functions circshift!,mul_alt2! in the outer function Gnew! are not passed into Gnew!. And nx,ny,nz in the for loop are not passed into the Gnew! too, maybe you can declare them as constant.