Contents of the function run well when timed alone, but not when they are in a loop inside the function

Hello all, I am new to Julia and have been making progress with implementing its performance utilities. Generally, I am able to get some results or an understanding of how things are going, but not in this case.

The following code:

@btime circshift!(rickshift, rick0, shift[iz,iy,ix,it])
@btime mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
@btime Threads.@threads for i in 1:nT
    Dd= view(dd, :, i)
    Pd= view(pd, :, i)
    circshift!(Dd, Pd, Tgrid[i]) 
end

returns (I am using 4 threads throughout)

 200.706 ns (2 allocations: 48 bytes)
  8.112 μs (2 allocations: 928 bytes)
  29.826 μs (520 allocations: 22.14 KiB)

EDIT: I have defined rick0 as a sparse vector earlier in the code, and rickshift= similar(rick0), shift is a 4-dimensional array of size nz,ny,nx,nT containing Float32 values, dd is defined as dd= view(d, :, ir, :) as is also in the next snippet, pd= zeros(nt, nT), Tshift= -49:50

which implies that the following function

#size(d) = nt, nr, nT
#size(m) = nz,ny,nx,nT

function Gnew!(d, m)
    for ir in 1:Nr
        for ix in 1:nx, iy in 1:ny, iz in 1:nz
            circshift!(rickshift, rick0, shift[iz,iy,ix,ir])
            mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
            Threads.@threads for i in 1:nT
                Dd= view(dd, :, i)
                Pd= view(pd, :, i)
                circshift!(Dd, Pd, Tshift[i]) 
            end
        end
    end
end

would run in some 20 minutes for nx=ny=nz=50, nT= 100, Nr= 250 (39e-6 x50x50x50x250/60= 20.468)

but when I run it using the following:

D= zeros(nt, nr, nT)
m_init= randn(nx,ny,nz,nT)
t1= time()
Gnew!(D, m_init)
time()- t1

the output is 4375 secs

And mul_alt2!(...) is a function, to give matrix output of 2 vectors, one of them being sparse vector, which are not defined as matrices, defined as

function mul_alt2!(C::Matrix, X::Vector, A::SparseVector)
    @inbounds for i in A.nzind
        cc=view(C,i,:)
        BLAS.axpy!(A[i], X, cc)
    end
end

Since the A and X in the above function definition are vectors and not matrices, I did not find a relevant BLAS function or any relevant efficient function that does not require me to reshape the vectors into arrays, since that step again is not efficient.
I hope I have posted enough information. In case I haven’t, please let me know. Also if you have some suggestions, do share them. Thanks for your time!

One suspicious thing is the value of nT. It should be 4, shouldn’t it? However, you have nT=100 in the second case. You have too many threads declared, I think. You really only have 4 real processing things.

1 Like

Given your code I can only speculate that you are referencing (non constant) global variables, which is a well known performance pitfall in Julia. Maybe it would be helpful to compile a complete MWE?

1 Like
  1. There are many global variables in Gnew!. Non-const globals inhibit type inference during compilation and hurt performance at runtime. I can’t say for sure, but it could very well be causing overhead that stretched the expected ~20 minutes to ~73 minutes. Even 1 non-const global can seriously hamper a method. The usual fix is to create method arguments and to pass the values to them in the function call.
  2. You also didn’t post or describe the values for many global variables: rickshift, rick0, shift, pd, dd, Tshift.
  3. You don’t say what the arguments D and m_init in Gnew!(D, m_init) are.
  4. There are things in Gnew! I can’t make sense of.
    • The argument d (which the function call passes D to) isn’t used.
    • m is commented as a Tuple (nz,ny,nx,nT) of integers, which is fine. But m[iz,iy,ix, :] is either multidimensional indexing or making a Vector with type m, and neither of those things is consistent with the comment.
1 Like

Hello @Benny, thanks for the reply, I think the pitfall is due to the non-const globals. I have still edited the post for another person that may come across it.
I will test my code and let you know if the non-const globals were the problem.

My bad, I have edited the snippet.

That was a way to let know that D and m_init which are passed into Gnew!(d, m) as d and m are matrices of the respective shapes. I have edited the comment, and also the snippet, to make it clearer.

It looks like that the variables rickshift,rick0,pd,m,dd,pd of the single functions circshift!,mul_alt2! in the outer function Gnew! are not passed into Gnew!. And nx,ny,nz in the for loop are not passed into the Gnew! too, maybe you can declare them as constant.

Hello all, sorry for disturbing you again, looks like I have run into a similar problem again. I hope it’s not the same silly doubt as the last time.
Let’s define the variables first:

rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>gpu; # FFT of ricker
fg= fill(0 .+ im*0., nf, 1, 1,1,1,1)|>gpu; # fgrid
shiftg= fill(0 .+ im*0., 1, nr, nz, ny, nx, 1)|>gpu; # time shifts
Tshiftg= fill(0 .+ im*0., 1,1,1,1,1,nT)|>gpu; #time shits for temporal domain
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
G_unvec_per_rec= fill(0. + im* 0., nf, nz*ny*nx*nT) |>gpu;
delay= fill(0. + im.*0, 1, 1, nz, ny, nx, nT) |>gpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
dunvec= fill(0. + im.*0, nf,nr) |>gpu;
munvec= fill(0. + im.*0, nz*ny*nx*nT, 1) |>gpu;

Later on, these are updated with values, to reproduce, you can fill them up with random complex numbers, there is no sparsity in any arrays here, and I am using Flux.jl along with CUDA.jl to run them on GPU, and took nr= 125, nz= 61, ny== 1, nx= 61, nf= 401, nT= 141.

The following line of codes

@time broadcast!(+, delay, view(shiftg, :, ir:ir, :, :, :, :), Tshiftg);
@time broadcast!(*, delay_rf, -im, 2π, fg, delay);
@time broadcast!(exp, delay_rf, delay_rf);
@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
@time copyto!(G_unvec_per_rec, G_vec_per_rec);
@time dd= view(dunvec, :, ir:ir)
# @time # mm= view(munvec, :,ir:ir)
@time mul!(dd, G_unvec_per_rec, munvec);

return

  0.000101 seconds (51 allocations: 1.734 KiB)
  0.000052 seconds (7 allocations: 512 bytes)
  0.000072 seconds (26 allocations: 2.250 KiB)
  0.000058 seconds (7 allocations: 512 bytes)
  0.000053 seconds (2 allocations: 32 bytes)
  0.000011 seconds (5 allocations: 176 bytes)
  0.000139 seconds (17 allocations: 304 bytes)

In all, a total of approx. 0.000500 seconds.
but the following

for i in 1:10
    @time for ir in 1:nr
        broadcast!(+, delay, view(shiftg, :, ir:ir, :, :, :, :), Tshiftg);
        broadcast!(*, delay_rf, -im, 2π, fg, delay);
        broadcast!(exp, delay_rf, delay_rf);
        broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
        copyto!(G_unvec_per_rec, G_vec_per_rec);
        dd= view(dunvec, :, ir:ir)
        # mm= view(munvec, :,ir:ir)
        mul!(dd, G_unvec_per_rec, munvec);
    end
end

returns

  0.644182 seconds (14.27 k allocations: 678.453 KiB)
  7.918085 seconds (14.27 k allocations: 678.453 KiB)
  7.953713 seconds (14.27 k allocations: 678.453 KiB)
  7.958340 seconds (14.27 k allocations: 678.453 KiB)
  7.879235 seconds (14.27 k allocations: 678.453 KiB)
  7.940350 seconds (14.27 k allocations: 678.453 KiB)
  8.226722 seconds (14.27 k allocations: 678.453 KiB)
  8.210767 seconds (14.27 k allocations: 678.453 KiB)
  8.279914 seconds (14.27 k allocations: 678.453 KiB)
  8.243146 seconds (14.27 k allocations: 678.453 KiB)

Things add up for the first runtime, 0.000500*nr(=125)*10= 0.625 seconds, but after that it messes up. I don’t think it’s the global variables thing this time, because I’m running them without defining any functions. Am I again missing something basic? Please let me know, and apologies if it’s again a simple thing.
I’m tagging you all so that you receive the notifications @goerch @Benny @maphdze @FrancisKing
Thanks for your time!

Non-const globals can explain these sort of timing discrepancies, but I’m not sure if it’s the right or only reason in this new example because of the different setup (no @threads, no function containing the for-loop, yes Flux, yes CUDA).

Quickest way to check if it’s a globals issue is to run the code in a local scope and see if that resolves the discrepancy. Designing a useful function takes a bit of time and isn’t necessary, just throw a let ... end or a function blah() ... end around all of your code; I prefer the blah because, I can type blah() in the REPL to repeat the run instead of includeing the let ... end script again.

So I figured out that the issue is something to do with the CUDA arrays. broadcast!(...) and copyto!() work weirdly, and somehow do some allocations as well, which should not be the case and which is not the case when using CPU arrays. Not sure what’s going wrong so I guess I’ll go through the documentation again and see what I’m missing, probably not using the right functions? Let’s see…