Contents of the function run well when timed alone, but not when they are in a loop inside the function

ayushinav · May 28, 2022, 1:49pm

Hello all, sorry for disturbing you again, looks like I have run into a similar problem again. I hope it’s not the same silly doubt as the last time.
Let’s define the variables first:

rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>gpu; # FFT of ricker
fg= fill(0 .+ im*0., nf, 1, 1,1,1,1)|>gpu; # fgrid
shiftg= fill(0 .+ im*0., 1, nr, nz, ny, nx, 1)|>gpu; # time shifts
Tshiftg= fill(0 .+ im*0., 1,1,1,1,1,nT)|>gpu; #time shits for temporal domain
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
G_unvec_per_rec= fill(0. + im* 0., nf, nz*ny*nx*nT) |>gpu;
delay= fill(0. + im.*0, 1, 1, nz, ny, nx, nT) |>gpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
dunvec= fill(0. + im.*0, nf,nr) |>gpu;
munvec= fill(0. + im.*0, nz*ny*nx*nT, 1) |>gpu;

Later on, these are updated with values, to reproduce, you can fill them up with random complex numbers, there is no sparsity in any arrays here, and I am using Flux.jl along with CUDA.jl to run them on GPU, and took nr= 125, nz= 61, ny== 1, nx= 61, nf= 401, nT= 141.

The following line of codes

@time broadcast!(+, delay, view(shiftg, :, ir:ir, :, :, :, :), Tshiftg);
@time broadcast!(*, delay_rf, -im, 2π, fg, delay);
@time broadcast!(exp, delay_rf, delay_rf);
@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
@time copyto!(G_unvec_per_rec, G_vec_per_rec);
@time dd= view(dunvec, :, ir:ir)
# @time # mm= view(munvec, :,ir:ir)
@time mul!(dd, G_unvec_per_rec, munvec);

return

  0.000101 seconds (51 allocations: 1.734 KiB)
  0.000052 seconds (7 allocations: 512 bytes)
  0.000072 seconds (26 allocations: 2.250 KiB)
  0.000058 seconds (7 allocations: 512 bytes)
  0.000053 seconds (2 allocations: 32 bytes)
  0.000011 seconds (5 allocations: 176 bytes)
  0.000139 seconds (17 allocations: 304 bytes)

In all, a total of approx. 0.000500 seconds.
but the following

for i in 1:10
    @time for ir in 1:nr
        broadcast!(+, delay, view(shiftg, :, ir:ir, :, :, :, :), Tshiftg);
        broadcast!(*, delay_rf, -im, 2π, fg, delay);
        broadcast!(exp, delay_rf, delay_rf);
        broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
        copyto!(G_unvec_per_rec, G_vec_per_rec);
        dd= view(dunvec, :, ir:ir)
        # mm= view(munvec, :,ir:ir)
        mul!(dd, G_unvec_per_rec, munvec);
    end
end

returns

  0.644182 seconds (14.27 k allocations: 678.453 KiB)
  7.918085 seconds (14.27 k allocations: 678.453 KiB)
  7.953713 seconds (14.27 k allocations: 678.453 KiB)
  7.958340 seconds (14.27 k allocations: 678.453 KiB)
  7.879235 seconds (14.27 k allocations: 678.453 KiB)
  7.940350 seconds (14.27 k allocations: 678.453 KiB)
  8.226722 seconds (14.27 k allocations: 678.453 KiB)
  8.210767 seconds (14.27 k allocations: 678.453 KiB)
  8.279914 seconds (14.27 k allocations: 678.453 KiB)
  8.243146 seconds (14.27 k allocations: 678.453 KiB)

Things add up for the first runtime, 0.000500*nr(=125)*10= 0.625 seconds, but after that it messes up. I don’t think it’s the global variables thing this time, because I’m running them without defining any functions. Am I again missing something basic? Please let me know, and apologies if it’s again a simple thing.
I’m tagging you all so that you receive the notifications @goerch @Benny @maphdze @FrancisKing
Thanks for your time!

Topic		Replies	Views
Achieving C++ Speeds for a CFD Algorithm: Julia spends too much time on pre-allocation? Performance	22	3718	November 28, 2018
Huge performance improvement by separating function? General Usage	15	1208	August 27, 2018
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
Simple For Loops and Speed New to Julia question , performance	8	3465	November 19, 2017
How to improve the performance while do matrix operation Performance	13	853	August 27, 2018

Contents of the function run well when timed alone, but not when they are in a loop inside the function

Related topics