CUDA arrays not working well with broadcast!(), and other in-place operations inside a loop

Hey guys,
I was working on using GPUs for my codes, and came across this unusual anomaly. I am using Flux.jl along with CUDA.jl.

Let’s define the variables first:

rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>gpu; # FFT of ricker
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;

where, nf= 401, nz=nx=61, ny=1, nT= 141.
The following line

@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf);

returns

  0.000061 seconds (7 allocations: 512 bytes)

In all, a total of approx. 0.000500 seconds.
but the following, with nr= 123,

for i in 1:10
    @time for ir in 1:nr
        broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
    end
end

returns

  1.379863 seconds (985 allocations: 65.375 KiB)
  1.378438 seconds (985 allocations: 65.375 KiB)
  1.375190 seconds (985 allocations: 65.375 KiB)
  1.372815 seconds (985 allocations: 65.375 KiB)
  1.374234 seconds (985 allocations: 65.375 KiB)
  1.374774 seconds (985 allocations: 65.375 KiB)
  1.375210 seconds (985 allocations: 65.375 KiB)
  1.407133 seconds (985 allocations: 65.375 KiB)
  1.442426 seconds (985 allocations: 65.375 KiB)
  1.389992 seconds (985 allocations: 65.375 KiB)


When it should be taking around nr (=123) * 0.000061 = 0.007 seconds approx., it’s taking so much more time. The other thing that confuses me is why is the broadcast!() function allocating memory. When I use the CPU arrays, instead of CUDA arrays, no memory is allocated, though it’s taking more time because of being run on CPU.

rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>cpu; # FFT of ricker
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>cpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>cpu;

and
@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf); returns
0.607953 seconds, no memory allocations

Please let me know what I might be missing out here, and also if you know any other function that can get the work done. I’ve posted something similar on a previous question, and if you want, you can take a look here:

Thanks in advance for your help!

GPU operations are asynchronous. If you don’t call synchronize() or call CUDA.@sync, your measurement is wrong. See Benchmarking & profiling · CUDA.jl

Right, so I realised that all the allocations are on CPU, and I guess the transition between CPU and GPU is taking time.
CUDA.@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf) returns
0.020132 seconds (54 CPU allocations: 3.422 KiB)

As an end user, I am more concerned about using GPUs to reduce the computation time without going into the depths of it. Flux.jl had made my life easy because not much was required to do to shift from CPUs to GPUs, but it has come to haunt me now.

While we are at it, could you please also explain why the first iteration runs fast (and as expected) while the later ones don’t.

for i in 1:10
    @time for ir in 1:nr
        broadcast!(+, delay, view(shiftg, :, ir:ir, :, :, :, :), Tshiftg);
        broadcast!(*, delay_rf, -im, 2π, fg, delay);
        broadcast!(exp, delay_rf, delay_rf);
        broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
        copyto!(G_unvec_per_rec, G_vec_per_rec);
        dd= view(dunvec, :, ir:ir)
        # mm= view(munvec, :,ir:ir)
        mul!(dd, G_unvec_per_rec, munvec);
    end
end

  0.644182 seconds (14.27 k allocations: 678.453 KiB)
  7.918085 seconds (14.27 k allocations: 678.453 KiB)
  7.953713 seconds (14.27 k allocations: 678.453 KiB)
  7.958340 seconds (14.27 k allocations: 678.453 KiB)
  7.879235 seconds (14.27 k allocations: 678.453 KiB)
  7.940350 seconds (14.27 k allocations: 678.453 KiB)
  8.226722 seconds (14.27 k allocations: 678.453 KiB)
  8.210767 seconds (14.27 k allocations: 678.453 KiB)
  8.279914 seconds (14.27 k allocations: 678.453 KiB)
  8.243146 seconds (14.27 k allocations: 678.453 KiB)

If you want to know the variables, they are as:

rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>gpu; # FFT of ricker
fg= fill(0 .+ im*0., nf, 1, 1,1,1,1)|>gpu; # fgrid
shiftg= fill(0 .+ im*0., 1, nr, nz, ny, nx, 1)|>gpu; # time shifts
Tshiftg= fill(0 .+ im*0., 1,1,1,1,1,nT)|>gpu; #time shits for temporal domain
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
G_unvec_per_rec= fill(0. + im* 0., nf, nz*ny*nx*nT) |>gpu;
delay= fill(0. + im.*0, 1, 1, nz, ny, nx, nT) |>gpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
dunvec= fill(0. + im.*0, nf,nr) |>gpu;
munvec= fill(0. + im.*0, nz*ny*nx*nT, 1) |>gpu;

with nr= 125, nz= 61, ny= 1, nx= 61, nf= 401, nT= 141.
Again, if instead of using @time, I use

@time for i in 1:nr
        blah
end

I get similar time consumption. If there was some issue with measuring time, I guess the first iteration would also have given approx. 7 seconds. There is something happening after the first iteration which is slowing down further operations. People are running GPUs in loops so I guess I am doing something wrong.

The weird thing is when I compile this lot into a function, the first time, after compilation, it gives the expected low computation time, but after that something changes and I get 10 times high computation times.

I’m not sure what’s the problem that’s haunting you here?

Same explanation, you’re measuring time wrong. You can only use @time when you synchronize the GPU, which you aren’t.

Why to you assume that? The GPU can only queue a number of asynchronous operations, so it’s likely that the first sequence of operations without a synchronize() gets queued up without anything more, while subsequent iterations will need to wait until there’s space in the queue.

3 Likes

Yeah, I got it now. My operation takes around 0.61 seconds and, as you rightly pointed out, I was measuring time without synchronise(), thus showing me the time for my code plus the synchronisation time. Since the synchronise() is taking some 6 seconds, my code gets slowed down.

I had started with the GPU computation without understanding much about the working of the GPUs because my work lied more in the implementation. Now I guess, I should learn and develop a good understanding of the environments before using them.
Thanks a lot for your time!