Hey guys,
I was working on using GPUs for my codes, and came across this unusual anomaly. I am using Flux.jl
along with CUDA.jl
.
Let’s define the variables first:
rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>gpu; # FFT of ricker
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>gpu;
where, nf= 401, nz=nx=61, ny=1, nT= 141
.
The following line
@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
returns
0.000061 seconds (7 allocations: 512 bytes)
In all, a total of approx. 0.000500 seconds.
but the following, with nr= 123
,
for i in 1:10
@time for ir in 1:nr
broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
end
end
returns
1.379863 seconds (985 allocations: 65.375 KiB)
1.378438 seconds (985 allocations: 65.375 KiB)
1.375190 seconds (985 allocations: 65.375 KiB)
1.372815 seconds (985 allocations: 65.375 KiB)
1.374234 seconds (985 allocations: 65.375 KiB)
1.374774 seconds (985 allocations: 65.375 KiB)
1.375210 seconds (985 allocations: 65.375 KiB)
1.407133 seconds (985 allocations: 65.375 KiB)
1.442426 seconds (985 allocations: 65.375 KiB)
1.389992 seconds (985 allocations: 65.375 KiB)
When it should be taking around nr (=123) * 0.000061 = 0.007 seconds approx., it’s taking so much more time. The other thing that confuses me is why is the broadcast!()
function allocating memory. When I use the CPU arrays, instead of CUDA arrays, no memory is allocated, though it’s taking more time because of being run on CPU.
rickfg= fill(0 .+ im*0., nf, 1, 1, 1, 1, 1) |>cpu; # FFT of ricker
G_vec_per_rec= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>cpu;
delay_rf= fill(0. + im* 0., nf, 1, nz, ny, nx, nT) |>cpu;
and
@time broadcast!(*, G_vec_per_rec, rickfg, delay_rf);
returns
0.607953 seconds
, no memory allocations
Please let me know what I might be missing out here, and also if you know any other function that can get the work done. I’ve posted something similar on a previous question, and if you want, you can take a look here:
Thanks in advance for your help!