I have written a code to run on GPUs with ParallelStencil but I don’t see a significant speedup by moving to GPUs. I am not sure whether I am using the framework as intended and there exists an easy fix for this. So, any help to get more performance would be greatly appreciated.
To illustrate my problem, I have a minimal example code below. The code launches 2 big kernels (UpdateA ! and UpdateB!) which are supposed to do most of the computation, and a few of small kernels which do only a fraction of actual computation. The code ends up spending lot more time on the smaller kernels.
const USE_GPU = true
# using BenchmarkTools
using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@static if USE_GPU
@init_parallel_stencil(CUDA, Float64, 2);
else
@init_parallel_stencil(Threads, Float64, 2);
end
function main2D()
# Numerics
nx, ny = 1024, 512; # Number of gridpoints in dimensions x and y
nt = 10000; # Number of time steps
c0 = 10.0
# Array initializations
A = @zeros(nx, ny);
B = @zeros(nx, ny);
# A2 = @zeros(nx, ny);
C = @rand(nx, ny);
# Initial conditions
A .= 1.5;
# Time loop
dt = 1/nt;
for it = 1:nt
if (it == 11)
GC.enable(false)
global t_tic=time() # Start measuring time.
end
@parallel UpdateA!(A, B, C)
# the for loops are just to launch many small kernels
# In actual code, these are different small kernels not called in loop
for i in 1:10:100
@parallel (i:i+10, 1:ny) ASubset!(A)
end
@parallel UpdateB!(A, B, C)
# the for loops are just to launch many small kernels
# In actual code, these are different small kernels not called in loop
for i in 1:10:300
@parallel (1:nx, i:i+10) BSubset!(B)
end
end
time_s = time() - t_tic
end
@parallel_indices (ix, iy) function UpdateA!(A, B, C)
A[ix, iy] = A[ix, iy] + C[ix, iy] * B[ix, iy]
return
end
@parallel_indices (ix, iy) function UpdateB!(A, B, C)
B[ix, iy] = B[ix, iy] + C[ix, iy] * A[ix, iy]
return
end
@parallel_indices (ix, iy) function ASubset!(A)
# Reduce the value by 10%
A[ix, iy] = 0.9*A[ix, iy]
return
end
@parallel_indices (ix, iy) function BSubset!(B)
# Increase the value by 10%
B[ix, iy] = 1.1*B[ix, iy]
return
end
The smaller kernels are not run in a for loop in actual code. In the actual code the same kernel is called but with different set of arguments.
My guess is that it takes a lot more time to sync the kernel than running the it. But smaller kernels ( ASubset! and BSubset!) do not need to be synchronized. Is it possible to launch them without syncing?
This is a screenshot from the profiler which shows the problem.
Alternatively, I might be completely wrong about what kills the performance and would be happy to know what is wrong and how to get more performance.