Hi everyone, I am developing a 3D finite volume ocean model in Julia (Oceananigans.jl).
For now I have the model running reasonably fast on a CPU and I’m trying to get it to run efficiently on a GPU using CUDAnative and CuArrays. I’d really like to have the same code/kernel for both the CPU and GPU (with zero memory allocations) but I’m not sure how to do this as most operators involve stencil-like computations.
Almost minimal working example: a global operator that calculates an x-difference (with periodic boundary conditions) of a 3D array f
of size (Nx, Ny, Nz)
and storing it in δxf
might look like
function δx!(f, δxf, Nx, Ny, Nz)
for k in 1:Nz, j in 1:Ny
for i in 1:(Nx-1)
@inbounds δxf[i, j, k] = f[i+1, j, k] - f[i, j, k]
end
@inbounds δxf[end, j, k] = f[1, j, k] - f[end, j, k]
end
nothing
end
Benchmarking it shows it running decently fast on a CPU but it’s terrible on a GPU
~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!(xc, yc, Nx, Ny, Nz)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.381 ms (0.00% GC)
median time: 1.391 ms (0.00% GC)
mean time: 1.393 ms (0.00% GC)
maximum time: 3.173 ms (0.00% GC)
--------------
samples: 3580
evals/sample: 1
~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
==54821== Profiling application: /home/alir/julia/bin/julia
==54821== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 590.15ms 2 295.08ms 294.44ms 295.72ms ptxcall__x__1
==54821== API calls:
Time(%) Time Calls Avg Min Max Name
98.85% 9.3003ms 1 9.3003ms 9.3003ms 9.3003ms cuModuleLoadDataEx
0.86% 81.203us 2 40.601us 40.423us 40.780us cuLaunchKernel
0.08% 7.6330us 1 7.6330us 7.6330us 7.6330us cuProfilerStart
0.08% 7.6270us 4 1.9060us 303ns 4.2190us cuCtxGetCurrent
0.06% 5.6430us 1 5.6430us 5.6430us 5.6430us cuDeviceGetCount
0.04% 3.7120us 4 928ns 196ns 2.3920us cuDeviceGetAttribute
0.01% 863ns 1 863ns 863ns 863ns cuModuleGetFunction
0.01% 758ns 2 379ns 314ns 444ns cuDeviceGet
0.01% 625ns 1 625ns 625ns 625ns cuCtxGetDevice
I expected it to be slow because one thread is doing all the work (?) and there’s no room for parallelization.
Then I tried using an element-wise or local operator that calculates the x-difference for a single element of f
at [i, j, k]
and returns it.
# Not sure how to code an element-wise δx without branching.
@inline incmod1(a, n) = a == n ? 1 : a+1
δx(f, Nx, i, j, k) = f[incmod1(i, Nx), j, k] - f[i, j, k]
# Basically just applies δx to all elements.
function time_stepping(f, δxf)
Nx, Ny, Nz = size(f)
for k in 1:Nz, j in 1:Ny, i in 1:Nx
δxf[i, j, k] = δx(f, Nx, i, j, k)
end
end
but this is even slower
~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark time_stepping(xc, yc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.377 ms (0.00% GC)
median time: 3.387 ms (0.00% GC)
mean time: 3.449 ms (0.00% GC)
maximum time: 3.865 ms (0.00% GC)
--------------
samples: 1449
evals/sample: 1
~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
==55354== Profiling application: /home/alir/julia/bin/julia
==55354== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 2.86728s 2 1.43364s 1.42919s 1.43808s ptxcall_time_stepping_2
==55354== API calls:
Time(%) Time Calls Avg Min Max Name
99.82% 98.188ms 1 98.188ms 98.188ms 98.188ms cuModuleLoadDataEx
0.14% 138.82us 2 69.411us 51.497us 87.326us cuLaunchKernel
0.02% 14.819us 1 14.819us 14.819us 14.819us cuProfilerStart
0.01% 11.707us 6 1.9510us 173ns 9.1930us cuDeviceGetAttribute
0.01% 7.3640us 6 1.2270us 200ns 2.9550us cuCtxGetCurrent
0.00% 4.5960us 1 4.5960us 4.5960us 4.5960us cuDeviceGetCount
0.00% 977ns 2 488ns 486ns 491ns cuCtxGetDevice
0.00% 930ns 1 930ns 930ns 930ns cuModuleGetFunction
0.00% 807ns 2 403ns 288ns 519ns cuDeviceGet
I know I can get the GPU kernel running much faster if I explicitly use block and threads but then I don’t think it would run on a CPU.
Other approaches I considered:
- I’ve seen other threads around here that suggest using
CuArrays.@sync map!(...)
but I don’t think stencil operations can be mapped. - From reading CUDA tutorials on stencil computations, I think the usual CUDA way of doing this is by using 16*16 thread blocks and careful shared memory management but I don’t think a kernel written this way would work on a CPU.
- I’ve also been looking at Casette.jl, maybe it’s useful here in case different kernels must be written for the CPU and GPU? Maybe the “API” doesn’t need to know?
If it matters, I am using an Nvidia Tesla P100 and Julia 1.0.1 (CUDAapi v0.5.3, CUDAdrv v0.8.6, CUDAnative v0.9.1, CuArrays v0.8.1).
Apologies for the wall of text! All this GPU stuff is new to me so please excuse me if I’ve made some rookie mistakes.