Hi everyone, I am developing a 3D finite volume ocean model in Julia (Oceananigans.jl).
For now I have the model running reasonably fast on a CPU and I’m trying to get it to run efficiently on a GPU using CUDAnative and CuArrays. I’d really like to have the same code/kernel for both the CPU and GPU (with zero memory allocations) but I’m not sure how to do this as most operators involve stencil-like computations.
Almost minimal working example: a global operator that calculates an x-difference (with periodic boundary conditions) of a 3D array f of size (Nx, Ny, Nz) and storing it in δxf might look like
function δx!(f, δxf, Nx, Ny, Nz)
    for k in 1:Nz, j in 1:Ny
        for i in 1:(Nx-1)
            @inbounds δxf[i, j, k] =  f[i+1, j, k] - f[i, j, k]
        end
        @inbounds δxf[end, j, k] =  f[1, j, k] - f[end, j, k]
    end
    nothing
end
Benchmarking it shows it running decently fast on a CPU but it’s terrible on a GPU
~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!(xc, yc, Nx, Ny, Nz)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.381 ms (0.00% GC)
  median time:      1.391 ms (0.00% GC)
  mean time:        1.393 ms (0.00% GC)
  maximum time:     3.173 ms (0.00% GC)
  --------------
  samples:          3580
  evals/sample:     1
~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
==54821== Profiling application: /home/alir/julia/bin/julia
==54821== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  590.15ms         2  295.08ms  294.44ms  295.72ms  ptxcall__x__1
==54821== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.85%  9.3003ms         1  9.3003ms  9.3003ms  9.3003ms  cuModuleLoadDataEx
  0.86%  81.203us         2  40.601us  40.423us  40.780us  cuLaunchKernel
  0.08%  7.6330us         1  7.6330us  7.6330us  7.6330us  cuProfilerStart
  0.08%  7.6270us         4  1.9060us     303ns  4.2190us  cuCtxGetCurrent
  0.06%  5.6430us         1  5.6430us  5.6430us  5.6430us  cuDeviceGetCount
  0.04%  3.7120us         4     928ns     196ns  2.3920us  cuDeviceGetAttribute
  0.01%     863ns         1     863ns     863ns     863ns  cuModuleGetFunction
  0.01%     758ns         2     379ns     314ns     444ns  cuDeviceGet
  0.01%     625ns         1     625ns     625ns     625ns  cuCtxGetDevice
I expected it to be slow because one thread is doing all the work (?) and there’s no room for parallelization.
Then I tried using an element-wise or local operator that calculates the x-difference for a single element of f at [i, j, k] and returns it.
# Not sure how to code an element-wise δx without branching.
@inline incmod1(a, n) = a == n ? 1 : a+1
δx(f, Nx, i, j, k) = f[incmod1(i, Nx), j, k] - f[i, j, k]
# Basically just applies δx to all elements.
function time_stepping(f, δxf)
    Nx, Ny, Nz = size(f)
    for k in 1:Nz, j in 1:Ny, i in 1:Nx
        δxf[i, j, k] = δx(f, Nx, i, j, k)
    end
end
but this is even slower
~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark time_stepping(xc, yc)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.377 ms (0.00% GC)
  median time:      3.387 ms (0.00% GC)
  mean time:        3.449 ms (0.00% GC)
  maximum time:     3.865 ms (0.00% GC)
  --------------
  samples:          1449
  evals/sample:     1
~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
==55354== Profiling application: /home/alir/julia/bin/julia
==55354== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.86728s         2  1.43364s  1.42919s  1.43808s  ptxcall_time_stepping_2
==55354== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.82%  98.188ms         1  98.188ms  98.188ms  98.188ms  cuModuleLoadDataEx
  0.14%  138.82us         2  69.411us  51.497us  87.326us  cuLaunchKernel
  0.02%  14.819us         1  14.819us  14.819us  14.819us  cuProfilerStart
  0.01%  11.707us         6  1.9510us     173ns  9.1930us  cuDeviceGetAttribute
  0.01%  7.3640us         6  1.2270us     200ns  2.9550us  cuCtxGetCurrent
  0.00%  4.5960us         1  4.5960us  4.5960us  4.5960us  cuDeviceGetCount
  0.00%     977ns         2     488ns     486ns     491ns  cuCtxGetDevice
  0.00%     930ns         1     930ns     930ns     930ns  cuModuleGetFunction
  0.00%     807ns         2     403ns     288ns     519ns  cuDeviceGet
I know I can get the GPU kernel running much faster if I explicitly use block and threads but then I don’t think it would run on a CPU.
Other approaches I considered:
- I’ve seen other threads around here that suggest using CuArrays.@sync map!(...)but I don’t think stencil operations can be mapped.
- From reading CUDA tutorials on stencil computations, I think the usual CUDA way of doing this is by using 16*16 thread blocks and careful shared memory management but I don’t think a kernel written this way would work on a CPU.
- I’ve also been looking at Casette.jl, maybe it’s useful here in case different kernels must be written for the CPU and GPU? Maybe the “API” doesn’t need to know?
If it matters, I am using an Nvidia Tesla P100 and Julia 1.0.1 (CUDAapi v0.5.3, CUDAdrv v0.8.6, CUDAnative v0.9.1, CuArrays v0.8.1).
Apologies for the wall of text! All this GPU stuff is new to me so please excuse me if I’ve made some rookie mistakes.