I think I may have found an answer using @views
after skimming through Tim’s awesome presentation!
function δx!(f, δxf)
@views @. δxf[2:end, :, :] = f[2:end, :, :] - f[1:end-1, :, :]
@views @. δxf[end, :, :] = f[1, :, :] - f[end, :, :]
nothing
end
which is almost as fast on the CPU and gives a decent ~10x speedup on a GPU
~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!($xc, $yc)
BenchmarkTools.Trial:
memory estimate: 384 bytes
allocs estimate: 6
--------------
minimum time: 1.645 ms (0.00% GC)
median time: 1.662 ms (0.00% GC)
mean time: 1.667 ms (0.00% GC)
maximum time: 3.740 ms (0.00% GC)
--------------
samples: 2996
evals/sample: 1
~ julia
julia> using CUDAnative, CUDAdrv, CuArrays, BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(zeros(Nx, Ny, Nz));
julia> @benchmark δx!($xg, $yg)
BenchmarkTools.Trial:
memory estimate: 4.84 KiB
allocs estimate: 70
--------------
minimum time: 9.721 μs (0.00% GC)
median time: 178.504 μs (0.00% GC)
mean time: 169.339 μs (0.27% GC)
maximum time: 2.521 ms (99.15% GC)
--------------
samples: 10000
evals/sample: 1
although I’m interested to look into why there’s such a big spread in run times. And apparently @inbounds
slows things down on the CPU.
I know double posting is somewhat taboo but I figured this is closer to a solution than an edit.