Optimizing CUDA.jl performance for small array operations

marius311 · February 8, 2021, 5:32am

Quick note but I think those benchmark times are in nanoseconds, so its not 25,000 microseconds, its 25 microseconds, on the order of the kernel launch.

Anyway, thanks for the info, that helps. Are the inputs to the ~1000 equations the same at least? As in, does this reduce to applying an array of functions to some input? If so, search these foums for “array of functions” and you’ll get several links discussing this, some even mentioning GPU (this is probalby a good one to follow links from, also, this may be the identical question). This may be naive since I haven’t really read those threads, but here’s my solution messing around with it briefly:

using CUDA

fs = (sin, x->x^2, x->tan(x^2))

apply_funcs(fs, x) = ntuple(i->fs[i](x), length(fs))

apply_funcs.(Ref(fs), cu(rand(1000)))

I haven’t done a super proper profile, but I’m pretty sure that final call results in a single kernel launch. Certainly benchmarking it on a V100, its faster than 3 consecutive ones:

julia> x = cu(rand(Float32, 1000));

julia> @btime CUDA.@sync apply_funcs.($(Ref(fs)), $x);
  22.183 μs (47 allocations: 1.55 KiB)

julia> @btime CUDA.@sync (fs[1].($x); fs[2].($x); fs[3].($x));
  39.731 μs (78 allocations: 1.84 KiB)

Something else which is cool is that if you look at e.g. @code_llvm apply_funcs(fs, 2.) you’ll see that x^2 is only calculated once, I believe in this case the compiler is smart enough to eliminate the common sub-expression. I’d guess the CUDA compiler would do the same thing, although I’m not familiar enough with how to check exactly.

A limitation of this is that it stops working once your tuple of functions is longer than length 10 because of this. But maybe batching 10 functions together like this is enough to saturate the GPU, and if not, you can always write your own ntuple-like function and grow that if-statement a bit larger.

Topic		Replies	Views
Improving GPU performance for symbolic regression Performance question , gpu , cuda , symbolic-regression	14	1093	September 12, 2024
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1618	November 11, 2018
Presentation on effective use of CUDAnative/CuArrays GPU	7	2397	January 3, 2019
Performance of view with cuArrays GPU	11	2693	November 11, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1517	December 14, 2018

Optimizing CUDA.jl performance for small array operations

Related topics