Optimizing CUDA.jl performance for small array operations

Quick note but I think those benchmark times are in nanoseconds, so its not 25,000 microseconds, its 25 microseconds, on the order of the kernel launch.

Anyway, thanks for the info, that helps. Are the inputs to the ~1000 equations the same at least? As in, does this reduce to applying an array of functions to some input? If so, search these foums for “array of functions” and you’ll get several links discussing this, some even mentioning GPU (this is probalby a good one to follow links from, also, this may be the identical question). This may be naive since I haven’t really read those threads, but here’s my solution messing around with it briefly:

using CUDA

fs = (sin, x->x^2, x->tan(x^2))

apply_funcs(fs, x) = ntuple(i->fs[i](x), length(fs))

apply_funcs.(Ref(fs), cu(rand(1000)))

I haven’t done a super proper profile, but I’m pretty sure that final call results in a single kernel launch. Certainly benchmarking it on a V100, its faster than 3 consecutive ones:

julia> x = cu(rand(Float32, 1000));

julia> @btime CUDA.@sync apply_funcs.($(Ref(fs)), $x);
  22.183 μs (47 allocations: 1.55 KiB)

julia> @btime CUDA.@sync (fs[1].($x); fs[2].($x); fs[3].($x));
  39.731 μs (78 allocations: 1.84 KiB)

Something else which is cool is that if you look at e.g. @code_llvm apply_funcs(fs, 2.) you’ll see that x^2 is only calculated once, I believe in this case the compiler is smart enough to eliminate the common sub-expression. I’d guess the CUDA compiler would do the same thing, although I’m not familiar enough with how to check exactly.

A limitation of this is that it stops working once your tuple of functions is longer than length 10 because of this. But maybe batching 10 functions together like this is enough to saturate the GPU, and if not, you can always write your own ntuple-like function and grow that if-statement a bit larger.

2 Likes