Optimizing CUDA.jl performance for small array operations

Hi,

I’m trying to get better GPU performance with CUDA.jl for small array operations.

So, I’ve started to port SymbolicRegression.jl to the GPU using CUDA.jl. It seems I’ve gotten the main evaluation part of the code to use the corresponding CUDA operations (which was REALLY straightforward by the way, great job!) , but it’s slower than I would like.

Part of the problem is that during symbolic regression, you typically work on small amounts of data; maybe a matrix of size 5x1000. Without some clever fusion of tree evaluations, this means one needs to worry about the time it takes to launch kernels which makes things tricky.

As a MWE, consider the following code:

using CUDA, BenchmarkTools, Statistics
for N in [1000, 10000]
    c1 = CUDA.ones(Float32, N)
    c2 = ones(Float32, N)
    res1 = @benchmark CUDA.@sync cos.($c1);
    res2 = @benchmark cos.($c2);
    println("Size $N: CUDA=$(median(res1.times)); CPU=$(median(res2.times))")
end

On my v100, this gives me (in microseconds):

Size 1000: CUDA=26021.0; CPU=9086.0
Size 10000: CUDA=24419.5; CPU=87287.0

The GPU scales so well the array size is negligible. But the baseline time it takes to launch a kernel means I can’t exploit the full power of the GPU for evaluating trees on these small arrays.

Is there something I can do to improve the kernel launch speed here?

I also tried:

res1 = @benchmark CUDA.@sync blocking=false cos.($c1);

which was mentioned in the CUDA.jl docs to be better for profiling short executions. This lowers the the evaluation time to ~11000, but unfortunately this is still not enough.

Thanks for any advice!
Cheers,
Miles

2 Likes

This is the order I’d probably try things in:

  • I’m not at all familiar with the algorithms, but could the fusion be easier than it appears? For one, you can always add “batch” dimensions easily like:

    c1a = CUDA.ones(Float32, N)
    c1b = CUDA.ones(Float32, N)
    c1 = cat(c1a, c1b, dims=2) 
    res1a, res2a = eachslice(cos.(c1), dims=2) 
    

    Also, you can “lazily” create broadcasted objects then evaluate them with a single “fused” kernel call like:

    using Base: broadcasted, materialize
    bc1 = broadcasted(cos, c1)
    bc2 = broadcasted(sin, c2)
    bc = broadcasted(+, bc1, bc2)
    res = materialize(bc) # one kernel call, w/o temp arrays
    
  • You could use multiple threads. I think you may need to be on CUDA#master (might need to do some reading in https://github.com/JuliaGPU/CUDA.jl/pull/662 and links therein too) but the latest versions each thread uses its own stream, which can then execute concurrently. In general it’d be good to do some profiling to make sure the GPU is doing what you think it is.

  • You could combine more of your code into single kernels by writing custom kernels (still pure Julia). KernelAbstractions.jl offers a pretty nice interface to it, although even CUDA.@cuda isn’t too bad.

2 Likes

Thanks @marius311. Replies to each point numbered below.

I should mention first that the C++ CUDA kernel launch cost is only ~10 microseconds, so I definitely think the seeming ~25,000 microseconds launch cost here can be cut down. i.e., this seem to be a software problem rather than hardware or algorithm. But these other tricks to combine kernels might be enough to account for it.

  1. (Re: batching) Since you reference the algorithm; this kind of fusion would be really really hard. Basically: Imagine you generate 1000 random equations of different sizes, different operators, different topology (i.e., some nodes degree=2, others degree=1), different features, different constants. How should one decide which nodes in different equations to fuse and evaluate together? I guess you could start by grouping the binary root nodes containing the same operator into one, and so on… but just seems like this would add a lot of complexity without much improvement in performance. e.g., say there are 6 operators and 100 equations evaluated at once (the maximum for typical hyperparameters). This means there’s only (100/6) root nodes you could fuse on average, so instead of 1000 points, you get 17,000 points… So not sure this would solve the issue.
  2. (Re: lazy kernels) This is a really nice suggestion that I haven’t seen this before, thanks! I will try it out.
  3. (Re: CUDA streams). Also a good suggestion, thanks. Equations are typical depth of 5-10 nodes though, so I still feel like the kernel launch cost is going to eat up most of the potential performance gain. But might actually beat CPU with this.
  4. (Re: custom kernel) Thanks, I will look more into this too. Maybe writing a custom kernel will help eliminate the launch cost over the automatic . vectorization kernels.

Best,
Miles

Quick note but I think those benchmark times are in nanoseconds, so its not 25,000 microseconds, its 25 microseconds, on the order of the kernel launch.

Anyway, thanks for the info, that helps. Are the inputs to the ~1000 equations the same at least? As in, does this reduce to applying an array of functions to some input? If so, search these foums for “array of functions” and you’ll get several links discussing this, some even mentioning GPU (this is probalby a good one to follow links from, also, this may be the identical question). This may be naive since I haven’t really read those threads, but here’s my solution messing around with it briefly:

using CUDA

fs = (sin, x->x^2, x->tan(x^2))

apply_funcs(fs, x) = ntuple(i->fs[i](x), length(fs))

apply_funcs.(Ref(fs), cu(rand(1000)))

I haven’t done a super proper profile, but I’m pretty sure that final call results in a single kernel launch. Certainly benchmarking it on a V100, its faster than 3 consecutive ones:

julia> x = cu(rand(Float32, 1000));

julia> @btime CUDA.@sync apply_funcs.($(Ref(fs)), $x);
  22.183 μs (47 allocations: 1.55 KiB)

julia> @btime CUDA.@sync (fs[1].($x); fs[2].($x); fs[3].($x));
  39.731 μs (78 allocations: 1.84 KiB)

Something else which is cool is that if you look at e.g. @code_llvm apply_funcs(fs, 2.) you’ll see that x^2 is only calculated once, I believe in this case the compiler is smart enough to eliminate the common sub-expression. I’d guess the CUDA compiler would do the same thing, although I’m not familiar enough with how to check exactly.

A limitation of this is that it stops working once your tuple of functions is longer than length 10 because of this. But maybe batching 10 functions together like this is enough to saturate the GPU, and if not, you can always write your own ntuple-like function and grow that if-statement a bit larger.

1 Like

Thanks! This is very helpful.

Okay, good to know re: units. I’m not sure why I thought it was in microseconds… but yeah 25 us seems much more reasonable.

Are the inputs to the ~1000 equations the same at least?

Only at the leafs of each equation. Two equations might both use the variable x1, for instance. But equations also have constants, and these will always be slightly different due to mutations. In the branch nodes, the inputs end up being completely different unless that particular subtree is identical, which is rare. I tried using memoization for small subtrees at one point, but this didn’t help… there’s just so many different subtrees possible in a typical search.

Re: apply_funcs, good idea. Will try this. To be honest, thinking more about this, I wonder how doable it would be to completely batch evaluation of many equations…

Actually, maybe this make things easier for your point 1!

Right now my recursive evaluation essentially looks like this:

function evalTree(X, tree)
    if tree.degree == 0
        if tree.constant
            return fill(tree.value, size(X, 2))
        end
        return X[tree.feature, :]
    elseif tree.degree == 1
        x = evalTree(X, tree.left)
        op = unary_operators[tree.op]
        return op.(x)
    else
        x = evalTree(X, tree.left)
        y = evalTree(X, tree.right)
        op = binary_operators[tree.op]
        return op.(x, y)
    end
end

(though I do some operator fusing for small subtrees, and also run things behind a function barrier, per your helpful advice in the other thread :slight_smile: ).

I guess I am wondering if it would be efficient to: (1) pass a list of trees to evaluate, and (2) walk the binary tree for ALL trees up to the maximum depth, using an identity operator for i when tree[i]==nothing at the current depth, and converting all unary operators to binary via (x, y) -> op(x)

What do you think?

Although, maybe the fact that the tuple of operators changes every single time means that this would be re-compiling the kernel each launch… I guess one could instead pass an array of indices for the functions to call, and manage all of that inside the kernel assuming fixed operators?