Hi,

I’m trying to get better GPU performance with CUDA.jl for small array operations.

So, I’ve started to port SymbolicRegression.jl to the GPU using CUDA.jl. It seems I’ve gotten the main evaluation part of the code to use the corresponding CUDA operations (which was REALLY straightforward by the way, great job!) , but it’s slower than I would like.

Part of the problem is that during symbolic regression, you typically work on small amounts of data; maybe a matrix of size 5x1000. Without some clever fusion of tree evaluations, this means one needs to worry about the time it takes to launch kernels which makes things tricky.

As a MWE, consider the following code:

```
using CUDA, BenchmarkTools, Statistics
for N in [1000, 10000]
c1 = CUDA.ones(Float32, N)
c2 = ones(Float32, N)
res1 = @benchmark CUDA.@sync cos.($c1);
res2 = @benchmark cos.($c2);
println("Size $N: CUDA=$(median(res1.times)); CPU=$(median(res2.times))")
end
```

On my v100, this gives me (in microseconds):

```
Size 1000: CUDA=26021.0; CPU=9086.0
Size 10000: CUDA=24419.5; CPU=87287.0
```

The GPU scales so well the array size is negligible. But the baseline time it takes to launch a kernel means I can’t exploit the full power of the GPU for evaluating trees on these small arrays.

Is there something I can do to improve the kernel launch speed here?

I also tried:

```
res1 = @benchmark CUDA.@sync blocking=false cos.($c1);
```

which was mentioned in the CUDA.jl docs to be better for profiling short executions. This lowers the the evaluation time to `~11000`

, but unfortunately this is still not enough.

Thanks for any advice!

Cheers,

Miles