CUDA.jl tutorial code kernel slower than broadcast

The stride is unnecessary, right, since you’re launching as many threads as there’s elements? That wouldn’t explain a 10x though. These are just some quick suggestions, I can have a better look later.

1 Like