GPU kernel?

It should probably look something like @btime CUDA.@sync f1($(cu(x)), $(cu(A)));. This appears to be slower than CPU at n=1000 (and much slower if you include the time to transfer).

Tullio is not always fast at GPU stuff. You can also make use of the spartisity without it, e.g. with f4(x, js) = sum(sin.(x .- @view x[js]); dims=2) |> vec

1 Like