Yup, this looks about as expected.
That was fixed up.
What does the profile say? Share a flame graph. If it’s mostly in the parts that aren’t allocating then the allocations are not the issues. Allocations can even improve performance in some cases.
For the lowest overhead case, try the Vern implementations in SimpleDiffEq.jl. If it’s a dead simple ODE like this, then those should have essentially zero overhead since they are just the loop. GPUVern7 and GPUVern9.
The big thing to ask is whether the Verner methods are the right ones for the job here. At the tolerances you’re choosing, the answer is probably no.