Some CUDA functions suddenly become very slow

CUDA APIs are asynchronous, so your initial timing is just the time to launch the kernel. The kernel actually does take 60s to finish, because its implemented badly: You’re iterating elements on a single GPU thread, which is the wrong way to use GPUs. Please read the CUDA.jl introductory tutorial, this exact pitfall is explained there: Introduction · CUDA.jl