CUDA APIs are asynchronous, so your initial timing is just the time to launch the kernel. The kernel actually does take 60s to finish, because its implemented badly: You’re iterating elements on a single GPU thread, which is the wrong way to use GPUs. Please read the CUDA.jl introductory tutorial, this exact pitfall is explained there: Introduction · CUDA.jl