DiffEq documentation example slower on GPU (33 sec) than on CPU (0.14 sec)

Hello,

We were surprized to see that example code on the online documentation ran slower on the GPU than on the CPU.

https://docs.sciml.ai/DiffEqGPU/stable/getting_started/#Simple-Example-of-Within-Method-GPU-Parallelism

using OrdinaryDiffEq, CUDA, LinearAlgebra

u0 = rand(1000)
A = randn(1000, 1000)
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0, 1.0))
@time "Tsit5 On CPU" sol = solve(prob, Tsit5())

u0 = cu(rand(1000))
A = cu(randn(1000, 1000))
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0f0, 1.0f0)) # Float32 is better on GPUs!                                                  
@time "Tsit5 on GPU" sol = solve(prob, Tsit5())

Here is the command and corresponding output on our system:

$ srun -G 4 --pty .local/bin/julia ./research/diffeq/sandbox/mwe_within_method.jl
Tsit5 on CPU: 0.146780 seconds (115.22 k allocations: 10.860 MiB)
Tsit5 on GPU: 33.247299 seconds (31.20 M allocations: 2.100 GiB, 5.22% gc time, 0.42% compilation time)

Is this expected? We were wondering if there is an implementation issue in the library. I personally am wondering if our system is just unusual. Below is some more information:

(@v1.9) pkg> status
Status `~/.julia/environments/v1.9/Project.toml`
  [052768ef] CUDA v4.3.0
  [071ae1c0] DiffEqGPU v2.2.1
  [1dea7af3] OrdinaryDiffEq v6.51.2
  [37e2e46d] LinearAlgebra

julia> using CUDA

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 470.161.3, originally for CUDA 11.4

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3

Julia packages: 
- CUDA.jl: 4.3.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
- CUDA_Runtime_Discovery: 0.2.2

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  1: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  2: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  3: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)

julia> 
$ lscpu                                                                          
Architecture:                    x86_64                                                        
CPU op-mode(s):                  32-bit, 64-bit                                                
Byte Order:                      Little Endian                                                 
Address sizes:                   46 bits physical, 48 bits virtual                             
CPU(s):                          80                                                            
On-line CPU(s) list:             0-79                                                          
Thread(s) per core:              2                                                             
Core(s) per socket:              20                                                            
Socket(s):                       2                                                             
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:                        4
CPU MHz:                         3005.529
CPU max MHz:                     3700.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4800.00
Virtualization:                  VT-x
L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        40 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-19,40-59
NUMA node1 CPU(s):               20-39,60-79

I apologize for the consecutive posts on this topic; please do let me know if there’s any way for me to improve this question or if it somehow inappropriate.

I would imagine 99%+ of that 33s is compilation time for GPU stuff, much of which unfortunately isn’t reported in @time.

Thank you for your reply! I suppose it’s implied by the 115 k CPU allocations for Tsit5 on CPU vs 31 M CPU for Tsit5 on GPU. Could I use nsys to conclusively confirm how much time is spent in compilation?

the easiest way would be to just run it a second time to see how much faster it is

Ok, that’s very helpful. Thank you!

$ srun -G 4 --pty .local/bin/julia ./research/diffeq/sandbox/mwe_within_method.jl
Tsit5 on CPU: 0.150899 seconds (115.23 k allocations: 10.984 MiB)
Tsit5 on GPU: 31.783274 seconds (31.20 M allocations: 2.100 GiB, 4.97% gc time, 0.44% compilation time)
Tsit5/CUDA.@time:   0.017282 seconds (48.59 k CPU allocations: 3.285 MiB) (606 GPU allocations: 1.534 MiB, 8.90% memmgmt time)