DiffEq documentation example slower on GPU (33 sec) than on CPU (0.14 sec)

jogama · May 25, 2023, 8:45pm

Hello,

We were surprized to see that example code on the online documentation ran slower on the GPU than on the CPU.

https://docs.sciml.ai/DiffEqGPU/stable/getting_started/#Simple-Example-of-Within-Method-GPU-Parallelism

using OrdinaryDiffEq, CUDA, LinearAlgebra

u0 = rand(1000)
A = randn(1000, 1000)
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0, 1.0))
@time "Tsit5 On CPU" sol = solve(prob, Tsit5())

u0 = cu(rand(1000))
A = cu(randn(1000, 1000))
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0f0, 1.0f0)) # Float32 is better on GPUs!                                                  
@time "Tsit5 on GPU" sol = solve(prob, Tsit5())

Here is the command and corresponding output on our system:

$ srun -G 4 --pty .local/bin/julia ./research/diffeq/sandbox/mwe_within_method.jl
Tsit5 on CPU: 0.146780 seconds (115.22 k allocations: 10.860 MiB)
Tsit5 on GPU: 33.247299 seconds (31.20 M allocations: 2.100 GiB, 5.22% gc time, 0.42% compilation time)

Is this expected? We were wondering if there is an implementation issue in the library. I personally am wondering if our system is just unusual. Below is some more information:

(@v1.9) pkg> status
Status `~/.julia/environments/v1.9/Project.toml`
  [052768ef] CUDA v4.3.0
  [071ae1c0] DiffEqGPU v2.2.1
  [1dea7af3] OrdinaryDiffEq v6.51.2
  [37e2e46d] LinearAlgebra

julia> using CUDA

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 470.161.3, originally for CUDA 11.4

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3

Julia packages: 
- CUDA.jl: 4.3.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
- CUDA_Runtime_Discovery: 0.2.2

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  1: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  2: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
  3: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)

julia> 
$ lscpu                                                                          
Architecture:                    x86_64                                                        
CPU op-mode(s):                  32-bit, 64-bit                                                
Byte Order:                      Little Endian                                                 
Address sizes:                   46 bits physical, 48 bits virtual                             
CPU(s):                          80                                                            
On-line CPU(s) list:             0-79                                                          
Thread(s) per core:              2                                                             
Core(s) per socket:              20                                                            
Socket(s):                       2                                                             
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:                        4
CPU MHz:                         3005.529
CPU max MHz:                     3700.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4800.00
Virtualization:                  VT-x
L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        40 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-19,40-59
NUMA node1 CPU(s):               20-39,60-79

I apologize for the consecutive posts on this topic; please do let me know if there’s any way for me to improve this question or if it somehow inappropriate.

ToucheSir · May 25, 2023, 8:55pm

I would imagine 99%+ of that 33s is compilation time for GPU stuff, much of which unfortunately isn’t reported in @time.

jogama · May 25, 2023, 9:25pm

Thank you for your reply! I suppose it’s implied by the 115 k CPU allocations for Tsit5 on CPU vs 31 M CPU for Tsit5 on GPU. Could I use nsys to conclusively confirm how much time is spent in compilation?

Oscar_Smith · May 25, 2023, 10:26pm

the easiest way would be to just run it a second time to see how much faster it is

jogama · May 25, 2023, 10:41pm

Ok, that’s very helpful. Thank you!

$ srun -G 4 --pty .local/bin/julia ./research/diffeq/sandbox/mwe_within_method.jl
Tsit5 on CPU: 0.150899 seconds (115.23 k allocations: 10.984 MiB)
Tsit5 on GPU: 31.783274 seconds (31.20 M allocations: 2.100 GiB, 4.97% gc time, 0.44% compilation time)
Tsit5/CUDA.@time:   0.017282 seconds (48.59 k CPU allocations: 3.285 MiB) (606 GPU allocations: 1.534 MiB, 8.90% memmgmt time)

Topic		Replies	Views
DiffEqFlux GPU example slow GPU performance	3	641	January 14, 2021
DIfferentialEquations and GPU Performance gpu	15	5030	December 7, 2020
GPU simulation time of neural ODE program Machine Learning diffeq , flux	8	897	August 16, 2019
]test DiffEqGPU errors with " unsupported call to the Julia runtime" GPU first-steps	19	1651	September 20, 2019
Diffusion by using CUDA.jl, DifferentialEquations and DiffEqGPU GPU question , diffeq	6	1250	March 29, 2022

DiffEq documentation example slower on GPU (33 sec) than on CPU (0.14 sec)

Related topics