Hello,
We were surprized to see that example code on the online documentation ran slower on the GPU than on the CPU.
using OrdinaryDiffEq, CUDA, LinearAlgebra
u0 = rand(1000)
A = randn(1000, 1000)
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0, 1.0))
@time "Tsit5 On CPU" sol = solve(prob, Tsit5())
u0 = cu(rand(1000))
A = cu(randn(1000, 1000))
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0f0, 1.0f0)) # Float32 is better on GPUs!
@time "Tsit5 on GPU" sol = solve(prob, Tsit5())
Here is the command and corresponding output on our system:
$ srun -G 4 --pty .local/bin/julia ./research/diffeq/sandbox/mwe_within_method.jl
Tsit5 on CPU: 0.146780 seconds (115.22 k allocations: 10.860 MiB)
Tsit5 on GPU: 33.247299 seconds (31.20 M allocations: 2.100 GiB, 5.22% gc time, 0.42% compilation time)
Is this expected? We were wondering if there is an implementation issue in the library. I personally am wondering if our system is just unusual. Below is some more information:
(@v1.9) pkg> status
Status `~/.julia/environments/v1.9/Project.toml`
[052768ef] CUDA v4.3.0
[071ae1c0] DiffEqGPU v2.2.1
[1dea7af3] OrdinaryDiffEq v6.51.2
[37e2e46d] LinearAlgebra
julia> using CUDA
julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 470.161.3, originally for CUDA 11.4
CUDA libraries:
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3
Julia packages:
- CUDA.jl: 4.3.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
- CUDA_Runtime_Discovery: 0.2.2
Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
4 devices:
0: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
1: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
2: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
3: NVIDIA RTX A4000 (sm_86, 14.742 GiB / 14.746 GiB available)
julia>
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 3005.529
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
I apologize for the consecutive posts on this topic; please do let me know if there’s any way for me to improve this question or if it somehow inappropriate.