DiffEqFlux GPU example slow

acse-ogb119 · January 14, 2021, 1:23pm

I copy and paste the DiffEqFlux.jl GPU example, slap in an @time on the sciml_train call at the end and spin it up. The runtime is reported on my system as around 38 seconds.

I then edit the example to be fully CPU, and the runtime is around 2.5 seconds.

Is this example expected to be much slower on the GPU? e.g. maybe because it is a small problem size and the overhead of copies to and from GPU memory is a bottleneck? Do others experience the same performance issues with this example?

I had CUDA Toolkit already installed on my device before I started trying out Julia. Output of CUDA.versioninfo():

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.43.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+456.43
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: Quadro P2000 (sm_61, 2.859 GiB / 4.000 GiB available)

ChrisRackauckas · January 14, 2021, 1:29pm

It’s not a performance issue. It’s just because CPU matmuls of that size are faster on CPU. Did you try

A = rand(50,50)
b = rand(50)
@btime A*b
gA = gpu(A)
gb = gpu(b)
@btime gA*gb

You can make the example bigger and bigger until GPUs finally make sense. You need pretty big problems for GPUs to make sense if the CPU code is optimized. You can also batch data points which would help here.

acse-ogb119 · January 14, 2021, 1:43pm

Indeed the matrix multiplication test gives appropriate speed-up on my device as the size grows. So yes my GPU is working as expected - I just wanted to confirm that example was meant primarily to show you how to use the GPU, and not for demonstrating performance gains.

Thanks
(P.S. if you can point me to GPU example using DiffEqFlux that does yield performance gains, that would be great!)

ChrisRackauckas · January 14, 2021, 1:53pm

https://www.stochasticlifestyle.com/solving-systems-stochastic-pdes-using-gpus-julia/ is a good example for GPUs, though not necessarily a neural ODE one.The problem here is that very optimized CPU code does pretty well on neural ODEs and UDEs until you make them asymptotically big. MNIST and the like can be a good example, or PDE-constrained optimization is probably a more realistic one. I think this is probably a good starting point for a GSoC. IIRC the PDE example from the UDE paper is a good one for GPUs.

https://github.com/ChrisRackauckas/universal_differential_equations/blob/master/Climate/NeuralPDE/npde.jl

Topic		Replies	Views
DiffEq documentation example slower on GPU (33 sec) than on CPU (0.14 sec) GPU diffeq , performance , differentialequation	4	318	May 25, 2023
GPU simulation time of neural ODE program Machine Learning diffeq , flux	8	913	August 16, 2019
DIfferentialEquations and GPU Performance gpu	15	5058	December 7, 2020
DiffEqFlux neural_ode used with Flux.Train! is slower on GPU than CPU GPU flux , ode	0	737	December 31, 2019
Flux model on CPU runs slowly Performance question , flux	3	437	October 4, 2020

DiffEqFlux GPU example slow

Related topics