DiffEqFlux GPU example slow

I copy and paste the DiffEqFlux.jl GPU example, slap in an @time on the sciml_train call at the end and spin it up. The runtime is reported on my system as around 38 seconds.

I then edit the example to be fully CPU, and the runtime is around 2.5 seconds.

Is this example expected to be much slower on the GPU? e.g. maybe because it is a small problem size and the overhead of copies to and from GPU memory is a bottleneck? Do others experience the same performance issues with this example?

I had CUDA Toolkit already installed on my device before I started trying out Julia. Output of CUDA.versioninfo():

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.43.0

- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+456.43
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: Quadro P2000 (sm_61, 2.859 GiB / 4.000 GiB available)

It’s not a performance issue. It’s just because CPU matmuls of that size are faster on CPU. Did you try

A = rand(50,50)
b = rand(50)
@btime A*b
gA = gpu(A)
gb = gpu(b)
@btime gA*gb

You can make the example bigger and bigger until GPUs finally make sense. You need pretty big problems for GPUs to make sense if the CPU code is optimized. You can also batch data points which would help here.

Indeed the matrix multiplication test gives appropriate speed-up on my device as the size grows. So yes my GPU is working as expected - I just wanted to confirm that example was meant primarily to show you how to use the GPU, and not for demonstrating performance gains.

(P.S. if you can point me to GPU example using DiffEqFlux that does yield performance gains, that would be great!)

Solving Systems of Stochastic PDEs and using GPUs in Julia - Stochastic Lifestyle is a good example for GPUs, though not necessarily a neural ODE one.The problem here is that very optimized CPU code does pretty well on neural ODEs and UDEs until you make them asymptotically big. MNIST and the like can be a good example, or PDE-constrained optimization is probably a more realistic one. I think this is probably a good starting point for a GSoC. IIRC the PDE example from the UDE paper is a good one for GPUs.

1 Like