I copy and paste the DiffEqFlux.jl GPU example, slap in an @time
on the sciml_train
call at the end and spin it up. The runtime is reported on my system as around 38 seconds.
I then edit the example to be fully CPU, and the runtime is around 2.5 seconds.
Is this example expected to be much slower on the GPU? e.g. maybe because it is a small problem size and the overhead of copies to and from GPU memory is a bottleneck? Do others experience the same performance issues with this example?
I had CUDA Toolkit already installed on my device before I started trying out Julia. Output of CUDA.versioninfo():
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.43.0
Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+456.43
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
1 device:
0: Quadro P2000 (sm_61, 2.859 GiB / 4.000 GiB available)