CUDA.jl vs. ArrayFire.jl on Windows 10 VM running on VMWare VSphere

We need to speed up some double-precision calculations that are mainly just complex array calculations and FFT/IFFT. We were using NVIDIA Geforce RTX 2080 Ti on a PC and the speedup vs CPU is very good but not sufficient for our need.

We are now remotely testing a server system which has a Windows 10 VM running on VMWare VSphere 5.x, with an NVIDIA Tesla V100 GPU dedicated to the VM with GPU passthrough. The system is configured by a vendor and we don’t know (and won’t understand) the details of the VSphere settings. We do not have physical access either; all we can do is RDP to the VM.

Considering the quoted FP64 performance specs of V100 (7 Tflop/s) vs. RTX 2080 Ti (0.5 Tflop/s), we’d expect about 14x speedup. The result, however, is much less impressive:

(1) If we use CUDA.jl, the calculation time is about the same for V100 on VM compared to RTX 2080 Ti on desktop PC.

(2) If we use ArrayFire.jl, the calculation becomes 3x faster for V100 on VM compared to RTX 2080 Ti on desktop PC.

(3) As a comparison, the calculation takes about the same time for both CUDA.jl and ArrayFire.jl on RTX 2080 Ti on desktop PC.

Does anybody know:

(1) Why the speedup is much less for V100 vs. RTX 2080 Ti compared to spec? My 2 guesses to start with:

  • Data movement time between CPU and GPU is significant in overall time, and Host-Device communciation speed is about the same for V100 and RTX 2080 Ti.
  • The GPU passthrough has bottleneck that slowed down V100. What could that be though?

(2) Why ArrayFire.jl performs better than CUDA.jl when used with V100 on VM?

BTW I know the JIT compilation overhead so all performance comparison is done for the 2nd run not the 1st. I know if you measure something that runs only for milliseconds you need to run it many times to be statistically stable; but this problem takes at least multiple seconds to run with minimal I/O so statistical fluctuation shouldn’t be large - repeated runs give about the same result. The heavy-lifting work is almost all GPU, though moving data back-and-forth between CPU and GPU is still required for about 3 times round-trip, for multiple arrays, in one run.

Thanks and Happy Holidays in advance!

Thats no real answer, but I think CUDA.jl and ArrayFire both call cufft underneath. I cant imagine CUDA.jl being much worse at that task, so how do you benchmark; maybe there is still something fishy? Note that you have to synchronize both, CUDA and AF after every kernel call. CUDA’s intro shows how and for AF it‘s maybe sync()? Also you should use BenchmarkTools for timing; makes life easier. And start with a minimal example, eg FFTing just an array.

Merry Christmas for you too! :slight_smile: