CUDA.jl - Better GPU but Worse Performance

moukann · June 25, 2022, 4:39pm

Hello all,

I have a question regarding to general usage of CUDA.jl, not a code problem.

So we have 2 different computers, both is running on Ubuntu. One of them has Nvidia Quadro P2000 5GB DDR5 160Bit graphics card and the other one has GEFORCE RTX 3080Ti 12GB GDDR6X 384Bit graphics card.

We have an GPU parallelized material simulation program and we run the exact same code on both machines. Then this strange thing happens.

Although 3080Ti is a much better GPU (at least according to this benchmark → UserBenchmark: Nvidia Quadro P2000 vs RTX 3080-Ti), the code runs slower 4 times compared to P2000.

I could not solve this problem so updated all the packages, reinstalled Julia, reinstalled CUDA.jl, run all the tests etc. But no, the problem is not gone away.

And I am sure that although the computers are different, the bottleneck for the simulation is not related to CPU or CPU RAM. The GPU computation itself runs much slower on 3080Ti, I mean kernels.

Is it caused by the GPU architectures like 3080Ti is designed for gaming and P2000 is a workstation GPU? I cannot think of any answer, I am not an expert on GPUs. Can someone explain or at least give me some insights? How can I detect the problem? What tests can I run? Thank you.

LaurentPlagne · June 25, 2022, 5:04pm

Do you use float64 ? Double precision is bad on gaming cards.

moukann · June 25, 2022, 5:17pm

Exactly, we are using Float64 for better error rates. Float32 can only reduce the error down to 10^-8 orders whereas float64 can reduce it down to 10^-17 orders.

Do you think this is the main reason? I knew float32 was faster than float64 but for the accuracy purposes I have never tried float64 on actual machines I mentioned above. Thank you for the answer.

LaurentPlagne · June 25, 2022, 5:23pm

Check NVIDIA GeForce RTX 3080 Ti Specs | TechPowerUp GPU Database gflops for example.

On professional cards dp perfs is only half of sp perf s.

moukann · June 25, 2022, 5:25pm

Indeed this can be the problem. I will try both precision and write an answer to here. Thank you so much.

LaurentPlagne · June 25, 2022, 5:30pm

gvijqb · June 27, 2022, 7:01pm

+1 to @LaurentPlagne

Gaming GPUs are best for FP16 and FP32. They are not performant for FP64 and are not designed for simulation workloads per se.

For FP64 you’d want to explore GPUs like Tesla V100 and A100s.

P2000 would be quite slow as well in comparison if you have large simulation workload.

moukann · June 29, 2022, 4:07pm

Hello gvijqb,

Can we interfere that because P2000 is a workstation GPU, it is compatible with heavy calculation workload? Is it the reason 3080Ti is slower ( or in other words, because 3080Ti is a gaming GPU)? I specifically need this answer because we might consider upgrading our P2000 to a better workstation GPU. Thank you.

Red-Portal · June 29, 2022, 5:30pm

Yes, the Quadro line of products was originally designed with large VRAM and FP64 operations in mind. The Tesla line of products came in later for machine learning applications, but it kinda covers the same range of workloads except for FP16. So yes, P2000 is supposed to do FP64 well compared to the 3080ti.

moukann · June 29, 2022, 6:10pm

Thank you for your clear answer. We better buy a workstation GPU to get desired performance.

Red-Portal · June 29, 2022, 6:45pm

The funny thing is that, the Quadros and the GeForces used to be the exact same set of hardware. People could hack the firmware (“soft mods”) to unlock the power of Quadros from Geforces. Nvidia later figured out that the secret has been revealed and went to hardware locks (think a few resistors acting as switches). People realized that too and started soldering (“hard mods”). Wonder how this works these days…

Topic		Replies	Views
GPU compute & high precision general questions New to Julia gpu , cuda , opencl	19	3463	December 30, 2021
Performance comparison of Nvidia A100, V100, RTX2080Ti Performance gpu , cuda	17	5395	June 14, 2021
How to choose a GPU - Please forgive the total noob question Offtopic question , gpu	23	1622	September 8, 2020
CUDA.jl vs. ArrayFire.jl on Windows 10 VM running on VMWare VSphere GPU	1	633	December 23, 2020
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2345	January 21, 2020

CUDA.jl - Better GPU but Worse Performance

Related topics