CUDA.jl - Better GPU but Worse Performance

Hello all,

I have a question regarding to general usage of CUDA.jl, not a code problem.

So we have 2 different computers, both is running on Ubuntu. One of them has Nvidia Quadro P2000 5GB DDR5 160Bit graphics card and the other one has GEFORCE RTX 3080Ti 12GB GDDR6X 384Bit graphics card.

We have an GPU parallelized material simulation program and we run the exact same code on both machines. Then this strange thing happens.

Although 3080Ti is a much better GPU (at least according to this benchmark → UserBenchmark: Nvidia Quadro P2000 vs RTX 3080-Ti), the code runs slower 4 times compared to P2000.

I could not solve this problem so updated all the packages, reinstalled Julia, reinstalled CUDA.jl, run all the tests etc. But no, the problem is not gone away.

And I am sure that although the computers are different, the bottleneck for the simulation is not related to CPU or CPU RAM. The GPU computation itself runs much slower on 3080Ti, I mean kernels.

Is it caused by the GPU architectures like 3080Ti is designed for gaming and P2000 is a workstation GPU? I cannot think of any answer, I am not an expert on GPUs. Can someone explain or at least give me some insights? How can I detect the problem? What tests can I run? Thank you.

Do you use float64 ? Double precision is bad on gaming cards.


Exactly, we are using Float64 for better error rates. Float32 can only reduce the error down to 10^-8 orders whereas float64 can reduce it down to 10^-17 orders.

Do you think this is the main reason? I knew float32 was faster than float64 but for the accuracy purposes I have never tried float64 on actual machines I mentioned above. Thank you for the answer.

Check NVIDIA GeForce RTX 3080 Ti Specs | TechPowerUp GPU Database gflops for example.

On professional cards dp perfs is only half of sp perf s.

1 Like

Indeed this can be the problem. I will try both precision and write an answer to here. Thank you so much.


+1 to @LaurentPlagne

Gaming GPUs are best for FP16 and FP32. They are not performant for FP64 and are not designed for simulation workloads per se.

For FP64 you’d want to explore GPUs like Tesla V100 and A100s.

P2000 would be quite slow as well in comparison if you have large simulation workload.


Hello gvijqb,

Can we interfere that because P2000 is a workstation GPU, it is compatible with heavy calculation workload? Is it the reason 3080Ti is slower ( or in other words, because 3080Ti is a gaming GPU)? I specifically need this answer because we might consider upgrading our P2000 to a better workstation GPU. Thank you.

Yes, the Quadro line of products was originally designed with large VRAM and FP64 operations in mind. The Tesla line of products came in later for machine learning applications, but it kinda covers the same range of workloads except for FP16. So yes, P2000 is supposed to do FP64 well compared to the 3080ti.

1 Like

Thank you for your clear answer. We better buy a workstation GPU to get desired performance.

The funny thing is that, the Quadros and the GeForces used to be the exact same set of hardware. People could hack the firmware (“soft mods”) to unlock the power of Quadros from Geforces. Nvidia later figured out that the secret has been revealed and went to hardware locks (think a few resistors acting as switches). People realized that too and started soldering (“hard mods”). Wonder how this works these days…

1 Like