Help with performance issue when upgrading CUDA

Hi all. I have an issue that I’m hoping somebody on here can help me with. The issue is simple to state yet hard (at least for me) to diagnose, and it’s this: every time I try to upgrade to a version of CUDA past 5.5.2, the performance of my code is adversely affected. For some situations the performance difference is around 40%, but sometimes the performance difference is a massive 5x or greater! It was shocking to see the first time it happened, and still is, to tell the truth, and I have no idea what’s causing it or how to fix it. So, I’m essentially stuck at version 5.5.2, because I can’t accept such a huge performance decrease after an upgrade.

A little about me. I am NOT a professional software developer, and I do not have a degree. I’m basically a hobbyist, and so I am essentially self-taught when it comes to what I’m working on, which at the moment is Machine Learning, as well as the tools that I choose to use, i.e., Julia and CUDA. I am able to write basic GPU kernels using CUDA.jl, using either 1d, 2d, or 3d thread blocks, and most of the time they work as designed and generally result in speedups of around 10x to 100x over the CPU code. The code in question, in fact, is my custom implementation of the “standard” ML algorithm, which consists of a feed forward portion and a back propagation portion, and I’ve written several GPU kernels to handle the processing of all the derivatives (or gradients, if you prefer) needed for the BP portion of the algorithm. (Yes, I’m aware of the existence of automatic differentiation systems.) An example of what my modest code can do: it can complete the entire ML process for an FFN for the entire MNIST dataset consisting of 70000 28 x 28 black and white images, in about 70 ms. This is for the “high-performance” setting (I’m referring to the hyperparameters, especially the hidden layer sizes and the batch size). The “high accuracy” setting takes about 450 ms. But that’s using CUDA 5.5.2. Using any later version of CUDA takes significantly longer, and the most recent version I tested, 5.9.4, took about 100ms and 625 ms, respectively.

What’s worse, after doing data-augmentation to produce a dataset consisting of 970000 images (done on the CPU prior to training), 5.5.2 takes about 2.5 minutes to complete 20 epochs of training, while version 5.9.4 takes a whopping 11 minutes to complete the same training! What’s going on here?! Something is seriously not right. How is it possible that there is such a huge difference in performance, especially when we’re dealing with a minor release change and not a major release change?

Please note, I am in no way criticizing the developers of CUDA.jl. I think it’s amazing what they’ve been able to accomplish, and I commend them for all the work that went into making GPU programming on Julia possible for someone like me. But I just really don’t understand how it’s possible to have such a huge performance difference for a minor release. But I would think that this kind of performance difference would be a head-scratcher even for a major release. I don’t know. Maybe it’s my system; maybe I just have a long-in-the-tooth video card and need to upgrade. In any event, I’d appreciate any feedback the community could provide. Thanks a bunch.

Fyi: I do my programming on a GTX 1660ti w/ Max-Q laptop.

Can you provide a minimal working example (MWE) that shows the slowdown?

Without a piece of code, it is not possible to debug this issue.

Hi, and thanks for the reply. I’m not sure how that would work, since it affects the entire program that includes multiple GPU functions/kernels handling all the derivative computations. I can send you an example function, though, if you think that would help. But that just gave me an idea: maybe I should benchmark individual functions to see if it affects all of them, or maybe there’s just one function that is the problem child. I think I’ll do that and perhaps update this thread on what I find.

Yes, benchmarking every function individually is a good idea. However, benchmarking code that runs on the GPU is not trivial, as there’s also the data transfer time between the CPU and GPU to be taken into account. Feel free to ask if you’re unsure how to benchmark a function correctly.

Quick update: I’ve done a bit of testing on a whole slew of different GPU functions, using @btime CUDA.@sync, and I found that, surprisingly, one of the largest deltas between 5.5.2 and 5.9.4 took place when broadcasting a function (that was written for scalar input) over a CuMatrix. So, in order to ameliorate this performance delta, I think I’m going to have to write GPU functions/kernels that takes the entire CuMatrix as an input, instead of using function broadcasting.

I’ve done a little bit of benchmarking since I’ve been working on this little project, including using @btime, @benchmark, CUDA.@profile, and the external profiler Nsight Compute, but there is always something new to be learned. I just use @btime to get the relative difference in performance between functions.

1 Like