Hi all. I have an issue that I’m hoping somebody on here can help me with. The issue is simple to state yet hard (at least for me) to diagnose, and it’s this: every time I try to upgrade to a version of CUDA past 5.5.2, the performance of my code is adversely affected. For some situations the performance difference is around 40%, but sometimes the performance difference is a massive 5x or greater! It was shocking to see the first time it happened, and still is, to tell the truth, and I have no idea what’s causing it or how to fix it. So, I’m essentially stuck at version 5.5.2, because I can’t accept such a huge performance decrease after an upgrade.
A little about me. I am NOT a professional software developer, and I do not have a degree. I’m basically a hobbyist, and so I am essentially self-taught when it comes to what I’m working on, which at the moment is Machine Learning, as well as the tools that I choose to use, i.e., Julia and CUDA. I am able to write basic GPU kernels using CUDA.jl, using either 1d, 2d, or 3d thread blocks, and most of the time they work as designed and generally result in speedups of around 10x to 100x over the CPU code. The code in question, in fact, is my custom implementation of the “standard” ML algorithm, which consists of a feed forward portion and a back propagation portion, and I’ve written several GPU kernels to handle the processing of all the derivatives (or gradients, if you prefer) needed for the BP portion of the algorithm. (Yes, I’m aware of the existence of automatic differentiation systems.) An example of what my modest code can do: it can complete the entire ML process for an FFN for the entire MNIST dataset consisting of 70000 28 x 28 black and white images, in about 70 ms. This is for the “high-performance” setting (I’m referring to the hyperparameters, especially the hidden layer sizes and the batch size). The “high accuracy” setting takes about 450 ms. But that’s using CUDA 5.5.2. Using any later version of CUDA takes significantly longer, and the most recent version I tested, 5.9.4, took about 100ms and 625 ms, respectively.
What’s worse, after doing data-augmentation to produce a dataset consisting of 970000 images (done on the CPU prior to training), 5.5.2 takes about 2.5 minutes to complete 20 epochs of training, while version 5.9.4 takes a whopping 11 minutes to complete the same training! What’s going on here?! Something is seriously not right. How is it possible that there is such a huge difference in performance, especially when we’re dealing with a minor release change and not a major release change?
Please note, I am in no way criticizing the developers of CUDA.jl. I think it’s amazing what they’ve been able to accomplish, and I commend them for all the work that went into making GPU programming on Julia possible for someone like me. But I just really don’t understand how it’s possible to have such a huge performance difference for a minor release. But I would think that this kind of performance difference would be a head-scratcher even for a major release. I don’t know. Maybe it’s my system; maybe I just have a long-in-the-tooth video card and need to upgrade. In any event, I’d appreciate any feedback the community could provide. Thanks a bunch.