GPU performance issues with an ML-from-scratch tutorial

oschulz · April 12, 2023, 5:42pm

I put together (yet another) Julia ML-from-scratch tutorial for a machine learning course we held in Munich recently: GitHub - odsl-team/julia-ml-from-scratch: Machine learning from scratch in Julia

It has very minimal dependencies (autodiff and optimizer are also done from scratch), and is fairly short (single script/notebook, 500 lines incl. I/O and plotting).

I’ve run this now on CPU (ANN training takes about a minute on a laptop), CUDA.jl, AMDGPU.jl, oneAPI.jl and Metal.jl. The good news: It runs on all of them to some degree. The bad news: It’s only fast on CUDA.jl (and currently stalls on oneAPI and Metal). I’ve collected some issues and performance results.

The results are not that surprising, of course, given the maturity of the respective packages, but maybe this tutorial can be useful in tracking down some underlying issues, since it’s so simple - GPU code should see nothing but Julia Base and StdLib functionality plus StructArrays.jl in the broadcast pullback.

@viralbshah suggested I post this here, in case someone would like to a deeper look in respect of one of the GPU frameworks. Also, maybe I’m just doing something very “non-CUDA-GPU unfriendly” in the tutorial - suggestions/PRs/etc. are very welcome, of course (thank to Christian Guinard for first hints regarding Metal.jl).

jpsamaroo · April 15, 2023, 3:31pm

If you can provide the Profile output of running the code under both CUDA and AMDGPU/Metal/oneAPI (as well as system details for where you run it), we can probably start to figure out where the latter 3 packages are slowing down. I would suspect that the slowdown is due to CUDA’s memory allocation/freeing behavior being so well-tuned, while the remaining backends still have room to catch up (this is certainly true for AMDGPU, in particular).

oschulz · April 15, 2023, 5:48pm

If you can provide the Profile output of running the code under both CUDA and AMDGPU/Metal/oneAPI

Sure, I’ll do my best (though for Intel I can’t offer more than a small laptop integrated GPU). Which profiling commands would you suggest?

jpsamaroo · April 16, 2023, 9:25pm

Using the standard Profile.@profile and Profile.Allocs.@profile would be useful to characterize any performance or allocation differences. If those don’t help, then we could pull out NSight and ROCm’s new Omnitrace profiler and see what the kernel performance and synchronization situation looks like.

maleadt · April 17, 2023, 7:34am

For oneAPI that would be VTune (not yet documented), for Metal you use Xcode (Profiling · Metal.jl, still needs some work), and CUDA.jl uses NSight Systems (Profiling · CUDA.jl).

Both oneAPI and Metal.jl could use improvements to their docs, so if you try out the profilers, please write down usage notes and add them do the repository, or file them in an issue!

oschulz · April 17, 2023, 11:04am

This kind of performance profiling might best be done by the experts (e.g. GPU package developers) themselves though, I guess?

The tutorial itself is self-contained and should run out-of-the-box very easily.

I’ll run the CUDA vs. ROC case as @jpsamaroo suggested, but I’m not deeply enough into the individual GPU-package internals to dig deep very efficiently.

maleadt · April 17, 2023, 2:05pm

Those docs are there to make it possible for users to profile their GPU code, and not have to defer to the GPU package maintainers (or at least be able to file more actionable issues). Because although I’d love to have a look, I don’t have the time to do so for all of the back-ends I’m maintaining.

oschulz · April 17, 2023, 7:39pm

Those docs are there to make it possible for users to profile their GPU code

Sure, @maleadt , I fully agree. When it comes to understanding and addressing the causes a more expert hand might be required though.

Topic		Replies	Views
Nsight compute from CUDA.jl and source annotation Performance gpu	3	742	March 25, 2021
Profiling Julia CUDA code missing 'CUDA HW' GPU	7	989	February 9, 2022
GPU kernel optimization (GPU vs CPU) GPU	3	1517	December 14, 2018
How to get started with GPU programming? OpenCL or CUDA? GPU	7	7281	August 29, 2017
CUDA v2 - performance regression on matrix multiplication GPU	14	1755	November 10, 2020

GPU performance issues with an ML-from-scratch tutorial

Related topics