I put together (yet another) Julia ML-from-scratch tutorial for a machine learning course we held in Munich recently: GitHub - odsl-team/julia-ml-from-scratch: Machine learning from scratch in Julia
It has very minimal dependencies (autodiff and optimizer are also done from scratch), and is fairly short (single script/notebook, 500 lines incl. I/O and plotting).
I’ve run this now on CPU (ANN training takes about a minute on a laptop), CUDA.jl, AMDGPU.jl, oneAPI.jl and Metal.jl. The good news: It runs on all of them to some degree. The bad news: It’s only fast on CUDA.jl (and currently stalls on oneAPI and Metal). I’ve collected some issues and performance results.
The results are not that surprising, of course, given the maturity of the respective packages, but maybe this tutorial can be useful in tracking down some underlying issues, since it’s so simple - GPU code should see nothing but Julia Base and StdLib functionality plus StructArrays.jl in the broadcast pullback.
@viralbshah suggested I post this here, in case someone would like to a deeper look in respect of one of the GPU frameworks. Also, maybe I’m just doing something very “non-CUDA-GPU unfriendly” in the tutorial - suggestions/PRs/etc. are very welcome, of course (thank to Christian Guinard for first hints regarding Metal.jl).
If you can provide the Profile output of running the code under both CUDA and AMDGPU/Metal/oneAPI (as well as system details for where you run it), we can probably start to figure out where the latter 3 packages are slowing down. I would suspect that the slowdown is due to CUDA’s memory allocation/freeing behavior being so well-tuned, while the remaining backends still have room to catch up (this is certainly true for AMDGPU, in particular).
If you can provide the Profile output of running the code under both CUDA and AMDGPU/Metal/oneAPI
Sure, I’ll do my best (though for Intel I can’t offer more than a small laptop integrated GPU). Which profiling commands would you suggest?
Using the standard
Profile.Allocs.@profile would be useful to characterize any performance or allocation differences. If those don’t help, then we could pull out NSight and ROCm’s new Omnitrace profiler and see what the kernel performance and synchronization situation looks like.
For oneAPI that would be VTune (not yet documented), for Metal you use Xcode (Profiling · Metal.jl, still needs some work), and CUDA.jl uses NSight Systems (Profiling · CUDA.jl).
Both oneAPI and Metal.jl could use improvements to their docs, so if you try out the profilers, please write down usage notes and add them do the repository, or file them in an issue!
This kind of performance profiling might best be done by the experts (e.g. GPU package developers) themselves though, I guess?
The tutorial itself is self-contained and should run out-of-the-box very easily.
I’ll run the CUDA vs. ROC case as @jpsamaroo suggested, but I’m not deeply enough into the individual GPU-package internals to dig deep very efficiently.
Those docs are there to make it possible for users to profile their GPU code, and not have to defer to the GPU package maintainers (or at least be able to file more actionable issues). Because although I’d love to have a look, I don’t have the time to do so for all of the back-ends I’m maintaining.
Those docs are there to make it possible for users to profile their GPU code
Sure, @maleadt , I fully agree. When it comes to understanding and addressing the causes a more expert hand might be required though.