I’ve reached to the conclusion that Flux is no longer usable if the system does not have GPU. Given that Flux is strongly dependent on CUDA, that means ML apps using Flux on IoT devices such as Raspberry Pi are no longer viable. Newest CUDA release apparently blocks that (till recently it was enabled).
Using an AArch64 device, clean Ubuntu 22.04 install, clean CUDA toolkit install, and clean Julia 1.9.1 install. Pkg.add(“CUDA”) works. Using CUDA fails:
julia> using CUDA
┌ Error: Failed to initialize CUDA
│ exception =
│ CUDA error (code 100, CUDA_ERROR_NO_DEVICE)
Did you actually open an issue on the CUDA.jl side as recommended on GitHub? using CUDA has intentionally been designed to be a no-op on platforms which don’t have a driver installed, so I’m pretty sure this counts as a bug.
To me it seems like it probably would be a better idea for Flux to make the entire GPU stack a weak dep now that we have those in 1.9. There’s no reason to depend on all the GPU stuff for users that don’t have GPUs.
Easier said than done unfortunately. We’ve run into a number of roadblocks trying to make package extensions work while maintaining backwards compat (Flux supports 1.6 and I see people on 1.7 worryingly often) without creating backport branches on a bunch of repos (which did not work out well the last time it was tried). The latest one we ran into is make NNlibCUDA an extension by CarloLucibello · Pull Request #492 · FluxML/NNlib.jl · GitHub. Given I’ve already read complaints about import word salad when it comes to using FluxML packages, you can see why these blockers are non-trivial to resolve.
As a researcher, I’m open to discuss on how to collaborate on bringing a solution for joining Flux and embedded Julia. This is a topic of great interest. If the solution involves Flux 2.0 with less CUDA dependency, so be it.
As a product developer that planned to use embedded Julia in a product, I’m afraid this path is no longer possible. One option in sight would be Python/TinyML - a number of chip suppliers already made it work even on Cortex-M family.
No, that’s not the suggestion. What appears to be happening is that CUDA.jl is detecting you have the proprietary Nvidia driver installed on your system for some reason. Since you presumably don’t have a Nvidia GPU attached to said system, you’ll want to uninstall the driver and then the issue should go away. If for some reason you must have the driver installed despite not having a GPU attached to the system (edit: or you don’t have it installed), I would mention that on the GitHub issue.
This is not my understanding. Being the man-in-the middle is not being practical at all. Perhaps if both, Flux and JuliaGPU, try to duplicate the issue as a team?
There is definitely an issue to be solved by experts.
I know it’s not evident from this discussion, but you’re not the man-in-the-middle here. Tim and I have seen this issue many times already, hence why we’re both asking if you can test a couple more things.
The problem is that we need an environment like yours to replicate the issue! If you know the exact specifications of the machine you’re working on you could share them here and hope someone else has a similar one to test, but otherwise I’d recommend trying to answer the following questions from my post above:
Do you have nvidia drivers installed on this machine?
If so, can you uninstall those drivers? Does uninstalling make the error go away?
I added some details on the issue, but essentially, you’ll want to figure out which package on your system provides libcuda.so. Only removing what provides nvidia.ko, aka. the kernel-level driver, is not sufficient. You also need to remove the user-space driver.
But once more, the output you showed in the issue is just an @error message. It should not break Flux. If it does, then please provide more details (what broke? can you provide a backtrace?) so that we can help you better.
This is where hopefully package extensions/ weakdeps get adopted more. I did have a related problem where JuliaGNSS depends on CUDA.jl just to check whether the hw have Nvidia GPU and suggest enabling GPU acceleration, while I’m running it on Raspberry Pi.
Yeah, we will probably need a tiny packages to detect the availability of GPU software and hardware so that people can check whether CUDA.jl is likely to work without having to depend on all of CUDA.jl.
Anyway, this is pretty off-topic to what was originally being discussed here.
Until the issue is fixed, have downgraded CUDA Pkg to v3.13.1. Development and compilation remain on a large AArch64/no GPU instance, then rsync to the Raspberry Pi.