Suggestion: abstraction for integrated GPUs?

I have had the following thought: many people do not have regular access to HPC or high-end GPUs (where Nvidia seems most popular), and/or prioritize the benefit of interactive sessions on their personal laptops and low end machines. The latter may or may not have a discrete GPU, but is almost guaranteed[1] to have an integrated GPU (so typically Intel/AMD/Apple silicon). I suppose in principle these integrated GPU can already present significant performance uplift for certain workloads. The near-ubiquity of integrated GPUs might also mean that package developers can assume the availability of and advantage of this resource in a way transparent to the end-user.

Would it be worthwhile to have have functionality that abstracts the local integrated GPU? As a minimal functionality, something like GPUArrays.IntegratedGPUArray(...) that inspects the local machine and returns the appropriate array type from oneAPI/Metal/ROCm? (some other methods like an abstracted synchronize() etc would also need to be exposed). Then dispatch on generically written code would do the rest.

[1] I’m not sure which CPU-attached resources are presented to a job running on a managed cluster.

I am unsure why an integrated GPU would needed to be treated differently than a non-integrated one?

KernelAbstractions.jl works over AMD/Intel/Apple and provides the necessary abstractions. It sounds like what you would like is a package that automatically detects which GPU the user has.

Thats a hard question and it’s particularly hard since the GPU packages are very “heavy” e.g. we recently had a user with less than ideal internet and they were frustrated by the fact that a package dependent directly on CUDA and they had to download the CUDA runtime. We have tried to minimize or delay these costs, but they do still exist.

My preferred is for users to install a library and an accelerator package and then using package extensions and KernelAbstractions.jl libraries can take advantage of GPUs if the user requests it.

1 Like

Yes, I had expected the detection is the relatively straightforward part and that the crux of the matter would be in the packaging and distribution. Is a scenario possible where a (new?) package detects the local machine architecture when it builds, and accordingly decides which GPU package to add and download?

To illustrate my use case: I am the primary developer of an academic (private) package which is used in a scientific collaboration between 6 different collaborators across different institutions. [Side Julia plug: we were working initially with a legacy codebase that required ~72 hours to run per dataset; I ported the functionality to Julia and reduced it to ~1 minute.] Since our calculations are now not too demanding on wall clock time, we keep using our personal machines, which are a mixture of Apple and Intel-based laptops. As the package developer I wondered if I can improve it by utilizing heterogenous GPUs on the different end-user machines. Granted, I accept that since this is not a public package, I can take liberty that would install both oneAPI and Metal and decide dynamically which to use. However, I thought a more elegant solution would be the kind of scenario I listed above.

Right now, we ask users to dispatch off of array type.

So they install the package, and then also add / using Metal. After, they can then run the kernel off of a MtlArray. This also allows you (the programmer) to write specialized extensions for certain backends (For example, CUDA might have a slightly different kernel than AMDGPU). It also allows users to specify whether they want their KernelAbstractions kernel to use their GPU or run in parallel on the CPU (the same kernel can do both).

You could try Hwloc.jl to detect GPUs in the user system, but a package is not allowed to modify the environment of the user. (E.g. you can tell the user how to use Metal.jl, but you can’t auto-Install it for them (run Pkg.add from an init function)

For me it’s also about user agency, they might want to use the GPU for something else and should opt-in that an application uses it.

For backed settings Preferences.jl (for static) or ScipedValues.jl (for runtime dynamic) might help, but as @leios said the most commy approach is dispatch on an Array type

1 Like

I agree with the perspective of user agency and autonomy on a general level. My goal in the continued development of the package is to empower my collaborators (some of whom not familiar with Julia at all) to be able to use it on their own instead of relying on me to run more calculations. In this way, I would not expect them to tinker too much with installing more packages themselves, and fiddle with different array types that my package would dispatch on. I was hoping to make that transparent to them.

Perhaps the reason I am chaffing against this is that my package, in terms of functionality, is very high-level and much closer to an end-to-end application — a single function invocation with ~10 scalar arguments launches the entire calculation and spits out the ~10 quantities of interest and a few plots. In particular, the end user never manipulates any arrays. I was hoping to retain this clean user interface for my collaborators. In this case, I don’t see any other coding model except that my package add oneAPI and Metal as hard dependencies, and then some gymnastics with Preferences etc.

It would be reasonable to implement a package which auto-detects GPUs with Hwloc and suggests that users install the appropriate packages in their global environment (which it can automate if they approve) - if users decline, then it stops asking. Then it can also conditionally load those GPU packages if the user previously requested to install them. You can do this on the first run of an application, and tell users they need to restart the application to pick up the changes - although with package extensions things may work within the same session.

There is some level of multi gpu support in, e.g., Flux [ GPU Support · Flux (sciml.ai); perhaps more generally in SciML??], I think. I have never used gpu:s for my work, but I find the possibility interesting.

Integrated gpu:s are also getting speedier, I understand. New PC:s come with NPU:s; these seem to come with INT8 instead of INT32/INT64 – I don’t know how useful these would be in the Julia community.

Even though the new “CoPilot+ PCs” with Qualcomm X Plus/Elite have been criticized for having poor GPU performance, in a recent comparison of Dell XPS 13 with X Elite vs. Intel i7 14th gen H processor, the X Elite was faster than the Intel based machine on both single core and multi core, while the Intel based machine was 50+% faster wrt. graphics performance.

But if the Intel based GPU is deemed fast enough for scientific computations, the Qualcomm GPU at 60-67% may also be useful. Even if the upcoming Lunar Lake processor will be still faster, I expect the current and future Qualcomm ARM processors (and rumored NVIDIA ARM processor in 2025) may also be of interest.

In the Dell comparison, it was also interesting to see that the X Elite based machine had more than double the battery life of the Intel based version.