Generalised CUDA/CPU

So this is the old question of using exponents and similar in code that runs on Cuda. I’m wondering what solutions are people using in late 2019.

There are answers like this: but they depend on CuArrays.

How can we release packages that don’t depend on CuArrays but can still use these GPU methods?
Is there a reason there isn’t a small base package that we could depend on that takes care of it? or is this a task for Cassette?

1 Like

Maybe check GPUifyLoops.jl?

Ah I didn’t really that was already using Cassette.jl underneath.

Are there any plans for this to work on broadcast?

Edit: also are there some good examples of packages that build on GPUifyLoops?

The three primary users of GPUifyLoops that I am aware of are:

My long-term plan is to move most of the Cassette business to CUDAnative (see, but that work has stalled due to a lack of time on my side. The issue right now is that there are still a couple of constructs that Cassette hinders inference thereof and as such we can’t make it a default.

Looking forward to CUDAnative working with Cassette! Thanks for all the work getting it organised.

Unfortunately I found out that my main use cases for GPUs don’t really gel with GPUifyLoops as well as with broadcast and CuArrays. Broadcast dot fusion composes nicely with inline recursive methods, but its hard to replicate with loops.

I’m still waiting for the day when I won’t have to care about the underlying hardware and the language or framework just does what’s best given the host system and the code I wrote. We have FPGA, CPU, GPU, TPU, keeping all of this around and thinking about this in implementation really takes a lot of effort which is not really interesting to an AI/ML guy like me.

I think this is something that actually isn’t too far off, with a bit of work. There already seems to be at least one package investigating this idea (, and GPUifyLoops and GPUArrays both implement operations (loops vs. array ops) that can be essentially written once, and executed on different devices without substantial changes.

In my mind, all that one would need to do to achieve efficient, automated offload of computations to whatever devices are available is the following:

  • A unified mechanism to query all available compute devices, their topologies, and then load the appropriate packages if available (for example, Hwloc.jl plus detection methods from CUDAdrv/CUDAapi)
  • A means to annotate or statically/dynamically analyze code for data access and compute patterns (probably the hardest part, but some solutions definitely exist in current literature)
  • A package which can tie the above together, and make the actual resource allocations and compute assignments, probably with some scheduling for longer-running, dynamic computations (also difficult to do well, but for simple problems, “obvious” solutions may exist)

GPUifyLoops contains the contextualize function that replaces exp by CUDAnative.exp etc. This works fine with broadcast. Do you have a MWE?

1 Like

(I think that function is still undocumented…)

Ok I was’t aware contextualise could work with a CuArrays broadcast. I don’t really have the bandwidth to explore that myself, I really just want it to work out of the box with CuArrays!

julia> using CuArrays, GPUifyLoops

julia> f(x) = exp(sin(x));

julia> v = CuArray([3.1, 4.2]);

julia> f.(v)   # fails
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called sin(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:30, maybe you intended to call sin(x::Float64) in CUDAnative at /home/dpsanders/.julia/packages/CUDAnative/X4QWM/src/device/cuda/math.jl:12 instead?

julia> contextualize(f).(v)   # succeeds
2-element CuArray{Float64,1}: