Generalised CUDA/CPU

Raf · October 8, 2019, 4:12am

So this is the old question of using exponents and similar in code that runs on Cuda. I’m wondering what solutions are people using in late 2019.

There are answers like this: https://github.com/JuliaGPU/CuArrays.jl/issues/283 but they depend on CuArrays.

How can we release packages that don’t depend on CuArrays but can still use these GPU methods?
Is there a reason there isn’t a small base package that we could depend on that takes care of it? or is this a task for Cassette?

Roger-luo · October 8, 2019, 5:19am

Maybe check GPUifyLoops.jl?

Raf · October 8, 2019, 5:43am

Ah I didn’t really that was already using Cassette.jl underneath.

Are there any plans for this to work on broadcast?

Edit: also are there some good examples of packages that build on GPUifyLoops?

vchuravy · October 8, 2019, 1:59pm

The three primary users of GPUifyLoops that I am aware of are:

My long-term plan is to move most of the Cassette business to CUDAnative (see https://github.com/JuliaGPU/CUDAnative.jl/pull/334), but that work has stalled due to a lack of time on my side. The issue right now is that there are still a couple of constructs that Cassette hinders inference thereof and as such we can’t make it a default.

Raf · October 8, 2019, 2:15pm

Looking forward to CUDAnative working with Cassette! Thanks for all the work getting it organised.

Unfortunately I found out that my main use cases for GPUs don’t really gel with GPUifyLoops as well as with broadcast and CuArrays. Broadcast dot fusion composes nicely with inline recursive methods, but its hard to replicate with loops.

DoktorMike · October 8, 2019, 3:50pm

I’m still waiting for the day when I won’t have to care about the underlying hardware and the language or framework just does what’s best given the host system and the code I wrote. We have FPGA, CPU, GPU, TPU, keeping all of this around and thinking about this in implementation really takes a lot of effort which is not really interesting to an AI/ML guy like me.

jpsamaroo · October 9, 2019, 4:54pm

I think this is something that actually isn’t too far off, with a bit of work. There already seems to be at least one package investigating this idea (https://github.com/JuliaDiffEq/AutoOffload.jl), and GPUifyLoops and GPUArrays both implement operations (loops vs. array ops) that can be essentially written once, and executed on different devices without substantial changes.

In my mind, all that one would need to do to achieve efficient, automated offload of computations to whatever devices are available is the following:

A unified mechanism to query all available compute devices, their topologies, and then load the appropriate packages if available (for example, Hwloc.jl plus detection methods from CUDAdrv/CUDAapi)
A means to annotate or statically/dynamically analyze code for data access and compute patterns (probably the hardest part, but some solutions definitely exist in current literature)
A package which can tie the above together, and make the actual resource allocations and compute assignments, probably with some scheduling for longer-running, dynamic computations (also difficult to do well, but for simple problems, “obvious” solutions may exist)

dpsanders · October 9, 2019, 5:00pm

GPUifyLoops contains the contextualize function that replaces exp by CUDAnative.exp etc. This works fine with broadcast. Do you have a MWE?

dpsanders · October 9, 2019, 5:01pm

(I think that function is still undocumented…)

Raf · October 10, 2019, 1:14am

Ok I was’t aware contextualise could work with a CuArrays broadcast. I don’t really have the bandwidth to explore that myself, I really just want it to work out of the box with CuArrays!

dpsanders · October 10, 2019, 2:53pm

julia> using CuArrays, GPUifyLoops

julia> f(x) = exp(sin(x));

julia> v = CuArray([3.1, 4.2]);

julia> f.(v)   # fails
ize(f).(v)
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called sin(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:30, maybe you intended to call sin(x::Float64) in CUDAnative at /home/dpsanders/.julia/packages/CUDAnative/X4QWM/src/device/cuda/math.jl:12 instead?

julia> contextualize(f).(v)   # succeeds
2-element CuArray{Float64,1}:
 1.0424572455982593
 0.4182918968212489

Topic		Replies	Views
Package use, CUDA stream support, etc GPU first-steps	5	1459	September 13, 2018
Best way to deal with broadcasting of intrinsics on CuArrays? General Usage	4	963	February 7, 2019
Writing stencils for CuArray GPU	6	1144	July 31, 2019
CUDAnative is awesome! GPU	12	5976	December 3, 2018
cuArrays vs CUDANative GPU	3	1362	November 14, 2018

Generalised CUDA/CPU

Related topics