Best way to deal with broadcasting of intrinsics on CuArrays?

dfdx · February 6, 2019, 11:12pm

Let’s say we have a function that uses Base intrinsics like exp:

logistic(x) = 1 / (1 + exp(-x))

If we broadcast it over CuArrays, e.g.:

A = cu(rand(5, 10))
logistic.(x)

we expectedly get a warning (and even crashes in more complicated cases):

┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at /home/dfdx/.julia/packages/CUDAnative/Mdd3w/src/device/libdevice.jl:90 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] #23 at /home/dfdx/.julia/packages/GPUArrays/t8tJB/src/broadcast.jl:49
└ @ CUDAnative ~/.julia/packages/CUDAnative/Mdd3w/src/compiler/irgen.jl:68
5×10 CuArray{Float32,2}:
...

I can of course rewrite logistic function for CuArrays specifically:

logistic(x) = 1 / (1 + CUDAnative.exp(-x))

But then it’s not generic anymore.

Another option is to override broadcasted for each function, e.g. something like:

Base.Broadcast.broadcasted(::typeof(logistic), x) = logistic.(x)
Base.Broadcast.broadcasted(::typeof(logistic), x::CuArray) = 1 ./ (1 .+ CUDAnative.exp.(-x))

But it would be pretty hard to extend to all such functions.

My current plan is to use Cassette.jl to rewrite all intrinsic function on the fly, however I may be missing some obviously simpler solution. Do I?

xiaodai · February 7, 2019, 12:18am

Interesting, what type is eltype(cu). In the Flux.jl it talks about fusing all that into a kernel but the kernel is written as

kernel(x)=exp.(x).+1

And I assume kernel dispatches on CuArray. Is the right answer to define a GPUFloat32 so logistic can dispatch on that.

For GPUs it makes more sense to define operations (kernels) on arrays. So maybe that’s why? I agree it does feel less generic.

dfdx · February 7, 2019, 10:32am

It’s good if you are the one who defines functions, but it doesn’t work if you use someone else’s function like this:

function predict(W, x)
    return logistic.(W * x)
end

xiaodai · February 7, 2019, 10:58am

Yeah. I agree. I saw that they are now defining a GPUPtr type. The way I think this can be solved is to define a GPUFloat32 type which dispatches differently.

dfdx · February 7, 2019, 8:03pm

I’m not quite sure it would help: GPU doesn’t work on per-element basis, instead, the whole call to broadcasted over CuArrays must be compiled into a kernel and applied to the underlying memory on GPU. If I’m missing something and dispatch on eltype can be used to replace dispatch on broadcast, could you please provide an example of this strategy?

Topic		Replies	Views
CuArrays not working as expected when broadcasting a function Machine Learning gpu	7	1439	December 1, 2019
Broadcast over CuArray error GPU	3	407	February 15, 2023
Broadcasting an expression-evaluator function GPU	2	498	August 2, 2022
Limitation in CuArrays GPU	1	1483	November 11, 2017
Unsupported call through a literal pointer on float power broadcast (CuArrays) General Usage gpu	1	778	May 22, 2019

Best way to deal with broadcasting of intrinsics on CuArrays?

Related topics