Best way to deal with broadcasting of intrinsics on CuArrays?

Let’s say we have a function that uses Base intrinsics like exp:

logistic(x) = 1 / (1 + exp(-x))

If we broadcast it over CuArrays, e.g.:

A = cu(rand(5, 10))
logistic.(x)

we expectedly get a warning (and even crashes in more complicated cases):

┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at /home/dfdx/.julia/packages/CUDAnative/Mdd3w/src/device/libdevice.jl:90 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] #23 at /home/dfdx/.julia/packages/GPUArrays/t8tJB/src/broadcast.jl:49
└ @ CUDAnative ~/.julia/packages/CUDAnative/Mdd3w/src/compiler/irgen.jl:68
5×10 CuArray{Float32,2}:
...

I can of course rewrite logistic function for CuArrays specifically:

logistic(x) = 1 / (1 + CUDAnative.exp(-x))

But then it’s not generic anymore.

Another option is to override broadcasted for each function, e.g. something like:

Base.Broadcast.broadcasted(::typeof(logistic), x) = logistic.(x)
Base.Broadcast.broadcasted(::typeof(logistic), x::CuArray) = 1 ./ (1 .+ CUDAnative.exp.(-x))

But it would be pretty hard to extend to all such functions.

My current plan is to use Cassette.jl to rewrite all intrinsic function on the fly, however I may be missing some obviously simpler solution. Do I?

Interesting, what type is eltype(cu). In the Flux.jl it talks about fusing all that into a kernel but the kernel is written as

kernel(x)=exp.(x).+1

And I assume kernel dispatches on CuArray. Is the right answer to define a GPUFloat32 so logistic can dispatch on that.

For GPUs it makes more sense to define operations (kernels) on arrays. So maybe that’s why? I agree it does feel less generic.

It’s good if you are the one who defines functions, but it doesn’t work if you use someone else’s function like this:

function predict(W, x)
    return logistic.(W * x)
end

Yeah. I agree. I saw that they are now defining a GPUPtr type. The way I think this can be solved is to define a GPUFloat32 type which dispatches differently.

I’m not quite sure it would help: GPU doesn’t work on per-element basis, instead, the whole call to broadcasted over CuArrays must be compiled into a kernel and applied to the underlying memory on GPU. If I’m missing something and dispatch on eltype can be used to replace dispatch on broadcast, could you please provide an example of this strategy?