CuArrays not working as expected when broadcasting a function

using CuArrays

t = cu(rand(150_000))
w = cu(rand(150_000))

t .+ w

I define the following function

logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))

but broadcasting it for CuArrays doesn’t work, see

logloss.(t, w)
 Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:99 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called log(x::Float32) in Base.Math at special/log.jl:290, maybe you intended to call log(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:71 instead?
│    Stacktrace:
│     [1] log at special/log.jl:290
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:99 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called log(x::Float32) in Base.Math at special/log.jl:290, maybe you intended to call log(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:71 instead?
│    Stacktrace:
│     [1] log at special/log.jl:290
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116

But v2 works which broadcasts inside the function instead of broadcasting the function

logloss_v2(t, w) = -(t .* log.(1 ./ (1 .+ exp.(-w))) .+ (1 .- t) .* log.(1 .- 1 ./ (1 .+ exp.(-w))))

logloss_v2(t, w) #works

Here are my info

(cu-test) pkg> st
    Status `C:\scratch\cu-test\Project.toml`
  [c5f51814] CUDAdrv v4.0.4
  [be33ccc6] CUDAnative v2.5.5
  [3a865a2d] CuArrays v1.4.7

Activating environment at `c:\scratch\cu-test\Project.toml`
┌ Warning: calls to Base intrinsics might be GPU incompatible
Julia Version 1.3.0-rc5.1
Commit 36c4eb251e (2019-11-17 19:04 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\RTX2080\AppData\Local\atom\app-1.41.0\atom.exe"  -a
  JULIA_NUM_THREADS = 6
  JULIA_PKG_DEVDIR = c:/git/

Try the following

logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))
CuArrays.@cufunc logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))
1 Like

Now it’s working. So I need to do @cufunc macro now?

Function like log/sin only work when broadcasted over a CuArray since then the type information is there. If you loop over the vector, like when you broadcast your outer function, you are calling the base version of log/sin with a scalar that lives on the GPU. The macro rewrites the function to use the CUDAnative version of log/sin so that it works and is performant

1 Like

This is a little odd for me. Of corse, I don’t undertand the technical reasons, but this makes Julia less generic in a way. I have to sprinkle CuArrays.@cufunc everywhere now right?

No, you need to place it where appropriate. I think it’s quite remarkable that all you need to do is to add @cufunc in your julia code to have it operate on a GPU. A gpu fundamentally operates by doing similar operations over large arrays. You could restrict your function logloss to only accept arrays and do broadcasting internally, sacrificing some potential performance from fusing. However, if you do intend to use your code for GPU computatoins, @cufunc is probably what’s recommended at the moment.

You can also write stuff using GPUifyLoops, which is supposed to make stuff run on both cpu and gpu.
@cufunc essentially writes the CUDA kernel for you, so that you do not ave to write CUDA C yourself.

I don’t quite understand this bit

broadcasted(sin, a) can be overloaded for various types of a to use sin on the GPU.

2 Likes