CuArrays not working as expected when broadcasting a function

xiaodai · November 22, 2019, 10:45am

using CuArrays

t = cu(rand(150_000))
w = cu(rand(150_000))

t .+ w

I define the following function

logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))

but broadcasting it for CuArrays doesn’t work, see

logloss.(t, w)

 Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:99 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called log(x::Float32) in Base.Math at special/log.jl:290, maybe you intended to call log(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:71 instead?
│    Stacktrace:
│     [1] log at special/log.jl:290
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, maybe you intended to call exp(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:99 instead?
│    Stacktrace:
│     [1] exp at special/exp.jl:75
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called log(x::Float32) in Base.Math at special/log.jl:290, maybe you intended to call log(x::Float32) in CUDAnative at C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\device\cuda\math.jl:71 instead?
│    Stacktrace:
│     [1] log at special/log.jl:290
│     [2] logloss at C:\scratch\cu-test\ok.jl:11
│     [3] #25 at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:49
└ @ CUDAnative C:\Users\RTX2080\.julia\packages\CUDAnative\2WQzk\src\compiler\irgen.jl:116

But v2 works which broadcasts inside the function instead of broadcasting the function

logloss_v2(t, w) = -(t .* log.(1 ./ (1 .+ exp.(-w))) .+ (1 .- t) .* log.(1 .- 1 ./ (1 .+ exp.(-w))))

logloss_v2(t, w) #works

Here are my info

(cu-test) pkg> st
    Status `C:\scratch\cu-test\Project.toml`
  [c5f51814] CUDAdrv v4.0.4
  [be33ccc6] CUDAnative v2.5.5
  [3a865a2d] CuArrays v1.4.7

Activating environment at `c:\scratch\cu-test\Project.toml`
┌ Warning: calls to Base intrinsics might be GPU incompatible
Julia Version 1.3.0-rc5.1
Commit 36c4eb251e (2019-11-17 19:04 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\RTX2080\AppData\Local\atom\app-1.41.0\atom.exe"  -a
  JULIA_NUM_THREADS = 6
  JULIA_PKG_DEVDIR = c:/git/

baggepinnen · November 22, 2019, 11:05am

Try the following

logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))
CuArrays.@cufunc logloss(t, w) = -(t * log(1 / (1 + exp(-w))) + (1 - t) * log(1 - 1 / (1 + exp(-w))))

xiaodai · November 22, 2019, 11:08am

Now it’s working. So I need to do @cufunc macro now?

baggepinnen · November 22, 2019, 11:20am

Function like log/sin only work when broadcasted over a CuArray since then the type information is there. If you loop over the vector, like when you broadcast your outer function, you are calling the base version of log/sin with a scalar that lives on the GPU. The macro rewrites the function to use the CUDAnative version of log/sin so that it works and is performant

xiaodai · November 22, 2019, 11:22am

This is a little odd for me. Of corse, I don’t undertand the technical reasons, but this makes Julia less generic in a way. I have to sprinkle CuArrays.@cufunc everywhere now right?

baggepinnen · November 22, 2019, 11:28am

No, you need to place it where appropriate. I think it’s quite remarkable that all you need to do is to add @cufunc in your julia code to have it operate on a GPU. A gpu fundamentally operates by doing similar operations over large arrays. You could restrict your function logloss to only accept arrays and do broadcasting internally, sacrificing some potential performance from fusing. However, if you do intend to use your code for GPU computatoins, @cufunc is probably what’s recommended at the moment.

You can also write stuff using GPUifyLoops, which is supposed to make stuff run on both cpu and gpu.
@cufunc essentially writes the CUDA kernel for you, so that you do not ave to write CUDA C yourself.

xiaodai · December 1, 2019, 9:58am

I don’t quite understand this bit

baggepinnen · December 1, 2019, 11:02am

broadcasted(sin, a) can be overloaded for various types of a to use sin on the GPU.

Topic		Replies	Views
Best way to deal with broadcasting of intrinsics on CuArrays? General Usage	4	985	February 7, 2019
Limitation in CuArrays GPU	1	1502	November 11, 2017
Issue with 1.5 month old code based on CuArrays General Usage gpu	3	818	February 6, 2020
Unsupported call through a literal pointer on float power broadcast (CuArrays) General Usage gpu	1	790	May 22, 2019
cuArrays vs CUDANative GPU	3	1384	November 14, 2018

CuArrays not working as expected when broadcasting a function

Related topics