PTX JIT compilation failed with Flux and CUDAnative while using Flux

This one is a bit weird, not sure it should be here or in Juno? Anyway, this code runs fine via reply from terminal but I get an error running this in Juno once it is using the GPU. The second part errors out.

using Flux
using CuArrays

m = Chain(flatten,Dense(784,10)) 
in = rand(28, 28, 1, 7) 
m(in)

m_gpu = gpu(m)
in_gpu = gpu(in)
m_gpu(xpto_gpu)

The error I am getting, the most I can get out of Juno is:

CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
ptxas application ptx input, line 488; error   : Call has wrong number of parameters
ptxas fatal   : Ptx assembly aborted due to errors
CUDAdrv.CuModule(::String, ::Dict{CUDAdrv.CUjit_option_enum,Any}) at module.jl:40
_cufunction(::GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at execution.jl:335
_cufunction at execution.jl:302 [inlined]
#77 at cache.jl:21 [inlined]
get!(::GPUCompiler.var"#77#78"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},typeof(CUDAnative._cufunction),GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}}, ::Dict{UInt64,Any}, ::UInt64) at dict.jl:452
macro expansion at lock.jl:183 [inlined]
check_cache(::typeof(CUDAnative._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at cache.jl:19
(::GPUCompiler.var"#check_cache##kw")(::NamedTuple{(),Tuple{}}, ::typeof(GPUCompiler.check_cache), ::Function, ::GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}, ::UInt64) at cache.jl:11
+ at int.jl:53 [inlined]
hash_64_64 at hashing.jl:35 [inlined]
hash_uint64 at hashing.jl:62 [inlined]
hx at float.jl:568 [inlined]
hash at float.jl:571 [inlined]
cached_compilation(::typeof(CUDAnative._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at cache.jl:0
cached_compilation(::Function, ::GPUCompiler.FunctionSpec{GPUArrays.var"#20#21",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Broadcasted{CuArrays.CuArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}}}, ::UInt64) at cache.jl:37
cufunction(::Function, ::Type; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at execution.jl:296
cufunction at execution.jl:291 [inlined]
macro expansion at execution.jl:108 [inlined]
gpu_call(::CuArrays.CuArrayBackend, ::Function, ::Tuple{CuArray{Float32,2,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.On...
(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [c52e3926] Atom v0.12.14
  [fbb218c0] BSON v0.2.6
  [336ed68f] CSV v0.6.1
  [c5f51814] CUDAdrv v6.3.0
  [be33ccc6] CUDAnative v3.1.0
  [5ae59095] Colors v0.11.2
  [34da2185] Compat v2.2.0
  [3a865a2d] CuArrays v2.2.1
  [717857b8] DSP v0.6.7
  [a93c6f00] DataFrames v0.20.2
  [7a1cc6ca] FFTW v1.2.2
  [5789e2e9] FileIO v1.3.0
  [587475ba] Flux v0.10.4
  [28b8d3ca] GR v0.48.0
  [c91e804a] Gadfly v1.2.1
  [7073ff75] IJulia v1.21.2
  [82e4d734] ImageIO v0.2.0
  [6218d12a] ImageMagick v1.1.5
  [916415d5] Images v0.22.2
  [682c06a0] JSON v0.21.0
  [b9914132] JSONTables v1.0.0
  [e5e0dc1b] Juno v0.8.2
  [b13ce0c6] LibSndFile v2.3.0
  [9c8b4983] LightXML v0.9.0
  [ca7b5df7] MFCC v0.3.1
  [cc2ba9b6] MLDataUtils v0.5.1
  [eb30cadb] MLDatasets v0.4.0
  [add582a8] MLJ v0.11.2
  [dbeba491] Metalhead v0.5.0
  [3b7a836e] PGFPlots v3.2.1
  [eadc2687] Pandas v1.4.0
  [d96e819e] Parameters v0.12.1
  [91a5bcdd] Plots v1.0.14
  [d330b81b] PyPlot v2.9.0
  [295af30f] Revise v2.7.2
  [bd7594eb] SampledSignals v2.1.0
  [4d633899] SignalOperators v0.4.0
  [b8865327] UnicodePlots v1.1.0
  [1986cc42] Unitful v0.17.0
  [e88e6eb3] Zygote v0.4.20
  [de0858da] Printf
  [9e88b42a] Serialization
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

What has Juno to do with this error? Please wrap your code with CUDnative’s @device_code_ptx and file an issue with the PTX assembly that fails to compile.

Sorry my description was off. The code runs fine in the repl from terminal but errors out when using Juno.

Here you go!