GPU gradient issues with matrix tranpose

Hello,

I recently stumbled upon the following issue:

using Flux
m = Chain(
	Dense(100, 10, relu),
	Dense(10, 10)) |> gpu

s = (X = gpu(rand(100)), k = 0.99, y = gpu(rand(10)))
# Works fine
gs = gradient(() -> log(s.k * sum(s.y .* m(s.X))), Flux.params(m))
# Does not work
gs = gradient(() -> log(s.k * s.y'm(s.X))), Flux.params(m))

I can’t fully comprehend why the second equation, with matrix transpose, does not work. If I add CUDA.allowscalar(false) I get the following error:

ERROR: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar(::String) at /home/jch/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:41
 [3] getindex(::CuArray{Float32,1}, ::Int64) at /home/jch/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:96
 [4] _broadcast_getindex at ./broadcast.jl:614 [inlined]
 [5] _getindex at ./broadcast.jl:644 [inlined]
 [6] _broadcast_getindex at ./broadcast.jl:620 [inlined]
 [7] getindex at ./broadcast.jl:575 [inlined]
 [8] copy at ./broadcast.jl:876 [inlined]
 [9] materialize at ./broadcast.jl:837 [inlined]
 [10] (::Zygote.var"#1110#1111"{CuArray{Float32,1}})(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/lib/nnlib.jl:7
 [11] #3914#back at /home/jch/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59 [inlined]
 [12] Dense at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:123 [inlined]
 [13] (::typeof(∂(invoke)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [14] Dense at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:134 [inlined]
 [15] (::typeof(∂(λ)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [16] applychain at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:36 [inlined]
 [17] (::typeof(∂(applychain)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [18] Chain at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:38 [inlined]
 [19] (::typeof(∂(λ)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [20] #29 at ./REPL[138]:1 [inlined]
 [21] (::typeof(∂(#29)))(::Float64) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [22] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(#29))})(::Float64) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface.jl:172
 [23] gradient(::Function, ::Zygote.Params) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface.jl:49
 [24] top-level scope at REPL[138]:1

If instead I use CUDA.allowscalar(true), the following error is shown

┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:43
ERROR: CuArray only supports bits types
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] CuArray{AbstractFloat,1}(::UndefInitializer, ::Tuple{Int64}) at /home/jch/.julia/packages/CUDA/dZvbp/src/array.jl:115
 [3] similar(::CuArray{Float32,1}, ::Type{AbstractFloat}, ::Tuple{Int64}) at /home/jch/.julia/packages/CUDA/dZvbp/src/array.jl:147
 [4] similar(::CuArray{Float32,1}, ::Type{AbstractFloat}) at ./abstractarray.jl:629
 [5] copyto_nonleaf!(::CuArray{Float32,1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(Zygote.drelu),Tuple{Base.Broadcast.Extruded{CuArray{Float32,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuArray{Float64,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1032
 [6] copy at ./broadcast.jl:880 [inlined]
 [7] materialize at ./broadcast.jl:837 [inlined]
 [8] (::Zygote.var"#1110#1111"{CuArray{Float32,1}})(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/lib/nnlib.jl:7
 [9] #3914#back at /home/jch/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59 [inlined]
 [10] Dense at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:123 [inlined]
 [11] (::typeof(∂(invoke)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [12] Dense at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:134 [inlined]
 [13] (::typeof(∂(λ)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [14] applychain at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:36 [inlined]
 [15] (::typeof(∂(applychain)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [16] Chain at /home/jch/.julia/packages/Flux/05b38/src/layers/basic.jl:38 [inlined]
 [17] (::typeof(∂(λ)))(::CuArray{Float64,1}) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [18] #31 at ./REPL[140]:1 [inlined]
 [19] (::typeof(∂(#31)))(::Float64) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface2.jl:0
 [20] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(#31))})(::Float64) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface.jl:172
 [21] gradient(::Function, ::Zygote.Params) at /home/jch/.julia/packages/Zygote/ggM8Z/src/compiler/interface.jl:49
 [22] top-level scope at REPL[140]:1

In both cases, the second gradient formulation does not work. Is there a reason why the second formula doesn’t work? Thanks!