CuArray and Optim

was giving it another try preceeding the optimization with the following

    using LinearAlgebra,Optim,CuArrays
    CuArrays.allowscalar(false);
    LinearAlgebra.norm1(x::CuArray{T,N}) where {T,N} = sum(abs, x); # specializes the one-norm
    LinearAlgebra.normInf(x::CuArray{T,N}) where {T,N} = maximum(abs, x); # specializes the one-norm    
    Optim.maxdiff(x::CuArray{T,N},y::CuArray{T,N}) where {T,N} = maximum(abs.(x-y));

This allowed the line

result2=optimize(fg,cimg, GradientDescent(),myOptions);

to run fine (not superfast, but this is probably due to the relatively small arrays.
There is however an output problem:

julia> result2
Results of Optimization Algorithm
 * Algorithm: Gradient Descent
Error showing value of type Optim.MultivariateOptimizationResults{GradientDescent{LineSearches.InitialPrevious{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Nothing,getfield(Optim, Symbol("##12#14"))},Float64,CuArray{Float32,3},Float32,Float32,Array{OptimizationState{Float32,GradientDescent{LineSearches.InitialPrevious{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Nothing,getfield(Optim, Symbol("##12#14"))}},1}}:
ERROR: scalar getindex is disabled

There also were frequent out-of-memory errors, even though the arrays were quite small (512x512x1):

ERROR: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory)
Stacktrace:
 [1] macro expansion at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\error.jl:56 [inlined]
 [2] macro expansion at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\error.jl:57 [inlined]
 [3] _mkplan(::UInt8, ::Tuple{Int64,Int64,Int64}, ::UnitRange{Int64}) at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:109
 [4] plan_rfft(::CuArray{Float64,3}, ::UnitRange{Int64}) at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:405

Finally I was (naively?) trying to convert them back to the main CPU:

julia> Float32.(result2.minimizer)
ERROR: GPU compilation failed, try inspecting generated code with any of the @device_code_... macros
CompilerError: could not compile #19(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}},Type{Float32},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},Tuple{Bool,Bool,Bool},Tuple{Int64,Int64,Int64}}}}); passing and using non-bitstype argument
- argument_type = Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}},Type{Float32},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},Tuple{Bool,Bool,Bool},Tuple{Int64,Int64,Int64}}}}
- argument = 4

Any ideas?