I am trying to write some deconvolution code in Julia via combining LBFGS Optimization (Optim.jl), automatic differentiation (AutoGrad.jl) and Cuda support (CuArray). Julia 1.0 on Windows 10.
I finally have some runnable code, but there seem to be major problems with CuArray and Optim.jl:

If I accidentally provide a nonCuda array as one of the arguments, I get an LLVM crash:
…
optimize! at C:\Users\pi96doc.julia\packages\CUDAnative\opsly\src\compiler.jl:606
… 
If I provide only CuArray arrays, there is no crash, but via the LBFGS optimization routine everything runs extremely slowly (at least 10x slower than without Cuda). There are also some minor disagreements in the results. E.g. the number of iterations is different. However, what really puzzles me, is that from that moment on, Jullia as a whole becomes incredibly slow. This is only remedied if Julia is restarted using exit().

I was so far not able to provide AutoGrad with a correct differentiation for the fft or rft routine. I thought it would simply be
@primitive rfft(x1),dy,y (irfft(dy,size(x1)))
@primitive irfft(x1,sz),dy,y (rfft(dy))
but this did not seem to work. Anyway providing the for the whole real>real convolution operation was no problem. Are there similar ways to provide such operations for JuliaDiff? This would be nice as this seems to be more native to Optim.jl than AutoGrad.jl.
Minor things: CuArray only seems to support fft and rfft up to 3 dimensions and complains, even if the trailing dimensions are singleton.
Any ideas about how to find out where the problems wrt. CuArray in the LBFGS method in Optim could appear?