# CuArray and Optim

#1

I am trying to write some deconvolution code in Julia via combining LBFGS Optimization (Optim.jl), automatic differentiation (AutoGrad.jl) and Cuda support (CuArray). Julia 1.0 on Windows 10.
I finally have some runnable code, but there seem to be major problems with CuArray and Optim.jl:

1. If I accidentally provide a non-Cuda array as one of the arguments, I get an LLVM crash:

optimize! at C:\Users\pi96doc.julia\packages\CUDAnative\opsly\src\compiler.jl:606

2. If I provide only CuArray arrays, there is no crash, but via the LBFGS optimization routine everything runs extremely slowly (at least 10x slower than without Cuda). There are also some minor disagreements in the results. E.g. the number of iterations is different. However, what really puzzles me, is that from that moment on, Jullia as a whole becomes incredibly slow. This is only remedied if Julia is restarted using exit().

3. I was so far not able to provide AutoGrad with a correct differentiation for the fft or rft routine. I thought it would simply be
@primitive rfft(x1),dy,y (irfft(dy,size(x1)))
@primitive irfft(x1,sz),dy,y (rfft(dy))
but this did not seem to work. Anyway providing the for the whole real->real convolution operation was no problem. Are there similar ways to provide such operations for JuliaDiff? This would be nice as this seems to be more native to Optim.jl than AutoGrad.jl.

Minor things: CuArray only seems to support fft and rfft up to 3 dimensions and complains, even if the trailing dimensions are singleton.

Any ideas about how to find out where the problems wrt. CuArray in the LBFGS method in Optim could appear?

#2
1. Fixed on CuArrays/CUDAnative master
2. Try running with allowscalar=false, copying from and to the GPU is very slow

#3

Hmm. This does not work. The following happens:

result2=optimize(fg,cimg, GradientDescent(),myOptions)

ERROR: scalar getindex is disabled

Stacktrace:

[1] assertscalar at C:\Users\pi96doc\.julia\packages\GPUArrays\3E1qk\src\indexing.jl:6 [inlined]

[2] getindex at C:\Users\pi96doc\.julia\packages\GPUArrays\3E1qk\src\indexing.jl:29 [inlined]

[3] iterate at .\abstractarray.jl:838 [inlined]

[4] iterate at .\abstractarray.jl:836 [inlined]

[9] optimize at C:\Users\pi96doc\.julia\packages\Optim\fabGe\src\multivariate\optimize\optimize.jl:33 [inlined]

[12] top-level scope at none:0


#4

Looks like only norm(::CuArray, p=2) is implemented in CuArrays: https://github.com/JuliaGPU/CuArrays.jl/blob/7cb56765a634259fc1bcedf43aa0c603afd0aa58/src/blas/highlevel.jl#L40
It doesn’t look like CUBLAS has other norms, so we probably need a GPU-compatible generic fallback. The docs state The second argument p is not necessarily a part of the interface for norm, i.e. a custom type may only implement norm(A) without second argument, so maybe that’s not necessary and there’s other ways to express that computation?

Also, please use triple quotes to delimit code and output in your post.

#5

maximum(abs, x) would work on a CuArray, right?

#6

Thanks for trying this out! I’ve been meaning to see if there were problems with Optim and GPU’s to be fixed, but I never got my CuArrays up and running (only tried for a day), so never found out if there were things we needed to change. Can you try changing the inf-norm to maximum(abs, x) instead? A PR is also welcome, but I can also implement the changes we find necessary in this thread myself.

#7

was giving it another try preceeding the optimization with the following

    using LinearAlgebra,Optim,CuArrays
CuArrays.allowscalar(false);
LinearAlgebra.norm1(x::CuArray{T,N}) where {T,N} = sum(abs, x); # specializes the one-norm
LinearAlgebra.normInf(x::CuArray{T,N}) where {T,N} = maximum(abs, x); # specializes the one-norm
Optim.maxdiff(x::CuArray{T,N},y::CuArray{T,N}) where {T,N} = maximum(abs.(x-y));


This allowed the line

result2=optimize(fg,cimg, GradientDescent(),myOptions);


to run fine (not superfast, but this is probably due to the relatively small arrays.
There is however an output problem:

julia> result2
Results of Optimization Algorithm
ERROR: scalar getindex is disabled


There also were frequent out-of-memory errors, even though the arrays were quite small (512x512x1):

ERROR: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory)
Stacktrace:
[1] macro expansion at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\error.jl:56 [inlined]
[2] macro expansion at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\error.jl:57 [inlined]
[3] _mkplan(::UInt8, ::Tuple{Int64,Int64,Int64}, ::UnitRange{Int64}) at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:109
[4] plan_rfft(::CuArray{Float64,3}, ::UnitRange{Int64}) at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:405


Finally I was (naively?) trying to convert them back to the main CPU:

julia> Float32.(result2.minimizer)
ERROR: GPU compilation failed, try inspecting generated code with any of the @device_code_... macros
- argument = 4


Any ideas?

#8
ERROR: scalar getindex is disabled


That’s because you disabled scalar indexing (which is good, preventing slow code that processes GPU values on the CPU), but many default show methods rely on iterating over the container. We’ve implemented alternative methods for some default containers (https://github.com/JuliaGPU/GPUArrays.jl/blob/7470e90152e51ad4cca7d80e02590eb95cf6fce5/src/abstractarray.jl#L53-L74), but Optim seems to wrap arrays differently hence triggering this error. This probably needs a redesign in order to cope with wrapper types without hardcoding these methods in GPUArrays/CuArrays.

ERROR: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory)


Interesting. We have a memory allocator sitting between Julia and CUDA to handle failing memory allocations (since GPU memory pressure isn’t known by the GC), but that doesn’t apply to libraries like cuFFT. I’ve created an issue: https://github.com/JuliaGPU/CuArrays.jl/issues/130.

passing and using non-bitstype argument


That Broadcast argument is not bits indeed, probably due to the Nothing in there. A result of the recent Broadcast changes in CuArrays, maybe? cc @vchuravy

#9

could you be specific here? what are we packing differently (so I can maybe remove the problem in Optim-land)

#10

It’s just that we need to add support for each wrapper specifically, so we’ve only done that for common Base wrappers (Adjoint, Transpose, SubArray). That obviously doesn’t scale, and we need a better solution, but it doesn’t seem easy (see e.g. https://github.com/JuliaGPU/GPUArrays.jl/issues/147).