Modifying CuArray elements in-place with index arrays

I’m trying to implement the vectorized square root for complex CUDA arrays, and am starting from the commented-out implementation in https://github.com/JuliaLang/julia/blob/5d8225a1116685e449fcc08b207acdd0f5106409/base/complex.jl#L508, just to get something working; it probably has underflow/overflow problems but it’s easier for a beginner to vectorize, also it’s horribly slow and maybe I should just write a kernel from the start. My current effort is

using CUDA

CUDA.allowscalar(false)

Xz_h = rand(Complex{Float64}, 5, 5)
Xz_d = CuMatrix(Xz_h)

function sqrt2(z::CuArray{Complex{T}}) where T
    ret = CuArray{Complex{T}}(undef, size(z)...)
    rz = float(real(z))
    iz = float(imag(z))
    r = sqrt.((hypot.(rz, iz) + abs.(rz)) / 2)
    ix1 = (r .== 0)
    # ret[ix1] = (zero(iz) + iz*im)[ix1]
    ix2 = (rz .>= 0)
    # ret[ix2] = (r + (iz ./ r / 2)*im)[ix2]
    ix3 = iszero.(ix1 + ix2)
    # ret[ix3] = ((abs.(iz) ./ r / 2) + copysign.(r, iz)*im)[ix3]
    return ret
end

function sqrt3(z::AbstractArray{Complex{T}}) where T
    ret = Array{Complex{T}}(undef, size(z)...)
    rz = float(real(z))
    iz = float(imag(z))
    r = sqrt.((hypot.(rz, iz) + abs.(rz)) / 2)
    ix1 = (r .== 0)
    ret[ix1] = (zero(iz) + iz*im)[ix1]
    ix2 = (rz .>= 0)
    ret[ix2] = (r + (iz ./ r / 2)*im)[ix2]
    ix3 = iszero.(ix1 + ix2)
    ret[ix3] = ((abs.(iz) ./ r / 2) + copysign.(r, iz)*im)[ix3]
    return ret
end

where the CPU version sqrt3 matches the sqrt. result, but the ret[ix] lines in the GPU version give this traceback:

julia> sqrt2(Xz_d)
ERROR: ReadOnlyMemoryError()
Stacktrace:
 [1] macro expansion at /home/eric/.julia/packages/LLVM/AGEVG/src/interop/base.jl:80
[inlined]
 [2] macro expansion at /home/eric/.julia/packages/LLVM/AGEVG/src/interop/pointer.jl:
7 [inlined]
 [3] pointerref at /home/eric/.julia/packages/LLVM/AGEVG/src/interop/pointer.jl:7 [in
lined]
 [4] unsafe_load at /home/eric/.julia/packages/LLVM/AGEVG/src/interop/pointer.jl:79 [
inlined]
 [5] arrayref at /home/eric/.julia/packages/CUDA/gKMm0/src/device/array.jl:80 [inline
d]
 [6] getindex at /home/eric/.julia/packages/CUDA/gKMm0/src/device/array.jl:99 [inline
d]
 [7] iterate at /home/eric/.julia/packages/CUDA/gKMm0/src/device/array.jl:145 [inline
d]
 [8] iterate at /home/eric/.julia/packages/CUDA/gKMm0/src/device/array.jl:144 [inline
d]
 [9] _simple_count(::typeof(identity), ::CuDeviceArray{Bool,2,1}) at ./reduce.jl:868
 [10] _count at ./reducedim.jl:392 [inlined]
 [11] #count#624 at ./reducedim.jl:390 [inlined]
 [12] #count#623 at ./reducedim.jl:389 [inlined]
 [13] count at ./reducedim.jl:389 [inlined]
 [14] LogicalIndex at ./multidimensional.jl:636 [inlined]
 [15] LogicalIndex at ./multidimensional.jl:639 [inlined]
 [16] adapt_structure at /home/eric/.julia/packages/Adapt/8kQMV/src/wrappers.jl:12 [i
nlined]
 [17] adapt at /home/eric/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
 [18] cudaconvert at /home/eric/.julia/packages/CUDA/gKMm0/src/compiler/execution.jl:
143 [inlined]
 [19] map at ./tuple.jl:159 [inlined]
 [20] map at ./tuple.jl:160 [inlined] (repeats 2 times)
 [21] #launch_heuristic#845 at /home/eric/.julia/packages/CUDA/gKMm0/src/gpuarrays.jl
:17 [inlined]
 [22] launch_heuristic at /home/eric/.julia/packages/CUDA/gKMm0/src/gpuarrays.jl:17 [
inlined]
 [23] gpu_call(::typeof(GPUArrays.setindex_kernel), ::CuArray{Complex{Float64},2}, ::
CuArray{Complex{Float64},1}, ::Tuple{Int64}, ::Int64, ::Base.LogicalIndex{CartesianIn
dex{2},CuArray{Bool,2}}; target::CuArray{Complex{Float64},2}, total_threads::Int64, t
hreads::Nothing, blocks::Nothing, name::Nothing) at /home/eric/.julia/packages/GPUArr
ays/Ck0bk/src/device/execution.jl:61
 [24] _setindex!(::CuArray{Complex{Float64},2}, ::CuArray{Complex{Float64},1}, ::Base
.LogicalIndex{CartesianIndex{2},CuArray{Bool,2}}) at /home/eric/.julia/packages/GPUAr
rays/Ck0bk/src/host/indexing.jl:161
 [25] setindex! at /home/eric/.julia/packages/GPUArrays/Ck0bk/src/host/indexing.jl:15
1 [inlined]
 [26] sqrt2(::CuArray{Complex{Float64},2}) at /tmp/load.jl:16
 [27] top-level scope at REPL[4]:1
 [28] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.2/usr/share/
 julia/stdlib/v1.5/REPL/src/REPL.jl:288

and even this doesn’t work:

julia> Xz_d[1:3] = 2.2
2.2

julia> ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

Nothing should be wrong with my GPU, since all the CUDA.jl tests pass.

That doesn’t work on the host either:

julia> Xz_h[1:3] = 2.2
ERROR: ArgumentError: indexed assignment with a single value to many locations is not supported; perhaps use broadcasting `.=` instead?

Using broadcast instead:

julia> Xz_h[1:3] .= 2.2
3-element view(::Array{Complex{Float64},1}, 1:3) with eltype Complex{Float64}:
 2.2 + 0.0im
 2.2 + 0.0im
 2.2 + 0.0im

julia> Xz_d[1:3] .= 2.2
3-element CuArray{Complex{Float64},1}:
 2.2 + 0.0im
 2.2 + 0.0im
 2.2 + 0.0im

That said, your sqrt2 definition works here. The ReadOnlyMemoryError happens because you’re executing GPU code on the CPU. That shouldn’t happen, somehow a device array is being iterated here when constructing a device-side LogicalIndex.

D’oh, missed the broadcast.

I think it’s the real/imaginary extraction:

1|debug> n                                                                                                                                                                   
In sqrt2(z) at /home/eric/development/forks/julia/CUDA.jl/mwe.jl:8                                                                                                           
  8  function sqrt2(z::CuArray{Complex{T}}) where T                                                                                                                          
  9      ret = CuArray{Complex{T}}(undef, size(z)...)                                                                                                                        
 10      # rz = float(real(z))                                                                                                                                               
 11      # iz = float(imag(z))                                                                                                                                               
>12      rz = real(z)                                                                                                                                                        
 13      iz = imag(z)                                                                                                                                                        
 14      r = sqrt.((hypot.(rz, iz) + abs.(rz)) / 2)                                                                                                                          
 15      ix1 = (r .== 0)                                                                                                                                                     
 16      ret[ix1] = (zero(iz) + iz*im)[ix1]                                                                                                                                  
                                                                                                                                                                             
About to run: <(real)(Complex{Float64}[0.31856560284195146 + 0.5734671308800006im 0.27744395249118337 + 0.5332794520...>                                                     
1|debug>                                                                                                                                                                     
Unreachable reached at 0x7ff684672600                                                                                                                                        
                                                                                                                                                                             
signal (4): Illegal instruction                                                                                                                                              
in expression starting at REPL[10]:1                                                                                                                                         
##compiledcall#591 at /home/eric/.julia/packages/JuliaInterpreter/Ha7hq/src/optimize.jl:358                                                                                  
jl_f__apply_latest at /usr/bin/../lib/libjulia.so.1 (unknown line)                                                                                                           
#invokelatest#1 at ./essentials.jl:710                                                                                                                                       
unknown function (ip: 0x7ff84c210214)                                                                                                                                        
invokelatest at ./essentials.jl:709                                                                                                                                          
unknown function (ip: 0x7ff84c210214)                                                                                                                                        
bypass_builtins at /home/eric/.julia/packages/JuliaInterpreter/Ha7hq/src/interpret.jl:179                                                                                    
#evaluate_call_recurse!#58 at /home/eric/.julia/packages/JuliaInterpreter/Ha7hq/src/interpret.jl:203                                                                         
evaluate_call_recurse! at /home/eric/.julia/packages/JuliaInterpreter/Ha7hq/src/interpret.jl:202 [inlined]                                                                   
                                                                                                        
...

which crashes the REPL completely (illegal hardware instruction (core dumped)). It works fine in the REPL outside the debugger.

Oh, this only happens in the debugger. Yes, there’s a couple of known issues debugging kernel calls, and I haven’t looked into it recently.