Mapping ThreadIdx().x to a 5D array?

I have the following big picture aim. Suppose I have a high dimensional array, e.g. 5D or 6D. On each index of those arrays, an independent computation delivers a different value that needs to be stored at this index. For example,

V[i1,i2,i3,i4,i5] = fun(x, i1,i2,i3,i4,i5)

I wanted to try whether I can hand over computation of fun(x, i1,i2,i3,i4,i5) to a single thread on a GPU, hence have many threads work in parallel on this job. so far so good.

Immediate Task: map ThreadIdx().x to array indices

Ideally I wanted to just use the julia function ind2sub to get a cartesian index from a linear index. that function is not exposed via CUDAnative. to I rewrote it for a particular case. Nevermind if this is completely inefficient, at this point I just want to understand what’s going on. So, I have the following example, which compiles without error. I am having trouble seeing the output though. why do all of my attempts to copy the device array back to the host give me an error? First the version that works, then the error at the bottom.

  • I am learning a lot here, so any comments most welcome. In particular, is that the right way to attack a 6D array?
  • it seems that the @assert call verifies that I’m doing this correctly, but I want to get this array back.
  • In general, I am very confused that a function like mod or tuple or @assert works on the device, but something like sub2ind does not. what’s the difference?
using CUDAnative, CUDAdrv

function do3D()

	V = zeros(Int64,2,3,4)
	d_V = CuArray(V)
	@cuda threads=10 ind2sub3Dkernel(d_V)
        # all of those error.
	# copy!(V,d_V)
	# x = Array(d_V)
	# println(x[1])
	# return d_V

end

# the kernel. cannot return a value, hence write into the supplied vector
# at each thread index, get the corresponding CartesianIndex, and 
# recompose the linear index manually to check that this is correct.
function ind2sub3Dkernel(V::CuDeviceArray{Int64})
	idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
	m = size(V)
	n = myind2sub3D(size(V),idx)
	# n = mytuple()  # "works"
	@assert idx == n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
	V[idx] = n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
end
# can I have a *device function* that returns a tuple?
# seems I can.
function mytuple()
	(1,2)
end

# my *implementation* of an old ind2sub version.
# had to specialize to 3D because I cannot do splatting on the device?
# I know this is terrible code, but that's not the point. (I hope!)
function myind2sub3D(dims::Tuple{Integer,Vararg{Integer}}, ind::Int)
    ndims = length(dims)
    @assert ndims==3
    stride = dims[1]
    for i=2:ndims-1
        stride *= dims[i]
    end
    i2 = 0
    i3 = 0
# a manual loop over i
    i = 2
        rest = rem(ind-1, stride) + 1
        i3 = div(ind - rest, stride) + 1
        ind = rest
        stride = div(stride, dims[i])
    i = 1
        rest = rem(ind-1, stride) + 1
        i2 = div(ind - rest, stride) + 1
        ind = rest
        stride = div(stride, dims[i])
    o = tuple(ind,i2,i3)
    # printing does not work
	# @cuprintf("my indices are %ld, %ld, %ld\n",o[1],o[2],o[3])
    # @cuprintf("i have ")
    return o

    # original implementation
    # sub = ()
    # for i=(ndims-1):-1:1
    #     rest = rem(ind-1, stride) + 1
    #     sub = tuple(div(ind - rest, stride) + 1, sub...)
    #     ind = rest
    #     stride = div(stride, dims[i])
    # end
    # return tuple(ind, sub...)
end

modifying the top level function to convert back to Array does this:

...
	x = Array(d_V)
...

julia> x=cudaVFI.do3D()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
 [1] macro expansion at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/base.jl:145 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:161
 [3] alloc at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:157 [inlined] (repeats 2 times)
 [4] CUDAdrv.CuArray{Int64,3}(::Tuple{Int64,Int64,Int64}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:33
 [5] CUDAdrv.CuArray(::Array{Int64,3}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:217
 [6] do3D() at /home/floswald/git/VFI/Julia/cudaVFI/src/cutest.jl:192
 [7] top-level scope

thanks!

I’m pretty sure @assert isn’t supported on the GPU!

I thought that was weird :grinning: well so much for that. So not even the fact that it compiles is an indication that the function exists?

have a look at:

julia> test(ndims) =  @assert ndims==3
test (generic function with 1 method)

julia> @code_warntype test(2)
Variables:
  #self# <optimized out>
  ndims::Int64

Body:
  begin
      unless (ndims::Int64 === 3)::Bool goto 3
      return
      3:
      return (Base.throw)(((Core.getfield)((Core.getfield)(Base.Main, :Base)::Any, :AssertionError)::Any)("ndims == 3")::Any)::Union{}
  end::Void

So first of all throws are not supported, and secondly Any/Union{} aren’t :wink:

So not even the fact that it compiles is an indication that the function exists?

I think errors related to unsupported code are giving quite a few different errors with CUDAnative on julia 0.6 - including corrupting the context, if I remember correctly.
But, on 0.7, @maleadt has improved error messages by a lot - so it might give you a jl_throw not supported error…

With CLArrays (0.6 only) you should also get better error messages in some cases. Although, this particular code might “work”, since I’m just ignoring throws^^

When you write your kernels with GPUArrays, you can also run them on the cpu as a sanity check - where e.g. the asserts will work :wink: My strategy is usually: run with GPUArrays.JLArray on the CPU untill the algorithm yields the correct results, then make sure everything infers correctly, then check if I use any CUDAnative intrinsics (cos, sin, sqrt etc) and then run it on the GPU by just switching from JLArray to CLArray/CuArray :wink:

Your entire kernel compiles to the following IR (as seen with @device_code_llvm do3D()):

; Function Attrs: noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: noreturn nounwind
define void @ptxcall_ind2sub3Dkernel_1({ [3 x i64], { i64 } }) local_unnamed_addr #0 {
fail10.i:
  tail call void @llvm.trap(), !noalias !2
  unreachable
}

So yeah, not much I can report about that at compile-time :slightly_smiling_face: (that is, there isn’t any truly invalid code. it just happened that you did something that unconditionally leads to a throw, which the optimizer detected and used to throw all other code away)

That’s just because memory operations are synchronizing. Manually calling CUDAdrv.synchronize() flushes these errors explicitly.

That’s not strictly true, most instances of throw are reduced to trap, but since we’ve been outlining throw to @noinline functions in 0.7 (to get rid of GC frames in hot code) certain GPU-incompatible objects/arguments aren’t optimized away anymore.

Should be improved once we can overdub Core.throw and let the Julia optimizer do its job.

alright guys, thanks for sorting me out. seems the only safe conclusion for now is that most of this was rubbish! :grinning:
any comments on the overall aim (5D etc)? completely crazy?
I will give GPUArrays a try.
thanks!

You probably don’t need to implement this yourself as the Base.Broadcast machinery provides this already, and is extensible for eg. GPU arrays. See CuArrays.jl, or this minimal 0.7-compatible reimplementation. That of course only works if you’re mapping a scalar function, ie. processing a single item per thread, but that looks like what you’re doing.