Mapping ThreadIdx().x to a 5D array?

floswald · June 15, 2018, 10:33am

I have the following big picture aim. Suppose I have a high dimensional array, e.g. 5D or 6D. On each index of those arrays, an independent computation delivers a different value that needs to be stored at this index. For example,

V[i1,i2,i3,i4,i5] = fun(x, i1,i2,i3,i4,i5)

I wanted to try whether I can hand over computation of fun(x, i1,i2,i3,i4,i5) to a single thread on a GPU, hence have many threads work in parallel on this job. so far so good.

Immediate Task: map `ThreadIdx().x` to array indices

Ideally I wanted to just use the julia function ind2sub to get a cartesian index from a linear index. that function is not exposed via CUDAnative. to I rewrote it for a particular case. Nevermind if this is completely inefficient, at this point I just want to understand what’s going on. So, I have the following example, which compiles without error. I am having trouble seeing the output though. why do all of my attempts to copy the device array back to the host give me an error? First the version that works, then the error at the bottom.

I am learning a lot here, so any comments most welcome. In particular, is that the right way to attack a 6D array?
it seems that the @assert call verifies that I’m doing this correctly, but I want to get this array back.
In general, I am very confused that a function like mod or tuple or @assert works on the device, but something like sub2ind does not. what’s the difference?

using CUDAnative, CUDAdrv

function do3D()

	V = zeros(Int64,2,3,4)
	d_V = CuArray(V)
	@cuda threads=10 ind2sub3Dkernel(d_V)
        # all of those error.
	# copy!(V,d_V)
	# x = Array(d_V)
	# println(x[1])
	# return d_V

end

# the kernel. cannot return a value, hence write into the supplied vector
# at each thread index, get the corresponding CartesianIndex, and 
# recompose the linear index manually to check that this is correct.
function ind2sub3Dkernel(V::CuDeviceArray{Int64})
	idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
	m = size(V)
	n = myind2sub3D(size(V),idx)
	# n = mytuple()  # "works"
	@assert idx == n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
	V[idx] = n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
end
# can I have a *device function* that returns a tuple?
# seems I can.
function mytuple()
	(1,2)
end

# my *implementation* of an old ind2sub version.
# had to specialize to 3D because I cannot do splatting on the device?
# I know this is terrible code, but that's not the point. (I hope!)
function myind2sub3D(dims::Tuple{Integer,Vararg{Integer}}, ind::Int)
    ndims = length(dims)
    @assert ndims==3
    stride = dims[1]
    for i=2:ndims-1
        stride *= dims[i]
    end
    i2 = 0
    i3 = 0
# a manual loop over i
    i = 2
        rest = rem(ind-1, stride) + 1
        i3 = div(ind - rest, stride) + 1
        ind = rest
        stride = div(stride, dims[i])
    i = 1
        rest = rem(ind-1, stride) + 1
        i2 = div(ind - rest, stride) + 1
        ind = rest
        stride = div(stride, dims[i])
    o = tuple(ind,i2,i3)
    # printing does not work
	# @cuprintf("my indices are %ld, %ld, %ld\n",o[1],o[2],o[3])
    # @cuprintf("i have ")
    return o

    # original implementation
    # sub = ()
    # for i=(ndims-1):-1:1
    #     rest = rem(ind-1, stride) + 1
    #     sub = tuple(div(ind - rest, stride) + 1, sub...)
    #     ind = rest
    #     stride = div(stride, dims[i])
    # end
    # return tuple(ind, sub...)
end

modifying the top level function to convert back to Array does this:

...
	x = Array(d_V)
...

julia> x=cudaVFI.do3D()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
 [1] macro expansion at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/base.jl:145 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:161
 [3] alloc at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:157 [inlined] (repeats 2 times)
 [4] CUDAdrv.CuArray{Int64,3}(::Tuple{Int64,Int64,Int64}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:33
 [5] CUDAdrv.CuArray(::Array{Int64,3}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:217
 [6] do3D() at /home/floswald/git/VFI/Julia/cudaVFI/src/cutest.jl:192
 [7] top-level scope

thanks!

sdanisch · June 15, 2018, 10:38am

I’m pretty sure @assert isn’t supported on the GPU!

floswald · June 15, 2018, 11:29am

I thought that was weird well so much for that. So not even the fact that it compiles is an indication that the function exists?

sdanisch · June 15, 2018, 11:42am

have a look at:

julia> test(ndims) =  @assert ndims==3
test (generic function with 1 method)

julia> @code_warntype test(2)
Variables:
  #self# <optimized out>
  ndims::Int64

Body:
  begin
      unless (ndims::Int64 === 3)::Bool goto 3
      return
      3:
      return (Base.throw)(((Core.getfield)((Core.getfield)(Base.Main, :Base)::Any, :AssertionError)::Any)("ndims == 3")::Any)::Union{}
  end::Void

So first of all throws are not supported, and secondly Any/Union{} aren’t

So not even the fact that it compiles is an indication that the function exists?

I think errors related to unsupported code are giving quite a few different errors with CUDAnative on julia 0.6 - including corrupting the context, if I remember correctly.
But, on 0.7, @maleadt has improved error messages by a lot - so it might give you a jl_throw not supported error…

With CLArrays (0.6 only) you should also get better error messages in some cases. Although, this particular code might “work”, since I’m just ignoring throws^^

sdanisch · June 15, 2018, 11:48am

When you write your kernels with GPUArrays, you can also run them on the cpu as a sanity check - where e.g. the asserts will work My strategy is usually: run with GPUArrays.JLArray on the CPU untill the algorithm yields the correct results, then make sure everything infers correctly, then check if I use any CUDAnative intrinsics (cos, sin, sqrt etc) and then run it on the GPU by just switching from JLArray to CLArray/CuArray

maleadt · June 15, 2018, 12:02pm

Your entire kernel compiles to the following IR (as seen with @device_code_llvm do3D()):

; Function Attrs: noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: noreturn nounwind
define void @ptxcall_ind2sub3Dkernel_1({ [3 x i64], { i64 } }) local_unnamed_addr #0 {
fail10.i:
  tail call void @llvm.trap(), !noalias !2
  unreachable
}

So yeah, not much I can report about that at compile-time (that is, there isn’t any truly invalid code. it just happened that you did something that unconditionally leads to a throw, which the optimizer detected and used to throw all other code away)

That’s just because memory operations are synchronizing. Manually calling CUDAdrv.synchronize() flushes these errors explicitly.

maleadt · June 15, 2018, 12:04pm

That’s not strictly true, most instances of throw are reduced to trap, but since we’ve been outlining throw to @noinline functions in 0.7 (to get rid of GC frames in hot code) certain GPU-incompatible objects/arguments aren’t optimized away anymore.

Should be improved once we can overdub Core.throw and let the Julia optimizer do its job.

floswald · June 15, 2018, 12:12pm

alright guys, thanks for sorting me out. seems the only safe conclusion for now is that most of this was rubbish!
any comments on the overall aim (5D etc)? completely crazy?
I will give GPUArrays a try.
thanks!

maleadt · June 15, 2018, 5:59pm

You probably don’t need to implement this yourself as the Base.Broadcast machinery provides this already, and is extensible for eg. GPU arrays. See CuArrays.jl, or this minimal 0.7-compatible reimplementation. That of course only works if you’re mapping a scalar function, ie. processing a single item per thread, but that looks like what you’re doing.

Topic		Replies	Views
Use GPU subfunction in a bigger-function? New to Julia cudanative , cuarrays	5	754	April 19, 2020
GPU Map without reduction on multiple arrays indices GPU	1	693	February 8, 2019
Is it possible to index a CuArray with a CuArray? GPU question	1	835	January 11, 2019
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1244	March 28, 2022
CUDAnative: kernel multidimensional access GPU cudanative	3	1169	February 3, 2017

Mapping ThreadIdx().x to a 5D array?

Immediate Task: map ThreadIdx().x to array indices

Related topics

Immediate Task: map `ThreadIdx().x` to array indices