I have the following big picture aim. Suppose I have a high dimensional array, e.g. 5D or 6D. On each index of those arrays, an independent computation delivers a different value that needs to be stored at this index. For example,
V[i1,i2,i3,i4,i5] = fun(x, i1,i2,i3,i4,i5)
I wanted to try whether I can hand over computation of fun(x, i1,i2,i3,i4,i5)
to a single thread on a GPU, hence have many threads work in parallel on this job. so far so good.
Immediate Task: map ThreadIdx().x
to array indices
Ideally I wanted to just use the julia function ind2sub
to get a cartesian index from a linear index. that function is not exposed via CUDAnative
. to I rewrote it for a particular case. Nevermind if this is completely inefficient, at this point I just want to understand what’s going on. So, I have the following example, which compiles without error. I am having trouble seeing the output though. why do all of my attempts to copy the device array back to the host give me an error? First the version that works, then the error at the bottom.
- I am learning a lot here, so any comments most welcome. In particular, is that the right way to attack a 6D array?
- it seems that the
@assert
call verifies that I’m doing this correctly, but I want to get this array back. - In general, I am very confused that a function like
mod
ortuple
or@assert
works on the device, but something likesub2ind
does not. what’s the difference?
using CUDAnative, CUDAdrv
function do3D()
V = zeros(Int64,2,3,4)
d_V = CuArray(V)
@cuda threads=10 ind2sub3Dkernel(d_V)
# all of those error.
# copy!(V,d_V)
# x = Array(d_V)
# println(x[1])
# return d_V
end
# the kernel. cannot return a value, hence write into the supplied vector
# at each thread index, get the corresponding CartesianIndex, and
# recompose the linear index manually to check that this is correct.
function ind2sub3Dkernel(V::CuDeviceArray{Int64})
idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
m = size(V)
n = myind2sub3D(size(V),idx)
# n = mytuple() # "works"
@assert idx == n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
V[idx] = n[1] + (n[2]-1)*m[1] + (n[3]-1)*m[1]*m[2]
end
# can I have a *device function* that returns a tuple?
# seems I can.
function mytuple()
(1,2)
end
# my *implementation* of an old ind2sub version.
# had to specialize to 3D because I cannot do splatting on the device?
# I know this is terrible code, but that's not the point. (I hope!)
function myind2sub3D(dims::Tuple{Integer,Vararg{Integer}}, ind::Int)
ndims = length(dims)
@assert ndims==3
stride = dims[1]
for i=2:ndims-1
stride *= dims[i]
end
i2 = 0
i3 = 0
# a manual loop over i
i = 2
rest = rem(ind-1, stride) + 1
i3 = div(ind - rest, stride) + 1
ind = rest
stride = div(stride, dims[i])
i = 1
rest = rem(ind-1, stride) + 1
i2 = div(ind - rest, stride) + 1
ind = rest
stride = div(stride, dims[i])
o = tuple(ind,i2,i3)
# printing does not work
# @cuprintf("my indices are %ld, %ld, %ld\n",o[1],o[2],o[3])
# @cuprintf("i have ")
return o
# original implementation
# sub = ()
# for i=(ndims-1):-1:1
# rest = rem(ind-1, stride) + 1
# sub = tuple(div(ind - rest, stride) + 1, sub...)
# ind = rest
# stride = div(stride, dims[i])
# end
# return tuple(ind, sub...)
end
modifying the top level function to convert back to Array
does this:
...
x = Array(d_V)
...
julia> x=cudaVFI.do3D()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
[1] macro expansion at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/base.jl:145 [inlined]
[2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:161
[3] alloc at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/memory.jl:157 [inlined] (repeats 2 times)
[4] CUDAdrv.CuArray{Int64,3}(::Tuple{Int64,Int64,Int64}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:33
[5] CUDAdrv.CuArray(::Array{Int64,3}) at /home/floswald/.julia/packages/CUDAdrv/GyXD/src/array.jl:217
[6] do3D() at /home/floswald/git/VFI/Julia/cudaVFI/src/cutest.jl:192
[7] top-level scope
thanks!