Casting, annotations and numeric types for CUDAnative

Alex_Ellison · January 20, 2019, 9:22pm

Hello,

I’ve begun exploring the CUDAnative package which modest success. I’ve run into a couple of limitations.

I wanted to allocate an array from within my kernel as is shown in some of the project examples. The first limitation appears that this must be a 1 dimensional collection. Ok, I have a 2D grid and want to perform a vector operation for each, so I need to use a linear array for what I’d ideally index as 3D. I wrote a function to do this, and it only takes in integers from the block/grid dimensions and adds/multiplies them.

I can see that CUDA is giving me the results back as floats (if I attempt to store this converted index as grid output, it will crash if the container expects Int32). I’ve attempted adding type annotations, and casting but neither of those compile.

I have a minimal working example below. My questions are:

Given the goal of allocating/using something like a multidimensional array, is there a better way to proceed? Are there index functions to handle this?
Are there ever times when I can cast/annotate types onto my variables?

Random potentially useful tidbits:
Julia version 1.0.3
Geforce 755m
CUDAnative 1.0.0
CUDAdrv 1.0.1
CuArrays 0.9.0

Thanks,
Alex

Example:

using CUDAdrv, CUDAnative, CuArrays

function l_0(x, y, z, w, h)
       return x + y*w + z*w*h
end

function l(x, y, z, w, h)
       _x = x - 1
       _y = y - 1
       _z = z - 1
       return l_0(_x, _y, _z, w, h) + 1
end

function kernel(out)
	x = blockIdx().x
	y = blockIdx().y
	w = gridDim().x
	h = gridDim().y
	z = 1

	# apparently @cuDynamicSharedMem can only be 1 dimensional?
	arr = @cuDynamicSharedMem(Int32, w * h * 3)
	# it allows me to say `linear_index :: Int..` but not `linear_index :: Int32..`
	linear_index = l(x,y,z,w,h)

	out[x, y] = linear_index
	return nothing
end

function make_matrix(width :: Int, height :: Int)
	grid = (width, height)
	threads = (1,)

	# I can't change this Float32 -> Int32
	cu_out = CuArray{Float32, 2}(undef, width, height)

	@cuda blocks=grid threads=threads kernel(cu_out)
	out = Array{Float32, 2}(cu_out)
	return out
end

function main()
	width = 10
	height = 10
	matrix = make_matrix(width, height)
	println(matrix)
end

main()

maleadt · January 21, 2019, 9:36am

I can’t change this Float32 → Int32

Why not? Doing so just works…

apparently @cuDynamicSharedMem can only be 1 dimensional?

No, it can be higher-dimensional but you need to pass the size as a tuple. @cu...SharedMem macros are a bit of a hack and really should be a proper array type, at which point they should probably implement all of the common constructor syntaxes.

it allows me to say linear_index :: Int.. but not linear_index :: Int32..

Again, that just works here. Do know that generally you want to avoid such patterns since it generates quite ugly code (branches for checked arithmetic, allocs and calls to exception methods) that typically is unwanted in hot code. GPU intrinsics return Int32 values but the Julia functions return Int (for codegen reasons) so you can safely unsafe_trunc(Int32, ...) if you really want a 32-bit value.

Maybe you forgot to reserve memory for the dynamic shared memory (specified using the shmem keyword argument to @cuda)? Do note that your use of dynamic shared memory is kind-of peculiar, since it’s normally shared between threads but you only use a single thread. Doing so will probably hurt your occupancy. If you want truly local memory, why not use StaticArrays?

Anyway, here’s the code that works:

using CUDAdrv, CUDAnative, CuArrays

function l_0(x, y, z, w, h)
       return x + y*w + z*w*h
end

function l(x, y, z, w, h)
       _x = x - 1
       _y = y - 1
       _z = z - 1
       return l_0(_x, _y, _z, w, h) + 1
end

function kernel(out)
    x = blockIdx().x
    y = blockIdx().y
    w = gridDim().x
    h = gridDim().y
    z = 1

    arr = @cuDynamicSharedMem(Int32, (w, h, 3))
    arr[x,y,z] = Int32(1)
    linear_index::Int32  = l(x,y,z,w,h)
    arr[linear_index] = Int32(1) # still works

    out[x, y] = linear_index
    return nothing
end

function make_matrix(width :: Int, height :: Int)
    grid = (width, height)
    threads = (1,)

    cu_out = CuArray{Int32, 2}(undef, width, height)

    @cuda blocks=grid threads=threads shmem=sizeof(Int32)*prod(grid) kernel(cu_out)
    out = Array{Int32, 2}(cu_out)
    return out
end

function main()
    width = 10
    height = 10
    matrix = make_matrix(width, height)
    println(matrix)
end

main()

Alex_Ellison · January 21, 2019, 6:02pm

Hey @maleadt thanks for the response.

Regarding the @cu...SharedMem macros, tuple syntax indeed works, and I had forgotten the shmem keyword when invoking @cuda. That completely frees me of the attempt at converting to a linear index - I’m sure that looked weird, especially in a MWE. Static arrays are working great for my use case.

Speaking of MWE, if I take the code snippet you provided, it actually does not work for me, I get an error about boxing ints.

The error:
ERROR: LoadError: CUDAnative.InvalidIRError(CUDAnative.CompilerContext(kernel, Tuple{CuDeviceArray{Int32,2,CUDAnative.AS.Global}}, v"3.0.0", true, nothing, nothing, nothing, nothing), Tuple{String,Array{Base.StackTraces.StackFrame,1},Any}[(“call to the Julia runtime”, [throw_inexacterror at boot.jl:567, checked_trunc_sint at boot.jl:589, toInt32 at boot.jl:626, Type at boot.jl:716, convert at number.jl:7, kernel at minimal_example.jl:23], “jl_box_int64”)])

I’ve been able to get along just fine from here by making kernel parameters explicitly Int32, so I don’t consider this a problem. Out of curiosity, are you running a 32 bit OS that would have the output of l_0 in your example be an Int32?

Finally, since you mentioned occupancy, how smart is CUDAnative at mapping grids to cores? On my machine CUDAnative tells me my max grid dimensions are 5 x 6 x 7 = 210. If I invoked a 2D grid of size 10 x 21, would I be getting high occupancy?

Thanks again

Alex_Ellison · January 21, 2019, 6:26pm

Also, RE occupancy, if the following property is true

julia> CUDAdrv.MAX_THREADS_PER_BLOCK
MAX_THREADS_PER_BLOCK::CUdevice_attribute = 1

then using more than one thread per block won’t increase my occupancy right?

maleadt · January 21, 2019, 6:39pm

Oh interesting, that should work on master but code like that is slow, and will allocate which currently leaks data, so better work with explicit conversions.

CUDAnative doesn’t do anything in that regard, it strictly works at the CUDA abstraction level where you have to deal with that yourself. CuArrays has some simplistic heuristics.

That’s just the enum value. You need to query the device with that value:

julia> CUDAdrv.MAX_THREADS_PER_BLOCK
MAX_THREADS_PER_BLOCK::CUdevice_attribute = 1

julia> attribute(device(), CUDAdrv.MAX_THREADS_PER_BLOCK)
1024

Alternatively, you can query the max kernel count for any given kernel (since it also depends on the kernel’s register usage) using CUDAnative’s low level APIs. Check the docstring for @cuda.

Alex_Ellison · January 21, 2019, 9:07pm

Enum. facepalm that makes so much more sense, now I don’t have to worry about all kinds of nonsense like a warp size of 10 and values which don’t add up.