Sparse arrays and function arrays in CUDAnative.jl



Is there any simple way to make, write to and read from sparse arrays on the device through CUDAnative.jl?

Also, is it possible to define an array of device functions that is moved to the device and then each function can be accessed and called by its index during a kernel call?


We haven’t looked at sparse arrays yet, but CUDAnative supports arbitrary structs (as long as they are not mutable) so you can redefine your own sparse array type if necessary. You might also be able to use Cxx.jl and CUDAnativelib.jl to call cuSPARSE on the device, but I haven’t tried that yet so your mileage might vary.

Can you give a bit more context why you want to do that?


That would be awesome if it works.

I would like to use SymEngine to get symbolic derivative expressions then use them to define functions on the GPU to use for some optimization algorithm on the GPU. The functions should be accessible by the variable index, with respect to which the derivative was taken. There could be as many as 1M functions btw.


Just double checked and cuSPARSE doesn’t define a device level API so direct calls are not possible.
But you can call cuSPARSE from the host and interweave that with CUDAnative code as long as the data formats are compatible.

Hm, I see. The problem is that you have to compile that code, then take the function pointer (e.g. what cfunction does but GPU aware) and then emit a ccall to that function pointer. It might be doable, but I don’t think we have the infrastructure for that yet and I don’t know if that will be efficient.

If I understand you correctly you effectively want a switch statement since you don’t dynamically add new functions? For CUDAnative.jl tthe code hast to be statically know and we can’t do dynamic dispatch. So if you can precompute your function selection to something like

if fidx == 1
elseif fidx == 2

You could generate that code and don’t need to deal with function pointers on the GPU.


Would it work for you to just directly splice the function call from the vector of functions into an expression, then eval the expression and execute that function (now with static dispatch) on the GPU?



So if I pass a sparse matrix to cuSPARSE and apply some decompositions, then solve some system of equations, the pointer to the solution I get back should be possible to pass to CUDAnative directly through a second kernel call so that basically all the work is done on the GPU, the solution doesn’t have to be moved back to the CPU, right?

That was my first thought, a horrendously long if statement in a single function but I was worried about the practicality of every thread/warp having to go through all the conditions to reach the intended one, may not be the most efficient approach. So I thought if I could define many device functions using metaprogramming and then pass them as an array, that it would be better. But if I understand your point correctly, until the functions are compiled, we cannot define the GPU pointer to point at them, so that’s why ccall is necessary as an after-step. Please correct me if I am wrong.


You mean to use metaprogramming inbetween kernel calls to define functions as necessary which then get compiled and spliced in a kernel call? That seems like it could work for me. But I am not sure I understand how this being possible fits with @vchuravy’s answer, could you please elaborate further.


It’s just a different solution that only works if you can statically decide the dispatch for each function :wink:
@vchuravy solution would also work when your function calls depend on the value of the variables, but introduces branches, which would be especially bad if you have ~1 million branches :wink:


Note that generated functions work on the GPU as well. So as long as your meta code is pure, this can also work:

const look_up_table = Function[...]
@generated function derivative_kernel(a, b)
        $(look_up_table[static_index])(a, b)

You just need to figure out how to derive a static index just from the function argument types.


How about something like this

dfdx1(x) = 2x[1]+x[2]
dfdx2(x) = x[2]^2

dfdx = [dfdx1, dfdx2]

for i in 1:2
	@eval begin
		function kernel_derivative_x$(i)(x)
		    return $(dfdx[i])(x)

Actually in this code function kernel_derivative_x1 and dfdx1 are the same, my bad, so what I really wanted to do is to take dfdx to the GPU so that all the functions can be compiled and callable by dfdx[i](x) from inside a kernel call.

I believe a generated function is what I might be looking for, so your previous comment kind of answers my question. I will try and see what happens, thanks :slight_smile:


Better do this in Julia, as @vchuravy and @sdanisch are explaining above, because our compiler doesn’t support taking the (device) pointer of a device function. Indirect calls also are a somewhat underused feature of CUDA C, and I wouldn’t know if NVPTX supports lowering it.


Doesn’t CUDAnative support Windows :sweat_smile:


Nothing holding it back, but none of the people actively working with CUDAnative uses Windows so it probably won’t work out of the box :frowning: Definitely needs a build from source, and I’d presume that any/all of LLVM.jl/CUDAdrv.jl/CUDAnative.jl might need some tweaks to the build scripts to get everything working (@musm has been doing some work on this). But other than that, I don’t expect any big issues, unless Windows codegen would be somehow incompatible with our assumptions for reusing most of host codegen for the GPU.


I see, I was looking for a good reason to go back to Linux anyways, seems like I found one :slight_smile: