Julia Cuda Matrix multiplication



I’m relatively new to Julia and want to implement a numerical method using the CUDA libraries for Julia. I worked myself through the introduction files on GitHub and gained all the basic knowledge to write my own code so far.

The thing is that I want to create at least a not completely inefficient code. Therefore my aim is to avoid unnecessary overhead communication between the GPU and CPU, or their memories. And here is my, maybe stupid, question.

Imagine I have some data

A = rand(ComplexF64, (N,N)) ,
B = rand(ComplexF64, (N,N)) ,

where N is some fixed integer, and upload my data to the Nvidia GPU using

A_gpu = CuArrays.cu(A),
B_gpu = CuArrays.cu(B).

And the thing I asking myself is, when I’m performing a simple a simple matrix multiplication


does this calculation take place at the GPU? I mean is this a standard implemented feature of Julia when one is multiplying CuArrays, or need I to write an extra kernel function for a “parallel matrix multiplication” and call it with @cuda…?

I would be great if some expert on Julia can answer this question for me.




It will indeed take place on the GPU. This is not really a “standard implemented feature of Julia” it is just that * can be overloaded and the guys writing CuArrays overloaded * between two CuArrays (CuMatrices specifically) to call the CUBLAS version of matrix multiply.



Specifically, this is where the CUBLAS-implementations are dispatched to: https://github.com/JuliaGPU/CuArrays.jl/blob/cee6253edeca2029d8d0522a46e2cdbb638e0a50/src/blas/highlevel.jl#L90-L145

And this is the fallback generically-typed implementation (e.g. for use with Dual numbers or other types that are not supported by CUBLAS): https://github.com/JuliaGPU/CuArrays.jl/blob/cee6253edeca2029d8d0522a46e2cdbb638e0a50/src/matmul.jl#L4-L50