Julia Cuda Matrix multiplication

Hi,

I’m relatively new to Julia and want to implement a numerical method using the CUDA libraries for Julia. I worked myself through the introduction files on GitHub and gained all the basic knowledge to write my own code so far.

The thing is that I want to create at least a not completely inefficient code. Therefore my aim is to avoid unnecessary overhead communication between the GPU and CPU, or their memories. And here is my, maybe stupid, question.

Imagine I have some data

A = rand(ComplexF64, (N,N)) ,
B = rand(ComplexF64, (N,N)) ,

where N is some fixed integer, and upload my data to the Nvidia GPU using

A_gpu = CuArrays.cu(A),
B_gpu = CuArrays.cu(B).

And the thing I asking myself is, when I’m performing a simple a simple matrix multiplication

A_gpu*B_gpu

does this calculation take place at the GPU? I mean is this a standard implemented feature of Julia when one is multiplying CuArrays, or need I to write an extra kernel function for a “parallel matrix multiplication” and call it with @cuda…?

I would be great if some expert on Julia can answer this question for me.

Thanks

It will indeed take place on the GPU. This is not really a “standard implemented feature of Julia” it is just that * can be overloaded and the guys writing CuArrays overloaded * between two CuArrays (CuMatrices specifically) to call the CUBLAS version of matrix multiply.

Specifically, this is where the CUBLAS-implementations are dispatched to: https://github.com/JuliaGPU/CuArrays.jl/blob/cee6253edeca2029d8d0522a46e2cdbb638e0a50/src/blas/highlevel.jl#L90-L145

And this is the fallback generically-typed implementation (e.g. for use with Dual numbers or other types that are not supported by CUBLAS): https://github.com/JuliaGPU/CuArrays.jl/blob/cee6253edeca2029d8d0522a46e2cdbb638e0a50/src/matmul.jl#L4-L50

This helps. Can I ask if this is the absolute most efficient way of multiplying two matrices, or is there any “trick” one might employ to speed up calculation even more?