Calling cublas within a kernel?

bcsj · January 12, 2023, 8:39am

Hi, I’m working on speeding up an iterative algorithm by running it on the gpu. Being iterative, depending on the problem structure it may at times take on the other of 10^6 iterations to reach a good solution, and I’d like to avoid incurring the overhead (despite its small size per call) of calling multiple gpu kernels hundreds of thousands of times. Hence I’d like to port the whole iteration into a single kernel call.

Now, the iteration performs various linear algebra computations in steps. Some are rather simple to write a good kernel for. In other steps I need standard matrix-matrix multiplications and while I could whip up somethings, I have my doubt that I can figure out how to truly optimize that.

So I figured, cublas exists already, it is there, a kernel being generated and called if I write something simple like

A_d = CuArray(A)
B_d = CuArray(B)
C_d = CuArray(C)
mul!(C_d, A_d, B_d)

so is it possible to harness cublas directly within my own kernel? without incurring additional overhead per iteration?

From what I understood in this stackoverflow question, it should at least be possible to call cublas in kernel.

Alternatively, is there some website with a library of written and optimized GPU “kernel snippets”, where I could find a well-written and optimized matmatmul kernel?

edit: a typo

Topic		Replies	Views
Calling CUBLAS GEMM in Julia 0.6 GPU	4	1447	July 6, 2018
Accelerate solving many matrix problems GPU cuda , linearalgebra , regression	8	2547	June 3, 2020
Is there any plan for GPU linear algebra? GPU package	3	1996	October 25, 2018
How to do mapslices() in parallel for 3D arrays GPU question	2	525	June 24, 2023
CUDA \| nested loops kernel GPU question	5	162	May 12, 2025

Calling cublas within a kernel?

Related topics