Hi, I’m working on speeding up an iterative algorithm by running it on the gpu. Being iterative, depending on the problem structure it may at times take on the other of 10^6 iterations to reach a good solution, and I’d like to avoid incurring the overhead (despite its small size per call) of calling multiple gpu kernels hundreds of thousands of times. Hence I’d like to port the whole iteration into a single kernel call.

Now, the iteration performs various linear algebra computations in steps. Some are rather simple to write a good kernel for. In other steps I need standard matrix-matrix multiplications and while I could whip up somethings, I have my doubt that I can figure out how to truly optimize that.

So I figured, cublas exists already, it is there, a kernel being generated and called if I write something simple like

```
A_d = CuArray(A)
B_d = CuArray(B)
C_d = CuArray(C)
mul!(C_d, A_d, B_d)
```

so is it possible to harness cublas directly within my own kernel? without incurring additional overhead per iteration?

From what I understood in this stackoverflow question, it should at least be possible to call cublas in kernel.

Alternatively, is there some website with a library of written and optimized GPU “kernel snippets”, where I could find a well-written and optimized matmatmul kernel?

edit: a typo