BLAS vs CUBLAS benchmark

I don’t suppose rockblas_get_stream(handle()) the correct one to synchronize? I.e., that hipStreamSynchronize(rockblas_get_stream(handle())) would be correct?
I’m also new to GPUs and don’t actually know what a stream is.

So for now, I used

gmul!(C,A,B) = (mul!(C,A,B); AMDGPU.HIP.hipDeviceSynchronize())

New results:


>10 TFLOPS is pretty good.

I’ll test your PR with build system updates.

3 Likes