I don’t suppose rockblas_get_stream(handle()) the correct one to synchronize? I.e., that hipStreamSynchronize(rockblas_get_stream(handle())) would be correct?
I’m also new to GPUs and don’t actually know what a stream is.
So for now, I used
gmul!(C,A,B) = (mul!(C,A,B); AMDGPU.HIP.hipDeviceSynchronize())
New results:
>10 TFLOPS is pretty good.
I’ll test your PR with build system updates.
