BLAS vs CUBLAS benchmark

Elrod · September 10, 2020, 5:21pm

I don’t suppose rockblas_get_stream(handle()) the correct one to synchronize? I.e., that hipStreamSynchronize(rockblas_get_stream(handle())) would be correct?
I’m also new to GPUs and don’t actually know what a stream is.

So for now, I used

gmul!(C,A,B) = (mul!(C,A,B); AMDGPU.HIP.hipDeviceSynchronize())

New results:

>10 TFLOPS is pretty good.

I’ll test your PR with build system updates.

Topic		Replies	Views
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	412	November 13, 2023
Parallel computing with * Performance question	27	1111	December 29, 2022
Julia matrix-multiplication performance Performance linearalgebra	20	8630	October 30, 2022
Matrix vector multiplication Performance question	4	899	September 27, 2020
Alternate BLAS libraries? General Usage blas	22	2915	July 4, 2020

BLAS vs CUBLAS benchmark

Related topics