BLAS vs CUBLAS benchmark


I’m trying to compare BLAS and CUBLAS performance with Julia.
For example, I want to compare matrix multiplication time.
Let A, B, C will be [NxN] matrices.
Which function, should I use to get something like C = AB?
Will standard A
B implementation be the fastest one (using BLAS)?
Is it parallelized by default?

Thanks for your help,

It really depends on what you want to measure. If you care about the time it’ll take to allocate memory for the output, you can just use @btime $C = $A*$B and @btime CUDA.@sync $C = $A * $B. If you actually just care about the performance of the actual matrix multiplication and don’t want your results to be skewed by allocation times, then you should do something like

C = similar(A, size(A, 1), size(B, 2)
@btime mul!($C, $A, $B)

Here’s an example:

julia> using CUDA, BenchmarkTools

julia> let N = 1_000
           A = randn(N, N)
           B = randn(N, N)
           C = zeros(N, N)
           cuA, cuB, cuC = cu(A), cu(B), cu(C)
           @btime mul!($C, $A, $B)
           @btime CUDA.@sync mul!($cuC, $cuA, $cuB)
  13.795 ms (0 allocations: 0 bytes)
  406.637 μs (7 allocations: 288 bytes)

Here’s an example of how I’d bencmark the scaling of OpenBLAS vs CuBLAS


I wanted to measure actual calculation time (without allocation). Thanks!

Will this calculate only multipilcation time? Looking at I was wondering if compilation before calling kernel will be OK (to avoid compiling time in actual results).

Yes, everything should be compiled here before the benchmark loop is run.

1 Like

Faster than my CPU, but at least it’s in the same ballpark:

julia> using LinearAlgebra; BLAS.vendor()

julia> M = K = N = 10^3;

julia> A = randn(Float32, M, K); B = randn(Float32, K, N); C = Matrix{Float32}(undef, M, N);

julia> @benchmark mul!($C, $A, $B)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     625.013 μs (0.00% GC)
  median time:      630.032 μs (0.00% GC)
  mean time:        629.326 μs (0.00% GC)
  maximum time:     982.538 μs (0.00% GC)
  samples:          7921
  evals/sample:     1

julia> 2e-9M*K*N / 625e-6 # 3.2 TFLOPS of single precision

julia> M = K = N = 10^4;

julia> A = randn(Float32, M, K); B = randn(Float32, K, N); C = Matrix{Float32}(undef, M, N);

julia> @benchmark mul!($C, $A, $B)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     483.007 ms (0.00% GC)
  median time:      483.146 ms (0.00% GC)
  mean time:        483.265 ms (0.00% GC)
  maximum time:     483.815 ms (0.00% GC)
  samples:          11
  evals/sample:     1

julia> 2e-9M*K*N / 483e-3 # 4.1 TFLOPS of single precision

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

630 (CPU) vs 410 (GPU) microseconds at 10^3, and
0.48s (CPU) vs 0.3s or so (GPU) for 10^4.

GPUs win at gemm of course, because they have more raw FLOPS and it’s possible to get close to 100% of peak. But it’d be interesting to see when the “crossing over” point is, where the GPU attains higher FLOPS than the CPU (using the same precision).


@jpsamaroo Any ideas?

CPU was a 7980XE, and GPU a Vega 64. I think the GPU should manage more than a couple GFLOPS.

FWIW, I also have these problems:

(@v1.5) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/nnddY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
│ Inconsistency detected by dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations ~/Documents/languages/julia/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

julia> using AMDGPU
┌ Warning: AMDGPU dependencies have not been built, some functionality may be missing.
│ Please run"AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/nnddY/src/AMDGPU.jl:86

I haven’t benchmarked rocBLAS at all, but have you tried increasing the matrix size even further? Memory transfer costs could be dominating at such a small size. Also, what AMDGPU.jl version are you on? You should test on 0.2.0 at a minimum, to get the ROCArray changes.

I’m not sure what’s up with that error, but the build step is exceptionally noisy and poorly implemented, and is something that I need to fix. However, the fact that you can run rocBLAS at all means that things are generally working :slight_smile:

1 Like

It turns out after building AMDGPU, isnothing(AMDGPU.librocblas).
It must’ve been running a fallback.

If I instead include the build script from the REPL (after using BinaryProvider), suddenly it’s defined and seems to run much faster.

But I’m getting approximate equality failures between the CPU BLAS libs and ROCm. Any idea what may cause that?
Are the BLAS calls asynchronous, and they just didn’t finish writing the results yet? Do I have to wait on something (they don’t return an event)?

I’m on AMDGPU v0.2.0.

1 Like

I posted a PR to at least make the builds quieter.

That’s probably GPUArrays generating a matmul kernel.

Yes, you need to wait on the call to complete. You can do so for the default device (which the external libs like rocBLAS use by default) with AMDGPU.HIP.hipDeviceSynchronize(). I’m not sure this will give the best results for benchmarking, though, because it syncs the whole device. You might be able to call into HIP to synchronize just the stream, but I haven’t tried it (check src/hip/libhip.jl for the calls you’ll need).

I filed an issue about returning an event token from external library calls, where possible.

I don’t suppose rockblas_get_stream(handle()) the correct one to synchronize? I.e., that hipStreamSynchronize(rockblas_get_stream(handle())) would be correct?
I’m also new to GPUs and don’t actually know what a stream is.

So for now, I used

gmul!(C,A,B) = (mul!(C,A,B); AMDGPU.HIP.hipDeviceSynchronize())

New results:

>10 TFLOPS is pretty good.

I’ll test your PR with build system updates.


can you also test the perf. with double precision?

Sure, but note of course that the Vega 64, like most consumer GPUs, is bad at Float64:

This claims the Vega 64 has a max of 791 GFLOPS of Float64, but I’ve measured more than that here. I think my card is clocked higher, but I haven’t fiddled with its speed, unlike the CPU.

Also worth pointing out that I originally had the 7980XE running at 3.8 GHz for all-core AVX-512, but it crashed while running the benchmark, so I lowered it to 3.7 GHz. If you want to test an overclock, MKL is the way to go :wink: .

If you want a GPU that’s good at Float64, get the Nvidea Titan V. It has about 7450 GFLOPS of Float64, so it should handily trounce the CPUs in the above benchmark. Or use the cloud, where you’ll probably get something like a V100, which has 7000 GFLOPS of Float64.

With AMD, you’d need an MI series card, which you’d probably have a hard time finding on the cloud, and costs a lot more money than a Titan V for similar performance (while also being a data center card, making it awkward for personal use).

Also, for fun, my 10980XE again, which retails for half the price of the 7980XE:

It did exceed 2000 GFLOPS of Float64.

The gap between MKL and OpenBLAS is also much larger, which makes me realize I didn’t overclock the cache (uncore) to match the core, so I’m guessing MKL is much cache friendlier than OpenBLAS, and that they perhaps fair more similarly on closer-to-stock settings.
I should try overclocking the uncore a bit.