BLAS vs CUBLAS benchmark

Szymon_Zak · September 7, 2020, 3:17pm

Hello.

I’m trying to compare BLAS and CUBLAS performance with Julia.
For example, I want to compare matrix multiplication time.
Let A, B, C will be [NxN] matrices.
Which function, should I use to get something like C = AB?
Will standard AB implementation be the fastest one (using BLAS)?
Is it parallelized by default?

Thanks for your help,
Szymon

Mason · September 7, 2020, 7:38pm

It really depends on what you want to measure. If you care about the time it’ll take to allocate memory for the output, you can just use @btime $C = $A*$B and @btime CUDA.@sync $C = $A * $B. If you actually just care about the performance of the actual matrix multiplication and don’t want your results to be skewed by allocation times, then you should do something like

C = similar(A, size(A, 1), size(B, 2)
@btime mul!($C, $A, $B)

Here’s an example:

julia> using CUDA, BenchmarkTools

julia> let N = 1_000
           A = randn(N, N)
           B = randn(N, N)
           C = zeros(N, N)
           cuA, cuB, cuC = cu(A), cu(B), cu(C)
           
           @btime mul!($C, $A, $B)
           @btime CUDA.@sync mul!($cuC, $cuA, $cuB)
       end;
  13.795 ms (0 allocations: 0 bytes)
  406.637 μs (7 allocations: 288 bytes)

Mason · September 7, 2020, 8:01pm

Here’s an example of how I’d bencmark the scaling of OpenBLAS vs CuBLAS

Szymon_Zak · September 7, 2020, 8:24pm

I wanted to measure actual calculation time (without allocation). Thanks!

Szymon_Zak · September 7, 2020, 8:26pm

Will this calculate only multipilcation time? Looking at https://juliagpu.gitlab.io/CUDA.jl/ I was wondering if compilation before calling kernel will be OK (to avoid compiling time in actual results).

Mason · September 7, 2020, 8:30pm

Yes, everything should be compiled here before the benchmark loop is run.

Elrod · September 8, 2020, 6:44am

Faster than my CPU, but at least it’s in the same ballpark:

julia> using LinearAlgebra; BLAS.vendor()
 :mkl

julia> M = K = N = 10^3;

julia> A = randn(Float32, M, K); B = randn(Float32, K, N); C = Matrix{Float32}(undef, M, N);

julia> @benchmark mul!($C, $A, $B)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     625.013 μs (0.00% GC)
  median time:      630.032 μs (0.00% GC)
  mean time:        629.326 μs (0.00% GC)
  maximum time:     982.538 μs (0.00% GC)
  --------------
  samples:          7921
  evals/sample:     1

julia> 2e-9M*K*N / 625e-6 # 3.2 TFLOPS of single precision
 3200.0000000000005

julia> M = K = N = 10^4;

julia> A = randn(Float32, M, K); B = randn(Float32, K, N); C = Matrix{Float32}(undef, M, N);

julia> @benchmark mul!($C, $A, $B)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     483.007 ms (0.00% GC)
  median time:      483.146 ms (0.00% GC)
  mean time:        483.265 ms (0.00% GC)
  maximum time:     483.815 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

julia> 2e-9M*K*N / 483e-3 # 4.1 TFLOPS of single precision
 4140.786749482402

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

630 (CPU) vs 410 (GPU) microseconds at 10^3, and
0.48s (CPU) vs 0.3s or so (GPU) for 10^4.

GPUs win at gemm of course, because they have more raw FLOPS and it’s possible to get close to 100% of peak. But it’d be interesting to see when the “crossing over” point is, where the GPU attains higher FLOPS than the CPU (using the same precision).

Elrod · September 10, 2020, 8:07am

@jpsamaroo Any ideas?

CPU was a 7980XE, and GPU a Vega 64. I think the GPU should manage more than a couple GFLOPS.

FWIW, I also have these problems:

(@v1.5) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/nnddY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
│ Inconsistency detected by ld.so: dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations ~/Documents/languages/julia/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

julia> using AMDGPU
┌ Warning: AMDGPU dependencies have not been built, some functionality may be missing.
│ Please run Pkg.build("AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/nnddY/src/AMDGPU.jl:86

jpsamaroo · September 10, 2020, 12:46pm

I haven’t benchmarked rocBLAS at all, but have you tried increasing the matrix size even further? Memory transfer costs could be dominating at such a small size. Also, what AMDGPU.jl version are you on? You should test on 0.2.0 at a minimum, to get the ROCArray changes.

I’m not sure what’s up with that ld.so error, but the build step is exceptionally noisy and poorly implemented, and is something that I need to fix. However, the fact that you can run rocBLAS at all means that things are generally working

Elrod · September 10, 2020, 12:54pm

It turns out after building AMDGPU, isnothing(AMDGPU.librocblas).
It must’ve been running a fallback.

If I instead include the build script from the REPL (after using BinaryProvider), suddenly it’s defined and seems to run much faster.

But I’m getting approximate equality failures between the CPU BLAS libs and ROCm. Any idea what may cause that?
Are the BLAS calls asynchronous, and they just didn’t finish writing the results yet? Do I have to wait on something (they don’t return an event)?

I’m on AMDGPU v0.2.0.

jpsamaroo · September 10, 2020, 2:58pm

I posted a PR to at least make the builds quieter.

That’s probably GPUArrays generating a matmul kernel.

Yes, you need to wait on the call to complete. You can do so for the default device (which the external libs like rocBLAS use by default) with AMDGPU.HIP.hipDeviceSynchronize(). I’m not sure this will give the best results for benchmarking, though, because it syncs the whole device. You might be able to call into HIP to synchronize just the stream, but I haven’t tried it (check src/hip/libhip.jl for the calls you’ll need).

I filed an issue about returning an event token from external library calls, where possible.

Elrod · September 10, 2020, 5:21pm

I don’t suppose rockblas_get_stream(handle()) the correct one to synchronize? I.e., that hipStreamSynchronize(rockblas_get_stream(handle())) would be correct?
I’m also new to GPUs and don’t actually know what a stream is.

So for now, I used

gmul!(C,A,B) = (mul!(C,A,B); AMDGPU.HIP.hipDeviceSynchronize())

New results:

>10 TFLOPS is pretty good.

I’ll test your PR with build system updates.

abulak · September 10, 2020, 9:56pm

can you also test the perf. with double precision?

Elrod · September 11, 2020, 4:17am

Sure, but note of course that the Vega 64, like most consumer GPUs, is bad at Float64:

This claims the Vega 64 has a max of 791 GFLOPS of Float64, but I’ve measured more than that here. I think my card is clocked higher, but I haven’t fiddled with its speed, unlike the CPU.

Also worth pointing out that I originally had the 7980XE running at 3.8 GHz for all-core AVX-512, but it crashed while running the benchmark, so I lowered it to 3.7 GHz. If you want to test an overclock, MKL is the way to go .

If you want a GPU that’s good at Float64, get the Nvidea Titan V. It has about 7450 GFLOPS of Float64, so it should handily trounce the CPUs in the above benchmark. Or use the cloud, where you’ll probably get something like a V100, which has 7000 GFLOPS of Float64.

With AMD, you’d need an MI series card, which you’d probably have a hard time finding on the cloud, and costs a lot more money than a Titan V for similar performance (while also being a data center card, making it awkward for personal use).

Also, for fun, my 10980XE again, which retails for half the price of the 7980XE:

It did exceed 2000 GFLOPS of Float64.

The gap between MKL and OpenBLAS is also much larger, which makes me realize I didn’t overclock the cache (uncore) to match the core, so I’m guessing MKL is much cache friendlier than OpenBLAS, and that they perhaps fair more similarly on closer-to-stock settings.
I should try overclocking the uncore a bit.

Topic		Replies	Views
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	391	November 13, 2023
Parallel computing with * Performance question	27	1095	December 29, 2022
Julia matrix-multiplication performance Performance linearalgebra	20	8493	October 30, 2022
Matrix vector multiplication Performance question	4	895	September 27, 2020
Alternate BLAS libraries? General Usage blas	22	2855	July 4, 2020

BLAS vs CUBLAS benchmark

Related topics