CLBlast, a tuned OpenCL BLAS library

ranocha · August 9, 2018, 11:26am

I’ve written a wrapper for CLBlast, a " tuned OpenCL BLAS library", which can be found at GitHub - ranocha/CLBlast.jl: Julia wrapper of CLBlast, a "tuned OpenCL BLAS library".. Most parts seem to work and there is a performance benefit compared to CLBLAS.jl, e.g.

$ julia examples/matrix_matrix_multiplication.jl 

m = 1024, n = 1024, k = 1024, eltype = Float32
BLAS:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.199 ms (0.00% GC)
  median time:      4.849 ms (0.00% GC)
  mean time:        4.854 ms (0.00% GC)
  maximum time:     29.058 ms (0.00% GC)
  --------------
  samples:          1029
  evals/sample:     1
----------------------------------------------------------------------
Platform name   : NVIDIA CUDA
Platform version: OpenCL 1.2 CUDA 9.1.84
Device name     : GeForce GTX 1070 Ti
Device type     : gpu

CLBLAS:
BenchmarkTools.Trial: 
  memory estimate:  1.14 KiB
  allocs estimate:  52
  --------------
  minimum time:     7.463 μs (0.00% GC)
  median time:      853.560 μs (0.00% GC)
  mean time:        837.311 μs (0.00% GC)
  maximum time:     1.305 ms (0.00% GC)
  --------------
  samples:          1493
  evals/sample:     4
CLBlast:
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  7
  --------------
  minimum time:     722.555 μs (0.00% GC)
  median time:      737.823 μs (0.00% GC)
  mean time:        744.368 μs (0.00% GC)
  maximum time:     1.704 ms (0.00% GC)
  --------------
  samples:          6709
  evals/sample:     1

Is there some interest to have such a wrapper in JuliaGPU? It could also be possible to use CLBlast for CLArray.jl etc.

ChrisRackauckas · August 9, 2018, 12:50pm

This looks awesome! Thanks @ranocha. Have you tested it on PDE work yet?

ranocha · August 9, 2018, 1:03pm

One of my main motivations has been to enable IterativeSolvers.jl for CLArray.jl. Therefore, dot and nrm2 have to be implemented, which is better with CLBlast than with CLBLAS. The methods still have to be added to CLArray.jl. I haven’t tested the matrix multiplication etc. for PDEs, because I use custom OpenCL kernels instead.

maleadt · August 9, 2018, 1:13pm

It might make sense to integrate this functionality with CLArrays, like we’ve been integrating CuBLAS/Cu* into CuArrays (cc @sdanisch).

PetrKryslUCSD · August 9, 2018, 2:50pm

Really?

ranocha · August 9, 2018, 2:52pm

I know, this is really weird. I think there is some error in the test with CLBLAS. 7.463 μs is just too fast. The median times seem to be okay, also in other tests.

ranocha · August 9, 2018, 2:54pm

The ownership of CLBlast.jl has been transferred to JuliaGPU: https://github.com/JuliaGPU/CLBlast.jl.

Topic		Replies	Views
BLAS vs CUBLAS benchmark Performance question , blas , cuda	13	5833	September 11, 2020
Any good OpenCL examples to demonstrate a speedup? GPU	4	1390	April 3, 2019
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36489	June 19, 2020
Thousands of matrix multiplications using CuArray GPU	5	1228	July 11, 2019
LU factorization performance issue New to Julia linearalgebra	30	718	June 6, 2022

CLBlast, a tuned OpenCL BLAS library

Related topics