CLBlast, a tuned OpenCL BLAS library

I’ve written a wrapper for CLBlast, a " tuned OpenCL BLAS library", which can be found at GitHub - ranocha/CLBlast.jl: Julia wrapper of CLBlast, a "tuned OpenCL BLAS library".. Most parts seem to work and there is a performance benefit compared to CLBLAS.jl, e.g.

$ julia examples/matrix_matrix_multiplication.jl 

m = 1024, n = 1024, k = 1024, eltype = Float32
BLAS:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.199 ms (0.00% GC)
  median time:      4.849 ms (0.00% GC)
  mean time:        4.854 ms (0.00% GC)
  maximum time:     29.058 ms (0.00% GC)
  --------------
  samples:          1029
  evals/sample:     1
----------------------------------------------------------------------
Platform name   : NVIDIA CUDA
Platform version: OpenCL 1.2 CUDA 9.1.84
Device name     : GeForce GTX 1070 Ti
Device type     : gpu

CLBLAS:
BenchmarkTools.Trial: 
  memory estimate:  1.14 KiB
  allocs estimate:  52
  --------------
  minimum time:     7.463 μs (0.00% GC)
  median time:      853.560 μs (0.00% GC)
  mean time:        837.311 μs (0.00% GC)
  maximum time:     1.305 ms (0.00% GC)
  --------------
  samples:          1493
  evals/sample:     4
CLBlast:
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  7
  --------------
  minimum time:     722.555 μs (0.00% GC)
  median time:      737.823 μs (0.00% GC)
  mean time:        744.368 μs (0.00% GC)
  maximum time:     1.704 ms (0.00% GC)
  --------------
  samples:          6709
  evals/sample:     1

Is there some interest to have such a wrapper in JuliaGPU? It could also be possible to use CLBlast for CLArray.jl etc.

5 Likes

This looks awesome! Thanks @ranocha. Have you tested it on PDE work yet?

One of my main motivations has been to enable IterativeSolvers.jl for CLArray.jl. Therefore, dot and nrm2 have to be implemented, which is better with CLBlast than with CLBLAS. The methods still have to be added to CLArray.jl. I haven’t tested the matrix multiplication etc. for PDEs, because I use custom OpenCL kernels instead.

1 Like

It might make sense to integrate this functionality with CLArrays, like we’ve been integrating CuBLAS/Cu* into CuArrays (cc @sdanisch).

Really?

I know, this is really weird. I think there is some error in the test with CLBLAS. 7.463 μs is just too fast. The median times seem to be okay, also in other tests.

The ownership of CLBlast.jl has been transferred to JuliaGPU: https://github.com/JuliaGPU/CLBlast.jl.

1 Like