I’ve written a wrapper for CLBlast, a " tuned OpenCL BLAS library", which can be found at GitHub - ranocha/CLBlast.jl: Julia wrapper of CLBlast, a "tuned OpenCL BLAS library".. Most parts seem to work and there is a performance benefit compared to CLBLAS.jl, e.g.
$ julia examples/matrix_matrix_multiplication.jl
m = 1024, n = 1024, k = 1024, eltype = Float32
BLAS:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.199 ms (0.00% GC)
median time: 4.849 ms (0.00% GC)
mean time: 4.854 ms (0.00% GC)
maximum time: 29.058 ms (0.00% GC)
--------------
samples: 1029
evals/sample: 1
----------------------------------------------------------------------
Platform name : NVIDIA CUDA
Platform version: OpenCL 1.2 CUDA 9.1.84
Device name : GeForce GTX 1070 Ti
Device type : gpu
CLBLAS:
BenchmarkTools.Trial:
memory estimate: 1.14 KiB
allocs estimate: 52
--------------
minimum time: 7.463 μs (0.00% GC)
median time: 853.560 μs (0.00% GC)
mean time: 837.311 μs (0.00% GC)
maximum time: 1.305 ms (0.00% GC)
--------------
samples: 1493
evals/sample: 4
CLBlast:
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 7
--------------
minimum time: 722.555 μs (0.00% GC)
median time: 737.823 μs (0.00% GC)
mean time: 744.368 μs (0.00% GC)
maximum time: 1.704 ms (0.00% GC)
--------------
samples: 6709
evals/sample: 1
Is there some interest to have such a wrapper in JuliaGPU? It could also be possible to use CLBlast for CLArray.jl etc.