Best way to call an OpenCL kernel with arguments of type CLArray



I would like to perform some computations on GPUs using CLArrays, since these support standard array syntax. However, I would like to call some existing OpenCL kernels on these CLArrays. Some possibilities that might seem to be natural (at least for me) do not work.

using OpenCL, CLArrays
device, ctx, queue = cl.create_compute_context()
mult_kernel = """
kernel void mult(global float const* a, global float* b)
  int gid = get_global_id(0);
  b[gid] = 2*a[gid];
p = cl.Program(ctx, source=mult_kernel) |>!
mult_cl = cl.Kernel(p, "mult")

# using buffers: calling kernels works, but buffers do not support the array interface
a = rand(Float32, 50_000)
a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, :rw, length(a))
queue(mult_cl, size(a), nothing, a_buff, b_buff)
b =, b_buff)
@show norm(b - 2a)

# calling queue with arguments of type CLArray throws an error
d_a = CLArray(a)
d_b = CLArray(similar(a))
queue(mult_cl, size(a), nothing, d_a, d_b)

# using gpu_call with a julia function works, but I would like to call an existing OpenCL kernel
function mult_julia(state, a, b)
  idx = GPUArrays.@linearidx a state
  @inbounds b[idx] = 2*a[idx]
gpu_call(mult_julia, d_a, (d_a, d_b))
mapreduce(x->x^2, +, d_b-2*d_a)

# calling gpu_call with an OpenCL kernel throws an error
gpu_call(mult_cl, d_a, (d_a, d_b))

What is the best way to call existing OpenCL kernels with CLArrays as arguments?


After browsing the source of CLArrays.jl and OpenCL.jl, I might have found a solution.

# this works
ctx = CLArrays.context(d_a)
queue = CLArrays.global_queue(d_a)
p = cl.Program(ctx, source=mult_kernel) |>!
mult_cl = cl.Kernel(p, "mult")
queue(mult_cl, size(d_a), nothing, pointer(d_a), pointer(d_b))
mapreduce(x->x^2, +, d_b-2*d_a)

Here, it is essential that the command queue queue and the context ctx are the corresponding ones of the CLArrays. Otherwise, I get CLError(code=-38, CL_INVALID_MEM_OBJECT).

Nevertheless, I would like to know whether this approach works in general and whether there is some better possibility.


I made a pr to have this integrated a bit nicer:


Your solution is also fine! :slight_smile:


Thank your very much, Simon!

In your PR, you wrote “Note, that the caching of the functor is not very nice, so for repeated calls, one might want to do this part manually:
So, for repeated calls, I should call clfunc = CLFunction(f, _args, ctx) only once and then use clfunc(_args, global_size, threads), correct?


Yes! Or benchmark the difference :wink: Would be interesting to know how bad the dictionary look up really is :wink:
Another side effect ist, that I’m not hashing the actual kernel string and instead just the function name + function argument types.


Okay, thank you again. I will test it when I’m back at a machine running OpenCL…