Pointer to CuArray as a parameter for DifferentialEquations right-hand-side function

Thank you for the suggestion. It seems KernelAbstractions allow me to do what I want:

import CUDA
using KernelAbstractions

CUDA.allowscalar(false)


@kernel function func(u, p)
  f, = p
  I = @index(Global)
  u[I] = f[I]
end


N = 2048
f = ones(Float32, N)

u_gpu = CUDA.zeros(N)
f_gpu = CUDA.CuArray(f)
p = (f_gpu, )

kernel = func(CUDADevice(), 32)

# launch kernel with original f ------------------------------------------------
event = kernel(u_gpu, p; ndrange=size(u_gpu))
wait(event)

u = CUDA.collect(u_gpu)
@show isequal(u, f)   # -> isequal(u, f) = true

# change f outside of kernel and launch the kernel once more -------------------
@. f_gpu = 2 * f_gpu
ev = Event(CUDADevice())   # needed for synchronization
event = kernel(u_gpu, p; ndrange=size(u_gpu), dependencies=ev)
wait(event)

u = CUDA.collect(u_gpu)
@show isequal(u, 2 .* f)   # -> isequal(u, 2 .* f) = true

As far as I understand, somewhere under the hood KernelAbstractions convert f into CuDeviceArray. I want to understand how I can do it by myself. If someone can help me with that, e.g. give a link to a proper line in sources I would be very appreciated (of cause, I will try to find it by myself, though I failed after a brief look). A direct conversion, f_cda = CUDA.CuDeviceArray(f_gpu.dims, CUDA.DevicePtr(pointer(f_gpu))), does not work — I obtain the ReadOnlyMemoryError() when I try to access f_cda.

Also I have a questions about the choice of the number of cuda threads for KernelAbstractions (equal to 32 in the above code). For example, in DiffEqGPU.jl the amount of cuda threads is hard coded and equal to 256 (https://github.com/SciML/DiffEqGPU.jl/blob/7c4398df1d069dea17c8292ae067329670a6e4fc/src/DiffEqGPU.jl#L40). Can I choose it more wisely, e.g. like in CUDA.jl where I can use get_config function (see The most general way to estimate the optimal arguments for @cuda macro)?