CUDA Array broadcasting: possible to use a different stream for a code block?

I am wondering if it is possible somehow to specify the stream for a code block with array computations that rely on CuArray’s broadcasting capability as e.g. A .= A .+ B in this snippet:

using CUDA
A = CUDA.zeros(2,3)
B = CUDA.ones(2,3)
A .= A .+ B


There currently isn’t. We need to set-up some global but task-local state to set/get the stream, and make all functions (like broadcast use that). For now though, we’ve switched to using the implicit per-thread stream, so if you perform those computations on a separate thread you should get the same effect.

Thanks @maleadt for the reply. However, we would like to run these computations on a high priority stream, whereas implicit per-thread streams are normal priority, I assume.