From this topic we can see that CUDA streams are supported. Are there any code examples using CUDA streams in CUDAnative to help with first steps?
I.e. how to create a stream, use it in a kernel call, sync on the stream, etc.
Lacking proper documentation (which I hope I’ll be able to get to in a couple of weeks), have a look at the tests:
- CUDAnative.jl/execution.jl at cfc9c5f0eb42824199b73afcdd6ada470b274010 · JuliaGPU/CUDAnative.jl · GitHub
- CUDAdrv.jl/stream.jl at 84dc2737bcb87ceea35e4dfa5bf4b3ff36ab09b0 · JuliaGPU/CUDAdrv.jl · GitHub
Basically, stream creation etc is part of CUDAdrv, and in CUDAnative you just pass a stream
argument to @cuda
. And FYI, there isn’t a good mechanism to use streams with CuArrays yet.
Thanks for the examples!
@maleadt, to my understanding, these tests only check that CuStream
creates a new, distinct stream at every invocation, but they do not test that these streams do overlap at execution, i.e. that the kernels on these streams run concurrently (or do I get it wrong?). Is there any test that checks this functionality?
I am asking, because I cannot get streams to overlap as reported in this topic. This is fundamental to overlap communication and computation in my application…
No. If you have any suggestions for such tests, let me know.
I do not have any suggestions right now, but I let you know if I come up with something during my investigations on overlapping of streams.
Any idea/hints when/if streams will be supported in CuArrays?
I am using CuArrays, however the GPU is not fully utilized, I think I have space for 2-3 more on the GPU. And as I understand/measured (on Windows) running several julia apps wouldn’t load GPU due to lack of MPS and using threads or tasks wouldn’t get us far, since streams are not supported by CuArrays and all gets serialized on a default stream .
Is Linux and MPS is the only way now to fully load GPU with CuArrays or there is anything else? (in case one’s function are small/ineffcient code)
I think it really depends on what you’re doing. Have you tried to see whether you can implement the behavior you want using the tools in CUDA native? IIRC you can set stream on kernel launch…
yes. CUDAnative that was the first version of the code. However, I really liked the simplicity of CuArrays and all functionality one gets for “free” notably fusion, easy test against CPU. Thus switched to CuArrays. Now, trying to load GPU more, I am thinking running several of streams should bring us to the result faster, since I see gpu is not 100% loaded.