Reporting allocations per stream in multithread CUDA.jl application

Is there a reliable way to monitor CUDA.jl allocations for each stream in a multithreaded application? In trying to track down the source of some excessive allocations using @CUDA.time macros, I am seeing some operations appear to allocate that clearly dont, e.g.:

             CUDA.synchronize()
             @info "Not doing anything here..."
             @CUDA.time sleep(.5)
             @info "Done"
             CUDA.synchronize()

which results in:

[ Info: Not doing anything here...
  0.500947 seconds (857 CPU allocations: 43.234 KiB) (5 GPU allocations: 5.008 MiB, 0.01% memmgmt time)
[ Info: Done

I assume that what is being reported here are allocations that are occurring in another stream/thread? Is this how @CUDA.time works, or is something else wacky going on here? Is there a better method for reporting allocations specifically for each stream/thread?

Yeah, CUDA.@time currently doesn’t take tasks or threads into account. That would be a useful addition though, can you open an issue on the CUDA.jl repository?