Is there a reliable way to monitor CUDA.jl allocations for each stream in a multithreaded application? In trying to track down the source of some excessive allocations using @CUDA.time macros, I am seeing some operations appear to allocate that clearly dont, e.g.:
CUDA.synchronize()
@info "Not doing anything here..."
@CUDA.time sleep(.5)
@info "Done"
CUDA.synchronize()
which results in:
[ Info: Not doing anything here...
0.500947 seconds (857 CPU allocations: 43.234 KiB) (5 GPU allocations: 5.008 MiB, 0.01% memmgmt time)
[ Info: Done
I assume that what is being reported here are allocations that are occurring in another stream/thread? Is this how @CUDA.time works, or is something else wacky going on here? Is there a better method for reporting allocations specifically for each stream/thread?