You mentioned a slowdown because of compilation, but now you’re mentioning its GC related?
Anyway, disabling the GC is not a magic bullet. And it requires you to do your own memory management, which is going to be very tricky. If you want to go that far, just use CUDA.unsafe_free! to inform CUDA.jl about allocations that can be collected, that should get you pretty far without actually doing your own memory management. Do note that this only drops the allocation’s refcount, so if you have multiple outstanding objects using that buffer – e.g. a view – calling that function on a single instance won’t do anything.