Does the multithreaded func2 (without CUDA.unsafe_free! or CUDA.reclaim) actually run out-of-memory if you call it in a loop?
I guess you have seen this comment.
It would also be interesting to see what happens if you call CUDA.reclaim on all threads.