I just want to mention that these types of issues have been a serious headache for AlphaZero.jl from the very start. To be fair, the situation is much better now that is used to be: I remember a time under CuArrays 1.2 where 90% of training time was spent in GC! But I am still probably leaving a 2x performances factor on the table, just because of bad memory management (AlphaZero.jl may be hit even harder that @Fabrice_Rosay’s implementation as it performs more allocations for the sake of modularity).
As was discussed in this thread, this is one of the rare places where Python’s ref-counting strategy is actually a great win. And I have never been completely satisfied by the answers provided in the thread I cited, which basically come down to “this is not a big deal in Julia as Julia makes it pretty easy not to allocate when necessary”.
I am still wondering how much of this problem could be solved simply with a better runtime and how much will always come down to having developers eliminate allocations in their code and free resources manually when needed. In the latter case, having powerful tooling to identify memory management issues and fix them strikes me as particularly important.