Is the OOM only happening in a shared cluster environment? If that is the case then Garbage collection not aggressive enough on Slurm Cluster might be relevant.
Did you check that 1 sample does not cause OOM? In your code there is nothing that should keep the garbage collector from freeing the memory from the last iteration.