Thanks for the responses. There are only a few places to reduce allocations further because most of them happen in the Flux model and gradient calls. I was not able to reliably replicate the exact behavior, but the issue was resolved by adding the line
GC.gc(true)
after each optimize_model!
call.