Zygote gradient accumulation?

jw3126 · February 19, 2021, 9:15pm

I ported a neural network from pytorch to Flux and now I am immediately out of GPU memory.
I would like to better understand allocations in the backward pass in Flux/Zygote.

Say I have the following toy function:

function densenet(x0)
    x1 = x0
    x2 = x0 + x1
    x3 = x0 + x1 + x2
    x4 = x0 + x1 + x2 + x3
    x5 = x0 + x1 + x2 + x3 + x4
    ...
    return xn
end

Now I want to compute the pullback of densenet. I am interested in the peak memory usage if I do this.

Mathematically, the differential x̄3 at say x3 is the sum of the differentials x̄4…x̄n.
Is it correct, that x̄4…x̄n need to be alive in memory, for Zygote to compute x̄3?
I think in pytorch, this is not the case and x̄3 will be accumulated incrementally as the x̄i materialize. Is that correct?

ToucheSir · February 20, 2021, 12:58am

If it’s not necessary for PyTorch, then in theory it’s not necessary for Zygote either since both are performing reverse mode AD. However, there are some implementation differences that likely affect memory usage. For one, Zygote does quite a bit of copying for certain operations. In-place accumulation is on the roadmap, but it’s not fully fleshed out yet. Also, PyTorch’s autograd engine is (according to their docs and my observations in practice) very proactive about buffer reuse and freeing allocations early. Lastly, CUDA.jl’s allocator integrates with the Julia GC and thus isn’t guaranteed to immediately “free” (or release to a pool/arena) temporary allocations like a reference counting system would. This doesn’t matter in most cases, but can lead to OOMs when you’re close to the VRAM limit and a library like CuDNN decides to carry out its own little allocation on the side.

To help evaluate where your network allocates on the backwards pass, you could use Zygote.@code_adjoint or @code_{typed/lowered/warntype} on the pullback itself. Do post your findings in #domain:ml or GitHub—it would be great if we could catalog and prioritize some of these performance gaps.

jw3126 · February 20, 2021, 7:15pm

Thanks for the answer! Following your suggestion I opened an issue Incremental accumulation of gradients? · Issue #905 · FluxML/Zygote.jl · GitHub

Topic		Replies	Views
Memory allocations with Zygote Numerics question , memory-allocation , zygote , gradient	8	725	November 15, 2023
Memory usage seems extremely high in standard Flux, is it any better with Zygote? Machine Learning	2	1313	August 12, 2019
Zygote Performance Machine Learning question	22	4981	September 23, 2019
How can I reduce memory consumption when using Zygote and Integrals together General Usage question , package , memory , zygote , integral	0	306	July 22, 2022
Zygote pullback vs gradient New to Julia flux , zygote	2	2166	December 25, 2020

Zygote gradient accumulation?

Related topics