Speed up compilation with Zygote and PINN

Matthieu_BARREAU · May 31, 2023, 7:43am

Hi!

I am relatively new to Julia and I am trying to move all my research projects to this new ecosystem instead of Python.

However, I have trouble when my project deals with Physics informed neural network. Since I am working on how to improve PINN, I need to keep control of what is happening and I cannot use any package. I am trying on my own but I am facing some compilation time that are excessively long. Here is a very simple example:

using Flux, Zygote

x_train = [2; 0;; 0; 1] * rand(Float32, 2, 10)
target = rand(Float32, 1, 10)

model = Chain(
    Dense(2 => 15, tanh),
    Dense(15 => 15, tanh),
    Dense(15 => 1)
)

function get_residual(f::Chain)
    df(u) = Zygote.gradient(u -> sum(f(u)), u)[1]
    #ddf(u) = Zygote.gradient(u -> sum(df(u)), u)[1]
    return u -> df(u)[1] .+ (1.0f0 .- 2.0f0 .* f(u)) .* df(u)[2] #.- 0.001f0 .* ddf(u)[2]
end

r = get_residual(model)
@info "Getting residuals"
timed = @timed r(x_train)
@show timed.time

function withgradient(model, x_train)
    loss, grads = Flux.withgradient(model) do m
        y_hat = m(x_train)
        loss = Flux.Losses.mse(y_hat, target)
        penalty = Flux.Losses.mse(get_residual(m)(x_train), zeros(Float32, 1, 10))
        loss + penalty
    end
    (loss, grads)
end

@info "Getting gradients"
timed = @timed withgradient(model, x_train)
@show timed.time

It takes a very long time to compile first time, even if the model is very simple (25s for getting the residuals and 264s for the gradients). Moreover, if I uncomment the lines in the function get_residual then it almost never finishes compilation. Such a code was running in a couple of seconds in Tensorflow before.

Do you have some suggestions to improve this code? Is there anything I am doing wrong? I tried using Yota (which is faster in my case) but it cannot compute the gradients in the end…

Best,
Matthieu

Matthieu_BARREAU · May 31, 2023, 3:03pm

I found that using ForwardDiff for getting the residuals is slightly more efficient (but can deal with the additional viscosity).
Any idea on how to use forward mode to get the residual and reverse mode for the grads?
I saw this post: How to achieve good performance with Zygote.pushforward on a neural network which might solve this issue, however, it is outdated and the codes there are not working anymore.

de-souza · May 31, 2023, 3:20pm

Hi,

Based on what was said in a recent topic, I would recommend the approach taken by TaylorDiff.jl for higher order automatic differentiation. See the discussion and links in that topic:

Matthieu_BARREAU · June 1, 2023, 7:57am

Thanks @de-souza for this reference. Based on this work, I propose the following (not really optimized) solution that seems to work well and fast. However, when we compare with the fully Zygote solution, there is a difference in the gradients that is not that negligible. But this is difficult to say which one is the best since we do not have any reference.

using Flux, Zygote, ForwardDiff, TaylorDiff, SliceMap

x_train = [2; 0;; 0; 1] * rand(Float32, 2, 10)
target = rand(Float32, 1, 10)

model = Chain(
    Dense(2 => 15, tanh),
    Dense(15 => 15, tanh),
    Dense(15 => 1)
)

function get_residual(f::Chain)
    df(u) = Zygote.gradient(u -> sum(f(u)), u)[1]
    return u -> df(u)[[1], :] .+ (1.0f0 .- 2.0f0 .* f(u)) .* df(u)[[2], :] # Cannot do higher order
end

function get_residual_forward(f::Chain)
    df(u) = ForwardDiff.gradient(u -> sum(f(u)), u)
    #ddf(u) = ForwardDiff.gradient(u -> sum(df(u)), u)
    return u -> df(u)[[1], :] .+ (1.0f0 .- 2.0f0 .* f(u)) .* df(u)[[2], :] #.- 0.001f0 .* ddf(u)[2]
end

function get_residual_Taylor(f::Chain, x)
    x = convert(Vector{Float32}, x)
    dfdt(u) = TaylorDiff.derivative(u -> sum(f(u)), u, [1.0f0, 0.0f0], 1)
    dfdx(u) = TaylorDiff.derivative(u -> sum(f(u)), u, [0.0f0, 1.0f0], 1)
    #dfdxx(u) = TaylorDiff.derivative(u -> sum(f(u)), u, [0.0f0, 1.0f0], 2)
    return dfdt(x) .+ (1.0f0 .- 2.0f0 .* f(x)) .* dfdx(x) #.- 0.001f0 .* dfdxx(x)
end

function get_residual_fd(f::Chain)
    ε = cbrt(eps(Float32))
    ε₁ = [ε; 0]
    ε₂ = [0; ε]
    V(x) = (1.0f0 .- 2.0f0 .* f(x))
    return x -> (f(x .+ ε₁) - f(x)) / ε .+ V(x) .* (f(x .+ ε₂) - f(x)) / ε
end

r = get_residual(model)
@info "Getting residuals (Zygote)"
timed = @timed r(x_train)
@show timed

r_forward = get_residual_forward(model)
@info "Getting residuals (ForwardDiff)"
timed = @timed r_forward(x_train)
@show timed

r_taylor(x) = get_residual_Taylor(model, x)
@info "Getting residuals (TaylorDiff)"
timed = @timed mapcols(x -> get_residual_Taylor(model, x), x_train)
@show timed

r_fd = get_residual_fd(model)
@info "Getting residuals (FiniteDifference)"
timed = @timed r_fd(x_train)
@show timed

@info "Getting gradients (Zygote)"
function withgradient_Zygote(model, x_train)
    loss, grads = Zygote.withgradient(model) do m
        y_hat = m(x_train)
        loss = Flux.Losses.mse(y_hat, target)
        penalty = Flux.Losses.mse(get_residual(m)(x_train), zeros(Float32, 1, 10))
        loss + penalty
    end
    (loss, grads)
end

timed_Zygote = @timed withgradient_Zygote(model, x_train)
@show timed_Zygote

@info "Getting gradients (TaylorDiff)"
function withgradient_taylor(model, x_train)
    loss, grads = Zygote.withgradient(model) do m
        y_hat = m(x_train)
        loss = Flux.Losses.mse(y_hat, target)
        r = mapcols(x -> get_residual_Taylor(model, x), x_train)
        penalty = Flux.Losses.mse(r, zeros(Float32, 1, 10))
        loss + penalty
    end
    (loss, grads)
end

timed_taylor = @timed withgradient_taylor(model, x_train)
@show timed_taylor

The solution with Zygote takes around 175sec to compile and run (first time) while the solution with TaylorDiff is about 10 times less (14sec).

One solution would be to compare on a simple example with an analytical solution but I am not sure I have the will for doing that now.

I hope that can help you too @de-souza , tell me what you think!

de-souza · June 1, 2023, 1:01pm

Thanks a lot for your solution. Perhaps the gradients are different because the residuals are not exactly equal when computed with Taylor or Zygote. Anyways I will save your solution for when I need it so thanks again!

Topic		Replies	Views
Autograd through loss function with derivatives in Machine Learning	1	881	March 25, 2021
Is it possible to do Nested AD ~elegantly~ in Julia? (PINNs) General Usage machine-learning	43	3209	September 27, 2024
How to achieve good performance with Zygote.pushforward on a neural network Performance zygote	3	1018	May 31, 2023
[Flux / Zygote] Circumvent long compilation times Machine Learning flux , zygote	1	470	April 12, 2022
[Zygote] Derivative of derivative compilation time Machine Learning	4	509	March 4, 2023

Speed up compilation with Zygote and PINN

Related topics