Translating tensorflow to Flux and SimpleChains and not getting the same results

dmetivie · March 15, 2023, 1:36pm

For the past few days, I started working on an ML project.
I looked at Flux.jl and SimpleChains.jl for doing pure Julia Deep Learning and TensorFlow, but I could not make them agree!

The data: let say the model tries to infer the mean (and possibly std) of a sample distribution.

using Flux
using SimpleChains
using PyCall, Conda
using Distributions
using Plots
n = 20
N = 5000

T = Float32
x_train = rand(T, n, N)
y_train = mean(x_train, dims=1)
# y_train = vcat(mean(x_train, dims=1), std(x_train, dims=1))
dim_output = size(y_train, 1)

Nepoch = 5

With TensorFlow (using PyCall)

#* tensorflow *#
py"""
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Sequential
"""

py"""# data to tensorflow format
y_train = tf.constant($(permutedims(y_train)), dtype=tf.float32)
x_train = tf.constant($(permutedims(x_train)), dtype=tf.float32)
"""

py"""# Model definition
inputs = Input(shape=($n,))
x = Dense(512, activation="relu")(inputs)
x = Dropout(0.1)(x, training=True)
x = Dense(512, activation="relu")(x)
x = Dropout(0.5)(x, training=True)
outputs = Dense($dim_output)(x)
model = tf.keras.Model(inputs, outputs)
"""

py"""# optimiser
model.compile(loss="mean_squared_error", optimizer="adam")
"""

# training
history_tf = py"model.fit(x_train, y_train, epochs=$Nepoch, verbose=0)"
losses_TF = history_tf.history["loss"]

With Flux. I understand there were some major change recently and I was kinda lost to know which training format to use for my model. I decided to use the huge for loop to save the loss at each iteration (could not find an option with train!).

With Flux.jl

#* Flux *#

model_flux = Chain(
    Dense(n => 512, Flux.relu),
    Flux.Dropout(0.1),
    Dense(512 => 512, Flux.relu),
    Flux.Dropout(0.5),
    Dense(512, dim_output)
)

optim_flux = Flux.setup(Adam(), model_flux)

losses_Flux = []

for epoch in 1:Nepoch
    for (x, y) in [(x_train, y_train)]
        loss, grads = Flux.withgradient(model_flux) do m
            # Evaluate model and loss inside gradient context:
            y_hat = m(x)
            Flux.mse(y_hat, y)
        end
        Flux.update!(optim_flux, model_flux, grads[1])
        push!(losses_Flux, loss)  
    end
end

With SimpleChain.jl. Similarly to train!, train_unbatched! does not seem to have an option to save loss automatically. So to save it I have to train_unbatched! for one loop and save the result.
BTW: it is really unclear from the git and docs that the result of model_SC(x_train, p) is the loss and not y_hat.
I had to figure out that model_SC_evaluate (x_train, p) (without the loss) does that.

# * SimpleChains * #
model_SC_evaluate = SimpleChain(
    static(n), # input dimension (optional)
    TurboDense{true}(SimpleChains.relu, 512), # dense layer with bias 
    SimpleChains.Dropout(0.1), # dropout layer
    TurboDense{true}(SimpleChains.relu, 512), # dense layer with bias 
    SimpleChains.Dropout(0.5), # dropout layer
    TurboDense{false}(identity, dim_output) # dense layer without bias 
)

model_SC = SimpleChains.add_loss(model_SC_evaluate, SquaredLoss(y_train))
p = SimpleChains.init_params(model_SC)
g = similar(p)

losses_SC = []
for epoch in 1:Nepoch
    SimpleChains.train_unbatched!(g, p, model_SC, x_train, SimpleChains.ADAM(), 1)
    push!(losses_SC, model_SC(x_train, p))
end

Now the 3 losses are widely different. Ok they have random initialization but here the difference is too much.

What am I doing wrong?

ToucheSir · March 18, 2023, 10:14pm

How different is wildly different? Do the models converge to a similar loss value and accuracy?

bertschi · March 19, 2023, 9:01am

Have not worked through all details (and in order to really debug such issues you will need to use the same weights across all models), but part of the problem seems to be that SimpleChains does not compute the mean squared loss (despite claiming otherwise in the documentation):

julia> x = [1.0, 2.0, 3.0];

julia> y = [4.0, 5.0, 6.0];

julia> mean((x .- y).^2)
9.0

julia> Flux.mse(x, y)
9.0

julia> SquaredLoss(y)(x, nothing, nothing)
(13.5, nothing, nothing)

According to the source – check with @less – it actually computes \frac{1}{2} \sum_i (y_i - \hat{y}_i)^2 instead.

dmetivie · March 23, 2023, 4:19pm

Thanks! Now it is better when I do

push!(losses_SC, model_SC(x_train, p)/N)

dmetivie · March 23, 2023, 4:25pm

Here is what it looks like with @bertschi rescaling:

plot(losses_TF, label="losses_TF")
plot!(losses_Flux, label="losses_Flux")
plot!(losses_SC, label="losses_SC")

loss

It looks like only Tensor flow actually decreases, while others are decreasing but oscillating.

I agree with @ToucheSir, I should start with the same initial conditions (I need to understand how to do that in the 3 frameworks).
However, all 3 are initialized with the same distributions of weights, so I would not expect such differences.

bertschi · March 23, 2023, 10:31pm

How do you know from just three samples? To get a feel for the initial error distribution you probably want to draw thousands of initial weights for each setup and check if the loss distributions are similar.

ToucheSir · March 24, 2023, 3:08pm

It should not be too difficult to transplant a copy of the TF weights to the Flux model either. I do agree that this this something best quantified over many different draws.

Topic		Replies	Views
Flux results not similar to Tensorflow Machine Learning question	3	1817	March 11, 2019
Flux is lagging far beyond tensorflow with a pretty basic use case Machine Learning tensorflow , flux	5	1209	September 9, 2021
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3068	December 18, 2019
Why the result from Flux.jl is totally different from tf.Keras (with the same simple MLP) Machine Learning question , package	6	1458	December 3, 2019
Why is this MLP slower in Flux than in TensorFlow? Performance performance , flux , python , neural-network	5	1128	April 30, 2022

Translating tensorflow to Flux and SimpleChains and not getting the same results

Related topics