Translating tensorflow to Flux and SimpleChains and not getting the same results

For the past few days, I started working on an ML project.
I looked at Flux.jl and SimpleChains.jl for doing pure Julia Deep Learning and TensorFlow, but I could not make them agree!

The data: let say the model tries to infer the mean (and possibly std) of a sample distribution.

using Flux
using SimpleChains
using PyCall, Conda
using Distributions
using Plots
n = 20
N = 5000

T = Float32
x_train = rand(T, n, N)
y_train = mean(x_train, dims=1)
# y_train = vcat(mean(x_train, dims=1), std(x_train, dims=1))
dim_output = size(y_train, 1)

Nepoch = 5

With TensorFlow (using PyCall)

#* tensorflow *#
py"""
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Sequential
"""

py"""# data to tensorflow format
y_train = tf.constant($(permutedims(y_train)), dtype=tf.float32)
x_train = tf.constant($(permutedims(x_train)), dtype=tf.float32)
"""

py"""# Model definition
inputs = Input(shape=($n,))
x = Dense(512, activation="relu")(inputs)
x = Dropout(0.1)(x, training=True)
x = Dense(512, activation="relu")(x)
x = Dropout(0.5)(x, training=True)
outputs = Dense($dim_output)(x)
model = tf.keras.Model(inputs, outputs)
"""

py"""# optimiser
model.compile(loss="mean_squared_error", optimizer="adam")
"""

# training
history_tf = py"model.fit(x_train, y_train, epochs=$Nepoch, verbose=0)"
losses_TF = history_tf.history["loss"]

With Flux. I understand there were some major change recently and I was kinda lost to know which training format to use for my model. I decided to use the huge for loop to save the loss at each iteration (could not find an option with train!).

With Flux.jl

#* Flux *#

model_flux = Chain(
    Dense(n => 512, Flux.relu),
    Flux.Dropout(0.1),
    Dense(512 => 512, Flux.relu),
    Flux.Dropout(0.5),
    Dense(512, dim_output)
)

optim_flux = Flux.setup(Adam(), model_flux)

losses_Flux = []

for epoch in 1:Nepoch
    for (x, y) in [(x_train, y_train)]
        loss, grads = Flux.withgradient(model_flux) do m
            # Evaluate model and loss inside gradient context:
            y_hat = m(x)
            Flux.mse(y_hat, y)
        end
        Flux.update!(optim_flux, model_flux, grads[1])
        push!(losses_Flux, loss)  
    end
end

With SimpleChain.jl. Similarly to train!, train_unbatched! does not seem to have an option to save loss automatically. So to save it I have to train_unbatched! for one loop and save the result.
BTW: it is really unclear from the git and docs that the result of model_SC(x_train, p) is the loss and not y_hat.
I had to figure out that model_SC_evaluate (x_train, p) (without the loss) does that.

# * SimpleChains * #
model_SC_evaluate = SimpleChain(
    static(n), # input dimension (optional)
    TurboDense{true}(SimpleChains.relu, 512), # dense layer with bias 
    SimpleChains.Dropout(0.1), # dropout layer
    TurboDense{true}(SimpleChains.relu, 512), # dense layer with bias 
    SimpleChains.Dropout(0.5), # dropout layer
    TurboDense{false}(identity, dim_output) # dense layer without bias 
)

model_SC = SimpleChains.add_loss(model_SC_evaluate, SquaredLoss(y_train))
p = SimpleChains.init_params(model_SC)
g = similar(p)

losses_SC = []
for epoch in 1:Nepoch
    SimpleChains.train_unbatched!(g, p, model_SC, x_train, SimpleChains.ADAM(), 1)
    push!(losses_SC, model_SC(x_train, p))
end

Now the 3 losses are widely different. Ok they have random initialization but here the difference is too much.

What am I doing wrong?

1 Like

How different is wildly different? Do the models converge to a similar loss value and accuracy?

Have not worked through all details (and in order to really debug such issues you will need to use the same weights across all models), but part of the problem seems to be that SimpleChains does not compute the mean squared loss (despite claiming otherwise in the documentation):

julia> x = [1.0, 2.0, 3.0];

julia> y = [4.0, 5.0, 6.0];

julia> mean((x .- y).^2)
9.0

julia> Flux.mse(x, y)
9.0

julia> SquaredLoss(y)(x, nothing, nothing)
(13.5, nothing, nothing)

According to the source – check with @less – it actually computes \frac{1}{2} \sum_i (y_i - \hat{y}_i)^2 instead.

1 Like

Thanks! Now it is better when I do

push!(losses_SC, model_SC(x_train, p)/N)

Here is what it looks like with @bertschi rescaling:

plot(losses_TF, label="losses_TF")
plot!(losses_Flux, label="losses_Flux")
plot!(losses_SC, label="losses_SC")

loss

It looks like only Tensor flow actually decreases, while others are decreasing but oscillating.

I agree with @ToucheSir, I should start with the same initial conditions (I need to understand how to do that in the 3 frameworks).
However, all 3 are initialized with the same distributions of weights, so I would not expect such differences.

How do you know from just three samples? To get a feel for the initial error distribution you probably want to draw thousands of initial weights for each setup and check if the loss distributions are similar.

1 Like

It should not be too difficult to transplant a copy of the TF weights to the Flux model either. I do agree that this this something best quantified over many different draws.