The same network performs differently in Flux.jl and tensorflow

Hi,

I am writing a toy model to test the performance of Flux.jl. I generated some dummy data with the following code

import numpy as np

traindata=np.random.random((10000,50))
target=np.random.random(10000)

np.savetxt("traindata.csv",traindata,delimiter=',')
np.savetxt("target.csv",target,delimiter=',')

and then write a single dense layer model with relu activation to realize a non-linear regression.

In Python with tensorflow, the code is

import numpy as np
import tensorflow as tf

traindata=np.loadtxt("traindata.csv",delimiter=',')
target=np.loadtxt("target.csv",delimiter=',')
print(traindata.shape,target.shape)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1,input_shape=(50,),activation='relu',kernel_initializer='glorot_uniform'),
])
model.compile(optimizer='adam',loss='mean_squared_error',metrics=['mean_squared_error'])
model.fit(traindata,target,epochs=100,verbose=2)

and in Julia with Flux.jl, it is

using Base.Iterators: repeated
using CSV,Random,Printf
using Flux
using Flux: glorot_uniform

traindata=Matrix(CSV.read("traindata.csv"; header=false))'
target=Matrix(CSV.read("target.csv"; header=false))'

model=Chain(Dense(50,1,relu,initW = glorot_uniform))
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM()
dataset = repeated((traindata, target),100)
evalcb = () -> @show(loss(traindata, target))
Flux.train!(loss, params(model), dataset, opt, cb=evalcb)

However, the results of them are very different. In Python withtensorflow, the mse loss decreases very fast

Epoch 1/100
10000/10000 - 0s - loss: 0.1981 - mean_squared_error: 0.1981
Epoch 2/100
10000/10000 - 0s - loss: 0.1423 - mean_squared_error: 0.1423
Epoch 3/100
10000/10000 - 0s - loss: 0.1033 - mean_squared_error: 0.1033
Epoch 4/100
10000/10000 - 0s - loss: 0.0896 - mean_squared_error: 0.0896
Epoch 5/100
10000/10000 - 0s - loss: 0.0861 - mean_squared_error: 0.0861
Epoch 6/100
10000/10000 - 0s - loss: 0.0851 - mean_squared_error: 0.0851
Epoch 7/100
10000/10000 - 0s - loss: 0.0845 - mean_squared_error: 0.0845
Epoch 8/100
10000/10000 - 0s - loss: 0.0847 - mean_squared_error: 0.0847
Epoch 9/100
10000/10000 - 0s - loss: 0.0843 - mean_squared_error: 0.0843
Epoch 10/100
10000/10000 - 0s - loss: 0.0844 - mean_squared_error: 0.0844

and the final loss after 100 epochs is about 0.08.

But in Julia with Flux.jl, the loss decreases slow and seems to be trapped in local minimum.

loss(traindata, target) = 0.20698824682017267 (tracked)
loss(traindata, target) = 0.20629590458383318 (tracked)
loss(traindata, target) = 0.20560309354360407 (tracked)
loss(traindata, target) = 0.2049097923861889 (tracked)
loss(traindata, target) = 0.20421840230183272 (tracked)
loss(traindata, target) = 0.20352757445130545 (tracked)
loss(traindata, target) = 0.20283026868343568 (tracked)
loss(traindata, target) = 0.20213053943995535 (tracked)
loss(traindata, target) = 0.20142913955620284 (tracked)
loss(traindata, target) = 0.20072485457048353 (tracked)

The final loss after 100 epochs remains 0.17.

The experiment has been repeated several times to avoid the influence of the random seed, but the trend is the same: model built with tensorflow performs better than the model build with Flux.jl, even if they have same structure, activation and initialization. What’s the reason behind this frustrating phenomenon?

Thank you very much!

2 Likes

As there is no relationship between input and output, the best the NN can do is to return the mean i.e. 0.5. So the expected MSE is 1/12 = 0.083333 (the variance of a uniform standard distribution). So it seems that tensorflow gives the correct result. But flux seems to still give random numbers which is indeed strange.

As a test I would try with a different activation function as rely has a zero gradient for negative values.

Thank you. As you said I tried a linear activation (in tensorflow it’s 'linear' and in Flux.jl it’s 'identity'), the trend remains the same.

In Flux.jl the loss is

loss(traindata, target) = 0.26388273830343645 (tracked)
loss(traindata, target) = 0.254745100985269 (tracked)
loss(traindata, target) = 0.24702623084351585 (tracked)
loss(traindata, target) = 0.24071751586112944 (tracked)
loss(traindata, target) = 0.23578285877461885 (tracked)
loss(traindata, target) = 0.23215070433838075 (tracked)
loss(traindata, target) = 0.2297068552543899 (tracked)
loss(traindata, target) = 0.22829085699168883 (tracked)
loss(traindata, target) = 0.22769951121435944 (tracked)
loss(traindata, target) = 0.22770057233706334 (tracked)

The final loss after 100 epochs is about 0.17.

In tensorflow the loss is

Epoch 1/100
10000/10000 - 1s - loss: 0.2868 - mean_squared_error: 0.2868
Epoch 2/100
10000/10000 - 1s - loss: 0.1848 - mean_squared_error: 0.1848
Epoch 3/100
10000/10000 - 1s - loss: 0.1390 - mean_squared_error: 0.1390
Epoch 4/100
10000/10000 - 1s - loss: 0.1101 - mean_squared_error: 0.1101
Epoch 5/100
10000/10000 - 1s - loss: 0.0951 - mean_squared_error: 0.0951
Epoch 6/100
10000/10000 - 1s - loss: 0.0883 - mean_squared_error: 0.0883
Epoch 7/100
10000/10000 - 1s - loss: 0.0858 - mean_squared_error: 0.0858
Epoch 8/100
10000/10000 - 1s - loss: 0.0847 - mean_squared_error: 0.0847
Epoch 9/100
10000/10000 - 1s - loss: 0.0844 - mean_squared_error: 0.0844
Epoch 10/100
10000/10000 - 1s - loss: 0.0844 - mean_squared_error: 0.0844

The final loss after 100 epochs is 0.0833.

Could the batch size be an issue? It seems that keras defaults to 32 if unspecified (https://keras.io/models/model/).

It seems to work with a batch size of 32 (and still a relu activation function)

using Base.Iterators: repeated
using CSV,Random,Printf
using Flux
using Flux: glorot_uniform

traindata=Matrix(CSV.read("traindata.csv"; header=false))'
target=Matrix(CSV.read("target.csv"; header=false))'

model=Chain(Dense(50,1,relu,initW = glorot_uniform))
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM()

dataset_batch = [(traindata[:,ind],target[:,ind])  for ind in partition(1:length(target),32) ];

for epoch = 1:100
   Flux.train!(loss, params(model), dataset_batch, opt)
  @show epoch,loss(traindata, target)
end

After 10 epoch I get now:

epoch, loss(traindata, target)) = (1, 0.15714580257711558 (tracked))
(epoch, loss(traindata, target)) = (2, 0.11063598723179667 (tracked))
(epoch, loss(traindata, target)) = (3, 0.09125624982175756 (tracked))
(epoch, loss(traindata, target)) = (4, 0.08590421903194571 (tracked))
(epoch, loss(traindata, target)) = (5, 0.0845466921288617 (tracked))
(epoch, loss(traindata, target)) = (6, 0.08419594869003737 (tracked))
(epoch, loss(traindata, target)) = (7, 0.08407588604495675 (tracked))
(epoch, loss(traindata, target)) = (8, 0.08400488181621797 (tracked))
(epoch, loss(traindata, target)) = (9, 0.08394463978187157 (tracked))
(epoch, loss(traindata, target)) = (10, 0.08388823812964562 (tracked))
[...]
4 Likes

Friends dont let friends use minibatches larger than 32

2 Likes

This is the reason. The official document of flux.jl seems not to mention how to set the batchsize. Perhaps I should open an issue to ask them to add the information. Thank you very much!

3 Likes

I agree, it is not so obvious to find such information.