Hi,

I am writing a toy model to test the performance of `Flux.jl`

. I generated some dummy data with the following code

```
import numpy as np
traindata=np.random.random((10000,50))
target=np.random.random(10000)
np.savetxt("traindata.csv",traindata,delimiter=',')
np.savetxt("target.csv",target,delimiter=',')
```

and then write a single dense layer model with relu activation to realize a non-linear regression.

In Python with `tensorflow`

, the code is

```
import numpy as np
import tensorflow as tf
traindata=np.loadtxt("traindata.csv",delimiter=',')
target=np.loadtxt("target.csv",delimiter=',')
print(traindata.shape,target.shape)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1,input_shape=(50,),activation='relu',kernel_initializer='glorot_uniform'),
])
model.compile(optimizer='adam',loss='mean_squared_error',metrics=['mean_squared_error'])
model.fit(traindata,target,epochs=100,verbose=2)
```

and in Julia with `Flux.jl`

, it is

```
using Base.Iterators: repeated
using CSV,Random,Printf
using Flux
using Flux: glorot_uniform
traindata=Matrix(CSV.read("traindata.csv"; header=false))'
target=Matrix(CSV.read("target.csv"; header=false))'
model=Chain(Dense(50,1,relu,initW = glorot_uniform))
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM()
dataset = repeated((traindata, target),100)
evalcb = () -> @show(loss(traindata, target))
Flux.train!(loss, params(model), dataset, opt, cb=evalcb)
```

However, the results of them are very different. In Python with`tensorflow`

, the mse loss decreases very fast

```
Epoch 1/100
10000/10000 - 0s - loss: 0.1981 - mean_squared_error: 0.1981
Epoch 2/100
10000/10000 - 0s - loss: 0.1423 - mean_squared_error: 0.1423
Epoch 3/100
10000/10000 - 0s - loss: 0.1033 - mean_squared_error: 0.1033
Epoch 4/100
10000/10000 - 0s - loss: 0.0896 - mean_squared_error: 0.0896
Epoch 5/100
10000/10000 - 0s - loss: 0.0861 - mean_squared_error: 0.0861
Epoch 6/100
10000/10000 - 0s - loss: 0.0851 - mean_squared_error: 0.0851
Epoch 7/100
10000/10000 - 0s - loss: 0.0845 - mean_squared_error: 0.0845
Epoch 8/100
10000/10000 - 0s - loss: 0.0847 - mean_squared_error: 0.0847
Epoch 9/100
10000/10000 - 0s - loss: 0.0843 - mean_squared_error: 0.0843
Epoch 10/100
10000/10000 - 0s - loss: 0.0844 - mean_squared_error: 0.0844
```

and the final loss after 100 epochs is about 0.08.

But in Julia with `Flux.jl`

, the loss decreases slow and seems to be trapped in local minimum.

```
loss(traindata, target) = 0.20698824682017267 (tracked)
loss(traindata, target) = 0.20629590458383318 (tracked)
loss(traindata, target) = 0.20560309354360407 (tracked)
loss(traindata, target) = 0.2049097923861889 (tracked)
loss(traindata, target) = 0.20421840230183272 (tracked)
loss(traindata, target) = 0.20352757445130545 (tracked)
loss(traindata, target) = 0.20283026868343568 (tracked)
loss(traindata, target) = 0.20213053943995535 (tracked)
loss(traindata, target) = 0.20142913955620284 (tracked)
loss(traindata, target) = 0.20072485457048353 (tracked)
```

The final loss after 100 epochs remains 0.17.

The experiment has been repeated several times to avoid the influence of the random seed, but the trend is the same: model built with tensorflow performs better than the model build with `Flux.jl`

, even if they have same structure, activation and initialization. What’s the reason behind this frustrating phenomenon?

Thank you very much!