Comparation with Flux leads to odd results in Flux

Hello,
I’m developing a toy ML library (BetaML) to learn a bit about ML algorithms (I’m pretty a newbie).

I tested BetaML with a bike sharing demand forecast example and compared it with Flux:

Runnable Binder notebook: https://mybinder.org/v2/gh/sylvaticus/BetaML.jl/master?filepath=notebooks%NN%20-%20Bike%20sharing%20demand%20forecast%20(daily%20db).ipynb

However, when I use the same model structure, data, training algorithms and hyperparameters, I experience strange behaviour in Flux, e.g. the data predicted by Flux for the training sample seems to be truncated. Also, predictions seems not to move so much as with the BetaML results:

BetaML output:
image
Flux output:
image
BetaML output:
image
Flux output:
image
BetaML output:
image
Flux output:
image

Note that also BetaML results tend to underestimate the high demand levels observed in the validation period, but in a less pronounced way that Flux.

I am wondering what cause this difference… weight initialisation ?

Are you using relu activation functions? It looks like the output has been clamped in a way typical for relu. If this is the case, the problem tends to go away with further training or using some strategy to make it easier to find a better minimum, such as residual connections etc.

You can also try to normalize your data prior to train g so that it has mean zero and variance 1,it also helps with this problem.

Thank you, I am using sigmoid for the hidden layer and identity for the output one:


# Defining the net model and load it with data...
Flux_nn = Chain(Dense(23,12,Flux.sigmoid),
                Dense(12,1,identity))
loss(x, y) = Flux.mse(Flux_nn(x), y)
ps = Flux.params(Flux_nn)
nndata = Flux.Data.DataLoader(xtrain', ytrain', batchsize=8)

What I found strange is that using instead an other library - but with the same parameters, including batch size, optimizer and number of epochs - I don’t get this effect.

I will try increasing the epochs. But it could also be that Flux doesn’t do by default a certain number of tricks, like Xavier weight initialisation, or random sampling of the batches… I did notice that Flux has a philosophy not to provide default arguments/optimisations. I understand it, but for newcomers could be useful to have for example an optimizer and a loss function by default.

No Flux does not do any random sampling for you, it’s all up to the iterator you pass it. The weight initialization should be one of the standard ones though.

I would still normalize the data. Since the data you are predicting is in the 1000s, the initial gradients will drive all the activation functions to saturation before the linear output layer has caught up. By normalizing, you’ll probably have a much faster convergence and to a better minimum.

Yes, actually both X and Y are scaled in the script (for Y they are just divided by 1000, as if I normalise to mean 0, s.d. 1 I may have some negative demand when I rescale them back).

Edit: you was right, to get similar results it was enought to load the data with shuffling: nndata = Flux.Data.DataLoader(xtrain', ytrain', batchsize=8,shuffle=true) (in BetaML I do it by default unless you opt-off with sequential=true)