# Generic Function to train NN w/ Flux

Updated:
I would like to write a generic function to train a model in Flux.

An ordinary linear regression w/ intercept is a special case of a neural network. If the optimizers work well I should get the same result (esp from such a simple model).
Unfortunately I don’t.

First: prepare Boston housing data

``````using MLJ: @load_boston, partition
using Flux, Statistics
X, y =  @load_boston; X = hcat(X...);
train, test = partition(eachindex(y), .7, rng=333);
#
Xm =  hcat(X, size(X,1) |> ones ); #add intercept
XmT= Xm[train,:]; XmH= Xm[test,:];
# Flux automatically includes intercept...
XT= X[train,:]; YT= y[train];
XH= X[test,:];  YH= y[test];
``````

Now train a linear model w/ intercept using matrix algebra:
In-sample RMSE= 4.90. Out-of-sample RMSE = 4.4242.
Below I denote these scores (IS, OS).

``````β̂ = XmT \ YT
ŷ = (XmT)*β̂
Flux.mse(ŷ, YT) |> sqrt
ŷ = (XmH)*β̂
Flux.mse(ŷ, YH) |> sqrt
``````

Now train a generic Flux model:

``````function f(d, XT, YT, XH, YH;
m = Chain(Dense(size(XT,2), 1)), #Model/Activation
ℓ = Flux.mse,                    #Loss: mse, crossentropy...
nE = 200                         #Number epochs
)
loss(x, y) = ℓ(m(x), y)
Flux.@epochs nE Flux.train!(loss, params(m), d, opt)
IS = Flux.mse(m(XT'), YT') |> sqrt
OS = Flux.mse(m(XH'), YH') |> sqrt
return IS, OS
end
``````

Using the ADAGrad() optimiser my scores are too far off, but improve w/ more epochs:

``````f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=    200)  #(5.57, 5.18)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=1_000)    #(5.30, 4.85)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=10_000)   #(5.14, 4.48)
``````

Next I try all default optimizers w/ 1000 epochs:

``````Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
Momentum()];
for opt in Opt_All
println(" ", opt)
IS, OS= f(d, XT, YT, XH, YH, opt=opt, nE=1_000)
println(IS, " ", OS)
println(" ")
end
``````

Still not great:

``````Flux.Optimise.Optimiser(Any[ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}()), WeightDecay(0)])
5.38384461137456 4.682120351510056
5.27592040673716 4.793918203133885
5.295374172910945 4.833882858869897
55.62990418302113 57.01656766967905
5.320812530474836 4.881181082991089
5.3934106199655245 4.690819240537613
1.394841024573495e12 1.393338515504848e12
Descent(0.1)
NaN NaN
5.392202731987643 4.693330726613823
Nesterov(0.001, 0.9, IdDict{Any,Any}())
NaN NaN
RMSProp(0.001, 0.9, IdDict{Any,Any}())
5.42999749585886 4.658016527255777
Momentum(0.01, 0.9, IdDict{Any,Any}())
NaN NaN
``````

Now FluxOptTools.jl:
(IS, OS) = (5.18, 4.488) better (claims status: success) but still not great.

``````using Flux, Zygote, Optim, FluxOptTools, Statistics
m = Chain(Dense(size(XT,2), 1)) #LM
loss4() = mean(abs2, m(XT') .- YT')
Zygote.refresh()
pars   = Flux.params(m)
lossfun, gradfun, fg!, p0 = optfuns(loss4, pars)
res = Optim.optimize(Optim.only_fg!(fg!), p0, Optim.Options(iterations=10_000, store_trace=true))

res
Flux.params(m)
IS = Flux.mse(m(XT'), YT') |> sqrt
OS = Flux.mse(m(XH'), YH') |> sqrt
``````

I’m unable to use the BFGS() option `MethodError: no method matching length(::GlobalRef)`.
@baggepinnen is there a smarter way to use your tool to find the minimum?

2 Likes

Btw, if someone wants to try diff optimizers w/ Flux

``````Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
Momentum()];
for opt in Opt_All
println(" ", opt)
IS, OS= f(d, XT, YT, XH, YH, opt=opt)
println(IS, " ", OS)
println(" ")
end
``````

I don’t believe Flux has an option to automatically create a vector w/ all optimizers…

To further clarify:
my understanding is `Chain(Dense(size(XT,2), 1))` is a linear regression.

``````#AZ: apply to Boston Housing!
using Flux, Statistics
X, y =  @load_boston; X = hcat(X...);
X = (X .- mean(X, dims = 2)) ./ std(X, dims = 2);
train, test = partition(eachindex(y), .7, rng=333);
XT= X[train,:]; YT= y[train];
XH= X[test,:];  YH= y[test];
``````

Regular linear regression:

``````β̂ = XT \ YT
ŷ = (XT)*β̂
Flux.mse(ŷ, YT) |> sqrt.          #RMSE=4.95
ŷ = (XH)*β̂
Flux.mse(ŷ, YH) |> sqrt           #RMSE=4.24
``````

Flux

``````#
m = Chain(Dense(size(XT,2), 1)) #Linear Model
ℓ = Flux.mse
nE = 10_000
#
loss(x, y) = ℓ(m(x), y)
Flux.@epochs nE Flux.train!(loss, params(m), [(XT',YT')], opt)
IS = Flux.mse(m(XT'), YT') |> sqrt      #RMSE=7.95
OS = Flux.mse(m(XH'), YH') |> sqrt      #RMSE=7.75
``````

Using @baggepinnen’s FluxOptTools:

``````using Flux, Zygote, Optim, FluxOptTools, Statistics
m = Chain(Dense(size(XT,2), 1)) #LM
loss4() = mean(abs2, m(XT') .- YT')
Zygote.refresh()
pars   = Flux.params(m)
lossfun, gradfun, fg!, p0 = optfuns(loss4, pars)
res = Optim.optimize(Optim.only_fg!(fg!), p0, Optim.Options(iterations=10_000, store_trace=true))

res
Flux.params(m)
IS = Flux.mse(m(XT'), YT') |> sqrt         #RMSE = 4.93
OS = Flux.mse(m(XH'), YH') |> sqrt.        #RMSE = 4.31
``````

why isn’t the RMSE the same as w/ linear regression?

I’m not an expert on Flux, but your code looks good to me.

Note that `Chain(Dense(size(XT,2), 1))` is actually an affine function, so the result could be slightly different from a linear regression.

If I understood correctly, when using FluxOptTools, you have almost the RMSE that you expect. But when you are using the optimizers built-in with Flux, you don’t.

Just a guess : although Flux is converging towards the right value, the first-order gradient descent is slow and the minimum is not yet reached. Whereas the higher-order quasi-Newton method of Optim would find the minimum of this quadratic problem almost immediately.
One way to check this hypothesis would be to plot the RMSE after each epoch of the training and see how it is converging.

1 Like

@mancellin
The intercept slightly changes the RMSE for linear regression using the analytic solution (as it should), but doesn’t change much for Flux… The oos RMSE =4.33, instead of the correct 4.22.
For such a simple linear model I’d expect more precision…

I tried running all default algorithms w/ 1000 epochs, & still no luck:

``````Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
Momentum()];
for opt in Opt_All
println(" ", opt)
IS, OS= f(d, XT, YT, XH, YH, opt=opt, nE=1_000)
println(IS, " ", OS)
println(" ")
end

7.444893070930646 7.333962459428459
7.770329465994163 7.597092298014306
7.801277739928222 7.610828911047076
25.60402302221672 25.543785997118032
7.579390098369818 7.448915711251212
7.440810833610845 7.3250896967053585
7.4372402871817735 7.322248828372182
Descent(0.1)
NaN NaN
7.441556218264752 7.33237354987947
Nesterov(0.001, 0.9, IdDict{Any,Any}())
5.6646968093079 5.750612117929154
RMSProp(0.001, 0.9, IdDict{Any,Any}())
7.5386542107074455 7.446910911160881
Momentum(0.01, 0.9, IdDict{Any,Any}())
6.566732406268816 6.457858653475062
``````

I suspect a lot of the issues here stem from the problem setup and not just Flux. You may want to start by examining the condition number of your data matrix. I found that k = 14992.546605226838 (the lower the condition number, the better conditioned your matrix). Such a large k indicates that for solving Ax = b with your data matrix, the solution will be very sensitive to errors. This makes iterative numerical methods prone to accumulating mistakes. A sign that this is happening is evident in the results you posted where simple optimizers like gradient descent or momentum result in NaNs.

Why does `X \ y` work? If Julia does what other languages do, then it is analytically computing the solution for `\` using the pseudo inverse. This result is the exact solution to the minimum loss problem for regression + MSE. And since it is computed using an analytical approach, it won’t accumulate error in the same way as iterative methods (though it is also not without numerical error).

To make sure I’m not just making stuff up, I used the Landweber iteration to find the solution to the problem (which is basically gradient descent for Ax = b & MSE). Here is the code:

``````using MLJ: @load_boston, partition
using Flux: mse, train!
using Flux, Statistics

# X, y =  @load_boston; X = hcat(X...)
X = rand(354, 12)
w = rand(12)
y = sign.(X * w)
train, test = partition(eachindex(y), 0.7, rng=333)

Xtrain = X[train, :]
Xtest = X[test, :]
Xbtrain = [Xtrain ones(size(Xtrain, 1))]
Xbtest = [Xtest ones(size(Xtest, 1))]
ytrain = y[train]
ytest = y[test]

rmse(ŷ, y) = sqrt(mse(ŷ, y))

function analytical(X, y, Xtest, ytest)
ŵ = X \ y
ŷ = X * ŵ
ŷtest = Xtest * ŵ

println(ŵ)

return rmse(ŷ, y), rmse(ŷtest, ytest)
end

function flux(X, y, Xtest, ytest; opt, nepochs = 1000)
m = Dense(size(X, 2), 1)
traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
for epoch in 1:nepochs
train!((x, y) -> mse(m(x), y), params(m), traindata, opt)
end

return rmse(m(permutedims(X)), y), rmse(m(permutedims(Xtest)), ytest)
end

function landweber(X, y, Xtest, ytest; λ = 0.1, nepochs = 1000)
traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
ŵ = rand(size(X, 2))
for epoch in 1:nepochs
for (X, y) in traindata
ŵ -= λ * X * (transpose(X) * ŵ - y)
end
end

ŷ = X * ŵ
ŷtest = Xtest * ŵ

return rmse(ŷ, y), rmse(ŷtest, ytest)
end

evalvector = [
("Analytical", () -> analytical(Xbtrain, ytrain, Xbtest, ytest)),
("Flux GD (8e-3)", () -> flux(Xtrain, ytrain, Xtest, ytest; opt = Descent(8e-3))),
("Landweber (1e-2)", () -> landweber(Xbtrain, ytrain, Xbtest, ytest; λ = 1e-2))
]

for (name, evalfunc) in evalvector
errtrain, errtest = evalfunc()
println("\$name:")
println("  RMSE (train): \$errtrain")
println("  RMSE (test):  \$errtest")
end
``````

If you comment out the line at the top and use your dataset instead of the synthetic data, you’ll see that both Flux and the manually specified Landweber iteration diverge. The learning rate needs to be 1e-8 to 1e-10 to get them to converge with high loss. And at that point, we can’t really slow down the algorithms further, since the learning rate is approaching machine epsilon.

If you use the synthetic data, you’ll see that all three approaches converge. Flux still suffers more error perhaps due to sources of numerical error in the AD (whereas for the Landweber iteration I manually wrote out the gradient).

The fact that ADAM etc get close is really a testament to how clever some of these optimization techniques are.

2 Likes

@darsnack Thank you! I didn’t think to check the condition number.
I reran w/ just the first two variables

``````X= X[:, 1:2]
``````

And it’s much more accurate! Though I still get some NaNs for some optimization algorithms…

Is it possible to decrease the learning rate for those algorithms? That would be my first attempt. Also, since you are only using two features now, you can visualize the data in a scatter plot. Does it appear linearly separable? If not, then adding another layer or two the NN will help the most.