Generic Function to train NN w/ Flux

Updated:
I would like to write a generic function to train a model in Flux.

An ordinary linear regression w/ intercept is a special case of a neural network. If the optimizers work well I should get the same result (esp from such a simple model).
Unfortunately I don’t.

First: prepare Boston housing data

using MLJ: @load_boston, partition
using Flux, Statistics
X, y =  @load_boston; X = hcat(X...);
train, test = partition(eachindex(y), .7, rng=333);
#
Xm =  hcat(X, size(X,1) |> ones ); #add intercept
XmT= Xm[train,:]; XmH= Xm[test,:];  
# Flux automatically includes intercept... 
XT= X[train,:]; YT= y[train];
XH= X[test,:];  YH= y[test];
d = Flux.Data.DataLoader(XT', YT');

Now train a linear model w/ intercept using matrix algebra:
In-sample RMSE= 4.90. Out-of-sample RMSE = 4.4242.
Below I denote these scores (IS, OS).

β̂ = XmT \ YT
ŷ = (XmT)*β̂
Flux.mse(ŷ, YT) |> sqrt
ŷ = (XmH)*β̂
Flux.mse(ŷ, YH) |> sqrt

Now train a generic Flux model:

function f(d, XT, YT, XH, YH;
  m = Chain(Dense(size(XT,2), 1)), #Model/Activation
  ℓ = Flux.mse,                    #Loss: mse, crossentropy...
  #                                #Penalty: add later...
  opt = ADAGrad(),                 #Optimiser
  nE = 200                         #Number epochs   
  )
  loss(x, y) = ℓ(m(x), y)
  Flux.@epochs nE Flux.train!(loss, params(m), d, opt)
  IS = Flux.mse(m(XT'), YT') |> sqrt
  OS = Flux.mse(m(XH'), YH') |> sqrt
  return IS, OS
end

Using the ADAGrad() optimiser my scores are too far off, but improve w/ more epochs:

f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=    200)  #(5.57, 5.18)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=1_000)    #(5.30, 4.85)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=10_000)   #(5.14, 4.48)

Next I try all default optimizers w/ 1000 epochs:

Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
            NADAM(), RADAM(), Descent(0.1), ADAM(), Nesterov(), RMSProp(),
            Momentum()];
for opt in Opt_All
  println(" ", opt)
  IS, OS= f(d, XT, YT, XH, YH, opt=opt, nE=1_000)
  println(IS, " ", OS)
  println(" ")
end

Still not great:

Flux.Optimise.Optimiser(Any[ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}()), WeightDecay(0)])
5.38384461137456 4.682120351510056
ADAGrad(0.1, IdDict{Any,Any}())
5.27592040673716 4.793918203133885
AdaMax(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.295374172910945 4.833882858869897
ADADelta(0.9, IdDict{Any,Any}())
55.62990418302113 57.01656766967905
AMSGrad(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.320812530474836 4.881181082991089
NADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.3934106199655245 4.690819240537613
RADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
1.394841024573495e12 1.393338515504848e12
Descent(0.1)
NaN NaN
ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.392202731987643 4.693330726613823
Nesterov(0.001, 0.9, IdDict{Any,Any}())
NaN NaN
RMSProp(0.001, 0.9, IdDict{Any,Any}())
5.42999749585886 4.658016527255777
Momentum(0.01, 0.9, IdDict{Any,Any}())
NaN NaN

Now FluxOptTools.jl:
(IS, OS) = (5.18, 4.488) better (claims status: success) but still not great.

using Flux, Zygote, Optim, FluxOptTools, Statistics
m = Chain(Dense(size(XT,2), 1)) #LM
loss4() = mean(abs2, m(XT') .- YT')
Zygote.refresh()
pars   = Flux.params(m)
lossfun, gradfun, fg!, p0 = optfuns(loss4, pars)
res = Optim.optimize(Optim.only_fg!(fg!), p0, Optim.Options(iterations=10_000, store_trace=true))

res
Flux.params(m)
IS = Flux.mse(m(XT'), YT') |> sqrt
OS = Flux.mse(m(XH'), YH') |> sqrt

I’m unable to use the BFGS() option MethodError: no method matching length(::GlobalRef).
@baggepinnen is there a smarter way to use your tool to find the minimum?

2 Likes

Btw, if someone wants to try diff optimizers w/ Flux

Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
            NADAM(), RADAM(), Descent(0.1), ADAM(), Nesterov(), RMSProp(),
            Momentum()];
for opt in Opt_All
  println(" ", opt)
  IS, OS= f(d, XT, YT, XH, YH, opt=opt)
  println(IS, " ", OS)
  println(" ")
end

I don’t believe Flux has an option to automatically create a vector w/ all optimizers…

To further clarify:
my understanding is Chain(Dense(size(XT,2), 1)) is a linear regression.

#AZ: apply to Boston Housing!
using MLJ: @load_boston, partition
using Flux, Statistics
X, y =  @load_boston; X = hcat(X...);
X = (X .- mean(X, dims = 2)) ./ std(X, dims = 2);
train, test = partition(eachindex(y), .7, rng=333);
XT= X[train,:]; YT= y[train];
XH= X[test,:];  YH= y[test];

Regular linear regression:

β̂ = XT \ YT
ŷ = (XT)*β̂
Flux.mse(ŷ, YT) |> sqrt.          #RMSE=4.95
ŷ = (XH)*β̂
Flux.mse(ŷ, YH) |> sqrt           #RMSE=4.24

Flux

#
m = Chain(Dense(size(XT,2), 1)) #Linear Model
ℓ = Flux.mse
opt = ADAGrad()
nE = 10_000
#
loss(x, y) = ℓ(m(x), y)
Flux.@epochs nE Flux.train!(loss, params(m), [(XT',YT')], opt)
IS = Flux.mse(m(XT'), YT') |> sqrt      #RMSE=7.95
OS = Flux.mse(m(XH'), YH') |> sqrt      #RMSE=7.75

Using @baggepinnen’s FluxOptTools:

using Flux, Zygote, Optim, FluxOptTools, Statistics
m = Chain(Dense(size(XT,2), 1)) #LM
loss4() = mean(abs2, m(XT') .- YT')
Zygote.refresh()
pars   = Flux.params(m)
lossfun, gradfun, fg!, p0 = optfuns(loss4, pars)
res = Optim.optimize(Optim.only_fg!(fg!), p0, Optim.Options(iterations=10_000, store_trace=true))

res
Flux.params(m)
IS = Flux.mse(m(XT'), YT') |> sqrt         #RMSE = 4.93
OS = Flux.mse(m(XH'), YH') |> sqrt.        #RMSE = 4.31

why isn’t the RMSE the same as w/ linear regression?

I’m not an expert on Flux, but your code looks good to me.

Note that Chain(Dense(size(XT,2), 1)) is actually an affine function, so the result could be slightly different from a linear regression.

If I understood correctly, when using FluxOptTools, you have almost the RMSE that you expect. But when you are using the optimizers built-in with Flux, you don’t.

Just a guess : although Flux is converging towards the right value, the first-order gradient descent is slow and the minimum is not yet reached. Whereas the higher-order quasi-Newton method of Optim would find the minimum of this quadratic problem almost immediately.
One way to check this hypothesis would be to plot the RMSE after each epoch of the training and see how it is converging.

1 Like

@mancellin
The intercept slightly changes the RMSE for linear regression using the analytic solution (as it should), but doesn’t change much for Flux… The oos RMSE =4.33, instead of the correct 4.22.
For such a simple linear model I’d expect more precision…

I tried running all default algorithms w/ 1000 epochs, & still no luck:

Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
            NADAM(), RADAM(), Descent(0.1), ADAM(), Nesterov(), RMSProp(),
            Momentum()];
for opt in Opt_All
  println(" ", opt)
  IS, OS= f(d, XT, YT, XH, YH, opt=opt, nE=1_000)
  println(IS, " ", OS)
  println(" ")
end

Flux.Optimise.Optimiser(Any[ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}()), WeightDecay(0)])
7.444893070930646 7.333962459428459
ADAGrad(0.1, IdDict{Any,Any}())
7.770329465994163 7.597092298014306
AdaMax(0.001, (0.9, 0.999), IdDict{Any,Any}())
7.801277739928222 7.610828911047076
ADADelta(0.9, IdDict{Any,Any}())
25.60402302221672 25.543785997118032
AMSGrad(0.001, (0.9, 0.999), IdDict{Any,Any}()) 
7.579390098369818 7.448915711251212
NADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
7.440810833610845 7.3250896967053585
RADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
7.4372402871817735 7.322248828372182
Descent(0.1)
NaN NaN
ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
7.441556218264752 7.33237354987947
Nesterov(0.001, 0.9, IdDict{Any,Any}())
5.6646968093079 5.750612117929154
RMSProp(0.001, 0.9, IdDict{Any,Any}())
7.5386542107074455 7.446910911160881
Momentum(0.01, 0.9, IdDict{Any,Any}())
6.566732406268816 6.457858653475062

I suspect a lot of the issues here stem from the problem setup and not just Flux. You may want to start by examining the condition number of your data matrix. I found that k = 14992.546605226838 (the lower the condition number, the better conditioned your matrix). Such a large k indicates that for solving Ax = b with your data matrix, the solution will be very sensitive to errors. This makes iterative numerical methods prone to accumulating mistakes. A sign that this is happening is evident in the results you posted where simple optimizers like gradient descent or momentum result in NaNs.

Why does X \ y work? If Julia does what other languages do, then it is analytically computing the solution for \ using the pseudo inverse. This result is the exact solution to the minimum loss problem for regression + MSE. And since it is computed using an analytical approach, it won’t accumulate error in the same way as iterative methods (though it is also not without numerical error).

To make sure I’m not just making stuff up, I used the Landweber iteration to find the solution to the problem (which is basically gradient descent for Ax = b & MSE). Here is the code:

using MLJ: @load_boston, partition
using Flux: mse, train!
using Flux, Statistics

# X, y =  @load_boston; X = hcat(X...)
X = rand(354, 12)
w = rand(12)
y = sign.(X * w)
train, test = partition(eachindex(y), 0.7, rng=333)

Xtrain = X[train, :]
Xtest = X[test, :]
Xbtrain = [Xtrain ones(size(Xtrain, 1))]
Xbtest = [Xtest ones(size(Xtest, 1))]
ytrain = y[train]
ytest = y[test]

rmse(ŷ, y) = sqrt(mse(ŷ, y))

function analytical(X, y, Xtest, ytest)
    ŵ = X \ y
    ŷ = X * ŵ
    ŷtest = Xtest * ŵ
    
    println(ŵ)

    return rmse(ŷ, y), rmse(ŷtest, ytest)
end

function flux(X, y, Xtest, ytest; opt, nepochs = 1000)
    m = Dense(size(X, 2), 1)
    traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
    for epoch in 1:nepochs
        train!((x, y) -> mse(m(x), y), params(m), traindata, opt)
    end

    return rmse(m(permutedims(X)), y), rmse(m(permutedims(Xtest)), ytest)
end

function landweber(X, y, Xtest, ytest; λ = 0.1, nepochs = 1000)
    traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
    ŵ = rand(size(X, 2))
    for epoch in 1:nepochs
        for (X, y) in traindata
            ŵ -= λ * X * (transpose(X) * ŵ - y)
        end
    end
    
    ŷ = X * ŵ
    ŷtest = Xtest * ŵ
    
    return rmse(ŷ, y), rmse(ŷtest, ytest)
end

evalvector = [
    ("Analytical", () -> analytical(Xbtrain, ytrain, Xbtest, ytest)),
    ("Flux GD (8e-3)", () -> flux(Xtrain, ytrain, Xtest, ytest; opt = Descent(8e-3))),
    ("Landweber (1e-2)", () -> landweber(Xbtrain, ytrain, Xbtest, ytest; λ = 1e-2))
]

for (name, evalfunc) in evalvector
    errtrain, errtest = evalfunc()
    println("$name:")
    println("  RMSE (train): $errtrain")
    println("  RMSE (test):  $errtest")
end

If you comment out the line at the top and use your dataset instead of the synthetic data, you’ll see that both Flux and the manually specified Landweber iteration diverge. The learning rate needs to be 1e-8 to 1e-10 to get them to converge with high loss. And at that point, we can’t really slow down the algorithms further, since the learning rate is approaching machine epsilon.

If you use the synthetic data, you’ll see that all three approaches converge. Flux still suffers more error perhaps due to sources of numerical error in the AD (whereas for the Landweber iteration I manually wrote out the gradient).

The fact that ADAM etc get close is really a testament to how clever some of these optimization techniques are.

2 Likes

@darsnack Thank you! I didn’t think to check the condition number.
I reran w/ just the first two variables

X= X[:, 1:2]

And it’s much more accurate! Though I still get some NaNs for some optimization algorithms…

Is it possible to decrease the learning rate for those algorithms? That would be my first attempt. Also, since you are only using two features now, you can visualize the data in a scatter plot. Does it appear linearly separable? If not, then adding another layer or two the NN will help the most.