**Updated**:

I would like to write a generic function to train a model in Flux.

An ordinary linear regression w/ intercept is a special case of a neural network. If the optimizers work well I should get the same result (esp from such a simple model).

Unfortunately I don’t.

First: prepare Boston housing data

```
using MLJ: @load_boston, partition
using Flux, Statistics
X, y = @load_boston; X = hcat(X...);
train, test = partition(eachindex(y), .7, rng=333);
#
Xm = hcat(X, size(X,1) |> ones ); #add intercept
XmT= Xm[train,:]; XmH= Xm[test,:];
# Flux automatically includes intercept...
XT= X[train,:]; YT= y[train];
XH= X[test,:]; YH= y[test];
d = Flux.Data.DataLoader(XT', YT');
```

Now train a linear model w/ intercept using matrix algebra:

In-sample RMSE= 4.90. Out-of-sample RMSE = 4.4242.

Below I denote these scores (IS, OS).

```
β̂ = XmT \ YT
ŷ = (XmT)*β̂
Flux.mse(ŷ, YT) |> sqrt
ŷ = (XmH)*β̂
Flux.mse(ŷ, YH) |> sqrt
```

Now train a generic Flux model:

```
function f(d, XT, YT, XH, YH;
m = Chain(Dense(size(XT,2), 1)), #Model/Activation
ℓ = Flux.mse, #Loss: mse, crossentropy...
# #Penalty: add later...
opt = ADAGrad(), #Optimiser
nE = 200 #Number epochs
)
loss(x, y) = ℓ(m(x), y)
Flux.@epochs nE Flux.train!(loss, params(m), d, opt)
IS = Flux.mse(m(XT'), YT') |> sqrt
OS = Flux.mse(m(XH'), YH') |> sqrt
return IS, OS
end
```

Using the ADAGrad() optimiser my scores are too far off, but improve w/ more epochs:

```
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE= 200) #(5.57, 5.18)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=1_000) #(5.30, 4.85)
f(d, XT, YT, XH, YH, m = Chain(Dense(12,1)), nE=10_000) #(5.14, 4.48)
```

Next I try all default optimizers w/ 1000 epochs:

```
Opt_All =[ADAMW(), ADAGrad(0.1), AdaMax(), ADADelta(0.9), AMSGrad(),
NADAM(), RADAM(), Descent(0.1), ADAM(), Nesterov(), RMSProp(),
Momentum()];
for opt in Opt_All
println(" ", opt)
IS, OS= f(d, XT, YT, XH, YH, opt=opt, nE=1_000)
println(IS, " ", OS)
println(" ")
end
```

Still not great:

```
Flux.Optimise.Optimiser(Any[ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}()), WeightDecay(0)])
5.38384461137456 4.682120351510056
ADAGrad(0.1, IdDict{Any,Any}())
5.27592040673716 4.793918203133885
AdaMax(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.295374172910945 4.833882858869897
ADADelta(0.9, IdDict{Any,Any}())
55.62990418302113 57.01656766967905
AMSGrad(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.320812530474836 4.881181082991089
NADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.3934106199655245 4.690819240537613
RADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
1.394841024573495e12 1.393338515504848e12
Descent(0.1)
NaN NaN
ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}())
5.392202731987643 4.693330726613823
Nesterov(0.001, 0.9, IdDict{Any,Any}())
NaN NaN
RMSProp(0.001, 0.9, IdDict{Any,Any}())
5.42999749585886 4.658016527255777
Momentum(0.01, 0.9, IdDict{Any,Any}())
NaN NaN
```

Now FluxOptTools.jl:

(IS, OS) = (5.18, 4.488) better (claims status: success) but still not great.

```
using Flux, Zygote, Optim, FluxOptTools, Statistics
m = Chain(Dense(size(XT,2), 1)) #LM
loss4() = mean(abs2, m(XT') .- YT')
Zygote.refresh()
pars = Flux.params(m)
lossfun, gradfun, fg!, p0 = optfuns(loss4, pars)
res = Optim.optimize(Optim.only_fg!(fg!), p0, Optim.Options(iterations=10_000, store_trace=true))
res
Flux.params(m)
IS = Flux.mse(m(XT'), YT') |> sqrt
OS = Flux.mse(m(XH'), YH') |> sqrt
```

I’m unable to use the BFGS() option `MethodError: no method matching length(::GlobalRef)`

.

@baggepinnen is there a smarter way to use your tool to find the minimum?