Generic Function to train NN w/ Flux

I suspect a lot of the issues here stem from the problem setup and not just Flux. You may want to start by examining the condition number of your data matrix. I found that k = 14992.546605226838 (the lower the condition number, the better conditioned your matrix). Such a large k indicates that for solving Ax = b with your data matrix, the solution will be very sensitive to errors. This makes iterative numerical methods prone to accumulating mistakes. A sign that this is happening is evident in the results you posted where simple optimizers like gradient descent or momentum result in NaNs.

Why does X \ y work? If Julia does what other languages do, then it is analytically computing the solution for \ using the pseudo inverse. This result is the exact solution to the minimum loss problem for regression + MSE. And since it is computed using an analytical approach, it won’t accumulate error in the same way as iterative methods (though it is also not without numerical error).

To make sure I’m not just making stuff up, I used the Landweber iteration to find the solution to the problem (which is basically gradient descent for Ax = b & MSE). Here is the code:

using MLJ: @load_boston, partition
using Flux: mse, train!
using Flux, Statistics

# X, y =  @load_boston; X = hcat(X...)
X = rand(354, 12)
w = rand(12)
y = sign.(X * w)
train, test = partition(eachindex(y), 0.7, rng=333)

Xtrain = X[train, :]
Xtest = X[test, :]
Xbtrain = [Xtrain ones(size(Xtrain, 1))]
Xbtest = [Xtest ones(size(Xtest, 1))]
ytrain = y[train]
ytest = y[test]

rmse(ŷ, y) = sqrt(mse(ŷ, y))

function analytical(X, y, Xtest, ytest)
    ŵ = X \ y
    ŷ = X * ŵ
    ŷtest = Xtest * ŵ
    
    println(ŵ)

    return rmse(ŷ, y), rmse(ŷtest, ytest)
end

function flux(X, y, Xtest, ytest; opt, nepochs = 1000)
    m = Dense(size(X, 2), 1)
    traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
    for epoch in 1:nepochs
        train!((x, y) -> mse(m(x), y), params(m), traindata, opt)
    end

    return rmse(m(permutedims(X)), y), rmse(m(permutedims(Xtest)), ytest)
end

function landweber(X, y, Xtest, ytest; λ = 0.1, nepochs = 1000)
    traindata = Flux.Data.DataLoader(permutedims(X), y; batchsize = 32)
    ŵ = rand(size(X, 2))
    for epoch in 1:nepochs
        for (X, y) in traindata
            ŵ -= λ * X * (transpose(X) * ŵ - y)
        end
    end
    
    ŷ = X * ŵ
    ŷtest = Xtest * ŵ
    
    return rmse(ŷ, y), rmse(ŷtest, ytest)
end

evalvector = [
    ("Analytical", () -> analytical(Xbtrain, ytrain, Xbtest, ytest)),
    ("Flux GD (8e-3)", () -> flux(Xtrain, ytrain, Xtest, ytest; opt = Descent(8e-3))),
    ("Landweber (1e-2)", () -> landweber(Xbtrain, ytrain, Xbtest, ytest; λ = 1e-2))
]

for (name, evalfunc) in evalvector
    errtrain, errtest = evalfunc()
    println("$name:")
    println("  RMSE (train): $errtrain")
    println("  RMSE (test):  $errtest")
end

If you comment out the line at the top and use your dataset instead of the synthetic data, you’ll see that both Flux and the manually specified Landweber iteration diverge. The learning rate needs to be 1e-8 to 1e-10 to get them to converge with high loss. And at that point, we can’t really slow down the algorithms further, since the learning rate is approaching machine epsilon.

If you use the synthetic data, you’ll see that all three approaches converge. Flux still suffers more error perhaps due to sources of numerical error in the AD (whereas for the Landweber iteration I manually wrote out the gradient).

The fact that ADAM etc get close is really a testament to how clever some of these optimization techniques are.

2 Likes