No changes with Flux NN regression training

Problem Description

Hi all, I’m having trouble using Flux to learn a non-linear function of two independent variables, x1 and x2. Everything is running, but I have a feeling that the parameters aren’t actually being updated and every time that I train, I’m starting off from the initialized values again.

I’ve made up a function that is kind of similar to my real data and it’s having the same issues. Notice that in the plots of the results the shape is only kind of there and the scale is way off. Also the results just seem to more or less stay the same as what is predicted after the first training iteration. I had an earlier version which actually did manage to take the shape well, but where the range of predicted values should have gone from [0,1], the predicted range was more like [0.225, 0.235] with no way to make it budge.

I’'m not sure if this has to do with the configuration of the NN itself, the activation functions, the batch sizes, updating the parameters in the train! or something else.

I’d tried to use the DataLoader but was having issues so I just tried to roll my own. Similarly, I wasn’t sure if the @epochs macro was causing problems so I just made my own iteration loop.

Any suggestions are welcome even if they don’t pertain to the particular problem. Thanks in advance!

Code

using DataFrames, Plots, Flux
using Flux:@epochs


x1 = DataFrame!(x1 = [0.2, 0.5, 1, 2, 5, 10, 15, 25, 50, 100, 200, 300, 400,
                    500, 700, 900, 1000, 1250, 1500, 1800, 2000, 3000, 5000])
x2 = DataFrame!(x2 = range(0., 90., length=30) |> collect)
df = crossjoin(x1, x2)

# faking up some data based on a nonlinear function that should be somewhat
# similar to mine. My actual data is a little more complex but I can't share it
df[:, :y] .= 0.
for row in eachrow(df)
    row.y = (row.x1)^(1/2) * (row.x2)^2
end

stats = describe(df, :min, :max,:mean, :std)

# Normalize the data
x1 = (df.x1 .- stats.min[1]) / (stats.max[1] - stats.min[1])
x2 = (df.x2 .- stats.min[2]) / (stats.max[2] - stats.min[2])
y = (df.y .- stats.min[3]) / (stats.max[3] - stats.min[3])

scatter(df.x1, df.x2, df.y, lab="True Values- unscaled")
scatter(x1, x2, y, lab="True Values- normalized")

Unscaled scatter plot

true_unscaled

Normalized scatter plot

true_normalized

z = 5
m = Chain(
    Dense(2, z)
    , Dense(z, z, tanh)
    , Dense(z, z, σ)
    , Dense(z, 1)
    )
ps = params(m)
opt = Descent()
loss(X, y) = Flux.Losses.mse(m(X)[1], y)

n = 1000                    # how many batches I want
batches = 1:1:n             # range to iterate on the batches
batch_size = 32             # number of random data in each batch
num_epochs = 250            # number of times to train on each batch
num_datum = size(df)[1]     # getting total number of data

for batch in batches
    # making a random index to make random minibatches
    rd_idx = []     # empty list
    # randomly select the batch size number of points within the range of data
    for i in 1:1:batch_size
        new = rand(1:num_datum)
        push!(rd_idx, new)
    end

    # creating new X and Y minibatch based on the random index selected
    x1_minibatch = x1[rd_idx,:]
    x2_minibatch = x2[rd_idx,:]
    y_minibatch = y[rd_idx, :]

    # putting X in correct dimensions
    X = transpose(hcat(x1_minibatch, x2_minibatch)) |> Array
    Y = y_minibatch |> Array
    data = [(X, Y)]
    for epoch in 1:1:num_epochs
        Flux.train!(loss, ps, data, opt)
    end
end


# Plotting Results
x1_test = 0.0:0.1:1.0
x2_test = 0.0:0.1:1.0
ŷ(x1_test, x2_test) = m([x1_test, x2_test])[1]
plot(x1_test, x2_test, ŷ, st=:surface)

After one complete training interation

function_trained

After several

function_more_trained

After many many iterations

function_most_trained

Environment

(jgpr) pkg> st
Status `C:\Users\~\jgpr\Project.toml`
  [336ed68f] CSV v0.7.7
  [052768ef] CUDA v1.3.3
  [a93c6f00] DataFrames v0.21.7
  [587475ba] Flux v0.11.1
  [91a5bcdd] Plots v1.6.7
  [08abe8d2] PrettyTables v0.9.1
  [bd369af6] Tables v1.0.5
  [37e2e46d] LinearAlgebra
1 Like

The biggest thing that jumps out to me is that the SGD loop is inside out :slight_smile:. Currently you have

for batch in randsamples(data)
  for i in 1...nepoch
    sgd_step(model, batch, opt) 

Whereas the correct ordering should be

for i in 1...nepoch
  for batch in randsamples(data)
    sgd_step(model, batch, opt) 

Conceptually, you can think of the first approach as beating the model over the head with a single batch, then swapping to a completely different one and repeating. There may be some weight adaptation that is conducive to generalization, but most likely you’ll have overfit on the last minibatch because the model is being fed that batch repeatedly for 250 iterations/optimization steps right before evaluation. Unlike the semi-directed random walk on the loss landscape you would expect from SGD, this will look like a series of dramatic jerks without any real sense of direction.

3 Likes

@ToucheSir, Thanks so much for the detailed explanation, makes perfect sense. I’m getting results that actually look like they’re improving now.