Two questions on Flux

I’ve watched videos in JuliaAcademy (I recommend them), and continued playing around with Flux. To test Flux, I’ve tried to re-create simple results from:

  • Bishop, Chris M. (1994). “Neural networks and their applications”. Rev. Sci. Instrum., Vol 65, no. 6, pp. 1803–1832.
    Bishop uses a feedforward network with 1 input, 1 output, and 5 hidden layers, and trains the network over 1000 “cycles” using a BFGS quasi-Newton learning algorithm:

    Here is my attempt of trying to re-create Bishop’s results, using the ADAM algorithm and 3000 epochs:
# Generating 50 data points
x_d = reshape(range(-1,1,length=50),1,50)
y_d_a = x_d.^2
D_a = [(x_d,y_d_a)]
#
# Model
mod = Chain(Dense(1,5,tanh),Dense(5,1))
# Loss/cost function
loss(x, y) = mean((mod(x).-y).^2)
# Optimization algorithm
opt = ADAM(0.002, (0.99, 0.999))
# Parameters of the model
par = params(mod);
# Running 3000 epochs, and generating model output
for i in 1:3000
    Flux.train!(loss,par,D_a,opt)
end
y_m_a = Tracker.data(mod(x_d));

The results in Flux using the four functions of Bishop are:

Clearly, Bishop’s results are far better using 1000 “cycles” (= epochs?) than what I get with Flux using 3000 epochs.

Questions:

  1. Is the main reason for Bishop’s superior results that he uses a smarter learning algorithm (BFGS)? Would it be possible to augment Flux with BFGS, or other methods (conjugated gradient)? Or will this complicate matters/make it unable to run on GPUs?
  2. Suppose that I have N data points X = [x1, x2, …, xN] and Y = [y1,y2, …, yN] where xi and yi are vectors, thus X and Y are vectors of vectors. With the latest version of Flux, it seems that X and Y must be reshaped into matrices with dim(X) = dim(xi) x N and dim(Y) = dim(yi) x N. The Flux.batch function does this reshaping, but only if dim(xi) > 1; with dim(xi) = 1, Flux.batch produces a vector/doesn’t change the argument.
    My question is the following… The Flux examples I have seen in the Flux documentation and in JuliaAcademy seem to illustrate training of the network by computing the gradient of the loss function mse(model(xi),yi), i.e., at a single data point (xi,yi).
    I’m used to gradient descent being based on a loss function mse(model(X),Y), i.e., deviation from the entire batch of data, not just on individual data points.
  • Does Flux, in fact, base the loss function on the entire batch of data points? Thus, computing the gradient of mse(model(X),Y)? Or…
  • Does Flux base the loss function on each data point, i.e., does N updates based on the gradient of mse(model(xi),yi) where i runs from 1 to N?

Btw., I also tried with 30 000 epochs in Flux, and then I get the following results:

5 Likes

BFGS is indeed a more potent, but more complex optimizer, so I would not be surprised if it improves the convergence in this simple case. How did Bishop come up with the learning rate, and how did you come up with yours? Is it a simple matter of tuning the parameters?

1 Like

Q1, I think LBFGS etc normally aren’t used because there are too many parameters – an expert may correct me, but I think you end up with a hessian of N^2 numbers for N parameters, which won’t work for a neural network where N only just fits in memory.

But for small problems you totally can. Do something like CatView(Flux.data.(Flux.params(mod).order)...) and then you can use Optim.jl on this vector.

Q2, I think received wisdom on neural networks is to train on batches which are much more than one data point, but much less than all your data. For each such batch you compute the gradient & update. One reason is that your data set might not fit in memory. Another reason is that the resulting gradient (and hence updates) are a bit noisy, and this is thought to help find better solutions.

I think such noisy gradients also confuse the hell out of BFGS and friends, which expect to be evaluating the same deterministic function at each step.

2 Likes

I just chose parameters in the ADAM (SGD?) algorithm from an example published in another thread in the Julia Discourse forum. I could probably fine tune those parameters.

Bishop discusses both the Newton method (rejected because of complexity/storage requirement), the conjugate gradient method, and the BFGS method. I guess there is less tuning with those methods. I’d assume that, say, the BFGS method combined with line search could give “fast” convergence, but I don’t know if the added complexity is worth it…

If anyone has some insight into how Flux handles gradient computations in the case of batch data, that would really clarify my understanding. Suppose x_i \in \mathbb{R}^{n_x} and y_i \in \mathbb{R}^{n_y} with available data (x_i,y_i), i \in \{1,\ldots, N\}, and suppose the loss/cost function is the sum of squared errors \cal{J} given as \cal{J} = \sum_i^N \lVert y_i - \cal{M}(x_i)\rVert_Q^2 for the batch of data where \cal{M}(x_i) is the model prediction.

In standard gradient descent Least Squares, \theta_{k+1} = \theta_k - \eta \frac{\partial \cal{J}}{\partial \theta}\rvert_{\theta_k}. But the examples in the Flux documentation seem to indicate that instead, loss function \cal{J}_i = \lVert y_i - \cal{M}(x_i)\rVert_Q^2 is used. So, I’m wondering whether this is the case, and whether the update is \theta_{i+1} = \theta_i - \hat{\eta}\frac{\partial \cal{J}_i}{\partial \theta}\rvert_{\theta_i} with \hat{\eta} \ll \eta gives the same overall learning…

1 Like

Yes, I realize that – normally – the data are split into training data and … validation data (?).

Personally, I like to use names for concepts that allow for indices that are mnemonic. Thus, I don’t like to talk about “training” and “test” data, because both start with “t”. I don’t know what would be the preferred names, though. “Training” data is an ok name for data used to train the network. Then a data set is used for cross-validation to choose the configuration/sizing of hidden layers.

One could also include a third data set to test the model after the model is finalized. And one could use a fourth data set to do bootstrapping to find the distributions of the weights and biases. Maybe the third data set and the fourth data set can be collapsed into one.

So one could consider to split the data set X_\mathrm{d}, Y_\mathrm{d} into:

  • training data X_\mathrm{t},Y_\mathrm{t}
  • data used for determination of network configuration X_\mathrm{c},Y_\mathrm{c}
  • data used for determining parameter statistics, X_\mathrm{s},Y_\mathrm{s}

Anyway, I’m sure there are other conceptual names for these data sets in the various sciences… If anyone has good names for these 3-4 datasets, I’m interested in that.

1 Like

@improbable22 and @BLI About “ … But for small problems you totally can. Do something like CatView(Flux.data.(Flux.params(mod).order)...) and then you can use Optim.jl on this vector. …”

@improbable22 and @BLI About “ … Bishop discusses both the Newton method (rejected because of complexity/storage requirement), the conjugate gradient method, and the BFGS method. … “

I’m guessing you are bumping up against the arbitrary 11GB memory limits of GPUs ? Would a solution where the memory model becomes one large flat fast memory model of say 1TB fast memory change the (practical) Big-O “time complexity as N-goes to infinity” equation define here >> Big O notation - Wikipedia for you ?

@improbable22 and @BLI Also can I actually test it for you by “ … Do something like CatView(Flux.data.(Flux.params(mod).order)...) and then you can use Optim.jl on this vector. …” << Is that code / pseudo-code outline complete ?, or can you provide code or complete outline to me to test out for you ?

1 Like

What I’m trying to say is that the training data is normally further divided, e.g. into 60 batches each of 1000 images. Each gradient step involves one such batch, and each epoch is a cycle through all the training data. This is what I thought perhaps you were missing. The gradient is computed using a loss function which contains a sum from 1:1000, but also has an index 1:60 labelling which batch you are dealing with.

Agree that test/train is a horrible choice of words, for that reason… this field is full of such choices, sadly.

1 Like

Thanks, the idea of dividing the batch data into sub batches clarifies things. What I’m “hinting” at is whether using even a single data in a batch with a tiny learning rate could make some sense in that a tiny learning rate doesn’t give fatal step because things smooth out over many such small batches… (is that the “stochastic” part of gradient descent?).

I cannot :frowning: I just ran into this issue elsewhere, that Optim likes one vector of parameters, while Flux is designed to handle them living in many different arrays. My CatView one-liner is just a sketch of how you might start hooking them together; you will have to to similarly concatenate the gradients.

Right, my take on the received wisdom here was that, if you had (11GB)^2, you’d spend it on having many more parameters, not on storing the hessian. But this is hearsay.

Yes that’s precisely what’s meant by “stochastic” here. Each batch/minibatch/sub-batch gives a gradient which you can think of as a noisy approximation to the true (all-data) gradient. But what people say is that this noisy gradient may actually be superior to the true gradient, in that it finds a better minimum.

2 Likes

Will all learning algorithms (not just the ADAM) use stochastic gradient descent, i.e., divide the batch up into these mini/sub batches?

I assume that one epoch is the “major iteration” when you have either done one iteration over the whole batch (with back-propagation, etc.) or done all (minor) iterations of the sub batches of the batch?

I suggest test it again with bigger learning rate, for example ADAM(0.01, (0.99, 0.999)) or even bigger.

Will try tomorrow – it could be interesting to see how large I can make the learning rate.

You might find the following paper from my thesis opponent interesting

It demonstrates a quasi-Newton approach for stochastic optimization of large models (he naturally wondered why I was using ADAM and not BFGS while training my models ;).

7 Likes

OK… re-running, with ADAM(0.01,(0.99,0.999)). These are my loops:

# Setting up a sequence of updates/Epochs
for i in 1:5000
    Flux.train!(loss,par,D_a,opt)
end
y_m_a = Tracker.data(mod(x_d));
#
# Setting up a sequence of updates/Epochs
for i in 1:5000
    Flux.train!(loss,par,D_b,opt)
end
y_m_b = Tracker.data(mod(x_d));
#
# Setting up a sequence of updates/Epochs
for i in 1:5000
    Flux.train!(loss,par,D_c,opt)
end
y_m_c = Tracker.data(mod(x_d));
#
# Setting up a sequence of updates/Epochs
for i in 1:5000
    Flux.train!(loss,par,D_d,opt)
end
y_m_d = Tracker.data(mod(x_d));

Running one time:

If I re-run the loops, I get a different results. The more times I run the loop, the “worse” the fit gets. After, say, 10 re-runs:

Questions:

  • When I re-run the loops above several times, will the weights/bias estimates of the model be re-set, or does the training continue with the current values of weights/bias?
  • What is the correct way to “re-set” the weights/bias values, i.e., make sure that they are seeded with random numbers between each run?

I assume we are not talking about over-training, since I compare the model with the training data all the time.

Btw. I also tried with \eta = 0.2. In that case, the result was good some times, and horrendous other times…

1 Like

If you continue the loop, your params continue to be updated. In addition, as I have seen above, you use only one model (mod) to fit 4 curves. It means the next curve you fit would take the params of the previous as the initial point. I suppose it is better to use 4 models (mod_1,…, mod_4), one model for one curve and I guess the results would be different. I don’t know the optimal way to reset params, I usually just set model = nothing and assign the model again.

1 Like

Thanks for suggestion – I did that, and now it works better. (I need a separate mod_a, …, mod_d as well as par_a,…, par_d, and separate loss functions loss_a, …, loss_d.)

Resetting the model parameters – it is possible to do it as follows, although this is somewhat clumsy (I only show it for mod_a):

mod_a.layers[1].W.data .= rand(5)
mod_a.layers[1].b.data .= rand(5)
mod_a.layers[2].W.data[:] .= rand(5)
mod_a.layers[2].b.data .= rand()

Thanks, indeed interesting. “Thesis opponent” is also a new phrase for me :slight_smile:

The opponent is the person asking questions about your thesis, to whom you have to defend the thesis during your defense.

1 Like

Only one opponent? Or several? At my place, one tends to have one international (“first opponent”) and at least a second opponent. Some places they have a big group of “sharks”; I’ve heard of one university (or country?) where 6 persons ask intensively … but after one hour, a time keeper rings a bell and the questioning has to stop. Once I had to rent a tailcoat, and borrow a “doctoral hat” – which was required equipment for the opponent…

We have a similar arrangement with one main opponent and a committee of 3-4 members. All committee members, as well as the audience (event is public) are allowed to ask questions until they are happy, with no upper limit on the time. The entire defense usually takes 2.5-3 hrs. No special clothing required :stuck_out_tongue:

2 Likes