I’ve watched videos in JuliaAcademy (I recommend them), and continued playing around with Flux. To test Flux, I’ve tried to re-create simple results from:
- Bishop, Chris M. (1994). “Neural networks and their applications”. Rev. Sci. Instrum., Vol 65, no. 6, pp. 1803–1832.
Bishop uses a feedforward network with 1 input, 1 output, and 5 hidden layers, and trains the network over 1000 “cycles” using a BFGS quasi-Newton learning algorithm:
Here is my attempt of trying to re-create Bishop’s results, using the ADAM algorithm and 3000 epochs:
# Generating 50 data points x_d = reshape(range(-1,1,length=50),1,50) y_d_a = x_d.^2 D_a = [(x_d,y_d_a)] # # Model mod = Chain(Dense(1,5,tanh),Dense(5,1)) # Loss/cost function loss(x, y) = mean((mod(x).-y).^2) # Optimization algorithm opt = ADAM(0.002, (0.99, 0.999)) # Parameters of the model par = params(mod); # Running 3000 epochs, and generating model output for i in 1:3000 Flux.train!(loss,par,D_a,opt) end y_m_a = Tracker.data(mod(x_d));
The results in Flux using the four functions of Bishop are:
Clearly, Bishop’s results are far better using 1000 “cycles” (= epochs?) than what I get with Flux using 3000 epochs.
- Is the main reason for Bishop’s superior results that he uses a smarter learning algorithm (BFGS)? Would it be possible to augment Flux with BFGS, or other methods (conjugated gradient)? Or will this complicate matters/make it unable to run on GPUs?
- Suppose that I have N data points X = [x1, x2, …, xN] and Y = [y1,y2, …, yN] where xi and yi are vectors, thus X and Y are vectors of vectors. With the latest version of Flux, it seems that X and Y must be reshaped into matrices with dim(X) = dim(xi) x N and dim(Y) = dim(yi) x N. The
Flux.batchfunction does this reshaping, but only if dim(xi) > 1; with dim(xi) = 1,
Flux.batchproduces a vector/doesn’t change the argument.
My question is the following… The Flux examples I have seen in the Flux documentation and in JuliaAcademy seem to illustrate training of the network by computing the gradient of the loss function
mse(model(xi),yi), i.e., at a single data point (xi,yi).
I’m used to gradient descent being based on a loss function
mse(model(X),Y), i.e., deviation from the entire batch of data, not just on individual data points.
- Does Flux, in fact, base the loss function on the entire batch of data points? Thus, computing the gradient of
- Does Flux base the loss function on each data point, i.e., does N updates based on the gradient of
mse(model(xi),yi)where i runs from 1 to N?
Btw., I also tried with 30 000 epochs in Flux, and then I get the following results: