I’ve watched videos in JuliaAcademy (I recommend them), and continued playing around with Flux. To test Flux, I’ve tried to re-create simple results from:

- Bishop, Chris M. (1994). “Neural networks and their applications”. Rev. Sci. Instrum., Vol 65, no. 6, pp. 1803–1832.

Bishop uses a feedforward network with 1 input, 1 output, and 5 hidden layers, and trains the network over 1000 “cycles” using a BFGS quasi-Newton learning algorithm:

Here is my attempt of trying to re-create Bishop’s results, using the ADAM algorithm and 3000 epochs:

```
# Generating 50 data points
x_d = reshape(range(-1,1,length=50),1,50)
y_d_a = x_d.^2
D_a = [(x_d,y_d_a)]
#
# Model
mod = Chain(Dense(1,5,tanh),Dense(5,1))
# Loss/cost function
loss(x, y) = mean((mod(x).-y).^2)
# Optimization algorithm
opt = ADAM(0.002, (0.99, 0.999))
# Parameters of the model
par = params(mod);
# Running 3000 epochs, and generating model output
for i in 1:3000
Flux.train!(loss,par,D_a,opt)
end
y_m_a = Tracker.data(mod(x_d));
```

The results in Flux using the four functions of Bishop are:

Clearly, Bishop’s results are far better using 1000 “cycles” (= epochs?) than what I get with Flux using 3000 epochs.

Questions:

- Is the main reason for Bishop’s superior results that he uses a smarter learning algorithm (BFGS)? Would it be possible to augment Flux with BFGS, or other methods (conjugated gradient)? Or will this complicate matters/make it unable to run on GPUs?
- Suppose that I have N data points X = [x1, x2, …, xN] and Y = [y1,y2, …, yN] where xi and yi are vectors, thus X and Y are vectors of vectors. With the latest version of Flux, it seems that X and Y must be reshaped into matrices with dim(X) = dim(xi) x N and dim(Y) = dim(yi) x N. The
`Flux.batch`

function does this reshaping, but only if dim(xi) > 1; with dim(xi) = 1,`Flux.batch`

produces a vector/doesn’t change the argument.

My*question*is the following… The Flux examples I have seen in the Flux documentation and in JuliaAcademy seem to illustrate training of the network by computing the gradient of the loss function`mse(model(xi),yi)`

, i.e., at a single data point (xi,yi).

I’m used to gradient descent being based on a loss function`mse(model(X),Y)`

, i.e., deviation from the entire*batch*of data, not just on individual data points.

- Does Flux, in fact, base the loss function on the entire batch of data points? Thus, computing the gradient of
`mse(model(X),Y)`

? Or… - Does Flux base the loss function on each data point, i.e., does N updates based on the gradient of
`mse(model(xi),yi)`

where i runs from 1 to N?

Btw., I also tried with 30 000 epochs in Flux, and then I get the following results: