Understanding Neural Networks (and Lux)

0. Overview

I’m trying to understand neural networks and Lux in particular, and how it is related to interpolation and least squares regression. So this post is more on fundamental understanding than Lux-technicalities.

I find it interesting to consider the relationship between NN and regression, and whether it is possible to speed up the training process.

I have tried to structured my understanding via (1) spline interpolation, (2) interpolation/least squares using the Heaviside and rectified function as basis functions, (3) using this as a motivation for NN, (4) the relationship between NN and linear regression for simple NNs, and (5)-(6) how choice of initial guess of NN parameters affects the training.

I close with (7) some questions on my understanding. Hopefully, someone with a deeper understanding of NN can enlighten me.

1. Spline interpolation

Here, I use a simple data set. Standard zero order and first order spline is straightforward:

Next, observing that zero order and first order splines can be expressed via differences in Heaviside functions and rectified functions, respectively, we have, e.g.,

2. Interpolation using Heaviside and rectified function

I can then do “least squares” fitting with Heaviside function and the rectified function (I use the same number of basis functions as the number of data points, and would thus expect interpolation):

3. Forward Neural Network

As far as I understand it, a (forward) neural network can be considered a chain of simple layers, where each layer can be written as:

\xi^{(\ell)} = \alpha^{(\ell)}.\left(W^{(\ell)}\xi^{(\ell-1)} + b^{(\ell)} \right)

where \ell is the layer index, \alpha is an activation function, \alpha.() implies broadcasting over the argument array, W is the weight matrix, \xi is the (hidden) layer variable, and b is the bias. With n_\ell layers, the input data is \xi^{(0)} = x and the output data is y=\xi^{(n_\ell)}.

With a single layer and identity activation function, we simply have:

y = Wx + b

or an affine model.

With two layers where the output layer has identity activation function, we have:

y =I\cdot \left( W^{(2)} \xi^{(1)} + b^{(2)}\right) = W^{(2)} \alpha^{(1)}.\left(W^{(1)} x + b^{(1)}\right) + b^{(2)}

Obviously, if we fix the parameters of the first layer (W^{(1)}, b^{(1)}), we are left with standard linear regression. Many available activation functions (e.g., in Lux), fall into the group of “Heaviside-type” (e.g., S-shaped functions) or the “rectified function-type” (e.g., ReLu, etc.).

4. Neural Network and (linear) regression

If I use the same number of internal nodes as I have data points (not realistic in general!), spread the biases b^{(1)} evenly along the range of the x value, choose weights W^{(1)}, and set b^{(2)} = 0, finding W^{(2)} simply involves solving a simple matrix equation, which is very fast. The result is:


where interpolation has been achieved for the “regression” estimator \hat{y}_\mathrm{reg}. This should be compared to if I use the Lux.setup function to make an initial guess for the NN parameters (\hat{y}_0).

5. Iterating on NN from regression parameters

If I use \hat{y}_\mathrm{reg} as initial guess for iteration of the NN, I get the following evolution of the loss function:


with fit:

In other words: the change is minor.

6. Iterating on NN from random initialization

If I instead iterate on the NN from a random initialization of parameters, I get:


with fit:

Obviously, this fit does not achieve “interpolation”.

If I run the training for more epochs:

Even after half a million epochs, interpolation is not achieved – although as seen above: it is possible.

7. Some questions

  1. In general, the data considered is noisy, so it doesn’t make sense to strive for “interpolation” – instead, a least squares solution is fine. And then fewer epochs are needed?
  2. For one or two layer cases, does it make sense to initialize the NN using linear regression instead of using a random initializer? Does such a linear regression initialization scale well with multiple inputs and outputs?
  3. For “deep learning” with multiple layers, initialization with regression may be impractical/more difficult?
  4. Is it possible to say something intelligent on the required number of internal nodes in a two-layer network, based on the data? Clearly, if the data exhibits multiple “humps”/“modes” (e.g., a sinus function over a couple of periods), in linear regression using a bell-shaped basis, one must at least have the same number of basis functions as the number of “humps”. Is it possible to generalize this and say that one must have at least the same number of nodes in the hidden layer as the number of “humps”?

Interesting.

Two points:

  • The number of nodes are related more to the dimension of the data points than on the number of data points.
  • How the two initialization methods (random vs regression-based) compare in the loss computed on training-unseen data (test data) ? Also, consider comparing with euristic-based weigths initialization methods, as the Xavier one.
1 Like

Is the Xavier method implemented in Lux.jl?

1 Like