Unstable learning of neural net with few nodes using Flux

I have a neural network with only one binary input node, one hidden layer and one output node. When I try to train the network, such that it outputs the value x_1 when the input is 0 and x_2 when the input is 1, it will sometimes work and sometimes not. I use the Flux library in Julia:

using Flux
using Flux.Optimise: update!

function train_v(v, initial_lr, target1, target2)
	for i in 1:1000
		lr = initial_lr * (1-i/1000)
		update_v(v, [0], target1, lr)
		update_v(v, [1], target2, lr)

function update_v(v, input, target, lr)
	ps = params(v)
	gs = gradient(ps) do
		Flux.Losses.mae(v(input), target)
	update!(Descent(lr), ps, gs)

function test(target1, target2, nodes=4, initial_lr=0.1, print = true, return_v = false)
	v = Chain(Dense(1, nodes, relu), Dense(nodes, 1))
	if print
		println("Before training: v(0):", v([0])[1], " / v(1):", v([1])[1])
		#for i in 1:3
		#	println(params(v)[i])
	train_v(v, initial_lr, target1, target2)
	if print
		println("After training: v(0):", v([0])[1], " / v(1):", v([1])[1])
		#for i in 1:3
		#	println(params(v)[i])
	if return_v
		return v

Running for example test(1,2) will sometimes work just fine and result in

julia> test(1,2)
Before training: v(0):0.0 / v(1):-0.35872597
After training: v(0):0.99979866 / v(1):1.9998001

and sometimes not work as expected and result in outputting the same values for both inputs

julia> test(1,2)
Before training: v(0):0.0 / v(1):-0.33133784
After training: v(0):1.1958001 / v(1):1.1958001

Observations I made:

It is working properly more often, if…

  • the absolute values of x_1 and x_2 are rather small.
  • the difference between x_1 and x_2 is rather small.
  • I use more nodes in the hidden layer.

Especially using a lot of nodes seem to make it work consistently. This one, however, I find especially counter intuitive, because I thought for non-complex functions it is better to use few nodes. Even with just one node in the hidden layer it is easy to find weights such that the function is exact. Can someone explain me this behaviour? The function I used for testing different settings was:

function get_success_probability(target1, target2, nodes=4, initial_lr=0.1, n=100, epsilon=abs(target1-target2)*0.1)
	success_count = 0
	for i in 1:n
		v = test(target1, target2, nodes, initial_lr, false, true)
		if abs(v([0])[1]-target1)<epsilon && abs(v([1])[1]-target2)<epsilon
			success_count += 1
	return success_count/n

I use v1.6.2.

Hi @Handam ,

are you sure your training loop guarantees convergence?
Have you plotted learning curves (loss vs iteration) and compared them between runs? I think that should shed some light on what’s going on.

Also note (not sure whether that was unclear) that the weights in the Dense layers are initialized randomly, which is the reason for the non-determinism between runs that you observe. (You can already see that fact if you compare the “Before training” outputs of your two runs).

First of all, thanks for the answer.
I am aware of the fact that the non-determinism comes from the random initial weights. However, I was not able to find a pattern for which initial weights it is converging and for which it is not.

I don’t know if my learning rate decay guarantees convergence, but I don’t see why it should ever result in outputting the same values for both binary inputs. Using a more “traditional” decay like \frac{\alpha}{episode} results in the same issue.

sorry, I missed that point (even though you explicitly mentioned it).
However I still think you’ll probably easily understand what’s going on, once you plot things.
(Again something you might already know, but just in case: GitHub - JuliaLogging/TensorBoardLogger.jl: Easy peasy logging to TensorBoard with Julia)

A plot when it is not working:

So it jumps to the final value within the first training loop, which I have to admit is interesting! However, I still fail to see the reason why this is happening for some initial weights :see_no_evil:

Dead ReLus maybe.
Plotting loss, gradients, histograms over weights will give you more clues.
Try Elus instead (they should not die) and/or some regularization.

Just a side note:
You’re calling params() and Descent() each iteration (actually even twice).
While that’s not wrong it is still unnecessary (and thus inefficient)

Yep, exaclty what’s happening:

julia> v[1].weight
4×1 Matrix{Float32}:

julia> v[1].bias
4-element Vector{Float32}:

If you directly look at the weights of the first layer, you see that for input ∈ [0, 1], the thing that goes into the ReLu will always be negative, thus the output of the ReLu zero, and therefore the gradient with respect to the parameters always zero.

As suggested above: A different activation function and/or regularization will help!

1 Like

Thanks for all your answers! Helped a lot, now everything is clear to me so far! And also thanks for the hint with calling params() more often than necessary - I checked it with @btimes and it does make a difference indeed! So the best solution is to pass ps = params(v) as an argument to the update_v() function, right?

1 Like

yes that would be reasonable