Superexpressive Activation function (paper) and "Three Hidden Layers Are Enough" paper

I came across Superexpressive Activation functions, the elementary kind on arxiv, the paper on it is also at:

A.
Elementary Superexpressive Activations http://proceedings.mlr.press/v139/yarotsky21a/yarotsky21a.pdf

we prove that the family {sin, arcsin} is superexpressive. We also show that most practical activations (not involving periodic functions) are not superexpressive.

from his website http://yarotsky.info/

Since I’m kind of obsessed with activation functions, if there can be one best kind, and had never heard of this kind (it’s actually from 1999, but the elementary kind from 2021), I wanted to code it in Julia. Please try it out.

Note, if you read the paper then it has a bug (doesn’t match Fig. 2, dipped into negative), missing “+ 2” and sin likely must have meant sinpi. I did contact the author, he hasn’t answered yet and confirmed my fixed version (nor that the “hard” simplified version would keep the superexpressive property):

function sigma_3(x)
  if abs(x) <= one(x)
    (one(x)/π) * (x*asin(x) + Base.sqrt_llvm(one(x) - x^2)) + (3//2)*x + 2
  else
    repr = one(x)/x
    if x >= one(x)
      7 - 3repr + (one(x)/π)*repr^2*sinpi(x)
    else
      -repr
    end
  end
end

function sigma_3_hard(x)
  if abs(x) <= one(x)
    (3//2)*x + 2.5
  else
    repr = 1/x
    if x >= one(x)
      7 - 3repr + (1/π)*repr^2*sinpi(x)
    else
      -repr
    end
  end
end

Not, this is one activation function you could use (I do not know if it trains well, it’s very similar to sigmoid, except for the barely noticeable needed wavy part).

The paper also includes simpler functions that are superexpressive, and references interesting papers, including this one with similar ideas:

B.
Neural Network Approximation: Three Hidden Layers Are Enough https://arxiv.org/pdf/2010.14075.pdf

The most important message of Shen et al. (2021) (and probably also of Yarotsky and Zhevnerchuk (2020)) is that the combination of simple activation functions can create super approximation power. In the Floor-ReLU networks mentioned above, the power of depth is fully reflected in the approximation rate […]. However, the power of width is much weaker and the approximation rate is polynomial in width if depth is fixed. This seems to be inconsistent with recent development of network optimization theory Jacot et al. (2018); Du et al. (2019); Mei et al. (2018); Wu et al. (2018); Chen et al. (2019b); Lu et al. (2020); Luo and Yang (2020), where larger width instead of depth can ease the challenge of highly noncovex optimization. The mystery of the power of width and depth remains and it motivates us to demonstrate that width can also enable super approximation power when armed with appropriate activation functions. In particular, we explore the floor function, the exponential function
(2^x), the step function (1x≥0), or their compositions as activation functions to build fully-connected feed-forward neural networks. These networks are called Floor-Exponential-Step (FLES) networks.

5. Conclusion

This paper has introduced a theoretical framework to show that three hidden layers are enough for neural network approximation to achieve exponential convergence and avoid the curse of dimensionality for approximating functions as general as (Hölder) continuous functions.

3 Likes

I’ll post one more very intriguing paper:

Two-argument activation functions learn soft XOR operations like cortical neurons https://arxiv.org/pdf/2110.06871.pdf

Neurons in the brain are complex machines with distinct functional compartments that interact nonlinearly. In contrast, neurons in artificial neural networks abstract away this complexity, typically down to a scalar activation function of a weighted sum of inputs. Here we emulate more biologically realistic neurons by learning canonical activation functions with two input arguments, analogous to basal and apical dendrites. We use a network-in-network architecture […]
Remarkably, the resultant nonlinearities often produce soft XOR functions, consistent with recent experimental observations about interactions between inputs in human cortical neurons. When hyperparameters are optimized, networks with these nonlinearities learn faster and perform better than conventional ReLU nonlinearities with matched parameter counts, and they are more robust to natural and adversarial perturbations.

What’s not to like, networks that learn faster, and also more robust. It seems it can’t be an activation function, that could be a drop-in replacement for other (unlike the one in my original post). There’s no relation to the papers I posted originally, I just wasn’t sure people liked me posing to lots of new threads.

The paper references this Science paper (Gidon et al., 2020):

I do know XOR and the XOR problem, but while I have some idea, can anyone tell me what “soft XOR function” would be?

Background:

2 Likes