I came across Superexpressive Activation functions, the elementary kind on arxiv, the paper on it is also at:
A.
Elementary Superexpressive Activations http://proceedings.mlr.press/v139/yarotsky21a/yarotsky21a.pdf
we prove that the family {sin, arcsin} is superexpressive. We also show that most practical activations (not involving periodic functions) are not superexpressive.
from his website http://yarotsky.info/
Since I’m kind of obsessed with activation functions, if there can be one best kind, and had never heard of this kind (it’s actually from 1999, but the elementary kind from 2021), I wanted to code it in Julia. Please try it out.
Note, if you read the paper then it has a bug (doesn’t match Fig. 2, dipped into negative), missing “+ 2” and sin likely must have meant sinpi. I did contact the author, he hasn’t answered yet and confirmed my fixed version (nor that the “hard” simplified version would keep the superexpressive property):
function sigma_3(x)
if abs(x) <= one(x)
(one(x)/π) * (x*asin(x) + Base.sqrt_llvm(one(x) - x^2)) + (3//2)*x + 2
else
repr = one(x)/x
if x >= one(x)
7 - 3repr + (one(x)/π)*repr^2*sinpi(x)
else
-repr
end
end
end
function sigma_3_hard(x)
if abs(x) <= one(x)
(3//2)*x + 2.5
else
repr = 1/x
if x >= one(x)
7 - 3repr + (1/π)*repr^2*sinpi(x)
else
-repr
end
end
end
Not, this is one activation function you could use (I do not know if it trains well, it’s very similar to sigmoid, except for the barely noticeable needed wavy part).
The paper also includes simpler functions that are superexpressive, and references interesting papers, including this one with similar ideas:
B.
Neural Network Approximation: Three Hidden Layers Are Enough https://arxiv.org/pdf/2010.14075.pdf
The most important message of Shen et al. (2021) (and probably also of Yarotsky and Zhevnerchuk (2020)) is that the combination of simple activation functions can create super approximation power. In the Floor-ReLU networks mentioned above, the power of depth is fully reflected in the approximation rate […]. However, the power of width is much weaker and the approximation rate is polynomial in width if depth is fixed. This seems to be inconsistent with recent development of network optimization theory Jacot et al. (2018); Du et al. (2019); Mei et al. (2018); Wu et al. (2018); Chen et al. (2019b); Lu et al. (2020); Luo and Yang (2020), where larger width instead of depth can ease the challenge of highly noncovex optimization. The mystery of the power of width and depth remains and it motivates us to demonstrate that width can also enable super approximation power when armed with appropriate activation functions. In particular, we explore the floor function, the exponential function
(2^x), the step function (1x≥0), or their compositions as activation functions to build fully-connected feed-forward neural networks. These networks are called Floor-Exponential-Step (FLES) networks.
5. Conclusion
This paper has introduced a theoretical framework to show that three hidden layers are enough for neural network approximation to achieve exponential convergence and avoid the curse of dimensionality for approximating functions as general as (Hölder) continuous functions.