A new paper advocates using the TeLU activation function, x \cdot \tanh (e^x), for neural networks, also proposed in this paper under the name TanhExp.
The intermediate result e^x grows very fast for large x, even though \tanh (e^x) remains bounded. Would this be a concern for accurate numerical evaluations? If so, is there a better alternative formula for evaluating this function? (The above papers seem to focus on deep learning outcomes rather than the numerical implementation of this function itself.)
P.S. I found later that the accuracy problem is with the gradient not the function itself.
Doesnāt seem like it should be an issue. Long before exp(x) overflows to Inf, tanh(x) rounds to 1.0 (and tanh(Inf) == 1.0), so you arenāt losing anything to overflow.
On the other hand, you could probably compute it more efficiently by designing your own special-function implementation.
(But Iām guessing that the activation function is not usually performance-critical? On the other hand, the TeLU authors claim that their function is better than some other alternatives like āMishā and āSmishā in part because the latter have āincreased computational complexity.ā They donāt seem to quantify the real-world impact of the activation-function cost, but if this is actually a realistic concern then a specialized implementation would be a good idea.)
For a while, the cost of tanh(::Float32) was most of the cost of running simple neural networks on the CPU. So we wrote a simpler slightly lower-precision version (a rational approximation found using Remez.jl). Iām sure something similar could be done for this combined function:
julia> telu(x) = x * tanh(exp(x));
julia> using NNlib
julia> let N = 1000
x = randn(Float32, N)
y = similar(x)
@btime $y .= tanh.($x)
@btime $y .= tanh_fast.($x) # NNlib's simpler version
@btime $y .= telu.($x)
end;
3.203 Ī¼s (0 allocations: 0 bytes)
450.086 ns (0 allocations: 0 bytes)
14.291 Ī¼s (0 allocations: 0 bytes)
On the GPU, to my knowledge none of this matters ā the computation is free compared to memory / launch costs.
It looks like Zygote has problems computing the gradient at large x, though the original function indeed remains accurate.
julia> using Zygote
julia> telu(x) = x * tanh(exp(x))
telu (generic function with 1 method)
julia> withgradient(telu, 89f0)
(val = 89.0f0, grad = (NaN32,)) # NAN gradient
julia> withgradient(telu, 88f0)
(val = 88.0f0, grad = (1.0f0,)) # gradient OK for smaller x
Zygote may be running into an issue that can also be understood by considering the analytical gradient
\frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x e^x}{\cosh(e^x)^2}
The NaN will come from the second term because e^x will overflow at an x large enough for \cosh(x) to overflow, resulting in Inf/Inf which gives NaN. To fix this, notice that the second term is always supposed to be small, so we may write it as a small number divided by a big number if we expand the hyperbolic cosine in terms of exponentials, \cosh(x) = (e^x + e^{-x})/2, and use standard exponent rules e^ae^b =e^{a+b} \frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x }{(e^{e^x-x/2}+e^{-e^x-x/2})/2)^2}
You could register the following custom gradient with Zygote:
julia> dtelu(x) = tanh(exp(x)) + x/((exp(exp(x)-x/2) + exp(-exp(x)-x/2))/2)^2
dtelu (generic function with 1 method)
julia> dtelu(300f0)
1.0f0
julia> dtelu(-300f0)
0.0f0
Maybe worth noting that any neural net is going to broadcast this function, and Zygoteās broadcasting uses ForwardDiff not ChainRules internally, so will miss this rule:
NNlibās activations file defines lots of rrule(broadcasted, f, x) methods (at the bottom). Surely this telu could be added if someone is motivated. It could use something like dtelu, but perhaps branching (something like x>4f0 ? 1f0 : ...) would be quicker.
Itās an optimisation, that for the broadcast of R ā R functions it switches to forwards mode, and outsources this to ForwardDiff. For more general broadcasts (e.g. f.(eachcol(rand(10,10))) where f acts on vectors) thereās a path which uses reverse mode, but this is much slower (and I think doesnāt work on CuArrays).