How to compute TanhExp / TeLU function accurately?

A new paper advocates using the TeLU activation function, x \cdot \tanh (e^x), for neural networks, also proposed in this paper under the name TanhExp.

The intermediate result e^x grows very fast for large x, even though \tanh (e^x) remains bounded. Would this be a concern for accurate numerical evaluations? If so, is there a better alternative formula for evaluating this function? (The above papers seem to focus on deep learning outcomes rather than the numerical implementation of this function itself.)

P.S. I found later that the accuracy problem is with the gradient not the function itself.

FWIW, the authors of the linked paper used a straight-forward implementation in pytorch:

Doesnā€™t seem like it should be an issue. Long before exp(x) overflows to Inf, tanh(x) rounds to 1.0 (and tanh(Inf) == 1.0), so you arenā€™t losing anything to overflow.

On the other hand, you could probably compute it more efficiently by designing your own special-function implementation.

(But Iā€™m guessing that the activation function is not usually performance-critical? On the other hand, the TeLU authors claim that their function is better than some other alternatives like ā€œMishā€ and ā€œSmishā€ in part because the latter have ā€œincreased computational complexity.ā€ They donā€™t seem to quantify the real-world impact of the activation-function cost, but if this is actually a realistic concern then a specialized implementation would be a good idea.)

1 Like

For a while, the cost of tanh(::Float32) was most of the cost of running simple neural networks on the CPU. So we wrote a simpler slightly lower-precision version (a rational approximation found using Remez.jl). Iā€™m sure something similar could be done for this combined function:

julia> telu(x) = x * tanh(exp(x));

julia> using NNlib

julia> let N = 1000
         x = randn(Float32, N)
         y = similar(x)
         @btime $y .= tanh.($x)
         @btime $y .= tanh_fast.($x)  # NNlib's simpler version
         @btime $y .= telu.($x)
       end;
  3.203 Ī¼s (0 allocations: 0 bytes)
  450.086 ns (0 allocations: 0 bytes)
  14.291 Ī¼s (0 allocations: 0 bytes)

On the GPU, to my knowledge none of this matters ā€“ the computation is free compared to memory / launch costs.

1 Like

It looks like Zygote has problems computing the gradient at large x, though the original function indeed remains accurate.

julia> using Zygote

julia> telu(x) = x * tanh(exp(x))
telu (generic function with 1 method)

julia> withgradient(telu, 89f0)
(val = 89.0f0, grad = (NaN32,)) # NAN gradient

julia> withgradient(telu, 88f0)
(val = 88.0f0, grad = (1.0f0,)) # gradient OK for smaller x

The NAN gradient problem is gone if I manually switch to ReLU for large x:

telu(x::Float32) = x > 88.7f0 ? x : x*tanh(exp(x))
1 Like

Zygote may be running into an issue that can also be understood by considering the analytical gradient

\frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x e^x}{\cosh(e^x)^2}

The NaN will come from the second term because e^x will overflow at an x large enough for \cosh(x) to overflow, resulting in Inf/Inf which gives NaN. To fix this, notice that the second term is always supposed to be small, so we may write it as a small number divided by a big number if we expand the hyperbolic cosine in terms of exponentials, \cosh(x) = (e^x + e^{-x})/2, and use standard exponent rules e^ae^b =e^{a+b}
\frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x }{(e^{e^x-x/2}+e^{-e^x-x/2})/2)^2}
You could register the following custom gradient with Zygote:

julia> dtelu(x) = tanh(exp(x)) + x/((exp(exp(x)-x/2) + exp(-exp(x)-x/2))/2)^2
dtelu (generic function with 1 method)

julia> dtelu(300f0)
1.0f0

julia> dtelu(-300f0)
0.0f0
5 Likes

Thanks! Your code makes the gradient more accurate even before the NAN problem kicks in.

julia> using Zygote

julia> import ChainRulesCore: @scalar_rule

julia> telu(x) = x * tanh(exp(x))
telu (generic function with 1 method)

julia> dtelu(x) = tanh(exp(x)) + x/((exp(exp(x)-x/2) + exp(-exp(x)-x/2))/2)^2
dtelu (generic function with 1 method)

julia> @scalar_rule telu(x::Real) dtelu(x)

julia> withgradient(telu, 2.2f0)
(val = 2.2f0, grad = (1.0000012f0,))

Without the custom rule, the last evaluation gives

(val = 2.2f0, grad = (1.0f0,))

which is less accurate.

Maybe worth noting that any neural net is going to broadcast this function, and Zygoteā€™s broadcasting uses ForwardDiff not ChainRules internally, so will miss this rule:

julia> gradient(telu, 2.2f0)  # using rule above
(1.0000012f0,)

julia> gradient(x -> sum(telu.(x)), [2.2f0, 89f0])
(Float32[1.0, NaN],)

NNlibā€™s activations file defines lots of rrule(broadcasted, f, x) methods (at the bottom). Surely this telu could be added if someone is motivated. It could use something like dtelu, but perhaps branching (something like x>4f0 ? 1f0 : ...) would be quicker.

2 Likes

is this fixable on the zygote side? the derivative of broadcast is just the broadcast of the derivative, right?

Itā€™s an optimisation, that for the broadcast of R ā†’ R functions it switches to forwards mode, and outsources this to ForwardDiff. For more general broadcasts (e.g. f.(eachcol(rand(10,10))) where f acts on vectors) thereā€™s a path which uses reverse mode, but this is much slower (and I think doesnā€™t work on CuArrays).

1 Like