How to compute TanhExp / TeLU function accurately?

greatpet · January 6, 2025, 12:25pm

A new paper advocates using the TeLU activation function, x \cdot \tanh (e^x), for neural networks, also proposed in this paper under the name TanhExp.

The intermediate result e^x grows very fast for large x, even though \tanh (e^x) remains bounded. Would this be a concern for accurate numerical evaluations? If so, is there a better alternative formula for evaluating this function? (The above papers seem to focus on deep learning outcomes rather than the numerical implementation of this function itself.)

P.S. I found later that the accuracy problem is with the gradient not the function itself.

barucden · January 6, 2025, 1:24pm

FWIW, the authors of the linked paper used a straight-forward implementation in pytorch:

github.com

alfredofernandez2021/TeLU-Experiments/blob/e08a40f6c780b2329523add011301c83f8b62aa4/Robustness/1MLP_FMNIST.py#L180


      
              def __init__(self):

                  """

                  Init method.

                  """

                  super().__init__()

          

              def forward(self, input):

                  """

                  Forward pass of the function.

                  """

                  return input * torch.tanh( torch.exp(input) )

          

          class SimpleMLP_TeLU(nn.Module):

              def __init__(self):

                  super().__init__()

                  self.layers = nn.Sequential(

                      nn.Flatten(),

                      nn.Linear(28*28, width),#1

                      TeLU(),

                      nn.Linear(width, width),#2

                      TeLU(),

stevengj · January 6, 2025, 2:14pm

Doesn’t seem like it should be an issue. Long before exp(x) overflows to Inf, tanh(x) rounds to 1.0 (and tanh(Inf) == 1.0), so you aren’t losing anything to overflow.

On the other hand, you could probably compute it more efficiently by designing your own special-function implementation.

(But I’m guessing that the activation function is not usually performance-critical? On the other hand, the TeLU authors claim that their function is better than some other alternatives like “Mish” and “Smish” in part because the latter have “increased computational complexity.” They don’t seem to quantify the real-world impact of the activation-function cost, but if this is actually a realistic concern then a specialized implementation would be a good idea.)

mcabbott · January 6, 2025, 3:02pm

For a while, the cost of tanh(::Float32) was most of the cost of running simple neural networks on the CPU. So we wrote a simpler slightly lower-precision version (a rational approximation found using Remez.jl). I’m sure something similar could be done for this combined function:

julia> telu(x) = x * tanh(exp(x));

julia> using NNlib

julia> let N = 1000
         x = randn(Float32, N)
         y = similar(x)
         @btime $y .= tanh.($x)
         @btime $y .= tanh_fast.($x)  # NNlib's simpler version
         @btime $y .= telu.($x)
       end;
  3.203 μs (0 allocations: 0 bytes)
  450.086 ns (0 allocations: 0 bytes)
  14.291 μs (0 allocations: 0 bytes)

On the GPU, to my knowledge none of this matters – the computation is free compared to memory / launch costs.

greatpet · January 6, 2025, 3:03pm

It looks like Zygote has problems computing the gradient at large x, though the original function indeed remains accurate.

julia> using Zygote

julia> telu(x) = x * tanh(exp(x))
telu (generic function with 1 method)

julia> withgradient(telu, 89f0)
(val = 89.0f0, grad = (NaN32,)) # NAN gradient

julia> withgradient(telu, 88f0)
(val = 88.0f0, grad = (1.0f0,)) # gradient OK for smaller x

greatpet · January 6, 2025, 3:38pm

The NAN gradient problem is gone if I manually switch to ReLU for large x:

telu(x::Float32) = x > 88.7f0 ? x : x*tanh(exp(x))

lxvm · January 6, 2025, 3:43pm

Zygote may be running into an issue that can also be understood by considering the analytical gradient

\frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x e^x}{\cosh(e^x)^2}

The NaN will come from the second term because e^x will overflow at an x large enough for \cosh(x) to overflow, resulting in Inf/Inf which gives NaN. To fix this, notice that the second term is always supposed to be small, so we may write it as a small number divided by a big number if we expand the hyperbolic cosine in terms of exponentials, \cosh(x) = (e^x + e^{-x})/2, and use standard exponent rules e^ae^b =e^{a+b}
\frac{\partial}{\partial x} x \tanh(e^x) = \tanh(e^x) + \frac{x }{(e^{e^x-x/2}+e^{-e^x-x/2})/2)^2}
You could register the following custom gradient with Zygote:

julia> dtelu(x) = tanh(exp(x)) + x/((exp(exp(x)-x/2) + exp(-exp(x)-x/2))/2)^2
dtelu (generic function with 1 method)

julia> dtelu(300f0)
1.0f0

julia> dtelu(-300f0)
0.0f0

greatpet · January 6, 2025, 4:12pm

Thanks! Your code makes the gradient more accurate even before the NAN problem kicks in.

julia> using Zygote

julia> import ChainRulesCore: @scalar_rule

julia> telu(x) = x * tanh(exp(x))
telu (generic function with 1 method)

julia> dtelu(x) = tanh(exp(x)) + x/((exp(exp(x)-x/2) + exp(-exp(x)-x/2))/2)^2
dtelu (generic function with 1 method)

julia> @scalar_rule telu(x::Real) dtelu(x)

julia> withgradient(telu, 2.2f0)
(val = 2.2f0, grad = (1.0000012f0,))

Without the custom rule, the last evaluation gives

(val = 2.2f0, grad = (1.0f0,))

which is less accurate.

mcabbott · January 6, 2025, 4:37pm

Maybe worth noting that any neural net is going to broadcast this function, and Zygote’s broadcasting uses ForwardDiff not ChainRules internally, so will miss this rule:

julia> gradient(telu, 2.2f0)  # using rule above
(1.0000012f0,)

julia> gradient(x -> sum(telu.(x)), [2.2f0, 89f0])
(Float32[1.0, NaN],)

NNlib’s activations file defines lots of rrule(broadcasted, f, x) methods (at the bottom). Surely this telu could be added if someone is motivated. It could use something like dtelu, but perhaps branching (something like x>4f0 ? 1f0 : ...) would be quicker.

Oscar_Smith · January 6, 2025, 5:01pm

is this fixable on the zygote side? the derivative of broadcast is just the broadcast of the derivative, right?

mcabbott · January 6, 2025, 5:07pm

It’s an optimisation, that for the broadcast of R → R functions it switches to forwards mode, and outsources this to ForwardDiff. For more general broadcasts (e.g. f.(eachcol(rand(10,10))) where f acts on vectors) there’s a path which uses reverse mode, but this is much slower (and I think doesn’t work on CuArrays).

Topic		Replies	Views
Numerical simulation changes with different tspan Numerics differentialequation	4	364	August 14, 2023
Simple Flux model not learning Machine Learning flux	4	1075	October 21, 2019
Automatic gradient ∼10x slower to evaluate than the primal computation Machine Learning performance , flux , adjoint , chainrulescore , gradient	2	413	February 3, 2023
Efficiently computing Hessians of Neural Networks output with respect to inputs Performance question , flux , pde , hessian	1	257	July 16, 2023
Modelling heating curves with a PINN Machine Learning question , flux , ode	6	351	June 11, 2023

How to compute TanhExp / TeLU function accurately?

Related topics