Accurate derivative for tanh

Tamas_Papp · December 13, 2023, 1:41pm

When debugging a computation, I found that the derivative of tanh as defined in DiffRules.jl suffers from catastrophic cancellation for pretty mild values (eg outside 20 in absolute value, examples below).

I propose a simple fix, but would like to solicit alternative ideas before making a PR. Recall that

\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1}

and its derivative is

\tanh'(x) = \frac{4 e^{2x}}{(e^{2x} + 1)^2}

dth1(x) = 1 - abs2(tanh(x))      # what we now have in DiffRules
dth0(x) = oftype(float(x), dth1(BigFloat(x))) # more precise calculation
function dth2(x)                              # proposed fix
    z = 2*x
    ez = exp(z)
    abs(z) > 0.5 ? 4 / (ez - exp(-z)) : 4 * ez / abs2(1 + ez)
end

using PrettyTables              # examples
tab = [(; x, d0 = dth0(x), d1 = dth1(x), d2 = dth2(x)) for x in -0:40];
pretty_table(IOContext(stdout, :limit => false), tab)

┌───────┬─────────────┬─────────────┬─────────────┐
│     x │          d0 │          d1 │          d2 │
│ Int64 │     Float64 │     Float64 │     Float64 │
├───────┼─────────────┼─────────────┼─────────────┤
│     0 │         1.0 │         1.0 │         1.0 │
│     1 │    0.419974 │    0.419974 │    0.551441 │
│     2 │   0.0706508 │   0.0706508 │   0.0732871 │
│     3 │  0.00986604 │  0.00986604 │  0.00991507 │
│     4 │  0.00134095 │  0.00134095 │  0.00134185 │
│     5 │ 0.000181583 │ 0.000181583 │   0.0001816 │
│     6 │  2.45765e-5 │  2.45765e-5 │  2.45768e-5 │
│     7 │  3.32611e-6 │  3.32611e-6 │  3.32611e-6 │
│     8 │  4.50141e-7 │  4.50141e-7 │  4.50141e-7 │
│     9 │  6.09199e-8 │  6.09199e-8 │  6.09199e-8 │
│    10 │  8.24461e-9 │  8.24461e-9 │  8.24461e-9 │
│    11 │  1.11579e-9 │  1.11579e-9 │  1.11579e-9 │
│    12 │ 1.51005e-10 │ 1.51005e-10 │ 1.51005e-10 │
│    13 │ 2.04364e-11 │ 2.04363e-11 │ 2.04364e-11 │
│    14 │ 2.76576e-12 │ 2.76579e-12 │ 2.76576e-12 │
│    15 │ 3.74305e-13 │ 3.74367e-13 │ 3.74305e-13 │
│    16 │ 5.06567e-14 │ 5.06262e-14 │ 5.06567e-14 │
│    17 │ 6.85563e-15 │ 6.88338e-15 │ 6.85563e-15 │
│    18 │ 9.27809e-16 │ 8.88178e-16 │ 9.27809e-16 │
│    19 │ 1.25565e-16 │ 2.22045e-16 │ 1.25565e-16 │
│    20 │ 1.69934e-17 │         0.0 │ 1.69934e-17 │
│    21 │ 2.29981e-18 │         0.0 │ 2.29981e-18 │
│    22 │ 3.11245e-19 │         0.0 │ 3.11245e-19 │
│    23 │ 4.21225e-20 │         0.0 │ 4.21225e-20 │
│    24 │ 5.70066e-21 │         0.0 │ 5.70066e-21 │
│    25 │   7.715e-22 │         0.0 │   7.715e-22 │
│    26 │ 1.04411e-22 │         0.0 │ 1.04411e-22 │
│    27 │ 1.41305e-23 │         0.0 │ 1.41305e-23 │
│    28 │ 1.91236e-24 │         0.0 │ 1.91236e-24 │
│    29 │ 2.58809e-25 │         0.0 │ 2.58809e-25 │
│    30 │  3.5026e-26 │         0.0 │  3.5026e-26 │
│    31 │ 4.74026e-27 │         0.0 │ 4.74026e-27 │
│    32 │ 6.41524e-28 │         0.0 │ 6.41524e-28 │
│    33 │ 8.68209e-29 │         0.0 │ 8.68209e-29 │
│    34 │ 1.17499e-29 │         0.0 │ 1.17499e-29 │
│    35 │ 1.59018e-30 │         0.0 │ 1.59018e-30 │
│    36 │ 2.15207e-31 │         0.0 │ 2.15207e-31 │
│    37 │ 2.91252e-32 │         0.0 │ 2.91252e-32 │
│    38 │ 3.94166e-33 │         0.0 │ 3.94166e-33 │
│    39 │ 5.33446e-34 │         0.0 │ 5.33446e-34 │
│    40 │ 7.21941e-35 │         0.0 │ 7.21941e-35 │
└───────┴─────────────┴─────────────┴─────────────┘

stevengj · December 13, 2023, 1:53pm

Why not just use \tanh'(x) = \mathrm{sech}(x)^2?

PS. Note that using abs2(z) here is wrong for complex arguments, and for real arguments has no advantage over z^2 anyway.

Tamas_Papp · December 13, 2023, 1:57pm

That’s how it was before

but it was changed for CSE. I don’t know how relevant that is though at the moment.

Mason · December 13, 2023, 1:58pm

It’s not relevant for most applications of DiffRules, but it definitely is relevant to how it’s computed in ChainRules.jl since ChainRules does explicitly re-use the answer computed for the forwards pass:

github.com

JuliaDiff/ChainRules.jl/blob/40b9058c1a6798a4f72bda1227786d584b44c822/src/rulesets/Base/fastmath_able.jl#L41


      
              sinx, cosx = sincos(x)
              return (cosx, -sinx * Δx)
          end
          
          @scalar_rule tan(x) 1 + Ω ^ 2
          
          
          # Trig-Hyperbolic
          @scalar_rule cosh(x) sinh(x)
          @scalar_rule sinh(x) cosh(x)
          @scalar_rule tanh(x) 1 - Ω ^ 2
          
          # Trig- Inverses
          @scalar_rule acos(x) -(inv(sqrt(1 - x ^ 2)))
          @scalar_rule asin(x) inv(sqrt(1 - x ^ 2))
          @scalar_rule atan(x) inv(1 + x ^ 2)
          
          # Trig-Multivariate
          @scalar_rule atan(y, x) @setup(u = x ^ 2 + y ^ 2) (x / u, -y / u)
          @scalar_rule sincos(x) @setup((sinx, cosx) = Ω) cosx -sinx

Perhaps what we should do here is add a branch like

abs(real(x)) <= 14 ? (1 - Ω^2) : sech(x)^2

And then maybe remove that branch only for the @fastmath version?

stevengj · December 13, 2023, 2:01pm

With your new formulation you’ll lose any CSE too, so you might as well go back to using sech.

stevengj · December 13, 2023, 2:09pm

That’s way too aggressive. By x = 4 the 1 - tanh(x)^2 formula has already lost 2 digits, by x = 6 it has lost 4 digits, and by x = 14 it has lost 10 digits.

Doing a few quick experiments, the crossover point where sech(x)^2 becomes more accurate seems to be around x = 1, so you could do abs(real(x)) <= 1 ? (1 - Ω^2) : sech(x)^2 (or real(x)^2 <= 1 … not sure which is faster).

Mason · December 13, 2023, 2:14pm

Yeah, it was just an example value taken from Tamas’ table.

I don’t think accuracy is the only concern here though, since AD people are typically also very performance sensitive. My assumption is that so long as values can be re-used from the primal calculation without truly catastrophic losses of performance, that’s typically going to be preferred.

devmotion · December 13, 2023, 2:16pm

ForwardDiff applies CSE to the rules generated from the DiffRules expressions.

Yes, the issue also affects ChainRules: rule for tanh has catastrophic cancellation for |x| > 20 · Issue #102 · JuliaDiff/DiffRules.jl · GitHub

stevengj · December 13, 2023, 2:18pm

I disagree. Some applications of AD might need only low accuracy, but in general when a standard library computes a math function the programmer should generically expect it to be computed to close to machine precision.

If they programmer wants lower accuracy they should use lower precision, or a different function name tanh_approx (or some macro @approx that rewrites tanh to tanh_approx etc.).

Tamas_Papp · December 13, 2023, 2:36pm

Yes, debugging speed issues is much easier than debugging numerical issues. Whenever there is a trade-off, inaccurate but fast should be opt-in.

I should have explanded, but I thought we could keep CSE by also calculating tanh from exp(2*x) etc. But maybe I misunderstand how that works, or how relevant CSE is.

devmotion · December 13, 2023, 2:37pm

I think the simplest solution seems to be to define the derivative as sech(x)^2 without reusing the primal. IMO, as @stevengj suggested, if users or a package want a faster and more inaccurate version, they should define and use a tanh_approx function with a less accurate derivate.

Branches discussed here would be optimized for Float64 but users would still run into the same problems with e.g. Float32 (or they might be inefficient for something like Float128).

stevengj · December 13, 2023, 2:38pm

No, in this case my suggested crossover point abs(real(x)) < 1 should be independent of precision (at least for real x … I haven’t thought too much about the complex case), because that’s where tanh(x)^2 is around 0.5 or less.

stevengj · December 13, 2023, 2:43pm

tanh(x) is computed from polynomial approximations for many values of x. Worse, as far as I can tell, tanh is not inlined by the compiler, which will prevent the compiler from CSE-ing expressions computed inside tanh with expressions outside tanh:

julia> f(x) = 1+tanh(x)
f (generic function with 1 method)

julia> @code_llvm f(0.2)
;  @ REPL[42]:1 within `f`
define double @julia_f_445(double %0) #0 {
top:
  %1 = call double @j_tanh_447(double %0) #0
; ┌ @ promotion.jl:410 within `+` @ float.jl:408
   %2 = fadd double %1, 1.000000e+00
; └
  ret double %2
}

devmotion · December 13, 2023, 2:43pm

True, I was thinking about the suggestion of a cutoff at 14. I’m not sure how much we would gain from a branch using abs(real(x)) <= 1 in practice though.

Topic		Replies	Views
ReverseDiff.jl failing with the Base.tanh function? General Usage reversediff	5	332	December 12, 2023
ChainRules: Replacing DiffRules in the Julia AD world Package Announcements package	43	4277	March 12, 2019
How to compute TanhExp / TeLU function accurately? General Usage flux , numerics , neural-network	10	188	January 6, 2025
Automatic differentiation of function with special points General Usage forwarddiff , autodiff	4	608	February 1, 2023
Numerically stable derivative of logerf without catastrophic cancellation Numerics precision , specialfunctions	4	433	January 15, 2023

Accurate derivative for tanh

Related topics