Flux vs pytorch cpu performance

Did you use @avx with SLEEFPirates? It will not reliably SIMD without it.

EDIT:
Also, FWIW, the relative error in the example we provided is:

julia> tanh(0.0001)
9.999999966666668e-5

julia> (SLEEFPirates.tanh_fast(0.0001) - ans)/ans
2.8135046469782325e-13

Ideally, we want to be within a few units in last place (ulp). I.e., prevfload(x, n) should get you the exact answer with abs(n) <= 4 or so.