Flux vs pytorch cpu performance

Oh that’s quite a long way from 3.5! But may be good enough for Flux?

Using expm1 for more accuracy removes most of the speedup:

julia> epses(tanh, SLEEFPirates.tanh_fast, 0.0001)
n = -2076

julia> @inline function tanh_fast_2(x)
           exp2xm1 = SLEEFPirates.expm1(x + x)
           exp2xm1 / (exp2xm1 + one(x) + one(x))
tanh_fast_2 (generic function with 1 method)

julia> epses(tanh, tanh_fast_2, 0.0001)
n = 0

julia> @btime @avx $y .= tanh.($x);
  434.116 ns (0 allocations: 0 bytes)

julia> @btime @avx $y .= SLEEFPirates.tanh_fast.($x);
  134.825 ns (0 allocations: 0 bytes)

julia> @btime @avx $y .= tanh_fast_2.($x);
  377.855 ns (0 allocations: 0 bytes)

Is expm1 fully optimized?

Based on @ChrisRackauckas’s comment here, fusing the matrix multiplication + add in Dense should allow the existing vectorized tanh broadcast to kick in. It seems like IntelVectorMath.jl also accomplishes this, but AIUI using fused mul! should also remove the intermediate allocation incurred when adding the bias vector (this is what PyTorch does).

Do we have a fused operation like that available? I thought 5-arg mul! dealt with C .= α.*A*B .+ β.*C, but Dense has W*X + b where b is a vector. Although I guess you could define C by repeating b before you called it…

Fusing the other two steps tanh.(WX .+ b) is done natively of course but still not (I think) in Zygote.

I’m definitely not qualified to say, just assumed it was equivalent to torch.addmm. Does the broadcasting prevent the usual dimensionality reduction you get from A*B?

It is, but that can’t take advantage of the vectorized implementation in https://github.com/FluxML/NNlib.jl/pull/199. In other words, optimizing the vectorized performance of tanh under broadcast is moot if tanh.(WX .+ b) is dispatching to +(b) ∘ tanh instead. Presumably there’s a way more ergonomic way to hook into the fused broadcast than defining Base.broadcasted(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}, ...)?

OK, didn’t follow the details of the PR, but I guess if you want it to pick up on tanh.(mat .+ b) written just like that then the dispatch is going to be about that ugly! You could just have a function dense(tanh, mat, bias) though.

The alternative fusion is like this I suppose? (And perhaps this should be in Base anyway?)

function Base.muladd(mat::AbstractMatrix, x::AbstractVecOrMat, z)
    res = similar(mat, ndims(x) == 1 ? (size(mat,1),) : (size(mat,1), size(x,2)))
    res .= z
    LinearAlgebra.mul!(res, mat, x, 1.0, 1.0)

Yeah, I think this thread has diverged into “make tanh fast” and “make Dense fast”. The former seems to be well on its way while the latter probably needs some changes to the implementation of Dense (e.g. your muladd implementation)`


No. The most optimized functions are probably those provided by GLIBC on Linux, which includes some basics like exp but not tanh.
It’s open source (although GPLed) so we could translate it to Julia or LLVM bitcode.
The functions are written in x86 assembly (different versions for different instruction sets), and make use of look up tables referenced by address, so it wouldn’t easy to translate.

The biggest advantage of that would be that Windows and Mac users can benefit, as well as people who haven’t upgraded their Linux distro in years.

I thought the expm1 definition just used double-doubles (or double-singles) for a more accurate exp - 1. Maybe it isn’t dispatching to the right exp.

@mcabbott Are those benchmarks with Julia 1.4? You can try Julia 1.6 (or 1.5) for the fast @avx $y .= tanh.($x).

DhairyaLGandhi, Chris R., and I had a meeting today where we discussed vectorizing activation functions. The basic plan is a whitelist:

# For evaluating dense activation function `f` on CPU
if canavx(f)
    @avx f.(WX .+ b)
    f.(WX .+ b)

# elsewhere
canavx(::typeof(tanh)) = true
1 Like

I’ve done a little experimenting, and it appears that

myexpm1(x::Float64)=abs(x)>1e-4 ? exp(x)-1 : x*(1+(x/2)*(1+x/3))

is noticeably faster and has a maximum error of around 3e-16 (maximum relative error is around 4e-12). Would this be a good replacement? My speed measurement is that for a length 10000000 Vector, on my computer it takes 95ms compared to SLEEFPirates’ 177ms.

Did you use @avx with SLEEFPirates? It will not reliably SIMD without it.

Also, FWIW, the relative error in the example we provided is:

julia> tanh(0.0001)

julia> (SLEEFPirates.tanh_fast(0.0001) - ans)/ans

Ideally, we want to be within a few units in last place (ulp). I.e., prevfload(x, n) should get you the exact answer with abs(n) <= 4 or so.

That creates some interesting results. If I have small numbers (less than the cutoff, my implimentation is about 3x faster. For bigger numbers, I think the fact that I’m using a non-vectorized exp is slowing things down since I’m about 3x slower. I also can’t use a SLEEFPirate’s exp with avx due to the ternary expression. It throws an error.

The problem with SIMD implementations is that they have to be branch-free.
If you have conditionals, the runtime will generally equal the runtime of the sum of all of them. That is because instead of branching, you evaluate the SIMD vector for each, and then select from the correct vector.
Normally it won’t get much more complicated than a main function plus range checks to set outputs to 0 or Inf as appropriate.

Based on my testing, evaluating both branches should end up faster than the SLEEFPirates version.

julia> x = rand(200); y = similar(x);

julia> @benchmark @avx $y .= expm1.($x)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     382.970 ns (0.00% GC)
  median time:      391.426 ns (0.00% GC)
  mean time:        391.832 ns (0.00% GC)
  maximum time:     625.404 ns (0.00% GC)
  samples:          10000
  evals/sample:     203

julia> @benchmark $y .= myexpm1.($x)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     929.519 ns (0.00% GC)
  median time:      938.148 ns (0.00% GC)
  mean time:        941.006 ns (0.00% GC)
  maximum time:     1.432 μs (0.00% GC)
  samples:          10000
  evals/sample:     27

I’ll look into improving the expm1 implementation, but it is already much faster for me.

I get a small improvement from myexpm1(x)= LoopVectorization.vifelse(abs(x)>1e-4, exp(x)-1, x*(1+(x/2)*(1+x/3))). But in the context of Flux, my guess is that the accuracy of tanh_fast is plenty. The implementation which calls canavx(f) inside Dense etc. could substitute this, without affecting broadcast globally.

Thanks for the reference to vifelse This is my first time using LoopVectorization for more than just a simple slap @avx on and hope. I’m seeing times about 40% faster for data of all sizes (when timing both with @avx!

I copied some more definitions from xsimd, but they didn’t really help:

julia> using VectorizationBase, SLEEFPirates, BenchmarkTools

julia> sxd = SVec(ntuple(VectorizationBase.pick_vector_width_val(Float64)) do _ Core.VecElement(10randn(Float64)) end)
SVec{8,Float64}<-4.493990314636461, 0.6408752683289985, -9.44142565823924, 5.136363736280279, -1.9103961417226556, 1.9076343186759728, -10.806148050494356, 10.19521898096773>

julia> sxf = SVec(ntuple(VectorizationBase.pick_vector_width_val(Float32)) do _ Core.VecElement(10randn(Float32)) end)
SVec{16,Float32}<10.296403f0, 8.404062f0, -0.96928066f0, -3.6089196f0, -4.552932f0, 4.7623963f0, -10.001701f0, 15.561426f0, 13.540369f0, 15.255327f0, -9.867335f0, 18.804873f0, 11.846057f0, 7.934388f0, -5.5608225f0, -12.819666f0>

julia> @btime exp($sxd)
  4.564 ns (0 allocations: 0 bytes)
SVec{8,Float64}<0.011175959122636744, 1.8981415356090743, 7.936717738489916e-5, 170.09612803709564, 0.14802173739348998, 6.737132024244336, 2.0274470980824412e-5, 26774.86841864976>

julia> @btime expm1($sxd)
  11.826 ns (0 allocations: 0 bytes)
SVec{8,Float64}<-0.9888240408773633, 0.8981415356090742, -0.9999206328226151, 169.09612803709564, -0.85197826260651, 5.737132024244335, -0.9999797255290191, 26773.86841864976>

julia> @btime exp($sxf)
  3.929 ns (0 allocations: 0 bytes)
SVec{16,Float32}<29625.86f0, 4465.1685f0, 0.37935588f0, 0.027081087f0, 0.01053627f0, 117.02602f0, 4.532276f-5, 5.731147f6, 759464.7f0, 4.219922f6, 5.184068f-5, 1.4684269f8, 139533.06f0, 2791.6504f0, 0.003845612f0, 2.7070096f-6>

julia> @btime expm1($sxf)
  7.896 ns (0 allocations: 0 bytes)
SVec{16,Float32}<29624.861f0, 4464.1685f0, -0.62064415f0, -0.9729189f0, -0.98946375f0, 116.026024f0, -0.9999547f0, 5.7311455f6, 759463.6f0, 4.219921f6, -0.99994814f0, 1.468427f8, 139532.08f0, 2790.6501f0, -0.99615437f0, -0.9999973f0>

The GLIBC exp functions are just so absurdly fast that tacking on extra computation (like x*(1+(x/2)*(1+x/3))) will still be fast than alternative approaches.
We really should implement exp functions following the same strategy they use.

I have two plans for the future to improve the performance of these functions:

  1. Finally steal the (GPL-ed) GLIBC definitions. This probably means creating a new GPL-ed library, unless some people want to try one of the work arounds (i.e., one person reads and explains how it is implemented, someone else implements).
  2. Define versions defined on NTuple{N,<:SVec} that interleave all the intermediate calculations, so that we can take advantage of super scalar parallelism. This could significantly speed up evaluating vectors.

My background isn’t in machine learning, but given that many people use 16-bit numbers, I’d guess so too.

If/when hope fails you, issues are welcome. :slight_smile:


I would be happy to do the gpl dance to get the glibc functions copied over. If you wanna to be the other one, dm me and we can set it up.


At this point, I have a PR open on base that gets 4x speedup for expm1 relative to current. @Elrod has also merged a vectored version with slightly worse accuracy and higher speed into LoopVectorization.


But I don’t have a released/working version of LoopVectorization supporting this yet. I’d like to do a more thorough rewrite, which will take some time.