I copied some more definitions from xsimd, but they didn’t really help:
julia> using VectorizationBase, SLEEFPirates, BenchmarkTools
julia> sxd = SVec(ntuple(VectorizationBase.pick_vector_width_val(Float64)) do _ Core.VecElement(10randn(Float64)) end)
SVec{8,Float64}<-4.493990314636461, 0.6408752683289985, -9.44142565823924, 5.136363736280279, -1.9103961417226556, 1.9076343186759728, -10.806148050494356, 10.19521898096773>
julia> sxf = SVec(ntuple(VectorizationBase.pick_vector_width_val(Float32)) do _ Core.VecElement(10randn(Float32)) end)
SVec{16,Float32}<10.296403f0, 8.404062f0, -0.96928066f0, -3.6089196f0, -4.552932f0, 4.7623963f0, -10.001701f0, 15.561426f0, 13.540369f0, 15.255327f0, -9.867335f0, 18.804873f0, 11.846057f0, 7.934388f0, -5.5608225f0, -12.819666f0>
julia> @btime exp($sxd)
4.564 ns (0 allocations: 0 bytes)
SVec{8,Float64}<0.011175959122636744, 1.8981415356090743, 7.936717738489916e-5, 170.09612803709564, 0.14802173739348998, 6.737132024244336, 2.0274470980824412e-5, 26774.86841864976>
julia> @btime expm1($sxd)
11.826 ns (0 allocations: 0 bytes)
SVec{8,Float64}<-0.9888240408773633, 0.8981415356090742, -0.9999206328226151, 169.09612803709564, -0.85197826260651, 5.737132024244335, -0.9999797255290191, 26773.86841864976>
julia> @btime exp($sxf)
3.929 ns (0 allocations: 0 bytes)
SVec{16,Float32}<29625.86f0, 4465.1685f0, 0.37935588f0, 0.027081087f0, 0.01053627f0, 117.02602f0, 4.532276f-5, 5.731147f6, 759464.7f0, 4.219922f6, 5.184068f-5, 1.4684269f8, 139533.06f0, 2791.6504f0, 0.003845612f0, 2.7070096f-6>
julia> @btime expm1($sxf)
7.896 ns (0 allocations: 0 bytes)
SVec{16,Float32}<29624.861f0, 4464.1685f0, -0.62064415f0, -0.9729189f0, -0.98946375f0, 116.026024f0, -0.9999547f0, 5.7311455f6, 759463.6f0, 4.219921f6, -0.99994814f0, 1.468427f8, 139532.08f0, 2790.6501f0, -0.99615437f0, -0.9999973f0>
The GLIBC exp
functions are just so absurdly fast that tacking on extra computation (like x*(1+(x/2)*(1+x/3))
) will still be fast than alternative approaches.
We really should implement exp
functions following the same strategy they use.
I have two plans for the future to improve the performance of these functions:
- Finally steal the (GPL-ed) GLIBC definitions. This probably means creating a new GPL-ed library, unless some people want to try one of the work arounds (i.e., one person reads and explains how it is implemented, someone else implements).
- Define versions defined on
NTuple{N,<:SVec}
that interleave all the intermediate calculations, so that we can take advantage of super scalar parallelism. This could significantly speed up evaluating vectors.
My background isn’t in machine learning, but given that many people use 16-bit numbers, I’d guess so too.
If/when hope fails you, issues are welcome.