Trig functions very slow

There are a couple of things we should do:

  1. Write faster scalar functions. We use translated versions of openlibm (which were based on FreeBSD’s libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting fma instructions on newer architectures).

  2. Provide hooks to use vectorised kernels, such as SVML, SLEEF, Yeppp or Apple’s Accelerate library, for operations like broadcast and reduce. LLVM provides such hooks for its own intrinsics: we don’t currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM.

  3. A framework to write our own vectorised kernels. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.

Unfortunately the discussion of this issue has become somewhat fragmented, but the main issues are:

16 Likes

For a pure Julia port, see GitHub - musm/SLEEF.jl: A pure Julia port of the SLEEF math library . For SLEEF wrappers, see GitHub - oxinabox/SLEEF_WRAPPER.jl: Julia wrapper for SLEEF - SIMD Library for Evaluating Elementary Functions (NB: this library exists only for evaluating LibM.jl and the pure Julia Sleef.jl implementation)) (which might need an update).

@musm might be willing to comment on how this project has gone. Last I heard these were doing quite well.

Here are the timings on my machine using SLEEF,

function f()
    r = 0.
    for i in float(0:100_000_000 - 1)
        r += sin(i)
    end
    return r
end

using SLEEF
function g()
    r = 0.
    for i in float(0:100_000_000 - 1)
        r += SLEEF.sin_fast(i)
    end
    return r
end

using BenchmarkTools
@btime f() # 3.595 s (8 allocations: 352 bytes)
@btime g() # 1.401 s (8 allocations: 352 bytes)

f() - g() # 8.544276397515205e-13

whereas in Numba I get

import numba
import numpy as np
import timeit

@numba.njit
def f():
    r = 0
    for i in range(100_000_000):
        r += np.sin(i)
    return r
        
%timeit f() # 2.59 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

my numba answer gives pyf = 0.7820103194607919.

pyf = 0.7820103194607919

f() - g() # 8.544276397515205e-13
pyf - g() # 1.0347278589506459e-12
pyf - f() # 1.8030021919912542e-13

so relative accuracies are still pretty tight, even using SLEEF.sin_fast.

I guess the moral of the story is if you want to sin and you gotta do it fast, choose SLEEF.

My apologies to @oxinabox for overlooking his prior work. The SLEEF developers have done a lot since his effort (adding functions, extending domains, etc.), so it might be worth reviving that project while we’re waiting for the long-term solutions outlined by Simon.

Last I heard,
SLEEF itself was wildly inaccurate.
Which is why @musm moved towards a new solution in GitHub - musm/Amal.jl: A pure Julia math library

It was a while ago,
but it used to be that SLEEF sin was not not accurate to 1ulp,
you needed to call sin_u1 for that.

And further, even sin_u1 was completely inaccurate for outside of some small range (like well under 10_000) either side of the origin.

That may have changed.

1 Like

I was talking about Sleef.jl and Amal.jl, not SLEEF itself. Yeah, the tests showed SLEEF was fast but had quite a few accuracy issues, and @musm was working to fix those and made some strides.

1 Like

@ChrisRackauckas yeah, I figured, but in case anyone else misunderstood.

We can summon up the LLVM intrinisic for sin via:

macro llvm_ufun(outername, name, jltype, llvmtype, suffix)
	declarations = """declare $llvmtype @llvm.$name.$suffix($llvmtype)"""
	body = """%2 = call $llvmtype @llvm.$name.$suffix($llvmtype %0)
			   ret $llvmtype %2"""
	quote
		function $outername(x::$jltype)
			Base.llvmcall(($declarations, $body),
					$jltype, Tuple{$jltype}, x)
	   end
	end |> esc
end

@llvm_ufun(llvm_sin, sin, Float64, double, f64)

This eventually falled back the the system libm,
or so @simonbyrne tells me:
See [WIP] LLVM Instrinsics by oxinabox · Pull Request #8 · JuliaMath/Libm.jl · GitHub

But they seem to be getting optimizations that none of the others ways do.
I think constant propagation.

And so watch this:

julia> @btime sin(0.7)
  2.399 ns (0 allocations: 0 bytes)
0.644217687237691

julia> @btime system_sin(0.7)
  17.803 ns (0 allocations: 0 bytes)
0.644217687237691

julia> @btime llvm_sin(0.7)
  2.401 ns (0 allocations: 0 bytes)
0.644217687237691

#########################################

julia> @btime sin(2.7)
  16.614 ns (0 allocations: 0 bytes)
0.4273798802338298

julia> @btime system_sin(2.7)
  35.172 ns (0 allocations: 0 bytes)
0.4273798802338298

julia> @btime llvm_sin(2.7)
  2.399 ns (0 allocations: 0 bytes)
0.4273798802338298

But it is clearly cheating as:

julia> @btime sin(rand())
  17.662 ns (0 allocations: 0 bytes)
0.8022669929477917

julia> @btime system_sin(rand())
  30.625 ns (0 allocations: 0 bytes)
0.6335114182107534

julia> @btime llvm_sin(rand())
  31.624 ns (0 allocations: 0 bytes)
0.08711740121816128
1 Like
julia> @btime sin(rand())
  12.489 ns (0 allocations: 0 bytes)
0.2769714155360552

julia> @btime sin($(rand()))
  5.539 ns (0 allocations: 0 bytes)
0.35678525349374

Yes,
one should normally interpolate the rand in @btime but, in that final test,
i was trying to stop it from being able to run as many optimizations.

Because the LLVM sin results are clearly getting better optimization than the others.

OK, but in that case, are you not benchmarking rand instead?

How about:

julia> @btime sin(r) setup=(r=rand())
  5.539 ns (0 allocations: 0 bytes)

Edit: I guess for the purpose of comparison, it doesn’t matter, since all three use rand as input.

2 Likes

That seems to work and not let LLVM autowin

julia> @btime sin(r) setup=(r=rand())
  8.334 ns (0 allocations: 0 bytes)
0.792247666812219

julia> @btime system_sin(r) setup=(r=rand())
  14.631 ns (0 allocations: 0 bytes)
0.21983066934225526

julia> @btime llvm_sin(r) setup=(r=rand())
  13.066 ns (0 allocations: 0 bytes)
0.3440570745049619

julia> @btime llvm_sin(r) setup=(r=100rand())
  13.647 ns (0 allocations: 0 bytes)
0.883268232887517

julia> @btime system_sin(r) setup=(r=100rand())
  14.684 ns (0 allocations: 0 bytes)
-0.6734981589154067

julia> @btime sin(r) setup=(r=100rand())
  8.718 ns (0 allocations: 0 bytes)
-0.3478630385874136

BTW, when benchmarking with r = 100rand(), and so on, it might be better to use @benchmark rather than @btime. With @btime you’ll just get the minimum value, so if any of the random r values are less than \pi/4, you’ll get a low estimate:

julia> @benchmark sin(r) setup=(r=100rand())
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.870 ns (0.00% GC)
  median time:      11.267 ns (0.00% GC)
  mean time:        11.453 ns (0.00% GC)
  maximum time:     71.176 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

vs

julia> @benchmark sin(r) setup=(r=max(π/4, 100rand()))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.429 ns (0.00% GC)
  median time:      11.081 ns (0.00% GC)
  mean time:        11.473 ns (0.00% GC)
  maximum time:     49.098 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

That is, the output of @btime becomes randomly dependent on the minimum value of your input r.

2 Likes

Normally you would like to use Random.seed! too. But it not change much in these tests.

What’s bother me is this:

julia> f = llvm_sin;

julia> Random.seed!(0); @btime f(r) setup=(r=rand())
  38.268 ns (2 allocations: 32 bytes)
0.23809634978032534

julia> Random.seed!(0); @btime llvm_sin(r) setup=(r=rand())
  8.353 ns (0 allocations: 0 bytes)
0.23809634978032534

julia> f === llvm_sin
true

Is it what I had to expect?

You’re forgetting to interpolate the (non-constant) variable f:

julia> f = sin
sin (generic function with 12 methods)

julia> @btime sin(r) setup=(r=rand())
  5.538 ns (0 allocations: 0 bytes)
0.1495824053430901

julia> @btime f(r) setup=(r=rand())
  38.765 ns (2 allocations: 32 bytes)
0.22395138641869694

julia> @btime $f(r) setup=(r=rand())
  5.837 ns (0 allocations: 0 bytes)
0.7879221208633208
2 Likes

If I remember @pkofod lecture on the last JuliaCon (Native Elementary Functions in Julia) he said this is the trick of SLEEF. Supporting smaller range of values for speed.

It means it would be an issue to make SLEEF the default.

@Mason, you compared 2 different levels of accuracy there.
Try use the Fast Math flag for NUMBA as well and then compare it to SLEEF (Make sure you’re on NUMBA 0.4 which is compiled with MKL to use SVML).

The current version of SLEEF includes an extensive set of tests against MPFR, covering wide domains for most (maybe all) functions. As I mentioned, it has been actively developed over the last couple of years.

1 Like

Chiming in.

Last I heard,
SLEEF itself was wildly inaccurate.
Which is why @musm moved towards a new solution in GitHub - musm/Amal.jl: A pure Julia math library

Actually SLEEF and SLEEF.jl is extensively tested and is very accurate and fast. Latest version of SLEEF does have extended range on trig functions, matching Libm’s range. I need to update the SLEEF.jl library to include these changes, which should not be difficult. (If there is interest in this.)

If SLEEF.jl gives you the best speeds for your application there’s absolutely no reason not to use it. So I’m glad that for this specific case, if you do not need the <1ulp accuracy, sin_fast should be used.

As an aside I tried creating Amal.jl to develop a more extensible elementary math library that takes stronger advantages of Julia’s unique features. However, this is a lot easier said than done, plus the huge time commitment of such a project.

15 Likes

Will SLEEF.exp eventually handle complex values as well? Currently it does not seem supported:

julia> SLEEF.exp(1.0im)
ERROR: MethodError: no method matching exp(::Complex{Float64})
You may have intended to import Base.exp

EDIT:
I justed noticed that SLEEF.cis is implemented. It was not documented in the README.md.

julia> SLEEF.cis(1)
0.5403023058681398 + 0.8414709848078965im

Cool!
Unfortunately there is no cis_fast, but that should be fairly easy to implement:

cis_fast(x) = (v = SLEEF.sincos_fast(x); complex(v[2], v[1]))

SLEEF.cis resolves to Base.cis, since SLEEF.cis is not implemented yet.
It’s an easy add however.

Complex values can also be easily handled by copying the base definitions and replacing all the elementary function calls with their SLEEF variants.

I can put these on my TODO.

4 Likes

I wanted to make updated SLEEF wrappers, but I am getting lots of segfaults.
https://github.com/chriselrod/SLEEFwrap.jl

A few months ago, when I experimented with using NTuple{N,Core.VecElement{T}} as a data type, I ran into lots of random segfaults. I don’t recall all the details, but I do recall unsafe_loading them, and that the segfaults went away when I replaced it with NTuple{N,T}.
However, I haven’t run into segfaults with SIMD.jl, and it would be nice for SLEEF and SIMD.jl to play well together.

I cannot use regular NTuple{N,T} with SLEEF, because then I get garbage answers (looks like uninitialized memory).

NTuple{N,Core.VecElement{T}} mostly seems to work. SIMD.jl’s Vec type is also just a wrapper for this, yet benchmarking any use of SIMD’s Vec type combined with SLEEF throws instant segfaults.

julia> using SLEEFwrap

julia> using SIMD, BenchmarkTools

julia> const Vec8 = Vec{8,Float64}
Vec{8,Float64}

julia> const V8 = NTuple{8,Core.VecElement{Float64}}
NTuple{8,VecElement{Float64}}

julia> xv = ntuple(i -> Core.VecElement(rand()), Val(8))
(VecElement{Float64}(0.8583526847040217), VecElement{Float64}(0.2687536152757337), VecElement{Float64}(0.8884708219081905), VecElement{Float64}(0.7902480329753863), VecElement{Float64}(0.34023049622713186), VecElement{Float64}(0.5061351730568031), VecElement{Float64}(0.011915278521906103), VecElement{Float64}(0.8420847167831675))

julia> # xvec = Vec8(ntuple(i -> rand(), Val(8)))
       xvec = Vec8(xv)
8-element Vec{8,Float64}:
Float64⟨0.8583526847040217, 0.2687536152757337, 0.8884708219081905, 0.7902480329753863, 0.34023049622713186, 0.5061351730568031, 0.011915278521906103, 0.8420847167831675⟩

julia> SLEEFwrap.sincos(xv)
((VecElement{Float64}(0.7567667649017668), VecElement{Float64}(0.26553000049665487), VecElement{Float64}(0.7761083562736459), VecElement{Float64}(0.7105278274130774), VecElement{Float64}(0.3337043846735693), VecElement{Float64}(0.48480060287253957), VecElement{Float64}(0.011914996580888924), VecElement{Float64}(0.7460329717928119)), (VecElement{Float64}(0.6536849880027183), VecElement{Float64}(0.9641025976711434), VecElement{Float64}(0.6305995712987914), VecElement{Float64}(0.7036691029679022), VecElement{Float64}(0.9426777729689156), VecElement{Float64}(0.8746247054905448), VecElement{Float64}(0.9999290139087261), VecElement{Float64}(0.6659090065451776)))

julia> SLEEFwrap.sincos(xvec)
(Float64⟨0.7567667649017668, 0.26553000049665487, 0.7761083562736459, 0.7105278274130774, 0.3337043846735693, 0.48480060287253957, 0.011914996580888924, 0.7460329717928119⟩, Float64⟨0.6536849880027183, 0.9641025976711434, 0.6305995712987914, 0.7036691029679022, 0.9426777729689156, 0.8746247054905448, 0.9999290139087261, 0.6659090065451776⟩)

julia> SLEEFwrap.log(xv)
(VecElement{Float64}(-0.15274020952031248), VecElement{Float64}(-1.3139602474419678), VecElement{Float64}(-0.11825347164732995), VecElement{Float64}(-0.23540841700638693), VecElement{Float64}(-1.0781319609829232), VecElement{Float64}(-0.6809515049418576), VecElement{Float64}(-4.429933792967661), VecElement{Float64}(-0.17187465604430383))

julia> SLEEFwrap.log(xvec)
8-element Vec{8,Float64}:
Float64⟨-0.15274020952031248, -1.3139602474419678, -0.11825347164732995, -0.23540841700638693, -1.0781319609829232, -0.6809515049418576, -4.429933792967661, -0.17187465604430383⟩

julia> SLEEFwrap.exp(xv)
(VecElement{Float64}(2.359271027581049), VecElement{Float64}(1.3083327481283562), VecElement{Float64}(2.4314087500618013), VecElement{Float64}(2.2039430090099676), VecElement{Float64}(1.4052714630067908), VecElement{Float64}(1.6588675537938549), VecElement{Float64}(1.011986548237915), VecElement{Float64}(2.321200982909682))

julia> SLEEFwrap.exp(xvec)
8-element Vec{8,Float64}:
Float64⟨2.359271027581049, 1.3083327481283562, 2.4314087500618013, 2.2039430090099676, 1.4052714630067908, 1.6588675537938549, 1.011986548237915, 2.321200982909682⟩

julia> @btime SLEEFwrap.log($xv)
  10.203 ns (-4628700970558799018 allocations: 0 bytes)
(VecElement{Float64}(-0.15274020952031248), VecElement{Float64}(-1.3139602474419678), VecElement{Float64}(-0.11825347164732995), VecElement{Float64}(-0.23540841700638693), VecElement{Float64}(-1.0781319609829232), VecElement{Float64}(-0.6809515049418576), VecElement{Float64}(-4.429933792967661), VecElement{Float64}(-0.17187465604430383))

julia> @btime SLEEFwrap.log($xvec)

signal (11): Segmentation fault
in expression starting at no file:0
#_run#8 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336
unknown function (ip: 0x7f3ca00b30bb)
jl_fptr_trampoline at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:1843
jl_apply_generic at /home/chriselrod/Documents/languages/julia-ob/src/gf.c:2198
inner at ./none:0

I’ve tried:

  1. Letting ccall use SIMD.jl’s convert method to convert the Vec{N,T} into a NTuple{N,Core.VecElement{T}}.
  2. Extracting the field from the struct, ie a.elts where a is an instance of Vec{N,T}.
  3. Telling ccall I was passing it a Vec{N,T}.

but these all led to segfaults.

For the sincos method, which works from the REPL, both using either the NTuple{N,Core.VecElement{T}} or Vec{N,T} result in segfaults when trying to benchmark.

Anyone have an idea what is going on, and how I may be able to address this?

I also don’t understand enough about Julia or BenchmarkTools internals to interpret the segfault stack trace. The line #_run#8 at /home/chriselrod/.julia/packages/BenchmarkTools/dtwnm/src/execution.jl:336 sometimes says #_run#7 or #_run#9.

Pretty fast though:

julia> @btime SLEEFwrap.log($xv)
  10.205 ns (-4623537962973049998 allocations: 0 bytes)
(VecElement{Float64}(-0.34208491394127016), VecElement{Float64}(-0.1277428509493214), VecElement{Float64}(-0.19021957729279487), VecElement{Float64}(-0.8016561211097802), VecElement{Float64}(-2.067217212156899), VecElement{Float64}(-0.1172006596374793), VecElement{Float64}(-2.668476338119487), VecElement{Float64}(-0.5178181325967083))

julia> @btime SLEEFwrap.log_fast($xv)
  5.435 ns (-4623537962973049999 allocations: 0 bytes)
(VecElement{Float64}(-0.3420849139412701), VecElement{Float64}(-0.12774285094932142), VecElement{Float64}(-0.19021957729279487), VecElement{Float64}(-0.8016561211097802), VecElement{Float64}(-2.0672172121568986), VecElement{Float64}(-0.1172006596374793), VecElement{Float64}(-2.6684763381194867), VecElement{Float64}(-0.5178181325967083))

julia> @btime SLEEFwrap.exp($xv)
  7.829 ns (4611763878698450332 allocations: 0 bytes)
(VecElement{Float64}(2.0345769062548253), VecElement{Float64}(2.4110917704333157), VecElement{Float64}(2.2859405809045037), VecElement{Float64}(1.5660952771191212), VecElement{Float64}(1.1348919169060723), VecElement{Float64}(2.4336853349651544), VecElement{Float64}(1.071819661958549), VecElement{Float64}(1.8145166598251141))

julia> @btime SLEEFwrap.sin($xv)
  10.824 ns (4604048382445319899 allocations: 0 bytes)
(VecElement{Float64}(0.6520520679002532), VecElement{Float64}(0.7707896329958752), VecElement{Float64}(0.7357528052669569), VecElement{Float64}(0.4336913600120062), VecElement{Float64}(0.12620000898890807), VecElement{Float64}(0.7766981851802598), VecElement{Float64}(0.06930222836378491), VecElement{Float64}(0.5611869269352823))

julia> @btime SLEEFwrap.sin_fast($xv)
  7.702 ns (4604048382445319899 allocations: 0 bytes)
(VecElement{Float64}(0.6520520679002532), VecElement{Float64}(0.7707896329958752), VecElement{Float64}(0.7357528052669569), VecElement{Float64}(0.4336913600120062), VecElement{Float64}(0.12620000898890807), VecElement{Float64}(0.7766981851802598), VecElement{Float64}(0.06930222836378491), VecElement{Float64}(0.5611869269352823))

julia> @btime SLEEFwrap.cos($xv)
  12.725 ns (4605004245534695369 allocations: 0 bytes)
(VecElement{Float64}(0.7581741889216512), VecElement{Float64}(0.6370897438085814), VecElement{Float64}(0.6772501823859506), VecElement{Float64}(0.9010614874973497), VecElement{Float64}(0.9920048173931413), VecElement{Float64}(0.629872946821572), VecElement{Float64}(0.9975957102673476), VecElement{Float64}(0.8276890920127764))

julia> @btime SLEEFwrap.cos_fast($xv)
  7.756 ns (4605004245534695369 allocations: 0 bytes)
(VecElement{Float64}(0.7581741889216512), VecElement{Float64}(0.6370897438085814), VecElement{Float64}(0.6772501823859507), VecElement{Float64}(0.9010614874973497), VecElement{Float64}(0.9920048173931413), VecElement{Float64}(0.629872946821572), VecElement{Float64}(0.9975957102673476), VecElement{Float64}(0.8276890920127764))

julia> @btime SLEEFwrap.tan($xv)
  17.598 ns (4605921675420273776 allocations: 0 bytes)
(VecElement{Float64}(0.8600293671664883), VecElement{Float64}(1.2098603697306811), VecElement{Float64}(1.086382587118547), VecElement{Float64}(0.481311615277844), VecElement{Float64}(0.12721713319955963), VecElement{Float64}(1.2331029441724537), VecElement{Float64}(0.06946925257448479), VecElement{Float64}(0.6780165793541951))

julia> @btime SLEEFwrap.tan_fast($xv)
  12.264 ns (4605921675420273776 allocations: 0 bytes)
(VecElement{Float64}(0.8600293671664883), VecElement{Float64}(1.2098603697306811), VecElement{Float64}(1.0863825871185468), VecElement{Float64}(0.481311615277844), VecElement{Float64}(0.1272171331995596), VecElement{Float64}(1.2331029441724537), VecElement{Float64}(0.06946925257448477), VecElement{Float64}(0.6780165793541951))

julia> @btime SLEEFwrap.erf($xv)
  54.018 ns (4604343905645784797 allocations: 0 bytes)
(VecElement{Float64}(0.6848617340469577), VecElement{Float64}(0.7867287531423094), VecElement{Float64}(0.7576930731379372), VecElement{Float64}(0.47417732523365935), VecElement{Float64}(0.14202377140324365), VecElement{Float64}(0.7915398934983942), VecElement{Float64}(0.07813660993221432), VecElement{Float64}(0.6005564766788369))

julia> @btime SLEEFwrap.erfc($xv)
  68.593 ns (4599348646226371142 allocations: 0 bytes)
(VecElement{Float64}(0.3151382659530423), VecElement{Float64}(0.21327124685769064), VecElement{Float64}(0.24230692686206282), VecElement{Float64}(0.5258226747663407), VecElement{Float64}(0.8579762285967564), VecElement{Float64}(0.2084601065016058), VecElement{Float64}(0.9218633900677857), VecElement{Float64}(0.399443523321163))

julia> @btime SLEEFwrap.gamma($xv)
  120.838 ns (4608452683576307722 allocations: 0 bytes)
(VecElement{Float64}(1.282055440401566), VecElement{Float64}(1.0852391465483837), VecElement{Float64}(1.1354506736208367), VecElement{Float64}(1.9743762331544568), VecElement{Float64}(7.437973193493231), VecElement{Float64}(1.0773258578660838), VecElement{Float64}(13.905308301696595), VecElement{Float64}(1.4988631037761113))

julia> @btime SLEEFwrap.lgamma($xv)
  128.714 ns (4598119901031446912 allocations: 0 bytes)
(VecElement{Float64}(0.248464602806461), VecElement{Float64}(0.08180037427069305), VecElement{Float64}(0.12702964139168493), VecElement{Float64}(0.6802525171536727), VecElement{Float64}(2.006598391502465), VecElement{Float64}(0.07448191310473816), VecElement{Float64}(2.6322706594427645), VecElement{Float64}(0.4047068899174618))

julia> @btime SLEEFwrap.sqrt($xv)
  6.018 ns (4605766359090098911 allocations: 0 bytes)
(VecElement{Float64}(0.8427857905804464), VecElement{Float64}(0.9381256108319143), VecElement{Float64}(0.9092731011250698), VecElement{Float64}(0.6697652101969048), VecElement{Float64}(0.35572098476673925), VecElement{Float64}(0.9430836162446596), VecElement{Float64}(0.2633587338823797), VecElement{Float64}(0.7718932109593795))

julia> @btime SLEEFwrap.sqrt_fast($xv)
  13.171 ns (4606226399220695390 allocations: 0 bytes)
(VecElement{Float64}(0.8938605051044239), VecElement{Float64}(0.6748056558334341), VecElement{Float64}(0.8640737421502825), VecElement{Float64}(0.7593243813245429), VecElement{Float64}(0.6839764178162592), VecElement{Float64}(0.9716645016644863), VecElement{Float64}(0.41321471619861017), VecElement{Float64}(0.20800569859917845))

julia> @btime SLEEFwrap.cbrt($xv)
  26.468 ns (4606211737266519962 allocations: 0 bytes)
(VecElement{Float64}(0.892232701193268), VecElement{Float64}(0.958312886649549), VecElement{Float64}(0.9385618465517778), VecElement{Float64}(0.7655056316877106), VecElement{Float64}(0.5020415448485336), VecElement{Float64}(0.9616863831147828), VecElement{Float64}(0.41086437256436426), VecElement{Float64}(0.8414690511688571))

julia> @btime SLEEFwrap.cbrt_fast($xv)
  13.819 ns (4606211737266519962 allocations: 0 bytes)
(VecElement{Float64}(0.892232701193268), VecElement{Float64}(0.958312886649549), VecElement{Float64}(0.9385618465517778), VecElement{Float64}(0.7655056316877107), VecElement{Float64}(0.5020415448485336), VecElement{Float64}(0.9616863831147828), VecElement{Float64}(0.41086437256436426), VecElement{Float64}(0.8414690511688571))

I’m also not inclined to believe the number of allocations. =/