Intel Core I7 6500U.

Thatās incorrect. For example, check out: `@code_llvm (^).(arr, 2)`

. As @Oscar_Smith above also said, the broadcasting loop is being done in the context of the `@simd`

macro.

Looking further into it, it seems `@simd`

is not compatible with `exp`

(and `sin`

, `cos`

), and the LoopVectorization is doing some magic to make these functions compatible.

I wonder if thereās any plans to upsteam these changes into base Julia to that `@simd`

works with these standard math functions?

I wonder if thereās any plans to upsteam these changes into base Julia to that

`@simd`

works with these standard math functions?

already done, see previous example and linked PR?

This isnāt explicit optimization. Whatās happening here is that for simple functions (ones small enough to be inlined), the compiler is able to automatically figure out vectorized versions. For a call like `exp.(v)`

that is a much harder optimization to make, and getting optimal performance would definitely require a handwritten vectorized `exp`

function (like the one in LoopVectorization).

The optimisation is done normally, but it depends on several things, LV do a lot of things more aggressively because it makes more assumptions and have some hand written implementationās of functions.

Respect to `exp`

, this two are related #44865 and #46338 which are in `master`

.

already done, see previous example and linked PR?

Wow, yes actually - itās about 4x times faster. Still not quite as fast as the 6x speedup of `@turbo`

but a big difference. And I think (?) I can see simd instructions in the llvm output.

It doesnāt appear this has also been applied to `sin()`

or `cos()`

though.

It doesnāt appear this has also been applied to

`sin()`

or`cos()`

though.

Does any language SIMD compile sin and cos by default (maybe Julia can (could [be made to]?) and should do that, but only if `@simd`

applied, even if using intrinsics/functions)? Itās unclear if only available with Intel compiler or done by default, since not always better:

there is a vector version using SSE/AVX!

But the catch is that Intel C++ compiler must be used.This is called

Intel small vector math library(intrinsics):for 128bit SSE please use (double precision):

_mm_sin_pdfor 256bit AVX please use (double precision):

_mm256_sin_pdThe two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.

By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, itās better to use plain sin() instead.

### Laszlo - .NET Developer, Personal Blog

In this post I will look into how someone can implement Sin/Cosin functions with SIMD in NET 5. NET 5 does not have a wrapper on SIMD trigonometric functions, so it seems a good exercise to implement it.

#### Where is Clang's '_mm256_pow_ps' intrinsic?

[ā¦]

Agner Fogās VCL has some math functions like. (Formerly GPL licensed, now Apache).`exp`

and`log`

`https://github.com/microsoft/DirectXMath`

(MIT license) - I think portable to non-Windows, and doesnāt require DirectX. [ā¦]

There is also Sleef. It supports ARM as well, so it is great.

But from my experience, it is less performant than Intel SVML.

I think SVML is now, to some degree, embodied in Clang and also in GCC.

Yet for sure it is available for free with Intel OneAPI package.

Related:

### GitHub - JuliaSIMD/SLEEFPirates.jl: Pirated SLEEF port.

Pirated SLEEF port. Contribute to JuliaSIMD/SLEEFPirates.jl development by creating an account on GitHub.

Itās been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of `exp`

, this is why itās faster, not because itās optimizing the base julia `exp`

.

It is likely that numpy has a similar vectorization built-it, hence why it is faster.

@Oscar_Smith

Julia `exp`

regressed recently.

Earlier, you could use call-site inlining to get it to vectorize, via `@inline exp(x[i])`

, and itād vectorize.

This now no longer works because `pow_body`

does not inline, and we need the entire function to inline.

What do you mean? `exp`

doesnāt call `pow_body`

ā¦

Itās been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of

`exp`

, this is why itās faster, not because itās optimizing the base julia`exp`

.

It uses the same basic algorithm, has slightly higher error tolerance (around 1.5 ULP IIRC, instead of 0.5 for Base?)

It does also have some special optimizations for AVX512 that the compiler wonāt be able to figure out on its own.

But this is a problem with Juliaās compiler.

Julia cannot vectorize code unless it is entirely inlined.

Juliaās inliner also does not consider the possibility of vectorization.

I have a non-inlined call to `pow_body`

when looking at the typed code.

Do you mean for floating point powers?

Yes

```
ā¢ %82 = < constprop > pow_body(::Core.Const(2.0),::Core.Const(-53))::Float64
```

The constant 2.0^-53 prevents SIMD.

Please do not rely on the compiler any more than necessaryā¦

But Iāll rebuild the latest Julia master. Might be fixed with the effects analysis improvements.

If so, thatās really dumb because it totally could just be `0x1p-53`

Yeah. It seems silly that a literal constant like `1e10`

vs `10.0^10`

or `0x1p-53`

vs `2^-53`

makes a big difference.

But itās been fixed on the latest master already.

Actually, even without `@inline`

things look really good on the latest Julia master!

```
julia> using LoopVectorization, BenchmarkTools
julia> function myfunc(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = exp(arr[i])
end
return result
end
myfunc (generic function with 1 method)
julia> function myfunc2(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = @inline exp(arr[i])
end
return result
end
myfunc2 (generic function with 1 method)
julia> function myfunc3(arr)
result = similar(arr)
@turbo for i in eachindex(arr)
result[i] = exp(arr[i])
end
return result
end
myfunc3 (generic function with 1 method)
julia> arr = rand(100_000);
julia> @btime myfunc($arr);
76.586 Ī¼s (2 allocations: 781.30 KiB)
julia> @btime myfunc2($arr);
65.682 Ī¼s (2 allocations: 781.30 KiB)
julia> @btime myfunc3($arr);
67.174 Ī¼s (2 allocations: 781.30 KiB)
```

Itād be great if we could have regression tests to make sure it does not regress.

Iām not sure why `myfunc`

is slower than `myfunc2`

/would have to spend more time looking at it to see what is actually different.

But it seems to inline on its own.