julia> @benchmark cos.($a)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
--------------
minimum time: 14.445 μs (0.00% GC)
median time: 14.776 μs (0.00% GC)
mean time: 15.341 μs (0.00% GC)
maximum time: 102.015 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark cos_sleef($b)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
--------------
minimum time: 1.982 μs (0.00% GC)
median time: 2.217 μs (0.00% GC)
mean time: 3.817 μs (8.40% GC)
maximum time: 188.719 μs (97.31% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark cos_sleef!($a, $b)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.805 μs (0.00% GC)
median time: 1.858 μs (0.00% GC)
mean time: 1.875 μs (0.00% GC)
maximum time: 5.114 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
You forgot to delete the reference to a
in eachindex, making it type unstable: for i in eachindex(a, inp)
should be for i in eachindex(inp)
. Mind rerunning the benchmark?
Viral just merged the PR, but I ran into problems installing it (I don’t have MKL.jl installed).
I’m pretty sure my gfortran uses VML, so I’ll test that instead. Compiling
module trig
use ISO_C_BINDING
implicit none
contains
subroutine vsin(a, b, N) bind(C, name = "vsin")
integer(c_int64_t), intent(in) :: N
real(C_double), intent(in), dimension(N) :: b
real(C_double), intent(out), dimension(N) :: a
a = sin(b)
end subroutine vsin
subroutine vcos(a, b, N) bind(C, name = "vcos")
integer(c_int64_t), intent(in) :: N
real(C_double), intent(in), dimension(N) :: b
real(C_double), intent(out), dimension(N) :: a
a = cos(b)
end subroutine vcos
end module trig
with
gfortran -S -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC /home/chriselrod/Documents/progwork/fortran/trig.f90 -o vtrig.s
Yields assembly containing the following loop:
.L4:
movq %rbx, %r12
salq $6, %r12
vmovupd (%r14,%r12), %zmm0
incq %rbx
call _ZGVeN8v_cos@PLT
vmovupd %zmm0, 0(%r13,%r12)
cmpq %rbx, %r15
jne .L4
Note it is using cos
on a zmm
(512) bit register.
Producing a shared library with:
gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC /home/chriselrod/Documents/progwork/fortran/trig.f90 -o libvtrig.so
and now
const TRIGFORTRAN = "/home/chriselrod/Documents/progwork/fortran/libvtrig.so";
function cos_gfortran!(a, b)
ccall(
(:vcos, TRIGFORTRAN),
Cvoid, (Ref{Float64},Ref{Float64},Ref{Int}),
a, b, Ref(length(a))
)
a
end
cos_gfortran(a) = cos_gfortran!(similar(a), a)
This yields even better times than I got with SLEEF:
julia> a = randn(100, 20); # no need to make it so large
julia> b = similar(a);
julia> @benchmark cos_gfortran($a)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
--------------
minimum time: 1.562 μs (0.00% GC)
median time: 1.811 μs (0.00% GC)
mean time: 2.251 μs (10.21% GC)
maximum time: 97.660 μs (92.97% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark cos_gfortran!($b, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.434 μs (0.00% GC)
median time: 1.463 μs (0.00% GC)
mean time: 1.499 μs (0.00% GC)
maximum time: 4.733 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
Putting them all together with btimes:
julia> @btime @. $b = cos($a);
13.181 μs (0 allocations: 0 bytes)
julia> @btime cos_gfortran!($b, $a); # VML?
1.467 μs (0 allocations: 0 bytes)
julia> @btime cos_sleef!($b, $a);
2.048 μs (0 allocations: 0 bytes)
julia> @btime cos_xsimd!($b, $a);
2.504 μs (0 allocations: 0 bytes)
So gcc
is the clear winner here. This probably requires recent versions of the compiler.
Given the roughly 9x magnitude in speed difference, I figure I should point out:
julia> all(cos_gfortran(a) .≈ cos.(a))
true
julia> all(cos_gfortran(1000a) .≈ cos.(1000 .* a))
true
Using the same size inputs as sw72:
julia> a = randn(2000,2000);
julia> b = similar(a);
julia> @btime @. $b = cos($a);
35.750 ms (0 allocations: 0 bytes)
julia> @btime cos_gfortran!($b, $a);
4.360 ms (0 allocations: 0 bytes)
julia> @btime cos_sleef!($b, $a);
5.304 ms (0 allocations: 0 bytes)
julia> @btime cos_xsimd!($b, $a);
6.901 ms (0 allocations: 0 bytes)
julia> all(cos_gfortran(a) .≈ cos.(a))
true
What was the error installing xsimdwrap
? You need git
and a C++
compiler (it defaults to trying g++
, but clang++
should work too).
I think you can answer why xsimdwrap
is separated from the others ;).
I did just rebuild it on two different computers, but not everyone gets to enjoy the easy life of running Linux. Such as myself at work.
The libraries all do different things, although VectorizationBase could probably be combined with SIMDPirates.
While developing libraries, splitting them up also cuts down on re-precompile times. Eg, VectorizationBase only precompiles when I actually change it, because it doesn’t depend on any of my other libraries.
I plan on registering them all within the next few months, so that it’ll just require ] add ${NAME}
. Some of them, in particular LoopVectorization, are going to see some major rework before that happens.
Also, it would be really cool to get gcc’s implementation of these vector functions ported directly into Julia. I wouldn’t mind a GPL’d library.