Vectorized Yeppp definitely improved exp a lot.
But there may be a some left on the table for serial evaluation too.
There’ve been a lot of performance optimizations released to glibc
recently for single precision.
Saw this on Phoronix:
https://sourceware.org/git/?p=glibc.git;a=commit;h=fe596486d694e657413d0d4c5a04598674ff71b1
https://sourceware.org/git/?p=glibc.git;a=commit;h=ac817e083b37a5c25d05cde8bde302d7a93ffc5e
These could translate to a big gain for things like single precision ffts.
I think what follows may be an example of future performance improvements, but I’m not really sure. When I try locate [the file names from the link]
, I don’t see any in what look like the right places for that to be the cause, but I’m short on alternative explanations.
julia> @benchmark sin(0.23f0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.548 ns (0.00% GC)
median time: 4.559 ns (0.00% GC)
mean time: 4.564 ns (0.00% GC)
maximum time: 17.363 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
So 10 million evaluations ought to take roughly 45 ms.
A function using pairwise summation to add up sin evaluated at 10 million different points, using explicit explicit SIMD vectorization (bought about 2 ms; may be a substantial chunk of the time not in sin
):
julia> @benchmark integrate($serial)
BenchmarkTools.Trial:
memory estimate: 400 bytes
allocs estimate: 4
--------------
minimum time: 43.200 ms (0.00% GC)
median time: 43.604 ms (0.00% GC)
mean time: 43.678 ms (0.00% GC)
maximum time: 44.549 ms (0.00% GC)
--------------
samples: 115
evals/sample: 1
Cutting to the chase, compiling translations to shared libraries (all with -Ofast -march=native -shared -fPIC
), this was g++ 8.0 (trunk):
julia> @benchmark intsotest($intf)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 24.057 ms (0.00% GC)
median time: 24.263 ms (0.00% GC)
mean time: 24.336 ms (0.00% GC)
maximum time: 25.292 ms (0.00% GC)
--------------
samples: 206
evals/sample: 1
This is only happens with -Ofast
. At -O3
, results are comparable to those that follow – so it may just be an improved sin_fast
type function.
Other C++ compilers, including older g++, and Fortran were slower than Julia:
julia> @benchmark intsotest($intf72) #g++ 7-2
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 53.198 ms (0.00% GC)
median time: 53.866 ms (0.00% GC)
mean time: 54.567 ms (0.00% GC)
maximum time: 77.023 ms (0.00% GC)
--------------
samples: 92
evals/sample: 1
julia> @benchmark intsotest($intclang) #Clang++ 5.0.1
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 117.949 ms (0.00% GC)
median time: 118.069 ms (0.00% GC)
mean time: 118.180 ms (0.00% GC)
maximum time: 119.378 ms (0.00% GC)
--------------
samples: 43
evals/sample: 1
julia> @benchmark intsotestf(intfort1) #gfortran - 8.0 trunk
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 51.365 ms (0.00% GC)
median time: 51.784 ms (0.00% GC)
mean time: 51.851 ms (0.00% GC)
maximum time: 53.317 ms (0.00% GC)
--------------
samples: 97
evals/sample: 1
julia> @benchmark intsotestf(intfort72) #gfortran 7-2
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 52.060 ms (0.00% GC)
median time: 52.273 ms (0.00% GC)
mean time: 52.369 ms (0.00% GC)
maximum time: 53.613 ms (0.00% GC)
--------------
samples: 96
evals/sample: 1
g++ 8.0’s shared libraries though were a decent chunk larger:
20504 - g++ 8.0
16856 - g++ 7.2
17944 - clang++ 5.0.1
10688 - gfortran - 8.0
8704 - gfortran - 7.2
Could it be inlining sin
? No idea. Something different is definitely going on.
I didn’t see any such difference for double precision.
Still not totally sure about the cause.
PS.
Compiling with fopenmp
made threaded executables, but shared libraries were serial.
I’ll look into that a little more. But, using Threads.@threads
in Julia:
julia> @benchmark integrate($threads)
BenchmarkTools.Trial:
memory estimate: 1.34 KiB
allocs estimate: 53
--------------
minimum time: 3.540 ms (0.00% GC)
median time: 4.035 ms (0.00% GC)
mean time: 5.471 ms (0.00% GC)
maximum time: 10.514 ms (0.00% GC)
--------------
samples: 915
evals/sample: 1
Code
using SIMD
abstract type mode end
struct serial <: mode end
struct threads <: mode end
function pairint(l, n, dx::Float32, ::Type{serial})
if n < 33
u = l + n - 17
sum_x = sum( Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+i) * dx) ) ) +
Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+8+i) * dx) ) ) )
for n_i ∈ l:u
sum_x = sum_x + sin( n_i * dx )
end
else
n_i = n ÷ 2
sum_x = pairint(l, n_i, dx, serial) + pairint(l + n_i, n - n_i, dx, serial)
end
sum_x
end
function pairint(l, n, dx, ::Any)
if n < 20
u = l + n - 1
sum_x = zero(dx)
for n_i ∈ l:u
sum_x = sum_x + sin( n_i * dx )
end
else
n_i = n ÷ 2
sum_x = pairint(l, n_i, dx, threads) + pairint(l + n_i, n - n_i, dx, threads)
end
sum_x
end
function vec_sum(v)
out = zero(eltype(v))
for vᵢ ∈ v
out += vᵢ
end
out
end
function integrate(::Type{M} = serial, ::Type{T} = Float32) where {M<:mode, T}
n_lim = 10_000_000
nthread = 16
a = acos( - one(T) )
dx = (a - zero(T)) / n_lim
sinx_v = Vector{T}(nthread)
nrange = Vector{Int}(nthread+1)
sinx = T(0.5) * ( sin(zero(T)) + sin(a) )
for i ∈ 0:nthread
nrange[i+1] = 1 + (n_lim-1) * i ÷ nthread
end
if M == threads
Threads.@threads for i ∈ 1:nthread
sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
end
else
for i ∈ 1:nthread
sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
end
end
sinx = dx * ( vec_sum(sinx_v) + sinx )
sinx
end
#include <cmath>
#include <algorithm>
template<class T>
T pairint(int l, int n, T dx){
T sum_x ;
int n_i ;
if ( n < 20 ){
int u = l + n - 1 ;
sum_x = 0.0 ;
for( n_i = l; n_i <= u; ++n_i){
sum_x = sum_x + std::sin( n_i * dx ) ;
}
} else {
n_i = n / 2 ;
sum_x = pairint(l, n_i, dx) + pairint(l + n_i, n - n_i, dx) ;
}
return sum_x ;
}
template<class T>
T vec_sum(T v[], int l){
T sum = 0 ;
for(int i = 0; i < l; i++){
sum = sum + v[i] ;
}
return sum ;
}
extern "C" float integratef(const float, const float, const long);
extern "C" double integrated(const double, const double, const long);
template<class T>
T integrate(const T l, const T u, const long n_lim)
{
const int nthread = 16;
const T dx = (u - l) / n_lim;
T sinx_v[nthread];
int nrange[nthread+1];
T sinx = 0.5 * ( std::sin(l) + std::sin(u) ) ;
for (int i = 0; i <= nthread; ++i){
nrange[i] = 1 + (n_lim-1) * i / nthread;
}
#pragma omp parallel for
for (int i = 0; i < nthread; ++i){
sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx);
}
sinx = dx * ( vec_sum(sinx_v, nthread) + sinx ) ;
return sinx;
}
float integratef(const float l, const float u, const long n_lim){
return integrate(l, u, n_lim);
}
double integrated(const double l, const double u, const long n_lim){
return integrate(l, u, n_lim);
}
ccall wrapper:
libcpp = Libdl.dlopen("path/to/sharedlib.so")
intf = Libdl.dlsym(libcpp, :integratef)
function intsotest(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
ccall(f, Float32, (Float32, Float32, Int64), l, u, n)
end
module integral
implicit none
integer, parameter :: nthread = 16
contains
function integrate(l, u, n_lim) result(sinx)
real, intent(in), value :: l, u
integer(8), intent(in), value :: n_lim
real :: dx, sinx
real, dimension(nthread) :: sinx_v
integer, dimension(0:nthread) :: nrange
integer :: i
dx = (u - l) / n_lim
sinx = 0.5 * ( sin(l) + sin(u) )
do i = 0, nthread
nrange(i) = 1 + (n_lim-1) * i / nthread
end do
!$OMP parallel do
do i = 1, nthread
sinx_v(i) = pairint(nrange(i-1), nrange(i)-nrange(i-1), dx)
end do
!$OMP end parallel do
sinx = dx * ( sinx + sum(sinx_v) )
end function integrate
recursive function pairint(l, n, dx) result(sum_x)
implicit none
integer, intent(in) :: l, n
real, intent(in) :: dx
integer :: n1
real :: sum_x
if ( n < 20 ) then
sum_x = 0.0
do n1 = l, l + n - 1
sum_x = sum_x + sin( n1 * dx )
end do
else
n1 = n / 2
sum_x = pairint(l, n1, dx) + pairint(l + n1, n - n1, dx)
end if
end function pairint
end module integral
ccall wrapper:
libfort1 = Libdl.dlopen("path/to/sharedlib.so")
intfort1 = Libdl.dlsym(libfort1, :__integral_MOD_integrate)
function intsotestf(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
ccall(f, Float32, (Ptr{Float32}, Ptr{Float32}, Ptr{Int64}), &l, &u, &n)
end