EDIT: Please disregard what I say about ifort here. I should have used `-fast`

instead of `-Ofast`

. I will test with the correct flag later.

I’ve routinely found ifort to be a factor 2-3 faster compared to gfortran. So comparisons between Julia and Fortran are only fair if the same compiler is used.

It seems highly variable. I’ve been writing Fortran for a consulting project where the user is dedicated to MATLAB. Fortran seemed a lot more interesting, so I decided to work in that.

I’m now on a computer with a 3.6 GHz i7:

```
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
BLAS: libmkl_rt
LAPACK: libmkl_rt
LIBM: libimf
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
```

Now…

```
$ ifort -Ofast -xHost -shared -fPIC cholesky.F90 normal.F90 -o foster.so
$ gfortran-7 -Ofast -march=native -shared -fPIC cholesky.F90 normal.F90 -o gfoster.so
```

```
julia> using Compat, BenchmarkTools
julia> libfort1 = Libdl.dlopen("/home/celrod/Documents/ProbCol/foster.so");
julia> intfort1 = Libdl.dlsym(libfort1, :probcol_mp_pc2dfoster_circle_);
julia> libgfort1 = Libdl.dlopen("/home/celrod/Documents/ProbCol/gfoster.so");
julia> gfort1 = Libdl.dlsym(libgfort1, :__probcol_MOD_pc2dfoster_circle);
julia> function intsotestf!(Pc::Ref{Float64}, f,r1,v1,cov1,r2,v2,cov2,HBR::Ref{Float64})
ccall(f, Cvoid, (Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}), Pc, r1, v1, cov1, r2, v2, cov2, HBR)
Pc
end
intsotestf! (generic function with 1 method)
###Defined inputs
julia> intsotestf!(Pc, intfort1, r1, v1, cov1, r2, v2, cov2, rHBR)
Base.RefValue{Float64}(2.7060234765689587e-5)
julia> intsotestf!(Pc, gfort1, r1, v1, cov1, r2, v2, cov2, rHBR)
Base.RefValue{Float64}(2.706023476568956e-5)
julia> @benchmark intsotestf!($Pc, $intfort1, $r1, $v1, $cov1, $r2, $v2, $cov2, $rHBR)
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 6
--------------
minimum time: 1.831 μs (0.00% GC)
median time: 1.840 μs (0.00% GC)
mean time: 1.874 μs (0.00% GC)
maximum time: 3.204 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark intsotestf!($Pc, $gfort1, $r1, $v1, $cov1, $r2, $v2, $cov2, $rHBR)
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 6
--------------
minimum time: 631.165 ns (0.00% GC)
median time: 635.118 ns (0.00% GC)
mean time: 651.982 ns (1.55% GC)
maximum time: 8.420 μs (87.74% GC)
--------------
samples: 10000
evals/sample: 170
```

I also translated all the code to Julia:

```
julia> pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)
2.7060234765689807e-5
julia> @benchmark pc2dfoster_circle($r1,$v1,$cov1,$r2,$v2,$cov2,$HBR)
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 11
--------------
minimum time: 1.835 μs (0.00% GC)
median time: 1.873 μs (0.00% GC)
mean time: 1.986 μs (1.25% GC)
maximum time: 135.751 μs (92.27% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark erf(2.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 44.628 ns (0.00% GC)
median time: 44.743 ns (0.00% GC)
mean time: 46.014 ns (0.00% GC)
maximum time: 98.411 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 990
```

Julia built with libopenlibm and OpenBLAS:

```
julia> pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)
2.7060234765689814e-5
julia> @benchmark pc2dfoster_circle($r1,$v1,$cov1,$r2,$v2,$cov2,$HBR)
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 11
--------------
minimum time: 700.407 ns (0.00% GC)
median time: 714.893 ns (0.00% GC)
mean time: 768.352 ns (3.42% GC)
maximum time: 8.002 μs (81.71% GC)
--------------
samples: 10000
evals/sample: 145
julia> @benchmark erf(2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 34.302 ns (0.00% GC)
median time: 34.449 ns (0.00% GC)
mean time: 37.365 ns (0.00% GC)
maximum time: 115.258 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
```

Performance is broadly comparable.

My Julia implementation was a direct translation, without any extra decorators for performance like `@fastmath`

(turned on in both Fortran compilations with `Ofast`

) or `@inbounds`

(Fortran does not check bounds by default).

```
using SpecialFunctions
function gdensity(w, hbr2=1.0, x0=2.0, u11=1.9979, u12=-0.230327, u22=2.61677)
nu12w = -u12*w
radical = sqrt(hbr2 - abs2(u11*w-x0))
exp(-abs2(w)/2)*( erf((nu12w+radical)*u22) - erf((nu12w-radical)*u22) )
end
gdensity(w, hbr2, x0, u::AbstractArray) = gdensity(w, hbr2, x0, u[1], u[2], u[3])
function gaussianarea(hbr, x0, U)
n = ( 0.09226835946330202, 0.273662990072083, 0.4457383557765383, 0.6026346363792564, 0.7390089172206591, 0.8502171357296142, 0.9324722294043558, 0.9829730996839018 )
w = ( 0.036704932945846216,0.035454990477090234,0.032997670831211, 0.029416655081518854,0.02483389042441457,0.01940543741387547, 0.01331615550646369,0.006773407894247478 )
hbr2 = abs2(hbr)
s = hbr / U[1]
mid = x0 / U[1]
ga = w[1] * ( gdensity(mid-s*n[1], hbr2,x0,U) + gdensity(mid+s*n[1], hbr2,x0,U) )
for i = 2:length(n)
ga += w[i] * ( gdensity(mid-s*n[i], hbr2,x0,U) + gdensity(mid+s*n[i], hbr2,x0,U) )
end
ga * s
end
function normalize!(x)
x0 = norm(x)
x ./= x0
end
function pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,hbr)
Cp = r1 .- r2
x0 = norm(Cp)
y = v1 - v2
z = cross(Cp,y)
normalize!(y)
normalize!(z)
combinecov!(Cp, cov1, cov2, y, z)
cholesky2x2!(Cp)
Cp[3] = 1/(Cp[3]*√2)
gaussianarea(hbr, x0, Cp)
end
function invert2x2triangle!(Ui, U)
t11 = 1 / U[1]
t12 = - U[2] * t11 / U[3]
Ui[1] = t11
Ui[2] = t12
Ui[3] = 1 / U[3]
Ui
end
function combinecov!(Cp, cov1, cov2, y, z)
cov = cov1 .+ cov2
x = cross(y, z)
a = x[1]*cov[1] + x[2]*cov[2] + x[3]*cov[4]
b = x[1]*cov[2] + x[2]*cov[3] + x[3]*cov[5]
c = x[1]*cov[4] + x[2]*cov[5] + x[3]*cov[6]
Cp[1] = a*x[1] + b*x[2] + c*x[3]
Cp[2] = a*z[1] + b*z[2] + c*z[3]
a = z[1]^2*cov[1] + 2*z[1]*z[2]*cov[2] + 2*z[1]*z[3]*cov[4]
b = z[2]^2*cov[3] + 2*z[2]*z[3]*cov[5] + z[3]^2*cov[6]
Cp[3] = a + b
end
function cholesky2x2!(S)
s11 = sqrt(S[1])
t12 = S[2] / s11
S[1] = s11
S[2] = t12
S[3] = sqrt( S[3] - t12^2 )
end
###data
r1 = [378.39559, 4305.721887, 5752.767554];
v1 = [2.360800244, 5.580331936, -4.322349039];
r2 = [374.5180598, 4307.560983, 5751.130418];
v2 = [-5.388125081, -3.946827739, 3.322820358];
cov1 = [44.5757544811362, 81.6751751052616, 158.453402956163, -67.8687662707124, -128.616921644857, 105.490542562701];
cov2 = [2.31067077720423, 1.69905293875632, 1.24957388457206, -1.4170164577661, -1.04174164279599, 0.869260558223714];
HBR = 0.020;
pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)
```

I would be happy to share the Fotran code too if anyone is willing to look over it.

Maybe I’m missing something, and making blunders I’m unaware of. But Julia seems broadly comparable.