Does anyone knows why this simple dot product is faster in Fortran then in Julia?
(LinearAlgebra.dot is faster and with LoopVectorization the Julia code gets also faster. but this is not what I am questioning. I am curious about why without those packages the Fortran compiler is doing a better job than the Julia one).
Here Julia takes about 2x the time of the Fortran execution (compiled with -O3).
Julia:
using BenchmarkTools
function mydot(a, b, N)
c = 0
@inbounds @simd for i = 1:N
c += a[i]*b[i]
end
c
end
N = 500_000_000
a = rand(Float32,N)
b = rand(Float32,N)
@btime mydot($a,$b,$N)
Fortran:
implicit none
real(real32) :: c
integer, parameter :: N = 500000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: time1, time2
allocate(a(N), b(N))
call random_number(a)
call random_number(b)
call cpu_time(time1)
c = mydot(a, b, N)
call cpu_time(time2)
print *, c
print '("Time: ",f19.17," s")', time2 - time1
contains
function mydot(a, b, N) result(c)
real(real32), intent(in) :: a(:)
real(real32), intent(in) :: b(:)
integer, intent(in) :: N
real(real32) :: c
integer :: i
c = 0.
do i = 1, N
c = c + a(i)*b(i)
enddo
endfunction
endprogram
I thought in such a case the compiler would figure that out and do union-splitting automatically. (the problem is clearly shown by @code_warntype, which I didn’t because my first feeling was that it this promoting would be harmless).
Those are some long arrays, so this will be completely memory bound.
If you try smaller arrays (once the Julia code is fixed), gfortran will be slower unless you dive deep into optimization options, using -funroll-loops -fvariable-expansion-in-unroller.
By default, LLVM will unroll the loop and use separate accumulators to split the dependency chain.
Both these optimizations need to be turned on manually with gfortran.
The problem isn’t the type instability. It’s the promotion type. By using Float64, you lose the ability to use fma, and get halve your vector with.
The fused multiply–add machine instruction in modern CPUs (with such SIMD extensions) performs several operations of the form a = b*c + a in a single clock cycle, but the compiler can only use it if all three variables involved have the same type.
While you’re correct that Float64 will have half throughput of Float32 I am not sure about your statement regarding the FMA. As far as I know, FMA for Float64 is supported on AVX2 / AVX512.
On most x86 CPUs, the conversion has a reciprocal throughput of 1, while the arithmetic tends to have a r-throughput of 0.5 or 1 (depending on the CPU).
Thus promoting once tends to be at least as fast as promoting twice.