Naive dot product faster in Fortran than in Juila

Does anyone knows why this simple dot product is faster in Fortran then in Julia?

(LinearAlgebra.dot is faster and with LoopVectorization the Julia code gets also faster. but this is not what I am questioning. I am curious about why without those packages the Fortran compiler is doing a better job than the Julia one).

Here Julia takes about 2x the time of the Fortran execution (compiled with -O3).

Julia:

using BenchmarkTools

function mydot(a, b, N)
  c = 0
  @inbounds @simd for i = 1:N
    c += a[i]*b[i]
  end
  c
end
N = 500_000_000
a = rand(Float32,N) 
b = rand(Float32,N)

@btime mydot($a,$b,$N)

Fortran:

implicit none

real(real32) :: c
integer, parameter :: N = 500000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: time1, time2

allocate(a(N), b(N))

call random_number(a)
call random_number(b)

call cpu_time(time1)
c = mydot(a, b, N)
call cpu_time(time2)
print *, c
print '("Time: ",f19.17," s")', time2 - time1

contains

function mydot(a, b, N) result(c)
   real(real32), intent(in) :: a(:)
   real(real32), intent(in) :: b(:)
   integer, intent(in) :: N
   real(real32) :: c
   integer :: i
   c = 0.
   do i = 1, N
      c = c + a(i)*b(i)
   enddo
endfunction
endprogram

1 Like

Type stability. You should declare c=zero(eltype(a)). The reason this was slowing you down is adding an Int to a Float32 promotes to Float64.

6 Likes

Yeah for me that took the execution time down by a full factor of 5

I thought in such a case the compiler would figure that out and do union-splitting automatically. (the problem is clearly shown by @code_warntype, which I didn’t because my first feeling was that it this promoting would be harmless).

Those are some long arrays, so this will be completely memory bound.

If you try smaller arrays (once the Julia code is fixed), gfortran will be slower unless you dive deep into optimization options, using -funroll-loops -fvariable-expansion-in-unroller.

By default, LLVM will unroll the loop and use separate accumulators to split the dependency chain.
Both these optimizations need to be turned on manually with gfortran.

5 Likes

The problem isn’t the type instability. It’s the promotion type. By using Float64, you lose the ability to use fma, and get halve your vector with.

7 Likes

Ah, I see. That makes sense :-).

Could you elaborate a bit, please? Only a bit will perhaps suffice.

The fused multiply–add machine instruction in modern CPUs (with such SIMD extensions) performs several operations of the form a = b*c + a in a single clock cycle, but the compiler can only use it if all three variables involved have the same type.

9 Likes

The following type annotation may prevent you from calling an SIMD-unfriendly method instance of this function:

function mydot(a::AbstractVector{T},
               b::AbstractVector{T}, N) where T
  c::T = 0
  @inbounds @simd for i = 1:N
    c += a[i]*b[i]
  end
  c
end
4 Likes

While you’re correct that Float64 will have half throughput of Float32 I am not sure about your statement regarding the FMA. As far as I know, FMA for Float64 is supported on AVX2 / AVX512.

I should have been more clear probably. Float64 fma exists, but there isn’t fma between Float64 and Float32

2 Likes

More specifically, in calculating

c::Float64 = c::Float64 + a::Float32 * b::Float32

either

  1. promote a to Float64
  2. promote b to Float64
  3. c = fma(a64, b64, c)
    or
  4. multiply ab = a*b
  5. promote ab to Float64
  6. add ab64 + c.

On most x86 CPUs, the conversion has a reciprocal throughput of 1, while the arithmetic tends to have a r-throughput of 0.5 or 1 (depending on the CPU).
Thus promoting once tends to be at least as fast as promoting twice.

1 Like