Naive dot product faster in Fortran than in Juila

lmiq · June 21, 2021, 5:24pm

Does anyone knows why this simple dot product is faster in Fortran then in Julia?

(LinearAlgebra.dot is faster and with LoopVectorization the Julia code gets also faster. but this is not what I am questioning. I am curious about why without those packages the Fortran compiler is doing a better job than the Julia one).

Here Julia takes about 2x the time of the Fortran execution (compiled with -O3).

Julia:

using BenchmarkTools

function mydot(a, b, N)
  c = 0
  @inbounds @simd for i = 1:N
    c += a[i]*b[i]
  end
  c
end
N = 500_000_000
a = rand(Float32,N) 
b = rand(Float32,N)

@btime mydot($a,$b,$N)

Fortran:

implicit none

real(real32) :: c
integer, parameter :: N = 500000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: time1, time2

allocate(a(N), b(N))

call random_number(a)
call random_number(b)

call cpu_time(time1)
c = mydot(a, b, N)
call cpu_time(time2)
print *, c
print '("Time: ",f19.17," s")', time2 - time1

contains

function mydot(a, b, N) result(c)
   real(real32), intent(in) :: a(:)
   real(real32), intent(in) :: b(:)
   integer, intent(in) :: N
   real(real32) :: c
   integer :: i
   c = 0.
   do i = 1, N
      c = c + a(i)*b(i)
   enddo
endfunction
endprogram

Oscar_Smith · June 21, 2021, 5:27pm

Type stability. You should declare c=zero(eltype(a)). The reason this was slowing you down is adding an Int to a Float32 promotes to Float64.

jacobadenbaum · June 21, 2021, 5:34pm

Yeah for me that took the execution time down by a full factor of 5

lmiq · June 21, 2021, 5:35pm

I thought in such a case the compiler would figure that out and do union-splitting automatically. (the problem is clearly shown by @code_warntype, which I didn’t because my first feeling was that it this promoting would be harmless).

Elrod · June 21, 2021, 5:56pm

Those are some long arrays, so this will be completely memory bound.

If you try smaller arrays (once the Julia code is fixed), gfortran will be slower unless you dive deep into optimization options, using -funroll-loops -fvariable-expansion-in-unroller.

By default, LLVM will unroll the loop and use separate accumulators to split the dependency chain.
Both these optimizations need to be turned on manually with gfortran.

Oscar_Smith · June 21, 2021, 6:00pm

The problem isn’t the type instability. It’s the promotion type. By using Float64, you lose the ability to use fma, and get halve your vector with.

lmiq · June 21, 2021, 6:03pm

Ah, I see. That makes sense :-).

zdenek_hurak · June 21, 2021, 6:26pm

Could you elaborate a bit, please? Only a bit will perhaps suffice.

mgkuhn · June 21, 2021, 6:57pm

The fused multiply–add machine instruction in modern CPUs (with such SIMD extensions) performs several operations of the form a = b*c + a in a single clock cycle, but the compiler can only use it if all three variables involved have the same type.

mgkuhn · June 21, 2021, 7:12pm

The following type annotation may prevent you from calling an SIMD-unfriendly method instance of this function:

function mydot(a::AbstractVector{T},
               b::AbstractVector{T}, N) where T
  c::T = 0
  @inbounds @simd for i = 1:N
    c += a[i]*b[i]
  end
  c
end

RoyiAvital · July 24, 2021, 9:58pm

While you’re correct that Float64 will have half throughput of Float32 I am not sure about your statement regarding the FMA. As far as I know, FMA for Float64 is supported on AVX2 / AVX512.

Oscar_Smith · July 24, 2021, 10:00pm

I should have been more clear probably. Float64 fma exists, but there isn’t fma between Float64 and Float32

Elrod · July 24, 2021, 11:03pm

More specifically, in calculating

c::Float64 = c::Float64 + a::Float32 * b::Float32

either

promote a to Float64
promote b to Float64
c = fma(a64, b64, c)
or
multiply ab = a*b
promote ab to Float64
add ab64 + c.

On most x86 CPUs, the conversion has a reciprocal throughput of 1, while the arithmetic tends to have a r-throughput of 0.5 or 1 (depending on the CPU).
Thus promoting once tends to be at least as fast as promoting twice.

Topic		Replies	Views
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	862	December 2, 2024
Julia vs Fortran complaint General Usage fortran	25	14620	July 20, 2017
Speeding up the multiplying, adding, subtracting of 3D matrices Numerics question	16	730	June 24, 2023
Noteworthy differences from Fortran General Usage documentation , fortran	38	2986	August 12, 2020
Compiler optimizations for ComplexF64 vs Fortran Performance fortran , optimization , complex-numbers , compiler	7	558	March 30, 2023

Naive dot product faster in Fortran than in Juila

Related topics