# Naive dot product faster in Fortran than in Juila

Does anyone knows why this simple dot product is faster in Fortran then in Julia?

(`LinearAlgebra.dot` is faster and with `LoopVectorization` the Julia code gets also faster. but this is not what I am questioning. I am curious about why without those packages the Fortran compiler is doing a better job than the Julia one).

Here Julia takes about 2x the time of the Fortran execution (compiled with `-O3`).

Julia:

``````using BenchmarkTools

function mydot(a, b, N)
c = 0
@inbounds @simd for i = 1:N
c += a[i]*b[i]
end
c
end
N = 500_000_000
a = rand(Float32,N)
b = rand(Float32,N)

@btime mydot(\$a,\$b,\$N)
``````

Fortran:

``````implicit none

real(real32) :: c
integer, parameter :: N = 500000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: time1, time2

allocate(a(N), b(N))

call random_number(a)
call random_number(b)

call cpu_time(time1)
c = mydot(a, b, N)
call cpu_time(time2)
print *, c
print '("Time: ",f19.17," s")', time2 - time1

contains

function mydot(a, b, N) result(c)
real(real32), intent(in) :: a(:)
real(real32), intent(in) :: b(:)
integer, intent(in) :: N
real(real32) :: c
integer :: i
c = 0.
do i = 1, N
c = c + a(i)*b(i)
enddo
endfunction
endprogram

``````
1 Like

Type stability. You should declare `c=zero(eltype(a))`. The reason this was slowing you down is adding an Int to a Float32 promotes to Float64.

6 Likes

Yeah for me that took the execution time down by a full factor of 5

I thought in such a case the compiler would figure that out and do union-splitting automatically. (the problem is clearly shown by `@code_warntype`, which I didnâ€™t because my first feeling was that it this promoting would be harmless).

Those are some long arrays, so this will be completely memory bound.

If you try smaller arrays (once the Julia code is fixed), `gfortran` will be slower unless you dive deep into optimization options, using `-funroll-loops -fvariable-expansion-in-unroller`.

By default, LLVM will unroll the loop and use separate accumulators to split the dependency chain.
Both these optimizations need to be turned on manually with `gfortran`.

5 Likes

The problem isnâ€™t the type instability. Itâ€™s the promotion type. By using Float64, you lose the ability to use fma, and get halve your vector with.

7 Likes

Ah, I see. That makes sense :-).

Could you elaborate a bit, please? Only a bit will perhaps suffice.

The fused multiplyâ€“add machine instruction in modern CPUs (with such SIMD extensions) performs several operations of the form `a = b*c + a` in a single clock cycle, but the compiler can only use it if all three variables involved have the same type.

9 Likes

The following type annotation may prevent you from calling an SIMD-unfriendly method instance of this function:

``````function mydot(a::AbstractVector{T},
b::AbstractVector{T}, N) where T
c::T = 0
@inbounds @simd for i = 1:N
c += a[i]*b[i]
end
c
end
``````
4 Likes

While youâ€™re correct that `Float64` will have half throughput of `Float32` I am not sure about your statement regarding the `FMA`. As far as I know, `FMA` for `Float64` is supported on `AVX2` / `AVX512`.

I should have been more clear probably. Float64 fma exists, but there isnâ€™t fma between Float64 and Float32

2 Likes

More specifically, in calculating

``````c::Float64 = c::Float64 + a::Float32 * b::Float32
``````

either

1. promote `a` to `Float64`
2. promote `b` to `Float64`
3. `c = fma(a64, b64, c)`
or
4. multiply `ab = a*b`
5. promote `ab` to `Float64`
6. add `ab64 + c`.

On most x86 CPUs, the conversion has a reciprocal throughput of 1, while the arithmetic tends to have a r-throughput of 0.5 or 1 (depending on the CPU).
Thus promoting once tends to be at least as fast as promoting twice.

1 Like