EDIT: I get the exact same performance between them (and -march=native doesn’t seem to matter).
Compiling with
gfortran -O3 -march=native rootloop.f90 -shared -fPIC -o libloop.so
gfortran -Ofast -march=native rootloop.f90 -shared -fPIC -o libloopfast.so
Yields:
julia> using BenchmarkTools
julia> function loop(x,n)
         s = 0.
         for i in 1:n
           r = sqrt(i)
           s = s + x*r
         end
         s
       end
loop (generic function with 1 method)
julia> function loopfast(x,n)
         s = 0.
         @fastmath for i in 1:n
           r = sqrt(i)
           s = s + x*r
         end
         s
       end
loopfast (generic function with 1 method)
julia> floop(x,n) = ccall((:loop_,"libloop.so"),Float64,(Ref{Float64},Ref{Int64}),x,n)
floop (generic function with 1 method)
julia> floopfast(x,n) = ccall((:loop_,"libloopfast.so"),Float64,(Ref{Float64},Ref{Int64}),x,n)
floopfast (generic function with 1 method)
julia> x = rand() ; n = 10_000_000;
julia> @btime loop($x, $n)
  13.054 ms (0 allocations: 0 bytes)
7.112252395049944e9
julia> @btime floop($x, $n)
  13.054 ms (0 allocations: 0 bytes)
7.112252395049944e9
julia> @btime loopfast($x, $n)
  6.982 ms (0 allocations: 0 bytes)
7.112252395049599e9
julia> @btime floopfast($x, $n)
  6.982 ms (0 allocations: 0 bytes)
7.1122523950496235e9