You should add -march=native
to the gfortran compile flags in your deps/build.jl
. This allows gfortran to emit SIMD vectorized code that fits your CPU architecture. (-O3
does not imply CPU architecture settings.) Otherwise itâll default to the slower scalar code. Most likely then the speed difference goes away since Julia instructs LLVM by default to same ânativeâ setting when optimisation is enabled (as is by default).
To see the difference you may use objdump -d array_routine.so
to show the disassembled code, which is essentially the same you would see from @code_native
in Julia.
Also, -O3
can lead to suboptimal code when the optimiser doesnât have enough information on how functions are exactly used and, ultimately, makes too aggressive assumptions. This is in particular a problem when writing shared libraries that iterate over arrays. So sometimes -O2
gives faster code. Be sure to check both settings.
Addendum:
To clarify, the above is with regard to the comment âThe array function is much slower, perhaps due to memory allocation in the fortran subroutine.â
So, hereâs the disassembly of the Fortran code using -O2
and -O2 -march=native
:
0000000000000610 <array_routine_>: 0000000000000610 <array_routine_>:
610: 48 8b 0a mov (%rdx),%rcx 610: 48 8b 0a mov (%rdx),%rcx
613: b8 01 00 00 00 mov $0x1,%eax 613: b8 01 00 00 00 mov $0x1,%eax
618: 48 85 c9 test %rcx,%rcx 618: 48 85 c9 test %rcx,%rcx
61b: 48 8d 51 01 lea 0x1(%rcx),%rdx 61b: 48 8d 51 01 lea 0x1(%rcx),%rdx
61f: 7e 20 jle 641 <array_routine_+0x 61f: 7e 20 jle 641 <array_routine_+0x
621: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 621: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
628: f2 0f 10 44 c7 f8 movsd -0x8(%rdi,%rax,8),%xmm | 628: c5 fb 10 44 c7 f8 vmovsd -0x8(%rdi,%rax,8),%xmm
62e: f2 0f 58 c0 addsd %xmm0,%xmm0 | 62e: c5 fb 58 c0 vaddsd %xmm0,%xmm0,%xmm0
632: f2 0f 11 44 c6 f8 movsd %xmm0,-0x8(%rsi,%rax,8 | 632: c5 fb 11 44 c6 f8 vmovsd %xmm0,-0x8(%rsi,%rax,8
638: 48 83 c0 01 add $0x1,%rax 638: 48 83 c0 01 add $0x1,%rax
63c: 48 39 d0 cmp %rdx,%rax 63c: 48 39 d0 cmp %rdx,%rax
63f: 75 e7 jne 628 <array_routine_+0x 63f: 75 e7 jne 628 <array_routine_+0x
641: f3 c3 repz retq 641: f3 c3 repz retq
And it appears the -O3 -march=native
is actually significantly faster on an AVX capable machine.