That is a good question, but it’ll take someone with a better understanding of compiler internals and assembly language than me to answer it.
The thing to do would be to compare the x86-64 assembly language produced by LLVM from the Julia code to the assembly produced by gcc on the Fortran code. But that’s a lot of assemby language.
Running @code_llvm ksintegrateUnrolled(u,Lx, dt, Nt) produces about a thousand lines of LLVM internal representation code, and @code_native produces several times that of assembly.
It’s possible to dig through the LLVM IR or assembly and focus on code for the time-stepping loop, but it’s still beyond my understanding.
But still, the ksintegratorUnrolled function in Julia-1.2.0 is only 15% slower than the equivalent Fortran code.