Given that both Julia and Fortran are asymptoting to pretty much exactly the same runtime, my lazy shot at a guess here is that you’re approaching your system’s memory bandwidth. It’s a relatively cheap operation over lots of data (800*800*800*4 or 8 == 2 or 4GB); the cost of shuffling the data around is likely to be the limiting factor.
This is also where GPUs excel.