Naive dot product faster in Fortran than in Juila

While you’re correct that Float64 will have half throughput of Float32 I am not sure about your statement regarding the FMA. As far as I know, FMA for Float64 is supported on AVX2 / AVX512.