I’ve been working on “translating” my 3d finite difference codes for (magneto)hydrodynamics, and am at the stage where for at least one set of parameters, the codes are giving similar results (damped propagating waves). However code run times are completely different. For a 91x91x91 grid, I have

32 mins : F77 code with reals

(old graduate school code! with do 10 i=1:nx; do 10 j=1:ny; do 10 k=1:nz … code block … 10 continue

16 mins : F90 code (above code “modernized” with implicit loops)

64 mins : Julia code with Float 32; based on earlier tests the Float 64 will be about 1.5-2 times slower than the Float32 version

So, now I’m really curious about how to speed up the matrix multiplies, because 90 % of the code is

of the form

function update_xmom!(…)

forc[iaX,iaY,iaZ] .= -namelist.t0*ddx(deni,mhd_paramlist,namelist,numx,numy,numz)

fx[iaX,iaY,iaZ] .= (fxp1[iaX,iaY,iaZ].-fxm1[iaX,iaY,iaZ]).*mhd_paramlist.dxt23 .- (fxp2[iaX,iaY,iaZ].-fxm2[iaX,iaY,iaZ]).*mhd_paramlist.dxt12

pux .= pux .+ fx .+ forc

while the boundary conditions are applied using eg

function bcy3sym!(foo::Array{Float32,3},numx::Int,numy::Int,numz::Int)

foo[:,2,:] .= foo[:,4,:]

foo[:,1,:] .= foo[:,5,:]

foo[:,numy-1,:] .= foo[:,numy-3,:]

foo[:,numy-0,:] .= foo[:,numy-4,:]

return

end

Yesterday based on earlier replies, I wrote up this piece of Julia code to see what is the fastest way to do the matrix multiplies (dot element or loops) and seems to me the dot element is the fastest?

```
using LinearAlgebra,Tullio, LoopVectorization, Einsum, Printf, BenchmarkTools
function A_dotmul_B!(C, A, B)
@turbo for n ∈ indices(C,2), m ∈ indices(C, 1)
C[m,n] = A[m,n] * B[m,n]
end
end
function matdot!(s, a, b)
@turbo for i ∈ eachindex(a)
s[i] = a[i] * b[i]
end
end
function eindot!(mat3,mat1,mat2)
@einsum mat3[i,j] = mat1[i,j] * mat2[i,j];
end
function mygemm!(C, A, B)
@turbo for m ∈ axes(A,1), n ∈ axes(A,2)
C[m,n] = A[m,n] * B[m,n]
end
end
########################################################################
nx = 9
nx = 91
nG = 2
mat1 = Array{Float64}(undef,nx,nx,nx);
mat2 = Array{Float64}(undef,nx,nx,nx);
mat3A = Array{Float64}(undef,nx,nx,nx);
mat3B = Array{Float64}(undef,nx,nx,nx);
iLen = nx - nG*2
iLen = nx
for i in 1:iLen, j in 1:iLen, k in 1:iLen
mat1[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
mat2[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
end
@time mat3A .= mat1 .* mat2
#########################
@time (for i in 1:nx, j in 1:nx, k in 1:nx
mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 1 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time (for i in 1:nx, j in 1:nx, k in 1:nx
mat3B[i,j,k] = @views mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 2 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time (for i in 1:nx, j in 1:nx, k in 1:nx
@views mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 3 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time A_dotmul_B!(mat3B, mat1, mat2)
@printf("vers 4 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time matdot!(mat3B, mat1, mat2)
@printf("vers 5 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time eindot!(mat3B, mat1, mat2);
@printf("vers 6 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time mygemm!(mat3B, mat1, mat2);
@printf("vers 7 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
```

the output is

julia> include(“time_mult_arrays.jl”)

0.003848 seconds (2 allocations: 64 bytes)

0.230426 seconds (3.03 M allocations: 58.004 MiB, 6.34% gc time)

vers 1 : sum(mat3A.-mat3B) = 0.000000000000

0.621630 seconds (6.05 M allocations: 126.995 MiB, 1.91% gc time)

vers 2 : sum(mat3A.-mat3B) = 0.000000000000

0.598883 seconds (6.05 M allocations: 126.995 MiB, 4.16% gc time)

vers 3 : sum(mat3A.-mat3B) = 0.000000000000

0.058982 seconds (85.90 k allocations: 3.283 MiB, 99.89% compilation time)

vers 4 : sum(mat3A.-mat3B) = 0.000000000000

0.029835 seconds (34.93 k allocations: 1.478 MiB, 94.30% compilation time)

vers 5 : sum(mat3A.-mat3B) = 0.000000000000

0.022794 seconds (21.83 k allocations: 1.159 MiB, 99.82% compilation time)

vers 6 : sum(mat3A.-mat3B) = 0.000000000000

0.076860 seconds (85.56 k allocations: 3.271 MiB, 99.91% compilation time)

vers 7 : sum(mat3A.-mat3B) = 0.000000000000

So, I was wondering if there are other ways to speed up the code, or if I made mistakes in what I have coded up? Or perhaps I need to be looking elsewhere for the bottlenecks. By the way, I typically start my simulation code using “julia” at the Linux prompt, but trying eg “julia --math-mode=fast -O3 --check-bounds=no” makes hardly any difference to the run time