I’ve been working on “translating” my 3d finite difference codes for (magneto)hydrodynamics, and am at the stage where for at least one set of parameters, the codes are giving similar results (damped propagating waves). However code run times are completely different. For a 91x91x91 grid, I have
32 mins : F77 code with reals
(old graduate school code! with do 10 i=1:nx; do 10 j=1:ny; do 10 k=1:nz … code block … 10 continue
16 mins : F90 code (above code “modernized” with implicit loops)
64 mins : Julia code with Float 32; based on earlier tests the Float 64 will be about 1.5-2 times slower than the Float32 version
So, now I’m really curious about how to speed up the matrix multiplies, because 90 % of the code is
of the form
function update_xmom!(…)
forc[iaX,iaY,iaZ] .= -namelist.t0*ddx(deni,mhd_paramlist,namelist,numx,numy,numz)
fx[iaX,iaY,iaZ] .= (fxp1[iaX,iaY,iaZ].-fxm1[iaX,iaY,iaZ]).*mhd_paramlist.dxt23 .- (fxp2[iaX,iaY,iaZ].-fxm2[iaX,iaY,iaZ]).*mhd_paramlist.dxt12
pux .= pux .+ fx .+ forc
while the boundary conditions are applied using eg
function bcy3sym!(foo::Array{Float32,3},numx::Int,numy::Int,numz::Int)
foo[:,2,:] .= foo[:,4,:]
foo[:,1,:] .= foo[:,5,:]
foo[:,numy-1,:] .= foo[:,numy-3,:]
foo[:,numy-0,:] .= foo[:,numy-4,:]
return
end
Yesterday based on earlier replies, I wrote up this piece of Julia code to see what is the fastest way to do the matrix multiplies (dot element or loops) and seems to me the dot element is the fastest?
using LinearAlgebra,Tullio, LoopVectorization, Einsum, Printf, BenchmarkTools
function A_dotmul_B!(C, A, B)
@turbo for n ∈ indices(C,2), m ∈ indices(C, 1)
C[m,n] = A[m,n] * B[m,n]
end
end
function matdot!(s, a, b)
@turbo for i ∈ eachindex(a)
s[i] = a[i] * b[i]
end
end
function eindot!(mat3,mat1,mat2)
@einsum mat3[i,j] = mat1[i,j] * mat2[i,j];
end
function mygemm!(C, A, B)
@turbo for m ∈ axes(A,1), n ∈ axes(A,2)
C[m,n] = A[m,n] * B[m,n]
end
end
########################################################################
nx = 9
nx = 91
nG = 2
mat1 = Array{Float64}(undef,nx,nx,nx);
mat2 = Array{Float64}(undef,nx,nx,nx);
mat3A = Array{Float64}(undef,nx,nx,nx);
mat3B = Array{Float64}(undef,nx,nx,nx);
iLen = nx - nG*2
iLen = nx
for i in 1:iLen, j in 1:iLen, k in 1:iLen
mat1[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
mat2[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
end
@time mat3A .= mat1 .* mat2
#########################
@time (for i in 1:nx, j in 1:nx, k in 1:nx
mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 1 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time (for i in 1:nx, j in 1:nx, k in 1:nx
mat3B[i,j,k] = @views mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 2 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time (for i in 1:nx, j in 1:nx, k in 1:nx
@views mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 3 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time A_dotmul_B!(mat3B, mat1, mat2)
@printf("vers 4 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time matdot!(mat3B, mat1, mat2)
@printf("vers 5 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time eindot!(mat3B, mat1, mat2);
@printf("vers 6 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
@time mygemm!(mat3B, mat1, mat2);
@printf("vers 7 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))
the output is
julia> include(“time_mult_arrays.jl”)
0.003848 seconds (2 allocations: 64 bytes)
0.230426 seconds (3.03 M allocations: 58.004 MiB, 6.34% gc time)
vers 1 : sum(mat3A.-mat3B) = 0.000000000000
0.621630 seconds (6.05 M allocations: 126.995 MiB, 1.91% gc time)
vers 2 : sum(mat3A.-mat3B) = 0.000000000000
0.598883 seconds (6.05 M allocations: 126.995 MiB, 4.16% gc time)
vers 3 : sum(mat3A.-mat3B) = 0.000000000000
0.058982 seconds (85.90 k allocations: 3.283 MiB, 99.89% compilation time)
vers 4 : sum(mat3A.-mat3B) = 0.000000000000
0.029835 seconds (34.93 k allocations: 1.478 MiB, 94.30% compilation time)
vers 5 : sum(mat3A.-mat3B) = 0.000000000000
0.022794 seconds (21.83 k allocations: 1.159 MiB, 99.82% compilation time)
vers 6 : sum(mat3A.-mat3B) = 0.000000000000
0.076860 seconds (85.56 k allocations: 3.271 MiB, 99.91% compilation time)
vers 7 : sum(mat3A.-mat3B) = 0.000000000000
So, I was wondering if there are other ways to speed up the code, or if I made mistakes in what I have coded up? Or perhaps I need to be looking elsewhere for the bottlenecks. By the way, I typically start my simulation code using “julia” at the Linux prompt, but trying eg “julia --math-mode=fast -O3 --check-bounds=no” makes hardly any difference to the run time