Speeding up the multiplying, adding, subtracting of 3D matrices

I’ve been working on “translating” my 3d finite difference codes for (magneto)hydrodynamics, and am at the stage where for at least one set of parameters, the codes are giving similar results (damped propagating waves). However code run times are completely different. For a 91x91x91 grid, I have

32 mins : F77 code with reals
(old graduate school code! with do 10 i=1:nx; do 10 j=1:ny; do 10 k=1:nz … code block … 10 continue
16 mins : F90 code (above code “modernized” with implicit loops)
64 mins : Julia code with Float 32; based on earlier tests the Float 64 will be about 1.5-2 times slower than the Float32 version

So, now I’m really curious about how to speed up the matrix multiplies, because 90 % of the code is
of the form

function update_xmom!(…)

forc[iaX,iaY,iaZ] .= -namelist.t0*ddx(deni,mhd_paramlist,namelist,numx,numy,numz)
fx[iaX,iaY,iaZ] .= (fxp1[iaX,iaY,iaZ].-fxm1[iaX,iaY,iaZ]).*mhd_paramlist.dxt23 .- (fxp2[iaX,iaY,iaZ].-fxm2[iaX,iaY,iaZ]).*mhd_paramlist.dxt12
pux .= pux .+ fx .+ forc

while the boundary conditions are applied using eg

function bcy3sym!(foo::Array{Float32,3},numx::Int,numy::Int,numz::Int)
foo[:,2,:] .= foo[:,4,:]
foo[:,1,:] .= foo[:,5,:]
foo[:,numy-1,:] .= foo[:,numy-3,:]
foo[:,numy-0,:] .= foo[:,numy-4,:]
return
end

Yesterday based on earlier replies, I wrote up this piece of Julia code to see what is the fastest way to do the matrix multiplies (dot element or loops) and seems to me the dot element is the fastest?

using LinearAlgebra,Tullio, LoopVectorization, Einsum, Printf, BenchmarkTools

function A_dotmul_B!(C, A, B)
    @turbo for n ∈ indices(C,2), m ∈ indices(C, 1)
      C[m,n] = A[m,n] * B[m,n]
    end
end

function matdot!(s, a, b)
    @turbo for i ∈ eachindex(a)
        s[i] = a[i] * b[i]
    end
end

function eindot!(mat3,mat1,mat2)
  @einsum mat3[i,j] = mat1[i,j] * mat2[i,j];
end

function mygemm!(C, A, B)
           @turbo for m ∈ axes(A,1), n ∈ axes(A,2)
           C[m,n] = A[m,n] * B[m,n]
           end
       end


########################################################################

nx = 9
nx = 91
nG = 2
mat1 = Array{Float64}(undef,nx,nx,nx);
mat2 = Array{Float64}(undef,nx,nx,nx);
mat3A = Array{Float64}(undef,nx,nx,nx);
mat3B = Array{Float64}(undef,nx,nx,nx);

iLen = nx - nG*2
iLen = nx
for i in 1:iLen, j in 1:iLen, k in 1:iLen
   mat1[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
   mat2[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
end

@time mat3A .= mat1 .* mat2

#########################

@time (for i in 1:nx, j in 1:nx, k in 1:nx
   mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 1 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time (for i in 1:nx, j in 1:nx, k in 1:nx
   mat3B[i,j,k] = @views mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 2 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time (for i in 1:nx, j in 1:nx, k in 1:nx
   @views mat3B[i,j,k] = mat1[i,j,k] * mat2[i,j,k]
end)
@printf("vers 3 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time A_dotmul_B!(mat3B, mat1, mat2)
@printf("vers 4 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time matdot!(mat3B, mat1, mat2)
@printf("vers 5 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time eindot!(mat3B, mat1, mat2);
@printf("vers 6 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

@time mygemm!(mat3B, mat1, mat2);
@printf("vers 7 : sum(mat3A.-mat3B) = %18.12f \n\n",sum(mat3A.-mat3B))

the output is

julia> include(“time_mult_arrays.jl”)
0.003848 seconds (2 allocations: 64 bytes)
0.230426 seconds (3.03 M allocations: 58.004 MiB, 6.34% gc time)
vers 1 : sum(mat3A.-mat3B) = 0.000000000000

0.621630 seconds (6.05 M allocations: 126.995 MiB, 1.91% gc time)
vers 2 : sum(mat3A.-mat3B) = 0.000000000000

0.598883 seconds (6.05 M allocations: 126.995 MiB, 4.16% gc time)
vers 3 : sum(mat3A.-mat3B) = 0.000000000000

0.058982 seconds (85.90 k allocations: 3.283 MiB, 99.89% compilation time)
vers 4 : sum(mat3A.-mat3B) = 0.000000000000

0.029835 seconds (34.93 k allocations: 1.478 MiB, 94.30% compilation time)
vers 5 : sum(mat3A.-mat3B) = 0.000000000000

0.022794 seconds (21.83 k allocations: 1.159 MiB, 99.82% compilation time)
vers 6 : sum(mat3A.-mat3B) = 0.000000000000

0.076860 seconds (85.56 k allocations: 3.271 MiB, 99.91% compilation time)
vers 7 : sum(mat3A.-mat3B) = 0.000000000000

So, I was wondering if there are other ways to speed up the code, or if I made mistakes in what I have coded up? Or perhaps I need to be looking elsewhere for the bottlenecks. By the way, I typically start my simulation code using “julia” at the Linux prompt, but trying eg “julia --math-mode=fast -O3 --check-bounds=no” makes hardly any difference to the run time

If you use stencil-based code or rely on point-wise operation mostly (only), you could benefit from GPU-acceleration. Checking some examples on ParallelStencil may be insightful.

Have you seen Performance Tips · The Julia Language?

Among other things, for benchmarking you probably want to use BenchmarkTools instead of naive @time commands.

To optimize your program, you should start with a more high-level approach: Profiler · Julia in VS Code (Although not documented on that page, the Julia (VS) Code extension also supports the allocation profiler, the interface is the same: @profview and @profview_allocs.)

Thanks for the replies. I’ve worked on the Profiler and Performance tips that have been suggested, and even pre-allocating the 3d matrices, seems to me that the dot element way of multiplying
mat1 .= matA .* matB .- matC .* matD
is faster than writing loops. So in a nutshell, for 91x91x91 matrices, the Fotran code is about x3-x4 faster than Julia code, and much much faster than Matlab.

Short of trying to learn parallelization, are that any additional speedups I could try, to make the Julia code as fast as Fortran code?

I’d say 90% of the lines of code I have are generically of the form given above, namely
mat1 .= (matA .* matB) .* scalar1 .- (matC .* matD).*scalar2
Some lines of code are of the form
mat1 .= (matA .* matB) .* functionOne!(matE).*scalar1 .- (matC .* matD).*functionTwo (matF).*scalar2

I’ve written the following snipped of code to test the timings, I’d appreciate everyone’s suggestions

########################################################################
using LinearAlgebra,Tullio, LoopVectorization, Einsum, Printf, BenchmarkTools, ProfileView

function A_dotmul_B_2!(C::Array{Float64,3}, A::Array{Float64,3}, B::Array{Float64,3})
    @turbo for o ∈ indices(A,3), n ∈ indices(A,2), m ∈ indices(A, 1)
      C[m,n,o] = A[m,n,o] * B[m,n,o]
    end
end

function matdot_2!(C::Array{Float64,3}, A::Array{Float64,3}, B::Array{Float64,3})
    @turbo for i ∈ eachindex(A)
        C[i] = A[i] * B[i]
    end
end

function another_A_dotmul_B_2!(C::Array{Float64,3}, A::Array{Float64,3}, B::Array{Float64,3})
           @turbo for o ∈ axes(A,3), n ∈ axes(A,2), m ∈ axes(A,1)
           C[m,n,o] = A[m,n,o] * B[m,n,o]
           end
       end

function eindot_2!(C::Array{Float64,3}, A::Array{Float64,3}, B::Array{Float64,3})
  @einsum C[i,j,k] = A[i,j,k] * B[i,j,k]
end

########################################################################

nx = 9
nx = 91
nG = 2
mat1 = Array{Float64}(undef,nx,nx,nx);
mat2 = Array{Float64}(undef,nx,nx,nx);
mat3 = Array{Float64}(undef,nx,nx,nx);
mat4 = Array{Float64}(undef,nx,nx,nx);
mat5A = Array{Float64}(undef,nx,nx,nx);
mat5B = Array{Float64}(undef,nx,nx,nx);
mat5C = Array{Float64}(undef,nx,nx,nx);

iLen = nx - nG*2
iLen = nx
for i in 1:nx, j in 1:nx, k in 1:nx
   mat1[i,j,k] = 15*i*rand() + 11*j*rand() + 7*k*rand()
   mat2[i,j,k] = 23*i*rand() + 19*j*rand() + 17*k*rand()
   mat3[i,j,k] = 07*i*rand() + 05*j*rand() + 03*k*rand()
   mat4[i,j,k] = 37*i*rand() + 31*j*rand() + 29*k*rand()
end


@time mat5B .= mat1 .* mat2;
@printf("did mat5B .= mat1 .* mat2 \n\n")

@time @turbo @. mat5B .= mat1 .* mat2;
@printf("did @turbo @. mat5B .= mat1 .* mat2 \n\n ")

@time mat5A .= mat1 .* mat2 .- mat3 .* mat4
@printf("did mat5A .= mat1 .* mat2 .- mat3 .* mat4 \n\n")

@time mat5A .= @views mat1 .* mat2 .- mat3 .* mat4
@printf("did mat5A .= @views mat1 .* mat2 .- mat3 .* mat4 \n\n")

@time mat5A .= @views (mat1[:,:,:] .* mat2[:,:,:] .- mat3[:,:,:] .* mat4[:,:,:])
@printf("did mat5A .= @views mat1[:,:,:] .* mat2[:,:,:] .- mat3[:,:,:] .* mat4[:,:,:] \n\n")


#########################

@time (for k in 1:nx, j in 1:nx, i in 1:nx
   mat5B[i,j,k] = mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k]
end)
@printf("vers 1 : for loops,  mat5B[i,j,k] = mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k] : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (for k in 1:nx, j in 1:nx, i in 1:nx
   mat5B[i,j,k] = @views mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k]
end)
@printf("vers 2 : flor loops, mat5B[i,j,k] = @views mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k], sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (for k in 1:nx, j in 1:nx, i in 1:nx
   @views (mat5B[i,j,k] = mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k])
end)
@printf("vers 3 : for loops  @views (mat5B[i,j,k] = mat1[i,j,k] * mat2[i,j,k] - mat3[i,j,k] * mat4[i,j,k]) : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (A_dotmul_B_2!(mat5B, mat1, mat2); A_dotmul_B_2!(mat5C, mat3, mat4); mat5B .= mat5B .- mat5C)
@printf("vers 4 : AdotmulB : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (matdot_2!(mat5B, mat1, mat2); matdot_2!(mat5C, mat3, mat4); mat5B .= mat5B .- mat5C)
@printf("vers 5 : matdot2 : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (eindot_2!(mat5B, mat1, mat2); eindot_2!(mat5C, mat3, mat4); mat5B .= mat5B .- mat5C)
@printf("vers 6 : eindot : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

@time (another_A_dotmul_B_2!(mat5B, mat1, mat2); another_A_dotmul_B_2!(mat5C, mat3, mat4); mat5B .= mat5B .- mat5C)
@printf("vers 7 : another_A_dotmul_B : sum(mat5A) = %12.9e sum(mat5A.-mat5B) = %12.9e \n\n",sum(mat5A),sum(mat5A.-mat5B))

Have you tried used @btime ? The funcions are compiled the first time they are called.

The standard dot product calls a Blas library that should be fast. Sometimed using MKL helps.

(Note that all the examples outside functions will be slow, because the matrices are global variables -with proper benchmarking I think most of the tries will be similar and not slower than Fortran)

1 Like

For example:

julia> C, A, B = (rand(91,91,91) for _ in 1:3);

julia> function matdot_2!(C::Array{Float64,3}, A::Array{Float64,3}, B::Array{Float64,3})
           @turbo for i ∈ eachindex(A)
               C[i] = A[i] * B[i]
           end
       end
matdot_2! (generic function with 1 method)

julia> @btime matdot_2!($C,$A,$B);
  757.099 μs (0 allocations: 0 bytes)

julia> @btime $C .= $A .* $B;
  733.973 μs (0 allocations: 0 bytes)

(the interpolations, $, are required by the @btime macro)

Yes, sorry, I forgot to mention I have tried @btime. The times are slightly different than @time, but at the end they all show that the first line (the .+, .*, .-) is faster than the looping functions I have written.

Can you explain what you mean by “examples outside functions are slow since the matrices are global” … I thought I had written my code so that I predeclare all the matrices beforehand

In the code I’m actually comparing to F90, this also holds true (eg

density = ones(Float32,namelist.nx,namelist.ny,namelist.nz), 
pux = zeros(Float32,namelist.nx,namelist.ny,namelist.nz) 

etc).

Then I call a function which steps forward in time, by calling the relevant physics equations

for nt in 1:ntime
  call step!(den,pux,puy,puz,bx,by,bz)
end

The function step! basically calls the relevant “forcings”

function step!(den::Array{Float32,3},pux::Array{Float32,3} ..... bz::Array{Float32,3})
step_den!(den,pux,puy,puz,bbx,bby,bbz)
step_pux!(den,pux,puy,puz,bx,by,bz) ......
step_bz!(denmpux,puy,puz,bx,by,bz)

where now the
function step_X!(den,pux,puy,puz,bbx,bby,bbz)
contains those lines of codes I am dot multiplying/adding

Does this mean I have made the matrices global? I am passing them in and out of the routines, and I thought that would mean they are not global.

Thanks

The matrices are global variables if you define them in the global scope (as you do, and as I do in my example) and you don’t pass them as parameters to functions. I’m not sure if you are doing that in your actual code or not.

In general, non-constant global variables will completely kill performance (but in this case it may not be the greatest issue).

From what you posted there, step_X! and similar functions are receiving all matrices as parameters, thus that should not be an issue.

In that case, you should not see any major difference in performance in your code relative to Fortran, for that operation.

Yes all matrices come as parameters. The only thing I can think of that is slightly different to what I have detailed above is that : since this is fluid dynamics, I have two guard channels on every face (so as to do derivatives on the edges/faces, and also to apply boundary conditions). So if the cube dimensions are nx,ny,nz and there are nG = 2 guard channels, I am really doing the following

(these are pre computed and also sent in as parameters)

iX = nG+1:nx-nG
iXp1 = iX .+ 1
iXm1 = iX .-1
iXp2 = iX .+ 2
iXm2 = iX .-2 

(and similarly for y- and z- directions)

matr1[iX,iY,iZ] .= A[iXp1,iY,iZ].*B[iXp1,iY,iZ].*scalar1  .- C[iXp2,iY,iZ].*D[iXp2,iY,iZ].*scalar2

or sometimes I have eg

matr2[iX,iY,iZ] .= diffusion(den) .*B[iXp1,iY,iZ].*scalar1

So each line of code inside step_den!, step_pux! etc would be subsetting the core of the matrices as above, not the entire matrix, Do you think these are potentially performance hitters?

I think you are already using @views for those, so that should be fixed. The thing that may be relevant is that you should try to have the largest dimensions to be the inner most. If, in your case, the first dimension turns out to be small, the performance will be greatly degraded, because the computations will be jumping from place to place in memory to perform the computations. But that would/will happen with Fortran as well, so that should not be the difference.

Hmm, I have to admit I have botched @views. I tried them in the test code I cut and pasted above earlier today, but have not got round to trying them in this fluid dynamics code.

The reason being as follows : Going back to one of your earlier replies, you mentioned you guess iaX/iaXm1 etc were indices/ranges. Then you suggest : put @views in front of the function name??? eg
@views function step_den!(den,p,B)

I remember being quite puzzled by that, since will that change den? In other words, I thought I should use views as follows :

function step_den!(den,p,B)
@views mat1[iaX,iaY,iaZ] .= (matA[iaX,iaY,iaZ] .* matB[iaX,iaY,iaZ]) .* scalar1 .- (matC[iaX,iaY,iaZ] .* matD[iaX,iaY,iaZ]).*scalar2

Sorry about this lapse, I meant to ask you about it.

There is no indexing in that last expression, so there’s nothing for @views to do.

This would be much easier if you can provide a self-contained example with runnable code and dummy input data (a so-called MWE).

Any chance the Fortran implementations are implicitly using multithreading?

julia> function f1_dot!(c, a, b)
           @. c = a * b
       end
f1_dot! (generic function with 1 method)

julia> function f2_loop!(c, a, b)
           @inbounds for i in eachindex(a, b, c)
               c[i] = a[i] * b[i]
           end
       end
f2_loop! (generic function with 1 method)

julia> function f3_turbo!(c, a, b)
           @turbo for i in eachindex(a, b, c)
               c[i] = a[i] * b[i]
           end
       end
f3_turbo! (generic function with 1 method)

julia> function f4_turbo_threads!(c, a, b)
           @turbo thread=true for i in eachindex(a, b, c)
               c[i] = a[i] * b[i]
           end
       end
f4_turbo_threads! (generic function with 1 method)
julia> using LoopVectorization, BenchmarkTools

julia> nx = 91
91

julia> mat1, mat2, mat3 = [Array{Float64}(undef, nx, nx, nx) for _=1:3];

julia> @btime f1_dot!($mat3, $mat1, $mat2);
  678.500 μs (0 allocations: 0 bytes)

julia> @btime f2_loop!($mat3, $mat1, $mat2);
  632.800 μs (0 allocations: 0 bytes)

julia> @btime f3_turbo!($mat3, $mat1, $mat2);
  653.700 μs (0 allocations: 0 bytes)

julia> @btime f4turbo_threads!($mat3, $mat1, $mat2);
  121.400 μs (0 allocations: 0 bytes)

Oops, sorry about the typo, I put in the indices now. It’s actually code that many researchers have worked on over the years, so I can’t just post it here without their permission. I’ll look into pulling out bits and pieces and providing a MWE.

This is what I mean:

julia> C, A, B = (rand(91,91,91) for _ in 1:3); range = 1:30;

julia> function f!(C,A,B,r)
           C[r,r,r] .= A[r,r,r] .* B[r,r,r]
       end
f! (generic function with 1 method)

julia> @btime f!($C, $A, $B, $range);
  67.778 μs (4 allocations: 422.00 KiB)

julia> @views function f!(C,A,B,r)
           C[r,r,r] .= A[r,r,r] .* B[r,r,r]
       end
f! (generic function with 1 method)

julia> @btime f!($C, $A, $B, $range);
  21.476 μs (0 allocations: 0 bytes)

and, please, follow the guidelines here: Please read: make it easier to help you

I’m reasonably confident the Fortran code is not doing threading, mostly because (a) I do not know parallel programming so the code is standard if-endif and do-end do loops! (b) I did not use any compile time options to enable auto-paralleization and (c) while the Fortran code runs, top -H shows only one process running (using 99.9% of the cpu; as it does when I run the Julia code). These are the compiler options :

gfortran -c -Ofast -extend_source mhd.f

Having said all that, I realize the above is not a guarantee that gfortran did not implement threading (and I cannot find the compiler option that says no threading, if it does exist).

I’ll also confirm I did “ps -T -p” of the jobID on a really long (f90) run, as well as top -H. … both only show one processor running.