Element by element wise multiplication and addition/subtraction of 3d matrices

smac · June 11, 2023, 10:32am

Hello, I’m new to Julia, with prior experience of fortran and Matlab. I have a 3d magnetohydrodynamics code which does all the time stepping + spatial derivatives using 3d matrices, with the operations carried out using element by element (dot) multiplications, additions, subtractions of the coupled density matrices, 3d momentum matrices and 3d magnetic field matrices.

So for example three lines in the code would be
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
fxm1[iaX,iaY,iaZ] .= vx[iaXm1,iaY,iaZ].*bbyi[iaXm1,iaY,iaZ] .- vy[iaXm1,iaY,iaZ].*bbxi[iaXm1,iaY,iaZ]
fxp1[iaX,iaY,iaZ] .= vx[iaXp1,iaY,iaZ].*bbyi[iaXp1,iaY,iaZ] .- vy[iaXp1,iaY,iaZ].*bbxi[iaXp1,iaY,iaZ]

Here all the matrices eg vx,bbyi are size (nx,ny,nz) with nx,ny,nz typically being on the order of 30-100 (though I could go higher). The indices into the matrix iaX,iaY,iaZ are the “interior” points away from the boundaries so eg iaX = nG+1:nx-nG, where nG = 2 guard channels

So I was wondering what is the fastest way of doing this in Julia. I did a few tests using gfortran -O3 compiled code versus Matlab versus Julia. For T=100 timesteps, repeated runs of the codes show the timings are roughly

                         N=31               N=91

Matlab. 11 sec 6 min
Julia(Float 64) 8-10 sec. 2 min
Julia(Float 32) 7-9 sec 1.5 min
F90 (real) 1.5 sec 1 min

If you’re wondering what the translations entailed, it was literally getting those lines of code to work in the relevant programming language
eg for Julia
f. xm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
for Matlab
fxm2[iaX,iaY,iaZ] = vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
for F90
fxm2[iaX,iaY,iaZ] = vx[iaXm2,iaY,iaZ]*bbyi[iaXm2,iaY,iaZ] - vy[iaXm2,iaY,iaZ]*bbxi[iaXm2,iaY,iaZ]

I have zero experience in parallel programming, so these codes run only on a single processor. I was wondering if there are any suggestions for speeding up the Julia code?

Thanks

sloede · June 11, 2023, 12:18pm

We have conducted extensive tests comparing Fortran to Julia for a high-order DG scheme and came to the firm conclusion that there is no inherent performance advantage of using Fortran over Julia - as expected, they are on par if using an “efficient” implementation. However, one notable difference between the two languages is that you need to develop a new intuition of what is fast and what is not in Julia, that is on top of your existing knowledge about this for Fortran.

Thus, most likely in your example application you have stumbled in one of the performance traps that we have fallen into several times before:

Avoid type instabilities in your innermost kernels by using function barriers
Avoid unnecessary allocations, e.g., by avoiding type instabilities, use views instead of slices when accessing arrays
Mark your kernel loops with @inbounds or start Julia with --check-bounds=no

Some more performance tips can be found, e.g., in the docs: Performance Tips · The Julia Language

Other than that, it is hard to say what is the root cause of your performance issues. From the snippet you have posted in your example above, nothing looks overly suspicious, but it is hard to say with a more complete MWE.

lmiq · June 11, 2023, 1:20pm

Slices in Julia allocate intermediate arrays. Add @views in front of you function, as:

@views function f(x,y,i,j)
    x .= y[i:j]
end

(This is one of the performance tips above, and likely related to the slowness you are observing)

Seif_Shebl · June 11, 2023, 10:34pm

Have you checked GitHub - MHDFlows/MHDFlows.jl: Three Dimensional Magnetohydrodynamic(MHD) pseudospectral solvers written in julia with FourierFlows.jl? Also, please post real code examples for us to copy paste and show you some optimization techniques. As @sloede mentioned, Julia code should not be any slower than Fortran’s if written carefully.

smac · June 12, 2023, 3:13am

Thank you, everyone! I re-started Julia with --check-bounds=no, but that made no difference. So I went over my code more carefully, and found places (other than the snippet I showed) where I had matrix times scalar, and I had not used the .* for the multiplication … that made a huge difference, and now the Float32 runtimes with Julia, are essentially the same as the F90 gfortran compiled with -O3

If you don’t mind, I have a couple of questions about views vs slices. At each time step, my code updates momentum, magnetic field and density. And any function uses temporary variables on the LHS most of the way, till the last line

so for example, function update_density would consists of many lines like the snippet of code I showed, for example

function update_den!(den,vx,vy,bbx,bby)
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
factor1[iaX,iaY,iaZ] .= den[iaXp2,iaY,iaZ].-den[iaXm1,iaY,iaZ] .- (den[iaXp1,iaY,iaZ].-den[iaX,iaY,iaZ]) .*3
factor2[iaX,iaY,iaZ] .= -den[iaXp2,iaY,iaZ] -2.*vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
work[iaX,iaY,iaZm1] .= den[iaX,iaY,iaZm1] .+fxm2[iaX,iaY,iaZ] .+ factor1[iaX,iaY,iaZ]
den .= work

where for brevity I omitted a lot of code (and made up some fake lines there). So would you put @view or @views in front of any of these lines?

Also if you notice on the third line (which assigns a value to “factor2”), the first term on the right of the equal to sign is -den; I was wondering would
-1.*den[…]
make a difference?

Finally, it seems to me if the speeds are now very comparable to optimized F90, the next step in speed improvements (after thinking about views etc) would be to parallelize the code. I understand Julia can either do MPI or can do multithreading or can do “distributed” computing, as well as “SharedArrays”. Since I have to do derivatives , I always need two ghost points along each face of my cuboid, so I was wondering if you have any suggestion what would be the best approach to this, given I have almost no experience in parallel computing.

Thanks!

Oscar_Smith · June 12, 2023, 3:22am

For parallel computing, I highly recommend starting with multithreading. It is the best for single machines in most circumstances (since shared memory means you don’t have to copy things around).

For code like this, you might want to try out Tullio.jl which will let you write your code in einstein notation and it can do the parallelization for you. It generally produces really clean code that’s very fast for your typical tensor operations.

sloede · June 12, 2023, 5:52am

No. Generally speaking, you only need views to avoid a copy when you have a slice operation (i.e., anything with a : in the index expression) on the right side of an assignment. That is, here you need a view

# This avoids a temporary copy of `b[:, i, j]`
@views a .= b[:, i, j]

You do not need a view if the slice is on the left side of the expression or no slices are involved:

# This does not create temporary copies
a[:, i, j] .= c
# This does not even involve a slice
a[v, i, j] = b[v, i, j]

No. Just go with the most natural and clean-looking implementation, which is -den

Note, depending on whether you want to develop an MHD solver or just use one, there is a fast, high-order DG solver available for the MHD equations in Trixi.jl, including adaptive mesh support and shock capturing (full disclosure: I am one of the main authors of said package).

fredrikekre · June 12, 2023, 5:58am

The @. macro is useful when you want to “dot” all functions and operators in an expression. Using that, there is no risk of accidentally miss some operation, like in your case.

lmiq · June 12, 2023, 12:11pm

From the original post I had the impression that your iaXp2 (and etc) indexes, were actually ranges given as inputs to your function. So in that case yes, using @views can help. If that is the case, you can add @views in front of your whole function @views function update_den!(...) to remove those intermediates.

stevengj · June 12, 2023, 12:56pm

smac:

fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
fxm1[iaX,iaY,iaZ] .= vx[iaXm1,iaY,iaZ].*bbyi[iaXm1,iaY,iaZ] .- vy[iaXm1,iaY,iaZ].*bbxi[iaXm1,iaY,iaZ]
fxp1[iaX,iaY,iaZ] .= vx[iaXp1,iaY,iaZ].*bbyi[iaXp1,iaY,iaZ] .- vy[iaXp1,iaY,iaZ].*bbxi[iaXp1,iaY,iaZ]

Putting aside the issue of views etcetera, I should point out that this kind of “vectorized” style ala Matlab may be a pretty suboptimal way of implementing finite-difference algorithms like this. You are making multiple passes over the arrays, which is hard on the cache etc. Have you considered just writing a loop? Loops are fast in Julia, and they allow you to perform all your finite-difference operations in a single pass over the array(s), with minimal intermediate/temporary arrays. (And packages like LoopVectorization.jl can make loops faster still.)

smac · June 12, 2023, 2:26pm

Hello everybody again,

Just to confirm, iaX = nG+1:nx-nG is indeed a range (and similarly for iaY,iaZ).

So for this line (or a whole number of similar lines in each function), @views could help, or even actually going back to loops?
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]

Thanks for pointing me to the Trixi package, I’ll certainly take a look at it. I’m in the atmospheric physics field and want to change my slow/looping Matlab (retrieval and radiative transfer) code to faster Julia code, so I used the MHD code which I helped develop as a graduate student, as a starting point to learn Julia. And I’ve certainly already learned a lot from all the replies posted to my question.

Thanks again!

smac · June 14, 2023, 1:57pm

I was looking at some of the suggestions mentioned by all of you (views etc) when I realized, much to my embarrasment, that the f90 code was running for more timesteps than the Julia code (I had modified the driver files during the debugging and forgot to revert back to asking for orginal number of timesteps).

So I decided that’s an opportunity to try eg the @views, which helped a lot (thanks!) and as of now the Julia code is roughly double the run time of the f90 code.

However I realized the Julia code is also suffering from sort of instability (which the F90 code does not have), which I’m trying to nail down. One oddity I noticed is in some finite difference derivative-related functions I have written. Focusing on the x direction, and simplifying what I use

function ddx(foo::Array{Float64,3},nG::Int,nx::Int,ny::Int,nz::Int,dx23::Float64,dx12::Float64)

iaX = nG+1:nx-nG
iaY = nG+1:ny-nG
iaZ = nG+1:nz-nG

iaXp1 .= iaX .+ 1
iaXp2 .= iaX .+ 2
iaXm1 .= iaX .- 1
iaXm2 .= iaX .- 2

## predeclare output to be zeros
sddx = zeros(Float64,nx-(2*nG),ny-(2*nG),nz-(2*nG))

## do the derivatives in 
sddx .= (foo[iaXp1,iaY,iaZ].- foo[iaXm1,iaY,iaZ]).*dx23 .-  (foo[iaXp2,iaY,iaZ] .-foo[iaXm2,iaY,iaZ]).*dx12
return sddx
end

So the output matrix should be slightly smaller than the input foo 3d matrix, by nG elements in every direction

I call it using
forc1 = zeros(Array64,nx,ny,nz)
forc1[iaX,iaY,iaZ] = ddx(density,nG,nx,ny,nz,dx23,dx12)

Shouldn’t this just be filling the interior of the matrix (ie iaX,iaY,iaZ where eg iaX = nG+1:nx-nG, similarly for y and z directions)? I’m asking because even though I initialized the matrix with zeros, surprisingly sum(forc) is different than sum(forc[iaX,iaY,iaZ])?

Is there some issue with Julia doing the sum of the full matrix (including zeros) vs just doing the sum over the interior points? I have written this example, running it a few times shows the differences are typically zero, but about 1.4 of the time the differences are (curiously) +/- 5.82076609e-11

using Printf

########################################################################

function matrRand2(nx,nG)

iLen = nx-nG*2

matrXYZ = zeros(Float64,iLen,iLen,iLen)
matrXYZ .= rand(iLen,iLen,iLen)

return matrXYZ
end

########################################################################

function printdiff(M0,MX,nx,nG)

iaRange = nG-1:nx-nG

str = zeros(1,2);
str1 = [sum(M0) sum(M0)]
str2 = [sum(MX) sum(M0[iaRange,iaRange,iaRange])]
str .= str1 .- str2

@printf(“The final answer = %15.8e %15.8e \n”,str[1],str[2])
return nothing
end

########################################################################
########################################################################

nx = 91;
nG = 2;

#########################

println(“VERS X”)

matr = zeros(Float64,nx,nx,nx)

matrX = zeros(Float64,nx-2nG,nx-2nG,nx-2*nG)
matrX = matrRand2(nx,nG)

matr[nG+1:nx-nG,nG+1:nx-nG,nG+1:nx-nG] = matrX;

printdiff(matr,matrX,nx,nG)

#########################
println(“VERS Y”)
matr = zeros(Float64,nx,nx,nx)

matrY = zeros(Float64,nx-2nG,nx-2nG,nx-2*nG)
matrY .= matrRand2(nx,nG)

matr[nG+1:nx-nG,nG+1:nx-nG,nG+1:nx-nG] = matrY;

printdiff(matr,matrY,nx,nG)

Thanks!

lmiq · June 14, 2023, 6:48pm

If you are calling this function repeatedly, you probably need to preallocate matrXYZ to avoid having to create at every time, and you can write it like this:

function matRand2!(matrXYZ,nx,nG)
    iLen = nx - nG*2
    for i in 1:iLen, j in 1:iLen, k in 1:iLen
        matrXYZ[i,j,k] = rand()
    end
    return matrXYZ
end

and then you have to allocate matrXYZ up in the program and pass it here as a parameter.

But, please, read this before posting your questions, because otherwise it is hard to help:

DNF · June 14, 2023, 6:54pm

You are updating iaXp1 etc on the left sides here, but I cannot see where they are defined. They are are not inputs to the function ddx. Are they global variables?

lmiq · June 14, 2023, 6:54pm

Also note that here you are creating two matrices (one with zeros, the other with random numbers). Without further complications, you just need this:

matrXYZ = rand(iLen,iLen,iLen)

or, if matrXYZ was preallocated (but that only makes sense if you did it outside the function) use a loop to fill it with random values, as I mentioned above.

DNF · June 14, 2023, 6:59pm

What about just

matRand2!(matrXYZ) = rand!(matrXYZ) 
# or
matRand2!(matrXYZ) = (matrXYZ .= rand.())

?

DNF · June 14, 2023, 7:03pm

Preallocation inside the function has not benefit, in fact it should be slower, because you are filling ssdx with zeros that will just be overwritten later.

Either pre-allocate it outside, or just do this

sddx = (foo[iaXp1,iaY,iaZ].- foo[iaXm1,iaY,iaZ]).*dx23 .-  (foo[iaXp2,iaY,iaZ] .-foo[iaXm2,iaY,iaZ]).*dx12

= not .=

lmiq · June 14, 2023, 7:13pm

I think the ranges that is being randomized change. Thus it could be

function matRand2!(matrXYZ,nx,nG)
    r = 1:(nx - nG*2)
    matrXYZ(r,r,r) = rand.()
    # or
    # rand!(@view(matXYZ(r,r,r))
    return matrXYZ
end

(but yes, those notations are nicer - the rand.() is something I don´t have internalized as a possibility)

smac · June 14, 2023, 7:22pm

Yes, sorry I cut and paste the important part but left out the fact that iaX,iaY,iaZ are predefined and come in as arguments to the function

smac · June 14, 2023, 7:28pm

Thanks for pointing this out. I preallocate the 3d matrix/matrices for the actual MHD code and then send them in as part of the argument list, but for this example since I was trying to isolate the sum(matr) vs sum(matr(iaX,iaY,iaZ)), I did not pre-allocate

Topic		Replies	Views
Speeding up the multiplying, adding, subtracting of 3D matrices Numerics question	16	770	June 24, 2023
Creating the same matrix in Julia as in Matlab takes longer time New to Julia	28	2293	June 28, 2017
Trying to understand how to write good code (example) New to Julia	34	3087	February 16, 2017
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4317	March 28, 2019
How to improve the performance while do matrix operation Performance	13	853	August 27, 2018

Element by element wise multiplication and addition/subtraction of 3d matrices

Related topics