Element by element wise multiplication and addition/subtraction of 3d matrices

Hello, I’m new to Julia, with prior experience of fortran and Matlab. I have a 3d magnetohydrodynamics code which does all the time stepping + spatial derivatives using 3d matrices, with the operations carried out using element by element (dot) multiplications, additions, subtractions of the coupled density matrices, 3d momentum matrices and 3d magnetic field matrices.

So for example three lines in the code would be
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
fxm1[iaX,iaY,iaZ] .= vx[iaXm1,iaY,iaZ].*bbyi[iaXm1,iaY,iaZ] .- vy[iaXm1,iaY,iaZ].*bbxi[iaXm1,iaY,iaZ]
fxp1[iaX,iaY,iaZ] .= vx[iaXp1,iaY,iaZ].*bbyi[iaXp1,iaY,iaZ] .- vy[iaXp1,iaY,iaZ].*bbxi[iaXp1,iaY,iaZ]

Here all the matrices eg vx,bbyi are size (nx,ny,nz) with nx,ny,nz typically being on the order of 30-100 (though I could go higher). The indices into the matrix iaX,iaY,iaZ are the “interior” points away from the boundaries so eg iaX = nG+1:nx-nG, where nG = 2 guard channels

So I was wondering what is the fastest way of doing this in Julia. I did a few tests using gfortran -O3 compiled code versus Matlab versus Julia. For T=100 timesteps, repeated runs of the codes show the timings are roughly

                         N=31               N=91

Matlab. 11 sec 6 min
Julia(Float 64) 8-10 sec. 2 min
Julia(Float 32) 7-9 sec 1.5 min
F90 (real) 1.5 sec 1 min

If you’re wondering what the translations entailed, it was literally getting those lines of code to work in the relevant programming language
eg for Julia
f. xm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
for Matlab
fxm2[iaX,iaY,iaZ] = vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
for F90
fxm2[iaX,iaY,iaZ] = vx[iaXm2,iaY,iaZ]*bbyi[iaXm2,iaY,iaZ] - vy[iaXm2,iaY,iaZ]*bbxi[iaXm2,iaY,iaZ]

I have zero experience in parallel programming, so these codes run only on a single processor. I was wondering if there are any suggestions for speeding up the Julia code?

Thanks

We have conducted extensive tests comparing Fortran to Julia for a high-order DG scheme and came to the firm conclusion that there is no inherent performance advantage of using Fortran over Julia - as expected, they are on par if using an “efficient” implementation. However, one notable difference between the two languages is that you need to develop a new intuition of what is fast and what is not in Julia, that is on top of your existing knowledge about this for Fortran.

Thus, most likely in your example application you have stumbled in one of the performance traps that we have fallen into several times before:

  • Avoid type instabilities in your innermost kernels by using function barriers
  • Avoid unnecessary allocations, e.g., by avoiding type instabilities, use views instead of slices when accessing arrays
  • Mark your kernel loops with @inbounds or start Julia with --check-bounds=no

Some more performance tips can be found, e.g., in the docs: Performance Tips · The Julia Language

Other than that, it is hard to say what is the root cause of your performance issues. From the snippet you have posted in your example above, nothing looks overly suspicious, but it is hard to say with a more complete MWE.

8 Likes

Slices in Julia allocate intermediate arrays. Add @views in front of you function, as:

@views function f(x,y,i,j)
    x .= y[i:j]
end

(This is one of the performance tips above, and likely related to the slowness you are observing)

4 Likes

Have you checked GitHub - MHDFlows/MHDFlows.jl: Three Dimensional Magnetohydrodynamic(MHD) pseudospectral solvers written in julia with FourierFlows.jl? Also, please post real code examples for us to copy paste and show you some optimization techniques. As @sloede mentioned, Julia code should not be any slower than Fortran’s if written carefully.

1 Like

Thank you, everyone! I re-started Julia with --check-bounds=no, but that made no difference. So I went over my code more carefully, and found places (other than the snippet I showed) where I had matrix times scalar, and I had not used the .* for the multiplication … that made a huge difference, and now the Float32 runtimes with Julia, are essentially the same as the F90 gfortran compiled with -O3

If you don’t mind, I have a couple of questions about views vs slices. At each time step, my code updates momentum, magnetic field and density. And any function uses temporary variables on the LHS most of the way, till the last line

so for example, function update_density would consists of many lines like the snippet of code I showed, for example

function update_den!(den,vx,vy,bbx,bby)
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
factor1[iaX,iaY,iaZ] .= den[iaXp2,iaY,iaZ].-den[iaXm1,iaY,iaZ] .- (den[iaXp1,iaY,iaZ].-den[iaX,iaY,iaZ]) .*3
factor2[iaX,iaY,iaZ] .= -den[iaXp2,iaY,iaZ] -2.*vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]
work[iaX,iaY,iaZm1] .= den[iaX,iaY,iaZm1] .+fxm2[iaX,iaY,iaZ] .+ factor1[iaX,iaY,iaZ]
den .= work

where for brevity I omitted a lot of code (and made up some fake lines there). So would you put @view or @views in front of any of these lines?

Also if you notice on the third line (which assigns a value to “factor2”), the first term on the right of the equal to sign is -den; I was wondering would
-1.*den[…]
make a difference?

Finally, it seems to me if the speeds are now very comparable to optimized F90, the next step in speed improvements (after thinking about views etc) would be to parallelize the code. I understand Julia can either do MPI or can do multithreading or can do “distributed” computing, as well as “SharedArrays”. Since I have to do derivatives , I always need two ghost points along each face of my cuboid, so I was wondering if you have any suggestion what would be the best approach to this, given I have almost no experience in parallel computing.

Thanks!

For parallel computing, I highly recommend starting with multithreading. It is the best for single machines in most circumstances (since shared memory means you don’t have to copy things around).

For code like this, you might want to try out Tullio.jl which will let you write your code in einstein notation and it can do the parallelization for you. It generally produces really clean code that’s very fast for your typical tensor operations.

2 Likes

No. Generally speaking, you only need views to avoid a copy when you have a slice operation (i.e., anything with a : in the index expression) on the right side of an assignment. That is, here you need a view

# This avoids a temporary copy of `b[:, i, j]`
@views a .= b[:, i, j]

You do not need a view if the slice is on the left side of the expression or no slices are involved:

# This does not create temporary copies
a[:, i, j] .= c
# This does not even involve a slice
a[v, i, j] = b[v, i, j]

No. Just go with the most natural and clean-looking implementation, which is -den

Note, depending on whether you want to develop an MHD solver or just use one, there is a fast, high-order DG solver available for the MHD equations in Trixi.jl, including adaptive mesh support and shock capturing (full disclosure: I am one of the main authors of said package).

1 Like

The @. macro is useful when you want to “dot” all functions and operators in an expression. Using that, there is no risk of accidentally miss some operation, like in your case.

2 Likes

From the original post I had the impression that your iaXp2 (and etc) indexes, were actually ranges given as inputs to your function. So in that case yes, using @views can help. If that is the case, you can add @views in front of your whole function @views function update_den!(...) to remove those intermediates.

2 Likes

Putting aside the issue of views etcetera, I should point out that this kind of “vectorized” style ala Matlab may be a pretty suboptimal way of implementing finite-difference algorithms like this. You are making multiple passes over the arrays, which is hard on the cache etc. Have you considered just writing a loop? Loops are fast in Julia, and they allow you to perform all your finite-difference operations in a single pass over the array(s), with minimal intermediate/temporary arrays. (And packages like LoopVectorization.jl can make loops faster still.)

1 Like

Hello everybody again,

Just to confirm, iaX = nG+1:nx-nG is indeed a range (and similarly for iaY,iaZ).

So for this line (or a whole number of similar lines in each function), @views could help, or even actually going back to loops?
fxm2[iaX,iaY,iaZ] .= vx[iaXm2,iaY,iaZ].*bbyi[iaXm2,iaY,iaZ] .- vy[iaXm2,iaY,iaZ].*bbxi[iaXm2,iaY,iaZ]

Thanks for pointing me to the Trixi package, I’ll certainly take a look at it. I’m in the atmospheric physics field and want to change my slow/looping Matlab (retrieval and radiative transfer) code to faster Julia code, so I used the MHD code which I helped develop as a graduate student, as a starting point to learn Julia. And I’ve certainly already learned a lot from all the replies posted to my question.

Thanks again!

I was looking at some of the suggestions mentioned by all of you (views etc) when I realized, much to my embarrasment, that the f90 code was running for more timesteps than the Julia code (I had modified the driver files during the debugging and forgot to revert back to asking for orginal number of timesteps).

So I decided that’s an opportunity to try eg the @views, which helped a lot (thanks!) and as of now the Julia code is roughly double the run time of the f90 code.

However I realized the Julia code is also suffering from sort of instability (which the F90 code does not have), which I’m trying to nail down. One oddity I noticed is in some finite difference derivative-related functions I have written. Focusing on the x direction, and simplifying what I use

function ddx(foo::Array{Float64,3},nG::Int,nx::Int,ny::Int,nz::Int,dx23::Float64,dx12::Float64)

iaX = nG+1:nx-nG
iaY = nG+1:ny-nG
iaZ = nG+1:nz-nG

iaXp1 .= iaX .+ 1
iaXp2 .= iaX .+ 2
iaXm1 .= iaX .- 1
iaXm2 .= iaX .- 2

## predeclare output to be zeros
sddx = zeros(Float64,nx-(2*nG),ny-(2*nG),nz-(2*nG))

## do the derivatives in 
sddx .= (foo[iaXp1,iaY,iaZ].- foo[iaXm1,iaY,iaZ]).*dx23 .-  (foo[iaXp2,iaY,iaZ] .-foo[iaXm2,iaY,iaZ]).*dx12
return sddx
end

So the output matrix should be slightly smaller than the input foo 3d matrix, by nG elements in every direction

I call it using
forc1 = zeros(Array64,nx,ny,nz)
forc1[iaX,iaY,iaZ] = ddx(density,nG,nx,ny,nz,dx23,dx12)

Shouldn’t this just be filling the interior of the matrix (ie iaX,iaY,iaZ where eg iaX = nG+1:nx-nG, similarly for y and z directions)? I’m asking because even though I initialized the matrix with zeros, surprisingly sum(forc) is different than sum(forc[iaX,iaY,iaZ])?

Is there some issue with Julia doing the sum of the full matrix (including zeros) vs just doing the sum over the interior points? I have written this example, running it a few times shows the differences are typically zero, but about 1.4 of the time the differences are (curiously) +/- 5.82076609e-11

using Printf

########################################################################

function matrRand2(nx,nG)

iLen = nx-nG*2

matrXYZ = zeros(Float64,iLen,iLen,iLen)
matrXYZ .= rand(iLen,iLen,iLen)

return matrXYZ
end

########################################################################

function printdiff(M0,MX,nx,nG)

iaRange = nG-1:nx-nG

str = zeros(1,2);
str1 = [sum(M0) sum(M0)]
str2 = [sum(MX) sum(M0[iaRange,iaRange,iaRange])]
str .= str1 .- str2

@printf(“The final answer = %15.8e %15.8e \n”,str[1],str[2])
return nothing
end

########################################################################
########################################################################

nx = 91;
nG = 2;

#########################

println(“VERS X”)

matr = zeros(Float64,nx,nx,nx)

matrX = zeros(Float64,nx-2nG,nx-2nG,nx-2*nG)
matrX = matrRand2(nx,nG)

matr[nG+1:nx-nG,nG+1:nx-nG,nG+1:nx-nG] = matrX;

printdiff(matr,matrX,nx,nG)

#########################
println(“VERS Y”)
matr = zeros(Float64,nx,nx,nx)

matrY = zeros(Float64,nx-2nG,nx-2nG,nx-2*nG)
matrY .= matrRand2(nx,nG)

matr[nG+1:nx-nG,nG+1:nx-nG,nG+1:nx-nG] = matrY;

printdiff(matr,matrY,nx,nG)

Thanks!

If you are calling this function repeatedly, you probably need to preallocate matrXYZ to avoid having to create at every time, and you can write it like this:

function matRand2!(matrXYZ,nx,nG)
    iLen = nx - nG*2
    for i in 1:iLen, j in 1:iLen, k in 1:iLen
        matrXYZ[i,j,k] = rand()
    end
    return matrXYZ
end

and then you have to allocate matrXYZ up in the program and pass it here as a parameter.

But, please, read this before posting your questions, because otherwise it is hard to help:

You are updating iaXp1 etc on the left sides here, but I cannot see where they are defined. They are are not inputs to the function ddx. Are they global variables?

Also note that here you are creating two matrices (one with zeros, the other with random numbers). Without further complications, you just need this:

matrXYZ = rand(iLen,iLen,iLen)

or, if matrXYZ was preallocated (but that only makes sense if you did it outside the function) use a loop to fill it with random values, as I mentioned above.

What about just

matRand2!(matrXYZ) = rand!(matrXYZ) 
# or
matRand2!(matrXYZ) = (matrXYZ .= rand.())

?

Preallocation inside the function has not benefit, in fact it should be slower, because you are filling ssdx with zeros that will just be overwritten later.

Either pre-allocate it outside, or just do this

sddx = (foo[iaXp1,iaY,iaZ].- foo[iaXm1,iaY,iaZ]).*dx23 .-  (foo[iaXp2,iaY,iaZ] .-foo[iaXm2,iaY,iaZ]).*dx12

= not .=

I think the ranges that is being randomized change. Thus it could be

function matRand2!(matrXYZ,nx,nG)
    r = 1:(nx - nG*2)
    matrXYZ(r,r,r) = rand.()
    # or
    # rand!(@view(matXYZ(r,r,r))
    return matrXYZ
end

(but yes, those notations are nicer - the rand.() is something I don´t have internalized as a possibility)

Yes, sorry I cut and paste the important part but left out the fact that iaX,iaY,iaZ are predefined and come in as arguments to the function

Thanks for pointing this out. I preallocate the 3d matrix/matrices for the actual MHD code and then send them in as part of the argument list, but for this example since I was trying to isolate the sum(matr) vs sum(matr(iaX,iaY,iaZ)), I did not pre-allocate