Matrix construction operations - increasing speed

performance

#1

Below I will copy in some code from a larger application; in this part of the code the primary goal is to set up several somewhat large matrices utilized for some subsequent finite element operations. There are a number of repmat,reshape, zeros() calls, for examples.

When executing this with @benchmark (detailed step list below), it runs about 20% slower than the original Matlab code - it also has a large % assigned to GC:

BenchmarkTools.Trial:
  memory estimate:  457.35 MiB
  allocs estimate:  114800
  --------------
  minimum time:     556.528 ms (67.65% GC)
  median time:      568.348 ms (66.76% GC)
  mean time:        578.302 ms (66.46% GC)
  maximum time:     642.323 ms (68.25% GC)
  --------------
  samples:          9
  evals/sample:     1

The questions are: what can I do to make it faster? I tried allocating arrays in advance, redoing functions to operate on variables in place (e.g., function name with !, getIxIy!() ).

Is it correct for me to assume that I am losing a lot of time in GC?

One issue is that the matrix stiffCoeffXX has complex values in for matrix elements within 10 locations from its edges, but the interior is entirely real - I don’t know have Matlab does this, but maybe it can do real arithmetic in the middle automatically?

Any ideas would be appreciated.

To run the code saved within a file named test.jl, I do:

include("test.jl")

@benchmark assembleMatrix(3,3,stiffCoeffXX,elementStiffMatrixXX)

(the latter two arguments are set up with in the program).

Here is the complete code in the file test.jl:

nX=120
nY=120

elementStiffMatrixXX = [0.361111 -0.406535 0.0593126 -0.0138889 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; -0.406535 0.694444 -0.347222 0.0593126 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0593126 -0.347222 0.694444 -0.406535 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; -0.0138889 0.0593126 -0.406535 0.361111 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 1.80556 -2.03267 0.296563 -0.0694444 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 -2.03267 3.47222 -1.73611 0.296563 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.296563 -1.73611 3.47222 -2.03267 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 -0.0694444 0.296563 -2.03267 1.80556 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.80556 -2.03267 0.296563 -0.0694444 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -2.03267 3.47222 -1.73611 0.296563 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.296563 -1.73611 3.47222 -2.03267 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0694444 0.296563 -2.03267 1.80556 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.361111 -0.406535 0.0593126 -0.0138889; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.406535 0.694444 -0.347222 0.0593126; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0593126 -0.347222 0.694444 -0.406535; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0138889 0.0593126 -0.406535 0.361111];

stiffCoeffXX = rand(nX,nY)+rand(nX,nY)*im



function getIxIy(basisOrder::Int64,nx0::Int64,ny0::Int64)
    localElementNode = getDOFInElement(basisOrder,nx0,ny0)::Array{Int64,2}

    totalDeg = (basisOrder+1)*(basisOrder+1)
    iX = repmat(localElementNode,totalDeg,1)

    iY = reshape( (repmat(localElementNode[:],1,totalDeg)')[:],totalDeg^2,nx0*ny0)

    return(iX,iY)
end


function getDOFInElement(basisOrder::Int64,nx::Int64,ny::Int64)::Array{Int64,2}
    nodes = zeros(Int64, (basisOrder+1)^2, nx*ny)
    nodeMatrix = reshape( 1:(nx*basisOrder+1)*(ny*basisOrder+1), nx*basisOrder+1, ny*basisOrder+1)

    for j in 1:ny
        for  i in 1:nx
            nodes[:,(j-1)*nx+i] = (nodeMatrix[ (j-1)*basisOrder+1:j*basisOrder+1, (i-1)*basisOrder+1:i*basisOrder+1 ]')[:]
        end
    end
    return(nodes)
end

function assembleMatrix(basisOrder::Int64,coeff::Array{Complex{Float64},2},
    localElementMatrix::Array{Float64,2})

    iX,iY = getIxIy(basisOrder,size(coeff)[1],size(coeff)[2])

    coeff = (coeff.')[:]
    A = sparse(iX[:], iY[:], ( (localElementMatrix[:])*( coeff.'))[:] )

    return(A)
end

#2

Hmmmm… after posting this I had the bright idea of trying the same thing in the latest Julia 0.7 I downloaded.

Got some interesting deprecation warnings - I tweaked the code to (try to) take care of those, and the speed up is impressive. Mainly deprecation of .’

I need to double check that I’m still getting the same answer, but the speed now seems to be about identical to Matlab.


#3

I think the larger issue is that you’re using a bunch of heap allocated arrays for the elements instead of stack-allocated (static arrays) which is why you’re hitting the GC so much and getting painfully slow results, as slow as MATLAB. If you fix that by using StaticArrays.jl in the right places then you should be able to substantially improve the performance. The place to change would be the spot so that

nodes[:,(j-1)*nx+i] = (nodeMatrix[ (j-1)*basisOrder+1:j*basisOrder+1, (i-1)*basisOrder+1:i*basisOrder+1 ]')[:]

isn’t allocating a bunch of temporaries.


#4

Thanks - I’m sure you are right - I know how I would do this in C to make things clean, but I didn’t know what to look for in Julia. As noted, I’d tried to deal with this in a few places by pre-allocating arrays as discussed in the Performance Tips documentation here but it didn’t help.

I’ll dig into StaticArrays.jl and see if I can figure out a way to apply that in the code line you note (and possibly elsewhere). I’ll report back on my progress.


#5

I have solved a similar problem in https://github.com/PetrKryslUCSD/FinEtools.jl by passing around matrix buffers. They are on the heap, but since there are only a few buffers, the memory usage is quite low. https://github.com/KristofferC/JuAFEM.jl addresses the memory issue by using static arrays. We did some timing tests with Kristoffer and we got similar speeds. For instance, my code assembles the conductivity matrix for two million linear triangles in about two seconds (Julia 0.7).


#6

I think that the larger issue is not heap- vs stack-allocated arrays, but the basic structure of your algorithm. Your Julia code, if I understand correctly, simply apes the original Matlab code, which does some incredibly contortionist vectorization to achieve acceptable performance.

For example, getDOFInElement would be much faster if you pre-allocate the output matrix, and then simply loop over every element (in column order) and calculate the correct value for each entry, according to whatever rule applies. Now, you are allocating a redundant nodeMatrix matrix, transposing, reshaping and slicing all over the place. This is typical of performant Matlab code, but still far suboptimal for Julia code.

I can suggest some band-aids that will, for example, speed up getDOFInElement by a factor of 2 or more, but I think you should instead re-write it completely, without the auxiliary nodeMatrix array.

I would really like to be more helpful, but I just find the code hard to fathom (typical for vectorized code Edit: I should say, typical of unnaturally vectorized code. Vectorized code can in some cases of course be both clear and fast), and I cannot grasp the meaning of the parameters basisOrder, nx and ny, and how they relate to the structure of the output matrix. Also, not all values of basisOrder, nx and ny are allowed:

julia> getDOFInElement(2,2,3)
ERROR: BoundsError: attempt to access 5×7 reshape(::UnitRange{Int64}, 5, 7) with eltype Int64 at index [5:7, 1:3]

suggesting that there can be some other, clearer way of parameterizing the function.

Without understanding your code at all, I will risk suggesting that the entire thing should be re-written from top to bottom. If you can do that without any use of auxiliary arrays, transposes, reshapes, repmats and slices, I think you will achieve dramatic speedups.

Sorry, that’s a lot of text with little concrete help, so I will just throw out this code, which gives me a +2x speed-up on ver0.7 for getDOFInElement, but which still is in the terrible vectorized style:

function getDOFInElement(basisOrder::Int64,nx::Int64,ny::Int64)::Array{Int64,2}
    nodes = Matrix{Int64}(undef, (basisOrder+1)^2, nx*ny)
    nodeMatrix = transpose(reshape( 1:(nx*basisOrder+1)*(ny*basisOrder+1), nx*basisOrder+1, ny*basisOrder+1))
    for j in 1:ny
        for  i in 1:nx
            nodes[:,(j-1)*nx+i] .= vec(@view nodeMatrix[(i-1)*basisOrder+1:i*basisOrder+1, (j-1)*basisOrder+1:j*basisOrder+1])
        end
    end
    return nodes
end

#7

If you know how to do this in C, you can probably do something similar in Julia. It’s the vectorization that is causing the problems here.


#8

OK, I’ve not had time to even get through the rest of your post, but you’ve helped get my day off to a fun start with your description of the original Matlab code and its “incredibly contortionist vectorization”! Thanks! :joy:

Essentially I’m taking some existing Matlab for a new algorithm, and I’m rewriting to learn Julia and the FEM algorithm itself - but also to get a more less contortionist (and maintainable) code.

Now back to reading the rest for solid ideas…


#9

I fully agree with your comments about the unnecessary complexity of the code - as noted in my previous reply, one of the motivations for porting the matlab stuff to Julia is to make something readable and maintainable. In fact, I did try to preallocate some of the output arrays but it didn’t help too much.

I had hoped, however, that there might be some steps that might be obvious to more experienced Julia coders that could avoid a full rewrite - but the latter seems to be the way to go (and, with the addition of some useful comments that are 100% missing in the Matlab code could be comprehensible as well as fast).

I should note too that one more motivation for porting to Julia is that if I can convince the original matlab coder that Julia is faster, more readable and free he will start to use it too :grin:


#10

Thanks for the ideas - can you comment more about what you mean by passing around matrix buffers? A simple one-liner example, perhaps, to show the idea?

I took a look at StaticArrays documentation, and it suggest that if array size > 100 elements then it is best to go with std Array - normally the array sizes will be 10s or 100s of thousands I may still give it a try.


#11

The full arrays, but not all of your temporaries?


#12

Code that is written in matlab style and then almost automatically transferred to Julia will in general not see much performance improvement.

However, it is possible to write quite expressive code in julia that has very good performance. As an example, the loop for assemble the global stiffness for the heat equation can is in JuAFEM.jl (snippet taken from https://nbviewer.jupyter.org/github/KristofferC/JuAFEM.jl/blob/gh-pages/latest/examples/generated/heat_equation.ipynb) written as:

function doassemble(cellvalues::CellScalarValues{dim}, K::SparseMatrixCSC, dh::DofHandler) where {dim}

    n_basefuncs = getnbasefunctions(cellvalues)
    Ke = zeros(n_basefuncs, n_basefuncs)
    fe = zeros(n_basefuncs)

    f = zeros(ndofs(dh))
    assembler = start_assemble(K, f)
    @inbounds for cell in CellIterator(dh)
        fill!(Ke, 0)
        fill!(fe, 0)
        reinit!(cellvalues, cell)
        for q_point in 1:getnquadpoints(cellvalues)
            dΩ = getdetJdV(cellvalues, q_point)
            for i in 1:n_basefuncs
                v  = shape_value(cellvalues, q_point, i)
                ∇v = shape_gradient(cellvalues, q_point, i)
                fe[i] += v * dΩ
                for j in 1:n_basefuncs
                    ∇u = shape_gradient(cellvalues, q_point, j)
                    Ke[i, j] += (∇v ⋅ ∇u) * dΩ
                end
            end
        end
        assemble!(assembler, celldofs(cell), fe, Ke)
    end
    return K, f
end

which is quite similar how you write it down theoretically:

  • Loop over elements (cells)
  • Loop over quadrature points
  • Loop over degrees of freedom and assemble element force vector.
  • Second loop over degrees of freedom to assemble element stiffness matrix
  • When done looping over quadrature points, assemble element force vector and stiffness matrix into global force vector and stiffness matrix.
  • When done looping over all elements, return global force and stiffness

Running this with the 120x120 grid is indeed quite fast and has low allocations.

julia> @time doassemble(cellvalues, K, dh);
  0.007465 seconds (20 allocations: 115.797 KiB)

#13

Good point - given other feedback, I’ll be diving into a full rewrite, but StaticArrays might help along the way.

This is a very cool example - the ability to have the code follow the notation and structure of the equations (e.g., (∇v ⋅ ∇u) * dΩ) is a big win for Julia. Thanks!


#14

So for instance (see the source):

function  buffers(self::FEMMHeatDiff, geom::NodalField{FFlt}, temp::NodalField{FFlt})
    ...
    Kedim = ndn*nne;      # dimension of the element matrix
    elmat = fill(zero(FFlt), Kedim, Kedim); # buffer
    ...
    return dofnums, loc, J, RmTJ, gradN, kappa_bargradNT, elmat, elvec, elvecfix
end

Then the code asks for the buffers

dofnums, loc, J, RmTJ, gradN, kappa_bargradNT, elmat = buffers(self, geom, temp)

which are then used in the loop over all the finite elements

for i = 1:count(fes) # Loop over elements
        fill!(elmat,  0.0); # Initialize element matrix
        ...
        for j=1:npts # Loop over quadrature points
            ....
            # Add the product gradN*kappa_bar*gradNT*(Jac*w[j])
            add_gkgt_ut_only!(elmat, gradN, (Jac*w[j]), kappa_bar, kappa_bargradNT)
        end
end

Notice that I have a little utility there to fill an existing matrix (the buffer elmat), well actually only the upper triangle is calculated– saves time since the matrix is known to be symmetric.

Concerning the static arrays: there is a fairly small number for the break-even point, something like a matrix of 11 x 11 is just about the right size. Anything bigger you might be better of with a heap-allocated array.