Parallel Assembly of Matrices in FEM

Hi. I am writing a CFD solver using Finite Element Methods. I encounter issues in assembling the sparse matrices( specially in regards with the performance). I construct a sparse matrix in CSC form. This function works fine but I feel its too slow for the performance I am aiming.
I tried with multi threading, I created rows, cols and values array for each threads, then concatenated the results from all the threads and used that array to construct the CSC matrix.
There is some performance improvement but not as much as I expected. (A speedup of around 3 when I use 8 cores).

function assembling_matrices(ndim,total_nodes_domain,element_data_array, fixed_u_nodes, fixed_v_nodes, fixed_w_nodes, fixed_p_nodes)

M = spzeros(ndim*total_nodes_domain,ndim*total_nodes_domain)
K1 = spzeros(ndim*total_nodes_domain,ndim*total_nodes_domain)
K2 = spzeros(ndim*total_nodes_domain,ndim*total_nodes_domain) 
HV = spzeros(total_nodes_domain,ndim*total_nodes_domain) 
GG = spzeros(ndim*total_nodes_domain,total_nodes_domain) 
G = spzeros(ndim*total_nodes_domain,total_nodes_domain) 

function global_stiffness_matrices(k, K_size, rows, cols, values, DOFs, NEN)
    # Determine ranges for rows and columns based on K_size
    row_range = K_size[1] == ndim * total_nodes_domain ? ndim * NEN : NEN
    col_range = K_size[2] == ndim * total_nodes_domain ? ndim * NEN : NEN

    for i in 1:row_range
        row_index = DOFs[i]
        for j in 1:col_range
            col_index = DOFs[j]
            push!(rows, row_index)
            push!(cols, col_index)
            push!(values, k[i, j])
        end
    end
	return rows, cols, values
end

rows_m = [Vector{Int64}() for _ in 1:Threads.nthreads()]
rows_hv = [Vector{Int64}() for _ in 1:Threads.nthreads()]

# same thing goes for k1, k2, gg, g( I did not include it over here). 

cols_m = [Vector{Int64}() for _ in 1:Threads.nthreads()]
cols_hv = [Vector{Int64}() for _ in 1:Threads.nthreads()]

values_m = [Vector{Float64}() for _ in 1:Threads.nthreads()]
values_hv = [Vector{Float64}() for _ in 1:Threads.nthreads()]


Threads.@threads for element in element_data_array 
    tid = Threads.threadid()
    NEN = element.NEN

    # This includes the effect of periodic nodes. 
    DOFs = element.DOFs_assembly

    m = element.m
    hv = element.hv 
    k1 = element.k1 
    k2 = element.k2 
    gg = element.gg 
    g = element.g

    rows_m[tid], cols_m[tid], values_m[tid] =       global_stiffness_matrices(m, (size(M,1), size(M,2)), rows_m[tid], cols_m[tid], values_m[tid], DOFs, NEN)
    rows_hv[tid], cols_hv[tid], values_hv[tid] =    global_stiffness_matrices(hv, (size(HV,1), size(HV,2)), rows_hv[tid], cols_hv[tid], values_hv[tid], DOFs, NEN)
    rows_k1[tid], cols_k1[tid], values_k1[tid] =    global_stiffness_matrices(k1, (size(K1,1), size(K1,2)), rows_k1[tid], cols_k1[tid], values_k1[tid], DOFs, NEN)
    rows_k2[tid], cols_k2[tid], values_k2[tid] =    global_stiffness_matrices(k2, (size(K2,1), size(K2,2)), rows_k2[tid], cols_k2[tid], values_k2[tid], DOFs, NEN)
    rows_gg[tid], cols_gg[tid], values_gg[tid] =    global_stiffness_matrices(gg, (size(GG,1), size(GG,2)), rows_gg[tid], cols_gg[tid], values_gg[tid], DOFs, NEN)
    rows_g[tid], cols_g[tid], values_g[tid] =       global_stiffness_matrices(g, (size(G,1), size(G,2)), rows_g[tid], cols_g[tid], values_g[tid], DOFs, NEN)
end 

array_5 = [rows_m, rows_mca, rows_hv, rows_k1, rows_k2, rows_gg, rows_g]
array_6 = [cols_m, cols_mca, cols_hv, cols_k1, cols_k2, cols_gg, cols_g]
array_7 = [values_m, values_mca, values_hv, values_k1, values_k2, values_gg, values_g]

function concatenate_vectors(array_5, array_6, array_7)
    results_row = Vector{Vector{Int64}}(undef, 11)
    Threads.@threads for index in 1:length(array_5)
        results_row[index] = vcat(array_5[index]...)
    end 

    results_col = Vector{Vector{Int64}}(undef, 11)
    Threads.@threads for index in 1:length(array_6)
        results_col[index] = vcat(array_6[index]...)
    end 

    results_val = Vector{Vector{Float64}}(undef, 11)
    Threads.@threads for index in 1:length(array_7)
        results_val[index] = vcat(array_7[index]...)
    end 
    return results_row, results_col, results_val
end 

# Concatenate the individual rows, cols and values array. 
rows_mat, cols_mat, values_mat = concatenate_vectors(array_5, array_6, array_7)

# Now, construct the sparse CSC matrix. 
main_array = [M, MCA, HV, K1, K2, GG, G]
Threads.@threads for index in 1:length(main_array)
    row = rows_mat[index]
    col = cols_mat[index]
    val = values_mat[index]
    size_1 = size(main_array[index],1)
    size_2 = size(main_array[index],2)
    main_array[index] = sparse(row, col, val, size_1 , size_2)
end
M, MCA, HV, K1, K2, GG, G = main_array

return M,MCA,HV,K1,K2,GG,G

end #end of function.

There are actually 11 matrices like this, I have only included a few of them and the size of all the matrices are not same. Hence, I had to create variables like row_range and col_range.
Element_data_array is an array of structs. Each struct contains all the informations for an element.
Right now for
Total_nodes_domain = 23,000. Its taking like 1.9 seconds to assemble all the matrices which is still a lot for me since I aim to solve problem with atleast 200,000 nodes.
In addition to this, there is significant allocations (5.950 GiB), I also aim to reduce this since my memory limit is 188 GB.
I tried my best to optimize this further but couldn’t do it. It would be of great help if you guys could assist me in making this code better.
Thank you for your suggestions. I apologize if this information is not enough. If any extra information is needed, please let me know.

Hi @priyanshuFSI ,

The first thing you probably want to take a look at is this post. Essentially, threads correspond to physical cores and Julia runs multiple tasks concurrently on each thread. Thus, multiple tasks may share threadid() and since they can yield at any time, you don’t have any guaranteed behavior. Anyway, it’s not totally essential to understand the threading model. What’s important is to make sure your code is safe and deterministic. One simple fix that I used in my own package is ChunkSplitters.jl. This ensures that each task will be working on truly local, private data.

For your performance issues, it would be great if you could provide a runnable example so I can see where the hot spots are… but I guess more generally is there a reason you are writing your own custom FEM package? There are already several very good packages in Julia like Gridap.jl and Ferrite.jl. I know for sure Gridap.jl can easily run problems on the order of 200,000 nodes quite easily.

1 Like

Thank you. I will look into that package. For now, for some reason, my code doesnt scale well with the number of cores used even for easily parallelizable problems (like the construction of element stiffness matrices). I get a performance of improvment of like 4-5X while using 8 cores.

By runnable example, you mean a whole code that you can run ? It might be too large(also involves importing mesh from GMSH and all). Basically its solving FEM but for fluids. Other parts of the code work fine but the problem is the assembly which a bit slow for my liking. (which I uploaded above).

I was aware of those packages. I am not developing my own package but I need a code to solve my own problems(FSI problems) for which I need the ALE formulation. I dont think ALE is available in gridap. That is why.

Perhaps this will help? GitHub - PetrKryslUCSD/FinEtoolsMultithreading.jl: Parallel finite element library

1 Like

I get a performance of improvment of like 4-5X while using 8 cores.

4-5X speedup is actually quite reasonable depending on the problem size and other parameters. Note that in @PetrKryslUCSD’s paper he is getting a parallel efficiency between 0.6 and 0.8 for 8 cores for most of his kernels (assuming I an interpreting e.g., Figure 18 correctly). This corresponds to a speedup between 4.8-6.4x, which is very good!

By runnable example, you mean a whole code that you can run ?

No, ideally just a toy example with all the variables necessary to execute the function you provided e.g., element_data_array, rows_k1, rows_k2, etc. Btw, it’s a bit of a red flag to me that some of the variables here are not declared, nor are arguments of this function. Are these global variables? This might be part of your performance/allocation issues.

EDIT: I’m not super familiar with ALE, but it looks like @oriol.colomes has a project to solve FSI problems with Gridap.

Thank you. That means I should be content with the performance improvement of 4-5 times for 8 cores. Another thing is once I go beyond 16 cores, the performance does not improve at all (Even for large problems, I have tried problems with around million degrees of freedom). Initially, I was planning to go to multi processing , but I didnt see the advantage since the performance was same beyond the 16 cores.

I am sorry but I am unsure how should I upload these variables. For eg : element_data_array is a array of structs and each struct has like 30 fields, so its kind of a long array. I can upload according to your preference.

Do you mean declaring the tyoes of the arguments ? I had checked if the function is type stable, I thought if it is type-stable, its okay to not declare the types( Please correct me if I am wrong).

About global variables, for eg : I have one function (in one file) from which I get fixed_u_nodes and then use that variable in another function. Does it mean its a global variable ? (Sorry if a lot of these queries are too basic).

What machine were you using? If you are using a dual-socket (16x16) machine then the NUMA aspect can easily outweigh the benefit of adding more threads. For really large scale problems, you should really consider distributed memory parallelism à la MPI.

Can you serialize all the necessary inputs to benchmark using, e.g., JLD2? I believe you can attach to a comment on here as long as the file size is reasonable.

What I mean by a global variable is any (non-const) variable contained outside the scope of any function. See the relevant docs.

The compiler cannot reason about types of global variables, and thus cannot perform optimizations. It seems like in the function you provided, the variable e.g. values_k1 is either global, or this function is nested inside yet another function.

Don’t worry :slight_smile: that’s what Discourse is here for.

The CPU is Intel Xeon G6252, with 2 sockets (probably 24 cores each since the total core count is 48). I have one question, even if I plan to use 1 node, can there be a benefit of using distributed computing over multi-threading ?

I can save the data as JLD2 but it seems like this community does not let me upload the jld2 file. There are only few file types that I can upload. I can send it to you through other mediums if you are okay with it.

I find it curious not trying to build your implementation from an existing FEM software, you could use custom weak forms to take the material mappings or other ALE specificities into account, while leveraging existing (parallel) assembly routines or (parallel) solvers, without having redo everything.

Are you solving a classic problem ?

Trixy.jl also seems to have many fluid simulation capabilities.