Improving Performance of For-Loop and Matrix Multiplication for Time Integration Solver

I am writing some code that performs a simple time-history integration solver (Newmark method for structural dynamics) and I looking for some suggestions in how I can improve the performance of my code. The below snippet of code is part of a larger function that is called from another script.

Currently, I am calculating a parameter called displacements for many time steps (think on the order of 5,000,000 steps). From observing the code and running ProfileView to profile the code it is apparent that the matrix multiplication seems to be slowing down the script in the for-loop. Some details about the input:

dofs = number of degrees of freedom in the system (at most 1282)
num_time_steps = 5,000,000
yF = vector of size (num_steps, 1)
loading_vec = vector of size (dofs-2, 1)
a1, a2, a3, k_hat_inv = matrices of size (dofs-2, dofs-2) (a1, a2, a3 are sparsely populated matrices, k_hat_inv is a full matrix)
vel_const1, vel_const2, vel_const3 = Float64 constant
acc_const1, acc_const1, acc_const1= Float64 constant

For a simple example when dofs = 12 this code runs very quickly (~3-4s) but as dofs increases to 1282 the run time increases very significantly (still running after 1/2 day) and I would like to find some ways to decrease this run time. Are there any other ways that I can improve the performance of this code?

You can also see in the for-loop some commented out lines, these are all things I have tried but resulted in poorer performance compared to the uncommented lines.

    # Initial state, forst time step at t = 0s
    cdisp = zeros(dofs - 2)
    cvel = zeros(dofs - 2)
    #caccel = m_inv * (loading_vec .- (c * cvel) .- k * cdisp) -> Always zero for zero initial conditions
    caccel = zeros(dofs - 2)
    u_i_p1 = zeros(dofs-2)
    udot_i_p1 = zeros(dofs-2)
    u2dot_i_p1 = zeros(dofs-2)
    a1_vec = zeros(dofs-2)
    a2_vec = zeros(dofs-2)
    a3_vec = zeros(dofs-2)
    P_i_p1_hat = zeros(dofs-2)

    # Initialize the displacement vector
    displacements = zeros(dofs - 2, disp_steps)

    # Start the time history analysis as per Chopra, 2017
    disp_counter = 1
    println("     Time history started")
    for ii = 1:num_time_steps-1

        # Obtain the force for the next time step
        @inbounds loading_vec[impact_dof] = yF[ii+1]

        # Obtain P_i_p1_hat
        #P_i_p1_hat = loading_vec + a1 * cdisp + a2 * cvel + a3 * caccel
        P_i_p1_hat .= loading_vec .+ mul!(a1_vec, a1, cdisp) .+ mul!(a2_vec, a2, cvel) .+ mul!(a3_vec, a3, caccel)

        # Calculate the displacement due to this force, u_i+1 = k_hat_inv * P_i_p1
        #u_i_p1 = k_hat_inv * P_i_p1_hat
        mul!(u_i_p1, k_hat_inv, P_i_p1_hat)
        #u_i_p1 = k_hat_inv * (loading_vec + a1 * cdisp + a2 * cvel + a3 * caccel)
        #mul!(u_i_p1, k_hat_inv, (loading_vec .+ (a1 * cdisp) .+ (a2 * cvel) .+ (a3 * caccel)))
        #mul!(u_i_p1, k_hat_inv, (loading_vec .+ mul!(a1_vec, a1, cdisp) .+ mul!(a2_vec, a2, cvel) .+ mul!(a3_vec, a3, caccel)))

        # Calculate the velocity due to this displacement, u_dot_i+1
        #udot_i_p1 = vel_const1 * (u_i_p1 - cdisp) + vel_const2 * cvel + vel_const3 * caccel
        udot_i_p1 .= (vel_const1 .* (u_i_p1 .- cdisp)) .+ (vel_const2 .* cvel) .+ (vel_const3 .* caccel)

        # Calculate the acceleration due to this displacement, u_2dot_i+1
        #u2dot_i_p1 = acc_const1 * (u_i_p1 - cdisp) - acc_const2 * cvel - acc_const3 * caccel
        u2dot_i_p1 .= (acc_const1 .* (u_i_p1 .- cdisp)) .- (acc_const2 .* cvel) .- (acc_const3 .* caccel)

        # Get current values for the next step
        cdisp .= u_i_p1
        cvel .= udot_i_p1
        caccel .= u2dot_i_p1

        # Write the displacement vector
        if ii % disp_save == 0
            disp_counter += 1
            displacements[:, disp_counter] .= u_i_p1

It’s kind of hard do to formatting, but it appears you are running all this code in the global scope. You should probably read

Hi @Oscar_Smith, just updated the question but this code snippet is part of a larger function that is called from the main run file. I believe this would mean these are local variables but please correct me if I am wrong (I have only been using Julia for about a week).

Are you using an actual inverse? That would be a mistake. The “fast” way is via factorization:

I am doing exactly what you are doing in line 69-70, essentially just solving the left hand side of the equation. But I am not exactly sure what you mean by an actual inverse?

Don’t use inv(K). Use factorization and backwards and forwards solve.

This is especially critical if you are inverting a sparse matrix K, in which case you can use a sparse factorization instead of the dense inverse. On the other hand, if K is dense then it’s not terrible to use the inverse since it is computed only once. Without more information it’s hard to say.

a1, a2, a3 are sparsely populated matrices

So you are using sparse-matrix data structures for these?

It will be faster to just write a single loop to do all of these updates than a bunch of vectorized operations, even with Julia’s fused-in-place broadcasting. In a single loop, you can compute quantities like u_i_p1[i] - cdisp[i] only once and re-use them for both udot_i_p1 and u2dot_i_p1, for example. You will also get better cache-line locality.


@stevengj, I am using sparse-matrix data structures for the variables that I can. As for the k matrix I am directly using the inverse so calculating it only once is far for efficient than calculating it at every loop. Additionally, placing the calculation of udot_i_p1 and u2dot_i_p1 into a for loop also increased the performance slightly, thank you.

Calculating the inverse of the sparse matrix is the wrong thing to do. It results in a dense matrix, so that the multiplication of this matrix with a vector is more costly than doing a backwards and forward substitution. Not to mention that for sufficiently large problem it becomes impossible to store the matrix in the first place.