Can I make my code faster with parallelism, or just plain better coding?

jgreener64 · February 15, 2019, 10:37am

Ages ago I played around with some LJ code but didn’t make serious efforts to optimise it: GitHub - jgreener64/LennardJones.jl: Molecular dynamics simulation of a simple Lennard Jones system

I had a bit more of a crack with Molly, an MD program that includes an LJ term: GitHub - JuliaMolSim/Molly.jl: Molecular simulation in Julia

You might find something useful in there.

Depending on your algorithmic flexibility you could consider a neighbour list, which is updated every n steps either by distance or on a cubic grid. This will seriously reduce compute time.

MrRobot · February 15, 2019, 3:21pm

Of course, I reference your code at the top of the main code I provide up above

@kristoffer.carlsson has a pretty neat package that makes an array of structures into a structure of arrays. It actually makes it so you can reference variables either in SoA, or AoS simultaneously. I think you might like it as well… https://github.com/piever/StructArrays.jl. It looks like it can be used for GPU’s too.

The big test will be how fast I can make ewalds. I am hoping once I learn my Julia lessons from LJ, I will know what to do for a fast and efficient ewalds.

jgreener64 · February 15, 2019, 3:31pm

I totally missed that.

Thanks, I’ll take a look.

MrRobot · February 15, 2019, 3:36pm

I got a huge speedup when I added the advice of @ffevotte and @bennedich

Also, using a premade table for the LJ parameters saved me 40% at the time I tested the difference in speed. The additional sqrt and multiplication each call actually added up alot.

Seif_Shebl · February 18, 2019, 9:16pm

And once again, Julia with SArrays challenges the king (aka Intel Fortran)!

I wrote the OP’s code using StaticArrays after removing unused variables and with no structs. I also wrote the equivalent Fortran version using arrays. Here is the amazing result:

Time:   85.384 ms      # Julia (-O3 --check-bounds=no)
Time:   88.630 ms      # Intel Fortran (-Ofast)

For many years, I used to trust nothing but ifort for writing any performance sensitive code, now Julia is becoming a real game changer. The only thing I dream of is being able to write vectorized code in Julia that doesn’t allocate without the need for adding @views, @uviews, or other ugly annotations. Here is the full code for both languages for reference:

# Julia Version
function vector1D(c1::Float64, c2::Float64, box_size::Float64)
    if c1 < c2
        return (c2 - c1) < (c1 - c2 + box_size) ? (c2 - c1) : (c2 - c1 - box_size)
    else
        return (c1 - c2) < (c2 - c1 + box_size) ? (c2 - c1) : (c2 - c1 + box_size)
    end
end

@inline function SingleLJ2!(i, j, f1, coord1, coord2, ϵ::Float64, σ::Float64, box_size)
    dx = vector1D(coord1[1], coord2[1], box_size[1])
    dy = vector1D(coord1[2], coord2[2], box_size[2])
    dz = vector1D(coord1[3], coord2[3], box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz
    (rij_sq > 1.0)  && return

    sr2 = σ^2 / rij_sq
    # rmag = sqrt(rij_sq)
    sr6 = sr2 ^ 3
    sr12 = sr6 ^ 2
    f = ((24 * ϵ) / rij_sq) * (2 * sr12 - sr6)

    x = -f * dx
    y = -f * dy
    z = -f * dz

    f1[i] += SVector(x,y,z)
    f1[j] -= SVector(x,y,z)    
    nothing
end

function TotalEnergyStructOfArrays(f, r, typ, vdwTable, n, box_size)
    for i = 1:(n-1)
        ti = typ[i]
        for j = (i+1):n
            tj = typ[j]
            SingleLJ2!(i, j, f, r[i], r[j], vdwTable[1][ti,tj], vdwTable[2][ti,tj], box_size)
        end
    end
    return
end

################################################################################
#                  Some general parameters used by all trials
################################################################################

using BenchmarkTools, StaticArrays

box_size = (2.0,2.0,2.0)
x = 3011 # dummy number of atoms
m = 4    # dummy number of atom types (for table of parameters...)
f = [SVector(0.0,0,0) for i=1:x] 
r = [SVector(rand(3)...) .* box_size[1] for i=1:x]
typ = trunc.(Int, [ceil(rand())*m for i=1:x])
vdwTable = [@SMatrix rand(m,m) for i=1:2] 

################################################################################
#                           Testing Struct of Arrays
################################################################################

println("------------------------Start of Testing SoA------------------------------")
print("Struct of Arrays: ")
@btime TotalEnergyStructOfArrays($f, $r, $typ, $vdwTable, $x, $box_size)
println("--------------------------End of Testing SoA------------------------------")

 # Output:
 # ------------------------Start of Testing SoA------------------------------
 # Struct of Arrays:   85.384 ms (0 allocations: 0 bytes)
 # --------------------------End of Testing SoA------------------------------

And

! Fortran version
program structOfArrays

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                  Some general parameters used by all trials
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer, parameter :: x = 3011  ! dummy number of atoms
integer, parameter :: m = 4     ! dummy number of atom types (for table of parameters...)
real(8) :: box_size(3) = [2.0,2.0,2.0], r(3,x), f(3,x) = 0, vdwTable(m,m,2)
integer :: typ(x)
integer :: t0, t1, count_rate, count_max

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                   Start of trials
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

call random_number(r)
typ = m * ceiling(r(3,:))
r   = r * box_size(1)
call random_number(vdwTable)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                           Testing Struct of Arrays
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

print *, "------------------------Start of Testing SoA------------------------------"
print *, "Struct of Arrays: "

call system_clock(t0, count_rate, count_max)
do iter = 1,100
  call TotalEnergyStructOfArrays(typ, f, r, vdwTable, x, box_size)
end do 
call system_clock(t1)

print*, 'Time:', real(t1-t0)/count_rate
print *, "--------------------------End of Testing SoA------------------------------"

contains

real(8) function vector1D(c1, c2, box_size)
    real(8) :: c1, c2, box_size
    if (c1 < c2) then
        if (c2 - c1 < c1 - c2 + box_size) then 
            vector1D = c2 - c1
        else 
            vector1D = c2 - c1 - box_size
        end if 
    else
        if (c1 - c2 < c2 - c1 + box_size) then 
            vector1D = c2 - c1 
        else 
            vector1D = c2 - c1 + box_size
        end if 
    end if 
end

subroutine SingleLJ2(f1, f2, coord1, coord2, epsilon, sigma, box_size)
    real(8) :: f1(:), f2(:), coord1(:), coord2(:), box_size(:), epsilon, sigma, dx, dy, dz, rij_sq, x, y, z, f 
    dx = vector1D( coord1(1), coord2(1), box_size(1) )
    dy = vector1D( coord1(2), coord2(2), box_size(2) )
    dz = vector1D( coord1(3), coord2(3), box_size(3) )

    rij_sq = dx*dx + dy*dy + dz*dz
    if (rij_sq > 1.0) return 
    
    sr2  = sigma**2 / rij_sq  
    ! rmag = sqrt(rij_sq)
    sr6  = sr2**3
    sr12 = sr6**2
    f = 24*epsilon / rij_sq * (2*sr12 - sr6)  

    x = -f * dx
    y = -f * dy
    z = -f * dz

    f1 = f1 + [x, y, z]
    f2 = f2 - [x, y, z]    
end

subroutine TotalEnergyStructOfArrays(typ, f, r, vdwTable, n, box_size)
    real(8) :: f(:,:), r(:,:), vdwTable(:,:,:), box_size(:)
    integer :: typ(:), n 
    do i = 1, (n-1)
        ti = typ(i)
        do j = (i+1), n
            tj = typ(j)
            call SingleLJ2(f(:,i), f(:,j), r(:,i), r(:,j), vdwTable(ti,tj,1), vdwTable(ti,tj,2), box_size)
        end do
    end do 
end

end program structOfArrays

 ! Output:
 ! ------------------------Start of Testing SoA------------------------------
 ! Struct of Arrays: 
 ! Time:   8.863000        ! x 10 = 88.63 ms
 ! --------------------------End of Testing SoA------------------------------

MrRobot · February 18, 2019, 9:29pm

Nice work!

I will post a link to a github I made that has I think over 8 different versions of this code I need to polish it a bit.

it has:
Array of Structs XYZ version
Struct of Array XYZ version
Array of Structs (static arrays)
Struct of Arrays (static arrays)
static arrays … Vector{SVector{3,Float64}}
mutable matrices (basically just 2 dimensional arrays)
StructArrays (from @piever github library)

And another one I don’t remember. The results are interesting. You are right, once you get Julia honed in, it is very fast! The XYZ isn’t the fastest one

Edit

On my machine your version clocks in at 153 ms

A few of the other versions I have run give me

--------------------Start of Testing StructArray-------------------------
StructArray:   132.161 ms (0 allocations: 0 bytes)
--------------------Start of Testing StructArray-------------------------
------------------------Start of Testing SoSA------------------------------
Struct of Static Arrays:   131.807 ms (0 allocations: 0 bytes)
--------------------------End of Testing SoSA------------------------------
------------------------Start of Testing AoS------------------------------
array of structs (XYZ):   145.657 ms (0 allocations: 0 bytes)
--------------------------End of Testing AoS------------------------------
------------------------Start of Testing SoA------------------------------
Struct of Arrays (XYZ):   158.322 ms (0 allocations: 0 bytes)
--------------------------End of Testing SoA------------------------------
----------------------Start of Testing MArrays----------------------------
Static MArrays  :   118.291 ms (0 allocations: 0 bytes)
------------------------End of Testing MArrays----------------------------
----------------------Start of Testing MMAtrix----------------------------

It is interesting that the mutable static arrays give better performance

In my real code which does not generate coordinates using random numbers I have the StructArray down to 90 ms for the same computation.

bennedich · February 18, 2019, 10:12pm

Nice! Something to keep in mind though: @btime runs the code many many times and returns the minimum of all runtimes, whereas the Fortran code (if I read it correctly) looks like the average runtime? Might be interesting to compare the minimum for both versions.

Seif_Shebl · February 18, 2019, 10:35pm

Yes, good observation. I ran the Fortran version many times and found the minimum to be 87.8 ms, so the above numbers still hold true. From my own experience with Fortran, timing differences between runs are usually negligible.

Seif_Shebl · February 18, 2019, 10:48pm

Very interesting. Can you share the link please after you finalize things?

MrRobot · February 19, 2019, 2:07am

I have put some of them up here GitHub - Zarasthustra/Need-For-Speed: Test sets for Molecular simulation algorithms

The latest set have a _V2 on them.

I have more, but need to wait, I am always toying with them, and don’t remember what I did to mess them up :S

These ones are alright. I have not tested to see if they all yield correct results, which makes a big difference. When I have time…

Elrod · February 19, 2019, 5:23am

Struct of arrays data layouts help when they can enable vectorization.
In this case, there are probably a lot of non-inlined function calls plus branches the compiler can’t easily turn into masked operations, meaning it wont be vectorized.

It might be possible to rewrite things in a way that this can happen, but it would probably take a fair bit of playing with it.

MrRobot · February 19, 2019, 5:31am

I will send you a case of beer for every order of magnitude you get any of the github codes down by

I have ulterior reasons for using the StructArray version though, it is a very handy data structure that will pay dividends for monte carlo with reactions. Maybe 2 cases if you do it on that one…

bennedich · February 19, 2019, 6:35am

Exactly. The reason for the speedup is not that mutable static arrays are faster, it’s that your code and benchmarks are flawed. This doesn’t imply that you’re a bad coder – it’s almost impossible to refactor code like this without making any mistakes along the way. This is why you absolutely have to employ some form of testing. It’s the very first thing you do.

So what is wrong the the mutable array version? First there’s this:

SingleLJ3!(f[j], f[j], diff,

Which should be:

SingleLJ3!(f[i], f[j], diff,

This is the same issue I pointed out in my previous post, which caused your “struct of arrays” version to run 20 % faster (and which is why initially this thread was dedicated to optimizing that version, since we all thought it was faster, when in fact it was not).

More importantly, you generate different input data. The mutable array does this:

r = [MVector{3}(rand(), rand(), rand()) for i = 1:x]

While the other versions do this:

[XYZ(box_size[1] * rand(), box_size[2] * rand(), box_size[3] * rand()) for i=1:x]

With box_size equal to 2. Since the performance of your algorithm is heavily dependent on the distribution of these random numbers, this most likely explains the differences you’re seeing.

So now, before you do any more work on this, add a simple test to your code. If you follow the suggestion in my previous post, this takes a minute or so to implement, and would have caught all of these issues.

MrRobot · February 19, 2019, 7:00am

I test my main code rigorously against known results. The only reason I posted these was because of the recent activity, but I noted specifically that I had not tested them to verify there was no bugs. I don’t have time tonight, but figured if the curious person did, they would take the time to look it over first. A test on these is necessary for sure before any victors can be claimed.

Topic		Replies	Views
Speeding up force calculations and mutable structs Performance	20	1160	August 5, 2020
Can this be written even faster? (CPU) Performance	34	808	February 9, 2024
Multiple-threading does not increase speed in my 16-threads workstation, I am wondering is there something wrong in my code? Performance question , performance , parallel	3	308	August 22, 2023
Speeding up operations on large arrays Performance	6	486	December 21, 2022
Parallelization from v1.1.1 to v1.4 Performance	0	417	March 22, 2020

Can I make my code faster with parallelism, or just plain better coding?

Edit

Related topics