Can I make my code faster with parallelism, or just plain better coding?

I have been wondering about if using the XYZ struct is the way to go… it makes it hard to vectorize things, but variable.x is easier to read and understand… I will try using static arrays on my test set and see how it goes.

Have you thought about using an algorithm that is not O(n^2)? Say a fast-multipole algorithm or similar?

4 Likes

There are a lot of things I want to try, multiple time steps (RESPA) is one of them as well, especially for the electrostatics… I am not looking forward to being crushed by Ewalds, but first I must master the basics, and then I can move on.

Also in simulation we cheat and make neighborlists so we don’t bother interacting with the particles far far away, but alot of time, is still spent on the ones in range… too much time :slight_smile:

I believe people have been playing around with fast multipole for lennard jones, I know they use them for electrostatics… I should revisit what their progress has been.

When optimizing or refactoring code like this, I always employ a simple form of Golden Master Testing. Basically, all you’re testing then, is that your working version is equal to the original version. Whether it was correct or not to begin with, who knows, but at least it works the same.

This is usually trivial to set up. Example:

Random.seed!(0)          # fix random seed, so you always get the same data

A = rand(100, 100)        # create input data
B = complex_algorithm(A)  # output data returned as a matrix

# Now calculate some hash of the output (problem specific)
hash = sum(sum(B))

# And test against what you got with the original version of your code
expected = 8425.2178752625
hash ≈ expected || println("Expected $expected, got $hash")

Of course, a sum is not a perfect hash (and if you have NaNs it’s worthless), but I don’t think you need to go overboard in creating a perfect hash. A simple sum will almost always catch refactoring errors (like the one you had), and the above takes a minute or so to implement.

(Note: Just pay attention that if you rearrange floating point operations, floats may differ slightly due to rounding artifacts, so don’t do exact comparisons.)

There are many variations of this. A more robust option is to keep a copy of you original code, and call both that and your new code with the given input, and make sure that the results are equal.

5 Likes

@aaowens

How did you get the SVector to add inplace in a for loop?

When I use your code, it works on my computer, and the timing I get is the same as yours.

But, if I try to update a SVector in my SingleLJ2! function, it fails, I get ERROR: LoadError: setindex! ::SArray{Tuple{3},Float64,1,3}, value, ::Int) is not defined.

In this case I am passing a SVector{3} to a function (SingleLJ2!), and in that function I am trying to add to each of its 3 indices in a for loop, just like in your example. You example works… mine does not :S

I get that SVectors are immutable, but why does your example seem to break this? If I add to my SVector a Vector of size 3, that works, but I get a ton of allocations and memory usage when I do this inside my function!

Actually, all of the extra allocations are because I have to fill the SVector manually as f1 = [-fdx,-fdy,-f*dz]… i then fill the second SVector as f2 = -f1, and this generates no allocations… So I also am curious about this too… how do I avoid allocations if I fill the SVector with an array…

I think you misread @aaowens’s code, this A[i] += abc means "copy abc to location i of vector A". This works because A is mutable (even though its elements are not). Thus A[i][1] += B[i][1] would not work because A[i][1] is a field of a SVector and thus cannot be updated.

To fill an SVector, see Constructing SVector with a loop - #7 by tkf. But for your example just do SVector(-fdx, -fdy, f*dz).

2 Likes

Not all allocations are equal. “Allocations” of isbitstype structs are on the stack and since they are created from scratch the compiler knows that nothing holds a reference to them and can optimize based on that. Mutable structs can have other references to them so the compiler must defensive.

For things like 3-dimensional vectors in this case I would recommend always using immutable struct and writing the code in a more functional manner (similar as how you would write it in math / physics).

3 Likes

Ages ago I played around with some LJ code but didn’t make serious efforts to optimise it: https://github.com/jgreener64/LennardJones.jl

I had a bit more of a crack with Molly, an MD program that includes an LJ term: https://github.com/jgreener64/Molly.jl

You might find something useful in there.

Depending on your algorithmic flexibility you could consider a neighbour list, which is updated every n steps either by distance or on a cubic grid. This will seriously reduce compute time.

1 Like

Of course, I reference your code at the top of the main code I provide up above :slight_smile:

@kristoffer.carlsson has a pretty neat package that makes an array of structures into a structure of arrays. It actually makes it so you can reference variables either in SoA, or AoS simultaneously. I think you might like it as well… https://github.com/piever/StructArrays.jl. It looks like it can be used for GPU’s too.

The big test will be how fast I can make ewalds. I am hoping once I learn my Julia lessons from LJ, I will know what to do for a fast and efficient ewalds.

1 Like

:joy: I totally missed that.

Thanks, I’ll take a look.

I got a huge speedup when I added the advice of @ffevotte and @bennedich

Also, using a premade table for the LJ parameters saved me 40% at the time I tested the difference in speed. The additional sqrt and multiplication each call actually added up alot.

And once again, Julia with SArrays challenges the king (aka Intel Fortran)!

I wrote the OP’s code using StaticArrays after removing unused variables and with no structs. I also wrote the equivalent Fortran version using arrays. Here is the amazing result:

Time:   85.384 ms      # Julia (-O3 --check-bounds=no)
Time:   88.630 ms      # Intel Fortran (-Ofast)

For many years, I used to trust nothing but ifort for writing any performance sensitive code, now Julia is becoming a real game changer. The only thing I dream of is being able to write vectorized code in Julia that doesn’t allocate without the need for adding @views, @uviews, or other ugly annotations. Here is the full code for both languages for reference:

# Julia Version
function vector1D(c1::Float64, c2::Float64, box_size::Float64)
    if c1 < c2
        return (c2 - c1) < (c1 - c2 + box_size) ? (c2 - c1) : (c2 - c1 - box_size)
    else
        return (c1 - c2) < (c2 - c1 + box_size) ? (c2 - c1) : (c2 - c1 + box_size)
    end
end

@inline function SingleLJ2!(i, j, f1, coord1, coord2, ϵ::Float64, σ::Float64, box_size)
    dx = vector1D(coord1[1], coord2[1], box_size[1])
    dy = vector1D(coord1[2], coord2[2], box_size[2])
    dz = vector1D(coord1[3], coord2[3], box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz
    (rij_sq > 1.0)  && return

    sr2 = σ^2 / rij_sq
    # rmag = sqrt(rij_sq)
    sr6 = sr2 ^ 3
    sr12 = sr6 ^ 2
    f = ((24 * ϵ) / rij_sq) * (2 * sr12 - sr6)

    x = -f * dx
    y = -f * dy
    z = -f * dz

    f1[i] += SVector(x,y,z)
    f1[j] -= SVector(x,y,z)    
    nothing
end

function TotalEnergyStructOfArrays(f, r, typ, vdwTable, n, box_size)
    for i = 1:(n-1)
        ti = typ[i]
        for j = (i+1):n
            tj = typ[j]
            SingleLJ2!(i, j, f, r[i], r[j], vdwTable[1][ti,tj], vdwTable[2][ti,tj], box_size)
        end
    end
    return
end

################################################################################
#                  Some general parameters used by all trials
################################################################################

using BenchmarkTools, StaticArrays

box_size = (2.0,2.0,2.0)
x = 3011 # dummy number of atoms
m = 4    # dummy number of atom types (for table of parameters...)
f = [SVector(0.0,0,0) for i=1:x] 
r = [SVector(rand(3)...) .* box_size[1] for i=1:x]
typ = trunc.(Int, [ceil(rand())*m for i=1:x])
vdwTable = [@SMatrix rand(m,m) for i=1:2] 

################################################################################
#                           Testing Struct of Arrays
################################################################################

println("------------------------Start of Testing SoA------------------------------")
print("Struct of Arrays: ")
@btime TotalEnergyStructOfArrays($f, $r, $typ, $vdwTable, $x, $box_size)
println("--------------------------End of Testing SoA------------------------------")

 # Output:
 # ------------------------Start of Testing SoA------------------------------
 # Struct of Arrays:   85.384 ms (0 allocations: 0 bytes)
 # --------------------------End of Testing SoA------------------------------

And

! Fortran version
program structOfArrays

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                  Some general parameters used by all trials
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer, parameter :: x = 3011  ! dummy number of atoms
integer, parameter :: m = 4     ! dummy number of atom types (for table of parameters...)
real(8) :: box_size(3) = [2.0,2.0,2.0], r(3,x), f(3,x) = 0, vdwTable(m,m,2)
integer :: typ(x)
integer :: t0, t1, count_rate, count_max

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                   Start of trials
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

call random_number(r)
typ = m * ceiling(r(3,:))
r   = r * box_size(1)
call random_number(vdwTable)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!                           Testing Struct of Arrays
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

print *, "------------------------Start of Testing SoA------------------------------"
print *, "Struct of Arrays: "

call system_clock(t0, count_rate, count_max)
do iter = 1,100
  call TotalEnergyStructOfArrays(typ, f, r, vdwTable, x, box_size)
end do 
call system_clock(t1)

print*, 'Time:', real(t1-t0)/count_rate
print *, "--------------------------End of Testing SoA------------------------------"

contains

real(8) function vector1D(c1, c2, box_size)
    real(8) :: c1, c2, box_size
    if (c1 < c2) then
        if (c2 - c1 < c1 - c2 + box_size) then 
            vector1D = c2 - c1
        else 
            vector1D = c2 - c1 - box_size
        end if 
    else
        if (c1 - c2 < c2 - c1 + box_size) then 
            vector1D = c2 - c1 
        else 
            vector1D = c2 - c1 + box_size
        end if 
    end if 
end

subroutine SingleLJ2(f1, f2, coord1, coord2, epsilon, sigma, box_size)
    real(8) :: f1(:), f2(:), coord1(:), coord2(:), box_size(:), epsilon, sigma, dx, dy, dz, rij_sq, x, y, z, f 
    dx = vector1D( coord1(1), coord2(1), box_size(1) )
    dy = vector1D( coord1(2), coord2(2), box_size(2) )
    dz = vector1D( coord1(3), coord2(3), box_size(3) )

    rij_sq = dx*dx + dy*dy + dz*dz
    if (rij_sq > 1.0) return 
    
    sr2  = sigma**2 / rij_sq  
    ! rmag = sqrt(rij_sq)
    sr6  = sr2**3
    sr12 = sr6**2
    f = 24*epsilon / rij_sq * (2*sr12 - sr6)  

    x = -f * dx
    y = -f * dy
    z = -f * dz

    f1 = f1 + [x, y, z]
    f2 = f2 - [x, y, z]    
end

subroutine TotalEnergyStructOfArrays(typ, f, r, vdwTable, n, box_size)
    real(8) :: f(:,:), r(:,:), vdwTable(:,:,:), box_size(:)
    integer :: typ(:), n 
    do i = 1, (n-1)
        ti = typ(i)
        do j = (i+1), n
            tj = typ(j)
            call SingleLJ2(f(:,i), f(:,j), r(:,i), r(:,j), vdwTable(ti,tj,1), vdwTable(ti,tj,2), box_size)
        end do
    end do 
end

end program structOfArrays

 ! Output:
 ! ------------------------Start of Testing SoA------------------------------
 ! Struct of Arrays: 
 ! Time:   8.863000        ! x 10 = 88.63 ms
 ! --------------------------End of Testing SoA------------------------------
4 Likes

Nice work!

I will post a link to a github I made that has I think over 8 different versions of this code :slight_smile: I need to polish it a bit.

it has:
Array of Structs XYZ version
Struct of Array XYZ version
Array of Structs (static arrays)
Struct of Arrays (static arrays)
static arraysVector{SVector{3,Float64}}
mutable matrices (basically just 2 dimensional arrays)
StructArrays (from @piever github library)

And another one I don’t remember. The results are interesting. You are right, once you get Julia honed in, it is very fast! The XYZ isn’t the fastest one :wink:

Edit

On my machine your version clocks in at 153 ms

A few of the other versions I have run give me

--------------------Start of Testing StructArray-------------------------
StructArray:   132.161 ms (0 allocations: 0 bytes)
--------------------Start of Testing StructArray-------------------------
------------------------Start of Testing SoSA------------------------------
Struct of Static Arrays:   131.807 ms (0 allocations: 0 bytes)
--------------------------End of Testing SoSA------------------------------
------------------------Start of Testing AoS------------------------------
array of structs (XYZ):   145.657 ms (0 allocations: 0 bytes)
--------------------------End of Testing AoS------------------------------
------------------------Start of Testing SoA------------------------------
Struct of Arrays (XYZ):   158.322 ms (0 allocations: 0 bytes)
--------------------------End of Testing SoA------------------------------
----------------------Start of Testing MArrays----------------------------
Static MArrays  :   118.291 ms (0 allocations: 0 bytes)
------------------------End of Testing MArrays----------------------------
----------------------Start of Testing MMAtrix----------------------------

It is interesting that the mutable static arrays give better performance

In my real code which does not generate coordinates using random numbers I have the StructArray down to 90 ms for the same computation.

Nice! Something to keep in mind though: @btime runs the code many many times and returns the minimum of all runtimes, whereas the Fortran code (if I read it correctly) looks like the average runtime? Might be interesting to compare the minimum for both versions.

Yes, good observation. I ran the Fortran version many times and found the minimum to be 87.8 ms, so the above numbers still hold true. From my own experience with Fortran, timing differences between runs are usually negligible.

Very interesting. Can you share the link please after you finalize things?

I have put some of them up here https://github.com/Zarasthustra/Need-For-Speed/tree/master

The latest set have a _V2 on them.

I have more, but need to wait, I am always toying with them, and don’t remember what I did to mess them up :S

These ones are alright. I have not tested to see if they all yield correct results, which makes a big difference. When I have time…

Struct of arrays data layouts help when they can enable vectorization.
In this case, there are probably a lot of non-inlined function calls plus branches the compiler can’t easily turn into masked operations, meaning it wont be vectorized.

It might be possible to rewrite things in a way that this can happen, but it would probably take a fair bit of playing with it.

I will send you a case of beer for every order of magnitude you get any of the github codes down by :slight_smile:

I have ulterior reasons for using the StructArray version though, it is a very handy data structure that will pay dividends for monte carlo with reactions. Maybe 2 cases if you do it on that one…

Exactly. The reason for the speedup is not that mutable static arrays are faster, it’s that your code and benchmarks are flawed. This doesn’t imply that you’re a bad coder – it’s almost impossible to refactor code like this without making any mistakes along the way. This is why you absolutely have to employ some form of testing. It’s the very first thing you do.

So what is wrong the the mutable array version? First there’s this:

SingleLJ3!(f[j], f[j], diff,

Which should be:

SingleLJ3!(f[i], f[j], diff,

This is the same issue I pointed out in my previous post, which caused your “struct of arrays” version to run 20 % faster (and which is why initially this thread was dedicated to optimizing that version, since we all thought it was faster, when in fact it was not).

More importantly, you generate different input data. The mutable array does this:

r = [MVector{3}(rand(), rand(), rand()) for i = 1:x]

While the other versions do this:

[XYZ(box_size[1] * rand(), box_size[2] * rand(), box_size[3] * rand()) for i=1:x]

With box_size equal to 2. Since the performance of your algorithm is heavily dependent on the distribution of these random numbers, this most likely explains the differences you’re seeing.

So now, before you do any more work on this, add a simple test to your code. If you follow the suggestion in my previous post, this takes a minute or so to implement, and would have caught all of these issues.

3 Likes